Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSDMobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13μs to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5× higher throughput and 4.4× lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.