TY - GEN
T1 - End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression
AU - Anupreetham, Anupreetham
AU - Ibrahim, Mohamed
AU - Hall, Mathew
AU - Boutros, Andrew
AU - Kuzhively, Ajay
AU - Mohanty, Abinash
AU - Nurvitadhi, Eriko
AU - Betz, Vaughn
AU - Cao, Yu
AU - Seo, Jae Sun
N1 - Funding Information:
This work is partially supported by NSF grant 1652866, the Intel ISRA program on FPGA, JUMP C-BRIC (a SRC program sponsored by DARPA), the Intel/NSERC Industrial Research Chair in Programmable Silicon, and the VectorInstitute for Artificial Intelligence. Any opinions, findings, conclusions or recommendations are those of the authors and not of the funding institutions.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSDMobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13μs to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5× higher throughput and 4.4× lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.
AB - Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSDMobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13μs to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5× higher throughput and 4.4× lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.
UR - http://www.scopus.com/inward/record.url?scp=85125759116&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125759116&partnerID=8YFLogxK
U2 - 10.1109/FPL53798.2021.00021
DO - 10.1109/FPL53798.2021.00021
M3 - Conference contribution
AN - SCOPUS:85125759116
T3 - Proceedings - 2021 31st International Conference on Field-Programmable Logic and Applications, FPL 2021
SP - 76
EP - 82
BT - Proceedings - 2021 31st International Conference on Field-Programmable Logic and Applications, FPL 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st International Conference on Field-Programmable Logic and Applications, FPL 2021
Y2 - 30 August 2021 through 3 September 2021
ER -