TY - GEN
T1 - Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks
AU - Ma, Yufei
AU - Cao, Yu
AU - Vrudhula, Sarma
AU - Seo, Jae-sun
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/2/22
Y1 - 2017/2/22
N2 - As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.
AB - As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.
KW - Convolutional neural networks
KW - FPGA
KW - Hardware acceleration
UR - http://www.scopus.com/inward/record.url?scp=85016023112&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85016023112&partnerID=8YFLogxK
U2 - 10.1145/3020078.3021736
DO - 10.1145/3020078.3021736
M3 - Conference contribution
AN - SCOPUS:85016023112
T3 - FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
SP - 45
EP - 54
BT - FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
PB - Association for Computing Machinery, Inc
T2 - 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017
Y2 - 22 February 2017 through 24 February 2017
ER -