Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks

Yufei Ma; Yu Cao; Sarma Vrudhula; Jae-sun Seo

doi:10.1145/3020078.3021736

Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks

Yufei Ma, Yu Cao, Sarma Vrudhula, Jae-sun Seo

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

315 Scopus citations

Abstract

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.

Original language	English (US)
Title of host publication	FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Publisher	Association for Computing Machinery, Inc
Pages	45-54
Number of pages	10
ISBN (Electronic)	9781450343541
DOIs	https://doi.org/10.1145/3020078.3021736
State	Published - Feb 22 2017
Event	2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017 - Monterey, United States Duration: Feb 22 2017 → Feb 24 2017

Publication series

Name	FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Conference

Conference	2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017
Country/Territory	United States
City	Monterey
Period	2/22/17 → 2/24/17

Keywords

Convolutional neural networks
FPGA
Hardware acceleration

ASJC Scopus subject areas

Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1145/3020078.3021736

Cite this

Ma, Y., Cao, Y., Vrudhula, S., & Seo, J. (2017). Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 45-54). (FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays). Association for Computing Machinery, Inc. https://doi.org/10.1145/3020078.3021736

Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. / Ma, Yufei; Cao, Yu; Vrudhula, Sarma et al.
FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc, 2017. p. 45-54 (FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Ma, Y, Cao, Y, Vrudhula, S & Seo, J 2017, Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. in FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Association for Computing Machinery, Inc, pp. 45-54, 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017, Monterey, United States, 2/22/17. https://doi.org/10.1145/3020078.3021736

Ma Y, Cao Y, Vrudhula S, Seo J. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc. 2017. p. 45-54. (FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays). doi: 10.1145/3020078.3021736

Ma, Yufei ; Cao, Yu ; Vrudhula, Sarma et al. / Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc, 2017. pp. 45-54 (FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays).

@inproceedings{e088384c927e49dfa03db21569e0c766,

title = "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks",

abstract = "As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.",

keywords = "Convolutional neural networks, FPGA, Hardware acceleration",

author = "Yufei Ma and Yu Cao and Sarma Vrudhula and Jae-sun Seo",

note = "Publisher Copyright: {\textcopyright} 2017 ACM.; 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017 ; Conference date: 22-02-2017 Through 24-02-2017",

year = "2017",

month = feb,

day = "22",

doi = "10.1145/3020078.3021736",

language = "English (US)",

series = "FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays",

publisher = "Association for Computing Machinery, Inc",

pages = "45--54",

booktitle = "FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays",

}

TY - GEN

T1 - Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks

AU - Ma, Yufei

AU - Cao, Yu

AU - Vrudhula, Sarma

AU - Seo, Jae-sun

PY - 2017/2/22

Y1 - 2017/2/22

N2 - As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.

AB - As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2x enhancement compared to state-of-the-art FPGA implementations of VGG model.

KW - Convolutional neural networks

KW - FPGA

KW - Hardware acceleration

UR - http://www.scopus.com/inward/record.url?scp=85016023112&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85016023112&partnerID=8YFLogxK

U2 - 10.1145/3020078.3021736

DO - 10.1145/3020078.3021736

M3 - Conference contribution

AN - SCOPUS:85016023112

T3 - FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

SP - 45

EP - 54

BT - FPGA 2017 - Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

PB - Association for Computing Machinery, Inc

T2 - 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017

Y2 - 22 February 2017 through 24 February 2017

ER -

Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this