Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

Yufei Ma; Yu Cao; Sarma Vrudhula; Jae-sun Seo

doi:10.1109/TVLSI.2018.2815603

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

Yufei Ma, Yu Cao, Sarma Vrudhula, Jae-sun Seo

Research output: Contribution to journal › Article › peer-review

245 Scopus citations

Abstract

As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

Original language	English (US)
Pages (from-to)	1354-1367
Number of pages	14
Journal	IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Volume	26
Issue number	7
DOIs	https://doi.org/10.1109/TVLSI.2018.2815603
State	Published - Jul 2018

Keywords

Accelerator architectures
convolutional neural networks (CNNs)
field-programmable gate array (FPGA)
neural network hardware

ASJC Scopus subject areas

Software
Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1109/TVLSI.2018.2815603

Cite this

@article{0d34a25305d2454584542e97b5a288b8,

title = "Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA",

abstract = "As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.",

keywords = "Accelerator architectures, convolutional neural networks (CNNs), field-programmable gate array (FPGA), neural network hardware",

author = "Yufei Ma and Yu Cao and Sarma Vrudhula and Jae-sun Seo",

note = "Funding Information: This work was supported in part by the NSF I/UCRC Center for Embedded Systems through NSF under Grant 1230401, Grant 1237856, Grant 1701241, Grant 1361926, Grant 1535669, Grant 1652866, and Grant 1715443; and in part by the Intel Labs, and in part by the Samsung Advanced Institute of Technology. Funding Information: Manuscript received October 27, 2017; revised February 3, 2018; accepted March 6, 2018. Date of publication April 3, 2018; date of current version June 26, 2018. This work was supported in part by the NSF I/UCRC Center for Embedded Systems through NSF under Grant 1230401, Grant 1237856, Grant 1701241, Grant 1361926, Grant 1535669, Grant 1652866, and Grant 1715443; and in part by the Intel Labs, and in part by the Samsung Advanced Institute of Technology. (Corresponding author: Yufei Ma.) Y. Ma, Y. Cao, and J.-s. Seo are with the School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA (e-mail: yufeima@asu.edu; yu.cao@asu.edu; jaesun.seo@asu.edu). Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2018",

month = jul,

doi = "10.1109/TVLSI.2018.2815603",

language = "English (US)",

volume = "26",

pages = "1354--1367",

journal = "IEEE Transactions on Very Large Scale Integration (VLSI) Systems",

issn = "1063-8210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "7",

}

TY - JOUR

T1 - Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

AU - Ma, Yufei

AU - Cao, Yu

AU - Vrudhula, Sarma

AU - Seo, Jae-sun

N1 - Funding Information: This work was supported in part by the NSF I/UCRC Center for Embedded Systems through NSF under Grant 1230401, Grant 1237856, Grant 1701241, Grant 1361926, Grant 1535669, Grant 1652866, and Grant 1715443; and in part by the Intel Labs, and in part by the Samsung Advanced Institute of Technology. Funding Information: Manuscript received October 27, 2017; revised February 3, 2018; accepted March 6, 2018. Date of publication April 3, 2018; date of current version June 26, 2018. This work was supported in part by the NSF I/UCRC Center for Embedded Systems through NSF under Grant 1230401, Grant 1237856, Grant 1701241, Grant 1361926, Grant 1535669, Grant 1652866, and Grant 1715443; and in part by the Intel Labs, and in part by the Samsung Advanced Institute of Technology. (Corresponding author: Yufei Ma.) Y. Ma, Y. Cao, and J.-s. Seo are with the School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA (e-mail: yufeima@asu.edu; yu.cao@asu.edu; jaesun.seo@asu.edu). Publisher Copyright: © 2018 IEEE.

PY - 2018/7

Y1 - 2018/7

N2 - As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

AB - As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

KW - Accelerator architectures

KW - convolutional neural networks (CNNs)

KW - field-programmable gate array (FPGA)

KW - neural network hardware

UR - http://www.scopus.com/inward/record.url?scp=85049389987&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049389987&partnerID=8YFLogxK

U2 - 10.1109/TVLSI.2018.2815603

DO - 10.1109/TVLSI.2018.2815603

M3 - Article

AN - SCOPUS:85049389987

SN - 1063-8210

VL - 26

SP - 1354

EP - 1367

JO - IEEE Transactions on Very Large Scale Integration (VLSI) Systems

JF - IEEE Transactions on Very Large Scale Integration (VLSI) Systems

IS - 7

ER -

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this