TY - GEN
T1 - Inference engine benchmarking across technological platforms from CMOS to RRAM
AU - Peng, Xiaochen
AU - Kim, Minkyu
AU - Sun, Xiaoyu
AU - Yin, Shihui
AU - Rakshit, Titash
AU - Hatcher, Ryan M.
AU - Kittl, Jorge A.
AU - Seo, Jae sun
AU - Yu, Shimeng
N1 - Funding Information:
This work was supported by ASCENT, one of the SRC/DARPA JUMP centers, NSF-CCF-1903951 and NSF/SRC E2CDA program.
Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/9/30
Y1 - 2019/9/30
N2 - State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.
AB - State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.
KW - Deep convolutional neural network
KW - Hardware accelerator
KW - Near memory processing
KW - Processing in memory
KW - Resistive random access memory
KW - Systolic architecture
UR - http://www.scopus.com/inward/record.url?scp=85075852947&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075852947&partnerID=8YFLogxK
U2 - 10.1145/3357526.3357566
DO - 10.1145/3357526.3357566
M3 - Conference contribution
AN - SCOPUS:85075852947
T3 - ACM International Conference Proceeding Series
SP - 471
EP - 479
BT - MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems
PB - Association for Computing Machinery
T2 - 2019 International Symposium on Memory Systems, MEMSYS 2019
Y2 - 30 September 2019 through 3 October 2019
ER -