TY - JOUR
T1 - PIMCA
T2 - A Programmable In-Memory Computing Accelerator for Energy-Efficient DNN Inference
AU - Zhang, Bo
AU - Yin, Shihui
AU - Kim, Minkyu
AU - Saikia, Jyotishman
AU - Kwon, Soonwan
AU - Myung, Sungmeen
AU - Kim, Hyunsoo
AU - Kim, Sang Joon
AU - Seo, Jae Sun
AU - Seok, Mingoo
N1 - Publisher Copyright:
© 1966-2012 IEEE.
PY - 2023/5/1
Y1 - 2023/5/1
N2 - This article presents a programmable in-memory computing accelerator (PIMCA) for low-precision (1-2 b) deep neural network (DNN) inference. The custom 10T1C bitcell in the in-memory computing (IMC) macro has four additional transistors and one capacitor to perform capacitive-coupling-based multiply and accumulation (MAC) in analog-mixed-signal (AMS) domain. A macro containing 256× 128 bitcells can simultaneously activate all the rows, and as a result, it can perform a matrix-vector multiplication (VMM) in one cycle. PIMCA integrates 108 of such IMC static random-access memory (SRAM) macros with the custom six-stage pipeline and the custom instruction set architecture (ISA) for instruction-level programmability. The results of IMC macros are fed to a single-instruction-multiple-data (SIMD) processor for other computations such as partial sum accumulation, max-pooling, activation functions, etc. To effectively use the IMC and SIMD datapath, we customize the ISA especially by adding hardware loop support, which reduces the program size by up to 73%. The accelerator is prototyped in a 28-nm technology, and integrates a total of 3.4-Mb IMC SRAM and 1.5-Mb off-the-shelf activation SRAM, demonstrating one of the largest IMC accelerators to date. It achieves the system-level energy efficiency of 437 TOPS/W and the peak throughput of 49 TOPS at the 42-MHz clock frequency and 1-V supply for the VGG9 and the ResNet-18 on the CIFAR-10 dataset.
AB - This article presents a programmable in-memory computing accelerator (PIMCA) for low-precision (1-2 b) deep neural network (DNN) inference. The custom 10T1C bitcell in the in-memory computing (IMC) macro has four additional transistors and one capacitor to perform capacitive-coupling-based multiply and accumulation (MAC) in analog-mixed-signal (AMS) domain. A macro containing 256× 128 bitcells can simultaneously activate all the rows, and as a result, it can perform a matrix-vector multiplication (VMM) in one cycle. PIMCA integrates 108 of such IMC static random-access memory (SRAM) macros with the custom six-stage pipeline and the custom instruction set architecture (ISA) for instruction-level programmability. The results of IMC macros are fed to a single-instruction-multiple-data (SIMD) processor for other computations such as partial sum accumulation, max-pooling, activation functions, etc. To effectively use the IMC and SIMD datapath, we customize the ISA especially by adding hardware loop support, which reduces the program size by up to 73%. The accelerator is prototyped in a 28-nm technology, and integrates a total of 3.4-Mb IMC SRAM and 1.5-Mb off-the-shelf activation SRAM, demonstrating one of the largest IMC accelerators to date. It achieves the system-level energy efficiency of 437 TOPS/W and the peak throughput of 49 TOPS at the 42-MHz clock frequency and 1-V supply for the VGG9 and the ResNet-18 on the CIFAR-10 dataset.
KW - Capacitive coupling computing
KW - deep neural network (DNN)
KW - in-memory computing (IMC)
KW - programmable accelerator
UR - http://www.scopus.com/inward/record.url?scp=85140791733&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140791733&partnerID=8YFLogxK
U2 - 10.1109/JSSC.2022.3211290
DO - 10.1109/JSSC.2022.3211290
M3 - Article
AN - SCOPUS:85140791733
SN - 0018-9200
VL - 58
SP - 1436
EP - 1449
JO - IEEE Journal of Solid-State Circuits
JF - IEEE Journal of Solid-State Circuits
IS - 5
ER -