TY - GEN
T1 - Transmuter
T2 - 2020 ACM International Conference on Parallel Architectures and Compilation Techniques, PACT 2020
AU - Pal, Subhankar
AU - Feng, Siying
AU - Park, Dong Hyeon
AU - Kim, Sung
AU - Amarnath, Aporva
AU - Yang, Chi Sheng
AU - He, Xin
AU - Beaumont, Jonathan
AU - May, Kyle
AU - Xiong, Yan
AU - Kaszyk, Kuba
AU - Morton, John Magnus
AU - Sun, Jiawen
AU - O'Boyle, Michael
AU - Cole, Murray
AU - Chakrabarti, Chaitali
AU - Blaauw, David
AU - Kim, Hun Seok
AU - Mudge, Trevor
AU - Dreslinski, Ronald
N1 - Funding Information:
We thank the anonymous reviewers for their helpful feedback. The material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7864. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U.S. Government.
Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/9/30
Y1 - 2020/9/30
N2 - With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications thatmeet power and performance targets, while remaining flexible andprogrammable for end users. This is particularly true for domainsthat have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machinelearning and graph analytics. To overcome this, we present a flexibleaccelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-SpecificIntegrated Circuits (ASICs). Transmuter adapts to changing kernelcharacteristics, such as data reuse and control divergence, throughthe ability to reconfigure the on-chip memory type, resource sharingand dataflow at run-time within a short latency. This is facilitatedby a fabric of light-weight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidlygrowing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to supportprogrammability and ease-of-adoption, we prototype a softwarestack composed of low-level runtime routines, and a high-levellanguage library called TransPy, that cater to expert programmersand end-users, respectively.Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×)over a high-end CPU and GPU, respectively, across a diverse set ofkernels predominant in graph analytics, scientific computing andmachine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementationsof the same kernels, while remaining on average within 9.3× ofstate-of-the-art ASICs.
AB - With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications thatmeet power and performance targets, while remaining flexible andprogrammable for end users. This is particularly true for domainsthat have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machinelearning and graph analytics. To overcome this, we present a flexibleaccelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-SpecificIntegrated Circuits (ASICs). Transmuter adapts to changing kernelcharacteristics, such as data reuse and control divergence, throughthe ability to reconfigure the on-chip memory type, resource sharingand dataflow at run-time within a short latency. This is facilitatedby a fabric of light-weight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidlygrowing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to supportprogrammability and ease-of-adoption, we prototype a softwarestack composed of low-level runtime routines, and a high-levellanguage library called TransPy, that cater to expert programmersand end-users, respectively.Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×)over a high-end CPU and GPU, respectively, across a diverse set ofkernels predominant in graph analytics, scientific computing andmachine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementationsof the same kernels, while remaining on average within 9.3× ofstate-of-the-art ASICs.
KW - Dataflow reconfiguration
KW - General-purpose acceleration
KW - Hardware acceleration
KW - Memory reconfiguration
KW - Reconfigurable architectures
UR - http://www.scopus.com/inward/record.url?scp=85094212659&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85094212659&partnerID=8YFLogxK
U2 - 10.1145/3410463.3414627
DO - 10.1145/3410463.3414627
M3 - Conference contribution
AN - SCOPUS:85094212659
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 175
EP - 190
BT - PACT 2020 - Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 October 2020 through 7 October 2020
ER -