TY - GEN
T1 - Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs
AU - Taka, Endri
AU - Gourounas, Dimitrios
AU - Gerstlauer, Andreas
AU - Marculescu, Diana
AU - Arora, Aman
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.
AB - FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.
KW - ACAP
KW - AI Engine
KW - AI Tensor Blocks
KW - Deep Learning
KW - FPGA
KW - GEMM
KW - Hardware Acceleration
KW - Stratix
KW - Versal
UR - http://www.scopus.com/inward/record.url?scp=85204400142&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204400142&partnerID=8YFLogxK
U2 - 10.1109/FCCM60383.2024.00015
DO - 10.1109/FCCM60383.2024.00015
M3 - Conference contribution
AN - SCOPUS:85204400142
T3 - Proceedings - 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2024
SP - 54
EP - 65
BT - Proceedings - 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2024
Y2 - 5 May 2024 through 8 May 2024
ER -