Memory performance estimation of CUDA programs

Yooseong Kim; Aviral Shrivastava

doi:10.1145/2514641.2514648

Memory performance estimation of CUDA programs

Yooseong Kim, Aviral Shrivastava

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

Original language	English (US)
Article number	21
Journal	Transactions on Embedded Computing Systems
Volume	13
Issue number	2
DOIs	https://doi.org/10.1145/2514641.2514648
State	Published - Oct 21 2013

Keywords

CUDA
GPGPU
Memory performance
Performance estimation
Program optimization

ASJC Scopus subject areas

Software
Hardware and Architecture

Access to Document

10.1145/2514641.2514648

Cite this

@article{f7176c5a9153421c8a2958ca0700d47a,

title = "Memory performance estimation of CUDA programs",

abstract = "CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].",

keywords = "CUDA, GPGPU, Memory performance, Performance estimation, Program optimization",

author = "Yooseong Kim and Aviral Shrivastava",

year = "2013",

month = oct,

day = "21",

doi = "10.1145/2514641.2514648",

language = "English (US)",

volume = "13",

journal = "Transactions on Embedded Computing Systems",

issn = "1539-9087",

publisher = "Association for Computing Machinery (ACM)",

number = "2",

}

TY - JOUR

T1 - Memory performance estimation of CUDA programs

AU - Kim, Yooseong

AU - Shrivastava, Aviral

PY - 2013/10/21

Y1 - 2013/10/21

N2 - CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

AB - CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

KW - CUDA

KW - GPGPU

KW - Memory performance

KW - Performance estimation

KW - Program optimization

UR - http://www.scopus.com/inward/record.url?scp=84885650614&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84885650614&partnerID=8YFLogxK

U2 - 10.1145/2514641.2514648

DO - 10.1145/2514641.2514648

M3 - Article

AN - SCOPUS:84885650614

SN - 1539-9087

VL - 13

JO - Transactions on Embedded Computing Systems

JF - Transactions on Embedded Computing Systems

IS - 2

M1 - 21

ER -

Memory performance estimation of CUDA programs

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this