TY - GEN
T1 - CUDA-DTM
T2 - 7th International Conference on Networked Systems, NETYS 2019
AU - Irving, Samuel
AU - Chen, Sui
AU - Peng, Lu
AU - Busch, Costas
AU - Herlihy, Maurice
AU - Michael, Christopher J.
N1 - Funding Information:
DTM for GPUs, supported by Hybrid MPI
Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - We present CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive auto-coherence scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability. We extend GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and propose a scheme for GPU-to-GPU communication using CUDA-Aware MPI. The performance of CUDA-DTM is evaluated using a suite of seven irregular memory access benchmarks with varying degrees of compute intensity, contention, and node-to-node communication frequency. Using a cluster of 256 devices, our experiments show that GPU clusters using CUDA-DTM can be up to 115x faster than CPU clusters.
AB - We present CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive auto-coherence scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability. We extend GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and propose a scheme for GPU-to-GPU communication using CUDA-Aware MPI. The performance of CUDA-DTM is evaluated using a suite of seven irregular memory access benchmarks with varying degrees of compute intensity, contention, and node-to-node communication frequency. Using a cluster of 256 devices, our experiments show that GPU clusters using CUDA-DTM can be up to 115x faster than CPU clusters.
KW - CUDA
KW - Distributed Transactional Memory
KW - GPU cluster
UR - http://www.scopus.com/inward/record.url?scp=85075595144&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075595144&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-31277-0_12
DO - 10.1007/978-3-030-31277-0_12
M3 - Conference contribution
AN - SCOPUS:85075595144
SN - 9783030312763
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 183
EP - 199
BT - Networked Systems - 7th International Conference, NETYS 2019, Revised Selected Papers
A2 - Atig, Mohamed Faouzi
A2 - Schwarzmann, Alexander A.
PB - Springer
Y2 - 19 June 2019 through 21 June 2019
ER -