TY - GEN
T1 - Scaling sparse matrix multiplication on CPU-GPU nodes
AU - Xia, Yang
AU - Jiang, Peng
AU - Agrawal, Gagan
AU - Ramnath, Rajiv
N1 - Funding Information:
This work is motivated by the observation that despite significant interest in SpGEMM implementation on GPUs, the limited memory of GPU hinders execution when the matrices involved are large, as is often the case in practice. To address this gap, we proposed an iterative out-of-core GPU implementation. A number of optimizations and design choices were critical to this process. Since data transfer overhead is a dominant factor in performance, we overlapped execution with data transfers. In the process, we had to create an implementation that does not require dynamic memory (de)allocation, and is judicious about data transfers. We extend our iterative implementation to a hybrid implementation, which offloads a fraction of chunks to the CPU to further improve performance. Our evaluation with 9 large matrices shows that we outperform a state-of-the-art CPU implementation, our hybrid implementation further achieves a speedup by 3.74X, and reordering and asynchronous execution turn out to be important for performance. Acknowledgements. This research was partially supported by NSF awards CCF-1629392; CCF-2007793 and OAC-2034850.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/5
Y1 - 2021/5
N2 - Multiplication of two sparse matrices (SpGEMM) is a popular kernel behind many numerical solvers, and also features in implementing many common graph algorithms. Though many recent research efforts have focused on implementing SpGEMM efficiently on a single GPU, none of the existing work has considered the case where the memory requirements exceed the size of GPU memory. Similarly, the use of the aggregate computing power of CPU and GPU has also not been addressed for those large matrices. In this paper, we present a framework for scaling SpGEMM computations for matrices that do not fit into GPU memory. We address how the computation and data can be partitioned across kernel executions on GPUs. An important emphasis in our work is overlapping data movement and computation. We achieve this by addressing many challenges, such as avoiding dynamic memory allocations, and re-scheduling data transfers with the computation of chunks. We extend our framework to make efficient use of both GPU and CPU, by developing an efficient work distribution strategy. Our evaluation on 9 large matrices shows that our out-of-core GPU implementation achieves 1.98-3.03X speedups over a state-of-the-art multi-core CPU implementation, our hybrid implementation further achieves speedups up to 3.74x, and that our design choices are directly contributing towards achieving this performance.
AB - Multiplication of two sparse matrices (SpGEMM) is a popular kernel behind many numerical solvers, and also features in implementing many common graph algorithms. Though many recent research efforts have focused on implementing SpGEMM efficiently on a single GPU, none of the existing work has considered the case where the memory requirements exceed the size of GPU memory. Similarly, the use of the aggregate computing power of CPU and GPU has also not been addressed for those large matrices. In this paper, we present a framework for scaling SpGEMM computations for matrices that do not fit into GPU memory. We address how the computation and data can be partitioned across kernel executions on GPUs. An important emphasis in our work is overlapping data movement and computation. We achieve this by addressing many challenges, such as avoiding dynamic memory allocations, and re-scheduling data transfers with the computation of chunks. We extend our framework to make efficient use of both GPU and CPU, by developing an efficient work distribution strategy. Our evaluation on 9 large matrices shows that our out-of-core GPU implementation achieves 1.98-3.03X speedups over a state-of-the-art multi-core CPU implementation, our hybrid implementation further achieves speedups up to 3.74x, and that our design choices are directly contributing towards achieving this performance.
UR - http://www.scopus.com/inward/record.url?scp=85113526372&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113526372&partnerID=8YFLogxK
U2 - 10.1109/IPDPS49936.2021.00047
DO - 10.1109/IPDPS49936.2021.00047
M3 - Conference contribution
AN - SCOPUS:85113526372
T3 - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
SP - 392
EP - 401
BT - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
Y2 - 17 May 2021 through 21 May 2021
ER -