TY - GEN
T1 - End-to-End LU Factorization of Large Matrices on GPUs
AU - Xia, Yang
AU - Jiang, Peng
AU - Agrawal, Gagan
AU - Ramnath, Rajiv
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/2/25
Y1 - 2023/2/25
N2 - LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
AB - LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
KW - GPU acceleration
KW - LU factorization
KW - memory limits
UR - http://www.scopus.com/inward/record.url?scp=85149325232&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85149325232&partnerID=8YFLogxK
U2 - 10.1145/3572848.3577486
DO - 10.1145/3572848.3577486
M3 - Conference contribution
AN - SCOPUS:85149325232
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 288
EP - 300
BT - PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
T2 - 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023
Y2 - 25 February 2023 through 1 March 2023
ER -