End-to-End LU Factorization of Large Matrices on GPUs

Yang Xia; Peng Jiang; Gagan Agrawal; Rajiv Ramnath

doi:10.1145/3572848.3577486

End-to-End LU Factorization of Large Matrices on GPUs

Yang Xia, Peng Jiang, Gagan Agrawal, Rajiv Ramnath

Computer & Cyber Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.

Original language	English (US)
Title of host publication	PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Publisher	Association for Computing Machinery
Pages	288-300
Number of pages	13
ISBN (Electronic)	9798400700156
DOIs	https://doi.org/10.1145/3572848.3577486
State	Published - Feb 25 2023
Event	28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023 - Montreal, Canada Duration: Feb 25 2023 → Mar 1 2023

Publication series

Name	Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference	28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023
Country/Territory	Canada
City	Montreal
Period	2/25/23 → 3/1/23

Keywords

GPU acceleration
LU factorization
memory limits

ASJC Scopus subject areas

Software

Access to Document

10.1145/3572848.3577486

Cite this

Xia, Y., Jiang, P., Agrawal, G., & Ramnath, R. (2023). End-to-End LU Factorization of Large Matrices on GPUs. In PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (pp. 288-300). (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP). Association for Computing Machinery. https://doi.org/10.1145/3572848.3577486

End-to-End LU Factorization of Large Matrices on GPUs. / Xia, Yang; Jiang, Peng; Agrawal, Gagan et al.
PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Association for Computing Machinery, 2023. p. 288-300 (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Xia, Y, Jiang, P, Agrawal, G & Ramnath, R 2023, End-to-End LU Factorization of Large Matrices on GPUs. in PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Association for Computing Machinery, pp. 288-300, 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023, Montreal, Canada, 2/25/23. https://doi.org/10.1145/3572848.3577486

Xia Y, Jiang P, Agrawal G, Ramnath R. End-to-End LU Factorization of Large Matrices on GPUs. In PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Association for Computing Machinery. 2023. p. 288-300. (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP). doi: 10.1145/3572848.3577486

Xia, Yang ; Jiang, Peng ; Agrawal, Gagan et al. / End-to-End LU Factorization of Large Matrices on GPUs. PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Association for Computing Machinery, 2023. pp. 288-300 (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP).

@inproceedings{517e4c6d64a8465ebef7399431e16dae,

title = "End-to-End LU Factorization of Large Matrices on GPUs",

abstract = "LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.",

keywords = "GPU acceleration, LU factorization, memory limits",

author = "Yang Xia and Peng Jiang and Gagan Agrawal and Rajiv Ramnath",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023 ; Conference date: 25-02-2023 Through 01-03-2023",

year = "2023",

month = feb,

day = "25",

doi = "10.1145/3572848.3577486",

language = "English (US)",

series = "Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP",

publisher = "Association for Computing Machinery",

pages = "288--300",

booktitle = "PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming",

}

TY - GEN

T1 - End-to-End LU Factorization of Large Matrices on GPUs

AU - Xia, Yang

AU - Jiang, Peng

AU - Agrawal, Gagan

AU - Ramnath, Rajiv

PY - 2023/2/25

Y1 - 2023/2/25

N2 - LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.

AB - LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13 - 32.65X. Further, our out-of-core implementation achieves a speedup of 1.2 - 2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.

KW - GPU acceleration

KW - LU factorization

KW - memory limits

UR - http://www.scopus.com/inward/record.url?scp=85149325232&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85149325232&partnerID=8YFLogxK

U2 - 10.1145/3572848.3577486

DO - 10.1145/3572848.3577486

M3 - Conference contribution

AN - SCOPUS:85149325232

T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

SP - 288

EP - 300

BT - PPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

PB - Association for Computing Machinery

T2 - 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023

Y2 - 25 February 2023 through 1 March 2023

ER -

End-to-End LU Factorization of Large Matrices on GPUs

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this