TY - GEN
T1 - Landing CG on EARTH - A Case Study of Fine-Grained Multithreading on an Evolutionary Path.
T2 - 2000 ACM/IEEE Conference on Supercomputing, SC 2000
AU - Theobald, Kevin B.
AU - Agrawal, Gagan
AU - Kumar, Rishi
AU - Heber, Gerd
AU - Gao, Guang R.
AU - Stodghill, Paul
AU - Pingali, Keshav
N1 - DBLP's bibliographic metadata records provided through http://dblp.org/search/publ/api are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2000
Y1 - 2000
N2 - We report on our work in developing a fine-grained multithreaded solution for the communication-intensive Conjugate Gradient (CG) problem. In our recent work, we developed a simple yet efficient program for sparse matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is integrated with the sparse MVM, resulting in a scalable implementation of the complete CG application. Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., absolute speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reduction-broadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3) By slowing down the network by a factor of 2, no notable degradation of overall CG performance was observed.
AB - We report on our work in developing a fine-grained multithreaded solution for the communication-intensive Conjugate Gradient (CG) problem. In our recent work, we developed a simple yet efficient program for sparse matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is integrated with the sparse MVM, resulting in a scalable implementation of the complete CG application. Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., absolute speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reduction-broadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3) By slowing down the network by a factor of 2, no notable degradation of overall CG performance was observed.
UR - http://www.scopus.com/inward/record.url?scp=77955203874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955203874&partnerID=8YFLogxK
U2 - 10.1109/SC.2000.10011
DO - 10.1109/SC.2000.10011
M3 - Conference contribution
T3 - Proceedings of the International Conference on Supercomputing
SP - 4
BT - SC 2000 - Proceedings of the 2000 ACM/IEEE Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 4 November 2000 through 10 November 2000
ER -