Abstract
Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.
Original language | English (US) |
---|---|
Pages | 181-182 |
Number of pages | 2 |
DOIs | |
State | Published - 2011 |
Externally published | Yes |
Event | 20th International Conference on Parallel Architectures and Compilation Techniques, PACT 2011 - Galveston, TX, United States Duration: Oct 10 2011 → Oct 14 2011 |
Conference
Conference | 20th International Conference on Parallel Architectures and Compilation Techniques, PACT 2011 |
---|---|
Country/Territory | United States |
City | Galveston, TX |
Period | 10/10/11 → 10/14/11 |
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture