Motivations
Related Work
Implementation
Results
Conclusion
Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU Ahmad Abdelfattah1 1 Division
David Keyes1
Hatem Ltaief 2
of Mathematical and Computer Sciences and Engineering 2 Supercomputing
Laboratory
King Abdullah University of Science and Technology Thuwal, Saudi Arabia
Rhodes Island, Greece, Aug 27 2012 Abdelfattah, Keyes, Ltaief
HeteroPar 2012
1 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
2 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
3 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Basic Linear Algebra Subroutines
• Level 1 BLAS: vector-vector operations • Level 2 BLAS: matrix-vector operations • Level 3 BLAS: matrix-matrix operations • Basic block operations for most scientific applications • Dense linear algebra routines (structured data access) are
usually good candidates for GPU architecture (high memory bandwidth and compute power).
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
4 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Basic Linear Algebra Subroutines
• Level 1 BLAS: vector-vector operations • Level 2 BLAS: matrix-vector operations • Level 3 BLAS: matrix-matrix operations • Basic block operations for most scientific applications • Dense linear algebra routines (structured data access) are very
good match for GPU architecture (high memory bandwidth and compute power).
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
5 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
6 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Related Work
• CUBLAS is the vendor implementation for BLAS on GPUs. It
is often regarded as the reference implementation. • MAGMA is a free library for BLAS and LAPACK routines on GPUs and hybrid architectures. • Developed by the Innovative Computing Laboratory (ICL),
University of Tennessee, Knoxville. • Some BLAS routines outperform CUBLAS like the SYMV and
the GEMM kernels. • CULA is a commercial package that implements BLAS and
LAPACK routines on NVIDIA GPUs. • Recently, CULA provided an academic edition for free.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
7 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Related Work (cont.)
• For productivity, auto-tuning frameworks are needed to tune
the kernel over a new architecture. • Example, ASTRA (UTK) framework was initially proposed to
tune GEMM on Fermi. • Now extended to Kepler and kernels other than GEMM.
• Sometimes, major architectural innovations may enforce a
redesign of the GPU kernel.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
8 / 34
Motivations
Related Work
Implementation
Results
Conclusion
SSYMV
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
9 / 34
Motivations
Related Work
Implementation
Results
Conclusion
DSYMV
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
10 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Symmetric Matrix-Vector Kernel
• Achieves 80% of performance peak on Fermi C2070 GPU
(accepted in VECPAR’12). • Outperforms all existing implementations (including
commercial packages). • The key idea is to hide global memory latency. • Performance profilers (PAPI/CUDA).
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
11 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
12 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Design Concepts • Matrix is divided into square blocks. • Two levels of granularity: • Block-level strategy: How thread blocks (TBs) move throughout the matrix. • Thread-level strategy: How threads inside a TB do the processing of a single matrix block. • Block-level strategy of SYMV is tricky: Each thread block
travels vertically through the matrix blocks in either lower or upper half. • Block-level strategy of GEMV is simpler: Each TB is
responsible for a whole row (or column) of matrix block. • Thread-level strategy is the key for high performance; typically
the same for SYMV and GEMV. Abdelfattah, Keyes, Ltaief
HeteroPar 2012
13 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Block level strategy of GEMV
• For a matrix with M rows and N columns, we launch: • M / Block Size (+1) TBs in non-transposed case, or • N / Block Size (+1) TBs in transposed case. • TBs moves horizontally/vertically in the
non-transposed/transposed modes. Abdelfattah, Keyes, Ltaief
HeteroPar 2012
14 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Thread level strategy of GEMV
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
15 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Thread level strategy of GEMV
• Three main ideas are incorporated • ILP on the thread level through double buffering. • Maximum latency hiding for stalled warps by maximizing parallelism within a single thread block. • Role of shared memory is restricted to a final reduction step, whenever necessary! (experimentally shared memory is up to 6 times slower than registers on Fermi)
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
16 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
17 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Non-transposed case
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
18 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Non-transposed case
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
19 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Non-transposed case
• The main improvement comes in relatively small/medium
matrices. • An average 60% improvement in single precision. • An average 25% improvement in double precision.
• For larger matrices, performance is almost the same as other
implementations (≈80%) of possible peak performance.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
20 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Transposed case
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
21 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Transposed case
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
22 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Transposed case
• Performance is the almost identical for large matrices. Again
at 80% of possible peak performance. • For relatively small matrices, we lose performance. • The work per thread in the transposed case is more than in
the non-transposed case. • Currently, both computation modes are configured with the
same parameters, we might think of doing separate auto-tuning for each mode.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
23 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on SP Bidiagonal Reduction
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
24 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on DP Bidiagonal Reduction
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
25 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on Bidiagonal Reduction
• Reduction drivers are used from MAGMA. • For matrices ≤8000 in dimension, average performance is 25%
in SP and 20% in DP. • For larger matrices, improvement drops to an average 7% in
SP and 12% in DP. • CULA reduction routines were not available for free when the
paper was written (Now an academic edition is available for free!)
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
26 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on SP Hessenberg Reduction
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
27 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on DP Hessenberg Reduction
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
28 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Performance: Impact on Hessenberg Reduction
• Reduction drivers are used from MAGMA. • For matrices ≤8000 in dimension, average performance is 35%
in SP then drops to 7%. • For DP, average performance improvement is 17% for
matrices ≤6000, then drops to 2%.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
29 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Outline
1
Motivations
2
Related Work
3
Kernel Design
4
Performance Results
5
Conclusion
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
30 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Summary
• New implementation for GEMV that offers better performance
in relatively small/medium size matrices. • Such improvement has a strong impact on higher level
LAPACK reduction routines, such as bidiagonal and hessenberg reduction algorithms. • KAUST BLAS (K-BLAS) is freely available at
http://ksl.kaust.edu.sa/Pages/HatemLtaief.aspx
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
31 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Future work
• For specific dimensions, we see a huge performance dip. • Most probably, this happens in a worst case mapping of TBs
on SMs (When #TBs
mod #SMs is very low).
• Breaking down work on a whole row/column of matrix block
among several TBs is an obvious solution. However, overhead of a global reduction should be cautiously monitored.
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
32 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Before lining up...
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
33 / 34
Motivations
Related Work
Implementation
Results
Conclusion
Questions?
Abdelfattah, Keyes, Ltaief
HeteroPar 2012
34 / 34