Systematic Approach in Optimizing Numerical ... - Semantic Scholar

Report 2 Downloads 106 Views
Motivations

Related Work

Implementation

Results

Conclusion

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU Ahmad Abdelfattah1 1 Division

David Keyes1

Hatem Ltaief 2

of Mathematical and Computer Sciences and Engineering 2 Supercomputing

Laboratory

King Abdullah University of Science and Technology Thuwal, Saudi Arabia

Rhodes Island, Greece, Aug 27 2012 Abdelfattah, Keyes, Ltaief

HeteroPar 2012

1 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

2 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

3 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Basic Linear Algebra Subroutines

• Level 1 BLAS: vector-vector operations • Level 2 BLAS: matrix-vector operations • Level 3 BLAS: matrix-matrix operations • Basic block operations for most scientific applications • Dense linear algebra routines (structured data access) are

usually good candidates for GPU architecture (high memory bandwidth and compute power).

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

4 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Basic Linear Algebra Subroutines

• Level 1 BLAS: vector-vector operations • Level 2 BLAS: matrix-vector operations • Level 3 BLAS: matrix-matrix operations • Basic block operations for most scientific applications • Dense linear algebra routines (structured data access) are very

good match for GPU architecture (high memory bandwidth and compute power).

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

5 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

6 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Related Work

• CUBLAS is the vendor implementation for BLAS on GPUs. It

is often regarded as the reference implementation. • MAGMA is a free library for BLAS and LAPACK routines on GPUs and hybrid architectures. • Developed by the Innovative Computing Laboratory (ICL),

University of Tennessee, Knoxville. • Some BLAS routines outperform CUBLAS like the SYMV and

the GEMM kernels. • CULA is a commercial package that implements BLAS and

LAPACK routines on NVIDIA GPUs. • Recently, CULA provided an academic edition for free.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

7 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Related Work (cont.)

• For productivity, auto-tuning frameworks are needed to tune

the kernel over a new architecture. • Example, ASTRA (UTK) framework was initially proposed to

tune GEMM on Fermi. • Now extended to Kepler and kernels other than GEMM.

• Sometimes, major architectural innovations may enforce a

redesign of the GPU kernel.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

8 / 34

Motivations

Related Work

Implementation

Results

Conclusion

SSYMV

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

9 / 34

Motivations

Related Work

Implementation

Results

Conclusion

DSYMV

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

10 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Symmetric Matrix-Vector Kernel

• Achieves 80% of performance peak on Fermi C2070 GPU

(accepted in VECPAR’12). • Outperforms all existing implementations (including

commercial packages). • The key idea is to hide global memory latency. • Performance profilers (PAPI/CUDA).

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

11 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

12 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Design Concepts • Matrix is divided into square blocks. • Two levels of granularity: • Block-level strategy: How thread blocks (TBs) move throughout the matrix. • Thread-level strategy: How threads inside a TB do the processing of a single matrix block. • Block-level strategy of SYMV is tricky: Each thread block

travels vertically through the matrix blocks in either lower or upper half. • Block-level strategy of GEMV is simpler: Each TB is

responsible for a whole row (or column) of matrix block. • Thread-level strategy is the key for high performance; typically

the same for SYMV and GEMV. Abdelfattah, Keyes, Ltaief

HeteroPar 2012

13 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Block level strategy of GEMV

• For a matrix with M rows and N columns, we launch: • M / Block Size (+1) TBs in non-transposed case, or • N / Block Size (+1) TBs in transposed case. • TBs moves horizontally/vertically in the

non-transposed/transposed modes. Abdelfattah, Keyes, Ltaief

HeteroPar 2012

14 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Thread level strategy of GEMV

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

15 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Thread level strategy of GEMV

• Three main ideas are incorporated • ILP on the thread level through double buffering. • Maximum latency hiding for stalled warps by maximizing parallelism within a single thread block. • Role of shared memory is restricted to a final reduction step, whenever necessary! (experimentally shared memory is up to 6 times slower than registers on Fermi)

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

16 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

17 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Non-transposed case

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

18 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Non-transposed case

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

19 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Non-transposed case

• The main improvement comes in relatively small/medium

matrices. • An average 60% improvement in single precision. • An average 25% improvement in double precision.

• For larger matrices, performance is almost the same as other

implementations (≈80%) of possible peak performance.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

20 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Transposed case

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

21 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Transposed case

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

22 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Transposed case

• Performance is the almost identical for large matrices. Again

at 80% of possible peak performance. • For relatively small matrices, we lose performance. • The work per thread in the transposed case is more than in

the non-transposed case. • Currently, both computation modes are configured with the

same parameters, we might think of doing separate auto-tuning for each mode.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

23 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on SP Bidiagonal Reduction

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

24 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on DP Bidiagonal Reduction

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

25 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on Bidiagonal Reduction

• Reduction drivers are used from MAGMA. • For matrices ≤8000 in dimension, average performance is 25%

in SP and 20% in DP. • For larger matrices, improvement drops to an average 7% in

SP and 12% in DP. • CULA reduction routines were not available for free when the

paper was written (Now an academic edition is available for free!)

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

26 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on SP Hessenberg Reduction

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

27 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on DP Hessenberg Reduction

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

28 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Performance: Impact on Hessenberg Reduction

• Reduction drivers are used from MAGMA. • For matrices ≤8000 in dimension, average performance is 35%

in SP then drops to 7%. • For DP, average performance improvement is 17% for

matrices ≤6000, then drops to 2%.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

29 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Outline

1

Motivations

2

Related Work

3

Kernel Design

4

Performance Results

5

Conclusion

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

30 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Summary

• New implementation for GEMV that offers better performance

in relatively small/medium size matrices. • Such improvement has a strong impact on higher level

LAPACK reduction routines, such as bidiagonal and hessenberg reduction algorithms. • KAUST BLAS (K-BLAS) is freely available at

http://ksl.kaust.edu.sa/Pages/HatemLtaief.aspx

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

31 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Future work

• For specific dimensions, we see a huge performance dip. • Most probably, this happens in a worst case mapping of TBs

on SMs (When #TBs

mod #SMs is very low).

• Breaking down work on a whole row/column of matrix block

among several TBs is an obvious solution. However, overhead of a global reduction should be cautiously monitored.

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

32 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Before lining up...

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

33 / 34

Motivations

Related Work

Implementation

Results

Conclusion

Questions?

Abdelfattah, Keyes, Ltaief

HeteroPar 2012

34 / 34