Parallel Multigrid Solvers using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters Kengo Nakajima Information Technology Center, The University of Tokyo Japan Science & Technology Agency (JST) 9th International Meeting High Performance Computing for Computational Science VECPAR 2010 June 22-25, 2010, Berkeley, California, USA
VECPAR2010
2
T2K/Tokyo (1/2) • “T2K Open Supercomputer Alliance” – http://www.open-supercomputer.org/ – Tsukuba, Tokyo, Kyoto
• “T2K Open Supercomputer (Todai Combined Cluster)” – by Hitachi – op. started June 2008 – Total 952 nodes (15,232 cores), 141 TFLOPS peak • Quad-core Opteron (Barcelona)
– 53rd in TOP500 (JUN 2010) – Fat-Tree with Myrinet-10G
VECPAR2010
3
T2K/Tokyo (2/2) • AMD Quad-core Opteron (Barcelona) 2.3GHz • 4 “sockets” per node
L2 L1
Memory
Memory
L3
L3
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
– 16 cores/node
• Multi-core,multi-socket system • cc-NUMA architecture
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
– careful configuration needed
L3
L3
• local data ~ local memory
Memory
Memory
L1 L2
VECPAR2010
4
Flat MPI, Hybrid (4x4, 8x2, 16x1) Higher Performance of HB16x1 is important 0 1 2 3 Flat MPI
Hybrid 4x4
0
1
2
3
Hybrid 8x2
0
1
2
3
Hybrid 16x1
0
1
2
3
VECPAR2010
5
Domain Decomposition Inter Domain: MPI-Block Jacobi Intra Domain: OpenMP-Threads (re-ordering) example: 6 nodes, 24 sockets, 96 cores
Flat MPI
HB 4x4
HB 16x1
6
VECPAR2010
First Touch Data Placement Local Data – Local Memory The most common NUMA pageplacement algorithm is the “first touch” algorithm, in which the PE first referencing a region of memory will have the page holding that memory assigned to it. A very common technique in OpenMP program is to initialize data in parallel using the same loop schedule as will be used later in the computations.
L2 L1
Memory
Memory
L3
L3
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
L2 L1
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
L1 L2
L3
L3
Memory
Memory
L1 L2
7
VECPAR2010
First Touch Data Placement Method of Initialization to be initialized as computation do lev= 1, LEVELtot do ic= 1, COLORtot(lev) !$omp parallel do private(ip,i,j,isL,ieL,isU,ieU) do ip= 1, PEsmpTOT do i = STACKmc(ip,ic-1,lev)+1, STACKmc(ip,ic,lev) RHS(i)= 0.d0; X(i)= 0.d0; D(i)= 0.d0 isL= indexL(i-1)+1 ieL= indexL(i) do j= isL, ieL itemL(j)= 0; AL(j)= 0.d0 enddo isU= indexU(i-1)+1 ieU= indexU(i) do j= isU, ieU itemU(j)= 0; AU(j)= 0.d0 enddo enddo enddo !$omp omp end parallel do enddo enddo
VECPAR2010
8
Further Re-Ordering for Continuous Memory Access: Sequential 5 colors, 8 threads Initial Vector Coloring (5 colors) +Ordering
color=1
color=1
color=2
color=2
color=3
color=4
color=3
color=4
color=5
color=5
Coalesced (Original)
12345678 12345678 12345678 12345678 12345678
Sequential
11111 22222 33333 44444 55555 66666 77777 88888
VECPAR2010
9
Effect of F.T. + Sequential Data Access 16,777,216= 64x643 cells, 64 cores, CM-RCM(2) Time for Linear Solvers, HB 4x4 is the fastest 100.0
Initial
NUMA control
Full Optimization
80.0
sec.
60.0
40.0
20.0
Down is good 0.0 Flat MPI
HB 4x4
HB 8x2
HB 16x1
VECPAR2010
10
Strong Scaling • 512x256x256= 33,554,432 cells • Up to 1,024 cores (64 nodes) • CM-RCM(2)
VECPAR2010
11
Strong Scaling (T2K) 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores Iterations
Performance
90
120
Iterations
85 80
Parallel Performance (%)
Flat MPI HB 4x4 HB 8x2 HB 16x1
75 70 65 60
Flat MPI HB 8x2
100
HB 4x4 HB 16x1
80 60 40 20 0
10
100
CORE#
1000
16
32
64
128
CORE#
256
512
1024
VECPAR2010
12
OpenMP Overhead on SEND/RECV • ⇒more nodes/cores – smaller number of vertices per thread – more OpenMP overhead • more significant for higher-levels of MG operations !C !C-- SEND do neib= 1, NEIBPETOT istart= levEXPORT_index(lev-1,neib) + 1 iend = levEXPORT_index(lev ,neib) inum = iend - istart + 1 !$omp parallel do private (ii) do k= istart, iend ii = 3*EXPORT_ITEM(k) WS(3*k-2)= X(ii-2) WS(3*k-1)= X(ii-1) WS(3*k )= X(ii ) enddo !$omp end parallel do call MPI_ISEND (WS(3*istart-2), 3*inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, SOLVER_COMM, req1(neib), ierr) enddo
VECPAR2010
13
OpenMP Overhead on SEND/RECV • Serial computation for memory copy in SEND/RECV • Optimization • OpenMP: only for the finest mesh !C !C-- SEND do neib= 1, NEIBPETOT istart= levEXPORT_index(lev-1,neib) + 1 iend = levEXPORT_index(lev ,neib) inum = iend - istart + 1 do k= istart, iend ii = 3*EXPORT_ITEM(k) WS(3*k-2)= X(ii-2) WS(3*k-1)= X(ii-1) WS(3*k )= X(ii ) enddo call MPI_ISEND (WS(3*istart-2), 3*inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, SOLVER_COMM, req1(neib), ierr) enddo
VECPAR2010
14
Effect of Optimization at 1,024 cores • If “LEVEL (level of multigrid) ≥ LEVcri”, serial memory copy (without OpenMP) is applied. • “LEVcri=1” means NO OpenMP • “LEVcri=2” (OpenMP only at the finest level) : best HB 4×4
HB 16×1
6.00
6.00
Computation
Communication
5.00
5.00
4.00
4.00
sec.
sec.
Communication
3.00
3.00
2.00
2.00
1.00
1.00
0.00
0.00 Orginal
1
2
3
LEVcri
4
5
6
Computation
Orginal
1
2
3
LEVcri
4
5
6
VECPAR2010
15
Memory Copy Processes in Comm. Optimized 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores Before
After 120
Flat MPI HB 8x2
100
HB 4x4 HB 16x1
80 60 40 20 0
Parallel Performance (%)
Parallel Performance (%)
120
Flat MPI HB 8x2
100
HB 4x4 HB 16x1
80 60 40 20 0
16
32
64
128
CORE#
256
512
1024
16
32
64
128
CORE#
256
512
1024
VECPAR2010
16
Strong Scale: Parallel Performance 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores (Improved coarse-grid smoother does not work well because cost per iteration is larger for many-core cases)
Up is good
Parallel Performance (%)
120
Flat MPI HB 8x2
100
HB 4x4 HB 16x1
80 60 40 20 0 16
32
64
128
CORE#
256
512
1024