Parallel Multigrid Solvers using OpenMP/MPI ... - Semantic Scholar

Report 6 Downloads 92 Views
Parallel Multigrid Solvers using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters Kengo Nakajima Information Technology Center, The University of Tokyo Japan Science & Technology Agency (JST) 9th International Meeting High Performance Computing for Computational Science VECPAR 2010 June 22-25, 2010, Berkeley, California, USA

VECPAR2010

2

T2K/Tokyo (1/2) • “T2K Open Supercomputer Alliance” – http://www.open-supercomputer.org/ – Tsukuba, Tokyo, Kyoto

• “T2K Open Supercomputer (Todai Combined Cluster)” – by Hitachi – op. started June 2008 – Total 952 nodes (15,232 cores), 141 TFLOPS peak • Quad-core Opteron (Barcelona)

– 53rd in TOP500 (JUN 2010) – Fat-Tree with Myrinet-10G

VECPAR2010

3

T2K/Tokyo (2/2) • AMD Quad-core Opteron (Barcelona) 2.3GHz • 4 “sockets” per node

L2 L1

Memory

Memory

L3

L3

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

– 16 cores/node

• Multi-core,multi-socket system • cc-NUMA architecture

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

– careful configuration needed

L3

L3

• local data ~ local memory

Memory

Memory

L1 L2

VECPAR2010

4

Flat MPI, Hybrid (4x4, 8x2, 16x1) Higher Performance of HB16x1 is important 0 1 2 3 Flat MPI

Hybrid 4x4

0

1

2

3

Hybrid 8x2

0

1

2

3

Hybrid 16x1

0

1

2

3

VECPAR2010

5

Domain Decomposition Inter Domain: MPI-Block Jacobi Intra Domain: OpenMP-Threads (re-ordering) example: 6 nodes, 24 sockets, 96 cores

Flat MPI

HB 4x4

HB 16x1

6

VECPAR2010

First Touch Data Placement Local Data – Local Memory The most common NUMA pageplacement algorithm is the “first touch” algorithm, in which the PE first referencing a region of memory will have the page holding that memory assigned to it. A very common technique in OpenMP program is to initialize data in parallel using the same loop schedule as will be used later in the computations.

L2 L1

Memory

Memory

L3

L3

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

L2 L1

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

L1 L2

L3

L3

Memory

Memory

L1 L2

7

VECPAR2010

First Touch Data Placement Method of Initialization to be initialized as computation do lev= 1, LEVELtot do ic= 1, COLORtot(lev) !$omp parallel do private(ip,i,j,isL,ieL,isU,ieU) do ip= 1, PEsmpTOT do i = STACKmc(ip,ic-1,lev)+1, STACKmc(ip,ic,lev) RHS(i)= 0.d0; X(i)= 0.d0; D(i)= 0.d0 isL= indexL(i-1)+1 ieL= indexL(i) do j= isL, ieL itemL(j)= 0; AL(j)= 0.d0 enddo isU= indexU(i-1)+1 ieU= indexU(i) do j= isU, ieU itemU(j)= 0; AU(j)= 0.d0 enddo enddo enddo !$omp omp end parallel do enddo enddo

VECPAR2010

8

Further Re-Ordering for Continuous Memory Access: Sequential 5 colors, 8 threads Initial Vector Coloring (5 colors) +Ordering

color=1

color=1

color=2

color=2

color=3

color=4

color=3

color=4

color=5

color=5

Coalesced (Original)

12345678 12345678 12345678 12345678 12345678

Sequential

11111 22222 33333 44444 55555 66666 77777 88888

VECPAR2010

9

Effect of F.T. + Sequential Data Access 16,777,216= 64x643 cells, 64 cores, CM-RCM(2) Time for Linear Solvers, HB 4x4 is the fastest 100.0

Initial

NUMA control

Full Optimization

80.0

sec.

60.0

40.0

20.0

Down is good 0.0 Flat MPI

HB 4x4

HB 8x2

HB 16x1

VECPAR2010

10

Strong Scaling • 512x256x256= 33,554,432 cells • Up to 1,024 cores (64 nodes) • CM-RCM(2)

VECPAR2010

11

Strong Scaling (T2K) 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores Iterations

Performance

90

120

Iterations

85 80

Parallel Performance (%)

Flat MPI HB 4x4 HB 8x2 HB 16x1

75 70 65 60

Flat MPI HB 8x2

100

HB 4x4 HB 16x1

80 60 40 20 0

10

100

CORE#

1000

16

32

64

128

CORE#

256

512

1024

VECPAR2010

12

OpenMP Overhead on SEND/RECV • ⇒more nodes/cores – smaller number of vertices per thread – more OpenMP overhead • more significant for higher-levels of MG operations !C !C-- SEND do neib= 1, NEIBPETOT istart= levEXPORT_index(lev-1,neib) + 1 iend = levEXPORT_index(lev ,neib) inum = iend - istart + 1 !$omp parallel do private (ii) do k= istart, iend ii = 3*EXPORT_ITEM(k) WS(3*k-2)= X(ii-2) WS(3*k-1)= X(ii-1) WS(3*k )= X(ii ) enddo !$omp end parallel do call MPI_ISEND (WS(3*istart-2), 3*inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, SOLVER_COMM, req1(neib), ierr) enddo

VECPAR2010

13

OpenMP Overhead on SEND/RECV • Serial computation for memory copy in SEND/RECV • Optimization • OpenMP: only for the finest mesh !C !C-- SEND do neib= 1, NEIBPETOT istart= levEXPORT_index(lev-1,neib) + 1 iend = levEXPORT_index(lev ,neib) inum = iend - istart + 1 do k= istart, iend ii = 3*EXPORT_ITEM(k) WS(3*k-2)= X(ii-2) WS(3*k-1)= X(ii-1) WS(3*k )= X(ii ) enddo call MPI_ISEND (WS(3*istart-2), 3*inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, SOLVER_COMM, req1(neib), ierr) enddo

VECPAR2010

14

Effect of Optimization at 1,024 cores • If “LEVEL (level of multigrid) ≥ LEVcri”, serial memory copy (without OpenMP) is applied. • “LEVcri=1” means NO OpenMP • “LEVcri=2” (OpenMP only at the finest level) : best HB 4×4

HB 16×1

6.00

6.00

Computation

Communication

5.00

5.00

4.00

4.00

sec.

sec.

Communication

3.00

3.00

2.00

2.00

1.00

1.00

0.00

0.00 Orginal

1

2

3

LEVcri

4

5

6

Computation

Orginal

1

2

3

LEVcri

4

5

6

VECPAR2010

15

Memory Copy Processes in Comm. Optimized 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores Before

After 120

Flat MPI HB 8x2

100

HB 4x4 HB 16x1

80 60 40 20 0

Parallel Performance (%)

Parallel Performance (%)

120

Flat MPI HB 8x2

100

HB 4x4 HB 16x1

80 60 40 20 0

16

32

64

128

CORE#

256

512

1024

16

32

64

128

CORE#

256

512

1024

VECPAR2010

16

Strong Scale: Parallel Performance 512x256x256= 33,554,432 cells based on performance of Flat MPI with 16 cores (Improved coarse-grid smoother does not work well because cost per iteration is larger for many-core cases)

Up is good

Parallel Performance (%)

120

Flat MPI HB 8x2

100

HB 4x4 HB 16x1

80 60 40 20 0 16

32

64

128

CORE#

256

512

1024