GASNet: A High-Performance, Portable ... - Semantic Scholar

Report 7 Downloads 173 Views
Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick

UPC Review. July 22, 2009.

GASNet at UC Berkeley / LBNL

What is GASNet? • GASNet is: - A high-performance, one-sided communication layer - Portable abstraction layer for the network - Can run over portable network interfaces (MPI, UDP) - Native ports to a wide variety of low-level network APIs

- Designed as compilation target for PGAS languages - UPC, Co-array Fortran, Titanium, Chapel,... - On Cray XT, GASNet is targeted by 7 separate parallel compiler efforts and counting: – 3 UPC: Berkeley UPC, GCC UPC, Cray XT UPC – 2 CAF: Rice CAF, Cray XT CAF – Berkeley Titanium, Cray Chapel – Numerous prototyping efforts GASNet at UC Berkeley / LBNL

PGAS Compiler System Stack

PGAS Code (UPC, Titanium, CAF, etc)

Platformindependent Networkindependent

PGAS Compiler

Compiler-generated code (C, asm) Language Runtime system GASNet Communication System Network Hardware

GASNet at UC Berkeley / LBNL

Compilerindependent Languageindependent

GASNet Design Overview: System Architecture • Two-Level architecture is mechanism for portability

Compiler-generated code Compiler-specific runtime system

• GASNet Core API - Most basic required primitives, narrow and general - Implemented directly on each network - Based on Active Messages lightweight RPC paradigm

GASNet Extended API GASNet Core API Network Hardware

• GASNet Extended API – Wider interface that includes higher-level operations – puts and gets w/ flexible sync, split-phase barriers, collective operations, etc – Have reference implementation of the extended API in terms of the core API – Directly implement selected subset of interface for performance – leverage hardware support for higher-level operations

GASNet at UC Berkeley / LBNL

GASNet Design Progression on XT • Pure MPI: mpi-conduit - Fully portable implementation of GASNet over MPI-1 - “Runs everywhere, optimally nowhere” • Portals/MPI Hybrid - Replaced Extended API (put/get) with Portals calls - Zero-copy RDMA transfers using SeaStar support • Pure Portals: portals-conduit - Native Core API (AM) implementation over Portals - Eliminated reliance on MPI • Firehose integration - Reduce memory registration overheads GASNet at UC Berkeley / LBNL

Portals Message Processing NIC

Incoming Message

Portal Table Optional

Match List Portal Index

MEME ME

Memory Descriptor

Event Queue

MD

EQ



Application Memory Region

Application Memory Region

- Lowest-level software interface to the XT network is Portals - All data movement via Put/Get btwn pre-registered memory regions - Provides sophisticated recv-side processing of all incoming messages

- Designed to allow NIC offload of MPI message matching - Provides (more than) sufficient generality for our purposes GASNet at UC Berkeley / LBNL

GASNet Put in Portals-conduit Node 0 Memory

Node 1 Memory Match List

GASNet segment

A

GASNet segment

Portal Table RAR ME RARSRC MD

RAR PTE

RAR MD B (No EQ)

SAFE EQ SEND_END ACK

Remote completion

Local completion

Node 0’s gasnet_put of A to B becomes: PortalsPut(RARSRC, offset(A), RARME | op_id, offset(B)) GASNet at UC Berkeley / LBNL

Operation identifier smuggled thru ignored match bits

GASNet Get in Portals-conduit Node 0 Memory

Node 1 Memory Match List

GASNet segment

GASNet segment

Portal Table RAR ME TMPMD MD

RAR PTE

RAR MD B (No EQ)

SAFE EQ REPLY_END

C

Get completion

Node 0’s gasnet_get of B to C becomes: Dynamically-created MD for large out-ofsegment reference

PortalsGet(TMPMD, 0, RARME | op_id, offset(B)) GASNet at UC Berkeley / LBNL

GASNet AM Request in Portals-conduit Node 0 Memory GASNet segment

AM Request Send Buffers AM Request

Node 1 Memory

Match List Portal Table

ReqSB MD

Req ME

AM PTE

(Triple buffered)

ReqRB ReqRB ReqRB MD MD MD

GASNet segment

AM Request Recv Buffers

AM EQ

SAFE EQ

PUT_END

AM Request

AM Request Handler executed

Node 0’s gasnet_AMRequestMedium becomes: PortalsPut(ReqSB_MD, offset(sendbuffer), Req_ME | op_id | , 0) GASNet at UC Berkeley / LBNL

ReqRB has a Locally-managed offset

GASNet AM Reply in Portals-conduit Match List Node 0 Memory

AM Reply

Node 1 Memory

Rpl ME

GASNet segment

AM Request Send Buffers

Portal Table

GASNet segment

AM PTE

RplSB MD

ReqSB MD

SAFE EQ

SAFE EQ

AM Reply Send Buffers

AM Reply

PUT_END

AM Reply Handler executed

Node 1’s gasnet_AMReplyMedium becomes: PortalsPut(RplSB_MD, offset(sendbuffer), Rpl_ME | op_id | , request_offset) GASNet at UC Berkeley / LBNL

Portals-conduit Data Structures MD RAR RARAM

PTE

Match Bits

Ops Allowed

Offset Mgt.

Event Queue

RAR

0x0

PUT/GET

REMOTE

NONE

RAR

0x1

PUT

REMOTE

AM_EQ

RARSRC

ReqRB ReqSB RplSB TMPMD

RAR

0x2

PUT

REMOTE

SAFE_EQ

AM

0x3

PUT

LOCAL

AM_EQ

AM

0x4

PUT

REMOTE

SAFE_EQ

none

none

N/A

N/A

SAFE_EQ

none

none

N/A

N/A

SAFE_EQ

Description Remote segment: dst of Put, src of Get Remote segment: dst of RequestLong payload Remote segment: dst of ReplyLong payload Local segment: src of Put/Long payload, dst of Get Dest of AM Request Header (double-buffered) Bounce buffers for out-of-segment Put/Long/Get, AM Request Header src, AM Reply Header dst Src of AM Reply Header Large out-of-segment local addressing: Src of Put/AM Long payload, dest of Get

• RAR PTE: covers GASNet segment with 3 MD’s with diff EQs • AM PTE: Active Message buffers - 3 MD’s: Request Send/Reply Recv, Request Recv, and Reply Send - EQ separation for deadlock-free AM • TMPMD’s created dynamically for transfers with out-of-segment local side GASNet at UC Berkeley / LBNL

Performance: Small Put Latency 30

(down is good)

Latency of Blocking Put (µs)

25

20

15

10

mpi-conduit Put MPI Ping-Ack portals-conduit Put

5

0 1

2

4

8

16

32

64

128

256

512

1024

Payload Size (bytes)

• All performance results taken on Franklin, quad-core XT4 @ NERSC • Portals-conduit outperforms GASNet-over-MPI by about 2x - Semantically-induced costs of implementing put/get over message passing - Leverages Portals-level acknowledgement for remote completion • Outperforms a raw MPI ping/pong by eliminating software overheads GASNet at UC Berkeley / LBNL

Performance: Large Put Bandwidth

1600 1400

(up is good)

Bandwidth of Non-Blocking Put (MB/s)

1800

1200 1000 800 600

portals-conduit Put OSU MPI BW test mpi-conduit Put

400 200 0 2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

Payload Size (bytes)

• Portals-conduit exposes the full zero-copy RDMA bandwidth of the SeaStar - Meets or exceeds achievable bandwidth of a raw MPI flood test - Mpi-conduit bandwidth suffers due to 2-copy of the payload GASNet at UC Berkeley / LBNL

Portals-conduit Flow Control • Most significant challenge in the AM implementation - Prevent overflowing recv buffers at the target - Prevent overflowing EQ space at either end • Local-side resources managed using send tokens - Request injection acquires EQ and buffer space for send and Reply recv - Still need to prevent overflows at remote (target) end • Initial approach: statically partition recv resources between peers - Reserve worst-case space at target for each sender to get full B/W - Initiator-managed, per-target credit system - Requests consume credits (based on payload sz), Replies return them

- Downside: Non-scalable buffer memory utilization • Final approach: Dynamic credit redistribution - Reserve space for each receiver to get full B/W - Each peer starts with minimal credits, rest banked at the target - Target loans additional credits to “chatty” peers GASNet at UC Berkeley / LBNL

Performance: Active Message Latency

25

(down is good)

AM Medium Round-trip Latency (µs)

30

20

15

10

mpi-conduit portals-conduit

5

0 1

2

4

8

16

32

64

128

256

512

1024

Payload Size (bytes)

• Shows the benefit of implementing AM natively • Portals-conduit AM’s outperform mpi-conduit - Less per-message metadata, big advantage under 1 packet - Beyond one packet, less software overheads w/o MPI GASNet at UC Berkeley / LBNL

Performance: Out-of-segment Put Bandwidth (Firehose) 1600 1400

(up is good)

Bandwidth of Blocking Put (MB/s)

1800

1200 1000 800 600

portals-conduit w/Firehose

400

portals-conduit w/TMPMD

200 0 2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

Payload Size (bytes)

• Blocking put test (no overlap), exaggerates software overheads • TMPMD pays synchronous MD create/destroy every transfer - Incurs a pinning cost linear in the page count (on CNL) • Firehose exploits spatial/temporal locality to reuse local MDs - LRU algorithm with region coalescing – quickly discovers the working set - Provides 4% to 8% bandwidth improvement GASNet at UC Berkeley / LBNL

Conclusions • Portals-conduit delivers good GASNet performance on Cray XT - Outperforms generic GASNet-over-MPI by about 2x - Microbenchmark performance competitive with raw MPI - Solid comm. foundation for many PGAS compilers • Future Work - Expand Firehose integration to include remote memory • Funding / Machine acknowledgements: - Office of Science DOE Contract DE-AC02-05CH11231. - NERSC, DOE Contract DE-AC02-05CH11231. - ORNL, DOE Contract DE-AC05-00OR22725. http://gasnet.cs.berkeley.edu - NSF TeraGrid & PSC http://upc.lbl.gov GASNet at UC Berkeley / LBNL

Recommend Documents