Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick
UPC Review. July 22, 2009.
GASNet at UC Berkeley / LBNL
What is GASNet? • GASNet is: - A high-performance, one-sided communication layer - Portable abstraction layer for the network - Can run over portable network interfaces (MPI, UDP) - Native ports to a wide variety of low-level network APIs
- Designed as compilation target for PGAS languages - UPC, Co-array Fortran, Titanium, Chapel,... - On Cray XT, GASNet is targeted by 7 separate parallel compiler efforts and counting: – 3 UPC: Berkeley UPC, GCC UPC, Cray XT UPC – 2 CAF: Rice CAF, Cray XT CAF – Berkeley Titanium, Cray Chapel – Numerous prototyping efforts GASNet at UC Berkeley / LBNL
PGAS Compiler System Stack
PGAS Code (UPC, Titanium, CAF, etc)
Platformindependent Networkindependent
PGAS Compiler
Compiler-generated code (C, asm) Language Runtime system GASNet Communication System Network Hardware
GASNet at UC Berkeley / LBNL
Compilerindependent Languageindependent
GASNet Design Overview: System Architecture • Two-Level architecture is mechanism for portability
Compiler-generated code Compiler-specific runtime system
• GASNet Core API - Most basic required primitives, narrow and general - Implemented directly on each network - Based on Active Messages lightweight RPC paradigm
GASNet Extended API GASNet Core API Network Hardware
• GASNet Extended API – Wider interface that includes higher-level operations – puts and gets w/ flexible sync, split-phase barriers, collective operations, etc – Have reference implementation of the extended API in terms of the core API – Directly implement selected subset of interface for performance – leverage hardware support for higher-level operations
GASNet at UC Berkeley / LBNL
GASNet Design Progression on XT • Pure MPI: mpi-conduit - Fully portable implementation of GASNet over MPI-1 - “Runs everywhere, optimally nowhere” • Portals/MPI Hybrid - Replaced Extended API (put/get) with Portals calls - Zero-copy RDMA transfers using SeaStar support • Pure Portals: portals-conduit - Native Core API (AM) implementation over Portals - Eliminated reliance on MPI • Firehose integration - Reduce memory registration overheads GASNet at UC Berkeley / LBNL
Portals Message Processing NIC
Incoming Message
Portal Table Optional
Match List Portal Index
MEME ME
Memory Descriptor
Event Queue
MD
EQ
Application Memory Region
Application Memory Region
- Lowest-level software interface to the XT network is Portals - All data movement via Put/Get btwn pre-registered memory regions - Provides sophisticated recv-side processing of all incoming messages
- Designed to allow NIC offload of MPI message matching - Provides (more than) sufficient generality for our purposes GASNet at UC Berkeley / LBNL
GASNet Put in Portals-conduit Node 0 Memory
Node 1 Memory Match List
GASNet segment
A
GASNet segment
Portal Table RAR ME RARSRC MD
RAR PTE
RAR MD B (No EQ)
SAFE EQ SEND_END ACK
Remote completion
Local completion
Node 0’s gasnet_put of A to B becomes: PortalsPut(RARSRC, offset(A), RARME | op_id, offset(B)) GASNet at UC Berkeley / LBNL
Operation identifier smuggled thru ignored match bits
GASNet Get in Portals-conduit Node 0 Memory
Node 1 Memory Match List
GASNet segment
GASNet segment
Portal Table RAR ME TMPMD MD
RAR PTE
RAR MD B (No EQ)
SAFE EQ REPLY_END
C
Get completion
Node 0’s gasnet_get of B to C becomes: Dynamically-created MD for large out-ofsegment reference
PortalsGet(TMPMD, 0, RARME | op_id, offset(B)) GASNet at UC Berkeley / LBNL
GASNet AM Request in Portals-conduit Node 0 Memory GASNet segment
AM Request Send Buffers AM Request
Node 1 Memory
Match List Portal Table
ReqSB MD
Req ME
AM PTE
(Triple buffered)
ReqRB ReqRB ReqRB MD MD MD
GASNet segment
AM Request Recv Buffers
AM EQ
SAFE EQ
PUT_END
AM Request
AM Request Handler executed
Node 0’s gasnet_AMRequestMedium becomes: PortalsPut(ReqSB_MD, offset(sendbuffer), Req_ME | op_id | , 0) GASNet at UC Berkeley / LBNL
ReqRB has a Locally-managed offset
GASNet AM Reply in Portals-conduit Match List Node 0 Memory
AM Reply
Node 1 Memory
Rpl ME
GASNet segment
AM Request Send Buffers
Portal Table
GASNet segment
AM PTE
RplSB MD
ReqSB MD
SAFE EQ
SAFE EQ
AM Reply Send Buffers
AM Reply
PUT_END
AM Reply Handler executed
Node 1’s gasnet_AMReplyMedium becomes: PortalsPut(RplSB_MD, offset(sendbuffer), Rpl_ME | op_id | , request_offset) GASNet at UC Berkeley / LBNL
Portals-conduit Data Structures MD RAR RARAM
PTE
Match Bits
Ops Allowed
Offset Mgt.
Event Queue
RAR
0x0
PUT/GET
REMOTE
NONE
RAR
0x1
PUT
REMOTE
AM_EQ
RARSRC
ReqRB ReqSB RplSB TMPMD
RAR
0x2
PUT
REMOTE
SAFE_EQ
AM
0x3
PUT
LOCAL
AM_EQ
AM
0x4
PUT
REMOTE
SAFE_EQ
none
none
N/A
N/A
SAFE_EQ
none
none
N/A
N/A
SAFE_EQ
Description Remote segment: dst of Put, src of Get Remote segment: dst of RequestLong payload Remote segment: dst of ReplyLong payload Local segment: src of Put/Long payload, dst of Get Dest of AM Request Header (double-buffered) Bounce buffers for out-of-segment Put/Long/Get, AM Request Header src, AM Reply Header dst Src of AM Reply Header Large out-of-segment local addressing: Src of Put/AM Long payload, dest of Get
• RAR PTE: covers GASNet segment with 3 MD’s with diff EQs • AM PTE: Active Message buffers - 3 MD’s: Request Send/Reply Recv, Request Recv, and Reply Send - EQ separation for deadlock-free AM • TMPMD’s created dynamically for transfers with out-of-segment local side GASNet at UC Berkeley / LBNL
Performance: Small Put Latency 30
(down is good)
Latency of Blocking Put (µs)
25
20
15
10
mpi-conduit Put MPI Ping-Ack portals-conduit Put
5
0 1
2
4
8
16
32
64
128
256
512
1024
Payload Size (bytes)
• All performance results taken on Franklin, quad-core XT4 @ NERSC • Portals-conduit outperforms GASNet-over-MPI by about 2x - Semantically-induced costs of implementing put/get over message passing - Leverages Portals-level acknowledgement for remote completion • Outperforms a raw MPI ping/pong by eliminating software overheads GASNet at UC Berkeley / LBNL
Performance: Large Put Bandwidth
1600 1400
(up is good)
Bandwidth of Non-Blocking Put (MB/s)
1800
1200 1000 800 600
portals-conduit Put OSU MPI BW test mpi-conduit Put
400 200 0 2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
Payload Size (bytes)
• Portals-conduit exposes the full zero-copy RDMA bandwidth of the SeaStar - Meets or exceeds achievable bandwidth of a raw MPI flood test - Mpi-conduit bandwidth suffers due to 2-copy of the payload GASNet at UC Berkeley / LBNL
Portals-conduit Flow Control • Most significant challenge in the AM implementation - Prevent overflowing recv buffers at the target - Prevent overflowing EQ space at either end • Local-side resources managed using send tokens - Request injection acquires EQ and buffer space for send and Reply recv - Still need to prevent overflows at remote (target) end • Initial approach: statically partition recv resources between peers - Reserve worst-case space at target for each sender to get full B/W - Initiator-managed, per-target credit system - Requests consume credits (based on payload sz), Replies return them
- Downside: Non-scalable buffer memory utilization • Final approach: Dynamic credit redistribution - Reserve space for each receiver to get full B/W - Each peer starts with minimal credits, rest banked at the target - Target loans additional credits to “chatty” peers GASNet at UC Berkeley / LBNL
Performance: Active Message Latency
25
(down is good)
AM Medium Round-trip Latency (µs)
30
20
15
10
mpi-conduit portals-conduit
5
0 1
2
4
8
16
32
64
128
256
512
1024
Payload Size (bytes)
• Shows the benefit of implementing AM natively • Portals-conduit AM’s outperform mpi-conduit - Less per-message metadata, big advantage under 1 packet - Beyond one packet, less software overheads w/o MPI GASNet at UC Berkeley / LBNL
Performance: Out-of-segment Put Bandwidth (Firehose) 1600 1400
(up is good)
Bandwidth of Blocking Put (MB/s)
1800
1200 1000 800 600
portals-conduit w/Firehose
400
portals-conduit w/TMPMD
200 0 2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
Payload Size (bytes)
• Blocking put test (no overlap), exaggerates software overheads • TMPMD pays synchronous MD create/destroy every transfer - Incurs a pinning cost linear in the page count (on CNL) • Firehose exploits spatial/temporal locality to reuse local MDs - LRU algorithm with region coalescing – quickly discovers the working set - Provides 4% to 8% bandwidth improvement GASNet at UC Berkeley / LBNL
Conclusions • Portals-conduit delivers good GASNet performance on Cray XT - Outperforms generic GASNet-over-MPI by about 2x - Microbenchmark performance competitive with raw MPI - Solid comm. foundation for many PGAS compilers • Future Work - Expand Firehose integration to include remote memory • Funding / Machine acknowledgements: - Office of Science DOE Contract DE-AC02-05CH11231. - NERSC, DOE Contract DE-AC02-05CH11231. - ORNL, DOE Contract DE-AC05-00OR22725. http://gasnet.cs.berkeley.edu - NSF TeraGrid & PSC http://upc.lbl.gov GASNet at UC Berkeley / LBNL