GHz Asynchronous SRAM in 65nm Jonathan Dama, Andrew Lines Fulcrum Microsystems
Context • Three Generations in Production, including: • Lowest latency 24-port 10G L2 Ethernet Switch • Lowest Latency 24-port 10G L3 Switch/Router • Higher Frequencies • Lower Latencies
Design Methodology • Mostly Quasi Delay-Insensitive • PCHB, PCFB, WCHB templates • 18 tpc • Islands of Synchronous Standard Flow (GALS) • Additional timing assumptions in key circuits • Register Files (unacknowledged bit-writes) • Dense SRAM (many) • TCAM (trickiest)
Outline • 10T Register RegisterFiles Files • 6T SRAM SRAM Bank Bankand andAnalog AnalogVerification Verification • Multibank 6T Multibank 6T SRAMs(SDP/DDP/CDP) (SDP/DDP/CDP) • Dual-ported Dual-ported SRAMs • Design for Test (scan) • Design for Yield (repair) • Soft Error Tolerance • Performance Analysis Performance Analysis
Outline • 10T Register RegisterFiles Files • 6T SRAM SRAM Bank Bankand andAnalog AnalogVerification Verification • Multibank 6T Multibank 6T SRAMs(SDP/DDP/CDP) (SDP/DDP/CDP) • Dual-ported Dual-ported SRAMs • Design for Test (scan) • Design for Yield (repair) • Soft Error Tolerance • Performance Analysis Performance Analysis
10T Memories: Fast, Safe • 10T state-bit (11T including reset) • Uses foundry 6T ratios
_r.0
_w.0
• Design Rule Correct • Up to 32 bits and 32 address • Supports masked writes • Single & Dual Ported Control Versions
_w.1
_r.1 JW
• Custom Handshakes replace control for particular purposes • FIFOs and “SHELFs”
JR
10T Memories: Structure _r.0
_w.0
_w.1
_r.1 JW
JR
BIT ARRAY
e1ofN JW
READ
WRITE
W
R
e1ofN JR
e1of1 KW
e1of1 KR DECODE
DI Interface
10T Memories: Structure _r.0
_w.0
_w.1
_r.1 JW
JR
BIT ARRAY
e1ofN JW
READ
WRITE
W
R
e1ofN JR
e1of1 KW
e1of1 KR DECODE
DI Interface
6T Memories: Dense • 6T Statebit (TSMC) • (Carefully) Violates DRC • Different Implant than normal Logic • Validated ratio assumptions • Bank: up to 16 bits and 1024 address • 4 way set muxing • 8-way 2nd level buses • 32 bits per bit-line • Fully pipelined to arbitrary width and depth
6T Bank: Bit and “SET” R1
W1
B1
Go
B0 W0 A0 STATE!BIT
R0
S0 PRECHARGE
WRITE
SET!MUX
READ
6T Bank: Two Chunks
2x 128 Addresses in 4 Sets
6T Bank: Top-level Structure e1of4[2] R
DATA
CHUNK
CHUNK
CHUNK
CHUNK
DATA
CHUNK
CHUNK
CHUNK
CHUNK
e1of4[2] W
e1of4[2] R e1of4[2] W e1of2 I e1of4[5] A
CTRL
e1of4[2] R
DATA
e1of4[2] W
e1of4[2] R e1of4[2] W
DATA
DEMUX
DEMUX
DEMUX
DEMUX
CHUNK
CHUNK
CHUNK
CHUNK
CHUNK
CHUNK
CHUNK
CHUNK
6T Bank: Address Decoding “DEMUX” • 5 1of4s as input • 256 address lines decoded with AND4s • 8 groups (half chunk) of 4 set lines • Decoder transitions are treated as digitally isochronic • CHUNKs are power-gated
6T SRAM: Bank
6T Bank: Analog Assumptions • Common Concerns: • Bit-line pull-down can overpower state-bit while pass-gate open • Bit-lines held at or floating near Vdd don’t write state-bit while pass-gate open • Cap-coupling, Slews, Leakage • Arise from implementation decisions: • Precharge interference with reads of unselected sets must hold those bitlines above the switching threshold of the set-muxing NAND • Bit-lines float at Vdd briefly before address-lines asserted
6T Bank: Analog Assumptions Write Overpowers State!Bit 1
(Opposite State!Bit Rail (s’) Forced to GND)
0.9 0.8
Voltage (V)
0.7
b’
b
State!Bit Rail (s)
A
0.6 0.5
s
0.4 0.3
Bit!Line Rail (b)
0.2
15% of Vdd 6% of Vdd
0.1 0
2
2.1
2.2
2.3
Time (ns)
2.4
2.5
s’
6T Bank: Timing Assumptions • Read-Data is fully Delay-Insensitive (DI) • Writes are not checked (~2:1 race) • Bit-line precharge is not checked (~2:1 race) • Neutrality of address decoding implied by input neutrality; the decoded control is not checked 4T
• Everything else is DI!
8T 3T
_w
2nd Level Bus
bit!line
6T
4T
11T 8T
state!bit writes successfully Write/Precharge Timing Margin write pull!down 1
0.9 0.8
set/address closes
Voltage (V)
0.7 0.6 0.5
write
0.4
precharge margin
margin
0.3
bit!lines precharge
0.2
opens again
0.1 0 !0.1 11.4
11.6
11.8
12
Time (ns)
12.2
12.4
once write begins, opposing bit!line and state!node are restored with pass!gate open, bit!line reads initially
6T Bank: Timing Assumptions
Write and Precharge Margins
Dense Dual-Ported Memory Design 1.8x Larger 8T
6T
Dual-Ported Memories
10
• 8T statebit is 1.8x larger • Address and bit-lines lengthen, reducing read performance • Overhead increases with fewer bits per-line
6
• Overall scaling worse than 1.8x at high frequencies: bit-line slew dominates
4
2x 2 6T Bank 8T Bank
0.8
0.9
1
1.1 Frequency (GHz)
1.2
1.3
0
Area/Bit (um^2)
8
6T SRAM: Multi-Bank Structure
6T Bank
WD
Read Bus
RD
Write Bus
Read Bus
Write Bus
WD
6T Bank
6T Bank
Address/Control Bus I A
6T Bank
Address/Control Crossbar WI WA
RI RA
RD
Cached Dual-Ported (CDP) SRAM • Uses same 6T High Current State-bits • Dual-ported buses, single-ported banks ‣ Can read and write different banks at once • Sideband cache SRAM of one bank in size (e.g. 1024 addresses) • When attempting to read and write the same bank, divert the write to the cache ‣ Must victimize the old cache entry to the main banks, but this won’t conflict with the read
Cached Dual-Ported (CDP) SRAM Cache Data 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
CDP: Operation • Write ‘red’ to 0b10 ‣ Directed to core
Tags
• Write ‘green’ to 0b01
0 0
‣ Directed to cache
0 0
Cache
Bank 0
‣ no eviction needed Cache Data 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
Bank 1
CDP: Operation • Scenario: Read Bank 1, Index 0
Tags
Write ‘blue’ to Bank 1, Index 1
0 0
• ‘Green’ evicted from cache
0 0
• ‘Blue’ written to cache to allow read of ‘red’ from bank
0 1 0 Cache Data 0
Cache
Bank 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
Read Write Flush
Bank 1
CDP: Operation • Scenario: Read Bank 1, Index 0
Tags
Write ‘blue’ to Bank 1, Index 1
0 0
• ‘Green’ evicted from cache
0 0
• ‘Blue’ written to cache to allow read of ‘red’ from bank
0 1 0 Cache Data 0
Cache
Bank 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
Read Write Flush
Bank 1
CDP: Operation • Scenario: Read Bank 1, Index 0
Tags
Write ‘blue’ to Bank 1, Index 1
0 0
• ‘Green’ evicted from cache
0 0
• ‘Blue’ written to cache to allow read of ‘red’ from bank
0 1 0 Cache Data 0
Cache
Bank 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
Read Write Flush
Bank 1
CDP: Operation • Scenario: Read Bank 1, Index 0
Tags
Write ‘blue’ to Bank 1, Index 1
0 0
• ‘Green’ evicted from cache
0 0
• ‘Blue’ written to cache to allow read of ‘red’ from bank
0 1 0 Cache Data 0
Cache
Bank 0
SDP Core
WD
RD Cache Data 1
Cache Tags and Control WI WA
RI RA
Read Write Flush
Bank 1
DUALSRAM16K_16
CDP: Area Scaling 4 Area/Bit @1.1GHz, 1V, 125C
Reference 8T
Fulcrum SDP (6T)
Fulcrum CDP (6T)
3
2
1
0
4
8 Number of Banks
16
Post Silicon: Simulation vs. Silicon simulated read frequency measured read frequency simulated write frequency measured write frequency simulated read power simulated write power
2
Frequency (GHz)
40 35 30 25
1.5
20 1
15 10
0.5 5 0
0 0.7
0.8
0.9
1
1.1 1.2 Voltage (V)
1.3
1.4
1.5
Power (mW)
2.5
Conclusions • Quasi Delay-Insensitive design works as a cost-competitive, productioncompatible methodology • Targeted timing assumptions still useful for aggressive frequency targets and area reduction • We can build asynchronous SRAMs as dense as synchronous and faster at similar densities • 65nm development successful and the fruits are soon doing into production