GHz Asynchronous SRAM in 65nm - ASYNC Symposium

Report 3 Downloads 55 Views
GHz Asynchronous SRAM in 65nm Jonathan Dama, Andrew Lines Fulcrum Microsystems

Context • Three Generations in Production, including: • Lowest latency 24-port 10G L2 Ethernet Switch • Lowest Latency 24-port 10G L3 Switch/Router • Higher Frequencies • Lower Latencies

Design Methodology • Mostly Quasi Delay-Insensitive • PCHB, PCFB, WCHB templates • 18 tpc • Islands of Synchronous Standard Flow (GALS) • Additional timing assumptions in key circuits • Register Files (unacknowledged bit-writes) • Dense SRAM (many) • TCAM (trickiest)

Outline • 10T Register RegisterFiles Files • 6T SRAM SRAM Bank Bankand andAnalog AnalogVerification Verification • Multibank 6T Multibank 6T SRAMs(SDP/DDP/CDP) (SDP/DDP/CDP) • Dual-ported Dual-ported SRAMs • Design for Test (scan) • Design for Yield (repair) • Soft Error Tolerance • Performance Analysis Performance Analysis

Outline • 10T Register RegisterFiles Files • 6T SRAM SRAM Bank Bankand andAnalog AnalogVerification Verification • Multibank 6T Multibank 6T SRAMs(SDP/DDP/CDP) (SDP/DDP/CDP) • Dual-ported Dual-ported SRAMs • Design for Test (scan) • Design for Yield (repair) • Soft Error Tolerance • Performance Analysis Performance Analysis

10T Memories: Fast, Safe • 10T state-bit (11T including reset) • Uses foundry 6T ratios

_r.0

_w.0

• Design Rule Correct • Up to 32 bits and 32 address • Supports masked writes • Single & Dual Ported Control Versions

_w.1

_r.1 JW

• Custom Handshakes replace control for particular purposes • FIFOs and “SHELFs”

JR

10T Memories: Structure _r.0

_w.0

_w.1

_r.1 JW

JR

BIT ARRAY

e1ofN JW

READ

WRITE

W

R

e1ofN JR

e1of1 KW

e1of1 KR DECODE

DI Interface

10T Memories: Structure _r.0

_w.0

_w.1

_r.1 JW

JR

BIT ARRAY

e1ofN JW

READ

WRITE

W

R

e1ofN JR

e1of1 KW

e1of1 KR DECODE

DI Interface

6T Memories: Dense • 6T Statebit (TSMC) • (Carefully) Violates DRC • Different Implant than normal Logic • Validated ratio assumptions • Bank: up to 16 bits and 1024 address • 4 way set muxing • 8-way 2nd level buses • 32 bits per bit-line • Fully pipelined to arbitrary width and depth

6T Bank: Bit and “SET” R1

W1

B1

Go

B0 W0 A0 STATE!BIT

R0

S0 PRECHARGE

WRITE

SET!MUX

READ

6T Bank: Two Chunks

2x 128 Addresses in 4 Sets

6T Bank: Top-level Structure e1of4[2] R

DATA

CHUNK

CHUNK

CHUNK

CHUNK

DATA

CHUNK

CHUNK

CHUNK

CHUNK

e1of4[2] W

e1of4[2] R e1of4[2] W e1of2 I e1of4[5] A

CTRL

e1of4[2] R

DATA

e1of4[2] W

e1of4[2] R e1of4[2] W

DATA

DEMUX

DEMUX

DEMUX

DEMUX

CHUNK

CHUNK

CHUNK

CHUNK

CHUNK

CHUNK

CHUNK

CHUNK

6T Bank: Address Decoding “DEMUX” • 5 1of4s as input • 256 address lines decoded with AND4s • 8 groups (half chunk) of 4 set lines • Decoder transitions are treated as digitally isochronic • CHUNKs are power-gated

6T SRAM: Bank

6T Bank: Analog Assumptions • Common Concerns: • Bit-line pull-down can overpower state-bit while pass-gate open • Bit-lines held at or floating near Vdd don’t write state-bit while pass-gate open • Cap-coupling, Slews, Leakage • Arise from implementation decisions: • Precharge interference with reads of unselected sets must hold those bitlines above the switching threshold of the set-muxing NAND • Bit-lines float at Vdd briefly before address-lines asserted

6T Bank: Analog Assumptions Write Overpowers State!Bit 1

(Opposite State!Bit Rail (s’) Forced to GND)

0.9 0.8

Voltage (V)

0.7

b’

b

State!Bit Rail (s)

A

0.6 0.5

s

0.4 0.3

Bit!Line Rail (b)

0.2

15% of Vdd 6% of Vdd

0.1 0

2

2.1

2.2

2.3

Time (ns)

2.4

2.5

s’

6T Bank: Timing Assumptions • Read-Data is fully Delay-Insensitive (DI) • Writes are not checked (~2:1 race) • Bit-line precharge is not checked (~2:1 race) • Neutrality of address decoding implied by input neutrality; the decoded control is not checked 4T

• Everything else is DI!

8T 3T

_w

2nd Level Bus

bit!line

6T

4T

11T 8T

state!bit writes successfully Write/Precharge Timing Margin write pull!down 1

0.9 0.8

set/address closes

Voltage (V)

0.7 0.6 0.5

write

0.4

precharge margin

margin

0.3

bit!lines precharge

0.2

opens again

0.1 0 !0.1 11.4

11.6

11.8

12

Time (ns)

12.2

12.4

once write begins, opposing bit!line and state!node are restored with pass!gate open, bit!line reads initially

6T Bank: Timing Assumptions

Write and Precharge Margins

Dense Dual-Ported Memory Design 1.8x Larger 8T

6T

Dual-Ported Memories

10

• 8T statebit is 1.8x larger • Address and bit-lines lengthen, reducing read performance • Overhead increases with fewer bits per-line

6

• Overall scaling worse than 1.8x at high frequencies: bit-line slew dominates

4

2x 2 6T Bank 8T Bank

0.8

0.9

1

1.1 Frequency (GHz)

1.2

1.3

0

Area/Bit (um^2)

8

6T SRAM: Multi-Bank Structure

6T Bank

WD

Read Bus

RD

Write Bus

Read Bus

Write Bus

WD

6T Bank

6T Bank

Address/Control Bus I A

6T Bank

Address/Control Crossbar WI WA

RI RA

RD

Cached Dual-Ported (CDP) SRAM • Uses same 6T High Current State-bits • Dual-ported buses, single-ported banks ‣ Can read and write different banks at once • Sideband cache SRAM of one bank in size (e.g. 1024 addresses) • When attempting to read and write the same bank, divert the write to the cache ‣ Must victimize the old cache entry to the main banks, but this won’t conflict with the read

Cached Dual-Ported (CDP) SRAM Cache Data 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

CDP: Operation • Write ‘red’ to 0b10 ‣ Directed to core

Tags

• Write ‘green’ to 0b01

0 0

‣ Directed to cache

0 0

Cache

Bank 0

‣ no eviction needed Cache Data 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

Bank 1

CDP: Operation • Scenario: Read Bank 1, Index 0

Tags

Write ‘blue’ to Bank 1, Index 1

0 0

• ‘Green’ evicted from cache

0 0

• ‘Blue’ written to cache to allow read of ‘red’ from bank

0 1 0 Cache Data 0

Cache

Bank 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

Read Write Flush

Bank 1

CDP: Operation • Scenario: Read Bank 1, Index 0

Tags

Write ‘blue’ to Bank 1, Index 1

0 0

• ‘Green’ evicted from cache

0 0

• ‘Blue’ written to cache to allow read of ‘red’ from bank

0 1 0 Cache Data 0

Cache

Bank 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

Read Write Flush

Bank 1

CDP: Operation • Scenario: Read Bank 1, Index 0

Tags

Write ‘blue’ to Bank 1, Index 1

0 0

• ‘Green’ evicted from cache

0 0

• ‘Blue’ written to cache to allow read of ‘red’ from bank

0 1 0 Cache Data 0

Cache

Bank 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

Read Write Flush

Bank 1

CDP: Operation • Scenario: Read Bank 1, Index 0

Tags

Write ‘blue’ to Bank 1, Index 1

0 0

• ‘Green’ evicted from cache

0 0

• ‘Blue’ written to cache to allow read of ‘red’ from bank

0 1 0 Cache Data 0

Cache

Bank 0

SDP Core

WD

RD Cache Data 1

Cache Tags and Control WI WA

RI RA

Read Write Flush

Bank 1

DUALSRAM16K_16

CDP: Area Scaling 4 Area/Bit @1.1GHz, 1V, 125C

Reference 8T

Fulcrum SDP (6T)

Fulcrum CDP (6T)

3

2

1

0

4

8 Number of Banks

16

Post Silicon: Simulation vs. Silicon simulated read frequency measured read frequency simulated write frequency measured write frequency simulated read power simulated write power

2

Frequency (GHz)

40 35 30 25

1.5

20 1

15 10

0.5 5 0

0 0.7

0.8

0.9

1

1.1 1.2 Voltage (V)

1.3

1.4

1.5

Power (mW)

2.5

Conclusions • Quasi Delay-Insensitive design works as a cost-competitive, productioncompatible methodology • Targeted timing assumptions still useful for aggressive frequency targets and area reduction • We can build asynchronous SRAMs as dense as synchronous and faster at similar densities • 65nm development successful and the fruits are soon doing into production