Drowsy Caches - EECS @ UMich

Report 31 Downloads 194 Views
Drowsy Caches Simple Techniques for Reducing Leakage Power

Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

[email protected] [email protected] [email protected] [email protected] [email protected]

1

Motivation ! Ever increasing leakage power " as feature size shrinks

! Vt scales down " exponential increase in leakage power

Normalized leakage power

12 0 0 10 5 ºC

10 0 0

75 ºC 800 50 ºC 2 5 ºC

600

400

200

0 0 .2

! On-chip caches

0 .15

0 .1

0 .0 5

Minimum gate length (µm)

" responsible for 15%~20% of the total power " leakage power can exceed 50% of total cache power according to our projection using Berkeley Predictive Models 2

Processor power trends 1000

Power Consumption (W)

800 Leakage Power 600 Dynamic Power 400

200

0 Pentium II

Pentium III

Pentium 4

One Gen

Two Gen

Three Gen

Processor Generation

• Based on ITRS roadmap and transistor count estimates. • Total power in this projection cannot come true. 3

An observation about data caches ! L1 data caches • Working set: fraction of cache lines accessed in a time window. • Window size = 2000 cycles. • Only a small fraction of lines are accessed in a window. 50%

40%

Working set of current + 1, 8, and 32 previous windows 30%

Working set of current window 20%

10%

0% crafty

vortex

bzip

vpr

mcf

parser

gcc

facerec

equake

mesa

4

The Drowsy Cache approach Instead of being sophisticated about predicting the working set, reduce the penalty for being wrong.

Algorithm: • Periodically put all lines in cache into drowsy mode. • When accessed, wake up the line.

• Optimize across circuit-microarchitecture boundary: – Use of the appropriate circuit technique enables simplified microarchitectural control.

• Requirement: state preservation in low leakage mode. 5

Access control flow – Awake tags Awake tags Hit

Miss

Awake tag match

Line wake up

Awake tag miss

Line wake up Replacement

Line access

Memory

• Drowsy hit / miss adds at most 1 cycle latency • Access to awake line is not penalized

6

Access control flow – Drowsy tags Drowsy tags Hit

Miss

Awake tag match

Tag wake up

Line wake up

Awake tag miss

Tag wake up

Line wake up Replacement

Line access

Memory

Unneeded tags and lines back to drowsy

• Drowsy tags implementation is more complicated • Is the complexity worth it? – Tags use about 7% of data bits (32 bit address) – Only small incremental leakage reduction

• Worst case: 3 cycle extra latency 7

Low-leakage circuit techniques Circuit

Gated-VDD

Pros •Largest leakage reduction •Fast mode switching •Easy implementation

ABB-MTCMOS •Retains cell state

DVS

•Retains cell state •Fase mode switching •More power reduction than ABB

Cons

•Loses cell state

•Slow mode switching

•More SEU noise susceptible

8

Drowsy memory using DVS • Low supply voltage for inactive memory cells – Low voltage reduces leakage current too! P↓↓ = I↓ × V↓ – Quadratic reduction in leakage power supply voltage for normal mode leakage path

supply voltage for drowsy mode

9

Leakage reduction using DVS • High-Vt devices for access transistors ! reduce leakage power ! increase access time of cache 100% 0.2V

! Right Trade-off point 0.25V

" 91% leakage reduction " 6% cycle time increase

Performance

95% 0.3V

90% 0.35V

85% 76%

Projections for 0.07µm process 78%

80%

82%

84%

86%

88%

90%

92%

94%

Leakage reduction

10

Drowsy cache line architecture drowsy bit

voltage controller

drowsy (set)

word line driver

row decoder

drowsy

power line

VDD (1V)

SRAMs

VDDLow (0.3V)

drowsy

word line

wake up (reset)

word line drowsy signal

word line gate

11

Energy reduction 100%

80%

60%

Leakage

40%

Drow sy Drowsy

High leakage 20% Dynamic

Dynamic

Regular Cache

Drowsy Cache

0%

• • •

Projections for 0.07µm process High leakage: lines have to be powered up when accessed. Drowsy circuit – –

Without high vt device (in SRAM): 6x leakage reduction, no access delay. With high vt device: 10x leakage reduction, 6% access time increase.

12

1 cycle vs. 2 cycle wake up 100%

95%

Drowsy fraction

90%

1 cycle vs. 2 cycle wakup 85%

simple policy, awake tags, 4000 cycle window ammp00 apsi00 bzip200 eon00 facerec00 galgel00 gcc00 lucas00 mesa00 parser00 swim00 vortex00 wupwise00

80%

75%

70% 0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%

1.80%

applu00 art00 crafty00 equake00 fma3d00 gap00 gzip00 mcf00 mgrid00 sixtrack00 twolf00 vpr00

2.00%

2.20%

Run-time increase





Fast wakeup is important – but easy to accomplish ! – Cache access time: 0.57ns (for 0.07µm from CACTI using 0.18µm baseline). – Speed dependent on voltage controller size: 64 x Leff – 0.28ns (half cycle at 4 GHz), 32 x Leff – 0.42ns, 16 x Leff – 0.77ns. Impact of drowsy tags are quite similar to double-cycle wake up. 13

Policy comparison 100%

lucas gcc

twolf

gzip parser

facerec

95%

simple 2000

simple 4000 90%

noaccess vs. simple policy gap

Drowsy fraction

noaccess 4000 vortex 85%

sixtrack eon

80%

applu

crafty

art

mgrid

75%

galgel

70% 0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1 cycle wakeup, awake tags, simple policy: 2000 and 4000 cycle window, noaccess policy: 2000 cycle window ammp00

applu00

apsi00 bzip200

art00 crafty00

eon00 facerec00

equake00 fma3d00

galgel00 gcc00

gap00 gzip00

lucas00 mesa00

mcf00 mgrid00

parser00 swim00

sixtrack00 twolf00

vortex00 wupwise00

vpr00

1.20%

1.40%

Run-time increase

14

Energy reduction Normalized Total Energy

Normalized Leakage Energy

Run-time increase

DVS

Theoretical min.

DVS

Theoretical min.

Awake tags

0.46

0.35

0.29

0.15

0.41%

Drowsy tags

0.42

0.31

0.24

0.09

0.84%

> 50% total energy reduction

• •

> 70% leakage energy reduction

Theoretical minimum assumes zero leakage in drowsy mode Total energy reduction within 0.1 of theoretical minimum – Diminishing returns for better leakage reduction techniques



Above figures assume 6x leakage reduction, 10x possible with small additional run-time impact 15

Conclusions • Simple circuit technique – Need high-Vt transistors, low Vdd supply

• Simple architecture – No need to keep counter/predictor state for each line – Periodic global counter asserts drowsy signal – Window size (for periodic drowsy transition) depends on core: ~4000 cycles has good E-delay trade-off

• Technique also works well on in-order procesors – Memory subsystem is already latency tolerant

• Drowsy circuit is good enough – Diminishing returns on further leakage reduction – Focus is again on dynamic energy 16