Drowsy Caches Simple Techniques for Reducing Leakage Power
Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge
[email protected] [email protected] [email protected] [email protected] [email protected] 1
Motivation ! Ever increasing leakage power " as feature size shrinks
! Vt scales down " exponential increase in leakage power
Normalized leakage power
12 0 0 10 5 ºC
10 0 0
75 ºC 800 50 ºC 2 5 ºC
600
400
200
0 0 .2
! On-chip caches
0 .15
0 .1
0 .0 5
Minimum gate length (µm)
" responsible for 15%~20% of the total power " leakage power can exceed 50% of total cache power according to our projection using Berkeley Predictive Models 2
Processor power trends 1000
Power Consumption (W)
800 Leakage Power 600 Dynamic Power 400
200
0 Pentium II
Pentium III
Pentium 4
One Gen
Two Gen
Three Gen
Processor Generation
• Based on ITRS roadmap and transistor count estimates. • Total power in this projection cannot come true. 3
An observation about data caches ! L1 data caches • Working set: fraction of cache lines accessed in a time window. • Window size = 2000 cycles. • Only a small fraction of lines are accessed in a window. 50%
40%
Working set of current + 1, 8, and 32 previous windows 30%
Working set of current window 20%
10%
0% crafty
vortex
bzip
vpr
mcf
parser
gcc
facerec
equake
mesa
4
The Drowsy Cache approach Instead of being sophisticated about predicting the working set, reduce the penalty for being wrong.
Algorithm: • Periodically put all lines in cache into drowsy mode. • When accessed, wake up the line.
• Optimize across circuit-microarchitecture boundary: – Use of the appropriate circuit technique enables simplified microarchitectural control.
• Requirement: state preservation in low leakage mode. 5
Access control flow – Awake tags Awake tags Hit
Miss
Awake tag match
Line wake up
Awake tag miss
Line wake up Replacement
Line access
Memory
• Drowsy hit / miss adds at most 1 cycle latency • Access to awake line is not penalized
6
Access control flow – Drowsy tags Drowsy tags Hit
Miss
Awake tag match
Tag wake up
Line wake up
Awake tag miss
Tag wake up
Line wake up Replacement
Line access
Memory
Unneeded tags and lines back to drowsy
• Drowsy tags implementation is more complicated • Is the complexity worth it? – Tags use about 7% of data bits (32 bit address) – Only small incremental leakage reduction
• Worst case: 3 cycle extra latency 7
Low-leakage circuit techniques Circuit
Gated-VDD
Pros •Largest leakage reduction •Fast mode switching •Easy implementation
ABB-MTCMOS •Retains cell state
DVS
•Retains cell state •Fase mode switching •More power reduction than ABB
Cons
•Loses cell state
•Slow mode switching
•More SEU noise susceptible
8
Drowsy memory using DVS • Low supply voltage for inactive memory cells – Low voltage reduces leakage current too! P↓↓ = I↓ × V↓ – Quadratic reduction in leakage power supply voltage for normal mode leakage path
supply voltage for drowsy mode
9
Leakage reduction using DVS • High-Vt devices for access transistors ! reduce leakage power ! increase access time of cache 100% 0.2V
! Right Trade-off point 0.25V
" 91% leakage reduction " 6% cycle time increase
Performance
95% 0.3V
90% 0.35V
85% 76%
Projections for 0.07µm process 78%
80%
82%
84%
86%
88%
90%
92%
94%
Leakage reduction
10
Drowsy cache line architecture drowsy bit
voltage controller
drowsy (set)
word line driver
row decoder
drowsy
power line
VDD (1V)
SRAMs
VDDLow (0.3V)
drowsy
word line
wake up (reset)
word line drowsy signal
word line gate
11
Energy reduction 100%
80%
60%
Leakage
40%
Drow sy Drowsy
High leakage 20% Dynamic
Dynamic
Regular Cache
Drowsy Cache
0%
• • •
Projections for 0.07µm process High leakage: lines have to be powered up when accessed. Drowsy circuit – –
Without high vt device (in SRAM): 6x leakage reduction, no access delay. With high vt device: 10x leakage reduction, 6% access time increase.
12
1 cycle vs. 2 cycle wake up 100%
95%
Drowsy fraction
90%
1 cycle vs. 2 cycle wakup 85%
simple policy, awake tags, 4000 cycle window ammp00 apsi00 bzip200 eon00 facerec00 galgel00 gcc00 lucas00 mesa00 parser00 swim00 vortex00 wupwise00
80%
75%
70% 0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
1.40%
1.60%
1.80%
applu00 art00 crafty00 equake00 fma3d00 gap00 gzip00 mcf00 mgrid00 sixtrack00 twolf00 vpr00
2.00%
2.20%
Run-time increase
•
•
Fast wakeup is important – but easy to accomplish ! – Cache access time: 0.57ns (for 0.07µm from CACTI using 0.18µm baseline). – Speed dependent on voltage controller size: 64 x Leff – 0.28ns (half cycle at 4 GHz), 32 x Leff – 0.42ns, 16 x Leff – 0.77ns. Impact of drowsy tags are quite similar to double-cycle wake up. 13
Policy comparison 100%
lucas gcc
twolf
gzip parser
facerec
95%
simple 2000
simple 4000 90%
noaccess vs. simple policy gap
Drowsy fraction
noaccess 4000 vortex 85%
sixtrack eon
80%
applu
crafty
art
mgrid
75%
galgel
70% 0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1 cycle wakeup, awake tags, simple policy: 2000 and 4000 cycle window, noaccess policy: 2000 cycle window ammp00
applu00
apsi00 bzip200
art00 crafty00
eon00 facerec00
equake00 fma3d00
galgel00 gcc00
gap00 gzip00
lucas00 mesa00
mcf00 mgrid00
parser00 swim00
sixtrack00 twolf00
vortex00 wupwise00
vpr00
1.20%
1.40%
Run-time increase
14
Energy reduction Normalized Total Energy
Normalized Leakage Energy
Run-time increase
DVS
Theoretical min.
DVS
Theoretical min.
Awake tags
0.46
0.35
0.29
0.15
0.41%
Drowsy tags
0.42
0.31
0.24
0.09
0.84%
> 50% total energy reduction
• •
> 70% leakage energy reduction
Theoretical minimum assumes zero leakage in drowsy mode Total energy reduction within 0.1 of theoretical minimum – Diminishing returns for better leakage reduction techniques
•
Above figures assume 6x leakage reduction, 10x possible with small additional run-time impact 15
Conclusions • Simple circuit technique – Need high-Vt transistors, low Vdd supply
• Simple architecture – No need to keep counter/predictor state for each line – Periodic global counter asserts drowsy signal – Window size (for periodic drowsy transition) depends on core: ~4000 cycles has good E-delay trade-off
• Technique also works well on in-order procesors – Memory subsystem is already latency tolerant
• Drowsy circuit is good enough – Diminishing returns on further leakage reduction – Focus is again on dynamic energy 16