Comparative Analysis of Conventional and ... - Semantic Scholar

Report 2 Downloads 106 Views
14.1

Comparative Analysis of Conventional and Statistical Design Techniques Steven M. Burns, Mahesh Ketkar, Noel Menezes

Keith A. Bowman, James W. Tschanz, Vivek De

Strategic CAD Lab Intel Corporation Hillsboro, OR USA

Circuit Research Lab Intel Corporation Hillsboro, OR USA

Abstract— We explore the power benefits of changing a microprocessor path histogram through circuit sizing based on statistical timing analysis and optimization (STAO) versus a deterministic timing approach that uses statistical design to establish a global guardband followed by conventional optimization (SDGG). Using an analytical modeling approach, we quantify the differences in total power between the two approaches while maintaining an equivalent performance distribution. For a relative 1σ random WID stage delay variation of 5% and representative microprocessor critical paths, the analysis indicates that the STAO approach enables ∼2% power reduction over the SDGG approach. To achieve a 4% and 6% power reduction through the STAO approach, the process variation needs to increase by a factor of 2x and 4x, respectively.

Categories and Subject Descriptors B.8.2 [Hardware]: Performance and Reliability—Performance Analysis and Design Aids General Terms Algorithms, performance, design, reliability Keywords Leakage, statistical optimization, timing guardbands I. I NTRODUCTION A. Process Variation Characteristics Variations in transistor and interconnect characteristics are becoming more significant as process technology is scaled to 65nm and beyond. Adverse impacts of these variations on the maximum clock frequency (Fmax ) and chip power are also becoming more pronounced. The variations can be classified into two categories: Die-to-Die (D2D) and Within-Die (WID). D2D variations, resulting from lot-to-lot, wafer-to-wafer and within-wafer variations, affect all transistors and interconnects on a die equally. On the other hand, WID variations produce different electrical characteristics across a die [7]. A portion of WID variations is uncorrelated (independent) across transistor and interconnect elements. This is referred to as “local uncorrelated random” WID variation. The remaining WID variation component is spatially correlated across devices, with the degree of correlation becoming smaller as separation between devices increases. Although these variations exhibit a correlated behavior, the profile of these variations can change randomly from die to die [9]. This is referred to as “systematic” or “regionally correlated” or “smooth and continuous” random WID variation [12]. Historically, chip Fmax and power distributions have been dictated primarily by D2D variations. Fmax binning techniques have been used effectively for high performance microprocessors to partially Permission to make make digital digital ororhard hardcopies copiesofofallallor or part work Permission to part of of thisthis work for for personal or classroom classroom use useisisgranted grantedwithout withoutfeefee provided copies personal or provided thatthat copies are are not made or or distributed distributedfor forprofit profitororcommercial commercial advantage copies not made advantage andand thatthat copies bear this notice notice and andthe thefull fullcitation citationonon first page. copy otherwise, bear this thethe first page. To To copy otherwise, or or republish, to post poston onservers serversorortotoredistribute redistribute lists, requires prior specific republish, to to to lists, requires prior specific permission and/or aa fee. fee. permission and/or DAC 2007, June June4–8, 4–8,2007, 2007,San SanDiego, Diego,California, California,USA. USA DAC 2007, Copyright 2007 ACM ACM 978-1-59593-627-1/07/0006 978-1-59593-627-1/07/0006...5.00 Copyright 2007 ...$5.00.

mitigate D2D variation impacts. Recently, post-silicon, chip-level variation compensation techniques such as adaptive supply voltage (ASV) and adaptive body bias (ABB) [16], [17] have been proposed to further alleviate impacts of D2D variation. At the same time, with process technologies scaling to 65nm and beyond, WID variations are expected to worsen [3], [9]. Therefore, impacts of WID variations on both Fmax and power distributions need to be considered carefully. B. Conventional Variation-Aware Timing Analysis and Design Optimization Conventional variation-aware design approaches use a “worst-case (WC) process skew corner” for timing analysis. Design optimizations such as transistor width sizing and multi-performance device assignment are performed at this skew corner to minimize power for a target clock cycle time (tcycle = 1/Fmax ). In the simplest deterministic or Conventional Worst-Case (CWC) approach, the WC corner is generated by varying the physical and electrical parameters of transistor and interconnect elements simultaneously in a direction that degrades performance [11]. The timing yield target dictates kσ (the number of σ’s used, usually 3 corresponding to a 99.87% yield) for all parameters. In the statistical skew corner generation or Statistical Worst-Case (SWC) approach [11], realistic correlations among the various parameters are captured accurately. Statistical Monte-Carlo (MC) simulations or analytical methods are then used to create delay distribution of representative circuit elements. The WC corner is selected from this distribution to achieve the desired timing yield target. SWC produces a less pessimistic and more realistic skew corner than CWC since variations in many of the parameters are uncorrelated or weakly correlated. Neither CWC nor SWC, however, accounts for the distinctions between D2D and WID components of parameter variations, in terms of variation statistics or Fmax impacts. C. SSTA and Design Optimization Impacts of D2D parameter variations are captured accurately by SWC. Statistical Static Timing Analysis (SSTA) accounts for timing impacts of WID variations more accurately. SSTA is combined with statistical design optimization for a full blown Statistical Timing Analysis and Optimization (STAO) which optimizes the path delay histogram to obtain power or yield benefits. The WID variation effects comprehended by statistical techniques are: 1) averaging of gate delay fluctuations due to uncorrelated random WID variations along a path; 2) for paths traversing multiple correlated regions, averaging of path segment delay fluctuations due to systematic WID variations from region to region; 3) for multiple critical paths within a correlated region, the slowest path delays due to random WID variations limiting tcycle ; 4) among groups of critical paths in multiple correlated regions, the slowest path delays due to systematic WID variations from region to region limiting tcycle ;

238

D. Comparisons of different approaches Presumably compared to optimization at CWC, STAO provides 15% performance benefit [2], [13], or 20%-30% power reduction at a target timing yield [8], [15], [6]. In this paper, we compare the power and timing yield of CWC, SWC, SDGG and STAO design approaches. Comparisons are performed using a newly developed analytical modeling framework which comprehends different path types of varying delay statistics and power sensitivities. The framework includes a new analytical model for tcycle µ degradation due to WID variations. The model is validated using MC simulations. The model allows us to examine numerous combinations of target path delays for the different path types, subject to a fixed degradation in tcycle µ, to identify the combination which minimizes power for STAO. In contrast, all path types are constrained to have equal delay targets for SDGG. Simplifying assumptions are made in an attempt to create the best-case scenario for STAO and determine an upper bound for its potential advantages. Only transistor sizing optimizations are considered. Impacts of both power and delay variations are comprehended. STAO benefit trends are then examined for a wide range of design and process technology characteristics.

10% systematic and random WID systematic WID random WID

σ/µ (%)

8% 6% 4% 2% 0%

1

6

11

16

21

Number (m) of path stages.

1: Path σ/µ vs. the number of path stages within a correlated region.

Probability Density

5) differences in delay variations of different paths depending on the number and types of stages, types of transistors and interconnects, transistor widths and lengths; and 6) differences in power/area vs. delay sensitivities of different paths due to differences in the number of gates, types of gates, transistor types, switching activity factor, etc. Effects (1)-(5) are fully comprehended by an SSTA tool. STAO fully utilizes all of the effects (1)-(6) to minimize power while meeting timing yield requirements. In a simpler approach, an appropriate global delay guardband for the target tcycle can be used to capture WID variation effects (1)-(4). A conventional transistor sizing and device selection optimizer can then be used to minimize power for this guardbanded tcycle target at a statistically generated WC corner that accounts for D2D variations. This approach is termed Statistical Design with Global Guardband (SDGG). The appropriate guardband can be found by sweeping its value, and repeating conventional design optimization then SSTA at each point, until timing yield is met. To further simplify SDGG, the desired guardband can be estimated using general critical path delay statistics. For example, the mean (µ) tcycle degradation from WID variations can be calculated using a tcycle distribution model that is based on general critical path delay statistics [4], [5], and used as the guardband. Alternately, general critical path delay impacts due to WID variations can be derived statistically and included appropriately in the WC skew corner for timing analysis, along with D2D variations [10]. These simpler versions of SDGG capture WID variation effects (1)-(4) and allow the use of a deterministic timing tool.

25.0 D2D +WID:Ncp=1

20.0

WID:Ncp=104

WID:Ncp=1

15.0

D2D +WID:Ncp=104

D2D 10.0 5.0 0.0

0.8

0.9

1

1.1

1.2

1.3

1.4

Normalized Maximum Critical Path Delay

2: Individual contributions of D2D and WID variations to the maximum critical path delay distribution. (The D2D distribution and the WID Ncp = 1 distribution are identical since both have a σ/µ of 5%.)

uses a skew corner to account for WID parameter variation impacts on path delays, it ignores both of these averaging effects. Thus, it overestimates this aspect of the WID variation impacts on the tcycle distribution. Figure 2 shows the individual and combined impacts of D2D and WID variations on tcycle distributions. As the number of statistically independent critical paths (Ncp ) increases, the tcycle µ degrades and contributions of WID variations to tcycle σ reduces. This is because the slowest critical path delay limits tcycle . In contrast, D2D variations impact delays of all paths on the chip equally, and are the primary contributors to tcycle σ for large Ncp values. Therefore, the SWC approach accurately accounts for D2D variation impacts. However, SWC uses the WC skew corner to account for WID parameter variation impacts as well. As a result, it ignores tcycle µ degradation and its dependence on Ncp . Also, it overestimates impacts on tcycle σ when Ncp > 1. Since SWC ignores both “path delay averaging” and “slowest path delay” impacts of WID variations, the tcycle distribution presumed by SWC is unrealistic (Figure 3). This can result in either timing yield loss (when the real tcycle distribution is worse than presumed), or in overdesign, with unnecessary area and power overheads (when the real distribution produced a better timing yield.)

II. VARIATION I MPACTS Path delay σ/µ due to uncorrelated random WID variations is √ proportional to 1/ m, where m is the number of gates in the path (Figure 1). This is due to averaging of delay fluctuations along the path. For paths within a correlated region, delay σ/µ due to systematic WID variations is independent of m since delays of all gates change by the same percentage. However, for paths traversing mr correlated regions, path segment delay fluctuations due to systematic WID variations average along the path across regions. Thus, overall path delay σ/µ goes down as mr increases. Since SWC

Probability Density

A. Impacts on tcycle distribution

8.0 SWC Real: Overdesign Real: Underdesign

6.0 4.0 2.0 0.0







Std. Deviations of the SWC Distribution

3: The statistical worst-case corner (SWC) can have either higher or lower yield than the real distribution depending on the number of critical paths (Ncp ) and on the number of stages in a path (m).

239

1.0

D2D WID D2D and WID

0.5

MC FBLK−1 MC FBLK−2

CDF

CDF

1.0

Nominal

0.0

MC FBLK−8 0.5

Model FBLK−2 Model FBLK−8

0.0 1

10

0.0001

0.001

Normalized Leakage

0.01

0.1

1

BLK Leakage Power

4: Individual contributions of D2D and WID variations to the leakage distribution.

B. Impacts on power distribution

5: Cumulative BLK leakage distributions based on statistical MC simulations and a simple model of 1 BLK ” “ Pthat shifts the distribution via FBLK−N (P ) = FBLK−1 P Pleak,nom,BLK−N . leak,nom,BLK−1

The distribution of the switching component of maximum active or thermal design power (TDP) is dictated (for a fixed product frequency) by the distribution of total switched capacitance. Random and systematic WID capacitance variations average linearly across the millions of switching nodes on die and across the number of correlated regions on die, respectively, and thus have negligible impact on switching power. The µ and σ for switching power is therefore, dictated primarily by D2D variations, which impact switched capacitances of all elements on a die equally. Since leakage has become a large fraction of total TDP, variations in leakage power (Pleak ) can dominate TDP variability [3], [8], [15], [6], [14]. Contributions of D2D and WID variations to Pleak distributions are examined using statistical MC simulations (Figure 4). Channel length (Leff ) fluctuations and the corresponding exponential changes in subthreshold leakage current (Ioff ) due to short-channel effects are the primary contributors to Pleak variations. In logic designs, average transistor widths are large enough so that contributions from threshold voltage (Vth ) fluctuations due to random dopant fluctuations (RDF) can be neglected. For Gaussian Leff distributions, Ioff distributions of transistors are log-normal. The impact of random WID variations on the Pleak distribution is determined by convolving the log-normal Ioff distributions of the millions of individual transistors on die. The impact of systematic WID variations on the Pleak distribution is determined by convolving the leakage distributions of the correlated regions on die. Since the µ of log-normal distributions is larger than the median, WID variations increase the Pleak median from the nominal value. At the same time, since leakage variability is averaged over a large number of transitors for random WID variations and averaged across the number of correlated regions for systematic WID variations, contributions of WID variations to Pleak σ are minimal. On the other hand, D2D variations impact Ioff of all transistors on a die equally, and are the primary contributors to Pleak σ as illustrated in Figure 4. Thus, WID variations primarily impact the Pleak median, and D2D variations primarily impact the Pleak variance. Skew corner based design approaches can comprehend power impacts due only to D2D variations by considering WC corners for switched capacitance and Ioff . This is useful in design optimizations where power constraints are applied on top of timing yield requirements. Strong correlations between Pleak and Fmax need to be considered in setting realistic power and timing constraints, and for determining the joint Pleak , Fmax distributions [1], [3]. Full-chip statistical leakage power analysis is needed to accurately determine the increase in the Pleak median due to WID variations. A more efficient method of estimating the shift in the Pleak median is discussed next. In this method, MC leakage simulations are performed on a large circuit block, instead of the entire chip, to get the Pleak distribution of a single block (FBLK−1 ). The Pleak distribution of the die, containing many such blocks, can then be obtained by simply

shifting the Pleak variable by the ratio of the nominal die leakage to nominal leakage of a single block (Figure 5). Since nominal leakages of circuit blocks can be computed easily, this method is more efficient than full-chip statistical leakage simulation. Good agreement of the Pleak distributions, obtained using this simple method, to MC leakage simulation results demonstrates its validity. C. Simplifying assumptions The following simplifying assumptions are made to create the bestcase scenario for STAO and allow comparisons with conventional approaches using an analytical modeling framework. 1) All of the WID variation is uncorrelated random (the systematic component is zero). Since systematic WID variations affect transistors and interconnects equally for paths contained within a correlated region and random WID variations average across the number of stages in a path as discussed in Section II-A, the impact of both systematic and random WID variations on critical path delay is similar from path to path within the correlated region. In relative terms, systematic WID variation significantly diminishes the fifth effect in Section I-C. Therefore, under this assumption, differences between results of statistical and conventional design optimizations would be the highest, thus creating the best-case scenario to evaluate benefits of STAO. 2) tcycle µ degradation due to WID variations is the key performance metric. This assumption is reasonable since (i) tcycle µ degradation is dictated mainly by WID variations; and (ii) statistical techniques differ from conventional approaches only in how they comprehend and utilize the WID variation effects. 3) Nominal leakage and switching power comparisons are sufficient. As shown in Section II-B, switching capacitance µ is not impacted by WID variations. Furthermore, the Pleak median impact due to WID variations scales proportionally to the nominal value. Since statistical techniques differ from conventional approaches in only how they comprehend the power impacts of WID variations, this assumption is valid. III. M ODEL D ERIVATION A. Power-Delay Relationship We develop this model from a prototype circuit (a chain of identically sized constant-fanout inverters) and then discuss how this applies to more general circuit structures. We model delay using d = d0 + d1 CL /z, where d0 is the intrinsic delay of the inverter, d1 is the delay coefficient of normalized load, CL is a normalized capacitive load of both the interconnect and fanout gates (in units of transistor size), and z is proportional to the inverter transistor sizes. Representing both capacitances and drive strengths (conductances) in units of transistor width, the stage delay for the circuit is:

240

dstage = d0 +d1 (kz+ℓ)/z = (d0 +d1 k)+d1 ℓ/z = d′0 +d1 ℓ/z , (1)

4

hard 260

3

Nominal target delay (ps) for easy paths dnom,easy 270 280 290 300

2

300

1

easy σpath,hard Φ−1

0 0

100

200 300 Nominal path delay (ps)



1 1 nhard 2

«

400 290

6: Power sensitivity to delay of hard (mhard = 8) and easy (measy = 6) paths. 280

where ℓ is the interconnect capacitance per stage and k is the fanout per stage. The path delay of m inverters is mdstage . To achieve a particular path delay of dnom , we can invert a combination of the previous equations to get: z = md1 ℓ/(dnom − md′0 ). Since each path is m stages long, the total transistor width for the path is proportional to mz. Thus, per-path normalized power and per-path delay are related by the function: m 2 d1 ℓ dnom − md′0

η

=

ffi

∂dnom dnom = dnom dnom − md′0

«

270

7: Constant median max delay locus superimposed on top of power level curves. The square (circular) marker corresponds to the SDGG (STAO) point. 8%

(3)

6% 4% 2% 0%

−20

A dnom of 280ps together with a d′0 = 23.3ps leads to ηhard ≈ 3 and ηeasy ≈ 2. Our analytical model of power sensitivity to delay facilitates simple description and analysis. However, more complex modeling of power sensitivity to delay can be performed with qualitatively similar results to those shown in Figures 6. To ease the following graphical analysis, we restrict the number of different path types to two. We do this in a way that bounds the best-case benefits for STAO over SDGG. We empirically validated this claim by dividing 20000 paths into three path types with path stage counts m1 < m2 < m3 , where the benefits did not reduce after moving all the paths with stage count m2 into either the m1 grouping or the m3 grouping. By picking the smallest and largest values of m among all critical paths, we can therefore bound this benefit. For the case of two path types, the total nominal power is the per-path power (Pnom,hard ,Pnom,easy ) times the number of paths (nhard ,neasy ) of each type: Pnom,total = nhard Pnom,hard + neasy Pnom,easy .

1 1 neasy 2

260

and this relation is shown in Figure 6 for two different stage counts: mhard = 8 and measy = 6. The delay target is chosen so that the power-delay point for the hard paths corresponds to a 3% increase in power for a 1% decrease in delay. This metric is called hardware intensity (η) [18]. Using our power-delay relationship (2), we can relate η to the parameters m, dnom , and d′0 as follows: ∂Pnom − Pnom



(2)

Power increase

Pnom =

σpath,easy Φ−1

Nominal target delay (ps) for hard paths dnom,hard

Normalized power

5

(4)

B. Maximum Delay Distribution for Two Path Types We now derive the maximum delay distribution for two collections of paths: nhard paths with mhard stages and neasy paths with measy stages. To compute the parameters describing the distribution of max delay, we find the median using the product of the CDFs (because the path delays are independent) corresponding to the max delay of

−15

−10 −5 0 5 10 dnom,easy − dnom,hard

15

20

8: Power increase from the minimum power design (STAO) for the design choices with constant median max delay as described in Figure 7.

the hard and easy paths: Prob[d ≤ dmedian ] = CDFall =

paths (dmedian ) nhard (CDFpath,hard (dmedian )) (CDFpath,easy (dmedian ))neasy „ «nhard „ «n dmedian − dnom,hard dmedian − dnom,easy easy

=Φ 1 = 2

σpath,hard

Φ

σpath,easy

(5)

Here Φ(x) represents the CDF of the standard normal distribution (zero mean, unit variance.) The value of dmedian that makes the equality hold is the median max delay for the collection of both hard and easy paths. In our case, the median of this distribution closely approximates the mean [5]. For a fixed resulting median max delay (dmedian ), this equation represents an implicit function connecting dnom,hard and dnom,easy . Thus, the model (5) allows us to examine numerous combinations dnom,hard and dnom,easy , subject to a fixed degradation in tcycle µ. We plot this locus of hard and easy target delays described implicitly by (5) in Figure 7 using dmedian = 300ps, WID random variations σstage,hard , σstage,easy = 5%, and nhard , neasy = 1000, 10000. In Figure 8, we show the nominal power for all designs along this locus. There is a clear optimal point (marked with a circle) and this represents the best possible STAO design point. For this case, there is

241

STAO rand+d2d SDGG rand+d2d STAO rand SDGG rand

500

%Power Increase

Num. of Samples

1000 m=1 10%

m=4 m=9

5%

m=16 m=25

0%

0 250

300 Max Delay (ps)

350

1

0%

Num. of Samples

1000

500

STAO rand+d2d SDGG rand+d2d STAO rand SDGG rand

−2% −4% −6%

r=1.0

−8%

r=0.6

−10% 10

0 1

10 Normalized Power

10000

11: Power increase from using SWC as a function of Ncp for different m values. The bold line represents entry into the yield loss region. %Power Benefit

9: Cumulative delay distributions for STAO and SDGG design points based on the analytical model. Notice the STAO and SDGG delay distributions are essentially identical.

100 Ncp

1000 Number of hard paths nhard

100

10: Cumulative power distributions for STAO and SDGG design points based on the analytical model. Notice the STAO and SDGG power distribution differ only by the median value, which is in close agreement with model predictions.

less than 1% improvement over an equal-target delay or SDGG point (marked with a square.) By superimposing level curves corresponding to nominal power in Figure 7, we can dispense with the need for cross-sectional power plots and use these contour lines (separated by 2% intervals) to determine the same power benefit information. Notice that there is less than half a level curve spacing between the STAO and SDGG points meaning there is less than 1% power benefit. The STAO point itself can be determined when the median max delay locus is tangent to the power level curves. The effect of changing parameters such as nhard and neasy (with a fixed ratio) and the variation σ values can easily be visualized on Figure 7 as changes to the position of the median max delay locus as these parameters do not affect the power contours. Changes to the ratio of nhard to neasy affect the slope of the power contour lines and an increase in hardware intensity increases the density of these contour lines.

C. Monte-Carlo Validation We have validated our power improvement results and the constancy of the delay distributions by MC simulations of STAO and SDGG design points based on our model. For this comparison we use: dmedian = 300ps, nhard = 10000, neasy = 10000, σstage,hard = 3%, σstage,easy = 5%, and σD2D = 5%, where the nominal power increase from STAO to SDGG is 7.2% (exaggerated to see the power shift.) The delay distributions are plotted in Figure 9. Notice the delay distributions are essentially identical and validate the max delay model (5) and the model delay assumption in Section II-C. The power distributions are plotted in Figure 10. The STAO and SDGG power distributions differ only in the median power values—the shape of the distributions being nearly identical. The median power shifted from an STAO value of 6.74 to an SDGG value of 7.24 in normalized power units resulting in a 7.4% power increase, which is in close agreement with our model and validates the model assumptions in Section II-C.

100000

12: Power benefit from SDGG to STAO design points over a wide-range of nhard values, where r is the ratio of σstage,hard to σstage,easy .

IV. R ESULTS : P OWER B ENEFITS A. SDGG over SWC and CWC As argued in Section II-A, the SWC technique for determining the design tcycle target can lead to either overdesign (resulting in more power) or underdesign (resulting in yield loss). In Figure 11, we show the results of comparing an SWC targeted design with its real distribution. Here we assume a relative stage D2D σ of 5% and a relative stage WID random σ of 5%, and sweep Ncp for a wide range of values. We compute the actual yield for multiple stage counts (m) and compare them to the expected tcycle yield of 99.87%. If we meet the yield target, we report the relative power increase due to overdesign. Here we use a hardware intensity value of 2. Note that for a Ncp = 1 and m = 1, the SWC and SDGG results are identical. As m increases with Ncp = 1, the averaging effect reduces the impact of random WID variations on yield, resulting in an overdesign for SWC. As Ncp increase with m = 1, the SWC underestimates the impact of random WID variations on yield, leading to a tcycle yield loss. Thus, setting design targets using SWC is dangerous because of the potential for tcycle yield loss. Using CWC results in extreme overdesign; for example, 84% power increase for Ncp = 1000 and m = 9. B. STAO over SDGG We now repeat the benefit analysis by sweeping various parameters around the representative values for today’s microprocessor designs: dmedian = 300ps, nhard = 1000, σstage,hard = 5% or 3%, neasy = 10000, and σstage,easy = 5%. Aggregate values of η should be near 2 for a power efficient design [18]. This necessitates a smaller number of hard paths than easy paths, or a smaller value for ηhard (closer to 2 than 3). Both scenarios reduce the benefit of STAO optimizations. Here, as a center point for our investigations, we assume a smaller number of hard paths (1000) versus easy paths (10000) and keep ηhard at 3. In Figure 12, we hold the number of easy paths constant at 10000 and sweep the number of hard paths for both σstage,hard = 5% and σstage,hard = 3%, corresponding to ratios (r) of σstage,hard to σstage,easy of 1.0 and 0.6, respectively. We chose two different values of σstage,hard to consider the case of similar relative stage variations and a second situation where the variation of the hard paths

242

V. S UMMARY

%Power Benefit

0% −2% −4% −6%

r=1.0

−8%

r=0.6

−10% 250

300 350 Median max delay dmedian

400

13: Power benefit from SDGG to STAO design points while varying the median max delay, which changes the power sensitivity to delay.

%Power Benefit

0% −2% −4% −6%

r=1.0

−8%

r=0.6

−10% 0%

5%

10% 15% Stage sigma σstage,easy

20%

14: Power benefit from SDGG to STAO design points while varying σstage,easy .

is deliberately reduced, perhaps by using multiple leg transistors. Note that the benefit improves with increasing nhard until a maximum is reached due to a competition between the sixth and third effects in Section I-C. The initial improvement is due to the power level curves tilting to favor hard paths (changing the power sensitivity to delay). At some point this effect saturates and is countered by the slow movement of the optimal STAO design point back toward the SDGG design point due to movement of the max delay locus caused by an increase in nhard . In Figure 13, we sweep the median max delay. For smaller values of the median max delay, the power sensitivity to delay increases, as described in Figure 6, amplifying the sixth effect listed in Section I-C, and thus providing more potential benefit. However, the region to the left of dmedian = 300ps corresponds to ηhard > 3 and is not likely to be used in significant portions of a microprocessor design. In Figure 14, we sweep the stage delay standard deviation, keeping the σstage,hard to σstage,easy ratio fixed. As we sweep this value, we maintain a constant SDGG dnom,hard = dnom,easy = 280ps (by increasing dmedian as σstage,easy is increased) ensuring a constant ηhard value of 3. This comparison examines the benefits of STAO from SDGG by capturing the fifth effect listed in Section I-C. For large and equal stage delay standard deviations (σstage,hard = σstage,easy = 20%), we get only a 2% benefit from STAO. Large and different values (σstage,hard = 12%, σstage,easy = 20%) provide a 6% benefit. Finally, it should be noted that our analytical exploration of the potential benefits of STAO made several ideal assumptions to obtain the best possible gains. These cannot all be realized in practice. Major issues include: the discreteness of the cell library, which limits the achievable gain because both the exact delay values determined by STAO cannot be achieved and, since the discrete optimization problem is no longer convex, the heuristics used to size the design do not always find the optimal design; the slope and minimum size limits employed in real designs do not allow paths to be fully downsized to desired targets; and the error in static timing analysis from inadequate handling of effects such as non-ideal waveforms and multiple-input switching, which is comparable in magnitude to the STAO benefits. In addition, as discussed in Section II-C, the inclusion of systematic WID variations would further reduce the power gains.

In this paper, we explored the power benefits of tuning a microprocessor path histogram through STAO versus SDGG. A new analytical modeling approach, based on understanding different power and variation sensitivities for different path types, was developed to obtain these benefits under a constant timing yield constraint. The model and modeling assumptions were successfully validated using MonteCarlo analysis. For a 1σ random WID stage delay variation of 5%, the analytical modeling approach indicated that STAO enables approximately a 2% power reduction over SDGG. By increasing the 1σ random WID stage delay variation to 10% and 20%, the power benefits from STAO increased to 4% and 6%, respectively. Finally we outlined some of the practical considerations that are likely to reduce these potential best-case benefits of performing STAO over SDGG. While a solid understanding of the statistical nature of process variation is essential to the design of quality microprocessors, we claim, through the arguments presented in this paper, that we can account for this variation up front in a global guardband and then complete the design using traditional design techniques without a signficant loss in overall quality. This means that the benefit of improved design quailty is unlikely to exceed the cost of building new statistical timing and optimization tools and deploying methodologies around these tools to a large design force. R EFERENCES [1] Y. Abulafia and A. Kornfeld. Estimation of FMAX and ISB in microprocessors. IEEE Trans. VLSI Syst., pages 1205–1209, Oct. 2005. [2] A. Agarwal, K. Chopra, D. Blaauw, and V. Zolotov. Circuit optimization using statistical static timing analysis. In DAC, pages 321–324, 2005. [3] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In DAC, pages 338–342, June 2003. [4] K. A. Bowman, S. G. Duvall, and J. D. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE J. Solid-State Circuits, pages 183–190, Nov. 2002. [5] K. A. Bowman, S. B. Samaan, and N. Z. Hakim. Maximum clock frequency distribution model with practical VLSI design considerations. In IEEE ICICDT, pages 183–191, May 2004. [6] S. Choi, B. Paul, and K. Roy. Novel sizing algorithm for yield improvement under process variation in nanometer technology. In DAC, pages 454–459, June 2004. [7] S. G. Duvall. Statistical circuit modeling and optimization. In 5th Intl. Workshop on Statistical Metrology, pages 56–63, June 2000. [8] M. Mani, A. Devgan, and M. Orshansky. An efficient algorithm for statistical minimization of total power under timing yield constraints. In DAC, pages 309– 314, 2005. [9] H. Masuda, S. Ohkawa, A. Kurokawa, and M. Aoki. Challenge: Variability characterization and modeling for 65- to 90-nm processes. In IEEE CICC, pages 593–600, Sept. 2005. [10] F. Najm and N. Menezes. Statistical timing analysis based on a timing yield model. In DAC, pages 460–465, June 2004. [11] S. Nassif, A. Strojwas, and S. Director. A methodology for worst-case analysis of integrated circuits. IEEE Trans. Computer-Aided Design, pages 104–113, Jan. 1986. [12] S. B. Samaan. The impact of device parameter variations on the frequency and performance of microprocessor circuits. In IEEE ISSCC Microprocessor Circuit Design Forum: Managing Variability on Sub-100nm Designs, Feb. 2004. [13] J. Singh, V. Nookala, Z. Luo, and S. Sapatnekar. Robust gate sizing by geometric programming. In DAC, pages 315–320, 2005. [14] A. Srivastava and D. Sylvester. A general framework for probabilistic low-power design space exploration considering process variation. In ICCAD, pages 808– 813, Nov. 2004. [15] A. Srivastava, D. Sylvester, and D. Blaauw. Statistical optimization of leakage power considering process variations using dual-vth and sizing. In DAC, pages 773–778, June 2004. [16] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, and V. De. Adaptive body bias for reducing impacts of die-to-die and within-die parameter variations on microprocessor frequency and leakage. IEEE J. SolidState Circuits, pages 1396–1402, Nov. 2002. [17] J. Tschanz, S. Narendra, R. Nair, and V. De. Effectiveness of adaptive supply voltage and body bias for reducing impact of parameter variations in low power and high performance microprocessors. IEEE J. Solid-State Circuits, pages 826–829, May 2003. [18] V. Zyuban and P. Strenski. Balancing hardware intensity in microprocessor pipelines. IBM J. Res. and Dev., pages 585–598, Sept/Nov 2003.

243