Paper - Nanyang Technological University

Report 5 Downloads 299 Views
Slack-aware Timing Margin Redistribution Technique Utilizing Error Avoidance Flip-Flops and Time Borrowing Mini Jayakrishnan1,2, Alan Chang2, Jose Pineda De Gyvez2, Kim Tae Hyoung1 VIRTUS, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 2 NXP Semiconductors, Singapore 1,2 [email protected]; [email protected]

1

Abstract— There is much focus on timing error resilience for the speed critical paths of processors. In the context of growing parameter variations with technology scaling and voltage scaling, resilience helps to ensure functional correctness. Moreover it allows the chip to stretch its operating voltage and frequency beyond the conventional limits to meet the demand for high performance and low power. Conventionally, timing error resilience is achieved through variation tolerant circuitry at the cost of undesirable power, area and throughput overheads. Such overheads are aggravated by the presence of large number of critical timing paths in the design. In this paper, we propose a slack-aware timing margin redistribution technique for error resilience using time borrowing error avoidance flip-flops (EAFFs) while minimizing overheads. The proposed algorithm designs the processor critical paths ground up by inserting EAFFs at places where positive slack is available in the subsequent fan-out stage. Experiment results on an industrial processor design show that a timing margin improvement of 11% of the clock period can be achieved on 64% of the critical paths and a 55% timing margin on 45% of the critical paths without any throughput degradation. The area and power overheads of the additional flip-flops are 0.2% and 5.4%, respectively. Keywords— Timing error resilience; time borrowing; error avoidance flip-flop

I.

INTRODUCTION

With technology scaling, the nature of variations has changed from static to temporal and die-to-die to within-die. So we need additional timing margins for the speed critical paths in the design [1], [2]. Low power requirements tend to increase variations and affect the yield of chips [3]. The spread of variations has increased significantly, with leakage spread being larger than frequency spread [4]. Traditional chips have been designed to address worst case variations through conservative guard bands in operating voltage and frequency. However, providing enough guard bands becomes unacceptably expensive in deeply scaled process technology. Better than Worst Case (BTWC) design techniques together with resilient architecture try to recover the wasted design margins through typical case designs with less resources, lower voltage and higher frequency compared to their worst case counterparts [5], [6]. The effectiveness of BTWC designs is limited by the wall of slack which leads to massive timing errors on the near critical paths with voltage and frequency scaling. Variation Tolerant Design has become a one stop solution to achieve reliability as well as voltage and frequency

978-1-4673-9140-5/15/$31.00 ©2015 IEEE

Fig. 1. Resilience, Power & Performance add-ons at different stages

scalability. We need to optimize performance, power, reliability and cost at all design levels to meet the design targets as summarized in Fig. 1. There has been considerable research going on for new variation aware sequential circuits. But we lack a proper design methodology which can maximize the gains offered by such circuits. This paper compares some of the existing resilience circuit and architecture, their pitfalls and how they can leverage on the huge slack margins offered by designs. In this paper, we propose a technique to improve the timing margin of the critical paths in a processor pipeline proportional to the amount of slack available in the consecutive pipeline stages. We use a time borrowing flip-flop called error avoidance flip-flop (EAFF) based on TIMBER (Time Borrowing and Error Relay) [7] to replace the selected critical flip-flops. This method does not require complex design iterations and area overheads as in the cell resizing [8]. Compared to re-timing, our approach uses opportunistic time borrowing which can deal with dynamic process and environmental variations. In the proposed approach slack redistribution can be done by simply replacing the selected flip-flops with EAFF having a clock delay proportional to the amount of slack available. Compared to TIMBER design flow, we claim better timing margin improvement for the critical flip-flops as there is no need to divide the margin among multiple pipeline stages. TIMBER introduces additional overheads in terms of error propagation logic and error consolidation latency. On the contrary, the proposed method does not require error propagation to next stages for multistage time borrowing. The critical operating point of the endpoints is improved by the timing margin available to them which helps in aggressive voltage or frequency scaling. The key contributions of this paper are as follows:

1) Our approach is based on available slack which can recover better timing margins and do not need an error relay logic for multi stage time borrowing as in TIMBER. 2) The proposed approach improves timing margin of the selected endpoints without any throughput degradation. 3) Opportunistic time borrowing is effectively done without costly area overheads and complex design iterations unlike resizing and/or retiming the combinational logic. 4) The critical operating point of the design is improved which has applications in aggressive voltage scaling.

works used architectural replay and counter-flow pipelining to recover from the erroneous state, which incurs substantial latencies and lead to throughput degradation [9]. Architecture level solutions have to perform voltage and frequency scaling to mitigate high error rates [9], [12]. TIMBER based error masking avoids the need for complex architectural recovery and throughput loss. But it has additional overheads of error propagation logic and error consolidation latency for multistage time borrowing. TIMBER also suffer from metastability issues in the data path which affects the clock signal to the next pipeline stage.

The subsequent sections of this paper are organized as follows. Section II describes the related works in circuit, architecture and algorithmic level. Section III explains the motivation of the proposed approach. Section IV discusses the implementation details, the algorithm and the design methodology. Finally, section V presents simulation results and section VI draws the conclusions.

C. Algorithmic Level Several algorithms for error resilience and BTWC have been published so far. EVAL (Environment for VariationAfflicted Logic) speeds up timing of critical paths through Adaptive Body Bias (ABB) as well as Adaptive Supply Voltage (ASV) [14]. It reshapes signal paths by speeding up slower paths and slowing down faster paths, which saves overall energy consumption but has significant area overheads. Another work Blue Shift uses On-demand Selective Biasing (OSB) and Path Constrained Tuning (PCT) to optimize selective critical paths [15]. The tuning knobs select frequently executed critical paths, apply forward body bias to some of the logic gates, and selectively tighten their timing constraints to achieve performance gains at the cost of significant power overheads. Power aware slack distribution (SlackOptimizer) [16] is another significant work which use cell sizing to distribute slack evenly in a power and cost efficient manner, but the benefits are small compared to the optimization effort. Another work in this area is Selective End Point Optimization (SEOpt), Clock Skew Optimization (SkewOpt) and Combined Optimization (CombOpt) [17]. SEOpt claims to reduce the cost of resilience by replacing error tolerant registers with conventional ones using additional margin insertion. SkewOpt migrates available timing slack from non-critical paths to critical paths. CombOpt is the combination of SEOpt and SkewOpt, and claims significant power savings. These techniques perform selective optimizations of the critical paths for performance and reliability enhancement while reduces cost of resilience in the form of area and power overheads. The efforts and benefits vary with the design and optimization approaches employed.

II. STATE OF THE ART METHODOLOGIES Due to rise in process, voltage and temperature variations, chip reliability faces more challenges. Moreover, reliability is a prerequisite for BTWC design techniques. Resilience improvement techniques need to be extended to algorithm, architecture and circuit levels to reap maximum benefits with minimum overheads. In this section, we will briefly summarize the state-of-the-art error resilience techniques used in various design phases. A. Circuit Level We need timing error monitors at circuit level which can detect, predict or mask timing errors. Several papers propose error monitors in the form of flip-flops or latches. Most of them share a common architecture and can be categorized as Error Detection Sequential (EDS) circuits and Error Masking Sequential (EMS) circuits. EDS has a data path similar to a conventional flip-flop and a shadow path which captures input data using a delayed clock. Timing error detection is done by comparing a data path with a shadow path. Razor I [9], Bubble Razor [10], Razor II [11], DSTB and TDTB [12] belong to this category. EMS is similar to EDS except that the shadow path samples the data with a delayed clock and masks the timing error. EMS has an additional clock control block that provides the delayed clock for the data input. The width of the time borrowing window can be adjusted by the clock control circuitry. TIMBER flip-flops [7] and Soft Edges flip-flops [13] belong to this category. Using TIMBER, timing violation can be masked without the need for complex error recovery mechanisms. However, this design approach leads to shorter time borrowing intervals even at places where it can borrow a huge slack. With reliability, low power and performance demands on the rise, any circuit level error mitigation techniques should maximize the gains with minimal overheads as possible. B.

Architecture Level For EDS circuits, once an error is detected at the circuit level, it needs to be corrected at the architecture level. Previous

III.

MOTIVATION

For our experiments we took an industrial processor design core in 40nm LP CMOS technology running on a clock period of 5.5ns. After synthesis and static timing analysis (STA), we filtered ~ 1000 most critical paths with worst slack of up to 0.1ns for slack improvement. Fig. 2 shows ~100 critical paths out of the 1000 paths in the processor pipeline with slack up to 6ps and Fig. 3 shows the corresponding consecutive stage path with enormous slack in the order of up to 4000ps. Our experiment results show that 85% of the critical paths has positive slack available in the consecutive stages to borrow from. Out of that, 64% paths can do coarse grain time borrowing with 11% to 77% timing margin improvement based on their available slack. We assign those critical paths into 7 different time borrow bands from TB1 (11% timing margin) to

proposed algorithm replaces the critical flip-flops with EAFF. We used RTL Compiler for synthesis, timing analysis and to generate power and area reports. The library of EAFFs was generated using Cadence Liberate, a characterisation tool.

6

5

4

3

2

1

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

0

A. EAFF Library Error Avoidance flip-flops (EAFFs) are based on the TIMBER flip-flop design [7] as shown in Fig. 5. Additional control logic is added to TIMBER flip-flop to generate the sampled error flag which is not shown here. The master latch (LATCH0) samples DATA while the shadow latch (LATCH1) captures DATA with the delayed clock DCK. The presence of shadow latch avoids data path meta-stability issues in the design. ERROR_FLAG indicates whether the data in LATCH0

Fig. 2. Critical path slacks of the real time processor. Time borrowing

4000

Critical Slack (ps)

3500

Critical end points 3000

Non‐critical end point

huge slack 2500

FF

FF 2000 1500

Single Stage Time Borrowing

1000 500

Critical end points

Critical end point

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

0

FF

FF

Fig. 3. Consecutive stage slacks of the real time processor.

TB7 (77% timing margin). 12% of the critical paths can do fine grain time borrowing with less than 11% timing margin improvement. Thus slack is redistributed effectively without any time consuming iterations as in [14], [15], [16] and [17]. Again, multistage time borrowing can be used for those critical paths which do not have enough slack to borrow from as shown in Fig. 4. Our approach can mask the variation induced timing errors opportunistically and will not degrade the throughput as in [9], [10], [11] and [12]. Since we place EAFF’s based on slack analysis, we do not need error relay as in TIMBER [7]. The error signal can be used to monitor the error rate of the design. Compared to soft edge flip-flop [13], the delay clock buffer of EAFF is not embedded inside the standard library which leads to less clock buffer overheads with amount of time borrowed. Thus the proposed technique minimizes the throughput, power and area overhead. IV.

IMPLEMENTATION OF THE PROPOSED APPROACH

In this section, we will explain the implementation of the proposed technique. We aim to improve timing margin on selected critical paths in the design to make it more robust against process and environmental variations. We ran synthesis on an industrial processor design in 40nm LP CMOS technology and did the timing analysis of the pipeline stages. This gave us potential locations which could use time borrowing to improve slack margins. We developed a standard cell library which has different flavours of time borrowing EAFF. Based on the results of static timing analysis, the

Huge slack /less slack

FF huge slack

Multi Stage Time Borrowing Fig. 4. Single stage and multi stage time borrowing.

is identical to that in LATCH1. ERROR_FLAG is sampled to avoid fake errors and glitches which give ERROR_SAMPLED. Fig. 6 shows simulation waveforms of EAFF. When timing error occur, LATCH0 and LATCH1 store different values. However the output is corrected by LATCH1, P0 and P1. The output signal Q is delayed by the delay value of DCK. Thus even if the output is corrected regardless of timing error, the design has to ensure enough slack in the consecutive stage of the EAFF so that the delayed output Q doesn’t affect the chip functionality. The control signals, P0 and P1 are derived from the clock CK and the delayed clock DCK. To implement the proposed algorithm, a library of various EAFFs was generated using Cadence Liberate and the library validations were also performed. There can be hold time issues due to the delayed sampling of input data. This can be rectified by adding delay buffers in the corresponding short paths which is not part of this work. B. Design Flow and Algorithm The design flow used in our experiment is similar to the standard IC design flow. We use the post synthesis net list of an industrial processor designed for worst case for slack analysis. We also generate the standard cell library of a set of EAFFs. Our optimization algorithm finds the set of flip-flops

Main Latch LATCH0

CK

P0

R

CK CK

DATA

R

ALGORITHM1: PSEUDO CODE FOR COARSE GRAIN TB, FINE GRAIN TB AND MULTI-STAGE TB

CK

CK

CK

P0

DCK

P1

DCK

Q

ERROR GENERATION LOGIC

ERROR_SAMPLED

P1

DCK DCK

LATCH1

Shadow Latch

Fig. 5. Error Avoidance Flip-Flop (EAFF) based on TIMBER [7].

Fig. 6. Simulated waveforms of EAFF.

which are to be replaced with EAFFs along with the optimum time borrow value to be assigned to them. We confine our analysis to the most critical or near critical paths which are bound to fail when there is process or environmental variation. Algorithm1 shows the pseudo code for the EAFF replacement based on the time borrow capability of the respective critical flip-flops. The algorithm is generic and can be used at any design stage from front end to back end. The critical paths analysed can be changed according to the analysis requirements. For the slack analysis, we consider the most critical paths with slack less than 2% of the clock period using Find Analysis Paths function. Now we report the subsequent stage timing for the critical paths using the function Find Consecutive Slack which show the time borrowing capability of each path. Based on this we group the critical flip-flops into different time borrowing bands from TB1:TB7 (0.6ns ~ 4.2ns) using Update TB Map. Replace Sequential replaces the selected flip-flops in the critical paths with corresponding EAFFs. The proposed analysis starts with coarse grain time borrowing (CoarseGrainTB) which target a timing margin improvement in the range TB1:TB7. To further improve the critical path coverage, we use fine grain time borrowing (FineGrainTB) which targets a margin 0.1ns ~ 0.6ns. For paths which are not covered by CoarseGrainTB and FineGrainTB, we use multi-stage time borrowing (MultistageTB). The above three techniques ensure that the available slack is redistributed to the critical paths with minimum design iterations. The flip-

1. Procedure CoarseGrainTB (Initial Net list) 2. # Find slack of consecutive stage for all critical paths 3. P ← Find Analysis Paths ( ) 4. for all p P do 5. S ← Find Consecutive Slack (p) 6. end for 7. # Replace all registers with consecutive slack by EAFF 8. for all TB=TB1, TB= TB then 11. Replace Sequential (p) 12. Update CoarseGrainTB Map (p) 13. Delete path (p) 14. end if 15. end for 16. return (net list2, P2 ) 17. Procedure FineGrainTB (net list2, P2) 18. for all TB=TB1/grain size, TB= TB then 21. Replace Sequential (p) 22. Update FineGrainTB Map (p) 23. Delete path (p) 24. end if 25. end for 26. return ( net list3, P3 ) 27. Procedure MultistageTB (net list3, P3) 28 S2 ← Find Next Consecutive Slack (p) 29. for all p P3 do 30. if s2 >= TB then 31. Replace Sequential (p) 32 end if 33. end for 34. return (net list final)

Library (by Synopsys)

Timing

EAFF Library (by Liberate)

SDC Constraints

RTL Compiler Engine

Algorithm Scripts

Area

Power

Final Netlist

Gate-level Netlist

TB Map

Fig. 7. Design Environment of the proposed optimization technique.

flop groups thus formed along with their time borrow values can be used to construct the delay clock tree in the later design phase, which is out of the scope of this work. After the flip-flop replacement, the area, the power and the timing are analysed in RTL Compiler. Fig. 7 shows the design environment of the proposed optimization technique. The main inputs are the timing library generated by Synopsys, EAFF library generated by Cadence Liberate, SDC design constraints and initial gate level net list. The RTL Compiler engine is empowered with additional wrapper scripts for the replacement of flip-flops, optimum time borrow calculation for each flip-flop and analysis of results. We use the same engine for STA and power/area reports. We

Area Overhead (%)

70% path coverage

500

400

300

200

100

%Chip area overhead #optimized endpoints

45% path coverage

# of Paths

600

11% safety margin 55% safety margin 0 0

10

20

30

40

50

60

70

80

90

100 More

Time Borrowed (TB1=0.6ns, incremented by 0.6ns)

120

700 600

100

500 80 400 60 300 40

200

20

100

0

Fig. 10. Safety (timing) margin improvement and corresponding critical path coverage along with % chip area overhead caused by the proposed technique at different time borrowing values Normal Coarsegrain Finegrain Multistage #Optimized paths #Critical paths

# of Paths

800

Area (um2)

140

# of Critical Paths

Critical Path Coverage (%)

Fig. 8. Slack distribution after time borrowing

0 M1

M2

M3

M4

M5

M6

M7

M8

M9

Fig. 9. % of critical paths improved using the proposed optimization scheme.

do not use toggle rate or activity information of the critical flipflops in our design where only those flip-flops with sufficient activity factor need to be replaced. This can reduce the power and area overheads further.

Processor Modules Fig. 11. Area overheads of various critical modules caused by the proposed technique.

Fig. 8 shows the slack re-distribution before and after EAFF replacement. The critical operating point shifts to right as we employ time borrowing in the design. This shows that a significant number of critical/near-critical paths can be protected from variations and voltage/frequency scaling by the proposed optimization scheme.

techniques with the coarse grain alone contributing 64%, fine grain contributing 12% and multi stage contributing 9%. The remaining 15% paths fan out mostly to memory write back stage where some slack can be created for timing margin improvement. Fig. 10 shows the trade-off between timing margin improvement and the number of critical paths that can be improved. As shown in Fig. 10, a safety (timing) margin improvement of 11% of clock period (at TB1=0.6ns) covers 70% of the critical paths. If we use higher time borrowing values we get a safety margin improvement of 55% of the clock period (at TB5=3.0ns) with 45% critical paths improved.

A. Timing Margin Improvement The aim of the proposed algorithm is to improve the timing margin of the critical paths and to distribute the slack of the design more evenly with least overheads. Fig. 9 shows the percentage of paths out of the ~1000 paths which get a timing margin improvement using the proposed techniques. Results show that 7 out of 9 processor modules get timing margin improvement for all the critical paths. Modules M3 and M8 with most number of critical paths, get an added timing margin on 85% of the paths and 74% of the paths respectively as shown in Fig. 9. Overall, we get timing margin improvement on 85% of the critical paths using the three time borrowing

B. Overheads The proposed technique incurs overheads because of the additional elements used in EAFFs for error resilience. The module wise area and power comparisons are shown in Fig. 11 and Fig. 12, respectively. The peak area and power overheads occur for the module M3, which has the maximum number of EAFF insertions. The chip wise area and power overheads of the EAFF’s for the timing margin improvement range 0.6ns ~ 4.8ns is depicted in Fig. 10 and 13 respectively. The maximum area and power overhead is around 0.2% and 6% respectively for this range. There is negligible architectural, error propagation or clock control overheads since we only need a

V.

RESULTS AND ANALYSIS

4000

800

3500

700

3000

600

2500

500

2000

400

1500

300

1000

200

500

100

REFERENCES [1]

[2]

[3]

0

0 M1

M2

M3

M4

M5

M6

M7

M8

[4]

M9

Fig. 12. Power overheads of various critical modules caused by the proposed technique.

[5]

700 600

5

[6]

500

4

400 3 300 2

200

1

# of Paths

Power Overhead (%)

6

100

0

[7]

[8]

[9]

0

[10] Fig. 13. Overall power overhead caused by the proposed technique at different time borrowing values.

clock delay network for EAFFs with the delay corresponding to the slack value calculated using the algorithm. VI.

CONCLUSIONS

The proposed technique is able to improve the timing margins of the critical paths in a fast and efficient way by using EAFFs with optimum time borrowing based on available slack in the subsequent stages. Simulation results show significant timing margin improvement using less design iterations. Architectural overhead is minimal and there is no throughput degradation because we use available slack and mask timing errors instead of clock stretching or instruction replay. There is minimal chip area overhead which makes it a viable choice compared to cell resizing. Finally the proposed technique helps in opportunistic time borrowing which can tolerate static and dynamic variations as and when required.

[11]

[12]

[13]

[14]

[15]

[16]

[17]

S. Ghosh and K. Roy, “Parameter variation tolerance and error resiliency: New design paradigm for the nanoscale era,” Proc. IEEE, vol.98, no. 10, pp. 1718–1751, Oct. 2010. D. Bull, S. Das, K. Shivashankar, G. Dasika, K. Flautner, and D. Blaauw, “A power-efficient 32 bit ARM processor using timing-error detection and correction for transient-error tolerance and adaptation to PVT variation,” IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 18–31, Jan. 2011. Ronald G. Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, and Trevor Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proc. IEEE, Vol. 98, No. 2, Feb. 2010. Puneet Gupta et al., “Underdesigned and Opportunistic Computing in Presence of Hardware Variability,” IEEE Transactions on ComputerAided Design of integrated circuits and systems, vol. 32, no. 1, january 2013. T. Austin, V. Bertacco, D. Blaauw and T. Mudge, “Oppotunities and Challenges for Better Than Worst-Case Design”, Proc. Asia and South Pacific Design Automation Conf., 2005, pp. 2–7. S. Moreno and J. Pineda de Gyvez, “A better than worst case circuit design using timing-error speculation and frequency adaptation,” in Proc. IEEE Int. SOC Conf., 2012, pp. 15–20. M. Choudhury et.al. “TIMBER: Time borrowing and error relaying for online timing error resilience, ” Proc. of DATE, Dresden, Germany, March 2010, pp. 1554-1559. A. B. Kahng, S. Kang, R. Kumar and J. Sartori, “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, Proc. ASPDAC, 2010, pp.825–831. D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, T. Mudge, and K. Flautner. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. Proceedings of the 36th Symposium on Microarchitecture (MICRO-36), San Diego, CA, 2003. M. Fojtik et al., “Bubble Razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction,” IEEE J. Solid-State Circuits, vol. 48, no. 1, pp. 66–81, Jan. 2013. S. Das, C. Tokunaga, S. Pant, W. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw, “Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance,” IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 32-48,2009. K. Bowman, J. Tschanz, N. Kim, J. Lee, C. Wilkerson, S. Lu, T. Karnik, and V. De, “Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance,” IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 49-63, 2009. V. Joshi. David Blaauw. Deunis Sylvester, "Soft-edge flip-flops for improved timing yield: design and optimization," in Proc. IEEFJACM Int. Conf. Computer-Aided Design, Nov. 2007, pp. 667-673. S. R. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas. EVAL:Utilizing processors with variation-induced timing errors. In International Symposium on Microarchitecture, November 2008. B. Greskamp, L. Wan, W. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen and C.Zilles, “BlueShift: Designing Processors for Timing Speculation from the Ground Up”, IEEE International Symposium on High Performance Computer Architecture, 2009, pp. 213–224. A. B. Kahng, S. Kang, R. Kumar and J. Sartori, “Designing a Processor From the Ground Up to Allow Voltage/Reliability Tradeoffs”, IEEE International Symposimum on High-Performance Computer Architecture, January 2010. A. B. Kahng, S. Kang, J. Li, “A New Methodology for Reduced Cost of Resilience”, Proc. GLSVLSI, 2014, pp. 157-162.