Sleepy Stack Leakage Reduction - Semantic Scholar

Report 15 Downloads 351 Views
1250

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

Sleepy Stack Leakage Reduction Jun Cheol Park and Vincent J. Mooney III, Senior Member, IEEE

Abstract—Leakage power consumption of current CMOS technology is already a great challenge. International Technology Roadmap for Semiconductors projects that leakage power consumption may come to dominate total chip power consumption as the technology feature size shrinks. Leakage is a serious problem particularly for CMOS circuits in nanoscale technology. We propose a novel ultra-low leakage CMOS circuit structure which we call “sleepy stack.” Unlike many other previous approaches, sleepy stack can retain logic state during sleep mode while achieving ultra-low leakage power consumption. We apply the sleepy stack to generic logic circuits. Although the sleepy stack incurs some delay and area overhead, the sleepy stack technique achieves the lowest leakage power consumption among known state-saving leakage reduction techniques, thus, providing circuit designers with new choices to handle the leakage power problem. Index Terms—Dual- th , low-leakage power dissipation, transistor stacking.

I. INTRODUCTION OWER consumption is one of the top concerns of VLSI circuit design, for which CMOS is the primary technology. Today’s focus on low power is not only because of the recent growing demands of mobile applications. Even before the mobile era, power consumption has been a fundamental problem. To solve the power dissipation problem, many researchers have proposed different ideas from the device level to the architectural level and above. However, there is no universal way to avoid tradeoffs between power, delay, and area, and thus, designers are required to choose appropriate techniques that satisfy application and product needs. Power consumption of CMOS consists of dynamic and static components. Dynamic power is consumed when transistors are switching and static power is consumed regardless of transistor switching. Dynamic power consumption was previously (at 0.18- m technology and above) the single largest concern for low-power chip designers since dynamic power accounted for 90% or more of the total chip power. Therefore, many previously proposed techniques, such as voltage and frequency scaling, focused on dynamic power reduction. However, as the feature size shrinks, e.g., to 0.09 and 0.065 m, static power has become a great challenge for current and future technologies. Based on the International Technology Roadmap for Semiconductors (ITRS) [1], Kim et al. report that subthreshold leakage power dissipation of a chip may exceed dynamic power dissipation at the 65-nm feature size [2].

P

Manuscript received August 5, 2005; revised July 7, 2006. J. C. Park is with the Mobility Group, Intel Corporation, Folsom, CA 95630 USA (e-mail: [email protected]). V. J. Mooney III is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2006.886398

One of the main reasons causing the leakage power increase is the increase of subthreshold leakage power. When technology feature size scales down, supply voltage and threshold voltage also scale down. Subthreshold leakage power increases exponentially as threshold voltage decreases. Furthermore, the structure of the short channel device decreases the threshold voltage even lower. In addition to subthreshold leakage, another contributor to leakage power is gate-oxide leakage power due to the tunneling current through the gate-oxide insulator. Since gateoxide thickness may reduce as the channel length decreases, in sub 0.1- m technology, gate-oxide leakage power may be comparable to subthreshold leakage power if not handled properly. However, we assume other techniques will address gate-oxide leakage; for example, high- dielectric gate insulators may provide a solution to reduce gate-leakage [2]. Therefore, this paper focuses on reducing subthreshold leakage power consumption. In this paper, we provide a new circuit structure named “sleepy stack” as a remedy for static power consumption. The sleepy stack has a novel structure that uniquely combines the advantages of two major prior approaches, the sleep transistor technique and the forced stack technique. However, unlike the sleep transistor technique, the sleepy stack technique retains the original state; furthermore, unlike the forced stack technique, to achieve up the sleepy stack technique can utilize highto two orders of magnitude leakage power reduction compared to the forced stack. Unfortunately, the sleepy stack technique comes with delay and area overheads. Therefore, the sleepy stack technique provides new Pareto points [3] to designers who require ultra-low leakage power consumption and are willing to pay some area and delay cost. The main contributions of this paper are as follows: 1) introduction of a sleepy stack structure that can save leakage power up to two orders of magnitude for circuits that require extremely low leakage power consumption and 2) analysis of example sleepy stack logic circuits in terms of various ways (transistor scaling, threshold voltage, and transistor width) circuit design engineers can employ to adopt the sleepy stack technique as necessary. This paper is organized as follows. In Section II, prior work about low-leakage logic design is discussed. In Section III, the sleepy stack structure is explained and an analytical delay model is discussed. In Section IV, an empirical methodology applying the sleepy stack to generic logic is explained. In Section V, the experimental results of the sleepy stack for generic logic is presented. In Section VI, conclusions are given. II. PREVIOUS WORK In this section, we discuss previous low-power techniques that primarily target reducing leakage power consumption of CMOS circuits. Techniques for leakage power reduction can

1063-8210/$20.00 © 2006 IEEE

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

be grouped into the following two categories: 1) state-saving techniques where circuit state (present value) is retained and 2) state-destructive techniques where the current Boolean output value of the circuit might be lost [2]. A state-saving technique has an advantage over a state-destructive technique in that with a state-saving technique the circuitry can immediately resume operation at a point much later in time without having to somehow regenerate state. We characterize each low-leakage technique according to this criterion. State-destructive techniques cut off transistor (pull-up or pulldown or both) networks from supply voltage or ground using sleep transistors [4]. These types of techniques are also called and gated(note that a gated clock is genergatedally used for dynamic power reduction). Motoh et al. propose a technique they call multithreshold-voltage CMOS (MTCMOS) sleep transistors between pull-up net[4], which adds highand between pull-down networks and ground works and transistors in order to maintain while logic circuits use lowfast logic switching speeds. The sleep transistors are turned off when the logic circuits are not in use. By isolating the logic networks using sleep transistors, the sleep transistor technique dramatically reduces leakage power during sleep mode. However, the additional sleep transistors increase area and delay. Furthermore, during sleep mode, the pull-up and pull-down networks will have floating values and, thus, will lose state. These floating values significantly impact the wake-up time and energy of the sleep technique due to the requirement to recharge transistors which lost state during sleep (this issue is nontrivial, especially for registers and flip-flops). To reduce the wake-up cost of the sleep transistor technique, the zigzag technique is introduced [5]. The zigzag technique reduces the wake-up overhead by choosing a particular circuit state (e.g., corresponding to a “reset”) and then, for the exact circuit state chosen, turning off the pull-down network for each gate whose output is high while conversely turning off the pull-up network for each gate whose output is low. By applying, prior to going to sleep, the particular input pattern chosen prior to chip fabrication, the zigzag technique can prevent floating. Although the zigzag technique retains the particular state chosen prior to chip fabrication, any other arbitrary state during regular operation is lost in power-down mode. Another technique to reduce leakage power is transistor stacking. Transistor stacking exploits the stack effect; the stack effect results in substantial subthreshold leakage current reduction when two or more stacked transistors are turned off together. Narendra et al. study the effectiveness of the stack effect including effects from increasing the channel length [6]. Since forced stacking of what previously was a single transistor increases delay, Johnson et al. propose an algorithm that finds circuit input vectors that maximize stacked transistors of existing complex logic [7]. As a variation of the stacking transistors, Hanchate and Ranganathan introduce self-controlled stacked transistors which are inserted between pull-up and pull-down networks and reduce leakage power by increasing internal resistance [8]. Our sleepy stack structure can achieve more power savings than the forced stack technique and the self-controlled stacked transistors (e.g., 100 compared with 10 for the forced

1251

Fig. 1. (a) Forced stack technique applied to an inverter. (b) Sleep transistor technique applied to an inverter.

stack transistor or the self-controlled stacked transistors). Furthermore, the sleepy stack can save exact logic state unlike and gatedtechniques (conventional sleep trangatedsistor technique) and the zigzag technique. In Section III, we will discuss the sleepy stack structure and sleepy stack operation. III. SLEEPY STACK STRUCTURE We introduce our new leakage power reduction technique we name “sleepy stack.” The sleepy stack technique has a combined structure of the forced stack technique and the sleep transistor technique. However, unlike the sleep transistor technique, the sleepy stack technique retains exact logic state when in sleep mode; furthermore, unlike the forced stack technique, the sleepy transistors without 5 (or stack technique can utilize highgreater) delay penalties. Therefore, far better than any prior approach known to the authors of this paper, the sleepy stack technique can achieve ultra-low leakage power consumption while saving state. We, first, explain the structure of the sleepy stack technique using an inverter. Then, we describe the details of sleepy stack operation in active mode and sleep mode. The advantages of the sleepy stack technique over the forced stack technique and the sleep transistor technique are explored. Finally, we derive a first-order delay model that compares the sleepy stack technique to the forced stack technique analytically. A. Sleepy Stack Approach In this section, we explain our sleepy stack structure comparing to the forced stack technique and the sleep transistor technique. The details of the sleepy stack inverter are described as an example. Two operation modes, active mode and sleep mode, of the sleepy stack technique are explored. 1) Sleepy Stack Structure: The sleepy stack structure has a combined structure of the forced stack and the sleep transistor techniques. Although we mentioned these two techniques in Section II, we focus on explaining forced stack and sleep transistor inverters here for the purposes of comparison with a sleepy stack inverter. Fig. 1(a) depicts a forced stack inverter and Fig. 1(b) depicts a sleep transistor inverter. The forced stack inverter breaks existing transistors into two transistors and forces a stack structure to take advantage of the stack effect; this is shown in Fig. 1(a). Meanwhile, the sleep transistor inverter shown in

1252

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

Fig. 3. (a) Inverter circuit schematic. (b) RC equivalent circuit.

Fig. 2. (a) Sleepy stack inverter with W=L of each transistor and active mode S , S assertion. (b) Sleep mode S , S assertion.

Fig. 1(b) isolates existing logic networks using sleep transistors. The stack structure in Fig. 1(b) saves leakage power consumption during sleep mode. This sleep transistor technique sleep transistors (the transistors confrequently uses hightrolled by and ) to achieve larger leakage power reduction. The sleepy stack technique has a structure merging the forced stack technique and the sleep transistor technique. Fig. 2 shows a sleepy stack inverter. The sleepy stack technique divides existing transistors into two transistors each typically with the half the size of the original single transistor’s same width (i.e., ), thus, maintaining equivalent width input capacitance. The sleepy stack inverter in Fig. 2(a) uses for the pull-up transistors and for the pull-down transistors, while a conventional inverter with the for the pull-up same input capacitance would use for the pull-down transistor (assuming transistor and ). Then sleep transistors are added in parallel to one of the transistors in each set of two stacked transistors. We use a transistor sized as half the width of the original tran) for the sleep transistor width of the sistor (i.e., we use for the width sleepy stack. Although we exclusively use of the sleep transistor, changing the sleep transistor width in various ways may provide additional tradeoffs between delay, power, and area. However, in this paper, we mainly focus on sleep transistor applying the sleepy stack structure with widths to generic logic circuits while varying technology feature size, threshold voltage, and temperature. Please note that halving transistor width is not possible for a circuit that uses minimum size transistors. However, many circuits use nonminimum size to gain driving strength. In any case, if we cannot halve transistor width, then we simply use minimum width. 2) Sleepy Stack Operation: Now we explain how the sleepy stack works during active mode and during sleep mode. Also, we explain leakage power savings using the sleepy stack structure. The sleep transistors of the sleepy stack operate similar to the sleep transistors used in the sleep transistor technique in which sleep transistors are turned on during active mode and turned off during sleep mode. Fig. 2 depicts the sleepy stack operation using a sleepy stack inverter. During active mode [Fig. 2(a)], and are asserted, and, thus, all sleep transistors

are turned on. This sleepy stack structure can potentially reduce circuit delay in two ways. First, since the sleep transistors are always on during active mode, the sleepy stack structure achieves faster switching time than the forced stack structure; specifically, in Fig. 2(a), at each sleep transistor drain, the voltage value connected to the sleep transistor source is always ready and available at the sleep transistor drain, and thus, current flow transistors connected to is immediately available to the lowthe gate output regardless of the status of each transistor in parallel to the sleep transistors. Furthermore, we can use hightransistors (which are slow but 1000 or so less leaky) for the sleep transistors and the transistors parallel to the sleep transistors (see Fig. 2) without incurring large (e.g., 2 or more) delay increase. and are During sleep mode [Fig. 2(b)], asserted, and so both of the sleep transistors are turned off. Although the sleep transistors are turned off, the sleepy stack structure maintains exact logic state. The leakage reduction of the sleepy stack structure occurs in two ways. First, leakage transistors, which are applied power is suppressed by highto the sleep transistors and the transistors parallel to the sleep transistors. Second, stacked and turned off transistors induce the stack effect [11], which also suppresses leakage power consumption. By combining these two effects, the sleepy stack structure achieves ultra-low leakage power consumption during sleep mode while retaining exact logic state. The price for this, however, is increased area. We will derive an analytical delay model of the sleepy stack inverter and compare the sleepy stack technique to the forced stack inverter in the next section. This analytical comparison of the next section, Section III-B, can be skipped if desired. The detailed experimental methodology and the results will be presented in Section IV. B. Analytical Comparison of Sleepy Stack Inverter Versus Forced Stack Inverter In this section, an analytical delay model of a sleepy stack inverter is explained and compared to a forced stack inverter, the best prior state-saving leakage reduction technique we could find. Generally, the transistor delay of a conventional inverter shown in Fig. 3 driving a load of can be expressed using the following equation: (1) where tance.

is the load capacitance and is the transistor resisin Fig. 3(b) indicates input capacitance. Although the

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

1253

is 50% We assume that the internal node capacitance because is the capacitance from three tranlarger than is the capacitance from two transistors connected, while sistors connected. Then (6) (7)

Fig. 4. (a) Forced stack technique inverter circuit schematic. (b) RC equivalent circuit.

Therefore, is 25% faster than if we use the same and for the forced stack inverter and the sleepy stack inverter. of the sleepy stack inverter Alternatively, we may increase and make the delay of the sleepy stack inverter and the delay of the forced stack inverter the same. Let us take an example. The gate delay of a CMOS circuit can be expressed as shown in the following approximated equation: (8) , and denote the gate delay in a CMOS cirwhere , cuit, the threshold voltage, and velocity saturation index of a transistor, respectively. Using (8), the delay of the forced stack and the delay of the sleepy stack can be expressed as follows: (9)

Fig. 5. (a) Sleepy stack technique inverter schematic. (b) RC equivalent circuit.

nonsaturation mode equation is complicated, we can predict the adequate first-order gate delay from (1) [14]. Now we derive the delay of the inverter with the forced stack technique shown in Fig. 4. Since we assume that we break each existing transistor into two half sized transistors (see Section III-A1), the resistance of each transistor of the , compared to the forced stack technique is doubled, i.e., standard inverter; furthermore, in this way, we can maintain is internal input capacitance equal to Fig. 3(b). In Fig. 4, node capacitance between the two pull-down transistors. Using the Elmore equation [10], we can express the delay of the forced stack inverter as follows: (2) (3)

Similarly, we can depict the sleepy stack inverter and its resistance-capacitance (RC) equivalent circuit as shown in Fig. 5. Two extra sleep transistors are added and each sleep transistor (as discussed in Section III-A1, please has a resistance of note that increasing sleep transistor width reduces the sleep transistor resistance further—however, let us continue with the ap. proach of Section III-A). The internal node capacitance is Using the Elmore equation, we can derive the transistor delay of the sleepy stack inverter as follows: (4) (5)

(10) and are delay coefficients of the forced stack where inverter and the sleepy stack inverter, respectively. When the is the same as the threshold voltage of the forced stack , we calculate threshold voltage of the sleepy stack from (7). If we assume that , V, and V, we can make equal to by applying , which is 69% higher than the of the forced can potentially result in large stack inverter. This higher leakage power reduction (e.g., 10 ). In this section, we introduced the sleepy stack technique for leakage power reduction. By combining the forced stack technique and the sleep transistor technique, the sleepy stack can achieve smaller transistor delay than the forced stack technique while retaining state unlike the sleep transistor technique. The main advantage of the sleepy stack approach is the ability to use for both the sleep transistors and the transistors in parhighallel with the sleep transistors. The increased threshold voltage transistors of the sleepy stack technique potentially brings much larger ( 10 ) leakage power reduction than the forced stack technique while achieving the same transistor delay. From the analytical model of the sleepy stack inverter, we observe that the sleepy stack inverter can reduce delay by 25%, which alby 69%. Using this internatively can be used to increase creased threshold voltage, the sleepy stack inverter can potentially achieve a large (e.g., 10 ) leakage power reduction compared to the forced stack inverter. In this section, we explained the sleepy stack structure and sleepy stack operation. We also described a first-order delay model of the sleepy stack (please note that all power and delay results reported in Section V are based, however, on

1254

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

Fig. 6. Chain of four inverters with W=L of each transistor.

HSPICE—see Section IV-C). In the next sections, we apply the sleepy stack structure to generic logic circuits, explaining in detail our methodology. IV. APPLYING SLEEPY STACK TO LOGIC CIRCUITS In this section, we first explain target benchmark circuits we use focusing on generic logic to evaluate our sleepy stack technique [11]. Then we explain low-leakage techniques we consider for purposes of comparison; although the basic ideas of the compared techniques have been covered in Section II, this section will give detailed structure with transistor sizing for each prior technique to be compared to our sleepy stack approach. Finally, we explain experimental methodology that we use to compare our technique to the previous techniques we consider. A. Benchmark Circuits To show that the sleepy stack technique is applicable to general logic design, we choose three benchmark circuits, which are as follows: 1) a chain of 4 inverters; 2) a 4:1 multiplexer; and 3) a 4-bit adder. 1) Chain of Four Inverters: A chain of four inverters shown in Fig. 6 is chosen because an inverter is one of the most basic CMOS circuits and is typically used to study circuit characteristics. We size each transistor of the inverter to have equal rise and fall times in each stage. Instead of using the minimum possible size of the transistor in a given technology, we use for pMOS and for nMOS transistors. Please refer to [12] for a layout of the chain of four inverters in TSMC 0.18- m technology using the widths shown in Fig. 6; note that in Fig. 6, for 0.18- m technology, all pMOS transistors have m and m while all nMOS transistors have m and m. 2) 4:1 Multiplexer: A possible implementation of a 4:1 mulare input signals, tiplexer is shown in Fig. 7, in which and are selection signals, and is an enable signal. The multiplexer consists of an inverter, two-input NAND gates, and two-input NOR gates. All gates are sized to have rise and fall and nMOS times equal to an inverter with pMOS . Although the 4:1 multiplexer shown in Fig. 7 is not the most efficient way to implement a 4:1 multiplexer, we use the design of Fig. 7 to show that the sleepy stack can be applicable to a combination of (a logic network of) typical CMOS gates. Please refer to [12] for NAND and NOR layouts used in this 4:1 multiplexer. 3) 4-Bit Adder: By use of the 1-bit full adder shown in Fig. 8, we implement a 4-bit adder. A full adder is an example of a typical complex CMOS gate. In Fig. 8, and are two inputs and is a carry input. and are outputs. The transistor

Fig. 7. 4:1 multiplexer with delay critical path along the dashed line.

sizing of the full adder is noted in Fig. 8. Please refer to [12] for the full adder layout we use. These three benchmark circuits (chain of 4 inverters, 4:1 multiplexer, and 4-bit adder) designed in a conventional CMOS structure are used as our base case. In the next section, we explain the low-leakage techniques to which we compare to our sleepy stack technique. These three benchmark circuits are also implemented using the low-leakage techniques explained in the next section, Section IV-B. B. Prior Low-Leakage Techniques Considered for Comparison Purposes The sleepy stack technique is compared to a conventional CMOS approach, which is our base case, and three other wellknown previous approaches, i.e., the forced stack, sleep, and zigzag techniques explained in Section II. We also explore the and transistor width on the sleepy stack technique. impact of 1) Base Case: In this paper, we use the phrase “base case” to refer to the conventional CMOS technique shown in Fig. 9 and described in a classic textbook by Weste and Eshraghian [13]. Fig. 9 shows a pull-up network and a pull-down network using as few transistors as possible to implement the Boolean logic function desired. The base case of a chain of four inverters is sized as explained in Section IV-A1. The base case of a 4:1 multiplexer is sized as explained in Section IV-A2. The base case of a 4-bit adder is sized as explained in Section IV-A3. 2) Sleepy Stack Technique: Fig. 10 shows the sleepy stack technique applied to a conventional CMOS design. When we apply the sleepy stack technique, we replace each existing transistor with two half sized transistors and add one extra sleep transistor as shown in Fig. 10. If dualvalues are available, transistors are used for sleep transistors and transistors highthat are parallel to the sleep transistors. 3) Forced Stack Technique: Fig. 11 shows the forced stack technique, which forces a stack structure by breaking down an

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

1255

Fig. 8. 1-bit full adder with W=L of each transistor.

Fig. 9. Base case (conventional CMOS) circuit structure. Fig. 10. Sleepy stack technique circuit structure.

existing transistor into two half size transistors. When we apply the forced stack technique, we replace each existing transistor with two half sized transistors as shown in Fig. 11. 4) Sleep Transistor Technique: The sleep transistor technique shown in Fig. 12 uses sleep transistors between both and the pull-up network as well as between and the pulldown network. Generally, the width/length ratio is sized based on a tradeoff between area, leakage reduction, and delay. For simplicity, we size the sleep transistor to the size of the largest transistor in the network (pull-up or pull-down) connected to the sleep transistor. The size noted in Fig. 12 shows an example when the sleep transistors are applied to one of the inverters from Fig. 6. The pMOS and nMOS sleep transistors in and , respectively, because Fig. 12 have

the size of the pull-up and pull-down transistors in Fig. 6 are and , respectively. If dualvalues are transistors are used for sleep transistors. available, high5) Zigzag Technique: The zigzag technique in Fig. 13 uses one sleep transistor in each logic stage either in the pull-up or pull-down network according a particular input pattern. In this paper, we use an input vector that can achieve the lowest measured (simulated) leakage power consumption. Then, we either assign a sleep transistor to the pull-down network if the output is “ ” or else assign a sleep transistor to the pull-up network if the output is “ .” For Fig. 13, we assume that the output of the first stage is “ ” and the output of the second stage is “ ”

1256

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

Fig. 11. Forced stack technique circuit structure.

Fig. 13. Zigzag technique circuit structure.

Fig. 14. Experimental flow with V Fig. 12. Sleep transistor technique circuit structure.

when minimum leakage inputs are asserted. Therefore, we apply a pull-down sleep transistor for the first stage and a pull-up sleep transistor for the second stage. Similar to the sleep transistor technique, we size the sleep transistors to the size of the largest transistor in the network (pull-up or pull-down) connected to the sleep transistor. The transistor sizing in Fig. 13 shows an example where the zigzag technique is applied to two inverters values are available, hightransisfrom Fig. 6. If dualtors are used for the sleep transistors. The low-leakage techniques explained in this section, Section IV-B, are implemented using the three benchmark circuits described in Section IV-A. In the next section, we explain our experimental methodology. C. Experimental Methodology The implemented circuits are simulated to measure delay, power, and area. For power measurement, we consider both dynamic power and static power. We first explain experimental infrastructure, and then we explain detailed measurement methodology.

of each process technology.

1) Simulation Setup: We use an empirical methodology to evaluate the five techniques which are the base case, zigzag, sleep, stack, and sleepy stack techniques. Each benchmark circuit implemented using each of the five techniques is evaluated in terms of delay, dynamic power, static power, and area. Our experimental procedure, which is shown in Fig. 14, is as follows. We first design each target benchmark circuit with each specific technique using Cadence Virtuoso, a custom layout tool,1 and the North Carolina State University (NCSU) Cadence design kit targeting TSMC 0.18- m technology.2 When we design a circuit using Cadence Virtuoso, we implement schematics as well as layouts. Then, we extract schematics from layout to obtain transistor circuit netlists. The extracted netlists are fed into the HSPICE simulation to estimate delay and power of the target benchmark designed with a specific technique; we use Synopsys HSPICE.3 We use TSMC 0.18- m parameters obtained from MOSIS,4 and we also use the Predictive Technology Model (PTM) param1Cadence

Design Systems. [Online]. Available: http://www.cadence.com

2NC State Univ. Cadence Tool Information. [Online]. Available: http://www.

cadence.ncsu.edu 3Synopsys Incorporated. [Online]. Available: http://www.synopsys.com 4The MOSIS Service. [Online]. Available: http://www.mosis.org

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

Fig. 15. Inputs and the critical path (dashed line) for 4-bit adder delay measurement.

eters for the technologies below 0.18 m in order to estimate the changes in power and delay as technology shrinks,5 [14]. The chosen technologies, i.e., 0.07, 0.10, 0.13, and 0.18 m, use supply voltages of 0.8, 1.0, 1.3, and 1.8 V, respectively. We assume that only a single supply voltage is used in the chip detechsigns we target. We do consider both single- and dualnology for the sleep, zigzag, and sleepy stack techniques. For the to one of the stacked forced stack technique, we apply hightransistors while fixing the technology to 0.07 m to observe causes dradelay and leakage variations (we find that highmatic—greater than 5 —delay increase with the forced stack technique—see Section V-B). For the logic circuits, we set all transistors to have 2.0 higher than the of a highnormal transistor (low- ). 2) Delay: We measure the worst case propagation delay of each benchmark. Input vectors and input and output triggers are chosen to measure delay across a given circuit’s critical path. The propagation delay is measured between the trigger input edge reaching 50% of the supply voltage value and the circuit output edge reaching 50% of the supply voltage value. Input waveforms have a 4-ns period (i.e., a 250-MHz rate) and rise as the output load capacand fall times of 100 ps. We use itance. For the chain of four inverters, we measure two different propagation delay values: one when an input goes high and another when an input goes low. We take the larger value as the worst case propagation delay of the chain of four inverters. For the 4:1 multiplexer, we measure the worst case propagation delay of the path - -NAND-NOR-NOR-NAND-output shown in Fig. 7 (note that several other paths exist with equal delay). We measure this critical path delay when the output changes from “ ” to “ .” To generate this signal transition, we , , , , pick initial input values as , , and ; the result is that the output is equal to “ .” Then we set to make the output equal to “ .” We measure the propagation delay between the falling edge of and the rising edge of the output. We form a 4-bit adder as shown in Fig. 15 using four 1-bit full adders all of which are identical in size. The critical path of our 4-bit adder is the path . To measure the worst case propagation delay, we initially force input signals as shown in and measure the delay from Fig. 15. Then we assert to . 5Predictive Technology Model (PTM). [Online]. Available: http://www.eas. asu.edu/~ptm

1257

Fig. 16. Waveforms of 1-bit adder for dynamic power measurement.

3) Active Power: Active power is measured by asserting semirandom input vectors and calculating the average power dissipation during this time. Input vectors are chosen so that a large number of possible input combinations are included in the set. We take the average power dissipation reported by HSPICE as our estimate of active power consumption. This active power includes dynamic power as well as static power during the time we measure. However, we do not attempt to subtract out static power consumption to calculate pure dynamic power consumption; instead, we use this power consumption as active power consumption. All sleep transistors are turned on when we measure active power for the sleep, zigzag, and sleepy stack techniques. We measure the active power of the chain of four inverters by asserting “ ” and “ ” repeatedly. For the 4:1 multiplexer, the input vectors are chosen to represent a sample of possible inputs, with a change of at least four of the seven input bits at every input change (details are available in [12]). For the 4-bit adder, we assert input vectors covering every possible input. The waveform in Fig. 16 shows input vectors asserted for each one bit adder, where the input vector changes every 4 ns. Please note that we use the same signal timing while scaling technology from 0.18 to 0.07 m. We do not customize signal timing to each particular technology (e.g., 0.13 m) because in this way we can observe the effect of technology scaling on a fixed clock. However, we are aware that reducing cycle time along with technology feature size is possible and may reveal additional insights and tradeoffs. 4) Static Power: We also use HSPICE to measure static power consumption. Since static power varies according to input state, we consider either a full combination of input vectors or subset of possible input combinations. When we measure static power, we first assert an input vector and measure power consumption after signals become stable (e.g., after 30 ns). Each measured static power consumption over 30 ns is averaged to derive static power consumption of each benchmark circuit. For the chain of four inverters, we consider two input vectors “ ” and “ .” For the 4:1 multiplexer, we choose eight input vectors out of 128 possible input combinations. The chosen input combinations are shown in Table I. For the 4-bit adder, we consider all eight possible input vectors of a 1-bit adder for leakage power measurement. The sleep transistors of the sleep, zigzag, and sleepy stack techniques are turned off during sleep mode in which we measure the leakage power consumption. For the zigzag technique, we take the lowest static power dissipation instead of averaging

1258

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

TABLE I INPUT SETS FOR A 4:1 MULTIPLEXER STATIC POWER MEASUREMENT

Delvto option is not as accurate as the real process models. However, this is the best we can do and is also quite commonly used (e.g., in [15]). We also consider two different temperatures because leakage power is highly dependent on temperature. The two temperatures are 25 C and 110 C. These experimental results will be presented in Section V-B. In this section, we explained the application of sleepy stack to generic logic circuits. We further explained our experimental methodology. In the next section, we will explain experimental results regarding generic logic leakage results using the methodology explained in Section IV.

each measured power result for each input tested; in short, we assume that the zigzag technique applied an input vector that achieves the lowest possible leakage power by analyzing circuitry as explained in Sections II and IV-B5. 5) Area: The area of the 0.18- m technology version of each target circuit in a particular design style (e.g., zigzag) is measured using layout. For a chain of four inverters and a 4-bit adder, we directly measure from an actual full layout we did for each (see layouts in [12]). For a 4:1 multiplexer, we directly measure the area of the gates used (i.e., NAND, NOR, and —see [12]) and estimate total area. Although the gates used to build the 4:1 multiplexer, i.e., NAND, NOR, and , have different heights, we assume that all gates have identical height to and rails. Therefore, we estimate area of use the same the 4:1 multiplexer by multiplying the height of the tallest gate and the sum of all gate widths. For example, if we use an (width m, height m), a NAND (width m, m), and a NOR (width m, height height m), then the area is 2.1 m . Area when utilizing technologies below 0.18 m is estimated by scaling the area of each benchmark layout for each particular design style where TSMC 0.18- m technology is taken as a starting point. To estimate area of layouts using 0.13-, 0.10-, and 0.07- m technologies, we do not take into account extra area needed to wire gates (even though needed, e.g., to connect the gates comprising the 4:1 multiplexer or the 1-bit adders into 4 bits), but the absence of a wiring penalty equally affects all techniques considered (i.e., base case, sleep, zigzag, forced stack, and sleepy stack). 6) Experiments: We perform two different experiments. We first compare the sleepy stack to the base case and three wellknown techniques, i.e., sleep, zigzag, and forced stack, while scaling transistor technologies. For this experiment, we use all three benchmark circuits explained in Section IV-A. For experimental results, see Section V-A. Second, we compare the sleepy stack technique only to the state-saving techniques, i.e., the forced stack technique and the base case with high- . At this time, we consider various values, various transistor widths, and two different temperatures. We exclusively use a chain of four inverters for this experof all traniment. For the base case with high- , we vary of transistors sistors. For the forced stack technique, we vary or . For the sleepy stack technique, connected to either we vary of sleep transistors and transistors in parallel with the sleep transistors. We use the “Delvto” option of HSPICE . We are well aware that changing with the to change

V. EXPERIMENTAL RESULTS FOR GENERAL LOGIC CIRCUITS In this section, we explain the experimental results for generic logic circuits. We utilize the three logic designs presented in Section IV. A. Impact of Technology Scaling First, we explore the impact of technology scaling. Fig. 17 shows the experimental results for the chain of four inverters (see Section IV-A1), 4:1 multiplexer (see Section IV-A2), and 4-bit adder (see Section IV-A3). Fig. 17 shows results from 0.18 to 0.07 m. We considered five different techniques: base CMOS), forced stack, sleep, zigzag, and case (standard lowsleepy stack. Please note that in Fig. 17, a “ ” next to a technique name means that the technique was implemented utilizing transistors appropriately. highWe can observe from Fig. 17(a), (e), and (i) that static power increases as technology feature size shrinks. We can also observe from Fig. 17(b), (f), and (j) that dynamic power decreases as technology feature size shrinks. Finally, we can observe from Fig. 17(c), (g), and (k) that propagation delay decreases as technology feature size shrinks. Table II presents data in normalized numbers. For the raw data used to generate Table II, please see [12]. 0.07- m technology impleLet us focus on the single mentation of each benchmark shown in Table II: we see that results in leakage our sleepy stack approach with singlepower roughly equivalent to the other three leakage-reduction approaches, i.e., forced stack, sleep, and zigzag when each uses technology. Compared to the sleep and zigzag apsingleproaches, which do not save state, the sleepy stack approach results in up to 68% delay increase and up to 138% area increase. Furthermore, compared to the forced stack approach, which saves state, the sleepy stack approach results in up to 118% area increase, but the sleepy stack is up to 31% faster. Thus, we recommend the sleepy stack approach with singlewhen state-preservation is needed, dualis not available, the speedup over forced stack is important and the area penalty for sleepy stack is acceptable. technology, the zigzag, sleep, and In addition to singlesleepy stack approaches are also implemented using dualtechnology in which hightransistors are used as explained in Sections IV-B2, IV-B4, and IV-B5. Compared to the sleep and technology, the sleepy stack zigzag approaches with dualapproach can save state. This is the main advantage of the sleepy stack over the sleep and zigzag techniques.

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

1259

Fig. 17. Experimental results while scaling technology ( dual V ). (a) Static power (W). (b) Dynamic power (W). (c) Propagation delay (s). (d) Area ( ). (e) Static power (W). (f) Dynamic power (W). (g) Propagation delay (s). (h) Area ( ). (i) Static power (W). (j) Dynamic power (W). (k) Propagation delay (s). (l) Area ( ).

Let us compare in 0.07- m technology the state-saving tech, forced stack niques, which are the base case with single , and sleepy stack with dual- , highlighted as with single

shaded rows in Table II. The results from a chain of four inverters in Table II(a) shows that the sleepy stack achieves 3440 leakage reduction over the base case. Furthermore, the sleepy

1260

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

TABLE II NORMALIZED EXPERIMENTAL RESULTS FOR (a) CHAIN OF FOUR INVERTERS, (b) 4:1 MULTIPLEXERS, AND (c) 4-bit ADDERS (0.07 m)

stack achieves 215 leakage power reduction over the best prior state-saving approach, the forced stack, while reducing delay by 6% and increasing area by 51%. The results from a 4:1 multiplexer in Table II(b) shows that the sleepy stack achieves 2680 leakage reduction over the base case. Compared to the forced stack, the sleepy stack achieves 202 leakage power reduction over the forced stack while increasing delay by 7% and increasing area by 118%. Finally, the results from a 4-bit adder in Table II(c) shows that the sleepy stack achieves 2490 leakage reduction over the base case. Compared to the forced stack, Table II(c) shows that the sleepy stack achieves 190 leakage power reduction over the forced stack while increasing delay by 6% and increasing area by 113%. In short, our sleepy stack technique achieves up to 215 leakage power reduction with up to 7% delay overhead compared to the best prior state-saving approach, the forced stack. Not surprisingly, the sleepy stack approach has between 51% and 118% larger area as compared to the forced stack approach. can be used Therefore, our sleepy stack approach with dualwhere state-preservation and ultra-low leakage power consumption are needed and are judged to be worth the area overhead. B. Impact of Choosing the right value of the sleepy stack technique is very important in terms of delay and power consumption. Therefore, using a chain of four inverters with 0.07- m technology, we compare dynamic power, leakage power, and delay of the state-saving techniques, i.e., base case (conventional CMOS technique), forced stack and sleepy stack, while varying

. We vary of transistors as follows: all the transistors in the base case, one of the stacked transistors in the forced stack case, and the sleep transistors plus transistors parallel to the sleep transistors in the sleepy stack case. Although the base , in this section, we vary case in Section V-A uses single of the base case for the purposes of comparison. Fig. 18 shows ; the three graphs on the the measured results while varying left-hand side of Fig. 18 are for 25 C while the three graphs on the right-hand side of Fig. 18 plot values for 110 C. From Fig. 18(a), we can see the forced stack inverter increases delay increases (e.g., with , 6.2 delay dramatically as of the sleepy increase over the base case). While varying of the sleepy stack that achieves the same stack, we can find delay with the forced stack with V, and dotted lines values found. At 25 C, the sleepy in Fig. 18(a) indicate the V has almost exactly the same delay as stack with V, and, at 110 C, the sleepy the forced stack with stack with V has exactly the same delay as the V while the sleepy stack achieves forced stack with more than 100 leakage reduction. V Similarly, at 25 C, the sleepy stack with has the same delay as the base case with , and, at V has the same delay 110 C, the sleepy stack with V while the sleepy stack as the base case with achieves 2.22 and 2.98 leakage reduction, respectively, over the base case with high- . From Fig. 18(b), we can observe that the base case with V consumes unacceptable active power consumption when the temperature is 110 C. This is because large leakage power

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

1261

Fig. 18. Results from a chain of four inverters while varying V . (a) Delay (s). (b) Active power (W). (c) Static power (W).

consumption of the base case severely hurts active power consumption. This result emphasizes the importance of the leakage power reduction techniques in sub-0.1- m technology. C. Impact of Transistor Width The sleepy stack technique comes with some area overhead. Therefore, we explore the impact of transistor width variation using three state-saving techniques, i.e., base case (conventional CMOS), forced stack, and sleepy stack. Although increasing transistor width reduces gate internal resistance, the increased transistor width increases gate input capacitance. Therefore, we

need to carefully size transistor width to reduce overall delay. of the base case and the sleepy stack technique to We set V while using V for the forced stack technique since the forced stack technique with highincreases delay dramatically as observed in Fig. 19(a). We set the temas total load capacitance. perature to 25 C. Also, we use The results show that inverter chain delay decreases as transistor width increases. However, delay reduction saturates due to the increased gate input capacitance. In Fig. 19(a), initially the delay of the base case and the sleepy stack inverter are different. However, as transistor width increases, sleepy stack shows noticeable

1262

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006

Fig. 19. Results from a chain of four inverters while varying width. (a) Delay (s). (b) Area (m ). (c) Active power (W). (d) Static power (W).

delay reduction, and the sleepy stack and the base case achieve similar delay using 5 transistor width. From Fig. 19(b), the sleepy stack inverter with 1 width is 72% and 51% larger than the base case and the forced stack, respectively. Since the sleepy stack technique comes with some area penalties, we find an area of the forced stack technique that has the same area of sleepy stack technique by increasing transistor width. The forced stack inverter has similar area as the sleepy stack when 2 transistor widths are applied. The forced stack with 2 width shows similar delay as the sleepy stack technique with 1 width, but also shows 430 larger leakage power consumption. D. Summary of Experimental Results for Generic Logic Circuits We compare the sleepy stack technique to existing techniques in terms of delay, dynamic power, leakage power, and area. The empirical analysis in Section V-B shows that we can increase up to 2.1 while maintaining equal or less sleepy stack , the sleepy delay than the forced stack technique; with stack achieves 102 less leakage power consumption than the forced stack approach. We apply the sleepy stack technique to a chain of four inverters, a 4:1 multiplexer, and a 4-bit adder, achieving up to 200 leakage power reduction compared to the forced stack technique with between 50% and 120% area penalty. VI. CONCLUSION In sub-0.1- m CMOS technology, subthreshold leakage power consumption can be nearly equal to dynamic power consumption; thus, effective handling of leakage power is a great challenge. In this paper, we present a new circuit structure

named “sleepy stack” to help tackle the leakage problem. The sleepy stack has a combined structure of two well-known low-leakage techniques: the forced stack and sleep transistor techniques. However, unlike the forced stack technique, the transistors without sleepy stack technique can utilize highincurring large delay overhead. Also, unlike the sleep transistor technique, the sleepy stack technique can retain exact logic state while achieving similar leakage power savings. In short, our sleepy stack structure achieves ultra-low leakage power consumption while retaining state. In conclusion, we have explored a high-impact and heavily researched area: low-power VLSI design. The sleepy stack has been shown to have significant impact. For systems spending a large percentage of time in sleep mode yet requiring ultra-fast wakeup through maintenance of precise logic state, sleepy stack may provide the best solution currently known in VLSI design, typically resulting in approximately two orders of magnitude less leakage power over the best of all prior known state-saving VLSI design approaches. REFERENCES [1] “International Technology Roadmap for Semiconductors,” Semiconductor Industry Association, 2005. [Online]. Available: http://public. itrs.net [2] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kandemir, and V. Narayanan, “Leakage current: Moore’s Law meets static power,” IEEE Comput., vol. 36, no. 12, pp. 68–75, Dec. 2003. [3] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. [4] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS,” IEEE J Solid-State Circuits, vol. 30, no. 8, pp. 847–854, Aug. 1995.

PARK AND MOONEY III: SLEEPY STACK LEAKAGE REDUCTION

[5] K.-S. Min, H. Kawaguchi, and T. Sakurai, “Zigzag super cut-off CMOS (ZSCCMOS) block activation with self-adaptive voltage level controller: An alternative to clock-gating scheme in leakage dominant era,” in IEEE Int. Solid-State Circuits Conf., 2003, pp. 400–401. [6] S. Narendra, V. D. S. Borkar, D. Antoniadis, and A. Chandrakasan, “Scaling of stack effect and its application for leakage reduction,” in Proc. Int. Symp. Low Power Electron. Des., 2001, pp. 195–200. [7] M. Johnson, D. Somasekhar, L.-Y. Chiou, and K. Roy, “Leakage control with efficient use of transistor stacks in single threshold CMOS,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 1, pp. 1–5, Feb. 2002. [8] N. Hanchate and N. Ranganathan, “A new technique for leakage reduction in CMOS circuits using self-controlled stacked transistors,” in Proc. 17th Int. Conf. VLSI Des., 2004, pp. 228–233. [9] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 473–484, Apr. 1992. [10] J. P. Uyemura, CMOS Logic Circuit Design Second Edition. Norwell, MA: Kluwer, 1999. [11] J. Park, V. J. Mooney, and P. Pfeiffenberger, “Sleepy stack reduction in leakage power,” in Proc. Int. Workshop Power Timing Modeling, Optimiz. Simulation, 2004, pp. 148–158. [12] J. Park, “Sleepy stack: A new approach to low power VLSI and memory” Ph.D. dissertation, Sch. Elect. Comput. Eng., Georgia Inst. Technol., Atlanta, 2005 [Online]. Available: http://etd.gatech.edu/ theses/available/etd-07132005-131806/ [13] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design. Santa Clara, CA: Addison-Wesley, 1992. [14] Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu, “New paradigm of predictive MOSFET and interconnect modeling for early circuit design,” in Proc. IEEE Custom Integr. Circuits Conf., 2000, pp. 201–204. [15] N. Azizi, A. Moshovos, and F. Najm, “Low-leakage asymmetric-cell SRAM,” in Proc. Int. Symp. Low Power Electron. Des., 2002, pp. 48–51.

1263

Jun Cheol Park received the B.E. degree in electrical engineering from Soongsil University, Seoul, Korea, in 1993, and the M.S. and Ph.D. degrees in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, in 2003 and 2005, respectively. He is currently a Component Design Engineer at the Intel Corporation, Folsom, CA, where he is developing low power and high-speed digital circuits. His research interests include low-power VLSI circuits and computer architectures and computer-aided design (CAD) of VLSI.

Vincent J. Mooney III (S’94–M’98–SM’04) received the B.S. degree from Yale University, New Haven, CT, in 1991, where he double majored in electrical engineering and computer science, and the M.S. degree in electrical engineering, M.A. degree in philosophy, and Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 1994, 1997, and 1998, respectively. He has worked at Bell Laboratories (Lucent), Allied Signal Aerospace VLSI Design Group, Hughes Network Systems, and Redwood Design Automation (Cadence). He is currently an Associate Professor in the School of Electrical and Computer Engineering and an Adjunct Associate Professor in the College of Computing, Georgia Institute of Technology, Atlanta. During the 1991–1992 school year he did research on real-time vision systems at the “Centro de Estudios e Investigaciones Tecnicas” (CEIT), San Sebastian, Spain. He is Codirector of the Center for Research in Embedded Systems and Technology (CREST) at Georgia Tech. His research interests include computer-aided design (CAD) of integrated circuits with a particular emphasis on hardware-software codesign, real-time operating systems, and power-aware architectures and compilers. Prof. Mooney was awarded the NCAA Postgraduate Scholarship upon his graduation in 1991. He was also a recipient of the National Science Foundation CAREER Award.