INTEGRATION, the VLSI journal 44 (2011) 175–184
Contents lists available at ScienceDirect
INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi
Low power finite state machine synthesis using power-gating Sambhu Nath Pradhan a,n, M. Tilak Kumar b, Santanu Chattopadhyay b a b
Department of ECE, NIT Agartala, 799055 Tripura, India Department of E and ECE, IIT Kharagpur 721302, West Bengal, India
a r t i c l e i n f o
abstract
Article history: Received 2 March 2010 Received in revised form 26 November 2010 Accepted 3 March 2011 Available online 21 March 2011
Power-gating turns off the power supply of a portion of the circuit completely, resulting in total elimination of power consumption for that part. However, it also necessitates that the sub-circuit to be activated should be charged for some time before its activation. This critical issue can influence the decomposition of a finite state machine (FSM) for its power gated implementation. In this paper we have presented a power-gating method that integrates FSM partitioning with state encoding, thus providing a total solution to the problem of power-aware FSM synthesis. It shows better results, in terms of dynamic and leakage power consumption, compared to the existing techniques reported in the literature. & 2011 Elsevier B.V. All rights reserved.
Keywords: Low power Finite state machine Power-gating Genetic algorithm
1. Introduction Many of the emerging computing and communication equipments are control dominated. Even for designs containing a good number of datapath elements, a sizeable portion is occupied by the controller. As the devices are mostly portable and hand-held, reducing power dissipation (and associated heat generation and cooling arrangement overheads) has emerged as the primary concern of today’s VLSI designers. While the datapath elements can be shutdown when they are not being used, controllers are always active. As a result, a good amount of system power is consumed by the controller. Controllers are mostly implemented as finite state machines (FSMs). Thus, power-efficient synthesis of FSM has come up as a very important problem domain, attracting lot of researchers to work on it. This paper is an attempt to develop an approach to synthesize FSMs consuming less power. It uses power-gating technique that can reduce both dynamic and leakage power consumed by the circuit. State assignment is one of the most crucial steps towards the synthesis of FSMs. As a result, a lot of work has been done in FSM synthesis and state encoding targeting low power [1–5]. A physical partitioning of FSM has been proposed in [1] with considerable increase in area of the circuit. The traded-off area results in dynamic power saving via switching activity reduction. However, leakage power, being dependent on the area of the circuit, may increase significantly. The work [2] presents a state encoding algorithm that
n
Corresponding author. E-mail addresses:
[email protected],
[email protected] (S. Nath Pradhan),
[email protected] (M. Tilak Kumar),
[email protected] (S. Chattopadhyay). 0167-9260/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2011.03.003
minimizes the Boolean distance between the codes of states with a high transition probability, using a probability description of the FSM. This work also lacks leakage power handling. The works [6,7] discuss about low power FSM synthesis using an appropriate mix of D- and T-flip-flops. As a judicious mix is used, combinational area and power get reduced. However, T-flip-flops require more area than D. Thus, area may increase, affecting the leakage power consumption. In [8,9], partition directed state-encoding algorithms have been presented. Both the approaches target only dynamic power saving. Moreover, in [8], the area overhead is about 77% of the normal FSM implementation. State decomposition has been addressed in [10]. The work is not generic in nature as it only looks into the FPGA based realization of the resulting circuit. The work presented in [11] exploits a fundamental and important source of power reduction—shutting down useless parts of a sequential circuit. This can lead to both dynamic and leakage power reduction. But the work does not provide any detailed experimental results on power savings. In [12,13], clock gating technique has been reported for low power FSM decomposition. Clock-gating can reduce dynamic power considerably, but leakage power cannot be controlled via this technique. All these works ignored the leakage power consumed by the circuit under the premises that leakage power is quite insignificant as compared to the dynamic power. On the other hand, the International Technology Roadmap for Semiconductors (ITRS) projects an exponential increase in the leakage power with miniaturization of devices [14]. The present paper addresses both dynamic and leakage power saving using power-gating. Power-gating is a technique for saving both leakage and switching power by shutting off the idle blocks of the circuit. Now-a-days integrated power gating has emerged as a primary knob for balancing the needs for high performance and low standby power during periods of circuit inactivity. Many works have incorporated
176
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
power-gating as the tool to reduce power. An analysis of the efficiency of power-gating for Clocked Storage Elements (CSEs) has been presented in [15]. It examines the energy savings for standard clock gated CSEs versus their power gated counterparts. The application of power gating circuits to semicustom design based on standard-cell elements is given in [16]. Authors propose a new power network architecture that enables the use of conventional standard-cell elements for the combinational logic. Power gating for the structured ASIC, implemented using existing standard cell design tools, has been introduced in [17]. Designing a power-gating structure with high performance in the active mode and low leakage and short wakeup time during standby mode is an important and challenging task. The works [18,19] address this issue. Authors in [19] present a tri-modal switch cell that enables the implementation of multimodal power gating, including active, data-retentive drowsy, and deep sleep modes. A per-core power gating architecture for multicore processors is presented in [18]. Impact of power gating depends on the workload running on the processor. Power saving employing power gating with varying work-load on processors has been explored in [18,20]. Another variation in power gating design, namely, reconfigurable power gating, has been utilized in the design of precision multiplier in [21]. ITRS roadmap has projected that in the submicron era, the delay of an interconnect wire segment is expected to dominate over the delay of logic blocks. Power gating for the wire in the form of bus segmentation has been explained in [22]. Power-gating structure for the reduction of peak current and voltage glitch in System-on-Chip environment has been presented in [23]. Power gating with multiple sleep modes has been proposed in [24]. All the power gating techniques discussed above consider microarchitectural level of circuit and system design. There is not much work reported on the application of power gating in the higher level of system design in the form of finite state machine (FSM) synthesis. The work [25] proposes a power-gating technique for FSM decomposition targeting low power. The FSM is partitioned into two or more sub-machines, only one of which is active most of the time. Power supply of other sub-machines can be cut off to save energy. Since adjustment of supply voltage may not finish instantly, for activating a sub-machine, it is needed to raise the supply voltage ahead of time. This makes the partitioning problem complicated. In [25], a simulated annealing based algorithm has been proposed to perform the decomposition without timing penalty. However, the method presented in [25] does not consider state encoding. Detailed experimental results are also not presented. In our paper, we consider the problems of FSM decomposition and state encoding together, targeting low power dissipation in power-gating technique. In the light of the above discussion, the salient contributions of this paper are as follows: 1. It integrates the two problems, partitioning and state encoding, into a genetic algorithm based formulation. 2. For the resulting partitions and encoding, it compares the following: (a) Dynamic power consumed by the combinational part of the circuit as estimated by SIS [26] with the same resulting from NOVA [27]. (b) Total power for the partitions as obtained via synthesis using Synopsys Design Vision tool, with the same for circuits resulting from state encoding techniques like NOVA [27], JEDI [28]. This includes both dynamic and leakage power. (c) Assuming the combinational logic to be realized as 2-level Pseudo-NMOS PLA, the leakage and switching activity has been computed and compared with NOVA encoded circuit. The same can be done with other design styles as well, such as, static and dynamic CMOS, once the corresponding library characterizations are performed.
(d) Power has also been compared with other existing low-power state encoding approaches. The rest of the paper is organized as follows. In Section 2 we have presented the power-gating structure. Section 3 illustrates the FSM partitioning and encoding strategy using a genetic algorithm (GA) approach. In Section 4, the modeling strategy for dynamic and leakage power has been given. Section 5 details the experimental results of our power-gating technique and compares with the existing works. Finally, Section 6 notes the conclusion.
2. Power-gating implementation A common way to shutdown the supply voltage for part of a circuit is to use sleep transistors (Fig. 1). A sleep transistor is added between Vdd and the circuit (or between the circuit and ground). The drain electrode of the sleep transistor forms a virtual Vdd, the voltage of which is grid-controlled. An FSM is defined as a tuple M¼{S, S, D, d, Q, s0} where S is the finite set of states, S is the finite set of inputs, D is the finite set of outputs, d: S S-S is the state transition function, Q: S S-D is the output function and s0 is the initial state. A finite state machine can be represented using state transition graph (STG), denoted as G¼(V,E), where V is the set of the vertices and E¼{(a,b)9a,b e V} is the set of the edges. Each node in V corresponds to a unique state and the edges represent transitions between the states. Since FSMs are used mostly to realize the controller portion of a system, it is very much essential to have good power-aware design of them. The FSM may consist of a large number of states—the states often forming clusters. The controller normally operates within a cluster, switching to a new cluster at some point of time. Thus, a power-aware FSM synthesis strategy can exploit this situation and realize the clusters separately. This enables shutting down the power to the cluster(s), which is (are) not active. However, in case of inter-cluster transitions, special care needs to be taken to avoid the delay in waking up the circuit. In this work, we have considered a bipartitioning strategy to achieve power minimization via power-gating. 2.1. Partitioning strategy The scheme has been shown in Fig. 2. Here, a given FSM F with n-states S¼{s1, s2, y, sn} is partitioned into two subFSMs F1 (with states S1 ¼{s11, s12, y, s1k}) and F2 (with states S2 ¼{s21, s22, y, s2l}) such that S1 [ S2 ¼ S and S1 \ S2 ¼ Ø. As shown in Fig. 2, for both the FSMs, there is a common state register holding the state bits. If set S1 has k-states and S2 has l-states, then the number of state-bits needed to represent the states is given by, ð1þ maxð logk , log l ÞÞ. The extra bit is used to identify the machine whose output determines the primary outputs and the next-state values. The logic COMB1 realizes the combinational part of F1 while that of F2 is realized by COMB2. Power-gating transistors are inserted in the supply lines of these blocks. As we will see later, it is sometimes
Vdd
Vdd enable
enable
Fig. 1. PMOS and NMOS based power-gating design.
177
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
Vdd
PI
L A T C H
PI
PO COMB1
PS
NS 0
PO
Vdd 1
EP1 Enable logic LE EP2
L A T C H
PI
PO COMB2
0
PS
NS
1
NS
State Register
Fig. 2. Proposed circuit architecture for power-gating.
necessary that both the machines be ON simultaneously for some states of the FSM, while for the remaining states, only one of the two submachines needs to be ON. This has been accomplished by the ‘‘Enable logic’’ block. Depending upon the values present in the present state lines, it asserts EP1 and/or EP2 allowing the Vdd supply to reach the corresponding combinational logic. For the submachine active at a time instant, the latch at its input is enabled (via signal LE) to allow the primary input changes and state transitions reach the corresponding combinational logic. It may be noted that even when both the submachines are powered, it is only one of these that computes the primary output and the next-state values. Multiplexers are used at the outputs of the combinational blocks to select the proper primary outputs and next state bits. Next, we state some of the definitions [25] to enable us to formulate the partitioning and state assignment problem.
Inner transition: a transition from one state to another, where
both the states belong to the same submachine. Cross transition: a transition from one state to another, where the states belong to different submachines. c-Expansion: a c-expansion of a state is the set of the states that can be reached from it within c steps. Boundary depth: it is defined as the number of clock cycles needed to turn ON the submachine to be activated. Boundary states: a boundary state between two submachines is a state in one submachine, which is within the boundary depth of another machine. We use D(F1, F2) to denote the set of boundary states in F1 leading to F2. The sum of the boundary state probabilities of D(F1, F2) is denoted as PD(F1, F2). Similarly, the sum of the boundary state probabilities of D(F2, F1) is denoted as PD(F2, F1).
Example. Fig. 3 is an example showing the bipartitioning of a state transition graph (STG) into F1, and F2. For this example, the transitions
F2
F1 S3
S1 S2
S6
S4
S7
S5 S8
S11 S9
S10
Fig. 3. Bipartition of a simple STG.
between the states of F1 ¼{S1, S2, S4, S5, S8, S9} and the transitions between the states of the submachine F2 ¼{S3, S6, S7, S10, S11} are the inner transitions. The transitions from state S3 to S2, S3 to S4, S6 to S1, S5 to S6, S5 to S10, S9 to S10 and S11 to S2 are the cross transitions. If the boundary depth is taken as 2, then the set of boundary states in the subFSM F1 is D (F1, F2)¼{S5, S9, S8, S1}and for F2, D (F2, F1)¼{S3, S6, S7}. When the circuit is in a boundary state, both of the submachines need to be turned ON because of the recovery time needed to turn ON the machine. In such cases, excessive energy will be consumed. Therefore, it is important to reduce the probability that the FSM is in a boundary state. In order to evaluate the quality of a bipartition, we estimate the power consumption of the overall circuit. The total power can be stated as Gated power ¼ PðF1 Þ PowerðF1 Þ þ PðF2 ÞPowerðF2 Þ þPD ðF1 ,F2 Þ PowerðF2 Þ þPD ðF2 ,F1 Þ PowerðF1 Þ
ð1Þ
178
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
where, P(F1) is the probability that the current state belongs to F1, P(F2) is the probability of the present state being in F2. PD(F1, F2) and PD(F2, F1) are as defined earlier. The first two terms of Eq. (1) are the contributions of submachines that are active. The last two terms represent the power consumed when a submachine is not active but is in the recovery process. The quality of the power-gating circuit depends on the partitioning and encoding of the states of FSM. We have used a genetic algorithm [29] based strategy for partitioning of FSM states and their encoding. The following section describes this partitioning and state encoding approach.
3. Genetic algorithm for partitioning Genetic algorithms (GAs) [29] are a class of related stochastic search and optimization techniques. GAs are capable of operating within domains that are traditionally thought to be difficult to optimize. It is a global optimization technique that is able to identify globally optimal or near optimal solutions. The key idea behind GAs is to emulate the way nature uses evolution. By mimicking natural genetic processes, GAs are able to evolve solutions to real world problems, if they have been suitably encoded. GAs work with a population of individuals, each of which represents a possible solution to a given problem. A population of a fixed number of chromosomes is maintained. Each chromosome is assigned a fitness score according to the merit of the corresponding solution. Individuals with a high fitness score are given opportunities to reproduce with other members of the population. This produces new individuals as ‘offspring’, who display features that are taken from each parent. The least fit members of the population are less likely to be selected for reproduction, and so they will die out. By favouring mating between the more fit individuals, the most promising areas of the search space are explored. If the GA has been designed correctly, the population will converge to an optimal solution to the problem. The problem solution using a evolutionary GA is as shown in Fig. 4. A standard GA proceeds as follows: 1. Randomly generate an initial population. 2. Define the selection criteria of the chromosomes for reproduction. 3. Compute the fitness for each chromosome in the current population.
4. Produce the offspring in the presence of variation inducing operators such as mutation and crossover. 5. Repeat steps 2, 3 and 4 until a satisfactory solution is obtained.
In order to apply a GA to any optimization problem, it is necessary to define an individual solution representation (that is, chromosome), mutation operator, crossover operator, fitness function and selection procedure. Next, we enumerate the same for the partitioning and state assignment problem. Solution structure: we have used a chromosome with two parts—the partition part and the state encoding part. For an N-state FSM, the partition part is an N-bit array, whereas, the state encoding part is an array of N integers. Corresponding to the ith state, the partition bit identifies the partition to which the state belongs (partition 1, if the bit is ‘0’ and partition 2, otherwise). The state encoding part identifies the codes assigned to the states. The most significant bit (MSB) of state code is set to ‘0’ for states in partition 1, whereas, the bit is set to ‘1’ for states in partition 2. Apart from this of bits needed for MSB, number encoding is given by, max ( log k , log l ), where, k is the number of states in partition 1 and l is the number of states in partition 2. Genetic operators: three genetic operators, namely, crossover, mutation and direct copy have been used to evolve the new generations. Crossover: to select chromosomes participating in crossover, first, the whole population is sorted according to the fitness value. A certain percentage of population (here, 20%) with better fitness value is considered to be the ‘‘best class’’. Population size is taken to be 100. To select a chromosome participating in crossover, first a random number between 0 and 9 is generated. If the number is less than 6, a chromosome from the best class is selected randomly. Otherwise a chromosome is selected from the entire population. This approach of selecting more fit chromosomes to participate in crossover leads to the generation of better off-springs as compared to the truly random ones. Similarly, the other parent gets selected. Next, a parametric crossover [29] is used to create an off-spring. A random number between 0 and 9 is generated. If the number is above 6, the new chromosome is created with first position same as the first parent, otherwise, it is taken from the second parent’s first position. Similarly, all the positions of the newly created chromosome are filled by generating random numbers. The resultant chromosome is modified to satisfy the requirement that each entry in the state code part be
coding of solutions solution
genetic search
objective function genetic operators problem specific knowledge
fitness assignment
mutation
genetic search
crossover
Fig. 4. Problem solution using genetic algorithm.
selection
replication
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
unique. 70% of the total population has been generated via the crossover operation. Mutation: to perform the mutation, two random numbers are generated between 1 and the width of the chromosome. Between these two generated positions, the entries are toggled. The positions are changed from 0 to 1 or 1 to 0 in the first part while the second part positions are divided by 2. The resulting chromosome is modified to satisfy the uniqueness criteria. Here, we have taken mutation rate as 10% of the total population. Direct copy: the basic GA has been modified so that the solution does not degrade between the generations. To ensure this, 20% best chromosomes are directly copied to the next generation. Termination: the GA terminates when there is no improvement in result over the previous 40 generations. Fitness measure: as we are doing power optimization, the fitness should reflect the power consumption. For this purpose, the combinational logic parts corresponding to the two partitions are extracted, minimized using the two-level minimizer Espresso [30], and the corresponding dynamic power values are computed using the ‘‘power-estimate’’ command of SIS [26] with appropriate options. These two values are then combined together using Eq. (1). It may be noted that though total power of a circuit has contributions from both dynamic and leakage components, we have used the estimate of dynamic power only. This simplifies the fitness calculation. Moreover, as the idle component is getting switched off, the leakage power will also get reduced considerably. This has been readily reflected in our experimental results.
4. Power modeling In this section we present our model to get the estimate of dynamic and leakage power consumption in the final resulting circuit for the sake of comparison with other FSM synthesis techniques. We assume that the combinational logic has been implemented in a two-level Pseudo-NMOS style. It may be noted that our fitness calculation targets two-level realization, and that PLAs are mostly implemented in Pseudo-NMOS NOR–NOR style. However, other design styles can be followed as well with proper power modeling. 4.1. Switching power estimation 4.1.1. Dynamic power of combinational part To estimate the dynamic power of the combinational logic, we need to compute the expected switching activity of the logic gates. We assume that primary inputs are uncorrelated and are statically independent of each other. In this case, the probability of a primary input being ON is 0.5. To compute the ON-probabilities of present state lines, first we compute the steady-state probabilities of individual states of FSM. For this, the FSM has been modeled as a Markov chain [31], a set of linear Chapman–Kolmogorov equations are formulated with normalization criteria, and are solved. This is given in the following.
179
where P(k) represents the probability of the primary input combination holding true for which the transition from si to sj takes place. The steady state probabilities of the FSM states are calculated using the following equations:
Si Pðsi Þ ¼ 1 Pðsi Þ ¼ Sj Pðsji Þ This system of equations is known as Chapman–Kolmogorov equations [32] for a discrete-time discrete-transition Markov process. By solving this set of linear equations using the Gauss–Jordon elimination method, the steady state probabilities have been determined. Now, the probability of a particular state-bit line being 1 is taken to be equal to the sum of steady-state probabilities of all those states whose codes have this particular bit turned ON. Moreover, while computing the power of the combinational logic of the first machine, the states belonging to it and those belonging to the boundary states of second machine are considered. It may be noted that the machine is ON in only these states. Similarly, for the combinational logic of second machine, the states of it along with the boundary states of first machine are considered. A similar procedure has been used to compute the OFF probabilities of state-bits feeding individual combinational logic. It may be noted that the sum of the ON and OFF probabilities of the present state lines are not equal to 1, but they are equal to machine ON probability. All the present state line probabilities are then normalized with the machine ON probability to make the sum of the ON and OFF probabilities of the state lines equal to 1. The ON-probability of a NOR gate in the first level of combinational logic can be computed from the OFF probabilities of all inputs feeding the gate. For this, OFF probabilities of the primary inputs are taken as 0.5, while for the state-bit lines, it is computed by the procedure noted above. Once the ON-probabilities of each of the NOR gates in the first level are known, the switching activities can be computed. Summing up the switching of all such gates, we get the total switching activity of the first level NOR gates. For the second level NOR gates, the ON-probabilities are computed by finding the ON-probabilities of first level NOR-gates feeding them. Once the probabilities are known, switching activities can be obtained. Summing up the switching activities of first and second level gives the total switching activity. 4.1.2. Dynamic power of sequential part To find the dynamic power of the sequential part of individual machines, all unreachable states are first eliminated, and the graph is transformed into a weighted undirected graph G by collapsing all multiple-directed edges between states si and sj into a single undirected edge whose weight wij is the sum of the directed total transition (wij) probabilities between these two states. The dynamic power consumption is then given by [33] X Cdynamic ¼ w HðCi ,Cj Þ ð3Þ ij ij Ci and Cj are the binary codes assigned to states si and sj, respectively, and H(Ci, Cj) is the Hamming distance between Ci and Cj. 4.2. Leakage power estimation
4.1.1.1. Steady state probability calculation. It is possible to compute the steady state probabilities of the states and the transition probabilities from the specific input line probabilities. The steady state probability P(si) is the probability of the finite state machine being in the state si at a time instant. The state transition probability for the transition from si to sj is defined as P(sij) and it is computed as follows: Pðsij Þ ¼ Pðsi Þ PðkÞ
ð2Þ
4.2.1. Leakage power of combinational part To calculate leakage of a NOR-gate, we have used the runtime mode leakage considering all input probabilities. Simulation for leakage power has been carried out using CADENCE SPECTER at 90 nm UMC technology. All the transistors are of minimum size supported at 90 nm technology. The transistors are special purpose transistors. These are low power, low leakage transistors available in UMC 90 nm Technology library. Probability
180
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
dependent leakage power is given by X Pleakage ¼ Vdd S Ik k k
Table 1 Leakage of HLFF.
ð4Þ
where, k ranges over all possible input states of the NOR gate. Sk is the probability of state k and Ik is the leakage current of state k. Vdd is the supply voltage. If the number of inputs to a NOR gate is very large, direct application of Eq. (4) to estimate leakage power becomes infeasible. So we have estimated leakage in a different way as follows. Simulation of up to 30 input NOR gate has been carried out. Length and width of pull up transistor are 500 and 120 nm, respectively. All the pull down transistors has length and width of 90 and 135 nm, respectively. In all the cases, it has been observed that when all the inputs are zero, leakage power is approximately equal to the number of inputs times 4.518 pW, which is the off state leakage power of single input NOR gate. It is also observed that the leakage power of all non-zero patterns is very close to each other (about 5.92 mW for 3-input gate). This is expected because with output being 0, effective drain voltage is very small. Turning on more transistors cannot increase the current flow significantly. Thus, we have computed the average leakage power for the input patterns other than the all zero pattern of a gate. Then, the average leakage value is extrapolated to get the average leakage power of NOR gate of any higher number of inputs. The leakage power of any NOR gate is thus given by, all zero probability leakage power due to all zero inputs þ ð1ÿall zero probabilityÞ average leakage value
4.2.2. Leakage power of sequential part Consider a general sequential circuit with I inputs, O outputs, V¼{v1, v2 y, vn} present-state variables, and U¼ {u1, u2, y un} next-state variables. A transition from state sa to state sb implies the leakage power [33] in the sequential part of the FSM as X fX:LeakTableðclk ¼ 1,output ¼ vi ,input ¼ ui Þ Leakageðsa ,sb Þ ¼ i þ ð1ÿXÞ: LeakTableðclk ¼ 0,output ¼ vi ,input ¼ ui Þg here, X is the fractional duty cycle, vi is bit i of sa, and ui is bit i of sb. LeakTable provides values for static power consumption in a given flip-flop as a function of inputs and state. A static cost metric for the FSM (sequential part) can be formed as the sum of all leakage values between two states sa and sb, each weighted with the probability of being at state sa and having sb as the next state. This probability, represented as Pab, can be computed as the steady state probability of state sa multiplied by the transition probability from state sa to state sb. The leakage power of the
Fig. 5. Schematic of Hybrid Latch Flip Flop (HLFF).
CLK D Q
Leakage power (nW)
0 0 0 0 1 1 1 1
387.9 339.9 475.4 427.0 411.1 356.2 490.3 435.0
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
sequential part is calculated as follows: X X Cstatic ¼ ðP Leakageðsa ,sb ÞÞ a b ab
ð5Þ
Since we are targeting low power realization, instead of ordinary flip-flop, we have used the Hybrid Latch Flip Flop (HLFF), shown in Fig. 5 [34]. The structure has been simulated using the CADENCE 90 nm technology and the corresponding leakage values are shown in Table 1.
5. Experimental results The proposed power-gating decomposition technique has been implemented in C language on a Pentium 4 machine with 3 GHz clock frequency and 1 GB main memory. Table 2 presents the combinational power results obtained by our approach. The details of the columns of Table 2 are as follows. Column I/O/S notes the number of primary inputs, outputs and states. Column PD(F1, F2) is the sum of the steady state probabilities of all the boundary states in F1. Similarly, PD(F2, F1) represents the sum of the steady state probabilities of all the boundary states in F2. Column P(F1) represents the steady state probability of the states in F1. The power-gating results are obtained by estimating the power consumed by the combinational logic of F1 and F2, respectively, and summing them using Eq. (1). For the power calculation we have used the ‘power_estimate’ command of SIS [26]. SIS assumes a supply voltage of 5 V and operating frequency of 20 MHz. It may be noted that SIS gives only the dynamic power estimation (it does not provide the leakage power). Hence, Table 2 notes the dynamic power only for 2-level realization of the resulting combinational circuits. Column NOVA notes the power required by the combinational logic generated by the state assignment tool NOVA [27] with ‘‘ ÿe ioh’’ option. The result shows on an average 35% saving in dynamic power. The maximum savings are for circuits like s820 (78%), s832 (76.9%), s510 (58.6%). Saving is higher for the cases with lower boundary state probabilities. For FSMs with higher boundary state probabilities, both the machines are always on, thus requiring higher power. Circuits, such as s820, s832 and s510 have lower boundary state probabilities, so lower power dissipation. The last column gives the number of generations used in GA beyond which no further improvement in the fitness function could be observed. To see the impact of power-gating, we have formulated another GA in which only state-encoding has been performed for the whole FSM—it does not do any partitioning. The dynamic power consumed by the combinational part of the resulting circuit has been estimated using SIS and noted in the column marked GA of Table 2. As it can be computed from the table, the GA along with power-gating requires 25.5% lesser dynamic power than the GA that performs state encoding only. This clearly establishes the impact of powergating. The column COMB delay notes the delay of the combinational block corresponding to the state assignment given by NOVA. The column GATED delay notes the maximum of the delays
181
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
Table 2 State probabilities and combinational power of our power-gating technique. Circuit
I/O/S
PD(F1, F2) PD(F2, F1) P(F1) NOVA power (mW) COMB delay (ns) GATED power (mW) GATED delay (ns) GA POWER (mW) GA generation
bbara bbtas dk14 dk16 dk512 dk27 donfile ex4 modulo12 sand s1 s208 s386 s510 s820 s832
4/2/10 2/2/6 3/5/7 2/3/27 1/3/15 1/2/7 2/1/24 6/9/14 1/1/12 11/9/32 8/6/20 11/2/18 7/7/13 19/7/47 18/19/25 18/19/25
0.33 0.42 0.57 0.55 0.43 0.67 0.67 0.39 0.17 0.33 0.39 0.06 0.24 0.17 0.03 0.03
0.05 0.35 0.43 0.45 0.43 0.33 0.33 0.26 0.17 0.12 0.16 0.004 0.007 0.122 0.01 0.01
0.96 0.42 0.57 0.55 0.57 0.67 0.67 0.65 0.67 0.41 0.84 0.99 0.99 0.53 0.99 0.99
963.8 207.1 1355 3001.6 611.1 296.9 1549.6 540 229.4 11373.1 4921.9 422.5 1663.9 6787.7 3057.3 2932.3
2.21 1.86 3.14 3.58 2.49 2.15 3.55 2.64 2.40 3.45 3.69 1.90 3.31 3.45 4.32 3.10
724 171.25 1255.9 2434.6 290.21 107.3 1084.2 485.7 185.28 5022.8 3474.69 264.6 813.05 2608.45 662.3 664.94
2.15 1.63 2.70 3.82 2.49 1.30 2.67 2.46 2.13 3.30 3.02 1.35 2.69 3.62 3.48 2.52
738.9 172 1289 2442.8 328.2 138.89 1265 637.5 174 8627.7 4363.3 824 637.5 4574 3238.8 3304.1
93 23 24 241 178 124 138 173 289 223 218 85 184 228 152 209
Table 3 Total dynamic and leakage power results of the final power-gated circuit. Circuit
GATED_SW (mW)
NOVA_SW (mW)
JEDI_SW (mW)
GATED_LEAK (nW)
NOVA_LEAK (nW)
JEDI_LEAK (nW)
bbtas beecount ex4 dk14 dk16 dk512 dk27 donfile sand s1 s208 s386 s510 s820 Average w.r.to gated
19.83 88.99 117.41 189.85 304.48 126.2 49.55 211.18 205.55 306.76 21.6 95.12 34.12 69.31 1.0
36.35 127.52 53.45 224.54 408.79 67.47 38.12 374.65 639.27 250.52 95.67 237.02 95.82 122.85 1.77
27.49 127.93 47.09 170.45 250.42 60.38 34.8 159.68 454.38 77.39 107.39 115.47 87.98 155.33 1.45
145.77 948.58 267.57 1094.57 1271.44 384.39 201.78 1977.75 1056.73 1198.02 146.16 411.51 724.58 1062.11 1.0
157.96 500.19 370.96 679.65 1418.8 370.31 161.89 2617.1 2408.6 1315.6 449.66 862.32 1397.5 1124.1 1.38
135.84 527.55 400.96 577.7 1117.1 361.33 170.0124 714.92 2059.5 492.86 580.35 663.75 1029.4 1102.3 1.20
of two combinational blocks, COMB1 and COMB2, corresponding to the two partitions generated by our tool. In 13 out of 16 circuits, the partitioned realization has resulted in delay reduction. Next, we have carried out another set of experiments taking the final encoded submachines of each FSM. Before synthesis, both the submachines and original machine are converted into Verilog HDL codes. The two submachines of each FSM are synthesized separately in Synopsys Design Vision [35] to get the dynamic and leakage power of the synthesized circuit. Using Eq. (1), dynamic and leakage power of the total FSM are calculated. This dynamic power is given in the column GATED_SW of Table 3. The leakage power is given in the column GATED_LEAK. The technology in Synopsys Design Vision synthesis tool is 90 nm Faraday, where the global operating voltage is 1 V. It may be noted that the tool synthesizes the circuit into a multilevel one. The original FSM is then encoded with the state assignment tool NOVA [27]. This NOVA encoded circuit is also synthesized in Synopsys Design Vision tool at 90 nm Faraday technology. Switching and leakage power after synthesis in Synopsys tool for the NOVA encoded circuit are given in columns NOVA_SW and NOVA_LEAK of Table 3, respectively. For the circuits s208, sand, s386 and s510 switching power savings are 77.41%, 67.84%, 59.86% and 64.38%, respectively. Some circuits like, dk27, ex4, s1 and dk512 do not exhibit switching power saving. For dk27 leakage dissipation is also high. This happens because the GA produced solutions are optimized for two-level realization. However, since a good two-level minimization can act as the basis of multilevel minimization also, as shown in Table 3, we can get
15.3% average switching power saving. The circuits, which show maximum switching power saving also give maximum leakage power saving. Leakage power saving for the circuits s208, sand, s386 and s510 are 67.5%, 56%, 52% and 48%, respectively. Average leakage power savings in circuits synthesized using our approach is about 9% compared to the circuits encoded using NOVA. The fourth and the last columns of Table 3 present the power of the encoded FSM using JEDI [28] assignment tool that targets multilevel realization. It shows power improvement in some of the circuits like s208 (79.87%), s510 (61.2%) and sand (54.4%). The power improvement could be observed only in some of the circuits, which are having low boundary state probabilities. This happens because in our GA based approach we have targeted a two-level realization. On the otherhand, JEDI targets a multi-level realization. The Synopsys tool also generates a multi-level implementation. Thus, it is expected that JEDI output will be more amenable for synthesis using Synopsys tool. We have also presented the combinational area (which is the number of the product terms) comparison of the power-gating approach in Table 4(a) with the area needed for NOVA [27] encoding and GA encoding without power-gating. While the GA approach without power-gating requires on an average 8.7% extra product terms over NOVA, the power-gating approach needs about 10% extra product terms. In Table 4(b), we have compared the area of the power-gating approach with the multilevel state assignment tool JEDI in 90 nm technology using Synopsys synthesis tool. In this technology, a 2-input NAND gate has an area of 4 units and an inverter area is 3 units. Average area overhead in this case is about 5.7%.
182
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
Table 4 (a) Combinational area comparison.
Table 5 Total dynamic power results of the power-gated circuit with new power model.
Circuit
Power-gating approach
NOVA
GA
Circuit
dyn0
dyn1
tot_dyn
Nova_dyn
Seq_dyn
Nova_seq_dyn
bbara bbtas dk14 dk16 dk512 dk27 donfile ex4 modulo12 sand s1 s208 s386 s820 s510 s832
34 34 36 65 18 8 39 17 14 96 72 29 36 76 71 81
26 26 28 62 20 9 47 15 12 99 75 24 32 66 66 64
26 17 37 63 17 10 43 18 11 108 65 27 34 78 73 79
bbara bbtas dk14 dk16 dk512 dk27 donfile ex4 modulo12 san sse s1 s208 s386 s820 s510 s832
0.946 0.783 1.24 2.02 1.51 1.562 2.43 0.696 0.864 2.31 1.069 2.94 0.604 2.357 1.171 0.977 0.98
1.3503 0.86 2.054 2.471 1.38 0.834 1.47 1.245 0.633 2.99 0.312 0.679 1.068 2.323 0.483 0.71 0.567
1.46 1.46 3.29 4.49 2.42 2.39 3.9 1.54 1.04 4.14 1.15 3.53 0.67 2.77 1.29 1.005 1.002
3.87 2.42 3.79 4.39 2.85 2.25 3.72 3.39 2.47 4.71 3.46 5.5 2.18 3.925 3.21 4.2 4.14
0.369 0.613 0.8767 2.4 1.97 2.047 1.79 0.717 0.9166 1.07 0.81 1.389 0.79 1.283 1.02 0.559 0.559
0.44 0.769 1.51 2.53 2.31 1.59 1.75 0.739 0.75 1.1 1.88 1.56 0.688 1.54 1.35 0.919 0.87
Table 4 (b) Area comparison at 90 nm technology. Circuit
POWER GATED_AREA
JEDI_AREA
Beecount dk14 dk16 dk512 dk27 Donfile ex4 Sand s1 s208 s386 s510 s820
323 318 902 291 126 530 281 1386 805 431 548 1027 1569
389 493 861 294 136 580 331 1696 405 480 540 908 923
Next, in Tables 5 and 6, we present the results of the new power model explained in Section 4. Here, we have considered the combinational and sequential power of the resulting circuit. Table 5 gives the dynamic power results of the circuits using our power model. dyn0 and dyn1 are the combinational switching activities of the first and second machine, respectively. tot_dyn is the total switching activity of the machine calculated using Eq. (1). Nova_dyn is the total switching activity of the combinational part of the original machine after encoding using NOVA. Seq_dyn and Nova_seq_dyn are the dynamic powers calculated using Eq. (1), after our encoding and NOVA encoding of the machine respectively. The result shows on an average 38.5% savings in dynamic power. The maximum savings in dynamic power are for circuits like s820 (76%), s832 (75.1%) and s208 (69.6%). Saving in dynamic power is higher for the cases with lower boundary state probabilities. For FSMs with higher boundary state probabilities, both the machines are always on, thus requiring higher power. Circuits, such as s820, s832 and s208 have lower boundary state probabilities, so lower power dissipation. Leak0 and Leak1 are the combinational leakage power of the first and second machine, estimated using our leakage model given in Section 4.2 of the final encoded circuit. Total leakage is calculated using Eq. (1) and is given in the column tot_leak of Table 6. This leakage is in mW. Nova_Leak (mW) is the leakage power of the combinational part of the machine after encoding using NOVA. Seq_Leak and Nova_Seq_Leak are the leakage results of the sequential part of the machine, calculated using Eq. (5), after our encoding and NOVA encoding respectively. The result shows on an average 9% saving in leakage power. The maximum savings in leakage power are for circuits like s820 (87.1%), s832 (87.1%).
Table 6 Total leakage power results of the power-gated circuit with new power model. Circuit
Leak0
Leak 1
bbara 695.9 232.9 bbtas 76.04 81.62 dk14 447.95 642.1 dk16 505.2 515.3 dk512 113.56 112.08 dk27 54.04 53.86 donfile 348 335 ex4 158.4 272.6 modulo12 97.88 66.67 sand 3748 262.2 sse 386.2 747.4 s1 1839.9 373.89 s208 204.2 531.4 s386 505.72 747.01 s820 625.56 7621.4 s510 340.2 437.01 s832 600 7172.4
tot_Leak Nova_Leak Seq_Leak Nova_seq_ Leak 784.22 139.79 1090.05 1020.5 187.42 107.9 683 340.05 114.91 2454.4 572.3 2163.39 237.07 691.52 930.42 466.84 886.89
921.71 120.04 739 1036.28 216.17 103.34 800.5 165.05 121.99 3996.2 1072.28 2873.58 136.08 1158.01 7190.51 710.2 6890.61
1.71 1.35 1.37 2.48 2.22 1.6 2.4 0.95 1.8 2.37 2.511 2.3 2.15 1.8 2.5 2.64 2.5
1.71 1.37 1.52 2.51 2.05 1.52 2.36 0.958 1.783 2.23 2 2.33 2.23 1.91 2.2 2.75 2.2
Table 7 presents a comparison of power savings in our powergating approach with other works, GA-D [6], LPSA [36], RELPSA [37], IITG8 [9] and [38]. The values noted in the table are the percentage reductions in power consumption in the benchmark FSMs over their NOVA encoded implementations. On an average, the power-gating approach shows the maximum improvements among all the techniques reported. For most of the circuits like bbara(62.24%), modulo12(58.03%), sse(67%), s208(69.26%), s386(69.78), s820(76%), s832(76%) saving is high. Here, the boundary state probabilities are low. Few cases, where boundary state probability is high like dk16, dk27, donfile saving is low. Table 8 presents a comparison of percentage savings in area in various approaches over NOVA. Our approach requires 9.4% extra area, on an average. The approach [38] shows the maximum area savings. This happens as the technique also considers area as a parameter to optimize. However, as noted in Table 7, the approach has lesser power savings, compared to power-gating. 5.1. Impact on performance: While comparing the delay of the circuits synthesized by our approach with the NOVA encoded circuits, the difference found is not significant. This happens due to the following reason. From
S. Nath Pradhan et al. / INTEGRATION, the VLSI journal 44 (2011) 175–184
presented in the work, is best taken by the circuits having lesser number of boundary states.
Table 7 Comparison of % savings in power with respect to NOVA. Circuit
GA-D [6] LPSA [36] RELPSA [37] IITG8[9] [38]
bbara bbtas dk14 dk16 dk512 dk27 donfile ex4 modulo12 sand sse s1 s208 s386 s510 s820 s832 Average
33.64 49.74 8.3 29.06 ÿ 3.01 1.39 49.87 57.25 26.8 9.87 28.4 – – – – – – 26.48
6.12 22.8 0.58 1.35 6.95 1.27 32.55 26.56 – 4.83 20.82 – – – – – – 12.38
20.8 – – 14.09 – – – – – 18.9 – 32.2 – 14.6 49.3 22.5 32.9 25.66
18.87 – ÿ 12.5 ÿ 8.3 14.6 4.9 0.0 48.48 – 15.25 5.3 24.72 – – 47.86 47.29 33.45 18.46
Our approach
– 62.24 40.0 39.71 17.0 13.08 – ÿ 2.3 – 15.11 38.0 ÿ 6.48 12.0 ÿ 4.83 – 54.71 – 58.03 – 12.2 37.0 66.85 30.0 35.85 – 69.26 19.0 59.78 – 29.28 – 76.06 – 75.78 27.57 38.49
Table 8 Comparison of % savings in area with respect to NOVA. Circuit
[6]
LPSA [36] RELPSA [37] [9]
[38]
Our approach
bbara bbtas dk14 dk16 dk512 dk27 donfile ex4 modulo12 sand sse s1 s208 s386 s510 s820 s832 Average
11.54 10.00 ÿ 13.79 ÿ 15.87 ÿ 37.50 0.00 33.33 10.53 18.18 ÿ 1.01 ÿ 16.67 – – – – – – ÿ 11.45
0.00 23.08 ÿ 3.70 ÿ 6.56 ÿ 5.56 ÿ 28.57 33.93 15.79 – 0.00 6.45 – – – – – – 3.49
– 31.0 3.0 – – 12.0 14.0 – – – 15.0 2.0 – 3.0 – – – 11.43
ÿ 30.77 ÿ 30.77 ÿ 28.57 ÿ 2.3 10.0 11.11 17.02 ÿ 13.33 ÿ 16.67 3.03 0.00 4.0 ÿ 20.83 ÿ 12.50 ÿ 7.58 ÿ 15.15 ÿ 26.57 ÿ 9.40
ÿ 0.8 – – ÿ 7.70 – – – – – ÿ 9.20 – 1.80 – ÿ 6.70 ÿ 9.80 ÿ 2.70 ÿ 9.70 ÿ 5.6
– 18.18 ÿ 24.14 ÿ 8.62 0.00 ÿ 12.50 38.10 10.53 15.38 ÿ 9.09 12.50 – – – – – – 4.03
183
Fig. 2, it can be observed that the extra delay that may come in our approach has three components. 1. The enable logic generating the LE signal. 2. The delays of COMB1 and COMB2. 3. The multiplexers at the outputs of COMB1 and COMB2. Out of these, the LE signal generation does not have any logic elements in it. The most-significant-bit of the state register directly determines the LE. Hence, the associated delay is negligible. The output multiplexers have a delay of 41.7 ps at 90 nm technology. Due to partitioning of states, the combinational logic blocks COMB1 and COMB2 are expected to be simpler than in the original NOVA encoded FSM. This often leads to delay reduction as well (as noted in Table 2), taking care of the multiplexer delay. 6. Conclusion We have presented an efficient technique for synthesizing FSMs using power-gating targeting total power saving. The idea of combined partitioning and state encoding is introduced for the first time in the synthesis process in the genetic algorithm formulation. The technique worked well as verified by the experimentation with a number of benchmark circuits. The benefit of the scheme
References [1] L. Benini, G. De Micheli, State assignment for low power dissipation, IEEE Journal on Solid State Circuits (1994) 32–40. [2] W. Noeth, R. Kolla., Spanning tree-based state encoding for low power dissipation, Design Automation and Test in Europe (1999). [3] S. Devadas, H.K.T. Ma, A.R. Newton, A.S. Vincentelli, Mustang: state assignment of finite state machines for optimal multilevel logic implementation, IEEE Transactions on Computer Aided Design 7 (12) (1988) 1290–1300. [4] Y. Xia, A.E.A. Almaini, Genetic algorithm based state assignment for power and area optimization, IEE Proceedings on Computer and Digital Techniques 126 (4) (2002) 128–133. [5] P. Surti, L.F. Chao, A. Tyagi, Low power FSM design using Huffman-style encoding, in: Proceedings of IEEE EDTC-97, Paris, France, March 1997, pp. 521–525. [6] S. Chattopadhyay, Low power state assignment and flip-flop selection for finite state machine synthesis—a genetic algorithm approach, IEE proceedings on Computer and Digital Techniques 125 (4/5) (2001) 124–151. [7] A. Iranli, P. Rezvani, M. Pedram, Low power synthesis of finite state machines with mixed D and T flip-flops, in: Proceedings of IEEE ASP-DAC-2003, Kitakyushu, Japan, pp. 803–808. [8] G. Venkataraman, S.M. Reddy, I. Pomeranz, GALLOP: genetic algorithm based low power FSM synthesis by simultaneous partitioning and state assignment, in: Proceedings of 16th IEEE Conference on VLSI Design 2003. [9] S. Chattopadhyay, P.N Reddy, Finite state machine state assignment targeting low power consumption, IEE Proceedings on Computer and Digital Techniques 151 (1) (2004). [10] G. Sutter, E. Todorovich, S. Lopez-Buedo, E. Boemo, FSM decomposition for low power in FPGA, lecture notes in computer science, in: Proceedings of the Reconfigurable Computing is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications, vol. 2438, pp. 350–359. [11] H. Lensen, M. Kruus, A. Sudnitson, Synthesis of Sequential Circuits with Dynamic Power management, RTUCET’01, Riga, Latvia, 2001, pp. 81–86. [12] S.H. Chow, Y.C. Ho, T. Hwang, C.L. Liu, Low power realization of finite state machines—a decomposition approach, ACM Transactions on Design, Automation and Electrical Systems 1 (3) (1996) 315–340. [13] J.C. Monteiro, A.L. Oliveira, Finite state machine decomposition for low power, in: Proceedings of DAC 98, pp. 758–763. [14] S. Thompson, P Packan, M. Bohr, MOS scaling: transistor challenges for the 21st century, Intel Technology Journal Q3 (1998). [15] C. Giacomotto, M. Singh, M. Vratonjic, V.G. Oklobdzija, Energy efficiency of power-gating in low-power clocked storage elements, Lecture Notes in Computer Science 5349/2009 (2009) 268–276. [16] H. Kim, Y. Shin, H. Kim, I. Eo, Physical design methodology of power gating circuits for standard-cell-based design, in: Proceedings of the 43rd Annual Design Automation Conference, 2006, pp. 109–112. [17] S.Y. Chen, R.B. Lin, H.H. Tung, K.W. Lin, Power gating design for standard-celllike structured ASICs, in: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), 2010, pp. 514–519. [18] J. Leverich, M. Monchiero, V. Talwar, P. Ranganathan, C. Kozyrakis, Power management of datacenter workloads using per-core power gating, Computer Architecture Letter (IEEE Computer Society) 8 (2) (2009). [19] E. Pakbaznia, M. Pedram, Design and application of multimodal power gating structures, in: Proceedings the Tenth International Symposium on Quality of Electronic Design, 2009, pp. 120–126. [20] K. Shi, Z. Lin, Yi.-M. Jian, L. Yuan, Simultaneous sleep transistor insertion and power network synthesis for industrial power gating designs, Journal of Computers 3 (3) (2008) 6–13. [21] M. Sjalander, M. Drazdziulis, P. L. Edefors, H. Eriksson, A low-leakage twinprecision multiplier using reconfigurable power gating, in: Proceedings of IEEE International Symposium on Circuits and Systems, vol. 2, 2005, pp. 1654–1657. [22] K. Heyrman, A. Papanikolaou, F. Catthoor, P. Veelaert, W. Philips, Control for power gating of wires, IEEE Transactions on Very Large Scale Integration Systems 18 (9) (2010) 1287–1300. [23] S. Kim, S.V. Kosonocky, D.R. Knebel, K. Stawiasz, D. Heidel, M. Immediato, Minimizing inductive noise in system-on-a-chip with multiple power gating structures, in: Proceedings of the 29th European Solid-State Circuits Conference, 2003, pp. 635–638. [24] K. Agarwal, H. Deogun, D. Sylvester, K. Nowka, Power gating with multiple sleep modes, in: Proceedings of the Seventh International Symposium on Quality Electronic Design, 2006, pp. 633–637. [25] B. Liu, Y. Cai, Q. Zhou, J. Bian, X. Hong, FSM Decomposition For Power Gating Design Automation in Sequential Circuits (2005) 862–865 ASICONASICON. [26] E.M. Sentovich, K.J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P.R. Stephan, R.K. Brayton, A.L. Sangiovanni-Vincentelli, SIS: A System for Sequential Circuit Synthesis, /www.eecs.berkeley.edu/Pubs/ TechRpts/1992/2010.htmlS. [27] T. Villa, A.S. Vincentell, NOVA: state assignment of finite state machines for optimal two-level logic implementation, IEEE Transactions on CAD 9 (9) (1990) 905–924.