Design of Dual Threshold Voltages Asynchronous ... - Semantic Scholar

Report 6 Downloads 38 Views
Design of Dual Threshold Voltages Asynchronous Circuits Behnam Ghavami

Hossein Pedram

Amirkabir University of Technology 424, Hafez Ave., Tehran, Iran

Amirkabir University of Technology 424, Hafez Ave., Tehran, Iran

[email protected]

[email protected]

ABSTRACT This paper introduces a framework for the minimization of leakage power consumption of asynchronous circuits via using dual threshold voltages technique. The utilized circuit model is an extended Timed Petri-Net which captures the dynamic behavior of the circuit. We propose a heuristic method based on quantum genetic algorithm which finds the optimal high and low threshold voltage assignment. Experimental results are given for a number of 90 nm ISCAS benchmark circuits. From the experimental results, we show that the combination of asynchronous and multiple threshold voltage design techniques is an effective way to achieve low leakage power budget in high performance asynchronous circuits.

voltage has provoked today's processes offering multiple threshold voltages. There are many techniques to design of dual threshold voltage (dual-Vth in sequence) synchronous circuits. However, dual-Vth cannot be applied directly to asynchronous circuits in the same way that it can be done for synchronous circuits. It is due to the fact that it is difficult to define or to identify a critical path in asynchronous circuits, where it starts, where it stops, at least with CAD tool that have been designed for synchronous circuits. This paper proposes a method to synthesize a dual-Vth asynchronous design while preserving the performances requirements. To the best of our knowledge, this is the first leakage power optimization framework for asynchronous circuits.

Design.

Section 2 contains a discussion on the leakage power reduction techniques. Section 3 introduces the threshold voltage assignment framework. Vth assignment algorithm is described in detail in section 4 while in section 5 we give our experimental results by the use of some related benchmarks. Finally, some conclusions are inferred in section 6.

Keywords

2. DUAL-Vth CIRCUIT DESIGN

Categories and Subject Descriptors B.6.1 [Logic Design]: Design Styles – Asynchronous circuits

General Terms

Asynchronous Circuits, Leakage Power, Dual Threshold Voltage, Timed Petri-Net, Genetic Quantum Algorithm.

1. INTRODUCTION In asynchronous circuits, local signaling eliminates the need for global synchronization which exploits some potential advantages; beside the elimination of the clock skew and tolerating interconnect delays, asynchronous circuits are more tolerant to process variations and external voltage fluctuations. They are more modularly synthesizable and potentially faster [1]. Asynchronous design allows reducing dynamic power consumption because activity is controlled by request, not upon clock edge. On the other hand, the request receiver and acknowledgment emission capacities have a cost in the number of transistors. But in deep sub-micron technologies the leakage current is becoming more and more significant[2]. Reducing this leakage power dissipation is an overriding constraint for asynchronous circuits design. Exponential dependence of transistor leakage on its threshold Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED’08, August 11–13, 2008, Bangalore, India. Copyright 2008 ACM 978-1-60558-109-5/08/08...$5.00.

The dual-Vth design technique uses two kinds of transistors in the same circuit. Some transistors have a high threshold voltage, while other transistors have a low threshold voltage. The highthreshold-voltage transistors have less subthreshold leakagepower dissipation but also have a larger delay as compared to the low-threshold-voltage transistors. In dual threshold voltage implementation of custom VLSI designs, the gates on noncritical paths are assigned as high-Vth, and the gates on the critical path are assigned as low-Vth. The objective is to maximize the number of transistors having high threshold voltage without sacrificing the performance of the circuit. The impact of this approach heavily relies on the efficiency of the threshold voltage assignment algorithm. Recently, researchers have proposed many design techniques, for selecting and assigning threshold voltage to gates of circuits which reduce leakage power under performance constraints[3]. However, the dual-threshold-voltage-design technique proposed in the literature for custom VLSI designs cannot be used for asynchronous ones. This is because the performance analysis of asynchronous circuit is completely different from synchronous one, because of the dependencies between highly concurrent events. While synchronous performance estimation is based on a static critical path analysis affected only by the delay of components and interconnecting wires, it has been shown that the performance of an asynchronous circuit depends on dynamic factors like the number of tokens in the circuit. In the clocked case, the critical path has a clear beginning and a clear end

because all paths are broken by latches. But importantly, no clear separation is available in asynchronous circuits.

3. DUAL Vth ASYNCHRONOUS CIRCUITS DESIGN FRAMEWORK While the first generation of asynchronous synthesis tools were mainly focused on control synthesis which make them inapplicable for practical circuits, the new generation of synthesis methods target the use of pre-designed asynchronous buffer templates. The design process usually starts with a high level circuit specification. By applying a set of function preserving transformations to the high level specification, Data Driven Decomposition[4] method divides the original circuit to a set of communicating fine grain pipeline templates. Template Synthesizer receives a CSP[4] source code containing a number of pre-designed template-compatible modules and optionally a top-level netlist and generates a netlist of standard-cell elements with dual-rail ports that can be used for creating final layout. The resulted circuit then can be directly implemented down to the layout. Usually, each pipeline template is composed of a number previously laid out standard cells that can be connected to each other. Figure 1 shows the general structure of the proposed leakage power optimization scheme and its interface with a generic asynchronous synthesis flow. To model the dual-threshold design of asynchronous circuits as an optimization problem, a suitable circuit and performance model of asynchronous circuit is required. In this work, the output of Decomposition is translating to Timed Petri-Nets model [5] for performance analysis and assigns low or high Vth to each template. Then, a Petri-Net simulator runs the circuit model and provides the dynamic information of the original circuit such as token assignment.

engine which assign high/low Vth to the templates of the circuit. Assignment of Vth is done with utilize a heuristic method. Then the optimized circuit is given as input to Template Synthesizer to generate a netlist of standard-cell elements.

3.1 TIMED PETRI-NET: BACKGROUND Petri-Nets are used as an elegant modeling formalism to model concurrency and synchronization in many applications including asynchronous circuit modeling. A Petri Net is a four-tuple N = ( P , T , F , m 0 ) where P is a finite set of places, T is a finite set of transitions and F ⊆ ( P × T ) ∪ (T × P ) is a flow p relation, and m 0 ∈ℵ is the initial marking. A marking is a token assignment for the place and it shows the state of the system. Timed Petri-Net (TPNT in sequence) is a Petri-Net in which some transitions or places can be annotated with delays.

A cycle c is a sequence of places p 1 p 2 L p 1 connected by arcs and transitions whose first and last place is the same. The cycle metric ( CM (c ) ) is the sum of the delays of all associated places along the cycle c , d (c ) , divided by the number of tokens that reside in the cycle, m 0 (c ) , is defined as:

CM (c ) = d (c ) m0 (c )

(1)

The cycle time of a TPNT is defined as the largest cycle metric among all cycles in the TPNT [7].

3.2 Template Model We model the network of templates with a novel Timed PetriNets. The main advantage of our model is that it can be used for simulation of hierarchy circuits in addition to static performance analysis. The simplest form of a full buffer is a simple buffer that only reads a value from its input and writes it to its output. This behavior can be modeled simply as shown in Figure 2. The intuition behind a channel net is that each template instance is modeled as a transition and each channel between ports of template instances becomes three places. Transition tW is analogous to the write statement while place pWa emulates the write acknowledge. Similarly pF can be seen as the dual for read statement while tRa is the corresponding acknowledge. We considered delays on the place therefore forward delay and backward delay can be put on pF and pB. This model is very similar to FCBN model presented in [6] and the only difference is that we added tRa (called EFBCN in sequence) . The reason for this is that the used definition of the hierarchical Petri-Nets has a restriction on the input and output ports; all outputs must be transitions and all inputs must be places. … `READ(F,a) z = f(a); `WRITE(W,z) …

Figure 1- Leakage power reduction framework

Figure 2- A buffer template modeled using Timed Petri-Nets

The proposed optimizer includes a static performance analyzer in order to provide performance information and a Vth-assignment

This convention ensures that unwanted choices or merge constructs cannot be formed when connecting Petri-Net modules to each other.

3.3 Performance models The performance of the network is represented by modeling the performance of the channels as they are connected to the templates. The d ( f ) represents the forward latency of a channel, while the corresponding d ( p ) represents the backward latency of channel. We considered delays on the places, therefore forward delay and backward delay can be put on pF and pB. We define these values as follow:

generation and is applied update one operation to be a new set of chromosomes – or new circuit configuration. The update procedure which introduced in [9] is used hear. The iteration continues until the termination criterion is met.

• The forward latency represents the delay through an empty channel (and associated cell). • The backward latency represents the time it takes the handshaking circuitry within the neighboring cells to reset, enabling a second token to flow.

The cycle time of the circuit is captured by the maximum cycle metric of the corresponding EFBCN Net which can be resolved using traditional Maximum Mean-Cycle algorithms[7]. The throughput of the circuit is the reciprocal of this value.

3.4 Leakage and delay model Leakage current of a CMOS circuit is the sum total of the leakage currents of all of the gates (template in our case). In our study, we calculate the template leakage under each input pattern. The leakage numbers under different input patterns are used to arrive at a weighted average of leakage numbers. We utilize the analysis and simulations performed in [8] that delivers the delay of basic gates in the library under load-dependent delay model (LDDM).

4. A Vth-ASSIGNMENT ALGORITHM

The power optimization flow uses a quantum genetic algorithm and is shown in Figure 3. Quantum genetic algorithm (QGA) is characterized by principles of quantum computing including concepts of qubits and superposition of states [8]. The algorithm is concentrated on the quantum-inspired evolutionary computing for a classical computer[9]. This algorithm is a probabilistic algorithm and is similar to classical genetic algorithms and it also maintains a population of n qubit individuals at generation g in

{

Q ( g ) = q1 , q2 , K , qn g

g

g

},

(2)

g

⎡α

q jg = ⎢

α

g j2

L α ⎤ g jm

g j2

⎥, L β ⎦⎥ g jm

The fitness of a chromosome should be related to the leakage power consumption of that particular configuration. It should be noted that the leakage optimization is under a specified timing constraint. If the global cycle metric in a circuit is larger than the metric requirement, the configuration is not desirable and the corresponding chromosome should have little chance to survive. Therefore, we evaluate Pop (G ) using bellow fitness function:

Φ =

⎛ ⎞ 1 1 PLeakage + α ⎜ CM ⎟ Pmax − Pmin CM CM − max min ⎝ ⎠ (4)

CM is the CM max and CM min

where PLeakage is the leakage power of the circuits, largest cycle metric and

Pmax , P Min ,

tight timing constraint causes that the algorithm doesn’t provide good results. An efficient method based on Karp’s algorithm[7] is (3)

⎣⎢ β β m is the number of templates in the circuits and represents g j1

4.1 Fitness Function

are used for normalization and α is user specified constants. A

Where q j is a qubit individual defined as g j1

Figure 3- The Vth assignment flow

where a linear superposition of solutions to the Vth assignment problem; in fact, we assume a matrix 2 × m , where first and second rows of matrix denote square root of probabilities of choosing low and high voltage threshold for m template, respectively. A qubit may be in '1' state, '0' state, or in any superposition of the two. The optimization flow begins with a random generated initial population, which consists of many randomly generated circuit configurations. The optimization flow is an iterative procedure. The chromosomes with better fitness will survive at each

used here to find the largest cycle metric of potential solutions.

4.2 Control Parameters g

We initialize q j

elements to 1

2

. We have conducted

extensive simulation on a wide range of functions and concluded that a small population of size 10 to 15 performs very well. The termination of the iterative evolution can be user defined. We set a maximum generation and specify that if the power reduction is less than 0.005% during the last 110 generations, the evolution stops without going through all generations.

5. EXPERIMENTAL RESULTS To test our method, we construct a multiple-Vth standard cell library using 90 nm process. For NMOS (PMOS) transistors, the high threshold voltage and the low threshold voltage are 0.22V (0.22V) and 0.12V (-0.12V) respectively. The library was characterized using Berkeley 90 nm BSIM predictive model [10]. The ISCAS’98 benchmark circuits are chosen to map to our library. An asynchronous synthesis toolset[11] employed to synthesis benchmarks. The circuits were first optimized for maximum speed and then were optimized for lowest leakge power consumption. It is observed that, on the average, in dual-Vth asynchronous circuits 74% leakage power can be reduced in standby mode. The runtime for our benchmark ranges from 5s to 500s, depending on circuit sizes and timing constraints. To verify the performance of originated dual-Vth circuits, we quantify the dynamic performance of both synthesized single-Vth and dual-Vth circuits by a Timed Petri Net simulator. Figure 5 identify the circuit throughput with single-Vth and Dual-Vth, respectively. As it is shown, there is only 4% performance penalty in case of dual-Vth circuits.

30.00 25.00 20.00 Dual-Vth

15.00

Single-Vth 10.00 5.00

In this paper, an efficient method for exploiting dual-threshold voltage technique for reducing leakage power of asynchronous circuits while maintaining the high performance of theses circuits is presented. The decomposed circuit is used to generate a Timed Petri Net model. The proposed assigning high and low threshold voltage method is based on quantum computing algorithm. The experimental results on a set of ISCAS benchmark circuits show that our proposed technique can achieve on average 74% savings for leakage power, while there is only 4% performance penalty. Our future work includes extending the model.

7. REFERENCES [1] A. J. Martin and etc., “The Lutonium: A Sub-Nanojoule Asynchronous 8051 Microcontroller,” ASYNC 2003. [2] S. G. Narendra and A. Chandrakasan, Eds., Leakage in Nanometer CMOS Technologies, Springer, 2005

[4]

C G. Wong and Alain J. Martin, “High-Level Synthesis of Asynchronous Systems by Data Driven Decomposition,” Proc. Of 40th DAC, Anneheim, CA, USA, 2003.

[5] J. L. Peterson. Petrinet Theory and the Modeling of Systems. Englewood Cliffs, N.J.: Prentice-Hall, 1981. s838

s1196

s820

s713

s641

s420

s400

s382

c1908

c880

c1355

c499

c17

c432

0.00

Figure 4- Leakage Power ( μW ) for Single/Dual Vth Circuits 0.4 0.35

[6] Peter A Beerel, Nam-Hoon Kim , Andrew Lines, Mike Davies, “Slack Matching Asynchronous Designs,” Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems, Washington, DC, USA, 2006. [7] R. M. Karp. “A characterization of the minimum cycle mean in a diagraph”. Discrete mathematics, 23:309–311, 1978.

0.3 0.25

0.15

6. CONCLUSION

[3] L. Wei, Z. Chen, K. Roy, M. C. Johnson, Y. Ye, and V. K. De, “Design optimization of dual-threshold circuits for lowvoltage low-power applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1, pp. 16–24, Mar. 1999.

35.00

0.2

find better solution, on the average 10%. And QG-Vth Assignment is 2.8-3.5 times faster than the GA-Vth Assignment.

Singl-Vth Dual-Vth

0.1 0.05 0

Figure 5- Throughput (Token/Second) of Single/Dual Vth circuits

We implement a contemporary V th assignment based on genetic algorithm which we called it as GA-Vth Assignment. Our practical experiment shows that GA-Vth Assignment does not converge to the good solutions in some cases, while QC-Vth Assignment can

[8] V.Khandelwal, A.Davoodi, and A.Srivastava, “Simultaneous Vth selection and assignment for leakage optimization,” IEEE Transactions on Very Large Scale Integration Systems (2005), Vol.13, pp.762–765. [9] K.-H Han, “Quantum-inspired Evolutionary Algorithm,” Ph.D. dissertation, Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, Korea, 2003. [10] B.J. Sheu, D.L. Scharfetter, P.K. Ko, and M.C. Teng, "BSIM: Berkeley Short-Channel IGFET Model for MOS Transistors," IEEE J. Solid-State Circuits, SC-22, No.4, pp. 558-566, 1987. [11] http://www.async.ir/asynctool.php