Design and experimental results of a CMOS flip ... - Semantic Scholar

Report 3 Downloads 84 Views
DESIGN AND EXPERIMENTAL RESULTS OF A CMOS FLIP-FLOP FEATURING EMBEDDED THRESHOLD LOGIC Marius Padure, Sorin Cotofana, and Stamatis Vassiliadis Computer Engineering Laboratory, Delft University of Technology, Mekelweg 4,2628CD Delft, The Netherlands E-mail: {marius,sorin,stamatis}@duteppO.et.tudelft.nl ABSTRACT This paper describes a semi-dynamic CMOS flip-flop family featuring embedded Threshold Logic functions. First, we present the concept of flip-flop featuring embedded Threshold Logic, and then we describe the circuit and its operation. Subsequently, we present the design issues and the experimental results of such Threshold Logic flip-flops, obtained in 0.25pm CMOS technology. It is shown in this paper that we successfully manufactured and tested flip-flops having embedded Threshold functions with up to 16 data inputs. The proposed flip-flop featuring embedded Threshold Logic is very suitable for high-performance pipelined arithmetic units since this feature greatly reduces the pipeline overhead, by allowing the elimination of one or more levels of logic from the path leading to the flip-flop. Keywords: CMOS digital integrated circuits, flip-flops, Threshold logic, compuler arithmetic 1. INTRODUCTION The continual push for higher clock rates and higher performance has led microprocessor designers in recent years to design superpipelined machines with multiple functional units that can execute operations concurrently. High clock rates in these machines are often achieved with high granularity pipelining, for which there are relatively few levels of logic gates per pipeline stage. One direct consequence of this design trend is that pipeline overhead is becoming more significant. This pipeline overhead is primarily due to the latency of the flip-flop or latch used and the clock skew of the system. While the clock skew varies, the latency of the flip-flop cannot be hidden. As an example, assuming that a flipflop latency is four gates delay and that the clock cycle of a stateof-the-an microprocessor is 20 gates delay, the flip-flop overhead amounts 20% of the cycle time. This is a substantial penalty that degrades the overall performance of the system, since no useful logic operation is performed on the data when is being latched. The idea of incorporating logic functions into storage elements to improve the critical path latency have emerged in the last decade as apotential alternative for meeting the cycle time goal of processors [2, 61. The challenge has been to develop latch slructures that can do it efficiently, in terms of both total latency (defined as the sum of setup time and clock-to-output latency) and area. While previously published flip-flops have embedded simple Boolean functions (AND/OR of few inputs) [2, 61, no attempt has been done to incorporate Threshold Logic functions into the storage elements.

0-7803-7761-3/03/Sl7.00 02003 IEEE

It is well known that TL is fundamentally more powerful that Boolean logic since the TL gate (when envisioned as a combinational element) can perform more complex and wider functions than the usual Boolean gales (e.g., NAND, OR) can. More formally, a Threshold Logic Gate (TLG)is defined as an n-input processing element such that its output performs the following Boolean function': F (X)= sgn {E&' wi ' 2 s T}, where X=[ZO,ZI,. . . ,zn--l].W=[WO,WI,.. . ,wn-l]and T are the set of Boolean input variables, the set of fixed signed integer weights associated with data inputs, and the fixed signed integer threshold, respectively [3]. Several recent theoretical investigations [ I , 7 , 5 ] have indicated that computer arithmetic building blocks (e.g., adders and multipliers) can be implemented in TL with smaller number of logic gates and fewer .logic stages when compared with traditional Boolean logic counterparts. Therefore, embedding TL functions in the storage elements can have a direct impact over the pipeline overhead. In this paper we present a new class of flip-flop featuring embedded Threshold Logic functions to reduce the pipeline overhead. The main features of the basic design are short latency and a single phase clock scheme. Furthermore, this flip-flop has the capability of incorporating Threshold Logic functions at almost no extra delay costs. This feature greatly reduces the pipeline overhead, since each flip-flop can be viewed as a special logic gate that serves as a synchronization element as well. Thus, more data processing can be performed within the same cycle time or for the same processing requirements the cycle time can be reduced. Taken together, these features make the flip-flop presented in this paper well suited for high-performance microprocessor designs (e.g., computer arithmetic building blocks). In order to have a proof-of-concept and to evaluate the effective maximum fan-in that can be reliably achieved for the Threshold Logic embedded in the flip-flop, we manufactured and tested in 0.25pm CMOS 4 TLFF prototypes. It is proved in this paper that TLFFs with up to 16 inputs, each having unit weight, were successfully tested. Moreover, TLFFs with up to 64 inputs were tested functionally and debugged. The failure reasons and subsequent remedies are discussed in this paper. The paper is organized as follows: Section 2 explains the basic operation of the proposed Threshold Logic flip-flop. Section 3 presents the experimental results obtained on a test chip in 0.25pm. Section 4 presents some concluding remarks.

-

'All the operators are algebraic.

v-253

-

Figure 1: CMOS flip-flop with embedded Threshold Logic

the flip-flop entersin prechnrge phnse. Therefore, Mto, Mi1 are on, nodes S and R are precharged-high, s d the outputs Q and hold their previous values; since S and R are high, M6, M7 are on pulling their sources to weak high level. On the rising edge of the clock, the flip-flop enters the evaluation phase. Therefore. M 5 ,Me,s are on s d M 3 M7 (shutoff devices) Stan drawing currents from nodes S a n d R. If Idat, > IT then the voltage at node S will start to drop faster than than the voltage at node R. Therefore, 3 crosses first the laEh switching threshold which regenerates rapidly to s l o w and R t j g h , causing Q high. Conversely, if Idotu < IT then R low and S high, causing Q low. At the end of the evaluation phase, the high-rising node among S and R will he decoupled from being connected to ground by one of the shutoff transistors M B ,M7 going off. Therefore no DC power is dissipated at the end of the evaluation phase. Additionally, any change on the inputs after the gate has ended the evaluation will not affect nodes 3 and Ti and consequently the whole structure acts an edge-triggered flip-flop. We want to stress out that since the basic operation of such flip-flop with embedded Threshold Logic functions relies on tight matching between the nMOS transistors from both data bank and threshold mapping hank, the elecvical and physical designs of such TLFF require careful matching. In the next Section we present the experimental results obtained in 0.25pm CMOS and several implementation issues we took into account for the experimental test chip.

2. FLIP-FLOP DESCRIF'TION AND OPERATION

A simple test structure was implemented in 0.25pm feature size

P

P

I

V

L

3. EXPERIMENTAL RESULTS

A schematic diagram of the Threshold Logic flip-flop (TLFF) is

presented in Figure 1. The circuit is composed of a semi-dynamic front-end comprising a differential current-switch threshold logic gate (DCSTL) [41 followed by a static back-end comprising an RS latch. DCSTL front-end comprises a fast latched comparator and two parallel-connected sets of unit nMOS transistors (Mxo-2 and M T ~ - z ) .referenced herein as inpur data bank and threshold mapping bank. The TLFF from Figure 1 has 3 data inputs, X U ,X I , X Z ,and 3 threshold mapping inputs, To, TI,T,. The weights are implemented using parallel connected sets of unit transistors (i.e., the nMOS transistors M x o , MXI, and MXZhave the multiplicity factors m = 1, 3 and 4 respectively). The transistors from both banks are operated in the linear region, therefore their drain current depends mainly on m, Vcs - &,,. and b s , where, V m is the gate-source voltage, VG,,,,is the threshold voltage and VDSis the drain-source voltage. The total currents generated by the transistor hanks are compared each other by the latched comparator and therefore the node 3 is logic zero if the current generated by the data hank, Id,,., is greater than the current generated by the threshold mapping bank, IT, and logic one otherwise. Please note that, by design, the data hank is prevented to have similar current with the threshold mapping hank, when the threshold is reached. since the nMOS transistor Mol having the weight 0.5 is always on. This prevents the latch comparator entering in a metastable state. Moreover, the additional transistor MB was introduced for increased accuracy. It equalizes the voltages on both hanks and therefore eliminates the dependency OfIdd.t. IT on the drain-source voltages of the nMOS transistors. Ms is minimized in order to prevent a significant increase in total delay. The circuit operates as follows. On the falling edge of the clock,

vanilla CMOS technology. The main issues we were focused on were to have a proof-of-concept and to evaluate the effective maximum fan-in that can he reliably achieved. Performance measurement was also targeted hut not as a main issue as the implemented TLFFs were not delay optimized. The microphotograph of the manufactured test chip is shown in Figure 2. Additional buffering for the output pads was provided on-chip since the functional tests were performed by applying test vectors with a HP82000 IC evaluation system2 and reading the outputs with a digitizing oscilloscope off-chip. The large buffers presented in Figure 2 were designed in 6 stages with a step-up ratio of 3 to drive approximately 6pF of external capacitance. In Table 1 there are presented the main data regarding the implemented TLFFs. The total sum of weights of each TLFF is 8 (TLFFB), 16 (TLFF16). 32 (TLFF32), 64 (TLFF61) and this reflects the fan-in intended for the Threshold functions (having unit weights) embedded in the flip-flops. We implemented 4 TLFFs, each having 3 data inputs and 3 threshold mapping inputs, all 6 inputs being available off-chip. The reasons behind the previous decisions stand in the following. First, we targeted flip-flops with embedded Threshold Logic having fan-ins up to 64 data inputs (each with unit weight). Second, for subsequent debugging, all inputs would be preferable to be available off-chip. Therefore, due to the limited pin count of the available package, we implemented 4 TLFFs,each having only 3 data inputs and 3 threshold mapping inputs, and the weights designed such as if we assume thatn is the fan-in, then W O = 1, WI = n f 2 - 1, wa = n/2. In this way it can be achieved great functional testing flexibility for all 4 TLFFs (weights and T values ranging from 0 up to n). 'Since the available testing facilities were limited, the functionality 01 the TLFFs was verified only for clack frequenciesup to25MHz.

V-254

Table 2: Transistor geometries for the on-chip TLFFs

Table 1: Brief dekription of TLFFs implemented on-chip Name

TLFFB TLFF16 TLFF32 TLFF64

I

W

1

Possible values o f T

(1,3,4] {0,1,3,4,5,7,8) [l,7, 81 {0,1,7,8,9, 15,16) [I, 15,161 { O , 1,15,16,17,31,32} [1,31,32] {0,1,31,32,33,63,64}

118 0 1470 2050 3208

M6,7.8,9

I

101.24 41.24

-

1 1 I 1

M,,NAND

Table 3: Qualitative description of the functional tests

I Functional test passed I Obs. 1 d i "i& .-..." I V I TLFF32 I 1 if VH (XI) = 2.2V J TLFF64 1 I if VH (Tz) = 2.0V Name TLFF8

Figure 2: Microphotograph of manufactured TLFFs

The TLFFs implemented on-chip have the transistor geometries presented in Table 2. Since the main issue was to test the correct functionality, the TLFFs were sized in order to optimize matching, and therefore to maximize the effectivefan-in. First, the latch transistors have channel lengths of 1 p m which greatly affects the d k - Q delay but increases the matching and gain. Second, the unit nMOS transistor from both banks was sized approximately 5 times bigger than minimum for better matching. Third, each gate layout was designed for better matching, using symmetrical structure for the latch and transistor banks. All previous considerations pushed our TLFF design performances to less optimal results' when speed is of particular interest. During the functional tests, we applied all 8 possible input vectors while sweeping through all possible 8 T values (only 7 threshold values are possible sinceT = n/2 can be mapped distinctly). Please remark that when T = 1, all TLFFs perform an OR function while T = 8, 16,32,64 for TLFFS,TLFF16, TLFF32, TLFF64 respectively implies all TLFFs perform an A N D function. In Figures 3 and 4 we present the experimental waveforms on the output of TLFF8, when configured to embed an OR and an A N D function respectively. The results of the functional tests performed on all TLFFs are summarized on Table 3. While TLFFB and TLFF16 perform correctly with all 7 distinct threshold values and having applied on the data inputs a l l 8 possible vector combinations, TLFF32 and TLFF64 flip-flops experience failures on 4 out of 7 T values. The main reason for such failures stands mainly in the random mismatch between the CUIrents generated in the data and threshold mapping bank. However, achieving reliable behavior for TLFF32 and TLFF64 flipflops was possible by reducing the logic one voltage levels (desig'We also simulated an aggressive design. having smaller sized transistors, with SPICE and the dk - Q delay indicated by the simulator is between 218ps and 316ps, thus between 5 and 6 times fater than the delay measured for the on-chip design (see Table 4 for details).

nated herein as VH) applied individually on each of the Xo-2 and To-z inputs. In the last column of Table 3 there are presented the logic one voltage levels applied on XI and Tz pins respectively in order to correct the malfunctioning of TLFF32 and TLFFM. All the other pins had during those tests unchanged logic one voltage levels (VH = & d ) . The clk - Q delay was measured indirectly, for each TLFF, using the experimental setup from Figure 5. I n order to have equal rise and fall times for both TLFF outputs, and to reduce measurements errors, two dummy inverters were connected to the Q output. One long chain of 28 inverters was implemented for each TLFF serving as a delay line between the tested TL flipflop output and the data input of the conventional flip-flop (for signal capturing purposes). Additionally, a similar stand-alone inverter chain was manufactured on chip and its measured delay, t c h v i n , is about 1.12 ns. During the test, the clock frequency was kept constant (fcrr = 1MHz) while the skew between d k , and d k z was changed with a high resolution4 until the conventional flip-flop receiving signal from the inverter chain failed to capture the correct logic value of the signal originating from TLFF output. At skew times which are long enough' for the conventional flip-flop to successfully capture the data, the output of buffer 8 2 is a delayed replica of the buffer B1. As the skew is decreased, the conventional flip-flop fails to capture the data due to setup violation and therefore, the following relation holds true: tekew =tclk1-Q+tchain.

The simulated and measured d k - Q delay figures for all 4 TLFFs are presented in Table 4. No measured delay figures are available for TLFF32 and TLFF64. Please note that, according to Table 4 , the minimum clk - Q delay is obtained when the TLFFs perform the AND function while the maximum d k - Q delay is obtained when the TLFFs perform the OR function. With regard to the TLFF performances, while the simulations have indicated that the clk - Q delay for TLFFB range from 0.99 ns up to 1.90 ns, the measurements indicated a worst case delay of between 1.0 4 H P 8 Z O allows ~ down to 100 ps resolution.

V-255

Table 4 Simulated and measured clk - Q delay figures for the implemented TLFFs 0 V u = 2.5V and room temp. TLFF Name TIPFS TLFF16 TLFF32 TLFF64

tffP-"-o,,,, t'"< I ~ - Q , M [n51

[nsl

0.99 0.93 0.91 0.90

1.88 1.88 1.88

I

1.87

t:;x"-~...

-

[nsl

t:?x"-~,,w [nsl

1.0 1.0

2.2 2.1

N.A. N.A.

N.A. N.A.

,.*"."."

Figure 3: Experimental output waveform for TLFFS with T = 1 (output buffered by B1)

Figure 5: Experimental setup for delay measurements

Acknowledgments The authors wish to thank OzMicro Intemational Ltd. for covering the costs of processing and packaging the test chip and to MI. Jan Koopmans, from Delft University of Technology, for the help during testing. 5. REFERENCES

Figure 4: Experimental output waveform for TLFF8 with T = 8 \output buffered by B1)

ns and 2.2 ns. Note that d k - Q delay for a TLFF depend mainly on the function performed (i.e., T)since the greater the value of

T ,the greater the current drawn from the threshold mapping bank. Therefore, a TLFF with embedded AND function will be faster than a TLFF with embedded OR function.

4. CONCLUSIONS This paper presented a semi-dynamic CMOS flip-flop family featuring embedded Threshold Logic functions. First, we introduced the concept of flip-flop featuring embedded Threshold Logic, and then we described the circuit and its operation. Subsequently, we presented the design issues and the experimental results of such Threshold Logic flip-flops, obtained in 0.25pm CMOS technology. It was shown in this paper that we succesfully manufacNred and tested flip-flops having embedded Threshold functions with up to 16 data inputs. Moreover, TLFFs with up to 64 inputs were tested functionally and debugged. The failure reasons and subsequent remedies are discussed in this paper.

[ I ] S. Cotofana and S. Vassiliadis. Periodic symmetric functions, serial addition and multiplication with neural networks. IEEE Trans. on Neural Networks, 9(6):1118-1128, October 1998. [2] F. Klass. Semi-dynamic and dynamic flip-flops with embedded logic. Symposium on VLSl Circuirs, pages 108-109, 1998. [3] S. Muroga. Threshold logic and its applications. Wiley and Sons Inc., 1971.

[4] M. Padure, S. Cotofana, C. Dan, S. Vassiliadis, and M. Bodea. A low-power Threshold logic family. IEEE Intemational Conference on Electronics, Circuits and Systems. ICECS 2002. 2 6 5 7 4 6 0 , September 2002.

[ 5 ] M. Padure, S. Cotofana, and S. Vassiliadis. High-speed hybridThreshold-Boolean counters and compressors. 45th IEEE Midwest Symposium on Circuits and Systems, in press, 2002.

[6] H. Partovi, R. Burd, U. Salim, E Weber, L. DiGregono, and D. Draper. Flow-through latch and edge-triggered flip-flop hybrid elements. IEEE International Solid-State Circuits Conference, pages 138-139, 1996. [7] S. Vassiliadis, S. Cotofana, and K. Bertels. 2-1 addition and related operations with Threshold logic. IEEE Transactions on Computers, 45(9):1062-1068, September 1996.

V-256