of Combinatorial Delays in FPGAs
Self-characterization
Justin S. J. Wong 1, Pete Sedcole 2 and Peter Y. K. Cheung
Circuit
I
Department of Electrical & Electronic Engineering, Imperial College London South Kensington campus, London SW7 2AZ, UK
[email protected] [email protected] 3p.
[email protected] Abstract- This paper proposes a built-in self-test (BIST) method to measure accurately the combinatorial circuit delays on an FPGA. The flexibility of the on-chip clock generation capability found in modern FPGAs is employed to step through a range of frequencies until timing failure in the combinatorial circuit is detected. In this way, the delay of any combinatorial circuit can be determined with a timing resolution of lps or lower. A parallel implementation of the method for self-characterization of the delay of all the LUTs on an FPGA is also proposed. The method was applied to an Altera Cyclone-IT FPGA (EP2C35). A complete self-characterization was achieved in 3 seconds, utilizing only 13kbit of block RAM to store the results. This self-characterization method paves the way for matching timing requirements in designs to FPGAs as a means of combating the problem of process variations. I. INTRODUCTION
Process variability as a consequence of shrinking feature size is generally acknowledged as a serious problem for the integrated circuit industry in the future. Its impact on FPGAs is even more serious than in ASICs. Unlike ASICs where the design, and hence the critical paths, is known, an FPGA is uncommitted to any specific design until it is configured. However, as suggested in [1], the configurable nature of FPGAs could be exploited to alleviate the curse of process variability by matching the timing requirements of different part of a design to the hardware resources on a device-bydevice bases during power-up configuration. In order for such an approach to be taken, an efficient and accurate method to measure the delay of individual or small groups configurable logic blocks (CLBs) or logic elements (LEs) is needed. This paper reports a method that achieves accurate selfcharacterization of delay on an FPGA through the use of a novel built-in self test (BIST) design. The proposed systems provides the following novel features: 1) delay in any combinatorial components, LUTs or paths can be measured; 2) timing resolution of around lps can be achieved with 90nm technology; 3) delay measurement happens concurrently on many CLBs or LEs resulting in fast characterization; 4) the method potentially allows self-characterization without any external circuits to the FPGA. The paper is organised as follows: Section II reviews existing work in process variation measurement and delay testing of FPGAs. In Section III, the key idea of the novel BIST design is
1-4244-1472-5/07/$25.00 ©) 2007 IEEE
explained. Section IV describes the delay characterization system in detail and how the on-chip clock generation circuits on a modern FPGA are exploited to facilitate self characterization. In Section V, the BIST design is extended to allow efficient parallel delay measurement. The BIST system is applied to an actual FPGA and results are presented with discussion of their significance. Section VI concludes the paper and proposes possible future work. II. BACKGROUND Existing process variation measurement methods are mainly based on arrays of ring oscillators (ROs) [1], [2]. Frequency counters are used to determine the maximum operating frequencies of groups of LEs or CLBs on a given FPGA die. Using ring oscillators for characterization, while simple, is limiting in many ways. In order to produce a low enough oscillation frequency for the frequency counter, a number (typically 5 or more) of inverter stages need to be used. Therefore the granularity of the delay measurement is limited to groups of inverters. Moreover, a RO with its inherent feedback path does not resemble the structure and behaviour of circuits found in real designs. Alternative delay measurement methods can be found in [3], [4]. However, these are designed for application specific testing on FPGA and cannot be used directly as a general delay variation test method. Other general delay testing methods for FPGAs such as those found in [5], [6] and [7] are designed for discovering delay faults in abnormal paths or logic, and are unsuitable for delay characterization of non-faulty devices. In ASICs, at-speed scan testing is widely adopted to test for propagation time [8]. A predefined test vector is propagated from a launch flip-flop through a combinatorial circuit to a capture flip-flop and results are compared to the expected output. By altering the time between the launch and capture clock edges it is possible to determine the propagation time. However, at-speed testing in general is used to measure delays on custom designs and requires external clock source and test equipment to generate and compare the test vectors. Results have to be processed externally before a definitive propagation time is determined. Most existing at-speed scan tests are designed to obtain only pass/fail information of a specific custom chip at specific frequencies instead of getting precise
rPT7 2007rw
Circuit underT-est (CUT)
\
Launch Register
Circuit-under-Test (CUT)
eigster
Fig. 1.
Basic principle of the delay measurement method.
propagation delay information of every combinatorial path on the chip. It is therefore not an ideal option for a fully BIST method to characterize the delay variability on an FPGA. III.
PRINCIPLE OF OPERATION
Fig. 2.
\
Sampling
Register
Error Detection
ckut (ED)
Detailed FRD Circuit.
The frequency at which the circuit fails is derived from the cumulative histogram instead of the spot frequency at which the circuit first fails. This statistical approach provides better measurement accuracy while reducing the effect of phase jitter in the test clock signal. This will be demonstrated in more detail later.
Most previous methods to measure on-chip delay on FPGAs based on direct measurements of frequency or time [1], [2], [9], [10]. The method proposed in this work performs A. Circuit Detail indirect measurement based around stepping the frequency of The detailed implementation of the FRD is shown in Fig. 2. an on-chip clock generator. Fig. 1 depicts the basic idea of our The launch register LR and the sampling register SR are method using a Failure Rate Detection (FRD) circuit, which is clocked at opposite phase of the test clock. This implies that particularly suitable for measuring delay between two pipeline the stimuli S must propagate across the CUT at approximately stages. half the clock period. The EDC compares the delayed signal The circuit-under-test (CUT) is sandwiched between two D and output of the sampling register Q with an XOR gate pipeline registers which are clocked by a test clock generator and latches any error E with the capture register CR on the (TCG), the frequency of which is stepped between a lower rising edge of the test clock to produce a late signal L. This in and an upper bound. A test stimuli generator (TSG) provides turn causes a toggle flip-flop to provide a transition, signaling stimuli S to the CUT such that the output signal D toggles after an error to the EHA circuit with the signal Error. tdelay, which is to be measured. As the clock signal is stepped The output toggle flip-flop serves a number of useful from the lower to the upper bound, the CUT transits from purposes. Firstly it reduces the error count frequency by half, operating correctly to operating with timing error. The error reducing the self-heating effect of the EHA, which can be detection circuit (EDC) monitors particular internal signals in placed some distance away from the FRD. Secondly this the CUT in such a way that the signal Error is high if a timing serves as a synchronous to asynchronous interface circuit. The error is detected. The error histogram accumulator stores the EHA is implemented as an asynchronous counter to avoid the error count at each frequency to facilitate the building of a need for synchronization between the test clock and the error cumulative error histogram against the test clock frequency. counter clock. From this, the frequency at which the circuit fails, and hence the delay of CUT can be found. B. Timing Considerations Since modern FPGAs contain very flexible on-chip clock Fig. 3 illustrates the operation of the FRD over three test generation resources, any combinatorial circuit and path delay clock cycles. In cycle 1, the CUT operates without timing can be measured without the need of external circuitry. The error. This error-free condition (valid zone) occurs when: accuracy of the delay measurement is determined by how fine T the frequency of the TCG can be stepped. If the CUT works thold (2) Q -tclk S < tdelayl < 2 -(tsetup Q + tclk s) at frequency f, but fails at frequency f + Af, the delay of the circuit is between t1 = and t2 = ±f The delay time where tdelayl is the delay to be measured, T is the clock resolution is: period, tsetup-Q is the setup time of the sampling register SR, and tclk-s is the clock to output delay of the launch register f At= tl -t2 = f-1 f- (l +Af) 1 LR. Furthermore, in order for the late signal L to be interpreted correctly, the following must hold: For example, suppose f = 500MHz and Af = 0.25MHz, the delay resolution achieved is lps. (3) tdelayl < T- (tclk-s + tsetup C + tg) are
Cycle I
Cycle 2
:O'
-
Cvclone- 11
Cycle 3
Clock _
Virtex-4 D_
Q_ E (J xor Q)
!I b'est
Ih
tclk-S is the clock to Q delay between the the clock edge and the output (S) of the Launch Toggle Flip-flop. tg is the propagation delay of the XOR gate.
tsetup-Q and thold-Q are the setup and hold time of the sampling register SR. tsetup_c and thold-c are the setup and hold time of the capture register CR.
(CJUT)
Col Enable
Eal
_''~
A1
RovoFFW
T
Si
n F
Register
stages
;......
D
1s
Test Clock Generator
(TCG)
IN R tegsLer
Error Dete tionr Circuit (EDC)
Fig. 4. Detailed FRD on a Cyclone-II FPGA.
where tsetup c is the setup time of the capture register CR and tg is the propagation delay of the XOR gate. Cycle 2 depicts the condition where a timing error occurs (invalid zone). This happens when the following condition is satisfied: T s+ C + tg) 2 + thold Q tclk s < tdelay2
m
2 (tdelay-min + tclkSS-thold-Q)
where tdelay max and tdelay min and fastest CUT respectively.
are
the
(7) (8)
delays of the slowest
Dynamic Resolution of Timing Measurements
11Data & ICControls
Fig. 6. Test Clock Generator using a Virtex-4 and a Cyclone-II. The signal "ReProgram Data & Control" from the DE2 is a 22 bit bus containing the 20 bit M and D data for both DCMs, ReProgram enable and reset signals. The Clk locked signal informs the DE2 that the Clock is ready and it can begin testing with it.
Frequency (IM
FIrequncy Step Siz (I28 - 840 MHz)
Fig. 9. The plot shows the dynamic resolution of timing measurements.
Fig. 7. Frequency steps size from 128 to 840MHz.
Due to the constraints imposed by the DCMs on the Virtex4, the possible frequencies produced by the TCG are discrete and are not equally spaced. With the clock generation circuit shown in Fig. 6, the TCG circuit is able to generate a test clock between 128 MHz to 840 MHz with an average frequency step of 0.038 MHz and a worst case (i.e. largest) step of 0.854 MHz. Fig. 7 shows all the possible frequency steps that can be produced. A. Delay Results Fig. 8(a) shows the relationship between the test clock frequency and the failure rate of a typical CUT containing 2 cascaded LUTs configured as inverters. The plot shows that the CUT transits from 0% failure at 520 MHz (Region A) to 100% failure at 590MHz (Region C) as expected. However, there is a range of frequency (Region B) where the failure rate settles at exactly 50%. This behaviour can be explained with the help of the timing diagram shown in Fig. 8(b). Fig. 8(b) shows the delay behaviour of the CUT in the three regions: Region A is the fault-free range, Region C is the faulty range, and Region B is the 50% range. The 50% failure is due to the case where the delays for the positive and negative transitions through the CUT are different. In which case, there exists a range of frequencies where timing failure only occurs for one type of transition and not the other. Since by definition, 50% of transitions are positive or negative, this results in the plateau area of the failure rate vs frequency plot.
As the difference in the positive and negative transition delay decreases, the plateau area will get narrower. In order to give a single value of delay for the CUT, the faster of the two transitions is ignored, and the delay calculated from the Region A to Region B curve. The transition between the three regions is not a step as would be expected if failure always occur at a spot frequency. This is caused by phase jitter in the clock signal and the effect of the signal transition occurring in the metastable window of the clock signal. If both of these are treated as stochastic in nature with symmetrical probability distributions, the actual CUT delay can be estimated from the point at which a 50% failure rate for the slow type of transition occurs, i.e., the 25% failure rate point. The delay through the CUT is therefore estimated as: I
tdelay
2
2x
f25%h
(9)
B. Measurement Accuracy The accuracy of the delay measurements is affected by a number of factors. The current TCG circuit used with frequencies above 267MHz provides a worst case timing resolution of 1.33ps at 400MHz as shown in Fig. 9. However, at the operating frequency of the FRD circuit of over around 530MHz, the delay resolution is under 0.5ps. There are also uncertainties due to clock jitter and selfheating effects. To evaluate such uncertainties, 500 delay measurements of the same circuit are made over time and the results are shown in Fig. 10(a). The best-fit curve shows the self-heating effect which settles after around 120 measurements. Each test lasts for approximately 720ms therefore the self-heating reaches equilibrium at 720ms x120 = 86.4s. The random scattering around the best-fit curve is depicted in Fig. 10(b) as a histogram. This random error, mostly due to clock jitter, is approximately Gaussian with a small standard deviation of 0.61ps.
C. Delay Variation Across the Chip Fig. 11 shows the surface plot of the CUT delay map estimation on an EP2C35 FPGA. Similar to the finding in [1],
rilure ate 1o
(2 U1s) Clock
D
tegion B 'ase 1 50% Error)
) I~~ ._......
legion B
5ase 2r
50% Error)
D
rror)
(a)
(b)
Fig. 8. (a) The failure rate obtained from a PUT containing 2 LUTs. (b) A timing diagram showing the conditions resulting in region A,B and C in the failure rate plot. tclk-s-p and tclk_,- are the clock to Q delay on the Launch Toggle Flop-flop corresponding to positive and negative edge respectively. tDelay_p and tDelay_n are the delay of positive and negative edge through the PUT. Where tpos = tclk s-p + tDelay p and tneg = tclk s-n + tDelay n-
942-
Sc ared plot of '2 at 25% fa lu re
rate
94;1| K
P
939
X
93
there is an observable correlated delay variation on the plot observed as the curvature of the overall surface. On top of the correlated variation, a clear stochastic delay variation can be seen superimposed on the surface resulting in the apparent "roughness". Unlike the ring oscillator method in [1], the spatial resolution of this current method is much better. LUTs Xcan now be characterized in pairs instead of chains of 5. L B M PARALLEL BIST METHOD f XxX f VV. PR The method described in the last section allows delay across any combinatorial circuit to be measured with accuracy. The detailed information the method provides may help manufacturers to gain insights into how they can be make useful improvements in the circuit architecture or in the manufacturing process. However, the use of cumulative failure rate as described is not very efficient because each FRD is enabled in turn and mulitple measurements are made at each frequency in order to build the failure rate characteristic as shown in Fig. 10(a). For the purpose of power-up self-characterization in order to optimize placement of circuit on a device-by-device bases, a parallel characterization method is needed. A. Detailed Circuit The FRD circuit shown in Fig. 4 is modified to the first-fail detection (FFD) circuit shown in Fig. 12(a). The EDC now provides a sticky status output which goes high the first time a timing error is detected and remains high until it is reset. Since the FFD measurement records the first failure point, the estimated delay of the CUT will be more conservative than that calculated from the failure rate measurements. If necessary, this can be compensated for by initially calibrating the first point of failure with the 25% failure point on a small number of CUTs, and then allowing for the appropriate margin in the FFD measurements. Multiple FFDs (16 in this case) are grouped together to form sectors, which are then arranged in an array as shown
C
(a)
(b) Fig. 10. (a) The scattered plot of half clock cycle (T/2) at 25 % failure rate with exponential best fit. (b) Histogram of residuals of the failure rate measurements around the exponential best fit.
,
Control, decoder
andi mux circuitry
Sectors containing arrays of 16 FFDs
94
Fig. 11. Delay map estimation for the CUT across the entire FPGA based on the half clock period (T/2) at 25 % failure rate.
LAIB X
Fig. 13. An array of sectors on the Cyclone-II EP2C35, each sector contains an array of 16 FFD test blocks.
Circtuit-under-Test (CUT) )&
(a) icy Table it
In
W EN
Addr
.... .......
TCG
~
in Fig. 12(b). The entire self-characterization sequence is controlled by the on-chip test control FSM. The status of all the FFDs from each sector at each test frequency forms a status word (16 bit) which is written to the on-chip block RAM. For the EP2C35 on the DE2 Board, the device is organized into 52 sectors, each containing 16 FFDs. The floorplan of the device is shown in Fig. 13. Part of the device is devoted to the implementation of the FSM and decoder circuits. Characterizing this part of the device is easily achieved by relocating the controlling circuits elsewhere on the device. With this parallel timing error detection, selfcharacterization is achieved in approximately 3 seconds, and the entire characterization data, i.e., maps of timing failure on the chip at different frequencies, are stored in 13kbit of block RAM. The upper bound of the test frequency range is adaptive for each sector, therefore test speed may be shorter depending on the actual device under test. For optimal results storage, the characterization data set is stored as a frequency index map covering every FFD locations. Each frequency index points to an output frequency from the TCG that caused the FFD at the particular location to give a failure status. The 13kbit of storage space requirement is based on a frequency index size of 16 bit for each FFD location with 52 sectors and 16 FFDs per sector. (i.e. 52 x 16 x 16bit = 13kbit). B. Results
Sectors containing mnutiple test blocks
(b) Fig. 12. (a) Modified first-fail detector (FFD) circuit; (b) Sector based parallel BIST system schematic.
Fig. 14 shows progressively how failure occurs as the test frequency is gradually increased. With the hierarchical organization of the FFDs into sectors, it is possible to detect worst case delay of fine-grain FFDs (involving 2-LUTs) as demonstrated in Fig. 14 (a)-(d). Alternatively, a more coarsegrain sector based characterization can be obtained (Fig. 14 (e)-(g)).
(e) 545.3 Mhz
(f) 550.0 Mhz
(g) 560.0 Mhz
Fig. 14. (a-g) Progressive failure maps of FFDs.
As can be seen in the failure maps, at 515MHz no timing failure is detected anywhere on the device. Beyond that frequency failure starts to occur from the right side of the device and gradually moves left. At 560MHz, all sectors on the device have errors. It can be seen clearly that the failure pattern follows an overall trend caused by the correlated variation but at the same time contains some randomness caused by the stochastic variations. VI. CONCLUSIONS AND FUTURE WORK This paper presented a technique to measure delay time of combinatorial circuits on FPGAs. The measurement includes circuit as well as path delays, and provide a delay time resolution that increases with frequency. This means that the method will track the advancement of technology as operating speed increases, so will the timing resolution. The method is demonstrated on the EP2C35 FPGA, which was successfully characterized. Although a Virtex-4 FPGA is used to provide the necessary clock generation in this demonstration, this is not a problem for more recent FPGAs such as the Stratix-II and Virtex-5. This work is the beginning of the interesting area of runtime self-characterization of FPGAs. As such, it has many limitations, and potential for future work. Firstly, it can be extended to measure register timing. Modern FPGAs have many embedded blocks such as multipliers and DSP blocks. We have not so far attempted to measure delays in these components. The combinatorial circuits tested in the experiment are very simple and the delay measurements are far from exhaustive. The method should be tested on much more complicated circuits and applied more extensively to the entire FPGA, even exhaustively. Notwithstanding, this work provides a very useful advancement towards the eventual goal of matching the need of a design to the individual device's timing characteristics so as to optimize the performance. ACKNOWLEDGEMENTS
The authors would like to acknowledge EPSRC (UK Engineering and Physical Sciences Research Council) for Platform
Grant EP/C549481/1, Terasic Technologies Inc. for providing software and hardware support on the USB BLASTER interface on the DE2 Cyclone-II board to enable high speed communication with a PC, Steve Brown from Altera Corp. for the donation of the DE2 board and Patrick Lysaght from Xilinx Inc. for the donation of the ML401 board. REFERENCES [1] P. Sedcole and P. Y. K. Cheung, "Within-die delay variability in 90nm FPGAs and beyond," Jun. 2006. [2] X.-Y. Li, F. Wang, T. La, and Z.-M. Ling, "FPGA as process monitor-an effective method to characterize poly gate CD variation and its impact on product performance and yield," IEEE Trans. Semiconduct. Manufact., vol. 17, no. 3, pp. 267-272, Aug. 2004. [3] P. R. Menon, W. Xu, and R. Tessier, "Design-specific path delay testing in lookup-table-based FPGAs," IEEE Transactions On Computer-Aided Design of Integrated Circuits And System, vol. 25, no. 5, pp. 867-877, May. 2006. [4] M. B. Tahoori and S. Mitra, "Application-dependent delay testing of FPGAs," IEEE Transactions On Computer-Aided Design of Integrated Circuits And System, vol. 26, no. 3, pp. 553-563, Mar. 2007. [5] C.-C. Wang, J.-J. Liou, Y.-L. Peng, C.-T. Huang, and C.-W. Wu, "A BIST scheme for FPGA interconnect delay faults," 2005. [6] P. Girard, 0. Heron, S. Pravossoudovitch, and M. Renovell, "High quality TPG for delay faults in look-up tables of FPGAs," IEEE International Workshop on Electronic Design, Test and Applications (DELTA), 2004.
[7] M. Abramovici and C. E. Stroud, "BIST-based delay-fault testing in FPGAs," Journal ofElectronic Testing: Theory and Applications, vol. 19,
no. 5, pp. 549-558, Oct. 2003. [8] I. Pomeranz and S. Reddy, "At-speed delay testing of synchronous sequential circuits," Design Automation Conference, 1992. Proceedings., 29th ACM/IEEE, pp. 177-181, Jun. 1992.
[9] R. Szplet, J. Kalisz, and R. Szymanowski, "Interpolating time counter with 100 ps resolution on a single FPGA device," Instrumentation and Measurement, IEEE Transactions, vol. 49, no. 4, pp. 879-883, Aug. 2000.
[10] A. Chan and G. Roberts, "A synthesizable, fast and high-resolution timing measurement device using a component-invariant vernier delay line," Test Conference, 2001. Proceedings. International, pp. 858-867, Nov. 2001.