Characterisation of FPGA Clock Variability Pete Sedcole, Justin S. Wong and Peter Y. K. Cheung Department of Electrical & Electronic Engineering, Imperial College London South Kensington campus, London SW7 2AZ, UK
Abstract
1. Introduction The fabrication of integrated circuits involves processes and materials that cannot be perfectly controlled. The result is manufacturing variations in devices, where performance and power consumption varies. This occurs both between dice and, more recently, between circuit elements within a single die. This variability is expected to increase as transistor sizes are scaled down [4]. Field-Programmable Gate Arrays (FPGAs), often on the cutting edge of technology scaling, are susceptible to process and material variations, possibly more than other highperformance integrated circuits. Unlike ASICs, the critical paths of the circuit the FPGA implements is not known until after fabrication, which results in particularly pessimistic circuit timing. Since variability cannot be eliminated by improving the fabrication process, new design techniques are required that are aware of variability, and able to compensate for its effects, such as proposed in [6]. It is therefore necessary to quantify the levels of performance variation exhibited. In our previous work, we reported on measurements of logic and routing variation in FPGAs using ring oscillators [5] and delay measurements using at-speed testing tech-
Delay (ps)
3700
As integrated circuits are scaled down it becomes difficult to maintain uniformity in process parameters across each individual die. The resulting performance variation requires new design strategies to avoid pessimistic overdesign. A quantified understanding of the contribution different circuit components make to performance variation is a necessary part of such strategies. This paper proposes a technique for quantifying variability in clock skew in FPGAs based on a novel differential delay measurement circuit. The technique is capable of isolating the effects on clock skew from different components in the clock network. Results from a 65nm FPGA show that clock skew variation is significant, being comparable in magnitude to signal path delay variation.
3600 3500 3400 0
0 50
20 100
40 150
Row
60
Column
Figure 1. Nominally identical signal paths exhibit different actual delays, as measured on a 65nm Virtex-5 FPGA.
niques [7]. An example of a set of delay measurements made on nominally identical paths using the latter technique is shown in Fig. 1 for a 65nm Virtex-5 XC5VLX50-1 FPGA. In this paper, we focus on determining how much of the variation in such measured delays is due to variability in the clock network and registers. It is important to quantify this, as techniques for compensating for variability will be different depending on where the variation occurs. The study of the effect of process variability on clock trees has been examined in ASIC devices by a number of researchers. These include work employing Monte Carlo simulations [3, 10] as well as approaches based on canonical or numerical analysis of the classical H-tree clock structure [1, 2]. To date, there is no published report on the measurement or analysis of clock tree variability in FPGAs. There are three main contributions presented in this paper: (a) we report, in Section 3, the design and implementation of a built-in self-test technique for measuring differential delay in FPGAs, which compensates for variation in the measurement circuit and is able to achieve a resolution in the order of picoseconds; (b) we show how differential measurements can be used to isolate the contributions to clock skew variability of different components of the clock network; (c) in Section 4, we present and analyse measurements made on a Virtex-5 device to illustrate the method.
Region buffer
Switch block
Clock region 1.9
Branch buffer
10
Delay (ns)
1.8
32
1.7 1.6
Horizontal spine
1.5 100 80
60 60
40 40
Central buffer
Branch
Slice Y
20 20
0
Slice X
Figure 3. The propagation delay through the clock network in a Virtex-5 XC5VLX50-1 device, as reported by the vendor timing tools.
Figure 2. The spine-and-branch clock tree structure in a Virtex-4/5 type of device. The device is divided into a number of clock regions.
2. FPGA Clock Trees The clock network in an integrated circuit is generally designed to manage the skew between any two points in the device. A design with zero nominal skew can be achieved by employing the well-known H-tree structure. An FPGA clock network must balance the minimal-skew requirement with sufficient flexibility to implement the clocking requirements of many different circuits. Inevitability, providing this flexibility reduces the symmetry in the clock distribution, which has implications for the sensitivity of the clock to variations. Clock networks in FPGAs generally come in two flavours. A spine-and-branch approach is typified by the Xilinx Virtex-4 and Virtex-5 devices, and is represented by the diagram in Fig. 2. The clock is distributed on a hierarchical network of linear spines, where each spine taps directly off the higher-level spine. In the Virtex-4 and -5 architectures, all clock regions are of equal size: larger devices have a higher number of separate clock regions. The Stratix-II and Stratix-III devices from Altera favour a structure that resembles the traditional H-tree design. Although not illustrated here, the structure is again hierarchical: the higher levels of the hierarchy use an H-tree network, which minimises delay differences. At the lower levels, the clock is distributed to rows of logic blocks along linear branches. With this structure the device is divided into
clock octants (or sixteen parts for the Stratix-III) regardless of the size of the device. Although the clock networks in Altera devices are more balanced than in those from Xilinx, FPGAs from both vendors exhibit definite differences in clock routing delay across the chip. Fig. 3 shows a plot of the clock distribution timings in a Virtex-5 XC5VLX50-1 device as reported by the vendor timing tools. It can be seen that the deterministic clock skew is of the order of 250ps. This would increase in larger devices. In all cases, the clock network comprises duplicate resources to enable multiple clocks to be distributed throughout the device. A Virtex-5 XC5VLX50 device, for example, has 32 central buffers each of which drive a separate vertical spine, and each region has 10 horizontal spine and branch lines [9]. Hierarchical levels are connected by some form of crossbar switch.
3. Measuring Variation in Clock Skew It is desirable to quantify the variability in clock skew. To avoid the cost and limitations imposed by external test equipment, a self-test measurement system can be built from the FPGA fabric resources. Built-in self-test (BIST) circuitry can be designed to accurately measure register-toregister delays, and by replicating the circuitry over the device it is possible to characterise variability in the measured delays. However, extracting the variability caused by the clock network and isolating this from the variation in the signal path logic is a non-trivial task. This section proposes an approach for isolating sources of stochastic variability in delay measurements. The technique calculates differences in the delay of signal path pairs, and is therefore robust to environmental factors such as changes in temperature and voltage. By making incremental changes in the test circuitry, and comparing the resulting measurements, it is possible to determine the contribution of the different parts of the FPGA has to overall register-to-
Launch stage
Launch circuit (u)
Capture and compare stage Shadow
Enable EN
Launch
Error
Path Under Test
φ
p1 c1
p2 clock source
v
(a) The basic circuit used for measuring the delay of a signal path.
Capture circuit (v1)
launch clock
Main
φ
u
common signal path
common clock path
c2
Capture circuit (v2 )
(b) The measurement system for detecting clock skew variation.
Figure 4. The circuits used in the delay and clock skew measurements. register delay variation. It is emphasised that the objective is to estimate stochastic delay variation; spatially correlated delay variation can be quantified using an array of ring-oscillators. We have previously reported on the BIST method and circuitry used to achieve accurate signal path delay measurements between registers [7]. The method is summarised below.
3.1. Measurement circuit Measuring a signal propagation delay between two registers is achieved using an at-speed test technique combined with a finely-adjustable clock source. The circuit, shown in Fig. 4(a), comprises a launch stage, the path under test (PUT), and a capture-and-compare stage. The launch stage, when enabled, sends a series of signal transitions along the PUT. The capture stage registers the signal half a cycle later (in the ‘main’ flip-flop) as well as a full cycle later (in the ‘shadow’ flip-flop). The two captured values are compared and used to generate an ‘error’ signal. This indicates when the signal propagation delay is longer than half a clock cycle: thus by changing the clock frequency the propagation delay can be estimated. The condition for no error to be indicated is given by thold < tCLK-Q + tPUT + tsetup
(2)
tCLK-Q + tPUT + tsetup
(3)
The propagation delay can be estimated by finding boundary condition between (1) and (2). There are two complicating factors which must be considered. The first is clock jitter, which causes T to change from cycle to cycle. This can be accounted for by measuring
the rate of reported error. For a symmetrically distributed clock jitter, the boundary between (1) and (2) will correspond to the case where the error rate reaches 50%. Secondly, the values tCLK-Q , tPUT and tsetup can differ depending on the polarity of the signal transition (i.e., if it is low-to-high or high-to-low). In most cases, the slowest propagation delay is the quantity of interest. Therefore the slower of the two transitions is identified and used in the above equations. The delay measurement method requires a clock source where the frequency can be finely adjusted. Modern FPGAs contain highly configurable digital and analogue clock synthesisers which are capable of generating a wide range of frequencies. Run-time dynamic reconfiguration of the Digital Clock Managers (DCM) in a Virtex-5 has been implemented to produce the clock source for the measurement circuit above. A single DCM is capable of generating 637 unique possible output frequencies for a given input frequency [9]. In order to increase the frequency resolution, we have cascaded two DCMs together; after taking into account the frequency limitations of the Virtex-5 [8], the cascaded pair are able to generate over 21000 unique possible output frequencies. The mean increment between adjacent frequencies is 0.01%.
3.2. Differential paths The technique described above enables precise measurement of register-to-register delays. An estimate of the clock skew from these measurements would be possible, in theory, by eliminating the signal path delay (tPUT ). Practically this is not feasible as the measurement would require an unrealistically high clock frequency: indeed, it is advantageous to increase the tPUT delay to minimise the noise in the measurement. However, with a large tPUT the variation in the signal path delay then obscures the subject of interest: variation in the clock network. In order to isolate the effect of the clock network, a method of delay differencing is proposed, as depicted in Fig. 4(b). The test signal generated by a launch circuit passes through a delay path and is then captured by two
separate capture circuits. The difference in the measured delays is then calculated. The divergent parts of the signal and clock paths, labelled p1 , p2 , c1 and c2 , we term the differential path. The fundamental idea of the proposed approach is that by repeating the measurements while manipulating the differential path, information can be extracted on the variation caused by the clock tree. As well as enabling a lower test clock frequency to be used, the proposed approach has the advantage of removing correlated sources of variation from the measurement. Provided that the capture circuits are not significantly separated, variations in delay in the signal and clock paths due to spatially correlated process variation will cancel out. Moreover, environmental effects such as temperature and voltage fluctuations can be assumed to affect both measured delays similarly, and therefore will also be eliminated. This is highly important when taking measurements of different circuits over a period of several minutes or longer, as changes in ambient temperature could otherwise swamp the effects under study. Mathematically, the difference measurements can be analysed as follows. The delay through any particular clock or signal path element is modelled as the nominal delay t0 summed with zero-mean random variables, as is the convention in the literature: tpath = t0 + X = t0 + Xs + Xr
(4)
The variables Xs and Xr represent spatially correlated and purely random variations respectively. With reference to Fig. 4(b), the calculated difference in the two measured delay values is: d = d0 + [X(p1 ) − X(c1 )] − [X(p2 ) − X(c2 )]
(5)
Note that the correlated variables can be ignored, as Xs (p1 ) ≈ Xs (p2 ) and Xs (c1 ) ≈ Xs (c2 ) if the paths are closely spaced. From here on, X refers to uncorrelated random variation only. The variance of the difference is: var(d) = 2 var [X(p)] + 2 var [X(c)]
(6)
Here, we assume that var [X(p1 )] ≈ var [X(p2 )] ≈ var [X(p)] and var [X(c1 )] ≈ var [X(c2 )] ≈ var [X(c)].
3.3. Comparative analysis Further information can be learnt if a change is made to the test circuit, and the differential measurements compared before and after the change. If the circuit modification results in a stochastic change in the delay, statistical techniques can be used to analyse the comparative measurement. Assume a measurement d is taken, and then a change is made to either the signal path or the clock path, and a new
differential measurement d′ taken. The covariance of the two differential measurements is: cov(d, d′ ) = cov [X(p1 ), X(p′1 )]+cov [X(c1 ), X(c′1 )] + cov [X(p2 ), X(p′2 )] + cov [X(c2 ), X(c′2 )]
(7)
Each of the covariance terms equates to the variance of the part of the differential clock path or signal path which is the same in both measurements. Therefore, to estimate the variance due to a particular differential path element (for either the signal or clock path) two designs are created, where the differential paths have only the path element of interest in common. The covariance of the differential measurements will equal the variance in delay of the element of interest.
4. Experiments In this section measurements made on a single Virtex-5 XC5VLX50-1 FPGA are presented and analysed. To verify that the results are statistically meaningful, the measurements should be repeated on a significant number of FPGAs. However, the focus here is on demonstrating a method for extracting clock skew variability from delay measurements, rather than the actual quantitative values.
4.1. Method As described above, the contribution of the different elements in a clock network to perceived delay variability can be isolated by computing the covariance of differential measurements made on different test designs. The implementation of incrementally different test designs was achieved partly through changes to implementation constraints (via the Xilinx UCF file). Such constraints are insufficient for fine control of clock and signal routing; for such changes, the implemented test designs were modified directly using simple automated scripts, either using the Xilinx text format (XDL file) or the Xilinx FPGA Editor tool. The purpose of the experiment is to quantify the variance of clock skew. A sufficiently large sample set is required to calculate the statistical variance. For this purpose, an array of 336 differential path measurement circuits were created in the FPGA under test, spanning 10 clock regions. The expected differential delay of all test path pairs were extracted from the vendor timing tools and compared, to ensure that the observed variance was not due to deterministic sources. The sources of clock variation are depicted as shown in Fig. 5, and modelled as the following lumped random variables: – Vertical spine, V-H switch and horizontal buffer V . – Regional clock wiring, switches and buffers, from the H-spine buffer to the CLB clock mux input R. – Clock mux Mc .
routing channel branch
central buffer
Lp V
Mp
Mc L c clock source
switch box
R V−H switch
Table 1. Summary of initial measurements. Quantity Ring Osc. Reg.-reg. Path length 5 LUTs 5 LUTs Mean path delay 3537ps 3559ps Mean stage delay 707ps 712ps Est. correlated var. (max) 6.82% 6.88% Residual var. in path (std. dev.) 0.70% 1.08% Residual var. per stage (std. dev.) 11.0ps -
50
H−spine H−branch switch
Figure 5. A model of the clock and signal routing architecture, illustrating the lumped components of variation.
Delay diff. (ps)
V−spine
0 −50 −100 150 60
100
– Local clock interconnect and register setup time Lc . The two main capture registers were located within the same logic block (CLB). Therefore the differential signal path only diverges within the CLB. Thus, p1 and p2 differ in the signal multiplexers and local interconnect only. These are modelled by variables Mp and Lp . Two options were explored for the clock signals for each of the two main capture registers. One case used a common clock signal, which diverged only within the CLB. In the second separated clock scenario, the clock signal was routed through two separate central clock buffers. In the later case c1 and c2 can differ substantially, corresponding to different central buffer, spine and branch routing in addition to differences within a CLB. To determine the relative effect of each of the possible sources of delay variability, the following items were explored, either separately or in combinations: – External routing of the common signal path. The routing of the signal path outside the CLB is changed, which affects the routing through the signal mux but not the signal routing local to the CLB. – Internal (local) routing of the signal path. Internally to a logic element the differential signal path can be routed through a LUT or directly to the input of the capture flip-flop. – The horizontal spine and branch routing used. The routing of a clock within each region is changed to a different horizontal spine, which also changes the branch routing and the use of the clock mux. – The placement of the central clock buffer for one of the two clocks. The central buffer used for one of the test clocks is changed, the clock routing for the horizontal spines and branches is kept constant.
40
50 Row
20 0
0
Column
Figure 6. The differential measured delay for one test.
4.2. Results To begin with, an array of ring oscillators were placed on the FPGA to find the base level of logic delay variability, as per the method described in [5]. A second measurement was performed using the register-to-register method of [7] (see Fig. 1). Correlated variation is modelled as a two-dimensional quadratic in both cases. A summary of the initial results are listed in Table 1. Note that although it appears that the measured mean delay per stage is very similar, this is coincidental as the signal routing and LUT programming was substantively different in the two cases. The initial measurements indicate that the stochastic variation exhibited by the register-to-register measurements is significantly higher than in the ring oscillator based test. However, variation which results from clock skew variability will depend on the differences in wires and buffering used to route the clock signal. In other words, the skew variation between registers is highly dependent on the relative placement of the registers. The noise in the measurement was estimated from consecutive measurements made on the same design. The repeatability of the measurements is high, the standard error being estimated as 1.07ps. An example differential measurement using common clock routing is plotted in Fig. 6. There is no significant spatially correlated variability in the differential measurement. The variance of the differential delay, calculated from the average of six such measurements, was 236.8ps2 . Following the procedure outlined above, the measure-
Test 1 2 3 4 5 6 7 8
Table 2. Comparison experiments. Description of changes Common variables Common clock signal None 2Mc , 2Lc , 2Mp , 2Lp External signal routing 2Mc , 2Lc , 2Lp Internal signal routing 2Mc , 2Lc Regional clock routing 2Lc , 2Mp , 2Lp Regional clock routing and ext. signal routing 2Lc , 2Lp Regional clock routing and int. signal routing 2Lc Separated clock signals None 2R, 2Mc , 2Lc , 2Mp , 2Lp Regional clock routing for one clock R, Mc , 2Lc , 2Mp , 2Lp
Cov. (ps2 ) 236.8 202.7 137.2 209.3 163.3 102.6 256.9 228.7
60
50
20
Delay diff. (ps)
Mean delay offset (ps)
40
0 −20 (4,5) (4,6) (4,7) (4,8) (4,9)
−40 −60 1
2
3
0 60 40 −50 120
100
20 80
60 Row
4
5 6 7 Region enumeration
8
9
10
Figure 7. The mean offset in measured differential delay for each clock region. The separate clock paths (c1 , c2 ) are routed on H-spine resources (m, n) in regions 3 to 10. ments were repeated after changes were made to the signal and clock paths. A summary of the experiments is given in Table 2. The original common-clock differential measurement is listed as test 1. For each common clock signal experiment test i, the covariance with the test 1 measurement is given, with a list of the variables that are common between test i and test 1. Note that these variable appear in both sides of the differential paths (i.e., in both p1 and p2 or c1 and c2 ): therefore the covariance value includes each of the variables twice. Next, experiments were run with the separated clock signals. In these tests, one of the two clock paths c1 used a buffer at location X0Y3 and was routed on regional clock resource number 4. In the first two of these experiments (tests 7 and 8) the second clock path c2 used a buffer at location X0Y4. The regional clock resource used for the second clock path was varied. As predicted by the model, the differential delays have regional bias due to the central clock resources V . The average regional bias in the differential measurements is plotted in Fig. 7. To estimate the variation due to R, the regional clock routing excluding the cen-
40
20
0
Column
Figure 8. Relative change in measured differential delay due to a moving the central clock buffer from location X0Y4 to X0Y11. tral clock resources, the regional bias was subtracted from the measured results for tests 7 and 8 before calculating the variance (for test 7) or covariance (between tests 7 and 8). The values of the variance due to each of the sources can be solved by solving a regression equation formed from Table 2: the standard deviations of (R, Mc , Lc , Mp , Lp ) are thus found to be (2.8, 4.1, 7.2, 4.4, 5.6) picoseconds respectively. Note that the stochastic skew between registers within the same clock region will have variance var(R) + 2 var(Mc )+2 var(Lc ), which corresponds to a standard deviation of 12.0ps. This is similar to the stochastic variation in delay of a single LUT stage. Test 8 and Fig. 7 provide some insight into the variation due to the central clock resources. Further insight is gained in the last set of experiments, which investigated the effect of changing the central clock buffer used by clock path c2 . This also changes the vertical spine and V-H switch used in c2 , while all other variables are kept constant. An example of the relative change in measured differential delay caused by changing the clock buffer for c2 is shown in Fig. 8. Here, the regional offset caused by the different delays in the vertical spines and V-H switches is clearly apparent. The mean offset in differential delay for various buffer placements is plotted in Fig. 9.
Acknowledgements
60
This work has been supported by the EPSRC through Platform Grant EP/C549481/1.
Mean delay offset (ps)
40 20
References
0 −20
X0Y4 X0Y9 X0Y10 X0Y11 X0Y12 X0Y13
−40 −60 1
2
3
4
5 6 7 Region enumeration
8
9
10
Figure 9. The mean offset in measured differential delay for each clock region for different choices of buffer location. The clock routing resources used within each region is kept constant. While there are too few data points in these last tests to calculate meaningful statistics, it can be observed that the central clock spine, V-H switch and H-spine buffer are significant contributors to clock skew variability between regions. From Fig. 7, it can be seen that the relative skew between regions 6 and 9 could change by up to 78ps in this case depending on the resource used to route the clock in each region. To put this in perspective, the stochastic delay variation measured for one LUT stage of logic has a standard deviation σ of 11.0ps. Therefore, a 78ps change in skew is equivalent to the ±3.5σ range in delay for one LUT. In Fig. 9, the use of a different central buffer choice can result in a change in relative skew between regions 1 and 2 of up to 67ps, equivalent to the ±3.0σ range of delay for one LUT.
5. Conclusion Process variation increasingly causes performance variation in integrated circuits, including FPGAs. It is important to quantify the relative contributions of different circuit elements to performance variation in order to create design strategies that compensate for variability. This paper proposed a technique of differential path delay measurement, which is robust to the effects of variation in the measurement circuitry. Moreover, it was shown how an array of differential delay measurements can be combined with incremental changes to the test circuitry and simple covariance calculations to isolate the effects of circuit components to measured delay variations. The proposed technique and analysis has been used to quantify the delay variation caused by different components of the clock network in a 65nm Virtex-5 FPGA. The level of clock delay variation was shown to be significant compared with the variation exhibited by the logic fabric.
[1] A. Agarwal, V. Zolotov, and D. T. Blaauw. Statistical clock skew analysis considering intradie-process variations. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(8):1231–1242, Aug 2004. [2] M. Hashimoto, T. Yamamoto, and H. Onodera. Analysis of clock skew variation in H-tree structure. In Proc. IEEE International Symposium on Quality Electronic Design, 2005. [3] V. Mehrotra and D. Boning. Technology scaling impact of variation on clock skew and interconnect delay. In International Interconnect Technology Conference, 2001. [4] S. R. Nassif. Design for variability in DSM technologies. In Proc. IEEE International Symposium on Quality Electronic Design, 2000. [5] P. Sedcole and P. Y. K. Cheung. Within-die delay variability in 90nm FPGAs and beyond. In Proc. IEEE International Conference on Field Programmable Technology, 2006. [6] P. Sedcole and P. Y. K. Cheung. Parametric yield in FPGAs due to within-die delay variations: A quantitative analysis. In Proc. ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2007. [7] J. S. Wong, P. Sedcole, and P. Y. K. Cheung. Selfcharacterization of combinatorial circuit delays in FPGAs. In Proc. IEEE International Conference on Field Programmable Technology, Dec. 2007. [8] Xilinx Inc. Virtex-5 Data Sheet: DC and Switching Characteristic, July 2007. [9] Xilinx Inc. Virtex-5 User Guide v3.0, February 2007. [10] S. Zanella, A. Nardi, A. Neviani, M. Quarantelli, S. Saxena, and C. Guardiani. Analysis of the impact of process variations on clock skew. IEEE Transactions on Semiconductor Manufacturing, 13(4):401–407, Nov 2000.