Design of Resonant Global Clock Distributions Steven C. Chan, Kenneth L. Shepard, and Phillip J. Restley Department of Electrical Engineering, Columbia University, New York, NY yIBM T.J Watson Research Center, Yorktown Heights, NY fschan,
[email protected],
[email protected] Abstract This paper presents a new approach to global clock distribution in which traditional tree-driven grids are augmented with onchip inductors to resonate the clock capacitance at the fundamental frequency of the clock node. Rather than being dissipated as heat, the energy of the fundamental resonates between electric and magnetic forms. The clock drivers must only provide the energy necessary to overcomelosses. As a result, power reduction of over 80% is possible depending on the Q of the resonant system. Clock latency is also improved because the effective capacitance of the grid is lower, and fewer buffer stages are necessary to drive the grid. Skew and jitter reductions come about because of this reduced buffer latency.
1 Introduction Clocking large digital chips with a single high-frequency global clock is becoming an increasingly difficult task. There are two main issues involved: skew/jitter and power. Skew/jitter. Relative to the clock cycle time, clock latency is increasing, becoming larger in many cases than the cycle time. This increasing latency is due to two main factors. First, interconnect bandwidth is not scaling as fast as gate delays, resulting in longer wire delays as a fraction of cycle time for chips of equivalent size. Second, the capacitive loading of the clock distribution is generally increasing, which requires more levels of buffering to build up adequate gain to achieve slew time targets of generally no more than 10-20% of system cycle time. Clock designers have become increasingly skilled in matching nominal silicon and wire delays to achieve very-low-skew balanced distributions[1]. Nevertheless, increasing clock latency in distributing the global clock from a single synchronizing source leads directly to problems with uncompensated skew and jitter, primarily due to intrachip variability and power-supply noise coupling through buffers. Power. While jitter and skew have traditionally been the dominant concerns in clock design, particularly in the highestperformance microprocessors, power may soon gain primacy. The 72-W 600-MHz Alpha processor[2] dissipates more than half of its power in the clock distribution. Clock capacitance and frequency are increasing with each technology generation, offset only slightly by decreasing power supply voltages, resulting in exploding dynamic power dissipation. Every cycle the entire clock capacitance is charged to the supply voltage and this charge is subsequently dumped to ground, with all the stored energy lost as heat. The work at Columbia University was supported by an IBM PhD Fellowship, the National Science Foundation under grant CCR-00-86007, by the MARCO C2S2 Center, by the SRC, and by gifts from IBM and Intel.
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE
In Section 2, we consider recent approaches that have been considered to address these issues in globally synchronousdesign. Section 3 present our novel approach to global clock distribution based on a global resonant grid. Time-domain and frequencydomain simulation results are presented in Section 4. Section 5 concludes and offers direction for future work.
2 Approaches for global clock distribution 2.1 Grid and tree clock networks Most of the work in clock distribution has been focused on addressing the issues of skew and jitter. There are two general approaches to clock wiring, trees and grids. Tunable trees consume less wiring and, therefore, represent less capacitance, lower wiring track usage, lower power, and lower latency. Trees must, however, be carefully tuned and this tuning is a very strong function of load. Grids, in contrast, can represent large capacitances and significant use of wiring resources, but provide relative load independenceby connected nearby points directly. This latter property has proven irresistible and most recent global clock distributions in high-end microprocessors utilize some sort of global clock grid. Early grid distributions were driven by a single effective global clock driver positioned at the center of the chip[3].1 Most modern distributions use a balanced H-tree to build up and distribute the gain required to drive the grid. The grid drive points are distributed across the entire chip, rather than being concentrated at a single point; this means that the grid can be less dense than a grid that is driven in a less distributed fashion, resulting in less capacitance and less consumption of wiring resources. The shunting properties of the grid then help to cancel out skew and jitter from imperfections in the tree distribution, as well as balance out uneven clock loads. The actual clock load (all the latches in the design) is usually far too large to be directly driven from the global grid. Additional levels of buffering from the clock grid provide the local clocks to the latches. The partitioning of the gain stages between global and local seems to differ between different microprocessor implementations. The Alpha[2] seems to build up more gain in driving a larger global clock distribution, while in Power4[1], more of the gain is relegated to the local distribution.
2.2 Resonant clocks The virtues of LC -type oscillators for achieving lower-power
and better phase stability than oscillators based on delay elements have been long recognized. The adiabatic logic community has already considered the importance of resonant clock generation 1 In the 200 MHz Alpha, the single clock driver is actually implemented as 145 inverters shorted together at the driver point. A five-deep binary fanning tree is used to build up the gain to drive the final clock driver.
since the clocks are used to power the circuits and such resonance is fundamental to the energy recovery[4, 5, 6].2 To combine the clock generation and distribution, distributed LC oscillators in the form of transmission line systems have been considered.3 In salphasic clock distribution[8], a sinusoidal standing wave is established in an unterminated transmission line. As a result, each receiver along the line receives a sine wave of identical phase, but different amplitude with periodic nulls appearing throughout the distribution. Coupled standing-wave oscillators of this type are used in [9] to distribute a high-frequency clock across a chip. Onchip transmission lines are very lossy. As a result, active negative resistance elements must be distributed throughout the grid. Another similar approach uses travelling waves in a set of coupled transmission line rings[10], driven by periodically distributed crosscoupled inverters. The propagation time around the rings determines the oscillation frequency and different points around the ring have different phases. Travelling and standing wave clock distributions require very regular clock network structures and must contend with either varying phase or varying amplitude. Neither approach has, as yet, shown a power advantage for clock distribution.
3 Resonant clock grids
c
Buffer chain
Clock grid c
that operates more like a “conventional” clock distribution, providing uniform phase and amplitude across the entire chip. Our resonant network augments the traditional tree-driven grid with a set of on-chip spiral inductors. The large clock capacitance then resonates with this inductance. This approach promises to significantly reduce the power necessary to drive the grid, since the energy of the fundamental resonates back and forth between electric and magnetic forms rather than being dissipated as heat. Consequently, the clock drivers must only supply the energy needed to overcome losses at the fundamental. Furthermore, because the effective capacitance of the clock network is dramatically reduced, the number of gain stages and the associated latency required to drive the clock is reduced as well, resulting in considerable improvement in skew and jitter. We designed a resonant clock network in a TSMC 0:18m, six-level-metal, mixed-signal CMOS technology. The design is presently being fabricated; simulation results are presented in Section 4. Our clock network, shown in Figure 1, might represent one sector in a much larger microprocessor clock distribution containing several dozen such sectors. There is a single clock driver, which consist of a chain of inverters, at the root of an H-tree. The four spiral inductors have one end attached to the clock tree and the other end attached to a large decoupling capacitance which establishes a mid-rail dc voltage around which the grid oscillates. These thin-oxide capacitors are positioned adjacent to the spiral inductors, as shown in Fig. 1. The total load of the clock tree and mesh, which occupy an area approximately 2,500 m x 2,500 m, is approximately 9.5 pF. The clock tree and grid occupy the top two metal layers, M6 and M5. In order to minimize resistance and inductance in the clock lines, each wire in our distribution is shielded, and each line is split into multiple fingers. Figure 2 shows the shielding and fingering of the clock tree wires on M6 and M5. We use 16 m wide segments, spaced 4 m apart. For the M6 and M5 clock grid wires, we use a single 16 m wide segment surrounded by two 8 m ground shields spaced 4 m apart.
c
Clock tree c
GND
GND CLK
GND CLK
c
Figure 2: Fingering and shielding of clock tree wires on M6/M5. Inductor Decap
A Figure 1: Resonant clock network consists of a buffer chain driving a clock tree and grid. Spiral inductors are attached at four points in the tree. Decoupling capacitors are attached to the other end points of the spiral inductors. Our approach is to instead engineer a resonant clock network 2 These generators generally produce sinusoidal or near sinusoidal clock waveforms. 3 These also bear resemblance to distributed oscillators[7].
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE
A simple lumped circuit representation4 of the system is shown in Figure 3, where Cclock is the capacitance of the clock, Cdecap is the clock decoupling capacitance, Rind and Rcap are the resistances associated with losses in the clock network, and Rdriver is the effective resistance of the driver. Cdecap p must be chosen large enough to ensure that fdecap = 1=2 LCdecap is much less than the desired clock resonance frequency (fclock ). For f fdecap (and Cdecap Cclock ), the driving point admittance of the clock network is given by:
Ydriver = j! Cclock L!1 2
(1)
p At fclock = 1=2 LCclock , the capacitive reactance clock load
is cancelled by the inductive reactance. The circuit topology of
4 A distributed model, such as that used in Section 4, is necessary for detailed understanding of the system.
mass of the grid itself). The grid then oscillates (up and down) driven by the clock drivers. The springs, however, do the bulk of the work in driving the grid, storing energy when going down and delivering it to the load when coming up.
4 Simulation results 1 vdriver
Rind 2
Rcap
Cdecap
Cclock
Figure 3: Simple lumped circuit model of the resonant clock distribution. Figure 3 is similar to that of a buck converter[11], a resonant circuit commonly used for dc-dc power conversion. In a buck converter, the duty cycle of the clock node (node 1 in Figure 3, shunted by the snub capacitor Cclock ) would be varied and a different filtered average voltage would appear across Cdecap (node 2 in Figure 3). The amount of ripple depends on the ratio fclock =fdecap, which we find in practice must be at least three to realize the benefits of resonant clocking. This corresponds to decoupling capacitance that is approximately ten times greater than the amount of clock capacitance; such a quantity of decoupling capacitance is no more than what is typically required on the power-ground network in the case of non-resonant clocking to prevent more than a 10% collapse of the supply rails during clock switching. In this application, the spiral inductors exist in an environment quite different from those that exist in typical RF applications. Specifically, the inductors are embedded in the metal-rich environment of a digital integrated circuit. Careful attention must be paid to limit eddy current losses due to neighboring wires; this is important both to prevent Q degradation and to prevent inductive noise in the power-ground distribution or in neighboring signal lines. Because the spiral inductors are much larger than the power grid, most of the potential deleterious coupling will be to the underlying power grid. To impede eddy current formation in the underlying grid, the vias in the grid are dropped and small cuts are made in the wires, analogous to the ground plane laminations used for spiral inductors in RF circuits[12]. By virtue of the spiral inductors and assuming the grid itself is designed to have low inductance, we have engineered an eigenmode of the grid in which it rigidly oscillates as a contiguous unit with a fclock resonance; this is in contrast to the standing or travelling wave patterns that characterize the clock distributions described in Section 2. By making the grid low inductance, we deliberately push other resonances associated with the distribution to high frequencies so that they do not interfere with the dominant engineered resonance. Further intuition can be derived with a mechanical analogy, in which the inductors (of inductance L) correspond to springs (with spring constant k) and capacitance (C ) corresponds to to mass (m). The smaller the inductance, the larger the spring constant of the associated spring (L 1=k). Because the clock grid itself has very low inductance, the grid is rigid, but is “suspended” on a set of springs corresponding to the explicit inductance built into the clock grid. The grid amplitude corresponds to the voltage on the grid and is centered at VDD =2, the voltage of the clock supply decoupling capacitors. Additional masses attached into the grid correspond to grid loading (in addition to the
!
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE
We have performed a RLCK, partial-element equivalent circuit (PEEC) model extraction of the clock distribution network described in Section 3 using Cadence’s Assura RCX-PL extractor and simulated the extracted netlists in Cadence’s Spectre simulator. PEEC models are tantamount to a quasi-static solution of Maxwell’s equations for the network. Each of the spiral inductors as extracted has an approximate inductance of 9nH . This results in an fclock value of approximately 1:1GHz . Each of the four decoupling capacitances are sized to be 60pF , consuming about 150m 150m of chip area.
4.1 Driving point admittance 10 Magnitude of driving point admittance (S)
L
Rdriver
0
Resonant
10
10
−1
−2
Nonresonant
10
A
−3
10
8
9
10 Frequency (Hz)
10
10
Figure 4: The driving point admittance of the resonant and nonresonant clock distribution networks. The point A corresponds to the fundamental resonance defined by the spiral inductors and clock capacitance. In Figure 4, we first compare, in the frequency domain, the driving point admittance of the clock distribution (from the point at which the central clock driver is attached to the H-tree) for resonant and non-resonant networks. In the non-resonant case, we have open-circuited the inductor in the distribution of Figure 1, but have otherwise kept the tree and grid the same. It is important to note that any real application of resonant clocking will have to support returning to the non-resonant case to allow for low frequency test. The resonance at A corresponds to the desired fclock clock frequency. We estimate the quality factor of this resonance (Q = fclock =BW ) to be about 8. Other resonances in the system, probably corresponding to standing wave patterns, are evident at frequencies beyond 10GHz . These admittance curves demonstrate that for a given driver strength, it is far easier (lower admittance) to drive the resonant clock network at fclock = 1:1GHz than it is to drive the nonresonant network. This will translate to the power, jitter, and skew advantages observed in the time domain.
Mag(Transfer Function)
(a) Strong driver
−1
10
Nonresonant
Weak driver
−2
10
0
Mag(Transfer Function)
10
(b) Strong driver
Resonant
B −1
10
Weak driver
A
−2
10
8
10
9
10 Frequency (Hz)
10
10
Figure 5: Magnitude of the transfer function from the driver to point A in the clock distribution. Before considering skew, jitter, and power issues, it is insightful to consider (in both the time and frequency domain) the response of both a resonant and non-resonant clock network to different driver sizes. We consider two different driver widths, 700 m (strong driver) and 87.5 m (weak driver). In the frequencydomain analysis, we linearize these to driver resistances of 3.5 and 28 , respectively. 5 We consider the clock network response at point A in Figure 1 and in Figure 5, we compare the magnitude of the transfer function from the driver to this point on the clock grid for both the resonant and non-resonant cases. The corresponding transient responses are shown in Figure 6 for squarewave driver-input waveforms with a slew time of 100psec. In the non-resonant case with the weaker driver, there is inadequate bandwidth to distribute the clock (Figure 5(a)); in the time-domain, the clock is unable to actually swing full rail (Figure 6(a)). To support slews of 200 psec for a 1.1 GHz clock requires a bandwidth of at least 0:5=tr = 2:5GHz . At the larger driver, the -3-dB bandwidth improves to 3 GHz and the time-domain response shows a more “square-wave” behavior. With the larger driver, more energy is also transferred to the higher-order resonances in the distribution beyond 10GHz . This is probably manifest in the time-domain behavior with the slight overshoot and undershoot in the response. In the resonant case, point A in Figure 5(b) corresponds with the fdecap resonance, which in this case, is set to a factor of five lower than the fclock resonance at B . As in the non-resonant case, the stronger driver also excites more high-frequency components, including the higher frequency resonances beyond 10GHz . The time-domain behavior shown in Figure 6(b) shows the sinusoidal “natural” response of the resonant network pumped by the action of the driver. The waveform begins to come down as energy is transfered from the clock capacitance to the spiral inductors (with charge transferred from the clock capacitor to the decoupling capacitor) until the point at which the clock driver completes the transition by actively driving the clock node down. Similarly, on the rising edge, the inductor begins charging the clock capacitance 5 Our technology has an nFET f of approximately 50 GHz with a fanout-of-four T delay of approximately 68 psec.
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE
4.3 Skew and jitter reduction 2 Clock waveform (V)
0
10
(by transferring charge from the decoupling capacitance) until the point at which the clock driver actively pulls the clock node up. The Q of the resonance has an important impact on the quality of the results. When the Q is higher, the drivers can be made weaker since there is less loss that must be overcome at the fundamental. This results in a more sinusoidal distribution, closer to the natural oscillations of the resonant system, and more power savings. When the Q is poor, the drivers must be larger to overcome these losses. More power is burned in the distribution not only because more energy must be provided at the fundamental to overcome losses, but because lossy higher frequency components will also be driven into the clock network by the drivers. From this discussion, it is clear that resonant clock distributions favor the distribution of more sinusoidal clocks. Clock frequencies are increasing faster than device speeds are increasing (as measured by fT ) and wire bandwidths are scaling (clock periods are becoming a decreasing number of FO4 delays in a given technology). The result is that it is increasingly difficult to get fast slew rates and the clock waveforms are becoming increasingly sinusoidal, even in non-resonant distributions. A resonant scheme provides the basis for a “naturally” sinusoidal clock distribution.
(a)
Strong driver 1
Nonresonant Weak driver
0
−1 2 (b) Clock waveform (V)
4.2 Response of clock network
1 Weak driver
Resonant
0 Strong driver −1 8.0e−09
8.5e−09
9.0e−09 Time
9.5e−09
1.0e−08
Figure 6: Transient response of the clock network at point A. The square-wave driver input waveform has a slew of 100psec. Skew and jitter in real clock distribution networks come about becauseof spatial and temporal variation, respectively, in the clock latency. A significant component to skew and jitter is variation in the latency of the buffering stages needed to drive the large capacitive load of the clock network. Intrachip variability is also a significant source of skew, while power-supply noise, coupled through the buffers, is also a significant source of jitter. Resonant clock distributions provide for the potential to significantly reduce this component of clock latency because of the reduction in the number of gain stages required to drive the load. This should bring reductions in skew and jitter to the clock network. Furthermore, the reduced size of the final driver means that it is delivering less energy to the clock distribution, reducing the effect of power-supply noise on this stage as well. To explore this last effect, we examined the cycle-to-cycle jitter induced in the clock network by the existence of power-supply noise acting on
the clock driver. In Figure 7, we show the results for power-supply noise having a 180-mV, 100-MHz square wave characteristic. The peak-to-peak jitter is reduced from 9psec for the non-resonant, strong-driver case to 2psec for the resonant, weak-driver case.
0.3 Nonresonant − Strong driver 0.25
10.0
Power (W)
0.2
Cycle−to−cycle jitter (psec)
6.0
0.15
0.1
2.0
Resonant − Weak driver
0.05 −2.0
0 0.7
−6.0 Resonant − Weak driver Non−resonant − Strong driver
−10.0
0
10 Cycle number
20
Figure 7: Cycle-to-cycle jitter in the presence of power supply noise in the form of a 180-mV 100-MHz square wave
4.4 Power versus frequency One of the most significant benefits of resonant clocking is the potential power savings. In Figure 8, we compare the power dissipated in driving the clock for the strong-driver, non-resonant case and the weak-driver, resonant case. The non-resonant power scales linearly with frequency. The resonant power is fairly constant, with better-than-80% power savings at the desired resonance frequency of 1:1GH z . To minimize energy dissipation at the fundamental, there might be some need to tune the grid resonance to the clock frequency with MOS capacitors that can be switched onto the clock load. We emphasize that it is not important that the resonance of the grid exactly match the fundamental frequency of the clock, since the grid resonance is not determining the clock frequency. Furthermore, because the Q is only 8 the resonance is not particularly sharp and the benefits of the resonant clocking can be achieved over a fairly large frequency band around 1:1GH z without tuning. Local buffering would, of course, not be resonant and would dissipate the same amount of power as a non-resonant distribution. Hence, with resonant clocking there would be a desire to shift more of the clock load to the resonant grid. This would carry both the benefit of reduced power as well as reduced skew and jitter.
4.5 Scaling This approach can easily scale to higher clock frequencies for a given clock load by the addition of more inductors to the network, reducing the effective L in Figure 3. Adding more spirals to the grid is preferable to reducing the inductance of each spiral because the addition of more “attach” point helps to suppress standing waves in the grid and preserve uniform phase and amplitude across the distribution.
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE
0.9
1.1 Frequency (GHz)
1.3
1.5
Figure 8: The resonant grid shows 80 % power savings over the non-resonant grid at 1.1 GHz.
5 Conclusions and future work In this paper, we have presented a new approach to global clock distribution in which traditional tree-driven grids are augmented with on-chip inductors to “resonate” the clock capacitance at the fundamental frequency of the clock node. The energy of the fundamental resonates back and forth between electric and magnetic form rather than being dissipated as heat. We have shown the significant power, skew, and jitter savings possible with this approach. Such a resonant clock distribution benefits most from a sinusoidal clock, produced by a “minimally” sized clock driver which works primarily to add the energy lost at the fundamental. We would like to investigate the possibility of turning the resonant grid into a true sinusoidal tuned oscillator with distributed active negative resistance elements (e. g., a single-ended Colpitts topology). Such a resonant grid could act as a voltage-controlled oscillator (VCO), phase-locked to an external reference.
References [1] P. J. Restle and et al. A clock distribution network for microprocessors. IEEE Journal Solid-State Circuits, 36:792 – 799, May 2001. [2] D. W. Bailey and B. J. Benschneider. Clocking design and analysis for a 600-MHz Alpha microprocessor. IEEE Journal Solid-State Circuits, pages 1627 – 1633, November 1998. [3] D. Dobberpuhl et al. A 200 MHz 64b Dual-Issue CMOS Microprocessor. IEEE Journal of Solid-State Circuits, 27(11):1555 – 1567, 1992. [4] W. C. Athas, L. J. Svensson, J. G. Koller, N. Tzartzanis, and Y. Chou. Low-power digital systems based on adiabatic-switching principles. IEEE Trasactions on VLSI, pages 398 – 406, December 1994. [5] W. Athas, N. Tzartzanis, L. J. Svensson, L. Peterson, H. Li, X. Jiang, P. Wang, and W.-C. Liu. AC-1: A clock-power microprocessor. In Proc. Int. Symp. Low-Power Electronics and Design, August 1997. [6] Suh Kim and Marios C. Papaefthymiou. True single-phase adiabatic circuitry. IEEE Trasactions on VLSI, pages 52 – 63, February 2001. [7] B. Kleveland and et al. Monolithic CMOS distributed amplifier and oscillator. Digest Technical Papers, International Solid-State Circuits Conference, 36:70 – 71, February 1999.
[8] V. L. Chi. Salphasic distribution of clock signals for synchronous systems. IEEE Transactions on Computers, 43:597 – 602, May 1994. [9] F. O’Mahony, C.P. Yue, M. Horowitz, and S.S. Wong. 10 GHz clock distribution using coupled standing-waveoscillators. In Digest Technical Papers, International Solid-State Circuits Conference, 2003. [10] John Wood, Terence C. Edwards, and Steve Lipa. Rotary travelingwave oscillator arrays: a new clock technology. IEEE Journal SolidState Circuits, 36:1654 – 1665, November 2001. [11] A. J. Stratakos, S. R. Sanders, and R. W. Broderson. A low-voltage CMOS dc-dc converter for a portable batter-operated system. In Power Electronics Specialists Conference, 1994. [12] C. Patrick Yue and S. Simon Wong. On-chip spiral inductors with patterned ground shields for Si-based RF IC’s. IEEE Journal SolidState Circuits, 33:743 – 752, 1998.
Proceedings of the 21st International Conference on Computer Design (ICCD’03) 1063-6404/03 $ 17.00 © 2003 IEEE