SRAM Cell Optimization for Low AVT Transistors Lawrence T. Clark SuVolta Inc. 130-D Knowles Dr. Los Gatos, CA USA (408)429-6070
[email protected] Samuel Leshner SuVolta Inc. 130-D Knowles Dr. Los Gatos, CA USA (408)429-6043
[email protected] ABSTRACT In this paper, we describe a six-transistor static random access memory (SRAM) cell optimization methodology for transistors with significantly improved matching, while maintaining compatibility with the baseline design. We briefly describe the reduced AVT transistors and show that they allow substantially improved minimum SRAM operating voltage (Vmin) and improved array leakage. Using an efficient design of experiments (DOE) factorial as a pseudo-Monte Carlo generator, points on the tail of the distribution are directly simulated. The highly efficient method is shown to allow optimization and 'what if' scenario investigations. Simulation and silicon results on a 65-nm process as well as simulation results on a 28-nm process are shown. Categories and Subject Descriptors B.7.1 [VLSI]: Advanced technologies – memory technologies, Static memory (SRAM), performance analysis and design aids, simulation. General Terms Algorithms, Design, Verification. Keywords Low power SRAM, mismatch, statistical design. 1. INTRODUCTION Transistor mismatch, predominantly caused by random dopant fluctuation (RDF) where the number of dopants affects the transistor threshold voltage (VT) local mismatch, has led to diverging minimum logic and SRAM operating voltages. Local matching is AVT [11], the area normalized VT, i.e., the variance of the VT difference measured between nearby, identical transistors where √
(1)
and |
| .
(2)
Since the number of dopant atoms is proportional to the transistor area (W×L), small devices, e.g., those used in SRAM memories, are the most affected [4]. 6-T SRAM write-ability and read stability are ensured by the cell transistor drive ratios. Consequently, SRAM yield falls off rapidly as VDD is lowered towards Vmin, the minimum yielding SRAM VDD. This is due to increasing variability in the drive ratios as the transistor gate overdrive VGS – VT (where VGS = VDD) diminishes, due to wide VT variations.
George Tien SuVolta Inc. 130-D Knowles Dr. Los Gatos, CA USA (408)429-6084
[email protected] (a)
(b)
Figure 1. DDC transistor structure (a) and 65-nm SRAM baseline and DDC transistor VT for access, pull-down and pull-up devices (b).
In this paper we discuss a bulk CMOS deeply depleted channel (DDC) transistor architecture that provides significantly improved mismatch, i.e., lower AVT, resulting in greatly improved SRAM low voltage characteristics. Since improving the transistor AVT allows different SRAM optimization points, an efficient way to explore the design/yield space and target the SRAM transistor sizes and threshold voltages is needed, without resorting to fabrication and measurement. To this end, we present a stratified sampling DOE factorial based pseudo-Monte Carlo method for simulation of SRAM yield. Fabricated 65-nm SRAM Vmin measurements confirm the simulation predictions. We also demonstrate its use to optimize SRAM transistor VT targets and dimensions to provide the best V min, subject to read current, array leakage and retention voltage (Vretain), through circuit simulation. Section 1 describes the DDC transistor structure and fabrication. Section 2 presents the device impact on SRAM and silicon results at 65-nm. Section 3 describes the simulation based optimization methodology, while results on 65 and 28-nm are presented in Section 4. Section 5 concludes. 1.1 Transistor Structure and Fabrication The DDC technology used is built in planar bulk CMOS. Layout rules for the baseline and DDC version of the process are identical. The device architecture used is shown in Fig. 1(a). An un-doped region forms the depleted MOS channel that dramatically reduces RDF. The DDC devices for this study are fabricated using a shallow trench isolation (STI) last process. This allows the formation of the channel regions
Figure 2. 6-T SRAM cell.
Figure 3. 65-nm 9.4M-bit non-redundant SRAM DDC Vmin vs. temperature.
by using a blanket un-doped epitaxial layer deposited across the entire wafer after a series of well implants. Implant species choices and processing temperatures combine to produce a very abrupt drop in doping between the screen and VT setting layers to the channel. In addition to improved AVT, the essentially un-doped channel also provides higher mobility and thus higher performance, as well as reduced vertical electric fields that improve reliability [7]. The fabricated devices use no halo implants, which can also adversely affect matching [6]. Furthermore, an advantage of the DDC scheme is that well proximity effects [6] are mitigated, since the wells are implanted before formation of the un-doped channels—scattered dopants are thus isolated below the channel region. The device architecture directly supports the multiple threshold voltages that are essential to low power circuit design by controlling the VT with the VT setting offset region doping. The highly doped screening region terminates the gate and drain electric fields for short channel effect (SCE) control and contributes to improved AVT by reducing depletion region depth variation. Additionally, the screen layer provides significantly greater body coupling. Transistor gate formation, implantation of source drain regions, and back-end flow are conventional. The mask cost adder for DDC over the conventional process can be minimized to be as low as the added initial pre-well alignment mask, which is not required in the conventional process. 1.2 DDC SRAM Transistor Electrical Characteristics and Impact on SRAM The DDC body effect increase over a conventional halo transistor ranges from 2-3× greater depending on technology node and device dimensions. The gate delay impact of higher
body effect is cancelled out by improved effective drive current (IEFF) at low drain voltages and thus does not reduce performance [3]. The strong body effect allows effective process corner pull-in, using body biasing. On the 65-nm process the DDC threshold voltages are chosen so that reverse body bias (RBB) can be used to retarget silicon at process corners back towards the typical device response. To accomplish this, a large (e.g., 0.6 V) RBB is applied to fast die, moderate (0.3 V) RBB is used on typical corner die, and no RBB is applied to slow die. The actual bias is a function of the measured die average device response. Since PMOS and NMOS response as-fabricated are independent, different body biases may be applied to PMOS and NMOS. Results on 65-nm DDC designs using RBB for systematic corner pull-in demonstrate nearly 50% reduced leakage (at the fast-fast corner) and active power at the same (slow process corner) performance over the baseline, after process corner pull-in [3]. With DDC transistors, the SRAM PMOS VT is reduced to 20 mV from 37 mV in the baseline process and the SRAM NMOS VT is reduced to 14 mV from 35 mV—an improvement of 40% and 60% for PMOS and NMOS respectively (Fig. 1(b)). In addition to the large AVT improvement, the DDC transistor architecture reduces drain induced barrier lowering (DIBL) by as much as 40%, resulting in less loss of IEFF as VDD is reduced. The IEFF gain ranges from approximately 10% at full VDD, to over 50% at VDD = 0.6 V. The nominal operating VDD is 0.9 V, as opposed to 1.2 V for the baseline process, providing nearly 50% active power savings. The DDC AVT improvement directly improves matching, significantly impacting SRAM Vmin. However, improvement in Vmin is also a result of the improved systematic variation, IEFF, and greater body coefficient. Referring to Fig. 2, assume that that node CN stores a logic 0 and C stores a logic 1. Read SNM is a function of the ratio of MPD1 and MPG1, where the former must be stronger to maintain node CN low. During read, WL = VDD, as CN rises RBB is applied to the access transistor MPG1 reducing the MPG1 drive strength, particularly as it rises high enough where the cell may reach metastability. The reduced DIBL, producing greater I EFF, most significant at low VDS, increases the on-state holding current at low VDD. Quantitative results comprise Section 3.5.1. 2. MEASURED SRAM RESULTS ON 65-nm Measurements of early silicon demonstrated a usable static noise margin (SNM) over a significantly wider V DD range, and the mean SNM is over 5 from VDD = 0.4 V to VDD = 1 V using the same SRAM circuits [7]. The measured 9.4 M-bit SRAM array Vmin with no redundancy is shown in Fig. 3. The (95% array yield) Vmin for -40ºC, 25ºC, and 125ºC is 500 mV, 450 mV, and 600 mV, respectively. These values are each approximately 200 mV lower than those measured on the baseline (non-DDC) process with identical circuits. Since the leakiest transistors contribute to the array leakage nonlinearly, random variability manifests as a higher overall (IC or block level) leakage than the average device level VT. Lower VT transistors have exponentially greater leakage [12]. As a result, reducing AVT results in improved
Figure 4. Basic rare events simulation setup. The simulator acts as a black box, transforming input points xi (left) into the margin or pass/fail space (right) via observed values gi generated by the simulator.
SRAM array level leakage. Reducing this variation component can significantly impact the overall IC leakage, which tends to be dominated by SRAM due to their large area component despite use of higher VT’s. It also reduces the performance variation, although long logic paths benefit from averaging across multiple stages as well as using larger transistors. Since read bit line delay is significantly greater than one logic stage, delay variation impact on SRAM is commensurately higher. Reduced VDD for state retentive SRAM low power modes has been reported, with and without application of body biasing [15]. One drawback of reduced DIBL is that the V T increase as SRAM supply voltage is reduced is also diminished. Thus, some of the power benefit from improved Vretain is lost. Junction leakage is mitigated by reduced Vretain, however this is generally limiting on slow corner die only and does not affect the device leakage (standby power) specification. 3. METHODOLOGY FOR SIMULATION- BASED OPTIMIZATION 3.1 Prior Work on Rare Events Simulation Simulation of yield far onto the tail of the device response distribution, i.e., the determination of the likelihood of rare events, is required to analyze the yield of SRAMs [2]. The range of where failures occur is typically greater than 6 . However, direct application of Monte Carlo (MC) simulation methods is futile, since the number of points required to achieve a reasonable estimate increases as the square of the resolution. Consequently, there has been substantial effort to design efficient algorithms to more efficiently perform simulations to determine SRAM yield. Methods include response surface methodologies (RSM) [9], importance sampling (IS) [5][8], statistical blockade (SB) [14], and stratified sampling [16]. All of these methods rely on the simulator as a black box that, provided inputs, returns a circuit behavior or similar pass/fail results based on the varied inputs. X is a random variable with probability distribution function f(x) and g(X) is a function of X. Thus, a MC estimate of the distribution of g(X) is obtained by generating xi points approximating f, while computing gi = g(xi). The observed valuesgi are used as an estimate of the distribution of g(X), with equal weights (Fig. 4). For instance, for SRAM analysis, the inputs xi are the variations in the individual devices (basically their ki). The result gi is the read noise margin
Figure 5. Stratified sampling. The inputs are chosen to match the required overall circuit response sigma in bins (strata).
(RNM), read or retain mode SNM, write margin (WM) or the pass/fail based on these. If X is discrete the expected value is (3) and the Monte Carlo estimator of g(X) is
(4) From that data, the yield of an array is straightforward to calculate based on simple yield models. Monte Carlo estimation error decreases with the square root of the number of samples. Ideally the points chosen have good uniformity (low discrepancy) across the hypercube being investigated. IS uses a modified sampling distribution emphasizing the region of interest on the probability tail in place of the ―true‖ distribution to improve the efficiency of the simulation testing. IS can allow many orders of magnitude gains in efficiency over traditional MC. The IS samples are drawn from a ―sampling distribution‖, Xi from a density h(x) that has values emphasizing the region of interest, e.g., the high sigma tail. Then the MC estimator is
(5) where h(Xi) is the weighted density that provides the bulk of the data points in the region of interest. A good sampling function is one that is close to proportional to g(x). In contrast, SB builds a classifier that allows points outside the region of interest to be thrown out before simulating (which is the slowest step) to determine whether they are pass or fail points. While building the classifier model can be fast, a guard band is required that varies with the smoothness of the response. 3.2 Stratified Sampling Methodology While methods such as mixture importance sampling and statistical blockade shift the bulk of the randomly generated points to be in the range of CELL that spans from the cutoff from full yield to zero yield, stratified sampling [16], generates samples at known strata in the distribution (see Fig. 5). Since we have significant engineering understanding of the problem at hand, this is very beneficial in reducing simulation overhead. Assuming that the variation is dominated by RDF, the device responses are independent, allowing determination that the overall circuit CELL is
(a)
(b)
Figure 6. Factorials for 2 inputs (a) and 6 inputs (b), basically all combinations of -1, 0, 1 for each parameter ai. Figure 8. Worst-case 6 sigma 65-nm SRAM read noise margin and write margin vs. VDD as determined by the proposed SS-FPMC approach for baseline and DDC.
Figure 7. 65-nm SRAM yield vs. VDD as determined by the proposed SS-FPMC approach at 25ºC. High accuracy to the measured data (from Fig. 3) is achieved.
obtained as the sum of squares of the DEVICE of the six constituent devices as √∑
(6)
Consequently, rather than generating random points with a shifted density or screening them, points at the desired CELL can be generated directly by appropriately setting the various DEVICE in (6). The array yield Y is simply calculated by the Poisson distributed values ,
(7)
where A is the defective cell density; in this case the fraction of cells that fail, i.e., that do not meet the required margins. In any sampling experiment it is important that the input points xi chosen adequately cover the circuit response, formally referred to as the granularity required [10][11]. For the problem at hand, a sequential strategy can be adopted when using stratified sampling, since the response of the devices is monotonic. As larger is explored, failures at low are retained and new failures are added. Simply stated, the CELL that provides the required yield, e.g., 95%, can be solved by inverting (7) to determine Y for the array size being investigated. Using this cumulative fail rate, tests at decreasing VDD can determine the Vmin or Vretain. Note that this step of rapidly zeroing in on the bounds of the steep part of the yield curve (see Fig. 3) is analogous to building a classifier in the SB approach.
Figure 9. Worst case 9 65-nm conventional halo and DDC SRAM RNM and WM vs. VDD as determined by the SSFPMC analysis. The worst-case RNM for the conventional transistor SRAM at 9 sigma is 0 for all VDD.
3.3 DOE Factorial Based Pseudo-MC A key issue with MC methods, besides the large number of points that must be investigated, is that random sampling of the region may not uncover all the important features of the circuit response, resulting in an under or overestimate of the yield and commensurate mis-estimate of the error. It is necessary that the input points adequately cover the circuit response, formally referred to as the space filling and projective properties of the inputs. Essentially, the first criterion requires that all the salient properties of the response are explored, and the latter that every point is unique. While choosing the points randomly, a sufficiently good algorithm provides uniformity. Design of experiments (DOE) approaches provide appropriate space filling and projective properties with relatively few points [10]. Full factorials (see Fig. 6) in the case here require positive and negative points ai, (e.g., a transistor VT may increase or decrease with variation) which, including cases where some devices exhibit no variation (or negligible variation compared to the others), requires 3K - 1 points for K devices. Note that the zero variation point is independent of the target CELL. In
PMOS |VT| (mV) Figure 10. DDC improvement in RNM for 64 M-bit array 95% yield vs VDD assuming both devices retain the conventional transistor AVT.
experimental design, full factorials rapidly become untenable for large problems, but are quite efficient for simulation, especially in the SRAM design space, with only six input variables. The resulting factorial requires 36 - 1 = 728 points. To obtain each transistor’s required variation for a simulation trial, each ai is multiplied by the required DEVICE for that row, obtained by √∑
(8)
where CELL is determined by the strata. Thus, the same CELL is obtained regardless of the number of transistors that experience variation in that trial. 3.4 Simulation and Application to SRAM While it is straightforward to simulate read and hold mode SNM [13] we use read noise margin (RNM) [1] as the read stability criterion. Write margin (WM) is determined by the bit line voltage at which the cell state flips, when lowered from VDD with WL = VDD. Cells that are not write stable (they will not hold the written data after WL is de-asserted) in a given direction are by definition not read stable in the same configuration. Conversely, a cell that is write stable may not have positive RNM. Thus, by this choice of metrics, the response is naturally dominated by write failures, i.e., their number always exceeds the read failures. Using the stratified sampling factorial pseudo-MC (hereafter SS-FPMC) method, a single CELL, VDD pair (all 728 trials) can be run in one simulator run, which is parallelizable. Consequently the method is fast enough to allow not just Vmin determination, but exploration of the design space and circuit optimization. 3.5 65-nm SRAM Validation and Analysis A validation of the method compares the experimental results of Fig. 3 with the model output at room temperature (Fig. 7). The agreement with the measured data is very good, particularly at the 95% yield point, and using the SS-FPMC method, the DDC Vmin is estimated to be 450 mV (at 95% array yield). Using the same analysis, the baseline halo SRAM Vmin is estimated to be 675 mV. The method does not comprehend non-parametric limited yield and thus predicts 100% above Vmin, while the actual fabricated 9.4 Mb array yield is slightly less at this point in development, as shown.
NMOS |VT| (mV) Figure 11. DDC iso-Vmin vs. VT PMOS (vertical axis) and NMOS (horizontal axis) from the baseline values.
Fig. 8 shows the 6 worst-case read RNM and WM vs. VDD on the baseline and DDC processes. Fig. 9 shows the same results at 9 , where the conventional devices have vanishing SNM at all voltages. The impact on all responses is essentially a downward shift as increases, as expected. 3.5.1 Impact of DDC without AVT improvement As mentioned in Section 1.2, the IEFF and body coefficient improvements also directly impact the SRAM stability, independent of the large AVT reduction. Fig. 10 plots the worst case 6 SNM for the DDC and baseline (conventional halo) device 65-nm SRAM cell, assuming the same (baseline) AVT. The SNM is improved by 32% at the baseline nominal VDD of 1.2 V. 4. DESIGN OPTIMIZATION The efficiency of the SS-FPMC approach allows the SRAM array parameters to be optimized. We optimize by gradient descent. To aid in engineering understanding, the circuit response can be plotted vs. device parameters. Plotting the response topology also allows determination of the variation of the response due to global, e.g., wafer to wafer and die to die variability in addition to the random variability. Moreover, the factorial directly provides the specific device variability combinations that are limiting near the Vmin point. This allows engineering analysis and points to useful optimization directions for Vmin improvement and the effects on other parameters such as array leakage and Iread. 4.1 65-nm SRAM Analysis and Optimization Fig. 11 shows the iso-Vmin vs. NMOS and PMOS VT with the baseline target VT’s as the origins. The graph shows directly how much the VT’s can be modified, subject to meeting the other requirements (not shown) to obtain the lowest Vmin. The dashed oval represents within wafer variability, i.e., the expected cross-wafer spread showing worst-case average Vmin of 0.47 V. The line superimposed is the iso-leakage power line of 100pW/cell @ VDD = 0.9 V
PMOS |VT| (mV)
B A
Figure 12. NMOS only DDC improvement in Vmin for 50% DDC AVT. improvements over the baseline NMOS FET, before and after SRAM cell device retargeting.
(with RBB of -0.3 V) at 25ºC. Further VT reduction will violate the leakage specification. The SRAM stability enhancements provided by the DDC transistor also allow a more lithographically friendly cell. The baseline cell pull down (MPD1 in Fig. 2) is wider than the access device (PMG1 in Fig. 2) as is standard. Narrowing MPD1 to the width of MPG1 and reducing the L of the PMOS pull up transistor (MPU1 in Fig. 2) removes the diffusion notch, while retaining the same cell footprint. The DDC characteristics, including better matching for the smaller PMOS area, allow a 5% V min improvement, with a more ―litho friendly‖ topology—no diffusion notches. 4.2 28-nm SRAM Analysis and Optimization On a 28-nm high-k metal gate (HKMG) process, the SSFPMC methodology was shown to match well with the foundry analysis and silicon, using 64 M-bit arrays, again assuming 95%, non-redundant array yield. The analysis here assumes that the array leakage target remains constant, which with lower AVT does allow a lower target VT when using the DDC devices. The SS-FPMC methodology allows investigation of whatif scenarios as illustrated here. If the process is not limited by RDF or if RDF is a fractional component of one transistor type, e.g., PMOS, then the DDC AVT improvement may be very asymmetrical. For example, PMOS transistors may have significant mismatch contribution from the embedded SiGe (e-SiGe) source/drains, limiting the matching improvement that can be obtained (again, at that point in the development). Using DDC only for NMOS, having 50% AVT improvement with no other changes, e.g., in transistor dimensions or V T targets, on the 28-nm SRAM improves the Vmin to 650 mV from 760 mV (Fig. 12). While the read and write failure rates of this baseline design at Vmin are balanced, after NMOS only AVT improvement, the write failures predominate (low PMOS VT being the root cause). At this point, write margin can be improved by weakening the PMOS devices and strengthening the NMOS devices, i.e., raising the PMOS and lowering the NMOS VT target by 20 mV and 40 mV, respectively (see Fig. 13). The latter is limited by the target SRAM array leakage. This reduces the write failure rate, while increasing the RNM
NMOS |VT| (mV) Figure 13. Iso-Vmin plot for 28-nm NMOS only DDC before (A, Vmin = 650 mV) and after cell device retargeting (B, Vmin = 540 mV). Ovals are the range of Vmin that will be produced due to within wafer variation.
failures and reducing the Vmin to 540 mV, with a worst-case within wafer value just above VDD = 580 mV. Further analysis shows that improving NMOS matching is more valuable than improving PMOS matching. To reiterate, device matching improvements require cell VT target or size ratio re-optimization to obtain the best Vmin with the improved devices. 5. CONCLUSIONS Improving SRAM Vmin requires improved AVT. For the examples presented here, where DDC SRAM devices exhibit improved systematic variability, AVT, and IEFF, considerable latitude exists to optimize the circuits to obtain better SRAM array leakage, Vmin, Vretain, and IREAD simultaneously. Improving the device leads to a measurable difference in circuit parameters, and with changes to the circuit design, optimal circuit level results can be obtained. The SS-FPMC predictions of array V min match measured results. This approach to rare events simulation is efficient enough to allow design exploration, engineering analysis, and optimization of SRAM design points, and their relations to array yield. The high efficiency allows the mapping of the response vs. parameters such as constituent transistor target VT, instead of providing point results, providing useful insights into the root causes of the response. The method, which is confirmed by silicon results, also allows separation of the device characteristics and determination of their relative impacts. We believe it will also prove useful for statistical analysis of other circuits, eg., sense amplifiers and analog circuits, as well. 6. REFERENCES [1] Agarwal, K., Nassif, S., Statistical Analysis of SRAM Cell Stability, Proc. Design Automation Conference (July 2006) 57-62. [2] Aitken, R., Idgunji, S., Worst-Case Design and Margin for Embedded SRAM, Proc. DATE, (Sept. 2007) 12891294.
[3] Clark, L.T., et al., A Highly Integrated 65-nm SoC
[4]
[5]
[6] [7]
[8]
[9] [10] [11]
[12]
[13]
[14]
[15]
[16]
Process with Enhanced Power/Performance of Digital and Analog Circuits, Proc. IEDM (Dec. 2012) 14.3.114.4.4. De, V.,Tang, X., Meindl, J., Random MOSFET parameter fluctuation limits to gigascale integration (GSI), VLSI Tech. Symp. Dig., (June 1996) 198 - 199. Doorn, T.S., et al., Importance Sampling Monte Carlo simulations for accurate estimation of SRAM yield, Proc. ESSCIRC (Sept. 2008) 230-233. Faricelli, J., Layout-dependent Proximity Effects in deep Nanoscale CMOS, Proc. CICC (Sept. 2010) pp. 1 – 8. Fujita, K., et al., Advanced Channel Engineering Achieving Aggressive Reduction of VT Variation for Ultra-low-power Applications, Proc. IEDM (Dec. 2011). 32.3.1 - 32.3.4. Kanj, R., Joshi, R. Nassif, S., Mixture Importance Sampling and Its Application to Analysis of SRAM Designs in the Presence of Rare Failure Events. Proc. DATE (July 2006) 69-72. Myers, R.H., Montgomery, D.C., Response Surface Methodology, Wiley, 2002. Montgomery, D.C., 2001. Design and Analysis of Experiments. Pelgrom, M., Duinmaijer, A., Welbers, A., Matching properties of MOS transistors,‖ IEEE J. Solid-state Circuits, 24, 5, (Oct. 1989) 1433 - 1439. Saxena, S., et al., Variation in Transistor Performance and Leakagein Nanometer-Scale Technologies. IEEE Trans. Elec. Dev., 55, 1 (Jan. 2008) 131-144. Seevinck, E., List, F., Lohstroh, J., Static Noise Margin Analysis of MOS SRAM Cells, IEEE J. Solid-state Circuits, SC-22, 5, (Oct. 1978) 748-754. Singhee, A., Rutenbar, R., Statistical Blockade: Very Fast Statistical Simulation and Modeling of Rare Circuit Events and Its Application to Memory Design, IEEE Trans. CAD ICs and Sys., 28, 8 (Aug. 2008) 1176-1189. Wang, J., Singhee, A. Rutenbar, R., Calhoun, B., Statistical Modeling for the Minimum Standby Supply Voltage of a Full SRAM Array, Proc. ESSCIRC (Sept. 2007) 400-403. Brandimarte, P., Numerical Methods in Finance and Economics, Wiley, 2006.