Power benefit study for ultra-high density transistor-level ... - GTCAD

Report 4 Downloads 48 Views
Power Benefit Study for Ultra-High Density Transistor-Level Monolithic 3D ICs Young-Joon Lee, Daniel Limbrick, and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, GA [email protected], [email protected], [email protected] ABSTRACT The nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high density device integration at the individual transistor-level. In this paper we demonstrate the power benefits of transistor-level monolithic 3D designs. We first build a cell library that consists of 3D gates and model their timing/power characteristics. Next, we build timing-closed, full-chip GDSII layouts and perform sign-off iso-performance power comparisons with 2D IC designs. We also study the characteristics of benchmark circuits that maximize the power benefits in monolithic 3D designs. Lastly, our study is extended to predict the power benefits of monolithic 3D designs built with future devices.

Categories and Subject Descriptors B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids

General Terms Design

Keywords 3D IC, monolithic 3D, transistor-level, power analysis

1. INTRODUCTION To better exploit the benefits from 3D die stacking, monolithic 3D technology is currently being investigated as a next generation technology. In a monolithic 3D IC, the device layers are fabricated sequentially. When the top layer is attached to the bottom layer, the top layer is a blank silicon. Alignment precision is determined by lithography stepper accuracy, which is around 10nm today. Also, the top layer can be made very thin, around 30nm [1]. Thus, monolithic inter-tier vias (MIVs) for vertical connections are very small—about two orders of magnitude smaller than throughsilicon-via (TSV)—with almost negligible parasitic RC. With these small MIVs, designers can truly exploit the benefit of vertical dimension. The early works for monolithic 3D ICs were technology-driven [6, 4, 9]. Recently, logic design methodologies for monolithic 3D ICs were demonstrated [2, 8, 7]. In these works, the authors presented various comparisons among monolithic 3D ICs and TSVPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC’13, May 29 - June 07 2013, Austin, TX, USA. Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00.

based 3D ICs and conventional 2D ICs in terms of footprint, timing, and power. However, timing was not closed in these works, which make the studies not practical. In addition, all these works assume that the timing and power characteristics of 3D monolithic gates are the same as 2D gates and did not demonstrate why that is a reasonable assumption. The authors also did not provide indepth analyses and discussions on why monolithic 3D technology reduces power consumption and what factors affect the power reduction margin. This knowledge is crucial to maximize the benefit and justify on-going and future research on fabrication and design technologies for monolithic 3D ICs. As discussed in [2, 8], monolithic 3D technology enables a very fine-grained 3D circuit partitioning. We can divide standard cells into PMOS and NMOS parts, place them in different layers, and connect them using MIVs, which we call transistor-level monolithic 3D integration (T-MI) in this paper. Or, as in TSV-based 3D ICs, we may place planar cells in different layers and connect them using MIVs, which is named gate-level monolithic 3D integration (G-MI). In this paper we focus on transistor-level integration that allows the highest integration density possible. The T-MI designs are different from G-MI: (1) Most of the 3D interconnects are embedded in the cells. (2) PMOS and NMOS transistors are on different layers, thus manufacturing processes can be optimized separately. (3) Physical layout (placement, routing, optimization, etc.) can be performed using existing 2D electronic design automation (EDA) tools, with modifications. In this paper, we study the power benefit of T-MI based on timingclosed, detailed routing completed GDSII-level layouts and signoff analysis on timing and power. Our comprehensive work encompasses device and interconnect-level study, gate-level modeling and optimization, and full-chip layout constructions, optimization, and timing/power analysis for the current and future technology nodes. With our layout-based simulations and in-depth analyses, we demonstrate how to maximize the power benefit of T-MI technology. For fair comparisons between 3D and 2D designs, timing is closed on all designs (iso-performance), and power consumption is compared. We also investigate the circuit characteristics that affect the power benefit of monolithic 3D ICs. Our major contributions are as follows: (1) To the best of our knowledge, this is the first work to characterize the timing and power of the individual transistor-level monolithic 3D cells. We extract the internal parasitic RC of our T-MI cells and characterize their timing and power. We then compare T-MI cells with 2D counterparts. (2) We study the design aspects that significantly affect the power benefit of monolithic 3D ICs. We discuss what kind of logic circuits are suitable for power reduction in monolithic 3D ICs. In addition, we demonstrate that the power reduction rate also depends on the target clock period. (3) We build the libraries and full-chip layouts for monolithic 3D ICs implemented using 7nm devices. The goal is to predict the future trend of power saving with

Extend layer definitions & metal layers Design T-MI cells Create physical cell library & interconnect RC library 2D timing & power library Benchmark circuit RTL WLM Synthesis

VDD fold

Z M1(130)

Placement

A

Z

A

Pre-route optimization Routing

CTB VSS

Timing/power analysis

monolithic 3D technology and study how the smaller dimensions and varying parasitic RC affect the power benefit.

2. DESIGN AND ANALYSIS FLOW One of the major benefits of T-MI is that existing 2D EDA tools can be used, with simple modifications if needed. We extensively use commercial EDA tools in this study. Our design and analysis flow, summarized in Fig. 1, consists of four parts: (1) library preparations, (2) synthesis, (3) layout, and (4) analysis. In the library preparation part, we prepare T-MI-specific library files. We synthesize the RTL codes of benchmark circuits using Synopsys Design Compiler.1 In the layout part, we perform placement, routing, and optimizations using Cadence Encounter (v10.12). Finally, we perform static timing analysis and statistical power analysis. Our major efforts for T-MI design flow are spent on T-MI cell library construction and characterization, T-MI interconnect structure modeling, and T-MI wire load modeling. We modify the technology files and design rules to account for additional layers on the bottom tier as well as additional metal layers on the top tier (see Section 3.3). Using Cadence Virtuoso, we create our T-MI cells by modifying existing 2D cells. The cells are then abstracted to create the T-MI physical cell library. We also build interconnect RC libraries using Cadence capTable generator and QRC Techgen. For synthesis, we create the T-MI wire load models (see Section 3.4) that guide synthesis optimizations. During layout construction, we first run Encounter placer. The tool recognizes T-MI cells as the cells with pins on multiple layers. For routing, we set up Encounter to utilize the additional metal layers on bottom and top tiers. Since our T-MI cells contain routing blockages on the MIV layer, the router avoids 3D routing through the top tier part of the cells using MIVs. Using our T-MI interconnect library that reflects the T-MI metal layer structures and materials, we perform RC extraction on all the nets in the layout. Our fullchip timing/power optimizations and analyses for T-MI and 2D are the same, because the entire T-MI design (top/bottom tiers) is captured in a single Encounter session. We perform statistical power analysis with the switching activity of the primary inputs and sequential cell outputs at 0.2 and 0.1, respectively.2

(a) 2D cell

3.1 Monolithic 3D Cell Design Our benchmark circuits and the synthesis results are shown in Section S4. 2 The impact of switching activity is shown in Section S10.

(b) our T-MI cell

Figure 2: The layout of an inverter from (a) Nangate 45nm library, and (b) our T-MI library. P, M, and CT represent poly, metal, and contact. The suffix ’B’ means the bottom tier. Top/bottom tier silicon substrate and p/nwells are not shown for simplicity. Numbers in parentheses mean thickness in nm. We design our T-MI 3D cells using the (2D) standard cells in Nangate 45nm library [10] as our baseline. As shown in Fig. 2, we fold the 2D standard cells into 3D and create T-MI 3D cells. The thicknesses of top/bottom tier silicon substrates and inter-layer dielectric (ILD) are 30nm and 110nm, respectively. The diameter of MIV is 70nm. Note that by folding, each input/output pin is on both tiers. We prefer to place the PMOS transistors on the bottom tier and the NMOS on the top tier. In Nangate 45nm library, P/NMOS transistors show hole/electron mobility skew. To compensate the difference, in Nangate 45nm library, a PMOS is larger than the corresponding NMOS. Since extra silicon space on the top tier is required for MIVs (not on the bottom tier – see Fig. 2(b)), placing PMOS transistors on the bottom tier balances top/bottom silicon area usage. However, we should also consider manufacturing aspects in deciding the P/NMOS layer assignment.3 After folding the cell, VDD and VSS strips are overlapping, as shown in Fig. 2. The power to VDD on the bottom tier can be delivered down through arrays of MIVs, placed apart from the VSS strip. We may need extra space for these VDD MIVs. Yet, power delivery network design and IR-drop analysis are outside our scope. Also, since VDD and VSS strips are overlapping, it may act as a small decoupling capacitor. However, in the extracted cell internal RC data for our inverter cell, the coupling capacitance (or cap ) between VDD and VSS strips is around 0.01f F , which is small compared with other cell internal parasitic capacitances. The transistor model in Nangate 45nm library is ASU PTM 45nm with bulk silicon technology. In monolithic 3D technology, because of the structure, top tier transistors are similar to silicon-oninsulator (SOI) devices [1]. However, in this study we assume the same transistor model for T-MI and 2D cells, because (1) the original Nangate 45nm library is based on bulk silicon technology, and (2) if we assume both devices and interconnect structures in T-MI are different from 2D, it becomes harder to understand which factor contributes to power reduction, by how much.

3.2 3. 45NM TECHNOLOGY SETUP

CT P(85) MIV(140) Z MB1(130) PB(85)

VDD

Post-route optimization

Figure 1: Overall design and analysis flow. Shaded boxes highlight differences in T-MI. The WLM means wire load model.

1

VSS

Comparison with 2D Cells

Our T-MI cells preserve the same transistor sizes as in the original 2D cells.4 The T-MI cell height is 0.84µm, which is 40% smaller than the original 2D cell height (1.4µm). Thus, cell foot3 In sub-32nm nodes, thanks to advanced channel engineering techniques, the hole/electron mobility is about the same. 4 Our T-MI cell layouts are presented in Fig. 5 in the supplement.

Table 1: Cell internal parasitic RC values. The 3D-c means 3D with top tier silicon modeled as a conductor. cell INV NAND2 MUX2 DFF

2D 0.186 0.372 1.133 2.876

R (kΩ) 3D 0.107 0.237 0.975 3.045

3D-c 0.107 0.237 0.975 3.045

2D 0.363 0.561 1.823 4.108

C (f F ) 3D 0.368 0.586 1.938 5.101

3D-c 0.349 0.547 1.796 4.740

print reduces by 40%. The reasons why it is not 50% are (1) P/NMOS size mismatch incurs extra space on NMOS side, and (2) MIVs require extra space on the top tier. When designing T-MI cells, care should be taken to reduce cell internal parasitic RC. As shown in Fig. 2(b), the connection from the PMOS on the bottom tier to the NMOS on the top tier needs to go through CTB, MB1, MIV, CT, M1, then CT to diffusion. This 3D path may become larger than the original 2D path and may increase cell internal parasitic RC. Similarly, the path from the PB on the bottom tier to the P on the top tier goes through multiple layers. To reduce cell internal parasitic RC, it is important to minimize the lengths of 3D paths. To achieve shorter 3D paths, we should place MIVs close to the connecting transistors. We also need to utilize direct source/drain (S/D) contacts (see Fig. 5(c) in the supplement). The direct S/D contacts reduce the detour in the 3D paths and unnecessary parasitic RC. We examine the cell internal parasitic RC of 3D and 2D cells and the impact on timing/power. In previous works [2, 8, 7], the authors assumed that the delay and power of 3D cells are the same as 2D cells and used 2D timing/power library. In [1], the authors fabricated a transistor-level monolithic 3D IC and measured the top/bottom transistor performances. They reported that the differences between 3D transistors and baseline 2D transistors were negligible. Yet, the delay and power of cells are also affected by cell internal parasitic RC. From Fig. 2(b), we can conjecture that there are coupling capacitances among PB, CTB, MB1, MIV, CT, and M1. Using Mentor Graphics Calibre XRC with EM-simulationbased extraction rules, we extract these capacitance values as well as resistances and transistors from our T-MI cell layout. Then, we generate a SPICE netlist of the cell that consists of transistors and parasitic RC components. Since Calibre XRC is designed for 2D ICs, it can only model one diffusion layer. Due to this tool limitation, top tier diffusion layer can be modeled as either dielectric or conductor. Even though the top tier silicon is doped (low resistivity) and the bodies of top tier trasistors are tied to the ground, we expect that some amount of electric field may penetrate the top tier silicon and coupling among top and bottom tier objects (M1, MB1, P, PB, etc.) may exist. When we assume that the top tier silicon is dielectric, the coupling between top and bottom tier objects would be overestimated; when it is conductor, the coupling would be underestimated. The real case would be between these two extreme cases. The total cell internal RC values, extracted from the original 2D cells and our 3D (T-MI) cells, are shown in Table 1. For 3D case, the results with top tier silicon as both dielectric (3D) and conductor (3D-c) are shown. From the results, we observe the followings: (1) For INV, NAND2, and MUX2, the R values of 3D are noticeably smaller than 2D counterparts, because we reduce the length of poly and metal lines inside the cells, using 3D interconnects. (2) The C values of 3D are comparable with those of 2D – the 2D value is between 3D and 3D-c. (3) For DFF, both R and C of 3D are larger than 2D counterparts. Due to the complex internal connections, we could not create a 3D cell layout that match parasitic RC of 2D. In summary, depending on the cell layout complexity, the internal RC

Table 2: Delay and internal power consumption of cells with various input slew and load capacitance conditions. The library uses different input slew settings for DFF. The values in the parentheses mean the percentage ratio of 3D to 2D. delay (ps) power (f J) cell 2D 3D 2D 3D fast case: input slew=7.5ps (5ps for DFF), load cap.=0.8f F INV 17.2 16.9 (98.3%) 0.383 0.351 (91.6%) NAND2 21.2 20.9 (98.6%) 0.616 0.583 (94.6%) MUX2 59.8 58.2 (97.3%) 2.113 2.060 (97.5%) DFF 108.8 113.4 (104.2%) 6.341 6.735 (106.2%) medium case: input slew=37.5ps (28.1ps for DFF), load cap.=3.2f F INV 51.1 50.8 (99.4%) 0.362 0.343 (94.8%) NAND2 56.2 55.9 (99.5%) 0.604 0.581 (96.2%) MUX2 97.0 95.3 (98.2%) 2.239 2.168 (96.8%) DFF 142.6 147.0 (103.1%) 6.358 6.756 (106.3%) slow case: input slew=150ps (112.5ps for DFF), load cap.=12.8f F INV 188.3 188.0 (99.8%) 0.449 0.431 (96.0%) NAND2 195.9 195.5 (99.8%) 0.698 0.675 (96.7%) MUX2 215.1 212.5 (98.8%) 2.555 2.487 (97.3%) DFF 237.4 243.3 (102.5%) 7.303 7.659 (104.9%)

Table 3: Summary of metal layers. Unit is nm. level metal layers width spacing thickness global 2D:M7-8, 3D:M10-11 400 400 800 intermediate 2D:M4-6, 3D:M7-9 140 140 280 local 2D:M2-3, 3D:M2-6 70 70 140 M1 2D:M1, 3D:MB1,M1 70 65 130

ratio between 3D and 2D may vary. Yet, the delay and power of the cells are more important metrics. We perform cell timing/power characterizations using commercial softwares. The SPICE netlists obtained from the previous RC extractions are fed into Cadence Encounter Library Characterizer, which runs SPICE simulations to characterize delay and power of cells under various input slew and load capacitance conditions. The delay/power of 3D and 2D cells are shown in Table 2. The values are obtained from the data tables in the characterized Liberty library. The delay is the cell internal delay including load effect, and the power is the dynamic power consumed within cell boundary (including short circuit power and power for gate/parasitic capacitances). We observe that for INV, NAND2, and MUX2, the delay and power of 3D are slightly better than 2D, whereas for DFF, they are a little worse. In addition, as the input slew and load capacitance condition changes from fast to slow case, the difference between TMI and 2D becomes smaller. Note that depending on cell design quality and manufacturing technology, the results may change. We believe that with proper cell designs, the delay and power of 3D cells could be similar to 2D counterparts.

3.3

Monolithic Interconnect Setup

Our T-MI interconnect structure is an extension of the Nangate (2D) 45nm library. As shown in Table 3, we use 8 out of 10 metal layers in the Nangate 45nm. For T-MI, we make two modifications: We add (1) a new metal layer on the bottom tier (MB1), and (2) three local metal layers on the top tier (M4-6).5 With T-MI cell folding, the cells become 40% smaller than 2D (see Section 3.2). This results in about 40-50% smaller core footprint area. As a result, the cell pin density in T-MI becomes about 1.7-2X larger than in 2D, leading to a higher routing demand per unit area (or routing tile). To satisfy the high routing demand, we need to increase the routing capacity (#routing tracks per routing tile). The most area-efficient way is to add local metal layers, be5

Our 2D and T-MI metal layers are shown in Fig. 9 in the supplement.

Table 4: Summary of layout results for 45nm node. The values represent the percentage difference of T-MI over 2D. circuit footprint total power name wirelen. total cell net FPU -41.7% -26.3% -14.5% -9.4% -19.5% AES -42.4% -23.6% -10.9% -7.6% -13.9% LDPC -43.2% -33.6% -32.1% -12.8% -39.2% DES -40.9% -21.5% -4.1% -1.6% -7.7% M256 -43.4% -28.4% -17.5% -10.7% -22.2%

leakage -11.1% -9.5% -21.7% -1.4% -12.9%

Table 5: Summary of design results in our work and previous works. The [2]-3D means their INTRACEL method with timing driven + IPO, which corresponds to transistor-level monolithic 3D design. The [7]-3D means their 3TM setup. circuit name

design type ours-2D AES ours-3D [7]-2D [7]-3D ours-2D LDPC ours-3D [2]-2D [2]-3D ours-2D ours-3D DES [2]-2D [2]-3D [7]-2D [7]-3D

cause of the small pitch. We found that adding 3 local metal layers increases routing capacity sufficiently. Due to manufacturing issues (low thermal budget), in [2] the authors suggest tungsten is suitable for bottom tier metal. However, in this work we assume copper, because a copper-based manufacturing process may be developed. Besides, MB1 is mostly used for short interconnects such as within cells or short nets.6 In our benchmark circuit M256 (see Table 12), the wirelength of MB1 (for net routing) is only 0.3% of the total wirelength. Thus, the impact of MB1 material on the timing and power of a whole circuit is minimal. When tungsten is used, IR-drop on the VDD strips could be an issue, which is outside our scope.

total wirelongest path total power length (m) delay (ns) (mW ) 0.260 0.770 13.69 0.199 (-23.5%) 0.775 12.20 (-10.9%) 0.271 1.310 13.7 0.214 (-21.0%) 1.165 12.8 (-6.6%) 3.806 2.400 54.79 2.528 (-33.6%) 2.388 37.22 (-32.1%) 1.83 2.461 1,554 1.60 (-12.6%) 2.421 1,461 (-6.0%) 0.611 0.976 63.88 0.479 (-21.6%) 0.968 61.24 (-4.1%) 0.671 1.132 620.2 0.581 (-13.4%) 0.971 608.2 (-1.9%) 0.849 1.086 134.9 0.682 (-19.7%) 0.923 130.7 (-3.1%)

footprint = 457.83x456.4um wirelength = 3.806m

footprint = 331.88x330.4um wirelength=0.611m

(a) LDPC

(b) DES

3.4 Monolithic 3D Wire Load Model In T-MI designs, the wires are about 20-30% shorter than in 2D designs (see Table 4). We provide this information to the synthesis step by modifying wire load models (WLM). A WLM defines the statistical average of unit length resistance, capacitance, area of wires, as well as the fanout vs. wirelength tables. For each net, according to the fanout, the synthesis engine finds the corresponding wirelength and the capacitance/resistance/area from the WLM. We reflect the reduced wirelength of T-MI designs in the fanout vs. wirelength tables. With these WLMs, the synthesized netlists for 2D and T-MI are different.7

Figure 3: Snapshots of routing results for LDPC and DES.

4.2

4. 45NM RESULTS 4.1 Design Analysis Results The layout simulation results for 45nm node are summarized in Table 4.8 With T-MI, the footprint reduces by 40.9-43.4%, which is larger than the cell footprint reduction rate, 40%. With T-MI, timing is better because of shorter wirelengths, and the optimizer may downsize cells and use less number of buffers while still meeting the target clock period. Thus, the footprint of the whole T-MI design could be further reduced than the individual cell footprint reduction rate. With T-MI, total wirelength reduces by 21.5-33.6%. Depending on the circuit characteristics, the wirelength reduction rate varies. We observe that the circuit with a larger wirelength reduction rate tends to show a larger power reduction rate. All designs met the timing. The power reduction was the largest in LDPC, 32.1%, whereas in DES, only 4.1%. In LDPC, the net power is much larger than the cell power, thus a large net power reduction with T-MI leads to a large total power reduction. We also observe that with T-MI, not only net power but also cell power reduces; with a better timing, cells are downsized and less number of buffers are used, to reduce cell power. 6

The impact of MB1 on optimization quality is discussed in Section S5. Our WLM is further presented in Section S2. The impact of T-MI WLM on design quality is presented in Section S7. 8 Our detailed layout results for 45nm node are presented in Section S6. GDSII layouts of our AES design are shown in Fig. 8 in the supplement. 7

Comparison with Existing Works

Our results and the results from previous works ([2][7]) are summarized in Table 5.9 All three works use Nangate 45nm library as baseline 2D. The footprint reduction rate of 3D over 2D in this work, [2], and [7] are about 42.3%, 30%, and 40%, respectively. This footprint reduction rate mostly affects overall design quality of 3D designs, because the timing and power reduction in the monolithic 3D designs is from reduced footprint and wirelength. Our results show larger wirelength reduction than these previous works. In [2, 7], they intentionally chose small target clock periods, thus timing was not closed. Note that power values in different works vary by much. For AES and LDPC, our results show larger power reduction rate than previous works. Interestingly, in all three works, the power reduction rates for DES circuit are low (only 2-4%).

4.3

Circuit Characteristics Study

As shown in Table 4, LDPC and DES showed much different power reduction rate with T-MI. By contrasting these two designs, we explain for what kind of circuits T-MI provides large power benefit. With T-MI, the buffer count reduces by 48.6% (in LDPC) vs. 3.2% (in DES), total wirelength reduces by 33.6% vs. 21.5%, total power reduces by 32.1% vs. 4.1%, cell power reduces by 12.8% vs. 1.6%, and net power reduces by 39.2% vs. 7.7%. Compared with LDPC, the buffer count reduction for DES is very small, which leads to very small cell power reduction. Although the wirelength 9

Note that the purpose of this study is not to directly compare the design quality of ours to the previous works; due to different setup, design, and analysis flow, it is not possible to provide fair comparisons.

28

20

reduction (%)

reduction (%)

24

total power cell power net power leakage

16 12

28

Table 6: Comparison of our 45nm and 7nm node setup.

24

45nm 7nm transistor planar multi-gate VDD (V ) 1.1 0.7 transistor length (drawn, nm) 50 11 transistor width varies fixed back-end-of-line ILD k 2.5 2.2 M2 width (nm) 70 10.8 MIV diameter (nm) 70 10.8 ILD thickness (nm) 110 50 standard cell height (um) 1.4 0.218

20 16 12

8

8

4

4 slow (1.0ns)

medium (0.8ns)

(a) AES

fast (0.72ns)

slow (2.6ns)

medium (2.4ns)

fast (2.0ns)

Table 7: Summary of layout results for 7nm node.

(b) M256

circuit footprint total power name wirelen. total cell net leakage FPU -47.0% -34.2% -37.3% -32.4% -44.4% -21.0% AES -62.0% -47.8% -19.8% -10.3% -28.4% -28.5% LDPC -42.9% -27.7% -19.1% -3.7% -26.6% -3.5% DES -40.8% -21.9% -3.4% -1.3% -7.3% -3.0% M256 -44.6% -23.0% -17.8% -14.1% -23.0% -2.4%

Figure 4: Power reduction rate (T-MI over 2D) under various target clock periods. reduction in DES is not so small, the net power reduction rate is significantly smaller than LDPC. The net capacitance/power consists of wire and (cell input) pin parts.10 For most nets in DES, wires are very short. This difference is also observed in Fig. 3. In DES layout, there are many small regions where cells are tightly connected inside but not so much to outside. For these short nets, pin capacitances dominate wire capacitances, thus reducing wirelength does not reduce net power as much. Although these two circuits are similar in size (#cells, nets) and average fanout, because of the inherent difference in circuit characteristics, the power benefit of T-MI differs by much.

4.4 Impact of Target Clock Period The power benefit of T-MI also depends on the target clock period. For AES and M256, we vary the target clock period and perform full designs, from synthesis to layout optimizations. The power reduction rate is shown in Fig. 4. The trend is clear; when the target clock is faster, the power benefit of T-MI becomes larger. This is because at faster clock speeds, the timing of the 2D design becomes harder to meet than T-MI, because of longer wires. The optimization engine uses more buffers and larger cells, leading to steep increase in cell power. Thus, the cell power reduction rate increases noticeably as clock becomes faster. With faster clock speeds, core footprint and wirelengths also become larger, leading to larger net power reduction rate with T-MI.

5. 7NM TECHNOLOGY SETUP Another major aspect that affects the power benefit of T-MI is the technology node. As the technology advances, devices and wires shrink at different speed, affecting timing/power of the circuit and changing power benefit of T-MI. According to the latest ITRS 2011 roadmap [5], 7nm node is near the end of the roadmap.11 In ITRS projection for 7nm node, devices become dramatically efficient, however wires do not. The copper effective resistivity in 7nm is 3.7X larger than in 45nm, due to size effects (edge scattering, etc.). We now predict how the power benefit of T-MI changes in the future 7nm node. The comparison between our 45nm and 7nm setup is shown in Table 6. Since there is no real 7nm node data available today, we scale down our 45nm library data as well as use data from ITRS projection. As a transistor model, we use ASU PTMMG HP 7nm model [11]. The interconnect dimensions are scaled down to (7/45)X = 0.156X, and the interconnect RC libraries are

rebuilt, with a lower dielectric k (=2.2). We scale down the physical shapes of cells to 0.156X. Based on preliminary SPICE simulations12 , we also scale down cell input capacitance to 0.179X, cell delay to 0.471X, output slew to 0.420X, cell power to 0.084X, and cell leakage power to 0.678X. We apply these scaling factors to the 45nm Liberty library and create our 7nm Liberty library. Since the transistors in 7nm node are not planar but multi-gate (e.g. FinFET), the coupling between top/bottom tier transistors would be much smaller. Thus, we can reduce ILD thickness to keep the aspect ratio of MIV reasonable. The interconnect RC characteristics for 45nm and 7nm are obtained from the capTable built with Cadence Encounter, which runs EM simulations. The unit length resistances (Ω/µm) of 45nm and 7nm nodes for a local metal layer (M2) are 3.57 and 638, respectively, whereas for a global metal layer (M8), 0.188 and 2.650, respectively. The unit length capacitances (f F/µm) of 45nm and 7nm nodes for M2 are 0.106 and 0.153, respectively, whereas for M8, 0.100 and 0.095, respectively. We observe that in 7nm node, the local metal layers become very resistive, due to the larger copper effective resistivity and the smaller metal width/thickness. Yet, in 7nm node, the wirelengths of the nets on local metal layers become shorter, thus the resistances of the net wires do not increase as dramatically. The capacitance per unit length increases for local metal layers, even though the dielectric k becomes smaller.

6.

7NM RESULTS

The layout simulation results for 7nm node are summarized in Table 7.13 Compared with the results in Table 4, we see that the footprint reduction rate is larger, especially for AES where 62% footprint reduction was achieved. In the AES case, the target clock period is very small, 0.27ns. For the 2D design, Encounter performed high-effort optimization techniques to meet the timing, while for T-MI design it did not. As a result, the buffer count of the TMI design is 84.5% smaller. We also observed similar optimization differences for FPU. Wirelength reduction is 21.9-47.8%. In the FPU case, total power reduction is the largest, 37.3%. For DES, the power reduction is the smallest, 3.4%. For LDPC, the power reduction rate in 7nm node is smaller than in 45nm. In LDPC, there are lots of long wires across the core area.

10

We provide wire vs. pin power breakdown in Section S8. A summary of 45nm and 7nm node device and interconnect characteristics from ITRS projections are shown in Table 10 in the supplement.

11

12 13

Our 7nm cell characterizations are presented in Section S3. Our detailed layout results for 7nm node are presented in Section S6.

Table 8: Impact of lower cell pin cap in 7nm node. The ’p20/40/60’ mean 20/40/60% reduced pin cap cases. design DES-2D DES-3D DES-2D-p20 DES-3D-p20 DES-2D-p40 DES-3D-p40 DES-2D-p60 DES-3D-p60

total WL (mm) 81.2 63.5 (-21.9%) 81.3 63.5 (-21.9%) 81.2 63.2 (-21.8%) 81.3 63.5 (-21.9%)

total power cell net leak (mW ) (mW ) (mW ) (mW ) 15.11 9.49 5.03 0.60 14.60 (-3.4%) 9.36 4.67 0.58 14.38 9.48 4.30 0.60 14.12 (-1.8%) 9.42 4.09 0.60 13.54 9.39 3.56 0.59 13.17 (-2.7%) 9.31 3.27 0.59 12.74 9.35 2.81 0.59 12.45 (-2.3%) 9.32 2.55 0.59

Considering the unit length metal resistance, the router prefers intermediate/global layers than local metal layers for long nets. However, in T-MI we added 3 metal layers to only local layers; on intermediate/global layers, T-MI suffers more routing congestion than 2D.14 Thus, in 7nm node, the extremely high resistance on local layers (see Section 5) reduces the power reduction rate, because of worse timing (the local metal resistance was not so high in 45nm node.). In summary, depending on circuit characteristics, in 7nm node, the power benefit may become larger or smaller.

6.1 Impact of Pin Cap Reduction Rate As mentioned in Section 5, when we compare 7nm node with 45nm node, the cell pin cap reduces by 82.1%, which is smaller than the wirelength reduction rate, about 85% (compare total wirelength of designs in Table 13 and 14). Thus, in 7nm node, the (pin cap)/(wire cap) ratio may become larger than in 45nm node. Then, the wire cap reduction with T-MI reduces the total net cap by a smaller percentage in 7nm node. However, depending on the materials and manufacturing technology, the pin cap of cells may reduce further than our projection. Thus, we explore how the power benefit of T-MI changes when pin cap reduces more. For this study, we choose DES as the test circuit, because it showed the largest (pin cap)/(wire cap) ratio among our circuits. Thus, we expect to see larger impact with various pin cap settings. Our simulation results are summarized in Table 8. Surprisingly, the power benefit of T-MI does not increase with larger pin cap reduction rate. As pin cap reduces, the net power reduces. Then, the cell power becomes more dominating factor, because cell power does not decrease so much with smaller pin caps. Thus, the power reduction rate with T-MI becomes smaller.

6.2 Impact of Lower Metal Resistivity As discussed in Section 5, in 7nm node, the effective resistivity of copper becomes very high. However, in the future, thanks to better interconnect materials and manufacturing process, the resistivity of interconnect may be lower than expected. In this scenario, we may expect that the timing benefit of 3D may become smaller, because the nets are longer in 2D designs and the lower resistivity would reduce delay of nets in 2D more than in 3D. As a case study, we reduce the resistivity of local and intermediate layers by 50%.15 We choose M256 as the test circuit, because it is the largest circuit among our benchmark circuits and more affected by net delay change. The impact of the reduced metal resistivity is shown in Table 9. All designs met the timing. With lower resistivity, the power consumption reduces, because with better timing smaller cells are 14

The impact of a different metal layer setup is discussed in Section S9. The resistivity of global metal layers is not changed, because the wires on the global layers are large and the resistivity is not too high.

15

Table 9: Impact of the lower metal resistivity in 7nm node for M256. The ’-m’ suffix means reduced metal resistivity. design

total WL total power cell net leak (mm) (mW ) (mW ) (mW ) (mW ) M256-2D 795 30.55 13.26 15.21 2.07 M256-3D 612 (-23.0%) 25.12 (-17.8%) 11.39 11.71 2.02 M256-2D-m 795 27.57 12.10 13.67 1.80 M256-3D-m 613 (-22.9%) 22.67 (-17.8%) 10.42 10.69 1.57

used. However, there is not much difference in wirelength and total power reduction percentage. The cell and net power reduction rate went down a little, however the leakage power reduction rate went up. Thus, we conclude that the lower metal resistivity does not necessarily lead to smaller power reductions in monolithic 3D ICs.

7.

CONCLUSIONS

In transistor-level monolithic 3D ICs, reduced footprints lead to shorter wirelengths, better performances, and lower power consumptions. With carefully designed T-MI 3D cells, we performed layout simulations for the benchmark circuits and demonstrated up to 32.1% and 37.3% total power reductions in 45nm and 7nm nodes. In addition, we discussed other factors that affect the power benefit of T-MI, such as circuit characteristics and target clock periods. We expect to see larger power benefits with T-MI in future technology nodes, where wires become serious problems.

8.

ACKNOWLEDGMENTS

This material is based upon the work supported by Intel, Qualcomm, and the CISS funded by the MEST Global Frontier Project of the South Korean Government (CISS-2-3).

9.

REFERENCES

[1] P. Batude et al. Advances in 3D CMOS Sequential Integration. In Proc. IEEE Int. Electron Devices Meeting, pages 1–4, 2009. [2] S. Bobba et al. CELONCEL: Effective Design Technique for 3-D Monolithic Integration targeting High Performance Integrated Circuits. In Proc. Asia and South Pacific Design Automation Conf., pages 336–343, 2011. [3] K. D. Boese, A. B. Kahng, and S. Mantik. On the Relevance of Wire Load Models. In Proc. Int. Workshop on System-Level Interconnect Prediction, pages 91–98, 2001. [4] N. Golshani et al. Monolithic 3D Integration of SRAM and Image Sensor Using Two Layers of Single Grain Silicon. In Proc. IEEE Int. Conf. on 3D System Integration, pages 1–4, 2010. [5] International Technology Roadmap for Semiconductors. ITRS 2011 Edition. [6] S.-M. Jung et al. The Revolutionary and Truly 3-Dimensional 25F 2 SRAM Technology with the smallest S 3 (Stacked Single-crystal Si) Cell, 0.16um2 , and SSTFT (Stacked Single-crystal Thin Film Transistor) for Ultra High Density SRAM. In Proc. Symposium on VLSI Technology, pages 228–229, 2004. [7] Y.-J. Lee, P. Morrow, and S. K. Lim. Ultra High Density Logic Designs Using Transistor-Level Monolithic 3D Integration. In Proc. IEEE Int. Conf. on Computer-Aided Design, pages 539–546, 2012. [8] C. Liu and S. K. Lim. A Design Tradeoff Study with Monolithic 3D Integration. In Proc. Int. Symp. on Quality Electronic Design, pages 531–538, 2012. [9] T. Naito et al. World’s first monolithic 3D-FPGA with TFT SRAM over 90nm 9 layer Cu CMOS. In Proc. Symposium on VLSI Technology, pages 219–220, 2010. [10] Nangate. Nangate 45nm Open Cell Library. [11] S. Sinha et al. Exploring Sub-20nm FinFET Design with Predictive Technology Models. In Proc. ACM Design Automation Conf., pages 283–288, 2012.

top tier

MIV

NMOS

bot tier

direct S/D contact

node 45nm 7nm year 2010 2025 device type bulk Si multi-gate NMOS drive current (µA/µm) 1,210 2,228 Cu effective resistivity (µΩ · cm) 4.08 15.02 Cu unit length capacitance (f F/µm) 0.19 0.15

PMOS (a) INV

(b) NAND2

Table 10: Summary of the ITRS projection on high performance logic devices and interconnects. The 45nm and the 7nm projection data are from ITRS 2008 and 2011, respectively. The copper effective resistivity and unit length capacitance are for local/intermediate metal layers.

(c) MUX2

Table 11: The 7nm cell characterization results. The cell delay, output slew, and cell power are obtained by averaging the rise/fall transition cases, when input slew is 19ps and load capacitance is 3.2f F .

(d) DFF

Figure 5: GDSII layouts of our T-MI cells. The S/D means source/drain. The p/nwell and implants are not shown for simplicity.

wirelength (um)

400 300

FPU LDPC M256

AES DES

5

10 fanout

200 100 0 0

15

20

Figure 6: Fanout vs. wirelength in 2D wire load models.

SUPPLEMENT S1 T-MI Cell Layouts We created total 66 T-MI cells. Some of our T-MI cells are shown in Fig. 5. The internal connections of the DFF cell are rather complex. We found that direct S/D contact is helpful for reducing the cell internal parasitic RC of some cells. Note that we preserve the transistor locations of the baseline 2D cells; further reductions in cell internal parasitic RC may be possible if transistors are allowed to be relocated within a cell or the cells are completely redesigned.

S2 Wire Load Model for Monolithic 3D The fanout vs. wirelength trends for our benchmark circuits are shown in Fig. 6. From preliminary layout simulations, per each circuit we extract a WLM for T-MI as well as 2D. Note that the curves of circuits are distinct, which is related to the circuit characteristics discussed in Section 4.3.

S3 Scaling Factors of 7nm Standard Cells To obtain the scaling trends of 7nm cell characteristics, we first create SPICE netlists of 7nm cells. From the SPICE netlists of Nangate 45nm cells, the transistor models are replaced by ASU PTM-MG HP 7nm model [11]. The transistor fin height, width,

INV 45nm 7nm input cap (f F ) 0.463 0.125 cell delay (ps) 44.27 25.56 output slew (ps) 31.35 15.13 cell power (f J) 0.446 0.020 leakage (pW ) 2,844 2,583

NAND2 DFF 45nm 7nm 45nm 7nm 0.523 0.082 0.877 0.097 49.24 30.50 124.70 27.07 35.89 19.29 34.55 8.25 0.680 0.020 3.425 0.604 4,962 2,906 42,965 23,241

and length of the ASU model are 18, 7, and 11nm, respectively. We assume the number of fins per MOS transistor is 1, because the original cells are of X1 strength; the results may change if we use multiple fins. We also scale the cell internal parasitic R and C components in the original SPICE netlists by 7.7X and 0.156X, respectively, because: (1) The resistance of metal interconnect is R = ρ · L/(W t) = ρs · L/W . The sheet resistance (ρs = ρ/t) becomes 7.7X, because M1 thickness (t) is 0.156X and we increase effective resistivity (ρ) by 20% to account for size effects and barrier thickness. Both the length (L) and width (W) of cell internal interconnects become 0.156X. Thus, the R components become 7.7X of the original. (2) The unit length capacitance does not change much. And the length of cell internal interconnects becomes 0.156X. Thus, the C components become 0.156X of the original. With the SPICE netlists of our 7nm cells, we run Cadence Encounter Library Characterizer (ELC) to obtain Liberty timing and power library. The ELC runs SPICE simulations for various input slew and load capacitance conditions and builds a library with timing and power data. The characterization results are shown in Table 11. Per each cell, we calculate the scaling ratio, then average them for all cells to obtain the final scaling trend.

S4 Benchmark Circuits and Synthesis Results Our benchmark circuits and synthesis results for 45nm and 7nm nodes are summarized in Table 12. The FPU is a double precision floating point unit. The AES and the DES are encryption engines. The LDPC is a low-density parity-check engine for the IEEE 802.3an standard. And the M256 is a simple partial-sum-add-based 256bit integer multiplier. The circuits are in different sizes. Note that target clock periods for 7nm node are smaller than those for 45nm node. We use Synopsys Design Compiler (ver. F-2011.09) for synthesis. The synthesis results are from 2D results. All synthesized designs (2D, T-MI, in 45nm, 7nm) met target clock periods.

S5 Concerns in Layout Optimizations In the post-route optimization step, the Encounter optimization engine tries to preserve routed wires. In T-MI designs, the MB1 wires and the routing MIVs block the cell placement, thus the op-

Table 12: Benchmark circuits and synthesis results. FPU AES 45nm node target clock period (ns) 1.8 0.8 #cells 9,694 13,891 cell area (µm2 ) 19,123 16,756 #nets 11,345 14,218 average fanout 2.35 2.40 7nm node target clock period (ns) 0.72 0.27 #cells 11,378 12,541 cell area (µm2 ) 447.1 362.3 #nets 12,484 12,811 average fanout 2.44 2.57

LDPC

DES

M256

Table 15: Layout results with/without our T-MI WLMs. The ’-n’ suffix means without our T-MI WLM. design

total WL WNS total power (mm) (ps) (mW ) FPU-3D 149.1 +4 7.22 FPU-3D-n 152.0 (+1.9%) +11 7.20 (-0.3%) AES-3D 198.8 +25 12.20 AES-3D-n 199.0 (+0.1%) +21 12.19 (-0.1%) LDPC-3D 2527.8 +12 37.22 LDPC-3D-n 2782.2 (+10.1%) +16 40.99 (+10.1%) DES-3D 479.1 +32 61.24 DES-3D-n 481.7 (+0.5%) +29 61.79 (+0.9%) M256-3D 4760.2 0 160.5 M256-3D-n 5020.6 (+5.5%) +3 166.8 (+3.9%)

2.4 1.0 2.4 38,289 51,162 202,877 60,590 85,526 293,636 44,153 54,724 222,569 2.38 2.33 2.23 0.9 0.3 1.0 37,322 50,833 191,543 1456.4 2061.3 6788.8 43,183 54,426 209,545 2.41 2.33 2.30

Table 16: Wire vs. pin capacitance breakdown of LDPC and DES in 45nm node. The values are for the entire circuit. design

total cap. (pF ) power (mW ) wire pin wire pin LDPC-2D 558.0 134.4 30.73 9.04 LDPC-3D 310.3 123.6 15.88 8.32 DES-2D 64.4 127.4 8.88 17.80 DES-3D 50.1 126.6 6.87 17.76

cells cannot be placed VDD/VSS

MB1

MIV MB1 M1

VDD/VSS

Figure 7: A zoom-in shot of T-MI design for AES. Skyblue rectangles are standard cells. For clarity, only MB1, M1, and MIV layers are shown.

timizer cannot place cells at (nor move cells to) such places. For example, in Fig. 7, the white spaces (dotted boxes) cannot be used for optimization such as buffering or gate sizing. To see whether these MIV/MB1 blockages cause design quality degradation, we perform a layout simulation. For this case study, we use AES as the target circuit, because it showed a high placement utilization with lots of densely packed placement regions. From layout simulations, we observe that there are negligible differences in design quality, in terms of wirelength (+0.1%), timing (WNS = +25ps in original vs. +21ps without MB1 and MIV), and total power (-0.1%). Thus, we conclude that under our settings (placement, routing, optimization options, final utilization, etc.), the routings on MB1 and MIV do not degrade design quality noticeably. Note that the utilization of the above AES design is around 80%; we may see problems caused by the MIV/MB1 blockages when utilization is very high. However, in general, it is customary not to exceed the 80% utilization, due to various reasons (placement and routing quality, optimization quality, decap area, etc).

S6 Detailed Layout Results The detailed layout simulation results for 45nm node are shown in Table 13. We set the target utilization to around 80%, which is common in industry designs. Since we observed severe wire congestions in LDPC (see Fig. 3(a)), the target utilization was lowered to about 33%; the 2D design was barely routable with this setting. We also observed significant wire congestions in M256, thus the

target utilization was lowered to 68%. All designs met the timing (WNS≥0). The detailed layout simulation results for 7nm node are shown in Table 14. We set similar target utilizations as for 45nm node. All designs met timing.

S7

Impact of T-MI Wire Load Model

As mentioned in Section 3.4, we create custom WLMs for T-MI designs. There have been debates on whether WLM is helpful or not to the final layout results [3]. Since our target circuits are small to medium sized, we may expect that WLM is helpful to some extent. To see the impact of the custom WLMs on design quality, we perform the synthesis for T-MI designs with not our T-MI WLMs but the 2D WLMs. As a result, the synthesized netlists for T-MI and 2D become similar. The layout results with/without custom WLM for T-MI designs are shown in Table 15. For FPU, AES, and DES, the design quality difference is negligible. However, for LDPC and M256, we observe significant increase in wirelength and total power without T-MI WLM. Thus, we conclude that for some designs, T-MI WLM models are helpful for obtaining larger power benefits with T-MI.

S8

Breakdown of Net Power

We break net power into wire and pin power components (net = wire + pin). Wire means metal wires and vias used for connecting cell pins, and pin means input pins of cells. As shown in Table 16, in LDPC, wire cap is much larger than pin cap, and so is wire power. Most of the net power reduction is from reduced wirelengths, as seen by the wire power reduction. In contrast, in DES, pin cap is much larger than wire cap. Thus, reduced wirelengths and wire power only reduces a small portion of the net power. In fact, most of the nets in DES are short, whereas most are long in LDPC; the average wirelength of LDPC-2D and DES-2D are 72.0µm and 10.5µm, respectively.

S9

Impact of the Metal Layer Setup

To see the impact of the metal layer setup on power benefit of T-MI, we modify the metal layer stack of T-MI. Instead of adding 3 local metal layers on the top tier, we add 2 to local and 2 to intermediate metal layers. The original and modified metal stacks

Table 13: Layout results of 2D and monolithic 3D designs for 45nm node. The #cells mean total number of cells, and #buffers mean the number of inverting/non-inverting buffers. The #cells include #buffers. The utilization means final cell placement density, after all optimizations. The WL and WNS mean wirelength and worst negative slack, respectively. Positive WNS value means timing is met with a positive slack. The values in parentheses show the percentage ratio to the 2D designs. circuit design footprint #cells #buffers utiliname type (µm2 ) zation (%) FPU 2D 24,839 (100) 10,959 1,644 (100) 80.4 3D 14,476 (58.3) 9,922 1,240 (75.4) 79.5 AES 2D 25,375 (100) 19,577 4,952 (100) 79.9 3D 14,613 (57.6) 18,996 5,157 (104.1) 79.7 LDPC 2D 208,954 (100) 47,017 13,374 (100) 32.6 3D 118,758 (56.8) 42,831 6,868 (51.4) 32.4 DES 2D 109,652 (100) 54,402 8,436 (100) 79.9 3D 64,830 (59.1) 53,534 8,170 (96.8) 80.5 M256 2D 478,077 (100) 245,935 62,970 (100) 68.2 3D 270,748 (56.6) 216,956 48,125 (76.4) 67.3

total WL WNS total power cell power net power (m) (ps) (mW ) (mW ) (mW ) 0.202 (100) +6 8.44 (100) 3.98 (100) 4.21 (100) 0.149 (73.7) +4 7.22 (85.5) 3.61 (90.6) 3.39 (80.5) 0.260 (100) +30 13.69 (100) 6.36 (100) 6.94 (100) 0.199 (76.4) +25 12.20 (89.1) 5.87 (92.4) 5.97 (86.1) 3.806 (100) 0 54.79 (100) 14.17 (100) 39.78 (100) 2.528 (66.4) +12 37.22 (67.9) 12.36 (87.2) 24.20 (60.8) 0.611 (100) +24 63.88 (100) 36.17 (100) 26.68 (100) 0.479 (78.5) +32 61.24 (95.9) 35.60 (98.4) 24.62 (92.3) 6.647 (100) 0 194.6 (100) 74.73 (100) 115.2 (100) 4.760 (71.6) 0 160.5 (82.5) 66.70 (89.3) 89.66 (77.8)

leakage (mW ) 0.25 (100) 0.23 (88.9) 0.40 (100) 0.36 (90.5) 0.85 (100) 0.66 (78.3) 1.03 (100) 1.02 (98.6) 4.70 (100) 4.10 (87.1)

Table 14: Layout results of 2D and monolithic 3D designs for 7nm node. circuit design footprint #cells #buffers utilitotal WL WNS total power cell power net power name type (µm2 ) zation (%) (mm) (ps) (mW ) (mW ) (mW ) FPU 2D 639 (100) 17,306 3,931 (100) 80.9 33.1 (100) +2 2.87 (100) 1.37 (100) 1.34 (100) 3D 339 (53.0) 11,371 1,368 (34.8) 78.9 21.8 (65.8) +1 1.80 (62.7) 0.92 (67.6) 0.74 (55.6) AES 2D 724 (100) 29,153 11,496 (100) 79.2 45.5 (100) +9 2.85 (100) 1.35 (100) 1.27 (100) 3D 275 (38.0) 12,687 1,778 (15.5) 79.6 23.8 (52.2) +6 2.29 (80.2) 1.21 (89.7) 0.91 (71.6) LDPC 2D 5,208 (100) 47,503 11,689 (100) 30.9 608 (100) +2 8.68 (100) 2.43 (100) 5.83 (100) 3D 2,972 (57.1) 43,453 7,936 (67.9) 31.4 439 (72.3) +4 7.02 (80.9) 2.34 (96.3) 4.28 (73.4) DES 2D 2,612 (100) 50,878 6,851 (100) 79.1 81.2 (100) 0 15.11 (100) 9.49 (100) 5.03 (100) 3D 1,546 (59.2) 50,758 6,693 (97.7) 80.1 63.5 (78.1) 0 14.60 (96.6) 9.36 (98.7) 4.67 (92.7) M256 2D 11,411 (100) 255,364 59,153 (100) 68.6 795 (100) +23 30.55 (100) 13.26 (100) 15.21 (100) 3D 6,172 (55.4) 213,272 40,997 (69.3) 67.9 612 (77.0) +14 25.12 (82.2) 11.39 (85.9) 11.71 (77.0)

Table 17: Impact of the different metal layer setup for T-MI. The ’+M’ suffix means the modified metal layer stack. design

total WL total power cell net leak (mm) (mW ) (mW ) (mW ) (mW ) LDPC-3D 439 7.02 2.34 4.28 0.40 LDPC-3D+M 432 (-1.6%) 6.85 (-2.4%) 2.27 4.23 0.36 M256-3D 612 25.12 11.39 11.71 2.02 M256-3D+M 618 (+1.0%) 24.42 (-2.8%) 11.11 11.47 1.83

are shown in Fig. 9. We use LDPC and M256 for this case study. The results are summarized in Table 17. With the modified metal layer structure, compared with our T-MI results, total wirelength of the design with modified metal layers decreases by 1.6% for LDPC and increases by 1.0% for M256. The cell power, net power, and leakage power reduces, and the total power of LDPC and M256 reduces by 2.4% and 2.8%, respectively. Thus, we conclude that the metal layer structure of T-MI affects power benefit and should be chosen carefully. The local, intermediate, and global metal layer usage for LDPC and M256 designs are shown in Fig. 10. We observe that both local and intermediate layers are heavily used. On global layers, we see a lot of long wires. LDPC used more global metal than M256. Note that a net uses combinations of these layers; the line segments in the snapshot do not represent the whole net.

S10 Impact of Switching Activity Factor Another major factor that affects the power consumption is the switching activity factor. The switching activity factor is defined as the number of signal transitions (0-1 or 1-0) per a given clock period. The power values of cells and nets are linearly proportional to the related switching activities. Depending on various factors (architecture, usage scenario, etc.), the actual switching activity values may vary. For statistical power analyses, we provide switching ac-

leakage (mW ) 0.17 (100) 0.13 (79.0) 0.23 (100) 0.16 (71.5) 0.41 (100) 0.40 (96.5) 0.60 (100) 0.58 (97.0) 2.07 (100) 2.02 (97.6)

tivity factors to the primary input ports and the outputs of sequential cells (e.g. flipflop). Our default settings for primary inputs and sequential cell outputs are 0.2 and 0.1, respectively. Then, the given switching activity values are propagated to the rest of the circuit, based on the netlist connectivity and the functionality of cells. Since the switching activities of primary inputs affects until the first sequential cells and these paths are usually short, changing the switching activity factor of primary inputs affects the power by a small amount. In this case study, we vary the switching activity factors of the sequential cell outputs only. The total power of 2D and 3D designs for M256 under various switching activity factors are shown in Fig. 11(a). Although the total power increases with a larger switching activity factor, the power reduction rate does not change much, as shown in Fig. 11(b). The other circuits also show negligible differences in power reduction rate under various switching activity factors. Thus, we conclude that the power benefit of T-MI is not largely affected by the switching activity level.

170.53x168.24um 127.70x126.20um

global layers (M11-12)

(a) 2D-placement

(b) T-MI-placement

intermediate layers (M6-10)

(c) 2D-routing

(d) T-MI-routing

Figure 8: The placement and routing snapshots of AES designs. The figures reflect the relative sizes of 2D vs. T-MI designs.

local layers (MB1, M1-5) (a) LDPC (b) M256 M11-12

Figure 10: GDSII snapshots of local, intermediate, and global metal layers for (a) LDPC and (b) M256.

M10-11 M7-8

M7-9 intermediate

M4-6

local

M1-3

M1-6

M1-5

400 300 200 100 0

MB1 (a) 2D

(b) T-MI

MB1 (c) T-MI+M

Figure 9: Metal layer stack diagrams for (a) 2D, (b) T-MI, and (c) T-MI+M. The ’+M’ means modified metal layer stack.

power reduction (%)

M6-10

35

M256-2D M256-3D

500 total power (mW)

global

30 FPU LDPC M256

25 20

AES DES

15 10 5

0.1

0.2 0.3 0.4 switching activity

(a)

0.1

0.2 0.3 0.4 switching activity

(b)

Figure 11: Power dependency on switching activity factor. (a) Total power of M256 with various switching activity factors, and (b) power reduction rate under various switching activity factor. All results are from 45nm node.