Buffer Design and Optimization for LUT-based Structured ASIC Design

Report 3 Downloads 49 Views
Buffer Design and Optimization for LUT-based Structured ∗ ASIC Design Styles Po-Yang Hsu† Shu-Ting Lee† Fu-Wei Chen‡ [email protected] [email protected] [email protected] Yi-Yu Liu† [email protected]

Department of Computer Science and Engineering, Yuan Ze University, Chungli, Taiwan, 320, R.O.C. ‡ Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, 300, R.O.C.

ABSTRACT

Most of the masks are fixed except the contact-mask and some via-masks. The functionalities and the interconnections are specified by properly assigning the contacts and the vias. It is also known as the via patterned gate array (VPGA) [4] [5] [6] [8] [9] [10]. With low NRE cost and medium circuit performance as compared with the conventional standard cell design, structured ASIC provides a new alternative for circuit designers in terms of the mask cost, programmabilities, and performance. Since the interconnection delay dominates circuits delay, buffer insertion becomes an important technique in modern VLSI design. The interconnection delay is due to the parasitic resistance and capacitance on a wire. The downstream capacitance can be greatly reduced by buffer insertion. Hence, the interconnection delay can be reduced at the expense of extra buffer area. Gao and Wong solve the buffer planning problem by using a graph-based algorithm [3]. In the structured ASIC design style, the downstream capacitance would become a crucial problem unless there are dedicated buffers. The aforementioned issues motivate us to design dedicated buffers for the interconnections. We design the layouts of two dedicated buffers and extract the technology dependent parameters for evaluations. The rest of this paper is organized as follows. The preliminary background knowledge and the motivation of this paper are given in Section II. Section III presents two dedicated buffers and performs evaluations for these buffers. The experimental results are drawn in Section IV. Section V concludes this paper.

The interconnection delay of pre-fabricated design style dominates circuit delay due to the heavily downstream capacitance. Buffer insertion is a widely used technique to split off a long wire into several buffered wire segments for circuit performance improvement. In this paper, we are motivated to investigate the buffer insertion issues in LUT-based structured ASIC design style. We design the layouts of two dedicated buffers and extract the technology dependent parameters for evaluations. After that, we propose a channel migration technique, which employs both intra-channel migration and inter-channel migration, to alleviate the sub-channel saturation problem. The experimental results demonstrate that dedicated buffers are essential for structured ASIC design style. Categories and Subject Descriptors: B.7.1 [Integrated Circuits]: VLSI General Terms: Performance Keywords: Structured ASIC, Buffer Insertion, Interconnection

1.

INTRODUCTION

In nanoscale CMOS era, mask cost increases dramatically due to the lithographic difficulties. The one-time-use mask cost is no longer affordable for small volume ASIC designs. This results in a higher threshold for conventional standard cell ASIC designs. To amortize mask cost, some pre-fabricated design styles are proposed. Those design styles provide different levels of pre-fabrication: the device-level, the gate-level, the semi-chip-level, and the full-chip-level. Among them, structured ASIC is a semi-chip-level pre-fabricated design style.

2. 2.1

∗This work is supported in part by the National Science Council of Taiwan under Grant NSC-96-2221-E-155-070 and NSC97-2221-E-155-071-MY2.

MOTIVATION Island-style Structured ASIC

Structured ASIC is proposed to balance the mask cost and the circuit performance gap between standard cell and FPGA. Several research papers use the island-style pre-fabrication in their structured ASIC design [9] [10] [12]. The logic fabric is used to replace the FPGA logic component; while the routing fabric is used to replace the FPGA routing component. Both SRAMs and transmission-gates are replaced by maskprogrammable vias. Therefore, structured ASIC provides the mask-level programmability instead of the field programmability. In a structured ASIC design style, the device-masks and the metal-masks have been pre-fabricated such that they can be shared among different circuits. In contrast to the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’09, May 10–12, 2009, Boston, Massachusetts, USA. Copyright 2009 ACM 978-1-60558-522-2/09/05 ...$5.00.

377

device and metal masks, the via-masks are required to be re-designed. Hence, the NRE cost of structured ASIC is relatively low as compared with that of the standard cell ASIC. Since the transmission-gates are replaced by vias, the diffusion capacitance in the interconnections is substantially reduced. The commonly used switch-buffers in a traditional FPGA design are no longer required in a structured ASIC design. The removed transistors from SRAMs, transmissiongates, and switch-buffers greatly improve the chip density as well as the circuit performance. Therefore, structured ASIC trades the field programmability for the circuit performance. Since the architectures of FPGA and structured ASIC are similar to each other, many research papers utilize the FPGA design flow to synthesize the structured ASIC [6] [9] [10]. The structured ASIC design flow can be roughly categorized into two phases, the front-end synthesis and the back-end synthesis. High level synthesis and logic synthesis techniques are used to technology independently optimize a design with less hardware resource under certain of timing constraints. Commonly used components are extracted to be shared for circuit area reduction. Once the optimization is done, the technology mapping is performed to map the circuit into available cells (LUTs and FFs). After that, the clustering/packing is performed to aggregate those mapped cells into CLBs. Finally, CLB placement and routing track assignment are performed in back-end synthesis.

2.2

Table 1: Pre-sim results on different LUTs 4-LUT 8-to-1 MUX (with diffusion input) 16-to-1 MUX (w/o diffusion input)

3.1

BestCase 9.31E-11 1.66E-10

Ratio (%) 439 264

Figure 1: Pilot layout of a logic fabric logic fabric by using an 2N −1 -to-1 MUX implementation with one diffusion input [9]. By using the diffusion input, their design significantly reduces the number of required transmissiongates as well as the required device area. We first perform 4-LUT pre-layout simulations on both implementations with 0.18µm process technology. The pre-sim results of both implementations are reported in Table 1. According to the pre-sim results, the delay ratio of 8-to-1 MUX is much larger than that of 16-to-1 MUX due to the diffusion input. Hence, we use a 4-LUT without diffusion input in our pilot structured ASIC logic fabric. Since the input capacitance of each LUT input differs to each other, we fine tune the input-buffers such that the 4-LUT delay is minimum and every LUT input capacitance is identical to each other. After that, we add a FF to the output of the 4-LUT to form a CLB and place a viaprogrammable 4X-buffer to the output of the CLB. Our pilot CLB layout with TSMC 0.18µm technology is drawn in Figure 1. The CLB layout dimension is 33µm × 29µm. We use the same information for routing fabric design.

Buffer Insertion Issues for Structured ASIC

In the front-end synthesis, commonly used sub-circuits are extracted to reduce the circuit area. The fanouts of these extracted sub-circuits increase capacitive load and hence may result in circuit performance degradation. In the back-end synthesis, the parasitic resistance and capacitance of long wires result in large interconnection delay. Both the fanout and parasitic capacitance is required to be carefully taken into account in modern VLSI design. In the standard cell design style, there are multiple driving-strength logic gates and buffers in a cell library. Hence, the capacitive problem can be alleviated by using the concept of logical effort. In the FPGA design style, the capacitive problem is directly neutralized by switch-buffers, since the switch-buffers split off the high capacitive load into several buffered wire segments. However, the aforementioned issue would be a serious problem for a cost-efficient structured ASIC design style with neither multiple driving-strength gates nor switch-buffers. Therefore, buffer planning is a key issue for a high performance structured ASIC design. Zhang and Sapatnekar propose a novel statistical scheme to estimate the distribution of pre-fabricated buffers by using Rent’s rule. However, the exact buffer locations are not addressed in their study [12]. According to the island-style architecture, there are two possible candidate locations, the logic fabric and the routing fabric. Hence, we investigate the buffer insertion issue in either logic fabric or routing fabric. For simplicity, the buffer in the logic fabric is denoted logic fabric buffer (LFB); while the buffer in the routing fabric is denoted routing fabric buffer (RFB). We design a pilot logic fabric layout with a LFB and a corresponding routing fabric layout with the RFBs for evaluations in Section III.

3.

W orstCase 4.08E-10 4.38E-10

3.2

Routing Fabric Design

Since the connectivity of the routing fabric is programmed by vias, the underlying unused device area can be directly used for buffer insertion. We design the via-programmable RFBpairs under the routing channels such that the above routing tracks are capable of using the RFBs in case of suffering heav-

BUFFER DESIGN AND EVALUATION Logic Fabric Design

The N -LUT design can be easily implemented in an 2N to-1 MUX. Patel et al. propose an N -LUT structured ASIC

Figure 2: channels

378

The RFB in routing channel and sub-

Table 2: Simulation results of different buffer insertions

Algorithm Buffer-Insertion-for-Delay-Optimization Init C = all possible buffer insertion candidates ci P =φ Begin do static timing analysis S = {ci ∈ C|Crit(ci ) = 1} U = S\P

Buffer Insertion Baseline LF B RF B LF B + RF B

DW /DL 14.17 7.47 0.98 0.77

and T-Vpack [1], respectively. After that, we perform placement and routing by using VPR. We first conduct a baseline simulation. There is no additional buffers inserted in our baseline simulation. After that, we perform simulations on the following three scenarios: the LFB insertion, the RFB insertion, and the combined LFB and RFB insertion. The combined LFB and RFB insertion is the results of LFB insertion followed by performing the RFB insertion. Notice that there are unlimited RFBs for both the RFB insertion and the combined LFB and RFB insertion. The simulation results are summarized in Table 2. The Baseline, LF B, RF B, and LF B + RF B represent the results without any buffer insertion, the results with LFB insertion, the results with RFB insertion, and the results with combined LFB and RFB insertion, respectively. Column RDelay represents the average ratio of the circuit delay. Column DW /DL represents the ratio of the wire delay to the logic delay. In Table 2, the ratio of the wire delay to the logic delay is low after performing the RFB insertion. The circuit delay of combined LFB and RFB insertion is similar to that of RFB insertion only.

foreach ci in U if (downstream capacitance of ci > Cth ) buffer insertion for ci end foreach P =P ∪U while (U 6= φ) End

Figure 3: The algorithm of Buffer-Insertion-for-DelayOptimization

ily downstream capacitance. Each RFB-pair has two opposite RFBs in order to support signal propagation in different directions. The dimensions of our RFB-pairs are 8.1µm × 4.5µm and 6µm × 4.5µm in horizontal channel and vertical channel, respectively. Combined the layouts of RFB and CLB, we can see that there are 6 (⌊4.5/0.66⌋) routing tracks sharing 4 RFB-pairs (⌊33/8.1⌋ and ⌊29/6⌋). Therefore, we integrate 4 RFB-pairs into a sub-channel. In our design, we use metal2/metal4 for horizontal tracks and metal3/metal5 for vertical tracks. Hence, there are total 12 routing tracks and 4 RFB-pairs (8 RFBs) in a sub-channel. To improve the area utilization, sub-channels are back-to-back connected to form a routing channel. The sub-channels are similar to standard cell rows with power/ground sharing. Finally, we use crossbar switch-block to interconnect tracks in different routing channels [9]. A simplified layout (metal4 and metal5 are omitted for brevity) of routing channel as well as the RFB is drawn in Figure 2.

3.3

RDelay (%) 100 29.46 10.45 10.39

3.4

Simulation Summary

In this section, we design two buffers in our structured ASIC and perform simulations on them. From the viewpoint of area overheads, there is only 1 4X-buffer per CLB in a LFB design; while there are 8 1X-inverters per 12 routing tracks in a RFB design. Hence, the LFB is the most area-efficient; while the combined LFB and RFB is the least area-efficient. From the viewpoint of circuit performance, both the RFB insertion and the combined LFB and RFB insertion outperform the LFB insertion. From the viewpoint of the downstream capacitance, we can repeatedly insert the RFB in a long wire to split off the downstream capacitance and hence to maintain a low ratio of the wire delay to the logic delay. However, the LFB is incapable of dealing with the heavily downstream capacitance due to the limitation of its maximum driving-strength. Therefore, taking the area overheads, the circuit performance, and the balanced ratio of the wire delay to the logic delay into account, we suggest to use the RFB only for structured ASIC design style.

Simulations on Buffer Insertion

Before we perform the simulations on LFB and RFB insertions, we propose an algorithm to insert buffers for delay optimization. The algorithm is illustrated in Figure 3. C represents the set of all possible buffer insertion candidates. For the LFB insertion, C is the set of total CLBs; while for the RFB insertion, C is the set of total 2-pin wires. P represents the set of processed buffer insertion candidates. S represents the set of candidates on the critical paths. U represents the set of critical candidates which are not processed before. Cth is the threshold capacitance for buffer insertion. We set Cth ≈ 30f F in our simulation. Our algorithm iteratively selects the candidates on critical paths. Once the downstream capacitance of the selected candidate exceeds the threshold Cth , we insert a buffer to split off the downstream capacitance. The complexity of this algorithm is O(n), where n is the number of all possible buffer insertion candidates in C. We use a FPGA placement and routing tool, VPR [1], as our simulation platform. The technology dependent parameters are from SPICE simulation and predictive technology model (PTM) [7]. All the benchmark circuits from MCNC are optimized, mapped, and packed by using SIS [11], FlowMap [2],

4.

CONSTRAINED RFB INSERTION AND EXPERIMENTAL RESULTS

According to the simulation results in Section III, we suggest to use the RFB for structured ASIC design style. Since the number of RFBs are fixed, we can not unlimitedly use the RFB in routing channels. A constrained RFB insertion technique is required to optimize the interconnection delay. During the RFB insertion, if there is no available RFB in a sub-channel (saturated sub-channel), we need to assign another available RFB for that critical wire. Hence, we propose a channel migration technique to alleviate the sub-channel saturation problem. There are two phases in our channel migration technique, the intra-channel migration and the inter-

379

Table 3: Constrained RFB insertion for 16-to-1 MUX-based CLB Circuit apex4 ex5p bigkey des spla frisc clma diffeq dsip elliptic ex1010 seq misex3 tseng average

Baseline Delay DW /DL 5.06E-08 13.9 5.36E-08 13.0 1.01E-07 49.8 8.99E-08 29.3 9.46E-08 21.1 1.74E-07 59.7 1.97E-07 29.9 6.16E-08 24.4 1.22E-07 60.6 1.47E-07 73.3 1.30E-07 29.4 4.86E-08 11.7 4.60E-08 12.5 5.71E-08 27.8 32.6

CLB

Inter-channel CLB

Delay 7.22E-09 1.50E-08 4.77E-09 9.46E-09 4.01E-08 1.55E-08 1.38E-08 8.64E-09 4.85E-09 1.38E-08 5.90E-08 1.47E-08 1.22E-08 9.91E-09

CLB

Without Channel Migration Imp (%) DW /DL 85.7 1.12 72.0 3.40 95.3 1.40 89.5 1.78 57.6 8.37 91.1 2.95 93.0 0.97 86.0 0.32 96.0 1.45 90.6 5.96 54.7 12.79 69.7 3.33 73.5 2.18 82.7 1.27 81.2 3.38

RF B

Delay

Imp (%)

6657 2400 2442 4309 8348 6569 43453 1909 2024 6898 5330 3602 2850 1001

7.22E-09 6.85E-09 4.77E-09 6.20E-09 1.10E-08 1.52E-08 1.38E-08 8.64E-09 4.85E-09 1.12E-08 1.07E-08 7.28E-09 7.16E-09 9.07E-09

85.7 87.2 95.3 93.1 88.4 91.3 93.0 86.0 96.0 92.4 91.8 85.0 84.4 84.1 89.6

CLB

Figure 4: Intra-channel and inter-channel migrations channel migration. Figure 4 demonstrates the ideas of both intra-channel and inter-channel migrations. In intra-channel migration, we move the critical wire from a saturated subchannel to the other unsaturated sub-channel. Since the inserted RFB is in the same routing channel, there is no performance degradation. If the intra-channel migration is not available, we apply the inter-channel migration. In inter-channel migration, we seek unused inverters near the original saturated channel for the RFB insertion with some performance degradation. The migration procedure will be terminated until there is no available unused inverter close to the original pre-computed RFB location. We conduct the experiments on constrained RFB insertion. Both the buffer insertion algorithm and the channel migration technique are integrated into VPR. In our experiments, the priority of RFB insertion is the same to the proposed algorithm in Figure 3 until a sub-channel overflow occurs. Then, the channel migration technique is performed to seek for unused inverters. The results of RFB insertion with and without channel migrations are listed in Table 3. Columns Delay, DW /DL , Imp, RF B, Overf low, Intra, and Inter represent the circuit delay, the ratio of the wire delay to the logic delay, the delay improvement as compared with the baseline results, the number of inserted RFB, the number of RFB insertion in a saturated sub-channel, the number of intra-channel migrations, and the number of inter-channel migrations, respectively. From the experimental results, we can see that our proposed channel migration technique improves the circuit performance by 89.6% under the RFB constraint. Additionally, the ratio of the wire delay to the logic delay is reduced from 32.6 to 0.98.

5.

1.12 0.79 1.40 1.09 1.57 0.45 0.97 0.32 1.45 0.60 1.49 0.90 1.11 0.48 0.98

6673 4671 2442 6439 16968 6771 43561 1909 2024 7530 13743 6470 5194 1134

33 22 0 13 33 1 222 0 0 3 17 12 17 1

Intra

Inter

15 7 0 6 26 1 113 0 0 1 4 4 12 0

17 14 0 5 6 0 107 0 0 2 13 8 5 1

insertion issues for LUT-based structured ASIC design style. A pilot CLB layout with a LFB and a corresponding routing channel layout with the RFBs are used for evaluations. According to the experimental results, we conclude that the RFBs are sufficient for a buffered structured ASIC design. After that, a channel migration technique, which employs both intra-channel migration and inter-channel migration, is proposed to alleviate the sub-channel saturation problem. The experimental results demonstrate that our proposed structured ASIC design and optimization technique improve the circuit performance by 89.6%. Furthermore, our proposed design balances the ratio of the wire delay to the logic delay from 32.6 to 0.98.

CLB

Intra-channel CLB

With Channel Migration DW /DL RF B Overf low

6.

REFERENCES

[1] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for deep-submicron FPGAs. Kluwer Academic Publishers, 1999. [2] J. Cong and Y. Ding. Flowmap: An optimal technology mapping algorithm for delay optimization in lookup-table based fpga designs. IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, 13(1):1–12, January 1994. [3] Y. Gao and D. F. Wong. A graph based algorithm for optimal buffer insertion under accurate delay models. In Proceedings of Design Automation and Test in Europe, pages 535–539, March 2001. [4] S. Gopalani, R. Garg, S. P. Khatri, and M. Cheng. A lithography-friendly structured asic design approach. In Proceedings of Great Lakes Symposium on VLSI, pages 315–320, May 2008. [5] K. Gulati, N. Jayakumar, and S. P. Khatri. A structured asic design approach using pass transistor logic. In Proceedings of IEEE International Symposium on Circuits and Systems, pages 1787–1790, May 2007. [6] B. Hu, H. Jiang, Q. Liu, and M. M. Sadowska. Synthesis and placement flow for gain-based programmable regular fabrics. In Proceedings of International Symposium on Physical Design, pages 197–203, April 2003. [7] N. Integration and M. N. Group. The predictive technology model (PTM). http://www.eas.asu.edu/˜ptm/, 2007. [8] N. Jayakumar and S. P. Khatri. A metal and via maskset programmable vlsi design methodology using plas. In Proceedings of International Conference on Computer-Aided Design, pages 590–594, May 2004. [9] C. Patel, A. Cozzie, H. Schmit, and L. Pileggi. An architectural exploration of via patterned gate arrays. In Proceedings of International Symposium on Physical Design, pages 184–189, April 2003. [10] Y. Ran and M. Marek-Sadowska. Via-configurable routing architectures and fast design mappability estimation for regular fabrics. In Proceedings of International Conference on Computer-Aided Design, pages 25–32, May 2005. [11] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, and A. Saldanha. SIS: A system for sequential circuit synthesis. Electronics Research Laboratory, Memorandum No. UCB/ERL M92/41, March 1992. [12] T. Zhang and S. S. Sapatnekar. Buffering global interconnects in structured asic design. In Proceedings of Conference on Asia and South Pacific Design Automation, pages 23–26, January 2005.

CONCLUSION

Buffer insertion is a widely used technique to split off a long wire into several buffered wire segments for interconnection delay reduction. In this paper, we discuss the buffer

380