Techniques for Improved Placement-Coupled Logic Replication Hosung (Leo) Kim a,∗, John Lillis a , and Miloˇs Hrki´c b a University
of Illinois at Chicago, Dept. of Computer Science, Chicago, IL 60607 b IBM
Corporation, East Fishkill, NY 12533
Abstract Several recent papers have demonstrated the potential of logic replication driven by placement-level timing analysis for improving clock period. In this paper we propose a number of techniques aimed at more fully realizing this potential within the framework employed in Mrki´c, Lillis, and Beraudo (2004, DAC). There are situations in which the basic approach fails to yield significant additional improvement due largely to the effects of reconvergence in the netlist. We suggest the use of rectilinear Steiner arborescence embedding as a tool for overcoming this limitation. We also propose techniques for fanout partitioning and cell relocation which are cognizant of both wirelength and timing impact for improved solution quality. We have implemented and experimented with these techniques in FPGA domain. Promising experimental results are reported with average 17.4% (up to 39.9%) clock period reduction compared with the timing-driven placement from VPR. Key words: Timing Optimization, Logic Replication, Placement, FPGA
1
Introduction
Logic replication has been shown to be a useful technique for aiding in achievement of certain design goals while maintaining the logical behavior of the netlist. This idea has been exploited in different contexts including min-cut partitioning (e.g., [8] and [9]), high fanout logic cell replication (e.g., [10] and [11]), and physical-level interconnect-dominated delay optimization (e.g., [1], ∗ Corresponding author. Email addresses:
[email protected] (Hosung (Leo) Kim),
[email protected] (John Lillis),
[email protected] (Miloˇs Hrki´c).
Preprint submitted to Elsevier
14 December 2006
[2], [4], [12], and [13]). It is this last category of optimization which is the focus of this paper. A typical placement contains combinational paths which are non-monotone – i.e., when tracing the path in the placement there are detours with respect to the locations of the endpoints of the path. This situation implies additional wiring delay and often it cannot be improved by timing-driven placement. The reason for the non-monotonicity is often that a critical path and nearcritical paths share some logic cells. By replicating certain cells it is possible to decouple the paths, and thereby enable “straightening” of the paths. In [1], a simple replication heuristic was proposed based on this idea. The heuristic made local cell duplications with the goal of straightening signal paths. This proof-of-concept work showed the potential of the replication technique – in many cases, significant improvements in clock period could be obtained with a relatively small number of logic replications. Subsequently, in [2] a stronger approach was proposed. Key components of the approach are replication tree and fan-in tree embedder. A replication tree is derived from a subgraph of the netlist containing only combinational paths. By replication, a fanin tree (that is, a reconvergence free tree) is derived while maintaining the logical behavior. This idea is illustrated in Figure 1. In Figure 1-(a), static timing analysis determines that a is the critical sink of the timing graph. The solid edges represent a slowest-paths tree – the thick arrows are the slowest path; The thin solid arrows represent the other fanins of the critical nodes; The dotted arrows are other circuit interconnects which are not in slowest-paths tree. Notice that this tree is not a valid fanin tree due to the convergence by (c, d) and (b, d). To extract a valid fanin tree the internal nodes on the slowest path are replicated into b , c , d , and e in Figure 1-(b). The edges in the replication tree are formed as follows: Let v be an original node and v be a replicated node. Also let u1 , · · · , uk be the inputs to v. If an edge (ui, v) is on the slowest path, then v receives its i’th input from ui . Otherwise it receives its i’th input from ui. The constructed replication tree is shown in Figure 1-(b). The reconvergence is broken since c receives a signal from d and b receives a signal from d. Note that the modified circuit is functionally equivalent to the original circuit and the replicated nodes form the internal vertices of a legitimate fanin tree which can be embedded (Figure 1-(c)). Once a fanin tree is extracted, a powerful timing-driven fanin tree embedding algorithm is applied to embed the structure into a layout area leaving the rest of the circuit fixed. This algorithm is based on dynamic programming and adapted from the buffer tree synthesis [3]. It optimally solves the following problem: Given a fanin tree with fixed leaves (inputs) and a root (sink), arrival times at the leaves, a target embedding graph, and cost metrics including placement and wiring cost, find a minimum cost embedding achiev2
a
a bƍ
b c g
a
cƍ
i d
g
c
i d dƍ
h
i
cƍ dƍ
d
g eƍ
e
bƍ
b
eƍ
e h
(a) Sub-circuit containing (b) Extracted replication the slowest path tree
h (c) Functionally equivalent circuit
Fig. 1. Replication Tree Construction.
ing a specified arrival time upper-bound. The algorithm starts from leaves and propagates costs and arrival times toward the root. A candidate solution (embedding) for a sub-tree with its root u placed at vertex v in the target embedding graph is represented by a signature (cost, time). At each candidate placement, the sub-solutions of the fanin nodes of u are combined and only non-dominated solutions are kept. These lists of solutions are propagated over the embedding graph for the parent of u by a generalized version of Dijkstra’s shortest path algorithm (this approach is similar to [19]). At the root of the fanin tree, a set of solutions (trade-offs between cost and arrival time) is available. Among the trade-offs, the fastest solution that doesn’t exceed the minimally-achievable circuit clock period is chosen. The philosophy behind tree embedding is that trees have separability properties which make them, in general, easier to optimize than more general graph structures. While some techniques have been developed for general directed acyclic graph (DAG) optimization (e.g., [14]), these techniques are limited in the generality of problem formulation, e.g., they assure only minimum delay, but are weak when considering cost/performance trade-offs. In addition to the property, a unique trait of the replication tree approach is that it does not limit itself to pre-existing tree structures as in some technology mappers [18]. There are a couple of other techniques that utilize logic replication to improve performance. [17] presented an algorithm that performs clustering and duplication during placement: it introduced the notion of feasible region and super feasible region to improve the critical path monotonicity. [16] proposed a packing algorithm that leaves empty basic logic elements (BLEs) in timing critical clusters, and a placement algorithm that performs a logic replication to reduce the critical path length. In contrast to these approaches which perform local modifications to the netlist and placement, the replication tree approach 3
effectively performs a timing-driven placement and global routing of relatively large sub-circuits which can include many I/O paths. While the fanin tree embedding approach showed promise, we identified some limitations — some of which were implementation oriented and others were more fundamental in nature. As reported in [4], the effects of reconvergence in the netlist can prevent the embedding algorithm of [2] from straightening a critical path. We present a rectilinear Steiner arborescence ([7], [5], and [6]) embedding approach that addresses this limitation. We also describe techniques for fanout partitioning and cell relocation which are aware of both wirelength and timing impact. Other wirelength management techniques for improved solution quality are discussed. This paper is organized as follows. Section 2 talks about lower-bounding of clock period. In Section 3 we describe the effect of reconvergence, a rectilinear Steiner arborescence embedding, and a generalized Steiner arborescence embedding. Complimentary techniques are presented in Section 3.4. The experimental results are discussed in Section 4.
2
Lower-bounding Clock Period
The achievable clock period of a logic path between two fixed flip-flop(FF)/pad points must be accurately estimated. We use a new lower-bounding technique that guides the embedder not to over-optimize the paths in a replication tree. Let Di be the lower-bound on delay from input node i to root, n be the number of look-up tables (LUTs) along the path from i to root, and l be the rectilinear distance between i and root. The lower-bounding technique in the experimental setup of [1] and [2] was based on the following formula: Di = d 1 · n + d 2 · l
(1)
where d1 is a cell delay and d2 is a unit wire delay. This formula correctly estimates the lower-bound for most of the paths. For example, in Figure 2-(a), the LUT count of the path is 2 and the distance between the endpoints is 5: Di = d1 · 2 + d2 · 5. We noted, however, that a rectilinear distance between a source and a sink can be short and requires a logic detouring and Formula (1) does not capture this detouring. For example, in Figure 2-(b), Dj is d1 ·6+d2 ·5, but a better estimate should be d1 · 6 + d2 · 7. We introduce new formula that tightens the bound: Di = d1 · n + d2 · (n + 1) + d2 · max(0, l − (n + 1)) 4
(2)
i
j (a)
k
(b)
(c)
Fig. 2. Delay estimate on various paths.
In this formula, we break down the delay for a logic path into three components: (1) Intrinsic LUT delay, (2) Intrinsic LUT-to-LUT interconnect delay (since LUTs cannot be placed on top of each other), and (3) Extra interconnect delay (if there is more distance between the endpoints than required for LUT abutments, then there is extra interconnect delay). Now new estimate of the path in Figure 2-(b) is Dj = d1 · 6 + d2 · 7 + d2 · max(0, 5 − 7) = d1 · 6 + d2 · 7. It also computes the correct delay for the path in Figure 2-(a): Di = d1 · 2 + d2 · 3 + d2 · max(0, 5 − 3) = d1 · 2 + d2 · 5. When the endpoints are pads on the same boundary, we adjust l to l ← l + 2 so that the distance of only routable paths is considered (Figure 2-(c)).
3
Steiner Arborescence
As noted in [4], there are situations in which the basic approach in [2] fails to yield significant additional improvement because near critical paths that are not in the slowest-paths tree can dominate once a small reduction in the delays of the most critical paths is achieved. Since these near-critical paths may not have many edges in the slowest-paths tree (particularly as many paths become near critical), there is no improvement.
3.1 Effect of Reconvergence
The limitation of the basic approach is illustrated in Figure 3. Figure 3-(a) is a subcircuit that came from an actual run on circuit misex3 in MCNC benchmarks, when the basic embedder could not improve the clock period further. Node a is the chosen critical sink. The path from h to a is the slowest path. The arrival time of the signal from i to d is very close to the arrival time of the signal from e to d, so the path that goes through i (i.e., f → j → i → d) is also included in the slowest-paths tree. Edge (i, f ) and (j, e) are other incoming signals to the internal nodes (for the clarity of explanation, the other 5
h
a
a
b
bƍ
h
c
cƍ
d j
g
i
f k
dƍ jƍ
gƍ
e
fƍ
l
f
k
(a) Selected sub-circuit (simplified by omitting some of the non-critical branches)
iƍ eƍ l
(b) Replication tree
Fig. 3. Reconvergence effect on a replication tree.
non-critical nodes and edges that provide signals to the internal nodes are not shown in the figure). Note that the paths converge at f . The replication tree of the sub-circuit produced by the tree construction procedure is shown in Figure 3-(b). There are two copies of f : a movable f and a fixed f , where convergence breaks. The basic tree embedding algorithm computes cost/delay trade-off solutions in a bottom-up fashion. The intermediate solution set for the subtree consisting of {d , e , f , g , h, i, j} contains some improved embeddings. These embeddings, however, are discarded as the path from f to d won’t be changed (it is already monotone) and the arrival time of the signal from i dominates over most of improved arrival times of the signal from e . The final embedding that the embedder returns places the movable nodes at the same location as they originally were; the placement of the subcircuit remains unchanged (Figure 3-(a)). In [4], this reconvergence issue was addressed by using a modified timing objectives – a lexicographic ordering on the largest arrival times – was used so that some paths can still be improved on a single iteration even if the arrival time at the output is not reduced. Thus, over multiple iterations more paths can be sped up and the clock period reduced. The lexicographic approach, however, incurs a runtime overhead. Also, the reconvergence issue is not wellunderstood in general, so we have studied and experimented simpler strategies.
3.2 Steiner Arborescence Embedding
The rectilinear Steiner arborescence (RSA) problem was investigated in [7] and [5], and reviewed in [6]. The RSA is of interest, since it straightens not 6
f
f a
a
a d b
c
(a) Topological structure
f
d b
d b
c
c
(b) Min-WL embedding (c) Arborescence embedding
Fig. 4. Fanin tree with a min wirelength embedding and an arborescence embedding.
only the critical path but also the other paths in a tree. In our context, the topological structure is fixed, so we modify RSA problem as follows: Formulation 1. Given a non-embedded tree with fixed input nodes and a fixed root, find an embedding in the layout area such that each path from a leaf to the root is monotone. The Steiner Arborescence Embedding is illustrated in Figure 4. The topological structure of a fanin tree is shown in Figure 4-(a). Suppose the critical path of this tree is the path from c to f . The placement of minimum wirelength embedding is shown in Figure 4-(b) — the node in square is fixed and the node in circle is movable. The basic tree embedding algorithm in [2] will return this embedding as the best solution because the critical path delay is minimum and the wiring cost is also minimum. In this embedding, however, not all the source-to-sink paths are monotone, e.g., the path from d to f is not monotone. Steiner arborescence embedding, however, will produce a solution where all the paths are monotone (Figure 4-(c)). An arborescence needs not to be tied to the geometric interpretation implied in the figure; if one can determine the minimum achievable delay, Di , with respect to each input i, the embedding formulation can be to minimize cost subject to the minimum delay being a constraint for each input. we, therefore, solve the Steiner arborescence embedding problem with the existing tool — the basic tree embedder. We replace the arrival time of each input node i with −Di where Di is the minimum possible delay from the input node i to the root (Section 2 explains how we compute this delay.) When we run the existing algorithm on the instance with replaced arrival times, the algorithm returns the min-cost solution achieving arrival time of 0 (in addition to solutions with larger arrival time and lower-cost.) In Figure 3, we have seen that the basic embedder could not optimize the critical path. Now, if we invoke the new Steiner arborescence embedder on the same tree, we can obtain an embedding that optimizes the critical path (Figure 5-(a)). 7
h
gƍ
fƍ
eƍ
a
a
bƍ cƍ dƍ
bƍ
h
gƍ
cƍ fƍ
jƍ
iƍ
jƍ
f k
eƍ
dƍ iƍ
f k
l
(a) Arborescence embedding
l
(b) Generalized arborescence embedding
Fig. 5. Effect of Steiner arborescence embedding.
3.3 Generalized Arborescence Embedding
As one can see in Figure 3, Steiner arborescence embedding incurs more wiring and replication costs. In order to avoid over-optimization, we loosen the delay constraint on early arriving inputs. Let Ai be the arrival time of an input node i, and LB be the minimum achievable clock period of a given circuit. We replace the arrival time of i with − max(Di , LB −Ai ). This means if the delay of the signal path that goes through i to the root does not exceed LB, then we allow the path to detour within the extra budget. Figure 5-(b) shows the generalized Steiner arborescence embedding on the replication tree. Here the critical path, h ; a, is detouring because the clock period of the path does not violate LB and it saves the wiring costs of (i, f ) and (j, e ). In the new flow, we invoke a generalized Steiner arborescence embedding when the conventional flow saturates (i.e., when the circuit clock period won’t be improved over several iterations). This optimization reduces the delay of the critical path and the number of paths that are near critical. After application of an arborescence embedding, we return to the conventional formulation and further improve the clock period.
3.4 Complementary Techniques
The fanin tree embedder in [1] and [4] was not very sophisticated about assessing wirelength impact. In addition to this, the new Steiner arborescence 8
v3
v2
v4
v1 u
uƍ
a
b
Fig. 6. Fanout partitioning.
embedding incurs more wiring cost. We make further enhancements to the embedder so as to better manage wirelength during the course of the algorithm.
3.5 Fanout Partitioning
When we are embedding a replication tree, we need to decide whether a cell in the tree can be moved or should be replicated. The fanouts of the cell can get a signal either from the original cell or from the (temporary) replicated and optimally-placed cell. If all the fanouts can get a signal from the replicated cell without violating certain criteria like clock period, we can delete the original cell: the subject cell is moved. If not, we should keep both copies: the subject has been replicated. Distributing the fanouts among the logically-equivalent cells is called fanout partitioning and is illustrated in Figure 6. Cell u is a clone of cell u that is optimally placed; Cells vi are the fanouts of u. In this example, fanouts v1 and v2 stay with u and fanouts v3 and v4 get a signal from u . The partitioning approach in [1] and [4] was based on delay only: it moves a fanout of the original cell to the replicated cell if the move doesn’t degrade the arrival time of the fanout. This approach is simple and fast, but it usually degenerates the wirelength of a circuit. In a new partitioning approach, we take the half-perimeter wirelength (HPWL) into account. First, we move fanouts vi of cell u to u as long as the arrival time of vi is not worsened. This step make us check whether we can move all the fanouts and eliminate the cost of keeping u and reduce the HPWL of its outgoing and incoming nets. Second, among the fanouts that have moved to u we pick a fanout v and move it back to u if the move yields the maximum HPWL gain. We perform this processing for all the fanouts of u . After all the possible move, we pick a max gain move sequence. Last, we repeat the second step for the remain fanouts of u for any further wirelength improvement. 9
3.6 Cell Relocation
Once a replication occurs, it is often the case that a simple move of the source cell of the replication can often reduce wirelength. For example, consider the fanout partition {v1 ,v2 } and {v3 , v4 } in Figure 6, when fanouts v3 and v4 are no longer tied to cell u, we can relocate u to a better location where the wirelength is reduced without degrading clock period. We use a simple heuristic that relocate the source cell. We first limit the target region to be bounded by the fanouts and fanins of the original cell. Then we scan the region and pick the location where the HPWLs of incoming and outgoing nets of the cell is reduced and the arrival time of vi does not get worsened. 3.7 Cell Unification
As we perform replication tree embeddings and placement legalizations over iterations, the placement of cells are perturbed and some logically equivalent cells migrate to each other. In [2], the embedder unified the equivalent cells when only one of them was on the selected critical path. It, however, left some of equivalent cell sets untouched as they were not selected. In our new embedder, we go through all of equivalent cell sets and perform pair-wise fanout partitionings so that better fanout partitions or cell unifications are obtained. We invoke the unification procedure as a post-processing when the conventional flow saturates.
3.8 Replication Cost
One of the capability of the embedding algorithm is its ability to incorporate various cost including wiring cost, placement cost, and replication cost. The replication cost is to prevent excessive cell duplications. During embedding, we compute a region where a subject cell can be placed without incurring replication. In the computation, we consider the HPWL as well as the delay. Once a region is found, we impose a high cost for a placement outside the region, and a low cost for a placement within the region.
3.9 Wirelength Estimation
In [2], when wiring costs were computed, it considered only cell-to-cell wirings. We, however, note that a cell with high fanouts has a chance to save some wiring cost. For example, in Figure 7, node h has 5 fanouts and some part 10
d
e f3 f2
g
f4
f1 h
Fig. 7. Signal can be connected to any pin in a fanout net.
of h-to-e wiring could be shared with other wiring in the fanout net. In the new embedder, a node (e.g., e) is allowed to receive a signal from not just the source pin (e.g., h) but from any one of valid pins (e.g., h, f1 , f2 , f3 , and f4 ), with appropriate changes made to the arrival times.
4
Experimentation
4.1 Delay Model
In our experimentation we use a placement-level delay estimator that is related to VPR [15] and is similar to [1] and [2]. The target architecture is the FPGA in which all the switches are buffered and interconnect resources are uniform. With buffered switches, RC effects are localized to switch-to-switch connections. Thus the delay of an interconnection can be approximated by a linear function of the Manhattan length of the interconnection. As a side, we want to mention that the target of the embedder is an arbitrary graph in which edges can have arbitrary delays, it is well-suited to routing architectures with pre-defined and non-uniform routing resources.
4.2 Optimization Flow
Firstly, Initial placements are obtained by invoking the VPR placer in timingdriven mode. Secondly, in each iteration of the optimization, we start with static timing analysis in order to identify the critical sink, and we extract a fanin tree whose root is the critical sink. This tree is passed to the new embedder which produces a set of solutions that trade off between cost and delay. We select a solution from the trade-off curve, and analyze the circuit for possible post-processing. After the post process, the circuit is legalized by the 11
Table 1 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor Circuit
Timing Driven VPR crit path [ns]
name
size
dens
I/O
ex5p
35x35
0.87
71
tseng
35x35
0.85
apex4
38x38
misex3
FF
wire
W∞
Wls
length
block
0
65.80
66.47
20086
1135
174
385
53.53
54.84
9692
1221
0.87
28
0
72.81
74.20
21660
1290
40x40
0.87
28
0
76.32
78.90
22239
1425
alu4
41x41
0.91
22
0
76.00
76.73
21573
1544
diffeq
41x41
0.89
103
377
62.71
64.65
14614
1600
dsip
54x54
0.47
426
224
65.38
66.61
17642
1796
seq
44x44
0.90
76
0
80.42
80.76
27789
1826
apex2
46x46
0.89
41
0
100.05
100.87
30995
1919
s298
47x47
0.87
10
8
123.82
125.78
21844
1941
des
63x63
0.40
501
0
90.44
91.31
27861
2092
bigkey
54x54
0.59
426
224
62.77
64.23
20562
2133
frisc
63x63
0.90
136
886
121.64
125.46
61130
3692
spla
64x64
0.90
62
0
117.04
121.06
68663
3752
elliptic
64x64
0.88
245
1122
108.95
112.08
51240
3849
ex1010
72x72
0.89
20
0
171.08
175.05
70632
4618
pdc
71x71
0.91
56
0
146.78
149.39
108292
4631
s38417
84x84
0.91
135
1463
97.80
99.09
63968
6541
s38584.1
85x85
0.89
342
1260
94.96
95.54
58034
6789
clma
97x97
0.89
144
33
240.90
240.52
141747
8527
timing-driven placement legalizer ([1] and [2]). Thirdly, the optimization phase is iterated for the number of given times, and the best placement is selected. Lastly, the selected placement is routed by the VPR router in timing-driven mode. 12
Table 2 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor (continued)
Circuit
RT Embedding
Lex-3 Embedding
(normalized to VPR)
(normalized to VPR)
crit path [ns]
wire
crit path
wire
W∞
Wls
length
block
W∞
Wls
length
block
ex5p
0.918
1.225
1.203
1.042
0.890
0.968
1.295
1.085
tseng
0.939
0.935
1.057
1.009
0.939
0.946
1.117
1.020
apex4
0.863
0.880
1.216
1.025
0.853
1.041
1.244
1.032
misex3
0.782
0.782
1.054
1.004
0.730
0.842
1.217
1.027
alu4
0.840
1.037
1.149
1.016
0.855
0.944
1.127
1.012
diffeq
0.954
0.955
1.099
1.004
0.948
0.922
1.069
1.006
dsip
0.744
0.817
1.436
1.001
0.731
1.185
1.577
1.001
seq
0.780
0.961
1.127
1.012
0.795
0.819
1.102
1.007
apex2
0.803
0.815
1.100
1.010
0.785
0.799
1.102
1.011
s298
0.913
0.924
1.084
1.002
0.872
0.882
1.129
1.002
des
0.896
0.909
1.020
1.000
0.876
0.972
1.018
1.002
bigkey
0.842
0.972
1.230
1.000
0.819
1.009
1.307
1.000
frisc
0.974
0.963
1.012
1.001
0.964
0.946
1.031
1.006
spla
0.824
0.864
1.179
1.024
0.780
0.844
1.176
1.025
elliptic
0.778
0.779
1.081
1.008
0.753
0.896
1.128
1.011
ex1010
0.806
1.190
1.267
1.030
0.770
1.061
1.245
1.010
pdc
0.807
0.858
1.058
1.014
0.708
0.794
1.152
1.016
s38417
0.878
0.965
1.037
1.004
0.842
0.929
1.040
1.007
s38584.1
0.863
0.878
1.023
1.000
0.832
0.862
1.129
1.001
clma
0.630
0.661
1.102
1.010
0.622
0.655
1.102
1.006
0.842
0.918
1.127
1.011
0.818
0.916
1.165
1.014
average
4.3 Experimental Results
The VPR placement tool in default mode sets the number of rows and the number of columns in the FPGA logic array to minimums that is required to fit a circuit. This minimum logic array size was used in experiments of [1] and 13
Table 3 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor (continued) Arbor Embedding (normalized to VPR) Circuit
crit path
wire
Delay
W∞
Wls
length
block
/LB
ex5p
0.876
0.924
1.142
1.031
1.265
tseng
0.916
0.919
1.104
1.002
1.162
apex4
0.822
0.863
1.142
1.018
1.254
misex3
0.720
0.729
1.167
1.020
1.072
alu4
0.852
0.885
1.098
1.019
1.295
diffeq
0.885
0.889
1.130
1.001
1.008
dsip
0.687
0.776
1.245
1.000
1.000
seq
0.769
0.801
1.130
1.005
1.071
apex2
0.764
0.771
1.137
1.009
1.159
s298
0.831
0.834
1.174
1.000
1.685
des
0.877
0.902
1.034
1.000
1.000
bigkey
0.800
0.889
1.210
1.005
1.000
frisc
0.891
0.878
1.072
1.002
1.093
spla
0.770
0.803
1.095
1.003
1.240
elliptic
0.734
0.737
1.123
1.002
1.082
ex1010
0.775
0.819
1.088
1.002
1.718
pdc
0.708
0.746
1.073
1.004
1.346
s38417
0.820
0.878
1.046
1.001
1.021
s38584.1
0.842
0.872
1.064
1.000
1.000
clma
0.582
0.601
1.092
1.001
1.349
0.796
0.826
1.118
1.006
average
[2]. It, however, is rare that all logic and routing resources is 100% utilized. We adjusted the FPGA size so that the circuits have at least 10% white space: the VPR placement option -nx and -ny was used in order to adjust the number of rows and the number of columns. The circuits were routed with the number of tracks per channel that is about 20% more than the minimum required, which follows the definition of low-stress routing in [15]. 14
In our experiments, we compared the generalized Steiner arborescence embedding, named Arbor-embedding with the timing-driven VPR [15], the basic embedding (RT) [2], and Lex-3 embedding [4]. In Arbor, the generalized arborescence embeddings were invoked only when the conventional flow saturated. The main criteria of interest are the clock period, wirelength, the number of logic blocks, and delay lower-bound. Table 1 and 3 shows the experimental results for 20 MCNC benchmark circuits. We used the timing-driven VPR placer to obtain initial placements of the benchmark circuits. The first data set shows the design density and the I/O information of the circuits. In the second data set, we run the timing-driven VPR router on the initial placement. In the third and fourth data set, the initial placements were optimized by RT and Lex-3 before invoking VPR router. These values are normalized to VPR. The last data set shows the optimization data by Arbor embedder. W∞ denotes the estimated delay where infinite routing resources are assumed to be available. Wls represents the low-stress routed delay. The average W∞ reduction of RT, Lex-3, and Arbor over VPR was 15.8%, 18.2%, and 20.4%, respectively. The difference between Lex-3/RT and Arbor looks small, but note that some of the circuits already meet or are close to the delay LB. The column labeled “Delay/LB” is the rate between the placement-level delay achieved by Arbor and the placement-level delay lower-bound. Circuit dsip, des, bigkey, and s38584.1 reached the theoretical LB, they could not be improved further for given fixed FFs. Circuit diffeq and s38417 were very close to the LB. The largest W∞ reduction of Arbor over VPR was 41.8% (clma which is the biggest circuit in the benchmarks). The average Wls reduction of RT, Lex-3, and Arbor over VPR was 8.2%, 8.4%, and 17.4%, respectively. Arbor showed higher improvement in routed delays due to its better wirelength management (RT and Lex-3 used excessive wirings on low design density circuits, like dsip and bigkey). It is observed that the new cell unification techique effectively merges the logical equivalent cells. We also think that Arbor optimizes near critical paths more, thus the router has less difficulty to route the most critical path.
5
Discussion
The runtime overhead of our algorithm was modest. It was less than 15% of the time of VPR placer. There were a couple of experimental observations. When the strict Steiner arborescence embedding was invoked over all the iterations, it optimized the critical path at the expense of a lot of resources: It incurred many replicated cells and wiring overhead. It was not able to improve the clock period of the 15
circuits. When we invoked the generalized Steiner arborescence embedding over all the iterations, we could obtain better W∞ than the reported data. It, however, still used good number of resources, and the routed delay (Wls ) was not as good as the estimated delay.
6
Conclusions
We have presented the techniques that were used for the improved timingdriven, placement-coupled logic replication. We described and generalized the rectilinear Steiner arborescence for the fanin tree embedding problem. The Steiner arborescence was shown to be useful for overcoming the issues caused by reconvergence in a circuit netlist specification. Fanout partitioning (cell unification), cell relocation, and wirelength estimation techniques were discussed as complementary improvement techniques. These techniques were implemented and experimented in FPGA domain. In many cases we were able to approach a fixed flip-flop lower-bound on achievable clock period. Promising experimental results, average 17.4% delay reduction compared with the timing-driven VPR and average 9.3% reduction compared with the basic embedder, were reported.
References
[1] G. Beraudo and J. Lillis, “Timing Optimization of FPGA Placements by Logic Replication,” ACM/IEEE DAC, 2003. [2] M. Hrki´c, J. Lillis, and G. Beraudo, “An Approach to Placement-Coupled Logic Replication,” ACM/IEEE DAC, 2004. [3] M. Hrki´c and J. Lillis, “S-Tree: A technique for buffered routing tree synthesis,” ACM/IEEE DAC, 2002. [4] M. Hrki´c and J. Lillis, “Addressing the Effects of Reconvergence on PlacementCoupled Logic Replication,” IWLS, 2004. [5] J. Cong, K. S. Leung, and D. Zhou, “Performance-Driven Interconnect Design Based on Distributed RC Delay Model,” ACM/IEEE DAC, 1993. [6] A. Kahng and G. Robins, “On optimal Interconnections for VLSI,” Kluwer Academic Publishers, 1995. [7] S. K. Rao, P. Sadayappan, F. K. Hwang, and P. W. Shor, “The Rectilinear Steiner Arborescence Problem,” Algorithmica, 1992. [8] L. T. Liu, M. T. Kuo, C. K. Cheng, and T. C. Hu, “A Replication Cut for Two-Way Partitioning,” IEEE Transaction on CAD, 1995.
16
[9] W. K. Mak and D. F. Wong, “Minimum replication min-cut partitioning,” IEEE Transaction on CAD, 1997. [10] J. Lillis, C.-K. Cheng, and T.-T Y. Lin, “Algorithms for Optimal Introduction of Redundant Logic for Timing and Area Optimization,” IEEE ISCAD, 1995. [11] A. Srivastava, R. Kastner, and M. Sarrafzadeh, “Timing Driven Gate Duplication: Complexity Issues and Algorithms,” ICCAD, 2000. [12] W. Gosti, A. Narayan, R.K. Brayton, and A.L. Sangiovanni-Vincentelli, “Wireplanning in logic Synthesis,” ICCAD, 1998. [13] W. Gosti, S.P. Khatri, and A.L. Sangiovanni-Vincentelli, “Addressing The Timing Closure Problem By Integrating Logic Optimization and Placement,” ICCAD, 2001. [14] Y. Kukimoto, R.K. Brayton, and P. Sawkar, “Delay-optimal technology mapping by DAG covering,” DAC, 1998. [15] A. Marquardt, V. Betz, and J. Rose, “Timing-Driven Placement for FPGAs,” ACM/SIGDA ISFPGAs, 2000. [16] K. Schabas and S. D. Brown, “Using Logic Duplication to Improve Performance in FPGAs,” ACM/SIGDA ISFPGAs, 2003. [17] G. Chen and J. Cong, “Simultaneous timing-driven placement and duplication,” ACM/SIGDA ISFPGAs, 2005. [18] K. Keutzer, “DAGON: Technology Binding and Local Optimization by DAG Matching,” ACM/IEEE DAC, 1987. [19] S.-W. Hur, A. Jagannathan, and J. Lillis, “Timing-Driven Maze Routing,” IEEE Transaction on CAD, 2000.
17