Techniques for Improved Placement-Coupled Logic Replication

Comment

Report 1 Downloads 52 Views

Techniques for Improved Placement-Coupled Logic Replication Hosung (Leo) Kim a,∗, John Lillis a , and Miloˇs Hrki´c b a University

of Illinois at Chicago, Dept. of Computer Science, Chicago, IL 60607 b IBM

Corporation, East Fishkill, NY 12533

Abstract Several recent papers have demonstrated the potential of logic replication driven by placement-level timing analysis for improving clock period. In this paper we propose a number of techniques aimed at more fully realizing this potential within the framework employed in Mrki´c, Lillis, and Beraudo (2004, DAC). There are situations in which the basic approach fails to yield signiﬁcant additional improvement due largely to the eﬀects of reconvergence in the netlist. We suggest the use of rectilinear Steiner arborescence embedding as a tool for overcoming this limitation. We also propose techniques for fanout partitioning and cell relocation which are cognizant of both wirelength and timing impact for improved solution quality. We have implemented and experimented with these techniques in FPGA domain. Promising experimental results are reported with average 17.4% (up to 39.9%) clock period reduction compared with the timing-driven placement from VPR. Key words: Timing Optimization, Logic Replication, Placement, FPGA

1

Introduction

Logic replication has been shown to be a useful technique for aiding in achievement of certain design goals while maintaining the logical behavior of the netlist. This idea has been exploited in diﬀerent contexts including min-cut partitioning (e.g., [8] and [9]), high fanout logic cell replication (e.g., [10] and [11]), and physical-level interconnect-dominated delay optimization (e.g., [1], ∗ Corresponding author. Email addresses: [email protected] (Hosung (Leo) Kim), [email protected] (John Lillis), [email protected] (Miloˇs Hrki´c).

Preprint submitted to Elsevier

14 December 2006

[2], [4], [12], and [13]). It is this last category of optimization which is the focus of this paper. A typical placement contains combinational paths which are non-monotone – i.e., when tracing the path in the placement there are detours with respect to the locations of the endpoints of the path. This situation implies additional wiring delay and often it cannot be improved by timing-driven placement. The reason for the non-monotonicity is often that a critical path and nearcritical paths share some logic cells. By replicating certain cells it is possible to decouple the paths, and thereby enable “straightening” of the paths. In [1], a simple replication heuristic was proposed based on this idea. The heuristic made local cell duplications with the goal of straightening signal paths. This proof-of-concept work showed the potential of the replication technique – in many cases, signiﬁcant improvements in clock period could be obtained with a relatively small number of logic replications. Subsequently, in [2] a stronger approach was proposed. Key components of the approach are replication tree and fan-in tree embedder. A replication tree is derived from a subgraph of the netlist containing only combinational paths. By replication, a fanin tree (that is, a reconvergence free tree) is derived while maintaining the logical behavior. This idea is illustrated in Figure 1. In Figure 1-(a), static timing analysis determines that a is the critical sink of the timing graph. The solid edges represent a slowest-paths tree – the thick arrows are the slowest path; The thin solid arrows represent the other fanins of the critical nodes; The dotted arrows are other circuit interconnects which are not in slowest-paths tree. Notice that this tree is not a valid fanin tree due to the convergence by (c, d) and (b, d). To extract a valid fanin tree the internal nodes on the slowest path are replicated into b , c , d , and e in Figure 1-(b). The edges in the replication tree are formed as follows: Let v be an original node and v be a replicated node. Also let u1 , · · · , uk be the inputs to v. If an edge (ui, v) is on the slowest path, then v receives its i’th input from ui . Otherwise it receives its i’th input from ui. The constructed replication tree is shown in Figure 1-(b). The reconvergence is broken since c receives a signal from d and b receives a signal from d. Note that the modiﬁed circuit is functionally equivalent to the original circuit and the replicated nodes form the internal vertices of a legitimate fanin tree which can be embedded (Figure 1-(c)). Once a fanin tree is extracted, a powerful timing-driven fanin tree embedding algorithm is applied to embed the structure into a layout area leaving the rest of the circuit ﬁxed. This algorithm is based on dynamic programming and adapted from the buﬀer tree synthesis [3]. It optimally solves the following problem: Given a fanin tree with ﬁxed leaves (inputs) and a root (sink), arrival times at the leaves, a target embedding graph, and cost metrics including placement and wiring cost, ﬁnd a minimum cost embedding achiev2

a

a bƍ

b c g

a

cƍ

i d

g

c

i d dƍ

h

i

cƍ dƍ

d

g eƍ

e

bƍ

b

eƍ

e h

(a) Sub-circuit containing (b) Extracted replication the slowest path tree

h (c) Functionally equivalent circuit

Fig. 1. Replication Tree Construction.

ing a speciﬁed arrival time upper-bound. The algorithm starts from leaves and propagates costs and arrival times toward the root. A candidate solution (embedding) for a sub-tree with its root u placed at vertex v in the target embedding graph is represented by a signature (cost, time). At each candidate placement, the sub-solutions of the fanin nodes of u are combined and only non-dominated solutions are kept. These lists of solutions are propagated over the embedding graph for the parent of u by a generalized version of Dijkstra’s shortest path algorithm (this approach is similar to [19]). At the root of the fanin tree, a set of solutions (trade-oﬀs between cost and arrival time) is available. Among the trade-oﬀs, the fastest solution that doesn’t exceed the minimally-achievable circuit clock period is chosen. The philosophy behind tree embedding is that trees have separability properties which make them, in general, easier to optimize than more general graph structures. While some techniques have been developed for general directed acyclic graph (DAG) optimization (e.g., [14]), these techniques are limited in the generality of problem formulation, e.g., they assure only minimum delay, but are weak when considering cost/performance trade-oﬀs. In addition to the property, a unique trait of the replication tree approach is that it does not limit itself to pre-existing tree structures as in some technology mappers [18]. There are a couple of other techniques that utilize logic replication to improve performance. [17] presented an algorithm that performs clustering and duplication during placement: it introduced the notion of feasible region and super feasible region to improve the critical path monotonicity. [16] proposed a packing algorithm that leaves empty basic logic elements (BLEs) in timing critical clusters, and a placement algorithm that performs a logic replication to reduce the critical path length. In contrast to these approaches which perform local modiﬁcations to the netlist and placement, the replication tree approach 3

eﬀectively performs a timing-driven placement and global routing of relatively large sub-circuits which can include many I/O paths. While the fanin tree embedding approach showed promise, we identiﬁed some limitations — some of which were implementation oriented and others were more fundamental in nature. As reported in [4], the eﬀects of reconvergence in the netlist can prevent the embedding algorithm of [2] from straightening a critical path. We present a rectilinear Steiner arborescence ([7], [5], and [6]) embedding approach that addresses this limitation. We also describe techniques for fanout partitioning and cell relocation which are aware of both wirelength and timing impact. Other wirelength management techniques for improved solution quality are discussed. This paper is organized as follows. Section 2 talks about lower-bounding of clock period. In Section 3 we describe the eﬀect of reconvergence, a rectilinear Steiner arborescence embedding, and a generalized Steiner arborescence embedding. Complimentary techniques are presented in Section 3.4. The experimental results are discussed in Section 4.

2

Lower-bounding Clock Period

The achievable clock period of a logic path between two ﬁxed ﬂip-ﬂop(FF)/pad points must be accurately estimated. We use a new lower-bounding technique that guides the embedder not to over-optimize the paths in a replication tree. Let Di be the lower-bound on delay from input node i to root, n be the number of look-up tables (LUTs) along the path from i to root, and l be the rectilinear distance between i and root. The lower-bounding technique in the experimental setup of [1] and [2] was based on the following formula: Di = d 1 · n + d 2 · l

(1)

where d1 is a cell delay and d2 is a unit wire delay. This formula correctly estimates the lower-bound for most of the paths. For example, in Figure 2-(a), the LUT count of the path is 2 and the distance between the endpoints is 5: Di = d1 · 2 + d2 · 5. We noted, however, that a rectilinear distance between a source and a sink can be short and requires a logic detouring and Formula (1) does not capture this detouring. For example, in Figure 2-(b), Dj is d1 ·6+d2 ·5, but a better estimate should be d1 · 6 + d2 · 7. We introduce new formula that tightens the bound: Di = d1 · n + d2 · (n + 1) + d2 · max(0, l − (n + 1)) 4

(2)

i

j (a)

k

(b)

(c)

Fig. 2. Delay estimate on various paths.

In this formula, we break down the delay for a logic path into three components: (1) Intrinsic LUT delay, (2) Intrinsic LUT-to-LUT interconnect delay (since LUTs cannot be placed on top of each other), and (3) Extra interconnect delay (if there is more distance between the endpoints than required for LUT abutments, then there is extra interconnect delay). Now new estimate of the path in Figure 2-(b) is Dj = d1 · 6 + d2 · 7 + d2 · max(0, 5 − 7) = d1 · 6 + d2 · 7. It also computes the correct delay for the path in Figure 2-(a): Di = d1 · 2 + d2 · 3 + d2 · max(0, 5 − 3) = d1 · 2 + d2 · 5. When the endpoints are pads on the same boundary, we adjust l to l ← l + 2 so that the distance of only routable paths is considered (Figure 2-(c)).

3

Steiner Arborescence

As noted in [4], there are situations in which the basic approach in [2] fails to yield signiﬁcant additional improvement because near critical paths that are not in the slowest-paths tree can dominate once a small reduction in the delays of the most critical paths is achieved. Since these near-critical paths may not have many edges in the slowest-paths tree (particularly as many paths become near critical), there is no improvement.

3.1 Eﬀect of Reconvergence

The limitation of the basic approach is illustrated in Figure 3. Figure 3-(a) is a subcircuit that came from an actual run on circuit misex3 in MCNC benchmarks, when the basic embedder could not improve the clock period further. Node a is the chosen critical sink. The path from h to a is the slowest path. The arrival time of the signal from i to d is very close to the arrival time of the signal from e to d, so the path that goes through i (i.e., f → j → i → d) is also included in the slowest-paths tree. Edge (i, f ) and (j, e) are other incoming signals to the internal nodes (for the clarity of explanation, the other 5

h

a

a

b

bƍ

h

c

cƍ

d j

g

i

f k

dƍ jƍ

gƍ

e

fƍ

l

f

k

(a) Selected sub-circuit (simplified by omitting some of the non-critical branches)

iƍ eƍ l

(b) Replication tree

Fig. 3. Reconvergence eﬀect on a replication tree.

non-critical nodes and edges that provide signals to the internal nodes are not shown in the ﬁgure). Note that the paths converge at f . The replication tree of the sub-circuit produced by the tree construction procedure is shown in Figure 3-(b). There are two copies of f : a movable f and a ﬁxed f , where convergence breaks. The basic tree embedding algorithm computes cost/delay trade-oﬀ solutions in a bottom-up fashion. The intermediate solution set for the subtree consisting of {d , e , f , g , h, i, j} contains some improved embeddings. These embeddings, however, are discarded as the path from f to d won’t be changed (it is already monotone) and the arrival time of the signal from i dominates over most of improved arrival times of the signal from e . The ﬁnal embedding that the embedder returns places the movable nodes at the same location as they originally were; the placement of the subcircuit remains unchanged (Figure 3-(a)). In [4], this reconvergence issue was addressed by using a modiﬁed timing objectives – a lexicographic ordering on the largest arrival times – was used so that some paths can still be improved on a single iteration even if the arrival time at the output is not reduced. Thus, over multiple iterations more paths can be sped up and the clock period reduced. The lexicographic approach, however, incurs a runtime overhead. Also, the reconvergence issue is not wellunderstood in general, so we have studied and experimented simpler strategies.

3.2 Steiner Arborescence Embedding

The rectilinear Steiner arborescence (RSA) problem was investigated in [7] and [5], and reviewed in [6]. The RSA is of interest, since it straightens not 6

f

f a

a

a d b

c

(a) Topological structure

f

d b

d b

c

c

(b) Min-WL embedding (c) Arborescence embedding

Fig. 4. Fanin tree with a min wirelength embedding and an arborescence embedding.

only the critical path but also the other paths in a tree. In our context, the topological structure is ﬁxed, so we modify RSA problem as follows: Formulation 1. Given a non-embedded tree with ﬁxed input nodes and a ﬁxed root, ﬁnd an embedding in the layout area such that each path from a leaf to the root is monotone. The Steiner Arborescence Embedding is illustrated in Figure 4. The topological structure of a fanin tree is shown in Figure 4-(a). Suppose the critical path of this tree is the path from c to f . The placement of minimum wirelength embedding is shown in Figure 4-(b) — the node in square is ﬁxed and the node in circle is movable. The basic tree embedding algorithm in [2] will return this embedding as the best solution because the critical path delay is minimum and the wiring cost is also minimum. In this embedding, however, not all the source-to-sink paths are monotone, e.g., the path from d to f is not monotone. Steiner arborescence embedding, however, will produce a solution where all the paths are monotone (Figure 4-(c)). An arborescence needs not to be tied to the geometric interpretation implied in the ﬁgure; if one can determine the minimum achievable delay, Di , with respect to each input i, the embedding formulation can be to minimize cost subject to the minimum delay being a constraint for each input. we, therefore, solve the Steiner arborescence embedding problem with the existing tool — the basic tree embedder. We replace the arrival time of each input node i with −Di where Di is the minimum possible delay from the input node i to the root (Section 2 explains how we compute this delay.) When we run the existing algorithm on the instance with replaced arrival times, the algorithm returns the min-cost solution achieving arrival time of 0 (in addition to solutions with larger arrival time and lower-cost.) In Figure 3, we have seen that the basic embedder could not optimize the critical path. Now, if we invoke the new Steiner arborescence embedder on the same tree, we can obtain an embedding that optimizes the critical path (Figure 5-(a)). 7

h

gƍ

fƍ

eƍ

a

a

bƍ cƍ dƍ

bƍ

h

gƍ

cƍ fƍ

jƍ

iƍ

jƍ

f k

eƍ

dƍ iƍ

f k

l

(a) Arborescence embedding

l

(b) Generalized arborescence embedding

Fig. 5. Eﬀect of Steiner arborescence embedding.

3.3 Generalized Arborescence Embedding

As one can see in Figure 3, Steiner arborescence embedding incurs more wiring and replication costs. In order to avoid over-optimization, we loosen the delay constraint on early arriving inputs. Let Ai be the arrival time of an input node i, and LB be the minimum achievable clock period of a given circuit. We replace the arrival time of i with − max(Di , LB −Ai ). This means if the delay of the signal path that goes through i to the root does not exceed LB, then we allow the path to detour within the extra budget. Figure 5-(b) shows the generalized Steiner arborescence embedding on the replication tree. Here the critical path, h ; a, is detouring because the clock period of the path does not violate LB and it saves the wiring costs of (i, f ) and (j, e ). In the new ﬂow, we invoke a generalized Steiner arborescence embedding when the conventional ﬂow saturates (i.e., when the circuit clock period won’t be improved over several iterations). This optimization reduces the delay of the critical path and the number of paths that are near critical. After application of an arborescence embedding, we return to the conventional formulation and further improve the clock period.

3.4 Complementary Techniques

The fanin tree embedder in [1] and [4] was not very sophisticated about assessing wirelength impact. In addition to this, the new Steiner arborescence 8

v3

v2

v4

v1 u

uƍ

a

b

Fig. 6. Fanout partitioning.

embedding incurs more wiring cost. We make further enhancements to the embedder so as to better manage wirelength during the course of the algorithm.

3.5 Fanout Partitioning

When we are embedding a replication tree, we need to decide whether a cell in the tree can be moved or should be replicated. The fanouts of the cell can get a signal either from the original cell or from the (temporary) replicated and optimally-placed cell. If all the fanouts can get a signal from the replicated cell without violating certain criteria like clock period, we can delete the original cell: the subject cell is moved. If not, we should keep both copies: the subject has been replicated. Distributing the fanouts among the logically-equivalent cells is called fanout partitioning and is illustrated in Figure 6. Cell u is a clone of cell u that is optimally placed; Cells vi are the fanouts of u. In this example, fanouts v1 and v2 stay with u and fanouts v3 and v4 get a signal from u . The partitioning approach in [1] and [4] was based on delay only: it moves a fanout of the original cell to the replicated cell if the move doesn’t degrade the arrival time of the fanout. This approach is simple and fast, but it usually degenerates the wirelength of a circuit. In a new partitioning approach, we take the half-perimeter wirelength (HPWL) into account. First, we move fanouts vi of cell u to u as long as the arrival time of vi is not worsened. This step make us check whether we can move all the fanouts and eliminate the cost of keeping u and reduce the HPWL of its outgoing and incoming nets. Second, among the fanouts that have moved to u we pick a fanout v and move it back to u if the move yields the maximum HPWL gain. We perform this processing for all the fanouts of u . After all the possible move, we pick a max gain move sequence. Last, we repeat the second step for the remain fanouts of u for any further wirelength improvement. 9

3.6 Cell Relocation

Once a replication occurs, it is often the case that a simple move of the source cell of the replication can often reduce wirelength. For example, consider the fanout partition {v1 ,v2 } and {v3 , v4 } in Figure 6, when fanouts v3 and v4 are no longer tied to cell u, we can relocate u to a better location where the wirelength is reduced without degrading clock period. We use a simple heuristic that relocate the source cell. We ﬁrst limit the target region to be bounded by the fanouts and fanins of the original cell. Then we scan the region and pick the location where the HPWLs of incoming and outgoing nets of the cell is reduced and the arrival time of vi does not get worsened. 3.7 Cell Uniﬁcation

As we perform replication tree embeddings and placement legalizations over iterations, the placement of cells are perturbed and some logically equivalent cells migrate to each other. In [2], the embedder uniﬁed the equivalent cells when only one of them was on the selected critical path. It, however, left some of equivalent cell sets untouched as they were not selected. In our new embedder, we go through all of equivalent cell sets and perform pair-wise fanout partitionings so that better fanout partitions or cell uniﬁcations are obtained. We invoke the uniﬁcation procedure as a post-processing when the conventional ﬂow saturates.

3.8 Replication Cost

One of the capability of the embedding algorithm is its ability to incorporate various cost including wiring cost, placement cost, and replication cost. The replication cost is to prevent excessive cell duplications. During embedding, we compute a region where a subject cell can be placed without incurring replication. In the computation, we consider the HPWL as well as the delay. Once a region is found, we impose a high cost for a placement outside the region, and a low cost for a placement within the region.

3.9 Wirelength Estimation

In [2], when wiring costs were computed, it considered only cell-to-cell wirings. We, however, note that a cell with high fanouts has a chance to save some wiring cost. For example, in Figure 7, node h has 5 fanouts and some part 10

d

e f3 f2

g

f4

f1 h

Fig. 7. Signal can be connected to any pin in a fanout net.

of h-to-e wiring could be shared with other wiring in the fanout net. In the new embedder, a node (e.g., e) is allowed to receive a signal from not just the source pin (e.g., h) but from any one of valid pins (e.g., h, f1 , f2 , f3 , and f4 ), with appropriate changes made to the arrival times.

4

Experimentation

4.1 Delay Model

In our experimentation we use a placement-level delay estimator that is related to VPR [15] and is similar to [1] and [2]. The target architecture is the FPGA in which all the switches are buﬀered and interconnect resources are uniform. With buﬀered switches, RC eﬀects are localized to switch-to-switch connections. Thus the delay of an interconnection can be approximated by a linear function of the Manhattan length of the interconnection. As a side, we want to mention that the target of the embedder is an arbitrary graph in which edges can have arbitrary delays, it is well-suited to routing architectures with pre-deﬁned and non-uniform routing resources.

4.2 Optimization Flow

Firstly, Initial placements are obtained by invoking the VPR placer in timingdriven mode. Secondly, in each iteration of the optimization, we start with static timing analysis in order to identify the critical sink, and we extract a fanin tree whose root is the critical sink. This tree is passed to the new embedder which produces a set of solutions that trade oﬀ between cost and delay. We select a solution from the trade-oﬀ curve, and analyze the circuit for possible post-processing. After the post process, the circuit is legalized by the 11

Table 1 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor Circuit

Timing Driven VPR crit path [ns]

name

size

dens

I/O

ex5p

35x35

0.87

71

tseng

35x35

0.85

apex4

38x38

misex3

FF

wire

W∞

Wls

length

block

0

65.80

66.47

20086

1135

174

385

53.53

54.84

9692

1221

0.87

28

0

72.81

74.20

21660

1290

40x40

0.87

28

0

76.32

78.90

22239

1425

alu4

41x41

0.91

22

0

76.00

76.73

21573

1544

diffeq

41x41

0.89

103

377

62.71

64.65

14614

1600

dsip

54x54

0.47

426

224

65.38

66.61

17642

1796

seq

44x44

0.90

76

0

80.42

80.76

27789

1826

apex2

46x46

0.89

41

0

100.05

100.87

30995

1919

s298

47x47

0.87

10

8

123.82

125.78

21844

1941

des

63x63

0.40

501

0

90.44

91.31

27861

2092

bigkey

54x54

0.59

426

224

62.77

64.23

20562

2133

frisc

63x63

0.90

136

886

121.64

125.46

61130

3692

spla

64x64

0.90

62

0

117.04

121.06

68663

3752

elliptic

64x64

0.88

245

1122

108.95

112.08

51240

3849

ex1010

72x72

0.89

20

0

171.08

175.05

70632

4618

pdc

71x71

0.91

56

0

146.78

149.39

108292

4631

s38417

84x84

0.91

135

1463

97.80

99.09

63968

6541

s38584.1

85x85

0.89

342

1260

94.96

95.54

58034

6789

clma

97x97

0.89

144

33

240.90

240.52

141747

8527

timing-driven placement legalizer ([1] and [2]). Thirdly, the optimization phase is iterated for the number of given times, and the best placement is selected. Lastly, the selected placement is routed by the VPR router in timing-driven mode. 12

Table 2 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor (continued)

Circuit

RT Embedding

Lex-3 Embedding

(normalized to VPR)

(normalized to VPR)

crit path [ns]

wire

crit path

wire

W∞

Wls

length

block

W∞

Wls

length

block

ex5p

0.918

1.225

1.203

1.042

0.890

0.968

1.295

1.085

tseng

0.939

0.935

1.057

1.009

0.939

0.946

1.117

1.020

apex4

0.863

0.880

1.216

1.025

0.853

1.041

1.244

1.032

misex3

0.782

0.782

1.054

1.004

0.730

0.842

1.217

1.027

alu4

0.840

1.037

1.149

1.016

0.855

0.944

1.127

1.012

diffeq

0.954

0.955

1.099

1.004

0.948

0.922

1.069

1.006

dsip

0.744

0.817

1.436

1.001

0.731

1.185

1.577

1.001

seq

0.780

0.961

1.127

1.012

0.795

0.819

1.102

1.007

apex2

0.803

0.815

1.100

1.010

0.785

0.799

1.102

1.011

s298

0.913

0.924

1.084

1.002

0.872

0.882

1.129

1.002

des

0.896

0.909

1.020

1.000

0.876

0.972

1.018

1.002

bigkey

0.842

0.972

1.230

1.000

0.819

1.009

1.307

1.000

frisc

0.974

0.963

1.012

1.001

0.964

0.946

1.031

1.006

spla

0.824

0.864

1.179

1.024

0.780

0.844

1.176

1.025

elliptic

0.778

0.779

1.081

1.008

0.753

0.896

1.128

1.011

ex1010

0.806

1.190

1.267

1.030

0.770

1.061

1.245

1.010

pdc

0.807

0.858

1.058

1.014

0.708

0.794

1.152

1.016

s38417

0.878

0.965

1.037

1.004

0.842

0.929

1.040

1.007

s38584.1

0.863

0.878

1.023

1.000

0.832

0.862

1.129

1.001

clma

0.630

0.661

1.102

1.010

0.622

0.655

1.102

1.006

0.842

0.918

1.127

1.011

0.818

0.916

1.165

1.014

average

4.3 Experimental Results

The VPR placement tool in default mode sets the number of rows and the number of columns in the FPGA logic array to minimums that is required to ﬁt a circuit. This minimum logic array size was used in experiments of [1] and 13

Table 3 Comparison between Timing-Driven VPR, RT, Lex-3, and Arbor (continued) Arbor Embedding (normalized to VPR) Circuit

crit path

wire

Delay

W∞

Wls

length

block

/LB

ex5p

0.876

0.924

1.142

1.031

1.265

tseng

0.916

0.919

1.104

1.002

1.162

apex4

0.822

0.863

1.142

1.018

1.254

misex3

0.720

0.729

1.167

1.020

1.072

alu4

0.852

0.885

1.098

1.019

1.295

diffeq

0.885

0.889

1.130

1.001

1.008

dsip

0.687

0.776

1.245

1.000

1.000

seq

0.769

0.801

1.130

1.005

1.071

apex2

0.764

0.771

1.137

1.009

1.159

s298

0.831

0.834

1.174

1.000

1.685

des

0.877

0.902

1.034

1.000

1.000

bigkey

0.800

0.889

1.210

1.005

1.000

frisc

0.891

0.878

1.072

1.002

1.093

spla

0.770

0.803

1.095

1.003

1.240

elliptic

0.734

0.737

1.123

1.002

1.082

ex1010

0.775

0.819

1.088

1.002

1.718

pdc

0.708

0.746

1.073

1.004

1.346

s38417

0.820

0.878

1.046

1.001

1.021

s38584.1

0.842

0.872

1.064

1.000

1.000

clma

0.582

0.601

1.092

1.001

1.349

0.796

0.826

1.118

1.006

average

[2]. It, however, is rare that all logic and routing resources is 100% utilized. We adjusted the FPGA size so that the circuits have at least 10% white space: the VPR placement option -nx and -ny was used in order to adjust the number of rows and the number of columns. The circuits were routed with the number of tracks per channel that is about 20% more than the minimum required, which follows the deﬁnition of low-stress routing in [15]. 14

In our experiments, we compared the generalized Steiner arborescence embedding, named Arbor-embedding with the timing-driven VPR [15], the basic embedding (RT) [2], and Lex-3 embedding [4]. In Arbor, the generalized arborescence embeddings were invoked only when the conventional ﬂow saturated. The main criteria of interest are the clock period, wirelength, the number of logic blocks, and delay lower-bound. Table 1 and 3 shows the experimental results for 20 MCNC benchmark circuits. We used the timing-driven VPR placer to obtain initial placements of the benchmark circuits. The ﬁrst data set shows the design density and the I/O information of the circuits. In the second data set, we run the timing-driven VPR router on the initial placement. In the third and fourth data set, the initial placements were optimized by RT and Lex-3 before invoking VPR router. These values are normalized to VPR. The last data set shows the optimization data by Arbor embedder. W∞ denotes the estimated delay where inﬁnite routing resources are assumed to be available. Wls represents the low-stress routed delay. The average W∞ reduction of RT, Lex-3, and Arbor over VPR was 15.8%, 18.2%, and 20.4%, respectively. The diﬀerence between Lex-3/RT and Arbor looks small, but note that some of the circuits already meet or are close to the delay LB. The column labeled “Delay/LB” is the rate between the placement-level delay achieved by Arbor and the placement-level delay lower-bound. Circuit dsip, des, bigkey, and s38584.1 reached the theoretical LB, they could not be improved further for given ﬁxed FFs. Circuit diffeq and s38417 were very close to the LB. The largest W∞ reduction of Arbor over VPR was 41.8% (clma which is the biggest circuit in the benchmarks). The average Wls reduction of RT, Lex-3, and Arbor over VPR was 8.2%, 8.4%, and 17.4%, respectively. Arbor showed higher improvement in routed delays due to its better wirelength management (RT and Lex-3 used excessive wirings on low design density circuits, like dsip and bigkey). It is observed that the new cell uniﬁcation techique eﬀectively merges the logical equivalent cells. We also think that Arbor optimizes near critical paths more, thus the router has less diﬃculty to route the most critical path.

5

Discussion

The runtime overhead of our algorithm was modest. It was less than 15% of the time of VPR placer. There were a couple of experimental observations. When the strict Steiner arborescence embedding was invoked over all the iterations, it optimized the critical path at the expense of a lot of resources: It incurred many replicated cells and wiring overhead. It was not able to improve the clock period of the 15

circuits. When we invoked the generalized Steiner arborescence embedding over all the iterations, we could obtain better W∞ than the reported data. It, however, still used good number of resources, and the routed delay (Wls ) was not as good as the estimated delay.

6

Conclusions

We have presented the techniques that were used for the improved timingdriven, placement-coupled logic replication. We described and generalized the rectilinear Steiner arborescence for the fanin tree embedding problem. The Steiner arborescence was shown to be useful for overcoming the issues caused by reconvergence in a circuit netlist speciﬁcation. Fanout partitioning (cell uniﬁcation), cell relocation, and wirelength estimation techniques were discussed as complementary improvement techniques. These techniques were implemented and experimented in FPGA domain. In many cases we were able to approach a ﬁxed ﬂip-ﬂop lower-bound on achievable clock period. Promising experimental results, average 17.4% delay reduction compared with the timing-driven VPR and average 9.3% reduction compared with the basic embedder, were reported.

References

[1] G. Beraudo and J. Lillis, “Timing Optimization of FPGA Placements by Logic Replication,” ACM/IEEE DAC, 2003. [2] M. Hrki´c, J. Lillis, and G. Beraudo, “An Approach to Placement-Coupled Logic Replication,” ACM/IEEE DAC, 2004. [3] M. Hrki´c and J. Lillis, “S-Tree: A technique for buﬀered routing tree synthesis,” ACM/IEEE DAC, 2002. [4] M. Hrki´c and J. Lillis, “Addressing the Eﬀects of Reconvergence on PlacementCoupled Logic Replication,” IWLS, 2004. [5] J. Cong, K. S. Leung, and D. Zhou, “Performance-Driven Interconnect Design Based on Distributed RC Delay Model,” ACM/IEEE DAC, 1993. [6] A. Kahng and G. Robins, “On optimal Interconnections for VLSI,” Kluwer Academic Publishers, 1995. [7] S. K. Rao, P. Sadayappan, F. K. Hwang, and P. W. Shor, “The Rectilinear Steiner Arborescence Problem,” Algorithmica, 1992. [8] L. T. Liu, M. T. Kuo, C. K. Cheng, and T. C. Hu, “A Replication Cut for Two-Way Partitioning,” IEEE Transaction on CAD, 1995.

16

[9] W. K. Mak and D. F. Wong, “Minimum replication min-cut partitioning,” IEEE Transaction on CAD, 1997. [10] J. Lillis, C.-K. Cheng, and T.-T Y. Lin, “Algorithms for Optimal Introduction of Redundant Logic for Timing and Area Optimization,” IEEE ISCAD, 1995. [11] A. Srivastava, R. Kastner, and M. Sarrafzadeh, “Timing Driven Gate Duplication: Complexity Issues and Algorithms,” ICCAD, 2000. [12] W. Gosti, A. Narayan, R.K. Brayton, and A.L. Sangiovanni-Vincentelli, “Wireplanning in logic Synthesis,” ICCAD, 1998. [13] W. Gosti, S.P. Khatri, and A.L. Sangiovanni-Vincentelli, “Addressing The Timing Closure Problem By Integrating Logic Optimization and Placement,” ICCAD, 2001. [14] Y. Kukimoto, R.K. Brayton, and P. Sawkar, “Delay-optimal technology mapping by DAG covering,” DAC, 1998. [15] A. Marquardt, V. Betz, and J. Rose, “Timing-Driven Placement for FPGAs,” ACM/SIGDA ISFPGAs, 2000. [16] K. Schabas and S. D. Brown, “Using Logic Duplication to Improve Performance in FPGAs,” ACM/SIGDA ISFPGAs, 2003. [17] G. Chen and J. Cong, “Simultaneous timing-driven placement and duplication,” ACM/SIGDA ISFPGAs, 2005. [18] K. Keutzer, “DAGON: Technology Binding and Local Optimization by DAG Matching,” ACM/IEEE DAC, 1987. [19] S.-W. Hur, A. Jagannathan, and J. Lillis, “Timing-Driven Maze Routing,” IEEE Transaction on CAD, 2000.

17

Recommend Documents

Improved Techniques for Training GANs

Improved VT Techniques

New Coding Techniques for Improved Bandwidth Utilization

Improved TDOA Disambiguation Techniques for Sound Source ...