Technology Mapping of Sequential Circuits for ... - ACM Digital Library

Report 1 Downloads 46 Views
Technology Mapping of Sequential Circuits for LUT-based FPGAs for Performance Peichen Pan Dept. of Electrical & Computer Eng. Clarkson University Potsdam, NY 13699

Abstract

In this paper, we study the technology mapping problem for sequential circuits for LUT-based FPGAs. The conventional approach for this problem is based on a technology mapping algorithm for combinational circuits while assuming the positions of the ip- ops are xed. We propose a new approach in which FFs can be arbitrarily repositioned by retiming. We present an ecient technology mapping algorithm that produces a mapping solution with the minimum clock period for a circuit without loops under the unit delay model. The algorithm is also extended to the general delay model, in which case it produces a mapping solution with a clock period at most an interconnect or LUT delay away from the minimum one. Note that the algorithm can also be used for circuits with loops by removing some of the FFs to break the loops before the application of the algorithm. The superiority of our approach is further demonstrated experimentally. Keywords: FPGAs, technology mapping, retiming, logic replication, look-up table, sequential circuits, clock period

1 Introduction

Field programmable gate arrays (FPGAs) have evolved rapidly to become an important ASIC technology. Most conspicuous features of FPGAs are low manufacturing cost for low volume designs, short design cycle, and reprogrammability. These features make FPGAs particularly attractive for such applications as design prototyping and hardware emulation. In this paper we con ne ourselves to look-up table based FPGA architectures [19]. A LUT-based FPGA consists of an array of programmable logic blocks (PLBs) together with programmable interconnections. The core of a PLB is a k-input LUT (k-LUT) which can implement any combinational logic with up to k inputs and a single output,

 The work was partially supported by the National Science Foundation under grant MIP-9222408.

C. L. Liu Dept. of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 where k is a positive integer ranging usually from 3 to 9. There are also a few ip- ops (FFs) in each PLB which can be connected to the inputs and outputs of the LUT to realize sequential behavior. The technology mapping problem for LUT-based FPGAs is to produce, for a given circuit, an equivalent circuit comprising of k-LUTs. This problem has been studied extensively. However, almost all mapping algorithms were designed for combinational circuits. Mapping algorithms for combinational circuits have been proposed for various optimization criteria: performance [3, 2, 7, 11, 15, 20], area [4, 5, 6, 8, 13, 14, 18], routability [1, 16]. We mention here particularly FlowMap [3]. For a k-bounded combinational circuit, FlowMap is able to produce a mapping solution with the minimum delay under the unit delay model. The conventional approach to technology mapping of sequential circuits is to use a mapping algorithm for combinational circuits to map the combinational logic between FFs. Speci cally, the FFs in a circuit are removed to obtain a combinational network. Then the combinational network is mapped. Finally, the FFs are placed back. This approach has two obvious shortcomings: (i) it fails to consider signal dependencies across FF boundaries, and (ii) it does not consider the possibility of exposing the combinational logic between the FFs in di erent ways. Note that the FFs in a circuit can be repositioned by a technique called retiming [10]. Recently, a couple of direct technology mapping methods for sequential circuits were proposed [12, 17]. However, they also assume that the initial positions of the FFs are xed, though the heuristic algorithm in [17] uses retiming as a post-processing step.

1.1 Motivations

We notice that even with an optimal mapping algorithm for combinational circuits, the conventional approach may not produce an optimal mapping solution for a sequential circuit. As an example, consider the circuit in Figure 1(a). Suppose k = 4. If the FFs are not repositioned, it can be veri ed that either c or d must be an input to any 4-LUT for gate e. As a result, any mapping solution has a clock period at least two and uses at least four LUTs. One possible mapping solution is indicated by the dashed polygons in Figure 1(a). On the other hand, if we move both f1 and f2 backward (by retiming) as shown in Figure 1(b), we have the mapping solution with one LUT shown in Figure 1(c),

f1 a

a d

i1

f2

b

d

i1

i1

b

i2

e

i2

c

e

e i2

c (b)

(a)

(c)

Figure 1: Advantage of retiming. a a

d

i1

a’

y b

e f1 x

i2 c

g

i1 i2

i1

b b’

f

d

c

e

g i2

f

c’ (a)

(c)

(b)

Figure 2: Advantage of logic replication. where the LUT is formed by all gates in Figure 1(b). Notice that this mapping solution has a clock period one. We further notice that to fully exploit the potential of retiming, logic replication is also necessary. Replication can help produce mapping solutions which are otherwise impossible. Consider the circuit in Figure 2(a). Assume k = 4. First, we notice that on the two input edges to each of the gates d, e and f , if a FF is removed (by retiming) from one edge, a negative FF will be introduced on the other edge. For instance, if we move FF f1 out of edge x, a negative FF will be introduced on edge y. Hence, the resultant circuit is symmetrical. As a result, we can assume that the FF positions are all xed. In this case, it can be veri ed that at least two of the outputs of d, e, and f must be inputs to any 4-LUT for gate g. Consequently, any mapping solution must use at least six LUTs and have a clock period at least two. However, if we duplicate a (to become a and a0 ), b (to become b and b0 ), and c (to become c and c0), then retime the FFs across gates a0 , b0 , and c0 as shown in Figure 2(b), we can map all the gates (including the duplicated ones) to a single 4-LUT as shown in Figure 2(c). Note that this mapping solution has a clock period one.

1.2 Contributions

In this paper we examine the technology mapping problem for sequential circuits for LUT-based FPGAs in the most general setting. Conceptually, our problem formulation can be described by the diagram in Figure 3. That is, the space of mapping solutions consists of all the circuits that can be obtained by retiming and replicating the

given initial circuit, then mapping the combinational logic between the FFs, followed by another retiming and replication. Note that in the conventional approach, Step 1 and Step 3 in Figure 3 are missing. The objective of the technology mapping problem addressed in this paper is to obtain a mapping solution with a small clock period, which is the maximum delay between any two successive FFs in a circuit. We will present an ecient algorithm that produces a mapping solution with the minimum clock period for a loopless sequential circuit, under the unit delay model. Furthermore, the algorithm is extended to the general delay model. In that case, it produces a mapping solution with a clock period provably close to the minimum one. given sequential circuit

replication &

retiming

mapping the combinational logic

replication & retiming

solutions in solution space

Figure 3: Solution space after integrating retiming and replication. We should emphasize that what Figure 3 shows is the de nition of the solution space, not the steps that our algorithm takes to solve the problem. In fact, to try to obtain a mapping solution by carrying out the steps in the gure sequentially is what an algorithm based on the conventional approach does, and it inevitably arrives at a sub-optimal solution since there is no way to know what retiming and replication to use before mapping the combinational logic between FFs. Another way to understand our approach is

that it integrates retiming, replication, and mapping logic globally to obtain the best mapping solutions. It is obvious that the solution space in our formulation is enormous since there are too many ways to retime and replicate a circuit. One contribution of this work is identifying a much smaller subset of the solution space that is sucient for obtaining a minimum clock period mapping solution. An important issue in technology mapping for LUT-based FPGAs is the formation of LUTs. Though this issue is relatively simple for combinational circuits, it is complicated for sequential circuits because of the inclusion of retiming and replication. Another contribution of this work is to derive a method for forming LUTs in sequential circuits by introducing the concept of expanded circuits. Our algorithm is also based on the general idea of node labeling [9]. We will introduce a labeling scheme that takes into consideration both delays and FFs. This work is another signi cant step toward the understanding of technology mapping for LUT-based FPGAs.

2 Preliminaries and problem de nition A (sequential) circuit can be modeled as an edgeweighted directed (multi-)graph. The nodes are the primary inputs (PIs), the primary outputs (POs), and the combinational processing elements (PEs). (A PE is either a gate or a k-LUT depending on whether the circuit is the initial one or a mapping solution.) The edges represent the interconnections. There is an edge from u to v with weight t if the output of u, after passing through t FFs, is an input to v. We will use w(e) to denote the weight of edge e. For a node v in a sequential circuit N , we use Nv to denote the subcircuit induced by the nodes that can reach v. The output of Nv is that of v. The clock period of N is the maximum total delay (of gates and interconnections) on the combinational paths (paths without FFs) in N . Retiming is a technique of repositioning the FFs without changing the functionality or the structure of the circuit [10]. Retiming a node by a value i means removing i FFs from each fanout and adding i FFs to each fanin of the node. Figure 4 shows the case in which i = 1 and 1. In general, all nodes in a circuit can be retimed simultaneously (a retiming of the circuit). It can be shown that the retimed circuit and the original circuit have the same functionality if no retiming is performed at the PIs and POs (i.e., the retiming values for the PIs and POs are all zero). +1 v

v -1

Figure 4: Retiming a node. The retimed clock period of N is de ned to be the minimum clock period of all the circuits that can be obtained by retiming N . Ecient algorithms for computing the retimed clock period as well as a retiming that achieves the retimed clock period are presented in [10].

Refer again to Figure 3. Suppose N is the circuit to be mapped. We assume that N is k-bounded, i.e., each gate has at most k inputs. Let N 0 be a circuit obtained from N using replication and retiming. Let N 00 be a mapping solution of the combinational logic in N 0 , and S be the circuit obtained by placing the removed FFs back to N 00 and followed by another retiming1 . S is a mapping solution of N . The technology mapping problem addressed in this paper is as follows:

Problem 1 Find a mapping solution with the minimum clock period.

Finally, we list several graph-theoretic concepts. Let G be a DAG with one sink but possibly several sources. A cut (X; X ) is a partition of the nodes in G such that the sink is in X and all the sources are in X . The edge-set E (X; X ) of the cut is the set of edges from X to X , the node-set V (X; X ) of the cut is the set of nodes in X which are connected to one or more nodes in X . The cone of the cut is the subgraph induced by X . If jV (X; X )j  k, (X; X ) is further called a k-feasible cut, or k-cut for short.

3 Formation of LUTs

Recall that in a combinational circuit, a k-LUT for a node v is formed by the cone of a k-cut in Nv [3]. In this section, we are going to derive a way to examine k-LUTs for nodes in sequential circuits. In our problem formulation, a sequential circuit can conceptually be retimed and replicated arbitrarily. As a result, there is no xed circuit for LUT formation. To overcome this diculty, we will study an equivalent problem in which only a subset of special mapping solutions is considered. A mapping solution in which the output signals of the LUTs are from the original circuit is referred to as a simple mapping solution. Note that due to retiming, the output signal of a LUT in a mapping solution is in general a signal in the original circuit delayed or advanced by a few clock cycles. The set of simple mapping solutions may not contain a mapping solution with the minimum clock period. However, we have the following result:

Theorem 1 There is a simple mapping solution whose re-

timed clock period is equal to the minimum clock period of all mapping solutions. 2

If we have a simple mapping solution with the minimum retimed clock period among all simple mapping solutions, we can then retime the simple mapping solution using a retiming that achieves the clock period to obtain a mapping solution with the minimum clock period. As a result, instead of studying Problem 1 we will study the following equivalent problem:

Problem 2 Find a simple mapping solution whose retimed clock period is minimum.

1 Though we can also apply replication here, it turns out to be unnecessary.

The importance of simple mapping solutions is that by restricting to them, we only need to study how to form LUTs for the nodes in N . To form a LUT for a node v, a straightforward approach is to use the cone of a cut in Nv . Then FFs within the cone are moved to the boundaries of the cone using retiming and logic replication. Beside the problem that the number of inputs to the nal LUT is not directly related to the size of the node-set of the the cut (this makes it dicult to select cuts in the rst place), this approach may not generate all possible LUTs for v. As an example, for the circuit in Figure 5(a) if we replicate a (to become a1 and a2 ) and b (to become b1 and b2 ), then retiming b2 by 1, we arrive at the circuit in Figure 5(b). The logic delineated by the dashed polygon forms a 3-LUT as shown in Figure 5(c). However, this LUT cannot be derived without replication rst.

i2

c

b

i1 a1

(a) i1 i2 a2

b1 c

i2 b2

a2

c

(b)

(c)

Figure 5: A LUT formed only through replication. We now introduce the concept of expanded circuits. We will use expanded circuits to derive LUTs. The expanded circuit Ev for v is constructed by replicating the nodes in Nv . The intuition behind the construction is as follows: For a path from u to v in Nv : e1

e2

et

(u =)u1 ! u2 !    ! ut+1 (= v); we apply replication to create a unique corresponding path in the expanded circuits as follows: P

ud11

e1 d2 e2 e ! u !    ! ut 0

0

2

udi i

0

t

0 +1

When T consists of all the nodes in Nv , the resulting circuit is Ev . As an example, for gate g in the circuit in Figure 2(a), Figure 6(a) shows its expanded circuit. The importance of the expanded circuits is that a k-LUT for a node can be derived from each k-cut in the expanded circuit for the node. As an example, for the 6-cut delineated by the dashed circle in Figure 6(a), the corresponding 6-LUT is shown in Figure 6(b). In general, if ud is in the node-set of a k-cut in Ev , u after passing through d FFs is an input to the corresponding k-LUT. The logic of the k-LUT is simply the cone of the k-cut less the FFs. We can further show that these are the only k-LUTs that need to be examined. In summary, we have

Theorem 2 To examine all k-LUTs for a node v, it suf ces to examine all the k-LUTs that can be derived from the k-cuts in Ev .

i1 a

(3) For each edge (y;du) with weight t in NvT , add an edge (y; u ) with weight t for every d 2 r(u; v). (4) Remove node u and the edges incident to it.

;

where di = ijt w(ej ), is a replicated copy of ui , e0i a copy of ei (with the same weight), for 1  i  t. Let r(u; v) = fw(p) j p, a path from u to v in N g. That is, r(u; v) is the set of di erent path weights of all the paths from u to v in N . Ev is obtained by successively replicating the nodes in Nv in topological order starting from v. At the beginning, Nvfvg is the same as Nv except we rename v as v0 . Suppose we have constructed NvT , and x is in T for each edge (u; x) in Nv , then NvT [fug is obtained from NvT by replicating u into jr(u; v)j copies as follows: (1) For each d 2 r(u; v), introduce a node ud . (2) For each edge (u;d xdd1 ) with weight t in NvT , add an edge (u ; x 1 ) with weight t where d = d1 + t.

Finally, we estimate the numbers of nodes and edges in Ev . Hereafter, we will use n to denote the number of nodes and f to denote the number of FFs in N . Since N is k-bounded, the number of edges in N is O(kn). Obviously, the maximum value in r(u; v) for any pair of nodes u and v is at most f . Consequently, the number of distinct values in r(u; v) is at most f . Therefore, the number of nodes in Ev is O(nf ) and the number of edges is O(knf ) (note that Ev is also k-bounded). In practice, we expect these numbers to be much smaller.

4 An optimal clock period mapping algorithm

In this section we present an algorithm for solving Problem 2. We mainly focus on the unit delay model, namely, LUTs have one unit delay and interconnections have zero delay. Later, we will brie y describe how to extend the algorithm to the general delay model. As was mentioned earlier, we assume that the initial circuit N is loopless and k-bounded. The algorithm is based on solving the decision version of Problem 2: Problem 3 Given a target clock period c, nd a simple mapping solution whose retimed clock period is c or less, if such a mapping solution exists. If we have an algorithm for solving Problem 3, we can do a binary search on c to solve Problem 2. In the remainder of this section, we are going to present an algorithm for solving Problem 3. Our algorithm employs a labeling procedure and is based on network ow techniques. Before presenting the algorithm, we examine this question: Given a circuit M and an integer c, whether M can be retimed to a clock period c or less. Though the algorithms in [10] can be used to answer this question, we want a method that is more suitable for solving our problem. To this end, we construct a graph whose topology is the same

a

6-cut

6-LUT

a

i1

a’

d b

i2

b e

i’ 1

b’

i’2

c

g

i1 i2

f

d

a’

e

g

b’ f

c c

c’ (a)

(b)

Figure 6: Obtaining LUTs in expanded circuits. as that of M , but with the edge weights being rede ned: For an edge u !e v, the new weight is c  w(e) + 1. The l-value of v is de ned to be the maximum weight of the paths from the PIs to v according to the new edge weights. We have, Theorem 3 M can be retimed to a clock period c or less i the l-value of each PO is less than or equal to c + 1. Our algorithm for solving Problem 3 has two phases: the labeling phase and the mapping phase. In the labeling phase, we compute a label for each node in N . The label of a node is the minimum l-value of the k-LUTs for the node among all simple mapping solutions of the circuit. With the labels, we can easily determine whether N has a simple mapping solution with a retimed clock period c or less: If the label of each PO is less than or equal to c + 1, there is such a mapping solution. Otherwise, there is not. If the mapping solution exists, we generate one such solution in the mapping phase. In the next two subsections, we describe further details of the two phases, separately.

4.1 The labeling phase

For a node v in N , let lopt (v) denote its label, namely, the minimum l-value of the k-LUTs for v among all simple mapping solutions of N . We determine lopt (v) for each node v in N in this phase of the algorithm. Given a k-cut (X; X ) in Ev , the minimum l-value of the corresponding k-LUT is, e maxflopt (u) c  d + 1 j ud ! x 2 E (X; X )g: 0

(1)

By Theorem 2, we have that lopt (v) is equal to the minimum of the quantity in (1) among all k-cuts in Ev . Obviously, we can compute the labels of the nodes in N in topological order starting from the PIs. For each PI v, lopt (v) = 0. Suppose we now want to determine lopt (v) of a non-PI node v, given that the labels of all its predecessors have already been determined. To compute lopt (v), we again consider the decision problem, that is,

Problem 4 Determine whether lopt (v)  integer L.

L for a given

We use network ow techniques to solve Problem 4. A

ow network G is constructed from Ev by applying to Ev a standard network transformation, called node-splitting. Each node except v0 in Ev is split into two nodes with a bridging edge between them. A supersource is added and connected to all the sources in the network. The bridging edge for a node ud has a unit capacity if l(u) c  d +1  L. All other edges in G has in nite capacity. The following result can be readily shown:

Lemma 1

lnew

 L i G has a cut with edge capacity less

than or equal to k.

2

Based on the classical Max- ow Min-cut Theorem, we can use an augmenting path algorithm for solving the max ow problem to determine whether G has a cut with capacity less than or equal to k in O(k  jE (G)j) = O(k2 nf ) time. Though lopt (v) can now be determined using binary search, we actually only need to solve Problem 4 once. This is because of the following result: Lemma 2 lopt (v) = Lv e 1 or Lv , where Lv = maxflopt (u) c  w(e) + 1 j u ! v is in Nv g. 2 To determine lopt (v), we check whether lopt (v)  Lv 1. If so, lopt (v) = Lv 1; otherwise, lopt (v) = Lv . Based on these discussions, we have the following result: Theorem 4 The minimum l-values of all nodes in N can be determined in O(k2 n2 f ) time. 2

4.2 The mapping phase

The purpose of this phase is to generate a mapping solution with a clock period c or less if it is determined that there is one. This phase is relatively simple. The rst step is to generate a simple mapping solution whose retimed clock period is c or less. To do so, we keep two lists D and U . D is the set of nodes in N for which we have included their k-LUTs in the simple mapping solution and U is the set of nodes whose k-LUTs are inputs to some k-LUTs in D and have not yet been included. At the beginning, D consists of the PIs and U consists of the POs. At each iteration, a node

v in U is removed and added to D. Let the k-LUT that realizes lopt (v) be Lv as determined in the labeling phase. Then, if u after passing through d FFs is an input to Lv , we create an edge from Lu to Lv with weight d, and add u to U if it is not in D or U . This process stops when U becomes empty. Let S denote the resulting mapping solution. Note that there is at most one LUT in S for each node v in N .

Finally, to obtain a mapping solution with a clock period

c or less, we simply apply the following retiming to S : r(Lv ) =



0

d

lopt (v) c

v is a PI or PO

e 1

otherwise. The retimed solution is guaranteed to have a clock period c or less, provided that the label of each PO is no more than c + 1.

4.3 Extension to the general delay model

In the general delay model, each edge e has an associated delay value (denoted d(e)). Note that all k-LUTs have the same delay (denoted d2 ). For an edge in a mapping solution, its delay is assumed to be the maximum of the delays of all edges in the initial circuit that are mapped to this edge. The algorithm for the general delay model is almost the same as the one for the unit delay model except the following formula will be used to compute lopt (v): min

(X; X )



maxfl(u)

c

 + ( )+ d

d e

d

2j



e0 d! x 2 E (X; X )g

u

;

where the minimization is over all k-cuts in Ev . We can show that the extended algorithm can produce a mapping solution with a clock period that can be away from the minimum one by at most the maximum of d2 and the interconnection delays.

5 Experimental results

We implemented our technology mapping algorithm (referred to as SeqMap) and carried out some experiments. In this section, we describe our experiments and summarize the results. Since there are no loopless sequential benchmark circuits, we derived our test examples from the multi-level combinational and sequential benchmark circuits in the LGSynth91 suite from MCNC. For a combinational benchmark circuit, we added to it one stage of pipeline. The resulting sequential circuit was then retimed to its retimed clock period to obtain a test example. For a sequential benchmark circuit, we removed a set of gates to cut the loops, then again retime it to its retimed clock period to obtain a test example. Some standard SIS commands such as speed up, sweep, and tech decomp were used in the construction of the test examples. For comparison, we implemented a technology mapping algorithm based on the conventional approach called ComMap. ComMap maps a sequential circuit by rst removing all the FFs. Then, it maps the resulting combinational logic using FlowMap | a delay optimal technology mapping algorithm for combinational circuits. Finally, the

removed FFs are connected back to form a mapping solution of the original circuit. ComMap also uses retiming as a post-processing step by retiming the initial mapping solution to its retimed clock period. The resulting circuit is then the nal output of ComMap. In our current implementation, neither ComMap nor SeqMap contains any post-processing operations for LUT reduction. We tested both ComMap and SeqMap on a set of example circuits using 5-LUTs. The results are summarized in Table 1. In Table 1, under column initial we list the number of gates, the number of FFs, and the clock period () of each test example; under column ComMap, we list the number of LUTs, the number of FFs, and the clock period of the mapping solution produced by ComMap. The same quantities are also listed for SeqMap in column SeqMap. Note that the clock periods of the mapping solutions produced by SeqMap are optimal. From the table, it is clear that quite often ComMap produced sub-optimal solutions. It can also be seen that the mapping solutions produced by SeqMap usually have fewer LUTs than that produced by ComMap. This was also expected since SeqMap may extend across FF boundaries to form LUTs. The running times of SeqMap were about 5 times that of ComMap for the test examples.

6 Summary

In this paper, we have studied the FPGA technology mapping problem for sequential circuits. The problem is studied in the most general setting. In our formulation, the mapping solution space is much larger than what the conventional approach (based on a mapping algorithm for combinational circuits) is able to explore. We have presented an optimal clock period mapping algorithm for loopless sequential circuits. This algorithm can also be used for circuits with loops by removing some of the FFs to break the loops before the application of our algorithm. Another way to understand our algorithm is that it combines retiming, replication, and mapping combinational logic globally, while in the conventional approach, they are carried out separately. Our algorithm has been implemented. Experimental results also con rm the superiority of our approach over the conventional one. Currently, we are in the process of extending our results to sequential circuits with loops. We are also considering using the ideas developed here for LUT minimization.

References

[1] N. Bhat and D. Hill. Routable technology mapping for FPGAs. In ACM/SIGDA Workshop on FPGAs, pages 143{ 148, 1992. [2] J. Cong and Y. Ding. Beyond the combinational limit in depth minimization for LUT-based FPGA designs. In Digest Intl. Conf. on Computer-Aided Design, pages 110{114, 1993. [3] J. Cong and Y. Ding. FlowMap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Trans. on Computer-Aided Design, 13:1{11, 1994.

test example

C3540 C2670 cm85a comp count cu f51m i10 pm1 term1 x2 mult32a s1196 s5378 s9234 s15850 total

Initial

#gates #FFs  1078 56 19 996 173 12 48 8 4 134 12 6 264 53 6 62 12 4 226 22 5 2774 330 21 73 24 3 735 88 8 59 13 4 533 33 6 671 19 16 2779 430 21 5597 498 27 9785 944 54

ComMap

#LUTs #FFs  477 53 6 386 171 4 15 7 2 56 10 3 140 29 3 26 13 2 61 13 3 1348 308 7 36 18 2 306 45 4 23 11 2 68 33 2 306 19 5 747 411 5 1164 483 6 2157 944 7 7316 63

SeqMap

#LUTs #FFs  471 58 5 364 157 4 13 10 1 52 14 2 119 40 2 23 17 1 55 17 2 1287 341 6 28 23 1 275 54 3 17 15 1 66 34 1 309 22 5 572 309 4 911 338 5 1618 694 7 6180 50

Table 1: Experimental results. [4] A. H. Farrahi and M. Sarrafzadeh. Complexity of the lookup-table minimization problem for FPGA technology mapping. IEEE Trans. on Computer-Aided Design, 13:1319{1332, 1994. [5] R. J. Francis, J. Rose, and K. Chung. Chortle: A technology mapping for lookup table-based eld programmable gate arrays. In Proc. ACM/IEEE Design Automation Conf., pages 613{619, 1990. [6] R. J. Francis, J. Rose, and Z. Vranesic. Chortle-crf: Fast technology mapping for lookup table-based FPGAs. In Proc. ACM/IEEE Design Automation Conf., pages 227{ 233, 1991. [7] R. J. Francis, J. Rose, and Z. Vranesic. Technology mapping for lookup table-based FPGAs for performance. In Digest Intl. Conf. on Computer-Aided Design, pages 568{ 571, 1991. [8] K. Karplus. Xmap: A technology mapper for table-lookup FPGAs. In Proc. ACM/IEEE Design Automation Conf., pages 240{243, 1991. [9] E.L. Lawler, K.N. Levitt, and J. Turner. Module clustering to minimize delay in digital networks. IEEE Trans. on Computers, 18:47{57, 1969. [10] C. E. Leiserson, F. M. Rose, and J. B. Saxe. Optimizing synchronous circuitry by retiming. In Proc. 3rd Caltech Conf. on VLSI, pages 87{116, 1983. [11] A. Mathur and C. L. Liu. Performance driven technology mapping for lookup-table based FPGAs using the general delay model. In ACM/SIGDA Workshop on Field Programmable Gate Arrays, 1994. [12] R. Murgai, R.K. Brayton, and A. Sangiovanni-Vincentelli. Sequential synthesis for table look up programmable gate arrays. In Proc. ACM/IEEE Design Automation Conf., pages 224{229, 1993.

[13] R. Murgai, Y. Nishizaki, N. Shenoy, R.K. Brayton, and A. Sangiovanni-Vincentelli. Logic synthesis algorithms for table look up programmable gate arrays. In Proc. ACM/IEEE Design Automation Conf., pages 620{625, 1990. [14] R. Murgai, N. Shenoy, R.K. Brayton, and A. SangiovanniVincentelli. Improved logic synthesis algorithms for table look up architectures. In Digest Intl. Conf. on ComputerAided Design, pages 564{567, 1991. [15] P. Sawkar and D. Thomas. Performance directed technology mapping for look-up table based FPGAs. In Proc. ACM/IEEE Design Automation Conf., pages 208{212, 1993. [16] M. Schlag, J. Kong, and P.K. Chan. Routability-driven technology mapping for lookup table-based FPGA's. IEEE Trans. on Computer-Aided Design, 13:13{26, 1994. [17] U. Weinmann and W. Rosenstiel. Technology mapping for sequential circuits based on retiming techniques. In Proc. European Design Automation Conf., pages 318{323, 1993. [18] N.-S. Woo. A heuristic method for FPGA technology mapping based on the edge visibility. In Proc. ACM/IEEE Design Automation Conf., pages 248{251, 1991. [19] Xilinx. The Programmable Gate Arrays Data Book. Xilinx, San Jose, CA, 1993. [20] H. Yang and D. F. Wong. Edge-Map: Optimal performance driven technology mapping for iterative LUT based FPGA designs. In Digest Intl. Conf. on Computer-Aided Design, pages 150{155, 1994.