A BEHAVIORAL SYNTHESIS APPROACH FOR ... - CiteSeerX

Report 0 Downloads 222 Views
A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY FPGA ARCHITECTURES Ashutosh Pal*

M. Balakrishnan †

*



CoWare India Pvt. Ltd. Tower-B, Logix Techno Park Noida, India [email protected]

Department of Computer Science and Engineering, Indian Institute of Technology Hauz Khas, New Delhi, India [email protected]

ABSTRACT This paper presents an approach for efficiently mapping loops and array intensive applications onto FPGA architectures with distributed RAMs, multipliers and logic. We perform a data dependency based, two level partitioning of the application’s iteration space under target FPGA architectural constraints, to achieve better performance. It is shown that, this approach can result in a super-linear speedup; linear speedup due to concurrent computation on multiple compute elements and additional speedup due to improvement in the clock frequency (up to 30%). The clock period reduction is made possible because computation and accesses are now localized, i.e. the compute elements interact only with memories which are close by.

architectures and reports improvement in clock cycles, but not in the clock period. In this work, we target specialized FPGA architectures with distributed RAMs and associated compute units to achieve concurrency in computation and localization of memory accesses. We report improvement both in clock period and clock cycles to achieve superlinear speed-up. 2 MOTIVATIONAL EXAMPLE We have used Xilinx Virtex II [7] FPGA architecture for performing our experiments. The architecture comprises of embedded block RAMs (BRAMs) with multipliers adjacent to them as shown in fig. 2.1. For establishing the motivation, we have used Dot Product as the input application, whose kernel is shown below: for (i = 0; i < N; i++) dprod = dprod + A[i] * B[i];

1 INTRODUCTION One of the main aspects of behavioral synthesis is the efficient utilization of the target architecture information, while performing transformations. In this work, we propose to utilize the physical closeness between the embedded RAMs and compute units present in the current FPGA architectures like Xilinx Virtex II [7] and Altera’s Stratix II [8] to create distributed data paths. This is integrated with the apparent parallelism available in FPGA architectures due to the presence of replicated resources. We have adopted the loop partitioning techniques reported in the domain of parallelizing compilers for multiprocessors, to partition the application's iteration space. There have been some efforts previously in the behavioral synthesis domain targeting FPGA architectures. FPGA compilers like the SA-C compiler [2], mainly performed parallelizing code transformations to utilize the huge arrays of CLBs. Baradaran et al. [3] presents a technique to utilize the embedded RAMs as caches, to enable data-reuse. Ouaiss et al. [4] targets the hierarchical memory mapping problem on an RC board aiming to optimize the performance but the analysis assumes a single processing unit. Baradaran et al. [5] presents a custom array mapping approach for generalized configurable

1-4244-1060-6/07/$25.00 ©2007 IEEE.

517

Fig. 2.1(a) shows the synthesized design of dot product when taken through the normal synthesis flow. Each of the arrays gets mapped to the BRAMs with only a single multiplier unit being used for the computation. It is seen that the memory access path to the multiplier is the critical path of the design. As the size of arrays increase, the critical path delay increases, which in turn increases the clock period and the performance degrades. Transformed dot product design is shown in fig. 2.1(b). It is obtained by partitioning the iteration space into four partitions (each mapped to a BRAM and a multiplier). It can be seen that the memory accesses are now localized and thus the critical path delay (or the clock period) will not increase with the increase in the array-sizes. However, A

A

B

B

BRAMS

*

*

*

*

Multiplier

(a) Normal Design utilizing single compute unit A

B *

A

B *

A

B *

A

B *

(b) Transformed Design utilizing multiple compute units

Fig. 2-1: Dot Product synthesized designs

as can be seen in fig. 2.2, there is an increase in the clock period, which is primarily due to the increase in the routing overheads. It can be inferred from fig. 2.2 that up to 30% of clock period improvement is achieved.

These partitions are obtained by grouping the datadependent iterations together in one partition and hence there is no communication between them. We represent these partitions using the data-dependency based vector space formulation of an iteration partition from Chen et al. [1]. So we have, It ( L1i ) = bi + a1 .d 1 + a 2 .d 2 + ... + a u .d u ,

where b i is an initial point and d 1 , d 2 etc. are datadependence distance vectors [6]. We call a1 , a 2 etc. as the dependency control variables as they control the amount of applicability of the corresponding dependencies. Each partition has a unique initial point and by varying a i s , all the iterations in a partition can be obtained. 2. Logical level-2 (L2) Partitions, {L 21 , L 2 2 ,..., L 2 k } are at the next level of the hierarchy. Formally we have, It ( L 2 i ) ∩ It ( L 2 j ) = φ ; Data( L 2 i ) ∩ Data( L 2 j ) ≠ φ ;

Fig. 2-2: Clock period Variation for the two designs 3 PROBLEM DESCRIPTION

Comm ( L 2 i , L 2 j ) = φ , ∀i, j , 1 ≤ i, j ≤ k ,

Consider an application with an N dimensional perfectly nested [6] iteration space with constant loop bounds and uniformly generated array access functions (as described in Chen et al. [1]). Further, consider a target FPGA architecture with p identical embedded RAMs, p associated compute units and a Latency description Lat (i, j), which represents the latency of accessing an element from RAM i to compute unit j. The aim of this work is to map the input application’s iteration and data space onto the resources of the target architecture to achieve better performance. The latency descriptions were obtained empirically by placing a logical RAM onto the different BRAM positions on the device, while keeping the position of the multiplier fixed. The values ranged from 8.310 ns to 14.706 ns, for XC2V6000 Xilinx Virtex II [7] device (128 BRAMs organized in 6 columns).

These are obtained by further partitioning the L1 partitions by relaxing the false dependencies [6] and duplicating the shared data as described in section 5. Each of these k L2 partitions will again comprise of t physical level partitions, described next. 3. Physical level Partitions {P1 , P2 ,..., Pt } are at the lowest level of the hierarchy. Each of these partitions comprise of an Embedded RAM and its associated computation unit. Formally we have, It ( Pi ) ∩ It ( P j ) = φ ; Data ( Pi ) ∩ Data ( P j ) ≠ φ ; Comm ( Pi , P j ) ≠ φ , ∀i, j , 1 ≤ i, j ≤ t ,

4 PARTITIONING MODEL We propose a partitioning model according to which the given input application will be partitioned. The model is hierarchical in nature and comprises of three different kinds of partitions. We use the notations, It(X) to represent the set of iterations mapped to partition X, Data(X) to represent the set of data elements accessed by partition X, while Comm(X,Y) represents the set of data elements transferred from partition X to partition Y and vice-versa during computation. 4.1 Types of Partitions 1. Logical level-1 (L1) Partitions {L11 , L12 ,..., L1n } are at the topmost level of the hierarchy. Formally we have, It ( L1i ) ∩ It ( L1 j ) = φ ; Data( L1i ) ∩ Data( L1 j ) = φ ; Comm( L1i , L1 j ) = φ , ∀i, j , 1 ≤ i, j ≤ n,

518

Fig. 4-1: Partitioning Model 4.2 Description of the Model As can be seen in fig. 4.1, the model comprises of three main units: Init unit, Compute unit and Collate unit. Init unit initializes all the embedded RAMs with the appropriate data elements using the Address Map and the

Input Data, both of which reside in the external memory. Address Map contains the mapping of the array elements to the locations in the physical partitions. The Compute unit encapsulates all the three levels of partitions discussed in the previous subsection. Also there is a hierarchical FSM (LFSM, PFSM, and TFSM respectively at each of the 3 levels) in place for providing control at each level. Finally, there is a Collate unit to collate all the results after computation and transfer them to the Output memory. Please note that we make use of two clocks viz. computeclock and init-clock. Init-clock is a slower clock which triggers the Init unit, Collate unit and the top level FSM (TFSM) as they are involved in the global communication all over the chip. While, compute-clock is a faster clock which triggers the RAMs, compute units and local FSMs. 5 PARTITIONING APPROACH

The aim of our approach is to capture those design points where there is maximum parallelism and minimum clock period, implying maximum performance. This is achieved by increasing the number of L2 partitions within one L1 partition and minimizing the number of physical partitions within a L2 partition. Algorithms 1 and 2 describe the complete exploration process. We will go over the main steps of the algorithms using Matrix Multiplication example, kernel of which is shown below: for(i = 0; i < N; i++) for(j = 0; j < N; j++) for(k = 0; k < N; k++) R[i, j]=R[i, j]+P[i, k]*Q[k, j];

________________________________________________ Algorithm 1: Level1_Partitioning

We obtain L1 partitions using the approach proposed by Chen et al. [1] for multi-processor architectures, in which all the data-dependent iterations are grouped together into a set of partitions (Step-2 of Algorithm 1). Using the representation of an iteration partition described in subsection 4.1, the L1 partition for matrix multiplication obtained after step-2 of algorithm 1 is: (0, 0, 0) + a1 . (0, 1, 0) + a2 . (1, 0, 0) + a3 . (0, 0, 1) Here (0, 0, 0) is the initial point, while (0, 1, 0), (1, 0, 0), (0, 0,1) are data-dependence distance vectors [6] corresponding to array accesses P[i, k], Q[k, j] and R[i, j] respectively. One can see that there will be only one L1 partition in this case, as by varying a i s from 1 to N all iterations can be obtained. If the loop-bound N is 32, then 0