A Methodology for Implementing FIR Filters and CAD Tool Development for Designing RNS-Based Systems
D.Soudris, K. Sgouropoulos, K. Tatas, V. Padidis* and A. Thanailakis Dept. of Electrical and Computer Engineering Democritus University of Thrace, GR-67100 Xanthi, Greece { dsoudris, ksgourop, ktatas) @ee.duth.gr *Dept. Electrical and Computer Engineering University of Rochester, Rochester, NY USA
[email protected] ABSTRACT The goal of the research is twofold First, the derivation of a design methodology for FIR filters implementation based on Residue Number System (RNS), aiming at power, delay and hardware complexity reduction comparing with conventional binary implementations. Second, a CAD tool development, which generates synthesizable VHDL description of any RNS system design, in automatic way. This tool can derive RNS Full Adderbased DSP architectures consisting of FIR, Scaling, Converters, Multiplication and Accumulation units. 1. INTRODUCTION T h e demand for real-time Digital Signal Processing (DSP) with respect to power consumption has forced the researchers to look for efficient arithmetic algorithms, which can implement highspeed DSP units. Among others, the systems based on Residue Number System (RNS) have become the most popular as they take advantage of all the benefits given by the parallelism and the carryfree computations. Tne use of RNS allows the decomposition of a given dynamic range in channels of smaller range on which computation can be’efficiently implemented in parallel without the need of cany propagation between them [I], 121. These features make it beneficial for DSP applications, particularly, when large word length and high throughput rate are required. The simplest implementation approach of RNS is the utilization of look-up tables, usually realized as ROMs [3]. The fundamental drawback of look-up tables-based architectures is the exponential growth of table size with the input word length; therefore they can be inefficient for large moduli. In (41, a systematic FA-based methodology was developed for designing RNS-based inner product processor units. The aforementioned methodology outperforms conventional look-up table based approaches with much higher throughput rate and less area-cost, providing, at the same time, a systematic design framework for deriving m a y architectures for any modulo. This methodology has already been extended for the implementation of binary to RNSQRNS converters and vice vena (51, multiplication accumulation [41 and scaling computations 161. Existing approaches of RNS FIR filten implementations have been presented in [71,using ROMs based computations, leading though, to high memory requirements and low throughput rate for large word length. FA-based
0-7803-7761-31031317.00 02003 IEEE
implementations and comparisons with binary, in terms of area and power consumption, have been presented in [81 and [91, although the presented design procedure is not automated. In this paper, firstly a methodology for designing efficient FAbased RNS FIR filters compared to the traditional binary FIR systems and secondly, a unified Computer Aided Design (CAD) tool for designing VLSI RNS-based DSP units are introduced. An RNS system may comprise combination of FIR filters, scaling and converters. The developed tool takes into consideration the common characteristics from the design methodology of each RNS operation and derives in automatic way VHDL synthesizable description of complex RNS systems.
2. RNS FIR FILTER METHODOLOGY 2.1 RNS Basics
RNS is defined by a set of relatively prime modulo, + q , m l ,..., mi,...,m p ) . Any integer X in the integer ring
R ( M ) with
x = (10,”,.
M=npaq
has
a
unique
representation
...,I,,..,,.,), according to the simple rule xi = ( X ) , i ,
where (X),i
denotes the operation X modulo mi. Any of the
Y = WOZ can be encoded as allowed operations, Y = {yo. y t ,..., yi.. ..,y,}, where each member of the tuple can be
calculated by yi = w i O q and 0 denotes addition, subtraction, or multiplication. It is worth mentioning that the RNS-based operations are performed in parallel fashion within small word length, non-communicating channels, whose word length is bounded by nj = nog2 m i ] . Since the word length is constant and smaller than the binary representation, the hardware needed for the computations may reach much higher throughput rates, reducing, at the same time, area cost compared to the designs using traditional binary arithmetic. 2.2 Design Methodology Traditional N-tap FIR filter with impulse response coefficients h(k) can be described by
N y(n)= Ch(k)x(n-k)
k=O
V-129
(1)
xln)
w
----
Figure 1: FIR filter in transposed form Possible realization of a FIR filter in transposed form is shown in Figure 1. Each tap of FIR filter requires one multiplication and accumulation with the proper precision of bits, which is defined by the input data and coefficients. FIR filters implementation is slightly different within RNS-decomposed channels compared to a binary implementation in sense that modular arithmetic should be used. In Figure 2, implementation of the RNS-based FIR filter is shown. Each channel implements a FIR modulo mj, whose inputs and coefficients are in their residue representation. Filter's coefficients. which are provided from some type of memory, should be residues too. To reduce power consumption due to coefficient fetching from memory, we store the coefficients in certain order. determining in [IO].
n.
I
Figure 3: Tap's implementation using IPSPmi data
Figure 2: RNS FIR filter The associated formula of ( I ) for RNS FIR filters can be expressed as:
It is sufficient to study that the design of a FIR filter modulo m,. Eq. ( 2 ) is actually a sum of products algorithm. That is. we need one multiplication and accumulation unit. The cell required for this operation is an Inner Product Step Processor modulo mi. The formula of the IPSPmi with three inputs (X, h(k), Y, .,), one output
Y, can be described by
where Y,, h(k). U,-,and X E %(mi) = (0,I,...,m , - I ) . The proposed methodology for the computation of Y, requires three stages of computation. In particular, i) Preprocessing stage ii) Bit Reduction stage iii) Final Mapping stage. In particular, the first stage computes the addition of Y , ~ with , the pmial products of h(k)xX, calculated by n; AND gates. The second stage reduces the output word length of first stage. using the modulo arithmetic property
Since the proposed architecture is based on FAs. it's worth mentioning some of its favorable characteristics. Derived graphs could be used in a variety of complex DSP applications considering that they lead to two dimensions arrays. which are able to implement a number of required calculations at the same time that they provide efficient placement and interconnection in the m a y processor. FAs array processors can lead to variety of throughput rates due to the fact that they can be implemented using a number of different pipelining techniques. Using bit-parallel loading, it is possible to come up with a critical path of the design equal to that of FA. 2.3 Hardware Implementation In this section, the hardware realization results of both RNS and binary FIR filters are presented. The architectures have been mapped in XILINX-VIRTEX-II FPGA. The aim is tQ present a number of favorable characteristics of the proposed achitecture such as the ability of high throughput rate using pipelining and that choice of the RNS base in RNS is of great imporlance. Finally, traditional binary filters were implemented and results are presented in cumparisun with identical RNS filters, i.e. both filters with the same order and dynamic range. Table I shows comparison results of three pipelining techniques in terms of area, delay, and latency of a FIR tap modulo m, = 29. In parlicular. type I respresents the basic pipelining scheme where one register is located between the taps of filters. In
V-130
Type II, pipelined registers are placed between the first and second stage in order to reduce critical path Figure 3. This approach gave optimal results with respect to area, delay, latency and performance. Type Ill, uses a high-pipelined technique with registers placed after each FA. It can be seen that we speed up the throughput rate over 5 times than Type I pipelining scheme. Thus, the designer can choose that architecture, whose features meet an application's constraints. Pipehning
Area
Type . Type I Type n T y p e Ill
(equiv. gates)
742 846 4501
Delay (ns) 17.ll
11.13 3.3
Latency (clock cycles) I 2 28
Table IV gives additional results between RNS FIR and binary FIR for various word lengths under the A.? criterion, where A is the area cost and Tis the filter tap delay. All measurements were done per filter tap. Total accumulated area by N-tap RNS FIR filter is given by:
P (5) i=O where Ai and A,,, represent the area accumulated by tap modulo mi and converters, respectively. Both binary and RNS filters were implemented using same pipelining method in order to accomplish same latency, though RNS filter needs 4 mare clock cycles than the binary due to input - output conversions. Performance model A.?, in Table IV, did not take under consideration the overhead area coming from conveners based on the fact that FIR filters are usually of high order (N>10)and contribution of converten becomes negligible. Performing area and delay measurements up to the word length of 32 input bits. Figure 4 illustrates the relation of A.? vs. word length. It can be deduced that the derived RNS-based FIR filters architectures exhibit better features than binary FIR filters f o r e 1 6 . Atoror = Acow + N Chi
pares,
1968 5021 6623 15611
10
16 20 32
1368 15.10 24.94 39.68
368,300 1237400 4118750 24683470
2346 4988 6467 12618
9.21 15.24 18.41 19.56
199wO It58500
2206160 4821560
Table IV : Area, delay and efficiency model A.T1per tap for equivalent binary and RNS filters. The research made for filters in transposed form can be easily extended for implementing lattice FIR. Since, a lattice tap consists of two multiplication accumulation units, i.e. two IPSPs are required. Thus. the hardware complexity of a lattice is double of the transposed one
10
I5
20
25
30
35
wmd Ihngfh
Figure 4:Efficiency model A.T2 Venus word length.
3. UNIFIED CAD TOOL
Table HI:Example for different base RNS implementations
The developed design environment for RNS DSP units is presented here. The environment has been built using C++. The environment is a CAD tool, which implements the FA-based RNS
V-131
design methodology and eventually, automates the design procedure, starting from the function (algorithm) expression ending up with the VHDL description. The tool is able to describe a number of RNS DSP units using VHSIC Hardware Description Language-VHDL, providing at the same time important information concerning hardware and delay costs. It is also possible, based on the provided information, to make design and performance tradeoffs. It has the ability of composing complex RNS-based DSP systems based on common architecture characteristics. The tool receives as inputs: i) the RNS base, ii) the word length of binary data inputs, 1, iii) the desirable funclion, iv) the scaling factor. v) the filter’s order. The last two inputs are optional according to function implementation. The developed tool consists of three main modules: i) the Elementary Component Core Module - ECCM. ii) the Single Channel Architecture Module - SCAM, and iii) the Integrated System Module - ISM, as shown in Figure 5. In particular, ECCM includes three parts: a) The Elementary Component Library, b) the hardware architecture of each elementary component, after the application of RNS design methodology and c) the VHDL description of each elementary component. The purpose of ECCM is the VHDL derivation of any element of the library components. Specifically. RNS system may comprise any combination of the preprocessing stage, the bit reduction stage, the final mapping stage. and the multiplication stage, each of which can describe the developed RNS methodologies [4], [ 5 ] , 161. Up to now we have specified the Dependence Graph of each stage. Then, we derive the VHDL description ofeach library element. Having determined the VHDL of each component and taking into account the RNS function we produce the FA-based RNS architecture (VHDL description) for any modulo m,. The third module is the derivation of the whole architecture, with the appropriate binary-RNS and vice versa converters. The developed CAD tool can design any FA-based DSP system, which may consists of Conveners, FIR, Scaling, Multiplication and Accumulation blocks. Tool’s output is a series of files containing VHDL code describing the system. Components’ declaration and architecture do not need editing for simulation and synthesis. The only used library for components’ description is std-logic-1164. VHDL code is easy to be read and analyzed. due to the insertion of automatically-generated comments. Finally, a file is generated, which contains information about the number of FAs and an estimated delay of the implemented system. This information assists the designer to have a good estimation about the hardware requirements in the early design phase. Thus is not necessary to reach the layout design reducing the design cycle. 4. CONCLUSIONS A methodology for FIR filters implementation using RNS and a CAD tool that creates an environment for automatic implementations of several DSP units based on the same methodology. Additionally, it has been shown that the proposed FIR methodology leads to efficient implementations compared to binary, in t e r m of area and delay if large word length was required. REFERENCES 111 . _F.J. Tavlor. ~,. “Residue Arithmetic: A Tutorial with Examoles”. IEEE Trans. Computers. vol. 17, no. 5, pp. 50-62, May 1984.
[2] M.A. Soderstrand, W.K. Jenkins, G.A. Julien and F.J. Taylor. . ’ “Residue Number System Arithmetic: Modem Applications in ” ” ’ Digital Signal Processing”. New York: IEEE Press, 1986. [3] N.S. Szabo’and R.I. Tanakn, “Residue Arithmetic and its Applications in Computer Technology, McCraw-Hill, 1967. [4] D.J. Soudris, V Paliouras. T. Stouraitis and C.E Goutis. “A VLSl Design Methodology for RNS Full Adder Based Inner Product Architectures”, IEEE Trans. on Circuits and Systems-, 11, vol.44,no. 4,pp. 315-318, April 1997. [5] D.J. Soudris, M.M. Dasygenis and A.T. Thanailakis, “VLSI Methodology for the Design of RNS and QRNS Full Adder Based Converters”, to appear in IEE Proc: Circuits Deylces Sysrems.
[6] D. Soudris, M. Dasygenis, K. Mitroglou, K. Tatas and A. Thanailakis, “A Full Adder Based Methodology for Scaling Operation In Residue Number System”, in Proc. of ICECS 2002, Croatia, pp 891-894.
Elementary Component Core Module
L
t tt
-
C-nan,r’5.lrlm
Figure 5: CAD Tool Structure [7] W.K. Jenkins & G.A. Julien. ‘The Theory and Application of Modular Arithmetic in VLSl Digital Signal Processing”. IEEE Proceedings, September 1988. [XI W.L. Freking and K.K. Parhi, “Low Power FIR Digital Filters Using Residue Arithmetic”. Proc. 31” Asilomor ConJ Signal System and Compurers. Nov. 1997, pp 739-743. [9] Gian Carlo Cardarilli, Alberta Nannarelli and Marco Re, “Reducing Power Dissipation in FIR Filters using the Residue Number System”, in Proc. of 43rd lEEE MWCAS. Vol. 1. pp. 320-323, Lansing, MI, USA, Aug. 8-1 1,2000. [IO] K. Masselos S . Theoharis P. K. Merakos T. Stouraitis C. E. Goutis, “Low Power Synthesis of Sum-Of-Products Computation, International”, in Proc. of ISLPED, July 2000. [I I ] V. Paliouras and T. Stouraitis, “Area-time performance of VLSl FIR filters based on Residue Arithmetic”. In Proceedings of Euromicro, 1997.
V-132