1492
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 8, AUGUST 2004
Hardware Efficient Fast Parallel FIR Filter Structures Based on Iterated Short Convolution Chao Cheng, Member, IEEE, and Keshab K. Parhi, Fellow, IEEE
Abstract—This paper presents an iterated short convolution (ISC) algorithm, based on the mixed radix algorithm and fast convolution algorithm. This ISC-based linear convolution structure is transposed to obtain a new hardware efficient fast parallel finite-impulse response (FIR) filter structure, which saves a large amount of hardware cost, especially when the length of the FIR filter is large. For example, for a 576-tap filter, the proposed structure saves 17% to 42% of the multiplications, 17% to 44% of the delay elements, and 3% to 27% of the additions, of those of prior fast parallel structures, when the level of parallelism varies from 6 to 72. Their regular structures also facilitate automatic hardware implementation of parallel FIR filters. Index Terms—Fast convolution, iterated short convolution, parallel finite-impulse response (FIR), tensor product.
I. INTRODUCTION
M
ANY efforts have been directed toward deriving fast parallel filter structures in the past decade [1]–[6]. These algorithms usually first derive smaller length fast parallel filters and then cascade or iterate them to design parallel finite-impulse response (FIR) filters with long block sizes. An approach to increase the throughput of FIR filters with reduced complexity hardware was presented in [1]. This approach starts with the short convolution algorithms, which are transposed to obtain computationally efficient parallel filter structures. In [2], parallel FIR filters are implemented by using polyphase decomposition and fast FIR algorithms (FFA). The FFAs are iterated to get fast parallel FIR Algorithms for larger block sizes. Although, in [1] and [2], the small-sized parallel filter structures are computationally efficient, the number of required delay elements increases with the increase of the level of parallelism. However, the transpose of the linear convolution structure in [1] is an optimal parallel FIR filter structure in terms of the required delay elements. While the Toeplitz-matrix factorization procedure in [1] buries additional delay elements inside the diagonal subfilter matrix and the algorithm in [2] places additional delay elements in the postaddition matrix, parallel FIR filtering structure based on the transpose of the linear convolution structure requires no additional delays inside the convolution matrix. Furthermore, the positions of the delay elements in this transposed
Manuscript received August 10, 2003; revised March 17, 2003. This paper was recommended by Associate Editor Y. Wang. C. Cheng is with VIA Technologies (China), Inc., Ltd., Beijing 100085, China (e-mail:
[email protected]). K. K. Parhi is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis MN 55455 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TCSI.2004.832784
linear convolution structure are nicely placed and thus this structure is more regular. In [3], a set of fast block filtering algorithms are derived based on fast short-length linear convolution algorithms to realize the parallel processing of subfilters, However, when the convolution length increases, the number of additions increases dramatically, which leads to complex preaddition and postaddition matrices that are not practical for hardware implementation. Therefore, if we could use fast convolution algorithms to decompose the convolution matrix with simple preaddition and postaddition matrices, we can get computationally efficient parallel FIR filter with reduced number of required delay elements. Fortunately, we can use the mixed radix algorithm in [7], which decomposes the convolution matrix with tensor product into two short convolutions. This algorithm is combined with fast two and three point convolution algorithms to obtain a general iterated short convolution algorithm (ISCA). Although fast convolution of any length can be derived from Cook–Toom algorithm or Winograd algorithm [4], their preaddition or postad, dition matrices may contain elements not in the set which makes them not suitable for hardware implementation of iterated convolution algorithm. This paper is organized as follows. Section II investigates the iterated short convolution algorithm in matrix form. In Section III, ISC-based new fast parallel FIR filter structures are presented. Section IV presents the complexity comparisons of ISC-based filters and existing FFA-based filters. Efficient short convolutions for the proposed hardware efficient fast parallel filters are defined in Section V. Automatic hardware implementation aspects are described in Section VI. II. ISCA A long convolution can be decomposed into several levels of short convolutions. After fast convolution algorithms for short convolutions are constructed, they can be iteratively used to implement the long convolution [4]. In this section, the mixed radix algorithm [7] is used to derive the generalized iterated short convolution algorithm using the Tensor Product operator in matrix form. A ( ) convolution can be decomposed into a convolution and a convolution, whose short convolution algorithms can be constructed with fast convolution algorithms such as Cook–Toom algorithm or Winograd algorithm [4] and represented as and , respectively. and are and are preaddition mathe postaddition matrices. and are diagonal matrices, which can be trices. and denoted as
1057-7122/04$20.00 © 2004 IEEE
CHENG AND PARHI: HARDWARE EFFICIENT FAST PARALLEL FIR FILTER STRUCTURES
1493
respectively; they determine the number of required multiplications in the iterated and short convolution algorithm. are two column vectors, containing the two input sequences convolution. and are two for column vectors, containing the two input sequences for convolution. These two convolutions result in two outputs: and of length and , respectively. Using the mixed radix algorithm [7], the resulting iterated short convolution algorithm can be represented as (1)
Representation for A
Fig. 1.
and are defined in (2) and (3), respecMatrices is a by tively, shown at the bottom of the page. matrix, composed of by unity matrices as defined in (3). As an example, a 6 6 convolution can be decomposed into a 2 2 convolution and a 3 3 convolution. One set of fast convolution algorithms for 2 2 convolution and 3 3 convolution and , respectively, can be: where
.
and . 6
If we perform the 2 2 convolution first, according to (1), the 6 convolution can be expressed in matrix form as (4)
where , and is shown in Fig. 1. If we perform the 3 3 convolution first, according to (1), the 6 6 convolution can be described by the matrix form in (5). (5) where, is shown in Fig. 2.
, and
(2)
..
. ..
. ..
(3)
. ..
. ..
.
1494
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 8, AUGUST 2004
FIR filter based on iterated (
convolutions, ), can be expressed as (8)
where ; (
) inputs
,(
; );
; ( Fig. 2.
Representation for A
);
.
,
A ( ) convolution can be further decomposed into three convolutions: , and . One of the resulting iterated convolution algorithms can be represented by
(6) where
We can decompose a long convolution into any combination of two or three point short convolutions by iterated convolu( ) linear convotion using of (6). A lution can be decomposed into short convolutions ( ). One of the resulting iterated convolution algorithms can be represented by
( cients (
,(
) subfilters containing the coeffi);
); ( ); defined by (7). and are the preprocessing and postprocessing matrices, which determine the manner in which the inputs and outputs are combined, respectively [4]. Consider the 4-parallel FIR filter as an example. We start with the 4 4 convolution implemented with iterated two 2 2 short convolutions as (9) where
(7) where
In order to simplify notification, we write in (3) as . can be computed using the following procedure:
Equation (7) is the proposed iterated short convolution algorithm. The mixed radix algorithm [7] combines two short convolutions to get a longer convolution, while our iterated short convolution algorithm can combine any numbers of short convolutions and thus it is more efficient. III. FAST PARALLEL FIR ALGORITHM BASED ON ISC The iterated convolution structure can be transposed to obtain a fast parallel FIR filter. An ( ) parallel -tap
This 4-parallel FIR architecture is shown in Fig. 3.
CHENG AND PARHI: HARDWARE EFFICIENT FAST PARALLEL FIR FILTER STRUCTURES
Fig. 3.
1495
Four-parallel FIR filter implementation.
IV. COMPLEXITY COMPUTATION The number of required multiplications is determined by the in (8) and given by (10) diagonal matrix
The number of required delay elements is counted by the ( ) delay elements in the input side and the ones used in the subfilters, and is given by
(10)
(12)
convolutions used, is where is the number of convolution, the number of multiplications used in the which is determined by , and is the length of the original filter. All multiplications lie in subfilters of the same length. The , and the length of number of subfilters is determined by . each subfilter is given by Since each row of matrix has only one “1,” it will not increase the number of adders used. The number of required , , and the adders used in subfiladders is determined by ters, and is given by
In the following example, the number of multiplications and additions required to implement a 24-tap filter with a 6-parallel filter is calculated with (10) and (11), respectively. The calculation is performed for both , and , forms
(11)
(14)
Function to calculate
is the minimum number of adders used , where . As an
example, for the matrix
(13)
For
and
, we have
. Note that the number of required additions depends on the order of iteration; 3 3 convolution is iterated ahead of 2 2 convolution as this will lead to the lowest adder complexity.
(15)
1496
For
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 8, AUGUST 2004
and
(16)
V. EFFICIENT SHORT CONVOLUTION DEFINITION The computational efficiency of the proposed parallel FIR filter structures is dependent on that of the selected short convolution algorithms. These efficient short convolution algorithms can lead to highly efficient parallel FIR filter structures and are defined in this section. A. 3
3 Short Convolution
A Winograd convolution algorithm for 3 be derived as (17) by choosing
3 convolution can
(17) where
Therefore, we can get a new structure (18) for the computation of (16) (18) where
Fig. 4.
Hardware implementation of
Q
.
creased, while maintaining the low complexity of the structure. can be implemented as shown in Fig. 4. For example, and After sharing some additions, we can implement by 9 and 7 additions, respectively. Furthermore, (18) requires 5 multiplications. Using this new structure, (15) can be computed using 60 multiplications and 106 additions
Therefore, for a 24-tap filter with 6-parallel filter, the new structure saves 12 multiplications, while using the same number of addition. Equation (18) is more computationally efficient than (14). B. 4
4 Short Convolution
A 4 4 fast convolution algorithm can be constructed by inspection as (19) where =
In (18), the elements for preprocessing and postprocessing matrices are extended and most of them are in the set is and integer . Since multiplying a data by is to simply left shift this data by bit, no hardware cost is in-
=
CHENG AND PARHI: HARDWARE EFFICIENT FAST PARALLEL FIR FILTER STRUCTURES
1497
Fig. 5. Hardware implementation for postprocessing matix of 6-parallel FIR filter.
uses 1 less adder than (9). Therefore, when the parallel level has 4 as one of its factor, we use the structure (20). =
C. Another Efficient 4
.
4 Short Convolution
A Winograd convolution algorithm for 4 be derived as (21) by
4 convolution can . (21)
The corresponding 4-parallel filter algorithm is given by where (20) and After sharing some additions, we can implement using 11 and 10 additions, respectively. When we use (19) as a short convolution for the proposed ISC-based parallel filter, it is more efficient than iterating two 2 2 convolutions, Since (20)
=
1498
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 8, AUGUST 2004
=
=
Fig. 6.
FIR filter structure for subfilters.
Fig. 7.
Four-input carry-save adder.
The corresponding 4-parallel filter algorithm is (22) (22) After sharing some additions, we can implement and by 15 and 13 additions, respectively. Note that (22) requires eight multiplications. VI. HARDWARE IMPLEMENTATION With proposed parallel FIR filter algorithm, the preprocessing, postprocessing and subfilter matrices are calculated easily with Matlab. Then, Matlab is used to automatically generate Verilog code for the hardware implementation of this algorithm. This automation is very efficient when the filter coefficients, word length or level of parallelism changes, especially when the length of the FIR filter is large. A. Implementation of Preprocessing and Postprocessing Matrices It is obvious that preprocessing and postprocessing matrices cannot be directly used to combine the inputs and outputs. This is because the number of adders they require is not minimum and in cases where large levels of parallelism are used, the redundant adders will contribute to a significant amount of hardware overhead. The preprocessing and postprocessing matrices represent the tensor product implementation, since these are the tensor products of the preaddition and postaddition matrices, respectively, which are in their transposition form. The implementation of with the minimum number of adders is obtained by with the minimum number of adders, first implementing implementation among the then sharing the obtained and finally combining nonzero elements of each column of the outputs of each obtained implementation for each according to addition operation of each line of column of . As an example, for a 6-parallel FIR filter, the implementais shown in Fig. 5. tion of the postprocessing matrix Note that is not accounted for, since it just rearranges the inputs and delays and does not add any adders to the preprocessing matrix. From Fig. 5, we can see that it needs optimization. For example, some inputs go through two consecutive subtraction operators. This optimization can be done when the Verilog code is being automatically generated. Carry-save adders are used to accumulate consecutive adders. If these adders include subtraction operation, the corresponding
input is reversed and additional “1”s are added to the carry bits of resulting carry-save adders. The number of “1”s added is determined by the number of subtraction operations. B. Implementation of Subfilters The subfilters are implemented with canonical signed digit (CSD) coefficient-based FIR structures. If the coefficients for , then its CSD coefficient FIR one subfilter are filter is shown in Fig. 6, which is a transposed direct-form FIR filter [8]. The basic arithmetic operations include addition and CSD multiplication. The addition operation is implemented with the carry-save adder, which can be used to convert the sum of numbers into the sum of 2 numbers. For example, a 4-input carry-save adder can be implemented as shown in Fig. 7. Before CSD multiplication, coefficient quantization is performed using a look-ahead maximum absolute difference (MAD) algorithm [5]. After the look-ahead MAD quantization is used to quantize subfilter coefficients, a CSD multiplier is automatically generated according to the nonzero bits of the quantization result. During the CSD multiplication, Horner’s rule [4] is used to improve the computation accuracy. Carry-save adders are used to accumulate all shifted numbers in each tap of the subfilters, the number of inputs of the carry-save adder is the number of nonzero bits of the corresponding coefficient. VII. ALGORITHM ANALYSIS Compared with FFA-based fast parallel FIR filter structures, ISC-based algorithm saves large amount of hardware cost. The number of required multiplications, additions, and delay elements for a 144-tap filter are summarized for different levels of parallelism in Table I.
CHENG AND PARHI: HARDWARE EFFICIENT FAST PARALLEL FIR FILTER STRUCTURES
TABLE I NUMBER OF REQUIRED MULTIPLICATIONS (R.M.), ADDITIONS (R.A.) AND DELAY ELEMENTS (R.D.) FOR A 144-TAP FIR FILTER
1499
TABLE II NUMBER OF REQUIRED MULTIPLICATIONS (R.M.), ADDITIONS (R.A.) AND DELAY ELEMENTS (R.D.) FOR A 576-TAP FIR FILTER
should be 4 4, 3 3 and 2 2, as this will lead to the lowest implementation cost; while, in FFA-based algorithm, the 2-parallel FFA is always applied first [4]. VIII. CONCLUSION
From Table I, we can see that the proposed ISC-based algorithm can save 11% to 42% of the multiplications, 11% to 49% of the delay elements, which FFA-based structure uses. Although ISC-based algorithm uses more additions than FFAbased structure with the increase of the level of parallelism, it can lead to large savings in the number of multiplications and delay elements. For example, for a 72-parallel filter, ISC-based algorithm requires 1292 more additions, but reduces the number of multiplications and delays by 594 and 617, respectively. The advantage of the proposed structures becomes more obvious, when the length of the FIR filter increase to 576 taps. The number of required multiplications, additions and delay elements for a 576-tap filter are summarized for different levels of parallelism in Table II. In this case, there is no tradeoff between multiplication and addition. From Table II, we can see, the proposed ISC-based algorithm can save 17% to 42% of the multiplications, 17% to 44% of the delay elements and 3% to 27% of the additions, compared with the FFA structure. Note that the number of required additions is dependent on the order of iterations. The iteration order for short convolutions
In this paper, a novel ISC-based fast parallel FIR filter algorithm has been presented. This algorithm is very efficient in reducing hardware cost, especially when the length of the FIR filter is large. Tensor products are used to improve iterated short convolution algorithm in matrix form. The proposed ISC-based algorithm saves large amount of hardware cost. Since preprocessing and postprocessing matrices are tensor products without delay elements, this presented ISC-based algorithm facilitates automatic hardware implementation of parallel FIR filters, which is very efficient when the filter coefficients, word length or level of parallelism change, especially when the length of the FIR filter and the level of parallelism are large. REFERENCES [1] J. I. Acha, “Computational structures for fast implementation of L-path and L-block digital filters,” IEEE Trans. Circuits Syst., vol. 36, pp. 805–812, June 1989. [2] D. A. Parker and K. K. Parhi, “Low-area/power parallel FIR digital filter implementations,” J. VLSI Signal Processing Syst., vol. 17, no. 1, pp. 75–92, 1997. [3] I.-S. Lin and S. K. Mitra, “Overlapped block digital filtering,” IEEE Trans. Circuits Syst. II, vol. 43, pp. 586–596, Aug. 1996. [4] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.
1500
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 8, AUGUST 2004
[5] J. G. Chung and K. K. Parhi, “Frequency-spectrum-based low-area lowpower parallel FIR filter design,” EURASIP J. Appl. Signal Processing, vol. 2002, no. 9, pp. 444–453, 2002. [6] Z. -J. Mou and P. Duhamel, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Trans. Signal Processing, vol. 39, pp. 1322–1332, June 1991. [7] J. Granata, M. Conner, and R. Tolimieri, “A tensor product factorization of the linear convolution matrix,” IEEE Trans. Circuits Syst., vol. 38, pp. 1364–1366, Nov. 1991. [8] H. Lee, J. Chung, and G. Sobelman, “FPGA-based digit-serial CSD FIR filter for image signal format conversion,” in Proc. Int. Conf. Signal Processing Applicat. Technol. (ICSPAT’98), Toronto, ON, Canada, Sept. 1998.
Chao Cheng (M’03) was born in SiChuan, China, in 1976. He received the B.E. degree from China University of Geosciences, Wuhan, China, and the M.E. degree from Huazhong University of Science and Technology, Wuhan, China, in 1998 and 2001, respectivley. In 2001, he joined VIA Technologies, Beijing, China, as a Digital Communication Engineer. His present research interest is in VLSI signal processing algorithms and their implementation.
Keshab K. Parhi (S’85–M’88–SM’91–F’96) received the B.Tech., M.S.E.E., and Ph.D. degrees from the Indian Institute of Technology, Kharagpur, India, the University of Pennsylvania, Philadelphia, and the University of California at Berkeley, in 1982, 1984, and 1988, respectively. Since 1988, he has been with the University of Minnesota, Minneapolis, where he is currently a Distinguished McKnight University Professor in the Department of Electrical and Computer Engineering. His research addresses VLSI architecture design and implementation of physical layer aspects of broadband communications systems. He is currently working on error-control coders and cryptography architectures, high-speed transceivers, ultra wideband systems, and quantum error-control coders and quantum cryptography. He has published over 350 papers, has authored the text book VLSI Digital Signal Processing Systems (New York: Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Systems (New York: Marcel Dekker, 1999). Dr. Parhi is the recipient of numerous awards including the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2001 IEEE W.R.G. Baker prize paper award, and a Golden Jubilee award from the IEEE Circuits and Systems Society in 1999. He has served on Editorial Boards of IEEE TRANSACTIONS ON VLSI SYSTEMS, IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE SIGNAL PROCESSING LETTERS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–II, currently serves on Editorial Boards of the IEEE Signal Processing Magazine and Journal of VLSI Signal Processing Systems, and is the current Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, for 2004–2005. He served as Technical Program Cochair of the 1995 IEEE VLSI Signal Processing Workshop and the 1996 ASAP Conference and as the General Chair of the 2002 IEEE Workshop on Signal Processing Systems. He was a Distinguished Lecturer for the IEEE Circuits and Systems Society from 1997 to 1999.