Low- Cost Parallel FIR Filter Structures With 2-Stage Parallelism

Report 4 Downloads 32 Views
280

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

Low- Cost Parallel FIR Filter Structures With 2-Stage Parallelism Chao Cheng, Student Member, IEEE, and Keshab K. Parhi, Fellow, IEEE

Abstract—Based on recently published low-complexity parallel finite-impulse response (FIR) filter structures, this paper proposes a new parallel FIR Filter structure with less hardware complexity. The subfilters in the previous parallel FIR structures are replaced by a second stage parallel FIR filter. The proposed 2-stage parallel FIR filter structures can efficiently reduce the number of required multiplications and additions at the expense of delay elements. For a 32-parallel 1152-tap FIR filter, the proposed structure can save 5184 multiplications (67%), 2612 additions (30%), compared to previous parallel FIR structures, at the expense of 10089 delay elements (-133 ). The proposed structures will lead to significant hardware savings because the hardware cost of a delay element is only a small portion of that of a multiplier, not including the savings in the number of additions.

( [10]–[12], then combined by (1) to get the linear convolution

), and

(1) where

is the tensor product

%

Index Terms—Fast convolution, iterated short convolution (ISC), parallel finite-impulse response (FIR), VLSI.

is the input sequence in [5]. An -parallel ( iterated convolutions, ), can be expressed as

, and )

is defined

-tap FIR filter based on (

I. INTRODUCTION

(2)

AST parallel filter structures have been discussed in detail in [1]–[9]. Although their basic idea is the same, i.e., first derive smaller length fast parallel filters and then cascade or iterate these short-length filters for long block sizes, their starting point is not the same. Designs in [1]–[4] are based on polyphase decomposition, where additional delay elements are integrated into the post-addition matrix. These designs require large number of delay elements and are irregular when block sizes are large. However, the matrix form of linear convolution is used in [5]–[8]. In these structures, the delay elements are regularly placed and the fast linear convolution algorithm can be used to reduce the hardware cost, especially the number of multiplications. Recently, an iterated short-convolution algorithm (ISCA) was proposed for long block size linear convolution and was used to implement the fast parallel finite-impulse response (FIR) filter in [5]. This approach leads to large amount of hardware savings, compared to previous designs. According to the ISCA in [5], a ( ) linear convolution can be first decomposed into short convolutions, which can be computed by Cook–Toom or Winograd algorithm [3],

F

Manuscript received February 8, 2006; revised June 20, 2006. This work was supported in part by National Science Foundation Grand CCF-0429979. This paper was recommended by Associate Editor J. R. Chen. The authors are with Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis MN 55455 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TCSI.2006.885976

where

represents the inputs,

,(

), and

where efficients

,(

are the subfilters containing the co), and

When is large, this ISCA-based parallel filter involves many subfilters, which require a large number of multiplication operations but the same hardware structure. Designing an efficient core to share the computation of all these subfilters in different time slots can reduce the hardware cost. Although this idea has been applied in [8] and the computation of these subfilters is shared by a linear convolution processing core, its structure is irregular for some cases. We will show that the structure in [8] is a special case of the proposed parallel FIR filter in this paper. If we assume that a -parallel -tap FIR filter contains subfilters of length , then previous ISCA-based structures can

1057-7122/$25.00 © 2007 IEEE

CHENG AND PARHI: LOW- COST PARALLEL FIR FILTER STRUCTURES

281

Fig. 1. Implementation of 3-parallel FIR filter.

process input samples in clock cycles with all the subfilters working simultaneously. is also the number of output samples from the preprocessing matrix in (2) when input samples are input in each clock cycle, and thus the subfitlers will process intermediate data, corresponding to the outputs of the preprocessing matrix, in clock cycles. If we can intermediate data in clock finish the computation of these cycles by using one ( )-tap FIR filtering core, this core must be able to process data in one clock cycle, i.e., the core must implement an -parallel ( )-tap FIR subfilter. The hardware complexity of this -parallel FIR subfilter is far less than that of the subfilters, especially when and are large. This is the basic idea of the proposed 2-stage parallel FIR filter structures. This paper is organized as follows. In Section II, we illustrate the proposed new parallel FIR filter structures by two examples of parallel implementation of a 36-tap FIR Filter. We first analyze its implementation with previous parallel FIR filter structures, and then present the proposed parallel FIR filter structure. The second stage -parallel filtering core, delay element matrix (DEM) for rearrangement of the input-output data flow, and timing of proposed designs are also discussed in Section II. Direct implementation of the proposed parallel FIR structures in Section II will lead to too many required delay elements when the desired parallelism level is high. We then improve the proposed structures in Section III. In Section IV, we discuss the computational complexity of the proposed structures. Obvious reduction of hardware cost will be shown, when a comparison between the new design and previous one is carried out in Section V. II. PROPOSED 2-STAGE PARALLEL FIR FILTER STRUCTURES (METHOD-1) We illustrate the proposed new parallel FIR filter structure by an example. 1) Example 1: Consider the implementation of a 3-parallel 36-tap FIR Filter. A. Previous Design According to (2), a 3-parallel 36-tap FIR Filter with coefficients can be represented as

(3)

where

and

It can be implemented as shown in Fig. 1. , , , , , In Fig. 1, the six subfilters are all 12-tap subfilters and require a total of 72 and multiplications, 66 additions and 66 delay elements. They have the same hardware structure, and thus can be implemented, in consecutive time slots, with the same structure. B. Proposed New Design A 3-parallel 36-tap FIR filter contains six 12-tap subfilters. Previous designs can process 18 input data in 6 clock cycles with all the 6 subfilters working simultaneously. Six output data of are generated when 3 input data the preprocessing matrix are input in each clock cycle. Each of these 6 output data of is processed by one of the 6 subfilters. Therefore, in 6 clock and the 6 subfilters will process the cycles, 18 data will enter with one subfilter processing 6 generated 36 output data of data. If we can use a 12-tap FIR subfilter processing core to process in one clock cycle those 6 data which enter one subfilter in a row, in consecutive 6 clock cycles we can finish 36 output data of

282

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

Fig. 2. Proposed parallel FIR filter structure for a 3-parallel 36-tap FIR filter.

Fig. 3. (a) 6

2 6 DEM. (b) Delay element function.

and maintain the same processing speed. The FIR processing core, which can process 6 data in one clock cycle, is actually a 6-parallel FIR filter. The hardware cost of a 6-parallel 12-tap FIR filter is less than that of six 12-tap subfilters. This is the basic idea of the proposed 2-stage parallel FIR structures. The hardware efficiency of proposed parallel FIR filter structure will be shown when we discuss the implementation of this 6-parallel 12-tap FIR subfilter structure. The proposed 2-stage parallel FIR filter structure for a 3-parallel 36-tap FIR filter is shown in Fig. 2. We will illustrate the functions of each module in the rest of this section. C. Preloading 6

6 DEM

The 6 6 DEM is shown in Fig. 3. The data flow of the DEM in Fig. 3(a) is “horizontal in, vertical out” or “vertical in, horizontal out” and controlled by C0, C1, and C2 signals. C0 signal controls whether the data are “horizontal in” or “vertical in.” C1 signal controls whether the data are’horizontal out’ or “vertical out.” C2 signal controls whether

the data flow horizontally or vertically in the DEM. The data flow is illustrated in Figs. 4 and 5. ’s are the outputs of preprocessing matrix In Fig. 4, when and correspond to in Fig. 2. The data in preloading DEM with the same enter DEM at the same clock cycle, while those data with the same will be processed by the same subfilter. In previous parallel FIR structures, those data with the same should be each processed simultaneously by 6 , , , , , independent subfilters (i.e., ). However, the proposed new parallel FIR structures will process those data with the same by a shared filtering core in one clock cycle. Both design structures process 18 data in 6 clock cycles, leading to an effective 3-parallel processing. Fig. 5 shows the data flow in preloading DEM when time ranges from 5 to 11. When time is 12, the pattern of data flow will return to that of Fig. 4. Every 6 clock cycles, the pattern of data flow will switch between Figs. 4 and 5. We next discuss the implementation of the shared filtering core.

CHENG AND PARHI: LOW- COST PARALLEL FIR FILTER STRUCTURES

283

D. 6-Parallel FIR Subfilter as a Shared Filtering Core An ISCA-based [5] 6-parallel FIR filter is described by (4) where

2

Fig. 4. Preloading 6 6 DEM when (a) k = 0 and i = 0 (i.e., t = 0), and (b) k = 0 and i = 5 (i.e., t = 5).

where

2

Fig. 5. Preloading 6 6 DEM when (a) (b) k = 6 and i = 5 (i.e., t = 11).

k

= 6 and

i

= 0 (i.e.,

t

= 6);

and

represents the subfilters of the th one of the 6 subfilters , , , , , and in (3), is shown in the equation at the bottom of the page. VLSI structure of a 6-parallel FIR subfilter as a shared filtering core is shown as Fig. 6. This 6-parallel FIR subfilter is derived from the ISCA-based 6-parallel FIR filter by replacing the delay element “ ” with “ , ”which is because this 6-par, , , allel FIR subfilter will be shared by the 6 subfilters

284

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

Fig. 6. (a) 6-parallel FIR subfilter as a shared filtering core. (b) Block diagram of (a).

, , and in (3). Therefore, the hardware cost of this 6-parallel FIR subfilter is the same as that of the ISCA-based 6-parallel FIR filter except the 6-fold increase in the number of the delay elements. The total number of required multiplications, additions and delay elements of the proposed 6-parallel 12-tap FIR subfilter , and , are respectively. The preprocessing and post-processing require 52 additions. The subfilter length of this proposed 6-parallel 12-tap . FIR subfilter is From Figs. 1 and 2, we can see that the computation process of preprocessing and post-processing of the previous and the proposed design are exactly the same and the differences are located in the subfilter part. In the previous structure shown in (3), the total number of multiplications, additions and delay el, and ements for the subfilters are , respectively. Therefore, the proposed 3-parallel 36-tap FIR filter can save 36 multiplications at the cost of 4 addelay elments. More detailed ditions and comparison of hardware savings will be shown in Section IV. E. Post-Loading 6

with the same index will be processed by post-processing matrix at the same time. ( ) enter postIn Fig. 7(a), the first six , because of the latency of 12 clock loading matrix when cycles, which will be shown in timing analysis. Fig. 7 also shows the data flow in post-loading DEM when time ranges from 12 to 17. When time is 18, the pattern of data flow will switch to that of Fig. 8. Every 6 clock cycles, the pattern of data flow will switch between Figs. 7 and 8. F. Timing Analysis of Proposed Parallel FIR Filter Timing of the proposed 3-parallel 36-tap FIR filter is shown in Fig. 9 to illustrate how the proposed design works. From Fig. 9, we can see that there is a latency of 12 clock cycles, which is two times the number of subfilters, in the first stage parallelism of the proposed 3-parallel 36-tap FIR filter. G. Generalization of Proposed Parallel FIR Filter Structures (Method-1) The proposed structures for a given -parallel -tap FIR filter can be generalized as follows.

6 DEM

The post-loading 6 6 DEM has the same structure as shown in Fig. 3. and works the same way as the preloading in Figs. 7 and 8 DEM. The only difference is that those with the same index enter the post-loading DEM, and those

1) Form an ISCA-based -parallel FIR filter by (2) . 2) Replace its subfilters with a second stage -parallel FIR subfilter, where is the number of subfilters involved in the first stage -parallel implementation, and two DEMs of size

CHENG AND PARHI: LOW- COST PARALLEL FIR FILTER STRUCTURES

2

285

2

Fig. 7. Post-loading 6 6 DEM when (a) k = 12 and i = 0 (i.e., t = 12), and (b) k = 12 and i = 5 (i.e., t = 17).

Fig. 8. Post-loading 6 6 DEM when (a) k = 18 and i = 0 (i.e., t = 18), and (b) k = 18 and i = 5 (i.e., t = 23).

each to arrange the input and output of the -parallel FIR subfilter.

From the above analysis, we can see that we must control the increase of when is large. We will give another example to illustrate how to control . 1) Example 2: Consider the implementation of a 6-parallel 36-tap FIR filter. We start the first stage -parallel implementation with instead of as shown in Example 1. The first stage 3-parallel FIR filter has 6 subfliters ( ). If we can process 36 data in 6 clock cycles, we can get an equivalent 6-parallel implementation. When , 36 data will generate 72 output data of and the 6 subfilters of length will process the generated 72 output data of with one subfilter processing 12 data. 12 data will be processed by one of the 6 subfilters in one clock cycle. When a shared filtering core is designed, we need a 12-parallel FIR subfilter of length . The proposed 2-stage parallel FIR filter structure for a 6-parallel FIR 36-tap FIR filter with the first stage as is shown in Fig. 10. As shown in the above analysis, the 36 data must enter the 6 6 proloading DEM in 6 clock cycles. But one can only process 18 inputs in 6 clock cycles. Therefore, two and two 6 6 preloading DEM’s are used on the input side in Fig. 10. Meanwhile, two and two 6 6 post-loading DEM’s are used on the output side for the same reason.

3) Implement the -parallel FIR subfilter by first forming an ISCA-based -parallel FIR filter from (1.2), and then replacing each delay element’ ’ with’ ’.

III. PROPOSED 2-STAGE PARALLEL FIR FILTER STRUCTURES (METHOD-2) Direct application of the proposed parallel FIR filter structures will have problems when is large. Then the number of subfilters of the first stage -parallel implementation, , will increase dramatically. In this case, the number of required additions for preprocessing and post-processign matrices of the second stage -parallel FIR subfilter will dominate the total number of required additions and lead to large amount of required additions. Furthermore, large will also lead to a dramatic increase in the number of required delay elements because of the replacement of’ ’ with’ ’ in the implementation of the second stage -parallel FIR subfilter and the two DEMs of size . Finally the latency of the design will also be long since the computation latency is clock cycles.

286

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

Fig. 9. Timing of the proposed 3-parallel 36-tap FIR filter.

Fig. 10. The proposed 2-stage parallel FIR filter structure for a 6-parallel FIR 36-tap FIR filter with the first stage as L = 3.

The second stage 12-parallel FIR subfilter module requires delay elements on its input side. Since its 54 subfilters are all 1-tap, the number of required multiplications, additions and delay elements for subfilters are , and . The first stage preprocessing and post-processing matrices require additions. The second stage preprocessing and post-processing matrices require 192 additions. Thus the number of required multiplications, additions and delay elements for the proposed 6-parallel 36-tap FIR filter are 54, and , respectively.

Timing of the proposed 6-parallel 36-tap FIR filter is shown in Fig. 11. We can see that its latency is still 12 clock cycles. The improved implementation of proposed -parallel FIR filter structures (Method-2) can be generalized as 1) Form an ISCA-based -parallel FIR filter by (2) , where is the first stage parallelism and it divides . 2) Replace its subfilters with a second stage -parallel FIR subfilter, where and is the number of subfilters involved in the first stage -parallel implementation, and DEMs of size needed to arrange the input and output of the -parallel FIR subfilter.

CHENG AND PARHI: LOW- COST PARALLEL FIR FILTER STRUCTURES

287

Fig. 11. Timing of the proposed 6-parallel 12-tap FIR filter.

3) Implement the -parallel FIR subfilter by first forming an ISCA-based -parallel FIR filter from (2) , and then replacing each delay element “ ” with “ ”. For example, in Example 2, we have , , and .

The number of required additions is made up of three parts. required for the first stage -parallel pre1) Additions processing and post-processing matrices

IV. COMPLEXITY COMPUTATION For the proposed -parallel -tap FIR filter, where has only 2 and/or 3 as its prime factors, the number of subfilters of ) can be its first stage -parallel structure ( given as (5) convolutions used, is where is the number of convolution, the number of multiplications used in the in (2). which is determined by In this paper, we define and to simplify the hardware structure. has only 2 and/or It is obvious that 3 as its prime factors and can be further decomposed as . The number of subfilters of the second stage -parallel FIR subfilter can be given as: , which

(7) is the minimum number where, function of adders used to calculate , where . and are matrices with and , respectively. size required for the second stage -parallel 2) Additions preprocessing and post-processing matrices

(8)

is also the total number of subfilters of the proposed -parallel -tap FIR filter. The final subfilter length is . Therefore the total number of required multiplications is given by

3) Additions required for the subfilters in the second stage -parallel FIR filter. Therefore the total number of required additions can is given by

(6)

(9)

where

and

are defined in (5).

where,

and

are defined in (7) and (8), respectively.

288

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

TABLE I COMPARISON OF THE NUMBER OF REQUIRED MULTIPLICATIONS (R.M.), ADDITIONS (R.A.) AND DELAY ELEMENTS (R.D.) FOR A 144-TAP FIR FILTER WITH THEFIRST STAGE PARALLELISM L = 4. (THE SECOND STAGE PARALLELISM IS l )

TABLE II COMPARISON OF THE NUMBER OF REQUIRED MULTIPLICATIONS (R.M.), ADDITIONS (R.A.) AND DELAY ELEMENTS (R.D.) FOR A 1152-TAP FIR FILTER WITH THE FIRST STAGEPARALLELISM L = 4. (THE SECOND STAGE PARALLELISM IS l )

The number of required delay elements is made up of four parts 1) Delay elements on the input side of the first stage -par; allel FIR filter: 2) Delay elements on the input side of the second stage -parallel FIR filter: ; 3) Delay elements used in the two DEMs: , 4) Delay elements required for the subfilters in the second stage

-parallel FIR filter:

Therefore, the total number of required delay elements is given by

(10) , we get the direct implementation of Note that when proposed 2-stage parallel FIR filter structures (Method-1). V. COMPARISON AND ANALYSIS Compared with previous ISCA-based fast parallel FIR filter structures, the proposed ones can save large amount of hardware cost. The number of required multiplications, additions and delay elements for 144-tap and 1152-tap filters are summarized for different stages of parallelism in Table I through Table III.

In Table I, around 60% of the multiplications of ISCA-based parallel FIR structures are saved when the parallelism level of a 144-tap FIR filter varies from 4 to 16. Although when parallelism level increases, the proposed design saves less additions and can even cost more additions, the number of additional delay elements decreases. The cost of additional delay elements and additions is small compared to the large number of savings of multiplications. This is because the hardware cost of an addition is usually several times that of a D flip-flop and the hardware cost of a multiplication is usually several times that of an addition. From Table II, we can see that the proposed 2-stage parallel can save 75% to 79% of the multipliFIR structures with cations and 4.3% to 58% of the additions, which ISCA-based structure uses, when the parallel level of a 1152-tap FIR filer varies from 4 to 32. Although the proposed 2-stage parallel FIR structures use more delay elements than ISCA-based structures, compared to the large number of saved multiplications and additions, the paid area for the additional delay elements is still small. Increasing the first stage parallelism will lead to more savings of required multiplications. Furthermore, the increase of required delay elements can be controlled in a reasonable range. , and From Table III, we can see that when , proposed design (Method-2) can save 6156 (79%) multiplications, 373 (4.3%) additions, compared to those required

CHENG AND PARHI: LOW- COST PARALLEL FIR FILTER STRUCTURES

289

TABLE III COMPARISON OF THE NUMBER OF REQUIRED MULTIPLICATIONS (R.M.), ADDITIONS (R.A.) AND DELAY ELEMENTS (R.D.) FOR A 1152-TAP FIR FILTER WITH THE FIRST STAGE PARALLELISM L = 8. (THE SECOND STAGE PARALLELISM IS l )

by ISCA-based parallel FIR structures, at the cost of 12285 ) delay elements, which is just two times the number ( of the saved multiplications. This will lead to large amount of overall hardware savings. From Tables I –III, we can also see that the proposed 2-stage parallelism FIR design can on average save 1 multiplier at the cost of only 1 to 6 delay elements, not considering the large number of saved adders. It’s obvious that the hardware cost of 6 delay elements is not comparable with that of a multiplier. Although direct implementation of the 2-stage parallel FIR structures (Method-1) will lead to large number of required delay elements, an interesting phenomenon of 2-stage parallel FIR structure (Method-1) is that the number of required multiplications is always less than or equal to the filter length and doesn’t increase as the parallelism level increases. It has been shown in the first rows of Table I through Table III. It is easy to verify that increasing and/or will not change this property in (6). The of the proposed FIR structure by applying authors have done extensive computations, which proves this to be true. The serial processing of an FIR filter also requires multiplications, which shows that the proposed 2-stage parallel FIR filter can reduce the number of required multiplications to a very low level. Note that the proposed -parallel FIR structure has latency of , where is defined in (5). The latency is caused by the existence of preloading and post-loading DEMs. This has been shown in Example 1 and Example 2. The latency of 144-tap and 1152-tap filters of different levels of parallelism is also shown in the last column Table I through Table III. We can see that and the proposed parallel FIR Latency is only decided by structure (Method-2) has efficiently controlled the increase of latency when increases. Note that the ISCA-based parallel FIR design have no latency and output data will be generated in the same clock cycle as the corresponding input data are injected. In [8], linear convolution-based processing core is used as a parallel FIR filter for the processing of subfilters. It requires the number of subfilters to be equal to the subfilter length. This requirement leads to irregular preprocessing and post-processing DEMs with low utilization efficiency and complex control signals. However, when the processing of subfilters is assigned to a parallel filter design, this restriction does not exist. Even if the filter length is not divisible by the parallelism level, zeros can be added at the end of the filter coefficients to make the total length

divisible by the parallelism level. The computation results will be the same as the original filter. The linear convolution-based processing core in [8] can be actually seen as a special case of parallel FIR implementation when the parallelism level is equal to the filter length. For ex, and , the comparison ample, When results in [8, Table V.1] for the 16-parallel FIR filter are very close to those of the proposed 16-parallel FIR filter in Table I. One problem of the large increase of required delay elements is the increase of large switching activity, which will cause large power consumption. However, the power saving from saved multipliers and adders can potentially compensate this power consumption. VI. CONCLUSION Parallel FIR filter has many subfilters of the same length and structure. They require most of the overall hardware cost of the parallel FIR filters. This paper utilized these features and developed a second stage parallel FIR filter as a shared processing core to reduce the hardware cost of the overall parallel FIR filter design. The proposed 2-stage parallel FIR filter structures can efficiently reduce the number of required multiplications and additions at the cost of delay elements. The proposed parallel FIR structures also have regular structures and simple control signals, which facilitate their VLSI implementation. REFERENCES [1] D. A. Parker and K. K. Parhi, “Low-area/power parallel FIR digital filter implementations,” J. VLSI Signal Process. Syst., vol. 17, no. 1, pp. 75–92, Sep. 1997. [2] J. G. Chung and K. K. Parhi, “Frequency-spectrum-based low-area low-power parallel FIR filter design,” EURASIP J. Appl. Signal Process., vol. 2002, no. 9, pp. 444–453, 2002. [3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999. [4] Z.-J. Mou and P. Duhamel, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Trans. Signal Process., vol. 39, no. 6, pp. 1322–1332, Jun. 1991. [5] C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter structures based on iterated short convolution,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004. [6] J. I. Acha, “Computational structures for fast implementation of L-path and L-block digital filters,” IEEE Trans. Circuits Syst., vol. 36, no. 6, pp. 805–812, Jun. 1989. [7] I.-S. Lin and S. K. Mitra, “Overlapped block digital filtering ,” IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process., vol. 43, pp. 586–596, Aug. 1996. [8] C. Cheng and K. K. Parhi, “Further complexity reduction of parallel FIR filters,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 2005), Kobe, Japan, May 2005.

290

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 2, FEBRUARY 2007

[9] R. C. Agarwal and J. W. Cooley, “New algorithms for digital convolution,” IEEE Trans. Acoust. Speech, Signal Process., vol. ASSP-25, no. 5, pp. 392–410, Oct. 1977. [10] S. Winograd, “Some bilinear forms whose multplicative complexity depends on the field of constants,” Math. Syst. Theory, vol. 10, pp. 169–180, 1977. [11] R. E. Blahut, Fast Algorithms for Digital Signal Processing. Reading, MA: Addison-Wesley, 1985. [12] H. J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms. Berlin, Heidelberg, New York: Springer-Verlag, 1982.

Chao Cheng (M’03–S’04) received the M.S.E.E. degree from Huazhong University of Science and Technology, Wuhan, China, in 2001. He is currently working toward the Ph.D. degree at the University of Minnesota, Twin Cities, MN. He also has three years of industrial experience as a Digital Communication Engineer at VIA Technologies. His present research interest is in VLSI digital signal processing algorithms and their implementation.

Keshab K. Parhi (S’85–M’88–SM’91–F’96) received the B.Tech., M.S.E.E., and Ph.D. degrees from the Indian Institute of Technology, Kharagpur, India, the University of Pennsylvania, Philadelphia, and the University of California at Berkeley, in 1982, 1984, and 1988, respectively. He has been with the University of Minnesota, Minneapolis, since 1988, where he is currently Distinguished McKnight University Professor in the Department of Electrical and Computer Engineering. His research addresses VLSI architecture design and implementation of physical layer aspects of broadband communications systems. He is currently working on error control coders and cryptography architectures, high-speed transceivers, and ultra wideband systems. He has published over 400 papers, has authored the text book VLSI Digital Signal Processing Systems (Wiley, 1999), and coedited the reference book Digital Signal Processing for Multimedia Systems (Marcel Dekker, 1999). Dr. Parhi is the recipient of numerous awards including the 2004 F.E. Terman award by the American Society of Engineering Education, the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2001 IEEE W.R.G. Baker prize paper award, and a Golden Jubilee award from the IEEE Circuits and Systems Society in 1999. He has served on the editorial boards of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS —I: REGULAR PAPERS and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS —II: EXPRESS BRIEFS, VLSI Systems, Signal Processing, Signal Processing Letters, and Signal Processing Magazine, and currently serves as the Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS —I: REGULAR PAPERS (2004–2005 term), and serves on the Editorial Board of the Journal of VLSI Signal Processing. He has served as technical program cochair of the 1995 IEEE VLSI Signal Processing workshop and the 1996 ASAP conference, and as the general chair of the 2002 IEEE Workshop on Signal Processing Systems. He was a distinguished lecturer for the IEEE Circuits and Systems society during 1996–1998.