IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
791
Low-Cost Fast VLSI Algorithm for Discrete Fourier Transform Chao Cheng, Student Member, IEEE, and Keshab K. Parhi, Fellow, IEEE
Abstract—A prime -length discrete Fourier transform (DFT) can be reformulated into a ( 1)-length complex cyclic convolution and then implemented by systolic array or distributed arithmetic. In this paper, a recently proposed hardware efficient fast cyclic convolution algorithm is combined with the symmetry properties of DFT to get a new hardware efficient fast algorithm for small-length DFT, and then WFTA is used to control the increase of the hardware cost when the transform length is large. Compared with previously proposed low-cost DFT and FFT algorithms with computation complexity of (log ), the new algorithm can save 30% to 50% multipliers on average and improve the average processing speed by a factor of 2, when DFT length varies from 20 to 2040. Compared with previous prime-length DFT design, the proposed design can save large amount of hardware cost with the same processing speed when the transform length is long. Furthermore, the proposed design has much more choices for different applicable DFT transform lengths and the processing speed can be flexible and balanced with the hardware cost. Index Terms—Discrete Fourier transforms (DFTs), systolic array, VLSI, cyclic convolution.
I. INTRODUCTION
S
YSTOLIC-array-based discrete Fourier transform (DFTs) have been studied extensively in the past [1]–[8]. The input–output (I/O) cost in [1] and [3] will increase with DFT length, which will cause problems for packaging and reliability. Although the DFT design in [2] can control the I/O cost, it requires too much hardware for multiplication and addition operations, compared with the cyclic-convolution-based systolic array designs for DFT in [4]–[6]. , which The computation complexity of [1]–[4] is means that, when the DFT length is large, the hardware cost will increase dramatically. The DFT designs in [5], [6] remove the multiplication operations, but they require large number of adders and RAM/ROM resources, which increase proportionally and/or exponentially with . Furthermore, the number of required adders is dependent on the word length. When high-resolution computations are required, larger number of adders will be required. In [7]–[9], the computational complexity has been reduced to , which makes it possible to realize single-chip implementation. However, those high-speed applications, where the DFT transform length is large, more hardware cost needs to be Manuscript received May 6, 2006; revised October 22, 2006. This paper was recommended by Associate Editor K. Chakraborty. The authors are with Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis MN 55455 USA (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TCSI.2006.888772
cut down and higher processing speed is desirable. Furthermore, . these algorithms must have transform lengths of In [10], a hardware efficient fast cyclic convolution algorithm was presented. It can efficiently control the number of required multipliers, at the cost of reasonable number of adders. Furthermore, the I/O cost can be kept low and the throughput rate is high. Thus, it is much more efficient than previous systolicarray-based cyclic convolution implementation methods. But independently applying this algorithm for prime -length DFT will still require huge amount of hardware cost when is large. In this paper, we first formulate a prime -length DFT into a )-point complex cyclic convolution, and then use the sym( metry properties and the hardware efficient cyclic convolution )-point comalgorithm in [10] to simplify the derived ( plex cyclic convolution. Winograd Fourier transform algorithm (WFTA) [12] is finally combined to get a low-cost algorithm for algolong-length DFT. Compared with the previous rithms, the proposed algorithm has much more choices for different applicable transform lengths, because N can be decomposed into any small numbers which are mutually prime. Furthermore, processing speed can be improved by a factor of 2 and 30% to 50% multipliers can be saved. This paper is organized as follows. Algorithm derivation for small prime -length DFT is discussed in Section II. The hardware implementation structures are presented in Section III. In Section IV, algorithm computational complexity of the proposed algorithm for real and complex inputs is discussed. The proposed algorithm is combined with WFTA [12] to get low-cost implementation of long length DFT in Section V. DFT algoAlgorithm analysis and comparison with rithm [9] is given in Section VI. II. ALGORITHM DERIVATION The 1-D DFT of the input sequence can be generally expressed as
(1) where as
is assumed to be
. (1) can be formulated
(2a) and
1549-8328/$25.00 © 2007 IEEE
(2b)
792
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
where
(2c) (2d) where equal to
, for
and
(2c) is
(2e) Equation (2e) is the cyclic convolution of the sequences and . Take a 13-point DFT as an example, , we can get by (2d). By using when formulations (2a), (2b), and (2e) and the symmetries of , a 13-point DFT can be expressed in cyclic convolution as and
(3) where a 12-point cyclic convolution matrix is shown at the bottom of the page. Next, we will use the fast cyclic convolution algorithm in [10] . to compute the 12-point complex cyclic convolution by This way, (3) can be decomposed as
(4) where
are all 4-point cyclic convolution matrices. For example
Equation (4) is equal to
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
793
(5)
where
where is the real part, is the imaginary part. can be represented as From (6),
Equation (5) contains six 4-point cyclic convolutions, whose results can be combined to get the final 12-point cyclic convolution. These six 4-point cyclic convolutions can be computed by using the same hardware structure for 4-point cyclic convolution in six consecutive clock cycles, which is the basic idea of the fast cyclic convolution algorithm in [10]. So the hardware efficiency of the 4-point cyclic convolution is significant. We will to show how to implement the 4-point comuse plex cyclic convolution as a processing core in 13-point DFT. can be decomposed using the 4-point short fast cyclic convolution in Appendix B as
(7a) After applying the above process to the other five 4-point cyclic convolutions, we get
(7b) where
(6) where
and
are both 4 4 matrices and are defined in Table I.
as the upper half part of matrix and If we define as the lower half part of matrix , then from (7b) we get
(8) is the conjugate of . where Next, we use the property in (8) to simplify (5). The post-addition process of (5) contains (9a) and (9b), as shown at the bottom of the next page, where
794
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
TABLE I DEFINITION OF R (i); j = 1; 2; I (i); k = 1; 2; 3 IN (7B)
Fig. 1. 4-point complex cyclic convolution structure for 13-point DFT when r = 3.
and (10b) From (9), we can simplify (5) as (10), where large number of post-addition operations are reduced
III. HARDWARE IMPLEMENTATION
(10a)
can be implemented by the structure Using (8), in Fig. 1. Note that in Fig. 1 the output signal is used for computation in (3). ’s, , are defined in (5). . From (6), we can As illustrated in Appendix B, . Therefore, the 4-point comget plex cyclic convolution structure requires 5 multipliers and 11 adders. Using this same structure, we can implement the other five 4-point complex cyclic convolutions in (5). The output of these 4-point cyclic convolutions can be combined to get the result of 12-point cyclic convolution by the structure illustrated in Fig. 2. This combination unit uses four adders and four delay elements. for The 12-point complex cyclic convolution 13-point DFT can be given as Fig. 3. Using this structure, we adders and need five multipliers, delay elements.
(9a)
(9b)
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
Fig. 2. Combination Units for the output of 4-point complex cyclic convolution for 13-point DFT when r = 3.
Fig. 3. 12-point complex cyclic convolution for 13-point DFT.
Fig. 4. Poposed structure for 13-point DFT when input data are real.
795
796
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
Fig. 5. Proposed real–input structure for 13-point DFT used for the computation of complex input 13-point DFT.
Fig. 6. Proposed structure for 13-point DFT, when input data are complex.
After computing the 12-point complex cyclic convolution in (3) by the structure in Fig. 3, 13-point DFT can be implemented by the structure in Fig. 4. The total number of required multipliers, adders, delay elements and clock cycles are 5, 31, 25, and 6, respectively. If the input data are complex, we define, , then (5) is equal to
inary part of the complex input as real input, after some post-addition operations. We first illustrate the real input DFT structure in Fig. 4 as Fig. 5, where real and imaginary parts of the output data are listed in separated lines. Then the proposed structure for 13-point DFT, when input data are complex, is shown in Fig. 6. IV. ALGORITHM COMPLEXITY COMPUTATION For a prime -length DFT, the algorithm derivation process in Section II has the following restrictions:
and
(11) (12) Equations (11) and (12) show that complex input DFT can be computed by two real input DFT with the real part and the imag-
-length linear convolution can be computed by iterated -length cyclic convolution short convolution algorithm [11], can be decomposed by -point short cyclic convolutions listed in Appendix B. Observe the 2-point, 4-point, and 8-point short cyclic convolution algorithms in Appendix B, we can see that, when , the left columns of each row in have the same values or the same absolute values of opposite signs with those gets of the right i/2 columns. This property persists when . Combined with the symmetry proplarge with ) by ( ) cyclic matrix erties of twiddle factors in the (
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
Fig. 7. Input channel for preaddition when r = 4(2
797
2 2).
Fig. 8. New input channel for preaddition when r = 4.
obtained from the prime -length DFT, this property is the basis of the algorithm derivation process in Section II. The computational complexity of the proposed DFT algorithm is given as follows. A. When Input Data Are Real The number of required multipliers is the same as that of those used in -point fast cyclic convolution algorithm in [10] and is given as
(13) is the number of multipliers required for where -point short cyclic convolution and is given in Appendix B. The number of required adders is given as
that is to say, for all
, where and . For example, when
. When we use the input channel design method in [10], the is shown in Fig. 7. input channel for This implementation requires two adders and four delay elements. However, we can implement the input channel for in the structure shown in Fig. 8 with less hardware cost. This new implementation requires 2 adders and three delay elements. When gets large, this new design can save large number of delay elements. The number of required delay elements is given as (15), shown at the bottom of the page, where when , and are ordered as 2, 4, 3, and there are as . For many 4 as possible factors in group . example, and are defined in the following table.
(14) where is the number of required adders used in -point fast cyclic convolution algorithm in [10] and we have ; is the number of parallel levels when -length linear convolution is implemented using iterated short convolution;
(15)
798
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
The number of required clock cycles is determined by
A. When
(16) linear which is the number of required multiplier for convolution, using the Iterated short convolution algorithm [11]. is the number of multipliers required for -point short linear convolution, and is given in Appendix A. The number of required I/O cost is determined by
can be decomposed from [13] and [14] as a sequence of vector and parallel operations
(24) denotes vector operation; denotes where parallel operation. can be changed into parallel operation by
(17) (25)
where L is the wordlength. B. When Input Data Are Complex The number of reqired multipliers is given as
(18) is defined in (13). where The number of required adders is given as
(19) is defined in (14). where The number of required delay elements is given as
denotes the reordering action on the elewhere , by striding through the vector ments of a vector with size with stride . From (25), we can see that the computation of (24) can be and -parallel . transformed into that of -parallel Fast short DFT algorithms can be used, as shown in [5], to reis large, the hardduce the hardware cost. However, when ware cost will be very high. Another problem is the I/O cost. The input and output will lead to too much I/O requirement of cost. and -parWe can actually implement the -parallel by applying in consecutive times and allel in consecutive times. This will reduce the hardware cost and I/O cost by times, which will result in practical implementation of long length DFT. , from (24) For example, when and (25), we can get
(20) is defined in (15). where The number of required clock cycles is given as
(26) where (21)
is defined in (16). where The number of required I/O cost is determined by
(22) where
is defined in (17).
V. USE OF WFTA FOR LONG LENGTH DFT IMPLEMENTATION From WFTA [12], when are relatively prime, a be decomposed as
, where -point DFT can
(23) where is the tensor product; -point DFT.
is the
Define
is the permuted input sequence by .
and
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
799
Fig. 9. Proposed 20-point DFT structure.
Since and just permute the input and output of DFT and no computational operations are involved, we ignore them here. Thus, from (26), 20-point DFT can be implemented by the following structure. D5 is computed by our proposed 5-point DFT systolic array structure, which requires one clock cycle to finish processing of clock cycles five input samples. Thus, D5 will need to finish the computation of 20 input data samples. D4 is implemented by fast DFT algorithms in [12], which can implement 4-point DFT in one clock cycle and require only 16 real data addition. is implemented through The permutation operation the delay element matrix (DEM) in the middle of Fig. 9. The squares in the array denote delay elements. The output data of D5 is injected into the DEM from direction and alternatively. The permuted data get out of the permutation array and enter D4 from the direction and alternatively. Since it takes four clock cycles for the 20 input data to get into the permutation array and five clock cycles for these 20 data to get out of the array, the data can get into and get out of the permutation array concurrently and continuously. Five clock cycles are required for the computation of our proposed 20-point DFT structure. The data flow of the proposed 20-point DFT is shown in Fig. 10. From Fig. 10(b), we can see that four clock cycles are needed to load the 20 outputs of D5 into the DEM and five clock cycles are need to unload the permuted data. Note that loading and unloading occur concurrently. After the data flow in Fig. 10(c), the data flow returns to Fig. (b) and the pattern is repeated. When the squares, which denote delay elements, are blank, it means these delay elements are empty. The number “ ” in the square is the group number, which has the same meaning as the “ ” in . This structure is hardware efficient and fast. Considering the O(log N) algorithms, we know that 4 complex multipliers and 8 complex adders, which requires 16 real multipliers and 24 real adders, are required for the computation of a 16-point DFT
Fig. 10. Data flow of Fig. 9.
and can complete computation in 16 clock cycles. However, our structure only needs 10 real multipliers and 58 real adders for a 20-point DFT and can complete computation in five clock cycles. B. When we can get
(27) where
For example, when From (27), we can get
.
(28) where
800
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
Fig. 11. (a) and (b) Data flow of proposed 60-point DFT.
Define
.. . The data flow of this proposed 60-point DFT algorithm is shown in Fig. 11(a)–(d). From Fig. 11(a)–(d), we can see that three clock cycles and five clock cycles are needed to load and unload DEM between D5 and D3. The outputs of D3 enter and come out of the DEMs between D3 and D4 form upper DEM to lower DEM one by one in consecutively different clock cycles. In Fig. 11(a)–(d), the numbers after “t” and the arrows sown when and from where these data are loaded and unloaded concurrently. After Fig. 11(d), the data flow goes back to Fig. 6(b) and the pattern is repeated. In Fig. 11(a)–(d), when the squares, which denote delay elements, are blank, it means these delay elements are empty. The number “ ” in the square is the group number, . which has the same meaning as the “ ” in This proposed 60-point DFT can complete its computation in 20 clock cycles and requires 14 real multipliers and 74
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
801
Fig. 11. Continued (c) and (d) Data flow of proposed 60-point DFT.
real adders. However its 64-point counterpart by algorithm needs 64 clock cycles and 24 real multipliers and 36 real adders. It’s obvious that our design is both fast and hardware efficient.
Combining (29) and (30), we can get any -length DFT structure similar to those derived in above examples, where is decomposed into relatively prime factors. VI. ALGORITHM ANALYSIS AND COMPARISON
C. When We will have
(29) (30)
We now analyze the computational complexity and compare our proposed DFT structures with the recently reported hardware efficient and fast DFT structures [9] with . Based on (29) and computational complexity , the computational complexity of our proposed design for complex input DFT, in terms of the number , adders and delay eleof required multipliers ments (R.D.), is given as follows: (31)
802
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
TABLE II COMPARISON BETWEEN PROPOSED DFT STRUCTURES AND PREVIOUS DFT DESIGN WITH COMPUTATIONAL COMPLEXITY (log ) IN TERMS THE NUMBER OF REQUIRED REAL MULTIPLIERS (R.M.), REAL ADDERS (R.A.), DELAY ELEMENTS (R.D.) FOR REAL DATA AND COMPUTATION TIME CYCLES FOR A COMPLETE DFT (R.T.)
O
(32)
N
TABLE III HARDWARE COST OF PRIME-LENGTH DFT [6], ALL FOR REAL NUMBER COMPUTATION AND THE INPUT DATA ARE WITH 16 BITS
(33) where and are the number of required multipliers, adders and delay elements, respectively, for -length DFT ; if is a prime number, (18–20) can be used to compute them. The computation time complexity of -length DFT is determined by
(34) where is defined in (16). Note that can only be decomposed to have prime numbers and at most one of as its factors. We use the number from , because the fast DFT algorithms with these lengths are hardware efficient and can be computed in one clock cycle [12]. As shown by examples in Section 4, the proposed new DFT design is fast and hardware efficient, compared to the previous . More DFT structure with computational complexity detailed comparison is shown in Table II. From Table II, we can see that the proposed design is both fast and hardware efficient. When the transform length varies from 16 to 2048, our average computation speed is nearly or more than 2 times faster. While the DFT algorithm with comcomputes one point in one clock cycle on avplexity erage; our design requires only 0.20 to 0.53 cycles. Furthermore, our DFT structure can save the average number of required real from 30% to 50%. Although the number multipliers of required adders and delay elements increases, they are con-
trolled in a reasonable range, which will not trade off the saved hardware cost from the reduced number of required multipliers. Note that, all the numbers in Table II are for real number computations. We also compare the hardware cost of prime-length DFT [6] in Table III. From Table III, we can see that the DFT design in [6] has slower processing speed for short length DFT and close processing speed for long length DFT. Although no multipliers are needed, the DFT design in [6] requires large number of adders and delay elements especially when the transform length is long. For example, when transform length is 2039, the DFT design in [6] requires 12049 extra adders although our design requires additional 28 multipliers. The hardware saving of our design is obvious. VII. CONCLUSION Systolic-array-based DFT algorithm has regular structure. But when the DFT length increases, the number of
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
required multiplications will increase dramatically. Although memory-based systolic array designs can be used to remove the multiplication operations, large number of adders and RAM/ROM resources will be used; the structure is also comDFT algorithm can cut down the plicated. Although , their throughput rate computation complexity to is only 1/N and they put restriction to the transform lengths, . Since hardware efficient fast cyclic which must be convolution algorithm can be used to save large number of multipliers, it is combined with the symmetry properties of DFT to implement prime-length DFT in this paper. The WFTA is also used to control the increase of hardware cost when N is large and increase the applicable transform lengths of DFT. The results show that our proposed DFT design can not only algorithms, save 30% to 50% multipliers from those but also improve the processing speed by a factor 2.
803
Algorithm
(A.2) where
APPENDIX A SHORT LINEAR CONVOLUTION ALGORITHMS Efficient short linear convolution algorithms for Iterated Short Convolution Algorithm are given with the number of and A. multipliers M and the number of adders The operations involving h are not counted, since they can be precomputed and stored. Algorithm
APPENDIX B SHORT CYCLIC CONVOLUTION ALGORITHMS Algorithm-
(B.1) (A.1) where where
Algorithm-
(B.2)
804
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
where Algorithm-
(B.4) Algorithm-
where the equations at the top of the previous page are true. Algorithm(B.3) (B.5)
where where the set of equations at the top of the page are true. Algorithm-
(B.6) where the equations at the top of the next page are true. REFERENCES [1] L. W. Chang and M. Y. Chen, “A new systolic array for discrete fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 10, pp. 1665–1666, Oct. 1988. [2] N. Rama Murthy and M. N. S. Swamy, “On the real-time computation of DFT and DCT through systolic architectures,” IEEE Trans. Signal Process., vol. 42, no. 4, pp. 988–991, Apr. 1994.
CHENG AND PARHI: LOW-COST FAST VLSI ALGORITHM FOR DISCRETE FOURIER TRANSFORM
[3] D. C. Kar and V. V. Bapeswara Rao, “A new systolic realization for the discrete fourier transform,” IEEE Trans. Signal Process., vol. 41, no. 5, pp. 2008–2010, May 1993. [4] C. M. Liu and C. W. Jen, “A new systolic array algorithm for discrete Fourier transform,” in Proc. IEEE Int. Conf. on Circuits Syst., May 1991, pp. 2212–2215. [5] J. I. Guo, C. M. Liu, and C. W. Jen, “The efficient memory-based VLSI array designs for DFT and DCT,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 37, no. 10, pp. 723–733, Oct. 1992. [6] T. S. Chang, J. I. Guo, and C. W. Jen, “Hardware-efficient DFT designs with cyclic convolution and subexpression sharing,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 47, no. 9, pp. 886–892, Sep. 2000. [7] J. Choi and V. Boriakoff, “A new linear systolic array for FFT computation,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 4, pp. 236–239, Apr. 1992. [8] V. Boriakoff, “FFT computation with systolic arrays, a new architecture,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 41, no. 4, pp. 278–284, Apr. 1994. [9] C.-H. Chang, C.-L. Wang, and Y.-T. Chang, “Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse,” IEEE Trans. Signal Process., vol. 48, no. 11, pp. 3206–3216, Nov. 2000. [10] C. Cheng and K. K. Parhi, “Hardware efficient fast DCT based on novel cyclic convolution structures,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4419–4434, Nov. 2007.
805
[11] C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter structures based on iterated short convolution,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004. [12] H. S. Silverman, “An introduction to programming the Winograd Fourier transform algorithm,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-25, no. 2, pp. 152–165, Feb. 1977. [13] P. Lavoie, “A high-speed CMOS implementation of the Winograd Fourier transform algorithm,” IEEE Trans. Signal Process., vol. 44, no. 8, pp. 2121–2126, Aug. 1996. [14] R. Tolimieri, M. An, and C. Lu, Algorithms for Discrete Fourier Transform and Convolution. New York: Springer-Verlag, 1989.
Chao Cheng (M’03–S’04) received the M.S.E.E. degree from Huazhong University of Science and Technology, Wuhan, China, in 2001. He is working toward the Ph.D. degree at the University of Minnesota, Twin Cities. He has three years industrial experience as a digital communication engineer at VIA Technologies. His present research interest is in VLSI digital signal processing algorithms and their implementation.
806
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007
Keshab K. Parhi (S’85–M’88–SM’91–F’96) received the B.Tech., M.S.E.E., and Ph.D. degrees from the Indian Institute of Technology, Kharagpur, India, the University of Pennsylvania, Philadelphia, and the University of California at Berkeley, in 1982, 1984, and 1988, respectively. He has been with the University of Minnesota, Minneapolis, since 1988, where he is currently Distinguished McKnight University Professor in the Department of Electrical and Computer Engineering. His research addresses VLSI architecture design and implementation of physical layer aspects of broadband communications systems. He is currently working on error control coders and cryptography architectures, high-speed transceivers, and ultra-wide-band systems. He has authored over 400 papers, has authored the text book VLSI Digital Signal Processing Systems (Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Systems (Marcel Dekker, 1999).
Dr. Parhi is the recipient of numerous awards including the 2004 F. E. Terman Award by the American Society of Engineering Education, the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2001 IEEE W. R. G. Baker prize paper award, and a Golden Jubilee award from the IEEE Circuits and Systems Society in 1999. He has served on the Editorial Boards of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VLSI Systems, Signal Processing, Signal Processing Letters, and Signal Processing Magazine. He served as the Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS from 2004 to 2005, and serves on the Editorial Board of the Journal of VLSI Signal Processing. He has served as technical program cochair of the 1995 IEEE VLSI Signal Processing workshop and the 1996 ASAP conference, and as the general chair of the 2002 IEEE Workshop on Signal Processing Systems. He was a distinguished lecturer for the IEEE Circuits and Systems society during 1996–1998.