The 1D Discrete Cosine Transform For Large Point Sizes Implemented On Reconfigurable Hardware Sherman Braganza Dept. of Electrical and Computer Engineering Northeastern University Boston,MA
[email protected] Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston,MA
[email protected] Abstract
Gradient (PCG) technique in applications like adaptive filtering [4] and phase unwrapping [3]. This paper presents an algorithm, first developed by Makhoul [7], and an implementation of it that utilizes a Fast Fourier Transform (FFT) core to compute a DCT without significantly increasing overall latency as compared to just a FFT core. The advantage of this approach is the ready availability of a large number of FFT cores in both fixed- point [11] and floatingpoint [1] formats which can be easily dropped in with minimal modifications to the overall design.
The Discrete Cosine Transform (DCT) is used in place of the Discrete Fourier Transform (DFT) in a wide variety of audio and image processing applications due to its energy compaction properties which approach those of the optimal Karhunen-Love transform. Previous work in reconfigurable hardware has focused on implementations of 2D 8x8 transforms of the type commonly used in the JPEG and MPEG standards. Several applications for larger DCTs exist, including those involving the extraction of features from image data, solving partial differential equations (PDEs) and those utilizing the Preconditioned Conjugate Gradient (PCG) method such as phase-unwrapping. This paper presents an indirect algorithm and implementation on a Xilinx FPGA that performs 1D DCTs on large block sizes using a block floating point format. The DCT was designed to use fewer resources than other popular approaches due to the larger point sizes supported which would otherwise consume all available chip area, but at the cost of higher latency. This latency is similar to that required for an identically sized FFT. A 512-point DCT has been shown to take 1771 cycles or 13.3 us at 133 MHz as compared to a similarly sized FFT that takes 1757 cycles or 13.2 us (including all component transfer times).
1
The popularity of the JPEG and MPEG standards has resulted in a proliferation of hardware implementations of the DCT targeted towards specific sizes, most commonly the 8x8 2D implementation. This is usually accomplished by multiplying an 8x8 block from the image data by an 8x8 coefficient matrix, resulting in an output matrix that contains the component frequencies. This is a fairly expensive operation, requiring 4096 additions and another 4096 multiplications. There have been many publications detailing the implementation of such 2D matrix multiplications using distributed arithmetic (DA) which reduces the number of multiplications required, but which still results in relatively large, low latency hardware. Woods et. al. [9] used a combination of 1D DCTs, distributed arithmetic, and transpose buffers on a Xilinx XC6264 to generate such a design that utilized 30 percent of total chip area. Bukhari et. al. [5] investigated the implementation of a modified Loeffler algorithm for 8x8 DCTs on various FPGAs. Siu and Chan proposed a multiplierless VLSI implementation[2] for an 11x11 2D transform and many other variations on small 2D DCTs exist.
Introduction and Related work
The Discrete Cosine Transform (DCT) is used in a wide variety of applications such as image and audio processing due to its compaction of energy into the lower frequencies. This property is exploited to produce efficient frequencybased compression methods in various image and audio codecs such as JPEG and MPEG. However, the DCT is also used in other applications that require larger sized transforms such as those using the Preconditioned Conjugate
1-4244-1027-4/07/$25.00 ©2007 IEEE
Larger 2D DCTs can be implemented using 1D DCTs by taking advantage of the DCT’s separability property. This is accomplished by first taking the DCT of all the rows and then of all the columns. However, there do not exist many implementations of large 1D DCTs for reconfigurable systems. In [10], an 8 to 32 point core with a maximum of 24
101
bit precision was implemented using distributed arithemetic for the vector-coefficient matrix multiplication. A 32 point, 24 bit instance of this core has a latency of only 32 cycles, but consumes 10588 LUTs which makes the approach impractical for large designs. In contrast, our approach is much more compact and therefore enables larger transform sizes at the cost of higher latency. Leong [6] implements a variable radix, bit serial DCT using a systolic array but only describes area requirements for designs up to N=32 which consumes between 457 and 1363 adders and has a high worst case error. In comparison, our approach supports much larger transform sizes with demonstrated designs of up to 1024 points. The rest of the paper is organized as follows: First the algorithm used to implement the DCT is described along with further possible optimizations. Next the implementation is discussed on a system level and on an individual component level. The results are then analyzed and various characteristics of the implementation pointed out. Last of all, conclusions are drawn and the direction of future efforts discussed.
2
Figure 1. Even extension around n=-0.5 and n=N-0.5
with the even extension defined as: x(n) n=0 . . . N-1 x (n) = x(2N − n − 1) n=N . . . 2N-1 The DCT-II can be shown to be solvable via DFT by noting that:
X(k)
Algorithm
N −1
=
xn e−
2πi N nk
i=0
xn cos
(2n + 1)πk 2N
N −1
jπk
= e 2N
jπn N nk
πn
xn e− N
nk
+
2N −1
πn
xn e− N
nk
n=N N −1
xn [e−
jπn N nk
jπn
e− 2N
nk
+ e−
jπn N nk
jπn
e 2N
n=0
=
2e
jπk 2N
N −1 i=0
k = 0...N − 1
xn cos
(2n + 1)πk 2N
Which is identical to the definition of the DCT above exjπk cept for a multiplicative factor of e 2N . A similar method can be used to write an IDCT in terms of a length 2N complex IDFT. Full details can be found elsewhere[7]. The performance of the DCT in terms of latency and area can be further improved upon such that an N point real DFT/IDFT may be used. The method for accomplishing this is outlined below. A sequence v(n) can be constructed from x(n) such that it follows the restriction: x(2n) n=0 . . . N 2−1 (1) v(n) = x(2N − 2n − 1) n= N2+1 . . . N-1
The cosine transform can be viewed as the real part of X(k) which is the equivalent of saying that it is the Fourier transform of the even extension of X(k) given that x(n) is causal (i.e. x(n) = 0, n < 0). This is the inspiration for the usual technique for implementing the DCT, which is by mirroring the set of real inputs and taking the real DFT of the resulting sequence. This mirroring can be performed in any of four ways: around the n=-0.5 and n=N-0.5 sample points, around n=0 and n=N, around n=-0.5 and n=N, and finally around n=0 and n=N-0.5. All of these methods result in slightly different DCTs. The most commonly used even extension is the one depicted in Figure 1 and this will be the focus of the algorithm and implementation presented. This category of DCT, obtained by taking the DFT of a 2N point even extension, is known as a DCT Type II or DCT-II and is defined as: X(k) = 2
xn e−
n=0
i=0
N −1
2N −1 n=0
The general algorithm presented here was first discussed by Makhoul[7]. It is an indirect algorithm for computing DCTs using FFTs. The steps are presented as well as their correspondence to the computation done in hardware. Given an input signal x(n), the DFT of that signal is given by: X(k) =
=
When the DFT of v(n) is computed and the result multi−jπk plied by 2e 2N the resulting sequence can be written as: X(k) = 2
N −1 i=0
k = 0...N − 1
vn cos
(4n + 1)πk 2N
k = 0...N − 1
which is an alternative version of the DCT based on v(n).
102
nk
]
Again, for the IDCT, the real sequence X(k) can be rearranged to form a complex Hermitian symmetric sequence V (k) where Hermitian symmetry is defined as X(N −k) = X ∗ (k) and the sequence itself as: 1 jπk e 2N [X(k) − jX(N − k)] k = 0 . . . N − 1 2 An IDFT on V (k) generates the v(n) sequence described earlier, which can then be rearranged to form x(n). This is the method used in the implementation discussed in the later sections of this paper. However, for both the size N DCT and IDCT it should be noted that the input sequences are either entirely real or Hermitian symmetric and can thus be computed using FFTs with a point size of N/2 [8, 7]. This can be done by setting alternating elements of v(n) to the real and imaginary parts of a new sequence t. That is, V (k) =
N −1 2 The DFT of this sequence can then be computed and the original V (k) extracted by using: t(n) = v(2n) + jv(2n + 1)
n = 0...
−2πkj 1 N N [T (k)+T ∗ ( −k)]−0.5je N [T (k)−T ∗ ( −k)]. 2 2 2 This gives the original V (k) which can subsequently be used to generate X(k). A similar method can be applied to the real IFFT to realize similar savings. The implementation described in this paper does not use the last FFT optimization (that reduces required transform size from N to N/2) since it involves an extra multiplication step that would reduce the accuracy of the results, since the data being used is fixed-point. However, for a floating-point or low precision implementation, this would be a feasible optimization.
V (k) =
3
Implementation
The description of the algorithm for both the forward and inverse DCT lends itself to a clearly defined component breakdown in terms of hardware. For the DCT, the first component is the one that creates v(n) by re-ordering the input sequence and writing it to memory. The second component is the actual FFT that transforms the shuffled input data into the frequency domain. Last of all is the −jπk component that multiplies the output V (k) by 2e 2N and extracts the desired output values from the complex FFT output. Roughly the same components are required for the inverse DCT but in reverse order. First of all, a multiplijπk cation of a re-arranged sequence Y (k) by 0.5e 2N is performed where Y (k) = X(k) − jX(N − k). Then the data is passed through an inverse FFT of size N, followed by the mapping of v(n) to x(n). This organization of the components is depicted in Figure 2 and Figure 3 for the forward and inverse transforms respectively.
Figure 2. Components and dataflow for the forward DCT transform
3.1
Shuffle
As input data is sent sequentially into the DCT core, the first stage of processing that occurs is the generation of the v(n) sequence. This occurs within the shuffle component. The shuffle has a latency of one clock cycle and calculates output indices based on the input index according to Equation 1. Since the shuffle component only affects index values, all addition and subtraction performed within it is of bitwidth log2 N . Based on these output indices, the sample value is written to block RAM in shuffled order as shown in Figure 4. This step of writing to block RAM is necessary since the FFT component takes in input in sequential order but the shuffle produces output non-sequentially. For an inverse DCT shuffle, the FFT output data is rearranged in the opposite direction, forming x(n) from v(n). This is also written to block RAM before transmitting back to the host since the data will not be generated in sequential order.
3.2
Fast Fourier Transform
The complex FFT used was provided by Xilinx LogicCore and generated using Coregen [11]. It allows for a range of options, including a parameterized bit-width of 8 to 24 bits for both input and phase factors, the use of either
103
Figure 4. Re-ordering pattern in a forward shuffle
used has a mapping between the input integer angle T and the calculated θ of θ = T 2T W2πIDT H . Thus for θ = −πk 2N and noting that 2T W IDT H = N , the input angle works out to be − k4 and k4 for the forward and inverse respectively. The sine and cosine of the index k is generated and then multiplied by the results of the FFT using a complex multiply with a latency of six cycles. This generates a 48 bit output, of which only the first 24 bits are retained. The overall dataflow of this component is depicted in Figure 5. Note that for the forward DCT transform only the real output of the complex multiplication is used. The full functionality of the complex multiplier is retained however, since the inverse transform requires it for the initial step as shown in Figure 3.
Figure 3. Components and dataflow for the inverse DCT transform
block RAM or distributed RAM, the choice of algorithm and rounding used and the ability to set the output ordering. For the applications envisaged for larger DCTs, it was necessary to set the bitwidth to a large size to maximize precision. To this end a 24 bit signed input was used along with a block floating-point exponent for each 1D transform completed. This exponent field reduces the need to increase output bitwidth after each FFT. Block RAM, a radix-4 block transform and bit reversed ordering were also selected. Since the algorithm optimization for computing a real or Hermitian symmetric FFT using a transform of length N/2, as mentioned in the previous section, wasn’t used due to precision issues, the imaginary input for the FFT was tied to zero. In addition, the FFT core was set up to support both forward and inverse transforms simultaneously as well as to have run-time configurable transform length.
3.3
4
Results
The hardware was implemented on a Wildstar II Pro by Annapolis Microsystems Inc. and is connected via PCI to a 3.0 GHz Xeon. The Wildstar II Pro has two Virtex II Pros (V2P70) with a speed grade of -6, of which only one was used. The core design was verified on this platform and all synthesis and place and route results target it. We report on three sets of results, one which details latency information, another which describes area and finally a comparison to the Xilinx 1D DCT core. The algorithm used for the DCT implemented above is highly dependent on the performance of its FFT core. This allows for all the work done in optimizing FFT cores (a heavily researched topic) to directly translate into faster DCTs for a fixed latency overhead. This overhead is constant at fourteen cycles, independent of the size of the transform, since the shuffle and the rebuild rotate steps happen in a pipelined, streaming manner. The latency of the design for various sizes is depicted in Table 1. Overall latency of the synthesized component (not including data transfer times from the host to the 1D DCT
Rebuild rotate
The rebuild rotate component implements the multipli−jπk jπk cation by 2e 2N for the forward transform and 0.5e 2N for the inverse. These complex exponentials can be converted to a format consisting of sines and cosines by using Eulers formula. For example, the forward transform is the equiva−πk lent of 2(cos( −πk 2N ) + jsin( 2N )). The Coregen Sine Cosine Lookup Table 5.0 component
104
Figure 6. DCT transform time in us
point transform to 48 multipliers and 14 BRAM blocks for a 1024 point transform.
Figure 5. The rebuild component - Forward Transform
Component DCT FFT
N=32 234 220
N=64 249 235
N=256 858 844
N=512 1771 1757
N=1024 3435 3421
Table 1. Table of DCT and FFT Latencies for different transform sizes Figure 7. Design area component) scales at approximately the same rate as the latency in clock cycles since the clock frequency is relatively constant between 140 and 150 MHz. The transform time for various sizes of N, including the full time to transfer data to and from the component from BRAM is show in Figure 6. The area figure for the implemented design is significantly larger than the DCT core area since there is a fixed amount of overhead required by the Annapolis interface and by the controller that manages the flow of data from the host to the component on the board and vice versa. The relationship between area (in LUTs) and transform length is depicted in Figure 7. In addition to these LUTs, embedded multipliers and BRAMs are used as well. These scale with transform size as published in the datasheets for the FFT[11] and Sine Cosine Lookup Table[12]. The requirements of the complex multiplier remain constant for a fixed bit-width and are not dependent on transform length. The buffering stage also requires BRAMs to store the block of data to be transformed. For the values of N discussed here, the total number of BRAMS and embedded multipliers range from 24 multipliers and 8 BRAM blocks for a 32
Although there do not exist many implementations of larger DCT transforms that the authors have been able to discover, Xilinx does produce an 8-32 point DCT core that uses distributed arithmetic, as has Leong[6] who uses systolic arrays. For our implementation, it should be noted that a 32 point transform is not supported by the radix-4 implementation so a radix-2 implementation was used which requires fewer resources. A clearer comparison between our implementation and the Xilinx core may be drawn from Table 2. A comparison versus Leongs implementation could not be made since insufficient size and performance numbers were available. Note that our implementations go up to 1024 points while other cores are 32 points at most.
5
Conclusions And Future Work
The implementation of the DCT algorithm originally presented by Makhoul [7] was implemented on an Annapolis Wildstar II Pro and the results are discussed and found to
105
Component DCT - Xilinx DCT - Ours DCT - Ours DCT - Ours
Bit-width 24 24 24 24
N 32 32 256 1024
Registers 13978 8223 10957 11486
LUTs 10588 4584 6695 7119
Latency 32 234 858 3435
Table 2. Table of DCT and FFT Latencies for different transform sizes
have various attractive properties for large transform sizes. The fixed overhead in terms of latency when compared to a similarly sized FFT is a strong reason to use this method since it means that fast FFT cores developed by others can be easily integrated into this design. Future improvements in FFT performance on reconfigurable hardware is likely given the proliferation of such cores and the research effort focused on them. Another strength of our approach is the relatively low area requirement. The distributed arithmetic(DA) approach of others cannot be expanded to large DCTs due to size constraints, but even for 1024 point transforms, the implementations discussed here consume only around 30 percent of the resources of an XC2VP70-6 FPGA. For small transform sizes requiring high performance where area is not an issue, the highly parallelizable matrix multiplication approach works best. For large transform sizes, our implementation is clearly superior. Thus this implementation represents another point on the area-speed tradeoff curve. The work done here was accomplished as a preliminary step to speeding up the minimum LP Norm[3] phase unwrap technique often used for phase related data because of the high quality of its results. The next step is to apply the 1D transform to compute a 2D DCT on image data and to observe the performance and quality of the results. Further analysis into minimizing the size of the FFT for real or symmetric data will also be carried out and its effects on accuracy gauged.
6
Acknowledgements
This work was supported in part by CenSSIS, the Center for Subsurface Sensing and Imaging Systems, under the Engineering Research Centers Program of the National Science Foundation (award number EEC-9986821)
References [1] 4DSP Inc. IEEE-745 compliant floating-point FFT core for FPGA. http://www.4dsp.com/fft.htm, Last accessed March 2007.
106
[2] Y. Chan and W. Siu. On the realization of discrete cosine transform using the distributed arithmetic. IEEE Transactions on Circuits and Systems, 39(9):705–712, Sept 1992. [3] D. C. Ghiglia and M. D. Pritt. Two-Dimensional Phase Unwrapping: Theory, Algorithms and Software. Wiley InterScience, 605 Third Avenue, New York, NY, 10158-0012, 1998. [4] A. Hull and W. Jenkins. Preconditioned conjugate gradient methods for adaptive filtering. In IEEE International Symposium on Circuits and Systems, pages 540–543, June 1991. [5] G. K. Khurram Bukhari and S. Vassiliadis. DCT and IDCT implementations on different FPGA technologies. In Program for Research on Integrated Systems and Circuits (ProRISC), pages 232–235, November 2002. [6] M. P. Leong and P. H. W. Leong. A variable-radix digitserial design methodology and its application to the discrete cosine transform. IEEE Transactions on Very Large Scale Integrated (VLSI) Systems, 11(1):90–104, Feb 2003. [7] J. Makhoul. A fast cosine transform in one and two dimensions. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February 1980. [8] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipies: The Art of Scientific Computing. Cambridge University Press, 1986. [9] R. Woods and D. T. J. Heron. Applying an xc6200 to realtime image processing. IEEE Design and Test of Computers, 15(1):30–38, Jan-Mar 1998. [10] Xilinx Inc. 1-D discrete cosine transform(DCT) v2.1. http://www.xilinx.com/ipcenter/catalog/logicore/docs/ da 1d dct.pdf, Last accessed March 2007. [11] Xilinx Inc. Fast fourier transform 3.2. http://www.xilinx.com/ipcenter/catalog/logicore/docs/xfft.pdf, Last accessed March 2007. [12] Xilinx Inc. Sine-cosine look-up table v5.0. http://www.xilinx.com/ipcenter/catalog/logicore/ docs/sincos.pdf, Last accessed March 2007.