Annihilation-Reordering Look-Ahead Pipelined ... - Semantic Scholar

Comment

Report 2 Downloads 47 Views

1

Annihilation-Reordering Look-Ahead Pipelined CORDIC Based RLS Adaptive Filters and Their Application to Adaptive Beamforming Jun Ma, Keshab K. Parhi, and Ed F. Deprettere Abstract

The novel annihilation-reordering look-ahead technique is proposed as an attractive technique for pipelining of Givens rotation (or CORDIC) based adaptive lters. Unlike the existing relaxed look-ahead, the annihilation-reordering look-ahead does not depend on the statistical properties of the input samples. It is an exact look-ahead and based on CORDIC arithmetic, which is known to be numerically stable. The conventional look-ahead is based on multiply-add arithmetic. The annihilation-reordering look-ahead technique transforms an orthogonal sequential adaptive ltering algorithm into an equivalent orthogonal concurrent one by creating additional concurrency in the algorithm. Parallelism in the transformed algorithm is explored, and dierent implementation styles including pipelining, block processing, and incremental block processing are presented. Their complexity are also studied and compared. The annihilation-reordering look-ahead is employed to develop ne-grain pipelined QR decomposition based RLS adaptive lters. Both implicit and explicit weight extraction algorithms are considered. The proposed pipelined architectures can be operated at arbitrarily high sample rate without degrading the lter convergence behavior. Stability under nite-precision arithmetic are studied and proved for the proposed architectures. The pipelined CORDIC based RLS adaptive lters are then employed to develop high-speed linear constraint minimum variance (LCMV) adaptive beamforming algorithms. Both QR decomposition based minimum variance distortionless response (MVDR) realization and generalized sidelobe canceller (GSC) realization are presented. The complexity of the pipelined architectures are analyzed and compared. The proposed architectures can be operated at arbitrarily high sample rate, and consist of only Givens rotations which can be scheduled onto CORDIC arithmetic based processors.

EDICS: 5-VLSI / 5-PARA, 5-PARI. Contact author: Jun Ma, Department of Electrical and Computer Engineering, University of Minnesota, 200 Union St. S.E., Minneapolis, MN 55455. Phone: (612) 626-7217. Fax: (612) 625-4583. Email: junma at ece.umn.edu.

T

I. Introduction

HE digital signal processing (DSP) technology has been and continues to be driven by the progress in VLSI circuit technology, by the increasing bandwidth requirement of various applications such as multimedia and wireless communications, and more recently by the desire to reduce the overall power consumed by the target applications. One of the important ways to design ecient algorithms for high-speed/low-power applications is through pipelining [1]{[2] and parallel processing [3]. Look-ahead techniques for pipelining of recursive xed-coecient lters have been proposed [2],[4]{[5] and have been successfully applied to two-dimensional recursive ltering [6], dynamic programming [7]{[9], algorithms with quantizer loops [10], nite state machines [8], and Human decoders [11]. Relaxed look-ahead techniques for pipelining of adaptive lters have been proposed [12], and have been successfully applied to LMS adaptive lters [13], stochastic gradient lattice lter [14], the adaptive dierential vector quantizer J. Ma and K.K. Parhi are with the Department of Electrical and Computer Engineering at the University of Minnesota, Minneapolis, MN 55455. E.F. Deprettere is with the Department of Electrical Engineering at the Delft University of Technology, 2628 CD Delft, The Netherlands.

2

[15], and the adaptive dierential pulse code modulation codec [16]. Both look-ahead and relaxed look-ahead techniques are based on multiply-add arithmetic. In this paper, we propose novel annihilation-reordering look-ahead techniques for pipelining of Givens rotation or CORDIC arithmetic based adaptive lters which have been successfully applied to QR decomposition based recursive least squares (QRD-RLS) algorithm [17], the adaptive inverse QR algorithm [18], and the QR decomposition based minimum various distortionless response (MVDR-QR) algorithm [19]. Recursive least squares (RLS) [20] based adaptive lters have wide applications in channel equalization, voiceband modem, high-de nition TV (HDTV), digital audio broadcast (DAB) system, beamforming, and speech and image processing. Historically, least mean squares (LMS) based adaptive lters are preferred in practical applications due to their simplicity and ease of implementation. A limitation of LMS algorithm is that it has a very slow convergence rate. The convergence of the LMS algorithm is also very sensitive to the eigenvalue spread of the correlation matrix of the input data. In applications such as HDTV equalizer and DAB system, there are only limited number of data samples available. LMS based equalizer may not be able to convergence. The convergence of the RLS algorithm is an order of magnitude faster than the LMS algorithm, but its complexity is an order of magnitude higher than LMS. However, with rapid advances in scaled VLSI technologies, it is possible to implement RLS adaptive lters on single chips which will make them attractive due to their rapid convergence behavior. QR decomposition based RLS (QRD-RLS) algorithm [20], also referred as Givens rotation or CORDIC based RLS algorithm in this paper, is the most promising RLS algorithm since it possess desirable properties for VLSI implementations such as regularity, good nite word-length behavior, and can be mapped onto CORDIC arithmetic based processors [21]{[24]. The QRD-RLS algorithm can be summarized as follows. At each sample time instance n, evaluate a residual error:

e(n) = y(n) ? uT (n) w(n);

(1)

where u(n) and y(n) denote the p-element vector of signal samples and the reference signal at time instance n, respectively, and w(n) is the p-element vector of weights which minimize the quantity

(n) = k 1=2 (n)e(n) k2 = k 1=2 (n) (y(n) ? A(n)w(n)) k2 ;

(2)

where y(n) = [ y(1); ; y(n) ]T denotes the sequence of all reference signal samples obtained up to time instance n, A(n) = [ u(1); u(2); ; u(n)]T is the input data matrix, and (n) = diag[n?1 ; ; ; 1] is the diagonal matrix of the forgetting factors. Here, we assume that all the data are real. The extension to the complex case does not seem to have any particular diculties. The optimum weight vector wls of the QRD-RLS solution can be obtained by solving the following equation

R(n) wls (n) = p(n);

(3)

where R(n) and p(n) are p-by-p matrix and p-by-1 vector, respectively, which are obtained by applying a QR decomposition to the weighted data matrix 1=2 (n)A(n) and the weighted reference vector 1=2 (n)y(n), respectively. R(n) is usually chosen to be an upper triangular matrix. In practice, the QR decomposition is implemented in a recursive manner. With each incoming data sample set, a new row uT (n) is appended to the data matrix A(n ? 1) to yield A(n). An orthogonal transformation matrix Q(n) is determined as products of p Givens rotation matrices to null the last row of A(n). Thus the triangular matrix R(n ? 1) gets updated to R(n). The determined matrix Q(n) is then used to update p(n ? 1) to p(n). The QR

3

update procedure can be described by the following equation h

R(n)

0

T p

p(n) i = Q(n) h (n)

1=2 R(n ? 1) 1=2 p(n ? 1) uT (n) y(n)

i

:

(4)

A signal ow graph (SFG) representation of the QR update procedure is shown in Fig. 1. In this gure, the circle and square cells denote CORDIC operations with circle cells operating in vectoring mode and square cells operating in rotating mode. The functionality of the recursive update with M feedback delays (M = 1 for the triangular update) is shown at the bottom of Fig. 1. The c and s denote cos and sin , respectively, which are chosen to annihilate x1 (n). The determined rotation angle is then applied to rotate the second column vector which consists of M=2 r2 (n ? M ) and x2 (n). u (n) 1

u (n) 2

u (n) 3

u (n) 4

y (n)

Boundary Cell u (n)

D

D

r

D

r

r D

D

r

r

r

r

p

MD

r

θ2

p

θ3

θ

Internal Cell u (n)

D

D

D

θ1 D

D

r

r

D

D

p MD

D

D

r r

θ4

θ

p

u’ (n)

r1(n) r2(n) 0 x 2’(n)

Μ/2

=

c s s c

Μ/2

λ r 1(n-M) λ r 2(n-M) x 1(n) x 2(n)

Fig. 1. The signal ow graph representation of the QR update procedure.

In practice, there are two types of QRD-RLS algorithms. One is implicit weight extraction based RLS algorithms which are found useful in applications such as adaptive beamforming. In these algorithms, the residual error e(n) is obtained without the explicit computation of the weight vector w(n). A popular implicit weight extraction algorithm is due to McWhirter etc. [25]. The other is explicit weight extraction based RLS algorithms which are found useful in applications such as channel equalization. One such algorithm is due to Gentleman and Kung [26], which involves a triangular update part and a linear array for triangular back-solving. The linear array part does not make use of Givens rotations and thus can not be eciently combined with the triangular update part. To overcome this problem, alternative QRD-RLS algorithms with inverse updating have been developed to achieve explicit weight extraction and also make use of Givens rotations. A typical structure is the double-triangular type adaptive inverse QR algorithm [27]{[28]. This algorithm performs a QR update in an upper triangular matrix and an inverse QR update for weight extraction in a lower triangular matrix. Both implicit and explicit weight extraction based QRD-RLS algorithms can be easily pipelined at cell level (also referred as coarse-grain pipelining). However, The speed or sample rate of the algorithms is limited by the recursive operations in individual cells as shown in Fig. 1. In many applications, such as image coding and beamforming, very high data rates would be required, and the sequential QRD-RLS algorithms may not be able to operate at such high data rate. In this paper, we create parallelism in the QRD-RLS algorithms and exploit it for pipelining at ner level such as bit or multi-bit level (also referred as ne-grain pipelining). Notice

4

that apart from being used to increase speed, pipelining can also be used to reduce power dissipation in low to moderate speed applications [29]. To exploit the parallelism and increase the speed of the QRD-RLS, lookahead techniques or block processing techniques can be applied. The lookahead techniques and the so called STAR rotation have been used in [30] to allow ne-grain pipelining with little hardware overhead. However, this is achieved at the cost of degradation of ltering performance due to the approximations in the algorithms. Block processing was used to speed up the QRDRLS in [31], with large hardware overhead. Both these approaches are based on multiply-add arithmetic. If one insists on not using multiply-add arithmetic for their implementation, there is no trivial extension of the lookahead technique to the QRD-RLS algorithm. There are other fast QRD-RLS algorithms, which are computationally more ecient than the original algorithm [32]{[38]. Square-root free forms of QRD-RLS are presented in [32]{[36]. A uni ed approach to square-root QRDRLS algorithm is presented in [33]. A low-complexity square-root free algorithm is developed in [34]. In [35], a scaled version of the fast Givens rotation [32] is developed that prevents over ow and under ow. Recently, division as well as square-root free algorithm has been proposed in [37]. In [38], a fast QRD-RLS algorithm based on Givens rotations was introduced. However, all these fast algorithms suer the same pipelining diculty as the QRD-RLS algorithm, i.e., they can not be pipelined at ne-grain level. In this paper, we propose a novel annihilation-reordering look-ahead technique to achieve ne-grain pipelining in QRD-RLS adaptive lters. It is an exact lookahead and based on CORDIC arithmetic. One of the nice properties of this technique is that it can transform an orthogonal sequential recursive DSP algorithm to an equivalent orthogonal concurrent one by creating additional concurrency in the algorithm. The resulting transformed algorithm possesses the pipelinablity, the stability (if the original one is stable), and good nite word-length behavior which are attractive for VLSI implementations. This paper is organized as follows. The annihilation-reordering look-ahead technique is presented in section II. The derivation of pipelined CORDIC based RLS adaptive lters using the proposed look-ahead technique is presented in section III. Section IV presents the application of pipelined RLS adaptive lters to adaptive beamforming. Finally, section V draws conclusions and discusses future research directions. Appendix A and appendix B provide the derivation and proof of some key formulas presented in the paper. II. The Annihilation-Reordering Look-Ahead Technique

In this section, we introduce the annihilation-reordering look-ahead technique as an exact look-ahead based on CORDIC arithmetic. Similar to the traditional look-ahead, it transforms a sequential recursive algorithm to an equivalent concurrent one by creating additional parallelism in the algorithm. It is based on CORDIC arithmetic and is suitable for pipelining of Givens rotation based adaptive ltering algorithms. The annihilation-reordering look-ahead technique can be derived from two aspects. One is from the block processing point of view, and the other is from the iteration point of view. The former is practical in real applications, while the latter shows the connection with the traditional look-ahead technique. During our derivation, the forgetting factor is omitted for clarity purpose. This section is organized as follows. The derivations of the annihilation-reordering look-ahead through block processing and iteration are presented in section II-A and section II-B, respectively. The relationship with the conventional multiply-add look-ahead is shown in section II-C. Section II-D explores the parallelism in the proposed look-ahead transformed algorithm. Dierent implementation styles are then presented in section II-E. Finally, a

5

lemma for stability invariance is stated and proved in section II-F. A. Look-Ahead Through Block Processing In this section, we derive the annihilation-reordering look-ahead transformation for Givens rotation based adaptive ltering algorithms via block-processing formulation. The annihilation-reordering look-ahead technique can be summarized as the following two-step procedure. 1. Formulate block updating form of the recursive operations with block size equal to the pipelining level M . 2. Choose a sequence of Givens rotations to perform the updating in such a way that it rst operates on the block data and then updates the recursive variables. The aim is to reduce the computational complexity of a block update inside the feedback loop to the same complexity as a single-step update. Assume that a 3{time speed up is desired for the QR update shown in Fig. 1. Consider the block update form of the QR update procedure shown in Fig. 2 with block size equal to the desired pipelining level 3. A sequence of Givens rotations is then chosen to annihilate the block data u(n ? 2); u(n ? 1), and u(n), and update R(n ? 3) to R(n). In this gure, traditional sequential update operation is used. The sample data is annihilated in a row-by-row manner and the diagonal r elements are involved in each update. The signal ow graph of a typical r element update is shown in Fig. 3. It can be seen that the number of rotations inside the feedback loop increases linearly with the number of delay elements in the loop. Therefore, there is no net improvement in the sample or clock speed. r r r r r r r r r r u u u u u u u u u u u u

(n-3)

r r r r r r r r r r 0 u u u u u u u u u u u

Q1

(n-2) (n-1) (n)

(n-3)

Q

r r r r r r r r r r 0 0 u u u u u u u u u u

2

(n-2) (n-1) (n)

(n-3) (n-2) (n-1) (n)

Q r r r r r r r r r r 0 0 0 0 0 0 0 0 0 0 0 0

Q

(n)

r r r r r r r r r r 0 0 0 0 0 0 0 0 u u u u

5

(n-2) (n-1) (n)

(n-1)

Q

3

r r r r r r r r r r 0 0 0 0 u u u u u u u u

4

(n-2) (n-1) (n)

(n-2) (n-2) (n-1) (n)

Fig. 2. QRD-RLS block update. r(n)

3D

D r(n-3)

r(n-1)

C

r(n-2)

u(n-2) C

C

r(n-1)

u(n-1)

C

u(n)

r(n)

u(n)

(a)

(b)

Fig. 3. (a) The sequential QR update procedure. (b) The block update procedure with block size 3.

The novel annihilation-reordering look-ahead technique is illustrated in Fig. 4. In this gure, the sample data is annihilated in a column-by-column manner and the diagonal r elements are updated only at the last step. The signal ow graph of a typical r element update is shown in Fig. 5. It can been seen that, without increasing the loop computational complexity, the number of delay elements in the feedback loop is increased from one delay element to three delay elements. These three delay elements can then be redistributed around the loop using the retiming technique [39] to achieve ne-grain pipelining by 3-level. The two CORDIC units outside the feedback loop are the

6

computation overhead due to the lookahead transformation. Since they are feed-forward, cutset pipelining [40] can be applied to speed them up. Furthermore, the overhead CORDIC units outside the loop can be arranged in a tree structure to explore the parallelism and reduce overall latency. r r r r r r r r r r u u u u u u u u u u u u

(n-3)

r r r r r r r r r r u u u u u u u u 0 u u u

Q1

(n-2) (n-1) (n)

(n-3)

Q

r r r r r r r r r r u u u u u u u 0 0 u u u

2

(n-2) (n-1) (n)

(n-3) (n-2) (n-1) (n)

Q (n)

(n)(n)

r r r r r r r r r r 0 0 0 0 0 0 0 0 0 0 0 0

Q

(n)

r r r r r r r r r r 0 0 u u 0 0 u u 0 0 u u

5

(n-2) (n-1) (n)

Q

(n-3)

3

r r r r r r r r r r 0 u u u 0 u u u 0 u u u

4

(n-2) (n-1) (n)

(n-3) (n-2) (n-1) (n)

Fig. 4. QRD-RLS annihilation-reordering look-ahead. 3D

r(n) D

r(n-3) r(n)

r(n-1)

C

u(n-2) C

u(n-1)

C

C u(n)

u(n)

(a)

(b)

Fig. 5. (a) A sequential QR update procedure. (b) The 3-level pipelined architecture using annihilation-reordering look-ahead.

B. Look-Ahead Through Iteration Alternatively, the annihilation-reordering look-ahead can be derived through matrix iterations. From Fig. 1 and equation (4), the basic QR recursion is given as follows

r(n) = c s 0 ?s c

r(n ? 1) ; u(n)

(5)

where r(n) and u(n) correspond to the boundary element and input data to the boundary element in Fig. 1, respectively. A direct lookahead by iterating equation (5) two times can be performed by the following embedding procedure. Equation (5) can be rewritten as 2 4

r(n) 0 0

3 5

2

c1 0 s 1

=4 0

1 0

?s 0 c 1

32 54

r(n ? 1)

1

0

u(n)

3 5

:

(6)

From equation (5), we also have 2 4

r(n ? 1) 0

u(n)

3

2

c2 s2 0 5 = 4 ?s2 c2 0 0

0 1

32 54

3

r(n ? 2) u(n ? 1) 5 : u(n)

(7)

7

Substituting equation (7) into equation (6) leads to 2 4

r(n) 0 0

3 5

2

c1 0 s 1

=4 0

1 0

?s 0 c 1

32 54

c2 s2 0 ?s2 c2 0 0

1

0 1

32 54

3

r(n ? 2) u(n ? 1) 5 : u(n)

(8)

This is the one-step iterated version of equation (5). Iterating (8) once more leads to the following two-step iterated version of (5). 2 6 6 4

r(n) 0 0 0

3 7 7 5

2

c1 0 0 s1

6 = 64 00

?s

1

1 0 0 0 1 0 0 0 c1

32 76 76 54

c2 0 s2 0 0

1 0 0

?s 0 c 0 0

2

2

0 0 1

32 76 76 54

c 3 s3 0 0 ?s3 c3 0 0 0 0

0 1 0 0 0 1

32 76 76 54

3

r(n ? 3) u(n ? 2) 77 : u(n ? 1) 5 u(n)

(9)

The signal ow graph of (9) is shown in Fig. 3. Notice that this transformation does not help much, since all three CORDIC operations involve updating the r element. Although the feedback loop contains three delays, the computation time in the loop is also increased by a factor of three. Therefore, the overall sample rate remains unaltered. In order to increase the sample rates, the following transformation is considered. Notice that in (9), instead of vectoring the input vector in the order of (1; 2); (1; 3), and (1; 4), we could apply the Givens matrix in a dierent order so that the input vector is annihilated in the order of (3; 4); (2; 3), and (1; 2), where notation (i; j ) represents a pair of row indexes of input matrix in (9). For example, (3; 4) denotes that the Givens matrix will operate on input vector [u(n ? 1); u(n)]T , According to this scheme, the input samples are pre-processed and the r elements are updated only at the last step. This leads to the following 3-level annihilation-reordering lookahead transformation for CORDIC based RLS adaptive lters. 2 6 6 4

r(n)

3

2

c1 s1 0 0 0 77 = 66 ?s1 c1 0 0 0 5 4 0 0 1 0 0

0

0

0

0

0

0 0 1

32 76 76 54

1 0 0 c2 0 ?s2 0 0 0

0

0 0 s2 0 c2 0 0 1 0

0

32 76 76 54

1 0 0 0

0 0 1 0 0 c1 0 ?s1 0

0

0 0

s1 c1 0

0

32 76 76 54

3

r(n ? 3) u(n ? 2) 77 : u(n ? 1) 5 u(n)

The SFG of the above transformation is shown in Fig. 5, which is the same as the one derived from the block processing point of view. Therefore, without increasing the loop computation time, we increase the number of delays in the feedback loop from one delay element to three delay elements. These three delay elements can then be redistributed around the loop to achieve pipelining by 3-level. The above derivation is similar to the traditional multiply-add look-ahead [2] procedure in the sense that both perform look-lookahead through iteration. However, it can be seen that the block processing derivation in section II-A is more simple and practical in real applications. C. Relationship with Multiply-Add Look-Ahead It is worth mentioning here, for rst order case, there exists strong similarity of the transformed ow graphs between the annihilation-reordering look-ahead and the multiply-add look-ahead. Consider the rst order IIR digital lter described by the following equation

y(n) = a y(n ? 1) + u(n):

(10)

8

u(n)

D

1111 a 0000

00000000000 11111111111 11 D 00 1111 1111 a 0000 a 0000

11 00

D

3D

1111 a30000

2

0011

u(n)

11 00

y(n)

(a)

y(n)

(b)

Fig. 6. (a) A rst-order IIR digital lter. (b) The 3-level pipelined architecture using multiply-add look-ahead.

The SFG of (10) is shown in Fig. 6(a). After applying the multiply-add look-ahead transformation with pipelining level 3, the resulting equation is given as

y(n) = a3 y(n ? 3) + a2 u(n ? 2) + a u(n ? 1) + u(n):

(11)

The SFG of (11) is shown in Fig. 6(b). The lter sample rate can be increased by a factor of 3 after redistributing the 3 delay elements in the feedback loop. u(n)

D

u(n)

0011

C

01

01

D

r(n)

0000000000 1111111111 01

D

3D

C

C

(a)

01

C

r(n)

(b)

Fig. 7. (a) A sequential QR update procedure. (b) The 3-level pipelined architecture using annihilation-reordering look-ahead.

On the other hand, Fig. 5 can be redrawn as shown in Fig. 7. Comparing Fig. 7 to Fig. 6, it is seen that the two graphs are essentially the same except that the multiply-add units are replaced by the CORDIC units. D. Parallelism in Annihilation-Reordering Look-Ahead In this section, we show explicitly how the annihilation-reordering lookahead technique explore the parallelism in the recursive algorithm and create the additional concurrency. k direction (time) u(n-2)

r(n-3)

u(n-1)

r(n-2)

u(n)

r(n-1)

u(n+1)

r(n)

u(n+2)

r(n+1)

u(n+3)

r(n+2)

u(n+4)

r(n+3)

u(n+5)

r(n+4)

r(n+5)

Fig. 8. The dependence graph of the sequential QR update procedure.

Consider the sequential QR update procedure shown in Fig. 5(a). Its dependence graph (DG) representation is shown in Fig. 8. In this gure, the little circles denote CORDIC operations. The arrows denote dependency between signals. The k direction is the time increasing direction. The arrows along the k direction represent the dependency between iterations. For example, r(n + 1) is dependent on r(n), r(n) is dependent on r(n ? 1), and so on. As we mentioned in section I, it is this kind of dependency that limits the speed of the QR update thus limits the sample rate. The annihilation-reordering lookahead actually breaks this dependency and transforms the original DG into an equivalent DG which consists of M independent sub-DGs, where M is the desired pipelining level. Since these M sub-DGs are independent, they can be executed in parallel. Therefore, the M independent sub-DGs are the additional concurrency created by look-ahead transformation. For M = 3, the 3 sub-DGs: DG-I, DG-II, and DG-III are shown in Fig. 9. Fig. 9 is the DG representation of Fig. 5(b). From Fig. 9, it is seen that the computation of r(n) is not dependent on the r(n ? 1) anymore, instead dependent on the r element 3 iterations

9

k direction (time) u(n-2)

u(n-1)

u(n)

u(n+1)

u(n+2)

u(n+3)

u(n+4)

u(n+5)

DG-I r(n-3)

r(n)

r(n+3)

DG-II r(n-2)

r(n+1)

r(n+4)

DG-III r(n-1)

r(n+2)

r(n+5)

Fig. 9. The dependence graph of the pipelined QR update with pipelining level 3.

back which is r(n ? 3). Similarly, r(n + 1) is dependent on r(n ? 2), and r(n + 2) depends on r(n ? 1). Therefore, after look-ahead transformation, the dependency between consecutive iterations are broken down into 3 independent operation sequences with each sequence having a dependency between every three iterations. For each sub-DG, the dependency which is perpendicular to the k direction does not cause problem, since index transformation [41] (which is equivalent to the cut-set pipelining for SFG) can be performed to reveal these dependency. Therefore, The 3 independent sequences which consist of 32 = 9 independent CORDIC operations in total are created for one iteration. In general, for pipelining level M , M 2 independent CORDIC operations are created in the algorithm for one iteration. As M increases, in nite parallelism can be created in the algorithm, thus can achieve arbitrarily high sample rate. E. Pipelining and Block Processing Implementations In this section, we present three concurrent realizations of CORDIC based RLS adaptive lters. They are named as pipelining, block processing, and incremental block processing.

E.1 Pipelined Realization Consider the three sub-DGs in Fig. 9. If they are mapped along the k direction, we obtain the SFG representation. Since the three sub-DGs are independent, they can be mapped onto the same CORDIC operation resources and operated in a pipeline interleaving fashion [2]. This leads to the pipelined realization of CORDIC based adaptive lters shown in Fig. 10. In this gure, the input data samples are processed in the block manner through a tapped delay line as shown in Fig. 11(a). Since consecutive block samples are shift-overlapped, thus all ltering output can be obtained consecutively. The implementation complexity in terms of CORDIC units for pipelined realization is linear with respect to the pipelining level M which is M = 3 CORDIC units in Fig. 10.

10

Boundary Cell

0110

D

u(n)

0110

D

C

θ1 θ2

C

θ3

C

r(n) 3D

Fig. 10. Pipelined realization with pipelining level 3.

D

10 1010

D

0110

10 0110 11001100

0110

u(n-2) u(n-1)

u(n)

Block (k)

u(n-2) u(n-1)

u(n)

Block (k-1)

u(n-3) u(n-2)

u(n-1)

u(n-5) u(n-4) u(n-3)

Block (k-2)

u(n-4) u(n-3)

u(n-2)

u(n-8) u(n-7) u(n-6)

(a)

(b)

Fig. 11. Serial-to-parallel conversion for (a) Pipelining and (b) Block Processing.

E.2 Block Processing Realization In block processing, the three sub-DGs are mapped independently along the k direction to obtain the block processing realization shown in Fig. 12. In block realizations, input samples are processed in the form of nonoverlapping blocks to generate non-overlapping output samples. The block of multiple inputs is derived from the single serial input by using a serial-to-parallel converter at the input as shown in Fig. 11(b), and the serial output is derived from the block of outputs by a parallel-to-serial converter at the output. The implementation complexity in terms of CORDIC units for block processing realization is quadratic with respect to the pipelining level M which is M 2 = 32 = 9 CORDIC units in Fig. 12. To reduce complexity, incremental block processing technique can be used. u ( 3k+1 ), u ( 3k )

Boundary Cell

D

D

θ 1 (3k) C C

θ 2 (3k)

C C

r (3k) D

θ 2 (3k+1)

C

θ 2 (3k+2)

C

θ 3 (3k+1)

θ 3 (3k+1)

θ 3 (3k) C

θ 1 (3k+2)

θ 1 (3k+1) C

D

C

r (3k+1)

D

r (3k+2)

Fig. 12. Block processing realization with block size 3.

11

E.3 Incremental Block Processing Realization Consider the annihilation-reordering look-ahead transformed DG shown in Fig. 9. Instead of using u(n+1); u(n); u(n? 1), and r(n ? 2) to obtain r(n + 1), r(n + 1) can be computed incrementally using u(n + 1) and r(n) once r(n) is available. Similarly, r(n + 2) can be computed incrementally using u(n + 2) and r(n + 1) once r(n + 1) is available. The DG of incremental block QR update is shown in Fig. 13. Mapping the DG along the k direction gives us the SFG representation of the incremental block processing realization shown in Fig. 14. The implementation complexity in terms of CORDIC units for incremental block processing is linear with respect to the pipelining level M which is 2M ? 1 = 2 3 ? 1 = 5 CORDIC units in Fig. 14. Notice that the incremental computation parts do not contain feedback loops, thus cutset pipelining can be employed to speed them up. k direction (time) u(n-2)

u(n-1)

u(n)

u(n+1)

u(n+2)

u(n+3)

0000 000 1111 111 1111 000 0000 111 1010 11001100 11111111111111 00000000000000 Fig. 13. The dependence graph of the incremental block QR update with block size 3. r(n-2)

r(n+1) r(n+2)

r(n-1)

r(n-3)

r(n)

u ( 3k+1 ), u ( 3k )

r(n+3)

11111 00000 00000 11111 00000 11111 00000 11111 D

Boundary Cell

D

θ 1 (3k) C C C

θ 2 (3k) θ 3 (3k) r (3k)

C

θ 3 (3k+1) r (3k+1)

C

θ 3 (3k+2) r (3k+2)

D

Fig. 14. Incremental block processing realization with block size 3.

Therefore, in terms of number of CORDIC units used in the implementation, pipelined realization is better than incremental block processing and block processing, and incremental block processing is better than block processing. In practice, the chosen of implementation styles depends on the target applications and available hardware resources. F. Invariance of Bounded Input/Bounded Output In this section, we show a property of the annihilation-reordering look-ahead transformation. It will be useful in the proof of the stability of the pipelined QRD-RLS algorithm in section III-B. Lemma 1: Consider the compound CORDIC cell denoted by the dashed circle in Fig. 10. Under nite-precision arithmetic, if each individual CORDIC cell is bounded input and bounded output (BIBO), then the compound CORDIC cell is also BIBO.

12

Proof: Assume the pipelining level is M , from Fig. 10, the look-ahead transformed compound CORDIC cell consists of cascade connections of M CORDIC units. Since each of them is BIBO under nite precision arithmetic, therefore the compound cell is also BIBO. III. Pipelined CORDIC Based RLS Adaptive Filters

In this section, we apply the annihilation-reordering lookahead techniques to the CORDIC based RLS adaptive lters and derive ne-grain pipelined topologies. We consider both RLS with implicit weight extraction and explicit weight extraction. This section is organized as follows. The pipelined QRD-RLS with implicit weight extraction is presented in section III-A. Its stability under nite-precision arithmetic is studied and proved in section III-B. Finally, the pipelined adaptive inverse QR algorithm for explicit weight extraction is presented in section III-C. A. Pipelined QRD-RLS with Implicit Weight Extraction Consider the QRD-RLS formulation given in equations (1){(4). After the triangular matrix R(n) and the corresponding vector p(n) are generated, the optimum weight vector w(n) can be obtained by solving equation (3). The residual error e(n) is then computed as

e(n) = y(n) ? uT R?1 (n) p(n):

(12)

However, for some applications such as adaptive beamforming, this proves to be unnecessary since in these cases, the residual error e(n) is usually the only variable of interest, and it is not necessary to compute the weight vector w(n) explicitly. In [25], It is shown that the estimation error e(n) may be obtained directly as the product of two variables, the angle-normalized residual (n) and the likelihood factor (n). (n) and (n) are obtained by applying the same orthogonal transformation matrix Q(n) to the vector [p(n ? 1); y(n)]T and the pinning vector = [0; ; 0; 1]T [25]. Therefore, the adaptive recursive least squares algorithm can be summarized as

R(n) 0Tp

p(n) s(n)

(n) (n) = Q(n)

1=2 R(n ? 1) 1=2 p(n ? 1) 0p ; uT (n) y(n) 1

(13)

where 0p is the p-by-1 null vector. The SFG representation of the algorithm is shown in Fig. 15, where problem size p is chosen to be 4. The circle and square cells in Fig. 15 denote Cordic operations which follow the same notations in Fig. 1. The circle cell with letter G inside denotes a Gaussian rotation (or a linear CORDIC operation). Its functionality is shown in the Figure. Notice that the converting factor cells which generate the likelihood factor does not contain recursive operations. In Fig. 15, the recursive operation in the cell limits the throughput of the input samples. To increase the sample rates, the annihilation-reordering look-ahead technique is applied. The recursive updating formula for the QRD-RLS with implicit weight extraction is given in equation (13). Its block updating form with block size M is given as follows

R(n) p(n) s(n) = Q(n) M=2 R(n ? M ) M=2 p(n ? M ) 0p ; OM p (n) (n) UM (n) yM (n) M

(14)

13

u (n) 1 D

u (n) 2 D

u (n) 3

r D

r

r

r

r

r D

r

y

θ2

θ3

r

θ4

z

D

p D

p 0

D

D

p

0

D

r

θ1 0

D

D

y (n)

1 0

D

D

r

u (n) 4

D

p 0 γ

x

α

G

G

z+xy

e (n)

Fig. 15. Signal ow graph representation for recursive least squares minimizations.

where UM (n) is an M -by-p matrix de ned as

UM (n) = [u(n ? M + 1); ; u(n ? 1); u(n)]T ; and yM (n) is an M -by-1 vector de ned as yM (n) = [y (n ? M + 1); ; y (n ? 1); y (n)]T :

In (14), OM p and 0p denote M -by-p null matrix and p-by-1 null vector, respectively, (n) and (n) are M -by-1 vectors, and M is a M -by-1 constant vector de ned as M = [ 0; ; 0; 1 ]T . The estimation error e(n) can be calculated as the inner product of the angle-normalized residual vector (n) and the likelihood vector (n), i.e.

e(n) = T (n) (n):

(15)

The proof is given in Appendix A. We now determine a sequence of Givens rotations, whose product form the orthogonal transformation matrix Q(n) in (14), to annihilate the block input data matrix UM (n). The order of the Givens rotations is chosen such that the input data is pre-processed and block-data update is nished in the same complexity as a single-data update. This procedure is illustrated in detail in Fig. 4. A 3-level ne-grain pipelined QR update topology is shown in Fig. 5. After applying the annihilation-reordering lookahead, the concurrent QRD-RLS algorithm can be realized using dierent implementation styles such as pipelining, block processing, and incremental block processing as discussed in section II-E. In the rest of the paper, we only show the topologies for the pipelined realization. The other realizations can be derived similarly. A ne-grain pipelining implementation with pipelining level 3 of CORDIC based QRD-RLS adaptive lter with implicit weight extraction is shown in Fig. 16. In this gure, all cell notations follow the notations in Fig. 15 except that they are compound versions. The internal structure of each compound cell is shown at the

14

bottom part of Fig. 16. Compared to Fig. 15, the 3-level pipelined architecture tripled the number of CORDIC units and communication bandwidth which is linear with respect to the pipelining level. Thus, in general, the total complexity is O( 12 Mp2) CORDIC units per sample time, where p is the input sample size, and M is the pipelining level. u2 (n)

u1 (n)

u3 (n)

u4 (n)

y (n) 1

D

D

D

3D

D

D

3D

D

D

r

0 D

0

3D

3D

r

0

D

r

D

3D

r

p 0

3D

3D

3D

r

3D

p

r

r

0 3D

3D

3D

p

r

r

0 3D

3D

y

1

y y 2 3

x

1 x2 x 3

p

r

z

0

G

G

α

γ

G

G

G

z + x1 y1 + x 2 y2 + x 3 y3 Boundary Cell

Internal Cell Converting Factor Cell Reference Cell

C C

e(n)

C C

C

3D

3D

C C

θ3

C

r

θ1 θ2

r

C

0

C C C

3D

p

Fig. 16. A 3-level ne-grain pipelined CORDIC based implicit weight extraction QRD-RLS adaptive lter architecture.

B. Stability Analysis It is generally recognized that the QR decomposition based algorithms have good numerical properties, which means that these can perform with an acceptable manner in a short word-length environment. This is due to the fact that the algorithms consist of only orthogonal transformation which leads to inherent stability under niteprecision implementation. From section II-A and section II-B, we see that the annihilation-reordering look-ahead transformation only involves orthogonal transformation and does not alternate the orthogonality of the algorithm. This implies that the pipelined algorithms also maintain the good numerical properties. Let's de ne the stability of the QRD-RLS algorithm in the sense of bounded input/bounded output (BIBO), i.e., under nite-precision arithmetic, if the input signals are bounded, then the output residual error e(n) is also bounded. We have the following result.

15

Theorem 1: Under the nite-precision arithmetic, given a pipelining level M , the M -level ne-grain pipelined CORDIC based RLS adaptive lter algorithm with implicit weight extraction is stable. Proof: In [42], it is shown that for the sequential QRD-RLS algorithm shown in Fig. 15, a CORDIC cell, operating with nite-precision arithmetic, constitutes a BIBO subsystem of the array. From Fig. 16, the pipelined algorithm has the same architecture as the sequential one except that all CORDIC cells are compound versions. Therefore, by Lemma 1, a compound CORDIC cell, operating with nite-precision arithmetic, constitutes a BIBO subsystem of the array. Thus, if the desired response y(n) and input samples u(n) in Fig. 16 are bounded, the quantized value of the input of the nal linear CORDIC cell is also bounded, which leads to the bounded residual error e(n). This completes the proof of Theorem 1. The stability of other CORDIC based RLS adaptive ltering and beamforming algorithms presented in this paper can also be proved using the similar approach as in Theorem 1 and will not be repeated in the rest of the paper. C. Pipelined QRD-RLS With Explicit Weight Extraction In applications such as channel equalization, RLS based equalization algorithms such as, e.g., decision-directed schemes [43] and orthogonalized constant modulus algorithms [44], require the explicit availability of the lter weight vector w(n). The standard (Gentleman-Kung type) QRD-RLS update scheme involves two computational steps which can not be eciently combined on a pipelined array. To circumvent the diculty, inverse updating based algorithms are developed [44]{[46]. Here, we focus on a double-triangular type adaptive inverse QR algorithm [27]. Consider the least squares formulation given in (1)|(4). De ne the (p + 1)-by-(p + 1) upper triangular compound matrix R~ (n) as 1=2 R(n) 1=2 p(n) 0Tp

(n)

R~ (n) =

;

where (n) is a scalar factor, and R(n); p(n); 0Tp are de ned as in section I. Using equation (3), R~ ?1 is then given as

R~ ?1 (n) =

=

?1=2 R?1 (n)

?R?1 (n)p(n)= (n)

0Tp 1= (n) ? 1 = 2 R?1 (n) ?w(n)= (n) : 0Tp 1= (n)

Notice that R~ ?1 remains upper triangular and the optimal weight vector w(n) is explicitly shown on the rightmost column of R~ ?1 except for a scaling factor ?1= . Now, consider the QR update of the upper triangular compound matrix R~ . From (4), we have

R~ (n)

0

T p+1

= Q~ (n)

h

R~ (n ? 1) u~T (n)

i

(16)

;

where u~T (n) = [ uT (n); y(n) ], and Q~ (n) is determined as products of (p + 1) Givens rotation matrices to null the input sample vector u~T (n) and update matrix R~ . Extending the (p + 2)-by-(p + 1) matrix on the right hand side of (16) to the (p + 2)-by-(p + 2) square matrix by adding an extra column vector [ 0Tp+1 ; 1 ]T to its right leads to

R~ (n)

0Tp

+1

v(n) = Q~ (n) h R~ (n ? 1) 0p d(n) u~T (n) 1

+1

i

;

(17)

where vector v(n) and scalar d(n) correspond to the QR update of vector 0p+1 and scalar 1. Inverting the matrix on both sides of equation (17) (the matrix is non-singular since R~ is non-singular) and noticing that Q?1 = QT lead to R~ ?1 (n) v0 (n) = R~ ?1 (n ? 1) 0 p+1 Q~ T (n): (18) 0Tp+1 d0 (n) ?u~T (n) R~?1 (n ? 1) 1

16

Taking the transposition on both sides of (18), we obtain

R~ ?T (n) 0p+1 v0 T (n) d0 (n)

= Q~ (n)

?R~?T (n ? 1) u~(n) : 1

R~ ?T (n ? 1)

0Tp

+1

Thus, we have the following inverse updating formula

R~ ?T (n) v0 T (n)

= Q~ (n)

R~ ?T (n ? 1)

(19)

:

0Tp

+1

Notice that the orthogonal transformation matrix Q~ (n), which updates the upper triangular matrix R~ in (16), also updates the lower triangular matrix R~ ?T in (19). Thus, the double-triangular adaptive inverse QR algorithm can be summarized as follows ~ ~?T R~ (n) R~ ?T (n) (20) = Q~ (n) R(nT ? 1) R T(n ? 1) : T

0Tp

u~ (n)

v0 (n)

+1

0p

+1

The important point lies in noticing that the scaled weight vector ?w= explicitly sits on the bottom row of lower triangular matrix R~ ?T or the rightmost column of upper triangular matrix R~ ?1 as shown before. Therefore, we could achieve parallel weight extraction by taking out the last row elements of R~ ?T (n) and multiplying them by the scaling factor (n) sitting on the lower right corner of upper triangular matrix R~ (n). An ecient SFG representation of the CORDIC based double-triangular adaptive inverse QR algorithm is shown in Fig. 17. In this gure, the notation follows the ones in Fig. 1. The operation for updating r?1 elements is shown at the bottom part of Fig. 17. The residual error e(n) is computed according to (1) as shown in the gure. The element on the lower right corner of lower triangular matrix R~ ?T contains value 1= (n) and is not shown in the gure. This algorithm has complexity O(p2 ) Givens rotations per sample period, where p is the size of the array. We now apply the annihilation-reordering look-ahead technique to derive concurrent adaptive inverse QR algorithm for high-speed CORDIC based parallel RLS weight extraction. The block updating form with block size M of the sequential updating equation (20) is given as follows h

R~ ?T (n) V (n)

R~ (n)

0M (p+1)

i

= Q~ (n)

R~ (n ? M ) R~ ?T (n ? M ) T (n) 0M (p+1) U~M

;

(21)

where U~MT (n) is a M -by-(p + 1) matrix de ned as

U~MT (n) =

u ~ (n ? M + 1)

u ~ (n ? 1) u ~(n)

T

;

OM (p+1) denotes a M -by-(p + 1) null matrix, and V (n) is a M -by-(p + 1) matrix.

The derivation of (21) essentially follows the algebraic manipulation in (16)|(20) provided that we start from the following block update equation h

R~ (n)

0M (p+1)

i

= Q~ (n)

R~ (n ? M ) U~M (n)

:

(22)

Notice that the Q~ (n) matrix in (22) is dierent from the Q~ (n) in (16), though we use the same notation here. Applying the annihilation-reordering procedure described in Fig. 4, we obtain the concurrent architecture shown in Fig. 5. A complete 3-level ne-grain pipelined topology for CORDIC based QRD-RLS with explicit parallel weight

17

u2 ( n )

u1 ( n ) D

u3 ( n )

D

D

r11

y(n) D

r12

D

r13

D

0

D

p1

θ1

D

r22

D

p2

r23 D

r111 D

θ2

D

r121 D

r33

p3

1 r13

D

γ

r221

0

D

θ3

D

0

D

r231

r331

D

θ4

D

w1/γ

w2/γ

w3/γ

G

G

G

G

y(n)

u1 ( n )

u2 ( n )

u3 ( n )

x z y

e(n)

G

z+xy

Boundary Cell

0

Internal Cell u (n)

u (n) MD

MD

r

1

θ

r1

θ

u’ (n) 1

−Μ/2

1

r 1 (n) r 2 (n) 0 u 2’(n)

=

c s s c

1

λ r 1 (n-1) u 1(n)

−Μ/2

1

λ r2 (n-1) u 2(n)

Fig. 17. Signal ow graph representation of double triangular type adaptive inverse QR algorithm.

extraction is shown in Fig. 18. In this gure, all cell notations follow the notations in Fig. 17 except that some of them are compound versions. The internal structure of each compound cell is shown at the bottom part of Fig. 18. Compared to Fig. 17, the number of CORDIC units and communication bandwidth are tripled which is linear with respect to the pipelining level. In general, the total complexity for pipelined realization of CORDIC based QRD-RLS with explicit weight extraction is O(Mp2 ) where M is the pipelining level and p is the size of input samples. IV. Application to Adaptive Beamforming

In this section, we apply the annihilation-reordering look-ahead technique to the CORDIC based adaptive beamforming algorithms and derive ne-grain pipelined topologies. A beamformer is a processor which forms a scalar output signal as a weighted combination of the data received at an array of sensors. The weights determine the spatial ltering characteristics of the beamformer and enable separation of signals having overlapping frequency content if they originate from dierent locations. The weights in a data independent beamformer are chosen to provide a xed response independent of the received data. Statistically optimum beamformers select the weights to optimize the beamformer response based on the statistics of the data. The statistically optimum beamformers include the multiple sidelobe canceller (MSC) [47], beamformers with reference signal [48], maximum signal to noise ratio (Max SNR) beamformer [49], and linearly constrained minimum variance (LCMV) beamformer [50]. Among them, the LCMV beamformer is most exible since it does not require

18

u1 ( n )

u2 ( n )

u3 ( n )

y(n) 000

D

D

D

3D

D

3D

r11

D

D

D

3D

3D

r13

r12

r111

3D

r23

r22

3D

p1

3D

3D

D

3D

1

r12

3D

r33

3D

1

p2

3D

000

3D

r22 3D

3D

3D

3D

r331

r231

r131

p3

000

3D

3D

γ

w1/γ

w2/γ

w3/γ

G

G

G

G

u1 ( n )

u2 ( n )

u3 ( n )

x z y

G

e(n)

z+xy

y(n) Boundary Cell

Internal Cell

C C

C C

C

3D

rii (n)

0

C

θ1 θ2 θ3

rij (n)

3D

Fig. 18. A 3{level ne-grain pipelined topology of CORDIC based double-triangular adaptive inverse QR algorithm.

absence of desired signal as in MSC, generating the desired signal as in reference signal beamformers, estimation of signal and noise covariance matrices as in Max SNR beamformers. The basic idea behind LCMV beamforming is to constrain the response of the beamformer so signals from the direction of interest are passed with speci ed gain and phase. The weights are chosen to minimize output variance or power subject to the response constraint. This has the eect of preserving the desired signal while minimizing contributions to the output due to interfering signals and noise arriving from directions other than the direction of interest. the LCMV beamforming is a constrained minimization problem. Solving the constrained minimization problem directly leads to the minimum variance distortionless response (MVDR) beamforming realization [28]. An alternative formulation is to change the constrained minimization problem into unconstrained form which leads to the generalized sidelobe canceller realization [51]{ [52],[47]. In practice, the data statistics are often unknown and may change with time so adaptive algorithms are used to obtain weights that converge to the statistically optimum solution. Least mean squares (LMS) and recursive least squares (RLS) based adaptive algorithms are two most popular ones for beamforming. Among RLS algorithms, QR decomposition based schemes are numerically stable and therefore attractive in practical applications. A good survey on various beamforming approaches is given in [53]. In this paper, we consider high-speed QR decomposition based LCMV adaptive beamforming which include MVDR and GSC realizations. This section is organized as follows. The LCMV adaptive beamforming problem is stated in section IV-A. The pipelined QRD-MVDR realization is then presented in section IV-B. Finally, section IV-C addresses the pipelined QRD-GSC realization and its comparison with the pipelined MVDR beamformer.

19

A. LCMV Adaptive Beamforming Problem Consider a linear array of p uniformly spaced sensors whose outputs are individually weighted and then summed to produce the beamformer output corresponding to the kth desired look direction

e(k) (n) = uT (n) w(k) (n);

(23)

where u(n) is the p-element vector of signal samples received by the array at time instant n, and w(k) (n) is the p-element vector of weights corresponding to the kth desired look direction. Let c(k) be the kth steering vector which represents the kth desired look direction, and (k) be the corresponding beamforming gain which is usually a constant. The linear constraint minimum variance (LCMV) beamforming problem may be summarized as follows. Minimize the cost function (k) (n) = k 1=2 (n)e(k) (n) k2 (24) = k 1=2 (n)A(n)w(k) (n) k2 ; subject to a linear constraint T

c(k) w(k) (n) =

(k) :

(25)

where A(n) = u(1) u(2) u(n) T is the input data matrix, and (n) = diag[n?1 ; ; ; 1] is the diagonal matrix of the forgetting factors. Here, we assume that all the data are real. The extension to the complex case does not present any particular diculties. B. Pipelined QRD-MVDR Adaptive Beamforming A direct solving of the constraint least squares minimization problem given in (24) and (25) using the method of Lagrange multipliers leads to the minimum variance distortionless response (MVDR) beamforming realization. The solution to the constrained least squares minimization problem is given by the following formula ?1 (n) c(k) ; (26) w(k) (n) = (k) c(k) T ?1 (n) c(k) where is the covariance matrix de ned by

(n) = AT (n) (n) A(n):

(27)

Assuming that a QR decomposition has been carried out on the weighted data matrix (n)1=2 A(n) so that R ( n ) 1=2 Q(n)(n) A(n) = (28) 0 ; where R(n) is a p-by-p upper triangular matrix, then it follows that (n) = RT (n) R(n);

(29)

and so R(n) is the Cholesky square root factor of the covariance matrix (n). Equation (26) may therefore be written in the form R?1 (n) R?T (n) c(k) w(k) (n) = (k) c(k) T R?1 (n) R?T (n) c(k) ?1 (k) = (k) R ((nk)) a (2n) ; (30) k a (n) k

20

where a(k) (n) =

R?T (n) c(k) :

(31)

It follows that the beamformer output for the kth desired look direction at time instant n is given by T ?1 (k) e(k) (n) = uT (n) w(k) (n) = (k) u (nk) Ra(k) ((nn)) ak2 (n) : Let

e0 (k) (n) = uT (n) R?1 (n) a(k) (n):

(32) (33)

That is e0 (k) (n) is a scaled version of e(k) (n) with the scaling factor (k) = k a(k) (n) k2 . The QR decomposition of the data matrix A(n) can be implemented in a recursive manner as shown in (4) and repeated here as R(n) = Q(n) 1=2 R(n ? 1) : (34) 0Tp uT (n) After some algebraic manipulations, it can be shown that the orthogonal matrix Q in (34) also updates the auxiliary vector a(k) in (31) [28]. Therefore, the QR update for the MVDR adaptive beamforming algorithm can be summarized as follows.

1=2 R(n ? 1) 1=2 a(k) (n ? 1) 0p : = Q ( n ) (35) T (k) uT (n) 0 1 0p (n) (n) The insertion of the third column in (35) is used to generate the converting factor (n). It can be shown that the scaled residual e0 (k) (n) can be computed using the following equation [25] R(n)

a(k) (n) g(n)

e0 (k) (n) = ? (k) (n) (n):

(36)

An ecient SFG representation of the CORDIC based QRD-MVDR adaptive beamforming algorithm is shown in Fig. 19. In this gure, the circle and square cells follow the notations in Fig. 1. There are 3 desired look direction in the gure with scaled beamformer outputs e0 (1) (n); e0 (2) (n), and e0 (3) (n). The circle cell with a shaded right-angle and letter G inside denote a Gaussian rotation. Its functionality is shown in the Figure. Notice that the sign is subtraction instead of addition as in Fig. 15. The algorithm complexity is O( 12 p2 + Kp) Givens rotations per sample time, where p is the number of adjusting weights or the size of the antenna array, and K is the number of look direction constraints. In Fig. 19, the recursive operation in each cell limits the throughput of the input samples. Beamforming applications usually require very high sample rates, and the sequential QRD-MVDR algorithm might not be able to operate at such high sample rates. To increase the throughput, the annihilation-reordering look-ahead technique is applied. The recursive updating formula for the QRD-MVDR algorithm is given in (35). Its block update form with block size M is given as follows h i R(n) a(k) (n) g(n) = Q(n) M=2 R(n ? M ) M=2 a(k) (n ? M ) 0p ; (37) T OM p

(k) (n) (n)

UM (n)

0M

M

where UMT (n) is an M -by-p matrix de ned as

UMT (n) =

u(n ? M + 1)

u(n ? 1) u(n)

T

;

21

u1 (n)

u2 (n)

u3 (n)

u4 (n)

1

0

0

0

0

r

r

r

r

θ1

D

D

D

D

D

D

D

a

(1)

a

(2)

a

(3)

0 D

D

r

r

D

D

D

D

r

θ2

a

(1)

a

(2)

a

(3)

0 D

D

D

r

r

D

D

θ3

a

(1)

a

(2)

a

(3)

0 D

D

r

y z x

a

(1)

α

0

G

γ

x

(1)

G

(1)

z-xy

D

D

θ4

e’ (n)

a

(2)

α

0

a (2)

(3)

α

0

G

(2)

e’ (n)

(3)

G

(3)

e’ (n)

Fig. 19. Signal ow graph representation of QRD-MVDR adaptive beamforming algorithm.

OM p and

0M denote

M -by-p null matrix and M -by-1 null vector, respectively. (k) (n) and (n) are M -by-1

vectors, and M is a M -by-1 constant vector de ned as

M = [ 0; ; 0; 1 ]T : The scaled beamformer output for the kth look direction e0 (k) (n) is then given as

e0 (k) (n) = ? (k) T (n) (n):

(38)

The derivation of equations (37) and (38) are shown in Appendix B. Notice that the Q(n) matrix in (37) is dierent from the Q(n) in (35), though we use the same notation here. A sequence of Givens rotations, whose product form the orthogonal transformation matrix Q(n) in (37), is then determined to annihilate the block input data matrix UMT (n), and update R(n ? M ) and a(k) (n ? M ) as shown in Fig. 4 and Fig. 5. A nal ne-grain pipelined QRD-MVDR beamforming topology with pipelining level 3 is shown in Fig. 20. In this gure, all cell notations follow the notations in Fig. 19 except that they are compound versions. The internal structure of each compound cell is shown at the bottom part of Fig. 20. Notice that, compared to Fig. 19, the 3-level pipelined architecture tripled the number of Cordic units and communication bandwidth which is linear with respect to the pipelining level. Thus, the total complexity is O(M ( 12 p2 + Kp)) Givens rotations per sample time, where p is the number of antenna elements, K is the number of look direction constrains, and M is the pipelining level. C. Pipelined CORDIC Based Generalized Sidelobe Canceller An alternative realization of the LCMV beamforming problem is to transform the constraint least squares minimization problem into unconstrained form. This leads to the generalized sidelobe canceller (GSC) realization. Consider the LCMV problem given in equations (24) and (25). The weight vector w(k) (n) can be decomposed into two orthogonal parts which lie in the range and null space of c(k) . w(k) = wq(k) + B (k) wb(k) ;

22

u2 (n)

u1 (n)

u3 (n)

u4 (n) 1

D

D

D

3D

D

D

D

D

3D

3D

r

0

r

0 0

0 0

0

0 0

0

0

0

D

0

3D

r

3D

3D

r

a

(1)

3D

a

(2)

a

(3)

0 3D

3D

3D

3D

r

r

3D

r

a

(1)

3D

a

(2)

a

(3)

0 3D

3D

3D

r

3D

r

a

(1)

3D

a

(2)

a

(3)

0 3D

3D

y1 y2 y3 z x1 x2

3D

r

a

G

(1)

a α

0

G

3D

(1)

(2)

a (2)

α

0

(3) (3)

α

0

γ

G

x3

G

G

z - x1 y1 - x2 y2 - x3 y3

(1)

C

θ1 θ2

C C

(2)

Converting Factor Cell

C

C

(3)

e’ (n)

Auxillary Cell

C

C

C

θ3

C

r(n)

G

e’ (n)

e’ (n) Internal Cell

Boundary Cell

G

C C

C

r(n)

3D

a(n)

3D

3D

0

Fig. 20. A 3{level pipelined Cordic based QRD-MVDR adaptive beamforming architecture.

where wq(k) and wb(k) are p-by-1 and (p ? 1)-by-1 vectors, respectively. The p-by-(p ? 1) blocking matrix B (k) satis es c(k)

T

B (k) = 0:

(39)

By (39), we then have c(k) w(k) (n) = c(k) wq(k) + c(k) = c(k) T wq(k) T

T

T

B (k) wb(k) (n)

= (k) :

Thus,

T

wq(k) = c(k) c(k) c(k)

?1

(k) :

Furthermore, e(k) (n) =

A(n)w(k) (n) = A(n) wq(k) + A(n)B (k) wb(k) (n) = y(k) (n) ? A(k) (n) wb(k) (n);

(40)

y(k) (n) =

(41)

where

A(n) wq(k) A(k) (n) = ?A(n) B (k) :

23

up-1 up

u1 u2

01 01 01 01 01 01

01 01 01 (k)

(k)

wq

y

01

B

01 01 01

u2(k) u(k) 3

(k)

u(k) p

w(k)b

Least-squares Estimator

01 e

(k)

Fig. 21. The block diagram of a generalized sidelobe canceller.

Substituting (40) into (24) and compare with (2), we see that the GSC problem is actually an unconstrained RLS problem with the error equation given in (40), and a pre-processing procedure given in (41). A complete block diagram of a serial generalized sidelobe canceller is shown in Fig. 21. From Fig. 21, we see that the GSC consists of two parts: The rst part is a pre-processing procedure which consists of a matrix-vector multiplication operation and a vector inner product operation given in (41). We will show that both operations can be implemented using CORDIC rotations and consist of only feed-forward path which can be pipelined to ne-grain level using cutset pipelining. The second part is the recursive least squares adaptive ltering shown in the dashed box, which can be realized using the QRD-RLS algorithm. Furthermore, it can be ne-grain pipelined using the annihilation-reordering look-ahead technique shown in section II. Therefore, ne-grain pipelined QRD-GSC architectures are developed which can be operated at arbitrarily high sample rates. The pre-processing part can be implemented using CORDIC arithmetic in the following way. A QR decomposition of vector c(k) gives us c(k)

T

Q = k c(k) k 0 0 :

Decompose orthogonal matrix Q in the following way

Q=

h

w0 q(k)

i

j Bk : ( )

Then c(k) w0 (qk) = k c(k) k T c(k) B (k) = 0: T

(42)

Comparing (42) with (39) and (40), we can see that w0 (qk) is a scaled version of wq(k) , and the orthogonal matrix Q contains vector wq(k) up to a scaling factor and the matrix B (k) . A typical CORDIC based pre-processing SFG realization is shown in Fig. 22. The little circle represents a CORDIC operation. The CORDIC units with letter V

24

inside operate in the vectoring mode, and the ones with letter R inside operate in the rotating mode. The dashed lines indicate the places where pipelining latches can be placed. Since all paths are feed-forward, ne-grain pipelining can be achieved by placing latches between CORDIC micro-rotation stages using cut-set pipelining. The output of Fig. 22 is then fed into the input of Fig. 16 as shown in Fig. 21. u 1 2 3 4 5 6 7 8

1

R

V

R

2

V

3

R

V

R

4

(k)

V

c

5

R

V

R

µ c

6

V

7

V

R

8

y (k)

2 3 4 5 6 7 8 (k)

u

Fig. 22. A typical pipelined pre-processing architecture of the GSC algorithm.

From Fig. 21, the pipelined GSC topology consists of a feed-forward constraint preprocessor and a recursive leastsquares processor as shown in Fig. 23(a). The preprocessor consumes approximately Mp CORDIC operations, and the least-squares processor uses approximately 21 Mp2 CORDIC operations where p is the number of adjustable weights and M is the pipelining level. Fig. 21 shows the GSC block diagram for only one constraint. If we impose several (say K ) constrains independently, we then have to run the same data through the beamformer a total of K times. Such an imposition requires K dierent preprocessors and K least-squares processing runs, taking a total of approximately MK ( 21 p2 + p) operations for the pipelined GSC topology. Data

Constraint preprocessor

Leastsquares processor (a)

Data

Constraint postprocessor

Leastsquares processor

Beamformer output

Beamformer output

(b)

Fig. 23. (a) CORDIC based generalized sidelobe canceller. (b) QRD-MVDR beamformer.

On the other hand, from Fig. 20, the MVDR topology is basically a constraint postprocessor in that it permits the application of the constraint after the least-squares processing, as shown in Fig. 23(b). Consequently, the output

25

from the least-squares processor is broadcast to K separate constraint processors, each of which uses approximately Mp operations. The total number of operations in the pipelined MVDR topology is thus approximately M ( 21 p2 + Kp) which is considerably less than the corresponding number of operations for the pipelined GSC topology. Therefore, the pipelined MVDR beamformer has the advantage of being computationally more ecient than the pipelined GSC realization. However, the GSC provides the insight, and is useful for analysis of the LCMV beamforming problem. V. Conclusions

In this paper, a novel annihilation-reordering look-ahead technique is proposed to achieve ne-grain pipelining for CORDIC based RLS adaptive ltering and beamforming algorithms. It is an exact look-ahead and based on CORDIC arithmetic. The look-ahead transformation can be derived from either the block processing or the iteration point of view, while the former is simpler and more practical and the latter shows the connection with the traditional multiply-add look-ahead technique. The exploration of the parallelism in the annihilation-reordering look-ahead transformation leads to three implementation styles namely pipelining, block processing, and incremental block processing. The implementation complexity in terms of CORDIC units for pipelined realization is the least, and the one for the block processing is the most. CORDIC based RLS lters with implicit weight extraction are found useful in applications such as adaptive beamforming, and the ones with explicit weight extraction are found useful in applications such as channel equalization. The application of proposed look-ahead technique to these RLS lters lead to ne-grain pipelined topologies which can be operated at arbitrarily high sample rate. The pipelined algorithms maintain the orthogonality and the stability under nite precision arithmetic. High-speed adaptive beamforming applications are presented in the paper. In particular, the linearly constrained minimum variance (LCMV) adaptive beamformer is considered. The LCMV adaptive beamforming is a constrained least squares minimization problem. Solving the constrained minimization problem directly leads to the QRD-MVDR beamforming realization. An alternative is the unconstrained reformulation which leads to the QR decomposition based GSC realization. Both MVDR and GSC beamformer can be realized using CORDIC arithmetic. The application of the annihilation-reordering look-ahead technique to these adaptive beamforming algorithms leads to ne-grain pipelined topologies which can achieve arbitrarily high throughout. Furthermore, they consist of only Givens rotations which can be mapped onto CORDIC arithmetic based processors [24]. TABLE I

The implementation complexity in terms of CORDIC units for various RLS based algorithms and implementation styles.

Implementation Styles Pipelining Incremental Block Block Processing

QRD-RLS

Mp2 1 (2 M ? 1)p2 2 1 M 2 p2 2 1 2

Inverse QR

Mp2 (2M ? 1)p2 M 2 p2

QRD-MVDR

M ( 21 p2 + Kp) (2M ? 1)( 12 p2 + Kp) M 2 ( 12 p2 + Kp)

QRD-GSC

MK ( 12 p2 + p) (2M ? 1)K ( 21 p2 + p) M 2 K ( 12 p2 + p)

The implementation complexity in terms of CORDIC units for various RLS based algorithms and implementation styles are shown in Table I. From the table, we see that the pipelining level M is a dimension variable in the complexity expressions for all algorithms and implementation styles. The pipelining and incremental block processing realizations require a linear increasing CORDIC units with an increasing factor of M for the pipelining and a factor of 2M ? 1 for the incremental block processing. The complexity factor for the block processing is M 2 which is quadratic with respect to the pipelining level. The adaptive inverse QR algorithm requires approximately two times of CORDIC

26

units as the QRD-RLS algorithm since an extra lower triangular matrix is needed to extract the weight vector. The MVDR topology outperforms GSC topology in terms of complexity by employing a constraint post-processor rather than a constraint pre-processor for the GSC realization. Future research will be directed towards the scheduling and mapping of these pipelined adaptive ltering and beamforming architectures onto hardware con gurations. It will be interesting to see how dierent scheduling strategies lead to mappings with dierent power consumptions. Design of architecture con gurations and schedules to both satisfy the bandwidth requirements and achieve low power consumptions will be challenging. Fast orthonormal -rotations can also be employed to implement these Givens rotation based architectures with lower complexity [54]{[56]. Appendix A

In this appendix, we derive equations (14) and (15). At time instance (n ? M ), apply the QR decomposition to the weighted data matrix 1=2 (n ? M )A(n ? M ) and the reference vector y(n) as follows R ( n ? M ) p(n ? M ) 1=2 Q(n ? M ) (n ? M ) A(n ? M ) y(n ? M ) = (43) O v(n ? M ) where R(n ? M ) is p-by-p upper triangular matrix, p(n ? M and v(n ? M ) are p-by-1 and (n ? M ? p)-by-1 vectors, respectively. At time n, the new inputs UM (n) and yM (n) become available processing, we have A(n) y(n) = AU(nT?(nM) ) y(ynM?(nM) ) (44) M De ne 1=2 (n) = M=2 1=2 (n ? M ) ; (45)

IM

Q (n ? M ) = Q(n ? M )

(46)

IM

Then 2

3

M=2 R(n ? M ) M=2 p(n ? M ) 1=2 4 Q(n ? M ) (n) A(n) y(n) = (47) O M=2 v(n ? M ) 5 T UM (n) yM (n) Notice that here we choose 1=2 (n) instead of 1=2 (n). 1=2 (n) diers from 1=2 (n) in replacing IM by 1=2 (M ). Using 1=2 (n) will lead to extra operations of the input data. While, using 1=2 (n) will not and also not aect the algorithm's convergence behavior due to the existence of M=2 in 1=2 (n). Applying the orthogonal matrix Q(n) which consists of a sequence of Givens rotations to annihilate the input data UM (n) using M=2 R(n ? M ) in (47), we have

2 4

3

2

R(n) p(n) M=2 R(n ? M ) M=2 p(n ? M ) 5 4 O v(n) = Q(n) O M=2 v(n ? M ) OM N (n) UM (n) yM (n)

3 5

(48)

which derives the rst two columns in (14). Next we prove equation (15) and justify the third column in (14). From (2), we have

27

e(n ? M ) = y(n ? M ) ? A(n ? M )w(n ? M )

(49)

By (43), we then have

Q(n ? M )1=2 (n ? M )e(n ? M ) =

p(n ? M ) v(n ? M )

? R(n O? M )

w(n ? M ):

(50)

Let

(n ? M ) = Q1=2 (n ? M )1=2(n ? M )e(n ? M ) T (n)w(n ? M ) (51) eM (n) = yM (n) ? UM T (n)w(n) eM (n) = yM (n) ? UM where (n ? M ); eM (n), and eM (n) are M -by-1 vectors. By (1), it is seen that the last element in eM (n) is the desired residual error e(n) at time instance n. From (50) and (51), we obtain h

i

M=2 (n ? M ) eM (n)

"

=

M=2 p(n ? M ) M=2 v(n ? M ) yM (n)

#

"

?

M=2 R(n ? M ) O T (n) UM

#

w(n ? M ):

(52)

Applying the orthogonal matrix Q(n) to both sides of (52) and use (48) to obtain 2

3

2

p(n) R(n) M=2 (n ? M ) 4 v(n) 5 ? 4 O = Q(n) eM (n) (n) O

3 5w

(n):

(53)

Notice that after annihilating the input block data UMT (n), the weight vector w(n ? M ) has been updated to w(n). Correspondingly, the residual error eM (n) becomes eM (n) as de ned in (51). Now, moving Q(n) to the right hand side of (53), and noticing that Q(n) is orthogonal, we obtain

M=2

2

(n ? M ) = QT (n) 4

eM (n)

p(n) ? R(n)w(n) v(n)

(n)

3 5

(54)

Since the optimum weight vector w(n) satis es p(n) ? R(n)w(n) = 0p , (54) reduces to

M=2

2

(n ? M ) = QT (n) 4

eM (n)

0p v(n)

(n)

3 5

(55)

Notice that the last element of eM (n) is e(n), we have 2

e(n) =

=

0Tn?M 0Tn?M

MT QT (n) 4

0p v(n)

3 5

(n) O (n?M )M T T M Q (n) (n): IM

(56)

The second equality is due to the fact that QT (n) only makes use of elements 0p and (n) in the vector [ 0Tp ; vT (n); T (n) ]T . Take the transpose on both sides of (56), and notice that e(n) is a scalar, then

28

e(n) = T (n) OM (n?M ) IM = T (n) OM (n?M ) IM = T (n) (n):

Q(n) OnM?M s(n)

(n)

(57)

The second equality justify the third column in (14). This completes the derivation of (14) and (15). Appendix B

In this appendix, we derive equations (37) and (38). From (48), we have the following block QR update for QRD-MVDR algorithm. R(n) = Q(n) M=2 R(n ? M ) ; (58) OM p UMT (n) Extending the (p + M )-by-p matrix on the right hand side of (58) to the (p + M )-by-(p + M ) square matrix by adding extra columns [ OpTM ; IM ]T to its right leads to

R(n) V (n) = Q(n) M=2 R(n ? M ) OpM (59) OM p D(n) UMT (n) IM where matrices V (n) and D(n) are p-by-M and M -by-M , respectively, which correspond to the QR update of matrices OpM and IM . Inverting the matrix on both sides of equation (59) (the matrix is non-singular since R is non-singular) and noticing that Q?1 = QT lead to R?1 (n) V 0 (n) = ?M=2 R?1 (n ? M ) OpM QT (n): (60) OM p D0 (n) ??M=2 UMT (n)R?1 (n ? M ) IM Taking the transposition on both sides of (60), we obtain

R?T (n) OpM V 0 T (n) D0 (n)

i h ?M=2 ?T R (n ? M ) ??M=2 R?T (n ? M )UM (n) : = Q(n) OM p IM

Thus, we have the following inverse updating formula R?T (n) = Q(n) ?M=2 R?T (n ? M ) : OM p V 0 T (n)

(61)

(62)

Using (62), we could derive the QR block update equation for the MVDR auxiliary vector a(k) (n) de ned in (31).

?M=2 (k) Q(n) Oa (n ? M ) M p

?M=2 R?T (n ? M ) = Q(n) OM p ?T (n) R (k) = V 0 T (n) c (k) = a (k)((nn)) ;

c(k)

(63)

where

(k) (n) = V 0 T (n)c(k)

(64)

29

is a M -by-1 vector. Therefore, the orthogonal matrix Q in (58) also updates the auxiliary vector a(k) in (63). This completes the derivation of the rst two columns in (37). Now, comparing (12) with (33), it is seen that if the vector p(n) is replaced with the vector ?a(k) (n), and y(n) is set to 0, then (12) becomes (33), and correspondingly, (14) becomes (37) except that (k) (n) needs to be changed to ?(n). In other words, if the p cells in Fig. 16 are replaced by the auxiliary cells a(k) , and the reference signal y(n) is set to zero, then the residual error e(n) in Fig. 16 will becomes ?e0 (k) (n) in Fig. 20. Therefore, according to (15) and the proof in Appendix A, we conclude the correctness of (38), and at the same time justi ed the insertion of the third column in equation (37). This complete the derivation of equations (37) and (38). [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]

References K.K. Parhi, \Algorithm transformation techniques for concurrent processors," Proceedings of the IEEE, vol. 77, pp. 1879{1895, Dec. 1989. K.K. Parhi and D.G. Messerschmitt, \Pipeline interleaving and parallelism in recursive digital lters{Part I: Pipelining using scattered look-ahead and decomposition," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 37, pp. 1118{1134, July 1989. K.K. Parhi and D.G. Messerschmitt, \Pipeline interleaving and parallelism in recursive digital lters{Part II: Pipelined incremental block ltering," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 37, pp. 1099{1117, July 1989. P.M. Kogge, \Parallel solution of recurrence problems," IBM J. Res. Develop., vol. 18, pp. 138{148, March 1974. H.H. Loomis and B.Sinha, \High speed recursive digital lter realization," Circuits, Sys., Signal Processing, vol. 3, no. 3, pp. 267{297, 1984. K.K. Parhi and D.G. Messerschmitt, \Concurrent architectures for two-dimensional recursive digital ltering," IEEE Transactions on Circuits and Systems, vol. 36, pp. 813{829, June 1989. G. Fettweis and H. Meyr, \Parallel Viterbi decoding by breaking the compare-select feedback bottleneck," IEEE Transactions on Communications, vol. 37, pp. 785{790, Aug. 1989. H.D. Lin and D.G. Messerschmitt, \Finite state machine has unlimited concurrency," IEEE Transactions on Circuits and Systems, vol. 38, pp. 465{475, May 1991. K.K. Parhi, \Pipelining in dynamic programming architectures," IEEE Transactions on Signal Processing, vol. 39, pp. 1442{1450, June 1991. K.K. Parhi, \Pipelining in algorithms with quantizer loops," IEEE Transactions on Circuits and Systems, vol. 38, pp. 745{754, July 1991. K.K. Parhi, \High-speed VLSI architectures for Human and Viterbi decoders," IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, vol. 39, pp. 385{391, June 1992. N.R. Shanbhag and K.K. Parhi, Pipelined Adaptive Digital Filters, Kluwer Academic Publishers, 1994. N.R. Shanbhag and K.K. Parhi, \Relaxed look-ahead pipelined LMS adaptive lters and their application to ADPCM coder," IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, vol. 40, pp. 753{766, Dec. 1993. N.R. Shanbhag and K.K. Parhi, \A pipelined adaptive lattice lter architecture," IEEE Transactions on Signal Processing, vol. 41, pp. 1925{1939, May 1993. N.R. Shanbhag and K.K. Parhi, \A pipelined adaptive dierential vector quantizer for low-power speech coding applications," IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, pp. 347{349, May 1993. N.R. Shanbhag and K.K. Parhi, \A high-speed architecture for ADPCM coder and decoder," in International Symposium on Circuits and Systems, May 1992, pp. 1499{1502. J. Ma, E.F. Deprettere, and K.K. Parhi, \Pipelined CORDIC based QRD-RLS adaptive ltering using matrix lookahead," in Proc. of the IEEE Workshop on Signal Processing Systems (SiPS), Nov. 1997, pp. 131{140. J. Ma, K.K. Parhi, and E.F. Deprettere, \High-speed CORDIC based parallel weight extraction for QRD-RLS adaptive ltering," in International Symposium on Circuits and Systems, May 1998, pp. 245{248. J. Ma, K.K. Parhi, and E.F. Deprettere, \Pipelined CORDIC based QRD-MVDR adaptive beamforming," in International Conference on Acoustic Speech and Signal Processing, May 1998, pp. 3025{3028. S. Haykin, Adaptive Filter Theory, Englewood Clis, NJ: Prentice-Hall, 1992. J.E. Volder, \The CORDIC trigonometric computing technique," IEEE Trans. on Electronic Computers, pp. 330{334, Sept. 1959. Y.H. Hu, \Cordic-based VLSI architectures for digital signal processing," IEEE Signal Processing Magzine, , no. 7, pp. 16{35, July 1992. G.J. Hekstra and E.F. Deprettere, \Floating point CORDIC," in Proceedings 11th Symp. Computer Arithmetic, June 1993, pp. 130{137. E. Rijpkema, G. Hekstra, E. Deprettere, and J. Ma, \A strategy for determining a Jacobi speci c data ow processor," in Proc. of the IEEE International Conf. on Application-Speci c Systems, Architectures and Processors, July 1997, pp. 53{64. J.G. McWhirter, \Recursive least-squares minimization using a systolic array," in Proc. SPIE: Real Time Signal Processing VI, 1983, vol. 431, pp. 105{112. W.M. Gentleman and H.T. Kung, \Matrix triangularization by systolic arrays," in Proc. SPIE: Real-Time Signal Processing IV, 1981, pp. 298{303. T.J. Shepherd, J.G. McWhirter, and J.E. Hudson, \Parallel weight extraction from a systolic adaptive beamformer," in Mathematics in signal processing II (J.G. McWhirter ed.). 1990, pp. 775{790, Clarendon Press, Oxford. J.G. McWhirter and T.J. Shepherd, \Systolic array processor for MVDR beamforming," in IEE Proceedings, April 1989, vol. 136, pp. 75{80. A.P. Chandrakasan, S. Sheng, and R.W. Broderson, \Low-power CMOS digital design," IEEE J. Solid-State Circuits, vol. 27, pp. 473{484, April 1992.

30

[30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56]

K.J. Raghunath and K.K. Parhi, \Pipelined RLS adaptive ltering using Scaled Tangent Rotations (STAR)," IEEE Transactions on Signal Processing, vol. 40, pp. 2591{2604, October 1996. T.H.Y. Meng, E.A. Lee, and D.G. Messerschmitt, \Least-squares computation at arbitrarily high speeds," in International Conference on Acoustic Speech and Signal Processing, 1987, pp. 1398{1401. G.H. Golub and C.F.V. Loan, Matrix Computation, Baltimore, MD: Johns Hopkins University Press, 1989. S.F. Hsieh, K.J.R. Liu, and K. Yao, \A un ed square-root-free Givens rotation approach for QRD-based recursive least squares estimation," IEEE Transactions on Signal Processing, vol. 41, pp. 1405{1409, March 1993. S. Hammarling, \A note on modi cations to Givens plane rotation," J. Inst. Math. Applicat., vol. 13, pp. 215{218, 1974. J.L. Barlow and I.C.F. Ipsen, \Scaled Givens rotations for solution of linear least-squares problems on systolic arrays," SIAM J. Sci. Stat. Comput., vol. 13, pp. 716{733, Sept. 1987. J. Gotze and U. Schwiegelshohn, \A square-root and division free Givens rotation for solving least-squares problems on systolic arrays," SIAM J. Sci. Stat. Comput., pp. 800{807, Dec. 1991. E. Franrzeskakis and K.J.R. Liu, \A class of square-root and division free algorithms and architectures for QRD-based adaptive signal processing," IEEE Transactions on Signal Processing, vol. 42, pp. 2455{2469, Sept. 1994. J.M. Cio, \The fast adaptive ROTOR's RLS algorithm," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 38, pp. 631{653, April 1990. C.E. Leiserson, F. Rose, and J. Saxe, \Optimizing synchronous circuitry by retiming," in Proc. Third Caltech Conf. VLSI, Pasadena, CA, March 1983, pp. 87{116. K.K. Parhi, \High-level algorithm and architecture transformations for DSP synthesis," Journal of VLSI Signal Processing, vol. 9, pp. 121{143, 1995. E.F. Deprettere, P. Held, and P. Wielage, \Model and methods for regular array design," International Journal of High Speed Electronics; Special issue on Massively Parallel Computing{Part II, vol. 4(2), pp. 133{201, 1993. H. Leung and S. Haykin, \Stability of recursive QRD-LS algorithms using nite-precision systolic array implementation," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 37, no. 5, pp. 760{763, 1989. M. Moonen and E.F. Deprettere, \A fully pipelined RLS-based array for channel equalization," Journal of VLSI Signal Processing, vol. 14, pp. 67{74, 1996. R. Gooch and J. Lundell, \The cm array: An adaptive beamformer for constant modulus signa," in International Conference on Acoustic Speech and Signal Processing, 1986, pp. 2523{2526. C.T. Pan and R.J. Plemmons, \Least squares modi cations with inverse factorizations: parallel implications," J. Compt. Appl. Math., vol. 27, pp. 109{127, 1989. S.T. Alexander and A.L. Ghirnikar, \A method for recursive least squares adaptive ltering based upon an inverse QR decomposition," IEEE Transactions on Signal Processing, vol. 41, pp. 20{30, 1993. S.P. Applebaum and D.J. Chapman, \Adaptive arrays with main beam constraints," IEEE Trans. on AP, vol. AP-24, pp. 650{662, Sept. 1976. B. Widrow, P.E. Mantey, L.J. Griths, and B.B. Goode, \A novel algorithm and architecture for adaptive digital beamforming," in Proceedings of the IEEE, Dec. 1967, vol. 55, pp. 2143{2159. R. Monzingo and T. Miller, Introduction to Adaptive Array, Wiley and Sons, NY, 1980. O.L. Frost III, \An algorithm for linearly constrained adaptive array processing," in Proceedings of the IEEE, Aug. 1972, vol. 60, pp. 926{935. R.L. Hanson and C.L. Lawson, \Extensions and applications of the Householder algorithm for solving linear least squares problems," Math. Comp., vol. 23, pp. 917{926, 1969. L.J. Griths and C.W. Jim, \An alternative approach to linearly constrained adptive beamforming," IEEE Trans. on AP, vol. AP-30, pp. 27{34, Jan. 1982. B.D.V. Veen and K.M. Buckley, \Beamforming: a versatile approach to spatial ltering," IEEE ASSP Magzine, vol. 5, pp. 4{24, April 1988. G.J. Hekstra and E.F. Deprettere, \Fast rotations: Low cost arithmetic methods for orthonormal rotation," in Proceedings of 12th Proc. Symp. Comput. Arithmetic, July 1997, pp. 116{125. J. Gotze and G. Hekstra, \An algorithm and architecture based on orthonormal {rotations for computing the symmetric EVD," INTEGRATION, the VLSI Journal, vol. 20, pp. 21{39, 1995. J. Ma, K.K. Parhi, G.J. Hekstra, and E.F. Deprettere, \Ecient implementations of CORDIC based IIR digital lters using fast orthonormal -rotations," in Proc. of the SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, July 1998.

Recommend Documents

A Pipelined Clock-Delayed Domino Carry-Lookahead Adder

A PIPELINED IMPLEMENTATION OF THE ... - Semantic Scholar