Selection of excitation vectors for the CELP coders - IEEE Xplore

Report 0 Downloads 90 Views
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 1, PART I, JANUARY 1994

29

Selection of Excitation Vectors for the CELP Coders Nicolas Moreau and h e m y s l a w Dymarski

Abstract-In this paper, we investigate several algorithms that construct the input for the synthesis filter in the CELP coder, we present them under the same formalism, and we compare their performances. We model the excitation vector by a linear combination of K signals, which are issued from K codebooks and multiplied by K associated gains. We demonstrate that this generalized form incorporates several particular coders such as code excited linear predictive coders, multipulse coders, self excited vocoders, etc. The least squares minimization problem is presented afterwards. In the case of orthogonal codebooks, we show that the optimal solution of this least squares problem is equivalent to orthogonal transform coding. We use the Karhunen-Loeve transform to design the corresponding orthogonal codebooks. In the case of nonorthogonal codebooks, we are restricted to suboptimal iterative algorithms for index selection and gain computation. We present some new algorithms based on orthogonalization procedures and QR factorizations that attempt to reduce this suboptimality. In a particular case, when the excitation is modeled using one gain coefficient (for example, ternary excitation or concatenation of short codebook vectors), an iterative angle minimization algorithm is proposed for index selection. The different extraction algorithms are compared with regard to the resulting coder complexity and synthetic speech quality. We find a particularly attractive method that consists of modeling the excitation With one unique gain. I. INTRODUCTION

S

PEECH coders for bit rates in the range 4.8-16 kbit/s are generally based on predictive analysis by synthesis methods [11-[SI. These analyses by synthesis methods basically consist of two parts. First, an autoregressive filter, which is characterized by the transfer function l/A(z), is determined by linear prediction of the original signal 3,. Next, an input signal F, for this filter is defined by extracting vectors from excitation codebooks. A very general way for describing this excitation is shown in Fig. 1. The input signal F, is obtained by a linear combination of K signals, which are issued from K codebooks and multiplied by K associated gains. This input signal is a model of the residual signal T,. The goal of these analyses by synthesis procedures is to construct a synthetic speech signal j., that approximates the original speech signal as closely as possible in some sense. This distance measure is defined as the euclidean norm of the difference between the perceptual signals p , and fin, i.e., the euclidian norm of the difference between the original and synthetic speech signal passed through the filter A(z)/A(z/y) with 0 < y < 1 [6]. Starting from this representation, the goal of this paper is to investigate several algorithms that select the excitation Manuscript received February 1, 1991; revised February 18, 1993. The associate editor coordinating the review of this paper and approving it for publication was Dr. Brian A. Hanson. N. Moreau is with Telecom Paris, Paris, France. P. Dymarski is with the Technical University of Warsaw, Warsaw, Poland. IEEE Log Number 9213394.

t Fig. 1. Multistage CELP coder.

vectors, present them under the same formalism, and compare their performances. In Section 11, it is demonstrated that the predictive coder as shown in Fig. 1 has a generalized form, which incorporates several particular coders such as the multipulse (MP) coders [ 13, code-excited linear predictive (CELP) coders [2], self-excited vocoders (SEV) [7], etc. The selection of vectors from the excitation codebooks is then analyzed. In Section 111, optimal methods for vector extraction from orthogonal codebooks are described. It is shown that using orthogonal filtered codebooks is equivalent to orthogonal transform coding. The Karhunen-Loeve (K-L) transform is used, with a proper adaptation to the short- and long-term correlations of the signal [SI. A partitioned shapegain vector quantization (VQ) is applied in the transform domain. In case of nonorthogonal codebooks, we are restricted to suboptimal iterative algorithms. We attempt to reduce this suboptimality. New algorithms for index selection, based on orthogonalization procedures, are proposed in Section IV; the recursive modified Gram-Schmidt (RMGS) algorithm [9] yields a locally optimal choice of index at very low extra computational cost. In some particular cases, for example, ternary excitation or concatenation of short codebook vectors, the excitation is modeled using one gain coefficient. In such cases, an iterative angle minimization algorithm is proposed for index selection. This procedure is described in Section V. The different extraction algorithms are compared with regard to the resulting coder complexity and synthetic speech quality in Section VI. 11. MATHEMATICAL DESCRIFTION OF THE PROBLEM

A. The Excitation Model Let us consider the general scheme of a predictive coder as shown in Fig. 1. Let us assume that there are K codebooks represented by the matrices C(l). . .C ( K !of dimension N x A, . . N x AK, respectively. The excitation vector Imay be

1063-6676/94$04.00 0 1994 IEEE

,

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL. 2, NO. 1, PART I, JANUARY 1994

30

K = N vectors and the following constraint on the gains:

written as

(1) We adopt the following notation: i is N-dimensional vector with components denoted as T,, and c " ~ is ) the j(k)th column (.k vector of the matrix C(k).For simpliking the notation, the codebook number is omitted when it is not necessary: in that case, the matrix C = [c' is comprised of column vectors cj. The inner product between the vectors c1 and c2 is denoted as (c1,c'). The model given by (1) is very general. By defining the codebook contents and imposing various constraints on indices and gains, we obtain several standard coders. For example, if we use only one codebook represented by the N x N identity matrix I from which we extract K vectors, the multipulse coder is obtained [ 13, [lo], [ 1 11. The indices j (k)and the gains gk correspond, respectively, to the positions and amplitudes of the K selected pulses. Imposing additional constraints on the pulse positions results in a regular pulse excitation scheme [12]. The K = N / M positions are chosen according to j ( k 1) = j ( k ) M; therefore, only the first position need be coded.

+

+

-

C(1)

=

e-

Qmal

. .. ...

i-N-1

e-N

+-N+1

+-N+Z

'

.

.

.

...

.

.

.

.

... ...

.

F-1

.

.

F-NfZ

'

'

.

'

P-1

(3)

where g denotes the amplitude of the excitation signal and sk the sign. In this case, the excitation model has a unique gain K

i=g

SkCj(k).

(4)

k=l

This model is also suitable for a ternary excitation using the same codebook C but this time with K < N . With a predetermined number K of nonzero samples, N ! / ( K ! ( NK ) ! ) 2 Kexcitation vectors can be obtained (for example, for N = 32 and K = 8, we have 2.7*109x z3' vectors). If K is not fixed, 3N excitation vectors exist. In order to simplify the treatment, the N-dimensional excitation vector can be constructed as a concatenation of K short vectors selected from a small codebook C'. To avoid high bit rate costs, we are obliged to use one gain for all these vectors. The small codebook C' consists of A' vectors with dimension N / K . The case K = 2 has been described in [26]. Using this procedure, A = (A') N-dimensional vectors are generated. In the excitation model, we use the codebooks

+-Qmin?

. .

gk = gsk for k = 1 . .. K

where the delays Qmin and Q, correspond to the minimum and maximum pitch values. This adaptive codebook is updated every analysis window. In most cases, only one vector is extracted from this codebook. The index of the selected vector corresponds to the pitch period (or its multiples), and the gain specifies the pitch coefficient. Extracting neighboring vectors from the adaptive codebook is equivalent to LTP with many coefficients [16], but this variant is less popular in practice since this method is more costly in terms of bit rate. Extending the adaptive codebook with vectors obtained by interpolation of the codebook yields the fractional pitch approach [17], [18]. The basic CELP coder [2] consists of two codebooks: The codebook C ( l )is an adaptive codebook, whereas the codebook C p ) is either populated with Gaussian random variables or designed using clustering techniques [ 191, [20]. The general case corresponds to the multistage CELP coder [19]. In order to reduce the number of computations, sparse [21],binary [22], temary [23], or algebraic [24] codebooks are implemented. Any combination of the above mentioned codebooks can be applied, for example, a binary regular pulse excitation [25]. The binary excitation will be examined in more detail. In the previous model (l), binary excitation corresponds to one codebook C represented by the N x N identity mamx with

The excitation model given by (4) is studied in Section V. Finally, the codebooks C(1). . .C ( K )can be juxtaposed to form a mixed codebook

The extraction of excitation vectors may be performed from successive parts of this mixed codebook. For example, in the case of a CELP coder, the mixed codebook consists of an adaptive and stochastic part. First, a LTP analysis is performed, and then, the LTP residual is modeled by a vector selected from the stochastic part of the codebook. If, however, we impose no constraints on the indices and let the coder freely choose its vectors from anywhere within the mixed codebook, the vector selection algorithm no longer depends on the type of excitation [27]. This approach allows us to observe what kind of excitation is most frequently selected and to justify the final choice of the excitation codebooks. Henceforth, the codebook number will no longer be specified since all mentioned coders can be described by this mixed codebook model with appropriate constraints for each typical Part.

B. Influence of the Preceding Analysis WindowsLThe Filtered Codebook Once the excitation model is fixed, we may define the algorithm for the selection of the indices and the determination of the gains. The distance between the original and synthetic

.

MOREAU AND DYMARSKI: SELECTION OF EXCITATION VECTORS

31

speech signal, which is expressed at the perceptual level, is given by

C . The Least Squares Minimization Problem The minimization problem can be formulated as follows: Given the target vector p and f 1 fa, find the indices j ( 1). j ( K ) and the gains g1 . . . QK in order to minimize +

(7)

K

n=O

I

This criterion must be minimized relative to the unknown Parameters j(lC) and Qk for IC = . ’ * K . The perceptual signal f i n in the analysis window n = 0 . . . N - 1 may be expressed as n hiCn-i

fin =

+fi: for n = O - . - N- 1

(8)

i=O

k=l

Using matrix notation, this problem becomes the following: Given a N x A matrix F consisting of A columns f j and a N-dimensional vector p, extract from F a N x K matrix A comprised of K columns among A, and find a K-dimensional vector g in order to minimize

E = IIP - Agll2.

where hn denotes the impulse response of the perceptual filter l/(Az/y), and fi: denotes the filter’s response to the excitation signal of preceding windows (ringing). The first term in (8) depends on the unknown parameters j(k) and g k , whereas the second term is known. A similar expression can be written for the residual signal rn and the perceptual signal p n n

This least squares minimization problem would be classic if the matrix A, which depends on the indices j(1) . . . j ( K ) , were known 1281. In that case, the overdetermined linear system P = Ag would be solved to obtain g. The best approximation P of the vector P is the ofiogonal Projection of p on the subspace spanned by the column vectors of the matrix A. This projection p = Ag may be obtained as a solution of the normal eauations

AtAg = Atp. The difference p n - $n yields

Pn - f i n =

(5

hiTn-i

+ Pn - P ,

i=O

O

n

for

-

-.)

= 0 . . .N - 1.

(10)

i=O

Henceforth, vector and matrix notation is used. The first three terms in (10) form the Perceptual vector denoted bY P. The last term forms the perceptual vector model denoted by 6. This vector is expressed as

(17)

Let us assume that A is a full rank matrix, and AtA is positive definite. The solution can be found using the Choleski decomposition; however, faster algorithms can be applied if AtA has a Toeplitz form. Unfortunately, the indices are not known, thus they have to be determined as well as the gains. The minimization of E leads to choosing the combination j(1). . . j ( K ) ,solving the normal equations, and selecting the combination that maximizes llp112 = ptp = gtAtAg

(18)

since the minimization of IIp - p1I2 is equivalent to the maximization of llfi112. Substituting

,

@=Hi.

(11)

where H is a N x N lower triangular Toeplitz matrix. Substituting the proposed model (1) for the excitation into ( l l ) , we obtain K

p = xgkHC’(k).

(12)

into (18), we have to maximize

[lp1I2= p

t ~ ( ~ t ~ ) - l ~ t p . (20)

The gains are obtained by solving the normal equations corresponding to the selected combination of indices. The minimum value of the criterion E yields Emin

= llpI12 - 11PI12*

(21)

k=l

The vectors fj

=H c ~ for j = 1. . . A

(13)

form the filtered codebook

F = HC.

(14)

The optimal algorithm for selection of the indices first requires the determination of the matrices A corresponding to all index combinations, then the evaluation of the criterion, and, finally, the selection of the combination that maximizes this criterion. If no constraints are imposed, there are A!/(K!(A - K)!) possible combinations. Taking the typical values K = 3 A = 256, and N = 40 would require us to test about 3 million combinations, where each is 5 ms. It is evident that we have to resort to suboptimal algorithms. The major problem is posed

U

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 1, PART I, JANUARY 1994

32

by inversion of the matrix AtA. Two methods (at least) are available to simplify the index-extraction procedure. One vector is extracted at a time, i.e., the algorithm becomes iterative. In this case, AtA is a scalar. Special properties are imposed on the matrix A, for example, AtA = I for any index combination as proposed in [29]. The first method, although it is suboptimal,is a very popular standard approach. In Section IV, this procedure will be presented, and we will try to reduce the suboptimality. First, the latter method will be examined in more detail. 111. OF’TIMAL ALGORITHM FOR ORTHOGONAL CODEBOOK A . Introduction

The codebooks studied in this section provide a globally optimal solution of the problem described in the previous section. We limit our investigationto filtered codebooks F that consist of orthogonal vectors as in [29]. In this case, given the perceptual vector p, it is easy to extract from F the optimal excitation vector set. The K vectors that form the minimal angles with p are extracted; therefore, 11p1I2 is maximized, and the error E is minimized. Note the following: An orthogonal codebook is comprised of a maximum of N vectors. With A = N , the complete space RN is spanned. A typical value for N is N = 40; therefore, codebooks consisting of a rather small number of vectors are obtained. The orthonormal codebook satisfies

i

qT-A(z/y)

p

lal

-

~~

+

+

LTP

Fig. 2. Orthogonal transform coder with long-term prediction.

vector) is computed p = Ftp and coded Q(p),and then, the perceptual signal model is obtained by an inverse transform p = F[Q(p)].The block diagram for an orthogonal transform coder with long-term prediction is shown in Fig. 2. Let e denote the long-term prediction residual and x the transformed residual. The quantization error and the perceptual signal modeling error are equal since I(p- pll = [le- 611 = IIx - 121).

C . Design of the Orthogonal Codebook F We are looking for a basis facilitating the coding of the transform vector x. The Karhunen-Loeve (K-L) transform is attractive because of the decrease in the variance of some vector components. According to this property, many components can be forced to zero (scalar quantization) or the dimension of the vectors in the codebook can be reduced (vector quantization). The K-L transform consists of computing the eigenvectors of the covariance matrix



R = ~ [ e eM~ ]

ej (ej

lt.

(26) I

The vectors f 1 . . .f we are seeking for designing F are the normalized eigenvectors of R with the eigenvalues placed in F~F =I (22) decreasing order. They are also the principal axes of inertia independent of the filter H. This implies that the construc- for the training sequence eJ.The matrix F obtained in this tion of the nonfiltered codebook C = H-lF depends way is fixed. To take into account the evolution of the spectral on H. In fact, the excitation codebook C is no longer properties of the signal e, an adaptive transform can be used. necessary. Thus, we obtain a degenerate form of a CELP Assuming that the residual r is white, the covariance matrix for a signal, without long-term prediction, is given by coder. Any orthogonal matrix F can be used, but using percepR = E[ppt]= HE[rrt]Ht= HHt. (27) tual signal properties will lead to better coding results. With a long-term predictor, we have to take into account the additional information given by this prediction for designing B. Equivalence with Transform Encoding K vectors are extracted from the filtered codebook to form F. Let q be the normalized vector corresponding to the long-term predictor. If we orthogonalize p relative to q the matrix A. The gains (orthogonalization procedures are studied in more detail in gk = (f’(k)lp) = (f’(k))tpfor IC = 1 . .. K (23) Section IV), we get ’

correspond to the length of the orthogonal projections of p on the basis vectors defined by F. The perceptual signal model p is a linear combination of the basis vectors K

p =Xikf’(k)

(24)

k=l

where ijk is a quantized version of the gains gk. When K = N , the model p may be expressed as P = FIQ(Ftp)l

(25)

where Q denotes the quantization process. It is equivalent to an orthogonal transform coding. The gain vector (the transform

e = (I - qqt)p = P p

(28)

where P is the orthogonalizationoperator verifying (Pp,q) = 0. According to the preceding hypothesis (white residual), the covariance matrix becomes

R = E[eet]= E[PpptPt]= PHHtPt.

(29)

The new transform F is given by the eigenvectors of the matrix PHHtPt. Only N-1 eigenvectors are used since the eigenvector q is associated with a null eigenvalue (Rq= 0). The corresponding transform takes into account the spectral properties of the perceptual signal, but the extra computational cost, which is studied in Section VI, is significant.

,

MOREAU AND DYMARSKI: SELECTION OF EXCITATION VECTORS

D. Coding in the Transform Domain

To take advantage of information concentrated in the first components, several approaches were studied: scalar quantization of the K first components of x and forcing to zero the remaining N-K components as in [30] scalar quantization of the K most significant components (i.e., K components having maximum absolute values) [29] and forcing to zero the remaining components vector quantization of x using a partitioned VQ with appropriate bit allocation. The approaches based on scalar quantization have not given encouraging results. At the same bit rate, the SNR obtained is inferior to the CELP coder SNR. Therefore, the following approach is considered. The N-dimensional vector x is divided in two subbands of dimension N I and N2, and a vector quantization is performed in each subband as in [31]. The overall distortion may be minimized by allocating bits or by choosing the dimension of each subband. We have decided to allocate the same number of bits for each subband and to optimize the dimensions N I and N2. For all possible pairs N I and N2, two training sequences are built by extracting vectors of dimension N I and N2 from the Ndimensional vectors x. Then, the corresponding codebooks are designed. We choose the combination maximizing the segmental S N R for the coder thus defined. In contrast with [31], a shape-gain VQ is performed in each subband. In an usual shape-gain VQ [32], we are looking for a shape codebook Y = [yl . . .yL] composed of normalized vectors and a gain codebook G = [gl. . .g ~ composed ] of scalars minimizing the overall distortion E, 11xJ-xjl12, where XJ = gmynis the model of the vector x3 minimizing the distance 11x3 - XJIl2. Because there are special directions near the principal axes of the training sequence yielding large values for the gains and other directions yielding smaller values, it is better to quantize separately each gain corresponding with Thus, for each direction y", there each direction XJ = gm,nyn. is a separate gain codebook G , = [gl,n. . . g ~ , ~Because ] . the evolution of gains is slow but significant, we also use a forward adaptation in contrast to [33] where backward adaptation is applied. The norm IIpJll of the perceptual signal is estimated, for example every 20 ms, and the normalized gains are coded Thus, we obtain using uniform quantizers xJ = gm,nllpjllyn. a tree-structured shape-gain VQ with adaptive quantization for the gains. The appropriate partition is defined by minimizing the local distortion

33

For designing the codebooks, the overall distortion may be written as

5[

n=l

lIxjl12

x3Ey"

x3 E y n

1

(xj,Yn)2 . (31)

(Sm,nllPjll- (xj,Yn))2-

4-

x3 Eyn

As suggested in [33] and [34], we decouple the shape quantizer design from the gain sequence. First, the shape codebook is designed, assuming that the gain quantization is ideal (the second term in (31) equals zero). The centroid of the nth cell is the vector yn maximizing

Qn =

( x ~ , Y "=) (~Y " ) ~ X'Ey"

Qn

I

~ j ( x j )yn ~

[xJEyn

= (yn)trnyn.

(32)

The matrix rn is the covariance matrix of the vectors in the nth cell. Maximizing Qn with the constraint IlynII = 1 yields

rny" = Qnyn.

(33)

The centroid is the eigenvector of the matrix rncorresponding to the most significant eigenvalue. The shape codebook is obtained using the generalized Lloyd algorithm [32] with splitting. At the end, the scalar uniform quantizers G1 . . . G L are calculated in such a way that the second term in (31) is minimized. Simulation results are presented in Section VI. IV. ITERATIVE ALGORITHMS A . Iterative Standard Algorithm

At each iteration, a single vector fj(') is selected, and a single gain gk is calculated. At the lcth iteration, the perceptual error to be modeled is k-1

(34) n=l

,

The index j(k) that maximizes the orthogonal projection of pk on fj is now searched for. According to (20), with p = pk and A = fj, we get

The corresponding gain at this stage is given by (19), which takes the following form:

For each direction yn, the length of the orthogonal projection (xj,y") is calculated, and the gain gm,n is chosen from the uniform quantizer G , minimizing lgm,nllpj 11 - (xj,y") I. Then, the local distortion (30) is evaluated, and the direction yn is chosen.

To prepare the next iteration (if k < K), the contribution of the lc first vectors fj(n) is subtracted from p k

(37) n=l

34

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 1, PART I, JANUARY 1994

[lo]. This time, a projection of the vector p on the subspace spanned by f j ( l ) . . . f j ( k ) is obtained, but this projection is not considered during the search for j ( k ) .

C. Locally Optimal Algorithms Basic Version of the Locally Optimal Algorithm: We have proposed [27] a third way to reduce the suboptimality of the standard algorithm. At the kth step, the alreadyof dimension k - 1 determined subspace f j ( l ) . . . fj('-') is augmented with the vector f j ( k ) maximizing the norm of the projection of p on the k-dimensional subspace spanned This algorithm yields a locally optimal by f j ( l ) . . .f j ( " l ) f j . index selection, i.e., at each stage, the best single vector is added to the already-existing subspace. However, this Fig. 3. Multistage CELP coder using the standard algorithm for index procedure is not globally optimal due to its iterative character, selection. i.e., at each stage only one vector is selected. Orthogonalization of the Filtered Codebook-The Modified Gram-Schmidt Algorithm: The basic version of the locally The crosscorrelation (fj, pk+') necessary at the stage k 1 can be calculated from the crosscorrelation at the previous optimal algorithm is too complex as compared with the standard algorithm. The computational cost may be reduced stage since if an orthogonal basis is progressively created in the se(fj, p"1) = (fj, p k ) - ( J k ( f j , f j @ ) ) lected subspace. To create such a basis, it is sufficient to orthogonalize at each step the codebook F relative to the vector chosen at the previous step. Thus, at a stage k of the locally optimal algorithm, we deal with the k-dimensional Let aj denote the energy of the vectors f j and p$ the subspaces spanned by the vectors f j ( l ) . . . f j ( k - l ) f j having crosscorrelation ( f j , pk) at the stage 5. Thus, we obtain the the orthogonal bases f~,$~l) . . . fj!lEh(:)l)firth(k). Maximizing standard algorithm the norm of the projection in these subspaces consists of maximizing (p$)2/aiwith = choosing the vector Forj = 1 . . - A

+

ai

fi:F2)k)

c ~= j (fj, f j ) and ,@ = ( f j , p) For k = l . . . K

0

For j = 1... A(if k

( f j r t h ( k ) , f i r t h ( k ) ) and = ( f i r t h ( k ) , p). Once this vector is chosen, we may orthogonalize the codebook vectors f i r t h ( k ) relative to this vector using the formula

50 dB). No formal listening tests were performed.

I

b

2

B. Definition of the Reference Coder

3

4

5

6

K

Fig. 5. Computational complexity of the multistage CELP coder: Case I-Direct orthogonalization with a stochastic codebook; Case 2-RMGS algorithm with a stochastic codebook; Case 3-Direct orthogonalization with a sparse codebook; Case &Backward transform with a sparse codebook; Case 5-RMGS algorithm with a sparse codebook.

Our reference coder is a multistage CELP coder (Fig. 3) with one adaptive codebook and one stochastic codebook populated with Gaussian random signals. The short-term predictor is updated every 20 ms (160 samples for 8 KHz sampling frequency) by an eighth-order LPC analysis based on Schur’s are considered reference values. The other algorithms are algorithm. The log area ratios are coded on 36 bits, which evaluated with respect to their extra computational cost (or corresponds to a bit rate of 1.8 kbit/s. The excitation signal is gain) and SNR improvement (or degradation). The qualitymodeled using K vectors every 5 ms ( N = 40). The first vector complexity tradeoff for different excitation evaluation algois extracted from an adaptive codebook consisting of A, = 128 rithms is shown in Fig. 6. The reference coder is attributed vectors (with Qmin = 40), and the K- 1 remaining vectors are the number 1. selected from a stochastic codebook with A2 = 128 vectors. The standard iterative algorithm is used for index selection. C . Iterative Algorithms Every 20 ms, the perceptual energy is coded on 5 bits. Every First, Fig. 4 shows the SNR’s versus the number of selected 5 ms, the coefficient X and the gain ratios are coded on 4 bits vectors for different index selection algorithms without gain and the indices on 7 bits. The sign of the first gain must be quantization. The locally optimal algorithms outperform the transmitted. Therefore, the bit rate is 2.25 + 2.2*K kbit/s. This standard algorithm. Similar results are reported in [37] for the coder with K = 3 (8.85 kbit/s) is chosen as a reference coder. multipulse coder. The quality improvement is not very high at To give an order of magnitude for the computation cost, low bit rates ( K = 2 or 3), but some locally optimal algorithms we evaluate the number of multiplications/ accumulations, in are used even in these cases [4], [35]. Mflops. (lo6 floating point operations per second). Using the There are several implementations for the locally optimal properties of the Toeplitz adaptive codebook (a sample shift index selection: direct orthogonalization (Section IV-C-2), between two adjacent vectors), the iterative standard algorithm RMGS algorithm (Section IV-C-3), and backward transforneeds 6.8 Mflops for K = 3. Some details about this evaluation mation (Section IV-C-4). In Fig. 5, the complexity of the are shown in Table I. corresponding coders is evaluated and compared with the For the speech database previously described, we have complexity of the reference coder described in the previous obtained a segmental SNR equal to 10.46 dB without gain section. The stochastic codebook is either a full Gaussian coding and 10.44 dB with in loop gain coding. Henceforth, stochastic codebook or a sparse codebook (90% zeros). The these values for the complexity and the SNR with gain coding backward transformation algorithm is inefficient for a full

,

-

L

39

MOREAU AND DYMARSKI: SELECTION OF EXCITATION VECTORS

tdB

.

+I 0

0

2

J

I

mo

MO

8

An excitation vector of dimension 40 may be obtained by concatenation of K = 4 short vectors of dimension 10 issued from four codebooks. We use the angle minimization algorithm with long-term prediction described in Section VC. The bit rate is 9.2 kbit/s (1.8 kbit/s for LPC, (7+5)*200 = 2.4 kbit/s for LTP, 4*800 = 3.2 kbit/s for indices, 0.8 kbit/s for signs of concatenated vectors, and 5*200 = 1 kbit/s for a gain). The algorithm complexity is reduced since the filtering operation and the energy calculation, which is applied to the stochastic codebook, are now effectuated on short codebooks. The crosscorrelation pi and the update of ~ ( k , j for ) the stochastic codebook are replaced by angle minimization. The total reduction is 3 Mflops. The SNR is 10.62 dB (case 10). Although the bit rate is slightly higher, this result is encouraging. The SNR may be improved at a cost of a slight complexity growth if the modifications mentioned in Section V-C are applied.

Mflops

-

IO

D.Excitation Modeling with One Gain Coefficient

22

Fig. 6. Quality-complexity tradeoff for different speech coders: Case 1-Iterative standard algorithm (8.85 kbit/s); Case 2-Algorithm with gain optimization at each step (8.85 kbit/s); Case 3-RMGS algorithm (8.85 kbit/s); Case 4-Iterative standard algorithm with mixed codebook (9.45 kbit/s); Case 5-RMGS algorithm with mixed codebook (9.45 kbit/s); Case 6-The stochastic codebook is forced to be Toeplitz (8.85 kbit/s); Case 7-The matrix H H t is forced to be Toeplitz (8.85 kbit/s); Case 8 - S a m e as 6 but Rh4GS algorithm (8.85 kbit/s); Case O a m e as 7 but RMGS algorithm (8.85 kbit/s); Case 1Woncatenation of short vectors (9.2 kbit/s); Case 11-Adaptive K-L transform (8.85 kbit/s); Case 12-Adaptive IC-L transform with orthogonalization (8.85 kbit/s).

Gaussian stochastic codebook (> 40 Mflops). The RMGS algorithm is efficient in all cases. For K = 3 (8.85 kbit/s), the algorithm with gain optimization at each iteration and the RMGS algorithm have a extra computational cost of 1.02 Mflops. The SNR is 10.62 dB for the first algorithm (case 2 in Fig. 6) and 10.95 dB for the RMGS algorithm (case 3). If a mixed codebook is used, i.e., if the adaptive and stochastic codebook are grouped together and the coder can choose K = 3 vectors from this mixed codebook [27], the bit rate increases by 600 bits/s, and the computational cost is 3.2 Mflops without orthogonalizationand 3.4 Mflops with orthogonalization. Without orthogonalization, the SNR is 10.94 dB (case 4). With orthogonalization, the SNR is 11.56 dB (case 5). The computationalcost may be reduced. Two simplifications are usual. The first one applied to the standard algorithm consists of forcing the stochastic codebook to be Toeplitz [39]. This algorithm now requires 0.14 Mflops instead of 2.05 Mflops for codebook filtering and 0.15 Mflops instead of 0.25 Mflops for the evaluation of energies. Other parameters are kept fixed. Thus, the gain in complexity is about 2 Mflops. For the standard algorithm, the SNR is 10.00 dB (case 6). If we suppress the filtered codebooks and force the matrix HtH to be Toeplitz, which is a widely used modification [40], the complexity gain is 1.7 Mflops. The SNR is 10.32 dB (case 7) for the standard algorithm. The RMGS algorithm extra computational cost is 1 Mflop. The SNR is 10.52 dB for the first simplication (case 8) and 10.51 dB for the second (case 9). Comparing the corresponding coders with the standard and RMGS algorithms (cases 1-3,4-5, 6-8,7-9), we observe the improvement of the quality of speech signal due to the locally optimal selection of excitation vectors.

E. Optimal Algorithms with Orthogonal Codebooks Three orthogonal transform coders are considered: with a fixed K-L transform, with an adaptive K-L transform, with an adaptive K-L transform, and orthogonalization. First, the scalar quantization of the selected components of the transformed vector is studied. Selecting K = 5 first components and using 5-bit quantizers yields the bit rate 9.2 kbit/s and a degradation in speech quality of 2.1 dB for the adaptive K-L transform. Choosing K = 3 most significant components yields the bit rate 10.2 kbit/s and a degradation of 1.2 dB. More encouraging results have been obtained using VQ of the transformed vectors. Two shape codebooks composed of 128 vectors each have been designed for the two subbands ( N I = 5 and N2 = 35) with a training ratio (size of the training set / size of the codebook) equal to 50. The norm of the perceptual signal is coded on 5 bits every 20 ms. The two gains are coded on 4 bits every 5 ms. With 1.8 kbit/s for LPC and 2.4 kbit/s for LTP as before, the bit rate is 8.85 kbit/s. The segmental to noise ratios are the following: 8.4 dB for the fixed transform (not shown in Fig. 6), 11.2 dB for the adaptive transform (case 1l), and 11.6 dB for the adaptive transform with orthogonalization (case 12). Complexity increases drastically if the adaptive K-L transform is used. The calculation of the eigenvalues and eigenvectors of HHt claims a number of Mflops that is a function of the required precision. The algorithm as described in [28] requires approximately 15 Mflops. The calculation of the transformed vector, the inverse transformation, and the two necessary filtering operations at the synthesis part yield a negligible complexity; therefore, the increase in processing complexity rests in the order of 15 Mflops. The adaptive K-L transform coders outperform the multistage CELP coders (compare cases 11 and 12 in Fig. 6 with the others coders except cases 4 and 5, which correspond to higher bit rates). Our preliminary experiments reveal that this quality improvement increases at higher bit rates. However, the extra computational cost is considerable.

,

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 1, PART I, JANUARY 1994

40

VII. CONCLUSION The locally optimal algorithms for index selection are recommended for multistage CELP coder as well as for the multipulse coders [37]. They offer an improvement of speech quality with no increase of bit rate. There are different implementations of the locally optimal algorithms: direct orthogonalization of filtered vectors [4], backward transformation [35], and the RMGS algorithm [9]. The RMGS algorithm is the most efficient implementation of the locally optimal index choice for full and sparse codebooks. The computational cost may be reduced using different approaches [23], [39], [40]. One more approach may be added with the concatenation of short vectors and the selection of indices using the angle minimization algorithm. This coder offers a good quality-complexity tradeoff (case 10 in Fig. 6). In the CELP coder, the codebook is adapted to the spectral properties of the signal by filtering. An alternative approach is used in the adaptive K-L transform coder: The codebook remains fixed, but the coordinates are rotated according to the spectral properties of speech. This approach seems promising especially for higher bit rates (for example, wideband coders). The computational complexity is too high, and its reduction may be a topic for further research.

ACKNOWLEDGMENT

We would like to thank W. Vos for many helpful suggestions comments.

REFERENCES B. Atal and J. Remde, “A new model of LPC excitation for producing natural-sounding speech at low bit rates” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1982, pp. 614-617. M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): Highquality speech at very low bit rates,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1985, pp. 937-940. J. Campbell, T. Tremain, and V. Welch, “The DoD 4.8 Kbps standard (Proposed Federal Standard 1016),” in Advances in Speech Coding (Atal, Cuperman, and Gersho, Eds.). Boston: Kluwer, 1991. I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 Kbps,” in Proc. In?. Conf. Acoust. Speech Signal Processing, 1990, pp. 461464. P. Vary et al., “Speech codec for the European mobile radio system,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1988, pp. 227-230. B. Atal and M. Schroeder “Predictive coding of speech signals and subjective error criteria,” IEEE Trans.Acoust. Speech Signal Processing, vol. ASSP-27, June 1979. R. Rose and T. Bamwell “The self excited vocoder-An altemate approach to toll quality at 4800 bps,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1986, pp. 453-456. N. Moreau and P. Dymarski, “Successive orthogonalizations in the multistage CELP coder,” in Proc. In?. Conf. Acoust. Speech Signal Processing, 1992. P. Dymarski, N. Moreau, and A. Vigier, “Optimal and sub-optimal algorithms for selecting the excitation in linear predictive coders,” in Proc. lnt. Conf. Acoust. Speech Signal Processing, 1990, pp. 485-488. M. Berouti, H. Garten, P. Kabal, and P. Mermelstein, “Efficient computation and encoding of the multi-pulse excitation for LPC,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1984, pp. 10.1.1-10.1.4. J. P. Lefevre and 0. Passien, “Efficient algorithms for obtaining multipulse excitation for LPC coders” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1985, pp. 957-960. E. Deprettere and P. Kroon, “Regular excitation reduction for effective and efficient LP-coding of speech,” in Proc. In?. Conf. Acoust. Speech Signal Processing, 1985, pp. 965-968.

W. Kleijn, D. Krasinski, and R. Ketchum “An efficient stochastically excited linear predictive coding algorithm for high quality low bit rate transmission of speech” Speech Commun., pp. 305-316, 1988. R. Rose and T. Bamwell, “Design and performance of an analysisby-synthesis class of predictive speech coders,” IEEE Trans. Acousr. Speech Signal Processing, Sept. 1990. W. Kleijn, D. Krasinski, and R. Ketchum, “Fast methods for the CELP speech coding algorithm,” IEEE Trans. Acoust. Speech Signal Processing, vol. 38, no. 8, Aug. 1990. R. Ramachandran and P. Kabal, “Pitch prediction filters in speech coding,” IEEE Trans. Acoust. Speech Signal Processing, vol. 37, no. 4, Apr. 1989. J. Marques, J. Tribolet, I. Trancoso, and L. Almeida “Pitch prediction with fractional delays in celp coding,” in Proc. Eurospeech, 1989, pp. 509-5 12. P. Kroon and B. Atal, “Pitch predictors with high temporal resolution,” in Proc. In?.Conf Acoust. Speech Signal Processing, 1990, pp. 661-664. G. Davidson and A. Gersho, “Multiple-stage vector excitation coding of speech waveforms,” in Proc. Int. Conf. Acousr. Speech Signal Processing, 1988, pp. 163-166. G. Davidson, M. Yong, and A. Gersho, “Real-time vector excitation coding of speech at 4800 bps,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1987, pp. 2189-2192. G. Davidson and A. Gersho, “Complexity reduction methods for vector excitation coding,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1986, pp. 3055-3058. A. Le Guyader, P. Combescure, C. Lamblin, M. Mouly, and J. Zurcher “A robust 16 Kbit/s vector adaptive predictive coder for mobile communications” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1986, pp. 857-860. M. Ireton and C. Xydeas, “On improving vector excitation coders through the use of spherical lattice codebooks,” in Proc. In?. Conf. Acoust. Speech Signal Processing, 1989, pp. 5740. J. P. Adoul, P. Mabilleau, M. Delprat, and S. Morisette, “Fast CELP coding based on algebraic codes,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1987, pp. 1957-1960. R. Salami, K. Wong, R. Steele, and D. Appleby, “Performance of error protected binary pulse excitation coders at 11.4 Kb/s over mobile radio channels,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1990, pp. 473476. T. Taniguchi, Y. Tanaka, A. Sasama, and Y. Ohta, “Principal axis extracting vector excitation coding: High quality speech at 8 Kb/s,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1990, pp. 241-244. N. Moreau and P. Dymarski, “Mixed excitation CELP coder,” in Proc. Eurospeech, 1989, pp. 322-325. G. Golub and C. Van Loan, Matrix Computations. Baltimore, MD: Johns Hopkins University Press, 1983 (2nd ed. 1989). E. Ofer, D. Malah, and A. Dembo, “A unified framework for lpc excitation representation in residual speech coders,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1989, pp. 41-44. V. Sanchez-Calle, J. Lopez-Soler, J. Segura-Luna, A. Peinado-Herreros. and A. Rubio-Ayuso, “Increasing the difference between the significant and the non-significant singular values in a model of LPC excitation based on the SVD,” Signal Processing V , 1990. P. Chang, R. Gray, and J. May, “Fourier transform vector quantization for speech coding,” IEEE Trans. Commun., vol. COM-35, no. 10, pp. 1059-1068, Oct. 1987. A. Gersho and R. Gray, Vector Quantization and Signal Compression. Boston: Kluwer, 1991. M. Sabin, “Fixed-shape adaptive gain vector quantization for speech waveform coding,” Speech Commun. vol. 8 , pp. 177-183, 1989. M. Sabin and R. Gray, “Product code vector quantizers for waveform and voice coding,” IEEE Trans. Acoust. Speech Signal Processing, vol. ASP-32, no. 3, June 1984. T. Taniguchi, M. Johnson, and Y. Ohta, “Multi-vectorpitch-orthogonal LPC: Quality speech with low complexity at rates between 4 and 8 KBPS,” in Proc. ICSLP, 1990, pp. 113-1 16. S. Singhal, “Reducing computation in optimal amplitude multipulse coders,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1986, pp. 2363-2366. S. Singhal and B. Atal, “Amplitude optimization and pitch prediction in multipulse coders, ” IEEE Trans. Acoust. Speech Signal Processing, vol. 37, no. 3, Mar. 1989. P. Kroon and B. Atal, “Quantization procedures for the excitation in CELP coders,” in Pror. In?. Conf. Acoust. Speech Signal Processing, 1987, pp. 1650-1654. D. Lin, “Speech coding using efficient pseudo-stochastic block codes,” in Proc. Int. Conf. Acoust. Speech Signal Processing, 1987, pp.

MOREAU AND DYMARSKI: SELECTION OF EXCITATION VECTORS

1354-1357. [40] I. Trancoso and B. Atal, “Efficient search procedures for selecting the optimum innovation in stochastic coders,’’ IEEE Trans. Acoust. Speech Signal Processing, vol. 38, no. 3, pp. 385-396, Mar. 1990.

Nicolas Moreau received the engineering degree from Institute National Polytechnique de Grenoble, France, in 1969 and the M.S. degree from Lava1 University, Quebec, Canada, in 1972. He joined the Bole Nationale Sup6rieure des TklBcommunications in 1972, where he is currently assistant professor. From 1981-1986, his research interests included signal processing and VLSI circuits design. Since 1986, he has been engaged in research in the area of signal compression, especially of low bit rate speech coding.

41

Przemyslaw Dymarski was bom in Wroclaw, Poland, in 1951. He received the B.S. and Ph.D. degrees from the Technical University of Wroclaw in 1974 and 1983, respectively, both in electrical engineering. In 1976, he joined the Institute of Telecommunications, Technical University of Warsaw, where he works on speech analysis and coding and digital signal processing. Since 1986, he has worked in cooperation with the France Telecom University Department of Signals.

,