Dimension Reduction Techniques for Training ... - Semantic Scholar

Report 7 Downloads 161 Views
Dimension Reduction Techniques for Training Polynomial Networks

William M. Campbell Kari Torkkola Motorola Human Interface Lab, 2100 East Elliot Road, M/D EL508, Tempe, AZ 85284 Sreeram V. Balakrishnan Motorola Human Interface Lab, 3145 Porter Drive, Palo Alto, CA 94304

Abstract

1. Introduction We consider polynomial networks of the following type. The inputs, x1 , ..., xM , to the network are combined with multipliers to form a vector of basis functions p x ; for example, for two inputs x1 and x2 and a second degree network, we obtain

()



1



x1 x2 x21 x1 x2 x22 t :

(1)

A second layer linearly combines all these inputs to produce scores wt p x . We call w the classification (or verification) model. In general, the polynomial basis terms of the form xi1 xi2 : : : xik are used where k is less than or equal to the polynomial degree, K . For each input vector, xi , and each class, j , a score is produced by the inner product, wjt p xi . If a sequence of input vectors is introduced to the classifier, thePtotal score is the average score over 1 M wt p xi . The total score is used all inputs, sj i=1 j M for classification or verification. Note that we do not use a sigmoid on the output as is common in higher-order neural networks (Giles & Maxwell, 1987).

()

( )

=

( )

FSB 027@ EMAIL . MOT. COM

Training techniques for polynomial networks fall into several categories. The first category of techniques estimates the parameters for the polynomial expansion based on inclass data (Fukunaga, 1990; Specht, 1967). These methods approximate the class specific probabilities (Schurmann, 1996). Since out-of-class data is not used for training a specific model, accuracy is limited. A second category of methods involves discriminative training (Schurmann, 1996) with a mean-squared error criterion. The goal of these methods is to approximate the a posteriori distribution for each class. This method traditionally involves decomposition of large matrices, so that it is intractable for large training sets in terms of both computation and storage. A more recent method involves the use of support vector machines. This method uses the technique of structural risk minimization. We use an alternate training technique based upon the method in Campbell and Assaleh (1999) which approximates a posteriori probabilities.

We propose two novel methods for reducing dimension in training polynomial networks. We consider the class of polynomial networks whose output is the weighted sum of a basis of monomials. Our first method for dimension reduction eliminates redundancy in the training process. Using an implicit matrix structure, we derive iterative methods that converge quickly. A second method for dimension reduction involves a novel application of random dimension reduction to “feature space.” The combination of these algorithms produces a method for training polynomial networks on large data sets with decreased computation over traditional methods and model complexity reduction and control.

p(x) =

P 27439@ EMAIL . MOT. COM A 540 AA @ EMAIL . MOT. COM

The use of polynomial networks and our training/classification method is motivated from an application perspective. First, the discriminative training method in Campbell and Assaleh (1999) can be applied to very large data sets efficiently. For the application we consider in speech processing, this property is critical since we want to be able to train systems in a reasonable amount of time without custom hardware. Second, discriminative training of polynomial networks produces state-of-the-art performance in terms of number of parameters needed, accuracy, and computation effort for several applications including speaker and isolated word recognition. Polynomial networks outperform many other common techniques used for these applications because they are discriminatively trained and approximate a posteriori probabilities. In contrast, techniques such as Gaussian mixture models and Hidden-Markov models use maximum likelihood training and approximate in-class probabilities. For open set problems, this leads to difficulties since we do not model the out-of-class set well (although partial solutions such as “cohort normalization” have been proposed (Campbell, Jr.,

1995)). Finally, the training method we use encapsulates the statistics of an entire class into a single vector which eliminates the need for storing training data. For instance, for the application of speaker verification (a two class problem), we collapse the entire population of impostors (over one million vectors of size ) into a single vector of approximately ; elements which is then stored for later enrollment (training) of legitimate users.

We define Mi as the matrix whose rows are the polynomial expansion of class i’s data; i.e., Mi   p xi;1 p xi;2 : : : p xi;Ni t where Ni is the number of training vectors for class i. We define M  t t M1 Mt2 : : : MtN lasses where N lasses is the number of classes. The training problem is

Two difficulties arise in the process of training a polynomial network. First, the dimension of the vector p x grows quickly with the degree of the network; we denote the output p x as being in “feature space” in analogy to the terminology used for support vector machines (SVM’s). For example, with an input vector of dimension , the vector p x is of dimension (degree ), (degree ), (degree ), etc. We would like to have finer granularity in this growth since it impacts model complexity. A second difficulty in training polynomial networks arises from the redundancy of polynomial terms in training (to be explained in Section 3). Training involves a feature space with a dimension corresponding to twice the degree of the classification network. For instance, if we classify with a first order (linear) network, we must compute second order statistics (correlations). Until the final step of the training process, we use only these statistics. At the final step, we introduce redundancy to construct a “higher order” correlation matrix. This final step increases resources (especially memory usage) dramatically in the algorithm. We propose an iterative method for solution which avoids this process. The resulting solution makes it possible to train larger problems or train on systems with limited resources.

where oi is the vector consisting of Ni ones in the rows where the ith classes data is located and zeros otherwise.

20 000

24

()

()

()

3

25

1 325

24

( ) ( )

wi = argmin kMw

w

=

oi k2

(2)

Applying the method of normal equations (Golub & Van Loan, 1989) to (2) gives the following problem

2 2925

The paper is organized as follows. In Section 2, we review the method for training a polynomial classifier given in Campbell and Assaleh (1999). In Section 3, we show how iterative techniques can be applied to eliminate redundancy in training. Section 4 introduces the process of random dimension reduction in feature space. A random direct method is proposed in Section 4.1. This method makes it possible to control model complexity easily and effectively. In Section 4.2, we show how the method can be implemented quickly using a fast Fourier transform (FFT). Section 5 shows how to combine iterative methods and random dimension reduction to eliminate redundancy and control model complexity. Section 6 illustrates the use of the algorithms on the task of speaker verification.

=

)

(

Mt Mwi = Mt oi :

(3)

Define 1 to be the vector of all ones. We rearrange (3) to NX

lasses j =1

If we define Rj

Mtj Mj wi = Mti 1

(4)

= Mtj Mj , then (4) becomes 0 

NX

lasses j =1

1

Rj A wi = Mti 1:

(5)

Equation (5) is a significant step towards our training method. The problem is now separable. We can individually compute Rj for each class j and then combine the PN lasses Rj . final result into a matrix R j =1

=

One advantage of using the matrices Rj is that their size does not change as more training data becomes available. Also, the unique terms in Rj are exactly the sums of basis terms of degree K or less where K is the polynomial network degree. We denote the terms of degree K or less for a vector x as a vector, p2 x . See Table 1 for the compresinput vector. Note that the sion factor for a dimension elimination of redundancy decreases both storage and computation; e.g., when training a second degree network we by computing p2 x reduce computation by a factor of instead of p x p x t .

2

() 12

2

26 () ()() The vector p(x) can be calculated recursively. Suppose we have the polynomial basis terms of degree k , and we wish to calculate the terms of degree k + 1. If we have the k th

2. Direct Training Method We train the polynomial network to approximate an ideal output using mean-squared error as the objective criterion. We deal with the multi-class problem in the following discussion of training.

Table 1. Term redundancies for the matrix

Degree 2 3 4

Terms in Rj 8,281 207,025 3,312,400

Unique Terms 1,820 18,564 125,970

R. j

Ratio 4.55 11.15 26.30

=

degree terms with end term having ik l as a vector ul , we obtain the k -th degree terms ending with ik+1 l  t as xl ut1 xl ut2 : : : xl utl . Concatenating the different degrees gives the vector p x .

( +1)

=

()

Combining all of the above methods results in the training algorithm in Table 2. This training algorithm is not limited to polynomials. One key enabling property is that the basis elements form a semigroup; i.e., the product of two basis elements is again a basis element. This property allows one to compute only the unique elements in the matrix Ri . Another key property is the partitioning the problem. If a linear approximation space is used, then we obtain the same problem as in (2). The resulting matrix can be partitioned, and the problem can be broken up to ease memory usage. Our method of using normal equations to fit data is a natural extension of older methods used for function fitting (Golub & Van Loan, 1989), and radial basis function training (Bishop, 1995). We have extended these methods in two ways. First, we partition the problem by class; i.e., we calculate and store the ri for each class separately. This is useful since in many cases we can use the ri for adaptation, new class addition, or on-line processing. We also can obtain the right hand side of (5) as a subvector of ri which reduces computation significantly as we find each wi . Second, we have used the semigroup property of the monomials to reduce computation and storage dramatically. We note that the normal equation method is absent from most expositions on training polynomial networks. This may be due to the fact that the condition number of the matrix R is squared; we have not found this to be a problem in practice. A disadvantage of the training algorithm in Table 2 is that in Step 9, the compressed representation is expanded into a matrix. This results in a dramatic increase in resources for the algorithm (by the factors shown in Table 1). We derive a method which avoids this expansion in the next section.

Table 2. Training algorithm.

=1

1) For i to N lasses 2) Let ri 0. to Ni 3) For j 4) Retrieve training vector j , xi;j , from class i’s training set. 5) Let ri ri p2 xi;j . 6) Next j 7) Next i PN lasses ri . 8) Compute r i=1 9) Expand r to R. Derive Mti 1 from ri . 10) For all i, solve Rwi Mti 1 for wi .

= =1

= + ( )

=

=

3. Eliminating Redundancy in Training 3.1 Iterative Training Methods Iterative techniques to solve linear equations have typically been used in two areas. In the numerical analysis community, methods are targeted toward solving large sparse systems (Golub & Van Loan, 1989). In the engineering community, approaches have concentrated on using iterative methods for recursive learning (Schurmann, 1996). We experiment with iterative methods from both areas. Iterative methods are a common technique used to solve linear equations; e.g., Rw b. In most cases, an iterative method is based upon computing a product Rp where p is some auxiliary vector. Using this data, a descent direction d is obtained. A new solution estimate is then given by wi+1 wi d where is some suitably chosen scalar.

=

= +

A common method for iterative training is implemented in Kaczmarz’ algorithm for recursive learning (Schurmann, 1996; Kaczmarz, 1937). The method uses the update

wi+1 = wi + (bj

sj wi )stj ;

(6)

where sj is the j th row of R, bj is the j th entry of b, and We use  ==ksj k22 in our experiments. The two main advantages of this method are (1) it is computationally simple, and (2) the update involves only one row of R.

0 < ksj k22 < 2 .

=1

More sophisticated algorithms for iterative training are the successive over-relaxation (SOR) algorithm and the conjugate gradient (CG) algorithm. The SOR algorithm is a generalization of the well-known Gauss-Seidel method with a parameter < ! < which can be varied to give difference convergence rates. The conjugate gradient (CG) algorithm is another popular method. It has the advantage that there are no direct parameters to estimate, and its convergence rate is determined by the condition of the matrix R. We have selected these iterative methods because of their common use and applicability to our problem.

0

2

We use iterative methods to solve the equation shown in Step 10 of Table 2. Several properties of R are critical. First, R is symmetric, nonnegative definite, and square by structure. Second, we assume (with no violations in practice for our application), that R is nonsingular. These properties allow all of the mentioned iterative methods to be applied. 3.2 Matrix-Vector Multiply Algorithm The core of our iterative algorithm is a method for computing Rw for an arbitrary w without explicitly performing the mapping from r to R; this process saves considerable memory and eliminates the “redundancy” of computing with R. The basic idea is to utilize the structure of the

matrix R. We specialize to the case when we map a class’s vector rk to a matrix structure as in Step 9 in Table 2. The matrix Rk is obtained from a sum of outer products

Rk =

Nk X i=1

p(xk;i )p(xk;i )t :

(7)

The mapping in Step 9 is based upon the fact that it is not necessary to compute the sum of outer products (7) directly. Instead one can compute the subset of unique entries (i.e., the vector p2 x ), and then map this result to the final matrix.

()

A straightforward way to implement the mapping is to precompute an index function. We first label each entry in the matrix Rk in column major form from to the number of entries. The structure of Rk is determined by one outer product term from (7), p x p x t . Using a computer algebra package, one can compute the outer product and p2 x with symbolic variables. Then an exhaustive search for each entry of Rk in p2 x yields the required index map. An example of such a mapping is shown in Figure 1.

1

()()

()

()

The difficulty in using an index function is that the index map must be stored. To avoid this problem, we propose an alternate method based upon a simple property–the semigroup structure of the monomials. Suppose we have an input vector with n variables, x1 ; : : : ; xn . The mapping

xi1 xi2 : : : xik

7! qi1 qi2 : : : qi

(8)

k

where qij is the ij ’th prime defines a semigroup isomorphism between the monomials and the natural numbers (we x1 x1 x2 to map to ). For example, we map x21 x2 2 q1 q1 q2 ( is the first prime). We can implement the mapping in Table 2 efficiently using this semi-

1 1 = 2 3 = 12 2

=

3000

1) Let q be the vector of the first n primes. 2) Let v p q and v2 p2 q . 3) Sort v2 into a numerically increasing vector, v20 . Store the permutation,  , which maps v20 to v2 . 4) For i to (Number of rows of R) 5) Let yi . to (Number of columns of R) 6) For j 7) Compute n vi vj . 8) Perform a binary search for n in v20 ; call the index of the resulting location i0n . 9) Using the permutation  , find the index, in , in v2 corresponding to the index, i0n in v20 . 10) yi yi rin wj 11) Next j 12) Next i

= ()

=1

= ()

=0 =1

=

= +

group isomorphism since it transforms symbol manipulation (monomials) into number manipulation. Based upon the mapping in (8), an algorithm for computing an arbitrary product, Rw, was derived, see Table 3. The basic idea is as follows. We first compute the numerical equivalents to p x and p2 x using the mapping (8) in Steps 1-2. We sort the resulting vector v2 , so that it can be searched quickly. Steps 4-12 obtain the ith entry of y using a matrix multiply; i.e, the i; j th entry of R is multiplied by the j th entry of w and summed over j to obtain yi .

()

()

( )

4. Random Dimension Reduction 4.1 Direct Method

A natural way to control model complexity is to linearly transform “feature space.” That is, we replace the expansion, p x by a transformed version Tp x . This results in solving the optimization problem

2500

()

k

Index in r

y = Rw.

Another consideration in the design of polynomial networks is model complexity control. Because of the large increase of terms with degree, there are large jumps in model complexity. We would like to control this complexity to achieve better generalization and use less storage.

3500

2000

()

wi = argmin MTt w

1500

w

1000

oi 2

(9)

or equivalently the normal equation

TRTt wi = T(Mti 1):

500

0 0

Table 3. Calculation of

0.5

1

1.5 Index in R

2 k

2.5

3 4

x 10

Figure 1. Index function for 8 dimension input and degree 3.

(10)

The quantity Tt w in (9) is the model in the original space, and w is the model in the reduced dimension. Standard techniques for reducing dimension through linear transformation are principal component analysis (PCA)

and linear discriminant analysis (LDA). Although these are excellent techniques, they pose some difficulty in application to the problem (9). First, LDA and PCA require eigenvalue analysis in large dimensional spaces for even moderate degree polynomial networks. For instance, with inputs and th degree, the resulting polynomial has 1820 model terms. Second, we cannot reduce storage significantly with LDA and PCA since they require us to store the matrix T as well as the models wi .

12

4

As an alternative, we propose the use of random dimension reduction (Kaski, 1998) in feature space. In Kaski (1998), random dimension reduction is used on sparse document vectors whose dimension may be in the thousands. In contrast to the document vector reduction problem, our vectors are dense. The idea of random dimension reduction is simple. We generate a matrix whose entries are IID Gaussian. This results in a matrix whose rows are approximately orthogonal (with a better approximation as the dimension becomes larger). Using this matrix for dimension reduction preserves similarity; see Kaski (1998) for more details. A novel application of this process is that we only have to store the seed of the random number generator to regenerate the entire matrix T; i.e, we store the seed and w in (9) to reconstruct the original dimension model, Tt w. We note that this may have application to securing the model (in the case of speaker recognition) since the model Tt w cannot be recovered without the “key” seed value. 4.2 Fast Dimension Reduction for Dense Vectors For very large dimension vectors, even a matrix multiply can be slow. Thus, it is advantageous to consider a fast method of dimension reduction. The basic idea for fast dimension reduction is to generate an orthogonal matrix that has considerable structure. A natural choice is a circulant matrix. One can then perform fast dimension reduction using an FFT. The construction of the circulant matrix proceeds as follows. First, every circulant matrix, C, may be expressed in the form

C = F H DF

(11)

where F is the Discrete Fourier Transform (DFT) matrix and D is a diagonal matrix. Second, since we require C to be orthogonal, this implies that 

D = diag ej1 ; : : : ; ejn :

(12)

A typical approach is to select the phases, i , with a random uniform distribution on ;  . An intuitive reason to use this transform is that it tends to make the output of the transform more like a Gaussian, since a complex Gaussian

[0 2 ℄

random variable has random uniform phase. A final consideration in constructing (11) is to produce real outputs given real inputs; to do this we force the sequence eji to have conjugate symmetry. This technique is known in the quantization literature (Kuo & Huang, 1993; Popat & Zeger, 1992), since it can be used to filter an arbitrary independent sequence to produce a Gaussian source (as a consequence of the Central Limit Theorem). This technique also arises in the numerical analysis literature where it is used to avoid pivoting in Gaussian elimination (Parker & Pierce, 1995). We can now implement fast reduction of vectors in feature space as follows. We take a FFT (fast Fourier transform) of the vector in feature space, apply a random rotating phasor to the output, and then inverse FFT the result. We then keep the first m components of the output using a projection P. The overall dimension reduction transform is T PC where C is given in (11).

=

Two items should be observed. First, we only need to store the seed that generates the random sequence of angles; i.e., we still have the same advantage of reducing model storage. Second, fast dimension reduction requires O n 2 n instead of O mn operations for one vector in feature space. The new algorithm is faster than a direct approach if m  2 n.

( log )

( )

log

5. Combined Algorithm If it is desired to reduce model complexity and use minimal memory space for training a polynomial network, the methods of Section 3 and Section 4 can be combined. For the iterative method, we compute a product Rp with an auxiliary vector p. For dimension reduction, we use a dimension reduced R, given by TRTt . Thus, to combine the algorithms, we replace R by TRTt in the iterative training of Section 3; i.e., we compute the product TRTt p. The computation of the product TRTt p is straightforward. If we use fast dimension reduction, we can compute Tt p by applying FFT’s to p. This operation can be performed with very little additional memory (or in place if desired). The product R Tt p can be computed using the algorithm in Table 3. Finally, T RTt p can be computed using FFT’s. The combined method uses no significant additional memory and has both model complexity control and smaller memory footprint (than the original direct method).

(

)

(

)

6. Experiments We applied our methods to the problem of speaker verification; i.e., verifying identity through speech characteristics. One goal of our experiments was to show that the methods for dimension reduction given in the previous sections

performed the required memory and model complexity reductions while at the same time maintaining accuracy comparable to the original direct method. Another goal was to show that the methods could be implemented in a manner that did not drastically increase the training time required. We use the YOHO database for speaker verification (Campbell, Jr., 1995). Verification is a two class problem; that is, we must determine whether the speaker is who he claimed to be (a true claimant) or an imposter. YOHO has combination lock utterances, e.g. “26-81-57,” for enrollment and verification and is comprised of different speakers. Enrollment, in this case, is the process of generating a model (training a polynomial network) to distinguish between an individual and impostors. Verification is the process of determining whether an identity claim is valid using a single utterance. The YOHO database has utterances for enrollment per speaker. For verification, there is utterances per speaker. We performed verifia total of cation using all speakers as impostors and true claimants. This results in tests per speaker for false rejection, and  tests per speaker for false acceptance. The enrollment and verification utterances are separate sets.

138

96

40

40 40 137 = 5480

Training and verification vectors were computed using ms frames of speech every ms. For each frame, a vector of elements was derived corresponding to the vocal tract configuration. This process resulted in a sequence of vectors for each utterance. We also generated discrete-time derivatives of these elements to give delta-coefficients. For all frames, we had “static” coefficients and delta coefficients. For the training set, the derived vectors for each utterance were concatenated together to form a large training set. For verification, each utterance was used to compute a score using a verification model. This score was compared to a threshold to determine acceptance or rejection.

12

10

12

12

12

3 2925

(24 2) 2 325

(12 3)

8 18 564 16 2 455 455 8 455 455 2 218 762 8 18 564 18 564 2 18 564 4 18 564 2 455 5 8 315 224 2 218 762 315 224 7

A secondary goal of these experiments was to insure that the new iterative method converged quickly to reduce training time. We incorporated the algorithm in Table 3 into several iterative methods–SOR, CG, Kaczmarz’ method, steepest descent, and a preconditioned CG method. The results of these methods applied to training the first speaker, , using the ; configuration is shown in Figure 2. The metric, "i used to indicate progress is the norm of the residual divided by the norm of the right-hand side, b; i.e., if the equation to be solved is Rw b, then

101

(12 3)

=

12

Three different polynomial classifier configurations were ; , tested. The first configuration, which we denote used static coefficients and a polynomial network of degree . This resulted in a speaker model with coefficients. Two other configurations, ; and ; , were constructed by using static and delta coefficients with polynomial networks of degree ( model coefficients) and ( model coefficients), respectively. These configurations were chosen since they give good accuracy and are reasonable to train (in terms of time and memory space).

12 3

To show that we achieve the goal of redundancy reduction, we contrast the memory usage of the direct approach to the ; configuranew iterative method in Section 3 for the tion. For the direct approach of Section 2, we allocate space for r (double precision,  ; bytes), the index map  bytes), and R (double precision, ( bit int,    bytes), for a total of ; ; bytes. For the new method (with the CG algorithm), we store r (double bytes), v (16 bit int, ;  bytes), precision,  ; v20 (32 bit int, ;  bytes),  (16 bit int, ;  bytes), and scratch space for the iterative algorithm (double   ), for a total of ; bytes. The precision, memory savings is thus ; ; = ;  . This reduction in memory makes the algorithm more useful for systems with little memory, for example portable devices.

(12 3) 455 (24 3)

We represent each speaker using a vector rspk (using the notation in Table 2). The set of impostors is represented by another vector rimp ; rimp was constructed by computing an rk for each speaker and then summing across all speakers. This approach is a compromise since we are including actual impostors in our training; we have addressed this issue in Campbell and Assaleh (1999).

"i =

kRwi bk2 : kbk2

(13)

Note that since we have assumed R to be invertible, then the error, "i , should ideally go to zero. From Figure 2, several conclusions can be made. First, methods such as steepest descent and Kaczmarz’ algorithm

0

10

Kaczmarz Steepest Descent

−2

10

CG −4

10 εi

30

6.1 Results for the Iterative Method

SOR 1.0

−6

10

SOR 1.2 −8

10

Precond. CG −10

10

0

200

400

600

800

iteration

Figure 2. Comparison of iterative methods.

1000

converge slowly. In practice, we found the solution did not perform well even after iterations. Second, the CG method gives acceptable results after iterations, but this amount of computation is unacceptable. One would typically expect significantly less than iterations (the dimension of R) for convergence. Finally, the SOR method performs the best with no preconditioning. Note that we found that the SOR method with ! : converged faster than ! (after trying several values of ! ).

1000

1000 455

We explored preconditioning the matrix R to achieve better convergence. The condition number of R was estimated to 7 ; this explained the difficulty in the CG itbe about : erations. After examining several standard approaches, we settled on a diagonal preconditioner. The matrix R arises from a matrix product Mt M. Applying a diagonal matrix, D, to normalize the column norms of M is the same as computing MD. This results in the matrix product DRD. A convenience of using the matrix D is that it can be obtained from the entries of R. That is,

3

p

p

R1;1 ; : : : ; Rn;n



1

:

2

1

0.5 0

500

1000

1500

Dimension

Figure 3. Illustration of random dimension reduction in feature space. In the figure, “dimension” is the dimension of the final transformed feature space, p( ).

T x

(14)

After applying this preconditioner, the condition number 3 . The resulting preconditioned CG was reduced to : algorithm (Precond. CG) is the fastest converging method in Figure 2. We note that we applied preconditioning with the other iterative methods–in no case did we obtain substantial gains as in the CG case.

2.5

1.5

5 2(10 )

D = diag

Dim. 24 Input, Degree 2 Dim. 12 Input, Degree 3 Dim. 24 Input, Degree 3

3.5

Avg. EER (%)

=12

=1

4

1

10

We implemented the preconditioned CG method in C and enrolled all speakers in the YOHO database using iterative methods. After experimenting with several convergence 4 , then the values, we found that if we iterated till "i < average EER (equal error rate, the rate at which the false acceptance rate equals the false rejection rate) was the same as obtained from a direct solution method. Approximately iterations per speaker were needed for convergence.

Avg. EER (%)

5 7(10 )

0

10

10

100

6.2 Results for Random Dimension Reduction We applied dimension reduction to the speaker verification problem using the algorithms in Section 4 with all three ; , ; and ; . Our goal was configurations, to show that random dimension reduction is a useful way of trading off model complexity with accuracy. That is, the increase in error rate by using dimension reduction is gradual. We admit this is a qualitative criterion, but we justify it by the results below.

(12 3) (24 2)

(24 3)

The results are shown in Figure 3. In the figure, we display reduction for the ; configuration only up to dimension to show the detail for the other configurations. Figure 3 shows several interesting results for random dimension reduction. First, adding delta components does not add significant information content. If we reduce

1500

(24 3)

−1

10

2

3

10

10 Dimension

Figure 4. Random dimension reduction for (24; 3) model on a log-log plot. The triangles are actual data. The straight line is the fitted curve.

(12 3) (24 2)

the dimension of feature space for ; system down to the feature space dimension of ; , we obtain approximately the same EER. Second, all of the different configurations approximately track each other. This is an intriguing result and seems to indicate that performance is determined mainly by the dimension of feature space. This fact is a quantitative affirmation of the qualitative goal stated earlier; that is, random dimension reduction produces reasonable tradeoffs between error rate and model complexity for our application. Third, we replotted the result for the ; from Figure 3 on a log-log plot, see Figure 4. model This figure shows an interesting result–the log of the aver-

(24 3)

age EER is approximately a linear function of the log of the dimension. Fitting the data with a straight line gives :  : : .

log (Avg EER) 4 70 0 72 log(dim)

We also tested fast dimension reduction; this approach produced essentially the same results as an unstructured random matrix. For example, with the ; configuration , the EER’s are : and an output dimension of (fast) and : (direct).

455

1 356%

(24 3)

1 320%

6.3 Results for the Combined System We tested a combined system (see Section 5) with the

(12; 3) configuration. The goal was to show that the com-

bined approach has the strengths of both the iterative and dimension reduction methods. Table 4 shows the results for various configurations. Note that dimension reduction just transforms the matrix R without any to dimension real reduction; this case is included to show the maximum time required for transformation. The time quoted is on a single processor 360 MHz Ultra Sparc and includes generation of r and w for a single speaker. Most of the time for the direct approach is consumed in generating r, seconds out of : seconds.

455

60

18

18 3

Table 4. Comparison of Training Approaches.

Approach Direct Iterative DimRed (Slow) DimRed (Fast) DimRed (Slow) DimRed (Fast) Combined

Time (sec) 18.3 32.0 26.2 20.1 20.4 19.2 40.4

Space (Mbytes) 2.22 0.32 2.22 2.22 1.39 1.39 0.32

EER (%) 1.52 1.52 1.52 1.52 2.22 2.20 2.07

Model Terms 455 455 455 455 227 227 227

Several major observations from Table 4 should be made. First, training time increases less than a factor of for the iterative approach, but memory usage is decreased by a factor of . Second, fast dimension reduction achieves a reasonable speedup for a large output dimension. For , fast dimension reduction is inthe smaller dimension effective (probably because of the overhead in the FFT). Finally, the combined approach is successful and takes approximately : times longer than the direct approach. The increase in time was due to the fact that we could not use preconditioning (the random mixing destroyed structure) and therefore the number of iterations required was higher (approximately iterations per model).

2

7

227

22

150

7. Conclusions Two novel methods for reducing dimension in feature space were illustrated. One method eliminated redundancy using

iterative training. A second method used random dimension reduction in feature space. Experiments showed the effectiveness of the techniques in reducing memory usage and reducing model complexity.

References Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Campbell, W. M., & Assaleh, K. T. (1999). Polynomial classifier techniques for speaker verification. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (pp. 321–324). Campbell, Jr., J. P. (1995). Testing with the YOHO CDROM voice verification corpus. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (pp. 341–344). Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego: Academic Press. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in high-order neural networks. Applied Optics, 26, 4972–4978. Golub, G. H., & Van Loan, C. F. (1989). Matrix computations. Baltimore: The John Hopkins University Press. Kaczmarz, S. (1937). Angen¨aherte Aufl¨osung von Systemen linearer Gleichungen. Bull. Internat. Aca. Polon. Sciences et Lettres, 355–357. Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. Proceedings of the International Joint Conference on Neural Networks (pp. 413–418). Kuo, C. J., & Huang, C. S. (1993). A novel image coding technique for noisy communications. IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing (pp. 260–263). Parker, D. S., & Pierce, B. (1995). The randomizing FFT: an alternative to pivoting in Gaussian elimination (Technical Report CSD-950037). Computer Science Department, University of California, Los Angeles. Popat, K., & Zeger, K. (1992). Robust quantization of memoryless sources using dispersive FIR filters. IEEE Transactions on Communications, 40, 1670–1674. Schurmann, J. (1996). Pattern classification. New York: John Wiley and Sons, Inc. Specht, D. F. (1967). Generation of polynomial discriminant functions for pattern recognition. IEEE Transactions on Electronic Computers, EC-16, 308–319.