A Fast On-line Generalized Eigendecomposition Algorithm for Time Series Segmentation Yadunandana N. Rao and Jose C. Principe Computational NeuroEngineering Laboratory Department of Electrical and Computer Engineering University of Florida, Gainesville, FL 3261 1 e-mail: yadu @cnel.ufl.edu,
[email protected] Abstract This paper presents a novel, fast converging on-line rule for Generalized Eigendecomposition (GED) and its application in time series segmentation. We adopt the concepts of deflation and power method to iteratively estimate the generalized eigencomponents. The algorithm is guaranteed to produce stable results. In the second half of the paper, we discuss the application of GED to segment time series. GED is tested for chaotic time series and speech. The simulation results are compared with the venerable Generalized Likelihood Ratio Test (GLR) as a benchmark to gauge performance.
ponents that circumvents the slow convergence of gradient methods. The power method can be used to estimate the principal eigenvector and for the subsequent eigencomponents, a deflation procedure is adopted [lo]. Starting from an initial estimate of the principal eigenvector, the algorithm converges to the principal eigenvector e, of a matrix A, with an exponential rate as O((A,&,f)
iteration index and h, h, are the two largest eigenvalues of the matrix A. The convergence is extremely fast when these eigenvalues are well separated. Even for h, h, the speed of convergence and stability are better than the traditional gradient descent algorithms found in the literature.
-
2.0 Generalized Eigendecomposition
1.0 Introduction Principal Component Analysis (PCA) and Generalized Eigendecompositions are widely used in various signal processing applications like feature extraction, signal estimation and also speech separation [l], [2], [3], [8]. Linear Discriminant Analysis (LDA) is also a generalized eigenvalue problem. There are analytic methods for estimating the generalized eigencomponents [6]. However, for many applications, an adaptive on-line solution is required. Sanger’s rule [9] and APEX [2] present iterative solutions for eigendecompositions. [3] presents an iterative algorithm for LDA using two single layered feedforward networks, each modeled according to Rubner and Tavan [4], [SI.The algorithm exhibits slow convergence when the dimensionalities are high and when multiple eigencomponents need to be estimated. [lo] gives a rule only for the largest generalized eigencomponents. [7] proposes an on-line local algorithm for GED. Although the formulation is novel, this algorithm suffers from the problems of gradient methods. Our technique uses the idea of implementing GED using a two step PCA process similar to [3]. We utilize the well known power method [6] for computing the principal com-
0-7803-5800-7/00/$10.00 02000 IEEE
where k is the
266
The generalized eigenvalue problem can be mathematically expressed as A V = AB V , where A, B are non-singular matrices. For signal processing applications, these are generally full covariance matrices of zero mean, stationary random signals. For real symmetric positive definite covariance matrices, the eigenvectors are all real and the eigenvalues strictly positive. Generalized eigenvectors accomplish simultaneous diagonalization of covariance matrices as ?AV = A, ?BV = I where I is an identity matrix. [7] proposes an alternative formulation to solve the GED problem. Assuming A, B as the estimates of the covariance matrices of two zero mean stationary random signals, the problem is to find vi E Rm that maximizes J(Vi)
=
( vTA vi) subject to viTSvj = 0,S = A, B, (A + B) . ( vTBvi)
Taking the gradient of J ( v i ) , we get
VJ.; (l/(vrBvi))[Avi- ((v~Avi)/(v~Bvi))Bvi]. Equating the gradient to zero, VJ =
o+
~ =v ( (~ v ~ ~ v ~ ) / ( v ~ ~ w v e~ ) ) ~ v ~ .
see that this is nothing but the generalized eigenvalue equation, Avi = hiBvp hi = ( vTi A v i ) / ( vTiB v i ) . hi is the only globally asymptotically stable fixed point and this is the generalized eigenvalue. Based on the above interpretation, we can see that the principal component of A is now steered by the distribution of B . For this reason, GED is also termed as Oriented PCA [lo]. The generalized eigenvectors work as filters in the joint space of the two signals, minimizing the energy of one of the signals and maximizing the energy of the other at the same time. They are notch filters that put nulls at the frequencies where one signal has maximum energy. Thus GED can be used as a filter that can perform signal separation by suppressing the undesired component in a composite signal. This property can be used to quantify changes in the time series. We present a new way of segmenting time series using this idea in the second half of the paper.
2.1 Description of the proposed method Figure 1 shows the block diagram for the proposed method.
t
where xl(n)is the data vector at the nth time instant. The first processing block with a full weight matrix W, implements a whitening transform on the dataset x , ( n ) . So, where VI is the eigenvector matrix and
W, =
A,the corresponding diagonal eigenvalue matrix of the covariance matrix SI.The eigendecomposition of SI is done by the power method followed by the standard deflation procedure [lo]. As the next step, the second dataset x2(n)is passed through the fust block and its outputs are used to train the second block. Thus the second processing block with full weight matrix W,implements an eigendecomposition on the dataset x,(n) after being transformed by the matrix W,. The overall mapping performed is V = W , W, W, = AT1’2V,. This matrix is the generalized eigenvector matrix. The variances of the outputs of the second processing block, when the dataset x2(n) is input, are the eigenvalues in descending order of magnitude. We give a simple proof to substantiate our statement. proof: We prove that the overall mapping V simultaneously diagonalizes the covariance matrices SI,S, .
t W, = V,AT1’2, SIVI = A, VI . For simplicity, the time index
,
is not used in the derivation. The output y (n) is
Figure 1: Block diagram
<x,
yl(n) =
= A;”’<x,
.The covariance of y, (n)is
As said earlier, we implement GED using a two step PCA process. Each processing block in the figure is trained for
given by
eigendecomposition. Let x , ( n ) E Rm and x2(n)E Rm be two
VTS,V,
zero mean, stationary random signals. Let SI,S, E Rm be the corresponding covariance matrices. The covariance matrices can be recursively updated as S&n) = k S l ( n - l ) + Q , l = 1,2
passing y , ( n ) through the second block, we get
where k is a parameter in the range (0,1]. Q is a symmetric matrix. In this case, as we are assuming stationarity, k = (1-l/n)and
Q = (I/n)xl(n)x:(n),1 = 1,2
= A ~merefore, . sYI= A;’/~A,A;’/~ = I . After
z,(n) = $2,
. The covariance of z l ( n )is estimated to be
S,, = E{ WY :,
CW,} = W:Syl W2 = W;IW2
=I
as W ~ W , = I. n u s , we see that the covariance matrix S,is transformed into I after the overall projection by the processing blocks. Let us now consider the dataset x , ( n ) . After the first projection, its covariance matrix is given by S y 2 .
267
2.2 Simulation -1/2
Sy2(n) = A,
T
VI S2V,A;'/2. The second projection will
T -1/2 give z2(n) = W2AI < X 2 .
Two, three-dimensional (3-D) signals were generated. x , (n) = sin (2x200t) + w(n)
Sz2is given by the expres-
2
sionSz2 = E{ W;A;'/2LfX2X;V,A;1/2
W2}
w ( n )is Gaussian noise with zero mean and unit variance. Table I compares the results obtained with the numerical method eig(A) in MATLAB (to implement the eigendecomposition for each block), with the proposed method, after one pass of the entire data (2000 samples). Pre-estimated covariance matrices were used for the numerical method. vl, v2, v3 correspond to the generalized eigenvectors. The eigenvectors estimated by our method are either in phase or 180" out of phase with the h4ATLAJ3 results. It is interesting to note that the data sets need to be passed through the processing blocks only once to get good results. The heart of the algorithm is the adaptation of the individual eigendecomposition blocks. Figure2 shows the convergence characteristics for the eigendecomposition on data set x , (n). The algorithm converges well within 500 samples for both the maximum and minimum eigenvectors. We also show the convergence plots for the traditional Sanger's rule applied to the same problem. The power method converges extremely fast when compared to Sanger's rule which is basically a gradient algorithm. Since both the eigendecomposition blocks used in our algorithm are identical, the convergence properties for the second block are the same. In fact, the convergence of the second processing block is faster because, the covariance matrix for the second data set x2(n)will be estimated during the training of the first block. So, the second block is trained using the full covariance matrix S, transformed by the matrix W, .
fies to Sz2 = W:Sy2W2 = A . This is nothing but a diagonal matrix of the generalized eigenvalues. Thus, the overall mapping performs simultaneous diagonalization of the covariance matrices S , , S 2 and solves the problem S,V = AS,V. A summary of the algorithm is given below. wi = [ Wi(1) Wi( 2) Wi( 3) Wi(4). ...........Wi(m)],i = 1,2 den otes the full weight matrices where the columns are the individual eigenvectors. Both the time series are of length T samples. Step 1: FORn = 1,2,3,4 ,............T S,(n) = (1 - l / n ) S , ( n - l)+(l/n)x,(n)x:(n) S,(n) = (1 - l / n ) S 2 ( n - 1) + ( l / n ) x 2 ( n & ( n )
R = Sl(n) F O R T = 1 , 2 , 3 , 4 ,............m Power Method updates the weights as W y ( r )= Rq-I(r) fl(r)
= q-'(r)/Ifl-'(r)l
where I ~ - ' ( r ) is l the n o m
of the weight vector.
R = [ I - fl(r)(q(r))TIR END END
Table 1: Comparison between results obtained by numerical method and the proposed method
Step 2: Estimate the Whitening transform W , = W,AT"*,A,
x2(n) = cos(2x100t) + w ( n )
which simpli-
1 1
= $SIW1
~~~~
Step 3: Pass x,(n)through the first block and estimate the output covariance matrix given by Sy2 = Wi S, W, Step 4: Train the second processing block on the covariance matrix SY2using the power method and deflation as in Step 1.
268
ProposedMethod ~~~
~~
~
h,
1.3430790411229
1.3445350310008
h,
1.1429309887522
1.1433526250078
1
1
h3
1
~ ~ ( 3 0.6883234928496 )
1
0.9838982153879
I
0.9843520345293
1
I
-0.687551473801
1
Table 1: Comparison between results obtained by numerical method and the proposed method Numerical Method
Proposed Method
v2( 1)
-0.212803334629
0.2113477721460
q 2 )
0.9051476141863
I -0.905443305040
v2 (3)
-0.212803334629
0.2104869205520
-+I--
Weight tracks for min eigenvectorusing Sanger's rule ?a
1
~ ~ ( 2 -0.026654859984 )
1
v3(3)
-0.535238507722
I
1
,
.
,
,
,
,
Figure 2: Plots showing convergence characteristics
-0.023784689113
3.0 Time Series Segmentation
-0.537238582769
An intensively investigated topic in time series analysis is
segmentation. In detection, segmentation of a signal refers to the automatic decompositionof the signal into stationary, weakly non-stationary segments, the lengths of which are adapted to the local properties of the signal. There are various approaches in time series analysis to deal with the issues of modeling and segmentation. Mixture models [ll], Gated Competitive Experts, where the experts are Linear predictors or neural networks [12], [13], but for this paper, we will use the Generalized Likelihood Ratio [14] test for comparison with our method. The GLR is an unsupervised sequential segmentation algorithm based on the NeymanPearson test for deciding between two hypotheses. The details of the algorithm are omitted here and can be found in [14], [l5]. As mentioned earlier, the generalized eigenvectors find projections in the joint space of the two signals where there is maximum energy of one signal and minimum energy of the other. Also, as the GED operates on the covariance matrices of both signals simultaneously, we can say that the generalized eigenvalues which are nothing but the projections of the eigenvectors on the data, can be a measure of the spectral distance between the two signals. The key difference between this method and the other segmentation methods lies in the fact that GED looks at pairs of signals and tries to find the difference in structure, whereas all other techniques look at the whole data to find the global information or model each data frame and then look at the difference between the model parameters of the successive data frames. Consider a time series x ( n ) We split up these time series into frames of samples, either overlapping or non-overlapping. Now, we perform a GED between successive frames of the time series and observe the ratio between the maximum and the minimum generalized eigenvalue to make a decision about the transitions in the time series. The estimation of L ( n ) = hmax/hminis repeated over the entire time series. A simple peak detection on the sequence L ( n ) ,is done to detect the transitions.
0.3.
o..rp..**.
Weight tracks for max eigenvector using Sanger's rule ,,
.
-
oe 0.020 -01.
4.
.
:'y
Weight tracks for max eigenvectorusing power method
0..
,
-0.
1 -0.535238507722 I -0.534868427424
v3( 1)
,
-
269
A suitable threshold has to be fixed to detect the true transitions.
Simulation 1: Consider a time series generated by a switching random FIR process. The switching is done at samples [200, 300, 500, 8001. Figure3 shows a plot of the data, and the sequence L ( n ) over time. GLR implemented as a 2-model case with linear predictors gave transitions at samples [201,290,510, 8201. Using GED, we got the transitions at samples [200, 300, 500, 8001. The window size was 100 samples, and overlapping of 50 samples was used between two successive frames. The eigenfilter length was 5 taps. We got exact segmentation using GED. The number of computations involved in the GED to detect a single transition is far less when compared with GLR.
length was 80 samples and overlapping between two successive frames was 40 samples. We see that there is a false transition at sample 180. Apart from that, the segmentation seems to match pretty well with the true segmentation. [161 uses GLR along with a multi layer perceptron to successfully segment this time series. The GLR with linear network fail for this data set, but GED with a linear framework is successful. 1.2
-1
10
-0.2
0
z
w
4
0
0
6
m
B
Q
1
1
D
o
c
1
~
D1
Plot of generalized eigenspread over time 0.4 I
-10
-15
a
n
4
0.35
m
s
o
o
m
o
l
m
o
I
0.3
S*rrOl.S
over
0.25
1.8.
0.2
1.6.
0.15
141.2.
0.6
-
0.4
.
0.2
.
0.1
v \
1 .
00.8. .s . j
0 , ’ 5
1
0
Threshold Settjng
1
5
2
0
2
5
3
0
3
5
10
15 20 25 Wmdow Frame Inde.x
30
5
Figure 4: Plots for simulation 2 5
~
O
5
S
b
Wmdow Frame Index
Figure 3: Plots for simulation 1 Simulation 2: The data set used in [161 is considered here. This data is a mixture of four ergodic chaotic processes. The processes are described by four different regimes 1. Logistic map-x(n + 1) = f ( x ( n ) ) = 4x(n)[l - x ( n ) l 2. Tent map-x(n + 1) = g ( x ( n ) ) = 2 . abs(x(n)-0.5) 3. Double logistic map-x(n + 1) = f ( f ( x ( n ) ) ) 4. Double tent map-x(n + 1) = g ( g ( x ( n ) ) ) The switching between the regimes are random. Figure4 shows a plot of the data set and the plot of the sequence L(n) over time. The true transitions are at samples [100, 249, 336, 439, 507, 607, 699, 815, 932, 1078, 11471. The transitions detected by our method are at samples [loo, 180, 245, 340, 450, 510, 600, 700, 850, 940, 1085, 11501. The eigenfilter length was 8 taps, window
270
Simulation 3: We now consider a spoken word “carry” for segmentation. The word was taken from the TIMIT database. Speech signals pose additional complexities for segmentation due to the pitch. Moreover, speech is quasistationary and its statistics vary slowly with time. Figure5 shows a plot of the speech signal with segmentation obtained from “T along with the sequence L(n) over time. Human experts pick transitions at samples [764, 1804,2799,38431.The GED algorithm was applied for this problem with an eigenflter length of 15 taps, window length of 350 samples and overlapping between successive frames of 200 samples. The hard part of the test was determining the threshold. Since the GED looks at the structure of the signals, there can be many peaks in the plot of L(n) especially for speech signals. The transitions detected are at samples [710,1750,2900,3880]. There is a false transition at sample 3000. But that can be neglected by considering a no-transition zone of about 100 samples on either side of a
true transition. This seems reasonable for speech when sampled at a very high rate. The advantage of the GED method over many existing methods is that, for most of the cases, we need to make only one pass through the data.
Acknowledgments: This work was partially supported by the National Science Foundation under Grant NSF ECS9900394.
5.0 References [l] R. 0. Duda and P. E. Hart,“Pattern Classification and Scene analysis”, NewYork, Wiley, 1973. [2] S. Y. Kung, K. I. Diamantaras and J. S. Taur, “Adaptive principal component extraction (APEX)and applications”, IEEE Trans. Signal Processing, vol. 42, May 1994. [3] Jianchung Mia0 and Ani1 K. Jain, “Artificial Neural Networks for feature extraction and Multivariate data projection”, IEEE Trans. Neural Networks, vol. 6, No. 2, March 1995. [4] J. Rubner and K. Schulten, ‘Development of feature detectors by self organization”,Biol. Cybem., vol. 62, pp. 193-199, 1990. [SI J. Rubner and P. Tavan, “A self organizing network for principal component analysis”, Europhysics Lett.,vol. 10, pp. 693-698, 1989. [6] G Golub and C. V. Loan, ‘‘Matrix Computation”, Baltimore, MD: John Hopkins Univ. Press 1993. [7] D.Xu, J. C. Principe and H. C. Wu, “Generalized Eigendecomposition with an On-Line Local Algorithm”, IEEE Signal Processing Lett.,vol. 5, No. 11, November 1998 [8] Y Cao, S. Sridharan and M. Moody, “Multichannel Speech Separation by Eigendecomposition and Its Application to CoTalker Interference Removal”, IEEE Trans. Speech and Audio Proc., vol. 5,No. 3, May 1997. [9] T. D. Sanger, “Optimal Unsupervised Leaming in a SingleLayer Linear Feedforward Neural Network”, Neural Networks, 2(6), pp. 459-473, 1989. [lo] K. I. Diamantam and S. Y. Kung, “Principal Component Neural Networks, Theory and Applications”, Newyork Wiley, 1996. [ 111A. S. Weigend, M. Mangeas and A. N. Srivastava, “Nonlinear gated experts for time series: discovering regimes and avoiding overfitting”, Int Journal of Neural Systems 6(1995) pp. 373-399 [12] C. L. Fancourt and J. C. Rincipe, “Competitive principal component analysis for locally stationary time series”, IEEE Trans. Signal Processing, vol. 46, No. 11, April 1998. [13] K. R. Muller, J. Kohlmorgen and K. Pawelzi. “Analysis of switching dynamics with competing neural networks”, IEICE Trans. on fundamentals of Electronics, Communications and Computer Sciences, vol. E-78-A, No. 10, pp. 171-187, February 1997 [14] M. Basseville and I. V. Nikiforov, “Detection of Abrupt Changes, Theory and Application, Prentice-Hall, 1993. [15] R. A. Obrecht, “A New Statistical Approach for the Automatic Segmentation of Continuous Speech Signals”, IEEE Trans. Acoustics, Speech and Signal Processing, vol. 35, No., 1, Jan 1988 [161 C. L. Fancourt and J. C. Principe, “On the use of neural networks in the generalized likelihood ratio test for detecting abrupt changes in signals”, submitted to UCNN, 2000.
0.04
-0.02
-0.04
om
I
I . I . 3 o O 0 4 o m
I .
-0.06
2000
6oDo
yxlo
Plot of generalized eigenspread over time
I 1
0
2
0
3
0
4
0
5
0
a
o
Wmdow Frame Index
F’igure 5: Plots for simulation 3
4.0 Conclusion A fast converging on-line technique for generalized eigendecomposition is proposed using power method and deflation. The algorithm estimates multiple generalized eigencomponents in descending order of magnitude of the eigenvalues. Simulations show the efficiency of this method over the conventional gradient based methods. We also explored the problem of time series segmentation using GED. Simulations involving generated time series proved that GED performs better than the GLR method. GED was also applied to speech segmentation. The results were satisfactory. Fixing the threshold is a key problem in many segmentation algorithms. We are working on the problem of estimating a generic threshold value based on the data.
27 1