ICA by PCA Approach: Relating Higher-Order Statistics to Second ...

Comment

Report 1 Downloads 65 Views

ICA by PCA Approach: Relating Higher-Order Statistics to Second-Order Moments Kun Zhang and Lai-Wan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong {kzhang, lwchan}@cse.cuhk.edu.hk

Abstract. It is well known that principal component analysis (PCA) only considers the second-order statistics and that independent component analysis (ICA) exploits higher-order statistics of the data. In this paper, for whitened data, we give an elegant way to incorporate higherorder statistics implicitly in the form of second-order moments, and show that ICA can be performed by PCA following a simple transformation. This method is termed P-ICA. Kurtosis-based P-ICA is equivalent to the fourth-order blind identiﬁcation (FOBI) algorithm [2]. Analysis of the transformation form enables us to give the robust version of P-ICA, which exploits the trade-oﬀ of all even order statistics of sources. Experimental comparisons of P-ICA with the prevailing ICA methods are presented. The main advantage of P-ICA is that it enables any PCA system, especially the dedicated hardware, to perform ICA after slight modiﬁcation.

1

Introduction

In independent component analysis (ICA), we have an observable random vector x = [x1 , x2 , ..., xn ]T . Variables xi are assumed to be linear mixtures of some mutually statistically independent variables si , which form the random vector s = [s1 , s2 , ..., sn ]T (here we assume that the number of observations is equal to that of sources, and that si are zero-mean). Mathematically, x is generated according to x = As. By using some linear transform, the task of ICA is to ﬁnd the random vector y = [y1 , y2 , ..., yn ]T : y = Wx

(1)

whose components are as mutually independent as possible, such that it provides an estimate of s. Since they involve no iterative optimization and the computation is extremely simple, closed-form solutions to ICA were developed by using high-order cumulants [4, 6, 7, 10] or the derivatives of mixtures [9]. However, the applicability of these results is limited since they only apply for the two-source case. Generally

This work was partially supported by a grant from the Research rants Council of the Hong Kong Special Administration Region, China.

J. Rosca et al. (Eds.): ICA 2006, LNCS 3889, pp. 311–318, 2006. c Springer-Verlag Berlin Heidelberg 2006

312

K. Zhang and L.-W. Chan

speaking, ICA does not have closed-form solutions, and an ordinary ICA algorithm consists of two parts—an objective function (contrast function) and an optimization method used to optimize it. In this paper, by investigating the relationship between ICA and PCA, we present a way to solve the ICA problem by explicitly applying principal component analysis (PCA). We consider the case where sources have diﬀerent distributions. It is shown that for whitened data, with a tricky transformation, the linear transformation for PCA of the transformed data is exactly that for ICA of the original data. Although this method, namely P-ICA, may not give optimal results, it is very simple and eﬃcient, as PCA can be easily done by eigenvalue decomposition (EVD) of the data covariance matrix or singular value decomposition (SVD) of the data. In addition, it just explicitly involves second-order statistics and has no local optima. As a consequence, second-order statistics-based systems, especially the hardware ones, can also perform ICA with P-ICA.

2

PCA and ICA

Given the data x, PCA and ICA both aim to ﬁnd the linear transformation given in Eq. 1. However, they are based on diﬀerent criteria and exploit data information of diﬀerent aspects. PCA aims at ﬁnding an orthogonal transformation W which gives uncorrelated outputs, or equivalently, in PCA, each principal component deﬁnes a projection that encapsules the maximum amount of variation in the data and is uncorrelated to the previous principal components. It only considers the second order statistical characteristics of the data. In other words, PCA just uses the joint Gaussian distribution to ﬁt the data and ﬁnds an orthogonal transformation which makes the joint Gaussian distribution factorable, regardless of the true distribution of the data. While in ICA, we ﬁnd the linear transformation which makes the true joint distribution of the transformed data factorable, such that the outputs are mutually independent. Statistically speaking, mutual independence is much stronger than uncorrelatedness between the variables, i.e. independence guarantees uncorrelatedness. In ICA, the independent components yi are uncorrelated, at least approximately, since they are as independent as possible.Some ICA algorithms, such as FastICA [8] and JADE [3], require whitening the observed data x as a preprocessing step. The method proposed in this paper also requires this step. Whitening of the data x can be performed by PCA or EVD of the covariance matrix of x. Denote the covariance matrix of x by Σx , i.e. Σx = E{xxT }. Let the EVD of Σx be Σx = EDET , where E is the orthogonal matrix consisting of eigenvectors of Σx as its columns and D the diagonal matrix of the corresponding eigenvalues. Whitening of x can be done using the matrix V = ED−1/2 ET . Denote the whitened data by v, i.e. v = Vx

(2)

ICA by PCA Approach

313

After whitening, we just need to ﬁnd an orthogonal transformation U to make components of y = Uv mutually independent. The transformation matrix W in Eq. 1 can then be constructed as W = UV

3

(3)

ICA by PCA

Without loss of generality, we assume the sources si are zero-mean and of unit variance, i.e. E{si } = 0 and E{s2i } = 1. Let B = VA. One can easily see that B is orthogonal. Now let us consider the case where si have diﬀerent kurtosis. 3.1

Based on Kurtosis

Under the condition that s1 , ..., sn have diﬀerent kurtosis, we have the following theorem. Theorem 1. Let s, v, and z be random vectors such that v = Bs, where B is an orthogonal matrix, and z = ||v|| · v (4) Suppose additionally s has zero-mean independent components and these components have diﬀerent kurtosis. Then the orthogonal matrix U which gives principal components of z (without centering) performs ICA on v. Proof: Let s = [s1 , ..., sn ]T , U = [u1 , ..., un ], and C = [c1 , ..., cn ] = [cij ]n×n = UB. Since B is orthogonal, we have ||v|| = ||Bs|| = ||s||. The second-order origin moment of the projection of z on ui is E(uTi z)2 = E(uTi · ||v|| · v)2 = E{||s||2 · (uTi Bs)2 } = E

n

(cTi s)2 · s2k

k=1

=

n

n n n n n c2ik E(s4k )+ c2ip E(s2p s2k )+ cip ciq E(sp sq s2k ) (5)

k=1

k=1

p=1 p=k

k=1

p=1, q=1, p=k q=p

When q = p, obviously at least one of q and p is diﬀerent from k. Suppose q = k. We then have E(sp sq s2k ) = E(sq )E(sp s2k ) = 0 since si are independent n and zero-mean. We also have k=1 c2ik = 1 since C is orthogonal. Equation 5 then becomes E(uTi z)2 =

n

c2ik E(s4k ) +

k=1

=

n k=1

n n k=1

c2ik E(s4k ) +

p=1 p=k

n n k=1

c2ip E(s2p )E(s2k )

p=1 p=k

c2ip =

n

c2ik kurt(sk ) + (n + 2)

(6)

k=1

Therefore E(uTi z)2 is the weighted average of kurt(si ) plus a constant. As si are assumed to have diﬀerent kurtosis, without loss of generality, we assume

314

K. Zhang and L.-W. Chan

kurt(s1 ) > kurt(s2 ) > · · · > kurt(sn ). From Eq. 6 we can see that maximization of E(uT1 z)2 gives c1 = [±1, 0, ..., 0]T , which means that y1 = uT1 v = uT1 Bs = cT1 s = ±s1 . After ﬁnding u1 , under the constraint that u2 is orthogonal to u1 , the maximum of E(uT2 z)2 is obtained at c2 = [0, ±1, 0, ..., 0]T . Consequently y2 = uT2 v = cT2 s = ±s2 . Repeating this procedure, ﬁnally all independent components can be estimated as yi = uTi v = ±si where ui maximizes E(uTi z)2 . In other words, the orthogonal matrix U performing PCA on z (without centering of z) actually performs ICA on v. Note that z may not be zero-mean since v may be nonsymmetrical. And we should not do centering of z when performing PCA on z—in fact, here PCA is used to maximize the origin moment of uTi z, rather than its variance. We term this method as P-ICA. Kurtosis-based P-ICA, which is actually equivalent to the fourth-order blind identiﬁcation (FOBI) algorithm [2], consists of three steps: 1. do whitening of x (Eq. 2); 2. do the transformation given in Eq. 4; 3. ﬁnd U by PCA on z. After estimating the orthogonal matrix U, the de-mixing matrix for the original data, W, can be constructed according to Eq. 3. With PCA done by EVD, below we give the Matlab code implementing this algorithm: % whitening [E,D]=eig(cov(x)); v=E*D∧(-.5)*E’*x; % Data transformation and ICA by PCA z=sqrt(sum(v.∧2)).*v; [EE,DD]=eig(z*z’); y=EE’*v; 3.2

An Illustration

For clarity, let us use a simple example to illustrate how P-ICA works. We use a sinusoid waveform and a Gaussian white signal as the independent sources, shown in Fig. 1 (a). The mixing matrix A is randomly chosen. The observed data x = As are shown in Fig. 1 (b). The whitened version of the observations, v = Vx, is shown in Fig. 1 (c). Fig. 1 (d) shows the waveforms of the transformed data z = ||v||·v together with the scatterplot. From this ﬁgure we can see that the axes corresponding to the independent sources si are almost the same as those giving principal components of z. The orthogonal matrix for PCA of z can be obtained by applying EVD on the covariance matrix of z: −0.2878 −0.9577 U= −0.9577 0.2878 According to Theorem 1, the independent components of x can then be obtained as y = Uv (or equivalently, y = UVx), as shown in Fig. 1 (e). Clearly the independent sources si have been successfully recovered. 3.3

For Robustness

From Eq. 6 we can see that the vector ui found by maximizing E(uTi z) actually depends on the kurtosis of si . It is well known that kurtosis can be very sensitive to outliers, so it is useful to develop a robust version for P-ICA.

ICA by PCA Approach 2

315

4

1

3

−2

2 1 0

200

400

600

800

1000

s2

s

0

0

2

5

−1 −2

s

0

−3 −5

0

200

400

600

800

−4 −2

1000

−1

0

s

1

2

1

(a) x1

1

1.5 1

0

−1

0

200

400

600

800

1000

x2

0.5 0

2

x2

−0.5 0

−2

−1

0

200

400

600

800

−1.5 −1

1000

−0.5

0

x

0.5

1

2

4

1

(b) 5

4

−5

2 1 0

200

400

600

800

1000

v2

v

1

3 0

v

2

5

0

−1 −2

0

−3 −5

0

200

400

600

800

−4 −4

1000

−2

0

v1

(c) 10

10

1

0

z

5

−10

200

400

600

800

1000

0 2

0

z

−20 10

2

−5

samples of z Axes for si

z

0 −10

−10

0

200

400

600

800

1000

Axes for PCA of z −10

−5

z

0

5

10

1

(d) 2

4

2 1 0

200

400

600

800

1000

y

2

5

2

−2

y

y

1

3 0

0

−1 −2

0

−3 −5

0

200

400

600

800

1000

−4 −2

−1

0

y1

1

2

(e)

Fig. 1. An illustrative example. (a) The independent sources si (left) and their joint scatterplot (right). (b) The observations xi (left) and their joint scatterplot (right). (c) The whitened data vi (left) and their joint scatterplot (right). The straight lines in the ﬁgure on the right show the axes of the independent sources si . The ellipse, which is actually a circle, is a contour of the joint Gaussian distribution ﬁtting v. (d) The transformed data z = ||v|| · v (left) and their joint scatterplot (right). The ellipse shows a contour of the joint Gaussian distribution ﬁtting z. (e) The estimated independent component y = Uv = UVx (left) and their joint scatterplot (right). Clearly yi provide a good estimate for the independent sources si .

316

K. Zhang and L.-W. Chan 1

1

Now let us replace ||v|| in Eq. 4 by f 2 (||v||2 ), i.e. z = f 2 (||v||2 ) · v, where f is a suﬃciently smooth and monotonically increasing function and f (0) = 0. We have the Taylor expansion about the origin for f (||v||2 ): f (||v||2 ) = f (||s||2 ) = f (s21 + · · · + s2n ) ∞ 2en 1 2e2 = α(e1 ,e2 ,...,en ) · s2e 1 s2 · · · sn

(7)

e1 ,...,en =0 Πei =0

where α(e1 ,e2 ,...,en) denotes the coeﬃcients of the Taylor expansion. Suppose si have ﬁnite moments. We then have E(uTi z)2 = E{f (||v||2 ) · (uTi Bs)2 } = E f (||s||2 ) · =

n n

n n

cip ciq sp sq

p=1 q=1

∞

2eq +1

1 cip ciq α(e1 ,e2 ,...,en ) · E(s2e · · · sp2e2 +1 · · · sq 1

n · · · s2e n )

p=1 q=1 e1 ,...,en =0 q=p

+

n

Πei =0

c2ip E{s2p · f (||s||2 )}

(8)

p=1

Provided that at most one of the sources si is nonsymmetrical, also taking into account that si are mutually independent, the ﬁrst term in Eq. 8 vanishes. Deﬁne Gp ≡ E{s2p · f (||s||2 )}. As the expectation of s2p weighted by f (||s||2 ), Gp is actually a function of the moments of si of all even orders according to Eq. 7: Gp =

∞

2ep +2 1 n α(e1 ,e2 ,...,en ) · E(s2e ) · · · E(s2e n ) 1 ) · · · E(sp

e1 ,...,en =0 Πei =0

Suppose Gp vary according to p. Roughly speaking, this condition could be enforced by assuming si have diﬀerent distributions. According to Eq. 8, E(uTi z)2 n T 2 2 is the weighted average of Gp : E(ui z) = p=1 cip Gp . This equation is similar to Eq. 6, and we can see that the orthogonal matrix U which gives principal components of z (without centering) performs ICA on v. In order to improve the robustness, f (·) should be a function increasing slower than the linear one. Many functions can be chosen as f for such a purpose. In our experiments, we choose f (||v||2 ) = log(1 + ||v||2 )

(9)

which behaves very well. Note that we have made the assumption that at most one of si is nonsymmetrical when using the robust version of P-ICA, while in kurtosis-based P-ICA, this assumption is unnecessary.

4

Experiments

The illustrative example in Subsection 3.2 has demonstrated the validity of the P-ICA method. Now we give two additional experiments. The Amari perfor-

ICA by PCA Approach

317

mance index Perr [1] is used to assess the quality of W for separating observations generated by the mixing matrix A. The smaller Perr , the better W. The P-ICA method assumes that sources have diﬀerent distributions. The ﬁrst experiment demonstrates the necessity of this assumption and how the separation performance is aﬀected by violation of this assumption. s1 is generated by the Cornish-Fisher (C-F) expansion [5] with skewness 0 and kurtosis 1. s2 is also generated by the C-F expansion with skewness 0, but the kurtosis varies from −1 and 6. Each source has 1000 samples. We compare ﬁve methods, which are kurtosis-based P-ICA, robust P-ICA (with f given in Eq. 9), FastICA [8] with pow3 nonlinearity, FastICA with tanh nonlinearity, and JADE [3], for separating s1 and s2 from their mixtures. For each value of kurt(s2 ), we randomly generate s2 and the mixing matrix A, and repeat the algorithms 40 runs. The average performance index is shown in Fig. 2. As expected, when s1 and s2 have the same distribution (here their distribution only depends on the skewness and kurtosis as they are both generated by the C-F expansion), P-ICA fails to separate the sources. When kurt(s2 ) is far from kurt(s1 ), the performance of the robust P-ICA is very close to the best, and the other three methods produce similar results. 0

P−ICA (kurtosis) P−ICA (robust) FastICA (pow3) FastICA (tanh) JADE

−0.5

log10(Perr)

−1

−1.5

−2

−2.5

−3

−3.5 −1

kurt(s 1) 0

1

2

3

4

5

6

kurt(s ) 2

Fig. 2. Logarithm of the average performance index as function of kurt(s2 ). Note that kurt(s1 ) = 1, and s1 and s2 are both generated by the Cornish-Fisher expansion.

The second experiment tests how the source number aﬀects the separation performance. Four sources are involved, which are a non-central t distributed white signal (with v = 5 and δ = 1, s1 ), a uniformly distributed white signal (s2 ), a Gaussian white signal (s3 ), and a square waveform (s4 ). Each source has 1000 samples. The source number n varies from 2 to 4. For each source number, we randomly generate the sources and the mixing matrix, and repeat the methods 50 runs. The average performance index, together with its standard deviation, is given in Table 1. Compared to other methods, the performance of P-ICA becomes worse when the source number increases. There may be two reasons for this phenomenon. First, as the source number increases, the diﬀerence between source distributions becomes more and more insigniﬁcant. Second, in developing P-ICA, we treat the last term in Eq. 5 (and the ﬁrst term in Eq. 8) as zero, which theoretically holds. However, in practice, due to the ﬁnite-sample eﬀect and the fact that sources may not be completely independent, this term may

318

K. Zhang and L.-W. Chan

Table 1. The average Perr of the ﬁve algorithms with diﬀerent source number for 50 runs. The values in parentheses are the standard deviations. Source P-ICA P-ICA FastICA FastICA JADE number (n) (kurtosis) (robust) (pow3) (tanh) 2 (s1 &s2 ) 0.1137(0.0663) 0.0786(0.0502) 0.1137(0.0663) 0.0619(0.0410) 0.1288(0.0773) 3 (s1 ∼ s3 ) 0.1381(0.0591) 0.1104(0.0454) 0.1234(0.0632) 0.0901(0.0516) 0.1069(0.0535) 4 (s1 ∼ s4 ) 0.1109(0.0366) 0.0851(0.0284) 0.0852(0.0284) 0.0568(0.0171) 0.0735(0.0236)

not vanish and the error accumulates as the source number increases. However, we should remark that the computational cost of P-ICA is very light since it is just a PCA procedure after a simple transformation.

5

Conclusion

We investigated the relationship between PCA and ICA, and consequently showed that ICA with sources having diﬀerent distributions can be solved by PCA following a simple transformation. Such a method, termed P-ICA, does not explicitly involve higher-order statistics and has low computational cost. Kurtosis-based P-ICA, which just exploits the kurtoses of sources, is the same as FOBI [2]. We further developed the robust version of P-ICA, which is insensitive to outliers. P-ICA enables PCA (hardware) systems to perform ICA after sligh modiﬁcation. Experimental results showed the validity and restriction of P-ICA.

References 1. S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, 1996. 2. J. F. Cardoso. Source separation using higher order moments. In ICASSP’89, pages 2109–2112, 1989. 3. J. F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. IEE Proceeding-F, 140(6):362–370, 1993. 4. P. Comon. Separatin of stochastic processes. In Proceedings of the Workshop on Higher-Order Spectral Analysis, pages 174–179, Vail, Colorado, USA, June 1989. 5. E. A. Cornish and R. A. Fisher. Moments and cumulants in the speciﬁcation of distributions. Review of the International Statistical Institute, 5:307–320, 1937. 6. F. Harroy and J. L. Lacoume. Maximum likelihood estimators and Cramer-Rao bounds in source separation. Signal Processing, 55(2):167–177, 1996. 7. F. Herrmann and A. K. Nandi. Blind separation of linear instantaneous mixtures using closed-form estimators. Signal Processing, 81:1537–1556, 2001. 8. A. Hyv¨ arinen. Fast and robust ﬁxed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999a. 9. S. Lagrange, L. Jaulin, V. Vigneron, and C. Jutten. Analytical solution of the blind source separation problem using derivatives. 10. V. Zarzoso and A. K. Nandi. Closed-form estimators for blind separation of sources—part I: Real mixtures. Wireless Personal Communications, 21:5–28, 2002.

Recommend Documents

Probabilistic PCA and ICA subspace mixture models for ... - CiteSeerX

An Iterative Approach to local-PCA - Uni Bielefeld