ICA by PCA Approach: Relating Higher-Order Statistics to Second-Order Moments Kun Zhang and Lai-Wan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong {kzhang, lwchan}@cse.cuhk.edu.hk
Abstract. It is well known that principal component analysis (PCA) only considers the second-order statistics and that independent component analysis (ICA) exploits higher-order statistics of the data. In this paper, for whitened data, we give an elegant way to incorporate higherorder statistics implicitly in the form of second-order moments, and show that ICA can be performed by PCA following a simple transformation. This method is termed P-ICA. Kurtosis-based P-ICA is equivalent to the fourth-order blind identification (FOBI) algorithm [2]. Analysis of the transformation form enables us to give the robust version of P-ICA, which exploits the trade-off of all even order statistics of sources. Experimental comparisons of P-ICA with the prevailing ICA methods are presented. The main advantage of P-ICA is that it enables any PCA system, especially the dedicated hardware, to perform ICA after slight modification.
1
Introduction
In independent component analysis (ICA), we have an observable random vector x = [x1 , x2 , ..., xn ]T . Variables xi are assumed to be linear mixtures of some mutually statistically independent variables si , which form the random vector s = [s1 , s2 , ..., sn ]T (here we assume that the number of observations is equal to that of sources, and that si are zero-mean). Mathematically, x is generated according to x = As. By using some linear transform, the task of ICA is to find the random vector y = [y1 , y2 , ..., yn ]T : y = Wx
(1)
whose components are as mutually independent as possible, such that it provides an estimate of s. Since they involve no iterative optimization and the computation is extremely simple, closed-form solutions to ICA were developed by using high-order cumulants [4, 6, 7, 10] or the derivatives of mixtures [9]. However, the applicability of these results is limited since they only apply for the two-source case. Generally
This work was partially supported by a grant from the Research rants Council of the Hong Kong Special Administration Region, China.
J. Rosca et al. (Eds.): ICA 2006, LNCS 3889, pp. 311–318, 2006. c Springer-Verlag Berlin Heidelberg 2006
312
K. Zhang and L.-W. Chan
speaking, ICA does not have closed-form solutions, and an ordinary ICA algorithm consists of two parts—an objective function (contrast function) and an optimization method used to optimize it. In this paper, by investigating the relationship between ICA and PCA, we present a way to solve the ICA problem by explicitly applying principal component analysis (PCA). We consider the case where sources have different distributions. It is shown that for whitened data, with a tricky transformation, the linear transformation for PCA of the transformed data is exactly that for ICA of the original data. Although this method, namely P-ICA, may not give optimal results, it is very simple and efficient, as PCA can be easily done by eigenvalue decomposition (EVD) of the data covariance matrix or singular value decomposition (SVD) of the data. In addition, it just explicitly involves second-order statistics and has no local optima. As a consequence, second-order statistics-based systems, especially the hardware ones, can also perform ICA with P-ICA.
2
PCA and ICA
Given the data x, PCA and ICA both aim to find the linear transformation given in Eq. 1. However, they are based on different criteria and exploit data information of different aspects. PCA aims at finding an orthogonal transformation W which gives uncorrelated outputs, or equivalently, in PCA, each principal component defines a projection that encapsules the maximum amount of variation in the data and is uncorrelated to the previous principal components. It only considers the second order statistical characteristics of the data. In other words, PCA just uses the joint Gaussian distribution to fit the data and finds an orthogonal transformation which makes the joint Gaussian distribution factorable, regardless of the true distribution of the data. While in ICA, we find the linear transformation which makes the true joint distribution of the transformed data factorable, such that the outputs are mutually independent. Statistically speaking, mutual independence is much stronger than uncorrelatedness between the variables, i.e. independence guarantees uncorrelatedness. In ICA, the independent components yi are uncorrelated, at least approximately, since they are as independent as possible.Some ICA algorithms, such as FastICA [8] and JADE [3], require whitening the observed data x as a preprocessing step. The method proposed in this paper also requires this step. Whitening of the data x can be performed by PCA or EVD of the covariance matrix of x. Denote the covariance matrix of x by Σx , i.e. Σx = E{xxT }. Let the EVD of Σx be Σx = EDET , where E is the orthogonal matrix consisting of eigenvectors of Σx as its columns and D the diagonal matrix of the corresponding eigenvalues. Whitening of x can be done using the matrix V = ED−1/2 ET . Denote the whitened data by v, i.e. v = Vx
(2)
ICA by PCA Approach
313
After whitening, we just need to find an orthogonal transformation U to make components of y = Uv mutually independent. The transformation matrix W in Eq. 1 can then be constructed as W = UV
3
(3)
ICA by PCA
Without loss of generality, we assume the sources si are zero-mean and of unit variance, i.e. E{si } = 0 and E{s2i } = 1. Let B = VA. One can easily see that B is orthogonal. Now let us consider the case where si have different kurtosis. 3.1
Based on Kurtosis
Under the condition that s1 , ..., sn have different kurtosis, we have the following theorem. Theorem 1. Let s, v, and z be random vectors such that v = Bs, where B is an orthogonal matrix, and z = ||v|| · v (4) Suppose additionally s has zero-mean independent components and these components have different kurtosis. Then the orthogonal matrix U which gives principal components of z (without centering) performs ICA on v. Proof: Let s = [s1 , ..., sn ]T , U = [u1 , ..., un ], and C = [c1 , ..., cn ] = [cij ]n×n = UB. Since B is orthogonal, we have ||v|| = ||Bs|| = ||s||. The second-order origin moment of the projection of z on ui is E(uTi z)2 = E(uTi · ||v|| · v)2 = E{||s||2 · (uTi Bs)2 } = E
n
(cTi s)2 · s2k
k=1
=
n
n n n n n c2ik E(s4k )+ c2ip E(s2p s2k )+ cip ciq E(sp sq s2k ) (5)
k=1
k=1
p=1 p=k
k=1
p=1, q=1, p=k q=p
When q = p, obviously at least one of q and p is different from k. Suppose q = k. We then have E(sp sq s2k ) = E(sq )E(sp s2k ) = 0 since si are independent n and zero-mean. We also have k=1 c2ik = 1 since C is orthogonal. Equation 5 then becomes E(uTi z)2 =
n
c2ik E(s4k ) +
k=1
=
n k=1
n n k=1
c2ik E(s4k ) +
p=1 p=k
n n k=1
c2ip E(s2p )E(s2k )
p=1 p=k
c2ip =
n
c2ik kurt(sk ) + (n + 2)
(6)
k=1
Therefore E(uTi z)2 is the weighted average of kurt(si ) plus a constant. As si are assumed to have different kurtosis, without loss of generality, we assume
314
K. Zhang and L.-W. Chan
kurt(s1 ) > kurt(s2 ) > · · · > kurt(sn ). From Eq. 6 we can see that maximization of E(uT1 z)2 gives c1 = [±1, 0, ..., 0]T , which means that y1 = uT1 v = uT1 Bs = cT1 s = ±s1 . After finding u1 , under the constraint that u2 is orthogonal to u1 , the maximum of E(uT2 z)2 is obtained at c2 = [0, ±1, 0, ..., 0]T . Consequently y2 = uT2 v = cT2 s = ±s2 . Repeating this procedure, finally all independent components can be estimated as yi = uTi v = ±si where ui maximizes E(uTi z)2 . In other words, the orthogonal matrix U performing PCA on z (without centering of z) actually performs ICA on v. Note that z may not be zero-mean since v may be nonsymmetrical. And we should not do centering of z when performing PCA on z—in fact, here PCA is used to maximize the origin moment of uTi z, rather than its variance. We term this method as P-ICA. Kurtosis-based P-ICA, which is actually equivalent to the fourth-order blind identification (FOBI) algorithm [2], consists of three steps: 1. do whitening of x (Eq. 2); 2. do the transformation given in Eq. 4; 3. find U by PCA on z. After estimating the orthogonal matrix U, the de-mixing matrix for the original data, W, can be constructed according to Eq. 3. With PCA done by EVD, below we give the Matlab code implementing this algorithm: % whitening [E,D]=eig(cov(x)); v=E*D∧(-.5)*E’*x; % Data transformation and ICA by PCA z=sqrt(sum(v.∧2)).*v; [EE,DD]=eig(z*z’); y=EE’*v; 3.2
An Illustration
For clarity, let us use a simple example to illustrate how P-ICA works. We use a sinusoid waveform and a Gaussian white signal as the independent sources, shown in Fig. 1 (a). The mixing matrix A is randomly chosen. The observed data x = As are shown in Fig. 1 (b). The whitened version of the observations, v = Vx, is shown in Fig. 1 (c). Fig. 1 (d) shows the waveforms of the transformed data z = ||v||·v together with the scatterplot. From this figure we can see that the axes corresponding to the independent sources si are almost the same as those giving principal components of z. The orthogonal matrix for PCA of z can be obtained by applying EVD on the covariance matrix of z: −0.2878 −0.9577 U= −0.9577 0.2878 According to Theorem 1, the independent components of x can then be obtained as y = Uv (or equivalently, y = UVx), as shown in Fig. 1 (e). Clearly the independent sources si have been successfully recovered. 3.3
For Robustness
From Eq. 6 we can see that the vector ui found by maximizing E(uTi z) actually depends on the kurtosis of si . It is well known that kurtosis can be very sensitive to outliers, so it is useful to develop a robust version for P-ICA.
ICA by PCA Approach 2
315
4
1
3
−2
2 1 0
200
400
600
800
1000
s2
s
0
0
2
5
−1 −2
s
0
−3 −5
0
200
400
600
800
−4 −2
1000
−1
0
s
1
2
1
(a) x1
1
1.5 1
0
−1
0
200
400
600
800
1000
x2
0.5 0
2
x2
−0.5 0
−2
−1
0
200
400
600
800
−1.5 −1
1000
−0.5
0
x
0.5
1
2
4
1
(b) 5
4
−5
2 1 0
200
400
600
800
1000
v2
v
1
3 0
v
2
5
0
−1 −2
0
−3 −5
0
200
400
600
800
−4 −4
1000
−2
0
v1
(c) 10
10
1
0
z
5
−10
200
400
600
800
1000
0 2
0
z
−20 10
2
−5
samples of z Axes for si
z
0 −10
−10
0
200
400
600
800
1000
Axes for PCA of z −10
−5
z
0
5
10
1
(d) 2
4
2 1 0
200
400
600
800
1000
y
2
5
2
−2
y
y
1
3 0
0
−1 −2
0
−3 −5
0
200
400
600
800
1000
−4 −2
−1
0
y1
1
2
(e)
Fig. 1. An illustrative example. (a) The independent sources si (left) and their joint scatterplot (right). (b) The observations xi (left) and their joint scatterplot (right). (c) The whitened data vi (left) and their joint scatterplot (right). The straight lines in the figure on the right show the axes of the independent sources si . The ellipse, which is actually a circle, is a contour of the joint Gaussian distribution fitting v. (d) The transformed data z = ||v|| · v (left) and their joint scatterplot (right). The ellipse shows a contour of the joint Gaussian distribution fitting z. (e) The estimated independent component y = Uv = UVx (left) and their joint scatterplot (right). Clearly yi provide a good estimate for the independent sources si .
316
K. Zhang and L.-W. Chan 1
1
Now let us replace ||v|| in Eq. 4 by f 2 (||v||2 ), i.e. z = f 2 (||v||2 ) · v, where f is a sufficiently smooth and monotonically increasing function and f (0) = 0. We have the Taylor expansion about the origin for f (||v||2 ): f (||v||2 ) = f (||s||2 ) = f (s21 + · · · + s2n ) ∞ 2en 1 2e2 = α(e1 ,e2 ,...,en ) · s2e 1 s2 · · · sn
(7)
e1 ,...,en =0 Πei =0
where α(e1 ,e2 ,...,en) denotes the coefficients of the Taylor expansion. Suppose si have finite moments. We then have E(uTi z)2 = E{f (||v||2 ) · (uTi Bs)2 } = E f (||s||2 ) · =
n n
n n
cip ciq sp sq
p=1 q=1
∞
2eq +1
1 cip ciq α(e1 ,e2 ,...,en ) · E(s2e · · · sp2e2 +1 · · · sq 1
n · · · s2e n )
p=1 q=1 e1 ,...,en =0 q=p
+
n
Πei =0
c2ip E{s2p · f (||s||2 )}
(8)
p=1
Provided that at most one of the sources si is nonsymmetrical, also taking into account that si are mutually independent, the first term in Eq. 8 vanishes. Define Gp ≡ E{s2p · f (||s||2 )}. As the expectation of s2p weighted by f (||s||2 ), Gp is actually a function of the moments of si of all even orders according to Eq. 7: Gp =
∞
2ep +2 1 n α(e1 ,e2 ,...,en ) · E(s2e ) · · · E(s2e n ) 1 ) · · · E(sp
e1 ,...,en =0 Πei =0
Suppose Gp vary according to p. Roughly speaking, this condition could be enforced by assuming si have different distributions. According to Eq. 8, E(uTi z)2 n T 2 2 is the weighted average of Gp : E(ui z) = p=1 cip Gp . This equation is similar to Eq. 6, and we can see that the orthogonal matrix U which gives principal components of z (without centering) performs ICA on v. In order to improve the robustness, f (·) should be a function increasing slower than the linear one. Many functions can be chosen as f for such a purpose. In our experiments, we choose f (||v||2 ) = log(1 + ||v||2 )
(9)
which behaves very well. Note that we have made the assumption that at most one of si is nonsymmetrical when using the robust version of P-ICA, while in kurtosis-based P-ICA, this assumption is unnecessary.
4
Experiments
The illustrative example in Subsection 3.2 has demonstrated the validity of the P-ICA method. Now we give two additional experiments. The Amari perfor-
ICA by PCA Approach
317
mance index Perr [1] is used to assess the quality of W for separating observations generated by the mixing matrix A. The smaller Perr , the better W. The P-ICA method assumes that sources have different distributions. The first experiment demonstrates the necessity of this assumption and how the separation performance is affected by violation of this assumption. s1 is generated by the Cornish-Fisher (C-F) expansion [5] with skewness 0 and kurtosis 1. s2 is also generated by the C-F expansion with skewness 0, but the kurtosis varies from −1 and 6. Each source has 1000 samples. We compare five methods, which are kurtosis-based P-ICA, robust P-ICA (with f given in Eq. 9), FastICA [8] with pow3 nonlinearity, FastICA with tanh nonlinearity, and JADE [3], for separating s1 and s2 from their mixtures. For each value of kurt(s2 ), we randomly generate s2 and the mixing matrix A, and repeat the algorithms 40 runs. The average performance index is shown in Fig. 2. As expected, when s1 and s2 have the same distribution (here their distribution only depends on the skewness and kurtosis as they are both generated by the C-F expansion), P-ICA fails to separate the sources. When kurt(s2 ) is far from kurt(s1 ), the performance of the robust P-ICA is very close to the best, and the other three methods produce similar results. 0
P−ICA (kurtosis) P−ICA (robust) FastICA (pow3) FastICA (tanh) JADE
−0.5
log10(Perr)
−1
−1.5
−2
−2.5
−3
−3.5 −1
kurt(s 1) 0
1
2
3
4
5
6
kurt(s ) 2
Fig. 2. Logarithm of the average performance index as function of kurt(s2 ). Note that kurt(s1 ) = 1, and s1 and s2 are both generated by the Cornish-Fisher expansion.
The second experiment tests how the source number affects the separation performance. Four sources are involved, which are a non-central t distributed white signal (with v = 5 and δ = 1, s1 ), a uniformly distributed white signal (s2 ), a Gaussian white signal (s3 ), and a square waveform (s4 ). Each source has 1000 samples. The source number n varies from 2 to 4. For each source number, we randomly generate the sources and the mixing matrix, and repeat the methods 50 runs. The average performance index, together with its standard deviation, is given in Table 1. Compared to other methods, the performance of P-ICA becomes worse when the source number increases. There may be two reasons for this phenomenon. First, as the source number increases, the difference between source distributions becomes more and more insignificant. Second, in developing P-ICA, we treat the last term in Eq. 5 (and the first term in Eq. 8) as zero, which theoretically holds. However, in practice, due to the finite-sample effect and the fact that sources may not be completely independent, this term may
318
K. Zhang and L.-W. Chan
Table 1. The average Perr of the five algorithms with different source number for 50 runs. The values in parentheses are the standard deviations. Source P-ICA P-ICA FastICA FastICA JADE number (n) (kurtosis) (robust) (pow3) (tanh) 2 (s1 &s2 ) 0.1137(0.0663) 0.0786(0.0502) 0.1137(0.0663) 0.0619(0.0410) 0.1288(0.0773) 3 (s1 ∼ s3 ) 0.1381(0.0591) 0.1104(0.0454) 0.1234(0.0632) 0.0901(0.0516) 0.1069(0.0535) 4 (s1 ∼ s4 ) 0.1109(0.0366) 0.0851(0.0284) 0.0852(0.0284) 0.0568(0.0171) 0.0735(0.0236)
not vanish and the error accumulates as the source number increases. However, we should remark that the computational cost of P-ICA is very light since it is just a PCA procedure after a simple transformation.
5
Conclusion
We investigated the relationship between PCA and ICA, and consequently showed that ICA with sources having different distributions can be solved by PCA following a simple transformation. Such a method, termed P-ICA, does not explicitly involve higher-order statistics and has low computational cost. Kurtosis-based P-ICA, which just exploits the kurtoses of sources, is the same as FOBI [2]. We further developed the robust version of P-ICA, which is insensitive to outliers. P-ICA enables PCA (hardware) systems to perform ICA after sligh modification. Experimental results showed the validity and restriction of P-ICA.
References 1. S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, 1996. 2. J. F. Cardoso. Source separation using higher order moments. In ICASSP’89, pages 2109–2112, 1989. 3. J. F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. IEE Proceeding-F, 140(6):362–370, 1993. 4. P. Comon. Separatin of stochastic processes. In Proceedings of the Workshop on Higher-Order Spectral Analysis, pages 174–179, Vail, Colorado, USA, June 1989. 5. E. A. Cornish and R. A. Fisher. Moments and cumulants in the specification of distributions. Review of the International Statistical Institute, 5:307–320, 1937. 6. F. Harroy and J. L. Lacoume. Maximum likelihood estimators and Cramer-Rao bounds in source separation. Signal Processing, 55(2):167–177, 1996. 7. F. Herrmann and A. K. Nandi. Blind separation of linear instantaneous mixtures using closed-form estimators. Signal Processing, 81:1537–1556, 2001. 8. A. Hyv¨ arinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999a. 9. S. Lagrange, L. Jaulin, V. Vigneron, and C. Jutten. Analytical solution of the blind source separation problem using derivatives. 10. V. Zarzoso and A. K. Nandi. Closed-form estimators for blind separation of sources—part I: Real mixtures. Wireless Personal Communications, 21:5–28, 2002.