Identifiability, subspace selection and noisy ICA Mike Davies DSP & Multimedia Group, Queen Mary University of London, Mile End Road, London E1 4NS, UK
[email protected] Abstract. We consider identifiability and subspace selection for the noisy ICA model. We discuss a canonical decomposition that allows us to decompose the system into a signal and a noise subspace and show that an unbiased estimate of these can be obtained using a standard ICA algorithm. This can also be used to estimate the relevant subspace dimensions and may often be preferable to PCA dimension reduction. Finally we discuss the identifiability issues for the subsequent ‘square’ noisy ICA model after projection.
1
Introduction
Most work on ICA to date has ignored the presence of noise even when it is known to exist. While there has been some debate (see for example [12]) on the advantages and disadvantages of including the noise term, there are applications where background noise is known to be present (e.g audio source separation [3], or telecommunications [15]) and where noise reduction is important. There have been a number of attempts to tackle the problem of noisy ICA. Recently solutions based on Higher Order Statistics (HOS) only [5], bias removal (quasi-whitening) [10] and density estimation, e.g. using a Mixture of Gaussians (MoG) model, [14, 1], have been proposed. Each of these comes with its own drawbacks: bias removal requires prior knowledge of the noise covariance matrix; HOS only methods typically require more data and are sensitive to outliers; and density estimation, while on the face of it, the most general solution, is typically computationally expensive and difficult to train. We will also see that in fact the noisy ICA model is not fully identifiable. It is well known that neglecting the presence of noise in the ICA formalism introduces bias into the subsequent estimators. Furthermore consideration of the noise is actually fundamental in tackling the case of more sensors than sources (M > N ), since, if there is no noise, we can arbitrarily throw away all but N of the observations (this is due to the famous equivariance property of ICA, [6]). Most papers, e.g. [7], currently advocate projecting down to an equal number of sources and sensors using Principal Component Analysis (PCA) to improve the Signal-to-Noise Ratio (SNR). This can be justified if the noise term is ‘small’ in comparison with the Independent Components (ICs). However PCA does not always provide the appropriate projection for all noise models. Instead we consider
2
Mike Davies
the optimal subspace selection, which turns out to be well defined, and subsequently show that this can be achieved using the outputs of a standard FastICA algorithm without the need for bias correction. We also show that negentropy approximations used in algorithms such as FastICA provide a much more apropriate measure for order selection than the eigenvalues from PCA. In fact we are using ICA here much more in the spirit of projection pursuit since component independence is not strictly required. We give an example that illustrates how dramatically different to PCA projection this can be. Finally we conclude with a discussion of the identifiability issues of the subsequent ‘square’ noisy ICA model, indicating a weakness in the overall system when both the source densities and noise covariance are unknown a priori.
2
The noisy ICA model
Let x ∈ RM be an observed random vector. We will assume that x has the following decomposition: x = As + z (1) where A is a constant unknown M × N mixing matrix, s is a zero mean nonGaussian random vector with independent components and z is a zero mean Gaussian random vector with unknown covariance Cz . We furthermore assume that A has full column rank, the covariance matrices, Cx and Cz are positive definite and that M ≥ N (for M < N the issues become significantly more complicated even in the absence of noise - see [8]). Our statistical model consists of the following unknowns: {A, Cz , pi (si )}. As with standard ICA, [6], we can consider different degrees of blindness. For example it is quite common to assume knowledge of the noise covariance and/or the individual source densities, pi (si ), e.g. [10], in which case the problem is well posed. Here we consider the fully blind case where we assume only that the source components are mutually independent and non-Gaussian. 2.1
Identifiability of A
The identifiability of non-Gaussian components was comprehensively tackled in the seminal work of Kagan et al., [13] chapter 10, in the early seventies and has recently been rediscovered to shed light on the identifiability conditions for both overcomplete ICA, [8] and noisy ICA, [4]. We thus have the following result, taken from [4] which is directly deducible from [13], theorem 10.3.1. Theorem 1. Let x = As + z be the random vector as defined above. Then A is unique up to an arbitrary permutation and scaling. The only difference between identifiability with and without noise is that here all the sources must be non-Gaussian so as to be able to distinguish them from the Gaussian noise.
Lecture Notes in Computer Science
2.2
3
A canonical decomposition for noisy ICA with M > N
The noisy ICA problem admits the following canonical form that splits the observation space into a signal subspace and a noise subspace (we first reported this in [4] but have subsequently found an equivalent decomposition was given in [2]). · ¸ · ¸ y1 s + z1 = (2) y2 z2 where z1 is an N dimensional Gaussian random vector, z2 is an M − N dimensional Gaussian random vector and s, z1 and z2 are all mutually independent. Furthermore these subspaces are unique. That is: it is always possible to reduce the M × N problem into the N × N dimensional problem y1 = s + z1 where the other M − N directions can be completely removed (If M = N then clearly z2 is degenerate). One way to construct such a transform is to use the linear Minimum Mean Squared Error (MMSE) estimator for s given A and x which takes the form: sˆ = Cs AT Cx−1 x (see, for example, [9]). Although we do not know the covariance CS this simply rescales the source estimates and does not change the subspace thus we arbitrarily set Cs = I. Being linear, this estimator serves to decorrelate the source and noise subspaces (as required in our canonical form). We can therefore construct the canonical form as follows: · ¸ · T −1 −1 T −1 ¸ (A Cx A) A Cx y x (3) y= 1 = UAT¯ y2 where UA¯ is the (M − N ) × N dimensional matrix of orthogonal vectors that span null(AT ) (obtainable from the singular value decomposition of A. It is easy to show that this also meets the requirement that z1 and z2 are mutually independent. Clearly we can also spatially whiten this decomposition while retaining the subspace decomposition: # · ¸ " − 12 u1 Cy1 0 u= = y (4) −1 u2 0 Cy 2 2
Since the spatially whitened data is unique up to an arbitrary rotation this shows that, once we have spatially whitened data, the signal subspace, Us , and the noise subspace, Un , are orthogonal. In contrast, the IC directions are not necessarily orthogonal.
3
Order and subspace selection
As mentioned in the introduction, the noiseless ICA framework offers no indications of how best to deal with more sensors than sources. Indeed choosing any N sensors (generically) should be equivalent. The usual advice (though it has its critics, [11]) is to use PCA to project the observation data onto a vector of
4
Mike Davies
the same dimension as the number of sources (assuming we know this number!) down to the square ICA problem. However, ironically, we will see that direct application of noiseless ICA deals with noise in a much more principled manner. PCA projection can be justified when the noise covariance is isotropic (Cz = σz2 I) in which case: 2 Cx = AAT + σz2 I = UA (ΣA + σz2 I)UAT
(5)
where A = UA ΣA VAT is the singular value decomposition of A. Similarly, if the noise is significantly smaller than the signal components, [16], PCA can distinguish between the noise floor (the smallest N − M principal values) and the subspace spanning A. However this argument no longer holds for a general noise covariance (If we know Cz we can always transform x such that this is the case). Indeed when the noise is directional and not insignificant with respect to the component size the PCA projection can result in an extremely poor transformation. In comparison the canonical decomposition in section 2.2, identifys an optimal projection, assuming that we know A. That is: choosing y1 in equation (2) is optimal, projecting out the noise subspace z2 . This alone can have a significant de-noising effect. To see the difference between the PCA and the optimal projection we considered a pair of independent binary sources observed through a 4 × 2 dimensional mixing matrix, A, chosen randomly, with additive noise that was strongly anisotropic. Figure 1 shows scatter plots of 10000 samples projected first onto the optimal 2-D subspace (left) and the same data projected onto the first two principal components (middle). It is clear that the PCA projection has not only failed to reduce the noise but that the four clusters associated with the binary data are completely indistinguishable. In constrast to this the optimal projection nicely separates out the clusters indicating the potential for significant noise reduction. Of course, the optimal subspace projection currently requires knowledge of the mixing matrix a priori. We now show that we can identify this using standard FastICA, without restorting to bias correction, HOS only techniques or complicated density modelling. We concentrate on ICA methods that search for orthogonal directions within pre-whitened data, u, (either gradient or fixed point based) and that have as their aim the maximization of some approximation of negentropy (e.g. see [11]) J(u) ∝ (E{G(u)} − E{G(v)})2
(6)
where G(·) is some nonlinear function and v is a zero mean unit variance Gaussian random variable. For simplicity let us assume that G(·) = u4 so that we are dealing with a kurtosis based ICA algorithm (though the conclusions should be more generally applicable). If u ∈ Un we have J(u) = 0 while if u ∈ Us we know that J(u) ≥ 0. Finally suppose that u has the form: u = αus + (1 − α2 )un , us ∈ Us , un ∈ Un
(7)
Lecture Notes in Computer Science 2
4
1.5
3
1
2
0.5
1
0
0
−0.5
−1
−1
−2
−1.5
−3
−2 −5 0 5 Optimal subspace selection
−4 −5
5
3 2 1 0 −1 −2
0 5 PCA subspace selection
−3 −2
0 2 ICA subspace selection
Fig. 1. Scatter plots of 2-dimensional projections using: the optimal canonical form (left), the first two Principal Components (middle) and the first to Independent Components using fastICA without bias correction (right).
with 0 ≤ α ≤ 1. Then from the mutual independence of us and un and the linearity property of kurtosis, J(u) is maximum for α = 1. Thus the ICA algorithm will select M − N directions spaning Us with the remaining directions spanning Un due to the orthogonality constraint. In summary, while standard ICA produces biased estimates of the individual components, which do not necessarily form an orthogonal set of directions in the pre-whitened data, it does produce unbiased estimates of the signal and noise subspaces and therefore the first N IC directions provide an estimate for the optimal projection. To illustrate this the righthand plot in figure 1 shows the subspace associated with the first two ICs estimated for the noisy data using G(u) = ln cosh(u). We stress that these estimates were made using the FastICA algorithm, [11], without any bias correction. It is clear that we have achieved a similar level of de-noising to the optimal projection. This still leaves us with the problem of order selection. Here, since the ICs are assumed to be non-Gaussian, J(u), itself, provides us with a good indicator of model order. We thus propose using the negentropy estimates in a similar manner to the eigenvalue spectrum used in PCA. Figure 2 shows plots of the negentropy spectrum for the FastICA and the eigenvalue spectrum for the PCA. We can clearly identify from the negentropy spectrum that there are 2 non-Gaussian components followed by a 2 dimensional noise subspace. In contrast to this the eigenvalue spectrum tells us nothing about the order of the noisy ICA.
4
Identifiability issues for ‘square’ noisy ICA
We conclude by looking at the identifiability of the ‘square’ noisy ICA problem, which we can now easily obtain using standard ICA. Although, in theory, we can
6
Mike Davies −3
3
x 10
2.5
2.5
2
2 1.5 1.5 1 1 0.5
0.5
0
1
2 3 Negentropy for FastICA
4
0
1
2 3 eigenvalues from PCA
4
Fig. 2. A plot of the negentropy values for Scatter plots of the 2-dimensional projections using: the canonical form (left); and the pseudo-inverse (right).
identify the mixing matrix A, due to the presence of noise, inverting A does not give us direct access to the independent sources. Furthermore optimal estimation of the sources requires the knowledge of the source distributions, pi (si ), and the noise covariance, CZ . In [4] the following identifiability result was derived for the canonical form: Theorem 2. Let y1 , and z1 be defined as above. Then only the off-diagonal elements of the Cz1 are uniquely identifiable. The diagonal elements of Cz1 and the source distributions, p(si ), are not. This is not really surprising as it is generally possible to incorporate some of the noise into the sources and vice versa. The ambiguity is related to the notion of non-separability in [8]. What is perhaps surprising is that the amount of noise allowed to be incorporated into one source is dependent on the noise being incorporated into the other sources. This is best illustrated with a simple example. 4.1
A simple example
Consider the 4 × 2 example with binary sources used above. We will assume that we have projected out the noise subspace and have removed the remaining bias in the estimates by one means or other. Let us denote the (observed) covariance matrix of y1 by: · ¸ r11 r12 Cy1 = = I + Cz 1 r12 r22 We can then write the noise covariance Cz1 as: · ¸ c1 r12 Cz 1 = r12 c2
(8)
Lecture Notes in Computer Science
7
The off-diagonal terms are immediately observable from Cy1 . In contrast c1 and c2 are ambiguous. The extent of the ambiguity region is shown as the shaded area in figure 3, below. The upper bounds for c1 and c2 are defined by, in this case, the true source distributions (we have set the scale ambiguity to one). This is because pi (si ) cannot be further de-convolved by a Gaussian of finite variance. In general the top right hand corner of the ambiguity region gives the minimum variance source estimates. The lower bound on c1 and c2 depends on the degree of correlation between the components of z1 and corresponds to the noise covariance becoming rank deficient. Note it is a joint function of c1 and c2 . Thus there is no unique solution that ‘leaves’ as much independent noise as possible in each of the individual sources.
Fig. 3. The ambiguity region for the simple binary source example.
4.2
Noisy ICA and ML estimation
Given this ambiguity, ML estimates, such as those in [1] based upon MoG models, are not neccessarily optimal since the selection of the diagonal terms in the noise covariance, Cz1 will be essentially arbitrary. One solution is to include additional knowledge of the nature of the source densities. For example, the minimum variance MoG source density will be characterized by the presence of at least one degenerate Gaussian (variance = 0). Another alternative that makes the problem well-posed again is to assume that the noise covariance is isotropic, Cz ∝ I. In this case PCA can also be justified for the dimension reduction stage.
8
Mike Davies
5
Conclusion
We have shown that a standard ICA algorithm using pre-whitening, in the spirit of projection pursuit, provides unbiased estimates of the signal and noise subspace for the undercomplete noisy ICA model and thus offers optimal de-noising when projecting onto the signal subspace. This is in contrast to PCA which, while more popular for dimension reduction, is only meaningful in a restricted set of circumstances. Once a ‘square’ noisy ICA model has been obtained a number of techniques can be used to avoid the bias induced by standard ICA. However further noise reduction is made more difficult since the model is not fully identifiable.
Acknowledgment I would like to thanks Nikolaos Mitianoudis for provision of the FastICA code.
References 1. H. Attias. Independent Factor Analysis. Neural Comp., 11, 803-851, 1998. 2. O. Bermond and J-F. Cardoso, Approximate Likelihood for noisy mixtures, Proc. ICA ’99, 1999. 3. M.E. Davies. Audio Source Separation. In Mathematics in Signal Processing V, edited by J. McWhirter and I. Proudler, 2002. 4. M. E. Davies, Identifiability Issues in noisy ICA. To appear in IEEE Sig. Proc. Lett., May, 2004. 5. L. De Lathauwer, B. De Moor and J. Vandewalle. A technique for higher-order-only blind source separation. In Proc. ICONIP, Hong Kong 1996. 6. J-F. Cardoso. Blind signal separation: statistical principles. Proceedings of the IEEE, 9(10), 2009-2025, 1998. 7. P. Comon. Independent Component Analysis: a new concept? Signal Processing, 36(3), 287-314, 1994. 8. J. Eriksson and V. Koivunen. Identifiability and separability of linear ICA models revisited. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003. 9. M.H. Hayes, Statistical Digital Signal Processing and Modeling, John Wiley & Sons, 1996. 10. A. Hyvarinen. Gaussian Moments for noisy Independent Component Analysis. IEEE Signal Processing Letters, 6(6):145–147, 1999. 11. A. Hyvarinen, J. Karhunen and E. Oja, Independent Component Analysis, John Wiley & Sons, inc, 2001. 12. ICA mailing list maintained by J-F. Cardoso, at: www.tsi.enst.fr/icacentral. 13. A.M. Kagan, Y.V. Linnik and C.R. Rao. Characterization Problems in Mathematical Statistics. Wiley, New York, 1973. 14. E. Moulines, J.-F. Cardoso, E. Gassiat. Maximum Likelihood for blind signal separation and deconvolution of noisy signals using mixture models. ICASSP-97, 1997. 15. C.B. Papadias. Globally Convergent Blind Source Separation Based on a Mulituser Kurtosis maximization criterion. IEEE Trans. Signal Processing, 48(12), 2000. 16. J-P. Nadal, E. Korutcheva and F. Aires, Blind source processing in the presence of weak sources. Neural Networks, 13(6):589-596. 2000.