Joint low-rank approximation for extracting non ... - Semantic Scholar

Report 3 Downloads 41 Views
ARTICLE IN PRESS

Signal Processing 87 (2007) 1890–1903 www.elsevier.com/locate/sigpro

Joint low-rank approximation for extracting non-Gaussian subspaces Motoaki Kawanabea,!, Fabian J. Theisb a

Fraunhofer FIRST.IDA, Germany Max Planck Institute for Dynamics and Self-Organisation & Bernstein Center for Computational Neuroscience, Germany

b

Received 1 August 2006; received in revised form 4 December 2006; accepted 26 January 2007 Available online 16 February 2007

Abstract In this article, we consider high-dimensional data which contains a low-dimensional non-Gaussian structure contaminated with Gaussian noise. Motivated by the joint diagonalization algorithms, we propose a linear dimension reduction procedure called joint low-dimensional approximation (JLA) to identify the non-Gaussian subspace. The method uses matrices whose non-zero eigen spaces coincide with the non-Gaussian subspace. We also prove its global consistency, that is the true mapping to the non-Gaussian subspace is achieved by maximizing the contrast function defined by such matrices. As examples, we will present two implementations of JLA, one with the fourth-order cumulant tensors and the other with Hessian of the characteristic functions. A numerical study demonstrates validity of our method. In particular, the second algorithm works more robustly and efficiently in most cases. r 2007 Elsevier B.V. All rights reserved. Keywords: Linear dimension reduction; Non-Gaussian subspace; Joint low-rank approximation; Fourth-order cumulant tensor; Characteristic function

1. Introduction In the last decade, independent component analysis (ICA) and its various extensions have gained a lot of popularity and are widely accepted in the field of signal processing and statistical data analysis [1–5]. Because data sets sometimes contain a large number of features, dimension reduction is important as a preprocessing before such blind source separation (BSS) techniques in order to avoid overfitting [6] and to reduce measurement noise [7]. Furthermore, since ICA/BSS algorithms often allow at most one one!Corresponding author. Tel.: +49 3063921881.

E-mail address: [email protected] (M. Kawanabe).

dimensional Gaussian component, it is quite important to remove all Gaussian components before applying them. A standard procedure is principal component analysis (PCA) which discards components with smaller power. However, the power criterion of PCA does not match the philosophy of ICA which aims at extracting non-Gaussian independent sources even with low power [8]. This was one motivation of the linear dimension reduction method called non-Gaussian component analysis (NGCA) developed by one of the authors and his collaborators [9], even though NGCA can be used not only for preprocessing of ICA, but also for extracting dependent non-Gaussian sources like in independent subspace analysis (ISA, [10]).

0165-1684/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2007.01.033

ARTICLE IN PRESS M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

The statistical model considered in NGCA and also in this paper is that the high-dimensional data x 2 Rd is a sum of low-dimensional non-Gaussian components (‘signal’) s 2 Rm ðmodÞ and a Gaussian noise n#Nð0; GÞ, x ¼ As þ n,

(1)

where A is a d & m full rank matrix indicating the non-Gaussian subspace and s and n are assumed to be independent. This assumption follows the rationale that in most real-world applications the ‘signal’ or ‘information’ contained in the highdimensional data is essentially non-Gaussian while the ‘rest’ can be interpreted as high-dimensional Gaussian noise. For simplicity, in this paper we assume that the dimension m of the non-Gaussian subspace is known and will not discuss how to estimate it. Although Eq. (1) is similar to the undercomplete noisy ICA [5], it is considerably more general since the mutual independence between the components si of the non-Gaussian source s is not assumed here. Under this modeling assumption, the tasks are to estimate the relevant non-Gaussian subspace and to recover the low-dimensional non-Gaussian structures by linear dimension reduction. Although our goal is dimension reduction, we want to emphasize that we do not assume the Gaussian components to be of smaller order of magnitude than the signal components. This setting therefore excludes the use of common linear and non-linear dimensionality reduction methods such as PCA, Isomap [11] and local linear embedding (LLE, [12]). As historical remarks, the dimension reduction problem for finding non-Gaussian subspaces was firstly treated by projection pursuit (PP) procedures [13,14]. However, it was not widely spread, because computational power was not enough to apply such kind of techniques to real world data at that time. In late 1990s, a similar algorithm called FastICA [15,16] was applied to extract independent sources in signal processing and has become a standard procedure of ICA/BSS. We remark that PP and FastICA in the deflation mode can in principle discover any (dependent) non-Gaussian structures in data. For this purpose, a single index which measures the non-Gaussianity (or ‘interestingness’) of a projection direction is fixed and it is then optimized over all possible directions of projection; the procedure can be repeated iteratively (over directions orthogonal to the first ones already found) to find a higher-dimensional projection of

1891

the data as needed. NGCA [9] proposed recently was built upon a general semi-parametric framework in mathematical statistics. Although the first implementation also contains FastICA updates as a part, it can improve standard FastICA/PP methods by taking into account many various non-Gaussianity indices at the same time, when data contain complicated non-Gaussian structure. Furthermore, as a theoretical guarantee, a stochastic bound of estimation error is provided. Recently alternative implementation of NGCA using radial kernel functions has been constructed [17]. For establishing theoretical basis of NGCA procedures, we have recently shown [18] that in terms of the NGCA model, the signal subspace is uniquely given and separated from the noise space, a feature not a priori given or clear for PP. Apart from FastICA/PP and its extensions, algebraic algorithms, especially joint diagonalization such as JADE [3] and TDSEP/SOBI [19,20] are also commonly used for ICA/BSS (cf. [21]). This work was motivated by suggestions from experts of those algebraic procedures for ICA/BSS. The goal of this paper is to construct an algebraic algorithm for extracting non-Gaussian subspaces as FastICA/ PP and NGCA. For this purpose, we will propose here a linear dimension reduction procedure called joint low-rank approximation (JLA).1 This method uses many matrices whose non-zero eigen spaces coincide with the non-Gaussian subspace under the model assumption (1). As examples, we will present two implementations of JLA, one with the fourth-order cumulant tensor and the other with Hessian of the characteristic functions which was applied to (multidimensional) ICA [24–27]. The first version works for sub-Gaussian structures, while the second algorithm is more flexible to various source distributions and more robust against outliers. The remaining of this paper is organized as follows. After explaining the model assumptions more in detail and preparing notations in Section 2, we will propose the JLA procedure in Section 3. The global consistency and the two implementations of JLA will also be presented there. For proof of concept, in Section 4, we will show results of numerical experiments with synthetic data sets used in Blanchard et al. [9]. In Section 5, we will summarize our study and discuss open issues. 1

This paper is an extended version of two conference proceedings, Kawanabe [22] and Kawanabe and Thesis [23].

ARTICLE IN PRESS 1892

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

2. Model assumptions

in the same formula as

2.1. Indeterminacy and a canonical form

'1 '1 x ¼ ðAN C 1 ÞðC '1 1 sN þ C 1 lN Þ þ ðAG C 2 ÞðC 2 sG Þ.

We remark that the decomposition (1) is not unique at all. In this section, we will have a closer look of our model, in order to make the following discussion clearer and to give some intuition. The essential indeterminacy of the decomposition (1) comes from the fact that we cannot separate the signal from the part of the noise inside the nonGaussian subspace RangeðAÞ. Let us decompose the noise term n into two independent parts as n ¼ n1 þ n2 , where n1 2 RangeðAÞ and n2 is restricted in the ðd ' mÞ-dimensional complementary subspace such that Cov½n1 ; n2 ) ¼ 0 holds (for the explicit construction of the noise decomposition, see Appendix A). We note that from the definition of RangeðAÞ, there is a vector g 2 Rm which satisfies n1 ¼ Ag. In order to remove this trivial indeterminacy from our model, we redefine es:¼s þ g as the non-Gaussian source and rewrite the decomposition as x ¼ Aes þ n2 , where here the noise term n2 distributes with a ðd ' mÞ-dimensional degenerated Gaussian. As we mentioned already, it is impossible to decompose es into the true signal s and the contaminated noise g in the non-Gaussian subspace without further assumptions. By changing the symbols as A ! AN , es ! sN þ lN and n2 ! AG sG , we will express our model as x ¼ AN ðsN þ lN Þ þ AG sG ,

(2)

where sN is zero-mean and lN is the mean of the non-Gaussian component es, AG indicates the Gaussian subspace and sG denotes a ðd ' mÞdimensional zero-mean Gaussian random vector. Independence of sN and sG implies that the nonGaussian subspace and the Gaussian noise components are orthogonal with the metric V '1 , i.e. '1 A> AG ¼ 0, where V :¼Cov½x) (the derivation NV will be provided in Appendix B). We note that the mean E½x) is equal to AlN and thus, the term AlN in Eq. (2) can be removed easily by subtracting E½x) from the observation x. Another trivial indeterminacy still remains in model (2). Suppose that C 1 and C 2 are m- and ðd ' mÞ-dimensional square invertible matrices, respectively. It is easy to show that there exist infinitely many representations of the observation x

(3)

In other words, we cannot determine the matrices AN and AG (or equivalently, the bases in the non-Gaussian and the Gaussian subspaces) only from the assumptions (2) which we imposed. We recently proved that the decomposition (2) is unique up to this indeterminacy, if the dimension d ' m of the Gaussian subspace is maximal, i.e. if as many Gaussians have been removed from the signal subspace as possible. Therefore, we can determine the non-Gaussian subspace RangeðAN Þ and the Gaussian subspace RangeðAG Þ. 2.2. Dimension reduction for extracting the nonGaussian component Our hypothesis on high-dimensional data is that the ‘signal’ or ‘information’ is contained only in a low-dimensional subspace and that the rest is just Gaussian noise. Our goal in this paper is, therefore, finding a linear mapping from the observation x to the non-Gaussian signal sN , in order to project out the irrelevant Gaussian component sG . Let > > ðB> N ; BG Þ be the inverse matrix of ðAN ; AG Þ. Then, the submatrices BN and BG extract the nonGaussian and the Gaussian parts of the data x, i.e. BN x ¼ sN þ lN and BG x ¼ sG . Because of the indeterminacy (3) we cannot determine the m & d matrix BN without extra constraints, but can infer the subspace I:¼RangeðB> N Þ spanned by the row vectors of BN . We will call I the (non-Gaussian) index space. In the following, performance of dimension reduction procedures will be measured by the estimation error of the index space I, 1 kðI d ' PI ÞPb k2Fro , I m b is the estimated index space, P is the where I b I b the operator ðI d ' projection matrix onto I, PI Þ is the projection to the orthogonal complement of the index space I and k * kFro is the Frobenius norm. Further explanation about the error criterion E can be found after Eq. (17). The left panel of Fig. 1 illustrates the relation between the three subspaces I ¼ Range ðB> N Þ, RangeðAN Þ and RangeðAG Þ. Once the index space I is derived, the other subspaces can also be determined because of the orthogonality relations

b IÞ ¼ EðI;

BN AG ¼ 0;

'1 AG A> NV

BG AN ¼ 0,

¼ 0;

BN VB> G ¼ 0.

ARTICLE IN PRESS 1893

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

B TN

W TN

AN

V -1/2AN

AG V -1/2AG

Fig. 1. Illustration of the non-Gaussian subspace and the index space (left) in the original space (right) after whitening (see Section 3.2). > > The matrices B> N , AN , AG and W N in the plots should be regarded as I ¼ RangeðBN Þ, RangeðAN Þ, RangeðAG Þ and RangeðW N Þ, respectively.

Finally, we remark that model (2) is also semiparametric as ICA [28], because we do not fix the density of the non-Gaussian component sN . From independence of sN and sG , the density function of x can be expressed as a product of the non-Gaussian and the Gaussian components pðxÞ ¼ gðBN xÞfL ðBG xÞ, where g is an unknown function describing the density of sN and fL is the ðd ' mÞ-dimensional Gaussian density with mean 0 and covariance L.

calculated as B ¼ WV '1=2 .

In order to infer the demixing matrix W after whitening, joint diagonalization algorithms use a set of matrices M 1 ; . . . ; M K which can be simultaneously diagonalized by an orthogonal matrix W , i.e. offðWM k W > Þ:¼WM k W > ' diagðWM k W > Þ ¼ 0 8k ¼ 1; . . . ; K.

3. JLA method

For example, in the simplest version of JADE we use d 2 matrices QðklÞ ðk; l ¼ 1; . . . ; dÞ whose ði; jÞth element is defined by the fourth-order cumulant tensor

3.1. Joint diagonalization

QðklÞ ij ¼ cumðzi ; zj ; zk ; zl Þ

Joint diagonalization methods [3,19,20] provide the basis for some of the most popular ICA algorithms for extracting independent components s ¼ ðs1 ; . . . ; sd Þ> from the observation, x ¼ As, where A is a d & d matrix. Although there exist algorithms without the pre-whitening step [29], the whitening transformation z ¼ V '1=2 ðx ' E½x)Þ is often employed as preprocessing, where V ¼ Cov½x). Without loss of generality, we can assume Cov½s) ¼ I d , and then, the mixing model can be rewritten as z ¼ W > s, where W is an orthogonal matrix (because I ¼ Cov½s) ¼ W Cov½z)W > ¼ WW > ) which extract independent components s from the whitened data z. Once the orthogonal matrix W is estimated, the entire demixing matrix from the original data x can be

:¼E½zi zj zk zl ) ' E½zi zj )E½zk zl ) ' E½zi zk )E½zj zl ) ' E½zi zl )E½zj zk ). ð4Þ

Because the sources s1 ; . . . ; sd are mutually independent, the cumulant cumðsi ; sj ; sk ; sl Þ is not zero, only if i ¼ j ¼ k ¼ l. Therefore, the off-diagonal elements of WQðklÞ W > vanish, because ðWQðklÞ W > Þij ¼ cumðsi ; sj ; zk ; zl Þ ¼ 0;

ðklÞ

iaj 8k; l.

This guarantees that the matrices Q are simultaneously diagonalizable. Usually, instead of the matrices M 1 ; . . . ; M K which are exactly jointly diagonalizable, we can b 1; . . . ; M b K which can only access their estimators M be approximately jointly diagonalized. Then, the b of the demixing matrix can be obtained estimator W by minimizing the contrast K X k¼1

b k W > Þk2 koffðW M Fro

ARTICLE IN PRESS 1894

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

with respect to the orthogonal matrix W . As is the case for JADE, this is sometimes equivalent to maximizing K X k¼1

b k W > Þk2 . kdiagðW M Fro

3.2. JLA of matrices Motivated by the joint diagonalization algorithms for ICA explained in the previous section, we propose the JLA method for extracting the nonGaussian subspace under the model assumption (2). We also adopt the whitening transformation z ¼ V '1=2 ðx ' E½x)Þ as preprocessing. Let us define the matrices W N :¼BN V 1=2 ;

W G :¼BG V 1=2 ,

which are the linear transformations from the whitened data z to the non-Gaussian and Gaussian > > components s ¼ ðs> N ; sG Þ . We remark that the nonGaussian index space can be expressed as '1=2 I ¼ RangeðB> RangeðW > NÞ ¼ V N Þ,

and therefore, it is enough to estimate the m & d matrix W N (the situation after whitening is illustrated in the right panel of Fig. 1). Without loss of generality, we can assume that Cov½s) ¼ I. Then, > ðW > N ; W G Þ becomes an orthogonal matrix. Let us start with the asymptotic situation ignoring estimation errors. Suppose that K complex matrices M 1 ; . . . ; M K can be simultaneously transformed into ! " + 0 W o M k ðW o Þ> ¼ ; k ¼ 1; . . . ; K, (5) 0 0 that is, all components of all the transformed matrices vanish except for those in m & m submatrices indicated by +, where W o is a d-dimensional orthogonal matrix. Examples of such matrices will be given in the following sections. Let W oN be the m & d matrix composed of the first m rows of W o . We remark that W oN ðW oN Þ> ¼ I m . The goal here is to estimate the mapping W oN which we call the reduction matrix. In order to estimate the reduction matrix W oN , let us consider the contrast function LðW N Þ ¼

K X k¼1

2 kW N M k W > N kFro ,

(6)

where Frobenius norm kCk2Fro ¼ trðCC + Þ in complex case. Indeed, we can show that the desired mapping W oN can be obtained up to an orthogonal matrix by maximizing the contrast function LðW N Þ. Theorem 1 (Global consistency of JLA). The objective function LðW N Þ is maximal at W N ¼ UW oN , where W oN is the first m & d submatrix of W o defined by Eq. (5) and U is an m-dimensional orthogonal matrix. Proof. We remark that Frobenius norm kWM k W > k2Fro is unchanged for any orthogonal matrix W , i.e. kWM k W > k2Fro ¼ kM k k2Fro ¼ kW o M k ðW o Þ> k2Fro . From property (5) of the matrix W o , we get # !# # W o M k ðW o Þ> 0 #2 N N # # o o > 2 kW M k ðW Þ kFro ¼ # # # 0 0 # Fro

¼ kW oN M k ðW oN Þ> k2Fro ,

where we divided W o into two submatrices W oN and W oG . On the other hand, for a general orthogonal matrix W , kWM k W > k2Fro # # W NM k W > N # ¼# # W GM k W > N

! #2 # # # > # W GMkW G W NMkW > G

Fro > 2 > 2 ¼ kW N M k W N kFro þ kW N M k W G kFro 2 > 2 þ kW G M k W > N kFro þ kW G M k W G kFro 2 XkW N M k W > N kFro ,

where W was also divided into two submatrices W N and W G . Since the inequality is strict if and only if one of the three terms is non-zero, we notice that all global maxima were already found. Therefore, for all orthogonal matrices U, finally we get kUW oN M k ðW oN Þ> U > k2Fro ¼ kW oN M k ðW oN Þ> k2Fro 2 XkW N M k W > N kFro :

&

In order to construct optimization algorithms, let us calculate the derivative of the contrast L with respect to W N under the constraint W N W > N ¼ I m. In order to make formula (8) of the proposed algorithm simpler, additionally we assume that M> k ¼ M k . For the examples which we will use this condition is satisfied. In the general case, symmetrization by setting ðM k þ M > k Þ=2 ! M k also yields this situation, so this does not restrict the generality

ARTICLE IN PRESS M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

of JLA. We remark that Theorem 1 holds without the symmetry of the matrices M k . The optimal reduction matrix should satisfy the equation WN

K X k¼1

Mk ðW N Þ ¼ LW N ,

(7)

where + + > Mk ðW N Þ:¼M k W > NW NM k þ M k W NW NM k

(8)

is a d & d matrix depending on W N and the symmetric matrix L denotes the Lagrange multipliers. Because Eq. (7) reminded us of the eigenvalue decomposition, we propose the following procedure of iterative eigenvalue decomposition. Suppose that W ðrÞ N is the current guess of the reduction matrix at rth step. By substituting the matrices M1 ðW ðrÞ N Þ; ðrÞ ðrÞ . . . ; MK ðW N Þ at the current estimator W N , Eq. (7) turns out to be W ðrþ1Þ N

K X k¼1

ðrþ1Þ Mk ðW ðrÞ . N Þ ¼ LW N

(9)

Without loss of generality, we can assume the matrix L of the Lagrange multipliers is diagonal. In ðrþ1Þ fact, if W N is a solution of Eq. (9), so is UW ðrþ1Þ N for all m & m orthogonal matrix U. Thus, we can always find U which diagonalizes L. Then, the next ðrþ1Þ estimator W N can be obtained by computing the PK eigen vectors of k¼1 Mk ðW ðrÞ N Þ corresponding the m largest eigenvalues. We repeat this update by the eigenvalue decomposition until convergence. Usually, the true matrices M 1 ; . . . ; M K which possess the property Eq. (5) exactly are not available. Instead, we can only access their estimab 1; . . . ; M b K which satisfy Eq. (5) approxitors M mately. Then, we have to replace the matrices Mk ðW N Þ in Eq. (9) with c k ðW N Þ:¼M b+ þ M bk b kW > W NM b + W > W NM M N k k N

and solve iteratively the eigenvalue problem W ðrþ1Þ N

K X k¼1

c k ðW ðrÞ Þ ¼ LW ðrþ1Þ . M N N

(10)

ck The computational complexities of calculating M and deriving m eigen vectors of Eq. (10) are both Oðd 2 mÞ for each step.

1895

3.3. Dimension reduction by using fourth-order cumulant tensor We will propose two implementation of the JLA procedure for linear dimension reduction in the following sections. One uses the fourthorder cumulant tensors defined in Eq. (4), while the other employs Hessian of the characteristic functions. We expect that other JLA algorithms with matrices appeared in other joint diagonalization methods for ICA can be constructed just by modifying their contrast functions in the same manner. The first implementation which we will call JLA4 is inspired by the JADE algorithm [3] and use the same d 2 matrices QðklÞ ðk; l ¼ 1; . . . ; dÞ for M k ’s, where QðklÞ ij ¼ cumðzi ; zj ; zk ; zl Þ. Under our model assumption, the whitened data z is represented as ! sN o > o > o > z ¼ ðW Þ s ¼ ððW N Þ ; ðW G Þ Þ , sG and the sources s are divided into the m-dimensional non-Gaussian component sN ¼ ðs1 ; . . . ; sm Þ and the ðd ' mÞ-dimensional Gaussian part which is independent of sN . As is the case for JADE, the fourthorder cumulant tensors of the sources s have very simple structure. Lemma 2. Under Assumption (2), the fourth-order cumulant tensor cumðsi ; sj ; sk ; sl Þ can take a non-zero value, only if 1pi; j; k; lpm (i.e. all components should belong to the non-Gaussian part). From Lemma 2, we can show that the matrices QðklÞ satisfy the property (5). In fact, fW o QðklÞ ðW o Þ> gij ¼ cumðsi ; sj ; zk ; zl Þ should become 0, if i4m or j4m. Thus, ! " + 0 W o QðklÞ ðW o Þ> ¼ 0 0 holds for all ðk; lÞ, that is, all components other than those which are not contained in the m & m submatrix + vanish after the similar transformation by W o . According to the general derivation of the JLA procedure, the transformation W oN to the nonGaussian component sN can be estimated by maximizing the Frobenius norms of the m & m submatrices corresponding to the non-Gaussian

ARTICLE IN PRESS 1896

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

subspace LðW N Þ ¼ ¼

d X

k;l¼1

In the JLA4 algorithm, at first, the empirical whitening transformation is applied: 2 kW N QðklÞ W > N kFro

d m X X

k;l¼1 i0 ;j 0 ¼1

cumðyi0 ; yj0 ; zk ; zl Þ2

ð11Þ

with respect to W N such that W N W > N ¼ I m , where yN ¼ ðy1 ; . . . ; ym Þ> ¼ W N z denotes the reconstructed non-Gaussian component by W N . We remark that this contrast function (11) is a squared sum of the cumulant tensors as that of JADE LJADE ðW Þ ¼

d X d X

k;l¼1 i¼1

d Vb ¼ Cov½x).

d zi ; b zj ; Then, the empirical cumulant tensors cumðb b zk ; b zl Þ of the whitened data bz are calculated and the bðklÞ whose ði; jÞth components are matrices Q bðklÞ ¼ cumðb d zi ; b zj ; b zk ; b zl Þ 8k; l ¼ 1; . . . ; d, Q ij

are prepared. We summarize the entire procedure of JLA4 in the following table. Algorithm.

2

cumðyi ; yi ; yk ; yl Þ ,

where y ¼ W z 2 Rd is a linear transformation of the whitened data z by an orthogonal matrix W . However, the first two indices of the cumulant tensors range the different sets; one is 1pi0 pm; 1pj 0 pm, while the other is i ¼ j. In the actual data analysis, we know neither the covariance matrix V ¼ Cov½x) nor the fourth-order cumulant tensor cumðzi ; zj ; zk ; zl Þ. We have to replace all unknown quantities with their empirical counterparts, for example, T 1X b E½x):¼ xðtÞ, T t¼1

d Cov½x):¼

bzðtÞ ¼ Vb '1=2 fxðtÞ ' b E½x)g;

T 1 X ðxðtÞ ' b E½x)Þ ðxðtÞ ' b E½x)Þ> , T ' 1 t¼1

d i ; zj ; zk ; zl Þ:¼b cumðz E½zi zj zk zl ) ' b E½zi zj )b E½zk zl ) b b b ' E½zi zk )E½zj zl ) ' E½zi zl )b E½zj zk ).

We remark that such empirical moments and cumulants usually differ from the true values in the order of n'1=2 . The deviation affects the accuracy of the estimated non-Gaussian subspace by JLA4. For example, it is known that the asymptotic variance of the empirical fourthorder cumulants depend on the eighth-order moments of random variables which can be quite large for super-Gaussian data. This is the reason why cumulant-based methods are not robust against outliers. Although it might be possible to calculate the asymptotic variance of the estimated subspace in the standard manner ([28,30,31], etc.), such asymptotic analysis is out of scope of this paper.

(1) Apply the whitening transformation bzðtÞ ¼ b '1=2 ðxðtÞ ' b V E½x)Þ to the data fxðtÞgTt¼1 , where d b ¼ Cov½x). V (2) Calculate the fourth-order cumulant tensor d zi ; b cumðb zj ; b zk ; b zl Þ from the whitened data fb zðtÞgTt¼1 . (3) Compute the m eigen vectors with largest absolute eigenvalues to get an initial estimator W ð0Þ N : W ð0Þ N

d X

k;l¼1

bðklÞ ¼ LW ð0Þ . Q N

(4) Solve the following eigen value problem until the matrix W ðrÞ N converges. ðrþ1Þ WN

d X

k;l¼1

bðklÞ fW ðrÞ g> W ðrÞ Q bðklÞ ¼ LW ðrþ1Þ . Q N N N

The algorithm works well for sub-Gaussian structures. However, due to outliers it performs worse when the data contains heavy-tailed structures. In the next section, we will introduce another version of the JLA implementation with Hessian of the characteristic function, which generates more robust and efficient results in most cases. 3.4. Dimension reduction by using characteristic functions In Theis [25,26], Hessians of the characteristic function were used for (multidimensional) ICA (see also [24,27]). Since they satisfy property (5) under our model assumption as we will show, they can also be used as the matrices M k ’s for JLA. The

ARTICLE IN PRESS 1897

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

characteristic function of the random variable z can be defined by > b zÞ), ZðfÞ:¼E½expðif b where the hat in ZðfÞ has nothing to do with the empirical estimates but instead indicates that it is a Fourier transformation of z. As the preceding sections, let W o ¼ ð ðW oN Þ> ; ðW oG Þ> Þ> be an orthogonal matrix which maps the whitened data z to the > > components s ¼ ðs> N ; sG Þ , i.e. ! ! W oN s N ¼ z. s ¼ W o z; W oG sG

Then, the characteristic function can be expressed as b o fÞ ¼ SbN ðW o fÞ expð'1kW o fk2 Þ, b ¼ SðW ZðfÞ N G 2

(12)

where Sb and SbN are the characteristic functions of s and sN , respectively. Eq. (12) means that the characteristic function of z is represented as the product of those of the non-Gaussian component sN and the Gaussian noise sG . Or, in other words, at f b exists, it becomes the sum of log of where log ZðfÞ these two parts. b ¼ log SbN ðW o fÞ ' 1kW o fk2 . log ZðfÞ N G 2

b can be calculated Therefore, the Hessian of log ZðfÞ as H

q2 b ðfÞ:¼ log ZðfÞ b log Z qf qf> 0

1 q2 bN ðW o fÞ log S 0 N > B C o ¼ ðW o Þ> @ qnN qnN AW , 0 'I d'm

ð13Þ

W oN f.

where nN ¼ From Eq. (13), it is easy to see that the matrix H bðfk Þ þ I d satisfies property (5) log Z for any f such that S N ðW oN fÞa0. In order to weaken contribution of unreliable matrices with b k Þj2 to the matrix small power, we multiply jZðf H bðfk Þ þ I d . Hence, in our implementation of the log Z

new JLA algorithm, we exploit b k Þj2 fH M k :¼jZðf

bðfk Þ þ I d g; log Z

k ¼ 1; . . . ; K,

characteristic function of bz is defined as T X bemp ðfÞ ¼ 1 expfif>bzðtÞg, Z T t¼1

bemp ðfÞ can be calcuthe Hessian H emp ðfÞ of log Z b log Z lated by differentiating it: H emp ðfÞ b log Z

¼

ð1=TÞ '

PT

zðtÞ t¼1b

P expfif>bzðtÞgð1=TÞ Tt¼1bz> ðtÞ expfif>bzðtÞg bemp ðfÞg2 fZ

P ð1=TÞ Tt¼1bzðtÞ bz> ðtÞ expfif>bzðtÞg . bemp ðfÞ Z

Thus, we will use the estimated matrices b k :¼jZ bemp ðfk Þj2 fH emp ðfk Þ þ I d g; M b log Z

k ¼ 1; . . . ; K, (15)

for the JLA procedure. Even if we generate a set of K random vectors for the parameters ffk g, the JLA algorithm with Hessian of the characteristic function returns satisfactory results [23]. In order to improve the performance further, we combine the active selection of the vector f as explained in the following. In the initial state, K different vectors f1 ; . . . ; fK are generated randomly such that their norms kfk k are equispaced in ½0:3; 3) and their direction fk =kfk k distribute uniformly. In the single step of our selection scheme, after computing the matrices b 1; . . . ; M b K for the current parameter ff1 ; . . . ; fK g, M we solve the eigenvalue problem W ð0Þ N

K X k¼1

b k Þ þ ImðM b k Þg ¼ LW ð0Þ , fReðM N

b to get an initial estimator W ð0Þ N , where ReðM k Þ and b k . In b k Þ are the real and imaginary parts of M ImðM b order to evaluate the goodness of each matrix M k , we employ the multiplication of the power ð0Þ > 2 b kW ð0Þ N M k ðW N Þ kFro in the estimated non-Gaussian ð0Þ > 2 b b 2 subspace and its ratio kW ð0Þ N M k ðW N Þ kFro =kM k kFro to the entire power, i.e. Ck :¼

ð0Þ > 4 b kW ð0Þ N M k ðW N Þ kFro , b k k2 kM

(16)

Fro

(14)

for appropriately selected K vectors f1 ; . . . ; fK 2 Rd . The matrices M k ’s in Eq. (14) must be estimated b '1=2 ðxðtÞ ' b from the whitened data bzðtÞ:¼V E½x)Þ, b b where E½x) and V are the empirical average and covariance, respectively. Since the empirical

because the initial estimator W ð0Þ N is usually reliable. Then we discard uninformative matrices whose evaluation criterion Ck is smaller than 1% of its maximum. Let rmin and rmax be the minimum and maximum norm of the vectors zk corresponding to the remaining matrices, respectively. In order to make the

ARTICLE IN PRESS 1898

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

new set of the vectors, we generate at first K random vectors from Gaussian Nð0; OÞ, where basically O is P b k Þ þ ImðM b k Þg, except that its equal to remaining fReðM eigenvalues are flipped if they are negative. The obtained vectors are more concentrated near the non-Gaussian subspace. Then, the norms of the vectors ffk g are rescaled such that their norms are equispaced in the interval ½rmin ; rmax ). We repeat the regeneration step of the parameter vectors ffk g twice. We summarize in the following table the JLA algorithm with Hessian of the characteristic functions which we will call JLAH.

, Characteristic functions can in principle detect all kinds of non-Gaussianity if we check enough number of frequency vectors f, while fourth-order cumulant only measures one aspect of non-Gaussianity. Thus, JLAH is more flexible to various kinds of non-Gaussian structures than JLA4. It should be mentioned, however, that estimation of the second characteristic function can also be sensible to outliers. 4. Numerical results

Algorithm.

4.1. Artificial data

(1) Apply the whitening transformation bzðtÞ ¼ b '1=2 ðxðtÞ ' b V E½x)Þ to the data fxðtÞgTt¼1 , where d b ¼ Cov½x). V (2) For the current set of the vectors ffk g, calculate b k :¼jZ bemp ðfÞj2 fH emp ðfk Þ þ I d g the matrices M b log Z from the empirical characteristic function T b bemp ðfÞ of the whitened data fzðtÞg Z t¼1 . (3) Compute the m eigenvectors with largest absolute eigenvalues

For testing our JLA algorithms, we performed numerical experiments using various synthetic data used in Blanchard et al. [9]. Each data set includes 1000 samples in 10 dimension. Each sample consists of eight-dimensional independent standard Gaussian and two non-Gaussian components as follows.

ð0Þ WN

K X k¼1

b k Þ þ ImðM b k Þg ¼ LW ð0Þ , fReðM N

b to get an initial estimator W ð0Þ N , where ReðM k Þ b k Þ are the real and imaginary parts of and ImðM b k. M (4) Calculate the evaluation criterion Ck for each b k defined in Eq. (16). matrix M (5) If in the final estimation step, take the best 50% b k g. If in the selection step of the matrices fM (repeated twice), generate the next set of the parameters ffk g as explained in the text and go back to (2). (6) Solve the following eigenvalue problem until the ðrÞ matrix W N converges. X ðrþ1Þ c k ðW ðrÞ Þ ¼ LW ðrþ1Þ . WN M N N remaining

There are two advantage of the new JLA method against the one with the fourth-order cumulant. , Because estimators of higher-order moments and cumulants are very sensitive against outliers, JLA4 is not reliable for heavy-tailed distributions. JLAH is more robust against outliers.

(A) Simple: Two-dimensional independent Gaussian mixtures with density of each component given by 12f'3;1 ðxÞ þ 12f3;1 ðxÞ. (B) Dependent super-Gaussian: Two-dimensional isotropic distribution with density proportional to expð'kxkÞ. (C) Dependent sub-Gaussian: Two-dimensional isotropic uniform with constant positive density for kxkp1 and 0 otherwise. (D) Dependent super- and sub-Gaussian: Onedimensional Laplacian with density proportional to expð'jxLap jÞ and one-dimensional dependent uniform Uðc; c þ 1Þ, where c ¼ 0 for jxLap jp log 2 and c ¼ '1 otherwise. The profiles of the density functions of the nonGaussian components in the above data sets are described in Fig. 2. The mean and standard deviation of samples are normalized to zero and one in a component-wise manner. Besides the proposed algorithms JLA4 and JLAH, we applied for reference the following three methods in the experiments: FastICA (deflation mode)/PP with ‘pow3’ or ‘tanh’ index (denoted by FIC3 and FICt, respectively),2 JADE with post selection of non-Gaussian components. In JLAH, 2 Many ideas of ICA with flexible non-Gaussian indices have been proposed (e.g. [32]). Since the implementation of FastICA we used has no option to adjust the non-Gaussianity index

ARTICLE IN PRESS M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

1899

Fig. 2. Densities of non-Gaussian components. The data sets are: (a) two-dimensional independent Gaussian mixtures, (b) twodimensional isotropic super-Gaussian, (c) two-dimensional isotropic uniform and (d) dependent one-dimensional Laplacian + onedimensional uniform.

K ¼ 1000 vectors f were used. We remark that we did not include the better NGCA methods [9] which take many various non-Gaussianity indices into account and also include an active search for finding informative indices among others. The current active selection in our JLAH is rather preliminary and we need further research to improve the procedure. In FastICA and JLA4, additionally 9 runs from random initial matrices were also carried out and the optimum among these 10 solutions were chosen to avoid local optima.3 We generated 100 realizations for each setting and b computed the estimated non-Gaussian subspaces I by the methods listed above for each data set. The estimation error is evaluated by the criterion b IÞ ¼ EðI;

1 kðI d ' PI ÞPb k2Fro , I m

(17)

where PI (resp. Pb ) is the projection matrix onto the I true non-Gaussian index space I (resp. the estimated b The quantity EðI; b IÞ is equal to zero if and one I). only if the estimated and true non-Gaussian subspaces coincide, and increases as the angle between them becomes larger. Fig. 3 shows boxplots of the 100 error values E for all five algorithms. The boxplot is a simple method for describing sample distributions. Its center line and box indicate the median and two quantiles (i.e. 25%- and 75%-tiles) of data, respectively. The region of normal points (points within 1.5 times the interquantile distance centered around the median) is expressed by vertical lines, while outliers are plotted by ‘+’. (footnote continued) depending on some features of each extracted component, we just compared with FastICA with fixed non-Gaussianities. 3 Although JADE suffered also from local optima for the data set (C), we showed the results starting from a single initial value (default setting). On the other hand, the initial value of the JLAH procedure is almost always OK and such multiple starting is not necessary.

Surprisingly, JADE which assumes mutual independence between the one-dimensional components could find the non-Gaussian subspace I in most of these examples. Although this kind of phenomena has been reported in ISA, the reason has not been completely understood, as far as we know [33,34]. We can see that JLA4 outperforms FastICA’s and JADE for the data sets (A) and (C) which only contain sub-Gaussian structures, while the errors are much larger than those of FastICA with ‘tanh’ non-linearity for the other data sets which contains super-Gaussian sources. The degraded performance for the latter data sets is in common with the kurtosis/cumulant-based methods FIC3 and JADE and stems from the large variances of the empirical cumulant tensors in the case of super-Gaussian random variables. We also remark that JLA4 failed to estimate the index space I many times ((B) 7%, (C) 21% and (D) 30%). The new algorithm JLAH outperformed the better FastICA method except for the super-Gaussian data (B), where JLAH is on par with FICt. Its performance is slightly better than that of JLA4 for data set (A) and (C), while the difference is significantly large for data set (B) and (D). Unfortunately, the computational time of JLA procedures is longer than the conventional algorithms, because we have to calculate many components of the fourth-order cumulant tensors or many Hessian of the characteristic functions for three times (the active selection step). We summarized the average computational time (s) per realization (i.e. the total time of the simulation for each data set (A)–(D) divided by 100) by Machintosh Power Book with a 1.25 GHz processor and 1 GB RAM in Table 1. 4.2. Speech signals In order to demonstrate the potential of JLA preprocessing, we performed an experiment with

ARTICLE IN PRESS 1900

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

0.012

0.2 0.18

0.01 0.16 0.14 0.008 0.12 0.006

0.1 0.08

0.004 0.06 0.04 0.002 0.02 0

0 FIC3

FICt

JADE

JLA4

JLAH

0.08

0.08

0.07

0.07

0.06

0.06

0.05

0.05

0.04

0.04

0.03

0.03

0.02

0.02

0.01

0.01

0

FIC3

FICt

JADE

JLA4

JLAH

FIC3

FICt

JADE

JLA4

JLAH

0 FIC3

FICt

JADE

JLA4

JLAH

b IÞ. Algorithms are FIC3, FICt, JADE, JLA4 and JLAH (from left to right). Fig. 3. Boxplots of the error criterion EðI;

speech signals. We used the two speech signals from Japanese text which were tailored as explained in Kawanabe and Mu¨ller [31] such that the two sources have high variance-dependency, and further downsampled to the sequences of length 1000 by taking every 20 points from first 20 000 of the entire data. The original sources are shown in the panel (A) of Fig. 4. Then we made 100 dimensional data

Table 1 Comparison of computational time

(A) (B) (C) (D)

FIC3

FICt

JADE

JLA4

JLAH

2.86 4.06 4.22 3.39

3.36 5.90 6.33 4.02

0.34 0.38 0.37 0.32

9.40 8.83 16.60 10.51

46.16 46.03 46.08 46.45

ARTICLE IN PRESS M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

by combining 98 dimensional white Gaussian noises with the two-dimensional sources. The standard deviation of the sources were normalized to 1, while those of the Gaussian features were randomly chosen from the uniform distribution Uð0:5; 1:5Þ. For extracting the speech signals, we applied three different procedures to the observed data: ‘PCAPreProc’—FastICA with ‘tanh’ nonlinearity after reducing the dimension to two by PCA, ‘NoReduction’—extracting two components by FastICA with ‘tanh’ without dimension reduction

8 6 4 2 0 -2 -4 -6

1901

(FastICA was started from 10 different initial values), ‘JLAHPreProc’—FastICA with ‘tanh’ nonlinearity after reducing the dimension to two by JLAH. The extracted components by each method were plotted in Fig. 4(B)–(D). Because some of Gaussian features have more power than the speeches, ‘PCAPreProc’ completely failed to pick up the sources. ‘NoReduction’ extracted the first source quite well, but missed the second. On the other hand, ‘JLAHPreProc’ found both of the two signals well,

4 3 2 1 0 -1 -2 -3 0

100 200 300 400 500 600 700 800 900 1000

8

4

6

2

4

0

100 200 300 400 500 600 700 800 900 1000

0

100 200 300 400 500 600 700 800 900 1000

0

100 200 300 400 500 600 700 800 900 1000

0

100 200 300 400 500 600 700 800 900 1000

0

2 -2

0

-4

-2 -4

-6 0

100 200 300 400 500 600 700 800 900 1000

10

10

5

5

0

0

-5

-5

-10

-10 0

100 200 300 400 500 600 700 800 900 1000

6

8

4

6

2

4

0

2

-2

0

-4

-2

-6

-4 0

100 200 300 400 500 600 700 800 900 1000

Fig. 4. Extracted components in the speech experiment: (A) original sources, (B) the result by ‘PCAPreProc’, (C) by ‘NoReduction’, (D) by ‘JLAHPreProc’.

ARTICLE IN PRESS 1902

M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

even though they were contaminated by Gaussian noises. Although this is only an illustrative example, we expect that the new JLAH preprocessing will contribute to signal processing researches.

AB þ AG BG ¼ ðA; AG Þ

B BG

!

¼ Id

(A.2)

also holds. By multiplying Eq. (A.2) to the noise vector n, we get

5. Conclusions In this paper, motivated by the joint diagonalization for ICA, we proposed a novel procedure called joint low-rank approximation (JLA) for extracting the nonGaussian subspace. As is the case for joint diagonalization algorithms, we proved the global consistency of the JLA method. Then, we presented two implementation of JLA, JLA4 with the fourth-order cumulant tensors and JLAH with Hessians of the characteristic function. In our numerical experiments, we tested the JLA algorithms with synthetic data sets. Although JLA4 worked better for sub-Gaussian data, JLAH was more robust and efficient than JLA4, when data contain super-Gaussian structures. There are a few topics which will be pursued later. At first, a gradient ascend algorithm may be developed for speed up. In order to improve the performance of the JLA methods, for example, better active selection schemes of the vectors fk in JLAH should be explored. Finally, the JLA algorithms should be applied to real world applications to show their practical usefulness. Acknowledgments

n ¼ AðBnÞ þ AG ðBG nÞ,

where the first term n1 :¼AðBnÞ belongs to RangeðAÞ, while the second n2 :¼AG ðBG nÞ is restricted on the ðd ' mÞ-dimensional subspace RangeðAG Þ. It is easy to show that the covariance Cov½n1 ; n2 ) vanishes. > > > Cov½n1 ; n2 ) ¼ AB Cov½n) B> G AG ¼ AðBGBG ÞAG ¼ 0.

Appendix B. Orthogonality of AN and AG It can be shown that the matrix AG defined in Eq. (A.1) does not change, when the noise covariance G is replaced with V :¼Cov½x) ¼ AN LA> N þ G, where the matrix A in the previous section is expressed as AN here and L is the covariance of the source s in Eq. (1): '1 > '1 AG ¼ GBG ðBG GB> ¼ VB> GÞ G ðBG VBG Þ .

'1 Therefore, the orthogonality A> AG ¼ 0m&ðd'mÞ NV '1 with the metric V follows from the property > A> N BG ¼ 0m&ðd'mÞ .

References

M.K. acknowledges K.-R. Mu¨ller and J.-F. Cardoso for valuable discussion and comments. Appendix A. Decomposition of the noise n We will write down the decomposition n ¼ n1 þ n2 explicitly, where n1 2 RangeðAÞ and n2 is restricted in the ðd ' mÞ-dimensional complementary subspace s.t. Cov½n1 ; n2 ) ¼ 0. Let us take a basis fb1 ; . . . ; bd'm g of the orthogonal subspace RangeðAÞ? and make a ðd ' mÞ & d matrix BG ¼ ðb1 ; . . . ; bd'm Þ> . From the construction this matrix satisfies BG A ¼ 0ðd'mÞ&m . If we define the following two matrices B:¼ðA> G'1 AÞ'1 A> G'1 ;

Therefore, the equality

> '1 AG :¼GB> G ðBG GBG Þ , (A.1)

> it is easy to show that ðB> ; B> G Þ is the inverse matrix > of ðA; AG Þ and vice versa, i.e. ðB> ; B> G Þ ðA; AG Þ ¼ I d .

[1] C. Jutten, J. He´rault, Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture, Signal Process. 25 (1991) 1–10. [2] P. Comon, Independent component analysis—a new concept?, Signal Process. 36 (1994) 287–314. [3] J.-F. Cardoso, A. Souloumiac, Blind beamforming for non Gaussian signals, IEE Proc. F 140 (6) (1993) 362–370. [4] A.J. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Computation 7 (1995) 1129–1159. [5] A. Hyva¨rinen, J. Karhunen, E. Oja, Independent Component Analysis, Wiley, New York, 2001. [6] A. Hyva¨rinen, J. Sa¨rela¨, R. Viga´rio, Spikes and bumps: artefacts generated by independent component analysis with insufficient sample size, in: Proceedings of ICA’99, 1999, pp. 425–429. [7] F. Asano, Y. Motomura, H. Asho, T. Matsui, Effect of PCA filter in blind source separation, in: Proceedings of ICA 2000, 2000, pp. 57–62. [8] R. Viga´rio, V. Jousma¨ki, M. Ha¨ma¨la¨inen, R. Hari, E. Oja, Independent component analysis for identification of artefacts in magnetoencephalographic recordings, in: Advances in Neural Information Processing

ARTICLE IN PRESS M. Kawanabe, F.J. Theis / Signal Processing 87 (2007) 1890–1903

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

[17]

[18]

[19]

[20]

[21]

Systems, vol. 10, MIT Press, Cambridge, MA, 1998, pp. 229–235. G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, K.-R. Mu¨ller, In search of non-Gaussian components of a high-dimensional distribution, J. Mach. Learning Res. 7 (2006) 247–282. A. Hyva¨rinen, P. Hoyer, Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces, Neural Computation 12 (7) (2000) 1705–1720. J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. J.H. Friedman, J.W. Tukey, A projection pursuit algorithm for exploratory data analysis, IEEE Trans. Comput. 23 (9) (1975) 881–890. P.J. Huber, Projection pursuit, Ann. Statist. 13 (1985) 435–475. A. Hyva¨rinen, E. Oja, A fast fixed-point algorithm for independent component analysis, Neural Computation 9 (7) (1997) 1483–1492. A. Hyva¨rinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE Trans. Neural Networks 10 (3) (1999) 626–634. M. Kawanabe, G. Blanchard, M. Sugiyama, K.-R. Mu¨ller, A new algorithm of non-Gaussian component analysis with radial kernel functions, Ann. Inst. Statist. Math. 59(1) (2007) 57–75. F.J. Theis, M. Kawanabe, Uniqueness of non-Gaussian subspace analysis, in: Proceedings of ICA 2006, 2006, pp. 917–924. A. Ziehe, K.-R. Mu¨ller, TDSEP—an efficient algorithm for blind separation using time structure, in: L. Niklasson, M. Bode´n, T. Ziemke (Eds.), Proceedings of the Eighth International Conference on Artificial Neural Networks, ICANN’98, Springer, Berlin, 1998, pp. 675–680 (Perspectives in Neural Computing). A. Belouchrani, K. Abed Meraim, J.-F. Cardoso, E. Moulines, A blind source separation technique based on second order statistics, IEEE Trans. Signal Process. 45 (2) (1997) 434–444. F.J. Theis, Y. Inouye, On the use of joint diagonalization in blind signal processing, in: Proceedings of ISCAS 2006, Kos, Greece, 2006.

1903

[22] M. Kawanabe, Linear dimension reduction based on the fourth-order cumulant tensor, in: Proceedings of ICANN 2005, Lecture Notes in Computer Science, vol. 3697, 2005, pp. 151–156. [23] M. Kawanabe, F.J. Theis, Estimating non-Gaussian subspaces by characteristic functions, in: J. Rosca, D. Erdogmus, J.C. Prı´ ncipe, S. Haykin (Eds.), Independent Component Analysis and Blind Signal Separation, Lecture Notes in Computer Science, vol. 3889, Springer, Berlin, 2006, pp. 157–164, ISBN 3-540-32630-8. [24] A. Yeredor, Blind source separation via the second characteristic function, Signal Process. 80 (5) (2000) 897–902. [25] F.J. Theis, A new concept for separability problems in blind source separation, Neural Computation 16 (2004) 1827–1850. [26] F.J. Theis, Multidimensional independent component analysis using characteristic functions, in: Proceedings of EUSIPCO 2005, 2005. [27] J.K. Lin, Factorizing multivariate function classes, in: Advances in Neural Information Processing Systems, vol. 10, MIT Press, Cambridge, MA, 1998, pp. 563–569. [28] S. Amari, J.-F. Cardoso, Blind source separation—semiparametric statistical approach, IEEE Trans. Signal Process. 45 (11) (1997) 2692–2700. [29] A. Yeredor, Non-orthogonal joint diagonalization in the leastsquares sense with application in blind source separation, IEEE Trans. Signal Process. 50 (7) (2002) 1545–1553. [30] D.T. Pham, P. Garrat, Blind separation of mixture of independent sources through a quasi-maximum likelihood approach, IEEE Trans. Signal Process. 45 (7) (1997) 1712–1725. [31] M. Kawanabe, K.-R. Mu¨ller, Estimating functions for blind separation when sources have variance dependencies, J. Mach. Learning Res. 6 (2005) 453–482. [32] T.-W. Lee, M. Girolami, T.J. Sejnowski, Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources, Neural Computation 11 (2) (1999) 417–441. [33] Z. Szabo´, B. Po´czos, A.L. Horincz, Cross-entropy optimization for independent process analysis, in: J. Rosca, D. Erdogmus, J.C. Prı´ ncipe, S. Haykin (Eds.), Independent Component Analysis and Blind Signal Separation, Lecture Notes in Computer Science, vol. 3889, Springer, Berlin, 2006, pp. 909–916, ISBN 3-540-32630-8. [34] F.J. Theis, Towards a general independent subspace analysis, in: Advances in Neural Information Processing Systems, 2006.