ICA with Sparse Connections - CiteSeerX

Report 2 Downloads 35 Views
ICA with Sparse Connections Kun Zhang and Lai-Wan Chan

?

Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, Hong Kong Email: {kzhang, lwchan}@cse.cuhk.edu.hk

Abstract. When applying independent component analysis (ICA), sometimes we expect that the connections between the observed mixtures and the recovered independent components (or the original sources) to be sparse, to make the interpretation easier or to reduce the model complexity. In this paper we propose natural gradient algorithms for ICA with a sparse separation matrix, as well as ICA with a sparse mixing matrix. The sparsity of the matrix is achieved by applying certain penalty functions to its entries. The properties of the penalty functions are investigated. Experimental results on both artificial data and causality discovery in financial stocks show the usefulness of the proposed methods.

1

Introduction

Independent component analysis (ICA) aims at recovering latent independent sources from their observable linear mixtures [4]. Denote by x = (x1 , ..., xn )T the vector of observable signals. x is assumed to be generated by x = As, where s = (s1 , ..., sn )T with mutually independent components and A is the mixing matrix. Here for simplicity we assume the number of observed signals is equal to that of the independent sources. Under certain conditions on A and the distribution of si , ICA applies the following linear transformation on x: y = Wx

(1)

The separation matrix W is tuned such that the components of y = (y1 , ..., yn )T are mutually as independent as possible, and finally yi provide an estimate of the original sources si . When performing ICA, sometimes it is desirable not only to achieve the independence between outputs, but also to make the transformation matrix (the separation matrix W or the mixing matrix A) as sparse as possible, i.e. to make its zero entries of the transformation matrix as many as possible, for the following reasons. First, consider the case where the data dimension is high and the true values of some entries of the transformation matrix are 0. If the zero ?

This work was partially supported by a grant from the Research grants Council of the Hong Kong Special Administration Region, China.

entries can be automatically detected and are set to 0 during the learning process, the model complexity will be reduced and the parameters will be estimated more reliably. Second, a sparse mixing matrix means that the observations xi are affected by only a smaller subset of the independent sources si , and this makes the interpretation easier. The third reason is application-oriented. Recently, ICA was proposed for identifying the linear, non-Gaussian, acyclic causal model (LiNGAM) [8]. If the data are generated according to the LiNGAM model, theoretically, the ICA separation matrix W can be permuted to lower triangularity. However, in practice this may not be true, due to the finite sample effect, noise effect, or slight violation of the model. So in order to estimate LiNGAM, W with as many as possible zero entries is preferred. In this paper we propose methods to perform ICA with a sparse transformation matrix. The sparsity of the transformation matrix is achieved by maximizing the likelihood function of the ICA model together with suitable penalties on the ICA transformation matrix. We first discuss the properties of various penalty functions which can produce sparse coefficients in the linear regression problem, and also propose a new one. Under certain regularity conditions, likelihood-based models can also incorporate the above penalties to produce sparse parameters. In this way we develop the natural gradient-based ICA algorithm producing a sparse separation matrix, as well as that producing a sparse mixing matrix.

2

Penalties Producing Sparse Coefficients in Linear Regression

In the statistics literature, the behavior of different penalties in the linear regression problem has been intensively studied. By incorporating the Lγ penalty on the regression coefficients βi , the penalized least square error (LSE) estimate T for the coefficients β = (β0 , ..., by minimizing the mean square Pβnn ) is obtained error plus the penalty term λ j=1 |βj |γ , where λ ≥ 0 is a parameter controlling the extent to which the penalty influences the solution. γ = 2 results in ridge regression, which tends to make coefficients smaller, but cannot set any coefficient to 0. The L2 penalty also results in the weight decay regularizer in neural networks learning. The least absolute shrinkage and selection operator (Lasso) emerges when γ = 1, i.e. the L1 penalty is adopted [9]. The L1 penalty corresponds to a Laplacian prior on βi . It automatically sets insignificant coefficients to 0 with a suitable λ; however, it has the disadvantage that it also influences the estimate of significant coefficients. In [3] it was claimed that a good penalty function producing sparse parameters should result in an estimator with three properties. They are: 1. unbiasedness for the resulting estimator of significant parameters, 2. sparsity, which means that insignificant coefficients are automatically set to 0, to reduce the model complexity, and 3. continuity of the resulting estimator with respect to changes in data to avoid unnecessary variation in model prediction. Furthermore, the smoothly clipped absolute deviation (SCAD) penalty, which possesses the above properties, was proposed. The derivative of the SCAD penalty (including

the coefficient λ) is given by (for β > 0) n o (aλ − β)+ Pλ0 (β) = λ I(β ≤ λ) + I(β > λ) , for some a > 2 (a − 1)λ

(2)

where I(·) is the indicator function. The typical value for a is 3.7, which is obtained from the Bayesian viewpoint [3]. The SCAD penalty corresponds to an improper prior on β. It was also shown that with a proper choice of regularization parameter, the SCAD penalty exhibits the so-called oracle property. This property is very appealing, meaning that the coefficients whose true values are zero are automatically estimated as zero, and that the remaining coefficients are estimated as well as if the correct submodel were known in advance. For details of the SCAD penalty, see [3]. Now we propose a generalized version of the SCAD penalty and consider some computational problems. In fact, in the SCAD penalty (Eq. 2), the parameter controlling the strength of the penalty and that controlling the range the penalty applies to are not necessarily equal; we then propose the following generalized SCAD (GSCAD) (for β ≥ 0): o n (aλ1 − β)+ Pλ0 (β) = λ I(β ≤ λ1 ) + I(β > λ1 ) (a − 1)λ1

(3)

GSCAD plays a trade-off between SCAD and the L1 penalty. When λ1 is very large, it tends to be the L1 penalty. When the data are very noisy, we can choose the parameter λ1 > λ so that the penalty operates on a wider range in the parameter space than SCAD. The penalty for weight elimination in neural networks learning [10] can be considered as a heuristic approximate to GSCAD. But it just shrink small parameters and can not set any parameter to 0. Any penalty which can set small parameters to 0 and result in a continuous estimator, including the L1 penalty and (G)SCAD, must be singular (not differentiable) at the origin [3]. Consequently, we can not use the gradient-based learning rule to optimize the objective function when β is very close to 0. Here we use some approximation to tackle this computational problem. We can use the function tanh(mβ) (m is a large number, say 200) to approximate the derivative of |β|. The GSCAD (Eq. 3) is then approximated by n o (aλ1 − β)+ Pλ0 (β) = λ tanh(mβ) · I(β ≤ λ1 ) + tanh(mλ1 ) · I(β > λ1 ) (a − 1)λ1

(4)

As the price of the approximation, β will not exactly shrink to 0 even if its true value is 0; it will converge to a very small number (about 10−2 ) instead. In practice, after the convergence of the algorithm, we need to set the parameters whose values are small enough, say, smaller than 0.02, to 0. For the illustrative purpose, Fig. 1 depicts the penalty functions discussed above, as well as their derivatives.

(a)

(b)

6

4

L penalty 2 L1 penalty (Lasso)

5

3

SCAD (a=3.7) GSCAD (λ =1.5, a=3) 1

2

4

Pλ’(β)

(with λ=1)

λ

P (β)

(with λ=1)

1

3

2

0 −1 −2

L2 penalty L penalty (Lasso)

1

1

−3

0 −6

−5

−4

−3

−2

−1

0

1

2

3

4

β

5

6

−4

SCAD (a=3.7) GSCAD (λ1=1.5, a=3) −6

−5

−4

−3

−2

−1

0

β

1

2

3

4

5

6

Fig. 1. (a) Curves of the four penalty functions with λ = 1. (b) Their derivatives.

3

ICA with Sparse Transformation Matrix

Under weak regularity conditions, the ordinary maximum likelihood estimates are asymptotically normal [5], and the above penalties can be applied to likelihoodbased models [3]. As ICA algorithms can be derived from a maximum likelihood point of view [7], ICA with a sparse transformation matrix can be derived with the data (log-)likelihood together with the above penalties as the objective function. 3.1

ICA with Sparse Separation Matrix

Denote by pi (si ) the probability density function of the source si , and let wi be the i-th column of WT . According to the ICA model (Eq. 1), the log-likelihood of the data x is `(x; W) =

T X n X t=1 i=1

log pi (wiT xt ) + T log | det W|

(5)

where T denotes the number of samples. The penalized log-likelihood (divided by T ) is n X 1 `P = `(x; W) − Pλ (wij ) (6) T i,j=1 Its gradient w.r.t W is d`P = −E{ψ(y)xT } + [WT ]−1 − [Pλ0 (wij )] dW

(7)

where ψ(y) = (ψ1 (y1 ), ..., ψn (yn ))T denotes the score function of y, with ψi (yi ) = p0 (y ) − pii (yii ) , and [Pλ0 (wij )] denotes the matrix whose (i, j)-th entry is Pλ0 (wij ). Multiplying the above equation from the right-hand side with W T W, the corresponding natural gradient learning rule [1] for W can be obtained:  4W ∝ I − E{ψ(y)yT } − [Pλ0 (wij )]WT W (8)

W and yi estimated by ICA have the scaling indeterminacy. However, in the penalized log-likelihood (Eq. 6), the value of the penalty Pλ (wij ) depends on the scale of wij , and the scaling indeterminacy should be avoided. A convenient way is to replace the diagonal entries of the right-hand side of Eq. 8 with 1 − E{yi2 } to ensure that yi are of unit variance at convergence. 3.2

ICA with Sparse Mixing Matrix

Now we give the learning rule for ICA with a sparse mixing matrix. The gradient of the penalized log-likelihood (Eq. 6) w.r.t A is d`P = [AT ]−1 (ψ(y)yT − I) − [Pλ0 (aij )] dA

(9)

where [Pλ0 (aij )] denotes the matrix whose (i, j)-th entry is Pλ0 (aij ). The corresponding natural gradient learning rule for A [1, 6] is  4A ∝ A ψ(y)yT − I − AT [Pλ0 (aij )] (10) In our implementation the diagonal entries of the right-hand side of the above equation are replaced by 1 − E{yi2 } to avoid the scaling indeterminacy. 3.3

To determine parameters in penalty functions

The performance of the penalies is affected by the choice of the parameters in them. For example, in order to possess the oracle property (see√Section 2) for the SCAD penalty, λ should satisfy the condition λT → 0 and T λT → ∞ as T → ∞. Note that here we use the subscript T to indicate that the choice of λ depends on the sample size T . The popular ways to select the parameters in the penalties are cross-validation and generalized cross-validation [3]. The choice of these parameters can also be application-oriented. For instance, in Section 4.2 we will discuss how to select the parameter λ when applying ICA with a sparse separation matrix (Eq. 8) to estimate the LiNGAM model, in the situation that the LiNGAM assumption is slightly violated.

4

Experiments

We now present two experiments to show the usefulness and the behavior of the proposed methods.1 In the first experiment, we use artificial data to illustrate and compare properties of various penalty functions discussed in Section 2. In the second experiment, ICA with a sparse separation matrix is used to discover the LiNGAM causal relations among a set of stocks selected from the Hong Kong stock market. 1

We also have exploited ICA with a sparse separation matrix to analyze highdimensional gene expression data (with 77 conditions and 6180 genes). The regularization parameter λ is selected by five-fold cross-validation, and ICA with a sparse separation matrix gives much more stable separation results than traditional ICA. This result is not reported in detail due to space limitation.

4.1

With Artificial Data

In this experiment four independent sources are a sine wave (s1 ), a uniformly distributed white signal (s2 ), a sign signal (s3 ) with a different frequency as s1 , and a normally distributed white signal (s4 ). The entries of the mixing matrix A are randomly generated between −0.5 and 0.5, and some of them are set to 0 to make A sparse:   0.1518 −0.2793 0 0 −0.4472 −0.1934  0 0  A= −0.2707 0.2207 −0.3748  0 0.0874 0.4544 −0.3338 −0.2310 The observations are generated by x = As + n, where n = (n1 , ..., n4 )T denotes isotropic Gaussian noise. Here we use the ratio of the standard deviation of n i to the average standard deviation of xi − ni to measure the noise level. We compare the performance of four ICA methods for estimating A at different noise levels. These methods all exploit the natural gradient-based algorithm (Eq. 10), but different penalty functions are applied to entries of the estimate of A. The first one does not use any penalty, i.e. it is the traditional natural gradient ICA algorithm. The penalties in the other methods are the L1 penatly, SCAD (Eq. 2), and GSCAD (Eq. 3), respectively. The parameters in the penalty functions are chosen empirically. λ = 0.05 for all the above penalties. In SCAD, a = 3.7. In GSCAD, λ1 = 1.5λ and a = 3.3. At each noise level, we repeat the above methods for 30 runs. In each run, the sources s2 and s4 and the noise n are randomly generated. Fig. 2(a) shows the estimate of the entry a14 at different noise levels (with columns of the estimate of A correctly permuted). Note that here the estimate of a14 will not be exactly 0 since we use tanh(maij ) to approximate the gradient of |aij |. Fig 2(b) compares the Amari performance index [2] obtained by the above methods. The smaller the Amari performance index, the better separation performance. From these figures we can see that when the noise level is not very high, the last three methods always performs better than traditional ICA. From the error bar in Fig. 2(a) we can see that they all provide a more stable estimate. Among the three penalties, the L1 penalty is not as good as the others. The reason is probably that it penalizes all entries of the estimate of A, including the significant and reliable ones. SCAD and GSCAD give very similar performance. However, when the noise level is comparatively high (greater than 0.3), GSCAD behaves slightly better than SCAD. 4.2

For Estimation of LiNGAM among Stock Returns

If the ICA separation matrix W of observed variables can be permuted to lower triangularity, these variables follow the LiNGAM model and W implies the causal relations among them [8]. In practice, W produced by traditional ICA is unlikely to have many zero entries, especially when the LiNGAM assumption is

(b)

(a) 0.05

No penalty With LASSO With SCAD (a=3.7) With GSCAD (λ1=1.5λ, a=3.3)

0.04

0.22 0.2

Amari performance index

a14 (the true value is 0)

0.03 0.02 0.01 0 −0.01 −0.02

0.18 0.16 0.14 0.12 0.1 0.08

−0.03

0.06

−0.04

0.04

−0.05

0

0.05

0.1

0.15

0.2

0.25

Noisy level (std(n )/std(x −n )) i

No penalty With LASSO With SCAD (a=3.7) With GSCAD (λ1=1.5λ, a=3.3)

i

i

0.3

0.35

0.02

0

0.05

0.1

0.15

0.2

0.25

Noisy level (std(ni)/std(xi−ni))

0.3

0.35

Fig. 2. (a) The estimate of the entry a14 (the true value is 0) at different noise levels. The error bar denotes the standard deviation of the results of 30 runs. The curves are shifted a little for clarity. (b) The Amari performance index.

slightly violated. In [8], statistical tests were proposed to prune entries of W such that it is permuted to lower triangularity. However, it does not adjust W continuously, and consequently, small changes in data may result in sudden changes in the result. In addition, the interaction among entries of W is neglected when do pruning, and after an entry is pruned, the value and significance of others are changed. ICA with a sparse separation matrix provides a more reliable way to estimate the LiNGAM model. When using ICA with a sparse separation matrix for estimation of the LiNGAM model, we need to choose a suitable λ such that the resulting W can be permuted to lower triangularity. A greedy scheme can be adopted for determining λ. (When using the GSCAD penalty, we set λ1 = 1.5λ and a = 3.3 empirically.) Starting from a small value, each time λ is increased by a fixed increment. After the algorithm converges for the new value of λ, we check whether the LiNGAM assumption holds by examining W. The check can be easily done with Algorithm B in [8]. Once the LiNGAM assumption holds, we stop the above procedure and the LiNGAM model is estimated by analyzing W; if λ reaches the upper bound set in advance or the correlations of yi become significant, we terminate the above procedure and conclude that the data do not follow the LiNGAM model. The computation involved in this method is not high, since the convergence of the algorithm for the new value of λ usually requires just several iterations. The data to be analyzed are returns of six stocks selected from the Hong Kong stock market. The stocks (0001.HK∼0006.HK) are given in the legend of Fig. 3. Starting from λ = 0.04 and increased by 0.04 each time, finally when λ = 0.16 the LiNGAM assumption holds for W estimated by ICA with a sparse separation matrix (Eq. 8). Fig. 3 gives the LiNGAM causal relations. x2 , x3 , and x6 are components of the Hang Seng Utilities Index, and they are interconnected. x1 , x4 , and x5 are closely related. This figure also tells us how their returns are influenced by others.

x1: Cheung Kong (0001.hk) x2: CLP Hldgs (0002.hk) x3: HK & China Gas (0003.hk) x4: Wharf (Hldgs) (0004.hk) x5: HSBC Hldg (0005.hk), x6: HK Electric (0006.hk)

Fig. 3. LiNGAM causal relations among the six stocks.

5

Conclusion

ICA with sparse connections is more suitable than traditional ICA for some realworld problems. In this paper We have proposed natural gradient algorithms for ICA with a sparse separation matrix and ICA with a sparse mixing matrix. The algorithms are derived by maximizing the ICA likelihood penalized by certain functions. Various penalty functions which can produce sparse parameters have been discussed and compared. A direct application of the proposed methods is to estimate the LiNGAM model. The proposed methods have been applied to both artificial data and real-world data, and the experimental results support the theoretical claims.

References 1. S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10:251–276, 1998. 2. S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, 1996. 3. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statist. Assoc., 96:1348–1360, 2001. 4. A. Hyv¨ arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, Inc, 2001. 5. E.L. Lehmann. Theory of Point Estimation. John Wiley & Sons, Inc., 1983. 6. M. Lewicki and T.J. Sejnowski. Learning nonlinear overcomplete represenations for efficient coding. In Advances in Neural Information Processing Systems 10, pages 815–821, 1998. 7. D.T. Pham and P. Garat. Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45(7):1712–1725, 1997. 8. S. Shimizu, P.O. Hoyer, A. Hyv¨ arinen, and A.J. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Submitted to Journal of Machine Learning Research, 2006. 9. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996. 10. A.S. Weigend, D.E. Rumelhart, and B.A. Huberman. Generalization by weight elimination with application to forecasting. In Advances in Neural Information Processing Systems 3, 1991.