Postnonlinear overcomplete blind source ... - Semantic Scholar

Report 2 Downloads 52 Views
Postnonlinear overcomplete blind source separation using sparse sources Fabian J. Theis1,2 and Shun-ichi Amari1 1

Brain Science Institute, RIKEN 2-1, Hirosawa, Wako-shi, Saitama, 351-0198, Japan 2 Institute of Biophysics, University of Regensburg D-93040 Regensburg, Germany [email protected],[email protected]

Abstract. We present an approach for blindly decomposing an observed random vector x into f (As) where f is a diagonal function i.e. f = f1 × . . . × fm with one-dimensional functions fi and A an m × n matrix. This postnonlinear model is allowed to be overcomplete, which means that less observations than sources (m < n) are given. In contrast to Independent Component Analysis (ICA) we do not assume the sources s to be independent but to be sparse in the sense that at each time instant they have at most m − 1 non-zero components (Sparse Component Analysis or SCA). Identifiability of the model is shown, and an algorithm for model and source recovery is proposed. It first detects the postnonlinearities in each component, and then identifies the now linearized model using previous results.

Blind source separation (BSS) based on ICA is a rapidly growing field (see for instance [1,2] and references therein), but most algorithms deal only with the case of at least as many observations as sources. However, there is an increasing interest in (linear) overcomplete ICA [3–5], where matrix identifiability is known [6], but source identifiability does not hold. In order to approximatively detect the sources [7], additional requirements have to be made, usually sparsity of the sources. Recently, we have proposed a model based only upon the sparsity assumption (summarized in section 1) [8]. In this case identifiability of both matrix and sources given sufficiently high sparsity can be shown. Here, we extend these results to postnonlinear mixtures (section 2); they describe a model often occurring in real situations, when the mixture is in principle linear, but the sensors introduce an additional nonlinearity during the recording [9]. Section 3 presents an algorithm for identifying such models, and section 4 finishes with an illustrative simulation.

1

Linear overcomplete SCA

Definition 1. A vector v ∈ Rn is said to be k-sparse if v has at most k non-zero entries.

2

Fabian J. Theis and Shun-ichi Amari

If an n-dimensional vector is (n − 1)-sparse, that is it includes at least one zero component, it is simply said to be sparse. The goal of Sparse Component Analysis of level k (k-SCA) is to decompose a given m-dimensional random vector x into x = As (1) with a real m × n-matrix A and an n-dimensional k-sparse random vector s. s is called the source vector, x the mixtures and A the mixing matrix. We speak of complete, overcomplete or undercomplete k-SCA if m = n, m < n or m > n respectively. In the following without loss of generality we will assume m ≤ n because the undercomplete case can be easily reduced to the complete case by projection of x. Theorem 1 (Matrix identifiability). Consider the k-SCA problem from equation 1 for k := m − 1 and assume that every m × m-submatrix of A is invertible. Furthermore let s be sufficiently rich represented in the sense that for any index set of n − m + 1 elements I ⊂ {1, ..., n} there exist at least m samples of s such that each of them has zero elements in places with indexes in I and each m − 1 of them are linearly independent. Then A is uniquely determined by x except for left-multiplication with permutation and scaling matrices. Theorem 2 (Source identifiablity). Let H be the set of all x ∈ Rm such that the linear system As = x has an (m − 1)-sparse solution s. If A fulfills the condition from theorem 1, then there exists a subset H0 ⊂ H with measure zero with respect to H, such that for every x ∈ H \ H0 this system has no other solution with this property. The above two theorems show that in the case of overcomplete BSS using (m − 1)-SCA, both the mixing matrix and the sources can uniquely be recovered from x except for the omnipresent permutation and scaling indeterminacy. We refer to [8] for proofs of these theorems and algorithms based upon them. We also want to note that the present source recovery algorithm is quite different from the usual sparse source recovery using l1 -norm minimization [7] and linear programming. In the case of sources with sparsity as above, the latter will not be able to detect the sources.

2 2.1

Postnonlinear overcomplete SCA Model

Consider n-dimensional k-sparse sources s with k < m. The postnonlinear mixing model [9] is defined to be x = f (As) (2) with a diagonal invertible function f with f (0) = 0 and a real m × n-matrix A. Here a function f is said to be diagonal if each component fi only depends on xi . In abuse of notation we will in this case interpret the components fi of f as

Postnonlinear overcomplete blind source separation using sparse sources

3

functions with domain R and write f = f1 × . . . × fm . The goal of overcomplete postnonlinear k-SCA is to determine the mixing functions f and A and the sources s given only x. Without loss of generality consider only the complete and the overcomplete case (i.e. m ≤ n). In the following we will assume that the sources are sparse of level k := m − 1 and that the components fi of f are continuously differentiable with fi0 (t) 6= 0. This is equivalent to saying that the fi are continuously differentiable with continuously differentiable inverse functions (diffeomorphisms). 2.2

Identifiability

Definition 2. Let A be an m × n matrix. Then A is said to be mixing if A has at least two nonzero entries in each row. And A = (aij )i=1...m,j=1...n is said to be absolutely degenerate if there are two columns k 6= l such that a2ik = λa2il for all i and fixed λ 6= 0 i.e. the normalized columns differ only by the sign of the entries. Postnonlinear overcomplete SCA is a generalization of linear overcomplete SCA, so the indeterminacies of postnonlinear SCA contain at least the indeterminacies of linear overcomplete SCA: A can only be reconstructed up to scaling and permutation. Also, if L is an invertible scaling matrix, then ¡ ¢ f (As) = (f ◦ L) (L−1 A)s ,

so f and A can interchange scaling factors in each component. Two further indeterminacies occur if A is either not mixing or absolutely degenerate. In the first case, this means that fi cannot be identified if the ith row of A contains only one non-zero element. In the case of an absolutely degenerate mixing matrix, sparseness alone cannot detect the nonlinearity as the µ ¶ 1 1 counterexample A = and arbitrary f1 ≡ f2 shows. 1 −1 If s is an n-dimensional random vector, its image (or the support of its density) is denoted as im s := {s(t)}. Theorem 3 (Identifiability). Let s be an n-dimensional k-sparse random vector (k < m), and x an m-dimensional random vector constructed from s as in equation 2. Furthermore assume that (i) s is fully k-sparse in the sense that im s equals the union of all k-dimensional coordinate spaces (in which it is contained by the sparsity assumption), (ii) A is mixing and not absolutely degenerate, (iii) every m × m-submatrix of A is invertible. ˆ s) is another representation of x as in equation 2 with ˆs satisfying If x = ˆf (Aˆ the same conditions as s, then there exists an invertible scaling L with f = ˆf ◦ L, ˆ 0 P0 . and invertible scaling and permutation matrices L0 , P0 with A = LAL

PSfrag replacements 4

Fabian J. Theis and Shun-ichi Amari

R3

g1 × g2

f1 × f2

A R2

R2

BSRA R2

R3

Fig. 1. Illustration of the proof of theorem 3 in the case n = 3, m = 2. The 3dimensional 1-sparse sources (leftmost figure) are first linearly mapped onto R 2 by A and then postnonlinearly distorted by f := f1 × f2 (middle figure). Separation is performed by first estimating the separating postnonlinearities g := g1 × g2 and then performing overcomplete source recovery (right figure) according to the algorithms from [8]. The idea of the proof now is that two lines spanned by coordinate vectors (thick lines, leftmost figure) are mapped onto two lines spanned by two columns of A. If the composition g ◦ f maps these lines onto some different lines (as sets), then we show that (given ’general position’ of the two lines) the components of g ◦ f satisfy the conditions from lemma 1 and hence are already linear.

The proof relies on the fact that when s is fully k-sparse as formulated in 3(i), it includes all the k-dimensional coordinate subspaces and hence intersections of k such subspaces, which give the n coordinate axes. They are transformed into n curves in the x-space, passing through the origin. By identification of these curves, we show that each nonlinearity is homogeneous and hence linear according to the previous section. The proof is omitted due to lack of space. Figure 1 gives an illustration of the proof in the case n = 3 and m = 2. It uses the following lemma (a generalization of the analytic case presented in [10]). Lemma 1. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such that f (ax) = bf (x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f 0 (t) exists and does not vanish, then f is linear. Theorem 3 shows that f and A are uniquely determined by x except for scaling and permutation ambiguities. Note that then obviously also s is identifiable by applying theorem 2 to the linearized mixtures y = f −1 x = As given the additional assumptions to s from the theorem. For brevity, the theorem assumes in (i) that im s is the whole union of the k-dimensional coordinate spaces — this condition can be relaxed (the proof is local in nature) but then the nonlinearities can only be found on intervals where the corresponding marginal densities of As are non-zero (however in addition, the proof needs that locally at 0 they are nonzero). Furthermore in practice the assumption about the image of s will have to be replaced by assuming the same with non-zero probability. Also note that almost any A ∈ Rmn in the measure sense fulfills the conditions (ii) and (iii).

3

Algorithm for postnonlinear (over)complete SCA

The separation is done in a two-stage procedure: In the first step, after geometrical preprocessing the postnonlinearities are estimated using an idea similar to

Postnonlinear overcomplete blind source separation using sparse sources

5

the one used in the identifiability proof of theorem 3, also see figure 1. In the second stage, the mixing matrix A and then the sources s are reconstructed by applying the linear algorithms from [8], section 1, to the linearized mixtures f −1 x. So in the following it is enough to reconstruct f . 3.1

Geometrical preprocessing

Let x(1), . . . , x(T ) ∈ Rm be i.i.d.-samples of the random vector x. The goal of geometrical preprocessing is to construct vectors y(1), . . . , y(T ) and z(1), . . . , z(T ) ∈ Rm using clustering or interpolation on the samples x(t) such that f −1 (y(t)) and f −1 (z(t)) lie in two linearly independent lines of Rm . In figure 1 they are to span the two thick lines which already determine the postnonlinearities. Algorithmically, y and z can be constructed in the case m = 2 by first choosing far away samples (on different ’non-opposite’ curves) as initial starting point and then advancing to the known data set center by always choosing the closest samples of x with smaller modulus. Such an algorithm can also be implemented for larger m but only for sources with at most one non-zero coefficient at each time instant, but it can be generalized to sources of sparseness m − 1 using more elaborate clustering. 3.2

Postnonlinearity estimation

Given the subspace vectors y(t) and z(t) from the previous section, the goal is to find C 1 -diffeomorphisms gi : R → R such that g1 × . . . × gm maps the vectors y(t) and z(t) onto two different linear subspaces. In abuse of notation, we now assume that two curves (injective infinitely differentiable mappings) y, z : (−1, 1) → Rm are given with y(0) = z(0) = 0. These can for example be constructed from the discrete sample points y(t) and z(t) from the previous section by polynomial or spline interpolation. If the two curves are mapped onto lines by g1 × . . . × gm (and if these are in sufficiently general position) then gi = λi fi−1 for some λ 6= 0 according to theorem 3. By requiring this condition only for the discrete sample points from the previous section we get an approximation of the unmixing nonlinearities gi . Let i 6= j be fixed. It is then easy to see that by projecting x, y and z onto the i-th and j-th coordinate, the problem of finding the nonlinearities can be reduced to the case m = 2, and g2 is to be reconstructed, which we will assume in the following. A is chosen to be mixing, so we can assume that the indices i, j were chosen such that the two lines f −1 ◦ y, f −1 ◦ z : (−1, 1) → R2 do not coincide with the coordinate axes. Reparametrization (¯ y := y ◦ y1−1 ) of the curves lets us further assume that y1 = z1 = id. Then after some algebraic manipulation, the condition that the separating nonlinearities g = g1 × g2 must map y and z onto lines can be written as g2 ◦ y2 = ag1 = ab g2 ◦ z2 with constants a, b ∈ R \ {0}, a 6= ±b. So the goal of geometrical postnonlinearity detection is to find a C 1 -diffeomorphism g on subsets of R with g ◦ y = cg ◦ z

(3)

6

Fabian J. Theis and Shun-ichi Amari

for an unknown constant c 6= 0, ±1 and given curves y, z : (−1, 1) → R with y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determined by y and z except for scaling. Indeed by taking derivatives in equation 3, we get c = y 0 (0)/z 0 (0), so c can be directly calculated from the known curves y and z. In the following section, we propose to solve this problem numerically, given samples y(t1 ), z(t1 ), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed that the samples of the curves y and z are given at the same time instants ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample points of y and vice versa will first have to be estimated, for example by using spline interpolation. 3.3

MLP-based postnonlinearity approximation

We want to find an approximation g˜ (in some parametrization) of g with g˜(y(t i )) = c˜ g (z(ti )) for i = 1, . . . , T , so in the most general sense we want to find g˜ = argming E(g) := argming

T 1 X (g(y(ti )) − cg(z(ti )))2 . 2T i=1

(4)

In order to minimize this energy function E(g), a single-input single-output multilayered neural network (MLP) is used to parametrize the nonlinearity g. Here we choose one hidden layer of size d. This means that the approximated g˜ can be written as ´ ³ g˜(t) = w(2)> σ ¯ w(1) t + b(1) + b(2)

with weigh vectors w(1) , w(2) ∈ Rd and bias b(1) ∈ Rd , b(2) ∈ R. Here σ denotes −1 an activation function, usually the logistic sigmoid σ(t) := (1 + e−t ) and we set σ ¯ := σ ×. . .×σ, d times. The MLP weights are restricted in the sense that g˜(0) = Pd (1) (1) (2) 0 and g˜0 (0) = 1. This implies b(2) = −w(2)> σ ¯ (b(1) ) and i=1 wi wi σ 0 (b1 ) = 1. Especially the second normalization is very important for the learning step, otherwise the weights could all converge to the (valid) zero solution. So the outer bias is not trained by the network; we could fix a second weight in order to guarantee the second condition — this however would result in an unstable quotient calculation. Instead it is preferable to perform network training on a submanifold in the weight space given by the second weight restriction. This results in an additional Lagrange term in the energy function from equation 4 Ã d !2 T X (1) (2) 1 X 2 0 (1) ¯ (˜ g (y(tj )) − c˜ g (z(tj ))) + λ wi wi σ (b1 ) − 1 E(˜ g ) := 2T j=1 i=1

(5)

with suitably chosen λ > 0. Learning of the weights is performed via backpropagation on this energy ¯ g ) with respect to the weight matrix can be easily function. The gradient of E(˜

Postnonlinear overcomplete blind source separation using sparse sources

7

calculated from the Euclidean gradient of g. For the learning process, we fur(j) ther note that all weights wi should be kept nonnegative in order to ensure invertibility of g˜. In order to increase convergence speed, the Euclidean gradient of g should be replaced by the natural gradient [11], which in experiments enhances the algorithm performance in terms of speed by a factor of roughly 10.

4

Experiment

The postnonlinear mixture of three sources to two mixtures is considered. 10 5 samples of artificially generated sources with one non-zero coefficient (drawn uniformly from [−0.5, 0.5]) are used. We refer to figure 2 for a plot of the sources, mixtures and recoveries. The sources were mixedµusing the postnonlinear mixing ¶ 4.3 7.8 0.59 model x = f1 × f2 (As) with mixing matrix A = and postnonlin9 6.2 10 earities f1 (x) = tanh(x) + 0.1x and f2 (x) = x. For easier algorithm visualization and evaluation we chose f2 to be linear and did not add any noise. MLP based postnonlinearity detection algorithm from section 3.3 with natural gradient-descent learning, 9 hidden neurons, a learning rate of η = 0.01 and 105 iterations gives a good approximation of the unmixing nonlinearities g i . Linear overcomplete SCA is then applied to g1 ×g2 (x): for practical reasons (due to approximation errors, the data is not fully linearized) instead of the matrix recovery algorithm from [8] we use a modification of the geometric ICA algorithm [4], which is known to work well inµthe very sparse one-dimensional case ¶ −0.46 −0.81 −0.069 ˆ = , which except to get the recovered mixing matrix A −0.89 −0.58 −1.0 for scaling and permutation coincides well with A. Source recovery then gives a (normalized) signal-to-noise ratios (SNRs) of these with the original sources are high with 26, 71 and 46 dB respectively.

References 1. Cichocki, A., Amari, S.: Adaptive blind signal and image processing. John Wiley & Sons (2002) 2. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley & Sons (2001) 3. Lee, T., Lewicki, M., Girolami, M., Sejnowski, T.: Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters 6 (1999) 87–90 4. Theis, F., Lang, E., Puntonet, C.: A geometric algorithm for overcomplete linear ICA. Neurocomputing 56 (2004) 381–398 5. Zibulevsky, M., Pearlmutter, B.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 13 (2001) 863–882 6. Eriksson, J., Koivunen, V.: Identifiability and separability of linear ICA models revisited. In: Proc. of ICA 2003. (2003) 23–27

8

Fabian J. Theis and Shun-ichi Amari

0.5

1.5

0

0.5

1

0 −0.5

0

10

20

30

40

50

60

70

80

90

100

−0.5 −1

0.5

−1.5

0

10

20

30

0

10

20

30

40

50

60

70

80

90

100

40

50

60

70

80

90

100

0 5 −0.5

0

10

20

30

40

50

60

70

80

90

100

0.5 0 0

−0.5

0

10

20

30

40

50

60

70

80

90

100

−5

(a) sources

(b) mixtures

5

5

4 0

3 −5

2

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

5

1

0

0

−1 −5

−2 5

−3 0

−4

−5 −1.5

−1

−0.5

0

0.5

(c) mixture scatterplot

1

1.5

−5

(d) recovered sources

Fig. 2. Example: (a) shows the 1-sparse source signals, and (b) the postnonlinear overcomplete mixtures. The original source directions can be clearly seen in the structure of the mixture scatterplot (c). The crosses and stars indicate the found interpolation points used for approximating the separating nonlinearities, generated by geometrical preprocessing. Now, according to theorem 3, the sources can be recovered uniquely, figure (d), except for permutation and scaling.

7. Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 (1998) 33–61 8. Georgiev, P., Theis, F., Cichocki, A.: Blind source separation and sparse component analysis of overcomplete mixtures. In: Proc. of ICASSP 2004, Montreal, Canada (2004) 9. Taleb, A., Jutten, C.: Indeterminacy and identifiability of blind identification. IEEE Transactions on Signal Processing 47 (1999) 2807–2820 10. Babaie-Zadeh, M., Jutten, C., Nayebi, K.: A geometric approach for separating post non-linear mixtures. In: Proc. of EUSIPCO ’02. Volume II., Toulouse, France (2002) 11–14 11. Amari, S., Park, H., Fukumizu, K.: Adaptive method of realizing gradient learning for multilayer perceptrons. Neural Computation 12 (2000) 1399–1409