Mutual Information Approach to Blind Separation of ... - CiteSeerX

Report 2 Downloads 44 Views
Mutual Information Approach to Blind Separation of Stationary Sources Dinh Tuan Pham Laboratoire LMC/IMAG, C.N.R.S., University of Grenoble B.P. 53X, 38041 Grenoble cedex, France

Keywords Contrast. Convolution. Entropy. Independent component analysis. Kullback-Leibner divergence. Mutual information. Separation of sources. Stationary process.

Abstract This paper presents an uni ed approach to the problem of separation of sources, based on the consideration of mutual information. The basic setup is that the sources are independent stationary random processes which are mixed either instantaneously or through a convolution, to produce the observed records. We de ne the entropy of stationary processes and then the mutual information between them as a measure of their independence. This provides us with a contrast for the separation of source problem. For practical implementation, we introduce several degraded forms of this contrast, which can be computed from a nite dimensional distribution of the reconstructed source processes only. From them, we derive several sets of estimating equations generalising those considered earlier.

1 Introduction Blind separation of sources is a topic which have received much attention recently, as it has many important applications (speech analysis, radar, sonar, : : : ). Basically, one observes several linear instantaneous or convolutive mixtures of independent signals, called sources, and the problem is to recover them from the observations. This problem is called blind since one doen't have any speci c knowledge on 1

the structure of the sources, the separation being based only on their independence and some general assumptions such as the stationarity. A more general setup which have also been considered assume further that observed channels are corrupted with noises, but in this paper we shall restrict ourselves to the pure mixture case. Further we shall concentrate on the so called \batch processing" in which a large block of data has been recorded and is processed o line, although some of our results and ideas can certainly be generalized to the on-line adaptative approach. Even in this restricted area, there have already been a large numbers of paper devoted to the subject ([1]{[3], [5], [7], [10] { [12], [13] { [15], [16], [17], : : : ). Many of them propose an ad-hoc method based on the consideration of higher order moments, although some more systematic treatments based on the likelihood and contrast ([5], [14]) and the independent components analysis ([4], [14]) have also been presented. Many of the above works also concern the instantaneous case only and little have been done in the convolutive mixture case, as the problem is much more complex. In this paper, we aim at providing a general framework for solving the separation of source problem, in both the instantaneous and the convolutive mixture cases. As the title says, our approach is based on the use of the mutual information, which provides us a natural contrast for the problem. By di erentiation, we obtain a set of separating functions which can be related to our earlier works ([13], [14]). Our emphasis will be on the general ideas and concepts and therefore implementation details will not be discussed (but can be subjected of subsequent work). To proceed, let us describe the problem in mathematical terms and introduce some notations. We assume that J sequences of observations Xj () = fXj (t); t = 1; 2; : : :g, j = 1, : : : , J , are available, each being a mixtures of K independent sources, either instantaneously or through a convolution. More precisely, let Sk () = 2

fSj (t); t 2 ZZg, k = 1, : : : , K , denote the sources, one has

X(t) = AS(t) 1 X = A(l)S(t ? l) l=?1

(instantenous mixture)

(1:1)

(convolutive mixture)

(1:2)

where X(t) and S(t) denote the vectors [X1(t)  XJ (t)]T and [S1(t)  SK (t)]T and A, A(l) are J  K constant matrices. For simplicity, the symbol ? will be used to denote the convolution, so that the right hand side of (1.2) may be written as 0

(A ? S)(t). To extract the sources one naturally perform an \inverse" transformation on the sequence of observed vectors, namely

Y(t) = BX(t) = (B ? X)(t) =

1 X l=?1

(instantenous mixture case) (1:3)

B(l)X(t ? l) (convolutive mixture case); (1:4)

where B and fB(l); l 2 ZZg are the reconstruction matrix and the sequence of reconstruction matrices, of order K  J . In the blind separation context, one knows nothing of the structure of the sources and the only fundamental assumption which one relies on is that the source sequences are mutually independent. Thus a sensible way is to look for the reconstruction matrix B or sequence of matrices fB(l), l 2 ZZg such that the output sequence fY(t); t 2 ZZg of (1.3) or (1.4) has components as independent as possible. This is similar to goal of the independent component analysis (ICA). However, in ICA the observed sequence fX(t); t 2 ZZg is not necessarily a mixture of independent sources and further ICA so far has been restricted to instantaneous transformations. To solve our problem, we need a good measure of dependence between random stationary processes (as we will assume that the sources are stationary). In this paper we will introduce a such measure based on the mutual information. This would yield a contrast for extracting the sources. Needless to say, it is only a theoretical contrast, 3

in the sense that it requires the knowledge of the distribution (or more exactly the density function) of the source. In practice, such distribution must be estimated from the data. This estimation problem is not considered here (for a simple case, see Pham [13]) although we are fully aware of the diculties which may arise in such problem and we will try to alleviate them as far as possible. Note that contrasts based on higher order moments, for example, also require implicitly the knowledge of the distribution of the sources, the only di erence is that such moments may be estimated directly from the data without having to estimate the density function.

2 Mutual information, entropy and Kullback-Leibner divergence Let Y1, : : : , YK a set of K random vectors with joint density fY1 ;:::;YK , the mutual information between them is de ned as QK Z I (Y1; : : : ; YK ) = ? log f i=1(fyYk; :(y: :k;)y ) fY1 ;:::;YK (y1 ; : : : ; yk )dy1 : : : dyK Y1 ;:::;YK 1 K =

K X

H (Yk ) ? H (Y1 ; : : : ; YK )

i=1

where fYk is the marginal density of Yk and Z

H (Y1 ; : : : ; YK ) = ? log[fY1 ;:::;YK (y1 ; : : : ; yK )]fY1 ;:::;YK (y1 ; : : : ; yK )dy1 : : : dyK and

Z

H (Yk ) = ? log[fYk (yk )]fYk (yk )dyk are the (Shannon) joint and marginal entropies of Y1 ; : : : ; YK and of Yk , respectively. Note that H (Y1 ; : : : ; YK ) is the same as H (Y ) where Y is the vector obtained by stacking the components of the vectors Y1 , : : : , YK . We shall use the two notations interchangeably. The above mutual information is in fact no other than the Kullback-Leibner divergence between the joint density fY1 ;:::;YK (y1 ; : : : ; yK ) of Y1; : : : ; YK and the denQ sity Ki=1 fYk (yk ) which would arise if these random variables are independent. 4

From the inequality log a  a ? 1 for all a  0, with equality attained only at a = 1, one gets that I (Y1; : : : ; YK ) is non negative and can be zero only if Q

fY1 ;:::;YK (y1 ; : : : ; yK ) = pi=1 fYk (yk ) almost everywhere, that is if Y1, : : : , YK are independent. Thus the mutual information may be viewed as a measure of dependency between a set of random vectors. As a consequence of the above result, the joint entropy is at most equal to the sum of the marginal entropies, with equality being achieved if and only if the random vectors are independent. In particular H (Y1 ; Y2) ? H (Y2 )  H (Y1 ). But H (Y1 ; Y2) ? H (Y2 ) = ?

Z h

i log fY1f;Y2 ((yy1 ;)y2 ) fY1 ;Y2 (y1 ; y2 ) dy1dy2

Y2 2

and thus is no other than the expectation of the entropy of the conditional distribution (or expected conditional entropy for short) of Y1 given Y2 , which is de ned as Z n Z o ? [log fY1 jY2 (y1 ; y2 )]fY1 jY2 (y1 ; y2 ) dy1 fY2 (y2 ) dy2 ; H (Y1 jY2) = where fY1 jY2 (y1 ; y2 ) = fY1 ;Y2 (y1 ; y2 )=fY2 (y2 ) denotes the conditional density of Y1 at y1 given that Y2 = y2 . Thus

H (Y1 jY2) = H (Y1 ; Y2) ? H (Y2 )  H (Y1 ):

(2:1)

This result can be generalized. If Y0 is another random vector,

H (Y1 jY2; Y0)  H (Y1 jY0):

(2:2)

To show this, note that the expected conditional entropies satis es a similar inequality as the entropies: K X H (Y1 ; : : : ; YK jY0)  H (Yk jY0); k=1

with equality if and only if the random vectors Y1, : : : , YK are conditionally independent given Y0, that is their joint conditional density given Y0 factors as a 5

product. This can be proved by repeating the same calculations at the beginning of this section, but with the joint and marginal densities of Y1 , ..., Yk replaced by the conditional joint and marginal densities given Y0 , and then integrating everything with respect to the density of Y0. From the above result and (2.1), one gets

H (Y1 ; Y2; Y0) ? H (Y0 )  H (Y1 ; Y0) + H (Y2 ; Y0 ) ? 2H (Y0 ): Using (2.1) again one gets the announced result (2.2). Consider now K random stationary (scalar) processes Yk () = fYk (t); t 2 ZZg, i = 1, : : : , K . For convenience we shall in the sequel denote by Z (1 : T ), Z () being a sequence of random variables or real numbers, the vector with components

Z (1), : : : , Z (T ). Then, according to (2.1) H [Yk (1 : T )] = H [Yk (1)] +

T X t=2

H [Yk (t)jYk (1 : t ? 1)]:

On the other hand, (2.2) yields that H [Yk (t)jYk (1 : t ? 1)]  H [Yk (t)jYk (2 : t ? 1)] and by stationarity the last right hand side is the same as H [Yk (t ? 1)jYk (1 : t ? 2)]. Thus H [Yk (t)jYk (1 : t ? 1)] is a non increasing function of t and hence must converge to a limit (which could be minus in nity) as t ! 1. It follows that H [Yk (1 : T )]=T converges to the same limit as T ! 1. This common limit is, by our de nition, the entropy of the random process Yk () and will be denoted by H [Yk ()]. More generally, the joint entropy of the random processes Y1(), : : : , YK () is de ned as 1 H [Y (1 : T ); : : : ; Y (1 : T )] H [Y1 (); : : : ; YK ()] = Tlim 1 K !1 T = Tlim H [Y(T )jY(1 : T ? 1)] !1

(2:3)

where Y(t) is the vector with components Y1(t), : : : , YK (t) and Y(1 : T ) is the vector obtained by stacking the components of the Y(1), : : : , Y(T ). This notation will be used consistently througout. The existence of the above limit and their equality can 6

be deduced from (2.1) and (2.2) in a completely similar as before. Note that one also has (2:4) H [Y1 (); : : : ; YK ()]  H [Y(T )jY(1 : T ? 1)]  T1 H [Y(1 : T )] and that the two left hand sides above are non increasing with respect to T . A further interesting result is that

H [Yk ()]  Hk [Y (t)];

H [Y1(); : : : ; YK ()]  H [Y1(t); : : : ; YK (t)];

(2:5)

with the rst and second inequality being an equality if and only if the variables Yk (t), t 2 ZZ and the random vectors Y(t), t 2 ZZ, respectively, are independent. (Note that the right hand sides of (2.5) do not depend on t by stationarity.) The rst inequality comes from the de nition (2.3) and the inequality H [Yk (t)jYk (1 : t ? 1)]  H [Yk (t)] which is a simple consequence of (2.1) and it is clear that equality can be achieved if and only if H [Yk (t)jYk (1 : t ? 1)] = H [Yk (t)] for all t > 1 and this implies that the Yk (t), t 2 ZZ, are independent. A similar argument applies for the case of the joint entropy. The mutual information between the processes Y1(), : : : , YK () can be now de ned as:

I [Y1(); : : : ; YK ()] = H [Y1 (); : : : ; YK ()] ? or equivalently

p X i=1

H [Yk ()];

1 I [Y (1 : T ); : : : ; Y (1 : T )]: I [Y1(); : : : ; YK ()] = Tlim K !1 T 1 Thus the mutual information between a set of stationary processes is again a measure of their dependency. The rest of this section is devoted to the calculation of the entropy of a random stationary process. The only case in which an analytic formula is available is the Gaussian case, 7

Proposition 2.1 Let Y1(), : : : YK () be jointly Gaussian stationary random processes. Then their joint entropy equals Z  K 1 1 log det[42 f ()]d + K2 H [Y1 (); : : : ; YK ()] = 2 log det(2G) + 2 = 4 ?

where G is the innovation covariance matrix of the process Y() = [Y1()  YK ()]T (that is the covariance matrix of Y(t) minus its best linear predictor based on Y(s), s < t) and f is its spectral density. In order to go beyond Gaussian processes, we shall establish an useful result relating the entropy of a ltered process to that of the original process. To avoid technical diculties, we shall limit ourselves to a class of \well behaved" lters. De nition 2.1 A sequence of square matrices fB(l); l 2 ZZg is said to belong to class A if P1

P1 jl l=?1 kB(l)k < 1 and det[ l=?1 B(l)e ] 6= 0 for all .

By restricting to sequences fB(l); l 2 ZZg of class A, the convolution (B ? X)() is well de ned for any stationary process X() with nite rst moment and is itself stationary with nite rst moment. Further, this class has the nice properties that it is closed with respect to the convolution and any of its elements admits an inverse. The rst property means that if the sequences fB(l); l 2 ZZg and fC(l); l 2 ZZg both belong to A, then so is their convolution, in either order (note that matrix multiplication is not commutative). The second property means that for any sequence fB(l); l 2 ZZg in A, there exists a sequence fC(l); l 2 ZZg in A, such that (B ? C)(l) = (C ? B)(l) = 0 if l 6= 0, = I if l = 0. The sequence fC(l); l 2 ZZg is actually no other than the sequence of the Fourier coecients of P il ?1 the function  7! [ 1 l=?1 B(l)e ] . The above results follows easily from a result of Wiener which says that if f is a 2-periodic function non zero everywhere and 8

has absolutely summable Fourier coecients, then the same is true for 1=f (see for ex. [18], p. 245). Proposition 2.2 (i) Let X be a random vector and B be an invertible matrix. Then the entropy of Y = BX is given by

H (Y) = H (X) + log j det Bj: (ii) Let X() be a vector random stationary process and Y() = (B ? X)() where fB(l); l 2 ZZg be a sequence of matrices of class A with only a nite number of non zero terms. Then the joint entropy of the components processes Y1 (), : : : , YK () of the vector process Y() satis es

H [Y1 (); : : : ; YK ()]  H [X1 (); : : : ; XK ()] +

Z



?

1 h X log det

j =?1

i B(l)eil 2d ;

X1 (), : : : , XK () denoting the component processes of the process X(). It would be nice if one can drop the restriction that the sequence fB(l); l 2 ZZg in part (ii) of the above Proposition contains only a nite number of non zero terms. To this end, we shall assume that the entropy functional is lower semicontinuous at Y() with respect to convolution, in the sense that (C): For any T > 1, any  > 0, there exists  > 0 such that

H [Y(1) + (C ? Y)(1); : : : ; Y(T ) + (C ? Y)(T )]  H [Y(1); : : : ; Y(T )] +  P for all sequences fC(l); l 2 ZZg satisfying 1 l=?1 kC(l)k <  .

Admittedly, the above condition would be dicult to check, but we haven't been able to derive a simpler condition. The problem is that the entropy is de ned through the density which changes in a subtle way with respect to addition of random 9

vectors. Nevertheless, we are con dent that this continuity condition holds generally in practice. (It clearly holds in the case of Gaussian distribution.) Proposition 2:20 The inequality in part (ii) of Proposition 2.2 holds with fB(l); l 2 ZZg being a sequence of matrices of class A if the condition (C) relative to the process Y() is satis ed. If this condition is also satis ed relative to the process X(), then this inequality becomes an equality. The above result is fundamental in that it describes how the entropy change when the process if ltered. In particular it permits the calculation of the entropy of a linear process through the following Proposition, which follows from it directly. Proposition 2.3 Let Y1(), : : : , YK () be jointly stationary processes and denote by Y(t) the vector with components Y1(t), : : : , YK (t) and assume that the condition (C) is satis ed relative to the process (B ? Y)() for all sequences fB(l); l 2 ZZg of class A. Then

H [Y1 (); : : : ; YK ()]  H [(B ? Y)(0)] ?

Z



?



log det

1 h X j =?1

i B(l)eil 2d

with equality if and only if the random vectors (B ? Y)(t), t 2 ZZ, are mutually independent. Recall that a vector process Y() is called linear if it admits a representation of the form 1 X Y(t) = A(l)e(t ? l) l=?1

where fe(t); t 2 ZZg is a sequence of independent identically distributed (iid) random vectors and fA(t); t 2 ZZg is a sequence of matrices. We shall assume that the last sequence is of class A, then in the above Proposition one may take fB(l); l 2 ZZg to 10

1 il il ?1 be its inverse, that is 1 l=?1 B(l)e = [ l=?1 A(l)e ] . Then (B ? Y)(t) = e(t) and the inequality in this Proposition becomes an equality. It follows that the entropy P

P

of this linear process may be computed as

H [Y1 (); : : : ; YK ()] = H [e(0)] +

Z



?



log det

1 h X j =?1

i A(l)eil d 2 :

Clearly Proposition 2.3 still holds if the class A is replaced by a smaller subclass. A subclass of interest is the subclass A+ of sequences fB(l); l 2 ZZg which are causal and have inverse causal, in the sense that B(l) = 0 for l < 0 and the Fourier P il ?1 coecients of the function  7! [ 1 l=0 B(l)e ] vanishes at negative indexes. It is l well known that the last condition is equivalent to that det[ 1 l=0 B(l)z ] 6= 0 for all complex number z of modulus not exceeding 1. This condition implies P

I

n

C

h

log det I +

1 X

= 0; B(0)?1 B(l)zl 2dz iz l=1 io

the integral being over the unit circle C of the complex plane; hence Z



?

1 hX log det l=0

i B(l)eil 2d = log det jB(0)j:

(2:6)

Thus one get the same result as in the Proposition 2.3 with A replaced by A+ and the integral there replaced by log j det B(0)j.

3 Criteria

We shall assume throughout that the sequence fA(l); l 2 ZZg in (1.2) is of class A, hence for the reconstruction (1.4), we will restrict ourselves to sequences fB(l); l 2 ZZg of this class. From the results of the previous section, to extract the sources in the model (1.2), one may minimize the criterion

C1 =

K X k=1

H [Yk ()] ?

Z



?

1 h X log det

l=?1

B(l)eil

i

d : 2

(3:1)

where Y1 (), : : : , YK () are the components processes of the vector process Y() de ned by (1.4). The obtained Y1(), : : : , YK () are the reconstructed sources. In 11

the instantaneous mixture case (model (1.1)), Y() will be given by (1.3) and one simply replaces the integral in (3.1) by log j det Bj. It is worthwhile to note that the criterion C1 is invariant with respect to permutation and pre-convolution with a sequence of diagonal matrix. More precisely, it is unchanged when one replaces B(l) by P(D ? B)(l) with P being a permutation matrix and fD(m); m 2 ZZg being a sequence of diagonal matrices of class

A. Indeed, denoting by d1(m), : : : , dK (m) the diagonal elements of D(m), this replacement would change the process Yk () in (3.1) to (dk ? Yk )() and subtract R P P im to (3.1) the term Kk=1 ? log j 1 m=?1 dk (m)e jd=(2 ), hence by Propositions 2.2, 2:20 , the criterion os unchanged. In particular, in the case of an instantaneous mixture (model (1.1)), one may pre-multiply B with a permuted diagonal matrix, without changing C1. The above invariant property makes clear what one may already know intuitively, that it is only possible to recover the sources up to a permutation and a convolution (or a scale factor in the instantaneous mixture case, see [15]). The criterion C1 is of theoretical interest only. The reason is that it requires the complete knowledge of the distribution of each component process Yk (). In general such distribution must be estimated, but density estimation in high dimension poses big problem. The number of data needed to estimate a density grow exponentially with the dimension. For a data length of realistic size, one may be able to estimate adequately a density in two or three dimensions but hardly more. To overcome this problem we propose two approaches. The rst approach is to streamline the criterion C1 to obtain one which doesn't involve high dimensional density. This works well for the instantaneous mixture case, but we ran into diculty for the convolutive case and therefore we exclude this case here. A rst possibility is to consider the mutual information 12

between the random vectors Y1(1 : T ), : : : , YK (1 : T ), for small T , instead of that between the random processes Y1(), : : : , YK (). This mutual information is given by

K X k=1

H [Yk (1 : T )] ? H [Y1 (1 : T ); : : : ; YK (1 : T )]:

On the other hand, it is clear from the rst part of Proposition 2.2 that, with Y() = [Y1()  YK ()]T being related to X() = [X1()  XK ()]T through (1.3),

H [Y1 (1 : T ); : : : ; YK (1 : T )] = T log j det Bj + H [X1 (1 : T ); : : : ; XK (1 : T )]: (3:2) Note that this relation cannot be generalized to the convolutive case, hence the diculty alluded above. Thus, one is led to the criterion K X 1 CT (B) = T H [Yk (1 : T )] ? log j det Bj: k=1

(3:3)

The criterion CT would be less ecient than C1 since it doesn't exploit the serial dependence between observations more than T time units apart. But under the model (1.1), it is easily seen that it is still a contrast, in the sense that it is minimized if and only if B = A?1, up to the pre-multiplication by a permuted diagonal matrix. A second possibility consists in considering the expected Kullback-Leibner divergence between the conditional distribution of Y(T ) given Y(T ? 1), : : : , Y(1) and the one it would have if the vectors Y1 (1 : T ), : : : , YK (1 : T ) are mutually independent. Explicitly, this divergence is Z

? log

QK

k=1 fYk (T )jYk (1:T ?1) [yk (1 : T )] f Y(1:T )[y(1 : T )]dy(1)  ; dy(T ) fY(T )jY(1:T ?1)(y(1 : T )]

where y(t) = [y1 (t)  yK (t)]T , y(1 : T ) and Y(1 : T ) are the vector obtained by stacking the components of y(1), : : : , y(T ) and Y(1), : : : , Y(T ), respectively, fYk (T )jYk(1:T ?1)[yk (1 : T )] denotes the conditional density of Yk (T ) at yk (T ) given 13

that Yk (1 : T ? 1) = yk (1 : T ? 1), fY(T )jY(1:T ?1)[y(1 : T )] denotes the conditional density of Y(T ) at y(T ) given that Y(1 : T ? 1) = y(1 : T ? 1) and fY(1:T ) denotes the density of Y(1 : T ). In view of the relations

fYk (1:T )[yk (1 : T )] ; Yk (1:T ?1) [yk (1 : T ? 1)] f [y(1 : T )] fY(T )jY(1:T ?1)[y(1 : T )] = f Y(1:T )[y(1 : T ? 1)] ; Y(1:T ?1)

fYk (T )jYk(1:T ?1)[yk (1 : T )] = f

where fYk (1:T ) denotes the density of Yk (1 : T ), the above divergence reduces to K X

fH [Yk (1 : T )] ? H [Yk (1 : T ? 1)]g ? fH [Y(1 : T )] ? H [Y(1 : T ? 1)]g:

k=1

The above expression is by construction always non negative and vanish if the random vectors YK (1 : T ), : : : , YK (1 : T ) are independent. By (3.2) and taking into account of (2.1), it di ers from

CT (B) =

K X k=1

H [Yk (T )jYk (1 : T ? 1)] ? log j det Bj

(3:4)

by the constant term ?H [X1 (T ); : : : ; XK (T )j[X1 (1 : T ? 1); : : : ; XK (1 : T ? 1)]. Therefore, one may use CT as a criterion, which attains it minimum when B = A?1. The advantage of CT is that it is a better approximation to C1 than CT , since H [Yk (T )jYk (1 : T ? 1)], by (2.4), is a better approximation to H [Yk ()] than H [Yk (1 : T )]=T . In particular, if the random processes Yk () is Markovian of order m, then the approximation is exact as soon as T exceed m. A third possibility is to replace the entropy by the Gaussian entropy, de ned as the entropy of a Gaussian process (or vector, or variable) which has the same second order moment structure. This entropy is thus given by the formula of Proposition 2.1. Since the resulting criterion depends only on the rst and second order moments of the processes, it is clear that it would be minimized as soon as the reconstructed sources are uncorrelated. It is then easy to see that this criterion would 14

not permit the separation of sources in the convolutive case, Indeed, one can always nd a sequence fB(l); l 2 ZZg such that all the components of the random vectors

Y(t), de ned by (1.4), are uncorrelated among themselves, hence a further orthog-

onal transformation performed on these variables would not change this property. However, the criterion could still be used in the case of an instantaneous mixture, since the reconstruction would then be restricted to the form (1.3). Explicitly, one gets from (3.1) and Proposition 2.1 that this criterion is K Z K [2 log(2) + 1) ? log j det Bj: 1 X log[ f (  )] d + (3:5) kk 4 k=1 ? 2 where fkk is the spectral density of the k-th component of the process Y() = BX(). One can further streamline the criterion by using the Gaussian entropy in

CT and CT . This yields the criteria (after dropping a constant term) K 1 X (3:30 ) 2T k=1 log j det cov[Yk (1 : T )]j ? log j det Bj where cov() refers to covariance matrix and K 1X (3:40 ) 2 k=1 log var[Yk (T ) ? Yk (T j1 : T ? 1)] ? log j det Bj where var() refers to variance and Yk (T j1 : T ? 1) denotes the best linear predictor of Yk (T ) based on Yk (1), : : : , Yk (T ? 1). Obviously T must be greater than 1, otherwise the above criteria would not permit the separation. The second approach consists in assuming a somewhat restrictive form for the distribution of the sources. This seems so far to be the only approach which works in the convolutive mixture case. A quite general assumption is that the source sequences Sk () are scalar linear processes, with the coecients sequence being of class A (as a scalar sequence). Therefore, by the Proposition 2:30 and the discussions that follow, n

H [Sk ()] = inf H [(bk ? Sk )(t)] ? 15

Z



?

1 X log

l=?1



o

bk (l)eil d

(3:6)

where the in mum is over all sequences fbk (l); l 2 ZZg of class A. This suggests replacing the criterion (3.1) by inf

K nX k=1

H [(bk ? Yk )(t)] ?

Z



?

K X 1 h Y

log

k=1 l=?1

i

bk (l)eil det

1 h X l=?1

o i B(l)eil 2d

(3:7)

where the in mum is over all sequences fbk (l); l 2 ZZg of class A and where the processes Y1(), : : : YK () are the component processes of the vector process Y() de ned by (1.4). From the result of Proposition 2.3, the above criterion cannot be less than (3.1), which can be easily seen to be bounded below by H [X1 (); : : : ; XK ()]. On the other hand, when the Yk () are made to coincide with the sources Sk (), that

is when the sequence fB(l); l 2 ZZg is the inverse of the sequence fA(l); l 2 ZZg, then one can see from (3.6) that (3.7) coincides with (3.1), which, since the processes Y1 (), : : : YK () are independent, reduces to H [X1 (); : : : ; XK ()]. It follows that the criterion (3.7) is indeed a contrast. Note that the criterion (3.7) can be used in the context of instantaneous mixture too, by simply constraining B(l), l 6= 0 to be zero (we write B in place of B(0) then). This criterion is however aimed at the convolutive mixture case. In this case, one may absorb the sequences fbk (l); l 2 ZZg, k = 1, : : : , K , in (3.7) into the sequences fB(l); l 2 ZZg. Thus one is led to the criterion K X k=1

H [Yk (t)] ?

Z



?

1 h X log det

l=?1

i B(l)eil 2d

(3:70 )

where Y1(t), : : : YK (t) are the components of the vector Y(t), which is related to the observation process X() through (1.4). The minimization of (3:70 ) actually not only separates the sources but also performs a deconvolution of them to yield sequences of iid random variables. These sequences, of course, can be interpreted as sources, up to a convolution. A more restrictive assumption on the distribution of the sources could be 16

that they are linear causal processes. By causal we mean that the sequence of coecients in their representation are of class A+. We further assume that the

sequence fA(l); l 2 ZZg in (1.2) is also of this class. Then it makes sense to restrict the sequence fB(l); l 2 ZZg in (1.4) to this class too. Therefore, in (3.7), we will restrict the in mum to over all sequences fbk (k); l 2 ZZg of class A+. Then, using (2.6) one gets the criterion inf

K nX k=1

H

1 hX l=0

i

bk (l)Yk (t ? l) ? log jb1 (0)  bK (0) det B(0)j

o

where the minimum is over all sequences fbk (l); l 2 ZZg of class A+ and Y1(),

: : : , YK () are the component processes of the vector process Y() de ned by (1.4). Clearly, this criterion may be written equivalently as inf

K X

k=1

h

H Yk (t) +

1 X l=1

i

~bk (l)Yk (t ? l) ? log j det B(0)j

(3:8)

where the supremum is over all sequences f: : : ; 0; 1; ~bk (1); : : :g of class A+. As before, the criterion (3.8) cannot be less than H [X1 (); : : : ; XK ()]. But again, when the Yk () are made to coincide with the sources, this criterion coincides with (3.1) and hence equals H [X1 (); : : : ; XK ()], as the processes Y1(), : : : YK () are independent. Thus, the proposed criterion is again a contrast. As before, the criterion (3.8) can also be used in the context of instantaneous mixture, by restricting to reconstruction to the form (1.3) and replacing B(0) by B. But this criterion is aimed at the convolutive mixture case in which one may absorb the sequences f: : : ; 0; 1; ~bk (1); : : :g in (3.8) into the sequence fB(l); l 2 ZZg. Thus one is led to the minimization of K X

k=1

H [Yk (y)] ? log j det B(0)j;

(3:80 )

Yk (t) denoting the components of Y(t) given by (1.4), under the constraint that the sequence fB(l); l 2 ZZg in (1.4) is of class A+. As before, this minimization not only separates the sources but performs a deconvolution as well. 17

4 Estimating equations By di erentiating the above criteria, one is led to a set of equations to be satis ed, called estimating equations (see Godambe, 1963). Their use is much more

exible than that of criteria, since they need not arise from the di erentiation of a criterion. In this problem, they may simply come from functions of B or of fB(l); l 2 ZZg which take the value zero when the reconstructed sources (the components of the vectors fY(t); l 2 ZZg de ned by (1.3) or (1.4)) are independent. In order to di erentiate the criteria in section 3, the following result plays a central role. Proposition 4.1 Let Y and Z be two random vectors admitting absolute 1 + moment, > 0. Assume Y and Y + Z,  being a matrix for which the product Z makes sense and has the same dimension as Y, admit densities fY and fY+Z, satisfying R (C1) As  ! 0, log[fY (u)=fY+Z(u)] fY+Z(u)du ! 0 faster than kk. (C2) The function log fY admits almost everywhere a gradient ? Y such that for some constant C : k Y (u)k  C (1 + kuk ) for all u. Then as  ! 0: H (Y + Z) ? H (Y) = E( YT Z) + o() where o() denotes a term tending to 0 faster than . Note: The condition (C1) above could be hard to verify, but it is quite reasonable. Indeed, Z

Z h fY (u) ? fY (u) ? 1if f ( u ) Y f log log f Y + Z(u)du = Y+Z(u)du fY+Z(u) fY+Z(u) Y+Z(u)

For small , one would expect that the expression inside the above bracket [ ] is of the order kk2 and thus whole integral would be of this order. The diculty is that fY and fY+ converge to zero at in nity and hence the behavior of the ratio fY =fY+ near in nity is dicult to predict. The expression inside the above bracket 18

is of the order kk2 only for xed u, but not uniformly in u. This uniformity is however not at all necessary since we will integrate with respect to fY+Z, which can be expected to converge to zero with a fast rate. But we have been unable to nd simple conditions to ensure that (C1) is satis ed. We now apply the above result to obtain a necessary condition for the criteria CT and CT to be minimized. If CT is minimized at B one would have CT (B + B) 

CT (B) for all matrices . Take  to have a single non zero element, kj say, then one gets from (3.3)

fH [Yk (1 : T ) + kj Yj (1 : T )] ? H [Yk (1 : T )g=T ? log j det(I + )j  0: But log j det(I + )j = jk jk + o(jk ) as jk ! 0, jk denoting the Kronecker symbol. On the other hand, by Proposition 4.1, as jk ! 0:

H [Yk (1 : T )+kj Yj (1 : T )]?H (Yk (1 : T )] = kj Ef Yj (1 : T )T

k;T [Yk (1 : T )]g + o(kj )

where k;T is minus the logarithmic gradient of the density of Yk (1 : T ). Therefore,

kj EfYj (1 : T )T

k;T [Yk (1 : T )]g=T

? kj kj  0

for all kj small enough. But since the above left hand side is linear in kj , it must vanish identically. Thus a necessary condition for B to minimize CT is that EfYj (1 : T )T k;T [Yk (1 : T )]g = 0;

k; j = 1; : : : ; K; k 6= j:

(4:1)

Note that we haven't included the condition EfYk (1 : T )T k;T [Yk (1 : T )]g = T since it is actually always true because of the way k;T is de ned. This can be seen from the following result, which we give here for completeness. Lemma 4.1 19

Let Y be a random vector admitting a di erentiable density fY such that fY (y)y ! 0 as y ! 1 and Y be the gradient of ? log fY . Then E[Y T Y (Y )] equals the dimension of Y . Since CT = TCT ? (T ? 1)CT ?1 , a completely similar argument yields that a necessary condition for this criterion to be minimized at B is EfYj (1 : T )T k;T [Yk (1 : T )] ? Yj (1 : T ? 1)T k;T ?1 [Yk (1 : T ? 1)]g = 0;

(4:2)

for all k; j = 1; : : : ; K; k 6= j . It is worthwhile to note that the expression inside the above curly bracket f g is no other than

?Yk (1 : T )T r log fYk (T )jYk(1:T ?1)[Yk (1); : : : ; Yk (T )] where r is the gradient operator and fYk (T )jYk(1:T ?1)[y(1 : T )] is the conditional density at y(T ) of Yk (T ) given that Yk (1 : T ? 1) = y(1 : T ? 1). Consider now the criterion (3:30 ). Observe that log j det(M+)j = tr(M?1 )+ o(kk) as  ! 0, tr denoting the trace. Hence when B is changed to B + B where  is a matrix with only a non zero term kj , the criterion (3:30 ) is increased by ?



kj tr fcov[Yk (1 : T )]g?1cov[Yj (1 : T ); Yk (1 : T )] =T ? kj kj + o(kj ); cov(; ) denoting the cross covariance matrix between the indicated random vectors. But the matrix cov[Yk (1 : T )] ?1 can be written as 3 2 ?2 2 0 3 2 ai;T ?1 (0)  ai;T ?1 (T ? 1) 3 i;T ?1 ai;T ?1 (0) 0 7 76 76 6 .. .. ... ... ... 5 54 54 4 . . ? 2 0 ai;0 (0) ai;T ?1 (T ? 1)  ai;0(0) 0 i;0 P where ak;t (0) = 1, ? tl=1 ak;t(l)Yk (t + 1 ? l) is the best linear predictor of Yk (t + 1) 2 is the variance of Pt ak;t (l)Yk (t + 1 ? l). based on Yk (1), : : : , Yk (t) and k;t l=0 Therefore, a necessary condition for the criterion (3:30 ) to be minimized at B is T X t=1

cov

t?1 hX l=0

ak;t?1 (l)Yj (t ? l);

t?1 X l=0

20

i

2 ak;t?1(l)Yk (t ? l) =i;t ?1 = 0;

(4:3)

for k; j = 1; : : : ; K , k 6= j . By a similar calculation, a necessary condition for the criterion (3:40 ) to be minimized at B is ?1 h TX

i

k; j = 1; : : : ; K; i 6= j: l=0 l=0 (4:4) Consider nally the Gaussian entropy criterion (3.5). Again, change B to B+B where  is a matrix with only one nonzero element kj . Then the process Yk () is changed to Yk () + kj Yj () which has the spectral density fkk + 2kj fkj + 2kj fjj , where fkj is the cross spectral density between the processes Yk () and Yj (). Thus R the criterion (3.5) is increased by kj f ? [fkj ()=fkk ()]d=(2) ? ij g + o(kj ). Therefore a necessary condition for the criterion (3.5) to be minimized at B is cov

ak;T ?1 (l)Yj (T ?l);

TX ?1

Z



?

ak;T ?1(l)Yk (T ?l) = 0;

[fkj ()=fkk ()]d = 0;

k; j = 1; : : : ; K; k 6= j:

(4:5)

One can see that (4.5) is a limiting form of (4.3) or (4.4) as T ! 1. Indeed, ak;T ?1 (l) P1 2 2 and k;T ?1 converge to ak (l) and k as T ! 1, such that ? l=1 ak (l)Yk (t ? l) is the best linear predictor of Yk (t) based on Yk (t ? 1), Yk (t ? 2), : : : and k2 is the P variance of 1 l=0 ak (l)Yk (t ? l). The results follows from the fact that fkk () can be P il 2 factorized as k2 =[2j 1 l=0 ak (l)e j ]. We now consider the convolutive case. We begin by deriving a necessary condition in order that the criterion (3:70 ) is minimized at the sequence fB(l); l 2 ZZg. Suppose that this is so, then adding to this sequence the sequence f(?B)(l); l 2 ZZg, where f(l); l 2 ZZg is any sequence of (small) matrices, must not decrease the criterion. Choose this sequence to have only one nonzero term (m) which has only one nonzero element kj (m), then the criterion is increased by Z  H [Yk (t) + (kj ? Yj )(t ? m)] ? H [yk (t)] ? log j det[I + (m)eil ]j 2d : ? Using Proposition 4.1, one gets can write the above expression as

kj (m)EfYj (t ? m) k [Yk (t)]g + m0kj kj (m) + o[kj (m)]; 21

as kj (m) ! 0, where k is minus the logarithmic derivative of the density of the random variable Yk (t). Therefore a necessary condition that the sequence fB(l); l 2 ZZg realizes the minimum of (3:70 ) is that EfYj (t ? m) k [Yk (t)]g = 0;

j; k = 1; : : : ; K; m 2 ZZ; k 6= j or m 6= 0: (4:6)

By a completely similar argument, a necessary condition that the sequence fB(l); l 2 ZZg of class A+ realizes the minimum of (3:80 ) is that EfYj (t ? m) k [Yk (t)]g = 0;

j; k = 1; : : : ; K; m  0; k 6= j or m 6= 0: (4:7)

Since the criteria (3.7) and (3.8) in the instantaneous mixture case do not reduce to (3:70 ) and (3:80 ), it is of interest to derive a necessary condition in order that they are minimized in this case. Consider rst (3.7) and suppose it is minimized at B. We assume that the in mum of Z  1 X bk (l)eil 2d ; H [(bk ? Yk )(t)] ? log ? l=?1

can be attained at some sequence fbk (l); l 2 ZZg of class A. Let us change B to B + B with  having a single nonzero element kj . Then the criterion (3.7) is increased by inf H [(bk ? Yk )(t) + kj (bk ? Yj )(t)] ? H [(bk ? Yk )(t)] Z  P1  il l=?1 bk (l)e d log P1 b (l)eil 2 ? log j det[I + ]j; + ? l=?1 k the in mum being over all sequences fbk (l); l 2 ZZg of class A. Therefore, taking bk (l) = bk (l),

H [(bk ? Yk )(t) + kj (bk ? Yj )(t)] ? H [(bk ? Yk )(t)] ? log j det(I + )j  0 Thus by a similar calculation as before, a necessary condition in order that B minimizes the criterion (3.7) is that Ef(bk ? Yj )(t) k [(bk ? Yk )(t)]g = 0; 22

k; j = 1; : : : ; K; k 6= j;

(4:8)

k denoting minus the logarithmic gradient of the density of (bk ? Yk )(t).

By a completely similar argument, one obtains the same result for the criterion (3.8) except that the sequence fbk (l); l 2 ZZg now is of class A+ with P bk (0) = 1 and realizes the minimum of H [ 1 l=0 bk (l)Yk (t ? l)] among all sequences fbk (l); l 2 ZZg of class A+ with bk (0) = 1. In summary, the estimating equations in the case of instantaneous mixture are of the form EfYj (1 : T )T 'k;T [Yk (1 : T )]g = 0;

k; j = 1; : : : ; K; k 6= j:

(4:9)

where 'k;T is a T -vector function of T variables. In (4.2) this function is minus the gradient of the logarithm of the conditional density of Tk (T ) given Yk (1 : T ? 1). In (4.8), it takes the form 2

3

2

3

ak;T ?1(0) y(1) ?1 h TX i 6 . 7 7 6 . .. ak;T ?1 (l)y(T ? l) (4:10) 4 .. 5 7! 4 5 'k l =0 y(T ) ak;T ?1 (T ? 1) if one truncates the sequence fbk (l) ; l 2 ZZg to a nite number of term. In (4.4), it take the same form except that 'k is the identity function (in the case of zero mean sources). This is expected since 'k is minus the logarithmic derivative of the density P ?1 of Tl=0 ak;T ?1(l)Yk (T ? l), which is linear in the Gaussian case. The estimating equations (4.1) and (4.3) may be viewed as the average of (4.2) and (4.4) with respect to T and can be casted in to the form (4.9) too. However, as we have said in the beginning of this section, the use of estimnating equation is very exible in that they need not come from a constrast. Any system of the form (4.9) will do, assuming that the sources are of zero mean (if they are not, one may center the observations rst). A bad choice of the separating functions could of course severely degrade the performance. The above calculations are useful in that they suggest good candidate for the estimating functions. The form 23

(4.9) however requires the speci cation KT real functions of T variables and thus allow a lot degrees of freedom. If one believe that the sources are linear processes (or may be approximated by such processes), one may settle for the form (4.10) which requires the speci cation of only K (T ? 1) constants and K real functions. Noting that taking T = 1 in (4.9) yields the set of estimating functions introduced in Pham and Garat [14], which can be traced back to the method of Jutten and Herault [8]. Turning to the convolutive mixture case, we see that the estimating equations are of the form EfYj (t ? m)'k [Yk (t)g = 0;

j; k = 1; : : : K; m 2 ZZ; k 6= j or m 6= 0; (4:11)

with m being further contrained to be non negative in the case the reconstruction sequence of matrices fB(l); l 2 ZZg is restricted to the class A+. In practice, one would take B(l) to be nonzero for l in some interval [L1; L2] and restrict m in (4.11) to be in the same interval, so as to have just K equations less than the number of unknowns (which accounts for the indeterminacy of scale). Note however that taking L2 = 0 is not enough to ensure that the sequence fB(l); l 2 ZZg is of class A+. This constraint is acyually not easy to enforce. As before, the separating function 'k in (4.11) can be any function and needs not be the logarithmic derivative of the density of (deconvoluted) source, but the later is the natural candidate resulting form the mutual information criterion. We must caution that the use of estimating equation, although convenient, can lead to spurious reconstructed sources. Indeed, the equation (4.9) and (4.11) almost always have several solutions, of which only one corresponds to the true sources. If one choses the estimating functions carelessly, chance is high that one ends up with a spurious solution. We believe that by taking them from a contrast one has a better chance of avoiding such situation. But there is no guarantee that solving equations such as (4.1) { (4.7) yields a local minimum of the corresponding contrast 24

(and not a maximum or a saddle point), let alone a global minimum. However, the fact that these equations come from a contrast makes it possible to monitor the calculation algorithm so as to ensure that it converges at least to a local minimum.

Appendix: Proofs of results Proof of Proposition 2.1 Since the process is Gaussian, the conditional distribution of Y(t) given

Y(1), : : : Y(t ? 1) is the Gaussian distribution with mean being the conditional expectation Y(tj1 : t ? 1) of Y(t) given Y(1), : : : Y(t ? 1) and with covariance matrix

G(t) = covariance matrix of Y(t) ? Y(tj1 : 1 ? t): Note that Y(tj1 : 1 ? t) is no other than the best linear predictor of Y(t) based on Y(t ? 1), : : : , Y(1) and the di erence Y(t) ? Y(tj1 : 1 ? t) is simply the t-th order innovation. Therefore, G(t) tend to the innovation variance G, as de ned in the Proposition. On the other hand, direct computation show that the expected conditional entropy of Y(t) given Y(t ? 1), : : : , Y(1) equals flog det[2G(t)]+ K g=2. This yields the rst result of the Proposition. The other result comes from an extension of Szego 's Theorem to the multivariate case (see for ex. [9] p. 162) which R says that log det G = ? log det[2f ()] d=(2). Proof of Proposition 2.2 The rst part follows from the same calculation in Pham [13] and based on Lemma A1 there. Essentially, the density of Y is given by fY (y) = fX(B?1 y)=j det Bj. Thus Z (B?1y) dy H (Y) = log det B ? [log fX (B?1y)] fXj det Bj and a change of integration variable yields the result. 25

To prove the second part, consider the random vectors 1 TX ?1  1  X X ( T ) Y (t) = B(l)X[(t ? l) modulo T ] = B t ? s + mT X(s): m=?1 s=0 l=?1

In other word Y(T )(0 : T ? 1) = B(T )X(0 : T ? 1) where Y(T ) (u : v) and X(u : v)

denote the vectors obtained by stacking the components of Y(T )(u), : : : , Y(T ) (v) and of X(u), : : : , X(v), respectively, and B(T ) is the circular block Toeplitz matrix P with block B(t ? s + 1 m=?1 mT ) at the (s; t) place. Thus from the result just proved:

H [Y(T ) (0 : T ? 1)] = H [X(0 : T ? 1)] + log j det B(T )j: To compute the determinant of B(T ), note that if U is the matrix with block e?i2st=T I at the (s; t) place, then U?1 B(T )U is block diagonal with diagonal block P1

i2lt=T B(l). Therefore l=?1 e

log j det B(T )j =

T X t=1



log det

1 h X l=?1

ei2lt=T B(l)

i

On the other hand, by assumption, B(l) = 0 for all l greater than some integer, q say. Then it is easily seen that for T > 2q, Y(T )(t) = Y(t) for all t in fq; : : : ; T ? 1 ? qg Thus, by (2.1)

H [Y(T ) (0 : T ? 1)]  H [Y(q : T ? 1 ? q)+ H [Y(T ) (0 : q ? 1)]+ H [Y(T ) (T ? q : T ? 1)] with Y(u : v) being de ned similarly. Combining the above results, one gets

H [X(0 : T ? 1)] +

T X t=1

1 h X log det

l=?1

ei2lt=T B(l)

i



H [Y(q : T ? 1 ? q) + H [Y(T )(0 : q ? 1)] + H [Y(T ) (T ? q : T ? 1)]: But for T  2q, Y(T )(t) equals

Pt

l=?q B(l)X(t

? l) for t = 0; : : : ; q ? 1 and Pq (T ) l=t?T +1 B(l)X(t ? l) for t = T ? q; : : : ; T ? 1, hence Y (0 : q ? 1) does not 26

depends on T and Y(T )(T ? q : T ? 1) has distribution independent of T , by stationarity. Thus, dividing both sides of the above inequality by T , then letting T ! 1, one gets the second result of the Proposition. Proof of Proposition 2:20 Let T ,  and  be as in the Proposition and By(l) be the Fourier coecients

im ?1 y of [ 1 m=?1 B(m)e ] so that (B ? B)(l) = 0 if l 6= 0, = I if l = 1. Write B(l) = Bn(l) + B n(l) where Bn(l) = B(l) if jlj  n, = 0 otherwise and put C(l) = ?(B n ? By)(l). Thus B n(l) = ?(C ? B)(l) and therefore the process Yn(t) = P

(Bn ? X)(t) can be written as Y(t) + (C ? Y)(t). On the other hand, 1 l=?1 kC(l)k P P y is bounded by [ 1 l=?1 kB (l)k][ l>n kB(l)k] and hence is less than  for all n P

suciently large. For such n, one has, according to the condition (C)

H [Y(1); : : : ; Y(T )] +   H [Yn (1); : : : ; Yn(T )] Therefore, by (2.4) H [Y(1); : : : ; Y(T )]=T + =T  H [Yn ()]. Applying now the Proposition 2.2 to the process Yn (), one gets,

H [Y(1); : : : ; Y(T )]=T + =T  H [X1 (); : : : ; XK ()] +

Z



?

1 h X log det

j =?1

i Bn(l)eil 2d ;

This inequality should hold for all n exceeding a threshold depending on T , . Hence letting n ! 1 and then T ! 1, one gets the result of the Proposition. Note that X(t) = (By ? Y)(t). Hence if the condition (C) is satis ed relative to the process X(), one may apply the result just proved and obtains the reverse inequality Z  1 i d h X y il B (l)e 2 H [X1 (); : : : ; XK ()]=T  H [Y1(); : : : ; YK ()] + log det ? j =?1 Z  1 i h X B(l)eil 2d log det = H [Y1(); : : : ; YK ()] ? ? j =?1 27

It follows that the inequality in the Proposition is in fact an equality. Proof of Proposition 2.3 Let Zk () be the component processes of the vector process (B ? Y)(). Then by Proposition 2:20

H [Y1 (); : : : ; YK ()] = H [Z1 (); : : : ; ZK ()] ?

Z



?

1 h X log det

j =?1

i B(l)eil 2d :

The result then follows from (2.5). Proof of Proposition 4.1 We have

Z

H (Y + Z) ? H (Y) = log f fY (u()u) fY+Z(u)du Y+Z Z Z + [log fY (u)]fY (u)du ? fY+Z(u)]du:

By assumption, the rst term in the above right hand side is o(kk) as  ! 0. As for the second and last terms, they are by de nition the same as E[fY (Y)] and E[fY (Y + Z)]. Hence Z

H (Y + Z) ? H (Y) = [log fY (y) ? log fY (y + z)]dPY;Z (y; z) + o(kk) where PY;Z denotes the probability distribution of Y, Z. Therefore, one obtains the results if one has proved that Z

1 [log f (y) ? log f (y + z) ? T (y)z]f (y; z)dydz ! 0 Y Y;Z Y kk Y

as kk ! 0. Observe that the function under the integral sign converges indeed to 0 almost everywhere. Therefore using the Lebesgue dominated convergence Theorem, one needs only to show that it is bounded for all  small enough by an integrable function. But by the mean value Theorem and the condition (C2), it can be bounded by C [1 + 2max( ;1)?1(kyk + kzk k)]kzk: 28

The last expression is integrable, by assumption, yielding the result. Proof of Lemma 4.1 One has E[Y T Y (Y )] = ?

Z

k 1X

h i yk @y@ fY (y1 ; : : : ; yK ) dy1  dyK k ?1 k=1

where fY (y1 ; : : : ; yK ) is the density of Y at the point (y1  yK )T . The result then follows from the integration by parts.

References [1] Cardoso, J.-F. \Source separation using higher order moments", proc . ICASSP 89, Glasgow, Scotland, May 1989, 4, pp. 2109-2112. [2] Cardoso, J.-F. \Iterative technique for blind source separation using only fourth order cumulants", Proc. EUSIPCO 92, Brussels, Aug. 1992, 2, pp. 739-742. [3] Cardoso, J.-F., Souloumiac, A. \An ecient technique for blind separation of complex sources", Proc. IEEE SP Workshop on Higher-Order Statistics, Lake Tahoe, U.S.A., 1993, 275{279. [4] Comon, P. \Independence components analysis, a new concept". Signal Processing, 1994, 36, 3, 287{314. [5] Gaeta, M., Lacoume, J.-L. \Source separation without a priori knowledge: the maximum likelihood solution", Proceedings of EUSIPCO 90, Barcelona, Spain, 1990, 621{624. [6] Godambe, V. P. \Conditional likelihood and unconditional optimum estimating equations". Biomtrika, 1963, 63, 277{284. [7] Gorokov, A., Loubaton, Ph. \Second order blind identi cation of convolutive mixture with temporally correlated sources: a subspace based approach. Proceedings of EUSIPCO 96, Trieste, Italy, 1996, 2093{2096. 29

[8] Jutten, C. Herault, J. \Blind separation of sources, Part I: an adaptative algorithm based on neuromimetic structure", Signal Processing, 1991, 24, 1{10. [9] Hannan, E. J. Multiple time series, 1970. New-York: Wiley. [10] Lacoume, J.-L., Ruiz, P. \Source identi cation : A solution based on the cumulants", Proc. 4th ASSP workshop on Spectral Estimation and Modelling, Minneapolis, USA, August 1988, pp. 199-203. [11] Loubaton, Ph., Delfosse, N. \Separation adaptive de sources independantes par une approch de de ation", Proc. Colloque GRETSI 93, Juan-les-Pins, France, Sept. 1993, 1, 325{328. [12] Mansour, A., Jutten, J. \A direct solution for blind separation of sources" IEEE Trans. SP, 1996, 44, 3, 418{935. [14] Pham, D. T. \Blind Separation of Instantaneous Mixture of Sources via an Independent Component Analysis." IEEE Trans. SP, 1996 44, 11, 2768{2779. [14] Pham, D. T., Garat, Ph. \Blind separation of mixtures of independent sources through a quasi maximum likelihood approach". IEEE Trans. SP, 1997, 45, 7, 1712{1725. [15] Tong, L., Soon, V., Huang, Y. F., Liu, R. \Indeterminacy and identi ability of blind identi cation", IEEE Trans. Circuits and Systems 38, 1991, pp. 499{509. [16] Tong, L., Inouye, Y., Liu, R. \Waveform preserving blind estimation of multiple independent sources", IEEE Trans. on SP 41, 7, 1993, pp. 2461{2470. [17] Yellin, D. and Weinstein, E. \Criteria for multichanel Signal Separation" IEEE Trans. on SP 42, 19945, pp. 2158{2168 [18] Zygmund, A. Trigonometric series, Vol. I, 1998, Cambridge Univ. Press.

30