Signal Processing 73 (1999) 49—66
Adaptive unsupervised separation of discrete sources Odile Macchi , Eric Moreau* Laboratoire des Signaux et Syste% mes (LSS), CNRS-Supe& lec-Univ. Paris 11, 91192 Gif-sur-Yvette, France Groupe d+Etude des Signaux et Syste% mes (MS-GESSY), ISITV, BP 56, 83162 La Valette du Var, France Received 30 October 1997; received in revised form 3 September 1998
Abstract This paper treats source separation with the help of contrast functions and proposes corresponding adaptive implementations. Its major contributions are two-fold: (i) it proposes a new contrast which can be evaluated without pre-whitening the signals, provided all have unitary power. Its adaptive maximization involves an output AGC for each source recovery and performs as well as separation with pre-whitened signals; (ii) in case of sources with discrete alphabet, an intermediate contrast is proposed which takes additional advantage of the alphabet knowledge. The improvement to source separation is significant for correlated signals, but for adaptively pre-whitened separation, the quality of whitening conditions the improvement. 1999 Elsevier Science B.V. All rights reserved. Zusammenfassung Diese Arbeit behandelt die Trennung von Quellen mit Hilfe von Kontrastfunktionen und schla¨gt entsprechende adaptive Implementationen vor. Es gibt zwei bedeutende Beitra¨ge: (i) sie schla¨gt einen neuen Kontrast vor, der unter der Voraussetzung gleicher Leistungen ohne Prewhitening der Signale berechnet werden kann. Die adaptive Maximierung entha¨lt eine automatische Versta¨rkungsregelung fu¨r jede Quellenerkennung und arbeitet genauso gut wie die Trennung von Signalen mit Prewhitening; (2) fu¨r Quellen mit einem diskreten Alphabet wird ein Zwischenkontrast vorgeschlagen, der die Kenntnis des Alphabetes zusa¨tzlich ausnutzt. Die Verbesserung der Signaltrennung ist fu¨r korrelierte Signale bedeutend, aber fu¨r die adaptive Trennung mit Prewhitening wird die Verbesserung durch die Qualita¨t des Whitening bestimmt. 1999 Elsevier Science B.V. All rights reserved. Re´sume´ Cet article traite du proble`me de se´paration de sources a` l’aide de fonctions de contraste et propose les implantations adaptatives correspondantes. Ses contributions principales sont: (i) un nouveau contraste qui peut eˆtre e´value´ sans pre´blanchiment des signaux, pourvu qu’ils aient tous une puissance unite´. Sa maximisation adaptative implique un Controˆle Adaptatif de Gain (AGC) pour chaque re´cupe´ration de source et fonctionne aussi bien que la se´paration de sources avec pre´blanchiment; (ii) dans le cas de sources a` alphabet discret, un contraste interme´diaire est propose´ qui tient compte des avantages apporte´s par la connaissance de l’alphabet. L’ame´lioration de la se´paration de sources est significative pour les signaux corre´le´es, mais pour la se´paration utilisant un pre´blanchiment adaptatif, la qualite´ du blanchiment conditionne cette ame´lioration. 1999 Elsevier Science B.V. All rights reserved. Keywords: Unsupervised learning; Blind processing; Contrast functions; Normalized correlated signal separation; Automatic gain control; Adaptivity; Discrete signals
* Corresponding author. Tel.: #33 4 97 14 25 86; fax: #33 4 94 14 24 48; e-mail:
[email protected] 0165-1684/99/$ — see front matter 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 1 6 8 4 ( 9 8 ) 0 0 1 8 4 - 4
50
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
1. Introduction
After this learning phase the separation task is solved by computing the vector
In recent years, the problem of source separation has received an increasing interest because of the wide domain of potential applications, e.g. for telecommunication purposes and for image reconstruction. Source separation enters the general category of ‘signal separation’ where several, say N, unknown, random (time) input signals a (t) are G jointly propagated inside a linear multiple input channel F with multiple outputs x (t). The probH lem is that each x is a mixture of several a and that H G the mixing effect of the channel F is usually unknown. Hence, the different signals a are unobserG ved and not easily separated. In this paper, we consider the general linear model
y"Hx,
x"Fa#n,
(1)
where F is a fixed deterministic matrix called ‘mixture’, a is the vector of input sources a and x is the G vector of outputs x , observed after the mixture. H The additive noise n, if any, is assumed to be zero-mean and independent of a. For instance, in the telecommunication context, this model is suitable for the case of open atmosphere (wireless) radio communications with several transmitted interfering sources a and several receivers G arranged in an antenna, and bringing multiple information x . Then F represents the propagation H in the atmosphere. Another example is the image reconstruction context, where the input sources a are the light intensities at the various pixels G of the true image, and the outputs x characterize H the observed image pixels after the distorting and blurring process F. The latter is caused by the electronics or by the optics of the camera. There are many other contexts where model (1) is relevant. Usually the problem is attacked in a supervised mode. There is a learning phase, where some ‘teacher’ provides many examples of associated pairs (a, x). This permits the identification of the matrix F and its inverse, denoted H (or its pseudoinverse for a non-square matrix).
Hereafter the time t is assumed discrete.
(2)
which will be essentially equal to a when the noise is low and when identification is good. In the so-called unsupervised or self-learning approaches, examples of pairs (a,x) are not available. Identification of F and H must be done with the sole knowledge of a sequence of observed x, or possibly of its statistics: both the input a and the matrix F are unknown. It is of course impossible to solve this problem without any additional knowledge. In previous studies on source separation, the key assumption is always the statistical joint independence of the N sources a [3—8,10,11,15,17, G 19—21,24,26,27]. The purpose of the present paper is to show that the distribution of sources is another kind of key a priori information that can be used to improve separation. In fact, a very efficient unsupervised source separation method will be presented in this paper under the assumption that the a belong to G a discrete alphabet A"+c ,2,c , which is the same for the N sources. In addition we shall suppose that they all have the same distribution, as is often the case, e.g. in transmission systems: P+a "c ,"p , i"1,2,N, l"1,2,A. (3) G J J In the following, the alphabet A is taken symmetrical with respect to 0 and we define the decision function dec(.) matched to A: it is stepwise constant with intermediate thresholds between the alphabet levels c , as illustrated in Fig. 1 for a 4-PAM data J constellation. It is moreover an odd function because of the alphabet symmetry. Without any loss of generality, it can be assumed that the a have unit power: G E(a)"1. (4) G 2. Similarity with data equalization In the context of adaptive equalization, the knowledge of Eq. (3) has already been exploited for These approaches are often called blind. We think that this denomination should be avoided because of its pejorative connotation.
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
51
wards optimization of the equalizer H is adaptively pursued in a second, self-learning phase without knowledge of the a . This is done by adaptively G minimizing the so-called ‘decision-directed cost function’ [13,18] d(H)"E"dec(y )!y ". G G
Fig. 1. The decision function for a 4-PAM constellation.
three decades. The problem of equalization [1, 2, 9, 12—14, 16, 18, 22, 23, 25] is similar to source separation except that the source index i represents time. So the ith source is a "g(i¹), (5) G where g(t) is the continuously emitted time signal, ¹ is the so-called ‘baud-period’ which controls the emission of data. Moreover, the matrix F has a particular (Tœplitz) form, and large dimension. It characterizes the time-invariant linear filter describing the transmission channel. The equalizer compensates for the channel and is similarly characterized by a large Tœplitz matrix H. Usually optimization of H is done in two distinct phases. First there is a learning (supervised) preamble phase where the a are known at the receiver, G at the same time as the x . Then H is optimized so H as to minimize the cost function m(H)"E"y !a ". (6) G G The latter is independent of time i¹. This permits to adaptively reach a sufficiently good value of H within a reasonable convergence time. Then y is G close enough to a and the threshold decision device G recovers the true data level: dec(y )"a . (7) G G When Eq. (7) holds with a very high probability 1!e, e1, it is said that ‘the eye is open’. After-
(8)
This self-learning mode only exploits the knowledge of the source alphabet A. It provides the optimal H with a very good final accuracy after another reasonable convergence period. Several authors have considered self-learning approaches for which the open-eye assumption (7) is not required, see e.g. [22,25]. Then H is optimized so as to minimize another cost function j(H), which involves neither the a nor the (falsely) recovered G levels dec(y ). The resulting adaptive equalizer does G not require a learning phase but the final value of H has poorer accuracy than with the cost functions (6) or (8). It has also been proposed in [16,23] to use a mixed strategy by adaptively minimizing the intermediate cost function c(H)"aj(H)#(1!a)d(H),
(9)
where the parameter a lies in the interval [0,1]. Notice that a"0 is not permitted. This will retain both advantages: no-learning phase because of j(H) and good final accuracy because of d(H). The parameter a is taken close to 1 at the beginning of optimization, when decisions would be false. At the end the eye is open and decisions are correct. So a is set close to zero, and the criterion is essentially d(H). The present paper applies a similar approach to the problem of source separation.
3. Recalls on contrast functions As mentioned in the introduction, the usual unsupervised approach to source separation is based on the assumption that the sources a , i"1,2,N, G are zero-mean and statistically independent. With a square N;N mixture matrix S, it has been shown [7] that the vector y"Sa
(10)
52
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
has independent coordinates if, and only if (iff) (11)
S"KP,
where P is an arbitrary permutation matrix (one and only one non-zero element per row and column, that is 1) and K an invertible diagonal matrix of gains j . Then G y"j a , (12) G G NG where p(i) is the label of the column where the 1 stands in the ith row of P. Therefore, the coordinates of y restore the unknown sources in arbitrary order p(i) and up to arbitrary non-zero gains j . These indeterminations are inherent to the probG lem of source separation. Indeed it is clear that the (noiseless) observation vector is (13)
x"Fa"Fa, with F"FP2K\,
a"KPa.
(14)
Therefore the source vectors a and a are undistinguishable, based on the sole observation of x, even in the absence of noise. Thus source separation is characterized by independence of coordinates. As a result, some quantitative measure of how close the coordinates of a vector y are to mutual independence is required. This is the purpose of the so-called ‘contrast function’, introduced by Comon in [7]. We call it J(y) for the sake of conciseness, although it is a function of the probability distribution of y. In general, the definition is restricted to a specific kind of distribution for y, e.g. the set of ‘white’ vectors, i.e. with uncorrelated components, or the set of normalized vectors, i.e. with unit power components. Inside its set Y of definition for y, J(y) should satisfy the four following requirements: R1. For all y3Y and for all permutation matrix P, then J(Py)"J(y).
(15)
R2. For all y3Y and for all diagonal invertible matrix K such that Ky3Y, then J(Ky)"J(y).
(16)
R3. For all matrix S such that Sa3Y where a is a vector of unit-power independent sources a , G then
J(Sa))J(a).
(17)
R4. For all matrix S such that Sa3Y where a is a vector of unit-power independent sources a , G then J(Sa)"J(a)
(18)
if and only if S"KP
(19)
for all permutation and invertible diagonal matrices P and K such that KPy3Y. The first requirement means that the contrast is unaffected by a reordering of the coordinates and the second one means that the contrast is unaffected by arbitrary (admissible) rescalings of the coordinates. The third and fourth requirements means that except for the reordering of coordinates and scale changes, all mixtures decrease the contrast. Most contrast functions can be evaluated with the help of the cumulants of the coordinates y , for G instance as in [7] J(y)"+L(K ,2,K ), (20) W W, where K is the fourth-order self-cumulant of the WG zero-mean random variable y , that is G K "Ey!3Ey. (21) WG G G It follows from Eqs. (1) and (2) that y"Sa#n,
(22)
where S"HF
(23)
and where the output noise n"Hn
(24)
is independent of a. Thus, in the Gaussian noise case, the cumulants K and then the contrast J(y) WG are unaffected by the noise. Thus the contrast function J(y)"J(Sa)
(25)
is only dependent on the separation matrix H. In the following, we make a change of variable and use the so-called ‘cost function’ j(H)_!J(y).
(26)
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
The minima of j(H) will characterize source separation (up to the output noise). Originally, Comon [7] has considered for Y the set denoted by Y of the so-called ‘white’ vectors y which are zero-mean and for which the covariance matrix is the identity. Such contrasts defined on Y will be called w-contrasts in the following. The coordinates y of y may remain deG pendent but, as a first step, they are uncorrelated Ey y "0 for iOj, i, j"1,2,N. (27) G H Moreover, the vector y is ‘normalized’ in the sense that Ey"1, i"1,2,N. (28) G Clearly a is a white vector. White vectors are normalized but the converse is not true in general. The N constraints in Eq. (28) can be considered as a means to remove the indeterminations of the N gains j in Eq. (12). The other N(N!1)/2 conG straints in Eq. (27) can be considered as the first half of the N!N constraints involved in the ‘iff’ independence condition (11). With this set Y for y, the matrices K and S men tioned in requirements R2—R4 are easily shown to be restricted in the class U of unitary matrices U such that UU2"I.
(29)
The most classical example of w-contrast is , (30) J (y)" (K ). WG G We have also introduced a few other w-contrast functions [19,20]. In particular, , J (y)" "K " (31) WG G is a w-contrast as well. In the present case of discrete sources satisfying assumption (3), all the a have identical fourthG order cumulants. If e denotes their sign, it is easily seen that "K ""eK WG WG
(32)
This definition of normalization is a non-standard one, but it will reveal helpful.
53
(all the K have the same sign). It follows from WG Eqs. (28) and (21) that K "Ey!3. (33) WG G Therefore, dropping the unnecessary constant !3eN in Eq. (31), we obtain the simple w-contrast , J (y)"e Ey. (34) G G In [20], the set of definition of a contrast is enlarged to normalized but non-white vectors. It is denoted by Y and contrasts defined on Y will be called n-contrasts in the following. For instance, with the help of the fourth-order symmetrical cross-cumulants K "Eyy!2Ey y !EyEy, G H G H G H WGWH we have shown that the function
(35)
, , J (y)" "K "! "K " (36) WG WGWH G GHGH is an n-contrast when the N sources a have an G identical sign for their fourth-order cumulants, which is true here, as just mentioned (see Statement 1, Eq. (83) of [11]). Relationships (32) and (33) remain valid and "K ""eK holds in a way WGWH WGWH similar to Eq. (34). This permits again to drop the unnecessary constant values and yields the new n-contrast , , J (y)"e Ey!e +Eyy!2Ey y ,. G G H G H GHGH G (37) Nothing has been published yet about the achievement of this new n-contrast. In Sections 5.3 and 6 below, we prove that it can be adaptively maximized and that it does achieve good source separation.
4. Separation of discrete sources The notions used in Section 2 to treat adaptive data equalization can be transposed to the adaptive separation of discrete sources a with levels in proG portional alphabets A "b A in the following G G
54
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
way. Notion (7) of an ‘open eye’ obviously becomes dec(y)"KPa,
(38)
where K (respectively P) is an admissible diagonal (respectively permutation) matrix, and where Eq. (38) holds with a very high probability 1!e, e1. Equivalently, dec(y )"j a . (39) G G NG It is possible that the gain j depends on the label i. G The decision-directed cost-function d(H) in Eq. (8) becomes , d(H)" E"dec(y )!y ". (40) G G G As in equalization we can use a mixed strategy by minimizing the intermediate cost c(H)"aj(H)#(1!a)d(H),
(41)
where the parameter a lies in the interval ]0,1]. With the change of variables C(y)_!c(H),
D( y)_d(H),
(45)
D(Ky)"D(y),
(46)
so, like J(y), D(y) is unaffected by sign changes (cf. Eq. (44). Moreover, we have the relationships C(y))aJ(y)
(47)
(because D is non-negative), aJ( y))aJ(a)
(48)
(because J is a contrast), aJ(a)"C(a)
(49)
(because dec(a )"a ). It follows that G G C(y))C(a)
(50)
and requirement (R3) holds. Finally in order that C( y)"C(a)
(43)
it is necessary that both inequalities (47) and (48) become equalities. Because J(y) is an (n or w) contrast, inequality (48) means that
This can be done in an adaptive way. Once the eye is open this method yields an accurate optimization of H. Yet, in certain cases the open-eye condition (38) (or (39)) can be relaxed and the minima of c(H) are characteristic of source separation. In other words, the function C(y) in Eq. (43) is an n-contrast or d-contrast according to the definition of Section 3. We call it an ‘intermediate contrast’. This happens in particular in the noise-free case (n"n"0) and for normalized vectors y. To prove it we check the four requirements (R1)—(R4). (R1) is fulfilled because, like J( y), the function D( y) in Eq. (42) is symmetrical w.r.t. the coordinates of y. As a result of the normalization of y, the admissible scale change matrices K in requirement (R2) have the entries j "$1 G
dec(j y )!j y "j [dec(y )!y ]. G G G G G G G Therefore,
(42)
the problem is to maximize the functional C( y)"aJ(y)!(1!a)D( y).
to meet constraint (28). Since the decision function is odd, it follows that
(44)
y"KPa,
(51)
(52)
where the j obey Eq. (44) and P is a permutation G matrix. Accordingly, y "$a "dec($a )"dec(y ). (53) G NG NG G Therefore D( y)"0 and Eq. (47) is an equality too. Thus requirement (R4) is fulfilled and C(y) is a contrast. It is a w-contrast (respectively n-contrast) if J is a w-contrast (respectively n-contrast). In this contrast, the first part J( y) forces statistical independence of the coordinates, while the second part !D( y) forces the coordinates to belong to an alphabet that is proportional to the known alphabet A. Thus full advantage is taken of the two a priori knowledge. Instead of relaxing the open-eye condition, one can relax the condition that the parameter a be non-zero: it is now shown that the function !D( y) is a contrast over the set Y of normalized vectors
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
y for which the eye is open. Indeed (R1) and (R2) have already been proved for this function. (R3) is also valid because !D(y))0"!D(a).
(54)
To prove (R4) assume that the open-eye condition (38) holds with the matrices K and P and denote (55)
B"S!KP. Then y!dec(y)"Ba.
(56)
Hence the equation !D(y)"!D(a)"0 holds iff
, , E b a "0, (58) GH H G H where the b are the entries of B. Now Eq. (58) reads GH , , (59) b Ea"0 GH H G H because the a are independent, hence uncorrelated. G In turn, Eq. (59) implies that B"0. Thus, S"KP.
(60)
So !D(y) is a contrast. Let us give two examples of contrast functions in Eq. (43). Example 1 (Source separation with a pre-whitening stage). The set of definition for J(y) is the set Y of white vectors. As a result, the separation matrix is split into two successive stages. The (first) pre-whitening stage is made of a fixed matrix W that will whiten the observation x by computing w"Wx
(61)
in such a way that R _Eww2"I. (62) Then the proper separation stage with matrix H generates the final vector y"Hw.
(63)
This vector is white too, provided that H is unitary: HH2"I.
In this case the w-contrast J (y) given in Eq. (34) yields the intermediate w-contrast , , C (y)"ae Ey!(1!a) E(dec(y )!y ). G G G G G
(64)
(65)
This example is further developed in Section 5.2. Example 2 (Source separation with output gain controls). The set of definition for J(y) is the set Y of normalized vectors. As a result, the separation task involves a post-processing of the kind y"Kz,
(57)
55
(66)
where the diagonal matrix K ensures the normalizing condition (28) for the y . This can be done by G placing an adaptive gain control (AGC) on each coordinate y (see Section 5.2.1). The proper separG ation stage, with matrix H, comes first according to z"Hx,
(67)
where the diagonal entries of matrix H can be constrained to unity without loss of generality. In this case the n-contrast J (y) in Eq. (37) yields the intermediate n-contrast , , C (y)"ea Ey!ea +Eyy!2Ey y , G G H G H G GH GH , !(1!a) E(dec(y )!y ). (68) G G G This example is further developed in Section 5.3. 5. Adaptive source separation 5.1. Stochastic gradient procedures It remains to estimate the separation matrix such that the contrast C(y) in Eq. (43) is maximum, or equivalently, such that the cost c(H) in Eq. (41) is minimum. Our general strategy will be adaptive, based on gradient algorithms. First a deterministic procedure is to reach the minimum of c(H) by an iterative algorithm which updates H with the opposite gradient increment k 䉭h "! £h c(H), i"1,2,N. G 2 G
(69)
56
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
In this equation h "(h ,2,h )2 denotes the ith G G G, row of the separation matrix H, and k is a small positive step-size. Accordingly, a local optimum H will be found as the limit of the sequence k H(n)"H(n!1)! £Hc(H)"H H . L\ 2
(70)
Very often, it is possible to express the cost c(H) as an expectation according to c(H)"EH[C(H, x)],
(71)
where the notation EH indicates an expectation, conditioned by the value of H: the expectation is performed with the probability distribution of the observed vector x. Then £h c(H)"EH[£h C(H, x)]. G
(72)
G
The so-called ‘adaptive algorithm’ is simpler. It consists in dropping the expectation involved in Eq. (72): the matrix H is updated each time a new observation x(n) is measured according to the time recursive algorithm k , h (n)"h (n!1)! £h C(H, x(n))"H H L\ G G 2 G i"1,2,N,
(73)
where the increment is evaluated with H in the state H(n!1) which is available before the observation x(n) is gained. The matrix H(n) generated by Eq. (73) is, like x(n), a stochastic matrix. This algorithm enters the category of stochastic approximation, for which the stability and convergence investigations are, in general, difficult [2,12]. Yet, in many cases a few theoretical results can be
obtained, while the algorithm turns out to be efficient in practice. We now consider two examples of Eq. (73).
5.2. Adaptive source separation with pre-whitening (Example 1) 5.2.1. The adaptive pre-whitening stage In [20] we have proposed an adaptive method to generate a matrix W such that the vector w in Eq. (61) is white. This is done by first decorrelating the coordinates of x thanks to a triangular matrix T, 1 T"
!t $ !t
,
0 1
,
\ 2
!t ,,\
(74)
1
then by normalizing the power of each coordinate thanks to a diagonal matrix K. So "Tx,
(75)
w"K.
(76)
Accordingly, the whole separation system involves three stages as depicted in Fig. 2. The linear transform (75) is a Gram—Schmidt procedure performing step-by-step decorrelation: the i!1 unknowns t _(t ,2,t )2 of the ith G G GG\ row of T are found by assuming that v ,v ,2,v G\ are already uncorrelated and by adding the new constraints that Ev xG\"E[(x !t2xG\)xG\]"0, G G G
Fig. 2. Adaptive source separation with a pre-whitening stage.
(77)
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
where the vector xG\"(x ,2,x )2 includes G\ only the i!1 first coordinates of x. In the Wiener—Hopf theory, the root t of Eq. (77) is G known to minimize the mean square error f (t )_E[(x !t2xG\)] G G G
(78)
made when the ith coordinate x is linearly estiG mated on the basis of the i!1 previous coordinates x ,2,x . Therefore an adaptive procedure G\ to evaluate t can be deduced, e.g. from the classical G LMS algorithm [12]. The latter reads t (n)"t (n!1)#k v (n)xG\(n), G G R G
(79)
where k is a small positive step-size and v (n) is R G evaluated by taking t "t (n!1). This adaptive alG G gorithm is controlled by v (n) which comes at the G output of the matrix T, as indicated by the crossarrow starting from the output in Fig. 2. The diagonal transform (76) will provide the w with normalized power iff G 1 j" , i"1,2,N. (80) G Ev G Each j can be found by implementing an AGC on G the channel v (n), e.g. along G w (n)"j (n!1)v (n), G G G
(81)
c (n)"c (n!1)#k [1!w(n)], k '0, G G A G A
(82)
j (n)"(c (n). G G
(83)
This algorithm is controlled by the output w of the G AGC, hence the cross-arrow on each channel v of G Fig. 2. Clearly, the resulting vector w will satisfy the whiteness constraint (62), once the matrices T(n) and K(n) have reached their steady states where Eqs. (77) and (80) hold. 5.2.2. The proper separation stage The matrix H should fulfill the unitary constraint (64). According to [7] it can be parametrized via M"N(N!1)/2 Givens (planar) rotations along H"EG (h )2G (h ). + +
(84)
57
In the matrix 0
I K G (h)" 0 K
cos h !sin h
0
0 sin h
I K
0
(85)
cos h
0
I K
there is a single parameter h, the matrices I being KG identity matrices of order m . G is specified by G K the locations of the two elements cos h in its diagonal entries. Hence there are N(N!1)/2 of such matrices. The matrix E is diagonal with entries e "$1. It corresponds to fixed restoration gains G for the sources and can thus be omitted. We shall denote H"(h ,2,h )2. The w-contrast + in Eq. (65) is an expectation, so the associated cost c (H)"!C (y) is of the form EHC (H,w) with , C (H,w)" [!aey#(1!a)(dec(y )!y )]. G G G G (86) Hence the adaptive gradient procedure to optimize the matrices G K k H(n)"H(n!1)! £HC (H(n!1),w(n)), k'0. 2 (87) Assuming that the decisions dec(y ) remain unG changed for a slight change of H, we obtain 1 ! £HC (H,w) 2 , " [2aey#(1!a)(dec(y )!y )]£Hy . (88) G G G G G The N gradient vectors £Hy in Eq. (88) follow G from Eqs. (63), (84) and (85): *y "G (h )2G (h #p/2)2G (h )w. + + K K *h K
(89)
Hence the adaptive algorithm controlling H. For instance, consider the case of N"2 sources, then
58
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
M"1. There is a single angle h. According to Eq. (89) *y *y "!y . "y , *h *h Hence Eq. (87) reads
(90)
h(n)"h(n!1)#k+2aey y (y!y) #(1!a)(dec(y )y !dec(y )y ),.
(91)
This adaptive system, including the pre-whitener, is called pre-whitened separation (PWS). A quick look at Eq. (91) shows that the advantage brought to the adaptation by the decision function is practically free of any computational cost. In [19] it is shown that for N"2, a"1, and subject to perfect pre-whitening, the (PWS) algorithm yields a steady-state unbiased value. It means that E(h(n))"hI , where G(hI ) is a separating matrix. Moreover, the residual fluctuations p_E((h(n)!hI )) are shown to be proportional to F k. This is what can be called ‘quasi mean-square convergence’, because h(n) can be made arbitrarily close to hI by decreasing the step-size k. These results have been reached by linearization of the algorithm (91), assuming that h(n) starts close to a separating value hI .
5.3. Adaptive source separation with output AGC (Example 2) The system involves two stages as depicted in Fig. 3. The diagonal matrix K will adaptively implement the N output gain controls j in order to G ensure Eq. (28) as detailed in Section 5.2.1, Eqs. (81)—(83), but with the variables z (respectively G y ) replacing v (respectively w ). The adaptive realizG G G ation of K is the same as described in Fig. 2. The second issue is the adaptation of the front separation matrix H. We adopt the intermediate n-contrast C (y) in Eq. (68) whose associated cost c (H)"!C (y)
(92)
can be written c (H)"EHC (H, y),
(93)
Fig. 3. Adaptive source separation with N output AGC.
with , C (H, y)" [!aey#(1!a)(dec(y )!y )] G G G G , #ae (yy!2y y Ey y ). (94) G H G H G H GH GH To simplify, consider only the case of N"2, with the parametrization
H_
1
h Clearly,
h . 1
(95)
1 ! £HC (H,y) 2 " [2aey#(1!a)(dec(y )!y )]£Hy G G G G G ea ! £H[yy!2y y Ey y ]. 2
(96)
According to Eqs. (66), (67) and (95) *y *y " "0 *h *h and
(97)
*y *j G"j x #z G, (98) G H G *h *h G G where (i, j)"(1,2) or (2,1). According to the normalizing property j Ez"1. Thus, G G *j G#j(Ex x #h Ex)"0. (99) G G H *h G
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
It follows easily from Eqs. (98) and (99) that *y G"j [x !j y (Ex x #h Ex)]. (100) G H G G G H *h G Plugging Eqs. (97) and (100) into Eq. (96) yields the adaptive gradient algorithm: *h "kj +ae(2y#y Ey y !y y) G G G H G H #(1!a)(dec(y )!y ), G G ;[x !j y (Ex x #h Ex)] H G G G H #kaey y j [Ex x (j h !j Ey y ) G H H G #Ex(j !j h Ey y )], (101) H H G G where, again, (i, j)"(1,2) or (2,1). In these equations, the quantities are evaluated with the nth observation x(n) for x, and with the separating system KH in the state K(n!1)H(n!1): z(n)"H(n!1)x(n),
y(n)"K(n!1)z(n).
(102)
Finally, the expectation Ey y (and Ex, Ex, Ex x ) which appear in Eq. (101) can be estimated via the LMS-type algorithm M(n)"M(n!1)#k [m(n)!M(n!1)], (103) K where m(n) is the nth trial of y y (and of x, x, x x ), and k is a small positive step-size. K The joint adaptive formulae (101)—(103), plus the AGC formulae for K(n) constitute an adaptive source separation system, which we call the ‘Normalized Separation’ (NS) system. As for the PWS algorithm, it is clear that introducing the decision function into the adaptation (101) is free of computational cost. In [19], it is shown that c (H) exhibits only global minima, so the algorithm is unbiased. A full theoretical analysis of the PWS and NS algorithms is out of the scope of this paper.
6. Computer simulations This section demonstrates the efficiency of the PWS and NS algorithms. Sources and mixtures: There are two zero-mean sources which assume the two levels $1 with equal probabilities. Hence a is normalized and e"!1.
59
Since the order of sources and the signs with which they are recovered are irrelevant, it can be assumed without loss of generality that f and f are positive and that f f *" f f ". Two examples of mixtures have been considered: Case 1:
1 F " 0.4
!0.6 1
.
(104)
Case 2:
1 F " 0.4
0.6 1
.
(105)
For the He´rault—Jutten separation algorithm it has been shown [15] that Case 2 is more severe than Case 1 because of the sign configuration in F and F . The following shows that this is also true with the PWS and NS algorithms. Noise: When it is present, the additive noise n (see Eq. (1)) involved in the observed vector x is zeromean, Gaussian, with independent components n and n . The power level of n (respectively n ) is 15 dB below the power level of the dominant source in x (respectively x ); in other words SNR"15 dB. Choice of the parameter a in the algorithms: The parameter a involved in the PWS and NS algorithms is set to 1 during the first B iterations. The contrast is J(y); it seeks independence between y and y . At that time the eye is open and a is switched to 0. The contrast becomes !D(y) and it forces y and y to belong to the discrete alphabet. Performance index: The non-negative performance index of [20] is used to characterize source separation achievement:
1 "s " GH ind(S)" !1 2 max "s " J GJ G H "s " GH # !1 , (106) max "s " J JH H G where the s are the entries of the overall mixture GH matrix (mixture F followed by separation H). This index reaches its minimum (zero) value iff S is of the form (11), that is, when source separation is achieved. Approximate separation is thus realized
60
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
Fig. 4. Performance index for PWS in Case 1.
when it is much smaller than 1. The factor is introduced because both terms in the brackets take the same value when the matrix S is approximately equal to the product of an invertible diagonal matrix with a permutation matrix. This index is estimated by averaging 50 independent trials. It is plotted against time (iterations). Reference method: To evaluate the potential benefit to adaptive source separation brought by the two algorithms PWS and NS, we compare them to the equivariant adaptive separation via independence (EASI) of [4]. Like NS, the EASI algorithm does not require pre-whitening. In this method, the separation matrix H(n) is for instance updated according to H(n)"(I!jM(n))H(n!1), (107) M(n)"y(n)y2(n)!I#g(y(n))y2(n)!y(n)g(y(n))2, where g( ) )"( ) ) is the cubic componentwise function and j a small positive constant. Results: The various step-sizes involved in the simulations have been chosen to assign approxim-
ately the same convergence speed to the three algorithms PWS, EASI and NS in similar working conditions. So we can compare them by means of their respective steady-state performance indices. Figs. 4—6 (respectively Figs. 7—9) are concerned with Case 1 (respectively Case 2) while Figs. 10—12 illustrate the influence of noise. In these nine figures we observe that the three algorithms reach a satisfactory separation index below !26 dB within less than a thousand iterations. This is before the switch of a from 1 to 0 at step B"3000. The PWS algorithm (optimizing the classical w-contrast J (y)), the NS algorithm (optimizing the novel n-contrast J (y)) and the EASI algorithm are all successfully implemented in an adaptive way. An improved efficiency is gained by the intermediate novel ncontrast C (y) as evidenced by Figs. 6, 9 and 12 at step B"3000. The presence of noise at the level !15 dB does not influence the conclusions at all. Superiority of direct adaptive separation: NS is always found superior to EASI which in turn is always superior to PWS. This tends to show that in an adaptive method, it is preferable to perform decorrelation and separation in a single step rather
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
Fig. 5. Performance index for EASI in Case 1.
Fig. 6. Performance index for NS in Case 1.
61
62
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
Fig. 7. Performance index for PWS in Case 2.
Fig. 8. Performance index for EASI in Case 2.
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
Fig. 9. Performance index for NS in Case 2.
Fig. 10. Performance index for PWS in Case 1. Noisy case.
63
64
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
Fig. 11. Performance index for EASI in Case 1. Noisy case.
Fig. 12. Performance index for NS in Case 1. Noisy case.
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49—66
than to (adaptively) pre-whiten the signals before the proper source separation: the performance of PWS is limited by a too rough pre-whitening. This drawback could be corrected by decreasing the considered step-size but at the price of a slower convergence speed. Moreover Figs. 4, 7 and 10 exhibit no gain at step B"3000 when a is switched from 1 to 0: the too rough pre-whitening also cancels the benefit brought by the knowledge of the source alphabet, and by the intermediate contrast. To correct these drawbacks of PWS one can think of adaptive least squares procedures (rather than Gram—Schmidt) to improve the pre-whitening efficiency. However this would be at a higher computational cost. To the best of our knowledge, this has not yet been developed in the literature. Benefit of the intermediate contrast: If we compare NS and EASI we get the same conclusion in the three examples. During the first B"3000 steps where NS does not take advantage of the discreteness of sources, it is slightly inferior to EASI. This is likely due to the fact that EASI proceeds in one single phase whereas NS involves two phases (proper separation is followed by normalizations in the output AGC). This special feature of NS would become useful in the case of fast power variations of the sources, but this is not considered in this contribution. Now the supplementary index decrease of NS by more than 6 dB in Figs. 6, 9 and 12 when a is switched from 1 to 0, evidences the advantage that can be drawn by introducing in the contrast the knowledge of the source alphabet. This is how the novel NS algorithm overcomes EASI by 2 or 3 dB in Case 1 (noiseless and noisy case) and by up to 6 dB in the severe case of mixture 2.
7. Conclusion This paper considers adaptive source separation systems which are based on contrast functions. It provides two novel contributions. First we propose a new contrast function valid when observed signals are correlated. The corresponding separation system, called ‘normalized separation’, is free from a pre-whitening stage and it involves N output AGC. It can be made adaptive. Then the control of the proper input separation
65
stage requires a feedback from the AGC outputs, but the input signal does not need to be decorrelated. By computer simulations we demonstrate the practical efficiency of this criterion. Secondly for sources with known discrete alphabet we design and investigate a new ‘intermediate’ contrast function which is based on two assumptions: (i) as usually assumed, sources are independent of one another and (ii) sources are known to be discrete. At a practically zero additional computational cost, this second knowledge brings a significant improvement to the steady-state achievements of the new normalized adaptive separation system. For adaptively pre-whitened systems, no similar improvement is observed, the reason being that pre-whitening requires an overaccurate lengthy convergence to let this improvement appear.
References [1] A. Benveniste, M. Goursat, G. Ruget, Robust identification of a minimum phase system, IEEE Trans. Automat. Control 25 (1980) 385—398. [2] A. Benveniste, M. Metivier, P. Priouret, Adaptive algorithms and stochastic approximations, in: Applications of Mathematics, Vol. 22, Springer, Berlin, 1990. [3] J.F. Cardoso, On the performance of orthogonal source separation algorithms, in: Proc. EUSIPCO’94, Edinburgh, Scotland, 1994, pp. 776—779. [4] J.F. Cardoso, B. Labeld, Equivariant adaptive source separation, IEEE Trans. Signal Process. 44 (1996) 3017—3030. [5] J.W. Carlin, Y. Bar-Ness, S. Gross, M.L. Steinberger, W.E. Studdiford, An IF Cross—Pol canceller for microwave radio systems, IEEE J. Selected Areas Comm. 5 (3) (April 1987) 502—514. [6] P. Chevalier, On the performances of higher order blind sources separation methods, in: Proc. HOS’95, Girona, Spain, 1995, pp. 30—34. [7] P. Comon, Independent component analysis, A new concept?, Signal Processing 36 (3) (April 1994) 287—314. [8] N. Delfosse, P. Loubaton, Adaptive separation of independent sources: A deflation approach, Signal Processing 45 (3) (April 1995) 59—83. [9] D.N. Godard, Self-recovering equalization and carrier tracking in two-dimensional data communication systems, IEEE Trans. Comm. 28 (1980) 1867—1875. [10] C. Jutten, J. He´rault, Blind separation of sources, Part I: An adaptative algorithm based on neuromimetic architecture, Signal Processing 24 (1991) 1—10. [11] J.L. Lacoume, P. Ruiz, Separation of independent sources from correlated inputs, IEEE Trans. Signal Process. 40 (12) (December 1992) 3074—3078.
66
O. Macchi, E. Moreau / Signal Processing 73 (1999) 49–66
[12] O. Macchi, Adaptive Processing: The Least Mean Square Approach with Applications in Transmission, Wiley, New York, 1995. [13] O. Macchi, E. Eweda, Convergence analysis of self-adaptive equalizers, IEEE Trans. Inform. Theory 30 (1983) 162—176. [14] O. Macchi, A. Hachicha, Self-adaptive equalization based on a prediction principle, in: Proc. Globecom, Houston, 1986, pp. 1641—1645. [15] O. Macchi, E. Moreau, Self-adaptive source separation, Part I: Convergence analysis of a direct linear network controlled by the He´rault—Jutten algorithms, IEEE Trans. Signal Process. 45 (1997) 918—926. [16] O. Macchi, C.A.F. da Rocha, J.M.T. Romano, Egalisation adaptative autodidacte par re´tropre´diction et pre´diction, in: Proc. 14th GRETSI Symp., Juan-les-Pins, 1993, pp. 491—494. [17] A. Mansour, C. Jutten, Fourth-order criteria for blind source separation, IEEE Trans. Signal Process. 43 (1995) 2022—2025. [18] J.E. Mazo, Analysis of decision-directed equalizer convergence, BSTJ 59 (1980) 1857—1876. [19] E. Moreau, Apprentissage et adaptativite´. Se´paration auto-adaptative de sources inde´pendantes par un re´seau line´aire, Ph.D. Thesis, Orsay, February 1995.
[20] E. Moreau, O. Macchi, High-order contrasts for self-adaptive source separation, Internat. J. Adaptive Control Signal Process. 10 (1) (January 1996) 19—46. [21] E. Moreau, N. Thirion, Multichannel blind signal deconvolution using high order statistics, in: Proc. SSAP’96, Corfu, Greece, 1996, pp. 336—339. [22] G. Picchi, G. Prati, Blind equalization and carrier recovery using a “stop-and-go” decision directed algorithm, IEEE Trans. Comm. 35 (1987) 877—887. [23] C.A.F. da Rocha, O. Macchi, J.M.T. Romano, An adaptive nonlinear IIR filter for self-learning equalization, in: Internat. Telecom. Conf. Rio de Janeiro, Brasil, 1994, pp. 6—10. [24] H. Sahlin, U. Lindgren, The asymptotic Crame´r—Rao lower bound for blind signal separation, in: Proc. SSAP’96, Corfu, Greece, 1996, pp. 328—331. [25] Y. Sato, A method of self-recovering equalization for multilevel amplitude modulation systems, IEEE Trans. Comm. 23 (1975) 679—682. [26] E. Weinstein, M. Feder, A.V. Oppenhein, Multi-channel signal separation by decorrelation, IEEE Trans. Speech Audio Process. 1 (1993) 405—413. [27] D. Yellin, B. Friedlander, Blind multi-channel system identification and deconvolution: performance bounds, in: Proc. SSAP’96, Corfu, Greece, 1996, pp. 582—585.