Information-Theoretic Approach to Blind ... - Semantic Scholar

Report 3 Downloads 93 Views
Signal Processing: 64(1998) 291-300

Information-Theoretic Approach to Blind Separation of Sources in Non-linear Mixture Howard Hua Yanga, Shun-ichi Amarib and Andrzej Cichockib a Department of Computer Science and Engineering, Oregon Graduate Institute, P.O. Box 91000, Portland, OR 97291-1000 USA FAX: 503 690 1548, email: [email protected] b

Brain-Style Information Systems Research, RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-01, Japan FAX: +81 48462 4633 emails: [email protected], [email protected]

Abstract

The linear mixture model is assumed in most of the papers devoted to blind separation. A more realistic model for mixture should be non-linear. In this paper, a two-layer perceptron is used as a de-mixing system to separate sources in non-linear mixture. The learning algorithms for the de-mixing system are derived by two approaches: maximum entropy and minimum mutual information. The algorithms derived from the two approaches have a common structure. The new learning equations for the hidden layer are di erent from the learning equations for the output layer. The natural gradient descent method is applied in maximizing entropy and minimizing mutual information. The information (entropy or mutual information) back-propagation method is proposed to derive the learning equations for the hidden layer.

Key words: blind separation, non-linear mixture, maximum entropy, minimum mutual informa-

tion, information back-propagation

Running head: Blind Separation of Non-linear Mixture

1 Introduction One of important issues in signal processing is to nd a set of statistically independent components from the observed data by linear or non-linear transformation. Most of the blind separation algorithms are based on the theory of the independent component analysis (ICA) [8] when the mixture model is linear. The idea of the ICA is to nd the independent components in the mixture of the statistically independent source signals by optimizing some criteria. Some blind separation algorithms such as those in [2, 1, 5, 7, 6] have the equivariant property when there is no noise in the observation of the mixture. However, for the non-linear mixture model, the linear ICA theory is 1

not applicable and the equivariant property does not hold. The blind separation algorithms for the linear mixture model generally fail to extract the independent sources in non-linear mixture. The self-organizing map (SOM) has been used to extract sources in non-linear mixture[9, 11, 12]. It is a model free method but su ers from a) the exponential growth of the network complexity and b) interpolation error in recovering continuous sources. An extension of ICA to the separation sources in non-linear mixture is to employ a non-linear function to transform the mixture such that the outputs become statistically independent. However, without limiting the function class for de-mixing transforms, this extension may give statistically independent outputs which tell nothing about sources because any random variable with a continuous distribution function can be transformed to uniformly distributed random variable and multiple uniform random variables are always independent. This transformation is not unique. To limit a function class for de-mixing functions is equivalent to assume some knowledge about the mixing function. We employ a two-layer perceptron, a parametric model, as a de-mixing system. A twolayer perceptron was also used in [4] to separate sources in non-linear mixture by gradient descent method to minimize the mutual information (measure of dependence) of the outputs . Comparing our approach with the approach in [4], this paper includes the following innovations: 1. on-line learning algorithms derived by maximizing entropy and minimizing mutual information, 2. the learning algorithms consisting of a) the learning equations for the hidden-output layer which are the same as those in [2, 3, 6, 15] for linear de-mixing system and b) novel learning equations for input-hidden layer and threshold for the hidden neurons, 3. using the natural gradient descent method to optimize cost functions (entropy or mutual information) without constraints. To estimate the mutual information, instead of using the Taylor expansion of a multi-variable characteristic function [4], we use the Gram-Charlier expansion to approximate the marginal probability density function and obtain a more reliable estimation of the mutual information. A short version of this paper appeared in [17]. The non-linear mixture model considered there and this paper contains crossing non-linearities in the mixture. This model is more general than that in [10, 14] where the non-linear mixture is obtained by operating non-linear functions componentwise on the linear mixture.

2 Mixture models and de-mixing systems Let us consider unknown source signals si (t); i = 1;    ; n which are mutually independent and stationary. It is assumed that each source has a zero mean and moments of any orders and at most one source is Gaussian. Before we describe the non-linear mixture model, let us brie y review the linear mixture model:

x(t) = As(t)

where A is an unknown non-singular mixing matrix, s(t) = [s1 (t);    ; sn (t)]T and x(t) = [x1 (t);    ; xn (t)]T . In this paper, the transpose operation is denoted by ()T . A linear de-mixing system y = Wx with a non-singular de-mixing matrix W is employed to recover the sources from the linear mixture. Since 2

A is unknown, it is only possible to recover the sources up to a permutation and scaling. Many blind separation algorithms for tuning the matrix W are based on Comon's ICA theory. The idea is to nd the de-mixing matrix W so that all the components of the output y are mutual independent. Such matrix W can be nd by minimizing some contrast functions such as the mutual information of the output[8, 13, 16] or maximizing some criteria such as entropy[3]. The non-linear memory less model for the sensor output (mixture) is generally formulated as

x(t) = f (s(t))

where f () is an unknown non-linear function. Without knowing the source signals, the mixing matrix and the non-linear function, the problem is to recover the original signals in the non-linear mixture x(t) by a non-linear de-mixing system shown in Figure 1 which is a two-layer perceptron with n hidden neurons. Generally, this problem is not tractable without some prior knowledge about the non-linear function x = f (s). In this paper, we assume that the function f () is invertible and the inverse f ?1 () can be approximated by the two-layer perceptron, i.e.,

y = f ? (x)  C g(Bx +  ) (1) where g(u) is a sigmoid function, both B and C are square matrices and g(u) = (g(u );    ; g(un ))T . Here, the scalar function g operates on each component of the vector u. We shall use this convention for all scalar functions throughout this paper. Note this two-layer perceptron is invertible if and only if the matrices B and C are non-singular. Suppose (1) is true, then when W = C , W = B , and  =  , the output of the de-mixing system is y = W g(W x + ) = C g(Bx +  ) = f ? (x) = s: In general, for any non-singular and diagonal matrix  and permutation matrix P , when W = PC , W = B, and  =  , we have a vector formed by a permutation of scaled sources: y = W g(W x + ) = Ps 1

0

1

1

2

0

1

2

0

1

1

2

0

1

2

whose components have zero mutual information. This means that the mutual information of the output y should be minimized in order to nd independent components. In this paper, the non-linear function (1) is taken as the demixing function. This means that the mixing function x = f (s)  W ?2 1 (g?1 (W ?1 1s) ? 0 ) is again a two-layer perceptron but the activation function for the hidden layer is an unbounded function even on a bounded region. However, in practice, we shall not limit the mixing function to this form. Instead, later in a simulation example, we shall test our algorithm to recover the inputs from the outputs of a two-layer perceptron with a sigmoid function for each neuron in the hidden layer.

3 Information back-propagation approach We have explained why the minimum mutual information (MMI) approach can be applied for extracting independent sources in the non-linear mixture when the inverse of the non-linear mixing 3

function is approximated by the two-layer perceptron. We can also apply the maximum entropy (ME) approach in [3] to train the de-mixing system. When the mixture model is linear, the ME approach is justi ed by its relation with the MMI approach[16]. The two approaches are generally equivalent around the solution points but may have di erent performances in practice[15]. We believe that this will also happen when the mixture model is non-linear. We shall derive the learning algorithms using both entropy and mutual information criteria.

3.1 Maximizing entropy

Denote jX j = det(X ) for a n  n matrix. Let fi (y)g be sigmoid functions di erentiable up to second order and de ne z = (1 (y1);    ; n (yn))T : For the output of the demixing system y = W1 v = W1 g(W2 x + ), the di erential entropy H (y) = H (y; W1 ; W2 ; ) is generally not upper bounded. This means that the maximum of H (y) may not exist. But after using the sigmoid functions fi (y)g to transform y, the entropy H (z ) becomes upper bounded and the maximum of H (z ) always exists. It is easy to calculate the entropy

H (z) = H (v) + log jW 1 j +

n X i=1

E [log i0 (yi )]:

(2)

Similarly, from the relations v = (g1 (u1 + 1 );    ; gn (un + n))T and u = W2 x, we have

H (v) = H (x) + log jW 2j +

n X i=1

E [log gi0 (ui + i )]:

(3)

Denote W k = (wijk ) for k = 1; 2. Noticing that H (v) does not depend on W 1 and y = W 1 v, we have the stochastic gradient of H (z )

@H (z ) = [W ?T ] + i00 (yi) v ij 1 i0 (yi) i @wij1

(4)

where W ?1 T = (W ?1 1 )T and [W ?1 T ]ij is the (i; j ) entry in W ?1 T . A matrix form for the expression (4) is the following: @H (z) = W ?T ?  (y)v T (5) 1 1 @W 1

where

00 00 1 (y ) = ?( 10 ((yy1)) ;    ; n0 ((yyn )) )T : 1

1

n

n

It is simple but important that if Fi (y) = ai (y) + b for some constants a 6= 0 and b then Ry pi(u)du where fpi g =  ((yy)) . So, without losing generality, we can assume that i (y) = ?1 are some pdfs. Then, 1 (y) can also be expressed as Fi00 (y ) Fi0 (y )

00

i

0

i

0 0 1 (y ) = ?( pp1 ((yy1 )) ;    ; ppn ((yyn )) )T : 1

1

4

n

n

From the expressions (2) and (3) we have

H (z) = H (x) + log jW 1 j + log jW 2 j +

n X i=1

X E [log g0 (ui + i )] + E [log 0 (yi )]: n

i

i=1

i

Noticing that yi depends on W2 , from the above expression we have the stochastic gradient n @ log 0 (y ); @H (z ) = W ?T ?  (u)xT + X

@W 2

where Denote

2

2

i=1

@W 2

i

i

(6)

(7)

00 + 1 ) gn00 (un + n) )T : ;    ; 2 (u) = ?( gg10 ((uu1 + ) g0 (u +  ) 1

1

1

n

n

n

D(u) = diag(g10 (u1 + 1 );    ; gn0 (un + n))T :

Applying the chain rule of partial derivatives, we have

(8)

@yi = w1 g0 (u +  )x : k l ik k k @wkl2

From this relation, we can calculate the third term in (7) and obtain @H (z ) = W ?T ?  (u)xT ? D(u)W T  (y)xT :

@W 2

2

2

1

1

(9)

Again applying the chain rule of partial derivatives, we have

@yk = w1 g0 (u +  ): i ki i i @i

From the above relation and (6), we have the stochastic gradient n 00 X g k00 (yk ) @yk @H (z ) i (ui + i ) = + @ gi0 (ui + i) k=1 k0 (yk ) @i n 00 00 X = ggi0 ((uui ++ i)) + k0 ((yyk )) wki1 gi0 (ui + i ): i i i k =1 k k i

The above expression has the following vector form: @H (z ) = ? (u) ? D(u)W T  (y): 2 1 1 @

(10)

Using natural gradient descent method, we have the following ME algorithm for the de-mixing system for maximizing H (z ):

dW 1 = [I ?  (y)y T ]W 1 1 dt d = ?[ (u) + D(u)W T  (y)] 2 1 1 dt dW 2 = [I ?  (u)uT ? D(u)W T  (y)uT ]W 2 1 2 1 dt 5

(11) (12) (13)

3.2 Minimum mutual information The mutual information of the output y is decomposed as n X I (y; W ; ; W ) = ?H (v) ? log jW j + H (yi ) 2

1

1

i=1

Applying the Gram-Charlier expansion to approximate each marginal pdf of y , we nd the approximation of the marginal entropy in [2]: a 2 a 2 1 (a )3 H (ya)  12 log(2e) ? (233!) ? (244!) + 83 (a3 )2 a4 + 16 (14) 4 where a3 = ma3 , a4 = ma4 ? 3, and mak = E [(ya )k ]. Let F (a3 ; a4 ) denote the right hand side of the approximation (14), then

I (y; W 1 ; W 2 ; )  ?H (v) ? log jW 1 j +

n X a=1

F (a3 ; a4 ):

(15)

@I @I From (3) and (15), we can rst calculate the gradients @ W , @I and @ , then derive the following 2 @W 1 MMI algorithm:

where

dW 1 = [I ? ye yT ]W 1 dt d = ?[ (u) + D(u)W T ye ] 2 1 dt dW 2 = [I ?  (u)uT ? D(u)W T ye uT ]W 2 2 1 dt

(17)

ye = f ( ;  )  y + g ( ;  )  y ;

(19)

1

3

2

4

1

3

f1 (y; z) = ? 21 y + 49 yz; g1 (y; z) = ? 16 z + 23 y2 + 34 z 2 ;

4

3

(16) (18)

and  denotes the Hadamard product of two vectors: c  d = [c1 d1 ;    ; cn dn ]T . Here, the cumulants a3 and a4 are traced by the following equations:

da3 = ?(a ? (ya )3 ); da4 = ?(a ? (ya )4 + 3) 3 4 dt dt

(20)

Note the natural gradient descent method is again used to derive the learning equations for W 2 and W 1 . The learning equations derived have the same structure as those derived by maximizing entropy. The learning algorithms for W 1 and those algorithms [2, 3, 15] for linear mixture model have the common form. The new learning algorithms for W 2 and  are derived using a method similar to the error Back-Propagation (BP) method for training a multi-layer network. Because of this, the unsupervised algorithms (11-13) and (16-18) are called entropy BP and mutual information BP respectively, or information BP algorithms altogether. We also call the entropy BP and the mutual information BP non-linear ME and MMI algorithms to di erentiate those linear ME and MMI algorithms in [2, 3, 15] for linear de-mixing systems. 6

3.3 Performance index

The performance of the ME and MMI algorithms can be measured by the mutual information I (y; W 1 ; W 2 ; ) of the output y. From (3) and (15), we have I (y; W 1; W 2 ; )  ?H (x) + Ip(y; u; W 1 ; W 2 ; ) (21) where n X Ip(y; u; W 1 ; W 2 ; ) = ? log jW 1 W 2 j ? (E [log gi0 (ui + i)] + F (i3 ; i4 )): (22) i=1

The second term in the above expression can be estimated online in practice. Since H (x) does not change in the learning process, we can use Ip(y ; u; W 1 ; W 2 ; ) as the performance index which may take negative values. The di erence between the mutual information I (y; W 1 ; W 2 ; ) and the performance index Ip (y; u; W 1; W 2 ; ) is ?H (x).

4 The information BP vs. the error BP Blind separation of sources is an unsupervised learning problem. We employ a two-layer perceptron as the de-mixing system to extract the sources in the non-linear mixture. However, we cannot use the well known error back-propagation method to train this multilayer network since the desired signals are not accessible. Instead, we use information criteria such as entropy and mutual information as error signals to derive the information BP algorithms. The block diagram of the entropy BP is plotted in Figure 2. Replacing the nonlinearity 1 (y) by the adaptive nonlinearity ye de ned by (19) in Figure 2, we obtain the block diagram of the mutual information BP. The two information BP algorithms use the same feedback non-linearity 2 (y ) in the hidden layer but di erent feedback non-linearities 1 (y) and ye in the output layer. To understand the back-propagation process in the information BP, let us rst consider the error BP algorithm for the two-layer perceptron assuming the desired signals d(t) are known. The demixing system used in this paper is a two-layer n-n-n network with linear nodes in the output layer. The error BP algorithm for this network has the following vector form: 8 dW T > > < dt 1 = ?(y ? d)v T d (23) dt = ?D (u)b > > : dW 2 = ?D(u)bxT dt where  > 0 is a learning rate, u = W 2 x, v = (v1 ;    ; vn )T = g(u + ), b = (b1 ;    ; bn)T = W T1 (y ? d) a feedback signal via a back-propagation matrix W T1 . The block diagram of the error BP is plotted in Figure 3. The two-layer perceptron is trained by the error BP based on example f(x(t); d(t)); t = 1;    ; T g. In the context of non-blind demixing, the unknown system is f ?1 (x) which is the inverse of the mixing system. The desired signal d(t) is the source signal s(t). The error BP algorithm is usually written in the following component form: 8 > > < > > :

1 dwkj dt d dt 2 dwkj dt

= ?(yk ? dk )vj = ?g0 (uk + k )bk = ?g0 (uk + k )bk xj : 7

(24)

A well known fact is easily seen from the above equation: the error BP algorithm (23) or (24) is local which means only local signals are required to update each weight and each threshold. On the other hand, in the information BP algorithms (11-13) and (16-18) the learning equations for the threshold are local but the learning equations for W 1 and W 2 are not local. This is the main di erence between the the error BP and the information BP. Regarding the back-propagation process, one thing in common between the error BP and the information BP is the back-propagation matrix W T1 . However, the feedback non-linearities used in the information BP algorithms are not needed in the error BP algorithms. This is an another di erence between the information BP and the error BP. Since we use a n-n-n two-layer perceptron to approximate the inverse of an invertible mixing function, the two-layer perceptron should also be invertible which means the two matrices W 1 and W 2 should be non-singular. Although the learning equations for W 1 and W 2 are not local, they have a very interesting property to keep the matrices W 1 and W 2 from becoming singular if the initial matrices are non-singular. This is proved in [16].

5 Simulation

Assume that the mixing function f (u) in Figure 1 is a two-layer perceptron of 5 input neurons and 5 hidden neurons and 5 output neurons:

x(t) = A tanh(A s(t)) 2

1

where the source vector

s(t) = [sign(cos(2155t)); sin(2800t); sin(2300t + 6 cos(260t)); sin(290t); b(t)]T ; consisting of four modulated signals and one random source b(t) uniformly distributed in [?1; 1]. A1 and A2 are two 5  5 mixing matrices: 2 66 A1 = 666 4

and

0:1420 0:3016 ?0:3863 0:2506 0:4690 ?0:4347 ?0:2243 ?0:3150 0:4064 0:2012 0:3543 0:3589 0:0942 0:0916 0:4059 ?0:4545 ?0:4382 0:3962 0:3090 0:4473 0:2704 ?0:1301 ?0:2211 ?0:1983 ?0:1733

2 66 A2 = 666 4

3 7 7 7 7 7 5

3

0:4389 ?0:8860 ?0:4669 ?0:3164 ?0:7320 0:6564 0:9429 ?0:9738 0:1652 0:4402 77 0:9134 ?0:3451 ?0:0726 0:2833 0:8911 777 0:2316 ?0:4454 ?0:5936 0:9550 ?0:0720 5 ?0:2552 0:1434 ?0:5017 0:4779 0:3734 where the elements of A1 are randomly chosen in [?0:5; 0:5] and those of A2 are randomly chosen in [?1; 1]. The SOM algorithms in [9, 11, 12] fail to extract sources from the non-linear mixture in this example. The algorithms in [10, 14] are not applicable to the mixture x(t) which contains crossing non-linearities. We applied the non-linear MMI algorithm (18)-(16) to extract source signals in the 8

observed non-linear mixture using the sigmoid function g(x) = tanh( x) with a gain > 0 for each hidden unit in the demixing system. We also tried the linear MMI algorithm in [15] on the same non-linear mixture to compare the performance of the two algorithms. The outputs of the linear MMI and those of the non-linear MMI are shown by the second row in Figure 4. In the outputs of the linear MMI, only one source is visible while in the outputs of the non-linear MMI three modulated signals are clearly visible. The performance index of the non-linear MMI algorithm is shown in Figure 5.

6 Conclusions Assuming the inverse of the non-linear mixing function can be approximated by a two-layer perceptron, we employ a two-layer perceptron as a de-mixing system to extract sources in the non-linear mixture. The two blind separation algorithms are derived by using two approaches: minimum entropy and maximum mutual information. The two algorithms are called the entropy BP algorithm and the mutual information BP algorithm or information BP algorithms altogether. The information (entropy or mutual) BP process is similar to the error back-propagation for supervised learning. In the information BP process, the feedback non-linearities are used to generate feedback signals without using teacher signals. However, in the information BP algorithms, the learning algorithms for the weight matrix in the output layer are the same as the linear ME/MMI algorithms for linear de-mixing systems. But the learning algorithms for the weight matrix and thresholds in the hidden layer are di erent from those linear ME/MMI algorithms. The simulation shows that the information BP algorithms can e ectively extract most of source signals in the non-linear mixture.

References [1] S. Amari, A. Cichocki, and H. H. Yang. Recurrent neural networks for blind separation of sources. In Proceedings of NOLTA 1995, volume I, pages 37{42, December 1995. [2] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, 8, eds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MIT Press: Cambridge, MA., pages 757{763, 1996. [3] A. J. Bell and T. J. Sejnowski. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129{1159, 1995. [4] G. Burel. Blind separation of sources: A non-linear neural algorithm. Neural Networks, 5:937{ 947, 1992. [5] J.-F. Cardoso and B. Laheld. Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12):3017{3030, December 1996. [6] A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning for blind identi cation and blind separation of sources. IEEE Trans. on Circuits and Systems-I: Fundamental Theory and Applications, 43(11):894{906, November 1996.

9

[7] A. Cichocki, R. Unbehauen, L. Moszczynski, and E. Rummert. A new on-line adaptive learning algorithm for blind separation of source signals. In ISANN94, pages 406{411, Taiwan, December 1994. [8] P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287{314, 1994. [9] M. Herrmann and H. H. Yang. Perspectives and limitations of self-organizing maps in blind separation of source signals. In Progress in Neural Information Processing: Proceedings of ICONIP*96, pages 1211{1216. Springer, September 1996. [10] T.-W. Lee and B.-U. Koehler. Blind source separation of nonlinear mixing models. In To appear in Neural Network for Signal Processing VII , 1997. [11] J. K. Lin, D. G. Grier, and J. D. Cowan. Source separation and density estimation by faithful equivariant som. In Advances in Neural Information Processing Systems, 9, MIT Press: Cambridge, MA., pages 536{542, 1997. [12] P. Pajunen, A. Hyvarinen, and J. Karhunen. Nonlinear blind source separation by selforganizing maps. In Progress in Neural Information Processing: Proceedings of ICONIP*96, volume 2, pages 1207{1210, September 1996. [13] D. T. Pham. Blind separation of instantaneous mixture of sources via an ica. IEEE Trans. on Signal Processing, 44(11):2768{2779, November 1996. [14] A. Taleb and C. Jutten. Nonlinear source separation: the post-nonlinear mixtures. In ESANN'97 , pages 279{284, 1997. [15] H. H. Yang and S. Amari. Two Gradient Descent Algorithms for Blind Signal Separation. In Proceedings of ICANN96, The Lecture Notes in Computer Science Vol.1112, pp.287-292. Springer-Verlag, 1996. [16] H. H. Yang and S. Amari. Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(7):1457{1482, 1997. [17] H. H. Yang, S. Amari, and A. Cichocki. Information back-propagation for blind separation of sources from non-linear mixture. In Proc. of ICNN'97, Houston, pages 2141{2146, June 1997.

10

Figures Figure 1: Figure 2: Figure 3: Figure 4:

The mixture model is on the left of the dash line and the de-mixing system is on the right Block diagram of the entropy BP Block diagram of the error BP Simulation results: (a) sources (b) non-linear mixture (c) outputs of linear MMI (d) outputs of non-linear MMI Figure 5: The performance index of the non-linear MMI

11

s(t)

-

x(t)

f ()



7

W2

7

u(t)-

g( + ) v(t-)  

7

-y(t)

W1

Figure 1:

x(t) -

 ?? W2

?? 6

u(t)

-  ?

??

- g(u + ) ? ??

6

v (t)

-

?? 6

 (u)

?? W1

y(t)

?

 (y) 1

2



+? ?  +   D(u)  

W T1

Figure 2:

12



?

-

x(t) -

 ?? W2

?6? 

?

u(t)

- ?

D(u) 

??

v (t)

- g(u+ ) ?6?

W T1

-

??

?6?

W1

y(t)

-

+?    6



?

d(t) = s(t)

- unknown system

Figure 3:

13

-

sources

14

12

12

10

10

8

8

6

6

4

4

2

2

0

0

−2 0.5

0.51

0.52

0.53

0.54

non−linear mixture

14

0.55 t

0.56

0.57

0.58

0.59

0.6

−2 0.5

0.51

0.52

0.53

0.54

(a)

outputs of linear MMI

14

12

12

10

10

8

8

6

6

4

4

2

2

0

0

0.51

0.52

0.53

0.54

0.55 t

0.56

0.57

0.58

0.59

0.6

0.56

0.57

0.58

0.59

0.6

(b)

14

−2 0.5

0.55 t

0.56

0.57

0.58

0.59

0.6

−2 0.5

(c)

outputs of non−linear MMI

0.51

0.52

0.53

0.54

0.55 t

(d)

Figure 4:

14

2

1

performance index

0

−1

−2

−3

−4

−5 0

1000

2000

3000 4000 iteration

Figure 5:

15

5000

6000

7000