Optimisation of Density Estimation Models with ... - CiteSeerX

Report 2 Downloads 148 Views
to be published in:

Th. Back, A.E. Eiben, M. Schoenauer, H.P. Schwefel (Eds.) Parallel Problem Solving from Nature { PPSN V, Springer, 1998

Optimisation of Density Estimation Models with Evolutionary Algorithms? Martin Kreutz1?? , Anja M. Reimetz2 , Bernhard Sendho 1 , Claus Weihs2 , and Werner von Seelen1 1

Institut fur Neuroinformatik, Ruhr-Universitat Bochum, Germany 2 Fachbereich Statistik, Universitat Dortmund, Germany

Abstract.

We propose a new optimisation method for estimating both the parameters and the structure, i. e. the number of components, of a nite mixture model for density estimation. We employ a hybrid method consisting of an evolutionary algorithm for structure optimisation in conjunction with a gradient-based method for evaluating each candidate model architecture. For structure modi cation we propose speci c, problem dependent evolutionary operators. The introduction of a regularisation term prevents the models from over- tting the data. Experiments show good generalisation abilities of the optimised structures.

1 Introduction The estimation of the probability density function (pdf) of a data generating process is one of the primary tools in the analysis and modeling of stochastic or partially stochastic systems. If a complete a priori model of the system is lacking and no parametric model can be assumed, the density estimation has to be based completely on the available data. Mixture densities have been considered as general semi-parametric models for density estimation:

p(xj) =

m X

i i (xji ) ;

i=1 m X i=1

x 2 IRn

;

 = ( i ; : : : ; m ; i ; : : : ; m )

i = 1 ; i  0 8 i = f1; : : :; mg

(1) (2)

Finite mixture models have several appealing properties: they are universal in the sense that they can approximate any continuous probability distribution, they can cope with multimodal distributions and their complexity can be easily adjusted by the number of components. In this article we employ mixtures of normal densities: n  x ?  2 ! X 1 1 k ik i (xji ) = p n Qn (3) exp ? 2  ik k=1 ik 2 k=1

2 i = (i1 ; : : : ; in ; i21 ; : : : ; in ) 2 IR2n

? Supported by the BMBF under Grant No. 01IB701A0 (SONN II ). ?? email: [email protected]

(4)

When constructing statistical models using a limited quantity of data, the issue of model selection is very crucial. We address this problem in an evolutionary framework. The determination of the size of the model and the arrangement of its components can be regarded as a structure optimization problem for which we employ an evolutionary algorithm with speci c, problem dependent operators. For the optimisation of the model parameters we use the maximum penalised likelihood approach. This optimisation task has been carried out in [8] with evolution strategies, in this paper we employ the EM algorithm [7]. The remainder of this article is organised as follows. In Section 2 we address the issue of regularisation. The evolutionary algorithm is described in Section 3. Section 4 presents the experimental results and conclusions are drawn in Section 5.

2 Regularisation The method of maximum likelihood is widely used in statistical inference. The likelihood is the probability for the occurrence of sample X = fx1 ; : : : ; xN g for the chosen probability density model (characterised by its parameters ). Assuming the xi to be iid. sampled with pdf p(xj) the log-likelihood function reads

L() = log

N Y

k=1

p(xk j) =

N X k=1

log p(xk j):

(5)

However, if in nite-dimensional objects such as functions of one or more continuous variables are involved, the principle of maximum likelihood is usually inadequate. Attempting to maximise the likelihood in such contexts usually results in an in nite value for the likelihood and degeneracy of the model. In the context of normal mixtures, eq. (3), we observe that the likelihood will be in nite, if one of the density functions i (xji ) approaches a delta function placed on one data point. In structure optimisation of normal mixtures, methods solely based on the likelihood, therefore, tend to increase the number of kernel functions, place their centres on the data points and minimise their widths. A general approach to transform this ill-posed optimisation problem into a well-posed problem is the introduction of regularisation terms which re ect speci c assumptions about the density model. The aim is the simultaneous minimisation of bias and variance of the model which is possible in the case of in nite data sets but leads to the bias-variance dilemma [2] in practical applications. A sensible choice for regularisation is to demand a smooth density model. A common choice of a smoothness functional J () is the integral of the squared second derivative

J () =

Z+1

?1

p00 (xj)2 dx

(6)

which has an appealing interpretation as the global measure of curvature of p. Hence, the complete objective function reads F () = L() ?  J (): (7)

The smoothing parameter  controls the relative in uence of the two criteria. There are several possibilities to extend this function to multivariate functions. In the multivariate case the second derivative is described by the Hessian: 8 2 @ p(xj) > > < @x2 ; r = s : (8) H = (hrs ) ; hrs = > 2 r > : @ p(xj ) ; r 6= s @x @x r

s

The squared second derivative C and the matrix of integrals D is de ned by n X C = (crs ) = H> H ; crs = hrt hts

D = (drs ) ; drs =

r=1

Z+1

?1

crs dx:

(9) (10)

In order to yield a scalar measure of smoothness of p either the determinant or the trace of D can be taken. These measures correspond to the product and the sum of eigenvalues of D, respectively, and both are invariant under a linear non-singular transformation. We use the trace of D for simplicity:

J () = tr D =

n X r=1

drr :

(11)

The derivation of the diagonal elements drr is given in appendix A.

3 Evolution of density models The structure or architecture of a model has a direct impact on training time, convergence, generalisation ability, robustness, etc. of the model. Therefore, the choice of at least a near optimal architecture is a key issue in model building. The optimisation of the structure of a model represents a dicult problem for search algorithms since the search space is in general non-di erentiable and multimodal. Furthermore, the mapping from architecture to performance is indirect, strongly epistatic, and dependent on initial conditions. Evolutionary algorithms (EA) have been considered as a promising method for dealing with these problems especially in the area of neural networks [11]. Less attention has been paid to the arti cial evolution of mixture models for density estimation. From approximation theory we know that they are universal approximators [5] and their local nature permits to learn faster than global models (like NN) [4]. Furthermore, with regard to EAs, the local interaction between the components in mixture models leads to a lower epistasis. This justi es the usage of a simple direct encoding. The advantages of locality, however, are obtained at the expense of a potentially large number of components required to cover the data space. This leads to the severe problem of over- tting the data and under- tting the noise level, respectively.

In our method this problem is handled by an appropriate regularisation (see Section 2). For the evolution of structures we use a variant of the steady-state evolutionary algorithm. In each generation two parents are randomly selected from the population and the operators crossover and mutation are applied with the probabilities Pc and Pm , respectively. The two o springs are evaluated by training them with the EM algorithm and replace the two worst individuals in the population. This is repeated for a prede ned number of generations.

3.1 Genome representation All parameters of the mixture density are encoded directly in the genome, which are comprised of a variable length string of composite genes. Each gene encodes the parameters (centre and width) of a single density function as well as the corresponding mixing coecient:

Gi = ( i ; i ; : : : ; in ; i ; : : : ; in ); 2 1

1

(12)

2

where n denotes the dimension of the input space. The length of the genome varies with the number m of kernel density functions:

C = (G ; : : : ; Gm ):

(13)

1

All parameters are encoded as oating point values.

3.2 Recombination One additional diculty which arises in the context of recombination is referred to as the problem of competing conventions : Due to the invariance to permutation of the kernel functions distinct genomes map to phenotypes which are functionally equivalent. For that reason the recombination of two successful parents employing di erent conventions is unlikely to produce successful o spring. In [1] this problem was circumvented by a special 2-point crossover operator which is based on crosspoints in the input variable space (and therefore the space of possible centres for the Gaussian) rather than on the position in the genome. We employ a modi ed version of this operator. The two crosspoints (1) and (2) are sampled randomly with uniform distribution in the space which is spanned by all centres j : ;2) ; : : : ; (1;2) ) (1;2) = ((1 n 1





;2)  ; max  (14) with (1 i  R i=1min ;:::;m ij i=1;:::;m ij These two points de ne a hypervolume in the space of centres. All centres of the two selected parents lying inside this hypercube are exchanged by the crossover operator. After crossover, the rst o spring contains those genes G i (kernel functions) from the second parent which satisfy:



 

(1) (2) (2) (1) j  ij  j _ j  ij  j



8 j = f1; : : : ; ng

(15)

together with those genes from the rst parent which do not satisfy this condition. The second o spring contains all remaining genes from both parents. This crossover operator treats functionally equivalent phenotypes using di erent conventions as identical.

3.3 Mutation Several mutation operators for structure modi cation of the model were considered. A very straightforward choice are simple insertion and deletion operators. The component insertion operator introduces a new kernel function with random centre and width, whereas the component deletion operator simply deletes a randomly selected kernel function. However, the insertion and deletion of kernels in the structure optimisation of the model can be very disruptive and violate the principle of a strong causal mapping between the genotype and phenotype or tness space [9]. In order to minimise these e ects and to better control the impact of mutation we employ special merge and split operators. We optimise the parameters of the kernel(s) for the case of merging (two kernels are replaced by one) and splitting (one kernel is replaced by two). In both cases an appropriate distance measure between the three Gaussian functions (according to eq. (3)) is given by

d~^ =

Z1 

?1



2

(x; ; 2 ) ? ~(x; ~ ; ~ 2 ) + ^(x; ^ ; ^ 2 )

dx:

(16)

R

All integrands in eq. (16) are of the general form   dx which can be solved analytically Z1

n ( ?  )2 X 1 i i exp ? 2   dx = p n Qn q 2 2 +   i i=1 i 2 i2 + i 2 ?1

1

!

(17)

i=1

We therefore get an analytical solution for d~^, eq. (16). Since we want to minimise the disruptive e ect of the operators merge and split, we minimise d~^ with respect to the parameters which have to be chosen. In the case of the merge operator, we have to choose the centre  and the variance 2 of the replacement function. For the split operator, we x the centre values of the two new functions ~ and ^ as a function of the centre vector  and optimise the variances ~ 2 and ^ 2 . A possible choice of the centre values would be to select the coordinate with the largest 2 component, (let us assume this is k) and use

~i = i ^ ^i = i 8i 6= k ; ~k = i + k ^ ^k = i ? k :

(18)

In our experiments we set = 1. The remaining parameters are optimised using a modi ed Newton method. In Figure 1 we show two examples one for the merge operator (a) and one for the split operator (b).

(a)

(b)

Fig. 1. Application of the merge operator (a) and the split operator (b). 3.4 Fitness evaluation

Each candidate model structure is evaluated by adapting its parameters to the given data. This can be seen as a missing data estimation problem for which we employ an Expectation-Maximization (EM) algorithm [7]. In this context the EM algorithm is used for maximum penalised likelihood or a posteriori estimation, that is the maximisation of F () according to eq. (7). Unfortunately, the EM algorithm applied to this function involves a nonlinear optimisation problem in the M-step. Alternatives like the one-step-late algorithm [3] or the smoothed EM algorithm [10] avoid this problem and usually converge to the same solution when the amount of penalty required is small. However, in our context both methods turned out to be numerically unstable and did not converge. In the simulations reported here, the nonlinear optimisation in the M-step is performed through an iterative second order method, the Broyden-Fletcher-Goldfarb-Shanno algorithm [6]. Since the iterations in the M-step are computationally expensive in contrast to the re-estimation of the expectations in the E-step we perform only a few iterations in the inner loop. Due to the monotonic increase of the objective function in each EM-step the convergence to the next local optimum is guaranteed. The expectation step of the EM algorithm involves the construction of an auxiliary function (see [7] for details) m X N 0 0 X 0 Q(j ) = log( i i (xk ji ))i (xk ; 0 ) with i (x; 0 ) = i i (xj0i ) : i=1 k=1

p(xj )

(19) In the estimation with no penalty Q(j0 ) is maximised with respect to the parameters  in each maximisation step. In the penalised case, however, we have to maximise Q(j0 ) ?  J (): (20) 0 The derivatives of Q(j ) and J () with respect to i , ij and ij which are required for the optimisation of eq. (20) are given in Appendix A and B.

4 Experimental results In order to evaluate the performance of the proposed method several experiments were carried out to compare it against classical methods like EM. Figure 2 shows

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2. Plots of density estimates. (a) True density. (b) Maximum likelihood estimate

using EM (20 kernels). (c) Unregularised maximum likelihood estimate using EA and structure optimisation (484 kernels). (d) Regularised estimate with  = 0:001 (229 kernels). (e) Regularised estimate with  = 1 (47 kernels). (f) Regularised estimate with  = 10 (5 kernels).

the results of di erent density estimations. Figure 2 (a) shows the true density which has to be estimated. We sampled 500 points for the training set and left 500 points out for the test set. The conventional maximum likelihood estimation using the EM algorithm and a xed number of kernels, see Figure 2 (b), leads to severe over- tting of the training set. In Figure 3 (a) the out-of-sample performance on the test set decreases with increasing over- tting on the training set. Figure 2 (c) shows the results of a maximum likelihood estimation with the EA

-960

-960

-980

-980

-1000

-1020

-1040

-1060

-1020

-1040

-1060

-1080

(a)

training test

-1000

log-likelihood

log-likelihood

training test

-1080

generations (b) Fig. 3. (a) Maximum likelihood estimation using EM (20 kernels). (b) Maximum penalised likelihood estimation ( = 1) using EA and structure optimisation (47 kernels). 0

50

100

150

200

0

20

40

60

80

100

iterations

and no regularisation. If no penalty is imposed the structure optimisation process tends to place one Gaussian function with in nitely small width at each point of the training set. In Figure 2 (d) { (f) the results of the maximum penalised likelihood estimation combined with structure optimisation are shown for di erent values of the penalty factor . Even for rather small values of  the e ect of over- tting is drastically reduced. Furthermore, the results obtained with di erent values of  varying over several orders of magnitude suggest that the choice of  is fairly robust. Figure 3 (b) shows that in a regularised density estimation ( = 1) the out-of-sample performance behaves nearly like the performance on the training set. This allows for accurate predictions of the generalisation ability of the optimised models. In all experiments we set the population size to 20, the number of generations to 100 and the probabilities for crossover and mutation to Pc = 0:5 and Pm = 0:1.

5 Conclusion and further work A new method has been proposed for the construction of mixture models using an evolutionary algorithm. Experiments in estimating a simple density have been presented which show that evolutionary algorithms combined with the EM algorithm can produce models with good generalisation abilities. The objective function favours models which are smooth and have a low tendency to over- t the data. However, the number of components is only indirectly controlled by the smoothness criterion. A parsimony requirement { the famous Occam's razor { calls for an objective function which focuses on the minimum number m of components in a mixture. Depending on the context, it may be appropriate to replace the second order derivative in de ning J () by some higher order derivatives. This corresponds to a di erent assumption about what we exactly mean by smooth. We have used Lamarckian evolution in our experiments, for the case of time-varying distributions, however, it might be sensible not to pass the optimised parameters back. Finally, it would be interesting to evaluate the proposed hybrid method on larger problems with higher dimensional input spaces.

References

1. B. Carse and T. C. Fogarty. Fast evolutionary learning of minimal radial basis function neural networks using a genetic algorithm. In T. C. Fogarty, editor, Evolutionary Computing { selected papers from AISB, pages 1{22. Springer, 1996. 2. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1{58, 1992. 3. P. J. Green. On use of the em algorithm for penalized likelihood estimation. J. R. Statist. Soc. B, 52:443{452, 1990. 4. J. Moody and C. L. Darken. Fast learning in networks of locally{tuned processing units. Neural Computation, 1:281{294, 1989. 5. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78:1481{1497, 1990. 6. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scienti c Computing. Cambridge University Press, 1992. 7. R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26:195{239, 1984. 8. A. M. Reimetz. Strukturbestimmung von probabilistischen neuronalen Netzen mit Hilfe von Evolutionaren Algorithmen. Master's thesis (Diplomarbeit), Fachbereich Statistik, Universitat Dortmund, 1998. 9. B. Sendho , M. Kreutz, and W. von Seelen. A condition for the genotype{ phenotype mapping: Causality. In T. Back, editor, Proc. International Conference on Genetic Algorithms, pages 73{80. Morgan Kaufman, 1997. 10. B. W. Silverman, M. C. Jones, J. D. Wilson, and D. W. Nychka. A smoothed em approach to indirect estimation problems, with particular reference to stereology and emission tomography. J. R. Statist. Soc. B, 52:271{324, 1990. 11. L. D. Whitley. Genetic algorithms and neural networks. In J. Periaux and G. Winter, editors, Genetic Algorithms in Engineering and Computer Science. Wiley, 1995.

A Derivation of the diagonal elements of D

For the calculation of the trace only the diagonal elements of D are required:

drr =

n Z+1 X

n X

t=1?1

t=1;t6=r

h2rt dx = Ar +

Brt

(21)

The symbols Ar and Brt which stand for the partial second derivatives are introduced for convenience. After some straightforward but tedious calculations we get:

Ar = Brt =

+1 Z

?1 +1 Z ?1

@ 2 p(xj) 2 dx = @x2r @ 2 p(xj) 2 dx = @xr @xt

m X m X i=1 j =1 m X m X i=1 j =1

i j Eij Uijr

(22)

i j Eij Vijrt

(23)

n (ik ? jk )2 1X Eij = p n Qn 1 exp ? 2 2 k=1 ik2 + jk 2 (ik2 + jk ) 21 2

!

(24)

k=1

2 2 2 2 2 2 4 Uijr = 3 (ir + jr ) ? 6 (ir(+2jr+) (2 ir)4? jr ) + (ir ? jr ) ir jr ? 2 ? 2  2 2 2  +  ? (  ?  )  (it ? jt )2 : Vijrt = ir jr (ir2 + jr2 )2 (2it ++ 2jt)? ir jr it jt 2

(25) (26)

For the maximisation of the objective function we need the gradient of J () and therefore the gradients of Ar and Brt. Since we have to guarantee condition (2) on i we use the following transformation:

i = exp( i )

m . X j =1

exp( j ):

(27)

Symbolic derivation of Ar and Brt with respect to i , ij , and ij yields:

@Ar @ s @Ar @sq @Ar @sq @Brt @ s @Brt @sq @Brt @sq

= 2 s = 2 s

"

m X

j =1 m X j =1 m X

j Esj Usjr ? Ar

#

(28)

"

? sq Usjr + qr (1) j Esj jq2 + sjq 2 sq

jq

#

(29)

"

  2  (  ?  ) sq jq sq = 2 s j Esj 2 + 2 Usjr 2 + 2 ? 1 + qr (2) sjq sq sq jq jq j =1

= 2 s

"

m X

j =1 m X

j Esj Vsjrt ? Brt

#

#

(31)

"

sq Vsjrt + qr (3) (t) + qt (3) (r) = 2 s j Esj jq2 ? sjq sjq 2 + jq sq j =1 "   m 2 X  sq = 2 s j Esj 2 + 2 Vsjrt (jq2 ?+sq2 ) ? 1 sq sq jq jq j =1 + qr sjq (t) + qt sjq (r) (4)

#

2

(33)

2

sq jq 2 2 2 2 36(jq + sq )(jq ? sq )2 ? 12(jq2 + sq ) ? 8(jq ? sq )4 (2) sjq = 2 2 4 (jq + sq ) 2 2 2(jk + sk ? (jk ? sk )2 ) (3) sjq (k) = ( 2 + sq 2 )( 2 +  2 )2 jq jk sk   2 2(  ?  ) jq sq (4) (3) sjq (k) = sjq (k) 2 + 2 ? 1 sq jq

B Derivatives of Q(j ) with respect to ,  and  0

i

(32)

#

(4)

12(jq + sq ) ? 4(jq ? sq ) (1) sjq = (2 + 2 )3 2

(30)

ij

N ?  @Q(j0 ) = X ( i according to eq. (27) ) i (xk ; 0 ) ? i @ i k=1 N @Q(j0 ) = X xkj ? ij i (x; 0 ) 2 @ij k=1 ij  N  (x ?  )2 @Q(j0 ) = X ij ? 1  (x;  0 ): kj @ij ij3 ij i k=1

(34) (35) (36) (37) ij

(38) (39) (40)