Mixture Density Estimation Based on Maximum Likelihood ... - CiteSeerX

Report 2 Downloads 217 Views
Neural Processing Letters, 1998

Mixture Density Estimation Based on Maximum Likelihood and Sequential Test Statistics N. A. Vlassis G. Papakonstantinou P. Tsanakas Dept. of Electrical and Computer Engineering National Technical University of Athens Zografou Campus, GR-15773, Athens, Greece http://www.dsclab.ece.ntua.gr Abstract

We address the problem of estimating an unknown probability density function from a sequence of input samples. We approximate the input density with a weighted mixture of a nite number of Gaussian kernels whose parameters and weights we estimate iteratively from the input samples using the Maximum Likelihood (ML) procedure. In order to decide on the correct total number of kernels we employ simple statistical tests involving the mean, variance, and the kurtosis, or fourth moment, of a particular kernel. We demonstrate the validity of our method in handling both pattern classi cation (stationary) and time series (non-stationary) problems. Keywords: Gaussian mixtures, PNN, semi-parametric estimation, number of mixing components, test statistics, stationary distributions, non-stationary distributions.

1 Introduction In this paper we study the problem of estimating the probability density function, or density, of a random variable X based on a sequence of observations, or samples, xi of X . Knowing the exact form of a random variable's density could be helpful in designing optimal classi cation rules as in statistical discriminant analysis [10], performing probabilistic clustering of the variable's domain [5], or predicting a time series evolution [18]. For the input density we choose a particular parametric model, namely, the nite Gaussian mixture model [14]. This approximates the unknown density p(x) by a nite weighted sum of normal (Gaussian) densities, or kernels, as p(x)

=

K X j =1

j fj (x);

(1)

where fj (x) is in the univariate case the normal density N (j ; j ) with mean j and variance j2 . In the above formula j denote the mixing weights and K is the number of kernels. Both the weights of the mixture and the parameters, e.g., mean and variance, of each kernel we are to estimate from the training set iteratively using the Maximum Likelihood (ML) procedure. We assume that the input samples xi are obtained sequentially, one at a time, rather than as a batch. This is typical for neural network algorithms and may prove helpful in several pattern recognition and signal detection applications [3]. The diculty in using such mixture models for estimating an unknown density lies mainly in the ignorance of the total number K of kernels. Thus, although ML techniques have been successfully applied in the estimation of the weights and the parameters of the kernels [9], they cannot be used directly for nding the total number of kernels, since then they would impose a

separate kernel for each input sample, resulting in over- tting. On the other hand, most of the precise mathematical techniques for estimating the number of kernels in mixtures (see [14], ch. 5) usually assume a batch training set and thus cannot be used in our framework. In the following we propose a solution to the problem of estimating the unknown number of kernel densities when the ML procedure for model tting is used. In parallel to applying the ML technique for the estimation of the mixing weights and kernel parameters, we apply at each iteration step simple statistical tests involving the mean, the variance, and the kurtosis, or fourth moment, of the kernels to decide when a kernel should split in two or when two kernels should join in one, and a simple test involving the mixing weight of a kernel to decide upon its removal. This work is continuation of [16]. We demonstrate herein the validity of our method in examples of unknown density estimation, pattern classi cation, and time series problems, and discuss on its usefulness in other problems.

2 Related work Recent research in neural networks has proposed several models for mixture density estimation that follow the above de nitions. The Probabilistic Neural Network (PNN) [12] is a non-parametric kernel model that stores for each input sample a separate Gaussian kernel function with constant mean and variance, and can be regarded as the neural network realization of the method of Parzen [8]. The mixing weights are assumed equal among all kernels and equal to the reciprocal of the total number of input samples. The constraints the original PNN model imposed on the parameters of the kernels and the mixing weights were relaxed in subsequent works [15, 1, 13, 11]. In [1] di erent mixing weights are used, while in [15, 13] the kernels are represented by multivariate Gaussian functions whose parameters are estimated by the Maximum Likelihood technique, similarly to our approach here. However, most of the above approaches assume that the total number K of kernels is known a priori, and it turns out that the automatic estimation of K is a dicult problem [14, 10]. A neural network method for deciding on the normality of a particular kernel that is based on Pearson's 2 test statistic is proposed in [11]. There, each kernel is divided into equiprobable regions and it is counted the number of input samples belonging to each region. Based on these numbers an appropriate test statistic is formed that follows 2 distribution and whose failure would imply non-normality for the underlying kernel which is then split. However there is a certain diculty to combine that test with sequential estimation procedures, i.e., when the parameters of the underlying distribution are continuously modi ed. For the same problem of estimating the total number of kernels, mathematical methods that assume the training set is in a batch form have been developed that are based on the likelihood ratio test statistic [6], the method of moments [4], graphical methods or tests for unimodality [14]. In most of the above cases, however, it is dicult to nd the asymptotic distribution of the respective test statistic due to unsatis ed regularity conditions of the underlying tests, in which case empirical Monte Carlo bootstrapping techniques are the only solution. Pruning NN techniques for kernel density estimation like the ones we propose herein can be found in [2].

3 The mixture model We say that a random variable X has a nite mixture distribution when it can be represented by a probability density function in the form of (1). In the following we will assume that the kernels are univariate Gaussian functions and generalize to the multivariate case in the next section. Under this framework the unknown density p(x) of the one-dimensional random variable X can be written as p(x) = 1 f1 (x; 1 ; 1 ) +    + K fK (x; K ; K ); (2)

where fj (x; j ; j ) denotes the normal density N (j ; j ) fj (x; j ; j ) =

p1 exp[ ?(x2?2j ) ]; 2

j

2

j

(3)

parametrized on the mean j and the variance j2 , and K is the total number of kernels. In order for p(x) to be a correct probability density function with integral 1 over the input space, the additional constraints K X j = 1; j  0 (4) j =1

must hold. We further assume that a sequence of independent observations or samples x1 ; : : : ; xn ; : : : of the random variable X are received, one at a time, from which we must estimate the unknown density of X . In other words, we must t the above mixture parametric model to the input data. Such an estimation procedure would require the evaluation from the training sequence of the total number K of kernels, together with the 3K unknown parameters j , j , and j .

3.1 The Maximum Likelihood procedure

The joint density p(x1 ; : : : ; xn ; ) = p(x1 ; )    p(xn ; ) of a sequence of n independent samples xi of the random variable X , regarded as a function of the unknown parameter vector , with = [1 ; 1 ; 1 ; : : : ; K ; K ; K ], is called the likelihood function of the samples. The vector that maximizes the likelihood function or equivalently its logarithm L(x1 ; : : : ; xn ;

) = ln p(x1 ; : : : ; xn ; ) =

n X i=1

ln p(xi ; )

(5)

is called the Maximum Likelihood (ML) estimate of . Under reasonable assumptions it can be shown [9] that a unique ML estimate of the parameter vector exists, which at least locally maximizes the likelihood function and it is the root of the equation r L(x1 ; : : : ; xn ; ) = 0; (6) 2 which for a parameter j , with j = j or j = j , and using (1) and (5) reads @L(x1 ; : : : ; xn ; @j

n n n X ) =X @ ln p(xi ; ) 1 @p(xi ; ) = X j @fj (xi ; j ) = = 0: (7) @j @j i=1 i=1 p(xi ; ) @j i=1 p(xi ; )

In this context we ignore singularity problems, i.e., we assume that the log-likelihood function is bounded above. If singularities are to be a problem, then appropriate constraints on the parameters j must be placed as in [14, p. 93] of the form min(i =j )  c > 0; i 6= j , with c constant. Using the continuous version of Bayes' theorem [7, p. 84] we estimate the probability that a particular input sample xi is drawn from a kernel j as j fj (xi ; j ; j ) P fj jxi g = ; (8) p(xi ; ) with the mixing weights j regarded as prior probabilities on the kernels, or simply priors, and thus a further simpli cation of (7) is possible as n X P fj jxi g i=1

@fj (xi ; j ) fj (xi ; j ) @j

=

n X i=1

j (xi ; j ) f j g @ ln f@ = 0;

P j xi

j

(9)

which di ers from (7) on the P fj jxi g term and can be regarded as a separate, weighted maximum likelihood estimation on each kernel. Inserting the formula for the Gaussian (3) yields the ML estimates of the mean j and variance j2 of a kernel j as j j2

Pn P fj jxi gxi = Pi=1 n P fj jx g ; i Pni=1 2 P fj jx g(x ?  ) = i=1Pn i i j : i=1 P fj jxi g

(10) (11)

In order toP nd the ML estimates for the priors j we introduce a Langrange multiplier  for the condition Kj=1 j = 1 and nd the roots of the rst partial derivative with respect to j of the quantity n X i=1

ln p(xi ; ) ? (

K X j =1

j

? 1)

(12)

which after some derivations gives the ML estimate for the priors j

= n?1

n X i=1

fj g

P j xi :

(13)

Substituting the sum in the denominators of (10) and (11) from (13), after some derivations one can easily arrive at iterative formulas for the quantities j ; j2 , and j of a kernel j as j

2

j

j

:= j + (nj )?1 P fj jxg(x ? j ); := j2 + (nj )?1 P fj jxg[(x ? j )2 ? j2 ]; := j + n?1 (P fj jxg ? j ):

P

(14) (15) (16)

It is not dicult to verify that the new j satisfy Kj=1 j = 1. Interestingly, exactly the same formulas are obtained in [14, p. 210] using the method of scores. We should note nevertheless that there is an inherent bias in the estimation of the above quantities since at each step the variance depends on the current value of the mean. However, simulations showed that in practical cases and for relatively large n, e.g., n  100, this bias can be ignored. In the following we assume that the input sequence of random samples xi is in nite and that the number n in (14){(16) is constant. This can be viewed as a moving average procedure on previous inputs. In this case we don't ask for a stochastic convergence in the regular sense but seek for a way to model potentially non-stationary input distributions.

3.2 Testing for the number of kernels

Perhaps the most dicult problem in the theory of mixture density models is the proper estimation of the number of kernels when a method like ML is used [14]. We propose here a method that on-line seeks for the correct number of kernels, based on simple test statistics for testing the hypothesis of single normality against a two-kernel alternative.

Splitting a kernel

We rst look for a test statistic to decide when a kernel should split in two. Consider the simple case of Fig. 1 where the input samples (vertical bars) are assumed to follow a two-kernel mixture distribution, but the algorithm tries to t the data with a single kernel (Gaussian curve). A statistical test is needed to check the hypothesis that the input samples follow a single Gaussian with  and  against the alternative that they follow a mixture of two kernels, in which case the single kernel should split in two. Since from the ML estimates (14) and (15) of the rst two moments of the underlying kernel one cannot decide on the normality of the kernel, it is reasonable to base the hypothesis on tests

?

+



Figure 1: Kernel splitting: the input samples (vertical bars) are assumed to follow a two-kernel distribution but the ML procedure tries to t the data with a single kernel.

j ? 

j k

Figure 2: Kernels joining: when two neighboring kernels have similar variances and means they can be joined in one. involving higher moments. We form a simple sequential test statistic based on a weighted formula of the kurtosis, or fourth moment, of a kernel j as kj

" 4 x?

:= kj + (nj )?1 P fj jxg

j

j

? kj ? 3

#

;

(17)

with j and j the current ML estimates for the parameters of the kernel. On the hypothesis that xi follow a normal distribution N (j ; j ) it follows [17] that the random variable q

= kj

q

nj =96

(18)

approximately follows normal distribution N (0; 1). Since N (0; 1) is symmetrical about zero we accept the hypothesis that the kernel j is N (j ; j ) if (cf. [7], ch. 9-3) jqj < z1?a=2 ; (19) and reject it otherwise. In the above formula zu is the standard normal u percentile and a the signi cance level of the test. For a = 0:001, z1?a=2 equals 3.291. The advantage of this procedure for testing the normality of a kernel is that it can be implemented sequentially. When a new kernel is created it gets a zero value of kurtosis which is updated at each iteration step from (17). The rst time (19) is violated we split the kernel and create two kernels with means  +  and  ? , and variances and priors P both equal to the original variance. The priors of all kernels are then re-normalized to ensure Kj=1 j = 1.

Joining two kernels

A reasonable criterion for joining two neighboring kernels in one is when they have almost the same variance and very near means, as shown in Fig. 2. To check for this situation we rst apply the Snedecor-F test statistic for equal variances and if it succeeds then we apply the Student-t test statistic for equal means assuming equal variances [17]. More speci cally, for checking for equality of the variances j2 and k2 of two neighboring kernels j and k we form the ratio of the larger to the smaller variance, e.g., F

2

= j2 ; k

(20)

For each input sample x = [x1 ; : : : ; xd ] belonging to class c if class c has not received any input yet create a new kernel j for class c j = 1 for each dimension i, i = 1; : : : ; d ji = xi , ji = 1, and kji = 0 compute from (8) the posterior probability for all kernels j , j = 1; : : : ; Kc, of class c apply tests (21) and (22) to kernels j and k with largest and second largest posterior if tests succeed in all dimensions p join these two kernels in onePwith m = (mj + mk )=2, S = Sj Sk , and  = j c normalize all priors so that K j=1 j = 1 for each kernel j , j = 1; : : : ; Kc, of class c update the prior of j using (16) for each dimension i, i = 1; : : : ; d update the parameters of kernel j in dimension i using (14) and (15) apply test (17) on dimension i of kernel j if test succeeds split kernel j on dimension i into kernels v and w vi = ji + ji , wi = ji ?Pji , vi = wi = ji , kvi = kwi = 0, v = w = j c normalize all priors so that K j=1 j = 1 if for a kernel k holds k < 1=n remove kernel k P c  =1 normalize all priors so that K j=1 j

Figure 3: The training algorithm for density estimation. which under the hypothesis of equal variances approximately follows Snedecor-F distribution with (n; n) degrees of freedom. Since F  1 we accept the hypothesis of equal variances if F satis es F < Fn;n;1?a=2 ;

(21)

with Fn;n;u being the u percentile of the F (n; n) distribution and a the test signi cance level. For n = 100 and a = 0:001 the above inequality becomes F < 1:945, while for n = 1000 and a = 0:001 it becomes F < 1:23. If the test for equal variances succeeds, we subsequently check for equal means assuming common variance 2 = j k , using the test statistic t

= pnj

j

?p k ;

2 which under the hypothesis of equal means approximately follows normal distribution N (0; 1). Similarly to (19) we accept the hypothesis of equal means if it holds 

jtj < z1?a=2 :

(22)

If the above test succeeds, the two kernels are joined in one with mean (j + k )=2, variance 2 , and prior equal to j . The priors of all kernels are re-normalized to unity.

Removing a kernel

We maintain that a kernel j should be removed from the mixture when its prior probability j is below some threshold. A small value for the prior probability of a kernel implies that only few samples have originated from that kernel, which might have been outlier points, or noise, and thus the kernel may be eliminated. A reasonable threshold for the priors is 1=n which also ensures that the terms in (14) and (15) remain bounded. After a kernel is removed all kernels should update their priors to sum one.

pc (x)

+ 1

f1 (x)

x1

2

Kc

f2 (x)

fKc (x)

xd

x = [x1; : : : ; xd]

Figure 4: Neural network implementation.

4 Multivariate densities

For problems of higher dimension d, Eq. (3) generalizes for a kernel j and input vector x = [x1 ; : : : ; xd ] to 1 (23) exp[? 12 (x ? mj )S?j 1 (x ? mj )T ]; fj (x; mj ; Sj ) = p d (2) det Sj where mj = [j1 ; : : : ; jd ] is the mean of the kernel, Sj is the covariance matrix, and det Sj denotes the determinant of Sj . The approach we described in the previous section can be directly applied to the multivariate case if we make the assumption that in each multivariate kernel j the d components of the input vector x are jointly normal and uncorrelated, an assumption that results in hyper-ellipsoidal kernels. In this case [7, p. 197] the respective covariance matrix Sj is diagonal and (23) can be written as the product of the d marginal univariate Gaussians, i.e., 1p ?(x1 ? j1 )2 +    + ?(xd ? jd )2 ]; fj (x; mj ; Sj ) = (24) exp[ 2 2j21 2jd j 1    jd (2 )d 2 the diagonal elements of S . For the prior updates we use (1), (8), and (16) with j21 , . . . , jd j with the kernel densities substituted from (24), while adaptation of the kernel parameters is done in each dimension separately so that the d components of mj and the d diagonal elements of Sj of each kernel are estimated as in the univariate case from (14) and (15), respectively. The kurtosis test (19) is applied on each dimension separately and if it succeeds for dimension i then the split kernels keep all but the i components unaltered, the latter being changed as in the univariate case. Finally, to join two kernels, (21) and (22) must succeed in all dimensions.

5 Supervised learning It is fairly straightforward to extend our algorithm to supervised cases, i.e., problems where the incoming samples are labeled to belong to one of a number C of classes. In this case, the K kernels

0.25 0.2 0.15 0.1 0.05 0 -1

0

1

2

3

4

5

6

Figure 5: Estimating a uniform density in a one-dimensional problem. Total number of used kernels: K = 31. Parameters: n = 1000, a = 0:1.

P

are divided into C subsets, each subset i has Ki kernels, and holds Ci=1 Ki = K . Each subset i estimates the density pi (x) of the class i, and it is trained separately from the inputs that belong to class i. The training algorithm for estimating a class density in a multi-class problem is shown in Fig. 3. After training, to decide which class a new-coming sample x belongs to we use the Bayes rule of minimal risk [10, p. 19] which assigns x to the class that maximizes the class posterior probability

ProbfClass = c j X = xg = PCprior(c) pc(x) ; i=1 prior(i) pi (x) where the prior of a particular class is estimated by the proportion of the samples of that class used for training. The proposed method can be directly mapped on a neural network (NN) architecture. In Fig. 4 we show the NN that computes the mixture density pc (x) for a class c. The d components x1 ; : : : ; xd of the input vector x are forwarded to the hidden layer whose activation functions (Gaussians) are used to compute the kernel densities fj (x) for all kernels j , j = 1; : : : ; Kc, of class c. The outputs of the hidden layer are multipliedPby the respective prior probabilities j and the output c unit sums its inputs to compute pc(x) = Kj=1 j fj (x).

6 Examples We ran our algorithm on a density estimation, a pattern recognition, and a time series problem. The rst assumed one-dimensional random samples following uniform distribution in the interval [0; 5]. The estimated density is shown in Fig. 5, approximated with 31 kernels. The test signi cance level was chosen a = 0:1 and the constant n = 1000. The distribution is well approximated near the center, with the typical `jumps' at the discontinuity points. In Fig. 6 we demonstrate the results from solving the `synthetic' two-class problem from [10, p. 11]. There are two features (two-dimensional inputs) and two classes. Each class has a bimodal distribution and are both chosen so as to allow a best-possible error rate of about 8% [10]. Our algorithm approximated the class-1 density (points below the decision boundary) with four kernels and that of class-2 with one kernel, resulting in an error rate of 8.5%. The above results were taken running our algorithm for 150 epochs with parameters a = 0:001 and n = 100. Tests showed that the best performance of the algorithm was achieved for relatively small values of n, e.g., n = 100, while a number of epochs close to 100 was almost always sucient. To demonstrate the potential of our algorithm to handle even non-stationary input distributions we applied it to a time series problem. We assume as input a small sinusoidal variation around zero, plus white Gaussian noise as shown in Fig. 7. One kernel is initially allocated whose mean

1

0.5

0

-1

-0.5

0

0.5

Figure 6: Solving the synthetic two-class problem: four kernels for class-1 (points below the boundary), one kernel for class-2, and an error rate of 8.5%. Parameters: n = 100 and a = 0:001. 4 means deviances input level

2 0 -2 -4 -6 -8 -10 -12 -14 0

5

10

15

20

25

Figure 7: Handling non-stationary input distributions: the input sinusoidal signal with white Gaussian noise drops from 0 to ?10. The algorithm quickly converges to the correct number of kernels. Parameters: n = 65, a = 0:001. follows the input signal with a small delay. At time t = 10 the level of the input signal drops to ?10 and the kurtosis test (17) causes the initial kernel to split. A newly generated kernel with mean varying around ?10 is nally allocated in the region of the new density. As it is shown in Fig. 7 the three unused split kernels are eliminated soon after their prior probabilities drop below the removing threshold 1=n. The total number of kernels at any moment is the sum of thick lines along the x axis, at most 4. The error bars denote the deviance  of each kernel at any instant. The value of n for this application was 65 and the value of a 0.001. We should mention here the e ect of the two parameters a and n to our algorithm. Small a means that we tolerate larger deviations from the expected values in the test statistics (19), (21), and (22), thus less frequent splits and joins. The parameter n a ects the statistical tests as well, but mainly plays the role of a `moving window' on input values. Thus, for non-stationary problems n must be set according to the speed of change of the input distributions. The algorithm has been implemented in C and the images are produced by the UNIX gnuplot program. The software is available at http://www.dsclab.ece.ntua.gr/~nvlassis.

7 Conclusions We presented an algorithm for supervised and unsupervised learning that is based on maximum likelihood estimation of mixture densities. The novel feature of our approach is that it employs simple sequential statistical tests to decide when a kernel should split or when two kernels should join in one. This makes possible to implement the algorithm on-line, without need to keep the whole training set in memory. Also, using these tests it is possible to on-line decide on the total number of kernels, a problem which is dicult to solve analytically [14]. We implemented and tested our method on both unsupervised and supervised problems, and also in time series problems with promising results. We are currently working on applying the algorithm to the intricate problem of mobile robot navigation. We believe that one has much to gain by using our approach in handling dynamic robot environments where the sensor distributions are noisy, usually non-Gaussian, and time-variant.

Acknowledgments We would like to thank A. Likas, Univ. of Ioannina, Dept. of Computer Science, G. Efthivoulidis, NTUA, and D. Karlis, Athens Univ. of Economics & Business, Dept. of Statistics for their suggestions and helpful comments during the preparation of this manuscript.

References [1] Chou W.S., Chen Y.C.: A new fast algorithm for e ective training of neural classi ers. Pattern Recognition 25(4) (1992) 423{429. [2] Fambon O., Jutten C.: Pruning kernel density estimators. In: Verleysen M., ed., ESANN95European Symposium on Arti cial Neural Networks . D facto publications, Brussels, Belgium, (Apr 1995). [3] Fu K.: Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, New York, (1968). [4] Furman W.D., Lindsay B.G.: Testing for the number of components in a mixture of normal distributions using moment estimators. Computational Statistics & Data Analysis 17 (1994) 473{492. [5] Kaufman L., Rousseeuw P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, (1990). [6] McLachlan G.J.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36 (1987) 318{324. [7] Papoulis A.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, 3rd edn., (1991). [8] Parzen E.: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33 (1962) 1065{1076. [9] Redner R.A., Walker H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2) (Apr 1984) 195{239. [10] Ripley B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, U.K., (1996). [11] Shimoji S.: Self-Organizing Neural Networks Based on Gaussian Mixture Model for PDF Estimation and Pattern Classi cation. Ph.D. thesis, University of Southern California, (1994).

[12] Specht D.F.: Probabilistic neural networks. Neural Networks 3 (1990) 109{118. [13] Streit R.L., Luginbuhl T.E.: Maximum likelihood training of probabilistic neural networks. IEEE Transactions on Neural Networks 5(5) (1994) 764{783. [14] Titterington D.M., Smith A.F., Makov U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, (1985). [15] Traven H.G.C.: A neural network approach to statistical pattern classi cation by \semiparametric" estimation of probability density functions. IEEE Transactions on Neural Networks 2(3) (May 1991) 366{377. [16] Vlassis N.A., Dimopoulos A., Papakonstantinou G.: The probabilistic growing cell structures algorithm. In: Proc. ICANN'97, 7th Int. Conf. on Arti cial Neural Networks . Lausanne, Switzerland, (Oct 1997) 649{654. [17] von Mises R.: Mathematical Theory of Probability and Statistics. Academic Press, New York, (1964). [18] West M., Harrison J.: Bayesian Forecasting and Dynamic Models. Springer-Verlag, New York, 2nd edn., (1997).