The Probabilistic Growing Cell Structures Algorithm - CiteSeerX

Report 0 Downloads 40 Views
Proc. 7th Int. Conf. on Artificial Neural Networks, Lausanne, Switzerland, Oct 1997

The Probabilistic Growing Cell Structures Algorithm Nikos A. Vlassis Apostolos Dimopoulos George Papakonstantinou Dept. of Electrical and Computer Engineering National Technical University of Athens, Greece email: [email protected] Abstract The growing cell structures (GCS) algorithm is an adaptive k-means clustering algorithm in which new clusters are added dynamically to produce a Dirichlet tessellation of the input space. In this paper we extend the non-parametric model of the GCS into a probabilistic one, assuming that samples are distributed in each cluster according to a multi-variate normal probability density function. We show that by recursively estimating the means and the variances of the clusters, and by introducing a new criterion for the insertion and deletion of a cluster, our approach can be more powerful to the original GCS algorithm. We demonstrate our results within the mobile robots paradigm.

1

Introduction

The growing cell structures (GCS) algorithm [1] is an adaptive k-means algorithm [3] that performs data clustering. Based on the earlier work of Kohonen [4], GCS is a self-organizing neural network model [2] that incrementally builds a Dirichlet or Voronoi tessellation of the input space, while it is able to automatically find its structure and size (see [7] and [5] for a general description of the problem of pattern recognition with neural networks). In this paper, we extend the non-parametric model of the GCS into a probabilistic one [6]. We assume that samples come from each cluster according to a Gaussian probability density function, whose parameters, means and variance, we are to estimate statistically from the input set. In contrast to many traditional probabilistic techniques like, e.g., the ISODATA algorithm [3], that need to maintain a large portion of the training set in memory, our approach uses recursive formulas for the estimation of the means and variances of the clusters. Besides, due to its cluster insertion-deletion criterion, the original GCS algorithm cannot very easily handle cases of correlated inputs, e.g., successive inputs that are close in the input space. Our algorithm ameliorates this deficiency of GCS by introducing a synthesized criterion for cluster insertion-deletion that takes into consideration the a priori probability of a cluster, together with its variance. We show the virtues of our algorithm when applied to a real-world problem; that of estimating the Voronoi centers of a mobile robot’s configuration space [9].

2

The Probabilistic Model

Consider an input space V ⊂ IRp , and a sequence of n p-dimensional samples {x1 , x2 , . . . , xn } from V. Our task is to group all xi into K clusters, so that each cluster contains similar points, i.e., points that are near in V by some metric distance. We also want K to change dynamically. Our probabilistic model assumes that in each cluster k the samples are distributed according to a known probability density function (or density) pk (x; θk ), for some unknown vector of parameters θk . Also, we assume that each cluster k has prior probability πk , implying an a priori preference

for k over the other clusters. Then the membership probability that a new sample x i is assigned to cluster k is pk (xi , k; θk ) = πk pk (xi ; θk ), and the posterior (normalized) probability [7] (p. 19) is πk pk (xi ; θk ) . p(k|xi ) = PK j=1 πj pj (xi ; θj )

(1)

The denominator in the above formula is the total input mixture density p(x) = The Bayes decision rule [7] (p. 19) assigns a future sample xi to cluster k if p(k|xi ) = max{p(j|xi )},

PK

j=1

j = 1, . . . , K.

πj pj (x; θj ). (2)

From the above equations it follows that a complete clustering schema requires the estimation of both the prior cluster probabilities πj and the parameter vectors θj of all clusters j, with j = 1, . . . , K. More specifically now, we assume that samples from cluster k follow a p-variate normal density Np {µk , Σk }, i.e., −1 T 1 pk (x; µk , Σk ) = (2π)−p/2 |Σk |−1/2 e− 2 (x−µk )Σk (x−µk ) , (3) parametrized over the means µk and covariance matrix Σk of the cluster (see [6] p. 188, 197), while |Σk | denotes the determinant of Σk . Furthermore, we assume that Σk is of the type σk2 I, with I the identity matrix. This simplification can be quite reasonable under certain conditions (see, e.g., [7] p. 289, 296, and [8]), while it helps keeping the computational cost low. Under this model, a cluster can be regarded as a hyper-sphere centered around µk , while σk2 gives a measure of its size. Assuming normal density for the clusters, the problem is now to estimate from the sequence of input samples {xi }, i = 1, . . . , n the parameters µk and σk2 , and the prior probability πk of each cluster. By using the Maximum Likelihood (ML) estimation (see [7] p. 334) Tr˚ av´en [8] proves that for p-variate normal distributions the prior probability πk can be estimated as 1X = p(k|xi ), n i=1 n

(n) πk

(4)

where p(k|xi ) is the posterior probability that a sample xi belongs to cluster k. (·)(n) denotes the value after the n-th input xn has arrived. In addition, recursive expressions for the estimation of µk and σk2 can be formulated as (n+1)

µk

2(n+1) σk

(n)

(n+1)

= µk + η k =

2(n) σk

+

(n)

(xn+1 − µk ),

(n+1) ηk [(xn+1



(5)

(n) µk )(xn+1



(n+1)

(n) µ k )T



2(n) σk ].

(6)

(n)

In the above equations ηk can be approximated by p(k|xn+1 )/[(n + 1)πk ]. Then the recursive formula becomes similar to the learning rule used by the family of Kohonen’s algorithms. 1 (n+1) Now we restrict ourselves to the l most recent samples {xn−l , . . . , xn }. Then ηk reads (n+1)

ηk

=

p(k|xn+1 ) (n)

lπk

.

(7)

This restriction implies a ‘forgetting’ schema, in which old samples affect the learning process less than recent ones. Extending the calculations of [8] a little further, a recursive formula similar to the above formulas for µk and σk2 can be formulated for the prior probability πk as (n+1)

πk

(n)

= (1 − 1/l)πk +

p(k|xi ) l

(n+1)

or πk

(n)

= πk

1 (n) + [p(k|xi ) − πk ]. l

(8)

1 If we assume that clusters are equiprobable and fully separated in the input space, and a new sample is assigned to its nearest cluster with probability 1, then the recursive formula for the means is exactly the SOM learning rule.

2

This formula shows that the prior probability πk ‘moves’ towards p(k|x) in each step, and can be viewed as an estimation over the l most recent samples. Finally, if clusters are assumed a priori equiprobable (πi = πj , ∀i 6= j) and fully separated in the input space, and a new sample xi is assigned to its nearest cluster k with probability 1 (i.e., p(k|xi ) = 1 if ||xi − µk || = min, which implies a uniform density in each cluster) then, setting 1−1/l = α, n = l, and using eq. 4, eq. 8 becomes identical to the signal counter forgetting schema 2 of [1].

3

The PGCS Algorithm

Based on the previous framework, we describe PGCS, a probabilistic version of the original GCS algorithm [1]. We show how the results for GCS can be generalized within our probabilistic framework. Additionally, we propose some extensions to the cluster insertion-deletion criterion in order to handle cases of correlated inputs. 3.1 Initialization. The first sample x1 becomes the center (means) µ1 of the first cluster, whereas σ12 is set to 0. The prior probability π1 of this cluster is set to 1. 3.2 Adaptation. According to eq. 2, a new input xi is assigned to the cluster k with the maximum p(k|xi ). This cluster is called the winning cluster. Applying the logarithm to eq. 3, this corresponds to minimizing δ(xi , µk )2 + 2 log(2π)p/2 σkp − 2 log πk , where δ(xi , µk ) = [ σ12 (xi − µk )(xi − µk )T ]1/2 is the Mahalanobis distance from xi to the center of cluster k. k

The parameters µk and σk2 of cluster k change according to eq. 5, 6, and 7.3 3.3 Updating. The prior probabilities of all clusters are updated according to eq. 8. Note that since p(k|xi ) is larger for clusters lying in the neighborhood of the winning cluster, the prior probability of these clusters is affected more than their farther counterparts. 3.4 Cluster insertion. After a fixed number of steps, and if a pre-defined maximum number of clusters is not exceeded, we find the cluster q who maximizes the quantity σq2 φ(q) = απ πq + ασ PK

j=1

σj2

,

(9)

with απ , ασ appropriate coefficients that are problem-dependent. We create a new cluster r between q and its direct neighbor f with the maximum φ(f ). We estimate the means and variance of the new cluster as µr =

φ(q)µq + φ(f )µf φ(q) + φ(f )

and σr2 =

φ(q)σq2 + φ(f )σf2 . φ(q) + φ(f )

Clusters q and f change their variances appropriately as σq2(new) = (1 −

φ(q) )σ 2 φ(q) + φ(f ) q

2(new)

and σf

= (1 −

φ(f ) )σ 2 , φ(q) + φ(f ) f

where (·)(new) denotes the new value. For estimating the prior probability of clusters r, q, and f , we stick to the rule of GCS: every cluster should potentially have the same probability of being the winning cluster for a new random sample. From eq. 1 and 3 it follows that the posterior probability of a cluster j lying in the vicinity 2 ∆τ

= −ατ no topological ordering in a space of lower dimension than p is actually entailed, there is no real need to change the parameters µ and σ 2 of the neighboring clusters. 3 Since

3

a. gcs

b. probabilistic gcs

Figure 1: Applying the GCS and our extended model to a robot configuration. of r, estimated on the means, is proportional to πj /σj2 . Making the posterior probabilities before and after the insertion equal for the clusters q and f yields 2(new)

(new)

πq,f πq,f = 2(new) 2 σq,f σq,f

or ∆πq,f =

σq,f

2 − σq,f

2 σq,f

πq,f ,

where ∆π = π (new) − π. From this formula we compute the π (new) of clusters q and f . Preserving the total prior probability locally before and after the insertion leads to the new prior probability of r πr(new) = −(∆πq + ∆πf ). The above two equations are equivalent to eq. 10 and 11 of [1]. 3.5 Cluster deletion. lowest φ(c).

4

After a fixed number of steps we remove the cluster c with the

Discussion-Results

The insertion or deletion of a new cluster in the original GCS algorithm happens at the region of V with the highest or lowest prior probability, respectively. This schema is adequate if samples come randomly from the input space V, but seems rather ineffective when successive inputs happen to be correlated somehow, e.g, by being not very far in V. In our PGCS algorithm, the criterion for creating a new cluster takes into consideration both the prior probability of a cluster and its variance. Thus, clusters with small variance, i.e., those that are concentrated around their means, are not likely to split, even if many inputs are assigned to them, and the opposite. The motivation for this work was [9], in which the estimation of the Voronoi centers of a robot’s configuration space necessitated the existence of an algorithm that could handle successive points in IR2 , denoting the robot’s (x, y) position. There, the GCS model seemed incapable of performing a good clustering of the input space. In Fig. 1 we show the performance of GCS and PGCS for a typical robot configuration, where the robot, starting from the up-left corner, was to explore an unknown environment (cluster centers are located in the free space, whereas the rest is obstacles). We ran both algorithms with the same set of parameters, whereas the coefficients α π and ασ of PGCS had values 0.2 and 0.8, respectively.

4

5

Conclusions

We presented PGCS, a probabilistic version of the growing cell structures [1] algorithm for data clustering. We outline below some of the main contributions of this work. • We assume that input samples follow Gaussian distributions, whereas GCS assumes uniform densities within each cluster. Our approach is more realistic since in real-world problems the inputs are inherently Gaussian, or can approximately be modeled by Gaussian distributions. Moreover, by simplifying the type of the covariance matrices we manage to keep the computational cost low. • Our cluster insertion-deletion criterion takes into consideration both the prior probabilities of the clusters and their variances, whereas GCS only the former. This is very helpful in cases where inputs are not selected in random from the input space, but are rather correlated. • We estimate the means and the variances of the clusters recursively, in contrast to other clustering algorithms that need to maintain a large amount of input samples for this purpose.

References [1] Fritzke B.: Growing cell structures—a self-organizing network for unsupervised and supervised learning. Neural Networks 7(9) (1994) 1441–1460. [2] Haykin S.: Neural Networks. Macmillan College Publishing Company, New York, (1994). [3] Kaufman L., Rousseeuw P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, (1990). [4] Kohonen T.: The self-organizing map. Proceedings of the IEEE 78(9) (Sep 1990) 1464–1480. [5] Lippmann R.P.: Pattern classification using neural networks. IEEE Communications Magazine (Nov 1989) 47–64. [6] Papoulis A.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, 3rd edn., (1991). [7] Ripley B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, U.K., (1996). [8] Tr˚ av´en H.G.C.: A neural network approach to statistical pattern classification by “semiparametric” estimation of probability density functions. IEEE Trans. on Neural Networks 2(3) (May 1991) 366–377. [9] Vlassis N., Papakonstantinou G., Tsanakas P.: Learning the Voronoi centers of a mobile robot’s configuration space. In: Proc. 3rd ECPD Int. Conf. on Adv. Robotics, Intel. Automation and Act. Systems. Bremen, Germany, (Sep 1997) 339–343.

5