Validity of Fuzzy Clustering Using Entropy ... - Semantic Scholar

Report 2 Downloads 178 Views
1

Validity of Fuzzy Clustering Using Entropy Regularization Hichem Sahbi and Nozha Boujemaa IMEDIA Research Group, INRIA-Rocquencourt, France {Hichem.Sahbi,Nozha.Boujemaa}@inria.fr

Abstract— We introduce in this paper a new formulation of the regularized fuzzy C-means (FCM) algorithm which allows us to find automatically the actual number of clusters. The approach is based on the minimization of an objective function which mixes, via a particular parameter, a classical FCM term and a new entropy regularizer. The main contribution of the method is the introduction of a new exponential form of the fuzzy memberships which ensures the consistency of their bounds and makes it possible to interpret the mixing parameter as the variance (or scale) of the clusters. This variance closely related to the number of clusters, provides us with an intuitive and an easy to set parameter. We will discuss the proposed approach from the regularization point-of-view and we will demonstrate its validity both analytically and experimentally. We will show an extension of the method to non-linearly separable data. Finally, we will illustrate preliminary results both on simple toy examples as well as database categorization problems. Index Terms— Fuzzy Clustering, Fuzzy C-Means, Regularization, Image Retrieval.

I. I NTRODUCTION Define a training set {x1 ,...,xN } being, for instance, images in database categorization or colors in image segmentation. A clustering algorithm finds a function which assigns each training example xi to one class, resulting into a final partition of C subsets. Existing clustering methods can be categorized into hierarchical approaches [1] and those finding dynamic partitions such as EM (Expectation Maximization) [2], C-means [3] and self organizing maps (SOMs) [4]. Sometimes the decision as whether a training example belongs to one or another cluster can be fuzzy and a family of algorithms dealing with fuzziness exist in the literature for instance [3], [5], [6]. These methods have been used in different domains including Gene expression [7], image segmentation [8] and database categorization [9]. For a survey on existing clustering methods and their applications see for instance [3], [10], [11]. One of the main issues in the existing clustering methods remains setting the appropriate number of classes for a given problem. The well-known fuzzy C-mean (FCM) algorithm [3] has proven to perform well when the application allows us to know a priori the number of clusters or when the user sets it manually. Of course, the estimation of this number is application-dependent, for instance in image segmentation it can be set a priori to the number of targeted regions. Unfortunately, for some applications such as

database categorization, it is not always possible to predict automatically and even manually the appropriate number of classes. Several methods exist in the literature in order to automatically find the number of classes for clustering; among them competitive agglomeration (CA) [5] and recently new original approaches based on kernel methods [12]. The former attempts to set automatically the relevant number of classes by adding a validity criteria in the minimization problem. The underlying objective function usually involves an FCM term which measures the fidelity of the data to the clusters and a validity term which reduces the number of clusters, i.e., the complexity. Solving such a problem implies finding the membership degrees of each training example to the different clusters and assigning the former to the cluster which maximizes its membership. Nevertheless, the constraints on the bounds and the consistency of the membership degrees are difficult to carry. In this work, we introduce a new simple formulation which guarantees the consistency of the membership degrees and provides a solid connection and interpretation in terms of regularization. The method uses a new exponential form of the fuzzy memberships which ensures the validity of their bounds and makes it possible to interpret the parameter, which mixes the fidelity and the regularization terms, as the variance (or scale) of the clusters. For some applications, it turns out that setting the cluster variance is more intuitive and easier than finding the number of classes mainly for large datasets living in high dimensional spaces. In the remainder of this paper i stands for data indices while k, c stand respectively for a particular and a given cluster indices. We refer in this paper to a cluster as a set of data gathered using a clustering algorithm while a class (or category) is the actual membership of this data according to a well defined ground truth. Other notations and terminology will be introduced as we go along through different sections of this paper which is organized as following: in §II we review the basic formulation of the regularized FCM while in §III we introduce our entropy regularizer. In §IV, we discuss the consistency of our solution, the technical issues, limitations and in §V we show how this method can be used for clustering non homogeneous and non-linearly separable classes. In §VI we show the experimental validity of the method for database categorization problems. We conclude in §VII and we provide

2

some directions for a future work. II. A SHORT REMINDER ON REGULARIZED FCM A variant of the regularized version of FCM [13] consists in the minimization problem: J (µ) =

N C X X

k=1 i=1

µik d2ik + α(t) R(µ)

(1)

under the constraints that {µi1 , ..., µiC } is the probability distribution of the memberships of xi to C clusters and dik is the distance of xi to the k th cluster. The first term of the objective function (1), referred to as the FCM term, measures the fidelity of each training example to its cluster and vanishes when each example is a cluster by it self. The regularization term R(µ) measures the complexity or the spread of the clusters and reaches its minimum when all the training examples are assigned to only one big cluster. Among possible regularizers R(µ) we can find the quadratic and the Kullback-Leibler functions [5], [14]. The tradeoff between the fidelity and the regularization terms makes it possible to define the optimal number of clusters automatically for a given application. The FCM and the regularization terms are mixed via a coefficient α which controls this tradeoff. In the different existing formulations, this coefficient is proportional to the ratio between the FCM term and the regularizer R(µ) and decreases with respect to the iteration number t: ! PC PN 2 i=1 µik dik k=1 (2) α(t) ∼ fτ (t) × O R(µ) where fτ is a decreasing function which can be fτ (t) = e−t/τ . Initially, the algorithm selects a large value of α, so the objective function will pay more attention to the regularization term and this makes it possible to decrease the number of clusters. As we go along different iterations of the clustering process, the decrease of α according to (2) ensures that the clustering process will pay more attention to the fidelity term, so the centers will be updated to minimize the distances to their training data.

III. O UR ENTROPY REGULARIZATION Let us consider the regularization term:

R(µ) = −

1 N

N X i=1



|

C X

k=1

{z

Minimize J (U )

}

If the memberships of all the training examples to m clusters (m < C) are similar, the latter overlap, R decreases as m increases and reaches its global minimum when m = C. On the other hand, when each training example is a cluster by it self, the distribution of the memberships will be peaked, so the entropies will vanish and R will reach its global maximum 0. Let us consider a new definition of the membership degrees as 2 {µik = e−Uik , Uik ∈ R} which per construction ensures the

2

e−Uik d2ik

k=1 i=1 N X

K N

α

C X

i=1

k=1

|

2

2 e−Uik Uik

{z

T he new entropy C X

s.t.

e

2 −Uik

= 1,

(3)

}

term

i = 1, ..., N

k=1

where K = log10 (e)/log10 (2) and U = {Uik }. Using Lagrange theory [15], we introduce the Lagrange coefficients {λi }, so the minimization problem can be written as: Minimize L (U, λ)

=

N,C X

e

2 −Uik

d2ik

i,k=1

+

X

X

λi

i

N,C 2 K X −Uik 2 − α Uik e N i,k=1 ! 2

e−Uik − 1

k

(4) When the gradient of L(U, λ) with respect to {U ik } and {λi } vanishes, we obtain respectively: 2 −Uik

N −K α

=

d2ik + λi

N

e

2 −Uik

e −K

=

α

(d2ik

+ λi )



− 1

(5)

e

and X

2

e−Uic = 1 =

c

N

term

C X N X

= −

e− K

µik log2 (µik )

T he entropy

consistency of their bounds, i.e., µik ∈ [0, 1]. If we plug these memberships in the FCM term and in the regularizer R(µ), we can show that the underlying constrained minimization problem becomes:

α λi

=e X

1 X − N d2ic − N λi e Kα e Kα e c 1 N

e− K

(6)

2 α dic

c

By replacing (6) in (5), the conditions for optimality lead to the very simple solution: 2

2 e−(N/K α) dik µik = e−Uik = X 2 e−(N/K α) dic

(7)

c

P It is clear from the above expression that c µic = 1 and µik ∈ [0, 1], so for each training example xi , µi1 , ..., µiC will be a probability distribution.

3

IV. C ONSISTENCY AND INTERPRETATION A. Regularization

500

It is easy to see that our solution (7) is consistent with the role of the coefficient α. When α → ∞, the limit of µik will go to 1/C, the distribution of the memberships will be uniform and the entropies will take high values, so the regularization term R will reach its global minimum. It follows P that the centersPof the C clusters defined as: N N {ck = ( i µik xi ) / ( i µik ), k = 1, ..., C} will converge to only one center. Notice that the overlapping clusters can be detected at the end of the algorithm using a simple criteria such as the distance between their centers is below a given threshold  1 or by using more sophisticated tests such as the Kullback-Leibler divergence [16] in order to detect similar cluster distributions. Overlapping clusters are removed and replaced by only one cluster. The memberships are updated using (7).

400

=

limα→0 

1 +

X

1 2 2 e−(N/Kα){dic −dik }

1 (xi − ck )t Σ−1 k (xi − ck ) e σ2 = 1 X − (xi − cc )t Σ−1 (xi − cc ) c e σ2

1 In practice,  is set to 1% the radius of the ball enclosing the data. It can also be set close to the precision of float-numbers.

300

400

200 100 0

500

0

100

class 1 class 2 centers

400

200

200

100

100 0

100

200

300

400

0

500

0

100

σ = 50 class 1 class 2 centers

400

500

300

400

500

200

200

100

100 300

400

class 1 centers

400 300

200

200

500

300

100

400

σ = 100

500

0

300

class 1 class 2 centers

500

300

0

200

σ = 40

300

0

500

0

100

σ = 180

200

300

400

500

σ = 200

Location of the centers with respect to the value of the regularization parameter σ. As the regularization parameter increases, the membership degrees will be uniform, the regularization term decreases, so the centers overlap and the number of clusters decreases. N = 33 and the parameter C is set initially to 20. Fig. 1.

0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5

0

20 40 60 80 100 120 140 160 180 200 Variance

30 Number of classes

c

Here ck and denote respectively the center and the inverse of the covariance matrix of the k th cluster. Now, σ (and also α) acts as a scaling factor (or variance); when it is underestimated, each ck will be a center of a Gaussian which vanishes everywhere except in ck and each example will form a cluster by it self if C = N (cf. figure 1, top-left). On the contrary, when σ is overestimated the Gaussian will be quasiconstant, so no difference will be noticed in the memberships of the training examples to the clusters and this results into one big cluster (cf. figure 1, bottom-right). Figure (2, top) shows the entropy and the fidelity terms with respect to the variance σ. As expected, low variance implies

200

400

(9)

Σ−1 k

100

500

0

If the training examples are assumed to be Gaussian distributed then it is possible to use the Mahalanobis distance which takes into account the spread of the data. For σ 2 = αK/N , we can rewrite the membership coefficients (7) as:

0

300

σ = 30

1 if dik = min{dic } 0 otherwise

B. Scaling

µik

0

c6=k

(8) or will be a cluster by it self if the number of clusters C is equal to N .



100

400

Fidelity term

=

200

Entropy term

limα→0 e

2 −Uik

300

class 1 class 2 class 3 centers

500

Center differences (log scale)

On the other hand, when α → 0, the effect of the regularization term vanishes, so each training example x i will prefer the closest cluster as shown using (8):

class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9 class 10 centers

25 20 15 10 5 0

0

20 40 60 80 100 120 140 160 180 200 Variance

5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0

20 40 60 80 100 120 140 160 180 200 Variance

5

sigma=10 sigma=60 sigma=100 sigma=240

0 -5 -10 -15 -20 -25 -30

0

20

40

60

80

100

Iterations

(Top) variation of the entropy and the fidelity terms with respect to the variance σ. Both the entropy and the fidelity values are shown in the log-base 10 scale. (Bottom-left) the decrease of the number of clusters with respect to the variance. (Bottom-right) The convergence process is shown through the localization error P (n) (n−1) 2 (n) − ck k , where ck is the k th center estimated at k kck th the n iteration. These results are related to the clustering problem shown in figure (1). Fig. 2.

4

high entropy, so the resulting number of clusters is high (cf. figure 2, bottom-left). On the other hand, high variance implies low entropy, so the underlying number of clusters decreases. The best variance σ is of course application-dependent, but it may exist a trivial σ for a clustering problem mainly when the underlying classes are linearly separable. Nevertheless, when classes are not linearly separable, embedding methods such as isomap [17] will be used (cf. §V). C. Discussion It is known that the initialization process in FCM requires C to be exactly the actual number of classes. In contrast, we consider in our approach C to be far larger (but fixed) than the actual number of classes in order to guarantee that each class will be captured, at least, with one center. This may result into several overlapping clusters, which can be eliminated, at the convergence stage, using a simple test; thresholding the distances between the centers; in order to detect and remove overlapping clusters and replace them with few others (cf. §IV-A). Beside, the previous remark, notice that our approach maintains both the parameters C and α constant through different steps of the clustering process, so the objective function is kept fixed and the only changing parameters are the membership degrees. V. E XTENSION TO NON - LINEARLY SEPARABLE DATA In this section, we will show how to extend our clustering method to the topology of different manifolds. We will use isomap[17], which allows us to embed a training set from an input space into an embedding space where the underlying classes become linearly separable. The whole objective is to make the classes linearly separable and to maximize the ratio between their inter and the intra class variances (or scales). Thus, setting the scale in the embedding space will be easy. We will briefly remind isomap and we will show in the experiments that this embedding improves the performance of our clustering method. A. Isomap Consider a training set in an Euclidean space, instead of using directly the Euclidean distance between these training examples, we consider the geodesic distance. In practice, an adjacency graph is defined where an arc exists between two training examples xi and xj if xj belongs to the M th nearest neighbors of xi or if xj is inside a small ball around xi (cf. figure 3, top right). The geodesic distance between two training examples is found by searching the minimal path using the Dijkstra algorithm [18]. Following the formulations introduced in [17], each training example is embedded in the space spanned by the eigenvectors of the following matrix: 1 (10) Kisomap = − (I − eet ) K (I − eet ) 2 Here K is the Gram matrix of the training data where an entry Kij is equal to the value of the kernel function k(x i , xj )

[19] which depends on the geodesic distance and I, e denote respectively the identity and the unit matrices. Isomap finds a non-linear mapping which embed non-linearly separable classes into a feature space where these classes become linearly separable. Afterward, we can use our clustering method after setting appropriately the scale parameter σ (cf. figure 3). Figure (4) shows examples of clustering spirals, rings and numbers using our clustering method after isomap embedding. Beside the applicability of isomap to handle the nonseparability and the topology of data, the underlying graph structure can be used in order to add specific knowledges. A user might add some constraints on data connections, for instance, he can mention that two training examples must (or must not) belong to the same cluster. The graph is then built using both the topology of data and by propagating the user’ constraints on connections. When building the graph, it is important to ensure that no path exists between two examples declared as disconnected by the user. 280

280

260

260

240

240

220

220

200

200

180

180

160

160

140

140

120

120

100 80

100

0

50

100

150

200

250

300

350

400

80

0

50

100

150

200

250

300

350

400

0

50

100

150

200

250

300

350

400

280 260 240 220 200 180 160

4e-05 3e-05 2e-05 1e-05 -0.2 -0.15 0 -0.1 -0.05 -1e-05 -2e-05 0 0.05 -3e-05 0.1 0.15 0.2 -4e-05

140 120 100 80

From top-left to bottom-right (1) original non-linearly separable data (2) the underlying adjacency graph. The arity M is set to 3. (3) The underlying isomap embedding makes the two classes linearly separable. (4) the result of the clustering process.

Fig. 3.

VI. E XPERIMENTS A. Validity criteria A clustering method can be objectively evaluated when the ground truth is available, otherwise the meaning of clustering can differ from one intelligent observer to another. The validity criterias are introduced in order to measure the quality of a clustering algorithm, i.e., its capacity to assign data to their actual classes. For a survey on these methods, see for example [20]. In the presence of a ground truth, we consider in our work a simple validity criteria based on the probability of misclassification. The latter occurs when either two examples

5

(∗) = P (f (X) = 6 f (X 0 ) | Y = Y 0 ) P (Y = Y 0 ) + P (f (X) = f (X 0 ) | Y 6= Y 0 ) P (Y 6= Y 0 ) (11)

0.14

0.01

3 classes 5 classes 7 classes 9 classes

0.12 0.08 0.06 0.04

0.006 0.004 0.002

0.02 0

3 classes 5 classes 7 classes 9 classes

0.008 Scale

0.1 Scale

belonging to two different classes are assigned to the same cluster, or when two elements belonging to the same class are assigned different clusters. We denote X and Y as two random variables standing respectively for the training examples and their different possible classes {y1 , ..., yC } and X 0 , Y 0 respectively similar to X, Y . We denote by f (X) the index of the cluster of X in {y1 , ..., yC }. Formally, we define the misclassification error as P (1{(f (X) = f (X 0 )} 6= 1{Y = Y 0 } ) = (∗), which we expand as:

0

10

20 30 40 50 Number of samples

60

0

70

0

10

20 30 40 50 Number of samples

60

70

Fig. 5. These diagrams show the variation of the scale with respect to the number of samples per class on the Columbia set. The scale is given as the expectation of the distance between pairs of images taken from 3, 5, 7 and 9 classes. The diagram in the left-hand side is related to the laplacian color space while the other one is related to the isomap embedding space.

120 100 400

80

300 60

200

0.6

100

40

−300

−20

−400 200

−40

100

400 0

−60

200 −100

−80 −100

0 −200

−80

−60

−40

−20

0

20

40

60

−300

80

500

−200 −400

350

0.35

0.4 0.3 0.2 0.1 0

400

450

0.5

Probability of error

−200

0

Probability of error

−100

0.4

Fisher coefficients Isomap embedding

0 20

Laplacian color Isomap embedding

0.3 0.25 0.2 0.15 0.1 0.05

0.02

0.04 0.06 The variance

0.08

0

0.02 0.04 0.06 0.08 0.1 The variance

400 300 350

Fig. 6. Left: precision of clustering using our method on the Olivetti

250

300 250

set. The solid (resp. dashed) line shows the probability of error when clustering is performed on the Fisher space (resp. after isomap embedding). Right: the same experiments on the Columbia set.

200

200

150

150 100 100 50

50 0

0

100

200

300

400

500

600

700

0

0

100

200

300

400

500

600

700

Fig. 4. Some clustering results on non-linearly separable data. After the application of our clustering algorithm, data belonging to different clusters are shown using different colors. The scale σ is set to the second eigenvalue of the eigenproblem (10).

B. Database categorization Experiments have been conducted on both the Olivetti and Columbia databases in order to show both the good performance of our clustering method and the improvement brought by isomap. The Olivetti database contains 40 persons each one represented by 10 faces. Each face is processed using histogram equalization and encoded using kernel Fisher discriminant analysis [21] resulting into 100 coefficients. The Columbia set contains 100 categories of objects each one represented by 72 images. Each image is encoded using the laplacian color descriptor [22] resulting into a feature vector of 216 coefficients. It may be easier for database categorization to predict the variance σ rather than the number of clusters. Predicting the variance σ can be achieved by sampling manually few images from some categories, estimating the variance (or scale) of the underlying classes, then setting σ to the expectation of the variance through these classes. While this setting is not automatic, it has at least the advantage of introducing a priori knowledges on the variance of the data at the expense of few interaction and reasonable effort from

Fig. 7. Top: cluster prototypes resulting from our clustering method on the Columbia set and using (on the left) laplacian color space and (on the right) isomap. Middle and bottom: images from two different clusters after the application of our algorithm where the scale σ is set to 0.075 (resp. σ = 0.0055 using isomap). Images in the lefthand side are related to the laplacian color space while the others are related to isomap.

6

the user. Furthermore, this is easier than trying to predict the actual number of classes, mainly for huge databases living in high dimensional spaces. Figure (5) shows that this heuristic provides “good guess” of the scale on the Columbia set as it is close to the optimal scale shown in figure (6, right). Again, figure (6) shows the misclassification error (11) with respect to the scale parameter σ. When comparing the results on the Olivetti and Columbia sets, we can see (cf. figure 6) that the optimal scale using isomap is smaller than the optimal scale using the original space since isomap reduces the ratio between the intra and the inter class variances. Furthermore, the error rate becomes smaller. Tables in (I) show the distribution of 20 categories from the Columbia set 2 through the clusters after the application of our method (with and without isomap). We can see that this distribution is concentrated in the diagonal and this clearly shows that most of the training data are assigned to their actual categories (cf. example in figure 7). TABLE I

These tables show the distribution of the categories through the clusters on the Columbia set using our clustering algorithm (resp. without and with isomap embedding). We can see that this distribution is concentrated on the diagonal. These experiments are performed using the laplacian color space where the scale parameter σ is set to 0.075 (resp. 0.0055 with isomap embedding). Errors in the cardinalities of the clusters are mentioned in red. categories clusters 11: 4: 10: 5: 8: 2: 17: 19: 9: 1: 6: 14: 20: 18: 16: 13: 15: 3: 0: 7: 12: total

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

total

72 . . . . . . . . . . . . . . . . . . . . 72

. 72 . . . . . . . . . . . . . . . . . . . 72

. . 72 . . . . . . . . . . . . . . . . . . 72

. . . 72 . . . . . . . . . . . . . . . . . 72

. . . . 72 . . . . . . . . . . . . . . . . 72

. 72 . . . . . . . . . . . . . . . . . . . 72

. . . . . 72 . . . . . . . . . . . . . . . 72

. . . . . . 51 21 . . . . . . . . . . . . . 72

. . . . . . . . 66 6 . . . . . . . . . . . 72

. . . . . . . . . 72 . . . . . . . . . . . 72

. . . . . . . . . . 57 7 8 . . . . . . . . 72

. . . . . . . . . . . 30 18 24 . . . . . . . 72

. . . . . . . . 11 6 . . . . 23 14 18 . . . . 72

. . . . . . . . . 55 . . . . . . . 17 . . . 72

. . . . . . . . . . . . . . . . . . 72 . . 72

. . . . . . . . . . . . . . . . . 72 . . . 72

. . . . . . . . . . . . . . . . . . 72 . . 72

. . . . . . . . . 72 . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . . . 72 . 72

. . . . . . . . . . . . . . . . . . . . 72 72

72 144 72 72 72 72 51 21 77 211 57 37 26 24 23 14 18 89 144 72 72 72

categories clusters 12: 4: 11: 6: 9: 2: 15: 10: 1: 7: 17: 16: 18: 20: 14: 19: 5: 0: 3: 21: 8: 13: total

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

total

72 . . . . . . . . . . . . . . . . . . . . . 72

. 72 . . . . . . . . . . . . . . . . . . . . 72

. . 72 . . . . . . . . . . . . . . . . . . . 72

. . . 72 . . . . . . . . . . . . . . . . . . 72

. . . . 72 . . . . . . . . . . . . . . . . . 72

. 72 . . . . . . . . . . . . . . . . . . . . 72

. . . . . 72 . . . . . . . . . . . . . . . . 72

. . . . . . 72 . . . . . . . . . . . . . . . 72

. . . . . . . 72 . . . . . . . . . . . . . . 72

. . . . . . . . 72 . . . . . . . . . . . . . 72

. . . . . . . . . 58 4 . . 10 . . . . . . . . 72

. . . . . . . . . 3 47 3 . 19 . . . . . . . . 72

. . . . . . . 18 . . . . 7 . 34 13 . . . . . . 72

. . . . . . . . . . . . . . . . 72 . . . . . 72

. . . . . . . . . . . . . . . . . 72 . . . . 72

. . . . . . . . . . . . . . . . . . 72 . . . 72

. . . . . . . . . . . . . . . . . . . 72 . . 72

. . . . . . . . 72 . . . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . . . . 72 . 72

. . . . . . . . . . . . . . . . . . . . . 72 72

72 144 72 72 72 72 72 90 144 61 51 3 7 29 34 13 72 72 72 72 72 72 72

VII. C ONCLUSION AND FUTURE WORK We introduced in this paper a new formulation of regularized FCM which is simple, it ensures the consistency of the fuzzy membership bounds and it is easy to interpret in 2 For ease of visualization of these tables, we run our algorithm using only a subset of 20 randomly chosen categories.

terms of regularization. We have also shown an extension of the method to non-linearly separable data. Open issues are several; one of them is (1) the study of the effect of noise on the method and the convergence of the algorithm (2) the use of transductive learning which requires the interaction of the user in order to have a priori knowledges on the cluster manifolds. (3) An other issue will be a new formulation of the objective function (3) which takes into account data with large variations in scale. Intuitively, if we consider, in the objective function (3), different mixing parameters α for different classes then it might be possible to capture clusters with large variations in scale. R EFERENCES [1] P. Sneath, R. Sokal, Numerical Taxonomy- the principles and practice of numerical classification, W. H. Freeman, San Francisco, 1973. [2] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of Royal Statistical Society. Ser. B 39 (1) (1977) 1–38. [3] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York, Kluwer Academic Norwell, 1981. [4] E. Tsao, J. Bezdek, N. Pal, Fuzzy kohonen clustering networks, Pattern Recognition 27 (5) (1994) 757–764. [5] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with applications in computer vision, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (5) (1999) 450–465. [6] R. Dave, Characterization and detection of noise in clustering, Pattern Recognition 12 (11) (1991) 657–664. [7] C. Orengo, D. Jones, J. Thornton, Bioinformatics - Genes, Protein and Computers. BIOS, ISBN: 1-85996-054-5, 2003. [8] C. Carson, S. Belongie, H. Greenspan, J. Malik, Blobworld: Image segmentation using expectation-maximization and its application to image querying, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (8) (2002) 1026–1038. [9] B. L. Saux, N. Boujemaa, Unsupervized categorization for image database overview, in: International Conference on Visual Information System, 2002, pp. 163–174. [10] A. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. [11] M. Yang, A survey of fuzzy clustering, MathCompMod 18 (11) (1993) 1–16. [12] A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapnik, Support vector clustering, in: Neural Information Processing Systems, 2000, pp. 367– 373. [13] N. Boujemaa, Generalized competitive clustering for image segmentation, in: The 19th International Meeting of the North American Fuzzy Information Processing Society, 2000. [14] H. Ichihashai, K. Honda, N. Tani, Gaussian mixture pdf approximation and fuzzy c-means clustering with entropy regularization, in: The 4th Asian Fuzzy System Symposium, 2000, pp. 217–221. [15] R. Fletcher, Practical Methods of Optimization, Vol. 1, John Wiley & sons, New York, 1980. [16] T. Cover, J. A. Thomas, Elements of Information Theory, New York: Wiley, 1991. [17] J. Tanenbaum, V. de Silva, J. Langford, A global geometric framework for non-linear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [18] E. W. Dijkstra, A note on two problems in connection with graphs., Numerische Math 1 (1959) 269–271. [19] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. [20] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster Validity Methods: Part I and II, SIGMOD Record, 2002. [21] Q. Liu, R. Huang, H. Lu, S. Ma, Face recognition using kernel based fisher discriminant analysis, in: The IEEE conference on Face and Gesture, 2002, pp. 197–201. [22] C. Vertan, N. Boujemaa, Upgrading color distributions for image retrieval: can we do better ?, in: International Conference on Visual Information Systems, 2000, pp. 178–188.