Fuzzy Clustering: Consistency of Entropy ... - Semantic Scholar

Report 2 Downloads 188 Views
Fuzzy Clustering: Consistency of Entropy Regularization ∗ Hichem Sahbi and Nozha Boujemaa Imedia Research Group Inria Rocquencourt BP 105- 78153, Le Chesnay France {Hichem.Sahbi,Nozha.Boujemaa}@inria.fr

Abstract We introduce in this paper a new formulation of the regularized fuzzy C-means (FCM) algorithm which allows us to find automatically the actual number of clusters. The approach is based on the minimization of an objective function which mixes, via a particular parameter, a classical FCM term and a new entropy regularizer. The main contribution of the method is the introduction of a new exponential form of the fuzzy memberships which ensures the consistency of their bounds and makes it possible to interpret the mixing parameter as the variance (or scale) of the clusters. This variance closely related to the number of clusters, provides us with an intuitive and an easy to set parameter. We will discuss the proposed approach from the regularization point-of-view and we will demonstrate its validity both analytically and experimentally. We will show an extension of the method to non-linearly separable data. Finally, we will illustrate preliminary results both on simple toy examples as well as image segmentation and database categorization problems. ∗

Accepted in International Conference on Computational Intelligence (Special Session on Fuzzy Clustering), Dortmund, Germany, September 2004.

Keywords: Fuzzy Clustering, Regularization, Fuzzy C-Means, Image Segmentation, Image Retrieval.

1 Introduction Define a training set {x1 ,...,xN } being, for instance, images in database categorization or colors in image segmentation. A clustering algorithm finds a function which assigns each training example xi to one class, resulting into a final partition of C subsets. Basically, a simple hierarchical agglomeration algorithm can do this task [SS73, Fra98, Pos01], but sometimes the decision as whether a training example belongs to one or another cluster can be fuzzy and a family of algorithms dealing with fuzziness exist in the literature for instance [Bez81, FK99, Dav91]. Existing clustering methods can be categorized into hierarchical approaches [Fra98, Pos01] and those finding dynamic partitions such as EM (Expectation Maximization) [DLR77], C-means [Bez81] and self organizing maps (SOMs) [TBP94]. These methods have been used in different domains including Gene expression [OJT03], image segmentation [CBGM02] and database categorization [SB02]. For a survey on existing clustering methods see for instance [Bez81, JD88, Yan93] One of the main issues in the existing clustering methods remains setting the appropriate number of classes for a given problem. The well-known fuzzy C-mean (FCM) algorithm [Bez81] has proven to perform well when the application allows us to know a priori the number of clusters or when the user sets it manually. Of course, the estimation of this number is application-dependent, for instance in image segmentation it can be set a priori to the number of targeted regions. Unfortunately, for some applications such as database categorization, it is not always possible to predict automatically and even manually the appropriate number of classes. Several methods exist in the literature in order to automatically find the number of classes for clustering; among them competitive agglomeration (CA) [FK99] and recently new original approaches based on kernel methods [BHHSV00]. The for-

mer attempts to set automatically the relevant number of classes by adding a validity criteria in the minimization problem. The underlying objective function usually involves an FCM term which measures the fidelity of the data to the clusters and a validity term which reduces the number of clusters, i.e., the complexity. Solving such a problem implies finding the membership degrees of each training example to the different clusters and assigning the former to the cluster which maximizes its membership. Nevertheless, the constraints on the bounds and the consistency of the membership degrees are difficult to carry. In this work, we introduce a new simple formulation which guarantees the consistency of the membership degrees and provides a solid connection and interpretation in terms of regularization. The method uses a new exponential form of the fuzzy memberships which ensures the validity of their bounds and makes it possible to interpret the parameter, which mixes the fidelity and the regularization terms, as the variance (or scale) of the clusters. For some applications, it turns out that setting the cluster variance is more intuitive and easier than finding the number of classes mainly for large datasets living in high dimensional spaces. In the remainder of this paper i stands for data indices while k, c stand respectively for a particular and a given cluster indices. We refer in this paper to a cluster as a set of data gathered using a clustering algorithm while a class (or category) is the actual membership of this data according to a well defined ground truth. Other notations and terminology will be introduced as we go along through different sections of this paper which is organized as following: in §2 we review the basic formulation of the regularized FCM while in §3 we introduce our entropy regularizer. In §4, we discuss the consistency of our solution, the technical issues, limitations and in §5 we show how this method can be used for clustering non homogeneous and non-linearly separable classes. In §6 we show the experimental validity of the method for image segmentation and database categorization problems. We conclude in §7 and we provide some directions for a future work.

2 A short reminder on regularized FCM A variant of the regularized version of FCM [Bou00] consists in the minimization problem: J (µ) =

C X N X k=1 i=1

µik d2ik + α(t) R(µ)

(1)

under the constraints that {µi1 , ..., µiC } is the probability distribution of the memberships of xi to C clusters and dik is the distance of xi to the k th cluster. The first term of the objective function (1), referred to as the FCM term, measures the fidelity of each training example to its cluster and vanishes when each example is a cluster by it self. The regularization term R(µ) measures the complexity or the spread of the clusters and reaches its minimum when all the training examples are assigned to only one big cluster. Among possible regularizers R(µ) we can find the quadratic and the Kullback-Leibler functions [FK99, IHT00]. The tradeoff between the fidelity and the regularization terms makes it possible to define the optimal number of clusters automatically for a given application. The FCM and the regularization terms are mixed via a coefficient α which controls this tradeoff. In the different existing formulations, this coefficient is proportional to the ratio between the FCM term and the regularizer R(µ) and decreases with respect to the iteration number t: α(t) ∼ fτ (t) × O

PC

k=1

PN

2 i=1 µik dik R(µ)

!

(2)

where fτ is a decreasing function which can be fτ (t) = e−t/τ . Initially, the algorithm selects a large value of α, so the objective function will pay more attention to the regularization term and this makes it possible to decrease the number of clusters. As we go along different iterations of the clustering process, the decrease of α according to (2) ensures that the clustering process will pay more attention to the fidelity term, so the centers will be updated to minimize the distances to their training data.

3 Our entropy regularization Let us consider the regularization term: C N X 1 X R(µ) = − µik log2 (µik ) − N i=1 k=1 {z } | T he entropy term

If the memberships of all the training examples to m clusters (m < C) are similar, the latter overlap, R decreases as m increases and reaches its global minimum when m = C. On the other hand, when each training example is a cluster by it self, the distribution of the memberships will be peaked, so the entropies will vanish and R will reach its global maximum 0. Let us consider a new definition of the 2 membership degrees as {µik = e−Uik , Uik ∈ R} which per construction ensures the consistency of their bounds, i.e., µik ∈ [0, 1]. If we plug these memberships in the FCM term and in the regularizer R(µ), we can show that the underlying constrained minimization problem becomes: Minimize J (U )

=

C X N X k=1 i=1

2

e−Uik d2ik − α

N KX N i=1

C X

2

2 e−Uik Uik

|k=1 {z

}

(3)

T he new entropy term

s.t.

C X

2

e−Uik = 1, i = 1, ..., N

k=1

where K = log10 (e)/log10 (2) and U = {Uik }. Using Lagrange theory [Fle80], we introduce the Lagrange coefficients {λi }, so the minimization problem can be written as:

Minimize L (U, λ)

=

N,C X

2

i,k=1

+

X

e−Uik d2ik − α X

λi

i

k

2

N,C K X −U 2 2 e ik Uik N i,k=1 !

(4)

e−Uik − 1

When the gradient of L(U, λ) with respect to {Uik } and {λi } vanishes, we obtain respectively: 2 −Uik

e

2 −Uik

=

=

 d2ik + λi − 1

N −K α

N 2 e −K α (dik e

+ λi )

(5)

and X

2

e−Uic = 1 =

c

e

− KNα λi

=e X

1 X − N d2ic − N λi e Kα e Kα e c 1 N

(6)

2

e− K α dic

c

By replacing (6) in (5), the conditions for optimality lead to the very simple solution: 2

e−(N/K α) dik 2 µik = e−Uik = X 2 e−(N/K α) dic c

(7)

P It is clear from the above expression that c µic = 1 and µik ∈ [0, 1], so for each training example xi , µi1 , ..., µiC will be a probability distribution.

4 Consistency and interpretation 4.1

Regularization

It is easy to see that our solution (7) is consistent with the role of the coefficient α. When α → ∞, the limit of µik will go to 1/C, the distribution of the memberships will be uniform and the entropies will take high values, so the regularization term R will reach its global minimum. It follows that the centers of the C clusters P PN defined as: {ck = ( N k = 1, ..., C} will converge i µik xi ) / ( i µik ), to only one center. Notice that the overlapping clusters can be detected at the end of the algorithm using a simple criteria such as the distance between their centers is below a given threshold  1 or by using more sophisticated tests such as the Kullback-Leibler divergence [CT91] in order to detect similar cluster distributions. Overlapping clusters are removed and replaced by only one cluster. The memberships are updated using (7). On the other hand, when α → 0, the effect of the regularization term vanishes, so each training example xi will prefer the closest cluster as shown using (8): 2

limα→0 e−Uik

= limα→0

=

(

1 +

X

1 2 2 e−(N/Kα){dic −dik }

c6=k

1 if dik = min{dic } 0 otherwise

(8)

or will be a cluster by it self if the number of clusters C is equal to N .

4.2

Scaling

If the training examples are assumed to be Gaussian distributed then it is possible to use the Mahalanobis distance which takes into account the spread of the data. For σ 2 = αK/N , we can rewrite the membership coefficients (7) as: In practice,  is set to 1% the radius of the ball enclosing the data. It can also be set close to the precision of float-numbers. 1

class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9 class 10 centers

500 400 300 200 100 0

0

100

200

300

400

class 1 class 2 class 3 centers

500 400 300 200 100 500

0

0

σ = 30 400

200

200

100

100 100

200

300

400

500

0

0

σ = 50 400

200

200

100

100 100

200

300

400

200

300

400

500

class 1 centers

400 300

0

100

500

300

0

500

σ = 100

class 1 class 2 centers

500

400

class 1 class 2 centers

400 300

0

300

500

300

0

200

σ = 40

class 1 class 2 centers

500

100

500

σ = 180

0

0

100

200

300

400

500

σ = 200

Figure 1: Location of the centers with respect to the value of the regularization parameter σ. As the regularization parameter increases, the membership degrees will be uniform, the regularization term decreases, so the centers overlap and the number of clusters decreases. N = 33 and the parameter C is set initially to 20.

µik

1 (xi − ck )t Σ−1 k (xi − ck ) 2 σ e = 1 X − (xi − cc )t Σ−1 c (xi − cc ) e σ2 −

(9)

c

Here ck and Σ−1 k denote respectively the center and the inverse of the covariance th matrix of the k cluster. Now, σ (and also α) acts as a scaling factor (or variance); when it is underestimated, each ck will be a center of a Gaussian which vanishes everywhere except in ck and each example will form a cluster by it self if C = N (cf. figure 1, top-left). On the contrary, when σ is overestimated the Gaussian will be quasi-constant, so no difference will be noticed in the memberships of the training examples to the clusters and this results into one big cluster (cf. figure 1, bottom-right). Figure (2, top) shows the entropy and the fidelity terms with respect to the variance σ. As expected, low variance implies high entropy, so the resulting number of clusters is high (cf. figure 2, bottom-left). On the other hand, high variance implies low entropy, so the underlying number of clusters decreases. The best variance σ is of course application-dependent, but it may exist a trivial σ for a clustering problem mainly when the underlying classes are linearly separable. Nevertheless, when classes are not linearly separable, embedding methods such as isomap [TdSL00] will be used (cf. §5).

4.3

Discussion

It is known that the initialization process in FCM requires C to be exactly the actual number of classes. In contrast, we consider in our approach C to be far larger (but fixed) than the actual number of classes in order to guarantee that each class will be captured, at least, with one center. This may result into several overlapping clusters, which can be eliminated, at the convergence stage, using a simple test; thresholding the distances between the centers; in order to detect and remove overlapping clusters and replace them with few others (cf. §4.1). Beside, the previous remark, notice that our approach maintains both the parameters C and α constant

0

-1

Fidelity term

Entropy term

-0.5

-1.5 -2 -2.5 -3 -3.5

0

20 40 60 80 100 120 140 160 180 200 Variance Center differences (log scale)

Number of classes

30 25 20 15 10 5 0

0

20 40 60 80 100 120 140 160 180 200 Variance

5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0

20 40 60 80 100 120 140 160 180 200 Variance

5

sigma=10 sigma=60 sigma=100 sigma=240

0 -5 -10 -15 -20 -25 -30

0

20

40 60 Iterations

80

100

Figure 2: (Top) variation of the entropy and the fidelity terms with respect to the variance σ. Both the entropy and the fidelity values are shown in the log-base 10 scale. (Bottomleft) the decrease of the number of clusters with respect to the variance. (Bottom-right) P (n−1) 2 (n) k , The convergence process is shown through the localization error k kck − ck (n)

where ck

is the k th center estimated at the nth iteration. These results are related to the

clustering problem shown in figure (1).

through different steps of the clustering process, so the objective function is kept fixed and the only changing parameters are the membership degrees.

5 Extension to non-linearly separable data In this section, we will show how to extend our clustering method to the topology of different manifolds. We will use isomap[TdSL00], which allows us to embed a training set from an input space into an embedding space where the underlying classes become linearly separable. The whole objective is to make the classes linearly separable and to maximize the ratio between their inter and the intra class variances (or scales). Thus, setting the scale in the embedding space will be easy. We will briefly remind isomap and we will show in the experiments that this embedding improves the performance of our clustering method.

5.1

Isomap

Consider a training set in an Euclidean space, instead of using directly the Euclidean distance between these training examples, we consider the geodesic distance. In practice, an adjacency graph is defined where an arc exists between two training examples xi and xj if xj belongs to the M th nearest neighbors of xi or if xj is inside a small ball around xi (cf. figure 3, top right). The geodesic distance between two training examples is found by searching the minimal path using the Dijkstra algorithm [Dij59]. Following the formulations introduced in [TdSL00], each training example is embedded in the space spanned by the eigenvectors of the following matrix: 1 Kisomap = − (I − eet ) K (I − eet ) 2

(10)

Here K is the Gram matrix of the training data where an entry Kij is equal to the value of the kernel function k(xi , xj ) [CST00] which depends on the geodesic distance and I, e denote respectively the identity and the unit matrices. Isomap finds a non-linear mapping which embed non-linearly separable classes into a feature space where these classes become linearly separable. Afterward, we can use

our clustering method after setting appropriately the scale parameter σ (cf. figure 3). Figure (4) shows examples of clustering spirals, rings and numbers using our clustering method after isomap embedding. 280

280

260

260

240

240

220

220

200

200

180

180

160

160

140

140

120

120

100

100

80

0

50

100

150

200

250

300

350

400

80

0

50

100

150

200

250

300

350

400

0

50

100

150

200

250

300

350

400

280 260 240 220 200 180 160

4e-05 3e-05 2e-05 1e-05 -0.2 -0.15 0 -0.1 -0.05 -1e-05 -2e-05 0 0.05 -3e-05 0.1 0.15 0.2 -4e-05

140 120 100 80

Figure 3: From top-left to bottom-right (1) original non-linearly separable data (2) the underlying adjacency graph. The arity M is set to 3. (3) The underlying isomap embedding makes the two classes linearly separable. (4) the result of the clustering process.

Beside the applicability of isomap to handle the non-separability and the topology of data, the underlying graph structure can be used in order to add specific knowledges. A user might add some constraints on data connections, for instance, he can mention that two training examples must (or must not) belong to the same cluster. The graph is then built using both the topology of data and by propagating the user’ constraints on connections. When building the graph, it is important to ensure that no path exists between two examples declared as disconnected by the user.

120 100 400

80

300 60

200 100

40

0 20

−100 −200

0

−300

−20

−400 200

−40

100

400 0

−60

200 −100

−80 −100

0 −200

−80

−60

−40

−20

0

20

40

60

500

−200 −300

80

−400

400

450

350

400 300 350 250

300 250

200

200

150

150 100 100 50

50 0

0

100

200

300

400

500

600

700

0

0

100

200

300

400

500

600

700

Figure 4: Some clustering results on non-linearly separable data. After the application of our clustering algorithm, data belonging to different clusters are shown using different colors. The scale σ is set to the second eigenvalue of the eigenproblem (10).

6 Experiments 6.1

Validity criteria

A clustering method can be objectively evaluated when the ground truth is available, otherwise the meaning of clustering can differ from one intelligent observer to another. The validity criterias are introduced in order to measure the quality of a clustering algorithm, i.e., its capacity to assign data to their actual classes. For a survey on these methods, see for example [HBV02]. In the presence of a ground truth, we consider in our work a simple validity criteria based on the probability of misclassification. The latter occurs when either two examples belonging to two different classes are assigned to the same cluster, or when two elements belonging to the same class are assigned different clusters. We denote X and Y as two random variables standing respectively for the training examples and their different possible classes {y1 , ..., yC } and X 0 , Y 0 respectively similar to X, Y . We denote by f (X) the index of the cluster of X in {y1 , ..., yC }. Formally, we define the misclassification error as P (1{(f (X) = f (X 0 )} 6= 1{Y = Y 0 } ) = (∗), which we expand as: (∗) = P (f (X) = 6 f (X 0 ) | Y = Y 0 ) P (Y = Y 0 ) + P (f (X) = f (X 0 ) | Y 6= Y 0 ) P (Y 6= Y 0 )

6.2

(11)

Database categorization

Experiments have been conducted on both the Olivetti and Columbia databases in order to show both the good performance of our clustering method and the improvement brought by isomap. The Olivetti database contains 40 persons each one represented by 10 faces. Each face is processed using histogram equalization and encoded using kernel Fisher discriminant analysis [LHLM02] resulting into 100 coefficients. The Columbia set contains 100 categories of objects each one represented by 72 images. Each image is encoded using the laplacian color descriptor [VB00] resulting into a feature vector of 216 coefficients. It may be easier for database categorization to predict the variance σ rather

0.14 0.12 0.08 0.06 0.04

0.006 0.004 0.002

0.02 0

3 classes 5 classes 7 classes 9 classes

0.008 Scale

0.1 Scale

0.01

3 classes 5 classes 7 classes 9 classes

0

10

20 30 40 50 Number of samples

60

0

70

0

10

20 30 40 50 Number of samples

60

70

Figure 5: These diagrams show the variation of the scale with respect to the number of samples per class on the Columbia set. The scale is given as the expectation of the distance between pairs of images taken from 3, 5, 7 and 9 classes. The diagram in the left-hand side is related to the laplacian color space while the other one is related to the isomap embedding space.

0.4

Fisher coefficients Isomap embedding

0.5

0.35 Probability of error

Probability of error

0.6

0.4 0.3 0.2 0.1 0

Laplacian color Isomap embedding

0.3 0.25 0.2 0.15 0.1 0.05

0.02

0.04 0.06 The variance

0.08

0

0.02 0.04 0.06 0.08 0.1 The variance

Figure 6: Left: precision of clustering using our method on the Olivetti set. The solid (resp. dashed) line shows the probability of error when clustering is performed on the Fisher space (resp. after isomap embedding). Right: the same experiments on the Columbia set.

Figure 7: Top: cluster prototypes resulting from our clustering method on the Columbia set and using (on the left) laplacian color space and (on the right) isomap. Middle and bottom: images from two different clusters after the application of our algorithm where the scale σ is set to 0.075 (resp. σ = 0.0055 using isomap). Images in the left-hand side are related to the laplacian color space while the others are related to isomap.

than the number of clusters. Predicting the variance σ can be achieved by sampling manually images from some categories, estimating the variance (or scale) of the underlying classes, then setting σ to the expectation of the variance through these classes. While this setting is not automatic, it has at least the advantage of introducing a priori knowledges on the variance of the data at the expense of few interaction and reasonable effort from the user. Furthermore, this is easier than trying to predict the actual number of classes, mainly for huge databases living in high dimensional spaces. Figure (5) shows that this heuristic provides “good guess” of the scale on the Columbia set as it is close to the optimal scale shown in figure (6, right). Again, figure (6) shows the misclassification error (11) with respect to the scale parameter σ. When comparing the results on the Olivetti and Columbia sets, we can see (cf. figure 6) that the optimal scale using isomap is smaller than the optimal scale using the original space since isomap reduces the ratio between the intra and the inter class variances. Furthermore, the error rate becomes smaller. Tables (1) and (2) show the distribution of 20 categories from the Columbia through the clusters after the application of our method (with and without isomap). We can see that this distribution is concentrated in the diagonal and this clearly shows that most of the training data are assigned to their actual categories (cf. example in figure 7). set2

6.3

Image segmentation

Clustering methods have also proven to be very useful for image segmentation [BHC93]. One possible application of image segmentation is partial queries [CBGM02] For ease of visualization of these tables, we run our algorithm using only a subset of 20 randomly chosen categories. 2

Table 1: This table shows the distribution of the categories through the clusters on the Columbia set using our clustering algorithm without isomap embedding. We can see that this distribution is concentrated on the diagonal. These experiments are performed using the laplacian color space where the scale parameter σ is set to 0.075. Errors in the cardinalities of the clusters are mentioned in red.

categories clusters

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

total

11: 4: 10: 5: 8: 2: 17: 19: 9: 1: 6: 14: 20: 18: 16: 13: 15: 3: 0: 7: 12:

72 . . . . . . . . . . . . . . . . . . . .

. 72 . . . . . . . . . . . . . . . . . . .

. . 72 . . . . . . . . . . . . . . . . . .

. . . 72 . . . . . . . . . . . . . . . . .

. . . . 72 . . . . . . . . . . . . . . . .

. 72 . . . . . . . . . . . . . . . . . . .

. . . . . 72 . . . . . . . . . . . . . . .

. . . . . . 51 21 . . . . . . . . . . . . .

. . . . . . . . 66 6 . . . . . . . . . . .

. . . . . . . . . 72 . . . . . . . . . . .

. . . . . . . . . . 57 7 8 . . . . . . . .

. . . . . . . . . . . 30 18 24 . . . . . . .

. . . . . . . . 11 6 . . . . 23 14 18 . . . .

. . . . . . . . . 55 . . . . . . . 17 . . .

. . . . . . . . . . . . . . . . . . 72 . .

. . . . . . . . . . . . . . . . . 72 . . .

. . . . . . . . . . . . . . . . . . 72 . .

. . . . . . . . . 72 . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . 72 .

. . . . . . . . . . . . . . . . . . . . 72

72 144 72 72 72 72 51 21 77 211 57 37 26 24 23 14 18 89 144 72 72

total

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

Table 2: This table shows the distribution of the categories through the clusters on the Columbia set using our clustering algorithm with isomap embedding. We can see that this distribution is concentrated on the diagonal. These experiments are performed using the laplacian color space where the scale parameter σ is set to 0.0055. Errors in the cardinalities of the clusters are mentioned in red.

categories clusters

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

total

12: 4: 11: 6: 9: 2: 15: 10: 1: 7: 17: 16: 18: 20: 14: 19: 5: 0: 3: 21: 8: 13:

72 . . . . . . . . . . . . . . . . . . . . .

. 72 . . . . . . . . . . . . . . . . . . . .

. . 72 . . . . . . . . . . . . . . . . . . .

. . . 72 . . . . . . . . . . . . . . . . . .

. . . . 72 . . . . . . . . . . . . . . . . .

. 72 . . . . . . . . . . . . . . . . . . . .

. . . . . 72 . . . . . . . . . . . . . . . .

. . . . . . 72 . . . . . . . . . . . . . . .

. . . . . . . 72 . . . . . . . . . . . . . .

. . . . . . . . 72 . . . . . . . . . . . . .

. . . . . . . . . 58 4 . . 10 . . . . . . . .

. . . . . . . . . 3 47 3 . 19 . . . . . . . .

. . . . . . . 18 . . . . 7 . 34 13 . . . . . .

. . . . . . . . . . . . . . . . 72 . . . . .

. . . . . . . . . . . . . . . . . 72 . . . .

. . . . . . . . . . . . . . . . . . 72 . . .

. . . . . . . . . . . . . . . . . . . 72 . .

. . . . . . . . 72 . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . 72 .

. . . . . . . . . . . . . . . . . . . . . 72

72 144 72 72 72 72 72 90 144 61 51 3 7 29 34 13 72 72 72 72 72 72

total

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

72

where the user selects a region of homogeneous color or texture distribution and the system will display similar images according only to the selected region. These experiments are targeted to show the good performance of our clustering method in image segmentation. The idea here is based on grouping the distribution of colors in an image using a suitable color space such as HSV, LUV, etc. In our experiments, the color of each pixel in an image is defined as the median RGB color in a local neighborhood of 10 × 10 pixels; this makes the color distribution of neighboring pixels smooth, the resulting blobs homogeneous and avoids us creating new colors mainly in the region boundaries. We applied our clustering algorithm on this distribution and each color is assigned to its closest cluster center using a simple Euclidean distance (see. figure 8).

7 Conclusion and future work We introduced in this paper a new formulation of regularized FCM which is simple, it ensures the consistency of the fuzzy membership bounds and it is easy to interpret in terms of regularization. We have also shown an extension of the method to non-linearly separable data. Open issues are several; one of them is (1) the study of the effect of noise on the method and the convergence of the algorithm (2) the use of transductive learning which requires the interaction of the user in order to have a priori knowledges on the cluster manifolds. (3) An other issue will be a new formulation of the objective function (3) which takes into account data with large variations in scale. Intuitively, if we consider, in the objective function (3), different mixing parameters α for different classes then it might be possible to capture clusters with large variations in scale.

Figure 8: Some segmentation results (without isomap). The variances are set, resp. from the top-left to the bottom-right, to 0.2, 0.26, 0.2, 0.11 and 0.3.

References [Bez81]

J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. New York. Kluwer Academic Norwell, 1981.

[BHC93]

J.C. Bezdek, L.O. Hall, and L.P. Clarke. Review of mr image segmentation techniques using pattern recognition. Medical Physics, 20(4):1033–1048, 1993.

[BHHSV00] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik. Support vector clustering. In Neural Information Processing Systems, pages 367–373, 2000. [Bou00]

N. Boujemaa. Generalized competitive clustering for image segmentation. In The 19th International Meeting of the North American Fuzzy Information Processing Society, 2000.

[CBGM02] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, 2002. [CST00]

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.

[CT91]

T.M. Cover and J. A. Thomas. Elements of Information Theory. New York: Wiley, 1991.

[Dav91]

R.N. Dave. Characterization and detection of noise in clustering. Pattern Recognition, 12(11):657–664, 1991.

[Dij59]

E. W Dijkstra. A note on two problems in connection with graphs. Numerische Math, 1:269–271, 1959.

[DLR77]

A. Dempster, N. Laird, and D Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society. Ser. B, 39(1):1–38, 1977.

[FK99]

H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):450–465, 1999.

[Fle80]

R. Fletcher. Practical Methods of Optimization, volume 1. John Wiley & sons, New York, 1980.

[Fra98]

C. Fraley. Algorithms for model-based gaussian hierarchical clustering. SIAM Journal on Scientific Computing, 20(1):270–281, 1998.

[HBV02]

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster Validity Methods: Part I and II. SIGMOD Record, 2002.

[IHT00]

H. Ichihashai, K. Honda, and N. Tani. Gaussian mixture pdf approximation and fuzzy c-means clustering with entropy regularization. In The 4th Asian Fuzzy System Symposium, pages 217–221, 2000.

[JD88]

A.K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

[LHLM02]

Q. Liu, R. Huang, H. Lu, and S. Ma. Face recognition using kernel based fisher discriminant analysis. In The IEEE conference on Face and Gesture, pages 197–201, 2002.

[OJT03]

C.A. Orengo, D.T. Jones, and J.M. Thornton. Bioinformatics Genes, Protein and Computers. BIOS. ISBN: 1-85996-054-5, 2003.

[Pos01]

C. Posse. Hierarchical model-based clustering for large datasets. Journal of Computational and Graphical Statistics, 10(3):464–486, 2001.

[SB02]

B. Le Saux and N. Boujemaa. Unsupervized categorization for image database overview. In International Conference on Visual Information System, pages 163–174, 2002.

[SS73]

P.H. Sneath and R.R. Sokal. Numerical Taxonomy- the principles and practice of numerical classification. W. H. Freeman, San Francisco, 1973.

[TBP94]

E.K. Tsao, J.C. Bezdek, and N.R. Pal. Fuzzy kohonen clustering networks. Pattern Recognition, 27(5):757–764, 1994.

[TdSL00]

J Tanenbaum, V de Silva, and J Langford. A global geometric framework for non-linear dimensionality reduction. Science, 290(5500):2319–2323, 2000.

[VB00]

C. Vertan and N. Boujemaa. Upgrading color distributions for image retrieval: can we do better ? In International Conference on Visual Information Systems, pages 178–188, 2000.

[Yan93]

M.S. Yang. A survey of fuzzy clustering. MathCompMod, 18(11):1– 16, 1993.