A Clustering Algorithm Combine the FCM algorithm with Supervised ...

Report 1 Downloads 33 Views
A Clustering Algorithm Combine the FCM algorithm with Supervised Learning Normal Mixture Model Wei Wang, Chunheng Wang, Xia Cui, Ai Wang Key Laboratory of Complex System and Intelligence Science, Institute of Automation, Chinese Academy of Sciences {wang.wei, chunheng.wang, xia.cui, ai.wang }@ia.ac.cn Abstract In this paper we propose a new clustering algorithm which combines the FCM clustering algorithm with the supervised learning normal mixture model; we call the algorithm as the FCM-SLNMM clustering algorithm. The FCM-SLNMM clustering algorithm consists of two steps. The FCM algorithm was applied in the first step. In the second step the supervised learning normal mixture model was applied and the clustering result of the first step was used as training data. The experiments on the real world data from the UCI repository show that the supervised learning normal mixture model can improve the performance of the FCM algorithm sharply, and which also show that the FCM-SLNMM perform much better than the unsupervised learning normal mixture model and other comparison clustering algorithms. This indicates that the FCM-SLNMM algorithm is an effective clustering algorithm.

1. Introduction The data clustering is an important problem in a wide variety of fields; including data mining, pattern recognition, computer vision, and bioinformatics [1]. Clustering algorithm partitions an unlabelled set of data into groups according to the similarity. Compared with the data classification, the data clustering is an unsupervised learning process, it does not need a labeled data set as training data, but the performance of the data clustering algorithm is often much poorer. Although the data classification has better performance, it needs a labeled data set as training data and labeled data for the classification is often very difficult and expensive to obtain. So there are many algorithms are proposed to improve the clustering performance.

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

In this paper, we propose a new clustering algorithm which divides the clustering process into two stages. In the first stage, we use an unsupervised learning algorithm to obtain training data. In the second stage, we use the training data to build a classifier and apply the classifier partition the data. The FCM algorithm is used to obtain training data. Basically FCM clustering is dependent on the measure of distance between samples [10]. The probabilities of the sample x i belong to the jth cluster (i.e. membership) is determined by: P (ϖ

xi ) =

j

C

∑ P (ω

j

(1 / d



ji

) 1 /( b − 1 )

with

C

( 1 / d ri ) 1 /( b − 1 ) r =1

x i ) = 1, i = 1, … , N .

j =1

where

d rj is the Euclidean distance of x i to the

rth cluster center and b > 1 . If the sample x i is in

x i to all clustering centers are approximately equal and all P are small). Then it is hard to judge x i belong to a cluster. If the sample x i becomes nearer to the jth cluster center and farther from the other cluster centers (i.e. d ji becomes the margin (i.e. the d of

larger and

d ri , r ≠ j become smaller.). Then the

P (ϖ j x i ) becomes larger and other P (ϖ r x i ), r ≠ j become smaller. The FCM algorithm uses a threshold to determine whether x i belong to the jth cluster (i.e. if P (ϖ j x i ) > threshold the x i belong to the jth cluster). So if we select a larger threshold, when

P (ϖ j x i ) > threshold ,the sample

x i is very close to a cluster center. Then the accuracy of judging whether x i belong to the jth cluster center is very high. But with a larger threshold, some samples maybe not belong to any clusters because their membership to all clusters is smaller than the threshold. This indicates that we can obtain an accurate partition subset by the FCM algorithm with a larger threshold. We give each sample in the subset a label and use them as training data. Finite mixture models are known as powerful tools, which have been applied in various theoretic and real problems as they are capable of modeling a wide range of densities [11]. Ghahramani and Jordan present a framework based on maximum likelihood density estimation for learning from high-dimensional data sets with arbitrary patterns of missing data. They use mixture models for the density estimates. The EM algorithm is used both for the estimation of mixture components and for coping with missing data [8]. We apply the normal mixture models for the supervised learning. This is because that the training data obtained by the first step inevitably have some wrong partitioned samples. The iterative EM process can reduce the deviation cause by the wrong partitioned samples, this improves the partition results. Apart from Section 1, this paper is organized as follows: Section 2 introduces the FCM-SLNMM clustering algorithm in detail. Section 3 illustrates some experimental results for testing the effectiveness of the FCM-SLNMM clustering algorithm. We conclude the paper in Section 4.

c-means clustering algorithm is to minimum the cost C

function

N

J=∑



∑ ( P (ω

j =1



j

The FCM-SLNMM algorithm consists of two stages. In the first stage, we apply the FCM clustering algorithm to obtaining a training data. In the second stage we use the training data and the supervised learning normal mixture model to build a classifier, and then we use the classifier to partition the data set.

i =1



we have not explicitly shown the dependence on θ . The probabilities of cluster membership for each point C

are normalized as



∑ P (ω j x i ) = 1, i = 1,… , N . j =1

To minimum the J, we have [2] [3] [4]: ∧

∂J / ∂ P (ω j x i ) = 0 .

∂J / ∂v j = 0 and

These lead to the solutions

∑ = ∑



N

vj

( P (ω j x i )) b x i i =1 ∧

N i =1

( P (ω j x i ))



P (ω j x i ) =

Let

X = { x i ; i = 1,… , N } with x i ∈ R d be a

data set which consists of N d-dimensional samples. We want to partition X into C subsets X 1 , … , X C . Each subset is to represent a cluster [4]. First we use the fuzzy c-means cluster algorithm to get an initial for the Bayesian clustering. The fuzzy

(2)

b

(1 / x i − v j )1 /( b −1)

∑r =1 (1 / x i − v r )1 /( b−1) C

(3)

2.2. The Normal Mixture Model We consider X follows a C-component finite mixture distribution and the probability density function of the X data set can be written as [5] C

P ( x θ ) = ∑ α i p i ( x θ i ) with Where the parameters are such that each

α 1 ,… , α C are



C i =1

α i = 1 (4)

θ = (α1 ,…,αC ,θ1 ,…,θC )

the mixing probabilities and

pi is a density function parameterized by θ i .

The log-likelihood function is: N

log( p( x θ )) = log ∏ p( x i θ ) i =1

N

C

i =1

j =1

= ∑ log( ∑ α j p j ( x i θ j ))

2.1. The FCM algorithm The FCM algorithm which was proposed by Dunn [2] and was extended by Bezdek [3] is one of the most widely used clustering algorithms.

2

x i ,θ ) x i − v j (1)

Where b>1 is the fuzziness index. For simplicity,

i =1

2. The FCM-SLNMM Algorithm

b

(5)

This is difficult to optimize because it contains the log of the sum [6]. So we consider x as incomplete data and assume the missing part is a set of C

Y = { y i ; i = 1, … , N } , with y i = ( y i 1 ,… , y iC ) are binary C-dimensional vectors such that y ik = 1 if and only if x i arises from component i [7]. We assume that a complete data set labels

Z = { X , Y } and also assume the probability density function as:

Input: data X , cluster numberC, fuzziness index b Initialize v0j = random value ; j = 1,2,…,C;

P ( z θ ) = p( x , y θ ) = p( y x , θ ) p( x θ ) (6) The likelihood becomes [6]: N

C

log( p( x , y θ )) = ∑ ∑

Algorithm FCM-SLNMM

repeat

y ij log α j p( x i y i , θ j ) (7)

i = 1 j =1

log( p(θ x , y )) can be maximized by iterating

until max(| v0j − v j |) ≤ ε ( j = 1,2, , C )

the following two steps [8] [9]: E step:

Q(θ , θ k ) = E[log( p( x , y θ )) x ,θ k ]

M step:

θ k +1 = arg max Q(θ , θ k )



Calculate P(ω j xi )(i = 1,2, , N; j = 1,2, , C) using Eq. (1) Calculate vi ( j = 1,2, ,C;) using Eq. (2) ∧

if P(ω j xi )(i = 1,2, , N; j = 1,2, , C) then y ij = 1

θ

else y ij = 0

The E step computes the expected computes the expected completed data log likelihood and the M step finds the parameters that maximize this likelihood. These two steps form the basis of the EM algorithm [8].

end if repeat E step: calculate w ij ; i = 1, … , N ; j = 1, … , C by (8)

log( p( x , y θ ) is linear with respect to the missing y , we simply have to compute the conditional

M step: calculate

Since

expectation

W ≡ E[ y x ,θ k ] , and plug it into

log( p( x , y θ ) . Since the elements of y are binary, their conditional expectations are given by

w ij ≡ E[ z ij x i ,θ k ] We assume that the component densities are becomes θ

multinomial, the θ

in the k iterative times,

= {α j , μ j , Σ j } Nj=1 ,

w ij can be calculated by the

equation [8]: ∧ k+1

k ij

w =



1 2

∧k

∧ −1,k

∧k

α j | Σkj |−1 / 2 exp{− (xi − μ j )T Σj (xi − μ j )} k+1 C ∧ l l =1



α

∧ ∧k ∧ −1,k ∧k 1 | Σkl |−1 / 2 exp{− (xi − μl )T Σl (xi − μl )} 2

(8)

until

(

uj

∧ k +1

Σj

∧ k +1

αj

∑ = ∑

N

i =1 N

w ijk x i

i =1

w

∧ k +1

=

∧ k +1

∑i =1 w ijk ( x i − μ j )( x i − μ j )T N

=

(9)

k ij



1 N



N i =1

N

wk i = 1 ij

w ijk

2.3. The FCM-SLNMM algorithm

(11)

∧ k +1

)

3. Experiments In this section, we use experimental results to demonstrate the performance of the FCM-SLNMM clustering algorithm. We do experiments on four real data sets. The k-means, FCM, FCM-Bayesian and USLNMM algorithms are used as comparison.

3.1. Data Set and Comparison algorithms We do experiments on five real data sets from the UCI repository, which are Iris database, Pima Indians Diabetes database, Heart Diseases database, Ionosphere database and Wine database [13]. TABLE1. OVERVIEW OF THE DATA SETS

DATA SET

ATTRIBUTE

CLASSES

DATA NUMBER

IRIS

4

3

150

PIMA

8 13 34 13

2 2 2 3

768 270 351 178

HEART

(10)

∧ k +1

max | w ijk +1 − w ijk | ≤ ε

The M step, re-estimates the mixing probabilities, means and covariance: ∧ k +1

∧ k +1

u j (9), Σ j (10), α j (11)

IONOSPHERE WINE

The comparison algorithms contain the k-means algorithm, the FCM algorithm, the FCM-Bayesian algorithm and the USLNMM algorithm. The k-means algorithm and the FCM algorithm are two widely use clustering algorithms. The FCM-Bayesian algorithm combines the FCM algorithm with naïve bayes

classifier. The USLNMM is the unsupervised learning normal mixture model.

3.2. Evaluation Method We choose F-Measure for evaluating the clustering algorithms. The F-Measure is commonly used in evaluating the effectiveness of clustering [12]. It combines recall and precision to evaluate clustering performance of the clustering algorithms. F-Measure can be computed as: 2 • Pr ecision • Re call . F − Measure = Pr ecision + Re call

3.2. Results

Figure 1 shows the clustering results of the five clustering algorithms on the five data sets. It shows that the F-Measures of the FCM-SLNMM algorithm are 7%, 1%, 3%, 5% and 10% higher than the FCM algorithm’s. This proves that the supervised learning normal mixture model can improve the performance of the FCM algorithm sharply. The clustering results of the FCM-SLNMM algorithm are also much better than the other algorithms’, this indicate that the normal mixture model combine with the FCM algorithm is an effective clustering algorithm. The experiments also show that combining the FCM algorithm with the SLNMM is much better than combining the FCM algorithm with naïve bayes classifier. The F-Measure of the FCM-SLNMM in the Pima database is only 1% higher than the FCM algorithm’s. This is because that the FCM algorithm can not obtain an accurate partition subset (the F-Measure is only 67%). So obtaining an accurate partition subset is the key of our clustering method.

Fig. 1. F-Measure of clustering results of all data sets

4. Conclusions In this paper, we present the FCM-SLNMM, a new clustering algorithm that combine the FCM clustering

algorithm with supervised learning normal mixture model. The experiment results on the on the real world data sets from the UCI repository show that the clustering performance of the FCM algorithm can be improved sharply by combined FCM algorithm with supervised learning normal mixture model. The experiments also confirm that the FCM-SLNMM clustering algorithm perform better than other comparison algorithms.

References [1] N. Grira, & M. E. Houle. Best of Both: A Hybridized Centroid-Medoid Clustering Heuristic. Proc. of the 24th International Conference on Machine Learning, 2007. [2] J.C. Dunn. Some recent investigations of a new fuzzy partition algorithm and its application to pattern classification problems. Cybernetics, pages 1-15, 1974. [3] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York, 1981. [4] R. O. Duda, P. E. Hart and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2004. [5] M.A.F. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no.3, pp. 381-396, July 2002. [6] J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, University of Berkeley, ICSI-TR-97-021, 1997. [7] C. Biernacki, G. Celeux and G. Govaert. Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no.7, pp. 719-725, July 2000. [8]Z. Ghahramani and M. I. Jordan Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems 6, pp.120-127, 1994. [9]A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Royal Statistical Society Series B, 39:1-38,1977. [10]X. Wang, Y. Wang, L. Wang. Improving fuzzy c-means clustering based on feature-weight learning, Pattern Recognition Letters, vol. 25, pp. 1123-1132,2004. [11]T.I. Lin, J.C. Lee and H.J. Ho. On fast supervised learning for normal mixture models with missing information. Pattern Recognition, vol. 39, pp. 11771187,2006. [12]R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval, ACM Press, Addison Wesley Longman Limited, 1999. [13] http://archive.ics.uci.edu/ml/datasets.html