A modified mountain clustering algorithm - Semantic Scholar

Comment

Report 2 Downloads 160 Views

Pattern Anal Applic (2005) 8: 125–138 DOI 10.1007/s10044-005-0250-9

T H E O R E T I C A L A D V A N C ES

Miin-Shen Yang Æ Kuo-Lung Wu

A modified mountain clustering algorithm

Received: 5 March 2004 / Published online: 24 June 2005 Springer-Verlag London Limited 2005

Abstract In this paper, we modify the mountain method and then create a modiﬁed mountain clustering algorithm. The proposed algorithm can automatically estimate the parameters in the modiﬁed mountain function in accordance with the structure of the data set based on the correlation self-comparison method. This algorithm can also estimate the number of clusters based on the proposed validity index. As a clustering tool to a grouped data set, the modiﬁed mountain algorithm becomes a new unsupervised approximate clustering method. Some examples are presented to demonstrate this algorithm’s simplicity and eﬀectiveness and the computational complexity is also analyzed. Keywords Mountain method Æ Modiﬁed mountain algorithm Æ Parameter estimation Æ Validity index Æ Unsupervised clustering

1 Introduction Cluster analysis is a tool for clustering the data with similar characteristics into groups and also for discovering the data structure of a data set. Clustering methods have been widely applied in various areas such as taxonomy, geology, business, engineering systems, medicine, pattern analysis and image processing, etc. (see Refs. [1–3]). Since Zadeh [4] proposed fuzzy set theory, which provided the idea of partial memberships, it has been successfully used in cluster analysis. In the fuzzy clustering literature, objective-function-based algoM.-S. Yang (&) Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li, Taiwan, 32023, ROC E-mail: [email protected] K.-L. Wu Department of Information Management, Kun Shan University of Technology, Yung-Kang, Tainan, Taiwan, 71023, ROC

rithms are the most used methods (see [2–3]). In particular, the fuzzy c-means (FCM) algorithm and its varieties are the best-known and applied clustering methods (see Refs. [5–7]). However, these clustering algorithms always need to have some priori information. The ﬁrst is the initial guesses of centers or partitions so that the clustering performance is always aﬀected by the diﬀerent initials. The second is the number of clusters which is needed to be assigned a priori. We shall call these clustering methods semi-unsupervised clustering methods. Many cluster validity indexes [8–11] were proposed to provide a good estimate of the cluster number for validating clustering algorithms. However, these validity methods are generally based upon clustering algorithms whose performances have been known to be sensitive to the initializations. The mountain method, proposed by Yager and Filev [12], is a simple and eﬀective algorithm as an approximate clustering. The mountain function is similar to a Parzen window estimation of the probability density function of the feature vectors. The parameter of the mountain function corresponds to the kernel width for the Parzen density estimation [1]. This method is often used to obtain the initial guesses of the cluster centers for clustering algorithms. It can also be used alone as a stand method for approximating the estimates for cluster centers. Because the mountain method is computed in the amount of computation growing exponentially with the increase in the dimensionality of the data, Chiu [13] modiﬁed it by considering the mountain function on the data points instead of the grid nodes. Pal and Chakraborty [14] proposed a scheme to improve the accuracy of the prototypes obtained by Yager’s mountain method and Chiu’s modiﬁed type. They then applied it to detect circular shells. Moreover, Yager and Filev [15] applied the mountain method to the generation of fuzzy rules and Velthuizen et al. [16] applied it for clustering large data sets such as the segmentation of magnetic resonance images of the human brain. However, the performance of these mountain methods depends heavily on the parameters of the mountain

126

function. In this paper, we ﬁrst propose a correlation self-comparison method to estimate the parameters. The proposed estimation method is well-considered, in accordance with the structure of the data set. On the other hand, the stopping condition for the mountain method is another problem. The mountain algorithm for acquiring new cluster centers repeats until the chosen largest mountain value is less than a given d. However, if the value of d is speciﬁed too large, we may lose some important clusters. If the value of d is speciﬁed too small, the result may have too much clusters. It is diﬃcult to specify a suitable d value in real applications. In this paper, we propose a new cluster validity measure function as a new stopping condition for the modiﬁed mountain method. This validity index is based on the proposed modiﬁed mountain function and has ability to reﬂect the relation between clusters including cluster size, cluster scale and the distance between clusters. The remainder of this paper is organized as follows. In Sect. 2, we describe the mountain method and examine the problems arising when using the mountain method. In order to solve these problems, we propose a modiﬁed mountain clustering algorithm in Sect. 3. We ﬁrst propose the correlation self-comparison method to approximate the density shape and set a threshold to obtain a suitable density shape estimation. This method works well in various simulated high dimensional data sets. We then modify the revised mountain function. This modiﬁcation can allow somewhat diﬀerent density shapes for each extracted cluster so that the modiﬁed revised mountain function is always positive. According to the properties of the modiﬁed mountain method, we also propose a new cluster validity measure function. Section 4 provides some examples to show the performance of the modiﬁed mountain clustering algorithm. In Sect. 5, the performance of the proposed modiﬁed mountain method is compared with the original mountain method based on the data sets from normal mixtures. We also implement FCM with two popular validity indexes, partition coeﬃcient (PC) [8] and Xie and Beni (XB) [11], to ﬁnd good clustering results for the given data sets. We then compare these results with our modiﬁed mountain method. The computational complexity is also analyzed. Finally, the conclusions are made in Sect. 6.

2 The mountain method The mountain method proposed by Yager and Filev [12] is usually used to obtain the initial values of cluster centers. It can also be used as a ‘stand alone’ technique for approximating the estimation of the cluster centers. Suppose, we have n data points denoted by {x1,…,xn} in an m-dimensional Euclidean space Rm. We then make grids to the data space. Let m-dimensional hypercube be I1·…· Im where each interval Ip(p=1,2,…,m) is deﬁned by the range of n data points in pth coordinate. Thus, the hypercube I1·…· Im should contain all the points of

the data set {x1,…,xn}. Each interval Ip is discretized into rp equidistant points. Such a discretization forms an m dimensional grid in the hypercube with nodes N (i1,…,im), where indexes i1,…,im take values from the sets [1,…,r1],…,[1,…,rm]. Let {Ni} denote the set of all grid nodes. In the mountain method, cluster centers are restricted to the grid nodes {Ni} in Rm and the mountain function for each Ni is calculated with M1 ðNi Þ ¼

n X

eadðxj ;Ni Þ ;

i ¼ 1; 2; . . .

ð1Þ

j¼1

where d(xj,Ni) is the distance measure between the data point xj and grid node Ni. The mountain function values of the nodes can be seen to be closely related to the density of the data points in the neighborhood. They also represented the potential ability of a grid node to be a cluster center estimate. A node with many neighborhood data points will have a large mountain function value. The parameter a in Eq. (1) is very important. It could decide the neighborhood radius in which data points outside the radius will have a small inﬂuence on the mountain function. It also determines the approximate density shape of the data set. It is reasonable that the ﬁrst cluster center estimate is the node with a maximal mountain function score. Thus, ﬁnd N*1 (the ﬁrst cluster center estimate among all grid nodes) with M1 ðN1 Þ ¼ maxfM1 ðNi Þg:

ð2Þ

i

Since the nodes close to N*1 will also have high mountain-function values, it is necessary to remove the eﬀect of the identiﬁed cluster center before obtaining the next cluster center. The mountain function, after eliminating the (k1)th cluster center N*k1, is deﬁned by

Mk ðNi Þ ¼ Mk1 ðNi Þ Mk1 ðNk1 Þ ecdðNk1 ;Ni Þ ; k ¼ 2; 3; . . .

ð3Þ

where Mk1 ðNk1 Þ ¼ maxfMk1 ðNi Þg i

ð4Þ

and the function in (3) was called the revised mountain function. In the revised mountain function (3), the mountain function values for the nodes, which are close to the newly identiﬁed cluster centers will be reduced and the parameter c determines the neighborhood radius that will have measurable reductions in the mountain function. In (1), the optimization problem is reduced to a ﬁnite set {Ni} which also determines the precision of the clustering results. A ﬁner grid will determine a more accurate solution over all Rm. However, this will increase the computational time, especially in a high dimensional case. Chiu [13] modiﬁed the mountain method by deﬁning the mountain function on data vectors. Although the number of data points may be larger than the grid nodes, we must calculate the distance measures

127

between nodes and the identiﬁed cluster center nodes to obtain the next revised mountain function in the original mountain method. However, the mountain method deﬁned in the data vectors will omit this calculation process because the revised mountain functions are also deﬁned in the data vectors. We adopt this modiﬁcation in one part of our new modiﬁed mountain clustering algorithm. A clustering procedure contains two problems with some priori knowledge. The ﬁrst is the initial guesses of centers or partitions which always aﬀect the clustering performance. The second is the number of clusters which needs to be given a priori. The mountain method can provide good initial guesses for a clustering procedure on the base of its stopping conditions to decide if the identiﬁed cluster center has the potential to be a new cluster center in this data set. This is a cluster validity problem. In Yager and Filev’s procedure, the algorithm for acquiring new cluster centers repeats until Mk(N*k) is less than a given d. The d becomes an important factor. If d is speciﬁed too large, we may lose some important clusters. If d is speciﬁed too small, the result may have too many clusters. It is diﬃcult to specify d as a constant that works well for all examples. Therefore, Chiu [13] developed an additional criteria for accepting or rejecting cluster centers. Although these methods can simply decide the number of the identiﬁed cluster centers, they do not give a performance measure for it. A new concept for measuring the cluster validity will be proposed in the next Section.

3 A modified mountain clustering algorithm The mountain method can provide the approximate clustering result via the density estimation concept. The performance is sensitive to the density shape estimation parameter a and the revised mountain function parameter c. In this section, we propose a modiﬁed mountain method which can produce good initial cluster centers with an estimate of the number of cluster centers via the analogical kernel type density estimation by deﬁning the modiﬁed mountain function on data points, not on grid nodes. We will use a correlation self-comparison procedure to approximate the exact density shape and set a threshold to get a good estimate of a. Thus, the proposed algorithm becomes an unsupervised clustering algorithm. We now deﬁne the modiﬁed mountain function for each data vector xi on all data points as P1 ðxi Þ ¼

n X

embdðxi ;xj Þ ;

i ¼ 1; . . . ; n

ð5Þ

j¼1

where Pn b¼

j¼1

!1 xj x2 n

Pn with x ¼

j¼1 xj

n

:

ð6Þ

The role of the parameter b is a normalization term which normalizes the dissimilarity measure d(xi,xj). This normalization reduces the inﬂuences of the scale of the data set on the modiﬁed mountain function. Using the parameter m in the modiﬁed mountain function P1(xi) is an important step because the m in (5) will become the only parameter of the data set by taking b as the inverse of the dispersion of the data set. We mention that the concept of decomposing the parameter a in the original mountain function (1) into mb in the modiﬁed mountain function (5) with b the inverse of sample variance can be also used in the Yager and Filev’s mountain method. In fact, the parameter mb in (5) and the parameter a in (1) are both to determine the approximate density shape of the data set. In this consideration, the diﬀerence between the mountain function (1) and the modiﬁed mountain function (5) is deﬁned on the grid nodes and on data points, respectively. In this paper, we focus the parameter m on the modiﬁed mountain function (5) and then propose a tool to acquire a good estimate of m. According to our experiments, most of estimates of m always locate on an expected range when the normalization term b is considered. This helps us to analyze m more eﬃciently. The tool for analyzing m (equivalent to approximate the density shape) using a correlation selfcomparison technique is illustrated as follows. 3.1 The correlation self-comparison procedure In order to analyze the parameter m using the correlation self-comparison procedure, we provide a data set with four clusters as shown in Fig. 1a. In Fig. 1b–d, we show the 3D plots for P1(xi) with m=1, 5 and 10, respectively. The ‘‘+¢¢ sign represents the value for P1(xi) with respect to the data point xi. A poor density shape estimate occurs when m=1. However, when we increase the value of m to 5 or 10, they both give a good density shape estimate via the modiﬁed mountain function. Although this graphical method can help us to acquire the estimate of the density shape in both the mountain method and modiﬁed mountain method, this kind of subjective method is restricted to a small dimensional data set. A more eﬃcient and precise method to ﬁnd a good estimate for m is constructed with the correlation self-comparison. We see that the 3D plots of P1(xi) in Fig. 1b, c for m=1 and m=5 respectively show diﬀerences. However, Figure 1c, d shows similarities. The reason is that the values of P1(xi) for m=5 and m=10 have a very high relationship. In the statistical sense, the correlation of the values of P1(xi) for m=5 and m=10 is very close to one. If the correlation self-comparison of the values of P1(xi) for ‘‘m=1 and m=5’’, ‘‘m=5 and m=10’’ or ‘‘m=10 and m=15’’ is larger than a given threshold, then this m will provide a good approximate density shape for the data. According to the data set of Fig. 1a, c, d show their high similarity. In general, to increase m is to decrease the neighborhood radius of the data

128

points. The correlation of the values of P1(xi) is larger than a given threshold, which means that when we decrease the neighborhood radius of the data points over the threshold, the approximate density shapes will not alter too much. Thus, when the approximate density shapes are stable with this m value, it should be a good estimate for the exact density shape. In general, we suggest to choose the threshold with 0.99. A large increased shift of m in our self-comparison procedure may lose a good estimate for the parameter m. However, a small increased shift of m will decrease the neighborhood radius slowly for the data points so that the modiﬁed mountain function (5) will make no signiﬁcant diﬀerence for these small changes of m. Of course, an ideal choice of the shift of m should depend on the data set so that it is diﬃcult to ﬁnd an ideal shift of m. In general, the shift of 5 can be adopted and the threshold of 0.99 should be large enough used in the correlation self-comparison procedure. For example, the correlation self-comparisons for the data in Fig. 1 with ‘‘m=1 and m=5’’, ‘‘m=5 and m=10’’, ‘‘m=10 and m=15’’ are 0.683, 0.977 and 0.996 respectively. Thus, we can choose m=10 for this data set.

Fig. 1 a The data set. b–d 3D Plots of the modiﬁed mountain function with m=1, 5 and 10 respectively

In order to execute the correlation self-comparison procedure as a computer program, the modiﬁed mountain function is rewritten as P1m0 ðxi Þ ¼

n X

em0 bdðxi ;xj Þ

j¼1

and P1ml ðxi Þ ¼

n X

eml bdðxi ;xj Þ

j¼1

where m0 is set to a ﬁxed constant 1 and ml=5l, l=1,2,3,…. The correlation self-comparison procedure can then be summarized as follows. Correlation self-comparison algorithm 1. Set l=1 and w=0.99. m 2. Calculate the correlation of the values of P1 ðl1Þ ðxi Þ ml and P1 ðxi Þ: 3. IF the correlation is greater than or equal to the speciﬁed w, m

THEN choose P1 ðl1Þ to be the modiﬁed mountain function; ELSE l=l+1 and GOTO step 2. We now use some data sets to test the eﬃciency of the correlation self-comparison in searching for a good

a 10 b 200

5

m=1

150

100

0

10

50

5 -5

-5 -5

0

5

70

5

-5 10

10 d

c

80

0 0

60

m=5

m=10

50

60 40

50

30

40 30

20

20 10

10 5

0 -5

0 0

5

-5 10

10

10 5

0 -5

0 0

5

-5 10

129

modiﬁed mountain function (i.e. a good density shape estimation of the data set). Figure 2a–f shows six artiﬁcial data sets. The values of the correlation self-comparison for each data set are shown in Table 1. After the parameters are selected according to the self-comparison procedure, the modiﬁed mountain functions for these data sets are shown in Fig. 2g–l. The modiﬁed mountain function ﬁts a good density shape estimation for each one of these various data sets. The parameter b with an inverse of sample variance normalizes the dissimilarity measure d(xi,xj) and helps us to ﬁnd m more eﬃciently. We mention that Beni and Liu [17] proposed a least biased fuzzy clustering algorithm by maximizing the clustering entropy with no assumptions on the number

of clusters or their initial positions. They set the parameter b¢ to b/bmax such that the range of b¢ is between 0 and 1. They then set up a quantity P(c) as the probability that will yield c clusters under diﬀerent values of b¢. Based on this multiscale analysis, they can determine the optimum number c* of clusters and then ﬁnd the ﬁnal cluster centers and partitions. In our procedure for the decomposition of a to m/b with b the inverse of the sample variance, it has a similar merit to the Beni and Liu’s multiscale method. However, both have their diﬀerent considerations. Our procedure considers the modiﬁed mountain function and uses the correlation self-comparison method with a new proposed validity index to have an optimum number of clusters.

Fig. 2 The data set and their modiﬁed mountain functions a 20

b

c

20

20

10

10

10

0

0

0

0

10

20

0

10

20

d 20

e 20

f 20

10

10

10

0

0

0

0

10

20

0

10

20

g

h

i

21.5 20.5 19.5

23

15 14 13 12 11 10 9 8 7

18

18.5 17.5 16.5

20 10 0

10

y

20

13

20 10 x

8

x

0

0

10

y

20

10

0

5

10 x 0

10

y

20

0

0

10

20

20 10 x 10

0 20

l

20 0

20

y

10

5

10

0

k

j

0

20 0

10 x 0 10

y

20

0

15 14 13 12 11 10 9 8 7 6

20 10 x 0

10

y

0 20

130 Table 1 Correlation selfcomparison

Data set

m=1 and m=5

m=5 and m=10

m=10 and m=15

Figure 2a Figure 2b Figure 2c Figure 2d Figure 2e Figure 2f Example 1 Iris Iris (last two feature) Figure 8a Figure 8b Figure 8c Figure 8d

0.8012 0.9455 0.7313 0.5401 0.7367 0.8483 0.8640 0.5200 0.2920 0.7710 0.7620 0.7810 0.8030

0.9996 0.9790 0.9551 0.9663 0.9331 0.8925 0.7390 0.9220 0.9110 0.9880 0.9890 0.9870 0.9900

0.9926 0.9899 0.9972 0.98998 0.9796 0.8110 0.9850 0.9870 0.9930 0.9960 0.9960

3.2 The modiﬁed revised mountain function After the correlation self-comparison algorithm is implemented, the modiﬁed mountain function (5) is acquired. We then search for the kth cluster center using the modiﬁed revised mountain function with 0

Pk ðxi Þ ¼ Pk1 ðxi Þ Pk1 ðxi Þ eb dðxi ;xk1 Þ ;

k ¼ 2; 3; . . . ð7Þ

where xi is the feature vector and x*k1 is the (k1)th cluster center which satisﬁes Pk1 ðxk1 Þ ¼ maxfPk1 ðxi Þg; i

k ¼ 2; 3; . . .

ð8Þ

The parameter b¢ of this modiﬁed revised mountain function determines the extracted cluster shape and also determines the potential of the data point being the next cluster center after extracting each cluster. Since the data Fig. 3 a–d The modiﬁed revised mountain function after extracting the ﬁrst, second, third and fourth clusters respectively

m=15 and m=20

Selected m value 5 10 15 10 15 15 15 15 15 10 10 10 5

0.9951 0.9970 0.9950 0.9920 0.9940 0.9960

points near the ﬁrst extracted cluster center will have greatly reduced potential, and therefore are unlikely to be selected as the next cluster center. To avoid obtaining closely cluster centers, we choose b¢ in (7) to be somewhat smaller than mb in (5). Thus, we may take the parameter b¢ in (7) as mb in (7) divided by m (i.e. b¢=mb/m=b) such that b¢ is smaller than mb. One problem for the revised mountain function (3) in the original mountain method is that the extraction cluster shapes for diﬀerent clusters are all identical. The shape of the data distribution is not necessarily identical for diﬀerent clusters. However, the original mountain method uses a ﬁxed density shape to extract each cluster from each new identiﬁed cluster center, as shown in (3). The proportion of the extracted quantity for each node Ni is

Mk1 ðNk1 Þ ecdðNk1 ;Ni Þ ;

a

b

60

60

50

50

40

40

30

30

20

20

10

ð9Þ

10 10

0

5 -5

0

10

5 -5

0 5

10

0

-5

c

0

0 5

10

-5

d

60

60

50

50

40

40

30

30

20

20

10

10 10

0

5 -5

0

0 5

10

-5

10

0

5 -5

0

0 5

10

-5

131

which is independent of its revised mountain function Mk1(Ni) and the shapes of each extracted cluster are all equal. This leads to the revised mountain function Mk1(Ni) in (3) becomes negative for some data points. It is not reasonable to a grid node with a negative potential Mk1(Ni). Therefore, Yager and Filev [12] needed to set the negative revised mountain function values to zero after each extraction. In our modiﬁed revised mountain function (7), the reduction of each data point xi after ﬁnding the (k1)th cluster center x*k1 is upon its (k1)th revised mountain function Pk1(xi). The proportion of the extracted quantity for each data point is

Pk1 ðxi Þ ebdðxi ;xk1 Þ ;

ð10Þ

which depends on its revised mountain function Pk1(xi) and the shape of the extracted cluster also depends on Pk1(xi). This modiﬁcation allows somewhat diﬀerent density shapes for each extracted cluster and the revised mountain function is always positive. To demonstrate the behavior of the modiﬁed revised mountain function (7), we use the data set in Fig. 1a. The values of the modiﬁed revised mountain function after exacting the ﬁrst, second, third, and fourth clusters are shown in Fig. 3a, b, c and d, respectively. 3.3 The performance measure for the cluster validity Many cluster validity indexes (see Refs. [8–11] are measured based on the clustering algorithms such as the fuzzy c-means (FCM) and hence the results are heavily dependent on the performance of the clustering algorithms. In this subsection, we will propose a new validity measure for our modiﬁed mountain method, which is simple and does not depend on any clustering method. The idea is to measure the potential of the new identiﬁed cluster center being a new cluster center based on the modiﬁed mountain method and select the number of clusters in which the total potential is maximum. The validity function is deﬁned as

Fig. 4 Clustering result and the validity index MV(c) for the data set in Fig. 1a a

MVðcÞ ¼

c X

potðkÞ;

c ¼ 2; 3; . . . ; n 1

ð11Þ

k¼2

where c denotes the number of clusters. The function pot(k) is the potential of the kth cluster center x*k and is deﬁned as P1 ðxk Þ 2 n exp mðd =bÞ ; potðkÞ ¼ P1 ðxk Þ k ð12Þ P1 ðx1 Þ k ¼ 2; 3; . . . where dk is the minimum distance among x*k and all (k1) previous identiﬁed cluster centers, i.e. dk ¼ minfdðxk ; xk1 Þ; dðxk ; xk2 Þ; . . . ; dðxk ; x1 Þg:

If the modiﬁed mountain function (5) for the new identiﬁed cluster center x*k is large (i.e. P1(x*k) is large), then it has a large potential to be a cluster center. We use the term P1(x*k)/P1(x*1) to measure the degree of compactness of the kth extracted cluster. If the kth cluster is dense, then it will have a large value of P1(x*k)/P1(x*1) and have a large potential to be a new cluster center. If the kth cluster is disperse, then it will have a large reduction for pot(k) according to the term P1(x*k)/P1(x*1). In fact, the term P1(x*k)/P1(x*1) had been used for the stopping condition by Yager and Filev [12] for the original mountain method. However, this stopping condition only control the compactness. In our validity function pot(k) in (12), we also consider the separation measure (dk/b)2. The data point which is close to any previous identiﬁed cluster centers will also have a large modiﬁed mountain function value and we should reduce the potential of x*k if it is close to any previous cluster centers. The term dk/b measures the degree of reduction. If dk is small, then the potential of x*k will have a large reduction. If dk is large, then x*k will have a large potential to be a new cluster center. If dk is very small compared to b in this data set, then the kth cluster center x*k may be close to any one of the previous identiﬁed cluster centers and will have a large reduction for pot(k). We then use (dk/b)2 to measure the separation of k-th extracted cluster. The result using the modiﬁed mountain method for the data set in Fig. 1a is shown in Fig. 4. Figure 4a b 10

0

MV(c)

5 -500 0 -1000 -5 2

3

4

5

c

6

7

8

9

ð13Þ

-5

0

5

10

132

presents the validity function MV(c) and c=4 is the best number of clusters for this data set. The solid circle points are four identiﬁed cluster centers as shown in Fig. 4b. The modiﬁed mountain method provides a good cluster number estimate and good initial guesses (cluster centers) in this example. The proposed algorithm is summarized as follows: The modiﬁed mountain clustering algorithm 1. Acquire the modiﬁed mountain function using the correlation self-comparison algorithm. 2. Find the kth cluster center x*k using the modiﬁed revised mountain function (7) and condition (8). 3. Calculate MV(c), c=2,3,…,(n1). 4. Choose the cluster number estimate with maximum value of MV(c) and select these c extracted cluster centers. If the cluster number is known, this method can provide good initial guesses via the density shape estimate. Even with the data having an unknown number of clusters, it provides both the cluster number estimate and initial guesses. The validity problem is solved by the proposed validity measure function which includes the compactness and separation measures for each extracted cluster. The results are not dependent on the clustering algorithm whose performance is always aﬀected by initial cluster centers. Some numerical examples are made in the next section.

4 Examples In this section, we present some examples with numerical and real data to demonstrate the eﬀectiveness of the proposed modiﬁed mountain algorithm. Example 1. This is a three-dimensional data set with six clusters as shown in Fig. 5a. In this example, we can not plot the modiﬁed mountain function using Eq. 5. Therefore, we can not see the change of the approximated density shape when the parameter m changes. We

use the proposed correlation self-comparison procedure to automatically search for a suitable m value. The correlation self-comparison is described in Table 1. The parameter m was chosen to be 15 to accomplish the threshold. The approximated cluster centers (solid circle points) are shown in Fig. 5a. According to the validity function MV(c) in Fig. 5b, the optimal cluster number shall be c=6. Although this example is a high dimensional case in that the density shape estimate can not be seen using a graphical method, the correlation selfcomparison procedure and the validity measure MV(c) can give us a reasonable result. The modiﬁed mountain method provide us a good initial guesses for a clustering algorithm and a good cluster number estimate for the data set. It also give us a good unsupervised approximate clustering result. Example 2. This is the well-known Iris data set whose real cluster number is 3. The self-comparison of correlations is described in Table 1. The density shape estimate m is chosen to be 15 to accomplish the threshold 0.99. The result for the modiﬁed mountain method is shown in Fig. 6. In Fig. 6a, we used a three-dimensional plot to present the four-dimensional Iris data set. Three identiﬁed cluster centers are also plotted with solid circle points. The plot of MV(c) (Fig. 6b) shows that c=3 is a good cluster number estimate. It also oﬀer the information that c=2 may be another good choice. Although people may have argued that c=2 is also a good choice and many validity indexes also indicate that c=2 is optimal for Iris data, it has really three groups with Iris Sestosa, Iris Versicolor and Iris Virginica. For unsupervised clustering, we think that both c=2 and c=3 are good estimates for the Iris data set. We also cluster the last two features of the Iris data. The self-comparison of correlations are shown in Table 1. The parameter m was chosen to be 15. The results shown in Fig. 7 are similar to the four-dimensional Iris data. Example 3. We used this example to show the performance of our validity function MV(c). Figure 8 shows the data sets and the identiﬁed cluster centers (solid circle points). The parameters m for each data set are chosen to accomplish the threshold, as shown in Table 1. Table 2 shows the MV(c) values for each case.

Fig. 5 Clustering result and the validity index MV(c) for example 1 b 0 a 30

MV(c)

20 10 0 0

5

10

0

10

20

30

40

-100

50

-200 2

3

4

5

6 c

7

8

9

133

In Fig. 8a, there are two randomly generated clusters and ﬁve artiﬁcial points. The reasonable cluster number for this data set is 3 and the maximum value of MV(c) gives the coincident result. In Fig. 8b, we reduce the number of artiﬁcial points, they then do not have enough potential to be a cluster. In Fig. 8c, we increase the scale of the artiﬁcial points. These artiﬁcial points become too separated and do not have enough potential to be a cluster. In Fig. 8d, we decrease the distance between ﬁve artiﬁcial points and two random generated clusters. These clusters became overlapped and the artiﬁcial cluster are diﬃcult to distinguish from the other clusters. This example shows that our validity measure function MV(c) can reﬂect the relationships between clusters including the cluster size, the cluster scale and distance between clusters.

5 Comparisons and computational complexity In this section, we compare the performance of the proposed modiﬁed mountain method with the original mountain method according to the data sets from normal mixtures. To demonstrate the eﬀectiveness of the proposed validity index MV(c), we choose two popular validity indexes, partition coeﬃcient (PC) [8] and Xie and Beni (XB) [11], and then compare their results. We also analyze the computational complexity of algorithms. Example 4. We randomly generate a normal mixture data. The mixed distributions include N(0,1), N(3,0.5), N(6,1), N(9,0.5) and N(12,1). The mixture proportions are all equal. The Histogram of this mixture data is shown in Fig. 9a. The data set contains 250 data points. Under the threshold 0.99, m=15 is chosen in this example. The approximate density shape is shown in Fig. 9b which is coincident to the Histogram. The validity function of MV(c) is shown in Fig. 9c. The optimal cluster number of this data set is ﬁve. Figure 9b

Fig. 6 Clustering result and the validity index MV(c) for the Iris data set

also shows these ﬁve extracted clusters with the solid circle points and their extracted order. We next implement the mountain method proposed by Yager and Filev [12]. The ﬁrst step is to deﬁne the grid nodes over the data space. We choose 201 grid nodes with {5.0, 4.9, … , 14.9, 15.0}. The second step is to determine the parameter a in Eq. 1 that is equivalent to mb in Eq. 5 of the modiﬁed mountain method. The third step is to determine the revised mountain function parameter c in (3). In this example, four kinds of c values with c=b, c=5b, c=15b and c=25b are implemented. The results are shown in Table 3. Since, after extracting four cluster centers, all revised mountain function values of Eq. 3 for each grid node are zero in the case of c=b, only four clusters are found. The second column of Table 3 presents these four extracted grid nodes. The third column presents the proportion of Mk(N*k)/M1(N*1) that is used as a stopping condition. The mountain method will be stopped when the proportion is less than a given threshold d. This is the fourth step of the original mountain method for choosing a suitable d to determine the cluster number of the data set. When c=5b and c=15b, the original mountain method can extract ﬁve suitable cluster centers with d=0.5. Under the same threshold, however, seven clusters are extracted when c=25b. We ﬁnd that, the original mountain method is a good tool for searching suitable cluster centers, but it heavily depends on the parameters a, c and d. Our modiﬁed mountain method can improve the original scheme with the following three modiﬁcations. First, to reduce the eﬀect of the parameter a, we use the correlation selfcomparison to approximate the density shape. This method is also workable in high dimensional data space. Second, to reduce the eﬀect of the parameter c, we modify the revised mountain function to allow diﬀerent density shapes for each extracted cluster. In fact, our modiﬁed revised mountain function is always positive so that it can avoid the situation in which the revised function values for grid nodes are all zero such as shown in Table 3 when c=b (only four clusters can be extracted). Finally, to reduce the eﬀect of the parameter d, we propose a new cluster validity function to measure the potential of an extracted cluster being a new iden-

b -100 a -200 max

MV(c)

2

1

-300

0 4

5

6

7

8

1

2

3

4

5

6

7

-400 2

3

4

5

6

c

7

8

9

134 a

b 0

MV(c)

2

1

-100

-200

0 1

2

3

4

5

6

7

2

Fig. 7 Clustering result and the validity index MV(c) for the last two features of the Iris data set

tiﬁed cluster and then select the cluster number with the maximum total potential. In this example, we also simulate another normal mixture data. The mixed distributions include N(0,1), N(5,2), N(10,3), N(15,4) and N(20,5). The mixture proportions are all equal. The Histogram of this mixture data is shown in Fig. 10a. The data set contains 500 data

Fig. 8 Data sets and results for Example 3. Solid circle points present the identiﬁed cluster centers resulting from the modiﬁed mountain method 8

a

8

4

0

0

8

-1

0

1

2

3

4

5

8

4

4

0

0

-2

-1

0

1

2

3

4

5

6

5 c

6

7

8

b

-2

6

c

4

points. Under the threshold 0.99, m=15 is obtained on the basis of the proposed correlation self-comparison algorithm. The approximate density shape is shown in Fig. 10b that is coincident to the Histogram. The validity function MV(c) shown in Fig. 10c indicates that c=4. Figure 10b shows four extracted clusters with the solid circle points and their extracted order. It is diﬃcult to distinguish between four and ﬁve clusters in this mixture data. The validity function MV(c) has large values in both c=4 and c=5. The results of the original mountain method are shown in Table 4. Four kinds of c are implemented. Only three clusters can be extracted when c=b. If we set d=0.5, the cluster number estimates are 2, 4, 4 and 6 when c equals to b, 5b, 10b and 15b, respectively. We ﬁnd that the proposed modiﬁed

4

-2

3

-1

0

1

2

3

4

5

6

-1

0

1

2

3

4

5

6

d

-2

135 Table 2 The validity function MV(c) for Example 3

Cluster

Figure 8a

Figure 8b

Figure 8c

Figure 8d

2 3 4 5 6 7 8 9 Estimate

84.6 84.3 134.6 175.0 193.5 212.2 238.9 259.9 3

83.4 134.0 168.4 168.2 181.9 195.8 217.0 232.9 2

84.4 135.6 177.1 177.0 196.5 216.2 244.0 250.8 2

78.5 128.1 184.9 184.5 213.1 241.7 279.2 310.5 2

In general, an optimal c* is obtained by solving min XBðcÞ to produce a best clustering performance

moutain method actually presents better results than the original mountain method. There are many prototype-based clustering algorithms [2, 3, 6, 9] that can be used to obtain clusters. Because the cluster number in these algorithms needs to be speciﬁed a priori, these methods need to combine some validity indexes to become unsupervised clustering algorithms. For a given cluster-number range, the validity function is evaluated for each given cluster number and then an optimal number is chosen based on these validity measures. We choose the two well-known validity indexes, PC [8] and XB [11], for the comparisons, and they are brieﬂy reviewed as follows:

2cn1

for the data set. In the next example, we will compare our modiﬁed mountain method to the FCM clustering results with these two validity indexes, PC and XB. Example 5. This is a three dimensional data set containing 11 clusters with diﬀerent shapes and sizes as shown in Fig. 11a. The data set contains 800 data points. Under the threshold 0.99, m=20 is obtained according to the proposed correlation self-comparison algorithm. The validity function MV(c) shown in Fig. 11b indicate that c=11 is a good cluster number estimate for this data set. The extracted cluster centers (solid circle points) are shown in Fig. 11c which well presents the data structure. We then implement the FCM algorithm with PC and XB indexes. Figure 12a, b show the results of PC and XB where fuzzy memberships are generated by the FCM algorithm with random initial values. Both indexes can not indicate the correct cluster number for the data set. This is because the data may be lack of structure or bad given initial values. The results of FCM with 11 random initial values are shown in Fig. 12c where we may see the reason why the validity indexes lose the ability of detecting correct cluster number. This also reveals the problem that even the clustering algorithm and the validity index are both well organized, the results still depend on initial speciﬁcations. If we take the extracted cluster centers obtained by the modiﬁed mountain method as the initial values for implementing the FCM algorithm, the results will become beter. We ﬁnd that, in this case, the PC and XB indexes as shown in Fig. 12d, e actually indicate that the cluster number estimate is c=11. Figure 12f shows these FCM clustering results with the modiﬁed mountain outputs as initial values.

(a) The partition coeﬃcient (PC) that was a ﬁrst proposed validity index associated with FCM [8] is deﬁned by PCðcÞ ¼

c X n 1X l2 n i¼1 j¼1 ij

ð14Þ

where (1/c) £ PC(c) £ 1. In general, an optimal cluster number c* is obtained by solving max PCðcÞ to produce a best clustering perfor2cn1 mance for the data set. (b) A validity function proposed by Xie and Beni [11] is deﬁned by 2 Pc Pn 2 i¼1 j¼1 lij xj ai : ð15Þ XBðcÞ ¼ 2 nminai aj i;j

Fig. 9 a The Histogram of the normal mixture data. b The modiﬁed mountain function. c The validity function MV(c). Five extracted cluster centers (solid circle points) and their extracted order are also shown in b. The algorithm takes 2.140625 s

m=15, time=2.140625sec

Histogram

b

10

1

c

3

2

40

0

5 MV(c)

20

mountain function

Frequency

a

30

4

10

0 0

5

10

15

-500

20

-1000 0

5

10

15

2

3

4

5

6

c

7

8

9

10

136 Table 3 The results of the original mountain method for the data set of Fig. 9a

1 2 3 4 5 6 7 8 9 10

c=b

c=5*b Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

3 11.9 8.8 0.3

1 0.86266 0.32973 0.05408

3 8.9 12 0 5.8 12.1 2

1 0.98857 0.80393 0.57173 0.55582 0.00142 0.00125

3 8.9 11.9 5.6 0.4 7.2 1.1 4.3 10.4 1.7

1 0.98863 0.87398 0.73787 0.67731 0.36187 0.30144 0.28657 0.26508 0.25402

3 8.9 11.9 5.6 0.4 4.3 1.8 7.3 10.4 0.9

1 0.98863 0.87449 0.74131 0.68065 0.54364 0.50566 0.49738 0.46664 0.39182

m=15, time= 5.546875sec

Histogram

b mountain function

30

Frequency

c=25*b

Node

a

20

10

0 0

c=15*b

10

20

30

90 80 70

1

4

c

2

0

60 50 40 30 20 10 0

40

MV(c)

Cluster number

3

-500

-1000 0

10

Fig. 10 a The Histogram of the normal mixture data. b The modiﬁed mountain function. c The validity function MV(c). Four extracted cluster centers (solid circle points) and their extracted order are also shown in b. The algorithm takes 5.546875 s

Finally, the computational complexity is analyzed. In the modiﬁed mountain method, for each correlation selfcomparison step, we need to compute Eq. 5 for all n data points in an m-dimensional space. The computational complexity is O(n2mt) where t is the number of iteration in the correlation self-comparison algorithm. In most of our experiments, the number m always falls in the interval [5, 20] and hence t is a small positive integer. To have MV(c) of equation 11, we need to search for x*k1 in equation 8 and compute Eqs. 7 and 12. The computational complexity is O(nmcmax) where cmax is the possible maximum number of clusters. In the original

20

30

2

3

4

5

6

c

7

8

9

10

mountain method, the ﬁrst step is to determine the grid nodes. If we set the number of grids for each dimension is equivalent to l, we will have lm m-dimensional grid nodes over the data space. The computational complexity of computing Eq. 1 is O(lmnm). The second step is to search N*k1 in Eq. 4 and compute d(N*k1,Ni) and Mk(Ni) in Eq. 3 and this takes the computational complexity O(lmm). The computational complexity of the original mountain method is a monotone increasing function of the data dimension. However, the modiﬁed mountain function (5) and revised mountain function (7) deﬁned on the data vectors, not on the grid nodes, can eﬃciently reduce the computational complexity, especially in high dimensional cases. In FCM, the computational complexity is O(ncmtc) where tc is the number of iterations when the cluster number is c. For ﬁnding a cluster number estimate of a given data set, we need

Table 4 The results of the original mountain method for the data set of Fig. 10a Cluster number

1 2 3 4 5 6 7 8 9 10

c=b

c=5*b

c=10*b

c=15*b

Node

Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

Node

Mk(Nk*)/M1(N1*)

0.4 12.5 21.9

1 0.77876 0.21093

0.4 7.1 12.8 18.3 23.2 26.7 30.8 5.2 16.7

1 0.89724 0.79281 0.51314 0.24583 0.06372 0.03314 0.03231 0.00949

0.4 6.7 12.3 16.9 21.1 3.9 9.4 25.1 14.9 2.3

1 0.92762 0.8695 0.62902 0.40175 0.33523 0.27469 0.20101 0.07958 0.0735

0.4 6.6 12.2 16.4 3.7 9.3 20 23.2 2 14.3

1 0.92918 0.87534 0.6595 0.52711 0.51696 0.4599 0.28078 0.21873 0.20782

137

The 11 clusters data set

The extracted cluster centers

a

b

c 0

c=11 -1000

7

7 6

6 MV(c)

5 4 3

5

-2000

4 3

-3000

2 1 0 0

1

2

3

4

5

6

0 1

7

2

3 4

5

2 1

6 7

-4000

0 0

1

-5000 5

10

15

2

3

4

5

6

0 1

7

3 4

2

5

6 7

20

c

Fig. 11 a The data set with the clusters of diﬀerent shapes and sizes. b The validity function MV(c). c The extracted cluster centers (solid circle points)

corresponds to the kernel width for the Parzen density estimation and determines the performance of the mountain method. It is diﬃcult to give a general estimation form for this parameter, which can work well for various data sets. However, a good parameter estimate can easily be estimated using our correlation self-comparison and still has a good performance even in high dimensional cases. Second, our modiﬁcation for the revised mountain function allows somewhat diﬀerent density shapes for each extraction cluster and successfully reduces the sensitivity to the choice of the revised mountain function parameter in the original mountain method. Moreover, our modiﬁed revised mountain is always positive. Third, we proposed a new cluster validity measure function MV(c) to measure the potential of an extracted cluster being a newly identiﬁed cluster and select the cluster number with the maximum total potential. If the cluster number is known, our modiﬁed mountain method can provide good initial guesses via a good density shape estimation using correlation selfcomparison. Even if the cluster number is unknown, our modiﬁed mountain method can provide a good

process FCM with c=2 to c=cmax. Thus, the compu* tational Pcmax complexity is O(n(cmax1)st ), where t ¼ c¼2 tc :

6 Conclusions We modiﬁed the mountain method with three important and reasonable modiﬁcations. First, we used the correlation self-comparison to approximate the true density shape and set a threshold to acquire a good density shape parameter estimate. This parameter

Fig. 12 a, b PC and XB validity indexes based on FCM using 11 random initial values. c The results of FCM with 11 random initial values. d, e PC and XB validity indexes based on FCM using the initial values obtained by the modiﬁed mountain method. f The results of FCM with the initial values obtained by the modiﬁed mountain method

The result of FCM with random initial values a

b

0.70

XB

PC

0.65

0.60

0.55

5

10

c

15

c

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

7 6 5 4 3 2 1 0 0

20

5

10

15

c

1

2

3

4

5

20

6

0 1

7

2

3 4

5

6 7

The result of FCM with the initial values obtained by the proposed method e

0.7

XB

PC

d

0.6

0.5 5

10

c

15

20

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

f 7 6 5 4 3 2 1 0

c=11

5

10

c

0

15

20

1

2

3

4

5

6

7

0 1

2

3 4

5

6 7

138

cluster number estimate via the new proposed validity measure function MV(c). Thus, our modiﬁed mountain method can be an unsupervised approximate clustering method for the analysis of a grouped data set. Some numerical comparisons and real data are given to show the eﬀectiveness and simplicity of the proposed modiﬁed mountain clustering algorithm. Finally, the computational complexity of algorithms is analyzed. Acknowledgements The authors would like to thank the anonymous reviewers for their helpful comments and suggestions to improve the presentation of the paper This work was supported in part by the National Science Council of Taiwan, under Grant NSC-89-2213-E-033-057

References 1. Duda RO, Hart PE (1973) Pattern classiﬁcation and scene analysis. Wiley, New York 2. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York 3. Yang MS (1993) A survey of fuzzy clustering. Math Comput Modelling 18:1–16 4. Zadeh LA (1965) Fuzzy sets. Inform Control 8:338–353 5. Ho¨ppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis. Wiley, New York

6. Wu KL, Yang MS (2002) Alternative c-means clustering algorithm. Pattern Recognit 35:2267–2278 7. Yang MS, Hu YJ, Lin KCR, Lin CCL (2002) Segmentation techniques for tissue diﬀerentiation in MRI of ophthalmology using fuzzy clustering algorithms. Magn Reson Imaging 20:173–179 8. Bezdek JC (1974) Cluster validity with fuzzy sets. J Cybernetics 3:58–73 9. Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11:773–781 10. Windham MP (1982) Cluster validity for the fuzzy c-means clustering algorithm. IEEE Trans Pattern Anal Mach Intell 11:357–363 11. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13:841–847 12. Yager RR, Filev DP (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24:1279–1284 13. Chiu SL (1994) Fuzzy model identiﬁcation based on cluster estimation. J Intel Fuzzy Syst 2:267–278 14. Pal NR, Chakraborty D (2000) Mountain and subtractive clustering method: improvements and generalizations. Int J Intell Syst 15:329–341 15. Yager RR, Filev DP (1994) Generation of fuzzy rules by mountain clustering. J Intell Fuzzy Syst 2:209–219 16. Velthuizen BP, Hall LO, Clarke LP, Silbiger ML (1997) An investigation of mountain method clustering for large data sets. Pattern Recognit 30:1121–1135 17. Beni G, Liu X (1994) A least biased fuzzy clustering method. IEEE Trans Pattern Anal Mach Intell 16:954–960

Recommend Documents

A Modified Mountain Clustering Algorithm based ... - Semantic Scholar

A Competitive Elliptical Clustering Algorithm - Semantic Scholar

Distributed Dynamic Clustering Algorithm in ... - Semantic Scholar

Energy-Balanced Distributed Clustering Algorithm ... - Semantic Scholar