Pattern Anal Applic (2005) 8: 125–138 DOI 10.1007/s10044-005-0250-9
T H E O R E T I C A L A D V A N C ES
Miin-Shen Yang Æ Kuo-Lung Wu
A modified mountain clustering algorithm
Received: 5 March 2004 / Published online: 24 June 2005 Springer-Verlag London Limited 2005
Abstract In this paper, we modify the mountain method and then create a modified mountain clustering algorithm. The proposed algorithm can automatically estimate the parameters in the modified mountain function in accordance with the structure of the data set based on the correlation self-comparison method. This algorithm can also estimate the number of clusters based on the proposed validity index. As a clustering tool to a grouped data set, the modified mountain algorithm becomes a new unsupervised approximate clustering method. Some examples are presented to demonstrate this algorithm’s simplicity and effectiveness and the computational complexity is also analyzed. Keywords Mountain method Æ Modified mountain algorithm Æ Parameter estimation Æ Validity index Æ Unsupervised clustering
1 Introduction Cluster analysis is a tool for clustering the data with similar characteristics into groups and also for discovering the data structure of a data set. Clustering methods have been widely applied in various areas such as taxonomy, geology, business, engineering systems, medicine, pattern analysis and image processing, etc. (see Refs. [1–3]). Since Zadeh [4] proposed fuzzy set theory, which provided the idea of partial memberships, it has been successfully used in cluster analysis. In the fuzzy clustering literature, objective-function-based algoM.-S. Yang (&) Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li, Taiwan, 32023, ROC E-mail:
[email protected] K.-L. Wu Department of Information Management, Kun Shan University of Technology, Yung-Kang, Tainan, Taiwan, 71023, ROC
rithms are the most used methods (see [2–3]). In particular, the fuzzy c-means (FCM) algorithm and its varieties are the best-known and applied clustering methods (see Refs. [5–7]). However, these clustering algorithms always need to have some priori information. The first is the initial guesses of centers or partitions so that the clustering performance is always affected by the different initials. The second is the number of clusters which is needed to be assigned a priori. We shall call these clustering methods semi-unsupervised clustering methods. Many cluster validity indexes [8–11] were proposed to provide a good estimate of the cluster number for validating clustering algorithms. However, these validity methods are generally based upon clustering algorithms whose performances have been known to be sensitive to the initializations. The mountain method, proposed by Yager and Filev [12], is a simple and effective algorithm as an approximate clustering. The mountain function is similar to a Parzen window estimation of the probability density function of the feature vectors. The parameter of the mountain function corresponds to the kernel width for the Parzen density estimation [1]. This method is often used to obtain the initial guesses of the cluster centers for clustering algorithms. It can also be used alone as a stand method for approximating the estimates for cluster centers. Because the mountain method is computed in the amount of computation growing exponentially with the increase in the dimensionality of the data, Chiu [13] modified it by considering the mountain function on the data points instead of the grid nodes. Pal and Chakraborty [14] proposed a scheme to improve the accuracy of the prototypes obtained by Yager’s mountain method and Chiu’s modified type. They then applied it to detect circular shells. Moreover, Yager and Filev [15] applied the mountain method to the generation of fuzzy rules and Velthuizen et al. [16] applied it for clustering large data sets such as the segmentation of magnetic resonance images of the human brain. However, the performance of these mountain methods depends heavily on the parameters of the mountain
126
function. In this paper, we first propose a correlation self-comparison method to estimate the parameters. The proposed estimation method is well-considered, in accordance with the structure of the data set. On the other hand, the stopping condition for the mountain method is another problem. The mountain algorithm for acquiring new cluster centers repeats until the chosen largest mountain value is less than a given d. However, if the value of d is specified too large, we may lose some important clusters. If the value of d is specified too small, the result may have too much clusters. It is difficult to specify a suitable d value in real applications. In this paper, we propose a new cluster validity measure function as a new stopping condition for the modified mountain method. This validity index is based on the proposed modified mountain function and has ability to reflect the relation between clusters including cluster size, cluster scale and the distance between clusters. The remainder of this paper is organized as follows. In Sect. 2, we describe the mountain method and examine the problems arising when using the mountain method. In order to solve these problems, we propose a modified mountain clustering algorithm in Sect. 3. We first propose the correlation self-comparison method to approximate the density shape and set a threshold to obtain a suitable density shape estimation. This method works well in various simulated high dimensional data sets. We then modify the revised mountain function. This modification can allow somewhat different density shapes for each extracted cluster so that the modified revised mountain function is always positive. According to the properties of the modified mountain method, we also propose a new cluster validity measure function. Section 4 provides some examples to show the performance of the modified mountain clustering algorithm. In Sect. 5, the performance of the proposed modified mountain method is compared with the original mountain method based on the data sets from normal mixtures. We also implement FCM with two popular validity indexes, partition coefficient (PC) [8] and Xie and Beni (XB) [11], to find good clustering results for the given data sets. We then compare these results with our modified mountain method. The computational complexity is also analyzed. Finally, the conclusions are made in Sect. 6.
2 The mountain method The mountain method proposed by Yager and Filev [12] is usually used to obtain the initial values of cluster centers. It can also be used as a ‘stand alone’ technique for approximating the estimation of the cluster centers. Suppose, we have n data points denoted by {x1,…,xn} in an m-dimensional Euclidean space Rm. We then make grids to the data space. Let m-dimensional hypercube be I1·…· Im where each interval Ip(p=1,2,…,m) is defined by the range of n data points in pth coordinate. Thus, the hypercube I1·…· Im should contain all the points of
the data set {x1,…,xn}. Each interval Ip is discretized into rp equidistant points. Such a discretization forms an m dimensional grid in the hypercube with nodes N (i1,…,im), where indexes i1,…,im take values from the sets [1,…,r1],…,[1,…,rm]. Let {Ni} denote the set of all grid nodes. In the mountain method, cluster centers are restricted to the grid nodes {Ni} in Rm and the mountain function for each Ni is calculated with M1 ðNi Þ ¼
n X
eadðxj ;Ni Þ ;
i ¼ 1; 2; . . .
ð1Þ
j¼1
where d(xj,Ni) is the distance measure between the data point xj and grid node Ni. The mountain function values of the nodes can be seen to be closely related to the density of the data points in the neighborhood. They also represented the potential ability of a grid node to be a cluster center estimate. A node with many neighborhood data points will have a large mountain function value. The parameter a in Eq. (1) is very important. It could decide the neighborhood radius in which data points outside the radius will have a small influence on the mountain function. It also determines the approximate density shape of the data set. It is reasonable that the first cluster center estimate is the node with a maximal mountain function score. Thus, find N*1 (the first cluster center estimate among all grid nodes) with M1 ðN1 Þ ¼ maxfM1 ðNi Þg:
ð2Þ
i
Since the nodes close to N*1 will also have high mountain-function values, it is necessary to remove the effect of the identified cluster center before obtaining the next cluster center. The mountain function, after eliminating the (k1)th cluster center N*k1, is defined by
Mk ðNi Þ ¼ Mk1 ðNi Þ Mk1 ðNk1 Þ ecdðNk1 ;Ni Þ ; k ¼ 2; 3; . . .
ð3Þ
where Mk1 ðNk1 Þ ¼ maxfMk1 ðNi Þg i
ð4Þ
and the function in (3) was called the revised mountain function. In the revised mountain function (3), the mountain function values for the nodes, which are close to the newly identified cluster centers will be reduced and the parameter c determines the neighborhood radius that will have measurable reductions in the mountain function. In (1), the optimization problem is reduced to a finite set {Ni} which also determines the precision of the clustering results. A finer grid will determine a more accurate solution over all Rm. However, this will increase the computational time, especially in a high dimensional case. Chiu [13] modified the mountain method by defining the mountain function on data vectors. Although the number of data points may be larger than the grid nodes, we must calculate the distance measures
127
between nodes and the identified cluster center nodes to obtain the next revised mountain function in the original mountain method. However, the mountain method defined in the data vectors will omit this calculation process because the revised mountain functions are also defined in the data vectors. We adopt this modification in one part of our new modified mountain clustering algorithm. A clustering procedure contains two problems with some priori knowledge. The first is the initial guesses of centers or partitions which always affect the clustering performance. The second is the number of clusters which needs to be given a priori. The mountain method can provide good initial guesses for a clustering procedure on the base of its stopping conditions to decide if the identified cluster center has the potential to be a new cluster center in this data set. This is a cluster validity problem. In Yager and Filev’s procedure, the algorithm for acquiring new cluster centers repeats until Mk(N*k) is less than a given d. The d becomes an important factor. If d is specified too large, we may lose some important clusters. If d is specified too small, the result may have too many clusters. It is difficult to specify d as a constant that works well for all examples. Therefore, Chiu [13] developed an additional criteria for accepting or rejecting cluster centers. Although these methods can simply decide the number of the identified cluster centers, they do not give a performance measure for it. A new concept for measuring the cluster validity will be proposed in the next Section.
3 A modified mountain clustering algorithm The mountain method can provide the approximate clustering result via the density estimation concept. The performance is sensitive to the density shape estimation parameter a and the revised mountain function parameter c. In this section, we propose a modified mountain method which can produce good initial cluster centers with an estimate of the number of cluster centers via the analogical kernel type density estimation by defining the modified mountain function on data points, not on grid nodes. We will use a correlation self-comparison procedure to approximate the exact density shape and set a threshold to get a good estimate of a. Thus, the proposed algorithm becomes an unsupervised clustering algorithm. We now define the modified mountain function for each data vector xi on all data points as P1 ðxi Þ ¼
n X
embdðxi ;xj Þ ;
i ¼ 1; . . . ; n
ð5Þ
j¼1
where Pn b¼
j¼1
!1 xj x2 n
Pn with x ¼
j¼1 xj
n
:
ð6Þ
The role of the parameter b is a normalization term which normalizes the dissimilarity measure d(xi,xj). This normalization reduces the influences of the scale of the data set on the modified mountain function. Using the parameter m in the modified mountain function P1(xi) is an important step because the m in (5) will become the only parameter of the data set by taking b as the inverse of the dispersion of the data set. We mention that the concept of decomposing the parameter a in the original mountain function (1) into mb in the modified mountain function (5) with b the inverse of sample variance can be also used in the Yager and Filev’s mountain method. In fact, the parameter mb in (5) and the parameter a in (1) are both to determine the approximate density shape of the data set. In this consideration, the difference between the mountain function (1) and the modified mountain function (5) is defined on the grid nodes and on data points, respectively. In this paper, we focus the parameter m on the modified mountain function (5) and then propose a tool to acquire a good estimate of m. According to our experiments, most of estimates of m always locate on an expected range when the normalization term b is considered. This helps us to analyze m more efficiently. The tool for analyzing m (equivalent to approximate the density shape) using a correlation selfcomparison technique is illustrated as follows. 3.1 The correlation self-comparison procedure In order to analyze the parameter m using the correlation self-comparison procedure, we provide a data set with four clusters as shown in Fig. 1a. In Fig. 1b–d, we show the 3D plots for P1(xi) with m=1, 5 and 10, respectively. The ‘‘+¢¢ sign represents the value for P1(xi) with respect to the data point xi. A poor density shape estimate occurs when m=1. However, when we increase the value of m to 5 or 10, they both give a good density shape estimate via the modified mountain function. Although this graphical method can help us to acquire the estimate of the density shape in both the mountain method and modified mountain method, this kind of subjective method is restricted to a small dimensional data set. A more efficient and precise method to find a good estimate for m is constructed with the correlation self-comparison. We see that the 3D plots of P1(xi) in Fig. 1b, c for m=1 and m=5 respectively show differences. However, Figure 1c, d shows similarities. The reason is that the values of P1(xi) for m=5 and m=10 have a very high relationship. In the statistical sense, the correlation of the values of P1(xi) for m=5 and m=10 is very close to one. If the correlation self-comparison of the values of P1(xi) for ‘‘m=1 and m=5’’, ‘‘m=5 and m=10’’ or ‘‘m=10 and m=15’’ is larger than a given threshold, then this m will provide a good approximate density shape for the data. According to the data set of Fig. 1a, c, d show their high similarity. In general, to increase m is to decrease the neighborhood radius of the data
128
points. The correlation of the values of P1(xi) is larger than a given threshold, which means that when we decrease the neighborhood radius of the data points over the threshold, the approximate density shapes will not alter too much. Thus, when the approximate density shapes are stable with this m value, it should be a good estimate for the exact density shape. In general, we suggest to choose the threshold with 0.99. A large increased shift of m in our self-comparison procedure may lose a good estimate for the parameter m. However, a small increased shift of m will decrease the neighborhood radius slowly for the data points so that the modified mountain function (5) will make no significant difference for these small changes of m. Of course, an ideal choice of the shift of m should depend on the data set so that it is difficult to find an ideal shift of m. In general, the shift of 5 can be adopted and the threshold of 0.99 should be large enough used in the correlation self-comparison procedure. For example, the correlation self-comparisons for the data in Fig. 1 with ‘‘m=1 and m=5’’, ‘‘m=5 and m=10’’, ‘‘m=10 and m=15’’ are 0.683, 0.977 and 0.996 respectively. Thus, we can choose m=10 for this data set.
Fig. 1 a The data set. b–d 3D Plots of the modified mountain function with m=1, 5 and 10 respectively
In order to execute the correlation self-comparison procedure as a computer program, the modified mountain function is rewritten as P1m0 ðxi Þ ¼
n X
em0 bdðxi ;xj Þ
j¼1
and P1ml ðxi Þ ¼
n X
eml bdðxi ;xj Þ
j¼1
where m0 is set to a fixed constant 1 and ml=5l, l=1,2,3,…. The correlation self-comparison procedure can then be summarized as follows. Correlation self-comparison algorithm 1. Set l=1 and w=0.99. m 2. Calculate the correlation of the values of P1 ðl1Þ ðxi Þ ml and P1 ðxi Þ: 3. IF the correlation is greater than or equal to the specified w, m
THEN choose P1 ðl1Þ to be the modified mountain function; ELSE l=l+1 and GOTO step 2. We now use some data sets to test the efficiency of the correlation self-comparison in searching for a good
a 10 b 200
5
m=1
150
100
0
10
50
5 -5
-5 -5
0
5
70
5
-5 10
10 d
c
80
0 0
60
m=5
m=10
50
60 40
50
30
40 30
20
20 10
10 5
0 -5
0 0
5
-5 10
10
10 5
0 -5
0 0
5
-5 10
129
modified mountain function (i.e. a good density shape estimation of the data set). Figure 2a–f shows six artificial data sets. The values of the correlation self-comparison for each data set are shown in Table 1. After the parameters are selected according to the self-comparison procedure, the modified mountain functions for these data sets are shown in Fig. 2g–l. The modified mountain function fits a good density shape estimation for each one of these various data sets. The parameter b with an inverse of sample variance normalizes the dissimilarity measure d(xi,xj) and helps us to find m more efficiently. We mention that Beni and Liu [17] proposed a least biased fuzzy clustering algorithm by maximizing the clustering entropy with no assumptions on the number
of clusters or their initial positions. They set the parameter b¢ to b/bmax such that the range of b¢ is between 0 and 1. They then set up a quantity P(c) as the probability that will yield c clusters under different values of b¢. Based on this multiscale analysis, they can determine the optimum number c* of clusters and then find the final cluster centers and partitions. In our procedure for the decomposition of a to m/b with b the inverse of the sample variance, it has a similar merit to the Beni and Liu’s multiscale method. However, both have their different considerations. Our procedure considers the modified mountain function and uses the correlation self-comparison method with a new proposed validity index to have an optimum number of clusters.
Fig. 2 The data set and their modified mountain functions a 20
b
c
20
20
10
10
10
0
0
0
0
10
20
0
10
20
d 20
e 20
f 20
10
10
10
0
0
0
0
10
20
0
10
20
g
h
i
21.5 20.5 19.5
23
15 14 13 12 11 10 9 8 7
18
18.5 17.5 16.5
20 10 0
10
y
20
13
20 10 x
8
x
0
0
10
y
20
10
0
5
10 x 0
10
y
20
0
0
10
20
20 10 x 10
0 20
l
20 0
20
y
10
5
10
0
k
j
0
20 0
10 x 0 10
y
20
0
15 14 13 12 11 10 9 8 7 6
20 10 x 0
10
y
0 20
130 Table 1 Correlation selfcomparison
Data set
m=1 and m=5
m=5 and m=10
m=10 and m=15
Figure 2a Figure 2b Figure 2c Figure 2d Figure 2e Figure 2f Example 1 Iris Iris (last two feature) Figure 8a Figure 8b Figure 8c Figure 8d
0.8012 0.9455 0.7313 0.5401 0.7367 0.8483 0.8640 0.5200 0.2920 0.7710 0.7620 0.7810 0.8030
0.9996 0.9790 0.9551 0.9663 0.9331 0.8925 0.7390 0.9220 0.9110 0.9880 0.9890 0.9870 0.9900
0.9926 0.9899 0.9972 0.98998 0.9796 0.8110 0.9850 0.9870 0.9930 0.9960 0.9960
3.2 The modified revised mountain function After the correlation self-comparison algorithm is implemented, the modified mountain function (5) is acquired. We then search for the kth cluster center using the modified revised mountain function with 0
Pk ðxi Þ ¼ Pk1 ðxi Þ Pk1 ðxi Þ eb dðxi ;xk1 Þ ;
k ¼ 2; 3; . . . ð7Þ
where xi is the feature vector and x*k1 is the (k1)th cluster center which satisfies Pk1 ðxk1 Þ ¼ maxfPk1 ðxi Þg; i
k ¼ 2; 3; . . .
ð8Þ
The parameter b¢ of this modified revised mountain function determines the extracted cluster shape and also determines the potential of the data point being the next cluster center after extracting each cluster. Since the data Fig. 3 a–d The modified revised mountain function after extracting the first, second, third and fourth clusters respectively
m=15 and m=20
Selected m value 5 10 15 10 15 15 15 15 15 10 10 10 5
0.9951 0.9970 0.9950 0.9920 0.9940 0.9960
points near the first extracted cluster center will have greatly reduced potential, and therefore are unlikely to be selected as the next cluster center. To avoid obtaining closely cluster centers, we choose b¢ in (7) to be somewhat smaller than mb in (5). Thus, we may take the parameter b¢ in (7) as mb in (7) divided by m (i.e. b¢=mb/m=b) such that b¢ is smaller than mb. One problem for the revised mountain function (3) in the original mountain method is that the extraction cluster shapes for different clusters are all identical. The shape of the data distribution is not necessarily identical for different clusters. However, the original mountain method uses a fixed density shape to extract each cluster from each new identified cluster center, as shown in (3). The proportion of the extracted quantity for each node Ni is
Mk1 ðNk1 Þ ecdðNk1 ;Ni Þ ;
a
b
60
60
50
50
40
40
30
30
20
20
10
ð9Þ
10 10
0
5 -5
0
10
5 -5
0 5
10
0
-5
c
0
0 5
10
-5
d
60
60
50
50
40
40
30
30
20
20
10
10 10
0
5 -5
0
0 5
10
-5
10
0
5 -5
0
0 5
10
-5
131
which is independent of its revised mountain function Mk1(Ni) and the shapes of each extracted cluster are all equal. This leads to the revised mountain function Mk1(Ni) in (3) becomes negative for some data points. It is not reasonable to a grid node with a negative potential Mk1(Ni). Therefore, Yager and Filev [12] needed to set the negative revised mountain function values to zero after each extraction. In our modified revised mountain function (7), the reduction of each data point xi after finding the (k1)th cluster center x*k1 is upon its (k1)th revised mountain function Pk1(xi). The proportion of the extracted quantity for each data point is
Pk1 ðxi Þ ebdðxi ;xk1 Þ ;
ð10Þ
which depends on its revised mountain function Pk1(xi) and the shape of the extracted cluster also depends on Pk1(xi). This modification allows somewhat different density shapes for each extracted cluster and the revised mountain function is always positive. To demonstrate the behavior of the modified revised mountain function (7), we use the data set in Fig. 1a. The values of the modified revised mountain function after exacting the first, second, third, and fourth clusters are shown in Fig. 3a, b, c and d, respectively. 3.3 The performance measure for the cluster validity Many cluster validity indexes (see Refs. [8–11] are measured based on the clustering algorithms such as the fuzzy c-means (FCM) and hence the results are heavily dependent on the performance of the clustering algorithms. In this subsection, we will propose a new validity measure for our modified mountain method, which is simple and does not depend on any clustering method. The idea is to measure the potential of the new identified cluster center being a new cluster center based on the modified mountain method and select the number of clusters in which the total potential is maximum. The validity function is defined as
Fig. 4 Clustering result and the validity index MV(c) for the data set in Fig. 1a a
MVðcÞ ¼
c X
potðkÞ;
c ¼ 2; 3; . . . ; n 1
ð11Þ
k¼2
where c denotes the number of clusters. The function pot(k) is the potential of the kth cluster center x*k and is defined as P1 ðxk Þ 2 n exp mðd =bÞ ; potðkÞ ¼ P1 ðxk Þ k ð12Þ P1 ðx1 Þ k ¼ 2; 3; . . . where dk is the minimum distance among x*k and all (k1) previous identified cluster centers, i.e. dk ¼ minfdðxk ; xk1 Þ; dðxk ; xk2 Þ; . . . ; dðxk ; x1 Þg:
If the modified mountain function (5) for the new identified cluster center x*k is large (i.e. P1(x*k) is large), then it has a large potential to be a cluster center. We use the term P1(x*k)/P1(x*1) to measure the degree of compactness of the kth extracted cluster. If the kth cluster is dense, then it will have a large value of P1(x*k)/P1(x*1) and have a large potential to be a new cluster center. If the kth cluster is disperse, then it will have a large reduction for pot(k) according to the term P1(x*k)/P1(x*1). In fact, the term P1(x*k)/P1(x*1) had been used for the stopping condition by Yager and Filev [12] for the original mountain method. However, this stopping condition only control the compactness. In our validity function pot(k) in (12), we also consider the separation measure (dk/b)2. The data point which is close to any previous identified cluster centers will also have a large modified mountain function value and we should reduce the potential of x*k if it is close to any previous cluster centers. The term dk/b measures the degree of reduction. If dk is small, then the potential of x*k will have a large reduction. If dk is large, then x*k will have a large potential to be a new cluster center. If dk is very small compared to b in this data set, then the kth cluster center x*k may be close to any one of the previous identified cluster centers and will have a large reduction for pot(k). We then use (dk/b)2 to measure the separation of k-th extracted cluster. The result using the modified mountain method for the data set in Fig. 1a is shown in Fig. 4. Figure 4a b 10
0
MV(c)
5 -500 0 -1000 -5 2
3
4
5
c
6
7
8
9
ð13Þ
-5
0
5
10
132
presents the validity function MV(c) and c=4 is the best number of clusters for this data set. The solid circle points are four identified cluster centers as shown in Fig. 4b. The modified mountain method provides a good cluster number estimate and good initial guesses (cluster centers) in this example. The proposed algorithm is summarized as follows: The modified mountain clustering algorithm 1. Acquire the modified mountain function using the correlation self-comparison algorithm. 2. Find the kth cluster center x*k using the modified revised mountain function (7) and condition (8). 3. Calculate MV(c), c=2,3,…,(n1). 4. Choose the cluster number estimate with maximum value of MV(c) and select these c extracted cluster centers. If the cluster number is known, this method can provide good initial guesses via the density shape estimate. Even with the data having an unknown number of clusters, it provides both the cluster number estimate and initial guesses. The validity problem is solved by the proposed validity measure function which includes the compactness and separation measures for each extracted cluster. The results are not dependent on the clustering algorithm whose performance is always affected by initial cluster centers. Some numerical examples are made in the next section.
4 Examples In this section, we present some examples with numerical and real data to demonstrate the effectiveness of the proposed modified mountain algorithm. Example 1. This is a three-dimensional data set with six clusters as shown in Fig. 5a. In this example, we can not plot the modified mountain function using Eq. 5. Therefore, we can not see the change of the approximated density shape when the parameter m changes. We
use the proposed correlation self-comparison procedure to automatically search for a suitable m value. The correlation self-comparison is described in Table 1. The parameter m was chosen to be 15 to accomplish the threshold. The approximated cluster centers (solid circle points) are shown in Fig. 5a. According to the validity function MV(c) in Fig. 5b, the optimal cluster number shall be c=6. Although this example is a high dimensional case in that the density shape estimate can not be seen using a graphical method, the correlation selfcomparison procedure and the validity measure MV(c) can give us a reasonable result. The modified mountain method provide us a good initial guesses for a clustering algorithm and a good cluster number estimate for the data set. It also give us a good unsupervised approximate clustering result. Example 2. This is the well-known Iris data set whose real cluster number is 3. The self-comparison of correlations is described in Table 1. The density shape estimate m is chosen to be 15 to accomplish the threshold 0.99. The result for the modified mountain method is shown in Fig. 6. In Fig. 6a, we used a three-dimensional plot to present the four-dimensional Iris data set. Three identified cluster centers are also plotted with solid circle points. The plot of MV(c) (Fig. 6b) shows that c=3 is a good cluster number estimate. It also offer the information that c=2 may be another good choice. Although people may have argued that c=2 is also a good choice and many validity indexes also indicate that c=2 is optimal for Iris data, it has really three groups with Iris Sestosa, Iris Versicolor and Iris Virginica. For unsupervised clustering, we think that both c=2 and c=3 are good estimates for the Iris data set. We also cluster the last two features of the Iris data. The self-comparison of correlations are shown in Table 1. The parameter m was chosen to be 15. The results shown in Fig. 7 are similar to the four-dimensional Iris data. Example 3. We used this example to show the performance of our validity function MV(c). Figure 8 shows the data sets and the identified cluster centers (solid circle points). The parameters m for each data set are chosen to accomplish the threshold, as shown in Table 1. Table 2 shows the MV(c) values for each case.
Fig. 5 Clustering result and the validity index MV(c) for example 1 b 0 a 30
MV(c)
20 10 0 0
5
10
0
10
20
30
40
-100
50
-200 2
3
4
5
6 c
7
8
9
133
In Fig. 8a, there are two randomly generated clusters and five artificial points. The reasonable cluster number for this data set is 3 and the maximum value of MV(c) gives the coincident result. In Fig. 8b, we reduce the number of artificial points, they then do not have enough potential to be a cluster. In Fig. 8c, we increase the scale of the artificial points. These artificial points become too separated and do not have enough potential to be a cluster. In Fig. 8d, we decrease the distance between five artificial points and two random generated clusters. These clusters became overlapped and the artificial cluster are difficult to distinguish from the other clusters. This example shows that our validity measure function MV(c) can reflect the relationships between clusters including the cluster size, the cluster scale and distance between clusters.
5 Comparisons and computational complexity In this section, we compare the performance of the proposed modified mountain method with the original mountain method according to the data sets from normal mixtures. To demonstrate the effectiveness of the proposed validity index MV(c), we choose two popular validity indexes, partition coefficient (PC) [8] and Xie and Beni (XB) [11], and then compare their results. We also analyze the computational complexity of algorithms. Example 4. We randomly generate a normal mixture data. The mixed distributions include N(0,1), N(3,0.5), N(6,1), N(9,0.5) and N(12,1). The mixture proportions are all equal. The Histogram of this mixture data is shown in Fig. 9a. The data set contains 250 data points. Under the threshold 0.99, m=15 is chosen in this example. The approximate density shape is shown in Fig. 9b which is coincident to the Histogram. The validity function of MV(c) is shown in Fig. 9c. The optimal cluster number of this data set is five. Figure 9b
Fig. 6 Clustering result and the validity index MV(c) for the Iris data set
also shows these five extracted clusters with the solid circle points and their extracted order. We next implement the mountain method proposed by Yager and Filev [12]. The first step is to define the grid nodes over the data space. We choose 201 grid nodes with {5.0, 4.9, … , 14.9, 15.0}. The second step is to determine the parameter a in Eq. 1 that is equivalent to mb in Eq. 5 of the modified mountain method. The third step is to determine the revised mountain function parameter c in (3). In this example, four kinds of c values with c=b, c=5b, c=15b and c=25b are implemented. The results are shown in Table 3. Since, after extracting four cluster centers, all revised mountain function values of Eq. 3 for each grid node are zero in the case of c=b, only four clusters are found. The second column of Table 3 presents these four extracted grid nodes. The third column presents the proportion of Mk(N*k)/M1(N*1) that is used as a stopping condition. The mountain method will be stopped when the proportion is less than a given threshold d. This is the fourth step of the original mountain method for choosing a suitable d to determine the cluster number of the data set. When c=5b and c=15b, the original mountain method can extract five suitable cluster centers with d=0.5. Under the same threshold, however, seven clusters are extracted when c=25b. We find that, the original mountain method is a good tool for searching suitable cluster centers, but it heavily depends on the parameters a, c and d. Our modified mountain method can improve the original scheme with the following three modifications. First, to reduce the effect of the parameter a, we use the correlation selfcomparison to approximate the density shape. This method is also workable in high dimensional data space. Second, to reduce the effect of the parameter c, we modify the revised mountain function to allow different density shapes for each extracted cluster. In fact, our modified revised mountain function is always positive so that it can avoid the situation in which the revised function values for grid nodes are all zero such as shown in Table 3 when c=b (only four clusters can be extracted). Finally, to reduce the effect of the parameter d, we propose a new cluster validity function to measure the potential of an extracted cluster being a new iden-
b -100 a -200 max
MV(c)
2
1
-300
0 4
5
6
7
8
1
2
3
4
5
6
7
-400 2
3
4
5
6
c
7
8
9
134 a
b 0
MV(c)
2
1
-100
-200
0 1
2
3
4
5
6
7
2
Fig. 7 Clustering result and the validity index MV(c) for the last two features of the Iris data set
tified cluster and then select the cluster number with the maximum total potential. In this example, we also simulate another normal mixture data. The mixed distributions include N(0,1), N(5,2), N(10,3), N(15,4) and N(20,5). The mixture proportions are all equal. The Histogram of this mixture data is shown in Fig. 10a. The data set contains 500 data
Fig. 8 Data sets and results for Example 3. Solid circle points present the identified cluster centers resulting from the modified mountain method 8
a
8
4
0
0
8
-1
0
1
2
3
4
5
8
4
4
0
0
-2
-1
0
1
2
3
4
5
6
5 c
6
7
8
b
-2
6
c
4
points. Under the threshold 0.99, m=15 is obtained on the basis of the proposed correlation self-comparison algorithm. The approximate density shape is shown in Fig. 10b that is coincident to the Histogram. The validity function MV(c) shown in Fig. 10c indicates that c=4. Figure 10b shows four extracted clusters with the solid circle points and their extracted order. It is difficult to distinguish between four and five clusters in this mixture data. The validity function MV(c) has large values in both c=4 and c=5. The results of the original mountain method are shown in Table 4. Four kinds of c are implemented. Only three clusters can be extracted when c=b. If we set d=0.5, the cluster number estimates are 2, 4, 4 and 6 when c equals to b, 5b, 10b and 15b, respectively. We find that the proposed modified
4
-2
3
-1
0
1
2
3
4
5
6
-1
0
1
2
3
4
5
6
d
-2
135 Table 2 The validity function MV(c) for Example 3
Cluster
Figure 8a
Figure 8b
Figure 8c
Figure 8d
2 3 4 5 6 7 8 9 Estimate
84.6 84.3 134.6 175.0 193.5 212.2 238.9 259.9 3
83.4 134.0 168.4 168.2 181.9 195.8 217.0 232.9 2
84.4 135.6 177.1 177.0 196.5 216.2 244.0 250.8 2
78.5 128.1 184.9 184.5 213.1 241.7 279.2 310.5 2
In general, an optimal c* is obtained by solving min XBðcÞ to produce a best clustering performance
moutain method actually presents better results than the original mountain method. There are many prototype-based clustering algorithms [2, 3, 6, 9] that can be used to obtain clusters. Because the cluster number in these algorithms needs to be specified a priori, these methods need to combine some validity indexes to become unsupervised clustering algorithms. For a given cluster-number range, the validity function is evaluated for each given cluster number and then an optimal number is chosen based on these validity measures. We choose the two well-known validity indexes, PC [8] and XB [11], for the comparisons, and they are briefly reviewed as follows:
2cn1
for the data set. In the next example, we will compare our modified mountain method to the FCM clustering results with these two validity indexes, PC and XB. Example 5. This is a three dimensional data set containing 11 clusters with different shapes and sizes as shown in Fig. 11a. The data set contains 800 data points. Under the threshold 0.99, m=20 is obtained according to the proposed correlation self-comparison algorithm. The validity function MV(c) shown in Fig. 11b indicate that c=11 is a good cluster number estimate for this data set. The extracted cluster centers (solid circle points) are shown in Fig. 11c which well presents the data structure. We then implement the FCM algorithm with PC and XB indexes. Figure 12a, b show the results of PC and XB where fuzzy memberships are generated by the FCM algorithm with random initial values. Both indexes can not indicate the correct cluster number for the data set. This is because the data may be lack of structure or bad given initial values. The results of FCM with 11 random initial values are shown in Fig. 12c where we may see the reason why the validity indexes lose the ability of detecting correct cluster number. This also reveals the problem that even the clustering algorithm and the validity index are both well organized, the results still depend on initial specifications. If we take the extracted cluster centers obtained by the modified mountain method as the initial values for implementing the FCM algorithm, the results will become beter. We find that, in this case, the PC and XB indexes as shown in Fig. 12d, e actually indicate that the cluster number estimate is c=11. Figure 12f shows these FCM clustering results with the modified mountain outputs as initial values.
(a) The partition coefficient (PC) that was a first proposed validity index associated with FCM [8] is defined by PCðcÞ ¼
c X n 1X l2 n i¼1 j¼1 ij
ð14Þ
where (1/c) £ PC(c) £ 1. In general, an optimal cluster number c* is obtained by solving max PCðcÞ to produce a best clustering perfor2cn1 mance for the data set. (b) A validity function proposed by Xie and Beni [11] is defined by 2 Pc Pn 2 i¼1 j¼1 lij xj ai : ð15Þ XBðcÞ ¼ 2 nminai aj i;j
Fig. 9 a The Histogram of the normal mixture data. b The modified mountain function. c The validity function MV(c). Five extracted cluster centers (solid circle points) and their extracted order are also shown in b. The algorithm takes 2.140625 s
m=15, time=2.140625sec
Histogram
b
10
1
c
3
2
40
0
5 MV(c)
20
mountain function
Frequency
a
30
4
10
0 0
5
10
15
-500
20
-1000 0
5
10
15
2
3
4
5
6
c
7
8
9
10
136 Table 3 The results of the original mountain method for the data set of Fig. 9a
1 2 3 4 5 6 7 8 9 10
c=b
c=5*b Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
3 11.9 8.8 0.3
1 0.86266 0.32973 0.05408
3 8.9 12 0 5.8 12.1 2
1 0.98857 0.80393 0.57173 0.55582 0.00142 0.00125
3 8.9 11.9 5.6 0.4 7.2 1.1 4.3 10.4 1.7
1 0.98863 0.87398 0.73787 0.67731 0.36187 0.30144 0.28657 0.26508 0.25402
3 8.9 11.9 5.6 0.4 4.3 1.8 7.3 10.4 0.9
1 0.98863 0.87449 0.74131 0.68065 0.54364 0.50566 0.49738 0.46664 0.39182
m=15, time= 5.546875sec
Histogram
b mountain function
30
Frequency
c=25*b
Node
a
20
10
0 0
c=15*b
10
20
30
90 80 70
1
4
c
2
0
60 50 40 30 20 10 0
40
MV(c)
Cluster number
3
-500
-1000 0
10
Fig. 10 a The Histogram of the normal mixture data. b The modified mountain function. c The validity function MV(c). Four extracted cluster centers (solid circle points) and their extracted order are also shown in b. The algorithm takes 5.546875 s
Finally, the computational complexity is analyzed. In the modified mountain method, for each correlation selfcomparison step, we need to compute Eq. 5 for all n data points in an m-dimensional space. The computational complexity is O(n2mt) where t is the number of iteration in the correlation self-comparison algorithm. In most of our experiments, the number m always falls in the interval [5, 20] and hence t is a small positive integer. To have MV(c) of equation 11, we need to search for x*k1 in equation 8 and compute Eqs. 7 and 12. The computational complexity is O(nmcmax) where cmax is the possible maximum number of clusters. In the original
20
30
2
3
4
5
6
c
7
8
9
10
mountain method, the first step is to determine the grid nodes. If we set the number of grids for each dimension is equivalent to l, we will have lm m-dimensional grid nodes over the data space. The computational complexity of computing Eq. 1 is O(lmnm). The second step is to search N*k1 in Eq. 4 and compute d(N*k1,Ni) and Mk(Ni) in Eq. 3 and this takes the computational complexity O(lmm). The computational complexity of the original mountain method is a monotone increasing function of the data dimension. However, the modified mountain function (5) and revised mountain function (7) defined on the data vectors, not on the grid nodes, can efficiently reduce the computational complexity, especially in high dimensional cases. In FCM, the computational complexity is O(ncmtc) where tc is the number of iterations when the cluster number is c. For finding a cluster number estimate of a given data set, we need
Table 4 The results of the original mountain method for the data set of Fig. 10a Cluster number
1 2 3 4 5 6 7 8 9 10
c=b
c=5*b
c=10*b
c=15*b
Node
Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
Node
Mk(Nk*)/M1(N1*)
0.4 12.5 21.9
1 0.77876 0.21093
0.4 7.1 12.8 18.3 23.2 26.7 30.8 5.2 16.7
1 0.89724 0.79281 0.51314 0.24583 0.06372 0.03314 0.03231 0.00949
0.4 6.7 12.3 16.9 21.1 3.9 9.4 25.1 14.9 2.3
1 0.92762 0.8695 0.62902 0.40175 0.33523 0.27469 0.20101 0.07958 0.0735
0.4 6.6 12.2 16.4 3.7 9.3 20 23.2 2 14.3
1 0.92918 0.87534 0.6595 0.52711 0.51696 0.4599 0.28078 0.21873 0.20782
137
The 11 clusters data set
The extracted cluster centers
a
b
c 0
c=11 -1000
7
7 6
6 MV(c)
5 4 3
5
-2000
4 3
-3000
2 1 0 0
1
2
3
4
5
6
0 1
7
2
3 4
5
2 1
6 7
-4000
0 0
1
-5000 5
10
15
2
3
4
5
6
0 1
7
3 4
2
5
6 7
20
c
Fig. 11 a The data set with the clusters of different shapes and sizes. b The validity function MV(c). c The extracted cluster centers (solid circle points)
corresponds to the kernel width for the Parzen density estimation and determines the performance of the mountain method. It is difficult to give a general estimation form for this parameter, which can work well for various data sets. However, a good parameter estimate can easily be estimated using our correlation self-comparison and still has a good performance even in high dimensional cases. Second, our modification for the revised mountain function allows somewhat different density shapes for each extraction cluster and successfully reduces the sensitivity to the choice of the revised mountain function parameter in the original mountain method. Moreover, our modified revised mountain is always positive. Third, we proposed a new cluster validity measure function MV(c) to measure the potential of an extracted cluster being a newly identified cluster and select the cluster number with the maximum total potential. If the cluster number is known, our modified mountain method can provide good initial guesses via a good density shape estimation using correlation selfcomparison. Even if the cluster number is unknown, our modified mountain method can provide a good
process FCM with c=2 to c=cmax. Thus, the compu* tational Pcmax complexity is O(n(cmax1)st ), where t ¼ c¼2 tc :
6 Conclusions We modified the mountain method with three important and reasonable modifications. First, we used the correlation self-comparison to approximate the true density shape and set a threshold to acquire a good density shape parameter estimate. This parameter
Fig. 12 a, b PC and XB validity indexes based on FCM using 11 random initial values. c The results of FCM with 11 random initial values. d, e PC and XB validity indexes based on FCM using the initial values obtained by the modified mountain method. f The results of FCM with the initial values obtained by the modified mountain method
The result of FCM with random initial values a
b
0.70
XB
PC
0.65
0.60
0.55
5
10
c
15
c
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
7 6 5 4 3 2 1 0 0
20
5
10
15
c
1
2
3
4
5
20
6
0 1
7
2
3 4
5
6 7
The result of FCM with the initial values obtained by the proposed method e
0.7
XB
PC
d
0.6
0.5 5
10
c
15
20
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
f 7 6 5 4 3 2 1 0
c=11
5
10
c
0
15
20
1
2
3
4
5
6
7
0 1
2
3 4
5
6 7
138
cluster number estimate via the new proposed validity measure function MV(c). Thus, our modified mountain method can be an unsupervised approximate clustering method for the analysis of a grouped data set. Some numerical comparisons and real data are given to show the effectiveness and simplicity of the proposed modified mountain clustering algorithm. Finally, the computational complexity of algorithms is analyzed. Acknowledgements The authors would like to thank the anonymous reviewers for their helpful comments and suggestions to improve the presentation of the paper This work was supported in part by the National Science Council of Taiwan, under Grant NSC-89-2213-E-033-057
References 1. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York 2. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York 3. Yang MS (1993) A survey of fuzzy clustering. Math Comput Modelling 18:1–16 4. Zadeh LA (1965) Fuzzy sets. Inform Control 8:338–353 5. Ho¨ppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis. Wiley, New York
6. Wu KL, Yang MS (2002) Alternative c-means clustering algorithm. Pattern Recognit 35:2267–2278 7. Yang MS, Hu YJ, Lin KCR, Lin CCL (2002) Segmentation techniques for tissue differentiation in MRI of ophthalmology using fuzzy clustering algorithms. Magn Reson Imaging 20:173–179 8. Bezdek JC (1974) Cluster validity with fuzzy sets. J Cybernetics 3:58–73 9. Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11:773–781 10. Windham MP (1982) Cluster validity for the fuzzy c-means clustering algorithm. IEEE Trans Pattern Anal Mach Intell 11:357–363 11. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13:841–847 12. Yager RR, Filev DP (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24:1279–1284 13. Chiu SL (1994) Fuzzy model identification based on cluster estimation. J Intel Fuzzy Syst 2:267–278 14. Pal NR, Chakraborty D (2000) Mountain and subtractive clustering method: improvements and generalizations. Int J Intell Syst 15:329–341 15. Yager RR, Filev DP (1994) Generation of fuzzy rules by mountain clustering. J Intell Fuzzy Syst 2:209–219 16. Velthuizen BP, Hall LO, Clarke LP, Silbiger ML (1997) An investigation of mountain method clustering for large data sets. Pattern Recognit 30:1121–1135 17. Beni G, Liu X (1994) A least biased fuzzy clustering method. IEEE Trans Pattern Anal Mach Intell 16:954–960