Cluster Validity Analysis Using Subsampling - Semantic Scholar

Report 2 Downloads 139 Views
Cluster Validity Analysis Using Subsampling* O s m a n A b u l *'~

A n t h o n y Lo t

R e d a Alhajj*

++Dept C o m p u t e r Science U n i v e r s i t y of C a l g a r y Calgary, Alberta, C a n a d a {abul, chiul, alhajj, b a r k e r } @ c p s c . u c a l g a r y . c a

- Cluster validity investigates whether generated clusters are true clusters or due to chance. This is usually done based on subsampling stability analysis. Related to this problem is estimating true number of clusters in a given dataset. There are a number of methods described in the literature to handle both purposes. In this paper, we propose three methods for estimating confidence in the validity of clustering result. The first method validates clustering result by employing supervised classifiers. The dataset is divided into training and test sets and the accuracy of the classifier is evaluated on the test set. This method computes confidence in the generalization capability of clustering. The second method is based on the fact that if a clustering is valid then each of its subsets should be valid as well. The third method is similar to second method; it takes the dual approach, i.e., each cluster is expected to be stable and compact. Confidence is estimated by repeating the process a number of times on subsamples. Experimental results illustrate effectiveness of the proposed methods. Abstract

Introduction The word "clustering" (unsupervised classification) refers to methods of grouping objects based on some similarity measure between them. Clustering algorithms can be classified into four classes, namely Partitional, Hierarchical, Density-based and Grid-based [8]. Each of these classes has subclasses and different corresponding approaches, e.g., conceptual, fuzzy, selforganizing maps etc. The clustering task can be divided into the following five steps, (the last two are optional) [9]: 1) Pattern representation; 2) Pattern proximity measure definition; 3) Clustering; 4) Data abstraction; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures effectiveness of the other steps. For a given a dataset, the produced clustering depends on the parameters of the applied clustering algorithm. Usually, it is the case that different algorithms, even the same *0-7"S03-7"952-7"/03/$17".00 (~) 2003 I E E E .

Faruk Polat ~

K e n B a r k e r ++

~Dept of C o m p u t e r E n g i n e e r i n g Middle East Technical University Ankara, Turkey [email protected]

algorithm with distinct parameters, generate different clustering results. Cluster validity analysis refers to how to assess the confidence in the resulting clusters. For datasets with few dimensions, the clustering result can be visualized, and hence clusters can be validated by human experts. But, this becomes nearly impossible for large dimensions; and hence some other automatic methods are needed. The main criteria for the evaluation of clustering results are [8]: Compactness (i.e., members of each cluster should be closer to each other) and separation (i.e., the clusters should be widely spaced). Based on these criteria, a number of indices are proposed for evaluating clusters and selecting the best possible number of clusters. In most of the cases, assessing validity turns out into determining the best parameters for a clustering algorithm. Confidence estimation is addressed in relatively less number of research papers, where confidence is given in terms of the proportion of cases clustering together. Our motivation for the work described in this paper is estimating confidence in each cluster, i.e., not addressing specific cases. For this purpose, here we propose three meta-methods from this perspective for cluster validity problem. To the best of our knowledge, these methods are novel and test results demonstrate their effectiveness. The methods are all based on subsampling of the dataset. They are general and can be used for evaluating clustering results generated by a wide range of existing clustering algorithms. The first method, starts by producing a clustering using a given clustering algorithm, with the number of clusters specified. Second, it randomly samples from the labeled clusters. Third, it builds a supervised classifier on the selected subset, and the induced classifier evaluates the non-selected portion. Random subsampling and evaluation steps are repeated many times. Finally, the overall accuracy gives the stability of clustering. Overall, these steps are repeated for all possible number of clusters, for comparison to other clustering results by different clustering algorithms. Instead of random subsampling, 10-fold cross-validation can also be used.

1435

The second method is based on subset selection of the original clusters. First of all, clusters are found by employing a given clustering algorithm. For each subset of these clusters, an algorithm that estimates the true number of clusters is used. The argument here is that, if the given clustering is stable, then we expect the number of clusters estimated for each subset is the same as the cardinality of labels of the selected subset. The confidence is computed as the proportion of correct estimations. It may be the case that clustering result contains large number of clusters (say 20 clusters). In this case, trying all subsets becomes computationallyintractable; so we resort to subset sampling instead. If the validity of clustering results generated by randomized algorithms like k-means is the concern, all the steps should be repeated for averaging for both the first and the second methods. The third method uses the idea that if a clustering is stable, further clustering of the cases in every cluster will reveal one cluster. For each of the clusters, an estimator algorithm is run and expected to give that there is only one cluster. The whole step is repeated several times with dataset subsampling, i.e., a bootstrapping approach is employed for confidence estimation. Confidence is computed similar to the second method. The rest of the paper is organized as follows. In section 2, some background and recent work on cluster validity are given. Section 3 presents our three methods for cluster validity analysis. Experimental results are presented in Section 4. Section 5 is the conclusions.

2

Cluster Validity and Stability

There are basically three methods for assessment of validity: internal, external and relative [9, 8, 7]. Internal indices measure how clustering result reflects the structure inherent in the dataset. Here, only inherent features of the dataset are used for the measurement, i.e., no external information is consulted. Usually between and within sum of square matrices are used as inherent features. There are a number of indices available, including silhouette, gap, gapPC [7]. These indices also define how to select the best number of clusters. In external assessment of validity, there is a known priori structure; an external index is computed using this structure and the generated structure. These indices define a measure of the degree of match between these two structures. The indices are usually defined on contingency tables of the two partitions. Entry, nij (row i and column j) of this table is the number of patterns that belong to cluster i in the priori partition and cluster j in the generated partition. Indices on contingency tables include Jaccard, Rand and FM. The F M measure is used in Clest algorithm; it is given below [7]. FM -

(1~2)(Z-n)

(1)

where n -

Y~'~i~lY~'~c-1n ij, Z -

y~ iR=l y~c_ 1 n i'2j, n i . -

E ~ - I nij and n.j - E i ~ l nij; with R and C represent the number of clusters of priori and generated clusters, respectively. Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for possible number of parameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset. Recent work on cluster validity research concentrates on a kind of relative indices called cluster stability [2, 3, 11, 12, 7, 13, 15]. Cluster stability exploits the fact that when multiple data sources are sampled from the same distribution, the clustering algorithms are expected to behave in the same way and produce similar structures. In the work described in [13], supervised predictors are built on each clustered resampling of the original dataset and their match with the original clustering labeling are used as measures of stability or degree of match. They show that selection of supervised classification algorithm does make a difference, but measured validity is still valid for other choices. They define an instability measure for taking the game-theoretic approach. The number of clusters minimizing this instability measure is used as best cluster count in the dataset. The work described in [13] presents an algorithm for estimating the true number of clusters. For each cluster count, the dataset is resampled twice and clustered using the same generic clustering algorithm. Similarity between these two clustering is measured using either Jaccard coefficient or matching coefficient. The resampling and similarity computations are repeated many times for each number of clusters for confidence estimation. The averaged values are used as measures of stability of clustering generated by the given clustering algorithm. The histograms and cumulative distributions are generated and plotted for selecting best cluster count. Smallest stable cluster count is estimated as the correct number of clusters; the decision is obvious in cumulative distributions diagram. They also give a measure for automating this process. The algorithm has a nice property that if there is no large gap between similarities across all cluster counts, it is said that the dataset does not tend for clustering, i.e., cluster count is 1. Another resampling based method is given in [12]. In their settings, original dataset is clustered first and a number of subsamples are gathered and each of them clustered independently using the same clustering algorithm. A figure of merit measure (i.e., degree of match in the connectivity matrix) is defined between the original clustering and each of the subsampled clusterings. The figure of merit is computed for each possible number of parameter sets. The plot of the figure of merit measure against parameter values is used for selecting

1436

the best parameters. A G aussian finite mixture based method for estimating true number of clusters is described in [14]. The algorithm first divides the dataset into training and test subsets. Next, for each cluster count k, a model is fitted to training set using Expectation M a x m i z a t i o n (ML) algorithm. Then, resulting parameter set is evaluated on test set. These steps are repeated many times and averaged. These averages are used for estimating the true number of clusters.

3

The Proposed

Three Methods

For the methods discussed in this section, we denote the input dataset by T having n patterns each of them having p dimensions. So, T is effectively a n x p matrix. The proposed algorithms can be used for different number of cluster counts and different clustering, even generated by different clustering algorithms. We collect their confidence measure for all possible number of clusters. These data can be used for relative confidence estimation of clustering algorithms on the given dataset. Any clustering algorithm operating on numeric values (e.g., k-means, ORCLUS, PAM, CLARA) having the cluster count as a parameter can be used in confidence estimation. For randomized algorithms like k-means confidences should be averaged on several runs. The ORCLUS algorithm is proposed for highdimensional datasets. The idea behind the algorithm is finding (potentially) different arbitrarily projected subspaces for each of the clusters. It is an iterative algorithm and starts with an initial partitions and original axis-system. In each iteration, first of all patterns are assigned to a cluster based on their projected distance to seeds of current clustering. Then, centroids of clusters (seeds) are recomputed and the new projected subspaces are computed for each of the clusters. Following this, closer seeds are merged to obtain less number of clusters. Iteration continues until user-specified number of clusters is found and the projected subspace dimensionality of each cluster reaches user-specified minimum. Contrary to feature selection methods which select dimensions in the larger eigen-values, this algorithm selects smaller eigen-value subspaces. The reason behind this is to reduce the variability in the projected subspace, i.e., reduce the distance within cluster. The algorithm has capabilities of detecting outliers and scales to very large databases, for details see [1].

3.1

The First Method: Supervised learning based approach

This method validates the result of clustering with supervised classifiers. The rationale behind this method is that if the labels generated by clustering algorithm is valid (i.e., clusters are well-separated), then they can be used by the classifier to classify clusters with high accuracy. So, this accuracy information can be used for

comparing different clustering algorithms with the same input parameters. Additionally, repeated measurements of accuracies on perturbed dataset can be used for estimating the validity of clustering algorithms. Doing so facilitates the measurement of confidence in cluster validity for multiple (not just two) clustering algorithms on the same basis. The classifier is trained on perturbed version of labeled patterns, and its accuracy is tested on the patterns not selected for training. For confidence estimation, the subsampling is repeated many times. The average accuracy is used as a measure of confidence in the validity of clustering. The whole process is sketched next in Algorithm 3.1. A l g o r i t h m 3.1 ( S u p e r v i s e d l e a r n i n g based method) Input:T=dataset, K=number of clusters, B=number of subsampling

f=o.7 L = Cluster(T, K ) For b=l to B do Lb = subsample(L, f ) Cb = Build_Classi fier(Lb) Ab = Compute_Accuracy(Cb, L - Lb ) end For B A - g1 }-~'~ b=lAt,

Any clustering algorithm that partitions the patterns can be used to decide on L in Algorithm 3.1; and the Diagonal Linear D i s c r i m i n a n t Analysis (DLDA) algorithm [6] is used to decide on Cb; the authors tested several algorithms and DLDA has been found one of the best as their settings and datasets are concerned. It is also employed in the Clest algorithm, which is a cluster estimation/validation method using discrimination analysis approach [7]. As noted in [13], wide range of supervised classifiers can be used for verification. For these reasons, DLDA is employed in this work. DLDA is based on M a x i m u m Likelihood (ML) approach. Classifier C classifies an instance x by using the class conditional probabilities: C ( x ) - arg mkaxP(x y - k)

(2)

For multivariate normal class density probabilities, i.e., P ( x l y = k) ~ N ( p k , Ek), the classifier becomes: C ( x ) - argmin{(x-/..tk)Y];l(x-/..tk) ! -Jr-log k

I,kl}

(3)

The special case is obtained when the class densities have the same diagonal covariance matrix. In this case, the classification formula known as DLDA discrimination rule is obtained as follows:

1437

p

(xj _ , ~ j ) 2

j=l

(Tj

(4)

3.2

The Second Method: Checking subsets of clusters

This method is designed for measuring the validity of clustering using subsets of clusters. The idea is that if the clustering is valid, then each of its subsets is expected to be valid as well. In this method, we aim to measure relative validity without referring to individual patterns. To test the validity of subsets, subsampling based cluster count estimation algorithms can be used. This way, the confidence in validity is computed based on stability of subsets of the original clustering. As in the previous method, multiple algorithms with the same parameters can be compared. The process is outlined next in Algorithm 3.2. A l g o r i t h m 3.2 ( S u b s e t s of c l u s t e r s b a s e d

method)

Input: T=dataset, K=number of clusters, B=number of subset subsampling, kmax=maximum number of clusters L -- Cluster(T, K ) For b=l to rain(B, 2 I~:l - 1) do Lb - - p a t t e r n s belonging to b'th subset o f K Kb -- Estimate_ClusterCount(Lb, k m a x )

ET/I(B'2IKI--1) Ab min(B,2l KI --1)

In each iteration of Algorithm 3.2, a subset of labels are selected randomly to form Lb. For K clusters, cluster i, 1 < i < K is selected with probability a, i.e., uniform selection. We set a - 0.5 to make the expected value of selected cluster label size K / 2 . A prediction-based resampling algorithm, namely Clest [7] is used to decide on K in Algorithm 3.2. In fact Clest is a method having several parameters, and different instantiations of parameters result in different algorithms. For example, actual clustering and classifier algorithms are generic. The algorithm is given next. A l g o r i t h m 3.3 ( C l e s t a l g o r i t h m ) Input: T=dataset, K=number of clusters, B=number of runs, Bo=number of resampling, kmax=maximum number of clusters To -- T For k=2 to kmax do For i=O to Bo do For b=l to B do Randomly split Ti into non-overlapping learning and test sets Apply clustering algorithm P to the learning set Build a classifier using the labeled learning set Apply the resulting classifier to the test set Apply the clustering algorithm to the test set sk,i,b -- F M external index comparing the two sets of labels end For tk,i -- median( sk,i,1 , " ' , Sk,i,B)

Pk -B0 dk -- tk,0 -- t~ end For

K-

{k12