Cluster Validity Through Graph-based Boundary ... - Semantic Scholar

Report 3 Downloads 54 Views
1

Cluster Validity Through Graph-based Boundary Analysis Jianhua Yang† and Ickjai Lee‡ †

School of Computing and Information Technology,

University of Western Sydney, Campbelltown, NSW 2560, Australia [email protected], Tel: +61 2 4620 3831, Fax: +61 2 4620 3826



School of Information Technology,

James Cook University , Townsville, QLD 4811, Australia [email protected], Tel: +61 7 4781 6905; Fax: +61 7 4781 4029

Abstract - Gaining confidence that a clustering algorithm has produced meaningful results and not an accident of its usually heuristic optimization is central to data mining. This is the issue of cluster validity. We propose here a method by which proximity graphs are used to effectively detect border points and measure the margin between clusters. With analysis of boundary situation, we design a framework and relevant working principles to evaluate the separation and compactness in the clustering results. The method can obtain an insight into the internal structure in clustering result. Keywords: Clustering, Proximity Graph, Cluster Validity, Data Mining.

I. Introduction Data mining constitutes the core step in knowledge discovery process to identify valid, novel, potentially useful, and ultimately understandable patterns in data. While researchers have proposed various methods for discovering knowledge from large data sets, cluster analysis is a central task in data mining processes. Clustering aims at discovering smaller, more homogeneous groups and identifying interesting patterns from a large heterogeneous collection of data objects [2]. Clustering is a challenging task as there is normally little or no a priori information about the data used for data mining. Clustering analysis reflects the exploratory nature of data mining. Its strategy is structure seeking - finding previously unknown and unsuspected patterns for hypothesis generation. However, its operation is structure imposing [1]. To solve clustering problems, some assumptions are naturally made to select a model to fit to the data. Then clustering is translated into an optimization problem by appropriate induction principles. The pair of model and induction principle is referred to as a clus-

tering context [4]. There is a diversity in construction of a clustering context because of the application and researchers’ background. As a result, there are many clustering methods and algorithms [4]. Since clustering algorithms depend significantly on the data and the way for an algorithm to represent (model) the potential structure in the data [9], the hypothetical structure postulated as the result of a clustering algorithm must be evaluated to gain confidence that it actually exists in the data. The validity of results is up-most importance, since patterns in data will be far from useful if they were invalid. Cluster validity defines a certain amount of confidence that reflects how well the revealed cluster structure fits the data. The purpose of clustering validity is to increase the confidence about groups proposed by a clustering algorithm. A fundamental way is to measure how significant are the resulting clusters. Here, formalizing how significant a partition is, implies fitting metrics between the clusters and the data structure (structure understood as discrepancy from the null hypothesis) [9]. A variety of tasks are known under the term of cluster validity, such as clustering tendency test, parameter tuning, selecting best-fit algorithms, and measuring the significance of the revealed structure. These problems have motivated several research efforts [3], [17]. Because there is no universal definition of “cluster”, cluster validity is a challenging issue. Various methods have been proposed for clustering validity. Clear and comprehensive description of statistical tools (hypothesis testing type) available for cluster validity appears in [10], [13]. The information contained in data models can also be captured using concepts from information theory [9]. In specialized cases, like conceptual schema clustering, formal validation has been used for suggesting and verifying certain properties [16]. In addition to theoretical indexes, empirical evaluation meth-

2

ods [14] are also used in some cases where sample datasets with similar known patterns are available. In contrast, for settings where visualization is possible, intuitive verification of the clustering results is feasible. Although there is no shortage of statistical indices that measure some sort of grouping quality in clustering literature, the validity of the output from clustering algorithms is a difficult problem in general. Perhaps, the major problem lies in embedding cluster validity into induction principles of clustering [15]. Note that, validity is a function of the data set, but a cluster partition is the result of optimizing an induction principle [4]. Such optimization should not be interpreted as obtaining valid clusters. While formal validity guarantees the consistency of clustering operations in some special cases like information system modeling, it is not a general-purpose method. The major drawback of empirical evaluation is the lack of benchmarks and unified methodology. In addition, in practice it is sometimes not so simple to get reliable and accurate ground truth. In the case of large multidimensional data sets, effective visualization of data is difficult. Moreover, the perception of clusters using available visualization tools is a difficult task for humans [13]. In the following sections, we propose a new approach to cluster validity using proximity graphs. Proximity graphs exhibit good performance and properties in modelling relations and constructing boundaries. Applying proximity graphs to clustering results and learning the boundary situation between clusters provides an effective approach to verifying valleys and detecting outliers. A key characteristic of proximity graphs is their boundary tractability under complex distributions with arbitrary shapes and large density changes. In addition, the present approach provides geometric interpretation of cluster validity. The rest of this paper is organized as follows. Section II presents the related work of clustering validity via boundary analysis. Section III addresses the favorable features of proximity graphs in modelling relations and learning boundary situation in data. Section IV discusses the idea behind our approach. We then conclude our paper with Section V. II. Related Work Despite the difficulties of cluster validity processes, the observation for “natural” clusters is straightforward. Intuitively speaking, clustering results are significant, if there are well defined separations between clusters, and there is at least one dense core within each cluster. These are so called compactness and separation [13]. The two criteria reflect the objective of the clustering methods

(a)

(b)

(c)

Fig. 1. Analysis of boundary situation.

that is to discover somehow significant groups present in a data set such that members in the same cluster are close to each other (in other words have a high degree of similarity) and members in different clusters are well separated. Compactness and separation are two main non-parametric measures for comparing clustering schemes [2]. Rather than agonizing endlessly about a mathematical interpretation of the rather straightforward consensus that clusters should be relatively separated and compact, effective measurement of separation and compactness should be developed to quantify the qualitative significance of a clustering result. The issue of separation measuring and compactness estimating, more theoretically, is concerned with the complexity of boundaries among clusters detected. On analysis of boundary situation, separation can be measured by the margin between clusters, compactness can be estimated during the course of filtering extreme border points gradually. Fig. 1 demonstrates this idea. The white region in Fig. 1(a) represents a margin that separates two clusters. Circled points in Fig. 1(b) are border points detected. Fig. 1(c) illustrates coherent regions in two clusters after filtering some border points. The key issues here are as follows: • •

• •



How to effectively construct an appropriate boundary between two non-linearly separable clusters? How to compute inter-cluster distance between two arbitrarily shaped clusters to measure the separation? How to detect extreme border points that reflect boundary situation to some extent? How to filter potential outliers that complicate the compactness of clusters so as to obtain a clear cluster structure? In data distribution with higher dimension and large density changes, boundary analysis is even more challenging.

A potential problem is that many of validity techniques that compare inter-cluster versus intra-cluster variability tend to favor configurations with ball-shaped well-

3

separated clusters. Irregularly shaped clusters are problematic. Moreover, the existing validity methods are too weak to measure the margin between two clusters that are in arbitrary shape and non-linearly separable. In this situation, traditional measures, like single-linkage, are not able to detect the actual separation between two clusters. In addition, these methods lack a mechanism to deal with outliers. To develop validity methods towards boundary analysis in complex situations, Estivill-Castro and Yang proposed a cluster validity approach using Support Vector Machines (SVMs) [7]. SVMs build discriminants for pattern recognition in terms of support vectors (border points). Cluster validity using SVMs is readily computable from the labeled clustering result in an supervised manner. SVMs are capable of providing good generalization for high dimensional training data. They can deal with arbitrary boundaries in data space, and are not limited to linear discriminants. While these properties are appropriate to realistic data structures concerned with clustering applications, selecting appropriate models and incorporating domain knowledge is not straightforward. Also the implicit mapping into a higher dimensional feature space makes utilizing domain knowledge and result interpretation difficult. Readers may refer to [7] for details. In this sense, we propose an alternative cluster validity approach using proximity graphs. Similarly, we attempt to confirm that the structure of clustering results is of some significance by analyzing boundary complexity. III. Proximity Graphs The operation of clustering is structure imposing. Introducing a model to encode information in data is the primary step in clustering analysis. While cluster information can be encoded with continuous mathematical models, it can be also described using discrete structural models. Measuring proximity for all pairs of data objects results in a complete weighted graph that associates the weight d(~xi , ~xj ) to the edge ~xi , ~xj . Proximity graphs aim at representing the entire information available in the complete proximity matrix by some sparse graphs, which are much smaller than the quadratic proximity matrix. In proximity graphs, vertices represent data points and edges connect pairs of points to model proximity and adjacency. Proximity graph modeling reduces redundancy and enhances integrity of data, so it is not excessive and remains informative. Since proximity graphs effectively encode both proximity and topological relationships between data patterns, they have been widely used for modeling data, in particular for spatial data patterns. A well-known proximity graph is the Voronoi diagram, in which two

(a) The Voronoi diagram

(b) The Delaunay diagram

Fig. 2. Modeling with the Voronoi and Delaunay diagrams.

spatial patterns are neighbors if and only if their territories share a common Voronoi boundary. The explicit representation of near-by neighborhood relationship is the Delaunay diagram, which is the dual graph of the Voronoi diagram [5]. In 2D, the Delaunay diagram approximates the complete graph with linear size. Moreover, it includes many other proximity graphs (like the Gabriel Graph, the Relative Neighbor Graph and the Minimum Spanning Tree). Fig. 2 demonstrates data modeling by the Voronoi diagram and the Delaunay diagram. In addition, argument-dependent proximity graphs [6] like k-Nearest Neighbors are also widely adopted to model data patterns. For arbitrary patterns, a hypergraph model can be employed to map the relationship present in the data into a hypergraph Proximity graphs have delivered a powerful data modeling method. It is capable of working on non-convex data, large density changes and dealing with noise data. Expected time complexity is sub-quadratic for any dimensional data by choosing appropriate underlying proximity graphs. In the following sections, we apply proximity graphs to the output of clustering algorithms to evaluate clustering quality. In this approach, proximity graphs are used in analyzing boundary situation to verify separation and compactness in clusters. In the following discussion, we mainly use the Delaunay diagrams for illustration. IV. Cluster Validity Through Proximity Graphs In this section, we discuss boundary analysis through proximity graphs to detect border points, verify separation, and check compactness of a clustering structure. A. Separation test To verify the separation between two clusters, we propose the following working principles: • •

Construct a proximity graph based on a clustering output. Learn the margin γ between two clusters from proximity graph computation.

4

(a)

(a) Margin computation

Structure C1

(b)

Border points

(c)

Structure C2

Fig. 4. For compact data, reclustering result (C2 ) are repeated to the previous result (C1 ) when border points are removed (R = 1.0, J = 1.0, F M = 1.0).

(b) Local neighborhood

Fig. 3. Margin and local neighborhood distances.

• •

Calculate local neighborhood distances of some border points from each cluster. Compare the margin to the local neighborhood distances to judge if the separation is significant.

Fig. 3 (a) shows a Delaunay diagram construction on a clustering output, where border points and inter-linkages are extracted. To effectively measure the margin γ between two clusters, we take into account all inter-linkages (adjacent edges between the two clusters). The average length of these inter-linkages is used as margin. In case of arbitrarily shaped clustering structure, this metric for inter-cluster distance is much more accurate than traditional single-linkage. To describe local geometric feature of border points, we randomly choose a number u of border points (say up to 5) from each class. For each border point chosen, we pick up its k (say up to 5) nearest neighbors based on the proximity graph constructed. As a result, we are able to calculate the average neighborhood distance of each border point selected. Eventually, we can calculate the total average neighborhood distance of a set of border points selected from each class. Fig. 3 (b) shows a border point p and its neighbors. Let γ1 be the total average neighborhood distances for one class and γ2 for the other class. To verify the significance of separation between two clusters, we compare γ with γ i . Given scalars t1 < 1 and t2 > 1, the relations between local measures and margin is evaluated by analyzing if any of the following conditions holds. γ1 < t1 · γ or γ2 < t1 · γ

(1)

γ1 > t2 · γ or γ2 > t2 · γ

(2)

If either of them holds for carefully selected control parameters t1 and t2 , the clusters are well separated. Otherwise they are not separable. In our experiments, we set t1 = 0.5 and t2 = 2 to evaluate separation.

B. Compactness test Verification of the compactness of a clustering result is much concerned with outliers distribution. Outliers are data items in boundary regions, which are surrounding the cores of clusters. Occasionally clustering results of an algorithm might not accurately describe the groups in the data or are hard to interpret because outliers may mask data models. When these potential outliers are tested and removed, the cores of clusters appear. In this sense, filtering outliers results in clearer structures of the data. However the estimation of outliers is not straightforward. The difficult here is that we lack a pre-knowledge about outliers distribution in data. Making use of nothing more than the available data, we verify compactness by checking the stability (robustness, repeated performance) of cluster assignment with gradually removing a small amount of outliers [7], [12]. In this process, we treat border points as potential outliers; we check for the robustness of reclustering assignment of the data with removal of border points so as to indicate the cluster compactness. Removing a few points in the cores does not affect their shape. In contrast, points in cluster boundaries are in sparse region and perturbing them does change the shape of boundaries. After removing some outliers, if reclustering results in repeatable assignments, we are more confident that the cores of classes exist. Let C1 and C2 be consecutively produced clustering structures from a clustering method working on data set X or its subset with outliers removed. We measure the match degree between these two clustering structures using indexes of rand statistic R, the Jaccard coefficient J and the Fowlkes-Mallows index F M [11]. The values of these three statistics are between 0 and 1. The larger their value, the higher degree to which C1 matches C2 . Fig. 4 and Fig. 5 illustrate these two different cases respectively. In case of Fig. 4, the reclustering results are completely repeatable; however in case of Fig. 5, reclustering results give a significantly different model from the first run.

5

(a)

Structure C1

(b)

Border points

(c)

Structure C2

Fig. 5. For non-compact data, reclustering result (C2 ) are less repeated to previous result (C1 ) when border points are removed (R = 0.5124, J = 0.4083, F M = 0.6389).

tion performance and potential outliers. The validity approach reveals an insight into the structure in data. After several rounds of reclustering and outlier filtering, we will obtain clearer clustering structures. Counting the number of valid runs and checking robustness of results from different rounds in our process contributes to verifying the goodness of clustering result. More specifically, we provide an effective approach to detecting border points and measuring a robust inter-cluster distance. References [1]

C. Working paradigm Consider an arbitrary clustering algorithm A. The idea behind the present approach is to increase the confidence of the result in applying A to a data set. If the clustering result is separable and repeatable, we increase our confidence that the data does reflect this clustering and is not an artifact of the clustering algorithm. We say the clustering result has an good sense of validity. On the other hand, if reclustering results are not quite repeatable but well separable, or repeatable but not quite separable, we call the current run a valid run. However, if reclustering shows output that is neither separable nor repeatable, we call the current run an invalid run. In this case, the border points removed in the previous run may not be outliers, and they should be recovered for a reclustering. Valid runs or invalid runs can still be discriminated in an iterative analysis process. After several rounds of the above validity process, if consecutive clustering results converge to a stable assignment (that is, the result from each run is repeatable and separable) we believe the potential outliers have been removed, and cores of clusters have emerged. If most of the repetition produce invalid runs, that is clustering solutions differ across runs without good separation, the clustering results are not interesting. For more than clusters, this approach checks each pair of clusters. That is, it works in pairwise way. This approach provides a novel mechanism to address cluster validity problems for more elaborate analysis. This is required by a number of clustering applications. The intuitive interpretability of boundary complexity makes it easy to operate practical cluster validity. V. Final Remarks Cluster validity is a certain amount of confidence that the cluster structure found is significant. Intuitively, if clusters are isolated from each other and each cluster is compact, the clustering results are somehow natural. In this paper, we have applied proximity graphs to cluster validity. By analyzing the situation of boundaries in terms of border information, we can estimate separa-

[2]

[3] [4] [5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

M. S. Aldenderfer and R. K. Blashfield. Cluster Analysis. Sage Publications, Beverly Hills, USA, 1984. M. J. A. Berry and G. S. Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, Inc., New York, 1997. R. C. Dubes. How many clusters are best? - an experiment. Pattern Recognition, 20(6):645–663, 1987. V. Estivill-Castro. Why so many clustering algorithms - a position paper. SIGKDD Explorations, 4(1):65–75, 2002. V. Estivill-Castro and I. Lee. Amoeba: Hierarchical clustering based on spatial proximity using delaunay diagram. In Proc. of the 9th Int. Symposium on Spatial Data Handling, pages 7a.26–7a.41, 2000. V. Estivill-Castro, I. Lee, and A. T. Murray. Criteria on proximity graphs for boundary extraction and spatial clustering. LNCS, 2035:348–347, 2001. V. Estivill-Castro and J. Yang. Cluster validity using support vector machines. In Proc. of the 5th Int. Conference on Data Warehousing and Knowledge Discovery (DaWaK 2003), pages 244–256, 2003. E. Gokcay and J. Principe. A new clustering evaluation function using Renyi’s information potential. In Proc. of IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP 2000), pages 3490–3493, 2000. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall International, New Jersey, 1998. Advanced Reference Series: Computer Science. R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. In Proc. of Int. Workshop on Program Comprehension, 2000. E. Levine and E. Domany. Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13:2573–2593, 2001. M. Vazirgiannis M. Halkidi, Y. Batistakis. On clustering validation techniques. Intelligent Information Systems Journal (Special Issue on Scientific and Statistical Database Management), 2001. A. Rauber, J. Paralic, and E. Pampalk. Empirical evaluation of clustering algorithms. In Proc. of the 11th Int. Conference on Information and Intelligent Systems (IIS’2000), 2000. S. J. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30(5):261–272, 1997. R. Winter. Formal validation of schema clustering for large information systems. In Proc. of the 1st American Conference on Information Systems, 1995. X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8):841–847, 1991.