Science in China Series F: Information Sciences © 2007
Science in China Press Springer-Verlag
Spectral clustering based on matrix perturbation theory TIAN Zheng1,2†, LI XiaoBin1 & JU YanWei1 1 2
Department of Applied Mathematics, Northwestern Polytechnical University, Xi’an 710072, China; National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing 100080, China
This paper exposes some intrinsic characteristics of the spectral clustering method by using the tools from the matrix perturbation theory. We construct a weight matrix of a graph and study its eigenvalues and eigenvectors. It shows that the number of clusters is equal to the number of eigenvalues that are larger than 1, and the number of points in each of the clusters can be approximated by the associated eigenvalue. It also shows that the eigenvector of the weight matrix can be used directly to perform clustering; that is, the directional angle between the two-row vectors of the matrix derived from the eigenvectors is a suitable distance measure for clustering. As a result, an unsupervised spectral clustering algorithm based on weight matrix (USCAWM) is developed. The experimental results on a number of artificial and real-world data sets show the correctness of the theoretical analysis. spectral clustering, weight matrix, spectrum of weight matrix, number of the clusters, unsupervised spectral clustering algorithm based on weight matrix
Clustering has been the focus of considerable research in machine learning and pattern recognition. Over the years numerous clever heuristics have been invented to tackle this problem, for example K-means, fuzzy C-means, etc. However, most of them suffer from several drawbacks. First, they usually perform clustering under the assumption that the data has some nice characteristics. Second, in many cases, it is difficult to find a globally optimal solution for most clustering cost function. As a promising alternative, recently, the method known as spectral clustering has been applied successfully in a variety of different situations[1,2]. Due to originating from the spectral graph partitioning [3, 4], when clustering, this method first constructs a weighted graph with the set of the data points as the vertex set and then obtains a perfect clustering by performing the spectral analysis of the matrix associated with the graph. Since the spectral clustering method can offer flexibility in the definition of the affinities between the data points that can be Received July 20, 2006; accepted August 29, 2006 doi: 10.1007/s11432-007-0007-8 † Corresponding author (email:
[email protected]) Supported by the National Natural Science Foundation of China (Grant No. 60375003) and the Aeronatical Science Foundation of China (Grant No. 03I53059)
www.scichina.com
www.springerlink.com
Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
the combination of a number of different data characteristics, the spectral clustering method is simple and can be used to recover the clusters that take on complicated manifold structure. As a result, a whole family of algorithms and variants has been published recently, e.g., Ratio cut [5, 6], Normalized cut[7,8], Min-max cut[9,10], etc. However, despite their empirical successes, spectral clustering methods are still incompletely understood and no closed theory for their functioning and limitations is yet available. In fact, different authors still disagree on exactly what matrix to use and how to derive clusters from them. To penetrate the essence of the spectral clustering method, in the past few years, much work has been done. Weiss[11] compared some spectral clustering methods to each other experimentally. Dhillon et al.[12] studied the relationship between the spectral clustering method and the K-means. Kannan et al.[13] supposed a (α, ε) bi-criteria measure to assess the quality of a clustering. Ng et al.[14], Brand and Huang [15] used the tools from the perturbation theory to analyze the spectral clustering method, etc. However, though these endeavors have helped improve our understanding of the method, there are several unresolved issues. In particular, what matrix derived from the data can be used for spectral clustering? Which and how many eigenvectors should be used? Do the eigenvalues contain some clustering information? Is the spectral analysis just a prelude to further information-lossy data analysis? Can the eigenvectors be directly used to derive the cluster, etc? The goal of this paper is to present reasonable answers to many of these questions. We first define a weight matrix of a graph. Then, by using the tools from matrix perturbation theory to perform spectral analysis of the weight matrix, we show that the number of clusters is equal to the number of eigenvalues that are larger than 1, and the number of points in each of the clusters can be approximated by the associated eigenvalue. We also show that the eigenvector of the weight matrix can be used directly to perform clustering; that is, the directional angle between the two-row vectors of the matrix derived from the eigenvectors is a suitable distance measure for clustering. As a result, an unsupervised spectral clustering algorithm based on the weight matrix is developed. We test the algorithm on a number of artificial and real-world data sets and compare it with two popular methods. It shows that our method outperforms the others. This paper is organized as follows. Section 1 presents some terminologies and notations. Section 2 gives our main results. A new unsupervised spectral clustering algorithm is developed in section 3. Second 4 reports the experimental results and comparisons, and we conclude in section 5.
1 Terminology and notation Given a set of data points V = {v1 , v2 , v3 ,L , vn }, that we want to cluster, construct a weighted
graph G (V , E ) , where V is the vertex set and E = V × V the edge set. To simplify our exposition, let eij represent the edge (vi , v j ) between the two nodes vi and v j . The weight on the edge eij is denoted by wij that measures the similarity between the pair of nodes vi and v j . In this paper, the edge weight always falls in the interval [0,1] and the larger weights imply the higher similarity between the points. Specially, wii = 1 for i = 1, 2,L , n . Definition 1 (Similarity Function)[1,7].
A function of two variables R : V × V → [0,1] is
called the similarity function of V if it satisfies the following conditions:
64
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
R(vi , v j ) = wij , i, j = 1, 2,L , n. Definition 2 (Weight Matrix).
The matrix W = ( wij )n×n is called the weight matrix of the
graph G . Obviously, W = ( wij ) n×n = ( R(vi , v j )) n×n .
For brevity, we write simply W = R(V × V ) . Remark 1. (i) Since the data point vi corresponds strictly with its subscript i , in the fol-
lowing, the element of V associated with the index i is no other than the data points vi . (ii) Clustering the data points in V is equivalent to partitioning the set of vertices V into mutually disjoint subsets V1 ,V2 ,L ,Vk according to some similarity measure, namely k
V = UVi , Vi I V j = ∅, i ≠ j , i, j = 1, 2,L , k , i =1
where the similarity among the vertices in a set Vi is high and across different sets Vi ,V j is low. Here, k is the number of clusters and Vi denotes the ith cluster. Suppose a matrix A ∈ R n×n . Let λ1 ≥ λ2 ≥ L ≥ λk ≥ L ≥ λn be its eigenvalues, and x1 , x2 ,L , xk ,L , xn the associated eigenvectors. In this paper, to simplify our exposition, we
would like to call λ1 ≥ λ2 ≥ L ≥ λk the first k largest eigenvalues of A and x1 , x2 ,L , xk the first k largest eigenvectors of A . For notation simplicity, in the following, let λ ( A) denote the set of the eigenvalues of A , and En denote the n × n square matrix of all ones.
2 Main results In this section, we will use the tools from matrix perturbation theory to analyze the spectrum of weight matrix and show why it can be used to perform clustering. For illustrative purpose, we will start with an idealized situation where the similarity between the two points is 1 if they belong to a same cluster and 0 otherwise. Then, we will relax the constrains on the data points step by step and perform the spectral analysis in a general situation. For the sake of discussion, assume the data point set V = {v1 , v2 , v3 ,L , vn } consists of k clusters, i.e., k
V = UVi , Vi I V j = ∅, i ≠ j, i, j = 1, 2,L , k . i =1
To facilitate the analysis of the following sections, assume further that the points in V are ordered according to which cluster they are in, i.e., {v1 , v2 ,L , vi1 } ∈ V1 ,{vi1 +1 , vi1 + 2 ,L , vi2 } ∈ V2 ,L ,{vik −1 +1 , vik −1 + 2 ,L , vn } ∈ Vk . 14 4244 3 144 42444 3 144424443 n1
n2
nk
Then,
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
65
⎛ R(V1 × V1 ) R(V1 × V2 ) ... R(V1 × Vk ) ⎞ ⎜ ⎟ R(V2 × V1 ) R(V2 × V2 ) ... R(V2 × Vk ) ⎟ ⎜ W= . ⎜ ⎟ M M M M ⎜ ⎟ ⎝ R(Vk × V1 ) R(Vk × V2 ) ... R(Vk × Vk ) ⎠
2.1
(1)
The ideal case
Assume the similarity function R satisfies the following condition: ⎧⎪1, vi and v j belong to the same cluster; R(vi , v j ) = ⎨ i, j = 1, 2,L , n . ⎪⎩0, otherwise, Then, W = diag( En1 , En2 ,L , Enk ) . Lemma 1. Let λ1 ≥ λ2 ≥ L ≥ λn be the eigenvalues of the matrix En . Let e = (1,1,L ,1)T
∈ R n . Then, (i) λ1 = n and λ2 = λ3 = L = λn = 0 ; (ii) the eigensubspace Vλ1 = L(e ) . Proof.
The conclusions follow directly from an easy calculation.
Lemma 2. Let Λ = diag( An1 , An2 ,L , Ank ) , where Ani ∈ R
ni ×ni
QED
is a real symmetric matrix. Then,
k
λ (Λ) = U λ ( Ani ) . Furthermore, if x ∈ R ni is the eigenvector associated with the eigenvalue λ of i =1
, 0 , 0, ,0,L , xT ,L ,0, , 0)T is the eigenvector corresponding to the the matrix Ani , then (0, 1L 23 1L 23 1L 23 n1
n2
nk
eigenvalue λ of the matrix Λ . Proof. An easy calculation can result in the conclusions directly. QED Theorem 1. Let λ1 ≥ λ2 ≥ L ≥ λn be the eigenvalues of the matrix W . Let x1 , x2 ,L , xk be the first k largest orthonormal eigenvectors of the matrix W . Let =
(α1Τ ,α 2Τ ,L ,α nΤ
X = ( x1 , x2 ,L , xk )
) , where α i is the i th row vector of the matrix X . Then,
(1) the number of clusters k satisfies k = max {i | λi > 1, i = 1, 2,L , n} ; (2) cosθ (α i ,α j ) =
α iT α j αi
2
αj
2
⎧⎪1, if vi and v j belong to a same cluster; =⎨ i, j = 1, 2,L , n . ⎪⎩0, otherwise,
Proof. Conclusion (1) follows immediately from Lemma 1 and Lemma 2. To prove conclusion (2), without loss of generality, suppose k = 3 . Assume 1 (1, , 0, 0, ,0)T , e1 = L ,1, 0, L L { 1 2 3 1 2 3 n 1
e2 =
n1
n2
1
(0, ,0,1, , 0)T , L L ,1, 0, L { 1 2 3 1 2 3 n2 n2 n n 1
66
n3
3
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
e3 =
1
(0, , 0,0, , 0,1, L L L ,1)T . { 1 2 3 1 2 3 n3 n3 n n 1
2
From Lemma 1 and Lemma 2, it is easy to see that L( x1 , x2 , x3 ) = L(e1 , e2 , e3 ) . Therefore, there exists an orthogonal matrix Q = (qij )3×3 that satisfies ( x1 , x2 , x3 ) = (e1 , e2 , e3 )Q n1 ⎛ 6474 8 ⎜ q ,L , q 11 ⎜ 11 = ⎜ q12 ,L , q12 ⎜ q ,L , q 13 ⎜⎜ 13 ⎝
n2 6474 8 q21 ,L , q21
q22 ,L , q22 q23 ,L , q23
T
n3 6474 8⎞ q31 ,L , q31 ⎟ ⎟ q32 ,L , q32 ⎟ . q33 ,L , q33 ⎟ ⎟⎟ ⎠
Thus, conclusion (2) holds. Note that in the ideal case the following conditions are supposed to be true:
(i)
R(Vi × Vi ) = Eni , i = 1, 2,L , k ;
(ii)
R(Vi × V j ) = O, i ≠ j, i, j = 1, 2,L , k ,
QED
where O denotes the null matrix. Obviously, in practice, these do not always hold. In the next subsection, we will generalize the conclusions of Theorem 1 in the situation where only condition (ii) holds. 2.2
The block case
Assume the similarity function R satisfies the following condition: ⎧⎪rij , vi and v j belong to a same cluster; R(vi , v j ) = ⎨ ⎪⎩ 0, otherwise,
0 ≤ rij ≤ 1, i, j = 1, 2,L , n.
Then, W = diag(Wn1 ,Wn2 ,L ,Wnk ) , where Wni = R(Vi × Vi ) for i = 1, 2,L , k . Theorem 2.
Assume max 1 − rij ≤ σ 1 . Let λ1 ≥ λ2 ≥ L ≥ λn are the eigenvalues of W . If i, j
σ 1 is small enough and ni > 1 for i = 1, 2,L , k , then the number of clusters k satisfies k = max {i | λi > 1, i = 1, 2,L , n} . k
Proof.
From Lemma 2, we can obtain λ (W ) = U λ (Wni ) . Now consider the eigenvalue of the i =1
matrix Wni for i = 1, 2,L , k . Without loss of generality, choose one of them denoted by Wm ∈ R m×m and consider its eigenvalue. Let
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
67
⎛1 ⎜ 1 ⎜ Wm = ⎜M ⎜ ⎝1
1 1 M 1
1⎞ ⎛ 0 ⎟ ⎜ 1⎟ ⎜ −ε 21 + M⎟ ⎜ M ⎟ ⎜ 1⎠ ⎝ −ε m1
... ... M ...
−ε12 ... −ε1m ⎞ ⎟ 0 ... −ε 2 m ⎟ = Em + ε . M M M ⎟ ⎟ 0 ⎠ −ε m 2 ...
If η1 ≥ η2 ≥ L ≥ ηm are the eigenvalues of the matrix Wm , then
η1 − m ≤ ε 2 , ηi ≤ ε 2 , i = 2,3,L , m . From the assumption max 1 − rij ≤ σ 1 , it is easy to see that ε
∞
ing to the Gerschgorin theorem, the eigenvalues of ε are all in o(0, ε
∞
i, j
≤ mσ 1 . Therefore, accord-
) . Hence,
η1 − m ≤ mσ 1 , ηi ≤ mσ 1 , i = 2,3,L , m . This means
λi − ni ≤ niσ 1 , i = 1, 2,L , k ,
λi ≤ max niσ 1 , i = k + 1, k + 2,L , n . 1≤i≤k
From this, it is easy to see that if σ 1 is small enough, e.g., σ 1 ≤
1 , then 2n
1 1 ni − ≤ λi ≤ ni + , i = 1, 2,L , k , 2 2 1 1 − ≤ λi ≤ , i = k + 1, k + 2,L , n . 2 2 Thus, the theorem holds.
QED
In the above proof, if we denote
by σ 1(i )
the maximal element of the matrix Eni − R (Vi × Vi ) ,
then σ 1 ≥ max σ 1(i ) . Obviously, the smaller the value of σ 1(i ) is, the more similar the data points 1≤i≤k
in the cluster Vi are. From here on, σ 1(i ) is called the dispersion degree of the cluster Vi and
σ 1 the intra-cluster dispersion degree of the data set V . Theorem 3. Assume max(1 − rij ) ≤ σ 1 and ni > 1 for i = 1, 2,L , k . Let λ1 , λ2 ,L , λk be the first i, j
k largest eigenvalue of the matrix W and x1 , x2 ,L , xk are the associated orthonormal eigenvectors. Let X = ( x1 , x2 ,L , xk ) = (α1T ,α 2T , L ,α nT )T . If σ 1 is small enough, then cosθ (α i ,α j ) = Proof.
α iT α j αi
2
αj
2
⎧⎪1, vi and v j belong to a same cluster; =⎨ ⎪⎩ 0, otherwise.
Without loss generality, suppose k = 3 . From Lemma 1 and Lemma 2, for i = 1, 2,3,
if yi = ( y1i , y2i ,L , yni i )T satisfies Wni yi = λi yi , yi 68
2
=1,
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
then β1 = ( y1T , 0,0)T , β 2 = (0, y2T , 0)T and β 3 = (0,0, y3T )T are the eigenvectors corresponding to the eigenvalues λ1 , λ2 and λ3 , respectively. Since L( x1 , x2 , x3 ) = L( β1 , β 2 , β 3 ) , then there exists an orthogonal matrix Q = (qij )3×3 that satisfies ( x1 , x2 , x3 ) = ( β1 , β 2 , β 3 )Q T
n3 n1 n2 ⎛ 644744 ⎞ 8 644 47444 8 6447448 ⎜ y q ,L , y q y12 q21 ,L , yn2 2 q21 y13 q31 ,L , yn3 3 q31 ⎟ n11 11 ⎜ 11 11 ⎟ = ⎜ y11q12 ,L , yn11q12 y12 q22 ,L , yn2 2 q22 y13 q32 ,L , yn3 3 q32 ⎟ . ⎜ ⎟ ⎜ y11q13 ,L , yn11q13 y12 q23 ,L , yn2 2 q23 y13 q33 ,L , yn3 3 q33 ⎟ ⎜ ⎟ ⎝ ⎠ QED An easy calculation immediately shows the theorem is true. Remark 2. From the above proof, it is easy to see that if min ni >> 1 , then even if the data 1≤i≤k
set has a significantly large intra-cluster dispersion; that is, σ 1 is large, Theorem 3 still holds. In contrast to the ideal case, in the block case, we relax the constraint on the points belonging to the same cluster that the similarity between two data points from the same cluster is equal to 1, but still require the similarity between the two data points from different clusters is null. In the next subsection, we will relax this constraint and further generalize the conclusions of Theorem 2 and Theorem 3. 2.3
The general case
Assume the similarity function R satisfies the following conditions: R(vi , v j ) = rij , 0 ≤ rij ≤ 1, i, j = 1, 2,L , n. Then, 0 ... 0 0 R(V1 × V2 ) ... R(V1 × Vk ) ⎞ ⎛ R(V1 × V1 ) ⎞ ⎛ ⎜ ⎟ ⎜ ⎟ 0 R(V2 × V2 ) ... 0 R(V2 × V1 ) 0 ... R(V2 × Vk ) ⎟ ⎜ ⎟ ⎜ + W= ⎜ ⎟ ⎜ ⎟ M M M M M M M M ⎜ ⎟ ⎜ ⎟ 0 0 ... R(Vk × Vk ) ⎠ ⎝ R(Vk × V1 ) R(Vk × V2 ) ... 0 ⎝ ⎠
= W1 + W2 . Let λ1 ≥ λ2 ≥ L ≥ λn and μ1 ≥ μ 2 ≥ L ≥ μ n be the eigenvalues of the matrix W and W1 , respectively, then
λi − μi ≤ W2 2 , i = 1, 2,L , n . From the Gerschgorin theorem, it is easy to see that W2
λi − μi ≤ W2 2 ≤ W2
∞
2
≤ W2
∞
. Thus,
, i = 1, 2,L , n .
Suppose rij satisfies the condition: if vi and v j belong to the same cluster, 1 − rij ≤ σ 1 . Otherwise, rij ≤ σ 2 . Then, W2
∞
≤ max(n − ni )σ 2 . 1≤i ≤ k
From the discussion in section 2.2, it follows that if the intra-cluster dispersion of data set V is
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
69
significantly small, then
μi − ni ≤ niσ 1 , i = 1, 2,L , k ,
μi ≤ max niσ 1 , i = k + 1, k + 2,L , n . 1≤i≤k
Thus,
λi − ni ≤ λi − μi + μi − ni ≤ max (n − ni )σ 2 + max niσ 1 ≤ n(σ 1 + σ 2 ), i = 1, 2,L , k , 1≤i≤k
1≤i≤k
λi ≤ λi − μi + μi ≤ max (n − ni )σ 2 + max niσ 1 ≤ n(σ 1 + σ 2 ), i = k + 1, k + 2,L , n . 1≤i≤k
1≤i≤k
From the above discussion, we obtain the following theorem. Theorem 4. Assume R(vi , v j ) = rij satisfies the condition: if vi and v j belong to the same cluster, then 1 − rij ≤ σ 1 . Otherwise, rij ≤ σ 2 . Assume further ni > 1 for i = 1, 2,L , k . Let
λ1 ≥ λ2 ≥ L ≥ λn be the eigenvalues of the matrix W . If σ 1 and σ 2 are small enough, then the number of clusters k satisfies k = max {i | λi > 1, i = 1, 2,L , n} .
In the above discussion, if we denote by σ 2(i , j ) the maximal element of the matrix R (Vi × V j )(i ≠ j ) , then σ 2 ≥ max σ 2(i , j ) . Obviously, the smaller σ 2(i , j ) is, the more dissimilar i≠ j
the points from Vi and V j respectively are. Therefore, the quantity σ 2(i , j ) measures the overall similarity between the points in Vi and the points in V j . In this paper, for simplicity of analysis, we will call 1 − σ 2(i , j ) the inter-dispersion degree between the clusters Vi and V j , and call 1 − σ 2 the inter-cluster dispersion degree of the data set V . Now, let us consider the question of how to use the eigenvector of W to perform clustering. Rewrite W as W = diag( En1 , En2 ,..., Enk ) + diag(Wn1 − En1 ,Wn2 − En2 ,...,Wnk − Enk ) + W2 . Let λ1 , λ2 ,L , λk be the first k largest eigenvalues of the matrix W , and x1 , x2 ,L , xk be the associated orthonormal eigenvectors. In addition, let 1 1 1 1 T yi = (δ1i ,..., δ1i ,..., δ ki ,..., δ ki ) , i = 1, 2,L , k n1 n1 nk nk 144 42444 3 14442444 3 n1
nk
be the first k largest eigenvectors of the diagonal matrix diag( En1 , En2 ,L , Enk ) . Let X = ( x1 , x2 ,L , xk ) and Y = ( y1 , y2 ,L , yk ) . If σ 1 and σ 2 are small enough, then
λi ≤ λi − μi + μi ≤ max (n − ni )σ 2 + max niσ 1 ≤ n(σ 1 + σ 2 ), i = k + 1, k + 2,L , n . 1≤i≤k
1≤i≤k
Note that n1 , n2 ,L , nk are the first k largest eigenvalues of the matrix diag( En1 , En2 ,L , Enk ) . Therefore, if ni > 1 for i = 1, 2,L , k , then from ref. [16] we obtain YY T − XX T
F
≤
(diag(Wn1 − En1 ,Wn2 − En2 ,L ,Wnk − Enk ) + W2 )Y min ni − 2n(σ 1 + σ 2 )
F
1≤i≤k
70
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
.
Since the Frobenius norm is consistent with the spectral norm, then (diag(Wn1 − En1 ,Wn2 − En2 ,L ,Wnk − Enk ) + W2 )Y ≤ (diag(Wn1 − En1 ,Wn2 − En2 ,L ,Wnk − Enk ) + W2 )
2
F
Y
F
≤ ( (diag(Wn1 − En1 ,Wn2 − En2 ,L ,Wnk − Enk ) + W2 2 ) Y 2
≤ k ( diag(Wn1 − En1 ,Wn2 − En2 ,L ,Wnk − Enk )
∞
+ W2
∞
F
)
≤ k ( max (n − ni )σ 2 + max niσ 1 ) 1≤i≤k
1≤i≤k
≤ kn(σ 1 + σ 2 ) .
Thus, YY T − XX T
F
≤
kn(σ 1 + σ 2 ) . min ni − 2n(σ 1 + σ 2 )
1≤i≤k
A summary of the above discussion results in the following theorem. Theorem 5. Let x1 , x2 ,L , xk be the first k largest orthonormal eigenvectors of W. Let yi = (δ1i
1
,..., δ1i
1
n1 n1 144 42444 3
,..., δ ki
1
,..., δ ki
1
n nk 144k42444 3
n1
)T , i = 1, 2,L , k
nk
be the first k largest eigenvectors of the diagonal matrix diag( En1 , En2 ,L , Enk ) . Let X = ( x1 , x2 ,L , xk ) and Y = ( y1 , y2 ,L , yk ) . Assume R(vi , v j ) = rij satisfies the condition: if vi and v j belong to the same cluster, then 1 − rij ≤ σ 1 . Otherwise, rij ≤ σ 2 . Assume further ni > 1 for i = 1, 2,L , k . If σ 1 and σ 2 are small enough, then YY T − XX T
F
≤
kn(σ 1 + σ 2 ) . min ni − 2n(σ 1 + σ 2 )
1≤i ≤ k
Theorem 5 shows that, when σ 1 , σ 2 → 0 , the row vectors of the matrix X tend to be parallel if the associated data points belong to the same cluster, and orthogonal otherwise. In the above discussion, by using the tools from matrix perturbation theory, we have performed the spectral analysis of the weight matrix in three situations step by step and show that the eigenvectors of the weight matrix can be used directly to cluster the data points. From Theorem 4 and Theorem 5, if σ 1 is small enough and 1 − σ 2 large enough, then the number of clusters is equal to the number of eigenvalues which are larger than 1. Furthermore, the number of points in each of the clusters can be approached by the associated eigenvalue. In addition, whether two data points are in the same cluster can be determined by the angle between the associated row vectors of the matrix X formed by stacking the first k largest eigenvectors of the weight matrix W in columns; that is, if the two data points belong to the same cluster, then the associated row vectors of X tend to be parallel; if the two points are from the different clusters, then the two associated vectors tend to be orthogonal. Specially, if σ 2 = 0 and σ 1 is small enough, then the row vectors of X will be parallel if the associated data points belong to the same cluster and orthogonal TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
71
otherwise. Recall that up to now we always assumed that the data points were ordered according to which cluster they were in. In practice, this assumption is not always true. However, the different enumeration of the data points cannot change the veracity of the above statement. In fact, there always exists a permutation p of V that makes the points in pV enumerated nicely; that is, there exist permutation matrixes P1 , P2 ,L , Pl that make Pl L P2 PWP 1 1 P2 L Pl resemble the matrix in (1). Let P = Pl L P2 P1 . Since W is a real symmetric matrix, then there exists an orthogonal matrix
U that satisfies W = Udiag(λ1 , λ2 ,L , λn )U T . Obviously, the column vectors of U are the eigenvectors of W . Hence, PWPT = PUdiag(λ1 , λ2 ,L , λn )U T PT = PUdiag(λ1 , λ2 ,L , λn )( PU )T . Since PUU T PT = PPT = I , then the column vectors of PU are the orthonormal eigenvectors of PWPT . Note that PU is just the row-permuted version of U . Therefore, when using the row vectors of X to perform clustering, it does not matter whether the data points in V are ordered according to which cluster they are in.
3
The algorithm
In this section, we will first describe an unsupervised spectral clustering algorithm and then discuss the selection of a key parameter. 3.1
The Clustering algorithm
Our Clustering algorithm consists of the following steps: 1. Given a set of data points V , construct a weighed graph G (V , E ,W ) . 2. Calculate the eigenvalues λ1 ≥ λ2 ≥ L ≥ λn of the weight matrix W and the associated orthonormal eigenvectors x1 , x2 ,L , xn .
3. Let k = max {i | λi > 1, i = 1, 2,L , N } .
4. Form the matrix X = ( x1 , x2 ,L , xk ) = (α1T ,α 2T ,...,α nT )T by stacking the first k largest eigenvectors in Columns. 5. Compute
α iα jT αi
2
αj
, and let 2
⎧ α iα jT ≥ flag , ⎪⎪1, if R(vi , v j ) = ⎨ αi 2 α j 2 ⎪ ⎪⎩0, otherwise,
(
)
where flag is a predefined threshold. If R vi , v j = 1 , then assign vi and v j to the same cluster. 6. If the cluster number is equal to k , then the clustering process is over. Otherwise, return to 72
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
step 5 and choose a new value of the threshold flag to restart. 3.2
Choice of flag
Using the same notations in section 2.3, let X = (α1T ,α 2T ,L ,α nT )T and Y = ( β1T , β 2T ,L , β nT )T . It is easy to see that β i
2
=
1 nj
if vi ∈ V j for i = 1, 2,L , n, j = 1, 2,L , k .
Since
αiα Tj = ( β i + α i − β i )( β j + α j − β j )T = β i β Tj + (α i − β i ) β Tj + (α j − β j ) β iT + (α i − β i )(α j − β j )T .
To estimate the value of flag , assume r1 β i consider the situation where α i − β i
α iα Tj − β i β Tj
Since α i
2
≥ βi
2
2
2
2
≤ αi − βi
≤ r2 β i
2
2
for i = 1, 2,L , n . First,
is large and assume r1 > 1 . Then,
= (α i − β i ) β Tj + (α j − β j ) β iT + (α i − β i )(α j − β j )T ≤ (α i −
β i ) β Tj
≤ (2r2 +
r22 )
− αi − βi
2
+ (α j − β
2
βi
2
2
+ (α i − β i )(α j − β j )T
2
βj .
2
≥ (r1 − 1) β i
T j )β i
2
2
, i.e.,
1 αi (r1 − 1)
2
≥ βi
2
,
then
α iα Tj − β i β Tj
≤ 2
(2r2 + r22 ) αi (r1 − 1) 2
2
αj . 2
If vi and v j belong to the same cluster, then
α iα Tj ≥ β i β Tj − Note that α i
2
≤ βi
2
+ αi − βi
2
≤ βi
α iα Tj αi
2
αj
2
(2r2 + r22 ) αi (r1 − 1) 2
+ r2 β i
≥ 2
2
2
αj . 2
= (r2 + 1) β i
2
, so
(2r + r 2 ) 1 − 2 22 . 2 (r2 + 1) (r1 − 1)
Conversely, if vi and v j belong to different clusters, then
α iα Tj αi
2
αj
≤ 2
(2r2 + r22 ) (r1 − 1) 2
.
Obviously, it is necessary to let (2r2 + r22 ) (2r2 + r22 ) 1 ≥ . − (r2 + 1)2 (r1 − 1) 2 (r1 − 1)2 Then, (r1 − 1) 2 − 2(r2 + 1)2 (2r2 + r22 ) ≥ 0 .
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
73
Beyond all doubt, this does not hold. Therefore, when α i − β i suitable value of flag. Next, we consider the situation where α i − β i 1 βi l
2
2
is large, there does not exist a
2
αi − βi 2 ≤
is small and assume
(l > 1) . Then, following by the similar arguments as in the above, we obtain
α iα Tj − β i β Tj
2
= (α i − β i ) β Tj + (α j − β j ) β iT + (α i − β i )(α j − β j )T
≤
Since β i
2
≤ αi
2
+ αi − βi
2l + 1 βi l2
2
βj .
2
2
1 l β i 2 , i.e., α i 2 ≥ β i 2 , then l −1 l 2l + 1 α iα Tj − β i β Tj ≤ αi 2 α j . 2 2 (l − 1)2 2
≤ αi
+
2
If vi and v j belong to the same cluster, then
α iα Tj ≥ β i β Tj − Note that α i
2
≤ βi
2
+ αi − βi
2
≤ βi
2
α iα Tj αi
2
+
1 βi l
≥
αj
2
2l + 1 αi (l − 1)2 2
=
2
αj .
l +1 βi l
2
2
, so
l2 2l + 1 − . 2 (l + 1) (l − 1) 2
Conversely, if vi and v j belong to different clusters, then
α iα Tj αi
2
αj
≤ 2
2l + 1 . (l − 1) 2
Obviously, it is necessary to let l2 2l + 1 2l + 1 ≥ . − 2 2 (l + 1) (l − 1) (l − 1) 2
Then, l 4 − 6l 3 − 9l 2 − 8l − 2 ≥ 0 . After calculating, it is easy to see that, when l ≥ 7.3729 , the above inequality holds with l2 2l + 1 2l + 1 ≥ 0.3877 ≥ 2 . − 2 l +1 l −1 l −1 1 The above arguments show that, when α i − β i 2 ≤ β i 2 , it is a good choice to let 7.3729 flag = 0.3877 . Obviously, if α i − β i 2 is smaller, then a larger value of flag can certainly be 2
chosen. Conversely, if there exists an index i0 that satisfies α i0 − β i0 >
1 β i , then some 7.3729 0
data points will be misclassified when letting flag = 0.3877 . 74
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
4
Experimental results and analysis
To justify the above theoretic analysis and test the effectiveness of our clustering algorithm, we have applied our algorithm to a number of artificial data sets and two benchmark data sets containing real-world data. We have also compared the performance of our proposed algorithm to two popular clustering algorithms. 4.1
Algorithm certification
We applied the clustering algorithm to spatial point sets to justify our theoretic analysis. In principle, there are mainly two cases needed to be considered: (i) the inter-cluster dispersion is large, and (ii) the inter-cluster dispersion is small. Before we start, we would like to point out that in the experiments we always choose a similarity function based on the Gaussian weighted Euclidean distance between the pair of data points, ⎛ d12 (vi , v j ) ⎞ ⎟, R1 (vi , v j ) = exp ⎜ − ⎜ ⎟ σ x ⎝ ⎠ where d1 (vi , v j ) denotes the Euclidean distance between vi and v j , and σ x is the scaling
parameter that controls how rapidly the affinity R1 (vi , v j ) falls off with the distance d1 (vi , v j ) . 4.1.1 The case of large inter-cluster dispersion. We first consider the situation where the intercluster dispersion is large. Figure 1 shows a spatial point set and the clustering result. The original point set shown in Figure 1(a) consists of three clusters that contain 20, 15 and 10 points, respectively. In this case, the inter-cluster dispersion of the point set is so large that it is expected that the first 3 largest eigenvalues of weight matrix can approximate well to the number of the points in each of three associated clusters, respectively. Figure 1(b) shows the eigenvalues of the weight matrix. Here, the first 4 largest eigenvalues are λ1 = 20.0002, λ2 = 13.4294, λ3 = 9.5009, λ4 = 0.64729 . Obviously, this result confirms our analysis deeply! Note that the difference between λ2 and 15 is a little large. Observing the three clusters in Figure 1(d), it is easy to see that the points in cluster 1 are relatively dense except an isolated point, but the points in cluster 2 and 3 are relatively sparse. From the analysis in section 2, since the scaling parameter σ x = 0.0573 and the Euclidean distance between two arbitrary data points from different clusters is large, the weight matrix is approximately block-diagonal. Thus, the large difference could be caused by the large value of σ 1(2) . After calculating, it can be concluded that σ 1(2) = 0.1702 and σ 1(3) = 0.1465 . This obviously agrees with our analysis. In the experiments, it is also found that when the value of scaling parameter σ x is large, e.g., σ x = 1 , the correct clustering result can also be obtained by using the first 3 eigenvectors, but the fluctuation of eigenvalue can be very large. For example, when σ x = 1 , the first 4 largest eigenvalues are
λ1 = 39.0422, λ2 = 4.8383, λ3 = 1.0765, λ4 = 0.032393 . This shows that when the number of points in each cluster is large, a perfect clustering can still be
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
75
Figure 1 Clustering of an artificial data set with large inter-cluster dispersion. (a) The original artificial data set; (b) the eigenvalues of the weight matrix, where λ1=20.0002, λ2=13.4294, λ3=9.5009, λ4=0.64729; (c) the clustering result; (d) the original three clusters.
Figure 2 Clustering of an artificial data set with small inter-cluster dispersion. (a) The original artificial data set; (b) the eigenvalues of the weight matrix, where λ1=27.0007, λ2=9.3832, λ3=6.4091, λ4=0.98948; (c) the clustering result; (d) the original three clusters. 76
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
obtained by using our algorithm even if the fluctuation of the eigenvalue of the weight matrix is large. It is easy to see that this conclusion agrees with the analysis in section 2.2. 4.1.2 The case of small inter-cluster dispersion. We then consider the situation where the inter-cluster dispersion is small. We tested our clustering algorithm on a spatial point set shown in Figure 2. This point set also consists of three clusters that contain 20, 15 and 10 points, respectively. However, unlike the point set in Figure 1, the left two clusters in Figure 2(a) is so close that it is maybe difficult to separate the two clusters. Figure 2(b) shows the eigenvalues of weight matrix. Here, the first 4 largest eigenvalues are λ1 = 27.0007, λ2 = 9.3832, λ3 = 6.4091, λ4 = 0.98948 . Obviously, in this case the eigenvalue is not a good approximation to the number of points in the associated cluster. As a result, one data point is misclassified (see Figure 2(c) for the detail). Observing the point set in Figure 2(d), it is easy to see that there exist two points in cluster 1 whose Euclidean distance is much larger than that of some pair of points from the cluster 1 and 2, respectively. A refinement of our analysis shows that it is indeed this fact that makes the inter-cluster dispersion of the point set so small that the dominant eigenvectors of weight matrix cannot result in a perfect clustering. This conclusion confirms our analysis in section 2.3. 4.2
Comparison to Normalized cut
In ref. [7], Shi et al. present a spectral clustering method, Normalized cut, which performs clustering by using the second smallest eigenvector of normalized Laplacian, i.e., D
−
1 2 (D
− W )D
−
1 2
,
n
where D = diag(d1 , d 2 ,L , d n ) and di = ∑ wij . To compare the performance of our grouping j =1
algorithm and the Normalized cut, we applied the two methods to a spatial point set shown in Figure 3(a). This point set consists of three clusters that are two rings and a line segment. We first chose the similarity function R1 to calculate the edge weight, but in this case the two methods all failed. Then, we assigned gray level 30 to the points in one cluster, 50 in the second and 70 in the last, and for our algorithm and the Normalized cut chose similarity functions ⎛ d 2 2 (vi , v j ) ⎞ ⎟, R2 (vi , v j ) = exp ⎜ − ⎜ ⎟ σg ⎝ ⎠ where d 2 (vi , v j ) denotes the difference between the two gray levels of vi and v j , and
R = R1 ⋅ R2 , respectively, to calculate the edge weight. This time, the two methods all worked
well. Figure 3(b) shows the clustering result. From the clustering process, it is found that there is a great contrast between the Normalized cut and our method. First, when using two methods to perform clustering respectively, for our method the threshold flag has been determined by calculating, but for Normalized cut the splitting threshold, which is used to partition the second smallest eigenvector of the normalized Laplacian into two parts, is always determined by many experiments. Second, it is a key requirement for two methods to choose a suitable similarity function. In experiments, when using Normalized cut to cluster and choosing R = R1 ⋅ R2 as the similarity function according to ref. [7], since the Euclidean distance between two points is not a suitable clustering feature, many TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
77
experiments have to be done to choose a suitable scaling parameter. Figure 3(c) and (d) show the second small eigenvectors of the normalized Laplacians with σ g = 0.1, σ x = 1 and
σ g = 0.1, σ x = 10 , respectively. From Figure 3(c) and (d), it is easy to see that if one let σ x = 1 , then it will be very difficult to find a suitable splitting point to partition the eigenvector into two parts corresponding to different clusters. It is also easy to see that if one lets σ x = 10 , then it is obvious that the value of the edge weight on eij is mainly determined by the difference between the gray levels of vi and v j . This shows that in practice Normalized cut also uses the difference between the gray levels of two nodes as the clustering feature. Figure 3(e) shows the second small eigenvector of the normalized Laplacian with σ g = 0.1 when choosing R2 as the similarity function. From Figure 3(e), it is easy to see that the eigenvector is piecewise constant and
Figure 3
(a) An artificial data set. (b) Clustering result obtained by using USCAWM and the Normalized cut. (c) Eigenvector −
1
associated with the second smallest eigenvalue of the normalized Laplacian D 2 ( D − W ) D
Eigenvector associated with the second smallest eigenvalue of the normalized Laplacian
−
1 2
with σ g = 0.1 , σ x = 1 . (d)
1 − D 2 (D
− W )D
σ x = 10 . (e) Eigenvector associated with the second smallest eigenvalue of the normalized Laplacian
−
1 2
with σ g = 0.1 ,
1 − D 2 (D
− W )D
−
1 2
with
σ g = 0.1 when choosing R2 as the similarity function. (f) Eigenvector associated with the third smallest eigenvalue of the −
1
normalized Laplacian D 2 ( D − W ) D
78
−
1 2
with σ g = 0.1 when choosing R2 as the similarity function.
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
the perfect clustering can be obtained by only using the second small eigenvector of the normalized Laplacian. Third, in contrast to our method, the Normalized cut only uses the second smallest eigenvector of the normalized Laplacian to cluster. This can be computationally wasteful since other eigenvectors maybe contain useful partitioning information. For example, Figure 3(f) shows the third smallest eigenvector of the normalized Laplacian with σ g = 0.1 when choosing R2 as the similarity function. Obviously, this eigenvector can also be used to perform clustering. Fourth, our algorithm is unsupervised and can simultaneously obtain a K-way partition by just one run, but the Normalized cut finds the partition only by using a recursive 2-way cut. 4.3
Performance evaluation
To evaluate the performance of our clustering method, we apply it to two benchmark data sets from the UCI machine-learning repository[17]: the Iris and Wine data set. The Iris data set consists of 3 clusters and contains 150 data points, each of which has 4 features denoted by F = ( F1 , F2 , F3 , F4 ) . The Wine data set contains 178 data points from 3 clusters, and each data point has 13 attributes denoted by F = ( F1 , F2 , F3 ,L , F13 ) . The data sets are outlined in Table 1, where CN denotes the number of clusters. In experiments, we choose R1 as the similarity function and take the value of the scaling parameter to be 0.5. For the Iris data set, the feature vector ( F1 , F2 ) is extracted as the clustering feature, and the misclassification rate of our algorithm is listed in Table 1. Figure 4 shows the visualized clustering result of the Iris data set, where the horizontal ordinate represents the feature F1 , and the vertical ordinate represents the feature F2 . From Figure 4, it can be seen that, by using the current clustering feature, our method reaches a good clustering result with only 6 false assignments though the right two clusters partially overlap. To test whether our method is sensitive to the choice of clustering feature, for the Wine data set, we simply choose F10 × F13 as the clustering feature. However, the corresponding classification rate listed in Table 1 shows that this is not a good choice. Obviously, it also shows that our algorithm depends heavily on the choice of clustering feature. Based on the same clustering features, we also apply K-means and the Normalized cut to the two data sets to compare the performance of the different techniques. Table 1 shows the misclassification rates of two methods. Here, when using the K-means to cluster, the initial cluster centers are randomly selected, and the Euclidean distance is used to measure the distance between two data points. To yield comparable experimental results, the number of clusters is predefined when using the two methods to perform clustering. From the experimental results, it is easy to see that for the Iris data set the misclassification rates of the K-means and our method are all low, but that of Normalized cut is high. Detailed analysis shows that the normalized cut favors the division of a data set into two subsets with similar intra-cluster similarity [7], but our algorithm and K-means do not. For the Wine data set, three methods all cannot reach good results. This may be attributed to the choice of clustering feature. From the experiment results, it is found that the misclassification rate of K-means is larger than twice that of our method, so does that of Normalized cut. This shows that despite lack of effective clustering feature, the performance of our method can still be relatively good. Analyzing the performance of K-means on the two data sets, it shows that this method could favor convex clusters. In contrast to that, our method is more suitable for clustering.
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
79
Figure 4 The visualized clustering result of the Iris data set with the feature F1 and F2 as the horizontal and vertical ordinate, respectively. Table 1 Empirical comparison of the three algorithms
Data set Iris Wine
Size 150 178
CN 3 3
Dim. 4 13
USCAWM 0.04 0.28652
Normalized cut 0.33333 0.66854
K-means 0.053333 0.66854
5 Conclusions In this paper, we analyzed the spectral clustering method by using the tools from matrix perturbation theory, resulting in an unsupervised spectral clustering algorithm. The main contributions include the following: (1) By analyzing the spectrum of the weight matrix, we obtained the relation between the eigenvalue of the weight matrix and the number of clusters. It shows that, based on a suitable similarity function, the number of clusters is equal to the number of eigenvalues whose value are larger than 1. In addition, it is also found that the number of points in each cluster can be approximated by the associated eigenvalue. (2) By analyzing the eigenvector of the weight matrix, we found that the eigenvector could be used directly to perform clustering. In fact, it was found that the directional angle between the two row vectors of the matrix formed by stacking the first k largest eigenvectors of the weight matrix in columns could be used as a distance measure for clustering. (3) Based on the spectral analysis, we developed a weight matrix based unsupervised spectral clustering algorithm. Comparing it with the Normalized cut, the main differences can be summarized below. First, our method is unsupervised. Next, our method performs clustering by just one run, but the Normalized cut does clustering by a recursive 2-way cut. Thirdly, our method uses k eigenvectors to simultaneously obtain a k-way partition, but the Normalized cut only uses one eigenvector to perform clustering. Finally, when using our method and the Normalized cut to perform clustering, respectively, for our method, the threshold has been determined by calculating, but for the Normalized cut, the splitting threshold is always obtained by many experiments.
80
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
In short, it is more suitable to use our method to perform clustering. It is worth noting that the key point of applying our algorithm to perform clustering is the selection of similarity function. Though there have existed many similarity functions, no closed theory for their functioning and limitations is yet available. It will be left as a subject of our future research to construct a suitable similarity function based on the concrete problem. We thank the anonymous reviewers for their significant work. We also thank the editors for improving the article.
1
Bach R, Jordan M I. Learning spectral clustering. University of California at Berkeley Technical report UCB/CSD-03-1249. 2003
2
Xing E P, Jordan M I. On semidefinite relaxation for normalized k-cut and connections to spectral clustering. University of California at Berkeley Technical report UCB/CSD-3-1265. 2003
3
Donath W E, Hoffman A J. Lower bounds for partitioning of graphs. IBM J Res Devel, 1973, 17(5): 420―425
4
Fiedler M. A property of eigenvectors of non-negative symmetric matrices and its application to graph theory. Czechoslovak
5
Hagen L, Kahng A B. New spectral methods for ratio cut partitioning and clustering. IEEE Trans Comput-Aid Design, 1992,
Mathemat J, 1975, 25(100): 619―633 11(9): 1074―1085 6
Chan P K, Schlag M D F, Zien J Y. Spectral k-way ratio-cut partitioning and clustering. IEEE Trans Comput-Aid Design Integ Circ Syst, 1994, 13(9): 1088―1096
7
Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans Patt Anal Mach Intel, 2000, 22(8): 888―905
8
Fowlkes C, Belongie S, Chung F, et al. Spectral grouping using the Nyström method. IEEE Trans Patt Anal Mach Intel, 2004, 26(2): 214―225
9
Ding C H Q, He X, Zha H, et al. A min-max cut algorithm for graph partitioning and data clustering. In: Cercone N, Lin T Y, Wu X, eds. ICDM 2001. Los Alamitos, California: IEEE Computer Society, 2001. 107―114
10
Ding C H Q, He X, Zha H. A spectral method to separate disconnected and nearly-disconnected web graph components. In: Provost F, Srikant R, eds. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2001. 275―280
11
Weiss Y. Segmentation using eigenvectors: a unifying view. In: Computer Vision, 1999, the proceedings of the Seventh IEEE International Conference on. Los Alamitos, California: IEEE Computer Society, 1999. 975―982
12
Dhillon I S, Guan Y, Kulis B. A unified view of kernel k-means, spectral clustering and graph cuts. University of Texas at
13
Kannan R, Vempala S, Vetta A. On clusterings: good, bad and spectral. J ACM, 2004, 51(3): 597―515
14
Ng A Y, Jordan M I, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Dietterich T G, Becker S, Ghahramani Z,
15
Brand M, Huang K. A unifying theorem for spectral embedding and clustering. Mitsubishi Electric Research Laboratory
Austin UTCS Technical Report TR-04-25. 2004
eds. Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press, 2002. 849―856 Technichal Report TR2002-42. 2002. 16
Sun J. Matrix Perturbation Analysis (in Chinese). 2nd ed. Beijing: Science Press, 2001. 252―272
17
Hettich S, Bay S D. The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science, 1999
TIAN Zheng et al. Sci China Ser F-Inf Sci | February 2007 | vol. 50 | no. 1 | 63-81
81