Clustering with Local and Global Regularization - FIU School of ...

Report 3 Downloads 31 Views
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21,

NO. 12,

DECEMBER 2009

1665

Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE, and Tao Li Abstract—Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on several data sets are presented to show the effectiveness of our method. Index Terms—Clustering, local learning, smoothness, regularization.

Ç 1

INTRODUCTION

C

[18] is one of the most fundamental research topics in both data mining and machine learning communities. It aims to divide data into groups of similar objects, i.e., clusters. From a machine learning perspective, what clustering does is to learn the hidden patterns of the data set in an unsupervised way, and these patterns are usually referred to as data concepts. From a practical perspective, clustering plays a vital role in data mining applications such as information retrieval, text mining, Web analysis, marketing, computational biology, and many others [15]. Many clustering methods have been proposed till now, among which K-means [12] is one of the most popular algorithms. K-means aims to minimize the sum of the squared distance between the data points and their corresponding cluster centers. However, it is well known that there are some problems existing in the K-means algorithm: 1) the predefined criterion is usually nonconvex which causes many local optimal solutions; 2) the iterative procedure (e.g., for optimizing the criterion usually makes the final solutions heavily depend on the initializations. In the last decades, many methods [17], [38] have been proposed to overcome the above problems. Recently, another type of methods, which are based on clustering on data graphs, have aroused considerable interest in the machine learning and data mining community. The basic idea behind these methods is to first model the whole data set as a weighted graph, in which the graph nodes represent the data points, and the weights on the edges correspond to the similarities between pairwise points. Then, the cluster assignments of the data set can LUSTERING

. F. Wang and C. Zhang are with Tsinghua University, FIT 3-120, Beijing 100084, PR China. E-mail: [email protected], [email protected]. . T. Li is with the School of Computing and Information Sciences, Florida International University, ECS 251, Miami, FL 33199. E-mail: [email protected]. Manuscript received 31 Jan. 2008; revised 28 Aug. 2008; accepted 14 Jan. 2009; published online 22 Jan. 2009. Recommended for acceptance by V. Ganti. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2008-01-0073. Digital Object Identifier no. 10.1109/TKDE.2009.40. 1041-4347/09/$25.00 ß 2009 IEEE

be achieved by optimizing some criteria defined on the graph. For example, Spectral Clustering is one kind of the most representative graph-based clustering approaches, and it aims to optimize some cut value (e.g., Normalized Cut [27], Ratio Cut [7], Min-Max Cut [11]) defined on an undirected graph. After some relaxations, these criteria can usually be optimized via eigendecompositions and the solutions are guaranteed to be global optimal. In this way, spectral clustering efficiently avoids the problems of the traditional K-means method. In this paper, we propose a novel clustering algorithm that inherits the superiority of spectral clustering, i.e., the final cluster results can also be obtained by exploiting the eigenstructure of a symmetric matrix. However, unlike spectral clustering, which just enforces a smoothness constraint on the data labels over the whole data manifold [2], our method first constructs a regularized linear label predictor for each data point from its neighborhood, and then combines the results of all these local label predictors with a global label smoothness regularizer. So, we call our method Clustering with Local and Global Regularization (CLGR). The idea of incorporating both local and global information into label prediction is inspired by the recent works on semisupervised learning [41], and our experimental evaluations on several real document data sets show that CLGR performs better than many state-of-the-art clustering methods. The rest of this paper is organized as follows: in Section 2, we will introduce our CLGR algorithm in detail. The experimental results on several data sets are presented in Section 3, followed by the conclusions and discussions in Section 4.

2

RELATED WORKS

Before we go into the details of our CLGR algorithm, first we briefly review some works that are closely related to this paper.

2.1 Local Learning Algorithms The goal of learning is to choose from a given set of functions the one which best approximates the supervisor’s responses. In traditional supervised learning, we are given a Published by the IEEE Computer Society

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1666

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21, NO. 12,

DECEMBER 2009

Fig. 2. Two examples of the local kernel K. (a) A square kernel, which only selects the data points in a specific neighborhood with “diameter” "0 . (b) A smooth kernel, which gives different weights for all data points according to their position with respect to c0 , and its size is controlled by "0 .

Fig. 1. An intuitive illustration of why local learning is needed. There are two classes of data points in the figure, which are severely overlapped. Hence, in this case, it is difficult to find a global classifier which can discriminate the two classes well. However, if we split the input space into some local regions, then it is much easier to construct good local classifiers.

set of training sample-label pairs fðxi ; yi Þgni¼1 , then a common principle for selecting a good f is to minimize the following Structural Risk Xn Lðyi ; fðxi ; wÞÞ þ kfk2F ; ð1Þ J ðfðxi ; wÞÞ ¼ i¼1 where Lð; Þ is the loss function (e.g., square loss in regularized least-square regression, hinge loss in SVM), kfkF is the induced norm of f in the functional space F (e.g., F can be a Reproducing Kernel Hilbert Space induced by some kernel k if we construct a kernel machine). Clearly, the above statement of the learning problem implies that a unique function f  ðx; wÞ (which minimizes (1)) will be used for prediction over the whole input space. However, this may not be a good strategy: The functional space may not contain a good predictor for the full input space, but it would be much easier to find a function that is capable of making good predictions on some specified regions of the input space. Fig. 1 shows an intuitive example of the necessity of local learning algorithms, in which it is hard to find a good function to discriminate the two classes, but in each local region, the data from different classes can be easily classified. Typically, a local learning algorithm aims to find an optimal classifier over the local region centered at c0 by minimizing [6] J f0 ¼

n 1X kðxi  c0 ; "0 ÞJ ðf0 ðxi ; w0 ÞÞ; n i¼1

ð2Þ

where J ðÞ is some predefined loss, and kðxi  c0 ; "0 Þ is a kernel centered at c0 with width "0 which specializes the location and size of the local region. Fig. 2 shows us two examples of local kernels.

2.2 Spectral Clustering In this section, we will give a brief review on spectral clustering [19], [7], [27], [23], [33]. Given a data set fx1 ; x2 ; . . . ; xn g, the spectral clustering algorithms typically aim to recover the data partitions by solving the following eigenvalue decomposition problem SV ¼ V;

ð3Þ

where S is the n  n smoothness matrix (e.g., S can be the combinatorial graph Laplacian as in [7] or the normalized graph Laplacian as in [27]), V 2 IRnC is the n  C relaxed cluster indication matrix with C being the desired number of clusters,1 which corresponds to the eigenvectors of S, and  is the diagonal eigenvalue matrix of S. Traditionally, there are two ways to get the final clustering results from V. 1.

2.

3

As in [23], we can treat the ith row of V as the embedding of xi in a C-dimensional space, and apply some traditional clustering methods like k-means to cluster these embeddings into C clusters. Usually, the optimal V is not unique (up to an arbitrary rotational matrix R2). Thus, we can pursue an optimal R that will rotate V to a standard cluster indication matrix. The detailed algorithm can be referred to in [36].

THE PROPOSED ALGORITHM

In this section, we will introduce our Clustering with Local and Global Regularization (CLGR) algorithm in detail. First, we discuss the basic principles behind our algorithm.

3.1 The Basic Principle As its name suggests, the main idea behind our algorithm is to construct a predictor via two steps: local regularization and global regularization. In the local regularization step, we borrow the idea from local learning introduced in Section 2.1, which assumes that the the label of a data point can be well estimated from the local region it belongs to. That is, if we partition the input data space into M regions fRm gM m¼1 , with each region being characterized by a local kernel kðcm ; "m Þ, then we can minimize the cost function J fm ¼

n 1X kðxi  cm ; "m ÞJ ðfm ðxi ; wm ÞÞ; n i¼1

ð4Þ

to get the optimal fm for the local region Rm . One problem of pure local learning is that there might not be enough data points in each local region for training the local classifiers. Therefore, in the global regularization 1. The definition of an n  K cluster indication matrix G is that: Gij 2 f0; 1g, and Gij ¼ 1 if xi belongs to cluster j, otherwise Gij ¼ 0. In the eigenvalue decomposition problem (3), we cannot guarantee the entries in V to satisfy those constraints; therefore, we usually call V the relaxed cluster indication matrix [27]. 2. Generally, the final optimization problem form of spectral clustering can be formulated as minimization of the trace trðVT SVÞ subject to VT V ¼ I [27], [7], [38], the solution of which is according to the Ky Fan theorem [14].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

step, we apply a smoother to smooth the predicted data labels with respect to the intrinsic data manifold. Mathematically, we want to obtain a classifier f by minimizing Jf ¼

M X

J fm þ kfk2I ;

ð5Þ

m¼1

where f is a “region-wise” function with fðxÞ ¼ fm ðxÞ if x 2 Rm , and kfkI measures the smoothness of f with respect to the intrinsic data manifold, and  > 0 is a tradeoff parameter. In the following sections, we will introduce our clustering with local and global regularization (CLGR) algorithm in detail.

3.2 Local Regularization In this section, we introduce a detailed algorithm to realize the local regularization step. Let q ¼ ½q1 ; q2 ; . . . ; qn T be the class labels (or the cluster assignments) we want to solve. Then, in local regularization, we assume that qi can be well estimated by the classifier constructed on the local region that xi belongs to. In our CLGR algorithm, we adopt the following k-nearest neighbor local kernel kðxi  cm ; "m Þ  1; if xi is the k  nearest neighbor of cm ¼ 0; otherwise;

ð6Þ

and we set the kernel centers fcm gM m¼1 to be the data points fxi gni¼1 . In this way, we, in fact, partition the whole input data space into n overlapping regions, and the ith region Ri is just the k-nearest neighborhood of xi . The label of xi will be estimated by fi ðxi Þ. Then, according to (4), we need to construct a classifier fi on each Ri ð1  i  nÞ. In the following, we will first see how to construct these local classifiers.

3.2.1 The Construction of Local Classifiers In this paper, we consider two types of fi : Regularized Linear Classifier and Kernel Ridge Regression. Regularized linear classifier. In this case, we assume that fi takes the following linear form

1667

 @J i 2 T T 1 Xi wi þ ni bi  1T qi ; ¼ ni @bi

ð10Þ

where qi ¼ ½qi1 ; qi2 ; . . . ; qini T with qik being the kth neighbor of xi , and 1 ¼ ½1; 1; . . . ; 1T 2 IRni 1 is a column vector of all 1s. Let @J i =@wi ¼ 0, then 1  wi ¼ Xi XTi þ ni i Id Xi ðqi  bi 1Þ; ð11Þ where Id is the d  d identity matrix. Using the Woodbury formula [14], we can rewrite (11) as  1 ð12Þ wi ¼ Xi XTi Xi þ ni i Ini ðqi  bi 1Þ; where Ini is the ni  ni identity matrix. Let @J i =@bi ¼ 0, then   ni bi ¼ 1T qi  XTi wi : ð13Þ Combining (12) and (13), we can get 1    ni bi ¼ 1T qi  XTi Xi XTi Xi þ ni i I ðqi  bi 1Þ   1T  1T XTi Xi XTi Xi þ ni i Ini ¼)bi ¼  1 qi : ni  1T XTi Xi XTi Xi þ i Ini 1

ð14Þ

Kernel ridge regression. The regularized linear classifier can only tackle the linear problems. In the following, we will apply the kernel trick [25] to extend it to the nonlinear cases. Specifically, let  : IRd ! IF be a nonlinear mapping that maps the data in IRd to a highdimensional (possibly infinite) feature space IF, such that the nonlinear problem in IRd can become a linear one in the feature space IF. Then, for a testing point x 2 N i , we can construct a local classifier fi which predicts the label of x by ~ Ti ððxÞ  ðxi ÞÞ þ bi ; fi ðxÞ ¼ w

ð15Þ

~ i 2 IF; bi 2 IR are the parameters of fi . Since w ~ i lies where w i [25], where xik represents the in the span of fðxik Þgnk¼1 kth neighbor of xi , then there exists a set of coefficients i such that fki gnk¼1 ~ ¼ w

ni X

  ki  xik :

ð16Þ

k¼1

fi ðxÞ ¼ wTi ðx  xi Þ þ bi ;

ð7Þ

where ðwi ; bi Þ are the weight vector and bias of fi .3 1 X Ji ¼ kwTi ðxj  xi Þ þ bi  qj k2 þ i kwi k2 ; ni x 2N j

where ni ¼ jN i j is the cardinality of N i , and qj is the cluster membership of xj . Define the locally centered data matrix Xi ¼ ½xi1  xi ; xi2  xi ; . . . ; xini  xi . Then, by taking the partial derivatives of J i with respect to wi and bi , we can get ð9Þ

ð17Þ

xj 2N i

ð8Þ

i

  @J i 2  T ¼ Xi Xi wi þ bi 1  qi þ ni i wi ; @wi ni

Combining (16) and (15), we can get X fi ðxÞ ¼ ji hðxÞ  ðxi Þ; ðxj Þi þ bi ;

where hi is the inner product operator. Define a positivedefinite kernel function K : X  X ! IR [25], then we can solve  i ¼ ½1i ; . . . ; ni i T and bi by minimizing 2 1 X  w ~ i k2 ~ Ti ððxj Þ  ðxi ÞÞ þ bi  qj  þ i kw ni x 2N j i 2   1 X  X i ¼ k hðxj Þ  ðxi Þ; ðxk Þi þ bi  qj     ni

Ji ¼

xj 2N i xk 2N i

þ i ð  i ÞT Ki  i ; 3. Note that we subtract xi from x because usually there are only a few data points in the neighborhood of xi ; hence, the structural penalty term kwi k will pull the weight vector wi toward some arbitrary origin. For isotropy reasons, we translate the origin of the input space to the neighborhood medoid xi .

ð18Þ ni ni

is the local kernel matrix for the data where Ki 2 IR points in N i with its ði; jÞth entry Ki ðk; jÞ ¼ Kðxik ; xij Þ, and

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1668

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

xij is the jth neighbor of xi , and i > 0 is the trade-off parameter. Defining the partial locally centered kernel  i 2 IRni ni as matrix K  i ðj; kÞ ¼ hðxj Þ  ðxi Þ; ðxk Þi ¼ Kðxj ; xk Þ  Kðxi ; xk Þ; K we rewrite J l as Ji ¼

1  kKi  i þ bi 1  qi k2 þ i ð i ÞT Ki  i ; ni

ð19Þ

where qi ¼ ½qi1 ; qi2 ; . . . ; qini T with qik being the kth neighbor of xi , and 1 ¼ ½1; 1; . . . ; 1T 2 IRni 1 is a column vector of all 1s. Then,  @J i 2 T  ¼ Ki ðKi  i þ bi 1  qi Þ þ ni i Ki  i ; ni @i  @J i 2 T ¼ 1 Ki  i þ ni bi  1T qi : ni @bi

ð20Þ

ð21Þ

 i ¼ 0 and @J i =@bi ¼ 0, we can get By setting @J i =@   T  i þ ni i Ki 1 K  T ðqi  bi 1Þ;  K i ¼ K ð22Þ i i

bi ¼

 T   K i K  i þ ni i Ki 1 K T 1T  1T K i i q:   i K  i þ ni i Ki 1 K  T1 i  TK ni  1T K i

ð23Þ

i

3.2.2 Combining the Local Regularized Predictors After all the local predictors having been constructed, we combine them together by minimizing the local prediction loss Jl ¼

n X ðfi ðxi Þ  qi Þ2 ;

ð24Þ

i¼1

in which we use fi to predict the label of xi in the following ways: .

Regularized Linear Classifier. In this case, combining (7) and (14), we can get   1T  1T XTi Xi XTi Xi þ ni i Ini fi ðxi Þ ¼ bi ¼  1 qi ni  1T XTi Xi XTi Xi þ i Ini 1 ¼ uTi qi ; ð25Þ

.

Kernel Ridge Regression. In this case, combining (17) and (23), we can get  T   K i K  i þ ni i Ki 1 K T 1T  1T K i i q fi ðxi Þ ¼   i K  i þ ni i Ki 1 K  T 1 i ð26Þ  TK ni  1T K i

fi ðxi Þ ¼  Ti xi ; where  i ¼ ui for regularized linear classifier, and i ¼ vi for kernel ridge regression. Then, by defining a square matrix  2 IRnn with its ði; jÞth entry  i ðjÞ; if xj 2 N i ð27Þ ij ¼ 0; otherwise; where i ðjÞ is the jth element of i . We can expand the local loss in (24) as q  qk2 ; J l ¼ k

TK  ðK  þn  K Þ1 K T 1T 1T K

i i i i i where vTi ¼ n 1T K ði K TiK þn T1 .  K Þ1 K i

i

i

ð28Þ

where q ¼ ½q1 ; q2 ; . . . ; qn T 2 IRn1 is the label vector of X . Till now we can write the criterion of clustering by combining locally regularized linear label predictors J l in an explicit mathematical form, and can minimize it directly using some standard optimization techniques. However, the results may not be good enough since we only exploit the local information of the data set. In the next section, we will introduce a global regularization criterion and combine it with J l , which aims to find a good clustering result in a local-global way.

3.3 Global Regularization In data clustering, we usually require that the cluster assignments of the data points should be sufficiently smooth with respect to the underlying data manifold, which implies 1) the nearby points tend to have the same cluster assignments; 2) the points on the same structure (e.g., submanifold or cluster) tend to have the same cluster assignments [41]. Without the loss of generality, we assume that the data points reside (roughly) on a low-dimensional manifold M,4 and q is the cluster assignment function defined on M. Generally, a graph can be viewed as the discretized form of a manifold [3]. We can model the data set as a weighted undirected graph, as in spectral clustering [27], where the graph nodes are just the data points, and the weights on the edges represent the similarities between pairwise points. Then, the following objective can be used to measure the smoothness of q over the data graph [2], [40], [42] J g ¼ qT Lq ¼

n X ðqi  qj Þ2 wij ;

ð29Þ

i¼1

where q ¼ ½q1 ; q2 ; . . . ; qn T with qi ¼ qðxi Þ; L is the graph Laplacian with its ði; jÞth entry 8 < di  wii ; if i ¼ j if xi and xj are adjacent ð30Þ Lij ¼ wij ; : 0; otherwise;

i

¼ vTi qi ; i

DECEMBER 2009

Therefore, we can predict the label of xi using the local classifiers in the form of

where   1T  1T XTi Xi XTi Xi þ ni i Ini uTi ¼  1 : ni  1T XTi Xi XTi Xi þ i Ini 1

VOL. 21, NO. 12,

i i

i

i

4. We believe that the text data are also sampled from some lowdimensional manifold, since it is impossible for them to fill in the whole high-dimensional sample space. And it has been shown that the manifoldbased methods can achieve good results on text classification tasks [41].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

1669

P where di ¼ j wij is the degree of xi , and wij is the similarity between xi and xj . If xi and xj are adjacent,5 wij is usually computed in the following way:6 kx x k2  i 2j 2

wij ¼ e

;

ð31Þ

where  is a data-set-dependent parameter. It is proved that under certain conditions, when the scale of the data set becomes larger and larger, the graph Laplacian will converge to the Laplace Beltrami operator on the data manifold [4], [16]. In summary, using (29) with exponential weights can effectively measure the smoothness of the data assignments with respect to the intrinsic data manifold. Thus, we adopt it as a global regularizer to punish the smoothness of the predicted data assignments.

aT Ma ¼ aT ð   IÞT ð   IÞa þ aT La n  X 2 ¼ k a  ak2 þ  ai  aj wij  0: i¼1

Therefore, M is positive semidefinite, i.e., its eigenvalues are nonnegative. In the following, we show that the smallest eigenvalue of M is 0, and its corresponding eigenvector is an all-ones vector 1 ¼ ½1; 1; . . . ; 1T 2 IRn1 : 1.

2.

nn is the L1 ¼ ðD  WÞ1 ¼ 0  1, where W P2 IR weight matrix and D ¼ diagð j Wij Þ is the degree matrix.   IÞ1 ¼ 0  1, since according to the ð   IÞT ð definitions of  in (27), we have the following.

a.

1 ¼ uTi 1ni

3.4 Clustering with Local and Global Regularization Combining the contents we have introduced in Sections 3.2 and 3.3, we can derive the clustering criterion as q  qk2 þ qT Lq min J ¼ J l þ J g ¼ k q

ð32Þ

s:t: qi 2 f1; þ1g; where  is defined as in (27), and  is a regularization parameter to trade off J l and J g . However, the discrete constraint of pi makes the problem an NP-hard integer programming problem. A natural way for making the problem solvable is to remove the constraint and relax qi to be continuous. Then, the objective that we aim to minimize becomes

ð33Þ

  IÞT ð   IÞ þ LÞq; ¼ qT ðð and we further add a constraint qT q ¼ 1 to restrict the scale of q. Then, our objective becomes to solve the following optimization problem T

  IÞ ð   IÞ þ LÞq min J ¼ qT ðð q

s:t: qT q ¼ 1:

  1Tni  1Tni XTi Xi XTi Xi þ ni i Ini ¼  1 1ni ¼ 1: ni  1Tni XTi Xi XTi Xi þ i Ini 1ni

b.

For kernel ridge regression, assuming  i 2 IR1n is the ith row of , then

1 ¼ vTi 1ni ¼

 TK  i ðK  i þ ni i Ki Þ1 K T 1Tni  1Tni K i i 1 ¼ 1:  TK  i ðK  i þ ni i Ki Þ1 K  T 1n ni ni  1 T K ni

i

i

i

Then, 1 ¼ 1 and ð   IÞT ð   IÞ1 ¼ 0. Therefore,   IÞ þ LÞ1 ¼ 0  1; M1 ¼ ðð   IÞT ð

J ¼ k q  qk2 þ qT Lq   IÞT ð   IÞq þ qT Lq ¼ qT ð

For regularized linear classifier, assuming  i 2 IR1n is the ith row of  , then

ð34Þ

We have the following theorem: Theorem 1. The optimal relaxed cluster indicator can be achieved by the eigenvector corresponding to the second smallest   IÞ þ L. eigenvalue of the matrix M ¼ ð   IÞT ð Proof. Using the Ky Fan theorem [38], we can derive that the optimal solution q corresponds to the smallest eigenvector of matrix M. However, we will prove in the following that such an eigenvector does not contain any discriminative information. First, for an arbitrary column vector a 2 IRn1 , we have 5. In this paper, we define xi and xj to be adjacent if xi 2 N ðxj Þ or xj 2 N ðxi Þ. 6. There are also some other ways for constructing wij , such as in [32].

where 0 is the smallest eigenvalue of M with 1 as its corresponding eigenvector, which does not contain any discriminative information. Therefore, we should use the eigenvector corresponding to the second smallest eigenvalue for clustering as in [27]. u t

3.5 Extending CLGR for Multiclass Clustering In the above, we have introduced the basic framework of Clustering with Local and Global Regularization (CLGR) for the two-class clustering problem, and we will extend it to multiclass clustering in this section. First, we assume that all the data objects belong to C classes indexed by L ¼ f1; 2; . . . ; Cg. qc is the classification function for class c ð1  c  CÞ, such that qc ðxi Þ returns the possibility that xi belongs to class c. Our goal is to obtain the value of qc ðxi Þ ð1  c  C; 1  i  nÞ, and the cluster assignment of xi can be determined by fqc ðxi ÞgCc¼1 using some proper discretization methods that we have introduced in Section 2.2. Therefore, in this multiclass case, for each data point xi ð1  i  nÞ, we will construct C locally regularized label predictors. The total local prediction error can be constructed as Jl ¼

C X c¼1

J cl ¼

C X k qc  qc k2 :

ð35Þ

c¼1

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1670

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21, NO. 12,

DECEMBER 2009

TABLE 1 Clustering with Local and Global Regularization (CLGR)

Similarly, we can construct the global smoothness regularizer in multiclass case as Jg ¼

C X n  C X X 2 ðqc ÞT Lqc : qic  qjc wij ¼ c¼1 i¼1

ð36Þ

c¼1

Then, the criterion to be minimized for CLGR in multiclass case becomes J ¼ J l þ J g ¼

C X

½k qc  qc k2 þ ðqc ÞT Lqc 

c¼1

¼

C X

ð37Þ c T

T

c

½ðq Þ ðð   IÞ ð   IÞ þ LÞq 

  IÞT ð   IÞ þ LÞQ; ¼ trace½QT ðð where Q ¼ ½q1 ; q2 ; . . . ; qc  is an n  c matrix, and traceðÞ returns the trace of a matrix. The same as in (34), we also add the constraint that QT Q ¼ I to restrict the scale of Q. Then, our optimization problem becomes min J ¼ trace½QT ðð   IÞT ð   IÞ þ LÞQ s:t: QT Q ¼ I:

3.6 A Mixed Regularization Viewpoint As we have mentioned in Section 3.1, the motivation behind our approach is that there may not be enough points in each local region to train a good predictor; therefore, we apply a global smoothness regularizer to smooth those predicted labels and make it compliant with the intrinsic data distribution. However, by revisiting the expression of the loss function (37), we can find that it is, in fact, a mixed loss function composed of two parts. The first part J 1 ¼ trðQT ðð   IÞT ð   IÞÞQÞ

c¼1

Q

of all the data points by making use of the methods introduced in Section 2.2. The detailed algorithm procedure for CLGR is summarized in Table 1.

ð38Þ

From the Ky Fan theorem [38], we know the optimal solution of the above problem is   ð39Þ Q ¼ q1 ; q2 ; . . . ; qC R; where qk ð1  k  CÞ is the eigenvector that corresponds to the kth smallest eigenvalue of matrix ðP  IÞT ðP  IÞ þ L, and R is an arbitrary C  C matrix. Since the values of the entries in Q are continuous, we need to further discretize Q to get the cluster assignments

contains some local information of the data set from the classifier construction perspective, which is derived using local learning algorithms. Purely minimizing J 1 with proper orthogonality constraints will result in the LocalLearning-based Algorithm for Clustering (LLAC) like in [34]. The second part J 2 ¼ trðQT LQÞ includes some distribution information of the data set from the geometry perspective. Purely minimizing J 2 with proper orthogonality constraints will result in some spectral clustering algorithms like ratio cut. Therefore, what our clustering does with local and global regularization algorithms is just to seek for a consistent partitioning of the data set which has a good trade-off between J 1 and J 2 by . Such an idea of mixed regularization with different types of information has previously been explored in the semisupervised learning community [41], [44]. In the following section, we will

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

1671

discuss the relationship between our method and some traditional approaches.

For the USPS data set, we only use the images of digits 1-4. . Text Data. We also perform experiments on three text data sets: 20-newsgroup,12 WebKB,13 and Cora [21]. For the 20-newsgroup data set, we choose the topic rec which contains autos, motorcycles, baseball, and hockey from the version 20-news-18828. For the WebKB data set, we select a subset consisting of about 6,000 Web pages from computer science departments of four schools (Cornell, Texas, Washington, and Wisconsin), which can be classified into seven categories. For the Cora data set, we select a subset containing the research paper of subfield data structure (DS), hardware and architecture (HA), machine learning (ML), and programming language (PL). The basic information of those data sets are summarized in Table 2.

3.7 Relationship with Some Traditional Approaches As stated in the last section, the CLGR method proposed in this paper can be viewed from a mixed regularization point, where the objective function is just a regularization term composed of a local learning regularizer and a Laplacian regularizer. Therefore, the traditional spectral clustering [27] (which just contains the Laplacian regularizer) and local learning clustering [34], [35] (which just contains the local learning regularizer7) can be viewed as special cases of our method. In fact, the idea of combining some “weak” regularizer to construct a “strong” regularizer has appeared in some previous papers (e.g., [9], [1], [10]). However, what these papers mainly discuss is how to combine the “weak” regularizers (usually they assume that the final regularizer is a convex combination of those weak regularizers, and the final goal is how to efficiently obtain the combination coefficients), and they didn’t mention how to obtain more informative weak regularizers. The mixed regularizers proposed in this paper are complementary to each other in some sense (the local regularizer makes use of the local properties while the global regularizer consolidates those local properties). Thus, it can be more effective in real world applications.

4

EXPERIMENTS

In this section, experiments are conducted to empirically compare the clustering results of CLGR with some other clustering algorithms on four data sets. First, we will briefly introduce the basic information of those data sets.

4.1 Data Sets We use three categories of data sets in our experiments, which are selected to cover a wide range of properties. Specifically, those data sets include: .

.

UCI Data. We perform experiments on 15 UCI data sets.8 The sizes of those data sets vary from 24 to 4,435, the dimensionality of the data points vary from 3 to 56, and the number of classes vary from 2 to 22. Image Data. We perform experiments on six image data sets: ORL,9 Yale,10 YaleB [13], PIE [28], COIL [22], and USPS.11 All the images in the data sets except USPS are resized to 32  32. For the PIE data set, we only select five near-frontal poses (C05, C07, C09, C27, C29) and all the images under different illuminations and expressions. So, there are 170 images for each individual. For the YaleB database, we simply use the cropped images which contain 38 individuals and around 64 near-frontal images under different illuminations per individual.

7. An issue that is worth mentioning here is that the local learning regularizer in this paper is not the same as in [34], [35]. 8. http://mlearn.ics.uci.edu/MLRepository.html. 9. http://www.uk.research.att.com/facedatabase.html. 10. http://cvc.yale.edu/projects/yalefaces/yalefaces.html. 11. Available at: http://www.kernel-machines.org/data.html.

4.2 Evaluation Metrics In the experiments, we set the number of clusters equal to the true number of classes C for all the clustering algorithms. To evaluate their performance, we compare the clusters generated by these algorithms with the true classes by computing the following two performance measures. Clustering accuracy (Acc). The first performance measure is the Clustering Accuracy, which discovers the one-toone relationship between clusters and classes and measures the extent to which each cluster contain data points from the corresponding class. It sums up the whole matching degree between all pair class-clusters. Clustering accuracy can be computed as ! X 1 T ðCk ; Lm Þ ; ð40Þ Acc ¼ max N Ck ;Lm where Ck denotes the kth cluster in the final results, and Lm is the true mth class. T ðCk ; Lm Þ is the number of entities belonging to class m that are assigned to cluster k. Accuracy computes the maximum sum of T ðCk ; Lm Þ for all pairs of clusters and classes, and these pairs have no overlaps. The greater clustering accuracy means the better clustering performance. Normalized mutual information (NMI). Another evaluation metric that we adopt here is the Normalized Mutual Information [29], which is widely used for determining the quality of clusters. For two random variables X and Y, the NMI is defined as IðX; YÞ NMIðX; YÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; HðXÞHðYÞ

ð41Þ

where IðX; YÞ is the mutual information between X and Y, while HðXÞ and HðYÞ are the entropies of X and Y, respectively. One can see that NMIðX; XÞ ¼ 1, which is the maximal possible value of NMI. Given a clustering result, the NMI in (41) is estimated as 12. http://people.csail.mit.edu/jrennie/20Newsgroups/. 13. http://www.cs.cmu.edu/~WebKB/.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1672

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

 nn  nk;m log nk n^k;m m NMI ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  PC ffi ; PC nk n^m ^ n log n k¼1 k m¼1 m log n n PC PC

m¼1

DECEMBER 2009

K-Means. In our implementation, the cluster centers are randomly initialized and the performances reported are averaged over 50 independent runs. . Normalized Cut (Ncut) [27]. The implementation is the same as in [36]. The similarities between pairwise points are computed using the standard Gaussian kernel,14 and the width of the Gaussian kernel is set in an automatic way using the method introduced in [37]. . Local Linearly Regularized Clustering (LLRC). Clustering purely is based on minimizing the local loss in (24), and the local predictions are made by (25). The neighborhood size is searched from the grid f5; 10; 20; 50; 100g, all the regularization parameters fi g of the local regularized classifiers are set to be the same and are searched from the grid f0:01; 0:1; 1; 10; 100g, and the final cluster labels are achieved by the discretization method in [36]. . Local-Kernel-Regularized Clustering (LKRC). Clustering is purely based on minimizing the local loss in (24), and the local predictions are made by (26). The neighborhood size is searched from the grid f5; 10; 20; 50; 100g, all the regularization parameters fi g of the local regularized classifiers are set to be the same and are searched from the grid f0:01; 0:1; 1; 10; 100g, the width of the Gaussian kernel is searched from the grid f43 0 ; 42 0 ; 41 0 ; 0 ; 41 0 ; 42 0 ; 43 0 g, where 0 is the mean distance between any two examples in the data set, and the final cluster labels are achieved by the discretization method in [36]. Note that this method is very similar to the one in [34] except for the centralization in the construction of the local classifiers, which, as introduced earlier in a footnote, is needed for isotropic reasons. . Regularized Clustering (RC) [5]. The implementation is based on the description in [5, Section 6.1]. The width of the Gaussian similarities used for constructing graph Laplacian is set using an automatic method described in [37], the Gaussian kernel is adopted in RKHS with its width tuned from the grid f43 0 ; 42 0 ; 41 0 ; 0 ; 41 0 ; 42 0 ; 43 0 g, where 0 is the mean distance between any two examples in the data set, the ratio of the regularization parameters I =A is tuned from f0:01; 0:1; 1; 10; 100g, and the final cluster labels are also achieved by the discretization method in [36]. For our proposed clustering with local and global regularization methods (LCLGR denotes the approach using regularized linear classifiers as its local predictors, and KCLGR represents the approach using kernel ridge regression as its local predictors), the local regularization parameters (i.e., fi g are set to be the same and are tuned from f0:01; 0:1; 1; 10; 100g), the size of the neighborhood is tuned from f5; 10; 20; 50; 100g, and the pairwise data similarities used for constructing the global smoothness regularizer are computed using standard Gaussian kernels with its width determined by the method in [37]. For the parameter used to trade off the local and global loss, we change J in (32) to .

TABLE 2 Descriptions of the Data Sets

k¼1

VOL. 21, NO. 12,

ð42Þ

where nk denotes the number of data contained in the cluster Ck (1  k  C), n^m is the number of data belonging to the mth class (1  m  C), and nk;m denotes the number of data that are in the intersection between the cluster Ck and the mth class. The value calculated in (42) is used as a performance measure for the given clustering result. The larger this value, the better the clustering performance.

4.3 Comparisons and Parameter Settings We have compared the performances of our methods with five other popular clustering approaches.

14. For all the methods given below, which are involved in the computation of Gaussian kernels, we adopt the Gaussian kernel with Euclidean distance on the UCI and Image data sets, and the Gaussian kernel with inner product distance on the Text data sets [42].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

1673

TABLE 3 Clustering Accuracy Results on UCI Data Sets

TABLE 4 Normalized Mutual Information Results on UCI Data Sets

J ¼ ð1  ÞJ l þ J g to better capture the relative importance of J l and J g , and  is tuned from ½0; 0:2; 0:4; 0:6; 0:8; 1. For KCLGR, the Gaussian kernel is adopted for constructing the local kernel predictors with its width tuned from the grid f43 0 ; 42 0 ; 41 0 ; 0 ; 41 0 ; 42 0 ; 43 0 g, where 0 is the mean distance between any two examples in the data set.

4.4 Algorithm Performances The performances of those clustering algorithms on the three types of data sets are reported in Tables 3, 4, 5, 6, 7 and 8, in

which the best two performances for each data set are highlighted. From those tables, we can observe the following. .

.

In most of the cases, k-means clustering perform poorer than other sophisticated methods, since the distributions of those high-dimensional data sets are commonly much more complicated than mixtures of spherical Gaussians. The Ncut algorithm has similar performances as the LLRC algorithm has on UCI data sets, as the distribution of those data sets are complex and

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1674

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21, NO. 12,

DECEMBER 2009

TABLE 5 Clustering Accuracy Results on Image Data Sets

TABLE 6 Normalized Mutual Information Results on Image Data Sets

TABLE 7 Clustering Accuracy Results on Text Data Sets

. .

there is no apparent mode underlying those data sets. On the image data sets, the Ncut algorithm usually outperforms the LLRC algorithm since there are clear nonlinear underlying manifolds behind those data sets, which cannot be tackled well by the linear methods like LLRC. On the other hand, on most of the text data sets, LLRC performs better than Ncut because the regularized linear classifier has been shown to be very effective for text data sets [39]. The KLRC algorithm usually outperforms LLRC since the local distributions of the data sets are nonlinear. The mixed-regularization algorithms, including RC, LCLGR, and KCLGR usually perform better than the algorithms based on a single regularizer

.

(Ncut, LLRC, and KLRC), as they make use of more information contained in the data sets. As expected, the LCLGR and KCLGR usually perform better than LLRC and KLRC because the data labels are smoothed by the global smoother, which makes the data labels more compliant with the intrinsic data distribution.

4.5 Sensitivity to the Selection of Parameters There are mainly three parameters in our algorithm: the local regularization parameters fi g, which are assumed to be the same in all local classifiers, the size of the neighborhood k, and the global trade-off parameter . We have also conducted a set of experiments to test the sensitivity of our method to the selection of those

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

1675

TABLE 8 Normalized Mutual Information Results on Text Data Sets

Fig. 3. Parameter sensitivity testing results on the UCI Satimage data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:2; k ¼ 20, for LCLGR:  ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR:  ¼ 0:2; i ¼ 1, for LCLGR:  ¼ 0:4; i ¼ 1. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 1; k ¼ 20, for LCLGR: i ¼ 1; k ¼ 20.

Fig. 4. Parameter sensitivity testing results on the UCI Iris data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:6; k ¼ 20, for LCLGR:  ¼ 0:8; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: i ¼ 1;  ¼ 0:6, for LCLGR: i ¼ 0:1;  ¼ 0:8. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 1; k ¼ 20, for LCLGR: i ¼ 0:1; k ¼ 20.

parameters, and the results are shown in Figs. 3, 4, 5, 6, 7, and 8. From these figures, we can clearly see that: .

.

Generally a too small or too large k will lead to poor results. This is because when k is too small, then the data in each neighborhood would be too sparse such that the resulting local classifier may not be accurate. When k is too large, the local properties of the data set would be hidden and the results will also be bad. Therefore, a relatively medium k would give better results. A proper trade-off between the local and global regularizer would lead to better results, and this is just what our method does. This also proved

experimentally the reasonability and effectiveness of combining local and global regularizers.

5

CONCLUSIONS

In this paper, we derived a new clustering algorithm called clustering with local and global regularization. Our method preserves the merit of local learning algorithms and spectral clustering. Our experiments show that the proposed algorithm outperforms some of the state-of-the-art algorithms on many benchmark data sets. In the future, we will focus on the parameter selection and acceleration issues of the CLGR algorithm.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1676

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21, NO. 12,

DECEMBER 2009

Fig. 5. Parameter sensitivity testing results on the Yale face image data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:8; k ¼ 20, for LCLGR:  ¼ 0:6; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR:  ¼ 0:2; i ¼ 0:1, for LCLGR:  ¼ 0:4; i ¼ 100. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 0:1; k ¼ 20, for LCLGR: i ¼ 100; k ¼ 10.

Fig. 6. Parameter sensitivity testing results on the ORL face image data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:2; k ¼ 20, for LCLGR:  ¼ 0:8; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR:  ¼ 0:2; i ¼ 100, for LCLGR:  ¼ 0:8; i ¼ 10. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 100; k ¼ 20, for LCLGR: i ¼ 10; k ¼ 10.

Fig. 7. Parameter sensitivity testing results on the 20-Newsgroup text data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:2; k ¼ 20, for LCLGR:  ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR:  ¼ 0:2; i ¼ 10, for LCLGR:  ¼ 0:4; i ¼ 10. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 10; k ¼ 20, for LCLGR: i ¼ 10; k ¼ 20.

Fig. 8. Parameter sensitivity testing results on the WebKB-Cornell text data set. (a) Acc versus i plot. The parameter settings for KCLGR:  ¼ 0:4; k ¼ 10, for LCLGR:  ¼ 0:4; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR:  ¼ 0:2; i ¼ 1, for LCLGR:  ¼ 0:4; i ¼ 0:1. (c) Acc versus  plot. The parameter settings for KCLGR: i ¼ 1; k ¼ 10, for LCLGR: i ¼ 0:1; k ¼ 10.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION

ACKNOWLEDGMENTS The work of Fei Wang and Changshui Zhang is supported by the China Natural Science Foundation under Grant Nos. 60835002, 60675009. The work of Tao Li is partially supported by the US National Science Foundation CAREER Award IIS-0546280, DMS-0844513, and CCF-0830659.

REFERENCES [1] [2] [3] [4] [5]

[6] [7]

[8] [9]

[10] [11] [12] [13]

[14] [15] [16]

[17] [18] [19] [20] [21] [22]

A. Argyriou, M. Herbster, and M. Pontil, “Combining Graph Laplacians for Semi-Supervised Learning,” Proc. Conf. Neural Information Processing Systems, 2005. M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation, vol. 15, no. 6, pp. 1373-1396, 2003. M. Belkin and P. Niyogi, “Semi-Supervised Learning on Riemannian Manifolds,” Machine Learning, vol. 56, pp. 209239, 2004. M. Belkin and P. Niyogi, “Towards a Theoretical Foundation for Laplacian-Based Manifold Methods,” Proc. 18th Conf. Learning Theory (COLT), 2005. M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples,” J. Machine Learning Research, vol. 7, pp. 2399-2434, 2006. L. Bottou and V. Vapnik, “Local Learning Algorithms,” Neural Computation, vol. 4, pp. 888-900, 1992. P.K. Chan, D.F. Schlag, and J.Y. Zien, “Spectral K-Way Ratio-Cut Partitioning and Clustering,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 9, pp. 1088-1096, Sept. 1994. O. Chapelle, M. Chi, and A. Zien, “A Continuation Method for Semi-Supervised SVMs,” Proc. 23rd Int’l Conf. Machine Learning, pp. 185-192, 2006. J. Chen, Z. Zhao, J. Ye, and L. Huan, “Nonlinear Adaptive Distance Metric Learning for Clustering,” Proc. 13th ACM Special Interest Group Conf. Knowledge Discovery and Data Mining (SIGKDD), pp. 123-132, 2007. G. Dai and D.-Y. Yeung, “Kernel Selection for Semi-Supervised Kernel Machines,” Proc. 24th Int’l Conf. Machine Learning (ICML ’07), pp. 185-192, 2007. C. Ding, X. He, H. Zha, M. Gu, and H.D. Simon, “A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering,” Proc. First Int’l Conf. Data Mining (ICDM ’01), pp. 107-114, 2001. R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. John Wiley & Sons, 2001. A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman, “From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643-660, June 2001. G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. The Johns Hopkins Univ. Press, 1996. J. Han and M. Kamber, Data Mining. Morgan Kaufmann, 2001. M. Hein, J.Y. Audibert, and U. von Luxburg, “From Graphs to Manifolds—Weak and Strong Pointwise Consistency of Graph Laplacians,” Proc. 18th Conf. Learning Theory (COLT ’05), pp. 470485, 2005. J. He, M. Lan, C.-L. Tan, S.-Y. Sung, and H.-B. Low, “Initialization of Cluster Refinement Algorithms: A Review and Comparative Study,” Proc. Int’l Joint Conf. Neural Networks, 2004. A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice-Hall, 1988. B. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” The Bell System Technical J., vol. 49, no. 2, pp. 291-307, 1970. L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,” Proc. Conf. Neural Information Processing Systems, 2005. A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the Contruction of Internet Portals with Machine Learning,” Information Retrieval J., vol. 3, pp. 127-163, 2000. S.A. Nene, S.K. Nayar, and J. Murase, “Columbia Object Image Library (COIL-20),” Technical Report CUCS-005-96, Columbia Univ., Feb. 1996.

1677

[23] A.Y. Ng, M.I. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis and an Algorithm,” Proc. Conf. Neural Information Processing Systems, 2002. [24] S.T. Roweis and L.K. Saul, “Noninear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000. [25] B. Scho¨lkopf and A. Smola, Learning with Kernels. The MIT Press, 2002. [26] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004. [27] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000. [28] T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, and Expression Database,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1615-1618, Dec. 2003. [29] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, 2002. [30] V.N. Vapnik, The Nature of Statistical Learning Theory. SpringerVerlag, 1995. [31] F. Wang, C. Zhang, and T. Li, “Regularized Clustering for Documents,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’07), 2007. [32] F. Wang and C. Zhang, “Label Propagation through Linear Neighborhoods,” Proc. 23rd Int’l Conf. Machine Learning, 2006. [33] Y. Weiss, “Segmentation Using Eigenvectors: A Unifying View,” Proc. IEEE Int’l Conf. Computer Vision, pp. 975-982, 1999. [34] M. Wu and B. Scho¨lkopf, “A Local Learning Approach for Clustering,” Proc. Conf. Neural Information Processing Systems, 2006. [35] M. Wu and B. Scho¨lkopf, “Transductive Classification via Local Learning Regularization,” Proc. 11th Int’l Conf. Artificial Intelligence and Statistics (AISTATS ’07), pp. 628-635, 2007. [36] S.X. Yu and J. Shi, “Multiclass Spectral Clustering,” Proc. Int’l Conf. Computer Vision, 2003. [37] L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,” Proc. Conf. Neural Information Processing Systems, 2005. [38] H. Zha, X. He, C. Ding, M. Gu, and H. Simon, “Spectral Relaxation for K-Means Clustering,” Proc. Conf. Neural Information Processing Systems, 2001. [39] T. Zhang and J.F. Oles, “Text Categorization Based on Regularized Linear Classification Methods,” Information Retrieval, vol. 4, pp. 531, 2001. [40] D. Zhou and B. Scho¨lkopf, “Learning from Labeled and Unlabeled Data Using Random Walks,” Proc. 26th DAGM Symp. Pattern Recognition, pp. 237-244, 2004. [41] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Scho¨lkopf, “Learning with Local and Global Consistency,” Proc. Conf. Neural Information Processing Systems, 2004. [42] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions,” Proc. 20th Int’l Conf. Machine Learning, 2003. [43] X. Zhu, J. Lafferty, and Z. Ghahramani, “Semi-Supervised Learning: From Gaussian Fields to Gaussian Process,” Computer Science Technical Report CMU-CS-03-175, Carnegie Mellon Univ., 2003. [44] X. Zhu and A. Goldberg, “Kernel Regression with Order Preferences,” Proc. 22nd AAAI Conf. Artificial Intelligence (AAAI), 2007. Fei Wang received the BE degree from Xidian University in 2003, and the ME and PhD degrees from Tsinghua University in 2006 and 2008, respectively. His main research interests include machine learning, data mining, and information retrieval.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

1678

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Changshui Zhang is a professor in the Department of Automation, Tsinghua University. His main research interests include machine learning. He is a member of the IEEE.

VOL. 21, NO. 12,

DECEMBER 2009

Tao Li is currently an assistant professor in the School of Computing and Information Sciences, Florida International University. His main research interests include machine learning and data mining.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.