Semi-supervised clustering via multi-level random walk - Amazon Web ...

Report 2 Downloads 24 Views
Pattern Recognition 47 (2014) 820–832

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Semi-supervised clustering via multi-level random walk Ping He, Xiaohua Xu n, Kongfa Hu, Ling Chen Department of Computer Science, Yangzhou University, Yangzhou 225009, China

art ic l e i nf o

a b s t r a c t

Article history: Received 12 February 2013 Received in revised form 17 June 2013 Accepted 31 July 2013 Available online 12 August 2013

A key issue of semi-supervised clustering is how to utilize the limited but informative pairwise constraints. In this paper, we propose a new graph-based constrained clustering algorithm, named SCRAWL. It is composed of two random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertices (i.e., data points) into constrained and unconstrained ones, according to whether they are in the pairwise constraints. For every constrained vertex, its influence range, or the degrees of influence it exerts on the unconstrained vertices, is encapsulated in an intermediate structure called component. The edge set between each pair of components determines the affecting scope of the pairwise constraints. In the higher-level random walk, SCRAWL enforces the pairwise constraints on the components, so that the constraint influence can be propagated to the unconstrained edges. At last, we combine the cluster membership of all the components to obtain the cluster assignment for each vertex. The promising experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our method. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Semi-supervised clustering Pairwise constraint Influence expansion Multi-level random walk Spectral clustering

1. Introduction Semi-supervised clustering, also called constrained clustering, has become a hotspot in the current research of machine learning and data mining communities. Compared with traditional clustering [1], semi-supervised clustering takes advantage of the additional prior knowledge, such as cluster seeds or pairwise constraints, to improve the clustering result and avoid clustering ambiguity. There are two types of supervision mostly used in semisupervised clustering. The first one is cluster seed set [2], very similar to the labeled data set in semi-supervised classification. The second type is pairwise constraint set, which specifies the pairs of data belonging to the same cluster (must-link constraints) or different clusters (cannot-link constraints) [3]. If we view every data point as a vertex on graph, then the first category of semisupervised clustering is a vertex-constrained learning problem, while the second category is as an edge-constrained learning problem. Since the edge constraints can be inferred from the vertex constraints, but not vice versa, it is more challenging to deal with the edge-constrained clustering problems than those vertexconstrained ones. By far, various methods have been proposed to handle semisupervised clustering with pairwise constraints. Generally, they can be classified into two lines. The first line, namely metric learning, learns optimized metric(s) to keep the must-linked data

n

Corresponding author. Tel.: +86 514 879 78309; fax: +86 514 878 87937. E-mail address: [email protected] (X. Xu).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.07.023

close and the cannot-linked data far away [4–6]. Most of the existing metric learning algorithms learn linear Mahalanobis distance metrics [4,5], but Wu et al. [6] develop a novel scheme to learn nonlinear Bregman distance functions. However, they have a generally known disadvantage that metric learning approaches require a large number of pairwise constraints to learn the correct metrics [7]. Moreover, they heavily rely on the prior assumption about the metric scope, which is hard to predict in advance. For instance, Xing et al. [4] assume that all the data points share a single global metric, while Bilenko et al. [8] assume that every cluster has an independent local metric. The second line of semi-supervised clustering algorithms focuses on adapting the existing clustering (generative) or classification (discriminative) models to deal with this problem. The early algorithms adapt the traditional methods like k-means [9], all-pairs shortest path [10], and Gaussian mixtures models [11] to find a clustering result that can satisfy all the pairwise constraints greedily. However, without the mechanism of backtracking, they may fail to find a satisfying partition even when there exists one. To tackle the sub-optimality problem, people adopt bio-inspired metaheuristic methods, such as genetic algorithm [12] and Ant Colony Optimization [13], which can explore the solution space more exhaustively and hence have a larger chance to find the global optimal solution. In recent years, more and more graphbased methods are incorporated in semi-supervised clustering algorithms. Lu [14] generalize the MAP Gaussian process classifiers to express the uncertainty information associated with the pairwise constraints in a probabilistic framework. In addition, semisupervised clustering based on kernel methods [15], maximum

P. He et al. / Pattern Recognition 47 (2014) 820–832

margin clustering [7], ensembles [16] and fuzzy c-means [17] have all been developed along this line. Along the second line, there is an emerging trend in developing semi-supervised clustering algorithms based on the spectral method [18]. Kamvar et al. [19] first modify the pairwise similarity matrix by setting the must-link similarities as 1 and the cannotlink similarities as 0, then apply spectral clustering on the modified similarity matrix. However, the 1/0 modification strategy seems extreme because the data in the same cluster may not coincide and the data in different clusters probably share similar attributes. To overcome the shortcoming, Kulis et al. [20] propose a reward/penalty strategy, which adds a reward to the must-link similarities and subtracts a penalty from the cannot-link similarities. The drawback, as Li et al. [21] criticized, is that it may cause non-positive-semidefinite problem for convergence if the penalty is larger than the original similarities. It is soon realized that by only revising the similarities of the constrained edges, it is hard to utilize the limited but informative pairwise constraints. A straightforward solution is to expand the constraint influence to the unconstrained edges, but the key issue lies in how. Although diverse efforts have been made, including the formulation of the constrained normalized cut [22], the alteration of the Laplacian matrix eigenspace [23], and the incorporation of the Gaussian process [24], they either cannot deal with the multi-class semi-supervised clustering problems or fail to handle the cannot-link constraints. Wang and Davidson [25] develop an objective function that allows real-valued degree-ofbelief constraints, but it can hardly produce satisfactory result when the number of pairwise constraints is small. Li et al. [21] combine the spectral method with global metric learning to adapt the spectral embedding of the data as consistent with the pairwise constraints as possible. Nevertheless, a metric is rarely uniform in the whole domain. In another word, the structure of patterns may vary between different local neighborhoods. Thus a more appropriate way is to spread the pairwise constraints locally and exert greater influence on the nearby edges than on the faraway edges. To confine the influence of the pairwise constraints to local areas, it is a natural choice to replace a global metric with several local metrics. Bilenko et al. [8] integrate local metric learning with constrained k means to learn an individual local metric for every cluster. The disadvantage is that they cannot deal with the data sets containing two or more local metrics in one cluster. Moreover, users need to provide much more pairwise constraints to ensure the correctness of all the local metrics. Besides, Lu and Peng [26] transform the pairwise constraint propagation into solving a continuous-time Lyapunov equation, which requires a high computational cost. Although the authors provide an approximation strategy to obtain a suboptimal solution, it still costs quadratic time complexity. In this paper, we propose a novel approach to spreading the constraint influence to the surrounding unconstrained edges with sufficient smoothness. To this end, we decompose the constraint propagation process into three steps. First, we extract the vertices in the pairwise constraints, called constrained vertices ðedge-vertexÞ. Second, we determine the influence range of every constrained vertex by computing the degrees of influence it exerts on the unconstrained vertices ðvertex-vertexÞ. Third, we derive the affecting scope of each pairwise constraint, and enforce the pairwise constraints on the affected edges. During these three steps, each pairwise constraint is at first treated as a single constrained edge, then transformed into the influence range of two constrained vertices, and at last expanded to a group of affected edges. Therefore, we call this procedure an “edge-vertex-edge” constraint utilization strategy. More specifically, our algorithm named SCRAWL, short for Semi-supervised Clustering via RAndom WaLk, is composed of two

821

random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertex set into the constrained and unconstrained two vertex subsets. Then it determines the influence range of every constrained vertex by translating the problem to a well-studied issue in semi-supervised classification, that is estimating the probabilities of the unlabeled data belonging to the same class of a labeled data [27]. For this purpose, a semi-supervised classification algorithm, label propagation [28], is incorporated in SCRAWL. We further encapsulate the vertices within the influence range of every constrained vertex in an intermediate structure called component. The component membership degree of each vertex equals to the degree of influence it receives from the constrained vertex. Since the overall component membership degree of each vertex is 1, we divide a whole vertex to multiple fractional vertices   [29], e.g., v ¼ 12 v; 13 v; 16v , according to its degrees of different component membership. In this point of view, a component is the union of the fractional vertices affected by a distinct constrained vertex. In the higher-level random walk, SCRAWL derives the affecting scope of each pairwise constraint, which is the edge set connecting the components around the two constrained vertices. We call such an edge between two fractional vertices in different components, e.g. 〈12 vi ; 13vj 〉ð1=2Þvi A component1 ;ð1=3Þvj A component2 ¼ 16〈vi ; vj 〉, a fractional edge. Its fraction, determined by the product of the fractions of the connected vertices (e.g., 16 ¼ 12  13), indicates the degree of influence that the whole edge ð〈vi ; vj 〉Þ receives from the constrained edge. To expand the constraint influence, we enforce the pairwise constraints on the fractional edges among components, and group the components into different clusters. Finally, we obtain the cluster assignment for each vertex by combining the cluster membership of the fractional vertices distributed in different components. The promising experimental results on the synthetic data sets, UCI database and image segmentations demonstrate the effectiveness of SCRAWL. There are several aspects of our proposed approach worthwhile to highlight here:

 SCRAWL can propagate the pairwise constraints to the surrounding







 

unconstrained edges in proportion to the degrees of influence they receive from the constrained edges. The greater influence an unconstrained edge receives from a constrained edge, the more likely it is to satisfy the same pairwise constraint. The existing graph-based semi-supervised clustering algorithms confine the utilization of the pairwise constraints on edges. In contrast, SCRAWL develops an “edge-vertex-edge” constraint utilization strategy, which can expand a single constrained edge, through its two connected vertices, to a group of affected edges. SCRAWL introduces an intermediate structure between the fine-grained vertex and the coarse-grained cluster, called “component”. It can effectively uncover the underlying substructures of the clusters. SCRAWL establishes a connection between semi-supervised clustering and semi-supervised classification algorithms. It provides a new way to develop semi-supervised clustering algorithms based on the semi-supervised classification algorithms, which can predict the degrees of different class membership for each unlabeled data. SCRAWL can effectively handle the clustering problems with extremely small or large amount of pairwise constraints. For large real-world data sets, the time complexity of SCRAWL is approximately linear, if given a k NN sparse similarity matrix.

The remainder of this paper is organized as follows. Section 2 introduces the label propagation algorithm incorporated in SCRAWL. Section 3 describes the algorithm of SCRAWL in detail. Section 4 discusses the parameters of SCRAWL. Section 5 evaluates

822

P. He et al. / Pattern Recognition 47 (2014) 820–832

the clustering performance of SCRAWL, and Section 6 concludes the whole paper.

3.1. Component construction Definition 1 (Constrained vertex set). The constrained vertex set is composed of all the vertices constrained by C.

2. Preliminaries

def

Label propagation [28,30,31] is a class of semi-supervised classification algorithms based on the smoothness assumption, i.e. nearby data points bear similar labels. Given a data set S ¼ ðX l ; Y l Þ [ X u , where X l ¼ fx1 ; …; xl g is the labeled data subset, Y l ¼ fy1 ; …; yl g, yi A f1; 2; …; cg contains the labels of X l , and X u ¼ fxlþ1 ; …; xn g is the unlabeled data set, the aim of semisupervised classification is to predict the labels of X u , i.e. Y u . Usually, a n  c label indicating matrix Y^ is constructed in a label propagation algorithm. " # Y^ l Y^ ¼ ^ ð1Þ ¼ ðy^ ij Þnc Yu Y^ l and Y^ u respectively denote the state of the known label subset 0 Y l and the unknown label subset Y u . The initial state of Y^ , i.e., Y^ , is set y^ ij ¼ 1 if yi ¼ j, y^ ij ¼ 0 if yi a j or yi is unknown. The evolution of Y^ depends on the row-normalized transition probability matrix P, tþ1 t Y^ ¼ P Y^

ð2Þ

t

where Y^ represents the state of Y^ at the time step t, the element of P ¼ ðpij Þnn indicates the transition probability from xi to xj, t 0 satisfying that 8 i, ∑j pij ¼ 1. The state of Y^ l is clamped Y^ l ¼ Y^ l , so that the initial labels of Y l will not fade away [28]. To obtain the converged solution of Eq. (2), P is reorganized in correspondence to the partition of X l and X u , " # P ll P lu P¼ ð3Þ P ul P uu where Pll and Puu are the transition probability sub-matrices from X l and X u to themselves, Plu and Pul are the mutual transition probability sub-matrices between X l and X u . The converged solution of Y^ u [28] is proved to be 0 Y^ u ¼ ðIP uu Þ1 P ul Y^ l

ð4Þ

The elements of Y^ u indicate the probabilities of the unlabeled data belonging to different classes. 8 y^ ij A Y^ u , y^ ij Z 0 and ∑j y^ ij ¼ 1. Finally, each unlabeled data in X u is assigned with the class that it most probably belongs to, i.e., yi ¼ arg maxj y^ ij .

3. Algorithm Consider a generic problem of semi-supervised clustering: given a data set X ¼ fx1 ; …; xn g and a pairwise constraint set C ¼ fC ¼ [ C a g, where C ¼ is the must-link constraint subset, C a is the cannot-link constraint subset. It is possible to formulate a graph-theoretic framework as follows. Suppose G ¼ ðV; E; WÞ is an undirected weighted graph, V ¼ fv1 ; …; vn g is the vertex set, vi corresponds to xi, E is the edge set, eðvi ; vj Þ A E is an edge between vi and vj, W is the similarity matrix, c ¼ ðvi ; vj Þ A C ¼ indicates that vi and vj belong to the same cluster, c a ðvi ; vj Þ A C a indicates that vi and vj belong to different clusters. The goal of semi-supervised clustering is to partition V into p disjoint clusters and satisfy C as much as possible. To deal with this problem, we propose a Semi-supervised Clustering algorithm via RAndom WaLk (SCRAWL), composed of component construction, component clustering and component combination three stages.

V c ¼ fvi ; vj j (c ¼ ðvi ; vj Þ A C ¼ 3 c a ðvi ; vj Þ A C a g

ð5Þ

jV c j denotes the number of the constrained vertices. Definition 2 (Unconstrained vertex set). The unconstrained vertex set V u is the complement of V c . def

V u ¼ V\V c

ð6Þ

jV u j denotes the number of the unconstrained vertices. Imagine such a random walk on graph G. All the vertices are regarded as different states of a Markov chain. Every constrained vertex represents an absorbing state, while each unconstrained vertex represents a transitive state. n particles start from n different vertices and walk randomly. For each step, a particle moves from vi to vj with probability pij. If it reaches one of the absorbing states, it is trapped and never moves on; otherwise, it continues moving. The random walk stops when all the particles are absorbed. We use a n  jV c j matrix F to include the probabilities of each vertex being absorbed by the different constrained vertices. Its elements also indicate the degrees of influence that each vertex receives from the constrained vertices. According to the partition of V c and V u , F is composed of two parts " # Fc F¼ ð7Þ Fu where Fc indicates the degrees of influence within V c , while Fu indicates the degrees of influence from V c to V u . First of all, we propose a q=q1 modification strategy to adapt the similarity matrix consistent with the pairwise constraints. It assumes the original similarities wði; jÞ A ½0; 1, and defines the f ¼ ðwði; e jÞÞnn as follows: modified similarity matrix W 8 q wði; jÞ if (c ¼ ðvi ; vj Þ A C ¼ > < 1=q e ð8Þ wði; jÞ ¼ wði; jÞ if (c a ðvi ; vj Þ A C a > : wði; jÞ otherwise The parameter q A ð0; 1 in Eq. (8) controls the power of the increase and decrease of the constrained edge similarities. When q-0, the q=q1 strategy approaches to the 1/0 strategy [19]. When f ¼ W. The value of q indicates how much original similaq¼ 1, W rities can be kept on the constrained edges. The larger q preserves the more original similarity, while the smaller q integrates the more supervision. Fig. 1 illustrates the similarity modification of the must-linked and cannot-linked edges using our proposed q=q1 strategy with various values of q. Compared with the existing 1/0 and reward/penalty strategies [20], our q=q1 strategy not only avoids extreme modifications, but also keeps the modified similarities within ½0; 1. Based on the modified similarity matrix, we can compute the e 1 W f , where D e ¼ diagðW f 1n Þ. transition probability matrix P ¼ D Then we apply the label propagation method (in Section 2) to determine the degrees of influence that the unconstrained vertices receive from each constrained vertex. F u ¼ ðIP uu Þ1 P uc F 0c

ð9Þ

Here Puu indicates the transition probabilities within V u , and Puc indicates the transition probabilities from V u to V c . Although Y^ u (Eq. (4)) and Fu (Eq. (9)) share the same converged solution, they have different meanings and are used differently for diverse purpose. First, Y^ u indicates the probabilities of the unlabeled data belonging to the different classes. It can directly predict the classes

P. He et al. / Pattern Recognition 47 (2014) 820–832

823

can also view eðf iα vi ; f jβ vj Þ as a fractional edge of eðvi ; vj Þ under the influence of eðvα ; vβ Þ. The union of the fractional edges between T α and T β , feðf iα vi ; f jβ vj Þjf iα ; f jβ 4 0g, forms the affecting scope of the (constrained) edge eðvα ; vβ Þ. Definition 4 (Component similarity matrix). The component similarity matrix is fF Wc ¼ FT W

ð12Þ

where the element of W c ¼ ðwc ðα; βÞÞjT jjT j is n

n

ff e ij ¼ f Tα W wc ðα; βÞ ¼ ∑ ∑ f iα f jβ w β

ð13Þ

i¼1j¼1

Fig. 1. Illustration of the q=q1 modification strategy with q ¼ f0:01; 0:1; 0:5; 1g. The solid lines indicate the modified similarities of the must-linked edges, while the dashed lines indicate the modified similarities of the cannot-linked edges.

of the unlabeled data for semi-supervised classification. In contrast, Fu suggests the degrees of influence that each constrained vertex exerts on the unconstrained vertices. It cannot directly provide the cluster partition for semi-supervised clustering, but only plays the role of an intermediate product of SCRAWL. Second, 0 the initial state of the labeled data Y^ l , which decides the ^ converged solution of Y u , is assigned with 1/0 according to the classes of the labeled data. Therefore, the number of the different 0 states in Y^ equals to the number of the classes. In contrast, the initial state of the constrained vertices F 0c depends on the assumption about how the constrained vertices affect each other. In this paper, we simply assume that the constrained vertices are independent from each other, i.e. F 0c ¼ I. Hence the number of the 0 different states in F^ c equals to the capacity of V c . Definition 3 (Component). Let T denote the component set, jT j denote the number of components. The jth component, T j , is composed of the vertices within the influence range of the jth constrained vertex. The membership degree of each vertex to T j is indicated by the jth column of F. Therefore, we call F the component indicating matrix of the vertices. Since 8 i, ∑j f ij ¼ 1, we can divide every vertex to multiple fractional vertices [29] according to its degrees of membership to different components, e. g., vi ¼ ½f i1 vi ; f i2 vi ; …; f ijT j vi . From this perspective, T j is the union of the fractional vertices affected by the jth constrained vertex, def

T j ¼ ff ij vi ∣f ij 4 0g

W c ¼ Dc1=2 W c Dc1=2

ð14Þ

where Dc ¼ diagðW c 1jT j Þ. Then we can apply our proposed q=q1 modification strategy on W c , leading to the modified component f c ¼ ðw e c ðα; βÞÞjT jjT j , similarity matrix W 8 q w ðα; βÞ if ( c ¼ ðvα ; vβ Þ A C ¼ > < c 1 e ð15Þ w c ðα; βÞ ¼ w c ðα; βÞq if ( c a ðvα ; vβ Þ A C a > : w c ðα; βÞ otherwise By this way, each pairwise constraint is propagated to all the affected edges according to the degrees of influence they receive from the constrained edge. Next, we compute the transition probability matrix of the f c with D e 1 W e c ¼ diagðW f c 1jT j Þ. component-level random walk, P c ¼ D c

Meila and Shi [32] have proved that the sum of the transition probabilities among different clusters equals to the normalized cut among them. Then we can group the components into different clusters based on Pc. An approximately optimal solution to the normalized cut is U ¼ ½u1 u2 … up , where u1 ; u2 ; …; up satisfy P c ui ¼ λi ui , λ1 Z ⋯ Zλp . Since the ith row of U indicates the cluster membership of T i , we call U the cluster indicating matrix of the components. 3.3. Component combination

ð10Þ

where f ij vi represents the fractional vertex of vi that belong to T j . 3.2. Component clustering Consider a higher-level random walk on the components. Recall that T 1 ; T 2 ; …; T jTj denote the jT j components, T α represents the αth component around the constrained vertex vα , T β represents the βth component around the constrained vertex vβ . The edge set between T α and T β is composed of the edges that connect two fractional vertices in different components, e.g., eðf iα vi ; f jβ vj Þf iα vi A T α ;f jβ vj A T β . Since ∑ij f iα f jβ ¼ ∑i f iα ∑j f jβ ¼ 1, we let eðf iα vi ; f jβ vj Þ ¼ f iα f jβ  eðvi ; vj Þ

In Definition 4, f iα f jβ is the fraction of the edge eðvi ; vj Þ within the influence range of eðvα ; vβ Þ (Eq. (11)). Therefore, given a pairwise constraint on eðvα ; vβ Þ, the pairwise component similarity wc ðα; βÞ computes the accumulated similarity of the fractional edges under the influence of c ¼ ðvα ; vβ Þ or c a ðvα ; vβ Þ. This allows us to expand the constraint influence by directly enforcing the pairwise constraints on the component similarity matrix. To ensure the component similarities within the range of ½0; 1, we first compute the normalized component similarity matrix W c .

ð11Þ

where f iα f jβ , the product of the degrees of influence that vi and vj receive from vα and vβ , indicates the degree of influence that eðvi ; vj Þ receives from eðvα ; vβ Þ. Similar to the fractional vertices, we

To combine the cluster membership of the fractional vertices distributed in the different components, we multiply the component indicating matrix of the vertices (F njT j ) with the cluster indicating matrix of the components ðU jT jp Þ, to obtain the cluster indicating matrix of the vertices. G ¼ FU

ð16Þ

The ith row vector of G indicates the cluster membership of vi. Fig. 2 shows an intuitive illustration of the computation of G. It is similar to a three-layer feedforward artificial neural network. The vertices constitute the input layer, the components form the hidden layer, and the clusters are in the output layer. F is the input to the hidden layer, U is the weight of the connections from the hidden layer to the output layer. G records the output of the neural network. The jth column of G, provided by the jth output neuron, is an approximation to the semi-supervised clustering function for the jth cluster. The components in the hidden layer play a crucial

824

P. He et al. / Pattern Recognition 47 (2014) 820–832

Fig. 2. Illustration of the computation of G, where g 11 ¼ f 11  u11 þ f 12  u21 þ f 13  u31 þ f 14  u41 .

role in the representation of semi-supervised clustering functions by this neural network. To explain the relationship of G with other unsupervised and semi-supervised spectral clustering algorithms, we rewrite the optimization of U as U¼

arg max U T U ¼ I;U A Rnp

trðU T ϕC ðW; FÞUÞ

ð17Þ

where ϕC ðW; FÞ is a function that computes and modifies the component similarity matrix based on C.

1. If ϕC ðW; FÞÞ ¼ F T WF, which means all the pairwise constraints are discarded or C ¼ |, then G ¼FU is the optimal solution to the unsupervised spectral clustering algorithms, like Normalized Cut [33]. 2. If ϕC ðW; FÞ ¼ F T ψðWÞF, where ψðWÞ is a function that adapts the vertex similarity matrix to be consistent with the pairwise constraints, e.g., similarity modification strategies or metric learning approaches, then G ¼ FU is the optimal solution to the semi-supervised spectral clustering algorithms based on the adaptation of W, including Spectral Learning [19], SS-KernelKmeans [20] and Constrained Clustering with Spectral Regularization [21]. 3. If ϕC ðW; FÞ ¼ ϕ″C ○ϕ′C ðW; FÞ, where ϕ′C ðW; FÞ ¼ F T ψðWÞF ¼ W c , ψðÞ e 1 ψðD1=2 is set the q=q1 modification strategy, ϕ″C ðW c Þ ¼ D c c 1=2 W c Dc Þ, which modifies the component similarity matrix, then G ¼ FU is the optimal solution to our algorithm SCRAWL. Compared with the semi-supervised spectral clustering algorithms in 2, SCRAWL takes one more step to expand the constraint influence through components. Finally, we project the row vectors of G onto a unit hypersphere e ¼ D1=2 G, where DG ¼ diagðdiagðGGT ÞÞ, and then use k-means [34], G G e into p clusters. to group the n row vectors of G 3.4. Time complexity The time complexity of the component construction stage is dominated by the computation of the component indicating submatrix Fu. If we directly compute the converged solution of Fu with Eq. (9), it will cost OððnjT jÞ3 Þ, which is too expensive for large real-world data sets. However, it can be reduced if we use a k NN sparse similarity matrix and compute Fu with the state transition equation: F tþ1 u

¼ P uc þ

P uu F tu

ð18Þ tþ1

t

F 0c

¼ PF and ¼ I. Since each vertex in which is derived from F the k NN sparse similarity matrix is only connected with its k nearest neighbors, each row of Puu contains k nonzero elements at most. As a result, the time complexity of the iterative computation of Fu is OððnjT jÞkjT jt max Þ, where tmax denotes the maximal number of iterations.

In the component clustering stage, the time complexity is dominated by the computation of Wc and the eigenvalue decomposition of Pc. The computation of Wc costs Oðn2 jT jÞ if the vertex similarity matrix W is dense, but it only costs OðnkjT jÞ if W is k NN sparse. Because Wc is always dense, the component transition probability matrix Pc is also dense. The time complexity of the eigenvalue decomposition of Pc is OðjT j3 Þ. In the end, the component combination stage costs OðnjT jpÞ for the computation of G. To summarize, if we use a dense vertex similarity matrix, the time complexity of SCRAWL is OððnjT jÞ3 Þ þ Oðn 2 jT jÞ þ OðjT j3 Þþ OðnjT jpÞ ¼ Oðn 3 Þ. If we use a k NN sparse vertex similarity matrix, the time complexity is OððnjT jÞk jT jt max Þ þ Oðnk jT jÞ þ OðjT j3 Þþ OðnjT jpÞ ¼ OðjT j3 Þ þ OðnjT jpÞ, where both k and tmax are removed as user-specified constants. Since jT j and p compared with n on large data sets are so small that can be ignored, we can further reduce the time complexity of SCRAWL to approximately linear O(n).

4. Parameters In this section, we discuss two important parameters, including the similarity modification strength q and the component number jT j, by studying their impact on the clustering performance of SCRAWL. We choose a modified version of F-measure, named Constrained F-measure, as the performance evaluation criterion. Fmeasure ¼

2PR PþR

ð19Þ

where P¼

TPjC ¼ j ; TP þ FPjC ¼ j



TPjC ¼ j TP þ FNjC ¼ j

ð20Þ

TP (True Positive) is the number of the pairs of data that belong to the same class and are assigned to the same cluster, FP (False Positive) is the number of the pairs of data that belong to different classes but are assigned to the same cluster, FN (False negative) is the number of the pairs of data that belong to the same class but are assigned to different clusters, jC ¼ j is the number of the mustlink constraints. Notice that Constrained F-measure excludes the influence of the pairwise constraints (Eq. (20)) to adapt for the evaluation of semi-supervised clustering algorithms. In the following text, we take iris data set for example, and illustrate the way to determine the appropriate parameters for SCRAWL. The number of clusters is set the number of classes, which is 3 for iris data set. 4.1. Similarity modification strength In the component construction stage, we have proposed a q=q1 similarity modification strategy. It is first applied on the vertex similarity matrix W (Eq. (8)) and later on the normalized component similarity matrix W c (Eq. (15)). In order to differentiate the two similarity modification in different granularities, we use qv to denote the modification strength on the vertex similarity matrix, and use qc to denote the modification strength on the component similarity matrix. 4.1.1. qv The qv =q1 modification of W refines the influence range of v each pairwise constraint. We fix the other parameters qc ¼1 (which ignores the modification of W c ), jT j ¼ jV c j, and analyze the influence of qv in Fig. 3. In Fig. 3(b), as qv decreases from 1 to 0.01, more and more supervision is integrated into the constrained edge similarities, and the clustering performance of SCRAWL exhibits significant improvement. However, as qv further declines from 0.01 to 0.001,

0.93

0.93

0.92

0.92

0.91

0.91

Constrained F−measure

Constrained F−measure

P. He et al. / Pattern Recognition 47 (2014) 820–832

0.9 0.89 0.88 0.87 0.86

0.9 0.89 0.88 0.87 0.86

0.85

0.85

0.84

0.84

0.83

0.002

0.004

0.006

0.008

0.83

0.01

825

0

0.2

0.4

0.6

0.8

1

0.96

0.96

0.94

0.94

Constrained F−measure

Constrained F−measure

Fig. 3. The learning curves of qv on iris data set averaged over 50 runs: (a) qv A ð0:001; 0:01Þ and (b) qv A ð0:01; 0:1Þ [ ð0:1; 1Þ.

0.92

0.9

0.88

0.9

0.88

0.86

0.86

0.84

0.92

0.002

0.004

0.006

0.008

0.01

0.84

0

0.2

0.4

0.6

0.8

1

Fig. 4. The learning curves of qc on iris data set averaged over 50 runs: (a) qc A ð0:001; 0:01Þ and (b) qc A ð0:01; 0:1Þ [ ð0:1; 1Þ.

the learning curves of qv only show small fluctuations (Fig. 3(a)). It indicates that as long as qv is smaller than a threshold (e.g., qv r 0:01), the influence of qv on the clustering performance of SCRAWL essentially keeps unchanged.

4.1.2. qc The qc =q1 modification of W c expands the influence of the c pairwise constraints to the affected fractional edges. To reveal the relationship of qc and the clustering performance of SCRAWL, we fix qv ¼ 0.01, jT j ¼ jV c j, and depict the learning curves of qc in Fig. 4. In Fig. 4, the learning curves of qc at first climb up rapidly with the decreases of qc from 1 to 0.1, but then slow down as qc declines from 0.1 to 0.01 (Fig. 4(b)), and finally keep about flat when qc A ð0:001; 0:01Þ (Fig. 4(a)). It indicates that as long as qc is smaller than a threshold (e.g., qc r 0:01), the influence of qc, similar to that of qv, on the clustering performance of SCRAWL is almost the same. Although in Figs. 3 and 4 the performance of the q=q1 strategy is similar to that of the 0/1 strategy on iris data set, it does not indicate that we can replace the q=q1 strategy with the 0/1 strategy for all the data sets. Assume such a case, in which the similarities within clusters are very low (10  5) and the similarities among clusters are even lower (10  7). Given a must-link constraint, if we use the 0/1 strategy that sets the must-linked similarities with 1, then the difference between the intra-cluster and inter-cluster similarities will be replaced by the larger difference between the constrained and unconstrained similarities

(1 and near 0), leading to a wrong clustering result. In contrast, if we use the q=q1 strategy for similarity modification, then the optimal q will be close to 1 instead of 0. Therefore, a significant difference between the 0/1 strategy and the q=q1 strategy is that the 0/1 strategy sets the constrained edge similarities regardless of their original values, while the q=q1 strategy integrates the original similarities and the supervisory information with q controlling their tradeoff, being more general and flexible. 4.2. Component number In the component construction stage, we build the same number of components as the constrained vertices ðjT j ¼ jV c jÞ. However, this strategy has problems in dealing with extremely small or large number of pairwise constraints. When jV c j o p, such as traditional unsupervised clustering, SCRAWL cannot partition jV c j components into p clusters. When jV c j-n, the affecting scope of each constrained vertex decreases to itself (F¼I), then SCRAWL is reduced to the semi-supervised clustering algorithm based on vertex similarity modification. In order to appropriately deal with the above mentioned extreme cases, we put a cap on the number of components, denoted as jT ju . When jV c j r jT ju , jT j ¼ jV c j. When jV c j 4jT ju , jT j ¼ jT u j, we randomly select jT ju constrained vertices for component construction. It ensures that each component contains at least n=jT ju vertices on average, and at most ðjT j2u =jV c j2 ÞjCj pairwise constraints can be propagated through the components on

P. He et al. / Pattern Recognition 47 (2014) 820–832

0.97

0.9

0.96

0.89

Constrained F−measure

Constrained F−measure

826

0.95 0.94 0.93 0.92 0.91

0.87 0.86 0.85 0.84

0.9 0.89

0.88

0

10

20

30

40

50

0.83

2

4

6

8

10

Fig. 5. The learning curves of n=jT ju and jT jl =p on iris data set averaged over 50 runs: (a) n=jT ju A ð1; 10Þ [ ð10; 50Þ and (b) jT jl =p A ð1; 10Þ.

average. Fig. 5(a) illustrates the learning curves of n=jT ju on iris data set, where the influence of the other parameters is excluded with qv ¼ qc ¼ 0:01. In Fig. 5(a), the minimal component size n=jT ju ¼ 1 corresponds to the maximal component number jT j ¼ jV c j, because jV c j r jT ju ¼ n. The maximal component size n=jT ju ¼ 50 corresponds to the minimal component number jT j ¼ p, where p ¼ 3 is the number of classes for iris data set. With the increase of n=jT ju , the learning curves of SCRAWL first rise rapidly because of the expanded influence range of each pairwise constraint, and then decline at a slower pace due to the reduced number of the propagated pairwise constraints. The performance gap between the optimal n=jT ju and the benchmark n=jT ju ¼ 1 demonstrates that an appropriate component size improves the clustering performance of SCRAWL significantly. On the other hand, we also set a floor under the number of components, denoted as jT jl . When jV c j Z jT jl , jT j ¼ minðjV c j; jT ju Þ. When jV c j o jT jl , jT j ¼ jT jl , we randomly select ðjT jl jV c jÞ unconstrained vertices together with V c to construct components. It ensures that each cluster contains at least jT jl =p components on average. Fig. 5(b) illustrates the learning curves of jT jl =p when jV c j o p on iris data set, where qv ¼ qc ¼ 0:01 and jT ju ¼ n. In Fig. 5(b), when jCj ¼ 0 (or jCj ¼ 1), jV c j ¼ 0 (or jV c j ¼ 2), we use different numbers of unconstrained vertices for component construction. Although none pairwise constraint is propagated through the components around the unconstrained vertices, we find that SCRAWL still produces much better clustering result than the unsupervised spectral clustering algorithm NCut [34]. This may because the components reveal the underlying structure of the data set, which makes the clustering easier. Besides, the rise and decline of the learning curves about jT jl =p suggest the importance of component diversity within clusters. Algorithm 1 summarizes the complete algorithm of SCRAWL. Step 1 infers the transitive closure of C to obtain more pairwise constraints [9]. Steps 2–11 are the component construction stage, steps 12–16 form the component clustering stage, and steps 17–19 constitute the component combination stage. At last, step 20 returns the cluster partition predicted by SCRAWL.

5. Experiments

in

In this section, we evaluate the clustering performance of SCRAWL comparison with another four semi-supervised clustering

algorithms including Spectral Learning (SL) [19], SS-Kernel-Kmeans (SSKK) [20], Constrained Clustering with Spectral Regularization (CCSR) [21] and Metric Pairwise Constrained Kmeans (MPCK) [8]. Among them, SL, SSKK, CCSR and SCRAWL are all semi-supervised spectral clustering algorithms implemented in Matlab, while MPCK is an integrated approach of constrained k-means and local metric learning implemented in Java.1 Algorithm 1. SCRAWLðW; C; p; q; jT jÞ. INPUT: the pairwise similarity matrix, W; the constraint set, C; the number of clusters, p; the similarity modification strength, q; the number of the components, jT j; OUTPUT: the predicted cluster assignment f Cn ←ConstraintTransitiveClosureðCÞ f ←q=q1 modification of W using Eq. (8) W f P←D1 W

1 2 3 4 5 6

F c ←I jT jjT j if W is dense

7 8

else

F u ←ðIP uu Þ1 P uc F 0u ←0ðnjT jÞjT j F u ← iterative computation based on Eq. (18)

9 10 11

end

12

W c ←F T WF

13

W c ←Dc1=2 W c Dc1=2 , where Dc ←diagðW c 1jT j Þ f c ←q=q1 modification of W c using Eq. (15) W

14 15 16 17 18 19 20

F←½F Tc F Tu T

e 1 W e c ¼ diagðW f c 1jT j Þ f c , where D P c ←D c U←½u1 ; u2 ; …; up , where P c ui ¼ λi ui , λ1 Z ⋯ Z λp G←FU 1=2 e G, where DG ¼ diagðdiagðGGT ÞÞ G←D G

e pÞ f ←KmeansðG; return f

We first illustrate the clustering results of SCRAWL on two artificial data sets, and then compare SCRAWL with another four semi-supervised clustering algorithms on nine UCI data sets and 1

http://www.cs.utexas.edu/users/ml/risc/code/

P. He et al. / Pattern Recognition 47 (2014) 820–832

1

1

0.5

0.5

0

0

−0.5

−0.5

−1 −1

−0.5

0

0.5

1

−1 −1

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1 −1.5

−1

−0.5

0

0.5

1

1.5

−0.5

−1 −1.5

−1

827

0

−0.5

0

0.5

0.5

1

1

1.5

Fig. 6. The clustering results of SCRAWL on two artificial data sets. The solid lines indicate must-link constraints, and the dashed lines indicate cannot-link constraints: (a) Yin-yang, (b) clustering result, (c) tri-circle and (d) clustering result.

seven image segmentations given both must-link and cannot-link constraints. For each UCI data set, at least 10 different numbers of pairwise constraints are provided based on the ground-truth class labels. For each number of pairwise constraints, 50 different realizations are randomly generated to compute the average clustering performance. We use two criteria, including Constrained F-measure and NMI, to evaluate the clustering results on the UCI data sets, and compare the image segmentation results visually. The Constrained F-measure has been defined in Eqs. (19) and (20). The definition of NMI, short for Normalized Mutual Information [35], is as follows: NMIðΩ; Ψ Þ ¼

IðΩ; Ψ Þ ½HðΩÞ þ HðΨ Þ=2

ð21Þ

where Ω is the set of predicted clusters, Ψ is the set of true classes, I is the mutual information:   pðωk \ ψ j Þ I ðΩ; Ψ Þ ¼ ∑∑p ωk \ ψ j log 2 pðωk Þpðψ j Þ k j

ð22Þ

and H is the entropy function: HðΩÞ ¼ ∑pðωk Þ log 2 pðωk Þ

ð23Þ

k

At last, we list the running time spent on the image segmentations for the discussion of their runtime performance. In respect of graph construction, we construct 20NN (20Nearest-Neighbors) sparse graphs for the artificial data sets, and construct fully connected graphs for the UCI data sets. Their

pairwise similarity matrices W are computed by the Gaussian radial basis function: 

wij ¼ e

J xi xj J 2

ð24Þ

2s2

where the scale parameter s is optimized over the interval f225=5 ; 224=5 ; …; 224=5 ; 225=5 g in each run for each number of constraints. We select the optimal s that minimizes the sum of the within-cluster point-to-centroid distances, leading to the e [34]. The rest parameters tightest clusters of the row vectors of G are set according to the parameter analysis on iris data set in Section 4, including the similarity modification strength qv ¼ qc ¼ 0:01, the component number jT j ¼ maxðjT jl ; minðjV c j; jT ju ÞÞ, jT jl ¼ 2p and jT ju ¼ 0:05n. Note that the optimal parameters for iris data set may not be optimum for other applications, but we use the same parameters throughout the experiments for simplicity and consistency. The images used in the semi-supervised image segmentation are selected from the Berkeley Segmentation Database [36]. Each image is scaled to 170  113 pixels. The similarities between each pair of pixels are computed by measuring the magnitude of their intervening contours [33]. wij ¼ emaxx A lineði;jÞ

J edgeðxÞ J 2 se

ð25Þ

line(i,j) is a straight line that connects the ith and jth pixels, edge(x) is the magnitude of the intervening contour at location x, and se is set 1/10 of the maximal edge magnitude on graph. To keep sparsity of the similarity matrices for image segmentation, we

828

P. He et al. / Pattern Recognition 47 (2014) 820–832

1

1

1

1

0.5

0.5

0.5

0.5

0

0

0

0

−0.5

−0.5

−0.5

−0.5

−1 −1

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1

Fig. 7. The components constructed by SCRAWL on Yin-yang data set. The solid symbols indicate the constrained data, the hollow symbols indicate the unconstrained data. The darker a symbol, the higher its probability of belonging to the component.

tissue

iris

Constrained F−measure

0.9

Constrained F−measure

parkinsons

1 CCSR SL SSKK MPCK SCRAWL

0.8 0.7 0.6

1

0.9

CCSR SL SSKK MPCK SCRAWL

0.8

0.7

Constrained F−measure

1

0.5 0.6

0.4 100

200

300

400

500

600

700

0.8

0.7 CCSR SL SSKK MPCK SCRAWL

0.6

0.5 100

200

Constraint Number

300

400

500

600

100

700

200

Constraint Number

sonar

300

statlog

0.7 CCSR SL SSKK MPCK SCRAWL

0.6

0.5

500

600

700

600

700

liver 1

CCSR SL SSKK MPCK SCRAWL

0.9

Constrained F−measure

0.8

Constrained F−measure

0.9

400

Constraint Number

1

1

Constrained F−measure

0.9

0.8

0.7

CCSR SL SSKK MPCK SCRAWL

0.9

0.8

0.7

0.6

0.6 0.5

100

200

300

400

500

600

100

700

200

Constraint Number

300

400

500

600

700

100

Constraint Number

ionosphere

200

300

pendigits389

1

400

500

Constraint Number breast

1

1

0.7

0.6

CCSR SL SSKK MPCK SCRAWL

0.5

Constrained F−measure

0.8

Constrained F−measure

Constrained F−measure

0.9 0.9

0.8

CCSR SL SSKK MPCK SCRAWL

0.7

0.95

0.9

CCSR SL SSKK MPCK SCRAWL

0.85

0.8 100

200

300

400

500

Constraint Number

600

700

100

200

300

400

500

600

700

Constraint Number

200

400

600

800

1000 1200 1400

Constraint Number

Fig. 8. The Constrained F-measure curves of the five semi-supervised clustering algorithms on UCI data sets.

only allow each pixel to connect with its neighboring pixels within a circle of radius r ¼10. All the following experiments are executed in MATLAB R2010b on a Mac with 1.7 GHz Intel Core i5 and 4 GB RAM.

5.1. Toy examples Fig. 6 shows the clustering results of SCRAWL on two synthetic data sets. One is Yin-yang data set, where the two classes of data

P. He et al. / Pattern Recognition 47 (2014) 820–832

clustering performance with the increase of the number of pairwise constraints. Although the learning curves of SCRAWL at first fall behind those of CCSR on small numbers of pairwise constraints, they climb up quickly and soon exceed the learning curves of the other algorithms with growing advantage. On the contrary, despite the superior performance on the initial pairwise constraints, CCSR hardly presents great improvement even when the number of pairwise constraints grows very large. This is probably due to the increasing difficulty in adapting the limited dimension (D ¼15) of the spectral space to satisfy the growing number of pairwise constraints. Different from SCRAWL and CCSR, SSKK produces good results on some of the data sets, but also shows poor performance on the others. It may because the reward and penalty setting is not universally appropriate for all the data sets. The learning curves of SL have the most similar ascending trend to those of SCRAWL. However, because of its lack of constraint propagation, SL falls behind of SCRAWL with large gaps. Finally, MPCK first shows a “dip” before rising on its learning curves. The reason might be that the metrics learned from few pairwise constraints are unreliable and more pairwise constraints can improve this situation [8].

are mutually embedded. The other is Tri-circle data set, where the three classes of data are not only intersecting but also overlapping. On both of the two synthetic data sets, SCRAWL produces satisfactory clustering results. We take Yin-yang data set for example, and plot the components constructed around the constrained vertices in Fig. 7. From Fig. 7, we can see that SCRAWL successfully recognizes the underlying structure of the data set, which allows the pairwise constraints to be propagated to unconstrained data correctly. 5.2. UCI database Figs. 8 and 9 respectively compare the Constrained F-measure and NMI curves of the five semi-supervised clustering algorithms on nine UCI data sets. The attribute values of tissue, parkinsons, statlog and liver data sets are scaled into the range of ½1; 1, and the missing data and the first attribute (ID number) of breast data set are removed. In Figs. 8–9, the comparison results of the Constrained F-measure curves and NMI curves demonstrate the same fact that, among the five clustering algorithms, SCRAWL exhibits the best tissue

iris

1

0.9 0.9

0.8 CCSR SL SSKK MPCK SCRAWL

NMI

0.8

0.7

0.7

0.7

NMI

0.8

NMI

parkinsons 1

1

CCSR SL SSKK MPCK SCRAWL

0.9

829

0.6 0.5 0.4

0.6 0.6

0.3

0.5

CCSR SL SSKK MPCK SCRAWL

0.2

0.5

0.1 0.4 100

200

300

400

500

600

100

700

Constraint Number sonar

400

500

600

700

100

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

200

300

400

500

600

700

600

700

Constraint Number

statlog

1

liver

CCSR SL SSKK MPCK SCRAWL

0.9

CCSR SL SSKK MPCK SCRAWL

0.8 0.7

0.5

NMI

NMI

0.6 0.5

0.5 0.4

0.4

0.4 0.3

0.3

CCSR SL SSKK MPCK SCRAWL

0.2 0.1 0 100

200

300

400

500

600

0.3

0.2

0.2

0.1

0.1

700

0 100

200

ionosphere 1

0.8

400

500

600

700

100

0.7

0.4

1

0.9

0.9

0.7

0.7

CCSR SL SSKK MPCK SCRAWL

0.5 0.4 300

400

500

Constraint Number

600

700

0.6

CCSR SL SSKK MPCK SCRAWL

0.3

0.1

500

0.8

0.6

0.2

400

breast

1

NMI

NMI

0.5

200

300

Constraint Number

0.8

0.6

100

200

pendigits389

CCSR SL SSKK MPCK SCRAWL

0.9

300

Constraint Number

Constraint Number

NMI

300

Constraint Number

1

NMI

200

100

200

300

400

500

Constraint Number

0.5 600

700

200

400

600

800 1000 1200 1400

Constraint Number

Fig. 9. The NMI curves of the five semi-supervised clustering algorithms on UCI data sets.

830

P. He et al. / Pattern Recognition 47 (2014) 820–832

5.3. Image segmentation The existing semi-supervised image segmentation algorithms either use cluster seeds or pure must-link constraints to define the desired content for extraction [37,38]. In this experiment, we evaluate the four graph-based semi-supervised clustering algorithms, including SL, SSKK, CCSR and SCRAWL, on seven natural color image segmentations given both must-link and cannot-link constraints. MPCK is replaced by the state-of-the-art unsupervised spectral clustering algorithm NCut [33] in comparison, because NCut is based on the same graphs (Eq. (25)) as the other four semisupervised clustering algorithms, while MPCK is not. Fig. 10 illustrates the segmentation results of the four semisupervised spectral clustering algorithms compared with the benchmark NCut [33]. In the first two rows, different pairwise constraints are provided to localize the worm and the plant respectively on the same input image. The third to the fifth rows bias the extraction content toward the left bird, the right bird and both the two birds on another input image. The last two rows aim to differentiate the trees, the house and the swan from their reflections in the lake and the background. The comparison results demonstrate the superiority of SCRAWL over the other semisupervised clustering algorithms.

The reason for the success of SCRAWL on image segmentation is because each image is composed of several local image elements, which correspond to different contents or subjects, while the components constructed by SCRAWL are the ideal approximations of those image elements if not their fragments. The automatic recognition of the local image elements, combined with the enforcement of the pairwise constraints, reduces the semisupervised image segmentation to the local image element clustering based on the adapted similarities. Fig. 11 illustrates the components constructed for different image segmentations by SCRAWL. In each subfigure, a light is placed at a constrained vertex, so that a brighter pixel indicates a higher probability of its belonging to the component. Table 1 summarizes the running time of the five clustering algorithms on the semi-supervised image segmentations in Fig. 10. Although theoretically the time complexities of NCut, SL, SSKK and CCSR are all Oðn3 Þ, dominated by the eigenvalue decomposition of W, SSKK spends the least computational time, while SL spends the most. The reason lies in the adoption of sparse similarity matrices, whose eigenvalue decomposition has to be executed by the eigs function (in Matlab) that computes the largest p eigenvectors with the Lanczos method. Its time complexity, upper-bounded by Oðn2 Þ, depends not only on the matrix sparsity, but also on the relative gap between the

Fig. 10. Comparison of the five clustering algorithms on seven image segmentations given both must-link (solid lines) and cannot-link (dashed lines) constraints: (a) Original, (b) Ncut, (c) SL, (d) SSKK, (e) CCSR and (f) SCRAWL.

P. He et al. / Pattern Recognition 47 (2014) 820–832

831

Fig. 11. The components constructed around the constrained vertices by SCRAWL. The brighter a pixel, the higher its probability of belonging to the component. The same components constructed for different image segmentations are deleted to save space.

Table 1 The running time (s) of the five clustering algorithms on image segmentations. Images

Normalized cut

SL

SSKK

CCSR

SCRAWL

1 2 3 4 5 6 7

14.40 13.75 7.49 7.60 6.74 7.58 11.48

84.28 82.42 393.54 399.97 133.38 13.18 12.62

0.97 1.21 1.13 1.17 1.14 0.93 2.02

22.97 23.65 23.76 22.90 26.95 24.01 24.93

3.04 6.63 4.45 6.41 9.90 4.73 5.98

largest eigenvalue and the next one. Since the four algorithms NCut, SL, SSKK and CCSR use diverse sparse similarity matrices for eigenvalue decomposition, they spend different running time on it. In contrast, no matter whether W is sparse or dense, SCRAWL always gets a dense component similarity matrix. As a result, it will never have the instability problem caused by the eigenvalue decomposition of sparse matrices. To summarize, according to the approximately linear time complexity of SCRAWL and its runtime performance in Table 1, we can say that SCRAWL is not only an effective but also efficient semi-supervised image segmentation algorithm.

6. Conclusions Random walk is a popular bottom-up technique used in supervised and unsupervised learning tasks. It can locally explore the neighborhood of each data by computing the transition probabilities from it and to it. In this paper, we propose a semisupervised clustering algorithm named SCRAWL under a framework of multi-level random walk. It first determines the propagation range of each pairwise constraint in a vertex-level random walk, and then expands the influence of pairwise constraint in a component-level random walk. Compared with other semisupervised clustering algorithms, SCRAWL exhibits superior performance on artificial data sets, UCI database and semi-supervised image segmentations. However, our proposed approach also has limitations, which suggest the following directions of our future study.

 In the real-world applications, the pairwise constraints provided by different domain experts may conflict with each other. For instance, (A,B) and (B,C) both satisfy must-link constraints, but (A,C) satisfies a cannot-link constraint. When SCRAWL is applied to such problems, it will infer unexpected constraint transitive closure, because the previously inferred constraints could get easily violated by the subsequently inferred ones.

832



P. He et al. / Pattern Recognition 47 (2014) 820–832

Hence we intend to investigate how to preprocess the inconsistent pairwise constraints for SCRAWL. Even when there is no conflicting constraints, SCRAWL is sensitive to noisy constraints for two reasons. First, it will infer more mislabeled pairwise constraints from the noisy ones. Second, it will expand the influence of the mislabeled pairwise constraints, leading to a wrong clustering result. To alleviate the problem, except for disabling the inference of the transitive closure for noisy constraints, we also intend to adjust the modification strength q, whose complement ð1qÞ indicates the constraint confidence, for every distinct pairwise constraint.

Conflict of Interest None declared. Acknowledgments The authors would like to thank Dr. Li for their code of CCSR and the anonymous reviewers for their valuable comments and suggestions to significantly improve the quality of this paper. This research was supported in part by the Chinese National Natural Science Foundation under Grant nos. 61003180, 61070047 and 61103018, Natural Science Foundation of Education Department of Jiangsu Province under contracts 13KJB520026 and 09KJB200013, Natural Science Foundation of Jiangsu Province under contracts BK2010318 and BK2011442, and the New Century Talent Project of Yangzhou University. References [1] A. Ben-Hur, D. Horn, H.T. Siegelmann, V. Vapnik, Support vector clustering, Journal of Machine Learning Research 2 (2001) 125–137. [2] S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by seeding, in: ICML, 2002, pp. 19–26. [3] K. Wagsta, C. Cardie, Clustering with instance-level constraints, in: ICML, 2000, pp. 1103–1110. [4] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning with application to clustering with side-information, in: NIPS, 2002, pp. 505–512. [5] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: ICML, 2007, pp. 209–216. [6] L. Wu, S. Hoi, R. Jin, J. Zhu, N. Yu, Learning Bregman distance functions for semi-supervised clustering, IEEE Transactions on Knowledge and Data Engineering 24 (3) (2012) 478–491. [7] H. Zeng, Y.-M. Cheung, Semi-supervised maximum margin clustering with pairwise constraints, IEEE Transactions on Knowledge and Data Engineering 24 (5) (2012) 926–939. [8] M. Bilenko, S. Basu, R. Mooney, Integrating constraints and metric learning in semi-supervised clustering, in: ICML, 2004, pp. 81–88. [9] K. Wagsta, C. Cardie, S. Rogers, S. Schroedl, Constrained k-means clustering with background knowledge, in: ICML, 2001, pp. 577–584.

[10] D. Klein, S.D. Kamvar, C.D. Manning, From instance-level constraints to spacelevel constraints: making the most of prior knowledge in data clustering, in: ICML, 2002, pp. 307–314. [11] N. Shental, A. Bar-Hillel, T. Hertz, D. Weinshall, Computing Gaussian mixture models with em using equivalence constraints, in: NIPS, 2003, pp. 465–472. [12] Y. Hong, S. Kwong, H. Xiong, Q. Ren, Genetic-guided semi-supervised clustering algorithm with instance-level constraints, in: GECCO, 2008, pp. 1381–1388. [13] X. Xu, L. Lu, P. He, Z. Pan, L. Chen, Improving constrained clustering via swarm intelligence, Neurocomputing 116 (2013) 317–325. [14] Z. Lu, Semi-supervised clustering with pairwise constraints: a discriminative approach, Journal of Machine Learning Research 2 (2007) 299–306. [15] X. Yin, S. Chen, E. Hu, D. Zhang, Semi-supervised clustering with metric learning: an adaptive kernel method, Pattern Recognition 43 (4) (2010) 1320–1333. [16] Y. Liu, R. Jin, A.K. Jain, Boostcluster: boosting clustering by pairwise constraints, in: ACM SIGKDD, 2007, pp. 450–459. [17] I.A. Maraziotis, A semi-supervised fuzzy clustering algorithm applied to gene expression data, Pattern Recognition 45 (1) (2012) 637–648. [18] M. Filippone, F. Camastra, F. Masulli, S. Rovetta, A survey of kernel and spectral methods for clustering, Pattern Recognition 41 (1) (2008) 176–190. [19] S. Kamvar, D. Klein, C. Manning, Spectral learning, in: IJCAI, 2003, pp. 561–566. [20] B. Kulis, S. Basu, I. Dhillon, R. Mooney, Semi-supervised graph clustering: a kernel approach, Machine Learning 74 (2009) 1–22. [21] Z. Li, J. Liu, X. Tang, Constrained clustering via spectral regularization, in: CVPR, 2009, pp. 421–428. [22] S.X. Yu, J. Shi, Segmentation given partial grouping constraints, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2) (2004) 173–180. [23] T.D. Bie, J.A.K. Suykens, B.D. Moor, Learning from general label constraints, in: SSPR & SPR, 2004, pp. 671–679. [24] Z. Lu, M. Carreira-Perpinan, Constrained spectral clustering through affinity propagation, in: CVPR, 2008, pp. 1–8. [25] X. Wang, I. Davidson, Flexible constrained spectral clustering, in: ACM SIGKDD, 2010, pp. 563–572. [26] Z. Lu, Y. Peng, Exhaustive and efficient constraint propagation: a graph-based learning approach and its applications, International Journal of Computer Vision 103 (3) (2013) 306–325. [27] X. Zhu, Semi-Supervised Learning Literature Survey (Technical Report 1530), Technical Report, Computer Science, University of Wisconsin-Madison, 2008. [28] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: ICML, 2003, pp. 912–919. [29] E.R. Scheinerman, D.H. Ullman, Fractional Graph Theory, Wiley-Interscience, New York, 1997. [30] A. Azran, The rendezvous algorithm: multiclass semi-supervised learning with Markov random walks, in: ICML, 2007, pp. 49–56. [31] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schöelkopf, Learning with local and global consistency, in: NIPS, 2004, pp. 321–328. [32] M. Meila, J. Shi, A random walks view of spectral segmentation, in: AISTATS, 2001, pp. 873–879. [33] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 888–905. [34] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: NIPS, 2001, pp. 849–856. [35] A. Strehl, J. Ghosh, R. Mooney, Impact of similarity measures on web-page clustering, in: AAAI, 2000, pp. 58–64. [36] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: ICCV, 2001, pp. 416–423. [37] H. Zhou, J. Zheng, L. Wei, Texture aware image segmentation using graph cuts and active contours, Pattern Recognition 46 (6) (2013) 1719–1733. [38] B. Peng, L. Zhang, D. Zhang, A survey of graph theoretical approaches to image segmentation, Pattern Recognition 46 (3) (2013) 1020–1038.

Ping He received her Ph.D. degree in computer science from Nanjing University of Aeronautics and Astronautics of China in 2012, and M.S. degree from Yangzhou University of China in 2008. Her research interests include machine learning, data mining and bioinformatics.

Xiao-hua Xu received his Ph.D. degree in computer science from Nanjing University of Aeronautics and Astronautics of China in 2008, and M.S. degree from Yangzhou University of China in 2005. His research interests include machine learning, evolutionary computation, and parallel algorithms.

Kong-fa Hu is a professor in the Computer Science Department at Yangzhou University, Yangzhou, P.R. China. He received his Ph.D. degree in computer science from Southeast University of China in 2004. His research interests include data mining, database, and data warehouse.

Ling Chen is a professor in the Computer Science Department at Yangzhou Uniuversity, Yangzhou, P.R. China. He did research for two years on parallel algorithms and architectures at the University of Pittsburgh, PA, first as a visiting scholar in 1986 and, later, as a visiting associate professor in 1992. His research interests include parallel algorithm design, artificial intelligence, and bioinformatics. Professor Chen is a member of IEEE Computer Society and Chinese Computer Society.