Polynomial Time Approximation Schemes for Metric Min-Sum Clustering

Report 3 Downloads 62 Views
Electronic Colloquium on Computational Complexity, Report No. 25 (2002)

Polynomial Time Approximation Schemes for Metric Min-Sum Clustering W. Fernandez de la Vega



Marek Karpinski Yuval Rabani

y

Claire Kenyon

z

x

Abstract

We give polynomial time approximation schemes for the problem of partitioning an input set of n points into a xed number k of clusters so as to minimize the sum over all clusters of the total pairwise distances in a cluster. Our algorithms work for arbitrary metric spaces as well as for points in Rd where the distance between two points x; y is measured by kx ? y k22 (notice that (Rd; k  k22 ) is not a metric space). Our algorithms can be modi ed to handle other objective functions, such as minimizing the sum over all clusters of the total distance to the best choice for cluster center.

 Email:[email protected] LRI, CNRS UMR 8623, Universite Paris-Sud, France. y Email: [email protected], Dept. of Computer Science, University of Bonn z Email:[email protected]. LRI, CNRS UMR 8623, Universite Paris-Sud, France. x Computer Science Department, Technion | IIT, Haifa 32000, Israel. Work at the Technion supported by

Israel Science Foundation grant number 386/99, by US-Israel Binational Science Foundation grant number 99-00217, by the European Commission Fifth Framework Programme Thematic Networks contract number IST-2001-32007 (APPOL II), and by the Fund for the Promotion of Research at the Technion. Email: [email protected]

ISSN 1433-8092

1 Introduction

Problem statement and motivation. The partition of a data set into a small number of clusters, each containing a set of seemingly related items, plays an increasingly crucial role in emerging applications such as web search and classi cation [12, 50], interpretation of experimental data in molecular biology and astrophysics [41, 57, 48], or market segmentation [45]. This task raises several fundamental questions about representing data, measuring anity, estimating clustering quality, and designing ecient algorithms. For example, when searching or mining massive unstructured data sets, data items are often processed and represented as points in a high dimensional space Rd, where some standard distance function measures anity (see, for example, [20, 58, 26, 12]). This paper deals with the question of designing good algorithms for an attractive criterion for clustering quality in such a setting. More speci cally, we consider a set V of n points endowed with a distance function  : V  V ! R. These points have to be partitioned into a xed number k of subsets C ; C ; : : : ; Ck so as to minimize the cost of the partition, which is de ned to be the sum over all clusters of the total pairwise distances in a cluster. We refer to this problem as the Min-Sum All-Pairs k-Clustering problem. Our algorithms deal with the case that  is an arbitrary metric (including, in particular, points in Rd with distances induced by some norm). We also handle the non-metric case of points in Rd where the distance between two points x; y is measured by (x; y) = kx ? yk . In the latter case, our algorithms can be modi ed to deal with other objective functions, including the problem of Min-Sum Median k-Clustering, where the cost of a clustering is the sum over all clusters of the total distances between cluster points and the best choice for a cluster center. All optimization problems that we consider are NP -hard to solve exactly even for k = 2. 1

1

2

2 2

Our results. For the Min-Sum All-Pairs objective function, we present algorithms for

every k and for every  > 0 that compute a partition into k clusters C ; C ; : : :; Ck of cost at most 1 +  times the cost of an optimum partition. 3Ink+1the metric case the algorithm is randomized and its running time is O(n k + nk 2O = ). In the case of the square of 4 O k= Euclidean distance, the algorithms are deterministic, and their running time is n . Our algorithms can be modi ed to output, for all  > 0, a clustering that excludes at most n outliers and has cost at most 1 +  times the optimum cost. In the case of the square of Euclidean distance, we can do this in probabilistic time O(f (k; ;  )  n log n), where f grows (rapidly) with k,  , and  . The Min-Sum Median objective function can be optimized in polynomial time for xed k in nite metrics, because the number of choices for centers is polynomial. However, if the points are located in a larger space, such as Rd, and the centers can be picked from this larger space, the problem may become hard. For points in Rd with distances measured by the square of Euclidean distance, we give Min-Sum Median algorithms that partition all points into k clusters of cost at most 1+  of the optimum cost in probabilistic time O(g(k; )  n  (log n)k ), where g grows (rapidly) with k and  . Some of our ideas can be modi ed trivially to derive polynomial time approximation schemes for other objective functions, such as minimizing 1

2

+1

~ (1

2

)

(

)

3

1

1

1

By \high dimensional" we mean that the dimension d should be treated as part of the input and not as a constant. 1

1

the maximum radius of a cluster. We do not elaborate on these modi cations.

Related work. Schulman [56] initiated the study of approximation algorithms for MinSum All-Pairs k-Clustering. He gave probabilistic algorithms for clustering points in Rd with

distance measured by the square of Euclidean distance. (Thus he also handled other interesting cases of metrics that embed isometrically into this distance space, such as Euclidean metrics or L metrics.) His algorithms nd a clustering such that either its cost is within a factor of 1 +  of the optimum cost, or it can be converted into an optimum clustering by changing the assignment of at most an  fraction of the points. The running time is linear n . Thus our results if d = o(log n= log log n) and otherwise the running time is nO improve and extend Schulman's result, giving a true polynomial time approximation scheme for arbitrary dimension. Earlier, Fernandez de la Vega and Kenyon [24] presented a polynomial time approximation scheme for Metric Max Cut, an objective function that is the complement of Metric Min-Sum All-Pairs 2-clustering. Indyk [35] later used this algorithm to derive a polynomial time approximation scheme for the latter problem. Thus our results extend Indyk's result to the case of arbitrary xed k. Bartal, Charikar, and Raz [11] gave a polynomial time approximation algorithm with polylogarithmic performance guarantees for Metric Min-Sum All-Pairs k-Clustering where k is arbitrary (i.e., part of the input). As mentioned above, instances of Min-Sum Median k-Clustering in nite metrics with xed k are trivially solvable in polynomial time. (For arbitrary k, the problem is APXhard [33] and has elicited much work and progress [8, 16, 37, 15].) This is not the case in geometric settings, including the square of Euclidean distance discussed in this paper. This case was considered by Drineas, Frieze, Kannan, Vempala, and Vinay [25], who gave a 2approximation algorithm. Ostrovsky and Rabani [52] gave a polynomial time approximation scheme for this case and other geometric settings. Our results improve signi cantly the running time for the square of Euclidean distance case. Recently and independently of our work, Badoiu, Har-Peled, and Indyk [10] gave a polynomial time approximation scheme for points in Euclidean space with much improved running time (as well as results on other clustering objectives). Their algorithm and analysis are in some respects similar to our algorithm (though it handles a di erent distance function). It is interesting to note that both Schulman's algorithm for Min-Sum All-Pairs Clustering and the algorithm of Fernandez de la Vega and Kenyon for Mertic Max Cut use a similar idea of sampling data points at random from a biased distribution that depends on the pairwise distances. In recent research on clustering problems, sampling has been the core idea in the design of provably good algorithms for various objective functions. Examples include [5, 3, 51]. 1

(log log )

2 Preliminaries In this section we introduce some notation and some tools that will be used to derive and analyze our algorithms. Throughout the paper we use V to denote the input set of points and  to denote the 2

distance function over pairs of points in V . The function  can be given explicitly or implicitly (for example, if V  Rd and  is derived from a norm on Rd). Our time bounds count arithmetic operations and assume that computing (x; y) is a single operation. The reader may assume that the input is rational to avoid having to deal with unrealistic computational models. We use k, a xed constant, to denote the desired number of clusters. We omit   the ceiling notation from expressions such as  . Our claims and proofs can be modi ed trivially to account for taking the ceiling of non-integers wherever needed. and x 2 V . With a slight abuse of notation, we use (x; Y ) to denote P Let(X;x; yY),andV we P use (X; Y ) to denote x2X (x; Y ) (notice that (; ) is a symmetric y2Y bilinear form but is not a distance in the power set of V ). We use (X ) to denote (X; X ). We put W = (V ) and wx = (x; V ). Finally, we denote the diameter of X by diam(X ) = maxx;y2X (x; y). Let C ; C ; : : : ; Ck be a partition of V into k disjoint clusters. Then, for all i = 1; 2; : : : ; k, we use cost(Ci) to denote the cost of Ci. For most of the paper, we are concerned with the all-pairs cost of a cluster, putting cost(Ci) = (Ci ). In some cases, our algorithms can be modi ed to apply to hard cases of the median cost of a cluster, putting cost(Ci) = min cases, the cost of the clustering is c = cost(C ; C ; : : :; Ck ) = Rd f (x; Ci)g. In both Pk x2cost(  ; : : :; C  to denote a clustering of V of minimum cost c = C ). We use C ; C i k i cost(C ; C ; : : : ; Ck). Our polynomial time approximation schemes handle the case where  induces an arbitrary metric on V , as well as the non-metric case of V  Rd and (x; y) = kx ? yk . The former case obviously includes instances where V  Rd and (x; y) = kx ? ykp for p 2 [1; 1) or p = 1. Instances of points in Rd are computationally hard if d is part of the input. 1

1

2

1 2

1

1

=1

1

2

2

2

2 2

2.1

Properties of Metric Spaces

The main property of metrics that we use is the following proposition, which follows easily from the triangle inequality. Proposition 1. Let X; Y; Z  V . Then,

jZ j(X; Y )  jX j(Y; Z ) + jY j(Z; X ): Proof: For every x; y; z, we have (x; y)  (y; z) + (z; y). Summing over X  Y  Z gives

the desired result. Here are some corollaries which are used in our proofs in metric space. Corollary 2. diam(V )  2W=n. Proof: Let x; y be such that diam(V ) = (x; y), and apply Proposition 1 to X = fxg, Y = fyg, and Z = V .

Corollary 3. Let C  V . For every vertex v 2 C we have (v; C )  2(jCC )j : 3

Proof: Apply Proposition 1 to X = C , Y = C and Z = fvg.

Our approximation scheme for min-sum all-pairs clustering in metric spaces uses as a tool an approximation scheme for Metric Max-k-Cut. De nition: The Metric Max-k-Cut problem takes as input a set V of n points from an arbitrary metric space, and outputs a partition of V into k clusters C ; C ; : : :; Ck so as to maximize total distance between pairs of points in di erent clusters, i.e. 1

max .

k? X k X 1

i=1 j =i+1

2

(Ci; Cj ):

For any partition, the sum of the Max-k-Cut value and of the min-sum all-pairs clustering value equals W . Thus the same partition is optimal for both objectives.

Theorem 4 ([24, 23]). There is a polynomial time approximation scheme for Metric Maxk-Cut. Theorem 4 is actually an easy extension of the MaxCut approximation scheme of [24]. The same reduction which is used for MaxCut also applies to Max-k-Cut, and the resulting weighted dense graph is only a variant of dense graphs in the usual sense, so that the Maxk-Cut approximation schemes for dense graphs (see [32, 7]) apply. An alternate algorithm can be found in [23]. 2

2.2

Properties of k  k2

Unless otherwise speci ed, all subsets and multi-subsets of Rd that we discuss are, for simplicity, nite. For a nite set X  Rd we denote by conv(X ) the convexPhull of X , i.e., conv(X ) = fy 2 Rd j 9 2 RjXj such that  0 P and k k = 1 and y = x2X xxg. We associate with every y in conv(X ) such that y = x2X xx with rational coecients , a multi-subset Y of X as follows. For every x 2 X , the number nx of copies of x in Y is de ned by x = nx=jY j, where nx is the number of times x appears in Y . We often use Y to denote the center of gravity of Y . The following proposition characterizes the all-pairs cost of a cluster for the case that (x; y) = kx ? yk . Proposition 5. For every cluster C  V , cost(C ) = jC j(C; C ). 1

2 2

Proof:

jC j  (C; C ) = jC j 

X x2C

X

!

X X x ? jC1 j y  x ? jC1 j y y2C y2C

! !

XX X = jC j  kxk + jC1j y  z ? jC2 j x  y by bilinearity x2C y2C z2C y2C XX X 1 x  y by renaming and grouping = jC j  kxk ? jC j x2C y2C x2C 2 2

2

2 2

4

XX?  = 21 kxk + kyk ? 2x  y by renaming x2C y2C X X = 21 kx ? yk x2C y2C = cost(C ): The following simple propositions will come in handy. Proposition 6. For every multi-subset Y of Rd, the center of gravity of Y is such that Y = arg minz2Rd f(Y; z)g. Proof: point that minimizes the above expression. As (Y; z) = Pd P Let(y z?2z )R, dwebecanthedetermine z by minimizing each coordinate separately. We have i y2Y i i P 2 P yi . P @ y2Y yi ?zi ( y ? z ). The right hand side has a single zero at z = = ? 2 i i i y2Y y 2 Y @z j Y j i @ 2 Py2Y yi ?zi 2 = 2jY j > 0, this point is the unique global minimum. As @zi2 2 2

2 2

2 2

2

=1

(

)

(

1

)

p

Proposition 7. For every x; y; z 2 Rd, (x; z)  (x; y) + (y; z) + 2 (x; y)  (y; z). p p p Proof: By the triangle inequality for Euclidean distance, (x; z)  (x; y) + (y; z). Squaring this inequality gives the desired result. Proposition 8. For every x 2 Rd, for every multi-subset Y of Rd, (x; Y )  jY j (x; Y ). 1

Proof:

(x; Y ) =



X

x ? 1 y

jY j y2Y

X

1 (x ? y)



jY j y2Y d X 1 X 2

2

=

2

!

2

=



i=1 d

X i=1

jY j y2Y (xi ? yi)

1 X(x ? y ) jY j y2Y i i

2

2

(1)

d XX 1 (xi ? yi) = jY j i y2Y X 1 = jY j (x; y); y2Y

2

=1

where (1) follows from the Cauchy-Schwarz inequality. The following lemma is attributed to Maurey [53, 14, 6]. We provide a proof for completeness. 5

Lemma 9 (Maurey). For every positive integer d, for every Y  Rd, for every  > 0, and for every x 2 conv(Y ), there exists a multi-subset Z of Y containing jZ j =  points such that (x; Z )    (diam(Y )). Proof: Put t =  . As x 2 conv(Y ), it can be expressed as a convex combination x = P 1

1

y2Y y y , where the coecients y are non-negative reals that sum up to 1. Pick a multiset = fz1; z2; : : :; ztg at random, where the zi-s are independent, identically distributed,

Z random points with Pr [zi = y] = y . Now,

  E (x; Z)

=

2

3 t

X E 4

x ? 1t zi

5 3 2

t i

X ? i 5 x ? z

E 4

1t i " X # t X t ?  ?  1 E x ? zi  x ? zj 2

=1

=

2

2

=1

=

t

2

2

i=1 j =1

X   X ?  ?  = t1 E kx ? zik + E x ? zi  x ? zj i j6 i t

!

(2)

2 2

2

=1

=

Xt   1 = t E kx ? zik (3) i 1  t diam(Y ); where (2) follows from the linearity of expectation, and (3) follows from the fact that for every  ? P d i j i j i i 6= j , z and z are independent, so E [(x ? z )  (x ? z )] = l E [(xl ? zl )] E xl ? zlj =  0. As E (x; Z)  t diam(Y ), there exists a choice of Z such that (x; Z)  t diam(Y ). 2 2

2

=1

=1

1

1

Lemma 9 can be used to derive a high-probability argument as follows. Lemma 10. There exists a universal constant  such that for every integer d, for every Y  Rd, for every  > 0, and for every  > 0, a multi-subset Z of Y that is generated by taking a sample of   2 log  independent, uniformly distributed, points from Y satis es  Pr (Y ; Z ) >   diam(Y ) < . Proof: Put s =     log(1=) and P t =  . Consider Z as s samples Z ; Z ; : : :; Zs of size t s  (Y ; Z ). Therefore, Pr  (Y ; Z ) >   diam(Y )    each. By Proposition 8,  ( Y ; Z )   i P s i Pr si (Y ; Zi) > is  diam(Y ) . Put i = (Y ; Zi)=diam(Y ) for all i = 1; 2; : : : s. The i are independent, identically distributed, random variables taking values in the range [0; 1]. By Lemma 9,? E [i]   for all i. Using standard Cherno bounds we get that P Pr [ si i > s] < e s= . Putting  = 4= log(4=e), the right hand side is equal to . 1

1

1

2

2

1

1

=1

=1

2

=1

1 2

4

6

2

3 A PTAS for Metric Instances In this section we present our algorithm for clustering metric spaces. We rst describe a streamlined version of Indyk's algorithm [35] that solves the case of k = 2. It will help to motivate our approximation scheme for arbitrary xed k. Let (L; R) denote an optimal partition into 2 clusters. Run the following three algorithms, constructing three partitions into 2 clusters. Output the best of the three partitions. 1. First algorithm: Use the metric MaxCut approximation scheme of de la Vega and Kenyon with relative error  . 2. Balanced clusters algorithm: By exhaustive search, guess jLj 2 (n; n] and jRj = n?jLj. Repeat O(1) times the following. Pick a random element ` 2 V uniformly at random, and a random element r 2 V uniformly at random. For each vertex v 2 V , let ^(v; L) = jLj  (v; `) and ^(v; R) = jRj  (v; r). Construct a partition (L0; R0) of V by placing v in L0 if ^(v; L)  ^(v; R), and placing v in R0 otherwise. 3. Unbalanced clusters algorithm: By exhaustive search, guess jLj 2 (0; n] and jRj = n ? jLj. Repeat O(1) times the following. Pick a random sample r 2 R uniformly at random. For each vertex v 2 V , let ^(v; R) = jRj  (v; r). Construct a partition (L0; R0) of V by placing in L0 the jLj vertices of V with largest value of ^(v; R). We now present our approximation scheme for arbitrary xed k. De nition: Given  > 0, two disjoint sets of points A and B are said to be well-separated if (A) + (B ) < k (A [ B ). Our algorithm consists of taking the best of all partitions that are generated as follows. 1. By exhaustive search, guess the optimal cluster sizes jC j  jC j    jCk j. Let i be the largest i such that jCij > jCi? j for i = 2; 3; : : : ; i . Clusters C through Ci0 are called large clusters, and the others are called small clusters. By exhaustive search, for each pair of large clusters Ci and Cj , guess whether clusters Ci and Cj are well-separated. De ne groups of large clusters by taking the transitive closure of the relation \Ci and Cj are not well-separated". 2. Choose, uniformly at random, an element ci in each large cluster Ci. (i.e. take i points uniformly at random, and with constant probability the ith element will be in Ci). For each point x and for each large cluster Ci, de ne ^(x; Ci) = jCij(x; ci). 3. For each x, consider the large cluster Ci which minimizes ^(x; Ci). Place x in Ci's group and de ne its contribution to the group as f (x) = ^(x; Ci). This de nes a partition of V into groups. 4. By exhaustive search, for each group G thus constructed and for each small cluster Cj , guess jG \ Cij, and remove from G the jG \ Cij elements with largest contribution f (x). Recursively partition the removed elements into (k ? i ) clusters. 5. Partition each group of h large clusters with h > 1 using Max-h-Cut with error parameter 0 =  k =h . 3

+1

1

0

2

1

0

1

0

0

3 +2

2

7

4 Analysis of the Metric Algorithm

Lemma 11. Let C  V and r 2 C be such that (r; C )  2(C )=jC j. Let ^(x; C ) = jC j  (x; r), for x 2 V . Then j(x; C ) ? ^(x; C )j  2(C )=jC j. Proof: Apply Proposition 1 to X = fxg, Y = C and Z = frg, and to X = fxg, Y = frg and Z = jC j. The following lemma is useful for analyzing balanced well-separated clusters. Lemma 12. Consider two sets of points R and L which are both of size at least jL [ Rj, and such that (R) + (L) <  (R [ L). Let r be such that (r; R)  2(R)=jRj and similarly ` be such that (`; L)  2(L)=jLj. For any x, de ne ^(x; R) = jRj  (x; r) and ^(x; L) = jLj  (x; `). Let F = fx 2 Rj^(x; L)  ^(x; R)g. Then,  jF j = O( )jR [ Lj; moreover, if (R) + (L) < c(R [ L), then jF j = O(c)jR [ Lj.  (F )  O()(R), and  (L; F ) ? (R; F )  O()((R) + (L)). Proof: If x 2 F then ^(x; L) ? ^(x; R)  0. Thus any point x in F must verify: 2

2

(x; L) ? (x; R) = (x; L) ? ^(x; L) + ^(x; L) ? ^(x; R) + ^(x; R) ? (x; R)  2(R)=jRj + 2(L)=jLj  2((RjR) [+ L(jL)) ;

where the rst inequality comes from Lemma 11 and the second one follows from jRj; jLj  jR [ Lj. We bound jF j as follows. X jF j 2(jRR [[ LL)j  (x; R [ L) from Corollary 3 applied to x in R [ L x2F X (2(x; R) + ((x; L) ? (x; R))) = F

 2(F; R) + jF j 2((RjR) +[ L(jL)) from Equation 4

 2(R) + jF j 2((RjR) +[ L(jL))  2c (R [ L) + 2jF j  jR(R[[LLj ) : Thus jF j = O( )jR [ Lj, which proves the rst statement of the Lemma. 2

2

8

Applying Proposition 1 to X = Y = F and Z = R, we get (F )  2 jjFRjj (F; R)  O()(R) since jF j = O( )jR [ Lj and jRj  jR [ Lj. This proves the second statement of the Lemma. Finally, summing Equation (4) over every x 2 F gives (L; F ) ? (R; F )  2 jRjE[ij jLj ((R) + (L))  O()((R) + (L)) since jF j  O( )jR [ Lj. This proves the last statement of the Lemma. The following lemma is useful to the analysis of unbalanced clusters. Lemma 13. Consider two sets of points R and L such that jLj < jRj and such that (R) + (L) <  (R [ L). Let r 2 R be such that (r; R)  2(R)=jRj. For x 2 R [ L, let ^(x; R) = jRj  (x; r). Let Ci0 denote the jRj points of R [ L with largest value of ^(:; R), and Cj0 = R [ L n Ci0. Let F = R \ Cj0 = fv ; : : :; vmg and E = L \ Ci0 = fv0 ; : : : ; vm0 g. Then,  (R; E ) ? (R; F ) = O()(R).  Pmp (vp; vp0 )  O(1)(R)=jRj,  j(L; F ) ? (L; E )j  O()(R),  (F )  O()(R), and  (E )  O()(R). Proof: We pair up vertex vp with vertex vp0 . (vp; R) ? (vp0 ; R) = ((vp; R) ? ^(vp; R)) + (^(vp; R) ? ^(vp0 ; R)) + (^(vp0 ; R) ? (vp0 ; R)): 2

2

2

1

1

=1

>From Lemma 11 we have (vp; R)?^(vp; R)  2(R)=jRj and ^(vp0 ; R)?(vp0 ; R)  2(R)=jRj. By de nition, the elements of Ci0 (and hence of E ) all have larger value of ^(:; R) than the elements of Cj0 (and hence of F ). In particular, ^(vp; R) ? ^(vp0 ; R)  0. Together, this implies that (vp; R) ? (vp0 ; R)  4(R)=jRj. Summing over p, we get

F j (R) (E; R) ? (F; R)  4 jjR j j (R) = 4 jjE Rj  4 jjRLjj (R) = O()(R); hence the rst statement of the Lemma. 9

Applying Proposition 1 to vp, vp0 and R and summing over p, we get:

jRj

X p

(vp; vp0 )  (F; R) + (E; R) = (E; R) ? (F; R)) + 2(F; R)  O()(R) + 2(R) = O(1)(R);

hence the second statement of the Lemma. Applying Proposition 1 to vp, vp0 and L and to vp0 , vp and L, we get

j(vp; L) ? (vp0 ; L)j  jLj  (vp; vp0 ): Summing over p, we get

j(L; F ) ? (L; E )j  jjRLjj O(1)(R) = O()(R);

hence the third statement of the Lemma. Applying Proposition 1 to F , F and R, we get (F )  2(F;jRRj )jF j  2(jRR)jjLj  2(R); hence the fourth statement of the Lemma. Now, write (vp0 ; vq0 )  (vp0 ; vp) + (vp; vq) + (vq; vq0 ). When we sum over p and q, we obtain

(E )  2

X p

(vp; vp0 )jE j + (F )

 2jLjO(1) j(RRj) + O()(R) = O()(R);

hence the last statement of the Lemma. Now, let us analyze the 2-clustering algorithm. Case 1: Assume that c   W . Then the MaxCut algorithm with error  produces a partition whose Cut value is at least OPT-Max-Cut(1 ?  )  OPT-Max-Cut ?  W . The 2-cluster value of this partition is thus at most W ? OPT-Max-Cut+  W , which is c +  W , hence at most (1 + )  c. Case 2: Assume that c <  W and that the optimal partition (L; R) is such that jLj; jRj  n. We analyze the Balanced Clusters algorithm. With probability at least =2, the algorithm has picked ` 2 L and r 2 R. For ` picked uniformly at random in L, we have on average E ((`; L)) = (L)=jLj. By Markov's inequality, with probability at least 1=2, it holds that (`; L)  2(L)=jLj. Similarly, with probability 2

3

3

3

3

2

10

3

at least 1=2, it holds that (r; R)  2(R)=jRj. Moreover, the two events are independent. Thus, with probability at least (1 ? )=4, we have:

` 2 L; (`; L)  2(L)=jLj; r 2 R; and (r; R)  2(R)=jRj: We assume that ` and r satisfy these properties and that jLj and jRj have been guessed correctly. Let L0 = L + F ? E and R0 = R + E ? F . Then, (L0) + (R0) ? (L) ? (R) = (L + F ? E; L + F ? E ) + (R + E ? F; R + E ? F ) ? (L; L) ? (R; R) = 2((L; F ) ? (R; F )) + 2((R; E ) ? (L; E )) + 2(E ) + 2(F ) ? 4(E; F ) = O()c; by Lemma 12. Case 3: assume that c <  W and that the optimal partition (L; R) is such that jLj < n. Then jLj < =(1 ? )jRj. We analyze the Unbalanced Clusters algorithm. With probability at least (1 ? )=2, we have r 2 R and (r; R)  2(R)=jRj. We assume that this holds and that jLj has been guessed correctly. Let E = L \ R0 and F = R \ L0 . The di erence between the value of the cut constructed by the algorithm and the value of the optimal cut is 2

(L + F ? E ) + (R + E ? F ) ? (L) ? (R) = 2((L; F ) ? (L; E )) + 2((R; E ) ? (R; F )) + 2(E ) + 2(F ) ? 4(E; F ) = O()(R); by Lemma 13. Thus in all cases, one of the algorithms will output a near-optimal solution. This concludes the analysis of 2-clustering. We now proceed with the analysis of the k-clustering algorithm. We rst analyze the mistakes made in step 3. For that, we focus on the large clusters. Consider two large clusters Ci and Cj which belong to di erent groups. let Eij be the set of element of Ci which are mistakenly classi ed as belonging to Cj . Consider the intermediate k-cluster such that C i 0 Ci = Ci ? [ E + [ E ifif ii > i : 0

We have:

i

X i

(Ci0) ?

 2

X

i

j ji

0

(Ci)

((Ci; Eji ) ? (Cj ; Eji)) +

X

i;j

+2

X

j ij

i;j;j 0

(Eji; Ej0 i) + 2 11

X

i;j;j 0

X i;j

(Eij )

(Eij ; Eij0 ):

The rst sum has only O(k ) terms, which are all small (i.e. O()c) by Lemma 12. The second sum also has only O(k ) terms, which are also all small by Lemma 12. The third sum has only O(k ) terms. Consider one of them. Applying Proposition 1 to X = Eji , Y = Ej0i and Z = Ci, we get jCij  (Eji; Ej0i)  jEjij(Eji; Ci) + jEj0 ij(Ej0i; Ci). We analyze the rst of the two terms of this sum (by symmetry, our analysis will also hold for the second term of the sum). We have: jEjij (E ; C ) jCij ji i  jjECjijj ((Eji; Cj ) + (Eji; Ci) ? (Eji; Cj )) i j E  jCjijj ((Cj ) + O()((Ci) + (Cj ))) by Lemma 12 i k )jCi [ Cj j O (  ) = O ( c jCij  = O()c ; 2

2

3

+1

where the previous-to-last equality follows from the de nition of well-separated clusters, and the last equality follows from the de nition of large clusters, which implies jCij  k n. The last sum is analyzed similarly:

(Eij ; Eij0 )  jjECijjj (Ci; Eij0 ) + jEjCijj0 j (Ci; Eij ) i i  jEij jjC+ jjEij0 j (Ci) i  O()(Ci): Thus the partition (Ci0) is a near-optimal k-clustering:

X i

(Ci0)  (1 + O(k ))c: 3

Unfortunately some mistakes are made in step 4 as well. We now need to bound the e ect of those mistakes. For each large cluster Ci and each small cluster Cj , let Fij denote the points of Ci0 which mistakenly go into Cj , and Fji denote the points of Cj which mistakenly go into Ci's group. By the guess made in step 4, we have jFij j = jFjij, and so we can pair up the vertices as in the analysis of the Unbalanced clustering algorithm. Let  C 0 + P Fji ? P Fij if i  i 00 Ci = Ci0 + Pj>i0 F ? Pj>i0 F if i > i : i j i0 ji j i0 ij 0

0

X i

(Ci00) ?

X i

(Ci0) 12

= =

X

X

(Ci0 +

Fji ?

Xi X

X

Fij ; Ci0 +

X

j j j 0 0 ((Ci ; Fji) ? (Ci; Fij )) +

j Xi X

XX i

j

i j;j 0

(Fji) +

XX i

j

Fji ?

X j

Fij ) ? (Ci0; Ci0)

(Fij ) +

((Fji; Fj0i) ? (Fji; Fij0 )) +

XX i

j;j 0

((Fij ; Fij0 ) ? (Fji; Fij0 )):

Remember that Fab is non-empty only if a refers to a small cluster and b to a large cluster, or if a refers to a large cluster and b to a small cluster. The rst term has O(k ) terms which are all small by Lemma 13. The next two terms also have O(k ) terms which are also all small by Lemma 13. For the next term, remembering that Fj0i is paired up with Fij0 and using (x; y) ? (x; y0)  (y; y0), we get 2

2

(Fji; Fj0i) ? (Fji; Fij0 )  jFjij

X

y;y0 )

(

pair of Fj0 iFij0

(y; y0):

If Ci is large and Cj ; Cj0 are small, then by Lemma 13 this is bounded by jCj jO(1)(Ci)=jCij, which is O()(Ci) because of the gap between sizes of large and small clusters. If Ci is small and Cj ; Cj0 are large, then by Lemma 13 this is bounded by jCijO(1)(Cj0 )=jCj0 j, which is O()(Cj0 ). Thus in all cases, this term, like the previous terms, is O()c. The last term can be dealt with similarly. Thus the partition (Ci00) is a near-optimal k-clustering:

X

(Ci00) 

i

X i

(Ci0) + O(k )c  (1 + O(k ))c: 3

3

Finally, we need to analyze the use of Max-h-Cut in the last step of the algorithm; we will present the analysis as if the group was perfect, i.e. consisted of the clusters Ci. (It is easy to see that the proof also goes through when replacing the Ci by Ci00, at the cost of some bookkeeping of the small errors introduced at every step of the calculation.) In the groups of large clusters, the clusters are not well-separated. From this, we can deduce that c is (W ) as follows. Consider a group C [ C [  [ Ch. We have: 1

2

(C [  [ Ch) = 1

X i

(Ci) +

X i6=j

(Ci; Cj ):

For i 6= j , by de nition of group, there exists a sequence of length m  h,

Ci = Ci0 ; Ci1 ; : : :; Cim = Cj ; 13

(4)

such that two consecutive clusters in that sequence are not well separated. Writing

(xi0 ; xi1 )  (xi0 ; xi1 ) + (xi1 ; xi2 ) +  + (xim?1 ; xim ) and summing over Ci0    Cim , we get (Ci0 ; Cim )  (Ci0 ; Ci1 ) + (Ci1 ; Ci2 ) +  + (Cim?1 ; Cim ) : jCi0 j  jCim j jCi0 j  jCi1 j jCi1 j  jCi2 j jCim?1 j  jCim j Since the size of any two large clusters di er by a factor of k at most, we deduce (Ci; Cj )  1k ((Ci0 ; Ci1 ) +  + (Cim?1 ; Cim )): By de nition of well-separated clusters, we then obtain (Ci; Cj )   k1 (((Ci0 ) + (Ci1 )) +  + ((Cim?1 ) + (Cim ))   k2 c: Plugging this into Equation (4) yields (5) (C [  [ Ch )  (1 + 2h(hk ? 1) )c: Now, doing Max-h-Cut on C [[Ch with error parameter ( k =h ) will yield a partition whose cut value is within an additive ( k =h )(C [[ Ch) of optimal. Hence the value of the clustering will be o by ( k =h )(C [  [ Ch) = ()c by Equation (5). The algorithm then recursively nds a clustering of the removed elements. There are at most k levels of recursion, each inducing a mistake of order 1 + O(k ), for a total relative error of O(k ). Now, let us turn to the running time of the algorithm. The exhaustive search of the rst step takes time O(nk 2k ). Sampling and computing ^ in the second step takes time O(n + k) = O(n). The minimization in the third step takes time O(nk). The fourth step takes time O(nk ), excluding the recursive call. The nal step uses Max-k-cut, which is a randomized algorithm and takes time O(n + nk2O =03 ) (in the version inspired from [32]). Overall, running the algorithm for 0 = (=k ) k =k , the algorithm thus becomes a (1 + O())approximation and has running time 2

3 +1

3 +1

1

3 +1

3 +1

1

3 +1

3 +1

2

2

2

1

1

3

4

~ (1

2

4 3 +1

)

2

O(nk 2k (n + nk + nk + n + nk2O =03 )  k = O(k2k n k + k 2k nk 2O =03 ): The above discussion proves the following theorem. Theorem 14. For every xed positive integer k and for every  > 0 there exists an algorithm for Metric Min-Sum All-Pairs k-clustering that computes a solution of cost within a 3k+1 k k O = factor of 1 +  of the optimum cost in time O(n + n 2 ). 2

~ (1

)

2

2

14

+1

~ (1

2

)

+1

~ (1

)

5 The Basic Algorithm for Squared Euclidean Distance

In this section we consider a nite input set V  Rd and distance function (x; y) = kx ? yk . 4 We give, for every  > 0, an nO k= time algorithm that produces a partition of the input space into k clusters with cost within a factor of 1 +  of the cost of an optimum partition. Our algorithm can be modi ed to solve the min-sum median case. We indicate the changes needed at the end of the section. We rst present the algorithm, and then proceed to motivate and analyze it. 1. By exhaustive search, guess the optimal cluster sizes jCij = ni, n + n +  + nk = n. By search, for each i = 1; : : : ; k, consider all possible multisets Ai containing ? exhaustive  points. 2. Consider the following weighted complete n  n bipartite graph G. The left side has n vertices, of which ni are labelled Ai, and the right side has n vertices which correspond to the points of V . The edge between a vertex labelled Ai and a vertex x of V has weight ^(x; Ci) = ni  (x; Ai). 3. Compute a minimum cost perfect matching in the graph G. This de nes the following clustering C ; C ; : : : ; Ck : Ci is the set of points matched to the copies of Ai. 4. Output the best such clustering over all choices of A = (A ; : : : ; Ak) and N = (n ; : : : ; nk ). Our algorithm is motivated by the following bound. Lemma 15. Let Y be any multi-subset of? V. Then, for every  such that 0 <   1, there exists a multi-subset Z of Y of size jZ j =  and such that (Y; Z ) ? (Y; Y )    (Y; Y ): (

2 2

)

1

16 4

2

2

1

2

1

1

16 4

Proof: Let  = jY j Px2Y (x; Y ) denote the average distance between a point x 2 Y and p  Y . Let Yc = fx 2 Y j (x; Y )  64= g. By Proposition 7, diam(Yc)  2 64= = 256= . By Lemma 9, there exists a multi-subset Z of Yc such that jZ j = (16=) and (Z; Yc)   diam(Yc )=16   =256. We complete the proof by proving the following 1

2

2

2

4

2

4

4

2

claim.

 Yc )   =256, then Claim 16. If Z is a multiset such that (Z; (Y; Z ) ? (Y; Y )    (Y; Y ): Proof: We want to bound X?   (x; Z ) ? (x; Yc) : (Y; Z ) ? (Y; Yc) = 2

x2Y

2

The constant 164 = 65536 was chosen to simplify our calculations. It can be improved signi cantly.

15

We bound each term of the right hand side separately using Proposition 7. This gives p      Yc)  (x; Yc). Let Y = fx 2 Y j (x; Yc)  g. If (x; Z ) ? (x; Yc)  (Z; Yc ) + 2 (Z; x 2 Y , then 1 1  (x; Z ) ? (x; Yc)  8  + 256   : (6) 1

1

2

 Yc)   =256 <  (x; Yc)=256. Therefore, If x 2 Y n Y , then (Z; 1 1    (x; Z ) ? (x; Yc) < 8  + 256  (x; Yc): P P By Proposition 6, x2Yc (x; Yc)  x2Yc (x; Y ). By Proposition 8, 2

1

2

(7)

2

X (Y ; Yc)  jY1 j Y ? y c y2Yc  ;

2 2

(8)

where (8) follows from the de nition of Yc. If x 2 Y n Yc , then (x; Y ) > 64= . Therefore, using Proposition 7 and (8) we get: 2

q

(x; Yc)  (x; Y ) + (Y ; Yc) + 2 (x; Y )  (Y ; Yc )  1 1  < 1 + 4  + 64   (x; Y ): Combining the bounds in (6), (7), and (9), we get X  X  X (x; Z ) = (x; Z ) + (x; Z )

(9)

2

x2Y

x2Y1

   

1

x2Y nY1

X







1     jY j + 1 + 1  + 1   X (x; Y ) (x; Yc) + 8  + 256 c 8 256 x2Y1 x2Y nY1 1 1   1 1   1 1 X  1 + 4  + 64   1 + 8  + 256   (x; Y ) + 8  + 256     jY j Y  x2X  1 7 3 1 1 + 2  + 128  + 1024  + 16384   (x; Y ) x2Y X  (1 + )  (x; Y ): 2

1

2

2

2

3

2

2

4

x2Y

P P On the other hand, by Proposition 6, x2Y (x; Z)  x2Y (x; Y ). This completes the proof of Claim 16 and of Lemma 15. We are now ready for the analysis of our algorithm. Theorem 17. The above algorithm computes a solution whose cost is within a factor of 4 O k= (1 + ) of the optimum cost in time n . (

)

16

Proof: By Lemma 15, for every i = 1; 2; : : : ; k, there exists a multi-subset Zi of Ci of size jZij = (16=) and such that (C ; Z ) ? (C ; C)    (C ; C): i i i i i i Consider the iteration of the algorithm where Ai = Zi and ni = jCij for every i = 1; 2; : : : ; k. 4

Let Ci be the set of points matched to the nodes marked Ai in this iteration, for all i = 1; 2; : : : ; k. Then, cost(C ; C ; : : :; Ck ) = 1

2

 

k X i=1 k

X i=1 k

X i=1

jCij  ni  ni 

X

x2Ci

X

x2Ci

(x; Ai)

X

x2Ci

 (1 + ) 

k X i=1

(x; Ci)

(x; Ai)

jCij 

X x2Ci

(x; Ci):

The performance guarantee follows because the algorithm nds a partition whose cost is at least as good as cost(C ; C ; : : : ; Ck ). As for the running time of the algorithm, there are less than nk4 possible representations of n as a sum n + n +  + nk . There are less than n k= possible choices for A. Computing a minimum cost perfect matching in G takes O(n log n) time. To solve the min-sum median case, we modify the algorithm as follows. We remove the enumeration over the cluster sizes, and the multiplication of edges weights in G by those sizes. Instead of computing a minimum cost perfect matching in G, we assign each point to the closest set to it. 1

1

2

65536

2

3

6 Outliers

In this section we present a much faster randomized algorithm that clusters at least (1 ?  )n points from V into k clusters C ; C ; : : : ; Ck , such that cost(C ; C ; : : : ; Ck ) is within a factor of 1 +  of the optimum cost to cluster all the points into k clusters (in fact, of the cost to cluster the points the algorithm chooses into k clusters), with probability at least 1 ? . The algorithm di ers from the previous algorithm in the way it enumerates over the choice of A and N . This is done as follows. Pick a sample Z of  8  k  log(k=) points, each chosen independently and uniformly at random from X (where is a suciently large constant). Enumerate over all choices for a list A of t  k disjoint subsets A ; A ; : : :; At of over all choices for Z , each containing 8  log(k=) points. For each choice of A enumerate ?  a list N of integers n ; n ; : : : ; nt such that?for all i = 1;P2; : : : ; t, ni = 1 +  ji  nk , for some non-negative integer ji, and furthermore 1 ?  n  ti ni  n. Proceed to compute a 1

2

1

2

1

1

2

2

2

17

=1

2

2

clustering using the graph G(A; N ) as in the previous algorithm. (Notice that the two sides of the graph need not be equal, so a minimum cost maximum matching may fail to assign some of the points to clusters.) Output the best clustering computed over all choices of A and N . Theorem 18. With probability at least 1 ? , the above algorithm computes a solution containing at least (1 ?  )n points, whose cost is within a factor of 1 +  of the optimum  n), where g(k; ; ; ) = ? cost. The algorithm runs in time O (g(k; ; ; )  n log exp 8  k  log(k=)  (log k + log(1=) + log(1= ) + log log(1=)) . Proof: If there are any clusters among C ; C ; : : : ; Ck that contain less than   nk points, then by removing them we remove at most   n points and we do not increase the cost of clustering the remaining pointsP into k clusters. So, consider a cluster Ci that contains at  n least  k points. Let i = jCij x2Ci(x; Ci), and let Yi = fx 2 Ci j (x; Ci)  64i = g. By Markov's inequality, jYi j  1 ? 2    nk . Therefore, for every suciently large  there exists > 0 such that     Pr jZ \ Yij <  log(k=) < 2k : (10) 3

1

1

2

2

2

1

2

2

64

2

8

(In the above expression we consider the intersection Z \ Yi as a multiset.) Conditioned on the event jZ \ Yi j  8 log(k=), the multiset Zi containing the rst  8 log(k=) points in Z \ Yi is a sample of jZi j points picked independently and uniformly at random from Yi. By Lemma 10, assuming  is suciently large,       Pr (Zi; Yi ) > 16  diam(Yi) < 2k (11) ?  If (Zi; Yi)    diam(Yi), then by Claim 16 4

4

X (x; Z ) ? X (x; C)    X (x; C): i i i x2C x2C x2C 

16

i

i

i

(12)

Let I  f1; 2; : : : ; kg be the set of indices i such that Ci    nk . Without loss of generality, let I = f1; 2; : : :; jI jg. Consider the event E that for every i 2 I we have jZ \ Yij  8 log(k=) ?  and furthermore (Zi; Yi)    diam(Yi ). Summing (10) and (11) over all i 2 I , Pr [E ]  1 ? . Assuming E holds, consider the iteration of the algorithm where t = jI j, for all i 2 I , ?   Ai = Zi , and 1 ? jCi j  ni  jCij. Let C ; C ; : : :; Ct be the clustering produced by the algorithm in this iteration. Then, 2

4

16

1

2

cost(C ; C ; : : :; Ct)  1

2

Xt i=1

2

ni 

X

x2Ci t

 (1 + )  

X

(x; Ai)

jCij 

X

(x; Ci)

x2Ci (1 + )  cost(C1; C2; : : : ; Ck):

18

i=1

Furthermore, the number of points clustered is

Xt i=1





Xt   ni  1 ? 2  jCi j    iX k  1 ? 2  jCij i > (1 ?  )  n: =1

2

=1

It remains to analyze the time complexity of the algorithm. The number of possible choices for A is 1 = = = ) : 2O( 8 k k=  k The number of possible choices for N is log(

) (log +log(1

2O k  (

)+log(1

k

)+log log(1

))

= ))):

(log +log(1

Each iteration requires the computation of a minimum cost maximum bipartite matching.

7 A Faster Min-Sum Median Algorithm In this section we present an improved polynomial time approximation scheme for min-sum median k-clustering, building on the ideas of the previous section. We give a randomized polynomial time approximation scheme for min-sum median clustering of a nite input set V  Rd with distance function (x; y) = kx ? yk . The running time of our algorithms, for xed k, , and , is just O(npoly log n) ( is the failure probability). The approximation scheme works as follows. Enumerate over all possible monotonically non-increasing integer sequences n ; n ; :P : :; nk such that for all i = 1; 2; : : : ; k, ni = (1 + )ji for a non-negative integer ji , and n  ki ni  (1 + )  n. Partition f1; 2; : : : ; kg into segments B ; B ; : : :; Bt as follows. The rst segment begins with 1 and every consecutive segment begins with the index following the last index of the previous segment. A? segment  Bi that starts with ai ends with the rst s = bi, s  ai, such that s = k or ns < k  ns. Compute a set of candidate clusterings using a depth-t recursion. It is convenient to think of the recursion as a depth-t rooted tree T , where every node of T is labelled by a clustering of a subset of X into at most k clusters. The candidate clusterings are the labels of the leaves of T . Output the best candidate clustering. To proceed with our description, we need some notation. Put mi = nai , for all i = 1; 2; : : : ; t. Put mt = 0. For every i = 1;2 2; : : : ; t, every depth-i node of T corresponds to a clustering into bi clusters, excluding k mi points. (The root of T corresponds to an empty clustering.) The label on a node of T is an extension of the label on its parent. I.e., it is a clustering that adds points and clusters to the label of its parent, but does not change the assignment of points already clustered. 2 2

1

2

3

=1

1

2

+1

+1

3

16

+1

In fact, a coarser approximation by a factor of 2 would suce.

19

2

16

Let Ci? be a label on a depth-(i ? 1) node of T , where 1  i  t. We describe how to compute the labels of the children of this node. ?Denote by Ri? the set of points that are  k k not clustered in Ci? . Pick a sample Z of Ri? of   10 ln k points drawn independently and uniformly at random, where > 0 is a constant. Enumerate over all choices for an ordered list of jBij disjoint subsets Aai : : :; Abi of Z , each containing 8 ln k points, where  > 0 is a constant. (Both and  are determined in the analysis below.) Every such choice generates a child of Ci? . (In the analysis it will be convenient to assume that every depth-i node of T includes, in addition to its label, the list A ; A ; : : :; Abi , where its pre x A ; A ; : : : ; Abi?1 is inherited from its parent.) Augment Ci? by nding a minimum cost assignment of jRi? j ? k2  mi points from Ri? to C ; C ; : : :; Cbi , where the cost of assigning x 2 Ri? to Cj is (x; Aj ). This completes the speci cation of the algorithm. We now proceed with its analysis. 1

1

2

1

1

16

1

1

1

2

2

1

16

1

4

+1

1

1

2

1

Claim 19. For all i = 1; 2; : : : ; t, nbi 

? k

k?1)

mi.

2(

16

? 

Proof: By construction, for every j 2 fai + 1; : : : ; big, nj  k nj? . Therefore, putting  ? s = bi ? ai, nbi  k s nai . As s < k, the claim follows. Claim 20. Among the sequences n ; n ; : : :; nk that the algorithm over there enumerates exists one such that for every j = 1; 2; : : : ; k, Cj  nj  (1 + )  Cj . Proof: Clearly for every j therePis a valid choice of nj that satis es the bounds in the claim. Because for these values n  kj nj  (1 + )  n, there is an iteration where the whole 2

1

16

2

16

1

2

=1

sequence is considered. Thus, from now on we analyze the iteration of the algorithm for which the bounds in Claim 20 hold. Consider a depth-(i ? 1) node u of T with label C ; C ; : : : ; Cbi?1 , list A ; A ; : : : ; Abi?1 , and set of unclustered points Ri? . To generate a child v of u, we add to the list sets Aj , for j = ai; : : : ; bi. We are interested in a particular choice of those sets. Let  bi  Ri? be mutually disjoint sets such that Kj = Ri? \ Cj if Ri? \ Cj  ?Kai; : : n: ; K, and k2 j P otherwise Kj is an arbitrary set of size nj . (Notice that as jRi? j =   mi > k  mi  j2Bi nj , such a choice of sets exists.) Claim 21. For every  > 0 and for every suciently large  > 0, there exists > 0 such that with probability at least 1 ? k , the sample Z from Ri? has the following property. For every j 2 Bi, jZ \ Kj j  8 ln k. Proof: The ?sets Kj , j 2 2Bi, ?are disjoint. There are at most k such sets, and each set has ?  k k  k   nbi   k  jRi? j. ?Then, jZ \ Kj j is the sum of   10 ln k size at least  2 k Bernouli trials with success probability   k . Thus, by standard Cherno bounds, for suciently large, the probability that jZ \ Kj j < 8 ln k is at most k2 . Summing this probability for j 2 Bi completes the proof. 1

1

2

2

1

1

1

1

3

1

16

16

1

3

16

2

256

16

16

1

2

2

256

16

4 Notice that this is a positive number of points, and in fact, almost all the points in R ?1 get assigned at depth i. i

20

Claim 22. For every  > 0 there exist  > 0 and > 0 such that with probability at least  1 ? k , u has a child v with list A ; A ; : : :; Abi such that for every j 2 Bi, 1

2

X (x; A ) ? X (x; K )    X (x; K ): j j j x2K 8 x2K x2K j

j

j

Proof: Following the proof of Theorem 18 put, for every j 2 Bi , j = jKj j Px2Kj (x; Kj ), and Yj = fx 2 Kj j (x; Kj )  64j = g. Set  so that the following property holds. For every j 2 Bhi, a multi-subset Zj of 8 ln ik independent, uniformly distributed, points of Yj ?  satis es Pr (Zj ; Yj ) >   diam(Yj ) < k . (This is possible by Lemma 10.) Set so that with probability at least 1 ? k the bound in Claim 21 holds. Conditioned on this event, for every j 2 Bi Z contains a sample of 8 ln k independent, uniformly distributed, 1

2

4

2

128

3

3

points from Kj . Notice that Yj contains more than two-thirds of the points in Kj . If  is suciently large, then the probability that Zj = Z \ Yj has at least 8 ln k points is at least 1 ? k . Conditioned on this assumption, Zj is a sample? of independent, uniformly distributed, points of Yj as discussed above. If (Zj ; Yj )    diam(Yj ), then, by P P P Claim 16, x2Kj (x; Zj ) ? x2Kj (x; Kj )    x2Kj (x; Kj ). The probability that all our assumptions are true is at least 1 ? k . In this case, v is the child of u corresponding to the choice Aj = Zj , for all j 2 Bi. 2

3

4

128

8

Claim 23. With constant probability, T contains a depth-t node l with label C ; C ; : : :; Ct 1

2

and list A ; A ; : : :; At such that the directed path p in T from its root to l has the property that every parent-child pair along p satis es the bound in Claim 22. 1

2

Proof: By a trivial induction on the level i.

Assume from now that the event in Claim 23 occurs. Denote, for every x 2 X , by jx the index of the cluster that x gets assigned to by the algorithm, and denote by jx the index for which x 2 Cjx . Let J be the set of indices j such that Kj  Cj. For i = 1; 2; : : : ; t, let Ji = fj 2 J j j  big. For i = 1; 2; : : : ; t, let Di be the set of points assigned to clusters at the depth-i node of p. A point x 2 Di is premature i jx > bi. Let Pi denote the set of premature points in Di . A point x 2 Di is leftover i Kjx 6 Cjx and jx  bi. Notice that in this case, almost all points from Cjx must be premature at some depth less than i. Let Li denote the set of leftover points in Di. ?  Let j 62 J . Let Lj denote the set of leftover points from Cj. By de nition, jLj j <  nj . Sort the points in Cj nLj by non-decreasing order of w(x) = maxf(x; Cj); (x; Ajx )g. (These are all premature points.) Assign these points to the points in Lj in round-robin fashion. Let Q(x) be the set of points assigned to x 2 Lj . Let q(x) be a point in Q(x) with smallest w(x). (Notice that fq(x) j x 2 Lj g is a set of jLj j points with smallest w() value in Cj n Lj .) For every x 2 X let 8 < unde ned if x 2 Pi; if x is leftover; (x) = : jq x otherwise. jx 3

16

( )

21

Claim 24. For every j 62 J ,   X X X X w(y): (x; A x )  1 + 3  (x; Cj) + 6  Proof: Let

( )

x2Lj x 2 Lj .

x2Lj y2Q(x)

x2Lj

By the triangle inequality, q q q (x; A x )  (x; Cj) + (q(x); Cj) + (q(x); A x ): ?  By de nition, j = jx = jq x . If w(q(x))    (x; Cj), we get that

q

( )

( )

(x; A x )  ( )



( )

2

q



q

q

16

(x; Cj) + (q(x); Cj) + (q(x); A x )

q 

(x; Cj) + 2 

r  

!

 16  (x; Cj ) 2

2

( )

2



 q   1 + 8  (x; Cj ) =    1 + 3  (x; Cj): ?  Otherwise, for every y 2 Q(x), w(y) >   (x; Cj). Moreover, C  n Lj jQ(x)j = jjLj j ?  (1 ? )  nj ?   nj  ?   n j   > 12  16 Therefore, in this case,  16     X 1 w(y) > 2    16  (x; Cj) = 8  (x; Cj); y2Q x and   X w(q(x)) < 2  16  w(y): y2Q x Thus, we get  q q q       (x; Cj ) + (q(x); Cj ) + (q(x); A x ) (x; A x )  2

2

16

3

3

16

16 3

3

2

( )

3

( )

( )

2

( )

0s 1 s   X X < @ 8  w(y)A w(y) + 2  2  16  y2Q x y2Q x  X

2

3

( )

( )

< 6 w(y): y2Q x ( )

22

Summing these bounds over all x 2 Lj completes the proof. Put Qi = fx 2 X j jx 2 Ji and x 2 Rig. Let Si be a set of jPij points in Qi with smallest (x; A x ) value (notice that by de nition Qi cannot contain premature points, so (x) is de ned for every x 2 Qi). ( )

Claim 25.

Xt X i=1 x2Si

Xt (x; A x )  8  ( )

X

i=1 x2DinPi

(x; A x ): ( )

S C  and therefore jP j  k  m . Proof: Fix i 2 f1; 2; : : : ; tg. The set Pi is a subset of i i  k2  j>bi j k2 On the other hand, jQij  jRij ? k  mi =  ? k  mi >   mi . Thus, +1

16

+1

8

+1

+1

P (x; A  ) jS j jP j  P x2Si (x; A x )  jQi j = jQi j  8k : i i x x2Qi S P (x; A  )  P 0 P Moreover, Qi  X n ( 0 Pi0 ), so ( )

( )

(x)

x2Qi

i

over i, which takes t  k values, completes the proof.

Claim 26.

Summing

   X X   (x; Cj ): (x; A x )  1 + 8  j 2J x2Kj x2DinPi nLi

Xt X i=1



x2Di0 nPi0  (x; A(x)):

i

( )

Proof: Notice that the lhs sums precisely over the points in Sj2J Kj . Moreover, for j 2 J , x 2 Kj , (x) = jx = j . As we are assuming that the bound in Claim 22 holds, the proof is

complete.

Theorem 27. With constant probability the above algorithm computes a solution whose

cost is within a factor of 1 +  of the optimum ? running time of the algorithm is ? cost. The O(g(k; )  n  (log n)k ), where g(k; ) = exp 8  k ln k  ln  + ln k 1

3

1

Proof: With constant probability the recurrence T will contain a computation path p as

per Claim 23. Assuming this occurs, consider the clustering C ; C ; : : : ; Ck computed at the leaf l reached by the path p. S For every i = 1; 2; : : : ; t, the set Di Si n Pi is a subset of Ri? of size jRi? j ? jRij. S Therefore, P assigning every x 2 Di Si n Pi to C x is a feasible augmentation of Ci? , so its cost x2Di S SinPi (x; A x )Pcannot be smaller than the cost of the augmentation that the algorithm chooses which is x2Di (x; Ajx ). Therefore, 1

2

1

( )

( )

X

x2X

(x; Ajx ) =



Xt X

i=1 x2Di t

(x; Ajx )

X X i=1



(x; A x ) S x2Di Si nPi ( )

23

1

1



k X X j =1 x2Lj

(x; A x ) + ( )

Xt X i=1 x2Si

(x; A x ) + ( )

Xt X i=1 x2DinPi nLi

(x; A x ) ( )

0k 1 t  X X X X  1 + 8  @ (x; A x ) + (x; A x )A i x2DinPi nLi 0 j x2Lj 1    @   X X   X X X A (x; Cj ) + 6  w(y) +  1+ 8  1+ 3  j 62J x2Lj j 62J x2Lj y2Q x    X X  

( )

( )

=1

=1

( )

+ 1+ 8  (x; Cj ) j 2J x2Kj 2



 X   Xt X    1 + 2  (x; Cjx ) + 5 (x; Ajx ): i x2Pi x2X =1

Moving terms around, we get

=2  X (x; C ) (x; Ajx )  11 + jx ? =5 x2X x2X X < (1 + )  (x; Cjx ):

X

x2X

On the other hand, cost(C ; C ; : : : ; Ck ) = 1

2



k X X j =1 x2Cj k

XX

j x2Cj X =1

=

x2X

(x; Cj ) (x; Aj )

(x; Ajx ):

As the algorithm outputs a clustering which is at least as good as C ; C ; : : :; Ck , this establishes the performance guarantee of the algorithm. As for the running time of the algorithm, ? the number  of sequences n ; n ; : : :; nk that the  k algorithms has to enumerate over is O log  n . The size of T is at most 1

2

1

2

1+

2( 8 k3 1

ln

k(ln 1 +ln k)) :

Computing the augmentation at each node of T requires O(n) distance computations, where the hidden constant depends mildly on k and . 24

References [1] P.K. Agarwal and C.M. Procopiuc. Exact and approximation algorithms for clustering. In Proc. of the 9th Ann. ACM-SIAM Symp. on Discrete Algorithms, January 1998, pages 658{667. [2] P.K. Agarwal and M. Sharir. Ecient algorithms for geometric optimization. ACM Computing Surveys, 30(4):412{458, 1998. [3] N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. In Proc. of the 41th Ann. IEEE Symp. on Foundations of Computer Science 2000. [4] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 1992. [5] N. Alon and B. Sudakov. On two segmentation problems. Journal of Algorithms, 33:173{184, 1999. [6] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [7] S. Arora, D. Karger, and M. Karpinski. Polynomial time approximation schemes for dense instances of NP-hard problems. J. Comp. System. Sci., 58:193{210, 1999. [8] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-medians and related problems. In Proc. of the 30th Ann. ACM Symp. on Theory of Computing, 1998. [9] K. Azuma. Weighted sum of certain dependent random variables. T^ohoku Math. J. 3:357{367, 1967. [10] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via Core-Sets. To appear in STOC 2002. [11] Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. In Proc. of the 33rd Ann. ACM Symp. on Theory of Computing, July 2001, pages 11{20. [12] A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proc. of the 6th Int'l World Wide Web Conf., 1997, pages 391{404. [13] M. Bern and D. Eppstein. Approximation algorithms for geometric problems. In D. Hochbaum, Editor, Approximation Algorithms for Hard Problems. PWS Publishing, 1996. [14] B. Carl and I. Stephani. Entropy, Compactness and the Approximation of Operators. Cambridge University Press, 1990. [15] M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science, 1999. 25

[16] M. Charikar, S. Guha, D.B. Shmoys, and E . Tardos. A constant factor approximation algorithm for the k-median problem. In Proc. of the 31st Ann. ACM Symp. on Theory of Computing, 1999. [17] D.R. Cutting, D.R. Karger, and J.O. Pedersen. Constant interaction-time scattergather browsing of very large document collections. In Proc. of the 16th Ann. Int'l SIGIR Conf. on Research and Development in Information Retrieval, 1993. [18] D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: A clusterbased approach to browsing large document collections. In Proc. of the 15th Ann. Int'l SIGIR Conf. on Research and Development in Information Retrieval, 1992, pages 318{ 329. [19] S. Dasgupta. Learning mixtures of Gaussians. In Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science, 1999. [20] S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6):391{407, 1990. [21] W. Fernandez de la Vega. MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms 8:187{199, 1996. [22] W. Fernandez de la Vega and M. Karpinski. Polynomial time approximation of dense weighted instances of MAX-CUT. Random Structures and Algorithms, 2000. [23] W. Fernandez de la Vega, M. Karpinski, and C. Kenyon. A polynomial time approximation scheme for metric MIN-BISECTION. Manuscript, 2002. [24] , W. Fernandez de la Vega and C. Kenyon. A randomized approximation scheme for metric MAX CUT. In Proc. of the 39th Ann. IEEE Symp. on Foundations of Computer Science, 1998, pages 468{471. [25] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. In Proc. of the 10th Ann. ACM-SIAM Symp. on Discrete Algorithms, 1999, pages 291{299. [26] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz. Ecient and e ective querying by image content. Journal of Intelligent Information Systems, 3(3):231{262, 1994. [27] U. Feige and R. Kranthgammer. A polylogharithmic approximation of minimum bisection. In Proc. of the 41th Ann. IEEE Symp. on Foundations of Computer Science 2000, pages 349{358. [28] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J. of Combinatorial Theory B, 44:355{362, 1988. 26

[29] A.M. Frieze and R. Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175{200, 1999. [30] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for nding low-rank approximations. In Proc. of the 39th Ann. IEEE Symp. on Foundations of Computer Science, 1998, pages 370{378. [31] M.X. Goemans and D.P. Williamson. .878-approximation algorithms for MAX-CUT and MAX-2SAT. In Proc. of the 26th Ann. ACM Symp. on Theory of Computing, 1994, pages 422{431. [32] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. J. of the ACM, 45:653{750, 1998. [33] S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. In Proc. of the 9th Ann. ACM-SIAM Symp. on Discrete Algorithms, January 1998. [34] W. Hoe ding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13{30, 1963. [35] P. Indyk. A sublinear time approximation scheme for clustering in metric spaces. In Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science, 1999. [36] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th Ann. ACM Symp. on Theory of Computing, 1998, pages 604{613. [37] K. Jain and V.V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. In Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science, 1999. [38] K. Jansen, M. Karpinski, A. Lingas, and E. Seidel. Polynomial time approximation schemes for MAX-BISECTION on planar and geometric graphs. In Proc. of 18th STACS, LNCS 2010, Springer, pages 365{375, 2001. [39] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into Hilbert space, Contemporary Mathematics, 26:189{206, 1984. [40] R. Kannan, S. Vempala, and A. Vetta. On clusterings: good, bad and spectral. In Proc. of the 41st Ann. IEEE Symp. on Foundations of Computer Science, 2000. [41] R.M. Karp. The genomics revolution and its challenges for algorithmic research. Bulletin of the EATCS, 71:151{159, June 2000. [42] S. Khanna, M. Sudan, and D. Williamson. A complete classi cation of the approximability of maximization problems derived from boolean constraint satisfaction. In Proc. of the 29th Ann. ACM Symp. on Theory of Computing, 1997, pages 11{20. 27

[43] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proc. of the 29th Ann. ACM Symp. on Theory of Computing, 1997, pages 599{608. [44] J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 46, 1999. [45] J. Kleinberg, C. Papadimitriou, and P. Raghavan. Segmentation problems. In Proc. of the 30th Ann. ACM Symp. on Theory of Computing, 1998, pages 473{482. [46] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Ecient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., 30(2):457{474, 2000. [47] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215{245, 1995. [48] R. Lupton, F.M. Maley, and N. Young. Data collection for the Sloan digital sky survey: A network- ow heuristic. In Proc. of the 7th Ann. ACM-SIAM Symp. on Discrete Algorithms, January 1996, pages 296{303. [49] F.J. MacWilliams and N.J.A. Sloane. The Theory of Error-Correcting Codes, North Holland, Amsterdam, 1977. [50] Manjara. http://cluster.cs.yale.edu/ [51] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. In Proc. of the 12th Ann. ACM-SIAM Symp. on Discrete Algorithms, January 2001, pages 439{447. [52] R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric clustering problems. J. of the ACM, 49(2):139{156, March 2002. [53] G. Pisier. Remarques sur un resultat non publie de B. Maurey. In Seminaire d'Analyse Fonctionelle 1980{1981, 1981, Ecole Polytechnic, Centre de Mathematiques, Palaiseau. [54] E. Rasmussen. Clustering algorithms. In W.B. Frakes, R. Baeza-Yates, eds., Information Retrieval. Prentice Hall, 1992. [55] J. O'Rourke and G. Toussaint. Pattern recognition. In J. Goodman, J. O'Rourke, eds., Handbook of Discrete and Computational Geometry. CRS Press, 1997. [56] L.J. Schulman. Clustering for edge-cost minimization. In Proc. of the 32nd Ann. ACM Symp. on Theory of Computing, 2000, pages 547{555. [57] R. Shamir and R. Sharan. Algorithmic approaches to clustering gene expression data. In T. Jiang, T. Smith, Y. Xu, M.Q. Zhang eds., Current Topics in Computational Biology, MIT Press, to appear. [58] M.J. Swain and D.H. Ballard. Color indexing. International Journal of Computer Vision, 7:11{32, 1991. 28

[59] S.J. Szarek. On the best constants in the Khinchin Inequality. Studia Math. 58:197{ 208, 1976. [60] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264{280, 1971. [61] O. Zamir, O. Etzioni, O. Madani, and R.M. Karp. Fast and intuitive clustering of web documents. In Proc. KDD '97, 1997, pages 287{290.

29 ECCC ISSN 1433-8092 http://www.eccc.uni-trier.de/eccc ftp://ftp.eccc.uni-trier.de/pub/eccc [email protected], subject ’help eccc’