A Spectral Algorithm for Learning Mixtures of Distributions - CiteSeerX

Report 0 Downloads 54 Views
A Spectral Algorithm for Learning Mixtures of Distributions Santosh Vempala

Abstract We show that a simple spectral algorithm for learning a mixture of k spherical Gaussians in R n works remarkably well — it succeeds in identifying the Gaussians assuming essentially the minimum possible separation between their centers that keeps them unique (solving an open problem of [1]). The sample complexity and running time are polynomial in both n and k . The algorithm also works for the more general problem of learning a mixture of “weakly isotropic” distributions (e.g. a mixture of uniform distributions on cubes).

Grant Wangy

special case, by making assumptions on the separation between the means of the Gaussians. This separation condition is crucial, so we proceed to make it explicit. Let F1 ; : : : ; Fk be spherical Gaussians in R n with mean vectors 1 ; : : : ; k and variances 12 ; : : : ; k2 . We will refer to p i n as the radius of Fi and jji j jj as the separation between Fi and Fj . If the pairwise separation is larger than the radii, then points from different Gaussians are isolated in space and easy to classify. On the other hand, if the separation is very small, then the classification problem need not have a unique solution.

1.1. Previous work

1. Introduction Learning a mixture of distributions is a classical problem in statistics and learning theory (see [10, 13]); more recently, it has also been proposed as a model for clustering. In the basic version of the problem we are given random samples from a mixture of k distributions, F1 ; : : : ; Fk . Each sample is drawn independently with probability wi from the i’th distribution. The numbers w1 ; : : : ; wk are called the mixing weights. The problem is to classify the random samples according to which distribution they come from (and thereby infer the mixing weights, means and other properties of the underlying distributions). An important case of this problem is when each underlying distribution is a Gaussian. In this case, the goal is to find the mean and covariances of each Gaussian (along with the mixing weights). This problem seems to be of great practical interest and many heuristics have been used to solve it. The most famous among them is the EM algorithm [5]. Unfortunately EM is a local search heuristic that can fail. A special case of the problem is when the Gaussians are assumed to be spherical, i.e. the variance is the same in any direction. In recent years, there has been substantial progress in developing polynomial-time algorithms for this  Department of Mathematics and Laboratory for Computer Science, MIT, [email protected] y Laboratory for Computer Science, MIT, [email protected]. Both authors are supported in part by NSF Career award CCR-987024.

Dasgupta [3] used random projection to learn a mixture of spherical Gaussians provided they are essentially nonoverlapping, i.e. the overlap in probability mass is exponentially small in n. His algorithm is polynomial-time provided the smallest mixing weight wmin is (1=k ) and the separation is

jji j jj  C maxfi ; j gn

1 2

for a constant C . In other words, the separation is proportional to the larger radius (the algorithm also required all the variances to be within a bounded range). Shortly thereafter, it was shown by Dasgupta and Schulman [6] that a variant of EM works with a smaller separation (along with some technical conditions on the variances):

jji j jj > C maxfi ; j gn log (n=wmin): 1 4

1 4

(1)

This separation is the minimum at which random points from the same Gaussian can be distinguished from random points from two different Gaussians based on pairwise distances. So points from the Gaussian with (approximately) the smallest variance have the smallest pairwise distances. They can be identified and removed and this can be repeated on the remaining points. Arora and Kannan [1] generalized this to non-spherical Gaussians. They used isoperimetric theorems to obtain distance concentration results for the non-spherical case. At this separation, their algorithm simply identifies all points at roughly the minimum distance

from each other as coming from a single Gaussian, removes them and repeats on the remaining data. They also give a version that uses random projection. Learning a mixture of spherical Gaussians at a smaller separation (when distance concentration results are no longer valid) has been an open problem.

1.2. New results In order for the solution to the classification problem to be well-defined (i.e. unique with reasonable probability) we need a separation of at least

jji j jj > C maxfi ; j g: At this separation the overlap in the probability mass is a constant fraction. So in particular, distance concentration results are no longer applicable. In this paper, we show that a simple spectral algorithm can learn a mixture of k Gaussians at this minimum separation in time polynomial in k O(k) poly(n). Our main result is that with a slightly larger separation of

jji j jj > C maxfi ; j gk log (n=wmin) 1 4

1 4

in Section 3), a distribution proportional to e jxj for a constant , etc. Finally, it is worth noting that this is a problem for which direct random projection does not work (see Figures 1,3). Indeed on random projection to d-dimensions, the intercenter p n distances and the radii scale at the same rate, namely d . So in terms of the dimension, the separation condition gets worse.

2. The Spectral Algorithm The first step of the algorithm is based on the singular value decomposition of a matrix. Any m  n matrix A can be written as n X A = i ui viT i=1

where 1  2 : : :  n are the singular values of A and ui ; vi are the left and right singular vectors corresponding to the i’th singular value i . The projection to the top k right singular vectors is

(2)

the algorithm is polynomial in both k and n. Note that this condition is almost independent of n and is much weaker than (1) as the dimension (n) gets larger than the number of Gaussians (k ). The main step of the algorithm is to project to the subspace spanned by the top k right singular vectors of the sample matrix (i.e. its k principal components). This is the rank k subspace which maximizes the squared projections of the samples. The key observation is that with high probability this subspace lies very close to the span of the mean vectors of the underlying Gaussians (Corollary 3 in Section 4.1). In Section 3 we first prove this for the expected best subspace and then show that in fact it is true with high probability for the best subspace. Thus, on projection, the separation between the mean vectors is preserved. On the other hand, the radius of a Gaussian projected p to any k -dimensional subspace drops by a factor of nk . Together this has the surprising effect of amplifying the ratio of the separation to the radii while reducing the dimension! Once we are in k -dimensional subspace, we can apply an exhaustive exponential in k algorithm. If we start with the slightly larger separation of (2), then the separation in the k -dimensional subspace is so large that we can apply distance concentration results to identify the components. More generally, the algorithm works for a mixture of weakly isotropic distributions, i.e. distributions with the property that the variance is the same along any direction. Examples of this include uniform distribution over cubes, balls, and generally weakly isotropic convex sets (defined

Ak =

k X i=1

i ui viT :

The key property of the decomposition that we use is that the subspace spanned by the top k right singular vectors is the one that maximizes the norm of the projection of A among all k -dimensional subspaces. The algorithm below for learning mixtures of spherical Gaussians uses this projection. Algorithm. 1. Compute the singular value decomposition of the sample matrix. 2. Project the samples to the rank k subspace spanned by the top k right singular vectors. 3. Perform a distance-based classification in the k -dimensional space. 4. Report the mean vectors and variances of each subsample. Step (3) is spelled out in full detail in Section 4.2. To learn more general distributions, steps 1-3 remain the same, while step (4) is replaced by something specific to the distributions being learned.

3. The Expected Best Subspace

Proof. By the above lemma,

In this Section, we show that in expectation, the subspace spanned by the top k singular vectors of the sample matrix is the same subspace spanned by the mean vectors of the distributions. The results of this Section hold for any mixture of weakly isotropic distributions. A distribution with mean  is said to be weakly isotropic1 if for a random sample X 2 R n we have E[(w  (X

8w 2 R n ; jjwjj = 1:

))2 ] = 2

(3)

In other words, the variance of any 1-dimensional projection is  2 . It is easy to verify that (3) is equivalent to the following: 1. For each coordinate i, E[(Xi 2. Each pair of coordinates E[Xi Xj ] = E[Xi ]E[Xj ].

i )2 ] = 2 .

i; j

2 Rn,

E[(X  v ) ] = (  v ) 2

2

+  jjvjj 2

E[(X  v )2 ] = E[(

i=1

Xi vi )2 ] =

=

E[

n X i;j =1

n X

i;j =1

i;j =1

E[Xi ]E[Xj ]vi vj

= (E[X ]  v)2 +

n X i=1

Corollary 1. For all v

n X i=1

Xi Xj vi vj ]

E[Xi Xj ]vi vj :

E[Xi ] vi

2 2

+

n X i=1

E[Xi2 ]vi2

vi2 i2 = (  v)2 + 2 jjvjj2 :

2 R n such that jjvjj = jjjj,

E[(X  )2 ]  E[(X  v )2 ] The corollary says that the best rank 1 subspace for a distribution is the one that passes through its mean (in expectation). 1 The

Next, we consider the projection of X to a higher dimensional subspace. We write jjprojV X jj2 to denote the squared length of the projection of X onto a subspace V . For basis fv1 : : : vr g for V , this is just Pr an orthonormal 2 ( X  v ) . i i=1 Lemma 2. Let V  R n be a subspace of dimension r with orthonormal basis fv1 : : : vr g. Then E[jjprojV X jj2 ] = jjprojV E[X ]jj2 + r 2 Proof. E[jjprojV X jj2 ] = E[

r X

Using the assumption that E[Xi Xj ] = E[Xi ]E[Xj ],

n X

(  v)2  0, and

r X i=1

jj(X  vi )vi jj2 ] = E[

r X i=1

(X  vi )2 ]:

The second and third equalities follow from the fact that fv1 : : : vr g is an orthonormal basis. By linearity of expectation and lemma 1, the above is

2

Proof.

n X

= (  )2 + 2 jjjj2 = (  v)2 + 2 jjvjj2

So E[(X  )2 ] E[(X  v )2 ] = (  )2 we have the desired result.

are uncorrelated, i.e.

Suppose we sample a random point X from such a distribution with mean vector  and variance  2 in every direction. Then we have the following: Lemma 1. For any v

E[(X  )2 ] E[(X  v )2 ]

term isotropic [9] also requires that the mean is zero and the variance is 1 along any direction. Here we are allowing a radial scaling and translation for each distribution in the mixture.

i=1

E[(X vi )2 ] =

r X i=1

(vi )2 +2 = jjprojV E[X ]jj2 +r2

Now consider a mixture of k components F1 : : : Fk , with mean vectors i and variances i2 . Let A 2 R mn be generated randomly from a mixture of distributions F1 : : : Fk with mixing weights w1 : : : wk . For a matrix A, jjprojV Ajj2 denotes the squared 2-norm of the projections of its rows onto the subspace V , i.e. jjprojV Ajj2 = Pmi=1 jjprojV Ai jj2 . Theorem 1. Let V  R n be a subspace of dimension with an orthonormal basis fv1 : : : vr g. Then E[jjprojV Ajj2 ] = jjprojV E[A]jj2 + m

k X i=1

wi  ri2

Proof. E[jjprojV Ajj2 ]

= =

m X

E[jjprojV Ai jj2 ]

i=1 k X X

l=1 i2Fl

E[jjprojV Ai jj2 ]:

r

The second equality follows from linearity of expectation. In expectation, there are wi m samples from each distribution. By applying lemma 2, we have

k X X l=1 i2Fl k X

= m

l=1

jjprojV E[Ai ]jj + ri 2

= jjprojV E[A]jj2 + m

k X i=1

wi  ri2 :

E[jjprojU Ajj2 ] E[jjprojV Ajj2 ] jjprojU E[A]jj2 jjprojV E[A]jj2 jjE[A]jj2 jjprojV E[A]jj2  0:

ml X r X

jjprojV Aijj ] :

i=1 j =1

(Xij )2 > (1 + )E[jjprojV B jj2 ]

Since E[jjprojV B jj2 ] =

B

ml X r X i=1 j =1

Xij > 2

Pr

j =1 ml ((l vj )

2

(1 + )

+l2 ), we have:

Pr

j =1 ml ((l  vj ) + l ) l2 2

By Markov’s inequality,

4. Analysis for mixtures of Gaussians

Pr (B ) 

4.1. The likely best subspace In this Section, we show that for a sufficiently large sample, with high probability the subspace found by SVD is very close to the one spanned by the mean vectors. We begin with a concentration lemma. Lemma 3 (Concentration). Let V be a subspace of R n of dimension r , with an orthonormal basis fv1 : : : vr g. Let A 2 R M n be generated randomly from Gaussians F1 ; : : : ; Fk with mixing weights w1 ; : : : ; wk . Assume that A contains at least m rows from each Gaussian. Then for any  > 0:  2 mr 1. Pr jjprojV Ajj2 > (1 + )E[jjprojV Ajj2 ] < e 8 . 

i2Fl

!

2

are generated by Fl , an arbitrary Gaussian. We bound the probability of E =(m1 : : : mk ) by bounding the probability of B . Let PYij = (Bi  vj ). Note that jjprojV B jj2 = Pm r Y 2 . We are interested in the event B  l Pim=1 Pjr=1 ij2 l 2 i=1 j =1 Yij > (1+ )E[jjprojV B jj ]. Note that Yij is a Gaussian random variable with mean (l  vj ) and variance l2 . We can write Yij = l Xij , where Xij is a Gaussian ( v ) random variable with mean ll j and variance 1. Rewriting in terms of Xij , we are interested in the event

B

Proof. By Theorem 1 we have,

jjprojV Ajj2 < (1 )E[jjprojV Ajj2 ] < e . Proof. Let E be the event in consideration. We further condition E on the event that exactly mi rows are generated by the ith Gaussian.

jjprojV Ai jj > (1 + )E[

X

jjprojV B jj2 > (1 + )E[jjprojV B jj2 ] where B 2 R ml n is a matrix containing the rows of A that

Corollary 2 (Expected Best Subspace). Let V  R n be a subspace of dimension k , and let U = spanf1 : : : k g. Then, E[jjprojU Ajj2 ]  E[jjprojV Ajj2 ]

2. Pr

i2Fl

2

Let B be the event that:



From this it follows that the expected best subspace for a mixture of k distributions is simply the subspace spanned by the mean vectors of the distributions.

= =

X

max Pr l

2

wi jjprojV i jj2 + ri2

Pk P 2 Note that jjprojV Ajj2 = l=1 i2Fl jjprojV Ai jj . Therefore, we have the following bound on the probability Pr (E =(m1 : : : mk )):

2 mr 8

Z =

P

P

ml r 2 E[et i=1 j =1 Xij ] P  r 2  exp t(1+) j=1 ml2l ((l vj )+l )

Pml Pr

2 i=1 j =1 Xij is a chi-squared random P m ( v )2 variable with noncentrality parameter rj=1 l l2 j and ml r degrees of freedom. The moment generating function for Z is (see e.g. [7]):

Note that

E[et

= (1 2t)

Pml Pr X 2 i=1 j =1 ij ] P m ( v)2

ml r=2 e

r j =1

l

So we obtain the bound on Pr (B ):

(1 2t)

ml r=2 et(

 (1 2t)

Prj=1 ml (l vj )2

ml r=2 et(

l l2

[

P



t

1 2t

℄ :



r 2 2 j =1 ml ((l vj ) +l )(1 2t) ) (1 2t) 2

(1+)

l

 ml r ( 2t(+1))

(1+ )

Prj=1 ml (l vj )2 (1 2t)l2

)

:

2 Using the fact that 1 12t  e2t+4t , and setting t have  2t( + 1) > 0, and so

Pr (B )  Therefore,

we E =(m1 : : : mk )

2 m r

e(2t+4t ) 2l etml r(1+)

obtain

the

e

 , we

=

4

E[jjprojW Ajj2 ] > (1 + )E[jjprojU Ajj2 ]:

2 ml r 8

following

bound

2 ml r

2 mr

on

8 Pr (E =(m1 : : : mk ))  max e e 8 l Since this is true regardless of the choice of m1 ; : : : ; mk , the probability of E itself is at most the above.

Theorem 2. Let the rows of A 2 R mn be picked according to a mixture of Gaussians fF1 ; : : : ; Fk g with mixing weights (w1 ; : : : ; wk ), means (1 : : : k ) and variances (12 : : : k2 ). Let V  R n be the k -dimensional subspace spanned by the top k right singular vectors, and let U be the subspace spanned by the mean vectors 1 : : : k .  Then for any 12 >   > 0, with  n + max jji jj2 + 1 ln 1  , m > 21000 n ln 2 i i wmin  n k Æ with probability at least 1 Æ , k X 2 2 jjprojU E[A]jj jjprojV E[A]jj  m(n k) wi i2 i=1 Proof. If V maximizes jjprojV Ajj2 then its orthogonal subspace, V minimizes jjprojV Ajj2 . This is because

jjAjj2 = jjprojV Ajj2 + jjprojV Ajj2

E

be the opposite of the desired event, i.e. E  jjprojU E[A]jj2 jjprojV E[A]jj2 > m(n k) Pki=1 wi i2 . Then Pr (E ) is

Pr

jjprojV E[A]jj2 jjprojU E[A]jj2 > m(n k)

which is at most the probability that there exists that jjprojW Ajj2  jjprojU Ajj2 and

jjprojW E[A]jj2 jjprojU E[A]jj2 > m(n k)

k X i=1

W

k X i=1

wi i2

such

wi i2 :

Note that by Theorem 1,

jjprojW E[A]jj jjprojU E[A]jj = E[jjprojW Ajj2 ] E[jjprojU Ajj2 ]: Now, jjprojU E[A]jj2 = 0, since U and U are orthogonal 2

2

subspaces. Also, E[jjprojU Ajj ]

= jjprojU E[A]jj + m(n k) 2

= m(n k)

k X i=1

wi i2 :

Note that we can rewrite jjprojW Ajj2

 jjprojU Ajj2 as: E[jjprojW Ajj2 ] jjprojW Ajj2 +jjprojU Ajj2 E[jjprojU Ajj2 ]  E[jjprojW Ajj2 ] E[jjprojU Ajj2 ]: The required probability is thus at most the maximum of the probability of the following two events (one of the following must occur for E to occur):

A:

9W : E[jjprojW Ajj2 ] > (1 + )E[jjprojU Ajj2 ] and E[jjprojW Ajj2 ] jjprojW Ajj2  1 E[jjproj Ajj2 ] E[jjproj Ajj2 ] : W U 2 2 B: 9W : E[jjprojW Ajj ] > (1 + )E[jjprojU Ajj2 ] and jjprojU Ajj2 E[jjprojU Ajj2 ]  1 E[jjproj Ajj2 ] E[jjproj Ajj2 ] : W U 2 The probability of B is at most the probability that there exists W such that jjprojU Ajj2 E[jjprojU Ajj2 ]  1 (1 + )E[jjprojU Ajj2 ] E[jjprojU Ajj2 ] , which is at 2 most 

Pr

Let

2

So, Pr (E ) is at most the probability that there exists W such that jjprojW Ajj2  jjprojU Ajj2 and

k X i=1

wi i2



jjprojU Ajj2  (1 + 2 )E[jjprojU Ajj2 ]  e



2 m wmin r 32

The last inequality follows from lemma 3, and the fact that m rows (the probabileach Gaussian generates at least wmin 2 ity this does not happen is much smaller than Æ ). Now the ! probability of A is at most

: jjprojW Ajj2  E[jjprojW Ajj2 ] 1 2 2 2 (E[jjprojW Ajj ] E[jjprojU Ajj ]) and E[jjprojW Ajj2 ] > (1 + )E[jjprojU Ajj2 ])  Pr(9W : jjprojW Ajj2  E[jjprojW Ajj2 ] 1 1 2 2 ((1 1 +  )E[jjprojW Ajj ])) Pr(9W

: jjprojW Ajj2  (1 4 )E[jjprojW Ajj2 ]) = Pr(9W : jjprojW Ajj2  (1 8 )E[jjprojW Ajj2 ] 8 E[jjprojW Ajj2 ])  1 Pr(8W : jjprojW Ajj2 > (1 8 )E[jjprojW Ajj2 ] 8 E[jjprojU Ajj2 ]):



Pr(9W

Here we used the fact that  < 1. With

2 =

E[jjprojU Ajj2 ] 16(n k)E[jjAjj2 ]

:

 



and

46

 

512 n ln 2 + 1 ln 1 2  n k Æ

44

42

40

samples from each Gaussian, by lemma 4 we get that the probability of E is at most Æ , which is the desired result.

38

36

34

Corollary 3 (Likely Best Subspace). Let 1 ; : : : ; k be the means of the k Gaussians in the mixture and wmin the smallest mixing weight. Let 01 ; : : : ; 0k be their projections onto the subspace spanned by the top k right singular vectors of the sample matrix A. With a sample of size m = O ( 2 wnmin ) we have with high probability

k X i=1

wi (jji jj2

jj0i jj2 )  (n k)

k X i=1

32

30

28 95

E[jjprojU Ajj ]

E[jjprojV Ajj ] k X

i=1

k X i

wi (jji jj2

125

130

5

0

−5

−10

−15

−20 −145

−140

−135

−130

−125

−120

−115

Figure 2. SVD1

wi i2 ):

4.2. After projecting to the best subspace

jj0i jj2 )  m(n k)

k X i=1

wi i2 :

The following lemma about a sufficiently dense net of subspaces was used in the proof of Theorem 2. Its proof appears in the Appendix. Lemma 4. Suppose A has at least m rows from each Gaussian. Then, for any ;  > 0, and any r such that 1  r  n, the probability that every subspace W of dimension r satisfies

jjprojW Ajj2 > (1 )E[jjprojW Ajj2 ] 2 2 rE[jjAjj2 ] 1

120

10

To see the main idea, let D

 rn

2



e

2 mr 8

:

As mentioned in the introduction, random projection does not preserve the distance between the means. The following figures illustrate the difference between random projection and the SVD-based projection. Figure 1 and Figure 3 are 2-dimensional random projections of samples from two different 49-dimensional mixtures (one with k = 2, the other with k = 4). Figure 2 and Figure 4 are the projections to the best rank 2 subspaces of the same data sets.

= maxi;j ij . If D is small,

we can apply Corollary 3 with 
jji j jj ^(i + j )

Now the projection of a spherical Gaussian onto any subspace p remains a spherical Gaussian and so the radius is now i k. With such a large radius, and a separation as in (2), we can use distance concentration to correctly classify a random subsample of size poly (k )=wmin . The details will be clear in what follows, but the idea is that points from any Gaussian with approximately the smallest radius will be separated from the rest of the sample. We can identify such a Gaussian and repeat the procedure. If we assume only the minimum possible separation,

jji j jj > C maxfi ; j g; then after projection the separation is

jj0i 0j jj > (C 2^) maxfi ; j g p which is the radius divided by kp(note that we started with a separation of radius divided by n). Here we could use an

19

We show that the each Gaussian output in step (v) of the algorithm above is correct with high probability, i.e. it is exactly the subsample of points from a single Gaussian. Further, at least one Gaussian is output during every iteration. The key idea is that on projection the center of a Gaussian with a large variance does not move much relative to its variance. Our main tool is the following distance concentration lemma (whose proof appears in the Appendix).

18

17

16

15

14

13

12

11 50

52

54

56

58

60

62

64

66

68

Figure 3. RP2

Lemma 5. Let X 2 Fi , and Y 2 Fj , such that i 6= j , where Fi ; Fj are k dimensional Gaussians with means 0i ; 0j , and variances i2 ; j2 . Then for > 0, the probability that



Y jj2 ]  q   p (2 + 2 ) k + 2jj0 0 jj 2 + 2 )

30

20

jjX Y jj2 i

10

E[jjX

i

j

−10

2 is at most 4e =8 .

−20

Theorem 3. With a sample of size

0

−30 62

64

66

68

70

72

74

76

Figure 4. SVD2

m=





j

i



n n3 k ji j2 ln 1 + max 2 i=1 i2 wmin wmin



j

+ n ln 4Æk



and initial separation exponential in k algorithm, to essentially examine all possible partitions of the data set and select the best explanation. k )O(k) poly(n). This would have a running time of ( wmin Now let us consider the general case (when D is large or unknown). We can learn the Gaussians by the following implementation of step (3) of the algorithm. In the description of the algorithm below, S is the set of m projected points.  For convenience, we will assume that k > 96 ln mÆ (if not, we can run an exponential in k algorithm and still be polynomial in n=wmin ).

the algorithm correctly classifies all Gaussians with probability at least 1 Æ . Proof. First, let us apply Corollary 3 with follows that,

k X i=1

wi (jji jj2

jj0i jj2 )  wmin^

 = wnmink^ .

k X i=1

It

wi i2 :

Let  2 be the largest variance. Then,

Algorithm. (i) Let R = maxx2S miny2S jjx

  jji j jj  85 maxfi ; j g(k ln mÆ )1=4

yjj.

k X

(ii) Discard all points from S whose closest point lies at squared distance at most 3^ R2 to form a new set S 0 . (iii) Let x; w be the two closest points in the remaining set, with r = jjx wjj2 ; Let H be the set q of all points at squared distance at most l = r (1 + 8

6 ln m Æ

k ) from x. (iv) Report H as a Gaussian, and remove the points in H . Repeat step (iii) till there are no points left. (v) Output all Gaussians learned in step (iv) with variance greater than 3R2 =k.

(vi) Remove the points from step (v) from the original sample S , and repeat the entire algorithm (including the SVD calculation and projection) on the rest.

i=1

jji jj2 jj0i jj2  ^

k X i=1

wi i2  ^2 :

So in particular, for all i,

jji 0i jj2 = jji jj2 jj0i jj2  ^2 Next we use Lemma 5 to bound the variation in pairwise distances. We obtain the following with probability at least 1 mÆ : For any Gaussian Fi and any two points x; y drawn from it, r

  jjx yjj  2i k + 4i 6k ln mÆ 2

2

2

This follows from Lemma 5 with =

q

24 ln

m : Æ

(5)

Let Fi ; Fj be two Gaussians with i2  j2 and i2 For any two points x from Fi and y from Fj , r

> ^i2 .

m

jjx yjj2  (i2 + j2 )k + 33i2 k ln Æ

(6)

r

  jjx yjj  (i + j )k 2(i + j ) 6k ln mÆ 2

2

2

2

r





  +jj0i 0j jj jj0i 0j jj 4 (i2 + j2 )6 ln mÆ :

Also,

jj0

i

r

m 0j jj2  85i2 k ln 2^2 Æ

The condition follows using the facts that i  ^ 2 and k > 96 ln mÆ . For the following observations, we condition on the event that (5) and (6) hold. 1. Let y 2 S 0 be from some Gaussian Fi . Then i2  ^ 2 , i.e. all the points in S 0 are from Gaussians with large variance. 2. Let x; H be the point and set used in any iteration of step (iii). Let x 2 Fj . Then H = S 0 \ Fj , i.e. it is the points in S 0 from Fj .

R2 =k, we have Fi  3. For any Gaussian Fi with i2 > 3^ 0 S , i.e. any Gaussian with sufficiently large variance will be contained in S 0 . 4. At least one Gaussian is output in each run of the algorithm. The correctness of the algorithm follows from these observations. We proceed to prove them. 1. Suppose not, i.e. i2 < ^ 2 . Let z be another point in S from Fi . Then by (5) we have that: r

  jjy zjj  2i k + 4i 6k ln mÆ 2

2

2

r

< 2^2 k + 4^2 6k ln s

 2^R + 4^R 2

2

6 ln

k

r

r

    2j2 k +4j2 6k ln mÆ  l  2j2 k +28j2 6k ln mÆ

By (5), every point z q

2 S 0 ; z 2 Fj has distance to x at

 most 2j2 k + 4j2 6k ln m Æ , so is therefore included in H . On the other hand, every point z 2 S 0 ; z 2 Fi ; i 6= j has distance to x at least r

  (i + j )k + 33 maxfi ; j g k ln mÆ

Æ

 3^R2

2 Here we have used the fact that R2    k (by (6)), and m also the assumption that k > 96 ln Æ . But then the above = S 0 , which is a contradiction. implies that y 2

2

2

2

2

(7)

Since we selected x; w as the closest points inq S 0 , i2 is not  much smaller than j2 . By (6), i2  j2 4j2 6k ln m Æ . Substituting into (7), we have that the minimum distance = Fj and x is between z 2 r

  2j2 k + 29j2 k ln mÆ

Since this is larger than the upper bound on l, included in H .

z cannot be

3. Let x 2 Fi . We want to show that x 2 S 0 , so we need to show the following:

8z; jjx zjj2 > 3^R2 There are two cases. If z 2 Fi , then by (5) we have: r

  jjx zjj  2i k 4i 6k ln mÆ 2

2

2

s

> 6^R

2

4^R

2

The last inequality follows since Fs ; i 6= s, then we have:

m

m Æ

r

r

    2j2 k 4j2 6k ln mÆ  r  2j2 k + 4j2 6k ln mÆ

Therefore, we have the following bounds on l, our cutoff for belonging to H .

By Lemma 5 we have: 2

2. By (5), we have the following bounds on r :

6 ln

k

m Æ

> 3^R2 :

 k > 96 ln mÆ .

If

z

2

r

  jjx zjj2  (s2 + i2 )k + 33i2 k ln mÆ > 3^R2

4. We show that by setting ^ < 91 , we output at least one Gaussian in step (v). By (5), we have the following bound on R2 . r

R2  22 k + 42 6k ln

m

Æ

 32 k

Furthermore, by the previous observation, we have that any Gaussian with s2 > 3^ R2 =k will be correctly and completely classified by the algorithm in step (v). Therefore, if ^ < 91 , we have that:

3^R2  9^2 < 2 k

This means that in each iteration, we are guaranteed to output the Gaussian with the largest variance. It remains to bound the success probability of the algorithm. For this we need

 

Corollary 3 holds up to k times, Steps (i)-(v) are successful up to k times

The probability of the first event is at least (1 4Æk )k  1 Æ=2. The probability of the latter event is just the probability Æ )  (1 that distance concentration holds, which is (1 m Æ ). Therefore, we have the probability that the algorithm 4k succeeds is: (1 4Æk )k (1 Æ=2)  1 Æ .

5. Mixtures of isotropic distributions How general is the algorithm? The results about the expected best subspace apply to any mixture of weakly isotropic distributions. For the algorithm to work, we also need two kind of concentration bounds. One to show that with a small number of samples the likely best subspace is close to the expected one, and the other to show distance concentration. A nice class of distributions that have such bounds are weakly isotropic log-concave distributions. Further, any projection of a log-concave distribution is also log-concave. A distribution with density f (x) is log-concave if for any two points, x; y and any 0   1, f ( x +(1 )y)  f (x) f (y)1 . This includes the special case of the uniform distribution on a weakly isotropic convex body, e.g. cubes, balls, etc. Although the weakly isotropic property might sound restrictive, it is worth noting that any single log-concave distribution can be made weakly isotropic by a linear transformation. In what follows, we summarize our results about mixtures of such distributions. As in the Gaussian case, the expected squared distance between two random points is 2 2 k if they are from the same k -dimensional distribution and 12 k + 22 k + jj1 2 jj2 if they are from different distributions. The p concentration is weaker, instead of a deviation of only O ( k log m) for m points, it could be as high as O (k log m). This leads to the following guarantee. Theorem 4. Any mixture of k weakly isotropic log-concave 1 ) using distributions in R n can be learned in poly (n; wmin 1 poly(n; wmin ) samples given that any two components of the mixture are separated as

jji j jj > C maxfi ; j gk log (n=wmin): 1 2

1 2

In the statement above, by “learn” we mean correctly classify the sample. Given samples from a component, learning that component depends on the nature of the distribution. We can certainly report the mean and variance of each component, but other information depends on the specific distribution. So, for example, we could learn a mixture of axis-aligned p cubes whose centers (centroids) are separated by about k times the larger of the variances. The first step of the algorithm once again projects to the best rank k subspace. This results in maintaining the inter-center distances while shrinking the radii of the distributions to the extent that the distance concentration bounds are enough to classify the sample correctly.

6. Remarks We remark that the SVD is tolerant to noise whose 2norm is bounded [12, 11, 2]. Thus even after corruption, the SVD of the sample matrix will recover a subspace that is close to one spanning the mean vectors of the underlying distributions. In this low-dimensional space, one could exhaustively examine subsets of the data to learn the mixture (the ignored portion of the data corresponds to the noise) . We also note that although the classical algorithm for SVD takes O (mn2 ) time for an m  n matrix, it is worth investigating whether the faster randomized methods of [8, 4] can be used in this setting. The spectral approach does not seem to directly apply to distributions that are not weakly isotropic, e.g. nonspherical Gaussians. It is an open question as to whether it can be generalized/adapted.

References [1] S. Arora, R. Kannan. Learning mixtures of arbitrary Gaussians. Proc. 33st ACM STOC, 2001. [2] Y. Azar, A. Fiat, A. Karlin, F. McSherry and J. Saia. Spectral analysis of data. In Proc. of 33rd STOC. [3] S. DasGupta. Learning mixtures of Gaussians. Proc. IEEE Foundations of Computer Science, 1999. [4] P. Drineas, R. Kannan, A. Frieze, S. Vempala, V. Vinay. Clustering in large graphs and matrices. Proc. 10th SODA, 1999. [5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistics Soc. Ser. B, 39:1-38, 1977. [6] S. DasGupta, L. Schulman. A two-round variant of EM for Gaussian mixtures. Uncertainty in Artificial Intelligence, 2000.

[7] M. L. Eaton. Multivariate Statistics. New York, Wiley, 1983. [8] A. Frieze, R. Kannan, S. Vempala. Fast monte-carlo algorithms for finding low-rank approximations. Proc. IEEE Foundations of Computer Science, 1998. [9] R. Kannan, L. Lov´asz and M. Simonovits. Isoperimetric problems for convex bodies and a localization lemma. In Discrete Computational Geometry 13, 541– 559, 1995. [10] B. Lindsay. Mixture models: theory, geometry and applications. American Statistical Association, Virginia 1995. [11] C. Papadimitriou, P. Raghaven, H. Tamaki and S. Vempala. Latent semantic indexing: a probabilistic analysis. In Journal of Comp. and System Sciences, 61, 217-235, 2000.

Similarly, E[jjprojW Ajj2 ]  E[jjprojW  Ajj2 ] + r 2 E[jjAjj2 ] Combining these two we get

jjprojW Ajj2 (1 )E[jjprojW Ajj2 ]  jjprojW  Ajj2 (1 )E[jjprojW  Ajj2 ] r 2 (jjAjj2 + (1 )E[jjAjj2 ]): Using Lemma 3 and  2 2 Pr jjAjj > (1 + )E[jjAjj ] < e Pr

that 

8W; jjprojW Ajj2 > (1 )E[jjprojW Ajj] 2r 2 E[jjAjj2 ] Æ2 mr

is at least 1 (N +1)e 8 : A simple upper bound on N + 1 is ( 2 )rn, the number of grid points in the cube [ 1; 1℄nr with grid size  . The lemma follows. Proof. (of Lemma 5) We consider the probability that

jjX Y jj2

[12] G. Stewart. Error and perturbation bounds for subspaces associated with certain eigenvalue problems. In SIAM review, 15(4), pp727-64, 1973. [13] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions, Wiley, 1985.

the fact 8 , we get that

2 m

p

E[jjX

Y jj2 ]  ((s2 + t2 ) k q

+2jj0s 0t jj s2 + t2 )

The other part is analogous. We write jjX Y jj2 = p 2 2 2 ti )) , where the Zi are i=1 ( s + t Zi + (si N (0; 1) random variables. Therefore,

Pk

k q X

7. Appendix Proof. (of Lemma 4) We consider a finite set S of r dimensional subspaces, with jS j = N , such that for any W 2 W with orthonormal basis fw1 : : : wr g, there exists a W  2 S with orthonormal basis fw1 : : : wr g, such that for all i and j , jwij wij j <  , component-wise. Then for any W 2 W , and any a 2 R n ,

jjprojW  ajj = 2

=

 =

r X

i=1

p +jj0s 0t jj2 + (s2 + t2 ) k + 2jj0s 0t jj s2 + t2 

A

i

i=1 r X i=1 r X i=1

(a  (wi wi + wi ))2

(a  (w i

wi )) + (a  wi ) 2

(8)

(a  (wi wi ))2 + jjprojW ajj2

 r jjajj + jjprojW ajj 2

B 2

2

k X

k X i=1

jjprojW Ajj2  jjprojW  Ajj2 r 2 jjAjj2



p

(s2 + t2 )Zi2  (s2 + t2 )(k + k) 

q

q

2(0si 0ti ) s2 + t2 Zi  2jj0s

0t jj s2 + t2

i=1 Simplifying, we get: Pr (A)  Pr

2

Line 8 follows from the triangle inequality. This gives us a bound on the projection of matrices:

q

The probability of this occurring is at most the maximum of the probabilities of the following events

(a  w )2

i=1 r X

( s2 + t2 Zi + (0si 0ti ))2  (s2 + t2 )k

Pr (B )  Pr

k X

k X i=1

(si

Zi

2

!

p

k+ k e

ti )Zi  jj0

i=1 which gives us the desired bound.

s

2 =8 !

0 jj t

e

2 =8