Achieving Exact Cluster Recovery Threshold via Semidefinite Programming: Extensions
arXiv:1502.07738v1 [stat.ML] 26 Feb 2015
Bruce Hajek
Jiaming Xu∗
Yihong Wu February 27, 2015
Abstract Recently it has been shown in [11] that the semidefinite programming (SDP) relaxation of the maximum likelihood estimator achieves the sharp threshold for exactly recovering the community strucuture under the binary stochastic block model of two equal-sized clusters. Extending the techniques in [11], in this paper we show that SDP relaxations also achieve the sharp recovery threshold in the following cases: (1) Binary stochastic block model with two clusters of sizes proportional to n but not necessarily equal; (2) Stochastic block model with a fixed number of equal-sized clusters; (3) Binary censored block model with the background graph being Erd˝ osR´enyi.
1
Introduction
The stochastic block model [13], also known as the planted partition model [6], is a popular model for studying the community detection and graph partitioning problem (see, e.g., [16, 7, 18, 19, 15, 4, 5] and the references therein). In its simple form, it assumes that n vertices are equally partitioned into r clusters of size K = n/r; a random graph G is generated based on the cluster structure, where each pair of vertices is connected independently with probability p if they are in the same cluster or q otherwise. In this paper, we focus on the problem of exactly recovering the clusters (up to a permutation of cluster indices) based on the graph G. Under the setting with two equal-sized clusters or a single cluster of size proportional to n plus outlier vertices, recently it has been shown in [11] that the semidefinite programming (SDP) relaxation of the maximum likelihood (ML) estimator achieves the optimal recovery threshold with high probability, in the asymptotic regime of p = a log n/n and q = b log n/n for fixed constants a, b as n → ∞. In this paper, we extend the optimality of SDP to the following three cases, while still assuming p = a log n/n and q = b log n/n with a > b > 0: • Stochastic block model with two asymmetric clusters: the first cluster consists of K vertices and the second cluster consists of n − K vertices with K = ⌊ρn⌋ for some ρ ∈ (0, 1/2]. • Stochastic block model with r clusters of equal size K: r ≥ 2 is a fixed integer and n = rK. • Censored block model with two clusters: given an Erd˝os-R´enyi random graph G ∼ G(n, p), each edge (i, j) has a label Lij ∈ {±1} independently drawn according to the distribution: PLij |σi∗ ,σj∗ = (1 − ǫ)1{Lij =σ∗ σ∗ } + ǫ1{Lij =−σ∗ σ∗ } , i
∗
j
i
j
The authors are with the Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL, {b-hajek,yihongwu,jxu18}@illinois.edu. This work was in part presented at the Workshop on Community Detection, February 26-27, Institut Henri Poincar´e, Paris.
1
where σi∗ = 1 if vertex i is in the first cluster and σi∗ = −1 otherwise; ǫ ∈ [0, 1/2] is a fixed constant. Under the censored block model, the graph itself does not contain any information about the underlying clusters and we are interested in recovering the clusters from the observation of the graph and the edge labels. In all three cases, we show that a necessary condition for the maximal likelihood (ML) estimator to succeed is also a sufficient condition for the correctness of the SDP procedure, thereby establishing both the optimal recovery threshold and the optimality of the SDP relaxation. The proof techniques in this paper are similar to those in [11]; however, the construction and validation of dual certificates for the success of SDP are more challenging due to the more sophisticated nature of the problems. Notably, we resolve the open problem raised in [1, Section 6] about the optimal recovery threshold in the censored block model and show that the optimal recovery threshold can be achieved in polynomial-time via SDP. Notation Denote the identity and all-one matrix by I and J respectively. We write X 0 if X is positive semidefinite and X ≥ 0 if all the entries of X are non-negative. Let S n denote the set of all n × n symmetric matrices. For X ∈ S n , let λ2 (X) denote its second smallest eigenvalue. For any matrix Y , let kY k denote its spectral norm. For any positive integer n, let [n] = {1, . . . , n}. For any set T ⊂ [n], let |T | denote its cardinality and T c denote its complement. We use standard big O notations, e.g., for any sequences {an } and {bn }, an = Θ(bn ) or an ≍ bn if there is an absolute constant c > 0 such that 1/c ≤ an /bn ≤ c. Let Bern(p) denote the Bernoulli distribution with mean p and Binom(n, p) denote the binomial distribution with n trials and success probability p. All logarithms are natural and we use the convention 0 log 0 = 0.
2
Binary asymmetric SBM
Let A denote the adjacency matrix of the graph, and (C1∗ , C2∗ ) denote the underlying true partition. The cluster structure under the binary stochastic block model can be represented by a vector σ ∈ {±1}n such that σi = 1 if vertex i is in the first cluster and σi = −1 otherwise. Let σ ∗ correspond to the true clusters. Then the ML estimator of σ ∗ for the case a > b can be simply stated as X max Aij σi σj σ
i,j
s.t. σi ∈ {±1},
i ∈ [n]
σ ⊤ 1 = 2K − n,
(1)
which maximizes the number of in-cluster edges minus the number of out-cluster edges subject to the cluster size constraint. If K = n/2, (1) includes the NP-hard minimum graph bisection problem as a special case and it is computational intractable in the worst case, so we consider its SDP relaxation. Let Y = σσ ⊤ . Then Yii = 1 is equivalent to σi = ±1, and σ ⊤ 1 = ±(2K − n) if and only if hY, Ji = (2K − n)2 . Therefore, (1) can be recast as max hA, Y i Y,σ
s.t. Y = σσ ⊤ Yii = 1,
i ∈ [n]
hJ, Y i = (2K − n)2 . 2
(2)
Notice that the matrix Y = σσ ⊤ is a rank-one positive semidefinite matrix. If we relax this condition by dropping the rank-one restriction, we obtain the following convex relaxation of (2), which is a semidefinite program: YbSDP = arg max hA, Y i Y
s.t. Y 0
Yii = 1,
i ∈ [n]
hJ, Y i = (2K − n)2 .
(3)
We note that the only model parameter needed by the estimator (3) is the cluster size K. Let Y ∗ = σ ∗ (σ ∗ )⊤ and Yn , {σσ ⊤ : σ ∈ {±1}n , σ ⊤ 1 = 2K −n}. The following result establishes the optimality of the SDP procedure. Let η(ρ, a, b) = aρ + b(1 − ρ) − γ + where τ =
a−b log a−log b
and γ =
p
(1 − 2ρ)τ (γ + (1 − 2ρ)τ )aρ log , 2 (γ − (1 − 2ρ)τ )b(1 − ρ)
(4)
(1 − 2ρ)2 τ 2 + 4ρ(1 − ρ)ab.
Theorem 1. If η(ρ, a, b) > 1, then minY ∗ ∈Yn P{YbSDP = Y ∗ } ≥ 1 − n−Ω(1) as n → ∞.
Next we prove a converse for Theorem 1 which shows that the recovery threshold achieved by the SDP relaxation is in fact optimal. Theorem 2. If η(ρ, a, b) < 1 and σ ∗ is uniformly chosen over {σ ∈ {±1}n : σ ⊤ 1 = 2K − n}, then for any sequence of estimators Ybn , P{Ybn = Y ∗ } → 0. √ In the special case with two equal-sized clusters, we have K = n/2 and η(1/2, a, b) = 21 ( a − √ 2 √ 2 √ b) . The corresponding threshold ( a − b) > 2 has been established in [2, 20] and the achievability by SDP has been shown in [11]. Interestingly, for fixed a and b, the function η(ρ, a, b) in (4) is minimized at ρ = 12 , indicating that in some sense clustering with equal-sized clusters is the most difficult case. Moreover, the proof of Theorem 2 requires log K/ log n → 1. In particular, if log K/ log n is bounded away from 1, the recovery threshold is different from limρ→0 η(ρ, a, b) = a − τ log ea τ > 1. We recently became aware of a work [24] that also studies the exact recovery problem in the unbalanced case and provides a sufficient recovery condition for a polynomial-time two-step procedure based on the spectral methods.
3
SBM with multiple equal-sized clusters
The cluster structure under the stochastic block model with r clusters can be represented by r binary vectors ξ1 , . . . , ξr ∈ {0, 1}n , where ξk is the indicator function of the cluster k, such that ξk (i) = 1 if vertex i is in cluster k and ξk (i) = 0 otherwise. Let ξ1∗ , . . . , ξr∗ correspond to the true clusters and let A denote the adjacency matrix. Then the maximum likelihood (ML) estimator of ξ ∗ for the case a > b can be simply stated as max ξ
X
Aij
i,j
r X
ξk (i)ξk (j)
k=1 n
s.t. ξk ∈ {0, 1} , ξk⊤ 1 = K,
ξk⊤ ξk′
= 0, 3
k ∈ [r]
k ∈ [r]
k 6= k′ ,
(5)
which maximizes the number of in-cluster edges. When r = 2, this includes the NP-hard minimum graph bisection problem as a special case. Instead, let us consider its convex relaxation similar to the SDP relaxation studied by Goemans and Williamson [10] for MAX CUT and by Frieze and Jerrum [9] for MAX k-CUT and MAX BISECTION. Each vertex i is associated with a vector xi which is allowed to be one of the r vectors v1 , v2 , . . . , vrPdefined as follows: Take an equilateral simplex Σr in Rr−1 with vertices v1 , v2 , . . . , vr such that rk=1 vk = 0 and kvk k = 1 for 1 ≤ k ≤ r. Notice that hvk , vk′ i = −1/(r − 1) for k 6= k′ . Therefore, (5) can be recast as X Aij xi xj max x
i,j
s.t. xi ∈ {v1 , v2 , . . . , vr }, X xi = 0.
i ∈ [n]
(6)
i
To obtain an SDP relaxation, we replace P xi by yi which is allowed to be any unit vector in Rn under the constraint hyi , yj i ≥ −1/(r − 1) and i yi = 0. Defining Y ∈ Rn×n such that Yij = hyi , yj i, we obtain an SDP: max hA, Y i Y
s.t. Y 0,
Yii = 1,
i ∈ [n]
Yij ≥ −1/(r − 1),
i, j ∈ [n]
hY, ei 1⊤ + 1e⊤ i i = 0, Letting Z =
r−1 r Y
i ∈ [n]
(7)
+ 1r J, we can equivalently rewrite (7) as ZbSDP = arg max hA, Zi Z
s.t. Z 0,
Zii = 1, Zij ≥ 0,
⊤
i ∈ [n]
i, j ∈ [n]
hZ, ei 1 + 1e⊤ i i = 2K, Z∗
i ∈ [n]
(8)
WePnote that the only model parameter needed by the estimator (8) is the cluster size K. Let = rk=1 ξk∗ (ξk∗ )⊤ correspond to the true clusters and define ) ( r X ′ ⊤ n ⊤ ⊤ ξk ξk : ξk ∈ {0, 1} , ξk 1 = K, ξk ξk′ = 0, k 6= k . Zn,r = k=1
The sufficient condition for the success of SDP in (8) is given as follows. √ √ √ Theorem 3. If a − b > r, then minZ ∗ ∈Zn,r P{ZbSDP = Z ∗ } ≥ 1 − n−Ω(1) as n → ∞.
The following result establishes the optimality of the SDP procedure. √ √ √ Theorem 4. If a − b < r and the clusters are uniformly chosen at random among all r-equalbn = Z ∗ } → 0 as n → ∞. sized partitions of [n], then for any sequence of estimators Zbn , P{Z 4
√ √ √ In the special case with two equal-sized cluster r = 2, the recovery threshold a − b > 2 is proved in [2, 20] and SDP is shown to achieve the recovery threshold in [11]. We recently became aware of an independent work [24] that proves the optimal recovery threshold can be also obtained in polynomial-time via a two-step procedure: First, apply the spectral algorithm to correctly cluster all but o(n) vertices; Second, move every vertex to the cluster with which it has the largest number of edges.
4
Binary censored block model
Let σ ∗ ∈ {±1}n correspond to the true clusters. Let A denote the weighted adjacency matrix such that Aij = 0 if i, j are not connected by an edge; Aij = 1 if i, j are connected by an edge with label +1; Aij = −1 if i, j are connected by an edge with label −1. Then the ML estimator of σ ∗ can be simply stated as X max Aij σi σj σ
i,j
s.t. σi ∈ {±1},
i ∈ [n],
(9)
which maximizes the number of in-cluster +1 edges minus the number of in-cluster −1 edges, or equivalently, maximizes the number of cross-cluster −1 edges minus the number of cross-cluster +1 edges. The NP-hard max-cut problem can be reduced to (9) by simply labeling all the edges in the input graph as −1 edges, and thus (9) is computationally intractable in the worst case. Instead, let us consider its convex relaxation same to the SDP relaxation studied in [1]. Let Y = σσ ⊤ . Then Yii = 1 is equivalent to σi = ±1. Therefore, (5) can be recast as max hA, Y i Y,σ
s.t. Y = σσ ⊤ Yii = 1,
i ∈ [n]
(10)
Notice that the matrix Y = σσ ⊤ is a rank-one positive semidefinite matrix. If we relax this condition by dropping the rank-one restriction, we obtain the following convex relaxation of (10), which is a semidefinite program: YbSDP = arg max hA, Y i Y
s.t. Y 0
Yii = 1,
i ∈ [n],
(11)
We remark that (11) does not rely on any knowledge of the model parameters. Let Y ∗ = σ ∗ (σ ∗ )⊤ and Yn , {σσ ⊤ : σ ∈ {±1}n }. The following result establishes the success condition of the SDP procedure: √ √ Theorem 5. If a( 1 − ǫ − ǫ)2 > 1, then minY ∗ ∈Yn P{YbSDP = Y ∗ } ≥ 1 − n−Ω(1) as n → ∞.
Next we prove a converse for Theorem 5 which shows that the recovery threshold achieved by the SDP relaxation is in fact optimal. √ √ Theorem 6. If a( 1 − ǫ− ǫ)2 < 1 and σ ∗ is uniformly chosen from {±1}n , then for any sequence of estimators Ybn , P{Ybn = Y ∗ } → 0 as n → ∞. 5
Theorem 6 still holds if the cluster sizes are proportional to n and known to the estimators, i.e., the prior distribution of σ ∗ is uniform over {σ ∈ {±1}n : σ ⊤ 1 = 2K − n} for with √ K = ⌊ρn⌋ √ ρ ∈ (0, 1/2]. Together with Theorem 5, this implies that the recovery threshold a( 1 − ǫ− ǫ)2 > 1 is insensitive to ρ, which is in contrast to what we have seen for the binary stochastic block model. Exact cluster recovery in the censored block model is previously studied in [1] and it is shown that if ǫ → 1/2, the maximum likelihood estimator achieves the optimal recovery threshold a(1 − 2ǫ)2 > 2 + o(1), while an SDP relaxation of the ML estimator succeeds if a(1 − 2ǫ)2 > 4 + o(1). The optimal recovery threshold for any fixed ǫ ∈ (0, 1/2) and whether it can be achieved in polynomial6 together show that the SDP relaxation time are previously unknown. Theorem 5 and √ Theorem √ 2 achieves the √ optimal recovery threshold a( 1 − ǫ − ǫ) > 1 for any fixed constant ǫ ∈ [0, 1/2]. √ Notice that ( 1 − ǫ − ǫ)2 = 12 (1 − 2ǫ)2 + o((1 − 2ǫ)2 ) when ǫ → 1/2. The above exact recovery threshold in the regime p = a log n/n shall be contrasted with the positively correlated recovery threshold in the sparse regime p = a/n for a constant a. In this sparse regime, there exists at least a constant fraction of vertices with no neighbors and exactly recovering the clusters is hopeless; instead, the goal is to find an estimator σ b positively correlated with σ ∗ up to a global flip of signs. It was conjectured in [12] that the positively correlated recovery is possible if and only if a(1 − 2ǫ)2 > 1; the converse part is shown in [14] and recently it is proved in [21] that spectral algorithms achieve the sharp threshold in polynomial-time.
5 5.1
Proofs Proofs for Section 2
n b log n Lemma 1 ([11, Lemma 2]). Let X ∼ Binom m, a log and R ∼ Binom m, for m ∈ N n n
and a, b > 0, where m = ρn + o(n) for some ρ > 0 as n → ∞. Let kn , kn′ ∈ [m] be such that kn = τ log n + o(log n) and kn′ = τ ′ ρ log n + o(log n) for some 0 ≤ τ ≤ a and τ ′ ≥ b. Then ea P {X ≤ kn } = n−ρ(a−τ log τ +o(1)) eb ′ P R ≥ kn′ = n−ρ(b−τ log τ ′ +o(1)) .
(12) (13)
Lemma 2. Suppose ρ1 , ρ2 , a, b > 0, and τ ∈ R. Let X and R be independent with n b log n X ∼ Binom(m1 , a log n ) and R ∼ Binom(m2 , n ), where m1 = ρ1 n + o(n) and m2 = ρ2 n + o(n) as n → ∞. Let kn ∈ N such that kn = τ log n + o(log n). If τ ≤ aρ1 − bρ2 ,
where γ =
p
P {X − R ≤ kn } = n
i h (γ−τ )aρ − aρ1 +bρ2 −γ− τ2 log (γ+τ )bρ 1 +o(1) 2
,
(14)
τ 2 + 4ρ1 ρ2 ab.
Proof. We first prove the upper tail bound in (14) using Chernoff’s bound. In particular, P {X − R ≤ kn } ≤ exp (−nℓ(kn /n)) , where ℓ(x) = supt≥0 −tx − n1 log E e−t(X−R) . By definition, i m h 1 a log n m2 b log n 1 −t(X−R) −t t = log E e log 1 − (1 − e ) + log 1 − (1 − e ) . n n n n n Since −tx − n1 log E e−t(X−R) is concave in t, it achieves the supremum at t∗ such that ∗
−x +
∗
ae−t log n/n m2 bet log n/n m1 − = 0. n 1 − a log n (1 − e−t∗ ) n 1 − a log n (1 − e−t∗ ) n n 6
(15)
p γ−τ Hence, by setting x = kn /n, we get t∗ = log 2ρ τ 2 + 4ρ1 ρ2 ab. Thus, by applying , where γ = 2b the Taylor expansion of log(1 − x) at x = 0, we have γ−τ log n ℓ(kn /n) = −τ log + o(log n/n) + ρ1 a + ρ2 b − γ 2ρ2 b n τ (γ − τ )aρ1 log n = − log + o(log n/n). + ρ1 a + ρ2 b − γ 2 (γ + τ )bρ2 n Then in view of (15), P {X − R ≤ τ log n} ≤ n
i h (γ−τ )aρ − aρ1 +bρ2 −γ− τ2 log (γ+τ )bρ 1 +o(1) 2
.
Next, we prove the lower tail bound in (14). For any choice of the constant τ ′ with τ ′ ≥ |τ |, ′ ′ τ +τ τ −τ {X − R ≤ τ log n} ⊃ X ≤ log n ∩ R ≥ log n 2 2 and therefore
′ ′ τ −τ τ +τ log n P R ≥ log n . P {X − R ≤ τ log n} ≥ max P X≤ τ′ 2 2
(16)
So, applying Lemma 1, we get that P {X − R ≤ τ log n} ≥ max n ′
h i ′ ′ 2eρ1 a 2eρ2 b − aρ1 − τ 2+τ log τ ′ +τ +bρ2 − τ 2−τ log τ ′ −τ
τ
.
(17)
Since the exponent in (17) is strictly convex in τ ′ over [|σ|, ∞), it achieves the maximum at γ = p τ 2 + 4ρ1 ρ2 ab and thus P {X − R ≤ τ log n} ≥ n
h i (γ−τ )aρ − aρ1 +bρ2 −γ− τ2 log (γ+τ )bρ 1 +o(1) 2
.
The following lemma provides a deterministic sufficient condition for the success of SDP (3) in the case a > b. Lemma 3. Suppose there exist D ∗ = diag {d∗i } and λ∗ ∈ R such that S ∗ , D ∗ − A + λ∗ J satisfies S ∗ 0, λ2 (S ∗ ) > 0 and S ∗ σ ∗ = 0.
(18)
Then YbSDP = Y ∗ is the unique solution to (3). Proof. The Lagrangian function is given by
L(Y, S, D, λ) = hA, Y i + hS, Y i − hD, Y − Ii − λ hJ, Y i − (2K − n)2 ,
where the Lagrangian multipliers are denoted by S 0, D = diag {di }, and λ ∈ R. Then for any Y satisfying the constraints in (3), (a)
hA, Y i ≤ L(Y, S ∗ , D ∗ , λ∗ ) = hD ∗ , Ii + λ∗ (2K − n)2 = hD ∗ , Y ∗ i + λ∗ (2K − n)2 (b)
= hA + S ∗ − λ∗ J, Y ∗ i + λ∗ (2K − n)2 = hA, Y ∗ i, 7
where (a) holds because hS ∗ , Y i ≥ 0; (b) holds because hY ∗ , S ∗ i = (σ ∗ )⊤ S ∗ σ ∗ = 0 by (18). Hence, Y ∗ is an optimal solution. It remains to establish its uniqueness. To this end, suppose Ye is an optimal solution. Then, (a) hS ∗ , Ye i = hD ∗ − A + λ∗ J, Ye i = hD ∗ − A + λ∗ J, Y ∗ i=hS ∗ , Y ∗ i = 0.
where (a) holds because hJ, Ye i = hJ, Y ∗ i, hA, Ye i = hA, Y ∗ i, and Yeii = Yii∗ = 1 for all i ∈ [n]. In view of (18), since Ye 0, S ∗ 0 with λ2 (S ∗ ) > 0, Ye must be a multiple of Y ∗ = σ ∗ (σ ∗ )⊤ . Because Yeii = 1 for all i ∈ [n], Ye = Y ∗ . Proof of Theorem 1. Let D ∗ = diag {d∗i } with d∗i
=
n X j=1
Aij σi∗ σj∗ − λ∗ (2K − n)σi∗
(19)
∗ ∗ ∗ and choose λ∗ = τ log n/n, where τ = log a−b a−log b . It suffices to show that S = D − A + λ J satisfies the conditions in Lemma P 3 with high probability. ∗ ∗ By definition, di σi = j Aij σj∗ − λ∗ (2ρ − 1)n for all i, i.e., D ∗ σ ∗ = Aσ ∗ − λ∗ (2K − n)1. Since ∗ Jσ = (2K − n)1, (18) holds, that is, S ∗ σ ∗ = 0. It remains to verify that S ∗ 0 and λ2 (S ∗ ) > 0 with probability converging to one, which amounts to showing that ⊤ ∗ P inf x S x > 0 → 1. (20) x⊥σ∗ ,kxk2 =1
Note that E [A] = kxk2 = 1,
p−q ∗ 2 Y
+
p+q 2 J
− pI and Y ∗ = σ ∗ (σ ∗ )⊤ . Thus for any x such that x ⊥ σ ∗ and
x⊤ S ∗ x = x⊤ D ∗ x − x⊤ E [A] x + λ∗ x⊤ Jx − x⊤ (A − E [A]) x p−q ⊤ ∗ p+q ⊤ ∗ ∗ =x D x− x Y x+ λ − x⊤ Jx + p − x⊤ (A − E [A]) x 2 2 p+q (a) ⊤ ∗ ∗ = x D x+ λ − x⊤ Jx + p − x⊤ (A − E [A]) x. 2
(21)
where (a) holds because hx, σ ∗ i = 0. It follows from (21) that for any x ⊥ σ ∗ and kxk2 = 1, xT S ∗ x = t1 (x) + t2 (x) where p+q t1 (x) = x⊤ D ∗ x + λ∗ − x⊤ Jx, 2 t2 (x) = p − x⊤ (A − E [A])x.
Observe that inf
x⊥σ∗ ,||x||2=1
x⊤ S ∗ x ≥
inf
x⊥σ∗ ,||x||2 =1
t1 (x) +
inf
x⊥σ∗ ,||x||2 =1
t2 (x)
(22)
Now inf x⊥σ∗ ,||x||√ t (x) ≥ p − ||A − E[A]||. In view of [11, Theorem 5], with high probability 2 =1 2 kA−E√[A] k ≤ c′ log n for a positive constant c′ depending only on a and thus inf x⊥σ∗ ,||x||2=1 t2 (x) ≥ p − c′ log n. We next bound inf x⊥σ∗ ,||x||2=1 t1 (x) from the below. Notice that b ≤ τ ≤ a+b 2 . If σi = +1 △
△
then E [di ] = d+ = {K(a − τ ) + (n − K)(τ − b) − a} log n/n. If σi = −1 then E [di ] = d− = 8
{(n − K)(a − τ ) + K(τ − b) − a} log n/n. Consider the specific vector q x ˇ that maximizes x⊤ Jx sub-
n−K ject to the unit norm constraint and hx, σ ∗ i = 0. It has coordinates nK for the K vertices of q K the first cluster and coordinates n(n−K) for the n − K vertices of the other cluster. It yields
x ˇ⊤ Jˇ x = 4K(n − K)/n and thus p+q a+b ∗ ⊤ λ − x ˇ Jˇ x = τ− 4K(n − K) log n/n2 2 2
= (τ − a + τ − b) 2K(n − K) log n/n2 .
Note that h i E x ˇ⊤ D ∗ x ˇ = (n − K)d+ /n + Kd− /n = 2K(n − K)(a − τ ) + K 2 + (n − K)2 (τ − b) − na log n/n2 P Since x ˇ⊤ D ∗ x ˇ = hA, Bi − λ∗ (2K − n) ni=1 x ˇ2i σi∗ , wherep Bij = σi σj x ˇ2i , it follows that x ˇ⊤ Dˇ x is Lips2 2 chitz continuous in A with Lipschitz constant kBkF = (1 − ρ) /ρ + ρ /(1 − ρ) + o(1). Moreover, Aij is [0, 1]-valued. It follows from the Talagrand’s concentration inequality for Lipschitz convex functions (see, e.g., [23, Theorem 2.1.13]) that for any c > 0, there exists c′ > 0 only depending on ρ, such that o n h i p P x ˇ⊤ D ∗ x ˇ−E x ˇ⊤ D ∗ x ˇ ≥ −c′ log n ≥ 1 − n−c . Hence, with probability at least 1 − n−c , h i p p p+q x ˇ⊤ Jˇ x − c′ log n = (τ − b) log n − a log n/n − c′ log n. (23) t1 (ˇ x) ≥ E x ˇ⊤ D ∗ x ˇ + λ∗ − 2
Let E2 = span(σ ∗ , x ˇ); E2 is the set of vectors that are constant over each cluster. Note that E[D ∗ ]ˇ x ∈ E2 . So for any vector x with x ⊥ E2 , x⊤ E[D ∗ ]ˇ x = 0 and Jx = 0. It follows that p ˇ + 1 − β2x t1 β x inf t1 (x) = inf x⊥σ∗ ,||x||2=1 β∈[0,1],||x||2=1,x⊥E2 p = inf β 2 t1 (ˇ x) + 2β 1 − β 2 x⊤ D ∗ x ˇ + (1 − β 2 )x⊤ D ∗ x. β∈[0,1],||x||2=1,x⊥E2
Furthermore,
inf
||x||2=1,x⊥E2
x⊤ D ∗ x ˇ=
inf
||x||2=1,x⊥E2
x⊤ (D ∗ − E [D ∗ ])ˇ x ≥ −||(D ∗ − E [D ∗ ])ˇ x||.
Notice that k(D − E [D])ˇ xk22 = Therefore
X i
2 2 n n X X X (Aij − E [Aij ])σi∗ σj∗ x ˇi = x ˇ2i (Aij − E [Aij ])σj∗ j=1
j=1
i
v uX X n q p u 2 xk2 = t x ˇ2i var[Aij ] ≤ a log n, E [||(D − E[D])ˇ x||] ≤ E k(D − E [D])ˇ i
9
j=1
One can check that ||(D − E [D] x ˇ|| is convex and O(1)-Lipchitz continuous in A. It follows from the Talagrand’s concentration inequality for Lipschitz convex functions that for any c > 0, there exists c′ > 0 such that n o p P ||(D − E[D])ˇ x|| − E [||(D − E[D])ˇ x||] ≤ c′ log n ≥ 1 − n−c .
√ Hence, with probability at least 1 − n−c , ||(D ∗ − E [D ∗ ])ˇ x|| ≤ c′ log n for some universal constant c′ . Notice that for ||x||2 = 1, x⊤ D ∗ x ≥ mini di . For i ∈ C1 , Aij σi σj is equal in distribution to b log n n X − R, where X ∼ Binom(K − 1, a log n ) and R ∼ Binom(n − K, n ). It follows from Lemma 2 that X log n P Aij σi σj ≤ −τ (1 − 2ρ) log n + ≤ n−η(ρ,a,b)+o(1) . log log n j
n For i ∈ C2 , Aij σi σj is equal in distribution to X − R, where X ∼ Binom(n − K − 1, a log n ) and n R ∼ Binom(K, b log n ). It follows from Lemma 2 that X log n ≤ n−η(ρ,a,b)+o(1) . P Aij σi σj ≤ τ (1 − 2ρ) log n + log log n j
It follows from the definition of di that log n P di ≥ ≥ 1 − n−η(ρ,a,b)+o(1) , ∀i log log n
Applying the union bound, we get that mini∈[n] d∗i ≥ logloglogn n holds with probability at least 1 − n1−η(ρ,a,b)+o(1) . In view of the assumption that η(ρ, a, b) > 1, with high probability, for some universal constant c′ , p p log n inf t1 (x) ≥ inf β 2 (τ − b) log n − c′ β 1 − β 2 log n + (1 − β)2 − c′ log n log log n x⊥σ∗ ,||x||2=1 β∈[0,1] p log n − 2c′ log n. ≥ log log n √ Since we have shown with high probability inf x⊥σ∗ ,||x||2=1 t2 (x) ≥ p − c′ log n, the desired (20) follows from (22). Proof of Theorem 2. Since the prior distribution of σ ∗ is uniform over {σ ∈ {±1}n : σ ⊤ 1 = 2K −n}, the ML estimator minimizes the error probability among all estimators and thus we only need ∗ ∗ true cluster 1 and 2, respectively. to find when the P ML estimator fails. Let C1 , C2 denote the a−b Let e(i, T ) , A for a set T . Recall that τ = ij j∈T log a−log b . Let F1 denote the event that mini∈C1∗ (e(i, C1∗ ) − e(i, C2∗ )) ≤ −τ (1 − 2ρ) log n − 2 and F2 denote the event that mini∈C2∗ (e(i, C2∗ ) − e(i, C1∗ )) ≤ τ (1 − 2ρ) log n − 2. Notice that F1 ∩ F2 implies the existence of i ∈ C1∗ and j ∈ C2∗ , such that the set (C1∗ \{i} ∪ {j}, C2∗ \{j} ∪ {i}) achieves a strictly higher likelihood than (C1∗ , C2∗ ). Hence P {ML fails} ≥ P {F1 ∩ F2 }. Next we bound P {F1 } and P {F2 } from below. By symmetry, we can condition on C1∗ being the first ρn vertices. Let T denote the set of first n ⌊ log2 n ⌋ vertex. Then min (e(i, C1∗ ) − e(i, C2∗ )) ≤ min(e(i, C1∗ ) − e(i, C2∗ )) ≤ min(e(i, C1∗ \T ) − e(i, C2∗ )) + max e(i, T ). (24)
i∈C1∗
i∈T
i∈T
10
i∈T
Let E1 , E2 denote the event that maxi∈T e(i, T ) ≤
log n log log n
− 2, mini∈T (e(i, C1∗ \T ) − e(i, C2∗ )) ≤
−τ (1 − 2ρ) log n − logloglogn n , respectively. In view of (24), we have F1 ⊃ E1 ∩ E2 and hence it boils down to proving that P {Ei } → 1 for i = 1, 2. For i ∈ T , e(i, T ) ∼ Binom(|T |, a log n/n) . In view of the following Chernoff bound for binomial distributions [17, Theorem 4.4]: For r ≥ 1 and X ∼ Binom(n, p), P {X ≥ rnp} ≤ (e/r)rnp , we have P e(i, T ) ≥
− log n/ log log n+2 log2 n log n −2 ≤ = n−2+o(1) . log log n ae log log n
Applying the union bound yields X P {E1 } ≥ 1 − P e(i, T ) ≥ i∈T
log n − 2 ≥ 1 − n−1+o(1) . log log n
Moreover, Y P {E2 } = 1 − P e(i, C1∗ \T ) − e(i, C2∗ ) > −τ (1 − 2ρ) log n − (a)
i∈T
log n log log n
|T | (c) (d) √ √ 2 2 ≥ 1 − exp −n1−η(ρ,a,b)) +o(1) → 1, ≥ 1 − 1 − n−a( 1−ǫ− ǫ) +o(1)
(b)
where (a) holds because {e(i, C ∗ \T )}i∈T are mutually independent; (b) follows from Lemma 2; (c) is due to 1 + x ≤ ex for all x ∈ R; (d) follows from the assumption that η(ρ, a, b) < 1. Thus P {F1 } → 1. Using the same argument as above, we can show P {F2 } → 1. Thus the theorem follows.
5.2
Proofs for Section 3
Theorem 3 is proved after three lemmas are given. ForPk ∈ [r], denote by Ck ⊂ [n] P the support of the kth cluster. For a set T of vertices, let e(i, T ) , j∈T Aij and e(T ′ , T ) = i∈T ′ e(i, T ). Let k(i) denote the index of the cluster containing vertex i. Denote the number of neighbors of i in its own cluster by si = e(i, Ck(i) ) and the maximum number of neighbors of i in other clusters by ri = maxk′ 6=k(i) e(i, Ck′ ). Lemma 4.
P min(si − ri ) ≤ log n/ log log n i∈[n]
√
≤ rn1−(
√ a− b)2 /r+o(1)
.
Proof. Notice that si ∼ Binom(K, p) and for k′ 6= k(i), e(i, Ck′ ) ∼ Binom(K, q). It is shown in [2] that √ √ 2 a− b) /r+o(1)
P {si − e(i, Ck′ ) ≤ log n/ log log n} ≤ n−( It follows from the union bound that
√ √ 2 a− b) /r+o(1)
P {si − ri ≤ log n/ log log n} ≤ rn−(
.
Applying the union bound over all possible vertices, we complete the proof. 11
.
Lemma 5. There exists a constant c > 0 depending only on b and r such that X p 1 ri ≤ Kq + c log n ≥ 1 − rn−2 . P max k∈[r] K i∈Ck
√ √ Proof. We first show E [ri ] ≤ Kq + O( log n). Let t0 = log n. Then E [ri ] = E max e(i, Ck′ ) k ′ 6=k(i) Z ∞ P max e(i, Ck′ ) ≥ t dt = k ′ 6=k(i) Z0 ∞ (rP {e(i, Sk′ ) ≥ t} ∧ 1) dt ≤ 0 Z ∞ P {e(i, Sk′ ) − Kq ≥ t} dt ≤ Kq + t0 + r t0
(a)
≤ Kq + t0 + r
Z
∞
t0
exp −
t2 2Kq + 2t/3
dt
where (a) follows from Bernstein’s inequality. Furthermore, Z Kq Z ∞ Z ∞ t2 2 dt ≤ r e−3t /(8Kq) dt + e−3t/8 dt exp − 2Kq + 2t/3 t0 Kq t0 Z Kq 8 ≤ e−3t0 t/(8Kq) dt + e−3Kq/8 3 t0 p 8Kq −3t20 /(8Kq) 8 −3Kq/8 ≤ e + e = O( log n). 3t0 3 √ P Thus E [ri ] ≤ Kq + O( log n). Denote i∈Ck ri , g(Aij , i ∈ Ck , j ∈ / Ck ). Then g satisfies the bounded difference inequality, i.e., for all i = 1, 2, . . . , m with m = (r − 1)K 2 , sup x1 ,...,xm ,x′i
|g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xm ) − g(x1 , . . . , xi−1 , x′i , xi+1 , . . . , xm )| ≤ 1.
It follows from McDiarmid’s inequality that X X p 2K 2 r log n P ri − E [ri ] ≥ K r log n ≤ exp − = n−2 . K 2r i∈Ck
i∈Ck
Thus, with probability at most n−2 , the union bound.
1 K
P
i∈Sc ri
√ ≥ Kq + O( log n). The lemma follows in view of
The following lemma provides a deterministic sufficient condition for the success of SDP (8) in the case a > b. Lemma 6. Suppose there exist D ∗ = diag{d∗i } with d∗i > 0 for all i, B ∗ ∈ S n with B ∗ ≥ 0 and Bij > △
0 whenever i and j are in distinct clusters, and λ∗ ∈ Rn such that S ∗ = D ∗ −B ∗ −A+λ∗ 1⊤ +1(λ∗ )⊤ satisfies S ∗ 0, and S ∗ ξk∗ = 0
∗ ∗ Bij Zij
= 0 12
k ∈ [r]
i, j ∈ [n]
(25) (26)
Then ZbSDP = Z ∗ is the unique solution to (8).
Proof. Let H = Z −Z ∗ , where Z is an arbitrary feasible matrix for the SDP (8). Since Z and Z ∗ are both feasible, hD ∗ , Hi = hλ∗ 1⊤ , Hi = h1(λ∗ )⊤ , Hi = 0. Since A = D ∗ − B ∗ − S ∗ + λ∗ 1⊤ + 1(λ∗ )⊤ , hA, Hi = −hB ∗ , Hi − hS ∗ , Hi and the following hold: • hB ∗ , Hi ≥ 0, with equality if and only if hB ∗ , Zi = 0. That is because B ∗ ≥ 0, Z ≥ 0, and hB ∗ , Z ∗ i = 0. ∗ , Zi = 0. That is because hS ∗ , Zi ≥ 0 (because • hS ∗ , Hi ≥ 0, with equality if and only if hSP ∗ ∗ ∗ ∗ S , Z 0) and hS , Z i = 0 (because Z = rk=1 ξk∗ (ξk∗ )⊤ and S ∗ ξk∗ = 0 for all k ∈ [r]).
Thus, hA, Hi ≤ 0, so that Z ∗ is a solution to the SDP. To prove that Z ∗ is the unique solution, restrict attention to the case that Z is another solution to the SDP. We need to show Z = Z ∗ . Since both Z and Z ∗ are solutions, hA, Hi = 0, so that hB ∗ , Hi = hS ∗ , Hi = 0. Therefore, by the above two points: hB ∗ , Zi = hS ∗ , Zi = 0. For each i, ∗ = 0 if and only if vertices i and j are in the same cluster. Also, the fact Z 0 and Z ≤ 1 Bi,j ii for all i implies Zij ≤ 1 for all i, j. Thus, the only way Z can meet the constraint Z1 = K1 is that Zij = 1 whenever i and j are in the same cluster. Therefore Z = Z ∗ , so Z ∗ is the unique solution. √ 1 (ri − Kq/2 + log n/2) Proof of Theorem 3. We now begin the proof of Theorem 3. Choose λ∗i = K and D ∗ = diag {d∗i } with d∗i = si − ri −
p 1 X rj + Kq − log n, K j∈Ck(i)
where k(i) is the index of the cluster containing vertex i. Let B ∗ ∈ S n be given by BC∗ k ×Ck = 0,
∀k ∈ [r]
∗ ∗ BC∗ k ×Ck′ (i, j) = BC∗ k′ ×Ck (j, i) = ykk ′ (i) + zkk ′ (j),
∀1 ≤ k < k ′ ≤ r,
(27)
where p 1 e(Ck , Ck′ ) 1 (ri − e(i, Ck′ )) + − Kq + log n , = K 2K K p e(Ck , Ck′ ) 1 1 ∗ (rj − e(j, Ck )) + − Kq + log n . zkk′ (j) = K 2K K ∗ ykk ′ (i)
(28) (29)
It suffices to show (S ∗ , D ∗ , B ∗ , λ∗ ) satisfies the conditions in Lemma 6 with high probability. By ∗ Z ∗ = 0 for all i, j ∈ [n]. Moreover, for any i, definition, we have Bij ij d∗i = si − λ∗i K − and for all k′ 6= k(i), e(i, Ck′ ) +
X
j∈Ck′
X
λ∗j ,
∗ (Bij − λ∗j ) = Kλ∗i .
13
(30)
j∈Ck(i)
(31)
It follows that S ∗ ξk∗ = 0 for k ∈ [r]. ∗ > 0 for all i, j We next show that with probability converging to 1, d∗i > 0 for all i ∈ [n], and Bij √ √ √ in distinct clusters. In view of Lemma 4 and the assumption that a − b > r, min √i (si − ri ) ≥ 1 P log n) with r ≤ Kq + O( log n/ log log n with high probability. By Lemma 5, maxk∈[r] K i i∈Ck high probability. It follows from the definition that p ∗ (32) P min di ≥ log n/ log log n − O( log n) → 1. i
Also, note that e(Ck , Ck′ ) ∼ Binom(K 2 , q). For X ∼ Binom(n, p), Chernoff’s bound yields P {X ≤ (1 − ǫ)np} ≤ e−ǫ
2 np/2
.
It follows that n o p P e(Ck , Ck′ ) ≤ K 2 q − K log n/2 ≤ n−1/(8q)
√ Applying the union bound, we have that with high probability, e(Ck , Ck′ ) ≥ K 2 q − K log n for all 1 ≤ k < k ′ ≤ r. Hence ykk′ (i) ≥ 0 and zkk′ (j) ≥ 0 for all 1 ≤ k < k ′ ≤ r and i ∈ Ck and j ∈ Ck′ so that Bij > 0 for all i, j in distinct clusters. It remains to verify S ∗ 0 with probability converging to 1, i.e., ⊤ ∗ P inf x S x > 0 → 1, (33) x⊥E,kxk2 =1
where E denotes the subspace spanned by vectors {ξk }k∈[r] , i.e., E = span(ξk∗ : k ∈ [r]). P Note that E [A] = (p − q)Z ∗ + qJ − pI and Z ∗ = k∈[r] ξk∗ (ξk∗ )⊤ . Thus for any x such that x ⊥ E and kxk2 = 1, x⊤ S ∗ x = x⊤ D ∗ x − x⊤ E [A] x − x⊤ B ∗ x + 2x⊤ λ∗ 1⊤ x − x⊤ (A − E [A]) x (a)
= x⊤ D ∗ x − (p − q)x⊤ Z ∗ x − qx⊤ Jx + p − x⊤ B ∗ x − x⊤ (A − E [A]) x
(b)
= x⊤ D ∗ x + p − x⊤ B ∗ x − x⊤ (A − E [A]) x
(c)
= x⊤ D ∗ x + p − x⊤ (A − E [A]) x
≥ min d∗i + p − kA − E [A] k,
(34)
i
where (a) holds because x ⊥ 1; (b) holds due to hx, ξk∗ i = 0 for all k ∈ [r] and x ⊥ 1; (c) holds because X X X X X X X X ∗ xi xj Bij =2 xj + xj zkk′ (j) xi ykk′ (i) x⊤ B ∗ x = 2 xi 1≤k 0. We aim to show that kA − E [A]k2 ≤ c′ np with high probability for some constant c′ > 0. Theorem 7. For any c > 0, there exists c′ > 0 such that for any n ≥ 1, P{kA − E [A]k2 ≤ √ c′ np} ≥ 1 − n−c .
Proof. Let E = (Eij ) denote an n × n matrix with independent entries drawn from µ ˆ , 2p δ1 + p 2 δ−1 + (1 − p)δ0 , which is the distribution of a Rademacher random variable multiplied with an ′ = −E independent Bernoulli with bias p. Define E ′ as Eii′ = Eii and Eij ji for all i 6= j. Let ′ A be an independent copy of A. Let D be a zero-diagonal symmetric matrix whose entries are drawn from µ ˆ and D ′ be an independent copy of D. Let M = (Mij ) denote an n × n zero-diagonal symmetric matrix whose entries are Rademacher and independent from C and C ′ . We apply the usual symmetrization arguments: (a)
(c)
(b)
E[kA − E[A]k] = E[kA − E[A′ ]k] ≤ E[kA − A′ k] = E[k(A − A′ ) ◦ M k] ≤ 2E[kA ◦ M k] (d)
(e)
(f )
= 2E[kDk] = 2E[kD − E[D ′ ]k] ≤ 2E[kD − D ′ k] = 2E[kE − E ′ k] ≤ 4 E[kEk], (35)
where (a), (d) follow from the Jensen’s inequality; (b) follows because A − A′ has the same distribution as (A − A′ ) ◦ M , where ◦ denotes the element-wise product; (c), (f ) follow from the triangle inequality; (e) follows from the fact that D − D ′ has the same distribution as E − E ′ . In particular, first, the diagonal entries of D − D ′ and E − E ′ are all equal to zero. Second, both D − D ′ and ′ is equal E − E ′ are symmetric matrices with independent upper triangular entries. Third, Dij − Dij ′ for all i < j by definition. in distribution to Eij − Eij Then, we apply the result of Seginer [22] which characterizedPthe expected spectral norm of 2 , which are independent i.i.d. random matrices within universal constant factors. Let Xj , ni=1 Eij Binom(n, p). Since µ ˆ is symmetric, [22, Theorem 1.1] and Jensen’s inequality yield " 1/2 1/2 # (36) ≤ κ E max Xj E[kEk] ≤ κ E max Xj j∈[n]
j∈[n]
for some universal constant κ. In view of the following Chernoff bound for the binomial distribution [17, Theorem 4.4]: P {X1 ≥ t log n} ≤ 2−t , 15
for all t ≥ 6np, setting t0 = 6 max{np/ log n, 1} and applying the union bound, we have Z ∞ Z ∞ (n P {X1 ≥ t} ∧ 1)dt E max Xj = P max Xj ≥ t dt ≤ j∈[n] j∈[n] 0 0 Z ∞ 2−t dt ≤ (t0 + 1) log n ≤ 6(1 + 2/c0 )np, ≤ t0 log n + n
(37)
t0 log n
where the last inequality follows from np ≥ c0 log n. Assembling (35) – (37), we obtain √ E[kA − E[A]k] ≤ c2 np,
(38)
for some positive constant c2 depending only on c0 , c1 . Since the entries of A − E[A] are valued in [−1, 1], Talagrand’s concentration inequality for 1-Lipschitz convex functions yields P {kA − E[A]k ≥ E[kA − E[A]k] + t} ≤ c3 exp(−c4 t2 ) for some absolute constants c3 , c4 , which implies that for any c > 0, there exists c′ > 0 depending √ on c0 , c1 , such that P kA − E[A]k ≥ c′ np ≤ n−c . i.i.d.
Let X1 , X2 , . . . , Xm ∼ p(1 − ǫ)δ+1 + pǫδ−1 + (1 − p)δ0 for m ∈ N, p ∈ [0, 1] and a fixed constant ǫ ∈ [0, 1/2], where P m = n + o(n) and p = a log n/n for some a > 0 as n → ∞. The following upper tail bound for m i=1 Xi follows from the Chernoff bound. Lemma 7. Assume that kn ∈ [m] be such that kn = (1 + o(1)) logloglogn n . Then P
(m X i=1
Xi ≤ kn
)
√
≤ n−a(
√ 2 1−ǫ− ǫ) +o(1)
.
Proof. It follows from the Chernoff bound that ) (m X Xi ≤ kn ≤ exp(−mℓ(kn /m)), P i=1
where ℓ(x) = supλ≥0 −λx − log E e−λX1 . Since X1 ∼ δ+1 + pǫδ−1 + (1 − p)δ0 , i h i h E e−λX1 = 1 + p e−λ (1 − ǫ) + eλ ǫ − 1 .
Notice that −λx − log E e−λX1 is concave in λ, so it achieves the supremum at λ∗ such that ∗
∗
p(e−λ (1 − ǫ) − eλ ǫ) −x + = 0. 1 + p [e−λ∗ (1 − ǫ) + eλ∗ ǫ − 1]
log 1−ǫ ǫ + o(1) and thus h ∗ i ∗ ℓ(kn /m) = −λ∗ kn /m − log 1 + p e−λ (1 − ǫ) + eλ ǫ − 1
Hence, by setting x = kn /m, we get λ∗ =
1 2
√ √ 1 1 − ǫ kn = − log − log 1 − p( 1 − ǫ − ǫ)2 + o(kn /m) 2√ ǫ m √ = a( 1 − ǫ − ǫ)2 log n/n + o(log n/n), 16
(39)
where the last equality holds due to the Taylor expansion of log(1 − x) at x = 0 and p = a log n/n. In view of (39), ) (m √ X √ 2 Xi ≤ kn ≤ n−a( 1−ǫ− ǫ) +o(1) , P i=1
The following lemma establishes a lower tail bound for
Pm
i=1 Xi .
Lemma 8. Assume that kn ∈ [m] such that kn = (1 + o(1)) logloglogn n . Then ) (m √ √ 2 X Xi ≤ −kn ≥ n−a( 1−ǫ− ǫ) +o(1) . P i=1
p P i.i.d. 2 Proof. Let k∗ , ⌊2a ǫ(1 − ǫ) log n⌋. Notice that m i=1 Xi ∼ Binom(m, p). Let Z1 , Z2 , . . . , Zn ∼ (1− ǫ)δ+1 + ǫδ−1 . Then ) ) (m ) (m (m X X X X m 2 2 ∗ ∗ Xi = k Xi = k P Xi ≤ −kn Xi ≤ −kn ≥ P P i=1
i=1
(a)
=P
( k∗ X i=1
i=1
i=1
) ) (m X Xi2 = k∗ Zi ≤ −kn P
(40)
i=1
Pm Pm Pk ∗ 2 where (a) holds because conditioning on k∗ , i=1 Zi have the same i=1 Xi = i=1 Xi and o nP P m k∗ 2 ∗ separately. distribution. Next we lower bound P i=1 Xi = k i=1 Zi ≤ −kn and P We use the following non-asymptotic bound on the binomial tail probability [3, Lemma 4.7.2]: For U ∼ Binom(n, p), (8k(1 − λ))−1/2 exp(−nD(λkp)) ≤ P {U ≥ k} ≤ exp(−nD(λkp)),
(41)
where λ = nk ∈ (0, 1) ≥ p and D(λkp) = λ log λp + (1 − λ) log 1−λ 1−p is the binary divergence function. ∗ Let W ∼ Binom(k , ǫ). Then, ) ( k∗
s X k∗ + kn 1 kn k∗ ∗
ǫ Zi ≤ −kn = P W ≥ P exp −k D + ≥ 2 2(k∗ + kn )(k∗ − kn ) 2 2k∗ i=1
= exp [−k∗ D(1/2kǫ) + o(log n)] ,
Moreover, using the following bound on binomial coefficients [3, Lemma 4.7.1]: √ n π k ≤ 1. ≤ −1/2 2 (2πnλ(1 − λ)) exp(nh(λ)) where λ = have that P
k n
(42)
∈ (0, 1) and h(λ) = −λ log λ − (1 − λ) log(1 − λ) is the binary entropy function, we
(m X i=1
Xi2
=k
∗
)
m k∗ ∗ 1 p (1 − p)k ≥ p exp (−mD(k∗ /nkp)) ∗ ∗ k∗ 2 2k (1 − k /n) ea log n ∗ + o(log n) . = exp −a log n + k log k∗ =
17
n Observe that by the definition of k∗ , log a log k ∗ = d(1/2kǫ) and it follows from (40) that ) (m h i √ X √ 2 p Xi ≤ −kn ≥ exp −a log n + 2a ǫ(1 − ǫ) log n + o(log n) = n−a( 1−ǫ− ǫ) +o(1) . P i=1
The following lemma provides a deterministic sufficient condition for the success of SDP (11) in the case of a > b. Lemma 9. Suppose there exist D ∗ = diag {d∗i } such that S ∗ , D ∗ − A satisfies S ∗ 0, λ2 (S ∗ ) > 0 and S ∗ σ ∗ = 0.
(43)
Then YbSDP = Y ∗ is the unique solution to (11). Proof. The Lagrangian function is given by
L(Y, S, D) = hA, Y i + hS, Y i − hD, Y − Ii, where the Lagrangian multipliers are S 0 and D = diag {di }. Then for any Y satisfying the constraints in (11), (a)
(b)
hA, Y i ≤ L(Y, S ∗ , D ∗ ) = hD ∗ , Ii = hD ∗ , Y ∗ i = hA + S ∗ , Y ∗ i = hA, Y ∗ i, where (a) holds because hS ∗ , Y i ≥ 0; (b) holds because hY ∗ , S ∗ i = (σ ∗ )⊤ S ∗ σ ∗ = 0 by (43). Hence, Y ∗ is an optimal solution. It remains to establish its uniqueness. To this end, suppose Ye is an optimal solution. Then, (a) hS ∗ , Ye i = hD ∗ − A, Ye i = hD ∗ − A, Y ∗ i=hS ∗ , Y ∗ i = 0.
where (a) holds because hA, Ye i = hA, Y ∗ i and Yeii = Yii∗ = 1 for all i ∈ [n]. In view of (43), since Ye 0, S ∗ 0 with λ2 (S ∗ ) > 0, Ye must be a multiple of Y ∗ = σ ∗ (σ ∗ )⊤ . Because Yeii = 1 for all i ∈ [n], Ye = Y ∗ . Proof of Theorem 5. Let D ∗ = diag {d∗i } with d∗i =
n X
Aij σi∗ σj∗ .
(44)
j=1
It suffices to show that S ∗P = D ∗ − A satisfies the conditions in Lemma 9 with high probability. ∗ ∗ By definition, di σi = j Aij σj∗ for all i, i.e., D ∗ σ ∗ = Aσ ∗ . Thus (43) holds, that is, S ∗ σ ∗ = 0. It remains to verify that S ∗ 0 and λ2 (S ∗ ) > 0 with probability converging to one, which amounts to showing that P
inf
x⊥σ∗ ,kxk2 =1
x⊤ S ∗ x > 0
18
→ 1.
(45)
Note that E [A] = (1 − 2ǫ)p(Y ∗ − I) and Y ∗ = σ ∗ (σ ∗ )⊤ . Thus for any x such that x ⊥ σ ∗ and kxk2 = 1, x⊤ S ∗ x = x⊤ D ∗ x − x⊤ E [A] x − x⊤ (A − E [A]) x
= x⊤ D ∗ x − (1 − 2ǫ)p x⊤ Y ∗ x + (1 − 2ǫ)p − x⊤ (A − E [A]) x
(a)
= x⊤ D ∗ x + (1 − 2ǫ)p − x⊤ (A − E [A]) x ≥ min d∗i + (1 − 2ǫ)p − kA − E [A] k. i∈[n]
(46)
√ where (a) holds since hx, σ ∗ i = 0. It follows from Theorem 7 that kA − E [A] k ≤ c′ log n with high ′ probability for a positive Pn−1 constant c depending only on a. Moreover, note that each di is equal in distribution to i=1 Xi , where Xi are identically and independently distributed according to p(1 − ǫ)δ+1 + pǫδ−1 + 1 − pδ0 . Hence, Lemma 7 implies that ) (n−1 √ X √ 2 log n ≥ 1 − n−a( 1−ǫ− ǫ) +o(1) . Xi ≥ P log log n i=1
Applying the union bound implies that mini∈[n] d∗i ≥
log n log log n
holds with probability at least 1 − √ √ It follows from the assumption a( 1 − ǫ − ǫ)2 > 1 and (46) that the desired (45) holds, completing the proof.
√ √ 2 n1−a( 1−ǫ− ǫ) +o(1) .
Proof of Theorem 6. The prior distribution of σ ∗ is uniform over {±1}n . First consider the case of ǫ = 0. If a < 1, then the number of isolated vertices tends to infinity in probability [8]. Notice that for isolated vertices i, vertex σi∗ is equally likely to be +1 or −1 conditional on the graph. Hence, the probability of exact recovery converges to 0. Next we consider ǫ > 0. Since the prior distribution of σ ∗ is uniform, the ML estimator minimizes the error probability P among all estimators and thus we only need to find when the ML estimator fails. Let e(i, T ) , Pj∈T |Aij |, denoting the P number of edges between vertex i and vertices in set T ⊂ [n]. Let si = j:σ∗ =σ∗ Aij and ri = j:σ∗ 6=σ∗ Aij . Let F denote the event that j
j
i
i
mini∈[n] (si − ri ) ≤ −1. Notice that F implies the existence of i ∈ [n] such that σ ′ with σi′ = −σi∗ and σj′ = σj∗ for j 6= i achieves a strictly higher likelihood than σ ∗ . Hence P {ML fails} ≥ P {F }. Next we bound P {F } from the below. P Let T denote the set of first ⌊ logn2 n ⌋ vertices and T c = [n]\T . Let s′i = j∈T c :σ∗ =σ∗ Aij and j i P ri′ = j∈T c :σ∗ 6=σ∗ Aij . Then j
i
min(si − ri ) ≤ min(si − ri ) ≤ min(s′i − ri′ ) + max e(i, T ). i∈T
i∈[n]
i∈T
i∈T
(47)
Let E1 , E2 denote the event that maxi∈T e(i, T ) ≤ logloglogn n − 1, mini∈T (s′i − ri′ ) ≤ − logloglogn n , respectively. In view of (47), we have F ⊃ E1 ∩ E2 and hence it boils down to proving that P {Ei } → 1 for i = 1, 2. Notice that e(i, T ) ∼ Binom(|T |, a log n/n). In view of the following Chernoff bound for binomial distributions [17, Theorem 4.4]: For r ≥ 1 and X ∼ Binom(n, p), P {X ≥ rnp} ≤ (e/r)rnp , we have − log n/ log log n+1 log n log2 n P e(i, T ) ≥ −1 ≤ = n−2+o(1) . log log n ae log log n Applying the union bound yields X P {E1 } ≥ 1 − P e(i, T ) ≥ i∈T
log n − 1 ≥ 1 − n−1+o(1) . log log n
19
Moreover, Y log n ′ ′ P {E2 } = 1 − P si − ri > − log log n (a)
i∈T
(b)
√
≥ 1 − 1 − n−a(
|T | (c) √ 1−ǫ− ǫ)2 +o(1)
(d) √ √ 2 ≥ 1 − exp −n1−a( 1−ǫ− ǫ) +o(1) → 1,
where (a) holds because {s′i − ri′ }i∈T are mutually independent; (b) follows 8; (c) is √ from Lemma √ due to 1 + x ≤ ex for all x ∈ R; (d) follows from the assumption that a( 1 − ǫ − ǫ)2 < 1. Thus P {F } → 1 and the theorem follows.
References [1] E. Abbe, A. Bandeira, A. Bracher, and A. Singer. Decoding binary node labels from censored edge measurements: Phase transition and efficient recovery. IEEE Transactions on Network Science and Engineering, 1(1):10–22, Nov. 2014. [2] E. Abbe, A. S. Bandeira, and G. Hall. arXiv:1405.3267, 2014.
Exact recovery in the stochastic block model.
[3] R. B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965. [4] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. arXiv:1210.3335, 2012. [5] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. 2014, available at http://arxiv.org/abs/1402.1267. [6] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms, 18(2):116–140, Mar 2001. [7] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborova. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physics Review E, 84:066106, 2011. [8] P. Erd¨os and A. R´enyi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290– 297, 1959. [9] A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAX BISECTION. Algorithmica, 18(1):67–81, 1997. [10] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, Nov. 1995. [11] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming. arXiv:1412.6156, Nov. 2014. [12] S. Heimlicher, M. Lelarge, and L. Massouli´e. Community detection in the labelled stochastic block model. avaiable at: http://arxiv.org/abs/1209.2910, 2012.
20
[13] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983. [14] M. Lelarge, L. Massouli´e, and J. Xu. Reconstruction in the labeled stochastic block model. In IEEE Information Theory Workshop (ITW), pages 1–5, 2013. [15] L. Massouli´e. Community detection thresholds and the weak Ramanujan property. arxiv:1109.3318, 2013. [16] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529 – 537, 2001. [17] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. [18] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. available at: http://arxiv.org/abs/1202.1499, 2012. [19] E. Mossel, J. Neeman, and A. Sly. arxiv:1311.4115, 2013.
A proof of the block model threshold conjecture.
[20] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for binary symmetric block models. Arxiv preprint arXiv:1407.1591, 2014. [21] A. Saade, F. Krzakala, M. Lelarge, and L. Zdeborova. Spectral detection in the censored block model. arXiv:1502.00163, 2015. [22] Y. Seginer. The expected norm of random matrices. Combinatorics, Probability and Computing, 9(2):149–166, 2000. [23] T. Tao. Topics in random matrix theory. American Mathematical Society, Providence, RI, USA, 2012. [24] S.-Y. Yun and A. Proutiere. Accurate community detection in the stochastic block model via spectral algorithms. arXiv:1412.7335, 2014.
21