Recovering a Hidden Community Beyond the Spectral Limit in $ O (| E ...

Report 2 Downloads 18 Views
Recovering a Hidden Community Beyond the Spectral Limit in O(|E| log∗ |V |) Time

arXiv:1510.02786v1 [stat.ML] 9 Oct 2015

Bruce Hajek

Yihong Wu

Jiaming Xu∗

October 12, 2015

Abstract The stochastic block model for one community with parameters n, K, p, and q is considered: K out of n vertices are in the community; two vertices are connected by an edge with probability p if they are both in the community and with probability q otherwise, where p > q > 0 and p/q is assumed to be bounded. An estimator based on observation of the graph G = (V, E) is said to achieve weak recovery if the mean number of misclassified vertices is o(K) as n → ∞. A critical role is played by the effective signal-to-noise ratio λ = K 2 (p − q)2 /((n − K)q). In the regime K = Θ(n), a na¨ıve degree-thresholding algorithm achieves weak recovery in O(|E|) time if λ → ∞, which coincides with the information theoretic possibility of weak recovery. The main focus of the paper is on weak recovery in the sublinear regime K = o(n) and np = no(1) . It is shown that weak recovery is provided by a belief propagation algorithm running for log∗ (n)+O(1) iterations, if λ > 1/e, with the total time complexity O(|E| log ∗ n). Conversely, log n no local algorithm with radius t of interaction satisfying t = o( log(2+np) ) can asymptotically outperform trivial random guessing if λ ≤ 1/e. By analyzing a linear message-passing algorithm that corresponds to applying power iteration to the non-backtracking matrix of the graph, we provide evidence to suggest that spectral methods fail to provide weak recovery if λ ≤ 1.

1

Introduction

The problem of finding a densely connected subgraph in a large graph arises in many research disciplines such as theoretical computer science, statistics, and theoretical physics. To study this problem, the stochastic block model [22] for one dense community with parameters n, K, p, and q is considered: K vertices chosen uniformly at random out of n vertices are in the community, denoted by C ∗ ; A random graph G = (V, E) is generated, where two vertices are connected by an edge with probability p if they are both in the community and with probability q otherwise, where p > q > 0. The subgraph induced by the community C ∗ has higher edge connectivity than the rest of the graph, and hence the model is also known as the planted dense subgraph model [28, 3, 6, 18, 31]. The problem of interest is to recover C ∗ based on a single realization of G. For simplicity, we assume the model parameters (K, p, q) are known to the estimators, and impose the mild assumptions that K/n is bounded away from one and p/q is bounded. We primarily focus on two types of recovery guarantees. ∗ B. Hajek and Y. Wu are with the Department of ECE and Coordinated Science Lab, University of Illinois at Urbana-Champaign, Urbana, IL, {b-hajek,yihongwu}@illinois.edu. J. Xu is with Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, [email protected].

1

b = C(G) b b exactly recovers C ∗ if Definition 1. (Exact Recovery) Given an estimator C ⊂ [n], C b 6= C ∗ } = 0, where the probability is taken with respect to the randomness of C ∗ and limn→∞ P{C G. b which almost comDepending on the application, it may be enough to ask for an estimator C ∗ pletely agrees with C .

b = C(G) b b weakly recovers C ∗ if, as Definition 2. (Weak Recovery) Given an estimator C ⊂ [n], C 1 b ∗ n → ∞, K |C△C | → 0, where the convergence is in probability, and △ denotes the set difference.

Exact recovery and weak recovery are the same as weak and strong consistency as defined in [34], respectively. Clearly an estimator that exactly recovers C ∗ also weakly recovers C ∗ . Also, it is not hard to show that the existence of an estimator satisfying Definition 2 is equivalent to the ∗ |] = o(K) (see [19, Appendix A] for a proof). b existence of an estimator such that E[|C△C Intuitively, if the community size K decreases, or p and q get closer, recovery of the community becomes harder. A critical role is played by the parameter λ=

K 2 (p − q)2 , (n − K)q

(1)

which can be interpreted as the effective signal-to-noise ratio for the problem of classifying a vertex according to its degree. For example, if K ≍ n − K ≍ n, p/q is bounded, and p is bounded away from one, a na¨ıve degree-thresholding algorithm can attain weak recovery in time linear in the number of edges, provided that λ → ∞, which coincides with the information-theoretic possibility of weak recovery. Moreover, one can show that degree-thresholding followed by a linear-time voting procedure achieves exact recovery whenever it is information theoretically possible in this asymptotic regime (see Appendix A for a proof). The main focus of this paper is on the analysis of the belief propagation (BP) algorithm for community recovery in the regime K = o(n) and np = no(1) . Intuitively, the belief propagation algorithm is an iterative algorithm which aggregates the likelihoods computed in the previous iterations with the observations in the current iteration. In fact, running belief propagation for one iteration reduces to degree thresholding. Montanari [31] analyzed the performance of the belief propagation algorithm for community recovery in a closely related regime with p = a/n, q = b/n, and K = κn, where a, b, κ are assumed to be fixed as n → ∞. In the double limit where first n → ∞, and then κ → 0 and a, b → ∞, it is shown that using a local algorithm, namely belief propagation running for a constant number of iterations, the sum of the probability for misclassifying a vertex in the community and the probability for misclassifying a vertex outside the community converges to zero if λ > 1/e; conversely, if λ < 1/e, for all local algorithms, the sum of the two misclassification probabilities is bounded away from 0 (see Section 1.4 for a more detailed explanation). However, the following question remains elusive: Is λ > 1/e the performance limit of belief propagation algorithms for weak recovery in the regime K = o(n) ? In this paper, we answer positively this question by analyzing belief propagation running for log∗ n + O(1) iterations1 . We show (Theorem 1) that if λ > 1/e, weak recovery can be achieved by a belief propagation algorithm running for log∗ n + O(1) iterations, whereas if λ < 1/e, not even very weak forms of recovery, defined in Appendix B, can be achieved (Theorems 3 - 4). The proof is based on analyzing the analogous belief propagation algorithm to classify the root node of a random tree graph, such that the random tree graph is the limit in distribution of the neighborhood of a vertex 1 The iterated logarithm log∗ (n) denotes the number of times the logarithm function must be iteratively applied to n to get a result less than or equal to one.

2

in the original graph G. In contrast to the analysis of belief propagation in [31] where the number of iterations is fixed as n → ∞, a novel feature of our proof is that the analysis on the tree and the associated coupling lemmas (Lemmas 15 and 18) involve the number of iterations converging slowly to infinity as the size of the graph increases. However, we observe that λ > 1/e still far n exceeds the information-theoretic minimal signal-to-noise ratio for exact recovery λ = Θ( K n log K ). It remains open whether any polynomial-time algorithm can provide weak recovery for λ < 1/e. Evidence is given in Section 6 to suggest that the spectral limit for weak recovery of the community is given by λ > 1, or, that is, spectral algorithms are inferior to the belief propagation algorithm by a factor of e in terms of the signal-to-noise ratio. The particular algorithm analyzed in Section 6 is a linear message passing algorithm corresponding to applying the power method to the non-backtracking matrix of the graph [26, 5], whose spectrum has been shown to be more informative than that of the adjacency matrix for the purpose of clustering. It is established that this linear message passing algorithm followed by thresholding provides weak recovery if λ > 1 (Theorem 5) and it does not even asymptotically outperform trivial random guessing if λ < 1 (Theorems 6 and 7). Finally, we address exact recovery. As shown in [19, Theorem 7], if there is an algorithm that can provide weak recovery even if the community size is random and only approximately equals to K, then it can be combined with a linear-time voting procedure to achieve exact recovery whenever it is information-theoretically possible. For K = o(n), we show that both the belief propagation and the linear message-passing method indeed work for such a case and hence can be upgraded to achieve exact recovery via local voting. Notation Let A denote the adjacency matrix of the graph G, I denote the identity matrix, and J denote the all-one matrix. For any matrix Y , let kY k denote its spectral norm. For any vector x ∈ Rn , let diag {x} denote the diagonal matrix with the diagonal entries given by x. For any positive integer n, let [n] = {1, . . . , n}. For any set T ⊂ [n], let |T | denote its cardinality and T c denote its complement. We use standard big O notations, e.g., for any sequences {an } and {bn }, an = Θ(bn ) or an ≍ bn if there is an absolute constant c > 0 such that 1/c ≤ an /bn ≤ c. Let Bern(p) denote the Bernoulli distribution with mean p and Binom(n, p) denote the binomial distribution with n trials and success probability p. Let d(pkq) = p log pq + (1 − p) log 1−p 1−q denote the KullbackLeibler (KL) divergence between Bern(p) and Bern(q). All logarithms are natural unless by default and we use the convention 0 log 0 = 0. Let Φ(x) and Q(x) denote the cumulative distribution function (CDF) and complementary CDF of a standard normal distribution, respectively. We say a sequence of events En holds with high probability, if P {En } → 1 as n → ∞.

1.1

Overview of main results

Our main theorems invoke the following assumption. Assumption 1. As n → ∞, p ≥ q, p/q = O(1), λ ≡

K 2 (p−q)2 (n−K)q

is a positive constant, and K = o(n).

Next we summarize the major aspects of our findings assuming Assumption 1: bBP be produced by running the belief propagation Recovery by belief propagation Let C ∗ ∗ algorithm (Algorithm 1) for log (n/K) + O(1) iterations. Suppose that (np)log (n/K) = no(1) , which holds, e.g., if the average degree is polylogarithmic in n, namely, np = O(logs n) for some bBP provides weak recovery of C ∗ , namely, constant s > 0. If λ > 1/e, then Theorem 1 shows that C ∗ ∗ bBP |] = o(K), in O(m log n) time, where m is the number of edges in the graph G. E[|C △C 3

In addition, suppose the following information-theoretic sufficient condition for exact recovery holds: lim inf n→∞

Kd(τ ∗ kq) > 1, log n

(2)

where ∗

τ =

log

1−q 1−p

log

1 n K log K p(1−q) q(1−p)

+

.

(3)

e be produced by combining C bBP with a local voting procedure as specified in Algorithm 2. Let C e = C ∗ } → 1, is achieved also in O(m log∗ n) Theorem 2 shows that exact recovery, namely, P{C time. Conversely, if 0 < λ ≤ 1/e, we show in Theorems 3 and 4 that any local algorithm, such that for each vertex u in G, the membership σu is estimated based on G in a neighborhood of log n radius o( log(np+2) ) from u, fails to achieve weak recovery; in particular, belief propagation cannot asymptotically improve upon random guessing without observing the graph. bLBP be the estimator produced by running the linear Recovery by spectral algorithms Let C  iterations. Theorem 5 shows that if message passing algorithm (Algorithm 3) for 21 logλ n−K K s log(n/K) o(1) bLBP provides λ > 1 and (np) = n , which holds if np = log n for some constant s > 0, C weak recovery.  ), Conversely, Theorems 6 and 7 show that if λ ≤ 1, (np)tf = no(1) , and tf = O(log n−K K then the linear message passing algorithm running up to tf iterations plus thresholding cannot asymptotically do better than trivial random guessing.

1.2

Comparison with information theoretic limits

As noted in the introduction, in the regime K = Θ(n), degree-thresholding achieves weak recovery and, if a voting procedure is also used, exact recovery whenever it is information theoretically possible. This section compares the recovery thresholds by belief propagation to the informationtheoretic thresholds established in [19], in the regime of K = o(n),

np = no(1) ,

p/q = O(1),

(4)

which is the main focus of this paper. The information-theoretic threshold for weak recovery is established in [19, Theorem 1], which, in the regime (4), reduces to the following: If lim inf n→∞

Kd(pkq) n > 1, 2 log K

(5)

then weak recovery is possible. On the other hand, if weak recovery is possible, then lim inf n→∞

Kd(pkq) n ≥ 1. 2 log K

(6)

To compare with belief propagation, we rephrase the above sharp threshold in terms of the signal-tonoise ratio λ defined in (1). Note that d(pkq) = (p log pq + q − p)(1 + o(1)) provided that p/q = O(1) and p → 0. Therefore the information-theoretic weak recovery threshold is given by λ > (C(p/q) + ǫ) 4

n K log , n K

(7)

2

2(α−1) for any ǫ > 0, where C(α) , 1−α+α log α . In other words, in principle weak recovery only demands n a vanishing signal-to-noise ratio λ = Θ( K n log K ), while, in contrast, belief propagation requires λ > 1/e to achieve weak recovery. No polynomial-time algorithm is known to succeed for λ ≤ 1/e, suggesting that computational complexity constraints might incur a severe penalty on the statistical optimality in the sublinear regime of K = o(n).

Next we turn to exact recovery. The information-theoretic optimal threshold has been established in [19, Theorem 3]. In the regime of interest (4), exact recovery is possible via the maximum likelihood estimator (MLE) provided that (5) and (2) hold. Conversely, if exact recovery is possible, then (6) and lim inf n→∞

Kd(τ ∗ kq) ≥1 log n

(8)

must hold. Notice that the information-theoretic sufficient condition for exact recovery has two parts: one is the information-theoretic sufficient condition (5) for weak recovery; the other is the sufficient condition (2) for the success of the linear time voting procedure. Similarly, recall that the sufficient condition for exact recovery by belief propagation also has two parts: one is the sufficient condition λ > 1/e for weak recovery, and the other is again (2). Clearly, the information-theoretic sufficient conditions for exact recovery and λ > 1/e, which is needed for weak recovery by local algorithms, are both at least as strong as the information theoretic necessary conditions (6) for weak recovery. It is thus of interest to compare them by assuming that 2 (6) holds. If p/q is bounded, p is bounded away from 1, and (6) holds, then d(τ ∗ kq) ≍ (p−q) as q shown in [19]. So under those conditions on p, q and (6), and if K/n is bounded away from 1,   n Kd(τ ∗ kq) K(p − q)2 ≍ ≍ λ. (9) log n q log n K log n Hence, the information-theoretic sufficient condition for exact recovery (2) demands a signal-ton noise ratio λ = Θ( K log ). Therefore, on one hand, if K = ω(n/ log n), then condition (2) is n stronger than λ > 1/e, and thus condition (2) alone is sufficient for local algorithms to attain exact recovery. On the other hand, if K = o(n/ log n), then λ > 1/e is stronger than condition (8), and thus for local algorithms to achieve exact recovery, it requires λ > 1/e, which far exceeds the K log n information-theoretic optimal signal-to-noise  ratio for exact recovery λ = Θ( n ). The critical value of K for this crossover is K = Θ K=

n log n

ρn , logs−1 n

. To illustrate it further, consider the regime

p=

a logs n , n

q=

b logs n , n

2

≍ log n where s ≥ 1, ρ ∈ (0, 1), a > b > 0 are fixed constants. In this regime, Kd(pkq) ≍ K(p−q) q and thus (5) is satisfied and weak recovery is information theoretically possible. Furthermore, τ ∗ = [1 + o(1)]τ0 logs n/n, d(τ ∗ kq) = [1 + o(1)]I(b, τ0 ) logs n/n, where τ0 = (a − b)/ log(a/b) and I(x, y) , x − y log(ex/y) for x, y > 0. Hence, information-theoretically, exact recovery is possible if ρI(b, τ0 ) > 1 and impossible if ρI(b, τ0 ) < 1. • When 1 ≤ s < 2, then λ = Ω(log2−s n) → ∞, and thus weak recovery is achievable in polynomial-time by the message passing algorithm, spectral methods, or even the degreethresholding algorithm. Exact recovery is achievable in polynomial-time by combining a polynomial-time weak recovery algorithm with a linear-time voting procedure, provide that ρI(b, τ0 ) > 1. 5

ρ2 (a−b)2 , and weak recovery b ρ2 (a−b)2 possible if ≤ 1/e. Fig. 1 b

• When s = 2, then λ =

ρ2 (a−b)2 > b ρ2 (a−b)2 = 1/e} b

by local algorithms is possible if

shows the curve {(b, ρ) : 1/e and is not corresponding to the weak recovery condition by local algorithms, and the curve {(b, ρ) : ρI(b, τ0 ) = 1} corresponding to the information-theoretic exact recovery condition as ρ and b vary, with a = 2b.

• When s > 2, then λ = Θ(log2−s (n)) = o(1), and thus the local recovery algorithms fail to provide recovery asymptotically better than trivial random guessing; we do not know any polynomial-time procedure for providing weak recovery, or exact recovery.

0.12

ρ

0.1 I 0.08 exact recovery threshold 0.06 II 0.04 BP threshold IV

III

0.02 200

400

600

800

b

Figure 1: Phase diagram with K = ρn/ log n, p = 2q, and q = b log2 n/n for b, ρ fixed as n → ∞. In region I, exact recovery is provided by the BP algorithm plus voting procedure. In region II, weak recovery is provided by the BP algorithm, but exact recovery is not information theoretically possible. In region III exact recovery is information theoretically possible, but no polynomial-time algorithm is known for even weak recovery. In region IV, with b > 0 and ρ > 0, weak recovery, but not exact recovery, is information theoretically possible and no polynomial time algorithm is known for weak recovery. As shown in [20], the relationship between weak recovery by an iterative algorithm and the conditions for exact recovery are similar in the case the observation given the community C ∗ is a Gaussian matrix, known as the submatrix localization problem.

1.3

Surpassing the spectral limit

Spectral algorithms estimate the communities based on the leading eigenvectors of the adjacency matrix, see, e.g., [1, 28, 32, 40] and the reference therein. Under the single community model, ∗ E [A] = (p − q)(1C ∗ 1⊤ C ∗ − diag {1C ∗ }) + q(J − I), where 1C ∗ is the indicator vector of C . Using the Davis-Kahan sin θ theorem [9], one can argue that the principal eigenvector of A− q(J− I) is almost parallel to 1C ∗ provided that the spectral norm kA − E [A] k is much smaller than K(p − q); thus one can get a good estimate of C ∗ by thresholding the principal eigenvector entry-wise. Assume that p/q = Θ(1). In the dense regime with np = Ω(log n), standard matrix concentration results √ [15, 17, 4] imply that with high probability, kA − E [A] k = O( nq), and thus the above procedure succeeds if the effective signal-to-noise ratio λ is sufficiently large. However, in the sparse regime √ with np = o(log n), it is known that with high probability kA−E [A] k = ω( nq) due to the existence 6

of high-degree vertices (see, e.g., [17, Appendix A] for a proof). This suggests that the conventional spectral algorithms are likely to fail in the sparse regime; this observation has already been pointed out in several related works [15, 7, 27, 33]. To estimate the communities in the sparse regime, a new spectral method based on the spectrum of the non-backtracking matrix is proposed and studied in [26, 5]. In Section 6, we analyze a linear message passing algorithm which corresponds to a power method for computing the leading eigenvectors of the non-backtracking matrix for the graph. It is established that the linear message passing algorithm followed by thresholding provides weak recovery if λ > 1 (Theorem 5) and it does not even achieve recovery asymptotically better than trivial random guessing if λ ≤ 1 (Theorems 6 and 7), suggesting that the spectral limit for weak recovery is given by λ > 1.

1.4 1.4.1

Further connections to the literature Belief propagation and local algorithms

Montanari [31] investigated the belief propagation algorithm for the community recovery problem with parameters (n, K, p, q) in a closely related regime, where K/n and λ are fixed and np, nq → ∞ as n → ∞, which, in particular, entails that p/q → 1. In this regime the distributions of the messages for belief propagation are shown to be asymptotically Gaussian, with the log likelihood ratio at a vertex i based on the neighborhood of fixed radius t having the N (vt /2, vt ) distribution if i ∈ C ∗ and the N (−vt /2, vt ) distribution if i ∈ / C ∗ , where v0 = 0 and # " 1 . (10) vt+1 = λ E −N (v /2,v ) K t t + e n−K Furthermore, Montanari [31] examined the recursion (10) to suggest that λ > 1/e is the performance threshold in case K = o(n). We summarize that argument of [31] here, and explain why it does not prove that λ > 1/e is necessary or sufficient for weak recovery if K = o(n). On one hand, the K right hand side of (10) is less than λevt for all positive values of the ratio n−K , so if λ < 1/e, then for fixed t, vt is bounded uniformly from above by x∗ (λ), the smallest positive solution of x = eλx , which is less than e. This suggests, but does not prove, that if λe < 1, then weak recovery is impossible in the case K = o(n), because (10) is predicated on K/n being constant. On the other hand, if λ > 1/e, then vt can be made arbitrarily large by taking t a large constant and the K ratio n−K small enough. However, this observation does not prove that weak recovery is possible if K = o(n) and λ > 1/e. The reason is that the performance of an estimator is characterized in [31] by a rescaled success probability, which can be expressed as Psucc = 1 − pe,0 − pe,1 , where pe,1 (resp. pe,0 ) denotes the probability of misclassifying a vertex inside (resp. outside) the community C ∗ . If K grows linearly with n, then Psucc → 1 is equivalent to weak recovery: Kpe,1 +(n−K)pe,0 = o(K). However, if K = o(n), then even if Psucc → 1, it could be that (n − K)pe,0 ≫ K and weak recovery fails. (See related discussions in Appendix B). For the general stochastic block model with possibly multiple hidden communities, belief propagation algorithm has been conjectured in [10] to be optimal in minimizing the fraction of misclassified vertices in certain special cases. When there are two hidden communities where every vertex is equally likely to be in one of the two communities, this optimality of belief propagation has been partially proved in the subsequent work [36, 38, 37]. 1.4.2

Computational barriers

The problem of recovering a single community demonstrates a fascinating interplay between statistics and computation and a potential departure between computational and statistical limits. 7

In the special case of p = 1 and q = 1/2, the problem of finding one community reduces to the classical planted clique problem [23]. If the clique has size K ≤ 2(1 − ǫ) log2 n for any ǫ > 0, then it cannot be uniquely determined; if K ≥ 2(1 + ǫ) log2 n, an exhaustive search finds the clique with probability converging to 1. In contrast, polynomial-time algorithms are only known to find √ a clique of size p K ≥ c n for any constant c > 0 2[1, 16, 11, 2], and it is shown in [12] that p if K ≥ (1 + ǫ) n/e, the clique can be found in O(n log n) time with high probability and n/e may be a fundamental limit for solving the planted clique problem in nearly linear time. Recent work [29] shows that the degree-r sum-of-squares (SOS) relaxation cannot find the clique unless √ K & ( n/ log n)1/r ; an improved lower bound K & n1/3 / log n for the degree-4 SOS is proved in [13]. Another recent line of work focuses on the case p = n−α , q = cn−α for fixed constants c < 1 and 0 < α < 1, and K = Θ(nβ ) for 0 < β < 1. It is shown in [6] that if β < α, no algorithm can exactly recover the community with a vanishing probability of error regardless of the computational costs; if β > α, the MLE exactly recovers the community with probability converging to 1; if β > 12 + α2 , a semidefinite relaxation of the MLE exactly recovers the community in polynomial-time with high probability. It is further conjectured to be computationally intractable to exactly recover the planted cluster if α < β < 12 + α2 . Recent work [18] provides some evidence to this conjecture by showing that if α < β < 21 + α4 , no randomized polynomial-time solver exists, conditioned on the planted clique hardness hypothesis (see [18, Hypothesis 1] for the precise statement). In sharp contrast to the computational barriers discussed in the previous two paragraphs, in the regime p = a log n/n and q = b log n/n for fixed a, b and K = ρn for a fixed constant 0 < ρ < 1, recent work [17] derived a function f (a, b) such that if ρf (a, b) > 1, exact recovery is achievable in polynomial-time via semi-definite relaxations of ML estimation with probability tending to one; if ρf (a, b) < 1, any estimator fails to exactly recover the cluster with probability tending to one regardless of the computational costs. In summary, the previous work indicates that to exactly recover a single community, a significant gap between the information limit and the limit of polynomial-time algorithms emerges as the community size K decreases from K = Θ(n) to K = nβ for 0 < β < 1. The results in this paper further suggest that the computational gap emerges as soon as the community size is below the critical point of K = Θ(n/ log n) for exact recovery and at K = o(n) for weak recovery.

1.5

Paper outline

The rest of this paper is organized as follows. Section 2 focuses on the belief propagation algorithm run on a random tree network, and coupling lemmas are given in Section 3 that justify the approximation of the neighborhood of a vertex in the graph G by a tree. Weak and exact recovery results for the belief propagation algorithm running on a graph G, are given in Section 4; converse results for local algorithms are given in Section 5. The linear message passing algorithm is formulated and analyzed in Section 6.

2

Inference problem on a random tree by belief propagation

In the regime we consider, the graph is locally tree like, with mean degree converging to infinity. We could derive a belief propagation algorithm for this problem by starting with the initial graph and making some approximations. Then the same algorithm could be run on the limiting infinite tree network with Poisson distributed degrees. Instead of taking that approach, we begin by deriving the exact belief propagation algorithm for the infinite tree network, and then deduce performance results for using that same algorithm on the original graph. The advantage of this order is that 8

the belief propagation algorithm on the tree is computing precise likelihood ratios. Since K can be much smaller than n, the error probability for classification of a vertex should be o(K/n) so that the expected number of misclassified vertices is o(K). A novel aspect of the analysis in this section is that we show that letting the number of iterations of the belief propagation algorithm grow slowly with n is adequate to yield a sufficiently small error probability. (The spectral gap of the adjacency matrix for the community recovery problem is not large enough with sufficiently high probability to warrant the spectral cleanup approach of [12, Lemma 2.4] that works for clique recovery with a Gaussian matrix.) The related inference problem on a Galton-Watson tree with Poisson numbers of offspring is defined as follows. Fix a vertex u and let Tu denote the infinite Galton-Watson tree rooted at vertex u. For vertex i in Tu , let Tit denote the subtree of Tu rooted at vertex i of height t. Let τi ∈ {0, 1} denote the label of vertex i in Tu . Assume τu ∼ Bern(K/n). For any vertex i ∈ Tu , let Li denote the number of its children j with τj = 1, and Mi denote the number of its children j with τj = 0. Suppose that Li ∼ Pois(Kp) if τi = 1, Li ∼ Pois(Kq) if τi = 0, and Mi ∼ Pois((n − K)q) for either value of τi . We are interested in estimating the label of root u given observation of the tree Tut . Notice that the labels of vertices in Tut are not observed. The probability of error for an estimator τbu (Tut ) is defined by n−K K P (b τu = 0|τu = 1) + P (b τu = 1|τu = 0). n n

pte ,

(11)

The estimator that minimizes pte is the maximum a posteriori probability (MAP) estimator, which can be expressed either in terms of the log belief ratio or log likelihood ratio: τbMAP = 1{ξut ≥0} = 1{Λtu ≥ν} ,

where ξut

 P τu = 1|Tut , log , P {τu = 0|Tut }

Λtu

(12)

 P Tut |τu = 1 , log , P {Tut |τu = 0}

t t 0 and ν = log n−K K . By Bayes’ formula, ξu = Λu − ν, and by definition, Λu = 0. By a standard result in the theory of binary hypothesis testing [24], the probability of error for the MAP decision rule is bounded by π1 π0 ρ2B ≤ pte ≤ (π1 π0 )1/2 ρB , (13) h t i where the Bhattacharyya coefficient (or Hellinger integral) ρB is defined by ρB = E eΛu /2 |τu = 0 , and π1 and π0 are the prior probabilities on the hypotheses. We comment briefly on the parameters of the model. The distribution of the tree Tu is de2 (p−q)2 , ν, and the ratio, p/q. Indeed, vertex u has label termined by the three parameters λ , K(n−K)q

τu = 1 with probability

K n

=

1 1+eν ,

and the mean numbers of children of a vertex i are given by:

λ(p/q)eν (p/q − 1)2 λeν E [Li |τi = 0] = Kq = (p/q − 1)2 λe2ν . E [Mi ] = (n − K)q = (p/q − 1)2

E [Li |τi = 1] = Kp =

9

(14) (15) (16)

The parameter λ can be interpreted as a signal to noise ratio in case K ≪ n and p/q = O(1), because varMi ≫ varLi and λ=

(E [Mi + Li |τi = 1] − E [Mi + Li |τi = 0])2 . varMi

In this section, the parameters are allowed to vary as long as λ > 0 and p/q > 1, although the focus is on the asymptotic regime: λ fixed, p/q = O(1), and ν → ∞. This entails that the mean numbers of children given in (14)-(16) converge to infinity. Montanari [31] considers the case of ν fixed with p/q → 1, which also leads to the mean vertex degrees converging to infinity. It is well-known that the likelihoods can be computed via a belief propagation algorithm. Let ∂i denote the set of children of vertex i in Tu and π(i) denote the parent of i. For every vertex i ∈ Tu other than u, define  P Tit |τi = 1 t . Λi→π(i) , log P {Tit |τi = 0} The following lemma gives a recursive formula to compute Λtu ; no approximations are needed. Lemma 1. For t ≥ 0, Λt+1 = −K(p − q) + u Λt+1 i→π(i)

= −K(p − q) +

Λ0i→π(i) = 0,

∀i 6= u.

X

t

log

eΛℓ→u −ν (p/q) + 1 t

eΛℓ→u −ν + 1

ℓ∈∂u

X

t

log

ℓ∈∂i

eΛℓ→i −ν (p/q) + 1 t

eΛℓ→i −ν + 1

!

!

,

, ∀i 6= u

Proof. The last equation follows by definition. We prove the first equation; the second one follows similarly. A key point is to use the independent splitting property of the Poisson distribution to give an equivalent description of the numbers of children with each label for any vertex in the tree. Instead of separately generating the number of children of with each label, we can first generate the total number of children and then independently and randomly label each child. Specifically, for every vertex i in Tu , let Ni denote the total number of its children. Let d1 = Kp + (n − K)q and d2 = Kq + (n − K)q = nq. If τi = 1 then Ni ∼ Pois(d1 ), and for each child j ∈ ∂i, independently of everything else, τj = 1 with probability Kp/d1 and τj = 0 with probability (n − K)q/d1 . If τi = 0 then Ni ∼ Pois(d0 ), and for each child j ∈ ∂i, independently of everything else, τj = 1 with probability K/n and τj = 0 with probability (n − K)/n. With this view, the observation of the total number of children Nu of vertex u gives some information on the label of u, and then the conditionally independent messages from those children give additional information. To be precise,

10

we have that Λt+1 u

 P Tut+1 |τu = 1 = log  t+1 P Tu |τu = 0

 P Tit |τu = 1 P {Nu |τu = 1} X + log = log P {Nu |τu = 0} P {Tit |τu = 0} i∈∂u  t P X d1 (b) x∈{0,1} P {τi = x|τu = 1} P Ti |τi = x = −K(p − q) + Nu log + log P t d0 τi ∈{0,1} P {τi = x|τu = 0} P {Ti |τi = x} i∈∂u   X KpP Tit |τi = 1 + (n − K)qP Tit |τi = 0 (c) = −K(p − q) + log KqP {Tit |τi = 1} + (n − K)qP {Tit |τi = 0} (a)

i∈∂u

(d)

= −K(p − q) +

X

i∈∂u

t

log

eΛi→u −ν (p/q) + 1 t

eΛi→u −ν + 1

,

where (a) holds because Nu and Tit for i ∈ ∂u are independent conditional on τu ; (b) follows because Nu ∼ Pois(d1 ) if τu = 1 and Nu ∼ Pois(d0 ) if τu = 0, and Tit is independent of τu conditional on τi ; (c) follows from the fact τi ∼ Bern(Kp/d1 ) given τu = 1, and τi ∼ Bern(Kq/d0 ) given τu = 0; (d) follows from the definition of Λti→u . Notice that Λtu is a function of Tut alone; and it is statistically correlated with the vertex labels. Also, since the construction of a subtree Tit and its vertex labels is the same as the construction of Tut and its vertex labels, the conditional distribution of Tit given τi is the same as the conditional distribution of Tut given τu . Therefore, for any i ∈ ∂u, the conditional distribution of Λti→u given τi is the same as the conditional distribution of Λtu given τu . Let Z0t denote a random variable that has the same distribution as Λtu given τu = 0, and let Z1t denote a random variable that has the same distribution as Λtu given τu = 1. Lemma 1 determines a recursion that determines the probability distribution of Z0t+1 in terms of the probability distribution of Z0t . This recursion is infinite dimensional. As discussed in Section 2.5, in a certain limiting regime, the recursion can be reduced to a finite dimensional recursion. The remainder of this section is devoted to the analysis of belief propagation on the Poisson tree model, and is organized into two main parts. In the first part, Section 2.1 gives expressions for exponential moments of the log likelihood messages, which are applied in Section 2.2 to yield an upper bound on the error probability for the problem of classifying the root vertex of the tree. That bound, together with the coupling lemmas of Section 3, is enough to establish weak recovery for the belief propagation algorithm run on graph G, given in Section 4. The second part of this section focuses on first and second moments and Gaussian limits of the messages themselves, leading to lower bounds on the probability of correct classification in Section 2.6. Those bounds, together with the coupling lemmas of Section 3, are used in Section 5 to establish the converse results for local algorithms.

2.1

Exponential moments of log likelihood messages for Poisson tree

The following lemma gives formulas for some exponential moments of Z0t and Z1t , based on Lemma 1. Although the formulas are not recursions, they are close enough to recursions to permit useful analysis.

11

Lemma 2. For t ≥ 0,

#) t eZ1 , = exp λE =E e E e t 1 + eZ1 −ν   " # !2    h t+1 i Z1t Z1t 2 e e λ   = exp 3λE E e2Z1 E + . t t   K(p − q) 1 + eZ1 −ν 1 + eZ1 −ν h

Z1t+1

i

h

2Z0t+1

(

i

"

(17)

(18)

More generally, for any integer h ≥ 2,   !j−1  j−1 t h     i h h t+1 i Z X t+1 e 1 h λ  E = exp K(p − q) = E e(h−1)Z1 E ehZ0 t   K(p − q) j 1 + eZ1 −ν j=2

(19)

Proof. By the definition of Λtu and change of measure, we have h i   t E g(Λtu )|τu = 0 = E g(Λtu )e−Λu |τu = 1 ,

where g is any measurable function such that the expectations above are well-defined. It follows that i h   t (20) E g(Z0t ) = E g(Z1t )e−Z1 . h ti h ti h ti Plugging g(z) = ez and g(z) = e2z , we have that E eZ0 = 1 and E e2Z0 = E eZ1 . Moreover,

Plugging g(z) =

1 1+e−z+ν

h i     t eν E g(Z0t ) + E g(Z1t ) = E g(Z1t )(e−Z1 +ν + 1) . and g(z) =

1 (1+e−z+ν )2

into the last displayed equation, we have



   1 1 + E = 1, t t 1 + e−Z0 +ν 1 + e−Z1 +ν       1 1 1 ν e E +E =E . t t t (1 + e−Z0 +ν )2 (1 + e−Z1 +ν )2 1 + e−Z1 +ν eν E

In view of Lemma 1, by defining f (x) = t+1

e2Λu

(21)

x(p/q)+1 x+1 ,

= e−2K(p−q)

(22) (23)

we get that   t Y f 2 eΛℓ→u −ν .

ℓ∈∂u

Since the distribution of Λtℓ→u conditional on τu = 0 and τu = 1 is the same as the distribution of Z0t and Z1t , respectively, it follows that  h  iLu   h  t+1 iMu  h t+1 i Z1t+1 −ν 2 −2K(p−q) 2Z0 . E E f 2 eZ0 −ν =e E E f e E e

  Using the fact that E cX = eλ(c−1) for X ∼ Pois(λ) and c > 0, we have i   h  t+1 i h  h  t+1 i h t+1 i = exp −2K(p − q) + Kq E f 2 eZ1 −ν − 1 + (n − K)q E f 2 eZ0 −ν − 1 . E e2Z0 12

Notice that 2

f (x) = It follows that



p/q − 1 1+ 1 + x−1

2

=1+

2(p/q − 1) (p/q − 1)2 + . 1 + x−1 (1 + x−1 )2

   h  t+1 i  h  t+1 i Kq E f 2 eZ1 −ν − 1 + (n − K)q E f 2 eZ0 −ν − 1      1 1 ν = 2Kq(p/q − 1) E +e E t t 1 + e−Z1 +ν 1 + e−Z0 +ν      1 1 ν + Kq(p/q − 1)2 E + e E t t (1 + e−Z1 +ν )2 (1 + e−Z0 +ν )2   1 (a) = 2K(p − q) + Kq(p/q − 1)2 E t 1 + e−Z1 +ν " # t eZ1 = 2K(p − q) + λE , t 1 + eZ1 −ν

where (a) follows by applying (22) and (23). Combining the above proves (17). The proof of (19) h  p/q−1 is expanded using is essentially the same as the proof just given for (17); f h (x) = 1 + 1+x −1 binomial coefficients as already illustrated for h = 2. Then (18) follows from (19) for h = 3. Using the notation h ti at = E eZ1 " # t eZ1 bt = E , t 1 + eZ1 −ν

(24) (25)

(17) becomes at+1 = exp(λbt ).

(26)

The following lemma provides upper bounds on some exponential moments in terms of bt .   2  h t+1 i p p ′ Lemma 3. Let C , λ(2 + q ) and C , λ 3 + 2 q + pq ≤ exp(Cbt ) and . Then E e2Z1 h t+1 i ≤ exp(C ′ bt ). More generally, for any integer h ≥ 2, E e3Z1  j−2 P i h h t+1 i h p λbt h (h−1)Z1t+1 hZ0 j=2 ( j ) q −1 ≤e =E e . E e

ez 1+ez−ν



≤ for all z. Therefore, for any j ≥ 2, Proof. Note that Applying this inequality to (19) yields (27).

2.2



ez 1+ez−ν

j−1

(27) ≤

e(j−2)ν



ez 1+ez−ν



.

Upper bound on classification error via exponential moments

Note that bt ≈ at if ν ≫ 0, in which case (26) is approximately a recursion for the a’s. The following two lemmas use this intuition to show that if λ > 1/e and ν is large enough, the bt ’s eventually grow large. In turn, that fact will be used that Bhattacharyya coefficient mentioned above, h tot show i h the i t /2 Z /2 −Z 0 1 which can be expressed as ρB = E e =E e , becomes small, culminating in Lemma 8, giving an upper bound on the classification error for the root vertex. 13

Lemma 4. Let C , λ(2 + pq ). Then   bt+1 ≥ exp(λbt ) 1 − e−ν/2 Proof. Note that C − λ > 0. If bt ≤

ν 2(C−λ) ,

if bt ≤

ν . 2(C − λ)

(28)

we have

i (b) h (a) t+1 bt+1 ≥ at+1 − E e−ν+2Z1 ≥ eλbt − e−ν+Cbt    (c)  = eλbt 1 − e−ν+(C−λ)bt ≥ eλbt 1 − e−ν/2 .

where (a) follows by the definitions (24) and (25) and the fact ν . from Lemma 3; (c) follows from the condition bt ≤ 2(C−λ)

1 1+x

≥ 1 − x for x ≥ 0; (b) follows

h t i Lemma 5. The variables at and bt are nondecreasing in t and E eZ0 /2 is non-increasing in t h  t i is nondecreasing (non-increasing) in t for any convex over all t ≥ 0. More generally, E Υ eZ0 (concave, respectively) function Υ with domain (0, ∞). h  t i Proof. Note that, in view of (20), E Υ eZ0 becomes at for the convex function Υ(x) = x2 , bt h t i √ for the convex function Υ(x) = x2 /(1 + xe−ν ), and E eZ0 /2 for the concave function Υ(x) = x. It thus suffices to prove the last statement of the lemma. It is well known that for a nonsingular binary hypothesis testing problem with a growing amount dP is a martingale under measure of information indexed by some parameter s,2 the likelihood ratio dQ   t Λ Q. Therefore, the likelihood ratios e u : t ≥ 0 (where Λs is the log likelihood ratio) at the root vertex u forthe infinite tree, conditioned on τu = 0, form a martingale. Thus, the random variables  t eZ0 : t ≥ 0 can be constructed on a single probability space to be a martingale. The lemma therefore follows from Jensen’s inequality. Let log∗ (ν) denote the number of times the logarithm function must be iteratively applied to ν to get a result less than or equal to one. Lemma 6. Suppose λ > 1/e. There are constants t¯0 and νo > 0 depending only on λ such that   bt¯0 +log∗ (ν)+2 ≥ exp(λν/(2(C − λ)) 1 − e−ν/2 , where C = λ



p q

 + 2 , whenever ν ≥ νo and ν ≥ 2(C − λ).

Proof. Given λ with λ > 1/e, select the following constants, depending only on λ:   √ • D and ν0 so large that λeλD 1 − e−νo /2 > 1 and λe 1 − e−νo /2 ≥ λe.  • w0 > 0 so large that w0 λeλD 1 − e−νo /2 − λD ≥ w0 . • A positive integer t¯0 so large that λ((λe)t¯0 /2−1 − D) ≥ w0 .

2

Here we mean information in the sense of martingale theory (see, e.g. [14]), where the growth of information is modeled by an increasing family of sigma algebras indexed by some parameter s, and often the sigma algebra for a given s is generated by a set of random variables, where the sets of random variables increase with s.

14

Throughout the remainder of the proof we assume without further comment that ν ≥ νo and ν ν ≥ 2(C − λ). The latter condition and the fact b0 = 1+e1−ν ensures that b0 < 2(C−λ) . Let t∗ = o n ν and let t¯1 = log∗ (ν). The first step of the proof is to show t∗ ≤ t¯0 + t¯1 . max t ≥ 0 : bt < 2(C−λ) For that purpose we will show that the b’s increase at least geometrically to reach a certain large constant (specifically, so (29) below holds), and then they increase as fast as a sequence produced by iterated exponentiation.  Since b0 ≥ 0 it follows from (28) and the choice of ν0 that b1 ≥ 1 − e−νo /2 ≥ (λe)−1/2 . Note u that eu ≥ eu for all u > 0, because eu is minimized at u = 1. Thus eλbt ≥ λebt , which combined with √ ν the choice of ν0 and (28) shows that if bt ≤ 2(C−λ) then bt+1 ≥ λebt . It follows that bt ≥ (λe)t/2−1 for 1 ≤ t ≤ t∗ + 1. ν If bt¯0 −1 ≥ 2(C−λ) then t∗ ≤ t¯0 − 2 and the claim t∗ ≤ t¯0 + t¯1 is proved (that is, the geometric ν growth phase alone was enough), so to cover the other possibility, suppose bt¯0 −1 < 2(C−λ) . Then ¯ ¯ t¯0 ≤ t∗ + 1 and therefore bt¯0 ≥ (λe)t0 /2−1 . Let t0 = min{t : bt ≥ (λe)t0 /2−1 }. It follows that t0 ≤ t¯0 , and, by the choice of t¯0 and the definition of t0 , λ(bt0 − D) ≥ w0 .

(29)

Define the sequence (wt : t ≥ 0) beginning with w0 already chosen, and satisfying the recursion wt+1 = ewt . It follows by induction that λ(bt0 +t − D) ≥ wt for t ≥ 0, t0 + t ≤ t∗ + 1. Indeed, the base case is (29), and if (30) holds for some t with t0 + t ≤ t∗ , then bt0 +t ≥ that     λ(bt0 +t+1 − D) ≥ λ eλbt0 +t 1 − e−ν/2 − D   ≥ wt+1 λeλD 1 − e−ν/2 − λD

(30) wt λ

+ D, so

≥ wt+1 ,

where the last inequality follows from the choice of w0 and the fact wt+1 ≥ w0 . The proof of (30) by induction is complete. Let t¯1 = log∗ (ν). Since w1 ≥ 1 it follows that wt¯1 +1 ≥ ν (verify by applying the log function λν t¯1 times to each side). Therefore, wt¯1 +1 ≥ 2(C−λ) − λD, where we use the fact C − λ ≥ 2λ. If ∗ t0 + t¯1 < t it would follow from (30) with t = t0 + t¯1 + 1 that bt0 +t¯1 +1 ≥

ν wt¯+1 +D ≥ , λ 2(C − λ)

which would imply t∗ ≤ t0 + t¯1 , which would be a contradiction. Therefore, t∗ ≤ t0 + t¯1 ≤ t¯0 + t¯1 , as was to be shown. ν ν Since t∗ is the last iteration index t such that bt < 2(C−λ) , either bt∗ +1 = 2(C−λ) , and we say ν ν ∗ the threshold 2(C−λ) is exactly reached at iteration t + 1, or bt∗ +1 > 2(C−λ) , in which case we say there was overshoot at iteration t∗ + 1. First, consider the case the threshold is exactly reached at ν , and (28) can be applied with t = t∗ + 1, yielding iteration t∗ + 1. Then, bt∗ +1 = 2(C−λ)     bt∗ +2 ≥ exp(λbt∗ +1 ) 1 − e−ν/2 = exp(λν/(2(C − λ)) 1 − e−ν/2 .

Since t∗ + 2 ≤ t¯0 + t¯1 + 2 = t¯0 + log∗ (ν) + 2, it follows from Lemma 5 that bt¯0 +log∗ (ν)+2 ≥ bt∗ +2 , which completes the proof of the lemma in case the threshold is exactly reached at iteration t∗ + 1. 15

To complete the proof, we explain how the information available for estimation can be reduced through a thinning method, leading to a reduction in the value of bt∗ +1 , so that we can assume without loss of generality that the threshold is always exactly reached at iteration t∗ + 1. Let φ be a parameter with 0 ≤ φ ≤ 1. As before, we will be considering a total of t∗ + 2 iterations, so ∗ consider a random tree with labels, (Tut +2 , τT t∗ +2 ), with root vertex u and maximum depth t∗ + 2. u For the original model, each vertex of depth t∗ + 1 or less with label 0 or 1 has Poisson numbers of children with labels 0 and 1 respectively, with means specified in the construction. For the thinning method, for each ℓ ∈ ∂u and each child i of ∂ℓ, (i.e. for each grandchild of u) we generate a random variable Uℓ,i that is uniformly distributed on the interval [0, 1]. Then we retain i if Uℓ,i ≤ φ, and we delete i, and all its decedents, if Uℓ,i > φ. That is, the grandchildren of the root vertex u are each deleted with probability 1 − φ. It is equivalent to reducing p and q to φp and φq, respectively, for that one generation. Consider the calculation of the likelihood ratio at the root vertex for the thinned tree. The log likelihood ratio messages begin at the leaf vertices at depth t∗ + 2. For any vertex ℓ 6= u, let Λℓ→π(ℓ),φ denote the log likelihood message passed from vertex ℓ to its parent, π(ℓ). Also, let Λu,φ denote the log likelihood computed at the root vertex. For brevity we leave off the superscript t on the log likelihood ratios, though t on the message Λℓ→π(ℓ),φ would be t∗ + 2 minus the depth of ℓ. The messages of the form Λℓ→π(ℓ),φ don’t actually depend on φ unless ℓ ∈ ∂u. For a vertex ℓ ∈ ∂u, the message Λℓ→u,φ has the nearly the same representation as in Lemma 1, namely: ! t X eΛi→ℓ −ν (p/q) + 1 Λℓ→u,φ = −φK(p − q) + log . (31) t eΛi→ℓ −ν + 1 i∈∂ℓ:U ≤φ ℓ,i

in Lemma 1, except with Λtℓ→u The representation of Λu,φ is the same as the representation of Λt+1 u replaced both places on the right hand side by Λℓ→u,φ . t and Z t denote random variables for analyzing the message passing algorithm for this Let Z0,φ 1,φ ∗ t ) is the law of Λ depth t + 2 tree. Their laws are the following. For 0 ≤ t ≤ t∗ + 1, L(Z0,φ ℓ→π(ℓ),φ t∗ +2 ∗ given τℓ = 0, for a vertex ℓ of depth t + 2 − t. And L(Z0,φ ) is the law of Λu,φ given τu = 0. 0 ≡ 0. The laws L(Z t ) are determined similarly, conditioning on the labels of the Note that Z0,φ 1,φ t ) and L(Z t ) each determine the other because they represent vertices to be one. For t fixed, L(Z0,φ 1,φ distributions of the log likelihood for a binary hypothesis testing problem. The message passing equations for the log likelihood ratios translate into recursions for the t ) and L(Z t ). We have not focused directly on the full recursions of the laws, but laws L(Z0,φ 1,φ rather looked at equations for exponential moments. The basic recursions we’ve been considering t ) are exactly as before for 0 ≤ t ≤ t∗ − 1 and for t = t∗ + 1. For t = t∗ the thinning needs for L(Z0,φ to be taken into account, resulting, for example, in the following updates for t = t∗ : #) ( " ∗ h t∗ +1 i h t∗ +1 i Z1t e = exp λφE = E e2Z0 E eZ1 t∗ 1 + eZ1 −ν h

∗ 2Z1t +1

E e

i

= exp

  

3λφE

"



t eZ1

t∗ −ν

1 + eZ1

#

16

!2     + E ∗ t  K(p − q) 1 + eZ1 −ν λ2 φ





t eZ1

Let

h t i at,φ = E eZ1,φ # " t eZ1,φ bt,φ = E t 1 + eZ1,φ −ν

(32) (33)

for 0 ≤ t ≤ t∗ + 2. Note that at,φ and bt,φ don’t depend on φ for 0 ≤ t ≤ t∗ . We have  exp(λbt,φ ) t 6= t∗ , at+1,φ = exp(λφbt,φ ) t = t∗

(34)

We won’t be needing (34) for t = t∗ but we will use it for t = t∗ + 1. t∗ +1 t∗ +1 On one hand, if φ = 0 then Λℓ→u,φ ≡ 0 for all ℓ ∈ ∂u, so that Z0,φ=0 = Z1,φ=0 ≡ 0 so that 1 ν n−K ∗ bt∗ +1,φ=0 = 1+e−ν = n ≤ 1 < 2(C−λ) . On the other hand, by the definition of t we know that ν ν bt∗ +1,φ=1 ≥ 2(C−λ) . We shall show that there exists a value of φ ∈ [0, 1] so that bt∗ +1,φ = 2(C−λ) . To do so we next prove that bt∗ +1,φ is a continuous, and, in fact, nondecreasing, function of φ, using a variation of the proof of Lemma 5. Let ℓ denote a fixed neighbor of the root node u. Note that eΛℓ→u,φ is the likelihood ratio for detection of τℓ based on the thinned subtree of depth t∗ + 1 with root ℓ. As φ increases from 0 to 1 the amount of thinning decreases, so larger values of φ correspond to larger  Λ ℓ,φ : 0 ≤ φ ≤ 1 is a martingale. amounts of information. Therefore, conditioned on τu = 0, e Moreover, the independent splitting property of Poisson random variables imply that, given τℓ = 0, the random process φ 7→ |{i ∈ ∂ℓ : Uℓ,i ≤ φ}| is a Poisson process with intensity nq, and therefore the sum in (31), as a function of φ over the interval [0, 1], is a compound Poisson process. Compound Poisson processes, just like Poisson processes, are almost surely continuous at any fixed value of φ, and therefore the random process φ 7→ Λℓ→u,φ is continuous in distribution. Therefore, the t∗ +1

random variables eZ0,φ can be constructed on a single probability space for 0 ≤ φ ≤ 1 to form a martingale which is continuous in distribution. Since bt∗ +1,φ is the expectation of a bounded, t∗ +1

continuous, convex function of eZ0,φ , it follows that bt∗ +1,φ is continuous and nondecreasing in φ. ν , as claimed. Therefore, we can conclude that there exists a value of φ so that bt∗ +1,φ = 2(C−λ) Since there is no overshoot, we obtain as before (by using (34) for t = t∗ + 1 to modify Lemma 4 to handle (bt+1 , bt ) replaced by (bt∗ +2,φ , bt∗ +1,φ )):     bt∗ +2,φ ≥ exp(λbt∗ +1,φ ) 1 − e−ν/2 = exp(λν/(2(C − λ)) 1 − e−ν/2 .

The same martingale argument used in the previous paragraph can be used to show that bt∗ +2,φ is nondecreasing in φ, and in particular, bt∗ +2 = bt∗ +2,1 ≥ bt∗ +2,φ for 0 ≤ φ ≤ 1. Hence, by Lemma 5 and the fact t∗ + 2 ≤ t¯0 + log∗ (ν) + 2, we have bt¯0 +log∗ (ν)+2 ≥ bt∗ +2 ≥ bt∗ +2,φ , completing the proof of the lemma. Lemma 7. Let B = (p/q)3/2 . Then     h t+1 i λ λ Z0 /2 exp − bt ≤ E e bt . ≤ exp − 8 8B Proof. We prove the upper bound first. In view of Lemma 1, by defining f (x) = that   t Y t+1 eΛu /2 = e−K(p−q)/2 f 1/2 eΛℓ→u −ν . ℓ∈∂u

17

x(p/q)+1 x+1 ,

we get

Thus, h

Z0t+1 /2

E e

i

−K(p−q)/2

=e

 h  t iMu   t iLu   h 1/2 1/2 Z1 −ν eZ0 −ν e . E E f E E f

  Using the fact that E cX = eλ(c−1) for X ∼ Pois(λ) and c > 0, we have i  t i   h  t i h  h h t+1 i E eZ0 /2 = exp −K(p − q)/2 + Kq E f 1/2 eZ1 −ν − 1 + (n − K)q E f 1/2 eZ0 −ν − 1 (35)

By the intermediate value form of Taylor’s theorem, for any x ≥ 0 there exists y with 1 ≤ y ≤ x √ x2 such that 1 + x = 1 + x2 − 8(1+y) 3/2 . Therefore, √ x x2 1+x≤1+ − , 2 8(1 + A)3/2 Letting A ,

p q

(36)

− 1 and noting that B = (1 + A)3/2 , we have 

It follows that

∀0 ≤ x ≤ A.

ez−ν (p/q) + 1 1 + ez−ν

1/2



 p/q − 1 1/2 = 1+ 1 + e−z+ν 1 (p/q − 1) 1 (p/q − 1)2 . ≤1+ − 2 (1 + e−z+ν ) 8B (1 + e−z+ν )2

  t i   h  t i  h Kq E f 1/2 eZ1 −ν − 1 + (n − K)q E f 1/2 eZ0 −ν − 1      1 1 1 ν ≤ Kq(p/q − 1) E +e E t t 2 1 + e−Z1 +ν 1 + e−Z0 +ν      1 1 1 2 ν Kq(p/q − 1) E − +e E t t 8B (1 + e−Z1 +ν )2 (1 + e−Z0 +ν )2   1 1 2 Kq(p/q − 1) E = K(p − q)/2 − t 8B 1 + e−Z1 +ν " # t eZ1 λ E = K(p − q)/2 − , t 8B 1 + eZ1 −ν | {z } bt

where the first equality follows from (22) and (23); the last equality holds due to Kq(p/q−1)2 eν = λ. Combining the last displayed equation with (35) yields the desired upper bound. The proof for the lower bound is similar. Instead of (36), we use the the inequality that √ 2 1 + x ≥ 1 + x2 − x8 for all x ≥ 0, and the lower bound readily follows by the same argument as above. Lemma 8. (Upper bound on classification error for the random tree model) Consider the random tree model with parameters λ, ν, and p/q. Let λ be fixed with λ > 1/e. There are constants t¯0 and νo depending only on λ such that if ν ≥ νo and ν ≥ 2(C − λ), then after t¯0 + log∗ (ν) + 2 iterations of the belief propagation algorithm, the average error probability for the MAP estimator τbu of τu satisfies      λ K(n − K) 1/2 −ν/2 t exp − exp(νλ/(2(C − λ)) 1 − e , (37) pe ≤ n2 8B 18

 3/2   where B = pq and C = λ pq + 2 . In particular, if p/q = O(1), and r is any positive constant, then if ν is sufficiently large,  r Ke−rν K K t pe ≤ = . (38) n n n−K

n−K Proof. hWe use the Bhattacharyya upper bound in (13) with π1 = K n and π0 = n , and the fact i t ρ = E eZ0 /2 . Plugging in the lower bound on bt¯0 +log∗ (ν)+2 from Lemma 6 into the upper bound h t i on E eZ0 /2 from Lemma 7 yields (37). If p/q = O(1) and r > 0, then for ν large enough,

  λ exp(νλ/(2(C − λ)) 1 − e−ν/2 ≥ ν(r + 1/2), 8B

which, together with (37), implies (38).

2.3

First and second moments of log likelihood messages for Poisson tree

The following lemma provides estimates for the first and second moments of the log likelihood messages for the Poisson tree model. Lemma 9. With C = λ(p/q + 2), for all t ≥ 0,

 2 Cbt   t+1  λ e λbt +O E Z0 =− 2 K(p − q)  2 Cbt    λbt λ e +O E Z1t+1 = 2 K(p − q)  2 Cbt   λ e var Z0t+1 = λbt + O K(p − q)  2 Cbt   λ e var Z1t+1 = λbt + O K(p − q)

(39) (40) (41) (42)

Lemma 10. Let ψ2 (x) and ψ3 (x) be defined for x ≥ 0 by the relations: log(1 + x) = x + ψ2 (x) 2 2 3 and log(1 + x) = x − x2 + ψ3 (x). Then 0 ≥ ψ2 (x) ≥ − x2 , and 0 ≤ ψ3 (x) ≤ x3. . In particular, |ψ2 (x)| ≤ x2 and |ψ3 (x)| ≤ x3 . Moreover, | log2 (1 + x) − x2 | ≤ x3 . Proof of Lemma 10. By the intermediate value form of Taylor’s theorem, for any x ≥ 0,   x2 1 log(1 + x) = x + − 2 (1 + y)2 1 for some y ∈ [0, x]. The fact −1 ≤ − (1+y) 2 ≤ 0 then establishes the claim for ψ2 . Similarly, the claim for ψ3 follows from the fact that for some z ∈ [0, x]   2 x2 x3 + . log(1 + x) = x − 2 3! (1 + z)3

Finally, the first and second derivatives of log2 (1 + x) at x = 0 are 0 and 2, and  3 4 log(1 + x) − 6 1 d 2 3! dx log (1 + x) = 3!(1 + x)3 ≤ 1 for x ≥ 0,

so the final claim of the lemma also follows from Taylor’s theorem. 19

Proof of Lemma 9. Plugging g(z) = ν

e E



1 t

(1 + e−Z0 +ν )3

1 (1+e−z+ν )3



+E



into (21) we have 1 t

(1 + e−Z1 +ν )3



=E



1 t

(1 + e−Z1 +ν )2



.

(43)

Applying Lemma 10, we have  z−ν    e (p/q) + 1 p/q − 1 log = log 1 + ez−ν + 1 1 + e−z+ν   (p/q − 1)2 p/q − 1 p/q − 1 − + ψ3 . = 1 + e−z+ν 2(1 + e−z+ν )2 1 + e−z+ν

(44) (45)

Hence, Λt+1 u

= −K(p − q) +

X

ℓ∈∂u

"

p/q − 1 t

1 + e−Λℓ→u +ν



(p/q − 1)2 t

2(1 + e−Λℓ→u +ν )2

+ ψ3



p/q − 1 t

1 + e−Λℓ→u +ν

#

.

It follows by considering the case the label of vertex u is conditioned to be one:      t+1  p/q − 1 p/q − 1 + E [Mu ] E E Z0 = −K(p − q) + E [Lu ] E t t 1 + e−Z1 +ν 1 + e−Z0 +ν     (p/q − 1)2 (p/q − 1)2 − E [M ] E − E [Lu ] E u t t 2(1 + e−Z1 +ν )2 2(1 + e−Z0 +ν )2       p/q − 1 p/q − 1 + E [Lu ] E ψ3 + E [Mu ] E ψ3 . t t 1 + e−Z1 +ν 1 + e−Z0 +ν Notice that E [Lu ] = Kq and E [Mu ] = (n − K)q. Thus     p/q − 1 p/q − 1 + E [M ] E E [Lu ] E u t t 1 + e−Z1 +ν 1 + e−Z0 +ν      1 1 +ν = Kq(p/q − 1) E +e E t t 1 + e−Z1 +ν 1 + e−Z0 +ν = K(p − q), where the last equality holds due to (22). Moreover,     (p/q − 1)2 (p/q − 1)2 + E [Mu ] E E [Lu ] E t t (1 + e−Z1 +ν )2 (1 + e−Z0 +ν )2      1 1 2 ν = Kq(p/q − 1) E +e E t t (1 + e−Z1 +ν )2 (1 + e−Z0 +ν )2   1 (a) 2 = Kq(p/q − 1) E , t 1 + e−Z1 +ν # " t eZ1 (b) = λbt = λE t 1 + eZ1 −ν

20

(46)

(47)

where (a) holds due to (23), and (b) holds due to the fact ν = log n−K n . Also,       p/q − 1 p/q − 1 E [Lu ] E ψ3 + E [M ] E ψ u 3 t t 1 + e−Z1 +ν 1 + e−Z0 +ν     (p/q − 1)3 (p/q − 1)3 ≤ E [Lu ] E + E [M ] E u t t (1 + e−Z1 +ν )3 (1 + e−Z0 +ν )3      1 1 ν 3 +e E = Kq(p/q − 1) E t t (1 + e−Z1 +ν )3 (1 + e−Z1 +ν )3   1 (a) = Kq(p/q − 1)3 E t −Z (1 + e 1 +ν )2 h ti λ2 eCbt , ≤ Kq(p/q − 1)3 e−2ν E e2Z1 ≤ K(p − q)

(48)

h ti where (a) holds due to (43); the last inequality holds because, as shown by Lemma 3, E e2Z1 ≤ eCbt . Assembling the last four displayed equations yields (39). Similarly, " !# t  t+1   t+1  eZ1 +ν (p/q) + 1 + K(p − q)E log = E Z0 E Z1 t eZ1 −ν + 1      (p/q) − 1 . = E Z0t+1 + λbt + K(p − q)E ψ2 t e−Z1 +ν + 1

and, using |ψ2 (x)| ≤ x2 and the definition of ν,

i h    λ2 E e2Z1t −1 λ2 eCbt K(p − q)E ψ2 (p/q) ≤ ≤ t K(p − q) K(p − q) e−Z1 +ν + 1 It follows that (40) holds. P Next, we calculate the variance. For Y = L and {Xi } i=1 Xi , where L is Poissondistributed  are i.i.d. with finite second moments, it is well-known that var(Y ) = E [L] E X12 . It follows that " !# " !# t t  eZ1 −ν (p/q) + 1 eZ0 −ν (p/q) + 1 t+1 2 2 var Z0 = E [Lu ] E log + E [Mu ] E log . t t eZ1 −ν + 1 eZ0 −ν + 1 Using (44) and the fact | log2 (1 + x) − x2 | ≤ x3 (see Lemma 10) yields      (p/q − 1)2 (p/q − 1)2 t+1 var Z0 = E [Lu ] E + E [Mu ] E t t (1 + e−Z1 +ν )2 (1 + e−Z0 +ν )2      (p/q − 1)3 (p/q − 1)3 + E [Mu ] E . + O E [Lu ] E t t (1 + e−Z1 +ν )3 (1 + e−Z0 +ν )3 Applying (47) and (48) yields (41). Similarly, applying (47) and the fact log2 (1 + x) ≤ x2 , yields      (p/q − 1)2 t+1 t+1 var Z1 = var Z1 + K(p − q)O E t (1 + e−Z1 +ν )2  2 Cbt   λ e = var Z0t+1 + O K(p − q)   2 λ = λbt + O eCbt , K(p − q) 21

which together with (41) implies (42).

2.4

Asymptotic Gaussian marginals of log likelihood messages

The following lemma is well suited to proving that the distributions of Z0t and Z1t are asymptotically Gaussian. Lemma 11. (Analog of Berry-Esseen inequality for Poisson sums [25, Theorem 3].) Let Sλ = X1 + · · · + XNλ , where Xi : i ≥ 1 are independent, identically distributed random variables with mean µ, variance σ 2 and E |Xi |3 ≤ ρ3 , and for some λ > 0, Nλ is a Pois(λ) random variable independent of (Xi : i ≥ 1). Then ) ( CBE ρ3 S − λµ λ ≤ x − Φ(x) ≤ p sup P p x λ(µ2 + σ 2 ) λ(µ2 + σ 2 )3 where CBE = 0.3041.

Lemma 12. Suppose λ > 0 is fixed, and the parameters p/q and ν vary such that p/q = O(1), ν is bounded from below (i.e. K/n is bounded away from one) and K(p − q) → ∞. (The latter condition holds if either ν → ∞ or p/q → 1; see Remark 2.)  Suppose t ∈ N  is fixed, or more generally, is ′b 2 C e t = o(bt ) as n → ∞, where C ′ = λ 3 + 2 pq + pq . Then such that K(p−q) ) ( t+1 λbt Z + 0 2 √ sup P ≤ x − Φ(x) → 0 λbt x ) ( t+1 λbt Z − 1 2 √ ≤ x − Φ(x) → 0 sup P λbt x

(49) (50)

Remark 1. As we will show later in Lemma 13, in the case that λe < 1, bt is bounded, independently of n, and (49) and (50) hold for all t, and, as can be checked from the proof, the limits hold uniformly in t. Also, in the case bt is bounded independently of n, (50) is a consequence of (49) and the fact that Z0t+1 is the log likelihood ratio. In the proof below, (50) is proved directly. Remark 2. The condition K(p − q) → ∞ in Lemma 12 is essential for the proof; we state some equivalent conditions here. Equations (14)-(16) express Kp, Kq, and (n − K)q in terms of the parameters λ, ν, and p/q. Similarly, λeν p/q − 1 λ(p/q)eν (eν + 1) np = (p/q − 1)2 eν (n − K)q = K(p − q) p/q − 1 K(p − q) =

2

2

(p−q) It follows that if K(n−K)q ≡ λ for a fixed λ > 0, p/q = O(1), and ν is bounded below (i.e. K/n is bounded away from one) then the following seven conditions are equivalent: (K(p − q) → ∞), (ν → ∞ or pq → 1), (Kp → ∞), (Kq → ∞), ((n − K)q → ∞), (np → ∞), (K(p − q) = o((n − K)q)).

22

Proof of Lemma 12. Throughout the proof it is good to keep in mind that b0 = 1+e1−ν , so that b0 is bounded from below by a fixed positive constant, and, as shown in Lemma 5, bt is nondecreasing in t. For t ≥ 0, Z0t+1 can be represented as follows: Z0t+1

= −K(p − q) +

Nnq X

Xi

i=1

where Nnq has the Pois(nq) distribution, the random variables Xi , i ≥ 0 are mutually independent and independent of Nnq , and the distribution a mixture of distributions: L(Xi ) =  of Xi is  z−ν

(n−K)q t nq L(f (Z0 ))

e (p/q)+1 t . + Kq nq L(f (Z1 )), where f (z) = log ez−ν +1 By (41) of Lemma 9 and the formula for the variance of the sum of a Poisson distributed number of iid random variables,  2 Cbt    λ e nqE Xi2 = var(Z01+t ) = λbt + O K(p − q)

The function f , and therefore the Xi ’s, are nonnegative. Using the fact log3 (1 + x) ≤ x3 for x ≥ 0, 3  p/q−1 . Applying (48) yields and applying (44) we find f (z)3 ≤ 1+e −z+ν   nqE |Xi |3 = E [Lu ] E ≤



   (p/q − 1)3 (p/q − 1)3 + E [Mu ] E t t (1 + e−Z1 +ν )3 (1 + e−Z0 +ν )3

2λ2 eCbt . K(p − q)

(51)

Therefore, the ratio relevant for application of the Berry-Esseen lemma satisfies:     E |Xi |3 nqE |Xi |3 λ2 eCbt r q q = ≤  2 3  2 3   Cb 3 → 0. 2 nqE Xi nqE Xi K(p − q) λbt + O λ e t K(p−q)

The Berry-Esseen lemma, Lemma 11, implies      t+1   t+1  CBE E |Xi |3 Z − E Z 0 0 q ≤ x − Φ(x) ≤ q sup P  2 3 .   x var(Z0t+1 ) nqE Xi

Applying Lemma 9 completes the proof of (49). The proof of (50) given next is similar. For t ≥ 0, Z1t+1 can be represented as follows: Z1t+1

= K(p − q) + p

1 (n − K)q

N(n−K)q+Kp

X

Yi

i=1

where N(n−K)q+Kp has the Pois((n − K)q + Kp) distribution, the random variables Yi , i ≥ 0 are mutually independent and independent of N(n−K)q+Kp , and the distribution of Yi is a mixture of  z−ν  (n−K)q (p/q)+1 Kp distributions: L(Yi ) = (n−K)q+Kp L(f (Z0t )) + (n−K)q+Kp L(f (Z1t )), where f (z) = log e ez−ν . +1 23

By (42) of Lemma 9 and the formula for the variance of the sum of a Poisson distributed number of iid random variables,    2 λ2 1+t eCbt ((n − K)q + Kp)E Yi = var(Z1 ) = λbt + O K(p − q) We again use f (z)3 ≤



p/q−1 1+e−z+ν

3

. Applying (48) and Lemma 3 yields



3

((n − K)q + Kp)E |Yi |



    (p/q − 1)3 3 = nqE |Xi | + K(p − q)E t (1 + e−Z1 +ν )3 h ti λ3 E e3Z1 2λ2 eCbt ≤ + K(p − q) (K(p − q))2 ′



λ3 eC bt 2λ2 eCbt + K(p − q) (K(p − q))2

where C ′ = λ(3 + 2p/q + (p/q)2 ). Therefore, the ratio relevant for application of the Berry-Esseen lemma satisfies:   E |Yi |3

3 C ′ bt

λ e λ2 eCbt + K(p−q)) q r ≤  2 3   3 → 0  2 λ ((n − K)q + Kp)E Yi λbt + O K(p−q) K(p − q) eCbt

Therefore, the Berry-Esseen lemma, Lemma 11, along with Lemma 9, completes the proof of (50).

2.5

Limit of bt for t fixed, or uniformly in t if λ ≤ 1/e, as ν → ∞

Recall that the distribution of the random  tree is determined by the three parameters λ, ν, and h ti t Z1 e p/q. Also, at = E eZ1 , bt = E , and at+1 = exp(λbt ). As noted above, when ν → ∞, we Z t −ν 1+e

1

have at ≈ bt , and these equations approximately become recursions. The following lemma is based on this idea.

Lemma 13. Fix λ > 0. Define (vt : t ≥ 0) recursively by v0 = 0 and vt+1 = λevt . (i) For all t ≥ 0, λbt ≤ λat ≤ vt+1 , and if λ ≤ 1/e, then bt < x∗ (λ) for all t and n, where x∗ (λ) is the root of the equation x = exp(λx) in the interval (0, e]. (ii) Suppose ν and p/q depend on n such that ν → ∞ (i.e. K = o(n)) and p/q = O(1). For any t ≥ 0 fixed, limν→∞ λbt = vt+1 . If, in addition, λ ≤ 1/e, then supt≥0 |λbt − vt+1 | → 0. Proof. Note that bt+1 ≤ at+1 = eλbt for t ≥ 0 and λb0 = 1+eλ−ν ≤ λ = λa0 = v1 . It follows by induction on t that λbt ≤ λat ≤ vt+1 for all t ≥ 0. The sequence vt /λ satisfies the recursion vt+1 /λ = eλ(vt /λ) . By induction, the sequence vt /λ is increasing in t, and if λe ≤ 1, simple analysis shows that limt→∞ vt /λ = x∗ (λ). Therefore, if λe ≤ 1, bt ≤ vt+1 /λ < x∗ (λ) for all t ≥ 0. Part (i) is proved. Suppose ν → ∞ for the remainder of the proof. We next prove that limn→∞ λbt = vt+1 for each t fixed. The claim is true in the base case t = 0. Suppose for the sake of proof by induction that for some fixed t ≥ 0, limn→∞ λbt = vt+1 . Thus bt is bounded from above and away from z zero. We have bt+1 = E [Yn ] , where Yn = f (Z1t+1 , ν) and f (z, ν) = 1+ee z−ν . We have written Yn to 24

emphasize the dependence on n and de-emphasize the dependence on t, which is fixed. Note the following properties of f : f is nonnegative (so the same is true of Yn ), f is monotone increasing in both z and ν, f (z, ν) ≤ ez , limν→∞ f (z, ν) = ez , and limz→∞ f (z, ν) = eν . Lemma 12 implies that the Kolmogorov distance (i.e. supremum of absolute difference of CDFs) between the laws of Z1t+1 and G(λbt ) converges to zero as n → ∞, where G(s) represents a Gaussian random variable with mean s/2 and variance s. Since the Kolmogorov distance is preserved under strictly monotone transformations of the random variables, the Kolmogorov distance between the laws of Yn and f (G(λbt ), ν) is the same as that between the laws of Z1t+1 and G(bt ) , and thus also converges to zero. That is, by expressing the CDF of f (G(λbt ), ν) in terms of the CDF of G(λbt ) and the inverse of f (z, ν) as a function of z, we have:    c →0 sup FYn (c) − FG(λbt ) log −ν 1 − ce 0 0, there is a sufficiently large constant M so that, for all n, P {Yn ≤ M } ≥ 1 − ǫ.

  c Since log 1−ce → log c as ν → ∞, uniformly over 0 ≤ c ≤ M, since λbt is bounded away from −ν zero (in n), and since ν → ∞ as n → ∞,    c − FG(λbt ) (log c) → 0. sup FG(λbt ) log −ν 1 − ce 0 0. In particular, the condition is satisfied if tf = O(log∗ n) and d = O((log n)s ) for some finite constant s > 0. Remark 5. The condition (2 + d)tf = no(1) is equivalent to (a + bd)tf = no(1) for any constants a and b with a > 1 and b > 0. Also, if d ≥ 1 + ǫ for all n and some fixed constant ǫ > 0, the condition is equivalent to dtf = no(1) . Remark 6. The last statement of Lemma 15 is included to handle the case that |C ∗ | has a certain hypergeometric distribution. In particular, if we begin with a graphical model with n vertices and a planted dense community with |C ∗ | ≡ K, for a cleanup procedure we will use for exact recovery (See Algorithm 2), we need to withhold a small fraction δ of vertices and run the belief propagation algorithm on the subgraph induced by the set of n(1 − δ) retained vertices. Let C ∗∗ denote the intersection of C ∗ with the set of n(1 − δ) retained vertices. Then |C ∗∗ | is obtained by sampling the vertices of the original graph without replacement. Thus, the distribution of |C ∗∗ | is hypergeometric, and E [|C ∗∗ |] = K(1 − δ). Therefore, by a result of Hoeffding [21], the distribution of |C ∗∗ | is convex order dominated by the distribution that would result by sampling with replacement, namely, by Binom n(1 − δ), K n . That is, for any convex function Ψ,   )) . Therefore, Chernoff bounds for Binom(n(1 − δ), K E [Ψ(|C ∗∗ |)] ≤ E Ψ(Binom(n(1 − δ), K n n )) ∗∗ also hold for |C |. We use the following Chernoff bounds for binomial distributions [30, Theorem 27

4.4, 4.5]: For X ∼ Binom(n, p): P {X ≥ (1 + ǫ)np} ≤ e−ǫ

2 np/3

−ǫ2 np/2

,

∀0 ≤ ǫ ≤ 1

P {X ≤ (1 − ǫ)np} ≤ e

, ∀0 ≤ ǫ ≤ 1. p Thus, if K(1 − δ) ≥ 3 log n, then (57) and (58) with ǫ = 3 log n/[K(1 − δ)] imply n o p P |C ∗∗ | − K(1 − δ) ≥ 3K(1 − δ) log n ≤ n−1 .

(57) (58)

Thus, the last statement of Lemma 15 can be applied with K replaced by K(1 − δ). Proof. We write V = V (G) and V t = V (G) \ V (Gtu ). Let V0t and V1t denote the set of vertices ei denote the number i in V t with σi = 0 and σi = 1, respectively. For a vertex i ∈ ∂Gtu , let L t t fi denote the number of i’s neighbors in V . Given V t , V t , and σi , of i’s neighbors in V1 , and M 1 0 0 e i ∼ Binom(|V t |, p) if σi = 1 and L e i ∼ Binom(|V t |, q) if σi = 0, and M fi ∼ Binom(|V t |, q) for either L 1 1 0 fi and L e i are independent. value of σi . Also, M Let C t denote the event C t = {|∂Gsu | ≤ 4(2 + 2d)s log n, ∀0 ≤ s ≤ t}.

fi The event C t is useful to ensure that V t is large enough so that the binomial random variables M e i can be well approximated by Poisson random variables with the appropriate means. The and L following lemma shows that C t happens with high probability conditional on C t−1 .

Lemma 16. For t ≥ 1,

 P C t |C t−1 ≥ 1 − n−4/3 .

t Moreover, P (C t ) ≥ 1 − tn−4/3 , and conditional on the event C t−1 , |Gt−1 u | ≤ 4(2 + 2d) log n. t−1 log n. For any i ∈ ∂Gt−1 , L ei + M fi is stochasProof. Conditional on C t−1 , |∂Gt−1 u u | ≤ 4(2 + 2d) e f are independent. It follows that |∂Gtu | is tically dominated by Binom(n, d/n), and {Li , Mi }i∈∂Gt−1 u stochastically dominated by (using d + 1 ≥ d):  X ∼ Binom 4(2 + 2d)t−1 n log n, (d + 1)/n .

Notice that E [X] = 2(2+ 2d)t log n ≥ 4 log n. Hence, in view of the Chernoff bound (57) with ǫ = 1,   P C t |C t−1 ≥ P X ≤ 4(2 + 2d)t log n = 1 − P {X > 2E [X]} ≥ 1 − e−E[X]/3 ≥ 1 − n−4/3 .

Since C 0 is always true, P (C t ) ≥ (1 − n−4/3 )t ≥ 1 − tn−4/3 . Finally, conditional on C t−1 , |Gt−1 u |

=

t−1 X s=0

∂Gsu



t−1 X

4(2 + 2d)s log n = 4

s=0

(2 + 2d)t − 1 log n ≤ 4(2 + 2d)t log n. 1 + 2d

Note that it is possible to have i, i′ ∈ ∂Gtu which share a neighbor in V t , or which themselves are connected by an edge, so Gtu may not be a tree. The next lemma shows that with high probability such events don’t occur. For any t ≥ 1, let At denote the event that no vertex in V t−1 has more t t than one neighbor in Gt−1 u ; B denote the event that there are no edges within ∂Gu . Note that if As and B s hold for all s = 1, . . . , t, then Gtu is a tree. 28

Lemma 17. For any t with 1 ≤ t ≤ tf ,

 P At |C t−1 ≥ 1 − n−1+o(1)  P B t |C t ≥ 1 − n−1+o(1) .

 t−1 , P A = A ′ = 1 ≤ d2 /n2 . Proof. For the first claim, fix any i, i′ ∈ ∂Gt−1 ij i ,j u . For any j ∈ V t−1 log n = no(1) . It follows from the Since |V t−1 | ≤ n and conditional on C t−1 , |∂Gt−1 u | ≤ 4(2 + 2d) union bound that, given C t−1 ,  d2 t−1 2t−2 2 ′ P ∃i, i′ ∈ ∂Gt−1 , j ∈ V : A = A = 1 ≤ n16(2 + 2d) log n × = n−1+o(1) . ij i ,j u n2   Therefore, P At |C t−1 ≥ 1−n−1+o(1) . For the second claim, fix any i, i′ ∈ ∂Gtu . Then P Ai,i′ = 1 ≤ d/n. It follows from the union bound that, given C t ,  d P ∃i, i′ ∈ ∂Gtu : Aii′ = 1 ≤ 16(2 + 2d)2t log2 n × ≤ n−1+o(1) . n  t t Therefore, P B |C ≥ 1 − n−1+o(1) .

In view of Lemmas 16 and 17, in the remainder of the proof of Lemma 15 we can and do assume without loss of generality that At , Bt , Ct hold for all t ≥ 0. We consider three cases about the cardinality of the community, |C ∗ |: • |C ∗ | ≡ K.

 √ • K ≥ 3 log n and P ||C ∗ | − K| ≤ 3K log n ≥ 1 − n−1/2+o(1) . This includes the case that |C ∗ | ∼ Binom(n, K/n) and K ≥ 3 log n, as noted in Remark 6. • K ≤ 3 log n and P {|C ∗ | ≤ 6 log n} ≥ 1 − n−1/2+o(1) . This includes the case that |C ∗ | ∼ Binom(n, K/n) and K ≤ 3 log n, because, in this case, |C ∗ | is stochastically dominated by a Binom(n, 3 log n/n) random variable, so Chernoff bound (57) with ǫ = 1 implies: P {|C ∗ | ≤ 6 log n} ≥ 1 − n−1 if K ≤ 3 log n. √ In the second and third cases we assume these bounds (i.e., either ||C ∗ | − K| ≤ 3K log n if K ≥ 3 log n or |C ∗ | ≤ 6 log n if K ≤ 3 log n) hold, without loss of generality. We need a version of the well-known bound on the total variation distance between the binomial distribution and a Poisson distribution with approximately the same mean: dTV (Binom(m, p), Pois(µ)) ≤ mp2 + ψ(µ − mp),

(59)

where ψ(u) = e|u| (1 + |u|) − 1. The term mp2 on the right side of (59) is Le Cam’s bound on the variational distance between the Binom(m, p) and the Poisson distribution with the same mean, mp; the term ψ(µ − mp) bounds the variational distance between the two Poisson distributions with means µ and mp, respectively (see [35, Lemma 4.6] for a proof). Note that ψ(u) = O(|u|) as u → 0. We recursively construct the coupling. For the base case, we can arrange that E [C ∗ ] K  0 0 − . P (Gu , σG0u ) = (Tu , τTu0 ) = 1 − |P {σu = 1} − P {τu = 1} | = 1 − n n 29

 If |C ∗ | ≡ K this gives P (G0u , σG0u ) = (Tu0 , τTu0 ) = 1 and in the other cases P



(G0u , σG0u )

=



(Tu0 , τTu0 )

≥1−



3K log n − n−1/2+o(1) ≥ 1 − n−1/2+o(1) . n

So fix t ≥ 1 and assume that (Tut−1 , τTut−1 ) = (Gt−1 ). We aim to construct a coupling u , σGt−1 u t t so that (Tu , τTut ) = (Gu , σGtu ) holds with probability at least 1 − n−1+o(1) if |C ∗ | ≡ K and with probability at least 1 − n−1/2+o(1) in the other cases. Each of the vertices i in ∂Gt−1 has a random u t−1 t−1 e f number of neighbors Li in V1 and a random number of neighbors Mi in V0 . These variables are conditionally independent given (Gt−1 , |V1t−1 |, |V0t−1 |). Thus we bound the total variational u , σGt−1 u distance of these random variables from the corresponding Poisson distributions by using a union t−1 holds, |∂Gt−1 | ≤ 4(2 + 2d)t−1 log n = no(1) , so it bound, summing over all i ∈ ∂Gt−1 u . Since C u suffices to show that the variational distance for the numbers of children with each label for any is at most n−1/2+o(1) (because no(1) n−1/2+o(1) = n−1/2+o(1) ). Specifically, we given vertex in ∂Gt−1 u need to obtain such a bound on the variational distances for three types of random variables: e i for vertices i ∈ ∂Gt−1 with σi = 1 • L u

e i for vertices i ∈ ∂Gt−1 • L with σi = 0 u

fi for vertices in i ∈ ∂Gt−1 • M (for either σi ) . u

The corresponding variational distances, conditioned on |V1t−1 | and |V0t−1 |, and the bounds on the distances implied by (59), are as follows:   dT V Binom(|V1t−1 |, p), Pois(Kp) ≤ |V1t−1 |p2 + ψ (K − |V1t−1 |)p   dT V Binom(|V1t−1 |, q), Pois(Kq) ≤ |V1t−1 |q 2 + ψ (K − |V1t−1 |)q   dT V Binom(|V0t−1 |, q), Pois((n − K)q) ≤ |V0t−1 |q 2 + ψ (n − K − |V0t−1 |)q

The assumption on d implies p ≤ o(n−1+o(1) ) and np2 = dp ≤ n−1+o(1) , and thus also |V1t−1 |q 2 ≤ ≤ n−1+o(1) and |V0t−1 |q 2 ≤ n−1+o(1) . Also, for use below, Kq 2 ≤ Kp2 ≤ n−1+o(1) . We now complete the proof for the three possible cases concerning |C ∗ |. Consider the first case, that |C ∗ | ≡ K. Since we are working under the assumption C t−1 holds, in the case |C ∗ | ≡ K,

|V1t−1 |p2

t −1+o(1) |(K − |V1t−1 |)p| ≤ p|Gt−1 u | ≤ p4(2 + 2d) log n ≤ n

and similarly t −1+o(1) |(n − K − |V0t−1 |)q| ≤ q|Gt−1 . u | ≤ q4(2 + 2d) log n ≤ n

The conclusion (55) follows, proving the lemma in case |C ∗ | ≡ K. √ Next consider the second case: ||C ∗ | − K| ≤ 3K log n and K ≥ 3 log n. Using C t−1 as before, we obtain p |(K − |V1t−1 |)p| ≤ 3Kp2 log n + p4(2 + 2d)t log n ≤ n−1/2+o(1)

and

|(n − K − |V0t−1 |)q| ≤

p

3Kq 2 log n + q4(2 + 2d)t log n ≤ n−1/2+o(1) ,

which establishes (56) in the second case. Finally, consider the third case: |C ∗ | ≤ 6 log n and K ≤ 3 log n. Then |(K − |V1t−1 |)p| ≤ 6p log n + p4(2 + 2d)t log n ≤ n−1/2+o(1) 30

and |(n − K − |V0t−1 |)q| ≤ 6q log n + q4(2 + 2d)t log n ≤ n−1/2+o(1) , which establishes (56) in the third case. Thus, we can construct a coupling so that (Tut , τTut ) = (Gtu , σGtu ) holds with probability at least 1 − n−1+o(1) in case |C ∗ | ≡ K, and with probability 1 − n−1/2+o(1) in the other cases, at each of the tf steps, and, furthermore, the o(1) term in the exponents of n are uniform in t over 1 ≤ t ≤ tf . Since 2tf = no(1) , it follows that tf = o(log n). So the total probability of failure of the coupling is upper bounded by tf n−1+o(1) = n−1+o(1) in case |C ∗ | ≡ K and by n−1/2+o(1) in the other cases. Lemma 15 can be easily extended in two different ways with essentially the same proof. First, as originally stated, the label σu of the vertex u in the planted community graph, and the label τu of the root vertex in the tree graph, are both random. At the base level of a recursive construction, the proof uses the fact that the labels can be coupled with high probability because P{σu = 1} ≈ K ∗ n = P{τu = 1}. If instead we let u be a vertex selected uniformly at random from C , so that σu ≡ 1, and we consider the random tree conditioned on τu = 1, the labels of u in the two graphs are equal with probability one (i.e. exactly coupled), and then the recursive construction of the coupled neighborhoods can proceed from there. Similarly, if u is a vertex selected uniformly at random from [n]\C ∗ , then the lemma goes through for coupling with the labeled tree graph conditioned on τu = 0. The second way to extend the lemma is to begin with multiple vertices in the planted community graph, and show that the joint distribution of their depth t neighborhoods can be coupled with high probability to a set of independent copies of the labeled tree graph. Indeed, such joint coupling is nearly already implied by Lemma 15, because the subtrees from each of the neighbors of the fixed vertex u couple to the independent subtrees rooted at the children of the root vertex in the tree graph. The proof of the extension follows by the same method in a straightforward way– the trees are recursively built up. We therefore state the following enhanced form of the coupling lemma without proof. Lemma 18. (Conditional joint coupling lemma) Let d = np. Suppose p, q, K and tf depend on n such that tf is positive integer valued, and (2 + d)tf = no(1) . Let m0 , m1 be fixed nonnegative integers with m0 +m1 ≥ 1. Given the set of vertices C ∗ of the planted dense subgraph, select vertices u1 , . . . , um0 by uniformly sampling without replacement m0 times from [n]\C ∗ , and select vertices um0 +1 , . . . , um0 +m1 by uniformly sampling without replacement m1 times from C ∗ . Consider a coupling between (G, σ) and m0 + m1 independent copies of the tree network, (Tu , τTu ), such that the first m0 copies are conditioned to have root label 0 and the other m1 copies are conditioned to have root label 1, and the two objects are equal up to depth tf from any of the m0 + m1 selected vertices. If the graphical model is such that |C ∗ | ≡ K, then such a coupling exists with probability at least 1 − n−1+o(1) . If the graphical model is such that |C ∗ | ∼ Binom(n, K/n), then such a coupling exists with probability at least 1 − n−1/2+o(1) . Finally, if the model is such that K ≥ 3 log n graphical  √ −1/2+o(1) ∗ ∗ , then such a coupling exists and |C | is random such that P ||C | − K| ≥ 3K log n ≤ n with probability at least 1 − n−1/2+o(1) . We close this section with a result showing that Lemma 18 can be used to show convergence of empirical distributions of random variables computed at each vertex of G by a local function. It is based on the following elementary lemma about covariance and coupling. Lemma 19. Suppose (X, Y ) is a random 2-vector with P{|X| ≤ c} = P{|Y | ≤ c} = 1 for some e Ye ) such that X e is constant c. Suppose for some δ ∈ [0, 1] there exists a random 2-vector (X, 2 e e e independent of Y and P{(X, Y ) 6= (X, Y )} ≤ δ. Then Cov(X, Y ) ≤ 4δc . 31

e and Ye if necessary, it can be assumed without loss of generality that Proof. By truncating X e e P{|X| ≤ c} = P{|Y | ≤ c} = 1. The assumptions imply e + δc1 E [X] = (1 − δ)E[X] E [Y ] = (1 − δ)E[Ye ] + δc2

h i e Ye ] + δc3 = (1 − δ)2 E X e Ye + (2δ − δ2 )c4 E [XY ] = (1 − δ)E[X

e Ye ) = 0 it follows that where |c1 |, |c2 | ≤ c, and |c3 |, |c4 | ≤ c2 . Since Cov(X,

|Cov(X, Y )| = |E [XY ] − E [X] E [Y ] |  e − δ2 c1 c2 = (2δ − δ2 )c4 − δ(1 − δ) c1 E[Ye ] + c2 E[X] ≤ (2δ − δ2 + 2δ(1 − δ) + δ2 )c2 ≤ 4δc2 .

Lemma 20. Suppose Φt is a map from  the set of rooted labeled graphs of depth less than or equal t t t 3 Gi , σGti for every vertex i ∈ G. For example, Rit could be the result to t to R, and let Ri = Φ of a message passing algorithm run for t iterations. Let d = np. Suppose p, q, K and t depend on n such that t ∈ N, (2 + d)t = no(1) , K → ∞, and n − K → ∞. For any bounded Borel measurable function f : R 7→ R, ! 1 X t lim var f (Ri ) = 0 (60) n→∞ K ∗ i∈C   X 1 f (Rit ) = 0. (61) lim var  n→∞ n−K ∗ i∈[n]\C

For any constant c, the convergence is uniform over all such functions f with |f | ≤ c.

Proof. We prove (60) ; the proof of (61) is similar. Without loss of generality, suppose |f | is bounded by one. For i ∈ C ∗ , var(f (Rit )) ≤ 1, and for distinct i, i′ ∈ C ∗ , Lemma 18 with m0 = 0 and m1 = 2 implies that the depth t neighborhoods of i and i′ can be jointly coupled to a pair of independent depth t trees with root labels conditioned to be one. It follows that (Rit , Rit′ ) can be coupled to a pair of independent random variables obtained by applying Φt to the two trees, with probability at least 1 − n−1+o(1) . Therefore, by Lemma 19, Cov(f (Rit ), f (Rit′ )) ≤ 4n−1+o(1) . Since the variance of an average is the average of the covariances, these bounds yield: ! K + K(K − 1)4n−1+o(1) 1 1 X t f (Ri ) ≤ ≤ + 4n−1+o(1) → 0. var 2 K K K ∗ i∈C

4

Belief propagation algorithm for community recovery

We now turn to the community recovery problem with parameters n, K, p, q. Lemma 15 implies that, under suitable conditions, the neighborhood of a fixed vertex i is locally tree-like with high 3

It is assumed that Φt is invariant with respect to automorphisms of the rooted labeled graph that leave the root invariant, so there is no ambiguity about issues such as the ordering of the neighbors of a vertex when applying the function to the graph. Such invariance holds if Φ is computed using a message passing algorithm such that at each vertex, the neighbor vertices are treated the same way.

32

probability; Lemma 1 gives the recursive equations for computing the log likelihoods, Λti , on the tree model. The two lemmas together suggest the following belief propagation algorithm for apP{G|σi =1} . (The proximately computing the log likelihoods for the community recovery problem, P{G|σ i =0} community recovery problem can be viewed as a set of detection problems: one for each vertex.) Let ∂i denote the set of neighbors of i in G. Define the message transmitted from vertex i to its neighbor j at (t + 1)-th iteration as    t  Rℓ→i −ν p + 1 e X q t+1 . Ri→j = −K(p − q) + Aℓi log  (62) Rtℓ→i −ν + 1 e ℓ∈∂i\{j} 0 for initial conditions Ri→j = 0 for all i ∈ [n] and j ∈ ∂i. Then we approximate

the belief of vertex i , at (t + 1)-th iteration, messages from its neighbors as follows: Rit+1 = −K(p − q) +

X

ℓ∈∂i

Rit+1 ,

P{G|σi =1} P{G|σi =0}

by

which is determined by combining incoming 

Aℓi log 

t

eRℓ→i −ν t

  p q

+1

eRℓ→i −ν + 1



.

(63)

Lemmas 8 and 15 suggest the following algorithm and performance guarantee. Algorithm 1 Belief propagation for weak recovery, Bernoulli model Input: n, K ∈ N. p > q > 0, adjacency matrix A ∈ {0, 1}n×n , tf ∈ N 0 2: Initialize: Set Ri→j = 0 for all i ∈ [n] and j ∈ ∂i. 1:

3:

t −1

f for all i ∈ [n] and j ∈ ∂i. Run tf − 1 iterations of belief propagation as in (62) to compute Ri→j

t

Compute Rif for all i ∈ [n] as per (63). b the set of K indices in [n] with largest values of Rtf . 5: Return C, i

4:

Before stating the theorem giving a performance guarantee for Algorithm 1, we briefly discuss 2 (p−q)2 → λ as n → ∞ for some fixed the choice of the parameters. The focus is on the case that K(n−K)q positive constant λ. One possible sequence of values is, for fixed positive constants r, ρ, a, and b 2 2 2r 2r . with a > b, K = logρnr n , p = a logn n , and q = b logn n , for which the limit λ is given by λ = ρ (a−b) b The following lemma shows that the existence of a finite positive limit λ constrains the parameters significantly. Lemma 21. Suppose λ is a fixed positive constant, and suppose (K, p, q) depend on n as n → ∞ 2 2 (p−q)2 = O(np). In particular, (i) If such that p ≥ q, p/q = O(1), and K(n−K)q → λ. Then n−K K K = o(n) then np → ∞, and (ii) If np = no(1) then K = n1−o(1) .

Proof. The assumptions imply that for large enough n, λ K 2 (p − q)2 ≤ ≤ 2 (n − K)q

  p .  n−K 2 q np

K

2 = O(np) = no(1) , proving the first claim of the lemma. Since p/q = O(1), it must be that n−K K Statements (i) and (ii) follow from the first claim. 33



Theorem 1. Suppose Assumption 1 holds with λ > 1/e. Suppose (np)log ν = no(1) , where ν , ∗ ¯ ¯ log n−K K . Let tf = t0 + log (ν) + 2, where t0 is a constant depending only on λ as described in b be produced by Algorithm 1. Then for any constant r > 0, for all sufficiently large Lemma 6. Let C n, if the graphical model is such that |C ∗ | ≡ K, b E[|C ∗ △C|] no(1) ≤ + 2e−νr . K K √  If instead |C ∗ | is random with P |C ∗ | − K ≥ 3K log n ≤ n−1/2+o(1) ,

(64)

1

b E[|C ∗ △C|] n 2 +o(1) ≤ + 2e−νr . K K

(65)

b] E[|C ∗ △C| For either model, weak recovery is achieved: → 0 as n → ∞. The running time is K ∗ O(m log n), where m is the number of edges in the graph G.

Remark 7. Lemma 21 shows that the assumptions of Theorem 1 imply e2ν = O(np), so that ∗ ∗ √ ∗ ∗ √ log ν o(1) log ( np) =n is satisfied if (np) = no(1) , log (ν) ≤ log ( np) − 1. Hence, the condition (np) s in particular, if np = log n for some fixed s > 0. Proof of Theorem 1. We begin by explaining why either (64) or (65) implies weak recovery. It is because Lemma 21(ii) shows that the assumptions of Theorem 1 imply that K = n1+o(1) , and it is also assumed that ν → ∞. Hence, the right hand sides of (64) and (65) converge to zero. The remainder of the proof basically consists of combining Lemmas 8 and 15. Lemma 8 holds 2 (p−q)2 ≡ λ for a constant λ with λ > 1/e, ν → ∞, and p/q = O(1). Lemma 8 by the assumptions K(n−K)q ∗

also determines the given expression for tf . In turn, the assumption (np)log ν = no(1) ensures that (np)tf = no(1) , and by Lemma 21(i), np → ∞, so also (2 + np)tf = no(1) , so that Lemma 15 holds. A subtle point is that the performance bound of Lemma 8 is for the MAP rule (12) for detecting the label of the root vertex. The same rule could be implemented at each vertex of the graph G which has a locally tree like neighborhood of radius t0 + log∗ (ν) + 2 by using the estimator bo = {i : Rtf ≥ ν}. We first bound the performance for C bo and then do the same for C b produced C i bo to be the output of Algorithm 1, but returning a constant by Algorithm 1. (We could have taken C size estimator leads to simpler analysis of the algorithm for exact recovery.) bo (for prior distriThe average probability of misclassification of any given vertex u in G by C K n−K bution ( n , n )) is less than or equal to the sum of two terms. The first term is n−1+o(1) in case |C ∗ | ≡ K or n−1/2+o(1) in the other case (due to failure of tree coupling of radius tf neighborhood– −νr (bound on average error probability for the detection see Lemma 15). The second term is K ne problem associated with a single vertex u in the tree model–see h Lemmai8.) Multiplying by n bo | ; dividing by K gives bounds the expected total number of misclassification errors, E |C ∗ △C b replaced by C bo and the factor 2 dropped in the bounds. the bounds stated in the lemma with C bo is defined by a threshold condition whereas C b similarly corresponds to using a data The set C b dependent threshold and tie breaking rule to arrive at |C| ≡ K. Therefore, with probability one, bo ⊂ C b or C b⊂C bo . Together with the fact |C| b ≡ K we have either C and furthermore,

b ≤ |C ∗ △C bo | + |C bo △C| b = |C ∗ △C bo | + ||C bo | − K|, |C ∗ △C|

bo | − K| ≤ ||C bo | − |C ∗ || + ||C ∗ | − K| ≤ |C ∗ △C bo | + ||C ∗ | − K|. ||C 34

So

b ≤ 2|C ∗ △C bo | + kC ∗ | − K|. |C ∗ △C|

b ≤ 2|C ∗ △C bo | and (64) follows from what was proved for C bo . In the other If |C ∗ | ≡ K then |C ∗ △C| 1 +o(1) ∗ bo . case, E [kC | − K|] ≤ n 2 , and (65) follows from what was proved for C As for the computational complexity guarantee, notice that in each BP iteration, each vertex t+1 i needs to transmit the outgoing message Ri→j to its neighbor j according to (62). To do so, t+1 and then subtract neighbor j’s contribution from it to get the vertex i can first compute Ri t+1 desired message Ri→j . In this way, each vertex i needs O(|∂i|) basic operations and the total time complexity of one BP iteration is O(m), where m is the total number of edges. Since ν ≤ n, at most O(log∗ n) iterations are needed and hence the algorithm terminates in O(m log∗ n) time. Next we discuss how to use the belief propagation algorithm to achieve exact recovery. The key idea is to attain exact recovery in two steps. In the first step, we apply the belief propagation algorithm for weak recovery. In the second step, we use a linear-time local voting procedure to clean-up the residual errors made by the belief propagation algorithm. In particular, for each vertex i, we count ri , the number of neighbors in the cluster estimated by BP, and pick the set of K vertices with the largest values of ri . To facilitate analysis, we adopt the successive withholding method described in [19] to ensure the first and second step are independent of each other; variants of this method were used previously in [8, 36, 34]. In particular, we first randomly partition the set of vertices into a finite number of subsets. One at a time, one subset is withheld to produce a reduced set of vertices, and BP algorithm is run on the reduced set of vertices. The estimate obtained from the reduced set of vertices is used to classify the vertices in the withheld subset. The idea is to gain independence: the outcome of BP based on the reduced set of vertices is independent of the data corresponding to edges between the withheld vertices and the reduced set of vertices. The full description of the algorithm is given in Algorithm 2. Algorithm 2 Belief propagation plus cleanup for exact recovery, Bernoulli model Input: n ∈ N, K > 0, p > q > 0, adjacency matrix A ∈ {0, 1}n×n , tf ∈ N, and δ ∈ (0, 1) with 1/δ, nδ ∈ N. 2: (Partition): Partition [n] into 1/δ subsets Sk of size nδ. 3: (Approximate Recovery) For each k = 1, . . . , 1/δ, let Ak denote the restriction of A to the rows and columns with index in [n]\Sk , run Algorithm 1 (belief propagation for weak recovery) with bk denote the output. input (n(1 − δ), ⌈K(1 − δ)⌉, p, q, Ak , tf ) and let C P e 4: (Cleanup) For each k = 1, . . . , 1/δ compute ri = bk Aij for all i ∈ Sk and return C, the set j∈C of K indices in [n] with the largest values of ri . 1:

The following theorem gives sufficient conditions for Algorithm 2 to achieve exact recovery. Theorem 2. Suppose λ is a constant with λ > 1/e. Let K, p, q, with K ∈ N and p > q > 0 be 2 (p−q)2 ∗ indexed by n such that K(n−K)q → ∞, p/q = O(1), and (np)log ν = no(1) . ≡ λ, ν , log n−K K Consider the graphical model with |C ∗ | ≡ K. Let tf = t¯0 + log∗ (ν) + 2, where t¯0 is a constant depending only on λ(1 − δ) as described in Lemma 6 with λ replaced by λ(1 − δ). Also, suppose p is bounded away from 1 and the information theoretic sufficient condition (2) is satisfied. Select δ > 0 so small that (1 − δ)λe > 1 and (66) and (67) hold for some sequence of thresholds jn indexed e = C ∗ } → 1 as n → ∞. The running time is e be produced by Algorithm 2. Then P{C by n. Let C O(m log∗ n). 35

Proof. The theorem follows from the fact that the belief propagation algorithm achieves√weak recov  ∗ ∗ ery, even if the cardinality |C | is random and is only known to satisfy P | |C | − K| ≥ 3K log n ≤ n−1/2+o(1) and the results in [19]. We include the proof for completeness. Let Ck∗ = C ∗ ∩ ([n]\Sk ) for 1 ≤ k ≤ 1/δ. As explained in Remark 6, Ck∗ is obtained by sampling the vertices in [n] without replacement, and thus the distribution of Ck∗ is hypergeometric with E [|Ck∗ |] = K(1  − δ). A result of Hoeffding [21] implies that the Chernoff bounds for the Binom n(1 − δ), K n distribution also p ∗ hold for |Ck |, so (57) and (58) with np = K(1 − δ) and ǫ = 3 log n/[K(1 − δ)] imply n o p P |Ck∗ | − K(1 − δ) ≥ 3K(1 − δ) log n ≤ 2n−1 ≤ n−1/2+o(1) .

Hence, it follows from Theorem 1 and the condition λ > 1/e that o n bk ∆C ∗ | ≤ δK for 1 ≤ k ≤ 1/δ → 1, P |C k

bk is the output of the BP algorithm in Step 3 of Algorithm 2. as n → ∞, where C Since λ is a constant and K = o(n), [19, Lemma 6] implies that τ ∗ as defined in (3) satisfies ∗ τ ∈ [q, p]. In view of [19, Lemmas 7 and 8] and the assumption (2), it follows that there exits a sequence of thresholds jn indexed by n such that P {Binom(K(1 − 2δ), p) + Binom(Kδ, q) ≤ jn } = o(1/K),

P {Binom(K(1 − δ), q) > jn } = o(1/(n − K)).

(66) (67)

Applying [19, Theorem 7], which is essentially an application of the union bound, we get that e = C ∗ } → 1 as n → ∞. P{C

5

Converse results for recovery by local algorithms

Lemma 14 provides lower bounds on error probability for the tree model, and by the coupling lemmas the bounds translate to lower bounds on error probability achievable by any local algorithm for estimating the label of a given vertex u in the community recovery problem. For convenience, throughout this section we restrict attention to the case |C ∗ | ≡ K. Theorem 3. (Converse for local algorithms) Fix λ with 0 < λe ≤ 1. Consider the community recovery model with parameters K, p, q depending on n and let tf ∈ N depend on n. Suppose K 2 (p−q)2 o(1) . Then for any estimator C tf b such that for each vertex u in G, (n−K)q ≡ λ and (2 + np) = n σu is estimated based on G in a neighborhood of radius tf from u, ∗ b E[|C△C |] ≥

K(n − K) exp(−λe/4) − no(1) . n

(68)

and the sum of Type-I and Type-II error probabilities for classification of a vertex satisfies pe,0 + pe,1 ≥

1 −1/4 e − n−1+o(1) . 2

Furthermore, if ν → ∞ and p/q = O(1), then lim inf n→∞ lim inf n→∞

n K pe

∗ |] b E[|C△C ≥ 1. K

36

(69)

≥ 1, or, equivalently, (70)

Proof. The average error probability, pe , for classifying the label of a vertex in the graph G is greater than or equal to the lower bound (52) on average error probability for the tree model, minus the upper bound, n−1+o(1) , on the coupling error provided by Lemma 15. Multiplying the lower bound on average error probability per vertex by n yields (68). Similarly, pe,0 and pe,1 , for the community recovery problem can be approximated by the respective conditional error probabilities for the random tree model by the the conditional form of the coupling lemma, Lemma 18, so (69) follows from (53). n t pee ≥ 1, where pete is the average By Lemma 14, assuming p/q = O(1) and ν → ∞, lim inf n→∞ K error probability for any estimator for the corresponding random tree network. By the coupling n n t n t pte −pte | ≤ n−1+o(1) . By Lemma 21(ii), K = no(1) so that | K pee − K pe | ≤ n−1+o(1) . lemma, Lemma 15, |e n The conclusion lim inf n→∞ K pe ≥ 1 follows from the triangle inequality. Remark 8. Condition (69) shows that weak recovery in the sense of [31] is not possible (see Definition 3). Condition (70) shows that in the asymptotic regime ν → ∞ with λe < 1 recovery in the sense of Definition 4 is not possible.

We next consider a converse result for fractional recovery (see Definition 5, which considers b ≡ K). It is impossible to have a nontrivial estimator with |C| b ≡ K which makes estimators with |C| decisions for each node locally. So instead, we consider Algorithm 1 that uses local computation t until the last step, where the last step is to output the set of K indices with largest values of Rif . Establishing a negative result for an estimator based on the largest K values requires consideration of the joint distribution of the messages. To see this, suppose the messages were exactly Gaussian with the same variances, with the means of messages for vertices in C ∗ being greater than the means of the messages in [n]\C ∗ by some ǫ > 0. Then each of the messages for i ∈ C ∗ is stochastically greater than each of the messages for i ∈ [n]\C ∗ . Therefore, if there were no restriction on the joint distribution, it could be that the message for any i ∈ C ∗ is greater exactly by ǫ than the message for any j ∈ [n]\C ∗ . Then using the largest K messages gives error probability equal to zero. The path we take is to use the enhanced version of the coupling lemma to derive bounds for the empirical distributions of the messages for vertices in either C ∗ or [n]\C ∗ . Theorem 4 (Converse for belief propagation: impossibility of fractional recovery). Fix λ with 0 < λe ≤ 1. Consider the community recovery model with parameters K, p, q depending on n with |C ∗ | ≡ K, such that Assumption 1 holds. Suppose t ∈ N, with t possibly depending on n such that (np)t = no(1) . Let (Rut : u ∈ [n]) be computed using the belief propagation updates (62) and (63) b denote the set of K vertices u with the largest values of Rut as defined in Algorithm 2. and let C Then

∗ |] b E[|C∩C K

→ 0.

The proof of the theorem is based on the following lemma.

Lemma 22. (Gaussian limit of empirical distribution) Fix λ > 0. Suppose p, q, K and t depend on 2 (p−q)2 n such that as n → ∞, K(n−K)q ≡ λ, K = o(n), p/q = O(1), and t ∈ N can vary with n subject to   2  ′ p eC bt ′ t o(1) . Then the following (np) = n and K(p−q) = o(bt ) as n → ∞, where C = λ 3 + 2 q + pq limits hold in the sense of convergence in probability. 1 X 1 Rti −bt /2  − Φ(x) → 0 sup √ K ≤x x bt i∈C ∗ 1 X   − Φ(x) → 0 1 Rti +bt /2 sup √ n ≤x x ∗ b i∈[n]\C

t

37

(71) (72)

Furthermore, the same limits hold if “≤ x” is changed to strict inequality, “< x,” in the indicator functions. P Proof. We prove (71); the proof of (72) is similar. Let St (x) = K1 i∈C ∗ 1 Rti −bt /2  . Lemma 12 √

bt

≤x

and Lemma 18 with m0 = 0 and m1 = 1 imply that E [St (x)] − Φ(x) → 0 as n → ∞, uniformly in x. Lemma 20 implies that var(St (x)) converges to zero uniformly in x. Thus, by the Chebyshev inequality, St (x) → Φ(x) in probability as n → ∞ for each x fixed. Since St (x) and Φ(x) are both nondecreasing in x with values in [0, 1], and Φ is continuous, for any ǫ > 0 there is a finite set of x values depending only on ǫ such that if |St (x) − Φ(x)| ≤ ǫ at each of those x values, then |St (x) − Φ(x)| ≤ 2ǫ for all x and |St (x−) − Φ(x)| ≤ 2ǫ, where St (x−) is the left limit of St at x.4 The conclusion, (71), with either “≤ x” as written or with “< x,” follows. b denote the number of misclassified vertices in C ∗ . Let Proof of Theorem 4. Let N 1→0 = |C ∗ \C| t ∗ b γ = min{Ri : i ∈ C }. The estimator C can be viewed as a threshold based estimator for the data dependent threshold γ and a tie-breaking rule. No matter how the ties are broken, |{i ∈ C ∗ : Rit < γ}| ≤ N 1→0 ≤ |{i ∈ C ∗ : Rit ≤ γ}|.

Therefore, Lemma 22 with x =

γ−b √ t /2 bt

N 1→0

in (71) implies that the random threshold γ is such that   bt /2 − γ √ + Kop (1), (73) = KQ bt

where op (1) represents a random variable such that for any ǫ > 0, P {|op (1)| ≥ ǫ} → 0 as n → ∞, b ∗ |. Then and Q(u) = Φ(−u). Similarly, let N 0→1 = |C\C |{i ∈ [n]\C ∗ : Rit > γ}| ≤ N 0→1 ≤ |{i ∈ [n]\C ∗ : Rit ≥ γ}|

and Lemma 22 with x =

γ+bt /2 √ bt

N 0→1

in (72) yields (using Q(u) = 1 − Φ(u)):   γ + bt /2 √ = (n − K)Q + (n − K)op (1). bt

(74)

b it follows that N 1→0 = N 0→1 . Thus, (73), (74) and the assumption K = O(n) Since |C ∗ | = |C| imply that     γ + bt /2 bt /2 − γ √ √ = nQ + nop (1). KQ bt bt

Let ǫ > 0 be arbitrary. Select δ > 0 so small that δ ≤ ǫ and such that any x satisfying Q(x) ≤ 2δ √ must also satisfy Q( e − x) ≥ 1 − ǫ. Select n so large that P {|op (1)| > δ} ≤ ǫ. The random threshold γ is such that, with probability at least 1 − ǫ,     γ + b /2 b /2 − γ t t ≤ nδ. KQ √ √ (75) − nQ bt bt

Suppose for  the moment that γ satisfies (75). Since K = o(n) it follows that for n sufficiently large,  γ+b t /2 √ ≤ 2δ. Then, since bt ≤ e, Q b t

Q 4



bt /2 − γ √ bt



=Q

p

γ + bt /2 bt − √ bt





√ γ + bt /2 ≥Q e− √ bt



≥ 1 − ǫ,

For example, we could take the x values to be such that Φ(xj ) = jǫ for integers j with 1 ≤ j < 1/ǫ.

38

where the last inequality follows  choice of δ. In summary, if n is sufficiently large, then with  by the bt /2−γ ≥ 1 − ǫ. So, in view of (73), P N 1→0 ≥ K(1 − 2ǫ ≥ 1 − 2ǫ probability at least 1 − ǫ, Q √b t ∗ |] b K−E[N 1→0 ] = ≤ 1 − (1 − 2ǫ)2 ≤ 4ǫ for n sufficiently large. for n sufficiently large. Hence E[|C∩C K K Since ǫ is arbitrary, the conclusion follows.

6

On the spectral limit for recovery of one community

Results are given in this section to suggest that the spectral limit for weak recovery of one community in the stochastic block model with parameters n, K, p and q is given by λ > 1, where 2 (p−q)2 λ = K(n−K)q . We study a linear message passing algorithm that is suggested by the power method for computing the principal eigenvector of the non-backtracking matrix. On the positive side, we establish that if λ > 1 then the algorithm succeeds to provide weak recovery. On the converse side, we show that if λ ≤ 1 then for any fixed number or slowly enough growing number of iterations, it is not possible to achieve recovery asymptotically better than what can be done by trivial random guessing. This section is organized as follows. The spectral algorithm is formulated in Section 6.1 for the graph G. Section 6.2 considers the analogous algorithm for the random tree graph; it gives upper bounds on exponential moments of the messages, and, as shown in Corollary 1, the bounds combined with the Chernoff bound provide an upper bound on the probability of error for estimating the label of the root vertex from the tree. The corollary together with the coupling lemma is used in Section 6.3 to provide a sufficient condition for weak recovery by the linear message passing algorithm in the graph. The bounds on exponential moments are also applied in Section 6.4 to derive a Gaussian limit result for the linear message passing algorithm, which in turn is used to provide a converse result in Section 6.5. The analysis in this section is somewhat similar to that for the belief propagation algorithm in Sections 2-5. One key difference is that the update algorithms are simply different. Another key difference is that the messages for the spectral algorithm, even on the tree, are not log likelihood ratios. Thus, the Bhattacharyya coefficient is not relevant, and the power of the estimators is not monotone nondecreasing with the number of iterations. The methodology for the converse results based on establishing Gaussian state evolution for a slowly increasing number of iterations is similar. For λ > 1, the signal-to-noise ratio for the spectral algorithm, discussed in Section 6.4, increases exponentially with the number of iterations, rather than doubly exponentially as for the belief propagation algorithm. Thus, the number of iterations required for recovery by the spectral n n instead of as the number of iterations log∗ K required for the belief algorithm grows as log K propagation algorithm.

6.1

Formulation of a spectral algorithm

Suppose K = o(n). If we apply the spectral method, a natural matrix to start with is A − q(J − I), or A − qJ. Finding the principal eigenvector of A − qJ according to the power method is done by starting with some vector and repeatedly multiplying by A − qJ sufficiently many times. We shall √ where m = (n − K)q. Of course the scaling doesn’t change the consider the scaled matrix A−qJ m eigenvectors. This suggests the following message passing update equation: q X t 1 X t θit+1 = − √ θℓ + √ θℓ . m m ℓ∈[n]

39

ℓ∈∂i

(76)

The first sum is over all vertices in the graph and doesn’t depend on i. An idea is to appeal to the law of large numbers and replace the first sum by its expectation. Also, as we noted in Section 1.3, in the sparse graph regime np = o(log n), there exist vertices of high degrees ω(np), and the spectrum of A is very sensitive to high-degree vertices. To deal with this issue, as proposed in [26], we associate the messages in (76) with directed edges and prevent the message transmitted from j to i from being immediately reflected back as a term in the next message from i to j, resulting in the following message passing algorithm: t+1 θi→j =−

q((n − K)At + KBt ) 1 √ +√ m m

X

t . θℓ→i

(77)

ℓ∈∂i\{j}

t |σ = 0] and B ≈ E[θ t |σ = 1]. Notice that 0 = 1, where At ≈ E[θℓ→i with initial values θℓ→i t ℓ ℓ→i ℓ t+1 t when computing θi→j , the contribution of θj→i is subtracted out. Since we focus on the regime np = no(1) , the graph is locally tree-like with high probability. In the Poisson random tree limit of t |σ = 0] and E[θ t |σ = 1] can be calculated the neighborhood of a vertex, the expectations E[θℓ→i ℓ ℓ→i ℓ exactly, and as a result (see the next section) we take A0 = 1, At = 0 for t ≥ 1, and Bt = λt/2 for t ≥ 0. The update equation (77) can be expressed in terms of the non-backtracking matrix associated with the adjacency matrix A. It is the matrix B ∈ {0, 1}2m×2m with Bef = 1{e2 =f1 } 1{e1 6=f2 } , where e = (e1 , e2 ) and f = (f1 , f2 ) are directed edges. Let Θt ∈ R2m denote the messages on directed edges with Θte = θet 1 →e2 . Then, (77) in matrix form reads

Θt+1 = −

1 q((n − K)At + KBt ) √ 1 + √ BΘt . m m

As shown in [5], the spectral properties of the non-backtracking matrix closely match those of the original adjacency matrix. It is therefore reasonable to take the linear update equation (77) as a form of spectral method for the community recovery problem. Finally, to estimate C ∗ , we define the belief at vertex u as: θut+1 = −

q((n − K)At + KBt ) 1 X t √ +√ θi→u , m m

(78)

i∈∂u

and select the vertices u such that θut is above a certain threshold. The full description of the algorithm and proof of correctness is deferred to Section 6.3.

6.2

Linear message passing on a random tree–exponential moments

To analyze the message passing algorithms given in (77) and (78), we first study an analogous message passing algorithm on the tree model introduce in Section 2. Let us recall the inference problem on a Galton-Watson tree with Poisson distributed numbers of offspring. In the following, we fix a vertex u and let Tu denote the infinite Galton-Watson tree rooted at vertex u. For vertex i in Tu , let Tit denote the subtree of Tu of height t rooted at vertex i. Let τi ∈ {0, 1} denote the label of vertex i in Tu . Assume τu ∼ Bern(K/n). For any vertex i ∈ Tu , let Li denote the number of its children j with τj = 1, and Mi denote the number of its children j with τj = 0. Suppose that Li ∼ Pois(Kp) if τi = 1, Li ∼ Pois(Kq) if τi = 0, and Mi ∼ Pois((n − K)q) for either value of τi .

40

We consider the linear message passing algorithm analogous to (77) and (78): t+1 =− ξi→π(i)

ξut+1 = −

1 X t q((n − K)At + KBt ) √ ξℓ→i , +√ m m

(79)

q((n − K)At + KBt ) 1 X t √ +√ ξi→u , m m

(80)

ℓ∈∂i

i∈∂u

0 with initial values θℓ→π(ℓ) = 1 for all ℓ 6= u, where π(ℓ) denotes the parent of ℓ. Let Z0t denote a random variable that has the same distribution as ξut given τu = 0, and let Z1t denote a random t variable that has the same distribution as ξut given τu = 1. Equivalently,  1} has the  t  Zb for b ∈ {0, t distribution of ξℓ→π(ℓ) for any vertex ℓ 6= u, given τℓ = b. Let At = E Z0 and Bt = E Z1t . Given   τu = 0, the mean of the sum in (79) is subtracted out, so At = E Z0t = 0 for all t ≥ 1. Compared to the case τu = 0, if τu = 1, then √ on average there are K(p − q) additional children of node u with labels equal to 1, so that Bt+1 = λBt , which gives Bt = λt/2 for t ≥ 0, as already stated above. 2 (p−q)2 Let λ = K(n−K)q and m = (n − K)q. We consider sequences of parameter triplets (λ, p/q, K/n) h ti indexed by n. Let ψit (η) = E eηZi for i = 0, 1 and t ≥ 1. Expressions are given for these functions when t = 1 in (86) and (88) below. Following the same method used in Section 2 for the belief propagation algorithm, we find the following recursions for t ≥ 1 :          η η η ψ0t+1 (η) = exp m ψ0t √ , (81) − 1 + Kq ψ1t √ − 1 − √ λt/2 m m m      √ η t+1 t+1 t −1 . (82) λm ψ1 √ ψ1 (η) = ψ0 (η) exp m

Lemma 23. Assume that as n → ∞, λ is fixed, K = o(n), and p/q = O(1). (Consequently, log

n−K

m → ∞; see Remark 2.) Let γ be a constant such that γ > 1 and γ ≥ λ. Let T = 2α log Kγ , √ where α = 1/4 (in fact any α < 1 works). Let c = 41 log γ (in fact any c ∈ (0, log γ) works). For 2 sufficiently large n, t ∈ [T ], and η such that γ (t−1)/2 ( ηm + √ηm ) ≤ c, ψ0t (η) ≤ exp(γ t/2 η 2 ),

ψ1t (η) ≤ exp(λt/2 η + γ t/2 η 2 ).

(83) (84)

√ Proof of Lemma 23. Recall that m = (n − K)q and K(p − q) = λm. Since K = o(n), it follows that (nq)/m → 1. Also, because λ is fixed, we have that λ/m → 0. Hence, the choice of c ensures that for n sufficiently large, r ! √ λ ec nq + ≤ γ. (85) m m 2 √ For t = 1 and η ∈ (−∞, mc],

√ √ ψ01 (η) = exp(nq(eη/ m − 1 − η/ m))  (85)  nq √ ec η 2 ≤ exp( γη 2 ), ≤ exp 2m

41

(86) (87)

where we used the fact that ex ≤ 1 + x +

ec 2 2x

for all x ∈ (−∞, c]. Similarly, √

ψ11 (η) = ψ01 (η) exp(K(p − q)(eη/ m − 1))      nq √ ec η 2 η c 2 λm √ + e η exp ≤ exp 2m 2 m m ! r ! √ √ λ ec 2 (85) nq √ + η ≤ exp λη + ≤ exp( λη + γη 2 ). m m 2 Thus, (83) and (84) hold for t = 1 and η as described in the lemma. Observe that 1 γ T /2 √ = o(1), m

(88) (89) (90)

(91)

1−α

α √1 = λ−α/2 ( pq − 1)α m− 2 = o(1). In addition, the choice of c because γ T /2 √1m = ( n−K K ) m guarantees that, for n sufficiently large, r !  √ ec  λ Kq c 3c + γ T /2 1+ + ≤ γ, e + (92) m m 2 Kq T /2 K 1−α because Kq = ( n−K ) = o(1), and (91) holds. Assume for the sake m = o(1), m → ∞, m γ of proof by induction that, for some t with 1 ≤ t < T, (83) and (84) hold for all η ∈ Γt , {η : 2 γ (t−1)/2 ( ηm + √ηm ) ≤ c} . Now fix η ∈ Γt+1 . Since Γt is an interval containing zero for each t and Γt+1 ⊂ Γt , it is clear that √ηm ∈ Γt for m ≥ 1. By the induction hypothesis, we have

         η η η ψ0t+1 (η) = exp m ψ0t √ − 1 + Kq ψ1t √ − 1 − √ λt/2 m m m ! ! !) ! ( γ t/2 η 2 η η γ t/2 η 2 − 1 + Kq exp + λt/2 √ − 1 − √ λt/2 ≤ exp m exp m m m m   !2     γ t/2 η 2 ec γ t/2 η 2 η  ≤ exp ec γ t/2 η 2 + Kq  + + λt/2 √   m 2 m m      Kq c  t/2 Kq e 3cγ + γ t η 2 ≤ exp γ t/2 η 2 ec + + m 2m o n (92) ≤ exp γ (t+1)/2 η 2 ,

where the first inequality holds due to the induction hypothesis; the second inequality holds due to c ex ≤ 1 + ec x for all x ∈ [0, c] and ex ≤ 1 + x + e2 x2 for all x ∈ (−∞, c]; the third inequality holds due to the fact that η ∈ Γt+1 and λ ≤ γ. Similarly,   !2     c t/2 2 t/2 2 √ √ η e γ η η γ η η + + λt/2 √ + √ λt/2  − 1 ≤ λm  λm ψ1t √ m 2 m m m m r   ec  λ t/2 t t/2 3cγ + γ η 2 + λ(t+1)/2 η = γ + m 2 42

and hence ψ1t+1 (η)

= ≤

(92)



 √

  η √ λm −1 m ) ( r ! r !   Kq λ λ ec Kq t/2 t 2 (t+1)/2 c t/2 2 + 3cγ + γ η + λ η + + exp γ η e + m m m m 2 n o exp λ(t+1)/2 η + γ (t+1)/2 η 2 .

ψ0t+1 (η) exp



ψ1t



Corollary 1. Assume that as n → ∞, λ is fixed with λ > 1, K = o(n), and p/q = O(1). Let T =   log n−K K K 2α log Kλ , where α = 1/4. If τ = 21 λT /2 , then P Z0T ≥ τ = o( n−K ) and P Z1T ≤ τ = o( n−K ).

Proof. Since λ > 1 we can let γ = λ in Lemma 23 so that T here is the same as T in Lemma 23. Equation (91) implies that the interval of η values satisfying the condition of Lemma 23 for t = T converges to all of R. By Lemma 23 and the Chernoff bound for threshold at τ = 12 λT /2 , for any η > 0, if n is sufficiently large  η=1/4 P Z0T ≥ τ ≤ ψ0T (η) exp(−ητ ) ≤ exp(λT /2 (η 2 − η/2)) = exp(−λT /2 /16).

(93)

Similarly, for any η < 0 and n sufficiently large,

 η=−1/4 P Z1T ≤ τ ≤ ψ1T (η) exp(−ητ ) ≤ exp(λT /2 (η 2 + η/2)) = exp(−λT /2 /16).

(94)

α T /2 /16) = o( K ). By the choice of T , we have λT /2 = ( n−K K ) and hence exp(−λ n−K

6.3

Spectral algorithm based on the non-backtracking matrix

Together with the coupling lemma, Corollary 1 shows that the linear message passing algorithm using (77) and (78) can provide weak recovery. The algorithm and performance result are stated formally in this section. Algorithm 3 Spectral algorithm for weak recovery 1: 2: 3: 4: 5: 6:

Input: n, K ∈ N. p > q > 0, adjacency matrix A ∈ {0, 1}n×n 2

2

log

n−K

(p−q) Set λ = K(n−K)q and T = 2α log Kλ , where α = 1/4 (in fact any α < 1 works). 0 Initialize: Set θi→j = 1 for all i ∈ [n] and j ∈ ∂i. T −1 Run T − 1 iterations of message passing as in (77) to compute θi→j for all i ∈ [n] and j ∈ ∂i. T Run one more iteration of message passing to compute θi for all i ∈ [n] as per (78). b the set of K indices in [n] with largest values of θ T . Return C, i

Theorem 5. Suppose λ is a constant with λ > 1. Let K, p, q be indexed by n such that

K 2 (p−q)2 (n−K)q

≡ λ,

K = o(n), p/q =  O(1), and √ = Suppose the graphical model is such that |C ∗ | ∗ −1/2+o(1) b be the estimator produced by is random with P |C | − K ≥ 3K log n ≤ n . Let C b] E[|C ∗ △C| Algorithm 3. Then → 0 as n → ∞. K (np)log(n/K)

no(1) .

43

Remark 9. By Lemma 21, the conditions of Theorem 5 imply that (n/K)2 = O(np). Therefore a √ log n) log(n/K) o(1) log(np) o(1) o(1/ . sufficient condition for (np) =n is (np) = n , or, equivalently, np = n A more specific sufficient condition for (np)log(n/K) = no(1) is np = (log n)s for some fixed s > 0. Remark 10. As shown in [19, Theorem 7], if there is an algorithm that can provide weak recovery even if the community size is random and only approximately equals to K, then it can be combined with a linear-time voting procedure to achieve exact recovery under the information theoretic sufficient condition (2). Theorem 5 implies that if λ > 1, the linear message passing algorithm indeed works for such a case. Therefore, we can upgrade the weak recovery result of linear message passing to exact recovery under condition λ > 1 and condition (2), in a similar manner as described in Algorithm 2 and the proof of Theorem 2. Proof. The proof consists of combining Corollary 1 and the coupling lemma. Lemma 21(i) implies np → ∞ and the assumptions imply (np)T = no(1) . Therefore, (2+np)T = no(1) ; the coupling lemma can be applied. The performance bound of Corollary 1 is for a hard threshold rule for detecting the label of the root node. The same rule could be implemented at each vertex of the graph G which bo = {i : θ T ≥ λT /2 /2}. We has a locally tree like neighborhood of radius T by using the estimator C i bo and then do the same for C b produced by Algorithm 3. first bound the performance for C bo (for prior distriThe average probability of misclassification of any given vertex u in G by C K n−K bution ( n , n )) is less than or equal to the sum of two terms. The first term is less than or equal K ) (due to error of to n−1/2+o(1) (due to coupling error) by Lemma 15. The second term is o( n−K classification of the root vertex of the Poisson tree graph of depth T ) by Corollary 1. Multiplying the h average ierror probability by n bounds the expected total number of misclassification errors, bo | . Lemma 21(ii) implies K = n1+o(1) , so n−1/2+o(1) n = n−1/2+o(1) = o(1), and of course E |C ∗ △C K bo |] E[|C ∗ △C n K bo is defined by a threshold condition )K = o(1). It follows that → 0. The set C o( n−K K b similarly corresponds to using a data dependent threshold and tie breaking rule to arrive whereas C b b follows from at |C| ≡ K. By the same method used in the proof of Theorem 1, the conclusion for C bo . what was proved for C

6.4

Linear message passing on a random tree–Gaussian limits

In this section we apply the bounds derived in Section 6.2 and a version of the Berry-Esseen central limit theorem for compound Poisson sums to show the messages are asymptotically Gaussian. As in Section 6.2, the result allows the number of iterations to grow slowly Pwith n. Let αt = var(Z0t ) and βt = var(Z1t ). Using the usual fact var( X i=1 Yi ) = E [X] var(Y ) + var(X)E [Y ]2 for iid Y ’s, we find Kq βt + m Kp = αt + A2t + βt + m

αt+1 = αt + A2t + βt+1

Kq 2 B m t Kp 2 B m t

(95) (96)

with the initial conditions α0 = β0 = 0. Comparing the recursions (without using induction) shows n ≥ 1, and αt is nondecreasing in t. Thus that αt ≤ βt ≤ pq αt for t ≥ 0. Note that α1 = n−K 2

t) 1 ≤ αt ≤ βt for all t. Therefore, if λ < 1, the signal to noise ratio (Bt −A ≤ λt → 0 as t → ∞. αt Also, under the assumption K = o(n) and p/q = O(1), the coefficients in the recursions (95) and Kp (96) satisfy Kq m → 0 and m → 0 as n → ∞. Thus, αt → 1 and βt → 1 for t fixed as n → ∞. The following lemma proves that the distributions of Z0t and Z1t are asymptotically Gaussian.

44

K 2 (p−q)2 (n−K)q

≡ λ, K = o(n), p/q = O(1), and t varies α with n such that t ∈ N and the following holds: If λ > 1 then λt/2 ≤ n−K , where α = 1/4 (any K  α ∈ (0, 2/3) works), and if λ ≤ 1: t = O(log n−K ). Then as n → ∞, K  t  Z0 sup P √ ≤ x − Φ(x) → 0 (97) αt x ) ( t t/2 Z − λ 1 √ sup P ≤ x − Φ(x) → 0. (98) βt x Lemma 24. Fix λ > 0. As n → ∞, suppose

Proof. Select a constant γ > 1 as follows. If λ > 1, let γ = λ. If λ ≤ 1, select γ > 1 so that α γ t/2 ≤ n−K for all n sufficiently large, which is possible by the assumptions. Then no matter K α α . Let T be defined as in Lemma 23. Since γ t/2 ≤ n−K it what the value of λ is, γ t/2 ≤ n−K K K follows that t ≤ T. For t ≥ 0, Z0t+1 can be represented as follows: Z0t+1

Nnq X Kqλt/2 1 Xi = −p +p (n − K)q (n − K)q i=1

where Nnq has the Pois(nq) distribution, the random variables Xi , i ≥ 0 are mutually independent and independent of Nnq , and the distribution of Xi is a mixture of distributions: L(Xi ) =   (n−K)(αt +A2 )+K(βt +λt )   (n−K)q Kq K t/2 t t t , E X12 = , and E |X1 |3 ≤ 0 )+ nq L(Z1 ). Note that E [X1 ] = n λ nq L(Z n     max{E |Z0t |3 , E |Z1t |3 } , ρ3 . Lemma 11 therefore implies   p  t+1 t/2 (n − K)qZ + Kqλ − nqE [X ] Cρ3 1 0 q ≤ x − Φ(x) ≤ q sup P    3   x nqE X12 nqE X12   t/2 , and n E X 2 = α Using the fact E[X12 ] ≥ 1, E [X1 ] = K t+1 , we obtain 1 nλ n−K  t+1  Z0 Cρ3 sup P √ ≤ x − Φ(x) ≤ √ . αt+1 nq x Equation (91) implies that the interval of η values satisfying the condition of Lemma 23 for t ≤ T converges to all of R. In view of Lemma 23 and the fact γ ≥ max{λ, 1}, we have that for n sufficiently large, ψ0t (±γ −t/2 ) ≤ 1,

ψ1t (±γ −t/2 ) ≤ e2 . Applying ex + e−x ≥ |x|3 /6 with x = Z0t /γ t/2 or x = Z1t /γ t/2 yields:     E |Z0t |3 ≤ 6γ 3t/2 ψ0t (γ −t/2 ) + ψ0t (−γ −t/2 ) ≤ 12γ 3t/2 ,     E |Z1t |3 ≤ 6γ 3t/2 ψ1t (γ −t/2 ) + ψ1t (−γ −t/2 ) ≤ 12e2 γ 3t/2

 2  2 √ K Since λ ≤ n−K nq pq it follows that nq = Ω(n/K). Hence,    3α −1 2 0 K → 0 and (97) follows. n 45

3 √ρ nq

= O(

 n−K 3α n K K)

=

The proof of (98) given next is similar. For t ≥ 0, Z1t+1 can be represented as follows: Z1t+1

Kqλt/2 1 = −p +p (n − K)q (n − K)q

N(n−K)q+Kp

X

Yi

i=1

where N(n−K)q+Kp has the Pois((n − K)q + Kp) distribution, the random variables Yi , i ≥ 0 are mutually independent and independent of N(n−K)q+Kp , and the distribution of Yi is a mixture (n−K)q Kp Kp of distributions: L(Yi ) = (n−K)q+Kp L(Z0t ) + (n−K)q+Kp L(Z1t ). Note that E [Y1 ] = (n−K)q+Kp λt/2 ,         t t +Kpβt +Kpλ E Y12 = (n−K)qα , and E |Y1 |3 ≤ max{E |Z0t |3 , E |Z1t |3 } = ρ3 . Lemma 11 therefore (n−K)q+Kp implies   p  t+1 t/2 (n − K)qZ + Kqλ − ((n − K)q + Kp)E [Y ] Cρ3 1 1 q sup P ≤ x −Φ(x) ≤ q    3   x ((n − K)q + Kp)E Y12 ((n − K)q + Kp)E Y12

Using the facts E[Y12 ] ≥ 1, p > q, and the expression above for E [Y1 ] , we obtain   p  t+1 t/2 (n − K)qZ − K(p − q)λ Cρ3 1 ≤ √ q sup P ≤ x − Φ(x)     nq x ((n − K)q + Kp)E Y12 p Dividing through by (n − K)q yields     t+1 (t+1)/2 Z − λ Cρ3 1 ≤ √ sup P q ≤ x − Φ(x)  ((n−K)q+Kp) E Y 2   nq x (n−K)q

Since

6.5

((n−K)q+Kp)  2  E Y1 (n−K)q

1

= βt+1 , (98) follows.

Converse for linear message passing algorithm

Two converse results for the linear message passing algorithm with a threshold decision rule are given in this section, showing that if λ ≤ 1, then weak recovery is not achievable using the linear message passing algorithm followed by thresholding. The results and proofs here are quite similar to the converse results for the belief propagation algorithm in Section 5. The main differences are that the means here are 0 and λt/2 instead of ±bt /2, and the variances here are unequal: αt and βt . However, since αt ≤ βt ≤ αqt p and we assume p/q = O(1), the same arguments go through. Finally, since the messages in the linear message passing algorithm do not correspond to log likelihood messages, the restriction on how quickly the number of iterations can grow with n is stronger here. The first converse, stated next, shows that if λ ≤ 1 then recovery in the sense of Definition 4 is not possible. Theorem 6. (Converse for linear message passing algorithm, version A) Fix λ with 0 < λ ≤ 1. Consider the community recovery model with parameters K, p, q depending on n with |C ∗ | ≡ K. 2 (p−q)2 ≡ λ , K = o(n), and p/q = O(1). Suppose t ∈ N, with t possibly depending on Suppose K(n−K)q  t ). Let (θut : u ∈ [n]) be computed using the message n such that (np) = no(1) and t = O(log n−K K b = {u : θ t ≥ γ} for some threshold γ, which may also passing updates (77) and (78) and let C u depend on n. Equivalently, σu is estimated for each u by σ bu = 1{θut ≥γ} . Let pe = π0 pe,0 + π1 pe,1 for prior probabilities π0 = (n − K)/n and π1 = K/n, where pe,0 = P {b σu = 1|σu = 0} and pe,1 = pe n ≥ 1. P {b σu = 0|σu = 1} . Then lim inf n→∞ K 46

Proof. Let pee,0 and pee,0 denote the corresponding conditional error probabilities for estimating the label of the root node for the tree graph of depth t using τbu = 1{ξut ≥γ} for some threshold γ. Lemma 24 implies that, as n → ∞, the conditional error probabilities satisfy, uniformly in the choice of γ, !   γ λt/2 − γ → 0 and pee,0 − Q → 0, pee,1 − Q βt αt where Q is the complementary CDF of the standard normal distribution. As indicated in Lemma 18, for a uniformly, randomly selected vertex u ∈ C ∗ (respectively, u ∈ [n]\C ∗ ) the depth t neighborhood of u in the labeled random graph couples to the labeled random tree of depth t with root label τu = 1 (respectively, τu = 0) with coupling error probability less than or equal to n−1+o(1) as n → ∞. Therefore, as n → ∞, uniformly in γ, !   λt/2 − γ γ pe,1 − Q → 0 and pe,0 − Q → 0. βt αt Since π0 → 1 as n → ∞, it is necessary for for t ≥ 1, it follows that

λt/2 −γ βt

γ αt

→ ∞ to make pe → 0. However, since 1 ≤ βt ≤ αt pq

→ −∞, resulting in pe,1 → 1, and hence lim inf n→∞

pe n K

≥ 1.

b such the In the remainder of this section we establish a converse result for an estimator C b |C| ≡ K. It is most natural to let this estimator be obtained by selecting K vertices u with the largest values of θut . For convenience we restrict attention to the case |C ∗ | = K in what follows.

Theorem 7. (Converse for linear message passing algorithm: impossibility of fractional recovery) Fix λ with 0 < λ < 1. Consider the community recovery model with parameters K, p, q depending 2 (p−q)2 ≡ λ , and K = o(n), p/q = O(1). Suppose t ∈ N, with on n with |C ∗ | ≡ K. Suppose K(n−K)q  ). Let (θut : u ∈ [n]) be t possibly depending on n such that (np)t = no(1) and t = O(log n−K K b denote the set of K vertices computed using the message passing updates (77) and (78) and let C u with the largest values of θut . Then

∗ |] b E[|C∩C K

→ 0.

The proof of the theorem is based on the following lemma. Lemma 25. (Gaussian limit of empirical distribution) Fix λ > 0. Suppose p, q, K and t depend on 2 (p−q)2 n such that as n → ∞, K(n−K)q ≡ λ, K = o(n), p/q = O(1), and t ∈ N can vary with n subject to (np)t = no(1) and the conditions of Lemma 24. Then the following limits hold in the sense of convergence in probability. 1 X (99) 1 θt −λt/2  − Φ(x) → 0 sup i K ≤x x ∗ β t i∈C 1 X 1 θit  − Φ(x) → 0 (100) sup n ≤x x αt ∗ i∈[n]\C

Furthermore, the same limits hold if “≤ x” is changed to strict inequality, “< x,” in the indicator functions.

Proof. We prove (99); the proof of (100) is similar. By Lemma 21 the conditions of the lemma 1 P 1+o(1) imply K = n . In particular, K → ∞. Let St (x) = K i∈C ∗ 1 θt −λt/2  . Lemma 24 and i

47

βt

≤x

Lemma 18 with m0 = 0 and m1 = 1 imply that E [St (x)] − Φ(x) → 0 as n → ∞, uniformly in x. Lemma 20 implies that var(St (x)) converges to zero uniformly in x. Thus, by the Chebyshev inequality, St (x) → Φ(x) in probability as n → ∞ for each x fixed. Since St (x) and Φ(x) are both nondecreasing in x with values in [0, 1], and Φ is continuous, for any ǫ > 0 there is a finite set of x values depending only on ǫ such that if |St (x) − Φ(x)| ≤ ǫ at each of those x values, then |St (x) − Φ(x)| ≤ 2ǫ for all x and |St (x−) − Φ(x)| ≤ 2ǫ, where St (x−) is the left limit of St at x.5 The conclusion, (99), with either “≤ x” as written or with “< x,” follows. b Let γ = min{θ t : i ∈ C ∗ }. The estimator C b can be Proof of Theorem 7. Let N 1→0 = |C ∗ \C|. i viewed as a threshold based estimator for the data dependent threshold γ and a tie-breaking rule. No matter how the ties are broken, |{i ∈ C ∗ : θit < γ}| ≤ N 1→0 ≤ |{i ∈ C ∗ : θit ≤ γ}|. Therefore, Lemma 25 with x =

γ−λt/2 βt

N

1→0

in (99) implies that the random threshold γ is such that

= KQ

λt/2 − γ βt

!

+ Kop (1),

(101)

where op (1) represents a random variable such that for any ǫ > 0, P {|op (1)| ≥ ǫ} → 0 as n → ∞, b ∗ |. Then and Q(u) = Φ(−u). Similarly, let N 0→1 = |C\C |{i ∈ [n]\C ∗ : θit > γ}| ≤ N 0→1 ≤ |{i ∈ [n]\C ∗ : θit ≥ γ}|

and Lemma 25 with x =

γ αt

in (100) yields (using Q(u) = 1 − Φ(u)): N

0→1

= (n − K)Q



γ αt



+ (n − K)op (1).

(102)

b it follows that N 1→0 = N 0→1 . Thus, (101), (102) and the assumption K = O(n) Since |C ∗ | = |C| imply that !   γ λt/2 − γ = nQ + nop (1). KQ βt αt Assume p/q ≤ c for a fixed constant c. Let ǫ > 0 be arbitrary. Select δ > 0 so small that δ ≤ ǫ and such that any x satisfying Q(x) ≤ 2δ must also satisfy Q(1 − x/c) ≥ 1 − ǫ. Select n so large that P {|op (1)| > δ} ≤ ǫ. The random threshold γ is such that, with probability at least 1 − ǫ, !   t/2 − γ λ γ KQ ≤ nδ. (103) − nQ βt αt Suppose for the moment that γ satisfies (103). Since K = o(n) it follows that for n sufficiently large, Q αγt ≤ 2δ. Then, since 1 ≤ βt ≤ cαt and λt/2 ≤ 1, Q 5

λt/2 − γ βt

!



γ ≥Q 1− βt





γ ≥Q 1− αt c



≥ 1 − ǫ,

For example, we could take the x values to be such that Φ(xj ) = jǫ for integers j with 1 ≤ j < 1/ǫ.

48

where the last inequality follows sufficiently large, then with  t/2by the choice of δ. In summary, if n is  1→0 λ −γ ≥ 1 − ǫ. So, in view of (101), P N probability at least 1 − ǫ, Q ≥ K(1 − 2ǫ ≥ 1 − 2ǫ βt ∗ |] b K−E[N 1→0 ] = ≤ 1 − (1 − 2ǫ)2 ≤ 4ǫ. Since ǫ is arbitrary, the for n sufficiently large. Hence E[|C∩C K K conclusion follows.

Appendices A

Degree-thresholding when K ≍ n

A simple algorithm for recovering C ∗ is degree-thresholding. Specifically, let di denote the degree of vertex i. Then di is distributed as the sum of two independent random variables, with distributions Binom(K − 1, p) and Binom(n − K, q), respectively, if i ∈ C ∗ , while di ∼ Binom(n, q) if i ∈ / C ∗. The mean degree difference between these two distributions is K(p − q), and the degree variance is O(nq). Assume p/q is bounded and p is bounded away from one. It follows from the Chernoff 2 2 b be the set bound that |di − E[di ]| ≥ K(p − q)/2 with probability at most e−Ω(K (p−q) /(nq)) . Let C ∗ |] = ne−Ω(K 2 (p−q)2 /(nq)) . b of vertices with degrees larger than nq + K(p − q)/2 and thus E[|C△C n ∗ |] = o(K), i.e., weak recovery is achieved. b ), then E[|C△C Hence, if K 2 (p − q)2 /(nq) = ω(log K 2 2 Note that d(pkq) ≍ K (p − q) /(nq) under the assumption that p/q is bounded and p is bounded away from one. In the regime K ≍ n − K ≍ n, then the necessary and sufficient condition for the existence of estimators providing weak recovery, Kd(pkq) → ∞, is equivalent to K 2 (p − q)2 /(nq) → ∞. Thus, degree-thresholding provides weak recovery in this regime whenever it is information theoretically possible. Under the additional condition (2), an algorithm attaining exact recovery can be built using degree-thresholding for weak recovery followed by a linear time voting procedure, as n n in Algorithm 2 (see [19, Theorem 7] and its proof). In the regime K log K = o(log n), or equivalently K = ω(n log log n/ log n), the information-theoretic sufficient condition for exact recovery given by n ), and hence in this regime the degree-thresholding (5) and (2) imply that K 2 (p−q)2 /(nq) = ω(log K attains exact recovery whenever it is information theoretically possible.

B

Three weaker definitions of recovery

We list three notions of recovery that are weaker than weak recovery as defined in Definition 2. b let pe , π0 pe,0 + π1 pe,1 , where pe,0 and pe,1 are the probability of misGiven an estimator C, classifying a vertex outside (Type-I) or inside (Type-II) the community C ∗ , respectively, and K (π0 , π1 ) = n−K n , n . These definitions are used in this paper for converse results only; we show that under certain conditions the estimators are unable to achieve recovery even in these weak senses. Definition 3. An estimator is said to achieve weak recovery in the sense of Montanari [31], if limn→∞ pe,0 + pe,1 = 0. Remark 11. Definition 3 is equivalent to Definition 2 in the case K/n is bounded away from 0 and away from 1. If K = o(n), or equivalently, ν → ∞, then even if pe,0 + pe,1 → 0, the mean ∗ |] = (n − K)p b number of misclassified vertices, E[|C△C e,0 + Kpe,1 , could be much larger than K.

The two notions of recovery given below are intended for the case K = o(n); they each give a sense in which the estimator performs asymptotically better than an estimator that disregards 49

the information in the graph. If C ∗ is uniformly distributed over {C ⊂ [n] : |C| = K}, among all estimators that disregard the information in the graph, the one that minimizes the mean number ∗ |] b b ≡ ∅. It achieves E[|C△C = 1, or equivalently, pe = K/n. The following of classification errors is C K requires an estimator to have better performance. b is said to achieve better than guessing recovery for arbitrary |C|, b if Definition 4. An estimator C

lim supn→∞

∗ |] b E[|C△C K

< 1, or equivalently, lim supn→∞

n K pe

< 1.

b ≡ K. This equalizes the number of type-I errors and number Suppose now we require |C| of type-II errors, so that (n − K)pe,0 = Kpe,1 . Equivalently, the two terms π0 pe,0 and π1 pe,1 in the definition of pe are equal. If C ∗ is uniformly distributed over {C ⊂ [n] : |C| = K}, any ∗ |] b b with |C| b ≡ K that disregards the information in the graph yields E[|C∩C = K estimator C K n → 0 n or K pe = 2(1 − K/n) → 2. b is said to achieve fractional recovery with |C| b ≡ K, if |C| b ≡ K and Definition 5. An estimator C

lim inf n→∞

∗ |] b E[|C∩C K

> 0, or equivalently, lim supn→∞

n K pe

< 2.

b achieving recovery according to Definition 4, then there Proposition 1. If there is an estimator C ′ b achieving recovery according to Definition 5. exists a modified version C

b < K, choose a set I of K − |C| b indices from [n]\C b uniformly at random Proof. On the event |C| ′ b = C ∪ I. On the event |C| b > K, choose a set J of |C| b − K indices from C b (or arbitrarily) and let C ′ ′ b b b uniformly at random (or arbitrarily) and let C = C\J. By construction, |C | = K with probability one, and ∗ b′ △C ∗ | ≤ |C△C b b − K|. |C | + ||C|

Moreover,

∗ b − K| = ||C| b − |C ∗ || ≤ |C△C b ||C| |.

∗ |. Thus, |C b ′ △C ∗ | ≤ 2|C△C b b′ ∩ C ∗| = K − Combining the last two displayed equations gives |C 1 b′ ∗ ∗ b 2 |C △C | ≥ K − |C△C |, from which the conclusion of the proposition follows directly from the definitions.

Remark 12. To achieve recovery under Definition 4 or Definition 5, the estimator must satisfy pe,0 = O(K/n), while pe,1 doesn’t even need to converge to zero. Definitions 4 and 5 have stronger requirements on pe,0 , but weaker requirements on pe,1 , than Definition 3.

C

Why a constant number of iterations is not enough.

The reader might be wondering why we analyzed the belief propagation algorithm for a slowly growing number of iterations in Section 4, whereas in the case studied by Deshpande and Montanari ρn a log2 n [12], a finite number of iterations is used. We focus on the regime K = log and n, p = n 2

2

2

2

2

(p−q) we know that if λe > 1 then as shown in Lemma 13, vt = ρ (a−b) q = b logn n . With λ = K(n−K)q b converges to infinity and thus bt converges to infinity as t → ∞ after n → ∞. Thus, according to Lemma 12, we can guarantee that both the type one and type two error probabilities for classifying a given vertex are smaller than any given ǫ > 0. That is the best we can do with a constant number

50

of iterations based on the Gaussian limit analysis. However, since we have K = o(n), there will be many vertices not in C ∗ classified as being in C ∗ , so we need to run a cleanup procedure. The first part of the cleanup procedure in Deshpande and Montanari [12], is similar to what we need but it is set up for sub-Gaussian matrix entries. The main idea of the procedure of [12] is to show that the spectral method does weak recovery with very high probability, (error probability e−cn for some c > 0) so a union bound over all possible dirty sets can be applied. Roughly speaking, by running the iteration for a finite number of iterations and using a threshold rule, we can reject a large (but constant) fraction of the vertices in [n]\C ∗ while retaining most of the vertices in C ∗ . This effectively boosts λ to a large but finite constant. While the spectral method can offer recovery for a large but finite value of λ, in this application the error probability needs to be exponentially small in n in order to apply the union bound. So let us consider the spectral method for a large but finite λ. The mean of the adjacency matrix is given by (assuming, without loss of generality, the community corresponds to the first K vertices)   pIK×K 0 ∗ , E[A] = (p − q)Z + qJ − 0 qI(n−K)×(n−K) ∗ = 1 for i, j ∈ [K] and 0 elsewhere. This suggests that we apply the spectral method to where Zij   (p − q)Ik×k 0 ∗ e A − qJ = (p − q)Z + +A (104) 0 0

e = A − E [A] has mean zero. The last two terms on the right of (104) can be viewed as where A e being significant noise. The top eigenvalue of E[A] is about (p − q)K, noise, with only the term A √ which, in the regime we are interested in, has size ρ(a − b) log n = λb log n. Thus, we would√like e to be less than δ log n with high probability for some constant δ ≪ λb. the spectral norm of A √ e ≤ C √nq = C b log n for some absolute constant C. If we apply Talagrand’s Notice that E[kAk] e we obtain for any constant ǫ, inequality (see, e.g., [39, Theorem 2.1.13]) to A, e ≥ E[kAk](1 e P{kAk + ǫ)} ≤ exp(−Ω(log2 n)).

(105)

However, as discussed above, to use the cleanup trick of [12] and the union bound we need the right-hand side of (105) to be exp(−Ω(n)). This turns out to be impossible as the following lemma shows that (105) is tight. Lemma 26. Suppose p > q such that as n → ∞, q → 0 and nq → ∞. Then for any c > 1 there exists c′ > 0 such that for all n sufficiently large, e ≥ c√nq} ≥ e−c′ nq . P{kAk en denote the last column of A. e Let Sn = Pn−1 Ain + U , where U has the Bern(q) Proof. Let A i=1 distribution and is independent of A, so Sn is distributed according to Binom(n, q). Then e 2 ≥ kA en k2 = kAk 2

n−1 X i=1

(Ain − q)2 = (Sn − U )(1 − 2q) + (n − 1)q 2 ≥ Sn (1 − 2q) − 1.

c2 nq

Let m = ⌈ 1−2q ⌉ + 1, which lies in [n] for all sufficiently large n. Then

e ≥ c√nq} ≥ P{(1 − 2q)Sn ≥ c2 nq + 1} ≥ P{Sn = m} P{kAk    nq m ′ n m e−2qn ≥ e−c nq , ≥ q (1 − q)n ≥ m m 51

 for all sufficiently large n with c′ = 3c2 log c + 2, where we used nk ≥ (n/k)k for 1 ≤ k ≤ n, 1 − q ≥ e−2q for 0 ≤ q ≤ 1/2, and m ≤ nq(c2 + ǫ) for large enough n, where ǫ > 0 is so small that (c2 + ǫ) log(c2 + ǫ) ≤ 3c2 log c. e ≥ c√np} does not go to zero at Note that the factor q in the exponent means that P{kAk a rate exponential in n. To circumvent this problem we analyzed running the belief propagation algorithm for a number of iterations growing slowly with n, showing that weak recovery can be achieved directly.

References [1] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Structures and Algorithms, 13(3-4):457–466, 1998. 6, 8 [2] B. P. Ames and S. A. Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Mathematical programming, 129(1):69–89, 2011. 8 [3] E. Arias-Castro and N. Verzelen. Community detection in dense random networks. Ann. Statist., 42(3):940–969, 06 2014. 1 [4] A. S. Bandeira and R. von Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. arXiv 1408.6185, 2014. 6 [5] C. Bordenave, M. Lelarge, and L. Massouli´e. Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs. ArXiv 1501.06087, January 2015. 3, 7, 40 [6] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. In Proceedings of ICML 2014 (Also arXiv:1402.1267), Feb 2014. 1, 8 [7] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Comb. Probab. Comput., 19(2):227–284, 2010. 7 [8] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms, 18(2):116–140, Mar. 2001. 35 [9] C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970. 6 [10] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborova. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physics Review E, 84:066106, 2011. 7 [11] Y. Dekel, O. Gurel-Gurevich, and Y. Peres. Finding hidden cliques in linear time with high probability. Combinatorics, Probability and Computing, 23(01):29–49, 2014. 8 p [12] Y. Deshpande and A. Montanari. Finding hidden cliques of size N/e in nearly linear time. Foundations of Computational Mathematics, 15(4):1069–1128, August 2015. 8, 9, 50, 51 [13] Y. Deshpande and A. Montanari. Improved sum-of-squares lower bounds for hidden clique and hidden submatrix problems. In Proceedings COLT 2015, pages 523–562, June 2015. 8 52

[14] J. L. Doob. Stochastic Processes, volume 101. New York Wiley, 1953. 14 [15] U. Feige and E. Ofek. Spectral techniques applied to sparse random graphs. Random Struct. Algorithms, 27(2):251–275, Sept. 2005. 6, 7 [16] U. Feige and D. Ron. Finding hidden cliques in linear time. In Proceedings of DMTCS, pages 189–204, 2010. 8 [17] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming. arXiv1412.6156, Nov. 2014. 6, 7, 8 [18] B. Hajek, Y. Wu, and J. Xu. Computational lower bounds for community detection on random graphs. In Proceedings COLT 2015, June 2015. 1, 8 [19] B. Hajek, Y. Wu, and J. Xu. Information limits for recovering a hidden community. arXiv 1509.07859, September 2015. 2, 3, 4, 5, 35, 36, 44, 49 [20] B. Hajek, Y. Wu, and J. Xu. Submatrix localization via message passing. arXiv, October 2015. 6 [21] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. 27, 36 [22] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983. 1 [23] M. Jerrum. Large cliques elude the Metropolis process. Random Structures & Algorithms, 3(4):347–359, 1992. 8 [24] H. Kobayashi and J. Thomas. Distance measures and releated criteria. In Proc. 5th Allerton Conf. Circuit and System Theory, pages 491–500, Monticello, Illinois, 1967. 9 [25] V. Korolev and I. Shevtsova. An improvement of the Berry–Esseen inequality with applications to Poisson and mixed Poisson random sums. Scandinavian Actuarial Journal, 2012(2):81–105, 2012. 22 [26] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov´a, and P. Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935–20940, 2013. 3, 7, 40 [27] M. Lelarge, L. Massouli´e, and J. Xu. Reconstruction in the labeled stochastic block model. In IEEE Information Theory Workshop (ITW), pages 1–5, 2013. 7 [28] F. McSherry. Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science, pages 529 – 537, Oct. 2001. 1, 6 [29] R. Meka, A. Potechin, and A. Wigderson. Sum-of-squares lower bounds for planted clique. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC ’15, pages 87–96, New York, NY, USA, 2015. ACM. 8 [30] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. 28

53

[31] A. Montanari. Finding one community in a sparse random graph. arXiv 1502.05680, Feb 2015. 1, 2, 3, 7, 10, 25, 26, 37, 49 [32] A. Montanari, D. Reichman, and O. Zeitouni. On the limitation of spectral methods: From the gaussian hidden clique problem to rank one perturbations of gaussian tensors. ArXiv 1411.6149, Nov. 2014. 6 [33] E. Mossel, J. Neeman, and A. Sly. arxiv:1311.4115, 2013. 7

A proof of the block model threshold conjecture.

[34] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC ’15, pages 69–75, New York, NY, USA, 2015. ACM. 2, 35 [35] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3-4):431–461, 2015. 27, 29 [36] E. Mossel, J. Neeman, and S. Sly. Belief propagation, robust reconstruction, and optimal recovery of block models (extended abstract). In JMLR Workshop and Conference Proceedings (COLT proceedings), volume 35, pages 1–35, 2014. 7, 35 [37] E. Mossel and J. Xu. Density evolution in the degree-correlated stochastic block model. arXiv:1509.03281, 2015. 7 [38] E. Mossel and J. Xu. arXiv:1508.02344, 2015. 7

Local algorithms for block models with side information.

[39] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012. 51 [40] S.-Y. Yun and A. Proutiere. Accurate community detection in the stochastic block model via spectral algorithms. arXiv 1412.7335, 2014. 6

54