Information Limits for Recovering a Hidden Community Bruce Hajek
Yihong Wu
Jiaming Xu∗
arXiv:1509.07859v1 [stat.ML] 25 Sep 2015
September 28, 2015
Abstract We study the problem of recovering a hidden community of cardinality K from an n × n symmetric data matrix A, where for distinct indices i, j, Aij ∼ P if i, j are both in the community and Aij ∼ Q otherwise, for two known probability distributions P and Q. If P = Bern(p) and Q = Bern(q) with p > q, it reduces to the problem of finding a densely-connected K-subgraph planted in a large Erd¨ os-R´enyi graph; if P = N (µ, 1) and Q = N (0, 1) with µ > 0, it corresponds to the problem of locating a K × K principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as n → ∞: (1) weak recovery: expected number of classification errors is o(K); (2) exact recovery: probability of classifying all indices correctly converges to one. We derive a set of sufficient conditions and a nearly matching set of necessary conditions for recovery, for the general model under mild assumptions on P and Q, where the community size can scale sublinearly with n. For the Bernoulli and Gaussian cases, the general results lead to necessary and sufficient recovery conditions which are asymptotically tight with sharp constants. An important algorithmic implication is that, whenever exact recovery is information theoretically possible, any algorithm that provides weak recovery when the community size is concentrated near K can be upgraded to achieve exact recovery in linear additional time by a simple voting procedure.
1
Introduction
Many modern datasets can be represented as networks with vertices denoting the objects and edges (sometimes weighted) encoding their pairwise interactions. An interesting problem is to identify a group of vertices with atypical interactions. In social network analysis, this group can be interpreted as a community with higher edge connectivities than the rest of the network; in microarray experiments, this group may correspond to a set of differentially expressed genes. To study this problem, we investigate the following probabilistic model considered in [15]. Definition 1 (Hidden Community Model). Let C ∗ be drawn uniformly at random from all subsets of [n] of cardinality K. Given probability measures P and Q on a common measurable space, let A be an n × n symmetric matrix with zero diagonal where for all 1 ≤ i < j ≤ n, Aij are mutually independent, and Aij ∼ P if i, j ∈ C ∗ and Q otherwise. In this paper we assume that we only have access to pairwise information Aij for distinct vertices i and j whose distribution is either P or Q depending on the community membership; no direct observation about the individual vertices is available (hence the zero diagonal of A). Arising in many applications, two particularly important choices of P and Q are as follows: ∗
B. Hajek and Y. Wu are with the Department of ECE and Coordinated Science Lab, University of Illinois at Urbana-Champaign, Urbana, IL, {b-hajek,yihongwu}@illinois.edu. J. Xu is with Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
[email protected].
1
• Bernoulli case: P = Bern(p) and Q = Bern(q) with 0 ≤ q < p ≤ 1. This coincides with the planted dense subgraph model studied in [28, 7, 12, 18, 29], which is also a special case of the general stochastic block model [23] with a single community. In this case, the data matrix A corresponds to the adjacency matrix of a graph, where two vertices are connected with probability p if both belong to the community C ∗ , and with probability q otherwise. Since p > q, the subgraph induced by C ∗ is likely to be denser than the rest of the graph. • Gaussian case: P = N (µ, 1) and Q = N (0, 1) with µ > 0. This corresponds to a symmetric version of the submatrix localization problem studied in [32, 26, 10, 9, 27, 12, 11].1 In this case, the submatrix of A with row and column indices in C ∗ has a positive mean µ except on the diagonal, while the rest of A has zero mean. Given the data matrix A, the problem of interest is to accurately recover the underlying community C ∗ . The distributions P and Q as well as the community size K depend on the matrix size n in general. For simplicity we assume that these model parameters are known to the estimator. The only assumption on the community size K we impose is that K/n is bounded away from one. Of particular interest is the case of K = o(n) where the community size grows sublinearly. We focus on the following two types of recovery guarantees.2 Let ξ ∈ {0, 1}n denote the indicator b of the community such that supp(ξ) = C ∗ . Let ξb = ξ(A) ∈ {0, 1}n be an estimator.
b → 0, where Definition 2 (Exact Recovery). Estimator ξb exactly recovers ξ, if, as n → ∞, P[ξ 6= ξ] the probability is with respect to the randomness of ξ and A. b Definition 3 (Weak Recovery). Estimator ξb weakly recovers ξ if, as n → ∞, d(ξ, ξ)/K → 0 in probability, where d denotes the Hamming distance.
The existence of an estimator satisfying Definition 3 is equivalent to the existence of an estimator b = o(K) (see Appendix A for a proof). Clearly, any estimator achieving exact such that E[d(ξ, ξ)] recovery also achieves weak recovery; for bounded K, exact and weak recovery are equivalent. Intuitively, for a fixed network size n, as the community size K decreases, or the distributions P and Q get closer together, the recovery problem becomes harder. In this paper, we aim to address the following question: From an information-theoretic perspective, computational considerations aside, what are the fundamental limits of recovering the community? Specifically, we derive sharp necessary and sufficient conditions in terms of the model parameters under which the community can be exactly or weakly recovered. These results serve as benchmarks for evaluating practical algorithms and aid us in understanding the performance limits of polynomial-time algorithms. Furthermore, we aim to understand the relationship between exact and weak recovery guarantees from both statistical and algorithmic perspectives.
1.1
Related work
Under the mild assumption that K/n is bounded away from 1, previous work has determined the information limits for exact recovery up to universal constant factors. For the Bernoulli case, it is shown in [12] that if Kd(qkp) − c log K → ∞ and Kd(pkq) ≥ c log n for some large constant 1
The previously studied submatrix localization model (also known as noisy biclustering) deals with submatrices whose row and column supports need not coincide and the noise matrix is asymmetric consisting of iid entries throughout. Here we focus on locating principal submatrices contaminated by a symmetric noise matrix. Additionally, we assume that diagonal does not carry any information. If instead we assume nonzero diagonal with Aii ∼ N (µ, 1) if i ∈ C ∗ and Aii ∼ N (0, 1) if i ∈ / C ∗ , the results in this paper carry over with minor modifications explained in Remark 11. 2 Exact and weak recovery are called strong consistency and weak consistency in [30], respectively.
2
c > 0, then exact recovery is achievable via the maximum likelihood estimator (MLE); conversely, if Kd(qkp) ≤ c′ log K and Kd(pkq) ≤ c′ log n for some small constant c′ > 0, then exact recovery is impossible for any algorithms. Similarly, for the Gaussian case, it is proved in [26] that if Kµ2 ≥ c log n, then exact recovery is achievable via the MLE; conversely, if Kµ2 ≤ c′ log n, exact recovery is impossible for any algorithms. To the best of our knowledge, there are only a few special cases where the information limits with sharp constants are known: • Bernoulli case with p = 1 and q = 1/2: It is widely known as the planted clique problem [24]. If K ≥ 2(1 + ǫ) log2 n for any ǫ > 0, exact recovery is achievable via the MLE; if K ≤ 2(1 − ǫ) log2 n, then exact recovery is impossible. Despite an extensive research effort √ polynomial-time algorithms are only known to achieve exact recovery for K ≥ c n for any constant c > 0 [3, 16, 14, 6, 15]. • Bernoulli case with p = a log n/n and q = b log n/n for fixed a, b and K = ρn for a fixed constant 0 < ρ < 1. The recent work [17] finds an explicit threshold ρ∗ (a, b), such that if ρ > ρ∗ (a, b), exact recovery is achievable in polynomial-time via semi-definite relaxations of the MLE with probability tending to one; if ρ < ρ∗ (a, b), any estimator fails to exactly recover the cluster with probability tending to one regardless of the computational costs. This conclusion is in sharp contrast to the computational barriers observed in the planted clique problem. • The paper of Butucea et al. [9] gives sharp results for a Gaussian submatrix recovery problem similar to the one considered here – see Remark 6 for details. Under the mild assumption that K/n is bounded away from 1, and p/q is bounded and p is bounded away from 1 in the Bernoulli case, this paper identifies the information limits with sharp constants for both weak recovery and exact recovery as well as the following algorithmic connection: If exact recovery is information-theoretically possible and there is an algorithm for weak recovery, then in linear additional time we can obtain exact recovery based on the weak recovery algorithm. This suggests that if the information limit of weak recovery can be obtained in polynomial time, then so does exact recovery; conversely, if there exists a computational barrier that separates the information limit and the performance of polynomial-time algorithms for exact recovery, then weak recovery also suffers from such a barrier. To establish the connection, we apply a two-step procedure: the first step uses an estimator capable of weak recovery, even in the presence of a slight mismatch between |C ∗ | and K, such as the ML estimator (see Lemma 2); the second step cleans up the residual errors through a local voting procedure for each vertex. In order to ensure the first and second step are independent, we use a method which we call successive withholding. The method of successive withholding is to randomly partition the set of vertices into a finite number of subsets. One at a time, one subset is withheld to produce a reduced set of vertices, and an estimation algorithm is run on the reduced set of vertices. The estimate obtained from the reduced set of vertices is used to classify the vertices in the withheld subset. The idea is to gain independence: the outcome of estimation based on the reduced set of vertices is independent of the data corresponding to edges between the withheld vertices and the reduced set of vertices, and the withheld subset is sufficiently small so that we can still obtain sharp constants. This method is mentioned in [13], and variations of it have been used in [13], [31], and [30]. While this paper focuses on information-theoretic limits, it complements other work investigating computationally efficient recovery procedures, such as convex relaxations [4, 5, 12, 17, 21], spectral methods [28], and message-passing algorithms [15, 29, 19, 20]. In particular, we mention that if K = ω(n/ log n), whenever information-theoretically possible, exact recovery can be 3
achieved in polynomial time using either semi-definite programming [21] or belief propagation plus cleanup [20, 19]; if K = Θ(n), a linear-time degree-thresholding algorithm achieves the information limit of weak recovery (see [20, Appendix A] and [19, Appendix A]). It is an open problem whether any polynomial-time algorithm can achieve the respective information limit of weak recovery for K = o(n), or exact recovery for K = O(n/ log n). The related work [29] studies weak recovery in the sparse regime of p = a/n, q = b/n, and 2 (a−b)2 K = κn. In the iterated limit where first n → ∞, and then κ → 0 and a, b → ∞, with λ = κ(1−κ)b fixed, it is shown that a local algorithm, namely local belief propagation, achieves weak recovery in linear time if λe > 1 and conversely, if λe < 1, no local algorithm can achieve weak recovery. Moreover, it is shown that for any λ > 0, MLE achieves a recovery guarantee similar to weak recovery in Definition 3. In comparison, the sharp information limit for weak recovery identified in Theorem 1 below allows p, q and K to vary jointly with n as n → ∞. Finally, we briefly compare the results of this paper to those of [1] and [30] on the planted bisection model (also known as the binary symmetric stochastic block model), where the vertices are partitioned into two equal-sized communities. First, a necessary and sufficient condition for weak recovery and a necessary and sufficient condition for exact recovery are obtained in [30]. In this paper, sufficient and necessary conditions, such as in (1) and (2) in Theorem 1, are presented separately. These conditions match up except right at the boundary; we do not determine whether recovery is possible when the limit is exactly equal to one. The result for exact recovery in [1] is similar in that regard. Perhaps future work, based on techniques from [30], can provide a more refined analysis for the recovery problem at the boundary. Secondly, when recovery is information theoretically possible for the planted bisection problem, efficient algorithms are shown to exist in [1] and [30]. In contrast, for detecting or recovering a single community whose size is sublinear in the network size, there can be a significant gap between what is information theoretically possible and what can be achieved by the known efficient algorithms for recovery (see [3, 8, 27, 18, 29]). We turn instead to the MLE for proof of optimal achievability. Finally, this paper covers both the Gaussian and Bernoulli case (and other distributions) in a unified framework without assuming that the community size scales linearly with the network size. Notations For any positive integer n, let [n] = {1, . . . , n}. For any set T ⊂ [n], let |T | denote its cardinality and T c denote its complement. We use standard big O notations, e.g., for any sequences {an } and {bn }, an = Θ(bn ) or an ≍ bn if there is an absolute constant c > 0 such dP ] denote the Kullback-Leibler (KL) divergence that 1/c ≤ an /bn ≤ c. Let D(P kQ) = EP [log dQ between two distributions P and Q. Let Binom(n, p) denote the binomial distribution with n trials and success probability p. Let Bern(p) denote the Bernoulli distribution with mean p and d(pkq) = D(Bern(p)kBern(q)) = p log pq + (1 − p) log 1−p 1−q . All logarithms are natural and we use the convention 0 log 0 = 0. Let Φ(x) and Q(x) denote the cumulative distribution function (CDF) and complementary CDF of a standard normal distribution, respectively. We say a sequence of events En holds with high probability, if P {En } → 1 as n → ∞.
2
Overview of Main Results
We overview our results on information limits for the Bernoulli case (where limits are expressed in terms of p, q, and K, as n → ∞) and Gaussian case (where limits are expressed in terms of µ and K as n → ∞). The following assumption on the network and community sizes will be imposed:
4
Assumption 1. For all n, 2 ≤ K < n,3 and lim supn→∞ K/n < 1. log n → 1, so in several asymptotic results log n and log(n − K) are Assumption 1 implies log(n−K) interchangeable. We give preference to log n. For the Bernoulli model we shall further impose the following.
Assumption 2. p > q, p/q = O(1) and lim supn→∞ p < 1.
2.1
Weak Recovery
Theorem 1. (Weak recovery in Bernoulli case) Suppose Assumptions 1 and 2 hold. If Kd(pkq) → ∞
and
lim inf n→∞
Kd(pkq) > 2, n log K
(1)
then weak recovery is possible. If weak recovery is possible, then Kd(pkq) → ∞
and
lim inf n→∞
Kd(pkq) ≥ 2. n log K
(2)
Remark 1. Condition (2) is necessary even if p/q → ∞, but (1) alone is not sufficient without the assumption that p/q is bounded. This can be seen by considering the extreme case where K = n/2, p = 1/n, and q = e−n . In this case, condition (1) is clearly satisfied; however, the subgraph induced by vertices in the cluster is an Erd˝os-R´enyi random graph with edge probability 1/n which contains at least a constant fraction of isolated vertices with high probability. It is not possible to correctly determine whether the isolated vertices are in the cluster, hence the impossibility of weak recovery. Theorem 2. (Weak recovery in Gaussian case) Suppose Assumption 1 holds. If Kµ2 → ∞
and
lim inf n→∞
(K − 1)µ2 > 4, n log K
(3)
then weak recovery is possible. If weak recovery is possible, then Kµ2 → ∞
and
lim inf n→∞
(K − 1)µ2 ≥ 4. n log K
(4)
Remark 2. The assumption K ≥ 2, implies K/2 ≤ K − 1 ≤ K, so the first parts of (3) and (4) would have the same meaning if K were replaced by K − 1. However, without the assumption K → ∞, the factor K − 1 cannot be replaced by K in the second parts of (3) and (4). The results for both the Bernoulli and Gaussian cases follow from the general results on the P/Q model developed in Section 3. In particular, we show the MLE weakly recovers ξ under the sufficient condition (1) and (3), respectively. For the necessary condition (2) and (4), the first part is proved by reducing the recovery problem to testing the product distribution P ⊗K Q⊗K versus Q⊗K P ⊗K , while the second part is proved using information-theoretic machinery such as data processing inequality and rate-distortion function. 3
Since the data matrix is assumed to have zero diagonal, the problem is degenerate if K = 1.
5
2.2
Exact Recovery
Theorem 3. (Exact recovery in Bernoulli case) Suppose Assumptions 1 and 2 hold. If (1) holds, and lim inf n→∞
K (d(τ ∗ kp) + d(τ ∗ kq)) > 1, log(Kn)
(5)
where ∗
τ =
log
1−q 1−p
log
1 n K log K p(1−q) q(1−p)
+
,
(6)
then exact recovery is possible. If exact recovery is possible, then (2) holds, and lim inf n→∞
K (d(τ ∗ kp) + d(τ ∗ kq)) ≥ 1. log(Kn)
(7)
Remark 3. Note that τ ∗ is defined so that Kd(τ ∗ kp) − log K = Kd(τ ∗ kq) − log n. As shown in Lemma 7, (5) is equivalent to lim inf n→∞
Kd(τ ∗ kq) > 1. log(n)
(8)
Remark 4. Consider the regime K=
ρn , logs−1 n
p=
a logs n , n
q=
b logs n , n
where s ≥ 1 is fixed, ρ ∈ (0, 1) and a > b > 0. Let I(x, y) , x − y log(ex/y) for x, y > 0. Then the sharp recovery thresholds are determined by Theorems 1 and 3 as follows: For any ǫ > 0, • For s > 1, if ρI(b, a) ≥
(2+ǫ)(s−1) log log n , log n
(2−ǫ)(s−1) log log n , log n
then weak recovery is possible; if ρI(b, a) ≤
then weak recovery is impossible. For s = 1, weak recovery is possible if and only if ρI(b, a) = ω( log1 n ).
• Assume ρ, a, b are fixed constants. Let τ0 = (a − b)/ log(a/b). Then exact recovery is possible if ρI(b, τ0 ) > 1; conversely, if ρI(b, τ0 ) < 1, then exact recovery is impossible, generalizing the previous results of [17, 2] for linear community size (s = 1). To see this, note that by definition, τ ∗ = (1 + o(1))τ0 logs n/n, and thus d(τ ∗ kq) = (1 + o(1))I(b, τ0 ) logs n/n, Theorem 4. (Exact recovery in Gaussian case) Assume that K → ∞ and lim sup K/n < 1. If (3) holds, and √ Kµ √ > 1, (9) lim inf √ n→∞ 2 log K + 2 log n then exact recovery is possible. If exact recovery is possible, then (4) holds, and √ Kµ √ ≥ 1. lim inf √ n→∞ 2 log K + 2 log n
6
(10)
Remark 5. Consider the regime K=
ρn , logs−1 n
µ2 =
µ20 logs n , n
where s ≥ 1 and ρ ∈ (0, 1) are fixed constants. The critical signal strength that allows weak or exact recovery is determined by Theorems 2 and 4 as follows: For any ǫ > 0, q q 8−ǫ , then exact recovery is possible; conversely, If µ < • If µ0 > 8+ǫ 0 ρ ρ , then exact recovery is impossible. q log log n • For s > 1, if µ0 > (2 + ǫ) (s−1) , then weak recovery is possible; conversely, if µ0 < ρ log n q log log n (2 − ǫ) (s−1) , then weak recovery is impossible. For s = 1, weak recovery is possible ρ log n 1 if and only if µ0 = ω( √log ). n
Remark 6. Butucea et al. [9] considers the submatrix localization model with an n × m submatrix with an elevated mean in an N × M large Gaussian random matrix with independent entries, and gives sufficient conditions and necessary conditions, matching up to constant factors, for exact recovery, which are analogous to those of Theorem 4. Setting (n, m, N, M ) in [9, (2.3)] (sufficient condition for exact recovery of rectangular submatrix) equal to (K, K, n, n) gives precisely the sufficient condition of Theorem 4 for exact recovery of a principal submatrix of size K from symmetric noise. That coincidence can be understood as follows. The nonsymmetric observations of [9, (2.3)] in the case of parameters (K, K, n, n) yield twice the available information as the symmetric observation matrix we consider (diagonal observations excluded) while the amount of information required to specify a K × K (not necessarily principal) submatrix of an n × n matrix is twice the information needed to specify a principal one. The proof techniques of [9] are similar to ours, with the main difference being that we simultaneously investigate conditions for weak and exact recovery. Remark 7. If K ≤ n1/9 , (3) implies (9), and thus (3) alone is sufficient for exact recovery; if K ≥ n1/9 , then (9) implies (3), and (9) alone is sufficient for exact recovery. For both the Bernoulli and Gaussian cases, we show that if there is an algorithm that can provide weak recovery even if the cluster size |C ∗ | is random and only approximately equals K, then that algorithm can be combined with a linear time voting procedure and the method of successive withholding to achieve exact recovery. The linear time voting procedure corresponds to testing the product distribution P ⊗K against Q⊗K . Under conditions (5) for the Bernoulli case and (9) for the Gaussian case, the average testing error probability is shown to be o(1/n) which turns out to be sufficient for exact recovery. For the necessity, the necessary conditions for weak recovery are clearly also necessary for exact recovery. Moreover, we show that the testing error in the voting procedure needs to be o(1/n), which gives rise to the necessary condition (7) for the Bernoulli case and (10) for the Gaussian case. Sections 3 and 4 present the proofs for weak recovery and exact recovery, respectively. Instead of proving the theorems above separately for Bernoulli and Gaussian cases, we first derive necessary conditions and sufficient conditions for the general P/Q model, and then particularize them to the Bernoulli and Gaussian cases.
7
3
Weak Recovery for General P/Q Model
Section 3.1 introduces the maximum likelihood estimator and large deviation bounds, which is used in Section 3.2 to prove the necessary and sufficient conditions for weak recovery under the general P/Q model. In Section 3.3, we present a sufficient condition for weak recovery with random cluster size, which is used later in deriving sufficient conditions for exact recovery.
3.1
Maximum Likelihood Estimator and Large Deviations
Given the data matrix A, a sufficient statistic for estimating the community C ∗ is the log likelihood dP (Aij ) for i 6= j and Lii = 0. For S, T ⊂ [n], define ratio (LLR) matrix L ∈ Rn×n , where Lij = log dQ X
e(S, T ) =
Lij .
(11)
bML = arg max{e(C, C) : |C| = K}. C
(12)
(i<j):(i,j)∈(S×T )∪(T ×S)
b denote the MLE of C ∗ , given by: Let C
C⊂[n]
Evaluating the MLE requires knowledge of K. Computation of the MLE is NP hard for general values of N and K because the question of the existence of a clique of a specified size in an undirected graph, which is known to be an NP complete problem [25], can be reduced to computation of the MLE. Thus, computation of the MLE in the worst case is considered to be computationally intractable. We caution the reader that, while the MLE is optimal for exact recovery, it may not be optimal for attaining weak recovery. We present useful large deviation inequalities that will be used for analyzing the MLE later. Let Li be i.i.d. copies of the LLR. Then for all θ ∈ [−D(QkP ), D(P kQ)], " n # X Q Lk ≥ nθ ≤ exp(−nEQ (θ)), (13) k=1
P
" n X k=1
#
Lk ≤ nθ ≤ exp(−nEP (θ)),
(14)
where ∗ EQ (θ) = ψQ (θ) = sup λθ − ψQ (λ), λ∈R
EP (θ) = EQ (θ) − θ,
and ψQ (λ) = log EQ [exp(λL1 )]. In particular, EP and EQ are convex functions with EQ (−D(QkP )) = EP (D(P kQ)) = 0 and hence EQ (D(P kQ)) = D(P kQ) and EP (−D(QkP )) = D(QkP ). For example, in the Gaussian case P = N (µ, 1), Q = N (0, 1), we have D(P kQ) = D(QkP ) = µ2 /2 and 2 EQ (θ) = 18 (µ + 2θ µ ) and EP (θ) = EQ (−θ); in the Bernoulli case, we have EP (θ) = d(αkp), EQ (θ) = 1−q )/ log p(1−q) d(αkq), where α = (θ + log 1−p q(1−p) . For the Gaussian case, the expressions given above imply, for 0 < η < 1,
EP ((1 − η)D(P kQ)) =
η 2 D(P kQ) 4
EQ (−(1 − η)D(QkP )) =
η 2 D(QkP ) . 4
(15)
The following lemma, proved in Appendix B, shows that the exponents satisfies the similar local quadratic behavior for any P and Q with bounded likelihood ratio. This property permits a unified treatment of the Gaussian case and the case of bounded LLR (such as Bernoulli) in the next section. 8
Lemma 1. Assume that | log
dP dQ |
≤ C for some positive constant C. Then for any η ∈ (0, 1),
η2 e−4C D(P kQ). 4(1 + C) η2 e−4C D(QkP ). EQ (−(1 − η)D(QkP )) ≥ 4(1 + C) EP ((1 − η)D(P kQ)) ≥
3.2
(16) (17)
Necessary and Sufficient Conditions for Weak Recovery
Theorem 5. Suppose Assumption 1 holds and suppose there exists a universal constant c > 0 such that for all 0 < η < 1, EP ((1 − η)D(P kQ)) ≥ cη 2 D(P kQ),
(18)
EQ (−(1 − η)D(QkP )) ≥ cη 2 D(QkP ).
(19)
If KD(P kQ) → ∞
and
lim inf n→∞
(K − 1)D(P kQ) > 2, n log K
(20)
then for any C ∗ ⊂ [n] with |C ∗ | = K, bML △C ∗ | ≤ 2Kǫ} ≥ 1 − e−Ω(K/ǫ) , P{|C
p where ǫ = 1/ KD(P kQ). b = o(K), then If there exists ξb such that E[d(ξ, ξ)] KD(P kQ) → ∞
and
lim inf n→∞
(K − 1)D(P kQ) ≥ 2. n log K
(21)
dP is bounded, Theorem 5 would remain true if the factors (K − 1) in (20) and Remark 8. If log dQ dP (21) were changed to K. That is because if log dQ is bounded, D(P kQ) is bounded, and either KD(P kQ) → ∞ or (K − 1)D(P kQ) → ∞ implies K → ∞ and hence also (K − 1)/K → 1.
Remark 9. The sufficiency proof only uses (18) while the necessary proof only uses (19). However, in most applications we expect that (18) and (19) can be established together, so we do not single them out in the theorem statement. Remark 10. For the necessary condition, the proof for KD(P kQ) → ∞ uses a genie argument kQ) ≥ 2 is based and the theory of binary hypothesis testing, while the proof of lim inf n→∞ (K−1)D(P log(n/K) on mutual information and rate-distortion function. Proof of Theorem 5. (Necessity) Given i, j ∈ [n], let ξ\i,j denote {ξk : k 6= i, j}. Consider the following binary hypothesis testing problem for determining ξi . If ξi = 0, a node J is randomly and uniformly chosen from {j : ξj = 1}, and we observe (A, J, ξ\i,J ); if ξi = 1, a node J is randomly and uniformly chosen from {j : ξj = 0}, and we observe (A, J, ξ\i,J ). Note that P ξ\i,J , A|ξi = 0, J P A|ξi = 0, J, ξ\i,J P J, ξ\i,J , A|ξi = 0 = = = P J, ξ\i,J , A|ξi = 1 P ξ\i,J , A|ξi = 1, J P A|ξi = 1, J, ξ\i,J 9
Y
k∈[n]\{i,J}:ξk =1
Q(Aik )P (AJk ) , P (Aik )Q(AJk )
where P {J|ξi = 0} = P {J|ξi = 1}; the second equality holds because the first equality holds because P ξ\i,j |ξi = 0, J = P ξ\i,j |ξi = 1, J . Let T denote the vector consisting of Aik and AJk for all k ∈ [n]\{i, J} such that ξk = 1. Then T is a sufficient statistic of (A, J, ξ\i,J ) for testing ξi = 1 and ξi = 0. Note that if ξi = 0, T is distributed as Q⊗(K−1) P ⊗(K−1) ; if ξi = 1, T is distributed as P ⊗(K−1) Q⊗(K−1) . Thus, equivalently, we are testing Q⊗(K−1) P ⊗(K−1) versus P ⊗(K−1) Q⊗(K−1) ; let E denote the optimal average probability of testing error. Then we have the following chain of inequalities: b ≥ E[d(ξ, ξ)]
n X
min P[ξi 6= ξbi ] ≥
b i=1 ξi (A)
=n
min
ξb1 (A,J, ξ\1,J )
n X
min
b i=1 ξi (A,J, ξ\i,J )
P[ξ1 6= ξb1 ] = nE.
P[ξi 6= ξbi ]
(22)
b = o(K), it follows that E = o(K/n). Since K/n is bounded away By the assumption E[d(ξ, ξ)] from one, this implies that the sum of Type-I and II probabilities of error pRe,0 + pe,1 = o(1), which is equivalent to TV((P ⊗ Q)⊗K−1 , (Q ⊗ P )⊗K−1 ) → 1, where TV(P, Q) , |dP − dQ|/2 denotes 1 [33, (2.25)] and the tensorization the total variation distance. Using D(P kQ) ≥ log 2(1−TV(P,Q)) property of KL divergence for product distributions, we have (K −1)(D(P kQ)+D(QkP )) → ∞. By the assumption (19) and the fact that EQ (θ) is non-decreasing in [−D(QkP ), D(P kQ)], it follows that c D(P kQ) = EQ (D(P kQ)) ≥ EQ (−D(QkP )/2) ≥ D(QkP ). 4 Hence, we have (K − 1)D(P kQ) → ∞, which implies KD(P kQ) → ∞. Next we show the second condition in (21) is necessary. Let H(X) denote the entropy function of a discrete random variable X and I(X; Y ) denote the mutual information between random variables X andPY . Let ξ = (ξ1 , . . . , ξn ) be uniformly drawn from the set {x ∈ {0, 1}n : w(x) = K} where w(x) = xi denotes the Hamming weight; therefore ξi ’s are individually Bern(K/n). Let b E[d(ξ, ξ)] = ǫn K, where ǫn → 0 by assumption. Consider the following chain of inequalities, which lower bounds the amount of information required for a distortion level ǫn : (a)
b ξ) ≥ I(A; ξ) ≥ I(ξ;
min
e E[d(ξ,ξ)]≤ǫ nK
e ξ) ≥ H(ξ) − I(ξ;
max
e E[d(ξ,ξ)]≤ǫ nK
n ǫn K (c) n = log − nh ≥ K log (1 + o(1)), K n K
(b)
H(ξe ⊕ ξ)
where (a) follows from the data processing inequality, (b) is due to the fact that4 maxE[w(X)]≤pn H(X) = 1 is the binary entropy function, and nh(p) for any p ≤ 1/2 where h(p) , p log 1p + (1 − p) log 1−p n n K (c) follows from the bound K ≥ k , the assumption K/n is bounded away from one, and the bound h(p) ≤ −p log p + p for p ∈ [0, 1]. Moreover, I(A; ξ) = min D(PA|ξ kQ|Pξ ) Q
n ≤ D(PA|ξ kQ⊗( 2 ) |Pξ ) K = D(P kQ). 2
(23)
P P To see this, simply note that H(X) ≤ n P {Xi = 1} /n) ≤ nh(p) by Jensen’s inequality, which i=1 H(Xi ) ≤ nh( is attained with equality when Xi ’s are iid Bern(p). 4
10
kQ) Combining the last two displays, we get that lim inf n→∞ (K−1)D(P ≥ 2. log(n/K) b denote the MLE, C bML , for brevity in the proof. Let L = |C b ∩ C ∗ | and (Sufficiency) We let C p ∗ ∗ b b ǫ = 1/ KD(P kQ). Since |C| = |C | = K and hence |C△C | = 2(K − L), it suffices to show that P{L ≤ (1 − ǫ)K} ≤ exp(−Ω(K/ǫ)). Note that
b C) b − e(C ∗ , C ∗ ) = e(C\C b ∗ , C\C b ∗ ) + e(C\C b ∗, C b ∩ C ∗ ) − e(C ∗ \C, b C ∗ ). e(C,
(24)
b = |C\C b ∗ | = K − L. Fix θ ∈ [−D(QkP ), D(P kQ)] whose value will be chosen later. and |C ∗ \C| Then for any 0 ≤ ℓ ≤ K − 1, {L = ℓ} ⊂ {∃C ⊂ [n] : |C| = K, |C ∩ C ∗ | = ℓ, e(C, C) ≥ e(C ∗ , C ∗ )}
= {∃S ⊂ C ∗ , T ⊂ (C ∗ )c : |S| = |T | = K − ℓ, e(S, C ∗ ) ≤ e(T, T ) + e(T, C ∗ \S)}
⊂ {∃S ⊂ C ∗ : |S| = K − ℓ, e(S, C ∗ ) ≤ mθ}
∪ {∃S ⊂ C ∗ , T ⊂ (C ∗ )c : |S| = |T | = K − ℓ, e(T, T ) + e(T, C ∗ \S) ≥ mθ}, Pm ℓ ∗ where m = K distribution as i=1 Li under measure 2 − 2 . Notice that e(S, C ) has the same P L under measure Q where Li are i.i.d. P ; e(T, T ) + e(T, C ∗ \S) has the same distribution as m i i=1 dP copies of log dQ . Hence, by the union bound and the large deviation bounds (13) and (14), # # "X "X m m K n−K K P {L = ℓ} ≤ Li ≤ mθ + P Li ≥ mθ Q K −ℓ K −ℓ K −ℓ i=1 i=1 K n−K K ≤ exp(−mEP (θ)) + exp(−mEQ (θ)) K −ℓ K −ℓ K −ℓ K−ℓ (n − K)Ke2 Ke K−ℓ exp(−mEQ (θ)) exp(−mEP (θ)) + ≤ K −ℓ (K − ℓ)2 where the last inequality holds due to the fact that ab ≤ (ea/b)b . Notice that m = (K − ℓ)(K + ℓ − 1)/2 ≥ (K − ℓ)(K − 1)/2. Thus, for any ℓ ≤ (1 − ǫ)K, P {L = ℓ} ≤ e−(K−ℓ)E1 + e−(K−ℓ)E2 ,
(25)
where e E1 , (K − 1)EP (θ)/2 − log , ǫ (n − K)e2 E2 , (K − 1)EQ (θ)/2 − log . Kǫ2 By the assumption (20), we have (K − 1)D(P kQ)(1 − η) ≥ 2 log θ = (1 − η)D(P kQ). By the assumption (18), we have
n K
for some η ∈ (0, 1). Choose
e E1 ≥ cη 2 (K − 1)D(P kQ)/2 − log . ǫ
Using the fact that EP (θ) = EQ (θ) − θ, we have e (K − 1) n−K + D(P kQ)(1 − η) − log ǫ 2 K e 2 ≥ cη (K − 1)D(P kQ)/2 − 2 log . ǫ
E2 ≥ cη 2 (K − 1)D(P kQ)/2 − 2 log
11
Since K ≥ 2 and (K − 1)D(P kQ) → ∞ by assumption, we have ǫ = o(1) and hence E = min{E1 , E2 } = Ω(KD(P kQ)) = Ω(ǫ−2 ). Hence, in view of (25), (1−ǫ)K
P {L ≤ (1 − ǫ)K} = ≤
X ℓ=0
P {L = ℓ} ≤
∞ X e−ℓE1 + e−ℓE2
ℓ=ǫK
2 exp(−ǫKE) = exp(−Ω(K/ǫ)). 1 − exp(−E)
Next, we give the proofs of Theorem 1 and Theorem 2 by applying Theorem 5. Proof of Theorem 1 . In the Bernoulli case, D(P kQ) = d(pkq) and D(QkP ) = d(qkp). By AssumpdP tion 2, log p(1−q) q(1−p) is bounded, and it follows that | log dQ | is bounded. The claim readily follows by combining Lemma 1, Theorem 5, and Remark 8. Proof of Theorem 2. In the Gaussian case, by (15), the assumptions (18) and (19) in Theorem 5 are satisfied with c = 1/4. Hence the claim directly follows from Theorem 5. Remark 11. The hidden community model (Definition 1) adopted in this paper assumes the data matrix A has zero diagonal, meaning we do not observe any self information about individual vertices – only pairwise information. A different assumption used in the literature for the Gaussian submatrix localization problem is that Aii has distribution P if i ∈ C ∗ and distribution Q otherwise. Theorem 5 holds for that case with the modification that the factors K − 1 in (20) and (21) are replaced by K + 1. We explain briefly why the modified theorem is true. The proof for the sufficient part goes through with Pthe definition of e(S, T ) in (11) modified to include diagonal terms indexed by S ∩ T : e(S, T ) = (i≤j):(i,j)∈(S×T )∪(T ×S) Lij . Then m increases by K − ℓ, resulting in K − 1 replaced by K + 1 in E1 and E2 . As for the necessary conditions, the proof of the first part of (21) goes through with the sufficient statistic T extended to include two more variables, Aii and AJJ , which has the effect of increasing K by one, so the first part of (21) holds with K replaced by K + 1, but the first part of (21) has the same meaning whether or not K is replaced by K + 1. The proof replaced by 1 + · · · + K = K+1 in (23), which of the second part of (21) goes through with K 2 2 has the effect of changing K − 1 to K + 1 in the second part of (21). The necessary conditions and the sufficient conditions for exact recovery stated in the next section hold without modification for the model with diagonal elements. In the proof of Theorem 6, the term e(i, C ∗ ) in the definition of F , (28), should include the term Lii and the random variable Xi in the proof that P {E1 } → 0 should be changed to Xi = e(i, {1, · · · , i}), and also include the term Lii .
3.3
A sufficient condition for weak recovery with random cluster size
Theorem 5 invokes the assumption that |C ∗ | ≡ K and K is known. In the proof of exact recovery, as we will see, we need to deal with the case where |C ∗ | is random and unknown. For that reason, the following lemma gives a sufficient condition for weak recovery with a random cluster size. We bML to denote the estimator defined by (12), although in this context it is shall continue to use C not actually the MLE because |C ∗ | need not be K. That is, there is a (slight) mismatch between the problem the estimator was designed for and the problem it is applied to. Lemma 2. (Sufficient condition for weak recovery with random cluster size) Assume that K → ∞, lim sup K/n < 1, and there exists a universal constant c > 0 such that (18) holds for all 0 < η < 1. Furthermore, suppose that P |C ∗ | − K ≤ K/ log K ≥ 1 − o(1). 12
If (20) holds, then
n o bML △C ∗ | ≤ 2Kǫ + K/ log K ≥ 1 − o(1), P |C
p where ǫ = 1/ min{log K, KD(P kQ)}.
Proof. By assumption, with probability converging to 1, |C ∗ |−K ≤ K/ log K. In the following, we bML ∩C ∗ |. Then |C bML △C ∗ | = K+K ′ −2L. assume that |C ∗ | = K ′ for |K ′ −K| ≤ K/ log K. Let L = |C To prove the theorem, it suffices to show that P{L ≤ (1−ǫ)K −K/ log K} = o(1), where ǫ is defined in the statement of the theorem. Following the proof of Theorem 5 in the fixed cluster size case, we get that for all 0 ≤ ℓ ≤ K − 1, {L = ℓ} ⊂ {∃C ⊂ [n] : |C| = K, |C ∩ C ∗ | = ℓ, e(C, C) ≥ e(C ∗ , C ∗ )}
= {∃S ⊂ C ∗ , T ⊂ (C ∗ )c : |S| = K ′ − ℓ, |T | = K − ℓ, e(S, C ∗ ) ≤ e(T, T ) + e(T, C ∗ \S)}
⊂ {∃S ⊂ C ∗ : |S| = K ′ − ℓ, e(S, C ∗ ) ≤ mθ}
∪ {∃S ⊂ C ∗ , T ⊂ (C ∗ )c : |S| = K ′ − ℓ, |T | = K − ℓ, e(T, T ) + e(T, C ∗ \S) ≥ mθ},
where θ ∈ [−D(QkP ), D(P kQ)] is chosen later. Notice that e(S, C ∗ ) has the same distribution P P ′ T ) + e(T, C ∗ \S) has the same distribution as m as m i=1 Li under measureP ; e(T, i=1 Li under K′ ℓ K ℓ dP ′ measure Q where m = 2 − 2 , m = 2 − 2 , and Li are i.i.d. copies of log dQ . Hence, by the union bound and large deviation bounds in (13) and (14), # # "X "X m m′ K′ n − K′ K′ Li ≥ mθ Q Li ≤ mθ + P P {L = ℓ} ≤ K′ − ℓ K−ℓ K′ − ℓ i=1 i=1 ′ ′ K ′ e K −ℓ −m′ EP (mθ/m′ )) (n − K ′ )e K−ℓ K ′ e K −ℓ −mEQ (θ) ≤ e + e . K′ − ℓ K −ℓ K′ − ℓ Notice that for any ℓ ≤ (1 − ǫ)K − |K − K ′ |, K ′ − ℓ ≥ ǫ max{K ′ , K}, K − ℓ ≥ ǫK, and
K −ℓ K −ℓ K − (1 − ǫ)K K ≤ ′ ≤ ≤ . K + K/ log K K −ℓ K − K/ log K − ℓ K − K/ log K − (1 − ǫ)K √ Since ǫ ≥ 1/ log K and K → ∞, it follows that (K − ℓ)/(K ′ − ℓ) = 1 + o(1). Also, m′ = (K ′ − ℓ)(K ′ + ℓ − 1)/2 ≥ (K ′ − ℓ)(K ′ − 1)/2 m = (K − ℓ)(K + ℓ − 1)/2 ≥ (K − ℓ)(K − 1)/2,
Therefore, m/m′ → 1, and, moreover, P {L = ℓ} ≤ e−(K−ℓ)(1+o(1))E1 + e−(K−ℓ)(1+o(1))E2 , with e E1 = KEP (mθ/m′ )/2 − log , ǫ (n − K ′ )e2 . E2 = KEQ (θ)/2 − log Kǫ2 n By the assumption (20), we have KD(P kQ)(1 − η) ≥ 2 log K for some η ∈ (0, 1). Choose θ = 2 (1 − η)D(P kQ). By Lemma 1, we have that EP (θ) ≥ cη KD(P kQ) and EP (mθ/m′ ) ≥ (1 + o(1))cη 2 KD(P kQ). Thus, e E1 ≥ (1 + o(1))cη 2 KD(P kQ)/2 − log . ǫ
13
Using the fact that EP (θ) = EQ (θ) − θ, we get that e K n − K′ e + D(P kQ)(1 − η) − log ≥ cKη 2 D(P kQ))/2 − 2 log . ǫ 2 K ǫ p Since KD(P kQ) → ∞ by assumption ǫ ≥ 1/ KD(P kQ), it follows that E = min{E1 , E2 } = Ω(KD(P kQ)). Therefore,5 E2 ≥ cη 2 KD(P kQ)/2 − 2 log
P {L ≤ (1 − ǫ)K − K/ log K} ≤
(1−ǫ)K−⌊K/ log K⌋
≤2
X
e−(K−ℓ)(1+o(1))E1 + e−(K−ℓ)(1+o(1))E2
ℓ=0
∞ X
ℓ=ǫK
p e−(1+o(1))ℓE = exp(−Ω( K 3 D(P kQ))) = o(1),
as was to be proved.
Exact Recovery for General P/Q Model
4
We begin by deriving necessary conditions and sufficient conditions for the general P/Q model in Section 4.1 and Section 4.2, respectively. Then, these general conditions are used to prove the information limits of exact recovery for Bernoulli case in Section 4.3 and Gaussian case in Section 4.4.
4.1
A general necessary condition
In the following, we give a necessary condition for exact recovery under the general P/Q model. dP . Let Theorem 6. Assume that K → ∞ and lim sup K/n < 1. Let Li denote i.i.d. copies of log dQ n o b such that P C b = C ∗ → 1, Bn satisfy K(n − K)Q [L1 ≥ Bn ] = o(1). If there exists an estimator C
then for any Ko → ∞ such that Ko = o(K), there exists a threshold θn depending on n such that for all sufficiently large n, # "K−K Xo 2 Li ≤ Kθn − Ko D(P kQ) − 6σ − Bn ≤ P , (26) Ko i=1 # "K X 1 Li ≥ Kθn ≤ , (27) Q n−K i=1
where σ 2 = Ko varP (L1 ) and varP (L1 ) denotes the variance of L1 under measure P . Proof. Since the planted cluster C ∗ is uniformly distributed, the MLE minimizes the error probab bility among all estimators. Thus, without loss of generality, we can assume the estimator used n o C is bML and the vertices are numbered so that C ∗ = [K]. Hence, by assumption, P C bML = C ∗ → 1. C For each i ∈ C ∗ and j ∈ / C ∗ , we have e (C ∗ \{i} ∪ {j}, C ∗ \{i} ∪ {j}) − e(C ∗ , C ∗ ) = e(j, C ∗ ) − e(i, C ∗ ) − log 5
The o(1) terms converge to zero as
K K′
→ 1 and
m m′
dP (Aij ). dQ
→ 1, uniformly in ℓ for 0 ≤ ℓ ≤ (1 − ǫ)K − |K − K ′ |.
14
Let F denote the event that min∗ e(i, C ∗ ) +
i∈C
max
log
i∈C ∗ ,j ∈C / ∗
dP (Aij ) < max e(j, C ∗ ), dQ j ∈C / ∗
(28)
which implies the existence of i ∈ C ∗ and j ∈ / C ∗ , such that the set C ∗ \{i} ∪ {j} achieves a likelihood strictly larger than that achieved by C ∗ . Hence, P {F } ≤ P {ML fails} = o(1). Set θn′ to be # ( "K−K ) Xo 2 Li ≤ Kx − Ko D(P kQ) − 6σ − Bn ≥ θn′ = inf x ∈ R : P , Ko i=1
′′
and θn to be ′′
(
θn = sup x ∈ R : Q
"
K X i=1
#
1 Li ≥ Kx ≥ n−K
)
.
∗ Let E1 denote the event that mini∈C ∗ e(i, C ∗ )+Bn ≤ Kθn′ ; E2 denote the event that maxj ∈C / ∗ e(j, C ) ≥ ′′ dP Kθn ; E3 denote the even that maxi∈C ∗ ,j ∈C / ∗ log dQ (Aij ) < Bn . We claim that P {E1 } = Ω(1), P {E2 } = Ω(1), and P {E3c } = o(1); the proof is deferred to the end. Note that E1 and E2 are independent. Hence,
P {E1 ∩ E2 ∩ E3 ∩ F c } ≥ P {E1 ∩ E2 } − P {E3c } − P {F } = P {E1 } P {E2 } − o(1) = Ω(1). Since ′′
E1 ∩ E2 ∩ E3 ∩ F c ⊂ {θn′ > θn }, ′′
′′
and θn′ , θn are deterministic, it follows that θn′ > θn′′ for sufficiently large n. Set θn = (θn′ + θn )/2. ′′ Thus θn < θn′ and by the definition of θn′ , (26) holds. Similarly, we have that θn > θn and by the ′′ definition of θn , (27) holds. / We are left to show P {E1 } = Ω(1), P {E2 } = Ω(1), and P {E3c } = o(1). Note that for i ∈ C ∗ , j ∈ dP ∗ (Aij ) has the same distribution as L1 under measure Q. Hence, by the union bound, C , log dQ
dP P max log (Aij ) ≥ Bn ∗ ∗ dQ i∈C ,j ∈C /
≤ K(n − K)Q [L1 ≥ Bn ] = o(1).
Thus P {E3c } = o(1). Next, we prove that P {E2 } = Ω(1). Since Q i hP ′′ K −1 in x, it follows that Q i=1 Li ≥ Kθn ≥ (n − K) . Note that P {E2 } = 1 −
=1−
Y
j ∈C / ∗
n
′′
∗
P e(j, C ) < Kθn
1−Q
"
K X
≥ 1 − exp −Q
′′
Li ≥ Kθn
i=1 "K X i=1
hP
i ≥ x is left-continuous
o
#!n−K ′′
#
!
Li ≥ Kθn (n − K) 15
K i=1 Li
≥ 1 − e−1 ,
where the first equality holds because e(j, C ∗ ) are independent for different j ∈ / C ∗ ; the second P K equality holds because e(j, C ∗ ) has the same distribution as i=1 Li under measure Q; the third i hP ′′ K −x ≥ inequality is due to 1 − x ≤ e for x ∈ R; the last inequality holds because Q L ≥ Kθ i n i=1
(n − K)−1 . So P {E2 } = Ω(1) is proved. Finally, we show that P {E1 } = Ω(1). The proof is similar to the proof of P {E2 } → 1 just given, but it ish complicated iby the fact the random variables e(i, C ∗ ) for i ∈ C ∗ are not independent. PK Since P i=1 Li ≤ x is right-continuous in x, it follows from the definition that P
"K−K Xo i=1
#
Li ≤ Kθn′ − Ko D(P kQ) − 6σ − Bn ≥
2 . Ko
(29)
P For all i ∈ / C ∗ , e(i, C ∗ ) has the same distribution as K i=1 Li under measure P , but they are not independent. Let T be the set of the first Ko vertices in C ∗ , i.e., T = [Ko ], where Ko = o(K) and Ko → ∞. Let σ 2 = Ko varP (L1 ), where varP (L1 ) denotes the variance of L1 under measure P , and let T ′ = {i ∈ T : e(i, T ) ≤ Ko D(P kQ) + 6σ}. Since6 min e(i, C ∗ ) ≤ min′ e(i, C ∗ ) ≤ min′ e(i, C ∗ \T ) + Ko D(P kQ) + 6σ,
i∈C ∗
i∈T
i∈T
it follows that
∗
P {E1 } ≥ P min′ e(j, C \T ) ≤ j∈T
Kθn′
− Ko D(P kQ) − 6σ − Bn .
We show next that P |T ′ | ≥ K2o → 1 as n → ∞. For i ∈ T, e(i, T ) = Xi + Yi where Xi = e(i, {1, . . . , i − 1}) and Yi = e(i, {i + 1, . . . , Ko }). The X’s are mutually P independent, and the Y ’s are also mutually independent, and Xi has the same distribution as i−1 j=1 Lj and Yi has the same PKo −i distribution as j=1 Lj , where Lj is distributed under measure P . Then E [Xi ] = (i − 1)D(P kQ) and var(Xi ) ≤ σ 2 . Thus, by the Chebyshev inequality, P {Xi ≥ (i − 1)D(P kQ) + 3σ} ≤ 19 for all i ∈ T Therefore, |{i : Xi ≤ (i−1)D(P kQ)+3σ}| is stochastically at least as large as a Binom Ko , 89 o random variable, so that, P |{i : Xi ≤ (i − 1)D(P kQ) + 3σ}| ≥ 3K → 1 as Ko → ∞. Similarly, 4 3Ko P |{i : Yi ≤ (Ko − i)D(P kQ) + 3σ}| ≥ 4 → 1 as Ko → ∞. If at least 3/4 of the X’s are small and at least 3/4 of the Y ’s are small, it follows that at least 1/2 of the e(i, T )’s for i ∈ T are small. Therefore, as claimed, P |T ′ | ≥ K2o → 1 as Ko → ∞. The set T ′ P is independent of (e(i, C ∗ \T ) : i ∈ T ) and each of those variables has the same o Lj under measure P . Thus, distribution as K−K j=1
Ko K o ′ ′ ∗ ′ − P |T | < P e(j, C \T ) ≥ Kθn − Ko D(P kQ) − 6σ − Bn |T | ≥ P {E1 } ≥ 1 − E 2 2 j∈T ′ K−K Xo ≥ 1 − exp −P Lj ≤ Kθn′ − Ko D(P kQ) − 6σ − Bn Ko /2 − o(1)
Y
j=1
−1
≥1−e
− o(1),
where the last inequality follows from (29). Therefore, P {E1 } = Ω(1). 6
In case T ′ = ∅ we adopt the convention that the minimum of an empty set of numbers is +∞.
16
4.2
A general sufficient condition
In this subsection, we describe a two-step procedure for exact recovery in Algorithm 1 . The first step uses an estimator capable of weak recovery, even with a slight mismatch between |C ∗ | and K, such as provided by the ML estimator (see Lemma 2); the second step cleans up the residual errors through a local voting procedure for each vertex. In order to make sure the first and second step are independent of each other, we use the method of successive withholding. Algorithm 1 Weak recovery plus cleanup for exact recovery Input: n ∈ N, K > 0, distributions P ,Q; observed matrix A; δ ∈ (0, 1) with 1/δ, nδ ∈ N. 2: (Partition): Partition [n] into 1/δ subsets Sk of size nδ. 3: (Approximate Recovery) For each k = 1, . . . , 1/δ, let Ak denote the restriction of A to the rows and columns with index in [n]\Sk , run an estimator capable of weak recovery with input bk denote the output. (n(1 − δ), ⌈K(1 − δ)⌉, P, Q, Ak ) and let C P ˇ 4: (Cleanup) For each k = 1, . . . , 1/δ compute ri = bk Lij for all i ∈ Sk and return C, the set j∈C of K indices in [n] with the largest values of ri . 1:
The following theorem gives sufficient conditions under which the two-step procedure achieves exact recovery. bk such that, Theorem 7. Suppose Algorithm 1 is run using estimators for weak recover C n o bk ∆C ∗ | ≤ δK for 1 ≤ k ≤ 1/δ → 1, P |C k
(30)
as n → ∞, where Ck∗ = C ∗ ∩ ([n]\Sk ). Let {Xi } denote a sequence of i.i.d. copies of log dP dQ under dP measure P . Let {Yi } denote another sequence of i.i.d copies of log dQ under measure Q, which is independent of {Xi }. Suppose there exists a sequence θn such that Kδ K(1−2δ) X X P Yi ≤ K(1 − δ)θn Xi + = o(1/K) (31) i=1 i=1 K(1−δ) X P Yi ≥ K(1 − δ)θn = o(1/(n − K)). (32) i=1
Then P Cˇ = C ∗ → 1 as n → ∞.
bk ), each of the random variables ri ∈ Sk for i ∈ [n] is conditionally the sum Proof. Given (Ck∗ , C of independent random variables, each with either the distribution of X1 or the distribution of Y1 . bk △C ∗ | ≤ δK}, Furthermore, on the event, Ek = {|C k bk ∩ C ∗ | ≥ |C bk | − |C bk △C ∗ | = ⌈K(1 − δ)⌉ − |C bk △C ∗ | ≥ K(1 − 2δ), |C k k k
One can check by definition and the change of measure that X1 is first-order stochastically greater than or equal to Y1 . Therefore, on the event Ek , for i ∈ C ∗ , ri is stochastically greater than or equal PK(1−2δ) P PK(1−δ) ∗ to j=1 Xj + Kδ Yj . Hence, by j=1 Yj . For i ∈ [n]\C , ri has the same distribution as j=1 assumption (31) and (32) and the union bound, with probability converging to 1, ri > K(1 − δ)θn for all i ∈ C ∗ and ri < K(1 − δ)θn for all i ∈ [n]\C ∗ . Therefore, P Cˇ = C ∗ → 1 as n → ∞. 17
The following corollary gives a general sufficient condition for exact recovery by using Algorithm 1 with the MLE for the weak recovery step. Corollary 1. (General sufficient conditions for exact recovery) Suppose the following conditions hold: 1. K/n is bounded away from one; there is a universal constant c such that (18) holds for 0 < η < 1; (20) holds (sufficient conditions for weak recovery by MLE). 2. K → ∞; there exists a threshold θn such that (31) and (32) hold. Then for any sufficiently small constant δ ∈ (0, 1) with 1/δ, nδ ∈ N, if Algorithm 1 is implemented bk being the MLE, the output Cˇ produced satisfies P Cˇ = C ∗ → 1 as n → ∞. with C
bk for each k is the MLE for C ∗ based Proof. In view of Theorem 7 it suffices to verify (30) when C k on observation of Ak , for δ sufficiently small. The distribution of |Ck∗ | is obtained by sampling the vertices of the original graph without replacement. Therefore, by a result of Hoeffding [22], the distribution of |Ck∗ | is convex order dominatedby the distribution that would result by sampling with ∗ replacement, namely, by Binom n(1 − δ), K n . That is, for any convex function Ψ, E [Ψ(|Ck |)] ≤ K ∗ E Ψ(Binom(n(1 − δ), K n )) . Therefore, Chernoff bounds for Binom(n(1 − δ), n )) also hold for |Ck |. The Chernoff bounds for X ∼ Binom(n, p) give: P {X ≥ (1 + η)np} ≤ e−η
2 np/3
−η2 np/2
P {X ≤ (1 − η)np} ≤ e
,
∀0≤η≤1
(33)
,
∀ 0 ≤ η ≤ 1.
(34)
Then, P |C ∗ | − (1 − δ)K ≥ k
K log K
K K ≤ P Binom n(1 − δ), − (1 − δ)K ≥ n log K
≤ e−Ω(K/ log
2
K)
= o(1).
Since (20) holds and K → ∞, it follows that lim inf n→∞
⌈(1 − δ)K⌉D(P kQ) >2 n log K
for any sufficiently small δ ∈ (0, 1) with 1/δ, nδ ∈ N. Hence, we can apply Lemma 2 with K replaced by ⌈(1 − δ)K⌉ to get that for any 1 ≤ k ≤ 1/δ, n o bk ∆C ∗ | ≤ 2ǫK + K/ log K ≥ 1 − o(1), P |C (35) k
p where ǫ = 1/ min{log K, KD(P kQ)}. Since δ is a fixed constant, by the union bound over all 1 ≤ k ≤ 1/δ, we have that o n bk ∆C ∗ | ≤ 2ǫK + K/ log K for 1 ≤ k ≤ 1/δ ≥ 1 − o(1). P |C k Since ǫ → 0, the desired (30) holds.
18
4.3
Proof of Exact Recovery in Bernoulli Case
In this subsection, we apply the general necessary conditions and sufficient conditions to prove Theorem 3. Before that, we need a few key lemmas.7 Lemma 3.
p
2md(τ kp) ≤ P {Binom(m, p) ≤ mτ } ≤ e−md(τ ||p) , p 2md(τ kq) ≤ P {Binom(m, q) ≥ mτ } ≤ e−md(τ ||q) , Q
Q
∀1/m ≤ τ ≤ p,
(36)
∀q ≤ τ ≤ 1.
(37)
Proof. The upper bound follows from the Chernoff bound and the lower bounds are proved in [34, Theorem 1]. Lemma 4. For any 0 < q ≤ p < 1,
(p − q)2 ≤ d(pkq) ≤ 2p(1 − q) (p − q)2 ≤ d(qkp) ≤ 2p(1 − q)
(p − q)2 . q(1 − q) (p − q)2 . p(1 − p)
(38) (39)
Proof. The upper bound follows by applying the inequality log x ≤ x − 1 for x > 0 and the lower 2 d(pkq) 1 bound is proved using ∂ (∂p) and Taylor’s expansion. = p(1−p) 2 Lemma 5. Assume that 0 < q ≤ p < 1 and u, v ∈ [q, p]. Then for any 0 < η < 1, 2ηp(1 − q) d(ukv), d((1 − η)u + ηvkv) ≥ 1 − q(1 − p) η 2 q(1 − p) max{d(ukv), d(vku)}. d((1 − η)u + ηvku) ≥ 2p(1 − q)
(40) (41)
Proof. By the mean value theorem, d((1 − η)u + ηvkv) = d(ukv) − η(u − v)d′ (xkv), for some x ∈ (min{u, v}, max{u, v}). Notice that D ′ (xkv) = log x(1−v) (1−x)v and thus |d′ (xkq)| ≤ log
max{u, v}(1 − min{u, v}) |u − v| ≤ , min{u, v}(1 − max{u, v}) q(1 − p)
where the last equality holds due to log(1 + x) ≤ x and x ∈ (q, p). It follows that 2ηp(1 − q) η(u − v)2 ≥ 1− d((1 − η)u + ηvkv) ≥ D(ukv) − d(ukv), q(1 − p) q(1 − p) where the last inequality holds due to the lower bounds in (38) and (39). Thus the first claim follows. For the second claim, d((1 − η)u + ηvku) ≥
η 2 q(1 − p) η 2 (u − v)2 ≥ max{d(ukv), d(vku)}, 2p(1 − q) 2p(1 − q)
where the first inequality holds due to the lower bounds in (38) and (39); the last inequality holds due to the upper bounds in (38) and (39). 7 The letter Q is used to denote both the complementary standard normal CDF, Q(u) = P {N (0, 1) ≥ u} , and the distribution of Aij for distinct i, j not both in C ∗ . The meaning should be clear from the context.
19
Lemma 6. Assume that log p(1−q) q(1−p) is bounded from above. Suppose for some ǫ > 0 that Kd(pkq) > n (1 + ǫ) log K for all sufficiently large n. Then p − τ ∗ = Θ(p − q) and τ ∗ − q = Θ(p − q). Proof. By the definition of τ ∗ , p − τ∗ = τ∗ − q = Notice that d(pkq) + d(qkp) = (p − q) log
1 K log log p(1−q) q(1−p) d(qkp) + K1 log log p(1−q) q(1−p)
d(pkq) −
p(1−q) q(1−p) .
n K
n K
, .
Hence,
n d(pkq) − K1 log K p − τ∗ = , p−q d(pkq) + d(qkp) n d(qkp) + K1 log K τ∗ − q = . p−q d(pkq) + d(qkp) p(1−q) and Lemma 4, d(pkq) ≍ d(qkp). Since Kd(pkq) > By the boundedness assumption of log q(1−p) n (1 + ǫ) log K for all sufficiently large n, it follows that p − τ ∗ and τ ∗ − q are both Θ(p − q).
Lemma 7. Assume 0 < q < p < 1. Condition (5) is equivalent to lim inf
Kd(τ ∗ kq) > 1, log(n)
(42)
lim inf
Kd(τ ∗ kp) > 1. log(K)
(43)
n→∞
and either (5) or (42) imply: n→∞
The above sentence is true if in each of (5), (42), and (43), lim inf is changed to lim sup and the direction of the inequality is reversed. Furthermore, if the strict inequalities in both (5) and (42) are replaced by weak inequalities, “ ≥”, they are still equivalent to each other. Proof. Equation (5) is equivalent to the existence of ǫ > 0 so that for all n sufficiently large, n n ] + [Kd(τ ∗ kq) − 12 log K ] [Kd(τ ∗ kp) + 12 log K ≥ 1 + ǫ. (44) log(Kn) The definition of τ ∗ implies that the two sums inside square brackets in (44) are equal. Therefore, (44) is equivalent to n Kd(τ ∗ kq) − 12 log K 1+ǫ ≥ log(Kn) 2 or ǫ ǫ Kd(τ ∗ kq) ≥ 1 + log n + log K (45) 2 2
which implies (42). Conversely, if (42) holds, then (45) holds for some ǫ > 0 and all sufficiently large n, which implies (is actually equivalent to) (44), proving (5). The equivalence of (5) and (42) is proved. 20
Similarly, (44) is also equivalent to Kd(τ ∗ kp) + 21 log log(Kn)
n K
≥
1+ǫ 2
or ǫ ǫ log K + log n, Kd(τ ∗ kp) ≥ 1 + 2 2
which implies (43). The statement for the case the direction of the inequalities is reversed can be proved by the same method. We prove the statement about weak inequalities next. If (42) is changed by replacing the inequality by the weak inequality ≥, it means that for any ǫ < 0, (44) holds for all n sufficiently large. Going through the same argument as before then shows the continued equivalence of (5) and (42) for weak inequalities. Lemma 8. Assume that p/q is bounded, p is bounded away from 1, and τ ∗ ∈ [q, p]. If lim inf n→∞
K (d(τ ∗ kp) + d(τ ∗ kq)) > 1, log(Kn)
(46)
then for all sufficiently small δ > 0, there exists a sequence θn such that (31) and (32) in Theorem 7 hold. Proof. Let δ > 0; how small δ must be is specified later. Set ǫ =
δ+δ1/4 1−δ
and ǫ′ =
2ǫp(1−q) q(1−p) .
Since
′ log p(1−q) q(1−p) is bounded from above, ǫ = O(ǫ). By the assumption (46) and Lemma 7, if δ is sufficiently small, then for all large n :
1+δ log n (1 − ǫ′ )(1 − δ) 1+δ Kd(τ ∗ kp) ≥ log K (1 − ǫ′ )(1 − 2δ) Kd(τ ∗ kq) ≥
(47) (48)
p(1−q) Let τn = (τ ∗ (1 − ǫ) + ǫq) ≥ q and θn = log q(1−p) τn + log 1−p 1−q . Recall that in Theorem 7, {Xi } denotes a sequence of i.i.d. random variables, where the distribution of each variable is the same as dP log dQ under measure P ; {Yi } denotes another sequence of i.i.d random variables independent of dP under measure Q. By the Chernoff {Xi }, and the distribution of each variable is the same as log dQ bound for Binomial distributions given in (37), K(1−δ) X P Yi ≥ K(1 − δ)θn = P {Binom(K(1 − δ), q) ≥ K(1 − δ)τn } ≤ e−K(1−δ)d(τn kq) . i=1
In view of (40), it yields that d(τn kq) ≥ (1 − ǫ′ )d(τ ∗ kq). Combining it with the last displayed equation and (47) gives K(1−δ) X 1 1 P . Yi ≥ K(1 − δ)θn ≤ 1+δ = o n n−K i=1
Thus, the desired (32) holds.
21
We prove the desired (31) next. Let τn′ = (1 − ǫ)τ ∗ + ǫp ≤ p. Let W1 ∼ Binom(K(1 − 2δ), p) and W2 ∼ Binom(Kδ, q). Then {W1 + W2 ≤ K(1 − δ)τn } ⊂ {W1 ≤ K(1 − 2δ)τn′ } ∪ {W2 ≤ K(1 − δ)τn − K(1 − 2δ)τn′ } Thus Kδ K(1−2δ) X X Yi ≤ K(1 − δ)θn = P {W1 + W2 ≤ K(1 − δ)τn } P Xi + i=1 i=1 ≤ P W1 ≤ K(1 − 2δ)τn′ + P W2 ≤ K(1 − δ)τn − K(1 − 2δ)τn′ .
Hence, to prove the desired (31), it suffices to show that P W1 ≤ K(1 − 2δ)τn′ = o(1/K) P W2 ≤ K(1 − δ)τn − K(1 − 2δ)τn′ = o(1/K). In view of the Chernoff bound for Binomial distributions given in (37), ′ P W1 ≤ K(1 − 2δ)τn′ ≤ e−K(1−2δ)d((τn kp) .
It follows from (40) that d(τn′ kp) ≥ (1 − ǫ′ )d(τ ∗ kq). Combining it with the last three displayed 1 = o(1/K). equations and (48) gives P {W1 ≤ τn′ } ≤ K 1+δ Note that K(1 − δ)τn − K(1 − 2δ)τn′ = K(1 − δ)(τ ∗ (1 − ǫ) + ǫq) − K(1 − 2δ)(τ ∗ (1 − ǫ) + ǫp) = K(1 − 2δ)ǫ(q − p) + Kδ(τ ∗ (1 − ǫ) + ǫq).
Since W2 ∼ Binom(Kδ, q), it follows that
E [W2 ] − K(1 − δ)τn − K(1 − 2δ)τn′ = K(1 − 2δ)ǫ(p − q) + Kδ(1 − ǫ)(q − τ ∗ ) ≥ K(1 − 2δ)ǫ(p − q) − Kδ(1 − ǫ)(p − q)
= K(p − q)(ǫ(1 − δ) − δ) = K(p − q)δ1/4 . Combining the last displayed equation with Chernoff bound for Binomial distributions given in (34) yields that 2 ! K(p − q)δ1/4 K(p − q)2 ′ √ = exp − P W2 ≤ K(1 − δ)τn − K(1 − 2δ)τn ≤ exp − 2E [W2 ] 2 δq Kd(pkq)(1 − q) √ ≤ exp − , 2 δ where the last inequality holds because d(pkq) ≤ (p − q)2 /(q(1 − q)) in view of the upper bound in (38). By (47) and the assumption that p is bounded away from 1, it follows that √ P W2 ≤ K(1 − δ)τn − K(1 − 2δ)τn′ ≤ e−Ω(log n/ δ) . √ δ)
If the constant δ is small enough (not depending on n), e−Ω(log n/ the proof. 22
= o(1/K), which completes
Proof of Theorem 3. (Necessity) Condition (2) is necessary for weak recovery, and hence also for exact recovery. Thus it suffices to prove (7) under the assumption that (2) holds. Note that d(pkq) ≍ (p − q)2 /p ≤ p. Therefore it follows from (2) that Kp → ∞ and certainly K → ∞. For the sake of proof by contradiction, suppose (7) fails. By focusing on a subsequence if necessary, we can assume that lim sup n→∞
K (d(τ ∗ kp) + d(τ ∗ kq)) < 1. log(Kn)
By Lemma 7, it follows that there exists some 0 < ǫ < 1 such that for all sufficiently large n, Kd(τ ∗ kq) ≤ (1 − ǫ) log n,
Kd(τ ∗ kp) ≤ (1 − ǫ) log K. 1−p The log likelihood ratios in Theorem 6 can be expressed as Li = Xi log p(1−q) q(1−p) + log 1−q , such that Xi has the Bern(p) distribution under P and the Bern(q) distribution under Q. Applying Theorem 6 with Ko = K/ log K, implies there exists a sequence θn such that (26) and (27) hold, with σ 2 = Ko varP (Li ) and Bn = log p(1−q) q(1−p) . Equivalently, expressing Li in terms of Xi for each i, 1−p and defining τn by θn = τn log p(1−q) q(1−p) + log 1−q , (26) and (27), respectively, become: P
"K−K Xo i=1
#
Xi ≤ Kτn − Ko p − 6e σ−1 ≤ Q
"K X i=1
#
Xi ≥ Kτn ≤
2 , Ko
(49)
1 , n−K
(50)
where σ e2 = Ko varP (Xi ) = Ko p(1 − p). We finish the proof by arguing that such τn does not exist for sufficiently large n. To do so we examine the probabilities for the threshold τ ∗ : # "K p X 1 ∗ 2Kd(τ ∗ kq) = ω Xi ≥ Kτ = P {Binom(K, q) ≥ Kτ ∗ } ≥ Q Q , n−K i=1
where the last inequality holds due to Kd(τ ∗ kq) ≤ (1 − ǫ) log n. Thus τn ≥ τ ∗ for all sufficiently large n and consequently, # # "K−K "K−K Xo Xo ∗ Xi ≤ Kτ − Ko p − 6e σ−1 Xi ≤ Kτn − Ko p − 6e σ−1 ≥P P i=1
i=1
where
= P {Binom(K − Ko , p) ≤ (K − Ko )v} p ≥Q 2(K − Ko )d(vkp) ,
(51)
p p Ko (p − τ ∗ ) + 6 Ko p(1 − p) + 1 Kτ ∗ − Ko p − 6 Ko p(1 − p) − 1 ∗ =τ − . v= K − Ko K − Ko Lemma 6 implies that τ ∗ − q = Θ(p − q) and p − τ ∗ = Θ(p − q). By our choice, Ko = o(K). Since K(p − q)2 /p = Θ(Kd(pkq)) → ∞, it follows that K(p − q) → ∞ and p p Ko p = o( Kp) = o(K(p − q)). 23
Thus v − q = Θ(p − q) − o(p − q) = Ω(p − q) ≥ 0. By the first inequality of Lemma 5, d(vkp) ≤ d(τ ∗ kp)(1 − ǫ′ C)−1
where C =
2p(1−q) q(1−P )
(C is bounded by assumption), and
√ τ∗ − v Ko (p − τ ∗ ) + 6 Ko p + 1 ǫ , ≤ . p−v (K − Ko )(p − τ ∗ ) √ We have shown that p − τ ∗ = Θ(p − q), Ko p = o(K(p − q)) and K(p − q) → ∞. It follows that ǫ′ = o(1). Since Kd(τ ∗ kp) ≤ (1 − ǫ) log K holds for all sufficiently large n, it follows that p p Q 2(K − Ko )d(vkp) Ko /2 ≥ Q 2(K − Ko )(1 − ǫ′ C)−1 d(τ ∗ kp) Ko /2 = K Ω(1) , ′
which together with (51) contradicts (49). (Sufficiency) We shall apply Corollary 1. The assumptions of Theorem 3 imply that log dP dQ is bounded, so Lemma 1 implies there is a universal constant c such that (18) holds for 0 < η < 1. The assumptions on p, q imply d(pkq) = O(p), which together with Kd(pkq) → ∞ from assumption (1) implies K → ∞. Assumption (1) then also implies (20). By Lemma 6, τ ∗ ∈ [q, p] and the assumption (5) allows us to apply Lemma 8 to conclude that (31) and (32) hold. The conditions of Corollary 1 are satisfied, proving exact recovery.
4.4
Proof of Exact Recovery in Gaussian Case
In this subsection, we apply the general necessary conditions and sufficient conditions to prove Theorem 4. Before that, we need a key lemma. Lemma 9. Assume that K → ∞. If (9) holds, then for all sufficiently small constants δ > 0, there exists a sequence θn such that (31) and (32) in Theorem 7 hold. Proof. By the assumption (9), for all sufficiently small δ > 0, √ p √ 1 − δ2 p Kµ ≥ 2 log K + 2 log n . 1 − 2δ
(52)
for all sufficiently large n. Recall that in Theorem 7, {Xi } denotes a sequence of i.i.d. random dP under measure P ; {Yi } variables, where the distribution of each variable is the same as log dQ denotes another sequence of i.i.d random variables independent of {Xi }, and the distribution of the Gaussian case, P = N (µ, 1) and each variable is the same as log dP dQ under measure Q. In q 2(1+δ) log n Q = N (0, 1). Thus log dP and θn = µ(τn − µ/2). It dQ (Z) = µ(Z − µ/2). Let τn = K(1−δ) follows that K(1−δ) p X 1 P Yi ≥ K(1 − δ)θn = P {N (0, K(1 − δ)) ≥ K(1 − δ)τn } = Q 2(1 + δ) log n ≤ 1+δ , n i=1
2
where the last inequality holds due to Q(x) ≤ e−x /2 for x ∈ R. Thus, the desired (32) holds. Moreover, Kδ K(1−2δ) X X Yi ≤ K(1 − δ)θn = P {N (K(1 − 2δ)µ, K(1 − δ)) ≤ K(1 − δ)τn } P Xi + i=1 i=1 p 1 ≤Q 2(1 + δ) log K ≤ 1+δ , K 24
where the first inequality holds due to (52). Since K → ∞ by assumption, the desired (31) holds. Proof of Theorem 4. (Necessity) Since K/n is bounded away from one by Assumption 1, (4) is necessary for weak recovery, hence exact recovery as well. Next we prove (10) holds. In the Gaussian variables Li appearing in Theorem 6 2 case, P = N (µ, 1) and Q = N (0, 1). The µ −µ2 2 2 have the N 2 , µ distribution under P and the N distribution under Q. Applying 2 ,µ Theorem 6 with Ko = K/ log K, we have that there exists a sequence of θn such that # "K−K Xo 2 Li ≤ Kθn − Ko µ2 /2 − 6σ − Bn ≤ P , Ko i=1 # "K X 1 , Li ≥ Kθn ≤ Q n−K i=1
√
where σ 2 = K0 µ2 , and Bn = O(µ log n). Note that # "K−K Xo −Kθn + Kµ2 /2 + 6σ + Bn 2 √ Li ≤ Kθn − Ko µ /2 − 6σ − Bn = Q P K − Ko µ i=1 # "K X Kθn + Kµ2 /2 √ Li ≥ Kθn = Q Q . Kµ i=1 It follows that p p p (K − Ko )µ2 log log K , Kθn ≤ Kµ2 /2 + 6 K0 µ2 + Bn − 2(K − Ko )µ2 log K + O p p Kθn ≥ −Kµ2 /2 + 2Kµ2 log(n − K) + O Kµ2 log log(n − K) .
Therefore,
√
Kµ √ √ ≥ 2 log K + 2 log n
q
2
p log K + 2 log(n − K) √ √ + o(1) = 1 + o(1), 2 log K + 2 log n
K−Ko K
where the last inequality holds because K → ∞, Ko = o(K), and lim supn→∞ K/n < 1. Thus the desired (10) follows. (Sufficiency) We shall apply Corollary 1. Since K → ∞, the assumption (3) implies (20). The assumption (9) and assumption K → ∞, allow us to apply Lemma 9, yielding that conditions (31) and (32) hold. The conditions of Corollary 1 are satisfied, proving exact recovery.
Appendices A
Equivalence of weak recovery in expectation and in probability
Lemma 10. There exists an estimator ξb such that b] E[d(ξ,ξ) exists an estimator ξb such that → 0. K 25
b d(ξ,ξ) K
→ 0 in probability if and only if there
Proof. One direction is automatic since convergence in L1 implies that in probability. Conversely, b ξ) b Then there exists a deterministic → 0 in probability for someo(sequence of) ξ. suppose d(ξ, K n b ≥ ǫn K ≤ ǫn . Define a new estimator by sequence ǫn → 0 such that P d(ξ, ξ) b b ξe = ξ1 , b {|ξ|≤K+ǫnK } + 0 · 1{|ξ|>K+ǫ nK }
where 0 denotes the all-zero vector. Since |ξ| = K, by the triangle inequality, we have h i n o e = E d(ξ, ξ)1 b b > K + ǫn K E[d(ξ, ξ)] + KP | ξ| b {|ξ|≤K+ǫnK } h i n o b b > K + ǫn K ≤ ǫn K + E d(ξ, ξ)1 + KP | ξ| b b {d(ξ,ξ)>ǫn K, |ξ|≤K+ǫnK } n o b > ǫn K ≤ 4ǫn K + ǫ2 K ≤ ǫn K + (3K + ǫn K)P d(ξ, ξ) n Therefore,
B
e] E[d(ξ,ξ) K
→ 0.
Proof of Lemma 1
We first prove (16). By definition, EP ((1 − η)D(P kQ)) = supλ∈R λ(1 − η)D(P kQ) − ψP (λ), where the supremum is achieved at λ such that ψP′ (λ) = (1 − η)D(P kQ), because ψP (λ) is convex in λ. Note that ψP′ (−1) = −D(QkP ) and ψP′ (0) = D(P kQ). Thus EP ((1 − η)D(P kQ)) = sup−1≤λ≤0 λ(1 − η)D(P kQ) − ψP (λ). We prove a quadratic upper bound on the rate function:
λ2 α ψP (λ) ≤ D(P kQ) λ + 2
,
−1 ≤ λ ≤ 0,
(53)
with α , e4C (2 + 2C), which implies EP ((1 − η)D(P kQ)) ≥ D(P kQ) sup−1≤λ≤0 (−ηλ − λ2 α/2) = 2
η , the desired result. To prove (53), let T = log D(P kQ) 2α Since |T | ≤ C, for any λ ∈ [−1, 0],
ψP′′ (λ) =
dP dQ .
Then ψP (λ) = log EP [exp(λT )].
EP [T 2 exp(λT )] EP [T 2 exp(λT )]EP [exp(λT )] − EP [T exp(λT )]2 ≤ ≤ e2C EP [T 2 ]. EP [exp(λT )]2 EP [exp(λT )]
Next we show that EP [T 2 ] ≤ (2 + 2C)e2C D(P kQ),
(54)
dP which implies (53) via Taylor’s theorem. Note that EP [T 2 ] = EQ [g( dQ )], where g(x) = x log2 x satisC
χ2 (P kQ), fies g(1) = g ′ (1) = 0 and g ′′ (x) = 2+2xlog x . By Taylor’s theorem, we have EP [T 2 ] ≤ (2+2C)e 2 2 where χ2 (P kQ) , EQ [( dP dQ − 1) ]. It remains to show that D(P kQ) can be lower bounded by χ2 (P kQ) under the assumption of bounded likelihood ratio, which can be established by the same dP procedure. Indeed, D(P kQ) = EQ [f ( dQ )], where f (x) = x log x with f ′ (1) = 1 and f ′′ (x) = 1/x. Consequently, 1 (55) D(P kQ) ≥ e−C χ2 (P kQ), 2 which completes the proof of (54). Similarly, we can prove (17). By definition, EQ (−(1−η)D(QkP )) = supλ∈R −λ(1−η)D(QkP )− ′ (λ) = −(1 − η)D(QkP ), because ψ (λ) ψQ (λ), where the supremum is achieved at λ such that ψQ Q 26
′ (1) = D(P kQ) and ψ ′ (0) = −D(QkP ). Thus E (−(1 − η)D(QkP )) = is convex in λ. Note that ψQ Q Q sup0≤λ≤1 {−λ(1 − η)D(QkP ) − ψQ (λ)}. We prove a quadratic upper bound on the rate function:
λ2 α ψQ (λ) ≤ D(QkP ) −λ + 2
,
0 ≤ λ ≤ 1,
(56)
with α = e4C (2 + 2C), which implies EP ((1 − η)D(P kQ)) ≥ D(QkP ) sup0≤λ≤1 (ηλ − λ2 α/2) = 2
η dP , the desired result. To prove (56), recall that T = log dQ . Then ψQ (λ) = log EQ [exp(λT )]. D(QkP ) 2α Since |T | ≤ C, for any λ ∈ [0, 1], ′′ ψQ (λ) =
EQ [T 2 exp(λT )]EQ [exp(λT )] − EQ [T exp(λT )]2 EQ [T 2 exp(λT )] ≤ ≤ e2C EQ [T 2 ]. EQ [exp(λT )]2 EQ [exp(λT )]
Let P ′ = Q and Q′ = P , and apply (55), we get that ′ 2 dP 2 2 dQ ≤ (2 + 2C)e2C D(P ′ kQ′ ) = (2 + 2C)e2C D(QkP ). (57) = EP ′ log EQ [T ] = EQ log dP dQ′ Therefore, we have the desired (56) via Taylor’s theorem.
References [1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. arXiv 1405.3267, October 2014. 4 [2] E. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms. arXiv 1503.00609, March, 2015. 6 [3] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Structures and Algorithms, 13(3-4):457–466, 1998. 3, 4 [4] B. P. W. Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming, pages 1–37, 2013. 3 [5] B. P. W. Ames. Robust convex relaxation for the planted clique and densest k-subgraph problems. arXiv 1305.4891, 2013. 3 [6] B. P. W. Ames and S. A. Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Math. Program., 129(1):69–89, Sept. 2011. 3 [7] E. Arias-Castro and N. Verzelen. Community detection in dense random networks. Ann. Statist., 42(3):940–969, 06 2014. 2 [8] S. Balakrishnan, M. Kolar, A. Rinaldo, A. Singh, and L. Wasserman. Statistical and computational tradeoffs in biclustering. In NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning. 4 [9] C. Butucea, Y. Ingster, and I. Suslina. Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. arXiv 1303.5647, March 2013. 2, 3, 7 [10] C. Butucea and Y. I. Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652–2688, 11 2013. 2 27
[11] T. T. Cai, T. Liang, and A. Rakhlin. Computational and statistical boundaries for submatrix localization in a large noisy matrix. arXiv preprint arXiv:1502.01988, 2015. 2 [12] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. In Proceedings of ICML 2014 (Also arXiv:1402.1267), Feb 2014. 2, 3 [13] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms, 18(2):116–140, Mar. 2001. 3 [14] Y. Dekel, O. Gurel-Gurevich, and Y. Peres. Finding hidden cliques in linear time with high probability. Combinatorics, Probability and Computing, 23(01):29–49, 2014. 3 p [15] Y. Deshpande and A. Montanari. Finding hidden cliques of size N/e in nearly linear time. Foundations of Computational Mathematics, 15(4):1069–1128, August 2015. 1, 3 [16] U. Feige and D. Ron. Finding hidden cliques in linear time. In Proceedings of DMTCS, pages 189–204, 2010. 3 [17] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming. arXiv1412.6156, Nov. 2014. 3, 6 [18] B. Hajek, Y. Wu, and J. Xu. Computational lower bounds for community detection on random graphs. In Proceedings COLT 2015, June 2015. 2, 4 [19] B. Hajek, Y. Wu, and J. Xu. Finding a principal submatrix in Gaussian noise via message passing. September 2015. 3, 4 [20] B. Hajek, Y. Wu, and J. Xu. Recovering a hidden community beyond the spectral limit. September 2015. 3, 4 [21] B. Hajek, Y. Wu, and J. Xu. Semidefinite programs for exact recovery of a hidden community. September 2015. 3, 4 [22] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. 18 [23] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983. 2 [24] M. Jerrum. Large cliques elude the Metropolis process. Random Structures & Algorithms, 3(4):347–359, 1992. 3 [25] R. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thacher, editors, Proceedings of a Symposium on the Complexity of Computer Computations, pages 85–103. Plenum Press, March 1972. 8 [26] M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh. Minimax localization of structural information in large noisy matrices. In Advances in Neural Information Processing Systems, 2011. 2, 3 [27] Z. Ma and Y. Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089–1116, 2015. 2, 4 28
[28] F. McSherry. Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science, pages 529 – 537, Oct. 2001. 2, 3 [29] A. Montanari. Finding one community in a sparse random graph. arXiv 1502.05680, Feb 2015. 2, 3, 4 [30] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC ’15, pages 69–75, New York, NY, USA, 2015. ACM. 2, 3, 4 [31] E. Mossel, J. Neeman, and S. Sly. Belief propagation, robust reconstruction, and optimal recovery of block models (extended abstract). In JMLR Workshop and Conference Proceedings (COLT proceedings), volume 35, pages 1–35, 2014. 3 [32] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, pages 985–1012, 2009. 2 [33] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Verlag, New York, NY, 2009. 10 [34] A. M. Zubkov and A. A. Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013. 19
29