Statistical Algorithms and a Lower Bound for Detecting Planted Cliques Vitaly Feldman
Elena Grigorescu∗†
Santosh S. Vempala†
Lev Reyzin‡†
Ying Xiao†
[email protected] Almaden Research Center IBM San Jose, CA 95120
[email protected] Department of Computer Science Purdue University West Lafayette, IN 47907
[email protected] Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago Chicago, IL 60607 {vempala,yxiao32}@cc.gatech.edu School of Computer Science Georgia Institute of Technology Atlanta, GA 30332
∗ This material is based upon work supported by the National Science Foundation under Grant #1019343 to the Computing Research Association for the CIFellows Project. † Research supported in part by NSF awards AF-0915903 and AF-0910584. ‡ Research supported by a Simons Postdoctoral Fellowship.
Abstract We introduce a framework for proving lower bounds on computational problems over distributions, based on defining a restricted class of algorithms called statistical algorithms. For such algorithms, access to the input distribution is limited to obtaining an estimate of the expectation of any given function on a sample drawn randomly from the input distribution, rather than directly accessing samples. Our definition captures most natural algorithms of interest in theory and in practice, e.g., moments-based methods, local search, standard iterative methods for convex optimization, MCMC and simulated annealing. Our definition and techniques are inspired by and generalize the statistical query model in learning theory [35]. For well-known problems over distributions, we give lower bounds on the complexity of any statistical algorithm. These include an exponential lower bounds for moment maximization in Rn , and a nearly optimal lower bound for detecting planted bipartite clique distributions (or planted dense subgraph distributions) when the planted clique has size O(n1/2−δ ) for any constant δ > 0. Variants of the latter have been assumed to be hard to prove hardness for other problems and for cryptographic applications. Our lower bounds provide concrete evidence supporting these assumptions.
1
Introduction
We study the complexity of problems where the input is modeled as random and independent samples from an unknown distribution usually belonging to a known family of distributions over a fixed domain. Such problems are at the heart of machine learning and statistics (and their numerous applications) and also occur in many other contexts such as compressed sensing and cryptography. Many methods exist to estimate the sample complexity (to achieve relevant approximation guarantees) of such problems (e.g. VC dimension [50] and Rademacher complexity [5]). At the same time proving lower bounds on the computational complexity of these problems is usually much more challenging. The traditional approach to this is based on finding distributions that can generate instances of a problem conjectured to be intractable, typically assuming NP 6= RP. Here we present a new approach: we show that a broad class of algorithms, which we refer to as statistical algorithms, must have high complexity, unconditionally. Our definition encompasses a wide variety of algorithms used in practice and in theory, such as Expectation Maximization (EM) [15], local search, MCMC optimization [47, 24], simulated annealing [37, 51], first and second order methods for linear/convex optimization, e.g., [17], k-means, Primary Component Analysis (PCA), Independent Component Analysis (ICA), Na¨ıve Bayes, Neural Networks and many others (see [13] and [8] for proofs and many other examples). In fact, we are not aware of any practical problem that can be (provably) solved efficiently by a non-statistical algorithm but cannot be solved efficiently by a statistical algorithm. P Informally, statistical algorithms are algorithms that only use samples to obtain values i g(xi ), where xi ’s are the samples available to the algorithm and g is any real-valued function. Formally, such a definition would not actually restrict the power of the algorithm and we will present the formal oracle-based definition later. The inspiration for our model is the statistical query (SQ) model in learning theory [35] defined as a restriction of Valiant’s PAC learning model [48]. The primary goal of the restriction was to simplify the design of noise-tolerant learning algorithms. As was shown by Kearns and others in subsequent works, almost all classes of functions that can be learned efficiently can also be efficiently learned in the restricted SQ model. A notable and so far the only exception is the algorithm for learning parities, based on Gaussian elimination. As was already shown by Kearns [35], parities require exponential time to learn in the SQ model. Further, Blum et al. [10] proved that the number of SQs required for weak learning (that is, for obtaining a non-negligible advantage over the random guessing) of a class of functions C over a fixed distribution D is characterized by a combinatorial parameter of C and D, referred to as SQ-DIM(C, D), the statistical query dimension. Our notion of statistical algorithms generalizes SQ learning algorithms to any computational problem over distributions. For any problem over distributions we define a parameter of the problem that lower bounds the complexity of solving the problem by any statistical algorithm in the same way as SQ-DIM lower bounds the complexity of learning in the SQ model. Our techniques for proving lower bounds on statistical algorithms are also based on methods developed for lowerbounding the complexity of SQ algorithms. However, as we will describe later, they depart from the known techniques in a number of significant ways that are necessary for our more general definition and our applications. We then demonstrate several applications of our technique. Our primary application is to the problem of detecting planted bipartite cliques. We now give a brief overview of this problem and the problem of maximizing the expectation of a polynomial over samples from a unit sphere.
1
Detecting Planted Cliques. In the planted clique problem, we are given a graph G whose edges are generated by starting with a random graph Gn,1/2 , then “planting,” i.e., adding edges to form a clique on k randomly chosen vertices. Jerrum [30] introduced the planted clique problem as a potentially easier variant of the classical problem of finding the largest clique in a random graph [34]. A random graph Gn,1/2 contains a clique of size 2 log n with high probability, and a simple greedy algorithm can find one of size log n. Finding cliques of size (2 − ) log n is a hard problem for any > 0. Planting a larger clique should make it easier to find one. The problem of finding the smallest k for which the √ planted clique can be detected in polynomial time has attracted significant attention. For k ≥ c n log n, simply picking vertices of large degrees suffices [38]. Cliques of size √ k = Ω( n) can be found using spectral methods Alon et al. [2] , McSherry [39], via SDPs Feige and Krauthgamer [19] or even simple combinatorial methods Feige and Ron [20], Dekel et al. [14]. There is presently no polynomial-time algorithm that can detect cliques of size below this threshold √ of Ω( n). However, for any k, there is a quasipolynomial algorithm: guess 2 log n vertices from the clique and take all their common neighbors. This is also the fastest known algorithm for any k = O(n1/2−δ ), where δ > 0. Some evidence toward the hardness of the problem was shown by Jerrum [30] who proved that √ a specific approach using a Markov chain cannot be efficient for k = o( n). The problem has been used to generate cryptographic primitives [31], and as a hardness assumption in [1, 27, 41]. In this work, we examine the bipartite version of the planted clique problem. Here a random bipartite clique of size k on both sides is planted in a random bipartite graph. A densest-subgraph version of the bipartite planted clique problem has been used as a hard problem for cryptographic applications [3]. All known bounds and algorithms for the k-clique problem can be easily adapted to the bipartite case. Therefore it is natural to suspect that new upper bounds on the planted k-clique problem would also yield new upper bounds for the bipartite case. The starting point of our investigation for this problem is the property of the bipartite planted k-clique problem is that it has an equivalent formulation as a problem over distributions defined as follows. Problem 1. For 1 ≤ k ≤ n, let S ⊆ {1, 2, . . . , n} be a set of k vertex indices and DS be a distribution over {0, 1}n such that when x ∼ DS , with probability 1 − (k/n) the entries of x are chosen uniformly and independently from {0, 1}, and with probability k/n the k coordinates in S are set to 1 and the rest are chosen uniformly and independently from {0, 1}. For an integer t, we define the distributional planted k-biclique problem with t samples as the problem of finding the unknown subset S using t samples drawn randomly from DS . One can view samples x1 , . . . , xt as adjacency vectors of a bipartite graph as follows: the bipartite graph has n vertices on the left (with k marked as members of the clique) and t vertices on the right. Each of the t samples gives the adjacency vector of the corresponding vertex on the right. It is not hard to see that for t = n, conditioned on the event of getting exactly k samples with planted indices, we will get a random bipartite graph with a planted k-biclique (we prove the equivalence formally in Section 4.3). Moment Maximization. Our second example is an optimization problem defined as follows. Problem 2. Let D be a distribution over [−1, 1]n and let r ∈ Z+ . The moment maximization problem is to find a unit vector u∗ that maximizes the expected r’th moment of the projection of D 2
to u∗ , i.e., u∗ = arg
max
r E [(u · x) ].
u∈Rn :kuk=1 x∼D
The complexity of finding approximate optima is interesting as well. For r = 2, an optimal vector simply corresponds to the principal component of the distribution D and can be found by the singular value decomposition. For higher r, there are no efficient algorithms known, and the problem is NP-hard for some distributions [11, 28]. It can be viewed as finding the 2-norm of an r’th order tensor (the moment tensor of D). For r = 3, Frieze and Kannan [23] give a reduction from finding a planted clique in a random graph to this tensor norm maximization problem; this was extended to general r in [12]. Specifically, they show that maximizing the r’th moment (or the ˜ 1/r ). 2-norm of an r’th order tensor) allows one to recover planted cliques of size Ω(n For moment maximization over a distribution that can be sampled, it is natural to consider the following type of optimization algorithm: start with some unit vector u, then estimate the gradient at u (via samples), and move along that direction staying on the sphere; repeat to reach a local maximum. Unfortunately, over the unit sphere, the expected r’th moment function can have (exponentially) many local maxima even for simple distributions. A more sophisticated approach [32] for both problems is through Markov chains or simulated annealing; it attempts to sample unit vectors from a distribution on the sphere which is heavier on vectors that induce a higher moment, e.g., u is sampled with density proportional to ef (u) where f (u) is the expected r’th moment along u. This could be implemented by a Markov chain with a Metropolis filter [40, 26] ensuring a proportional steady state distribution. If the Markov chain were to mix rapidly, that would give an efficient approximation algorithm because sampling from the steady state likely gives a vector of high moment. At each step, all one needs is to be able to estimate f (u), which can be done by sampling from the input distribution. As we will see presently, these approaches fall under a class of algorithms we call statistical algorithms and they will all have provably high complexity. We now outline our main results informally for these problems. For the planted clique problem, statistical algorithms that can query expectations of arbitrary functions to within a small tolerance need nΩ(log n) queries to detect 1 planted cliques of size k < n 2 −δ for any δ > 0. Even stronger exponential bounds apply for the more general problem of detecting planted dense subgraphs of the same size. These bounds match the current upper bounds. For moment maximization, we derive a general lower bound for the r’th moment. Even for r = 3, we get a superpolynomial lower bound on finding a vector that approximately maximizes the third moment via a reduction from the planted clique problem. These bounds also translate to sample complexity lower bounds. To describe these results precisely and discuss exactly what they mean for the complexity of these and other important problems, we will need to define the notion of statistical algorithms, the complexity measures we use, and our main tool for proving lower bounds, a notion of statistical dimension of a set of distributions. We do this in the next section. In Section 3 we prove our general lower bound theorems, and the following two sections are applications, where we estimate the statistical dimension for specific problems.
2
Definitions and Overview
We now describe our model, approach for proving lower bounds and some applications in detail.
3
2.1
Problems over Distributions
We begin by formally defining the class of problems addressed by our framework. Definition 1 (search problems over distributions). For a domain X, let D be a set of distributions over X, let F be a set of solutions and Z : D → 2F be a map from a distribution D ∈ D to a subset of solutions Z(D) ⊆ F that are defined to be valid solutions for D. For t > 0 the distributional search problem Z over D and F using t samples is to find a valid solution f ∈ Z(D) given access to t random samples from an unknown D ∈ D. We note that this definition captures decision problems by having F = {0, 1}. With slight abuse of notation, for a solution f ∈ F, we denote by Z −1 (f ) the set of distributions in D for which f is a valid solution. For some of the optimization problems we consider, it is natural to let the solution space F contain real-valued functions over X. An optimal solution to an optimization problem is . f ∗ = maxf ∈F Ex∼D [f (x)]. Define a valid solution for the problem to be the set of functions that are within additive error of being optimal, namely Z(D) = {f ∈ F | Ex∼D [f (x)] ≥ Ex∼D [f ∗ (x)] − }. We refer to finding such a valid function as -optimization. It is important to note that the number of available samples t can have a major influence on the complexity of the problem. First, for most problems there is a minimum t for which the problem is information theoretically solvable. This value is often referred to as the sample complexity of the problem. But even for t which is larger than the sample complexity of the problem, having more samples can make the problem much easier computationally. For example, in the context of attribute-efficient learning, Servedio [45] shows a problem that is intractable with few samples (under cryptographic assumptions) but is easy to solve with a larger (but still polynomial) number of samples. Our distributional planted biclique problem exhibits the same phenomenon.
2.2
Statistical Algorithms
The statistical query learning model of Kearns [35] is a restriction of the PAC model [48]. It captures algorithms that rely on empirical estimates of statistical properties of random examples of an unknown function instead of individual random examples (as in the PAC model of learning). In the same spirit, for general search, decision and optimization problems over a distribution, we define statistical algorithms as algorithms that do not see samples from the distribution but instead have access to estimates of the expectation of any bounded function of a sample from the distribution. Definition 2 (STAT oracle). Let D be the input distribution over the domain X. For a tolerance parameter τ > 0, STAT(τ ) oracle is the oracle that for any function h : X → [−1, 1], returns a value v ∈ [Ex∼D [h(x)] − τ, Ex∼D [h(x)] + τ ] . The general algorithmic techniques mentioned earlier can all be expressed as algorithms using STAT oracle instead of samples themselves, in most cases in a straightforward way. We would also like to note that in the PAC learning model some of the algorithms, such as the Perceptron algorithm, did not initially appear to fall into the SQ framework but SQ analogues were later found for all known learning techniques except Gaussian elimination (for examples see [35] and [9]). We expect the situation to be similar even in the broader context of search problems over distributions.
4
The most natural realization of STAT(τ ) oracle is one that computes h on O(1/τ 2 ) random samples from D and returns their average. Chernoff’s bound would then imply that the estimate is within the desired tolerance (with constant probability). However, if h(x) is very biased (e.g. equal to 0 with high probability), it can be estimated within τ with fewer samples. Our primary application requires a tight bound on the number of samples necessary to solve a problem over distributions. Therefore we define a stronger version of STAT oracle in which the tolerance is adjusted for the variance of the query function, that is the oracle returns the expectation with the same tolerance (up to constant factors) as one can expect to get from t samples (see Sec. 3.1 for more details on this correspondence). Definition 3 (VSTAT oracle). Let D be the input distribution over the domain X. For a sample size parameter t > 0, VSTAT(t) oracle is the oracle that for any h : X → {0, 1}, returns function q a value v ∈ [p − τ, p + τ ] , where p = Ex∼D [h(x)] and τ = max 1t , p(1−p . t The STAT and VSTAT oracles we defined can return any value within the given tolerance and therefore can make adversarial choices. We also aim to prove lower bounds against algorithms that use a potentially more benign, “unbiased” statistical oracle. The unbiased statical oracle gives the algorithm the true value of a boolean query function on a randomly chosen sample. This model is based on the Honest SQ model in learning by Yang [52] (which itself is based on an earlier model of Jackson [29]). Definition 4 (1-STAT oracle). Let D be the input distribution over the domain X. The 1-STAT oracle is the oracle that given any function h : X → {0, 1}, takes an independent random sample x from D and returns h(x). Note that the 1-STAT oracle draws a fresh sample upon each time it is called. Without resampling each time, the answers of the 1-STAT oracle could be easily used to recover any sample bit-by-bit, making it equivalent to the usual access to random samples. The query complexity of an algorithm using 1-STAT is defined to be the number of calls it makes to the 1-STAT oracle. Note that the 1-STAT oracle can be used to simulate VSTAT (with high probability) by taking the average of O(t) replies of 1-STAT for the same function h. While it might seem that access to 1-STAT gives an algorithm more power than access to STAT we will show that 1-STAT can be simulated using STAT and also prove stronger query complexity lower bounds for unbiased statistical algorithms directly. In the following discussion, we refer to algorithms using STAT or VSTAT oracles (instead of samples) as statistical algorithms. To algorithms using the 1-STAT oracle we refer as unbiased statistical.
2.3
Statistical Dimension
The main tool in our analysis is an information-theoretic bound on the complexity of statistical algorithms based on the structure of a search problem over distributions. Our definitions originate from the statistical query (SQ) dimensionl in learning theory [10] used to characterize SQ learning algorithms. Roughly speaking, it corresponds to the number of nearly uncorrelated labeling functions in a class (see Section 6 for the details of the definition and the relationship to our bounds) .
5
We introduce a natural generalization and strengthening of this approach to search problems over arbitrary sets of distributions and prove lower bounds on the complexity of statistical algorithms based on the generalized notion. Our definition departs from SQ dimension in three aspects. (1) Our notion applies to any set of distributions whereas in the learning setting all known dimensions require fixing the distribution over the domain and only allow varying the labeling function. (2) Instead of relying on a bound on pairwise correlations, our dimension relies on a bound on average correlations in a large set of distributions. This weaker condition allows us to derive the tight bounds on the complexity of statistical algorithms for the planted k-biclique problem. (3) We show that our dimension also gives lower bounds for the stronger VSTAT oracle (without incurring a quadratic loss in the sample size parameter). We now define our dimension formally. For two functions f, g : X → R and a distribution D with probability density function D(x), the inner product of f and g over D is defined as . hf, giD = E [f (x)g(x)]. x∼D
p The norm of f over D is kf kD = hf, f iD . We remark that, by convention, the integral from the inner product is taken only over the support of D, i.e. for x ∈ X such that D(x) 6= 0. Given a distribution D over X let D(x) denote the probability density function of D relative to some fixed underlying measure over X (for example uniform distribution for discrete X or Lebesgue measure over Rn ). Our bound is based on the inner products between functions of the following form: (D0 (x) − D(x))/D(x) where D0 and D are distributions over X. For this to be well-defined, we will only consider cases where D(x) = 0 implies D0 (x) = 0, in which case D0 (x)/D(x) is treated as 1. To see why such functions are relevant to our discussion, note that for every real-valued function f over X, 0 E [f (x)] − E [f (x)] = E [D (x)f (x)/D(x)] − E [f (x)] x∼D x∼D x∼D 0 D −D = ,f . D D
x∼D0
This means that the inner product of any function f with (D0 − D)/D is equal to the difference of 0 D0 expectations of f under the two distributions. We also remark that the quantity h D D − 1, D − 1iD is known as the χ2 (D0 , D) distance and is widely used for hypothesis testing in statistics [43]. For a set D0 of m distributions over X and a reference distribution D over X we define the average correlation of D0 relative to D as X D1 D . 1 2 0 ρ(D , D) = 2 − 1, − 1 . m D D D 0 D1 ,D2 ∈D
We are now ready to define the concept of statistical dimension. Definition 5. For γ¯ > 0, domain X and a search problem Z over a set of solutions F and a class of distributions D over X, let d be the largest integer such that there exists a reference distribution D over X such that for every f ∈ F, there exists a finite set of distributions Df ⊆ D \ Z −1 (f ) satisfying the following property: for any subset D0 ⊆ Df where |D0 | ≥ |Df |/d, ρ(D0 , D) < γ¯ . We define the statistical dimension with average correlation γ¯ of Z to be d and denote it by SDA(Z, γ¯ ). 6
The statistical dimension with average correlation γ¯ of a search problem over distributions gives a lower bound on the complexity of any statistical algorithm for the problem that uses queries to VSTAT(1/(3¯ γ )). Theorem 1. Let X be a domain and Z be a search problem over a set of solutions F and a class of distributions D over X. For γ¯ > 0 let d = SDA(Z, γ¯ ). Any statistical algorithm requires at least d calls to VSTAT(1/(3¯ γ )) oracle to solve Z. It also gives a lower bound on the query complexity of any unbiased statistical algorithm. Theorem 2. Let X be a domain and Z be a search problem over a class of solutions F and a class of distributions D over X. For γ¯ > 0 let d = SDA(Z, γ¯ ). Any unbiased statistical algorithm that solves Z with probability greater than 13/14 requires at least 1 d min , 8¯ γ 100 queries to the 1-STAT oracle. The bound on the average correlation of large subsets upon which our notion is based can be easily obtained from a bound on pairwise correlations. Pairwise correlations are easier to analyze and therefore we now define a special case of our statistical dimension based on pairwise correlations. This version directly generalizes the statistical query dimension from learning theory (see Section 6) . Definition 6. For γ, β > 0, domain X and a search problem Z over a set of solutions F and a class of distributions D over X. Let m be the maximum integer such that there exists a reference distribution D over X such that for every f ∈ F, there exists a set of m distributions Df = {D1 , . . . , Dm } ⊆ D \ Z −1 (f ) satisfying the following property: ( Di β for i = j ∈ [m] D j D − 1, D − 1 ≤ γ for i 6= j ∈ [m]. D We define the statistical dimension with pairwise correlations (γ, β) of Z to be m and denote it by SD(Z, γ, β). A corresponding lower bound can be obtained as a corollary of Theorem 1. We also describe it only for the simpler STAT since we will not need the stronger version of this bound. Corollary 1. Let X be a domain and Z be a search problem over a set of solutions F and a class of distributions D over X. For γ, β > 0, let m = SD(Z, γ, β). For any τ > 0, any statistical algorithm requires at least m(τ 2 − γ)/(β − γ) calls to the STAT(τ ) oracle to solve Z. As we show in Section 3, this corollary follows by an appropriate choice of parameters. Furthermore, we can obtain a similar corollary for unbiased statistical algorithms (see Section 3.2, Corollary 3) .
7
2.4
Lower Bounds
Before describing the lower bounds derived using our approach we address the following natural question: what do superpolynomial statistical lower bounds for a problem say about the performance of any given algorithm that only uses summations of functions over samples? If the algorithm uses each sample only for a single function evaluation then our lower bounds against unbiased statistical algorithms give a lower bound on the sample complexity (and hence running time) of any such algorithm for the problem. In most cases, however, such an algorithm is not statistical in the formal sense since it is likely using its samples more than once and is not based on oracles. Therefore our bounds do not constitute a proof that such an algorithm fails to solve the problem. At the same time they indicate that a proof of correctness or other formal performance analysis which relies on concentration of the sums used by the algorithm (as is almost always the case) will not be possible. They also imply that the algorithm is not robust to even tiny perturbations of the values of sums (which can arise from noise for example). Both of these points give strong evidence that the algorithm is unlikely to be successful on all input distributions. Our main application is the following lower bound for the distributional bipartite planted clique problem. Theorem 3. For any constant δ > 0, any k ≤ n1/2−δ and r > 0, at least nΩ(log r) queries to VSTAT(n2 /(rk 2 )) are required to solve the distributional planted bipartite k-clique. In particular, no polynomial-time statistical algorithm can solve the problem using queries to VSTAT(o(n2 /k 2 )) and any statistical algorithm will require nΩ(log n) queries to VSTAT(n2−δ /k 2 ). We note that this bound is essentially tight. For every vertex in the clique, the probability that the corresponding bit of a randomly chosen point is set to 1 is 1/2 + k/(2n) whereas for every vertex not in the clique, this probability is 1/2. Therefore using n queries of tolerance k/(4n) (or, alternatively, VSTAT(16n2 /k 2 )) it is easy to detect the planted biclique. There also exists a statistical algorithm that uses nO(log n) queries to VSTAT(4n/k) to find the planted set of indices for any k ≥ log n. It is the same algorithm as for the standard planted clique problem. Guess a set T of log n indices and ask a query to VSTAT(25n/k) for the function gT which equals 1 if and only if the point has ones in all positions in T . If the set is included in the true clique then k/n + 1/n ≥ ED [gT ] ≥ k/n and VSTAT(25n/k) will return a value ≥ k/n − (k + 1)/(5n) > 3k/(4n). If at least one guessed index is not from the true clique, then ED [gT ] ≤ k/(2n) + 1/n and VSTAT(25n/k) will return a value ≤ (k + 2)/(2n) + (k + 2)/(5n) < 3k/(4n). From these queries it is easy to reconstruct the entire planted clique. As we have demonstrated, the average-case planted k-biclique is equivalent to our distributional planted k-biclique with n samples (see Section 4.3). For a statistical algorithm, n samples directly correspond to having access to VSTAT(O(n)). Our bounds show that this problem can be solved in √ polynomial time when k = Ω( n). At the same time, for k ≤ n1/2−δ , any statistical algorithm will require nΩ(log n) queries to VSTAT(n1+δ ). This is exactly the state of the art for the average-case planted k-biclique and planted k-clique problems. One reason why we consider this unsurprising √ is that some of the algorithms that solve the problem for k = Ω( n) are obviously statistical. For example the key procedure of the algorithm of Feige and Ron [20] removes a vertex from the graph which has the lowest degree (in the current graph) and then repeats until the remaining graph is a clique. Naturally, the degree is the number of ones in a column (or row) of the adjacency matrix and can be thought of as an estimate of the expectation of 1 appearing in the corresponding
8
coordinate of a random sample. But even the more involved algorithms used for the problem, like finding the eigenvector with the largest eigenvalue or SDP solving, have statistical analogues. We also give a bound for unbiased statistical algorithms. ˜ 2 /k 2 ) queries are required by any Theorem 4. For any constant δ > 0 and any k ≤ n1/2−δ , Ω(n unbiased statistical algorithm to solve the distributional planted k-biclique problem. Each query of an unbiased statistical algorithm requires a new sample from D. Therefore ˜ 2 /k 2 ) samples. this bound implies that any algorithm that does not reuse samples will require Ω(n To place this bound in context, we note that it is easy to detect whether a clique of size k has ˜ 2 /k 2 ) samples (as before, to detect if a coordinate i is in the clique we been planted using O(n ˜ 2 /k 2 ) samples). Of course, finding all vertices in the clique can compute the average of xi on O(n would require reusing samples (which unbiased algorithms cannot do). But even the problem of identifying whether any specific coordinate is in the clique with high (say > 1 − 1/(2n)) probability is at least as hard as detecting the whole clique, and for this setting our lower bound is tight. As √ before, note that n2 /k 2 ≤ n if and only if k ≥ n. A closely related problem is the planted densest subgraph problem, where edges in the planted subset appear with higher probability than in the remaining graph. This is a variant of the densest k-subgraph problem, which itself is a natural generalization of k-clique that asks to recover the densest k-vertex subgraph of a given n-vertex graph [18, 36, 6, 7]. The conjectured hardness of its average case variant, the planted densest subgraph problem, has been used in public key encryption schemes [3] and in analyzing parameters specific to financial markets [4]. Our lower bounds extend in a straightforward manner to this problem and in addition become exponential as the density becomes (inverse-polynomially) close to 1/2. We next turn to applications of statistical dimension to several other natural optimization problems over distributions. In particular, we show that any statistical algorithm for the moment maximization problem defined above, as well as distributional variants of MAX-XOR-SAT and kCLIQUE must have high complexity. These problems are known to be NP-hard and therefore the bounds are less surprising. It might also seem that the fact that learning of parities with noise is a problem in NP and its distributional version is known to have exponential query complexity for statistical algorithms would automatically imply that all distributional versions of NP-hard problems will have exponential query complexity for statistical algorithms. This is not true in general, and in particular, Feldman and Kanade [22] show that there exist NP-hard learning problems which can be solved using a polynomial number of queries. We also note that our bounds are incomparable to NP-hardness since they are unconditional. Theorem 5. For the r’th moment maximization problem let F be the set of functions indexed by all possible unit vectors u ∈ Rn , defined over the domain {−1, 1}n with fu (x) = (u · x)r . Let D be the set of all distributions over {−1, 1}n . Then for r odd and δ > 0, any statistical algorithm will r! require at least τ 2 ( nr − 1) queries to STAT(τ ) to 2(r+1) r/2 − δ -optimize over F and D. In words, any statistical algorithm that maximizes the r’th moment (for odd r) to within roughly n r/2 (r/e) must have complexity that grows as r . We can also obtain lower bounds for r = 3 by reduction from the planted bipartite clique problem. Theorem 6. Suppose we can compute an approximate local maximum to Ex∼DS [(x·u)3 ] to within an additive error of 2k 3/2 /3n, then we can solve the planted bipartite clique problem. Thus, for any constant δ > 0, and k ≤ n1/2−δ , any statistical algorithm needs nΩ(log n) queries to VSTAT(n2−δ /k 2 ), 9
and any unbiased statistical algorithm needs Ω(n2−δ /k 2 ) queries, to approximate the maximum of 3/2 E[(x · u)3 ] to less than an additive term of 2k3n . In Section 5 we study other problems in our framework.
3
Lower Bounds from Statistical Dimension
Here we prove the general lower bounds. In later sections, we will compute the parameters in these bounds for specific problems of interest.
3.1
Lower Bounds for Statistical Algorithms
Before we begin with the proof of Theorem 1, we make several observations about VSTAT. First, √ VSTAT(t)√always returns the value of the expectation within 1/ t. Therefore it is stronger than STAT(1/ t) but weaker than STAT(1/t). (STAT, unlike VSTAT, allows non-boolean functions but this is not a significant difference as any [−1, 1]-valued query can be converted to a logarithmic number of {0, 1}-valued queries). Next observe that t·p(1−p) is the variance of the sum of t random and independent samples of h(x). For a sum of Bernoulli random variables, this implies p that an estimate of ED [h] using t samples will, with at least a constant probability, be within p(1 − p)/t of ED [h]. But also for tp> 1/(p(1 − p)), with at least a constant probability, the error of the estimate will be at least p(1 − p)/t, in other words one cannot get a better estimate with high probability using t samples. One extreme case we also need to consider is when t < 1/(p(1 − p)) (say t < 1/p for simplicity). In this case an estimate obtained from t samples is likely to be 0 and we cannot distinguish (with more than a constant probability) between two values p and p0 which are both at most 1/t. Therefore it is necessary to set τ to be at least 1/t. It is also not hard to see that, for any two distributions D and D0 and a function h, VSTAT(t) will give different answers for query h on distributions D and D0 whenever it is possible to distinguish (with at least a constant probability) D from D0 using the actual mean of h on t samples from D (see our proof of Theorem 2 for a stronger version of this argument). In some situations VSTAT(t) is actually slightly stronger than access to t samples since VSTAT(t) allows estimation of any number of queries with the same tolerance, whereas usual sampling will often require a logarithmic factor more samples to ensure that all estimates are within the desired tolerance. We now restate Theorem 1 for convenience. Theorem 7 (Th. 1 restated). Let X be a domain and Z be a search problem over a set of solutions F and a class of distributions D over X. For γ¯ > 0 let d = SDA(Z, γ¯ ). Any statistical algorithm requires at least d calls to VSTAT(1/(3¯ γ )) oracle to solve Z. Proof of Theorem 1. Let A be a statistical algorithm that uses q queries to VSTAT(1/(3¯ γ )) to solve Z over a class solutions F and class of distribution D, such that SDA(Z, γ¯ ) = d. Let D be the reference distribution for which the value d is achieved. Following an approach from [21], we simulate A by answering any query h : X → {0, 1} of A with value ED [h(x)]. Let h1 , h2 , . . . , hq be the queries asked by A in this simulation and let f be the output of A. By the definition of SDA, there exists a set of distributions Df = {D1 , . . . , Dm } for which f is not a valid solution and such that for every D0 ⊆ Df , either ρ(D0 , D) < γ¯ or |D0 | ≤ m/d. In the rest of the proof for conciseness we drop the subscript D from inner products and norms.
10
To lower bound q, we use a generalization of an elegant argument of Sz¨or´enyi [46]. For every k ≤ q, let Ak be the set of all distributions Di such that ) ( r p (1 − p ) 1 . i,k i,k E[hk (x)] − E [hk (x)] > τi,k = max , , D t t Di where we use t to denote 1/(3¯ γ ) and pi,k to denote EDi [hk (x)]. To prove the desired bound we first prove the following two claims: P 1. k≤q |Ak | ≥ m; 2. for every k, |Ak | ≤ m/d. Combining these two immediately implies the desired bound q ≥ d. To prove the first claim we assume, for the sake of contradiction, that there exists Di 6∈ ∪k≤q Ak . Then for every k ≤ q, | ED [hk (x)]−EDi [hk (x)]| ≤ τi,k . This implies that the replies of our simulation ED [hk (x)] are within τi,k of EDi [hk (x)]. By the definition of A and VSTAT(t), this implies that f is a valid solution for Z on Di , contradicting the condition that Di ∈ D \ Z −1 (f ). To prove the second claim, suppose that for some k ∈ [d], |Ak | > m/d. Let pk = ED [hk (x)] and assume that pk ≤ 1/2 (when pk > 1/2 we just replace hk by 1 − hk in the analysis below). First we note that: Di (x) Di [h (x)] − [h (x)] = h (x) − [h (x)] = h , − 1 = pi,k − pk . E k E E k E k k k D(x) D D D Di D ˆ i (x) = Di (x) − 1, (where the convention is that D ˆ i (x) = 0 if D(x) = 0). We will next show Let D D(x) upper and lower bounds on the following quantity * + X ˆ i · signhhk , D ˆ ii . D Φ = hk , Di ∈Ak
By Cauchy-Schwarz we have that +2
* Φ2 =
X
hk ,
ˆ i · signhhk , D ˆ ii D
Di ∈Ak
2
X
2 ˆ ˆ ≤ khk k · Di · signhhk , Di i
Di ∈Ak
X 2 ˆ ˆ ≤ khk k · hDi , Dj i Di ,Dj ∈Ak 2
≤ khk k · ρ(Ak , D) · |Ak |2 .
(1)
As before, we also have that Φ2 =
hk ,
X
ˆ i · signhhk , D ˆ ii D
2
+2
*
=
Di ∈Ak
X
ˆ i i · signhhk , D ˆ i i hhk , D
Di ∈Ak
2
≥
X Di ∈Ak
11
|pi,k − pk | .
(2)
To evaluate the last term of this inequality we claim that for every Di ∈ Ak , r pk |pi,k − pk | ≥ . (3) 3t p We know that |pi,k −pk | ≥ τi,k = max{1/t, pi,k (1 − pi,k )/t}. If pk < pi,k then certainly |pi,k −pk | ≥ p pk /(3t). Otherwise (when pi,k < pk ≤ 1/2), we have that 1 − pi,k ≥ 1/2 and ) ( r r pi,k pi,k (1 − pi,k ) pi,k 1 1 ≥ max ≥ pk − pi,k ≥ max , , . t t t 2t 2 Hence pi,k ≤q 2pk /3 and pk − pi,k ≥ pk /3. We also know that |pi,k − pk | ≥ τi,k ≥ 1/t and therefore |pi,k − pk | ≥ p3tk . By substituting equation (3) into (2) we get that Φ2 ≥ p3tk · |Ak |2 . We note that, hk is a {0, 1}-valued function and therefore khk k2 = pk . Substituting this into equation (1) we get that Φ2 ≤ pk · ρ(Ak , D) · |Ak |2 . By combining these two bounds on Φ2 we obtain that ρ(Ak , D) ≥ 1/(3t) = γ¯ which contradicts the definition of SDA. √ We remark that if the statistical algorithm was using the fixed tolerance STAT(τ ) for τ = γ¯ 2 P ≥ τ 2 |Ak |2 and the proof could be obtained by directly combining then Φ2 ≥ Di ∈Ak |pi,k − pk | equations (1) and (2) to get a contradiction. This also allows us to eliminate the factor of 3 in the bound and the assumption that queries are {0, 1} (it suffices that khk k2 ≤ 1). We obtain the following variant of our main bound. Theorem 8. Let X be a domain and Z be a search problem over a set of solutions F and a class of distributions D over X. For γ¯ > 0, let d = SDA(Z, γ¯ ). Any statistical algorithm requires at least √ d calls to STAT( γ¯ ) oracle to solve Z. We now give the simple proof of the pairwise correlation version of statistical dimension-based lower bound (Corollary 1). Proof of Corollary 1. Take d = m(τ 2 − γ)/(β − γ); we will prove that SDA(Z, τ 2 ) ≥ d and apply Theorem 8. Consider a set of distributions D0 ⊂ D where |D0 | ≥ m/d = (β − γ)/(τ 2 − γ): X D1 D2 1 0 − 1, − 1 ρ(D , D) = 0 2 |D | D D D 0 D1 ,D2 ∈D
1 |D0 |β + (|D0 |2 − |D0 |)γ 0 2 |D | β−γ ≤γ+ |D0 | ≤ τ2
≤
We can also use the same methods to bound the average correlation to obtain a direct bound on SDA using a bound on SD. Corollary 2. Let X be a domain and Z be a search problem over a set of solutions F and a class mγ of distributions D over X. For γ, β > 0, let m = SD(Z, γ, β). Then SDA(Z, 2γ) ≥ β−γ . 12
3.2
Lower Bounds for Unbiased Statistical Algorithms
Next we address lower bounds on algorithms that use the 1-STAT oracle. We recall that the 1-STAT oracle returns the value of a function on a single randomly chosen point. To estimate the expectation of a function, an algorithm can simply query this oracle multiple times with the same function and average the results. As we have argued already, almost all known algorithms are already in this form or have versions that fit this model. A lower bound for this oracle directly translates to a lower bound on the number of samples that any statistical algorithm must use. To prove lower bounds for 1-STAT, we do not have the room for deviation afforded by the tolerance of the STAT and VSTAT oracles. An adversary based on the latter oracles can vary its response in a coordinated way so as to ensure the algorithm makes slow progress. The 1-STAT oracle on the other hand provides no such power and an adversary is blind to the algorithm’s choices or to its overall progress in solving the problem. Thus, a lower bound must directly address the likelihood that an algorithm has converged to a correct answer, regardless of the sequence of queries it makes. This is exactly what we will do, via a conditional probability argument over the set of possible distributions that could have generated the responses seen so far by the algorithm. For each query h, we think of h(x) as giving a p-biased coin (where p is determined by the distribution of the specific instance); we show that there are at least m(1 − 1/d) distributions which have bias close to p under this query. For these distributions, h provides only a little information about the specific instance, thus many queries are necessary to solve the problem. We will need the following two lemmas before proving Theorem 2. Lemma 1. Let D, γ¯ , Df = {D1 , . . .p , Dm }, and d be as defined in Theorem 1 and its proof. For a √ query h : X → {0, 1} and τ = γ¯ · ED [h(x)], let A(h, τ ) be the set of all distributions Di such that | ED [h(x)] − EDi [h(x)]| > τ . Then |A(h, τ )| ≤ m/d. Proof. As in the proof of Theorem 1, we can obtain that |Ak | ≤ m/d whenever ρ(Ak , D) ≥ γ¯ . From the bounds on Φ2 we know that 2 X khk k2 · ρ(Ak , D) · |Ak |2 ≥ |pk − pi,k | . Di ∈Ak
In our case hk = h, and hence pk = ED [h(x)] and pi,k = EDi [h(x)]. Therefore the conditions of the lemma imply that 2 X |pk − pi,k | ≥ pk · γ¯ · |Ak |2 . Di ∈Ak
Combining this with the fact that khk k2 = pk , we reach the desired conclusion. In the notation of our lemma this means |A(h, τ )| ≤ m/d. Lemma 2. Let X ∼ B(1, p). Then, for any p0 ∈ (0, 1), Pr[B(1, p) generated X] (p − p0 )2 = 1 + . E Pr[B(1, p0 ) generated X] p0 (1 − p0 ) X
13
Proof. If X = 1, the ratio is p/p0 and when X = 0, then it is (1 − p)/(1 − p0 ). Thus, the expected ratio is r=
p2 (1 − p)2 (p − p0 )2 + = 1 + . p0 1 − p0 p0 (1 − p0 )
This lemma is critical in the proof of the main lower bound. Proof of Theorem 2. Our generative model for 1-STAT’s interaction with an algorithm is as follows: 1-STAT picks as the target D with probability 1/2 and with probability 1/2 picks a Di uniformly ˜ Upon a query of hj , 1-STAT draws a sample xj from at random. Denote this random variable D. ˜ ˜ Because D, and responds with hj (xj ). After q rounds, the algorithm outputs its best guess of D. ˜ is drawn randomly, it makes sense to talk about the algorithm’s success probability with respect D ˜ and xj . to the randomness of D ˜ and the possible An equivalent model is as follows: there is some joint distribution over D ˜ first, but will answer queries according responses of the 1-STAT oracle. 1-STAT will not choose D to their marginal distributions: when the algorithm presents query h1 , 1-STAT returns an answer ˜ variable). chosen according the marginal distribution of h1 (x1 ) (obtained by integrating out the D Subsequently, when the algorithm asks query hj , 1-STAT responds according to the marginal distribution of hj (xj ) conditioned on the previous responses h1 (x1 ), . . . , hj−1 (xj−1 ). After the q’th ˜ from the marginal conditioned on h1 (x1 ), . . . , hq (xq ) and the algorithm query 1-STAT will pick D will output a guess conditioned on h1 (x1 ), . . . , hq (xq ). It is clear that this is equivalent to the first model, but it captures the sources of randomness and available information much better. We call this the joint model, and will use it to prove our unbiased statistical algorithm lower bound. Denote the result of the first j queries as ωj = (h1 (x1 ), . . . , hj (xj )), and let B denote an algorithm which outputs a guess based on ωq : to maximize the probability that B’s output and 1-STAT’s are the same: max B
s.t.
˜ q] Pr[B(ωq ) = D|ω X Pr[B(ωq ) = Di |ωq ] = 1. Di
˜ We can rewrite the objective function as follows – B is adapted to ωq and is independent of D. X ˜ = Di |ωq ]. ˜ q] = Pr[B(ωq ) = D|ω Pr[B(ωq ) = Di |ωq ] Pr[D Di
The optimal B is deterministic and picks the Di with greatest conditional probability. By construction, B has this quantity as its success probability. Since the algorithm can do no better than picking maximum conditional probabilities as its output, we will assume that it in fact does so. Clearly, making the algorithm more powerful still preserves any lower bounds. We will analyze the conditional probability of D and show that this quantity never exceeds 7/8. The conditional probabilities can be rewritten by Bayes’ rule: Pr[Di |h1 (x1 ), . . . , hq (xq )] =
Pr[h1 (x1 ), . . . , hq (xq )|Di ] Pr[Di ] Pr[h1 (x1 ), . . . , hq (xq )] 14
Since the queries are adaptive, we define a random variable Hj for the choice of the j’th query. We can then expand the conditional probability term. Pr[h1 (x1 ), . . . , hq (xq )|Di ] =
q Y
Pr[Hj = hj |Di , ωj−1 , H1 , . . . , Hj−1 ] Pr[hj (xj )|Di , ωj−1 , H1 , . . . , Hj ]
j=1
The Hj random variables and Pr[h1 (x1 ), . . . , hq (xQ )] are the same for each Di , so we suppress these as a constant c. The hj (xj ) are conditionally independent when Di is fixed. In this case, each hj is a Bernoulli random variable with bias pij . Pr[hj (xj )|Di ] = (pij )hj (xj ) (1 − pij )1−hj (xj ) Therefore, the conditional probability is given by: Pr[Di |h1 (x1 ), . . . , hq (xq )] = c Pr[Di ]
q Y
(pij )hj (xj ) (1 − pij )1−hj (xj )
j=1
√ Let τ = γ¯ . Using Lemma 1, we can bound the size of A(hj , τ ) which consists of Di ’s whose pij are substantially different from that of D (which we shall denote by pj ). The number of Di ’s in the union of A(hj , τ ) is at most qm/d. Thus, with q ≤ d/100, there are at least 99m/100 such Di ’s remaining. p √ For the remaining Di ’s, we know that |pij − pj | ≤ τ ED [hj ] = τ pj . We can always assume that pj ≤ 1/2, since otherwise we can just use 1 − h in place of h in the analysis. This implies p √ i that |pj − pj | ≤ τ pj ≤ τ 2pj (1 − pj ). For every query j, we can now bound in expectation the increase in conditional probability using Lemma 2. The ratios change by at most 1+
(pij − pj )2 2τ 2 pj (1 − pj ) ≤1+ = 1 + 2¯ γ pj (1 − pj ) pj (1 − pj )
in any round (in expectation). After q queries, the expected ratio is at most: (1 + 2¯ γ )q ≤ 1.5 for q < 1/8¯ γ . We can obtain concentration by using Markov’s inequality. Hence, q ≥ 1/8¯ γ . In particular, in relative terms, the conditional probability of D increases by a factor of at most 1.5. In particular, if we compare the conditional probability of D with the total conditional probability across all the other Di , we obtain a comparison between Pr[D|h1 (x1 ), . . . , hq (xq )] ≤ 3/4c and P Di ∈A / Pr[Di |h1 (x1 ), . . . , hq (xq )] ≥ 99/200c which yields that the conditional probability of D is strictly less than 7/8. Let A denote the algorithm’s output, we have the following bounds ˜ 6= D] + Pr[A 6= D ∧ D ˜ = Pr[A = D ∧ D 6 D] = 1/2 ˜ = D] ≤ 1/2 Pr[A = D ∧ D ˜ = D] − 7 Pr[A = D ∧ D ˜ 6= D] ≤ 0. Pr[A = D ∧ D By taking a linear combination of these constraints in the ratio (1, 6/7, 1/7), we obtain the bound: ˜ = D] + Pr[A 6= D ∧ D ˜ 6= D] ≤ 13 Pr[A = D ∧ D 14 15
and that the success probability of the algorithm is bounded by 13/14. Thus, 1 d q ≥ min , 8¯ γ 100
We conclude this section with an application of Corollary 2 to obtain a version of Theorem 2 for the simpler (pairwise) version of statistical dimension. Corollary 3. Let X be a domain and Z be a search problem over a set of solutions F and a class of distributions D over X. For γ, β > 0 let m = SD(Z, γ, β). Any deterministic unbiased statistical algorithm requires at least r 1 1 m min , 16γ 60 β samples from 1-STAT oracle to solve Z. Proof. We have max min γ0 ≥γ
1 1 mγ0 , 16γ0 100 β − γ0
Solving for γ0 and substituting, we get our bound.
3.3
Reductions Between STAT and 1-STAT
We now show that access to the unbiased statistical oracle is essentially equivalent to access to STAT. It has been observed in the context of learning [52] that, given a boolean query function h one can obtain an estimate of ED [h] using t = O(log(1/δ)/τ 2 ) samples which with probability at least 1 − δ will be within τ of ED [h]. We also allow real-valued query functions in our model but any such query function can be replaced by dlog (1/τ )e + 2 boolean queries each or tolerance τ /2. A query i computes bit i of 1 + h(x) ∈ [0, 2] so only dlog (1/τ )e + 2 bits are necessary to get the value of h(x) within τ /2. Combining these two observations gives us the following theorem. Theorem 9. Let Z be a search problem and let A be a statistical algorithm that solved Z using q queries of tolerance τ . For any δ > 0, there exists an unbiased statistical algorithm A0 that uses at most O(q log (q/(δτ ))/τ 2 ) samples and solves Z with probability at least 1 − δ. We also show a reduction in the other direction, namely that the STAT oracle can be used to simulate the 1-STAT oracle. Theorem 10. Let Z be a search problem and let A be an unbiased statistical algorithm that solved Z with probability at least δ using q samples from 1-STAT. For any δ 0 there exists a statistical algorithm A0 that uses at most q queries of tolerance δ 0 /q and solves Z with probability at least δ(1 − δ 0 ). Proof. A0 simulates A as follows. Let h1 : X → {0, 1} be the first query of A and let p = Ex∼D [h(x)]. By making the query h1 to STAT(τ ), for τ = δ 0 /q we can get a value p0 ∈ [p − τ, p + τ ]. We flip a coin with bias p0 (that is one that outputs 1 with probability p0 and 0 with probability 1 − p0 ). We return the outcome to A. One can think of the coin flip with bias p0 as the coin flip with bias p 16
and then a correction with probability p0 − p. Namely, if p0 > p then 0 is output with probability p0 − p and otherwise 1 is output with probability p − p0 . This implies that our simulation can be seen as a true simulation with a random correction step that happens with probability at most |p − p0 | ≤ τ = δ 0 /q. We continue the simulation of the rest of A0 queries analogously. By the union bound, the probability of a correction step happening during the simulation (and hence of our simulation differing from the true one) is at most δ 0 , independently of other random events. Therefore A0 is successful with probability at least δ(1 − δ 0 ).
4 4.1
Planted Clique Statistical dimension of planted clique
We now prove the lower bound claimed in Theorem 3 on the problem of detecting a planted k-clique in the given distribution on vectors from {0, 1}n as defined above. For a subset S ⊆ [n], let DS be the distribution with a planted clique on the subset S. Let {S1 , . . . , Sm } be the set of all nk subsets of [n] of size k. For i ∈ [m] we use Di to denote DSi . The reference distribution in our lower bounds will be the uniform distribution over {0, 1}n and let ˆ S denote DS /D − 1. In order to apply our lower bounds based statistical dimension with average D correlation we now prove that for the planted clique problem average correlations of large sets must be small. We start with a lemma that bounds the correlation of two planted clique distributions relative to the reference distribution D as a function of the overlap between the cliques. Lemma 3. For i, j ∈ [m], D
ˆ i, D ˆj D
E D
≤
2λ k 2 , n2
where λ = |Si ∩ Sj |. Proof. For the distribution Di , we consider the probability Di (x) of generating the vector x. Then, ( 1 k 1 if ∀λ ∈ S, xλ = 1 ( n−k n ) 2n + ( n ) 2n−k Di (x) = n−k 1 ( n ) 2n otherwise. ˆi = Now we compute the vector D
Di D
Di −1= D D E ˆ i, D ˆj We then bound D D E ˆ i, D ˆj D
D
≤ ≤
− 1: (
k2k n − nk
−
k n
if ∀λ ∈ S, xλ = 1 otherwise.
D
2n−2k+λ 2n 2λ k 2 n2
k2k k − n n
2
+2
17
2n−k 2n
k2k k − n n
k k 2 − + − n n
ˆ S with a large number of distinct We now give a bound on the average correlation of any D clique distributions. Lemma 4. For δ > 0 and k ≤ n1/2−δ , let {S1 , . . . , Sm } be the set of all nk subsets of [n] of size k and {D1 , . . . , Dm } be the corresponding distributions on {0, 1}n . Then for any integer ` ≤ k, set S of size k and subset A ⊆ {S1 , . . . , Sm } where |A| ≥ 4(m − 1)/n2`δ , 1 X ˆ ˆ k2 hDS , Di i < 2`+2 2 . |A| n Si ∈A
Proof. In this proof we first show that if the total number of sets in A is large then most of sets in A have a small overlap with S. We then use the bound on the overlap of most sets to obtain a bound on the average correlation of DS with distributions for sets in A. 2 ˆ i, D ˆ j i ≤ 2|Si ∩Sj | α. Summing over Formally, we let α = nk 2 and using Lemma 3 get the bound hD Si ∈ A, X X ˆS, D ˆ ii ≤ hD 2|S∩Si | α. Si ∈A
Si ∈A
For any set A ⊆ {S1 , . . . , Sm } of size t this bound is maximized when the sets of A include S, then all sets that intersect S in k − 1 indices, then all sets that intersect S in k − 2 indices and so on until the size bound t is exhausted. We can therefore assume without loss of generality that A is defined in precisely this way. Let Tλ = {Si | |S ∩ Si | = λ} denote the subset of all k-subsets that intersect with S in exactly λ indices. Let λ0 be the smallest λ for which A ∩ Tλ is non-empty. We first observe that for any 1 ≤ j ≤ k − 1, k n−k |Tj | (j + 1)(n − 2k + j + 1) (j + 1)(n − 2k) (j + 1)n2δ j k−j = k n−k = ≥ ≥ . |Tj+1 | (k − j − 1)(k − j) k(k + 1) 2 j+1 k−j−1 By applying this equation inductively we obtain, |Tj | ≤
2j · (m − 1) 2j · |T0 | < j! · n2δj j! · n2δj
and X k≥λ≥j
By definition of λ0 , |A| ≤
P
j≥λ0
|Tλ |
2(2j+1 |Tj+1 |) we can therefore telescope the sum. Lemma 4 gives a simple way to bound the statistical dimension with average correlation of the planted bipartite k-clique problem. Theorem 11. For δ > 0 and k ≤ n1/2−δ let Z the planted bipartite k-clique problem. Then for any ` ≤ k, SDA(Z, 2`+2 k 2 /n2 ) ≥ n2`δ /4. Proof. Let {S1 , . . . , Sm } be the set of all nk subsets of [n] of size k and D = {D1 , . . . , Dm } be the corresponding distributions on {0, 1}n . For every solution S ∈ F, Z −1 (S) = DS and let DS = D \ {DS }. Note that |DS | = m − 1. Let D0 be a set of distributions D0 ⊆ DS such that |D0 | ≥ 4(m − 1)/n2`δ . Then by Lemma 4, for every Si ∈ D0 , 2 1 X ˆ ˆ `+2 k h D , D i < 2 . i j |D0 | n2 0 Sj ∈D
2
In particular, ρ(D0 , D) < 2`+2 nk 2 . By the definition of SDA, this means that SDA(Z, 2`+2 k 2 /n2 ) ≥ n2`δ /4. Combining Theorems 8 and 11 gives Theorem 3. Theorems 2 and 11 also imply the sample complexity lower bound stated in Theorem 4.
4.2
Generalized Planted Densest Subgraph
We will now show the lower bound on detecting a planted densest subset, a generalization of the planted clique problem. Problem 3. Fix 0 < q ≤ p ≤ 1. For 1 ≤ k ≤ n, let S ⊆ {1, 2, . . . , n} be a set of k vertex indices and DS be a distribution over {0, 1}n such that when x ∼ DS , with probability 1 − (k/n) the entries of x are independently chosen according to a q biased Bernoulli variable, and with probability k/n the k coordinates in S are independently chosen according to a p biased Bernoulli variable, and the rest are independently chosen according to a q biased Bernoulli variable. The generalized planted bipartite densest k-subgraph problem is to find the unknown subset S given access to samples from DS .
19
Note that p = 1, q = 1/2 is precisely the planted bipartite clique problem. For this generalized problem, we will take D, the reference distribution, to be that of n independent Bernoulli variables with bias q. Before we give our results for this problem, we have to fix some further notation: for P P a x ∈ {0, 1}n , we define kxk1 = xi (ie the number of 1’s in x); similarly for kxk1 = 1 − xi (the number of 0’s in x). We will denote the restriction of a set by subscripting so that xS is x restricted to the subset S ⊆ [n]. First, we give a computation of the correlation. This is a generalized version of Lemma 3. Lemma 5. Fix 0 < q ≤ p ≤ 1. For i, j ∈ [m], ! D E 2 λ (p − q) k2 ˆS, D ˆ S0 = 1+ D −1 , q(1 − q) n2 D where λ = |S ∩ S 0 |. Proof. For any x, we have D(x) = q kxk1 (1 − q)kxk1 . For DS (x): DS (x) = Pr[x|planted] Pr[planted] + Pr[x|not planted] Pr[not planted] k kxS k1 k kxS k1 kxS¯ k1 kxS¯ k1 = p (1 − p) q (1 − q) + 1− q kxk1 (1 − q)kxk1 n n For DS (x)/D(x) − 1, we have: DS (x) k pkxS k1 (1 − p)kxS k1 q kxS¯ k1 (1 − q)kxS¯ k1 k −1= · − kxk kxk D(x) n n q 1 (1 − q) 1 kxS k1 kxS k1 k p 1−p k = − n q 1−q n Now, for S 0 where |S ∩ S 0 | = λ, we want to compute: " # " # 2 X kxS k1 kxS0 k kxS k1 kxS 0 k1 E D 1 k p 1 − p p 1 − p ˆS, D ˆ S0 D = −1 −1 q kxk1 (1 − q)kxk1 n q 1−q q 1−q D n x∈{0,1}
There are three types of terms in the product in the summand. We deal with all these terms by repeated applications of the Binomial theorem. The first term illustrates this approach: X q kxk1 (1 − q)kxk1 = (q + (1 − q))n = 1 x∈{0,1}n
The second type of term is given by: X
q
kxk1
kxk1
(1 − q)
x∈{0,1}n
X
=
kxS k1 1 − p kxS k1 p q 1−q
q kxS¯ k1 (1 − q)kxS¯ k1 pkxS k1 (1 − p)kxS k1
x∈{0,1}n
X
=
pkyk1 (1 − p)kyk1
X ¯ z∈{0,1}|S |
y∈{0,1}|S|
=1 20
pkzk1 (1 − p)kzk1
The third type of term is more complicated – using the above trick, we can restrict x to T = S ∪ S 0 because the sum take over the remaining xi yields 1. " # " 0 # X p kxS k1 1 − p kxS k1 p kxS0 k1 1 − p kS k1 kxk1 kxk1 q (1 − q) q 1−q q 1−q n x∈{0,1} # " " 0 # X p kxS0 k1 1 − p kS k1 p kxS k1 1 − p kxS k1 kxk1 kxk1 = q (1 − q) q 1−q q 1−q |T | x∈{0,1}
Similarly, we can sum x over components associated with S/S 0 and S 0 /S. Hence, the sum simplifies: " kxS k1 # " kxS0 k kS 0 k # kxS k1 X 1 1 1 − p 1 − p p p q kxk1 (1 − q)kxk1 q 1−q q 1−q n x∈{0,1}
X
=
q
kxk1
(1 − q)
kxk1
|S∩S 0 |
" #2 0 0 p kS∩S k1 1 − p kS∩S k1 q 1−q
x∈{0,1}
X
=
0 x∈{0,1}|S∩S |
p2 q
kS∩S 0 k1
S∩S 0 k 1 (1 − p)2 k 1−q
λ p2 (1 − p)2 = + q 1−q 2 " λ # k (p − q)2 = 1+ n q(1 − q)
Combining these three calculations yields: D
ˆS, D ˆ S0 D
E D
# λ 2 " k (p − q)2 −1 = 1+ n q(1 − q)
Next, in analogy with Lemma 4, we give a bound on average correlation for sufficiently many distributions. Lemma 6. Fix 0 < q ≤ p ≤ 1. For δ > 0 and k ≤ n1/2−δ , let {S1 , . . . , Sm } be the set of all n n k subsets of [n] of size k and {D1 , . . . , Dm } be the corresponding distributions on {0, 1} . Let n2δ ≥ 1 +
(p−q)2 q(1−q) .
|A| ≥ 4(m −
Then for any integer ` ≤ k, set S of size k and subset A ⊆ {S1 , . . . , Sm } where
1)/n2`δ , 1 X D ˆ ˆ E 2k 2 DS , Di < 2 |A| n
Si ∈A
21
(p − q)2 1+ q(1 − q)
l+1
! −1 .
Proof. We proceed as in the proof of Lemma 4. The only difference is that we must verify that the tail terms in the following sum are geometrically decreasing: j k X (p − q)2 |Tj | q(1 − q)
j=λ0 +1
In particular, it suffices to show that the succeeding term is at most half of the current term:
(p − q)2 1+ q(1 − q)
j
(p − q)2 |Tj | ≥ 2 1 + q(1 − q)
j+1 |Tj+1 |
which is equivalent to: |Tj | (p − q)2 ≥1+ |Tj+1 | q(1 − q) From the proof of Lemma 4, we know that the left hand side is lower bounded, and then we can apply our hypothesis: |Tj | (j + 1)n2δ (p − q)2 ≥ ≥1+ |Tj+1 | 2 q(1 − q) To conclude, with the appropriate choice of λ0 , we obtain ! j k X X (p − q)2 ˆ ˆ hDS , Di i ≤ α 1+ − 1 |Tj ∩ A| q(1 − q) Si ∈A j=λ0 ! λ0 j k k 2 2 X X (p − q) (p − q) |Tj | 1 + ≤ |Tλ0 ∩ A| 1+ −1 + − |Tj | α q(1 − q) q(1 − q) j=λ0 +1 j=λ0 +1 λ0 λ0 +1 k 2 2 X (p − q) (p − q) ≤ 1+ |Tλ0 ∩ A| + 2 · |Tλ0 +1 | 1 + − |Tj | α q(1 − q) q(1 − q) j=λ0 +1 ! λ +1 (p − q)2 0 ≤ 2α |A| 1+ −1 q(1 − q)
We can now give a bound on statistical dimension SDA: Theorem 12. Fix 0 < q ≤ p ≤ 1. For δ > 0 and k ≤ n1/2−δ let Z be the generalized planted bipartite densest subgraph problem. Then for any ` ≤ k, !! l+1 2k 2 (p − q)2 SDA Z, 2 1+ −1 ≥ n2`δ /4. n q(1 − q) provided n2δ ≥ 1 + (p − q)2 /q(1 − q). 22
Proof. The proof is analogous to that of Theorem 11. This SDA bound also yields lower bounds for the VSTAT oracle: Corollary 4. Fix 0 < q ≤ p ≤ 1. For any constant δ > 0, any k ≤ n1/2−δ , ` ≤ k, at least nΩ(`) queries to V ST AT (n2 (1 + (p − q)2 /q(1 − q))−l−1 /k 2 ) are required to solve the distributional planted bipartite densest k-subgraph with density q provided that n2δ ≥ 1 + (p − q)2 /q(1 − q). This leads to interesting bounds for specific choices of p and q. For example, the following gives a lower bound for various sparse cases by taking l = O(log n): Corollary 5. Fix 0 < q < p ≤ 1 where q = c/nt and p = d/nt . For δ > 0, and k ≤ n1/2−δ , any 2+t )/k 2 (d − c)2 ) samples to find a generalized planted ˜ unbiased statistical algorithm requires Ω((cn densest subgraph of size k. ˜ 3 /k 2 ) for the sparse case when p, q = O(1/n). This In particular, we have a lower bound of Ω(n lower bound generalizes: in case p and q are slightly larger, or when they are not of the same order. Thus, we can state a corollary for general p and q: Corollary 6. Fix 0 < q ≤ p ≤ 1. For δ > 0, and k ≤ n1/2−δ , any unbiased statistical algorithm ˜ requires Ω(q(1 − q)n2 )/lk 2 (p − q)2 ) samples to find a generalized planted densest subgraph of size k provided that nδ ≥ 1 + (p − q)2 /q(1 − q). One is often interested in the case when q = 1/2 and p > q (the classical planted densest k-subgraph problem); we give the following results for this setting: Corollary 7. For δ > 0 and k ≤ n1/2−δ let Z the planted bipartite densest subgraph problem with density p > 1/2. Then for any ` ≤ k, ` SDA Z, 8 2(p2 + (1 − p)2 ) − 1)k 2 /n2 ≥ n2`δ /4. We now observe that our bound becomes exponential when p approaches 1/2. Corollary 8. For any constant δ > 0, any k ≤ n1/2−δ , ` ≤ k and density parameter p = 1/2+α, at least nΩ(`) queries to VSTAT(n2 /(`α2 k 2 )) are required to solve the distributional planted bipartite densest k-subgraph with density p. For example consider the setting k = n1/3 and α = n−1/4 . It is not hard to see that for this setting the problem can be solved on a random bipartite graph with n vertices on both sides 1/3 (in exponential time). Our lower bound for this setting implies that at least nΩ(n ) queries to VSTAT(n3/2 ) will be required. With appropriate choices of parameter settings in Corollary 7, we get the following corollary for 1-STAT. Corollary 9. For constants c, δ > 0, density 1/2 < p ≤ 1/2 + 1/nc , and k ≤ n1/2−δ , any unbiased ˜ 2+2c )/k 2 ) samples to find a planted densest subgraph of size k. statistical algorithm requires Ω((n
23
Proof. By applying Theorem 2 and choosing l = O(log log(n)), it suffices to estimate 1/8¯ γ to obtain the desired lower bound. From Corollary 7, we have: 1 1 n2 = 2· 2 8¯ γ 8k (2(p + (1 − p)2 )l − 1) n2 1 ≤ 2 8k (1 + 4/n2c )l − 1 n2+2c ≤ 4lk 2 where the last inequality holds for sufficiently large n.
4.3
Average case vs distributional planted bipartite clique
In this section we show the equivalence between the average-case planted biclique problem (where cliques of size k × k are planted in bipartite graphs of size n × n) and the distributional biclique problem (with n examples from {0, 1}n where there is a set of k coordinates that contain a plant). The idea of both the reductions is to obtain examples from one distribution given examples from the other distribution. This is not immediate due to the fact that in the distributional biclique problem the plant does necessarily have the same size on the left set of vertices as on the right set. To remedy this, we will need to essentially search over plants of smaller and smaller sizes on one side, by replacing possibly planted vertices with common ones. Definition 7. (Average-case planted bipartite clique (biclique) problem (APBC(n, k))) Given integers 1 ≤ k ≤ n, consider the the following distribution Davg (n, k) on bipartite graphs on [n] × [n] vertices. Pick two random sets of k vertices each from left/right respectively, say S1 and S2 . Plant a bipartite clique on S1 × S2 and add an edges between each remaining pairs (i, j) with probability 1/2. The problem is to find the S1 × S2 biclique given as input a random bipartite graph G on [n] × [n] vertices chosen according to Davg (n, k). We recall the definition of the distributional biclique problem from the introduction. Definition 8. (Distributional planted biclique problem (DPBC(n, k))) Given integers 1 ≤ k ≤ n and S ⊂ [n] a set of k vertices, consider the distribution DS over {0, 1}n , such that when x ∼ DS , w.p. 1 − (k/n) the entries of x are chosen uniformly and independently from {0, 1} and with probability k/n the k coordinates in S are set to 1 and the rests are chosen uniformly and independently from {0, 1}. The problem is to find the unknown subset S given inputs from {0, 1}n chosen according to DS . Proposition 1. Suppose that there is an algorithm that solves APBC(n0 , k 0 ) in time T 0 (n0 , k 0 ) and outputs the correct answer w.p. p0 (n0 , k 0 ). Then there exists an algorithm that solves DPBC(n, k) in time T (n, k) = O(nT 0 (n, k/2))) and outputs the correct answer with probability p(n, k) = (1 − 2e−k/8 )p0 (n, k/2). Proof. We will think of the distribution Davg (n, k 0 ) on graphs as a distribution on their respective adjacency matrices from {0, 1}n×n . Given k and n, and access to n samples from DS ⊂ {0, 1}n for some set S of size k, we will design an algorithm that finds S and which makes O(n) calls to the algorithm An,k0 that solves instances of APBC(n, k 0 ) for some values of k 0 . 24
Let M be the n × n binary matrix whose rows are the n samples from DS . First apply a random permutation π : [n] → [n] to the columns of M to obtain M 0 (this will ensure that the planted set is uniformly distributed among the n coordinates, which is necessary in order to obtain instances distributed according to Davg (n, k).) In what follows we will denote by k 0 × k a clique with k 0 vertices on the left and k vertices on the right. Note that M 0 has a k 0 × k planted clique for some k 0 that is distributed according to a binomial distribution of n Bernoulli trials with probability of success k/n (i.e. B(n, k/n)). By a multiplicative Chernoff bound, P r[k/2 < k 0 < k] ≥ 1 − 2e−k/8 . From now on we’ll condition on this event occurring. Suppose that k/2 < k 0 ≤ k. We aim at obtaining instances of APBC(n, k 0 ) but recall that there is a plant of size k on the columns. To obtain only k 0 planted columns (i.e. vertices on the right of the biclique), we will be replacing sequentially the columns by random columns whose entries are chosen uniformly from {0, 1}. That is, pick one after the other n uniformly random columns of M 0 (without replacement) and replace them by random {0, 1}n vectors. Let M10 , M20 , . . . , Mn0 be the sequence of matrices obtained this way. Since we started with a plant of size k 0 × k and we ended with no plant it must have been that one of the matrices in the sequence had a plant of size k 0 × k 0 . Let N be that matrix. To obtain the k 0 × k 0 plant we run An,k0 on all the matrices Mi0 , i ∈ [n], and in particular since N is a matrix distributed according to Davg (n, k 0 ), An,k0 will output the k 0 × k 0 plant on S1 × S2 from N w.p. p0 (n, k 0 ). Note that we would only recover k 0 ≤ k coordinates in this way (after applying the inverse permutation π −1 to the set S2 ), and we need to recover the entire set S of k indices. To recover the remaining planted indices, one can look at the actual examples from the matrix M corresponding to rows indexed from S1 . Then we can output all the indices i ∈ [n] such that the value of each of these rows at i is 1. Since there was a plant of size k, the set output must contain S. For simplicity we can assume that the given samples contain a unique set of k planted positions. If k < k 0 < 2k then we apply a similar argument. Now we can aim at finding plants of size k × k. This time we need to reduce the size of the plant indexed by rows of M 0 (corresponding to the left vertices of the clique). As before we will sequentially replace random rows with vectors uniformly chosen in {0, 1}n , one at a time until we have replaced all of them. Running An,k on each of the n instances obtained this way lets one recover the plant S1 × S2 w.p. p0 (k, n). If this is successful the correct set S is π −1 (S2 ). We can assume that it is harder to find smaller plants than larger ones, and so T 0 (n, k 0 ) ≤ T 0 (k/2, n) and p0 (n, k 0 ) ≤ p0 (n, k/2). Therefore the running time of our algorithm is T (n, k) = O(nT 0 (n, k/2))), and its success probability is p(n, k) = (1 − 2e−k/8 )p0 (n, k/2). Proposition 2. Suppose that there is an algorithm that solves DPBC(n, k) that runs in time T (n, k) and outputs the correct answer with probability p(n, k). Then there exists an algorithm that solves APBC(n0 , k 0 ) in time T 0 (n0 , k 0 ) = O(nT (n0 , k 0 )) and outputs the correct answer w.p. 0 p0 (n0 , k 0 ) ≥ (1 − 2e−k/8 )(p(n0 , k 0 ) − 2e−k /8 ). Proof. Let An,k denote the algorithm for DPBC(n, k) which takes n samples chosen according to some DS and outputs the planted set S w.p. p(n, k) (for any S of size k). Since the number of samples that witness the plant is distributed according to B(n, k/n), as before, by Chernoff’s bound this number is some k1 such that k/2 ≤ k1 ≤ 2k w.p. 1 − 2e−k/8 . Therefore the distribution of the n samples (viewed as n × n binary matrices) has at most 2e−k/8 mass on matrices M that witness 25
the plants on k1 rows for k1 ∈ / [−k/2, 2k]. Therefore, conditioned on the algorithm seeing samples M s.t. k/2 ≤ k1 ≤ 2k, Ak,n has success probability ≥ p(n, k) − 2e−k/8 . Consider the following algorithm for APBC(n, k 0 ) that takes as input an adjacency matrix G chosen randomly according to Davg (n, k 0 ), and so, it contains a k 0 × k 0 plant S1 × S2 that needs to be found. Pick k s.t. k 0 /2 < k < 2k 0 according to B(n, k/n). From G, which has a plant of size k 0 × k 0 we’d like to obtain samples M with k × k 0 plants, or k 0 × k plants. If k 0 /2 < k ≤ k 0 , as in the proof of Proposition 1 we can obtain a sample with a plant of size k × k 0 by sequentially replacing the n rows by random {0, 1}n vectors. As before, after running An,k0 on each of these n matrices the k × k 0 plant S10 × S2 will be output. To actually complete the plant on the left side, it is enough to look at the S2 indices and output all the rows i s.t. all the entries at i and j ∈ S2 take value 1. This takes nT (n, k 0 ) time complexity. If k 0 < k < 2k 0 then k/2 < k 0 < k and from G we can generate a sample matrix M with a plant on k 0 × k just as before but where now we replace the columns by random columns. This takes nT (n, k) = O(nT (n, k 0 )) time. Therefore, T 0 (n, k 0 ) = O(nT (n, k 0 )) and p0 (n, k 0 ) = (1 − 2e−k/8 )(p(n, k) − 2e−k/8 ), since we conditioned on the event that k 0 /2 < k ≤ 2k 0 .
5
Other Applications of Statistical Dimension
In this section, we use Definition 6 together with the bound in Corollary 1 to get unconditional lower bounds for a variety of optimization problems. The next corollary of Theorem 8 shows a setting of the parameters that is useful for applications in Section 5 . Corollary 10. Let X be a domain and Z be a search problem over a set of solutions F and a class −2/3 of distributions D over X. If for m > 0, SD(Z, γ = m 2 , β = 1) ≥ m then at least m1/3 /2 calls of tolerance m−1/3 to the STAT oracle are required to solve Z. The MAX-XOR-SAT problem over a distribution is defined as follows. Problem 4. Let D be a distribution over XOR clauses of arbitrary length, in n variables. The MAX-XOR-SAT problem is to find an assignment x that maximizes the number of satisfied clauses under the given distribution. In the worst case, it is known that MAX-XOR-SAT is NP-hard to approximate to within 1/2 − for any constant [25]. In practice, local search algorithms such as WalkSat [44] are commonly applied as heuristics for maximum satisfiability problems. We give strong evidence that the distributional version of MAX-XOR-SAT is hard for algorithms that locally seek to improve an assignment by flipping variables as to satisfy more clauses, giving some theoretical justification for the observations of [44]. Moreover, our proof even applies to the case when there exists an assignment that satisfies all the clauses generated by the target distribution. Theorem 13. For the MAX-XOR-SAT problem, let F be the set of functions indexed by all possible assignments in n variables and whose domain is the set of all clauses. (The value that such a function takes when evaluated on a clause is the truth value of the clause under the given assignment.) Let D be the set of all distributions over clauses, then for δ > 0, at least τ 2 (2n − 1) queries 1 of tolerance τ are required to 2 − δ -optimize over F and D for any statistical algorithm. 26
Next, we consider the distribution version of the k-clique problem. Note that this is different from the distributional bipartite planted clique problem primarily discussed in this paper. Problem 5. Let D be a distribution over graphs G. The k-clique problem is to find a subset S of size k that maximizes the probability that S is a clique in G. Detecting whether a graph has a clique of size k is NP-Hard [33], fixed-parameter intractable (hard for W[1] [16]) and no algorithm faster than O(n.792k ) is known [42], even for a large constant k. While our lower bound does not give insight into the computational hardness of k-clique on worstcase inputs, it says that the k-clique problem over a distribution on graphs has high complexity for any statistical algorithm. Theorem 14. For the distributional k-clique problem, let F be the indicator functions indexed by subsets S of k vertices and whose domain is the set of all graphs on n vertices, that indicate whether S is a k-clique in the input graph. Let D be the set of distributions over on graphs n vertices. Then n −(k2) 2 for δ > 0, at least τ ( k − 1) queries of tolerance τ are required to 2 − δ -optimize over F and D for any statistical algorithm. A recurring concept in our constructions will be a parity function, χ. We first explore some properties of parity functions. Definition 9 (parity). For x ∈ {0, 1}n and c ∈ {0, 1}n , let χc : {0, 1}n → {−1, 1}. . χc (x) = −(−1)c·x . Namely, χc (x) = 1 if c · x is odd, and −1 otherwise. 1 n Note: for convenience Q , we will sometimes use x ∈ {±1} , in which case we abuse notation and define χc (x) = − i: ci =1 xi . This corresponds to the embedding of x from {0, 1} → {−1, 1} of 0 → 1, 1 → −1.
Further, we define distributions uniform over the examples classified positive by a parity. Definition 10 (distributions Dc ). Let x ∈ {±1}n and c ∈ {0, 1}n and let Sc = {x | χc (x) = 1}. We define Dc to be the uniform distribution over Sc . ¯ and the uniform distribution U over {−1, 1}n , the following Lemma 7. For c ∈ {0, 1}n , c 6= 0 hold: ( ( 1 if c = c0 1 if c = c0 2) E [χc (x)χc0 (x)] = 1) E [χc0 (x)] = x∼U x∼Dc 0 otherwise. 0 otherwise. 0 6= 0 ¯ then it is easy Proof. To show Part 1) note that if c = c0 then Ex∈Sc [χc (x)] = P1. If c 6= c P to see that |Sc ∩ Sc0 | = |Sc |/2 = |Sc0 |/2 and so Ex∈Sc [χc0 (x)] = x∈Sc ∩S 0 1 + x∈Sc \S 0 (−1) = 0. c c Part 2) states the well-known fact that the parity functions are uncorrelated relative to the uniform distribution.
These two facts will imply that when D = U (the uniform distribution) and the Di ’s consist of the Dc ’s, we can set γ = 0 and β = 1, when considering the statistical dimension of the problems presented in the following sections. 1 For the moment maximization problem, it is necessary for our argument that examples x be ∈ {−1, 1}n , whereas for MAX-XOR-SAT, the argument is much cleaner when x is in {0, 1}n . It is, therefore, natural to use the same notation for the corresponding parity problems.
27
5.1
MAX-XOR-SAT
We first formalize the MAX-XOR-SAT problem introduced in Problem 4. Let D be a distribution over XOR clauses c ∈ {0, 1}n . We interpret ci = 1 as variable i appearing in c and otherwise not; for simplicity, no variables are negated in the clauses. The problem is to find an assignment x ∈ {0, 1}n that maximizes the expected number of satisfied XOR clauses. We now give the statistical dimension of this problem, from which Theorem 13 follows. Theorem 15. For the MAX-XOR-SAT problem, let F = {χx }x∈{0,1}n , let D be the set of all distributions over clauses c ∈ {0, 1}n , and for any δ > 0, let Z be the problem of ( 21 − δ)-optimizing over F and D. Then SD(Z, 0, 1) ≥ 2n − 1. Proof. Maximizing the expected number of satisfied clauses is equivalent to maximizing the quantity max
E [χx (c)].
x∈{0,1}n c∼D
This proof is a fairly direct application of Lemma 7 to the definition of statistical dimension. For the conditions in Definition 6, for each of the 2n possible assignments to x let Dx be the uniform distribution over the clauses c ∈ {0, 1}n such that χc (x) = 1. Because χc (x) is symmetric in x and c, the conditions in Definition 6, with β = 1 and γ = 0, which follow from Lemma 7, are satisfied for the 2n distributions Dc , with D = U . Because χc (x) = 1 when assignment x satisfies clause c and −1 otherwise, we need to scale the approximation term by 1/2 when measuring the fraction of satisfied clauses Corollary 11. Any statistical algorithm for a MAX-XOR-SAT instance asymptotically requires 2n/3 queries of tolerance 2−n/3 to find an assignment that approximates the maximum probability of satisfying clause drawn from an unknown distribution to less than an additive term of 1/2.
5.2
k-Clique n
We first formalize the distributional k-clique problem. Let D be a distribution over X = {0, 1}( 2 ) , corresponding to graphs G on n vertices. For G ∈ X, let ( . 1 if S induces a clique in G IS (G) = 0 otherwise. The k-clique problem is to find a subset S ⊆ V of size k that maximizes EG∼D [IS (G)]. We now give the statistical dimension of distributional k-clique, from which Theorem 14 follows. Theorem 16. For the distributional k-clique problem, let F = {IS }|S|=k , let D be the set of k distributions over graphs on n vertices, and for any δ > 0, let Z be the problem of 2−(2) − δ optimizing over F and D. Then SD(Z, 0, 1) ≥ nk − 1. k
Proof. We shall compute the statistical dimension of distributional k-clique with = 2−(2) − δ (for δ > 0), γ = 0, and β = 1 and show it is nk . For any subset of edges T ∈ V × V , and graph G ∈ X, we can define the function ( k . 1 if |E(G) ∩ T | has the same parity as 2 parityT (G, k) = −1 otherwise. 28
k Note that parityT (G, k) = (−1)(|E(G)∩T | +(2)) . n As both T and G lie in {0, 1}( 2 ) , note that parityT (G, k) is simply χT (G) or (its negation, depending on k). Let T1 , . . . , Td be all the nk cliques on k vertices. We generate the distributions D1 , . . . , Dd so that Di is uniform on the graphs G such that |E(G) ∩ Ti | = k2 mod 2. The distribution D is the uniform over all graphs G. By Lemma 7, these choices justify β = 1, γ = 0. We notice that the set of vertices of the clique Ti maximizes EDi [IS (G)] while the set of edges of the clique maximizes EDi [parityT (G, k)], namely we have that V (Ti ) = arg max ( E [IS (G)]) = V arg max ( E [parityT (G, k)]) .
T ∈V ×V G∼Di
S∈V :|S|=k G∼Di
By definition EG∼Di [parityT (G, k)] ≤ 1, with equality iff T = Ti . For Si = V (Ti ) we have that ISi (G) = 1 iff Ti is a clique in G. Since any setting of the edges not k in Ti appears equiprobably under Di and since there are 2(2)−1 possible settings for edges between k vertices in V (Ti ) occurring equiprobably in graphs from Di , it follows that EG∼Di [ISi (G)] = 2−(2)+1 . On the other hand, if Sj 6= V (Ti ) then all subsets of edges among the vertices of Sj appear k k equiprobably under D . Hence, for j 6= i, E [I ] = 2−(2) , as only 1 of every 2(2) subgraphs on i
G∼Di
Sj
k k vertices forms a clique. This allows us to set = 2−(2) − δ, for any δ > 0. Because our distributions were generated by the k vertex subsets, we have shown the statistical n dimension to be k − 1.
1/3 Corollary 12. Any statistical algorithm for a k-clique instance asymptotically requires nk −1/3 queries of tolerance nk to find an assignment that approximates the maximum probability of k satisfying clause drawn from an unknown distribution to less than an additive term of 2−(2) .
5.3
Moment Maximization
We recall the moment maximization problem. Let D be a distribution over {−1, 1}n and let r ∈ Z + . The moment maximization problem is to find a unit vector u that maximizes Ex∼D [(u · x)r ]. Before going to the main theorem, we need to prove a property of odd moments. Lemma 8. Let r ∈ Z + be odd and let c ∈ {0, 1}n . Let Dc be the distribution uniform over x ∈ {−1, 1}n for which χc (x) = −1. Then, ∀u ∈ Rn , Y r ui . E [(x · u) ] = r! x∼Dc
i: ci =1
Proof. From Lemma 9 we have that ∀u ∈ Rn , E [(x · u)r ] = r! x∼Dc
Y
ui +
i: ci =1
E
[(x · u)r ].
x∈{±1}n
the lemma follows now since when r is odd E
x∈{±1}n
[(x · u)r ] =
E
x∈{±1}n
29
[((−x) · u)r ] = 0.
(4)
Lemma 9. Under the conditions of Lemma 8, Y
∀u ∈ Rn , E [(x · u)r ] = r! x∼Dc
ui +
i: ui =1
E
[(x · u)r ].
(5)
x∈{±1}n
Proof. Notice that E
[(x · u)r ] =
x∈{±1}n
1 1 r r E (x · u) + E (x · u) 2 χc (x)=−1 2 χc (x)=1
and that −
E
x∈{±1}n
1 1 r r E (x · u) − E (x · u) , 2 χc (x)=−1 2 χc (x)=1
χc (x)(x · u)r =
therefore r E [(x · u) ] =
x∼Dc
[(x · u)r ] −
E
x∈{±1}n
E
x∈{±1}n
χc (x)[(x · u)r ].
Equation 5 follows now by Lemma 10 below. Lemma 10. Let c be an r parity on the variables indexed by set I = {i1 , . . . , ir }, c ∈ {0, 1}n . Let u be an arbitrary vector in Rn . Then 1. Ex∈{±1}n = E[χc (x)(x · u)i ] = 0 for i < r Q 2. Ex∈{±1}n = E[χc (x)(x · u)r ] = −r! i∈I ui . Proof. To prove Part 1, we have i = E χc (x) E χc (x)(x · c)
X
t1 +...tn =i
X
=
t1 +...tn =i
Y i (ui xi )ti t1 , . . . , tn i∈[r] Y (ui xi )ti . E χc (x)
i t1 , . . . , tn
i∈[r]
Q Notice that if there is some variable j ∈ I such that tj = 0 then Ex [χc (x) i∈[r] (ui xi )ti ] = 0, as the term corresponding to x always cancels out with the Q term corresponding to the element obtained by flipping the j’th bit of x. Since i < r every term i∈[r] (ui xi )ti must contain some tj = 0 with j ∈ I, which concludes that E[χc (x)(x · c)i ] = 0. To prove Part 2 of the lemma, we will induct on n. For n = r, Y X r r (ui xi )ti E [χc (x)(x · u) ] = E χc (x) t , . . . , t 1 r t1 +...tr =r i∈[r] X Y r = (ui xi )ti . E χc (x) t , . . . , t 1 r t +...t =r 1
i∈[r]
r
30
Q As before, if some tj = 0 and j ∈ I = [r] then E[χc (x) i∈[r] (ui xi )ti ] = 0, since for each x and x ˜ obtained by flipping the j’th bit of x it is the case that χc (x) = −χc (˜ x). Therefore Y r r ui xi E [χc (x)(x · u) ] = E χc (x) 1, 1, . . . , 1 hY i Y = −r!( ui ) E x2i Y = −r!( ui ). Assume now the identity holds for n. Let c ∈ {0, 1}n+1 and let j 6∈ I, and for x ∈ {0, 1}n+1 define x−j ∈ {0, 1}n to be x with the j’th bit punctured. Then r r E [χc (x)(x · u) ] = E [χc (x)(x−j · u−j + xj uj ) ] X r (x−j · u−j )r−i (xj uj )i = E χc (x) i 0≤i≤r X r = E [χc (x)(x−j · u−j )r ] + E χc (x) (x−j · u−j )r−i (xj uj )i i 1≤i≤r Y X r r−i i = −r! ui + E χc (x)(x−j · u−j ) (xj uj ) . i i∈I
(6)
1≤i≤r
If i is even then r−i i i r−i =0 E χc (x)(x−j · u−j ) (xj uj ) = (uj ) E χc (x)(x−j · u−j ) by Part 1 of the lemma. If i is odd then r−i i = uij E χc (x)(x−j · u−j )r−i xj E χc (x)(x−j · u−j ) (xj uj ) r−i r−i i1 − E χc (x)(x−j · u−j ) = uj E χc (x)(x−j · u−j ) 2 xj =1 xj =−1 = 0, since j 6∈ I and so χc (x) = χc (˜ x), where x ˜ is obtained from x by flipping the j’th bit. We can now Q conclude that Equation (6) = −r! i∈I ui . Corollary 13. Let r ∈ Z + be odd2 and let c ∈ {0, 1}n . Let Dc be the distribution uniform over x ∈ {−1, 1}n for which χc (x) = −1. Then, Ex∼Dc [(x · u)r ] is maximized when u = r−1/2 c. Proof. From Lemma 8, clearly whenever ci = 0, we have ui = 0. It follows from the AM-GM inequality that the product is maximized when the remaining coordinates are equal. Now we are ready to show the statistical dimension of moment maximization, from which Theorem 5 follows. 2
This statement does not hold for r even.
31
Theorem 17. For the r’th moment maximization problem let F = {(u · x)r }u∈Rn and let D be a set of distributions over {−1, 1}n . Then for an odd r and δ > 0, let Z denote the problem of n r! − δ -optimizing over F and D. Then SD(Z, 0, 1) ≥ r/2 r − 1. 2(r+1) Proof. Let D1 , . . . , Dd be distributions where Di is uniform over all examples x in {0, 1}n , where such that χci (x) = 1; this again allows us to consider β = 1 and γ = 0. Corollary 13 shows that under the distribution Di , the moment function max
r E [(u · x) ]
u∈R:kuk=1 x∼Di
is maximized at u = r−1/2 c. So, to maximize the moment, one equivalently needs to find the correct target parity. moment is simply QTo compute the needed , for r odd, Lemma 8 tells us that the expected −1/2 r! i:ci =1 ui , and for unit vectors, is maximized when ∀i : ci = 1, ui = r (and ui = 0 for the −r/2 other coordinates). This yields a maximum moment of (r!)r for any Di . In comparison, if the measured moment is equal to (r!)(r + 1)−r/2 , a simple consequence of P 2 Lemma 8 is that to minimize i:ci =1 ui , then for all i s.t. ci = 1, we have ui = (r + 1)−1/2 . Hence, for all i s.t. ci = 0, ui cannot take value greater than 1 − r((r + 1)−1/2 )2 = (r + 1)−1/2 , implying a moment of at most (r!)(r + 1)−r/2 on Dc0 . This gives a bound of ≥ (r!)r−r/2 − (r!)(r + 1)−r/2 ≥ The
n r
r! . 2(r + 1)r/2
parities generating the different distributions give the statistical dimension.
Corollary 14. For r odd, any statistical algorithm for the moment maximization problem asymp1/3 −1/3 totically requires nr queries of tolerance nr to approximate the r-th moment to less than r! an additive term of 2(r+1) . r/2
5.4
Moment maximization lower bounds via reductions
For r = 3, we can obtain the stronger lower bound of Theorem 6 by reduction from the planted bipartite clique problem. In this section, let DS be a planted clique distribution as in Problem 1; without loss of generality S = [k]; our reduction uses the maximum of the third moment of DS . Lemma 11. Let u∗ be the optimum of 3 E [(x · u) ],
max
u∈Rn :kuk=1
x∼DS
(7)
√ then u∗i = u∗j for all i, j ∈ S. In particular, the optimal solution is u∗i = 1/ k for i ∈ S and 0 otherwise. The optimal value is k 5/2 /n. Proof. Let us give an alternative representation for the third moment: 3 3 E [(x · u) ] = E[(xS · uS + xS¯ · uS¯ ) ]
x∼DS
= E[(xS · uS )3 ] + 3 E[(xS · uS )(xS¯ · uS¯ )2 ] !3 X k X 3k = ui + kuS¯ k2 ui n n i∈S
i∈S
32
If we restrict to the unit sphere and compute the first order conditions, for each i ∈ S: k n
!2 X
ui
i∈S
+
3k kuS¯ k2 − 2λui = 0 n
where λ is the Lagrange multiplier. Thus, at every critical point, all the ui are equal for i ∈ S. To compute the optimum, set the value of each ui to be t and the value of kuS¯ k to be v ∈ [0, 1], our problem reduces to: max st
k (kt)3 + 3v 2 kt n kt2 + v 2 = 1
By substituting the constraint into the objective, we reduce the objective function to: f (t) =
k 3 k (kt)3 + 3(1 − kt2 )kt = (k − k 2 )t3 + 3kt n n
√ √ For sufficiently large k, f (t) is monotonically increasing over the interval [−1/ k, 1/ k], so the √ 5/2 maximum is at the endpoint t = 1/ k. This gives an optimal value of k n . A simple extension is that if we fix any particular uj , or kuS¯ k, then the same conclusion holds: in the optimum all ui are equal for any i ∈ S where i 6= j. Proof of Theorem 6. Our reduction from planted clique to moment maximization is as follows: 1. Maximize the third moment Ex∼DS [(x · u)3 ]. 2. Output the k largest components of u as the planted clique. √ We will show that this algorithm will output S. In particular, if kuS¯ k ≥ 1/ k + 1, then u can not be within an additive k 3/2 /n error of the optimum. √ Claim 1. Let u∗ be the optimal solution to (7) restricted to the set {u ∈ Rn : kuS¯ k ≥ 1/ k + 1, then ∗ 3 E[(x · u ) ] ≤
k 5/2 k 3/2 − . n n
√ On the other hand, if kuS¯ k < 1/ k + 1, then we show that if there exists an i ∈ S where |ui | ≤ kuS¯ k, then u can not be within an additive 2k 3/2 /3n of the optimum. √ Claim 2. Let u∗ be the optimal solution to (7) restricted to the set {u ∈ Rn : kuS¯ k ≤ 1/ k + 1, u1 ≤ kuS¯ k}, then: ∗ 3 E[(x · u ) ] ≤
k 5/2 2k 3/2 − n 3n
33
Hence, our reduction outputs the√correct S. We conclude by giving the proofs of the claims. Proof of Claim 1. Fix kuS k = 1/ k + 1. By Lemma 11, we can upper bound the third moment by taking all ui for i ∈ S equal: " # !3 3 X X 3k k k k 2 3 √ √ + ui + 3 kuS¯ k ui = E (x · u) ≤ n n k+1 (k + 1) k + 1 i∈S
i∈S
The first term can be estimated by (1 − )m ≤ 1 − (m − 1/4) for sufficiently small : !3 r 3 k k k 5 3/2 √ = √ · ≤k 1− k+1 4(k + 1) k+1 k The second term is small, and we can absorb it into the first term. Hence: k 5/2 5 1 k 5/2 1 3 √ 1− + ≤ 1− E[(x · u) ] ≤ n 4(k + 1) n k k(k + 1)3/2 A similar calculation shows that following function is monotone decreasing in v over [0, 1]: !3 k k X k X g(v) = max wi + 3v 2 wi w∈Rk :kwk2 +kvk2 =1 n i=1
i=1
√ Thus, we can extend our upper bound to all kuS¯ k ≥ 1/ k + 1. Proof of Claim 3. Lemma 11 implies that in u∗ , we must have all remaining ui equal for i ∈ S and i 6= 1. Thus, taking ui = t for i ∈ S and, ui = kuS¯ k = v gives us the simpler program: max
√ {t,v∈R:(k−1)t2 +2v 2 =1,|v|≤1/ k+1}
k ((k − 1)t + v)3 + 3v 2 ((k − 1)t + v) n
Let us estimate the second term: √ k X 3 3 k 3v ((k − 1)t + v) ≤ · max wi ≤ k + 1 w∈Rk :kwk=1 k+1 2
i=1
√
The first term can be bounded be relaxing the |v| ≤ 1/ k + 1 constraint: max :(k−1)t2 +2v 2 =1,|v|≤1/
{t,v∈R
√
((k − 1)t + v)3 ≤ k+1}
max
((k − 1)t + v)3
:(k−1)t2 +2v 2 =1}
{t,v∈R
If we take the Lagrangian L and compute the first order conditions: ∂L = (k − 1)((k − 1)t + v)2 − 2λ(k − 1)t = 0 ∂t
∂L = ((k − 1)t + v)2 − 4λv = 0 ∂v
Thus, √ we must have t = 2v for any critical points where t or v are nonzero. This implies that v = 1/ 4k − 2. With these parameters, we have: 2k − 1 3 = (k − 1)3/2 ((k − 1)t + v)3 = ((2k − 1)v)3 = √ 4k − 2 34
Hence, we have: 3 E[(x · u) ] ≤
k 5/2 k(k 3/2 − (k − 1)3/2 ) k (k − 1)3/2 = − n n n
The desired bound follows from: k
3/2
3/2
− (k − 1)
2 = 3
Z
k
k−1
√
tdt ≥
2√ k−1 3
We conclude by applying Theorem 3.
6
Relationship to Statistical Queries in Learning
We will now use Corollary 1 to demonstrate that our work generalizes the notion of statistical query dimension and the statistical query lower bounds from learning theory. In an instance of a PAC learning problem, the learner has access to random examples of an unknown boolean function f 0 : X 0 → {−1, 1} from a set of boolean functions C (whenever necessary, we use 0 to distinguish variables from the identically named ones in the context of general search problems). A random example is a pair including a point and its label (x0 , c(x0 )) such that x0 is drawn randomly from an unknown distribution D0 . For > 0, the goal of an -accurate learning algorithm is to find, with high probability, a boolean hypothesis h0 for which Prx0 ∼D0 [h0 (x0 ) 6= f 0 (x0 )] ≤ . A statistical query (SQ) learning algorithm [35] has access to a statistical query oracle for the unknown function f 0 and distribution D0 in place of random examples. A query to the SQ oracle is a function φ : X 0 × {−1, 1} → [−1, 1] that depends on both the example x0 and its label `. To such a query the oracle returns a value v which is within τ of ED0 [φ(x0 , c(x0 )], where τ is the tolerance parameter. A SQ algorithm does not depend on the randomness of examples and hence must always succeed. Blum et al. [10] defined the statistical query dimension or SQ-DIM of a set of functions C and distribution D0 over X 0 as follows (we present a simplification and strengthening due to Yang [53]). Definition 11 ([10]). For a concept class C and distribution D0 , SQ-DIM(C, D0 ) = d0 if d0 is the largest value for which there exist d0 functions c1 , c2 , . . . , cd0 ∈ C such that for every i 6= j, |hci , cj iD0 | ≤ 1/d0 . Blum et al. [10] proved that if a class of functions is learnable using only a polynomial number of statistical queries of inverse polynomial tolerance then its statistical query dimension is polynomial. Yang [53] strengthened their result and proved the following bound (see [46] for a simpler proof). Theorem 18 ([53]). Let C be a class of functions and D0 be a distribution over X 0 and let d0 = SQ-DIM(C, D0 ). Then any SQ learning algorithm for C over D0 that makes q queries of tolerance 1/d01/3 and outputs an -accurate hypothesis for ≤ 1/2 − 1/(2d01/3 ) satisfies that q ≥ d01/3 /2 − 1. In this result the distribution D0 is fixed and known to the learner (such learning is referred to as distribution-specific) and it can be used to lower bound the complexity of learning C even in a weak sense. Specifically, when the learning algorithm is only required to output a hypothesis h0 such that Prx0 ∼D0 [h0 (x0 ) 6= c(x0 )] ≤ 1/2 + γ 0 for some inverse polynomial γ 0 (or ≤ 1/2 − γ 0 ).
35
It is not hard to see that we can cast this learning problem as an optimization problem over distributions, and by doing so, we will obtain that our statistical dimension implies a lower bound on learning which is stronger than that of Yang [53]. Let L = (C, D0 , ) be an instance of a distribution-specific learning problem of a class of functions C over distribution D0 to accuracy 1 − . We define the following 2-optimization problem ZL over distributions. The domain is all the labeled points or X = X 0 × {−1, 1}. When the target function equals c ∈ C the learning algorithm gets samples from the distribution Dc over X, where Dc (x0 , c(x0 )) = D0 (x0 ) and Dc (x0 , −c(x0 )) = 0. Therefore we define the set of distributions over which we optimize to be DL = {Dc | c ∈ C}. Note that STAT oracle for Dc with tolerance τ is equivalent to the statistical query oracle for c over D0 with tolerance τ . We can take the class of functions FL over which a learning algorithm optimizes to be the set of all boolean functions over X of the form f (x0 , `) = f 0 (x0 ) · ` for some boolean function f 0 over X 0 (an efficient learning algorithm can only output circuits of polynomial size but this distinction is not important for our information-theoretic bounds). We define ZL to be the problem of 2-optimizing over FL and DL . Note that for f ∈ FL and Dc ∈ DL , 0 0 0 0 0 0 E [f (x)] = E0 [f (x ) · c(x )] = 1 − 2 E0 [f (x ) 6= c(x )]
Dc
D
D
and therefore learning to accuracy 1 − is equivalent to 2-optimizing over FL and DL . We claim that SQ-DIM(C, D0 )-based lower bound given in Theorem 18 is special (and substantially simpler) case of our statistical dimension lower bound for ZL (Cor. 1). Theorem 19. Let C be a class of functions and D0 be a distribution over X 0 and let d0 = SQ-DIM(C, D0 ). Denote by L = (C, D0 , ) the instance of learning C over D0 for = 1/2−1/(2d01/3 ). Then 1 0 0 SD(ZL , γ = 1/d , β = 1) ≥ d − . 1/d02/3 − 1/d0 Proof. Let c1 , c2 , . . . , cd0 be the almost uncorrelated functions in C implied by the definition of SQ-DIM(C, D0 ). We define the reference distribution D as the distribution for which for every (x0 , `) ∈ X, D(x0 , `) = D0 (x0 )/2. We note that this ensures that D(x0 , `) is non-vanishing only when D0 (x0 ) is non-vanishing and hence the function DDc − 1 will be well-defined for all c ∈ C. For every c ∈ C, we have Dc (x0 , c(x0 )) −1=2−1=1 D(x0 , c(x0 )) Therefore,
Dc (x0 ,`) D(x0 ,`)
and
Dc (x0 , −c(x0 )) − 1 = 0 − 1 = −1. D(x0 , −c(x0 ))
= ` · c(x0 ). This implies that for any ci , cj ∈ C,
Dcj Dci − 1, −1 = E[` · ci (x0 ) · ` · cj (x0 )] = E [ci (x0 ) · cj (x0 )] = hci , cj iD0 . D D D D0 D
Hence
2
c (x)
1. for any c ∈ C, DD(x) − 1 = 1; D
2. for any i 6= j ≤
d0 ,
D
Dci (x) D(x)
− 1,
Dcj (x) D(x)
E −1 ≤ 1/d0 . D
36
These properties imply that d0 functions in C give d0 distributions in DL whose distinguishing functions are almost uncorrelated. This is essentially the condition required to obtain a lower bound of d0 on SD(ZL , 1/d0 , 1). The only issue is that we need to exclude distributions for which any given f ∈ FL is 2-optimal. We claim that it is easy to bound the number of distribution which are 2-optimal for a fixed f (x0 , `) = f 0 (x0 ) · ` and whose distinguishing functions are almost uncorrelated. First, note that the condition of 2-optimality of f for Dc states that E [f (x)] ≥ 1 − 2 ≥ 1/d
01/3
.
Dc
On the other hand, ED [f (x)] = 0 and therefore EDc [f (x)] − ED [f (x)] ≥ 1/d01/3 . This implies that if we view f (x) as a query function then expectations of the query function relative to Dc and D differ by at least τ = 1/d01/3 . In the proof of Corollary 1, we proved that this is possible for at most (β − γ)/(τ 2 − γ) distributions with pairwise correlations (γ, β). For our parameters this gives a bound of 1/d02/31−1/d0 distributions. Hence for m = d0 − 1/d02/31−1/d0 we obtain that for every f ∈ FL there exist m distributions D1 , . . . , Dm ⊆ {Dc1 , . . . , Dcd0 } \ ZL−1 (f ) such that
Di (x)
1. for any i ≤ m, D(x) − 1 = 1; D
2. for any i 6= j ≤ m,
D
Di (x) D(x)
− 1,
Dj (x) D(x)
−1
E D
≤ 1/d0 .
Applying Corollary 1, we get the following lower bound, which is twice larger than the d01/3 /2−1 bound of Yang [53]. Corollary 15. Let C be a class of functions and D0 be a distribution over X 0 , let d0 = SQ-DIM(C, D0 ) and let = 1/2 − 1/(2d01/3 ). Then any SQ learning algorithm requires at least d01/3 − 2 queries of tolerance 1/d01/3 to -accurately learn C over D0 .
6.1
Honest Statistical Queries
We now turn to the Honest SQ model [29, 52], which inspired our notion of unbiased statistical algorithms. In the Honest SQ model, the learner has access to an HSQ oracle and can again evaluate queries which are a function of the data points and their labels. As in our 1-STAT oracle, the queries are evaluated on an “honest” sample drawn from the target distribution. More precisely, the HSQ oracle accepts a functionP φ : X 0 ×{−1, 1} → {−1, 1} and a sample size t > 0, draws x01 , . . . , x0t ∼ D0 , 1 and returns the value t ti=1 φ(x0 , c(x0 )). The total sample count of an algorithm is the sum of the sample sizes it passes to HSQ. We note that using our one-sample-per-query-function oracle 1-STAT one can simulate estimation of queries from larger samples in the straightforward way while obtaining the same sample complexity. Therefore HSQ is equivalent to our 1-STAT oracle. We first observe that our direct simulation in Theorem 10 implies that the Honest SQ learning model is equivalent (up to polynomial factors) to the SQ learning model. We are not aware of this observation having been made before (although Valiant [49] implicitly uses it to show that evolvable concept classes are also learnable in the SQ model). We now show that using Corollary 3 we can derive sample complexity bounds on honest statistical query algorithms for learning. 37
Corollary 16. Let C be a class of functions, D0 be a distribution over X 0 , d0 = SQ-DIM(C, D0 ) and = 1/2 − 1/(2d01/3 ). Then the sample complexity of any Honest SQ algorithm for -accurate √ ˜ d0 ). learning of C over D0 is Ω( This recovers the bound of Yang [53] up to polynomial factors.
Acknowledgments We thank Benny Applebaum, Avrim Blum, Uri Feige, Ravi Kannan, Michael Kearns, Robi Krauthgamer, Moni Naor, Jan Vondrak, and Avi Wigderson for insightful comments and helpful discussions.
References [1] N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie. Testing k-wise and almost k-wise independence. In Proceedings of STOC, pages 496–505, 2007. [2] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. In SODA, pages 594–598, 1998. [3] B. Applebaum, B. Barak, and A. Wigderson. Public-key cryptography from different assumptions. In STOC, pages 171–180, 2010. [4] S. Arora, B. Barak, M. Brunnermeier, and R. Ge. Computational complexity and information asymmetry in financial products (extended abstract). In ICS, pages 49–65, 2010. [5] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. [6] A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and A. Vijayaraghavan. Detecting high log-densities: an o(n1/4 ) approximation for densest k-subgraph. In STOC, pages 201–210, 2010. [7] A. Bhaskara, M. Charikar, A. Vijayaraghavan, V. Guruswami, and Y. Zhou. Polynomial integrality gaps for strong sdp relaxations of densest k-subgraph. In SODA, pages 388–405, 2012. [8] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In Proceedings of PODS, pages 128–138, 2005. [9] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997. [10] A. Blum, M. L. Furst, J. C. Jackson, M. J. Kearns, Y. Mansour, and S. Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC, pages 253–262, 1994. [11] C. Brubaker. Extensions of principal component analysis. Phd. Thesis, School of CS, Georgia Tech, 2009.
38
[12] S. Brubaker and S. Vempala. Random tensors and planted cliques. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, volume 5687, pages 406–419. 2009. [13] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Proceedings of NIPS, pages 281–288, 2006. [14] Y. Dekel, O. Gurel-Gurevich, and Y. Peres. Finding hidden cliques in linear time with high probability. In Proceedings of ANALCO, pages 67–75, 2011. [15] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. [16] R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer-Verilag, 1999. [17] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. Math. Program., 114(1):101–114, 2008. [18] U. Feige. Relations between average case complexity and approximation complexity. In IEEE Conference on Computational Complexity, page 5, 2002. [19] U. Feige and R. Krauthgamer. Finding and certifying a large hidden clique in a semirandom graph. Random Struct. Algorithms, 16(2):195–208, 2000. [20] U. Feige and D. Ron. Finding hidden cliques in linear time. In Proceedings of AofA, pages 189–204, 2010. [21] V. Feldman. A complete characterization of statistical query learning with applications to evolvability. Journal of Computer System Sciences, 78(5):1444–1459, 2012. [22] V. Feldman and V. Kanade. Computational bounds on statistical query learning. Journal of Machine Learning Research - COLT Proceedings Track, 23:16.1–16.22, 2012. [23] A. M. Frieze and R. Kannan. A new approach to the planted clique problem. In FSTTCS, pages 187–198, 2008. [24] A. E. Gelfand and A. F. M. Smith. Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398–409, 1990. [25] J. H˚ astad. Some optimal inapproximability results. J. ACM, 48:798–859, July 2001. [26] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970. [27] E. Hazan and R. Krauthgamer. How hard is it to approximate the best nash equilibrium? SIAM J. Comput., 40(1):79–91, 2011. [28] C. J. Hillar and L.-H. Lim. Most tensor problems are np hard. CoRR, abs/0911.1393, 2009. [29] J. Jackson. On the efficiency of noise-tolerant PAC algorithms derived from statistical queries. Annals of Mathematics and Artificial Intelligence, 39(3):291–313, Nov. 2003.
39
[30] M. Jerrum. Large cliques elude the metropolis process. Random Struct. Algorithms, 3(4):347– 360, 1992. [31] A. Juels and M. Peinado. Hiding cliques for cryptographic security. Des. Codes Cryptography, 20(3):269–280, 2000. [32] R. Kannan. personal communication. [33] R. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thatcher, editors, Complexity of Computer Computations, pages 85–103. Plenum Press, 1972. [34] R. Karp. Probabilistic analysis of graph-theoretic algorithms. In Proceedings of Computer Science and Statistics 12th Annual Symposium on the Interface, page 173, 1979. [35] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998. [36] S. Khot. Ruling out ptas for graph min-bisection, densest subgraph and bipartite clique. In FOCS, pages 136–145, 2004. [37] S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by simmulated annealing. Science, 220(4598):671–680, 1983. [38] L. Kucera. Expected complexity of graph partitioning problems. Discrete Applied Mathematics, 57(2-3):193–212, 1995. [39] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529–537, 2001. [40] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953. [41] L. Minder and D. Vilenchik. Small clique detection and approximate nash equilibria. 5687:673– 685, 2009. [42] J. P. Neˇsetˇril. On the complexity of the subgraph problem. Commentationes Mathematicae Universitatis Carolinae, 26(2):415–419, 1985. [43] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series 5, 50(302):157–175, 1900. [44] B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. In DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 521–532, 1995. [45] R. Servedio. Computational sample complexity and attribute-efficient learning. Journal of Computer and System Sciences, 60(1):161–178, 2000. [46] B. Sz¨or´enyi. Characterizing statistical query learning: Simplified notions and proofs. In ALT, pages 186–200, 2009. 40
[47] M. Tanner and W. Wong. The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82:528–550, 1987. [48] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984. [49] L. G. Valiant. Evolvability. J. ACM, 56(1), 2009. [50] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971. ˇ [51] V. Cern´ y. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45(1):41–51, Jan. 1985. [52] K. Yang. On learning correlated boolean functions using statistical queries. In Proceedings of ALT, pages 59–76, 2001. [53] K. Yang. New lower bounds for statistical query learning. J. Comput. Syst. Sci., 70(4):485–509, 2005.
41