Fundamental Limits of Online and Distributed ... - Semantic Scholar

Report 2 Downloads 75 Views
arXiv:1311.3494v3 [cs.LG] 6 Feb 2014

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Ohad Shamir Weizmann Institute of Science [email protected] Abstract Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where any algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this paper, we describe how a single set of results implies positive answers to the above, for a variety of settings.

1

Introduction

Information constraints play a key role in machine learning. Of course, the main constraint is the availability of only a finite data set, from which the learner is expected to generalize. However, many problems currently researched in machine learning can be characterized as learning with additional information constraints, arising from the manner in which the learner may interact with the data. Some examples include: • Communication constraints in distributed learning: There has been much work in recent years on learning when the training data is distributed among several machines (with [13, 2, 27, 42, 27, 30, 8, 16, 34] being just a few examples). Since the machines may work in parallel, this potentially allows significant computational speed-ups and the ability to cope with large datasets. On the flip side, communication rates between machines is typically much slower than their processing speeds, and a major challenge is to perform these learning tasks with minimal communication.

1

• Memory constraints: The standard implementation of many common learning tasks require memory which is super-linear in the data dimension. For example, principal component analysis (PCA) requires us to estimate eigenvectors of the data covariance matrix, whose size is quadratic in the data dimension and can be prohibitive for highdimensional data. Another example is kernel learning, which requires manipulation of the Gram matrix, whose size is quadratic in the number of data points. There has been considerable effort in developing and analyzing algorithms for such problems with reduced memory footprint (e.g. [38, 4, 9, 48, 44]). • Online learning constraints: The need for fast and scalable learning algorithms has popularised the use of online algorithms, which work by sequentially going over the training data, and incrementally updating a (usually small) state vector. Well-known special cases include gradient descent and mirror descent algorithms (see e.g. [45, 46]). The requirement of sequentially passing over the data can be seen as a type of information constraint, whereas the small state these algorithms often maintain can be seen as another type of memory constraint. • Partial-information constraints: A common situation in machine learning is when the available data is corrupted, sanitized (e.g. due to privacy constraints), has missing features, or is otherwise partially accessible. There has also been considerable interest in online learning with partial information, where the learner only gets partial feedback on his performance. This has been used to model various problems in web advertising, routing and multiclass learning. Perhaps the most well-known case is the multi-armed bandits problem [17, 7, 6], with many other variants being developed, such as contextual bandits [35, 36], combinatorial bandits [20], and more general models such as partial monitoring [17, 12]. Although these examples come from very different domains, they all share the common feature of information constraints on how the learning algorithm can interact with the training data. In some specific cases (most notably, multi-armed bandits, and also in the context of certain distributed protocols [8, 50]) we can even formalize the price we pay for these constraints, in terms of degraded sample complexity or regret guarantees. However, we currently lack a general information-theoretic framework, which directly quantifies how such constraints can impact performance. For example, are there cases where any online algorithm, which goes over the data one-by-one, must have a worse sample complexity than (say) empirical risk minimization? Are there situations where a small memory footprint provably degrades the learning performance? Can one quantify how a constraint of getting only a few bits from each example affects our ability to learn? To the best of our knowledge, there are currently no generic tools which allow us to answer such questions. In this paper, we make a first step in developing such a framework. We consider a general class of learning processes, characterized only by information-theoretic constraints on how they may interact with the data (and independent of any specific learning problem semantics). As special cases, these include online algorithms with memory constraints, certain types of distributed algorithms, as well as learning with partial information. We identify 2

cases where any such algorithm must be worse than what can be attained without such information constraints. Armed with these generic tools, we establish several new results for specific learning problems: • We prove that for some learning and estimation problems - in particular, sparse PCA and sparse covariance estimation in Rd - no online algorithm can attain statistically ˜ 2 ) memory. optimal performance (in terms of sample complexity) with less than Ω(d To the best of our knowledge, this is the first formal example of a memory/sample complexity trade-off. • We show that for similar types of problems, there are cases where no distributed algorithm (which is based on a non-interactive or serial protocol on i.i.d. data) can ˜ 2 ) communication per machine. To the attain optimal performance with less than Ω(d best of our knowledge, this is the first formal example of a communication/sample complexity trade-off, in the regime where the communication budget is larger than the data dimension, and the examples at each machine come from the same underlying distribution. • We prove a ‘Heisenberg-like’ uncertainty principle, which generalizes existing lower bounds for online learning with partial information. In the context of learning with expert advice, it implies that the number of rounds T , times the number of bits b extracted from each loss vector, must be larger than the dimension d in order to get non-trivial regret. This holds no matter what these b bits are – a single coordinate (as in multi-armed bandits), some information on several coordinates (as in semi-bandit feedback), a linear projection (as in bandit optimization), some feedback signal from a restricted set (as in partial monitoring) etc. Remarkably, it holds even if the online learner is allowed to adaptively choose which bits of the loss vector it can retain at each round. • We demonstrate the existence of simple (toy) stochastic optimization problems where any algorithm which uses memory linear in the dimension (e.g. stochastic gradient descent or mirror descent) cannot be statistically optimal.

Related Work In stochastic optimization, there has been much work on lower bounds for sequential algorithms, starting from the seminal work of [40], and including more recent works such as [1]. [43] also consider such lower bounds from a more general information-theoretic perspective. However, these results all hold in an oracle model, where data is assumed to be made available in a specific form (such as a stochastic gradient estimate). As already pointed out in [40], this does not directly translate to the more common setting, where we are given a dataset and wish to run a simple sequential optimization procedure. Indeed, recent works exploited this gap to get improved algorithms using more sophisticated oracles, such as the 3

availability of prox-mappings [41]. Moreover, we are not aware of cases where these lower bounds indicate a gap between the attainable performance of any sequential algorithm and batch learning methods (such as empirical risk minimization). In the context of distributed learning and statistical estimation, information-theoretic lower bounds have been recently shown in the pioneering work [50]. Assuming communication budget constraints on different machines, the paper identifies cases where these constraints affect statistical performance. Our results (in the context of distributed learning) are very similar in spirit, but there are two important differences. First, their lower bounds pertain to parametric estimation in Rd , and are non-trivial when the budget size per machine is much smaller than d. However, for most natural applications, sending O(d) bits is rather cheap, since one needs O(d) bits anyway just to communicate the algorithm’s output. In contrast, our basic results pertain to simpler detection problems, and lead to non-trivial lower bounds in the natural regime where the budget size is the same or even much larger than the data dimension (e.g. as much as quadratic). The second difference is that their work focuses mostly on non-interactive distributed protocols, while we address a more general class, which allows some interaction and also includes information-constrained settings beyond distributed learning. Strong lower bounds in the context of distributed learning have also been shown in [8], but they do not apply to a regime where examples across machines come from the same distribution, and where the communication budget is much larger than what is needed to specify the output. As mentioned earlier, there are well-known lower bounds for multi-armed bandit problems and other online learning with partial-information settings. However, they are specific to the partial information feedback considered. For example, the standard multi-armed bandit lower bound [7] pertain to a setting where we can view a single coordinate of the loss vector, but doesn’t apply as-is when we can view more than one coordinate (as in semi-bandit feedback [33, 5], or bandits with side-information [37]), receive a linear projection (as in bandit linear optimization), or receive a different type of partial feedback (such as in partial monitoring [19]). In contrast, our results are generic and can directly apply to any such setting. The inherent limitations of streaming and distributed algorithms, including memory and communication constraints, have been extensively studied within theoretical computer science (e.g. [3, 10, 22, 39, 11]). This research provides quite powerful and general results, which also apply to algorithms beyond those considered here (such as fully interactive distributed algorithms). Unfortunately, almost all these results consider tasks unrelated to learning, and/or adversarially generated data, and thus do not apply to statistical learning tasks, where the data is assumed to be drawn i.i.d. from some underlying distribution. [49, 26] do consider i.i.d. data, but they focus on problems such as detecting graph connectivity and counting distinct elements, and not learning problems such as those considered here. On the flip side, there are works on memory-efficient algorithms with formal guarantees for statistical problems (e.g. [38, 9, 32, 23]), but these do not consider lower bounds or provable trade-offs.

4

2

Information-Constrained Protocols

We begin with a few words about notation. We use bold-face letters (e.g. x) to denote vectors, with xj denoting the j-th coordinate. In particular, we let ej ∈ Rd denote j-th standard basis vector. We use the standard asymptotic notation O(·), Ω(·) to hide constants, ˜ ˜ to hide logarithmic factors. log(·) refers to the natural logarithm, and log2 (·) and O(·), Ω(·) to the base-2 logarithm. We begin by defining the following class of algorithms, corresponding to online learning algorithms with bounded memory: Definition 1 (b-Memory Online Protocols). A given algorithm is a b-memory online protocol on m i.i.d. instances, if it has the following form, for arbitrary functions {ft }m t=1 , f returning an output of at most b bits: • For t = 1, . . . , m – Let X t be the t-th instance – Compute message W t = ft (X t , W t−1 ) • Return W = f (W t ) Note that the functions {ft }m t=1 , f are completely arbitrary, may depend on m and can also be randomized. The crucial assumption is that their outputs W t , W are constrained to be only b bits. Intuitively, b-memory online protocols maintain a state vector of size b, which is incrementally updated after each instance is received. We note that a majority of online learning and stochastic optimization algorithms have bounded memory. For example, for linear predictors, most single-pass gradient-based algorithms maintain a state whose size is proportional to the size of the parameter vector that is being optimized. We will also define and prove results for the following much more general class of information-constrained algorithms: Definition 2 ((b, n, m) Protocol). Given access to a sequence of mn i.i.d. instances, an algorithm is a (b, n, m) protocol if it has the following form, for arbitrary functions ft returning an output of at most b bits, and an arbitrary function f : • For t = 1, . . . , m – Let X t be a fresh batch of n i.i.d. instances – Compute message W t = ft (X t , W 1 , W 2 , . . . W t−1 ) • Return W = f (W 1 , . . . , W m ) As in b-memory online protocols, we constrain each W t to consist of at most b bits. However, unlike b-memory online protocols, here each W t may depend not just on W t−1 , but on the entire sequence of previous messages W 1 . . . W t−1 . b-memory online protocols are a special when n = 1. Examples of (b, n, m) protocols, beyond b-memory online protocols, include: 5

• Non-interactive and serial distributed algorithms: There are m machines and each machine receives an independent sample X t of size n. It then sends a message W t = ft (X t ) (which here depends only on X t ). A centralized server then combines the messages to compute an output f (W 1 . . . W m ). This includes for instance divideand-conquer style algorithms proposed for distributed stochastic optimization (e.g. [51, 52]). A serial variant of the above is when there are m machines, and one-by-one, each machine t broadcasts some information W t to the other machines, which depends on X t as well as previous messages sent by machines 1, 2, . . . , (t − 1). • Online learning with partial information: We sequentially receive d-dimensional examples/loss vectors, and from each of these we can extract and use only b bits of information, where b  d. For example, this includes most types of bandit problems. • Mini-batch Online learning algorithms: The data is streamed one-by-one or in minibatches of size n, with mn instances overall. An algorithm sequentially updates its state based on a b-dimensional vector extracted from each example/batch (such as a gradient or gradient average), and returns a final result after all data is processed. This includes most gradient-based algorithms we are aware of, but also distributed versions of these algorithms (such as parallelizing a mini-batch processing step as in [28, 24]). We note that our results can be generalized to allow the size of the messages W t to vary across t, and even to be chosen in a data-dependent manner. In our work, we contrast the performance attainable by any algorithm corresponding to such protocols, to constraint-free protocols which are allowed to interact with the sampled instances in any manner.

3

Basic Results

All our results are based on simple ‘hide-and-seek’ statistical estimation problems, for which we show a strong gap between the attainable performance of information-constrained protocols and constraint-free protocols. We will consider two variants of this problem, with different applications. Our first problem, parameterized by a dimension d, bias ρ, and sample size mn, is defined as follows: Definition 3 (Hide-and-seek Problem 1). Consider the set of distributions {Prj (·)}dj=1 over  {−1, 1}d defined as Prj (x) = 2−d+1 21 + ρxj . Given an i.i.d. sample of mn instances generated from Prj (·), where j is unknown, detect j. In words, Prj (·) corresponds to picking all coordinates other than j to be ±1 uniformly  at random, and independently picking coordinate j to be +1 with probability 21 + ρ , and  −1 with probability 21 − ρ . It is easily verified that this creates instances with zero-mean coordinates, except coordinate j whose expectation is 2ρ. The goal is to detect j based on an empirical sample.

6

We now present our first main result, which shows that for this hide-and-seek problem, there is a large regime where detecting j is information-theoretically possible, but any (b, n, m) protocol will fail to do so with high probability. Theorem 1. Consider hide-and-seek problem 1 on d > 1 coordinates, with some bias ρ ≤ 1/4n. Then for any estimate J˜ of the biased coordinate returned by any (b, n, m) protocol, there exists some coordinate j such that r mb 3 . Prj (J˜ = j) ≤ + 8 d d However, given mn samples, if J˜ is the coordinate with the highest empirical average, then   1 2 (1) Prj (J˜ = j) ≥ 1 − 2d exp − mnρ . 2 To give a concrete example, let us see how this theorem applies in the special case of bmemory online protocols, in which case we can take n = 1. In particular, suppose we choose the bias ρ of coordinate j to be 1/4. This makes the hide-and-seek problem extremely easy without information constraints: One coordinate j will be +1 with probability 3/4, while all the other coordinates will be +1 with probability only 1/2. Thus, given only O(log(d)) i.i.d. instances, we can easily determine j with arbitrarily high probability, just by finding the coordinate with highest empirical average. However, the theorem implies that any bmemory online protocol, where b  d will fail with high probability to detect j – at least as long as the same size m is much smaller than d/b. When m ≥ d/b, our lower bound becomes vacuous. However, this is not just an artifact of the analysis: If (for instance) m  Ω(d log(d)), it is possible to detect j with bounded memory1 . Thus, the result implies a trade-off between sample complexity and memory complexity, which is exponential in d: Without memory constraints, one can detect j after O(log(d)) i.i.d. instances, whereas with memory constraints, the number of instances needed scales linearly with d. Since Thm. 1 also applies to (b, n, m) protocols in general, a similar reasoning implies trade-offs between information and sample complexity in other situations, such as distributed learning. The proof of the theorem appears in Appendix A. However, the technical details may obfuscate the proof’s simple intuition, which we now turn to explain. First, the bound in Eq. (1) is just a straightforward application of Hoeffding’s inequality and a union bound, ensuring that we can estimate the biased coordinate when mn  log(d)/ρ2 . However, in a setting where the instances are given to us one-by-one or in batches, this requires us to maintain an estimate for all d coordinates simultaneously. Unfortunately, when we can only output a small number b of bits based on each instance/batch, we cannot accurately update the estimate for all d coordinates. For example, we can provide some 1

The idea is to split the instance sequence into d segments of length O(log(d)) each, and use each segment i to determine if coordinate i is the biased coordinate with high probability.

7

𝑋1𝑘 𝑋2𝑘 𝑗



𝑊𝑘

𝑋𝑗𝑘

⋮ 𝑋𝑑𝑘

Figure 1: Illustration of the relationship between j, the coordinates 1, 2, . . . , j, . . . , d of the sample X t , and the message W t . The coordinates are independent of each other, and most of them just output ±1 uniformly at random. Only coordinate j has a different distribution and hence contains some information on j. information on all d coordinates, but then the information we will provide per coordinate (and in particular, the biased coordinate j) will be very small. Alternatively, we can provide accurate information on O(b) coordinates, but we don’t know which coordinate is the important coordinate j, hence we are likely to ‘miss’ it. In fact, the proof relies on showing that no matter what, a (b, n, m) protocol cannot provide more than b/d bits of information (in expectation) on coordinate j, unless it already ‘knows’ j. As there are only m rounds overall, the amount of information conveyed on coordinate j is at most mb/d. If this is much smaller than 1, there is insufficient information to accurately detect j. From a more information-theoretical viewpoint, one can view this as a result on the information transmission rate of a channel as illustrated in figure 1. In this channel, the message j is sent at random through one of d independent binary symmetric channels (corresponding to the j-th coordinate in the data instances). Even though W constitutes b bits, the information on j it can convey is no larger than b/d in expectation, since it doesn’t ‘know’ which of the channels to decode. For some of the applications we consider, hide-and-seek problem 1 isn’t appropriate, and we will need the following alternative problem, which is again parameterized by a dimension d, bias ρ, and sample size mn: Definition 4 (Hide-and-seek Problem 2). Consider the set of distributions {Prj (·)}dj=1 over {−ei , +ei }di=1 , defined as ( ( 1 1 i = 6 j i 6= j 2d Pr (−e ) = . Prj (ei ) = 2d j i ρ ρ 1 1 + i = j − i = j 2d d 2d d Given an i.i.d. sample of mn instances generated from Prj (·), where j is unknown, detect j. In words, Prj (·) corresponds to picking ±ei where i is chosen uniformly at random, and the sign is chosen uniformly if i 6= j, and to be positive (resp. negative) with probability 8

1 2

+ ρ (resp. 12 − ρ) if i = j. It is easily verified that this creates sparse instances with zero-mean coordinates, except coordinate j whose expectation is 2ρ/d. We now present a result analogous to Thm. 1 for this new hide-and-seek problem. Theorem 2. Consider hide-and-seek problem 2 on d > 1 coordinates, with some bias 1 1 d ρ ≤ min{ 27 , 9 log(d) , 14n }. Then for any estimate J˜ of the biased coordinate returned by any (b, n, m) protocol, there exists some coordinate j such that r mb 3 . PrJ (J˜ = j) ≤ + 11 d d However, given mn samples, if J˜ is the coordinate with the highest empirical average, then PrJ (J˜ = j) ≥ 1 − d exp (−(mn/d)ρ2 ). The proof appears in Appendix A, and uses an intuition similar to the proof of Thm. 1. As in the previous theorem, it implies a regime with performance gaps between (b, n, m) protocols and unconstrained algorithms. The theorems above hold for any (b, n, m) protocol, and in particular for b-memory online protocols (since they are a special case of (b, 1, m) protocols). However, for b-online protocols, the following simple observation will allow us to further strengthen our results:   protocol Theorem 3. Any b-memory online protocol over m instances is also a b, κ, m κ for any positive integer κ ≤ m. The proof is immediate: Given a a batch of κ instances, we can always feed the instances one by one to our b-memory online protocol, and output the final message after bm/κc such batches areprocessed, ignoring any remaining instances. This makes the algorithm a type protocol. of b, κ, m κ As a result, when discussing b-memory online protocols for some particular value of m, we can actually apply Thm. 1 and Thm. 2 replacing n, m by κ, dm/κe, where κ is a free parameter we can tune to attain the most convenient bound.

4 4.1

Applications Online Learning with Partial Information

Consider the standard setting of learning with expert advice, defined as a game over T rounds, where each round t a loss vector `t ∈ [0, 1]d is chosen, and the learner (without knowing `t ) needs to pick an action it from a fixed set {1, . . . , d}, after which the learner suffers loss `t,it . The goal of the learner is to minimize the regret in hindsight to the best P fixed action, Tt=1 `t,it −mini `t,i . We are interested in partial information variants, where the learner doesn’t get to see and use `t , but only some partial information on it. For example, in standard multi-armed bandits, the learner can only view `t (it ). The following theorem is a simple corollary of Thm. 1, and we provide a proof in Appendix B.1. 9

Theorem 4. Suppose d ≥ 30. For any algorithm which can view only b bits of each loss vector, there is a distribution over loss vectors `t ∈ [0, 1]d with thePfollowing property: If the PT T loss vectors are sampled i.i.d. from this distribution, then minj E[ t=1 `t (jt ) − t=1 `t (j)] ≥ 0.01T as long as T b ≤ 0.01d As a result, we get that for any algorithm with any partial information feedback model (where b bits are extracted from each d-dimensional loss vector), it is impossible to get sub-linear regret guarantees in less than Ω(d/b) rounds. Remarkably, this holds even if the algorithm is allowed to examine each loss vector `t and choose which b bits of information it wishes to retain. In contrast, full-information algorithms (e.g. Hedge [31]) can get sublinear regret in O(log(d)) rounds. Another interpretation of this result is as a ‘Heisenberg-like’ uncertainty principle: Given T rounds and b bits extracted per round, we must have T b ≥ Ω(d) in order for the regret to be sublinear. When b = O(1), the bound in the theorem matches a standard lower bound for multiarmed bandits, in terms of the relation2 of d and T . However, since the b bits extracted are arbitrary, it also implies lower bounds for other partial information settings. For example, we immediately get an Ω(d/k) lower bound when we are allowed to view k coordinates instead of 1, corresponding to (say) the semi-bandit feedback model ([20]), or the side-observation model of [37] with a fixed upper bound k on the number of side-observations. In partial monitoring ([19]), we get a Ω(d/k) lower bound where k is the logarithm of the feedback matrix width. In attribute efficient learning (e.g. [21]), a simple reduction implies an Ω(d/k) sample complexity lower bound when we are constrained to view at most k features of each example. In general, for any current or future feedback model which corresponds to getting a limited number of bits from the loss vector, our results imply a non-trivial lower bound.

4.2

Stochastic Optimization

We now turn to consider an example from stochastic optimization, where our goal is to approximately minimize F (h) = EZ [f (h; Z)] given access to m i.i.d. instantiations of Z, whose distribution is unknown. This setting has received much attention in recent years, and can be used to model many statistical learning problems. In this section, we show a stochastic optimization problem where information-constrained protocols provably pay a performance price compared to non-constrained algorithms. We emphasize that it is going to be a very simple toy problem, and is not meant to represent anything realistic. We present it for two reasons: First, it illustrates another type of situation where information-constrained protocols may fail (in particular, problems involving matrices). Second, the intuition of the construction is also used in the more realistic problem of sparse PCA and covariance estimation, considered in the next section. The construction is as follows: Suppose we wish to solve min(w,v) F (w, v) = EZ [f ((w, v); Z)], p ˜ (d/b)T ) Based on a conjectured improvement of our results, it should also be possible to prove a Ω( regret lower bound, which holds for any d, T and any algorithm extracting b bits from the loss vector. See Sec. 5 for more details. 2

10

where f ((w, v); Z) = w> Zv , Z ∈ [−1, +1]d×d and w, v range over all vectors in the simplex (i.e. wi , vi ≥ 0 and Then we have for instance the following:

Pd

i=1

wi =

Pd

i=1

vi = 1).

Theorem 5. Suppose d ≥ 10. Given m i.i.d. instantiations Z 1 , . . . , Z m from any distriP m 1 t ˜ J) ˜ = arg mini,j bution over Z, let (I, t=1 Zi,j . Then the solution (eI˜, eJ˜) satisfies, with m probability at least 1 − δ, r 2 log(d2 /δ) F (eI˜, eJ˜) − min F (w, v) ≤ 2 . w,v m ˜ v ˜ returned However, for any (b, 1, m) protocol, where mb < d2 /290, and any solution w, by the protocol given m i.i.d. instances, there exists a distribution over Z such that with probability > 1/2, 1 ˜ v ˜ ) − min F (w, v) ≥ F (w, w,v 8 The upper bound in the theorem follows from a simple concentration of measure argument, noting that if E[Z] has a minimal element at location (i∗ , j ∗ ), then F (w, v) = E[f ((w, v); Z)] is minimized at w = ei∗ , v = ej ∗ , and attains a value equal to that minimal element. The lower bound follows by considering distributions where Z ∈ {−1, +1}d×d with probability 1, each of the d2 entries is chosen independently, and E[Z] is zero except some coordinate (i∗ , j ∗ ) where it equals −1/4. For such distributions, getting optimization error smaller than a constant reduces to detecting (i∗ , j ∗ ), and this in turn reduces to the hideand-seek problem defined in Definition 3, over d2 coordinates and a bias ρ = 1/4. However, we showed that no information-constrained protocol will succeed if mb is much smaller than the number of coordinates, which is d2 in this case. We provide a full proof in Appendix B.2. According to the theorem, there is a sample size regime m where we can get a small optimization error without information constraints, using a very simple procedure, yet any (say) stochastic gradient-based optimization algorithm (whose memory is linear in the dimension d), or any non-interactive distributed algorithm with per-machine communication budget linear in d, will fail to get sub-constant error.

4.3

Sparse PCA and Sparse Covariance Estimation

The sparse PCA problem ([53]) is a standard and well-known statistical estimation problem, defined as follows: We are given an i.i.d. sample of vectors x ∈ Rd , and we assume that there is some direction, corresponding to some sparse vector v (of cardinality at most k), such that the variance E[(v> x)2 ] along that direction is larger than at any other direction. Our goal is to find that direction. We will focus here on the simplest possible form of this problem, where the maximizing direction v is assumed to be 2-sparse, i.e. there are only 2 non-zero coordinates vi , vj . In 11

that case, E[(v> x)2 ] = v12 E[x21 ] + v22 E[x22 ] + 2v1 v2 E[xi xj ]. Following previous work (e.g. [14]), we even assume that E[x2i ] = 1 for all i, in which case the sparse PCA problem reduces to detecting a coordinate pair (i∗ , j ∗ ), i∗ < j ∗ for which xi∗ , xj ∗ are maximally correlated. A special case is a simple and natural sparse covariance estimation problem ([15, 18]), where we assume that all covariates are uncorrelated (E[xi xj ] = 0) except for unique correlated pair of covariates (i∗ , j ∗ ) which we need to detect. This setting bears a resemblance to the example seen in the context of stochastic optimization in section 4.2: We have a d × d stochastic matrix xx> , and we need to detect a single biased entry at location (i∗ , j ∗ ). Unfortunately, these stochastic matrices are rank-1, and do not have independent entries as in the example considered in section 4.2. Thus, we need to use a somewhat different construction, relying on a distribution supported on sparse vectors. The intuition is that then each xx> would be sparse, and the situation can be reduced to the hide-and-seek problem considered in Definition 4 and Thm. 2. Formally, for 2-sparse PCA (or covariance estimation) problems, we show the following gap between the attainable performance of any b-memory online protocol, or even any (b, n, m) protocol for a particular choice of n, and a simple plug-in estimator without information constraints. Theorem 6. Consider the class of 2-sparse PCA (or covariance estimation) problems in d ≥ 9 dimensions as described above, and all distributions such that: 1. E[x2i ] = 1 for all i. 2. For a unique pair of distinct coordinates (i∗ , j ∗ ), it holds that E[xi∗ xj ∗ ] = τ > 0, whereas E[xi xj ] = 0 for all distinct coordinate pairs (i, j) 6= (i∗ , j ∗ ). 3. For any i < j, if xg i xj isthe empirical average of xi xj over m i.i.d. instances, then τ 2 Pr |xg i xj − E[xi xj ]| ≥ 2 ≤ 2 exp (−mτ /6). Then the following holds: ∗ ∗ ˜ J) ˜ = arg maxi<j xg ˜ ˜ • Let (I, i xj . Then for any distribution as above, Pr((I, J) = (i , j )) ≥ 2 2 1 − d exp(−mτ /6). In particular, when the bias τ equals Θ(1/d log(d)),    m ∗ ∗ 2 ˜ J) ˜ = (i , j )) ≥ 1 − d exp −Ω . Pr((I, d2 log2 (d)

˜ J) ˜ of (i∗ , j ∗ ) returned by any b-memory online protocol using m • For any estimate (I, m instances, or any (b, d(d − 1), d d(d−1) e) protocol, there exists a distribution with bias τ = Θ(1/d log(d)) as above such that   r   1 m ∗ ∗ ˜ ˜ Pr (I, J) = (i , j ) ≤ O + . d2 d4 /b

12

The proof appears in Appendix B.3. The theorem implies that in the regime where b  4 d / log2 (d), we can choose any m such that db  m  d2 log2 (d), and get that the chances of the protocol detecting (i∗ , j ∗ ) are arbitrarily small, even though the empirical average reveals (i∗ , j ∗ ) with arbitrarily high probability. Thus, in this sparse PCA / covariance estimation setting, any online algorithm with sub-quadratic memory cannot be statistically optimal for all sample sizes. The same holds for any (b, n, m) protocol in an appropriate regime of (n, m), such as distributed algorithms as discussed earlier. To the best of our knowledge, this is the first result which explicitly shows that memory constraints can incur a statistical cost for a standard estimation problem. It is interesting that sparse PCA was also shown recently to be affected by computational constraints on the algorithm’s runtime ([14]). The theorem shows a performance gap in the regime where the number of instances m is much larger than d2 and much smaller than d4 /b. However, this is not the only possible regime, and it is possible to establish such gaps for similar problems already when m is linear in b (up to log-factors) - see the proof in Appendix B.3 for more details. Also, the distribution on which information-constrained protocols are shown to fail satisfies the theorem’s conditions, but is ‘spiky’ and rather unnatural. Proving a similar result for a ‘natural’ data distribution (e.g. Gaussian) remains an interesting open problem, as further discussed in Sec. 5. 2

5

Discussion and Open Questions

In this paper, we investigated cases where a generic type of information-constrained algorithm has strictly inferior statistical performance compared to constraint-free algorithms. As special cases, we demonstrated such gaps for memory-constrained and communicationconstrained algorithms (e.g. in the context of sparse PCA and covariance estimation), as well as online learning with partial information and stochastic optimization. These results are based on explicitly considering the information-theoretic structure of the problem, and depend only on the number of bits extracted from each data point. We believe these results form a first step in a fuller understanding of how information constraints affect learning ability in a statistical setting. There are several immediate questions left open. One question is whether our bounds for (b, n, m) protocols in Thm. 1 and Thm. 2 can be improved. We conjecture this is true, and that the bound should actually depend on mnρ2 b/d rather than mb/d, and without constraints on the bias ρ3 . This would allow us, for instance, to show performance gaps for much wider sample size regimes, as well as recover a stronger version ofpThm. 4 for online ˜ (d/b)/T ) for any learning with partial information, implying a regret lower bound of Ω( number of rounds T . A second open question is whether there are convex stochastic optimization problems, for which online or distributed algorithms are provably inferior to constraint-free algorithms (the 3

The looseness potentially comes from the application of Lemma 3 and Jensen’s inequality in the proof of Thm. 1, which causes the bound to lose the property of becoming trivial if Prj (·) = Pr0 (·)

13

example discussed in section 4.2 refers to an easily-solvable yet non-convex problem). Due to the large current effort in developing scalable algorithms for convex learning problems, this would establish that one must pay a statistical price for using such memory-and-time efficient algorithms. A third open question is whether the results for non-interactive (or serial) distributed algorithms can be extended to more interactive algorithms, where the different machines can communicate over several rounds. As mentioned earlier, there is a rich literature on the communication complexity of interactive distributed algorithms within theoretical computer science, but we don’t know how to ‘import’ these results to a statistical setting based on i.i.d. data. A fourth open question relates to the sparse-PCA / covariance estimation result. The hardness result we gave for information-constrained protocols uses a tailored distribution, which has a sufficiently controlled tail behavior but is ‘spiky’ and not sub-Gaussian uniformly in the dimension. Thus, it would be interesting to establish similar hardness results for ‘natural’ distributions (e.g. Gaussian). More generally, there is much work remaining in extending the results here to other learning problems and other information constraints.

References [1] A. Agarwal, P. Bartlett, P. Ravikumar, and M. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235–3249, 2012. [2] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable effective terascale linear learning system. arXiv preprint arXiv:1110.4198, 2011. [3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In STOC, 1996. [4] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of pca with capped msg. In NIPS, 2013. [5] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Minimax policies for combinatorial prediction games. In COLT, 2011. [6] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002. [8] M. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and privacy. In COLT, 2012. 14

[9] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental pca. In NIPS, 2013. [10] Z. Bar-Yossef, T. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. In FOCS, 2002. [11] B. Barak, M. Braverman, X. Chen, and A. Rao. How to compress interactive communication. In STOC, 2010. [12] G. Bart´ok, D. Foster, D. P´al, A. Rakhlin, and C. Szepesv´ari. Partial monitoring – classification, regret bounds, and algorithms. 2013. [13] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011. [14] A. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In COLT, 2013. [15] J. Bien and R. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820, 2011. [16] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and R in Machine Learning, 3(1):1–122, 2011. Trends [17] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. [18] T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494), 2011. [19] N. Cesa-Bianchi and L. Gabor. Prediction, learning, and games. Cambridge University Press, 2006. [20] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012. [21] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed attributes. The Journal of Machine Learning Research, 12:2857–2878, 2011. [22] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In CCC, 2003. [23] S. Chien, K. Ligett, and A. McGregor. Space-efficient estimation of robust statistics and distribution testing. In ICS, 2010.

15

[24] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011. [25] T. Cover and J. Thomas. Elements of information theory. John Wiley & Sons, 2006. [26] M. Crouch, A. McGregor, and D. Woodruff. Stochastic streams: Sample complexity vs. space complexity. In MASSIVE, 2013. [27] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction. In ICML, 2011. [28] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13:165–202, 2012. [29] S. S. Dragomir. Upper and lower bounds for Csisz´ar’s f-divergence in terms of the Kullback-Leibler distance and applications. In Inequalities for Csisz´ar f-Divergence in Information Theory. RGMIA Monographs, 2000. [30] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. Automatic Control, IEEE Transactions on, 57(3):592–606, 2012. [31] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. [32] S. Guha and A. McGregor. Space-efficient sampling. In AISTATS, 2007. [33] A. Gy¨orgy, T. Linder, G. Lugosi, and G. Ottucs´ak. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007. [34] A. Kyrola, D. Bickson, C. Guestrin, and J. Bradley. Parallel coordinate descent for l1-regularized loss minimization. In ICML), 2011. [35] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007. [36] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010. [37] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In NIPS, 2011. [38] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming pca. arXiv preprint arXiv:1307.0032, 2013.

16

[39] S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005. [40] A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983. [41] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005. [42] F. Niu, B. Recht, C. R´e, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011. [43] M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056, 2011. [44] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007. [45] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011. [46] S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning. Mit Press, 2011. [47] I. Taneja and P. Kumar. Relative information of type s, Csisz´ar’s f-divergence, and information inequalities. Inf. Sci., 166(1-4):105–125, 2004. [48] C. Williams and M. Seeger. Using the nystr¨om method to speed up kernel machines. In NIPS, 2001. [49] D. Woodruff. The average-case complexity of counting distinct elements. In ICDT, 2009. [50] Y. Zhang, J. Duchi, M. Jordan, and M. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, 2013. [51] Y. Zhang, J. Duchi, and M. Wainwright. Communication-efficient algorithms for statistical optimization. In NIPS, 2012. [52] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression. In COLT, 2013. [53] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

17

A

Proofs from Sec. 3

The proofs use several standard quantities and results from information theory – see Appendix C for more details. They also make use of a few technical lemmas (presented in Subsection A.1), and a key lemma (in Subsection A.2) which quantifies how informationconstrained protocols cannot provide information on all coordinates simultaneously. The proofs themselves appear in Subsection A.3 and Subsection A.4.

A.1

Technical Lemmas

Lemma 1. Suppose that d > 1, and for some fixed distribution Pr0 (·) over the messages w1 , . . . , wm computed by an information-constrained protocol, it holds that v u d u2 X t Dkl (Pr0 (w1 . . . wm )||Prj (w1 . . . wm )) ≤ B. d j=1 Then there exist some j such that Pr(J˜ = j) ≤

3 + 2B. d

Proof. By concavity of the square root, we have v u d d q u2 X 1X t 1 m 1 m 2 Dkl (Pr0 (w1 . . . wm )||Prj (w1 . . . wm )). Dkl (Pr0 (w . . . w )||Prj (w . . . w )) ≥ d j=1 d j=1 Using Pinsker’s inequality and the fact that J˜ is some function of the messages w1 , . . . , wm (independent of the data distribution), this is at least d 1 X X Pr0 (w1 . . . wm ) − Prj (w1 . . . wm ) d j=1 1 m w ...w d X   X  1 1 m 1 m 1 m ˜ Pr0 (w . . . w ) − Prj (w . . . w ) Pr J|w . . . w ≥ 1 m d j=1 w ...w d



1X |Pr0 (J˜ = j) − Prj (J˜ = j)|. d j=1

Thus, we may assume that d

1X |Pr0 (J˜ = j) − Prj (J˜ = j)| ≤ B. d j=1 18

The argument now uses a basic variant of the probabilistic method. If the expression above ˜ j) − Prj (J˜ = j)| ≤ 2B. is at most B, then for at least d/2 values of j, it holds that |Pr0 (J = Pd Also, since j=1 Pr0 (J˜ = j) = 1, then for at least 2d/3 values of j, it holds that Pr0 (J˜ = j) ≤ 3/d. Combining the two observations, and assuming that d > 1, it means there must ˜ − Prj (J˜ = j)| ≤ 2B, as well as Pr0 (J˜ = j) ≤ 3/d, exist some value of j such that |Pr0 (J) hence Prj (J˜ = j) ≤ d3 + 2B as required. Lemma 2. Let p, q be distributions over a product domain A1 × A2 × . . . × Ad , where each Ai is a finite set. Suppose that for some j ∈ {1, . . . , d}, the following inequality holds for all z = (z1 , . . . , zd ) ∈ A1 × . . . × Ad : p({zi }i6=j |zj ) = q({zi }i6=j |zj ). Also, let E be an event such that p(E|z) = q(E|z) for all z. Then X p(zj )q(E|zj ). p(E) = zj

Proof. p(E) =

X

=

X

p(z)p(E|z) =

X

z

z

p(zj )

zj

=

X X

X

p({zj }i6=j |zj )q(E|zj , {zi }i6=j )

{zi }i6=j

p(zj )

zj

=

p(z)q(E|z)

X

q({zj }i6=j |zj )q(E|zj , {zi }i6=j )

{zi }i6=j

p(zj )q(E|zj ).

zj

Lemma 3 ([29], Proposition 1, [47],Corollary 4.3). Let p, q be two distributions on a discrete set, such that maxx p(x) ≤ c. Then q(x) Dkl (p(·)||q(·)) ≤ c Dkl (q(·)||p(·)) . Lemma 4. Suppose we throw n balls independently and uniformly at random into d > 1 bins, and let K1 , . . . Kd denote the number of balls in each of the d bins. Then for any  ≥ 0 1 d such that  ≤ min{ 61 , 2 log(d) , 3n }, it holds that 

  E exp  max Kj < 13. j

19

P Proof. Each Kj can be written as ni=1 1(ball i fell into bin j), and has expectation n/d. Therefore, by a standard multiplicative Chernoff bound, for any γ ≥ 0,    γ2 n n ≤ exp − . Pr Kj > (1 + γ) d 2(1 + γ) d By a union bound, this implies that   X   d  n γ2 n n Pr max Kj > (1 + γ) ≤ ≤ d exp − . Pr Kj > (1 + γ) j d d 2(1 + γ) d j=1 In particular, if γ +1 ≥ 6, we can upper bound the above by the simpler expression exp(−(1+ γ)n/3d). Letting τ = γ + 1, we get that for any τ ≥ 6,    τn n Pr max Kj > τ ≤ d exp − . (2) j d 3d Define c = max{8, d3 }. Using the inequality above and the non-negativity of exp( maxj Kj ), we have   Z ∞    Z ∞  E exp( max Kj ) = Pr exp( max Kj ) ≥ t dt ≤ c + Pr exp( max Kj ) ≥ t dt j j j t=0 t=c   Z ∞  Z ∞  log(t)d n log(t) Pr max Kj ≥ Pr max Kj ≥ dt = c + dt =c+ j j  n d t=c t=c Since we assume  ≤ d/3n and c ≥ 8, it holds that exp(6n/d) ≤ exp(2) < 8 ≤ c, which implies log(c)d/n ≥ 6. Therefore, for any t ≥ c, it holds that log(t)d/n ≥ 6. This allows us to use Eq. (2) to upper bound the expression above by   Z ∞ Z ∞ log(t)d n exp − t−1/3 dt. c+d dt = c + d 3n d t=c t=c Since we assume  ≤ 1/6, we have 1/(3) ≥ 2, and therefore we can solve the integration to get 1 1 d c+ 1 c1− 3 ≤ c + dc1− 3 . −1 3 Using the value of c, and since 1 −

1 3

≤ −1, this is at most

max{8, d3 } + d ∗ d3

1− 31

= max{8, d3 } + d3 .

Since  ≤ 1/2 log(d), this is at most max{8, exp(3/2)} + exp(3/2) < 13 as required. 20

A.2

A Key Lemma

The following key lemma quantifies how a bounded-size message W is incapable of simultaneously capturing much information on many almost-independent random variables. Lemma 5. Let Z1 , . . . , Zd be independent random variables, and let W be a random variable which can take at most 2b values. Then d

1X b I(W ; Zj ) ≤ . d j=1 d Proof. We have d

d

1X 1X I(W ; Zj ) = (H(Zj ) − H(Zj |W )) . d j=1 d j=1 Using the fact that

Pd

j=1

H(Zj |W ) ≥ H(Z1 . . . , Zd |W ), this is at most

d

1X 1 H(Zj ) − H(Z1 . . . Zd |W ) d j=1 d d

1X 1 = H(Zj ) − (H(Z1 . . . Zd ) − I(Z1 . . . Zd ; W )) d j=1 d ! d 1 1 X = I(Z1 . . . Zd ; W ) + H(Zj ) − H(Z1 . . . Zd ) . d d j=1 Since Z1 . . . Zd are independent,

Pd

j=1

(3)

H(Zj ) = H(Z1 . . . Zd ), hence the above equals

1 1 1 I(Z1 . . . Zd ; W ) = (H(W ) − H(W |Z1 . . . Zd )) ≤ H(W ), d d d which is at most b/d since W is only allowed to have 2b values.

A.3

Proof of Thm. 1

On top of the distributions Prj (·) defined in the hide-and-seek problem (Definition 3), we define an additional ‘reference’ distribution Pr0 (·), which corresponds to the instances x chosen uniformly at random from {−1, +1}d (i.e. there is no biased coordinate). The upper bound in the theorem statement is an immediate consequence of Hoeffding’s inequality and a union bound. Let w1 , . . . , wm denote the messages computed by the protocol. To show the lower bound, it is enough to prove that d  2X mb Dkl Pr0 (w1 . . . wm ) Prj (w1 . . . wm ) ≤ 15 , d j=1 d

21

(4)

since thenpby applying Lemma 1, we get that for some j, Prj (J˜ = j) ≤ (3/d) + 2 (3/d) + 8 mb/d as required. Using the chain rule, the left hand side in Eq. (4) equals

p

15mb/d ≤

d m   2 XX Ew1 ...wt−1 ∼Pr0 Dkl Pr0 (wt |w1 . . . wt−1 )||Prj (wt |w1 . . . wt−1 ) d j=1 t=1 " d # m X  1X =2 Ew1 ...wt−1 ∼Pr0 Dkl Pr0 (wt |w1 . . . wt−1 )||Prj (wt |w1 . . . wt−1 ) d t=1 j=1

(5)

Let us focus on a particular choice of t and values w1 . . . wt−1 . To simplify the presentation, we drop the t superscript from the message wt , and denote the previous messages w1 . . . wt−1 as w. ˆ Thus, we consider the quantity d

1X Dkl (Pr0 (w|w)||Pr ˆ ˆ . j (w|w)) d j=1

(6)

Recall that w is some function of wˆ and a set of n instances received in the current round. Let xj denote the vector of values at coordinate j across these n instances. We can now use Lemma 2, where p(·) = Prj (·|w),q(·) ˆ = Pr0 (·|w) ˆ and Ai = {−1, +1}n (i.e. the vector of values at a single coordinate i across the n data points). The lemma’s conditions are satisfied since xi for i 6= j has the same distribution under Pr0 (·|w) ˆ and Prj (·|w), ˆ and also ˆ Thus, we can rewrite Eq. (6) as w is only a function of x1 . . . xd and w.   d X X 1 Dkl Pr0 (w|w) ˆ Pr0 (w|xj , w)Pr ˆ j (xj |w) ˆ . d j=1 xj Using Lemma 3, we can reverse the expressions in the relative entropy term, and upper bound the above by   ! d X X Pr0 (w|w) ˆ 1 max P ˆ . Dkl  Pr0 (w|xj , w)Pr ˆ j (xj |w) ˆ Pr0 (w|w) w d j=1 Pr (w|x , w)Pr ˆ (x | w) ˆ 0 j j j xj x j

(7) The max term equals P

xj

Pr0 (w|xj , w)Pr ˆ 0 (xj |w) ˆ

xj

Pr0 (w|xj , w)Pr ˆ j (xj |w) ˆ

max P w

≤ max xj

Pr0 (xj |w) ˆ , Prj (xj |w) ˆ

and using Jensen’s inequality and the fact that relative entropy is convex in its arguments, we can upper bound the relative entropy term by X Prj (xj |w)D ˆ kl (Pr0 (w|xj , w) ˆ || Pr0 (w|w)) ˆ xj

  Prj (xj |w) ˆ X ≤ max Pr0 (xj |w)D ˆ kl (Pr0 (w|xj , w) ˆ || Pr0 (w|w)) ˆ . xj Pr0 (xj |w) ˆ x j

22

The sum in the expression above equals the mutual information between the message w and the coordinate vector xj (seen as random variables with respect to the distribution Pr0 (·|w)). ˆ Writing this as IPr0 (·|w) ˆ (w; xj ), we can thus upper bound Eq. (7) by   d  Prj (xj |w) ˆ 1X Pr0 (xj |w) ˆ max IPr0 (·|w) max ˆ (w; xj ) xj Pr0 (xj |w) xj Prj (xj |w) d j=1 ˆ ˆ   X  d Pr0 (xj |w) ˆ Prj (xj |w) ˆ 1 ≤ max max IPr (·|w) ˆ (w; xj ). j,xj Prj (xj |w) j,xj Pr0 (xj |w) ˆ ˆ d j=1 0 Since {xj }j are independent of each other and w contains at most b bits, we can use the key Lemma 5 to upper bound the above by    Prj (xj |w) ˆ b Pr0 (xj |w) ˆ max . max j,xj Pr0 (xj |w) j,xj Prj (xj |w) ˆ ˆ d Now, recall that for any j, xj refers to a column of n i.i.d. entries, drawn independently of any previous messages w, ˆ where under Pr0 , each entry is chosen to be ±1 with equal probability, whereas under Prj each is chosen to be 1 with probability 21 + ρ, and −1 with probability 12 − ρ. Therefore, we can upper bound the expression above by n  n 2   1/2 b 1/2 + ρ , . max 1/2 1/2 − ρ d Since we assume ρ ≤ 1/4n, it’s easy to verify that the expression above is at most ((1 + 4ρ)n )2

b b b ≤ ((1 + 1/n)n )2 ≤ exp(2) . d d d

To summarize, this constitutes an upper bound on Eq. (7), which in turn is an upper bound on Eq. (6), i.e. on any individual term inside the expectation in Eq. (5). Thus, we can upper bound Eq. (5) by 2 exp(2)mb/d < 15mb/d. This shows that Eq. (4) indeed holds, which as explained earlier implies the required result.

A.4

Proof of Thm. 2

On top of the distributions Prj (·) defined in the hide-and-seek problem (Definition 4), we define an additional ‘reference’ distribution Pr0 (·), which corresponds to the instances being chosen uniformly at random from {−ei , +ei }di=1 (i.e. there is no biased coordinate). We start by proving the upper bound. Any individual entry in the instances has magnitude at most 1 and E[x2j ] ≤ d1 with respect to any PrJ (·). Applying Bernstein’s inequality, the probability of its empirical average deviating from its mean by at least ρ/d is at most !   mn(ρ/d)2 mnρ2  2 exp − 2 2 ρ = 2 exp − . + 3d 2 + 32 ρ d d 23

Since ρ ≤ 1/27, this can be upper bounded by 2 exp(−mnρ2 /(3d)). Thus, by a union bound, with probability at least 1 − 2d exp(−mnρ2 /(3d)), the empirical average of all coordinates will deviate from their respective means by less than ρ/d. If this occurs, we can detect J simply by computing the empirical average and picking the estimate J˜ to be the coordinate with the highest empirical average. The proof of the lower bound is quite similar to the lower bound proof of Thm. 1, but with a few more technical intricacies. Let w1 , . . . , wm denote the messages computed by the protocol. To show the lower bound, it is enough to prove that d  26mb 2X , Dkl Pr0 (w1 . . . wm ) Prj (w1 . . . wm ) ≤ d j=1 d

since then p by applying Lemma 1, we get that for some j, Prj (J˜ = j) ≤ (3/d) + 2 (3/d) + 11 mb/d as required. Using the chain rule, the left hand side in Eq. (8) equals

(8) p

26mb/d