The Limitations of Optimization from Samples

Report 3 Downloads 60 Views
The Limitations of Optimization from Samples Eric Balkanski∗

Aviad Rubinstein†

Yaron Singer‡

arXiv:1512.06238v2 [cs.DS] 7 Apr 2016

Abstract In this paper we consider the following question: can we optimize decisions on models learned from data and be guaranteed that we achieve desirable outcomes? We formalize this question through a novel framework called optimization from samples (OPS). In the OPS framework, we are given sampled values of a function drawn from some distribution and the objective is to optimize the function under some constraint. We show that there are classes of functions which have desirable learnability and optimizability guarantees and for which no reasonable approximation for optimization from samples is achievable. In particular, our main result shows that even for maximization of coverage functions under a cardinality constraint k, there exists a hypothesis class of functions that cannot be approximated within a factor of n−1/4+ (for any constant  > 0) of the optimal solution, from samples drawn from the uniform distribution over all sets of size at most k. In the general case of monotone submodular functions, we show an n−1/3+ lower bound and an almost matching ˜ −1/3 )-optimization from samples algorithm. On the positive side, if a monotone subadditive Ω(n function has bounded curvature we obtain desirable guarantees. We also show that additive and unit-demand functions can be optimized from samples to within arbitrarily good precision, and that budget additive functions can be optimized from samples to a factor of 1/2.



School of Engineering and Applied Sciences, Harvard University. [email protected]. This research was supported by Smith Family Graduate Science and Engineering Fellowship and by NSF grant CCF-1301976, CAREER CCF-1452961. † Department of Electrical Engineering and Computer Sciences, UC Berkeley. [email protected]. This research was supported by Microsoft Research PhD Fellowship, NSF grant CCF1408635, and Templeton Foundation grant 3966, and done in part at the Simons Institute for the Theory of Computing. ‡ School of Engineering and Applied Sciences, Harvard University. [email protected]. This research was supported by NSF grant CCF-1301976, CAREER CCF-1452961.

0

1

1

1

1

1

n

...

1

1

1

1 .. 1

. . .



1

n1+ε (a) Shortest path

(b) Influence maximization

Figure 1: Examples of shortest path and influence maximization.

1

Introduction

The traditional approach in optimization typically assumes some underlying model that is known to the algorithm designer, and the focus is on optimizing an objective defined on the given model. In the shortest path problem, for example, we are given a weighted graph and the goal is to find the shortest weighted path from a source to a destination. In the influence maximization problem we are given a weighted graph in which the weights encode the likelihood of one node to forward information to its neighbors, and the goal is to select a subset of nodes to spread information to maximize the expected number of nodes in the network that eventually receive information. In many applications we do not know the models we optimize over, and employ machine learning techniques to approximate them. This is often a reasonable approach since machine learning makes predictions by observing data and estimating parameters of the model in sophisticated ways. For finding good driving routes, for example, we could first observe road traffic and fit weights to a graph that represents the travel time on a road network, and then apply a shortest path calculation. To find influential individuals, we would observe messages forwarded on Twitter, fit weights to a graph that represents the diffusion model, and then apply an influence maximization algorithm. Naturally, the way we estimate the model has an effect on our decision, but by how much? To start this discussion, let’s consider the examples below which illustrate two extreme scenarios. Example: small estimation errors can lead to poor decisions. Figure 1 illustrates two networks, one for the shortest path problem (Figure 1a) and the other for the influence maximization problem (Figure 1b). Suppose that in both networks the weights are estimated up to (1 ± ) factor of error, for some fixed  > 0. How does this estimation error affect the decision made by the algorithm? In the shortest path problem, if every edge weight is estimated to be (1 + ) of its true weight, applying the shortest path algorithm is a good solution even on the distorted model: in the worst case the path selected is a (1 + ) approximation. In the influence maximization example, a small error in the estimation of the parameters leads the algorithm to make a very poor decision. For the problem of selecting a single node to spread information, the correct solution is to select the red node, and the value in this case is n. But suppose now that the weight of every edge in the chain is estimated to be (1 − ) instead of 1, and the n edges incident to the grey node are all estimated correctly as having weight 1. In this case, the estimated expected number of nodes that receive information from the red node is evaluated as only a constant rather than n, and for any fixed  and sufficiently large n, the algorithm selects the grey node instead, which reaches n nodes. Thus, a small estimation error would result in a decision that is an n1− approximation to the optimal solution.

1

The general consensus is to treat modeling and decisions separately: machine learning is typically responsible for creating a model from observed data, and algorithms then optimize the decision using this model as if it were exact. But as shown above, this approach can be problematic as the optimization critically depends on the guarantees of the learned model. Optimization from samples: from predictions to decisions. In this paper we investigate the limitations and possibilities of optimization from sampled data. At a high level, we are interested in understanding whether the fact that we can statistically learn a model, and optimize over the model when given full information, implies that we can optimize decisions from sampled data. In other words, does our ability to make good predictions imply we can make good decisions?

1.1

Optimization from samples

To formalize our question, we need to establish what we mean by model, prediction, decision, learnable, and optimizable. • Model. We consider a model to be a real-valued function. In the shortest path example, the function is given a path as input and returns its expected latency. For maximizing influence, the function receives a set of individuals and returns the expected number of individuals that receive information. • Predictions. In our context, a prediction is the ability to predict the behavior of the given model well. That is, for a given function that represents the model, we consider cases in which we can learn the function. The standard framework in the literature for learning set functions [24, 5, 2, 25, 26, 3] is Probably Mostly Approximately Correct (α-PMAC) learnability due to Balcan and Harvey [4] which is a generalization of Valiant’s celebrated Probably Approximately Correct (PAC) learnability [53]. Informally, these concepts of learnability guarantee that after observing polynomially-many samples of sets and their function values, one can construct a surrogate function that is likely to, α-approximately, mimic the behavior of the function observed from the samples (see Appendix D for formal definitions). • Decisions. We are interested in making “good” decisions. Given the function that represents the model, a good decision has value that is close to the optimal value for an objective defined by this function. In the routing example, when the function represents the length of paths, we wish to minimize the function under the constraint that the path is feasible (i.e. it actually exists in the graph). For the influence example, we wish to optimize the influence function under the constraint that we select no more than k nodes. We use the standard notion of multiplicative approximations: we say that a class of functions is α-optimizable under some constraint if there is a polynomial-time algorithm that approximates the optimal solution up to some factor α > 0. Optimization from samples is interesting when the functions (models) are both approximately learnable and optimizable. If a decision cannot be optimized on a given model it would be impossible to optimize the same decision with partial information. Similarly, if the behavior of the model cannot be predicted from observations, then (at least intuitively) optimizing decisions on the model from observations seems hard.

2

A framework for optimization from samples (OPS). We define the optimization from samples criterion as follows. We say that a hypothesis class of functions defined over a ground set N is optimizable from samples (OPS) under constraint M ⊆ 2N for some given distribution D, if there exists an algorithm with input polynomially many samples {Si , f (Si )} where the sets Si are drawn i.i.d. from D, the function f : 2N → R is in the hypothesis class of functions, and with high probability over the samples and the choices of the algorithm, the algorithm returns S ∈ M s.t. f (S) ≥ α · maxT ∈M f (T ), for some constant α > 0. For example, suppose that for any distribution a function f is (1 − )-PMAC learnable for any constant  > 0, and there exists constant-factor approximation algorithm for maxS:|S|≤k f (S) (given oracle access to f ). If we observe polynomiallymany samples from the uniform distribution D over all sets of size at most k, can we then obtain a constant factor approximation to maxS:|S|≤k f (S)?

1.2

Our results

In this paper we mainly focus on the special case in which M is a uniform matroid, i.e., sets of size at most k,1 and the distribution D is the uniform distribution over all feasible sets, i.e. the uniform distribution over all sets of size smaller or equal to k.2 Thus, we observe polynomially-many random sets of size at most k and their values, and the goal is to (approximately) solve maxS:|S|≤k f (S). Main result: impossibility of optimization from samples for coverage functions Our main result is a spoiler: coverage functions cannot be optimized from samples. A function is called coverage if there exists a family of sets T1 , .P . . , Tn that covers subsets of a universe {a1 . . . , am } with weights w(aj ) such that for all S, f (S) = aj ∈∪i∈S Ti w(aj ). A coverage function is compact if the universe is of polynomial size in n. Theorem. For every constant  > 0, there exists a hypothesis class of coverage functions that is not n−1/4+ √ -optimizable from samples, and a hypothesis class of compact coverage functions that is −Ω( log n) -optimizable from samples. not 2 It is well known that given access to a value oracle of a coverage function, one can obtain a 1 − 1/e approximation to maxS:|S|≤k f (S) using polynomially-many queries to the oracle [45], which is the best approximation possible [22]. In terms of learnability, for any constant  > 0 coverage functions are (1 − )-PMAC learnable [2]. This holds for any distribution, and in particular for the distribution over sets of size smaller or equal to k.3 Thus, despite having strong learnability and optimizability guarantees, coverage functions cannot be optimized from samples. This implies that while the agenda of learnability of coverage (and submodular) functions is interesting in its own right, it has no consequences for optimization. In addition, the following points are worth noting: • Coverage functions are arguably the canonical example of submodular functions. They are heavily used in machine learning [50, 56, 30, 39, 1, 41, 51], data-mining [12, 37, 47, 17, 46, 48, 14, 29], mechanism design [15, 40, 19, 20, 8, 18, 7], and privacy [31, 24]. In many of these applications (in machine learning and data mining in particular), the functions are learned 1

The unconstrained case leads to trivial results since a random set achieves a constant factor approximation optimization algorithm for non-monotone submodular functions [23]. 2 Observations over feasible sets is an important modeling decision as otherwise trivial results arise, as further discussed in Appendix C. 3 This result is information theoretic and does not imply the existence of an efficient algorithm for finding the desired surrogate function, which is aligned with our impossibility results that are due to sample complexity and not computational complexity.

3

from data and the goal is to optimize the function under a cardinality constraint. Thus, for many real-world problems, solutions with provable guarantees may be impossible to obtain from sampled data. • The class of compact coverage functions has additional guarantees, beyond being learnable and optimizable. Any compact coverage function can be exactly recovered, i.e., learned exactly for all sets, using polynomially-many (adaptive) queries to a value oracle [9]. In contrast, there are monotone submodular functions for which no algorithm can recover the function using fewer than exponentially-many value queries. Despite being a distinguished class within submodular functions with enough structure to be exactly recovered via adaptive queries, compact coverage functions are inapproximable from samples. • These lower bounds extend to non-adaptive queries over sets of size at most ck for any constant c > 0. • Small cardinality constraints k are natural to consider, as there are real-world settings where optimization over small sets suffices. If log n ≤ k ≤ n1/2− , then we obtain a lower bound of k 1/2− for coverage functions. • The lower bounds also hold even when the structure of the coverage function is known, and only the weights w(aj ) are unknown. Knowing the structure is only interesting in the compact case, as otherwise for every set S, there can be ai in the universe covered exactly by S. Additional results Tight bounds for submodular functions. A function f : 2N → R is monotone if f (S) ≤ f (T ) for any S ⊆ T and submodular if f (S ∪ T ) ≤ f (S) + f (T ) − f (S ∩ T ) for any S, T ⊆ N . For general monotone submodular functions, we show an even stronger lower bound as well as a nearly matching upper bound. Theorem. For monotone submodular functions, no optimization-from-samples algorithm can obtain an approximation ratio better than n−1/3+ for any constant  > 0. Furthermore, there exists ˜ −1/3 )-optimization from samples algorithm. an Ω(n We construct OXS functions for the lower bound (which are a special case of gross substitute functions). Like coverage functions, for any constant  > 0 the class of OXS functions is (1−)-PMAC learnable [2]. Regarding optimizability, submodular functions also admit a 1 − 1/e approximation guarantee [45]. Therefore, we have an almost tight optimization-from-samples inapproximability for a class of functions which has both strong learnability and optimizability guarantees. The following points are worth noting: • Throughout the past decade, there has been a great deal of interest in learning submodular function [28, 4, 5, 2, 27, 25, 16, 26, 3]. The interest in submodular functions is primarily due to their desirable optimization guarantees. Positive learnability results are obtainable under distributional assumptions, these results are discussed in the additional related work. • The main ingredient of the nearly matching upper bound involves estimating the expected marginal contribution of an element to a random set. The algorithm is quite simple, yet non-trivial, and the main interest is in its analysis. • For general monotone submodular functions, for small k where log n ≤ k ≤ n1/3− , the lower bound is k 1− . 4

• On a positive note, we show that if the function has bounded curvature desirable guarantees are obtainable. The curvature measures how close a function is to being additive and has been recently studied in the context of submodular functions [54, 36, 35, 55, 49]. Tight bounds for subadditive functions. A function f : 2N → R is subadditive if S, T ⊆ N implies f (S ∪ T ) ≤ f (S) + f (T ). For the more general class of monotone subadditive functions, a known result in the value query model gives an n−1/2+ -inapproximability [42]. Since the value query model is strictly stronger than the model in which we observe samples, this implies an n−1/2+ inapproximability for optimization-from-samples. For this class, we show that simply selecting the sample with the largest value or a random set of size k provides a matching n−1/2 upper bound. Constant factor approximations for bounded curvature, unit-demand, additive, and budget additive functions. For unit-demand, additive, and budget-additive classes of functions, we give near-optimal and constant factor optimization-from-samples results. The result for unit-demand is particularly interesting as it separates the problem of optimizing from samples from the problem of recoverability. As discussed above, for curvature c, we obtain a (1 − c)2 − o(1)optimization from samples algorithms, which holds for monotone subadditive functions. 1.2.1

Beyond set functions

Thinking about models as set functions is a useful abstraction, but optimization-from-samples can be considered for general optimization problems. Instead of the max-k-cover problem, one may ask whether samples of spanning trees can be used for finding an approximately minimum spanning tree. Similarly, one may ask whether shortest paths, matching, maximal likelihood in phylogenetic trees, or any other problem where crucial aspects of the objective functions are learned from data, is optimizable from samples.

1.3

Paper organization

Following a discussion of additional related work, the lower bound for coverage functions is presented in Section 2. This section starts with a technical overview of the result (Subsection 2.1), then describes a general framework for constructing OPS lower bounds (Subsection 2.2), which is used to show lower bounds for general (Subsection 2.3) and compact (Subsection 2.4) coverage functions, these lower bounds are extended for additional distributions and for non-adaptive queries (Subsection 2.5). Section 3 improves the lower bound for submodular functions (Subsection 3.1) and gives a near matching upper bound (Subsection 3.2). The notion of recoverability is defined and discussed in Section 4, followed by the notion of bounded curvature (Section 5), for which we present positive results for functions with bounded curvature, additive functions, and budget additive functions. Finally, the appendix starts with a review of standard concentration bound (Appendix A) and linear algebra results (Appendix B). The modeling choice of considering sets sampled uniformly at random is then further justified (Appendix C) and the PAC and PMAC learning models are formally defined (Appendix D). We then present additional results for subadditive functions (Appendix E) and small cardinality constraints k (Appendix F).

1.4

Additional related work

Revenue maximization from samples. The discrepancy between the model on which algorithms optimize and the true state of nature has recently been studied in algorithmic mechanism

5

design. Most closely related to our work are several recent papers (e.g. [13, 21, 10, 34, 43, 11]) that also consider models that bypass the learning algorithm, and let the mechanism designer access samples from a distribution rather than an explicit Bayesian prior. In contrast to our negative conclusions, these papers achieve mostly positive results. In particular, Huang et al. [34] show that the obtainable revenue is much closer to the optimum than the information-theoretic bound on learning the valuation distribution. Comparison to online learning and reinforcement learning. Another line of work which combines decision-making and learning is online learning (see survey [33]). In online learning, a player iteratively makes decisions. For each decision, the player incurs a cost and the cost function for the current iteration is immediately revealed. The objective is to minimize regret, which is the difference between the sum of the costs of the decisions of the player and the sum of the costs of the best fixed decision. The fundamental differences with our framework are that decisions are made online after each observation, instead of offline given a collection of observations. The benchmarks, regret in one case and the optimal solution in the other, are not comparable. A similar comparison can be made with the problem of reinforcement learning, where at each iteration the player typically interacts with a Markov decision process (MDP). At each iteration, an action is chosen in an online manner and the player receives a reward based on the action and the state in the MDP she is in. Again, this differs from our setting where there is one offline decision to be made given a collection observations. Additional learning results for submodular functions. In addition to the PMAC learning results previously mentioned for coverage and OXS functions, there are multiple learning results for submodular functions. Monotone submodular functions are α-PMAC learnable over product distributions for some constant α under some assumptions [4]. Our lower bounds for coverage functions also hold for product distributions. Impossibility results arise for general distributions, ˜ −1/3 )-PMAC learnable [4]. Finally, submodular in which case submodular functions are not Ω(n functions can be (1 − )-PMAC learned for the uniform distribution over all sets with a running time and sample complexity exponential in  and polynomial in n [25]. This exponential dependency is −2/3 ) necessary since 2Ω( samples are needed to learn submodular functions with `1 -error of  over this distribution [27].

2

Main Result

In this section we describe an information-theoretic (sample complexity) lower bound for optimization from samples. This lower bound is for the class of coverage functions, which, as discussed above, is a natural and important class of submodular functions which has many applications. We show this lower bound for the problem of maximizing the function under a cardinality constraint, which is the celebrated Max-k-Cover problem. Theorem 1. For every constant  > 0, for the problem of maximizing a function under a cardinality constraint, there exists a hypothesis class of coverage functions that is not n−1/4+√-optimizable from samples, and a hypothesis class of compact coverage functions that is not 2−Ω( log n) -optimizable from samples. Again, this lower bound is for the uniform distribution over sets of size at most k (and is extended to additional distributions in Subsection 2.5). We begin with a technical overview and continue with the full details of the proof. 6

2.1

Technical overview

It is well known that in the value query model a coverage function can be optimized within a (1−1/e) approximation guarantee. Therefore, if we want to prove non-constant lower bounds in the OPS model we need to develop different machinery. At a high-level, information theoretic lower bounds in the value query model typically proceed as follows: the objective function is defined on some random partition of the ground set, and optimizing the function requires knowing the partition. The objective function and the partition are such that the algorithm learns no information about the partition with high probability, which establishes the lower bound. This technique for valuequery lower bounds is limited by the need to hide all the information about the partition from the algorithm. In contrast, lower bounds in OPS can employ a broader class of functions to trick the algorithm. In particular, the objective function is also defined on a partition, but the algorithm can learn the entire partition. This allows us to work with a broader class of functions. The power of the lower bound designer comes from the inability of the algorithm to actively query the partition. In our constructions, the optimal solution is one part Tj of the partition T0 , T1 , . . . , Tm . Even though the algorithm may learn the different parts T0 , T1 , . . . , Tm of the partition, it cannot learn j, i.e. it cannot learn which part corresponds to the optimal solution. The main steps of the proof can be summarized as follows: • A framework for OPS lower bounds (Subsection 2.2). We begin by describing a general framework for constructing lower bounds in the OPS model. The goal of this part is to reduce the task of proving an OPS lower bound to constructing good and bad functions, which have very specific properties. The lower bound is shown on the hypothesis class of coverage functions F = {fj (·) : 1 ≤ j ≤ m} defined over a ground set N partitioned between T0 , T1 , . . . , Tm , where Tj is the optimal solution for fj (·). For a given fj (·), the elements in T0 , Tj , {∪i6=0,j Ti } are called the zero, the good, and the bad elements respectively. The coverage function fj (S) is defined as the sum of the value of the good elements in that set and of the value of the bad elements in that set (and the zero elements have no value), i.e.: fj (S) = g(S ∩ Tj ) +

m X

b(S ∩ Ti ),

i=1,i6=j

where g(·) and b(·) are called the good and the bad functions. We are interested in good and bad functions g(·) and b(·) which satisfy the following three criteria: – First, good and bad functions must have equal value g(S1 ) = b(S2 ) on small sets S1 , S2 of equal size; – Second, to obtain an inapproximability, good and bad functions must have a significant gap g(S1 )  b(S2 ) on sets S1 , S2 of size k/m; – Lastly, we require the good functions to have small curvature. This curvature intuitively says that the value of the good function grows sufficiently from a set of size k/m to a set of size k. For the uniform distribution on subsets of size smaller or equal to k, each sample intersects with few non-zero elements, which implies that the algorithm cannot distinguish the good and bad elements from samples by the first criterion. Given pairs of good and bad functions that respect the properties above with specific parameters, one can construct lower bounds for coverage functions in the OPS model. From this point onwards, the goal is to construct good and bad functions that answer the aforementioned criteria. 7

• Constructing the Good and the Bad Coverage Functions (Subsection 2.3). Due to the rigid structure of coverage functions, the main technical challenge for this lower bound is to construct good and bad coverage functions that combine the first and second criteria. If two coverage functions cover equal value g(S1 ) = b(S2 ) on every small sets S1 , S2 , it seems that the value covered by these two functions on larger Pgood and bad Psets should be similar. Our functions are convex combinations g(S) = |S| + t1 xt1 C1/t1 (S) and b(S) = t2 xt2 C1/t2 (S) of coverage functions C1/t (·) to be defined. The equal value on small sets criterion is ensured by writing this criterion as a system of linear equations with variables xt . It then remains to construct a family of coverage functions C1/t (·) such that (1) there exists a solution to the system of linear equations (2) the entries xt of this solution are bounded. If the entries xt are bounded, then the additive term |S| dominates g(·) for large sets and the second and third criteria are satisfied. Our construction for the family of functions C1/t (·) involves weights defined in terms of the binomial distribution. This construction satisfies (1) by interpreting C1/t (·) as polynomials to show that the matrix corresponding to the system of linear equations can be inverted and (2) using Hadamard’s inequality to bound the coefficients xt . • From general to compact coverage functions (Subsection 2.4). The construction above requires a universe of exponential size, where for every subset S of a part Ti of good or bad elements, there is a unique “child” aj in the universe covered exactly by S. We also obtain lower bounds for compact coverage functions, which are over a universe of polynomial size. The main difficulty here is to construct compact coverage functions that have equal value on all sets of size y ≤ ` for some non-trivial `. The main tool to build such functions is constructions of `-wise independent random variables. More precisely, we subsample the universe to further reduce its size. Random subsampling does not work here because we want g(·) and b(·) to look exactly identical on small sets. Instead, notice that when we have a child for every subset of elements, picking a uniformly random child is analogous to picking a uniformly random subset of elements. We choose a subset of the children that is analogous to sampling elements from an `-wise independent distribution, for an appropriately tuned parameter `. • From uniform to more general distributions (Subsection 2.5). The lower bound holds for any distribution such that every sample contains a small number of non-zero elements with high probability. We show that this result extends to any distribution over sets of size at most ck, product distributions such that sets drawn are of expected size at most ck, and non-adaptive queries of sets of size at most ck, for any constant c > 0.

2.2

A framework for OPS lower bounds

The hypothesis class of functions for the lower bound is F(g, b) = {f1 , . . . , fm } and is defined below in terms of two functions g(·) and b(·), called the good and the bad functions respectively. In this subsection, the problem of showing the lower bound is reduced to constructing g(·) and b(·) which satisfy certain properties. A function f (·) ∈ F(g, b) is defined in terms of the following partition of the ground set. The elements are first partitioned between n − m · k zero elements and non-zero elements, with m such that m · k < n. For all j, the value of fj (·) is independent of the zero elements. The non-zero elements are further partitioned into m parts T1 , · · · , Tm , each of equal size k. For function fj (·), the elements in Tj are called the good elements and their value is according to the good function, i.e., fj (Sj ) = g(Sj ) for Sj ⊆ Tj . Similarly, the elements in Ti , for i 6= j, are called the bad elements and fj (Si ) = b(Si ) for Si ⊆ Ti .

8

3/28/2016

draw.io

g(S)

g(S)

b(S)

b(S)

|S|

|S|

Figure 2: Sketches of the good and bad functions g(S) and b(S), which only depend on the size of S, as submodular functions on the left and as coverage functions on the right. A coverage function cannot behave like the bad function b(·) on the left. Definition 1. Given two functions g(·) and b(·) and a collection {Ti }m i=1 of m disjoint sets of size k, the hypothesis class of functions F(g, b) is defined to be F(g, b) := {f1 , · · · , fm } with fj (S) := g(S ∩ Tj ) +

m X

b(S ∩ Ti ).

i=1,i6=j

The desired properties for the good and bad functions are defined and parametrized below in Definition 2. Intuitively, we wish to have good and bad functions which look identical on small sets, but that have a large gap on larger sets (see Figure 2). The main result of this subsection then shows a lower bound in terms of these properties. We consider functions b(·) and g(·) that are symmetric, i.e., functions that have the same value on subsets of same size. We abuse the notation and denote by b(y) and g(y) the value of a set of size y. We also let ` be some number dependent on n to be later defined. Definition 2. Two functions g(·) and b(·) have an (α, β, k, m)-gap if • Identical on small sets. g(y) = b(y) for all 0 ≤ y ≤ `. chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html

1/1

• Gap α between good and bad function. g(k/m) ≥ α · b(k/m). • Curvature β of good function for large values.

g(k) g(k/m)

= (1 − β)m

If at most ` elements of each part Ti of non-zero elements are in each sample S, then the “identical on small sets property” implies that good elements Tj and bad elements ∪i6=j Ti cannot be distinguished. Under a simple condition, this occurs with high probability by the following concentration bound. Lemma 1. Let T ⊆ N and S be sampled uniformly from all sets of size at most k. If k · |T | ≤ n1− for some constant  > 0, then Pr(|S ∩ T | ≥ `) = n−Ω(`) . Proof. We start by considering a subset L of T of size `. We first bound the probability that L is a subset of a sample S, Pr(L ⊆ S) ≤

Y e∈L

Y k  k ` = . Pr(e ∈ S) ≤ n n e∈L

9

We then bound the probability that |S ∩ T | > ` with a union bound over the events that a set L is a subset of S, for all subsets L of T of size `: Pr(|S ∩ T | > `) ≤

X L⊆T : |L|=`

   `   |T | k k · |T | ` Pr(L ⊆ S) ≤ · ≤ ≤ n−` ` n n

where the last inequality follows from the assumption that k · |T | ≤ n1− . Using the indistinguishability of the good and bad elements due to the previous concentration bound, the following theorem gives an impossibility result for optimizing F(g, b) from samples in terms of the (α, β, k, m)-gap of the good and bad functions g(·) and b(·). Theorem 2. Let g(·) and b(·) be two submodular functions with an (α, β, k, m)-gap for 0 < m < −1 k < n1/2−Ω(1) . Then the class F(g, b) defined above is not (1 − β) · min{α, m}/2 -optimizable from samples. Proof. Choose fr (·) ∈ F(g, b) u.a.r. to be optimized. Consider an algorithm that observes polynomially many samples drawn uniformly from the collection of feasible solutions. We show that it holds with high probability over the samples that the value of fr on those samples is independent from r. In particular, this means that the algorithm can learn no information about r. By Lemma 1, the intersection of a sample S and a part Ti of non-zero elements has size at most `, except with probability n−Ω(`) . By a union bound, the same holds simultaneously for all polynomially many samples and all the parts Ti of non-zero elements. Therefore, with high probability, at least 1 − n−Ω(`) , the algorithm’s output is independent of r; we henceforth assume this is the case. Let S be a set returned by any algorithm. Since S is independent of r, we have Er [|S ∩ Tr |] ≤ k/m. Thus,   X Er [fr (S)] = Er g(S ∩ Tr ) + b(S ∩ Ti ) i6=r

≤ g(k/m) + m · b(k/m) m ≤ g(k/m) + · g(k/m) α  1 m ≤ 1+ g(k) α (1 − β)m 2 1 ≤ fr (Tr ). min(α, m) 1 − β

2.3

(submodularity) (gap α between g(·) and b(·)) (curvature β)

Constructing the good and the bad coverage functions

The focus is now to construct good and bad functions that satisfy the properties from Definition 2, which is the main technical difficulty for this lower bound. In particular, in order for the bad function to increase slowly (or not at all) for large sets, there has to be a non-trivial interaction between elements even on small sets (this is related to cover functions being second-order supermodular [38]). The good function then must have similar interaction on small sets while still growing quickly for large sets. The next lemma shows that there exists b(·) and g(·) that satisfy these properties. Our main result (Theorem 1) then follows from this lemma combined with Theorem 2. 10

Bad function b(.)

Identical on small sets constraints

C1/t1

C1/t2

...

Good function g(.) C1/tl x1

1

x2

2

...

...

...

|S| = l

xl

l

|S| = 1 |S| = 2

Mij = C1/tj(i) 

Figure 3: The matrix M . Lemma 2. For every constant  > 0, there exists coverage functions b(·) and g(·) that admit an (α = n1/4− , β = o(1), √ k = n1/2− , m = n1/4− compact coverage functions b(·) and g(·) √ )-gap and √ Ω( log n) , β = o(1), k = 2 log n , m = 2 log n/2 )-gap. that admit an (α = 2 The rest of this section is devoted to the proof of Lemma 2. We construct a good and a bad coverage function that are convex combinations of symmetric coverage functions C1/tj defined below, and the symmetric additive function. The coefficients are obtained by solving a system of ` linear equations M x = y where M is a square matrix with column j equal to [C1/tj (1), . . . , C1/tj (`)], and where y = [1, . . . , `] is the symmetric additive function. The ` rows correspond to the ` constraints to obtain the “identical on small sets” property, as illustrated in Figure 3. For compact coverage cr such that C cr (y) = C functions, the convex combination is over coverage functions C1/t 1/tj (y) for 1/tj j all y ≤ `, so the matrix M is identical. For the rest of this subsection, we mostly focus on general cr in more detail in coverage function and C1/tj , and consider the compact coverage case and C1/t j the next subsection. Let x? be the solution to this system of linear equations. The bad function is then the sum of the coverage functions x?j · C1/tj that have positive coefficients x?j , and the good function is the sum of the additive function and the coverage functions with negative coefficients. Formally, the bad function is defined as X x?j C1/tj (y); b(y) := j : x?j >0

and the good function as g(y) := y +

X

(−x?j )C1/tj (y).

j : x?j `) ≤



X

Pr(L ⊆ S) ≤

L⊆Ti : |L|=`

|Ti | `



ck n

`

 ≤

ck|Ti | n

`

0

≤ n− `

0

where the last inequality follows from ck|Ti | ≤ n1− . Thus, we obtain Lemma 1 for the parts Ti . The proof of Theorem 2 (with a union bound over all samples) then holds identically by Lemma 1 for the parts Ti . The remaining steps follow identically as previously, the same constructions for the good and bad functions g(·) and b(·) are used and it suffices to combine Theorem 2 and Lemma 2 to complete the proof. All the extensions mentioned previously then follow almost immediately from the above lemma. Corollary 1. Assume • samples drawn from a distribution over sets of size at most ck, or • samples drawn from a product distributions with marginal probabilities at most ck/n, or • non-adaptive queries of sets of size at most ck for some constant c > 0. Then there exists a hypothesis class of coverage functions that is not n−1/4+ -optimizable from samples, √ for any constant  > 0, and a compact hypothesis class of coverage functions that is not 2−Ω( log n) -optimizable from samples. Proof. Consider a sample (or query) Q S, so |S| ≤ ck or Pr(e ∈ S) ≤ ck/n, and a uniformly random set L of size `, then Pr(L ⊆ S) ≤ e∈L Pr(e ∈ S) ≤ (ck/n)` .

3

Submodular Functions

Coverage functions are a subclass of submodular functions. In this section, we improve the n−1/4+ impossibility result for coverage functions to n−1/3+ for submodular functions and, surprisingly, show that this bound is tight (up to logarithmic terms). This construction is shown for the class of OXS functions, another subclass of submodular functions. A function f : 2N → R is called or-of-xors (OXS) if for some ` > 0: f (S) =

max

l X

(S1 ,...,Sl ) partitions S i=1

18

fi (Si )

where for each i, fi (S) is a unit-demand function, i.e. fi (S) = maxj∈S wj for some weights w1 , . . . wn . As discussed in the introduction, submodular functions have desirable learnability guarantees in addition to their optimization guarantees. Similarly as in the previous section, the hard instances are over a partition of the elements in zero, good, and bad elements. The advantage of OXS is in the way it allows the us to handle non-zero elements (see technical overview in Subsection 2.1 for a reminder about zero and non-zero elements). While the algorithm can still distinguish between zero and non-zero elements, OXS functions allow us to hide information about the partition of the non-zero elements. We show that this lower bound is essentially tight using a non-trivial optimization-from-samples algorithm. At a high level, the algorithm uses the observed samples to estimate expected marginal contribution of elements (i.e. the marginal contribution of elements to a random set). The main crux is in its analysis, and in particular proving the n−1/3 bound. We give an overview for this analysis in Subsection 3.2.

3.1

Improving the impossibility result

While the lower bound construction for submodular functions resembles the lower bound for coverage function, the crucial difference is that Pmthe bad elements are not partitioned, i.e., the value of m bad elements is b(∪i=1,i6=j Si ) instead of i=1,i6=j b(Si ), which lowers the value of k bad elements. This is possible because the algorithm cannot learn any information about the partition of the non-zero element, unlike in the coverage case. Recall that for coverage functions, there needs to be some small cancellation, i.e. some small overlap on the children covered by the parents, between a small number of elements of a part of the partitions to obtain a large cancellation between a large number of these elements. This cancellation is what allow the algorithm to potentially learn the partition of the non-zero elements. In the submodular case, budget additive functions are used to have a set of element such that a small number of these elements have no cancellations, i.e. they are additive, but a large number P of them have a large cancellation. A budget additive function f (·) is defined as f (S) = min{B, i∈S wi } for some threshold B causing the cancellation and weights wi acting additively up to this threshold. Theorem 3. For every constant  > 0, there exists a hypothesis class of submodular functions that is not n−1/3+ -optimizable from samples. Proof. As for coverage functions, the construction contains good, bad, and zero elements, but the bad elements are not partitioned. Fix the set T of non-zero elements of size n2/3− . The good elements are n1/3 random elements from T , the remaining elements in T are bad elements, and the elements not in T are zero elements. Formally, the hypothesis class of function F contains function f G , for any subset G of T of size n1/3 , with value f G (S) = |S ∩ G| + min{|S ∩ (T \ G)|, log log n}. Choose f G ∈ F uniformly at random to be optimized under cardinality constraint k = |G| = n1/3 . By Lemma 1, a sample Si has intersection of size at most log log n with T , except with probability n−Ω(log log n) . By a union bound, the same holds simultaneously for all polynomially many samples. Therefore, with probability at least 1 − n−Ω(log log n) , the algorithm’s output is independent of G; we henceforth assume this is the case. Let S be a set returned by any algorithm. Since S is independent of G, we have EG [|S ∩ G|] ≤ k · |G|/|T | = n1/3 · n1/3 /n2/3− = n . Thus,   EG f G (S) = EG [|S ∩ G| + min{|S ∩ (T \ G)|, log log n}] ≤ n + log log n whereas the set of good elements is feasible and has value n1/3 . 19

We also note that there exists a construction less involved than for coverage functions to obtain an n−1/4+ impossibility result for submodular functions using the framework from Section 2.2. Let the bad function be a budget-additive function with budget log log n, i.e., b(y) = min{y, log log n}, and the additive function be an additive function, i.e., g(y) = y. These functions have an (α = n1/4− , β = 0, k = n1/2− , m = n1/4− )-gap, the properties for such a gap can easily be verified, which gives an n−1/4+ impossibility result for submodular functions by Theorem 2.

3.2

An algorithm which computes expected marginal contributions

˜ −1/3 )-optimization from samples algorithm for submodular functions, described We provide an Ω(n formally as Algorithm 2. The algorithm and an overview of the analysis are below. ˜ −1/3 )Theorem 4. Let f (·) be a monotone submodular function, then Algorithm 2 is an Ω(n optimization from samples algorithm. Overview In the appendix, we give a simple n−1/2 -optimization from samples algorithm for subadditive functions. The biggest advantage of submodular functions over subadditive functions, is that we can meaningfully talk about the marginal contribution of an element. As we discussed earlier, even for the special case of coverage functions the marginal contribution of an element varies greatly when measured with respect to different sets. Nevertheless, with polynomially many samples we can learn each element’s marginal contribution to a uniformly random set, and use that to prune out the very worst elements. We then divide elements into log n bins based on their average marginal contributions. Up to logarithmic factors, we can without loss of generality restrict our attention to just one bin of size t. We now analyze the following three simple approximation algorithms: 1/k-approximation We can obtain an Ω(1/k)-approximation by picking the single element with the largest average marginal contribution. This part of the analysis extends to subadditive functions. ˜ t/n-approximation We can obtain an Ω(t/n)-approximation by picking a uniformly random ksubset. Specifically, we compare the value f (S) of a random subset S with f (S ∪ S ? ), where S ∗ is the optimal subset. Roughly, a t/n-fraction of the elements in S will be from the optimal bin, and each element from that bin contributes to S at least as much as it contributes to S ∪ S?. ˜ k/t-approximation We can obtain an Ω(k/t)-approximation by either picking a uniformly random subset again, or picking a uniformly random subset from the optimal bin. The intuition is that a uniformly random subset from the optimal bin includes, in expectation, an k/t-fraction of the optimal subset from the optimal bin.  ˜ ( 1 · k · t )1/3 = Ω(n ˜ −1/3 )By using each algorithm with probability 1/3, we guarantee an Ω k t n approximation. Learning marginal contributions To improve from subadditive functions, we compute arbitrarily good estimates vˆi of the expected marginal contribution of an element to a random set of size k − 1 (Algorithm 1). The true marginal contribution of an element to a random set of size k − 1 is denoted by vi := ES : |S|=k−1,i6∈S [fS (i)] . 20

The estimated expected marginal contribution vˆi of an element i to a random set is the difference between the average value of a set in Sk,i , defined to be all samples of size k containing i, and the average value of a set in Sk−1,i−1 , defined to be all samples of size k − 1 not P containing i. The average value of a collection of sets S is denoted by avg(S), i.e., avg(S) := ( S∈S f (S))/|S|. Algorithm 1 MargCont: divides elements into bins according to their estimated expected marginal contributions to a random set of size k − 1. Input: S = {Si : (Si , f (Si )) is a sample)} vˆi ← avg(Sk,i ) − avg(Sk−1,i−1 ) vˆmax ←  maxi vˆi Bj ← i : vˆ2max ˆi < vˆmax j−1 ≤ v 2j return (B1 , · · · , B3 log n )

˜ −1/3 )-optimization from samples algorithm for submodular functions. Algorithm 2 An Ω(n Input: S = {Si : (Si , f (Si )) is a sample)} With probability 13 : return S ← argmaxSi ∈S f (Si ) With probability 13 : return S, a set of size k u.a.r. With probability 13 : (B1 , · · · , B3 log n ) ← MargCont(S) Pick j ∈ {1, 2, · · · , 3 log n} u.a.r. return S, a subset of Bj of size min{|Bj |, k} u.a.r.

sample with largest value a random set

a random set from a random bin

Lemma 11. Let f be a monotone subadditive function. Then, with high probability, the estimations vˆi are -close, for any  ≥ f (S ? )/poly(n), to the true expected marginal contribution of element i to a random set S of size k − 1, i.e., |ˆ vi − vi | ≤ . Proof. We assume wlog that k ≤ n/2 (otherwise, a random subset of size k is a 1/2-approximation). The size of a sample which is   the most likely is k, so the probability that a sample is of size k is at n least 2/n. Since k−1 ≥ nk /n, the probability that a sample is of size k − 1 is at least 2/n2 . A given element i has probability at least 1/n of being in a sample and probability at least 1/2 of not being in a sample. Therefore, to observe at least nc samples of size k which contain i and at least nc samples of size k − 1 which do not contain i, nc+3 samples are sufficient with high probability. 0 We assumed that  ≥ f (S ? )/nc for some constant c0 , we pick c ≥ 2c0 + 1. Then by Hoeffding’s inequality (Lemma 18 with m = nc and b = f (S ? ),  c 2 ? 2 c 2c0 Pr avg(Sk,i ) − ES : |S|=k,i∈S [f (S)] ≥ /2 ≤ 2e−2n (/2) /f (S ) ≤ 2e−n /(2n ) ≤ 2e−n/2 and similarly,  Pr avg(Sk−1,i−1 ) − ES : |S|=k−1,i6∈S [f (S)] ≥ /2 ≤ 2e−n/2 . Since vˆi = avg(Sk,i ) − avg(Sk−1,i−1 ) and vi = ES : |S|=k,i∈S [f (S)] − ES : |S|=k−1,i6∈S [f (S)], the claim holds then with high probability.

21

Let Bq be the bin with the largest value from the optimal solution, i.e., q := argmaxj f (S ? ∩Bj ), we call this bin the optimal bin. We also define Sq? := S ? ∩ Bq and denote by t the number of elements in bin Bq . For simplicity, we denote the expected value of a uniformly random set S of size k by ES [f (S)]. Claim 2. With the notation defined above,   1 ? f (Sq ) ≥ Ω (f (S ? ) − ES [f (S)]) . log n ? ? Proof. Let S>3 log n be the set of elements in S that are not in any bin. Then, by subadditivity, ? ? ? f (Sq ) is a 1/(3 log n) approximation to f (S \ S>3 log n ), and thus also a 1/(3 log n) approximation ? ? ? to f (S ) − f (S>3 log n ). We upper bound f (S>3 log n ) with the value of a random set of size k as follow   ? ? f (S>3 (monotonicity) log n ) ≤ ES f (S>3 log n ∪ S)   X ≤ ES f (S) + fS (i) (submodularity) ? i∈S>3 log n \S

 ≤ ES f (S) +

 X

vi 

(submodularity)

? i∈S>3 log n \S

 ≤ ES f (S) +

 X

vˆi  + O(f (S ? )/n2 )

(Lemma 11)

? i∈S>3 log n \S

≤ (1 + o(1))ES [f (S)] + k ·

vˆmax n3

(Not in a bin)

≤ (1 + o(1))ES [f (S)] .

(argmaxi vˆi ∈ S w.p. at least 1/k)

1/k-approximation Lemma 12. For any monotone subadditive function f (·), the sample S with the largest value among at least (n/k) log n samples is a 1/k-approximation to f (S ? ) with high probability. Proof. By subadditivity, there exists an element i? such that {i? } is a 1/k-approximation to the optimal solution. By monotonicity, any set which contains i? is a 1/k-approximation to the optimal solution. After observing (n/k) log n samples, the probability of never observing a set that contains i? is polynomially small. t/n-approximation Lemma 13. Let f be a monotone submodular function. Then a uniformly random subset of size ˜ k is an Ω(t/n)-approximation to f (S ? ). Proof. We show that a uniformly random subset is an Ω(t/n)-approximation to f (Sq? ), and then

22

the lemma follows by Claim 2. We first upper bound the value of Sq? ,   f (Sq? ) ≤ ES f (Sq? ∪ S)   X ≤ ES f (S) + fS (i)

(monotonicity) (submodularity)

i∈Sq? \S

≤ ES [f (S)] +

X

vi

(linearity of expectation)

i∈Sq?

≤ ES [f (S)] +

X

vˆi + O f (S ? )/n2



(Lemma 11)

i∈Sq?

 vˆmax + O f (S ? )/n2 q 2 vˆmax = ES [f (S)] · (1 + o(1)) + k · q . 2 ≤ ES [f (S)] + k ·

(Definition of Bq )

Next, we lower bound the expected value of random subset S of size k. " # X ES [f (S)] ≥ ES vi (submodularity) i∈S

 ≥ ES 

 X

vˆi  − O f (S ? )/n2



(Lemma 11)

i∈S∩Bq

 vˆmax − O f (S ? )/n2 q+1 2  − O f (S ? )/n2

≥ ES [|S ∩ Bq |] ·

kt vˆmax · n 2q+1 kt vˆmax = · q · (1 − o(1)) 2n 2 =

(Definition of Bq ) (|S| = k and |Bq | = t)

By combining the lower bound of ES [f (S)] and the upper bound of f (Sq? ), we get that f (Sq? ) ≤ ES [f (S)] · (1 + o(1)) + k ·

vˆmax ≤ O (n/t) ES [f (S)] . 2q

k/t-approximation The approximation obtained from a random set in a random bin follows from the next lemma. Lemma 14. For any monotone subadditive function f (·), a uniform random set S of size k is a k/n-approximation to f (N ). Proof. Partition the ground set into sets of size k uniformly at random. A uniform random set of this partition is a k/n-approximation to f (N ) in expectation by subadditivity. A uniform random set of this partition is also a uniform random set of size k. Corollary 2. The better of a uniformly random feasible set, and a uniformly random subset of size ˜ min{k, |Bj |} of a random bin Bj , is an Ω(k/t)-approximation to f (S ? ) assuming that t ≥ k. 23

Proof. With probability Ω(1/ log n), the random bin is Bq . The expected value of a random subset of Bq of size k is a k/t-approximation to f (Bq ) by Lemma 14, so a k/t-approximation to f (Sq? ) by monotonicity. Finally, by Claim 2, f (Sq? ) is an Ω(1/ log n)-approximation to (f (S ? ) − ES [f (S)]).

4

Recoverability

The largely negative results from the above sections lead to the question of how well must a function be learned for it to be optimizable from samples? One extreme is a notion we refer to as recoverability (REC). A function is recoverable if it can be learned everywhere within an approximation of 1 ± 1/n2 from samples. Does a function need to be learnable everywhere for it to be optimizable from samples? Definition 6. A function f (·) is recoverable for distribution D if there exists an algorithm which, given a polynomial number of samples drawn from D, outputs a function f˜(·) such that for all sets S in the support of D,     1 1 ˜ 1 − 2 f (S) ≤ f (S) ≤ 1 + 2 f (S) n n with high probability over the samples and the randomness of the algorithm. This notion of recoverability is similar to the problem of approximating a function everywhere from Goemans et al. [28]. The differences are that recoverability is from samples whereas their setting allows value queries, and that recoverability requires to be within an approximation of 1 ± 1/n2 . It is important for us to be within such bounds and not within some arbitrarily small constant because such perturbations can still lead to an O(n−1/2+δ ) impossiblity result for optimization [32]. We show that if a monotone submodular function f (·) is recoverable then it is optimizable from samples by using the greedy algorithm on the recovered function f˜(·). The proof is similar to the classical analysis of the greedy algorithm and is deferred to the appendix. Theorem 5. Let D be a distribution over feasible sets under a cardinality constraint. If a monotone submodular function f (·) is recoverable for D, then it is 1 − 1/e − o(1)-optimizable from samples from D. For additive functions, it is 1 − o(1)-optimizable from samples. We show that additive functions are in REC under some mild condition. A function f (·) is P additive if there exists v1 , . . . , vn such that for all subsets S, f (S) = i∈S vi . The previous result implies that additive functions are optimizable from samples under this mild condition. In the next section, we show that additive function are optimizable from samples, without any assumption, as a corollary of a curvature result instead of recoverability. Lemma 15. Let f (·) be an additive function with values v1 , . . . , vn and with vmax = maxi vi , vmin = mini vi and let D be the uniform distribution over feasible sets under a cardinality constraint. If vmin ≥ vmax /poly(n), then f (·) is recoverable for D. Proof. We have already shown that the expected marginal contribution of an element to a random set of size k − 1 can be estimated from samples for submodular functions4 . In the case of additive functions, this marginal contribution of an element is its value vi . 4

For simplicity, this proof uses estimations that we know how to compute. However, The values vi can be recovered exactly by solving a system of linear equations where each row corresponds to a sample, provided that the matrix for this system is invertible, which is the case with a sufficently large number of samples by using results from random matrix theory such as in the survey by Blake and Studholme [6].

24

2 . Note that We apply Lemma 11 with  = vi /n2 to compute vˆi such that |ˆ vi − vi | ≤ vi /nP  = vi /n2 satisfies  ≥ f (S ? )/poly(n) since vmin ≥ vmax /poly(n). Let f˜(S) = ˆi , then i∈S v P P f˜(S) ≤ i∈S (1 + 1/n2 )vi = (1 + 1/n2 )f (S) and f˜(S) ≥ i∈S (1 − 1/n2 )vi = (1 − 1/n2 )f (S).

Corollary 3. Let f (·) be an additive function with values v1 , . . . , vn and with vmax = maxi vi , vmin = mini vi and let D be the uniform distribution over feasible sets under a cardinality constraint. If vmin ≥ vmax /poly(n), then f (·) is 1 − o(1)-optimizable from samples from D. We also note that submodular functions that are a c-junta for some constant c are recoverable. A function f (·) is a c-junta [44, 25, 52] if it depends only on a set of elements T of size c. If c is constant, then with enough samples, T can be learned since each element not in T is in at least one sample which does not contain any element in T with high probability. For each subset of T and w.h.p., there is also at least one sample which intersects with T in exactly that subset, so f (·) is exactly recoverable. The previous results lead us to the question of whether a function needs to be recoverable to be optimizable from samples. We show that it is not the case since unit demand functions are optimizable from samples and not recoverable. A function f (·) is a unit demand function if f (S) = maxi∈S vi for some v1 , . . . , vn . Lemma 16. Unit demand functions are not recoverable for k ≥ n but are 1-optimizable from samples. Proof. We first show that unit demand functions are not recoverable. Define a hypothesis class of functions F which contains n unit demand functions fj (·) with v1 = j/n and vi = 1 for i ≥ 2, for all integers 1 ≤ j ≤ n. We wish to recover function fj (·) with j picked uniformly at random. With high probability, the sample {e1 } is not observed when k ≥ n , so the values of all observed samples are independent of j. Unit demand functions are therefore not recoverable. Unit demand functions, on the other hand, are 1-optimizable from samples. With at least n log n samples, at least one sample contains, with high probability, the best element e? := argmaxei vi . Any set containing the best element is an optimal solution. Therefore, an algorithm which returns the sample with largest value obtains an optimal solution with high probability. We conclude that functions do not need to be learnable everywhere to be optimizable from samples.

5

Bounded Curvature

The curvature c of a function measures how close this function is to being additive. In this section, we give a simple (1 − c)2 − o(1)-optimization from samples algorithm for monotone subadditive functions. We begin by formally defining the curvature of a function. Definition 7. The curvature c of a monotone subadditive function f (·) is defined as c := 1 −

min

e∈N,S⊆N

fS\e (e) . f (e)

This definition implies that fS (e) ≥ (1 − c)f (e) ≥ (1 − c)fT (e)

25

for all S, T and all e 6∈ S ∪ T since f (e) ≥ f (T ∪ e) − f (T ) = fT (e) where the first inequality is by subadditivity. As in Section 3, we denote by vi and vˆi the true and estimated marginal contribution of an element i to a random set size k − 1. Algorithm 3 MaxMargCont: A (1 − c)2 − o(1)-optimization from samples algorithm for subadditive functions with curvature c. Input: S = {Si : (Si , f (Si )) is a sample)} vˆi ← avg(Sk,i ) − avg(Sk−1,i−1 P) return S ← argmax|T |=k i∈T vˆi Theorem 6. Let f (·) be a monotone subadditive function with curvature c. Then MaxMargCont is a (1 − c)2 − o(1)-optimization from samples algorithm. Proof. Let S ? = {e?1 , . . . , e?k } be the optimal solution and S = {e1 , . . . , ek } be the set returned by Algorithm 3. Let Si? := {e?1 , . . . , e?i } and Si := {e1 , . . . , ei }. If e 6∈ S, then fSi−1 (ei ) ≥ (1 − c)vei ≥ (1 − c)ˆ vei −

(curvature) f (S ? )

Lemma 11 with  =

n2

f (S ? ) n2 f (S ? ) ≥ (1 − c)2 fT (e) − n2

f (S ? ) (1 − c)n2

(ei ∈ S, e 6∈ S and by Algorithm 3)

≥ (1 − c)ˆ ve −

(curvature)

for any set T which does not contain e0 and we conclude that ! k k X X kf (S ? ) 2 ? ? (e ) − f (S) = = ((1 − c)2 − o(1))f (S ? ). fSi−1 (ei ) ≥ (1 − c) fSi−1 i n2 i=1

i=1

Similarly as in Section 4, but this time without any assumption on the function, we obtain that additive functions can be arbitrarily well optimized from samples as a corollary. Corollary 4. Additive functions are 1 − o(1)-optimizable from samples. Proof. Additive functions have curvature c = 0.

Budget additive functions Budget additive functions are a simple extension to the class of additive P functions that behave in an additive manner up to a threshold value B, i.e., f (S) = min{B, i∈S wi }. Unlike additive functions that have curvature 0, budget additive functions have curvature 1. Thus, no approximation guarantee can be obtained from the main curvature result for budget additive functions. Nevertheless, we show that MaxMargCont? , which is identical MaxMargCont but returns the sample with highest value with probability 1/2, is a 1/2-optimization from samples algorithm for budget additive functions.

26

Theorem 7. MaxMargCont? is a (1/2 − o(1))-optimizable from samples algorithm for budget additive functions. Proof. There are two cases and in each case we obtain at least a 1 − o(1) approximation with probability 1/2. If there is at least one sample with value B, then with probability 1/2 the sample with highest value is returned by MaxMargCont? and that sample is an optimal solution. If all samples have value less than B, then the function is additive over P the samples and we can 0 use the previous result for additive function. Formally, let f (T ) := i∈T wi , which is identical to f (·) but without the threshold B. For all samples T , we have f (T ) = f 0 (T ) since all samples have value less MaxMargCont? is P than B. Thus with probability 1/2, the set0 S? returned by ? argmax|T |=k i∈T vˆi , which is a 1 − o(1)-approximation to f (S ), where S is the optimal solution to f 0 (·), by Corollary 4 since the samples are also according to f 0 (·) and f 0 (·) is an additive function. If f (S) ≥ B, then S is optimal. Otherwise, note that if S ? is optimal for f 0 (·), it is also optimal for f (·) and we obtain f (S) = f 0 (S) ≥ (1 − o(1))f 0 (S ? ) ≥ (1 − o(1))f (S ? ).

References [1] Ioannis Antonellis, Anish Das Sarma, and Shaddin Dughmi. Dynamic covering for recommendation systems. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 26–34. ACM, 2012. [2] Ashwinkumar Badanidiyuru, Shahar Dobzinski, Hu Fu, Robert Kleinberg, Noam Nisan, and Tim Roughgarden. Sketching valuation functions. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012, pages 1025–1035, 2012. URL http://portal.acm.org/citation.cfm?id= 2095197&CFID=63838676&CFTOKEN=79617016. [3] Maria-Florina Balcan. Learning submodular functions with applications to multi-agent systems. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2015, Istanbul, Turkey, May 4-8, 2015, page 3, 2015. URL http://dl.acm.org/citation.cfm?id=2772882. [4] Maria-Florina Balcan and Nicholas J. A. Harvey. Learning submodular functions. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pages 793–802, 2011. doi: 10.1145/1993636.1993741. URL http://doi.acm.org/10.1145/1993636.1993741. [5] Maria-Florina Balcan, Florin Constantin, Satoru Iwata, and Lei Wang. Learning valuation functions. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 4.1–4.24, 2012. URL http://www.jmlr.org/proceedings/ papers/v23/balcan12b/balcan12b.pdf. [6] Ian F Blake and Chris Studholme. Properties of random matrices and applications. Unpublished report available at http://www. cs. toronto. edu/˜ cvs/coding, 2006. [7] Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. Maximizing social influence in nearly optimal time. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 946–957. SIAM, 2014.

27

[8] Dave Buchfuhrer, Michael Schapira, and Yaron Singer. Computation and incentives in combinatorial public projects. In Proceedings of the 11th ACM conference on Electronic commerce, pages 33–42. ACM, 2010. [9] Deeparnab Chakrabarty and Zhiyi Huang. Testing coverage functions. In Automata, Languages, and Programming, pages 170–181. Springer, 2012. [10] Shuchi Chawla, Jason D. Hartline, and Denis Nekipelov. Mechanism design for data science. In ACM Conference on Economics and Computation, EC ’14, Stanford , CA, USA, June 8-12, 2014, pages 711–712, 2014. doi: 10.1145/2600057.2602881. URL http://doi.acm.org/10. 1145/2600057.2602881. [11] Yu Cheng, Ho Yee Cheung, Shaddin Dughmi, Ehsan Emamjomeh-Zadeh, Li Han, and ShangHua Teng. Mixture selection, mechanism design, and signaling. In FOCS, 2015. To appear. [12] Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. Max-cover in map-reduce. In Proceedings of the 19th international conference on World wide web, pages 231–240. ACM, 2010. [13] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 243–252. ACM, 2014. [14] Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep Pandey, and Andrew Tomkins. The discoverability of the web. In Proceedings of the 16th international conference on World Wide Web, pages 421–430. ACM, 2007. [15] Shahar Dobzinski and Michael Schapira. An improved approximation algorithm for combinatorial auctions with submodular bidders. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1064–1073. Society for Industrial and Applied Mathematics, 2006. [16] Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. Influence function learning in information diffusion networks. In Proceedings of the... International Conference on Machine Learning. International Conference on Machine Learning, volume 2014, page 2016. NIH Public Access, 2014. [17] Nan Du, Yingyu Liang, Maria-Florina F Balcan, and Le Song. Learning time-varying coverage functions. In Advances in neural information processing systems, pages 3374–3382, 2014. [18] Shaddin Dughmi. A truthful randomized mechanism for combinatorial public projects via convex optimization. In Proceedings of the 12th ACM conference on Electronic commerce, pages 263–272. ACM, 2011. [19] Shaddin Dughmi and Jan Vondr´ ak. Limitations of randomized mechanisms for combinatorial auctions. Games and Economic Behavior, 92:370–400, 2015. [20] Shaddin Dughmi, Tim Roughgarden, and Qiqi Yan. From convex optimization to randomized mechanisms: toward optimal combinatorial auctions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 149–158. ACM, 2011. [21] Shaddin Dughmi, Li Han, and Noam Nisan. Sampling and representation complexity of revenue maximization. In Web and Internet Economics - 10th International Conference, 28

WINE 2014, Beijing, China, December 14-17, 2014. Proceedings, pages 277–291, 2014. doi: 10.1007/978-3-319-13129-0 22. URL http://dx.doi.org/10.1007/978-3-319-13129-0_22. [22] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45 (4):634–652, 1998. [23] Uriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):1133–1153, 2011. [24] Vitaly Feldman and Pravesh Kothari. Learning coverage functions and private release of marginals. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 679–702, 2014. URL http://jmlr.org/proceedings/papers/ v35/feldman14a.html. [25] Vitaly Feldman and Jan Vondr´ ak. Optimal bounds on approximation of submodular and XOS functions by juntas. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 227–236, 2013. doi: 10.1109/FOCS.2013.32. URL http://dx.doi.org/10.1109/FOCS.2013.32. [26] Vitaly Feldman and Jan Vondr´ ak. Tight bounds on low-degree spectral concentration of submodular and xos functions. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 923–942. IEEE, 2015. [27] Vitaly Feldman, Pravesh Kothari, and Jan Vondr´ak. Representation, approximation and learning of submodular functions using low-rank decision trees. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pages 711–740, 2013. URL http://jmlr.org/proceedings/papers/v30/Feldman13.html. [28] Michel X Goemans, Nicholas JA Harvey, Satoru Iwata, and Vahab Mirrokni. Approximating submodular functions everywhere. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 535–544. Society for Industrial and Applied Mathematics, 2009. [29] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1019–1028. ACM, 2010. [30] Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sensor placements in gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 265–272. ACM, 2005. [31] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. SIAM Journal on Computing, 42(4):1494–1520, 2013. [32] Avinatan Hassidim and Yaron Singer. Submodular optimization under noise. 2015. Working paper. [33] Elad Hazan. Draft: Introduction to online convex optimization. In Foundations and Trends in Optimization, vol. XX, no. XX, pages 1–172. 2016.

29

[34] Zhiyi Huang, Yishay Mansour, and Tim Roughgarden. Making the most of your samples. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, Portland, OR, USA, June 15-19, 2015, pages 45–60, 2015. doi: 10.1145/2764468.2764475. URL http://doi.acm.org/10.1145/2764468.2764475. [35] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In Advances in Neural Information Processing Systems, pages 2436–2444, 2013. [36] Rishabh K Iyer, Stefanie Jegelka, and Jeff A Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. In Advances in Neural Information Processing Systems, pages 2742–2750, 2013. ´ Tardos. Maximizing the spread of influence through [37] David Kempe, Jon Kleinberg, and Eva a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. [38] Nitish Korula, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Online submodular welfare maximization: Greedy beats 1/2 in random order. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 889–898, 2015. doi: 10.1145/2746539.2746626. URL http://doi.acm.org/10. 1145/2746539.2746626. [39] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, volume 7, pages 1650–1654, 2007. [40] Benny Lehmann, Daniel Lehmann, and Noam Nisan. Combinatorial auctions with decreasing marginal utilities. In Proceedings of the 3rd ACM conference on Electronic Commerce, pages 18–28. ACM, 2001. [41] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 510–520. Association for Computational Linguistics, 2011. [42] Vahab Mirrokni, Michael Schapira, and Jan Vondr´ak. Tight information-theoretic lower bounds for welfare maximization in combinatorial auctions. In Proceedings of the 9th ACM conference on Electronic commerce, pages 70–77. ACM, 2008. [43] Jamie Morgenstern and Tim Roughgarden. The pseudo-dimension of nearly-optimal auctions. In NIPS, page Forthcoming, 12 2015. URL papers/auction-pseudo.pdf. [44] Elchanan Mossel, Ryan O’Donnell, and Rocco P Servedio. Learning juntas. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 206–212. ACM, 2003. [45] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions ii. Math. Programming Study 8, pages 73–87, 1978. [46] Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In SDM, volume 9, pages 697–708. SIAM, 2009.

30

[47] Lior Seeman and Yaron Singer. Adaptive seeding in social networks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 459–468. IEEE, 2013. [48] Yaron Singer. How to win friends and influence people, truthfully: influence maximization mechanisms for social networks. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 733–742. ACM, 2012. [49] Maxim Sviridenko, Jan Vondr´ ak, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1134–1148. SIAM, 2015. [50] Ashwin Swaminathan, Cherian V Mathew, and Darko Kirovski. Essential pages. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, pages 173–182. IEEE Computer Society, 2009. [51] Hiroya Takamura and Manabu Okumura. Text summarization model based on maximum coverage problem and its variant. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 781–789. Association for Computational Linguistics, 2009. [52] Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 11–20. IEEE, 2012. [53] Leslie G. Valiant. A Theory of the Learnable. Commun. ACM, 27(11):1134–1142, 1984. doi: 10.1145/1968.1972. URL http://doi.acm.org/10.1145/1968.1972. [54] Jan Vondr´ ak. Submodularity and curvature: the optimal algorithm. RIMS Kokyuroku Bessatsu B, 23:253–266, 2010. [55] Peng-Jun Wan, Ding-Zhu Du, Panos Pardalos, and Weili Wu. Greedy approximations for minimum submodular cover with submodular cost. Computational Optimization and Applications, 45(2):463–474, 2010. [56] Yisong Yue and Thorsten Joachims. Predicting diverse subsets using structural svms. In Proceedings of the 25th international conference on Machine learning, pages 1224–1231. ACM, 2008.

31

Appendix A

Concentration Bounds

Lemma 17 (Chernoff Bound). P Let X1 , . . . , Xn be independent indicator random variables such that Pr(Xi = 1) = 1. Let X = ni=1 Xi and µ = E[X]. For 0 < δ < 1, Pr(|X − µ| ≥ δµ) ≤ 2e−µδ

2 /3

.

Lemma 18 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables with values 1 Pm in [0, b]. Let X = m i=1 Xi and µ = E[X]. Then for every 0 <  < 1,    ¯ −E X ¯ | ≥  ≤ 2e−2m2 /b2 . Pr |X

B

Linear Algebra

We recall a couple of basic results from linear algebra: Theorem (Cramer’s rule). Let M be an invertible matrix. The solution to the linear system Mi M x = y is given by xi = det det M , where Mi is the matrix M with the i-th column replaced by the vector y. Q Theorem (Hadamard’s inequality). det M ≤ kvi k, where kvi k denotes the Euclidean norm of the i-th column of M .

C

Additional Discussion about the Framework

We justify our modeling decision in which we insist on having the sets sampled uniformly at random from the same family of sets we aim to optimize over. We emphasize the reasoning through the following points. • First, we argue that it makes sense to fix some distribution. We do this to avoid trivialities in our framework. Clearly, one cannot hope to optimize a function from every distribution: consider for example the degenerate distribution which simply returns the empty set on every sample. Similarly, it is not so interesting to ask about optimizing from any distribution: this is trivial when the distribution always returns the optimal set. • Once we agreed that we should fix some distribution, two natural choices come to mind: the uniform distribution over all feasible sets, and the uniform distribution over all sets. We argue that the former is more interesting (in what context would you observe the values of infeasible solutions?). Nevertheless, we emphasize the robustness of our results by showing a simple impossibility result for the uniform distribution over all sets. Additional lower bounds for some other distributions are discussed in Subsection 2.5. Theorem 8. There exists a hypothesis class of submodular functions that is not 4n−1/2 -optimizable from samples from the uniform distribution over all sets.

32

Proof. Assume that samples are drawn uniformly from all sets of elements, or equivalently, from the product distribution with marginal probabilities 1/2. Consider the case where k = n1/2 and the hypothesis class of functions F contains function f T (·), for any set T of size n1/2 , defined as n n1/2 o f T (S) = min |S ∩ T |, . 4 We pick a set T of size n1/2 uniformly at random. By Lemma 17 with Xi indicating if element ei is in S ∩ T , µ = n · Pr(e ∈ S) Pr(e ∈ T ) = n1/2 /2, and constant 0 < δ < 1, |S ∩ T | ≥ n1/2 /4 with high probability, for all samples. So f (S) = n1/2 /4, which is independent from T , for all samples S with high probability. It follows that any algorithm is independent from T with high probability. An algorithm therefore picks in expectation, over the choice of T and the randomness of the algorithm, n1/2 · n1/2 /n = 1 element in T . So for any algorithm, there is some function f T (·) ∈ F for which the algorithm outputs a set of expected value at most 1. The optimal solution is T and has value n1/2 /4. We also note that for k ≤ n/2, and with nc samples from the uniform distribution over all feasible sets under a cardinality constraint k, there are at least nc−1 samples of size exactly k in expectation. So the distribution over all feasible sets gives more information than the distribution over all sets of size exactly k.

D

Learning Models

As a model for statistical learnability we use the notion of PAC learnability due to Valiant [53] and its generalization to real-valued functions PMAC learnability, due to Balcan and Harvey [4]. Let F be a hypothesis class of functions {f1 , f2 , . . .} where fi : 2N → R. Given precision parameters  > 0 and δ > 0, the input to a learning algorithm is samples {Si , f (Si )}ti=1 where the Si ’s are drawn i.i.d. from from some distribution D, and the number of samples t is polynomial in 1/, 1/δ and n. The learning algorithm outputs a function fe : 2N → R that should approximate f in the following sense. • F is PAC-learnable on distribution D if there exists a (not necessarily polynomial time) learning algorithm such that for every , δ > 0:   h i e Pr Pr f (S) 6= f (S) ≥ 1 −  ≥ 1 − δ S1 ,...,St ∼D S∼D

• F is α-PMAC-learnable on distribution D if there exists a (not necessarily polynomial time) learning algorithm such that for every , δ > 0:   h i e e Pr Pr α · f (S) ≤ f (S) ≤ f (S) ≥ 1 −  ≥ 1 − δ S1 ,...,St ∼D S∼D

A class F is PAC (or α-PMAC) learnable if it is PAC- (α-PMAC)-learnable on every distribution D.

E

Tight Bounds for Subadditive Functions

A function f : 2N → R is subadditive if for any S, T ⊆ N we have that f (S ∪ T ) ≤ f (S) + f (T ). Since subadditive functions are a superclass of submodular functions, all the lower bounds apply to subadditive functions as well. Still, it seems natural to ask whether non-trivial guarantees are obtainable for subadditive functions. 33

Theorem 9. For subadditive functions, there exists an n−1/2 -optimization from samples algorithm. Furthermore, no algorithm can do better than n1/2− for any constant  > 0. Proof. In the value query model where f (S) can be queried for any set S, subadditive functions cannot be n−1/2+ -approximated for any constant  > 0 by a polynomial-time algorithm, which follows from [42]. Any algorithm for optimization from samples can trivially be extended for optimization in the value query model, so the n1/2− lower bound for optimizing subadditive functions from samples follows. For the upper bound, if k ≤ n1/2 , we return the sample with largest value, which by Lemma 12 is a 1/k-approximation. If k ≥ n1/2 , then we return a random set of size k, which by Lemma 14 is a k/n-approximation to f (N ). By monotonicity, it is a k/n-approximation to the optimum.

F

Lower Bounds for Small k

We extend our impossibility results to the case where k is small. Similarly as for the main lower bounds for coverage and submodular functions, we obtain stronger lower bounds for submodular functions because unlike in the coverage case, an algorithm cannot even learn the partition of the non-zero elements. Theorem 10. Let k ≥ log n, then there exists a hypothesis class of coverage functions that is not k 1/2− -optimizable from samples and a hypothesis class of submodular functions that is not k 1− -optimizable from samples, assuming k ≤ n1/3− , for any constant  > 0. Proof. For coverage functions, consider the exact same good and bad functions g(·) and b(·) as√in the analysis of the main lower bound for coverage functions and pick ` = log log log n and m = log n. Similarly as for Lemma 5, the gap α between g(·) and b(·) is then √ g(k/m) k 0 ≥  k 1/2− 4 log log log n b(k/m) (log log log n) since k ≥ log n and where 0 > 0 is some small constant. Similarly, the curvature β is g(k) k k ≥ = (1 − o(1))m. ≥ 4 log log log n g(k/m) (1 + o(1))k/m k/m + (log log log n) 0

Thus, g(·) and b(·) have an (k 1/2− , o(1), k, k 1/2 )-gap and by Theorem 2, there exists a class of coverage functions which is not k 1/2− -optimizable from samples. For submodular functions, the proof follows identically as Theorem 3 but with T of size k 2 and with k good elements.

G

Missing Proof from Section 4

We restate Theorem 5 for convenience. Theorem 5. Let D be a distribution over feasible sets under a cardinality constraint. If a monotone submodular function f (·) is recoverable for D, then it is 1 − 1/e − o(1)-optimizable from samples from D. For additive functions, it is 1 − o(1)-optimizable from samples. Proof. We show that the greedy algorithm with f˜(·) for a recoverable function performs well. The proof follows similarly as the classical analysis of the greedy algorithm. We start with submodular

34

functions and denote by Si = {e1 , · · · , ei } the set obtained after the ith iteration. Let S ? be the optimal solution, then by submodularity, X f (S ? ) ≤ f (Si−1 ) + fSi−1 (e) e∈S ? \Si−1



X

≤ f (Si−1 ) +

e∈S ? \Si−1

1 + 1/n2 1 − 1/n2



 f (Si ) − f (Si−1 )

where the second inequality follows from f˜(Si ) ≥ f˜(Si−1 ∪ {e}) for all e ∈ S ? \ Si−1 by the greedy algorithm, so (1 + 1/n2 )f (Si ) ≥ (1 − 1/n2 )f (Si−1 ∪ {e}). We therefore get that   1 + 1/n2 ? f (S ) ≤ (1 − k)f (Si−1 ) + k f (Si ). 1 − 1/n2 By induction and similarly as in the analysis of the greedy algorithm, we then get that  f (Sk ) ≥ Since 

1 − 1/n2 1 + 1/n2

1 − 1/n2 1 + 1/n2

k

 ≥

1−

k 

2 n2

 1 − (1 − 1/k)k f (S ? ).

k

≥ 1 − 2k/n2 ≥ 1 − 2/n

 and 1 − (1 − 1/k)k ≥ 1 − 1/e, the greedy algorithm achieves an (1 − 1/e − o(1))-approximation for submodular functions. For additive functions, let S be the set returned by the greedy algorithm and vˆi = f˜({i}), then X  X    X 1 1 1 − 1/n2 f (S) = vi ≥ vˆi ≥ vˆi ≥ f (S ? ). 2 1 + 1/n2 1 + 1/n2 1 + 1/n ? i∈S

i∈S

i∈S

We therefore get a (1 − o(1))-approximation for additive functions.

35