On the Optimal Sample Complexity for Best Arm Identification

Report 4 Downloads 44 Views
On the Optimal Sample Complexity for Best Arm Identification

arXiv:1511.03774v1 [cs.LG] 12 Nov 2015

Lijie Chen Jian Li Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University, China. Email: [email protected], [email protected] Abstract We study the best arm identification (Best-1-Arm) problem, which is defined as follows. We are given n stochastic bandit arms. The ith arm has a reward distribution Di with an unknown mean µi . Upon each play of the ith arm, we can get a reward, sampled i.i.d. from Di . We would like to identify the arm with largest mean with probability at least 1 − δ, using as few samples as possible. We also study an important special case where there are only two arms, which we call the Sign-ξ problem. We achieve a very detailed understanding of the optimal sample complexity of Sign-ξ, simplifying and significantly extending a classical result by Farrell in 1964, with a completely new proof. Using the new lower bound for Sign-ξ, we obtain the first lower bound for Best-1-Arm that goes beyond the classic Mannor-Tsitsiklis lower bound, by an interesting reduction from Sign-ξ to Best-1-Arm. To complement our lower bound, we also provide a nontrivial algorithm for Best-1-Arm, which achieves a worst case optimal sample complexity, improving upon several prior upper bounds on the same problem.

1

Introduction

In this paper, we study the following fundamental problem. Definition 1.1. Best-1-Arm: We are given n arms A1 , . . . , An . The ith arm Ai has a reward distribution Di with an unknown mean µi ∈ [0, 1]. We assume that all reward distributions have 1-sub-Gaussian tails (see Definition 2.1), which is a standard assumption in the stochastic multi-armed bandit literature. Upon each play of Ai , we can get a reward value sampled i.i.d. from Di . Our goal is to identify the arm with largest mean using as few samples as possible. We assume here that the largest mean is strictly larger than the second largest (i.e., µ[1] > µ[2] ) to ensure the uniqueness of the solution. The problem is also called the pure exploration problem in the stochastic multi-armed bandit literature. We also study the following natural problem Sign-ξ. Definition 1.2. Sign-ξ: ξ is a fixed constant. We are given a single arm with unknown mean µ 6= ξ. The goal is to decide whether µ > ξ or µ < ξ. Define the gap of the problem to be ∆ = |µ − ξ|. Again, we assume that the distribution of the arm is 1-sub-Gaussian. In fact, Sign-ξ can be viewed as a special case of Best-1-Arm where there are only two arms and we know the mean of one arm. Hence, a thorough understanding of the sample complexity of Sign-ξ is very useful for deriving tighter sample complexity bounds for Best-1-Arm. Definition 1.3. For a fixed value δ ∈ (0, 1), we say that an algorithm A for Best-1-Arm (or Sign-ξ) is δ-correct, if given any Best-1-Arm(or Sign-ξ) instance, A returns the correct answer with probability at least 1 − δ. We say that an algorithm A for Best-1-Arm is an (ǫ, δ)-PAC algorithm, if given any Best-1-Arm instance and any confidence level δ > 0, A returns an ǫ-optimal arm with probability at least 1 − δ. Here we say an arm Ai is ǫ-optimal if µ[1] − µi ≤ ǫ. The studies of both problems have a long history dating back to 1950s [2, 4, 3, 26, 14]. We first discuss the Sign-ξ problem. It is well known that for any δ-correct algorithm (for constant δ) A that can distinguish 1

two Gaussian arms with means ξ + ∆ and ξ − ∆ (the values of ξ and ∆ are known beforehand), the expected number of samples required by A is Ω(∆−2 ), which is optimal (e.g., [10]). This can be seen as a lower bound for Sign-ξ as well. However, a tighter lower bound of Sign-ξ was in fact provided by Farrell back in 1964 [14]. He showed that for any δ-correct A for Sign-ξ (where the distribution is in the exponential family), it holds that lim sup ∆→0

TA [∆] > 0, ln ln ∆−1

(1)

∆−2

where TA [∆] is the expected number of samples taken by A on an instance with gap ∆. Farrell’s presult cru Xt cially relies on the Law of Iterated Logarithm (LIL), which roughly states that lim sup Xi / 2t log log t = t

i=1

1 almost surely where Xi ∼ N (0, 1) for all i. Comparing with the Ω(∆−2 ) lower bound, the extra ln ln ∆−1 factor is caused by the fact that we do not known the gap ∆ beforehand. The above result implies that −1 ∆−2 [2] ln ln ∆[2] is also a lower bound for Best-1-Arm. Bechhofer [2] formulated the Best-1-Arm problem for Gaussians in 1954, which was followed by several subsequent papers (e.g., [4, 3, 26, 5]). Most of those earlier work on Best-1-Arm focused on designing practically efficient δ-correct algorithms (verified by numerical simulations). These early advances are summarized in the monograph [5]. The last decade has witnessed a resurgence of interest in the Best-1-Arm problem and its optimal  Xn  −2 sample complexity. Even-Dar, Mannor and Mansour [11] showed an O ∆[i] (ln δ −1 + ln n + ln ∆−1 ) [i] i=2 upper bound, where ∆[i] = µ[1] − µ[i] , using the so-called successive elimination procedure. Mannor and  Xn  −2 −1 Tsitsiklis [25] showed that for any δ-correct algorithm for Best-1-Arm, it requires Ω ∆[i] ln δ i=2 samples in expectation for any instance. We note that the Mannor-Tsitsiklis lower bound is an instance-wise lower bound, i.e., any Best-1-Arm instance requires the stated number of samples. Several subsequent work [15, 21, 17] obtained improved upper  bounds. See Table 1 for the details. The current best known bound is  X n −1 −2 −1 , due to Karnin, Koren and Somekh [22]. Jamieson, Malloy, Nowak and O ∆[i] ln ln ∆[i] + ln δ i=2

Bubeck [18] obtained a UCB-type algorithm (called lil’UCB), which achieves the same sample complexity and is also efficient in practice. We refer the above bound as the KKS-JMNS bound) See [19] for a survey on both theoretical bounds and experimental results for those algorithms. In fact, some researchers have believed that the KKS upper bound is already optimal, since it matches Farrell’s lower bound (1) for two arms. Both [18] and [19] explicitly referred the upper bound as “optimal”. In [18], it states that “The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL)”. So it appears that the problem is already completely solved, at least theoretically (except a “slight” generalization of the lower bound to more than two arms left yet to be proved). However, as we will demonstrate, the problem is far more subtle than we expected. The current best upper bound (i.e., the KKS-JMNS bound) is tight only for two arms (or O(1) arms) and in the worst case sense (not instance optimal in the sense of [13, 1]).

1.1

Our Contributions

We need some notations to state our results formally. Let {µ[1] , µ[2] , . . . , µ[n] } be the means of the n arms, sorted in the nondecreasing order (ties are broken in an arbitrary but consistent manner). We use A[i] to denote the arm with mean µ[i] . In the Best-1-Arm problem, we define the gap for the arm A[i] to be ∆[i] = µ[1] − µ[i] , which is an important quantity to measure the sample complexity. We use A to denote an algorithm and TA (I) to be the expected number of total arm pulls (i.e., samples) by A on the instance I. In this paper, we sometimes mention the sample complexity of an algorithm and its running time interchangeably, since for all of our algorithms (as well as all algorithms we are aware of for Best-1-Arm and Sign-ξ) the running time is at most a constant times the number of samples. Moreover, all of our lower bounds are information theoretic, and bound the number of samples regardless of the running time of the algorithm (as all existing lower bounds for Best-1-Arm and Sign-ξ). Hence, sometimes when we informally speak that an algorithm must be “slow”, which really means it requires many samples. We believe such slight abuse should not cause any confusion. 2

1.1.1

Sign-ξ

First we consider the Sign-ξ problem. We first emphasize that the lower bound (1) is not an instancewise lower bound.1 In particular, the lim sup asserts that the existence of infinite number of instances that require ∆−2 ln ln ∆−2 samples (as ∆ → 0). This does not rule out the possibility of an algorithm that requires o(∆−2 ln ln ∆−2 ) samples for almost all instances (with respect to some natural distribution of the instances). In fact, we show that it is impossible to obtain an instance-wise ∆−2 ln ln ∆−2 lower bound by presenting an algorithm that requires much less samples infinitely often. In particular, we show that for any function T (∆) which grows faster than ∆−2 (for example ∆−2 α(∆−1 ), where α(x) is the inverse Ackermann function), there is a δ-correct algorithm for Sign-ξ (which is a variant of the Exponential-Gap-Elimination algorithm in [22]) that needs fewer than T (∆) samples for an infinite number of instances as ∆ → 0. More formally, we have the following theorem. Theorem 1.4. For any function T on (0, 1] such that lim sup T (∆)∆2 = +∞ and for any fixed constant ∆→+0

δ > 0, there exists a δ-correct algorithm A for Sign-ξ, such that lim inf ∆→+0

TA (∆) = 0. T (∆)

Now, we turn to the lower bound side. Let the target lower bound be F (∆) = c∆−2 ln ln ∆−1 , where c is a small universal constant. For simplicity, assume that all the reward distributions are Gaussian with σ = 1. Let TA (∆) = max(TA (Aξ+∆ ), TA (Aξ−∆ )), in which Aξ+∆ and Aξ−∆ denote the arms with means ξ + ∆ and ξ − ∆, respectively. For an algorithm A, if TA (∆) ≥ F (∆), we say ∆ is a “slow point” (or A is slow at ∆), otherwise, it is a “fast point”. The lower bound (1) merely asserts that there are infinite slow points for any algorithm A. We achieve a much refined understanding by characterizing that under what prior distributions over ∆, any algorithm must be slow for almost all instances (and when not). Theorem 1.5. (Informal) Let A be any δ-correct algorithm for Sign-ξ. ρ : N → R+ is an increasing unbounded convex function. Assume that the gap parameter ∆ is distributed uniformly over the discrete set {e−ρ(0) , e−ρ(1) , . . . , e−ρ(N ) }. A requires Ω(F (∆)) samples for almost all instances as N approach to infinity if and only if ρ(x) grows no faster than some polynomial function (i.e., ρ(x) = xO(1) ). When ρ is increasing, unbounded and concave, the same holds if and only lim inf xρ′ (x)/ρ(x)β > 0 for some constant β > 0. x→+∞

We note that the theorem is also useful for proving lower bounds for Best-1-Arm (see Theorem 1.6 below). We also prove similar results for the continuous analogues of the above discrete prior distributions. The precise statements can be found in Theorem 4.3 and Theorem 4.4. Our result covers wide classes of natural prior distributions. For example, for the uniform distribution over {1, e−1, . . . , e−N } (as N → +∞), our theorem states that any algorithm must be slow for almost all instances. But for the uniform distribution over {1, 1/2, 1/3, . . . , 1/N } (as N → +∞), there is some algorithm that can be fast for a nontrivial proportion. Our proofs is very different from, and much simpler than the complicated proof in [14]. It would be transparent from our proof why the ln ln ∆−1 term is essential: The main reason is that ∆ is not known beforehand (if ∆ is known, Sign-ξ can be solved in O(∆−2 ln δ −1 ) time). Intuitively, an algorithm has to “guess” and “verify” (in some sense) the true ∆ value. As a result, if the algorithm is “lucky” in allocating the time for verifying the right guess of ∆, it may stop earlier and thus be faster than F (∆) for some ∆s. But as we will demonstrate, if an algorithm stops earlier on larger ∆, it would hurt the accuracy for smaller ∆, and there is no way to be always lucky. This is the only factor accounting for the ln ln ∆−1 . While Farrell’s proof attributes the ln ln ∆−1 factor to the Law of Iterative Logarithm (see also [18]), our proof shows that the ln ln ∆−1 factor exists due to algorithmic reasons, which is a new perspective. Furthermore, we note that [14] assumes the reward distributions are from the exponential family. While our proof only utilizes the KL divergence between different reward distributions, which applies to nonexponential family as well. 1

To the contrary, the Mannor-Tsitsiklis bound

X

i

∆−2 ln δ−1 is an instance-wise lower bound [25]. [i]

3

Finally, we would like to remark that it may be possible to generalize Theorem 1.5 to more general prior distributions. Nevertheless, Theorem 1.5 already asserts the existence of very wide class of prior distributions of instances, for which no algorithm can be much faster than ∆−2 ln ln ∆−1 for a nontrivial proportion. Combined with Theorem 1.4, we can also see that it is not possible to obtain a tight instancewise lower bound (lim sup and lim inf must differ). Hence, we have enough reasons to be content with the O(∆−2 ln ln ∆−1 ) upper bound (which can be achieved by [22, 18] or our Algorithm 1), and are inclined to call Ω(∆−2 ln ln ∆−1 ) a nearly instance-wise lower bound for Sign-ξ. 1.1.2

Best-1-Arm

Now, we consider the general problem with narms. First, we note that the lower bound XBest-1-Arm n −1 −1 algorithm for Best-1-Arm. If such a result (1) does not rule out an O ∆[i] ln δ + ∆−2 [2] ln ln ∆[2] i=2 exists, it is clearly an improvement over [22, 18]. Moreover, it must be (worst case) optimal since both Xn −1 ∆[i] ln δ −1 and ∆−2 [2] ln ln ∆[2] are clearly the lower bound of the problem. However, we show it is not i=2   Xn ∆[i] ln ln n possible to get such an upper bound by presenting a class of instances that require Ω i=2 samples. Theorem 1.6. There exist constants c, c1 > 0 and N ∈ N such that, for any δ < 0.005 and any δn X ∆−1 correct algorithm A, and any n ≥ N , there exists a n arms instance I such that TA [I] ≥ c · [i] ln ln n. −1 Furthermore, ∆−2 [2] ln ln ∆[2]

i=2

n c1 X −1 < ∆ ln ln n. · ln n i=2 [i]

Xn ∆−2 Note that the second statement of the theorem says that [i] ln ln n is the dominating term (so i=2 that the theorem is not vacant). This is the first lower bound that goes beyond the Mannor-Tsitsiklis lower Xn −1 ∆−2 [25]. The proof of the theorem is also interesting in its own right. We provide a bound [i] ln δ i=2 nontrivial reduction from the Sign-ξ problem to the Best-1-Arm problem, and utilize our previous lower bound for Sign-ξ to obtain the desired lower bound for Best-1-Arm. More concretely, we construct a class of instances for Best-1-Arm, and show that if there is an algorithm that can solve those instances faster than the target lower bound, we can use the algorithm to solve a nontrivial proportion of a class of Sign-ξ instances faster than ∆−2 ln ln ∆−1 time, which leads to a contradiction by our lower bound on Sign-ξ. Note that the old lower bound in [14] cannot be used here since it does not preclude the existence of such an algorithm for Sign-ξ. Xn −1 ∆−2 Finally, we return to the question whether [i] ln ln ∆[i] is the right lower bound for Best-1-Arm, i=2 which was in fact the initial motivation of our work. In terms of instance optimality, the answer is certainly no in light of our previous discussion of Sign-ξ. However, even in the sense of “nearly instance optimality”, the answer is still negative, unless there are only O(1) arms. We achieve this by providing a nontrivial (worst case) optimal algorithm, which performs strictly better than the KKS-JMNS bound for very wide class of instances (those such that ln ln ∆−1 [i] ≪ ln ln n and the first term does not dominate). Theorem 1.7. For any δ < 0.1, there is a δ-correct algorithm for Best-1-Arm which needs at most n n   X X −2 −1 −2 −1 −1 ∆ ln ln min(n, ∆ ) ∆ ln δ + O ∆−2 ln ln ∆ + [i] [i] [i] [2] [2] i=2

i=2

samples in expectation. Note that our upper bound is never worse than the KKS-JMNS bound in [22, 18]. In fact, all three additive terms in our upper bounds are necessary (first term due to [14] or Theorem 1.5, second term due to [25] and the third due to Theorem 1.6), hence our algorithm is indeed (worst case) optimal. Theorem 1.7 has a few interesting consequences we would like to point out. For example, it is not −2 −1 ln ln ∆−1 possible to construct a class of infinite instances that requires Ω(n∆[2] [2] ) samples unless ln ln ∆[2] = 4

O(ln ln n). This is somewhat surprising: Consider a very basic family of instances in this class: there are n − 1 arms with mean 0.5 and 1 arm with mean 0.5 + ∆. The Mannor-Tsitsiklis lower bound Ω n∆−2 for this instance (even when ∆ is known) is in fact a directed sum-type result: roughly speaking, in order to solve Best-1-Arm, we essentially need to solve n − 1 independent copies of Sign-ξ with gap ∆. However, our upper bound in Theorem 1.7 indicates that the role of ∆[2] is different from the others ∆[i] s. Hence, if we want to go beyond Mannor-Tsitsiklis, 2 Best-1-Arm can not be thought as n independent copies of Sign-ξ. In fact, from the analysis of the algorithm, we can see that the first term and the rest come from very different procedures: the first term is used for “estimating” the gap distribution and the rest for “verifying” and “eliminating” suboptimal arms. Now, we provide a high level idea of our algorithm. Our algorithm is heavily influenced by the elegant Exponential-Gap-Elimination algorithm [22], and goes well beyond it. In order to highlight the technical novelty of our algorithm, we provide a very brief introduction to the Exponential-Gap-Elimination algorithm which runs in round. In the rth round of Exponential-Gap-Elimination, we first try to identify an ǫr -optimal arm Ar (where ǫr = O(2−r )), using the classical PAC algorithm in [12]. Then, using the empirical mean of Ar as a threshold, the algorithm tries to eliminate those arms with smaller means. In fact, comparing with the previous elimination-based algorithms, such as [11, 12, 7], Exponential-GapElimination seems to be the most aggressive one, which is the main reason that Exponential-GapElimination improves on the previous results. However, we show that Exponential-Gap-Elimination may be over-aggressive, and we may benefit from delaying the elimination for some rounds, if we cannot eliminate a substantial number of arms in this round. To exploit this fact, we develop a procedure called FRAC-TEST, which, roughly speaking, can inform us about the distribution of the gaps and decide whether or not we should do elimination in this round. Furthermore, the analysis of the running time of our algorithm is quite challenging as well. For that purpose, we need to carefully choose a potential function to amortize the costs over different iterations.   Xn −1 −1 for a ∆−2 Our algorithm can achieve an even better upper bound O ∆−2 [i] ln δ [2] ln ln ∆[2] + i=2

special but important class of instances, which we call clustered instances. 3 Note that this bound is nearly instance optimal, since it matches the Mannor-Tsitsiklis instance-wise lower bound plus the nearly instance−1 wise lower bound ∆−2 [2] ln ln ∆[2] . In fact, the aforementioned instances (n − 1 arms with mean 0.5 and 1 arm with mean 0.5 + ∆) are clustered instances. Even for such basic instances, a tight bound is not known so far! After a careful examination, we find that all previous algorithms are suboptimal on the very basic examples, while our algorithm can achieve the O(∆−2 ln ln ∆−1 + n∆−2 ln δ −1 ), which is nearly instance optimal. By slightly modifying our algorithm, we can easily obtain an (ǫ, δ)-PAC algorithm for Best-1-Arm, which improves several prior work. See Table 2 for the detailed information. 1.1.3

δ-correct Algorithms

In the Best-1-Arm literature, there exist at least three different classes of δ-correct algorithms. The first type has an overall expected running time bound, the second requires that the algorithm outputs the correct answer and terminates before a fixed running time bound with probability at least 1 − δ, and the third only has an expected running time bound conditioning on some 1 − δ probability event. Such discrepancy has been overlooked in most prior work. Several prior algorithms belong to the second and third types (e.g., [11, 18, 22, 15]), which has the potential drawback that the algorithm may behave arbitrarily (even loop forever) with probability δ. Some previous work has noticed the issue, but we are not aware of a systematic way to deal with it. For example, Kalyanakrishnan et al. [21] explicitly stated that “However, their (i.e., [12]) elimination algorithm could incur high sample complexity on the δ-fraction of the runs on which mistakes are made—we think it unlikely that elimination algorithms can yield an expected sample complexity bound smaller than ... ”. In this paper, we formally define the three classes of δ-correct algorithms, and present a general transformation from the second or third type to the first type (Theorem 2.8), which addresses the issue raised in [21]. In order to do so, we apply a trick called parallel simulation, which runs an infinite number of different versions of the same algorithm (with different confidence parameters) in different rates. 2

In other words, we want the lower bound to reflect the hardness caused by not knowing ∆i s. We say the instances is clustered if the cardinality of the set {⌊ln ∆−1 ⌋}n i=2 is bounded by a constant. See Theorem 6.20 [i] for more details. 3

5

Source 2002, Even-Dar et al. [11] 2012, Gabillon et al. [15] 2013, Jamieson et al. [17] 2012, Kalyanakrishnan et al. [21] 2013, Jamieson et al. [17] 2013, Karnin et al. [22] 2014, Jamieson et al. [18] This paper This paper (clustered instances)

Sample Complexity   Xn ln δ −1 + ln n + ln ∆−1 ∆−2 [i] [i]   Xi=2 Xn n ln δ −1 + ln ∆−2 ∆−2 [i] [i] i=2   Xn  Xi=2 n ln ln ∆−2 + ln δ −1 ∆−2 [i] [j] j=2   Xi=2 Xn n −1 ln δ + ln ∆−2 ∆−2 [i] i=2 [i] i=2  Xn Xn ln δ −1 · ln ln δ −1 · ∆−2 + ∆−2 ln ∆−1 [i] [i] [i] i=2 i=2   Xn −1 ln ln ∆−1 ∆−2 [i] + ln δ [i]   Xi=2 n ln ln ∆−1 + ln δ −1 ∆−2 [i] [i]   Xi=2 n −1 ln δ −1 + ln ln min(n, ∆−1 ) + ∆−2 ∆−2 [i] [2] ln ln ∆[2] [i] Xi=2 n −1 −1 ∆−2 + ∆−2 [i] ln δ [2] ln ln ∆[2] i=2

Guarantee Type weakly T -time weakly T -time weakly T -time expected T -time weakly T -time weakly T -time weakly T -time expected T -time expected T -time

Table 1: Sample complexity upper bounds for different δ-correct algorithms. Source 2002, Even-Dar et al. [11] 2012, Gabillon et al. [15] 2012, Kalyanakrishnan et al. [21] 2013, Karnin et al. [22] This paper

Sample Complexity nǫ−2 · ln δ −1   Xn Xn −1 ∆−2 ∆−2 i,ǫ · ln i,ǫ + ln δ  Xi=2  Xi=2 n n −1 ∆−2 ∆−2 i,ǫ · ln i,ǫ + ln δ i=2 Xi=2 n −1 −1  ∆−2 i,ǫ ln ln ∆i,ǫ + ln δ Xi=2  n −1 −2 −1 ∆−2 + ln ln min(n, ∆−1 i,ǫ ln δ i,ǫ ) + ∆2,ǫ ln ln ∆2,ǫ i=2

Guarantee Type worst case weakly T -time expected T -time weakly T -time expected T -time

Table 2: Sample complexity upper bounds for different (ǫ, δ)-PAC algorithms. Here, ∆i,ǫ = max(∆[i] , ǫ)

1.2

Related Work

Sign-ξ and A/B testing: The Sign-ξ problem is closely related to the A/B testing problem in the literature, in which we have two arms with unknown means and the goal is to decide which one is larger. It is easy to see that a lower bound for Sign-ξ is also a lower bound for the the A/B testing problem. Kaufmann et al. [23] studied the optimal sample complexity for the A/B testing problem. However, their focus is on the limiting behavior of the sample complexity when the confidence level δ approaches to zero, while we are interested in the case where the gap ∆ approaches to zero but δ is a constant (hence in their case, the ln ln ∆−1 factor is absorbed by the O(ln δ −1 ) factor). They also considered the fixed budgeted setting, in which the number of pulls is fixed in advance and we want to minimize the mis-identification probability. Best-k-Arm: One natural generalization of Best-1-Arm is the Best-k-Arm problem, which asks for the top-k arms instead of just the top-1. The problem has also been studied extensively for the last few years [15, 16, 28, 21, 20, 7, 24, 23]. Most lower and upper bounds for Best-k-Arm are variants of those for Best-1-Arm, and the bounds also depend on the gap parameters. But in this case, the gaps are typically defined to be the distance to µ[k] or µ[k+1] . Chen et al. [9] initiated the study of the combinatorial pure exploration problem, which generalizes the cardinality constraint in Best-k-Arm to more general combinatorial constraints. For example, assume that the arms are the elements of a matroid, and the goal is to identify the base with the maximum total mean. PAC learning: The worst case sample complexity of Best-1-Arm in the PAC setting is also well studied. There is a matching lower and upper bound Ω(n ln δ −1 /ǫ2 ) [12, 25, 11]. The worst case sample complexity for Best-k-Arm in the PAC setting has also been well studied [20, 21, 28, 8]. In fact, there are several metrics to define an ǫ-approximate solution. For example, Cao et al. [8] required that the top-i arm in the solution is at most ǫ away from the µ[i] for i ∈ [k], while Zhou et al. [28] required that only the average mean of the solution is at most ǫ away from the top-k arms. Matching (worst case) upper and lower bounds (the bounds only depend on n, k, ǫ, δ) are obtained for the above problems.

6

2

Preliminaries

Definition 2.1. Let R > 0, we say a distribution D on R has R-sub-Gaussian tail (or D is R-sub-Gaussian) if for the random variable X drawn from D and any t ∈ R, we have that E[exp(tX − tE[X])] ≤ exp(R2 t2 /2). It is well known that the family of R-sub-Gaussian distributions contains all distributions with support on [0, R] as well as many unbounded distributions such as Gaussian distributions with variance R2 . Then we recall a standard concentration inequality for R-sub-Gaussian random variables. Lemma 2.2. (Hoeffding’s inequality) Let X1 , . . . , Xn be n i.i.d. random variables drawn from a R-sub Gaussian distribution D. Let µ = Ex∼D [x]. Then for any ǫ > 0, we have that " n #   1 X nǫ2 Pr Xi − µ ≥ ǫ ≤ 2 exp − 2 . n 2R i=1

For simplicity of exposition, we assume all reward distributions are 1-sub-Gaussian in the paper. Suppose A is an algorithm for Best-1-Arm(or Sign-ξ). Let the given instance be I. Let E be an event and PrA,I [E] be the probability that the event E happens when running A on instance I. When A is clear from the context, we omit the subscript A and simply write PrI [E]. Similarly, if X is a random variable, we use EA,I [X] to denote the expectation of X when running A on instance I. Sometimes, A takes an additional confidence parameter δ, and we write Aδ to denote the algorithm A with the fixed confidence parameter δ. Let τi be the random variable that denotes the number of pulls from arm i (when the algorithm and n X τi be the total the problem instance are clear from the context) and EI [τi ] be its expectation. Let τ = i=1

number of samples taken by A. The Kullback-Leibler (KL) divergence of any two distributions p and q is defined to be   Z dp KL(p, q) = log (x) dp(x) if q ≪ p dq

where q ≪ p means that dp(x) = 0 whenever dq(x) = 0. For any two real numbers x, y ∈ (0, 1), let H(x, y) = x log(x/y) + (1 − x) log((1 − x)/(1 − y)) be the relative entropy function. N (µ, σ 2 ) denotes the Gaussian distribution with mean µ and standard deviation σ. Many lower bounds in the bandit literature rely on certain “changes of distributions” argument. The following version (Lemma 1 in [23]) is crucial to us. Lemma 2.3. (Change of distribution) [23] We use an algorithm A for a bandit problem with n arms. 4 Let I (with arm distributions {Di }i∈[n] ) and I ′ (with arm distributions {Di′ }i∈[n] ) be two instances. Let E be an event, 5 such that 0 < PrA,I (E) < 1. Then, we have n X i=1

EI [τi ]KL(Di , Di′ ) ≥ H(PrA,I (E), PrA,I ′ (E)).

We also need the following well known fact about the KL divergence between two Gaussian distributions. Lemma 2.4. KL(N (µ1 , σ 2 ), N (µ2 , σ 2 )) = 4

(µ1 − µ2 )2 . 2σ 2

We make no assumption on the behavior of A in this lemma. For example, A may even output incorrect answers with high probability. 5 More rigorously, E should be in the σ-algebra Fτ where τ is a stopping time with respect to the filtration {Ft }t≥0 .

7

2.1

Three Types of δ-correct Algorithms

There are three kinds of algorithm for various bandit arm identification problems in the literature. The first kind (which we call expected-T -time δ-correct algorithms) requires the algorithm outputs the correct answer with probability 1 − δ and its expected running time E[τ ] is at most T (δ, I) on instance I. For example, some LUCB-type algorithms belong to this class [21]. The second kind (which we call weakly T -time δ-correct algorithms) requires that the algorithm outputs the correct answer and terminates before certain steps with probability at least 1 − δ. However, with probability δ, there is no guarantee on its behavior (e.g., runs forever). Hence, the expected running time may be unbounded. Most of the previous algorithms belong to this class [22, 18, 15]. Moreover, there can be another kind (which we call weakly expected-T -time δ-correct algorithms). An algorithm of this kind should output the correct answer with probability at least 1 − δ, and the conditional expected running time is upper bounded by T (δ, I). We can easily see that the third one is a strictly weaker guarantee than the first one and the second one. However, the first one and the second one are uncomparable to each other. Arguably, the first one is a better guarantee, since we would like the algorithm to terminate in finite time, even if it outputs the wrong answer. We first define the three algorithm classes rigorously. Definition 2.5. Let A be an algorithm for some problem P. A takes an additional input δ as the confidence level. Let I be the set of all valid instances for P. We write Aδ to denote the algorithm A with a fixed confidence level δ. 1. For a function T : (0, 1) × I → R+ , we call A a weakly T -time δ-correct algorithm iff there exists δ0 ∈ (0, 1) such that for any δ ∈ (0, δ0 ) and instance I ∈ I: Pr[Aδ returns the correct answer on I ∧ τ ≤ T (δ, I)] ≥ 1 − δ, where τ is the running time of A on I. 2. We call A an expected-T -time δ-correct algorithm iff there exists δ0 ∈ (0, 1) such that for any δ ∈ (0, δ0 ) and instance I ∈ I: TAδ [I] ≤ T (δ, I)

and

Pr[Aδ returns the correct answer on I] ≥ 1 − δ.

3. We call A a weakly expected-T -time δ-correct algorithm iff there exists δ0 ∈ (0, 1) such that for any δ ∈ (0, δ0 ) and instance I ∈ I, there exists an event E that PrAδ ,I [E] ≥ 1 − δ ∧ EAδ ,I [τ | E] ≤ T (δ, I)

and

Pr[Aδ returns the correct answer on I | E] = 1.

We call the above event E a good event. We need a mild assumption on the running times for our general transformation. Definition 2.6. We say a function T : (0, 1) × I → R is a reasonable time bound, if there exists 0 < δ0 < 1 such that for all 0 < δ ′ < δ < δ0 and I ∈ I we have that T (δ ′ , I) ≤

ln δ ′−1 T (δ, I). ln δ −1

Remark 2.7. The running times of most previous algorithms are of the form α(I) + β(I) ln δ −1 , where α and β only depend on I (e.g., see Table 1). Such running times are obviously reasonable time bounds. Theorem 2.8. Suppose T is a reasonable time bound. If A is a weakly expected-T -time δ-correct algorithm (or a weakly T -time δ-correct algorithm) for the problem P, then there exists an algorithm B which is expected-O(T )-time δ-correct. Moreover, B can be constructed explicitly from A.

8

Algorithm 1: TEST-SIGN(A, δ, {Λi}) Data: The single arm A with unknown mean µ 6= ξ, confidence level δ, the reference sequence {Λi }. Result: Whether µ > ξ or µ < ξ. 1 for r = 1 to +∞ do 2 ǫr = Λr /2 3 δr = δ/10r2 4 Pull A for tr = 2 ln(2/δr )/ǫ2r times. Let µ br denote its average reward. r 5 if µ b > ξ + ǫr then 6 Return µ > ξ 7 8

3

if µ br < ξ − ǫr then Return µ < ξ

Algorithm for Sign-ξ

In this section we prove Theorem 1.4 by presenting a class of algorithms for Sign-ξ. Each algorithm here is in fact a simple variant of the Exponential-Gap-Elimination algorithm in [22]. Now, we begin our description of the algorithm. Our algorithm takes an infinite sequence S = {Λi }+∞ i=1 as input, which we call the reference sequence. Definition 3.1. We say an infinite sequence S = {Λi }+∞ i=1 is a reference sequence if the following statements hold: 1. 0 < Λi < 1, for all i. 2. There exists a constant 0 < c < 1 such that for all i, Λi+1 ≤ c · Λi . Our algorithm TEST-SIGN takes a confidence level δ and the reference sequence {Λi } as input. It runs in rounds. In the rth round, the algorithm takes a number of samples (the actual number depends on r, and can be found in Algorithm 1) from the arm and let µ br be the empirical mean. If µ br ∈ ξ ± Λr /2, we decide that the gap ∆ is smaller than the reference gap Λr and we should proceed to the next round with a smaller reference gap. If µ br is larger than ξ + Λr /2, we decide µ > ξ. If µ br is smaller than ξ − Λr /2, we decide µ < ξ. The pseudocode can be found in algorithm 1. Theorem 3.2. Fix a confidence level δ > 0 and an arbitrary reference sequence S = {Λi }∞ i=1 . Suppose that the given instance has a gap ∆. Let κ be the smallest i such that Λi ≤ ∆. With probability at least 1 − δ, TEST-SIGN determines whether µ > ξ or µ < ξ correctly and uses O((ln δ −1 + ln κ)Λ−2 κ ) samples in total.

The proof for the above theorem is somewhat similar to the analysis of the Exponential-Gap-Elimination algorithm in [22] (in fact, simpler since there is only one arm). The details can be found in the appendix. Now, let T (δ, I) = (ln δ −1 + ln κ)Λ−2 κ , in which I is a valid Sign-ξ instance. It is not hard to see T (δ, I) here is a reasonable time bound. Algorithm 1 is weakly T -time δ-correct, so we can apply Theorem 2.8 to make it expected-T -time δ-correct. Corollary 3.3. For TEST-SIGN with the reference sequence S = {Λi }. After applying the construction in Theorem 2.8, we have algorithm A for Sign-ξ such that given confidence level δ, it takes O((ln δ −1 + ln κ)Λ−2 κ )) total samples in expectation, and outputs correctly with probability at least 1 − δ. Finally, we prove Theorem 1.4. Proof of Theorem 1.4. First, we can easily construct a reference sequence {Λi } such that 1. 0 < Λi < 1 and Λi+1 ≤ Λi /2 for all i. 2. T (Λi ) ≥ i · Λ−2 for all i (this is possible since lim sup T (∆)/∆−2 = +∞). i ∆→+0

9

Then we apply Corollary 3.3 with reference sequence {Λi }+∞ i=1 . Let the algorithm be A. For any fixed δ, we can see that TA (Λi ) C(ln δ −1 + ln i)Λ−2 i lim = 0, ≤ lim i→+∞ T (Λi ) i→+∞ iΛ−2 i where C is the constant hidden in the big-O in Corollary 3.3. This implies that lim inf TA (∆)/T (∆) = 0. ∆→+0

Remark 3.4. If we use the reference sequence {Λi = e−i }, then by Corollary 3.3, the constructed algorithm A is δ-correct for Sign-ξ, and it takes O(∆−2 (ln δ −1 + ln ln ∆−1 )) samples in expectation on instance with gap ∆.

4

Sign-ξ

In this section, we prove the lower bound for Sign-ξ, where ξ ∈ R is a fixed constant. For simplicity of exposition, we first assume the distributions of the arms are Gaussian distributions with σ = 1 (but the mean is unknown, of course). We will discuss how the assumption can be lifted to obtain lower bounds for other distributions at the end of the section.

4.1

Notations

We first introduce some notations. Let A denote an algorithm for Sign-ξ. Let Aµ be an arm with mean µ. Let TA (∆) = max(TA (Aξ+∆ ), TA (Aξ−∆ )). Let F (∆) = ∆−2 · ln ln ∆−1 , which is the benchmark we want to compare. Fix an algorithm A. For a random event E, PrA,Aµ [E] denotes the probability that E happens if we run A on arm Aµ . For notational simplicity, when A is clear from the context, we abbreviate it as Prµ [E]. Similarly, we write Eµ [X] as a short hand notation for EA,Aµ [X]. Let 1{expr} be the indicator function which equals 1 when expr is true and 0 otherwise. Throughout this section, we assume 0 < δ < 0.01 is a fixed constant. Recall our goal is to prove Theorem 1.5. We need to define nature distribution families for the gap parameter ∆. First, we define function ρ : [0, +∞) → R to control the distributions. Definition 4.1. We say ρ : [0, +∞) → R is a granularity function if ρ is twice differentiable, strictly increasing, nonnegative and unbounded. Next, we define ΓdN (ρ) (N ∈ N) to be the following discrete probability distribution of the gap parameter 6

∆:

ΓdN (ρ) := uniform{e−ρ(0) , e−ρ(1) , . . . , e−ρ(N ) }.

We use e−ρ(i) to make sure the limit is zero. We remark that a choice other than the exponential function is possible, but as we will see, the exponential function is the most natural regime to think about the algorithms and lower bounds. Similarly, we define the continuous analogue of Γdi (ρ) as ΓcN (ρ) := e−ρ(x)

where x ∼ uniform[0, N ],

Now, we are ready to define what is the meaning of “Any algorithm requires Ω(F (∆)) samples for almost all instances”. Definition 4.2. Let {ΓN }+∞ N =1 be a sequence of distributions, all supported on (0, 1]. We say that any δcorrect algorithm for Sign-ξ requires Ω(F (∆)) samples for almost all instances with respect to {ΓN }+∞ N =1 if and only if there exists a universal constant c > 0 such that for any δ-correct algorithm A, we have that lim ΓN ({∆ | TA (∆) < c · F (∆)}) = 0.

N →+∞ 6

We stress that this is not a Bayesian setting since we require the algorithm is δ-correct for every Sign-ξ instance.

10

We briefly discuss the reason why we need an infinite sequence of distributions {ΓN }+∞ N =1 . In the previous section, we have already seen from Theorem 1.4 that there is no instance-wise Ω(F (∆)) lower bound. So we settle for proving that Ω(F (∆)) is a lower bound holds for “almost all instances” with respect to natural distributions (of ∆) over (0, 1] (we would like to call Ω(F (∆)) an almost instance-wise lower bound). Suppose +∞ X ∆ follows a fixed distribution Γ. Now, let an = Γ([2−n , 2−n+1 )). Clearly we have an = 1. Furthermore, n=1

we assume there are infinite many an such that an > 0 as otherwise F (∆) is bounded by a constant on all instance. Fix n and consider the algorithm An , which is obtained by applying Corollary 3.3 with the reference −2 sequence {Λi = 2−n+1−i }+∞ ) = o(F (∆)) number of samples i=1 . We can see the algorithm uses less than O(∆ −n −n+1 in expectation if ∆ ∈ [2 , 2 ). It means for any c > 0, we can pick a big enough n with an > 0 such that Γ({∆ | TAn (∆) < c · F (∆)}) ≥ an > 0. However, the existence of such An which can solve a nontrivial proportion of instances faster is not a very useful or insightful result as we do not know n and An (for large n) clearly is not a very good algorithm. To resolve this issue, we use an infinite sequence of distributions {ΓN }+∞ N =1 , instead of a single distribution, and investigate the limiting behavior of any algorithm. If the granularity function ρ behaves arbitrarily, it seems to be impossible to analyze the problem. However, for ρ being either convex or concave, we manage to provide complete characterization when Sign-ξ requires Ω(F (∆)) for almost all instances. The main results of this section are the following theorems. Theorem 4.3. (Almost instance-wise lower bound, for convex granularity function) Let ρ be a convex granularity function. Any δ-correct algorithm for Sign-ξ needs Ω(F (∆)) samples for almost all instances with γ respect to {ΓdN (ρ)}+∞ N =1 iff there exists γ > 0 such that ρ(x) = O(x ). In other words, the lower bound holds iff ρ(x) grows no faster than some polynomial function. For the continuous case, any δ-correct algorithm for Sign-ξ requires Ω(F (∆)) samples for almost all instances with respect to {ΓcN (ρ)}+∞ N =1 . It may seem strange that the result for the discrete case differs from that for the continuous analogue for convex granularity functions. Intuitively, the reason is as follows: if ρ(x) grows faster than any polynomial, then we can design a reference sequence to make all the points eρ(i) , i ∈ N fast points. However, in the continuous case, the span of each interval [eρ(i+1) , eρ(i) ) is too large in some sense, and being quick at its endpoints does not make the whole interval quick points. But for concave granularity functions, such difference does not exist, as in the following theorem. Theorem 4.4. (Almost instance-wise lower bound, for concave granularity function) Let ρ be a concave granularity function. Any δ-correct algorithm for Sign-ξ needs Ω(F (∆)) samples for almost all instances xρ′ (x) > 0. The same statement holds inf with respect to {ΓdN (ρ)}+∞ N =1 iff there exists γ > 0 such that lim x→+∞ ρ(x)γ +∞ c for {ΓN (ρ)}N =1 as well. The rest of the section is devoted to prove the above two theorems. We need a few notations. For an integer N ∈ N, a δ-correct algorithm A for Sign-ξ, and a function g : R → R, define C(A, g, N ) =

N X  1 There exists some ∆ ∈ [e−i , e−i+1 ) such that: TA (∆) < g(∆) . i=1

Intuitively, it is the number of intervals [e−i , e−i+1 ) among the first N intervals that contains a fast point with respect to g. The following two lemmas are crucial to us, and may be useful for proving lower bounds for related problems. Lemma 4.5. For any γ > 0, we have a constant c1 > 0 (which depends on γ) such that lim

N →+∞ A′ is

C(A′ , c1 F, N ) = 0. Nγ δ-correct

sup

In other words, the number of intervals containing fast points with respect to Ω(F ) can be smaller than any polynomial. 11

Lemma 4.5 has a continuous analogue. For N ∈ N and function g : R → R, let ′

C (A, g, N ) =

Z

N

0

 1 TA (e−x ) < g(e−x ) dx.

That is the mass of fast points respect to g in interval [0, N ].

Lemma 4.6. For any γ > 0, we have a constant c1 > 0 (which depends on γ) such that lim

N →+∞ A′

C′ (A′ , c1 F, N ) = 0. Nγ is δ-correct sup

In other words, the mass of fast points with respect to Ω(F ) can be smaller than any polynomial. The proofs of Theorem 4.3 and Theorem 4.4 are largely based on Lemma 4.5, Lemma 4.6 and Corollary 3.3 and the properties of convex/concave functions. The ideas of the proofs are not difficult, but the details of the calculations are somewhat tedious, which we defer to the appendix. Below we provide the proofs for Lemma 4.5 and Lemma 4.6, which contain the core ideas of our lower bound.

4.2

Proof for Lemma 4.5 and Lemma 4.6

We first show a simple but convenient lemma based on Lemma 2.3. Lemma 4.7. Let I1 (with reward distribution D1 ) and I2 (with reward distribution D2 ) be two instances of 1 Sign-ξ. Let E be a random event and τ be the total number of samples taken by A. Suppose PrD1 [E] ≥ . 2 Then, we have  1 PrD2 [E] ≥ exp −2ED1 [τ ]KL(D1 , D2 ) . 4 Proof. Applying Lemma 2.3, we have that

ED1 [τ ] · KL(D1 , D2 ) ≥ H(PrD1 [E], PrD2 [E]) ≥ H



1 , PrD2 [E] 2





1 · ln 2



1 4 PrD2 [E](1 − PrD2 [E])



.

Hence, 4 PrD2 [E](1 − PrD2 [E]) ≥ exp(−2ED1 [τ ] · KL(D1 , D2 )), from which the lemma follows easily. From now on, suppose A is a δ-correct algorithm for Sign-ξ. We define two events: EU = [A outputs “µ > ξ”], E(∆) = EU ∧ [d∆−2 ≤ τ ≤ 5TA (∆)], where τ is the number of samples taken by A and d is a universal constant to be specified later. The following lemma is a key tool, which can be used to partition the event EU into several disjoint parts. Lemma 4.8. For any ∆ > 0 and d < H(0.2, 0.01)/2, we have that Prξ+∆ [E(∆)] = PrA,Aξ+∆ [E(∆)] ≥

1 . 2

Proof. First, we can see that Prξ+∆ [EU ] ≥ 1 − δ ≥ 0.99 since A is δ-correct and “µ > ξ” is the right answer. Now, we claim Prξ+∆ [τ < d∆−2 ] < 0.25. Suppose to the contrary that Prξ+∆ [τ < d∆−2 ] ≥ 0.25. We can see that Prξ+∆ [EU ∧ τ < d∆−2 ] ≥ 0.25 − δ > 0.2. Consider the following algorithm A′ : A′ simulates A for d∆−2 steps. If A halts, A′ outputs what A outputs, otherwise A′ outputs nothing. Let EV be the event that A′ outputs µ > ξ. Clearly, we have PrA′ ,ξ+∆ [EV ] > 0.2. On the other hand, PrA′ ,ξ−∆ [EV ] < δ, since A is a δ-correct algorithm. So by Lemma 2.3, we have that EA′ ,ξ+∆ [τ ]KL(N (ξ + ∆, σ), N (ξ − ∆, σ)) = EA′ ,ξ+∆ [τ ]2∆2 ≥ H(0.2, δ) ≥ H(0.2, 0.01). 12

Since d∆−2 ≥ EA′ ,ξ+∆ [τ ], we have d ≥ H(0.2, 0.01)/2. But this contradicts the condition of the lemma. Hence, we must have Prξ+∆ [τ < d∆−2 ] < 0.25. Finally, we can see that Prξ+∆ [E(∆)] ≥ Prξ+∆ [EU ] − Prξ+∆ [τ < d∆−2 ] − Prξ+∆ [τ > 5TA (∆)] ≥ 1 − 0.01 − 0.25 − Prξ+∆ [τ > 5EA,ξ+∆ [τ ]] ≥ 1 − 0.01 − 0.25 − 0.2 ≥ 0.5

where the first inequality follows from the union bound and the second from Markov inequality. Lemma 4.9. For any δ-correct algorithm A, and any finite sequence {∆i }ni=1 such that 1. the events {E(∆i )} are disjoint,

7

and 0 < ∆i+1 < ∆i for all 1 ≤ i ≤ n − 1;

2. there exists a constant c > 0 such that TA (∆i ) ≤ c · F (∆i ) for all 1 ≤ i ≤ n, then it must hold that:

n X i=1

Proof. Suppose for contradiction that

exp{−2c · F (∆i ) · ∆2i } ≤ 4δ.

n X i=1

exp{−2c · F (∆i ) · ∆2i } > 4δ.

1 1 (∆i + α)2 ≤ ∆n . By Lemma 2.4, we can see that KL(N (ξ + ∆i , σ), N (ξ − α, σ)) = 5 2σ 2 1 (1.2∆i )2 ≤ ∆2i . By Lemma 4.7, we have: 2 Let α =

Prξ−α [EU ] ≥

n X i=1

n

Prξ−α [E(∆i )] ≥

n

1X 1X exp{−2Eξ+∆i [τ ] ∆2i } ≥ exp{−2c · F (∆i ) · ∆2i } > δ. 4 i=1 4 i=1

Note that we need Lemma 4.8(1) (i.e., Prξ+∆i [E(∆i )] ≥ 1/2 ) in order to apply Lemma 4.7 for the second inequality. The above inequality means that A outputs a wrong answer for instance ξ − α with probability > δ, which contradicts that A is δ-correct. Now, we try to utilize Lemma 4.9 on a carefully constructed sequence {∆i }. The construction of the sequence {∆i } requires quite a bit calculation. To facilitate the calculation, we provide a sufficient condition for the disjointness of the sequence, as follows. Lemma 4.10. A is any δ-correct algorithm for Sign-ξ. c > 0 is a universal constant. The sequence {∆i }N i=0 satisfies the following properties: 1. 1/e > ∆1 > ∆2 > . . . > ∆N ≥ α > 0. 2. For all i ∈ [N ], we have that TA (∆i ) ≤ c · F (∆i ). 3. Let Li = ln ∆−1 i . We have Li+1 − Li >

1 ln c + ln 5 − ln d ln ln ln α−1 + c1 , in which c1 = . 2 2

Then, the events {E(∆1 ), E(∆2 ), . . . , E(∆N )} are disjoint. Proof. We only need to show the intervals for each E(∆i ) are disjoint. In fact, it suffices to show it holds for two adjacent events E(∆i ) and E(∆i+1 ). Since 5TA (∆i ) ≤ 5c·F (∆i ), we only need to show 5c·F (∆i ) < d∆−2 i+1 , which is equivalent to ln c + ln 5 + 2Li + ln ln Li < ln d + 2Li+1 . By simple manipulation, this is further equivalent to Li+1 − Li > (ln c+ ln 5 − ln d+ ln ln Li )/2. Since α ≤ ∆i , we have ln ln ln α−1 ≥ ln ln Li , which concludes the proof. 7

More concretely, the intervals [d∆−2 i , 5TA (∆i )] are disjoint.

13

Now, everything is ready to prove Lemma 4.5. Proof of Lemma 4.5. Suppose for contradiction, for any c1 > 0, the limit is not zero. This is equivalent to lim sup N →+∞ A′

C(A′ , c1 F, N ) > 0. Nγ is δ−correct sup

(2)

γ , the above can lead to a contradiction. 4 First, we can see that (2) is equivalent to the existence of an infinite increasing sequence {Ni }i and a positive number β > 0 such that We claim that for c1 =

C(A′ , c1 F, Ni ) > β, Niγ is δ−correct

for all i.

sup

A′

Consider some large enough Ni in the above sequence. The above formula implies that there exists a δ-correct algorithm A such that C(A, c1 F, Ni ) ≥ βNiγ . We maintain a set S, which is initially empty. For each 2 ≤ j ≤ Ni , if there exists ∆ ∈ [e−j , e−j+1 ) such that TA (∆) ≤ c1 F (∆), then we add one such ∆ into the set S. We have |S| ≥ C(A, c1 F, Ni ) − 1 ≥ βNiγ − 1 (−1 comes from that j starts from 2). Let   ln c1 + ln 5 − ln d + ln ln Ni +1 . b= 2 We keep only the 1st, (1 + b)th, (1 + 2b)th, . . . elements in S, and remove the rest. With a slight abuse of |S| notation, rename the elements in S by {∆i }i=1 , sorted in decreasing order. 1 It is not difficult to see that > ∆1 > ∆2 > . . . > ∆|S| ≥ e−Ni > 0. By the way we choose the elements, e for 1 ≤ i < |S|, we have ln c1 + ln 5 − ln d 1 −1 ln ∆−1 > + ln ln N. i+1 − ln ∆i 2 2 Recall that we also have TA (∆i ) ≤ c1 F (∆i ) for all i. Hence, we can apply Lemma 4.10 and conclude that all events {E(∆i )} are disjoint. We have |S| ≥ (βNiγ − 1)/b, for large enough Ni (we can choose such Ni since {Ni } approaches to infinity), it implies |S| ≥ βNiγ / ln ln Ni . Then, we can get |S| X j=1

exp{−2c1 · F (∆j ) · ∆−2 j }=

|S| X j=1

exp{−γ · ln ln ∆−1 j /2}

|S| X −γ/2 −γ/2 (ln ∆−1 ≥ |S| · Ni = j ) j=1

−γ/2

≥βNiγ / ln ln Ni · Ni

γ/2

= βNi

/ ln ln Ni .

The inequality in the second line holds since ∆j ≥ e−Ni for all j ∈ [|S|]. Since γ > 0, we can choose Ni large γ/2 enough such that βNi / ln ln Ni > 4δ, which renders a contradiction to Lemma 4.9. Now we prove Lemma 4.6, which is a simple consequence of Lemma 4.5.

14

Proof of Lemma 4.6. The lemma follows from the following observation: Z ⌈N ⌉ C′ (A, g, N ) ≤ 1{TA (e−x ) < g(e−x )}dx 0

=

⌈N ⌉ Z X

i

1{TA (e−x ) < g(e−x )}dx

i−1

i=1

⌈N ⌉



X i=1

1{There exists ∆ ∈ [e−i , e−i+1 ) such that: TA (∆) < g(∆)}

≤ C(A, g, ⌈N ⌉). Hence this lemma just follows from Lemma 4.5.

4.3

Generalization to other distributions

In this subsection, we briefly discuss how to generalize our results to other 1-sub-Gaussian distributions as well. Note that all our proofs rely on Corollary 3.3 (for the upper bound) and Lemma 4.5 (for the lower bound). It is easy to see that Corollary 3.3 works for all 1-sub-Gaussian distributions. For Lemma 4.5, the only property of the distribution we used is the KL divergence through Lemma 2.3. Hence, we can use our framework to handle other classes of distributions with similar mean-KL-divergence relations. More concretely, let M be a specific class of reward distributions (all having 1-sub-Gaussian tail). Each distribution in M is uniquely indexed by its mean. Let Dµ ∈ M denote the distribution with mean µ. We also assume all the means lie in range (a, b). For each µ ∈ (a, b), there exists a corresponding reward distribution Dµ ∈ M. Suppose for all µ1 , µ2 ∈ (a, b), we have KL(Dµ1 , Dµ2 ) ≤ C|µ1 − µ2 |2 for some constant C > 0. Then, all our previous proofs work exactly in the same way and Theorem 4.3 and Theorem 4.4 also hold for any δ-correct algorithm A for Sign-ξ. As an important example, the class of Bernoulli distributions exhibits the above property. Let Bµ denote (p − q)2 the Bernoulli distribution with mean µ. We have KL(Bp , Bq ) ≤ (see (2.8) in [6]). Then for all q(1 − q) 0 < ǫ < 0.5, for all p, q ∈ (0.5 − ǫ, 0.5 + ǫ), we have KL(Bp , Bq ) ≤

4 |p − q|2 , 1 − 4ǫ2

since q(1 − q) ≥ (0.5 − ǫ)(0.5 + ǫ) when q ∈ (0.5 − ǫ, 0.5 + ǫ). This shows for any ǫ > 0 and ξ ∈ (0.5 − ǫ, 0.5 + ǫ), Jian: check! what is ξ here? for any ǫ? Theorem 4.3 and Theorem 4.4 hold for all δ-correct algorithms for Sign-ξ when all reward distributions are Bernoulli.

5

Lower bound for Best-1-Arm

Our goal in this section is to prove Theorem 1.6. In this section, δ is a fixed constant such that 0 < δ < 0.005. We use [0, N ] to denote the set of integers {0, 1, . . . , N }. To show the lower bound of Best-1-Arm, we provide a reduction from Sign-ξ to Best-1-Arm. We need one more simple lemma about Sign-ξ. Lemma 5.1. For any δ ′ -correct algorithm A for Sign-ξ with δ ′ ≤ 0.01, there exist constants N0 ∈ N and c1 > 0 such that for all N ≥ N0 : |{TA (∆) < c1 · ∆−2 ln N | ∆ = 2−i , i ∈ [0, N ]|} ≤ 0.1 N +1 Proof. Let ρ(t) = t · ln 2. So e−ρ(t) = 2−t . Clearly, ρ is a convex granularity function. Note that δ ′ ≤ 0.01. So by Theorem 4.3, there exists c1 such that: lim ΓdN (ρ)({∆ | TA (∆) < 2c1 · ∆−2 ln ln ∆−1 }) = 0.

N →+∞

15

This implies that there exists N ′ ∈ N, such that for all N ≥ N ′ : ΓdN (ρ)({∆ | TA (∆) < 2c1 · ∆−2 ln ln ∆−1 }) ≤ 0.05. Therefore, we can pick N0 ≥ max(N ′ , 1000), such that for all N ≥ N0 : ΓdN (ρ)({∆ | TA (∆) < c1 ∆−2 ln N })

√ √ N }) + ΓdN (ρ)({∆ | ln ∆−1 < N }) √ ≤ ΓdN (ρ)({∆ | TA (∆) < 2c1 ∆−2 ln ln ∆−1 }) + ΓdN (ρ)({∆ | ln ∆−1 < N })

≤ ΓdN (ρ)({∆ | TA (∆) < c1 ∆−2 ln N ∧ ln ∆−1 ≥ ≤ 0.05 + 0.05 ≤ 0.1

√ In fact, the second inequality follows from 2c1 ∆−2 ln ln ∆−1 ≥√2c1 · ∆−2 ln N = c1 · ∆−2 ln N . The √ third inequality follows from the fact that ∆ = e−ρ(t) and ln ∆−1 < N corresponds to ρ(t) = t · ln 2 < N . For any algorithm A for Best-1-Arm, let Aperm be the algorithm which first randomly permutes the input arms, then runs A. More precisely, given an arm instance I with n arms, Aperm first chooses a random permutation π on n elements in I uniformly, then simulates A on the instance π ◦ I and returns what A returns. It is not difficult to see that the running time of Aperm only depends on the set {Di } of reward distributions of the instance, not their particular order. Clearly, if A is a δ-correct algorithm for any instance of Best-1-Arm, so is Aperm. Furthermore, we have the following simple lemma, which says we only need to prove a lower bound for Aperm . Lemma 5.2. For any instance I, there exists a permutation π such that TA (π ◦ I) ≥ TAperm (I). Proof. By the definition of Aperm , TAperm (I) =

1 n!

X

π∈Sym(n)

TA (π ◦ I),

where Sym(n) is the set of all n! permutations of {1, . . . , n}. Now we prove theorem 1.6, the high-level idea is to construct some “balanced” instances for Best-1Arm, and show that if an algorithm A is “fast” on those instances, we can construct a fast algorithm for Sign-ξ, which leads to a contradiction to Lemma 5.1. Proof of Theorem 1.6. In this proof, we assume all distributions are Gaussian random variables with σ = 1. Without loss of generality, we can assume N0 in Lemma 5.1 is an even integer, and N0 > 10 such that 4 2 · 4N0 ≥ · 4N0 + N0 + 2. Let N = 2 · 4N0 . 3 For every n ≥ N , we pick the largest even integer n1 such that 2 · 4n1 ≤ n. Clearly n1 ≥ N0 > 10 and n 1 X n 4 4k + n1 + 2 ≤ · 4n1 + n1 + 2 ≤ 2 · 4n1 . Also, by the choice of n1 , we have 2 · 4n1 +2 > n hence 4n1 > . 3 8 k=0 Consider the following Best-1-Arm instance Iinit with n arms: 1. There is a single arm with mean ξ. 2. For each k ∈ [0, n1 ], there are 4n1 −k arms with mean ξ − 2−k . 3. There are n −

n1 X

k=0

4k − 1 arms with mean ξ − 2.

For a Best-1-Arm instance I, let n(I) be the number of arms in I, and ∆[i] (I) be the gap ∆[i] according n(I) X ∆[i] (I)−2 . to I. We denote H(I) = i=2

Now we define a class of Best-1-Arm instances {IS } where each S ⊆ [0, n1 ]. Each IS is formed as follows: for every k ∈ S, we add one more arm with mean ξ − 2−k to Iinit ; finally we remove |S| arms with 16

mean ξ − 2 (by our choice of n1 there are enough such arms to remove). Obviously, there are still n arms in every instance IS . Let c be a universal constant to be specified later (in particular c does not dependent on n). Now we claim that for any δ-correct algorithm A for Best-1-Arm, there must exist an instance IS such that TAperm (IS ) > c · H(IS ) · ln n1 = Ω(H(IS ) ln ln n). Suppose for contradiction that there exists a δ-correct A such that TAperm (IS ) ≤ c · H(IS ) · ln n1 for all S. For the Best-1-Arm instance IS , we use NSk to denote the number of arms with gap 2−k . And let akS · 4k be the expected number of samples taken from an arm with gap 2−k by Aperm . Then we have: n1 X

k=0

Since 4

n1 −k



NSk

≤4

n1 −k

H(IS ) =

NSk (4k · akS ) ≤ TAperm (IS ) ≤ c · H(IS ) · ln n1 .

+ 1, we can see that

n1 X

k=0

NSk · 4k + (n −

n1 X

k=0

4k − 1 − |S|) · 2−2 ≤ 2

n1 X

4n1 +

k=0

1 · n. 4

Thus we have that n1 X

k=0

4n1 −k (4k · akS ) ≤

Simplify it a bit and note

n1 X

k=0

NSk (4k · akS ) ≤ c · H(IS ) · ln n1 ≤ c 2

k=0

4n1

1 + n 4

!

· ln n1 .

n < 4n1 , we get 8 n1 X

k=0

which is equivalent to:

n1 X

n1 X

k=0

4n1 · akS ≤ c · 2(n1 + 2)4n1 ln n1 ,

akS ≤ 2c(n1 + 2) ln n1 ≤ 3c(n1 + 1) ln n1 .

The last inequality follows n1 ≥ N0 > 10. n1 X c1 , in which c1 is the constant in Lemma 5.1. Then since akS ≤ 3c(n1 + 1) ln n1 = Now we set c = 30 k=0 c1 1 (n1 + 1) · ln n1 , we can see for any S, there are at most 0.1 fraction of elements in {akS }nk=0 satisfying 10 akS ≥ c1 · ln n1 . Let U =  {IS | |S|= n1 /2}, V = {IS | |S| = n1 /2 + 1} be two sets of Best-1-Arm instances. Notice that n1 + 1 |U | = |V | = (since n1 is even). n1 /2 Now, fix S ∈ U . Consider the following two algorithms for Sign-ξ: 1. A′S : Given the single arm A with unknown mean µ, replace one arm with mean ξ − 2 in IS with A to get instance I ′ . Run Aperm on I ′ , output µ > ξ if Aperm selects A as the best arm, otherwise output µ < ξ. 2. A′′S : Given the single arm A with unknown mean µ, construct an artificial arm A′ with mean 2ξ − µ from A. 8 Replace one arm with mean ξ − 2 in IS with A′ to get instance I ′ . Run Aperm on I ′ , output µ < ξ if Aperm selects A′ as the best arm, otherwise output µ > ξ. 8 That is, whenever the algorithm pulls A′ , we pull A to get a reward r, and return 2ξ − r as the reward for A′ . Note although we do not know µ, A′ is clearly an arm with mean 2ξ − µ.

17

Since Aperm is δ-correct for Best-1-Arm, it is not hard to see A′S and A′′S are both δ-correct for Sign-ξ. Now, consider the algorithm AS for Sign-ξ which runs as follows: It simulates A′S and A′′S simultaneously, each time it takes a sample from the input arm, and feeds it to both A′S and A′′S . If A′S (A′′S ) terminates first, it returns the output of A′S (A′′S )(in case of a tie, returns the output of A′S ). First, we can see if both A′S and A′′S are correct, then AS must be correct. Therefore, AS is 2δ-correct for Sign-ξ. Now we analyze the expected total samples taken by AS on arm A with mean µ and gap ∆ = |ξ−µ| = 2−k . Suppose k ∈ / S. A key observation is the following: if µ < ξ, then the instance constructed in A′S is exactly IS∪{k} ; otherwise µ > ξ, since 2ξ − (ξ + ∆) = ξ − ∆ = ξ − 2−k , the instance constructed in A′′S is exactly IS∪{k} (the order of arms in the constructed instance and IS∪{k} may differ, but as Aperm randomly permutes the arms beforehand, it does not matter). That is, either TA′S (A) = akS∪{k} · 4k or TA′′S (A) = akS∪{k} · 4k . Since AS terminates as soon as either one of them terminates, we clearly have TAS (A) ≤ min(TA′S (A), TA′′S (A)) ≤ akS∪{k} · 4k for arm A with gap 2−k when k 6∈ S. Now, for S ∈ U , let badS = {k 6∈ S ∧ akS∪{k} ≥ c1 ln n1 | k ∈ [0, n1 ]}. We have that X

S∈U

|badS | ≤

n1 XX

S∈V k=0

1{akS ≥ c1 ln n1 } ≤

n1 + 1 n1 + 1 |V | = |U |. 10 10

n1 + 1 . 10 Now, consider the algorithm AS . For all k ∈ [0, n1 ] \ (S ∪ badS ), we can see TAS (2−k ) ≤ akS∩{k} · 4k < c1 · 4k ln n1 . But this implies that In other words, there exists S ∈ U such that |badS | ≤

{TAS (∆) < c1 · ∆−2 ln n1 | ∆ = 2−i , i ∈ [0, n1 ]} |[0, n1 ] \ (S ∪ badS )| ≥ ≥ 0.4, n1 + 1 n1 + 1 which contradicts the Lemma 5.1 since 2δ < 0.01. So there must exists IS such that TAperm (IS ) > c · H(IS ) · ln n1 . n c X −2 ∆ ln ln n. This finishes By Lemma 5.2, there exists a permutation π on IS such that TA (π ◦ IS ) ≥ 2 i=2 i the first part of the theorem. −1 To see that ∆−2 2 ln ln ∆2 is not the dominating term, simply notice that −1 n1 ∆−2 ln(n1 · ln 2) ≤ 4n1 ln n1 ≤ 2 ln ln ∆2 = 4

n1 n 1 X 2 · ln 4 X −2 ∆ ln ln n. NSk · 4k ln n1 ≤ n1 ln n i=2 i k=0

This proves the second statement of the theorem.

6

Upper bound for Best-1-Arm

In this section we prove Theorem 1.7 by presenting a novel algorithm for Best-1-Arm. Our final algorithm builds on several useful components.

6.1

Useful Building Blocks

1. Uniform Sampling: The first building block is the simple uniform sampling algorithm. Algorithm 2: UNIF-SAMPL(S, ǫ, δ) Data: Arm set S, approximation level ǫ, confidence level δ. Result: For each arm a, output the empirical mean µ ˆ[a] . −2 −1 1 For each arm a ∈ S, sample it 2ǫ ln(2 · δ ) times. Let µ ˆ[a] be the empirical mean. We have the following Lemma for Algorithm 2, which is simply a consequence of Hoeffding’s inequality (Lemma 2.2). 18

Algorithm 3: FRAC-TEST(S, cl , cr , δ, t, ǫ) Data: Arm set S, range parameters cl , cr , confidence level δ, threshold t, approximate parameter ǫ. 1 cnt ← 0 −1 2 tot ← ln(2 · δ )(ǫ/3)−2 /2 3 for i = 1 to tot do 4 Pick a random arm ai ∈ S uniformly. cr − cl , ǫ/3) 5 µ ˆ[ai ] ← UNIF-SAMPL({ai }, 2 cl + cr 6 if µ ˆ[ai ] < then 2 7 cnt ← cnt + 1 8 9 10 11

if cnt/tot > t then Return True else Return False

  Lemma 6.1. For each arm a ∈ S, we have that Pr |µ[a] − µ ˆ[a] | ≥ ǫ ≤ δ.

2. Median Elimination: We need the MED-ELIM algorithm in [12], which is a classic (ǫ, δ)-PAC algorithm for Best-1-Arm. The algorithm takes parameters ǫ, δ > 0 and a set S of n arms, and returns an ǫ-optimal arm with probability 1 − δ. The algorithm runs in rounds. In each round, it samples every remaining arm a uniform number of times, and then discard half of the arms with lowest empirical mean (thus the name median elimination). It outputs the final arm that survives. We denote the procedure by MED-ELIM(S, ǫ, δ). We use this algorithm in a black-box manner and its performance is summarized in the following lemma. Lemma 6.2. Let µ[1] be the maximum mean value. MED-ELIM(S, ǫ, δ) returns an arm with mean at least µ[1] − ǫ with probability at least 1 − δ, using a budget of at most O(|S| log(1/δ)/ǫ2 ) pulls.

3. Fraction test: Another key tool here is an estimate procedure, which can be used to gain some information about the distribution of the arms. The algorithm takes five parameters (S, cl , cr , δ, t, ǫ), where S is the set of arms, δ is the confidence level, cl < cr are real numbers called range parameters, and t ∈ (0, 1) is the threshold, and ǫ is a small positive constant. Typically, cl and cr are very close. The goal of the algorithm, roughly speaking, is to distinguish whether there are still many arms in S which have small means (w.r.t. cr ) or the majority of arms already have large means. The precise guarantee the algorithm can achieve can be found in Lemma 6.3. The algorithm runs in ln(2 · δ −1 )(ǫ/3)−2 /2 iterations. In each iteration, it samples an arm ai uniformly from S, and takes O(ln ǫ−1 (cr − cl )−2 ) independent samples from ai . Then, we look at the number of iterations in which the empirical mean of ai is at least (cl + cr )/2. If the number is larger than t, the algorithm returns True. Otherwise, it returns False. For ease of notation, we define S ≥c := {µ[a] ≥ c | a ∈ S}, that is, all arms in S with means ≥ c. Similarly, we can define S >c , S ≤c and S cr | < (1 − t + ǫ)|S| (or equivalently |S ≤cr | > (t − ǫ)|S|). • If FRAC-TEST outputs False, then |S (1 − t − ǫ)|S|). Moreover, the number of samples taken by the algorithm is O(ln δ −1 ǫ−2 ∆−2 ln ǫ−1 ), in which ∆ = cr − cl . The proof is based on fairly simple applications of Chernoff bound and union bound. The details can be found in the appendix. Remark 6.4. If ǫ is a fixed constant, then the number of samples is simply O(ln δ −1 ∆−2 ). 19

Algorithm 4: ELIMINATION(S, cl , cr , δ) Data: Arm set S, range parameters cl , cr , confidence level δ. Result: A set of arms after elimination. 1 S1 ← S cl + cr 2 cm ← 2 3 for r = 1 to +∞ do 4 δr = δ/(10 · 2r ) 5 if FRAC-TEST(Sr , cl , cm , δr , 0.075, 0.025) then cr − cm 6 UNIF-SAMPL(Sr , , δr ) 2 n cm + cr o 7 Sr+1 ← a ∈ Sr | µ ˆ[a] > 2 8 else 9 Return Sr

4. Eliminating arms: The final ingredient is an elimination procedure, which can be used to eliminate most arms below a given threshold. The procedure takes four parameters (S, cl , cr , δ) as input, where S is a set of arms, cl < cr are the range parameters, and δ is the confidence level. It outputs a subset of S and guarantees that upon termination, most of the remaining arms have means at least cl with probability 1 − δ. Now, we briefly describe the procedure, which runs in iterations. It maintains the current set Sr of arms, which is initially S. In each iteration, it first applies FRAC-TEST on Sr (the detailed parameters can be found in Algorithm 4). If FRAC-TEST returns True, which means that there are still a lot of arms with small means in Sr , we sample all arms in Sr uniformly, and retain those with empirical means at least (cm + cr )/2. If FRAC-TEST returns True, the algorithm terminates and returns the remaining arms. The guarantee of ELIMINATION is summarized in the following lemma. The proof again follows from fairly straightforward applications of Chernoff bound and union bound, and can be found in the appendix. Lemma 6.5. Suppose δ < 0.1. Let S ′ = ELIMINATION(S, cl , cr , δ). Let A1 be the best arm among S, with mean µ[A1 ] ≥ cr . Then with probability at least 1 − δ, the following statements hold 1. A1 ∈ S ′

(the best arm survives);

2. |S ′≤cl | < 0.1|S ′ |

(only a small fraction of arms have means less than cl );

3. The number of samples is O(|S| ln δ −1 ∆−2 ), in which ∆ = cr − cl . Note that with probability at most δ, there is no guarantee for any of the above statements.

6.2

Our Algorithm

Now, everything is ready for us to describe our algorithm for Best-1-Arm. We provide a high level description here. All detailed parameters can be found in Algorithm 5. The algorithm runs in rounds. It maintains the current set Sr of arms. Initially, S1 is the set of all arms S. In round r, the algorithm tries to eliminate a set of suboptimal arms, while makes sure the best arm is not eliminated. First, it applies the MED-ELIM procedure to find an ǫr /4-optimal arm, where ǫr = 2−r . Suppose it is ar . Then, we take a number of samples from ar to estimate its mean (denote the empirical mean by µ ˆ [ar ] ). Unlike previous algorithms [11, 22], which eliminates either a fixed fraction of arms or those arms with mean much less than ar , we use a FRAC-TEST to see whether there are many arms with mean much less than ar . If it returns True, we apply the ELIMINATION procedure to eliminate those arms (for the purpose of analysis, we need to use MED-ELIM again, but with a tighter confidence level, to find an ǫr /4-optimal arm br ). If it returns False, the algorithm decides that it is not judicious to do elimination in this round (since we need to spend a lot of samples, but only discard very few arms, which is wasteful), and simply sets Sr+1 to be Sr , and then proceeds to the next round. 20

Algorithm 5: DISTRIBUTION-BASED-ELIMINATION(S, δ) Data: Arm set S, confidence level δ. Result: The best arm. 1 h← 1 2 S1 ← S 3 for r = 1 to +∞ do 4 if |Sr | = 1 then 5 Return the only arm in Sr ǫr ← 2−r δr ← δ/50r2 ar ← MED-ELIM(Sr , ǫr /4, 0.01). µ ˆ[ar ] ← UNIF-SAMPL({ar }, ǫr /4, δr ) if FRAC-TEST(Sr , µ ˆ[ar ] − 1.5ǫr , µ ˆ[ar ] − 1.25ǫr , δr , 0.4, 0.1) then δh ← δ/50h2 br ← MED-ELIM(Sr , ǫr /4, δh ) µ ˆ[br ] ← UNIF-SAMPL({br }, ǫr /4, δh ) Sr+1 ← ELIMINATION(Sr , µ ˆ[br ] − 0.5ǫr , µ ˆ[br ] − 0.25ǫr , δh ) h←h+1

6 7 8 9 10 11 12 13 14 15

else Sr+1 ← Sr

16 17

We devote the rest of the section to prove that Algorithm 5 indeed solves the Best-1-Arm problem and achieves the sample complexity stated in Theorem 1.7. To simplify the argument, we first describe some event we will condition on for the rest of the proof, and show the algorithm indeed finds the best arm under the condition. Lemma 6.6. Let EG denote the event that all procedure calls in line 9, 10, 12, 13, 14 return correctly for all rounds. EG happens with probability at least 1 − δ. Moreover, conditioning on EG , the algorithm outputs the correct answer. Proof. By Lemma 6.1, Lemma 6.3 and Lemma 6.5, we can simply bound the total error probability with a union bound over all procedure calls in all rounds: +∞ X

2δr +

r=1

+∞ X

h=1

3δh ≤ δ · 5

+∞ X i=1

1/50i2 ≤ δ.

To prove the correctness, it suffices to show that the best arm A1 is never eliminated in line 14. Conditioning on event EG , for all r, we have µ[br ] ≤ µ[A1 ] and |ˆ µ[br ] − µ[br ] | < ǫr /4, thus µ ˆ [br ] < µ[A1 ] + ǫr /4. Clearly, this means µ[A1 ] ≥ µ ˆ[br ] − 0.25ǫr . Then by Lemma 6.5, we know that A1 has survived round r. Note that the correctness of MED-ELIM (line 8) is not included in event EG .

6.3

Analysis of the running time

Now, let A denote the Algorithm 5. Let T (δ, I) =

n X i=2

  −1 −1 −1 ∆−2 ln δ + ln ln min(n, ∆ ) + ∆−2 [i] [i] [2] ln ln ∆[2]

be the target upper bound we want to prove. We need to prove EAδ ,I [τ | EG ] = O(T (δ, I)) for δ < 0.1. In the rest of the proof, we condition on the event EG , unless state otherwise. Let A1 denote the best arm.

21

We need some notations. First, for all s ∈ N, define the sets Bs , Vs and Us as: Bs = {a | 2−s ≤ ∆[a] < 2−s+1 },

Vs =

+∞ [

Br ,

r=s

Us =

s [

Br

r=1

Let ǫs = 2−s . It is also convenient to use the equivalent definitions for Vs and Us : Vs = {a | µ[A1 ] − 2ǫs < µ[a] < µ[A1 ] },

Us = {a | µ[a] ≤ µ[A1 ] − ǫs }

Let maxs be the maximum s such that Bs is not empty. We start with a lemma which concerns the ranges of µ ˆ [ar ] and µ ˆ[br ] . Lemma 6.7. Conditioning on EG , the following statements hold: 1. for any round r, µ ˆ[ar ] < µ[A1 ] + ǫr /4. In addition, if in that round MED-ELIM (line 8) returns an ǫr /4-approximation correctly, then we also have µ ˆ[ar ] > µ[A1 ] − 2ǫr /4. 2. for any round r, if br exists (the algorithm enters line 12). Then µ ˆ[br ] < µ[A1 ] + ǫr /4 and µ ˆ[br ] > µ[A1 ] − 2ǫr /4. Proof. Since we condition on EG , we have |ˆ µ[ar ] − µ[ar ] | < ǫr /4. And clearly µ[ar ] ≤ µ[A1 ] . Hence, µ ˆ[ar ] < µ[A1 ] + ǫr /4. If in addition ar is an ǫr /4-approximation of A1 , we have µ[ar ] ≥ µ[A1 ] − ǫr /4. Then we can see µ ˆ[ar ] > µ[A1 ] − 2ǫr /4. Note that conditioning on EG , br is always an ǫr /4-approximation of A1 , hence the second claim follows exactly in the same way. Then we give an upper bound of h (updated in line 15). Indeed, in the following analysis we only need an upper bound of h during the first maxs rounds. We introduce the definition first. Definition 6.8. Given an instance I, conditioning on EG , we denote the maximum value of h during the first maxs rounds as H(I). It is easy to see that H(I) ≤ maxs . Lemma 6.9. H(I) = O(ln n) for any instance I. Proof. Suppose we enter line 12 at round r. By Lemma 6.7, we have µ ˆ [ar ] < µ[A1 ] + ǫr /4 and µ ˆ[br ] > µ[A1 ] − 2ǫr /4. Hence µ ˆ[br ] > µ ˆ[ar ] − 0.75ǫr . By Lemma 6.3, we know there is at least 0.3 fraction of arms in Sr with mean ≤ µ ˆ[ar ] − 1.25ǫr . But by Lemma 6.5, we know that after line 14, there are at most 0.1 fraction of arms in Sr with mean ≤ µ ˆ[br ] − 0.5ǫr , and note µ ˆ[br ] − 0.5ǫr > µ ˆ[ar ] − 1.25ǫr . That means |Sr | drops by at least a constant fraction whenever we enter line 14. Therefore, h can increase by at most O(ln n) times. Remark 6.10. Conditioning on EG , for all round r ≤ maxs , we can see h ≤ r in line 12. Thus, h ≤ min(H(I), r), and ln δh−1 = O(ln δ −1 + ln[min(H(I), r)]). Now, we describe some behaviors of MED-ELIM and ELIMINATION before we analyze the running time. Lemma 6.11. If FRAC-TEST (line 10) outputs True, we know there are > 0.3 fraction of arms with mean ≤ µ[A1 ] − ǫr in Sr . In other words, |Ur ∩ Sr | > 0.3|Sr |. Moreover, we have that |Ur ∩ Sr+1 | ≤ 0.1|Sr+1| (i.e., we can eliminate a significant portion in this round). Proof. By Lemma 6.7, µ ˆ [ar ] < µ[A1 ] + ǫr /4, µ ˆ[br ] > µ[A1 ] − 2ǫr /4. Now consider the parameters for FRACTEST. Let cr = µ ˆ[ar ] − 1.25ǫr , then cr < µ[A1 ] − ǫr . By Lemma 6.3, when FRAC-TEST outputs True, we know there are > (0.4 − 0.1) = 0.3 fraction of arms ≤ cr ≤ µ[A1 ] − ǫr in Sr . Clearly, µ[a] ≤ µ[A1 ] − ǫr is equivalent to a ∈ Ur for an arm a. Now consider the parameters for ELIMINATION. Let cl = µ ˆ[br ] − 0.5ǫr . Then cl > µ[A1 ] − ǫr . Also, note that µ[a] ≤ µ[A1 ] − ǫr is equivalent to a ∈ Ur for an arm a. Then by Lemma 6.5, after the elimination, we ≤cl have |Ur ∩ Sr+1 | ≤ |Sr+1 | ≤ 0.1|Sr+1 |. Lemma 6.12. Consider a round r. Suppose MED-ELIM (line 8) returns a correct ǫr /4-approximation ar . Then, the following statements hold 22

1. If FRAC-TEST (line 10) outputs True, we know there are > 0.3 fraction of arms with mean ≤ µ[A1 ] − ǫr in Sr . In other words, |Ur ∩ Sr | > 0.3|Sr |. Moreover, we have that |Ur ∩ Sr+1 | ≤ 0.1|Sr+1 |. 2. If it outputs False, we know there are at least 0.5 fraction of arms with mean at least µ[A1 ] − 2ǫr in Sr . In other words, |Vr ∩ Sr | + 1 > 0.5|Sr |. Proof. Since MED-ELIM (line 8) returns the correctly, by Lemma 6.7, µ ˆ[ar ] > µ[A1 ] − 2ǫr /4 and µ ˆ[ar ] < µ[A1 ] + ǫr /4. Now consider the parameters for FRAC-TEST, let cl = µ ˆ[ar ] − 1.5ǫr and cr = µ ˆ[ar ] − 1.25ǫr , It is easy to see that cl > µ[A1 ] − 2ǫr , and cr < µ[A1 ] − ǫr . The first claim just follows from Lemma 6.11 (note Lemma 6.11 does not require the output of MED-ELIM (line 8) being correct). By Lemma 6.3, if FRAC-TEST outputs False, we know there are at least (1 − 0.4 − 0.1) = 0.5 fraction of arms ≥ cl > µ[A1 ] − 2ǫr in Sr . And for an arm a, µ[a] > µ[A1 ] − 2ǫr is equivalent to a ∈ Vr or a is the best arm A1 itself. We also need the following lemma describing the behavior of the algorithm when r > maxs . Lemma 6.13. For each round r > maxs , the algorithm terminates if MED-ELIM returns an ǫr /4-approximation correctly, which happens with probability at least 0.99. Proof. If we already have |Sr | = 1 at the beginning of round r, then there is nothing to prove since it halts immediately. So we can assume |Sr | > 1. Suppose in round r, MED-ELIM returns an correct ǫr /4 approximation. Then conditioning on EG , we know FRAC-TEST must output True. Since if it outputs False, then by Lemma 6.12, |Vr ∩ Sr | + 1 > 0.5|Sr |. But Vr = ∅ from r > maxs . So 1 > 0.5|Sr |, which means |Sr | = 1, thus a contradiction. Then, by Lemma 6.12, |Ur ∩ Sr+1 | ≤ 0.1|Sr+1 |, which is equivalent to: |Vr+1 ∩ Sr+1 | + 1 ≥ 0.9|Sr+1 |. Note that |Vr+1 ∩ Sr+1 | = 0 as Vr+1 = ∅. So it holds that 1 ≥ 0.9|Sr+1 |, or equivalently |Sr+1 | = 1. Thus it terminates just after round r. Finally, as MED-ELIM returns correctly an ǫr /4-approximation with probability at least 0.99, our proof is completed. We analyze the expected number of samples used for each subprocedure separately. We first consider FRAC-TEST (line 10) and UNIF-SAMPL (line 9, 13) and prove the following lemma, since it is simpler. Lemma 6.14. Conditioning on event EG , the expected number of samples incurred by FRAC-TEST (line 10) and UNIF-SAMPL (line 9, 13) is   −1 O ∆−2 + ln ln ∆−1 [2] (ln δ [2] )) .

Proof. By Lemma 6.13, for any round r > maxs , the algorithm halts w.p. at least 0.99. So we can bound the expectation of samples incurred by FRAC-TEST and UNIF-SAMPL by: max Xs r=1

c4 · ln δr−1 ǫ−2 r +

+∞ X

r=maxs +1

c4 · (0.01)r−maxs−1 ln δr−1 ǫ−2 r

In which c4 is a big enough constant such that in round r the number of samples taken by FRAC-TEST and UNIF-SAMPL together is bounded by c4 · ln δr−1 ǫ−2 r . It is not hard to see the first sum is dominated by −1 the last term while the second sum is dominated by the first. So the bound is O(∆−2 + ln ln ∆−1 [2] (ln δ [2] )) −1 since maxs is Θ(ln ∆[2] ). Next, we analyze MED-ELIM (line 8, 12) and ELIMINATION (line 14). In the following we only consider samples due to these two procedures. Lemma 6.15. Conditioning on event EG , the expected number of samples incurred by MED-ELIM (line 8, 12) and ELIMINATION (line 14) is ! n X −1 ∆−2 + ln[min(H(I), ln ∆−1 O [i] (ln δ [i] )]) . i=2

23

We devote the rest of the section to the proof of Lemma 6.15, which is significantly more involved. We first need a lemma which provides an upper bound on the number of samples for one round. Lemma 6.16. Let c3 be a sufficiently large constant. The number of samples in round r ≤ maxs can be bounded by: ( c3 · |Sr |ǫ−2 if FRAC-TEST outputs False. r −2 −1 c3 · |Sr |ǫr (ln δ + ln[min(H(I), r)]) if FRAC-TEST outputs True. And if r > maxs , the number of samples can be bounded by: −1 c3 · |Sr |ǫ−2 + ln(H(I) + r − maxs )). r (ln δ

Proof. Note that conditioning on event EG , ELIMINATION always returns correctly. So we let c3 be the constant such that MED-ELIM(Sr , ǫr /4, 0.01) takes no more than c3 /3 · |Sr |ǫ−2 r samples, MED-ELIM(Sr , ǫr /4, δh ) and ELIMINATION(Sr , µ ˆ[ar ] − 1.5ǫr , µ ˆ[ar ] − 1.25ǫr , δh ) both take no more −1 than c3 /3 · |S|ǫ−2 + ln[min(H(I), r)]) samples conditioning on EG . The later one is due to the fact r (ln δ −1 −1 that ln δh = O(ln δ + ln[min(H(I), r)]) for r ≤ maxs . And if r > maxs , we have h ≤ H(I) + r − maxs , then the bounds follow from a simple calculation. Proof of Lemma 6.15. We prove the lemma inductively. Let T (r, M ) denote the maximum expected total number of samples the algorithm takes at and after round r, and |Sr ∩ Ur−1 | ≤ M . In other words, it is an upper bound of the expected number of samples we will take further, provided that we are at the beginning of round r and there are at most M “small” arms left. By definition, T (1, 0) is the final upper bound for the total expected number of samples taken by the algorithm. Let c1 = 4c3 , c2 = 60c3 . 9 For ease of notation, we let l(x) = ln[min(H(I), x)]. We first consider the case where r = maxs + 1 and prove the following bound of T (r, M ):   T (r, M ) ≤ ln δ −1 + ln H(I) c1 · M · ǫ−2 (3) r .

Clearly there is nothing to prove for the base case M = 0. So we consider M ≥ 1. Now, suppose the first round after r in which MED-ELIM (line 8) returns correctly an ǫr /4-approximation is r′ ≥ r. Clearly, ′ this happens with probability at most 0.01r −r (all rounds in between fail). By Lemma 6.13, the algorithm terminates after round r′ . Moreover, we have |Vr ∩ Sr | = 0 since Vr = ∅. So Sr consists of the single best arm A1 and M arms in Ur−1 . By Lemma 6.16, the number of samples is bounded by ′

r X i=r

c3 (ln δ −1 + ln[H(I) + i − maxs ])(1 + M )ǫ−2 i .

Hence, we can bound T (r, M ) by: T (r, M ) ≤ ≤ ≤

+∞ X



r ′ −r

(0.01)

r ′ =r +∞ X

r ′ =r +∞ X

r ′ =r

r ′ −r

(0.01)

·

r X i=r

c3 (ln δ −1 + ln[H(I) + i − maxs ])(1 + M )ǫ−2 i

· 2c3 M



r X i=r

·(ln δ −1 + ln[H(I) + i − maxs ])ǫ−2 i



(0.01)r −r · 3c3 M (ln δ −1 + ln[H(I) + r′ − maxs ])ǫ−2 r′

≤4c3 M (ln δ −1 + ln H(I))ǫ−2 r Now, we analyze the more challenging case where r ≤ maxs . For ease of notation, we let Cr,s = (ln δ

−1

+ l(s))

s X

ǫ−2 i .

i=r

9

Although these constants are chosen somewhat arbitrarily, they need to satisfy certain relations (will be clear from the proof) and it is necessary to make them explicit.

24

When r > s, we let Cr,s = 0. We also define P (r) = c2 ·

+∞ X s=r

 Cr,s |Bs | + Cr,maxs .

P (r) can be viewed as a potential function. Notice that when r > maxs , we have P (r) = 0. We show inductively that T (r, M ) ≤ (ln δ −1 + l(r)) · c1 · M · ǫ−2 (4) r + P (r) We note that (3) is in fact consistent with (4) (when r > maxs , we have P (r) = 0 and l(r) = ln H(I)). Before proving (4), we need an important inequality for P (r). If r ≤ maxs , we have: ! +∞ X −1 −2 −1 −2 P (r) − P (r + 1) = c2 · (ln δ + l(s)) · ǫr |Bs | + (ln δ + l(maxs )) · ǫr ≥ c2 ·

s=r −2 ǫr (ln δ −1

+ l(r))(|Vr | + 1)

(5)

The induction hypothesis assumes that the inequality holds for r + 1 and all M . We need to prove that it also holds for r and all M . Now we are at round r. Conditioning on event EG , there are three cases we need to consider. Case 1: MED-ELIM (line 8) returns an ǫr /4-approximation of the best arm A1 , and FRAC-TEST outputs True. Let D = Br ∩ Sr , E = Vr+1 ∩ Sr . Then we have |Sr | = |D| + |E| + M + 1. Since by Lemma 6.12, we have |Ur ∩ Sr+1 | ≤ 0.1|Sr+1 |, which means |Vr+1 ∩ Sr+1 | + 1 ≥ 0.9|Sr+1 |. So, we have that 1 1 |Ur ∩ Sr+1 | ≤ (|Vr+1 ∩ Sr+1 | + 1) ≤ (|E| + 1). 9 9 Therefore, we can see the number of samples is bounded by: 1 −1 c3 · |Sr |ǫ−2 + l(r)) + T (r + 1, (|E| + 1)), r (ln δ 9 where the first additive term is the number of samples in this round, and is bounded by Lemma 6.16. By the induction hypothesis, we have: 1 T (r + 1, (|E| + 1)) 9 1 ≤(ln δ −1 + l(r + 1)) · c1 · (|E| + 1) · ǫ−2 r+1 + P (r + 1) 9 1 −1 ≤(ln δ −1 + l(r + 1)) · c1 · (|E| + 1) · ǫ−2 + l(r))(|Vr | + 1)ǫr−2 (By (5)) r+1 + P (r) − c2 · (ln δ 9 5 5 −2 ≤(ln δ −1 + l(r)) · ǫ−2 (|E| + |D| ≤ |Vr |) r (c1 |E| + c1 − c2 |D| − c2 |E| − c2 )ǫr + P (r) 9 9 Therefore, we have the following bound: 5 5 −1 T (r, M ) ≤ǫ−2 + l(r))(c3 · |Sr | + c1 |E| + c1 − c2 |D| − c2 |E| − c2 ) + P (r) r (ln δ 9 9 5 5 −2 −1 ≤ǫr (ln δ + l(r))((c3 − c2 )|D| + (c3 + c1 − c2 )|E| + c3 M + c1 + c3 − c2 ) + P (r) 9 9 5 5 −1 ≤ǫ−2 + l(r))c3 M + P (r) ( c1 + c3 − c2 < 0 , c3 + c1 − c2 < 0) r (ln δ 9 9 In the second inequality, we use the fact that |Sr | = |D| + |E| + M + 1. Case 2: MED-ELIM (line 8) returns an ǫr /4-approximation of the best arm A1 , and FRAC-TEST outputs False. 25

By Lemma 6.12, we can see |Vr ∩ Sr | + 1 = |D| + |E| + 1 > 0.5|Sr |. So M < 0.5|Sr |, thus |D| + |E| + 1 ≥ M.

(6)

We can see that the total number of samples is bounded by: −1 c3 · |Sr |ǫ−2 + l(r))c3 · |Sr |ǫ−2 r + T (r + 1, M + |D|) ≤ (ln δ r + T (r + 1, M + |D|).

In the above, the first term is due to the number of samples in this round by Lemma 6.16. The second term is an upper bound for the number of samples starting at round r + 1. Since we do not eliminate any arm in this case, we have |Ur ∩ Sr+1 | ≤ M + |D|. From the induction hypothesis, we have: T (r + 1, M + |D|)

≤(ln δ

−1

+

−1

+

≤(ln δ

−1

+

≤(ln δ

(7)

l(r + 1)) · c1 · (M + |D|) · ǫ−2 r+1 + P (r + 1) −2 l(r)) · c1 · 5(M + |D|) · ǫr + P (r) − c2 · (ln δ −1 + l(r))(|Vr | l(r))(5c1 M + 5c1 |D| − c2 |D| − c2 |E| − c2 )ǫ−2 r + P (r)

+ 1)ǫ−2 r

(By (5)) (|E| + |D| ≤ |Vr |)

Plugging it into the bound, we have: −1 T (r, M ) ≤(ln δ −1 + l(r))c3 · |Sr |ǫ−2 + l(r))(5c1 M + 5c1 |D| − c2 |D| − c2 |E| − c2 ) · ǫ−2 r + (ln δ r + P (r)

≤(ln δ −1 + l(r))(c3 |D| + c3 |E| + c3 M + c3 + 5c1 M + 5c1 |D| − c2 |D| − c2 |E| − c2 ) · ǫ−2 r + P (r)

≤(ln δ −1 + l(r))(c3 |E| + c3 + (5c1 + c3 )|D| − (c2 − 5c1 − c3 )(|D| + |E| + 1)) · ǫ−2 r + P (r)

≤P (r)

(c2 − 5c1 − c3 > 5c1 + c3 > c3 )

In the second inequality, we use the fact that (5c1 + c3 )M − (5c1 + c3 )(|D| + |E| + 1) ≤ 0 due to (6). In both case 1 and case 2, we have the following bound: T (r, M ) ≤ (ln δ −1 + l(r)) · c3 · M · ǫ−2 r + P (r)

(8)

Case 3: MED-ELIM (line 8) returns an arm which is not an ǫr /4-approximation of the best arm A1 . In this case, we can simply bound it by: c3 · |Sr |ǫr−2 (ln δ −1 + l(r)) + T (r + 1, M + |D|). The first term is still due to the number of samples in this round by Lemma 6.16. The second term is an upper bound for the samples taken starting at round r + 1, since |Ur ∩ Sr+1 | ≤ M + |D| in any case. Then, we have that −1 c3 · |Sr |ǫ−2 + l(r)) + T (r + 1, M + |D|) r (ln δ

−2 −2 ≤(ln δ −1 + l(r))(c3 (M + |D| + |E| + 1)ǫ−2 r + 5c1 (M + |D|)ǫr − c2 ǫr (|E| + |D| + 1)) + P (r)

≤(ln δ

−1

+ l(r))((c3 + 5c1 − c2 )|D| + (c3 − c2 )|E| + (c3 + 5c1 )M + c3 −

c2 )ǫ−2 r

(By (7))

+ P (r)

Since c3 + 5c1 − c2 ≤ 0 and c3 − c2 ≤ 0, we have the following bound: T (r, M ) ≤ (ln δ −1 + l(r))(c3 + 5c1 ) · M ǫ−2 r + P (r)

(9)

Note that (9) is larger than we need to prove (in particular, the constant is larger). So, we need to combine three cases together as follows: Recall that MED-ELIM (line 8) returns correctly an ǫr /4-approximation with probability p (p ≥ 0.99). By (8), (9) and c3 + (1 − p) · 5c1 ≤ c3 + 0.05c1 ≤ c1 , we have that T (r, M ) ≤ (ln δ −1 + l(r))(p · (c3 · M · ǫr−2 ) + (1 − p) · (c3 + 5c1 ) · M ǫ−2 r ) + P (r) ≤ (ln δ −1 + l(r))((c3 + 0.05c1 ) · M · ǫr−2 ) + P (r) ≤ (ln δ −1 + l(r))c1 · M · ǫ−2 r + P (r) 26

Then, the number of samples is bounded by T (1, 0) ≤ P (1). Note that C1,s ≤ 2(ln δ −1 + l(s))ǫ−2 for s any s. By a simple calculation of P (1), we can see that the overall sample complexity for MED-ELIM and ELIMINATION is ! n X −2 −1 −1 ∆[i] (ln δ + ln[min(H(I), ln ∆[i] )]) . T (1, 0) ≤ P (1) = O i=2

This finally completes the proof of Lemma 6.15.

Putting Lemma 6.15 and Lemma 6.14 together, we have the following corollary: Corollary 6.17. Algorithm 5 is a weakly expected-T -time δ-correct algorithm for Best-1-Arm, where ! n X −2 −1 −1 ∆−2 + ln[min(H(I), ln ∆−1 T = O [i] (ln δ [i] )]) + ∆[2] ln ln ∆[2] . We can apply Theorem 2.8 to transi=2

form Algorithm 5 to an expected-T -time δ-correct algorithm.

Combining the above corollary and Lemma 6.9, which asserts H(I) = O(ln n), we complete the proof for Theorem 1.7.

6.4

Nearly Instance Optimal Bounds for Clustered Instances

Recall Bs = {a | 2−s ≤ ∆[a] < 2−s+1 }. Definition 6.18. We say an instance I is clustered if |{Bi 6= ∅ | 1 ≤ i ≤ maxs }| is bounded by a constant. In this case, we can obtain a nearly instance optimal algorithm for such instances. For this purpose, we only need to establish a tighter bound on H(I). Lemma 6.19. H(I) ≤ 2 · |{Bi 6= ∅ | 1 ≤ i ≤ maxs }|. Proof. Let s be an index such that Bs is not empty. Let s′ be the largest index < s such that Bs′ is not empty. If such index does not exist, let s′ = 0. We show that during rounds s′ + 1, s′ + 2, . . . , s − 1, we can only call ELIMINATION (or equivalently, increase h) once. Suppose otherwise, we call ELIMINATION in round r and r′ such that s′ < r < r′ < s. We further assume that there is no other call to ELIMINATION between round r and r′ . Then clearly Sr′ = Sr+1 . Now by Lemma 6.11, we have |Ur ∩ Sr+1 | ≤ 0.1|Sr+1 |, but this means |Ur′ ∩ Sr′ | ≤ 0.1|Sr′ |, as Ur′ = Ur and Sr′ = Sr+1 . Again by Lemma 6.11, this contradicts the fact that on round r′ , FRAC-TEST outputs true. As Bmaxs is not empty, we can partition the rounds 1, 2, . . . , maxs into at most 2 · |{Bi 6= ∅ | 1 ≤ i ≤ maxs }| groups. Each group is either a single round which corresponds to a non-empty set Bs , or the rounds between two rounds corresponding to two adjacent non-empty sets Bs′ and Bs . Therefore, h can increase at most by 1 in each group, which concludes the proof. The following theorem is an immediate consequence of the above lemma and Corollary 6.17. Theorem 6.20. There is an expected-T -time δ-correct algorithm for clustered instances, where  Xn  −2 −1 −1 T (δ, I) = O ∆−2 ln δ + ∆ ln ln ∆ [2] [2] . [i] i=2

Example 6.21. For example, consider a very simple yet important instance where there are n − 1 arms with mean 0.5, and a single arm with mean 0.5 + ∆. In fact, a careful examination of all previous algorithms  (including [18, 22]) shows that they all require Ω n∆−2 (ln δ −1 + ln ln ∆−1 ) samples  even in this particular −2 −1 −2 −1 instance. However, our algorithm only requires O n∆ ln δ + ∆ ln ln ∆ samples. Our bound is nearly instance optimal, since the first term matches the instance-wise lower bound n∆−2 ln δ −1 . This is perhaps the best bound for such instances we can hope for.

27

6.5

PAC Algorithm

Finally, we discuss how to convert our algorithm to an (ǫ, δ)-PAC algorithm for Best-1-Arm. Theorem 6.22. For any ǫ < 0.5 and δ < 0.1, there exists an expected-T -time (ǫ, δ)-PAC algorithm for Best-1-Arm, where ! n X −2 −1 −2 −1 −1 ∆i,ǫ (ln δ + ln ln min(n, ∆i,ǫ )) + ∆2,ǫ ln ln ∆2,ǫ , T (δ, I) = O i=2

and ∆i,ǫ = max(∆[i] , ǫ). Proof. Given parameters ǫ, δ, we run DISTRIBUTION-BASED-ELIMINATION with confidence δ/2 only for the first ⌈ln ǫ−1 ⌉ rounds (if it terminates before that, just returns the output). After that, we invoke MEDELIM with confidence δ/2 to find an ǫ-optimal arm among S⌈ln ǫ−1 ⌉ . Clearly we are correct with probability at least 1 − δ. The analysis for the sample complexity is exactly the same as the original DISTRIBUTIONBASED-ELIMINATION. We can also apply Theorem 2.8 to get an expected-T -time δ-correct algorithm.

7

Parallel Simulation

Suppose we have a class of algorithms {Ai } (possibly infinite) for the same problem P. We want to construct a new algorithm B for problem P which simulates {Ai } in a parallel fashion. The details of the construction are specified as follows. Definition 7.1. (Parallel Simulation) The new algorithm B simulates Ai with rate ri ∈ N+ , for all i. More specifically, B runs in rounds. In the r-th round, B simulates each algorithm Ai such that ri divides r (i.e., ri |r) for one step. If there are more than one such algorithms, B simulates them in the increasing order of their indices. If any such Ai requires a sample, B takes a fresh sample and feeds it to Ai . B terminates whenever any Ai terminates and B outputs what Ai outputs. We denote this new algorithm as B = SIM({Ai }, {ri }). Now we prove the main result of this section. Proof of Theorem 2.8. The construction of B is very simple: Let Bδ = SIM({Ai }, {ri }), in which Ai = Aδ/2i (that is algorithm A with confidence parameter δ/2i ) and ri = 2i . Now we prove B is an expected-O(T )-time δ-correct algorithm for problem P. Suppose the given instance is I. Let Ei be a good event for Ai on instance I. First, by a simple union bound, we can see that the probability that Bδ outputs the correct answer is at +∞ X δ/2i = 1 − δ (Bδ returns the correct answer if all Ai returns the correct answer). least 1 − i=1

Now, assume δ < δ0′ = min(0.1, δ0 ). Let us analyze the running time of Bδ on instance I. For ease of argument, we can think that B executes in a slight different way. B does not terminate the same way as before, but keeps simulating all algorithms until all of them terminate (or run forever). The output of B is still the same as the Ai that terminates first. And the running time of B is determined by the first terminated Ai . Partition this probability space into disjoint events {Fi }, in which Fi is the event that all events Ej with j < i do not happen, and Ei happens. Note that {Fi } is indeed a partition of the probability space (Pr[∪i Fi ] = 1). This is simply because lim Pr[Ei ] ≥ lim 1 − δ/2i = 1. i→+∞

i→+∞

Let τ be the running time of B. Since each Ai uses its own independent samples, we have that EAi ,I [τ | Fi ] = EAi ,I [τ | Ei ] ≤ T (δ/2i , I).

Moreover, for Ai to run one step, for any j 6= i, Aj runs at most 2i−j steps. Thus, the running time for Bδ conditioning on Fi is bounded by: i

EBδ ,I [τ | Fi ] ≤ T (δ/2 , I) ·

+∞ X j=1

28

2i−j ≤ T (δ/2i , I) · 2i .

Furthermore, by the independence of different Ai s, we note that Pr[Fi ] ≤

i−1 Y

k=1

δ/2k ≤ δ i−1 .

Now, we can bound the expected running time of B as follows: EBδ ,I [τ ] = ≤

+∞ X

i=1 +∞ X

Pr[Fi ] · EA′δ ,I [τ | Fi ] δ i−1 · T (δ/2i , I) · 2i .

i=1 +∞ X

≤2

≤2 ≤2

i=1 +∞ X

i=1 +∞ X i=1

i−1

(2δ)

· T (δ, I)



ln δ −1 + i ln 2 ln δ −1



.

(2δ)i−1 (1 + i) · T (δ, I). (0.2)i−1 (1 + i) · T (δ, I).

≤ 6T (δ, I). In the second inequality, we use the fact that T is a reasonable time bound. To summarize, we can see that B is an expected-O(T )-time δ-correct algorithm. Since a weakly T -time δ-correct algorithm is also a weakly expected-T -time δ-correct algorithm, we immediately have the transformation from a weakly T -time δ-correct algorithm to an expected-O(T )-time δ-correct algorithm as well.

8

Concluding Remarks

Besides the new lower and upper bounds for Sign-ξ and Best-1-Arm, our work also raises some interesting questions. The first one is the notion of nearly instance optimality in our study of the Sign-ξ problem. We do not have a rigorous definition for the notion yet. However, as we argued in the introduction, it is not possible to obtain instance optimal algorithms for Sign-ξ, and there are very wide classes distributions of instances for which we can not do better than ∆−2 ln ln ∆−1 , hence we should be content with the ∆−2 ln ln ∆−1 bound. Moreover, the smoothed analysis framework [27] does not seem to apply to our problem neither. It would be interesting to see if there is other related problem exhibiting similar phenomena, for which we can derive nearly instance optimal bounds. In light of the new notion, our work in fact opens the venue of questing the nearly instance optimal bound for Best-1-Arm, which we think is most interesting open problem from this paper. Note that for the clustered instance, we already have such an algorithm. We conjecture that the general case is similar to the clustered case in the sense that the optimal bound is of the −1 form ∆−2 [2] ln ln ∆[2] + L(I), where L(I) is an instance-wise lower bound. Our techniques may be helpful for obtaining better bounds for the Best-k-Arm problem, or even the combinatorial bandit selection problem. In an ongoing work, we already have some partial results on applying some of the ideas in this paper to obtain improved upper and lower bounds for Best-k-Arm. Our current algorithm for Best-1-Arm may not be very efficient in practice due to the large constant hidden in the big-O. However, we believe that some of the our ideas (in particular, elimination based on the distribution of the gaps) may be useful in developing practical heuristic. Simplifying our algorithm and designing more practical version of it are left as interesting future directions.

29

9

Acknowledgment

We would like to thank Anupum Gupta for several interesting discussions in the beginning of this work, and X helpful comments on an earlier version of the paper. In fact, the question whether ∆[i] ln ln ∆−1 [i] is the i instance-wise lower bound for Best-1-Arm was raised during a discussion with him.

References [1] P. Afshani, J. Barbay, and T. M. Chan. Instance-optimal geometric algorithms. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 129–138. IEEE, 2009. [2] R. E. Bechhofer. A single-sample multiple decision procedure for ranking means of normal populations with known variances. The Annals of Mathematical Statistics, pages 16–39, 1954. [3] R. E. Bechhofer. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics, 14(3):408–429, 1958. [4] R. E. Bechhofer, C. W. Dunnett, and M. Sobel. A tow-sample multiple decision procedure for ranking means of normal populations with a common unknown variance. Biometrika, 41(1-2):170–176, 1954. [5] R. E. Bechhofer, J. Kiefer, and M. Sobel. Sequential identification and ranking procedures: with special reference to Koopman-Darmois populations, volume 3. University of Chicago Press, 1968. [6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012. [7] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identifications in multi-armed bandits. arXiv preprint arXiv:1205.3181, 2012. [8] W. Cao, J. Li, Y. Tao, and Z. Li. On top-k selection in multi-armed bandits and hidden bipartite graphs. In Neural Information Processing Systems, 2015. [9] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014. [10] H. Chernoff. Sequential Analysis and Optimal Design. Society for Industrial and Applied Mathematics (SIAM), 1972. [11] E. Even-Dar, S. Mannor, and Y. Mansour. Pac bounds for multi-armed bandit and markov decision processes. In Computational Learning Theory, pages 255–270. Springer, 2002. [12] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. The Journal of Machine Learning Research, 7:1079– 1105, 2006. [13] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 66(4):614–656, 2003. [14] R. Farrell. Asymptotic behavior of expected sample size in certain one sided tests. The Annals of Mathematical Statistics, pages 36–72, 1964. [15] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212–3220, 2012. [16] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identification. In Advances in Neural Information Processing Systems, pages 2222–2230, 2011.

30

[17] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. On Finding the Largest Mean Among Many. ArXiv preprint arXiv:1306.3917, June 2013. [18] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. COLT, 2014. [19] K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014. [20] S. Kalyanakrishnan and P. Stone. Efficient selection of multiple bandit arms: Theory and practice. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 511–518, 2010. [21] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 655–662, 2012. [22] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1238–1246, 2013. [23] E. Kaufmann, O. Capp´e, and A. Garivier. On the complexity of best arm identification in multi-armed bandit models. arXiv preprint arXiv:1407.4443, 2014. [24] E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228–251, 2013. [25] S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. The Journal of Machine Learning Research, 5:623–648, 2004. [26] E. Paulson. A sequential procedure for selecting the population with the largest mean from k normal populations. The Annals of Mathematical Statistics, pages 174–180, 1964. [27] D. A. Spielman and S.-H. Teng. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Communications of the ACM, 52(10):76–84, 2009. [28] Y. Zhou, X. Chen, and J. Li. Optimal pac multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 217–225, 2014.

A

Missing Proofs in Section 3

Theorem 3.2 (restate). Fix a confidence level δ > 0 and an arbitrary reference sequence S = {Λi }∞ i=1 . Suppose that the given instance has a gap ∆. Let κ be the smallest i such that Λi ≤ ∆. With probability at least 1 − δ, TEST-SIGN determines whether µ > ξ or µ < ξ correctly and uses O((ln δ −1 + ln κ)Λ−2 κ ) samples in total. Proof of Theorem 3.2. For any round r, by Hoeffding’s inequality (Lemma 2.2), we have that Pr (|b µr − µ| ≥ ǫr ) ≤ 2 exp(−ǫ2r /2 · tr ) = δr . Then, by a union bound, with probability 1 −

+∞ X i=1

δr = 1 − δ ·

+∞ X i=1

(10)

1/10r2 ≥ 1 − δ, we have |b µr − µ| < ǫr for

all r. Denote this event by E. Then we prove that conditioning on E, Algorithm 1 is correct. Let k be the round the algorithm returns the answer. By the definition of κ, we know that Λκ ≤ ∆. Then on round κ, we have |b µκ − µ| < Λκ /2 ≤ ∆/2. Thus, |b µκ − ξ| ≥ |µ − ξ| − |b µκ − µ| > ∆/2 ≥ ǫr . Therefore, we can see that k ≤ κ, which shows that the algorithm terminates on or before round κ. On round k, if we 31

have µ bk > ξ + ǫr , we must have µ > ξ since |b µk − µ| < ǫr . The case µ bk < ξ − ǫr is completely symmetric, which proves the correctness. Now, we analyze the number of samples. It is easy to see that the total number of samples is at most: κ X r=1

tr =

κ X

8 ln(2/δr )/Λ2r .

r=1

By the definition of the reference sequence, Λr ≥ cr−κ Λκ for 1 ≤ r ≤ κ. Hence, we have that κ X r=1

8(ln δ −1 + ln 20 + ln r)/Λ2r ≤ Λ−2 κ

r=1

c2(κ−r) · 8(ln δ −1 + ln 20 + ln r)

= O (ln δ −1 + ln κ)Λ−2 κ

This finishes the proof of the theorem.

B

κ X



Missing Proofs in Section 4

B.1

Proof for the Theorem 4.3

Theorem 4.3 (restate). Let ρ be a convex granularity function. Any δ-correct algorithm for Sign-ξ needs Ω(F (∆)) samples for almost all instances with respect to {ΓdN (ρ)}+∞ N =1 iff there exists γ > 0 such that ρ(x) = O(xγ ). In other words, the lower bound holds iff ρ(x) grows no faster than some polynomial function. For the continuous case, any δ-correct algorithm for Sign-ξ requires Ω(F (∆)) samples for almost all instances with respect to {ΓcN (ρ)}+∞ N =1 . Now we have enough technical tools to prove Theorem 4.3. We deal with the discrete distribution ΓdN (ρ) first. Suppose ρ is a convex granularity function. Indeed, we need to prove two parts. 1. If there exists γ > 0 such that ρ(x) is O(xγ ), then any δ-correct algorithm for Sign-ξ requires at least Ω(F (∆)) samples for almost all instances with respect to {ΓdN (ρ)}+∞ N =1 . 2. If ρ(x) grows faster than O(xγ ) for any γ > 0, then there exists a δ-correct algorithm A such that for any c1 > 0. lim sup ΓdN (ρ)({TA (∆) < c1 · F (∆)}) > 0. n→+∞

We begin with the first part. Proof of Theorem 4.3. (first part, discrete distribution) Let A be a δ-correct algorithm for Sign-ξ. Suppose ρ(x) is O(xγ ). So there exist constants c2 , M > 0 such that ⌈ρ(x)⌉ ≤ c2 · xγ for all x ≥ M . Note that ρ′ (0) = α > 0 as ρ is strictly increasing. Since ρ is convex, we have ρ(x + 1) − ρ(x) ≥ ρ′ (x) ≥ ′ ρ (0) = α for x ≥ 0. N X 1{TA (e−ρ(i) ) < c1 F (e−ρ(i) )}, in Now, pick N ∈ N+ and N > M . Then ⌈ρ(N )⌉ ≤ c2 · N γ . Let CN = i=0

which c1 is a constant to be determined later. Now, consider the following quantity: ⌈(ρ(N )⌉ ′ CN

=

X i=1

1{there exists t ∈ [0, N ] such that TA (e−ρ(t) ) < c1 F (e−ρ(t) ) and ρ(t) ∈ (i − 1, i]}.

For each interval (i − 1, i], it contains less than β = α−1 + 1 numbers in {ρ(t)}+∞ t=0 as ρ(t + 1) − ρ(t) ≥ α for all t. Hence, we have: 1 ′ CN ≥ · CN . β

32

′ Recall the definition in Lemma 4.5, it is easy to see that CN ≤ C(A, c1 F, ⌈ρ(N )⌉) ≤ C(A, c1 F, ⌊c2 N γ ⌋). ′ We apply Lemma 4.5 with the constant 1/2γ: there exists c1 > 0 such that

C(A, c′1 F, N ) = 0. N →+∞ N 1/2γ lim

Now, we specify c1 = c′1 . This implies that there exists N ′ such that for all N ≥ N ′ , we have C(A, c1 F, N ) ≤ ′ N 1/2γ . Therefore, for N such that N > M and ⌊c2 N γ ⌋ > N ′ , we have CN ≤ C(A, c1 F, ⌊c2 N γ ⌋) ≤ √ √ 1/2γ ′ γ 1/2γ (c2 N ) ≤ c3 N , where c3 = c2 . So CN ≤ βCN ≤ βc3 N . Hence, we can see that √ βc3 N CN d ≤ lim = 0, lim ΓN (ρ)({TA (∆) < c1 · F (∆)}) = lim N →+∞ N + 1 N →+∞ N →+∞ N + 1 which concludes the proof for the first part. Now we prove the second part, for which we need to provide an algorithm that needs o(F (∆)) samples for a nontrivial fraction of instances with respect to ΓdN (ρ). Proof of Theorem 4.3. (second part, discrete distribution) Suppose for all γ > 0, we have lim sup x→+∞

ρ(x) > 0. xγ

Since ρ is increasing, this is equivalent to saying that for all γ > 0 ρ(N ) = +∞. γ N ∈N+ ,N →+∞ N lim sup

Note that since ρ is strictly increasing, ρ′ (0) = α > 0. And since ρ is convex, we have ρ(x + 1) − ρ(x) ≥ ρ′ (x) ≥ ρ′ (0) = α for x ≥ 0. Let Λt = e−ρ(t) , and c = e−α . Then we have Λt+1 ≤ c · Λt for all t. Applying Corollary 3.3 with the reference sequence {Λt }, we obtain a δ-correct algorithm A for Sign-ξ, −2 −2 which uses O((ln δ −1 + ln t)Λ−2 (we treat δ as a constant) on instances with t ) = O(ln t · Λt ) ≤ C ln t · Λt gap Λt . Now, since ρ grows faster than any polynomial, for any c1 > 0, there exists an infinite increasing sequence 2C/c1 + . {Ni }+∞ i=1 such that Ni ∈ N and ρ(Ni ) > Ni Consider some Ni in the above sequence and the behavior of A. Notice that for any integer t ∈ [Ni , Ni2 ], we have 2C/c1 ≥ tC/c1 . ρ(t) > Ni Recall that Λt = e−ρ(t) . Hence, taking log of the above inequality, we can see that ln ln Λ−1 t > (C/c1 ) · ln t.

−2 −1 So A uses at most C ln t · Λ−2 samples on instance with gap Λt = e−ρ(t) , for any integer t < c1 · Λt ln ln Λt 2 t ∈ [Ni , Ni ]. Now, we can easily see that

Ni2 − Ni = 1, i→+∞ Ni2

lim ΓdN 2 (ρ)({TA (∆) < c1 F (∆)}) ≥ lim

i→+∞

i

which implies that lim sup ΓdN (ρ)({TA (∆) < c1 F (∆)}) = 1. N →+∞

Then we move to continuous case. Proof of Theorem 4.3. (continuous distribution) Let A be a δ-correct algorithm for √ Sign-ξ. Now, by Lemma 4.6, there exist constants M ≥ 1, c1 > 0 such that, for all N ≥ M , C′ (A, c1 F, N ) ≤ N . Since ρ is differentiable, we have: Z ρ(N ) C′ (A, c1 F, ρ(N )) ≥ 1{TA (e−x ) < c1 · F (e−x )}dx ρ(0)

=

Z

0

N

1{TA (e−ρ(t) ) < c1 · F (e−ρ(t) )}ρ′ (t)dt. 33

Let ϕ(t) = 1{TA (e−ρ(t) ) < c1 · F (e−ρ(t) )}. Pick a M2 such that ρ(M2 ) ≥ M and ρ(M2 ) − ρ(0) > 10 So we have: Z N p ϕ(x)ρ′ (t)dt ≤ ρ(N ) for all N ≥ M2 .

p ρ(M2 ).

0

We want to show that as N → +∞, Z need to provide an upper bound for

ΓcN (ρ)({TA (∆) N

< c1 F (∆)}) =

RN 0

ϕ(t)dt → 0. For this purpose we N

ϕ(t)dt.

0

Consider the following optimization problem: Find a function h(t) : R → [0, 1] such that OP TN =

Z

N

h(t)dt

0

is maximized, subject to the constraint that Z

R

0

Clearly we can see

Z

N

0

h(t)ρ′ (t)dt ≤

p ρ(R)

for all R ∈ [M2 , N ].

(11)

ϕ(t)dt ≤ OP TN . Thus, it suffices to upper bound OP TN . Since ρ′ (t) is increasing

and continuous, it is not hard to see that there exists an optimal solution h⋆ satisfying following conditions:

• h∗ is continuous on [M2 , N ] and all the constraints in (11) hold tight (i.e., the equality holds). p   p ρ(M2 ) + ρ(0) , otherwise h⋆ (x) = 0. First note ρ−1 ρ(M2 ) + ρ(0) < • For x ∈ [0, M2 ), h⋆ (x) = 1 if x ≤ ρ−1   Z ρ−1 √ρ(M2 )+ρ(0) p p M2 since ρ(M2 )+ρ(0) < ρ(M2 ). In addition, ρ′ (t)dt = ρ( M2 ) so it is consistent 0

with the first claim.

Then, we take derivative with respect to R on both sides of (11):

Hence, h⋆ (R) = Thus, we have:

1 ρ′ (R). h⋆ (R)ρ′ (R) = p 2 ρ(R)

1 p . Since ρ is increasing and unbounded, we have 2 ρ(R)

lim ΓcN (ρ)({TA (∆) ≤ c1 F (∆)}) =

N →+∞

lim

N →+∞

RN 0

lim h⋆ (N ) =

N →+∞

ϕ(t)dt ≤ lim N →+∞ N

RN 0

1 p = 0. 2 ρ(N )

h⋆ (t)dt = 0, N

which concludes the proof.

B.2

Proof for Theorem 4.4

Theorem 4.4 (restate). Let ρ be a concave granularity function. Any δ-correct algorithm for Sign-ξ needs Ω(F (∆)) samples for almost all instances with respect to {ΓdN (ρ)}+∞ N =1 iff there exists γ > 0 such that xρ′ (x) +∞ c > 0. The same statement holds for {ΓN (ρ)}N =1 as well. lim inf x→+∞ ρ(x)γ We prove Theorem 4.4 in this subsection. Let ρ be a concave granularity function. Again, we need to prove two parts: 10

This is possible as ρ(x) −

p

ρ(x) =

p p ρ(x)( ρ(x) − 1) is strictly increasing and unbounded when ρ(x) ≥ 1.

34

xρ′ (x) > 0, then any δ-correct algorithm takes Ω(F (∆)) total x→+∞ ρ(x)γ +∞ c samples respect to {ΓdN (ρ)}+∞ N =1 and {ΓN (ρ)}N =1 .

1. If there exists γ > 0 such that lim inf

2. If for any γ > 0, we have lim inf

x→+∞

c1 > 0,

xρ′ (x) = 0, then there exist algorithms A and A′ such that for any ρ(x)γ

lim sup ΓdN (ρ){TA (∆) > c1 · F (∆)} > 0;

and

N →+∞

lim sup ΓcN (ρ){TA′ (∆) < c1 · F (∆)} > 0. N →+∞

Now we prove the first part for the discrete distribution. Proof of Theorem 4.4. (first part, discrete distribution) By the assumption of the theorem, there exist N ρ′ (N ) positive number α and integer M , such that for all N ≥ M , we have ≥ α (or equivalently ρ(N )γ γ αρ(N ) ). Moreover, we can assume without loss of generality that γ < 1. ρ′ (N ) ≥ N N X 1{TA (e−ρ(i) ) < c1 F (e−ρ(i) )}, in which c1 is a constant to be determined later. Let CN = i=0

Consider the following quantity: ⌈(ρ(N )⌉ ′ CN

=

X i=1

1{There exists t ∈ [0, N ] such that TA (e−ρ(t) ) < c1 F (e−ρ(t) ) and ρ(t) ∈ (i − 1, i]}.

Note that for any t ≤ N − 1, we have ρ(t + 1) − ρ(t) ≥ ρ′ (t + 1) ≥ ρ′ (N ) since ρ is concave. So we can see that a single interval (i − 1, i] contains at most 1/ρ′ (N ) + 1 numbers in {ρ(t)}N t=0 , which implies that ′ CN ≤ CN · (1/ρ′ (N ) + 1).

By Lemma 4.5, there exist constants c′1 , M1 such that for all N > M1 , we have: C(A, c′1 F, N ) ≤ N γ/2 . Now we specify c1 = c′1 . For N > M and ρ(N ) > M1 , we have that ′ CN ≤ C(A, c1 F, ⌈ρ(N )⌉) ≤ ⌈ρ(N )⌉γ/2 ≤ 2ρ(N )γ/2 .

Thus, we can see that  CN ≤ 2ρ(N )γ/2 · 1 +

N αρ(N )γ



≤ 2ρ(N )γ/2 +

2N . αρ(N )γ/2

By the concavity of ρ, we have ρ(N ) ≤ ρ(0) + ρ′ (0) · N . Using the fact that γ < 1, we have lim

N →+∞

ΓdN (ρ)({TA (∆)

  CN (ρ(0) + ρ′ (0) · N )γ/2 2 < c1 F (∆)}) = lim = 0, 2· ≤ lim + N →+∞ N + 1 N →+∞ N αρ(N )γ/2

which concludes the proof for the discrete case. Now we move to continuous case. Proof of Theorem 4.4. (first part, continuous distribution) By Lemma 4.6, there exist constants M2 and c1 > 0 such that, for all N ≥ M2 , we have C′ (A, c1 F, N ) ≤ N γ/2 .

35

Let DN =

Z

N

1{TA(e−ρ(t) ) < c1 F (e−ρ(t) )}dt. We can also see that

0 ′

C (A, c1 F, ρ(N )) ≥ =

Z

ρ(N )

ρ(0)

Z

0

N

1{TA (e−x ) < c1 · F (e−x )}dx

1{TA (e−ρ(t) ) < c1 · F (e−ρ(t) )}ρ′ (t)dt.



By the concavity of ρ, ρ (t) is decreasing. Hence, we have that DN · ρ′ (N ) ≤ C′ (A, c1 F, ρ(N )), from which we can see that DN ≤

N ρ(N )γ/2 for ρ(N ) ≥ M2 and N ≥ M. = γ αρ(N ) /N αρ(N )γ/2

Thus, we have that lim ΓcN (ρ)({TA (∆) < c1 F (∆)}) =

N →+∞

DN 1 = 0, ≤ lim N →+∞ N N →+∞ αρ(N )γ/2 lim

which concludes the proof for the continuous case. Finally, we are ready to prove the second part. Proof of Theorem 4.4. (second part) In fact, we provide only one algorithm A, and show it suffices for both the discrete and continuous cases. As ρ is concave and increasing, we have that for any i ∈ N, N ρ′ (N ) ⌈x⌉ρ′ (⌈x⌉) ⌈x⌉ρ′ (x) ⌈x⌉ xρ′ (x) · = lim inf ≤ lim inf = lim inf = 0. x→+∞ ρ(⌈x⌉)1/i x→+∞ ρ(x)1/i x→+∞ x N ∈N,N →+∞ ρ(N )1/i ρ(x)1/i lim inf

(12)

Our algorithm is an instantiation of Algorithm 1 in Section 3. For this purpose, we need to construct a reference sequence {Λi } from ρ. First we construct an integer sequence {Mi } inductively. Suppose we have already constructed M1 , M2 , . . . , Mi−1 . By (12), we can pick Mi ∈ N large enough such that the following statements hold: 1. Mi ρ′ (Mi ) ≤ 2. ρ(Mi )1/i ≥

1 · ρ(Mi )1/i ; i

i−1 X

k=1

ρ(Mk )1/k + 2 · i;

3. ρ(Mi ) > ρ(i · Mi−1 ) + 1 if i > 1.

 Then, we  construct−tthe reference sequence {Λi } as follows: For each i, and for every integer t ∈ ⌈ρ(Mi )⌉, ⌈ρ((i+ 1)Mi )⌉ , we add e into the sequence. It is easy to see that for all i, Λi+1 ≤ e−1 · Λi . Now, we apply Corollary 3.3 using {Λi } as the reference sequence to get the δ-correct algorithm A for Sign-ξ. On an instance with gap ∆, the number of samples used by the algorithm is upper bounded by −2 −2 O((ln δ −1 + ln κ)Λ−2 κ ) = O(ln κ · Λκ ) ≤ C ln κ · Λκ ,

where κ is the smallest i′ such that Λi′ ≤ ∆. Since ρ is concave, for each i, we can see ρ((i + 1)Mi ) − ρ(Mi ) ≤ iMi · ρ′ (Mi ) ≤ ρ(Mi )1/i , and note that we add at most ρ((i + 1)Mi ) − ρ(Mi ) + 2 ≤ ρ(Mi )1/i + 2 numbers to {Λi }. Now, fix i. Let us analyze the behavior of A on Γd(i+1)Mi (ρ) (and Γc(i+1)Mi (ρ)). For all t ∈ [Mi , (i + 1)Mi ],   let ∆ = e−ρ(t) . We can also see that ⌈ρ(t)⌉ ∈ ⌈ρ(Mi )⌉, ⌈ρ((i + 1)Mi )⌉ . So, this implies e−⌈ρ(t)⌉ ∈ {Λj }, and 36

κ≤

i X

(ρ(Mk )1/k + 2) ≤ 2ρ(Mi )1/i . Clearly, Λκ = e−⌈ρ(t)⌉ ≥ e−1 · ∆. Also, ρ(Mi ) ≤ ρ(t) = ln ∆−1 . Thus,

k=1

the expected number of samples is less than C ln κ · Λ−2 κ ≤ C · (ln 2 +

1 1 · ln ρ(Mi )) · e2 ∆−2 ≤ C ′ · (ln 2 + ln ln ∆−1 ) · ∆−2 , i i

where we let C ′ = Ce2 . Now, for any c1 > 0, we choose a big enough i0 such that for all ∆ ≤ e−ρ(Mi0 ) , C ′ · (ln 2 +

1 ln ln ∆−1 ) < c1 ln ln ∆−1 . i0

Therefore, we can see that lim sup ΓdN (ρ)({TA (∆) < c1 F (∆)}) ≥ lim sup Γd(i+1)Mi (ρ)({TA (∆) < c1 F (∆)}) i>i0 ,i→+∞

N →+∞



lim

i>i0 ,i→+∞

i = 1. i+1

The second equality holds since the algorithm achieves the sample complexity better than c1 ∆−2 ln ln ∆−1 for all ∆ = e−ρ(t) , t ∈ [Mi , (i + 1)Mi ]. Exactly the same argument holds for the continuous case ΓcN (ρ), which concludes the whole proof.

C

Missing Proofs in Section 6

Lemma 6.3 (restate). Suppose ǫ < 0.1 and t ∈ (ǫ, 1 − ǫ). With probability 1 − δ, the following hold: • If FRAC-TEST outputs True, then |S >cr | < (1 − t + ǫ)|S| (or equivalently |S ≤cr | > (t − ǫ)|S|). • If FRAC-TEST outputs False, then |S (1 − t − ǫ)|S|).

Moreover the number of samples taken by the algorithm is O(ln δ −1 ǫ−2 ∆−2 ln ǫ−1 ), in which ∆ = cr − cl .

Proof of Lemma 6.3. Let Sa = S cr , Na = |Sa |, Nb = |Sb |, N = |S|. For each iteration i and the arm ai in line 4, by Lemma 6.1, we have Pr[|ˆ µ[ai ] − µ[ai ] | ≥ (cr − cl )/2] ≤ ǫ/3. Hence, if µ[ai ] < cl , then cl + cr cl + cr Pr[ˆ µ[ai ] < ] ≥ 1 − ǫ/3. Similarly, if µ[ai ] > cr , then Pr[ˆ µ[ai ] < ] ≤ ǫ/3. 2 2   cl + cr Let Xi be the indicator Boolean variable 1 µ ˆ[ai ] < . Clearly Xi s are i.i.d. From the algorithm, 2 tot X ˆ = cnt/tot, which is the empirical value of E. By Xi . Let E = E[Xi ]. Let E we can see that cnt = i=1

ˆ ≥ ǫ/3] ≤ δ. Chernoff bound, we can easily get that Pr[|E − E| ˆ < ǫ/3, which happens with probability In the rest of the proof, we condition on the event that |E − E| ˆ at least 1 − δ. Suppose E > t. Then we have E > t − ǫ/3. It is also easy to see that:  X  cl + cr E ·N = Pr µ ˆ[a] < ≤ (N − Nb ) · 1 + (ǫ/3)Nb = N − (1 − ǫ/3)Nb , 2 a∈S

as ǫ < 0.1,

1 < (1 + 2ǫ/3). So, we have proved the first claim: 1 − ǫ/3 Nb ≤

(1 − E)N (1 − t + ǫ/3)N ≤ < (1 − t + ǫ/3)(1 + 2ǫ/3)N ≤ (1 − t + ǫ)N. 1 − ǫ/3 1 − ǫ/3

ˆ ≤ t. Then we have E ≤ t + ǫ/3. We also have The second claim is completely symmetric. Suppose E E · N ≥ Na (1 − ǫ/3). So, Na ≤

(t + ǫ/3)N E·N ≤ < (t + ǫ/3)(1 + 2ǫ/3)N ≤ (t + ǫ)N. 1 − ǫ/3 1 − ǫ/3

Finally, the upper bound for the number of samples can be verified by a direct calculation. 37

C.1

Proof for Lemma 6.5

Lemma 6.5 (restate). Suppose δ < 0.1. Let S ′ = ELIMINATION(S, cl , cr , δ). Let A1 be the best arm among S, with mean µ[A1 ] ≥ cr . Then with probability at least 1 − δ, the following statements hold 1. A1 ∈ S ′

(the best arm survives);

2. |S ′≤cl | < 0.1|S ′ |

(only a small fraction of arms have means less than cl );

3. The number of samples is O(|S| ln δ −1 ∆−2 ), in which ∆ = cr − cl . Note that with probability at most δ, there is no guarantee for any of the above statements. Before proving Lemma 6.5, we first describe two events (which happen with high probability) that we condition on, in order to simplify the argument. Lemma C.1. With probability at least 1 − δ/2, it holds that in all round r, FRAC-TEST outputs correctly, cr − cm and |ˆ µ[A1 ] − µ[A1 ] | < . 2 Proof. Fix a round r. FRAC-TEST outputs incorrectly with probability at most δr . By Theorem 6.1, cr − cm Pr(|ˆ µ[A1 ] − µ[A1 ] | ≥ ) ≤ δr . 2 +∞ +∞ X X The lemma follows from a simple union bound over all rounds: 2 δr ≤ 2δ 0.1/2r ≤ δ/2. r=1

r=1

Lemma C.2. Let Nr = |Sr≤cm |. Then with probability at least 1 − δ/2, for all rounds r in which Algorithm 4 1 does not terminate, Nr+1 ≤ Nr . 4 cr − cm ] ≤ δr . So Pr[ˆ µ[a] > µ[a] − µ[a] | ≥ Proof. Suppose a ∈ Sr≤cm . By Theorem 6.1, we have that Pr[|ˆ 2 cm + cr ] ≤ δr . Then, we can see that E[Nr+1 ] ≤ δr Nr . By Markov inequality, we can see that Pr(Nr+1 > 2 δr N r 1 = 4δr . Nr ) ≤ 1 4 4 Nr +∞ +∞ X X 0.1/2r ≤ δ/2. 4δr ≤ 4δ Again, the lemma follows by a simple union bound: r=1

r=1

Proof of Lemma 6.5. With probability at least 1 − δ, both statements in Lemma C.1 and Lemma C.2 hold. Let that event be EG . Now we prove Lemma 6.5 under the condition that EG holds. Now we prove all the claims one by one. cr − cm (1) For the first claim, note that for all r, conditioning on EG , we have that |ˆ µ[A1 ] − µ[A1 ] | < , or 2 cm + cr cr − cm = . Hence, A1 survives all rounds and A1 ∈ S ′ . equivalently µ ˆ[A1 ] > cr − 2 2 (2) For the second claim, note that conditioning on event EG , FRAC-TEST always outputs correctly. Then suppose the algorithm terminates at round r, which means FRAC-TEST(Sr , cl , cm , δr , 0.075, 0.025) outputs False. By Lemma 6.3, we have |Sr (0.075 − 0.025)|Sr | = 0.05|Sr |. Then, we have that 3 |Sr+1 | ≤ |Sr | − (|Nr | − |Nr+1 |) ≤ |Sr | − |Nr | ≤ 0.99|Sr |. 4 Suppose the algorithm terminates at round r′ . Let c1 be a large enough constant (so that c1 ln δr ∆−2 |Sr | is an upper bound for the samples taken by UNIF-SAMPL in round r and c1 ln δr ∆−2 is an upper bound for the samples taken by FRAC-TEST in round r). Then, the number of samples is bounded by: 38



r X r=1



−2

c1 (∆

−2

ln δr |Sr | + ∆

−2

ln δr ) ≤ 2c1 |S|∆

r X r=1

(ln δ −1 + r ln 2 + ln 10) · 0.99r−1



−2

≤ 2c1 |S|∆

−2

≤ 2c1 |S|∆

r X r=1

(ln δ −1 (r + 1) + ln 10) · 0.99r−1

ln δ

−1

+∞ X r=1

(r + 1) · 0.99

r−1

≤ 2c1 |S|∆−2 ln δ −1 · 10100 + 100 · ln 10

+



+∞ X r=1

ln 10 · 0.99

So the number of samples is O(|S| ln δ −1 ∆−2 ), which concludes the proof of the lemma.

39

r−1

!