arXiv:1506.03378v1 [cs.LG] 10 Jun 2015
On the Prior Sensitivity of Thompson Sampling
Lihong Li Microsoft Research Redmond, WA 98052
[email protected] Che-Yu Liu Princeton University Princeton, NJ 08544
[email protected] Abstract The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm’s regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm’s worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with p being mass of the true reward-generating p the prior probability p model, we prove O( T /p) and O( (1 − p)T ) regret upper bounds for the badand good-prior cases, respectively, as well as matching lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.
1 Introduction This paper studies Thompson Sampling, also known as probability matching and posterior sampling, an increasingly popular strategy that addresses the exploration-exploitation tradeoff in stochastic multi-armed bandit problems. In this problem, an agent is repeatedly faced with K possible actions. At each time step t = 1, . . . , T , the agent chooses an action It ∈ A := {1, . . . , K}, then receives reward XIt ,t ∈ R. An eligible action-selection strategy chooses actions at step t based only on past observed rewards Ht = {Is , XIs ,s ; 1 ≤ s < t} and potentially on an external source of randomness. More background on the bandit problem can be found in [1]. We make the following stochastic assumption on the underlying reward-generating mechanism. Let Θ be a countable1 set of possible reward-generating models. When θ ∈ Θ is the true underlying model, the rewards (Xi,t )t≥1 are i.i.d. random variables taking values in [0, 1] drawn from some known distribution νi (θ) with mean µi (θ). Of course, the agent knows neither the true underlying model nor the optimal action that yields the highest expected reward. The performance of the agent is measured by the regret incurred for not always selecting the optimal action. More precisely, the frequentist regret (or regret for short) for an eligible action-selection strategy π under a certain reward-generating model θ is defined as T X max µi (θ) − µIt (θ) RT (θ, π) := E t=1
i∈A
1 Note that in this paper, we do not impose any continuity structure on the reward distributions ν(θ) with respect to θ ∈ Θ. Therefore, it is easy to see that when Θ is uncountable, the (frequentist) regret of Thompson Sampling in the worst-case scenario is linear in time under most underlying models θ ∈ Θ.
1
where the expectation is taken with respect to the rewards (Xi,t )i∈A,t≥1 , generated according to the model θ, and the potential external source of randomness. If one imposes a prior distribution p over Θ, then it is natural to consider the following notion of average regret known as Bayes regret: X ¯ T (π) := Eθ∼p RT (θ, π) = R RT (θ, π)p(θ) . θ∈Θ
The setup above is the discretized version of rather general bandit problems; such an abstract formulation is taken in this paper to simplify expositions and analysis. For example, the standard K-armed bandit is a special case, where Θ is the Cartesian product of the sets of reward distributions of all arms. As another example, in linear bandits, Θ is a set of candidate coefficient vectors that determine the reward expectation (see, e.g., [2]). Although we assume reward are bounded, this assumption is needed only in Lemma 14 in Appendix A. Many useful results in the paper still hold with unbounded rewards. 1.1 Thompson Sampling and Related Work The Thompson Sampling strategy was proposed in probably the very first paper on multi-armed bandits [3]. This strategy takes as input a prior distribution p1 for θ ∈ Θ. At each time t, let pt be the posterior distribution for θ given the prior p1 and the history Ht = {Is , XIs ,s ; 1 ≤ s < t}. Thompson Sampling selects an action randomly according to its posterior probability of being the optimal action. Equivalently, Thompson Sampling first draws a model θt from pt (independently from the past given pt ) and it pulls It ∈ argmaxi∈A µi (θt ). For concreteness, we assume that the distributions (νi (θ))i∈A,θ∈Θ are absolutely continuous with respect to some common measure ν on [0, 1] with likelihood functions (ℓi (θ)(·))i∈A,θ∈Θ . The posterior distributions pt can be computed recursively by Bayes rule as follows pt (θ)ℓIt (θ)(XIt ,t ) . η∈Θ pt (η)ℓIt (η)(XIt ,t )
pt+1 (θ) = P
We denote by T S(p1 ) the Thompson Sampling strategy with prior p1 . Recently, Thompson Sampling has gained a lot of interest, largely due to its empirical successes [4, 5, 6, 7]. Furthermore, this strategy is often easy to be combined with complex reward models and easy to implement; see, e.g., [8]. All these inspire a number of theoretical analyses of this old strategy. For the classic K-armed bandits, regret bounds comparable to the the more widely studied UCB algorithms are obtained [9, 10, 11], √ matching a well-known asymptotic lower bound [12]. For linear bandits of dimension d, an O(d T ) upper bound has been proved [13]. All these bounds, while providing interesting insights about the algorithm, assume non-informative priors (often uniform priors), and essentially show that Thompson Sampling has a comparable regret to other more popular strategies, especially those based on upper confidence bounds. Unfortunately, the bounds do not show what role prior plays in the performance of the algorithm. In contrast, [14] analyzes a variant of Thompson Sampling, giving a bound that depends explicitly on the entropy of the prior. However, their bound has an O(T 2/3 ) dependence on T that is likely sub-optimal. Another line of work in the literature focuses on the Bayes regret with an informative prior. [15] proved that for any prior in the two-armed case, Thompson Sampling is a 2-approximation to the optimal strategy that minimizes the “stochastic” (Bayes) regret. [16] √ and [17] showed that in the K-armed case, the Bayes regret is always upper bounded by p O( KT ) for any prior p1 . The result was later improved in [18] to a prior-dependent bound O( H(q)KT ) where q is the prior distribution of the optimal action, defined as q(i) = Pθ∼p1 (i = argmaxj∈A µj (θ)), and PK H(q) = − i=1 q(i) log q(i) is the entropy of q. While this bound elegantly quantifies, in terms of averaged regret, how Thompson Sampling exploits prior distributions, it does not tell how well Thompson Sampling works in individual problems. Indeed, in the analysis of Bayes regret, it is unclear what a “good” prior means from a theoretical perspective, as the notion of Bayes regret essentially assumes the prior is correct. In the extreme case where prior p1 is a point mass, H(q) = 0 and the Bayes regret is trivially 0. 2
To the best of our knowledge, our work is the first to consider frequentist regret of Thompson Sampling with an informative prior. Specifically, we focus on understanding Thompson Sampling’s sensitivity to the choice of prior, making progress towards resolving an interesting and important open question. The findings also have useful implications for practical applications of Thompson Sampling. 1.2 Main Results and Discussions Naturally, we expect the regret of Thompson Sampling to be small when the true reward-generating model is given a large prior probability mass, and vice versa. An interesting and important question, as addressed in this work, is to understand the sensitivity of Thompson Sampling, in terms of regret, to the prior distribution it takes as input. Our results fully characterize the worst-case dependence of Thompson Sampling on the prior in the special yet meaningful case where K = 2 and |Θ| = 2, and provide insight into the more general case. As we will seen, even such a seemingly simple case is highly nontrivial to analyze. 2-Actions-And-2-Models case: In this case, K = 2, Θ = {θ1 , θ2 } and under the model θi , action i is the optimal action; that is, µ1 (θ1 ) > µ2 (θ1 ) and µ2 (θ2 ) > µ1 (θ2 ). The main result is summarized in the following theorem. Precise statements are given in Theorem 7, Theorem 11 and Theorem 12 in later sections. Theorem 1. (Main Result) Consider the 2-Actions-And-2-Models case under a certain mild smoothness assumption. Assume without loss of generality that θ1 is the true reward-generating model and let p1 be a prior overq Θ = {θ1 , θ2 }. When p1 (θ1 ) is small, the regret of Thompson Sampling is T upper bounded by O( p1 (θ ). When p1 (θ1 ) is sufficiently large, the regret is upper bounded by 1) p O( (1 − p1 (θ1 ))T ). Furthermore, these upper bounds are tight up to a constant factor q in the
sense that there exist 2-Actions-And-2-Models instances where the lower bounds are Ω( p and Ω( (1 − p1 (θ1 ))T ) for the two cases, respectively.
T p1 (θ1 ) )
The lower bounds in the 2-Actions-And-2-Models case easily imply the lower bounds in the general case. See Appendix B for its proof. Corollary 2. (General Lower Bounds) Consider the case with two actions and an arbitrary countable Θ. Let p1 be a prior over Θ and θ∗ ∈ Θ beq the true model. Then, there exist problem instances p T where the regrets of Thompson Sampling are Ω( p1 (θ (1 − p1 (θ∗ ))T ) for small p1 (θ∗ ) ∗ ) ) and Ω( and large p1 (θ∗ ), respectively.
These lower bounds show that the performance of Thompson Sampling can be quite sensitive to the choice of input prior. Note that upper bounds in the general case can be derived easily from the results in [16]. In fact, it is easy to see that ! p ¯ T (T S(p1 )) H(q)KT R , =O RT (θ∗ , T S(p1 )) ≤ p1 (θ∗ ) p1 (θ∗ ) where θ∗ ∈ Θ is the true model. On one hand, in the 2-Actions-And-2-Models r case with T 1 θ1 being the true model, the above upper bound becomes O for small log p1 (θ p1 (θ1 ) 1) r p1 (θ1 ) and O log 1−p11 (θ1 ) (1 − p1 (θ1 ))T for large p1 (θ1 ). Our upper bounds in Theorem 1 remove the extraneous logarithmic terms in these upper bounds. √ On the other hand, the above general upper bound can be further upper bounded by O p1 (θT∗ ) for small p1 (θ∗ ) and r 1 ∗ O log 1−p1 (θ∗ ) (1 − p1 (θ ))T for large p1 (θ∗ ). We conjecture that these general upper 3
bounds can be improved to match our lower bounds in Corollary 2, especially for small p1 (θ∗ ). But it remains open how to extend our proof techniques for the 2-Actions-And-2-Models case to get tight general upper bounds. It is natural to compare Thompson Sampling to exponentially weighted algorithms, a well-known family of algorithms that can also take advantage of prior knowledge. If we see each model θ ∈ Θ as an expert who recommends the optimal action based on distributions specified by θ, and use the prior p1 as the initial weights assigned to the experts, then the EXP4 algorithm [19] has a regret 1 1 O KT γ + log . γ p1 (θ∗ ) For the sake of simplicity, we only do the comparison in the 2-Actions-And-2-Models case. p By trying to match or even beat the upper bounds in Theorem 1, we reach the choice that γ = H(p1 )/T . r 1 T log p1 (θ Assuming that θ1 is the true model, the bound becomes O for small p1 (θ1 ), p1 (θ1 ) 1) r log 1−p11 (θ1 ) (1 − p1 (θ1 ))T for large p1 (θ1 ). Thus, although EXP4 is not a Bayesian and O
algorithm, it has the same worst-case dependence on prior as Thompson Sampling, up to logarithmic factors. This is partly explained by the fact that such algorithms are designed to perform well in the worst-case (adaptive adversarial) scenario. On the contrary, by design, Thompson Sampling takes advantage of prior information more efficiently in most cases, especially when there is certain structure on the model space Θ; see [17] for an example. Note that in this paper, we do not impose any structure on Θ, thus our lower bounds do not contradict existing results in the literature with non-informative priors (where p(θ∗ ) can be very small as Θ is typically large).
Finally, our proof techniques used here are new in the Thompson Sampling literature, to the best of our knowledge. The key innovation is the finding that the inverse of the posterior probability of the true underlying model is a martingale (Theorem 4). It allows us to use results and techniques from martingale theory to quantify the time and probability that the posterior distribution hits a certain threshold. Then, the regret of Thompson Sampling can be analyzed separately before and after hitting times. We believe that Theorem 4 is an elegant and fundamental property of Thompson Sampling and can be used to obtain other interesting results about Thompson Sampling. Note that a martingale property was also used in [20] and [15] to study the Bayesian multi-armed bandit problem, in the sense that the reward at the current state is the same as the expected reward over the distribution of the next state when a play is made in the current state. Their martingale property is very different from ours because theirs applies to the reward at the current state while ours involves the inverse of the posterior distribution.
2 Preliminaries In this section, we introduce two useful properties of Thompson Sampling that are essential to proving our upper bounds in Section 3: the Markov property and a martingale property. Their proofs are given in Appendix C. Throughout this paper, for a random variable Y , we will use the shorthand Et [Y ] for the conditional expectation E[Y |Ht ]. Moreover, we denote by Eθ [Y ] the expectation of Y when θ is the true underlying model, i.e. when Xi,t has distribution νi (θ). The notation Pθ [·] is similarly defined. Furthermore, we use the shorthand a ∧ b for min{a, b}. Lemma 3. (Markov Property) Regardless of the true underlying model, the stochastic process (pt )t≥1 is a Markov process. Theorem 4. (Martingale Property) When θ∗ ∈ Θ is the underlying model, the stochastic process (pt (θ∗ )−1 )t≥1 is a martingale with respect to the filtration (Ht )t≥1 . In the 2-Actions-And-2-Models case, the martingale property above implies the following lemma which will be used repeatedly in the proofs of our results. Its proof is given in Appendix C. Lemma 5. Consider the 2-Actions-And-2-Models case. Let A, B ∈ (0, 1) such that A > p1 (θ1 ) > B. Define the hitting times and hitting probabilities by τA = inf{t ≥ 1, pt (θ1 ) ≥ A},
τB = inf{t ≥ 1, pt (θ1 ) ≤ B},
qB,A = Pθ1 (τA > τB ) .
θ1
qA,B = P (τA < τB ),
4
Then, τA < +∞. Furthermore, assume that there exists constant γ > 0 so that pτB (θ1 ) ≥ γ, then qA,B =
Eθ1 [pτB (θ1 )−1 ] − p1 (θ1 )−1 p1 (θ1 )−1 − Eθ1 [pτA (θ1 )−1 ] and q = . B,A Eθ1 [pτB (θ1 )−1 ] − Eθ1 [pτA (θ1 )−1 ] Eθ1 [pτB (θ1 )−1 ] − Eθ1 [pτA (θ1 )−1 ]
Finally, qB,A ≤
B p1 (θ1 )
and qB,A ≤
1−p1 (θ1 ) A−B .
3 Upper Bounds In this section, we focus on the 2-Actions-And-2-Models case. We present and prove our results on the upper bounds for the frequentist regret of Thompson Sampling. The following smoothness assumption is needed only in this section, but not in the next section (for lower bounds). Assumption 1. (Smoothness) There exists s > 1 such that ν-almost surely, for i ∈ {1, 2}, ℓi (θ1 ) ≤ ℓi (θ2 ) ≤ s · ℓi (θ1 ) . s Lemma 6. Under Assumption 1, regardless of either θ1 or θ2 being the true underlying model, for any θ ∈ {θ1 , θ2 }, pt (θ) ≤ pt+1 (θ) ≤ s · pt (θ) ν-almost surely. s Proof. This lemma is an immediate consequence of Assumption 1. See Appendix C for the proof. While Assumption 1 does not hold for all distributions, it does hold for some important ones, such as Bernoulli distributions with means other than 0 and 1. On one hand, the assumption essentially avoids situations where one Bayesian update can change posteriors by too much (analogous to bounded gradients or rewards in most online-learning literature). On the other hand, smaller s-values in the assumption tend to create hard problems for Thompson Sampling, since models are less distinguishable. Therefore, the assumption does not remove much of the core difficulties in analyzing Thompson Sampling. We are now ready to state our main upper bound results in Theorem 7. Its proof, given in Appendix D, relies on Propositions 8, 9 and 10 which we develop in the rest of this section. Although the proofs of the three propositions use similar analytic techniques, they differ in many important details. Due to space limitation, we only sketch the proofs of Propositions 8 and 10. Complete proofs of all three propositions are given in Appendix E–G. Theorem 7. Consider the 2-Actions-And-2-Models case and assume that Assumption 1 holds. Then the regret of Thompson Sampling with prior p1 satisfies s T RT (θ1 , T S(p1 )) ≤ 1490s . p1 (θ1 ) Moreover, when p1 (θ1 ) ≥ 1 −
1 8s2 ,
we have
RT (θ1 , T S(p1 )) ≤ 14560s4
p (1 − p1 (θ1 ))T .
Remark. The above upper bounds have the same dependence on T and p1 (θ1 ) as the lower bounds in Section 4. Moreover, the preceding constants(1490s and 14560s4) are both increasing functions of the problem-specific smoothness parameter s. Because problems with small s are generally hard for Thompson Sampling, our upper bounds are tight up to an universal constant for a fairly general class of hard problems. We believe that the dependence on s is an artifact of our proof techniques and can be removed to get tight upper bounds for all problem instances of the 2-Actions-And-2-Models case. We introduce some notation. Let ∆ = µ1 (θ1 ) − µ2 (θ1 ), ∆1 = |µ1 (θ1 ) − µ1 (θ2 )| and ∆2 = |µ2 (θ1 ) − µ2 (θ2 )|. Obviously, ∆ ≤ ∆1 + ∆2 . To simplify notation, define the regret function 5
RT (·) by RT (p1 (θ1 )) = RT (θ1 , T S(p1 )). Then RT (·) is a decreasing function (see Lemma 16 in Appendix A for the proof) and one has, " T # X RT (p1 (θ1 )) = RT (θ1 , T S(p1 )) = ∆ · Eθ1 pt (θ2 ) ≤ ∆T . t=1
Proposition 8. Consider the 2-Actions-And-2-Models case and assume that Assumption 1 holds. Then for any T > 0 and p1 (θ1 ) ∈ (0, 1), we have s T 3s 1 RT (p1 (θ1 )) ≤ 96 log . +6 + RT 2 p1 (θ1 ) 3 Proof sketch. We consider the case where θ1 is the true reward-generating model and use the notation defined in Lemma 5. First, the desired inequality is trivial if p1 (θ1 ) ≥ 31 since RT (·) q is a decreasing function, by Lemma 16 (Appendix A). Moreover, if ∆ ≤ 2 p1 (θ11 )T , then q T , which completes the proof. Therefore, we can assume that RT (p1 (θ1 )) ≤ ∆T ≤ 2 p1 (θ 1) q q p1 (θ1 ) 1 p1 (θ1 ) ≤ 13 and ∆ > 2 p1 (θ11 )T . Let A = 23 p1 (θ1 ) and B = ∆ T . Then, it is easy to see that B ≤ 12 p1 (θ1 ) ≤
1 2
≤ 1 − A.
Now, the first step is to upper bound Eθ1 [τA ∧ τB − 1]. By Lemma 13(a) in Appendix A, we have for t ≤ τA ∧ τB − 1, 1 1 Eθt 1 log(pt (θ1 )−1 ) − log(pt+1 (θ1 )−1 ) ≥ pt (θ1 )pt (θ2 )2 ∆21 + pt (θ2 )3 ∆22 2 2 B∆2 pt (θ2 )2 B 2 (∆1 + ∆22 ) ≥ , ≥ 2 16 2 In other words, log(pt (θ1 )−1 ) + t B∆ is a supermartingale. Applying Doob’s optional 16 t≤τA ∧τB
stopping theorem to the stopping times σ1 = t ∧ τA ∧ τB and σ2 = 1 and letting t → +∞ by using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, we have 16 pτA ∧τB (θ1 ) sA 3s 16 16 θ1 θ1 ≤ log E log log , = E [τA ∧ τB − 1] ≤ B∆2 p1 (θ1 ) B∆2 p1 (θ1 ) B∆2 2 where we have used Lemma 6 in the second last step. Next, the regret of Thompson Sampling can be decomposed as follows RT (p1 (θ1 )) = ≤
∆ · Eθ1 [τA ∧ τB − 1] + qB,A · Eθ1 [RT (pτB (θ1 ))|τA > τB ] + qA,B · Eθ1 [RT (pτA (θ1 ))|τA < τB ] s T 16 3 3 3s B 3s log + ∆T + RT p1 (θ1 ) = 16 log +1 + RT p1 (θ1 ) , B∆ 2 p1 (θ1 ) 2 2 p1 (θ1 ) 2
where in the second last step, we have used the facts that qB,A ≤ p1 B (θ1 ) (by Lemma 5), pτA (θ1 ) ≥ A = 23 p1 (θ1 ), and RT (·) is a decreasing function (by Lemma 16 in Appendix A). Because the above recurrence inequality holds for all p1 (θ1 ) ≤ 13 , simple calculations lead to the desired inequality. Proposition 9. Consider the 2-Actions-And-2-Models case and assume that Assumption 1 holds. Then for any T > 0 and p1 (θ1 ) ≤ 21 , we have √ 1 16s 1 RT (p1 (θ1 )) ≤ + 1 R p (θ ) . T + T 1 1 p1 (θ1 )2 2 2s Proposition 10. Consider the 2-Actions-And-2-Models case and assume that Assumption 1 holds. 1 We also assume that ∆ ≥ √ and define the function QT (·) by QT (x) = RT (1 − x). (1−p1 (θ1 ))T
Then for any T > 0 and p1 (θ2 ) ≤ 8s12 , we have p 4 1 QT 4s2 p1 (θ2 ) − QT (p1 (θ2 )) . QT (p1 (θ2 )) − QT ( 2 p1 (θ2 )) ≤ 360s4 p1 (θ2 )T + 4s 11s 6
Proof sketch. We consider the case where θ1 is the true reward-generating model and use the notation defined in Lemma 5. Fix p1 (θ1 ) ≥ 1 − 8s12 and let A = 1 − 4s12 (1 − p1 (θ1 )) and B = 1 − 4s(1 − p1 (θ1 )). Then A > p1 (θ1 ) > B ≥ 21 . The first step is to upper bound Eθ1 [τA ∧ τB − 1]. By Lemma 13(b) in Appendix A, we have for t ≤ τA ∧ τB − 1, pt (θ1 )2 2 pt (θ1 ) 2 ∆2 Eθt 1 (1 − pt+1 (θ1 ))−1 − (1 − pt (θ1 ))−1 ≥ ∆1 + ∆2 ≥ . 2pt (θ2 ) 2 16 2 is a submartingale. Applying Doob’s optional In other words, (1 − pt (θ1 ))−1 − t ∆ 16 t≤τA ∧τB
stopping theorem to the stopping times σ1 = t ∧ τA ∧ τB and σ2 = 1 and letting t → +∞ by using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, one has 16 16s Eθ1 [τA ∧ τB − 1] ≤ 2 Eθ1 (1 − pτA ∧τB (θ1 ))−1 ≤ 2 . ∆ ∆ (1 − A) where we have used Lemma 6 in the last step. Next, the regret of Thompson Sampling can be decomposed as follows RT (p1 (θ1 ))
≤ ≤
∆(1 − B)Eθ1 [τA ∧ τB − 1] + qA,B · Eθ1 [RT (pτA (θ1 ))|τA < τB ]
+ qB,A · Eθ1 [RT (pτB (θ1 ))|τA > τB ] p 1 256s4 (1 − p1 (θ1 ))T + qA,B · RT (1 − 2 (1 − p1 (θ1 ))) 4s + qB,A · RT 1 − 4s2 (1 − p1 (θ1 )) ,
where in last step, we have used the facts that pτA (θ1 ) ≥ A, pτB (θ1 ) = 1 − pτB (θ2 ) ≥ 1 − s(1 − B) ( by Lemma 6) and RT (·) is a decreasing function (by Lemma 16 in Appendix A). Finally, we get the desired recurrence inequality in the statement by rearranging the newly obtained inequality and observing that by Lemma 5, 1 4 1 − p1 (θ1 ) 4 = ≤ . qB,A ≤ ≤ A−B 15s 15 4s − 4s12
4 Lower Bounds In this section, we provide lower bounds for the frequentist regret of Thompson Sampling in the 2-Actions-And-2-Models case. The proof of Theorem 11 is given in Appendix H. Theorem 11. Consider the 2-Actions-And-2-Models case. Let p1 be a prior distribution and T ≥ 1 1−p1 (θ1 ) . Consider the following specific problem instance with Bernoulli reward distributions: 1 1 1 ν1 (θ1 ) = ν1 (θ2 ) = Bern , ν2 (θ1 ) = Bern − ∆ , ν2 (θ2 ) = Bern +∆ 2 2 2 q where ∆ = 8(1−p11(θ1 ))T . Then the regret of Thompson Sampling with prior p1 satisfies
1 p √ (1 − p1 (θ1 ))T . 10 2 Theorem 12. Consider the 2-Actions-And-2-Models case. Let p1 be a prior distribution and T ≥ 1 p1 (θ1 ) . Consider the following specific problem instance: 1 1 1 + ∆ , ν1 (θ2 ) = Bern − ∆ , ν2 (θ1 ) = ν2 (θ2 ) = Bern ν1 (θ1 ) = Bern 2 2 2 q 1 where ∆ = 8p1 (θ1 )T . Then, the regret of Thompson Sampling with prior p1 satisfies if p1 (θ1 ) ≤ 21 , s T 1 √ RT (θ1 , T S(p1 )) ≥ . 168 2 p1 (θ1 ) RT (θ1 , T S(p1 )) ≥
7
Proof. Let A = 23 p1 (θ1 ). Recall that τA = inf{t ≥ 1, pt (θ1 ) ≥ A}. Using Lemma 13(c) and Lemma 15, one has for t ≤ τA − 1, X ℓi (θ1 )(Xi,t ) pt (θi )pt (θ1 )pt (θ2 )Eθ1 −1 Eθt 1 [pt+1 (θ1 ) − pt (θ1 )] ≤ ℓi (θ2 )(Xi,t ) i∈{1,2} 2 θ1 ℓ1 (θ1 )(X1,t ) = pt (θ1 ) pt (θ2 )E − 1 ≤ 32A2 ∆2 = 72p1 (θ1 )2 ∆2 . ℓ1 (θ2 )(X1,t ) Therefore, pt (θ1 ) − 72p1 (θ1 )2 ∆2 t t≤τA is a supermartingale. Now, using Doob’s optional stopping theorem, one has for any t ≥ 1, Eθ1 pt∧τA ∧T (θ1 ) − (t ∧ τA ∧ T )72p1 (θ1 )2 ∆2 ≤ p1 (θ1 ) − 72p1 (θ1 )2 ∆2 .
Moreover, using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, Eθ1 pt∧τA ∧T (θ1 ) − (t ∧ τA ∧ T )72p1(θ1 )2 ∆2 −→ Eθ1 pτA ∧T (θ1 ) − (τA ∧ T )72p1(θ1 )2 ∆2
as t → +∞. Hence,
Eθ1 [τA ∧ T − 1] ≥
1 Eθ1 [pτA ∧T (θ1 ) − p1 (θ1 )] . 72p1 (θ1 )2 ∆2
T 1 , then Eθ1 [τA ∧ T ] ≥ Pθ1 (τA ∧ T = T )T ≥ 21 . On the One one side, if Pθ1 (τA ∧ T = T ) ≥ 21 20 10 θ1 θ1 θ1 other side, if P (τA ∧ T = τA ) ≥ 21 , then E [pτA ∧T (θ1 )] ≥ P (τA ∧ T = τA )A ≥ 7 p1 (θ1 ) and thus 10 T 1 p (θ ) − p (θ ) = . Eθ1 [τA ∧ T − 1] ≥ 1 1 1 1 72p1 (θ1 )2 ∆2 7 21
T . Finally, one has In both cases, we have Eθ1 [τA ∧ T − 1] ≥ 21 " T "τ ∧T −1 # # AX X RT (θ1 , T S(p1 )) = ∆Eθ1 (1 − pt (θ1 )) ≥ ∆Eθ1 (1 − pt (θ1 )) t=1
≥
t=1
∆T 1 √ = ∆(1 − A)E [τA ∧ T − 1] ≥ 84 168 2 θ1
where we have used the fact that 1 − A = 1 − 23 p1 (θ1 ) ≥ 41 .
s
T p1 (θ1 )
5 Conclusions In this work, we studied an important aspect of the popular Thompson Sampling strategy for stochastic bandits—its sensitivity to the prior. Focusing on a special yet representative case, we fully characterized the worst-case dependence of regret on prior, both for the good- and bad-prior cases, with matching upper and lower bounds. The lower bounds are also extended to a more general case, quantifying inherent sensitivity of the algorithm when the prior is poor. These results suggest a few interesting directions for future work, only three of which are given here. One is to close the gap between upper and lower bounds for the general case. We conjecture that a tighter upper bound is likely to match the lower bound in Corollary 2. The second is to consider prior sensitivity for structured stochastic bandits, where models in Θ are related in certain ways. For example, in the discretized version of the multi-armed bandit problem considered by [10], the prior probability mass of the true model is exponentially small when a uniform prior is used, but strong frequentist regret bound is still possible. Sensitivity analysis for such problems can provide useful insights and guidance for applications of Thompson Sampling. Finally, it remains open whether there exists an algorithm whose worst-case regret bounds are better than those of Thompson Sampling for any range of p1 (θ∗ ), with θ∗ being the true underlying model. We conjecture that the answer is negative, especially in the 2-Actions-And-2-Models case.
8
References [1] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. [2] Y. Abbasi-Yadkori, D. P´al, and Cs. Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS-11), pages 2312–2320, 2011. [3] W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, 25:285–294, 1933. [4] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 2011. [5] T. Graepel, J. Quinonero Candela, T. Borchert, and R. Herbrich. Web-scale Bayesian clickthrough rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML-10), pages 13–20, 2010. [6] B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic Bayesian sampling in contextualbandit problems. Journal of Machine Learning Research, 13:2069–2106, 2012. [7] S. L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639–658, 2010. [8] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In Proceedings of The Thirty-First International Conference on Machine Learning (ICML-14), pages 100–108, 2014. [9] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), 2012. [10] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), 2013. [11] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal finitetime analysis. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT), 2012. [12] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. [13] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of 30th International Conference on Machine Learning (ICML), 2013. [14] L. Li. Generalized Thompson sampling for contextual bandits. Technical Report MSR-TR2013-136, Microsoft Research, 2013. [15] S. Guha and K. Munagala. Stochastic regret minimization via thompson sampling. In Proceedings of the 27rd Annual Conference on Learning Theory (COLT), 2014. [16] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014. [17] S. Bubeck and C.Y. Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 2013. [18] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 2014. [19] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002. [20] S. Guha and K. Munagala. Approximation algorithms for bayesian multi-armed bandit problems. arXiv preprint arXiv:1306.3525v2, 2013.
9
Appendix A
Technical Lemmas
Lemma 13. Consider the 2-Actions-And-2-Models case. We have the following equalities and inequalities concerning various functionals of the process (pt (θ1 ))t≥1 . (a) For t ≥ 1,
Eθt 1 log(pt (θ1 )−1 ) − log(pt+1 (θ1 )−1 ) X pt (θi )KL(νi (θ1 ), pt (θ1 )νi (θ1 ) + pt (θ2 )νi (θ2 )) = i∈{1,2}
≥ (b) For t ≥ 1,
X 1 pt (θi )pt (θ2 )2 |µi (θ1 ) − µi (θ2 )|2 . 2
i∈{1,2}
Eθt 1 (1 − pt+1 (θ1 ))−1 − (1 − pt (θ1 ))−1 X pt (θ1 ) θ1 ℓi (θ1 )(Xi,t ) E −1 pt (θi ) = pt (θ2 ) ℓi (θ2 )(Xi,t ) i∈{1,2}
≥
pt (θ1 )2 pt (θ1 ) |µ1 (θ1 ) − µ1 (θ2 )|2 + |µ2 (θ1 ) − µ2 (θ2 )|2 2pt (θ2 ) 2
(c) For t ≥ 1, Eθ1 [pt+1 (θ1 )] ≥ Eθ1 [pt (θ1 )] and Eθt 1
[pt+1 (θ1 ) − pt (θ1 )] ≤
X
θ1
pt (θi )pt (θ1 )pt (θ2 )E
i∈{1,2}
ℓi (θ1 )(Xi,t ) −1 . ℓi (θ2 )(Xi,t )
(d) RT (θ1 , T S(p1 )) ≤ ∆T (1 − p1 (θ1 )). Proof. Recall that pt+1 (θ1 ) =
pt (θ1 )ℓIt (θ1 )(XIt ,t ) , pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t )
pt+1 (θ2 ) =
pt (θ2 )ℓIt (θ2 )(XIt ,t ) pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t )
and It = i with probability pt (θi ) for i ∈ {1, 2}. We carry out the following computations. (a)
= = =
Eθt 1 log(pt (θ1 )−1 ) − log(pt+1 (θ1 )−1 ) ℓIt (θ1 )(XIt ,t ) θ1 Et log pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t ) X ℓi (θ1 )(Xi,t ) θ1 pt (θi )Et log pt (θ1 )ℓi (θ1 )(Xi,t ) + pt (θ2 )ℓi (θ2 )(Xi,t ) i∈{1,2} X pt (θi )KL(νi (θ1 ), pt (θ1 )νi (θ1 ) + pt (θ2 )νi (θ2 ))
i∈{1,2}
≥
X 1 pt (θi )pt (θ2 )2 |µi (θ1 ) − µi (θ2 )|2 2
i∈{1,2}
10
where the last step follows from Lemma 14. (b) = = = = ≥ = =
Eθt 1 (1 − pt+1 (θ1 ))−1 − (1 − pt (θ1 ))−1 Eθt 1 pt+1 (θ2 )−1 − pt (θ2 )−1 1 pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t ) − Eθt 1 pt (θ2 )ℓIt (θ2 )(XIt ,t ) pt (θ2 ) pt (θ1 ) θ1 ℓIt (θ1 )(XIt ,t ) −1 E pt (θ2 ) t ℓIt (θ2 )(XIt ,t ) X pt (θ1 ) θ1 ℓi (θ1 )(Xi,t ) pt (θi ) E −1 pt (θ2 ) ℓi (θ2 )(Xi,t ) i∈{1,2} X pt (θ1 ) θ1 ℓi (θ1 )(Xi,t ) pt (θi ) log E pt (θ2 ) ℓi (θ2 )(Xi,t ) i∈{1,2}
pt (θ1 )2 KL(ν1 (θ1 ), ν1 (θ2 )) + pt (θ1 )KL(ν2 (θ1 ), ν2 (θ2 )) pt (θ2 ) pt (θ1 ) pt (θ1 )2 |µ1 (θ1 ) − µ1 (θ2 )|2 + |µ2 (θ1 ) − µ2 (θ2 )|2 2pt (θ2 ) 2
where we have used the inequality x − 1 ≥ log x and the last step follows from Lemma 14. (c)
= =
Eθt 1 [pt+1 (θ1 ) − pt (θ1 )] ℓIt (θ1 )(XIt ,t ) −1 pt (θ1 )Eθt 1 pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t ) X ℓi (θ1 )(Xi,t ) −1 . pt (θi )pt (θ1 )Eθt 1 pt (θ1 )ℓi (θ1 )(Xi,t ) + pt (θ2 )ℓi (θ2 )(Xi,t )
i∈{1,2}
On one hand, using the inequality x − 1 ≥ log x, we have ≥ =
Eθt 1 [pt+1 (θ1 ) − pt (θ1 )] X pt (θi )pt (θ1 )Eθt 1 log i∈{1,2}
X
i∈{1,2}
ℓi (θ1 )(Xi,t ) pt (θ1 )ℓi (θ1 )(Xi,t ) + pt (θ2 )ℓi (θ2 )(Xi,t )
pt (θi )pt (θ1 )KL (νi (θ1 ), pt (θ1 )νi (θ1 ) + pt (θ2 )νi (θ2 )) ≥ 0.
On the other hand, using Jensen’s inequality on the convex function x → x−1 , one has Eθt 1 [pt+1 (θ1 ) − pt (θ1 )] X θ1 pt (θi )pt (θ1 )Et ℓi (θ1 )(Xi,t ) ≤
pt (θ1 ) pt (θ2 ) + ℓi (θ1 )(Xi,t ) ℓi (θ2 )(Xi,t ) i∈{1,2} X ℓi (θ1 )(Xi,t ) pt (θi )pt (θ1 )pt (θ2 )Eθ1 = −1 . ℓi (θ2 )(Xi,t )
−1
i∈{1,2}
(d) By definition of the regret and part(c), one has RT (θ1 , T S(p1 )) = ∆Eθ1
T X
pt (θ2 ) = ∆Eθ1
t=1
T X t=1
11
(1 − pt (θ1 )) ≤ ∆T (1 − p1 (θ1 )).
Lemma 14. Let α ∈ [0, 1]. Let ν1 and ν2 be two probability distributions on [0, 1] with mean µ1 and µ2 , Then we have KL(ν1 , αν1 + (1 − α)ν2 ) ≥
(1 − α)2 |µ1 − µ2 |2 . 2
Proof. Let ν1 and ν2 be absolutely continuous with respect to some measure v with density functions ℓ1 and ℓ2 . On one hand, by Pinsker’s inequality, we have Z 1 2 1 |ℓ1 (x) − αℓ1 (x) − (1 − α)ℓ2 (x)| dv(x) . KL(ν1 , αν1 + (1 − α)ν2 ) ≥ 2 0 On the other hand, Z 1 Z |µ1 − µ2 | = (ℓ1 (x)x − ℓ2 (x)x) dv(x) ≤ 0
1
0
|ℓ1 (x) − ℓ2 (x)| dv(x)
which completes the proof. q q 1 Lemma 15. Let − 18 ≤ ∆ ≤ 8 . Let ℓ1 and ℓ2 be the density functions of the Bernoulli dis 1 1 tributions Bern h2 + ∆ and Bern − ∆ with respect to the counting measure on [0, 1]. Then 2 i ℓ1 (X) 2 EX∼Bern( 1 +∆) ℓ2 (X) − 1 ≤ 32∆ . 2
Proof. The result follows from the following straightforward computation. 1 1 1 ℓ1 (X) 1 2 +∆ 2 −∆ −1 = +∆ −∆ EX∼Bern( 1 +∆) −1 + −1 1 1 2 ℓ2 (X) 2 2 2 −∆ 2 +∆ 1 2∆ −2∆ 1 + +∆ − ∆ = 1 1 2 2 2 −∆ 2 +∆ 1 1 +∆ −∆ − 21 = 2∆ 21 − ∆ 2 2 +∆ =
4∆2 ≤ 32∆2 . 1 2 − ∆ 4
Lemma 16. Recall that in Section 3, we defined the regret function RT (·) in the 2-Actions-And-2Models case by RT (p1 (θ1 )) = RT (θ1 , T S(p1)). Then RT is a decreasing function of p1 (θ1 ). Proof. The proof is inspired by the dynamic-programming argument used in Section 3 of [15]. We (i) assume that θ1 is the true reward-generating model. For arm i ∈ {1, 2}, define RT (α) as the regret of the policy that starts with the prior p1 = (α, 1−α), plays arm i for the first step, and then executes Thompson Sampling for the remaining T − 1 steps. It is easy to see that (2)
(1)
RT (α) = αRT (α) + (1 − α)RT (α) .
(1)
We now prove by induction on T that RT (·) is a decreasing function. For the base case of T = 1, R(α) = 1 − α is obviously decreasing. Now, suppose Rt (·) is decreasing for all t < T , and we will show that RT (·) is also decreasing. The proof proceeds in three main steps. (1)
(2)
Step One: This step is devoted to showing that both RT and RT are decreasing functions of α. By definition, we have αℓ1 (θ1 )(Z) (1) RT (α) = EZ∼µ1 (θ1 ) RT −1 αℓ1 (θ1 )(Z) + (1 − α)ℓ1 (θ2 )(Z) αℓ2 (θ1 )(Z) (2) . RT (α) = ∆ + EZ∼µ2 (θ1 ) RT −1 αℓ2 (θ1 )(Z) + (1 − α)ℓ2 (θ2 )(Z) 12
Since RT −1 (α) is decreasing with α ∈ (0, 1), it follows that ℓ1 (θ1 )(Z) (1) RT (α) = EZ∼µ1 (θ1 ) RT −1 ℓ1 (θ1 )(Z) + (1/α − 1)ℓ1 (θ2 )(Z) (2)
is a decreasing function of α. Similarly, RT (α) is also a decreasing function. (1)
(2)
Step Two: This step is to show that the functions RT and RT satisfy (1)
(2)
RT (α) ≤ RT (α)
(2)
for any T and α ∈ (0, 1). We prove the claim by mathematical induction on T . The base case (1) (2) (1) (2) where T = 1 is trivial, since R1 (α) ≡ 0 and R1 (α) ≡ ∆. Now suppose Rt (α) ≤ Rt (α) (1) (2) for all t < T . Then for every t < T , Rt (α) ≤ Rt (α) ≤ Rt (α), because of Equation 1 and the induction hypothesis. It follows that, αℓ1 (θ1 )(Z) (1) RT (α) = EZ∼µ1 (θ1 ) RT −1 αℓ1 (θ1 )(Z) + (1 − α)ℓ1 (θ2 )(Z) αℓ1 (θ1 )(Z) (2) ≤ EZ∼µ1 (θ1 ) RT −1 αℓ1 (θ1 )(Z) + (1 − α)ℓ1 (θ2 )(Z) αℓ1 (θ1 )(Z)ℓ2 (θ1 )(Z ′ ) = ∆ + EZ∼µ1 (θ1 ) EZ ′ ∼µ2 (θ1 ) RT −2 , αℓ1 (θ1 )(Z)ℓ2 (θ1 )(Z ′ ) + (1 − α)ℓ1 (θ2 )(Z)ℓ2 (θ2 )(Z ′ ) and that
αℓ2 (θ1 )(Z) = ∆ + EZ∼µ2 (θ1 ) RT −1 αℓ2 (θ1 )(Z) + (1 − α)ℓ2 (θ2 )(Z) αℓ2 (θ1 )(Z) (1) ∆ + EZ∼µ2 (θ1 ) RT −1 αℓ2 (θ1 )(Z) + (1 − α)ℓ2 (θ2 )(Z) αℓ1 (θ1 )(Z ′ )ℓ2 (θ1 )(Z) ′ . ∆ + EZ∼µ2 (θ1 ) EZ ∼µ1 (θ1 ) RT −2 αℓ1 (θ1 )(Z ′ )ℓ2 (θ1 )(Z) + (1 − α)ℓ1 (θ2 )(Z ′ )ℓ2 (θ2 )(Z)
(2) RT (α)
≥ =
(2)
(1)
Thus, RT (α) ≥ RT (α) by Fubini’s theorem. Step Three: This step finishes the induction step, based on results established in the previous two steps. For any 0 < α < β < 1, we have RT (β)
(1)
(2)
(1)
(2)
(1)
(2)
= βRT (β) + (1 − β)RT (β) ≤ βRT (α) + (1 − β)RT (α) ≤ αRT (α) + (1 − α)RT (α) = RT (α) , (i)
where the equalities are from Equation 1, the first inequality is from the monotonicity of RT (·) established in Step One, and the second inequality is from Equation 2. We have thus proved that Rt (·) is a decreasing function for t = T , and finished the inductive step.
B Proof of Corollary 2 Proof. Let pe1 be the prior over {θ1 , θ2 } defined as pe1 (θ1 ) = p1 (θ∗ ) and pe1 (θ2 ) = p1 (Θ\{θ∗ }). By Theorem 1, there exists a 2-Actions-And-2-Models problem instance q P (defined by νi (θj ), i, j =
1, 2) where the regret of Thompson Sampling with prior pe1 is Ω(
T p e1 (θ1 ) ) ∗
for small pe1 (θ1 ). Now
consider the problem instance Q for the general Θ case defined as νi (θ ) = νi (θ1 ) for i = 1, 2 and νi (θ) = νi (θ2 ) for i = 1, 2 and θ ∈ Θ\{θ∗ }. It is easy to see that Thompson Sampling with prior p1 under Q has exactly the same regret as Thompson Sampling with prior q qpe1 under P. Thus, under T T ∗ Q, the regret of Thompson Sampling with prior p1 is Ω( pe1 (θ1 ) ) = Ω( p1 (θ ∗ ) ) for small p1 (θ ). p The Ω( (1 − p1 (θ∗ ))T ) lower bound for large p1 (θ∗ ) can be similarly obtained. 13
C
Proof of Lemma 3, Lemma 4, Lemma 5 and Lemma 6
Proof. of Lemma 3. Let θ∗ be the true underlying model. Recall that pt (θ)ℓIt (θ)(XIt ,t ) . η∈Θ pt (η)ℓIt (η)(XIt ,t )
pt+1 (θ) = P
Note that It is drawn from pt independent of the past and Xi,t is drawn from µi (θ∗ ). Hence, the distribution of pt+1 only depends on pt and µi (θ), i = 1, . . . , K, θ ∈ Θ. The reward distributions µi (θ) are fixed before the evolution of the process pt . Thus, the distribution of pt+1 only depends on pt , not on ps , s = 1, . . . , t − 1. This shows that pt is a Markov process. Proof. of Lemma 4. First recall that conditioned on Ht , pt is deterministic. Then one has P η∈Θ pt (η)ℓIt (η)(XIt ,t ) θ∗ ∗ −1 θ∗ Et [pt+1 (θ ) ] = Et pt (θ∗ )ℓIt (θ∗ )(XIt ,t ) P K X ∗ η∈Θ pt (η)ℓi (η)(Xi,t ) θ θ∗ Pt (It = i)Et = pt (θ∗ )ℓi (θ∗ )(Xi,t ) i=1 Z P K X η∈Θ pt (η)ℓi (η)(x) θ∗ θ∗ ∗ Pt (It = i)Et = ℓ (θ )(x) dν(x) i pt (θ∗ )ℓi (θ∗ )(x) i=1 Z X K X ∗ ∗ Pθt (It = i)Eθt pt (η)ℓi (η)(x) dν(x) = pt (θ∗ )−1 i=1
=
pt (θ∗ )−1
K X
η∈Θ
∗
Pθt (It = i) = pt (θ∗ )−1 ,
i=1
where the second last equality follows from the fact that
R
ℓi (η)(x) dν(x) = 1 for any η ∈ Θ.
Proof. of Lemma 5. In this proof, we consider the case where θ1 is the true reward-generating model. We first argue that τA < +∞ almost surely. Define the event E = {τA = +∞}. Under the event E, pt (θ1 ) is always upper bounded by A for any t. Thus RT (θ1 , T S(p1 )) = ∆ · Eθ1
T X t=1
pt (θ2 ) ≥ Pθ1 (E)∆(1 − A)T.
It follows that ¯ T (T S(p1 )) ≥ p1 (θ1 ) RT (θ1 , T S(p1 )) ≥ p1 (θ1 )Pθ1 (E)∆(1 − A)T. R
¯ T (T S(p1 )) is always upper bounded by However, it was proven in [17] that the Bayes risk R √ O( T ). Therefore we must have Pθ1 (E) = 0; that is τA < +∞ almost surely. This implies that pτA ∧τB (θ1 ) is well defined and qA,B + qB,A = 1. Now, by Theorem 4, (pt (θ1 )−1 )t≥1 is a martingale. It is easy to verify that τA and τB are both stopping times with respect to the filtration (Ht )t≥1 . Then it follows from Doob’s optional stopping theorem that for any t, Eθ1 [pt∧τA ∧τB (θ1 )−1 ] = p1 (θ1 )−1 . Moreover, for any t ≥ 1, pt∧τA ∧τB (θ1 )−1 ≤ γ −1 (Note that by definition, γ ≤ B). Hence, by Lebesgue’s dominated convergence theorem, Eθ1 [pt∧τA ∧τB (θ1 )−1 ] −→ Eθ1 [pτA ∧τB (θ1 )−1 ] as t → +∞. Thus, p1 (θ1 )−1 = Eθ1 [pτA ∧τB (θ1 )−1 ] = qA,B Eθ1 [pτA (θ1 )−1 ] + qB,A Eθ1 [pτB (θ1 )−1 ]. The above equality combined with qA,B + qB,A = 1 gives the desired expressions for qA,B and qB,A . Finally, we have qB,A =
p1 (θ1 )−1 − Eθ1 [pτA (θ1 )−1 ] p1 (θ1 )−1 B ≤ ≤ −1 −1 θ −1 θ θ 1 1 1 E [pτB (θ1 ) ] − E [pτA (θ1 ) ] E [pτB (θ1 ) ] p1 (θ1 ) 14
and qB,A =
p1 (θ1 )−1 − Eθ1 [pτA (θ1 )−1 ] θ 1 E [pτB (θ1 )−1 ] − Eθ1 [pτA (θ1 )−1 ]
≤
p1 (θ1 )−1 − 1 AB 1 − p1 (θ1 ) 1 − p1 (θ1 ) = ≤ . −1 −1 B −A p1 (θ1 ) A − B A−B
Proof. of Lemma 6. Without loss of generality, we assume that θ = θ1 . Recall that pt+1 (θ1 ) pt (θ1 )
= =
ℓIt (θ1 )(XIt ,t ) pt (θ1 )ℓIt (θ1 )(XIt ,t ) + pt (θ2 )ℓIt (θ2 )(XIt ,t ) 1 ℓ (θ )(X
)
,t pt (θ1 ) + pt (θ2 ) ℓIIt (θ12 )(XIIt ,t ) t
t
Therefore we have 1 pt+1 (θ1 ) 1 1 ≤ ≤ ≤ ≤ s, s pt (θ1 ) + pt (θ2 )s pt (θ1 ) pt (θ1 ) + pt (θ2 ) 1s
which completes the proof.
D
Proof of Theorem 7
Proof of the First Inequality: Let β = 96 log 3s 2 + 6. By Proposition 8 and Proposition 9, √ √ 1 1 1 √ 1 1 1 RT ≤ (144s + 1) T + RT ≤ (144s + 1) T + β 6sT + RT 3 2 6s 2 2 3 Therefore,
√ √ 1 T. ≤ 288s + β 6s + 2 RT 3
Using again Proposition 8, one has for any p1 (θ1 ) ∈ (0, 1), s 1 T RT (p1 (θ1 )) ≤ β + RT p1 (θ1 ) 3 s √ √ T ≤ β + 288s + β 6s + 2 T p1 (θ1 ) s s √ T T ≤ β + 288s + β 6s + 2 p1 (θ1 ) p1 (θ1 ) s s √ T T ≤ 1490s ≤ 288s + β( 6s + 1) + 2 p1 (θ1 ) p1 (θ1 )
√ √ √ 6s + 1 ≤ 4 s where the last step follows from the inequalities β = 96 log 3s 2 + 6 ≤ 300 s and for s > 1. 1 Proof of the Second Inequality: Fix p1 (θ1 ) ≥ 1 − 8s12 . First, if ∆ ≤ √ , then (1−p1 (θ1 ))T p by Lemma 13(d), RT (p1 (θ1 )) ≤ (1 − p1 (θ1 ))∆T ≤ (1 − p1 (θ1 ))T . Hence, we can assume 1 . It follows from Proposition 10 that for any integer h ≥ 1, as long as that ∆ ≥ √ (1−p1 (θ1 ))T
15
(4s2 )h−1 p1 (θ2 ) ≤
1 8s2 ,
one has
1 p1 (θ2 )) 4s2 h h−1 q X 4 k 4 360s4 (4s2 )k p1 (θ2 )T + QT ((4s2 )h p1 (θ2 )) ≤ 11s 11s k=0 h h−1 X 8 k p 4 ≤ 360s4 p1 (θ2 )T + QT ((4s2 )h p1 (θ2 )) 11 11s k=0 h p 4 4 QT ((4s2 )h p1 (θ2 )) . ≤ 1320s p1 (θ2 )T + 11s QT (p1 (θ2 )) − QT (
Let h be the smallest integer such that (4s2 )h p1 (θ2 ) > 8s12 . On one hand, (4s2 )h−1 p1 (θ2 ) ≤ 8s12 implies that 1 − (4s2 )h p1 (θ2 ) ≥ 12 . Using the first inequality of Theorem 7 and the fact that the function RT (·) is decreasing, one has √ 1 QT ((4s2 )h p1 (θ2 )) = RT (1 − (4s2 )h p1 (θ2 )) ≤ RT ( ) ≤ 1490s 2T . 2 √ p 1 4 h 1 h 2 h On the other hand, (4s ) p1 (θ2 ) > 8s2 implies that 2 2s p1 (θ2 ) > 2s > 11s . Hence, for p1 (θ2 ) ≤ 8s12 , p p 1 QT (p1 (θ2 )) − QT ( 2 p1 (θ2 )) ≤ (1320s4 + 5960s2) p1 (θ2 )T ≤ 7280s4 p1 (θ2 )T . 4s Thus, for any integer m, one has s k m m−1 X 1 1 4 p1 (θ2 )T + QT p1 (θ2 ) QT (p1 (θ2 )) ≤ 7280s 4s2 4s2 k=0 m m−1 X 1 k p 1 7280s4 p1 (θ2 )T + RT 1 − p (θ ) ≤ 1 2 2 4s2 k=0 m p 1 ≤ 14560s4 p1 (θ2 )T + p1 (θ2 )∆T 4s2 where we have used Lemma p 13(d) in the last step. Finally, letting mpgo to infinity, we get QT (p1 (θ2 )) ≤ 14560s4 p1 (θ2 )T , that is, RT (p1 (θ1 )) ≤ 14560s4 (1 − p1 (θ1 ))T .
E Proof of Proposition 8
Proof. In this proof, we consider the case where θ1 is the true reward-generating model. We use the notation defined in Lemma 5. First, the desired inequality is trivial if p1 (θ1 ) ≥ 13 since RT (·) is a decreasing function by Lemma 16. Let p1 (θ1 ) ≤ 13 , A = 32 p1 (θ1 ) and take B > 0 such that B ≤ 12 p1 (θ1 ). The exact value of B will be specified later. It is easy to see that A ≤ 21 and B ≤ 21 ≤ 1 − A. We decompose the rest of the proof into three steps. Step One: This step is devoted to upper bounding Eθ1 [τA ∧ τB − 1]. Note that by the definition of τA and τB , one has for t ≤ τA ∧ τB − 1, B ≤ pt (θ1 ) ≤ A ≤ 21 and pt (θ2 ) ≥ 1 − A ≥ 21 ≥ B. Thus, by Lemma 13(a), we have for t ≤ τA ∧ τB − 1, 1 1 Eθt 1 log(pt (θ1 )−1 ) − log(pt+1 (θ1 )−1 ) ≥ pt (θ1 )pt (θ2 )2 ∆21 + pt (θ2 )3 ∆22 2 2 B∆2 pt (θ2 )2 B 2 2 (∆1 + ∆2 ) ≥ , ≥ 2 16 where we have used ∆21 + ∆22 ≥ 12 (∆1 + ∆2 )2 ≥ 12 ∆2 . Rearranging, we get for t ≤ τA ∧ τB − 1, B∆2 B∆2 θ1 −1 ≤ log(pt (θ1 )−1 ) + t . Et log(pt+1 (θ1 ) ) + (t + 1) 16 16 16
2 In other words, log(pt (θ1 )−1 ) + t B∆ 16
t≤τA ∧τB
is a supermartingale.
Now, using Doob’s optional stopping theorem, one has for any t ≥ 1, B∆2 B∆2 −1 θ1 ≤ log(p1 (θ1 )−1 ) + . log(pt∧τA ∧τB (θ1 ) ) + (t ∧ τA ∧ τB ) E 16 16 Also, by Lemma 6, log(pt∧τA ∧τB (θ1 )−1 ) ≤ log Bs for any t ≤ 1. Using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, B∆2 B∆2 Eθ1 log(pt∧τA ∧τB (θ1 )−1 ) + (t ∧ τA ∧ τB ) −→ Eθ1 log(pτA ∧τB (θ1 )−1 ) + (τA ∧ τB ) 16 16 as t → +∞. Hence, 16 pτA ∧τB (θ1 ) sA 3s 16 16 θ1 θ1 ≤ log E log log , = E [τA ∧ τB − 1] ≤ 2 2 2 B∆ p1 (θ1 ) B∆ p1 (θ1 ) B∆ 2
where we have used Lemma 6 in the second last step.
Step Two: In this step, we establish a recurrence inequality for the regret function RT (·). By Lemma 3, (pt (θ1 ))t≥1 and (pt (θ2 ))t≥1 are both Markov processes. Thus, the regret of Thompson Sampling can be decomposed as follows RT (p1 (θ1 ))
θ1
= ∆·E
= ∆ · Eθ1
T X
pt (θ2 ) t=1 τA ∧τ B −1 X t=1 θ1
pt (θ2 ) + qB,A · Eθ1 [RT (pτB (θ1 ))|τA > τB ]
+ qA,B · E [RT (pτA (θ1 ))|τA < τB ]
≤ ∆ · Eθ1 [τA ∧ τB − 1] + qB,A ∆T + Eθ1 [RT (pτA (θ1 ))|τA < τB ] 16 3 3s B ≤ log + ∆T + RT p1 (θ1 ) , B∆ 2 p1 (θ1 ) 2 where in the last step, we have used the facts that qB,A ≤ 3 2 p1 (θ1 ), and RT (·) is a decreasing function.
B p1 (θ1 )
(by Lemma 5), pτA (θ1 ) ≥ A =
Step Three: The recurrence inequality established in the previous step and an appropriate choice ofqthe parameter B allow us to get the desired q upper bound on RT (p1 (θ1 )). On oneqside, if ∆ ≤ 1 T . On the other side, if ∆ > 2 p1 (θ11 )T , we 2 p1 (θ1 )T , then RT (p1 (θ1 )) ≤ ∆T ≤ 2 p1 (θ 1) q q p1 (θ1 ) p1 (θ1 ) 1 1 take B = ∆ ≤ 21 p1 (θ1 ). Then for any T . This choice of B is eligible since ∆ T 1 p1 (θ1 ) ≤ 3 , 3 3s B 16 log + ∆T + RT p1 (θ1 ) RT (p1 (θ1 )) ≤ B∆ 2 p1 (θ1 ) 2 s 3 3s T = 16 log +1 + RT p1 (θ1 ) . 2 p1 (θ1 ) 2 h−1 It follows that for any integer h ≥ 1, as long as 32 p1 (θ1 ) ≤ 31 , one has s ! h k h−1 X 2 3s T 3 RT (p1 (θ1 )) ≤ 16 log p1 (θ1 ) +1 + RT 2 3 p1 (θ1 ) 2 k=0 ! r !−1 h s 3s 2 T 3 16 log p1 (θ1 ) +1 + RT ≤ 1− 3 2 p1 (θ1 ) 2 ! h s 3s 3 T ≤ 96 log p1 (θ1 ) . +6 + RT 2 p1 (θ1 ) 2 17
h Finally, by taking h to be the smallest integer such that 32 p1 (θ1 ) > 13 and using the fact that the function RT (·) is decreasing, we get s 1 3s T , +6 + RT RT (p1 (θ1 )) ≤ 96 log 2 p1 (θ1 ) 3
which completes the proof.
F Proof of Proposition 9 Proof. In this proof, we consider the case where θ1 is the true reward-generating model. We use the notation defined in Lemma 5. Fix T > 0 and p1 (θ1 ) ≤ 21 . Let B = 12 p1 (θ1 ) and take A > p1 (θ1 ). The exact value of A will be specified later. We decompose the proof into three steps. Step One: This step is devoted to upper bounding Eθ1 [τA ∧ τB − 1]. By Lemma 13(b), we have for t ≤ τA ∧ τB − 1, pt (θ1 )2 2 pt (θ1 ) 2 Eθt 1 (1 − pt+1 (θ1 ))−1 − (1 − pt (θ1 ))−1 ≥ ∆ + ∆2 2pt (θ2 ) 1 2 B 2 ∆2 1 2 2 1 B ∆1 + B∆22 ≥ , ≥ 2 2 4 where we have used ∆21 + ∆22 ≥ 21 (∆1 + ∆2 )2 ≥ 12 ∆2 . Rearranging, we get for t ≤ τA ∧ τB − 1, B 2 ∆2 B 2 ∆2 θ1 −1 ≥ (1 − pt (θ1 ))−1 − t . Et (1 − pt+1 (θ1 )) − (t + 1) 4 4 2 2 In other words, (1 − pt (θ1 ))−1 − t B 4∆ is a submartingale. t≤τA ∧τB
Now, using Doob’s optional stopping theorem, one has for any t ≥ 1, B 2 ∆2 B 2 ∆2 −1 θ1 ≥ (1 − p1 (θ1 ))−1 − . (1 − pt∧τA ∧τB (θ1 )) − (t ∧ τA ∧ τB ) E 4 4 Moreover, by Lemma 6, s (1 − pt∧τA ∧τB (θ1 ))−1 = pt∧τA ∧τB (θ2 )−1 ≤ s · pt∧τA ∧τB −1 (θ2 )−1 ≤ 1−A for any t ≤ 1. Using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, B 2 ∆2 B 2 ∆2 −→ Eθ1 (1 − pτA ∧τB (θ1 ))−1 − (τA ∧ τB ) Eθ1 (1 − pt∧τA ∧τB (θ1 ))−1 − (t ∧ τA ∧ τB ) 4 4 as t → +∞. Hence, 4s 16s 4 = . Eθ1 [τA ∧ τB − 1] ≤ 2 2 Eθ1 (1 − pτA ∧τB (θ1 ))−1 ≤ 2 2 2 B ∆ B ∆ (1 − A) p1 (θ1 ) ∆2 (1 − A) Step Two: In this step, we establish a recurrence inequality for the regret function RT (·). By Lemma 3, (pt (θ1 ))t≥1 and (pt (θ2 ))t≥1 are both Markov processes. Thus, the regret of Thompson Sampling can be decomposed as follows RT (p1 (θ1 ))
θ1
= ∆·E
= ∆ · Eθ1
T X
pt (θ2 ) t=1 τA ∧τ B −1 X t=1 θ1
pt (θ2 ) + qB,A · Eθ1 [RT (pτB (θ1 ))|τA > τB ]
+ qA,B · E [RT (pτA (θ1 ))|τA < τB ]
≤ ∆Eθ1 [τA ∧ τB − 1] + qB,A · Eθ1 [RT (pτB (θ1 ))|τA > τB ] + Eθ1 [RT (pτA (θ1 ))|τA < τB ] 16s 1 1 ≤ + RT p1 (θ1 ) + (1 − A)∆T, p1 (θ1 )2 ∆(1 − A) 2 2s 18
1 where in the last step, we have used the facts that qB,A ≤ p1B (θ1 ) = 2 (by Lemma 5), 1 p1 (θ1 ) (by Lemma 6), RT (pτA (θ1 )) ≤ (1 − pτA (θ1 ))∆T ≤ (1 − A)∆T (by pτB (θ1 ) ≥ Bs = 2s Lemma 13(d))and RT (·) is a decreasing function.
Step Three: Finally, we establish the desired recurrence inequality by √ appropriately choosing the value of A. On one side, if ∆ ≤ √2T , then RT (p1 (θ1 )) ≤ ∆T ≤ 2 T . On the other side, if 1 1 . This choice of A is eligible since 1 − ∆√ ≥ 21 ≥ p1 (θ1 ). Then ∆ > √2T , we take A = 1 − ∆√ T T 1 for any p1 (θ1 ) ≤ 2 , 16s 1 1 RT (p1 (θ1 )) ≤ + R p (θ ) + (1 − A)∆T T 1 1 p1 (θ1 )2 ∆(1 − A) 2 2s √ 1 16s 1 + 1 R p1 (θ1 ) . ≤ T + T 2 p1 (θ1 ) 2 2s
G Proof of Proposition 10 Proof. In this proof, we consider the case where θ1 is the true reward-generating model. We use the notation defined in Lemma 5. Fix T > 0 and p1 (θ1 ) ≥ 1 − 8s12 . Let A = 1 − 4s12 (1 − p1 (θ1 )) and B = 1 − 4s(1 − p1 (θ1 )). Then it is easy to see that A > p1 (θ1 ) > B and B ≥ 21 . The proof is decomposed into two steps. Step One: This step is devoted to upper bounding Eθ1 [τA ∧ τB − 1]. By Lemma 13(b), we have for t ≤ τA ∧ τB − 1, Eθt 1 (1 − pt+1 (θ1 ))−1 − (1 − pt (θ1 ))−1
≥ ≥
pt (θ1 )2 2 pt (θ1 ) 2 ∆ + ∆2 2pt (θ2 ) 1 2 1 2 2 1 ∆2 B ∆1 + B∆22 ≥ 2 2 16
where we have used ∆21 + ∆22 ≥ 21 (∆1 + ∆2 )2 ≥ 12 ∆2 . Rearranging, we get for t ≤ τA ∧ τB − 1, B2 B2 θ1 −1 Et (1 − pt+1 (θ1 )) − (t + 1) ≥ (1 − pt (θ1 ))−1 − t . 16 16 2 is a submartingale. In other words, (1 − pt (θ1 ))−1 − t ∆ 16 t≤τA ∧τB
Now, using Doob’s optional stopping theorem, one has for any t ≥ 1, ∆2 ∆2 Eθ1 (1 − pt∧τA ∧τB (θ1 ))−1 − (t ∧ τA ∧ τB ) ≥ (1 − p1 (θ1 ))−1 − . 16 16 Moreover, by Lemma 6, (1 − pt∧τA ∧τB (θ1 ))−1 = pt∧τA ∧τB (θ2 )−1 ≤ s · pt∧τA ∧τB −1 (θ2 )−1 ≤
s 1−A
for any t ≥ 1. Using Lebesgue’s dominated convergence theorem and the monotone convergence theorem, ∆2 ∆2 −1 θ1 −1 θ1 (1 − pt∧τA ∧τB (θ1 )) − (t ∧ τA ∧ τB ) E (1 − pτA ∧τB (θ1 )) − (τA ∧ τB ) −→ E 16 16 as t → +∞. Hence, Eθ1 [τA ∧ τB − 1] ≤
16s 16 θ1 E (1 − pτA ∧τB (θ1 ))−1 ≤ 2 . 2 ∆ ∆ (1 − A) 19
Step Two: In this step, we establish the desired recurrence inequality. By Lemma 3, (pt (θ1 ))t≥1 and (pt (θ2 ))t≥1 are both Markov processes. Thus, the regret of Thompson Sampling can be decomposed as follows RT (p1 (θ1 ))
θ1
= ∆·E
= ∆ · Eθ1
T X
pt (θ2 ) t=1 τA ∧τ B −1 X t=1 θ1
pt (θ2 ) + qA,B · Eθ1 [RT (pτA (θ1 ))|τA < τB ]
+ qB,A · E [RT (pτB (θ1 ))|τA > τB ]
≤ ∆(1 − B)Eθ1 [τA ∧ τB − 1] + qA,B · RT (A) + qB,A · RT (1 − s(1 − B)) p 1 ≤ 256s4 (1 − p1 (θ1 ))T + qA,B · RT (1 − 2 (1 − p1 (θ1 ))) 4s + qB,A · RT 1 − 4s2 (1 − p1 (θ1 )) ,
where in last two steps, we have used the definition of A, B and the facts that pτA (θ1 ) ≥ A, pτB (θ1 ) = 1 − pτB (θ2 ) ≥ 1 − s(1 − B) ( by Lemma 6) and RT (·) is a decreasing function. Rearranging the newly obtained inequality, we get for p1 (θ1 ) ≥ 1 −
1 8s2 ,
1 (1 − p1 (θ1 ))) 4s2 256s4 p qB,A RT 1 − 4s2 (1 − p1 (θ1 )) − RT (p1 (θ1 )) . (1 − p1 (θ1 ))T + 1 − qB,A 1 − qB,A
RT (p1 (θ1 )) − RT (1 − ≤
By Lemma 5,
qB,A ≤
1 4 1 − p1 (θ1 ) 4 = ≤ . ≤ 1 A−B 15s 15 4s − 4s2
Therefore, we obtain the desired recurrence inequality by observing that
4 256s4 4 3840s4 qB,A ≤ ≤ 15s 4 ≤ ≤ 360s4 and . 1 − qB,A 11 1 − qB,A 11s 1 − 15s
H Proof of Theorem 11 Proof. Using Lemma 13(b) and Lemma 15, one has pt (θ1 ) θ1 ℓi (θ1 )(Xi,t ) E −1 pt (θ2 ) ℓi (θ2 )(Xi,t ) i∈{1,2} θ1 ℓ2 (θ1 )(X2,t ) − 1 ≤ 32∆2 . = pt (θ1 )E ℓ2 (θ2 )(X2,t ) ≤ Then for any t ≤ T , Eθ1 pt (θ2 )−1 ≤ 1−p11 (θ1 ) + 32(t − 1)∆2 = 1−p11 (θ1 ) 1 + 4(t−1) T −1 5 θ1 θ1 pt (θ2 )−1 ≥ 1−p1 (θ1 ) . By Jensen’s inequality, we have for any t ≤ T , E [pt (θ2 )] ≥ E p P T 1−p1 (θ1 ) 1−p (θ ) 1 1 . Hence, RT (θ1 , T S(p1 )) = ∆·Eθ1 t=1 pt (θ2 ) ≥ ∆T ≥ 101√2 (1 − p1 (θ1 ))T 5 5 . Eθt 1 pt+1 (θ2 )−1 − pt (θ2 )−1 =
X
20
pt (θi )