An efficient algorithm for contextual bandits with knapsacks, and an ...

Report 6 Downloads 242 Views
arXiv:1506.03374v1 [cs.LG] 10 Jun 2015

Contextual Bandits with Global Constraints and Objective

Nikhil R. Devanur Microsoft Research [email protected]

Shipra Agrawal Microsoft Research [email protected]

Lihong Li Microsoft Research [email protected]

Abstract We consider the contextual version of a multi-armed bandit problem with global convex constraints and concave objective function. In each round, the outcome of pulling an arm is a context-dependent vector, and the global constraints require the average of these vectors to lie in a certain convex set. The objective is a concave function of this average vector. The learning agent competes with an arbitrary set of context-dependent policies. This problem is a common generalization of problems considered by [9] and [3], with important applications. We give computationally efficient algorithms with near-optimal regret, generalizing the approach of [2] for the non-constrained version of the problem. For the special case of budget constraints our regret bounds match those of [9], answering their main open question of obtaining a computationally efficient algorithm.

1 Introduction Multi-armed bandits (e.g., [13]) are a classic model for studying the exploration-exploitation tradeoff faced by a decision-making agent, which learns to maximize cumulative reward through sequential experimentation in an initially unknown environment. The contextual bandit problem [21], also known as associative reinforcement learning [10], generalizes multi-armed bandits by allowing the agent to take actions based on contextual information: in every round, the agent observes the current context, takes an action, and observes a reward that is a random variable with distribution conditioned on the context and the taken action. Despite many recent advances and successful applications of bandits, one of the major limitations of the standard setting is the lack of “global” constraints that are common in many important real-world applications. For example, actions taken by a robot arm may have different levels of power consumption, and the total power consumed by the arm is limited by the capacity of its battery. In online advertising, each advertiser has her own budget, so that her advertisement cannot be shown more than a certain number of times. In dynamic pricing, there are a certain number of objects for sale and the seller offers prices to a sequence of buyers with the goal of maximizing revenue, but the number of sales is limited by the supply. Recently, a few authors started to address this limitation by considering very special cases such as a single resource with a budget constraint [16, 19, 20, 22, 26, 27], and application-specific bandit problems such as the ones motivated by online advertising [14, 23], dynamic pricing [6, 11] and crowdsourcing [7, 24, 25]. Subsequently, [8] introduced a general problem capturing most previous formulations. In this problem that they called Bandits with Knapsacks (BwK) there are d different resources, each with a pre-specified budget. Each action taken by the agent results in a d-dimensional resources consumption vector, in addition to the regular (scalar) reward. The goal of the agent is to 1

maximize the total reward, while keeping the cumulative resource consumption below the budget. The BwK model was further generalized to the BwCR (Bandits with convex Constraints and concave Rewards) model [3], which allows for arbitrary concave objective and convex constraints on the sum of the resource consumption vectors in all rounds. Both papers adapted the popular Upper Confidence Bound (UCB) technique to obtain near-optimal regret guarantees. However, the focus was on the non-contextual setting. There has been significant recent progress [2, 17] in algorithms for general (instead of linear [1, 15]) contextual bandits where the context and reward can have arbitrary correlation, and the algorithm competes with some arbitrary set of context-dependent policies. [17] achieved the optimal regret bound for this remarkably general contextual bandits problem, assuming access to the policy set only through a linear optimization oracle, instead of explicit enumeration of all policies as in previous work [5, 12]. However, the algorithm presented in [17] was computationally inefficient. [2] presented a simpler algorithm that is computationally efficient and achieves an optimal regret bound. Combining contexts and resource constraints, [3] also considers a static linear contextual version of BwCR where the expected reward is linear in the context.1 [28] considers the special case of random linear contextual bandits with a single budget constraint, and gives near-optimal regret guarantees for it. [9] extends the general contextual version of bandits with arbitrary policy sets to allow budget constraints, thus obtaining a Contextual version of BwK, a problem they call Resourceful Contextual Bandits. We will refer to this problem as CBwK. They give a computationally inefficient algorithm, based on [17], with a regret that is optimal in most regimes. They posed an open question of achieving computational efficiency while maintaining a similar or even sub-optimal regret. Main Contributions In this paper, we formulate and study a common generalization of BwCR and RCB: in every round the agent observes a context, takes one out of K actions and then observes a d-dimensional vector, and the goal of the agent is to maximize a concave objective function of the average of these vectors while ensuring that the average of the vectors is inside (or close to) a given convex set. The agent competes with some arbitrary set of context-dependent policies. We call this problem Contextual Bandits with convex Constraints and concave Rewards (henceforth, CBwCR). Our problem formulation (CBwCR) allows for any arbitrary convex constraint and concave objective on the average of reward vectors. This is substantially more general than the problem with budget constraints (CBwK) considered in [9] and allows for many more interesting applications, some of which were discussed in [3]. This is also quite more general than the contextual version considered in [3], where the context was fixed and the dependence was linear. We present a simple and computationally efficient algorithm for CBwCR based on the ILOVETOCONBANDITS algorithm in [2]. A key feature of our techniques is that we need to modify the algorithm in [2] in a very minimal way — in an almost blackbox fashion — thus retaining the structural simplicity of the algorithm while obtaining substantially more general results. Our algorithm achieves optimal (in many regimes) regret bound for CBwCR. When applied to the special case of CBwK, our techniques provide a regret bound that matches that of [9]. This resolves the main open question in [9], of obtaining algorithms that are computationally efficient. Furthermore, we do not need to know the distribution over contexts as in [9]. Organization In Section 2, we define the CBwCR problem and interesting special cases, and state precise statements of our regret bounds (as Theorems 1 and 2). To illustrate our algorithm and proof techniques, we first consider (in Section 3) a simpler “feasibility” version of this problem where the goal is to simply satisfy the given constraints and there is no objective function. Then, we extend this algorithm to the general CBwCR problem in Section 4. Finally in Section 5, we apply this algorithm to the important special case of budget constraints (CBwK).

2 Preliminaries and Main Results We now define the CBwCR problem. In this paper, we will use bold face letters to denote vectors. 1 In particular, each arm is associated with a fixed vector and the resulting outcomes for this arm have expected value linear in this vector.

2

Let A be a finite set of K actions and X be a space of possible contexts (for instance, this could be the feature space in supervised learning). To begin with, the algorithm is given a concave function f defined on the domain [0, 1]d , and a convex set S ⊆ [0, 1]d . Thereafter, in every round t ∈ [T ], the agent first observes context xt ∈ X, then chooses an action at ∈ A, and finally observes a d-dimensional vector vt (at ) ∈ [0, 1]d . For nominal convenience, we will call vt (at ) as a reward vector but, as will become clear from the definition of objective and constraints, in general vt (at ) can represent cost or reward or a combination of those. The goal of the agent is to take actions P ¯ T := T1 Tt=1 vt (at ), maximizes f (¯ such that the average reward vector, v vT ) and is inside (or close to) set S. Furthermore, we make a stochastic assumption that the context and reward vectors (xt , {vt (a) : a ∈ A}) for t = 1, 2, . . . , T are drawn i.i.d. (independent and identically distributed) from an unknown distribution D over X × [0, 1]d×A. Although in the above formulation, only functions of the sum (or average) of reward vectors are allowed, this is equivalent to allowing functions P of any decomposable aggregate of reward vectors: objectives (or constraints) of the form f ( T1 t g(rt (at ))), for any arbitrary scalar or vector valued g, can be handled by simply defining vt (a) = g(rt (a)). Policy Set. Following [2, 9, 17], our algorithms compete with an arbitrary set of policies. Let Π ⊆ AX be a finite set of policies2 that map contexts x ∈ X to actions a ∈ A. With global constraints and/or nonlinear global objective, distributions over policies in Π could be strictly more powerful than any policy in Π itself. Our algorithms compete with this more powerful set, which is a stronger guarantee than Psimply competing with fixed policies in Π. For this purpose, define C(Π) := {P ∈ [0, 1]Π : π∈Π P (π) = 1} as the set of all convex combinations of policies in Π. For a context x ∈ X, choosing actions with P ∈ C(Π) isP equivalent to following a randomized policy that selects action a ∈ A with probability P (a|x) = π∈Π:π(x)=a P (π); we therefore also P refer to P as a (mixed) policy. Similarly, define C0 (Π) := {P ∈ [0, 1]Π : π∈Π P (π) ≤ 1} as the set of all non-negative weights over Π, which sum to at most 1. Clearly, C(Π) ⊂ C0 (Π). Let R(P ) := E(x,v)∼D [Eπ∼P [v(π(x))]] denote the expected reward vector for policy P ∈ C(Π). We call a policy P ∈ C(Π) a feasible policy if R(P ) ∈ S. We assume that there exists a feasible policy in C(Π). Define an optimal policy P ∗ ∈ C(Π) as a feasible policy whose expected reward vector maximizes the objective function: P ∗ = arg max f (R(P )) s.t. R(P ) ∈ S. (1) P ∈C(Π)

Accessing policies in Π by explicit enumeration is impractical. For the purposes of efficient implementation, we instead only access Π via an optimization oracle. Employing such an optimization oracle is common when considering contextual bandits with an arbitrary set of policies [2, 17, 21]. Following previous work, we call this optimization oracle an “arg max oracle”, or AMO. Definition 1. For a set of policies Π, the arg max oracle (AMO) is an algorithm, which for any sequence of contexts and observation vectors, (x1 , v1 ), . . . , (xt , vt ) ∈ X × [0, 1]d×A , any d, and concave function g, returns Pt P (2) arg maxP ∈C(Π) g( 1t τ =1 π P (π)vτ (π(xτ )))

This AMO is stronger than those used previously [17, 2], since it optimizes a concave function g. However, we can use the Frank-Wolfe algorithm [18], in case g is smooth, or an algorithm from [4] otherwise, to solve the optimization problem (2) by repeatedly calling a linear AMO, i.e. the same AMO as in [17, 2], where the objective is simply a sum of rewards obtained in every step. Regret. We are interested in minimizing two types of average regrets.3 The average regret in objective measures the amount of suboptimality regarding maximization of the objective function f : avg-regret1 (T ) := f (R(P ∗ )) − f (¯ vT ) . 2 The policies may be randomized in general, but for our results, we may assume without loss of generality that they are deterministic. As observed by [9], we may replace randomized policies with deterministic policies by appending a random seed to the context. This blows up the size of the context space which does not appear in our regret bounds. 3 In regular versions of bandit problems, the√objective value is the sum of reward in every step, which scales P with T , and the regret typically scales with T . But in our formulation, the objective f ( T1 t vt (at )) is defined over average observations, therefore, to be consistent with the popular terminology, we call our regret “average regret”.

3

The average regret in constraints measures the amount of violation to the constraints, and is defined as the distance of the average reward vector from S: avg-regret2 (T ) := d(¯ vT , S) , where d(v, S) = inf s∈S kv − sk for some ℓq -norm k · k (with 1 ≤ q ≤ ∞). Special Cases.

We also consider the following interesting special cases of the problem.

Feasibility problem (CBwC): In this special case, referred to as CBwC, there are only constraints ¯T and no objective function f , and the aim of the algorithm is to have the average reward vector v ¯ T to S, i.e., be in the set S. The performance of the algorithm is measured by the distance of v by avg-regret2 (T ). We will first illustrate our algorithm and proof techniques for this simpler yet nontrivial case. Budget Constraints (CBwK): This is the problem studied in [9], which they call “Resourceful Contextual Bandits”, and which we refer to as CBwK. Here, the reward vector observed at time t can be broken down into two components: a scalar reward rt (at ) ∈ [0, 1] and (d − 1)-dimensional PT d−1 cost P vector vt (at ) ∈ [0, 1] . The objective is to maximize t=1 rt (at ) while ensuring that v (a ) ≤ B1, where 1 is the vector of all 1s and B > 0 is some scalar. The budget constraint Pt t t B t vt (at ) ≤ B1 is equivalent to using constraint set S of the form {v : 0 ≤ v ≤ T 1}. In this case, 2 we are never allowed to overshoot the budget, i.e., avg-regret (T ) is required to be 0 and we only need to bound avg-regret1 (T ). Hence we assume that the set A has the option of “doing nothing” and getting 0 reward and 0 cost; once the resources are consumed we can abort or take the “do nothing” option for the remaining rounds. Main Results. Our main result is a computationally efficient low regret algorithm for CBwCR. The algorithm needs to know a certain problem dependent parameter Z, defined by Assumption 2 in Section 4. (Such knowledge is not required for CBwC, however.) The algorithm accesses Π through the AMO oracle. Assume that the function f is L-Lipschitz with respect to the ℓq -norm k · k. Theorem 1. For the CBwCR problem, there is an algorithm thatp takes as input a certain problem˜ KT ln(|Π|)) calls to AMO, and dependent parameter Z ≥ L (satisfying Assumption 2), makes O( with probability at least 1 − δ has regret   p k1d k(Z) p √ K ln(T |Π|/δ) + ln(d/δ) , avg-regret1 (T ) = O T   p k1d k p 2 avg-regret (T ) = O √ K ln(T |Π|/δ) + ln(d/δ) . T For the CBwC problem, no knowledge of Z is needed, and the algorithm has the same avg-regret2 (T ) as above. For the CBwK problem, the regret bound depends on OPT = f (R(P ∗ )) and B. We get the same regret bound as [9]; they present a detailed discussion on the optimality of this bound. p ˜ KT ln(|Π|)) calls to Theorem 2. For the CBwK problem, there is an algorithm that makes O( AMO, and with probability at least 1 − δ has regret  r Kd ln(dT |Π|/δ) T OPT 1 avg-regret (T ) = O +1 . B T Note: [9] state their regret bound in terms of the total reward, hence their optimum is T times larger than ours and we need to scale their bound appropriately to compare with ours. After proper scaling [9, Theorem 1] gives the same regret bound as Theorem 2.

3 Feasibility Problem (CBwC) It will be useful to first illustrate our algorithm and proof techniques for the special case of the feasibility problem, CBwC, before extending it to the more general CBwCR problem. In CBwC, 4

there is no objective function f , and the aim of the algorithm is to keep the average reward vector P ¯ T = T1 Tt=1 vt (at ) in the set S. The performance of the algorithm is measured by the distance v from set S, i.e., by avg-regret2 (T ). 3.1 Algorithm The algorithm shares the same structure as the ILOVETOCONBANDITS algorithm for contextual bandits [2], with important changes necessary to deal with global constraints. For completeness, we provide the main algorithm below. Algorithm 1 ILOVETOCONBANDITS [2] Input Epoch schedule 0 = τ0 < τ1 < τ2 < · · · , allowed failure probability δ ∈ (0, 1). 1: Initialize weights Q0 := p 0 ∈ C0 (Π) and epoch m := 1. 1 2 |Π|/δ)/(Kτ )} for all m ≥ 0. Define µm := min{ 2K , ln(16τm m 2: for round t = 1, 2, . . . do 3: Observe context xt ∈ X. 4: (at , pt (at )) := Sample(xt , Qm−1 , Pτm −1 , µm−1 ). 5: Select action at and observe reward rt (at ) ∈ [0, 1]. 6: if t = τm then 7: Let Qm be a solution to (OP) with history Ht and minimum probability µm . 8: m := m + 1. 9: end if 10: end for The algorithm is given a finite policy class Π, and aims to compete with the best mixed policy in C(Π). It proceeds in epochs with pre-defined lengths; epoch m consists of time steps from τm−1 + 1 to τm , inclusively. At the beginning of an epoch, it computes a mixed policy in C(Π) which is used for the whole epoch. The details of the process Sample for sampling an action from the computed mixed policy are provided in Appendix B.1. There are several key challenges in finding the “right” mixed policy. Ideally, it should concentrate fast enough on the empirically best mixed policy (based on data observed so far), in order to have small regret; the probability of choosing an action must be large enough to enable sufficient exploration; and it should be efficiently computable. As we will show, all these can be addressed by solving a properly defined optimization problem, with similar structure to that defined for contextual bandits in [2], despite the additional technical challenges of dealing with mixed policies and global constraints. Some definitions are in place before we describe the optimization problem. Let Ht denote the history of chosen actions and observations before time t, consisting of records of the form (xτ , aτ , vτ (aτ ), pτ (aτ )), where xτ , aτ , vτ (aτ ) denote respectively the context, action taken, and reward vector observed at time τ , and pτ (aτ ) denotes the probability at which action aτ was taken. (Recall that our algorithm selects actions in a randomized way using a mixed policy.) Although Ht contains observation vectors only for chosen actions, it can be “completed” using the trick of importance sampling: for every (xτ , aτ , vτ (aτ ), pτ (aτ )) ∈ Ht , define the fictitious observation vector (aτ ) ˆ τ ∈ [0, 1]d×A by: v ˆ τ (a) := vpττ (a ˆ τ is an unbiased estimator of vτ : for v I {aτ = a}. Clearly, v τ) vτ (a)] = vτ (a), where the expectation is over randomization in selecting aτ . every a, Eaτ [ˆ With the “completed” history, it is straightforward to obtain an unbiased estimate of expected reward vector for every policy P ∈ C(Π): t X X ˆ t (P ) := 1 ˆ τ (π(xτ ))P (π) R v t τ =1 π∈Π

ˆ t (P )] = R(P ). Denote by Pt the empirically optimal policy: Pt := It is easy to verify that E[R ˆ arg minP ∈C(Π) d(Rt (P ), S). For regret of P ∈ C(Π), defined as Reg(P ) := k1d k−1 (d(R(P ), S) − d(R(P ∗ ), S)) = k1d k−1 d(R(P ), S) , 5

an estimate of the regret of policy P at time t can be obtained as:   ˆ t (P ), S) − d(R ˆ t (Pt ), S) . d (P ) := k1d k−1 d(R Reg t

d t (Pt ) = 0 by definition. Note that Reg

We are now ready to describe the optimization problem, (OP). It aims to find a mixed policy Q ∈ C0 (Π) (the algorithm will assign any remaining mass from Q to the empirically best policy Pt to get a policy in C(Π)). This is equivalent to finding a Q′ ∈ C(Π), α ∈ [0, 1], and return Q = αQ′ . Let Qµ denote a smoothed projection of Q, assigning minimum probability µ to every action: Qµ (a|x) := (1 − Kµ)Q(a|x) + µ. The parameter µm is defined in Algorithm 1, and ψ := 100. Optimization Problem (OP) ′



Find Q = αQ , for some Q ∈ C(Π), 0 ≤ α ≤ 1, such that d (Q′ ) Reg t ≤ 2K, ψµm   d (P ) 1 Reg t ˆ ∀P ∈ C(Π) : Ex∈Ht Eπ∼P ≤ + 2K. µ m Q (π(x)|x) ψµm α

The first constraint in (OP) is to ensure that, under Q, the empirical regret is “small”. In the second ˆ t (P ). constraint, the left-hand side, as shown in the analysis, is an upper bound on the variance of R This constraint therefore requires sufficient exploration over mixed policies with low empirical regrets. This is one of the keys in the analysis, since it ensures that competing against Pt is essentially the same as competing against P ∗ . These two constraints are critical for deriving the regret bound in Section 3.2. To solve (OP), we access the policy set Π via the AMO oracle defined in Definition 1. In Section Appendix B.2, we provide implementation details, which demonstrate that OP can be psolved when˜ KT ln(|Π|)) ever required in the algorithm using AMO, and is implementable using at most O( calls to AMO. 3.2 Regret Analysis The main result of this section is the following theorem, which states that√ the average regret of our ˜ algorithm due to constraint violation diminishes to 0 on the order of O(1/ T ). Theorem 3. With probability at least 1 − δ, in T rounds our algorithm for CBwC achieves   p k1d k p K ln(T |Π|/δ) + ln(d/δ) . avg-regret2 (T ) = O √ T The proof structure is similar to the proof of [2, Theorem 2], with major differences coming from the changes necessary to deal with mixed policies. A complete proof is given in Appendix C. Here, we only sketch the key steps and give informal intuitions. We start by defining a “good” event E (Definition 4), in which for all t, the empirical average reward ˆ t (P ) for any mixed policy P is close to the true average R(P ), and the variance estimates vector R (the left-hand side of the second constraint in OP) allow one to upper-bound the variance of the empirical average. Lemma 11 shows E holds with high probability. Therefore, we can assume E holds to prove the regret bound. Now suppose E holds. Using constraints given in OP, one can show (Lemma 15) that the empirical d (P ) and the actual regret Reg(P ) are close for every P ∈ C(Π). Therefore, the first regret Reg t constraint in (OP), which bounds the empirical regret of the chosen mixed policy, implies an upper bound on the actual regret of this mixed policy. Properly chosen scaling factors (ψ and µm ) result in the desired bound in Theorem 3. 6

4 The general CBwCR problem This section extends the algorithm from previous section problem, defined P to the general CBwCR P in Section 2. Recall that the aim is to maximize f ( 1t t vt (at )) while ensuring 1t t vt (at ) ∈ S.

A direct way to extend the algorithm from the previous section would be to reduce CBwCR to the Feasibility Problem with constraint set S ′ = {v : f (v) ≥ OPT, v ∈ S}, where OPT := f (R(P ∗ )). However, this requires the knowledge of OPT. If OPT is estimated, the errors in the estimation of ˜ √1 ) OPT at all time steps t would add up to the regret, thus this approach would only tolerate O( t per step estimation errors. In this section, we propose an alternate approach of combining objective value and distance from constraints using a parameter Z, which will capture the tradeoff between the two quantities. We may still need to estimate this parameter Z, however, Z will appear only in second-order regret terms, so that a constant factor approximation of Z will suffice to obtain optimal order of regret bounds. This makes the estimation task relatively easy and enables us to get better problem-specific bounds. As a specific example, for the special case of budget constraints (CBwK), OPT we will use Z = (B/T ) , so we still need to estimate OPT. However, it is sufficient to get an O(1) approximation of OPT, which in the other approach gives a linear regret; see Section 5 for details. In the rest of this section, we make the following assumption. Assumption 2. Assume we are given Z ≥ L such that for all P ∈ C(Π), f (R(P )) ≤ f (R(P ∗ )) +

Z d(R(P ), S) 2

(3)

Intuitively, Z measures the sensitivity of the optimization problem to violation of constraints. In Lemma 16 (in Appendix D), we provide a constructive proof of the existence of Z. In fact, its smallest value is given by the optimal Lagrangian dual variable for the optimization problem maxP ∈C(Π) f (R(P )) s.t. d(R(P ), S) ≤ γ for γ → 0. This observation also suggests that Z could potentially be estimated by solving for the optimal dual variable of an esimated version of this conˆ t (P ) instead of R(P ), constructed over some t initial rounds. vex program that uses R Below, we present an algorithm for CBwCR. It uses the same basic ideas as the algorithm for the feasibility problem in the previous section. The main new idea is to use the parameter Z to combine the objective with the constraints. 4.1 Algorithm We use Algorithm 1 with the same optimization problem as (OP) described in Section 3, but with d (P ) as below. Recall that P ∗ is the optimal policy as given new definitions of Reg(P ), Pt and Reg t by Equation (1), L is the Lipschitz factor for f with respect to norm k · k, and Z is as specified in Assumption 2. We now define the regret of policy P ∈ C(Π) as Reg(P ) :=

1 (f (R(P ∗ )) − f (R(P )) + Z d(R(P ), S)) . k1d kZ

The best empirical policy is now given by ˆ t (P )) − Zd(R ˆ t (P ), S), Pt := arg max f (R P ∈C(Π)

and an estimate of the regret of policy P ∈ C(Π) at time t is d (P ) := Reg t

1 ˆ t (Pt )) − Zd(R ˆ t (Pt ), S) − (f (R ˆ t (P )) − Zd(R ˆ t (P ), S))). (f (R k1d kZ

4.2 Regret Analysis: Proof of Theorem 1 d t (P ) achieves regret We prove that Algorithm 1 and (OP) with the above new definition of Reg bounds of Theorem 1 for the CBwCR problem. A complete proof of this theorem is given in Appendix D. Here, we sketch some key steps. The first step of the proof is to use constraints in (OP) d (P ) and actual regret Reg(P ) are close for every P ∈ C(Π). to show that the empirical regret Reg t 7

This is proven in Lemma 17. Therefore, the first constraint in (OP) that bounds the empirical regret of chosen policy Q′ implies a bound on the actual regret Reg(Q′ ). What remains is to show that a bound on this quantity is sufficient to bound both regret in objective (avg-regret1 (T )) and regret in constraints (avg-regret2 (T )). This effectively amounts to bounding f (R(P ∗ )) − f (R(P )) and d(R(P ), S) respectively, in terms of Reg(P ) for any policy P . Bounding the objective term is relatively simple, since by definition of Reg(P ), f (R(P ∗ )) − f (R(P )) ≤ Zk1d kReg(P ). To bound the distance, we utilize Assumption 2: f (R(P ∗ )) ≥ f (R(P )) − Z2 d(R(P ), S), ∀ P ∈ C(Π), so that Z k1d k(Z)Reg(P ) = f (R(P ∗ )) − f (R(P )) + Zd(R(P ), S) ≥ d(R(P ), S), 2 thus bounding d(R(P ), S) in terms of Reg(P ).

5 Budget Constraints (CBwK) In this section, we apply the algorithm from the previous section to the special case of budget constraints. In this case, the reward vector observed at time t is of the form (rt (at ), vt (at )) ∈ [0, 1]d PT PT and the aim is to maximize t=1 rt (at ) while ensuring that t=1 vt (at ) ≤ B1.

This is essentially equivalent to the CBwCR problem with f being the function that returns the first component of the vector, i.e., f ((r, v)) = r and S = {(r, v) : v ≤ B T 1}. However, one difference is that we are not allowed to violate the budget constraints at all, so the algorithm has to stop once the ′ budget of any resource is fully consumed. We take care of this by setting S = {(r, v) : v ≤ B−B T 1} ′ for some large enough B , and using the ℓ∞ norm to measure distance: if the ℓ∞ distance from S is at most B ′ /T , then the actual budgets are not violated, and we ensure that this happens with high probability. Note that since the algorithm is allowed to stop early or “do nothing”, we are effectively competing with all policies in C0 (Π) instead of C(Π). The value of the optimal mixed policy in C0 (Π) is maxP ∈C0 (Π) E(x,r,v)∼D,π∼P [r(π(x)] OPT := (4) s.t. E(x,r,v)∼D,π∼P [v(π(x))] ≤ B T1 Applying the algorithm from the previous section requires knowledge of parameter Z satisfying Assumption 2. The following lemma shows that it is enough to know OPT. (All proofs from this section are in Appendix E.) Lemma 4. distance.

Z 2

OPT = max{ (B/T ) , 1} satisfies Assumption 2, with f and S as above and ℓ∞ norm for

2OPT By definition, this means that any Z that is greater than or equal to (B/T ) + 1 satisfies Assumption 2. Therefore it suffices to estimate an upper bound on OPT, the smaller the better since Z appears in the regret term. We estimate OPT by using the outcomes of the first

T0 := 12Kd ln( d|Π| δ )T /B rounds, during which we do pure exploration (i.e., play an action in A uniformly at random). The following lemma provides a bound on the Z that we estimate. Lemma 5. Using the first T0 rounds of pure exploration, one can compute a quantity Z such that with probability at least 1 − δ, max{

OPT Z 6OPT , 1} ≤ ≤ + 2. (B/T ) 2 (B/T )

Now, given such a Z, the algorithm for the general CBwCR problem (refer to Section 4.1) can be d (P ) and Pt , we use the ℓ∞ used as it is. As mentioned earlier, in the definition of Reg(P ), Reg t distance, and f ((r, v)) = r. Note that f is 1-Lipschitz for all norms, and that k1k∞ = 1. In order to make sure that the budget constraints are not violated, for some large enough constant c, we set aside a budget of p B ′ := c KT ln(|Π|/δ).

The entire algorithm is as follows:

8

1. Use the first T0 rounds to do pure exploration and calculate a Z given by Lemma 5. 2. From the remaining budget for each resource, set aside a B ′ amount. 3. Run Algorithm from Section 4.1 for the remaining time steps with the remaining budget. 4. Abort when the full actual budget B is consumed for any component j. By definition, the algorithm does not violate budget constraints. The regret bound for this algorithm is stated in Theorem 2. The proof essentially follows from using Theorem 1 to bound the regret in Step 3, and accounting for the loss of budget in Steps 1 and 2.

References [1] Y. Abbasi-yadkori, D. P´al, and C. Szepesv´ari. Improved algorithms for linear stochastic bandits. In NIPS, 2012. [2] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In ICML 2014, June 2014. [3] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, 2014. [4] S. Agrawal and N. R. Devanur. Fast algorithms for online stochastic convex programming. In SODA, pages 1405–1424, 2015. [5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002. [6] M. Babaioff, S. Dughmi, R. D. Kleinberg, and A. Slivkins. Dynamic pricing with limited supply. ACM Trans. Economics and Comput., 3(1):4, 2015. [7] A. Badanidiyuru, R. Kleinberg, and Y. Singer. Learning on a budget: posted price mechanisms for online procurement. In Proc. of the 13th ACM EC, pages 128–145. ACM, 2012. [8] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In FOCS, pages 207–216, 2013. [9] A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Proceedings of The Twenty-Seventh Conference on Learning Theory (COLT-14), pages 1109–1134, 2014. [10] A. G. Barto and P. Anandan. Pattern-recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15(3):360–375, 1985. [11] O. Besbes and A. Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009. [12] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proc. of the 14th AIStats, pages 19–26, 2011. [13] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. [14] D. Chakrabarti and E. Vee. Traffic shaping to optimize ad delivery. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, 2012. [15] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions. Journal of Machine Learning Research - Proceedings Track, 15:208–214, 2011. [16] W. Ding, T. Qin, X.-D. Zhang, and T.-Y. Liu. Multi-armed bandit with budget constraint and variable costs. In Proc. of the 27th AAAI, pages 232–238, 2013. [17] M. Dud´ık, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. In Proc. of the 27th UAI, pages 169–178, 2011. [18] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics, 3:95110, 1956. [19] S. Guha and K. Munagala. Approximation algorithms for budgeted learning problems. In STOC, pages 104–113, 2007. [20] A. Gy¨orgy, L. Kocsis, I. Szab´o, and C. Szepesv´ari. Continuous time associative bandit problems. In Proc. of the 20th IJCAI, pages 830–835, 2007. 9

[21] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, pages 1096–1103, 2008. [22] O. Madani, D. J. Lizotte, and R. Greiner. The budgeted multi-armed bandit problem. In Learning Theory, pages 643–645. Springer, 2004. [23] S. Pandey and C. Olston. Handling advertisements of unknown quality in search advertising. In Advances in Neural Information Processing Systems, pages 1065–1072, 2006. [24] A. Singla and A. Krause. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proc. of the 22nd WWW, pages 1167–1178, 2013. [25] A. Slivkins and J. W. Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges (position paper). CoRR, abs/1308.1746, 2013. [26] L. Tran-Thanh, A. C. Chapman, E. M. de Cote, A. Rogers, and N. R. Jennings. Epsilon-first policies for budget-limited multi-armed bandits. In Proc. of the 24th AAAI, 2010. [27] L. Tran-Thanh, A. C. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI, 2012. [28] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. CoRR, abs/1504.06937, 2015.

Appendix A

Concentration Inequalities

Lemma 6. (Freedman’s inequality for martingales[12]) Let X1 , X2 , . . . , XT be a sequence of realvalued random variables. Assume for all t ∈ {1, 2, . . . , T }, |Xt| ≤ R and E[Xt |X1 , . . . , Xt−1 ] = PT PT 2 0. Define S := t=1 E[Xt |X1 , . . . , Xt−1 ]. For any δ ∈ (0, 1) and λ ∈ t=1 Xt and V := [0, 1/R], with probability at least 1 − δ, S ≤ (e − 2)λV +

ln(1 − δ) . λ

Lemma 7. (Multiplicative version of Chernoff bounds) Let X1 , .P . . , Xn denote independent random samples from distribution supported on ⊆ [a, b] and let µ := E[ i Xi ]. Then, for all ǫ > 0, !   n X µǫ2 Xi − µ| ≥ ǫµ ≤ exp − Pr | . 3(b − a)2 i=1 Corollary 8. (to Lemma 7) Let X1 , . . .P , Xn denote independent random samples from distribution supported on ⊆ [a, b] and let µ ¯ := n1 E[ i Xi ]. Then, for all ρ > 0, with probability at least 1 − ρ, |

n p 1X Xi − µ ¯)| ≤ (b − a) 3¯ µ log(1/ρ)/n. n i=1

Proof. Given ρ > 0, use Lemma 7 with s

ǫ = (b − a)

3 log(1/ρ) , µ

p Pn to get that the probability of the event | i=1 Xi − µ| > ǫµ = (b − a) 3µ log(1/ρ) is at most   µǫ2 = exp (− log(1/ρ)) = ρ. exp − 3(b − a)2  10

B Algorithmic Details for Section 3 B.1

Main Algorithm

Algorithm 1 provides the main algorithm used to solve the feasibility problem in Section 3. It requires two subroutines, one for solving (OP) for Qm , and the other for sampling an action given Qm . Solving OP is nontrivial, and is the focus of the most this section. The sampling process, Sample(x, Q, P, µ), takes the following as input: x (context), Q ∈ C0 (Π) (mixed policy returned by the optimization problem (OP) for the current epoch), P (default mixed policy), and µ > 0 (a scalar for minimum action-selection probability). Since Q may not be a proper ˜ ∈ C(Π), by distribution (as its weights may sum to a number less than 1), Sample first computes Q assigning any remaining mass (from Q) to the default policy P . Then, it picks an action from the ˜ µ of this distribution defined as: Q ˜ µ (a|x) = (1 − Kµ)Q(a|x)+ ˜ smoothed projection Q µ, ∀a ∈ A. B.2

Solving Optimization Problem (OP) by Coordinate Descent

At the end of every epoch m of Algorithm 1, we solve the optimization problem (OP) to find Qm ∈ C0 (Π). In the optimization problem (OP) described in the main text (Section 3), Q ∈ C0 (Π) was expressed as αQ′ for some Q′ ∈ C(Π). It is easy to see that any Q ∈ C0 (Π) can also be expressed as a linear combination of multiple mixed policies in C(Π): X αP (Q)P , Q= P ∈C(Π)

for some constants {αP (Q)}P ∈C(Π) , so that ∀P ∈ C(Π) : αP (Q) ≥ 0

X

and

P ∈C(Π)

αP (Q) ≤ 1 .

Note that the coefficients {αP (Q)} may not be unique. Now, consider the following version of the (OP) problem. Optimization Problem (OP) Let bP :=

c (P ) Reg t ψµm , ∀P

∈ C(Π) where ψ := 100. P Find Q = ( P ∈C(Π) αP (Q)P ) ∈ C0 (Π), such that X αP (Q)bP ≤ 2K, P ∈C(Π)

ˆ x∈Ht Eπ∼P ∀P ∈ C(Π) : E



 1 ≤ bP + 2K. Qµm (π(x)|x)

d (P ), the above version of (OP) and the earlier version described in the main Due to convexity of Reg t text (in Section 3) are equivalent, i.e., any feasible solution to one version provides a feasible solution to the other. To see this, first note that a solution αQ′ for Q′ ∈ C(Π) to the earlier (OP) problem ′ trivially gives a solution P Q = αQ to the above. For the other side, suppose we are given a solution P P ∈C(Π) αP (Q)P ′ , α = P ∈C(Π) αP (Q). Then, by Jensen’s inequality, Q to above, set Q = P α (Q) P P ∈C(Π)   X X αP (Q) d (Q′ ) ≤ α  d (P ) ≤ 2Kψµm . d (P ) = P αReg αP (Q)Reg Reg t t t P ∈C(Π) αP (Q) P ∈C(Π)

P ∈C(Π)



Therefore, first constraint is satisfied. Also, since αQ = Q, the second constraint is trivially satisfied. Therefore, αQ′ is a feasible solution to the earlier (OP).

In the rest of the paper, we assume (OP) to be the above optimization problem. We solve this using coordinate descent algorithm described below, which assigns a non-zero weight αP (Q) to at atmost one new policy P ∈ C(Π) in every iteration. 11

Let us fix m and use shorthand µ for µm . The optimization problem (OP) described above is of exactly the same form as the optimization problem in [2], except that the policy set being considered is C(Π) instead of Π. We can solve it using a coordinate descent algorithm similar to [2, Algorithm 2]. Algorithm 2 Coordinate Descent Algorithm for Solving (OP) Input History Ht , minimum probability µ > 0, initial weights Qinit ∈ C0 (Π). 1: Q ← Qinit . 2: loop 3: Define, for all P ∈ C(Π),    1 ˆ x∼H , VP (Q) := Eπ∼P E t Qµ (π(x)|x) " " ## 1 ˆ SP (Q) := Eπ∼P Ex∼Ht , 2 (Qµ (π(x)|x)) 4: 5:

6: 7: 8:

if

P

DP (Q)

:= VP (Q) − (2K + bP ) .

αP (Q)(2K + bP ) > 2K then Replace Q by cQ so that Q ∈ C(Π), where P ∈C(Π)

c := P

2K < 1. α P ∈C(Π) P (Q)(2K + bP )

(5)

end if if there is a P ∈ C(Π) for which DP (Q) > 0 then Update the coefficient for P by αP (Q) ← αP (Q) +

VP (Q) + DP (Q) . 2(1 − Kµ)SP (Q)

9: else 10: Halt and output the current set of weights Q. 11: end if 12: end loop

Lemma 9. Algorithm 2 can be implemented using one call to AMO (Definition 1) in the beginning before the loop is started, and one call for each iteration of the loop thereafter. ˆ i ) for Proof. Initially, one needs to compute Pt which can be done by calling AMO once on (xi , v P P vi (π(xi ))). i = 1, . . . , t, to minimize d( 1t ti=1 π P (π)ˆ

What remains to show is that we can identify a P for which DP (Q) > 0, whenever one exists, by one call to AMO. All the other steps can be performed efficiently for Q with sparse support. Now, DP (Q)

= VP (Q) − (2K + bP ) d (P ) Reg t = VP (Q) − (2K + ) ψµ ! t 1 XX 1 = − P (π) µ t i=1 π Q (π(xi )|xi )

ˆ t (P ), S) − d(R ˆ t (Pt ), S) d(R 2K + ψµk1d k

!

.

Finding P such that DP (Q) > 0 requires solving arg maxP ∈C(Π) DP (Q). Define the sequence of ˜ i ) for i = 1, . . . , t as: contexts and (d + 1)-dimensional reward vectors (xi , v ˆ i (a) v 1 ˜ i (a) := [ µ ] v Q (a|xi ) ψµk1d k And, define concave function g : Rd+1 → R as

g(v) = vT e1 − d(vT e−1 , S), for any v ∈ Rd+1 12

where e1 denotes the vector with 1 in the first component and 0 elsewhere, and e−1 denotes the vector with 0 in the first component and 1 elsewhere. Then, it is easy to check that ! t d(Rt (Pt ), S) 1XX DP (Q) = g P (π)˜ vi (π(xi )) − 2K + t i=1 ψµk1d k π∈Π

Since −2K +

d(Rt (Pt ),S) ψµk1d k

is a constant independent of P , t

arg max DP (Q) = arg max g( P ∈C(Π)

P ∈C(Π)

1 XX P (π)˜ vi (π(xi ))) t i=1 π

˜ i )}i=1,2,...,t , we can find P that maximizes Thus, by calling the AMO once on the sequence {(xi , v DP (Q) and thus identify a P for which DP (Q) > 0, whenever one exists.  Lemma 10. The number of times Step 8 of the algorithm is performed is bounded by 4 ln(1/(Kµ))/µ.

Proof. This follows by applying the analysis of Algorithm 2 in [2] (refer to Section 5) with policy set being C(Π)instead of Π. (Their analysis holds for any value of constant µ, and constants bπ for policies in the policy set being considered).  q dτm Now, since in epoch m, µ = µm ≥ Kτ , dt = ln(16t2 |Π|/δ). This proves that the algorithm m p p ˜ KT ln(|Π|)) iterations of the converges in at most O( KT ln(T |Π|/δ) ln(T ln(T |Π|))) = O( loop.

Finally, we note that the representing a mixture policy P ∈ C(Π) may require specifying up to |Π| coefficients, one for each π ∈ Π. In some cases it may suffice to consider P with small support only. For example in CBwK, where S is given by the linear budget constraints, and it can be shown that P ∗ , Pt can have support at atmost d points, therefore, the agent may compete against mixed policies with atmost d support. In general, the maximum size of support that needs to be considered depends on S. Another possibility is to choose a class Θ of mixed policies in place of C(Π), so that each P ∈ Θ can be compactly represented (e.g., in a parametric way). Results in the paper including regret analysis and the AMO-based optimization procedure may be adapted to work with such Θ.

C

Regret Analysis for Section 3: Feasibility Problem

Fix the epoch schedule 0 = τ0 < τ1 < τ2 < . . ., such that τm < τm+1 ≤ 2τm . The following quantities are defined for convenience: r 1 dτm µm := min{ }, , 2K Kτm dt := ln(16t2 |Π|/δ) , dτ 1 m0 := min{m ∈ N : m ≤ }, τm 4K t0 := min{t ∈ N : dt /t ≤ 1/(4K)} , p τm /τm−1 . ρ := sup m≥m0

q dτm A few quick observations are in place. For m ≥ m0 , µm = Kτm . Furthermore, dt /t is non√ increasing in t and µm is non-increasing in m. Finally, ρ ≤ 2 since τm+1 ≤ 2τm . 13

Definition 3 (Variance estimates). Define the following for any probability distributions P, Q ∈ C(Π), any policy π ∈ Π, and µ ∈ [0, 1/K]:   1 , V (Q, π, µ) := Ex∼DX Qµ (π(x)|x)   1 ˆ x∼Hτ Vˆm (Q, π, µ) := E , m Qµ (π(x)|x) V (Q, P, µ) := Eπ∼P [V (Q, π, µ)] ˆ Vm (Q, P, µ) := Eπ∼P [Vˆm (Q, π, µ)] ˆ x∼Hτ denote average over records in history Hτm . where E m Furthermore, let m(t) := min{m ∈ N : t ≤ τm }, be the index of epoch containing round t, and define Vt (P )

:=

max

˜ m , P, µm )} , {V (Q

0≤m<m(t)

for all t ∈ N and P ∈ C(Π).

Definition 4. Define E as event that the following statements hold • For all probability distributions P, Q ∈ C(Π) and all m ≥ m0 , V (Q, P, µm ) ≤ 6.4Vˆm (Q, P, µm ) + 81.3K

(6)

• For all P ∈ C(Π), all epochs m and all rounds t in epoch m, any δ ∈ (0, 1), and any choices of λm−1 ∈ [0, µm−1 ], 1 ˆ t (P ) − R(P )k ≤ Vt (P )λm−1 + dt . kR k1d k tλm−1

(7)

Lemma 11. Pr(E) ≥ 1 − (δ/2). Proof. Lemma 10 in [2] can be applied as it is to show that with probability 1 − δ/4, V (Q, π, µm ) ≤ 6.4Vˆm (Q, π, µm ) + 81.3K for all Q ∈ C(Π), π ∈ Π. Now, taking expectations on both side over π ∼ P , we get the first condition. For the second condition, the proof is similar to the proof of Lemma 11 in [2], but with some changes to account for distribution over policies. Fix component j of reward vector, policy π ∈ Π and epoch m ∈ N. Then, t X ˆ t (π)j − R(π)j = 1 R Zi , t i=1

ˆ i (π(xi ))j . Round i is in epoch m(i) ≤ m, so where Zi := vi (π(xi ))j − v |Zi | ≤

1 1 1 ≤ , ≤ µm(i)−1 ˜ µm(i)−1 µm−1 Qm(i)−1 (π(xi )|xi )

ˆt . Furthermore, E[Zi |Ht−1 ] = 0 and by definition of fictitious reward vector v E[Zi2 |Ht−1 ] ≤



E[ˆ vi (π(xi ))2 |Hi−1 ] ˜ m(i)−1 , π, µm(i)−1 ) V (Q

from the definition of fictitious reward and of V (Q, π, µ). Pt ˜ m(i)−1 , π, µm(i)−1 ) ≥ 1 Pt E[Z 2 |Ht−1 ]. Then, by Freedman’s inLet U (π) := 1t i=1 V (Q i=1 t Pt i Pt equality (Lemma 6) and a union bound to the sums (1/t) i=1 Zi and (1/t) i=1 (−Zi ), we have 14

that with probability at least 1 − 2δ/(16t2 |Π|), for all λm−1 ∈ [0, µm−1 ], t

1X Zi t i=1

≤ (e − 2)U (π)λm−1 +

ln(16t2 |Π|/δ) , tλm−1

≤ (e − 2)U (π)λm−1 +

ln(16t2 |Π|/δ) . tλm−1

t



1X Zi t i=1

and

Taking union bound over all choices of t and π ∈ Π, we have that, with probability at least 1 − 4δ , for all π and t, ˆ t (π)j − R(π)j ≤ (e − 2)U (π)λm−1 + dt , and R (8) tλm−1 ˆ t (π)j ≤ (e − 2)U (π)λm−1 + dt . (9) R(π)j − R tλm−1 Note that t 1X ˜ m(i)−1 , π, µm(i)−1 )] Eπ∼P [U (π)] = Eπ∼P [V (Q t i=1 t

=

1X ˜ m(i)−1 , P, µm(i)−1 ) V (Q t i=1 t



1X Vt (P ) = Vt (P ) t i=1

ˆ ˆ by definition of V (Q, P, µ). Also, by definition, Eπ∈P [R(π)j ] = R(P )j , Eπ∼P [R(π) j ] = R(P )j . Therefore, taking expectation with respect π ∼ P on both sides of Equations (8) and (9), we get that with probability 1 − 4δ , for all P ∈ C(Π) ˆ t (P )j − R(P )j R



ˆ t (P )j R(P )j − R



dt , tλm−1 dt (e − 2)Vt (P )λm−1 + . tλm−1 (e − 2)Vt (P )λm−1 +

That is, with probability 1 − 4δ , for all t and all P ∈ C(Π), we have ˆ t (P ) − R(P )k ≤ (e − 2)Vt (P )λm−1 + (k1d k)−1 kR

and

dt . tλm−1 

Lemma 12. Assume event E holds. Then for all m ≤ m0 , and all rounds t in m, r 4Kdt Vt 4Kdt −1 ˆ , }, (k1d k) kRt (P ) − R(P )k ≤ max{ t t

(10)

Proof. By definition of q m0 , for all m′ < m0 µ′m = (1/2K). Therefore qµm−1 = (1/2K). First dt dt 1 consider the case when tVt < µm−1 = 2K . Then, substitute λm−1 = tV , to get t r ˆ t (P ) − R(P )k ≤ 4dt Vt . (k1d k)−1 kR t Otherwise, 4K 2 dt Vt < t 1 Substituting λm−1 = µm−1 = 2K , we get ˆ t (P ) − R(P )k ≤ k1d k−1 kR

4Kdt t 

15

Lemma 13. Assume event E holds. Then, for all m, all t in round m, all choices of distributions P ∈ C(Π) q   4Kdt Vt (P ) 4Kdt max , t , m ≤ m0 t ˆ t (P ) − R(P )k ≤ (k1d k)−1 kR (11) V (P )µ dt , m>m . + t

m−1

tµm−1

0

Proof. Follows from definition of event E and Lemma 12.



Lemma 14. (Equivalent of Lemma 12 in [2]) Assume event E holds. For any round t ∈ N, and any policy P ∈ C(Π), let m ∈ N be the epoch achieving the max in the definition of Vt (P ). Then, ( 1 2K if µm = 2K , c τ (P ) Vt (P ) ≤ Reg 1 m θ1 K + θ2 µm if µm < 2K , where θ1 = 94.1 and θ2 = ψ/6.4 = · · · are universal constants. Proof. Fix a round t and a policy distribution P ∈ C(Π). Let m < m(t) be the epoch achieving the ˜ m , P, µm ). If µm = 1/(2K), which max in the definition of Vt (P ) (Definition 3), so Vt (P ) = V (Q immediately implies Vt (P ) ≤ 2K by definition. q q dτm dτm 1 If µm < 1/(2K), then µm = min{ 2K , Kτ } = Kτm , and we have m ˜ m , P, µm ) ≤ V (Q

≤ ≤

˜ m , P, µm ) + 81.3K 6.4Vˆ (Q 6.4Vˆ (Qm , P, µm ) + 81.3K ! d (P ) Reg τm + 81.3K 6.4 2K + ψµm

d (P ) Reg τm , θ 2 µm where the first step is from Equation (6) (which holds in event E); the second step is from the ˜ m (π) ≥ Qm (π) for all π ∈ Π; the third step is from the constraint in (OP) that observation that Q Qm satisfies; and the last step follows from the universal constants θ1 and θ2 defined earlier.  =

θ1 K +

Lemma 15. (Equivalent of Lemma 13 in [2]) Assume event E holds. Define c0 := 4ρ(1 + θ1 ). For all epochs m ≥ m0 , all rounds t ≥ t0 in epoch m, and all policies P ∈ C(Π), dt (P ) + c0 Kµm Reg(P ) ≤ 2Reg d t (P ) ≤ Reg

2Regt (P ) + c0 Kµm .

Proof. The proof is by induction on m. For the base case, we have m = m0 and t ≥ t0 in epoch t ≤ 1 for t ≥ t0 in epoch m0 . Then, from Lemma 13, using the facts that Vt ≤ 2K, and that 4Kd t m0 , we get, for all P ∈ C(Π) that (r ) r 4Kdt Vt (P ) 4Kdt 8Kdt −1 ˆ (k1d k) kRt (P ) − R(P )k ≤ max ≤ , t t t Then, using the triangle inequality, the definition of Pt , and the existence of an admissible distribuˆ ∗ ), S) = 0, we have tion P ∗ ∈ C(Π) with d(R(P ˆ t (P ), S) − d(R ˆ t (Pt ), S) − d(R(P ), S)| d (P ) − Reg(P )| = |d(R (k1d k)|Reg t





≤ d (P ) − Reg(P )| ≤ |Reg t



ˆ t (P ), S) − d(R(P ), S)| + d(R ˆ t (Pt ), S) |d(R ˆ t (P ), S) − d(R(P ), S)| + d(R ˆ t (P ∗ ), S) |d(R

ˆ t (P ) − R(P )k + kR ˆ t (P ∗ ) − R(P ∗ )k kR r 8Kdt 2 t c0 Kµm0 , 16

√ d (P ). for c0 ≥ 4 2. The base case then follows from the non-negativeness of Regt (P ) and Reg t

For the induction step, fix some epoch m > m0 and assume for all epochs m′ < m, all rounds t′ ≥ t0 in epoch m′ , and all distributions P ∈ C(Π) that,

Then, we have the following,

Reg(P ) ≤ d ′ (P ) ≤ Reg t

d (P )) k1d k(Reg(P ) − Reg t

= ≤



d (P ) ≤ Reg(P ) − Reg t

d ′ (P ) + c0 Kµm′ 2Reg t

2Regt′ (P ) + c0 Kµm′ .

ˆ t (P ), S) − d(R ˆ t (Pt ), S)) d(R(P ), S) − (d(R ˆ t (P ), S) + d(R ˆ t (P ∗ ), S) d(R(P ), S) − d(R

ˆ t (P ) − R(P )k + kR ˆ t (P ∗ ) − R(P ∗ )k kR 2dt (Vt (P ) + Vt (P ∗ ))µm−1 + , tµm−1

(12)

where we have used the definition of Pt , the triangle inequality, and Equation (11). Similarly, d (P ) − Reg(P )) k1d k(Reg t

d (P ) − Reg(P ) Reg t

ˆ t (P ), S) − d(R ˆ t (Pt ), S) − d(R(P ), S) = d(R ˆ t (P ), S) − d(R ˆ t (Pt ), S) − d(R(P ), S) + d(R(Pt ), S) ≤ d(R     ˆ t (P ), S) − d(R(P ), S) + d(R(Pt ), S) − d(R ˆ t (Pt ), S) = d(R

ˆ t (P ) − R(P )k + kR ˆ t (Pt ) − R(Pt )k ≤ kR 2dt , ≤ (Vt (P ) + Vt (Pt ))µm−1 + tµm−1

(13)

where we have used the definition of Pt , the triangle inequality, and Equation (11). By Lemma 14, there exist epochs m′ , m′′ < m such that Vt (P ) ≤ Vt (P ∗ ) ≤

  d t (P ) 1 Reg , I µm ′ < θ 2 µm ′ 2K  d (P ∗ )  Reg 1 t ′′ θ1 K + . I µm < θ2 µm′′ 2K θ1 K +

If µm′ < 1/(2K), then m0 ≤ m′ ≤ m − 1, and the inductive hypothesis implies d τ (P ) Reg m′ θ2 µ

m′



2Reg(P ) + c0 Kµm′ c0 K 2Reg(P ) c0 K 2Reg(P ) = + ≤ + , ′ ′ θ 2 µm θ2 θ 2 µm θ2 θ2 µm−1

where the last step uses the fact that µm′ ≥ µm−1 for m′ ≤ m − 1. Therefore, no matter whether µm′ < 1/(2K) or not, we always have   2 c0 Kµm−1 + Reg(P ) . Vt (P )µm−1 ≤ θ1 + (14) θ2 θ2 If µm′′ < 1/(2K), then m0 ≤ m′′ ≤ m − 1, and the inductive hypothesis implies ∗ d Reg τm′′ (P )

θ2 µm′′



2Reg(P ∗ ) + c0 Kµm′′ c0 K = , θ 2 µj θ2

where the last step uses the fact that Reg(P ∗ ) = 0. Therefore, no matter whether µm′′ < 1/(2K) or not, we always have   c0 Vt (P ∗ )µm−1 ≤ θ1 + Kµm−1 . (15) θ2 Combining Equations (12), (14) and (15) gives   c0 2dt 1 d Regt (P ) + 2(θ1 + )Kµm−1 + . (16) Reg(P ) ≤ 1 − 2/θ2 θ2 tµm−1 17

Since m > m0 , the definition of ρ ensures that µm−1 ≤ ρµm . Also, since t > τm−1 ,

dt tµm−1

Kµ2m−1



µm−1 ≤ ρKµm . Applying these inequalities and the facts c0 = 4ρ(1 + θ1 ) and θ2 ≥ 8ρ in Equation (16), we have thus proved

d t (P ) + c0 Kµm . Reg(P ) ≤ 2Reg

(17)

The other part can be proved similarly. By Lemma 14, there exist epochs m′′ < m such that  d (Pt )  1 Reg t . I µm′′ < Vt (Pt ) ≤ θ1 K + θ2 µm′′ 2K

If µm′′ < 1/(2K), then m0 ≤ m′′ ≤ m − 1, and the inductive hypothesis together with Equation (17) imply d Reg τm′′ (Pt ) θ2 µm′′



d t ) + c0 Kµm′′ ) + c0 Kµm′′ 2Reg(Pt ) + c0 Kµm′′ 2(2Reg(P ≤ . θ2 µm′′ θ2 µm′′

d t ) = 0 by definition, the above upper bound is simplified to Since Reg(P d Reg τm′′ (Pt ) θ2 µm′′



3c0 Kµm′′ 3c0 K . = θ2 µm′′ θ2

Therefore, no matter whether µm′′ < 1/(2K) or not, we always have Vt (Pt )µm−1 ≤ (θ1 +

3c0 )Kµm−1 . θ2

(18)

Combining Equations (13), (14) and (18) gives d (P ) ≤ (1 + 2 )Reg(P ) + 2(θ1 + 2c0 )Kµm−1 + 2dt . Reg t θ2 θ2 tµm−1

Since m > m0 , the definition of ρ ensures that µm−1 ≤ ρµm . Also, since t > τm−1 , Kµ2m−1

(19) dt tµm−1



µm−1 ≤ ρKµm . Applying these inequalities and the facts c0 = 4ρ(1 + θ1 ) and θ2 ≥ 8ρ in Equation (16), we have thus proved the second part in the inductive statement:

and the whole lemma.

d t (P ) ≤ 2Reg(P ) + c0 Kµm , Reg



C.1 Main Proof We are now ready to prove Theorem 3. By Lemma 11, event E holds with probability at least 1 − (δ/2). Hence, it suffices to prove the regret upper bound whenever E holds.

Recall from Appendix B that the algorithm samples at at time t in epoch m from smoothed projec˜ m−1 . Also, recall from Appendix B.2 that Q ˜ m for any m is represented as a linear ˜ µmm−1 of Q tion Q P P ˜ m )P = ˜m = α ( Q combination of P ∈ C(Π) as follows: Q P ∈C(Π) αP (Qm )P + (1 − P ∈C(Π) P P ˜ P ∈C(Π) αP (Qm ))Pt . (Qm assigns all the remaining weight from Qm to Pt ).

˜t = Q ˜ m(t)−1 , µt = µm(t)−1 , where m(t) denotes the epoch in which time step t lies: m(t) = Let Q m for t ∈ [τm−1 + 1, τm ]. Then, 1 1X 1 1 X ˜ t ), S) ≤ ˜ t ), S) d( R(Q d(R(Q k1dk T t k1dk T t 1 1 X X ˜ t )d(R(P ), S) ≤ αP (Q k1dk T t P ∈C(Π)

=

1X X ˜ t )Reg(P ) , αP (Q T t P ∈C(Π)

18

where we have used Jensen’s inequality twice. For m ≤ m0 , µm−1 =

1 2K .

So, trivially, for t in epoch m ≤ m0 , X ˜ m−1 )Reg(P ) ≤ c0 Kψµm−1 . αP (Q

P ∈C(Π)

Suppose E holds. Then, Lemma 15 implies that for all epochs m ≥ m0 , all rounds t ≥ t0 in epoch m, and all policies P ∈ C(Π), we have d (P ) + c0 Kµm . Reg(P ) ≤ 2Reg t

Therefore, for t in such epochs m, using the first condition in OP (from Section B.2), we get X X ˜ m−1 )(2Reg ˜ m−1 )Reg(P ) ≤ d (P ) + c0 Kψµm−1 ) αP (Q αP (Q t P ∈C(Π)

P ∈C(Π)

X

=

P ∈C(Π)



d (P ) + c0 Kψµm−1 ) αP (Qm−1 )(2Reg t

(c0 + 2)Kψµm−1 .

d (Pt ) = 0. The equality in above holds because Reg t

Substituting, we get, 1 X ˜ t ), S) ≤ d( R(Q T t

k1d kKψ(c0 + 2) X µm−1 (τm − τm−1 ) . T m

(20)

P P ˜ t ), S). Fix a Next, we show that the actual regret, d( T1 t vt (at ), S), is close to d( T1 t R(Q component i ∈ [d] and let [v]i be the ith component of vector v. Recall that the algorithm samples ˜ µt . Define the random variable at step t by at from Q t X ˜ t (π)[vt (π(xt ))]i + µt . Zt := [vt (at )]i − (1 − Kµt )Q π∈Π

It is easy to see E[Zt |Ht−1 ] = 0, so the Azuma-Hoeffding inequality for martingale sequences implies that, with probability at least 1 − δ/(2d), r T 4d 1 X 1 ǫ := ln ≥| Zt | . 2T δ T t=1

Applying a union bound over i ∈ [d], we have with probability at least 1 − δ/2 that k

T X 1X 1X ˜ t )k ≤ ǫk1d k + (K + 1) vt (at ) − R(Q µt k1d k , T t T t T t=1

(21)

which implies, together with the triangle inequality, that d(

T X 1X 1 X ˜ t ), S) ≤ ǫk1d k + (K + 1) vt (at ), S) − d( R(Q µt k1d k . T t T t T t=1

Combining (20) and (22), we get 1X vt (at ), S) ≤ d( T t

k1d kKψ(c0 + 4) X µm−1 (τm − τm−1 ) + ǫk1d k . T m

Applying an upper bound [2, Lemma 16] on the sum over µm−1 above gives ! r 8dτm (T ) τm(T ) 1 X k1d kKψ(c0 + 4) τm0 d( + ǫk1d k . + vt (at ), S) ≤ T t T 2K K 19

(22)

(23)

(24)

By definition of m0 , τm0 −1 ≤ 4Kdτm0 −1 , so τm0 ≤ 2τm0 −1 ≤ 8Kdτm0 −1 ≤ 8KdT = 8K ln

16T 2|Π| . δ

Furthermore, note that τm(T ) ≤ 2τm(T )−1 ≤ 2(T − 1) < 2T , so 64T 2|Π| . δ With these further bounding, Equation (24) becomes dτm(T ) ≤ ln

1 X d( vt (at ), S) T t

K 16T 2 |Π| ln + T δ

≤ k1d kψ(4c0 + 16)

r

K 64T 2|Π| ln T δ

!

+ ǫk1dk(25) .

Substituting ǫ, one gets the final regret lower bound stated in the theorem: 1X avg-regret2 (T ) = d( vt (at ), S) T t ! r r 4d K 64T 2 |Π| 1 K 16T 2|Π| + k1d k ln + ln ln ≤ k1d kψ(4c0 + 16) T δ T δ 2T δ !! r r K T |Π| 1 d K T |Π| = O k1d k . ln + ln + ln T δ T δ T δ Note that a regret bound of above order is trivial unless T ≥ K ln(T |Π|/δ). Making that assumption, we get from above: !! r r K T |Π| 1 d 2 avg-regret (T ) = O k1d k . ln + ln T δ T δ

D D.1

Appendix for Section 4: General CBwK Existence of sensitivity parameter Z

Lemma 16. There always exists Z ≥ L satisfying Assumption 2. Proof. We have OPT = max f (R(P )) such that R(P ) ∈ S. P ∈C(Π)

For any γ ≥ 0, define

OPTγ := max f (R(P )) such that d(R(P ), S) ≤ γ. P ∈C(Π)

Suppose that there exists λ∗ such that for all γ ≥ 0, OPTγ ≤ OPT + λ∗ γ. Now, for any given P ∈ C(Π), consider γ = d(R(P ), S). Then, f (R(P )) ≤ OPTγ ≤ OPT + λ∗ γ = f (R(P ∗ )) + λ∗ d(R(P ), S)

Therefore, Z = max{2λ∗ , L} would satisfy the Assumption 2. In the remaining, we prove that there exists λ∗ ≥ 0 such that for all γ ≥ 0, OPTγ ≤ OPT + λ∗ γ.

Let Ω := {x : ∃P ∈ C(Π), x = R(P )}. Then, Ω is a convex set. And, OPTγ can be written as the following convex optimization problem over x OPTγ := max f (x) such that d(x, S) ≤ γ. x∈Ω

Then, applying Lagrangian duality for convex programs OPTγ

=

min max f (x) + λ(γ − d(x, S)) λ≥0 x∈Ω

20

By Fenchel duality, for any concave function f , f (x) =

min

φ:kφk∗ ≤L

f ∗ (φ) − φT x,

where f ∗ is the Fenchel dual of f . Specifically, for distance function, d(x, S) =

max θ T x − hS (θ),

θ:kθk∗ ≤1

where hS (θ) := maxv∈S θ T v. Substituting in the expression for OPTγ , we get OPTγ

= = = ≤ =





min max

(f ∗ (φ) − φT x) + λγ − λ(θ T x − hS (θ))

min

λ≥0 x∈Ω |φk∗ ≤L,kθk∗ ≤1

min

max f ∗ (φ) − φT x − λθ T x + λhS (θ) + λγ

min

f (φ) + hΩ (−φ − λθ) + λhS (θ) + λγ

λ≥0,|φk∗ ≤L,kθk∗ ≤1 x∈Ω ∗ λ≥0,|φk∗ ≤L,kθk∗ ≤1 f ∗ (φ∗ ) + hΩ (−φ∗ ∗

OPT + λ γ

− λ∗ θ ∗ ) + λ∗ hS (θ ∗ ) + λ∗ γ



Here, φ , θ , λ denote the optimal value of these variables for the corresponding optimization problem when γ = 0 (i.e., for the convex program corresponding to OPT). This completes the proof of the lemma statement. From above, we can also observe that OPTγ is a non-decreasing concave function of γ, with gradient λ∗ (γ) ≥ 0, where λ∗ (γ) is the optimal value of the dual variable corresponding to the distance ′ constraint d(x, S) ≤ γ. Therefore, OPTγ ≤ OPT + λ∗ (γ)γ ′ for all γ ′ ≥ γ. However, λ∗ ≥ λ∗ (γ) for any γ ≥ 0, and λ∗ (γ) → λ∗ as γ → 0. Therefore, λ∗ is the smallest value that satisfies this for all γ ≥ 0.  D.2

Regret Analysis

Lemma 17. Assume event E holds. Define c0 := 4ρ(1 + θ1 ). For all epochs m ≥ m0 , all rounds t ≥ t0 in epoch m, and all policies P ∈ C(Π), Reg(P ) ≤ d t (P ) ≤ Reg

dt (P ) + c0 Kµm 2Reg

2Regt (P ) + c0 Kµm ,

d t (P ) as defined in Section 4.1. for Reg(P ), Reg

Proof. Proof is by induction. For base case m = m0 , and t ≥ t0 in epoch m. Then, from Lemma t 13, using that Vt ≤ 2K, and 4Kd ≤ 1 for t ≥ t0 in epoch m0 , we get t r 8Kdt −1 ˆ ˆ (k1d k) kRt (P ) − R(P )k ≤ t for all P ∈ C(Π). Now, for all P ∈ C(Π), d t (P ) − Reg(P )) k1d kZ(Reg

ˆ t (Pt )) − f (R ˆ t (P )) − f (R(P ∗ )) + f (R(π)) = f (R ˆ t (Pt ), S) − d(R ˆ t (P ), S) + d(R(P ), S)) −Z(d(R

(26)



By assumption about Z, we have that f (R(P )) ≥ f (R(Pt ))−Zd(R(Pt ), S). Substituting in (26), we get d (P ) − Reg(P )) k1d kZ(Reg t

≤ ≤

ˆ t (Pt )) − f (R ˆ t (P )) − f (R(Pt )) + Zd(R(Pt ), S) + f (R(P )) f (R ˆ t (Pt ), S) − d(R ˆ t (P ), S) + d(R(P ), S)) −Z(d(R ˆ t (Pt ) − R(Pt )k + ZkR ˆ t (P ) − R(P )k ZkR 21

So that, d (P ) − Reg(P )) ≤ k1d k(Reg t

ˆ t (Pt ) − R(Pt )k + kR ˆ t (P ) − R(P )k. kR

(27)

ˆ t )) − Zd(R(P ˆ t ), S) ≥ f (R(P ˆ ∗ )) − For the other side, by definition of Pt , we have that f (R(P ∗ ∗ ˆ Zd(R(P ), S). Substituting in (26), and using that d(R(P ), S) = 0, we get d (P ) − Reg(P )) k1d kZ(Reg t

ˆ t (P ∗ )) − f (R ˆ t (P )) − f (R(P ∗ )) + f (R(π)) ≥ f (R ˆ t (P ∗ ), S) − d(R ˆ t (P ), S) + d(R(P ), S)) −Z(d(R

ˆ t (P ∗ ) − R(P ∗ )k − ZkR ˆ t (P ) − R(P )k ≥ −ZkR

so that, d (P )) k1d k(Reg(P ) − Reg t

ˆ t (P ∗ ) − R(P ∗ )k + kR ˆ t (P ) − R(P )k. ≤ kR

ˆ t (P ) − R(P ˆ )k ≤ k1d k Susbstituting kR

q

8Kdt t

(28)

in (27) and (28) we obtain,

d (P ) − Reg(P )k ≤ 2 kReg t

r

8Kdt ≤ c0 Kµm . t

Now, fix some epoch m > m0 . We assume as the inductive hypothesis that for all epochs m′ < m, all rounds t′ in epoch m′ , and all P ∈ Π, d ′ (P ) + c0 Kµm′ , Reg(P ) ≤ 2Reg t d ′ (P ) ≤ 2Reg(P ) + c0 Kµm′ . Reg t

Fix a round t in epoch m and policy P ∈ C(Π). Using Equation (28) and Equation (7) (which holds under event E), d t (P ) ≤ Reg(P ) − Reg



ˆ t (P ∗ ) − R(P ∗ )k + (k1d k)−1 kR ˆ t (P ) − R(P )k (k1d k)−1 kR 2dt (Vt (P ) + Vt (P ∗ ))µm−1 + (29) tµm−1

Similarly, using Equation (27), d (P ) − Reg(P ) Reg t

≤ (Vt (Pt ) + Vt (P ))µm−1 +

2dt tµm−1

(30)

The remaining proof follows exactly the same steps as those in the proof of Lemma 15 after Equation (13).  D.2.1 Proof of Theorem 1 We are now ready to prove Theorem 1. By Lemma 11, event E holds with probability at least 1 − (δ/2). Hence, it suffices to prove the regret upper bound whenever E holds.

Recall from Appendix B that the algorithm samples at at time t in epoch m from smoothed projec˜ m−1 . Also, recall from Appendix B.2 that Q ˜ m for any m is represented as a linear ˜ µmm−1 of Q tion Q P ˜ m )P = P ˜ combination of P ∈ C(Π) as follows: Qm = P ∈C(Π) αP (Q P ∈C(Π) αP (Qm )P + (1 − P ˜ P ∈C(Π) αP (Qm ))Pt . (Qm assigns all the remaining weight from Qm to Pt ).

˜t = Q ˜ m(t)−1 , µt = µm(t)−1 , where m(t) denotes the epoch in which time step t lies: m(t) = Let Q m for t ∈ [τm−1 + 1, τm ]. 22

Now, using Jensen’s inequality, f (R(P ∗ )) − f (

1X ˜ t )) R(Q T t

≤ ≤ =

1 X ˜ t ))) (f (R(P ∗ )) − f (R(Q T t 1 X X ˜ t )(f (R(P ∗ )) − f (R(P ))) αP (Q T t P ∈C(Π)

1 X X ˜ t )(Zk1d kReg(P ) − Zd(R(P ), S)) , αP (Q T t P ∈C(Π)

≤ k1d k

ZX X ˜ t )Reg(P ) , αP (Q T t P ∈C(Π)

where we have used Jensen’s twice, in first and second inequality. The last inequality simply follows from the non-negativeness of the distance function. To bound distance, we use that by Assumption 2, for all P ∈ C(Π) f (R(P ∗ )) ≥ f (R(P )) −

Z d(R(P ), S), 2

so that Zk1d kReg(P ) = f (R(P ∗ )) − f (R(P )) + Zd(R(P ), S) ≥

Z d(R(P ), S). 2

Therefore, using Jensen’s, d(

1 X ˜ t ), S) R(Q T t

≤ ≤ ≤ ≤

Now, for m ≤ m0 , µm−1 =

1 2K .

X

1 X ˜ t ), S) d(R(Q T t 1 X X ˜ t )d(R(P ), S) αP (Q T t P ∈C(Π)

2k1d kZ X X ˜ t )Reg(P ) , αP (Q ZT t P ∈C(Π)

2k1d k X X ˜ t )Reg(P ) , αP (Q T t P ∈C(Π)

So, trivially, for t in epoch m ≤ m0 ,

P ∈C(Π)

˜ m−1 )Reg(P ) ≤ c0 Kψµm−1 . αP (Q

Suppose E holds. Then, Lemma 17 implies that for all epochs m ≥ m0 , all rounds t ≥ t0 in epoch m, and all policies P ∈ C(Π), we have d (P ) + c0 Kµm . Reg(P ) ≤ 2Reg t

Therefore, for t in such epochs m, using the first condition in OP (from Section B.2), we get X X ˜ m−1 )(2Reg ˜ m−1 )Reg(P ) ≤ d (P ) + c0 Kψµm−1 ) αP (Q αP (Q t P ∈C(Π)

P ∈C(Π)

=

X

P ∈C(Π)



d (P ) + c0 Kψµm−1 ) αP (Qm−1 )(2Reg t

(c0 + 2)Kψµm−1 .

d (Pt ) = 0. The equality in above holds because Reg t 23

Substituting, we get, 1 X ˜ t )) ≤ R(Q T t 1 X ˜ t ), S) ≤ d( R(Q T t

Zk1d kKψ(c0 + 2) X µm−1 (τm − τm−1 ) T m 2k1d kKψ(c0 + 2) X µm−1 (τm − τm−1 ) . T m

f (R(P ∗ )) − f (

(31) (32)

Applying an upper bound [2, Lemma 16] on the sum over µm−1 above gives (also refer to Appendix C.1 more detailed explanation of this bound) ! r X T 64T 2|Π| 16T 2|Π| (33) + ln µm−1 (τm − τm−1 ) ≤ 4 ln δ K δ m Also, from Equation (21), k where ǫ =

q

1 2T

T X 1X 1X ˜ t )k ≤ ǫk1d k + (K + 1) vt (at ) − R(Q µt k1d k , T t T t T t=1

(34)

ln 4d δ .

Substituting these bounds, and using Lipschitz condition for f , we get 1X vt (at )) avg-regret1 (T ) = f (R(P ∗ )) − f ( T t ! r r Zk1dkKψ(c0 + 4) 2T 1 16T 2|Π| 64T 2 |Π| 4d ≤ 4 ln + Lk1d k + ln ln T δ K δ 2T δ !! r r d K T |Π| K T |Π| 1 ln + ln + ln = O k1d kZ T δ T δ T δ Similarly, we obtain using triangle inequality, 1X avg-regret (T ) = d( vt (at ), S) = T t 2

O k1d k

r

K T |Π| ln + T δ

r

d K T |Π| 1 ln + ln T δ T δ

!!

Then, using the assumption T ≥ K ln(T |Π|/δ) (otherwise, the bound is trivial), we observe that the last term is dominated by the first, to get the theorem statement.

E Regret Analysis for Section 5: Budget Constraints Proof of Lemma 4. Let OPTγ denote value of optimal mixed policy when the budget constraints γ Z are relaxed to E(x,v),π∼P [v(π(x))] ≤ B T + γ. Suppose that OPT > OPT + 2 γ = OPT(1 + γ ¯ T γ/B). Let P ∈ C0 (Π) be the optimal policy that achieves OPT . Then, we can scale down P¯ to obtain P = P¯ /(1 + T γ/B). Now, P ∈ C0 (Π) thus constructed is a feasible policy since Tγ B E(x,v)∼D,π∼P [v(π(x))] = (1+T1γ/B) E(x,v)∼D,π∼P¯ [v(π(x))] ≤ ( B T + γ)/(1 + B ) = T . And, its value is at least

OPTγ (1+T γ/B)

> OPT. Thus, we arrive at a contradiction.

Finally, observing that L = 1 in this case, we obtain the lemma statement.



Proof of Theorem 2. In addition to the regret in Step 3 of the algorithm, we need to consider the losses due to Steps 1 and 2. The regret in Step 3 follows from Theorem 1, and is equal to r K ln(|Π|/δ) ˜ . O(T OPT/B + 1) T A reduction of B ′ in the budget may at most lead to a reduction of OPTB ′ /B in the objective. Therefore Step 2 may induce an additonal regret of the same order. 24

.

Now for Step 3. The maximum budget consumption in the first T0 rounds is T0 . We may assume that p B ≥ c KT d ln(d|Π|/δ)

for some large enough constant c, since otherwise our bound on avg-regret1 is larger than OPT, and holds trivially. This implies that T0 =

E.1

p 12Kd log( d|Π| δ )T ≤ KT d ln(d|Π|/δ). B



Estimating OPT (Proof of Lemma 5)

We use the first few rounds to do a pure exploration, that is aτ is picked uniformly at random from the set of arms, and use the outcomes from these results to compute an estimate of OPT. Let r˜t (a) := rt (a) · I(a = at ), ˜t (a) = vt (a) · I(a = at ). v

˜ t (P ) ∈ [0, 1]d . Since aτ is picked uniformly at random from the set of Note that r˜t (a) ∈ [0, 1], v arms, 1 1 E[˜ rt (a)|Ht−1 ] = E[r(a)], and E[˜ vt (a)|Ht−1 ] = E[v(a)]. K K For any policy P ∈ C0 (Π), let r(P ) := E(x,r,v)∼D,π∼P [r(π(x)] rˆt (P ) :=

K X Eπ∼P [˜ rτ (π(xτ ))] t τ ∈[t]

v(P ) := E(x,r,v)∼D,π∼P [v(π(x)] ˆ t (P ) := v

K X Eπ∼P [˜ vτ (π(xτ ))] t τ ∈[t]

be the actual and empirical means for a given policy P , and |supp(P )| denote the size of the support of P . Observe that for any P ∈ C0 (Π), E[ˆ rt (P )|Ht−1 ] = r(P ), and E[ˆ vt (P )|Ht−1 ] = v(P ). p Lemma 18. For all δ > 0, let η := 3K log(d|Π|/δ). Then for any t, with probability 1 − δ, for all P ∈ C0 (Π), p |ˆ rt (P ) − r(P )| ≤ η |supp(P )|r(P )/t, q ∀ j, |ˆ vt (P )j − v(P )j | ≤ η |supp(P )|v(P )j /t. Proof. Fix a policy π ∈ Π. Consider the random variables Xτ = r˜τ (π(xτ )), for τ ∈ [t]. Note P 1 1 that Xτ ∈ [0, 1], E[Xτ ] = K r(π), and 1t τ ∈[t] Xτ = K rˆt (π). Applying Corollary 8 to these variables, we get that with probability 1 − δ/(d|Π|), | Equivalently,

p p 1 1 rˆt (π) − r(π)| ≤ 3 log(d|Π|/δ) r(π)/Kt. K K |ˆ rt (π) − r(π)| ≤ η

p r(π)/t.

ˆ t (π) and taking a union bound over all Similarly applying the same corollary to all components of v π gives the lemma for all π ∈ Π. 25

Now consider a policy P ∈ C0 (Π). |ˆ rt (P ) − r(P )| ≤ Eπ∼P [|ˆ rt (π) − r(π)|] p ≤ Eπ∼P [η r(π)/t] p ≤ η |supp(P )|Eπ∼P [r(π)]/t]. p = η |supp(P )|r(P )/t



ˆ γ We solve a relaxed optimization problem on the sample to compute our estimate. Define OPT t as the value of optimal mixed policy in C0 (Π) on the empirical distribution upto time t, when the budget constraints are relaxed by γ: ˆ γt := maxP ∈C0 (Π) OPT s.t.

rˆt (P ) ˆ t (P ) ≤ ( B v T + γ)1

(35)

Let Pt ∈ C0 (Π) be the policy that achieves this maximum in (35). Let (as earlier) P ∗ denote the optimal policy w.r.t. D, i.e., the policy that achieves the maximum in the definition of OPT. Lemma 19 ([9], Lemma 5). |supp(P ∗ )|, |supp(Pt )| ≤ d.

Lemma 5 is now an immediate consequence of the following lemma, by setting ˆ γt Z 2OPT = max{ , 1}. 2 B/T Lemma 20. Suppose that for the first t = 12Kd log( d|Π| δ )T /B rounds the algorithm does pure B . Then with probability at least exploration, pulling each arm with equal probability, and let γ = 2T 1 − δ, ˆ γt , B } ≤ 2B + 6OPT. OPT ≤ max{2OPT T T p p 3K log(d|Π|/δ) be as in the proof of Lemma 18. Note that then η d/t = Proof. Let η = p p B/4T and γ = η dB/T t. By Lemma 18, with probability 1 − δ, we have that

ˆ t (P ∗ ) ≤ ( B v T + γ)1,

γ

ˆ ≥ rˆt (P ∗ ). and therefore P ∗ is a feasible solution to the optimization problem (35), and hence OPT t Again from Lemma 18, p p rˆt (P ∗ ) ≥ OPT − η dOPT/t = OPT − ( OPTB/T )/2.

Now either B/T ≥ OPT or otherwise

p OPT − ( OPTB/T )/2 ≥ OPT/2.

In either case, the first inequality in the lemma holds.

On the other hand, again from Lemma 18, q ˆ (Pt )j ∀ j, v(Pt )j − η dv(Pt )j /t ≤ v

≤ B/T + γ = 3B/2T = 9B/4T − η

p d9B/4T t.

√ The second inequality holds since Pt is a feasible solution to (35). The function f (x) = x − cx is increasing in the interval [c/4, ∞] and therefore v(Pt )j ≤ 9B/4T , and Pt is a feasible solution to 26

the optimization problem (4), with budgets multiplied by 9/4. This increases the optimum value of (4) by at most a factor of 9/4 and hence r(Pt ) ≤ 9OPT/4. Also from Lemma 18, p dr(Pt )/t p ≤ 9OPT/4 + 9OPTB/16T .

ˆ γt = rˆ(Pt ) ≤ r(Pt ) + η OPT

ˆ γt ≤ 3OPT. Otherwise, we get that Once again, if OPT ≥ B/T , we get from the above that OPT γ ˆ ≤ 9OPT/4 + 3B/4T . In either case, the second inequaity of the lemma holds. OPT t



27