Decision Tree Algorithms for the Contextual Bandit Problem

Report 8 Downloads 260 Views
Decision Tree Algorithms for the Contextual Bandit Problem

Decision Tree Algorithms for the Contextual Bandit Problem Rapha¨ el F´ eraud Robin Allesiardo Tanguy Urvoy Fabrice Cl´ erot

arXiv:1504.06952v16 [cs.LG] 29 Sep 2015

[email protected] and [email protected] and [email protected] and [email protected] Orange Labs, 2, avenue Pierre Marzin, 22307, Lannion, France

Editor:

Abstract To address the contextual bandit problem, we propose online decision tree algorithms. The analysis of proposed algorithms is based on the sample complexity needed to find the optimal decision stump. Then, the decision stumps are assembled in a decision tree, Bandit Tree, and in a random collection of decision trees, Bandit Forest. We show that the proposed algorithms are optimal up to a logarithmic factor. The dependence of the sample complexity upon the number of contextual variables is logarithmic. The computational cost of the proposed algorithms with respect to the time horizon is linear. These analytical results allow the proposed algorithms to be efficient in real applications, where the number of events to process is huge, and where we expect that some contextual variables, chosen from a large set, have potentially non-linear dependencies with the rewards. When Bandit Trees are assembled into a Bandit Forest, the analysis is done against a strong reference, the Random Forest built knowing the joint distribution of contexts and rewards. We show that the expected dependent regret bound against this strong reference is logarithmic with respect to the time horizon. In the experiments done to illustrate the theoretical analysis, Bandit Tree and Bandit Forest obtain promising results in comparison with state-of-the-art algorithms. Keywords: multi-armed bandits, contextual bandit, sample complexity, online decision tree, random forest.

1. Introduction Since the beginning of this century, a vast quantity of connected devices has been deployed. These devices communicate together through wired and wireless networks thus generating infinite streams of events. By interacting with these streams of events, machine learning algorithms optimize the choice of ads on a website, choose the best human machine interface, recommend products on a web shop, insure self-care of set top boxes, assign the best wireless network to each mobile phone etc... With the now rising internet of things, the number of decisions (or actions) to be taken by more and more autonomous devices further increases. In order to control the cost and the potential risk to deploy a lot of machine learning algorithms in the long run, we need scalable algorithms which provide worst case theoretical guarantees. 1

¨l Fe ´raud et al Raphae

Most of these applications necessitate to take and optimize decisions with a partial feedback. Only the reward of the taken decision is known. Does the user click on the proposed ad? The most relevant ad is never revealed. Only the click on the proposed ad is known. This well known problem is called multi-armed bandits mab (see Lai and Robbins (1985)). In its most basic formulation, it can be stated as follows: there are K decisions, each having an unknown distribution of bounded rewards. At each step, one has to choose a decision and receives a reward. The performance of a mab algorithm is assessed in term of regret (or opportunity loss) with regards to the unknown optimal decision. Optimal solutions have been proposed to solve this problem using a stochastic formulation in Auer et al (2002), using a Bayesian formulation in Kaufman et al (2012), or using an adversarial formulation in Auer et al (2002); Cesa-Bianchi and Lugosi al (2006). While these approaches focus on the minimization of the cumulated regret, the PAC setting (see Vailant (1984)) or the (, δ)-best-arm identification, focuses on the sample complexity (i.e. the number of time steps) needed to find an -approximation of the best arm with a failure probability of δ. This formulation has been studied for mab problem in Even-Dar et al (2002); Bubeck et al (2009), for voting bandit problem in Urvoy et al (2013), and for linear bandit problem in Soare et al (2014). Several variations of the mab problem have been introduced in order to fit the practical constraints coming from ad serving or marketing optimization. These variations include for instance the death and birth of arms in Chakrabarti et al (2008), availability of actions in Kleinberg et al (2008) or a drawing without replacement in F´eraud and Urvoy (2012, 2013). But more importantly, in most of these applications, a rich contextual information is also available. For instance, in ad-serving optimization we know the web page, the position in the web page, or the profile of the user. This contextual information must be exploited in order to decide which ad is the most relevant to display. Following Langford and Zhang (2007); Dudik et al (2011), in order to analyze the proposed algorithms, we formalize below the contextual bandit problem (see Algorithm 1). Let xt ∈ {0, 1}M be a vector of binary values describing the environment at time t. Let yt ∈ [0, 1]K be a vector of bounded rewards at time t, and ykt (t) be the reward of the action (decision) kt at time t. Let Dx,y be a joint distribution on (x, y). Let A be a set of K actions. Let π : {0, 1}M → A be a policy, and Π the set of policies. Algorithm 1 The contextual bandit problem repeat (xt , yt ) is drawn according to Dx,y xt is revealed to the player The player chooses an action kt = πt (xt ) The reward ykt (t) is revealed The player updates its policy πt t=t+1 until t = T Notice that this setting can easily be extended to categorical variables through a binary encoding. The optimal policy π ∗ maximizes the expected gain: 2

Decision Tree Algorithms for the Contextual Bandit Problem

  π ∗ = arg max EDx,y yπ(xt ) π∈Π

Let kt∗ = π ∗ (xt ) be the action chosen by the optimal policy at time t. The performance of the player policy is assessed in terms of the expectation of the cumulated regret against the optimal policy with respect to Dx,y : EDx,y [R(T )] =

T X

  EDx,y ykt∗ (t) − ykt (t) ,

t=1

where R(T ) is the cumulated regret, and T is the time horizon.

2. Previous work The contextual bandit problem differs from full-knowledge online classification problem where the rewards of all decisions are observed: here, the loss function is unknown at training time. Exploration is required to estimate it. However, from a practical perspective, online regression algorithms can be applied to the contextual bandit problem. For instance, during the exploration time the decisions are played at random to build a probability estimation tree (see Provost and Domingos (2003), Ikonomovska et al (2010)), or any regression model which estimates the probability of reward given the context and the decision. Then, during the exploitation time, the decisions leading to the maximum probability of reward are chosen. This reasonable approach leads to efficient experimental results using a Random Forest in Salperwyck and Urvoy (2011), or a Committee of Multi-Layer Perceptrons in Allesiardo et al (2014). The first drawback of these approaches is that no regret bound is provided. The risk when deploying a lot of models is not controlled. The second drawback is that they do not scale with the number of decisions. Indeed, their total computational cost in O(KT ) is a handicap for the applications where K is large. Moreover, when K is large, it is not unlikely to obtain a non robust estimation for one of decisions. This is problematic because as the decision which has the maximum probability of reward is selected at each round, in worst case the performance of this approach follows the worst estimation. A naive solution to adapt the mab algorithms to the contextual bandit decision problem consists of allocating one independent mab for each possible context xt . The number of possible contexts being the cross product of the set of values of contextual vectors, this approach can lead to an exponential number of arms. The time spent exploring these arms will tear down the gain of contextual information. Monte-Carlo Tree Search based on UCT (see Kocsis and Sz´epesvari (2006)) was successfully applied to game of Go in Hoock et al (2010). UCT can also be used in order to explore and exploit the tree structure of contextual variables (see Gaudel and S´ebag (2010)). However, the size of the tree structure  M D to explore and exploit D contextual variables being 2 D , this approach is limited to small context sizes. An interesting approach to analyze the contextual bandit problem is to assume a linear relation between the context variables and the reward of each action. LinUCB Chu et al (2011); Li et al (2010) choose the arm which maximizes the upper confidence bound 3

¨l Fe ´raud et al Raphae

of the estimated reward under the linear assumption. Filippi et al (2010) uses a similar approach based on the Generalized Linear Model framework. Agrawal and Goyal (2013) adapts a well known Bayesian approach, the Thompson Sampling (see Thompson (1933)) to the case of linear payoff functions. The linear approaches can achieve a dependent regret bound in O(log2 T ), (see Abbasi-Yadkori and Szepesv´ari (2011)). Besides the validity of the linear assumption, the main drawback of these algorithms is that they require an M ×M matrix inversion at each update of the model. The computational cost of this incremental inversion being in O(M 2 ), they may become intractable for large sets of contextual variables, or require to approximate this inversion, with loss of regret guarantees. For applications, such as adserving, where one can build thousands of contextual variables, this limitation is a problematic issue. The Banditron algorithm (see Kakade et al (2008)) uses also a linear model, the Perceptron, to evaluate the cumulated reward of each action. In comparison to linear bandits, the main drawbacks of this algorithm are that it requires an exploration factor which penalizes the minimization of the regret, and that its unregularized stochastic algorithm is less efficient to find the optimal weights. Its strength is its low computational cost. Another approach is to define a finite set of policies Π. Then, to address the contextual bandit problem, one has to choose the best one. Exp4 algorithm (see Auer et al (2002); Beygelzimer et al (2007)) maintains a weight for each policy to draw the estimated best one at each step. The chief strength of Exp4 is its ability topdeal with non-stationary reward distributions. It achieves an optimal regret bound in O( T K log |Π|). However, the computational cost of each step of this approach is linear in number of policies. This limits its use to small sets of policies. To circumvent this problem, Langford and Zhang (2007) alternates exploration and exploitation phases in order to optimize the choice of the best policy. The computational cost is in O(log |Π|) for each time step. Most of batch learning models including Deep Learning (see Bengio (2009)), SVM (see Cortes and Vapnik (1995)), or Random Forest (see Breiman (2001)) can be learnt during the exploration phases and then used during the exploitation phases to choose the best actions. This interesting approach suffers a major shortcoming: a suboptimal regret bound in O(T 2/3 (K log |Π|)1/3 ). Dudik et al (2011) maintains a weight for each policy to solve an optimization problem in order to estimate the optimal distribution of policies. Then, √ it draws a policy with respect to this distribution. It achieves a near optimal regret in O( T K log Π/δ) + K log(T |Π|)/δ. However, the number of calls to the optimization problem is in poly(T ), while the computational cost with respect to the time horizon is in O(poly(T, log |Π|)).√In Agrawal et al (2014), the number of calls to the optimization problem is reduced in KT with √ the same regret bound. This reduces the total computational cost, which is still in O(T T K log |Π|). For large value of the time horizon T , this total computational cost beyond O(T ) is a handicap. For instance, the number of page views of a web site can reach the order of billions per day.

3. Our contribution We propose decision tree algorithms to tackle the contextual bandit problem. Decision trees work by partitioning the input space in hyper-boxes. They can be seen as a combination of rules, where only one rule is selected for a given input vector. Finding the optimal tree structure (i.e. the optimal combination of rules) is NP-hard. For this reason, a greedy 4

Decision Tree Algorithms for the Contextual Bandit Problem

approach is used to build the decision trees offline (see Breiman et al (1984); Quilan (1993); Kass (1980)) or online (see Domingos and Hulten (2000)). The key concept behind the greedy decision trees is the decision stump. While Monte-Carlo Tree Search approaches (see Kocsis and Sz´epesvari (2006)) focus on the regret minimization (i.e. maximization of gains), the analysis of proposed algorithms is based on the sample complexity needed to find the optimal decision stump with high probability. This formalization facilitates the analysis of decision tree algorithms. Indeed, to build a decision tree under limited resources, one needs to eliminate most of possible branches. The sample complexity is the stopping criteria we need to stop exploration of unpromising branches. The decision stumps are assembled in a decision tree, Bandit Tree, and in a random collection of L decision trees of maximum depth D, Bandit Forest. We show that the proposed algorithms are optimal up to a factor log 1/∆, where ∆ is the minimum gap between the best variable or the best action versus the second one. The dependence of the sample complexity upon the number of contextual variables is logarithmic. The computational cost of the proposed algorithm with respect to the time horizon is linear. The regret analysis of this algorithm is done with respect to a strong reference: the Random Forest built knowing the joint distribution of the contexts and rewards. We show that the expected dependent regret bound against this strong reference is logarithmic with respect to time horizon. In comparison to the algorithms based on search of the best policy from a finite set of policies, our approach has several advantages. First, we take advantage of the fact that we know the structure of the set of policies to obtain a linear computational cost with respect to the time horizon T . Second, as our approach does not need to store a weight for each possible tree, we can use deeper rules without exceeding the memory resources. In comparison to the approaches based on a linear model, our approach also has several advantages. First, it is better suited for the case where the dependence between the rewards and the contexts is not linear. Second, the dependence of regret bounds of the proposed algorithms on the number of contextual variables is in the order of O(log M ) while the one √ of LinUCB is in O( M ) 1 . Third, its computational cost with respect to time horizon in O(LM DT ) allows to process large set of variables, while linear bandits are penalized by the update of a M × M matrix at each update, which leads to a computational cost in O(KM 2 T ). The price to pay for these advantages is that for ill-shaped distribution, where the dependencies between inputs and outputs can be null at depth d and high at depth d + 1, the optimal greedy tree, and therefore Bandit Tree, may not work well in comparison to the optimal tree, which is non-greedy. The optimal tree is impossible to build since it is a NP-hard problem, or impossible to enumerate in a finite set due to the exponential number of possible trees. Finally to overcome this limitation, we propose an algorithm, Bandit Forest, which is a Random Forest (see Breiman (2001)) of Bandit Trees built online. The analysis of this algorithm is done with respect to a strong reference: the Random Forest built knowing the joint distribution of the contexts and rewards. We show that the expected 1. In fact the dependence on the number of contextual variables of the gap dependent regret bound is in O(M 2 ) (see Theorem 5 in Abbasi-Yadkori and Szepesv´ ari (2011)).

5

¨l Fe ´raud et al Raphae

dependent regret bound against this strong reference is logarithmic with respect to time horizon. The paper is organized as follows: • in section 4, we present and analyze the decision stumps, • in section 5, we detail and analyze Bandit Tree algorithm, built by a recursive application of decision stumps, • in section 6, to reduce the regret during the exploration phases, we propose an improved learning algorithm for Bandit Tree, which explores the splitting variables and the actions at the same time. • in section 7, we detail and analyze Bandit Forest algorithm, which is a random collection of Bandit Trees, • in section 8, we illustrate the theoretical results on contextual bandit problems, • in the appendix, we provide a table of notations and the proofs.

4. The decision stump In this section, we consider a model which consists of a decision stump based on the values of a single contextual variable, chosen from at set of M binary variables. 4.1 A gentle start In order to explain the principle and to introduce the notations, before describing the decision stumps used to build the Bandit Tree, we illustrate our approach on a toy problem. Let k1 and k2 be two actions. Let xi1 and xi2 be two binary random variables, describing the context. Relevant probabilities and rewards are summarized in Table 1.

Table 1: The mean reward of actions k1 and k2 knowing each context value, and on the last line the probability to observe this context.

µik11 |v µik12 |v P (xi1 = v)

v0 0 3/5 5/8

v1 1 1/6 3/8

µik21 |v µik22 |v P (xi2 = v)

v0 1/4 9/24 3/4

v1 3/4 5/8 1/4

µik12 |v denotes the mean reward of action k2 when xi1 = v is observed, and P (xi1 = v) denotes the probability to observe xi1 = v. 6

Decision Tree Algorithms for the Contextual Bandit Problem

We compare the strategies of different players. Player 1 only uses uncontextual expected rewards, while Player 2 uses the knowledge of xi1 to decide. According to Table 1, the best strategy for Player 1 is to choose always action k2 . His expected reward will be µk2 = 7/16. Player 2 is able to adapt his strategy to the context: his best strategy is to choose k2 when xi1 = v0 and k1 when xi1 = v1 . According to Table 1, his expected reward will be: µi1 = P (xi1 = v0 ) · µik12 |v0 + P (xi1 = v1 ) · µik11 |v1 = µik12 ,v0 + µik11 ,v1 = 3/4 , where µik12 ,v0 and µik11 ,v1 denote respectively the expected reward of the action k2 when xi1 = v0 is observed, and the expected reward of the action k1 when xi1 = v1 is observed. Whatever the expected rewards of each value, the player, who uses this knowledge, is the expected winner. Indeed, we have:   µi = max µik,v0 + max µik,v1 ≥ max µik,v0 + µik,v1 k

k

k

≥ max µk k

Now, if a third player uses the knowledge of the contextual variable xi2 , his expected reward will be: µi2 = µik22 ,v0 + µik21 ,v1 = 15/32 Player 2 remains the expected winner, and therefore xi1 is the best contextual variable to decide between k1 and k2 . The best contextual variable is the one which maximizes the expected reward of best actions for each of its values. We use this principle to build a reward-maximizing decision stump. 4.2 Variable selection stump Let V be the set of variables, and A be the set of actions. Let µik,v = EDx,y [yk · 1xi =v ] be the expected reward of the action k with respect to Dx,y when the value v of the binary variable xi is observed. The expected reward when using variable xi to select the best action is the sum of expected rewards of the best actions for each of its possible values: µi =

X v∈{0,1}

max µik,v k

The optimal variable to be used to select the best action is: i∗ = arg maxi∈V µi . The algorithm Variable Selection chooses the best variable. The Round-robin function explores actions sequentially action in A (see Algorithm 2 line 4). Each time tk the reward of the selected action k is unveiled, the estimated expected rewards of values µ ˆik,v and all the estimated rewards of variables µ ˆi are updated (see VariableElimination lines 2-7). This parallel exploration strategy allows the algorithm to explore efficiently the variable set. 7

¨l Fe ´raud et al Raphae

When all the actions have been played once (LastAction function in VariableElimination line 8), irrelevant variables are eliminated if: s 4KM t2k 1 0 µ ˆi − µ ˆi +  ≥ 4 log , (1) 2tk δ where i0 = arg maxi µ ˆi , and tk is the number of times the action k has been played. Algorithm 2 Variable Selection 1: Initialization: ∀k tk = 0, S = V, ∀(i, k, v) µ ˆik,v = 0, ∀i µ ˆi = 0 2: repeat 3: Receive the context vector xt 4: Play k = Round-robin (A) 5: Receive reward yk (t) 6: tk = tk + 1 7: S=VariableElimination(tk , k, xt , yk (t), S, A) 8: t=t+1 9: until |S| = 1

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Function VariableElimination(t, k, xt , yk , S, A) for each remaining variable i ∈ S do for each value v do µ ˆik,v = ytk 1xi =v + t−1 ˆik,v t µ end for P µ ˆi = v∈{0,1} maxk µ ˆik,v end for if k = LastAction(A) then Remove irrelevant variables from S according to equation 1, or 4 end if return S

The parameter δ ∈ (0, 1] corresponds to the probability of failure. The use of the parameter  comes from practical reasons. The parameter  is used in order to tune the convergence speed of the algorithm. In particular, when two variables provide the same expected reward, the use of  > 0 ensures that the algorithm stops. The value of  has to be in the same order of magnitude as the best mean reward we want to select. In the analysis of algorithms, we will consider the case where  = 0. Lemma 1 analyzes the sample complexity (the number of iterations before stopping) of Variable Selection. Lemma 1: when K ≥ 2, M ≥ 2, and  = 0, the sample complexity of Variable Selection needed to obtain P (i0 6= i∗ ) ≤ δ is: t∗ =

64K 4KM , where log 2 δ∆1 ∆1

8



∆1 = min∗ (µi − µi ) i6=i

Decision Tree Algorithms for the Contextual Bandit Problem

The proof of Lemma 1 can be found in the Appendix. Lemma 1 shows that the dependence of the sample complexity needed to select the optimal variable is in O(log M ). This means that Variable Selection can be used to process large set of contextual variables, and hence can be easly extended to categorical variables, through a binary encoding with only a logarithmic impact on the sample complexity. Finally, Lemma 2 shows that the variable selection stump is optimal up to a factor log 1/∆1 . Lemma 2: It exists a distribution Dx,y such that any algorithm finding the optimal variable i∗ has a sample complexity of at least:  Ω

1 1 log 2 δ ∆1



The proof of Lemma 2 can be found in the appendix. 4.3 Action selection stump To complete a decision stump, one needs to provide an algorithm which optimizes the choice of the best action knowing the variable selection stump. Any stochastic bandit algorithm such as UCB (see Auer et al (2002)) and TS (see Thompson (1933); Kaufman et al (2012)) can be used. For the consistency of the analysis, we choose Successive Elimination in Even-Dar et al (2002, 2006) (see Algorithm 3), that we have renamed Action Selection. Let µk = EDy [yk ] be the expected reward of the action k taken with respect to Dy . The estimated expected reward of the action is denoted by µ ˆk . Algorithm 3 Action Selection 1: Initialization: t = 0, A = {1, ..., K}, ∀k µ ˆk = 0 and tk = 0 2: repeat 3: Play k = Round-robin (A) 4: Receive reward yk (t) 5: tk = tk + 1 6: A=ActionElimination(tk , k, yk (t), A) 7: t=t+1 8: until |A| = 1

1: 2: 3: 4: 5: 6:

Function ActionElimination(t, k, xt , yk , A) µ ˆk = ytk + t−1 ˆk t µ if k=LastAction(A) then Remove irrelevant actions from A according to equation 2, or 5 end if return A

The irrelevant actions in the set A are successively eliminated when: 9

¨l Fe ´raud et al Raphae

s µ ˆk0 − µ ˆk +  ≥ 2

4Kt2k 1 log , 2tk δ

(2)

where k 0 = arg maxk µ ˆk , and tk is the number of times the action k has been played. Lemma 3: when K ≥ 2, and  = 0, the sample complexity of Action Selection needed to obtain P (k 0 6= k ∗ ) ≤ δ is: t∗ =

64K 4K , where log 2 δ∆2 ∆2

∆2 = min(µk∗ − µk ). k

The proof of Lemma 3 is the same than the one provided for Successive Elimination in Even-Dar et al (2006). Finally, Lemma 4 states that the action selection stump is optimal up to a factor log 1/∆2 (see Mannor and Tsitsiklis (2004) Theorem 1 for the proof). Lemma 4: It exists a distribution Dx,y such that any algorithm finding the optimal action k ∗ has a sample complexity of at least:   K 1 Ω log δ ∆22

4.4 Analysis of a decision stump The decision stump is a variable selection stump followed by two action selection stumps: one per value of the binary variable (see Algorithm 4). The optimal decision stump uses the best variable to choose the best actions. It plays at time t: ∗

kt∗ = arg max µik |v , where i∗ = arg max µi , and v = xi∗ (t). i

k

Theorem 1: when K ≥ 2, M ≥ 2, and  = 0, the sample complexity needed to obtain P (i0 6= i∗ ) ≤ δ and P (kt0 6= kt∗ ) ≤ δ is: t∗ =

64K 4KM 128K 4K + , log log 2 2 δ∆1 δ∆2 ∆1 ∆2

  ∗ ∗ ∗ where ∆1 = mini6=i∗ µi − µi , and ∆2 = mink6=k∗ ,v∈{0,1} µik∗ |v − µik |v . The cumulative regret against the optimal decision stump is defined by: R(T ) =

T X

ykt∗ (t) − ykt (t)

t=1

10



Decision Tree Algorithms for the Contextual Bandit Problem

Algorithm 4 Decision Stump Learning 1: Initialization: t = 0, ∀k tk = 0, ∀(i, v, k) ti,v,k = 0 and Ai,v = A, S = V, ∀(i, k, v) µ ˆik,v = 0 and µ ˆik |v = 0, ∀i µ ˆi = 0 2: repeat 3: Receive the context vector xt 4: DecisionStump(xt , yk (t), A, S) 5: t=t+1 6: until t = T 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Function DecisionStump(xt , yk , A, S) if |S| > 1 then Play k = Round-robin (A) Receive reward yk (t) tk = tk + 1 S=VariableElimination(tk , k, xt , yk , S, A) else v = xi (t) Play k = Round-robin (Ai,v ) Receive reward yk (t) ti,v,k = ti,v,k + 1 Ai,v = ActionElimination(ti,v,k , k, xt , yk , Ai,v ) end if

Corollary 1.1: when K ≥ 2, M ≥ 2, and  = 0, the expected cumulative regret of Decision Stump Learning against the optimal decision stump is bounded at time T by: EDx,y [R(T )] ≤

64K 128K 4KM T 4KT + +2 log log ∆1 ∆2 ∆21 ∆22

The proof of Theorem 1 and Corollary 1.1 are provided in the appendix. Corollary 1 shows that the problem dependent regret bound of the decision stump based on Variable T Elimination and then Action Elimination is in O( ∆K2 log KM ∆ ). In comparison to the naive approach, which consists in allocating one UCB per value of contextual variables (regret bound in O( KM ∆ log KT )), on one hand the obtained regret bound suffers factors 1/∆2 versus 1/∆ for UCB, and on the other hand, the obtained regret bound benefits from a factor log M versus M . The factors in 1/∆2 come from the fact that the proposed algorithm eliminates variables. Theorem 2 provides a lower bound for the sample complexity, showing that these factors are inherent of the decision stump problem. Notice that for the linear bandit problem, the same factor 1/∆2 was obtained in the lower bound (see Lemma 2 in Soare et al (2014)). The elimination of variables is a necessary condition to build a decision tree under limited memory resources. As we shall see below, the factor log M versus M for the naive approach dramatically reduces the regret bound when the decision stumps are stacked in a decision tree. 11

¨l Fe ´raud et al Raphae

Theorem 2: It exists a distribution Dx,y such that any algorithm finding the optimal decision stump has a sample complexity of at least:    1 1 2K Ω + 2 log δ ∆21 ∆2 The proof of Theorem 2 can be found in the appendix. Theorem 2 shows that the learning algorithm of the decision stump is optimal up to a factor log 1/∆. Notice that these factors could be vanished following the same approach as that developed for the algorithm Median Elimination in Even-Dar et al (2006). Despite the optimality of this algorithm, we did not choose it for practical reasons. Indeed, it introduces another parameter (the number of elimination phases), which is tricky to tune. Moreover, as it suppresses 1/4 of variables (or actions) at the end of each elimination phase, Median Elimination is not well suited when a few number of variables (or actions) provide lot of rewards and the others not. In this case this algorithm spends a lot of times to eliminate non relevant variables. This case is precisely the one, where we would like to use a local model such as a decision tree.

5. Bandit Tree 5.1 Principle Our approach addresses the contextual bandit decision problem with K actions and M context features by building a decision tree of depth D: Bandit Tree. The successive application of variable selection stumps up to depth d < D selects a part of the context vector xt as a sequence of variable values of length d. We call this selection a context path, and we encode it as an ordered list of variable and value couples: cd (xt ) = (xi1 = xi1 (t)), ..., (xid = xid (t)). Each context vector xt is related to a unique context path of the decision tree and to simplify notation, in the following we write only cd . The expected rewards, used above to analyze the decision stumps, are defined here knowing the context path cd : µi |cd denotes the expected reward of the variable xi knowing the context path cd , and µk |cD denotes the expected reward of the action k knowing the context path cD . For each context path cd , we will consider the set Vcd of variables indices it does not covers. This set is defined recursively by Vc0 = {1, . . . , M } and Vcd+1 = Vcd \ {id+1 }. For context path of depth D, an action selection stump optimizes the choice of the action knowing the path cD . The depth D is a parameter, which corresponds to the complexity of the Bandit Tree. It can be seen as a parameter used to bound the VC-dim (see Vapnik (1995)) of Bandit Tree, or as the number of strong features (see Biau (2012)) of the contextual bandit problem. Its influence on the sample complexity will be analyzed in the following. The optimal greedy tree is the one obtained by a greedy algorithm using the knowledge of Dx,y (See Algorithm 5). It can be viewed as a weak reference, since in the general case the greedy algorithm does not find the optimal tree. However, the optimal tree is impossible to build since it is a NP-hard problem. The optimal greedy tree is a good reference, which is commonly employed when decision tree approaches are used for classification tasks. The greedy algorithms are optimal for a large class of problems called greedoids. Interested readers may find in Korte et al (1991) a formal characterization of this class of problems. 12

Decision Tree Algorithms for the Contextual Bandit Problem

Algorithm 5 OptimalGreedy (c∗d , d) 1: if d < D then 2: i∗d+1 = arg maxi∈Vcd µi |cd   3: OptimalGreedy c∗d + (xi∗d+1 = 0), d + 1   4: OptimalGreedy c∗d + (xi∗d+1 = 1), d + 1 5: else 6: k ∗ |c∗D = arg maxk µk |c∗D 7: end if Hereafter, we mark with a star the context paths c∗d and the decisions k ∗ obtained by the optimal greedy tree. 5.2 Bandit Tree Learning The optimal greedy tree requires the knowledge of the joint distribution Dx,y . In this section, we propose a decision tree algorithm, which exhibits a logarithmic expected regret against the optimal greedy tree. At each time step t, an environment vector xt is drawn. The corresponding path cd (xt ) is selected using a tree search in the current Bandit Tree (see Algorithm 6 line 5). The DecisionStump function is called for the path cd (see Algorithm 6 line 6). To take into account the D sequential decisions to be taken, irrelevant variables are eliminated using a slight modification of inequality (1). A possible next variable xi is eliminated when: s 4KM Dt2cd ,k 1 0 µ ˆi |cd − µ ˆi |cd +  ≥ 4 log , (3) 2tcd ,k δ where i0 = arg maxi µ ˆi |cd , and tcd ,k is the number of times the path cd and the action k have been drawn. When only a single variable remains for the path cd , if d < D the tree is unfolded by creating a new path cd+1 for each value of this variable (see Algorithm 6 lines 7-12). Theorem 3: when K ≥ 2, M ≥ 2, and  = 0, the sample complexity needed to obtain the optimal greedy tree with a probability at least 1 − δ is:   4KM D 128K 4K ∗ D 64K t =2 log log + , δ∆1 δ∆2 ∆21 ∆22  ∗ where ∆1 = mini6=i∗ ,cd µi |cd − µi |cd , and ∆2 = mink6=k∗ (µk∗ |c∗D − µk |c∗D ). Bandit Tree learning algorithm uses a localized explore then exploit approach: some paths may remain in exploration state while others (the most frequent ones) already have selected their best actions. Corollary 3.1 shows that this valuable property leads to a low regret bound. 13

¨l Fe ´raud et al Raphae

Algorithm 6 Bandit Tree Learning 1: Initialization: t = 1, c0 = () 2: NewPath(c0 , V) 3: repeat 4: Receive the context vector xt 5: Search the corresponding context path cd (xt ) 6: DecisionStump(xt , yk (t), A, Scd ) 7: if |Scd | = 1 and d < D then 8: Vcd = Vcd \ {i} 9: for each value v of the remaining variable i do 10: NewPath(cd + (xi = v) , Vcd ) 11: end for 12: end if 13: t=t+1 14: until t = T 1: 2: 3: 4: 5:

Function NewPath(cd , V) Scd = V, ∀(i, v) Acd ,i,v = A tcd ,k = 0, ∀(i, v, k) tcd ,i,v,k = 0 ∀(i, k, v) µ ˆik,v |cd = 0 and µ ˆik |(v, cd ) = 0, ∀i µ ˆi |cd = 0 d=d+1

Corollary 3.1: when K ≥ 2, M ≥ 2, and  = 0, the expected cumulative regret of the Bandit Tree learning algorithm against the optimal greedy tree is bounded at time T by: EDx,y [R(T )] ≤ 2

D



64K 4KM DT 4KT 128K log log + +2 2 2 ∆1 ∆2 ∆1 ∆2



Corollary 3.1 shows that the problem dependent bound of regret against the optimal greedy tree is in O(2D K(log(KM DT ))). The proof of Theroem 3 and Corollary 3.1 are provided in the appendix. The dependence on number of contextual variables M is in the order of O(log M ), which means that Bandit Tree learning algorithm can process large sets of contextual variables, and hence it can easly be adapted to categorical variables. In comparison, the naive approach, which allocates one mab per value of D contextual  M D variables chosen from M , obtains a regret bound in O( D 2 K log KT ). Despite a major downsizing of the pre-factor, the dependence of the regret bound on the depth D is still exponential. This means that like all decision trees, Bandit Tree is well suited for cases, where there is a small subset of relevant variables belonging to a large set of variables (D 0 we have : t∗ ∆ij Z≥p t∗ pij (1 − pij )

P (Θ − t∗ pij ≥ t∗ ∆ij ) ≥ P

! ,

(17)

Then, we use the lower bound of the error function (see Chu (1955)): s

z2 1 − exp − 2

P (Z ≥ z) ≥ 1 −





Therefore, we have: v u u ∗ ∗ P (Θ − t pij ≥ t ∆ij ) ≥ 1 − t1 − exp −

t∗ ∆2ij 2pij (1 − pij )

v ! u u t∗ ∆2ij t ≥ 1 − 1 − exp − pij ! t∗ ∆2ij 1 ≥ exp − 2 pij 34

!

Decision Tree Algorithms for the Contextual Bandit Problem

As δ = P (Θ − t∗ pij ≥ t∗ ∆ij ) and pij = 1/2 − ∆ij , we have: log δ ≥ log

t∗ ∆2ij 1 1 ≥ log − 2t∗ ∆2ij − 2 1/2 − ∆ij 2

Hence, we have: ∗

t =Ω

1 1 log δ ∆2ij

!

Then, we need to use the fact that as all the values of all the variables are observed when one action is played: the M (M − 1)/2 estimations of bias are solved in parallel. In worst case, minij ∆ij = minj ∆i∗ j = ∆. Thus any algorithm needs at least a sample complexity t∗ , where:   1 1 ∗ t =Ω log ∆2 δ

10.4 Theorem 1 Proof The proof of Theorem 1 is a direct application of Lemma 1 and Lemma 3.

10.5 Corollary 1.1 Proof Let t∗ be the number of time steps needed to build with high probability the optimal decision stump. When T ≤ t∗ , the cumulative regret is bounded by t∗ . In the following, we consider that T > t∗ . The expected cumulative regret at time T is: " t∗ # T X X EDx,y [R(T )] = EDx,y r(t) + r(t) t=1

t=t∗ +1

Then, if we bound by 1 the regret each time step t during the exploration phase, we obtain: EDx,y [R(T )] ≤ t∗ + (T − t∗ )EDx,y [r(t)] ≤ t∗ + (T − t∗ ) [1.P(kt 6= kt∗ ) + 0.P(kt = kt∗ )] From the union bound, we have: EDx,y [R(T )] ≤ t∗ + (T − t∗ ) [P(i 6= i∗ ) + P(k 6= k ∗ )] ≤ t∗ + 2(T − t∗ )δ 35

¨l Fe ´raud et al Raphae

Now, if we choose δ =

1 T,

we obtain:

EDx,y [R(T )] ≤ t∗ + 2

T − t∗ ≤ t∗ + 2 T

The total sample complexity of a decision stump based on one contextual variable is given by t∗ = t∗1 + 2t∗2 , where t∗1 and t∗2 are respectively the number of time steps needed to select the best variable and the number of time steps need to select the best action for each value of the best variable. Using Lemma 1 and Lemma 3, with δ = T1 , we complete the proof.

10.6 Theorem 2 Proof The proof of Theorem 2 is a direct application of Lemma 2 for the variable selection and Lemma 4 for the two action selections.

10.7 Theorem 3 Proof The proof of Theorem 3 uses Lemma 1 and Lemma 3. Using the slight modification of the variable elimination inequality proposed in section 4.2, Lemma 1 states that for each decision stump, we have:  δ1 P i∗d |cd−1 6= i0d |cd−1 ≤ D Then, from the union bound, we have:    P(cD 6= c∗D ) ≤P i∗1 6= i01 + P i∗2 |c1 6= i02 |c1 + ... + P i∗D |cD−1 6= i0D |cD−1 ≤ δ1 For the action corresponding to the path cD , Lemma 3 states that: P(k 6= k ∗ ) ≤ δ1 Using Lemma 1 and Lemma 3, and then summing the sample complexity of each decision stumps we bound the sample complexity of a path cD : t∗cD ≤

X 64K X 128 4KM D 4K log + log , 2 2 δ∆cd δ∆k ∆1 ∆2 ∗ k6=k

d≤D

where δ = 2δ1 . Then, summing over the path cD , we provide the proof.

36

Decision Tree Algorithms for the Contextual Bandit Problem

10.8 Corollary 3.1 Proof The proof of Corollary 3.1 uses Theorem 3 for the sample complexity and then the same arguments than those of Corollary 1.1.

10.9 Theorem 4 Proof P To build a Bandit Tree of depth D, any greedy algorithm needs to solve d