Second order regret bounds based on scaling time - Semantic Scholar

Comment

Report 0 Downloads 50 Views

JMLR: Workshop and Conference Proceedings vol 49:1–4, 2016

Open Problem: Second order regret bounds based on scaling time Yoav Freund

YFREUND @ UCSD . EDU

UCSD, San Diego, CA

Abstract We argue that the second order bounds given in Cesa-Bianchi et al. (2006), which accumulate the square of the loss of each action separately, are loose. We propose a different form of a second order bound and conjecture the it is satisfied by NormalHedge Chaudhuri et al. (2009).

1. Background 1.1. Motivation The upper bound on the regret of exponential weights algorithms Littlestone and Warmuth √ (1989); Cesa-Bianchi et al. (1993); Freund and Schapire (1999) have a leading term of the form n log N where n is the length of the sequence and N is the number of experts/actions. The bounds that the loss (gain) per iteration is in a bounded range, typically [0, 1]. Obviously, if the range is restricted (a priori) to [0, 1/2] then the bound will also be halved. Consider a scenario in which the range is [0, 1] but the actual observed losses are in the range [0, 1/2], we would like to have an algorithm which will perform (almost) as well as an algorithm that knew a-priori that the range is [0, 1/2]. The a-priori knowledge is consequential because multiplicative weights algorithms use it to choose the learning rate. For a more general formulation of the problem suppose that the losses in iteration t are in the range [0, at ]. Then we are seeking an algorithm with a bound of the form v u T uX t a2t (1) t=1

1.2. Existing second order bound In Cesa-Bianchi et al. (2006) the authors prove a regret bound for a multiplicative weightsp algorithm with a regret bound that satisfies (1). Specifically , the leading term in the regret bound is Q∗n ln n where n X Q∗n = x2kn∗ ,t t=1

xi,t is the instantanous payoff of action i at time t, and kn∗ is the action with the highest total payoff at time n. 1.3. Drawback of existing bound While the bound in Cesa-Bianchi et al. (2006) is tighter than previous bounds on some sequences, it does so at the cost of being significantly looser for other sequences. c 2016 Y. Freund.

F REUND

Total Payoﬀ

To demonstrate the problem consider following example (See Figure below). Consider a game with two actions and four phases. One action is the ’initially leading action’ (ILA) and the other action is either “low variance action” (LVA) or “high variance action” (HVA). I will use the term “gain” to refer to the cumulative payoff. At the end of the first phase the gain of ILA is large (we call it “the lead”) while the gain of HVA or LVA is zero. In phase 2 the lead remains constant. Note that, if the lead is sufficiently large NormalHedge will assign zero weight to the second action throughout phase 2. As a result, it does not matter to the algorithm (and should not matter to the bounds) whether the second action is LVA or HVA. In phases 3 and 4 the fates of the two actions reverses. The gain of ILA remains constant, but the gain of the other action increases by one at each step. The length of phase 3 is equal to that of phase 1. When phase 3 ends, the two gain sequences intersect. In phase 4 the gain of the non-ILA action keeps increasing. On the other hand, it is clear that at the intersection point, the two actions should have the same weight and in phase 4 the non-ILA action should become the higher weighed one. (If you are not convinced that this is the case, consider switching the actions at the point of intersection). All this holds whether or not the non-ILA action is high or low variance (i.e. having large or small Q∗n ). In other words, the fact that the laggard action has a high variance should not effect the online algorithm, or the regret bounds, at phase 4.1

Ini5ally leading ac5on

Lead

X

Intersec5on

High variance ac5on 0

Low variance ac5on Phase 1

Phase 3

Phase 2

Phase 4

Time

2. An alternative definition of second order bounds We seek an online algorithm with regret bounds that satisfy condition (1) and is also a uniform improvement over existing regret bounds. In other words, there is no sequence such that the new regret bound is worse (in the leading term). 1. In Cesa-Bianchi et al. (2006) only the bound is effected by Q∗n , the algorithm does not depend on it. In other work the value of Q∗n effects the weights.

2

O PEN P ROBLEM : S ECOND ORDER REGRET BOUNDS BASED ON SCALING TIME

I propose that instead of defining the second order term seperately for each sequence, we define a single second order term for each time step. A different way of to express this is that the time step, which is defined as 1 in the standard forumation, is redefined so that steps with high variance cause a bigger increase than steps with smaller variance. To figure out the correct scaling of time consider a standard worst case construction where each payoff is +1 or −1 with equal probabilities independent of all other payofss. In other words, actions whose total payoff defines a random walk. Imagine now that the range of the payoffs at time t are ±ai for some 0 ≤ at ≤ 1.PA consequence of central limit theorem is that the distribution of the total payoffs q at time t is N (0, ti=1 a2i ). Using standard lower bounds on regret we see that the regret is P P at least ( ti=1 a2i ) ln N . In this bound the sum ti=1 a2i replaces the number of steps t. Another way to say this is that we redefine time to be equal to the sum. This redefinition of the time steps as a2i is an upper bound that is tight for the case where the payof of each action at step i is −ai or +ai with probabilities 1/2, 1/2. 2.1. Conjecture qP T We conjecture the existance of an algorithm whose regret bound has the form C t=1 ∆t ln N . For some universal constant C, Where ∆t ≤ 1 are defined as follows. At step t we compute two weight vectors wt,i and ut,i . Both weight vectors are distributions (non-negative and sum to one). The weight wt,i are the hedging weights and define the one step regret of the algorithm relative to action i. Denoting the payoff vector at iteration t by pt we can write the instantanous regret as with respect to the ith action as: rt,i = wt · pt − pt,i The weights ut,i , which, I believe, are different from wt,i , are used to define the ith time step: X 2 ∆t = wt,i rt,i i

3. Questions and Prizes We state two open problems, one easier and one harder. • Bound for exponential weights algorithm ($200): Prove a bound of the conjectured form for a multiplicative weights algorithm with a fixed learning rate. • Bound for algorithm with no a-priori knowledge ($500): give a parameter free algorithm with a bound of the form above. The bound needs to hold uniformly for all time steps and for all fraction of the best actions. In other words the bound should have the form q PT t=1 ∆t ln 1/ and hold for all T and all . (I believe that such a bound holds for the version of NormalHedge described in Chaudhuri et al. (2009)).

References Nicol`o Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, pages 382–391, 1993. 3

F REUND

Nicol`o Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2):321–352, 2006. ISSN 1573-0565. doi: 10.1007/s10994-006-5001-7. URL http://dx.doi.org/10.1007/ s10994-006-5001-7. Kamalika Chaudhuri, Yoav Freund, and Daniel J. Hsu. A parameter-free hedging algorithm. CoRR, abs/0903.2851, 2009. URL http://arxiv.org/abs/0903.2851. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999. Nick Littlestone and Manfred Warmuth. The weighted majority algorithm. In 30th Annual Symposium on Foundations of Computer Science, pages 256–261, October 1989.

4

Recommend Documents

MCTS Based on Simple Regret