IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
1243
Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems Eric W. Cope
Abstract—We consider a class of multi-armed bandit problems where the set of available actions can be mapped to a convex, com, sometimes denoted as the “continuum-armed pact region of bandit” problem. The paper establishes bounds on the efficiency of any arm-selection procedure under certain conditions on the class of possible underlying reward functions. Both finite-time lower bounds on the growth rate of the regret, as well as asymptotic upper bounds on the rates of convergence of the selected control values to the optimum are derived. We explicitly characterize the dependence of these convergence rates on the minimal rate of variation of the mean reward function in a neighborhood of the optimal control. The bounds can be used to demonstrate the asymptotic optimality of the Kiefer-Wolfowitz method of stochastic approximation with regard to a large class of possible mean reward functions. Index Terms—Adaptive control, sequential decision procedures, stochastic approximation.
I. INTRODUCTION AND BACKGROUND HE well-known multi-armed bandit problem concerns a hypothetical gambler who makes successive plays from a given set of slot machines with unequal and initially unknown payoff distributions. After each play, the gambler immediately receives a random reward, which has not only a monetary but also an informational value for the gambler, as it provides data on the payoff distribution of that machine. The gambler must determine how to sequentially choose machines to play so as to win as much reward as possible. Multi-armed bandit problems have received considerable interest, not only because such models may be applied in many practical contexts, but also because their simple structure permits insight into the key tradeoffs between acquiring information (“exploration”) and gathering reward (“exploitation”), trade-offs which are common to most reinforcement learning and sequential search problems. We shall primarily be concerned with a special class of bandit problems where the set of actions can be represented as a compact subset of a finite-dimensional metric space. This problem, which has sometimes been referred to as the “continuum-armed bandit” problem [1], differs from the standard bandit model in that the choice of actions at each stage is not discrete but instead ranges over a continuum. Examples of practical settings where such problems arise include pricing a new product with uncertain demand in order to maximize revenue, controlling the transmission power of a wireless communication system in a noisy
T
Manuscript received August 16, 2006; revised November 26, 2008. First published May 27, 2009; current version published June 10, 2009. Recommended by Associate Editor A. Lim. The author is with IBM Zurich Research Laboratory, Rüschlikon 8803, Switzerland. Digital Object Identifier 10.1109/TAC.2009.2019797
channel to maximize the number of bits transmitted per unit of power, and calibrating the temperature or levels of other inputs to a reaction so as to maximize the yield of a chemical process. In each case, it is natural to assume that the mean reward (expected per-period sales revenue, average bits transmitted per unit power, and mean chemical yield) is a function of a continuous control variable (price, power level, and heat), and that this function obeys some continuity properties over the control domain. Indeed, continuity assumptions of some form are essential if the problem of choosing from a nondenumerable set of controls in discrete time periods is to be meaningful. We shall derive efficiency bounds for a wide-ranging class of continuum-armed bandit models. First, we provide lower bounds on the growth rate of the regret, which measures the expected value of perfect information of any given control strategy, i.e., the difference between the reward accumulated by the given strategy and the maximum expected reward achievable if perfect information about the mean reward function were available in advance. The bound on regret that we derive establishes a performance limit that can be used to judge the efficiency of any strategy. This performance bound is shown to hold uniformly over time, and we derive both the optimal learning rate as well as a bounding constant. Second, we provide asymptotic upper bounds on the convergence rate of the controls selected by any strategy to the optimal control value. The difficulty of locating the optimal control value depends in part on how the mean reward function behaves within a neighborhood of the maximum; we characterize this dependence when the function can be locally approximated by a polynomial function. Although these convergence rates only indirectly measure the reward performance of any control strategy, they provide us with insight into the optimal balance between exploration and exploitation in selecting controls. Knowing the optimal convergence rate of the controls to the optimum is of benefit in the design of control strategies because the bounds explicitly indicate how these trade-offs are best managed in the long run. Finally, we shall show that a version of the Kiefer-Wolfowitz (K-W) method of stochastic approximation can (asymptotically) achieve the regret growth and convergence rate bounds we derive. This indicates not only that the bounds are tight (up to a constant factor), but also that this classical algorithm is efficient with respect to the optimal rates indicated by these bounds. These results additionally allow us to directly compare the control problem we consider in this paper and the related problem of estimating the location of the maximum of a function, for which similar efficiency bounds have been previously derived by other authors [4]. A different version of K-W has previously been shown to be efficient for the estimation problem [6]; as a result, we can directly contrast
0018-9286/$25.00 © 2009 IEEE
1244
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
how the optimal exploration-exploitation trade-offs are handled differently under the control and estimation objectives. A. Problem Formulation The problem and our results may be stated more precisely as follows. Suppose that in each of several time periods , an action is selected from a convex compact set , and a random reward is subsequently re. Define the mean ceived before the start of time period , and assume that reward function is bounded for all and that does not change with . denote the conditional distribution of given Let and , for all and all . Although is initially unknown, it is assumed to belong to a known class of , there exists a point functions. We assume that for every such that is continuous at and uniquely achieves to indicate the deits maximum value at . We shall use to denote the value of pendence of on , as well as at its point of maximum on . In addition, for a given set of , we shall also frequently use the notation functions to denote the set of maxima of all denotes the set of all maxima of . functions in . Thus, represent the available data Let . Define an allocation rule at the beginning of time period to be any function which, when given as input, returns whose conditional distribution a random control value does not depend on . Note that together with and given , the rule determines the distribution of the sequence . We shall therefore denote probabilities and expectations taken with respect to this distribution by the operators and , respectively, or, treating the distribution as fixed, and . simply by In control problems such as we are considering here, the most important measure of the performance of an allocation rule is the rate at which it accumulates (expected) reward. Equivalently, this performance may also be measured in terms of the opportunity costs associated with having to learn the location of the , which is frequently referred to as the regret of maximum the allocation rule. We define the -step regret associated with an allocation rule , a function , and a reward distribution , according to (1)
. Although this convergence rate is not a direct meafor sure of the performance of an allocation rule, allocation rules where the control distance converges quickly to the maximum will typically achieve a low rate of regret growth. Note that the control distance only measures the convergence rate of the to ; it does not measure the selected control sequence to . As convergence rate of the best estimator based on mentioned in the introduction, this rate has been established by other authors for the related problem of determining a sequential sampling scheme that efficiently estimates of the location of the maximum of . We shall explore the connections between the control and this estimation problem later in this article. B. Summary of Main Results We now preview the main results to be proven in this paper. Under certain conditions on and , we shall characterize finite-time lower bounds for the growth rate of the regret (valid for any time period ), as well as asymptotic upper bounds for the rate of convergence to zero of the control distance. We shall derive bounds for any allocation rule on the worst-case (over ) rate of regret growth and control distance converall gence. We may summarize these results as follows. First, define the worst-case regret and control distance measures
If there exists some function , positive constants
that satisfies, for some , and for every (3)
and if there are continuum of functions in that are “close to” or “approximate” (see Condition II.6 below), then we show such that for any that there exists a constant (4) We shall also provide a formula to compute the largest value for that can be derived from the arguments presented below for which (4) is valid Furthermore, we show that the following asymptotic bounds hold for the control distance: (5) (6)
Regret measures the expected difference in value between having perfect information about (and therefore applying the in each time period) and using allocation optimal control rule without knowledge of . We seek allocation rules whose regret grows slowly with . In addition to the regret, we shall also explore the optimal rate of convergence of the selected controls to the maximum under any allocation rule. To measure convergence rate, we define the th moment of the time- control distance of a given allocation rule for a particular and to be (2)
We also demonstrate that, under some minor additional restrictions on and , the growth rates implied by (4)–(6) are by a version of K-W in the sense that achievable when for every
as , where denotes a particular classic version of the K-W allocation rule that we define in Section I-D. This shows that the lower bounds implied by (4) and (5) are tight (up to
COPE: REGRET AND CONVERGENCE BOUNDS
a constant factor) and that these bounds.
is rate-optimal with respect to
C. Related Results We now review some results on optimal regret growth obtained by previous authors for related multi-armed bandit problems. In the case where the set of bandit arms is finite, Lai and Robbins [14] proposed an allocation rule that achieves a re, and further established that this gret growth rate of rate is a lower bound on the growth rate of any control policy for all [13], [18]. for which the regret grows as These results were generalized by Yakowitz and Lowe [20] and Burnetas and Katehakis [3] to hold in a nonparametric context. Later, Lai and Yakowitz [15] showed that when the control set , a reis countably infinite, such that is attainable, where is any gret growth rate of . These bounds are all nondecreasing sequence with asymptotic in character; finite-time bounds for the growth rate of the regret were later derived by Kulkarni and Lugosi [11] for the two-armed bandit and by Auer et al. [2] for the general multiarmed bandit problem. Mannor and Tsitsiklis [16] also derive finite-time lower bounds on the regret for multi-armed bandit problem in the context of the PAC (“probably approximately correct”) framework. Agrawal [1] considered a “continuum-armed bandit” problem similar to the control problem we are considering when the , and proposed an allocation rule that achieves dimension for any , a regret growth rate that is for a class of functions that are uniformly locally Lipschitz . Kleinberg [10] showed that for with exponent and this function class, the optimal regret is between , and provided an algorithm which could in achieve the upper bound. The regret bounds of the present paper (which were obtained independently and concurrently of Kleinberg’s work) are based on a different set which are likely of assumptions about the function class to apply in most practical settings. Comparing these bound results, it is clear that the assumptions about are critical in determining the optimal regret rates for this problem. There has been a long tradition of research into the related problem of how to sequentially choose a set of design points in order to efficiently estimate the location of the maximum of a regression function. In fact, results similar to those derived here for the convergence rates of the control distance have been previously established for the estimation problem. In this problem, and a sethe goal is to find an allocation (sampling) rule such that converges to quence of estimates as quickly as possible. This problem differs from the one is not constrained to be the point we are considering because at which a sample is drawn, but may be any function of the . However, the problem is similar enough that it is data interesting to compare the results. Chen [4] established a class of lower bounds for the rate of convergence of estimates of the maximum of a function under various smoothness assumptions on functions in . These lower bounds were previously shown to be achievable by a stochastic approximation-type algorithm proposed by Fabian [6]. In particular, for classes of functions an whose th derivative exists and is bounded, with to converge to at odd number, then it is possible for
1245
. Similar results were given the optimal rate of by Polyak and Tsybakov [17]. The bound results of this paper follow a similar proof strategy as those of [4], although substantial modifications are required to extend these methods to the control problems considered here. Algorithms for the estimation problem often are also appropriate for use in the control problem setting we are considering. However, algorithms which are efficient for estimation are not necessarily efficient for the control objectives of minimizing regret and distance. In particular, when the function is known to be smooth, it is often possible to achieve a faster rate of converby allowing the sample points to gence for the estimator converge to the optimum more slowly than is strictly required to achieve consistency of the estimator. Under the control obto conjective, however, it is desirable for the control points verge to the optimum as quickly as possible, since this results , the in the lowest regret growth rate. For example, when results mentioned in the last paragraph indicate that a version of for the K-W achieves the optimal convergence rate of estimator . However, for this algorithm, the control sequence converges to at the rate of , which for is much slower than the rate of that we any have indicated is optimal for our control objectives. The following section gives a formal definition of the versions of K-W we shall consider along with some relevant convergence results. D. Kiefer-Wolfowitz Stochastic Approximation Kiefer and Wolfowitz [9] introduced an iterative stochastic gradient-following algorithm for locating the maximum of a unimodal function. The algorithm proceeds by estimating the gradient at a current estimate of the maximum using a finitedifference approximation, and then updating this estimate by taking a step in the direction of the estimated gradient. The iterates produced by this algorithm can be shown to converge to the maximum with probability one if the step sizes and finite difference interval widths decrease to zero at an appropriate rate. Although many variants and improvements of the Kiefer-Wolfowitz method have been proposed since the algorithm was first described, we shall only focus on what may be considered to be the classical form of the algorithm for multidimensional settings. Showing that this classical algorithm is optimal with regard to the regret and convergence bounds derived in this paper is sufficient to establish that the bounds are tight. We present the basic K-W procedure as described in Kushner and denote and Lin [12], p. 13. Let positive sequences of step sizes and finite difference interval widths, respectively, and let denote the th column of the identity matrix. The algorithm then proceeds as follows: given of in period , take measurements and an estimate at the points and , for . The th coordinate of the next iterate is then computed according to (7) . Suitable modifications to this procedure using for projection or “resetting” operators are possible whenever , , or fall outside the bounds of , see [12], pp. 77–79, or [8]. Note that allowing multiple measurements to be taken
1246
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
per iteration does not affect the rates of convergence or regret growth of this algorithm. The following condition on the mean reward function is sufficient for establishing the rate of convergence of K-W (cf. [6]). Condition I.1: Suppose is three times continuously differ, and there exist positive constants entiable for all such that the following hold for all :
Let denote the version of K-W with set to an arbitrary and with fixed vector in , and let . Assuming that has bounded third derivatives , and that the variance in a neighborhood of its optimum is bounded for all , then of the distribution according to Theorem 5.1 of [6]
, there exists a point such that 1) For every both is continuous and uniquely achieves its maximum . value at is convex and contains an open 2) The set of maxima ball in . on is bounded. 3) The range of every Our results require that the class also be rich enough to contain a set of functions whose maxima range over a continuous set of values. The following definitions make this statement precise: , and let and . Definition II.2: Let is -approximated in the direction We say that (for ) if there exists such that for satisfying , there exists an all satisfying, for all (8) (9) (10)
This result will be important in establishing the optimality of for any class of functions satisfying these conditions, as well as some additional conditions which shall be presented in the next section. Although we shall only present results for the standard K-W procedure outlined above, we note that similar results on the to have been convergence of moments of the estimators provided by Polyak and Tsybakov [17] and Gerencsér [8], under similar conditions on as Condition I.1, for related stochastic approximation techniques. Polyak and Tsybakov’s method is based on kernel regression techniques, and they showed that it uniformly achieves an estimator convergence rate of for a class of unimodal functions that have continuous partial , and which satisfy the Hölder conderivatives up to order , such that . Their condidition of order , tions are weaker than those of Fabian [6] (which are themselves weaker than the conditions stated in Condition I.1); in particular, they only require that have only one or two continuous partial derivatives. In addition, the algorithm only requires one measurements measurement per iteration, as opposed to the . of Gerencsér provides a convergence analysis for the simultaneous perturbation stochastic approximation (SPSA) method [19] that is limited to a compact domain. The SPSA method, like Polyak and Tsybakov’s, has the advantage that at each iteration, the number of measurements required in each iteration does not increase with the dimension of the search space, although the order of the rate of convergence of this method is not improved using this method. Gerencsér’s proof, unlike Fabian’s, also explicitly deals with the case where the domain is compact and the SPSA method uses a “resetting” technique every time the iterate falls outside , see [8].
The conditions (8)–(10) may be understood as follows: Any takes its maximum value at a fixed distance such function from (8), and its values are ‘close’ to the values of of over the domain , in the sense defined by (10). Equation (9) provides an additional condition on the rate of change of around its own maximum (parameterized by the exponent ) that will determine the bound on the optimal rate of convergence of the controls. -approximated in a Remark II.3: Any function that is direction satisfies (11) for all . Also, note that the right-hand side of (10) is an increasing function of . Therefore if the inequality (10) is as well. satisfied for some , it is also satisfied for all The following examples illustrate a set of functions con-approximated in a direction taining an element that is . Example II.4: Let be the unit disk in , let be a vector , and let and . For satsifying , define the functions
(12) where
For the particular choice , the functions a set of quadratic functions of the form
represent
II. BOUNDS ON REGRET GROWTH AND CONVERGENCE RATES A. Conditions on the Function Class Condition II.1: Let be a class of functions
be convex and compact, and let such that
Each function takes its unique maximum at , and it is possible to show that the -th derivative (in any direcat the point is negative and bounded tion) of
COPE: REGRET AND CONVERGENCE BOUNDS
1247
Fig. 1. Functions of the type h (1; m). In Figure (a) are pictured the functions h (1; m) with m ranging in the same set.
h (1; m) with m ranging in {0.0,0.1,0.2,0.3,0.4}. In Figure (b) are displayed
above for any . From this it follows that we can find a such that (9) holds. In addition, for all
is much smaller near than the corthe functions , even though the set of maxima of responding functions these functions are identical in both graphs. We may infer from these pictures that it can be harder to estimate the location of the maximum of the function to a given degree of accuracy in the case where contains functions like those pictured at right than those pictured at left in this figure, and thus we should expect slower convergence rates when the value of is large. Indeed, our bound results below indicate that we should expect a slower -approximated functions for optimal convergence rate for larger values of . Fig. 1 also indicates that even if the rate of convergence of , the controls to the optimum is slow in the case where the rate of regret growth may not be any faster, because in this change only very slightly near case, the values of the optimum. In fact, our bounds on regret growth indicate that even though the controls must converge more slowly when is higher, the optimal rate of regret growth is insensitive to the value of .
and thus (10) is satisfied. Thus if for some , , then is -approximated in the direction . , let be the unit disk centered at Example II.5: Let (0,0), and let
Suppose is such that to , and let be a vector on the boundary of such that the function . Then for all
with interior . Let be is in
B. Conditions on the Reward Distribution
This demonstrates that every whose maximum is interior to is (2,2)-approximated in all directions. such that for , Condition II.6: There exists an , and for some , is -approximated in the direction . The rate at which a function may be said to vary about as its maximum determines how quickly . To see what effect this can have on the problem of ef, let and consider again the funcficiently locating of Example II.4 defined on the interval [ 1, tions and ; and . 1] for two different values of : , we shall omit the direction in specifying .) (Since and for ranging in the The functions set {0.0,0.1,0.2,0.3,0.4} are depicted in Fig. 1. Note that the curves are much flatter around their optima than the corresponding functions. As a result, the distance between
Condition II.7: Let denote a -finite measure on the Borel and , let be a distribusets of . For each with respect tion function with an associated density exists and is to , such that the expectation . Also, is positive on the supequal to for all . Let , and suppose port of for all , the first two for each in the support of and every derivatives of with respect to , denoted and , exist and are . Furthermore, for any , if is continuous in for all with respect to , then a random variable with density the expectation (13) . is finite and continuous at each It is straightforward to show that this condition holds for a wide class of sampling distributions, including exponential families of distributions such as the Normal, Bernoulli, and Poisson.
1248
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
Condition II.7 allows us to establish the following technical lemma, whose proof is given in Appendix A. Lemma II.8: Suppose Condition II.7 holds. a) The following equalities hold for all fixed and :
. Then there exists a positive constant a) Suppose such that (4) holds for every allocation rule and for . all . Then (5), (6) hold for every allocab) Suppose tion rule . Proof: Suppose that (4) is false, i.e., there exists an alloca, there exists an such tion rule such that for any that
where is a random variable with density with respect to . be distinct funcb) Let be an allocation rule and let , let and denote the tions in . For any and , respectively, reprobability measures . Then for any the density of stricted to with respect to exists and may be expressed as
(17) Let and be given as in Condition II.6, and let denote a -approximated function in , with denoting its point of maximum. Let and be fixed to satisfy this hypothesis for a given rule . Since by (11)
(14) be a -approximated function in some c) Let be given as in Definition II.2. Then direction , and let such that for all , and there exists a finite value all satisfying (10) with (15) with where is a random variable with density respect to . d) Suppose Conditions II.1, II.6, and II.7 hold, and let be -approximated in in some direction . Let be is an allocation rule such that for fixed, and suppose some (16) Let
. There exists a satisfying (10) with
such that for any function , and for any
then (17) implies that (18) satisfying the condition of Lemma II.8(d) with and . Our next step is to show that under the given conditions, for , there exists and a function such any that for the given rule and the time period (19) Choose , let , and choose ac, cording to (34). Let be an integer such that , and define to be a positive number less choose . Define functions with distinct maxima than such that for all
event
The quantity on the left-hand side of the inequality (16) may be considered to be a performance metric that is related to both the regret and the control distance of any given allocation rule. Part (d) of Lemma II.8 establishes that for any allocation rule that achieves a certain level of performance of this metric -approximated function , there exists at any on a that is close enough to that it is time a function which difficult to distinguish from the data collected using of these functions is the true problem instance. This fact will be exploited in the theorem of the next subsection to show that no such allocation rule can distinguish with sufficient accuracy or by time , which whether the true maximum lies at is crucial in establishing the bounds we desire. C. Main Results Our main results can be summarized in the following theorem: Theorem II.9: Suppose that Conditions II.1, II.6, and II.7 hold.
(20) (21) (For an example where this condition is satisfied, choose from among the functions satisfying (8)–(10) such . Remark II.3 implies that (20) is that satisfied.) represent the restriction of the probability measure Let to , for . According to Lemma II.8(d), is at most . the total variation distance between any and and , define the sets For
Since is an allocation rule, tion (21) implies probabilities
for
. The condi. Next, define the
COPE: REGRET AND CONVERGENCE BOUNDS
1249
Applying the above results
which in turn implies that Then
To prove part (b), suppose that
for all
.
holds for some allocation rule
. This implies (26)
Then there exists a
This establishes that for any that for some function
and
, there exists a
such that
such
(22)
thus satisfying the condition of Lemma II.8(d). The argument , there following (18) can be used to show that for any such that for any , there exists a funcexists a such that (22) holds. Application of the Markov tion inequality yields
i.e., we have proven (19). Application of the Markov inequality to (19) yields (23) which implies
(27) contradicting (26). This proves (5). This argument can also be used to prove (6). Suppose instead of (26) we assume for some (28)
and thus (24) Note that depend on
, , and are independent of in (17), but can . The value of can be at most
using (34) and the fact that . This bound for is a decreasing function of , such that if we choose to be close , enough to zero, we may also choose a such that which generates a contradiction between (17) and (24). This implies that the regret at any time must satisfy
for some value of ; furthermore, a suitable bound may be computed as the unique positive solution for in the equation
This would again lead us to conclude (18), which again results again in (27), which contradicts (28). This proves (6) and concludes the proof of part (b). To understand the applicability of the theorem under various values of and , recall that the constant represents the maximal degree of a bounding polynomial around the approximated function, and thus the rate of curvature at the maximum of the function. The role of the constant in the conditions of the theorem has to do primarily with the degree of closeness of the -approximated function; as long as is within an acceptable range for the given , the regret bound in part (a) of the theorem holds. The value of is more important in part (b) of the theorem, however, as it figures in the convergence rate of the th moment of the control distance. III. ACHIEVEMENT OF THE BOUNDS A. Kiefer-Wolfowitz Stochastic Approximation We now show that the allocation rule defined in Section I-D can achieve the bounds established in the previous sec, along with the foltion when Condition II.1 is met with lowing condition. , such Condition III.1: There exist positive constants . In addition, there exists that Condition I.1 holds for all such that for all and a constant (29)
(25) This proves part (a).
Example III.2: Consider the function class in Example II.5. It is clear that this class of functions satisfies both Conditions II.1 and III.1.
1250
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
Condition III.3: There exists and ,
such that for each
Theorem III.4: Under Conditions III.1, and III.3 (30) (31) Proof: By Lemma 4.1 of [6], Conditions III.1 and III.3 imply that for all and , there exist a positive integer and a real number such that implies
As this holds for all proves (30). Next
versions of the distance measure, this
(32) and . Fabian’s proof of this lemma makes clear that the and depend on the choice of only through values of and . Since we assume that these constants are valid for all , we can assume that for every , there exist and such that (32) holds for all . The following is referred to as Chung’s lemma; we state a special case of the version of this lemma that was proved in [6] as Lemma 4.2. be real numbers with Lemma III.5: Let , , , and . If where
for sufficiently large , where the
Choose ciently large
such that and some
’s are nonpositive, then
. Then, for suffi-
This proves (31). The K-W algorithm is Markovian in that at each stage it only relies on the result of the most recent observations to select the next control. Theorem III.4 indicates, however, that K-W achieves the rates implied by Theorem II.9 for the case where and , even though the lower bounds are valid for any allocation rule, which may in general use the entire history of observations to select the next control. While this result may be somewhat surprising in this regard, note that the implied optimality of K-W is only rate-optimality; the constant factor associated with the regret growth of K-W need not match the conof Theorem II.9(a). We must conclude that the K-W stant regret growth and convergence rates, as well as the bounds on these rates, are sharp up to a constant factor. B. General Achievement of Bounds
Letting shows that there exists a
in Chung’s lemma such that
Suppose that we now index the distance measure by the diand so that mensions
for
; define a similar expression for
The previous section demonstrates that the bounds established in Section II-C are tight in the case where both Conditions . However, II.1 and III.1 hold for the function class with the tightness of the bounds can be demonstrated for a much wider range of function classes than those which lie in the intersection of these conditions. Cope [5] presented a class of allocation rules that nearly achieve the optimal rates for any for one-dimensional domains in the sense that , there exists a rule such that for any (33)
. Then This result requires that the function class be comprised of such unimodal functions where there exists a constant
COPE: REGRET AND CONVERGENCE BOUNDS
that the following implications hold for any and :
1251
,
Analyses are also presented which indicate that these results can as well. Both the above conditions be extended to the case and Condition III.1 require that the functions in be unimodal.
Borel sets of Then for any
, independent of the problem instance .
Then by the positivity conditions of Condition II.7,
IV. CONCLUSION The results of this paper establish two distinct sets of bounds for continuum-armed bandits: finite-time lower bounds on the regret of any allocation rule, and asymptotic upper bounds on the convergence rate of the controls to the optimum of the unknown reward function. The rate of the latter bounds was shown to be functionally dependent on the degree of curvature of functions at their point of maximum (as measured by the minimal degree of a bounding polynomial). In the case of the regret, we have provided not only the optimal growth rate, but also determined a constant factor for the growth rate. It is important to keep in mind, however, that all the bound results are dependent on the assumed function class from which problem instances may be drawn; in this paper, we have selected a function class which imposes general smoothness and approximability assumptions on the functions which are likely to hold in practical situations. We may therefore compare the results of this paper to that of Kleinberg [10], who assumes that problem instances may be drawn from a wider class of functions, and correspondingly obtains a lower bound for regret that grows more slowly than the one we derive here. We have additionally shown that the regret growth and convergence rate bounds are asymptotically attainable (at least up to a constant factor of the rates obtained here) by the well-known Kiefer-Wolfowitz method of stochastic approximation. This is in some ways surprising, as the bounds are meant to hold for any allocation rule, which in general may be dependent on the entire history of past selections and rewards, and yet they can be met by a Markovian algorithm that selects controls only according to the most recent observations. The results also indicate that the rate of convergence of the Kiefer-Wolfowitz design sequence, which is explicitly controlled by the parameter settings of the algorithm, is the optimal rate, up to a constant factor. This fact may be also be used to specify the design of other sequential search methods in continuous spaces. It also shows that the bounds obtained are optimal up to a constant factor. APPENDIX PROOF OF LEMMA II.8 (a) Demonstration of these equalities follows a standard analysis, see, e.g., [7], Lemma 9.2.1 and the proof of Theorem 9.2.4. . Under , (b) The proof proceeds by induction. Let is chosen according to some probability measure on the
which establishes (14) for . Now suppose (14) holds for some , so that for any
Since , let be the conditional probgiven determined by . In ability measure of given addition, the conditional density of and is . We may then write, , for
which establishes the claim. be a -approximated function. From (c) Let Condition II.1.3, the range of values of is restricted , and the range of any to some bounded interval satisfying (10) must be contained in the function interval , where is the maximum . distance of any point in to
1252
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009
as a function of and is finite and contin, and thus achieves uous over the compact region its maximum over this domain. Therefore, for any satmust be bounded for all isfying (10), . (d) Suppose that the conditions stated in the lemma are satisfied for a given , and . We shall assume for conveand nience and with no loss in generality that . Let , and define
(34) , are as in Definition II.2, where the constants is given in (15), and is given in the condition of the . According to the condition that lemma. Note is -approximated in , for any there is an that satisfies the conditions stated in the lemma. and denote the restrictions of and Let to , respectively. By Lemma II.8(b), the density of with respect to is given by the likelihood ratio
We can similarly derive and the Markov inequality, we can derive
. Using (34)
This implies that (35)
Let
, let . Note that
and so, since for all ,
as before, and let
for
and
implying Then for some positive sequence for each
(36)
, with Let
be
an
arbitrary and let . Let ; from (36) we have
event
,
the event density of
and so we can write
in denote denote the
, where
which implies that
By Lemma II.8(a), we may derive Applying the same argument to , and
, we obtain
so we may write (37) This establishes that the total variation distance between and is less than .
COPE: REGRET AND CONVERGENCE BOUNDS
REFERENCES [1] R. Agrawal, “The continuum-armed bandit problem,” SIAM J. Control Optim., vol. 33, pp. 1926–1951, 1995. [2] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256, 2002. [3] A. Burnetas and M. Katehakis, “Optimal adaptime policies for sequential allocation problems,” Adv. Appl. Math., vol. 17, pp. 122–142, 1996. [4] H. Chen, “Lower rate of convergence for locating the maximum of a function,” Annu. Stat., vol. 16, pp. 1330–1334, 1998. [5] E. W. Cope, “Stochastic Search for Optimum Under a Control Objective,” Ph.D. dissertation, Dept. Mgmt. Sci. and Eng., Stanford Univ., Stanford, CA, 2004. [6] V. Fabian, “Stochastic approximation of minima with improved asymptotic speed,” Annu. Math. Stat., vol. 38, pp. 191–200, 1967. [7] V. Fabian and J. Hannan, Introduction to Probability and Mathematical Statistics. New York: Wiley, 1985. [8] L. Gerencsér, “Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting,” IEEE Trans. Automat. Control, vol. 44, no. 5, pp. 894–905, May 1999. [9] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Annu. Math. Stat., vol. 23, pp. 462–466, 1952. [10] R. Kleinberg, “Nearly tight bounds for the continuum-armed bandit problem,” in Proc. Adv. Neural Information Processing Syst. (NIPS) 17, Vancouver, BC, Canada, 2004, pp. 697–704. [11] V. Kulkarni and G. Lugosi, “Finite-time lower bounds for the twoarmed bandit problem,” IEEE Trans. Automat. Control, vol. 45, no. 4, pp. 711–714, Apr. 2000. [12] H. J. Kushner and G. G. Lin, Stochastic Approximation Algorithms and Applications. New York: Springer, 1997. [13] T. L. Lai and H. Robbins, “Adaptive design and stochastic approximation,” Annu. Stat., vol. 7, pp. 1196–1221, 1979.
1253
[14] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Adv. Appl. Math., vol. 6, pp. 4–22, 1985. [15] T. L. Lai and S. Yakowitz, “Machine learning and nonparametric bandit theory,” IEEE Trans. Automat. Control, vol. 40, no. 7, pp. 1199–1209, Jul. 1995. [16] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multi-armed bandit problem,” J. Machine Learning Res., vol. 5, pp. 623–648, 2004. [17] B. T. Polyak and A. B. Tsybakov, “Optimal order of accuracy of search algorithms in stochastic optimization,” Problems Inform. Transmission, vol. 26, pp. 126–133, 1990. [18] H. Robbins and S. Monro, “A stochastic approximation procedure,” Annu. Math. Stat., vol. 22, pp. 400–407, 1951. [19] J. C. Spall and J. A. Cristion, “Model-free control of nonlinear stochastic systems with discrete-time measurements,” IEEE Trans. Automat. Control, vol. 43, no. 9, pp. 1198–1210, Sep. 1998. [20] S. Yakowitz and W. Lowe, “Nonparametric bandit methods,” Annu. Oper. Res., vol. 28, pp. 297–312, 1991.
Eric W. Cope received the Ph.D. degree in management science and engineering from Stanford University, Stanford, CA, in 2004. He is a Research Staff Member at the IBM Zurich Research Lab, Rüschlikon, Switzerland, working in the Information Analytics Group. Prior to joining IBM, he was an Assistant Professor in Operations and Logistics at the Sauder School of Business, University of British Columbia, Vancouver, BC, Canada.