ADVANCES IN APPLIED MATHEMATICS ARTICLE NO.
17, 122]142 Ž1996.
0007
Optimal Adaptive Policies for Sequential Allocation Problems Apostolos N. Burnetas Department of Operations Research, Case Western Reser¨ e Uni¨ ersity, Cle¨ eland, Ohio 44106-7235
and Michael N. Katehakis GSM and RUTCOR, Rutgers Uni¨ ersity, Newark, New Jersey 07102-1895 Received March 10, 1995
Consider the problem of sequential sampling from m statistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parameters u g Q, it is shown that there exists a class CR of adaptive policies with the following properties: Ži. The expected n horizon reward 0 Vnp Ž u . under any policy p 0 in CR is equal to n m*Ž u . y M Ž u .log n q oŽlog n., as n ª `, where m*Ž u . is the largest population mean and M Ž u . is a constant. Žii. Policies in CR are asymptotically optimal within a larger class C UF of ‘‘uniformly 0 fast convergent’’ policies in the sense that limn ª `Ž n m*Ž u . y Vnp Ž u ..r p Ž n m*Ž u . y Vn Ž u .. F 1, for any p g C UF and any u g Q such that M Ž u . ) 0. Policies in CR are specified via easily computable indices, defined as unique solutions to dual problems that arise naturally from the functional form of M Ž u .. In addition, the assumptions are verified for populations specified by nonparametric discrete univariate distributions with finite support. In the case of normal populations with unknown means and variances, we leave as an open problem the verification of one assumption. Q 1996 Academic Press, Inc.
1. INTRODUCTION Consider for any a s 1, . . . , m the i.i.d. random variables Ya , Ya j , j s 1, 2, . . . with univariate density function f aŽ y, ua ., with respect to some known measure na , where ua is a vector of unknown parameters Ž ua1 , . . . , uak .. For each a, k a is known and the vector ua belongs to some a 122 0196-8858r96 $18.00 Copyright Q 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
123
known set Qa that in general depends on a and is a subset of R k a. The functional form of f aŽ?, ? . is known and allowed to depend on a. The information specified by Ž Qa , f aŽ?, ? ., na . is said in the literature to define population a s 1, . . . , m. The practical interpretation of the model is that Ya j represents a reward received the jth time population a is sampled. The objective is to determine an adaptive rule for sampling from the m populations so as to maximize the sum of realized rewards S n s X 0 q X 1 q ??? qX ny1 n ª `, where X t is Yak if at time t population a is sampled for the kth time. In their paper w25x on this problem, Lai and Robbins give a method for constructing adaptive allocation policies that converge fast to an optimal one under complete information and possess the remarkable property that their finite horizon expected loss due to ignorance Ž‘‘regret’’. attains, asymptotically, a minimum value. The analysis was based on a theorem ŽTheorem 1. establishing the existence of an asymptotic lower bound for the regret of any policy in a certain class of candidate policies; see UF policies below. The knowledge of the functional form of this lower bound was used to construct, via suitably defined ‘‘upper confidence bounds’’ for the sample means of each population, adaptive allocation policies that attain it. The assumptions that they made for the partial information model restricted the applicability of the method to the case in which each population is specified by a density that depends on a single unknown parameter, as is the case of a single parameter exponential family. The contributions in this paper are the following. Ža. It is shown that Theorem 1 holds, under no parametric assumptions, for a suitable unique extension of the coefficient in the lower bound; see Theorem 1 Ž1., below. Žb. We give the explicit form of a new set of indices that are defined as the unique solutions to dual problems that arise naturally from the definition of the Žnew. lower bound. Žc. We give sufficient conditions under which the adaptive allocation policies that are defined by these indices possess the optimality properties of Theorem 1 Ž2., below. Žd. We show that the sufficient conditions hold for an arbitrary nonparametric, discrete, univariate distribution. Že. We discuss the problem of normal populations with unknown variance, where we leave as an open problem the verification of one sufficient condition. We first discovered the form of the indices used in this paper when we employed the dynamic programming approach to study a Bayes version of this problem w6, 7x. The ideas involved in the present paper are a natural extension of w25x; they are essentially a simplification of work in w8x on dynamic programming. Our work is related to that of w33x, which obtained adaptive policies with regret of order O Žlog n., as in our Theorem 1, for general nonparametric
124
BURNETAS AND KATEHAKIS
models, under appropriate assumptions on the rate of convergence of the estimates. Starting with w31, 3x, the literature on versions of this problem is large; see w24, 26, 22, 4, 15, 17]20x for work on the so-called multiarmed bandit problem and w16, 29, 30, 1, 12, 27, 14, 5, 2, 28x for more general dynamic programming extensions. For a survey see also w18x.
2. THE PARTIAL INFORMATION MODEL The statistical framework used below is as follows. For any population a let Ž YaŽ n., BaŽ n. . denote the sample space of a sample of size n: Ž Ya1 , . . . , Yan ., 1 F n F `. For each ua g Qa , let Pu a be the probability measure on BaŽ1. generated by f aŽ y, ua . and na and PuŽan. the measure on BaŽ n. generated by n independent replications of Ya . In what follows, PuŽan. will often be abbreviated as Pu a. The joint Žproduct. sample space for the m populations will be denoted by Ž Y Ž n., B Ž n. . and the probability measure on B Ž n. will be denoted by PuŽ n. and will be abbreviated as Pu , where u s Ž u 1 , . . . , um . g Q [ Q1 = ??? = Qm . 2.1. Sample Paths}Adapti¨ e Policies and Statistics Let A t , X t , t s 0, 1, . . . denote respectively the action taken Ži.e., population sampled. and the outcome observed at period t. A history or sample path at time n is any feasible sequence of actions and observations during the first n time periods, i.e., v n s a0 , x 0 , . . . a ny1 , x ny1. Let Ž V Ž n., F Ž n. ., 1 F n F `, denote the sample space of the histories v n , where V Ž n. is the set of all histories v n and F Ž n. the s-field generated by V Ž n.. E¨ ents will be defined on F Ž n. or on BaŽ n. and will be denoted by capital letters. The complement of event B will denoted by B. A policy p represents a generally randomized rule for selecting actions Žpopulations. based on the observed history, i.e., p is a sequence p 0 , p 1 , . . . 4 of history-dependent probability measures on the set of populations 1, . . . , m4 so that p t Ž a. s p t Ž a, v t . is the probability that policy p selects population a at time t when the observed history is v t . Any policy p generates a probability measure on F Ž n. that will be denoted by Pup Žcf. w10x, p. 47x.. Let C denote the set of all policies. Expectation under a policy p g C will be denoted by Eup . For notational convenience we may use p t to denote also the action selected by a policy p at time t. Given the history v n , let TnŽ a. denote the number of times population a 1 4 has been sampled, TnŽ a. [ Ý ny ts0 1 p t s a . Finally, assume that there are T Ž a. estimators uˆa n s g aŽ Ya1 , . . . , YaT n Ž a. . g Qa for ua . The initial estimates uˆa0
125
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
are arbitrary, unless otherwise specified. Properties of the estimators are given by conditions ŽA2. and ŽA3. below. Remark 1. Note the distinction between the policy-dependent Ž V Ž n., F Ž n., Pup . and policy-independent ŽgaŽ n., BaŽ n., PuŽ n. . probability spaces, a see also w33x. However, since uˆaj is a function of Ya1 , . . . , Ya j only, it is easy to see by conditioning that the following type of relations hold, for any sequence of subsets Fn a j of Qa , n, j G 1: Pup uˆaT nŽ a. g Fn aT n Ž a. , Tn Ž a . s j F Pu a uˆaj g Fn a j .
ž
/
ž
Ž 2.1.
/
Pup uˆaT nŽ a. g Fn aT n Ž a. F Pu a uˆaj g Fn a j for some j F n .
ž
/
ž
/
Ž 2.2.
2.2. Unobser¨ able Quantities We next list notation regarding the unobservable quantities such as the population means m a , the Kullback]Leibler information number IŽ ua , uaX ., the set of ‘‘optimal’’ populations, OŽ u . for any parameter value u , the subset DQaŽ ua . of the parameter space Qa that consists of all parameter values for which population a is uniquely optimal Žhenceforth called critical ., the minimum discrimination information for the hypothesis that population a is critical K aŽ u ., analogous quantities for m*Ž u . y « , DQaŽ ua , « ., and JaŽ ua , u ; « ., for any « ) 0, the set of all critical populations BŽ u ., and the parameter space constant MŽ u . as follows: Ž1. Ža. Žb. Ž2. Ža. Žb. Ž3. Ža. Žb. Žc. Ž4. Ža. Žb. Ž5. Ža. Žb.
m aŽ ua . [ Eu aYa , IŽ ua , uaX . [ Eu alogŽ f aŽ Ya ; ua .rf aŽ Ya ; uaX .., m* s m*Ž u . [ max as 1, . . . , m m aŽ ua .4 , OŽ u . [ a: m a s m*Ž u .4 , DQaŽ ua , « . [ uaX g Qa : m aŽ uaX . ) m*Ž u . y « 4 , for « ) 0, waŽ ua , z . [ inf IŽ ua , uaX .: m aŽ uaX . ) z 4 , for y` F z F `, JaŽ ua , u ; « . [ waŽ ua , m*Ž u . y « . s inf IŽ ua , uaX .: uaX g DQaŽ ua , « .4 , for « G 0, Ž . Ž . DQa ua [ DQa ua , 0 s uaX g Q a : m aŽ uaX . ) m*Ž u .4 , BŽ u . [ a: a f OŽ u . and DQaŽ ua . / B4 , K aŽ u . [ JaŽ ua , u ; 0. s inf IŽ ua , uaX .: uaX g DQaŽ ua .4 , for a g BŽ u ., MŽ u . [ Ý ag BŽ u .Ž m*Ž u . y m aŽ ua ..rK aŽ u ..
In the definition of MŽ u . we have used the fact that K aŽ u . g Ž0, `., ;a g BŽ u . : OŽ u ., which is a consequence of the fact that IŽ ua , uaX . s 0 only when ua s uaX .
126
BURNETAS AND KATEHAKIS
Under the assumptions made inw25x the constant K aŽ u . reduces to IŽ ua , u *., thus giving the form for MŽ u . used in that paper. 2.3. Optimality Criteria Ž . Ž . denote respectively the Let Vnp Ž u . s Eup Ý ny1 ts0 X t and Vn u s n m* u expected total reward during the first n periods under policy p and the optimal n-horizon reward which would be achieved if the true value u were known to the experimenter. Let Rpn Ž u . s VnŽ u . y Vnp Ž u . represent the loss or regret, due to partial information, when policy p is used; maximization of Vnp Ž u . with respect to p is equivalent to minimization of Rpn Ž u .. In general it is not possible to find an adaptive policy that minimizes Rpn Ž u . uniformly in u . uniformly in u . However, if we let g p Ž u . s limn ª `Vnp Ž u .rn, then limn ª `Ž VnŽ u . y Vnp Ž u ..rn s limn ª ` R pn Ž u . s m*Ž u . y g p Ž u . G 0, ;u . A policy p will be called uniformly con¨ ergent ŽUC. or uniformly fast con¨ ergent ŽUF. if ;u g Q as n ª `, Rpn Ž u . s oŽ n. Žfor UC. or Rpn Ž u . s oŽ n a ., ;a ) 0 Žfor UF.. A UF Policy p 0 will be called uniformly maximal con¨ ergence rate ŽUM. 0 if limnª` Rpn Ž u .rRpn Ž u . F 1, ;u g Q such that M Ž u . ) 0, for all UF policies p . Note that according to this definition a UM policy has maximum rate of convergence only for those values of the parameter space for which M Ž u . ) 0; when M Ž u . s 0 it is UF. Let C UC > C UF > C UM denote the classes of UC, UF, UM policies, respectively.
3. THE MAIN THEOREM We start by giving the explicit form of the indices UaŽ v k . which define a class of adaptive policies that will be shown to be UM under conditions ŽA1. ] ŽA3. below. For a s 1, . . . , m, ua g Qa , and 0 F g F `, let u a Ž ua , g . s sup
uaX gQ a
m a Ž uaX . : I Ž ua , uaX . - g 4 .
Ž 3.1.
Given the history v k and the statistics Tk Ž a., uˆa s uˆaT k Ž a. for ua , define the index UaŽ v k . as Ua Ž v k . s u a Ž uˆa , log krTk Ž a . . ,
Ž 3.2.
for k G 1; UaŽ v 0 . s m aŽ uˆa0 .. We assume, throughout, that, when j s 0 in a ratio of the form log krj, the latter is equal to `.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
127
Note also that UaŽ v k . is a function of k, Tk Ž a., and uˆaT k Ž a. only and that we allow Tk Ž a. s 0 in Ž3.2., in which case UaŽ v k . s supu aX g Q a m aŽ uaX .4 ; in applications this will be equivalent to taking some small number of samples from each population to begin with. Remark 2. Ža. For all a and v k , UaŽ v k . G m aŽ uˆa ., i.e., the indices are inflations of the current estimates for the means. In addition, UaŽ v k . is increasing in k and decreasing in Tk Ž a., thus giving higher chance for the ‘‘under sampled’’ actions to be selected. In the case of a one-dimensional parameter vector, they yield the same value as those in w25, 23x. Žb. The analysis remains the same if in the definition of UaŽ v k . we replace log krj by a function of the form Žlog k q hŽlog k ..rj, where hŽ t . is any function of t with hŽ t . s oŽ t . as t ª `. Up to this equivalence, the index UaŽ v k . is uniquely defined. Žc. We note that u aŽ ua , g . and waŽ ua , z . are connected by the following duality relation. The condition u aŽ ua , g . ) z implies waŽ ua , z . F g . In addition, when for g s g 0 , the supremum in u aŽ ua , g 0 . is attained at some ua0 s ua0 Žg 0 . g Qa such that IŽ ua , ua0 . s g 0 Žas is the case, for example, when m aŽ uaX . is a linear function of uaX ., ua0 also attains the infimum in waŽ ua , z 0 . for z 0 s u aŽ ua , g 0 ., i.e., u aŽ ua , g 0 . s m aŽ ua0 . s z 0 , and waŽ ua , z 0 . s IŽ ua , ua0 . s g 0 . This type of duality is well known in finance w32, p. 113x. Žd. For z g R, let WaŽ v k , z . s waŽ uˆaT nŽ a., z .. It follows from Žc. above that ;v k , the condition UaŽ v k . ) z implies the condition WaŽ v k , z . F log krTk Ž a.. Furthermore, when the supremum in u aŽ uˆa , log krTk Ž a.. is attained at some ua0 s ua0 Ž v k . g Qa such that IŽ uˆa , ua0 . s log krTk Ž a., Ua Ž v k . s m a Ž ua0 . ,
Ž 3.3.
Wa Ž v k , m a Ž ua0 . . s I Ž uˆa , ua0 . s log krTk Ž a . s Ja uˆa , u ; m* Ž u . y m a Ž ua0 . .
ž
/
Ž 3.4.
The conditions given below are sufficient conditions for Theorem 1 Ž2.. Condition A1. ;u g Q and ;a f OŽ u . such that DQaŽ ua , 0. s B and DQaŽ ua , « . / B, ;« ) 0, the following relation holds: lim « ª 0 JaŽ ua , u ; « . s `. Condition A2. Pu aŽ5 uˆak y ua 5 ) « . s oŽ1rk ., as k ª `, ;« ) 0, and ;ua g Qa , ;a. Condition A3. Pu aŽu aŽ uˆaj, log krj . - m Ž ua . y « , for some j F k . s oŽ1rk ., as k ª `, ;« ) 0, ;ua g Qa , ;a.
128
BURNETAS AND KATEHAKIS
Remark 3. To see the significance of condition ŽA1. consider the next examples. EXAMPLE 1. Take m s 2, Q 1 s w0, 1x, Q 2 s w0, .5x, m*Ž u . s m 1Ž u 1 . s 0.5 ) m 2 Ž u 2 ., f aŽ y; ua . s uax Ž1 y ua .Ž1yx ., x s 0, 1. EXAMPLE 2. Take Q 2 s w0, 0.51x in Example 1. EXAMPLE 3. Take Q 1 s Q 2 s w0, 1x, m*Ž u . s m 1Ž u 1 . s 1 ) m 2 Ž u 2 ., in Example 1. Situations such as in Example 1 are excluded, while Examples 2 and 3 satisfy ŽA1.. Remark 4. Ža. Note that ŽA2. is a condition on the rate of convergence of uˆak to ua and it holds in the usual case that uˆak is either equal to or follows the same distribution as the mean of i.i.d. random variables Z j with finite moment generating function in a neighborhood around zero. In this case ŽA2. can be verified using large deviation arguments. This implies 1 Ž5 ˆk 5 . Ž . that Ý ny ks 1 Pu a ua y ua ) « s o log n , as n ª `. Žb. From the continuity of IŽ ua , uaX . and, hence, of JaŽ ua , u ; « . in ua , it follows that the event JaŽ uˆak , u ; « . - JaŽ ua , u ; « . y d 4 is contained in the event 5 uˆak y ua 5 ) h 4 , for some h s h Ž d . ) 0. Thus, condition ŽA2. implies Pu awJaŽ uˆak , u ; « . - JaŽ ua , u ; « . y d x s oŽ1rk ., as k ª `. The last can be written in the form below, as required for the proof of Proposition 2 Ža.: ;d ) 0 ny1
Ý ks1
Pu a Ja uˆak , u ; « - Ja Ž ua , u ; « . y d s o Ž log n .
ž
/
as n ª `.
1 w Ž ˆj . Remark 5. Condition ŽA3. can be written as Ý ny ks 1 Pu a u a ua , log krj m Ž ua . y « , for some j F k x s oŽlog n., as n ª `. It is used in this form for the proof of Proposition 2 Žb..
Let CR denote the class of policies which in every period select any action with the largest index value UaŽ v k . s u aŽ uˆa , log krTk Ž a... We can now state the following theorem. THEOREM 1. Ž1. For any u g Q, a g BŽ u ., such that K aŽ u . / 0 the following is true, ;p g C UF , lim Eup Tn Ž a . rlog n G 1rK a Ž u . . nª`
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
129
Ž2. If conditions ŽA1., ŽA2., and ŽA3. hold and p 0 g CR , then, ;u g Q and ;a f BŽ u ., lim Eup Tn Ž a . rlog n F 1rK a Ž u . ,
if a g B Ž u . ,
lim Eup Tn Ž a . rlog n s 0,
if a f B Ž u . ,
0
nª`
Ž a.
0
nª`
Žb. 0 F Rpn Ž u . s MŽ u .log n q oŽlog n., as n ª `, ;u g Q, Žc. CR ; C UM . 0
Proof. Parts Ž1. and Ž2a. are proved in Propositions Ž1. and Ž2., respectively. For part Ž2b., note first that m
Rpn Ž u . s n m* Ž u . y
Ý
Eup
as1
TnŽ a.
Ý
m
s
Ý as1
Yat
ts1 p u Tn
ž m* Ž u . y m Ž u . / E a
a
Ž a. ,
;p g C.
Using the definition of MŽ u . in subsection 2.3, ;p 0 g CR , ;u g Q, lim Rpn Ž u . rlog n F M Ž u . , 0
Ž from part Ž 2a . . ;
nª`
hence, CR : C UF . Thus, it follows from part Ž1. that lim Rpn Ž u . rlog n G M Ž u . , 0
nª`
and the proof is easy to complete, using the above observations. 0 To show the last chain we need only to divide both Rpn Ž u . and Rpn Ž u . by MŽ u . log n, when MŽ u . ) 0. Remark 6. Ža. It is instructive to compare the maximum expected finite horizon reward under complete information about u ŽEq. Ž3.5.. with the asymptotic expression for expected finite horizon reward for a UM policy p 0 , under partial information about u ŽEq. Ž3.6.., established by Theorem 1: Vn Ž u . s n m* Ž u . Vnp Ž u . s n m* Ž u . y M Ž u . log n q o Ž log n . 0
Ž 3.5. Ž as n ª ` . . Ž 3.6.
130
BURNETAS AND KATEHAKIS
Žb. The results of Theorem 1 can be expressed in terms of the rate of convergence of Vnp Ž u .rn to m*Ž u ., as follows. If p g C UC then lim nª`Vnp Ž u .rn s m*Ž u . for all u . No claim regarding the rate of convergence can be made. If p g C UF then it is also true that < Vnp Ž u .rn y m*Ž u .< s oŽ n a . for all u Žand ;a ) 0.; therefore Vnp Ž u .rn converges to m*Ž u . at least as fast as log nrn. The UM policy is such that for all u g Q with M Ž u . ) 0, the rate of convergence of Vnp Ž u . to m*Ž u . is the maximum among all policies in C UF : C UC : C, and is equal to M Ž u .log nrn. 0 Žc. For u g Q such that M Ž u . s 0, it is true that Vnp Ž u . s n m*Ž u . q 0 oŽlog n.; therefore Vnp Ž u .rn converges to m*Ž u . faster than log nrn. However, this does not necessarily represent the fastest rate of convergence. In the proof of the Proposition 1 below, we use the notation uaX [ Ž u 1 , . . . , uay1 , uaX , uaq1 , . . . , um ., ;uaX g DQaŽ ua ., and the following remark. Remark 7. Ža. The definition of DQaŽ ua . implies that if a g BŽ u . / B, then OŽ uaX . s a4 , ;uaX g DQaŽ ua ., and thus EupaX Ž n y TnŽ a.. s oŽ n a ., ;a ) 0 and ;uaX g DQaŽ ua ., ;p g C UF , the latter being a consequence of the definition of C UF . Žb. Let Zt be i.i.d. random variables such that Snrn s Ý nts1 Ztrn, converges a.s. ŽP. to a constant m , let bn be an increasing sequence of positive constants such that bn ª `, and let ? bn @ denote the integer part of bn . Then max k F ? b n @ Sk 4 rbn converges a.s. ŽP. to m and ;d ) 0,
P max S k 4 rbn ) m q d s o Ž 1 .
PROPOSITION 1.
ž
/
kF ? bn @
Ž as n ª ` . .
If p g C UF then, for any a g BŽ u . / B, lim Eup Tn Ž a . rlog n G 1rK a Ž u . .
Ž 3.7.
nª`
Proof. The proof is an adaptation of the proof of Theorem 1 in w25x for the constant K aŽ u .. Form the Markov inequality it follows that Pup Tn Ž a . rlog n G 1rK a Ž u . F Eup Tn Ž a . K a Ž u . rlog n,
ž
/
Thus, to show Ž3.7., it suffices to show that lim Pup Tn Ž a . rlog n G 1rI a Ž u . s 1
nª`
ž
/
; n ) 1.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
131
or, equivalently, lim Pup Tn Ž a . rlog n - Ž 1 y « . rK a Ž u . s 0,
nª`
ž
/
;« ) 0.
Ž 3.8.
By the definition of K aŽ u . we have, ;d ) 0, 'uaX s uaX Ž d . g DQaŽ ua . such that K aŽ u . - IŽ ua , uaX . - Ž1 q d .K aŽ u .. Fix such a d ) 0 and uaX , let I d s IŽ ua , uaX ., and define the sets Adn [ TnŽ a.rlog n - Ž1 y d .rI d 4 and C nd [ log LT Ž a. F Ž1 y dr2.log n4 , n where log L k s Ý kis1 logŽ f aŽ Yai ; ua .rf aŽ Yai ; uaX .. We will show that Pup ŽAdn . s Pup ŽAdnC nd . q Pup ŽAdnC nd . s oŽ1., as n ª `, ;d ) 0. Indeed, Pup ŽAdnC nd . F n1y d r2 PupaX ŽAdnC nd . F n1y d r2 PupaX ŽAdn . F n1y d r2 EupaX Ž n y TnŽ a..rŽ n y Ž1 y d .log nrI d . s oŽ n a .rŽ n d r2 Ž1-O Žlog n.r n.. s oŽ1., for a - dr2. The first inequality follows from the observation that on C dn l TnŽ a. s 4 k we have f aŽ Ya1; ua . ??? f aŽ yak ; ua . F n1y d r2 f aŽ Ya1; uaX . ??? f aŽ Yak ; uaX .; note also that e Ž1y d r2.log n s n1y d r2 . The third relation is the Markov inequality and the fourth is due to Remark 7Ža.. above. To see that Pup ŽAdnC nd . s oŽ1., note that Pup Ž AdnC dn . F Pup
ž max log L 4 ) Ž 1 y dr2. log n / s P ž max log L 4 rb ) I Ž 1 q dr Ž 2 Ž 1 y d . . . F P ž max log L 4 rb ) I Ž 1 q dr Ž 2 Ž 1 y d . . . , p u
ua
kF ? bn @
k
kF ? bn @
k
kF ? bn @
k
n
n
d
d
where bn [ Ž1 y d .log nrI d and the last inequality follows using an argument like that in Remark 1. Thus the result follows from Remark 7Žb., since log L krk ª I d a.s. ŽPu a .. To complete the proof of Ž3.8., it suffices to notice that the choices of d and uaX Ž d . imply Ž1 y d .rI d ) Ž1 y d .rŽŽ1 q d .K aŽ u .. ) Ž1 y « .rK aŽ u ., and Pup ŽTnŽ a.rlog n - Ž1 y « .rK aŽ u .. F Pup ŽAdn . s oŽ1., when d - «r Ž2 y « .. To facilitate the proof of the Proposition 2 below we introduce some notation and state a remark. For any e ) 0, let ny1
TnŽ1. Ž a, « . s
Ý 1 ž p k s a, Ua Ž v k . ) m* Ž u . y « / ,
ks1
132
BURNETAS AND KATEHAKIS
and 1 TnŽ2. Ž a, « . s Ý ny ks1 1 Ž p k s a, Ua Ž v k . F m a Ž ua . y « . .
Remark 8. Let Zt be any sequence sequence of constants Žor random 1 4 variables. and let t k [ Ý ky ts0 1 Z t s a . This definition of t k implies that Žpointwise if we have random variables. ny1
Ý 1 Zk s a, t k F c 4 F c q 1. ks1 1 4 Indeed, note that Ý ny ks 1 1 Z k s a, t k s i F 1, ; i s 0, . . . , ? c @ . Therefore, ny 1 ny 1 ? c@ c@ 4 Ý ks 1 1 Zk s a, t k F c s Ý ks1 Ý is0 1 Zk s a, t k s i4 s Ý?is0 Ý ny1 ks1 1 Z k s 4 a, t k s i F ? c @ q 1 F c q 1.
PROPOSITION 2. For any u g Q the following are true: Ža. Under ŽA1. and ŽA2., if p 0 g CR and a f OŽ u ., then lim lim Eup TnŽ1. Ž a, « . rlog n F 1rK a Ž u . , 0
if a g B Ž u . , Ž 3.9.
«ª0 nª`
lim lim Eup TnŽ1. Ž a, « . rlog n s 0, 0
if a f B Ž u . .
«ª0 nª`
Ž 3.10.
Žb. Under ŽA3., if p 0 g CR , then limnª` Eup TnŽ2. Ž a, « .rlog n s 0, ;a and ;« ) 0. 0 Žc. Under ŽA1., ŽA2., and ŽA3., if p 0 g CR , then limnª` Eup TnŽ a.rlog n is less than or equal to 1rK aŽ u ., if a g BŽ u . and it is equal to 0, if a f BŽ u .. 0
Proof. Ža. fix p 0 g CR , u g Q, a f OŽ u ., i.e., m* ) m aŽ ua .. Let « g Ž0, m* y m aŽ ua .., and consider two cases. Case 1. There exists « 0 ) 0 such that DQaŽ ua , « 0 . s B. For any « - « 0 and any ua g Qa it is true that m aŽ ua . F m*Ž u . y « 0 - m*Ž u . y « ; therefore, TnŽ1. Ž a, « . s 0, for all « - « 0 and Ž3.10. holds. Case 2. DQaŽ ua , « . / B, ;« ) 0. Note that JaŽ ua , u ; « . ) 0, ;« g Ž0, m* y m aŽ ua ... Let J« s JaŽ ua , u ; « . and ˆ J« s JaŽ uˆaT k Ž a., u ; « .; then, ;d ) 0, we have sample path-wise: ny1
TnŽ1. Ž a, « . F
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. . ks1 ny1
s
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. , ˆJ« G J« y d . ks1 ny1
q
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. , ˆJ« - J« y d . ks1
133
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION ny1
F
Ý 1 Ž p k0 s a, Tk Ž a. F log nr Ž J« y d . . ks1 ny1
q
Ý 1 Ž p k0 s a, ˆJ« - J« y d . ks1 ny1
F log nr Ž J« y d . q 1 q
Ý 1 Ž p k0 s a, ˆJ« - J« y d . . ks1
For the first inequality, we have used an immediate consequence of the ‘‘duality’’ relations of Remark 2Žc. with z 0 s m*Ž u . y « , which imply that the event UaŽ v k . ) m*Ž u . y « 4 is contained in the event JaŽ uˆaT k Ž a., u ; « . F log krTk Ž a.4 . For the last inequality, we have used Remark 8. Taking expectations of the first and last terms and using remark 4Žb., it 0 follows that, since d is arbitrarily small, Eup TnŽ1. Ž a, « .rlog n F 1 q 0 1 Ž 0 ˆ .x log nrJaŽ ua , u ; « . q Eup wÝ ny ks1 1 p k s a, J« - J« y d . In addition, Eup
ny1 0
Ý 1 Ž p k0 s a, ˆJ« - J« y d . ks1
s Eup
0
TnŽ a .
Ý js0
F Eup
1 Ja uˆaj , u ; « - J« y d
ž ž
/
/
ny1 0
Ý 1 ž Ja ž uˆaj , u ; « / - J« y d /
js0 ny1
F Eu a
Ý 1 ž Ja ž uˆaj , u ; « / - J« y d /
s o Ž log n . ,
js0
where the second inequality follows from Remark 1 and the last equality 0 follows from Remark 4Žb.. Therefore, Eup TnŽ1. Ž a, « .rlog n F log nr JaŽ ua , u ; « . q oŽlog n.. Thus the proof of part Ža. is complete since lim « ª 0 JaŽ ua , u ; « . s K aŽ u ., if DQaŽ ua , 0. / B, from the definition of K aŽ u ., and lim « ª 0 JaŽ ua , u ; « . s `, if DQaŽ ua , 0. s B, from ŽA1.. Žb. Note first that for p 0 g CR , the following inequality holds pointwise on V Ž n. : ny1
TnŽ2. Ž a, « . F
Ý 1 ž u a* Ž uˆa*j , log krj . F m* Ž u . y « , for some
ks1
;a* g O Ž u . , ;u g Q.
jFk ,
/
134
BURNETAS AND KATEHAKIS
1 Ž 0 Ž . Ž . . Indeed, TnŽ2. Ž a, « . s Ý ny ks1 1 p k s a, Ua v k F m* u y e , and, since p g 0 CR , the condition p k s a implies that UaŽ v k . s max a9Ua9Ž v k . G Ua* Ž v k .; thus, the event p k0 s a, UaŽ v k . F m*Ž u . y « 4 is contained in the event Ua* Ž v k . F m*Ž u . y « 4 , for any a* g O Ž u .. The latter event is contained in the event u a* Ž uˆa*j , k, j . F m*Ž u . y « , for some j F k .4 . Therefore, using also Ž2.2.,
Eup TnŽ2. Ž a, « . F 0
ny1
Ý Pu
a*
ž u Ž uˆ a*
j a* ,
k, j . F m* Ž u . y « ,
ks1
for some j F k s o Ž log n . ,
/
by Condition ŽA3.. The proof of Žc. follows from Ža. and Žb. when we let « ª 0, since Ž Tn a. F 1 q TnŽ1. Ž a, « . q TnŽ2. Ž a, « ., ; n G 1, ;« ) 0. 4. APPLICATIONS OF THEOREM 1 4.1. Discrete Distributions with Finite Support Assume that the observations Ya j from population a follow a univariate discrete distribution, i.e., f aŽ y, p a . s p a y 1 Ya s y4 , y g S a s r a1 , . . . , r ada 4 , where the unknown parameters p a y are in Qa s p a g R d a : p a y ) 0, ; y s 1, . . . , d a , Ý y p a y s 14 , and r a y are known. Here we use the notation ua s p a , u s p s Žp1 , . . . , p m . and na is the counting measure on r a1 , . . . , r ad 4 . a a Thus we can write IŽp a , q a . s Ý dys1 p a y logŽp a yrq a y ., m aŽp a . s r Xa p a s X Ý y r a y p a y , m* s m*Žp. s max a r a p a4 , DQaŽp, « . s q a : m aŽq a . ) m*Žp. y « 4 , where r Xa denotes the transpose of the vector r a . Note that computation of the constant K aŽp. as a function of p involves the minimization of p involves the minimizatin of a convex function subject to two linear constraints; hence,
½
K a Ž p . s wa p a , m* Ž p . s min I Ž p a , q a . : r Xa q a G m* Ž p . ,
ž
/
q aG0
da
Ý qay s 1 ys1
5
.
Ž 4.1. For any estimators ˆ p at of p a , the computation of the index UaŽ v k . involves the solution of the dual problem of Ž4.1. Žwith p a replaced by ˆ p at . in Ž4.1.., which, in this case, is a problem of maximization of a linear function
135
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
subject to a constraint with convex level sets and a linear constraint; hence, U a Ž v k . s u a Žˆ p at , log krt . s max
q aG0
½
r Xa q a :
da
p at , q a
Iˆ
ž
Ýq / F log krt, ys1
ay
5
s1 .
Ž 4.2.
In Proposition 3 below, it is shown that Conditions ŽA1., ŽA2., and ŽA3. are satisfied for estimators defined from the observations from population a. Given vk with Tk Ž a. s t define: Ž1. For t G 1, n t Ž y; a. s Ýtjs1 1Ž Ya j s r a, y ., f t Ž y; a. s n t Ž y; a.rt and f t Ž a. s wf t Ž y; a.x y g S a. t Ž2. For t G 0 let ˆ Ž . Ž . p a, y s 1 y wt rd a q wt f t y; a , where wt s t t x Ž . w tr d a q t , and let ˆ pn s ˆ p a, y y g S a. In the proof of Proposition 3 we make use of the following quantities and properties: Ž1. For a s 1, . . . , m and p a , q 1 , q 2 g Qa let lŽp a ; q 1 , q 2 . s IŽp a , q 2 . Ž y I p a , q 1 . [ Ý y g S a p a y logwq 1 yrq 2 y x. Ž2. Let L t Žq 1 , q 2 . s Ł tjs1 q 1, Y rq 2, Y . aj aj Ž3. For p a g Qa , let Fk t Žp a . s q g Qa : IŽp a , q. F log krt 4 . Note that UaŽ v k . s sup r Xa q: q g Fk, T Ž a.Žˆ pTa k Ž a. .4 . k
Remark 9. Ža. log L t Žq 1 , q 2 . s lŽf t Ž a.; q 1 , q 2 .. Žb. sup q lŽp a ; q 1 , q 2 . s lŽp a ; p a , q 2 . s IŽp a , q 2 .. 1 Žc. e t IŽf t Ž a., q 2 . s e t lŽf t Ž a.; f t Ž a., q 2 . s sup q L t Žq 1 , q 2 .. 1 Žd. t ?lŽˆ p at ; q 1 , q 2 . s wt w bŽq 1 , q 2 . q lŽf t Ž a.; q 1 , q 2 .x, where bŽq 1 , q 2 . s Ý y logwwq 1 yrq 2 y x. Ž e . t I Žˆ p at , q 2 . F w t w b 0 Žq 2 . s t I Žf t Ž a ., q 2 .x, where b 0 Žq 2 0 s yÝ y log q 2 y . Indeed, Ža. follows from the observation that t
log L t Ž q 1 , q 2 . s
Ý log
q 1, Ya jrq 2, Ya j
js1 t
s
da
Ý Ý 1 Ž Ya j s y . log js1 ys1
s t l Ž f t Ž a. ; q 1 , q 2 . .
q 1 yrq 2 y
136
BURNETAS AND KATEHAKIS
Žb. is a restatement of the information inequality yIŽp a , q 1 . F 0. Žc. follows from Žb.. To see Žd. and Že., recall that ˆ p at s Ž1 y wt .rd a q wt ? f t Ž a., where wt s trŽ t q d a .; note that t Ž1 y wt .rd a s wt , and use Ža. Žfor Žd.. and Žb. Žfor Že... PROPOSITION 3. The discrete distribution model satisfies Conditions ŽA1., ŽA2., and ŽA3. of Theorem 1. Proof. Ž1. It is easy to see that Condition ŽA1. holds. Indeed, note that that ;« G 0, DQaŽp; « . / B if and only if max y r a y ) m*Žp. y « . Thus if DQaŽp, « . / B, ;« ) 0 and DQaŽp. s B, then max y r a y s m*Žp. and lim « ª 0 JŽp a , p; « . s IŽp a , q e . s `, where q e is the unit vector of R d a , with nonzero component corresponding to max y r a y . Ž2. We next show that Condition ŽA2. holds. Since 1 y wt ª 0, it t Ž . follows from the definition of ˆ p a, y that for any « ) 0 there exists t 0 « G 1 such that « t < < < Pp a