Value Iteration in a Class of Average Controlled ... - Semantic Scholar

Comment

Report 1 Downloads 30 Views

Value Iteration in a Class of Average Controlled Markov Chains with Unbounded Costs: Necessary and Sufficient Conditions for Pointwise Convergence Author(s): Rolando Cavazos-Cadena and Emmanuel Fernández-Gaucherand Source: Journal of Applied Probability, Vol. 33, No. 4 (Dec., 1996), pp. 986-1002 Published by: Applied Probability Trust Stable URL: http://www.jstor.org/stable/3214980 . Accessed: 09/04/2011 14:48 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=apt. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Applied Probability Trust is collaborating with JSTOR to digitize, preserve and extend access to Journal of Applied Probability.

http://www.jstor.org

J. Appl. Prob. 33, 986-1002 (1996) Printed in Israel ? Applied Probability Trust 1996

VALUEITERATIONIN A CLASS OF AVERAGECONTROLLED MARKOVCHAINS WITH UNBOUNDEDCOSTS: NECESSARYAND SUFFICIENTCONDITIONSFOR POINTWISECONVERGENCE ROLANDO CAVAZOS-CADENA,* UniversidadAut6noma Agraria Antonio Narro EMMANUEL FERNANDEZ-GAUCHERAND,** The Universityof Arizona

Abstract

ThisworkconcernscontrolledMarkovchainswithdenumerable statespace,(possibly) unbounded cost function,and an expectedaveragecost criterion.Under a Lyapunov function condition, together with mild continuity-compactness assumptions, a simple necessary and sufficientcriterion is given so that the relativevalue functions and differential

costs producedby the valueiterationschemeconvergepointwiseto the solutionof the optimalityequation;thiscriterionis appliedto obtainconvergenceresultswhenthe cost functionis boundedbelowor boundedabove. CONTROLLED MARKOVCHAINS;AVERAGE COSTCRITERION; LYAPUNOVFUNCTIONCONDITION; VALUE ITERATIONSCHEME;POINTWISECONVERGENCE; NECESSARYAND SUFFICIENTCONDITIONS AMS 1991SUBJECTCLASSIFICATION: PRIMARY93E20;90C40

1. Introduction This workconsiderscontrolledMarkovchains(CMC)with denumerablestate space and an averagecost criterion.The cost function is (possibly)unboundedand, besides standardcontinuity-compactness conditions,the main assumptionon the structureof the model is that the Lyapunovfunction condition (LFC) -

introduced by Hordijk in

[14]- holds true.The LFC impliesthe existenceof a (generallyunbounded)solution of the averagecost optimalityequation(ACOE),yieldingoptimalstationarypolicies.In this context, the main contributionof this paper is to formulatea simplecriterionso Received23 January1995;revisionreceived11 July 1995. * Postaladdress:Departamentode Estadisticay Cailculo,UniversidadAut6nomaAgrariaAntonioNarro, Buenavista,SaltilloCOAH25315, Mexico. ** Postaladdress:Systemsand IndustrialEngineeringDepartment,The Universityof Arizona,TucsonAZ 85721, USA. * Research supported by a U.S.-M6xicoCollaborativeResearchProgramfunded by the National Science Foundation under Grant NSF-INT 9201430, and by the Consejo Nacional de Ciencia y Tecnologia (CONACyT), M6xico.

** The workof Cavazos-Cadena was partiallysupportedby the PSF OrganizationunderGrantNo. 200150/94-01,andby the MAXTORFoundationforAppliedProbabilityandStatistics(MAXFAPS)underGrant

No. 01-01-56/01-94.

986

987

The VI scheme: necessary and sufficient conditions

that the differentialcosts and relativevalue functionsproducedby the valueiteration (VI) schemeconvergepointwiseto the solution of the ACOE;see Theorem3.1, which establishesa necessaryand sufficientconditionfor convergence.Next, this criterionis used to obtain convergentapproximationsof the solution of the ACOE;see Theorem 3.2. The relationof these theoremswith other resultsalreadyavailablein the literature is brieflydiscussedin Section6. The remainderof the paperis organizedas follows.In Section2 the decisionmodel and all the structuralassumptionsare introduced,togetherwith some of their basic consequences.Next, in Section3 the VI procedureis describedand the main resultsof this paperare statedin Theorems3.1 and 3.2. The proof of Theorem3.1 is presented in Section5 afterthe necessarytechnicalpreliminariesgivenin Section4, and the paper concludeswith some briefcommentsin Section6. Notation. As usual R and N stand for the sets of real numbersand non-negative integers,respectively;if a, b E , set a Ab := min{a, b} and a Vb := max{a, b}. On the otherhand,givenan event W,the correspondingindicatorfunctionis denotedby I[W]. 2. Decisionmodel Followingstandardnotation, our model of a CMC is describedby the four-tuple (S, A, C, P), wherethe state spaceS is a denumerableset, the compactmetricspaceA is the actionset, C is the cost functionand P(-)= [p,,(-)] is the (controlled)transition law. The interpretationof this model is as follows:at each time t eN the state of a dynamicalsystemis observed,say X,=x E S, and an action A,= a e A is chosen. As a consequence,a cost C(x, a) is incurredand, regardlessof the previousstatesand actions, the state of the systemat time t + 1 will be X,+ = y e S with probabilitypx,(a); see [1], [11] or [17] for details.Notice that it is assumedthat all actions in A are availableat everystateand decisiontime;as shownin [2]this does not implyany loss of generality. The followingis a ratherstandardassumption. Assumption2.1. For each x, y e S, the mappingsa continuousin a E A.

-

px,(a) and a

-

C(x, a) are

Controlpolicies. A policy(or controlstrategy)is a (measurable, possiblyrandomized) rule for choosing actions;at each time t e N the selectedcontrol may depend on the currentstate as well as on the recordof previousstates and actions.Given the initial state X0= x and the policy nrbeing used the distributionPx"of the state-actionprocess {(X,, A,)} is uniquelydetermined,and Ex stands for the correspondingexpectation operator;see, for instance[1], [11], [13]and [17] for details.Now set IF:=HIxesA and notice that F consistsof all functionsf: S -+ A, and that it is a compactmetricspace in the producttopology [8].A policy nris stationaryif thereexistsfE IFsuch that when the systemis in progessunderir, actionf(x) is appliedwheneverthe observedstate is X,=x regardlessof t e N; in this case ir and f are naturallyidentified.A Markovian policy is a sequencer := { Jf E F, tE N}; underthis policyactionA, =f(x) is applied at time t if X,= x is the observed state.

988

AND EMMANUEL FERNANDEZ-GAUCHERAND ROLANDO CAVAZOS-CADENA

Performanceindex. The (lim-infexpected)averagecost at state x E S underpolicy n is defined by 1

(2.1)

J(x, ):= lim inf

k

E1

C(X,,A,),

and (2.2)

J*(x) : = inf J(x, n)

is the optimalaveragecost at state x. A policy n* is average optimal (AO) if J(x, n*)= J*(x) for all x E S. The use of the limit inferiorin (2.1) implicitlyassumesan optimisticviewpointfrom the decision-maker,since what is being optimizedis the bestperformanceattainedby a policy.On the otherhand, an oppositepessimisticviewpointcould be adopted,by using the limit superiorinsteadin (2.1). However,underAssumption2.2 below,both criteria are equivalent;see [3].

Optimalityequation. To establishthe existenceof optimal stationarypolicies it is necessaryto complementAssumption2.1 with additionalconditions[1], [19],[17].One such conditionis introducedin Assumption2.2 below.First, let z E S be an arbitrarily selectedstate,fixed throughoutthe remainderof the paper,and definethe firstpassage time T as follows: (2.3)

T:= min{n > 0 IX,=z},

where,by the usualconvention,the minimumof the emptyset is Xo.The next condition was introducedby Hordijkin [14];see also [1], [4], [5], [6], [7], [9], [10]. Assumption2.2. Lyapunovfunctioncondition.Thereexists a (Lyapunov)function 1: S -* [0, Co) satisfying the following conditions (i)-(iii): (i) 1+ IC(x, a)I+ y p,,y(a)(y) 1(x) for all x ES and aE A. (ii) For each x e S, the mapping f Ef{[l(X,)] is continuousin ",pxy(f(x))1(y)=

(iii) For each fE F and x E S, Ef [1(X,)I[T > n]] -- 0 as n -- co.

Notice that the above condition imposes a growthrestrictionon the one-stagecost function.Moreover,undercontinuityand boundednessconditionsfor the one-stagecost functionand a communicatingassumption(undereverystationarypolicy), it has been shown in [7] that the Lyapunovfunction condition is equivalentto severalstability/ ergodicityconditionson the controlledtransitionlawof the system.UnderAssumptions 2.1 and 2.2 the average cost optimality equation (ACOE) given by (2.4) below has a

solution yieldingan optimalstationarypolicy. Lemma 2.1. Suppose that Assumptions 2.1 and 2.2 hold true. Then there exist h : S --l [ and g E [l such that (i)-(iv) below hold true. (i) g=J*(x) for each xE S. (ii) h(z)=O0and for some constant c > 0, Ih(x)t? c 1(x) for all x E S.

(iii) The ACOE is satisfiedby g and

h(.),

i.e.

The VI scheme: necessary and sufficient conditions

g + h(x) = min C(x, a)+

(2.4)

989

x E S. EPxY(a)h(y)j,

(iv) An optimal stationary policy exists: for each x E S the right-hand side of (2.4) considered as a function of a E A - has a minimizer f*(x), and the corresponding policy f* • F is optimal.

-

A proof of this result can be found in [14, ch. 5]; see also [6] for a proof of (ii). In addition to this lemma other (somewhat technical) consequences of Assumptions 2.1 and 2.2 will be used; they are stated in Lemmas A.1-A.3 in the appendix. Notice that g in Lemma 2.1 is uniquelydetermined,since it is the optimal average cost at every state. The function h is also unique, as established in Lemma A.2(iv). As already mentioned, the main objective of this paper is to study the VI procedure to obtain approximations to the solution (g, h(-)) of the ACOE. Theorems 3.1 and 3.2 in the next section point in this direction and require the following additional condition. > 0. p,.(a) Remark 2.1. Assumption 2.3 is readily verifiable. Furthermore, it does not imply any loss of generality, since it can be obtained by making an appropriate transformation on the transition law. In fact, suppose that M= (S, A, C, P) satisfies Assumptions 2.1 and 2.2 and define the transformed transition law P* = [p*,()] as follows: Assumption 2.3.

For each a E A,

p.,(a) := (1 - a)bxy + a pxy(a),

(x, a) ESx A,

S,

y

where a C [0, 1) is a given number and 6xy := 1 (resp. 0) if x =y (resp. x y). Now set M*:= (S, A, C, P*), which clearly satisfies Assumptions 2.1 and 2.3. Moreover, it is not difficult to see that 1*(.)= l(. )/a is a Lyapunov function for M*, so that Assumption 2.2 is also satisfied by M*. On the other hand, M and M* are equivalentCMC in the following sense. Let the pair (g, h) be the solution to the ACOE for model M and let (g*, h*) be the corresponding pair for model M*. Then (a) g*=g, (b) h*= h1a, and (c) a policy fE IFis optimal for model M if and only if f is optimal for M*. Furthermore, (d) a policy fE IFis such that, for all states x, f(x) minimizes the mapping a H- C(x, a) + , px,(a)h(y), if and only if it minimizes a H- C(x, a) + •, p* (a)h*(y). The transformation p -*p* was introduced by Schweitzer in [20]; see also [17, pp. 371-373]. 3. Main results This section contains the main results in the paper. To begin with, the necessary notions are introduced. Definition 3.1. The VI scheme. (i) The sequence { V,: S -+ n = - 1, 0, 1,..-} of value iterationfunctions is recursively defined as follows: VI- 0 and, for n > 0,

(3.1)

V,(x) = min C(x, a) + pxy(a)

-l()

x

E

S.

990

ROLANDO CAVAZOS-CADENA AND EMMANUEL FERNANDEZ-GAUCHERAND

(ii) The relative valuefunctions R, : S - +R are defined by R,(x) := V,(x) - V,(z), for

xES, n=-1 , , 1, 2,.... (iii) For each x ES and n E N define the nth differentialcost at x by g,(x) :=

V.(x)-V._(X).

Remark3.1. FromAssumption2.1, a standardinductionargumentyields that for each x E S and n E N ([1], [11]),the mapping a

C(x, a) +Z px,(a)C(x, a),

a EA

is continuous.Therefore,the minimumin (3.1) is indeedattained. The followingare well known results;see [1], [11],[17]. Lemma 3.1. (i) The value iterationfunctionssatisfy

(3.2)

V,(x)= infEx L

C(X,,A,) ,

xE S.

(ii) Furthermore,there exists a Markovianpolicy 7r"which attains the infimumin (3.2). Thedifferentialcostsandrelativevaluefunctionsarenaturalcandidatesto approximate the solution (g, h(-)) of the ACOE. Considerthe followingconditions. Cl. For all x E S, g,(x) -- g and R,(x) -- h(x) as n -+

00.

C2. The sequence{g,(z)} is bounded,i.e. thereexists b e [0,9) such that Ig,(z) ? b for all nE N. It is clear that Cl impliesC2. The firstmain resultof the paperis as follows. Theorem3.1. Supposethat Assumptions2.1-2.3 hold true.Then (i) and (ii) below hold true. (i) ConditionsCl and C2 are equivalent. (ii) Suppose that C2 holds, and that for each nE N, policy f, E F is such that, for each x E S, f,(x) is a minimizerof the mapping (3.3)

a " C(x, a) + E y pxy(a)R,(y),

a E A.

Then (3.4)

everylimitpoint of {f,} C F is AO.

A proof of this result is contained in Section 5. By part (i), establishing convergence in C1 is equivalent to verifying the - apparently weaker - condition of boundedness of the sequence {g,(z)}. This criterion is now used to obtain the following. Theorem 3.2. Suppose that Assumptions 2.1-2.3 hold true and that for some conB. Then, Cl holds. -B or stant BE [0, oo), either C(., .)? C(., .)

991

The VIscheme:necessaryandsufficientconditions

Proof. By Theorem3.1, it is sufficientto verifyC2. Withthis in mind, supposefirst ) - B and observe that (see (3.2))

that

C(.,

Vn(z)= min E2 [ n

tL=0

min E" -BB+ EC(X,, A,) x

C(X, At)

tL=0

-B+ min E so that V,(z)- V I(z) ? -B, n E N. Since Vn(z)>

A,) = -B+

EC(X,, Vn-(z)

k- Vn-r(Z)

V,(z) Vn-r-i(Z)

for

k k.

In particular,settingk =n + 1 and recallingthat V_, 0, - (n + 1)B.

V,(z)

(3.6)

On the other hand,

= Er EC(X,,At) V.(z)

[ C(X,,A,)I[T >n = En" YC(X,,A,)I[T < n]l+Er t =0 t=0 = E" t=0

C(X,,A,)+Vn_r(z)I[T n] -V,(z)P"n[T> n].

+ E"

Since T> 1, (3.5) yields that -(V,_l(z)? nB ? - V ? and from TBI[T

n],

-B

(3.6),

_l(z)

V,,(z)- V,_,l(z) E"

? ? n] ? VT_r(z))I[T n] (T-1)BI[T_ (n + 1)B. Therefore,

C(X,, A,)+ T. B I[T

+E

n]

C(X,, A,)I[T> n] +(n + 1)BP,"[T> n] ,=

992

ROLANDO CAVAZOS-CADENA AND EMMANUEL FERNANDEZ-GAUCHERAND

SE"" 1t=0 (C(X,, A,)+ B)I[T + n]

t=0

-Epx,y(a(x))L(y)

where Lemma4.1(ii) and LemmaA.3(ii) were used to obtain the second inequality, whereasthe third one follows from the definitionof L(y) as the limit inferiorof the whole sequence {g,(y)}. Setting f(x) := a(x), x E S, it follows that L(x) ? E![L(X,)] for

all x E S, and then LemmaA.2(iii) yields that L(x) ? L(z) for all x E S. Similarly,it is clear;see ? ? U(z), whereasthe inequality can be establishedthat U(') L(.) U(') Definition4.1. (ii) By induction,we need only considerthe s= 1 case, which we do next. Suppose that limk gn(k)(Z)=L(z) and let L'(z) be an arbitrary limit point of {gn(,,k)-l(z)}. It is clear

that the desiredconclusionwill be reachedif it is provedthat L'(z)= L(z). To verifythis equalityobservethat without loss of generalityit can be assumedthat as k --+ oo. (4.6) gn(k)-1(Z)--+ L'(z) In the argumentsused to establishpart (i) set x = z and take an additionalsubsequence (if necessary)such that (4.4) holds with x= z. In this case (4.5) becomes L(z) (4.7)

_ ?

lim inf C p,,(a,(z))g,(k)-, (Y) pzy(a(z)) limkinf gn(k)-1(Y) :by Lemmas 4.1(ii) and A.3(ii)

7

?

Z y

pey(a(z))L(y)

? L(z)

995

The VIscheme:necessaryandsufficientconditions

wherethe last inequalityfollowsfrompart(i). Sincelim infk L(-), (4.7) yields gn(k)-1(') > that lim inf gn(k)-I(y)= L(y)

(4.8)

k

if pz,(a(z)) > 0.

Then Assumption2.3, (4.6) and (4.8) togetheryield that L'(z)= L(z) and, as already mentioned,this completesthe proof of part (ii), whereas(iii) can be establishedin a similarway. Remark4.1. The key point in parts (ii) and (iii) in Lemma4.2 is that {n(k)} and {n(k)-s}, s a positiveinteger,are totallydifferentsequences,in general.The latteris not a delayedor shiftedversionof the former. Lemma 4.3.

(i) There exist functions R,: S --+ R, s E N, such that

(bV 1)1(-), and + _R,(-)IR,(x) = mina[C(x, a) + l, p,,(a), + l(y)], for all x E S and s E N. (b) U(z) Similarly, (ii) There exist functions Ra: S --+ If, s E N, such that (a) JRA,(-)< (bV 1)1(-),and

(a)

(b) L(z)+ R,(x)= min,[C(x, a)+ l, p,,(a)R, +1(y)] for all x E S and s E N.

Proof. (i) Pick an increasing sequence {n(k)} C N such that gn(k)(Z) -+ U(z) as k

and notice that Lemma4.2(ii) yields that for all s E N

--

00,

lim gn(k)-s(Z)= U(z).

(4.9)

k -*oo

For k large enough, n(k) -s ? 0, and the recursive relation (3.1) with n(k) - s instead of n can be written(see Definition3.1) as

(4.10)

gn(k)-s(z)+

Rn(k)-s(x)= min C(x,

a) + pxy(a)Rn(k)s()]

.

Now set R,(-) = 0 for t 0 there exists an integer n(e) > 0 such that

3

(A.2) y

Then, for all k _

~ Sk pxy(a)l(y)

n(e) and a E A,

?

e

for k ? n(e), aE A.

1000

ROLANDOCAVAZOS-CADENA AND EMMANUEL FERNANDEZ-GAUCHERAND

Z pxy(a) W(y)yESk

pxy(a)W(y) ? yES

y

Sk

pxy(a)|W(y)I

SB I Skpxy(a)l(y) < Be. y

Thus,as a uniformlimitof continuousfunctions,a ",,y-s pxy(a) W(y) is itselfa continuous mapping. (ii) Notice that W(x) ?E!

[ W(X,)] = W(z)Pf [T= 1]+ E! [W(X,)I[T > 1]]. Using this,

a simpleinductionargumentyields that, for all x E S and n E N, W(x) ?

W(z)Pf[T

< W(z)Pf[T< n]+BEf[1(X,+l)I[T>

n+ 1]].

Observenow that Pf [T< c]= 1 (a consequenceof LemmaA.l(i)), so that taking the limit as n -+ oo in the above inequalityand using Assumption2.2(iii), it follows that W(-) ? W(z). The proof of part (iii) follows along the same lines. (iv) Lemma3.3 in [13] and (A.1) togetherimplythat, for all x E S,

IW,(x)-

W2(x)I? sup pxy(a)I W2(y)I. a y W,(y)-Observingthat IW,(-)- W2(-)|I 2B1(-), the compactnessof A and part (i) together imply that thereexists a policyfE F such that IW (z)- W2(x)I? Z Pxy(f(x))IW,(y)- W2(y)I= Ef[I|W,(X,)- W2(X,)I], x E S, and part (ii) yields that IW1()since W,(z)= W2(z)= 0. Lemma A.3.

?1 W1(z)- W2(z)I.Then

W2(')I

W,(')=

W2(') follows

Let the functions W: S -- R, and W, : S --~ R, n E N be such that for

some constantB > 0,

IW(-)I ?

Recommend Documents

A New Value Iteration Method for the Average Cost Dynamic ...

A Mixed Value and Policy Iteration Method for ... - Semantic Scholar

Iteration - Semantic Scholar

Iteration and - Semantic Scholar