Structured solutions for stochastic control problems - Semantic Scholar

Report 1 Downloads 25 Views
STRUCTURED STOCHASTIC

Emmanuel

SOLUTIONS CONTROL

FOR PROBLEMS

*

: F e r n e l n d e z - G a u c h e r a n d t, S t e v e n I. M a r c u s and Aristotle Arapostathls

§,

:~.

K e y Words: Stochastic Control, Controlled Markov Processes, Structured Solutions.

ABSTRACT

We consider the discrete-time stochastic control problem for controlled Markov processes (CMP), with an average cost criterion. We show how structural properties in the model can be used to obtain a functional characterization of optimal values and policies, in the form of an average cost optimality equality (ACOE). In particular, Convex CMP are defined as those models for which the (discounted) value functions are convex. This convexity is used to obtain the ACOE as a limit of the corrcsponding discountcd optimality equations, as the discounting vanishes, i.e. as the discount factor tends to one. We further comment on the potential algorithmic impact of this and other structured solutions.

* This work was supported in part by the Texas Advanced Technology Program under Grant No. 003658-093, in part by the Air Force Office of Scientific Research under Grant AFOSR-91-0033, and in part by the National Science Foundation under Grant CDR-8803012. Systems and Industrial Engineering Department, The University of Arizona, Tucson, Arizona 85721. EmaiI: emmanueI~sie.arizona.edu. § Department of Electrical Engineering and Systems Research Center, The University of Maryland, College Park, Maryland 20742. Email: [email protected]. :~ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712-1084. Email: [email protected].

173 I. I n t r o d u c t i o n

A Controlled Markov Process, is a discrete time stochastic dynarnica/system specified by the five-tuple <X, U,Ll, P, c) where X is the ~tate space; U is the action, or control space; Lt(x) C__U is the set of feasible actions (or control inputs) when the system is in state x 6 X; each pair (z,u) in X X U determines a transition law P(. [ x, u); and c : X x U --~ ~. is the one-stage cost function. See [ABFGM], [BS], [HLM1] for more details. A control strategy, or policy, is a rule r for making decisions, based on the available information. At a given time t, the available information is the set ht of observed states and actions taken up to that time, i.e. ht = (X0, U0, X I , . . . , Ue-l,Xt). Each policy r incurs a stream of costs {c(X0, [To), c(Zx, U1),...}. Depending upon the problem requirements, different criteria can be used to evaluate the performance of the system, under the policy used. The following criteria are frequently used in many diverse application areas. In the equations below, E~ denotes the expectation operator under the policy v, when X0 = z.

Discounted Cost (DO): For 0 < # < 1, the discount factor, and a policy 7r, the total discounted cost incurred by ~r over the infinite p l a n i n g horizon is given by

:=

z:

;

the optimal value function, i.e. the minimum J#(x, ~r) over all It, is denoted by

Average Oost (AC): The expected long-run average cost incurred by the policy lr is given by

d(z, ~) := lira sup E~ N-~oo

the optima/average cost is denote by

~

c(X,, U,)l ;

t=O

J*(x).

The (DC) and (AC) can be seen as two opposite extremes in the type of criteria that can be considered for imClnite horizon problems, in the sense that the first captures primarily the performance of the system at the present and near future, due to the discounting, and the second only captures the performance at the distant future; see the comprehensive survey of this problem in [ABFGM]. To obtain a

174

reasonable compromise, one can combine these criteria in a weighted sum. This approach was suggested by Feinberg [FEI], and recently it has been extensively studied by Krass et al. [KFS] for the case of finite state and control sets, and by the authors [FGM] for the situation of general (Borel) spaces. II. T h e Stochastic Control Problem

As it is well known [ABFGM], [BS], [HLM1], the solution of the infinite horizon stochastic control problem under the above criteria, i.e. the functional characterization and computation of optimal values and policies, is related to the following dynamic programming-like functional equations.

The Discounted Cost Optimality Equation (DCOE): J~(x) = inf ~c(x,u) + fl f x J~(y)P(dy [ z,u)} = T#(J~)(z), ,,eu(~) l

xEX,

where the operator T#(.) is defined in the obvious way.

The Average Cost Optimality Equation (A COE): p(x) + h(x) = ~,eu(=)t.inf ~ c ( x , u ) + Jx [ h(y)P(dy[z,u)} = T(h)(x),

xEX,

where the operator T(-) is defined in the obvious way. Among the most important consequences deriving from these optimality equations are that (under some conditions) they provide an optimality criterion for actions: minimizing actions are optimal; they allow iterative schemes to compute optimal values (value iteration); and they allow algorithmic methods to compute and improve decision rules (policy iteration). When a (DC) criterion is used, a rather complete theory is available [BE], [BS], [DY], [HLM1], [KV]. For the average cost, one looks for conditions under which appropriate solutions to the ACOE exist. A solution is a pair (p(.), h(.)) of real-vaiued functions on X. There is a vast literature concerning the problem of existence and functional characterization of average cost optimal policies, when the state space X is countable, and/or the one stage cost function c(., .) is bounded [ABFGM],[BE], [HLM1]. However, this is not the case for the situation when the state space is a gener=l (Borel) space, e.g. X = R.", and the one-stage cost function is unbounded. Necessary and sufficient conditions for a bounded solution to the ACOE have been

175

recently given by the authors [FAM1]. However this type of solutions are not natural for problems involving an infinite number of states and an unbounded cost function. Recently, much research activity has been devoted to finding conditions for the functional characterization and existence results for average optimal values and policies, for the case of unbounded cost functions. Sexmott [SEN1]-[SEN3] treated the situation of countable state space and finite action set, and Hern~dez-Lerma and Lasserre [HLL], [HLM2] extended these results to a general space setting. However, in these references the authors only show existence of solutions to an average cost optimality inequality (ACOI):

p(x) + h(z) > inf ~c(x,u)+ fxh(Y)P(dylz,u)). -

=eu(=)l

Actually, it has been recently shown in [CC] that str/ct inequality is possible under the conditions in [HLL] and [SEN2]. The fact that equality is not shown prevents one from, e.g. quantifying the deviation from optimality for a policy rr, via Mandl's discrepancy function [ABFGM, Theorem 6.3], [SM]; this discrepancy function is very important in order to, e.g. analyze the performance of adaptive schemes [FAM2], [HLM1], [SM]. Also, policy improvement in a policy iteration algorithm [TIJMS, Sect. 3.2] cannot be implemented if only an inequality result is available. In essence, the problem derives from the fact that the left-hand-side of the ACOI is not equal to the term on the right for optimal (minimizing) actions, and strictly smaller than it otherwise. Thus, although the ACOI gives a criterion for the existence of stationary average cost optimal policies, it is not useful/tom an algorithmic atandpoint. Hence, the following question is very relevant: W h a t useful p r o p e r t i e s , shared by large and i m p o r t a n t p r o b l e m classes, can b e used to f u r t h e r show that an A C O E holds, and how can these p r o p e r t i e s be exploited to aid in the d e v e l o p m e n t of t r a c t a b l e algorithmic solutions? We address the above question, by concentrating on structured solutions to stochastic control models. By an structured solution we mean a model for which value functions and/or optimal policies have some special dependence on the (initiM) state. For linear stochastic control problems, structural results have played a very important role, both theoretically and algorithmicaUy, e.g. the celebrated LQG case. For more general (nonlinear) models like the ones we propose to study, structured solutions have been established and exploited only for specific models, but not for broad models classes that may share an unifying structure [HIN1]-[HIN2], [TIJMS]. Some useful structural properties for value functions and policies are: (i)

176

monotonicity of the value functions (with respect to an appropriate order relation) on the state, this can be used to compute numerical approximations that m o n ~ o n ically interpolate the functions using a finite grid, and monotonicity with respect to the actions can lead to the search of optimal actions among the largest elements of /~(z) only; (b) convexity of value functions on the state and actions leads likewise to convex interpolations, and also to search for optimal actions among the extreme points of U(z) (assumed to be a convex set). Although these ideas have been the object of some research [HIN1]-[HIN2], there is nevertheless still much to be done in the development and actual implementation of analytical and algorithmic solutions that can accommodate large problems classes, as well as the formulation of verifiable and unifying conditions that induce the required convex or monotone properties. III. Convex Controlled Markov Processes The use of, e.g. convexity of value functions in novel ways can further help in the analytical study of the infinite horizon stochastic control problem, under an average cost criterion. In [FG], it was shown how convexity properties of the discounted value function can be used to give a partial answer to the question previously posed. The main idea is to be able to extract an appropriately convergent subsequence, as fl ~"1, of the differential discounted value functions : = JZ(

)-

W e X,

where'~ E X (the reference state) is kept fixed, so that the ACOE can be obtained by taking limits in the DCOE (see [ABFGM, Sect.6]). If {h~(.)} is bounded, uniformly in fl, and cquicon~inuou~, then the Arzela-Ascoli Theorem can be used to properly take the required limits. This idea was studied by Ross [RO]; however the required uniform boundedness and equicontinuity conditions are very difficult to show, in general, and most likely do not hold for problems with an uncountable state space and unbounded costs. It has been shown in [FG] that, under a convexity condition of the discounted value functions~ local uniform boundedness and local equicontinuity properties can be shown for {h~(.)}, and thus the Arzela-Ascoli theorem can be successfully used to obtain the ACOE by taking limits in the DCOE. The framework is the following. Let X be an open convex subset of a separable Banach space, e.g. R~.n. In particular, X is a Borel space.

Definition: If the value functions J~(.) are convex functions, then we say that (X,U,/4,P,c) is a c o ~ v e z CMP.

177

The next assumption is quite standard, c.f. [HLL], [SEN2]. A s s u m p t i o n A: For ear.h x e X, P(B [ x,u) is continuous in u e U(x), for each (Borel) sets B C X; there exists a non-negative, upper semicontinuous function b : X ~ R, a constant M >_ 0, and a sequence {fin} C (0,1) with/3n T 1, such that for all x E X

(i)

< oo;

(ii) - M < h~,.(x) < b(x); (iii) f x b(y)P(dy [ z, u) < oo;

Vu e U(x).

The next result is a simple consequence of Lemma 4.3 in [HLL]; see also [FG, Lemma

4.11. L e m m a : There exists a constant p* and a subsequence fin, T 1 of {/3,), such that

lim(1-~n,)J;,,(x)=p*,

VxeX.

n -"#OO

Finally, we assume the required convexity properties. A s s u m p t i o n B: J~,., (.) is a convex function. Then, the results in [HLL], [HLM2], [SEN2] can be strengthen as follows. T h e o r e m : Under Assumptions A-B, we have that (i) there is a constant p* and a convex function h : X ---* 1~. with

- M < h(x) < b(x),

Vx E X,

such that the pair (p*, h) is a solution to the ACOE, i.e.,

Jx

J

(i)

(ii) there exists a stationary deterministic policy ~r* which is average optimal, and every stationary deterministic policy 7r attaining the inflmum in (1) is average optimal; (iii) J'(z) = p*, for all x e X.

178

P r o o f i We make the usual semicontirtuity assumptions on the transition kernel, and normegativity of the cost function [ABFGM],[BS], [DY]. In addition, we assume that the cost function has the compact level sets (CLS) property, i.e. for each ,k E 1~., the set {(x, u)lc(x, u) < ~)} is compact. Then, under this conditions it can be shown that: (a) the DCOE holds, (b) a deterministic stationary policy is discount optimal if and only if it attains the infimum in the DCOE, and (c) one such policy exists [ABFGM], [BS], [FG], [HLM1]. (i) By Assumption B, h~.,(.) is convex, and thus it exhibits a local Lipschitz property, as shown below. Let x0 E X be given, and let e > 0 be small enough so that

~(xo,2~) := (y' e X I Ilxo - Y'll ~< 2~ t , is contained in X. Let x,y E B(xo,e) := {y' e X

I I1~o -

y'll
_ N(e, x) /.

(1 - Z,,,)JL, (~) + h~., (x) = cCx, ~-:,,) + Z,,, Jx h~, (y)P(dy Ix, ~'R,)

(4)

< p* +h(x)+e, where, e.g., c(x,lr*,):= c(x,Tr~*,(x)). Recall that X is an open set, and that h#~,(.) is convex and bounded below; hence h#,~, E/:(X), where/:(X) denotes the set of lower semicontinuous and bounded below functions. Let U,,,(x,e) :=

{u E U(x) l c(x,u) + #,,, fxh/J.,(y)P(dY l z, u) < p" + h(z) + e}.

Then, by sernicontinuity and condition (CLS), U,~,(x,e) is closed, and ~r~,(x) E U.,(x,¢), for ml n' >_ N(¢, x). Furthermore, we have that Un,(x,e) C_ {u

E Zd(x)Ic(x,u)_ ~ E f [c(X,,U~)] - [M + h(x)]. f----.O

Dividing by N arid taking the limit superior, we obtain

J(z,¢*) s p* ~ J*(z) s J(z,~*), and (ii) and (iii) follow.O R e m a r k : Related work to the developments above is that in hnear quadratic control problems, where convexity of value functions is obtained [BE], see also [FG] and [HLM2]. Also, Dynkin [DYN] introduced the concept of stochastic concave dynamic programming, but his interest was on concavity properties of one-stage cost functions in the control actions, and its implications on the existence of minimizing actions. The assumption that J~., (.) is convex is the cornerstone of our developments. Therefore, in order to envision the type of problems for which our results may be applicable, it is necessary to identify conditions under which such convexity is obtained; see [HIN1] for some conditions of this nature. One natural approach to the above problem is by using induction via the value iteration scheme [ABFGM], [BS], [HLM1], to propagate the desired property [BTE], [HIN1], [STI]: • First, for the null function f(x) = O, for all x 6 X, conditions on c(-, .) are given such that

J~O(x) = is convex (concave).

inf {c(x,u)},

uEU(x)

Vx E X,

182

• Next, for the inductive step, assuming that j(k)(.) is convex (concave), then give conditions for the dynamic programming map T#(.) to preserve this property. • Finally, convexity (concavity) of J~(.) follows by the convergence of value iteration to the optimal value function.

REFERENCES [ABFGM] A. Arapostathis, V. Borkar, E. Fern£ndez-Gaucherand, M.K. Ghosh and S.I. Marcus, Controlled Markov Processes with an Average Cost Criterion: A Survey, SIE Working Paper #91-032 (submitted to SIAM Journal on Oontrol eJ Optimization). [BE] D.P. Bertsekas, Dynamic Programming: De~ermini.stic and S$ochastic Model~, Prentice-Hall, Englewood Cliffs, 1987. [BS] D.P. Bertsekas and S.E. Shreve, S~ochastic Optimal Con~rol: The Discrete Time Case, Academic Press, New York, 1978. [BTE] F.J. Beutler and D. Teneketzis, Routing in Queueing Networks Under Imperfect Information: Stochastic Dominance and Thresholds, Stochastics Stochastics Reports, 26 (1989) 81-100. [CC] R. Cavazos-Cadena, A Counterexarnple on the Optimality Equation in Markov Decision Chains with the Average Cost Criterion, Syst. Control I,et~. 16 (1991) 387-392. [DY] E.B. Dynkin and A.A. Yushkevich, Oon¢rolled Markov Processes, Springer Verlag, New York, 1979. [DYN] E.B. Dynkin, Stochastic Concave Dynamic Programming, Math. USSR Sbornik 16 (1972) 501-515. [FAM1] E. Fern£ndez-Gaucherand, A. Arapostathis and S.I. Marcus, Remarks on the Existence of Solutions to the Average Cost Optimality Equation in Markov Decision Processes, SysL Control Left. 15 (1990) 425-432. [FAM2] E. Fern£ndez-Gaucherand, A. Arapostathis and S.I. Marcus, Analysis of an Adaptive Control Scheme for a Partially Observed Controlled Markov Chain,

183

SIE Working paper #91-038 (under revision for IEEE Transactions in Automatic Control). [FEI] E.A. Feinberg, Controlled Markov Processes with Arbitrary Numerical Criteria, Theory Probab. Appl. 27 (1982) 486-503. [FG] E. Fern£ndez-Gaucherand, Controlled Markov Processes on the Infinite Planning Horizon: Optimal ~ Adaptive Control, Ph.D. Thesis, The University of Texas at Austin, August 1991. [FGM] E. Fernfiaadez-Gaucherand, M. K. Ghosh, and S.I. Marcus, Controlled Markov Processes in the Infinite Planning Horizon with a Weighted Cost Criterion, to appear in the Proceedings of the IV Latin American Congress in Probability and Mathematical Statistics, M~xico City, M~xico, 1990. [HIN1] K.F. Hinderer, On the Structure of Solutions of Stochastic Dynamic Programs, in: Proc. 7th Conf. on Probability Theory, Brasov, Romania, (1984) 173-182. [HIN2] K.F. Hinderer, Increasing Lipschitz Continuous Maximizers of Some Dynamic Programs, Annala of Operations Research, 29 (1991) 565-586. [HLL] O. Hernfindez-Lerma and J.B. Lasserre, Average Cost Optimal Policies for Markov Control Processes with Borel State Space and Unbounded Costs, Syst. Control Left. 15 (1990) 349-356. [HLM1] O. Hernfiaadez-Lerma, Adaptive Markov Control Processes, Springer Verlag, New York, 1989. [HLM2] O. HerafiJadez-Lerma, Average Optimality in Dynamic Programming on BoreI Spaces: Unbounded Costs and Controls, Syst Control Left 17 (1991) 237-242. [KFS] D. Krass, J.A. Filar and S. Sinha, A Weighted Markov Decision Process, to appear in Opera,ions Research. [KV] P.R. Kumar and P. Varaiya, Stochastic Systema: Estimation, Identification and Adaptive Control, Prentice-Hall, Englewood Cliffs, 1986. [RO] S.M. Ross, Arbitrary State Markovian Decision Processes, Ann. Math. Star. 39 (1968) 2118-2122. [ROY] H.L. Royden, Real Analysis, 2nd. ed., Macmillan, New York, 1968.

184

[SEN1] L.I. Sennott, A New Condition for the Existence of Optima/ Stationary Policies in Average Cost Markov Decision Processes, Oper. Res. Le~t. 5 (1986) 17-23.

[SEN2] L.I. Sennott, Average Cost Optima/ Stationary Policiesin InfiniteState Markov Decision Processeswith Unbounded Costs, Oper. Res. 37 (1989) 626633. [SEN3] L.L Sennott, Average Cost Semi-Markov DecisionProcessesand the Control of Queueing Systems, Probab. in Eng. ~ Info. Sci. 3 (1989) 247-272. [SM] A. Shwartz and A.M. Ma/cowski, Comparing Policiesin Markov Decision Processes: Mandl's L e m m a Revisited,Math. Oper. Res. 15 (1990) 155-174. [STI] S. Stidham, Scheduling, Routing, and Flow Control in StochasticNetworks, in Stochastic Differential Systems, S~ochastic Control Theory and Applications, W. Fleming and P.L. Lions, Eds., The IMA Volumes in Mathematics and Its Applications 10 529-561, Springer Verlag, Berlin, 1988. [TIJMS] H.C. Tijms, S~ocha~ic Modelling and Analysij: A Computational Approach, John Wiley, Chichester, 1986.