A Policy Improvement Method in Constrained Stochastic Dynamic ...

Report 15 Downloads 37 Views
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 51, NO. 9, SEPTEMBER 2006

A Policy Improvement Method in Constrained Stochastic Dynamic Programming Hyeong Soo Chang

Abstract—This note presents a formal method of improving a given base-policy such that the performance of the resulting policy is no worse than that of the base-policy at all states in constrained stochastic dynamic programming. We consider finite horizon and discounted infinite horizon cases. The improvement method induces a policy iteration-type algorithm that converges to a local optimal policy. Index Terms—Constrained Markov decision process, dynamic programming, policy improvement, policy iteration.

I. INTRODUCTION

Consider a Markov decision process (MDP) [8] M = (X; A; P; C ) with a finite state set X , a finite admissible action set A(x); x 2 X , a nonnegative cost function C : X 2 A(X ) ! + , and a transition function P that maps f(x; a) j x 2 X; a 2 A(x)g to the set of probability distributions over X . We denote the probability of making a transition a . to state y 2 X when taking action a 2 A(x) at state x 2 X by Pxy Let 5 be the set of all Markovian (history-independent) nonstationary deterministic policies  = f0 ; . . . ; H 01 g over a finite horizon H with t : X ! A(X ); t = 0; . . . ; H 0 1. Define the objective function value of a policy  2 5 over H with an initial state

x

2X

VH (x) = E

H 01 t=0

1523

satisfied by  3 with an extension of policy by a threshold embedding. Several works also exist for characterizing the optimal policies for constrained infinite-horizon stochastic DP problems (see, e.g., [1], [5], and [6]). However, to the author’s best knowledge, there is still no formal work so far which deals with a policy improvement (improving a given base-policy such that the performance of the resulting policy is no worse than that of the base-policy at all states) in constrained (finite/infinite horizon) stochastic DP as also noted in [2]. Bertsekas [2] studied rollout algorithms for improving a given base-policy in constrained deterministic DP but the algorithms do not admit a straightforward extension to stochastic problems as mentioned in [2]. Related works are policy improvement methods in vector-valued MDPs [9], [7], where a decision maker obtains a vector-valued cost at each decision time. The improvement methods in [9], [7] are not directly cast into our problem setup, but like the methods there, it may be possible to consider a partial order in the policy space for a given “preference” relation. However, this note’s goal is to design an improved policy relative to a given policy at all states. II. POLICY IMPROVEMENT METHODS

Suppose that a sequence of nonnegative real numbers f2n ; n = 0; . . . ; H g is given such that 2n01  2n for all n = 1; . . . ; H with 20 = 0 . Define the action set F~n (x); n = 1; . . . ; H as u J  (y ) Pxy F~n (x) = u u 2 A(x); D(x; u) + n01 y2X

+ 2n01  Jn(x) + 2n

t C (Xt ; t (Xt )) X0 = x ; x 2 X (1)

where Xt is a random variable denoting state at time t and 2 (0; 1] is a discount factor. The MDP is associated with a function  defined over X and a constraint cost function D : X 2 A(X ) ! + where each feasible policy  2 5 needs to satisfy the constraint inequality  (x)  (x); x 2 X , where the constraint function value of  is of JH defined such that H 01 t D(Xt ; t (Xt )) X0 = x ; JH (x) = E t=0 x 2 X (2)

for a discount factor 2 (0; 1] with Xt being a random variable denoting state at time t. We assume that V0 (x) = J0 (x) = 0 for all  2 5 and x 2 X and that (x) 2 [min25 JH (x); 1) for all x 2 X. This note considers the problem of designing a policy ~ that im~ proves a given feasible base-policy  such that VH (x)  VH (x); x 2 ~ X while satisfying JH (x)  (x); x 2 X . Recently, Chen and Blankenship considered the constrained stochastic control problem of obtaining an optimal policy 3 that  (x )  achieves min25 VH (x); x 2 X , where 5 = f j 2 5; JH (x); 8x 2 X g and provided dynamic programming (DP) equations

Manuscript received May 24, 2005; revised April 4, 2006. Recommended by Associate Editor S. Dey. This work was supported by the Ministry of Commerce, Industry and Energy, the 21st Century Frontier Program: Intelligent Robot Project in 2006. The author is with the Department of Computer Science and Engineering, and also affiliated with the Program of Integrated Biotechnology, Sogang University, Seoul 121–742, Korea (e-mail: [email protected]). Digital Object Identifier 10.1109/TAC.2006.880801

Note that from the property of the sequence 2 X.

H 0n (x) 2 F~n (x) for all x Define a policy ~ 2 5 as ~H 0n (x) 2 arg

min

a2F~ (x)

; x 2 X: (3)

f2n ; n = 0; . . . ; H g,

C (x; a)

+

y2X

a V  (y ) ; x 2 X (4) Pxy n01

for n = 1; . . . ; H .  (x )   (x ); x 2 Lemma 2.1: For a given policy  2 5 such that JH ~ X , define a policy  2 5 as in (4) with a sequence f2n g such that 2n01  2n ; 20 = 0~; n = 1; . . . ; H . Then, for any x 2 X , JH~ (x)  JH (x) + 2H and VH (x)  VH (x). Proof: We show by induction on n that for n = 0; . . . ; H , ~ Jn (x)  Jn (x) + 2n ; x 2 X . ~ The base case is true because J0 (x) = J0 (x) = 0 for all x 2 X . ~   Assume that for all x 2 X , Jn01 (x)  Jn01 (x) + 2n01 . Then, for any x 2 X , ~ ~ (x) J ~ (y) Jn (x) = D(x; ~H 0n (x)) + Pxy n01 y2X  D(x; ~H 0n(x)) + Pxy~ (x) Jn01(y) y2X + 2n01 by the induction hypothesis  D(x; ~H 0n(x)) + Pxy~ (x) Jn01(y) + 2n01 y2X   Jn (x) + 2n because ~H 0n (x) 2 F~n (x): ~ (x)  J  (x) + 2 ; x 2 X . It follows that JH H H

0018-9286/$20.00 © 2006 IEEE

1524

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 51, NO. 9, SEPTEMBER 2006

For the objective function value, we use the similar induction. The ~ base case is true for n = 0. Assume that for all x 2 X , Vn01 (x)   Vn01 (x). Then, for any x 2 X ,

Vn (x) = C (x; ~H 0n (x)) +

(x)  'n01 (y)  'n (x) true. This seems difficult to y2X Pxy achieve, in general. However, notice that to make the condition true, the following condition must be true:

min 'n0 (x)  xmin n = 1; . . . ; H: 2X 'n (x); We can then let 2n = minx2X 'n (x), where 'n (x) = Ln (x) 0  C (x; ~H 0n (x)) + Pxy x Vn0 (y) Jn (x); n = 1; . . . ; H; ' (x) = 0; 8x 2 X where Ln (x) is defined y2X as above, and we can consider the sequence f2n g for defining the sets by the induction hypothesis fF~n g as we did before in (3). x    (x )  Theorem 2.1: For a given policy  2 5 such that JH Vn0 (y)  C (x; H 0n (x)) + Pxy (x); x 2 X , define a policy ~ 2 5 as in (4) with f2n g; n = y2X 0; . . . ; H defined as in (5). Then, for any x 2 X = Vn(x) JH (x)  (x) and VH (x)  VH (x): where the last inequality follows because H 0n (x); ~H 0n (x) 2 F~n (x) and we have the definition of ~H 0n (x). Therefore, Proof: We show that 2n0  2n with 2n  0 for n = VH (x)  VH (x); x 2 X . 1 ; . . . ; H inductively in backwards. For the case of n = H , 2H =  (x)+   From Lemma 2.1, we can see that as long as we ensure that JH min x 2 X (LH (x)0 JH (x)) = minx2X ((x)0 JH (x))  0. Consider 2H  (x); x 2 X , the newly defined policy ~ satisfies the constraint function and improves the base-policy . We now present a sequence u J  (y ) GH 0 (x) = u u 2 A(x); D(x; u) + Pxy H0 of f2n g with the required properties. y2X Define a sequence f2n g; n = 0; . . . ; H as  JH 0 (x) + 2H ; x 2 X:  2n = xmin (5) 2X (Ln (x) 0 Jn (x)); n = H; . . . ; 1; 2 = 0 Because  (x) is in GH 0 (x) from 2H  0, GH 0 (x) is a where LH (x) = (x); x 2 X and for n = H 0 1; . . . ; 1 nonempty set. Therefore, from the definition of LH 0 (x), 0  a J  (y ) ; LH 0 (x) 0 JH 0 (x) and for a3 2 arg maxa2G x fD(x; a) + Ln (x) = max D(x; a) + Pxy n0 a J  (y )g, x 2 X a2G x y2X Pxy H0 y2X x2X a J  (y ) = L  Pxy D(x; a3 ) + H 0 (x)  JH 0 (x) + 2H : H0 y2X where the set Gn (x) for n = H 0 1; . . . ; 1 is defined as Therefore, 2H 0  0 and 2H 0  2H . Recall that 2n = u J  (y ) minx2X (Ln(x) 0 Jn (x)); n = H; . . . ; 1; 2 = 0. Assume that Gn (x) = u u 2 A(x); D(x; u) + Pxy n0 2 n  0(n 6= H ). Consider y2X ~

y2X

~

( x)

~

( )

Pxy

Vn01 (y) ~

x2X

1

1

0

( )

1

~

~

1

~

1

2

1

0

1

1

1

1

1

1

( )

( )

1

2

1

2

1

1

0

1

 Jn(x) + 2n

+1

; x 2 X:

The rationale behind the above definitions are as follows. In general, we may consider a sequence of nonnegative functions 'n ; n = 0; . . . ; H defined over X and define the action set F~n (x); n = 1; . . . ; H as

F~n (x) = u u 2 A(x); D(x; u) +

y2X

u (J  (y ) Pxy n01

+'n0 (y))  Jn(x) + 'n (x) 1

1

; x 2 X:

Gn01 (x) = u u 2 A(x); D(x; u) +

y2X

u J  (y ) Pxy n02

 Jn0 (x) + 2n 1

; x 2 X:

Because H 0n+1 (x) is in Gn01 (x), Gn01 (x) is a nonempty set. Therefore, from the definition of Ln01 (x), 0  Ln01 (x) 0 Jn01 (x)   2n for any x 2 X . Therefore, 2n01 = minx2X (Ln01(x) 0 Jn01(x))  2n and 2n01  0. It follows that by Lemma 2.1 with the defined sequence f2n g, for

x

2X

JH (x)  JH (x) + 2H ~

 (x) If for all x 2 X and n = 1; . . . ; H , y2X Pxy 'n01 (y)  'n (x), then H 0n (x) 2 F~n (x) for all n = 1; . . . ; H and ~ x 2 X . This implies that VH (x)  VH (x); x 2 X and ~ JH (x)  JH (x)+ 'H (x); x 2 X . We would then consider the slack (x) to make J  (x) + ' (x)  (x). ness of 'H (x) = (x) 0 JH H  H (x) is tempted to be The best achievable improvement of JH 01 LH 01 (x) within the slackness of 'H (x), and the best achievable  (x) is tempted to be L improvement of JH H 02 (x) within the 02  (x), and this way of slackness of 'H 01 (x) = LH 01 (x) 0 JH 01 reasoning may continue. However, each improvement made at each state at a horizon contributes to the slackness of a state for the bigger horizon in the average sense. In other words, we need to design a sequence of functions 'n to make the condition that for all x 2 X and n = 1; . . . ; H ,

= JH (x) + xmin 2X

LH (x) 0 JH (x)

  (x )

~ and VH (x)  VH (x); x 2 X . Note that the value of 2n is determined in a greedy manner. To preserve the constraint value at all states, we need to set 2H = minx2X ((x) 0 JH (x)). However, we could have set 2H 01 by maxx2X (LH 01(x) 0 JH 01(x)) and this will reduce the search space for the horizon H because we would have then a smaller F~H due to the harder constraint. Therefore, we set 2H 01 = minx2X (LH 01(x) 0 JH 01(x)). This argument applies to the other horizons subsequently. The more intuitive policy improvement method can be obtained by setting 2n = 0; n = 0; . . . ; H , which is immediate from the result of Lemma 2.1. The resulting method is conservative in that we first obtain

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 51, NO. 9, SEPTEMBER 2006

a candidate action set associated with each state that can preserve the constraint value satisfied by the base-policy at each state, and then we generate an improved policy for the objective value from the candidate set. For a policy  2 defined as

^ 5 ^H 0n (x) 2 arg min a2F^ (x)

1525

1 5 ( ) min max (max () ()

min

( ) min 1 C (x; a) min ( ( )) + + Pxya Vn01 (y) ; x 2 X (6) min ( ) y2X for n = 1; . . . ; H with the set F^n (x) for n = 1; . . . ; H obtained ( ( )) + by letting 2n = 0 for all n in F~n (x), we have that for any x 2 X , ~ ^ ^     ^ JH (x)  (x) and VH (x)  VH (x)  VH (x) because Fn (x)  min ( ) F~n (x); x 2 X . The rationale of the previous methods can be related with the exten= ( ) min ( ) sion idea of the rollout algorithm for MDPs by Bertsekas [2]. Roughly, at each time t = 0; 1; . . . ; H 0 1, the (extended) algorithm considers max ( ) min ( ) replacing the action prescribed by the given base-policy at state at time t max(max ( ) min ( )) = by generating a set of candidate actions that can still preserve the given constraint. The algorithm simply chooses the action that minimizes the () () objective function value of taking each action in the candidate set at time t and following the base-policy from time t + 1. Our methods are min () not direct generalization of his rollout algorithm for constrained deter( )=0 1 ( ) = ministic DPs. But the idea of generating a feasible action set at each ( ) min ( ) horizon (time) and improving the base-policy from the set is similar. We can design a policy iteration-type algorithm from the policy () improvement method considered here. However, the algorithm will = ( ( )) + converge to only a local optimal policy unlike policy iteration from Howard’s policy improvement [8] for MDPs. We generate a sequence of policies f0 ;  1 ; ; n g such that k is generated from k01 by applying the policy improvement method here.  0 needs to satisfy the constraint function . One way to obtain  0 is obtaining  25 JH x ; x 2 X , which would give the biggest slackness relative to  x ; x 2 X . It is guaranteed that the monotonicity x  of the objective value functions holdsVH x  VH 1 1 1  VH x ; x 2 X with JH x   x ; x 2 X; i ; ; ; n.

...

() ()

arg min

()

()

()

()

() = 0 1 ...

A. Multipolicy Improvement So far, we considered the case of improving a single base-policy. Suppose that we have a nonempty set  of multiple nonstationary  x   x ; x 2 X . We base-policies such that for any  2 , JH wish to design a single policy that improves the performances of all base-policies while preserving the constraint. For the nonconstrained case, a policy called “parallel rollout” [3] has been proposed. In this subsection, based on the parallel rollout, we provide a method of combining all policies in with a condition for the performance improvement and the constraint satisfaction. ; ;H Consider a sequence of nonnegative numbers fn g; n with 0 such that J x 0 J x n 0 n01 x2X 21 n 21 n

1 5 1 ()

()

1

= 0 ...

=0

= max max ( ) min ( ) for n = 1; . . . ; H . Define the set In (x); n = 1; . . . ; H as u min J  (y ) In (x) = u u 2 A(x); D(x; u) + Pxy 21 n01  +n01  min 21 Jn (x) + n

for n

= 1; . . . ; H .

C (x; a) +

y2X

;x 2 X

a min V  (y ) ; Pxy n01 21

()

1

x 2 X (7)

()

()

( )) +

min

()

(

( )) +

min

()

(

1 because

1

()

() () ()

()

min

( )) +

()

=

1

( ) min

()

( )+

( ) min ( ) +

min

()

(

( ))

max max ( ) min ( )

( ) min

max ()

(max ()

() ()

5

In this section, we restrict the policy space to be the set s of all stationary policies  X ! A X , and we use the subscript 1 on  and J  to denote the infinite horizon value functions obtained by V1 1 letting H ! 1 in (1) and (2) with the assumption of 2 ; and 2 ; and t ; 8t  . Let B X be the space of real-valued bounded functions on X . We define operators T B X ! B X and Tc B X ! B X for  2 s as (x) v y ; x 2 X T v x C x;  x

Pxy y2X (x) v y ; x 2 X D x;  x Pxy Tc v x y2X

:

and define a policy m for multipolicy improvement as m H 0n x

() 2 arg a2min I (x)

( ))

III. INFINITE HORIZON CASE

y2X

5 () () ()

Theorem 2.2: For a given nonempty set  , consider m 2  as in (7). Then, for any x 2 X; VH x  21 VH x , H  and for a given x 2 X , if x2X 21 Jn x 0 n=1    21 Jn x   x 0 21 JH x ; then JH x   x . Proof: First, for any  2 and any x 2 X (x)  Pxy Jn01 y D x; H 0n x  21 y2X 0 21 Jn x  D x; H 0n x Pxy (x) Jn01 y y2X 0 21 Jn x J x Jn x 0 21 n  21 Jn x 0 21 Jn x  x2X 21 Jn x 0 21 Jn x n 0 n01: Therefore, we have that H 0n x 2 In x for all  2 and x 2 X . We now use induction on the horizon value to show VH x   21 VH x ; x 2 X . The base case is trivially true because for all x 2 X and  2 . Assume that for all V0 x V0 x  x 2 X , Vn01 x  21 Vn01 x . Then, for any x 2 X Vn x  (x)  m C x; H Pxy

Vn01 y 0n x y2X  C x; Hm0n x Pxy (x) 21 Vn01 y y2X by the induction hypothesis  C x; H0 0n x Pxy (x) 21 Vn01 y y2X 0 for any  0 2 H 0n x 2 In x 0 8 2 and from the definition of Hm0n x  C x; H0 0n x Pxy (x) Vn01 y y2X Vn x for any0 2  which implies that for all x 2 X , Vn x  21 Vn x . Furthermore, by induction on the horizon value, we can show that  H ; which implies that by the definition JH x  21 JH x of fn g, for any x 2 X H J x 0 JH x J x : JH x  x2X 21 n 21 21 n n=1  It follows that for a given x 2 X , if H x2X 21 Jn x n=1    0 21 Jn x   x 0 21 JH x ; then JH x   x .

( )

(0 1)

(0 1) = 0 ( ) : ( ) ( ) 5 ( )( ) = ( ( )) +

: ( )

( )( ) = ( ( )) +

()

()

( )

1526

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 51, NO. 9, SEPTEMBER 2006

( )

respectively for v 2 B X . It is well known (see, e.g., [8]) that for each policy  2 s , there exists a corresponding unique v and u in B X , respectively, such that for x 2 X ,

5

( )

T (v)(x) = v(x) and v(x) = Tc (u)(x) = u(x) and u(x) =

Lemma 3.1: Given  2 and u in B X for which

( )

Then, for all x

2

5s and  2

() ()

 x V1  J1 x :

u(x) +  =(1 0 ).

 v(x) +  =(1 0 ) and J1 (x) 

2 n01 lim T n(v)(x)  v(x) + nlim n!1  !1  (1 + + + 1 1 1 + ): It is well-known (see, e.g., [8]) that T is a contraction mapping in B X and that iterative application of T on any initial value  . Therefore, function converges monotonically to the fixed point V1 n  V1 x ; x 2 X . The same arguments apply to n!1 T v x the Tc -operator case. Given  , define the set F x as

( )

( )( ) = 2 0

()

~( )

F~ (x) = u u 2 A(x); D(x; u) +

y2X

u J  (y ) Pxy 1

;x 2 X

~ 2 5s as

and a policy 

min

a2F~ (x)

C (x; a) +

y2X

a V  (y ) ; Pxy 1

x 2 X: (8)

5

( ) 2 = (1 0

 x 1) Theorem 3.1: For a given policy  2 s such that J1  x ; x 2 X , define a policy  2 s as in (8) with  x2X  x 0 J1 x . Then, for any x 2 X

~

5 ( )) ~ (x)  (x) and V ~ (x)  V  (x): J1 1 1 Proof: Because 2  0, for any x 2 X; (x) 2 F~ (x). Because  )(x)  V  (x); x 2 X , we have that from Lemma then T~ (V1 1 ~ (x)  V  (x); x 2 X . Furthermore, because 3.1 ( = 0) V1 1  )(x)  J  (x) + 2; x 2 X , from Lemma 3.1 ( = 2) T~c (J1 1 ~ (x)  J  (x) + 2 ; x 2 X: J1 1 10 () )min

(()

2 = (1 0 )minx2X ((x) 0 J1 (x)) ~ (x)  J  (x) + min((x) 0 J  (x))  (x); x 2 X: J1 1 1 x2X

Therefore, with

For multipolicy improvement, given a nonempty set base-policies, define a policy m as

m (x) 2 arg

min

a2I (x)

C (x; a) +

y2X

1  5s of

a min V  (y ) ; Pxy 1 21

x 2 X (9)

y2X

u min J  (y ) Pxy 1 21

  min 21 J1 (x) + 

; x 2 X:

2 1 and any x 2 X ,

D(x; (x)) +

y2X

(x) Pxy

 (x ) J  (y) 0 min J1 min 21 1  21



J  (x)) = : (max J  (x) 0 min  max x2X  21 1 21 1

() () 1 (9)( ) 9( ) () () min ( ) + (1 ) () ()

Therefore, we have that  x 2 I x for all  2 and x 2 X . x  x ; x 2 X with It follows that by showing T   x 21 V1 x ; x 2 X , we have that V1 x   21 V1 x ; x 2 X . Furthermore, with the reasoning similar to  the previous case, we can show that if = 0 21 J1 x    x for x 2 X , J1 x   x .

9( ) = min min () ()

IV. CONCLUDING REMARKS Suppose that we are given an initial state probability distribution  over X . Consider the following stochastic conthat trol problem of obtaining an optimal policy 3 2  f  j 2 achieves  25 x2X  x VH x , where  ; x2X  x JH x  x2X  x  x ; 8x 2 X g. Then, the problem is basically a constrained global optimization problem and we can solve it by randomized search methods, e.g., constrained simulated annealing. The purpose of this note is to give a first demonstration of designing a formal policy-improvement method for constrained stochastic dynamic programming. Some experimental investigations will be necessary to make the approach fruitful for applications.

5

 J1 (x) + 2 ~(x) 2 arg

I (x) = u u 2 A(x); D(x; u) +

For any 

Proof: By successive applications of the T -operator to both sides of (8) and the monotonicity property of the operator, we have that for all x 2 X

lim

= maxx2X (max21 J1 (x) 0 min21 J1 (x)),

, suppose that there exist v

T (v)(x)  v(x) + ; x 2 X Tc (u)(x)  u(x) + ; x 2 X:  (x ) X , V1

where with 

min () ()

() () 5 = ()()

5

ACKNOWLEDGMENT The author thanks Prof. D. Bertsekas for his comments on the initial idea of this note and the anonymous reviewers for their comments for improving the quality of the note.

REFERENCES [1] E. Altman, Constrained Markov Decision Processes. Boca Raton, FL: CRC, 1998. [2] D. P. Bertsekas, Rollout algorithms for constrained dynamic programming Mass. Inst. Technol., 2005, Tech. Rep. LIDS-2646. [3] H. S. Chang, R. Givan, and E. K. P. Chong, “Parallel rollout for on-line solution of partially observable Markov decision processes,” Discrete Event Dyna. Syst.: Theory Appl., vol. 14, no. 3, pp. 309–341, 2004. [4] R. C. Chen and G. L. Blankenship, “Dynamic programming equations for discounted constrained stochastic control,” IEEE Trans. Autom. Control, vol. 49, no. 5, pp. 699–709, May 2004. [5] E. A. Feinberg and A. Schwartz, “Constrained discounted dynamic programming,” Math. Oper. Res., vol. 21, no. 4, pp. 922–945, 1996. [6] E. A. Feinberg and A. Schwartz, “Constrained dynamic programming with two discount factors: Applications and an algorithm,” IEEE Trans. Autom. Control, vol. 44, no. 3, pp. 628–631, 1999. [7] N. Furukawa, “Characterization of optimal policies in vector-valued Markovian decision processes,” Math. Oper. Res., vol. 5, no. 2, pp. 271–279, 1980. [8] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: Wiley, 1994. [9] K. Wakuta, “Vector-valued Markov decision processes and the systems of linear inequalities,” Stoch. Process. Appl., vol. 56, pp. 159–169, 1995.