critic algorithm with function approximation for discounted cost ...

Systems & Control Letters 59 (2010) 760–766

Contents lists available at ScienceDirect

Systems & Control Letters journal homepage: www.elsevier.com/locate/sysconle

An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes Shalabh Bhatnagar Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India

article

info

Article history: Received 20 July 2010 Received in revised form 24 August 2010 Accepted 24 August 2010 Available online 8 October 2010 Keywords: Constrained Markov decision processes Infinite horizon discounted cost criterion Function approximation Actor–critic algorithm Simultaneous perturbation stochastic approximation

abstract We develop in this article the first actor–critic reinforcement learning algorithm with function approximation for a problem of control under multiple inequality constraints. We consider the infinite horizon discounted cost framework in which both the objective and the constraint functions are suitable expected policy-dependent discounted sums of certain sample path functions. We apply the Lagrange multiplier method to handle the inequality constraints. Our algorithm makes use of multi-timescale stochastic approximation and incorporates a temporal difference (TD) critic and an actor that makes a gradient search in the space of policy parameters using efficient simultaneous perturbation stochastic approximation (SPSA) gradient estimates. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal policy. © 2010 Elsevier B.V. All rights reserved.

1. Introduction The problem that we are concerned with in this article is of finding an optimal policy for a constrained Markov decision process (C-MDP) when (a) the transition probabilities are not known and (b) a feature-based representation is used. The latter is particularly useful when the state and action spaces are large or unmanageable. A text book treatment of C-MDP can be found in [1]. Reinforcement learning (RL) [2,3] has proved to be a useful paradigm for such problems and has largely been studied in the case of regular (unconstrained) Markov decision processes (MDP) [4,5]. Algorithms based on both value and policy iteration techniques have been developed and studied in this scenario. Actor–critic algorithms [6,7] are a class of RL algorithms that are based on the policy iteration method. Whereas the critic addresses a problem of prediction by estimating the value function for a given policy update, the actor updates the policy itself and is concerned with the problem of control. Temporal difference (TD) learning [3,8] has been found to be one of the most effective methods for the problem of prediction. For the problem of control, policy gradient methods [9] have been successfully applied. The policy in these methods is represented via a parameterized class of functions that are differentiable in the parameter. In actor–critic algorithms based on policy gradients, the actor’s parameter is updated along

E-mail address: [email protected]. 0167-6911/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysconle.2010.08.013

the direction of the performance gradient. The performance itself, for any given parameter update, is estimated by the critic. In this paper we present an actor–critic algorithm with function approximation for a C-MDP and prove its convergence. Our aim is to find an optimal policy that minimizes an infinite horizon discounted cost function subject to prescribed bounds on additional cost functions under that policy. The (above) additional cost functions have a similar structure to the objective function itself i.e., they are also expected discounted sums of some other single-stage cost functions. Thus cost functions in both the objective and the constraints are policy dependent. It is also important to note that the value functions corresponding to the single-stage costs cannot in general be analytically determined as functions of the parameter. Hence the constraint set is a priori not properly defined as well. Thus any solution methodology cannot be based on directly projecting iterates after each update to the constraint set formed from the inequality constraints. Constrained MDPs find immense applications in many domains. For instance, in communication networks, a problem of interest is to maximize the throughput (i.e., the rate at which packets are delivered to the destination) subject to constraints on the delay as well as packet loss in transmission, see [10,11]. One can find similar important problems in many other domains. Our development is based on forming the Lagrangian by incorporating the (multiple) inequality constraints in the objective. The primal is a new actor–critic algorithm that combines ideas from temporal difference learning, policy gradient methods and simultaneous perturbation stochastic approximation (SPSA) [12,13]

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766

while the dual corresponds to a recursive search in the space of Lagrange multipliers. A Lagrange based approach for solving a C-MDP with a single constraint has been used in [14]. The algorithm there however works for the look-up table case and does not use function approximation i.e., it requires storage of all the states and actions, and performs updates in the space of all randomized policies. Further, it has been designed for the long-run average cost objective and not the discounted cost. We use feature based representations as a result of which our problem becomes one of optimizing parameters in a constrained optimization setting. We consider multiple inequality constraints and the infinite horizon discounted cost criterion. Ours is the first work that develops an actor–critic algorithm with function approximation for control with multiple inequality constraints and under the discounted cost objective. The rest of the paper is organized as follows: in Section 2, we present the framework and problem formulation. We present in Section 3 our constrained actor–critic algorithm. In Section 4, we present the convergence proof. Finally, Section 5 contains the concluding remarks. 2. The framework and problem formulation By a MDP we mean a stochastic process {Xn } taking values in a set S (called the state space), that is governed by a control sequence {Zn } and satisfies the controlled Markov property (below). Let A(i) △

761

Our aim is to find a SRP that for a given initial distribution β (over states), minimizes (over all SRPs π ) the discounted cost J β (π ) =

β(i)U π (i),



(1)

i∈S

subject to the constraints β

Sk (π ) =



β(i)Wkπ (i) ≤ αk ,

(2)

i∈S

k = 1, . . . , N. Here

 △

π

U (i) = E

∞ −

 γ c (m) | X0 = i, π , m

m=0

 △

π

Wk (i) = E



∞ −

γ gk (m) | X0 = i, π , m

m=0

k = 1, . . . , N. Here 0 < γ < 1 is a given discount factor. Also, in (2), α1 , . . . , αN are given positive constants. We assume that there exists at least one SRP π for which all the inequality constraints (2) are satisfied. Under this requirement, it is shown in Theorem 3.8 of [1] that an optimal SRP π ∗ that uses at most N randomizations exists. The constraints (2) can be alternatively written as β



β

Gk (π ) = Sk (π ) − αk ≤ 0,

(3)

be the set of feasible actions in state i and A = ∪i∈S A(i) be the set of all actions or the action space. We assume that both S and A are finite sets. The controlled Markov property satisfied by {Xn } is the following:

k = 1, . . . , N. ¯ = (λ1 , . . . , λN )⊤ denote the vector of Lagrange multipliLet λ ¯ denote the Lagrangian ers λ1 , . . . , λN ∈ R+ ∪ {0} and let Lβ (π , λ)

P (Xn+1 = j | Xm , Zm , m ≤ n) = p(Xn , Zn , j) a.s.,

¯ = J β (π ) + Lβ (π , λ)

where p : S × A × S → [0, 1] is a given function for which ∑ j∈S p(i, a, j) = 1, ∀a ∈ A(i), i ∈ S. △

A policy is a decision rule for selecting actions. We call π¯ = {µ0 , µ1 , . . .}, where µn : S → A, an admissible policy when µn (i) ∈ A(i) ∀i ∈ S. When µn ≡ µ, ∀n ≥ 0, where µ is independent of n, we call π¯ or many times µ itself a stationary

It follows from Assumption 1 that {Xn } is also positive recurrent under any SRP because S is a finite set. Let c (n), g1 (n), . . . , gN (n), n ≥ 0 denote certain single-stage costs that we assume are nonnegative, real-valued, uniformly bounded and mutually independent random variables. In addition, given the current state and action (Xn and Zn ), c (n), gk (n), k = 1, . . . , N are conditionally independent of the previous states and actions (Xm , Zm , m < n). The evolution in a C-MDP proceeds as follows: at instant n, the state Xn is observed and action Zn is chosen. This results in the (N + 1)-length string of costs c (n), g1 (n), . . . , gN (n) and the process moves to a new state Xn+1 at instant n + 1. Let c (i, a), gk (i, a) be defined via c (i, a) = E [c (n) | Xn = i, Zn = a], gk (i, a) = E [gk (n) | Xn = i, Zn = a], ∀n ≥ 0, k = 1, . . . , N, respectively. (Note the abuse of notation here.) It is shown in Theorem 3.1 of [1] that SRPs correspond to a complete class of policies for the problem that we consider, i.e., it is sufficient to find an optimal policy within the class of SRPs. Under an SRP π , let dπ (i) denote the stationary probability of the Markov process being in state i ∈ S and △

let dπ = (dπ (i), i ∈ S )⊤ .

β

λk Gk (π ).

(4)

k=1

For a relaxed MDP problem for which the single-stage cost is ∑N c (i, a) + k=1 λk (gk (i, a) − αk ) in state i when the action chosen is a, the Bellman equation for optimality corresponds to

 V

∗,λ¯

(i) = min c (i, a) + a

deterministic policy (SDP). A stationary randomized policy (SRP) π is specified via a probability distribution π (i, ·) over A(i), ∀i ∈ S. It is easy to see that under any given SDP or SRP, {Xn } is a Markov chain. We make the following assumption: Assumption 1. The Markov chain {Xn } under any SRP π is irreducible.

N −

N −

λk (gk (i, a) − αk )

k=1

 + γ



p(i, j, a)V

∗,λ¯

(j) ,

(5)

j∈S

for all i ∈ S, where V ∗,λ (·) denotes the value function for a given ¯ of Lagrange parameters. Further, let V π ,λ¯ (·) denote the vector λ ¯. value function under a given SRP π and Lagrange parameters λ The Poisson equation in this case corresponds to ¯

 V

π ,λ¯

(i) =



π (i, a) c (i, a) +

a∈A(i)

N −

λk (gk (i, a) − αk )

k=1

 + γ



p(i, j, a)V

π ,λ¯

(j) ,

(6)

j∈S

∀i ∈ S. The solutions V ∗,λ (·) and V π ,λ (·) to (5) and (6) respectively ¯

¯

can both be seen to be unique [4,5]. In what follows, we shall restrict our attention to SRPs π △

that depend on a parameter θ = (θ1 , . . . , θd )⊤ and consider ¯ -tuple. Let an analogous problem of finding the optimum (θ , λ) {π θ (i, a), i ∈ S , a ∈ A(i), θ ∈ C ⊂ Rd } denote the parameterized class of SRP. Here, the set C in which θ takes values is assumed to be a compact and convex subset of Rd . From now on, π itself will represent the parameterized SRP π θ . The following is a standard requirement in policy gradient methods.

762

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766

Assumption 2. For any a ∈ A(i), i ∈ S , π (i, a) is continuously differentiable in θ .

We propose a multi-timescale stochastic approximation algorithm for our problem. Let a(n), b(n) and d(n), n ≥ 0 be three stepsize schedules that satisfy the conditions

Recall that

¯ = J β (π ) + Lβ (π, λ)

N −

3. The actor–critic algorithm for C-MDP

β

λk Gk (π ).

k=1

− β

Upon plugging in the expressions for J β (π ) and Gk (π ) above, one obtains β

¯ = L (π, λ)



π

β(i)U (i) +

i∈S

N −

λk



β(i) U π (i) +

i∈S



λk (Wkπ (i) − αk ) ,

i∈S

k=1

The initial distribution β can be expressed as the vector β = (β(i), i ∈ S )⊤ . When β = ei , the ith unit vector, i.e., β(i) = 1 and β(j) = 0 ∀j ̸= i, then N −

λk (Wkπ (i) − αk ).

k=1

Similarly, when β = ei , Gki (π) = Sk i (π ) − αk = Wkπ (i) − αk , e

e

k = 1, . . . , N .

In what follows, we shall use linear function approximation ¯ and Geki (π ), i.e., we approximate Lei (π , λ) ¯ ≈ on both Lei (π , λ) ¯⊤

v π ,λ fi and Gki (π ) ≈ wkπ ⊤ fi , k = 1, . . . , N, respectively, where ¯ v π ,λ , wkπ ∈ Rd1 ∀k = 1, . . . , N, are suitable parameters and fi is a d1 -dimensional feature vector fi = (fi (1), . . . , fi (d1 ))⊤ associated e

with state i. Thus, we also have

¯ = Lβ (π, λ)



¯ ≈ β(i)Lei (π , λ)

β

Gk (π) =

¯⊤

β(i)v π,λ fi ,

(7)

e

β(i)Gki (π ) ≈



β(i)wkπ ⊤ fi .

(8)

i∈S

i∈S

(10)

b(n) a(n)

= lim

n→∞

d(n) b(n)

= 0.

(11)





¯ n) = For n ≥ 0, let θ (n) = (θ1 (n), . . . , θd (n))⊤ and λ( (λ1 (n), . . . , λN (n))⊤ denote the nth updates of θ and λ¯ respectively. Recall that C ⊂ Rd is a compact and convex set in which θ takes values. Let Γ : Rd → C denote an operator that is defined as follows: for x = (x1 , . . . , xd )⊤ ∈ Rd , Γ (x) = (Γ1 (x1 ), . . . , Γd (xd ))⊤ is the nearest point of x in the set C . In particular, Γ (x) = x, if x ∈ C . We also let Γˆ : R → [0, P ] denote the projection Γˆ (x) = max(0, min(x, P )) for any x ∈ R, where P < ∞ is a large positive constant. This operator will be used in the Lagrange parameter updates (see (18)) because we require them to be nonnegative. We consider two parallel simulations of the C-MDP — {Xn } and {Xn′ } — that are respectively governed by the randomized policies {πn } and {πn′ }. Let {Zn } and {Zn′ } denote the sequences of actions chosen according to the above policies. Here πn is obtained from the parameter θ (n) while πn′ is obtained from θ (n) + δ ∆(n), where ∆(n), n ≥ 0 is a vector ∆(n) = (∆1 (n), . . . , ∆d (n))⊤ of independent random variables ∆j (n), j = 1, . . . , d, each having the distribution ∆j (n) = ±1 with probability 1/2. Let c (n), gk (n) (resp. c ′ (n), gk′ (n)), k = 1, . . . , N, n ≥ 0 denote the random costs incurred (corresponding to the objective and constraint functions) in the two simulations. Note here that E [c (n) | Xn = i, Zn = a] = E [c ′ (n) | Xn′ = i, Zn′ = a] = c (i, a), E [gk (n) | Xn = i, Zn = a] = E [gk′ (n) | Xn′ = i, Zn′ = a] = gk (i, a),

i∈S

i∈S





d(n) = ∞,

n

(a2 (n) + b2 (n) + d2 (n)) < ∞,

n→∞

k=1

¯ = U π (i) + Lei (π, λ)



n

lim



i∈S

b(n) =

n

β(i)Wk (i) − αk

β(i)αk = αk . Thus,   N − − β π π ¯ = ∇θ L (π, λ) β(i) ∇θ U (i) + λk ∇θ Wk (i) . since



π

i∈S

N −



n

 −

k=1

 =



a(n) =

The approximation in (7) will be used to update the actor ¯ of Lagrange parameters while the parameter θ for a given vector λ ¯. one in (8) will be used to update λ Suppose Φ is a |S | × d1 -matrix whose kth column (k =

= 1, . . . , N. Let vn , vn′ be the nth updates of the weight ′ ¯ ¯ parameters v π ,λ , v π ,λ in the two systems. Further, let δn , δn′ , n ≥ 0 k

be the temporal difference errors at instant n in the two systems (under the parameter updates vn , vn′ ) that are defined according to N −

1, . . . , d1 ) is f (k) = (fi (k), i ∈ S )⊤ . Here |S | denotes the cardinality of S. The following is a standard requirement, see for instance [8].

δ n = c ( n) +

Assumption 3. The basis functions {f (k), k = 1, . . . , d1 } are linearly independent. Further, d1 ≤ |S | and Φ has full rank.

δn′ = c ′ (n) +

¯ , we Given an SRP π and the vector of Lagrange multipliers λ define the temporal difference error at instant n, n ≥ 0, for the relaxed MDP as

respectively. Let wk,n , k = 1, . . . , N denote the nth update of the weight parameter wkπ associated with the kth additional temporal difference error that is denoted χk,n , n ≥ 0 and defined according to



δnπ ,λ = c (n) + ¯

N −

¯⊤

¯⊤

λk (gk (n) − αk ) + γ v π,λ fXn+1 − v π,λ fXn , (9)

k=1

where Xn (Xn+1 ) is the state of the process at instant n (n + 1). We also define N additional temporal difference errors associated with the constraint functions. These will be used in the Lagrange parameter updates. The kth (k ∈ {1, . . . , N }) such temporal difference error at instant n, n ≥ 0 is defined according to

χnπ ,k = (gk (n) − αk ) + γ wkπ ⊤ fXn+1 − wkπ ⊤ fXn .

λk (gk (n) − αk ) + γ vn ⊤ fXn+1 − vn ⊤ fXn ,

(12)

k=1 N −





λk (gk′ (n) − αk ) + γ vn′ fXn′ +1 − vn′ fXn′ ,

(13)

k=1

χk,n = (gk (n) − αk ) + γ wk,n ⊤ fXn+1 − wk,n ⊤ fXn . We show in Theorem 1 that vn , vn′ and wk,n (obtained recursively from the algorithm below) asymptotically converge with proba′ ¯ ¯ bility one to v π ,λ , v π ,λ and wkπ , respectively, and also characterize these quantities. Our actor–critic algorithm for discounted cost C-MDP is as follows: let X0 = X0′ = s0 for some s0 ∈ S. For n ≥ 0, k = 1, . . . , N,

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766

i = 1, . . . , d,

Define T , T ′ , Tk : R|S | → R|S | , k = 1, . . . , N as the operators

vn+1 = vn + a(n)δn fXn ,

(14)

vn+1 = vn + a(n)δn fXn′ ,  ′





(15)

θi (n + 1) = Γi θi (n) + b(n)



β(s0 )

s0 ∈S



(vn − vn′ )⊤ fs0 δ ∆i (n)

wk,n+1 = wk,n + a(n)χk,n fXn ,   λk (n + 1) = Γˆ

763

λk (n) + d(n)



j∈S

 β(s0 )wk,n ⊤ fXn

.

N − π (i, a) c (i, a) + λk (gk (i, a) − αk ) k=1 a∈A(i)  − +γ p(i, j, a)J (j) ,

T (J )(i) =

, (16) (17)









N − π (i, a) c (i, a) + λk (gk (i, a) − αk ) k=1 a∈A(i)  − +γ p(i, j, a)J (j) ,



T (J )(i) = ′

(18)

s0 ∈S

Recursions (14)–(15) are the critic updates while (16) is the actor update. Further, (18) is the Lagrange parameter update for which we require (17). In practice, one is often interested in optimizing system performance when starting from a given small subset of S. The initial distribution β would then take positive values only over those states. The summations in (16) and (18) would then also be only over those states. As an example, in the setting of communication networks, one is often interested in starting the system with no customers. In such a case, β would assign a value of 1 to the ‘empty’ state and 0 to all other states. Note also that unlike our algorithm, the algorithms in [7] depend only on single simulation trajectories as they are for the long-run average cost criterion and rely on a policy gradient result (cf. Lemma 4 of [7]). For the case of average cost C-MDP, it is conceivable to extend those algorithms using only one simulation and an analogous result to Lemma 4 of [7] for the relaxed problem. In [9], a policy gradient result for discounted cost MDP under the look-up table representations is also provided (in addition to that for the long-run average cost criterion). Extending the same to the case of function approximation however appears difficult. This is the reason for incorporating gradient estimates in our algorithm based on SPSA. SPSA based gradient estimates were first proposed in [12] and have been widely studied and found to be highly efficient in various settings (even those involving highdimensional parameters). The SPSA estimates in [12] involved twosided balanced estimates. Our gradient estimates on the other hand are one-sided and resemble those presented in [13]. We use one-sided estimates primarily because for updates of the Lagrange parameter (cf. (17)–(18)) we require a simulation with the running parameter θ (n) itself. Using a balanced gradient estimate would therefore come at the cost of an additional simulation (the resulting procedure would then require three simulations) which we avoid in our procedure.



j∈S

 −

Tk (J )(i) =



π (i, a) (gk (i, a) − αk ) + γ



a∈A(i)

p(i, j, a)J (j) ,

j∈S



∀i ∈ S. We let L(J ) = (L(J )(i), i ∈ S )⊤ , where L could be any of the ′ operators T , T ′ , Tk , k = 1, . . . , N, respectively. Let C π , C π and Gπk , k = 1, . . . , N denote the column vectors    ⊤ N − − π C = π (i, a) c (i, a) + λk (gk (i, a) − αk ) , i ∈ S , a∈A(i)

k =1

 C

π′

 −

=

π (i, a) c (i, a) + ′

N −

a∈A(i)



λk (gk (i, a) − αk ) , i ∈ S ⊤



Gk =

π (i, a)(gk (i, a) − αk ), i ∈ S

.

a∈A(i)

We have the following result. Theorem 1. Under Assumptions 1–3, with θ (n) ≡ θ , ∆(n) ≡ ∆ and ¯ n) ≡ λ¯ (for given θ , ∆ and λ¯ ), ∀n ≥ 0, {vn }, {vn′ } and {wk,n }, k = λ( 1, . . . , N, governed by recursions (14), (15) and (17) respectively, ′ ¯ ¯ satisfy vn → v π ,λ , vn′ → v π ,λ and wk,n → wkπ with probability one.

′ ¯ ¯ The quantities v π ,λ , v π ,λ and wkπ themselves are the unique solutions to

Φ ⊤ Dπ Φ v π ,λ = Φ ⊤ Dπ T (Φ v π ,λ ), ¯

π′

Φ⊤D Φv ⊤ π

π ′ ,λ¯

¯

π′

= Φ ⊤ D T ′ (Φ v

π

⊤ π

(19)

π ′ ,λ¯

),

(20)

π

Φ D Φ wk = Φ D Tk (Φ wk ),

(21)

k = 1, . . . , N, respectively. In particular, they satisfy

v π ,λ = −(Φ ⊤ Dπ (γ P π − I )Φ )−1 Φ ⊤ Dπ C π , ¯

4. Convergence analysis Recall that b(n) = o(a(n)) and d(n) = o(b(n)) (see (11)). Hence, we also have d(n) = o(a(n)). The theory of multi-timescale stochastic approximation, see Chapter 6 of [15], allows us to treat ¯ n) as constants θ , ∆ and λ¯ (say) while analyzing θ(n), ∆(n) and λ( (14)–(15). The policies πn and πn′ can therefore be treated as constants (i.e., time invariant or stationary policies) π and π ′ ′ respectively. Let P π and P π be the transition probability matrices ∑ π π′ with elements p (i, j) = a∈A(i) π (i, a)p(i, j, a) and p (i, j) = ∑ ′ π π a∈A(i) π (i, a)p(i, j, a), i, j ∈ S, respectively. Let d = (d (i), i ∈ π′

π′

S ) and d = (d (i), i ∈ S ) denote the stationary distributions ′ under policies π and π ′ . Further, let Dπ and Dπ respectively denote ′ the diagonal matrices with elements dπ (i) and dπ (i), i ∈ S, along their diagonals. In the following, ‖ · ‖ shall denote the Euclidean norm. ⊤



,

k=1

 π

⊤

v

π ′ ,λ¯

π′

= −(Φ ⊤ D (γ P

π

⊤ π

π

π′

π′

(22)

π′

− I )Φ )−1 Φ ⊤ D C ,

(23)

⊤ π π

wk = −(Φ D (γ P − I )Φ ) Φ D Gk . −1

(24)

Proof. We show here the claim for v π ,λ . The claims for v π ,λ and wkπ shall follow in a similar manner. Our analysis is based on showing the stability and convergence of iterates, for which we use the results in [16]. Consider the critic recursion (14). The ODE associated with (14) is the following: ′ ¯

¯

 v˙ = 1



π

d (i)



π (i, a) c (i, a) +

a∈A(i)

i∈S

N −

λk (gk (i, a) − αk )

k=1

 +γv

1⊤

− j∈S

p(i, j, a)fj − v

1⊤

fi fi .

(25)

764

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766



v˙ 1 = Φ ⊤ Dπ (T (Φ v 1 ) − Φ v 1 ) = g 1 (v 1 ).

(26)

Note that g (v ) is Lipschitz continuous in v . Now define g∞ (v 1 ) as 1

1

1

1



1 g∞ (v 1 ) = lim

g 1 (r v 1 )

r →∞

r

= Φ ⊤ Dπ (γ P π − I )Φ v 1 ,

where I is the identity matrix. Consider now the ODE 1 v˙ 1 = g∞ (v 1 ).

(27)

It can be inferred from Theorem 2 of [8] that the matrix Φ ⊤ Dπ (γ P π − I )Φ is negative definite. For the sake of completeness, we provide the necessary (though brief) argument. For x ∈ R|S | , define a weighted Euclidean norm ‖x‖Dπ according to ‖x‖Dπ = (x⊤ Dπ x)1/2 . Note that

‖x‖2Dπ = x⊤ Dπ x = ‖(Dπ )1/2 x‖2 . Now for any function V ∈ R|S | , we have

‖P π V ‖2Dπ = V ⊤ P π ⊤ Dπ P π V − = dπ (i)E 2 [V (Xn+1 ) | Xn = i, π] ≤ =



Proof. Consider v π ,λ first. Recall from Theorem 1 that v π ,λ is obtained according to (22). From Assumption 2, it is easy to see that C π and P π are continuously differentiable in θ . Further, from an application of Theorem 2, pp. 402–403 of [17], it can be seen that the stationary distribution dπ of the process {Xn } is continuously differentiable in θ . Hence, Dπ is also continuously differentiable in θ . Now writing out the inverse of the matrix Φ ⊤ Dπ (γ P π − I )Φ ¯ explicitly using Cramer’s rule, one can see that v π ,λ is continuously π differentiable in θ . Similarly, wk , k = 1, . . . , N can be seen to be continuously differentiable in θ as well.  ¯

¯

By an abuse of notation, we simply denote many times v π ,λ as v and wkπ as wkθ , respectively. We now consider the recursion ¯ n) ≡ λ¯ (16). As before, because d(n) = o(b(n)) (see (11)), we let λ( (a constant) in the analysis of (16). Let χ (·) be a vector field on the set C in which θ takes values. Define another vector field ¯

θ ,λ¯



Γ¯ (χ (y)) = lim

0 0, ∃δ0 > 0 such that for all δ ∈ (0, δ0 ), some λ θ (n), n ≥ 0 obtained according to (16) satisfy θ (n) → Kλ¯ϵ as n → ∞, with probability one. Proof. In lieu of Theorem 1, the recursion (16) can be replaced with

(28)

for some 0 < C1 < ∞. ¯ Now let v 1 = v π,λ be a solution to



 −

θi (n + 1) = Γi θi (n) − b(n)

β(s0 )

s0 ∈S

g 1 (v 1 ) = Φ ⊤ Dπ (T (Φ v 1 ) − Φ v 1 ) = 0.



(29)

×

Note that (29) corresponds to the linear system of equations

Φ ⊤ Dπ C π + Φ ⊤ Dπ (γ P π − I )Φ v 1 = 0.

(30) ⊤ π

π

Now since we have already shown that Φ D (γ P − I )Φ is ¯ negative definite, it is of full rank and invertible. Hence v π ,λ is the unique solution to (30) and corresponds to (22). Assumptions (A1)–(A2) of [16] are now satisfied and the claim follows from Theorems 2.1–2.2(i) of [16]. 

(v θ (n)+δ∆(n),λ − v θ (n),λ )⊤ fs0 δ ∆ i ( n) ¯

¯



 + ϵ1 (n)

, (33)

where ϵ1 (n) → 0 as n → ∞ (from Theorem 1). This recursion is stable because of the projection Γi . Further, the ODE associated with (33) is

 θ˙i = Γ¯ i −



(v θ+δ∆,λ − v θ ,λ )⊤ fs0 β(s0 )E | θ , λ¯ δ ∆i ∈S

− s0

¯

¯

 ,

(34)

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766

where ∆i = ±1 w.p. 1/2 and ∆j is independent of ∆k for all j ̸= k. The expectation E [·] in (34) is taken w.r.t. the common distribution of ∆i . Using a suitable Taylor’s expansion, one obtains that ¯ ¯ d − (v θ+δ∆,λ − v θ,λ )⊤ fs0 ∆l ¯⊤ ¯⊤ = ∇θi v θ,λ fs0 + ∇θl v θ,λ fs0 δ ∆i ∆i l=1,l̸=i

+ ϵ2 (δ)⊤ fs0 , where ϵ2 (δ) → 0 as δ → 0. Now,

 E

d − ∆l ¯⊤ ∇θl v θ,λ fs0 | θ , λ¯ = E ∆ i l=1,l̸=i



d − ∆l ¯⊤ ∇θl v θ,λ fs0 = 0 ∆ i l=1,l̸=i

E

(v

θ+δ ∆,λ¯

−v δ ∆i

θ,λ¯ ⊤

) fs0



Z (θ) =

β(i)v

fi = v

v π ,λ = Π C¯ π + ¯

θ,λ¯ ⊤

V,

where V = i∈S β(i)fi is a d1 -dimensional vector. Then corresponding to the ODE (32), we have



¯

dt



λ¯

= ∇ Z (θ ) θ˙ = V ∇θ v ⊤

∑ = ( a∈A(i) π (i, a)c (i, a), i ∈ S )⊤ . Thus,

where C

N −

θ,λ¯

Γ¯ (−∇θ v

θ,λ¯ ⊤

λk Π Gπk ,

k=1

i∈S

dZ λ (θ )

k=1

¯π

or

Thus, the trajectories of (34) converge to those of (32) uniformly on compacts when starting from the same initial condition (in both). ¯ Let Z λ (·) be defined according to θ,λ¯ ⊤

λk Gπk ,



¯ | θ , λ¯ = ∇θi v θ,λ fs0 + ϵ2 (δ)⊤ fs0 ¯⊤



N −

Π C π = Π C¯ π +

→ ∇θi v θ,λ fs0 as δ → 0.

λ¯

θ¯∗ ⊤

¯ ∑

such that θλ¯ ∗ ∈ Kλ¯ ∗ and Γˆ ( s0 ∈S β(s0 )wkλ fs0 ) = 0, ∀k = ¯∗ ∈ F. 1, . . . , N, viz., λ ¯ Now note that from Theorem 1, v π ,λ = Π C π while wkπ = Π Gπk , ⊤ π π −1 ⊤ π where Π = (Φ D (I − γ P )Φ ) Φ D . Now C π can be written as



since ∆l /∆i = ±1 w.p. 1/2 ∀l ̸= i. Hence,



△ ¯ n) → λ¯ ∗ for some λ¯ ∗ = pp. 17, Chapter 2, of [15]), λ( (λ∗1 , . . . , λ∗N )T

C π = C¯ π +



765

N −

λk wkπ .

k=1

Hence,

∇λk v π ,λ = wkπ . ¯

From the above, recursion (18) corresponds to a gradient ascent ¯ , the condition (for all k = 1, . . . , N) and for given λ

Γ¯ˆ

 −

θλ¯ ∗ ⊤

β(s0 )wk

 = 0,

fs0

s0 ∈S

V) < 0

is the same as

∀θ ̸∈ Kλ¯ .



π ,λ¯ ∗ ⊤





π ,λ¯ ∗ ⊤



Thus, Z (·) is a strict Liapunov function for (32). The claim now follows from Theorem 1, pp. 339, of [18].  Next, we consider the recursion (18). Let for any λ ∈ R and a bounded, continuous function ω : R → R,

As with [14], one can invoke the envelope theorem of mathematical economics (see pp. 964–966 of [19]) to conclude that (36) corresponds to

Γ¯ˆ (ω(λ)) = lim



0 0. Since λ∗k ∈ [0, P ) by definition,

θλ¯ ∗ ⊤

β(s0 )wk

 fs0

s0 ∈S

s0 ∈S

k = 1, . . . , N. Now from an application of Theorem 2, Chapter 2 of [15] (see the third extension to the basic scheme described on

β(i)v

k = 1, . . . , N, interpreted in the ‘Caratheodory’ sense, see Lemma 4.3 of [14]. The above can be seen to have the local maxima of H β (·) as its asymptotically stable attractors. The claim follows. 





∇λk

i∈S



¯ n) = (λ1 (n), . . . , λN (n))⊤ , Theorem 3. Under Assumptions 1–3, λ( ¯ n) → λ¯ ∗ as n → ∞, with n ≥ 0, obtained from (18) satisfy λ( ¯ ∗ ∈ F . Further, F = Fˇ , i.e., λ¯ ∗ is a local probability one, for some λ β maximum of H (·).

fi

i∈S

λ˙ k (t ) = Γ¯ˆ

= 0, ∀k = 1, . . . , N , θλ¯ ∈ Kλ¯ }. Further, let Fˇ denote the set of local ∑ θ,· ⊤ maxima of the function H β (·) = minθ fs0 . s0 ∈S β(s0 )v



β(i)∇λk v



i∈S

.

¯ ∑



¯ = (λ1 , . . . , λN )T | λk ∈ [0, P ], Γˆ ( Let F = {λ



= Γ¯ˆ



Γ¯ˆ

λ¯

 = λk + η ∗

− s0 ∈S

θλ¯ ∗ ⊤

β(s0 )wk

 fs0

,

766

S. Bhatnagar / Systems & Control Letters 59 (2010) 760–766

for η sufficiently small. Hence,

Γ¯ˆ

 −

θ ∗⊤

β(s0 )wkλ¯

corresponds to the case of regular (unconstrained) MDP. The set K0 (that corresponds to Kλ¯ ∗ with λ∗k = 0, ∀k = 1, . . . , N) in this case becomes

 fs0



s0 ∈S



 λk + η

Γˆ



θλ¯ ∗ ⊤

β(s0 )wk



 λ∗k + η



θ ∗⊤

β(s0 )wkλ¯

fs0

− λ∗k

θλ¯ ∗ ⊤

β(s0 )wk

fs0 > 0,

s0 ∈S

¯ ∗ ∈ F . Note above that Γˆ (λ∗k ) = λ∗k which is a contradiction since λ ∗ since λk ∈ [0, P ). The claim follows.  ¯ ∗ ∈ F such that λ∗k (the kth component of λ¯ ∗ ), for In the case of λ some k, equals P and

 Γˆ

 λ∗k + η



θ ∗⊤



s0 ∈ S

β(s0 )wkλ¯ 

θ ∗⊤

β(s0 )wkλ¯

fs0 > 0, we have

to spurious fixed points of the algorithm, see [20]. By selecting P to be large enough, one can ensure in most practical applications that the scheme will avoid getting stuck at a spurious fixed point. We now have the following result.

¯ ∗ ∈ F , if Proposition 2. For λ k ∈ {1, . . . , N }, then λ∗k = 0.



s0 ∈ S

θ ∗⊤

β(s0 )wkλ¯

fs0 < 0, for some

Proof. We consider the following two possibilities: (a) λ∗k = 0 and (b) 0 < λ∗k ≤ P, respectively. Consider (a) first. It is easy to

∑ θλ¯ ∗ ⊤ fs0 < 0, we have that see that for λ∗k = 0 and s0 ∈S β(s0 )wk

∑ θ¯∗ ⊤ Γˆ (λ∗k + η( s0 ∈S β(s0 )wkλ fs0 )) = 0 as well for all η > 0. Now consider (b). Note that for any 0 < λ∗k ≤ P, one can find an η0 > 0, such that for all 0 < η ≤ η0 ,      − − θλ¯ ∗ ⊤ θλ¯ ∗ ⊤ ∗ ∗ Γˆ λk + η β(s0 )wk fs0 = λk + η β(s0 )wk fs0 s0 ∈S

s0 ∈S

∈ (0, P ). Thus   − − θλ¯ ∗ ⊤ θ ∗⊤ ¯ Γˆ β(s0 )wk fs0 = β(s0 )wkλ¯ fs0 < 0, s0 ∈S

¯∗ which is a contradiction since λ θλ¯ ∗ ⊤

∈ F . Thus, for λ¯ ∗ ∈ F ,

fs0 < 0 for some k ∈ {1, . . . , N } is only possible s0 ∈S β(s0 )wk provided λ∗k = 0.  θ¯∗ ⊤

Remark 1. Let θλ¯ ∗ be such that s0 ∈S β(s0 )wkλ fs0 < 0, ∀k = 1, . . . , N, namely θλ¯ ∗ lies in the interior of the constraint region (obtained from the inequality constraints). Then from Proposition 2, λ∗k = 0, ∀k = 1, . . . , N, which implies that Lei (θλ¯ ∗ , λ¯ ∗ ) =



U π (i) ≈ v π,0 fi . The problem in such a case becomes unconstrained (i.e., has no functional inequality constraints). This ⊤

=0 .

i∈S

β(i)U π (i) ≈



i∈S

β(i)vkθ ,0 fi within the set C . ⊤

We presented the first actor–critic reinforcement learning algorithm with function approximation for infinite horizon discounted cost control under multiple inequality constraints. We applied the Lagrange multiplier method to handle the inequality constraints. Our algorithm combined various aspects of temporal difference learning, policy gradient methods and simultaneous perturbation stochastic approximation. We proved the almost sure convergence of the algorithm to a locally optimal policy. As future work, the numerical performance of this algorithm can be tested on some real life settings. One can also develop similar algorithms for long-run average cost constrained MDPs. References

θ¯∗ ⊤ ¯ ∑ ¯ ∗ shall correspond In this case, Γˆ ( s0 ∈S β(s0 )wkλ fs0 ) = 0. Such λ





= λ∗k = P .

fs0

s0 ∈S

s0 ∈S

β(s0 )∇θ (vk ) fs0



5. Conclusions

η

η→0

θ ,0 ⊤

s0 ∈S

∑ 

s0 ∈S

= lim



Our algorithm will then converge to a local minimum of

η

η→0

=

− Γˆ (λ∗k )

s0 ∈S

= lim



fs0



K0 = θ ∈ C | Γˆ



[1] E. Altman, Constrained Markov Decision Processes, Chapman and Hall/CRC Press, 1999. [2] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996. [3] R.S. Sutton, A. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. [4] D.P. Bertsekas, Dynamic Programming and Optimal Control, vol. II, third edition, Athena Scientific, Belmont, MA, 2007. [5] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley, New York, 1994. [6] V.R. Konda, J.N. Tsitsiklis, On actor–critic algorithms, SIAM Journal on Control and Optimization 42 (4) (2003) 1143–1166. [7] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, M. Lee, Natural actor–critic algorithms, Automatica 45 (2009) 2471–2482. [8] J.N. Tsitsiklis, B. Van Roy, An analysis of temporal difference learning with function approximation, IEEE Transactions on Automatic Control 42 (5) (1997) 674–690. [9] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in: Advances in Neural Information Processing Systems (NIPS), 12, MIT Press, 2000, pp. 1057–1063. [10] A. Lazar, Optimal flow control of a class of queuing networks in equilibrium, IEEE Transactions on Automatic Control 28 (1983) 1001–1007. [11] F. Vakil, A. Lazar, Flow control protocols for integrated networks with partially observed voice traffic, IEEE Transactions on Automatic Control AC-32 (1987) 2–14. [12] J.C. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE Transactions on Automatic Control 37 (3) (1992) 332–341. [13] H.F. Chen, T.E. Duncan, B. Pasik-Duncan, A Kiefer–Wolfowitz algorithm with randomized differences, IEEE Transactions on Automatic Control 44 (3) (1999) 442–453. [14] V.S. Borkar, An actor–critic algorithm for constrained Markov decision processes, Systems and Control Letters 54 (2005) 207–213. [15] V.S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press and Hindustan Book Agency, 2008. [16] V.S. Borkar, S.P. Meyn, The O.D.E. method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Control and Optimization 38 (2) (2000) 447–469. [17] P.J. Schweitzer, Perturbation theory and finite Markov chains, Journal of Applied Probability 5 (1968) 401–413. [18] M.W. Hirsch, Convergent activation dynamics in continuous time networks, Neural Networks 2 (1989) 331–349. [19] A. Mas-Colell, M.D. Whinston, J.R. Green, Microeconomic Theory, Oxford University Press, Oxford, UK, 1995. [20] H.J. Kushner, G.G. Yin, Stochastic Approximation Algorithms and Applications, Springer Verlag, New York, 1997.