A Policy Gradient Method for Semi-Markov Decision Processes with ...

Report 2 Downloads 74 Views
1

A Policy Gradient Method for Semi-Markov Decision Processes with Application to Call Admission Control Sumeetpal S. Singh∗ , Vladislav B. Tadi´c and Arnaud Doucet

Abstract

Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation. We apply our algorithm to call admission control.

Index Terms

Stochastic processes; Semi-Markov decision process; Policy gradient; Two-time scale; Call admission control.



Corresponding author.

S. Singh is with the Signal Processing Group, Department of Engineering, Cambridge University, CB2 1PZ Cambridge, UK. Email: [email protected] Tel: +44 1223 332 784 Fax: +44 1223 332 662 V. Tadi´c is with the Department of Automatic Control and Systems Engineering, The University of Sheffield, S1 3JD Sheffield, UK. Email: [email protected] Tel: +44 114 222 5198 Fax: +44 (0)114 222 5661 A. Doucet is with the Signal Processing Group, Department of Engineering, Cambridge University, CB2 1PZ Cambridge, UK. Email: [email protected] Tel: +44 1223 332 676 Fax: +44 1223 332 662 February 14, 2005

DRAFT

2

I. I NTRODUCTION A semi-Markov decision process (SMDP) can be solved using classical methods such as dynamic programming (DP) and policy iteration. However, these classical methods suffer from the curse of dimensionality, which means the memory required to represent the optimal value function is proportional to the dimension of the SMDP and is prohibitive for problems with large state and action spaces. Secondly, classical methods require knowledge of the transition probabilities of the SMDP (or a model). In applications like Call admission control (CAC), the state space is large and the SMDP probabilities are not known a priori [8][10][11]. To overcome the memory requirement and the need for model, reinforcement learning (RL) with function approximation is used [3]. RL combines DP and stochastic approximation (SA) to learn the optimal value function online from a family parameterised functions that are compactly represented. RL for a MDP is well studied [3] and recently, it has been extended to SMDPs [4],[7],[8]. Inspired by the work in [1], [12], the contribution of this paper is to present an alternative approach where the policy of a SMDP is represented by its own function approximator to yield a family of parameterised policies. The gradient of the SMDP performance criterion with respect to the policy parameters is then estimated online. Using this estimated gradient, a SA algorithm is used to update the policy parameters to improve performance. This approach, which is known as the policy gradient method [1], [12], also overcomes the memory requirement and the need for model. In this paper, the performance criterion used is the average cost and we prove the convergence of the SMDP gradient estimator derived under mild regularity conditions. We apply our algorithm to CAC and in simulations, demonstrate convergence of the online algorithm to the optimal policy. The advantage of the policy gradient framework is that it can be applied to optimise a SMDP

February 14, 2005

DRAFT

3

subject to average cost constraints by using recent results on constrained stochastic approximation [15]. This problem is important in CAC for multi-class networks as the constraints are used to impose upper bounds on the blocking probability of the various user classes. Additionally, we consider the general case where we assume no knowledge of the semi-Markov kernel or quantities derived from it (see Section III for details). Hence, estimating the SMDP gradient does not simplify to estimating the realisation matrix only, as done in [5]. Outline: Section II presents the average cost SMDP problem. In section III, the SMDP gradient is derived, an online algorithm to estimate it is presented and a convergence result is presented. Section IV gives algorithm OLSMDP, which optimises a SMDP online based on a single sample path. The application to CAC with some numerical examples is given in Section V. All proofs appear in the Appendix.

II. P ROBLEM F ORMULATION Let {(xθk , τkθ , aθk )}k≥0 be a controlled semi-Markov process where xθk ∈ X:={1, . . . , n}; X being the state space of the embedded chain {xθk }k≥0. aθk ∈ A is the control (or action) applied at time k. The control space A is finite. θ = [θ1 , . . . , θK ]T ∈ RK is the parameter that determines the control policy in effect where typically K < n. RK parameterises a family of stationary randomised polices as follows. Let u˜ : RK × X → Π(A) where Π(A) denotes the set of all probability measures on A. At epoch k, the control aθk to be applied is sampled from the measure u˜(θ, xθk , .). τkθ ∈ R+ :=[0, ∞), k > 0, is the time the process {xθk }k≥0 dwells in state xθk−1 or equivalently, the time interval between the k-th − 1 and the k-th transition. θ Let Q be the semi-Markov kernel that specifies the distribution of (xθk+1 , τk+1 ) given (xθk , aθk ):

θ P(τk+1 ≤ τ, xθk+1 = j|xθk = i, aθk = a) = Qij (τ, a).

February 14, 2005

(1) DRAFT

4

Note that Q is independent of θ and the dependence of {(xθk , τkθ , aθk )}k≥0 on θ only arises through the policy u˜. We make the following standard stability assumptions [2, Chapter 5], Z



τ Qij (dτ, a) < ∞,

∀i, j, a,

(2)

0

τ (i, a):=

X

j∈X

Z



τ Qij (dτ, a) > 0,

∀i, a.

(3)

0

Let xθ (t) and aθ (t) be respectively the right-continuous interpolations of {xθk }k≥0 and {aθk }k≥0 defined by (xθ (t), aθ (t)) = (xθk , aθk ), ∀t ∈ [tk , tk+1 ) where tk :=tk−1 + τkθ , k > 0, t0 :=0. {(xθ (t), aθ (t))}t∈

R+

is the continuous time semi-Markov process. Let c : X × A → R be

a suitably defined cost function. For a given θ ∈ RK , define average cost function

J(θ) = lim T T →∞

−1

E{

Z

T

c(xθ (t), aθ (t)) dt}.

(4)

0

The aim is to minimise J(θ) with respect to the parameter θ. We conclude this section with an assumption that ensures J(θ) is a well defined quantity with a limit independent of the initial distribution for xθ0 . A finite state discrete-time homogeneous Markov chain with state space X and transition probability matrix [Pij ] is said to be a unichain ¯ ⊂ X is if any 2 closed set of states have a non-empty intersection [14]. (A set of states X closed if

P

¯ j∈X

¯ Note that if there exists a state i∗ that is accessible from Pij = 1 for all i ∈ X.)

every state j ∈ X, then the Markov chain is a unichain. (In CAC, the state that corresponds to an empty system is accessible from all other states of the system.) Finally, note that the semi-Markov kernel in (1) specifies transition probabilities

Pij (a):= P(xθk+1 = j|xθk = i, aθk = a) = lim Qij (τ, a). τ →∞

February 14, 2005

(5)

DRAFT

5

Assumption 1: For any θ ∈ RK , the embedded chain {xθk }k≥0 with transition probability P (θ)ij := P(xθk+1 = j|xθk = i) =

P

a∈A

Pij (a)˜ u(θ, i, a) is a unichain.

Let {π(θ, i) : i ∈ X} be the unique stationary distribution of {xθk }k≥0 ; uniqueness follows from Assumption 1 and, limT →∞ 1/T

PT −1 k=0

P (θ)kij = π(θ, j) independently of i [14, Theorem

2.3.3]; P (θ)kij is the ij-th element of the matrix (P (θ))k , P 0 (θ):=I. Under Assumption 1, by Theorem 3.5.1 of [14],

J(θ) =

P

c(i, a)τ (i, a)˜ u(θ, i, a)π(θ, i) ηc (θ) . =: ητ (θ) u(θ, i, a)π(θ, i) i∈X,a∈A τ (i, a)˜

i∈X,a∈A

P

III. O NLINE E STIMATION

OF THE

G RADIENT

FOR

(6)

J(θ)

In this section, we derive a single sample-path based estimator for ∇J by assuming knowledge of ∇˜ u/˜ u only and not the semi-Markov kernel (1) or quantities derived from it (eg. (5) or (3)).

∇J(θ) = [ητ (θ)∇ηc (θ) − ηc (θ)∇ητ (θ)] ητ (θ)−2

(7)

where the following convention for ∇ is adopted. For scalar b(θ) ∈ R, the vector ∇b:=[∂b/∂θ1 , . . . , ∂b/∂θK ]T . For a vector b(θ) ∈ Rn , the matrix ∇b:=[∂b/∂θ1 , . . . , ∂b/∂θK ]. Given B(θ) ∈ Rn×m and vectors a ∈ Rn , b ∈ Rm , aT (∇B)b:=[aT ∂B/∂θ1 b, . . . , aT ∂B/∂θK b]T . By virtue of Assumption 1, ηc and ητ are performance criteria for average cost MDPs with cost functions c(xk , ak )τ (xk , ak ) and τ (xk , ak ) respectively, i.e., ηc (θ) = lim T −1 Eθ { T →∞

XT −1 k=0

c(xk , ak )τ (xk , ak )},

(8)

and similarly for ητ (θ). (We have the dropped the superscript θ on the processes in favour of the subscript on the expectation operator E to indicate the randomised policy θ is use.) Define

February 14, 2005

DRAFT

6

the i-th component of c¯(θ) ∈ R|X| to be

c¯(θ, i):=Eθ {c(xk , ak )τ (xk , ak )|xk = i} =

X

a∈A

c(i, a)τ (i, a)˜ u(θ, i, a).

(9)

c¯ is the “smoothed” version of the cost c and is only a function of the state xk . Then, ηc (θ) = P

i∈X

π(θ, i)¯ c(θ, i). Omitting the dependence on θ1 , the gradient is

∇ηc = (∇π)T c¯ + (∇¯ c)T π.

(10)

Given β ∈ [0, 1), define the vector ηc¯,β (θ) ∈ R|X| (by specifying its i-th component) as follows: X∞ ηc¯,β (θ, i):=eTi [ β k P (θ)k c¯(θ)] k=0

(11)

where eTi ∈ R|X| is the unit vector with the i-th component equal to 1. (ηc¯,β (θ) is the value function of an infinite horizon β-discounted MDP with transition probability matrix P (θ) and cost c¯(θ).) Then, due to [1, Theorem 2],

lim π T (∇P (θ))ηc¯,β = (∇π)T c¯,

β→1

∀θ ∈ RK .

(12)

This result implies that

∇β ηc (θ):=π(θ)T (∇P (θ))ηc¯,β (θ) + (∇¯ c(θ))T π(θ)

(13)

is a good estimate of ∇ηc (θ) when β is close to 1. Furthermore, we show π T (∇P (θ))ηc¯,β + (∇¯ c)T π can be estimated online using only a single sample-path of the SMDP (see algorithm SMDPBG below). The effect β in the gradient estimator presented below is a bias-variance 1

Throughout this paper, the dependence on θ is omitted whenever there is no risk of ambiguity.

February 14, 2005

DRAFT

7

trade-off; small β gives a small variance but large bias and vice versa [1]. As was done for ∇ηc , we can similarly define a biased gradient for ∇ητ : ∇β ητ :=π T (∇P (θ))ητ¯,β + (∇¯ τ )T π

(14)

where the i-th component of the vectors ητ¯,β (θ), τ¯(θ) ∈ R|X| are defined as

τ¯(θ, i):=

X

a∈A

τ (i, a)˜ u(θ, i, a),

X∞ β k P (θ)k τ¯(θ)]. ητ¯,β (θ, i):=eTi [ k=0

(15)

In summary, due to [1, Theorem 2], limβ→1 ∇β J(θ) = ∇J(θ) where ∇β J(θ) := [ητ (θ)∇β ηc (θ) − ηc (θ)∇β ητ (θ)]ητ (θ)−2 .

(16)

Values of β close to 1 leads to a large variance in the estimates of ∇β J obtained by algorithm SMDPBG below. However, if P (θ) has distinct eigenvalues and a short mixing time, β need not be too close to 1 for ∇β ηc and ∇β ητ to respectively approximate ∇ηc and ∇ητ well (and hence ∇β J to approximate ∇J well); see [1, Theorem 3] for a precise statement. In the case study in Section V, we observe the error is small even for β = 0.5. By only assuming knowledge of ∇˜ u/˜ u and by observing only a single trajectory of the (k)

(k)

SMDP, algorithm SMDPBG below produces a sequence of iterates {(ηc , ητ , △ck , △τk )}k≥0 that converges to (ηc , ητ , ∇β ηc , ∇β ητ ) almost surely (a.s.) as k tends to infinity; see Theorem 4 below. Thus, at time k, we use the following (biased) estimate of ∇J(θ)

ητ(k) △ck − ηc(k) △τk

(17)

which also converges almost surely to ητ ∇β ηc −ηc ∇β ητ . Alternately, we may use (17) multiplied

February 14, 2005

DRAFT

8

(k)

with (ητ )−2 . We may omit ητ (θ)−2 since (2), (3) implies ητ (θ) is bounded below away from 0, and from above, both uniformly in θ. Note that −ητ ∇ηc + ηc ∇ητ is a direction of descent as it makes a negative inner product with ∇J. Let

φ(θ, x, a) :=

∇˜ u(θ, x, a) u˜(θ, x, a)

(∈ RK ).

(18)

Algorithm 2 (SMDP Biased Gradient (SMDPBG)): Given β ∈ (0, 1), fixed θ ∈ RK policy u˜(θ, ., .), and an arbitrary SMDP sample-path {(xk , τk , ak )}k≥0 generated using u˜(θ, ., .). Set (0)

(0)

z0 = △c0 = △τ0 = 0 ∈ RK , ηc = ητ = 0, γk = k −1 . For k ≥ 0, update

zk+1 = βzk + φ(θ, xk , ak ),

(19)

△ck+1 = △ck + γk+1 (c(xk+1 , ak+1 )τk+2 [zk+1 + φ(θ, xk+1 , ak+1)] − △ck ) ,

(20)

△τk+1 = △τk + γk+1 (τk+2 [zk+1 + φ(θ, xk+1, ak+1 )] − △τk ) ,

(21)

  ηc(k+1) = ηc(k) + γk+1 c(xk , ak )τk+1 − ηc(k) ,

(22)

  ητ(k+1) = ητ(k) + γk+1 τk+1 − ητ(k) .

(k)

ηc

(k)

and ητ

(23)

are the sample average estimators for ηc (θ) and ητ (θ) respectively. In (19), if

u˜(θ, xk , ak ) = 0, we set ∇˜ u(θ, xk , ak )/˜ u(θ, xk , ak ) = 0. Note that only knowledge of ∇˜ u/˜ u is assumed and not the semi-Markov kernel (1) or any quantities derived from it (eg. (5) or (3)). As in [1], we make the following boundedness assumption on the derivatives of the policy. Assumption 3: ∂ u˜(θ, i, a) −1 ∂θk u˜(θ, i, a) < ∞, February 14, 2005

∀k, θ, i, a.

(24)

DRAFT

9

If u˜(θ, i, a) is 0, then boundedness requires the numerator to be also 0. The proof of the following theorem appears in the Appendix (k)

Theorem 4: For Algorithm SMDPBG, let (2) and Assumption 1 hold. Then, ηc

(k)

and ητ

converge almost surely to ηc and ητ respectively. If in addition, Assumption 3 and 2 E{τk+1 |xk = i, ak = a} < ∞ ∀i, a,

(25)

also holds, then △ck and △τk also converge almost surely to ∇β ηc and ∇β ητ respectively.

IV. O NLINE O PTIMIZATION

OF A

SMDP (OLSMDP)

An online algorithm that minimises J(θ) based on a single observed sample-path of the SMDP is now presented. The algorithm is based on two-time scale SA and updates θ at every time step. Algorithm 5 (OLSMDP): Initialisation: given β ∈ (0, 1), θ0 ∈ RK , positive step-sizes {γk }k≥1 , {αk }k≥1 satisfying γk → 0,

P

k

γk = ∞,

P

k

γk2 < ∞ (same for {αk }k≥1 ) and limk→∞ αk γk−1 =

0. Recursion: For k ≥ 0, sample ak+2 ∼ u˜(θk , xk+2 , .), update (19)-(23) with θ = θk , and   θk+1 = θk − αk+1 ητ(k+1) △ck+1 − ηc(k+1) △τk+1 .

(26)

We assume arbitrary x0 , τ0 , and (a0 , x1 , τ1 , a1 ) are generated with θ0 . (26) is the actual SA step (k+1)

that uses ητ

(k+1)

△ck+1 − ηc

△τk+1 as the biased estimate of ∇J(θk ). In the numerical example

presented in Section V, step-sizes used are

γk = k −ζ1 ,

February 14, 2005

αk = k −ζ2 ,

ζ1 , ζ2 ∈ (0.5, 1),

ζ2 > ζ1 .

(27)

DRAFT

10

As αk tends to zero quicker than γk , θk is said to evolve on a slower time-scale than the gradient estimator, which is comprised of all terms updated with γk . The convergence of two-time scale SA is studied in [16]. We remark that under suitable regularity conditions, it may be shown that θk converges to {θ : ∇β J(θ) = 0} a.s.

V. A PPLICATION

TO

OLSMDP

TO

C ALL A DMISSION C ONTROL

Consider a multi-service loss network that services {1, 2, . . . , K} classes of users. A class i user arrives according to a Poisson process with intensity λi , has service time exponentially distributed with mean 1/µi and uses bi ∈ Z+ of the total bandwidth M available. We assume Poisson arrival and departures but not knowledge of the numerical values of the intensities. This immediately precludes the use of value iteration or linear programming (LP) to solve the CAC problem [14]. An arriving class i user is either admitted or rejected. Admitted users remain in the system for the duration of their service time. Let x(t) = [x1 (t), . . . , xK (t)]T ∈ X be the state of the network at time t where the state space X:={[x1 , . . . , xK ]T : xi ∈ Z+ , Let {tk }k≥0 be the instances when the process {x(t)}t∈

R+

P

i

xi bi ≤ M}.

changes state; by convention, set

t0 = 0. Let xk denote x(tk ). The decision epochs are the instances tk , k ≥ 0. At each tk , an admit or reject decision is made for each possible user type that may arrive in (tk , tk+1 ]. If action a = (a1 , . . . , aK ) ∈ {0, 1}K (=: A) is selected, then a class i user that arrives in the interval (tk , tk+1 ] is admitted if ai = 1. Let each θ ∈ RK define a randomised policy as follows. At 1 K i tk , let xk = [x1k , . . . , xK k ] ∈ X. The action ak = (a , . . . , a ) is selected by setting a = 1 with −1  T probability 1 + eb xk −θi .

The aim of CAC is to minimise a weighted sum of the blocking probability of the various

February 14, 2005

DRAFT

11

5

0.8 (ζ1,ζ2)=(0.6,0.7)

4

Cost J(θk)

0.7

ideal

θk

(1)

3

θk(1)

0.6 (0.5,0.6)

2

θk(2)

0.5 1

θideal(2)

0.4

k

0

(0.45,0.55) 0.3 0

5000

k

10000

(a) Convergence: cost vs step-size. Fig. 1.

15000

−1 0

5000

k

10000

15000

(b) OLSMDP vs. ideal OLSMDP

Numerical performance of OLSMDP.

user classes. Let ei ∈ RK be the unit vector with 1 in the i-th position. For i = 1, . . . , K, define  ci (x, a):= min 1 − ai + 1 − I X (x + ei ), 1 , Then, minimising (4) with c =

PK

i=1

x ∈ X, a ∈ A

νi ci where weights νi ∈ R+ indeed minimises

i blocking probability) over the policy space RK (see [10], [11] for details).

(28) PK

i=1

νi ×(class

We consider the scenario with 2 user classes (K = 2), λ1 = λ2 = 1, µ1 = 0.4 , µ2 = 1, b1 = b2 = 1, M = 4 and weights ν1 = 0.85, ν2 = 0.15. The optimal non-randomised policy, constructed using a LP [14], yields an average cost of 0.2393. The (near) optimal θ value is [15, 3]T with an average cost of 0.2549. Figure 1-(a) shows the convergence of the cost J(θk ) for {θk }k≥0 generated by OLSMDP for different step-sizes (27). In (26), we replaced the estimated gradient with the true gradient ητ (θk )∇ηc (θk ) − ηc (θk )∇ητ (θk ). This ideal scenario is shown in Figure 1-(b) as {θkideal }k≥0 and is contrasted with {θk }k≥0 generated by OLSMDP via (26). These plots correspond to (ζ1 , ζ2 ) = (0.45, 0.55) and β = 0.5. Note that the gradient estimator performs well as θk follows θkideal closely, even for β = 0.5.

February 14, 2005

DRAFT

12

VI. C ONCLUSION We have presented a policy gradient method for the online optimisation of a SMDP based on a single sample path. The method proposed requires no knowledge of the semi-Markov transition kernel. The gradient estimator is based on the discounted score method of [1] while the optimisation step is performed using two-time scale SA. The advantage of the policy gradient approach is that it extends easily to the case where the SMDP is also subject to average cost constraints as SA can also be used for constrained stochastic optimisation [15].

VII. A PPENDIX In this section, we prove Theorem 4. The convergence analysis will proceed by viewing the process {(xθk , τkθ , aθk )}k≥0 as a general state space Markov chain. (See [9] for a definition.) The Markov nature of {(xθk , τkθ , aθk )}k≥0 follows from (1), which asserts that the dependence of θ ) on {(xθi , τiθ , aθi ) : i = 1, . . . , k} reduces to (xθk , aθk ) only, and from the the fact that (xθk+1 , τk+1

aθk+1 is sampled from u˜(θ, xθk+1 , .). The analysis below is done for an arbitrary fixed θ ∈ RK . With some exceptions for clarity, we omit θ from the notation in general. Define the Markov chain {Zk }k≥0 with state space X R+ A (denoting Cartesian product of the 3 sets) and measurable sets B(X R+ A):=σ{ABC : A ∈ 2X , B ∈ B( R+ ), C ∈ 2A } where B( R+ ) denotes the Borel sets of R+ . For convenience, let

(i, [0, τ ], a) := {i} × [0, τ ] × {a},

(i, τ, a) ∈ X R+ A.

(29)

Let K denote the (θ dependent) transition kernel of {Zk }k≥0 that is defined as follows. For each (i, a) ∈ XA, let K(.|i, a) be the unique measure on X R+ A that satisfies K((j, [0, τ ], a′ )|i, a) = Qij (τ, a)˜ u(j, a′ ) February 14, 2005

(30) DRAFT

13

for sets (j, τ, a′ ) ∈ X R+ A. (K is dependent on θ through policy u˜ only.) Extend the definition of K to yield K as follows. For z ∈ {i} × R+ × {a},

K(B|z) = K(B|i, a),

B ∈ B(XA R+ ).

(31)

There exists a sample space (Ω, F ) and, for each z0 ∈ X R+ A, a probability Pz0 such that {Zk }k≥0 is a time-homogeneous Markov chain on (Ω, F , Pz0 ) with transition probability kernel (31) and initial state z0 [9]. Let xk , τk and ak be the projection of Zk to X, R+ and A respectively, i.e., xk = fx (Zk ) where fx (z) = i for all z ∈ {i} × R+ × A, i ∈ X. (Similarly for τk and ak .) For integrable f : X R+ A → R ( E{|f (Zk+1)|} < ∞), (31) implies

E {f (Zk+1 )|Zk } = E {f (Zk+1)|xk , ak } .

(32)

This is true because

E {f (Zk+1 )|Zk } =

X

i,a

Z



f (z)K(dz|i, a) I {i}×

R+ ×{a} (Zk )

=

Z

f (z)K(dz|xk , ak ).

Using (32), P ((xk+1 , ak+1 ) = (j, a′ ), τk+1 ≤ τ | (x0 , a0 ), τ0 , . . . , (xk , ak ), τk ) = P ((xk+1 , ak+1 ) = (j, a′ ), τk+1 ≤ τ | (xk , ak )), which is the desired semi-Markov property [6]. Let x′k , τk′ and a′k , k ≥ 0, be the realisation of states, sojourn times and actions generated as described in Section II. Then, (x′k , τk′ , a′k ), k ≥ 0, is a realisation of {Zk }k≥0 since K admits a decomposition through Q and u˜. Also, (32) implies that {(xk , ak )}k≥0 is a finite state Markov chain with transition probability Pi,j (a)˜ u(j, a′ ). Lemma 6: Assumption 1 implies the Markov chain {(xk , ak )}k≥0 with transition probability P( (xk+1 , ak+1) = (j, a′ )| (xk , ak ) = (i, a)) = Pij (a)˜ u(j, a′ ) is a unichain. Furthermore, let (i, a)

February 14, 2005

DRAFT

14

hit be any recurrent state of this chain and τ(i,a) := inf{k > 0 : (xk , ak ) = (i, a)}. Then,

 hit (x0 , a0 ) = (j, a′ ) < ∞, E τ(i,a)

∀(j, a′ ) ∈ XA.

(33)

Proof: Showing {(xk , ak )}k≥0 is a unichain is straightforward and is omitted. (33) then follows from [14, Theorem 2.2.7]. Define the probability measure ν on X R+ A to be

ν(.):=

X

K(.|i, a)π(i)˜ u(i, a)

(34)

i∈X,a∈A

where π is the unique stationary distribution of the X-valued Markov chain with transition P probability matrix [ a Pij (a)˜ u(i, a)]i,j , i.e., P (θ)ij (see. Assumption 1). Note that ν({j} × R+ × {a′ }) = lim ν((j, [0, τ ], a′ )) = τ →∞

X

Pij (a)˜ u(j, a′ )π(i)˜ u(i, a) = π(j)˜ u(j, a′ )

i∈X,a∈A

which is the unique stationary distribution of the process {(xk , ak )}k≥0. Also, it is immediately apparent from the definition of ν that ν is the unique invariant probability measure of K. Lemma 7: {Zk }k≥0 is ν-irreducible. Proof: We need to show that if ν(A) > 0, then for every z ∈ X R+ A there exists a k

k

k > 0 such that K (A|z) > 0 [9], where K is the k-step transition kernel defined recursively 1

as K = K, K

k+1

(A|z) =

K

k+1

R

k

K (dz ′ |z)K(A|z ′ ), k > 0. Iterating the kernel K gives

(A|z0 ) =

X

K(A|i, a)[P (a0 )P (θ)k−1]i0 ,i u˜(i, a),

(35)

i,a

where z0 = (i0 , t0 , a0 ) ∈ X R+ A, k > 0. Since ν(A) > 0, there exists i, a such that

February 14, 2005

DRAFT

15

K(A|i, a) > 0, π(i) > 0, u˜(i, a) > 0 (see (34)). Also, since π(i) > 0, there exists k > 0 such that [P (a0 )P (θ)k ]i0 ,i > 0. Thus K

k+1

(A|z0 ) > 0.

Lemma 8: {Zk }k≥0 is Harris recurrent. Proof: One needs to show any set that has positive measure w.r.t. ν is visited infinitely often with probability 1. A set C ∈ B(X R+ A) is ξ-petite if K(A|z) ≥ ξ(A) for all z ∈ C, A ∈ B(X R+ A) where ξ is a non-trivial measure on B(X R+ A) [9, p. 121]. Let (i, a) be a recurrent state of the chain {(xk , ak )}k≥0 . We let ξ(.) = K(.|i, a) and the petite set C = {i} × R+ × {a}. Use the hitting time criterion for Harris recurrence [9, Proposition 9.1.7 (ii)]: if there exists a petite set C in B(X R+ A) such that Pz0 (τChit < ∞) = 1 ∀z0 ∈ X R+ A, then {Zk }k≥0 is Harris hit recurrent. With set C given above, Ez0 {τChit } = Ez0 {τ(i,a) } < ∞ for any z0 by Lemma 6.

Since {Zk }k≥0 has an invariant probability measure ν, is ν-irreducible and is Harris recurrent, {Zk }k≥0 is positive Harris [9]. We will need the following final lemma. Lemma 9: Let {Xk }k≥0 be a Rd -valued (or a closed subset of Rd ) homogeneous Markov chain, defined on the probability space (Ω, F , P), with a unique invariant probability measure ν and transition probability K(·|x), x ∈ Rd . (i) If {Xk }k≥0 is positive Harris and Borel-measurable f : Rd(j+1) → R (j ≥ 0) satisfies R

ν(dx0 )

R

R K(dx1 |x0 ) . . . K(dxj |xj−1 ) |f (x0 , x1 , . . . , xj )| < ∞, then T −1

1X lim f (Xk , . . . , Xk+j ) T →∞ T k=0 Z Z Z = ν(dx0 ) K(dx1 |x0 ) . . . K(dxj |xj−1 )f (x0 , x1 , . . . , xj ) a.s. ′

(ii) Let Borel-measurable φ : Rd → Rd and g : Rd × Rd → R satisfy

February 14, 2005

R

(36)

ν(dx0 ) kφ(x0 )k2 < ∞,

DRAFT

16

R

ν(dx0 )

R

K(dx1 |x0 )g(x0 , x1 )2 < ∞. If {Xk }k≥0 is positive Harris, then

T −1 k X 1X lim g(Xk+1, Xk+2) β k−j φ(Xj ) T →∞ T j=0 k=0 Z Z Z ∞ X k ν(dx0 ) K(dx1 |x0 ) . . . K(dxk+2 |xk+1 )φ(x0 )g(xk+1 , xk+2 ) a.s. = β

(37)

k=0

Proof: Part (i) is Lemma 9 of [13] while (ii) is a special case of Lemma 2 of [13]. Proof: [Theorem 4]Fix θ and consider the Markov chain {Zk }k≥0 with transition kernel (31) and stationary distribution (34). Let Eν denote the expectation when the initial distribution of the chain is ν, i.e., for an integrable Borel-measurable f : (XA R+ )(j+1) → R,

Eν {f (Z0 , . . . , Zj )} = (T )

Note that ηc

= T −1

Z

PT −1 k=0

ν(dz0 )

(k)

K(dz1 |z0 ) . . .

Z

K(zj |zj−1 )f (z0 , z1 , . . . , zj ).

c(xk , ak )τk+1 . From (2) and the fact that {Zk }k≥0 is positive (T )

Harris, Lemma 9-(i) implies limT →∞ ηc ητ

Z

= Eν {c(x0 , a0 )τ1 } (a.s.) = ηc . (The result follows for

by setting c = 1.) Similarly, we now prove convergence of △ck to ∇β ηc and convergence of

△τk to ∇β ητ follows by setting c = 1. Expanding △cT (20) one gets T −1 ∇˜ u(xk+1 , ak+1) 1X τk+2 c(xk+1 , ak+1 ) T u˜(xk+1 , ak+1 )

(38)

k=0

k T −1 X 1X ∇˜ u(xj , aj ) + c(xk+1 , ak+1 )τk+2 β k−j . T k=0 u˜(xj , aj ) j=0

February 14, 2005

(39)

DRAFT

17

For (38), by (2) and (24), we may apply Lemma 9-(i) to yield T −1 1X ∇˜ u(xk+1 , ak+1 ) lim c(xk+1 , ak+1 ) τk+2 T →∞ T u˜(xk+1 , ak+1 ) k=0

= Eν {c(x1 , a1 )

∇˜ u(x1 , a1 ) τ2 } u˜(x1 , a1 )

(a.s.) =

X

π(i)˜ u(i, a)c(i, a)

i∈X,a∈A

∇˜ u(i, a) τ (i, a) = (∇¯ c)T π. u˜(i, a)

The second equality follows by conditioning on Z1 and using (3), and then (34). For (39)’s limit, apply Lemma 9-(ii); (24) and (25) imply the squared-integrability conditions are satisfied. Thus T −1 k X ∇˜ u(xj , aj ) 1X c(xk+1 , ak+1 )τk+2 β k−j lim T →∞ T u˜(xj , aj ) j=0 k=0   ∞ X ∇˜ u(x0 , a0 ) k = β Eν c(xk+1 , ak+1 )τk+2 a.s. u˜(x0 , a0 ) k=0

(40)

(θ) We now show that the right-hand-side of (40) is equal to π T (∇P (θ)) ηc¯,β . By (11), π T ∂P η ∂θm c¯,β

=

P∞

k=0 β

k T ∂P (θ) π ∂θm P (θ)k c¯.

For any k ≥ 0,

∂u ˜(i,a) X   ∂P (θ) ∂θm k P (θ) c¯ = π(i)˜ u(i, a) Pij (a) P (θ)k c¯ (j) = Eν π ∂θm u˜(i, a) i,j,a T

(

∂u ˜(x0 ,a0 ) ∂θm

u˜(x0 , a0 )

c(xk+1 , ak+1 )τk+2

)

(41)

To obtain the last line, for any f : XA → R, use (35) to evaluate Eν {f (xk+1 , ak+1)|Z0 }. R EFERENCES [1] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,” Journal of Artificial Intelligence Research , vol. 15, pp. 319-350, 2001. [2] D. P. Bertsekas, Dynamic programming and optimal control - Vol. 2 . Athena Scientific, Belmont, 1995. [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming . Athena Scientific, Belmont, 1996. [4] S. J. Bradtke and M. O. Duff, “Reinforcement learning methods for continuous-time Markov decision problems,” Advances in Neural Information Processing Systems 7 , 1995. [5] X.-R. Cao, “Semi-Markov decision problems and performance sensitivity analysis,” IEEE Trans. Automat. Contr., vol. 48, no. 5, pp. 758–769, May 2003. February 14, 2005

DRAFT

.

18

[6] E. Cinlar, Introduction to Stochastic Processes . Prentice-Hall, Englewood Cliffs, NJ, 1974. [7] A. Gosavi, “Reinforcement learning for long-run average cost,” European Journal of Operational Research , pp. 654–674, 2004. [8] P. Marbach, O. Mihatsch and J. N. Tsitsiklis, “Call admission control and routing in integrated services networks using neuro-dynamic programming,” IEEE JSAC, vol. 18, no. 2, pp. 197–208, Feb. 2000. [9] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability. Springer-Verlag, London, 1993. [10] K. W. Ross, Multiservice loss models for broadband telecommunication networks . Springer-Verlag, London, 1995. [11] S. Singh, V. Krishnamurthy and H. V. Poor, “Integrated voice/data call admission control for wireless DS-CDMA systems with fading,” IEEE Trans. Signal Proc. , vol. 50, no. 6, 1483–1495, June 2002. [12] R. S. Suton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems 12 , pp. 1057-1063, 2000. [13] V. Tadi´c, “On the convergence of temporal-difference learning with linear function approximation,” Machine Learning, vol. 42, no. 3, pp. 241-267, 2001. [14] H. C. Tijms, Stochastic models: an algorithmic approach . John Wiley & Sons, Chichester, 1994. [15] V. Tadi´c and A. Doucet, “Two time-scale stochastic approximation for constrained stochastic optimization and constrained Markov decision problems,” in Proc. ACC, 2003. [16] V. Tadi´c and S. P. Meyn, “Asymptotic properties of two time-scale stochastic approximation algorithms with constant step sizes,” in Proc. ACC, 2003.

February 14, 2005

DRAFT