Simulation-based optimization of Markov decision processes: An ...

Comment

Report 2 Downloads 34 Views

Automatica 46 (2010) 1297–1304

Contents lists available at ScienceDirect

Automatica journal homepage: www.elsevier.com/locate/automatica

Simulation-based optimization of Markov decision processes: An empirical process theory approachI Rahul Jain a,b,∗ , Pravin Varaiya c a

EE Department, University of Southern California, Los Angeles, CA 90089, USA

b

ISE Department, University of Southern California, Los Angeles, CA 90089, USA

c

EECS Department, University of California, Berkeley, CA 94720, USA

article

info

Article history: Received 1 December 2008 Received in revised form 10 September 2009 Accepted 21 April 2010 Available online 18 June 2010 Keywords: Markov decision processes Learning algorithms Monte Carlo simulation Stochastic Control Optimization

abstract We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an -optimal policy from simulation. We provide sample complexity of such an approach. © 2010 Elsevier Ltd. All rights reserved.

1. Introduction It is well-known that solving Markov decision processes using dynamic programming is computationally intractable. Thus various approximate dynamic programming techniques as well as simulation-based techniques have been developed. We propose an empirical process theory approach to simulation-based optimization of Markov decision processes. The empirical process theory (EPT) (Pollard, 1990) studies the uniform behavior of a class G of measurable functions in the law of large numbers (as well as the central limit theorem (Dudley, 1999)) regime. In particular, EPT studies the conditions for which

) n 1 X Pr sup g (Xi ) − EP [g (X )] > → 0 g ∈G n i=1 (

(1)

where EP is the expectation with respect to probability measure P, and the rate of convergence. Convergence results in EPT typically use concentration of measure inequalities such as those of Chernoff and Hoeffding. The rate of convergence of an empirical average to the expected value depends on the exponent in the upper bound of such inequalities. Led by Talagrand, there has been a lot of effort towards improving this exponent (Ledoux, 2001; Talagrand, 1995). The goal of this paper is to extend the reach of this rich and rapidly developing theory to Markov decision processes and Multiarmed bandits problems, and use this framework to solve the optimal policy search problem. Consider an MDP with a set of policies Π . The value function assigns to each π ∈ Π its expected discounted reward V (π ). We estimate V from independent samples of the discounted reward by the empirical mean, Vˆ (π ). We obtain the number of samples n(, δ) (or sample complexity) needed so that the probability

I The material in this paper was partially presented at 46th IEEE Conference

on Decision and Control, December 12–14, 2007, New Orleans, Louisiana USA and at the Stochastic Processes and Applications (SPA), August 2007. This paper was recommended for publication in revised form by Associate Editor George Yin under the direction of Editor Ian R. Petersen. ∗ Corresponding author at: Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA. Tel.: +1 213 740 2246; fax: +1 213 740 4447. E-mail addresses: [email protected] (R. Jain), [email protected] (P. Varaiya). 0005-1098/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.automatica.2010.05.021

Pr sup |Vˆ (π ) − V (π )| > π∈Π

< δ.

(2)

The approach we present is broadly inspired by Haussler (1992), Vapnik and Chervonenkis (1981), Vidyasagar (2003), and first initiated in Kolmogorov and Tihomirov (1961). Thus, we would like to reduce the problem in Eq. (2) to understanding the geometry of Π in terms of its covering number. (Covering number is the minimal number of elements of a set needed to approximate any element in the set Π with a given accuracy.) We first relate the

1298

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

covering numbers of the space of stationary stochastic policies and the space of Markov chains that they induce. We relate these to the space of simulation functions that simulate the policies when the policy space is convex. These results together yield the rate of convergence of the empirical estimate to the expected value for the discounted-reward MDPs. The problem is non-trivial because obtaining empirical discounted reward from simulation involves an iteration of simulation functions. The geometry of the space of iterated simulation functions is much more complex than that of the original space. The problem of uniform convergence of the empirical average to the value function for discounted MDPs was studied in Kearns, Mansour, and Ng (1999) and Ng and Jordan (2000) in a machine learning context. While Kearns et al. (1999) only considered finite state and action spaces, Ng and Jordan (2000) obtains the conditions for uniform convergence in terms of the simulation model rather than the geometric characteristics (e.g., covering numbers or the P-dimension1 ) of the simulation function space as opposed to that of the more natural policy space. Large deviations results for finite state and action spaces for the empirical stateaction frequencies and general reward functions were obtained in Altman and Zeitouni (1994) and Mannor and Tsitsiklis (2005). A different approach more akin to importance sampling is explored in Peshkin and Mukherjee (2001). While the problem of uniform estimation of the value function for discounted and average-reward partially observed MDPs is of interest in itself, we also present a framework for simulationbased optimal policy search. The simulation-based estimates such as those proposed in this paper, have been used in a gradientbased method for finding Nash equilibrium policies in a pursuitevasion game problem (Shim, Kim, & Sastry, 2003) though the theoretical understanding is far from complete. Other simulationbased methods to find approximations to the optimal policy include Bartlett and Tewari (2007), Baxtar and Bartlett (2001) and Marbach and Tsitsiklis (2003). This paper makes two contributions: (i) It extends results in Jain and Varaiya (2006) from the case where the reward function depends only on the state to the case where it depends on both the state and action. (ii) It then presents how this framework can be used to find approximately optimal policies. We finally, conclude with a simulation-based optimization framework for Multi-armed bandits. 2. Preliminaries Consider an MDP M, with state space X and action space A, ¯, transition probability function Pa (x, x0 ), initial state distribution λ reward function r (x, a) (which depends on both state and action) with values in [0, R], and discount factor 0 < γ < 1. The value function for a policy π is the expected discounted reward

" V (π) = E

∞ X

# γ r ( x t , at ) , t

t =1

in which (xt , at ) is the state-action pair in the tth step under policy π . Let Π0 denote the space P of all stationary stochastic policies {π(x, a) : a ∈ A, x ∈ X, a π(x, a) = 1} and let Π ⊆ Π0 be the subset of policies of interest. The MDP M under a fixed stationary policy π induces a Markov chain with transition P probability function Pπ (x, x0 ) = P ( x , x0 )π(x, a). The initial a a distribution on the Markov chains is λ, and we identify Pπ with the Markov chain. Denote P := {Pπ : π ∈ Π }.

Let X be an arbitrary set and λ be a probability measure on X. Given a set F of real-valued functions on X, ρ a metric on R, let dρ(λ) be the pseudo-metric on F with respect to measure λ, dρ(λ) (f , g ) =

Z

ρ(f (x), g (x))λ(dx).

A subset G ⊆ F is an -net for F if ∀f ∈ F , ∃g ∈ G with dρ(λ) (f , g ) < . The size of the minimal -net is the -covering number, denoted N (, F , dρ(λ) ). The -capacity of F under the ρ metric is C (, F , ρ) = supλ N (, F , dρ(λ) ) where λ is any probability measure on X. Essentially, the -net can be seen as a subset of functions that can -approximate any function in F . The covering number is a measure of the richness of the function class. The richer it is, the more approximating functions we will need for a given measure of approximation . The capacity makes it independent of the underlying measure λ on X. (See Kolmogorov and Tihomirov (1961) for an elegant treatment of covering numbers.) Let F be a set of real-valued functions from X to [0, 1]. Say that F P-shatters {x1 , . . . , xn } if there exists a witness vector c = (c1 , . . . , cn ) such that the set {(η(f (x1 ) − c1 ), . . . , η(f (xn ) − cn )), f ∈ F } has cardinality 2n ; η(·) is the sign function. The largest such n is the P-dim(F ). This is a generalization of VC-dim and for {0, 1}-valued functions, the two definitions are equivalent. Other combinatorial dimensions such as the fat-shattering dimension introduced in Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) yield both an upper and lower bound on the covering numbers but in this paper we will use the P-dim. Results using fat-shattering dimensions can be established similarly. 3. The simulation model We estimate the value V (π ) of policy π ∈ Π from independent samples of the discounted rewards. The samples are generated by a simulation ‘engine’ (qπ , h¯ ). This is a deterministic function to which we feed two noise sequences ν = (ν1 , ν2 , . . .) and ω = (ω1 , ω2 , . . .) (where νi and ωi are i.i.d. random noises uniformly distributed on Ω = [0, 1]) and different initial states and actions (x10 , a10 ), . . . , (xn0 , an0 ) (with (xi0 , ai0 ) i.i.d. with distribution λ). (Note that we have put an initial distribution on initial actions but these play no role. The actions at t = 1 are determined by the policies which are functions of x0 and action at t = 1, a1 ). The engine then generates a state and action sequence, with the state sequence the P same as the Markov chain corresponding to π , Pπ (x, y) = a Pa (x, y)π (x, a). The estimate of V (π ) is the average of the total discounted reward starting with different initial states. Because simulation cannot be performed indefinitely, we truncate the simulation at some time T , after which the contribution to the total discounted reward falls below /2 for required estimation error bound . T is the /2-horizon time which depends on the discount factor γ , the bound on reward function r, and is independent of policies π ∈ Π . The function h¯ : X × A × Ω → X gives the next state x0 given the current state is x, action taken a, and noise ωi . Many simulation functions are possible. We will work with the following simple simulation model, for X = N and A = [1 : NA ], for NA finite. Definition 1 (Simple Simulation Model). The simple simulation model (qπ , h¯ ) for a given MDP with policy π ∈ Π is given by qπ (x, ν) = inf{b ∈ A : ν ∈ [Q π ,x (b − 1), Q π ,x (b))}, where Q π ,x (b) :=

P

b0 ≤b

π (x, b0 ) and

h¯ (x, a, ω) = inf{y ∈ X : ω ∈ [Fa,x (y − 1), Fa,x (y))}, 1 Please see Section 2 for definitions and Anthony and Bartlett (1999) and Vidyasagar (2003) for more details.

0 in which Fa,x (y) := y0 ≤y Pa (x, y ) is the c.d.f. corresponding to the transition probability function Pa (x, y).

P

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

This is the simplest method of simulation: For example, to simulate a probability distribution on a discrete state space, we partition the unit interval such that the first subinterval has length equal to the mass on the first state, the second subinterval has length equal to the mass on the second state, and so on. It is a surprising fact that there are other simulation functions h0 that generate the same distribution, but which have a much larger complexity than h. The state and action sequence {(xt , at )}∞ t =0 for policy π is obtained by at +1 = qπ (xt , νt +1 ), xt +1 = fπ (xt , νt +1 , ωt +1 ) = h¯ (xt , qπ (xt , νt +1 ), ωt +1 ), where νt and ωt ∈ Ω are noises. The initial state-action pair (x0 , a0 ) is drawn according to the given initial state-action ¯ over X). distribution λ (with marginal distribution λ Denote zt = (xt , at ) ∈ Z := X × A and ξt = (νt , ωt ) ∈ Ω2 := Ω × Ω , then sπ (zt , ξt ) = zt +1 = (qπ (xt , νt +1 ), fπ (xt , νt +1 , ωt +1 )). The function sπ : Z × Ω2 → Z is called the simulation system for the policy π . We denote Q = {qπ : π ∈ Π }, F = {fπ : π ∈ Π } and the set of all simulation systems induced by Π by S = {sπ : π ∈ Π} = Q × F . Let µ denote the Lebesgue measure on Ω2 . Then, Pr {zt +1 = z 0 |zt = z , π} = µ{ξ : sπ (z , ξ ) = z 0 }. Unless specified otherwise, S will denote the set of simulation functions for the policy space Π under the simple simulation model. The question now is how does the complexity of the space Q compare with that of the policy space Π . We connect the two when Π is convex. Lemma 1. Suppose Π is convex with P-dimension2 d. Let Q be the corresponding space of simple simulation functions that simulate Π . Then, P-dim (Q) = d. The proof essentially follows the argument in the proof of Lemma 4.2 in Jain and Varaiya (2006) and will not be repeated here. 4. Sample complexity for discounted-reward MDPs Consider an MDP M with countably infinite state space X = N, and finite action space A = [1 : NA ] (for some NA ∈ N), transition ¯ , reward probability function Pa (x, x0 ), initial state distribution λ function r (x, a), and discount factor γ < 1. The value function is the total discounted reward for a policy π in some set of stationary policies Π . Let θ be the left-shift operator on Ω ∞ , θ (ω1 , ω2 , . . .) = (ω2 , ω3 , . . .). We redefine Q to be the set of measurable functions from W := X × A × Ω ∞ × Ω ∞ onto itself which simulates Π , the set of given stationary policies under the simple simulation model to generate the action sequence. The action-simulation function qπ (x, a, ν, ω) may only depend on the current state x (though we define a more general function for notational simplicity), and ν1 , the first component of the sequence ν = (ν1 , ν2 , . . .) (with νi i.i.d. and uniform [0, 1]). Similarly, we redefine F to be the set of measurable functions from W := X × A × Ω ∞ × Ω ∞ onto itself which simulates Π , the set of given stationary policies under the simple simulation model

2 We refer the reader to Jain and Varaiya (2006) for a preliminary discussion about P-dimension, -covering numbers and -capacity.

1299

to generate the state sequence. Thus, state-simulation functions fπ (x, a, ν, ω) may depend only on the current state x (but again we define a more general function), and ξ1 , the first component of the sequence ξ = (ξ1 , ξ2 , . . .) (with ξi = (νi , ωi ) each i.i.d. and uniform [0, 1]). As before, S = Q × F , the set of simulation systems that simulate the policies Π , and generate the state-action sequence. Thus, the results and discussion of the previous section hold. For a policy π , our simulation system is

(zt +1 , θ ξ ) = sπ (zt , ξ ), in which zt +1 is the next state-action pair starting from the current state-action pair zt and the simulator also outputs the shifted noise sequences θ ξ . This definition of the simulation function is introduced to facilitate the iteration of simulation functions. Let S 2 := {s ◦ s : W × W → W × W , s ∈ S } and S t its generalization to t iterations. Let µ be a probability measure on Ω2∞ and λ the initial distribution on Z. Denote the product measure on W by P = λ×µ, and on W n by Pn . Define the two pseudo-metrics on S :

ρP (s1 , s2 ) =

X

λ(z )µ{ξ : s1 (z , ξ ) 6= s2 (z , ξ )},

z

and dL1 (P) (s1 , s2 ) :=

X

λ(z )

Z

|s1 (z , ξ ) − s2 (z , ξ )|dµ(ξ ).

z

The ρP and dL1 (P) pseudo-metrics for the functions in Q and F are defined similarly. We present a key technical result which relates the covering number of the iterated functions S t under the ρ pseudo-metric with the covering number for Q under the L1 pseudo-metric. Lemma 2. Let λ be the initial distribution on Z and let λs the (oneP step) distribution given by λs (y) = z λ(z )µ{ξ : s(z , ξ ) = (y, θ ξ )} for s ∈ S . Suppose that

K := max

λs (y) , 1 < ∞. s∈S ,y∈Z λ(y) sup

(3)

Then,

N (, S t , ρP ) ≤ N (/K t , Q, dL1 (P) ). The proof of Lemma 2 is in the Appendix. The condition of the lemma essentially means that under distribution λ the change in the probability mass on any state-action pair under any policy after one transition is uniformly bounded. The estimation procedure is this. Obtain n initial state-action (1) (n) pairs z0 , . . . , z0 drawn i.i.d. according to λ, and n trajectories (1) (n) ξ ,...,ξ ∈ Ω2∞ (Ω = [0, 1]) drawn according to µ, the product measure on Ω2∞ of uniform probability measures on Ω2 . Denote the samples by Ξ n = {(z01 , ξ 1 ), . . . , (z0n , ξ n )} drawn with measure Pn . This is our first main result. Theorem 1. Let (Z, Γ , λ) be the measurable state-action space and r (x, a) the real-valued reward function, with values in [0, R]. Let Π ⊆ Π0 , the space of stationary stochastic policies, and S the space of simple simulation systems of Π . Suppose that P-dim(Q) ≤ d and the initial state-action distribution λ is such that K := λ (z ) max{sups∈S ,z ∈Z λ(s z ) , 1} < ∞. Let Vˆ nT (π ) be the estimate of V (π ) obtained by averaging the reward from n samples, each T steps long. Then, for , δ > 0, n

P

T ˆ sup |Vn (π ) − V (π )| > < δ

π∈Π

1300

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

5. Partially observable MDPs with general policies

if n ≥ n0 (, δ) := Here T := log (T + 1).

32R

α

2

2

(1−γ ) 2R

log

4

δ

+ 2d log

32eR

α

+ T log K

.

(4)

/ log γ is the /2-horizon time and α = /2

Proof. Fix a policy π . Let sπ be the corresponding simple simulation system that yields the state-action sequence given a noise sequence ξ . Define the function Rπt (z0 , ξ ) := r ◦ sπ ◦ · · · ◦ sπ (z0 , ξ ), with sπ composed t times. Let Rt := {Rπt : Z × Ω2∞ → [0, R], π ∈ Π }. Let V (π ) be the expected discounted reward, and V T (π) the expected discounted reward truncated up to T steps. Let Pn PT i t π i Vˆ nT (π) = 1n i =1 [ t =0 γ Rt (x0 , ω )]. Then,

|V (π) − Vˆ nT (π )| ≤ |V (π ) − V T (π )| + |V T (π ) − Vˆ nT (π )|, ≤ |V T (π ) − Vˆ nT (π )| + 2 T n X 1 X π i i π ≤ [Rt (z0 , ξ ) − E(Rt )] + . n i=1 2 t =0 The expectation is with respect to the product measure P = λ × µ. We show that with high probability, each term in the sum over t is bounded by α = /2(T + 1). Note that (after abusing the definition of the reward function a bit) we have

Z

|r (st1 (z , ξ )) − r (st2 (z , ξ ))|dµ(ξ )dλ(z ) X ≤R· λ(z )µ{ξ : st1 (z , ξ ) 6= st2 (z , ξ )}, z

dL1 (P) (r ◦ st1 , r ◦ st2 ) ≤ R · K T dL1 (P) (q1 , q2 ).

# n 1 X i i sup Rt (z0 , ξ ) − E(Rt ) > α Rt ∈Rt n i=1

"

≤ 4(α/16, Rt , dL1 ) exp(−nα 2 /32R2 ) d 32eRK T 32eRK T ≤4 log exp(−nα 2 /32R2 ). α α This implies that the estimation error is bounded by α with probability at least δ , if the number of samples is n≥

32R2

α2

log

4

δ

+ 2d log

32eR

α

+ T log K

xt +1 = fπ ,t +1 (xt , ht , νt +1 , ωt +1 ) := h¯ (xt , at +1 , ωt +1 ) yt +1 = gπ ,t +1 (xt , ht , νt +1 , ωt +1 , ζt +1 ) := g (xt +1 , ζt +1 ) where νt , ωt , ζt are i.i.d. uniform [0, 1], and qπt , h and g are simulation functions under the simple simulation model (which simulate πt +1 , Pa and τ respectively). The initial state-action pair (x0 , a0 ) is drawn according to the given initial state-action distribution λ. Denote zt = (xt , ht ) ∈ Zt := X × Ht and ξt = (νt , ωt , ζt ) ∈ Ω3 := Ω × Ω × Ω , and ξ = (ξ1 , ξ2 , . . .) then

gπ ,t +1 (xt , ht , νt +1 , ωt +1 , ζt +1 ), ht , θ ξ )

From Theorem 5.7 in Vidyasagar (2003) (also Theorem 3 in Haussler (1992)), Lemma 2, and the inequality above, we get

P

at +1 = qπ ,t +1 (ht , νt +1 )

sπ ,t +1 (zt , ξt ) = (zt +1 , θ ξ ) = (qπ ,t +1 (ht , νt +1 ), fπ ,t +1 (xt , ht , νt +1 , ωt +1 ),

which as in Lemma 2 implies that

n

We now consider partially observed discounted-reward MDPs with general policies (non-stationary with memory). The setup is as before, except that the policy depends on observations y ∈ Y, governed by the (conditional) probability τ (y|x) of observing y ∈ Y when the state is x ∈ X. Let ht denote the history (a0 , y0 , a1 , y1 , . . . , at , yt ) of observations and actions before time t. The results of Section 4 extend when the policies are nonstationary, however there are many subtleties regarding the domain and range of simulation functions, and measures, and some details are different. Let Ht = {ht = (a0 , y0 , a1 , y1 , . . . , at , yt ) : as ∈ A, ys ∈ Y, 0 ≤ s ≤ t } (we introduce a0 for notational convenience but the next action and state might only depend on y0 ). Let Π be the set of policies π = (π1 , π2 , . . .), with πt +1 : Ht × A → [0, 1] a probability measure on A conditioned on ht ∈ Ht . Let Πt denote the set of all policies πt at time t with π ∈ Π . We can simulate a policy π in the following manner:

.

This Theorem ensures that if we perform n ≥ n0 (, δ) simulation runs and average over them to obtain Vˆ nT (π ), then this estimate is within of the actual value V (π ) for a policy π with probability greater than 1 − δ , and moreover this number n can be used for any policy π ∈ Π , where the policy space Π satisfies the hypothesis of the theorem. The argument of the proof is essentially function approximation. What makes it complicated is that we are dealing with iterated function spaces. We now give a simple example to illustrate the magnitude of the sample complexity. Example 1. Consider an MDP with four actions NA = 4, reward function upper bound R = 1, = δ = 0.1, γ = 0.7, and policy space P-dimension d = 2 and K = 1.01. Then, using Theorem 1, we get that the time horizon is T = 12, the sample complexity is n0 = 849, 260 and the number of random numbers needed to be generated is n0 T , i.e., approximately 9 million. This can be run on an Intel Core 2 duo system in around a second.

where θ is a left-shift operator. The function sequence sπ = (sπ ,1 , sπ ,2 , . . .) will be called the simulation system for a general policy π . The function sequences qπ = (qπ ,1 , qπ ,2 , . . .), fπ = (fπ ,1 , fπ ,2 , . . .) and gπ = (gπ ,1 , gπ ,2 , . . .) will be called action, state and observation simulation functions corresponding to policy π . We denote Q = {qπ : π ∈ Π }, F = {fπ : π ∈ Π }, G = {gπ : π ∈ Π } and the set of all simulation functions induced by Π by S = {sπ : π ∈ Π }. Note that we can define sπ ,t ◦ · · · ◦ sπ ,1 . We shall denote it by stπ . We first connect the P-dimensions of Πt and Qt := {qπ ,t : π ∈ Π }. (St and Ft will be defined similarly). Lemma 3. Suppose Πt is convex and P-dim (Πt ) = d. Then, P-dim (Qt ) = d. The proof follows that of Lemma 1 and is omitted. Let µ be a probability measure on Ω3∞ and λt a measure on Zt −1 . Denote the product measure on Wt −1 = Ω3∞ ×Zt −1 by Pt = λt ×µ, and on Wnt−1 by Pnt . Define the two pseudo-metrics on St :

X

ρt (s1t , s2t ) =

λt (z )µ{ξ : s1t (z , ξ ) 6= s2t (z , ξ )},

z ∈Zt −1

and dL1 (Pt ) (s1t , s2t ) :=

X

λt (z )

Z

|s1t (z , ξ ) − s2t (z , ξ )|dµ(ξ ).

z ∈Zt −1

The ρt and dL1 (Pt ) pseudo-metrics for the function spaces Qt and Ft are defined similarly. Define for s ∈ S and z ∈ Zt ,

λst (z ) :=

X

λ(z 0 )µ{ξ : st (z 0 , ξ ) = (z , θ t ξ )}

(5)

z 0 ∈Z

0

which is a probability measure on Zt . We now state the extension of the technical lemma needed for the main theorem of this section.

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

Lemma 4. Let λ be a probability measure on Z0 and λst be the probability measure on Zt as defined above. Suppose that P-dim (Qt ) ≤ d, ∀t ≥ 1, and there exists probability measures λt on Zt such that λ t (z )

K := max{supt supst ∈S t , z ∈Zt λ s (z ) , 1} < ∞. Then, for 1 ≤ t ≤ T , t +1

, Qt , ρt · · · N , Q1 , ρ1 Kt Kt dt 2eKt 2eKt ≤ . log

N (, S t , ρ1 ) ≤ N

The proof can be found in the Appendix. We now obtain our sample complexity result. Theorem 2. Let (Z, Γ , λ) be the measurable state-action space, Y the observation space, Pa (x, x0 ) the state transition function and τ (y|x) the conditional probability measure that determines the observations. Let r (x, a) the real-valued reward function bounded in [0, R]. Let Π be the set of stochastic policies (non-stationary and with memory in general) and S be the set of simple simulation systems that simulate π ∈ Π . Suppose that P-dim (Qt ) ≤ d ∀t ≥ 1, and let the probability measures λ, µ (on Z0 and Ω3∞ respectively) and λt +1 a probability measure on Zt (with λ1 = λ) be such that K := λ t (z ) max{supt supst ∈S t ,z ∈Zt λ s (z ) , 1} t +1

1301

We also define

π˜ ∈ arg sup Vˆ nT (π ). π∈Π

Define the regret of a policy as

%(π ) := V (π ∗ ) − V (π ). Now, from results of Section 4, we know that for a given , δ > 0, and with n ≥ n0 (/3, δ), with probability at least 1 − δ V (π ) ≤ Vˆ nT (π ) + /3,

and in particular for π = π . Also, Vˆ nT (π ) ≤ V (π ) + /3,

∀π ∈ Π ,

sup |Vˆ nT (π ) − V (π )| < 32R2

and in particular for π = πˆ . Consider any π1 , π2 ∈ Π with simulation functions q1 , q2 ∈ Q and corresponding s1 , s2 ∈ S . Then, from the proof of Theorem 1, we know that dL1 (Vˆ nT (π1 ), Vˆ nT (π2 )) ≤ R ·

32eR

T X

≤ R·

T X

6. Simulation-based optimization We propose a simulation-based optimization framework based on the empirical process theory for Markov decision processes developed above and in Jain and Varaiya (2006). Given an MDP M with a convex and compact policy space Π . Let Q be the set of ˆε simple action-simulation functions that simulate policies Π . Let Q be an ε -net for Q under the dL1 metric. Denote the set of policies in

γ t K t dL1 (q1 , q2 ).

t =0

Then, if dL1 (q1 , q2 ) ≤ ε , then 1 − (γ K )T +1 1 − γK

The result in this more general setting is similar in structure to Theorem 1. Making sure that the hypothesis is satisfied by computing K is more complicated. Further, we also use additional randomness in simulation to account for partial observability. We note that while dynamic programming formulations exist to solve partially observed MDPs but they reduce to doing dynamic programming on continuum state spaces (Davis & Varaiya, 1973) through introduction of belief states and hence are not computationally feasible. Thus, simulation-based methods as developed here are very attractive for such problems.

γ t dL1 (st1 , st2 )

t =0

dL1 (Vˆ nT (π1 ), Vˆ nT (π2 )) ≤ Rε

for n ≥ α 2 log δ + 2dT (log α + log KT ) where T is the /2 horizon time and α = /2(T + 1). 4

(7)

∗

VnT

π∈Π

(6)

∗

is finite, where λst is as defined

above. Let ˆ (π ), the estimate of V (π ) obtained from n samples with T time steps. Then, given any , δ > 0, with probability at least 1 − δ ,

∀π ∈ Π ,

for γ K < 1 and ε ≤

≤ /3

(1−γ K ) . 3R(1−(γ K )T +1 )

Now, note that if we take π1 = π ∗ , then by definition of an ˆ ε such that dL1 (q1 , q2 ) ≤ ε . Thus, ε -net, there exists a π2 = πˆ ∈ Π Vˆ nT (π ∗ ) ≤ Vˆ nT (πˆ ) + /3 ≤ Vˆ nT (πˆ ∗ ) + /3. From Eqs. (6)–(8), we get that V (π ∗ ) ≤ V (πˆ ∗ ) + , i.e., πˆ ∗ is an -optimal policy. Formally, we have shown that Theorem 3. Given an MDP with countably infinite state space and a finite action space (of cardinality NA ), with a convex, compact policy space Π with P-dim(Π ) = d, an , δ > 0, and γ < 1/K , if we estimate the value function for each policy in a given ε -net for Q (with 1−γ K ) ε < 3R(1(−(γ ) by doing n0 (/3, δ) simulation runs, each for T K )T +1 ) time steps, then the obtained policy πˆ ∗ is -optimal, in the sense that

Pn {%(π ) > } < δ.

ˆ ε . We know from Lemma 1 that Π corresponding to its ε -net by Π if the P-dimension of Π is some finite d, the P-dimension of Q is ˆ ε is finite, and its cardinality also d, which implies that the ε -net Π is known to be bounded by n1 (ε, NA ) = 2(x log x)d (see Anthony

Moreover, the sample complexity is given by

and Bartlett (1999), Haussler (1992) and Vidyasagar (2003)), where 2eN x = ε A , where NA is the cardinality of the action set A.

polynomial in 1/ .

ˆ ε and simulate n sample-paths for T times We pick each π ∈ Π steps, which is the /2-horizon time for a given > 0. Thus, we obtain estimates Vˆ nT (π). We pick πˆ ∗ ∈ arg sup Vˆ nT (π ). ˆε π ∈Π

The optimal policy is, of course,

π ∗ ∈ arg sup V (π ). π ∈Π

(8)

n0 (/3, δ) · n1 (ε, NA ) ∼ O

1

3

log2

1

, log

1

δ

,

Example 2. We now consider the same example as in Example 1. In this case, n1 ∼ 586 million. Using the brute-force method, this on a 10,000 Intel Core 2 duo processor machine will take approximately 16.28 h. However, the performance can be significantly improved by using intelligent heuristics, as well as intelligent (gradient-based) search algorithms. The results in this paper provide the framework for such methods to be developed in the future.

1302

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

7. Multi-armed bandits We now consider a specific model of stochastic systems, namely the Multi-armed bandit model. Such models have a special structure which allows for more tractable solutions using forward induction as compared to backward induction in dynamic programming formulations. Nevertheless, their solution in many settings is computationally challenging, and simulation-based methods provide a viable alternative. Consider a Multi-armed bandit M with (finite) NA machines, each machine being a Markov process on the countably infinite state space X. When arm of machine a is pulled and the system moves from current state x to state x0 ∈ X according to some state transition function Pa (x, x0 ). Note that our framework is quite general: Not only do we allow restless machines, we also allow the states of machines to be dependent. We will assume that the reward received is some bounded and known function r (x, a) ∈ [0, R]. The observation at time t is yt = r (xt , at ). Let ht +1 = (a0 , y0 , . . . , at , yt ) denote the history known at time t. A policy then is π = (π1 , π2 , . . .) where πt (a, ht ), the probability of pulling arm a given the current history ht −1 . We will assume we have a policy space Π . P ∞ Let V (π ) = E[ t =0 γ t r (xt , at )] denote the expected total discounted reward under policy π , where 0 < γ < 1 is the discount factor. We will assume that Π is convex and compact, and P-dim (Πt ) ≤ d, finite for each t. It is well-known that picking arms according to the Gittins index rule yields the optimal ‘exploration versus exploitation trade-off’ and maximizes V (π ) (Gittins & Jones, 1974). However, computing the Gittins index is computationally complex, and the problem is known to be intractable in some cases (Bertsimas & Nino-Mora, 1996). Thus, we extend the simulation-based framework we have developed for Markov decision processes to find an -optimal policy for such problems. Our simulation system for a Multi-armed bandit M is the following. Let ν, ω denote uniform i.i.d. noise sequences from Ω ∞ as before. Then, given a policy π = (π1 , π2 , . . .), the action at time t + 1 is at +1 = qπ,t (ht , νt +1 ) and the state at time t + 1 is xt +1 = fπ,t (xt , ht , νt +1 , ωt +1 ) = h(xt , at +1 , ωt +1 ). Denote zt = (xt , ht ) ∈ Zt := X × Ht and ξt = (νt , ωt ) ∈ Ω2 . Thus,

(zt +1 , θξ ) = sπ,t +1 (zt , ξ ) summarizes the simulation system for π where as before, θ is the left-shift operator. We denote Πt , Ht , Qt , Ft , St and Π , Q, F , S as before. It follows straightforwardly from Lemma 3 that if Πt is convex and P-dim (Πt ) = d < ∞, then P-dim (Qt ) = d also. Let µ be a probability measure on Ω2∞ and λt a probability measure on Zt . Denote the product measure by Pt = λt × µ. Defining the psuedometrics ρt and dL1 on St and Qt as before, we can establish that Lemma 4 also holds for the Multi-armed bandit problem. We can then establish the following result. Theorem 4. Consider the Multi-armed bandit M with Pa (x, x0 ) the state transition function for arm a. Let r (x, a) the real-valued reward function bounded in [0, R] with discount factor 0 < γ < 1. Let Π be the set of stochastic policies (non-stationary and with memory in general), S be the set of simulation systems that simulate policies π ∈ Π under the simple simulation model. Suppose that P-dim (Qt ) = d < ∞ and let the probability measures λ, µ (on Z0 and Ω2∞ respectively) and λt +1 a probability measure on Zt (λ1 = λ) be such λ t (z )

that K := max{supt supst ∈S t ,z ∈Zt λ s (z ) , 1} is finite, where λst is as t +1

defined in Eq. (5). Let Vˆ nT (π ), the estimate of V (π ) obtained from n samples with T time steps. Then, given any , δ > 0, with probability at least 1 − δ , sup |Vˆ n (π ) − V (π )| <

π ∈Π

for n ≥ n0 (, δ) with n0 (, δ) :=

32R2

α2

log

32eR + 2dT log + log KT δ α

4

where T is the /2 horizon time and α = /2(T + 1). Theorem 5. Given a Multi-armed bandit M with countably infinite state space X and a finite action space A (of cardinality NA ), with a convex, compact policy space Π with P − dim(Π ) = d < ∞, an , δ > 0, and γ < 1/K , if we estimate the value function for (1−γ K ) each policy in a given ε -net for Q (with ε < 3R(1−(γ K )T +1 ) ) by doing n0 (/3, δ) simulation runs, each for T times steps, then the obtained policy πˆ ∗ is -optimal, in the sense that

Pn {%(π ) > } < δ. And moreover the sample complexity is given by n0 (/3, δ) · n1 (ε, NA ) ∼ O

1

3

log

2

1

, log

1

δ

,

a polynomial in 1/ . 8. Conclusions The paper considers simulation-based value function estimation methods for Markov decision processes (MDP). Uniform sample complexity results are presented for the discounted reward case. The combinatorial complexity of the space of simulation functions under the proposed simple simulation model is shown to be the same as that of the underlying space of induced Markov chains when the latter is convex. Using ergodicity and weak mixing leads to similar uniform sample complexity result for the average reward case, when a reference Markov chain exists. Extension of the results are obtained when the MDP is partially observable with general policies. Remarkably, the sample complexity results have the same order for both completely and partially observed MDPs when stationary and memoryless policies are used. Sample results for discounted-reward Markov games can be deduced easily as well. The results can be seen as an extension of the theory of Probably Approximately Correct (PAC) learning for partially observable Markov decision processes (POMDPs) and games. PAC theory is related to the system identification problem. One of the key contributions of this paper is the observation that how we simulate an MDP matters for obtaining uniform estimates. This is a new (and suprising) observation. Thus, the results of this paper can also be seen as the first steps towards developing an empirical process theory for Markov decision processes. Such a theory would go a long way in establishing a theoretical foundation for computer simulation of complex engineering systems. We have used Hoeffding’s inequality in obtaining the rate of convergence for discounted-reward MDPs and the McDiarmid–Azuma inequality for the average-reward MDPs, though more sophisticated and tighter inequalities of Talagrand (1995) can be used as well. This would yield better results and is part of future work. While here we have developed a theoretical foundations for simulation-based optimization of Markov decision processes, in future work, we plan to develop gradient-based search algorithms that can be used to choose a smaller set of policies to be evaluated

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

(as opposed to the whole ε -net in Section 6). One advantage of the approach in this paper is that it will find the global maxima while the gradient-based algorithms are guaranteed to find only local maxima. We also plan to do a more comparative empirical study of simulation-based methods proposed here and in Cao (2007) and Chang, Fu, Hu, and Marcus (2007).

Appendix B. Proof of Lemma 4 The proof of Lemma 4 is similar but the details are somewhat more involved. Proof. Consider any st1 , st2 ∈ F t and z ∈ Z. Then,

µ{ξ : st1 (z , ξ ) 6= st2 (z , ξ )}

Appendix A. Proof of Lemma 2

= µ{ξ : st1 (z , ξ ) 6= st2 (z , ξ ), st1−1 (z , ξ ) = st2−1 (z , ξ )} + µ{ξ : st1 (z , ξ ) 6= st2 (z , ξ ), st1−1 (z , ξ ) 6= st2−1 (z , ξ )}

Proof. Consider any s1 = (q1 , f1 ), s2 = (q2 , f2 ) ∈ S and z ∈ Z. Then,

µ{ξ :

st1

(z , ξ ) 6=

= µ{ξ :

st1

+ µ{ξ :

st2

(z , ξ ) 6= st1

= µ{∪z ∈Zt −1 (ξ : st1 (z , ξ ) 6= st2 (z , ξ ),

(z , ξ )} st2

(z , ξ ) 6=

(z , ξ ), st2

( , ξ) =

st1−1 z st1−1

(z , ξ ),

st1−1 (z , ξ ) = st2−1 (z , ξ ) = (z , θ t −1 ξ ))}

( , ξ )}

st2−1 z st2−1

(z , ξ ) 6=

+ µ{∪z ∈Zt −1 (ξ : st1 (z , ξ ) 6= st2 (z , ξ ),

(z , ξ )}

st1−1 (z , ξ ) 6= st2−1 (z , ξ ), st1−1 (z , ξ ) = (z , θ t −1 ξ ))}

= µ{∪y (ξ : st1 (z , ξ ) 6= st2 (z , ξ ), s1t −1

(z , ξ ) =

( , ξ ) = (y, θ

s2t −1 z st1 z s2t −1 z t −1

+ µ{∪y (ξ :

( , ξ ) 6=

( , ξ ),

s1t −1 (z , ξ ) 6=

≤ µ{∪y (ξ : s1 (y, θ s1t −1

(z , ξ ) =

ξ ))}

st1−1 (z , ξ ) = st2−1 (z , ξ ) = (z , θ t −1 ξ ))}

+ µ{∪z ∈Zt −1 (ξ : st1−1 (z , ξ ) 6= st2−1 (z , ξ ),

(z , ξ ) = (y, θ t −1 ξ ))}

( , ξ ) = (y, θ

st1−1 (z , ξ ) = (z , θ t −1 ξ ))}

t −1

X

≤

ξ ))}

µ{ξ : s1t (z , θ t −1 ξ ) 6= s2t (z , θ t −1 ξ )|

z ∈Zt −1

( , ξ ) 6= st2−1 (z , ξ ),

s1t −1 (z , ξ ) = (y, θ

≤

≤ µ{∪z ∈Zt −1 (ξ : st1 (z , ξ ) 6= st2 (z , ξ ),

ξ ) 6= s2 (y, θ t −1 ξ ),

s2t −1 z s1t −1 z t −1

+ µ{∪y (ξ : X

t −1

( , ξ ),

st2 z st1−1

1303

st1−1 (z , ξ ) = (z , θ t −1 ξ )}µ{ξ : st1−1 (z , ξ ) = (z , θ t −1 ξ )}

ξ ))}

+ µ{ξ : st1−1 (z , ξ ) 6= st2−1 (z , ξ )}.

µ{ξ : s1 (y, θ t −1 ξ ) 6= s2 (y, θ t −1 ξ )|

Multiplying both RHS and LHS of the above sequence of inequalities and summing over z, and observing again that

y

s1t −1 (z , ξ ) = (y, θ t −1 ξ )}µ{ξ : st1−1 (z , ξ ) = (y, θ t −1 ξ )}

+ µ{ξ : s1t −1 (z , ξ ) 6= st2−1 (z , ξ )}.

µ{ξ : s1t (z , θ t −1 ξ ) 6= s2t (z , θ t −1 ξ )} = µ{ξ 0 : s1t (z , ξ 0 ) 6= s2t (z , ξ 0 )} we get that the first part of the RHS is

It is easy to argue that λst −1 (y) ≤ K t −1 λ(y) where λst −1 (y) =

P

z

λ(z )µ{ξ :

s1t −1

1

(z , ξ ) = (y, θ

X

t −1

ξ )}. Thus, multiplying both the

µ{ξ : s1 (y, θ t −1 ξ ) 6= s2 (y, θ t −1 ξ )} = µ{ξ 0 : s1 (y, ξ 0 ) 6= s2 (y, ξ 0 )} we get that the first part of RHS is

λst −1 (y)µ{ξ : s1 (y, ξ ) 6= s2 (y, ξ )} 0

0

0

1

y

≤ K t −1 ·

X

λ(y)µ{ξ 0 : s1 (y, ξ 0 ) 6= s2 (y, ξ 0 )}.

y

≤ (K t −1 + K t −2 + · · · + 1)ρP (s1 , s2 ) ≤ K t ρP (s1 , s2 ). where the second inequality is obtained by induction. Now,

λ(z )µ{ξ : s1 (z , ξ ) 6= s2 (z , ξ )}

z

X

λ(z )µ{ξ : q1 (z , ξ ) 6= q2 (z , ξ )}

z

≤

X

λ(z )

λt (z )µ{ξ 0 : s1t (z , ξ 0 ) 6= s2t (z , ξ 0 )}.

z ∈Zt −1

This by induction implies

ρ1 (st1 , st2 ) ≤ K (ρt (s1t , s2t ) + · · · + ρ1 (s11 , s21 )), which implies the first inequality. For the second inequality, note that the ρ pseudo-metric and the L1 pseudo-metric are related thus

X

λt (z )µ{ξ : s1t (z , ξ ) 6= s2t (z , ξ )}

≤

ρP (st1 , st2 ) ≤ K t −1 ρP (s1 , s2 ) + ρP (st1−1 , st2−1 )

≤

X

≤K·

z

This implies that

X

1

z ∈Zt −1

RHS and LHS of the above sequence of inequalities and summing over z, and observing that

X

λst −1 (z )µ{ξ 0 : s1t (z , ξ 0 ) 6= s2t (z , ξ 0 )}

1

Z

|q1 (z , ξ ) − q2 (z , ξ )|dµ(ξ )

z

where the first inequality follows because the event {q1 = q2 , f1 6= f2 } has zero probability, and the second inequality follows because µ is a probability measure and |q1 − q2 | ≥ 1 when q1 6= q2 , and thus ρP (st1 , st2 ) ≤ K t · dL1 (q1 , q2 ) which proves the required assertion.

X

λt (z )µ{ξ : q1t (z , ξ ) 6= q2t (z , ξ )}

z

≤

X

λt (z )

Z

|q1t (z , ξ ) − q2t (z , ξ )|dµ(ξ )

z

where the first inequality follows because only the events {q1t 6= q2t , f1t = f2t , g1t = g2t }, and {q1t 6= q2t , f1t 6= f2t } have nonzero probability and non-zero L1 distance. Both these events are contained in {q1t 6= q2t }. The second inequality follows because µ is a probability measure and |q1t − q2t | ≥ 1 when q1t 6= q2t , and thus ρt (s1t , s2t ) ≤ K · dL1 (q1t , q2t ) which proves the required assertion. References Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimension, uniform convergence and learnability. Journal of the ACM, 44, 615–631. Altman, E., & Zeitouni, O. (1994). Rate of convergence of empirical measures and costs in controlled Markov chains and transient optimality. Mathematics of Operations Research, 19(4), 955–974.

1304

R. Jain, P. Varaiya / Automatica 46 (2010) 1297–1304

Anthony, M., & Bartlett, P. (1999). Neural network learning: theoretical foundations. Cambridge University Press. Bartlett, P. L., & Tewari, A. (2007). Advances in neural information processing systems: Vol. 19. Sample complexity of policy search with known dynamics. MIT Press. Baxtar, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Bertsimas, D., & Nino-Mora, J. (1996). Conservation laws, extended polymatroids and multiarmed bandit problems: a polyhedral approach to indexable systems. Mathematics of Operations Research, 21(2), 257306. Cao, X. R. (2007). Stochastic learning and optimization — a sensitivity-based approach. Springer-Verlag. Chang, H.-S., Fu, M., Hu, J., & Marcus, S. (2007). Simulation-based algorithms for Markov decision processes. Springer-Verlag. Davis, M. H. A, & Varaiya, P. P. (1973). Dynamic programming conditions for partially observed stochastic systems. SIAM Journal of Control, 11(2), 226–261. Dudley, R. (1999). Uniform central limit theorems. Cambridge University Press. Gittins, J., & Jones, D. (1974). A dynamic allocation index for the sequential allocation of experiments. In J. Gani (Ed.)., Progress in statistics (pp. 241–266). NorthHolland. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural nets and other learning applications. Information and Computation, 100(1), 78–150. Jain, R., & Varaiya, P. P. (2006). Simulation-based uniform value function estimates of Markov decision processes. SIAM Journal of Control and Optimization, 45(5), 1633–1656. Kearns, M., Mansour, Y., & Ng, A. Y. (1999). Approximate planning in large POMDPs via reusable trajectories. In Proc. Neural Information Processing Systems (NIPS) conference. MIT Press. Kolmogorov, A. N., & Tihomirov, V. M. (1961). -entropy and -capacity of sets in functional spaces. American Mathematical Society Translations Series, 2(17), 277–364. Ledoux, M. (2001). Mathematical surveys and monographs: Vol. 89. The concentration of measure phenomenon. American Mathematical Society. Mannor, S., & Tsitsiklis, J. N. (2005). On the empirical state-action frequencies in Markov decision processes under general policies. Mathematics of Operations Research, 30(3), 545–561. Marbach, P., & Tsitsiklis, J. N. (2003). Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13, 111–148. Ng, A. Y., & Jordan, M. I. (2000). Pegasus: a policy search method for large MDPs and POMDPs. In Proc. UAI Conference. Peshkin, L., & Mukherjee, S. (2001). Bounds on sample size for policy evaluation in Markov environments. In Lecture Notes in Computer Science (pp. 616–630) Vol. 2111.

Pollard, D. (1990). Empirical process theory and applications. Institute of Mathematical Statistics. Shim, DH, Kim, HJ, & Sastry, SS (2003). Decentralized nonlinear model predictive control of multiple flying robots in dynamic environments. In Proc. IEEE Conf. on Decision and Control 2003. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathematiques de L’IHES, 81, 73–205. Vapnik, V., & Chervonenkis, A. (1981). Necessary and sufficient conditions for convergence of means to their expectations. Theory of Probability and Applications, 26(3), 532–553. Vidyasagar, M. (2003). Learning and generalization: with applications to neural networks (2nd ed.). Springer-Verlag.

Rahul Jain is an Assistant Professor of Electrical Engineering at the University of Southern California, Los Angeles, CA. Prior to joining USC, he was a postdoctoral member of the research staff at the Mathematical Sciences Division of the IBM T.J. Watson Research Center, Yorktown Heights, NY. He received his Ph.D. in EECS in 2004, and an M.A. in Statistics in 2002, both from the University of California, Berkeley. He also received an M.S. in ECE from Rice University in 1999 and completed his undergraduate work with a B.Tech in EE in 1997 from the Indian Institute of Technology, Kanpur. He won a Best paper award at the ValueTools Conference 2009 and is recipient of the CAREER Award from the National Science Foundation in 2010. He has diverse research interests with a current focus on Game Theory and Economics for Networks, and Stochastic Control and Learning.

Pravin Varaiya is Professor of the Graduate School in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. From 1975 to 1992 he was also Professor of Economics at Berkeley. His current research interests include transportation networks, electric power systems, and hybrid systems. His honors include a Guggenheim Fellowship, three Honorary Doctorates, the Field Medal and Bode Prize of the IEEE Control Systems Society, and the Richard E. Bellman Control Heritage Award. He is a Fellow of IEEE, a member of the National Academy of Engineering, and a Fellow of the American Academy of Arts and Science.

Recommend Documents

Controlled Markov Decision Processes with ... - Optimization Online

Markov Decision Processes with Arbitrary Reward Processes

MARKOV DECISION PROCESSES Abstract: The theory of Markov ...