Stability of Stochastic Approximations with 'Controlled Markov'Noise ...

Report 2 Downloads 129 Views
arXiv:1504.06043v1 [cs.SY] 23 Apr 2015

Stability of Stochastic Approximations with ‘Controlled Markov’ Noise and Temporal Difference Learning Arunselvan Ramaswamy

1

and Shalabh Bhatnagar

2

1

[email protected] 2 [email protected] 1,2 Department of Computer Science and Automation, Indian Institute of Science, Bangalore - 560012, India. April 24, 2015 Abstract In this paper we present a ‘stability theorem’ for stochastic approximation (SA) algorithms with ‘controlled Markov’ noise. Such algorithms were first studied by Borkar in 2006. Specifically, sufficient conditions are presented which guarantee the stability of the iterates. Further, under these conditions the iterates are shown to track a solution to the differential inclusion defined in terms of the ergodic occupation measures associated with the ‘controlled Markov’ process. As an application to our main result we present an improvement to a general form of temporal difference learning algorithms. Specifically, we present sufficient conditions for their stability and convergence using our framework. This paper builds on the works of Borkar and Benveniste, Metivier and Priouret.

1

Introduction

Let us begin by considering the general form of stochastic approximation algorithms: xn+1 = xn + a(n) (h(xn ) + Mn+1 ) , where (1) (i) h : Rd → Rd is a Lipschitz continuous function; P∞ (ii) {a(n)}n≥0 is the given step-size sequence such that n=0 a(n) = ∞ and P ∞ 2 n=0 a(n) < ∞; (iii) {Mn }n≥1 is the sequence of square integrable martingale difference terms. In 1996, Bena¨ım [3] showed that the asymptotic behavior of recursion (1) can be determined by studying the asymptotic behavior of the associated o.d.e. x(t) ˙ = h(x(t)).

1

This technique is popularly known as the ODE method and was originally developed by Ljung in 1977 [9]. In [3] it is assumed that sup kxn k < ∞ a.s., in n≥0

other words the iterates are assumed to be stable. In many cases the stability assumption becomes a bottleneck in using the ODE method. This bottleneck was overcome by Borkar and Meyn in 1999 [8]. Specifically, they developed sufficient conditions that guarantee the ‘stability and convergence’ of recursion (1). In many applications, the noise-process is Markovian in nature. Stochastic approximation algorithms with ‘Markov Noise’ have been extensively studied in Benveniste et. al. [5]. These results have been extended to the case when the noise is ‘controlled Markov’ by Borkar [6]. Specifically, the asymptotics of the iterates are described via a limiting differential inclusion (DI) that is defined in terms of the ergodic occupation measures of the Markov process. As explained in [6], the motivation for such a study stems from the fact that in many cases the noise-process is not Markov, but its lack of Markov property comes through its dependence on a time-varying ‘control’ process. In particular this is the case with many reinforcement learning algorithms. In [6], the iterates are assumed to be stable, which as explained earlier poses a bottleneck, especially in analyzing algorithms from reinforcement learning. The aim of this paper is to overcome this bottleneck. In other words, we present sufficient conditions for the ‘stability and convergence’ of stochastic approximation algorithms with ‘controlled Markov’ noise. Finally, as an application setting, we consider a general form of the temporal difference learning algorithms in reinforcement learning and present weaker sufficient conditions (than those in literature) that guarantee their stability and convergence using our framework. The organization of this paper is as follows: In Section 2.1 we present the definitions and notations involved in this paper. In Section 2.2 we discuss the assumptions involved in proving the stability of the iterates given by (3). In Section 3 we show the stability of the iterates under the assumptions outlined in Section 2.2 (Theorem 1). In Section 4 we present additional assumptions which coupled with assumptions from Section 2.2 are used to prove the ‘stability and convergence’ of recursion (3) (Theorem 2). Specifically, Theorem 2 states that under the aforementioned sets of assumptions the iterates are stable and converge to an internally ˆ chain transitive invariant set associated with x(t) ˙ ∈ h(x(t)). For the definition ˆ of h the reader is referred to Section 4. In Section 5 we discuss an application of Theorem 2. We present sufficient conditions for the ‘stability and convergence’ of a general form of temporal difference learning algorithms, in reinforcement learning.

2

2

Preliminaries and Assumptions

2.1

Notations & Definitions

In this section we present the definitions and notations used in this paper for the purpose of easy reference. Note that they can be found in Bena¨ım et. al. [4], Aubin et. al. [1], [2] and Borkar [7]. Marchaud Map: A set-valued map h : Rn → {subsets of Rm } is called a Marchaud map if it satisfies the following properties: (i) For each x ∈ Rn , h(x) is convex and compact. (ii) (point-wise boundedness) For each x ∈ Rn , sup kwk < K (1 + kxk) for w∈h(x)

some K > 0. (iii) h is an upper-semicontinuous map. We say that h is upper-semicontinuous, if given sequences {xn }n≥1 (in Rn ) and {yn }n≥1 (in Rm ) with xn → x, yn → y and yn ∈ h(xn ), n ≥ 1, y ∈ h(x). In other words, the graph of h, {(x, y) : y ∈ h(x), x ∈ Rn }, is closed in Rn × Rm . If the set-valued map H : Rd → {subsets of Rd } is Marchaud, then the differential inclusion (DI) given by x(t) ˙ ∈ H(x(t))

(2)

is guaranteed to have at least one solution that is absolutely continuous. The reader is referred to Aubin & Cellina [1] for more details. P If x is an absolutely continuous map satisfying (2) then we say that x ∈ . A set-valued semiflow Φ associated with (2) is defined on [0, +∞) × Rd as follows: P Φt (x) := {x(t) | x ∈ , x(0) = x}. Let B × M ⊂ [0, +∞) × Rk , define [ Φt (x). ΦB (M ) := {t∈B, x∈M}

T Let M ⊆ Rd , the ω − limit set be defined by ωΦ (M ) := t≥0 Φ[t,+∞) (M ). T Similarly the limit set of a solution x is given by L(x) = t≥0 x([t, +∞)). Invariant Set : M ⊆ Rd is invariant if for every x ∈ M there P exists a trajectory, x, entirely in M with x(0) = x. In other words, x ∈ with x(t) ∈ M , for all t ≥ 0. Internally Chain Transitive Set : M ⊂ Rd is said to be internally chain transitive if M is compact and for every x, y ∈ M , ǫ > 0 and T > 0 we have the following: There exist Φ1 , . . . , Φn that are n solutions to the differential inclusion x(t) ˙ ∈ h(x(t)), a sequence x1 (= x), . . . , xn+1 (= y) ⊂ M and n real numbers t1 , t2 , . . . , tn greater than T such that: Φiti (xi ) ∈ N ǫ (xi+1 ) and Φi[0,ti ] (xi ) ⊂ M for 1 ≤ i ≤ n. The sequence (x1 (= x), . . . , xn+1 (= y)) is called an (ǫ, T ) chain in M from x to y. Given x ∈ Rd and A ⊆ Rd , define the distance between x and A by d(x, A) := inf{ka − yk | y ∈ A}. We define the δ-open neighborhood of A by N δ (A) := 3

{x | d(x, A) < δ}. The δ-closed neighborhood of A is defined by N δ (A) := {x | d(x, A) ≤ δ}. Attracting Set : A ⊆ Rd is an attracting set if it is compact and there exists a neighborhood U such that for any ǫ > 0 there exists T (ǫ) ≥ 0 such that Φ[T (ǫ),+∞) (U ) ⊂ N ǫ (A). Then U is called the fundamental neighborhood of A. In addition to being compact if the attracting set is also invariant then it is called an attractor. The basin of attraction of A is given by B(A) = {x | ωΦ (x) ⊂ A}. The set A is Lyapunov stable if for all δ > 0, ∃ ǫ > 0 such that Φ[0,+∞) (N ǫ (A)) ⊆ N δ (A). We use T (ǫ) and Tǫ interchangeably to denote the dependence of T on ǫ. The open ball of radius r around 0 is represented by Br (0), while the closed ball is represented by B r (0). Upper limit of sequences of sets: Let {Kn }n≥1 be a sequence of sets in Rd . The upper-limit of {Kn }n≥1 is given by Limsupn→∞Kn := {y | lim d(y, Kn ) = n→∞

0}.

2.2

Assumptions

Let us consider a stochastic approximation algorithm with ‘controlled Markov’ noise in Rd . xn+1 = xn + a(n) [h(xn , yn ) + Mn+1 ] , where (3) (i) h : Rd × S → Rd is a jointly continuous map with S a compact metric space. The map h is Lipschitz continuous in the first component, further its constant does not change with the second component. Let the Lipschitz constant be L. This is assumption (1) in Section 2 of Borkar [6]. Here we call it (A1). (ii) The n≥0 is such that a(n) > 0 for all n ≥ 0, P∞ step-size sequenceP{a(n)} ∞ 2 a(n) = ∞ and a(n) < ∞. Without loss of generality let n=0 n=0 sup a(n) ≤ 1. This is assumption (3) in Section 2 of Borkar [6]. Here we n≥0

call it (A3). (iii) {Mn }n≥1 is a sequence of square integrable martingale difference terms, that also contribute to the noise. They are related to {xn }n≥0 by   E kMn+1 k2 | Fn ≤ K(1 + kxn k2 ), where n ≥ 0. This is assumption (2) in Section 2 of Borkar [6]. Here we call it (A2).

(iv) {yn }n≥0 is the S-valued ‘Controlled Markov’ process. Note that S is assumed to be polish in [6]. As stated in (A1), in this paper we let S be a compact metric space, hence polish. Among the assumptions made in [6], (A1) − (A3) are relevant to prove the stability of the iterates. The remaining assumptions are listed in Section 4 where we present the result on the ‘stability and convergence’ of the iterates given by (3). See Borkar [6] for more details. For each c ≥ 1, we define functions hc : Rd × S → Rd by hc (x, y) := h(cx, y)/c. 4

We define the limiting map h∞ : Rd ×S → {subsets of Rd } by h∞ (x, y) := Limsupc→∞{hc (x, y)}, where Limsup is the upper-limit of a sequence of sets (see Section 2.1).   d For each x ∈ R define H(x) := co ∪ h∞ (x, y) . y∈S

We replace the stability assumption



 supkxn k < ∞ in [6] with the following

n≥0

two assumptions.

(S1) If cn ↑ ∞, yn → y and lim hcn (x, yn ) = u for some u ∈ Rd , then u ∈ n→∞

h∞ (x, y).

(S2) There exists an attracting set, A, associated with x(t) ˙ ∈ H(x(t)) such that sup kuk < 1. Further, B 1 (0) is a subset of some fundamental neighborhood u∈A

of A. Assumption (T 2), discussed in Section 5, is a sufficient condition for (S2) to be satisfied. One could say that (T 2) constitutes the ‘Lyapunov function’ condition for DI. We shall show that H is a Marchaud map in Lemma 2. As explained in [1], it follows that the DI, x(t) ˙ ∈ H(x(t)), has at least one solution that is absolutely continuous. Hence assumption (S2) is meaningful. We begin by showing that hc satisfies (A1) for all c ≥ 1. Fix x1 , x2 ∈ Rd , y ∈ S and c ≥ 1, we have khc (x1 , y) − hc (x2 , y)k = kh(cx1 , y)/c − h(cx2 , y)/ck, kh(cx1 , y)/c − h(cx2 , y)/ck ≤ Lkcx1 − cx2 k/c, hence khc (x1 , y) − hc (x2 , y)k ≤ Lkx1 − x2 k. We thus have that hc is Lipschitz continuous in the first component with Lipschitz constant L. Further, for a fixed c this constant does not change with y. Since c was arbitrarily chosen it follows that L is the Lipschitz constant associated with every hc . It is trivially true that hc is a jointly continuous map. Fix c ≥ 1, x ∈ Rd and y ∈ S, then khc (x, y) − hc (0, y)k ≤ Lkx − 0k, hence khc (x, y)k ≤ khc (0, y)k + Lkxk. Since h(0, ·) is a continuous function on S (a compact set) and c ≥ 1 we have khc (0, ·)k∞ ≤ kh(0, ·)k∞ ≤ M for some 0 < M < ∞. Thus khc (x, y)k ≤ K (1 + kxk) , where K = L ∨ M.   We may assume without loss of generality that K is such that E kMn+1 k2 | Fn ≤  K 1 + kxn k2 also holds for all n ≥ 0 (assumption (A2)). Again K does not change with c. 5

Fix x ∈ Rd and y ∈ S. As explained in the previous paragraph we have, sup khc (x, y)k ≤ K(1 + kxk). c≥1

The upper-limit of {hc (x, y)}c≥1 , Limsupc→∞ {hc (x, y)}, is clearly  non-empty.  Recall that h∞ (x, y) = Limsupc→∞{hc (x, y)} and H(x) = co ∪ h∞ (x, y) . y∈S

Hence,

sup

kuk ≤ K(1 + kxk) and

u∈h∞ (x,y)

sup kuk ≤ K(1 + kxk).

(4)

u∈H(x)

We need to show that H is a Marchaud map. Before we do that, let us prove an auxiliary result. Lemma 1. Suppose xn → x in Rd , yn → y in S, cn ↑ ∞ and lim hcn (xn , yn ) = cn ↑∞

u. Then u ∈ h∞ (x, y). Proof. Consider the following inequality:

khcn (x, yn ) − uk ≤ khcn (xn , yn ) − uk + khcn (x, yn ) − hcn (xn , yn )k. Since khcn (x, yn ) − hcn (xn , yn )k ≤ Lkxn − xk and lim hcn (xn , yn ) = u, we get cn →∞

lim hcn (x, yn ) = u.

cn →∞

It follows from (S1) that u ∈ h∞ (x, y). The following is a direct consequence of Lemma 1: If xn → x in Rd , {yn } ⊂ S and cn → ∞ then d(hcn (xn , yn ), H(x)) → 0. If this is not so, then without loss of generality we have that d(hcn (xn , yn ), H(x)) > ǫ for some ǫ > 0. Since S is compact, ∃{m(n)} ⊆ {n} such that lim ym(n) = y and m(n)→∞

hcm(n) (xm(n) , ym(n) ) → u for some y ∈ S and some u ∈ Rd . We have xm(n) → x, ym(n) → y, cm(n) → ∞ and hcm(n) (xm(n) , ym(n) ) → u. It follows from Lemma 1 that u ∈ h∞ (x, y) ⊆ H(x). This is a contradiction. Lemma 2. H is a Marchaud map.   Proof. Recall that H(x) = co ∪ h∞ (x, y) . As explained earlier (cf. (4)), y∈S

sup kuk ≤ K(1 + kxk).

u∈H(x)

Hence H is point-wise bounded. From the definition of H it follows that H(x) is convex and compact for each x ∈ Rd .

6

It is left to show that H is upper semi-continuous. Let xn → x, un → u and un ∈ H(xn ), n ≥ 1. We need to show that u ∈ H(x). If this is not true, then there exists a linear functional on Rd , say f , such that sup f (v) ≤ α − ǫ v∈H(x)

and f (u) ≥ α + ǫ, for some α ∈ R and ǫ > 0. Since un → u, there exists N such that for each n ≥ N f (un ) ≥ α + 2ǫ , i.e., H(xn ) ∩ [f ≥ α + 2ǫ ] 6= φ, here [f ≥ a] is used to denote the set {x | f (x) ≥ a}. For the sake of notational convenience let us denote ∪ h∞ (x, y) by A(x) for all x ∈ Rd . We claim that y∈S

A(xn ) ∩ [f ≥ α + 2ǫ ] 6= φ for all n ≥ N . We shall prove this claim later, for now we assume that the claim is true and proceed.

Pick wn ∈ A(xn ) ∩ [f ≥ α + 2ǫ ] for each n ≥ N . Let wn ∈ h∞ (xn , yn ) for some yn ∈ S. Since {wn }n≥N is norm bounded it contains a convergent subsequence, say {wn(k) }k≥1 ⊆ {wn }n≥N . Let lim wn(k) = w. Since wn(k) ∈ k→∞

1 h∞ (xn(k) , yn(k) ), ∃ cn(k) ∈ N such that kwn(k) −hcn(k) (xn(k) , yn(k) )k < n(k) . The sequence {cn(k) }k≥1 is chosen such that cn(k+1) > cn(k) for each k ≥ 1. Since {yn(k) }k≥1 is from a compact set, there exists a convergent subsequence. For the sake of notational convenience (without loss of generality) we assume that the sequence itself has a limit, i.e., yn(k) → y for some y ∈ S. We have the following: cn(k) ↑ ∞, xn(k) → x, yn(k) → y, wn(k) → w and wn(k) ∈ hcn(k) (xn(k) , yn(k) ) for k ≥ 1. It follows from Lemma 1 that w ∈ h∞ (x, y). Since wn(k) → w and f (wn(k) ) ≥ α + 2ǫ for each k ≥ 1, we have that f (w) ≥ α + 2ǫ . This contradicts sup f (w) ≤ α − ǫ. w∈H(x)

It remains to prove that A(xn ) ∩ [f ≥ α + 2ǫ ] 6= φ for all n ≥ N . If this were not true, then ∃{m(k)}k≥1 ⊆ {n ≥ N } such that A(xm(k) ) ⊆ [f < α + 2ǫ ] for all k. It follows that H(xm(k) ) = co(A(xm(k) )) ⊆ [f ≤ α + 2ǫ ] for each k ≥ 1. Since un(k) → u, ∃N1 such that for all n(k) ≥ N1 , f (un(k) ) ≥ α + 3ǫ 4 . This leads to a contradiction.

3

Stability Theorem

Let us construct the linear interpolated trajectory x(t) for t ∈ [0, ∞) from the Pn−1 sequence {xn }n≥0 . Define t(0) := 0 and t(n) := i=0 a(i), ∀n ≥ 1. Let x(t(n)) := xn and for t ∈ (t(n), t(n + 1)) let     t(n + 1) − t t − t(n) x(t) := x(t(n)) + x(t(n + 1)). t(n + 1) − t(n) t(n + 1) − t(n) Define T0 := 0 and Tn := min{t(m) : t(m) ≥ Tn−1 + T } for n ≥ 1. Observe that there exists a subsequence, {m(n)}, of N such that Tn = t(m(n)) for all n ≥ 0. We use x(· ) to construct the rescaled trajectory, xˆ(t), for t ≥ 0. Let t ∈ x(t) , where r(n) = kx(Tn )k ∨ [Tn , Tn+1 ) for some n ≥ 0 and define x ˆ(t) := r(n) − 1. Also, let xˆ(Tn+1 ) := lim x ˆ(t), t ∈ [Tn , Tn+1 ). The rescaled martingale t↑Tn+1

ˆ k+1 := difference terms are given by M

Mk+1 r(n) ,

7

t(k) ∈ [Tn , Tn+1 ).

We define a piece-wise constant trajectory, zˆ(· ), using the rescaled trajectory as follows: Let t ∈ [t(m), t(m + 1)) and Tn ≤ t(m) < t(m + 1) ≤ Tn+1 . Define zˆ(t) := hr(n) (ˆ x(t(m)), ym ). Let us define another piece-wise constant trajectory using {yn }n≥0 as follows: Let y(t) := yn for all t ∈ [t(n), t(n + 1)). Recall that A is an attracting set associated with x(t) ˙ ∈ H(x(t)) (see assumption (S2) in section 2.2). Let δ1 := sup kuk, then δ1 < 1. Choose δ2 , δ3 , and δ4 u∈A

such that δ1 < δ2 < δ3 < δ4 < 1. Fix T := T (δ2 − δ1 ), where T (· ) is defined in section 2.1. Let x(· ) be a solution to x(t) ˙ ∈ H(x(t)) such that kx(0)k ≤ 1, then kx(t)k < δ2 for all t ≥ T (δ2 − δ1 ). Consider the following recursion: x(t(k + 1)) = x(t(k)) + a(k) (h(x(t(k)), yk ) + Mk+1 ) , such that t(k), t(k + 1) ∈ [Tn , Tn+1 ). Multiplying both sides by 1/r(n), we get the following rescaled recursion:   ˆ k+1 . x ˆ(t(k + 1)) = x ˆ(t(k)) + a(k) hr(n) (ˆ x(t(k)), yk ) + M (5) i h  ˆ k+1 k2 |Fk ≤ K 1 + kˆ x(t(k))k2 . Note that E kM

The following two lemmas can be found in Borkar & Meyn [8] (that however does not consider ‘controlled Markov’ noise). It is shown there that the ‘martingale noise’ sequence converges almost surely. We present the results below using our setting. Lemma 3. sup Ekˆ x(t)k2 < ∞. t≥0

Proof. Recall that Tn = t(m(n)) and Tn+1 = t(m(n + 1)). It is enough to show that   sup E kˆ x(t(k))k2 ≤ M, m(n) 0) that is independent of n. Let us fix n and k such that n ≥ 0 and m(n) < k < m(n + 1). Consider the following rescaled recursion:   ˆk . x ˆ(t(k)) = xˆ(t(k − 1)) + a(k − 1) zˆ(t(k − 1)) + M

Unfolding the above we get,

x ˆ(t(k)) = xˆ(t(m(n))) +

k−1 X

l=m(n)

  ˆ l+1 . a(l) zˆ(t(l)) + M

Taking expectation of the square of the norms on both sides we get,

2

k−1   X

ˆ l+1 . a(l) zˆ(t(l)) + M Ekˆ x(t(k))k2 = E ˆ(t(m(n))) +

x

l=m(n) 8

It follows from the Minkowski inequality that, E 1/2 kˆ x(t(k))k2 ≤ E 1/2 kˆ x(Tn )k2 +

k−1 X

l=m(n)

  ˆ l+1 k2 . a(l) E 1/2 kˆ z (t(l))k2 + E 1/2 kM

For each l such that m(n) ≤ hl ≤ k − 1, kˆ x(t(l)), y(t(l)))k ≤ iz (t(l))k = khr(n) (ˆ  2 ˆ l+1 k |Fl ≤ K 1 + kˆ x(t(l))k2 . Observe that K (1 + kˆ x(t(l))k). Further, E kM Tn+1 − Tn ≤ T + 1 (since supn a(n) ≤ 1). Using these observations we get the following: E 1/2 kˆ x(t(k))k2 ≤ 1 +

E

E

1/2

1/2

2

kˆ x(t(k))k ≤ 1 +

k−1 X

 √  x(t(l))k2 , a(l) KE 1/2 (1 + kˆ x(t(l))k)2 + KE 1/2 1 + kˆ

k−1 X

    √  a(l) K 1 + E 1/2 kˆ x(t(l))k2 + K 1 + E 1/2 kˆ x(t(l))k2 ,

l=m(n)

l=m(n)

k−1 i h X √ √ a(l)E 1/2 kˆ x(t(l))k2 . kˆ x(t(k))k ≤ 1 + (K + K)(T + 1) +(K+ K) 2

l=m(n)

Applying the discrete version of Gronwall inequality we now get, i h √ √ E 1/2 kˆ x(t(k))k2 ≤ 1 + (K + K)(T + 1) e(K+ K)(T +1) .

2 i h √ √ Let us define M := 1 + (K + K)(T + 1) e(K+ K)(T +1) . Clearly M is independent of n and the claim follows. Lemma 4. The sequence ζˆn , n ≥ 0, converges almost surely, where ζˆn := Pn−1 ˆ k=0 a(k)Mk+1 for all n ≥ 1.

Proof. It is enough to prove that ∞ X

k=0

Instead, we prove that " E

i h ˆ k+1 k2 | Fk < ∞ a.s. E ka(k)M

∞ X

k=0

i h ˆ k+1 k2 | Fk a(k) E kM 2

#

< ∞.

From assumption (A2) we get "∞ # ∞ i h X X  2 2 ˆ k+1 k | Fk ≤ E a(k)2 K 1 + Ekˆ x(t(k))k2 . a(k) E kM k=0

k=0

The claim now follows from Lemma 3 and (A3). 9

Let xn (t), t ∈ [0, T ], be the solution (up to time T ) to x˙ n (t) = zˆ(Tn + t) with initial condition xn (0) = x ˆ(Tn ). Clearly, n

x (t) = x ˆ(Tn ) +

Z

t

zˆ(Tn + s) ds.

(6)

0

Lemma 5. lim

sup

n→∞ t∈[T ,T +T ] n n

kxn (t) − x ˆ(t)k = 0 a.s.

Proof. Let t ∈ [t(m(n) + k), t(m(n) + k + 1)) such that Tn ≤ t(m(n) + k) < t(m(n) + k + 1) ≤ Tn+1 , where n ≥ 0. First we prove the lemma when t(m(n) + k + 1) < Tn+1 . Consider the following: x ˆ(t) =



t(m(n) + k + 1) − t a(m(n) + k)



xˆ(t(m(n)+k))+



t − t(m(n) + k) a(m(n) + k)



x ˆ(t(m(n)+k+1)).

Substituting for x ˆ(t(m(n) + k + 1)) in the above equation we get: 

t(m(n) + k + 1) − t a(m(n) + k)





t − t(m(n) + k) a(m(n) + k)



x ˆ(t) = xˆ(t(m(n) + k)) +    ˆ m(n)+k+1 , xˆ(t(m(n) + k)) + a(m(n) + k) hr(n) (ˆ x(t(m(n) + k)), ym(n)+k ) + M

hence,

  ˆ m(n)+k+1 . x ˆ(t) = x ˆ(t(m(n)+k))+(t − t(m(n) + k)) hr(n) (ˆ x(t(m(n) + k)), ym(n)+k ) + M

Unfolding x ˆ(t(m(n) + k)), we get (see (5)),

x ˆ(t) = x ˆ(Tn )+

k−1 X l=0

  ˆ m(n)+l+1 + a(m(n)+l) hr(n) (ˆ x(t(m(n) + l)), ym(n)+l ) + M

  ˆ m(n)+k+1 . (7) (t − t(m(n) + k)) hr(n) (ˆ x(t(m(n) + k)), ym(n)+k ) + M

Recall that

n

x (t) = x ˆ(Tn ) +

Z

t

zˆ(Tn + s) ds.

0

Splitting the above integral, we get xn (t) = x ˆ(Tn ) +

k−1 X Z t(m(n)+l+1) l=0

t(m(n)+l)

zˆ(s) ds +

Z

t

zˆ(s) ds.

t(m(n)+k)

Thus, xn (t) = x ˆ(Tn ) +

k−1 X

a(m(n) + l)hr(n) (ˆ x(t(m(n) + l)), ym(n)+l )+

l=0

(t − t(m(n) + k)) hr(n) (ˆ x(t(m(n) + k)), ym(n)+k ). (8)

10

From (7) and (8), we get the following:

k−1



X

ˆ m(n)+l+1 ˆ m(n)+k+1 kx (t)−ˆ x(t)k ≤ a(m(n) + l)M

+ (t − t(m(n) + k)) M

,

n

l=0

kxn (t) − x ˆ(t)k ≤ kζˆm(n)+k − ζˆm(n) k + kζˆm(n)+k+1 − ζˆm(n)+k k.

If t(m(n) + k + 1) = Tn+1 then in the above set of equations we may replace − x ˆ(t(m(n) + k + 1)) with x ˆ(Tn+1 ). The arguments remain the same. Since ζˆn , n ≥ 1, converges almost surely, the lemma follows. Recall that T = T (δ2 −δ1 ). Let us view {xn ([0, T ]) | n ≥ 0} and {xn ([Tn , Tn + T ]) | n ≥ 0} as subsets of C([0, T ], Rd ) (endowed with the sup-norm, k· k∞ ). We claim that {xn ([0, T ]) | n ≥ 0} is equicontinuous and point-wise bounded almost surely. Since kxn (0)k = kˆ x(Tn )k ≤ 1, we can use Gronwall inequality to show that sup kxn (· )k∞ < ∞ almost surely. Note that kxn (· )k∞ = sup kxn (t)k. n≥0

t∈[0,T ]

Hence we conclude that the aforementioned set is almost surely point-wise bounded. Now we show that the family of functions is almost surely equicontinuous. Recall that sup Ekˆ x(t)k2 < ∞ a.s. and kˆ z(t)k ≤ K(1 + kˆ x([t])k), where [t] := t≥0

max{t(m) | t(m) ≤ t}. Hence sup kˆ z (t)k < ∞ a.s. For δ > 0, we have the t≥0

following: n

n

kx (t + δ) − x (t)k ≤

Z

t

t+δ

kˆ z(s)k ds.

Since sup kˆ z (t)k < ∞ a.s. it follows that t≥0

kxn (t + δ) − xn (t)k ≤

Z

t+δ

M ds = M δ, where

t

M is a constant (possibly sample path dependent) such that sup kˆ z(t)k ≤ M . t≥0

Hence we conclude that {xn ([0, T ]) | n ≥ 0} is equicontinuous. It follows from Arzela-Ascoli Theorem that {xn ([0, T ]) | n ≥ 0} is relatively compact in C([0, T ], Rd ). From Lemma 5 it follows that {ˆ x([Tn , Tn + T ]) | n ≥ 0} is also relatively compact in C([0, T ], Rd ). Using Gronwall’s inequality we can show that sup kxk k < ∞ a.s. if and only k≥0

if sup kx(Tn )k < ∞ a.s. To prove the stability of the iterates it is enough to n≥0

show that sup r(n) < ∞ a.s. given that the recursion satisfies (A1) − (A3), n≥0

S1 and S2 (see Section 2.2). If sup r(n) = ∞ then there exists {l} ⊆ {n} n≥0

such that r(l) ↑ ∞. In the lemma that follows we characterize the limit set of {ˆ x([Tl , Tl + T ]) | {l} ⊆ {n} & r(l) ↑ ∞} in C([0, T ], Rd ).

11

Lemma 6. Let {l} ⊆ {n} such that r(l) ↑ ∞. Any limit of {ˆ x([Tl , Tl + T ]) | {l} ⊆ {n} & r(l) ↑ ∞} in C([0, T ], Rd ) is of the form x(t) = x(0) + Rt z(s) ds, where x(0) ∈ B 1 (0) and z : [0, T ] → Rd is a measurable function 0 such that z(t) ∈ H(x(t)), t ∈ [0, T ]. Proof. For t ≥ 0 define [t] := max{t(m) | t(m) ≤ t}. Fix t0 ∈ [Tn , Tn+1 ), zˆ(t0 ) = hr(n) (ˆ x([t0 ]), y([t0 ])). Since khr(n) (ˆ x([t0 ]), y([t0 ]))k ≤ K(1 + kˆ x([t0 ])k), we have kˆ z (t0 )k ≤ K(1 + kˆ x([t0 ])k). It follows from Lemma 3 that kˆ z (t)k < ∞ a.s. Recall that {ˆ x(Tl +· ) | {l} ⊆ {n}} is relatively compact in C([0, T ], Rd ). Without loss of generality we may assume that x ˆ(Tl +· ) → x(· ) in C([0, T ], Rd), for some x(· ) ∈ C([0, T ], Rd);

zˆ(Tl +· ) → z(· ) weakly in L2 ([0, T ], Rd), for some z(· ) ∈ L2 ([0, T ], Rd). It follows from Lemma 5 that xl (· ) → x(· ) in C([0, T ], Rd). Letting r(l) → ∞ in the following equation, Z t xl (t) = xl (0) + zˆ(Tl + s) ds, we get 0 Z t x(t) = x(0) = z(s)ds. 0

l

Since kx (0)k = kˆ x(Tl )k ≤ 1 , we have that kx(0)k ≤ 1. Further, since zˆ(Tl +· ) → z(· ) weakly in L2 ([0, T ], Rd) it follows from the Banach-Saks Theorem that ∃ {k(l)} ⊆ {l} such that

N 1 X zˆ(Tk(l) +· ) → z(· ) strongly in L2 ([0, T ], Rd). N l=1

Further, ∃ {m(N )} ⊆ {N } such that

m(N ) X 1 zˆ(Tk(l) +· ) → z(· ) a.e. on [0, T ]. (9) m(N ) l=1

Fix t0 ∈ [0, T ] such that (9) holds, i.e., lim

m(N )→∞

m(N ) X 1 zˆ(Tk(l) + t0 ) = z(t0 ). m(N )

(10)

l=1

We know that zˆ(Tk(l) + t0 ) = hr(k(l)) (ˆ x([Tk(l) + t0 ]), y([Tk(l) + t0 ])). Note that y([Tk(l) + t0 ]) = y(Tk(l) + t0 ). We claim the following: For any ǫ > 0 there exists N such that for all n ≥ N kˆ x(t(m)) − xˆ(t(m + 1))k < ǫ, where Tn ≤ t(m) < t(m + 1) < Tn+1 . If t(m + 1) = − Tn+1 then we claim that kˆ x(t(m)) − xˆ(Tn+1 )k < ǫ. We shall prove this later, for now we assume it to be true and proceed.

12

Since xˆ(Tk(l) +t0 ) → x(t0 ) it follows from the above claim that x ˆ([Tk(l) +t0 ]) → x(t0 ). Since r(k(l)) ↑ ∞ it follows from Lemma 1 that  lim d hr(k(l)) (ˆ x([Tk(l) + t0 ]), y([Tk(l) + t0 ])), H(x(t0 )) = 0 i.e., r(k(l))↑∞  lim d zˆ(Tk(l) + t0 ), H(x(t0 )) = 0 r(k(l))↑∞

Further, since H(x(t0 )) is convex and compact, it follows from equation (10) that z(t0 ) ∈ H(x(t0 )). On the measure zero set of [0,T] where (9) does not hold, the value of z(· ) can be modified to ensure that z(t) ∈ H(x(t)) for all t ∈ [0, T ]. It is left to prove the claim that was made earlier. We first show that given any ǫ > 0 there exists N such that n ≥ N implies that kˆ x(t(m)) − x ˆ(t(m + 1))k < ǫ, where Tn ≤ t(m) < t(m + 1) < Tn+1 . We know that   ˆ n+1 . xˆ(t(m + 1)) = xˆ(t(m)) + a(n) hr(n) (ˆ x(t(m)), y(t(m))) + M

Hence,

kˆ x(t(m)) − xˆ(t(m + 1))k ≤ a(n)khr(n) (ˆ x(t(m)), y(t(m)))k + kζn+1 − ζn k. From (4), the above inequality becomes kˆ x(t(m)) − x ˆ(t(m + 1))k ≤ a(n)K(1 + kˆ x(t(m))k) + kζn+1 − ζn k. It follows from Lemmas 3 & 4 that a(n)K(1+kˆ x(t(m))k) → 0 and kζn+1 −ζn k → 0 respectively in the ‘almost sure’ sense. In other words, there exists N (possibly sample path dependent) such that the claim holds. The second part of the unproven claim considers the situation when t(m + 1) = Tn+1 , the proof of which follows in a similar manner. Theorem 1 (The Stability Theorem). Under assumptions (A1)−(A3), (S1) & (S2), sup kxn k < ∞ a.s. n≥0

Proof. Define B := {sup xˆ(t) < ∞} ∩ {ζn converges}. It is enough to show that t≥0

sup kxn k < ∞ on B. Let us assume the contrary i.e., supkxn k = ∞ on D ⊆ B

n≥0

n≥0

such that P (D) > 0. Fix ω ∈ D ∩ B and choose N such that the following hold: 1. For all m(l) ≥ N , sup kˆ x(Tl + t) − xl (t)k < δ3 − δ2 . This is possible since t∈[0,T ]

kˆ x(Tl +· ) − xl (· )k → 0 on B (Lemma 5). Recall that {m(n)} ⊆ N is such that t(m(n)) = Tn for all n ≥ 0. − 2. For all m(l) ≥ N , kˆ x(Tl+1 )k < δ4 . This is possible since kˆ x(Tl + T ) − − x ˆ(Tl+1 )k → 0 as l → ∞ and kˆ x(Tl + T )k < δ3 for large m(l).

3. For all m(l) ≥ N , r(l) > 1. We have,

− kˆ x(Tl+1 )k kx(Tl+1 )k/r(l) = . kx(Tl )k/r(l) kˆ x(Tl )k

13

(11)

− For m(l) > N , we have kˆ x(Tl+1 )k < δ4 and kˆ x(Tl )k = 1. Hence it follows from (11) that kx(Tl+1 )k < δ4 (< 1). (12) kx(Tl )k

Let us closely analyze the implication of (12). Since kx(Tl )k > 1, it follows that kx(Tl+1 )k < δ4 kx(Tl )k, further if kx(Tl+1 )k ≥ 1, we have kx(Tl+2 )k < δ4 kx(Tl+1 )k < δ42 kx(Tl )k.

We see that the trajectory has a tendency to fall into the unit ball at an exponential rate (from the outside). Let t0 < Tl be the last time that the trajectory ‘jumps’ from inside the unit ball to the outside. This is because t0 is the last time before Tl when the trajectory ‘jumps’ outside and kx(Tl )k > 1. It follows from the observation that the trajectory falls exponentially into the unit ball that this jump is at least kx(Tl )k − 1. Since r(l) ↑ ∞ the trajectory is forced to make larger and larger ‘last jumps’ from inside the unit ball to the outside such that the lengths of these jumps (≥ r(l) − 1) ‘run off’ to infinity. Further, these jumps are made within time T + 1. Using Gronwall’s inequality we however get a contradiction.

4

Convergence Theorem

We begin this section by presenting the additional assumptions imposed on recursion 3. These assumptions are coupled with those made in Section 2.2 to prove that the iterates are stable and converge to an internally chain transitive invariant set associated with a DI that is defined in terms of the ergodic occupation measures associated with the ‘Markov process’. The additional assumptions made are similar to those in Borkar [6]. We list them below. (B1) {yn }n≥0 is an S−valued Markov process with two associated control processes: {xn }n≥0 and another random process {zn }n≥0 taking values in a compact metric space U . Thus Z p(dy|yn , zn , xn ), n ≥ 0, P (yn+1 ∈ A | ym , zm , xm , m ≤ n) = A

for A Borel in S. The map (y, z, x) ∈ S × U × Rd → p(dw|y, z, x) ∈ P(S) is continuous, further it is uniformly continuous on compacts in the x variable with respect to the other variables. P(S) is used to denote the space of probability measures on S. Let ϕ : S → ϕ(S) = ϕ(y, dz) ∈ P(U ) be a measurable map. Suppose the Markov process has a (possibly non-unique) invariant probability measure ηx,ϕ (dy) ∈ P(S), we can define the corresponding ergodic occupation measure Ψx,ϕ (dy, dz) := ηx,ϕ (dy)ϕ(dy, dz) ∈ P(S × U ). 14

(13)

Let D(x) be the set of all such ergodic occupation measures for a prescribed x. It can be shown that D(x) is closed and convex for each x ∈ Rd . Further, the map x 7→ D(x) is upper-semicontinuous. For a proof of the aforementioned results the reader is referred to Chapter 6.2 of [7]. (B2) D(x) is compact. Let us define a P(S × U )-valued random process µ(t) = µ(t, dydz), t ≥ 0, by µ(t) := δyn ,zn , t ∈ [t(n), t(n + 1)),R for n ≥ 0. For t > s ≥ 0, define µts ∈ 1 P(S × U × [s, t]) by µts (A × B) := t−s µ(y, A) dy for A, B Borel in S × U, [s, t] B respectively. (B3) Almost surely, for t > 0, the set {µs+t s , s ≥ 0} remains tight. Define e h(x, ν) := the following DI.

R

h(x, y)ν(dy, U ) for ν ∈ P(S × U ). We use this to define

ˆ ˆ x(t) ˙ ∈ h(x(t)), where h(x) := {e h(x, ν) | ν ∈ D(x)}.

(14)

Theorem 2 (Stability & Convergence). Under assumptions (A1) − (A3), (S1), (S2) and (B1) − (B3), almost surely the iterates given by (3) are stable and converge to an internally chain transitive invariant set associated with x(t) ˙ ∈ ˆ h(x(t)). Proof. Under assumptions (A1) − (A3), (S1) and (S2) the stability of the iterates follow from Theorem 1. Now, we invoke Theorem 3.1 of Borkar [6] to conclude that the iterates converge to an internally chain transitive invariant ˆ set associated with x(t) ˙ ∈ h(x(t)).

5

Application to temporal difference learning

Temporal difference (T D) learning is an important prediction method which combines ideas from Monte Carlo and dynamic programming. It has been mostly used to solve problems from reinforcement learning. There are several variants of T D algorithms. Consider the general form of a T D algorithm with ‘controlled Markov noise’. xn+1 = xn + a(n) (h(xn , yn ) + Mn+1 ) , where

(15)

(i) h : Rd × S → Rd is of the form h(x, y) = A(y)x + b(y). Here A : S → Rd×d is a matrix valued function and b : S → Rd is a vector valued function. P∞ (ii) P {a(n)}n≥0 is the given step-size sequence such that n=0 a(n) = ∞ and ∞ 2 n=0 a(n) < ∞. Recall that this is assumption (A3).

(iii) {Mn }n≥1 is the sequence of square integrable martingale difference terms such that   E kMn+1 k2 | Fn ≤ K(1 + kxn k2 ), n ≥ 0. Recall that this is assumption (A2).

(iv) {yn }n≥0 is a S-valued ‘Controlled Markov Process’. We assume that S is a compact metric space. 15

For a detailed exposition on T D algorithms the reader is referred to Tsitsiklis and Van Roy [10]. In this section, we impose conditions on A and b that guarantee the ‘stability and convergence’ of the iterates given by (15). Remark: It is important to note that our T D algorithm (cf. 15) is more general than the regular T D update with function approximation, as in (say) Tsitsiklis and Van Roy [10]. In particular, the regular T D with function approximation can be written (see [10]) as in (15). Note also that unlike the usual analyses of T D, we do not assume that the Markov process {yn } is (a) finite state and (b) ergodic under the given stationary policy. We state the first of the two assumptions below. (T1) A : S → Rd×d and b : S → Rd are continuous maps. We show that (15) satisfies (A1) − (A3) and (S1) if it satisfies (T 1). Since A and b are continuous maps, it follows that h is a jointly continuous map. Since A is continuous, the range of A, A(S) ⊂ Rd×d , is compact. Define L := sup kM k(< ∞). We have the following:

M∈A(S)

kh(x1 , y) − h(x2 , y)k ≤ kA(y)k × kx1 − x2 k ≤ Lkx1 − x2 k. Hence h is Lipschitz continuous in the first component, with Lipschitz constant L, further this constant does not change with the second component ((A1) is satisfied). Assumptions (A2) and (A3) are trivially satisfied. We have hc (x, y) = {A(y)x + b(y)/c} and h∞ (x, y) = {A(y)x}. Now, we show that (S1) is satisfied. Let cn ↑ ∞, yn → y and lim hcn (x, yn ) = u. n→∞

We need to show that u = h∞ (x, y). Since A is continuous, lim A(yn )x = n→∞

A(y)x. Since b is a bounded function, lim b(yn )/cn = 0. Hence, we get u = n→∞

lim hcn (x, yn ) = A(y)x ∈ h∞ (x, y). Before we state our second assumption we n→∞ present an auxiliary result. Lemma 7. Let H : Rd → {subsets of Rd } be a Marchaud map. Let A be an associated attracting set that is also Lyapunov stable. Let B be a compact subset of the basin of attraction of A. Then for all ǫ > 0 there exists T (ǫ) such that Φt≥T (ǫ) (B) ⊆ N ǫ (A). Proof. Since A is Lyapunov stable, corresponding to N ǫ (A) there exists N δ (A) such that Φ[0,+∞) (N δ (A)) ⊆ N ǫ (A). Fix x0 ∈ B. Since B is contained in the basin of attraction of A, ∃t(x0 ) > 0 such that Φt(x0 ) (x0 ) ⊆ N δ/4 (A). Further, from the upper semi-continuity of flow it follows that, for all x ∈ N δ(x0 ) (x0 ), Φt(x0 ) (x) ⊆ N δ/4 (Φt(x0 ) (x0 )) for some δ(x0 ) > 0, see Chapter 2 of Aubin and Cellina [1]. Hence Φt(x0 ) (x) ⊆ N δ (A) for all x ∈ N δ(x0 ) (x0 ). Since A is Lyapunov stable, we get Φ(t(x0 ),+∞] (x) ⊆ N ǫ (A). In this manner for each  x ∈ B we calculate t(x) and δ(x), the collection N δ(x) (x) : x ∈ B is an open  δ(x ) cover for B. Let N i (xi ) | 1 ≤ i ≤ m be a finite sub-cover. If we define T (ǫ) := max{t(xi ) | 1 ≤ i ≤ m} then Φ[T (ǫ),+∞) (B) ⊆ N ǫ (A). 16

We have H(x) = co ({A(y)x | y ∈ S}). It follows from Lemma 2 that H is a Marchaud map. We state our second assumption below. (T2) Let ǫ > 0 and V : B 1+ǫ (0) → [0, ∞). Let Λ be a compact subset of B1 (0), clearly sup kuk < 1. Let the following hold: u∈Λ

(i) For all t ≥ 0, Φt (B1+ǫ (0)) ⊆ B1+ǫ (0), where Φt (· ) is a solution to the DI x(t) ˙ ∈ H(x(t)). (ii) V −1 (0) = Λ. (iii) V is continuous and for all x ∈ B 1+ǫ (0) \ Λ. Further, for y ∈ Φt (x) and t > 0 we have V (y) < V (x). Propositon 3.25 from Bena¨ım et. al. [4] : Under (T 2), Λ is a Lyapunov stable attracting set, further there exists an attractor, A, contained in Λ whose basin contains B1+ǫ (0). Since A ⊂ Λ and B1+ǫ (0) is contained in the basin of attraction of A, it follows that B1+ǫ (0) is contained in the basin of attraction of Λ. We have that B 1 (0) is contained in some fundamental neighborhood of Λ (Lemma 7). Further, supkzk < 1. Hence (S2) is satisfied. Note that the attracting set associated with z∈Λ

x(t) ˙ ∈ co ({A(y)x(t) | y ∈ S}) in (S2) is Λ. Theorem 3. Under assumptions (A1) − (A3), (T 1), (T 2) and (B1) − (B3), almost surely the iterates given by (15) are stable and converge to an internally ˆ chain transitive invariant set associated with x(t) ˙ ∈ h(x(t)). Proof. We have shown that assumptions (A1) − (A3), (S1) & (S2) are satisfied by (15). It follows from Theorem 1 that the iterates are stable. Further, we have assumed that (B1) − (B3) are satisfied by (15). It follows from Theorem 2 that the iterates converge to an internally chain transitive invariant set associated with x(t) ˙ ∈ˆ h(x(t)). Let us consider the special case when A is a constant map i.e., A(y) = M for all y ∈ S. Thus, we get the following recursion: xn+1 = xn + a(n) [M xn + b(yn ) + Mn+1 ] ,

(16)

where M ∈ Rd×d and b : Rd → Rd is a continuous map. Hence (16) satisfies assumption (T 1). As explained before, (16) also satisfies (A1) − (A3) and (S1). It follows from the definition of {hc }c≥1 and H that hc (x, y) = M x + b(y)/c and h∞ (x, y) = M x;   H(x) = co ∪ h∞ (x, y) = M x. y∈Z

Note that the DI x(t) ˙ ∈ H(x(t)) is really the o.d.e. x(t) ˙ = M x, here.

17

Let us assume that all eigenvalues of M have strictly negative real parts. Then, the origin is a globally asymptotic stable equilibrium point (a globally attracting set that is also Lyapunov stable) associated with x(t) ˙ = M x(t) (see 11.2.3 of Borkar [7]). Now, we show that recursion (16) satisfies assumption (T 2). Solving x(t) ˙ = M x(t), we get Φt (x(0)) = eMt x(0) for t ≥ 0. Let t > 0 and x(0) ∈ Rd \ {0}, we have that kΦt (x(0))k < kx(0)k since all the eigenvalues of M have strictly negative real parts. Let us define the following: 1. ǫ := 1. 2. V (x) : B 2 (0) → [0, ∞) as V (x) := kxk. 3. Λ := {0} (origin). As explained earlier, for t ≥ 0 we have kΦt (x)k ≤ kxk, hence Φt (B2 (0)) ⊆ B2 (0) ((T2)(i) holds). Recall that V (x) = kxk for all x ∈ B 2 (0). It follows from the definition of V that V −1 (0) = Λ ((T2)(ii) holds). Fix x0 ∈ B2 (0)\{0} and t > 0, we have kΦt (x0 )k < kx0 k, hence V (Φt (x0 )) < V (x0 ) ((T2)(iii) holds). Since the recursion given by (16) satisfies (T 1) & (T 2), it follows from Theorem 3 that the iterates are ‘stable and convergent’.

6

Conclusions

We presented in this paper general sufficient conditions for stability and convergence of stochastic approximation algorithms with ‘controlled Markov’ noise. To the best of our knowledge this is the first time that sufficient conditions for stability of stochastic approximations with ‘controlled Markov’ noise have been provided. We further studied an application of our results as a temporal difference learning algorithm and showed that the algorithm is stable and asymptotically convergent under weaker requirements than those in the other analyses in the literature. An interesting future direction would be to extend this analysis to the case of multi-timescale stochastic approximations that would encompass actor-critic algorithms - another important class of algorithms in reinforcement learning.

References [1] J. Aubin and A. Cellina. Differential Inclusions: Set-Valued Maps and Viability Theory. Springer, 1984. [2] J. Aubin and H. Frankowska. Set-Valued Analysis. Birkh¨auser, 1990. [3] M. Bena¨ım. A dynamical system approach to stochastic approximations. SIAM J. Control Optim., 34(2):437–472, 1996. [4] M. Bena¨ım, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, pages 328–348, 2005. 18

[5] A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximations. Springer Publishing Company, Incorporated, 1st edition, 2012. [6] V. S. Borkar. Stochastic approximation with ’controlled markov’ noise. Systems & Control Letters, 55(2):139–145, 2006. [7] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008. [8] V. S. Borkar and S.P. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim, 38:447–469, 1999. [9] L. Ljung. Analysis of recursive stochastic algorithms. Automatic Control, IEEE Transactions on, 22(4):551–575, 1977. [10] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997.

19