Rate of Convergence and Error Bounds for LSTD ($\lambda $)

Report 2 Downloads 104 Views
Rate of Convergence and Error Bounds for LSTD(λ) Manel Tagorti and Bruno Scherrer INRIA Nancy Grand Est, Team MAIA [email protected], [email protected]

arXiv:1405.3229v1 [cs.LG] 13 May 2014

May 14, 2014 Abstract We consider LSTD(λ), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a β-mixing assumption, we derive, for any value of λ ∈ (0, 1), a high-probability estimate of the rate of convergence of this algorithm to its limit. We deduce a high-probability bound on the error of this algorithm, that extends (and slightly improves) that derived by Lazaric et al. (2010) in the specific case where λ = 0. In particular, our analysis sheds some light on the choice of λ with respect to the quality of the chosen linear space and the number of samples, that complies with simulations.

1

Introduction

In a large Markov Decision Process context, we consider LSTD(λ), the least-squares temporal-difference algorithm with eligibility traces proposed by Boyan (2002). It is a popular algorithm for estimating a projection onto a linear space of the value function of a fixed policy. Such a value estimation procedure can for instance be useful in a policy iteration context to eventually estimate an approximately optimal controller (Bertsekas and Tsitsiklis, 1996; Szepesv´ari, 2010). The asymptotic almost sure convergence of LSTD(λ) was proved by Nedic and Bertsekas (2002). Under a β-mixing assumption, and given a finite number of samples n, Lazaric et al. (2012) derived ˜ √1 ) rate1 in the restricted situation where λ = 0. To our a high-probability error bound with a O( n knowledge, however, similar finite-sample error bounds are not known in the literature for λ > 0. The main goal of this paper is to fill this gap. This is all the more important that it is known that the parameter λ allows to control the quality of the asymptotic solution of the value: by moving λ from 0 to 1, one can continuously move from an oblique projection of the value (Scherrer, 2010) to its orthogonal projection and consequently improve the corresponding guarantee (Tsitsiklis and Roy, 1997) (restated in Theorem 2, Section 3). The paper is organized as follows. Section 2 starts by describing the LSTD(λ) algorithm and the necessary background. Section 3 then contains our main result (Theorem 1): for all λ ∈ (0, 1), we ˜ √1 ). We shall then deduce a global error will show that LSTD(λ) converges to its limit at a rate O( n (Corollary 1) that sheds some light on the role of the parameter λ and discuss some of its interesting practical consequences. Section 4 will go on by providing a detailed proof of our claims. Finally, Section 5 concludes and describes potential future work.

2

LSTD(λ) and Related background

We consider a Markov chain M taking its values on a finite or countable state space2 X , with transition kernel P . We assume M ergodic3 ; consequently, it admits a unique stationary distribution µ. For any ˜ the paper, we shall write f (n) = O(g(n)) as a shorthand for f (n) = O(g(n) logk g(n)) for some k ≥ 0. restrict our focus to finite/countable mainly because it eases the presentation of our analysis. Though this requires some extra work, we believe the analysis we make here can be extended to more general state spaces. 3 In our countable state space situation, ergodicity holds if and only if the chain is aperiodic and irreducible, that is formally if and only if: ∀(x, y) ∈ X 2 , ∃n0 , ∀n ≥ n0 , P n (x, y) > 0. 1 Throughout

2 We

1

K ∈ R, we denote B(X , K) the set of measurable functions defined on X and bounded by K. We consider a reward function r ∈ B(X , Rmax ) for some Rmax ∈ R, that provides the quality of being in some state. The value function v related to the Markov chain M is defined, for any state i, as the average discounted sum of rewards along infinitely long trajectories starting from i:   ∞ X ∀i ∈ X , v(i) = E  γ j r(Xj ) X0 = i  , j=0 where γ ∈ (0, 1) is a discount factor. It is well-known that the value function v is the unique fixed point of the linear Bellman operator T : ∀i ∈ X , T v(i) = r(i) + γE [v(X1 )|X0 = i] . max . It can easily be seen that v ∈ B(X , Vmax ) with Vmax = R1−γ When the size |X | of the state space is very large, one may consider approximating v by using a linear architecture. Given some d  |X |, we consider a feature matrix Φ of dimension |X | × d. For any x ∈ X , φ(x) = (φ1 (x), ..., φd (x))T is the feature vector in state x. For any j ∈ {1, ..., d}, we assume that the feature function φj : X 7→ R belongs to B(X , L) for some finite L. Throughout the paper, and without loss of generality4 we will make the following assumption.

Assumption 1. The feature functions (φj )j∈{1,...,d} are linearly independent. Let S be the subspace generated by the vectors (φj )1≤j≤d . We consider the orthogonal projection Π onto S with respect to the µ-weighed quadratic norm sX |f (x)|2 µ(x). kf kµ = x∈X

It is well known that this projection has the following closed form Π = Φ(ΦT Dµ Φ)−1 ΦT Dµ ,

(1)

where Dµ is the diagonal matrix with elements of µ on the diagonal. The goal of LSTD(λ) is to estimate a solution of the equation v = ΠT λ v, where the operator T λ is defined as a weighted arithmetic mean of the applications of the powers T i of the Bellman operator T for all i > 1: ∀λ ∈ (0, 1), ∀v, T λ v = (1 − λ)

∞ X

λi T i+1 v.

(2)

i=0

Note in particular that when λ = 0, one has T λ = T . By using the facts that T i is affine and kP kµ = 1 (Tsitsiklis and Roy, 1997; Nedic and Bertsekas, 2002), it can be seen that the operator T λ is a contraction mapping of modulus (1−λ)γ 1−λγ ≤ γ; indeed, for any vectors u, v: kT λ u − T λ vkµ ≤ (1 − λ)k

∞ X

λi (T i+1 u − T i+1 v)kµ

i=0

= (1 − λ)k

∞ X

λi (γ i+1 P i+1 u − γ i+1 P i+1 v)kµ

i=0

≤ (1 − λ)

∞ X

λi γ i+1 ku − vkµ

i=0

(1 − λ)γ ku − vkµ . = 1 − λγ 4 This

assumption is not fundamental: in theory, we can remove any set of features that makes the family linearly dependent; in practice, the algorithm we are going to describe can use the pseudo-inverse instead of the inverse.

2

Since the orthogonal projector Π is non-expansive with respect to µ (Tsitsiklis and Roy, 1997), the operator ΠT λ is contracting and thus the equation v = ΠT λ v has one and only one solution, which we shall denote vLST D(λ) since it is what the LSTD(λ) algorithm converges to (Nedic and Bertsekas, 2002). As vLST D(λ) belongs to the subspace S, there exists a θ ∈ Rd such that vLST D(λ) = Φθ = ΠT λ Φθ. If we replace Π and T λ with their expressions (Equations 1 and 2), it can be seen that θ is a solution of the equation Aθ = b (Nedic and Bertsekas, 2002), such that for any i, " i # X A = ΦT Dµ (I − γP )(I − λγP )−1 Φ = EX−∞ ∼µ (γλ)i−k φ(Xk )(φ(Xi ) − γφ(Xi+1 ))T (3) k=−∞

" T

and b = Φ Dµ (I − γλP )

−1

r = EX−∞ ∼µ

i X

# i−k

(γλ)

φ(Xk )r(Xi ) ,

(4)

k=−∞

where uT is the transpose of u. Since for all x, φ(x) is of dimension d, we see that A is a d × d matrix and b is a vector of size d. Under Assumption 1, it can be shown (Nedic and Bertsekas, 2002) that the matrix A is invertible, and thus vLST D(λ) = ΦA−1 b is well defined. The LSTD(λ) algorithm that is the focus of this article is now precisely described. Given one trajectory X1 , ...., Xn generated by the Markov chain, the expectation-based expressions of A and b in Equations (3)-(4) suggest to compute the following estimates: Aˆ = and ˆb =

where

zi =

n−1 1 X zi (φ(Xi ) − γφ(Xi+1 ))T n − 1 i=1 n−1 1 X zi r(Xi ) n − 1 i=1 i X

(λγ)i−k φ(Xk )

(5)

k=1

is the so-called eligibility trace. The algorithm then returns vˆLST D(λ) = Φθˆ with5 θˆ = Aˆ−1ˆb, which is a (finite sample) approximation of vLST D(λ) . Using a variation of the law of large numbers, Nedic and Bertsekas (2002) showed that both Aˆ and ˆb converge almost surely respectively to A and b, which implies that vˆLST D(λ) tends to vLST D(λ) . The main goal of the remaining of the paper is to deepen this analysis: we shall estimate the rate of convergence of vˆLST D(λ) to vLST D(λ) , and bound the approximation error kˆ vLST D(λ) − vkµ of the overall algorithm.

3

Main results

This section contains our main results. Our key assumption for the analysis is that the Markov chain process that generates the states has some mixing property6 . Assumption 2. The process (Xn )n≥1 is β-mixing, in the sense that its ith coefficient " βi = sup E t≥1

# sup ∞ ) B∈σ(Xt+i

P (B|σ(X1t )) − P (B)

tends to 0 when i tends to infinity, where Xlj = {Xl , ..., Xj } for j ≥ l and σ(Xlj ) is the sigma algebra generated by Xlj . Furthermore, (Xn )n≥1 mixes at an exponential decay rate with parameters β > 0, κ b > 0, and κ > 0 in the sense that βi ≤ βe−bi . ˆ is invertible with high probability for a sufficiently big n. will see in Theorem 1 that A stationary ergodic Markov chain is always β-mixing.

5 We 6A

3

Intuitively the βi coefficients measure the degree of dependence of samples separated by i times step (the smaller the coefficient the more independence). We are now ready to state the main result of the paper, that provides a rate of convergence of LSTD(λ). Theorem 1. Let Assumptions 1 and 2 hold and let X1 ∼ µ. For any n ≥ 1 and δ ∈ (0, 1), define: 

 κ1 Λ(n, δ) ,1 b

I(n, δ) = 32Λ(n, δ) max  2 8n where Λ(n, δ) = log + log(max{4e2 , nβ}). δ Let n0 (δ) be the smallest integer such that 

v   u 2dL  2 u u log(n − 1)     + 1 I(n − 1, δ)+ ∀n ≥ n0 (δ), t  √ 1 (1 − γ)ν n−1  log λγ    1 − 1)  2  log(n    +  n0 (δ), P (n) holds with probability 1 − δ” while theirs is of the form “∀n, ∀δ, P (n) holds with probability 1 − δ”. Furthermore, under the same assumptions, the global error bound obtained by Lazaric et al. (2012), in the restricted case where λ = 0, has the following form: √   4 2 ˜ √1 , kv − Πvkµ + O k˜ vLST D(0) − vkµ ≤ 1−γ n where v˜LST D(0) is the truncation (with Vmax ) of the pathwise LSTD solution8 , while we get in this analysis   1 1 ˜ kv − Πvkµ + O √ . kˆ vLST D(0) − vkµ ≤ 1−γ n √ The term corresponding to the approximation error is a factor 4 2 better with our analysis. Moreover, contrary to what we do here, the analysis of Lazaric et al. (2012) does not imply a rate of convergence for LSTD(λ) (a bound on kvLST D(0) − vˆLST D(0) kµ ). Their arguments, based on a model of regression with Markov design, consists in directly bounding the global error. Our two-step argument (bounding the estimation error with respect to k · kµ , and then the approximation error with respect to k · kµ ) allows us to get a tighter result. As we have already mentioned, λ = 1 minimizes the bound on the approximation error kv −vLST D(λ) k (the first term in the r.h.s. in Corollary 1) while λ = 0 minimizes the bound on the estimation error kvLST D(λ) − vˆLST D(λ) k (the second term). For any n, and for any δ, there exists hence a value λ∗ that minimizes the global error bound by making an optimal compromise between the approximation and estimation errors. Figure 1 illustrates through simulations the interplay between λ and n. The optimal value λ∗ depends on the process mixing parameters (b, κ and β) as well as on the quality of the policy space kv − Πvkµ , which are quantities that are usually unknown in practice. However, when the number of samples n tends to infinity, it is clear that this optimal value λ∗ tends to 1. The next section contains a detailed proof of Theorem 1. 8 See

(Lazaric et al., 2012) for more details.

5

4

Proof of Theorem 1

In this section, we develop the arguments underlying the results of the previous section. The proof is organized in two parts. In a first preliminary part, we prove a concentration inequality for vector processes: a general result that is based on infinitely-long eligibility traces. Then, in a second part, we actually prove Theorem 1: we apply this result to the error on estimating A and b, and relate these errors with that on vLST D(λ) .

4.1

Concentration inequality for infinitely-long trace-based estimates

One of the first difficulties for the analysis of LSTD(λ) is that the variables Ai = zi (φ(Xi ) − γφ(Xi+1 ))T (respectively bi = zi r(Xi )) are not independent. Thus standard concentration results (like Lemma 6 we will describe in the Appendix A) for quantifying the speed at which the estimates converge to their limit cannot be used. As both terms Aˆ and ˆb have the same structure, we will consider here a matrix that has the following general form: ˆ= G

n−1 1 X Gi n − 1 i=1

with Gi = zi (τ (Xi , Xi+1 ))T

(7) (8)

Pi i−k with zi , defined in Equation (5), satisfies zi = φ(Xk ) and τ : X 2 7→ Rk is such that k=1 (λγ) 2 0 0 9 for 1 ≤ i ≤ k, τi belongs to B(X , L ) for some finite L . The variables Gi are computed from one single trajectory, they are then significantly dependent. Nevertheless with the mixing assumption (Assumption 2), we can overcome this difficulty, and this by using a blocking technique due to Yu (1994). This technique leads us back to the independent case. However the transition from the mixing case to the independent one requires stationarity (Lemma 5) while Gi as a σ(X i+1 ) measurable function of the non-stationary vector (X1 , . . . , Xi+1 ) does not define a stationary process. In order to satisfy the stationarity condition we will approximate Gi by it truncated stationary version Gm i . This is possible if we approximate zi by its m-truncated version: zim =

i X

(λγ)i−k φ(Xk ).

k=max(i−m+1,1)

Since the function φ is bounded by some constant L and the influence of the old events are controlled L (λγ)m . If we choose m such that by some power of λγ < 1, it is easy to check that kzi − zim k∞ ≤ 1−λγ  ˆ with the m > log(n−1) , we obtain kzi − zim k2 = O n1 . Therefore it seems reasonable to approximate G log 1 λγ

ˆ m satisfying process G ˆm = G

n−1 1 X m G , n − 1 i=1 i

m T with Gm i = zi (τ (Xi , Xi+1 )) .

(9) (10)

m+1 For all i ≥ m, Gm ) measurable function of the stationary vector Zi = (Xi−m+1 , Xi−m+2 i is a σ(X , . . . , Xi+1 ). So we can apply the blocking technique of Yu (1994) to Gm i , but before to do so we have to check out whether Gm well defines a β-mixing process. It can be shown (Yu, 1994) that any mesurable i function f of a β-mixing process is a β f -mixing process with β f ≤ β, so we only have to prove that the process Zi is a β-mixing process. For that we need to relate its β coefficients to those of (Xi )i≥1 on which Assumption 2 is made. This is the purpose of the following Lemma.

Lemma 1. Let (Xn )n≥1 be a β-mixing process, then (Zn )n≥1 = (Xn−m+1 , Xn−m+2 , . . . , Xn+1 )n≥1 is a X β-mixing process such that its ith β mixing coefficient βiZ satisfies βiZ ≤ βi−m . 9 We

denote X i = X × X ... × X for i ≥ 1. | {z } i times

6

Proof. Let Γ = σ(Z1 , ..., Zt ), by definition we have Γ = σ(Zj−1 (B) : j ∈ {1, ..., t}, B ∈ σ(X m+1 )). For all j ∈ {1, ..., t} we have Zj−1 (B) = {ω ∈ Ω, Zj (ω) ∈ B} . For B = B0 × ... × Bm , we observe that Zj−1 (B) = {ω ∈ Ω, Xj (ω) ∈ B0 , ..., Xj+m (ω) ∈ Bm }. Then we have Γ = σ(Xj−1 (B) : j ∈ {1, ..., t + m}, B ∈ σ(X )) = σ(X1 , ..., Xt+m ). ∞ ∞ Similarly we can prove that σ(Zt+i ) = σ(Xt+i ). Then let βiX be the ith β-mixing coefficient of the process (Xn )n≥1 , we have " #

βiX = sup E

sup ∞ ) B∈σ(Xt+i

t≥1

|P (B|σ(X1 , ..., Xt )) − P (B)| .

Similarly for the process (Zn )n≥1 we can see that " βiZ

= sup E

sup ∞ ) B∈σ(Zt+i

t≥1

#

|P (B|σ(Z1 , ..., Zt )) − P (B)| .

By applying what we developped above we obtain " βiZ

= sup E t≥1

sup ∞ ) B∈σ(Xt+i

#

|P (B|σ(X1 , ..., Xt+m )) − P (B)| .

Denote t0 = t + m then for i > m we have " βiZ

= sup E

# sup

t0 ≥m+1

B∈σ(Xt∞ 0 +i−m )

|P (B|σ(X1 , ..., Xt0 )) − P (B)|

X ≤ βi−m .

Pd Pk Let k.kF denote the Frobenius norm satisfying : for M ∈ Rd×k , kM k2F = l=1 j=1 (Ml,j )2 . We are ˆ now ready to prove the concentration inequality for the infinitely-long-trace β-mixing process G. Lemma 2. Let Assumptions 1 and 2 hold and let X1 ∼ µ. Define the d × k matrix Gi such that Gi =

i X

(λγ)i−k φ(Xk )(τ (Xi , Xi+1 ))T .

(11)

k=1

Recall that φ = (φ1 , . . . , φd ) is such that for all j, φj ∈ B(X , L), and that τ ∈ B(X 2 , L0 ). Then for all δ in (0, 1), with probability 1 − δ, v   u

√ n−1

1 n−1 0 u X X 1 2 d × kLL u log(n − 1) 

   + 1 J(n − 1, δ) + (n), √ Gi − E[Gi ] ≤ t 

1

n − 1 n − 1 (1 − λγ) n − 1   log i=1 i=1 2

λγ

where 

Γ(n, δ) ,1 b

 κ1

J(n, δ) = 32Γ(n, δ) max ,   2 Γ(n, δ) = log + log(max{4e2 , nβ}), δ   √ log(n − 1)  d × kLL0   (n) = 2  .    log 1  (n − 1)(1 − λγ) λγ

7

Note that with respect to the quantities I and Λ introduced in Theorem 1, the quantities we introduce here are such that J(n, δ) = I(n, 4n2 δ) and Γ(n, δ) = Λ(n, 4n2 δ). ˆ m with Proof. The proof amounts to show that i) the approximation due to considering the estimate G ˆ truncated traces instead of G is bounded by (n), and then ii) to apply the block technique of Yu (1994) in a way somewhat similar to—but technically slightly more involved than—what Lazaric et al. (2012) did for LSTD(0). We defer the technical arguments to Appendix A for readability. Using a very similar proof, we can derive a (simpler) general concentration inequality for β-mixing processes: Lemma 3. Let Y = (Y1 , . . . , Yn ) be random variables taking their values in the space Rd , generated from a stationary exponentially β-mixing process with parameters β, b and κ, and such that for all i, kYi − E[Yi ]k2 ≤ B2 almost surely. Then for all δ > 0,

( n ) n

1 X

B2 p 1X

J(n, δ) > 1 − δ P Yi − E[Yi ] ≤ √

n

n i=1 n i=1 2

where J(n, δ) is defined as in Lemma 2. Remark 2. If the variables Yi were independent, we would have βi = 0 for all i, that is we could choose 2 β = 0 and b = ∞, so that J(n, δ) reduces to 32 log 8eδ = O(1) and we recover standard results such as the one we describe in Lemma 6 we will describe in the Appendix A. Furthermore, the price to pay for having a β-mixing assumption (instead of simple independence) lies in the extra coefficient J(n, δ) which ˜ is O(1); in other words, it is rather mild.

4.2

Proof of Theorem 1

After having introduced the corresponding concentration inequality for infinitely-long trace-based estimates we are ready to prove Theorem 1. The first important step to Theorem 1 proof consists in deriving the following lemma. Lemma 4. Write A = Aˆ − A, b = ˆb − b and ν the smallest eigenvalue of the matrix ΦT Dµ Φ. For all λ ∈ (0, 1), the estimate vˆLST D(λ) satisfies10 : kvLST D(λ) − vˆLST D(λ) kµ ≤

1 − λγ √ k(I + A A−1 )−1 k2 kA θ − b k2 . (1 − γ) ν

Furthermore, if for some  and C, kA k2 ≤  < C ≤

1 kA−1 k2 ,

k(I + A A−1 )−1 k2 ≤

then Aˆ is invertible and

1 . 1 − C

Proof. Starting from the definitions of vLST D(λ) and vˆLST D(λ) , we have vˆLST D(λ) − vLST D(λ) = Φθˆ − Φθ = ΦA−1 (Aθˆ − b).

(12)

On the one hand, with the expression of A in Equation (3), writing M = (1 − λ)γP (I − λγP )−1 and Mµ = ΦT Dµ Φ, and using some linear algebra arguments, we can observe that  −1 ΦA−1 = Φ ΦT Dµ (I − γP )(I − λγP )−1 Φ  −1 = Φ ΦT Dµ (I − λγP − (1 − λ)γP )(I − λγP )−1 Φ = Φ(Mµ − ΦT Dµ M Φ)−1 . ˆ is not invertible, we take vˆLST D(λ) = ∞ and the inequality is always satisfied since, as we will see shortly, A ˆ is equivalent to that of (I + A A−1 ). the invertiblity of A 10 When

8

Since the matrices A and Mµ are invertible, the matrix (I − Mµ−1 ΦT Dµ M Φ) is also invertible, then ΦA−1 = Φ(I − Mµ−1 ΦT Dµ M Φ)−1 Mµ−1 . We know from Tsitsiklis and Roy (1997) that kΠkµ = 1—the projection matrix Π is defined in Equation (1)—and kP kµ = 1. Hence, we have kΠM kµ = (1−λ)γ 1−λγ < 1 and the matrix (I − ΠM ) is invertible. We can use the identity X(I − Y X)−1 = (I − XY )−1 X with X = Φ and Y = Mµ−1 ΦT Dµ M , and obtain ΦA−1 = (I − ΠM )−1 ΦMµ−1 .

(13)

On the other hand, using the facts that Aθ = b and Aˆθˆ = ˆb, we can see that: Aθˆ − b = Aθˆ − b − (Aˆθˆ − ˆb) = ˆb − b − A θˆ = ˆb − b − A θ + A θ − A θˆ ˆ = ˆb − b − (Aˆ − A)θ + A (θ − θ) ˆ ˆ − (b − Aθ) + A A−1 (Aθ − Aθ) = ˆb − Aθ ˆ ˆ + A A−1 (b − Aθ). = ˆb − Aθ Then we have ˆ ˆ − A A−1 (b − Aθ). Aθˆ − b = ˆb − Aθ Consequently ˆ Aθˆ − b = (I + A A−1 )−1 (ˆb − Aθ) = (I + A A−1 )−1 (b − A θ)

(14)

where the last equality follows from the identity Aθ = b. Using Equations (13) and (14), Equation (12) can be rewritten as follows: vˆLST D(λ) − vLST D(λ) = (I − ΠM )−1 ΦMµ−1 (I + A A−1 )−1 (b − A θ). Now we will try to bound kΦMµ−1 (I + A A−1 )−1 (b − A θ)kµ . Notice that for all x, q q 1 kΦMµ−1 xkµ = xT Mµ−1 ΦT Dµ ΦMµ−1 x = xT Mµ−1 x ≤ √ kxk2 ν

(15)

(16)

where ν is the smallest (real) eigenvalue of the Gram matrix Mµ . By taking the norm in Equation (15) and using the above relation, we get kˆ vLST D(λ) − vLST D(λ) kµ ≤ k(I − ΠM )−1 kµ kΦMµ−1 (I + A A−1 )−1 (b − A θ)kµ 1 ≤ k(I − ΠM )−1 kµ √ k(I + A A−1 )−1 (A θ − b )k2 ν 1 −1 ≤ k(I − ΠM ) kµ √ k(I + A A−1 )−1 k2 kA θ − b k2 . ν The first part of the lemma is obtained by using the fact that kΠM kµ = k(I − ΠM )

−1

(1−λ)γ 1−λγ

< 1, which imply that

∞ ∞

X

X 1 − λγ 1

i kµ = = . (ΠM ) ≤ kΠM kiµ ≤ (1−λ)γ

1−γ 1 − 1−λγ i=0 i=0 µ

(17)

We are going now to prove the second part of the Lemma. Since A is invertible, the matrix Aˆ is ˆ −1 = (A + A )A−1 = I + A A−1 is invertible. Let us denote invertible if and only if the matrix AA −1 ˆ −1 to be invertible is ρ(A A ) the spectral radius of the matrix A A−1 . A sufficient condition for AA 9

that ρ(A A−1 ) < 1. From the inequality ρ(M ) ≤ kM k2 for any square matrix M , we can see that for 1 any C and  that satisfy kA k2 ≤  < C < kA−1 k2 , we have ρ(A A−1 ) ≤ kA A−1 k2 ≤ kA k2 kA−1 k2 ≤

 < 1. C

It follows that the matrix Aˆ is invertible and

∞ ∞  i

X X 1 

= k(I + A A−1 )−1 k2 = (A A−1 )i ≤  .

C 1 − C i=0 i=0 2

This concludes the proof of Lemma 4. To finish the proof of Theorem 1, Lemma 4 suggests that we should control both terms kA k2 and kA θ − b k2 with high probability. This is what we do now. Controlling kA k2 .

By the triangle inequality, we can see that kA k2 ≤ kE[A ]k2 + kA − E[A ]k2 .

(18)

Write Aˆn,k = φ(Xk )(φ(Xn ) − γφ(Xn+1 ))T . For all n and k, we have kAˆn,k k2 ≤ 2dL2 . We can bound the first term of the r.h.s. of Equation (18) as follows, by replacing A with its expression in (3):

" # n−1 i

1 XX

i−k ˆ kE[A ]k2 = A − E (λγ) Ai,k

n − 1 i=1 k=1 2

" !# i i n−1

X X X 1

i−k ˆ i−k ˆ (λγ) Ai,k − (λγ) Ai,k = E

n − 1 i=1 k=−∞ k=1 2

" # n−1 0

X X 1

= E (λγ)i (λγ)−k Aˆi,k

n−1 i=1



1 n−1

n−1 X

k=−∞

(λγ)i

i=1

2

2

2dL 1 − λγ

1 2dL2 ≤ = 0 (n). n − 1 (1 − λγ)2 Let (δn ) a parameter in (0, 1) depending on n, that we will fix later, a consequence of Equation (18) and the just derived bound is that: P {kA k2 ≥ 1 (n, δn )} ≤ P{kA − E[A ]k2 ≥ 1 (n, δn ) − 0 (n)} ≤ δn if we choose 1 (n, δn ) such that (cf. Lemma 2) v   u u log(n − 1) 4dL2 u     √ 1 (n, δn ) − 0 (n) = t   + 1 J(n − 1, δn ) + (n) 1 (1 − λγ) n − 1  log λγ  where (n) =

4mdL2 (n−1)(1−λγ) ,

that is if

v   u u 4dL log(n − 1) u     √ 1 (n, δn ) = t   + 1 J(n − 1, δn ) + (n) + 0 (n). 1 (1 − λγ) n − 1  log λγ  2

10

(19)

Controlling kA θ − b k2 . By using the fact that Aθ = b, the definitions of Aˆ and ˆb, and the fact that φ(x)T θ = [φθ](x), we have ˆ − ˆb A θ − b = Aθ =

n−1 n−1 1 X 1 X zi (φ(Xi ) − γφ(Xi+1 )T )θ − zi r(Xi ) n − 1 i=1 n − 1 i=1

n−1 1 X = zi ([φθ](Xi ) − γ[φθ](Xi+1 )T − r(Xi )) n − 1 i=1

=

n−1 1 X zi ∆i n − 1 i=1

where, since vLST D(λ) = Φθ, ∆i is the following number: ∆i = vLST D(λ) (Xi ) − γvLST D(λ) (Xi+1 ) − r(Xi ). We can control kA θ − b k2 by following the same proof steps as above. In fact we have kA θ − b k2 ≤ kA θ − b − E[A θ − b ]k2 + kE[A θ − b ]k2 ,

(20)

and kE[A θ − b ]k2 ≤ kE[A ]k2 kθk2 + kE[b ]k2 . From what have been developed before we can see that kE[A ]k2 ≤ 0 (n) = can show that kE[b ]k2 ≤

√ dLRmax 1 n−1 (1−λγ)2 .

1 2dL2 n−1 (1−λγ)2 .

Similarly we

We can hence conclude that

√ 1 dLRmax 2dL2 1 kθk2 + = 00 (n). kE[A θ − b ]k2 ≤ 2 n − 1 (1 − λγ) n − 1 (1 − λγ)2 As a consequence of Equation (20) and the just derived bound we have P(kA θ − b k2 ≥ 2 (δn )) ≤ P(kA θ − b − E[A θ − b ]k2 ≥ 2 (δn ) − 00 (n)) ≤ δn if we choose 2 (δn ) such that (cf Lemma 2) v     u √ √ u 2 dLk∆i k∞  log(n − 1)  2 dLk∆i k∞ u log(n − 1)     + 1 J(n − 1, δn ) +    + 00 (n). √ 2 (δn ) = t   1 1 (n − 1)(1 − λγ) (1 − λγ) n − 1  log λγ   log λγ  (21) It remains to compute a bound on k∆i k∞ . To do so, it suffices to bound vLST D(λ) . For all x ∈ X , we have √ |vLST D(λ) (x)| = |φT (x)θ| ≤ kφT (x)k2 kθk2 ≤ dLkθk2 , where the first inequality is obtained from the Cauchy-Schwarz inequality. We thus need to bound kθk2 . On the one hand, we have q √ kvLST D(λ) kµ = kΦθkµ = θT Mµ θ ≥ νkθk2 , and on the other hand, we have kvLST D(λ) kµ = k(I − ΠM )−1 Π(I − λγP )−1 rkµ ≤ Therefore Vmax kθk2 ≤ √ . ν 11

Rmax = Vmax . 1−γ

We can conclude that √ ∀x ∈ X , |vLST D(λ) (x)| ≤

dLVmax √ . ν

Then for all i we have |∆i | = |vLST D(λ) (Xi ) − γvLST D(λ) (Xi+1 ) − r(Xi )| √ √ dLVmax dLVmax √ √ +γ + (1 − γ)Vmax . ≤ ν ν Since ΦT Dµ Φ is a symmetric matrix, we have ν ≤ kΦT Dµ Φk2 . We can see that 1

1

kΦT Dµ Φk2 ≤ d max |φtk Dµ φj | = d max |φtk Dµ2 Dµ2 φj | ≤ d max kφtk kµ kφj kµ ≤ dL2 , j,k

j,k

j,k

so that ν ≤ dL2 . It follows that, for all i √ √ √ √ dLVmax dLVmax dL dL √ √ +γ + √ (1 − γ)Vmax = 2 √ Vmax . |∆i | ≤ ν ν ν ν Conclusion of the proof. We are ready to conclude the proof. Now that we know how to control both terms kA k2 and kA θ − b k2 , we can see that P {∃n ≥ 1, {kA k2 ≥ 1 (n, δn )} ∪ {kA θ − b )k2 ≥ 2 (n, δn )}} ∞ X ≤ P {kA k2 ≥ 1 (n, δn )} + P {kA θ − b )k2 ≥ 2 (n, δn )} n=1 ∞ X

≤2

n=1

δn =

1 π2 δ 1 1 =

and 2 =

m−1 1 X Gi − E[Gi ] n − 1 i=1 n−1 1 X (zi − zim )τ (Xi , Xi+1 )T − E[(zi − zim )τ (Xi , Xi+1 )T ], n − 1 i=m

we have n−1 n−1 1 X 1 X Gi − E[Gi ] = Gi − E[Gi ] + 1 n − 1 i=1 n − 1 i=m

=

n−1 1 X zi τ (Xi , Xi+1 )T − E[zi τ (Xi , Xi+1 )T ] + 1 n − 1 i=m

=

n−1 1 X m z τ (Xi , Xi+1 )T − E[zim τ (Xi , Xi+1 )T ] + 1 + 2 n − 1 i=m i

n−1 1 X m = (G − E[Gm i ]) + 1 + 2 . n − 1 i=m i

(22) m

0

L L LL For all i, we have kzi k∞ ≤ 1−λγ , kGi k∞ ≤ 1−λγ , and kzi − zim k∞ ≤ (λγ) 1−λγ . As a consequence—using √ kM k2 ≤ kM kF = d × kkxk∞ for M ∈ Rd×k with x the vector obtained by concatenating all M columns—, we can see that √ √ 2(m − 1) d × kLL0 2(λγ)m d × kLL0 k1 + 2 k2 ≤ + (23) (n − 1)(1 − λγ) (1 − λγ)

13

By concatenating all its columns, the d × k matrix Gm i for all  > 0,

!

1 n−1

X

m P (Gm ≤P i − E[Gi ]) ≥ 

n − m

i=m

may be seen a single vector Uim of size dk. Then,

!

1 n−1

X

m (Gm

i − E[Gi ]) ≥ 

n − m

i=m

F !

1 n−1

X

m m =P (Ui − E[Ui ]) ≥  .

n − m

i=m

2

(24)

2

Uim

The variables define a stationary β-mixing process (Lemma 1). To deal with the β-mixing assumption, we use the decomposition technique proposed by Yu (1994) that consists in dividing the stationary m m sequence Um , . . . , Un−1 into 2µn−m blocks of length an−m (we assume here that n − m = 2an−m µn−m ). µn−m The blocks are of two kinds: those which contains the even indexes E = ∪l=1 El and those with odd µn−m indexes H = ∪l=1 Hl . Thus, by grouping the variables into blocks we get



! !

X

X

1 n−1

X 

m m m m m m Ui − E[Ui ] ≥ (n − m) Ui − E[Ui ] + Ui − E[Ui ] ≥  ≤P P



n − m

2 i=m 2

i∈H

i∈E

2

2

(25)

!

X (n − m)

m m Ui − E[Ui ] ≥ ≤P +

4 i∈H 2

!

X

(n − m)

m m P Ui − E[Ui ] ≥

4 i∈E

2 !

X

(n − m)

m m =2P Ui − E[Ui ] ≥

4 i∈H

(26)

(27)

2

where Equation (25) follows from the triangle inequality, Equation (26) from the fact that the event {X + Y ≥ a} implies {X ≥ a2 } or {Y ≥ a2 }, and Equation (27) from the assumption that the process is µn−m stationary. Since H = ∪l=1 Hl we have  

! n−m

1 n−1

µX

X X (n − m) 



P Uim − E[Uim ] ≥  ≤ 2P  Uim − E[Uim ] ≥

n − m

4 i=m l=1 i∈Hl 2

2

µ ! n−m

X (n − m)

= 2P U (Hl ) − E[U (Hl )] ≥ (28)

4 l=1

2

where we defined U (Hl ) = i∈Hl Uim . Now consider the sequence of identically distributed independent blocks (U 0 (Hl ))l=1,...,µn−m such that each block U 0 (Hl ) has the same distribution as U (Hl ). We are going to use the following technical result. P

Lemma 5. Yu (1994) Let X1 , . . . , Xn be a sequence of samples drawn from a stationary β-mixing process with coefficients {βi }. Let X(H) = (X(H1 ), . . . , X(Hµn−m )) where for all j X(Hj ) = (Xi )i∈Hj . Let X 0 (H) = (X 0 (H1 ), . . . , X 0 (Hµn−m )) with X 0 (Hj ) independent and such that for all j, X 0 (Hj ) has same distribution as X(Hj ). Let Q and Q0 be the distribution of X(H) and X 0 (H) respectively. For any measurable function h : X an µn → R bounded by B, we have |EQ [h(X(H)] − EQ0 [h(X 0 (H)]| ≤ Bµn βan . By applying Lemma 5, Equation (28) leads to:

µ

! ! n−m

X

1 n−1 X (n − m)

m m 0 0 Ui − E[Ui ] ≥  ≤ 2P U (Hl ) − E[U (Hl )] ≥ + 2µn−m βan−m . P



n − m 4 i=m 2

l=1

2

(29) 14

Pµn−m 0 The variables U 0 (Hl ) are independent. Furthermore, it can be seen that ( l=1 U (Hl )−E[U 0 (Hl )])µn−m 0 0 is a σ(U (H1 ), . . . , U (Hµn−m )) martingale: "µn−m # X E U 0 (Hl ) − E[U 0 (Hl )] U 0 (H1 ), . . . , U 0 (Hµn−m −1 ) l=1

µn−m −1

X

=

0 U 0 (Hl ) − E[U 0 (Hl )] + E[UH µ

n−m

0 − E[UH µ

n−m

]]

l=1 µn−m −1

=

X

U 0 (Hl ) − E[U 0 (Hl )].

l=1

We can now use the following concentration result for martingales. Lemma 6 (Hayes (2005)). Let X = (X0 , . . . , Xn ) be a discrete time martingale taking values in an Euclidean space such that X0 = 0 and for all i, kXi − Xi−1 k2 ≤ B2 almost surely. Then for all , 2

P {kXn k2 ≥ } < 2e2 e

 − 2n(B

2)

2

.

Pµ Indeed, taking Xµn−m = l=n−m U 0 (Hl ) − E[U 0 (Hl )], and observing that kXi − Xi−1 k = kU 0 (Hl ) − √ dkLL0 , the lemma leads to E[U 0 (Hl )]k2 ≤ an−m C with C = 2 1−λγ

µ ! n−m

X (n−m)2 2 − (n − m)

0 0 ≤ 2e2 e 32µn−m (an−m C)2 U (Hl ) − E[U (Hl )] ≥ P

4 l=1

2

(n−m)2

2 − 16an−m C 2

= 2e e

.

where the second line is obtained by using the fact that 2an−m µn−m = n − m. With Equations (28) and (29), we finally obtain

!

1 n−1

(n−m)2 X −

m m Ui − E[Ui ] ≥  ≤ 4e2 e 16an−m C 2 + 2(n − m)βaUn−m . P

n − m

i=m 2

The vector

Uim

is a function of Zi = (Xi−m+1 , . . . , Xi+1 ), and Lemma 1 tells us that for all j > m, κ

X βjU ≤ βjZ ≤ βj−m ≤ βe−b(j−m) .

So the equation above may be re-written as

!

1 n−1 (n−m)2 X κ −

m m P Ui − E[Ui ] ≥  ≤ 4e2 e 16an−m C 2 + 2(n − m)βe−b(an−m −m) = δ 0 .

n − m i=m

(30)

2

We now follow a reasoning similar to that of Lazaric et al. (2012) in order to get the same exponent 1 l m κ+1 2 in both of the above exponentials. Taking an−m − m = C2 (n−m) with C2 = (16C 2 ζ)−1 , and b ζ=

an−m an−m −m ,

we have 0



2

δ ≤ (4e + (n − m)β) exp − min Define Λ(n, δ) = log and

s (δ) =

2

b (n − m)2 C2



! 1  k+1 1 2 ,1 (n − m)C2  . 2

  2 + log(max{4e2 , nβ}), δ

Λ(n − m, δ) max C2 (n − m) 15



 κ1 Λ(n − m, δ) ,1 . b

(31)

It can be shown that  exp − min

b (n − m)((δ))2 C2



! 1  k+1 1 2 ,1 (n − m)C2 ((δ)) ≤ exp (−Λ(n − m, δ)) . 2

(32)

Indeed11 , there are two cases:  o n b 1. Suppose that min 2 (n−m)((δ)) C2 , 1 = 1. Then ! 1   k+1 b 1 2 exp − min ,1 (n − m)C2 ((δ)) (n − m)((δ))2 C2 2  k1 !  Λ(n − m, δ) = exp −Λ(n − m, δ) max ,1 b 

≤ exp (−Λ(n − m, δ)) .  o   n b b 2. Suppose now that min (n−m)((δ))2 C2 , 1 = (n−m)((δ))2 C2 . Then 1 !     k+1 1 k 1 k 1 k+1 1 k+1 Λ(n − m, δ) 2 k+1 exp − b ((n − m)C2 ((δ)) ) = exp − b (Λ(n − m, δ) k+1 max ,1 2 2 b   1 k 1 k+1 k+1 = exp − Λ(n − m, δ) max {Λ(n − m, δ), b} 2

≤ exp (−Λ(n − m, δ)) . By combining Equations (31) and (32), we get δ 0 ≤ (4e2 + (n − m)β) exp (−Λ(n − m, δ)) . If we replace Λ(n − m, δ) with its expression, we obtain exp (−Λ(n − m, δ)) =

δ max{4e2 , (n − m)β}−1 . 2

Since 4e2 max{4e2 , (n − m)β}−1 ≤ 1 and (n − m)β max{4e2 , (n − m)β}−1 ≤ 1, we consequently have δ0 ≤ 2

δ ≤ δ. 2

Now, note that since an−m − m ≥ 1, we have an−m an−m − m + m ζ= = ≤ 1 + m. an−m − m an−m − m n o κ1 Let J(n, δ) = 32Λ(n, δ) max Λ(n,δ) , 1 . Then Equation (30) is reduced to b

!

1 n−1 X 1 C

m m (Ui − E[Ui ]) ≥ √ P (ζJ(n − m, δ)) 2 ≤ δ.

n − m n−m i=m 2 q n−1 1 n−1 √ 1 Since J(n, δ) is an increasing function on n, and √n−1(n−m) = √n−m , we have n−m ≥ n−m

!

1 n−1

X 1 C

m m 2 P (Gi − E[Gi ]) ≥ √ (ζJ(n − 1, δ))

n − 1 n−1 i=m 2

!

1 n−1

X 1 C n−1

m m 2 ≤ P (Gi − E[Gi ]) ≥ √ ((m + 1)J(n − 1, δ))

n − m

n−1n−m i=m 2

!

1 n−1

X 1 C

m ≤ P (Gm ((m + 1)J(n − m, δ)) 2 . i − E[Gi ]) ≥ √

n − m

n − m i=m 2

11 This

inequality exists in Lazaric et al. (2012), and is developped here for completeness.

16

(33)

By using Equations (24) and (33), we deduce that

!

1 n−1

X 1 C

m m 2 (Gi − E[Gi ]) ≥ √ ((m + 1)J(n − 1, δ)) P ≤ δ.

n − 1

n−1 i=m

(34)

2

By combining Equations (22), (23),(34), plugging the value of C =

√ 2 dkLL0 1−λγ ,

and taking m =

l

log (n−1) 1 log λγ

m ,

we get the announced result.

References Archibald, T., McKinnon, K., and Thomas, L. (1995). On the generation of Markov decision processes. Journal of the Operational Research Society, 46, 354–361. Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific. Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49(2–3), 233–246. Hayes, T. P. (2005). A large-deviation inequality for vector-valued martingales. Manuscript. Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13, 3041–3074. Nedic, A. and Bertsekas, D. P. (2002). Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 13, 79–110. Scherrer, B. (2010). Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In ICML. Scherrer, B. and Lesner, B. (2012). On the use of non-stationary policies for stationary infinite-horizon Markov decision processes. In NIPS 2012 Adv.in Neural Information Processing Systems, South Lake Tahoe, United States. Szepesv´ ari, C. (2010). Algorithms for Reinforcement Learning. Morgan and Claypool. Tsitsiklis, J. N. and Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. Technical report, IEEE Transactions on Automatic Control. Yu, B. (1994). Rates of convergence for empirical processes stationnary mixing consequences. The Annals of Probability, 19, 3041–3074. Yu, H. (2010). Convergence of least-squares temporal difference methods under general conditions. In ICML.

17