arXiv:1604.00151v1 [cs.SY] 1 Apr 2016
Gradient-based learning algorithms with constant-error estimators: stability and convergence Arun Selvan. R
1
and Shalabh Bhatnagar
2
1
[email protected] 2
[email protected] 1,2 Department of Computer Science and Automation, Indian Institute of Science, Bangalore - 560012, India. April 4, 2016 Abstract Implementations of stochastic gradient search algorithms such as back propagation typically rely on finite difference (F D) approximation methods. These methods are used to approximate the objective function gradient in steepest descent algorithms as well as the gradient and Hessian inverse in Newton based schemes. The convergence analyses of such schemes critically require that perturbation parameters in the estimators of the gradient/Hessian approach zero. However, in practice, the perturbation parameter is often held fixed to a ‘small’ constant resulting in constant-error estimates. We present in this paper a theoretical framework based on set-valued dynamical systems to analyze the aforementioned. Easily verifiable conditions are presented for stability and convergence when using such F D estimators for the gradient/Hessian. In addition, our framework dispenses with a critical restriction on the stepsizes (learning rate) when using F D estimators.
1
Introduction
In 1951 Robbins & Monro [11] developed the first stochastic approximation algorithm to solve the root finding problem. Inspired by the Robbins-Monro algorithm, Kiefer & Wolfowitz [7] developed a method to stochastically estimate the maximum of a function. An important feature shared by the two algorithms is that the objective function cannot be directly observed, instead only noisy observations, N (· ) of the objective, are available at each stage. Below we state the Kiefer-Wolfowitz algorithm that updates a d-dimensional parameter xn using a gradient descent scheme to find the maximum. N (x +p(n)ξ )−N (x −p(n)ξ ) 1
n
n
1
2p(n)
xn+1 = xn + a(n)
.. .
N (xn +p(n)ξd )−N (xn −p(n)ξd ) 2p(n)
1
+ Mn+1 .
(1)
Here, ξi is the vector with 1 at the ith place and 0 in all others, a(n) is the given step-size sequence and Mn+1 is the martingale difference noise, see [15]. The above recursion is a prototypical example of an important technique in machine learning: finite difference (F D) method for gradient estimation. The p(n)’s in (1) are called the perturbation parameters. In a F D method perturbation parameters are used to control the estimation error i.e., the difference between the true gradient and the estimate. For convergence, the estimation errors need to vanish over time. This is ensured if P a(n)2 p(n) → 0 and p(n)2 < ∞. Since learning rates are determined by step-sizes, n
the aforementioned conditions can severely limit the step-size options and wreak havoc with the learning process. As a work around, in implementations, one often allows p(n) = p ∀ n, where p > 0 is a ’small’ constant. With this even though the estimation errors do not vanish asymptotically, they remain bounded at each stage under reasonable assumptions. However, a proof of convergence in this setting, is unavailable in literature as all proofs require p → 0. In this paper, we provide a theoretical framework for stability and precise convergence analysis for both cases: p → 0 and when p is fixed i.e, p 6→ 0. Since we do not P require p → 0, our analysis only requires the standard step-size condition of n a(n)2 < ∞ in 2 P place of the restrictive n a(n) p(n)2 < ∞. More recent F D methods for which our results work in particular include random directions stochastic approximation (RDSA) [9], simultaneous perturbation stochastic approximation (SP SA) [13] and the smoothed functional algorithm [12]. A multilayer perceptron (M LP ) is an artificial neural network (AN N ) with one or more hidden layers. Supervised training of M LP can be viewed as a problem of minimizing the associated cost function. Stochastic gradient descent (SGD) is a popular method employed to solve this problem, as exemplified by the back propagation algorithm [6]. If Eavg is the cost function then the SGD to solve the cost function minimization problem is given by, θn+1 = θn − a(n)∇θ Eavg |θ=θn . Although easy to implement, the SGD suffers from the problem of slow convergence rate. This problem is overcome by using the Newton’s method. Although the rate of convergence is vastly improved, Newton’s method is computationally intensive since it involves computing the inverse of the Hessian, H −1 , which often cannot be computed directly. These problems are overcome when using the quasi-Newton method. On the one hand, the quasi-Newton method only requires an estimate of the Hessian-inverse that can be readily calculated using one of many F D methods, for one such method see [14], on the other hand the rate of convergence is comparable to the Newton’s method. In other words, the back propagation algorithm that is run using the quasi-Newton method is given by the following update rule: (2) θn+1 = θn − a(n) H −1 (θn )∇θ Eavg |θ=θn + ξn + Mn+1 .
In the context of reinforcement learning, SGD procedures are used for policy optimization using either policy gradient algorithms [16] or actor-critic methods [8]. In the above, ξn is the estimation error at stage n and {Mn+1 }n≥0 is the 2
martingale difference noise sequence. As stated earlier, it is required that ξn → 0 for convergence of (2) to the minimum set, for instance, see [3]. Remark 1 in Section 2.2 compares our assumptions with those in [3]. To ensure that ξn → 0 the perturbation parameters need constant “tuning”. For example if the KieferWolfowitz estimator is used to estimate ∇θ Eavg then as stated earlier one needs to ensure that p(n) → 0 for ξn → 0. In addition to the aforementioned, one 2 P requires a(n) p(n)2 < ∞ to ensure convergence to the minimum set. These condin
tions greatly limit our choice of step-sizes. Since the learning rate is determined by the step-size used, this could potentially interfere with the learning process. In addition to the aforementioned conditions, convergence is only guaranteed when the iterates are bounded almost surely i.e., supkθn k < ∞ a.s., a condition n
that is hard to verify in most cases. We improve upon existing frameworks in the literature by dispensing with the requirement that ξn → 0, we only require that ξn ≤ ǫ ∀ n and a ǫ > 0. In other words, if the Kiefer-Wolfowitz estimator is used as a gradient estimator, it is no longer required that p(n) → 0. It is also P a(n)2 worth noting that there are no requirements on p(n)2 . In our scheme, the n
gradient estimators are allowed to make a maximum error of ǫ, where ǫ > 0 is fixed a priori, at each stage. We call such estimators as constant-error gradient estimators. Typically, constant-error gradient estimators are implemented using the F D methods mentioned earlier with their perturbation parameters tuned at the beginning of the simulation and then kept fixed for the rest of the run. The aim of this paper is to develop weak yet easily verifiable sufficient conditions for both stability and convergence of gradient based learning algorithms that use constant-error gradient estimators. Specifically, we show that algorithms like the one given by (2) that satisfy our assumptions (Section 2.2) are stable and converge to an arbitrarily small neighborhood, N , of the minimum set i.e., arbitrarily close to the minimum. The neighborhood, N , is related to the estimation errors made i.e, if we wish to ensure convergence to a δ-neighborhood of the minimum set (for fixed δ > 0) we show that there exists ǫ(δ) > 0 such that if kξn k ≤ ǫ(δ) ∀ n, the iterates given by (2) indeed converge almost surely to the δ-neighborhood of the minimum set. In what follows we consider the following generalized gradient scheme to minimize the given objective function F : Rd → R. xn+1 = xn + a(n) [g(xn ) + Mn+1 ] ,
(3)
where g(xn ) is the estimate of ðF (xn ) obtained by the constant-error finite difference method used; we define ðF (xn ) := −∇Fx |x=xn if standard SGD is used to find the minimum, on the other hand we define ðF (xn ) := −H −1 (xn )∇Fx |x=xn if Newton’s method is used with H as the Hessian. In other words, (3) combines SGD and Newton’s methods by using ðF as a placeholder either for −∇F if SGD is the algorithm at hand or −H −1 (xn )∇Fx |x=xn if Newton’s method were the algorithm considered. In Section 2.2 we shall impose assumptions on ðF instead of writing the restrictions for −∇F and −H −1 (xn )∇Fx |x=xn separately. As mentioned earlier, the estimation error at each stage, kðF (xn ) − g(xn )k is bounded by a fixed constant ǫ > 0, further ǫ is determined by the neighborhood, N δ , of the minimum set to which the iterates are required to converge,
3
see Corollary 2. Although using constant-error gradient estimators greatly eases the computational strain, the stability of the iterates is no longer guaranteed. In Section 2.2, easily verifiable sufficient conditions for stability and convergence of gradient-based learning algorithms, such as (3), that use constant-error gradient estimators are presented. Section 2.4 contains an outline of the proof of stability and convergence of (3) under the assumptions listed in Section 2.2. Section 3 contains the main results: stability of the iterates given by (3) (Theorem 1) and convergence of (3) to some neighborhood of the minimum set (Theorem 2). In the same section we also provide a relation between the estimation errors and the neighborhood of the minimum set to which the iterates converge (Corollary 2).
2 2.1
Preliminaries and assumptions Definitions and notations
The definitions and notations used in this paper are similar to those in Bena¨ım et. al. [2], Aubin et. al. [1] and Borkar [4]. In this section, we present a few for easy reference. A set-valued map h : Rn → {subsets of Rm } is called a Marchaud map if it satisfies the following properties: (i) For each x ∈ Rn , h(x) is convex and compact. (ii) (point-wise boundedness) For each x ∈ Rn , sup kwk < K (1 + kxk) for w∈h(x)
some K > 0. (iii) h is an upper-semicontinuous map. We say that h is upper-semicontinuous, if given sequences {xn }n≥1 (in Rn ) and {yn }n≥1 (in Rm ) with xn → x, yn → y and yn ∈ h(xn ), n ≥ 1, implies that y ∈ h(x). In other words the graph of h, {(x, y) : y ∈ h(x), x ∈ Rn }, is closed in Rn × Rm . Let H be a Marchaud map on Rd . The differential inclusion (DI) given by x˙ ∈ H(x)
(4)
is guaranteed to have at least one solution that is absolutely P continuous. The reader is referred to [1] for more details. We say that x ∈ if x is an absolutely continuous map that satisfies (4). The set-valued semiflow Φ associated with d (4) is defined on [0, +∞) P × R as: Φt (x) = {x(t) | x ∈ , x(0) = x}. Let B × M ⊂ [0, +∞) × Rk and define [ ΦB (M ) = Φt (x). t∈B, x∈M
T Let M ⊆ Rd , the ω − limit set be defined by ωΦ (M ) = t≥0 Φ[t,+∞) (M ). T Similarly the limit set of a solution x is given by L(x) = t≥0 x([t, +∞)). M ⊆ Rd is invariant if for every x ∈ MPthere exists a trajectory, x, entirely in M with x(0) = x. In other words, x ∈ with x(t) ∈ M , for all t ≥ 0. Let x ∈ Rd and A ⊆ Rd , then d(x, A) := inf{ka − yk | y ∈ A}. We define the δ-open neighborhood of A by N δ (A) := {x | d(x, A) < δ}. The δ-closed neighborhood of A is defined by N δ (A) := {x | d(x, A) ≤ δ}. The open ball of radius r around the origin is represented by Br (0), while the closed ball is 4
represented by B r (0). Internally Chain Transitive Set : M ⊂ Rd is said to be internally chain transitive if M is compact and for every x, y ∈ M , ǫ > 0 and T > 0 we have the following: There exist Φ1 , . . . , Φn that are n solutions to the differential inclusion x(t) ˙ ∈ h(x(t)), a sequence x1 (= x), . . . , xn+1 (= y) ⊂ M and n real numbers t1 , t2 , . . . , tn greater than T such that: Φiti (xi ) ∈ N ǫ (xi+1 ) and Φi[0,ti ] (xi ) ⊂ M for 1 ≤ i ≤ n. The sequence (x1 (= x), . . . , xn+1 (= y)) is called an (ǫ, T ) chain in M from x to y. A ⊆ Rd is an attracting set if it is compact and there exists a neighborhood U such that for any ǫ > 0, ∃ T (ǫ) ≥ 0 with Φ[T (ǫ),+∞) (U ) ⊂ N ǫ (A). Such a U is called the fundamental neighborhood of A. In addition to being compact if the attracting set is also invariant then it is called an attractor. The basin of attraction of A is given by B(A) = {x | ωΦ (x) ⊂ A}. It is called Lyapunov stable if for all δ > 0, ∃ ǫ > 0 such that Φ[0,+∞) (N ǫ (A)) ⊆ N δ (A). We use T (ǫ) and Tǫ interchangeably to denote the dependence of T on ǫ. Let {Kn }n≥1 be a sequence of sets in Rd . The upper-limit of {Kn }n≥1 is given by, Limsupn→∞ Kn := {y | lim d(y, Kn ) = 0}. n→∞
We may interpret that the lower-limit collects the limit points of {Kn }n≥1 while the upper-limit collects its accumulation points.
2.2
Assumptions
Recall that we have the following gradient scheme in Rd to minimize F : Rd → R: xn+1 = xn + a(n) [g(xn ) + Mn+1 ] , ∀n g(xn ) ∈ G(xn ).
(5)
We list our assumptions below. (A1) G(x) := ðF (x) + B ǫ (0) for some fixed ǫ > 0. ðF is a continuous function such that kðF (x)k ≤ K(1 + kxk) for all x ∈ Rd , K > 0. As explained in Section 1 ðF is a placeholder for −∇Fx |x=xn or −H −1 (xn )∇Fx |x=xn depending on whether the algorithm at hand is SGD or Newton’s method respectively. From (A1) it follows that sup kyk ≤ K(1 + kxk) + ǫ. Without loss y∈G(x)
of generality K is such that sup kyk ≤ K(1 + kxk). Further, since ðF is a y∈G(x)
continuous function it follows that G is upper-semicontinuous (see Section 2.1 for the definition of an upper-semicontinuous map). To see this, let xn → x and yn → y such that yn ∈ G(xn ) ∀ n. We need to show that y ∈ G(x). We have that yn = ðF (xn ) + ξn , where ξn ∈ B ǫ (0) for each n. Since ðF (xn ) → ðF (x) and yn → y we have ξn → ξ for some ξ ∈ B ǫ (0). In other words, y = ðF (x) + ξ and ðF (x) + ξ ∈ G(x) i.e., y ∈ G(x). P a(n) = ∞ and (A2) {a(n)}n≥0 is a scalar sequence such that: a(n) > 0 ∀n, n≥0 P a(n)2 < ∞. Without loss of generality we let sup a(n) ≤ 1. n
n≥0
(A3) {Mn }n≥1 is a square integrable martingale difference sequence with respect to the filtration Fn := σ (x0 , M1 , . . . , Mn ), n ≥ 0. Further, E[kMn+1 k2 |Fn ] 2 ≤ K 1 + kxn k , for n ≥ 0 and some constant K > 0. Without loss of generality the same constant, K, works for both (A1) and (A3). 5
For each c ≥ 1, we define Gc (x) := {y/c | y ∈ G(cx)} Define G∞ (x) := co hLimsupc→∞ Gc (x)i, see Section 2.1 for the definition of Limsup. Given S ⊆ Rd , the convex closure of S, denoted by cohSi, is the closure of the convex hull of S. It is worth noting that Limsupc→∞Gc (x) is non-empty for every x ∈ Rd . Further, we show that G∞ is a Marchaud map in Lemma 1. In other words, x(t) ˙ ∈ G∞ (x(t)) has at least one solution that is absolutely continuous, see [1]. (A4) x(t) ˙ ∈ G∞ (x(t)) has an attractor set A such that A ⊆ Ba (0) and B a (0) is a fundamental neighborhood of A. Since A ⊆ Ba (0) is compact, we have that sup kxk < a. Let us fix the x∈A
following sequence of real numbers sup kxk = δ1 < δ2 < δ3 < δ4 < a. x∈A
(A5) Let cn ≥ 1 be an increasing sequence of integers such that cn ↑ ∞ as n → ∞. Further, let xn → x and yn → y as n → ∞, such that yn ∈ Gcn (xn ), ∀n, then y ∈ G∞ (x). In most cases verifying stability (supkxn k < ∞ a.s.) is hard. Assumptions n
(A4)&(A5) are easily verifiable conditions which in conjunction with (A1)−(A3) give stability (and convergence to the minimum set). A sufficient condition for (A4) is the existence of a global Lyapunov function associated with x(t) ˙ ∈ G∞ (x(t)). Conditions (A4) and (A5) generalize conditions for stability and convergence of stochastic approximation algorithms such as (1) given by Borkar and Meyn [5] to cases where the mean field is set-valued as in (3). Remark 1. As mentioned earlier, SGDs with errors such as (3) are also analyzed in [3]. There the objective function, F , is such that ∇F is Lipschitz continuous, conditions exist on k∇F k2 and estimation errors are bounded by a product of step-sizes and gradient norms (this restricts the choice of stepsizes). In our paper, we assume that ∇F is merely continuous, further there are no assumptions that couple estimation errors and step-sizes. Also, in [3] there is a possibility that the iterates are unstable, however if the iterates are stable the estimation errors go to zero and they converge to the minima. We provide easily verifiable sufficient conditions for stability and convergence to a small neighborhood of the minimum set without requiring the estimation errors to go to zero. Lemma 1. G∞ is a Marchaud map. Proof. From the definition of G∞ and G we have that G∞ (x) is convex, compact and sup kyk ≤ K(1 + kxk) for every x ∈ Rd . It is left to show that G∞ is y∈G(x)
an upper-semicontinuous map. Let xn → x, yn → y and yn ∈ G∞ (xn ), for all n ≥ 1. We need to show that y ∈ G∞ (x). We present a proof by contradiction. Since G∞ (x) is convex and compact, y ∈ / G∞ (x) implies that there exists a linear functional on Rd , say f , such that sup f (z) ≤ α − ǫ and f (y) ≥ α + ǫ, z∈G∞ (x)
for some α ∈ R and ǫ > 0. Since yn → y, there exists N > 0 such that for all n ≥ N , f (yn ) ≥ α + 2ǫ . In other words, G∞ (x) ∩ [f ≥ α + 2ǫ ] 6= φ for all n ≥ N . We use the notation [f ≥ a] to denote the set {x | f (x) ≥ a}. For the sake of convenience let us denote the set Limsupc→∞Gc (x) by A(x), where 6
x ∈ Rd . We claim that A(xn ) ∩ [f ≥ α + 2ǫ ] 6= φ for all n ≥ N . We prove this claim later, for now we assume that the claim is true and proceed. Pick zn ∈ A(xn )∩[f ≥ α+ 2ǫ ] for each n ≥ N . It can be shown that {zn }n≥N is norm bounded and hence contains a convergent subsequence, {zn(k) }k≥1 ⊆ {zn }n≥N . Let lim zn(k) = z. Since zn(k) ∈ Limsupc→∞ (Gc (xn(k) )), ∃ cn(k) ∈ N such that k→∞
1 kwn(k) − zn(k) k < n(k) , where wn(k) ∈ Gcn(k) (xn(k) ). We choose the sequence {cn(k) }k≥1 such that cn(k+1) > cn(k) for each k ≥ 1. We have the following: cn(k) ↑ ∞, xn(k) → x, wn(k) → z and wn(k) ∈ Gcn(k) (xn(k) ), for all k ≥ 1. It follows from assumption (A5) that z ∈ G∞ (x). Since zn(k) → z and f (zn(k) ) ≥ α + 2ǫ for each k ≥ 1, we have that f (z) ≥ α + 2ǫ . This contradicts the earlier conclusion that sup f (z) ≤ α − ǫ. z∈h∞ (x)
It remains to prove that A(xn ) ∩ [f ≥ α + 2ǫ ] 6= φ for all n ≥ N . If this were not true, then ∃{m(k)}k≥1 ⊆ {n ≥ N } such that A(xm(k) ) ⊆ [f < α + 2ǫ ] for all k. It follows that G∞ (xm(k) ) = co(A(xm(k) )) ⊆ [f ≤ α + 2ǫ ] for each k ≥ 1. Since yn(k) → y, ∃N1 such that for all n(k) ≥ N1 , f (yn(k) ) ≥ α + 3ǫ 4 . This is a contradiction.
2.3
Constructing the linearly interpolated trajectories using the iterates
We begin by constructing the linearly interpolated trajectory, x(t) for t ∈ [0, ∞), Pn−1 from {xn }n≥0 . Define t(0) := 0, t(n) := i=0 a(i). Let x(t(n)) := xn and for t ∈ (t(n), t(n + 1)), let x(t) :=
t(n + 1) − t t(n + 1) − t(n)
x(t(n)) +
t − t(n) t(n + 1) − t(n)
x(t(n + 1)).
Now, using {g(xn )}n≥0 we get the following piece-wise constant trajectory: g(t) := g(xn ) for t ∈ [t(n), t(n + 1)), n ≥ 0. To show stability, we use a projective scheme where the projections are onto the closed ball of radius a around the origin, B a (0). Let us define the rescaled trajectories xˆ(· ) and gˆ(· ) that are obtained as a consequence of the projections. First, [0, ∞) is divided into intervals of length T := T (δ2 − δ1 ) + 1. The time T (δ2 − δ1 ) is such that Φt (x0 ) ∈ N δ2 −δ1 (A) for t ≥ T (δ2 − δ1 ), where Φt (x0 ) denotes solution to x(t) ˙ ∈ G∞ (x(t)) at time t with initial condition x0 and x0 ∈ B a (0). Note that T (δ2 − δ1 ) is independent of the initial condtion x0 , see Section 2.1 for more details. Define T0 := 0 and Tn := min{t(m) : t(m) ≥ Tn−1 + T }, n ≥ 1. Clearly, there exists a subsequence {t(m(n))}n≥0 of {t(n)}n≥0 such that Tn = t(m(n)) ∀ n ≥ 0. In what follows we use t(m(n)) and Tn interchangeably. We are now ready to construct the rescaled trajectories. First, we construct x ˆ(t), x(t) , where t ≥ 0, as follows: Let t ∈ [Tn , Tn+1 ) for some n ≥ 0, then x ˆ(t) := r(n)
7
r(n) =
kx(Tn )k a
− ∨ 1 (a is defined in (A4)). Also, let x ˆ(Tn+1 ) :=
lim x ˆ(t),
t↑Tn+1
g(t) and the t ∈ [Tn , Tn+1 ). The ‘rescaled g iterates’ are given by gˆ(t) := r(n) Mk+1 ˆ rescaled martingale noise terms by Mk+1 := r(n) , t(k) ∈ [Tn , Tn+1 ), n ≥ 0. i h ˆ k+1 k2 |Fk ≤ K 1 + kˆ x(t(k))k2 . Note that E kM
2.4
Outline of the proof
As explained previously, we divide time “roughly” into intervals of length T . At the beginning of every T -length interval, Tn , we check if x(Tn ) is outside B a (0). If so, the trajectory given by {x(Tn + t) | 0 ≤ t < Tn+1 − Tn } is scaled by kx(Tan )k i.e., projected onto B a (0), else the aforementioned trajectory is scaled by 1 i.e, run without any change. If the iterates given by (3) are unstable (supkxn k = ∞) n
then it can be shown that supr(n) = ∞. Let {Tl(n) }n≥0 ⊆ {Tn }n≥0 such that n
kx(Tl(n) )k ↑ ∞. We show that any limit of {ˆ x(Tl(n) + t), 0 ≤ t ≤ T | n ≥ 0} (rescaled T-length trajectories) in C([0, T ], Rd ) is a solution to x(t) ˙ ∈ G∞ (x(t)). In other words, the “rescaled version of the unstable T -length trajectories” track a solution to x(t) ˙ ∈ G∞ (x(t)). Then we use assumption (A4) to conclude that the original “unstable T -length trajectories” given by {x(Tl(n) + t), 0 ≤ t ≤ T | n ≥ 0} are forced to make jumps that “run off” to infinity, the key point being that these long jumps are made within T -length intervals. But the length of a jump within a fixed time interval is upper-bounded by a fixed constant from the discrete version of Gronwall’s inequality. This gives us a contradiction and hence stability. Once we have established stability, Theorem 3.6 & Lemma 3.8 from Bena¨ım, Hofbauer and Sorin [2] are invoked to conclude that the iterates converge to a closed connected internally chain transitive invariant set of x(t) ˙ ∈ G(x(t)). The objective of running (3) is to ensure convergence to the minimum set. Since it is no longer required that the estimation errors vanish over time the best hope is for almost sure convergence to δ-neighborhood of the minimum set. Here δ > 0 is allowance that the person running the simulation is willing to make in order that the sensitivity parameters be held constant for ease of computation among other reasons. Given δ, we use Theorem 2.1 from Bena¨ım, Hofbauer and Sorin [10] to get ǫ(δ), the maximum estimation error permitted at each stage. In other words, we fix ǫ := ǫ(δ) in (A1). Using all of the above, we conclude that the iterates converge almost surely to a δ-neighborhood of the minimum set.
2.5
Limits of the “rescaled unstable T-length trajectories” in C([0, T ], Rd)
First we present a few auxiliary lemmas that are needed in the proof of Theorem 1. We do not present a proof for the first three lemmas since they can be found in Borkar [4] or Bena¨ım, Hofbauer and Sorin [2]. The first two lemmas are concerning the almost sure convergence of the rescaled martingale noise while the rest of them are needed to prove the stability theorem, Theorem 1.
8
Lemma 2. sup Ekˆ x(t)k2 < ∞. t∈[0,T ]
Pn−1 ˆ Lemma 3. The rescaled sequence {ζˆn }n≥1 , where ζˆn = k=0 a(k)Mk+1 , is convergent almost surely. The following lemma states that the rescaled trajectories are bounded almost surely. Lemma 4.
sup kˆ x(t)k < ∞ a.s. t∈[0,∞)
Now, we define a few terms that are used later on. Let A = {ω | {ζˆn (ω)}n≥1 converges}. Since ζˆn , n ≥ 1, converges Pk−1 on A, there exists Mω < ∞, possibly sample path deˆ m(n)+l+1 k ≤ Mw , Mω is independent of pendent, such that k l=0 a(m(n) + l)M n and k. Also, let supkˆ x(t)k ≤ Kω , where Kω := (1 + Mω + (T + 1)K) eK(T +1) t≥0
is also a constant that is sample path dependent. Let xn (t), t ∈ [0, T ] be the solution (upto time T ) to x˙ n (t) = gˆ(Tn + t), with the initial condition xn (0) = x ˆ(Tn ), recall the definition of gˆ(· ) from Section 2.3. Clearly, we have Z t
xn (t) = x ˆ(Tn ) +
gˆ(Tn + z) dz.
(6)
0
The following two lemmas are inspired by ideas from Bena¨ım, Hofbauer and Sorin [2] as well as Borkar [4]. The first states that the limit sets of {xn (· ) | n ≥ 0} and {ˆ x(Tn +· ) | n ≥ 0} coincide in C([0, T ], Rd ). Lemma 5. lim
sup
n→∞ t∈[T ,T +T ] n n
kxn (t) − x ˆ(t)k = 0 a.s.
Proof. Let t ∈ [t(m(n) + k), t(m(n) + k + 1)) and t(m(n) + k + 1) ≤ Tn+1 . We first assume that t(m(n) + k + 1) < Tn+1 . We have the following: t(m(n) + k + 1) − t xˆ(t) = x ˆ(t(m(n) + k)) a(m(n) + k) t − t(m(n) + k) + x ˆ(t(m(n) + k + 1)). a(m(n) + k) Substituting for x ˆ(t(m(n) + k + 1)) in the above equation we get: t(m(n) + k + 1) − t x ˆ(t(m(n) + k)) x ˆ(t) = a(m(n) + k) t − t(m(n) + k) + [ˆ x(t(m(n) + k)) a(m(n) + k) ˆ m(n)+k+1 )], +a(m(n) + k)(ˆ g (t(m(n) + k)) + M hence, xˆ(t) = x ˆ(t(m(n) + k))+ ˆ m(n)+k+1 . (t − t(m(n) + k)) gˆ(t(m(n) + k)) + M 9
Unfolding x ˆ(t(m(n) + k)) over k we get,
k−1 X l=0
x ˆ(t) = xˆ(Tn )+ ˆ m(n)+l+1 + a(m(n) + l) gˆ(t(m(n) + l)) + M
ˆ m(n)+k+1 . (t − t(m(n) + k)) gˆ(t(m(n) + k)) + M
(7)
Let us consider xn (t), n
x (t) = x ˆ(Tn ) +
Z
t
gˆ(Tn + z) dz.
0
Splitting the above integral, we get n
x (t) = x ˆ(Tn ) +
k−1 X
a(m(n) + l)ˆ g(t(m(n) + l))+
l=0
(t − t(m(n) + k)) gˆ(t(m(n) + k)). (8) From (7) and (8), it follows that
k−1
X
ˆ a(m(n) + l)Mm(n)+l+1 + kx (t) − x ˆ(t)k ≤
l=0
ˆ m(n)+k+1
(t − t(m(n) + k)) M
, n
and hence,
kxn (t) − x ˆ(t)k ≤ kζˆm(n)+k − ζˆm(n) k + kζˆm(n)+k+1 − ζˆm(n)+k k. If t(m(n) + k + 1) = Tn+1 then in the proof we may replace x ˆ(t(m(n) + k + 1)) − with x ˆ(Tn+1 ). The arguments remain the same. Since ζˆn , n ≥ 1, converges almost surely, the desired result follows. Let us view the sets {xn (t), t ∈ [0, T ] | n ≥ 0} and {ˆ x(Tn + t), t ∈ [0, T ] | n ≥ 0} as subsets of C([0, T ], Rd). Since {xn (t), t ∈ [0, T ] | n ≥ 0} is equi-continuous and point-wise bounded, it follows from the Arzela-Ascoli theorem that it is relatively compact. Further, it follows from Lemma 5 and the aforementioned that the set {ˆ x(Tn + t), t ∈ [0, T ] | n ≥ 0} is also relatively compact in C([0, T ], Rd ). Lemma 6. Let r(n) ↑ ∞, then any limit point of {ˆ x(Tn + t), t ∈ [0, T ] : n ≥ 0} Rt is of the form x(t) = x(0) + 0 g∞ (s) ds, where y : [0, T ] → Rd is a measurable function and g∞ (t) ∈ G∞ (x(t)), t ∈ [0, T ]. Proof. For t ≥ 0, let [t] := max{t(k) | t(k) ≤ t}. Let t ∈ [Tn , Tn+1 ), we have gˆ(t) ∈ Gr(n) (ˆ x([t])) and kˆ g (t)k ≤ K (1 + kˆ x([t])k) since Gr(n) is a Marchaud map (K is the constant associated with the point-wise boundedness property). It follows from Lemma 4 that sup kˆ g(t)k < ∞ a.s. Using observations made t∈[0,∞)
earlier, we can deduce that there exists a sub-sequence of N, say {l} ⊆ {n}, such that x ˆ(Tl +· ) → x(· ) in C [0, T ], Rd and gˆ(m(l)+· ) → g∞ (· ) weakly in 10
L2 [0, T ], Rd . From Lemma 5 it follows that xl (· ) → x(· ) in C [0, T ], Rd . Letting r(l) ↑ ∞ in l
l
x (t) = x (0) +
Z
t
gˆ(t(m(l) + z)) dz, t ∈ [0, T ],
0
we get x(t) = x(0) + kx(0)k ≤ 1.
Rt 0
g∞ (z)dz for t ∈ [0, T ]. Since kˆ x(Tn )k ≤ 1 we have
Since gˆ(Tl + · ) → g∞ (· ) weakly in L2 [0, T ], Rd , there exists {l(k)} ⊆ {l} such that N 1 X gˆ(Tl(k) + · ) → g∞ (· ) strongly in L2 [0, T ], Rd . N k=1
Further, there exists {N (m)} ⊆ {N } such that N (m) X 1 gˆ(Tl(k) + · ) → g∞ (· ) a.e. on [0, T ]. N (m) k=1
Let us fix t0 ∈ {t |
1 N (m)
PN (m) k=1
gˆ(Tl(k) + t) → g∞ (t), t ∈ [0, T ]}, then
N (m) X 1 gˆ(Tl(k) + t0 ) = g∞ (t0 ). N (m)→∞ N (m)
lim
k=1
Since G∞ (x(t0 )) is convex and compact (Proposition 1), to show that g∞ (t0 ) ∈ G∞ (x(t0 )) it is enough to show lim d gˆ(Tl(k) + t0 ), G∞ (x(t0 )) = 0. If not, l(k)→∞ ∃ ǫ > 0 and {n(k)} ⊆ {l(k)} such that d gˆ(Tn(k) + t0 ), G∞ (x(t0 )) > ǫ. Since {ˆ g(Tn(k) + t0 )}k≥1 is norm bounded, it follows that there is a convergent subsequence. For the sake of convenience we assume that lim gˆ(Tn(k) + t0 ) = g0 , k→∞
for some g0 ∈ Rd . Since gˆ(Tn(k) + t0 ) ∈ Gr(n(k)) (ˆ x([Tn(k) + t0 ])) and lim
k→∞
x ˆ([Tn(k) + t0 ]) = x(t0 ), it follows from assumption (A5) that g0 ∈ G∞ (x(t0 )). This leads to a contradiction. Note that in the statement of Lemma 6 we can replace ‘r(n) ↑ ∞’ by ‘r(k) ↑ ∞’, where {r(k))} is a subsequence of {r(n)}. Specifically we can conclude that any limit point of {ˆ x(Tk + t), t ∈ [0, T ]}{k}⊆{n} in C([0, T ], Rd), conditioned on Rt r(k) ↑ ∞, is of the form x(t) = x(0) + 0 g∞ (z) dz, where g∞ (t) ∈ G∞ (x(t)) for t ∈ [0, T ]. It should be noted that g∞ (· ) may be sample path dependent. Recall δ1 , δ2 , δ3 , δ4 from Section 2.2 (see after (A4)). The following is an immediate consequence of Lemma 6. Corollary 1. ∃ 1 < R0 < ∞ such that ∀ r(l) > R0 kˆ x(Tl +· ) − x(· )k < δ3 − δ2 , where {l} ⊆ N and x(· ) is a solution (up to time T ) of x(t) ˙ ∈ G∞ (x(t)) such that kx(0)k ≤ 1. The form of x(· ) is as given by Lemma 6.
11
Proof. Assume to the contrary that ∃ r(l) ↑ ∞ such that x ˆ(Tl +· ) is at least δ3 − δ2 away from any solution to the DI. It follows from Lemma 6 that there exists a subsequence of {ˆ x(Tl + t), 0 ≤ t ≤ T : l ⊆ N} guaranteed to converge, in C([0, T ], Rd), to a solution of x(t) ˙ ∈ G∞ (x(t)) such that kx(0)k ≤ 1. This is a contradiction. It is worth noting that R0 may be sample path dependent. Since T = T (δ2 − δ1 ) + 1 we get kˆ x([Tl + T ])k < δ3 for all Tl such that kx(Tl )k(= r(l)) > R0 .
3
Main Results
In this section we show that gradient based learning algorithms given by (3) are stable and converge to a “small neighborhood” of the minimum set, provided assumptions (A1)−(A5) are satisfied. If sup r(n) < ∞, then the iterates are stable n
and there is nothing to prove. If on the other hand sup r(n) = ∞, there exists n
{l} ⊆ {n} such that r(l) ↑ ∞. It follows from Lemma 6 that any limit point Rt of {ˆ x(Tl + t), t ∈ [0, T ] : {l} ⊆ {n}} is of the form x(t) = x(0) + 0 g∞ (s) ds, where g∞ (t) ∈ G∞ (x(t)) for t ∈ [0, T ]. From assumption (A4), we have that kx(T )k < δ2 . Since the time intervals are roughly T apart, for large values − − of r(n) we have that kˆ x Tn+1 k < δ3 , where x ˆ(Tn+1 ) = limt↑t(m(n+1)) x ˆ(t), t ∈ [Tn , Tn+1 ). We are ready to prove stability of (3) under (A1) − (A5). Theorem 1 (Stability of the gradient scheme given by (3)). Under assumptions (A1) − (A5), the iterates given by (3) are stable i.e., supkxn k < ∞ a.s. n
Proof. As explained earlier it is sufficient to consider the case when sup r(n) = n
∞. Let {l} ⊆ {n} such that r(l) ↑ ∞. Recall that Tl = t(m(l)) and that [Tl + T ] = max{t(k) | t(k) ≤ Tl + T }. We have kx(T )k < δ2 since x(· ) is a solution, up to time T , to the DI given by x(t) ˙ ∈ G∞ (x(t)); recall that T = T (δ2 − δ1 ) + 1. From Lemma 6 we conclude that there exists N such that all of the following happen: (i) m(l) ≥ N =⇒ kˆ x([Tl + T ])k < δ3 . δ4 −δ3 . (ii) n ≥ N =⇒ a(n) < [K(1+K ω )+Mω ] (iii) n > m ≥ N =⇒ kζˆn − ζˆm k < Mω . (iv) m(l) ≥ N =⇒ r(l) > R0 . In the above, R0 is defined in the statement of Corollary 1 and Kω , Mω are explained after Lemma 4. Recall that we have sup kxk = δ1 < δ2 < δ3 < δ4 < a (see lines following x∈A
(A4) in Section 2.2). Let m(l) ≥ N and t(m(l + 1)) = t(m(l) + k + 1) for some k ≥ 0. Clearly from the manner in which the Tn sequence is defined, we − have t(m(l) + k) = [Tl + T ]. As defined earlier x ˆ(Tn+1 ) = limt↑t(m(n+1)) x ˆ(t), t ∈ [Tn , Tn+1 ) and n ≥ 0. We have that − x ˆ(Tl+1 ) = x ˆ(t(m(l) + k)) +
ˆ m(l)+k+1 . a(m(l) + k) gˆ(t(m(l) + k)) + M 12
Taking norms on both sides we get, − kˆ x(Tl+1 )k ≤ kˆ x(t(m(l) + k))k +
ˆ m(l)+k+1 k. a(m(l) + k)kˆ g(t(m(l) + k))k + a(m(l) + k)kM From the way we have chosen N we conclude that: kˆ g(t(m(l) + k))k ≤ K (1 + kˆ x(t(m(l) + k)k) ≤ K (1 + Kω ) and that ˆ m(l)+k+1 k = kζˆm(l)+k+1 − ζˆm(l)+k k ≤ Mω . kM Thus we get that, − kˆ x(Tl+1 )k ≤ kˆ x(t(m(l) + k))k +
a(m(l) + k) (K(1 + Kω ) + Mω ) . − Finally we have that kˆ x(Tl+1 )k < δ4 and − kˆ x(Tl+1 )k δ4 kx(Tl+1 )k = < < 1. kx(Tl )k kˆ x(Tl )k a
(9)
It follows from (9) that kx(Tn+1 )k < δa4 kx(Tn )k if kx(Tn )k > R0 . From Corollary 1 and the aforementioned we get that the trajectory falls at an exponential rate till it enters B R0 (0). Let t ≤ Tl , t ∈ [Tn , Tn+1 ) and n + 1 ≤ l, be the last time that x(t) jumps from B R0 (0) to the outside of the ball. It follows that kx(Tn+1 )k ≥ kx(Tl )k. Since r(l) ↑ ∞, x(t) would be forced to make larger and larger jumps within an interval of length T + 1. This leads to a contradiction since the maximum jump within any fixed time interval can be bounded using the Gronwall inequality. We now give the main theorem. Theorem 2 (Stability and convergence of (3)). Under assumptions (A1)−(A5), almost surely, the iterates, {xn }n≥0 of the gradient based scheme given by (3), are bounded and converge to a closed, connected, internally chain transitive and invariant set of x(t) ˙ ∈ G(x(t)). Proof. The stability of the iterates is shown in Theorem 1. The convergence can be proved under assumptions (A1) − (A3) and the stability of the iterates in exactly the same manner as in Theorem 3.6 & Lemma 3.8 of Bena¨ım, Hofbauer and Sorin [2]. Since the estimation errors do not vanish over time we can only expect that the iterates converge to some neighborhood of the minimum set. In what follows we illustrate that we have control over this neighborhood. For the iterates given by (3) to converge to a δ-neighborhood of the minimum set of F , where δ > 0 is fixed a priori, the estimation error made at each stage is required to be bounded by ǫ(δ), where ǫ(δ) is given by the following theorem. Corollary 2. Given δ > 0, ∃ ǫ(δ) > 0 such that 2ǫ ≤ ǫ(δ) in (A1) implies that the iterates given by (3) converge to N δ (M), where M is the minimum set of F. 13
Proof. The graph of the (set-valued) map, G is defined by Graph(G) := {(x, y) | x ∈ Rd , y ∈ G(x)}, see (A1) in Section 2.2 for the definition of G. Clearly Graph(G) ⊂ Graph(N 2ǫ (ðF )). Let us suppose that the global and the local minima coincide then M is the global attractor of x(t) ˙ = ðF (x(t)). Also, every compact set, K, is a fundamental neighborhood of M. We get the following from Theorem 2.1 of M. Bena¨ım, Hofbauer and Sorin [10]: Given δ > 0 and a compact fundamental neighborhood K of M, there exists ǫ(δ) > 0 such that there exists a unique attractor M′ of the DI x(t) ˙ ∈ H(x(t)) such that M′ ⊆ N δ (M) as long ǫ(δ) as H(x) ⊆ N (ðF (x)) for each x ∈ Rd . Further, K is also the fundamental neighborhood associated with M′ . It follows from Theorem 1 that the iterates given by (3) belong to some compact set K that may be sample path dependent. Given δ > 0 and K, we get ǫ(δ) associated with x(t) ˙ = ðF (x(t)) from Theorem 2.1 of [10]. If we fix 2ǫ = ǫ(δ) in (A1) then G(x) ⊆ N ǫ(δ) (ðF (x)), we get that x(t) ˙ ∈ G(x(t)) has an attractor inside N δ (M) whose fundamental neighborhood is K. From Theorem 2, we get that the iterates given by (3) track a solution to x(t) ˙ ∈ G(x(t)), hence they converge to N δ (M).
4
Conclusion
This paper provides easily verifiable sufficient conditions for the stability of gradient-based learning algorithms that use constant-error finite difference estimators. Using the framework presented here we showed convergence of iterates to a small neighborhood of the minimum set even when the F D methods make constant estimation errors at each stage unlike frameworks present hitherto in literature which require the estimation errors to vanish over time. Implementing the algorithm becomes easier since the perturbation parameters of the F D method can be tuned at the beginning of the simulation and then held constant till the end. Further, we have dispensed with that couples the assumption P a(n)2 the step-sizes and the perturbation parameters i.e, p(n)2 < ∞ . Since the n
learning rate is governed by the choice of step-size it is certainly desirable that such a coupling be dispensed with.
References [1] J. Aubin and A. Cellina. Differential Inclusions: Set-Valued Maps and Viability Theory. Springer, 1984. [2] M. Bena¨ım, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, pages 328–348, 2005. [3] Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000. [4] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
14
[5] V. S. Borkar and S.P. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim, 38:447–469, 1999. [6] Simon S Haykin. Neural networks and learning machines, volume 3. Pearson Education Upper Saddle River, 2009. [7] Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462– 466, 1952. [8] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In NIPS, volume 13, pages 1008–1014, 1999. [9] Harold Joseph Kushner and Dean S Clark. Stochastic approximation methods for constrained and unconstrained systems, volume 26. Springer Science & Business Media, 2012. [10] J. Hofbauer M. Bena¨ım, S. Sorin. Perturbations of set-valued dynamical systems, with applications to game theory. Dynamic Games and Applications, 2(2):195–205, 2012. [11] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [12] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 707. John Wiley & Sons, 2011. [13] James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. Automatic Control, IEEE Transactions on, 37(3):332–341, 1992. [14] James C Spall. Adaptive stochastic approximation by the simultaneous perturbation method. Automatic Control, IEEE Transactions on, 45(10):1839–1853, 2000. [15] James C Spall. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005. [16] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pages 1057–1063. Citeseer, 1999.
15