CONVERGENCE AND CONVERGENCE RATE OF STOCHASTIC GRADIENT SEARCH IN THE CASE OF MULTIPLE AND NON-ISOLATED EXTREMA ´ VLADISLAV B. TADIC
∗
Abstract. The asymptotic behavior of stochastic gradient algorithms is studied. Relying on some results of differential geometry (Lojasiewicz gradient inequality), the almost sure pointconvergence is demonstrated and relatively tight almost sure bounds on the convergence rate are derived. In sharp contrast to all existing result of this kind, the asymptotic results obtained here do not require the objective function (associated with the stochastic gradient search) to have an isolated minimum at which the Hessian of the objective function is strictly positive definite. Using the obtained results, the asymptotic behavior of recursive prediction error identification methods is analyzed. The convergence and convergence rate of supervised learning algorithms are also studied relying on these results. Key words. Stochastic gradient search, point-convergence, convergence rate, Lojasiewicz gradient inequality, system identification, recursive prediction error, ARMA models, machine learning, supervised learning, feedforward neural networks. AMS subject classifications. Primary 62L20; Secondary 90C15, 93E12, 93E35.
1. Introduction. Stochastic optimization is at the core of many engineering, statistics and finance problems. A stochastic optimization problem can be described as the minimization (or maximization) of an objective function in a situation when only noise-corrupted observations of the function values are available. Such a problem can be solved efficiently by stochastic gradient search, a stochastic approximation version of the deterministic steepest descent method. Due to its excellent performance (generality, robustness, low complexity, easy implementation), stochastic gradient algorithms have gained a wide attention in the literature and have found a broad range of applications in diverse areas such as signal processing, system identification, automatic control, machine learning, operations research, statistical inference, econometrics and finance (see e.g. [2], [7], [9], [10], [11], [16], [17], [22], [24], [25], [26] and reference cited therein). Various asymptotic properties of stochastic gradient algorithms have been the subject of a number of papers and books (see see [1], [14], [16], [24], [26] and references cited therein). Among them, the almost sure convergence and the convergence rate have received the greatest attention, as these properties most precisely characterize the asymptotic behavior and efficiency of stochastic gradient search. Although the existing results provide a good insight into the convergence and convergence rate, they hold only under very restrictive conditions. More specifically, the existing results require the objective function (which the stochastic gradient search minimizes) to have an isolated minimum such that the Hessian of the objective function is strictly positve definite at the minimum and such that the attraction domain of the minimum is infinitely often visited by the algorithm iterates. However, in the case of complex, high-dimensional high-nonlinear algorithms, this is not only hard (if possible at all) to verify, but is likely not to be true. In this paper, the convergence and convergence rate of stochastic gradient search are analyzed when the objective function has multiple non-isolated minima (notice ∗ Department of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, United Kingdom. (
[email protected]).
1
that at a non-isolated minimum, the Hessian can be semi-definite at best). Using some results of differential geometry (Lojasiewicz gradient inequality), the almost sure point-convergence is demonstrated and relatively tight almost sure bounds on the convergence rate are derived. The obtained results cover a wide class of complex stochastic gradient algorithms. We show how they can be used to analyze the asymptotic behavior of recursive prediction error algorithms for identification of linear stochastic systems. We also show how the convergence and convergence rate of supervised learning in feedforward neural networks can be analyzed using the results obtained here. The paper is organized as follows. In Section 2, stochastic gradient algorithms with additive noise are considered and the main results of the paper are presented. Section 3 is devoted to stochastic gradient algorithms with Markovian dynamics. Sections 4 and 5 contain examples of the results reported in Sections 2 and 3. In Section 4, supervised learning algorithms for feedforward neural networks are studied, while recursive prediction error algorithms for identification of linear stochastic systems are analyzed in Section 5. Sections 6 – 9 contain the proofs of the results presented in Sections 2 – 5. 2. Main Results. In this section, the convergence and convergence rate of the following algorithm is analyzed: θn+1 = θn − αn (∇f (θn ) + ξn ),
n ≥ 0.
(2.1)
Here, f : Rdθ → R is a differentiable function, while {αn }n≥0 is a sequence of positive real numbers. θ0 is an Rdθ -valued random variable defined on a probability space (Ω, F, P ), while {ξn }n≥0 is an Rdθ -valued stochastic process defined on the same probability space. To allow more generality, we assume that for each n ≥ 0, ξn is a random function of θ0 , . . . , θn . In the area of stochastic optimization, recursion (2.1) is known as a stochastic gradient search (or stochastic gradient algorithm), while function f (·) is referred to as an objective function. For further details see [22], [26] and references given therein. Throughout the paper, unless otherwise stated, the following notation is used. The Euclidean norm is denoted by k · k, while d(·, ·) stands for the distance induced by the Euclidean norm. S is the sets of stationary points of f (·), i.e., S = {θ ∈ Rdθ : ∇f (θ) = 0}. Sequence {γn }n≥0 is defined by γ0 = 0 and γn =
n−1 X
αi
i=0
for n ≥ 1. For t ∈ (0, ∞) and n ≥ 0, a(n, t) is the integer defined as a(n, t) = max {k ≥ n : γk − γn ≤ t} . Algorithm (2.1) is analyzed under the following assumptions: P∞ Assumption 2.1. limn→∞ αn = 0 and n=0 αn = ∞. Assumption 2.2. There exists a real number r ∈ (1, ∞) such that
k
X
r ξ = lim sup max αi γi ξi < ∞
n→∞ n≤k rˆ
The proofs are provided in Section 6. As an immediate consequence of the previous theorems, we get the following corollaries: Corollary 2.1. Let Assumptions 2.1 – 2.3 hold. Then, the following is true: ˆ = o γ −pˆ and kθn − θk ˆ = o γ −ˆq w.p.1 (i) k∇f (θn )k2 = o γn−pˆ , |f (θn ) − f (θ)| n n on {supn≥0 kθn k < ∞} ∩ {ξ = 0, rˆ > r}. ˆ = O γ −pˆ and kθn − θk ˆ = O γ −ˆq (ii) k∇f (θn )k2 = O γn−pˆ , |f (θn ) − f (θ)| n n w.p.1 on {supn≥0 kθn k < ∞} ∩ {ξ = 0, rˆ > r}c . ˆ = o(γ −p ) w.p.1 on {sup (iii) k∇f (θn )k2 = o(γn−p ) and |f (θn ) − f (θ)| n≥0 kθn k < n ∞}, where p = min{1, r}. In the literature on stochastic and deterministic optimization, the asymptotic behavior of gradient search is usually characterized by the convergence of sequences {∇f (θn )}n≥0 , {f (θn )}n≥0 and {θn }n≥0 (see e.g., [3], [4], [23], [24] are references quoted therein). Similarly, the convergence rate can be described by the rates at which {∇f (θn )}n≥0 , {f (θn )}n≥0 and {θn }n≥0 tend to the sets of their limit points. In the case of algorithm (2.1), this kind of information is provided by Theorems 2.1, 2.2 and Corollary 2.1. Theorem 2.1 claims that almost surely, algorithm (2.1) is pointconvergence and does not exhibit limit cycles. Theorem 2.2 and Corollary 2.1 provide relatively tight upper bounds on the convergence rate of {∇f (θn )}n≥0 , {f (θn )}n≥0 4
and {θn }n≥0 . These bounds can be thought of as a combination of the convergence rate of the gradient flow dθ/dt = −∇f (θ) (characterized by Lojasiewicz exponent Pk µθ ) and the rate of the noise averages i=n αi ξi (expressed through parameter r and sequence {γn }n≥0 ). Basically, Theorem 2.2 and Corollary 2.1 claim that the convergence rate of {k∇f (θn )k2 }n≥0 and {f (θn )}n≥0 is the slower of the rates O(γn−ˆrµˆ ) (the rate of the gradient flow dθ/dt = −∇f (θ) sampled at instants {γn }n≥0 ) and O(γn−rµˆ ) Pk (the rate of the noise averages maxk≥n k i=n αi ξi kµˆ ). Apparently, the results of Theorems 2.1, 2.2 and Corollary 2.1 are of a local nature: They hold only on the event where algorithm (2.1) is stable (i.e., where sequence {θn }n≥0 is bounded). Stating results on the convergence and convergence rate in such a local form is quite sensible due to the following reasons. The stability of stochastic gradient search is based on well-understood arguments which are rather different from the arguments used in the analysis of the convergence and convergence rate. Moreover and more importantly, it is straightforward to get a global version of the results provided in Theorems 2.1, 2.2 and Corollary 2.1 by combining the theorems with the methods used to verify or ensure the stability (e.g., with the results of [6] and [8]). The point-convergence and convergence rate of stochastic gradient search (and stochastic approximation) have been the subject of a large number of papers and books (see see [1], [14], [16], [24], [26] and references cited therein). Although the existing results provide a good insight into the asymptotic behavior and efficiency of stochastic gradient algorithms, they are based on fairly restrictive assumptions: Literally, they all require the objective function f (·) to have an isolated minimum θ∗ (sometimes even to be strongly unimodal) such that Hessian ∇2 f (θ∗ ) is strictly positive definite and such that {θn }n≥0 visits the attraction domain of θ∗ infinitely many times w.p.1. Unfortunately, in the case of high-dimensional high-nonlinear stochastic gradient algorithms (such as online machine learning and recursive identification), it is hard (if not impossible at all) to show even the existence of an isolated minimum, let alone the definiteness of ∇2 f (θ∗ ) and the infinitely often visits of {θn }n≥0 to the attraction domain of θ∗ . Moreover and more importantly, these requirements are unlikely to be satisfied by a high-dimensional high-nonlinear algorithm, as the objective function associated with such an algorithm prones to manifolds of (nonisolated) minima and (non-isolated) saddles each of which is a potential limit point of the algorithm iterates (e.g., a recursive prediction error identification method exhibits this behavior when the candidate models are overparameterized or do not match the true system). Relying on the Lojasiewicz gradient inequality, Theorems 2.1, 2.2 and Corollary 2.1 overcome the described difficulties: Both theorems and their corollary allow the objective function f (·) to have multiple, non-isolated minima, impose no restriction on the values of ∇2 f (·) (notice that ∇2 f (·) cannot be strictly definite at a non-isolated minimum or maximum) and do not require (a priori) {θn }n≥0 to exhibit any particular behavior (i.e., to visit infinitely often the attraction domain of an isolated minimum). Moreover, they cover a broad class of complex stochastic gradient algorithms (see Sections 4 and 5; see also [28], [29]). To the best or our knowledge, these are the only results on the convergence and convergence rate of stochastic search which enjoy such features. Regarding the results of Theorems 2.1, 2.2 and Corollary 2.1, it is worth mentioning that they are not just a combination of the Lojasiewicz inequality and the existing techniques for the asymptotic analysis of stochastic gradient search and stochastic approximation. On the contrary, the existing techniques seem to be completely in5
applicable to high-dimensional high-nonlinear stochastic gradient search. The reason comes out of the fact that these techniques crucially rely on the following Lyapunov function: w(θ) = (θ − θ∗ )T ∇2 f (θ∗ )(θ − θ∗ ), where θ∗ is an isolated minimum such that ∇2 f (θ∗ ) is strictly positive definite and such that the attraction domain of θ∗ is visited by {θn }n≥0 infinitely many times w.p.1. In this paper, we take an entirely different approach whose main steps can be summarized as follows: 1. The convergence of {f (θn )}n≥0 is demonstrated. 2. A ‘singular’ Lyapunov function ( (f (θ) − fˆ)−1/p , if f (θ) > fˆ v(θ) = 0, otherwise is constructed, where fˆ = limn→∞ f (θn ) and p is a suitable positive constant. Relying on this function, the convergence rate of {f (θn )}n≥0 and {∇f (θn )}n≥0 is evaluated. 3. Using the results derived at Step 2, the convergence rate of supk≥n kθk − θn k is assessed. 4. Applying the results of Step 3, the point-convergence of {θn }n≥0 is demonstrated. Then, refining the convergence rates derived at Steps 2 and 3, the results of Theorem 2.2 are obtained. At the core of our approach is the singular Lyapunov function v(·). Although subtle techniques are needed to handle such a function (see Section 6), v(·) provides intuitively clear explanation of the results of Theorem 2.2 and Corollary 2.1. The explanation is based on the heuristic analysis of the following two cases.1 ˆ is Case 2.1: lim inf n→∞ γnµˆr (f (θn )) − fˆ) = −∞ and supn≥0 kθn k < ∞, where µ defined in Theorem 2.2. In this case, there exists an increasing integer sequence {nk }k≥0 such that limk→∞ γnµˆkr (f (θnk )) − fˆ) = −∞. Owing to Assumption 2.3, we have 1/ˆµ ˆ k∇f (θn )k ≥ |f (θn ) − fˆ|/M
(2.8)
ˆ = M ˆ. Consequently, limk→∞ γnr k∇f (θn )k = ∞. for sufficiently large n, where M k θ k On the other side, Taylor formula yields f (θn ) ≈f (θnk ) − (∇f (θnk ))T
n−1 X
αi (∇f (θi ) + ξi )
i=nk
≈f (θnk ) − (γn − γnk )k∇f (θnk )k2 − (∇f (θnk ))T
n−1 X
αi ξi
i=nk
!
n−1
X
≤f (θnk ) − k∇f (θnk )k (γn − γnk )k∇f (θnk )k − αi ξi
i=nk
1 Throughout this analysis, we assume that Theorem 2.1 is true. We also assume P −r ) when n → ∞, which is slightly stronger than what Assumption supk≥n k ki=n αi ξi k = O(γn 2.2 and Lemma 6.1 yield.
6
for n ≥ nk and sufficiently large k ≥ 0. Since γn − γnk ≥ 1 for n > a(nk , 1) and since
k
X
(2.9) sup αi ξi = O(γn−r )
k≥n i=n
when n → ∞, we get f (θn ) ≤ f (θnk ) < fˆ for n > a(nk , 1) and sufficiently large k ≥ 0. However, this is not possible as limn→∞ f (θn ) = fˆ. Thus, Case 2.1 cannot occur. Case 2.2: lim supn→∞ γnp (f (θn ) − fˆ) = ∞, p < µ ˆ min{r, rˆ} and supn≥0 kθn k < ∞, where µ ˆ, rˆ are defined in Theorem 2.2. Similarly as in the previous case, there exists an increasing integer sequence {nk }k≥0 such that limk→∞ γnpk (f (θnk ) − fˆ) = ∞. Then, (2.8) implies µ lim γnr k (f (θnk ) − fˆ) ≥ lim γnp/ˆ (f (θnk ) − fˆ) = ∞ k
k→∞
(2.10)
k→∞
(notice that p/ˆ µ ≤ r). On the other side, Taylor formula and (2.8) yield v(θn ) ≈v(θnk ) +
n−1 X (∇f (θnk ))T αi (∇f (θi ) + ξi ) p(f (θn ) − fˆ)1+1/p i=nk
k
≈v(θnk ) + ≥v(θnk ) +
1
2
p(f (θnk ) − fˆ)1+1/p
n−1 X
T
(γn − γnk )k∇f (θnk )k + (f (θnk ))
! αi ξi
i=nk
γ n − γ nk 2/ˆ µ ˆ 2pM (f (θnk ) − fˆ)1+1/p−2/ˆµ
k∇f (θnk )k + p(f (θn ) − fˆ)1+1/p k
n−1
!
(γn − γnk )k∇f (θnk )k
X
αi ξi −
2 i=n
(2.11)
k
for n ≥ nk and sufficiently large k ≥ 0. Since limn→∞ f (θn ) = fˆ and since 1 + 1/p − 2/ˆ µ = 1/p − µ ˆ/ˆ r ≥ 0, relations (2.9) – (2.11) imply ˆ (γn − γn ) v(θn ) ≥ v(θnk ) + N k ˆ = 1/(2pM ˆ 2/ˆµ ). Consequently, for n > a(nk , 1), sufficiently large k ≥ 0 and N −p ˆ (γn − γn ) f (θn ) − fˆ ≤ v(θnk ) + N (2.12) k for n > a(nk , 1) and sufficiently large k ≥ 0. However, this is impossible, as (2.12) yields lim supn→∞ γnp (f (θn ) − fˆ) < ∞. Hence, Case 2.2 cannot happen. As none of Cases 2.1 and 2.2 is possible, we conclude that f (θn ) converges to fˆ at the rate O(γn−pˆ). Since γk − γn ≥ 1 for k > a(n, 1) and since f (θk ) − f (θn ) ≈ − (γk − γn )k∇f (θn )k2 − (∇f (θn ))T
k−1 X
αi ξi
i=n
k−1
2
1
X
≤ − ((γk − γn ) − 1/2) k∇f (θn )k + αi ξi
2 i=n 2
7
for k ≥ n and sufficiently large n ≥ 0, we deduce
k−1
2
X
k∇f (θn )k ≤ −2 (f (θk ) − f (θn )) + αi ξi
2
i=n
for k > a(n, 1) and sufficiently large n ≥ 0. As an immediate consequence, we have that k∇f (θn )k2 converges to zero at the rate O(γn−pˆ). The evaluation of the convergence rate of {θn }n≥0 is much more complicated (so that it cannot briefly be summarized here — the details are provided in Lemmas 6.6, 6.10) and is based on the following reasoning:
k−1 k−1
X X
αi ξi kθk − θn k ≤ θk − θn + αi ξi +
i=n i=n
k−1
k−1
X
X αi ξi = αi ∇f (θi ) +
i=n i=n
k−1
X
≈(γk − γn )k∇f (θn )k + αi ξi
i=n
! k−1 k−1
X
X 1
T ≈− f (θk ) − f (θn ) + (∇f (θn )) αi ξi + α i ξi
k∇f (θn )k i=n i=n
k−1
f (θk ) − f (θn )
X
+ 2 αi ξi ≤−
k∇f (θn )k i=n where k ≥ n and n ≥ 0 is sufficiently large. The heuristic analysis of Cases 2.1 and 2.2 carried out above indicates that the convergence rates of {f (θn )}n≥0 and {∇f (θn )}n≥0 reported in Theorem 2.2 are rather tight (if not optimal; for the discussion on the tightness of the rate of {θn }n≥0 , see Remark 6.4). The same conclusion is suggested by the following two special cases: Case 2.3: ξn = 0 for each n ≥ 0. Due to Assumption 2.3 and (2.8), we have d(f (θ(t)) − fˆ) = −k∇f (θ(t))k2 ≤ − dt
f (θ(t)) − fˆ ˆ M
!2/ˆµ
for a solution θ(·) of dθ/dt = −∇f (θ) satisfying limt→∞ f (θ(t)) = fˆ and θ([0, ∞)) ⊆ ˆ ≤ δ ˆ}. Consequently, {θ ∈ Rdθ : kθ − θk θ f (θ(t)) − fˆ = O(t−ˆµ/(2−ˆµ) ) = O(t−ˆµrˆ). As {θn }n≥0 is asymptotically equivalent to θ(·) sampled at instances {γn }n≥0 , we get f (θn ) − fˆ = O(γn−ˆµrˆ). The same result is implied by Theorem 2.1 and Corollary 2.1. Case 2.4: f (θ) = θT Aθ, where A is a strictly positive definite matrix. Recursion (2.1) reduces to a linear stochastic approximation algorithm in this case. For such an algorithm, the tightest bound on the convergence rate of {f (θn )}n≥0 and {k∇f (θn )k2 }n≥0 is O(γn−2r ) if ξ > 0 and o(γn−2r ) if ξ = 0 (see [27]). The same rate is predicted by Theorem 2.2 and Corollary 2.1. 8
3. Stochastic Gradient Algorithms with Markovian Dynamics. In order to illustrate the results of Section 2 and to set up a framework for the analysis carried out in Sections 4 and 5, we apply Theorems 2.1, 2.2 and Corollary 2.1 to stochastic gradient algorithms with Markovian dynamics. These algorithms are defined by the following difference equation: θn+1 = θn − αn F (θn , Zn+1 ),
n ≥ 0.
(3.1)
In this recursion, F : Rdθ × Rdz → Rdθ is a Borel-measurable function, while {αn }n≥0 is a sequence of positive real numbers. θ0 is an Rdθ -valued random variable defined on a probability space (Ω, F, P ), while {Zn }n≥0 is an Rdz -valued stochastic process defined on the same probability space. {Zn }n≥0 is a Markov process controlled by {θn }n≥0 , i.e., there exists a family of transition probability kernels {Πθ (·, ·)}θ∈Rdθ (defined on Rdz ) such that P (Zn+1 ∈ B|θ0 , Z0 , . . . , θn , Zn ) = Πθn (Zn , B) w.p.1 for any Borel-measurable set B ⊆ Rdz and n ≥ 0. In the context of stochastic gradient search, F (θn , Zn+1 ) is regarded to as an estimator of ∇f (θn ). The algorithm (3.1) is analyzed under the following assumptions. P∞ −1 Assumption 3.1. limn→∞ αn = 0, lim supn→∞ |α −αn−1 | < ∞ and n=0 αn = Pn+1 ∞ ∞. There exists a real number r ∈ (1, ∞) such that n=0 αn2 γn2r < ∞. Assumption 3.2. There exist a differentiable function f : Rdθ → R and a Borel-measurable function F˜ : Rdθ × Rdz → Rdθ such that ∇f (·) is locally Lipschitz continuous and such that F (θ, z) − ∇f (θ) = F˜ (θ, z) − (ΠF˜ )(θ, z) R for each θ ∈ Rdθ , z ∈ Rdz , where (ΠF˜ )(θ, z) = F˜ (θ, z 0 )Πθ (z, dz 0 ). Assumption 3.3. For any compact set Q ⊂ Rdθ and s ∈ (0, 1), there exists a Borel-measurable function ϕQ,s : Rdz → [1, ∞) such that max{kF (θ, z)k, kF˜ (θ, z)k, k(ΠF˜ )(θ, z)k} ≤ ϕQ,s (z), k(ΠF˜ )(θ0 , z) − (ΠF˜ )(θ00 , z)k ≤ ϕQ,s (z)kθ0 − θ00 ks for all θ, θ0 , θ00 ∈ Q, z ∈ Rdz . Assumption 3.4. Given a compact set Q ⊂ Rdθ and s ∈ (0, 1), sup E ϕ2Q,s (Zn )I{τQ ≥n} |θ0 = θ, Z0 = z < ∞ n≥0
for all θ ∈ Rdθ , z ∈ Rdz , where τQ = inf{n ≥ 0 : θn 6∈ Q}. The main results on the convergence rate of recursion (3.1) are contained in the next theorem. Theorem 3.1. Let Assumptions 3.1 – 3.4 hold, and suppose that f (·) (introduced in Assumption 3.2) satisfies Assumption 2.3. Then, the following is true: ˆ = 0 w.p.1 on {sup (i) θˆ = limn→∞ θn exists and satisfies ∇f (θ) n≥0 kθn k < ∞}. 2 −pˆ −pˆ ˆ ˆ (ii) k∇f (θn )k = o γn , |f (θn ) − f (θ)| = o γn and kθn − θk = o γn−ˆq w.p.1 on {supn≥0 kθn k < ∞} ∩ {ˆ r > r}. ˆ = O γ −pˆ and kθn − θk ˆ = O γ −ˆq (iii) k∇f (θn )k2 = O γn−pˆ , |f (θn ) − f (θ)| n n w.p.1 on {supn≥0 kθn k < ∞} ∩ {ˆ r ≤ r}. 9
ˆ = o(γ −p ) w.p.1 on {sup (iv) k∇f (θn )k2 = o(γn−p ) and |f (θn ) − f (θ)| n≥0 kθn k < n ∞}. The proof is provided in Section 7. p, pˆ, qˆ and rˆ are defined in Theorem 2.2 and Corollary 2.1. Assumption 3.1 is related to the sequence {αn }n≥0 . It holds if αn = 1/na for n ≥ 1, where a ∈ (3/4, 1] is a constant (in that case, γn = O(n1−a ) for n → ∞, while r can be any number satisfying 0 < r < (a − 1/2)/(1 − a)). On the other side, Assumptions 3.2 – 3.4 correspond to the stochastic process {Zn }n≥0 and are quite standard for the asymptotic analysis of stochastic approximation algorithms with Markovian dynamics. Assumptions 3.2 – 3.4 have been introduced by Metivier and Priouret in [20] (see also [1, Part II]), and later generalized by Kushner and his coworkers (see [14] and references cited therein). However, neither the results of Metivier and Priouret, nor the results of Kushner and his co-workers provide any information on the point-convergence and convergence rate of stochastic gradient search in the case of multiple, non-isolated minima. Regarding Theorem 3.1, the following note is also in order. As already mentioned in the beginning of the section, the purpose of the theorem is illustrating the results of Theorem 2.1 and providing a framework for studying the examples presented in the next sections. Since these examples perfectly fit into the framework developed by Metivier and Priouret, more general assumptions and settings of [14] are not considered here in order just to keep the exposition as concise as possible. 4. Example 1: Supervised Learning. In this section, online algorithms for supervised learning in feedforward neural networks are analyzed using the results of Theorems 2.2 and 3.1. To avoid unnecessary technical details and complicated notation, only two-layer perceptrons are considered here. However, the obtained results can be extended to other feedforward neural networks such radial basis function networks. The input-output function of a two-layer perceptron can be defined as M N X X Gθ (x) = ai ψ bi,j xj . i=1
j=1
Here, ψ : R → R is a differentiable function, while M and N are positive integers. a1 , . . . , aM , b1,1 , . . . , bM,N and x1 , . . . , xN are real numbers, while θ = [a1 · · · aM b1,1 · · · bM,N ]T , x = [x1 · · · xN ]T and dθ = M (N + 1). ψ(·) represents the network activation function. x is the network input, while Gθ (x) is the output. θ is the vector of the network parameters to be tuned through the process of supervised learning. Let π(·, ·) be a probability measure on RN × R, while Z 1 (y − Gθ (x))2 π(dx, dy) f (θ) = 2 for θ ∈ Rdθ . Then, the mean-square error based supervised learning in feedforward neural networks can be described as the minimization of f (·) in a situation when only samples from π(·, ·) are available. For more details on neural networks and supervised learning, see e.g., [10], [11] and references cited therein. Function f (·) is usually minimized by the following stochastic gradient algorithm: θn+1 = θn + αn (Yn − Gθn (Xn ))Hθn (Xn ), 10
n ≥ 0.
(4.1)
In this recursion, {αn }n≥0 is a sequence of positive real numbers, while Hθ (·) = ∇θ Gθ (·). θ0 is an Rdθ -valued random variable defined on a probability space (Ω, F, P ), while {(Xn , Yn )}n≥0 is an RN × R-valued stochastic process defined on the same probability space. In the context of supervised learning, {(Xn , Yn )}n≥0 is regarded to as a training sequence. The asymptotic behavior of algorithm (4.1) is analyzed under the following assumptions: Assumption 4.1. ψ(·) is real-analytic. Moreover, ψ(·) has a (complex-valued) ˆ with the following properties: continuation ψ(·) ˆ (i) ψ(z) maps z ∈ C into C (C denotes the set of complex numbers). ˆ (ii) ψ(x) = ψ(x) for all x ∈ R. ˆ is analytic on (iii) There exist real numbers ε ∈ (0, 1), K ∈ [1, ∞) such that ψ(·) ˆ Vε = {z ∈ C : d(z, R) ≤ ε}, and such that ˆ max{|ψ(z)|, |ψˆ0 (z)|} ≤ K ˆ for all z ∈ Vˆε (ψˆ0 (·) is the first derivative of ψ(·)). Assumption 4.2. {(Xn , Yn )}n≥0 are i.i.d. random variables distributed according the probability measure π(·, ·). There exists a real number L ∈ [1, ∞) such that kX0 k ≤ L and |Y0 | ≤ L w.p.1. Our main results on the properties of objective function f (·) and algorithm (4.1) are contained in the next two theorems. Theorem 4.1. Let Assumptions 4.1 and 4.2 hold. Then, f (·) is analytic on entire Rdθ , i.e., it satisfies Assumption 2.3. Theorem 4.2. Let Assumptions 3.1, 4.1 and 4.2 hold. Then, the following is true: ˆ = 0 w.p.1 on {sup (i) θˆ = limn→∞ θn exists and satisfies ∇f (θ) n≥0 kθn k < ∞}. 2 −pˆ −pˆ ˆ ˆ (ii) k∇f (θn )k = o γn , |f (θn ) − f (θ)| = o γn and kθn − θk = o γn−ˆq w.p.1 on {supn≥0 kθn k < ∞} ∩ {ˆ r > r}. ˆ = O γ −ˆq ˆ = O γ −pˆ and kθn − θk (iii) k∇f (θn )k2 = O γn−pˆ , |f (θn ) − f (θ)| n n w.p.1 on {supn≥0 kθn k < ∞} ∩ {ˆ r ≤ r}. 2 −p ˆ = o(γ −p ) w.p.1 on {sup (iv) k∇f (θn )k = o(γn ) and |f (θn ) − f (θ)| n≥0 kθn k < n ∞}. The proofs are provided in Section 8. p, pˆ, qˆ and rˆ are defined in Theorem 2.2 and Corollary 2.1. Assumption 4.1 is related to the network activation function. It holds when ψ(·) is a logistic function2 or a standard Gaussian density3 , which are the most popular activation functions in feedforward neural networks. Assumption 4.2 corresponds to the training sequence {(Xn , Yn )}n≥0 , and is common for the analysis of supervised learning. 2 Complex-valued
logistic function can be defined as h(z) = (1 + exp(−z))−1 for z ∈ C. Since
|1 + exp(−z)|2 = 1 + exp(−2Re(z)) + 2 exp(−Re(z)) cos(Im(z)) ≥ 1 + exp(−2Re(z)) when |Im(z)| ≤ π/2, h(·) is analytical on {z ∈ C : d(z, R) ≤ π/2}. Due to the same reason, max{|h(z)|, |h0 (z)|} ≤ 1 on {z ∈ C : d(z, R) ≤ π/2}. 3 Complex-valued standard Gaussian density is defined by h(z) = (2π)−1/2 exp(−z 2 /2) for z ∈ C. It is analytical on entire C. As (1 + |z|) exp(−z 2 /2) ≤ (1 + |Re(z)| + |Im(z)|) exp(−Re2 (z)/2 + Im2 (z)/2) ≤ 3e when |Im(z)| ≤ 1, we have max{|h(z)|, |h0 (z)|} ≤ 3e on {z ∈ C : d(z, R) ≤ 1}. 11
The asymptotic properties of supervised learning algorithms have been studied in a large number of papers (see [10], [11] and references cited therein). Unfortunately, the available literature does not provide any information on the point-convergence and convergence rate which can be verified for feedforward neural networks with nonlinear activation functions. The main difficulty comes out of the fact that the existing results on the convergence and convergence rate of stochastic gradient search require the objective function f (·) to have an isolated minimum θ∗ such that ∇2 f (θ∗ ) is strictly positive definite and such that {θn }n≥0 visits the attraction domain of θ∗ infinitely many times w.p.1. Since f (·) is highly nonlinear, these requirements are not only hard (if possible at all) to show, but are rather likely not to hold. Theorem 4.2 does not invoke any of such requirements and covers some of the most widely used feedforward neural networks. 5. Example 2: Identification of Linear Stochastic Dynamical Systems. In this section, the general results presented in Sections 2 and 3 are applied to the asymptotic analysis of recursive prediction error algorithms for identification of linear stochastic systems. To avoid unnecessary technical details and complicated notation, only the identification of one dimensional ARMA models is considered here. However, it is straightforward to generalize the obtained results to any linear stochastic system. To state the problem of the recursive prediction error identification in ARMA models, we use the following notation. M and N are positive integers. For a1 , . . . , aM ∈ R and b1 , . . . , bN ∈ R, let Aθ (z) = 1 −
M X
ak z −k ,
Bθ (z) = 1 +
k=1
N X
bk z −k ,
k=1
where θ = [a1 · · · aM b1 · · · bN ]T and z ∈ C (C denotes the set of complex numbers). Moreover, let dθ = M + N and Θ = {θ ∈ Rdθ : Bθ (z) = 0 ⇒ |z| > 1}. {Yn }n≥0 is a real-valued signal generated by the actual system (i.e., by the system being identified). For θ ∈ Θ, {Ynθ }n≥0 is the output of the ARMA model Aθ (q)Ynθ = Bθ (q)Wn ,
n ≥ 0,
(5.1)
where {Wn }≥0 is a real-valued white noise and q −1 is the backward time-shift operator. {εθn }n≥0 is the process generated by the recursion Bθ (q)εθn = Aθ (q)Yn ,
n ≥ 0,
(5.2)
while Yˆnθ = Yn − εθn for n ≥ 0. Yˆnθ represents a mean-square optimal estimate of Yn given Y0 , . . . , Yn−1 (which the model (5.1) can provide; for details see e.g., [16], [17]). Consequently, εθn can be interpreted as the estimation error of Yˆnθ . The parametric identification in ARMA models can be stated as follows: Given a realization of {Yn }n≥0 , estimate the values of θ for which the model (5.1) provides the best approximation to the signal {Yn }n≥0 . If the identification is based on the prediction error principle, this estimation problem reduces to the minimization of the asymptotic mean-square prediction error f (θ) =
1 lim E (εθn )2 2 n→∞ 12
over Θ. As the asymptotic value of the second moment of εθn is rarely available analytically, f (·) is minimized by a stochastic gradient (or stochastic Newton) algorithm. Such an algorithm is defined by the following difference equations: φn = [Yn · · · Yn−M +1 εn · · · εn−N +1 ]T ,
(5.3)
φTn θn ,
(5.4)
εn+1 = Yn+1 −
T
ψn+1 = φn − [ψn · · · ψn−N +1 ] D θn ,
(5.5)
n ≥ 0.
(5.6)
θn+1 = θn + αn ψn+1 εn+1 ,
In this recursion, {αn }n≥0 denotes a sequence of positive reals. D is an N × (M + N ) matrix whose entries are di,j = 1 if j = M + i, 1 ≤ i ≤ N and di,j = 0 otherwise. {Yn }n≥−M is a real-valued stochastic process defined on a probability space (Ω, F, P ), while θ0 ∈ Θ, ε0 , . . . , ε1−N ∈ R and ψ0 , . . . , ψ1−N ∈ Rdθ are random variables defined on the same probability space. θ0 , ε0 , . . . , ε1−N , ψ0 , . . . , ψ1−N represent the initial conditions of the algorithm (5.3) – (5.6). In the literature on system identification, recursion (5.3) – (5.6) is known as the recursive prediction error algorithm for ARMA models (for more details see [16], [17] and references cited therein). It usually involves a projection (or truncation) device which ensures that estimates {θn }n≥0 remain in Θ. However, in order to avoid unnecessary technical details and to keep the exposition as concise as possible, this aspect of algorithm (5.3) – (5.6) is not discussed here. Instead, similarly as in [15] – [17], we state our asymptotic results (Theorem 5.2) in a local form. Algorithm (5.3) – (5.6) is analyzed under the following assumptions: Assumption 5.1. There exist a positive integer L, a matrix A ∈ RL×L , a vector b ∈ RL and RL -valued stochastic processes {Xn }n>−M , {Vn }n>−M (defined on (Ω, F, P )) such that the following holds: (i) Xn+1 = AXn + Vn and Yn = bT Xn for n > −M . (ii) The eigenvalues of A lie in {z ∈ C : |z| < 1}. (iii) {Vn }n≥−M are i.i.d. and independent of θ0 , X1−M , ε0 , . . . , ε1−N , ψ0 , . . . , ψ1−N . (iv) EkV0 k4 < ∞. Assumption 5.2. For any compact set Q ⊂ Θ, (5.7) sup E (ε4n + kψn k4 )I{τQ ≥n} < ∞, n≥0
where τQ = inf{n ≥ 0 : θn ∈ / Q}. Our main result on the analyticity of f (·) is contained in the next theorem. Theorem 5.1. Suppose that {Yn }n≥0 is a weakly stationary process such that ∞ X
|Cov(Y0 , Yn )| < ∞.
n=0
Then, f (·) is analytic on entire Θ, i.e., the following is true: For any compact set Q ⊂ Θ and any a ∈ f (Q), there exist real numbers δQ,a ∈ (0, 1], µQ,a ∈ (1, 2], MQ,a ∈ [1, ∞) such that (2.2) is satisfied for all θ ∈ Q fulfilling |f (θ) − a| ≤ δQ,a . Let Λ is the event defined by Λ = sup kθn k < ∞, inf d(θn , ∂Θ) > 0 . n≥0
n≥0
13
Then, our main result on the convergence and convergence rate of algorithm (5.3) – (5.6) reads as follows. Theorem 5.2. Let Assumptions 3.1, 5.1 and 5.2 hold. Then, the following is true: ˆ (i) θˆ = limn→∞ θn exists 0 w.p.1 on Λ. and satisfies ∇f (θ) = 2 −pˆ ˆ = o γ −ˆq w.p.1 ˆ (ii) k∇f (θn )k = o γn , |f (θn ) − f (θ)| = o γn−pˆ and kθn − θk n on Λ ∩ {ˆ r > r}. ˆ = O γ −pˆ and kθn − θk ˆ = O γ −ˆq w.p.1 (iii) k∇f (θn )k2 = O γn−pˆ , |f (θn )f (θ)| n n on Λ ∩ {ˆ r ≤ r}. ˆ = o(γ −p ) w.p.1 on Λ. (iv) k∇f (θn )k2 = o(γn−p ) and |f (θn ) − f (θ)| n The proofs are provided in Section 9. p, pˆ, qˆ and rˆ are defined in Theorem 2.2 and Corollary 2.1. Assumption 5.1 corresponds to the signal {Yn }n≥0 . It is quite common for the asymptotic analysis of recursive identification algorithm (see e.g., [1, Part I]) and cover all stable linear Markov models. Assumption 5.2 is related to the stability of subrecursion (5.3) – (5.5) and its output {εn }≥0 , {ψn }n≥0 . In this or a similar form, Assumption 5.2 is involved in most of the asymptotic results on the recursive prediction error identification algorithms. E.g., [16, Theorems 4.1 – 4.3] (which are probably the most general results of this kind) require sequence {(εn , ψn )}n≥0 to visit a fixed compact set infinitely often w.p.1 on event Λ. When {Yn }n≥0 is generated by a stable linear Markov system, such a requirement is practically equivalent to (5.7). Various aspects of recursive prediction error identification in linear stochastic systems have been the subject of numerous papers and books (see [16], [17] and references cited therein). Despite providing a deep insight into the asymptotic behavior of recursive prediction error identification algorithms, the available results do not offer information about the point-convergence and convergence rate which can be verified for models of a moderate or high order (e.g., M and N are three or above). The main difficulty is the same as in the case of supervised learning. The existing results on the convergence and convergence rate of stochastic gradient search require f (·) to have an isolated minimum θ∗ such that ∇2 f (θ∗ ) is strictly positive definite and such that {θn }n≥0 visits the attraction domain of θ∗ infinitely many times w.p.1. Unfortunately, f (·) is so complex (even for relatively small M and N ) that these requirements are not only impossible to verify, but are likely not to be true. Apparently, Theorem 5.2 relies on none of them. Regarding Theorems 5.1 and 5.2, it should be mentioned that these results can be generalized in several ways. E.g., it is straightforward to extend them to practically any stable multiple-input, multiple-output linear system. Moreover, it is possible to show that the results also hold for signals {Yn }n≥0 satisfying mixing conditions of the type [16, Condition S1, p. 169]. 6. Proof of Theorems 2.1 and 2.2. In this section, the following notation is used. Let Λ be the event Λ = sup kθn k < ∞ . n≥0
For ε ∈ (0, ∞), let ϕε (ξ) = ϕ(ξ) + ε. 14
0 00 For 0 ≤ n < k, let ζn,n = ζn,n = ζn,n = 0, φn,n = φ0n,n = φ00n,n = 0 and
0 ζn,k =
k−1 X
αi ξi ,
i=n 00 ζn,k =
ζn,k =
k−1 X i=n 0 ζn,k
αi (∇f (θi ) − ∇f (θn )), 00 + ζn,k ,
φ0n,k = (∇f (θn ))T ζn,k , Z 1 φ00n,k = − (∇f (θn + s(θk − θn )) − ∇f (θn ))T (θk − θn )ds, 0
φn,k = φ0n,k + φ00n,k . Then, it is straightforward to show θk − θn = −
k−1 X
0 αi ∇f (θi ) − ζn,k
i=n
= − (γk − γn )∇f (θn ) − ζn,k ,
(6.1) 2
f (θk ) − f (θn ) = −(γk − γn )k∇f (θn )k − φn,k
(6.2)
for 0 ≤ n ≤ k. In this section, besides the quantities introduced in the previous paragraph, we rely on the following notation. For a compact set Q ⊂ Rdθ , CQ stands for an upper bound of k∇f (·)k on Q and for a Lipschitz constant of ∇f (·) on the same set. Aˆ is the set of accumulation points of {θn }n≥0 , while fˆ = lim inf f (θn ). n→∞
ˆ and Q ˆ are random sets defined by B [ ˆ= B θ0 ∈ Rdθ : kθ0 − θk ≤ δθ /2 ,
ˆ = cl(B) ˆ Q
ˆ θ∈A
on event Λ, and by ˆ = A, ˆ B
ˆ = Aˆ Q
outside Λ (δθ is specified in Remark 2.1). Overriding the definition of µ ˆ, pˆ, rˆ, in ˆ µ ˆ M ˆ as Theorem 2.2, we specify random quantities δ, ˆ, pˆ, rˆ, C, ˆ ˆ ˆ ˆ ˆ, δˆ = δQ, ˆ = µQ, ˆ , M = MQ, ˆ fˆ, µ ˆ fˆ, C = CQ f ( 1/(2 − µ ˆ), if µ ˆ fˆ 0, otherwise
for θ ∈ Rdθ . ˆ pˆ, rˆ, ˆ is compact and satisfies Aˆ ⊂ intQ. ˆ Thus, δ, Remark 6.1. On event Λ, Q ˆ ˆ C, M , v(·) are well-defined on Λ (what happens with these quantities outside Λ does not affect the results presented in this section). On the other side, Assumption 2.3 implies ˆ k∇f (θ)kµˆ |f (θ) − fˆ| ≤ M
(6.3)
ˆ ˆ satisfying |f (θ) − fˆ| ≤ δ. on Λ for all θ ∈ Q Remark 6.2. Regarding the notation, the following note is also in order: ˜ symbol is used for a locally defined quantity, i.e., for a quantity whose definition holds only in the proof where such a quantity appears. Lemma 6.1. Let Assumptions 2.1 and 2.2 hold. Then, there exists an event N0 ∈ F such that P (N0 ) = 0 and lim sup γnr n→∞
max n≤k≤a(n,1)
0 kζn,k k≤ξ τ1,ε (notice that γk − γn ≤ 1 for n ≤ k ≤ a(n, 1)). Due to (6.1), we have (γk − γn )k∇f (θn )k2 =k∇f (θn )kk(γk − γn )∇f (θn )k =k∇f (θn )kkθk − θn + ζn,k k for 0 ≤ n ≤ k. Combining this with (6.2), (6.12) and the first part of (6.11), we get 2 (f (θk ) − f (θn )) = − k∇f (θn )kkθk − θn + ζn,k k − (γk − γn )k∇f (θn )k2 − 2φn,k ≤ − k∇f (θn )kkθk − θn k − (γk − γn )k∇f (θn )k2 + k∇f (θn )kkζn,k k + 2|φn,k | ≤ − k∇f (θn )kkθk − θn k − (γk − γn )k∇f (θn )k2 + C˜4 (γk − γn )2 k∇f (θn )k2 + C˜4 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 = − k∇f (θn )kkθk − θn k − 1 − C˜4 (γk − γn ) (γk − γn )k∇f (θn )k2 + C˜4 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 18
for τ1,ε < n ≤ k ≤ a(n, 1). Consequently, (6.14) yields 2 (f (θk ) − f (θn )) ≤ − k∇f (θn )kkθk − θn k − 3(γk − γn )k∇f (θn )k2 /4 + C˜4 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 for τ1,ε < n ≤ k ≤ a(n, tˆ). Then, (6.8) implies that (6.7) is true for n > τ1,ε . Lemma 6.3. Suppose that Assumptions 2.1 – 2.3 hold. Then, limn→∞ ∇f (θn ) = 0 on Λ \ N0 . Proof. The lemma’s assertion is proved by contradiction. We assume that lim supn→∞ k∇f (θn )k > 0 for some sample ω ∈ Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). Then, there exists a ∈ (0, ∞) and an increasing sequence {lk }k≥0 (both depending on ω) such that lim inf k→∞ k∇f (θlk )k > a. Since lim inf k→∞ f (θa(lk ,tˆ) ) ≥ fˆ, Lemma 6.2 (inequality (6.6)) gives fˆ − lim inf f (θlk ) ≤ lim sup(f (θa(lk ,tˆ) ) − f (θlk )) k→∞
k→∞
≤ − (tˆ/2) lim inf k∇f (θlk )k2 k→∞
2ˆ
≤ − a t/2. Therefore, lim inf k→∞ f (θlk ) ≥ fˆ+ atˆ2 /2. Consequently, there exist b, c ∈ R (depending on ω) such that fˆ < b < c < fˆ + atˆ2 /2, b < fˆ + δˆ and lim supn→∞ f (θn ) > c. Thus, there exist sequences {mk }k≥0 , {nk }k≥0 (depending on ω) with the following properties: mk < nk < mk+1 , f (θmk ) < b, f (θnk ) > c and f (θn ) ≥ b
max
mk f (θmk ) = f (θmk +1 ) − (f (θmk +1 ) − f (θmk )) ≥ b − (f (θmk +1 ) − f (θmk )) for k ≥ 0, (6.17) yields limk→∞ f (θmk ) = b. As f (θnk )−f (θmk ) > c−b for k ≥ 0, (6.18) implies a(mk , tˆ) < nk for all, but infinitely many k (otherwise, lim inf k→∞ (f (θnk ) − f (θmk )) ≤ 0 would follow from (6.18)). Consequently, lim inf k→∞ f (θa(mk ,tˆ) ) ≥ b (due to (6.16)), while Lemma 6.2 (inequality (6.6)) gives 0 ≤ lim sup f (θa(mk ,tˆ) ) − b = lim sup(f (θa(mk ,tˆ) ) − f (θmk )) k→∞
k→∞
≤ − (tˆ/2) lim inf k∇f (θmk )k2 . k→∞
Therefore, limk→∞ k∇f (θmk )k = 0. Moreover, there exists k0 ≥ 0 (depending on ω) ˆ and f (θm ) ≥ (fˆ + b)/2 for k ≥ k0 (notice that limk→∞ f (θm ) = such that θmk ∈ Q k k ˆ ˆ and 0 < (b − fˆ)/2 ≤ f (θm ) − fˆ ≤ δˆ for k ≥ k0 b > (f + b)/2). Consequently, θmk ∈ Q k 19
(notice that f (θmk ) < b < fˆ + δˆ for k ≥ 0). Then, owing to (6.3) (i.e., to Assumption 3.3), we have ˆ k∇f (θm )kµˆ 0 < (b − fˆ)/2 ≤ f (θmk ) − fˆ ≤ M k for k ≥ k0 . However, this directly contradicts the fact limk→∞ k∇f (θmk )k = 0. Hence, limn→∞ ∇f (θn ) = 0 on Λ \ N0 . Lemma 6.4. Suppose that Assumptions 2.1 – 2.3 hold. Then, limn→∞ f (θn ) = fˆ on Λ \ N0 . Proof. We use contradiction to prove the lemma’s assertion: Suppose that fˆ < lim supn→∞ f (θn ) for some sample ω ∈ Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). Then, there exists a ∈ R (depending on ω) such that fˆ < a < fˆ + δˆ and lim supn→∞ f (θn ) > a. Thus, there exists an increasing sequence {nk }k≥0 (depending on ω) such that f (θnk ) < a and f (θnk +1 ) ≥ a for k ≥ 0. On the other side, Lemma 6.2 (inequality (6.5)) implies lim sup(f (θnk +1 ) − f (θnk )) ≤ 0.
(6.19)
k→∞
Since a > f (θnk ) = f (θnk +1 ) − (f (θnk +1 ) − f (θnk )) ≥ a − (f (θnk +1 ) − f (θnk )) for k ≥ 0, (6.19) yields limk→∞ f (θnk ) = a. Moreover, there exists k0 ≥ 0 (deˆ and f (θn ) ≥ (fˆ + a)/2 for k ≥ k0 (notice that pending on ω) such that θnk ∈ Q k ˆ and 0 < (a − fˆ)/2 ≤ f (θn ) − fˆ ≤ δˆ limk→∞ f (θnk ) = a > (fˆ + a)/2). Thus, θnk ∈ Q k for k ≥ k0 (notice that f (θnk ) < a < fˆ + δˆ for k ≥ 0). Then, due to (6.3) (i.e., to Assumption 2.3), we have ˆ k∇f (θn )kµˆ 0 < (a − fˆ)/2 ≤ f (θnk ) − fˆ ≤ M k for k ≥ k0 . However, this directly contradicts the fact limn→∞ ∇f (θn ) = 0. Hence, limn→∞ f (θn ) = fˆ on Λ \ N0 . Lemma 6.5. Suppose that Assumptions 2.1 – 2.3 hold. Then, there exist random ˆ M ˆ ) and for any real quantities Cˆ2 , Cˆ3 (which are deterministic functions of pˆ, C, number ε ∈ (0, ∞), there exists a non-negative integer-valued random quantity τ2,ε such that the following is true: 1 ≤ Cˆ2 , Cˆ3 < ∞, 0 ≤ τ2,ε < ∞ everywhere and u(θa(n,tˆ) ) − u(θn ) + tˆk∇f (θn )k2 /4 IAn,ε ≤ 0, (6.20) u(θa(n,tˆ) ) − u(θn ) + (tˆ/Cˆ3 ) u(θn ) IBn,ε ≤ 0, (6.21) v(θa(n,tˆ) ) − v(θn ) − (tˆ/Cˆ3 )(ϕε (ξ))−ˆµ/pˆ ICn,ε ≥ 0 (6.22) on Λ \ N0 for n ≥ τ2,ε , where n o n o An,ε = γnpˆ|u(θn )| ≥ Cˆ2 (ϕε (ξ))µˆ ∪ γnpˆk∇f (θn )k2 ≥ Cˆ2 (ϕε (ξ))µˆ , n o Bn,ε = γnpˆu(θn ) ≥ Cˆ2 (ϕε (ξ))µˆ ∩ {ˆ µ = 2}, n o n o Cn,ε = γnpˆu(θn ) ≥ Cˆ2 (ϕε (ξ))µˆ ∩ u(θa(n,tˆ) ) > 0 ∩ {ˆ µ < 2} . 20
Remark 6.3. Inequalities (6.20) – (6.22) can be represented in the following equivalent form: Relations γnpˆ|u(θn )| ≥ Cˆ2 (ϕε (ξ))µˆ ∨ γnpˆk∇f (θn )k2 ≥ Cˆ2 (ϕε (ξ))µˆ ∧ n > τ2,ε =⇒ u(θa(n,tˆ) ) ≤ u(θn ) − tˆk∇f (θn )k2 /4,
(6.23)
γnpˆu(θn ) ≥ Cˆ2 (ϕε (ξ))µˆ ∧ µ ˆ = 2 ∧ n > τ2,ε =⇒ u(θa(n,tˆ) ) ≤ 1 − tˆ/Cˆ3 u(θn ),
(6.24)
γnpˆu(θn ) ≥ Cˆ2 (ϕε (ξ))µˆ ∧ u(θa(n,tˆ) ) > 0 ∧ µ ˆ < 2 ∧ n > τ2,ε =⇒ v(θa(n,tˆ) ) ≥ v(θn ) + (tˆ/Cˆ3 )(ϕε (ξ))−ˆµ/pˆ
(6.25)
are true on Λ \ N0 . ˆ and Cˆ3 = 4ˆ ˆ 2 . Moreover, let ε ∈ (0, ∞) Proof. Let C˜ = 8Cˆ1 /tˆ, Cˆ2 = C˜ 2 M pM be an arbitrary real number. Then, owing to Lemma 6.1 and 6.4, it is possible to construct a non-negative inter-valued random quantity τ2,ε such that τ1,ε ≤ τ2,ε < ∞ ˆ ˆ |u(θn )| ≤ δ, everywhere and such that θn ∈ Q, ˆ γn−p/2 (ϕε (ξ))µˆ/2 ≥ γn−r (ξ + ε),
(6.26)
ˆ µ γn−p/ˆ ϕε (ξ)
(6.27)
≥
γn−r (ξ
+ ε)
on Λ \ N0 for n > τ2,ε .4 Since τ2,ε ≥ τ1,ε on Λ \ N0 , Lemma 6.2 (inequality (6.6)) yields u(θa(n,tˆ) ) − u(θn ) ≤ − tˆk∇f (θn )k2 /2 + Cˆ1 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 (6.28) ˆ and |u(θn )| ≤ δˆ on Λ \ N0 for n > τ2,ε , (6.3) (i.e., on Λ \ N0 for n > τ2,ε . As θn ∈ Q Assumption 2.3) implies ˆ k∇f (θn )kµˆ |u(θn )| ≤ M
(6.29)
on Λ \ N0 for n > τ2,ε . Let ω be an arbitrary sample from Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). First, we show (6.20). We proceed by contradiction: Suppose that (6.20) is violated for some n > τ2,ε . Therefore, u(θa(n,tˆ) ) − u(θn ) > −tˆk∇f (θn )k2 /4
(6.30)
and at least one of the following two inequalities is true: |u(θn )| ≥ Cˆ2 γn−pˆ(ϕε (ξ))µˆ , k∇f (θn )k2 ≥ Cˆ2 γn−pˆ(ϕε (ξ))µˆ .
(6.31) (6.32)
4 To conclude that (6.26) holds on Λ\N for all but finitely many n, notice that p ˆ/2 < min{r, rˆ} ≤ 0 r when µ ˆ < 2 and that the left and right hand sides of the inequality in (6.26) are equal when µ ˆ = 2. In order to deduce that (6.27) is true on Λ \ N0 for all but finitely many n, notice that pˆ/ˆ µ = r, ϕε (ξ) ≥ ξ + ε when r ≤ rˆ and that pˆ/ˆ µ = rˆ < r when r > rˆ.
21
If (6.31) holds, then (6.27), (6.29) imply ˆ µ ˆ )1/ˆµ ≥ (Cˆ2 /M ˆ )1/ˆµ γn−p/ˆ ˜ n−r (ξ + ε) k∇f (θn )k ≥ (|u(θn )|/M ϕε (ξ) ≥ Cγ
ˆ )1/ˆµ = C˜ 2/ˆµ ≥ C˜ owing to µ (notice that (Cˆ2 /M ˆ ≤ 2). On the other side, if (6.32) is satisfied, then (6.26) yields 1/2 ˆ ˜ n−r (ξ + ε). k∇f (θn )k ≥ Cˆ2 γn−p/2 (ϕε (ξ))µˆ/2 ≥ Cγ
Thus, as a result of one of (6.31), (6.32), we get ˜ −r (ξ + ε). k∇f (θn )k ≥ Cγ n Consequently, tˆk∇f (θn )k2 /8 ≥ (C˜ tˆ/8)γn−r k∇f (θn )k(ξ + ε) = Cˆ1 γn−r k∇f (θn )k(ξ + ε), tˆk∇f (θn )k2 /8 ≥ (C˜ 2 tˆ/8)γ −2r (ξ + ε)2 ≥ Cˆ1 γ −2r (ξ + ε)2 n
n
(notice that C˜ tˆ/8 = Cˆ1 , C˜ 2 tˆ/8 ≥ C˜ tˆ/8 = Cˆ1 ). Combining this with (6.28), we get u(θa(n,tˆ) ) − u(θn ) ≤ −tˆk∇f (θn )k2 /4,
(6.33)
which directly contradicts (6.30). Hence, (6.20) is true for n > τ2,ε . Then, as a result of (6.29) and the fact that Bn,ε ⊆ An,ε for n ≥ 0, we get u(θa(n,tˆ) ) − u(θn ) + (tˆ/Cˆ3 ) u(θn ) IBn,ε ˆ tˆ/Cˆ3 ) k∇f (θn )k2 IB ≤ u(θa(n,tˆ) ) − u(θn ) + (M n,ε ≤ u(θa(n,tˆ) ) − u(θn ) + tˆk∇f (θn )k2 /4 IBn,ε ≤ 0 ˆ ). for n > τ2,ε (notice that u(θn ) > 0 on Bn,ε for each n ≥ 0; also notice that Cˆ3 ≥ 4M Thus, (6.21) is true for n > τ2,ε . Now, let us prove (6.22). To do so, we again use contradiction: Suppose that (6.21) does not hold for some n > τ2,ε . Consequently, we have µ ˆ < 2, u(θa(n,tˆ) ) > 0 and γnpˆ u(θn ) ≥ Cˆ2 (ϕε (ξ))µˆ > 0, v(θ ˆ ) − v(θn ) < (tˆ/Cˆ3 )(ϕε (ξ))−ˆµ/pˆ. a(n,t)
(6.34) (6.35)
Combining (6.34) with (already proved) (6.20), we get (6.33), while µ ˆ < 2 implies 2/ˆ µ = 1 + 1/(ˆ µrˆ) ≤ 1 + 1/ˆ p
(6.36)
(notice that rˆ = 1/(2 − µ ˆ) owing to µ ˆ < 2; also notice that pˆ = µ ˆ min{r, rˆ} ≤ µ ˆrˆ). As ˆ 0 < u(θn ) ≤ δ ≤ 1 (due to (6.34) and the definition of τ2,ε ), inequalities (6.29), (6.36) yield 2/ˆµ 1+1/pˆ ˆ 2 ˆ k∇f (θn )k2 ≥ u(θn )/M ≥ (u(θn )) /M 22
(6.37)
ˆ 2/ˆµ ≤ M ˆ 2 due to µ ˆ ≥ 1). Since k∇f (θn )k > 0 and 0 < (notice that M ˆ < 2, M u(θa(n,tˆ) ) < u(θn ) (due to (6.29), (6.33)), inequalities (6.33), (6.37) give u(θn ) − u(θa(n,tˆ) ) tˆ u(θn ) − u(θa(n,tˆ) ) ˆ2 ≤M ≤ 1+1/pˆ 4 k∇f (θn )k2 (u(θn )) Z u(θn ) du ˆ2 =M 1+1/pˆ u(θa(n,tˆ) ) (u(θn )) Z u(θn ) du ˆ2 ≤M 1+1/ pˆ u(θa(n,tˆ) ) u ˆ 2 v(θ ˆ ) − v(θn ) . =ˆ pM a(n,t) Therefore, ˆ 2 ) = (tˆ/Cˆ3 ), v(θa(n,tˆ) ) − v(θn ) ≥ tˆ/(4ˆ pM which directly contradicts (6.35). Thus, (6.22) is satisfied for n > τ2,ε . Lemma 6.6. Suppose that Assumptions 2.1 – 2.3 hold. Then, there exists a ˆ such that the following random quantity Cˆ4 (which is a deterministic function of C) ˆ is true: 1 ≤ C4 < ∞ everywhere and kθa(n,tˆ) − θn k ≤ −γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 + Cˆ4 γn−s φε,s (ξ) (6.38) on Λ \ N0 for n > τ1,ε and any ε ∈ (0, ∞), s ∈ (1, r], where ( 1 + ξ + ε, if r = s, r > rˆ φε,s (ξ) = . ϕε (ξ), otherwise Proof. Let ε ∈ (0, ∞), s ∈ (1, r] be arbitrary real numbers, while Cˆ4 = 10Cˆ12 /tˆ. Moreover, let ω be an arbitrary sample from Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample), while n > τ1,ε is an arbitrary integer. To prove (6.38), we consider separately the cases k∇f (θn )k ≥ (4Cˆ1 /tˆ)γn−s φε,s (ξ) and k∇f (θn )k ≤ (4Cˆ1 /tˆ)γn−s φε,s (ξ). Case k∇f (θn )k ≥ (4Cˆ1 /tˆ)γn−s φε,s (ξ): Since s ≤ r and φε,s (ξ) ≥ ξ + ε, we have k∇f (θn )k ≥ (4Cˆ1 /tˆ)γn−r (ξ + ε). Therefore, (tˆ/4)k∇f (θn )k2 ≥ Cˆ1 γn−r k∇f (θn )k(ξ + ε), (tˆ/4)k∇f (θn )k2 ≥ (4Cˆ12 /tˆ)γn−2r (ξ + ε)2 ≥ Cˆ1 γn−2r (ξ + ε)2 . Then, Lemma 6.2 (inequality (6.7)) yields k∇f (θn )kkθa(n,tˆ) − θn k ≤ − 2 u(θa(n,tˆ) ) − u(θn ) − tˆk∇f (θn )k2 /2 + Cˆ1 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 ≤ − 2 u(θa(n,tˆ) ) − u(θn ) . 23
Consequently, kθa(n,tˆ) − θn k ≤ − 2k∇f (θn )k−1 u(θa(n,tˆ) ) − u(θn ) ≤ − (2Cˆ1 /tˆ)−1 γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 ≤ − γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 + Cˆ4 γn−s φε,s (ξ). Hence, (6.38) is true when k∇f (θn )k ≥ (4Cˆ1 /tˆ)γn−s φε,s (ξ). Case k∇f (θn )k ≤ (4Cˆ1 /tˆ)γn−s φε,s (ξ): As s ≤ r and φε,s (ξ) ≥ ξ + ε, Lemma 6.2 (inequalities (6.4), (6.5)) implies kθa(n,ˆr) − θn k ≤Cˆ1 k∇f (θn )k + γn−r (ξ + ε) ≤(Cˆ4 /2)γn−s φε,s (ξ), u(θa(n,tˆ) ) − u(θn ) ≤Cˆ1 γn−r k∇f (θn )k(ξ + ε) + γn−2r (ξ + ε)2 ≤(Cˆ4 /2)γn−2s (φε,s (ξ))2 . Combining this, we get kθa(n,tˆ) − θn k ≤ − γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 + γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 + (Cˆ4 /2)γn−s φε,s (ξ) ≤ − γns u(θa(n,tˆ) ) − u(θn ) (φε,s (ξ))−1 + Cˆ4 γn−s φε,s (ξ). Thus, (6.38) holds when k∇f (θn )k ≤ (4Cˆ1 /tˆ)γn−s φε,s (ξ). Lemma 6.7. Suppose that Assumptions 2.1 – 2.3 hold. Then, u(θn ) ≥ −Cˆ2 γn−pˆ(ϕε (ξ))µˆ
(6.39)
on Λ \ N0 for n > τ2,ε and any ε ∈ (0, ∞). Furthermore, there exists a random ˆ M ˆ ) such that the quantity Cˆ5 ∈ [1, ∞) (which is a deterministic function of pˆ, C, ˆ following is true: 1 ≤ C5 < ∞ everywhere and k∇f (θn )k2 ≤ Cˆ5 ψ(u(θn )) + γn−pˆ(ϕε (ξ))µˆ (6.40) on Λ \ N0 for n > τ2,ε and any ε ∈ (0, ∞), where function ψ(·) is defined by ψ(x) = x I(0,∞) (x), x ∈ R. Proof. Let Cˆ5 = 4Cˆ2 /tˆ, while ε ∈ (0, ∞) is an arbitrary real number. Moreover, ω is an arbitrary sample from Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). First, we prove (6.39). To do so, we use contradiction: Assume that (6.39) is not satisfied for some n > τ2,ε . Define {nk }k≥0 recursively by n0 = n and nk = a(nk−1 , tˆ) for k ≥ 1. Let us show by induction that {u(θnk )}k≥0 is non-increasing: Suppose that u(θnl ) ≤ u(θnl−1 ) for 0 ≤ l ≤ k. Consequently, u(θnk ) ≤ u(θn0 ) ≤ −Cˆ2 γn−0pˆ(ϕε (ξ))µˆ ≤ −Cˆ2 γn−kpˆ(ϕε (ξ))µˆ (notice that {γn }n≥0 is increasing). Then, Lemma 6.5 (relations (6.20), (6.23)) yields u(θnk+1 ) − u(θnk ) ≤ −tˆk∇f (θnk )k2 /4 ≤ 0, 24
i.e., u(θnk+1 ) ≤ u(θnk ). Thus, {u(θnk )}k≥0 is non-increasing. Therefore, lim sup u(θnk ) ≤ u(θn0 ) < 0. n→∞
However, this is not possible, as limn→∞ u(θn ) = 0 (due to Lemma 6.4). Hence, (6.39) indeed holds for n > τ2,ε . Now, (6.40) is demonstrated. Again, we proceed by contradiction: Suppose that (6.40) is violated for some n > τ2,ε . Consequently, k∇f (θn )k2 ≥ Cˆ5 γn−pˆ(ϕε (ξ))µˆ ≥ Cˆ2 γn−pˆ(ϕε (ξ))µˆ (notice that Cˆ5 ≥ Cˆ2 ), which, together with Lemma 6.5 (relations (6.20), (6.23)), yields u(θa(n,tˆ) ) − u(θn ) ≤ −tˆk∇f (θn )k2 /4. Then, (6.39) implies k∇f (θn )k2 ≤(4/tˆ) u(θn ) − u(θa(n,tˆ) ) −pˆ µ ˆ ≤(4/tˆ) ψ(u(θn )) + Cˆ2 γa(n, (ϕ (ξ)) ε tˆ) − p ˆ µ ˆ ≤Cˆ5 ψ(u(θn )) + γn (ϕε (ξ)) . However, this directly contradicts our assumption that n violates (6.40). Thus, (6.40) is indeed satisfied for n > τ2,ε . Lemma 6.8. Suppose that Assumptions 2.1 – 2.3 hold. Then, there exists a ˆ M ˆ ) such that the random quantity Cˆ6 (which is a deterministic function of pˆ, C, ˆ following is true: 1 ≤ C6 < ∞ everywhere and lim inf γnpˆ u(θn ) ≤ Cˆ6 (ϕε (ξ))µˆ n→∞
(6.41)
on Λ \ N0 for any ε ∈ (0, ∞). Proof. Let Cˆ6 = Cˆ2 + Cˆ3pˆ. We prove (6.41) by contradiction: Assume that (6.41) is violated for some sample ω from Λ \ N0 (notice that the formulas which follow in the proof correspond to this sample) and some real number ε ∈ (0, ∞). Consequently, there exists n0 > τ2,ε (depending on ω, ε) such that u(θn ) ≥ Cˆ6 γn−pˆ(ϕε (ξ))µˆ
(6.42)
for n ≥ n0 . Let {nk }k≥0 be defined recursively by nk = a(nk−1 , tˆ) for k ≥ 1. In what follows in the proof, we consider separately the cases µ ˆ < 2 and µ ˆ = 2. Case µ ˆ < 2: Due to (6.42), we have −1/pˆ γnk (ϕε (ξ))−ˆµ/pˆ. v(θnk ) ≤Cˆ6
On the other side, Lemma 6.5 (relations (6.22), (6.25)) and (6.42) yield v(θnk+1 ) − v(θnk ) ≥ (tˆ/Cˆ3 )(ϕε (ξ))−ˆµ/pˆ ≥ (1/Cˆ3 )(γnk+1 − γnk )(ϕε (ξ))−ˆµ/pˆ 25
for k ≥ 0 (notice that tˆ ≥ γnk+1 − γnk ). Therefore, (1/Cˆ3 )(γnk − γn0 )(ϕε (ξ))−ˆµ/pˆ ≤
k−1 X
(v(θni+1 ) − v(θni ))
i=0
=v(θnk ) − v(θn0 ) −1/pˆ
≤Cˆ6
γnk (ϕε (ξ))−ˆµ/pˆ
for k ≥ 1. Thus, −1/pˆ (1 − γn0 /γnk ) ≤ Cˆ3 Cˆ6
for k ≥ 1. However, this is impossible, since the limit process k → ∞ (applied to the 1/pˆ previous relation) yields Cˆ3 ≥ Cˆ6 (notice that Cˆ6 > Cˆ3pˆ). Hence, (6.41) holds when µ ˆ < 2. Case µ ˆ = 2: As a result of Lemma 6.5 (relations (6.21), (6.24)) and (6.42), we get u(θnk+1 ) ≤ (1 − tˆ/Cˆ3 )u(θnk ) ≤ 1 − (γnk+1 − γnk )/Cˆ3 u(θnk ) for k ≥ 0. Consequently, u(θnk ) ≤u(θn0 )
k Y
1 − (γni − γni−1 )/Cˆ3
i=1
≤u(θn0 ) exp −(1/Cˆ3 )
k X
! (γni − γni−1 )
i=1
=u(θn0 ) exp −(γnk − γn0 )/Cˆ3 for k ≥ 0. Then, (6.42) yields Cˆ6 (ϕε (ξ))µˆ ≤ u(θn0 )γnpˆk exp −(γnk − γn0 )/Cˆ3 for k ≥ 0. However, this is not possible, as the limit process k → ∞ (applied to the previous relation) implies Cˆ6 (ϕε (ξ))µˆ ≤ 0. Thus, (6.41) holds also when µ ˆ = 2. Lemma 6.9. Suppose that Assumptions 2.1 – 2.3 hold. Then, there exists a ˆ M ˆ ) such that the random quantity Cˆ7 (which is a deterministic function of pˆ, C, following is true: 1 ≤ Cˆ7 < ∞ everywhere and lim sup γnpˆ u(θn ) ≤ Cˆ7 (ϕε (ξ))µˆ
(6.43)
n→∞
on Λ \ N0 for any ε ∈ (0, ∞). Proof. Let C˜1 = 3Cˆ1 Cˆ5 , C˜2 = 6C˜1 Cˆ2 + Cˆ3pˆ + Cˆ6 and Cˆ7 = 2(C˜1 + C˜2 )2 . We use contradiction to show (6.43): Suppose that (6.43) is violated for some sample ω from Λ \ N0 (notice that the formulas which appear in the proof correspond to this sample) and some real number ε ∈ (0, ∞). Then, it can be deduced from Lemma 6.8 26
that there exist n0 > m0 > τ2,ε (depending on ω, ε) such that pˆ γm u(θm0 ) ≤ C˜2 (ϕε (ξ))µˆ , 0 γnpˆ u(θn ) ≥ Cˆ7 (ϕε (ξ))µˆ ,
(6.44) (6.45)
0
0
u(θn ) > C˜2 (ϕε (ξ))µˆ ,
min
γnpˆ
max
γnpˆ u(θn ) < Cˆ7 (ϕε (ξ))µˆ
m0 0 0
(6.51)
(notice that (γm0 +1 /γm0 )pˆ ≤ (γl0 /γm0 )pˆ ≤ 2; also notice that C˜2 /2 ≥ 3C˜1 ), while (6.44), (6.48), (6.50) imply −pˆ u(θn ) ≤(1 + C˜1 )u(θm0 ) + C˜1 γm (ϕε (ξ))µˆ 0 ≤(C˜1 + C˜2 + C˜1 C˜2 )γ −pˆ(ϕε (ξ))µˆ m0
C˜1 + C˜2 + C˜1 C˜2 ). Due to (6.45), (6.47), (6.52), we have l0 < n0 . On the other side, since x + C˜1 ψ(x) ≥ 0 only if x ≥ 0 and since x + C˜1 ψ(x) = (1 + C˜1 )x for x ≥ 0, inequality (6.51) implies −pˆ −pˆ u(θm0 ) ≥(1 + C˜1 )−1 (C˜2 /2 − C˜1 )γm (ϕε (ξ))µˆ ≥ Cˆ2 γm (ϕε (ξ))µˆ 0 0
27
(6.53)
(notice that C˜2 /2 − C˜1 ≥ C˜1 (3Cˆ2 − 1) ≥ 2C˜1 Cˆ2 ≥ (1 + C˜1 )Cˆ2 ). In what follows in the proof, we consider separately the cases µ ˆ < 2 and µ ˆ = 2. Case µ ˆ < 2: Owing to Lemma 6.5 (relations (6.22), (6.25)) and (6.44), (6.53), we have v(θl0 ) ≥v(θm0 ) + (tˆ/Cˆ3 )(ϕε (ξ))−ˆµ/pˆ −1/pˆ ≥ C˜2 γm0 + Cˆ3−1 (γl0 − γm0 ) (ϕε (ξ))−ˆµ/pˆ −1/pˆ ˆ −1 > min{C˜2 , C3 }γl0 (ϕε (ξ))−ˆµ/pˆ −1/pˆ
=C˜2
γl0 (ϕε (ξ))−ˆµ/pˆ
−1/pˆ (notice that tˆ ≥ γl0 − γm0 ; also notice C˜2 < Cˆ3−1 ). Consequently,
u(θl0 ) = (v(θl0 ))
−pˆ
< C˜2 γl−0 pˆ(ϕε (ξ))µˆ .
However, this directly contradicts (6.46) and the fact that l0 < n0 . Thus, (6.43) holds when µ ˆ < 2. Case µ ˆ = 2: Using Lemma 6.5 (relations (6.21), (6.24)) and (6.53), we get u(θl0 ) ≤ 1 − tˆ/Cˆ3 u(θm0 ). Then, (6.44), (6.48) yield u(θl0 ) ≤C˜2 (1 − tˆ/Cˆ3 )(γl0 /γm0 )pˆγl−0 pˆ(ϕε (ξ))µˆ ≤ C˜2 γl−0 pˆ(ϕε (ξ))µˆ . However, this is impossible due to (6.46) and the fact that l0 < n0 . Hence, (6.43) also in the case µ ˆ = 2. Lemma 6.10. Suppose that Assumptions 2.1 – 2.3 hold. Then, for any real ˆs (which is a numbers ε ∈ (0, ∞), s ∈ (1, r] ∩ (1, pˆ), there exist a random quantity B ˆ ˆ deterministic function of s, pˆ, C, M and does not depend on ε) and a non-negative ˆs < ∞, 0 ≤ σε,s < ∞ integer-valued quantity σε,s such that the following is true: 1 ≤ B everywhere and ˆ ˆs γn−p+s ˆs γn−s+1 φε,s (ξ) sup kθk − θn k ≤ B (ϕε (ξ))µˆ (φε,s (ξ))−1 + B
(6.54)
k≥n
on Λ \ N0 for n > σε,s . Proof. Let ε ∈ (0, ∞), s ∈ (1, r] ∩ (1, pˆ) be arbitrary real numbers, while C˜1 = ˆ 2(C2 + Cˆ7 ), C˜2 = 2C˜1 Cˆ5 , C˜3 = 2Cˆ1 C˜2 . Moreover, let ˜s = (3/tˆ)(C˜1 + Cˆ4 )(1 + s/(ˆ B p − s) + 1/(s − 1)) ˆs = B ˜s + 2C˜3 . and B It is straightforward to show γa(n,tˆ) − γn = tˆ + O(αa(n,tˆ) ) and s s s s γa(n, − γ =γ 1 − 1 − (γ − γ )/γ ˆ ˆ n ˆ ˆ n a(n,t) a(n,t) t) a(n,t) −1 −2 s =γa(n,tˆ) stˆγa(n,tˆ) + O(γa(n,tˆ) )
(6.55)
for n → ∞. On the other side, Lemmas 6.7 and 6.9 imply lim sup γnpˆ|u(θn )| ≤ max{Cˆ2 , Cˆ7 }(ϕε (ξ))µˆ n→∞
28
(6.56)
on Λ \ N0 , while Lemma 6.7 and (6.56) yield lim sup γnpˆk∇f (θn )k2 ≤Cˆ5 lim sup γnpˆψ(u(θn )) + Cˆ5 (ϕε (ξ))µˆ n→∞
n→∞
≤2Cˆ5 max{Cˆ2 , Cˆ7 }(ϕε (ξ))µˆ
(6.57)
on the same event. Then, owing to (6.55) – (6.57), it is possible to construct a nonnegative integer-valued random quantity σε,s such that τ1,ε ≤ σε,s < ∞ everywhere and such that γa(n,tˆ) − γn ≥ tˆ/2, s γa(n, tˆ)
−
γns
≤s
(6.58)
s−1 γa(n, , tˆ)
(6.59)
|u(θn )| ≤ C˜1 γn−pˆ(ϕε (ξ))µˆ , ˆ k∇f (θn )k ≤ C˜2 γ −p+s (ϕε (ξ))µˆ (φε,s (ξ))−1 + C˜2 γ −s+1 φε,s (ξ) n
(6.60) (6.61)
n
on Λ \ N0 for n > σε,s .5 Let ω is an arbitrary sample from Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). Moreover, let {nk }k≥0 be recursively defined by n0 = σε,s + 1 and nk+1 = a(nk , tˆ) for k ≥ 0. Then, due to Lemma 6.6, we have kθnl − θnk k ≤
l−1 X
kθni+1 − θni k
i=k
≤
l−1 X
l−1 X γns i u(θni ) − u(θni+1 ) (φε,s (ξ))−1 + Cˆ4 γn−s φε,s (ξ) i
i=k
≤
l X
i=k
(γns i − γns i−1 )|u(θni )|(φε,s (ξ))−1 + Cˆ4
i=k+1 + γns l |u(θnl )|(φε,s (ξ))−1
+
l−1 X
γn−s φε,s (ξ) i
i=k s γnk |u(θnk )|(φε,s (ξ))−1
for 0 ≤ k ≤ l. Consequently, (6.59), (6.60) yield l X
kθnl − θnk k ≤C˜1 s (ϕε (ξ))µˆ (φε,s (ξ))−1
ˆ γn−ip+s−1 + Cˆ4 φε,s (ξ)
i=k+1 ˆ ˆ + C˜1 (γn−kp+s + γn−lp+s )(ϕε (ξ))µˆ (φε,s (ξ))−1
l−1 X
γn−s i
i=k
(6.62)
for 0 ≤ k ≤ l. Since γnl = γnk +
l−1 X
(γni+1 − γni ) ≥ γnk + (tˆ/2)(l − k)
i=k
for 0 ≤ k ≤ l (owing to (6.58)), we get ∞ X i=k
γn−λ−1 i
≤
∞ X
(γnk + itˆ/2)−λ−1
i=0
+ ≤γn−λ−1 k
Z
∞
(γnk + utˆ/2)−λ−1 du
0
≤3λ−1 tˆ−1 γn−λ k 5 To
deduce that (6.61) holds on Λ \ N0 for all but finitely many n, notice that k∇f (θn )k = ) on Λ \ N0 and that at least one of pˆ/2 > pˆ − s, pˆ/2 > s − 1 is true.
−p/2 ˆ
O(γn
29
for k ≥ 0 and λ ∈ (0, ∞). Then, (6.62) implies ˆ kθnl − θnk k ≤C˜1 2 + 3tˆ−1 s(ˆ p − s)−1 γn−kp+s (ϕε (ξ))µˆ (φε,s (ξ))−1 + 3Cˆ4 tˆ−1 (s − 1)−1 γn−s+1 φε,s (ξ) k
ˆ ˜s γn−p+s ˜s γn−s+1 φε,s (ξ) ≤B (ϕε (ξ))µˆ (φε,s (ξ))−1 + B k k
(6.63)
for 0 ≤ k ≤ l. On the other side, since s − 1 < r and ϕε,s (ξ) ≥ ξ + ε, Lemma 6.2 (inequality (6.4)) and (6.61) yield kθk − θn k ≤Cˆ1 k∇f (θn )k + γn−r (ξ + ε) ≤Cˆ1 k∇f (θn )k + γn−s+1 φε,s (ξ) ˆ ≤C˜3 γn−p+s (ϕε (ξ))µˆ (ϕε,s (ξ))−1 + C˜3 γn−s+1 ϕε,s (ξ)
for σε,s < n ≤ k ≤ a(n, tˆ) (notice that σε,s ≥ τ1,ε ). Combining this with (6.63), we obtain kθk − θn k ≤kθk − θnj k + kθnj − θni k + kθni − θn k ˆ ˜s γn−p+s ˜s γn−s+1 φε,s (ξ) ≤B (ϕε (ξ))µˆ (φε,s (ξ))−1 + B i
i
ˆ ˆ + C˜3 (γn−p+s + γn−jp+s )(ϕε (ξ))µˆ (ϕε,s (ξ))−1 + C˜3 (γn−s+1 + γn−s+1 )ϕε,s (ξ) j ˆ ˆs γn−p+s ˆs γn−s+1 φε,s (ξ) ≤B (ϕε (ξ))µˆ (φε,s (ξ))−1 + B
for σε,s < n ≤ k, 1 ≤ i ≤ j satisfying ni−1 ≤ n < ni , nj ≤ k < nj+1 . Then, it is obvious that (6.54) is true. Lemma 6.11. Suppose that Assumptions 2.1 – 2.3 hold. Then, there exists a ˆ M ˆ ) such that the random quantity Cˆ8 (which is a deterministic function of r, µ ˆ, C, ˆ following is true: 1 ≤ C8 < ∞ everywhere and such that lim sup γnqˆ sup kθk − θn k ≤ Cˆ8 ϕε (ξ) n→∞
(6.64)
k≥n
on Λ \ N0 for any ε ∈ (0, ∞). ˆsˆ, while ε ∈ (0, ∞) is an arbitrary real Proof. Let sˆ = min{r, rˆ} and Cˆ8 = 2B number. Moreover, let ω be an arbitrary sample from Λ \ N0 (notice that all formulas which follow in the proof correspond to this sample). In order to show (6.64), we consider separately the cases r < rˆ and r ≥ rˆ. Case r < rˆ: We have sˆ = r, qˆ = r − 1 and pˆ = µ ˆr = r(2 − 1/ˆ r). Consequently, pˆ − r = r − r/ˆ r > 0, pˆ − 2r + 1 = 1 − r/ˆ r > 0 (notice that r/ˆ r < 1 < r), i.e., sˆ < pˆ, qˆ < pˆ − sˆ. Then, Lemma 6.10 implies ˆsˆφε,ˆs (ξ) ≤ Cˆ8 ϕε (ξ) lim sup γnqˆ sup kθk − θn k ≤ B n→∞
k≥n
ˆ q +ˆ s (notice that limn→∞ γn−p+ˆ = 0 and φε,ˆs (ξ) = ϕε (ξ)). Thus, (6.64) holds when r < rˆ. Case r ≥ rˆ: We have sˆ = rˆ, qˆ = rˆ − 1 and pˆ = µ ˆrˆ = 2ˆ r − 1. Therefore, qˆ = pˆ − sˆ and sˆ = (ˆ p + 1)/2 < pˆ (notice that pˆ > 1). Then, Lemma 6.10 yields
ˆsˆ(ϕε (ξ))µˆ (φε,ˆs (ξ))−1 + B ˆsˆφε,ˆs (ξ) lim sup γnqˆ sup kθk − θn k ≤B n→∞
k≥n
ˆsˆ(ϕε,ˆs (ξ))µˆ−1 + B ˆsˆϕε (ξ) ≤B ≤Cˆ8 ϕε (ξ) 30
(notice that φε,ˆs (ξ) = ϕε (ξ), since both r = sˆ, r > rˆ cannot hold if r ≥ rˆ; also notice that µ ˆ − 1 ≤ 1 and that ϕε (ξ) ≥ 1 if r ≥ rˆ). Hence, (6.64) is true when r ≥ rˆ. Proof of Theorems 2.1 and 2.2. Owing to Lemmas 6.3 and 6.10, θˆ = ˆ = 0 on Λ \ N0 . Thus, Theorem 2.2 holds. In limn→∞ θn exists and satisfies ∇f (θ) dθ ˆ ≤ δ ˆ} on Λ \ N0 (δθ is specified in Remark ˆ addition, we have Q ⊆ {θ ∈ R : kθ − θk θ 2.1). Therefore, on Λ \ N0 , random quantities µ ˆ, pˆ, rˆ defined in this section coincide ˆ M ˆ introduced in with µ ˆ, pˆ, rˆ specified in Theorem 2.2 (see Remark 2.1). Similarly, C, this section are identical to Cθˆ, Mθˆ (specified in Section 2) on Λ \ N0 . Thus, Theorem 2.1 is true. ˆ = 2Cˆ5 (Cˆ5 + Cˆ7 ) + Cˆ8 . Then, Lemmas 6.5, 6.8 and the limit process ε → 0 Let K imply µ ˆ ˆ lim sup γnpˆ|u(θn )| ≤ (Cˆ2 + Cˆ7 )(ϕ(ξ))µˆ ≤ K(ϕ(ξ)) n→∞
on Λ \ N0 . Consequently, Lemma 6.5 yields µ ˆ ˆ lim sup γnpˆk∇f (θn )k2 ≤Cˆ5 (ϕ(ξ))µˆ + Cˆ5 lim sup γnpˆψ(u(θn )) ≤ K(ϕ(ξ)) n→∞
n→∞
on Λ \ N0 . On the other side, using Lemma 6.11, we get ˆ ≤ Cˆ8 ϕ(ξ) ≤ Kϕ(ξ) ˆ lim sup γnqˆkθn − θk n→∞
on Λ \ N0 . Hence, Theorem 2.2 holds, too. ˆ Remark 6.4. Owing to Lemma 6.10, {θn }n≥0 converges to θ on Λ \ N0 at the − min{p−s,s−1} ˆ rate O γn , where s ∈ (1, r] ∩ (1, pˆ). It is straightforward to show qˆ =
min{ˆ p − s, s − 1}.
max s∈(1,r]∩(1,p) ˆ
This suggests that O(γn−ˆq ) is the tightest bound on the convergence rate of {θn }n≥0 which can be obtained by the arguments Lemmas 6.6, 6.10 and 6.11 are based on. 7. Proof of Theorem 3.1. The following notation is used in this section. For θ ∈ Rdθ , z ∈ Rdz , Eθ,z (·) denotes E(·|θ0 = θ, Z0 = z). Moreover, let ξn = F (θn , Zn+1 ) − ∇f (θn ), ξ1,n = F˜ (θn , Zn+1 ) − (ΠF˜ )(θn , Zn ), ξ2,n = (ΠF˜ )(θn , Zn ) − (ΠF˜ )(θn−1 , Zn ), ξ3,n = −(ΠF˜ )(θn , Zn+1 ) for n ≥ 1. Then, it is obvious that algorithm (3.1) admits the form (2.1), while Assumption 3.2 yields k X
αi γir ξi =
i=n
k X i=n
−
αi γir ξ1,i +
k X
i=n r αk+1 γk+1 ξ3,k +
αi γir ξ2,i −
k X r (αi γir − αi+1 γi+1 )ξ3,i
i=n r αn γn ξ3,n−1
for 1 ≤ n ≤ k. 31
(7.1)
Lemma 7.1. Let Assumption 3.1 hold. Then, there exists a real number s ∈ (0, 1) P∞ such that n=0 αn1+s γnr < ∞. Proof. Let p = (2 + 2r)/(2 + r), q = (2 + 2r)/r, s = (2 + r)/(2 + 2r). Then, using the H¨ older inequality, we get ∞ X n=0
αn1+s γnr
=
∞ X
(αn2 γn2r )1/p n=1
αn γn2
1/q ≤
∞ X
!1/p αn2 γn2r
n=1
∞ X αn γ2 n=1 n
!1/q .
Since γn+1 /γn = 1 + αn /γn = O(1) for n → ∞ and 2 Z γn+1 2 ∞ ∞ ∞ X X X 1 γn+1 − γn γn+1 dt αn γn+1 = ≤ ≤ max , γ2 γn2 γn t2 γ1 n≥0 γn γn n=1 n=1 n=1 n P∞ it is obvious that n=0 αn1+s γnr converges. Proof of Theorem 3.1. LetPQ ⊂ Rdθ be an arbitrary compact set, while ∞ r s ∈ (0, 1) isPa real number such that n=0 αn1+s T∞γn < ∞. Obviously, it is sufficient to ∞ r show that n=0 αn γn ξn converges w.p.1 on n=0 {θn ∈ Q}. Due to Assumption 3.1, we have s −1 s αn−1 αn γnr = 1 + αn−1 (αn−1 − αn−1 ) αn1+s γnr = O(αn1+s γnr ), −1 −1 (αn−1 − αn )γnr = (αn−1 − αn−1 ) 1 + αn−1 (αn−1 − αn−1 ) αn2 γnr = O(αn2 γnr ), r αn (γn+1 − γnr ) = αn γnr ((1 + αn /γn )r − 1) = αn γnr (rαn /γn + o(αn /γn )) = o(αn2 γnr )
as n → ∞. Consequently, ∞ X n=0 ∞ X
r αns αn+1 γn+1 < ∞,
r |αn γnr − αn+1 γn+1 |≤
n=0
(7.2) ∞ X
r αn |γnr − γn+1 |+
n=0
∞ X
r |αn − αn+1 |γn+1 < ∞.
(7.3)
n=0
On the other side, as a result of Assumption 3.3, we get Eθ,z kξ1,n k2 I{τQ >n} ≤2Eθ,z ϕ2Q,s (Zn+1 )I{τQ >n} + 2Eθ,z ϕ2Q,s (Zn )I{τQ >n−1} , Eθ,z kξ2,n k2 I{τQ >n} ≤Eθ,z ϕQ,s (Zn )kθn − θn−1 ks I{τQ >n−1} s ≤αn−1 Eθ,z ϕ2Q,s (Zn )I{τQ >n−1} , Eθ,z kξ3,n k2 I{τQ >n} ≤Eθ,z ϕ2Q,s (Zn+1 )I{τQ >n} for all θ ∈ Rdθ , z ∈ Rdz , n ≥ 1. Then, Assumption 3.1 and (7.2) yield Eθ,z Eθ,z
∞ X n=1 ∞ X n=1
! αn2 γn2r kξ1,n k2 I{τQ >n}
≤4
! αn γnr kξ2,n kI{τQ >n}
≤
∞ X
! αn2 γn2r
sup Eθ,z ϕ2Q,s (Zn )I{τQ ≥n} < ∞,
n≥0 n=1 ! ∞ X s αn−1 αn γnr sup Eθ,z n≥0 n=1
32
ϕ2Q,s (Zn )I{τQ ≥n} < ∞
for any θ ∈ Rdθ , z ∈ Rdz , while (7.3) implies Eθ,z ≤
Eθ,z ≤
∞ X
! |αn γnr
n=1 ∞ X n=1 ∞ X n=1 ∞ X
−
r αn+1 γn+1 |kξ3,n kI{τQ >n}
−
r αn+1 γn+1 |
! |αn γnr
sup Eθ,z ϕ2Q,s (Zn )I{τQ ≥n}
1/2
< ∞,
n≥0
! 2 2r αn+1 γn+1 kξ3,n k2 I{τQ >n}
! 2 2r αn+1 γn+1
n=1
sup Eθ,z ϕ2Q,s (Zn )I{τQ ≥n} < ∞
n≥0
for each θ ∈ Rdθ , z ∈ Rdz . Since Eθ,z ξ1,n I{τQ >n} |Fn = Eθ,z F˜ (θn , Zn+1 )|Fn − (ΠF˜ )(θn , Zn ) I{τQ >n} = 0 w.p.1 for every θ ∈ Rdθ , z ∈ Rdz , n ≥ 1, it can be deduced easily that series ∞ X
αn γnr ξ1,n ,
n=1
∞ X
αn γnr ξ2,n ,
n=1
∞ X
r (αn γnr − αn+1 γn+1 )ξ3,n
n=1
T∞
r converge w.p.1 on n=0 {θn ∈ Q}, as well as that limn→∞ P∞ αn γn ξ3,n−1 = 0 w.p.1 on theT same event. Owing to this and (7.1), we have that n=0 αn γnr ξn converges w.p.1 ∞ on n=0 {θn ∈ Q}.
8. Proof of Theorems 4.1 and 4.2. In this section, we use the following notation. For θ ∈ Rdθ , x ∈ RN , y ∈ R and z = [xT y]T , let F (θ, z) = −(y − Gθ (x))Hθ (x), while Zn+1 = [XnT Yn ]T for n ≥ 0. With this notation, it is obvious that algorithm (4.1) admits the form (3.1). Proof of Theorem 4.1. Let θ = [a1 · · · aM b1,1 · · · bM,N ]T ∈ Rdθ , while δθ =
ε 2KLM N (1 + kθk)
ˆθ = {η ∈ Cdθ : kη − θk < δθ } (ε is specified in Assumption 4.1). Moreover, for and U η = [c1 · · · cM d1,1 · · · dM,N ]T ∈ Cdθ , x = [x1 · · · xN ]T ∈ RN , let M N X X ˆ η (x) = G ci ψˆ di,j xj , i=1
1 fˆ(η) = 2
Z
j=1
ˆ η (x))2 π(dx, dy). (y − G
Then, we have N N N X X X d x − b x ≤ |di,j − bi,j | |xj | ≤ δθ LN < ε i,j j i,j j j=1 j=1 j=1 33
ˆθ , 1 ≤ i ≤ M and each x = [x1 · · · xN ]T ∈ RN for all η = [c1 · · · cM d1,1 · · · dM,N ]T ∈ U satisfying kxk ≤ L. Consequently, Assumption 4.1 implies X N M N X X X M ˆ c ψ d x − a ψ b x i i,j j i i,j j i=1 j=1 i=1 j=1 X M N M N N X X X X ≤ |ai | ψˆ di,j xj − ψˆ |ci − ai | ψˆ di,j xj + bi,j xj i=1 j=1 i=1 j=1 j=1 N M N X X X ≤ δθ KM + K di,j xj − |ai | bi,j xj j=1 i=1 j=1 ≤ δθ KM + δθ KLM N kθk < ε ˆθ and each x = [x1 · · · xN ]T ∈ RN satisfying for any η = [c1 · · · cM d1,1 · · · dM,N ]T ∈ U ˆ η (x) is kxk ≤ L. Then, it can be deduced that for all x ∈ RN satisfying kxk ≤ L, G ˆθ . On the other side, Assumption 4.1 yields analytical in η on U N M X X ˆ η (x)| ≤ di,j xj ≤ KM kηk, |G |ci | ψˆ j=1 i=1 N X ∂ ˆ η (x) = ψˆ d x G k,j j ≤ K, ∂ck j=1 N 0 X ∂ ˆ η (x) = ψˆ d x c x G k,j j k l ≤ KLkηk ∂dk,l j=1 ˆθ , 1 ≤ k ≤ M , 1 ≤ l ≤ N and each for all η = [c1 · · · cM d1,1 · · · dM,N ]T ∈ U T N x = [x1 · · · xN ] ∈ R satisfying kxk ≤ L. Therefore, ˆ η (x)k ≤ KLM N (1 + kηk) k∇η G ˆθ and each x ∈ RN satisfying kxk ≤ L. Thus, for any η ∈ U ˆ η (x))2 k = 2|y − G ˆ η (x)|k∇η G ˆ η (x)k ≤ 2K 2 L2 M 2 N (1 + kηk)2 k∇η (y − G ˆθ and each x ∈ RN , y ∈ R satisfying kxk ≤ L, |y| ≤ L. Then, the for all η ∈ U dominated convergence theorem and Assumption 4.2 imply that fˆ(·) is differentiable ˆθ . Consequently, fˆ(·) is analytical on U ˆθ . Since f (θ) = fˆ(θ) for all θ ∈ Rdθ , we on U conclude that f (·) is real-analytic on entire Rdθ . Proof of Theorem 4.2. As {Zn }n≥0 can be interpreted as a Markov chain whose transition kernel does not depend on {θn }n≥0 , it is straightforward to show that Assumptions 3.2 and 3.3 hold. The theorem’s assertion then follows directly from Theorem 3.1. 9. Proof of Theorems 5.1 and 5.2. In this section, we use the following notation. For n ≥ 0, let T T Zn = [XnT Yn · · · Yn−M +1 εn ψnT · · · εn−N +1 ψn−N +1 ] ,
34
while dz = L + (M + N )(N + 1). For θ ∈ Θ, let εθ0 = · · · = ε−N +1 = 0, ψ0θ = · · · = θ θ θ ψ−N +1 = 0, while {εn }n≥0 , {ψn }n≥0 are defined by the following recursion: φθn−1 = [Yn−1 · · · Yn−M εθn−1 · · · εθn−N ]T , εθn = Yn − (φθn−1 )T θ, θ θ ψnθ = φθn−1 − [ψn−1 · · · ψn−N ] D θ, θ T T Znθ = [XnT Yn · · · Yn−M +1 εθn (ψnθ )T · · · εθn−N +1 (ψn−N +1 ) ] ,
n ≥ 1.
Then, it is straightforward to verify that {εθn }n≥0 satisfies the recursion (5.2), as well as that ψnθ = ∇θ εθn for n ≥ 0. Moreover, it can be deduced easily that there exist a matrix valued function Gθ : Θ → Rdz ×dz and a matrix H ∈ Rdz ×L with the following properties: (i) Gθ is linear in θ and its eigenvalues lie in {z ∈ C : |z| < 1} for each θ ∈ Θ. (ii) Equations θ Zn+1 = Gθ Znθ + HVn ,
Zn+1 = Gθn Zn + HVn
hold for all θ ∈ Θ, n ≥ 0. The following notation is also used in this section. For θ ∈ Θ, x ∈ RL , y1 , . . . , yM ∈ T T ] , let R, e1 , . . . , eN ∈ R, f1 , . . . , fN ∈ Rdθ , and z = [xT y1 · · · yM e1 f1T · · · eN fN φ(ξ) = e21 ,
F (θ, z) = f1 e1 , while
Πθ (z, B) = E(IB (Gθ z + HV0 )) for a Borel-measurable set B from Rdz . Then, it can be deduced easily that recursion (5.3) – (5.6) admits the form of the algorithm considered in Section 3. Furthermore, it can be shown that (Πn φ)(θ, 0) = E (εθn )2 , (Πn F )(θ, 0) = E ψnθ εθn = ∇θ (Πn φ)(θ, 0)
(9.1) (9.2)
for each θ ∈ Θ, n ≥ 0. Proof of Theorem 5.1. Let m = E(Y0 ) and rk = r−k = Cov(Y0 , Yk ) for k ≥ 0, while ϕ(ω) =
∞ X
rk e−iωk
k=−∞
for ω ∈ [−π, π]. Moreover, for θ ∈ Θ, z ∈ C, let Cθ (z) = Aθ (z)/Bθ (z), while αθ = 1 + max |Aθ (eiω )|, ω∈[−π,π]
βθ =
min ω∈[−π,π]
|Bθ (eiω )|,
δθ =
βθ . 4dθ αθ
Obviously, 1 ≤ αθ < ∞, 0 < βθ , δθ < ∞ (notice that the zeros of Bθ (·) are outside {z ∈ C : |z| ≤ 1}). 35
P∞ As k=0 rk < ∞, |ϕ(·)| is uniformly bounded. Consequently, the spectral theory for stationary processes (see e.g. [7, Chapter 2]) yields lim E(εθn ) = Cθ (1)m,
n→∞
lim
n→∞
Cov(εθn , εθn+k )
1 = 2π
Z
π
|Cθ (eiω )|2 ϕ(ω)eiωk dω
−π
for all θ ∈ Θ, k ≥ 0 (notice that εθn = Cθ (q)Yn and the poles of Cθ (·) are in {z ∈ C : |z| > 1}). Therefore, f (θ) =
1 4π
Z
π
|Cθ (eiω )|2 ϕ(ω)dω + |Cθ (1)|2
−π
m2 2
(9.3)
for any θ ∈ Θ. On the other side, it is straightforward to verify ∂ Aθ (eiω ) = −e−iωk , ∂ak ∂2 Aθ (eiω ) = 0, ∂ak1 ∂ak2 ∂ l1 +···+lN 1 = − (l1 + l2 + · · · + lN )! e−iω(l1 +2l2 +···+N lN ) ∂bl11 · · · ∂blNN Bθ (eiω ) l1 +l2 +···+lN +1 1 · − Bθ (eiω ) for every θ = [a1 · · · aM b1 · · · bN ]T ∈ Θ, ω ∈ [−π, π], 1 ≤ k, k1 , k2 ≤ M , l1 , . . . , lN ≥ 0. Thus, ∂ k1 +···+kM +l1 +···lN iω C (e ) k1 θ ∂a1 · · · ∂akMM ∂bl11 · · · ∂blNN ∂ k1 +···+kM l1 +···lN 1 iω ∂ A (e ) l1 = k1 ∂b1 · · · ∂blNN Bθ (eiω ) ∂a1 · · · ∂akMM θ ≤ (l1 + · · · lN )! αθ (1/βθ )l1 +···lN +1 for all θ = [a1 · · · aM b1 · · · bN ]T ∈ Θ, ω ∈ [−π, π], k1 , . . . , kM ≥ 0, l1 , . . . , lN ≥ 0. Then, it can be deduced easily ∂ k1 +···+kdθ iω C (e ) ≤ (k1 + · · · + kdθ )!(αθ /βθ )k1 +···+kdθ +1 θ ∂ϑk1 · · · ∂ϑkdθ 1
dθ
for all θ ∈ Θ, ω ∈ [−π, π], k1 , . . . , kdθ ≥ 0 (ϑi denotes the i-th component of θ). Since kdθ j1 +···+jd X θ k1 kdθ ∂ |Cθ (e )| = ··· ··· C (eiω ) j dθ θ kdθ j k1 1 j j 1 d θ ∂ϑ · · · ∂ϑ ∂ϑ · · · ∂ϑ j =0 j =0
∂ k1 +···+kdθ 1
dθ
iω
2
k1 X 1
1
dθ
· 36
∂
(k1 −j1 )+···+(kdθ −jdθ ) kd −jdθ
∂ϑ1k1 −j1 · · · ∂ϑdθ θ
dθ
Cθ (eiω )
for each θ ∈ Θ, ω ∈ [−π, π], k1 , . . . , kdθ ≥ 0, we have ∂ k1 +···+kdθ iω 2 |C (e )| θ ∂ϑk1 · · · ∂ϑkdθ 1
dθ
≤ (k1 + · · · + kdθ )! ≤ (k1 + · · · + kdθ )! ≤ (k1 + · · · + kdθ )!
αθ βθ
k1 +···+kdθ +2 X k1
αθ βθ
k1 +···+kdθ +2 X k1
···
j1 =0
2αθ βθ
j1 =0
kdθ X
k1 j1
jdθ =0
···
···
kdθ jdθ
k1 +···kdθ j1 +···jdθ
kdθ X k1 kdθ ··· j jdθ 1 j =0 dθ
k1 +···+kdθ +2
for any θ ∈ Θ, ω ∈ [−π, π], k1 , . . . , kdθ ≥ 0. Consequently, the multinomial formula (see [12, Theorem 1.3.1]) implies k1 +···+kdθ ∞ ∞ k +···+kdθ X X δθ ∂ 1 iω 2 ··· |C (e )| θ k1 ! · · · kdθ ! ∂ϑk1 · · · ∂ϑkdθ k1 =0
≤
kdθ =0
1
k1 =0
=
dθ
2 X k +···+kdθ ∞ ∞ X 2αθ (k1 + · · · + kdθ )! 2αθ δθ 1 ··· βθ k1 ! · · · kdθ ! βθ 2αθ βθ
2 X ∞
kdθ =0
X
n=0 0≤k1 ,...,kd ≤n θ k1 +···kdθ =n
(k1 + · · · + kdθ )! k1 ! · · · kdθ !
2αθ δθ βθ
k1 +···+kdθ
2 X ∞
n 2dθ αθ δθ βθ n=0 2 X ∞ n 1 2αθ 0, ˜ > 0, Q \ V 6= ∅ =⇒ inf{k∇f (θ)k : θ ∈ Q \ Q} |f (θ) − a| ˜ |f (θ) − a| ≤ δ˜Q,a ≤ M ˜ < ∞. Q ∩ S 6= ∅ =⇒ sup : θ ∈ Q, Q,a k∇f (θ)kµ˜Q,a ˜ Q,a are well-defined and enjoy the following properties: Consequently, δ˜Q,a , µ ˜Q,a , M ˜ ˜ Q,a < ∞ and 0 < δQ,a ≤ 1, 1 < µ ˜Q,a ≤ 2, 1 ≤ M ˜ Q,a k∇f (θ)kµ˜Q,a |f (θ) − a| ≤ M for all θ ∈ Q satisfying |f (θ) − a| ≤ δ˜Q,a . Hence, the claim holds. REFERENCES [1] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, 1990. [2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996. [3] D. P. Bertsekas, Nonlinear Programming, 2nd edition, Athena Scientific, 1999. 38
[4] D. P. Bertsekas and J. N. Tsitsiklis, Gradient convergence in gradient methods with errors, SIAM Journal on Optimization, 10 (2000), pp. 627 – 642. ´ [5] E. Bierstone and P. D. Milman, Semianalytic and subanalytic sets, Institut des Hautes Etudes Scientifiques, Publications Math´ ematiques, 67 (1988), pp. 5 - 42. [6] V. S. Borkar and S. P. Meyn, The ODE Method for Convergence of Stochastic Approximation and Reinforcement Learning, SIAM Journal on Control and Optimization, 38 (2000), pp. 447 – 469. [7] P. E. Caines, Linear Stochastic Systems, Wiley, 1988. [8] H.-F. Chen, Stochastic Approximation and Its Application, Kluwer, 2002. [9] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, Wiley, 2002. [10] T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, 2001. [11] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, 1998. [12] S. G. Krantz and H. R. Parks, A Primer of Real Analytic Functions, Birikh¨ auser, 2002. [13] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’Institut Fourier (Grenoble), 48 (1998), pp. 769 - 783. [14] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, 2nd edition, Springer-Verlag, 2003. [15] L. Ljung, Analysis of a general recursive prediction error identification algorithm, Automatica, 27 (1981), pp. 89 – 100. [16] L. Ljung and T. S¨ oderstr¨ om, Theory and Practice of Recursive Identification, MIT Press, 1983. [17] L. Ljung, System Identification: Theory for the User, 2nd edition, Prentice Hall, 1999. [18] S. Lojasiewicz, Sur le probl` eme de la division, Studia Mathematica, 18 (1959), pp. 87 – 136. [19] S. Lojasiewicz, Sur la g´ eom´ etrie semi- et sous-analytique, Annales de l’Institut Fourier (Grenoble), 43 (1993), pp. 1575 – 1595. [20] M. Metivier and P. Priouret, Applications of a Kushner-Clark lemma to general classes of stochastic algorithms, IEEE Transactions on Information Theory, 30 (1984), pp. 140 – 151. [21] A. Nedi´ c and D. P. Bertsekas, Convergence Rate of Incremental Subgradient Algorithms, in S. Uryasev and P. M. Pardalos (Eds.), Stochastic Optimization: Algorithms and Applications, Kluwer, pp. 263 – 304. [22] G. Ch. Pflug, Optimization of Stochastic Models: The Interface Between Simulation and Optimization, Kluwer 1996. [23] B. T. Polyak and Y. Z. Tsypkin, Criterion algorithms of stochastic optimization, Automation and Remote Control, 45 (1984), pp. 766 – 774. [24] B. T. Polyak, Introduction to Optimization, Optimization Software, 1987. [25] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, 2007. [26] J. C. Spall, Introduction to Stochastic Search and Optimization, Wiley, 2003. [27] V. B. Tadi´ c, On the Almost Sure Rate of Convergence of Linear Stochastic Approximation, IEEE Transactions on Information Theory, 50 (2004), pp. 401 – 409. [28] V. B. Tadi´ c, Convergence Rate of Stochastic Gradient Search in the Case of Multiple and Non-Isolated Minima, extended version of this paper, available at arXiv.org as arXiv:0904.4229v2. [29] V. B. Tadi´ c, Analyticity, Convergence and Convergence Rate of Recursive Maximum Likelihood Estimation in Hidden Markov Models, submitted, available at arXiv.org as arXiv:0904.4264v1.
39