c 1998 Society for Industrial and Applied Mathematics
SIAM J. OPTIM. Vol. 8, No. 2, pp. 506–531, May 1998
012
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD WITH MOMENTUM TERM AND ADAPTIVE STEPSIZE RULE∗ PAUL TSENG† Abstract. We consider an incremental gradient method with momentum term for minimizing the sum of continuously differentiable functions. This method uses a new adaptive stepsize rule that decreases the stepsize whenever sufficient progress is not made. We show that if the gradients of the functions are bounded and Lipschitz continuous over a certain level set, then every cluster point of the iterates generated by the method is a stationary point. In addition, if the gradient of the functions have a certain growth property, then the method is either linearly convergent in some sense or the stepsizes are bounded away from zero. The new stepsize rule is much in the spirit of heuristic learning rules used in practice for training neural networks via backpropagation. As such, the new stepsize rule may suggest improvements on existing learning rules. Finally, extension of the method and the convergence results to constrained minimization is discussed, as are some implementation issues and numerical experience. Key words. incremental gradient method, gradient projection, convergence analysis, backpropagation, nonlinear neural network training AMS subject classifications. 49M07, 49M37, 90C30 PII. S1052623495294797
1. Introduction. Consider the problem of minimizing, over the n-dimensional real space ℜn , a function f : ℜn 7→ ℜ of the form (1)
f (x) =
m X
fi (x),
i=1
where fi , i = 1, ..., m, are continuously differentiable functions from ℜn to ℜ. Our interest in this problem stems from an important special case, that of nonlinear neural network training, in which x is the vector of weights in the neural network and fi (x) is the corresponding output error for the ith training example. (See [8], [9], [12], [13] for more detailed discussions of this connection.) Extension of our results to the constrained minimization of f will be discussed in section 5. We will focus on the following iterative method for solving the preceding problem whereby, for a given x01 ∈ ℜm , we generate a sequence {(xt1 , ..., xtm+1 )}t=0,1,... according to (2)
xti+1
:= xti − αt dti ,
x1t+1
xtm+1 ,
:=
i = 1, ..., m,
where αt is a positive scalar (the “stepsize”) and ( t−1 ∇fi (xti ) + ζdm if i = 1, t di := (3) ∇fi (xti ) + ζdti−1 if i > 1 −1 with dm = 0 and ζ ∈ [0, 1). Here, each direction dti is a weighted sum of the previous direction (the “momentum term”) and the gradient of fi at xti . Thus, unlike ∗ Received by the editors November 15, 1995; accepted for publication (in revised form) January 30, 1997. This work was supported by National Science Foundation grant CCR-9311621. http://www.siam.org/journals/siopt/8-2/29479.html † Department of Mathematics, University of Washington, Seattle, WA 98195 (tseng@math. washington.edu).
506
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
507
conventional gradient methods, this method does not use the gradient of f to take a step but only the gradient of one of the fi . In the special case where m = 1, this method reduces to the steepest descent method for ζ = 0 and to the heavy-ball method [17, p. 65] for ζ ≥ 0. For general m and ζ = 0, this method is reminiscent of a nonlinear least square algorithm of Davidon [5], which has been further studied by Pappas [15] and Bertsekas [2]. When applied to neural network training, this method reduces to the very popular on-line backpropagation algorithm with a momentum term (with “training/learning rates” identified with stepsizes), as conceived by Werbos [20], Le Cun [11], Parker [16], and Rumelhart, Hinton, and Williams [18] (see the discussions in [21]). Numerical experience suggests that it is typically beneficial to choose ζ > 0. (In [18, p. 330], a value of ζ ≈ .9 is recommended.) An interesting one-parameter generalization of this method for ζ = 0 and of the steepest descent method is recently studied in [3]. A key issue concerns the stepsizes {αt }t=0,1,... which, to quote from [14], “are often crucial for the success of the algorithm.” In the case of neural network training, various heuristic rules for choosing the stepsize have been proposed, the most popular of which entail keeping the stepsize fixed for as long as “progress” is made and decreasing the stepsize if otherwise. However, these heuristic rules are justified only by extensive experimentation (see [10, p. 124], [19], and references therein and in [6]). More recently, stepsize rules have been proposed for the special case of ζ = 0, for which global convergence can be shown under mild assumptions on f, f1 , ..., fm . One such rule, studied in [3], [8], [12], [13], [22], [23], requires the stepsizes {αt }t=0,1,... to be square summable but not summable, i.e., ∞ X
(αt )2 < ∞,
∞ X
αt = ∞.
t=0
t=0
The reference [8] also considers the more general stepsize rule in which square summability of {αt }t=0,1,... is replaced by αt → 0. These rules, however, always require the stepsize to tend to zero. To see the drawback of this, suppose f1 , ..., fm are identical with Lipschitz continuous gradients. Then the method (2)–(3) with ζ = 0 is just the steepest descent method for minimizing f , for which it is well known that convergence does not require the stepsize to tend to zero (and that a stepsize tending to zero yields slow convergence). A second rule, proposed in [12], requires that
2 m
X
t−1 t α ∝ ∇fi (xi ) .
i=1
This rule uses information about f , but it still requires the stepsize to tend to zero (the t−1 right-hand side tends to zero as x1t−1 , ..., xm all tend to a stationary point of f ) and in practice the convergence seems to be slow. A third rule, proposed in [9], chooses αt to be the largest element of {αt−1 , ωαt−1 , ω 2 αt−1 , ...} for which the following sufficient descent condition is satisfied:
2 m m
X
X
∇fi (xti ) 2 , f (x1t+1 ) ≤ f (xt1 ) − ǫ1 αt ∇fi (xti ) − ǫ2 (αt )2
i=1
i=1
where ω ∈ (0, 1), ǫ1 ∈ (0, 1), ǫ2 > 0 are parameters. This rule uses information about f and does not always require the stepsize to tend to zero. Moreover, it is in the spirit of heuristic rules used in practice in that it keeps the stepsize fixed for as long
508
PAUL TSENG
as sufficient descent is made and decreases the stepsize if otherwise. On the other hand, this rule still tends to make the stepsize very small since it requires sufficient Pm 2 descent at every iteration and when the term i=1 k∇fi (xti )k is bounded away from zero, the stepsize must necessarily tend to zero. In addition, the preceding stepsize rules apply to the case of ζ = 0. It is unclear whether these rules can be extended to the case of ζ > 0, which is the case of practical interest. (The work of [13] considers a version of the method that uses a momentum term. However, the momentum term uses only the history of the method from the start of the current iteration.) In this paper, we propose a new rule (see (5)–(6) and (7)–(9)) for choosing the stepsizes {αt }t=0,1,... for which convergence of the method (2)–(3) can be shown for any ζ ∈ [0, (.5)1/m ). (Note that (.5)1/m > .9 for m ≥ 8, so the restriction on ζ is mild.) This new rule, like the rule proposed in [9] for the case of ζ = 0, does not always require the stepsize to tend to zero and keeps the stepsize fixed for as long as descent in an overall sense is achieved and decreases the stepsize if otherwise. Unlike the rule of [9], this new stepsize rule does not require descent at every iteration and, as such, the stepsize tends to remain large which is essential for good convergence. We show that the method (2)–(3) using this stepsize rule has desirable global convergence properties (see Proposition 3.4). Moreover, in the case where ∇f1 , ..., ∇fm grow at most linearly in norm with ∇f (which, in the context of neural network training, amounts to the neural network being trainable so to achieve zero output error on the training examples), either the method is linearly convergent in some sense or the stepsize is bounded away from zero (see Proposition 4.2). The method and the convergence results can also be extended, with suitable modifications, to the problem of constrained minimization of f (see section 5). We note that neither dt1 , ..., dtm nor their sum need be a descent direction for f , so conventional convergence arguments cannot be applied here. Moreover, for ζ > 0, the proofs are further complicated by the dependence of dti on the entire past history of the method up to then. And while some of our proof ideas are adapted from [12] and [13], much of the arguments are new due to the use of a new stepsize rule and the presence of the momentum term. In section 6, we discuss implementation issues and numerical experience with the method. A few words about our notations: For any x, y ∈ ℜn , hx, yi denotes the usual inner product of vectors x and y and kxk denotes the Euclidean norm of x, i.e., kxk = hx, xi1/2 . For any continuously differentiable function h : ℜn 7→ ℜ, we say that ∇h, the gradient of h, is Lipschitz continous (with constant λ ≥ 0) on a subset X of ℜn if k∇h(x) − ∇h(x′ )k ≤ λkx − x′ k ∀x, x′ ∈ X. For any two positive integers t > τ and any scalar θ > 0, we denote θτt = θτ m + θτ m+1 + · · · + θtm−1 . Thus, θsτ + θτt = θst for any t > τ > s. For any i ∈ {1, ..., m} and positive integer j, we define i ⊖ j := (i − j − 1 mod m − 1) + 1. Thus i ⊖ j = i − j for j = 1, ..., i − 1 and i ⊖ j = i − j + m for j = i, ..., i + m − 1 and so on.
509
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
2. Method description. In this section, we describe in detail the method (2)– (3) for the unconstrained minimization of f given by (1) and the new rule for choosing the stepsize αt adaptively. To describe this new stepsize rule, we follow [12] and make the following assumption about f, f1 , ..., fm and the initial iterate x01 . Assumption A. There exist scalars η > f (x01 ) and ρ > 0 such that, for i = 1, ..., m, ∇fi is bounded and Lipschitz continuous (with some constant λi ≥ 0) on the set Rηρ := { x ∈ ℜn : f (x) ≤ η } + ρB, where B = { x ∈ ℜn : kxk ≤ 1 }. Assumption A is quite mild and, in particular, is satisfied when f1 , ..., fm are twice differentiable and the level set { x ∈ ℜn : f (x) ≤ η } is bounded for some η > f (x01 ), as is the typical case with neural network training. (See [12, section 3] for further discussions.) The new stepsize rule for choosing αt depends on η, ρ, and λ1 , ..., λm and, in the spirit of the Armijo–Goldstein stepsize rule for gradient descent methods, periodically checks if a certain descent condition is satisfied since the previous check was made and, if not, decreases the stepsize and restarts the method from when the previous check was made. Below, we formally state the method (2)–(3) using this stepsize rule. Following [3], we will call this method the incremental gradient method. Incremental gradient method (with momentum term). Choose any x01 ∈ n ℜ such that Assumption A holds for some η, ρ and λ1 , ..., λm . Choose any ζ ∈ [0, (.5)1/m ) and let (4)
δ1 :=
1 − 2ζ m , 1−ζ
δ2 :=
.5ζ m , 1−ζ
δ3 := .5(λ1 + · · · + λm )(1 + ζ m ).
(By choice of ζ, we have δ1 > 0.) Choose any ω ∈ (0, 1) and any subsequence T of {1, 2, ...} containing 1. Choose any positive scalars ǫ0 , ǫ1 , ǫ2 , ǫ3 satisfying ǫ1 < δ1 ζm 0 and ǫ2 1−ζ m < η − f (x1 ). 0 Step 0. Let α be the largest element of {ǫ0 , ωǫ0 , ω 2 ǫ0 , ...}, for which x02 , ..., x0m , x11 given by (2)–(3) with t = 0 satisfy the following two conditions: (5) (6)
(x02 , ..., x0m , x11 ) ∈ (Rηρ )m , f (x11 ) ≤ η − δ3 (α0 β 0 )2 − ǫ2
ζm , 1 − ζm
where β 0 is given by (10). Step 1. For each s ∈ T , let h = min{t ∈ T : t > s} and we generate h−1 s−1 (xh1 , dm , αh−1 ) from (xs1 , ds−1 ) as follows: If ∇f (xs1 ) = 0, we stop. Else, we m ,α s−1 h−1 be the largest element of {α , ωαs−1 , ω 2 αs−1 , ...} for which xt2 , ..., xtm , x1t+1 let α given by (2)–(3) with αt = αh−1 , t = s, ..., h−1, satisfy the following three conditions: (7) (8) (9)
η m (xt2 , ..., xtm , xt+1 1 ) ∈ (Rρ ) ,
t = s, ..., h − 1,
f (xh1 ) ≤ η − (δ2 + ǫ1 ) ph − δ3 q h + δ2 (1 − ζ)uh + δ3 (1 − ζ)v h − ǫ2 k∇f (xs1 ) − g s k ≤ ǫ3 kg s k,
where, for t = s, ..., h − 1, we define
ζ hm , 1 − ζm
510
PAUL TSENG m X
g t :=
(10)
∇fi (xti ),
(
β t := max kg t k,
i=1
(11)
m X
)
kdti k ,
i=1
pt+1 := pt + αt kg t k2 ,
ut+1 := ut + at ,
at+1 := ζ m at + (1 + · · · + ζ m−1 )αt kg t k2 ,
q t+1 := q t + (αt β t )2 ,
v t+1 := v t + bt ,
bt+1 := ζ m bt + (1 + · · · + ζ m−1 )(αt β t )2 ,
with p1 := u1 := a1 := q 0 := v 0 := b0 := 0. Roughly speaking, the stepsize rule checks at each iteration s ∈ T whether the current stepsize is acceptable (i.e., satisfies (7)–(9)) for all iterations between s and the next element h of T and, if not, it decreases the stepsize by ω and restarts the method from iteration s. Thus, if the elements of T are far apart, then the rule makes this check infrequently but it needs to backtrack further whenever the stepsize needs to be decreased. In our testing (see section 6), checking every 10 iterations, i.e., T = {1, 11, 21, 31, ...}, worked well. A more sophisticated strategy might be to check more frequently at the beginning, e.g., T = {1, 2, 5, 11, 21, 31, ...}. We remark that we can also increase the stepsize, provided that this is done only a finite number of times. For the other parameters in the stepsize rule, it suffices to choose ǫ1 , ǫ2 reasonably small, choose ǫ3 , η, ρ reasonably large, choose ǫ0 near 1, and choose ω, ζ away from 0 and 1. Only the parameter λ1 + · · · + λm , which is problem dependent, requires significant fine tuning (if this is too large, then the stepsize becomes too small; if this is too small, then the stepsize remains large but the method experiences large oscillations). In our testing (see section 6), the choice of ǫ1 = ǫ2 = .00001, ǫ3 = 1000, η = 1.5f (x01 ) + 100, ρ = ∞, ǫ0 = 1, ω = .5, ζ = .8, and λ1 + · · · + λm = 1 worked well. However, to solve a wider range of problems, we would need to update λi on-line by, for example, k∇fi (x1i ) − ∇fi (x0i )k/kx1i − x0i k, for i = 1, ..., m. Of the three conditions (7)–(9), both (7) and (9) are quite unrestrictive since ρ is typically large (e.g., in [8] and [13], it is assumed that ρ = ∞) and we can choose ǫ3 arbitrarily large. To see that (8) is also unrestrictive, first note that (8) does not require f -value to be monotonically decreasing in any sense but only that the f -value be less than η by some positive quantity. This positive quantity, which is easily computable using the updating formula (11), depends on f, f1 , ..., fm and is increasing with t only in a long-run sense. More precisely, by using a straightforward calculation from (11), we see that (12) pt =
t−1 X
ατ kg τ k2 , q t =
τ =1
ζ0t−τ −1
t−1 X
τ =0
(ατ β τ )2 , ut =
t−2 X
τ =1
ζ0t−τ −1 ατ kg τ k2 , v t =
t−2 X
ζ0t−τ −1 (ατ β τ )2 .
τ =0
≈ 1/(1 − ζ) for τ = 0, ..., t − 2 (this is true especially for large m), then Since pt ≈ (1 − ζ)ut+1 and q t ≈ (1 − ζ)v t+1 , so the right-hand side of (8) may increase or decrease with h, depending on how ατ kg τ k2 and (ατ β τ )2 change with τ , though in the long run the tendency is towards a decrease. To see this, we note from (12) and ζ0t−τ −1 ≤ 1/(1 − ζ) for τ = 0, ..., t − 2 that (8) implies (13)
f (xh1 ) ≤ η − ǫ1
h−1 X τ =1
ατ kg τ k2 − δ2 αh−1 kg h−1 k2 − δ3 (αh−1 β h−1 )2 − ǫ2
ζ hm . 1 − ζm
The second term on the right-hand side, which is the dominant term there, is decreasing with h.
511
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
3. Global convergence analysis. In this section we show that the incremental gradient method of section 2 has desirable global convergence properties (see Proposition 3.4). Throughout, we will assume that Assumption A holds. First, we have the following technical lemma. LEMMA 3.1. For any t ∈ {0, 1, ...}, any αt > 0 and any xt1 , ..., xtm+1 in Rηρ satisfying (2), we have k∇fi (xtj ) − ∇fi (xti )k ≤ λi αt β t , for 1 ≤ i, j ≤ m + 1, where β t is given by (10). Proof. By (2) and (10), for any 1 ≤ j ≤ i ≤ m + 1, we have
i−1 X
t t dtl kxtj − xti k = αt
≤αβ .
l=j
2A similar argument shows the above inequality also holds for any 1 ≤ i ≤ j ≤ m + 1. Since xt1 , ..., xtm+1 are in Rηρ , the above inequality, together with ∇fi being Lipschitz continuous (with constant λi ) on Rηρ for i = 1, ..., m, yields the desired inequality. Under Assumption A, there exist positive scalars β1 , ..., βm such that
(14)
k∇fi (x)k ≤ βi
∀x ∈ Rηρ ,
i = 1, ..., m.
Let (15)
β := β1 + · · · + βm ,
λ := λ1 + · · · + λm .
The next lemma shows that, for αt sufficiently small, xt1 , ..., xtm+1 satisfying (2)–(3) remain in Rηρ for t = 0, 1, ... LEMMA 3.2. For any t ∈ {0, 1, ...}, any ατ ∈ (0, ρ(1 − ζ)/β] and any xτ1 , ..., xτm+1 satisfying (2)–(3) for τ = 0, 1, ..., t, and such that (xτ1 , ..., xτm ) ∈ (Rηρ )m for τ = 0, 1, ..., t − 1 and f (xt1 ) ≤ η, we have that xt1 , ..., xtm are in Rηρ , as is the line segment joining xt1 with x1t+1 . Proof. First, we claim that (16)
kxtl − xt1 k ≤
l−1 tm+j−1 ρ(1 − ζ) X X k ζ βj⊖k , β j=1 k=0
for l = 1, ..., m + 1. We prove this by induction on l. Clearly, (16) holds for l = 1. Suppose (16) holds for l = 1, ..., i for some i ∈ {1, ..., m} and we show below that it also holds for l = i + 1. For l = 1, ..., i, since (16) holds and the right-hand side of (16) is bounded above by ρ, we have kxtl − xt1 k ≤ ρ so that (cf. f (xt1 ) ≤ η) xtl ∈ Rηρ . Then, (2)–(3) yields
tm+i−1
tm+i−1 tm+i−1−k X X ⌋ ⌊
kxti+1 − xti k = αt ζ k ∇fi⊖k xi⊖k m ζ k βi⊖k
≤ αt
k=0
k=0
ρ(1 − ζ) ≤ β
tm+i−1 X k=0
ζ k βi⊖k ,
512
PAUL TSENG
where the first inequality follows from x01 , x02 , ..., xti−1 , xti ∈ Rηρ and (14). Since (16) holds for l = i, this shows that (16) holds for l = i + 1. Since (16) holds and the right-hand side of (16) is bounded above by ρ, f (xt1 ) ≤ η implies xtl ∈ Rηρ for l = 1, ..., m + 1. Moreover, (16) with l = m + 1 and (2) implies − xt1 k ≤ ρ, so the line segment joining xt1 with xt+1 lies in Rηρ . kxt+1 1 1 By using Lemma 3.1, we obtain the following lemma estimating the decrease in f -value per iteration of the incremental gradient method. LEMMA 3.3. For any t ∈ {1, 2, ...}, any ατ > 0 and any xτ1 , ..., xτm+1 in Rηρ satisfying (2)–(3) for τ = 0, 1, ..., t, and such that the line segment joining xt1 with lies in Rηρ , we have xt+1 1 f (x1t+1 ) ≤ f (xt1 ) − (δ1 + δ2 )αt kg t k2 + λ(1.5 + 2ζζ0∞ )(αt β t )2 (17)
+ δ2 (1 − ζ)
t−1 X
t−τ t τ 2 ζt−τ −1 α kg k + δ3 (1 − ζ)
t−1 X
t−τ τ τ 2 t t tm ¯ β, ζt−τ −1 (α β ) + α kg kζ
τ =0
τ =1
Pm Pm where δ1 , δ2 , δ3 are given by (4), g t , β t are given by (10), and β¯ := k j=1 i=j ζ i−j ∇fj (x0j )k. Similarly, for any α0 > 0 and any x01 , ..., x0m , x11 in Rηρ satisfying (2)– (3) (with t = 0) and such that the line segment joining x01 with x11 lies in Rηρ , we have f (x11 ) ≤ f (x01 ) − α0 kg 0 k2 + 1.5λ(α0 β 0 )2 + 2α0 kg 0 kβ 0 .
(18)
Proof. Fix any t ∈ {1, 2, ...}. Since the line segment joining xt1 with xt+1 lies 1 in Rηρ and, by ∇f = ∇f1 + · · · + ∇fm (cf. (1)) and Assumption A, ∇f is Lipschitz continuous (with constant λ given by (15)) on Rηρ , we obtain from the intermediate value theorem (see [4, p. 639]) that − xt1 k2 . f (x1t+1 ) ≤ f (xt1 ) + h∇f (xt1 ), x1t+1 − xt1 i + ..5λkxt+1 1
(19)
Using (2) and (10), we bound the rightmost term in (19) as follows: kx1t+1
(20)
−
xt1 k
m
X
t =α di ≤ α t β t .
t
i=1
The second term on the right-hand side of (19) can be bounded as follows: h∇f (xt1 ), x1t+1 − xt1 i =
(21)
h∇f (xt1 )
−g
t
, x1t+1
−
xt1 i
+
*
t
g , −α
t
m X
dti
i=1
+
αt ζ01 g t
+
− αt ζ01 kg t k2
m
X
t+1 t t t t t 1 t t di − ζ0 g − αt ζ01 kg t k2 , ≤ k∇f (x1 ) − g kkx1 − x1 k + α kg k
i=1
where the equality follows from (2). Also, we have from (3) and (10) that
513
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
m
X
1 t t di − ζ0 g
i=1
X m t−1 X
m X +1 ζ (t−τ )m−j ∇fi+j−m (xτi+j−m ) =
i=1 τ =0 j=m−i+1
i m−i X X
ζ tm+i−j ∇fj (x0j ) − ζ01 g t ζ (t−τ )m−j ∇fi+j (xτi+j ) + +
j=1 j=1
X j m m X X
t−1 X τ +1 (t−τ )m−j ζ ∇f (x ) + ∇fk (xτk ) = k k
τ =0 j=1 k=1 k=j+1
m m X X
1 t 0 tm+i−j ∇fj (xj ) − ζ0 g ζ +
j=1 i=j
X m m X m m X X
t−1 X (t−τ )m−j τ +1 tm+i−j 0 m−j t
ζ gj + ζ ∇fj (xj ) − ζ g =
τ =0 j=1
j=1 i=j j=1
t−2 m
m X m m X
X X (t−τ )m−j τ +1 X
tm+i−j 0 m−j t t
= ζ gj + ζ ∇fj (xj ) + ζ gj − g
τ =0 j=1 j=1 i=j j=1
X m X m m t−2 X X
τ +1 tm+i−j 0 τ +1 (t−τ )m−j τ +1
ζ ∇fj (xj ) k) + ζ (kg k + kgj − g ≤
j=1 i=j τ =0 j=1 +
m X
ζ m−j kgjt − g t k
j=1
=
m t−2 X X
ζ (t−τ )m−j kg τ +1 k +
τ =0 j=1
≤
m t−2 X X
(22)
=
ζ (t−τ )m−j kgjτ +1 − g τ +1 k + ζ tm β¯
τ =0 j=1
ζ (t−τ )m−j kg τ +1 k + λ
τ =0 j=1 t−1 X
t−1 m−1 X X
t−1 m−1 X X
ζ (t−τ )m−j (ατ +1 β τ +1 + ατ β τ ) + ζ tm β¯
τ =0 j=0
t−τ +1 kg τ k ζt−τ
+ λζ
t−1 X
t−τ +1 τ τ ζt−τ −1 α β
+
ζ01 αt β t
−
t ζt−1 α0 β 0
τ =0
τ =1
!
¯ + ζ tm β,
where the second equality follows from interchanging the order of the summations over i and j and then making the substitution k = i + j − m (respectively, k = i + j) in the first (respectively, second) summation inside the doubly nested parentheses; the third equality follows by letting (23)
gjτ +1
:=
j X
k=1
∇fk (xkτ +1 ) +
m X
k=j+1
∇fk (xτk );
514
PAUL TSENG
τ +1 = g τ +1 ; the second inequality follows from the fifth equality uses the fact gm (15) and the following consequence of (10) and Lemma 3.1 (since xτ1 , ..., xτm+1 = xτ1 +1 , ..., xτm+1 ∈ Rηρ ):
X m
τ +1 τ +1 τ +1 τ τ τ +1
∇fk (xk ) − ∇fk (xm+1 ) + ∇fk (x1 ) − ∇fk (xk ) k= kgj − g
k=j+1
m X
≤
λk (ατ β τ + ατ +1 β τ +1 )
k=j+1
for j = 1, ..., m − 1 and for τ = 0, 1, ..., t − 1. By a similar argument, we have that
m m
X
X
t t t t λi αt β t = λαt β t . (24) k∇f (x1 ) − g k = (∇fi (x1 ) − ∇fi (xi )) ≤
i=1
i=1
Using (20)–(24) to bound the right-hand side of (19) and then using β t to bound kg k (cf. (10)) yields t
t 1 t t 2 t t 2 t t f (xt+1 1 ) ≤ f (x1 ) − ζ0 α kg k + 1.5λ(α β ) + α kg k
t−1 X
t−τ +1 ζt−τ kg τ k
τ =1 t−1 X
+ αt β t λζ
t−τ +1 τ τ 1 t t ζt−τ −1 α β + ζ0 α β
τ =0
≤
f (xt1 )
−
ζ01 αt kg t k2
t t 2
!
+ αt kg t kζ tm β¯
+ 1.5λ(α β ) + .5α
t
ζ1t kg t k2
+
t−1 X
t−τ +1 kg τ k2 ζt−τ
τ =1
+ λζ
..5 (ζ0t + ζ1t+1 )(αt β t )2 +
t−1 X
t−τ +1 τ τ 2 ζt−τ −1 (α β )
τ =0
+ ζ01 (αt β t )2
!
!
!
+ αt kg t kζ tm β¯
= f (xt1 ) − ζ01 − .5ζ1t αt kg t k2 + λ(1.5 + .5ζζ0t + .5ζζ1t+1 + ζζ01 )(αt β t )2 + .5αt ζ m
t−1 X
τ =1
t−τ τ 2 m ζt−τ −1 kg k + .5λζ(1 + ζ )
t−1 X
t−τ τ τ 2 t t tm ¯ β, ζt−τ −1 (α β ) + α kg kζ
τ =0
where the second inequality follows from using ab ≤ .5(a2 + b2 ) and the equality uses properties of ζij . Bounding from above ζ1t by ζ1∞ = ζ m /(1 − ζ) and ζ0t , ζ1t+1 , ζ01 by ζ0∞ in the above expression and then using (4) yields (17). The proof of (18) is very similar. Below, we state and prove the main global convergence result for the incremental gradient method of section 2. The proof uses Lemmas 3.1–3.3 to show that the method is well defined and uses the observation (13) to show that (29) holds (unless ∇f (xs1 ) = 0 for some s ∈ T or lim inf t→∞ f (xt1 ) = −∞). The latter in turn is used to show that {g t }t=0,1,... → 0. PROPOSITION 3.4. The sequences {(xt1 , ..., xtm+1 )}t=0,1,... and {αt }t=0,1,... generated by the incremental gradient method (see (2)–(3) and (4)–(11)) are well defined.
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
515
Moreover, either (i) ∇f (xs1 ) = 0 for some s ∈ T or (ii) lim inf t→∞ f (xt1 ) = −∞ or (iii) {g t }t=0,1,... → 0 and {∇f (xs1 )}s∈T → 0, where T is the subsequence of {1, 2, ...} specified in the method. Proof. We will show by induction on t that αt is well defined for t = 0, 1, .... Since f (x01 ) < η, for α0 ≤ ρ(1 − ζ)/β, Lemma 3.2 shows that x01 , ..., x0m are in Rηρ as is the line segment joining x01 with x11 . Thus (5) holds and, by Lemma 3.3, (18) ζm 0 0 holds. Since ǫ2 1−ζ m < η −f (x1 ), the latter implies that (6) holds for all α sufficiently small, so α0 , being the largest element of {ǫ0 , ωǫ0 , ω 2 ǫ0 , ...} for which (6) holds, is well defined. Now assume that, for some s ∈ T , αt is well defined for t = 0, 1, ..., s − 1. If ∇f (xs1 ) = 0, then we stop with case (i). Suppose instead that ∇f (xs1 ) 6= 0 and let h = min{t ∈ T : t > s}. We argue in the next two paragraphs that αt is well defined for t = s, ..., h − 1. Since g t → ∇f (xs1 ) and β t is bounded as αt → 0 for t = s, ..., h − 1, we have αt ≤ (25)
ρ(1 − ζ) , β
αt ≤
(δ1 − ǫ1 )kg t k2 , (λ(1.5 + 2ζζ0∞ ) + δ3 )(β t )2
ǫ2 , kg t kβ¯
αs ≤
ǫ3 kg s k , λβ s
αt ≤
t = s, ..., h − 1,
for any αs = · · · = αh−1 sufficiently small. We claim that, whenever (25) holds, then f (x1t+1 ) ≤ η,
(26)
η m (xt2 , ..., xtm , xt+1 1 ) ∈ (Rρ ) ,
for t = s−1, ..., h−1. Clearly (26) holds for t = s−1. (This follows from (5)–(6) when s = 1 and follows from (7) with t = s − 1 and (13) with h = s, with the latter implied by (8) with h = s, when s > 1.) Suppose that, for some k ∈ {s, ..., h − 1}, the relation (26) holds for t = s−1, ..., k−1. Then, (xt1 , ..., xtm ) ∈ (Rηρ )m for t = 0, 1, ..., k−1 (since (26) holds for t = s − 1, ..., k − 1 and (7) holds for t = 0, 1, ..., s − 1 and f (x01 ) < η) and, for each t ∈ {s, ..., k}, f (xt1 ) ≤ η. Since we also have αt ≤ ρ(1 − ζ)/β (by (25)) for each t ∈ {s, ..., k}, Lemma 3.2 then yields that xt1 , ..., xtm and the line segment joining xt1 with x1t+1 all lie in Rηρ , and, by Lemma 3.3, (17) holds. Using the second and third inequalities in (25) together with the fact αt ≤ ατ , for τ = 0, 1, ..., t − 1, to bound the right-hand side of (17) yields f (x1t+1 ) ≤ f (xt1 ) − (δ2 + ǫ1 ) αt kg t k2 − δ3 (αt β t )2 + δ2 (1 − ζ)
t−1 X
t−τ τ τ 2 ζt−τ −1 α kg k
τ =1
(27)
+ δ3 (1 − ζ)
t−1 X
t−τ τ τ 2 tm ζt−τ −1 (α β ) + ǫ2 ζ
τ =0
for t = s, ..., k. Also, we have that the inequality (8) holds for h = s. (For s = 1, this follows from (6); for s > 1, this follows from αs−1 being chosen in Step 1 such that (8) holds with h = s.) By using (12), we can rewrite this inequality equivalently as f (xs1 )
≤ η − (δ2 + ǫ1 )
s−1 X
τ =1
+ δ3 (1 − ζ)
s−2 X
τ =0
τ
τ 2
α kg k − δ3
s−1 X
τ
τ 2
(α β ) + δ2 (1 − ζ)
τ =0
ζ0s−τ −1 (ατ β τ )2 −
ǫ2 ζ sm . 1 − ζm
s−2 X
τ =1
ζ0s−τ −1 ατ kg τ k2
516
PAUL TSENG
Summing (27) over all t ∈ {s, ..., k} and then adding to it the above inequality, we obtain ! k k k X X X ζ sm k+1 τ τ 2 τ τ 2 tm (α β ) − ǫ2 α kg k − δ3 f (x1 ) ≤ η − (δ2 + ǫ1 ) − ζ 1 − ζ m t=s τ =0 τ =1 ! t−1 k X s−2 X X t−τ τ τ 2 ζt−τ ζ0s−τ −1 ατ kg τ k2 + + δ2 (1 − ζ) −1 α kg k t=s τ =1
τ =1
+ δ3 (1 − ζ)
s−2 X
ζ0s−τ −1 (ατ β τ )2
k X
ατ kg τ k2 − δ3
(28)
+ δ2 (1 − ζ)
k X
(ατ β τ )2 − ǫ2
τ =0
τ =1 k−1 X
ζ0k−τ ατ kg τ k2 + δ3 (1 − ζ)
k−1 X
!
ζ (k+1)m 1 − ζm
ζ0k−τ (ατ β τ )2
τ =0
τ =1
≤ η − (δ2 + ǫ1 )
t−τ τ τ 2 ζt−τ −1 (α β )
t=s τ =0
τ =0
= η − (δ2 + ǫ1 )
+
t−1 k X X
k X
τ
τ 2
α kg k − δ3
k X
τ
τ 2
(α β ) + δ2
τ =1
τ =0
τ =1
k−1 X
τ
τ 2
α kg k + δ3
k−1 X
(ατ β τ )2 ,
τ =0
where the equality follows from exchanging the order of summation and using the definition of ζij ; the last inequality follows from the facts ζ0j ≤ ζ0∞ = 1/(1 − ζ) for all j > 0. The right-hand side of the above inequality is less than η, so the claim (26) holds for t = k. By induction on k, it follows that (26) holds for t = s − 1, ..., h − 1. Thus, if (25) holds, then (26) holds for t = s − 1, ..., h − 1, so that (7) holds. In addition, our argument showed that (28) holds for k = s, ..., h − 1, so that, upon letting k = h − 1 in (28) and using (12), we obtain (8). Also, we have from (24) with t = s and the last inequality in (25) that k∇f (xs1 ) − g s k ≤ λαs β s ≤ ǫ3 kg s k, so (9) holds. Thus (7)–(9) hold whenever αs = ... = αh−1 are sufficiently small. Then, αs , ..., αh−1 , being the largest element of {αs−1 , ωαs−1 , ..., } such that (7)–(9) hold, are well defined. This completes the induction step and shows that αt is well defined for all t = 0, 1, ... There are three cases: either (i) ∇f (xs1 ) = 0 for some s ∈ T or else, since (8) holds and hence (13) holds for all h ∈ T , (ii) lim inf t→∞ f (xt1 ) = −∞, or (iii) (29)
∞ X
αt kg t k2 < ∞.
t=1
Also, we have from f (x01 ) < η and (5) and (7) (with h = min{t ∈ T : t > s}) for all s ∈ T that (30)
(xt1 , ..., xtm ) ∈ (Rηρ )m ,
t = 0, 1, ....
We claim that in case (iii), {g t } → 0. Since {αt } is monotonically decreasing, either lim inf t→∞ αt > 0 or {αt } ↓ 0. In the first case, (29) yields {g t } → 0. Consider now the second case. We showed earlier that, for each s ∈ T , (7)–(9) hold whenever
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
517
(25) holds, where h = min{t ∈ T : t > s}. Then the choice of αs = · · · = αh−1 being the largest element of {αs−1 , ωαs−1 , ..., } for which (7)–(9) hold, implies either αs = αs−1 or ρ(1 − ζ) (δ1 − ǫ1 )kg t k2 ǫ2 αs ǫ3 kg s k > min , , (31) . , t=s,...,h−1 ω β (λ(1.5 + 2ζζ0∞ ) + δ3 )(β t )2 kg t kβ¯ λβ s Since (30) holds, then for t = 0, 1, ... we have from (10) and (14)–(15) that
m m
X
X
∇fi (xti ) ≤ βi = β kg t k =
i=1
i=1
and from (3) and (14)–(15) that
m tm+i−1 m m tm+i−1 tm+i−1−k X X X X
t X β ⌊ ⌋
k m
di = ζ ∇f x ≤ , ζ k βi⊖k ≤
i⊖k i⊖k
1 − ζ i=1 i=1 i=1 k=0
k=0
so (10) yields β t ≤ β/(1 − ζ). Since {αs } ↓ 0 so that (31) holds for all s in some subsequence of T , it follows that g t → 0 for t along some subsequence of {0, 1, ...}. We now argue that g t → 0 for t along the entire sequence {0, 1, ...}. Suppose this is not the case so there exists ǫ > 0 such that kg t k > ǫ for all t along some subsequence of {0, 1, ...}. The following argument is a modification of the proof of [13, Theorem 2.1]. Consider any t such that kg t k ≥ ǫ. Since {kg t k}t=0,1,... contains a subsequence ′ that tends to zero, there exists a smallest integer t′ > t such that kg t k < ǫ/2. Then, ′ ǫ ≤ kg t k − kg t k 2 ′
≤ kg t − g t k
′
tX m
−1 X τ +1 τ +1 τ τ
∇fi (xi ) − ∇fi (x1 ) + ∇fi (xm+1 ) − ∇fi (xi ) =
τ =t i=1
≤
′ −1 X m tX
λi (ατ +1 β τ +1 + ατ β τ )
τ =t i=1
′
t −1 2λβ X τ ≤ α , 1 − ζ τ =t
where the equality uses (2) and (10); the fourth inequality follows from (30) and Lemma 3.1; the last inequality follows from (15), β τ ≤ β/(1 − ζ) for all τ and the monotone decreasing property of {ατ }τ =0,1,... . We also have that kg τ k ≥ ǫ/2 for τ = t, ..., t′ − 1, which together with the above relation yield ′ −1 tX
′
t −1 ǫ2 X τ ǫ2 ǫ(1 − ζ) α kg k ≥ α ≥ . 4 τ =t 4 4λβ τ =t τ
τ 2
P∞ Since the number of such t is infinite, it follows that τ =1 ατ kg τ k2 = ∞, a contradiction of (29). Since {g t } → 0 and (9) holds for all s ∈ T , it follows that {∇f (xs1 )}s∈T → 0.
518
PAUL TSENG
4. Convergence rate and stepsize analysis. In this section, we show that under a growth assumption on ∇f1 , ..., ∇fm (see Assumption B below), the incremental gradient method either is linearly convergent in some sense or has its stepsize bounded away from zero (see Proposition 4.2). This result gives an explanation of the observed behavior that on some problems, the stepsize remains bounded away from zero (see the numerical experience reported in section 6). To establish our result, we first need the following technical lemma. LEMMA 4.1. For any αt ∈ (0, α0 ] and any xt1 , ..., xtm+1 in Rηρ satisfying (2), with t d1 , ..., dtm given by (3), for t = 0, 1, ..., we have kdti k
(32)
≤
t−1 X
+1 τ h + (1 + · · · + µi−1 )ht , µt−τ t−τ
τ =0
for i = 1, ..., m and t = 0, 1, ..., where we let (33) ht := max k∇fi (xt1 )k, t = 0, 1, ..., i=1,...,m
µ :=
(m − 1)α0 max λi + ζ i=1,...,m
1/m
.
Proof. Clearly, (32) holds for t = 0 and i = 1. Suppose that, for some s ≥ 0 and some 1 ≤ j ≤ m, (32) holds for i = 1, ..., m if t < s and for i = 1, ..., j if t = s. First, consider the case j = m. Then, by (3) with t = s + 1 and (32) with t = s and i = m, ! s X +1 τ s s+1 h . µs−τ +ζ kd1s+1 k = k∇f1 (x1s+1 ) + ζdsm k ≤ k∇f1 (xs+1 s−τ 1 )k + ζkdm k ≤ h τ =0
m
Since ζ ≤ µ , this implies (32) holds for t = s + 1 and i = 1. Second, consider the case j < m. Then, by (3) for t = s and i = j + 1, kdsj+1 k = k∇fj+1 (xtj+1 ) + ζdtj k ≤ k∇fj+1 (xtj+1 ) − ∇fj+1 (xt1 )k + k∇fj+1 (xt1 )k + ζkdtj k ≤ αt λj+1 (kdtj k + · · · + kdt1 k) + k∇fj+1 (xt1 )k + ζkdtj k ! t−1 X t−τ τ j−1 t µt−τ )h ≤ jαt λj+1 −1 h + (1 + · · · + µ τ =0
t
+h +ζ
t−1 X
t−τ τ µt−τ −1 h
+ (1 + · · · + µ
τ =0
= (jαt λj+1 + ζ)
t−1 X
j−1
)h
t
!
t−τ τ t j−1 t µt−τ )h + ht −1 h + (jα λj+1 + ζ)(1 + · · · + µ
τ =0
≤ µm
t−1 X
t−τ τ j t µt−τ −1 h + (1 + · · · + µ )h ,
τ =0
where the second inequality follows from xtj+1 , xt1 ∈ Rηρ and the Lipschitz continuity of ∇fj+1 on Rηρ (with constant λj+1 ) and (2), the third inequality follows from (32) with i = 1, ..., j and (33). This implies (32) holds for t = s and i = j + 1. Thus, by induction on t and i, (32) holds for all i = 1, ..., m and all t = 0, 1, ....
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
519
Consider the following growth assumption on f1 , ..., fm . Assumption B. There exists c1 > 0 such that max k∇fi (x)k ≤ c1 k∇f (x)k ∀x ∈ Rηρ .
i=1,...,m
Assumption B roughly says that the size of ∇f1 , ..., ∇fm should grow no faster than linearly with the size of ∇f . In particular, this assumption requires that ∇f1 (x) = · · · = ∇fm (x) = 0 at any stationary point x of f . This requirement, though restrictive, is not entirely unrealistic for certain applications. For example, for the application of neural network training, this requirement amounts to being able to train the neural network to achieve zero output error on the learning examples. In fact, it is possible for this requirement to fail to hold and still have the stepsize bounded away from zero. Consider the example of n = 1, m = 2 and f1 (x) = x, f2 (x) = −x. Then, f ≡ 0 and Assumption A holds with η = ρ = ∞ and λ1 = λ2 = 0. Upon applying the incremental gradient method with, say, ζ = 0 and any choice of x01 , ω, T , and ǫ0 , ǫ1 , ǫ2 , ǫ3 satisfying ǫ1 < 1, we find that δ2 = δ3 = g t = pt = 0 and, in particular, αt = ǫ0 for all t = 0, 1, ... In contrast, the other stepsize rules mentioned in section 1 would require the stepsize to tend to zero on this example. (This example is degenerate in the sense that every x ∈ ℜ is a stationary point of f . However, it can be easily modified so that x01 is not a stationary point, etc.) In general, if the iterates are in a region where f1 , ..., fm are nearly linear, then the stepsize will tend not to decrease. By using Lemma 4.1 and the fact (see the proof of Proposition 3.4) that (31) holds whenever αs 6= αs−1 , we have the following convergence rate and stepsize result for the incremental gradient method. The result roughly says that, under Assumption B, either ht given by (33) tends to zero linearly in some sense or αt is bounded away from zero. The proof of this uses the idea that if ht does not tend to zero linearly, then neither does k∇f (xt1 )k (by Assumption B) from which it can be shown that kdti k = O(k∇f (xt1 )k) (see (35)). This in turn can be used to show that the right-hand side of (31) is bounded away from zero. PROPOSITION 4.2. Assume Assumption B (in addition to Assumption A) is satisfied and let {(xt1 , ..., xtm+1 )}t=0,1,... and {αt }t=0,1,... be generated by the incremental gradient method (2)–(3) and (4)–(13) (which, by Prop. 3.4, are well defined) with α0 chosen sufficiently small so that µ given by (33) is less than 1. If there exists a c2 ≥ 1 and σ ∈ (µ, 1) such that t−τ +1 τ h ≤ c2 ht , σt−τ
(34)
τ = 0, ..., t − 1,
for t = 1, 2, ..., where ht is given by (33), then lim inf t→∞ αt > 0. Proof. Fix any t ∈ {1, 2, ..}. Since (30) holds, then Lemma 4.1 and c2 ≥ 1 and σ < 1 show that, for each i = 1, ..., m, dti given by (3) satisfies t−1 X
µ m−1 t−τ +1 τ µt−τ h + 1 + ··· + c2 ht σ τ =0 ! t−1 µ m−1 X c2 ht t−τ +1 + 1 + ··· + c2 ht µt−τ ≤ t−τ +1 σ σ t−τ τ =0
kdti k ≤
≤
t−1 t−τ +1 X µ
τ =0
σ
t−τ
µ m−1 c2 ht + 1 + · · · + c2 ht σ
520
PAUL TSENG
=
t t−τ +1 X µ
τ =0
≤
σ
t−τ
c2 ht
c2 ht 1 − µ/σ
≤ c3 k∇f (xt1 )k,
(35)
where the second inequality follows from (34); the third inequality follows from the fact (a1 + · · · + am )/(b1 + · · · + bm ) ≤ maxi=1,...,m ai /bi ≤ (a1 /b1 + · · · + am /bm ); the c2 c1 . Then (10) last inequality follows by using Assumption B and letting c3 := 1−µ/σ and (15) yield kg t − ∇f (xt1 )k ≤
m X
k∇fi (xti ) − ∇fi (xt1 )k
i=1
≤
m X
λi kxt1 − xti k
i=1
=
m X
λi αt kdt1 + · · · + dti−1 k
i=1
≤ λαt mc3 k∇f (xt1 )k,
(36)
where dti is given by (3), the second inequality follows from ∇fi being Lipschitz continuous on Rηρ (with constant λi ) and the equality follows from (2). Thus, kg t k ≥ k∇f (xt1 )k − kg t − ∇f (xt1 )k ≥ (1 − λαt mc3 )k∇f (xt1 )k and, by (10) and (35)–(36), ( t
t
β = max kg k,
m X i=1
)
kdti k
≤ max kg t − ∇f (xt1 )k + k∇f (xt1 )k, mc3 k∇f (xt1 )k ≤ max λαt mc3 + 1, mc3 k∇f (xt1 )k.
Thus, if {αs } ↓ 0, then (31) must hold for all s along some subsequence of T . On the other hand, the above two inequalities and kg t k ≤ β (which hold for all t) show that the right-hand side of (31) is bounded away from zero, which implies αs is bounded away from zero for all s along this to {αs } ↓ 0. Pmsubsequence, a contradiction 1 1 t t Note that, for all t, h ≥ n i=1 k∇fi (x1 )k ≥ n k∇f (x1 )k while, by xt1 ∈ Rηρ and Assumption B, ht ≤ c1 k∇f (xt1 )k. Thus, under Assumption B, (34) is equivalent to t−τ +1 k∇f (xτ1 )k ≤ c′2 k∇f (xt1 )k, σt−τ
τ = 0, 1, ..., t − 1,
for some constant c′2 . In the very special case where f1 = · · · = fm (so Assumption B holds trivially), the above proof can be modified to show that αt generated by the incremental gradient method with ζ = 0 (i.e., steepest descent) is always bounded away from zero.
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
521
5. Extension to constrained problems. In this section, we consider an extension of the incremental gradient method of section 2 to the problem of minimizing, over a nonempty closed convex set X of ℜn , the function f given by (1). Such constrained problems arise in neural network training where bounds are placed on the weights of the neural network and which corresponds to X being a box. Due to the presence of the constraint set X , the formula for updating xti and dti need to be modified much as in [12]. We show that, analogous to Proposition 3.4, this extended method has desirable global convergence properties (see Proposition 5.4). Consider the following iterative method for solving the preceding problem whereby, for a given x01 ∈ X , we generate a sequence (xt1 , ..., xtm+1 ), t = 0, 1, ..., according to (37)
xti+1
:= xti − αt dti ,
xt+1 1
:= [xtm+1 ]+ ,
i = 1, ..., m,
where αt is a positive scalar and ( t−1 ) ∇fi (xti ) + ζ∇fm (xm t di := (38) t t ∇fi (xi ) + ζ∇fi−1 (xi−1 )
if i = 1, if i > 1
0 + with x−1 denotes the orthogonal projection operator onto m = x1 given. (Here, [·] X . This operator can be evaluated fairly easily if X is either a box or a Euclidean sphere or simplex.) In the context of neural network training, the projection in (37) corresponds to the oft-used practice of truncating the weights of the neural network at their respective bounds. Note that (38) sets dti to be a weighted sum of the two most recently computed gradients. This contrasts with (3), which sets dti to be a weighted sum of all previously computed gradients. Thus, while (37) reduces to (2) when X = ℜn , the formula (38) does not reduce to (3) when X = ℜn , so (37)–(38) is a different method from (2)–(3). Our results can be extended to the case where dti is a weighted sum of the K (K ≥ 1 and constant) most recently computed gradients though, for simplicity, we will not consider this more general case here. It is an open question whether our results can be extended to the case where (38) is replaced by (3). Also, we note that in the case where ζ = 0, the method (37)–(38) reduces to the approximate gradient-projection method studied in [12]. Our preliminary numerical experience suggests that ζ > 0 (e.g., ζ = .1) is typically preferable. As with the incremental gradient method of section 2, we propose a new rule for choosing the stepsize αt adaptively. To describe this new stepsize rule, we make the following assumption, analogous to Assumption A, about f, f1 , ..., fm , X and the initial iterate x01 . Assumption C. There exist scalars η > f (x01 ) and ρ > 0 such that, for i = 1, ..., m, ∇fi is bounded and Lipschitz continuous (with some constant λi ) on the set
Xρη := { x ∈ X : f (x) ≤ η } + ρB, where B = { x ∈ ℜn : kxk ≤ 1 }. Assumption C is quite mild and, in particular, is satisfied when f1 , ..., fm are twice differentiable and the level set { x ∈ X : f (x) ≤ η } is bounded for some η > f (x01 ). The new stepsize rule for choosing αt , analogous to the one of section 2, depends on η, ρ and λ1 , ..., λm and periodically checks if a certain descent condition is satisfied since the previous check was made and, if not, decreases the stepsize and restarts
522
PAUL TSENG
the method from when the previous check was made. Below, we formally state the method (37)–(38) using this stepsize rule. We will call this method the incremental gradient-projection method. Incremental gradient-projection method (with 1-memory momentum term). Choose any x01 ∈ X such that Assumption C holds for some η, ρ and λ1 , ..., λm . Choose any ζ ∈ (0, ∞). Choose any ω ∈ (0, 1) and any subsequence T of {1, 2, ...} containing 1. Choose any positive scalars ǫ0 , ǫ1 , ǫ2 satisfying ǫ1 < 1 + ζ. Step 0. Let α0 be the largest element of {ǫ0 , ωǫ0 , ω 2 ǫ0 , ...} for which x02 , ..., x0m , x11 given by (37)–(38) with t = 0 satisfy the following two conditions: (39)
(x02 , ..., x0m , x11 ) ∈ (Xρη )m ,
(40)
f (x11 ) ≤ η − λm ζ(α0 β 0 )2 ,
where β 0 is given by (10). h−1 ) Step 1. For each s ∈ T , let h = min{t ∈ T : t > s} and we generate (xh1 , dh−1 m ,α s s−1 s−1 s h−1 from (x1 , dm , α ) as follows: If r = 0, we stop. Else, we let α be the largest elgiven by (37)–(38) with ement of {αs−1 , ωαs−1 , ω 2 αs−1 , ...} for which xt2 , ..., xtm , xt+1 1 αt = αh−1 , t = s, ..., h − 1, satisfy the following three conditions: (41) (42)
η m (xt2 , ..., xtm , xt+1 1 ) ∈ (Xρ ) ,
f (xh1 ) ≤ η − ǫ1
h−1 X
t = s, ..., h − 1,
ατ kˆ rτ k2 − λm ζ(αh−1 β h−1 )2 ,
τ =1
krs − rˆs k ≤ ǫ2 kˆ rs k,
(43)
where, for t = s, ..., h − 1, we define (44)
rt := [xt1 − ∇f (xt1 )]+ − xt1 ,
rˆt := [xt1 − g t ]+ − xt1 ,
with g t and β t given by (10). Like the stepsize rule of section 2, the above stepsize rule checks at each iteration s ∈ T whether the current stepsize is acceptable (i.e., satisfies (41)–(43)) for all iterations between s and the next element h of T and, if not, it decreases the stepsize by ω and restarts the method from iteration s. Note that the conditions (40) and (42) are much simpler than their counterpart (6) and (8) of section 2. This is because here dti depends only on the two most recently computed gradients. The quantity rˆt may be viewed as an approximation to rt , the “natural residual” at xt1 . To establish the global convergence of the incremental gradient-projection method, we first need the following three technical lemmas analogous to Lemmas 3.1–3.3. We assume throughout that Assumption C holds. satisfying LEMMA 5.1. For any t ∈ {0, 1, ...}, any αt > 0, and any xt1 , ..., xtm , xt+1 1 (37), we have, for i = 1, ..., m, that k∇fi (xt1 ) − ∇fi (xti )k ≤ λi αt β t whenever xt1 , xti ∈ Xρη and that k∇fi (x1t+1 ) − ∇fi (xti )k ≤ 2λi αt β t whenever xti , xt+1 ∈ Xρη , where β t is given by (10). 1
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
523
Proof. We have from (37) and (10) that for any i ∈ {1, ..., m + 1},
i−1
X
dtl ≤ αt β t . kxt1 − xti k = αt
l=1
By a similar reasoning as above, we have kxtm+1 − xti k ≤ αt β t which, together with the above inequality with i = m + 1, yields − xti k ≤ kx1t+1 − xtm+1 k + kxtm+1 − xti k kxt+1 1 ≤ kxt1 − xtm+1 k + kxtm+1 − xti k ≤ αt β t + αt β t , is the point in X where the second inequality follows from the observation that xt+1 1 nearest in Euclidean distance to xtm+1 and xt1 ∈ X . The above inequalities, together with ∇fi being Lipschitz continuous (with constant λi ) on Xρη for i = 1, ..., m, yield the desired results. Under Assumption C, there exist positive scalars β1 , ..., βm such that (45)
k∇fi (x)k ≤ βi
∀x ∈ Xρη ,
i = 1, ..., m.
Let β and λ be given by (15). Also, let (46)
δ0 := ρ/(2β(1 + ζ)).
t−1 LEMMA 5.2. For any t ∈ {0, 1, ...}, any αt ∈ (0, δ0 ], and any xm , xt1 , ..., xtm , x1t+1 t−1 satisfying (37)–(38) and such that xm ∈ Xρη , xt1 ∈ X and f (xt1 ) ≤ η, we have that t η t x1 , ..., xm are in Xρ , as is the line segment joining xt1 with xt+1 1 . Proof. We claim that l−1
(47)
kxtl − xt1 k ≤
X ρ (βk + ζβk⊖1 ) 2β(1 + ζ) k=1
for l = 1, ..., m + 1. We prove this by induction on l. Clearly, (47) holds for l = 1. Suppose (47) holds for l = 1, ..., i for some i ∈ {1, ..., m} and we show below that it also holds for l = i + 1. Since (47) holds for l = 1, ..., i and the right-hand side of (47) is bounded above by ρ/2, we have (cf. f (xt1 ) ≤ η and xt1 ∈ X ) xt1 , ..., xti ∈ Xρη , as well t−1 as xm ∈ Xρη . Then, (37)–(38) and (45) and αt ≤ δ0 yield
tm+i−2 ⌋ ⌊
kxti+1 − xti k = αt ∇fi (xti ) + ζ∇fi⊖1 xi⊖1 m
≤ αt (βi + ζβi⊖1 ) ≤
ρ (βi + ζβi⊖1 ). 2β(1 + ζ)
Since (47) holds for l = i, this shows that (47) holds for l = i + 1. The claim (47) together with f (xt1 ) ≤ η and xt1 ∈ X implies that xtl ∈ Xρη for l = 1, ..., m + 1. Also, since x1t+1 is the point in X nearest in Euclidean distance to xtm+1 (see (37)) and xt1 ∈ X , we have kxtm+1 − x1t+1 k ≤ kxtm+1 − xt1 k ≤ ρ/2,
524
PAUL TSENG
where the last inequality follows from (47) with l = m + 1 and then bounding the right-hand side of (47) from above by ρ/2. Thus, t+1 t t t kxt1 − xt+1 1 k ≤ kx1 − xm+1 k + kxm+1 − x1 k ≤ ρ
lies in and, hence, (cf. f (xt1 ) ≤ η and xt1 ∈ X ) the line segment joining xt1 with xt+1 1 Xρη . By using Lemma 5.1, we obtain the third lemma estimating the decrease in f value per iteration of the incremental gradient-projection method. τ −1 , xτ1 , ..., xτm , LEMMA 5.3. For any t ∈ {1, 2, ...}, any ατ ∈ (0, 1/(1+ζ)], and any xm τ +1 t t−1 x1 satisfying (37)–(38) for τ = t − 1, t, and such that x1 ∈ X and both xm , xtm t+1 η t and the line segment joining x1 with x1 lie in Xρ , we have (48) f (x1t+1 ) ≤ f (xt1 ) − (1 + ζ)αt kˆ rt k2 + (1.5λ + 2λm ζ)(αt β t )2 + λm ζ(αt−1 β t−1 )2 , where β t is given by (10) and rˆt is given by (44). Similarly, for any α0 ∈ (0, 1/(1 + ζ)] 0 −1 1 0 0 and any x−1 m , x1 , ..., xm , x1 satisfying (37)–(38) (with t = 0) and such that xm = x1 ∈ 0 0 1 η X and both xm and the line segment joining x1 with x1 lie in Xρ , we have (49)
r0 k2 + (1.5λ + λm ζ)(α0 β 0 )2 . f (x11 ) ≤ f (x01 ) − (1 + ζ)α0 kˆ
lies Proof. Fix any t ∈ {1, 2, ...}. Since the line segment joining xt1 with xt+1 1 in Xρη and, by ∇f = ∇f1 + · · · + ∇fm (cf. (1)) and Assumption C, ∇f is Lipschitz continuous (with constant λ given by (15)) on Xρη , we obtain from the intermediate value theorem (see [4, p. 639]) that (50)
− xt1 k2 . f (x1t+1 ) ≤ f (xt1 ) + h∇f (xt1 ), x1t+1 − xt1 i + ..5λkxt+1 1
Using (37) and xt1 ∈ X , we bound the rightmost term in (50) as follows:
"
m #+
m
X X
t+1 t t t t t + t
dti ≤ αt β t , (51) di − [x1 ] ≤ α kx1 − x1 k = x1 − α
i=1 i=1
where the first inequality follows from the nonexpansive property of the projection operator [·]+ , and the last inequality follows from (10). The second term on the right-hand side of (50) can be bounded as follows: h∇f (xt1 ), x1t+1 − xt1 i = h∇f (xt1 ) − g t , x1t+1 − xt1 i + hg t , [xt1 − αt (1 + ζ)g t ]+ − xt1 i + hg t , x1t+1 − [xt1 − αt (1 + ζ)g t ]+ i ≤ k∇f (xt1 ) − g t kkx1t+1 − xt1 k −
1 k[xt − αt (1 + ζ)g t ]+ − xt1 k2 αt (1 + ζ) 1
+ kg t kkxt+1 − [xt1 − αt (1 + ζ)g t ]+ k 1 ≤ k∇f (xt1 ) − g t kkx1t+1 − xt1 k − αt (1 + ζ)kˆ rt k2 + kg t kkxt+1 − [xt1 − αt (1 + ζ)g t ]+ k 1
m
X
t+1 t t 2 t t t t t t t r k + α kg k di − (1 + ζ)g , ≤ k∇f (x1 ) − g kkx1 − x1 k − α (1 + ζ)kˆ
i=1
525
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
where the first inequality follows from the Cauchy–Schwarz inequality and the following well-known property of [·]+ : hx − y, [y]+ − xi ≤ −k[y]+ − xk2
∀x ∈ X ∀y ∈ ℜn ;
the second inequality follows from αt (1 + ζ) ≤ 1 and k[x − γd]+ − xk ≥ γk[x − d]+ − xk for all γ ∈ [0, 1] (see Lemma 1 in [7]) and (44); the last inequality follows from Pm t t t + + xt+1 = [x − α 1 1 i=1 di ] (see (37)) and the nonexpansive property of [·] . Also, we have from (10) and (38) that
m
X
t−1 ) − ∇fm (xtm ) dti − (1 + ζ)g t = ζ ∇fm (xm
i=1
t−1 ≤ ζ ∇fm (xm ) − ∇fm (xt1 )k + k∇fm (xt1 ) − ∇fm (xtm ) ≤ ζλm (2αt−1 β t−1 + αt β t ),
t−1 where the last inequality follows from xm , xt1 , xtm ∈ Xρη and Lemma 5.1. By a similar argument, we have that (24) holds. Combining the above two inequalities with (24) and kg t k ≤ β t (see (10)) yields
h∇f (xt1 ), xt+1 − xt1 i ≤ λ(αt β t )2 − αt (1 + ζ)kˆ rt k2 + αt β t ζλm (2αt−1 β t−1 + αt β t ) 1 ≤ λ(αt β t )2 − αt (1 + ζ)kˆ rt k2 + ζλm ((αt−1 β t−1 )2 + 2(αt β t )2 ), where the last inequality uses ab ≤ .5(a2 + b2 ). This together with (50)–(51) proves (48). The proof of (49) is very similar. Below we state and prove the global convergence result for the incremental gradientprojection method. The proof uses Lemmas 5.1–5.3. PROPOSITION 5.4. The sequences {(xt1 , ..., xtm+1 )}t=0,1,... and {αt }t=0,1,... , generated by the incremental gradient-projection method (see (37)–(38) and (39)–(44)) are well defined. Moreover, either (i) rs = 0 for some s ∈ T or (ii) lim inf t→∞ f (xt1 ) = −∞ or (iii) {ˆ rt } → 0 and {rt }t∈T → 0, where T is the subsequence of {1, 2, ...} specified in the method. Proof. We show by induction on t that αt is well defined for t = 0, 1, .... Since −1 xm = x01 ∈ X and f (x01 ) < η, for α0 ≤ min{δ0 , 1/(1 + ζ)}, Lemma 5.2 shows that x01 , ..., x0m are in Xρη as is the line segment joining x01 with x11 . Thus (39) holds and, by Lemma 5.3, (49) holds. Since f (x01 ) < η, the latter implies that (40) holds for all α0 sufficiently small, so α0 , being the largest element of {ǫ0 , ωǫ0 , ω 2 ǫ0 , ...} for which (40) holds, is well defined. Now assume that, for some s ∈ T , αt is well defined for t = 0, 1, ..., s − 1. If ∇f (xs1 ) = 0, then we stop with case (i). Suppose instead ∇f (xs1 ) 6= 0 and let h = min{t ∈ T : t > s}. We argue in the next two paragraphs that αt is well defined for t = s, ..., h − 1. Since g t → ∇f (xs1 ) and β t is bounded as αt → 0 for t = s, ..., h − 1, we have (52) αt ≤ min{δ0 , 1/(1 + ζ)},
αt ≤
(1 + ζ − ǫ1 )kˆ rt k2 , (1.5λ + 3λm ζ)(β t )2
αs ≤
ǫ2 kˆ rs k , λβ s
t = s, ..., h− 1,
for any αs = · · · = αh−1 sufficiently small. We claim that, whenever (52) holds, then (53)
f (xt+1 1 ) ≤ η,
η m (xt2 , ..., xtm , xt+1 1 ) ∈ (Xρ ) ,
526
PAUL TSENG
for t = s − 1, ..., h − 1. Clearly (53) holds for t = s − 1. (This follows from (39)–(40) when s = 1 and follows from (41) with t = s − 1 and (42) with h = s when s > 1.) Suppose that, for some k ∈ {s, ..., h − 1}, the relation (53) holds for t = s − 1, ..., k − 1. From (37), we also have xt+1 ∈ X for t = s − 1, ..., k − 1. Then, for each t ∈ {s, ..., k}, 1 t−1 we have f (xt1 ) ≤ η, xm ∈ Xρη and xt1 ∈ X , which together with Lemma 5.2 and αt ≤ δ0 (see (52)) imply that xt1 , ..., xtm and the line segment joining xt1 with x1t+1 all lie in Xρη , so, by Lemma 5.3 and αt ≤ 1/(1 + ζ) (see (52)), (48) holds. Then, using the second inequality in (52) to bound the right-hand side of (48) yields (54)
f (x1t+1 ) ≤ f (xt1 ) − ǫ1 αt kˆ rt k2 − λm ζ(αt β t )2 + λm ζ(αt−1 β t−1 )2
for t = s, ..., k. Also, we have that the inequality (42) holds for h = s. (For s = 1, this follows from (40); for s > 1, this follows from αs−1 being chosen in Step 1 such that (42) holds for h = s.) Summing (54) over all t ∈ {s, ..., k} and then adding to it this inequality, we obtain (55)
) ≤ η − ǫ1 f (xk+1 1
k X
rt k2 − λm ζ(αk β k )2 . αt kˆ
t=1
The right-hand side of this inequality is less than η, so the claim (53) holds for t = k. By induction on k, it follows that (53) holds for t = s − 1, ..., h − 1. Thus, if (52) holds, then (53) holds for t = s − 1, ..., h − 1, so that (41) holds. In addition, our argument showed that (55) holds for k = s, ..., h − 1, so that, upon letting k = h − 1 in (55), we obtain that (42) holds. To see that (43) holds, we use (44) and (24) with t = s to obtain krs − rˆs k = k[xs1 − ∇f (xs1 )]+ − [xs1 − g s ]+ k ≤ k∇f (xs1 ) − g s k ≤ λαs β s ≤ ǫ2 kˆ rs k, where the first inequality uses the nonexpansive property of [·]+ and the last inequality is due to the last inequality in (52). Thus αs , ..., αh−1 , being the largest element of {αs−1 , ωαs−1 , ..., } such that (41)–(43) hold, are well defined. This completes the induction step and shows that αt is well defined for all t = 0, 1, .... There are three cases: either (i) rs = 0 for some s ∈ T or else, since (42) holds for all h ∈ T , (ii) lim inf t→∞ f (xt1 ) = −∞, or (iii) ∞ X
(56)
αt kˆ rt k2 < ∞.
t=1
Also, we have from s ∈ T that (57)
f (x01 )
< η and (39) and (41) (with h = min{t ∈ T : t > s}) for all
(xt1 , ..., xtm ) ∈ (Xρη )m ,
t = 0, 1, ....
We claim that, in case (iii), {ˆ rt } → 0. Since {αt } is monotonically decreasing, t t either lim inf t→∞ α > 0 or {α } ↓ 0. In the first case, (56) yields {ˆ rt } → 0. Consider now the second case. We showed earlier that, for each s ∈ T , (41)–(43) hold whenever (52) holds for t = s, ..., h − 1, where h = min{t ∈ T : t > s}. Then, the choice of αs = · · · = αh−1 being the largest element of {αs−1 , ωαs−1 , ..., } such that (41)–(43) hold implies either αs = αs−1 or rs k (1 + ζ − ǫ1 )kˆ rt k2 ǫ2 kˆ αs > min , min{δ0 , 1/(1 + ζ)}, (58) . t=s,...,h−1 ω (1.5λ + 3λm ζ)(β t )2 λβ s
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
527
Since (57) holds, then for t = 0, 1, ... we have from (44), (10), (45), and (15) that
m m
X
X
kˆ rt k = k[xt1 − g t ]+ − [xt1 ]+ k ≤ kg t k = ∇fi (xti ) ≤ βi = β
i=1
i=1
and from (38), (45), and (15) that m m m tm+i−2 X X X ⌋ ⌊
(βi + ζβi⊖1 ) = β(1 + ζ), kdti k =
≤
∇fi (xti ) + ζ∇fi⊖1 xi⊖1 m i=1
i=1
i=1 t
s
so (10) yields β ≤ β(1 + ζ). Since {α } ↓ 0 so that (58) holds for all s in some subsequence of T , it follows that rˆt → 0 for t along some subsequence of {0, 1, ...}. We now argue that rˆt → 0 for t along the entire sequence {0, 1, ...}. Suppose this is not the case so there exists ǫ > 0 such that kˆ rt k > ǫ for all t along some subsequence of {0, 1, ...}. The following argument is a modification of the proof of [12, Prop. 2]. Consider any t such that kˆ rt k ≥ ǫ. Since {ˆ rt }t=0,1,... contains a subsequence that ′ rt k < ǫ/2. Then (44) tends to zero, there exists a smallest integer t′ > t such that kˆ yields ′ ǫ ≤ kˆ rt k − kˆ rt k 2 ′ ≤ kˆ rt − rˆt k
" #+ #+ "
m m X
t X ′ ′ ′ t t t t t
∇fi (xi ) + x1 ∇fi (xi ) − x1 − x1 − = x1 −
i=1 i=1
! ! m m
X X ′ ′ ′
∇fi (xti ) − xt1 − ∇fi (xit ) + kxt1 − xt1 k ≤ xt1 −
i=1 i=1
m
X ′ ′
≤ 2kxt1 − x1t k + (∇fi (xti ) − ∇fi (xit ))
i=1
′
′
t −1
tX m −1 X m
X τ X
τ +1 τ +1 τ +1 τ τ
∇f (x ) − ∇f (x ) + ∇f (x ) − ∇f (x ) = 2 + α d i i i 1 i 1 i i i
τ =t i=1
τ =t i=1 ≤2
′ −1 tX
τ =t
=
′ −1 tX
τ =t
τ
τ
α β +
′ −1 X m tX
λi (ατ +1 β τ +1 + 2ατ β τ )
τ =t i=1
2ατ β τ + λ(ατ +1 β τ +1 + 2ατ β τ )
≤ (2 + 3λ)β(1 + ζ)
′ tX −1
ατ ,
τ =t
where the third inequality uses the nonexpansive property of [·]+ ; the second equality follows from (37); the fifth inequality follows from (10), (57), and Lemma 5.1; the last inequality follows from β τ ≤ β(1 + ζ) for all τ and the monotone decreasing property rτ k ≥ ǫ/2 for τ = t, ..., t′ − 1, which together with of {ατ }τ =0,1,... . We also have that kˆ the above relation yields ′ tX −1
′
t −1 ǫ2 X τ ǫ ǫ2 α kˆ r k ≥ . α ≥ 4 4 2(2 + 3λ)β(1 + ζ) τ =t τ =t τ
τ 2
528
PAUL TSENG
P∞ rτ k2 = ∞, a contradiction Since the number of such t is infinite, this implies τ =1 ατ kˆ of (56). Since {ˆ rt } → 0 and (43) holds for all s ∈ T , it follows that {rs }s∈T → 0. We may ask whether a convergence rate and stepsize result analogous to Proposition 4.2 holds for the incremental gradient-projection method. This, however, appears unlikely since the proof of Proposition 4.2 requires that ∇f1 (x) = · · · = ∇fm (x) = 0 at a stationary point x of the problem. Such an assumption is reasonable for an unconstrained problem but not for a constrained problem. 6. Implementation issues and numerical experience. To gain some insight into the implementation issues associated with the incremental gradient(-projection) method and its practical performance, we implemented the method to train a singlehidden-layer feedforward neural network, and compared the performance of the method (which, in this case, is effectively on-line backpropagation) with the conjugate gradient method using Polak–Ribiere update and Armijo stepsize rule. In this section we report our findings. First, we briefly describe the problem of training a single-hidden-layer feedforward neural network (see [12, section 3] for a more detailed discussion). In this problem, we are given a collection of vectors (I(1), O(1)), ..., (I(m), O(m)) in ℜM × ℜL (“training examples” of input and desired output), and the goal is to minimize the output error function f given by (1) with fi : ℜM N +LN +N +L 7→ [0, ∞), i = 1, ..., m, given by
2
N
X
vk σ (hI(i), uk i + ωk ) + z − O(i) , fi (u1 , ..., uN , v1 , ..., vN , ω1 , ..., ωN , z) =
k=1
where u1 , ..., uN ∈ ℜM , v1 , ..., vN ∈ ℜL , ω1 , ..., ωN ∈ ℜ, z ∈ ℜL , and with σ : ℜ 7→ ℜ (“sigmoidal activation function”) a user-chosen continuous function satisfying σ(θ) → 1 as θ → ∞ and σ(θ) → 0 as θ → −∞. (In neural network terminology, N is the number of hidden neurons, u1 , ..., uN , v1 , ..., vN are the weights on the neural connections, and ω1 , ..., ωN , z are biases at the neurons.) In our testing we used the following standard choice of σ: σ(θ) = 1/(1 + exp(−θ/10)) and set N according to the specific training examples, as is done in practice. We have two specific test problems: For our first test problem, that of computing the parity of {0, 1}-vectors [10, p. 131], we have m = 5, M = 4, L = 1, and 0 1 1 0 0 1 I(1) = , O(1) = 0, I(2) = , O(2) = 1, I(3) = , O(3) = 0, · · · , 0 0 0 0 0 0 1 1 I(5) = , O(5) = 0, 1 1 and we set N = 1. (Here the desired output is 0 when the input has an even number of 1’s and, otherwise, the desired output is 1.) For our second test problem, that of
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
recognizing 1 1 I(1) = 1 1 1
the characters of 0, 1, 2, 3, 4, we have m = 5, M 1 1 1 0 1 0 1 0 1 , O(1) = 0 , ···, I(5) = 1 0 0 1 0 0 1 1
529
= 15, L = 3, and 0 1 0 0 1 1 1 , O(5) = 0 , 0 1 1 0 1
and we set N = 3. (Here the “1”s in I(1), written as a 5×3 matrix, form the character 0 and similarly for I(2), ..., I(5). The desired output O(i) is the binary representation of i − 1, for i = 1, ..., 5.) We also experimented with a version of the problem in which the desired output was changed to L = 1 and O(1) = 0, ..., O(5) = 4. However, though all methods converged faster on the problem, the trained neural network was much less accurate in recognizing corrupted input. Two key features of each function fi are that, (i) it has multiple local minima, and (ii) evaluating ∇fi requires roughly the same work as evaluating fi . Next, we describe the three methods we implemented. The first method, referred to as ALG1, is the incremental gradient method of section 2. For this method, we used the standard choice of ǫ0 = 1, ω = .5, T = {1, 11, 21, 31, ...} and, after some experimentation, found the choice of ζ = .8 to work best (much better than the memoryless choice of ζ = 0). To avoid stepsizes becoming too small, it is desirable to choose η, ǫ3 large and ǫ1 , ǫ2 small and, for our implementation, we chose η = 1.5f (x01 )+100, ǫ3 = 1000, ǫ1 = ǫ2 = .00001 and estimated λ1 +· · ·+λm and ρ by 1 and ∞, respectively. The second method, referred to as ALG2, is the incremental gradientprojection method of section 5 (with X = ℜn ). For this method, we used the standard choice of ǫ0 = 1, ω = .5, T = {1, 11, 21, 31, ...} and, after some experimentation, found the choice of ζ = .1 to work best. Analogous to ALG1, we chose η = 1.5f (x01 ) + 100, ǫ1 = .00001, ǫ2 = 1000, and estimated λm and ρ by 1 and ∞, respectively. The third method, referred to as ALG3, is the conjugate gradient method using the efficient Polak–Ribiere update and the Armijo stepsize rule [1, pp. 20, 57]. More precisely, the method generates, for any given x0 ∈ ℜn , a sequence x0 , x1 , ... according to xt+1 := xt − αt dt , where t
d :=
∇f (xt ) + β t dt−1 t
∇f (x )
if t > 0, if t = 0,
β t := h∇f (xt ), ∇f (xt ) − ∇f (xt−1 )i/k∇f (xt−1 )k2 , and αt is the largest element of {ǫ0 , ǫ0 ω, ǫ0 (ω)2 , ...} for which xt+1 given above satisfies f (xt+1 ) ≤ f (xt ) − σαt h∇f (xt ), dt i. Here ǫ0 > 0, ω ∈ (0, 1), σ ∈ (0, .5) are user-chosen parameters. In our implementation, we used the standard choice of ǫ0 = 1, ω = .5, σ = .1, and, to ensure convergence, incorporated a steepest descent restart (i.e., replace dt by ∇f (xt ) whenever h∇f (xt ), dt i < .00001k∇f (xt )kkdt k). For all three methods, each component of the starting point (i.e., x01 for ALG1, ALG2, and x0 for ALG3) was randomly generated according to the uniform distribution on the interval [0, 1] and the termination criterion was f (x) ≤ 10−8 . All methods were coded in Matlab Version 4.2a and were run
530
PAUL TSENG TABLE 1 Performance of ALG1, ALG2, ALG3 on the two test problems. ALG1
1 2
ALG2
ALG3
ngrad.1
nfunc2
ngrad1
nfunc2
ngrad1
nfunc2
Parity
222.3
30.2
602.3
61.3
650.2
1775.0
Character Recognition
185.6
629.0
64.0
227.0
713.3
Problem
19.6
ngrad denotes the number of times that ∇f1 , ..., ∇fm have been evaluated. nfunc denotes the number of times that f1 , ..., fm have been evaluated.
on a Decstation 5000. Table 1 gives the number of gradient and function evaluations for the three methods, averaged over three runs (all with a standard deviation of less than 20 percent). From Table 1, it can be seen that ALG1 requires fewer gradient evaluations and function evaluations (which are the most expensive operations) than either ALG2 or ALG3. Since gradient evaluation requires roughly equal work as function evaluation so that the total work is roughly equal to the sum of gradient and function evaluations, we see that ALG1 requires less than one-third the total work of either ALG2 or ALG3, while ALG3 requires the most total work. Thus, for our test problems at least, we can draw the following conclusions: (i) the incremental gradient method using an unlimited-memory momentum term is more efficient than the conjugate gradient method using Polak–Ribiere update and Armijo stepsize rule; (ii) the incremental gradient method using an unlimited-memory momentum term is more efficient than that using a one-memory momentum term (which in turn is more efficient than that using no momentum term at all). We note that the stepsize rule is also crucial to the efficiency of the incremental gradient method. When we took ALG1 and replaced its stepsize rule by the well-known (nonadapative) stepsize rule of αt = c/t (which produces stepsizes that are square summable but not summable), the convergence became agonizingly slow, regardless of the choice of the constant c > 0. In contrast, the stepsizes in both ALG1 and ALG2 remained at the value .5 after an initial decrease (the large stepsizes, as well as the presence of the momentum term, appear to be key to the good performance of ALG1), while the stepsizes in ALG3 varied between .125 and 1. On the other hand, we caution that these results are for some small test problems only and much more extensive testing is needed to determine the efficiency of the incremental gradient(-projection) method in general. REFERENCES [1] D. P. BERTSEKAS, Constrained Optimization and Lagrange Multiplier Methods, Academic Press, New York, 1982. [2] D. P. BERTSEKAS, Incremental least squares methods and the extended Kalman filter, SIAM J. Optim., 6 (1996), pp. 807–822. [3] D. P. BERTSEKAS, A new class of incremental gradient methods for least squares problems, SIAM J. Optim., 7 (1997), pp. 913–926. [4] D. P. BERTSEKAS AND J. N. TSITSIKLIS, Parallel and Distributed Computation: Numerical Methods, Prentice–Hall, Englewood Cliffs, NJ, 1989. [5] W. C. DAVIDON, New least-square algorithms, J. Optim. Theory Appl., 18 (1976), pp. 187–197. ´ , Initializing back propagation networks with prototypes, Neural [6] T. DENOEUX AND R. LENGELLE Networks, 6 (1993), pp. 351–363.
AN INCREMENTAL GRADIENT(-PROJECTION) METHOD
531
[7] E. M. GAFNI AND D. P. BERTSEKAS, Two-metric projection methods for constrained optimization, SIAM J. Control Optim., 22 (1984), pp. 936–964. [8] A. A. GAIVORONSKI, Convergence properties of back-propagation for neural nets via theory of stochastic gradient methods, Part I, Optim. Methods Software, 4 (1994), pp. 117–134. [9] L. GRIPPO, A class of unconstrained minimization methods for neural network training, Optim. Methods Software, 4 (1994), pp. 135–150. [10] J. HERTZ, A. KROGH, AND R. G. PALMER, Introduction to the Theory of Neural Computation, Addison–Wesley, Redwood City, CA, 1991. [11] Y. LE CUN, Une procedure d’apprentissage pour reseau a seuil assymetrique, in Proc. Cognitiva ’85, Paris, France, pp. 599–604. [12] Z.Q. LUO AND P. TSENG, Analysis of an approximate gradient projection method with applications to the backpropagation algorithm, Optim. Methods Software, 4 (1994), pp. 85–101. [13] O. L. MANGASARIAN AND M. V. SOLODOV, Serial and parallel backpropagation convergence via nonmonotone perturbed minimization, Optim. Methods Software, 4 (1994), pp. 103–116. [14] M. F. MØLLER, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks, 6 (1993), pp. 525–533. [15] T. N. PAPPAS, Solution of Nonlinear Equations by Davidon’s Least Squares Method, M.Sc. thesis, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 1982. [16] D. B. PARKER, Learning-Logic, Center for Computational Research in Economics and Management Science Report no. TR-47, Massachusetts Institute of Technology, Cambridge, MA, 1985. [17] B. T. POLYAK, Introduction to Optimization, Optimization Software, New York, 1987. [18] D. E. RUMELHART, G. E. HINTON, AND R. J. WILLIAMS, Learning internal representations by error propagation, in Parallel Distributed Processing–Explorations in the Microstructure of Cognition, Rumelhart and McClelland, eds., MIT Press, Cambridge, MA, 1986, pp. 318–362. [19] G. TESAURO, Y. HE, AND S. AHMAD, Asymptotic convergence of back propagation, Neural Comput., 1 (1989), pp. 382–391. [20] P. J. WERBOS, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA, 1974. [21] P. J. WERBOS, Backpropagation through time: What it does and how to do it, in Proc. IEEE, 78 (1990), pp. 1550–1560. [22] H. WHITE, Learning in artificial neural networks: A statistical perspective, Neural Comput., 1 (1989), pp. 425–464. [23] H. WHITE, Some asymptotic results for learning in single hidden-layer feedforward network models, J. Amer. Statist. Assoc., 84 (1989), pp. 1003–1013.