Noname manuscript No. (will be inserted by the editor)
Approximations and Generalized Newton Methods Diethard Klatte · Bernd Kummer
Submitted 29 January 2016 / Revised 23 February 2016
Abstract We study local convergence of generalized Newton methods for both equations and inclusions by using known and new approximations and regularity properties at the solution. Including Kantorovich-type settings, our goal are statements about all (not only some) Newton sequences with appropriate initial points. Our basic tools are results of [31], [37] and [40], mainly about Newton maps and modified successive approximation, but also graph-approximations of multifunctions and others. Typical examples and simplifications of existing methods are added. Keywords Generalized Newton method · inclusion · generalized equation · Newton map · successive approximation · graph-approximation · regularity · Kantorovich-Newton method Mathematics Subject Classification (2010) 49J53 · 49K40 · 90C31 · 65J05
1 Introduction In this paper, we present approaches to Newton methods for non-smooth functions and inclusions which are mostly studied in the framework of generalized equations 0 ∈ f (x)+F (x) where f : X → Y is a function and F : X ⇒ Y a multifunction. The Newton steps are defined by approximations fˆ of f and the zeros of 0 ∈ fˆ(x) + F (x). Setting F ≡ {0}, equations can be handled; but setting f ≡ 0, one learns nothing about solving 0 ∈ F (x). So we shall also pay attention to such inclusions. Starting more than 30 years ago, the extension of Newton’s method to non-smooth and multivalued settings and their application to nonlinear complementarity problems, KKT systems, variational inequalities, generalized equations and other model classes has become a broad field of research. We do not intend to outline these developments, let us mention only some early contributions [20, 28, 35, 37, 39, 47, 48, 50, 53] and some recent monographs [14, 17, 24, 25, 31, 60] which reflect the various aspects of this field. The purpose of the present paper is to give a (concise) unified view of the convergence analysis of such methods by connecting a certain type of approximation with the desired kind of convergence and a concrete iteration scheme. Thereby, we focus on local convergence analysis, including also Kantorovich-type results, while global convergence is not considered. It will turn out in this context that equations and generalized equations can be handled in a very similar manner, but example 8 also shows a considerable difference. Our studies essentially use approaches and results of the authors’ monograph [31] and the second author’s papers [37, 40]. Diethard Klatte IBW, Universit¨ at Z¨ urich, Moussonstrasse 15, CH-8044 Z¨ urich, Switzerland. E-mail:
[email protected] Bernd Kummer Institut f¨ ur Mathematik, Humboldt-Universit¨ at zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany. E-mail:
[email protected] 2
Diethard Klatte, Bernd Kummer
In section 2, we present several tools that are needed throughout the paper. In particular, we introduce a general iteration scheme for solving inclusions and derive basic estimates for the later convergence analysis. In section 3 we start by considering Newton maps [31], which are motivated by the necessary and sufficient convergence result Thm. 3, and discuss some concrete realizations. For the simplest non-smooth class of P C 1 functions, however, we recall an observation which justifies to apply the usual Newton method to a fixed C 1 -function only. The consequences in section 3.2 concern stationary point systems and even methods for solving inclusions 0 ∈ F (x) by linear auxiliary systems. Section 4 shows how various types of generalized derivatives can be used for locally Lipschitz functions and that the Newton map condition now becomes the approximation condition (CA)*, introduced in [31], [39]. We add this part from [31] for discussing the necessity of (CA)*, in section 5, when continuous functions are taken into account. There, following [21], we also motivate the need of automatic differentiation for handling functions which are composed by differentiable and (mostly simple) non-smooth functions as well. The related approximations are then not necessarily defined by generalized derivatives (like point-based approximations). Under general settings, we apply in section 6 the successive approximation scheme of [31, 42], which has various applications. Originally used for deriving implicit function statements, we exploit it directly for Newton methods. In contrast to other papers, e.g. [1, 3, 11, 12, 14], so our estimates hold for all constructed Newton sequences, not only if xk+1 , among all Newton iterates, is taken in an appropriate, not described manner, or is close enough to the solution. This also includes Kantorovich-type statements, based on metric (pseudo-) regularity in Prop. 5. In section 7, for solving 0 ∈ F (x), we discuss an approximation concept for gph F , which was introduced in [40] and allows applications in convex analysis, too. The often used assumption of metric regularity is here replaced by the simpler upper regularity. The Prop. 6 extends the results to generalized equations with non-differentiable functions. Throughout, we shall discuss the necessity of our hypotheses and present typical examples.
2 Pre-Requisites 2.1 Basic definitions and consequences of Lipschitz behavior In the whole paper, if nothing else is specified, X and Y will denote Banach spaces with elements x, y, respectively. In order to say that f : X → Y is locally Lipschitz we write f ∈ C 0,1 (X, Y ) while f ∈ C 1,1 (X, Y ) means that f has locally Lipschitzian first derivatives. As usual we shall use O- and o- type functions which have the properties O(x) → 0 if kxk → 0 and o(x) / kxk → 0 if 0 6= kxk → 0. We say that some property holds for x near x ¯, if it holds for sufficiently small kx − x ¯k. A multifunction F : X ⇒ Y is closed if gph F := {(y, x) | y ∈ F (x)} is closed in X × Y . Generalized equations will be often written as 0 ∈ h(x) + M (x) if the traditional symbols are otherwise occupied. Now we recall several concepts of local Lipschitz properties for mappings F : X ⇒ Y and/or its inverse S = F −1 : Y ⇒ X at a point (x0 , y0 ) ∈ gph F or (y0 , x0 ) ∈ gph S, respectively. We restrict our studies to Banach spaces for getting a uniform representation; generalizations to metric spaces are easily possible. The terminology for the subsequent Lipschitz properties (D1) - (D4) is rather different (and permanently changing and extending) in the literature. Therefore, we shall often recall our definitions which follow the authors’ book [31]. Given (y0 , x0 ) ∈ gph S, (D1) S is said to be pseudo-Lipschitz at (y0 , x0 ) if there are nbhds U 3 x0 , V 3 y0 and some L > 0 such that ∀(y, x) ∈ (V × U ) ∩ gph S and y 0 ∈ V : ∃x0 ∈ S(y 0 ) : kx0 − xk ≤ L ky 0 − yk. This notion was introduced and investigated in [4]; it is also called Aubin property [54].
(2.1)
Approximations and Generalized Newton Methods
3
(D2) S is called locally upper Lipschitz (briefly locally u.L.) at (y0 , x0 ) if there are nbhds U 3 x0 , V 3 y0 and some L > 0 such that (2.1) holds with (y 0 , x0 ) = (y0 , x0 ), ∀(y, x) ∈ (V × U ) ∩ gph S : kx − x0 k ≤ L ky − y0 k.
(2.2)
If the same holds with U = X, we call S globally upper Lipschitz. (D3) S is called Lipschitz lower semi-continuous (Lipschitz l.s.c.) at (y0 , x0 ) if there is a nbhd V 3 y0 and some L > 0 such that (2.1) holds with (y, x) = (y0 , x0 ), ∀y 0 ∈ V ∃x0 ∈ S(y 0 ) : kx0 − x0 k ≤ L ky 0 − y0 k.
(2.3)
(D4) S is called strongly Lipschitz at (y0 , x0 ) if the nbhds in (D1) can be taken in such a way that, in addition, S(y 0 ) ∩ U is single-valued for all y 0 ∈ V . Then S(y 0 ) ∩ U is locally (near y0 ) a Lipschitz function. Under (D2), S(y0 ) ∩ U is equal to {x0 }, and empty sets S(y) are permitted for y near y0 . The constant L is called a rank (or modulus) of the related stability. If F stands for a function f ∈ C 1 (IRn , IRn ), all these properties coincide with det Df (x0 ) 6= 0. Definition 1 (Regularity) If S is strongly Lipschitz at (y0 , x0 ), then F = S −1 is called strongly regular at (x0 , y0 ). Similarly, at the related points: If S is pseudo-Lipschitz then F is called pseudoregular. If S is locally u.L. and S(y) ∩ U 6= ∅ for all y ∈ V , then F is said to be upper regular. As before, a constant L for the related Lipschitz property of S is said to be a rank of regularity. Hence each of these types of regularity of F implies that F −1 is Lipschitz l.s.c. at (y0 , x0 ). Obviously, strong and pseudo-regularity are persistent under small variations of (x0 , y0 ) in gph F . Often, pseudo-regular is called metrically regular; the slightly different definitions are equivalent. Remark 1 (equivalent definitions) For deriving needed estimates, the nbhds U and V may be replaced, in the definitions, by open or closed balls around x0 , y0 , respectively. We shall use that they have the same radius after multiplying one norm with some factor. Remark 2 (limits of regularity assumptions) Though pseudo- and strong regularity are broadly used hypotheses in the contexts of stability of solutions or convergence of Newton-type methods, one has to consider the limits of these assumptions: One finds nowhere verifiable criteria which allow us to check them even for continuous IRn -functions (in contrast to f ∈ C 1 (IRn , IRn ), where det Df (x0 ) 6= 0 is necessary and sufficient). Also for the nice class of optimization problems min{f (x) | g(x) ≤ 0} with C 3 functions f, g on IRn , which permit a deeply developed theory of critical points [26, 27], intrinsic conditions for strong (or pseudo-) regularity of the stationary primal solutions under canonical perturbations (and MFCQ in place of LICQ) do not exist (differently from upper regularity); for details cf. [32]. Remark 3 (regularity and methods) In the context of solution method, pseudo-regularity of a crucial mapping is often the basic assumption and the points x0 and x in (D1) are the iterations xk+1 and xk , respectively. Accordingly, then some xk+1 fulfills the desired estimate. So convergence requires to choose the ”right one”, which is insufficient for an algorithm without an appropriate selection rule. Similarly, strong and upper regularity imply that xk+1 exists in some ball x ¯ + rB (uniquely or not) where it satisfies the estimate in question. Since x ¯ is unknown and r may be small, this is again insufficient if also xk+1 ∈ /x ¯ + rB may happen. However, the latter is a usual situation in the world of nonlinear methods. In the spirit of the last remark, it is useful to know whether a pseudo-regular mapping F is even strongly regular at the same point. For X = Y = IRn , this hold, if 1. F (x) = f (x) + NC (x), cf. [13], if f ∈ C 1 and NC is the usual normal-map of a set C = {x | g(x) ≤ 0, h(x) = 0} where (g, h) ∈ C 2 (IRn , IRm ).
(2.4)
However, strong and pseudo-regularity do not coincide for any mapping which describes the KKT-optimality system in a nonlinear programming problem, see [31, example BE4] with no constraints and a C 1,1 - objective.
4
Diethard Klatte, Bernd Kummer
F is a so-called generalized Kojima-function, with C 1 data, cf. [31, Cor. 7.22]. F is assigned to critical points of certain cone-constraint variational problems [34]. F = ∂f is the subdifferential of a convex function f : IRn → IR, [31, Thm. 5.4]. More general, for x in some (real) Hilbert space, (D1) and (D4) coincide for the mappings S(a) := Xglob (a) of global minimizers to minx f (x) − ha, xi and for Xglob (a, b) to min{f (x) − ha, xi | g(x) ≤ b} with arbitrary functions f and g [31, Corollary 4.7]. 6. Furthermore, if f ∈ C 0,1 (IRn , IRn ) is directionally differentiable at x0 and pseudo-regular at (x0 , 0), then the zero x0 is isolated [19, 18]. This implies that f is upper regular at (x0 , 0).
2. 3. 4. 5.
Let us also note that F (x) = Y ∀x ∈ X is pseudo-regular, but not upper regular, while f (x) = x + x2 sin(1/x) for x 6= 0 and f (0) = 0 (discussed in [30]) is upper, but not pseudo-regular at 0. 2.2 The iteration schemes In order to solve an inclusion 0 ∈ Γ (x) for given Γ : X ⇒ Y, our most general iteration schemes are described, as already in [37], by a multifunction Σ : X × X ⇒ Y in such a way that some initial point x0 must be given, x = xk is the current iteration point and any solution x0 = xk+1 of 0 ∈ Σ(. , x) is the next one. In other words, the concrete choice of Σ characterizes the considered method, and x0 ∈ S(x) := Σ(. , x)−1 (0) defines the next iterates. (2.5) Mostly, Σ(. , x) stands for some (multi-)function which approximates Γ near x. Let r(.) = o(.) with r(0) = 0
or
r(.) = qk.k, 0 < q < 1.
(2.6)
Considering a solution x ¯ of the inclusion we shall ask for local convergence kxk+1 − x ¯k ≤ r(xk − x ¯) as long as xk+1 ∈ S(xk ) exist, provided that kx0 − x ¯k is sufficiently small.
(2.7)
Equivalently, one may require S(x) ⊂ x ¯ + r(x − x ¯)B if kx − x ¯k is sufficiently small, while the additional condition S(x) 6= ∅ for x near x ¯ ensures, by kx0 − x ¯k < kx − x ¯k, the existence of xk+1 in each step. Therefore, Σ describes a well defined method that generates a convergent sequence xk → x ¯ with local convergence property (2.7) if and only if ∅= 6 S(x) ⊂ x ¯ + r(x − x ¯)B for x near x ¯.
(2.8)
This implies S(¯ x) = {¯ x}. Remark 4 In terms of the definitions of section 2.1, (2.8) requires that the mapping S of next iterates is Lipschitz l.s.c. (D3) and globally upper Lipschitz at (¯ x, x ¯) with fixed modulus q < 1 or, in the stronger (r = o) case, with each modulus q < 1. Example 1 For a generalized equation 0 ∈ Γ (x) := f (x)+M (x) with f ∈ C 1 , the setting Σ(x0 , x) = f (x) + Df (x)(x0 − x) + M (x0 ) requires to solve 0 ∈ f (xk ) + Df (xk )(xk+1 − xk ) + M (xk+1 ), a standard scheme in the literature.
(2.9)
To illustrate Remark 3, consider f (x) = x ∈ IR, M (x) = {0} if |x| < 1, M (x) = IR otherwise. Example 2 If Γ = f is a function, one can interprete Σ by some object like a generalized derivative. The difference Σ(x0 , x)−f (x) is a multifunction depending on x0 and x (or on x0 −x and x). Calling it Gf (x)(x0 − x) we obtain Σ(x0 , x) = f (x) + Gf (x)(x0 − x) with some ”generalized derivative Gf (x) : X ⇒ Y of f at x”, which describes the method by 0 ∈ f (xk ) + Gf (xk )(xk+1 − xk ) = Σ(xk+1 , xk ).
(2.10)
The inverse of Gf (x) now defines our iterates and solution sets via S(x) = x + Gf (x)−1 (−f (x)), and (2.8) turns into ∅ = 6 Gf (x)−1 (−f (x)) ⊂ x ¯ − x + r(x − x ¯)B for x near x ¯. Based on iteration schemes like (2.9) or (2.10), we intend to connect the type of approximations with the desired (or possible) kinds of convergence and concrete iteration rules Σ.
Approximations and Generalized Newton Methods
5
2.3 Some known generalized derivatives Generalized derivatives may be a help for characterizing regularity or solution methods like for smooth functions, provided they are available. Given a multifunction F : X ⇒ Y , the contingent derivative CF (x0 , y0 )(u) of F at (x0 , y0 ) ∈ gph F in direction u ∈ X (also called graphical derivative or Bouligand derivative) consists of all limits v = lim t−1 k [yk −y0 ] where yk ∈ F (x0 +tk uk ) for certain sequences tk ↓ 0 and uk → u. If F is a function then yk = F (xk ) is unique and one writes simpler CF (x0 )(u). For any function f : X → Y , the set of limits T f (x)(u) = {v | ∃tk ↓ 0, xk → x such that v = lim t−1 k [f (xk +tk u)−f (xk )] } is the Thibault derivative of f at x in direction u (notation from [31]. It is also called strict graphical derivative or paratingent derivative or limit set. For f ∈ C 0,1 (IRn , IRm ), the set D of all x ∈ IRn such that the Fr´echet-derivative Df (x) exists, has full Lebesgue measure [49]. In consequence, the B-subdifferential of f at x, defined by ∂B f (x) = {A | ∃xk → x, xk ∈ D with A = lim Df (xk )} is not empty. Its convex hull ∂ CL f (x) = conv ∂B f (x) is Clarke’s generalized Jacobian [9, 10] - a non-empty, compact set. Writing ∂ CL f (x)(u) = {Au | A ∈ ∂ CL f (x)} and similarly ∂B f (x)(u), the (possibly proper) inclusions ∂B f (x)(u) ⊂ T f (x)(u) ⊂ ∂ CL f (x)(u) and conv T f (x)(u) = ∂ CL f (x)(u) hold true. Notice however that - as in all double limit constructions (hence also for so-called limiting normals or limiting coderivatives) - computing the sets T f (x)(u) and ∂ CL f (x) may be a hard job, even if f is piecewise linear. The following statement presents interrelations between stability and generalized derivatives. Theorem 1 [31, Thm. 5.1] (regularity and derivatives). Let F : X = IRn ⇒ Y = IRm be closed and z0 = (x0 , y0 ) ∈ gph F . Then, F is upper regular at z0 ⇔ F −1 is Lipschitz l.s.c. at (y0 , x0 ) and 0 ∈ CF (z0 )(u) implies u = 0.
(2.11)
If F −1 is Lipschitz l.s.c. at (y0 , x0 ) then there exists L > 0 such that B ⊂ CF (z0 )(LB).
(2.12)
F is pseudo-regular at z0 with rank L ⇔ ∃ε > 0 : B ⊂ CF (z)(LB) for all z ∈ gph F ∩ (z0 + εB).
(2.13)
F = f ∈ C 0,1 (IRn , IRn ) is strongly regular at (x0 , f (x0 )) ⇔ 0 ∈ T f (x0 )(u) implies u = 0.
(2.14)
If X is a normed space, the conditions (2.11) are necessary for upper regularity. If X and Y are B-spaces, the condition(2.13) is sufficient for pseudo-regularity, cf. [4, Thm. 4, section 7.5]. Statement (2.14) is the inverse function theorem of [38] where also [15, Thm. 1F.2] and chain rules for T f can be found, while Clarke’s inverse function theorem [9] says that f ∈ C 0,1 (IRn , IRn ) is strongly regular at (x0 , f (x0 )) if all matrices A ∈ ∂ CL f (x0 ) are non-singular. Since T f (.)(.) is both closed and positively homogeneous in u, (2.14) also means ∃c > 0, δ > 0 such that v ∈ T f (x)(u) ⇒ kvk ≥ ckuk if x ∈ x0 + δB.
(2.15)
For the analysis of perturbed generalized equations 0 ∈ Fˆ := g + F where the perturbing function g : X → Y is Lipschitz on some set Ω ⊂ X, the quantities sup (g, Ω) := sup { kg(x)kY | x ∈ Ω} and Lip (g, Ω) := inf {L > 0 | kg(x) − g(x0 )kY ≤ Lkx − x0 k ∀x, x0 ∈ Ω} are important since Lip (g, Ω) plays the role of sup (kDgk, Ω) for g ∈ C 1 . Provided that x ∈ int Ω, it follows directly from the definitions that, with β = Lip (g, Ω), CF (x, y)(u) ⊂ C Fˆ (x, y + g(x))(u) + βkukB ⊂ CF (x, y)(u) + 2βkukB and, if F = f is a function, T F (x)(u) ⊂ T Fˆ (x)(u) + βkukB ⊂ T F (x)(u) + 2βkukB.
(2.16)
So one can use (2.13) and (2.15) along with the estimate (2.16) for analyzing pseudo-regularity of Fˆ = g + F and strong regularity of fˆ = f + g, respectively, with g ∈ C 0,1 .
6
Diethard Klatte, Bernd Kummer
Theorem 2 (i) If F : IRn ⇒ IRm is pseudo-regular at (x0 , y0 ) with rank L then Fˆ is pseudoregular at (x0 , y0 + g(x0 )) with rank λ−1 whenever λ = L−1 − Lip(g, x0 + rB) > 0 for some r > 0. (ii) If f ∈ C 0,1 (IRn , IRn ) is strongly regular at (x0 , f (x0 )) then so is fˆ at (x0 , fˆ(x0 )) if c − Lip(g, x0 + rB) > 0 for some r > 0 with c from (2.15). Proof (i) To apply condition (2.13), let v ∈ Y and β = Lip(g, x0 + rB). By assumption, some u ∈ LkvkB fulfills v ∈ CF (x, y)(u). By (2.16), we may write v = v 0 − w0 with some v 0 ∈ C Fˆ (x, y + g(x))(u) and w0 ∈ βkukB. So we can estimate: kv 0 k = kv + w0 k ≥ kvk − kw0 k ≥ kvk − βkuk ≥ L−1 kuk − βkuk = λkuk. Hence the assertion follows from (2.13). (ii) In the situation (2.15), the assertion follows analoguesly from (2.16). For B-spaces, these statements remain true (e.g., due to Thm. 9), but the above proofs fail since neither the condition in terms of CF nor Mordukhovich’s coderivative condition [0 ∈ / D∗ (z0 )(v) if v 6= 2 0] of [45] are necessary for pseudo-regularity if X = l and Y = IR. We refer to example [31, BE.2]. Hence one cannot use CF (and similarly T f ) for proving Thm. 2 in infinite dimension. 2.4 Particular nonsmooth functions Particularly composed functions and point based approximations We start with composed functions of the type smooth ◦ Lipschitz, the spaces of the functions are added. Let f (x) = h(γ(x)) where h ∈ C 1 (Z, Y ), γ ∈ C 0,1 (X, Z), 0 and Σ(x , x) = f (x) + Dh(γ(x))( γ(x0 ) − γ(x) ) be its partial linearization at x.
(2.17)
Such functions appear in [53] and have useful properties. If we restrict all arguments to some region where γ has Lipschitz rank Lγ , the function gx (x0 ) = f (x0 ) − Σ(x0 , x)
(2.18)
satisfies, as known from the usual case of f = h ∈ C 1 and γ = id, k gx (x0 ) k = kf (x0 ) − f (x) − Dh(γ(x)( γ(x0 ) − γ(x) )k R1 = k 0 [Dh(γ(x + t(x0 − x))) − Dh(γ(x))] (γ(x0 ) − γ(x)) dt k ≤ supt∈(0,1) k Dh(γ(x + t(x0 − x))) − Dh(γ(x)) k Lγ kx0 − xk = O(kx0 − x ¯k + kx − x ¯k) kx0 − xk; kgx (x0 ) − gx (x00 )k = k f (x0 ) − f (x00 ) − Dh(γ(x))(γ(x0 ) − γ(x00 )) k ≤ supt∈(0,1) k Dh(γ(x0 + t(x0 − x00 )) − Dh(γ(x)) k Lγ kx0 − x00 k = O(kx0 − x00 k + kx − x ¯k) kx0 − x00 k.
(2.19)
(2.20)
In the context of Robinson’s [53] point-based approximation (PBA), f and Σ are arbitrary continuous functions on an open set Ω ⊂ X and Ω × Ω, respectively. Σ is called a PBA for f on Ω, if there is a constant K such that (for all x, x ¯, x0 , x00 ∈ Ω), (a) (b)
kf (x0 ) − Σ(x0 , x)k ≤ 21 K kx0 − xk2 k [Σ(x00 , x) − Σ(x00 , x ¯)] − [Σ(x0 , x) − Σ(x0 , x ¯)] k ≤ K kx − x ¯k kx00 − x0 k.
(2.21)
It was a basic observation in [53] that the conditions (2.21) can be (locally) satisfied for f in (2.17) with h ∈ C 1,1 . Replacing f (x0 ) by Σ(x0 , x ¯) in (2.18) then, for each x ¯ ∈ Ω, the difference gx (x0 ) = Σ(x0 , x) − Σ(x0 , x ¯)
(2.22)
describes Σ(. , x) as a perturbation of Σ(. , x ¯) by a continuous function gx which satisfies kgx (x0 )k ≤ kΣ(x0 , x) − f (x0 )k + kf (x0 ) − Σ(x0 , x ¯)k ≤ 21 K(kx0 − xk2 + kx0 − x ¯k2 );
(2.23)
kgx (x0 ) − gx (x00 )k = k(Σ(x0 , x) − Σ(x0 , x ¯)) − (Σ(x00 , x) − Σ(x00 , x ¯))k ≤ Kkx − x ¯kkx00 − x0 k. (2.24)
Approximations and Generalized Newton Methods
7
Next restrict all arguments to a ball Ω = Ωr = x ¯ + rB. Then the estimates (2.19,2.20,2.23,2.24) ensure estimates for sup (g, Ω) and Lip (g, Ω), namely kgx (¯ x)k ≤ o(r), kgx (x0 ) − gx (x00 )k ≤ O(r)kx00 − x0 k and 0 kgx (x )k ≤ kgx (¯ x)k + kgx (x0 ) − gx (¯ x)k ≤ o(r) + O(r)r ≤ o(r).
(2.25)
Hence, though generally nonsmooth, the ”linearizations” Σ obey properties, which are known to be important for Newton’s method in the smooth case. In particular, Thm. 2 can be applied (for finite dimensions) to the functions Σ(., x) and perturbations gx near x ¯ if r is small enough in order to verify persistence of pseudo- [strong] regularity under small perturbations by gx and for estimating the related Lipschitz ranks via (2.16) as well. P C 1 functions and pseudo-smooth functions A function f ∈ C 0,1 (IRn , IRm ) is called piecewise smooth, if there are functions f s ∈ C 1 (IRn , IRm ) (s = 1, . . . N ) such that the sets I(x) = {s | f (x) = f s (x)} are non-empty (∀x ∈ IRn ); briefly f = P C 1 (f 1 , . . . , f N ), f ∈ P C 1 . For basic properties of these functions we refer to [55]. We only mention the possible description of ∂ CL f as conv{Df s (x) | x ∈ cl int I −1 (s)} and conv{Df s (x) | x ∈ I −1 (s) and ∃x0 → x : Df (x0 ) = Df s (x0 )}, cf. [55, Prop.A.4.1] and [37, Prop.4], respectively. The needed functions f s with x ∈ cl int I −1 (s)} are called essentially active at x. A function f ∈ C 0,1 (IRn , IRm ), which is C 1 on an open and dense subset Ω ⊂ IRn , is called 1 . Such f appear in many applications, pseudo-smooth. We denote this class of functions by CΩ 1 cover the class P C and many so-called NCP-functions [57]. They have locally bounded derivatives on Ω and nonempty sets D◦ f (x) = {A | ∃xk → x, xk ∈ Ω such that A = lim Df (xk )}, which could be called the small B-subdifferential. For P C 1 functions, D◦ f (x) = ∂B f (x) is valid. Several further relations between these sets become evident by an example which was made for checking Newton’s method when f ∈ C 0,1 . 1 Example 3 In [37, Sect. 2.3], a real function f ∈ CΩ \ P C 1 was constructed and analyzed in detail, which is globally Lipschitz, directionally differentiable, strongly regular and satisfies
6 ∂ CL f (0) = [ 21 , 2]. f (0) = 0, Df (0) = 1, 0 ∈ / Ω, D◦ f (0) = { 21 , 2} and ∂B f (0) = { 21 , 1, 2} = It was shown that if one starts at any x0 6= 0 where Df (x0 ) exists, then the standard (smooth) Newton method generates an alternating sequence x0 , x1 , x2 = −x1 , x3 = x1 , ... with xk ∈ Ω(∀k). Note that this function has been also discussed (in detail and with pictures) both in Example BE.1 of [31] and Example 7.4.1 of [17]. In the next two sections, we shall consider derivative based generalized Newton methods. 3 Linear Auxiliary Problems for Equations and Inclusions In this section, we recall the concept of Newton maps [31, 37] for equations, which allows necessary and sufficient conditions for local superlinear convergence and covers the popular approach via semismooth functions. Then we extend this to inclusions. 3.1 Newton maps and Newton functions for equations Newton’s method for computing a zero x ¯ of a C 1 function f : X → Y (Banach spaces) is determined by the iterations f (xk ) + A(xk+1 − xk ) = 0, where A = Df (xk ) ∈ Lin(X, Y ) and x0 is given. The local superlinear convergence of this method means that with some o-type function r, it holds kxk+1 − x ¯k ≤ r(xk − x ¯) for x0 near x ¯.
(3.1)
Let us now try to construct a procedure which recurrently solves linear problems in order to find a zero for nondifferentiable f with the same local behavior. Then we have to think about the
8
Diethard Klatte, Bernd Kummer
choice of A in each step. So let us repeat the introduction of a Newton map in [31]. Notice, that f : X → Y may be even arbitrary. In the framework of section 2.2 we shall obtain Σ(x0 , x) = f (x) + N (x)(x0 − x). Let N be any multifunction which assigns, to each x ∈ X, a non-empty set N (x) ⊂ Lin(X, Y ),
(3.2)
and let x ¯ ∈ X. We interpret N (x) as the permitted Newton operators for the iterations f (xk ) + A(xk+1 − xk ) = 0 with some A ∈ N (xk ),
x0 given.
(3.3)
Definition 2 (Newton map) We call N a Newton map (briefly N-map) for f at x ¯ if A(x − x ¯) ∈ f (x) − f (¯ x) + o(x − x ¯)B
∀A ∈ N (x).
(3.4)
Definition 3 (Newton-regularity) We say that N is Newton-regular (briefly N-regular) at x ¯ if there are constants K + , K − such that A−1 exist and kAk ≤ K + and kA−1 k ≤ K − hold for all A ∈ N (x) and sufficiently small kx − x ¯k.
(3.5)
If only the existence of a related K − is required, we speak about weak N-regularity. Our notations are motivated by its relations to Newton’s method. If both conditions (3.4) and (3.5) are satisfied, we also say that N is a regular N-map at x ¯. Similarly, if (3.4) and weak N-regularity hold true, we call N a weakly regular N-map. The elements xk+1 in (3.3) depend on the selected elements A. So we precise that the convergence (3.1) should hold independently of the choice of A ∈ N (xk ). Theorem 3 (see [31, Lemma 10.1]). Suppose that f (¯ x) = 0 and the mapping N in (3.2) is Nregular at x ¯, with K − , K + according to (3.5). Then, the method (3.3) fulfills the condition (3.1) if and only if N is a N-map of f at x ¯. In this case, it holds with o from (3.4), kxk+1 − x ¯k ≤ K − o(xk − x ¯) and 1 − −1 kx 2 (K )
−x ¯k ≤ kf (x) − f (¯ x)k ≤ 2K + kx − x ¯k
for x near x ¯.
(3.6)
The latter means that f −1 is necessarily locally u.L. at (0, x ¯) in the sense of (D2), the zero is isolated and f is ”pointwise” Lipschitz. Remark 5 The existence of K + in (3.5) is only needed for verifying the only if direction. Hence, the convergence (3.1) is already ensured if N is a weakly regular N-map. Property (3.4), called Newton differentiability in [24] and (after requiring this for all x ¯ near the zero) slantly differentiable in [22], is some generalization of differentiability for nonsmooth functions. To investigate Newton’s method, mappings N satisfying (3.4) and particular realizations have been considered, perhaps first, in [37, Prop. 3]. Particular N-maps and the related Newton-regularity are discussed in [31, section 10], too. The same is true for the following general properties and more or less known interrelations. Remark 6 The union of two N-maps or the convex hull of a N-map are again N-maps (for f at x ¯). ˆ ⊂ N of a N-map, e.g., for N ˆ = ∂B f if The same holds for all nonempty-valued submappings N N = ∂ CL f is a N-map for f ∈ C 0,1 (IRn , IRn ). Moreover, the definition of N (x) at x = x ¯ plays no role for N being a N-map at x ¯. Hence we may assume that N (¯ x) = {E}, after which only x 6= x ¯ must be considered in (3.5), too. A single-valued selection Rf of a N-map N for f at x ¯ is called a Newton function for f at x ¯. Theorem 4 [31, Thm. 6.12] (existence and chain rule for Newton functions) (i) Every locally Lipschitz function f : X → Y possesses, at each x ¯, a Newton function Rf being locally bounded by a Lipschitz constant L for f near x ¯.
Approximations and Generalized Newton Methods
9
(ii) Let h : X → Y and g : Y → Z be locally Lipschitz with Newton functions Rh at x ¯ and Rg at h(¯ x), respectively. Then the canonically composed function Rf (x) = Rg(h(x))Rh(x) defines a Newton function of f (·) = g(h(·)) at x ¯. The function Rf under (i) does not use local behavior of f near x and depends additionally on x ¯. So (i) does not help for solution methods unless one knows Rf without using x ¯. By (ii), Newton functions (hence also N-maps) satisfy a common chain rule. Remark 7 (Perturbations and inexact Newton methods) If N is a N-map of f ∈ C 0,1 (X, Y ) at ˜ (x) = N (x) + kf (x)k BLin . Hence, approximating N (x) with accuracy kf (x)k a zero x ¯, so is N means passing from one N-map to another. So the concept of Newton maps automatically includes also so-called inexact Newton methods. Further perturbations of N contains [31, chapter 10]. For readers, interested in approximations by Broyden-updates [5] when f is a nonsmooth Hilbert space function, we refer to [20] and (in finite dimension) to [35], two of the first papers concerning nonsmooth Newton methods at all. For generalized equations, recent Broyden-type results can be found in [3] and [8]. Specific Newton maps for f ∈ C 0,1 (IRn , IRm ) For such functions, every N-map N of f at x ¯ fulfills (with possibly new o-type functions), N (¯ x + u)u ⊂ Cf (¯ x)(u) + o(u)B ⊂ T f (¯ x)(u) + o(u)B ⊂ ∂ CL f (¯ x)(u) + o(u)B. Further, with our settings, f is semismooth if and only if x 7→ ∂ CL f (x) is a Newton map. This notion, based on Mifflin [44], has been used for Newton’s method by [47, 48] (in a slightly different, but equivalent form) and in many subsequent papers. Detailed presentations of semismooth Newton methods can be found, e.g. in [17, 24, 25] with interesting extensions in [60]. Notice, however, that example 3 shows: The non-singularity of ∂ CL f does not imply semismoothness of a Lipschitzian transformation f : IRn ↔ IRn as asserted in [23] (last page). There are even real Lipschitz functions of this type, which are nowhere semismooth, cf. [31, Expl. BE.0]. 1 For pseudo-smooth functions f ∈ CΩ , the single-valued selections of the mapping D◦ f are natural candidates for being Newton functions, since D◦ f (·) ⊂ N (·) holds necessarily for all closed mappings N satisfying N (x) = {Df (x)} ∀x ∈ Ω, hence also for all closed N-maps N , 1 , which assign (as usually), to x ∈ Ω, the Jacobian. In [31] we have presented a subclass of CΩ 1 1 called locally P C functions, which covers the class P C and includes the Euclidean norm. If f is locally P C 1 , then N = D◦ f is automatically a N-map and Cf (¯ x)(u) ⊂ D◦ f (¯ x)u. Because of space restriction, we skip the details and refer to §6.4 in [31] instead. For piecewise smooth f ∈ P C 1 and f (¯ x) = 0, N-maps are, e.g., N1 (x) = {Df s (x) | s ∈ I(x)} where I(x) = {s | f s (x) = f (x)} s N2 (x) = {Df (x) | s ∈ J(x)} where J(x) = {s | kf s (x) − f (x)k ≤ kf (x)k2 } or N3 (x) = D◦ f (x) = ∂B f (x), N4 (x) = ∂ CL f (x).
(3.7)
Clearly, the N-regularity condition (3.5) is as weaker as smaller the sets N (x) are. Hoewever, under standard regularity assumptions the situation becomes very easy. 3.2 P C 1 equations and Newton maps for inclusions Solving P C 1 equations or KKT-systems with the usual Newton method The well-known theory of generalized Newton methods for P C 1 functions f : IRn → IRn mostly uses the hypothesis of non-singularity for all Df s (¯ x) with s ∈ I(¯ x). This is a direct and canonical generalization of the usual C 1 case and is even a necessary condition in order to obtain a ”regular B-derivative” ∂B f (¯ x) at x ¯ if all s ∈ I(¯ x) are essential. Almost all papers, however, do nowhere take into consideration that this hypothesis is strong enough for avoiding non-smooth Newton methods at all, as noted in [31, section 10.2]. This can be seen as follows.
10
Diethard Klatte, Bernd Kummer
The hypothesis ensures that x ¯ is a (strongly) regular zero for each function f s , s ∈ I(¯ x). Hence Newtons method converges as usual to x ¯ for the C 1 function f s and initial points x0 ∈ x ¯ + εs B with some εs > 0. For small δ > 0 and x0 ∈ x ¯ + δB, also ∅ 6= I(x0 ) ⊂ I(¯ x) is obviously true by continuity of f . Therefore, it holds, Proposition 1 If all derivatives Df s (¯ x), s ∈ I(¯ x), are regular and kx0 − x ¯k < min{δ, mins εs }, then it suffices to choose any s0 ∈ I(x0 ) and to apply the usual Newton method to the C 1 function f s0 by keeping s0 fixed even if f s0 (xk ) 6= f (xk ) holds at some iteration point xk . So, local superlinear convergence is obvious. This is not only simpler than the usually proposed active index set strategy. Mainly, it also allows to apply all modifications of Newton’s method to f s0 (and to extend these modifications to active index set strategies in an evident manner). Notice also that the proposition allows to replace the function f s0 , at any step k, by another function f s which is active at xk since kxk − x ¯k < min{δ, εs } remains true. These facts, however, do not imply that Rf (x) = Df s (x) is a Newton function for f at x ¯ if s ∈ I(¯ x); consider, e.g., f (x) = |x| which is P C 1 with f 1 = x, f 2 = −x and f 3 = 7x. Here, f 3 is not essential and cannot appear as f s0 , but f 3 could be used to compute the zero, too. Obviously, the proposition needs the generating functions f s of f explicitly; knowing alone that f ∈ P C 1 does not help. But this applies also for active index set methods, and they need the condition kx0 − x ¯k < min{δ, mins εs }, too. In particular, KKT systems of optimization problems min{f (x)| x ∈ IRn , gi (x) ≤ 0, i = 1, ..., m} with f, gi ∈ C 2 (IRn , IR) can be handled in this way. The KKT-points are (up to a simple transformation) the zeros of the (Kojima-) function where yi+ = max{0, yi }, yi− = min{0, yi } and P − F (x, y) = ( Df (x) + i yi+ Dgi (x), g1 (x) − y1− , . . . , gm (x) − ym ) ∈ IRn+m . F is a P C 1 functions since so are y + and y − . Following Prop. 1 one obtains as a corollary Proposition 2 Let F be strongly regular at a zero (¯ x, y¯) and ε > 0 be sufficiently small. Then, for k(x0 , y0 ) − (¯ x, y¯)k < ε, replace above all (yi+ , yi− ) by (yi , 0) if y0i > 0 and by (0, yi ) otherwise. Then, Newton’s method applied to the related C 1 function, say F s , converges as usual to (¯ x, y¯). The hypothesis implies regularity of the derivatives DF s (¯ x, y¯) and DF s (x0 , y0 ) for (x0 , y0 ) near (¯ x, y¯). The same holds with additional equality constraints of type C 2 and for generalized Kojimafunctions where (in particular) any C 1 function Φ = Φ(x) of the same dimension may replace Df (x). In this way, variational conditions, games or complementarity problems can be written as equations. For proofs and details, cf. [31, chapter 7]. Other ”derivatives” of F can be found in [33]. Selections, projections, P C 1 -functions and Newton maps for general inclusions Now we intend to use Thm. 3 for dealing with inclusions 0 ∈ F (x) where F : X ⇒ Y is closed. We want to solve them again via certain linear auxiliary problems which modify (3.3), namely fk + Ak (xk+1 − xk ) = 0, Ak ∈ N (xk ), (k ≥ 0); having xk+1 , select some fk+1 ∈ F (xk+1 ), where x0 ∈ X and f0 ∈ F (x0 ) are given and ∅ = 6 N (x) ⊂ Lin(X, Y ) is N-regular.
(3.8)
Once more, we ask for superlinear convergence (3.1) to a zero x ¯ of F . Accordingly, all N-maps below are N-maps at x ¯. In addition, we suppose (S) All fk+1 are uniquely defined by xk+1 , (k ≥ 0). Then there is a selection function f (.) ∈ F (.), continuous or not, such that fk = f (xk ) holds for all steps, and method (3.8) coincides with (3.3), i.e., f (xk ) + Ak (xk+1 − xk ) = 0, Ak ∈ N (xk ). Hence we obtain another trivial but useful Proposition 3 Supposing (S), the method (3.8) satisfies the convergence condition (3.1) if and only if so does (3.3) for some selection function f ∈ F , defined near x ¯.
Approximations and Generalized Newton Methods
11
So we may apply Thm. 3 which asks for a regular N-map of f , requires small kx0 − x ¯k and the conditions (3.4) and (3.5). In order to find a suitable selection f along with a Newton map, the mapping F should be sufficiently simple. In particular, F (x) 6= ∅ for x near x ¯ must be supposed. Then the projections f (x) of the origin onto F (x) are interesting candidates whenever they exist, i.e., one could select f (x) ∈ argminy∈F (x) kyk. It is even hard to suggest a better one if F : IRn ⇒ IRp is polyhedral, i.e., if gph F is the union of a finite number of convex polyhedrons. If lim sup dist(0, F (x)) = 0
(i.e., F is l.s.c. at (¯ x, 0)),
(3.9)
x→¯ x
then f is piecewise linear and continuous near x ¯ since f (¯ x) = 0 ∈ F (¯ x). Applied to a generalized equation 0 ∈ F (x) = h(x) + M (x) the selection becomes f = h + m with m ∈ M . Let N h be any N-map for h. Then the existence of a N-map N f implies the existence of a N-map for m (and vice versa) namely N m = N f − N h defined by the sets {A1 − A2 | A1 ∈ N f (x), A2 ∈ N h(x)}. Hence we should again look for a selection m ∈ M with a simple N-map. Recalling the projections onto F (x), our candidates are the elements m(x) ∈ argminy∈M (x) ky − h(x)k. If h ∈ P C 1 and M : IRn ⇒ IRp is polyhedral then, again under (3.9) and with Euclidean norm, m and h + m are (semismooth) P C 1 -functions near x ¯. Thus various generalized Newton methods can be applied, even Prop. 1 if p = n. The remaining question of weak N-regularity for N h + N m turns into a regularity condition for h + m at x ¯. 4 Nonlinear Auxiliary Problems for C 0,1 Functions We include this section which needs only normed spaces X and Y and follows [31, section 10.3] and [39], in order to point out the meaning of the subsequent condition (CA)*, a nonlinear form of a Newton map condition for C 0,1 functions. The condition is crucial for the subsequent approximation method and appears (with a quadratic o-type function and for continuous functions) in the definition of a PBA (2.21),(a) after setting there x = x ¯. So the question arises whether (CA)* is generally necessary for superlinear convergence of Newton-type methods with continuous f . An answer will be given in section 5 by the examples 6 and 7. Here, we suppose that f ∈ C 0,1 (X, Y ), x ¯ ∈ X, f (¯ x) = 0 and that Gf : (X, X) ⇒ Y is any multifunction satisfying the general supposition ∅ = 6 Gf (x, u) and Gf (x, 0) = {0}. We want to solve the auxiliary problems being linear or non-linear, 0 ∈ f (xk ) + Gf (xk , u)
to put xk+1 := xk + u,
(4.1)
up to some error. The existence of exact solutions xk+1 for (4.1) is not required. Our notation Gf (x, u) in place of Gf (x)(u) is motivated by the fact that we shall obtain the strongest statements when Gf (x, u) is some kind of a (multivalued) generalized derivative of f at x in direction u, though Thm. 5 will also hold for other mappings. Convergence for approximate and exact solutions Given any α ≥ 0 we investigate the algorithm ALG(α): Having xk find some u such that ∅= 6 α kf (xk )kB ∩ [f (xk ) + Gf (xk , u)]
and put xk+1 := xk + u.
(4.2)
The parameter α prescribes the accuracy of our algorithm when solving (4.1). If (4.1) holds true, we call u and xk+1 = xk + u exact solutions. Though (4.2) can be also written as an ”exact ˆ α f (xk , u) ] where G ˆ α (x, u) = Gf (x, u) + αkf (x)kB, we continue in inclusion”, 0 ∈ f (xk ) + [ G considering model (4.2). Definition 4 (feasibility) We call the triple (f, Gf, x ¯) feasible if, for each q ∈ (0, 1), there are positive r and α such that, whenever kx0 − x ¯k ≤ r, the process (4.2) has solutions and generates necessarily iterates satisfying kxk+1 − x ¯k ≤ q kxk − x ¯k.
12
Diethard Klatte, Bernd Kummer
Notice that no condition like xk+1 ∈ Ω is required. To ensure feasibility of (f, Gf, x ¯), we will impose the following conditions for x near x ¯ which now replace (3.5) and (3.4) for Newton maps. (CI) (CA)
kvk ≥ ckuk ∀v ∈ Gf (x, u) ∀u ∈ X (c > 0 fixed), f (x) − f (¯ x) + Gf (x, u) ⊂ Gf (x, x − x ¯ + u) + o(x − x ¯)B ∀u ∈ X.
(4.3)
With u = x ¯ − x in (CA) and Gf (x, 0) = {0}, we obtain a weaker condition, namely (CA)*
f (x) − f (¯ x) + Gf (x, x ¯ − x) ⊂ o(x − x ¯)B
(again for x near x ¯).
(4.4)
The condition (CI) means some injectivity of Gf or, in other words, that Gf (x, .)−1 is globally u.L. at (0, 0) (uniformly for x near x ¯). Condition (CA) requires some type of approximation. In particular, the settings of section 3 are still possible (called standard settings), Gf (x, u) := N (x)u := {Au | A ∈ N (x)} where N (x) ⊂ Lin(X, Y ).
(4.5)
In this case, (CA)* is just the N-map condition (3.4) and coincides with (CA) since, for linear functions A satifying (CA)* we have f (x) − f (¯ x) + A(¯ x − x) ⊂ o(x − x ¯)B which yields (CA) by adding Au to both sides. For X = Y = IRn , weak N-regularity and (CI) obviously coincide and we are in the framework of section 3.1. The “Inexact Nonsmooth Newton Method” 7.2.6 in [17] is algorithm ALG(α), specified to f : IRn → IRn and Gf from (4.5). In what follows, strong regularity or surjectivity of f are not explicitly required for the analysis of the procedure (4.2). This is a realistic assumption for equations arising from control problems. Basic ideas to this topic (where, however the correct choice of related function spaces plays often the main role) can be found, e.g., in [2], [22] and [58–60]. For the proofs of the next three theorems we refer to [31, section 10.3]. The constant L stands for Lip (f, x ¯ + δB). Theorem 5 (Convergence of ALG(α)). (i) The triple (f, Gf, x ¯) is feasible if there exist c > 0, δ > 0 and a function o(·) such that, for all x ∈ x ¯ + δB, the conditions (CI) and (CA) are satisfied. Having (CI), (CA) and c, δ, let q ∈ (0, 1), α ∈ (0, 12 c qL−1 ] and let r ∈ (0, δ] be small enough such that o(x − x ¯) ≤ 21 c kx − x ¯k ∀x ∈ x ¯ + rB. Then, the convergence can be quantified as follows: (ii) If also o(x − x ¯) ≤ 12 α c kx − x ¯k ∀x ∈ x ¯ + rB holds true, then q, α and r fulfill the requirements in the definition of feasibility. (iii) If all xk+1 are exact solutions of (4.2) with r from (ii), then they fulfill c kxk+1 − x ¯k ≤ o(xk − x ¯) with o(·) from (CA) if kx0 − x ¯k ≤ r. Necessity of the required conditions Under several particular settings, the technical condition (CA) may be replaced by (CA)*. However, in the current section, we shall exploit that Gf (x, .) is positively homogeneous. Theorem 6 (Condition (CA)). Suppose X = Y = IRn , and let Gf denote any of the following generalized directional derivatives: 1. Gf (x, u) = N (x)u as for the standard setting (4.5), 2. Gf (x, u) = T f (x)(u), 3. Gf (x, u) = Cf (x)(u), 4. Gf (x, u) = ∂ CL f (x)u (Clarke’s Jacobian applied to u), 5. Gf (x, u) = f 0 (x; u) (usual directional derivatives, provided that they exist) 6. Gf (x, u) = {Au | A = Df s (x) and f s (x) = f (x)} if f ∈ P C 1 (f 1 , ..., f m ). Then, the conditions (CA) and (CA)* are equivalent. Generally, the equivalence (CA) ⇔ (CA)∗ may fail. Example 4 If x ¯ = f (¯ x) = 0, the setting Gf (x, u) = f (x + u) − f (x) yields in (CA) to f (x) + f (x + u) ∈ f (2x + u) + o(x − x ¯)B which usually fails to hold, but in (CA)* to 0 = f (x) − f (¯ x) + f (x + x ¯ − x) − f (x) ∈ o(x − x ¯)B which is trivially true.
Approximations and Generalized Newton Methods
13
Also the necessity of (CI) and (CA) can be characterized for several settings. Theorem 7 (Condition (CI)). Suppose that f ∈ C 0,1 (IRn , IRn ) and f (¯ x) = 0. (i) Let Gf (x, u) = T f (x)(u). Then (CI) holds at x ¯ ⇔ (CI) holds for x near x ¯ ⇔ f is strongly regular at (¯ x, 0). Under (CI), condition (CA) holds true if and only if (f, Gf, x ¯) is feasible. (ii) Let Gf (x, u) = ∂ CL f (x)u. Then (CI) holds at x ¯ ⇔ (CI) holds for x near x ¯ ⇔ ∂ CL f (¯ x) is non singular. This condition is stronger than strong regularity. (iii) Let Gf (x, u) = Cf (x)(u). Then (CI) holds at x ¯ ⇔ f −1 is locally u.L. at (0, x ¯). 0 (iv) Let Gf (x, u) = f (x; u), provided that directional derivatives exist near x ¯. Then, under strong regularity of f , (CA) holds true if and only if (f, Gf, x ¯) is feasible. Under pseudo-regularity, (CI) is satisfied for x near x ¯. Summarizing, the conditions (CI) and (CA)* are, at least for ALG(α) in the context of C 0,1 functions and nonlinear approximations, similar crucial as N-regularity and N-maps in Thm. 3. Solutions of the exact auxiliary problems (4.1) will obviously exist (i) if - for all x 6= x ¯ near x ¯ - the used mappings Gf (x, .) : X ⇒ Y are surjective, (ii) or if (stronger) X = Y = IRn , Gf (x, u) = Cf (x)(u) and pseudo-regularity of f at x ¯ holds true; cf. Thm. 1. We are, however, not happy with condition (ii) since it excludes f (x) = |x|. Modification and global convergence In [6, 7], the iteration (4.2) in the Newton scheme ALG(α) has been replaced by a non-monotone path search which leads to a local convergence behavior similar to Thm. 5, but also allows a global convergence result. This has been applied in [7] for the setting Gf (x, u) = {f 0 (x; u)} (directional derivative) to C 1,1 -optimization problems arising in (generalized) semi-infinite optimization and Nash equlibrium problems. For a survey of global nonsmooth Newton methods via path search, line search or trust region ideas, we refer to [17, Chapt.8].
5 Some non-derivative approaches for nonsmooth functions After studying Newton maps and other derivative-like objects for nonsmooth functions, let us turn to some ”non-derivative approaches” for Newton’s method. Having in mind the trivial setting Gf (x, u) = f (x + u) − f (x) after which 0 ∈ f (x) + Gf (x, u) ⇔ f (x + u) = 0,
(5.1)
it becomes evident that f (x + u) − f (x) should be approximated in some approriate way. We first consider the following idea: (A1) If f is composed by a finite number of P C 1 -functions hi = hi (xi ), one could construct Gf (x, u) by the help of directional derivatives (hi )0 (xi ; ui ) (and related chain rules for sums, products and quotients). We then obtain the directional derivative of f . (A2) If certain directional derivatives, say (hj )0 (xj ; uj ), are not available or difficult to compute, one can replace them by the differences hj (xj + uj ) − hj (xj ). In particular, this is possible if f is composed by C 1 -functions hi and piecewise linear functions hj . As elaborated in [21], then the resulting models reflex the original structure in an often preferable manner and the ”derivatives” can be determined by automatic differentiation (if f is defined in a hierarchic manner by the functions hk like a tree in a graph). The need of automatic differentiation To be more concrete, assume simpler that f = hN (xN ) ◦ ... ◦ hi (xi ) ◦ ... ◦ h1 (x1 ) : IRn → IRm is composed by P C 1 functions hi (xi ) of appropriate dimensions. Then f is again P C 1 , x1 = x ∈ IRn and hN (xN ) ∈ IRm . Let us look at the above approximations fˆ for f (x + u) − f (x).
14
Diethard Klatte, Bernd Kummer
(A1) uses the directional derivative fˆ = f 0 (x; u), and f 0 (x; u) is computable via the directional derivatives of all hi as f 0 (x; u) = (hN )0 (xN ; uN ) where, recursively, x1 = x, u1 = u and xi+1 = hi (xi ),
ui+1 = (hi )0 (xi ; ui ) i = 1 . . . N − 1.
(5.2)
Clearly, setting fˆ = f (x + u) − f (x) makes no sense since it does not simplify the equation. (A2) however, requires to do that for the involved difficult functions hj . Then (hj )0 (xj ; uj ) is replaced by uj+1 = hj (xj + uj ) − hj (xj ) in (5.2) and xj+1 = hj (xj ). Now the approximation fˆ is no longer positively homogeneous in the direction u as already the approximation under (2.17). Example 5 Let f = abs( 5 sin(abs(x)) + 1) = h3 ◦ h2 ◦ h1 , x1 = x, u1 = u. Then one may write (”exponents” are here indexes) h1 (x1 ) = abs(x1 ), h2 (x2 ) = 5 sin(x2 ) + 1), h3 (x3 ) = abs(x3 ) Here, (A1) yields x2 = abs(x), u2 = (abs)0 (x; u), x3 = 5 sin(abs(x) + 1), u3 = 5 cos(abs(x)) u2 , which implies 0 3 0 3 fˆ = f (x; u) = (h ) (x3 ; u ) = (abs)0 ( 5 sin(abs(x) + 1) ; 5 cos(abs(x))(abs)0 (x; u) ) while (A2) with hj = abs for j 6= 2 corresponds to x2 = abs(x), u2 = abs(x + u) − abs(x), x3 = 5 sin(abs(x) + 1), u3 = 5 cos(abs(x)) u2 and (h3 )0 (x3 ; u3 ) is replaced by (abs)(x3 + u3 ) − (abs)(x3 ), which yields fˆ = abs( 5 sin(abs(x) + 1) + 5 cos(abs(x)) [abs(x + u) − abs(x)] ) − abs( 5 sin(abs(x) + 1). The example shows that automatic differentiation is, in principle, an unavoidable pre-requisite for dealing with nonsmooth Newton methods, even if they involve only standard functions like the absolute value. Fortunately, the non-smoothness arises from complementarity conditions only and allows simpler procedures to handle this situation, in many concrete variational problems, since the abs-(or max-) function is involved in a primitive manner. But for hierarchic problems like MPEQ’s, functions as above may occur and it becomes important that fˆ is still of type P C 1 . Difficulties and condition (CA)* for non-Lipschitz functions Our general assumption of f ∈ C 0,1 excluded continuous, not locally Lipschitz functions in section 4. By formula (3.6), we motivated in [31] the restriction to f ∈ C 0,1 in the context of standard settings (4.5). For nonlinear approximations and f ∈ D := C \ C 0,1 , the situation is not simpler. In the paper [23], devoted to Newton’s method for continuous f , one finds no f ∈ D such that the proposed method converges or the hypotheses are satisfied. We also found nowhere an example of a PBA for continuous f , different from the functions in (2.17) with h ∈ C 1,1 . So it is even not clear whether a Newton-like method may converge for f ∈ D. To give an answer and to characterize the difficulties for f ∈ D, let us check superlinear convergence and condition (CA)* for real functions f ∈ D which are C 1 near x 6= x ¯. So one can use the usual Newton steps at x 6= x ¯. For f (x) = sgn(x) |x|q , 0 < q < 1, both superlinear convergence and the crucial condition (CA)* are violated. The following strongly regular examples from [43] indicate that Newton’s method may superlinearly converge for f ∈ D, while (CA)* may hold or not. In both examples, put f (0) = 0 and f (x) = −f (−x) for x < 0. Example 6 Superlinear local convergence, though (CA)* is violated: f (x) = x (1 − ln x) if x > 0. x) = Evidently, f is continuous and, for x > 0, it holds Df = − ln x and xnew = x − x(1−ln − ln x x + lnxx − x = lnxx . This implies f (x) x (1−ln x) 1 1 O1 (x) := f /Df → 0 for x ↓ 0 due to O1 = 1 − xDf (x) = 1 − −x ln x = 1 − (− ln x + 1) = ln x , and (CA)* fails due to O2 (x) :=
f (x) x
− Df (x) =
x (1−ln x) x
+ ln x ≡ 1.
Approximations and Generalized Newton Methods
15
Example 7 Superlinear local convergence and (CA)* hold true: f (x) = x ( 1+ln(− ln x) ) if x > 0. Consider small x > 0 which yields f > 0 and, for x ↓ 0, 1 1 1 Df = ( 1 + ln(− ln x) ) + x ( − ln x −x ) = 1 + ln(− ln x) + ln x → ∞. f O2 = x − Df = ( 1 + ln(− ln x) ) − ( 1 + ln(− ln x) + ln1x ) = − ln1x → +0.
O1 = 1 −
f xDf
=1−
1+ln(− ln x) 1+ln(− ln x) + ln1x
=
1 ln x
1+ln(− ln x) + ln1x
→ −0.
Similarly, negative x can be handled. Thus the assertions are verified. The property lim sup |f (x) − f (¯ x)|/|x − x ¯| = ∞ of these examples is justified by [43, Thm. 4.1]: For real f ∈ D with finite lim sup, which are not locally Lipschitz near x ¯, the method 0 ∈ f (xk ) + Cf (xk )(xk+1 − xk ) never satisfies |xk+1 − x ¯| = o(xk − x ¯) for all x0 near x ¯.
6 Modified Successive Approximation for Generalized Equations In this section, we study zeros of F : X ⇒ Y after small nonlinear variations near to (x0 , y0 ) ∈ gph F . Our basic model is a generalized equation g(x) ∈ F (x), x ∈ Ω
and their solution sets S(g) ⊂ Ω,
(6.1)
where Ω is some ball around x0 and g : Ω → Y is Lipschitz on Ω. Obviously, (6.1) coincides with the fixed point condition x ∈ F −1 (g(x)), x ∈ Ω. Particular inclusions (6.1) and assigned Newton-type methods have been studied already in [50] and [28], and its parametric form g(x, t) ∈ F (x) was the subject of the pioniering paper [52] for C 1 -functions g under the viewpoint of strong regularity. In this context, the mapping F −1 ◦ g is a contraction whenever F is strongly regular and Lip (g, Ω) is small. The following approach is based on [41, 42] and [31, section 4.1] where metric spaces X and Y have been permitted and several comments and further applications can be found. As a basic tool, we construct and estimate solutions to (6.1) directly by modified successive approximation, i.e., by selecting xk+1 ∈ F −1 (g(xk )) such that kxk+1 − xk k is ”sufficiently small”. Under several hypotheses, the result is a sequence which converges to a solution x∗ ∈ S(g). It is worth to note that the requirement of small kx0 − x ¯k for some x ¯ ∈ S(g) will not appear. Instead, we use some regularity near the initial point and a Lipschitz property for g. Similar assumptions characterize Kantorovich-type statements for Newton’s method, which are based on small values of kf (x0 )k instead of a small distance kx0 − x ¯k of the initial point to a (usually unknown) solution. Such conditions also require that the variation of Df near x0 and (or) kDf (x0 )−1 k are small enough. Related estimates are the key for various results in [29], Kap. XVIII and many papers on numerical methods for solving equations. If x0 is close to a regular zero x ¯ of f , these conditions hold true by continuity arguments. Let us recall a classical statement. Theorem 8 [46]. Let f ∈ C 1,1 (X, Y ), Ω ⊂ X be open and convex, x0 ∈ Ω and Lip (Df, Ω) ≤ β. Suppose that Df (x0 )−1 exists with norm L and that h := L β η ≤ 21 holds with η = kDf (x0 )−1 f (x0 )k. √ √ 1 1 Finally, put δ ∗ = Lβ (1 − 1 − 2h), δ ∗∗ = Lβ (1 + 1 − 2h) and suppose that S := x0 + δ ∗ B ⊂ Ω. Then, the Newton iterates are well defined, lie in S and converge to a zero x∗ of f which is unique in Ω ∩ [x0 + int δ ∗∗ B]. If even h < 21 the convergence is quadratic. A-priory, a check of these conditions may be rather hard; even if f (x) = x + 13 x3 with Ω = (−p, p). Much more, this will hold for related statements in the context of generalized equations. The proofs, however, require in both cases the explicit study of the appearing sequences. To obtain additional applications, we shall construct elements xk ∈ X and vk ∈ Y , independently of any function g : X → Y .
16
Diethard Klatte, Bernd Kummer
6.1 The approximation scheme and its properties Given any (x0 , y0 ) ∈ gph F and v0 ∈ Y we consider the following Process P(λ, β, x0 , y0 , v0 ). Let λ > 0, β > 0. For describing the initial step at (x0 , v0 ) like the others, put v−1 = y0 . Hence x0 ∈ F −1 (v−1 ) = F −1 (y0 ). Beginning with k = 0, find xk+1 ∈ F −1 (vk ) with kxk − xk+1 k ≤ dist(xk , F −1 (vk )) + λkvk − vk−1 k and choose any vk+1 such that kvk+1 − vk k ≤ β kxk+1 − xk k.
(6.2)
Clearly, xk+1 is an approximate projection of xk onto F −1 (vk ) with error ≤ λkvk − vk−1 k. Properties of P(λ, β, x0 , y0 , v0 ) P0. Some related xk+1 exist as long as F −1 (vk ) 6= ∅. Indeed, if vk 6= vk−1 then xk+1 exists by definition of the point-to-set distance. If vk = vk−1 , one may put xk+1 = xk since xk ∈ F −1 (vk−1 ) = F −1 (vk ). The existence of vk+1 is trivial. P1. The process becomes stationary (xk+1 = xk ) if one selects vk+1 = vk . P2. For the model (6.1), we will put vk = g(xk ) if Lip (g, Ω) ≤ β holds with some set Ω, defined later, cf. Remark 10. P3. If xk+1 (hence also vk+1 ) exists, (6.2) yields with dk = dist(xk , F −1 (vk )) and α := kv0 − y0 k, kx1 − x0 k ≤ d0 + λα and kxk+1 − xk k ≤ dk + λkvk − vk−1 k ≤ dk + λβ kxk − xk−1 k ∀k ≥ 1. (6.3) P4. Having an estimate dk ≤ Lkvk − vk−1 k (k ≥ 0) where θ := β(L + λ) < 1,
(6.4)
˜ = L + λ: then (6.3) ensures with L ˜α kx1 − x0 k ≤ L and, if k ≥ 1, ˜ ˜ βkxk − xk−1 k = θ kxk − xk−1 k kxk+1 − xk k ≤ Lkvk − vk−1 k ≤ L ˜ kvk − vk−1 k = θ kvk − vk−1 k kvk+1 − vk k ≤ β kxk+1 − xk k ≤ β L
(6.5)
kxk+1 − x0 k ≤ kx1 − x0 k + Σ1≤i≤k kxi+1 − xi k ˜ + Lα ˜ θ + . . . + Lα ˜ θk ˜ ≤ Lα ≤ (1 − θ)−1 Lα, −1 kvk+1 − y0 k ≤ α + Σ0≤i≤k kvi+1 − vi k ≤ (1 − θ) α.
(6.6)
as well as
In consequence, the points stay in any given neighborhoods of x0 and y0 provided that α = kv0 − y0 k is sufficiently small. Moreover, if the process generates infinite sequences {xk , vk }, they converge in the complete spaces X, Y (by boundedness of the sum of norms) to ˜ x∗ ∈ x0 + (1 − θ)−1 LαB
and v ∗ ∈ y0 + (1 − θ)−1 αB, respectively.
(6.7)
∞ P5. More general, under boundedness Σk=1 dk < C and λβ < 1, (6.3) guarantees convergence of {xk , vk } by estimates like (6.5) and (6.6). Condition (6.4) is a special case of this situation. P6. Following the estimate in [1] page 181 (though some index is wrong), (6.5) ensures linear θ < 1, if θ < 13 . convergence of the sequences to (x∗ , v ∗ ) with factor q = 1−2θ
Particular settings for process P(λ, β, x0 , y0 , v0 ) S1 In order to solve (6.1), put vk = g(xk ) and β = Lip(g, Ω) as mentioned above. S2 Let F = ∂f be the subdifferential of a convex function f : X = IRn → IR. Put g(x) = βx. Then x ∈ F −1 (g(xk )) means βxk ∈ ∂f (x) and 0 ∈ −βxk + ∂f (x). Hence, given xk , we require xk+1 ∈ argminx∈X (f (x) − βhxk , xi). A solution g(x∗ ) ∈ F (x∗ ) now solves the conjugate problem x∗ ∈ argminx∈X (f (x) − βhx∗ , xi).
Approximations and Generalized Newton Methods
17
S3 Let f and g be as in S2. Put F (x) = βx + ∂f (x). Then x ∈ F −1 (g(xk )) means g(xk ) ∈ F (x) and 0 ∈ β(x − xk ) + ∂f (x). Hence, xk+1 minimizes the Moreau-Yosida approximation f (x) + 21 βkx − xk k2 , and one has g(x∗ ) ∈ F (x∗ ) ⇔ x∗ ∈ argminx∈X f (x). In this case, the algorithm minimizes f by a proximal point method. S4 To solve H(x)∩F (x) 6= ∅, i.e., 0 ∈ −H(x)+F (x), for closed H, F : X ⇒ Y , assume v0 ∈ H(x0 ) and select vk+1 ∈ H(xk+1 ) with kvk+1 − vk k ≤ βkxk+1 − xk k. The latter is possible if H is pseudo-Lipschitz with rank β on Ω. Under the setting S1, β stands for a Lipschitz modulus of g, λ allows an approximation in the shortest-distance problem minx {d(x, xk ) | x ∈ F −1 (g(xk ))} and α is the distance between (x0 , y0 ) and the initial point (x0 , v0 ) = (x0 , g(x0 )) of the process. Next, according to Remark 1, we take the same radius δ for the balls around x0 and y0 . Theorem 9 [31, Thm. 4.2] (modified successive approximation). Let F : X ⇒ Y be closed and pseudo-regular at (x0 , y0 ) ∈ gph F with rank L and balls U = x0 + δBX , V = y0 + δBY . Suppose that α := kv0 − y0 k and β are small enough such that θ := β(L + λ) < 1
and
α < δ (1 − θ) (max{1, L + λ})−1 .
(6.8)
Then, P(λ, β, x0 , y0 , v0 ) generates convergent sequences {xk } ⊂ U, {vk } ⊂ V such that it holds: (i) The sequences satisfy (6.5), (6.6), (6.7) and (x∗ , v ∗ ) ∈ gph F. (ii) If, particularly, vk = g(xk ) for all k ≥ 0 in P(λ, β, x0 , y0 , v0 ) and Lip(g, U ) ≤ β, then g(x∗ ) ∈ F (x∗ ). (iii) If, particularly, vk ∈ H(xk ) for all k ≥ 0 in P(λ, β, x0 , y0 , v0 ) and H : X ⇒ Y is closed as under S4, then v ∗ ∈ H(x∗ ) ∩ F (x∗ ), hence 0 ∈ −H(x∗ ) + F (x∗ ). Proof (i) It suffices to verify, by complete induction, that the points xk+1 , vk+1 under (6.5) really exist. To this aim we apply pseudo-regularity to the points xk , vk , vk−1 , which already satisfy the given estimates due to (6.6) and the choice of our constants. For k = 0, it holds x0 ∈ F −1 (y0 ) and d(v0 , v−1 ) = kv0 − y0 k = α < δ. By pseudo-regularity of F , so some x1 ∈ F −1 (v0 ) fulfills d(x1 , x0 ) ≤ Lkv0 − v−1 k = Lα. Hence also (6.4) holds for k = 0. For k > 0, any sequence of the process, xk ∈ F −1 (vk−1 ) has been defined according to the previous steps, and (6.6) ensures xk ∈ x0 + δB and vk−1 , vk ∈ y0 + δB. Thus, again by pseudoregularity of F , some xk+1 ∈ F −1 (vk ) fulfills d(xk+1 , xk ) ≤ Lkvk − vk−1 k, and (6.4) holds again. Thus, all points xk+1 , vk+1 of the process exist, stay in U and V , respectively, and the properties of the process ensure the assertions. Clearly, (x∗ , v ∗ ) ∈ gph F follows from closeness of F . (ii) The inclusion g(x∗ ) ∈ F (x∗ ) follows from xk+1 ∈ F −1 (g(xk )) and passing to the limit. (iii) Similarly, v ∗ ∈ H(x∗ ) ∩ F (x∗ ) follows from xk+1 ∈ F −1 (vk ) and vk ∈ H(xk ). t u Remark 8 (Strongly regular F ) Under strong regularity of F with rank L at (x0 , y0 ), we obtain that all xk+1 of our construction are uniquely defined by vk . In the situation S1, then Lip(g, Uδ ) ≤ β implies that x 7→ φ(g(x)) = F −1 (g(x)) ∩ Uδ is a contraction which maps Uδ into itself. Thus, x∗ is also the unique fixed point of φ ◦ g on Uδ . Remark 9 (Family of mappings) The theorem indicates only one set of assumptions which ensures the crucial estimate (6.4) as well as F −1 (vk ) 6= ∅. So it is quite obvious that Thm. 9(ii) also holds for a family of mappings Fk and functions gk with Lip(gk , x0 + δB) ≤ β as long as they fulfill the requirements concerning pseudo-regularity of F and the initial conditions (6.8) for g with v0 := gk (x0 ). Thus all estimates for xk and vk = gk (xk ) remain true. In particular, (6.5) and (6.6) show that our assumptions for the initial point even hold with αk := kvk+1 − vk k ≤ θkvk − vk−1 k < α. Hence the limits x∗ and v ∗ exist again. Remark 10 To simplify our applications as in [31], we put λ = 1 in P(λ, β, x0 , y0 , v0 ) and claim (stronger) (6.9) β < 21 (L + 1)−1 and α := kv0 − y0 k < 21 δ(L + 1)−1 . 1 Then (6.8) is satisfied with θ := β(L + 1) < 2 , after which (i) presents the simpler estimates kxk − x0 k ≤ 2(L + 1)α < δ and kvk − y0 k ≤ 2α < δ. For the situation S1, we have α = kg(x0 ) − y0 k and may put β = Lip(g, U ).
18
Diethard Klatte, Bernd Kummer
6.2 Generalized equations under strong or pseudo-regularity We are now going to consider the approach of section 2.2 for solving 0 ∈ Γ with closed Γ : X ⇒ Y via a method given by Σ : X × X ⇒ Y , where Σ(., x) is a continuous translation of Σ(., x ¯). With the following proposition, both implicit mappings with parameter x and Newton-convergence can be studied in a unified manner. Proposition 4 Suppose that 0 ∈ Γ (¯ x) ∩ Σ(¯ x, x ¯), let Σ(. , x ¯) be closed and pseudo regular at (¯ x, 0) with rank L and assume that Σ(x0 , x) = gx (x0 ) + Σ(x0 , x ¯) holds with some function gx satisfying sup(gx , Ωr ) ≤ o(r) and Lip(gx , Ωr ) ≤ O(r),
∀x ∈ Ωr := x ¯ + rB.
(6.10)
(i) Then, if x ∈ Ωr and r is sufficiently small, there is some x0 ∈ Ωr with 0 ∈ Σ(x0 , x) and kx0 − x ¯k ≤ 2(L + 1) kgx (¯ x)k. (ii) For the method 0 ∈ Σ(xk+1 , xk ), there exist such iterates xk+1 for all k which satisfy kxk+1 − x ¯k = o(xk − x ¯), provided that kx0 − x ¯k is sufficiently small. Proof We apply Thm. 9(ii), based on pseudo regularity of F = Σ(., x ¯) at (¯ x, 0) for g = gx with initial point (x0 , v0 ) := (¯ x, gx (¯ x)). If x ∈ Ωr and r is small, then (¯ x, gx (¯ x)) is arbitrarily close to (¯ x, 0) and Lip(gx , Ωr ) is arbitrarily small. Hence Thm. 9 (with the settings of remark 10) ensures assertion (i) with x0 = x∗ . By (6.10), it also holds kgx (¯ x)k = o(x − x ¯). Hence, kx0 − x ¯k = o(x − x ¯). 0 Identifying xk = x, xk+1 = x , so also (ii) is true. t u This way we do not obtain the inclusion of (2.8), but at least ∅= 6 S(x) ∩ x ¯ + o(x − x ¯)B for x near x ¯.
(6.11)
Supplement Under strong regularity in place of pseudo regularity in Prop. 4, and again for r small enough, the solutions x0 ∈ Ωr for 0 ∈ Σ(x0 , x) are unique, and it follows stronger card(Ωr ∩ S(x)) = 1 and Ωr ∩ S(x) ⊂ x ¯ + r(x − x ¯)B for x near x ¯.
(6.12)
This tells us for the same method: If r > 0 and kx0 − x ¯k are small enough, there exist unique iterates xk+1 ∈ Ωr for all k, and they fulfill kxk+1 − x ¯k = o(xk − x ¯). Applications for generalized equations For Γ = f + M , where M is closed, let us apply Prop. 4 to two situations. Case 1: If f = h ◦ γ, with h ∈ C 1 and γ ∈ C 0,1 as under (2.17) put Σ(x0 , x) =
f (x) + Dh(γ(x))( γ(x0 ) − γ(x) ) +M (x0 ) if x 6= x ¯ f (x0 ) +M (x0 ) if x = x ¯
with gx (x0 ) = f (x0 ) − f (x) + Dh(γ(x))( γ(x0 ) − γ(x) ). ˆ for f near x ˆ 0 , x) + M (x0 ), Case 2: If f is continuous and there is a PBA Σ ¯, put Σ(x0 , x) = Σ(x 0 0 0 ˆ ˆ gx (x ) = Σ(x , x) − Σ(x , x ¯). In both situations, (6.10) holds true due to (2.25) since, adding the multivalued term M (x0 ), does not change the needed estimate. Also the proofs remain the same with or without M . For the role of Ωr in (6.12) we refer to example 1. ˜ 0 , x) = g˜(x0 ) + Σ(x0 , x) such that g˜ fulfills (6.10), If we replace Σ(x0 , x) by any mapping Σ(x then we are in the context of inexact Newton methods and obtain, evidently, the same statements. Slight generalizations of the case f ∈ C 1 by passing to uniform strict differentiability as, e.g., in [15], ensure condition (6.10) [by definition] too.
Approximations and Generalized Newton Methods
19
6.3 Kantorovich-type statements for generalized equations Now we intend to show how Thm. 9 can be used to obtain a Kantorovich-type result for generalized equations 0 ∈ f (x) + M (x) with composed function f = h ◦ γ, under the assumptions of case 1 above. Instead of small kf (x0 )k, we have to require that kf (x0 ) + m0 k is small with some m0 ∈ M (x0 ). While for Prop. 4, the Thm. 9 was applied for verifying the existence of next iterates x0 = xk+1 only, now the process (6.2) of Thm. 9 will directly define the whole sequence {xk }. Let us first consider the modified Newton method 0 ∈ f (xk ) + H0 ( γ(xk+1 ) − γ(xk ) ) + M (xk+1 )
(6.13)
with fixed operator H0 := Dh(γ(x0 )). If the set Xk+1 of the related solutions has more than one element, we require to select any xk+1 ∈ Xk+1 with dist(xk , Xk+1 ) + k g(xk ) − g(xk−1 ) k if k > 0 kxk+1 − xk k ≤ dist(xk , Xk+1 ) + k g(x0 ) − y0 k if k = 0, (6.14) where g(x) = f (x0 ) + H0 (γ(x) − γ(x0 )) − f (x). Notice that Oh (r, x0 ) = supx,x0 ∈x0 +rB kDh(γ(x)) − Dh(γ(x0 ))k vanishes for r ↓ 0 and is just rK if Dh is Lipschitz with rank K on x0 + δ0 B. Proposition 5 Suppose that x0 ∈ X and m0 ∈ M (x0 ) are given in such a way that (i) the mapping Γ = f + M is pseudo-regular at (x0 , y0 ) = (x0 , f (x0 ) + m0 ) with rank L0 , balls U = x0 + δ0 B, V = y0 + δ0 B and Lγ = Lip (γ, x0 + δ0 B) < ∞, (ii) some 0 < δ < c := min{1, β := Oh (δ, x0 ) Lγ
0 with the properties (ii) really exists, but (iii) is an additional requirement at the initial point which may be violated. 2. How to fulfill the assumptions for x0 near x ¯, if Γ is pseudo regular at (¯ x, 0) (as in Prop. 4, case 1)? By persistence of pseudo-regularity, one may suppose that (i) holds true for all (x0 , y0 ) ∈ gph Γ ¯x near (¯ x, 0). Next require (ii) for x ¯ in place of x0 , i.e., choose δ¯ with 0 < δ¯ < c, β := Oh (δ, ¯ ) Lγ < δ0 1 1¯ and θ := β (L + 1) < . Setting δ = δ, then (ii) is true for all x ∈ x ¯ + δB. Hence 0 3(L0 +1) 2 2 (i) and (ii) hold for all (x0 , y0 ) ∈ gph Γ close enough to (¯ x, 0). Finally, for small ky0 k, some x0 ∈ Γ −1 (y0 )∩(¯ x +Lky0 kB) exists by pseudo-regularity of Γ . Therefore, then (x0 , y0 ) = (x0 , f (x0 )+m0 ) fulfills all assumptions, if ky0 k was sufficiently small. Notice, however, that not necessarily all x0 near x ¯ are assigned to some y0 with small norm, even if Γ is strongly regular: Example 8 (a strongly regular multifunction) Put f (x1 , x2 ) = x1 ∈ IR, M (x1 , x2 ) = {1/x2 } if x2 6= 0, M (x1 , x2 ) = {0} if x2 = 0. For |y| < 1, the solution of y ∈ Γ (x) := f (x) + M (x) is unique and Lipschitz: x = (y, 0). If x → (0, 0) and x2 6= 0 then kyk → ∞ ∀y ∈ Γ (x). Hence, for such points x0 near x ¯, we cannot satisfy the assumptions of Prop. 5 since km0 k becomes big. The Newton step at x0 would generate the solution x1 = x ¯, but the difference of the assigned y-values is too big, in order to apply any regularity at (¯ x, 0). Therefore, in similar statements, e.g., [12, Thm. 4], often additional hypotheses about the situation after the first iteration occur.
20
Diethard Klatte, Bernd Kummer
On the other hand, if M (x) is fixed or if Γ fulfills, as in the context of polyhedral mappings (3.9), lim sup dist(−f (¯ x), M (x) = 0
(i.e., M is l.s.c. at (¯ x, −f (¯ x)),
(6.16)
x→¯ x
then we find, for each x0 close enough to x ¯, some appropriate m0 satisfying also (iii). Hence, the properties of M play now some role, indeed. Proof of Proposition 5 Define F (x) = f (x0 ) + H0 (γ(x) − γ(x0 )) + M (x), g(x) = f (x0 ) + H0 (γ(x) − γ(x0 )) − f (x).
(6.17)
The initial mapping Γ = f + M fulfills g + Γ = F, y0 = f (x0 ) + m0 ∈ F (x0 ) and the equivalences g(x) ∈ F (x) ⇔ − f (x) ∈ M (x) ⇔ 0 ∈ Γ (x). In addition, we obtain xk+1 ∈ F −1 (g(xk )) ⇔ g(xk ) ∈ F (xk+1 ) ⇔ f (x0 ) + H0 (γ(xk ) − γ(x0 )) − f (xk ) ∈ f (x0 ) + H0 (γ(xk+1 ) − γ(x0 )) + M (xk+1 ) ⇔ 0 ∈ f (xk ) + H0 [γ(xk+1 )) − γ(xk )] + M (xk+1 ). Hence the iterations of P(λ, β, x0 , y0 , v0 ) (for λ = 1) correspond to method (6.13), (6.14). We continue by checking the assumptions of Thm. 9. First we investigate the Lipschitz rank of g with respect to x0 , x ∈ x0 + rB and 0 < r < δ0 : kg(x0 ) − g(x)k = kH0 γ(x0 ) − f (x0 ) − (H0 γ(x) − f (x))k = kh(γ(x)) − h(γ(x0 )) − H0 (γ(x) − γ(x0 ))k ≤ supx,x0 ∈x0 +rB kDh(γ(x)) − Dh(γ(x0 ))k kγ(x) − γ(x0 ))k.
(6.18)
Lip(g, x0 + rB) ≤ Oh (r, x0 ) Lγ = O(r).
(6.19)
sup(g, x0 + rB) ≤ r Lip(g, x0 + rB) = o(r).
(6.20)
Hence This yields due to g(x0 ) = 0,
Because of (i) and since F = g + Γ where g is a small Lipschitz function, also F is pseudo-regular with some rank L at (x0 , y0 ) and nbhds U = x0 + δB, V = y0 + δB, provided that µ(δ) := max {sup(g, x0 + δB), Lip(g, x0 + δB)} is small enough. Using [41, Thm. 2.4], this is true if the constants L, δ and µ(δ) satisfy L = 2L0 ,
δ
0. 0 −xk
(6.22)
For the skillful interplay of the constants K and c in Robinson’s convergence-theorem (under the additional hypothesis that kx1 − x0 k is sufficiently small for the first iteration x1 and Ω is big enough), we refer the reader to the original paper. In our terminology, (6.22) (i) claims solvability of the equation f (x0 ) + Gf (x0 )(x − x0 ) = y, x ∈ Ω (6.23) in some ball y ∈ εBY , while (6.22) (ii) means Lipschitz behavior of the solutions x = x(y). So the function x0 7→ F (x0 ) := Σ(x0 , x0 ) is strongly regular at (x0 , f (x0 )). With gx (x0 ) = Σ(x0 , x) − Σ(x0 , x0 ), the equation 0 = gx (x0 ) + F (x0 ) describes the solutions of Σ(x0 , x) = f (x) + Gf (x)(x0 − x) = 0 and the assigned Newton method (2.10). With gˆx (x0 ) = f (x) + Gf (x0 )(x0 − x) − Σ(x0 , x0 ), the equation 0 = gˆx (x0 ) + F (x0 ) describes ˆ 0 , x) = f (x) + Gf (x0 )(x0 − x) = 0 and the modified Newton method as in the solutions of Σ(x (6.13) with M ≡ 0. Both models can be handled, by section 6.3, for the composed functions (2.17). However, for the general setting (2.21) of PBA’s, we do not see any possibility of applying Prop. 5; it lives from the existence of Dh. Remark 12 Nevertheless, there is some (Kantorovich-type) statement if arbitrary PBA’s of f replace the linearizations in f +M under pseudo-regularity at the initial point (x0 , y0 ), cf. [12, Thm. 4]. Unfortunately, this asserts once more only the existence of a related Newton sequence without giving the concrete construction. The same is true for Thm. 6.3 in [1] where systems 0 ∈ f (x)+F (x) have been considered with multivalued f and F . Hence solutions of ∅ 6= −f (x)∩F (x) are searched. The convergence is based on Hausdorff-metric approximations of f for the auxiliary problems.
22
Diethard Klatte, Bernd Kummer
We considered model (6.1) in order to show how close its theory is connected with the usual Newton method. Approximations of multifunctions will be studied in section 7.1 and the intersection of multifunctions can be handled by Thm. 9 (iii) in the setting S4, too. Note. Looking on the function class (2.17) one could believe that, after setting h = id, Newton methods exist for arbitrary Lipschitz functions f = γ. But we have to stop the readers enthusiasm: Though we obtain all C 0,1 functions f , the auxiliary equation Σ(x0 , x) = 0 for Newton’s method in (2.17) remains just the original one: f (x0 ) = 0. Also method (6.13) then leads us again to the original problem 0 ∈ f (xk+1 ) + M (xk+1 ).
7 Nonlinear Approximations for General Multifunctions 0 ∈ Γ (x) We are now going to investigate inclusions under less traditional hypotheses which a-priory do not use the structure of generalized equations or derivatives of composed functions as under (2.17). This will be done in the framework of the general approximation scheme introduced in §2.2. Hence we study 0 ∈ Γ (x) where Γ : X ⇒ Y is closed, and consider a mapping Σ : X × X ⇒ Y as well as the iterations 0 ∈ Σ(xk+1 , xk ) according to (2.5) with the (solution-) sets S(x) = Σ(. , x)−1 (0). As before, we study conditions which imply that sequences with xk+1 ∈ S(xk ) converge to a zero x ¯ of Γ . 7.1 Graph approximation of multivalued Σ(., x ¯) In [37, 40], the inclusion was studied with general mappings Σ : X × X ⇒ Y . Solvability of 0 ∈ Σ(xk+1 , xk ) and the existence of a zero x ¯ for Γ are assumed and discussed for particular cases. The conditions for convergence are concerned with (G1) The inverse of Σ(. , x ¯), namely: Φ = Σ(., x ¯)−1 has to be locally u.L. at (0, x ¯), cf. (D2). (G2) A relation between the graphs of Σ(. , x) and Σ(. , x ¯) by the requirement: if x0 ∈ S(x) ∩ [¯ x + εB]
then
dist( (x0 , 0), gph Σ(. , x ¯) ) ≤ τ
(7.1)
where dist corresponds to the max-norm of X × Y and τ = τ (ε, x − x ¯) is small; see (G3). Explicitly, this condition claims: Given a solution x0 ∈ x ¯ + εB, assigned to x, there is some (x00 , y 0 ) such that both y 0 ∈ Σ(x00 , x ¯) and max{ kx00 − x0 k, ky 0 − 0k } ≤ τ . (G3) An estimate for the function τ = τ (ε, z) near (0, 0X ), namely: ∃ real functions ai ≥ 0 (i = 0, 1) such that lims↓0 ai (s)s−i = 0 and τ (ε, z) = a0 (ε)kzk + a0 (kzk)ε + a1 (ε) + a1 (kzk).
(7.2)
Notice that ε estimates kx0 − x ¯k and z stands for x − x ¯ in (G2); a0 = O, a1 = o and τ (ε, x − x ¯) = O(ε)kx − x ¯k + O(kx − x ¯k)ε + o(ε) + o(kx − x ¯k). The interplay of these conditions for superlinear convergence describes Theorem 10 [40, p. 244] Suppose (G1), (G2) and (G3). Then, for each q ∈ (0, 1), there is some ε > 0 such that all solutions xk+1 ∈ S(xk ) ∩ (¯ x + εBX ) (as long as they exist) fulfill kxk+1 − x ¯k ≤ q kxk − x ¯k whenever kx0 − x ¯k ≤ ε.
7.2 Discussion of the hypotheses The hypotheses (G1), (G2), (G3) look very artificial, so let us discuss them under two viewpoints. How they can be satisfied and what about its necessity for convergence of Newton’s method? We start by considering a simpler (upper) approximation for the images of Σ(., x) only.
Approximations and Generalized Newton Methods
23
Remark 13 The graph-estimate (7.1) in (G2) is satisfied under a stronger condition for the Σimages alone (which requires x00 = x0 ), namely (G2)0
Σ(x0 , x ¯ + z) ⊂ Σ(x0 , x ¯) + τ (ε, z)BY
∀x0 ∈ x ¯ + εBX .
(7.3)
The proof is as follows: If 0 ∈ Σ(x0 , x ¯ + z) and kx0 − x ¯k ≤ ε, put x00 = x0 ∈ x0 + τ BX to obtain 0 00 0 ∈ Σ(x , x ¯ + z) ⊂ Σ(x , x ¯) + τ BY from (7.3). Thus Σ(x00 , x ¯) ∩ τ BY 6= ∅ now implies (7.1). The condition (G2)’ requires some upper semicontinuity of Σ(., x) at x ¯, measured by τ . Proposition 6 Let Γ = f + M be any multifunction with f from (2.17) and let Σ(x0 , x) = f (x) + Dh(γ(x))(γ(x0 ) − γ(x)) + M (x0 )
(similar to Prop. 4, case 1).
Then the conditions (G2)’ and (G3) are satisfied with a function τ of the form τ = a1 (kzk) + ˆ as under case 2 of Prop. 4, namely with a a0 (kzk)ε. They are also satisfied if one applies Σ function of the form τ = a0 (ε)kzk + a1 (kzk) + a1 (ε). Proof Assume kx0 − x ¯k ≤ ε and a ∈ Σ(x0 , x). We have to find some b ∈ Σ(x0 , x ¯) such that a − b has sufficiently small norm τ . Since a = f (x) + Dh(γ(x))(γ(x0 ) − γ(x)) + m0 holds with some m0 ∈ M (x0 ), we obtain b := f (¯ x) + Dh(γ(¯ x))(γ(x0 ) − γ(¯ x)) + m0 ∈ Σ(x0 , x ¯). So let us estimate ¯ ka − bk where we abbreviate H = Dh(γ(¯ x)), Hx = Dh(γ(x)), γ 0 = γ(x0 ), γ = γ(x), γ¯ = γ(¯ x): ¯ 0 − γ¯ )] k ka − bk = kf (x) + Hx (γ 0 − γ) − [f (¯ x) + H(γ 0 ¯ 0 − γ¯ ) k = k f (x) − f (¯ x) + Hx (γ − γ) − H(γ ¯ ¯ ¯ 0 − γ¯ )k = k [ h(γ) − h(¯ γ ) − H(γ − γ¯ ) ] + H(γ − γ¯ ) + Hx (γ 0 − γ) − H(γ 0 0 ¯ = k o(γ − γ¯ ) + H(γ − γ¯ + γ¯ − γ ) + Hx (γ − γ) k ¯ − γ 0 ) + Hx (γ 0 − γ) k = k o(γ − γ¯ ) + H(γ ¯ (γ 0 − γ) k ≤ o(x − x ¯) + k (Hx − H) ¯ (γ 0 − γ¯ − γ + γ¯ ) k = o(x − x ¯) + k (Hx − H) ≤ o(x − x ¯) + O(x − x ¯) kγ 0 − γ¯ k + O(x − x ¯) kγ − γ¯ k. Hence
ka − bk ≤ o(x − x ¯) + O(x − x ¯) kx0 − x ¯k.
(7.4)
In consequence, condition (7.3) holds with τ = a1 (kzk) + a0 (kzk)ε. In the second case, we may use formula (2.23) and continue with kx0 −xk2 +kx0 − x ¯k2 ≤ (kx0 − x ¯k+k¯ x −xk)2 +kx0 − x ¯k2 = 2kx0 − x ¯kk¯ x −xk+k¯ x −xk2 +2kx0 − x ¯k2 = a0 (ε)kzk + a1 (kzk) + a1 (ε). t u Obviously, M played no role; the estimate with or without m0 is the same as for M ≡ {0}. ¯ Corollary 1 After replacing Σ(. , x ¯) in Prop. 6 by Σ(x0 , x ¯) = f (x0 ) + M (x0 ) (without using H explicitly), one obtains a similar result with τ = a1 (kzk) + a0 (kzk)ε + a1 (ε). Proof Indeed, we have only to estimate (deleting m0 as above) d = k f (x) + Hx (γ 0 − γ) − f (x0 ) k = ¯ 0 − γ¯ ) + o(x0 − x k f (x) + Hx (γ 0 − γ) − f (¯ x) − H(γ ¯)k which yields by (7.4) d ≤ o(x − x ¯) + 0 0 O(x − x ¯) kx − x ¯k + o(x − x ¯). t u Applications in convex analysis The approximations (G2), (G3) above are not only of interest for the settings under Prop. 6 below or in view of Newton’s method at all. In order to show how the hypotheses can be satified, we present two other examples coming from convex analysis. 1. Proximal points with large exponents: For minimizing a convex function f : IRn → IR, consider the solvable problems minξ f (ξ) + kξ − xkp , (p > 2) (x ∈ IRn ) and Σ(x0 , x) = ∂x0 [f (x0 ) + kx0 − xkp ] = p kx0 − xkp−2 (x0 − x) + ∂f (x0 ). Then, 0 ∈ Σ(x0 , x) ⇔ x0 ∈ argminξ [ f (ξ) + kξ − xkp ].
(7.5)
24
Diethard Klatte, Bernd Kummer
We assume that x ¯ ∈ argmin f . The iterations require to find a minimizer xk+1 of f (ξ) + kξ − xk kp . To show (G2) and (G3), we have to consider some x0 ∈ x ¯ + εB with 0 ∈ Σ(x0 , x) which is 0 ∈ p kx0 − xkp−2 (x0 − x) + ∂f (x0 ) and to find some some (x00 , y) with y ∈ Σ(x00 , x ¯) and max{ kx00 −x0 k, kyk } ≤ τ. Here, y ∈ Σ(x00 , x ¯) 00 p−2 00 means by (7.5), y ∈ p kx − x ¯k (x − x ¯) + ∂f (x00 ). Setting x0 = x00 = h, we know that some related y fulfills p−1 kyk ≤ kD(h, x)k where D(h, x) = kh − xkp−2 (h − x) − kh − x ¯kp−2 (h − x ¯) and kD(h, x)k ≤ kh − xkp−1 + kh − x ¯kp−1 = 2kh − xkp−1 + [ kh − x ¯kp−1 − kh − xkp−1 ]. The bracket can be handled by applying the mean-value theorem to the real function φ(s) = sp−1 : φ(s) − φ(s0 ) = (p − 1) θp−2 (s − s0 ) with some θ ∈ (s, s0 ) if s < s0 or θ ∈ (s0 , s) if s0 < s. Accordingly, it follows finally (after further calculations [40]) Case 1: kh − xk > kh − x ¯k yields (G2) with τ ≤ a1 (ε) + a1 ( kzk ). Case 2: kh − xk ≤ kh − x ¯k yields (G2) with τ ≤ a1 (ε) + a0 (ε)kzk. 2. Minimizing convex f via so-called ε-subdifferentials: Let f : IRn → IR be convex, x ¯ ∈ argmin f and Γ (x) = ∂f (x). Given any p > 2 and an iteration point x ∈ IRn , let δ(x) = |f (x) − f (¯ x)|p and 0 0 Σ(x , x) be the set of all so-called δ(x)-subgradients (ε is already occupied) g of f at x , i.e., g ∈ Σ(x0 , x) ⇔ f (ξ) ≥ f (x0 ) + hg, ξ − x0 i − δ(x) ∀ξ ∈ IRn .
(7.6)
Then we obtain x0 ∈ S(x) ⇔ 0 ∈ Σ(x0 , x) ⇔ f (ξ) ≥ f (x0 ) − δ(x) ∀ξ ∈ IRn . So our method is quite trivial, but first we are interested in the conditions (G2), (G3) only. For showing (G2), we have to consider x0 ∈ S(x) ∩ (¯ x + εB) and to verify ∃(x00 , y) with y ∈ Σ(x00 , x ¯) and max{ kx00 − x0 k, kyk } ≤ τ for some τ from (G3). The relation y ∈ Σ(x00 , x ¯) coincides with f (ξ) ≥ f (x00 ) + hy, ξ − x00 i − 0 ∀ξ ∈ IRn , i.e., y ∈ ∂f (x00 ) 0 By x0 ∈ S(x) we [16] with p p have δ(x) + inf f ≥ f (x ). Using Ekeland’s variational principle the distance α = δ(x), there is some x00 ∈ x0 + αB which minimizes f (.) + δ(x) d(., x00 ). Hence - by ∂f and directional derivatives -p some y satisfies y ∈ p the maximum-relation between p ∂f (x00 )∩ δ(x) B. So we p observe y ∈ Σ(x00 , x ¯), kyk ≤ δ(x) and kx00 −x0 k ≤ δ(x) as well. This is condition (G2) with τ = δ(x). Finally, apply that f is locally Lipschitz, |f (x)−f (¯ x)| ≤ Lkx− x ¯k. For z = x − x ¯, then also δ(x) ≤ (Lkzk)p is valid, and (G2), (G3) hold with p p τ ≤ δ(x) ≤ (Lkzk)p = a1 (kzk). To solve 0 ∈ Σ(x0 , x), we have to know δ(x) = |f (x) − f (¯ x)|p for the step at the iterate x. This is true if, as in many approximation problems, f (¯ x) = 0 holds for a minimizer. Otherwise, to ensure that δ(x) in (7.6) is still small enough, one needs δ(x) ≤ |f (x) − f (¯ x)|p which can be only realized with sufficiently precise lower estimates of |f (x) − f (¯ x)|. In order to obtain a procedure which determines points x0 p and related δ(x)−subgradients, ˆ 0 , x) = Σ(x0 , x) + δ(x)B. Then x0 ∈ S(x) ˆ indeed, consider the perturbed mapping Σ(x ⇔0∈ p 0 0 ˆ Σ(x , x) ⇔ there are y ∈ δ(x)B and g ∈ Σ(x , x) such that y + g = 0. Hence, given x, we are looking for pairs (x0 , g) with g ∈ Σ(x0 , x) and small kgk, p kgk ≤ δ(x) and f (ξ) ≥ f (x0 ) + hg, ξ − x0 i − δ(x) ∀ξ ∈ IRn . (7.7) Ekeland’s principle can be applied like above, but for φ(.) = f (.) − hg, .ipin place of f : x0 satisfies δ(x) + inf ξ [f (ξ) − hg, ξi] p ≥ f (x0 ) − hg, x0 i. With the distance α = δ(x) p we obtain: Some 00 0 00 00 x ∈ x + αB minimizes φ(.) + δ(x) d(., x ). So some y satisfies B, i.e., y ∈ p p y ∈ ∂φ(x ) ∩ δ(x) (∂f (x00 ) − g) ∩ δ(x) B. Thus, h p = y + g fulfills khk ≤ 2 δ(x) and h ∈ p ∂f (x00 ) = Σ(x00 , x ¯) = ˆ 00 , x Σ(x ¯). The upper bound τ = δ(x) from above becomes now τ = 2 δ(x) with the same properties. In particular, τ does not depend on the distance ε = kx0 − x ¯k and is of type a1 (kzk). For both examples, the existence of the iterates xk+1 follows from well-known facts of convex
Approximations and Generalized Newton Methods
25
analysis, and the condition (G1) requires upper-Lipschitz behavior of (∂f )−1 . The regularity properties (strong, pseudo, upper) of ∂f for convex f on IRn have been completely characterized by [31, Thm. 5.4]. In particular, upper regularity simply means quadratic growth at x ¯, i.e., ∃ε > 0 : f (¯ x + u) ≥ f (¯ x) + εkuk2 ∀u ∈ εB. The other regularity properties are useful for the versions S2 and S3 of our process P(λ, β, x0 , y0 , v0 ), below. The conditions (G1), (G2), (G3) for strongly regular Lipschitz functions Let Γ = f and f ∈ C 0,1 (IRn , IRn ) be strongly regular at a zero x ¯. According to (2.10), put {f (x0 )} if x = x ¯ Σ(x0 , x) = (7.8) f (x) + Gf (x)(x0 − x) if x 6= x ¯ where we suppose that either Gf (x)(u) = f 0 (x; u) is the usual directional derivative (provided it exists), or Gf (x)(u) = T f (x)(u) consists of the Thibault derivative, cf. section 2.3. Under these assumptions, it has been shown in [40, section 2.3]: Necessity. If, in Thm. 10 the iterates exist for all k and converge superlinearly, then the conditions (G1), (G2), (G3) are satisfied. The proof used results of [39], consequences of Thm. 1 as well as the conditions (CI), (CA) and (CA)* of section 4. Sufficiency. Conversely, we know from Thm. 10 that superlinear convergence holds true under the conditions (G1), (G2), (G3), if our auxiliary problems are solvable in x ¯ + εB. For Gf = T f , this follows from the inverse derivative rule and non-emptyness of T f −1 (f (x))(v). For Gf = f 0 , one may first use that 0 ∈ f (x) + Cf (x)(u) has solutions u due to Thm. 1, see (2.13). Since f ∈ C 0,1 and f is directionally differentiable, Cf (x)(u) consists of f 0 (x, u) only. Hence, for both settings of (7.8) and strongly regular f ∈ C 0,1 (IRn , IRn ), our conditions are necessary and sufficient for superlinear convergence and the existence of xk+1 in Thm. 10. The stronger condition (7.3) = (G2)’ can be also combined with a simpler function τ , namely (G3)0
τ (ε, z) ≤ c kzk + a0 (kzk)ε + a1 (kzk)
(c ≥ 0).
(7.9)
Then it has been shown, but for functions Γ = f only and with possibly larger q than in Thm. 10, Theorem 11 [37, Thm. 1] Suppose (G1), (G2)’ and (G3)’ and let cL < 1. Then, for each q ∈ (cL, 1), there is some ρ > 0 such that all xk+1 ∈ S(xk ) ∩ (¯ x + ρBX ) (as long as they exist) fulfill kxk+1 − x ¯k ≤ q kxk − x ¯k, provided that kx0 − x ¯k ≤ ρ. Clearly, c = 0 and the existence of all xk+1 are sufficient for local superlinear convergence. 7.3 Solvability of the auxiliary problems under upper regularity of Σ(. , x ¯) For both theorems 10, 11, the auxiliary problems are obviously solvable whenever (G1) holds as local upper regularity of Σ(. , x ¯) and if this regularity is persistent with respect to the variation of Σ(. , x ¯) into Σ(. , x) under the additional hypotheses (G2) and (G3). We already know that such persistence can be based on successive approximation and pseudoregularity. Another way consists in applying Kakutani’s theorem. The latter needs convexity but permits general perturbations of a mapping H : X ⇒ Y without violating solvability of y ∈ H(x). Here, H plays the role of Σ(. , x ¯) and the perturbation G stands for Σ(. , x). For example, assume we want to minimize a convex function f on IRn . Then, H = ∂f (or Γ = ∂f ) is the basic multifunction and G := Gt := ∂(tf + (1 − t)g) could be the subdifferential of any ”homotopyfunction” tf + (1 − t)g or of f which was determined by some error. In this context, G is not only a continuous translation of H. Below, we need some upper regularity of H at (X0 , 0) where X0 is a set and U = X0 + βBX , V = βBY : ∃L > 0, β > 0 such that ∅ = 6 [H −1 (y) ∩ U ] ⊂ X0 + LkykB ∀y ∈ V.
(7.10)
26
Diethard Klatte, Bernd Kummer
Theorem 12 [31, Thm. 4.5] (persistence of upper regularity). Let H, G : IRn ⇒ IRm be closed and X0 ⊂ H −1 (0) be non-empty, convex and compact. Let H be upper regular at (X0 , 0) in the sense of (7.10), let the ranges G(x) and H −1 (y) (for x ∈ U , y ∈ V ) be non-empty and convex and let G(x) ∩ K 6= ∅ (∀x ∈ U ) for some bounded set K. Finally, let 0 < δ < min{ β, βL−1 }. Then, G has a zero in X0 + δLB if δB ∩ H(x) ⊂ G(x) + δB ∀x ∈ X0 + δLB. The same radii for U and V can be arranged by remark 1. If even G(x) ⊂ H(x) + δB ∀x ∈ U we may conclude [0 ∈ G(x) and x ∈ U ] ⇒ y ∈ H(x) for some y ∈ δB ⇒ dist(x, X0 ) ≤ Lδ by (7.10). This gives an upper estimate of the solutions. For similarly perturbed inclusions in B-spaces and related comments we refer to [31, 36], for generalized equations where G(x) = h(x) + H(x), to [51]. Clearly, the hypotheses of the theorem are hard to satisfy in the context of nonconvex or non-monotone models, though no Lipschitz condition is required. If H is pseudo-regular at (x0 , 0) then there exist a convex, compact set ˆ ⊂ H which satisfies (7.10), cf. [31, Thm. 2.10]. X0 3 x0 and a submapping H References 1. S. Adly, R. Cibulka, H. van Ngai. Newton’s method for solving inclusion using set-valued approximations. SIAM J. Optim., 25, 159–184 (2015) 2. W. Alt. Stability of Solutions and the Lagrange–Newton Method for Nonlinear Optimization and Optimal Control Problems. Universitaet Bayreuth, Habilitationsschrift (1990) 3. F.J. Aragon Artacho, M. Belyakov, A.L. Donchev, M. Lopez. Local convergence of quasi-Newton methods under metric regularity. Computational Optimization and Appl., 58, 225–247, (2014) 4. J.-P. Aubin and I. Ekeland. Applied Nonlinear Analysis. Wiley, New York, 1984 5. C. G. Broyden A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 19, 577–593 (1965) 6. S. B¨ utikofer. Globalizing a nonsmooth Newton method via nonmonotone path search. Mathematical Methods of OR, 68, 235-256 (2008) 7. S. B¨ utikofer and D. Klatte. A nonsmooth Newton method with path search and its use in solving C 1,1 programs and semi-infinite problems. SIAM Journal on Optimization, 20, 2381–2412 (2010) 8. R. Cibulka, A. L. Dontchev, M.H. Geoffroy. Inexact Newton methods and Dennis-More theorems for nonsmooth generalized equations. SIAM J. Control and Opt., 53 , 1003–1019 (2015) 9. F.H. Clarke. On the inverse function theorem. Pacific J. of Mathematics, 64, 97–102 (1976). 10. F.H. Clarke. Optimization and Nonsmooth Analysis. Wiley, New York, 1983 11. A. L. Dontchev. Local convergence of the Newton method for generalized equation. C. R. Acad. Sci. Paris, Ser. I, 322, 327–331 (1996) 12. A. L. Dontchev. Local Analysis of a Newton-type Method Based on Partial Linearization. Lectures in Applied Mathematics, 32, 295–306 (1996) 13. A. L. Dontchev and R.T. Rockafellar. Characterizations of strong regularity for variational inequalities over polyhedral convex sets. SIAM Journal on Optimization, 6, 1087–1105 (1996) 14. A. L. Dontchev and R.T. Rockafellar. Implicit functions and solution mappings; A view from variational analysis. Springer; Dortrecht, Heidelberg, London, New York, 2009 (2nd edition 2014) 15. A. L. Dontchev and R.T. Rockafellar. Newton’s method for generalized equations: a sequential implicit function theorem. Math. Progr. B, 123, 139–159 (2010). 16. I. Ekeland. On the variational principle. J. Math. Anal. Appl., 47, 324–353, (1974) 17. F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementary Problems, Vol II. Springer, New York, 2003 18. P. Fusek. Eigenschaften pseudo-regul¨ arer Funktionen und einige Anwendungen auf Optimierungsaufgaben. Dissertation, Fachbereich Mathematik. HU- Berlin, Febr. 1999. 19. P. Fusek. Isolated zeros of Lipschitzian metrically regular Rn functions. Optimization, 49, 425–446 (2001) 20. A. Griewank. The local convergence of Broyden-like methods on Lipschitzian problems in Hilbert spaces. SIAM Journal on Numerical Analysis, 24, 684-705 (1987) 21. A. Griewank. On stable piecewise linearization and generalized algorithmic differentiation. Optimization Methods and Software (2013), DOI: 10.1080/10556788.2013.796683 22. M. Hintermueller, K. Ito, K. Kunisch. The primal-dual active set strategy as a semi-smooth Newton method. SIAM J. Optim. 13, 865–888 (2003) 23. T. Hoheisel, C. Kanzow, B. S. Mordukhovich, H. Phan. Generalized Newton’s method based on graphical derivatives. Nonlinear Analysis, 75, 1324-1340 (2012) 24. K. Ito and K. Kunisch. Lagrange multiplier approach to variational problems and applications. SIAM series Advances in design and control: 15, Philadelphia, 2008 25. A. F. Izmailov and M. V. Solodov. Newton-Type Methods for Optimization and Variational Problems. Springer, 2014 26. H.Th. Jongen and P. Jonker and F. Twilt. Nonlinear Optimization in Rn , I: Morse Theory, Chebychev Approximation. Peter Lang Verlag, Frankfurt a.M.-Bern-NewYork, 1983
Approximations and Generalized Newton Methods
27
27. H.Th. Jongen and P. Jonker and F. Twilt. Nonlinear Optimization in Rn , II: Transversality, Flows, Parametric Aspects. Peter Lang Verlag, Frankfurt a.M.-Bern-NewYork, 1986 28. N. H. Josephy. Newton’s method for generalized equations and the PIES energy model. Ph.D. Dissertation, Department of Industrial Engineering, University of Wisconsin-Madison, 1979. 29. L. W. Kantorovich and G. P. Akilov. Funktionalanalysis in normierten Raumen. Akademie Verlag, Berlin 1964 30. D. Klatte and A. Kruger and B. Kummer. From convergence principles to stability and optimality conditions. Journal of Convex Analysis 19, 1043–1072 (2012) 31. D. Klatte and B. Kummer. Nonsmooth Equations in Optimization - Regularity, Calculus, Methods and Applications. Kluwer Academic Publ., Dordrecht-Boston-London, 2002 32. D. Klatte and B. Kummer. Strong Lipschitz Stability of Stationary Solutions for Nonlinear Programs and Variational Inequalities. SIAM Optimization, 16: 96–119, (2005) 33. D. Klatte and B. Kummer. Newton methods for stationary points: an elementary view of regularity conditions and solution schemes. Optimization, 56, 441–462, (2007) 34. D. Klatte and B. Kummer. Aubin property and uniqueness of solutions in cone constrained optimization. Mathematical Methods of OR, 77, 291–304 (2013) 35. M. Kojima and S. Shindo. Extension of Newton and Quasi-Newton methods to systems of P C 1 equations. Journ. of OR Soc. of Japan, 29, 352–374 (1986) 36. B. Kummer. Generalized Equations: Solvability and Regularity. Math. Program. Studies 21, 199–212 (1984) (preprint Nr. 30, Sektion Mathematik HU- Berlin, 1982) 37. B. Kummer. Newton’s method for non-differentiable functions. In J. Guddat et al. eds. Advances in Math. Optimization. Akademie Verlag Berlin, Ser. Math. Res., 45, 114–125 (1988) 38. B. Kummer. Lipschitzian inverse functions, directional derivatives and application in C 1,1 optimization, J. Optim. Theory Appl., 70, 559–580 (1991) 39. B. Kummer. Newton’s method based on generalized derivatives for nonsmooth functions: convergence analysis. In W. Oettli and D. Pallaschke, editors, Advances in Optimization, 171–194. Springer, Berlin, (1992) 40. B. Kummer. Approximation of multifunctions and superlinear convergence. in Recent Developments in Optimization, Springer, Ser. Lecture Notes in Economics and Mathematical Systems, R. Durier and C. Michelot eds., Vol. 429, 243–251, Berlin, (1995) 41. B. Kummer. Lipschitzian and pseudo-Lipschitzian inverse functions and applications to nonlinear programming, in Mathematical Programming with Data Perturbations, ed, A.V. Fiacco, Marcel Dekker, New York, 201–222, (1997) 42. B. Kummer. Metric regularity: Characterizations, nonsmooth variations and successive approximation. Optimization, 46, 247–281 (1999) 43. B. Kummer. Newton’s method for continuous functions? script for students (2012) http://www2.mathematik.hu-berlin.de/ kummer/scripts/ContNewton.pdf 44. R. Mifflin. Semismooth and semiconvex functions in constrained optimization. SIAM J Control Optim., 15, 957–972 (1977) 45. B.S. Mordukhovich. Complete characterization of openness, metric regularity and Lipschitzian properties of multifunctions. Transactions American Math. Soc., 340, 1–35 (1993) 46. J.M. Ortega. The Newton-Kantorovich Theorem. The American Mathematical Monthly, 75, 658-660 (1968) Published by: Mathematical Association of America. Stable URL: http://www.jstor.org/stable/2313800 , Accessed: 25-09-2015 12:30 UTC. 47. J.-S. Pang and L. Qi. Nonsmooth equations: motivation and algorithms. SIAM J. Optim., 3, 443–465 (1993). 48. L. Qi and J. Sun. A nonsmooth version of Newton’s method. Math. Program., 58, 353–367 (1993) 49. H. Rademacher. Ueber partielle und totale Differenzierbarkeit von Funktionen mehrerer Variabeln und ueber die Transformation der Doppelintegrale. Math. Ann. 79, 340-359 (1919) 50. S. M. Robinson Extension of Newton’s method to nonlinear functions with values in a cone. Numer. Math., 19, 341–347 (1972) 51. S. M. Robinson. Generalized equations and their solutions, Part I: Basic theory. Math. Program. Study, 10, 128–141 (1979) 52. S. M. Robinson. Strongly regular generalized equations. Math. Oper. Res., 5, 43–62 (1980). 53. S. M. Robinson. Newton’s method for a class of nonsmooth functions. Set-Valued Analysis, 2, 291–305 (1994) (preprint Univ. of Wisconsin, Michigan, 1988) 54. R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer, Berlin, 1998. 55. S. Scholtes. Introduction to Piecewise Differentiable Equations. Institut f¨ ur Statistik und Mathematische Wirtschaftstheorie, Preprint No. 53/1994, Universit¨ at Karlsruhe, (1994). 56. S. Scholtes. Combinatorial Structures in Nonlinear Programming. Working Paper 6/2002, University of Cambridge, The Judge Institute of Management Studies, (2002) 57. D. Sun and L. Qi. On NCP functions. J. of Computational Optimization and Appl. Vol. 13, 201–220, (1999) 58. M. Ulbrich. Nonsmooth Newton-like Methods for Variational Inequalities and Constrained Optimization Problems in Function Spaces. Technische Universitaet Muenchen, Fakultaet fuer Mathematik, June 2001, revised February 2002, Habilitationsschrift. 59. M. Ulbrich. Semismooth Newton methods for operator equations in function spaces. SIAM J. Optim. 13, 805-841, (2003) 60. M. Ulbrich. Semismooth Newton methods for variational inequalities and constrained optimization problems in function spaces, SIAM, Philadelphia, 2011