SOLVING VARIATIONAL INEQUALITIES WITH STOCHASTIC ...

Report 6 Downloads 118 Views
SOLVING VARIATIONAL INEQUALITIES WITH STOCHASTIC MIRROR-PROX ALGORITHM ANATOLI JUDITSKY∗ , ARKADI NEMIROVSKI† , AND CLAIRE TAUVEL‡

arXiv:0809.0815v1 [math.OC] 4 Sep 2008

September 5, 2008 Abstract. In this paper we consider iterative methods for stochastic variational inequalities (s.v.i.) with monotone operators. Our basic assumption is that the operator possesses both smooth and nonsmooth components. Further, only noisy observations of the problem data are available. We develop a novel Stochastic Mirror-Prox (SMP) algorithm for solving s.v.i. and show that with the convenient stepsize strategy it attains the optimal rates of convergence with respect to the problem parameters. We apply the SMP algorithm to Stochastic composite minimization and describe particular applications to Stochastic Semidefinite Feasability problem and Eigenvalue minimization. Key words. Nash variational inequalities, stochastic convex-concave saddle-point problem, large scale stochastic approximation, reduced complexity algorithms for convex optimization AMS subject classifications. 90C15, 65K10, 90C47

1. Introduction. Let Z be a convex compact set in Euclidean space E with inner product h·, ·i, k · k be a norm on E (not necessarily the one associated with the inner product), and F : Z → E be a monotone mapping: (1.1)

∀(z, z ′ ∈ Z) : hF (z) − F (z ′ ), z − z ′ i > 0)

We are interested to approximate a solution to the variational inequality (v.i.) (1.2)

find z∗ ∈ Z : hF (z), z∗ − zi 6 0

∀z ∈ Z

associated with Z, F . Note that since F is monotone on Z, the condition in (1.2) is implied by hF (z∗ ), z − z∗ i > 0 for all z ∈ Z, which is the standard definition of a (strong) solution to the v.i. associated with Z, F . The inverse – a solution to v.i. as defined by (1.2) (a “weak” solution) is a strong solution as well – also is true, provided, e.g., that F is continuous. An advantage of the concept of weak solution is that such a solution always exists under our assumptions (F is well defined and monotone on a convex compact set Z). We quantify the inaccuracy of a candidate solution z ∈ Z by the error (1.3)

Errvi (z) := maxhF (u), z − ui; u∈Z

note that this error is always > 0 and equals zero iff z is a solution to (1.2). In what follows we impose on F , aside of the monotonicity, the requirement (1.4)

∀(z, z ′ ∈ Z) : kF (z) − F (z ′ )k∗ 6 Lkz − z ′ k + M

with some known constants L > 0, M > 0. From now on, (1.5)

kξk∗ = max hξ, zi z:kzk61

∗ LJK,

Universit´ e J. Fourier, B.P. 53, 38041 Grenoble Cedex 9, France, [email protected] Institute of Technology, Atlanta, Georgia 30332, USA, [email protected], research of this author was partly supported by the NSF award DMI-0619977. ‡ LJK, Universit´ e J. Fourier, B.P. 53, 38041 Grenoble Cedex 9, France, [email protected] † Georgia

1

2

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

is the norm conjugate to k · k. We are interested in the case where (1.2) is solved by an iterative algorithm based on a stochastic oracle representation of the operator F (·). Specifically, when solving the problem, the algorithm acquires information on F via subsequent calls to a black box (“stochastic oracle”, SO). At i-th call, i = 0, 1, ..., the oracle gets as input a search point zi ∈ Z (this point is generated by the algorithm on the basis of the information accumulated so far) and returns the vector Ξ(zi , ζi ), where {ζi ∈ RN }∞ i=1 is a sequence of i.i.d. (and independent of the queries of the algorithm) random variables. We suppose that the Borel function Ξ(z, ζ) is such that  (1.6) ∀z ∈ Z : E {Ξ(z, ζ1 )} = F (z), E kΞ(z, ζi ) − F (z)k2∗ 6 N 2 .

We call a monotone v.i. (1.1), augmented by a stochastic oracle (SO), a stochastic monotone v.i. (s.v.i.). To motivate our goal, let us start with known results [5] on the limits of performance of iterative algorithms for solving large-scale stochastic v.i.’s. To “normalize” the situation, assume that Z is the unit Euclidean ball in E = Rn and that n is large. In this case, the rate hof convergence i of a whatever algorithm for solving v.i.’s √ . In other words, for a properly chosen poscannot be better than O(1) Lt + M+N t itive absolute constant C, for every number of steps t, all large enough values of n and any algorithm B for solving s.v.i.’s on the unit ball of Rn , one can point out a monotone s.v.i. satisfying (1.4), (1.6) and such that the expected error of the approximate h isolution z˜t generated by B after t steps , applied to such s.v.i., is at least √ c Lt + M+N for some c > 0. To the best of our knowledge, no one of existing algot rithms allows to achieve, uniformly in the dimension, this convergence rate. In fact, the “best approximations” available are given by Robust Stochastic Approximation √ (see [3] and references therein) with the guaranteed rate of convergence O(1) L+M+N t and extra-gradient-type algorithms for solving deterministic monotone v.i.’s with Lipschitz continuous operators (see [6, 9, 10, 11]), which attains the accuracy O(1) Lt in M the case of M = N = 0 or O(1) √ when L = N = 0. t The goal of this paper is to demonstrate that a specific Mirror-Prox algorithm [6] for solving monotone v.i.’s with Lipschitz continuous operators can be extended onto monotone s.v.i.’s to yield, uniformly in the dimension, the optimal rate of converh i L M+N √ gence O(1) t + . We present the corresponding extension and investigate it t in details: we show how the algorithm can be “tuned” to the geometry of the s.v.i. in question, derive bounds for the probability of large deviations of the resulting error, etc. We also present a number of applications where the specific structure of the rate of convergence indeed “makes a difference”. The main body of the paper is organized as follows: in Section 2, we describe several special cases of monotone v.i.’s we are especially interested in (convex Nash equilibria, convex-concave saddle point problems, convex minimization). We single out these special cases since here one can define a useful “functional” counterpart ErrN (·) of the just defined error Errvi (·); both ErrN and Errvi will participate in our subsequent efficiency estimates. Our main development – the Stochastic Mirror Prox (SMP) algorithm – is presented in Section 3. Some general results obout the performance of the SMP are presented in Section 3.2. Then in Section 4 we present SMP for Stochastic composite minimization and discuss its applications to Stochastic Semidefinite Feasability problem and Eigenvalue minimization. All technical proofs are collected in the appendix.

STOCHASTIC MIRROR-PROX ALGORITHM

3

Notations. In the sequel, lowercase Latin letters denote vectors (and sometimes matrices). Script capital letters, like E, Y, denote Euclidean spaces; the inner product in such a space, say, E, is denoted by h·, ·iE (or merely h·, ·i, when the corresponding space is clear from the context). Linear mappings from one Euclidean space to another, say, from E to F , are denoted by boldface capitals like A (there are also some reserved boldface capitals, like E for expectation, Rk for the k-dimensional coordinate space, and Sk for the space of k × k symmetric matrices). A∗ stands for the conjugate to mapping A: if A : E → F, then A∗ : F → E is given by the identity hf, AeiF = hA∗ f, eiE for f ∈ F, e ∈ E. When both the origin and the destination space of a linear map, like A, are the standard coordinate spaces, the map is identified with its matrix A, and A∗ is identified with AT . For a norm k · k on E, k · k∗ stands for the conjugate norm, see (1.5). For Euclidean spaces E1 , ..., Em , E = E1 × ... × Em denotes their Euclidean direct product, so that a vector from E isPa collection u = [u1 ; ...; um ] (“MATLAB notation”) of vectors uℓ ∈ Eℓ , and hu, viE = ℓ huℓ , vℓ iEℓ . Sometimes we allow ourselves to write (u1 , ..., um ) instead of [u1 ; ...; um ]. 2. Preliminaries. 2.1. Nash v.i.’s and functional error. In the sequel, we shall be especially interested in a special case of v.i. (1.2) – in a Nash v.i. coming from a convex Nash Equilibrium problem, and in the associated functional error measure. The Nash Equilibrium problem can be described as follows: there are m players, i-th of them choosing a point zi from a given set Zi . The loss of i-th player is a given function φi (z) of the collection z = (z1 , ..., zm ) ∈ Z = Z1 × ... × Zm of player’s choices. With slight abuse of notation, we use for φi (z) also the notation φi (zi , z i ), where z i is the collection of choices of all but the i-th players. Players are interested to minimize their losses, and Nash equilibrium zb is a point from Z such that for every i the function φi (zi , zbi ) attains its minimum in zi ∈ Zi at zi = zbi (so that in the state zb no player has an incentive to change his choice, provided that the other players stick to their choices). We call a Nash equilibrium problem convex, if for every i Zi is a compact convex set, φi (zi , z i ) is a Lipschitz continuous function convex in zi and concave in z i , and Pm the function Φ(z) = i=1 φi (z) is convex. It is well known (see, e.g., [8]) that setting   F (z) = F 1 (z); . . . ; F m (z) , F i (z) ∈ ∂zi φi (zi , z i ), i = 1, ..., m where ∂zi φi (zi , z i ) is the subdifferential of the convex function φi (·, z i ) at a point zi , we get a monotone operator such that the solutions to the corresponding v.i. (1.2) are exactly the Nash equilibria. Note that since φi are Lipschitz continuous, the associated operator F can be chosen to be bounded. For this v.i. one can consider, along with the v.i.-accuracy measure Errvi (z), the functional error measure ErrN (z) =

m  X i=1

 φi (z) − min φi (wi , z i ) wi ∈Zi

This accuracy measure admits a transparent justification: this is the sum, over the players, of the incentives for a player to change his choice given that other players stick to their choices. Special cases: saddle points and minimization. An important by its own right particular case of Nash Equilibrium problem is an antagonistic 2-person game, where

4

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

m = 2 and Φ(z) ≡ 0 (i.e., φ2 (z) ≡ −φ1 (z)). The convex case of this problem corresponds to the situation when φ(z1 , z2 ) ≡ φ1 (z1 , z2 ) is a Lipschitz continuous function which is convex in z1 ∈ Z1 and concave in z2 ∈ Z2 , the Nash equilibria are exactly the saddle points (min in z1 , max in z2 ) of φ on Z1 × Z2 , and the functional error becomes ErrN (z1 , z2 ) =

max

(u1 ,u2 )∈Z

[φ(z1 , u1 ) − φ(u1 , z2 )] .

Recall that the convex-concave saddle point problem minz1 ∈Z1 maxz2 ∈Z2 φ(z1 , z2 ) gives rise to the “primal-dual” pair of convex optimization problems (P ) : min φ(z1 ), z1 ∈Z1

(D) : max φ(z2 ), z2 ∈Z2

where φ(z1 ) = max φ(z1 , z2 ), φ(z2 ) = min φ(z1 , z2 ). z2 ∈Z2

z1 ∈Z1

The optimal values Opt(P ) and Opt(D) in these problems are equal, the set of saddle points of φ (i.e., the set of Nash equilibria of the underlying convex Nash problem) is exactly the direct product of the optimal sets of (P ) and (D), and ErrN (z1 , z2 ) is nothing but the sum of non-optimalities of z1 , z2 considered as approximate solutions to respective optimization problems:     ErrN (z1 , z2 ) = φ(z1 ) − Opt(P ) + Opt(D) − φ(z2 ) .

Finally, the “trivial” case m = 1 of the convex Nash Equilibrium is the problem of minimizing a Lipschitz continuous convex function φ(z) = φ1 (z1 ) over the convex compact set Z = Z1 , In this case, the functional error becomes the usual residual in terms of the objective: ErrN (z) = φ(z) − min φ. Z

In the sequel, we refer to the v.i. (1.2) coming from a convex Nash Equilibrium problem as Nash v.i., and to the two just outlined particular cases of the Nash v.i. as the Saddle Point and the Minimization v.i., respectively. It is easy to verify that in the Saddle Point/Minimization case the functional error ErrN (z) is > Errvi (z); this is not necessary so for a general Nash v.i. 2.2. Prox-mapping. We once for ever fix a norm k · k on E; k · k∗ stands for the conjugate norm, see (1.5). A distance-generating function for Z is, by definition, a continuous convex function ω(·) : Z → R such that 1. if Z o be the set of all points z ∈ Z such that the subdifferential ∂ω(z) of ω(·) at z is nonempty, then the subdifferential of ω admits a continuous selection on Z o : there exists a continuous on Z o vector-valued function ω ′ (z) such that ω ′ (z) ∈ ∂ω(z) for all z ∈ Z o ; 2. for certain α > 0, ω(·) is strongly convex, modulus α, w.r.t. the norm k · k: (2.1)

∀(z, z ′ ∈ Z o ) : hω ′ (z) − ω ′ (z ′ ), z − z ′ i > αkz − z ′ k2 .

In the sequel, we fix a distance-generating function ω(·) for Z and assume that ω(·) and Z “fit” each other, meaning that one can easily solve problems of the form (2.2)

min [ω(z) + he, zi] , z∈Z

e ∈ E.

5

STOCHASTIC MIRROR-PROX ALGORITHM

The prox-function associated with the distance-generating function ω is defined as V (z, u) = ω(u) − ω(z) − hω ′ (z), u − zi : Z o × Z → R+ . We set (2.3)

(a) Θ(z) = maxu∈Z V (z, u) [z ∈ Z o ]; (b) (c) Θ = Θ(zc ); (d)

zc Ω

= =

argminZ ω(z); p 2Θ/α.

Note that zc is well defined (since Z is a convex compact set and ω(·) is continuous and strongly convex on Z) and belongs to Z o (since 0 ∈ ∂ω(zc )). Note also that due to the strong convexity of ω and the origin of zc we have α ∀(u ∈ Z) : ku − zc k2 6 Θ 6 max ω(z) − ω(zc ); (2.4) z∈Z 2 in particular we see that (2.5)

Z ⊂ {z : kz − zc k 6 Ω}.

Prox-mapping. Given z ∈ Z o , we associate with this point and ω(·) the proxmapping P (z, ξ) = argmin {ω(u) + hξ − ω ′ (z), ui} ≡ argmin {V (z, u) + hξ, ui} : E → Z o . u∈Z

u∈Z

We illustrate the just-defined notions with three basic examples. Example 1: Euclidean setup. Here E is RN with the standard inner product, k · k2 is the standard Euclidean norm on RN (so that k · k∗ = k · k) and ω(z) = 12 z T z (i.e., Z o = Z, α = 1). Assuming for the sake of simplicity that 0 ∈ Z, zc = 0, Ω = maxz∈Z kzk2 and Θ = 12 Ω2 . The prox-function and the prox-mapping are given by V (z, u) = 12 kz − uk22 , P (z, ξ) = argminu∈Z k(z − ξ) − uk2 . N Example 2: Simplex PN setup. Here E is R , N > 1, with the standard inner product, kzk = kzk1 := j=1 |zj | (so that kξk∗ = maxj |ξj |), Z is a closed convex subset of the standard simplex DN = {z ∈ RN : z > 0,

o

PN

j=1 zj

containing its barycenter, and ω(z) = ′

N X

zj = 1}

j=1

ln zj is the entropy. Then

Z = {z ∈ Z : z > 0} and ω (z) = [1 + ln z1 ; ...; 1 + ln zN , z ∈ Z o . It is easily seen (see, e.g., [3]) that here α = 1, zc = [1/N ; ...; 1/N ], Θ 6 ln(N ) (the √ latter inequality becomes equality when Z contains a vertex of DN ), and thus Ω 6 2 ln N . The prox-function is V (z, u) =

N X

uj ln(uj /zj ),

j=1

and the prox-mapping is easy to compute when Z = DN : !−1 N X zi exp{−ξi } zj exp{−ξj }. (P (z, ξ))j = i=1

6

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

Example 3: Spectahedron setup. This is the “matrix analogy” of the Simplex setup. Specifically, now E is the space of N × N block-diagonal symmetric matrices, N > 1, of a given block-diagonal structure equipped with the Frobenius inner product PN ha, biF = Tr(ab) and the trace norm |a|1 = i=1 |λi (a)|, where λ1 (a) > ... > λN (a) are the eigenvalues of a symmetric N × N matrix a; the conjugate norm |a|∞ is the usual spectral norm (the largest singular value) of a. Z is assumed to be a closed convex subset of the spectahedron S = {z ∈ E : z  0, Tr(z) = 1} containing the matrix N −1 IN . The distance-generating function is the matrix entropy ω(z) =

N X

λj (z) ln λj (z),

j=1

so that Z o = {z ∈ Z : z ≻ 0} and Ω′ (z) = ln(z). This setup, √ similarly to the Simplex one, results in α = 1, zc = N −1 IN , Θ = ln N and Ω = 2 ln N [2]. When Z = S, it is relatively easy to compute the prox-mapping (see [2, 6]); this task reduces to the singular value decomposition of a matrix from E. It should be added that the matrices from S are exactly the matrices of the form a = H(b) ≡ (Tr(exp{b}))−1 exp{b} with b ∈ E. Note also that when Z = S, the prox-mapping becomes “linear in matrix logarithm”: if z = H(a), then P (z, ξ) = H(a − ξ). 3. Stochastic Mirror-Prox algorithm. 3.1. Mirror-Prox algorithm with erroneous information. We are about to present the Mirror-Prox algorithm proposed in [6]. In contrast to the original version of the method, below we allow for errors when computing the values of F – we assume that given a point z ∈ Z, we can compute an approximation Fb (z) ∈ E of F (z). The t-step Mirror-Prox algorithm as applied to (1.2) is as follows: Algorithm 3.1. 1. Initialization: Choose r0 ∈ Z o and stepsizes γτ > 0, 1 6 τ 6 t. 2. Step τ , τ = 1, 2, ..., t: Given rτ −1 ∈ Z o , set ( wτ = P (rτ −1 , γτ Fb(rτ −1 )), (3.1) rτ = P (rτ −1 , γτ Fb(wτ )) . When τ < t, loop to step t + 1. 3. At step t, output " t #−1 t X X zbt = (3.2) γτ γτ wτ . τ =1

τ =1

The preliminary technical result on the outlined algorithm is as follows. Theorem 3.2. Consider t-step algorithm 3.1 as applied to a v.i. (1.2) with a monotone operator F satisfying (1.4). For τ = 1, 2, ..., let us set ∆τ = F (wτ ) − Fb(wτ );

for z belonging to the trajectory {r0 , w1 , r1 , ..., wt , rt } of the algorithm, let ǫz = kFb (z) − F (z)k∗ ,

7

STOCHASTIC MIRROR-PROX ALGORITHM

and let {yτ ∈ Z o }tτ =0 be the sequence given by the recurrence (3.3)

yτ = P (yτ −1 , γτ ∆τ ), y0 = r0 .

Assume that α γτ 6 √ , 3L

(3.4) Then

Errvi (b zt ) 6

(3.5)

t X

γτ

τ =1

!−1

Γ(t),

where Errvi (b zt ) is defined in (1.3), (3.6)

Γ(t) = 2Θ(r0 ) +

 t X 3γ 2 τ

τ =1

+

t X



2

M + (ǫrτ −1

ǫ2 + ǫwτ ) + wτ 3 2



hγτ ∆τ , wτ − yτ −1 i

τ =1

and Θ(·) is defined by (2.3). Finally, when (1.2) is a Nash v.i., one can replace Errvi (b zt ) in (3.5) with ErrN (b zt ). 3.2. Main result. From now on, we focus on the case when Algorithm 3.1 solves monotone v.i. (1.2), and the corresponding monotone operator F is represented by a stochastic oracle. Specifically, at the i-th call to the SO, the input being z ∈ Z, the oracle returns the vector Fb = Ξ(z, ζi ),, where {ζi ∈ RN }∞ i=1 is a sequence of i.i.d. random variables, and Ξ(z, ζ) : Z × RN → E is a Borel function. We refer to this specific implementation of Algorithm 3.1 as to Stocastic Mirror Prox (SMP) algorithm. In the sequel, we impose on the SO in question the following assumption, slightly milder than (1.6): Assumption I: With some µ ∈ [0, ∞), for all z ∈ Z we have (3.7)

(a) (b)

kE{Ξ(z, ζi ) − F (z)} k ∗ 6 µ E kΞ(z, ζi ) − F (z)k2∗ 6 M 2 .

In some cases, we augment Assumption I by the following Assumption II: For all z ∈ Z and all i we have  (3.8) E exp{kΞ(z, ζi ) − F (z)k2∗ /M 2 } 6 exp{1}.

Note that Assumption II implies (3.7.b), since   exp{E kΞ(z, ζi ) − F (z)k2∗ /M 2 } 6 E exp{Ξ(z, ζi ) − F (z)k2∗ /M 2 }

by the Jensen inequality. Remark 3.3. Observe that that the accuracy of Algorithm 3.1 (cf. (3.6)) depends in the same way on the “size” of perturbation ǫz = kFb(z)−F (z)k∗ and the bound M of (1.4) on the variation of the non-Lipschitz component of F . This is why, to simplify

8

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

the presentation, we decided to use the same bound M for the scale of perturbation Ξ(z, ζi ) − F (z) in (3.7), (3.8). Remark 3.4. From now on, we assume that the starting point r0 in Algorithm 3.1 is the minimizer zc of ω(·) on Z. Further, to avoid unnecessarily complicated formulas (and with no harm to the efficiency estimates) we stick to the constant stepsize policy γτ ≡ γ, 1 6 τ 6 t, where t is a fixed in advance number of iterations of the algorithm. Our main result is as follows: Theorem 3.5. Let v.i. (1.2) with monotone operator F satisfying (1.4) be solved by t-step Algorithm 3.1 using a SO, and let the stepsizes γτ ≡ γ, 1 6 τ 6 t, satisfy 0 < γ 6 √α3L , see (1.4). Then (i) Under Assumption I, one has  2   αΩ 21M 2 γ E Errvi (b zt ) 6 K0 (t) ≡ (3.9) + 2µΩ, + tγ 2α

where M is the constant from (1.4) and Ω is given by (2.3). (ii) Under Assumptions I, II, one has, in addition to (3.9), for any Λ > 0,  (3.10) Prob Errvi (b zt ) > K0 (t) + ΛK1 (t) 6 exp{−Λ2 /3} + exp{−Λt}, where

K1 (t) =

2M Ω 7M 2 γ + √ . 2α t

In the case of a Nash v.i., Errvi (·) in (3.9), (3.10) can be replaced with ErrN (·). When optimizing the bound (3.9) in γ, we get the following Corollary 3.6. In the situation of Theorem 3.5, let the stepsizes γτ ≡ γ be chosen according to # " r α αΩ 2 (3.11) . γ = min √ , 3L M 21t Then under Assumption I one has (3.12)

h 2 i  √ E Errvi (b zt 6 K0∗ (t) ≡ max 47 Ω t L + 7 ΩM + 2µΩ, . t

(see (2.3)). Under Assumptions I, II, one has, in addition to (3.12), for any Λ > 0,  (3.13) Prob Errvi (b zt ) > K0∗ (t) + ΛK1∗ (t) 6 exp{−Λ2 /3} + exp{−Λt} with

K1∗ (t) =

7 ΩM √ . 2 t

In the case of a Nash v.i., Errvi (·) in (3.12), (3.13) can be replaced with ErrN (·). 3.3. Comparison with Robust Mirror SA Algorithm. Consider the case of a Nash s.v.i. with operator F satisfying (1.4) with L = 0, and let the SO be unbiased (i.e., µ = 0). In this case, the bound (3.12) reads (3.14)

7ΩM E {ErrN (b zt } 6 √ , t

9

STOCHASTIC MIRROR-PROX ALGORITHM

where 2

M = max



sup kF (z) − F (z

z,z ′ ∈Z



)k2∗ ,

 sup E kΞ(z, ζi ) − F (z)k2∗

z∈Z



The bound (3.14) looks very much like the efficiency estimate ΩM E {ErrN (˜ zt )} 6 O(1) √ t

(3.15)

(from now on, all O(1)’s are appropriate absolute positive constants) for the approximate solution z˜t of the t-step Robust Mirror SA (RMSA) algorithm [3]1) . In the latter estimate, Ω is exactly the same as in (3.14), and M is given by    2 M = max sup kF (z)k2∗ ; sup E kΞ(z, ζi ) − F (z)k2∗ . z

z∈Z

Note that we always have M 6 2M , and typically M and M are of the same order of magnitude; it may happen, however (think of the case when F is “almost constant”), that M ≪ M . Thus, the bound (3.14) never is worse, and sometimes can be much better than the SA bound (3.15). It should be added that as far as implementation is concerned, the SMP algorithm is not more complicated than the RMSA (cf. the description of Algorithm 3.1 with the description rt = P (rt−1 , Fb (rt−1 )), " t #−1 t X X zbt = γτ γτ rτ , τ =1

τ =1

of the RMSA). The just outlined advantage of SMP as compared to the usual Stochastic Approximation is not that important, since “typically” M and M are of the same order. We believe that the most interesting feature of the SMP algorithm is its ability to take advantage of a specific structure of a stochastic optimization problem, namely, insensitivity to the presence in the objective of large, but smooth and well-observable components. We are about to consider several less straightforward applications of the outlined insensitivity of the SMP algorithm to smooth well-observed components in the objective. 4. Application to Stochastic Approximation: Stochastic composite minimization. 4.1. Problem description. Consider the optimization problem as follows (cf. [5]): (4.1)

min φ(x) := Φ(φ1 (x), ..., φm (x)),

x∈X

where 1) In this reference, only the Minimization and the Saddle Point problems are considered. However, the results of [3] can be easily extended to s.v.i.’s.

10

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

1. X ⊂ X is a convex compact; the embedding space X is equipped with a norm k · kx , and X – with a distance-generating function ωx (x) with certain parameters αx , Θx , Ωx w.r.t. the norm k · kx ; 2. φℓ (x) : X → Eℓ , 1 6 ℓ 6 m, are Lipschitz continuous mappings taking values in Euclidean spaces Eℓ equipped with norms (not necessarily the Euclidean ones) k · k(ℓ) with conjugates k · k(ℓ,∗) and with closed convex cones Kℓ . We suppose that φℓ are Kℓ -convex, i.e. for any x, x′ ∈ X, λ ∈ [0, 1], φℓ (λx + (1 − λ)x′ ) 6Kℓ λφℓ (x) + (1 − λ)φℓ (x′ ), where the notation a 6K b ⇔ b >K a means that b − a ∈ K. In addition to these structural restrictions, we assume that for all v, v ′ ∈ X, h ∈ X , (a) (b)

(4.2)

k[φ′ℓ (v) − φ′ℓ (v ′ )]hk(ℓ) 6 [Lx kv − v ′ kx + Mx ]khkx k[φ′ℓ (v)]hk(ℓ) 6 [Lx Ωx + Mx ]khkx

for certain selections φ′ℓ (x) ∈ ∂ Kℓ φℓ (x), x ∈ X 2) and certain nonnegative constants Lx and Mx . 3. Functions φℓ (·) are represented by an unbiased SO. At i-th call to the oracle, x ∈ X being the input, the oracle returns vectors fℓ (x, ζi ) ∈ Eℓ and linear mappings Gℓ (x, ζi ) from X to Eℓ , 1 6 ℓ 6 m ({ζi } are i.i.d. random vectors) such that for any x ∈ X and i = 1, 2, ...,

(4.3)

(a) E {f   ℓ (x, ζi )} = φℓ (x), 1 6 ℓ 6 m (b) E max kfℓ (x, ζi ) − φℓ (x)k2(ℓ) 6 Mx2 Ω2x ; 16ℓ6m

(c)

(d)

E( {Gℓ (x, ζi )} = φ′ℓ (x), 1 6 ℓ 6 m, )

E

max k[Gℓ (x, ζi ) − φ′ℓ (x)]hk2(ℓ)

h∈X khkx 61

6 Mx2 , 1 6 ℓ 6 m.

4. Φ(·) is a convex function on E = E1 × ... × Em given by the representation (4.4)

Φ(u1 , ..., um ) = max y∈Y

(m X ℓ=1

)

huℓ , Aℓ y + bℓ iEℓ − Φ∗ (y)

for uℓ ∈ Eℓ , 1 6 ℓ 6 m. Here (a) Y ⊂ Y is a convex compact set containing the origin; the embedding Euclidean space Y is equipped with a norm k·ky , and Y - with a distancegenerating function ωy (y) with parameters αy , Θy , Ωy w.r.t. the norm k · ky ; (b) The affine mappings y 7→ Aℓ y + bℓ : Y → Eℓ are such that Aℓ y + bℓ ∈ Kℓ∗ for all y ∈ Y and all ℓ; here Kℓ∗ is the cone dual to Kℓ ; 2) For a K-convex function φ : X → E (X ⊂ X is convex, K ⊂ E is a closed convex cone) and x ∈ X, the K-subdifferential ∂ K φ(x) is comprised of all linear mappings h 7→ Ph : X → E such that φ(u) >K φ(x) + P(u − x) for all u ∈ X. When φ is Lipschitz continuous on X, ∂ K φ(x) 6= ∅ for all x ∈ X; if φ is differentiable at x ∈ int X (as it is the case almost everywhere on int X), one has ∂φ(x) ∈ ∂ K φ(x). ∂x

11

STOCHASTIC MIRROR-PROX ALGORITHM

(c) Φ∗ (y) is a given convex function on Y such that kΦ′∗ (y) − Φ′∗ (y ′ )ky,∗ 6 Ly ky − y ′ ky + My

(4.5)

for certain selection Φ′∗ (z) ∈ ∂Φ∗ (y), y ∈ Y . Example: Stochastic Matrix Minimax problem (SMMP). For 1 6 ℓ 6 m, let Eℓ = Spℓ be the space of symmetric pℓ × pℓ matrices equipped with the Frobenius inner product pℓ hA, BiF = Tr(AB) and the spectral norms | · |∞ , and let Kℓ be the cone S+ of symmetric positive semidefinite pℓ × pℓ matrices. Consider the problem ! m X T (P ) min max λmax Pjℓ φℓ (x)Pjℓ , x∈X 16j6k

ℓ=1

where Pjℓ are given pℓ × qj matrices, and λmax (A) is the maximal eigenvalue of a symmetric matrix A. Observing that for a symmetric p × q matrix A one has λmax (A) = max Tr(AS) S∈Sq

where Sq = {S ∈ Sq+ : Tr(S) = 1}. When denoting by Y the set of all symmetric positive semidefinite block-diagonal matrices y = Diag{y1 , ..., yk } with unit trace and diagonal blocks yj of sizes qj × qj , we can represent (P ) in the form of (4.1), (4.4) with ! m X T Φ(u) := max λmax Pjℓ uℓ Pjℓ 16j6k

=

=

=

ℓ=1

max

y=Diag{y1 ,...,yk }∈Y

max

y=Diag{y1 ,...,yk }∈Y

max

y=Diag{y1 ,...,yk }∈Y

k X

Tr

j=1

m X ℓ=1

m X ℓ=1

m X

T Pjℓ uℓ Pjℓ yk

ℓ=1





Tr uℓ 

k X j=1

!



T Pjℓ yk Pjℓ 

huℓ , Aℓ yiF

Pk T yj Pjℓ ). The set Y is the spectahedron in the space Sq of (we put Aℓ y = j=1 Pjℓ symmetric block-diagonal matrices with k diagonal blocks of the sizes qj × qj , 1 6 j 6 Pk k. When equipping Y with the spectahedron setup, we get αy = 1, Θy = ln( j=1 qj ) q Pk and Ωy = 2 ln( j=1 qj ), see Section 2.2. Observe that in the simplest case of k = m, pj = qj , 1 6 j 6 m and Pjℓ equal to Ip for j = ℓ and to 0 otherwise, the SMMP problem becomes   (4.6) min max λmax (φℓ (x)) . x∈X

16ℓ6m

If, in addition, pj = qj = 1 for all j, we arrive at the usual (“scalar”) minimax problem   (4.7) min max φℓ (x) x∈X

16ℓ6m

12

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

with convex real-valued functions φℓ . Observe that in the case of (4.4), the optimization problem (4.1) is nothing but the primal problem associated with the saddle point problem " # m X (4.8) hφℓ (x), Aℓ y + bℓ iEℓ − Φ∗ (y) min max φ(x, y) = x∈X y∈Y

ℓ=1

and the cost function in the latter problem is Lipschitz continuous and convex-concave due to the Kℓ -convexity of φℓ (·) and the condition Aℓ y + bℓ ∈ Kℓ∗ whenever y ∈ Y . The associated Nash v.i. is given by the domain Z and the monotone mapping "m # m X X F (z) ≡ F (x, y) = (4.9) [φ′ℓ (x)]∗ [Aℓ y + bℓ ]; − A∗ℓ φℓ (x) + Φ′∗ (y) . ℓ=1

ℓ=1

The advantage of the v.i. reformulation of (4.1) is that F is linear in φℓ (·), so that the initial unbiased SO for φℓ induces an unbiased stochastic oracle for F , specifically, the oracle "m # m X X ∗ ∗ ′ Ξ(x, y, ζi ) = (4.10) Gℓ (x, ζi )[Aℓ y + bℓ ]; − Aℓ fℓ (x, ζi ) + Φ∗ (y) . ℓ=1

ℓ=1

We are about to use this oracle in order to solve the stochastic composite minimization problem (4.1) by the SMP algorithm. 4.2. Setup for the SMP as applied to (4.9). In retrospect, the setup for SMP we are about to present is a kind of the best – resulting in the best possible efficiency estimate (3.12) – we can build from the entities participating in the description of the problem (4.1). Specifically, we equip the space E = X × Y with the norm q k[(x, y)k ≡ kxk2x /Ω2x + kyk2y /Ω2y ; the conjugate norm clearly is

k(ξ, η)k∗ =

q Ω2x kξk2x,∗ + Ω2y kηk2y,∗ .

Finally, we equip Z = X × Y with the distance-generating function ω(x, y) =

1 1 ωx (x) + ωy (y). 2 αx Ωx αy Ω2y

The SMP-related properties of our setup are summarized in the following Lemma 4.1. Let (4.11)

A=

max

y∈Y:kyky 61

m X ℓ=1

kAℓ yk(ℓ,∗) , B =

m X ℓ=1

kbℓ k(ℓ,∗) .

(i) The parameters of the just defined distance-generating function ω w.r.t. the √ just defined norm k · k are α = 1, Θ = 1, Ω = 2. (ii) One has (4.12)

∀(z, z ′ ∈ Z) : kF (z) − F (z ′ )k∗ 6 Lkz − z ′ k + M,

STOCHASTIC MIRROR-PROX ALGORITHM

13

where L = 5AΩx Ωy [Ωx Lx + Mx ] + BΩ2x Lx + Ω2y Ly M = [2AΩy + kbk1 ] Ωx Mx + Ωy My Besides this, (4.13)

∀(z ∈ Z, i) : E {Ξ(z, ζi )} = F (z);

 E kΞ(z, ζi ) − F (z)k2∗ 6 M 2 .

Furethermore, if relations (4.3.b,d) are strengthened to    2 2 E exp max kfℓ (x, ζi ) − φℓ (x)k(ℓ) /(Ωx M ) 6 exp{1}, )) ( (16ℓ6m (4.14) 6 exp{1}, 1 6 ℓ 6 m, E exp max k[Gℓ (x) − φ′ℓ (x)]hk2(ℓ) /M 2 h∈X , khkx 61

then (4.15)

 E exp{kΞ(z, ζi ) − F (z)k2∗ /M 2 } 6 exp{1}.

Combining Lemma 4.1 with Corollary (3.6) we get explicit efficiency estimates for the SMP algorithm as applied to the Stochastic composite minimization problem (4.1). 4.3. Application to Stochastic Semidefinite Feasibility problem. Assume we are interested to solve a feasible system of matrix inequalities (4.16)

ψℓ (x)  0, ℓ = 1, ..., m & x ∈ X,

where m > 1, X ⊂ X is as in the description of the Stochastic composite problem, and ψℓ (·) take values in the spaces Eℓ = Spℓ of symmetric pℓ × pℓ matrices. We equip Eℓ with the Frobenius inner product, the semidefinite cone Kℓ = Sp+ℓ and the spectral norm k · k(ℓ) = | · |∞ (recall that |A|∞ is the maximal singular value of matrix A). We assume that ψℓ are Lipschitz continuous and Kℓ = Sp+ℓ -convex functions on X such that for all x, x′ ∈ X and for all ℓ one has max

(4.17)

h∈X , khkx 61

max

h∈X , khkx 61

|[ψℓ′ (x) − ψℓ′ (x′ )]h|∞ 6 Lℓ kx − x′ k(ℓ) + Mℓ ,

|ψℓ′ (x)h|∞ 6 Lℓ Ωx + Mℓ

for certain selections ψℓ′ (x) ∈ ∂ Kℓ ψℓ (x), x ∈ X, with some known nonnegative constants Lℓ , Mℓ . We assume that ψℓ (·) are represented by an SO which at i-th call, the input being b ℓ (x, ζi ) from X to x ∈ X, returns the matrices fbℓ (x, ζi ) ∈ Spℓ and the linear maps G Eℓ such that for all x ∈ X it holds n o n o b ℓ (x, ζi ) = ψ ′ (x), 1 6 ℓ 6 m (a) E fbℓ (x, ζi ) = ψℓ (x), E G   ℓ (b) E max |fbℓ (x, ζi ) − ψℓ (x)|2∞ /(Ωx Mℓ )2 6 1 (4.18) ) (16ℓ6m b ℓ (x, ζi ) − ψ ′ (x)]h|2∞ /M 2 6 1, 1 6 ℓ 6 m. (c) E max |[G h∈X , khkx 61





Given a number t of steps of the SMP algorithm, let us act as follows.

14

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

A. We compute the m quantities µℓ = (4.19)

Ω√ x Lℓ t

+ Mℓ , ℓ = 1, ..., m, and set √ µ t µ , Mx = µ. µ = max µℓ , βℓ = , φℓ (·) = βℓ ψℓ (·), Lx = 16ℓ6m µℓ Ωx

Note that by construction βℓ > 1 and Lx /Lℓ > βℓ , Mx /Mℓ > βℓ for all ℓ, so that the functions φℓ satisfy (4.2) with the just defined Lx , Mx . Further, the SO for ψℓ (·)’s can be converted into an SO for φℓ (·)’s by setting b ℓ (x, ζ). fℓ (x, ζ) = βℓ fbℓ (x, ζ), Gℓ (x, ζ) = βℓ G

By (4.18), this oracle satisfies (4.3). B. We then build the Stochastic Matrix Minimax problem (4.20)

min max λmax (φi (x)),

x∈X 16ℓ6m

associated with the just defined φ1 , ..., φm , that is, the Stochastic composite problem (4.1) associated with φ1 , ..., φm and the outer function Φ(u1 , ..., um ) = max λmax (uℓ ) = max 16ℓ6m

y∈Y

m X ℓ=1

huℓ , yℓ iF ,

Y = {y = Diag{y1 , ..., ym } ∈ Y = Sp1 × ... × Spm : y  0, Tr(y) = 1} ⊂ Y = Sp1 × ... × Spm . Thus in the notation from (4.4) we have Aℓ y = yℓ , bℓ = 0, Φ∗ ≡ 0. Hence Lx = Mx = 0, and Y is a spectahedron. We equip Y and Y with the Spectahedron setup, arriving at v u m m X X u αy = 1, Θy = ln pℓ , Ωy = t2 ln pℓ . ℓ=1

ℓ=1

C. We have specified all entities participating in the description of the Stochastic composite problem. It is immediately seen that these entities satisfy all conditions of Section 4.1. We can now solve the resulting Stochastic composite problem by t-step SMP algorithm with the setup presented in Section 4.2. The corresponding convex-concave saddle point problem is min max

x∈X y∈Y

m X ℓ=1

βℓ hψℓ (x), yℓ iF ;

with the monotone operator and SO, respectively, "m # X ′ ∗ F (z) ≡ F (x, y) = βℓ [ψℓ (x)] yℓ ; − Diag {α1 ψ1 (x), ..., αm ψm (x)} , Ξ((x, y), ζ) =

"

ℓ=1

m X ℓ=1

# o b ∗ (x, ζ)yℓ ; − Diag α1 fb1 (x, ζ), ..., αm fbm (x, ζ)) . βℓ G ℓ n

Combining Lemma 4.1, Corollary 3.6 and taking into account the origin of the quantities Lx , Mx , and that A = 1, B = 0 3) , we arrive at the following result: 3)

Pm

See (4.11) andPnote that we are in the case when bℓ = 0 and k · k(ℓ,∗) is the trace norm; thus, = m ℓ=1 |yℓ |1 = |y|1 = kyky .

ℓ=1 kAℓ yk(ℓ,∗)

STOCHASTIC MIRROR-PROX ALGORITHM

15

Proposition 4.2. With the outlined construction, the resulting s.v.i. reads find z∗ ∈ Z = X × Y : hF (z), z − z∗ i > 0 ∀z ∈ Z,

(4.21)

for the monotone operator F which satisfies (1.4) with "

L = 10 ln

m X ℓ=1

pℓ

# 12

" m # 12 X √ Ωx µ( t + 1), M = 4 ln pℓ Ωx µ; ℓ=1

Beside this, the resulting SO for F satisfies (4.13) with the just defined value of M . Let now  " −1 # 12 m X √ γ = 10 3 ln pℓ Ωx µ( t + 1) , 1 6 τ 6 t. ℓ=1

When applying to (4.21) the t-step SMP algorithm with the constant stepsizes γτ ≡ γ (cf. (3.11) and note that we are in the situation α = Θ = 1), we get an approximate solution zbt = (b xt , ybt ) such that

1 Pm  Ωx [ln ℓ=1 pℓ ] 2 µ √ max βℓ λmax (ψℓ (b xt )) 6 80 16ℓ6m t √ (cf. (3.12) and take into account that we are in the case of Ω = 2, while the optimal value in (4.20) is nonpositive, since (4.16) is feasible). Furthermore, if assumptions (4.18.b,c) are strengthened to   2 2 b E max exp{|fℓ (x, ζi ) − ψℓ (x)|∞ /(Ωx Mℓ ) } 6 exp{1}, 16ℓ6m   ′ 2 2 b E exp{ max |[Gℓ (x, ζi ) − ψℓ (x)]h|∞ /Mℓ } 6 exp{1}, 1 6 ℓ 6 m,

(4.22)

E



h∈X , khkx 61

then, in addition to (4.22), we have for any Λ > 0: (

Prob

max βℓ λmax (ψℓ (b xt )) > 80

16ℓ6m

Ωx [ln

) 1 1 Pm P 2 2 15 [ln m ℓ=1 pℓ ] µ ℓ=1 pℓ ] µ √ √ +Λ t t

6 exp{−Λ2 /3} + exp{−Λt}.

Discussion. Imagine that instead of solving the system of matrix inequalities (4.16), we were interested to solve just a single matrix inequality ψℓ (x)  0, x ∈ X. When solving this inequality by the SMP algorithm as explained above, the efficiency estimate would be    Ωx L ℓ Mℓ µℓ 1/2 1/2 ℓ E ψℓ (b xt ) 6 O(1) [ln(pℓ + 1)] Ωx + √ = O(1) [ln(pℓ + 1)] Ωx √ t t t 1/2 −1 Ωx µ = O(1) [ln(pℓ + 1)] βℓ √ , t

16

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

(recall that the matrix inequality in question is feasible), where x bℓt is the resulting approximate solution. Looking at (4.22), we see that the expected accuracy of the SMP P as applied, in the aforementioned manner, to (4.16) is only by a logarithmic in ℓ pℓ factor worse: "

xt )} 6 O(1) ln (4.23) E {ψℓ (b

m X ℓ=1

pℓ

#1/2

Ωx µ βℓ−1 √ t

"

= O(1) ln

m X ℓ=1

pℓ

#1/2

Ωx µℓ √ . t

Thus, as far as the quality of the SPM-generated solution is concerned, passing from solving a single matrix inequality to solving a system of m inequalities is “nearly costless”. As an illustration, consider the case where some of ψℓ are “easy” – smooth and easy-to-observe (Mℓ = 0), while the remaining ψℓ are “difficult”, i.e., might be non-smooth and/or difficult-to-observe (Lℓ = 0). In this case, (4.23) reads E {ψℓ (b xt )} 6

(

P 1/2 O(1) [ln m pℓ ] Pℓ=1 1/2 m O(1) [ln ℓ=1 pℓ ]

Ω2x Lℓ t , Ωx √Mℓ , t

ψℓ is easy, ψℓ is difficult.

In other words, the violations of the easy and the√ difficult constraints in (4.16) converge to 0 as t → ∞ with the rates O(1/t) and O(1/ t), respectively. It should be added that when X is the unit Euclidean ball in X = Rn and X,√X are equipped with the Euclidean setup, the rates of convergence O(1/t) and O(1/ t) are the best rates one can achieve without imposing bounds on n and/or imposing additional restrictions on ψℓ ’s. 4.4. Eigenvalue optimization via SMP. The problem we are interested in now is (4.24)

Opt = X

=

min f (x) := λmax (A0 + x1 A1 + ... + xn An ), Pn {x ∈ Rn : x > 0, i=1 xi = 1},

x∈X

where A0 , A1 , ..., An , n > 1, are given symmetric matrices with common blockdiagonal structure (p1 , ..., pm ). I.e., all Aj are block-diagonal with diagonal blocks Aℓj of sizes pℓ × pℓ , 1 6 ℓ 6 m. We denote p(κ) =

m X

pκℓ , κ = 1, 2, 3; pmax = max pℓ . ℓ

ℓ=1

Setting φℓ : X 7→ Eℓ = Spℓ , φℓ (x) = Aℓ0 +

n X

xj Aℓj , 1 6 ℓ 6 m,

j=1

we represent (4.24) as a particular case of the Matrix Minimax problem (4.6), with all functions φℓ (x) being affine and X being the standard simplex in X = Rn . Now, since Aj are known in advance, there is nothing stochastic in our problem, and it can be solved either by interior point methods, or by “computationally cheap” gradient-type methods; these latter methods are preferable when the problem is largescale and medium accuracy solutions are sought. For instance, one can apply the t-step

STOCHASTIC MIRROR-PROX ALGORITHM

17

(deterministic) Mirror Prox algorithm from [6] to the saddle point reformulation (4.8) of our specific Matrix Minimax problem, i.e., to the saddle point problem Pn min max hy, A0 + j=1 xj Aj iF , x∈X y∈Y (4.25)  Y = y = Diag{y1 , ..., ym } : yℓ ∈ Sp+ℓ , 1 6 ℓ 6 m, Tr(Y ) = 1 .

The accuracy of the approximate solution x ˜t of the (deterministic) Mirror Prox algorithm is [6, Example 2] p ln(n) ln(p(1) )A∞ f (˜ xt ) − Opt 6 O(1) . t This efficiency estimate is the best known so far among those attainable with “computationally cheap” deterministic methods. On the other hand, the complexity of one step of the algorithm is dominated, up to an absolute constant factor, by the necessity, given x ∈ X and y ∈ Y , P 1. to compute the matrix A0 + nj=1 xj Aj and the vector [Tr(Y A1 ); ...; Tr(Y An )]; 2. to compute the eigenvalue decomposition of y. When using the standard Linear Algebra, the computational effort per step is Cdet = O(1)[np(2) + p(3) ] arithmetic operations. We are about to demonstrate that one can equip the deterministic problem in question by an “artificial” SO in such a way that the associated SMP algorithm, under certain circumstances, exhibits better performance than deterministic algorithms. Let us consider the following construction of the SO for F (different from the SO (4.10)!). Observe that the monotone operator associated with the saddle point problem (4.25) is     Xn   xj Aj  . F (x, y) = [Tr(yA1 ); ...; Tr(yAn )]; −A0 − j=1 {z } | | {z } x

(4.26)

F (x,y)

F y (x,y)

Given x ∈ X, y = Diag{y1 , ..., ym } ∈ Y , we build a random estimate Ξ = [Ξx ; Ξy ] of F (x, y) = [F x (x, y); F y (x, y)] as follows: 1. we generate a realization  of a random variable taking values 1, ..., n with probabilities x1 , ..., xn (recall that x ∈ X, the standard simplex, so that x indeed can be seen as a probability distribution), and set (4.27)

Ξy = A0 + A ;

2. we compute the Pmquantities νℓ = Tr(yℓ ), 1 6 ℓ 6 m. Since y ∈ Y , we have νℓ > 0 and ℓ=1 νℓ = 1. We further generate a realization ı of random variable taking values 1, ..., m with probabilities ν1 , ..., νm , and set (4.28)

Ξx = [Tr(Aı1 y¯ı ); ...; Tr(Aın y¯ı )], y¯ı = (Tr(yı ))−1 yı .

The just defined random estimate Ξ of F (x, y) can be expressed as a deterministic function Ξ(x, y, η) of (x, y) and random variable η uniformly distributed on [0, 1]. Given x, y and η, the value of this function can be computed with the arithmetic cost

18

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

O(1)(n(pmax )2 + p(2) ) (indeed, O(1)(n + p(1) ) operations are needed to convert η into ı and , O(1)p(2) operations are used to write down the y-component −A0 − A of Ξ, and O(1)n(pmax )2 operations are needed to compute Ξx ). Now consider the SO’s Ξk (k is a positive integer) obtained by averaging the outputs of k calls to our basic oracle Ξ. Specifically, at the i-t call to the oracle Ξk , z = (x, y) ∈ Z = X × Y being the input, the oracle returns the vector Ξk (z, ζi ) =

k 1X Ξ(z, ηis ), k s=1

where ζi = [ηi1 ; ...; ηik ] and {ηis }16i, 16s6k are independent random variables uniformly distributed on [0, 1]. Note that the arithmetic cost of a single call to Ξk is Ck = O(1)k(n(pmax )2 + p(2) ). The Nash v.i. associated with the saddle point problem (4.25) with the stochastic oracle Ξk (k being the first parameter of our construction) specify a Nash s.v.i. on the domain Z = X × Y . Let us equip the standard simplex X and its embedding space X = Rn with the Simplex setup, and the spectahedron Y and its embedding space Y = Sp1 × ... × Spm with the Spectahedron setup (see Section 2.2). Let us next combine the x- and the y-setups, exactly as explained in the beginning of Section 4.2, into an SMP setup for the domain Z = X × Y – a distance-generating function ω(·) and a norm k · k on the embedding space Rn × (Sp1 × ... × Spℓ ) of Z. The SMP-related properties of the resulting setup are summarized in the following statement. Lemma 4.3. Let n > 3, p(1) > 3. Then (i) The parameters of the just defined distance-generating function ω w.r.t. the √ just defined norm k · k are α = 1, Θ = 1, Ω = 2. (ii) For any z, z ′ ∈ Z one has (4.29)

kF (z) − F (z ′ )k∗ 6 Lkz − z ′ k, L = 2 ln(n) + 4 ln(p(1) ).

Besides this, for any (z ∈ Z, i = 1, 2, ..., (4.30)

(a) (b)

E {Ξ  k (z, ζi )} = F (z); 2 E exp{kΞ(z, ζi ) − F (z)k2∗ /M√ } 6 exp{1}, M = 27[ln(n) + ln(p(1) )]A∞ / k.

REFERENCES [1] Azuma, K. Weighted sums of certain dependent random variables. T¨ okuku Math. J., 19 (1967), 357-367. [2] Ben-Tal, A., Nemirovski, A. “Non-Euclidean restricted memory level method for large-scale convex optimization” – Math. Progr. 102 (2005), 407–456. [3] Juditsky, A. Lan, G., Nemirovski, A., Shapiro, A., Stochastic Approximation Approach to Stochastic Programming, http://www.optimization-online.org/DB HTML/2007/09/1787.html [4] Juditsky, A., Nemirovski, A. (2008), Large Deviations of Vector-valued Martingales in 2Smooth Normed Spaces E-print: http://www.optimization-online.org/DB HTML/2008/04/1947.html [5] Nemirovski, A., Yudin, D., Problem complexity and method eciency in Optimization J. Wiley & Sons (1983).

STOCHASTIC MIRROR-PROX ALGORITHM

19

[6] A. Nemirovski, “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems” – SIAM J. Optim. 15 (2004), 229-251. [7] Lu, Z., Nemirovski A., Monteiro, R. “Large-Scale Semidefinite Programming via Saddle Point Mirror-Prox Algorithm”, Math. Progr., 109 (2007), 211-237. [8] Nemirovski, A., Onn, S., Rothblum, U. (2007), “Accuracy certificates for computational problems with convex structure” – submitted to Mathematics of Operations Research. E-print: http://www.optimization-online.org/DB HTML/2007/04/1634.html [9] Nesterov, Yu. “Smooth minimization of non-smooth functions”, Math. Progr., 103 (2005), 127-152. [10] Nesterov, Yu. “Excessive gap technique in nonsmooth convex minimization”, SIAM J. Optim., 16 (2005), 235-249. [11] Nesterov, Yu. “Dual extrapolation and its applications to solving variational inequalities and related problems”, Math. Progr. 109 (2007) 319-344.

5. Appendix. 5.1. Proof of Theorem 3.2. We start with the following simple observation: if re is a solution to (2.2), then ∂Z ω(re ) contains −e and thus is nonempty, so that re ∈ Z o . Moreover, one has hω ′ (re ) − e, u − re i > 0 ∀u ∈ Z.

(5.1)

Indeed, by continuity argument, it suffices to verify the inequality in the case when u ∈ rint(Z) ⊂ Z o . For such an u, the convex function f (t) = ω(re + t(u − re )) + hre + t(u − re ), ei, t ∈ [0, 1] is continuous on [0, 1] and has a continuous on [0, 1] field of subgradients g(t) = hω ′ (re + t(u − re )) + e, u − re i. It follows that the function is continuously differentiable on [0, 1] with the derivative g(t). Since the function attains its minimum on [0, 1] at t = 0, we have g(0) > 0, which is exactly (5.1). At least the first statement of the following Lemma is well-known: Lemma 5.1. For every z ∈ Z o , the mapping ξ 7→ P (z, ξ) is a single-valued mapping of E onto Z o , and this mapping is Lipschitz continuous, specifically, kP (z, ζ) − P (z, η)k 6 α−1 kζ − ηk∗

(5.2)

∀ζ, η ∈ E.

Besides this, (5.3)

(a) (b)

∀(u ∈ Z) : V (P (z, ζ), u) 6 V (z, u) + hζ, u − P (z, ζ)i − V (z, P (z, ζ)) kζk2 6 V (z, u) + hζ, u − zi + 2α∗ .

Proof. Let v ∈ P (z, ζ), w ∈ P (z, η). As Vu′ (z, u) = ω ′ (u) − ω ′ (z), invoking 5.1, we have v, w ∈ Z o and (5.4)

hω ′ (v) − ω ′ (z) + ζ, v − ui 6 0

(5.5)

hω ′ (w) − ω ′ (z) + η, w − ui 6 0 ∀u ∈ Z.

∀u ∈ Z.

Setting u = w in (5.4) and u = v in (5.5), we get hω ′ (v) − ω ′ (z) + ζ, v − wi 6 0, hω ′ (w) − ω ′ (z) + η, v − wi > 0,

20

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

whence hω ′ (w) − ω ′ (v) + [η − ζ], v − wi > 0, or kη − ζk∗ kv − wk > hη − ζ, v − wi > hω ′ (v) − ω ′ (w), v − wi > αkv − wk2 , and (5.2) follows. This relation, as a byproduct, implies that P (z, ·) is single-valued. To prove (5.3), let v = P (z, ζ). We have V (v, u) − V (z, u) = [ω(u) − hω ′ (v), u − vi − ω(v)] − [ω(u) − hω ′ (z), u − zi − ω(z)] = hω ′ (v) − ω ′ (z) + ζ, v − ui + hζ, u − vi − [ω(v) − hω ′ (z), v − zi − ω(z)] (due to (5.4)) 6 hζ, u − vi − V (z, v),

as required in (a) of (5.3). The bound (b) of (5.3) is obtained from (5.3) using the Young inequality: hζ, z − vi 6

kzk2∗ α + kz − vk2 . 2α 2

Indeed, observe that by definition, V (z, ·) is strongly convex with parameter α, and V (z, v) > α2 kz − vk2 , so that hζ, u − vi − V (z, v) = hζ, u − zi + hζ, z − vi − V (z, v) 6 hζ, u − zi +

kζk2∗ . 2α

We have the following simple corollary of Lemma 5.1: Corollary 5.2. Let ξ1 , ξ2 , ... be a sequence of elements of E. Define the sequence o {yτ }∞ τ =0 in Z as follows: yτ = P (yτ −1 , ξτ ),

y0 ∈ Z o .

Then yτ is a measurable function of y0 and ξ1 , ..., ξτ such that (5.6)

(∀u ∈ Z) :

h−

t X

ξτ , ui 6 V (y0 , u) +

τ =1

t X

ζτ ,

τ =1

with (5.7)

|ζτ | 6 rkξτ k∗ (here r = max kuk); ζτ 6 −hξτ , yτ −1 i + u∈Z

kξτ k2∗ . 2α

Proof. Using the bound (b) of (5.3) with ζ = ξt and z = yt−1 (so that yt = P (yt−1 , ξt ) we obtain for any u ∈ Z: V (yt , u) − V (yt−1 , u) − hξt , ui 6 −hξt , yt i − V (yt−1 , yt ) ≡ ζt . Note that ζt = max[−hξt , vi − V (yt−1 , v)], v∈Z

so that −rkξt k∗ 6 −hξt , yt−1 i 6 ζt 6 rkξt k∗ .

21

STOCHASTIC MIRROR-PROX ALGORITHM

Further, due to the strong convexity of V , ζt = −hξt , yt−1 i + [−hξt , yt − yt−1 i − V (yt−1 , yt )] 6 −hξt , yt−1 i +

kξt k2∗ . 2α

When summing up from τ = 1 to τ = t we arrive at the corollary. We also need the following result. Lemma 5.3. Let z ∈ Z o , let ζ, η be two points from E, and let w = P (z, ζ),

r+ = P (z, η)

Then for all u ∈ Z one has

(a) kw − r+ k 6 α−1 kζ − ηk∗ (b) V (r+ , u) − V (z, u) 6 hη, u − wi +

(5.8)

kζ−ηk2∗ 2α

− α2 kw − zk2 .

Proof. (a): this is nothing but (5.2). (b): Using (a) of (5.3) in Lemma 5.1 we can write for u = r+ : V (w, r+ ) 6 V (z, r+ ) + hζ, r+ − wi − V (z, w). This results in (5.9)

V (z, r+ ) > V (w, r+ ) + V (z, w) + hζ, w − r+ i.

Using (5.3) with η substituted for ζ we get V (r+ , u) 6 V (z, u) + hη, u − r+ i − V (z, r+ )

= V (z, u) + hη, u − wi + hη, w − r+ i − V (z, r+ ) [by (5.9)] 6 V (z, u) + hη, u − wi + hη − ζ, w − r+ i − V (z, w) − V (w, r+ ) α 6 V (z, u) + hη, u − wi + hη − ζ, w − r+ i − [kw − zk2 + kw − r+ k2 ], 2 due to the strong convexity of V . To conclude the bound (b) of (5.8) it suffices to note that by the Young inequality, hη − ζ, w − r+ i 6

kη − ζk2∗ α + kw − r+ k2 . 2α 2

We are able now to prove Theorem 3.2. By (1.4) we have that

(5.10)

kFb (wτ ) − Fb (rτ −1 )k2∗ 6 (Lkrτ −1 − wτ k + M + ǫrτ −1 + ǫwτ )2 6 3L2 kwτ − rτ −1 k2 + 3M 2 + 3(ǫrτ −1 + ǫwτ )2 .

Let us now apply Lemma 5.3 with z = rτ −1 , ζ = γτ Fb (rτ −1 ), η = γτ Fb(wτ ) (so that w = wτ and r+ = rτ ). We have for any u ∈ Z hγτ Fb (wτ ), wτ − ui + V (rτ , u) − V (rτ −1 , u) γ2 α 6 τ kFb (wτ ) − Fb (rτ −1 )k2 − kwτ − rτ −1 k2 2α 2  α 3γτ2 L2  kwτ − rτ −1 k2 + M 2 + (ǫrτ −1 + ǫwτ )2 − kwτ − rτ −1 k2 [by (5.10)] 6 2α 2  3γτ2  2 2 M + (ǫrτ −1 + ǫwτ ) [by (3.4)] 6 2α

22

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

When summing up from τ = 1 to τ = t we obtain t X

hγτ Fb(wτ ), wτ − ui 6 V (r0 , u) − V (rt , u) +

τ =1

6 Θ(r0 ) +

t X 3γτ2 

τ =1



t X  3γτ2  2 M + (ǫrτ −1 + ǫwτ )2 2α τ =1

 M 2 + (ǫrτ −1 + ǫwτ )2 .

Hence, for all u ∈ Z, t X

hγτ F (wτ ), wτ − ui

τ =1

t X 3γτ2 

6 Θ(r0 ) + (5.11)

τ =1 t X

= Θ(r0 ) +

τ =1

+

t X



3γτ2 2α



t  X hγτ ∆τ , wτ − ui M 2 + (ǫrτ −1 + ǫwτ )2 +

 M 2 + (ǫrτ −1 + ǫwτ )2 +

τ =1 t X

hγτ ∆τ , wτ − yτ −1 i

τ =1

hγτ ∆τ , yτ −1 − ui,

τ =1

where yτ are given by (3.3). Since the sequences {yτ }, {ξτ = γτ ∆τ } satisfy the premise of Corollary 5.2, we have (∀u ∈ Z) :

Pt

τ =1 hγτ ∆τ , yτ −1

− ui 6 V (r0 , u) + Pt γ2 6 Θ(r0 ) + τ =1 2ατ ǫ2wτ ,

Pt

γτ2 2 τ =1 2α k∆τ k∗

and thus (5.11) implies that for any u ∈ Z (5.12)

t X

hγτ F (wτ ), wτ − ui 6 2Θ(r0 ) +

τ =1

+

t X

hγτ ∆τ , wτ − yτ −1 i

τ =1

  t X ǫ2 3γτ2 M 2 + (ǫrτ −1 + ǫwτ )2 + wτ 2α 3 τ =1

To complete the proof of (3.5) in the general case, note that since F is monotone, (5.12) implies that for all u ∈ Z, t X

τ =1

γτ hF (u), wτ − ui 6 Γ(t),

where   X t t X ǫ2wτ 3γτ2 2 2 M + (ǫrτ −1 + ǫwτ ) + + hγτ ∆τ , wτ − yτ −1 i Γ(t) = 2Θ(r0 ) + 2α 3 τ =1 τ =1 (cf. (3.6)), whence ∀(u ∈ Z) : hF (u), zbt − ui 6

"

t X

τ =1

γτ

#−1

Γ(t).

STOCHASTIC MIRROR-PROX ALGORITHM

23

When taking the supremum over u ∈ Z, we arrive at (3.5). In the case of a Nash v.i., setting wτ = (wτ,1 , ..., wτ,m ) and u = (u1 , ..., um ) and recalling the origin of F , due to the convexity of φi (zi , z i ) in zi , for all u ∈ Z we get from (5.12): t X

τ =1

γτ

m X i=1

Setting φ(z) =

t X

[φi (wτ ) − φi (ui , (wτ )i )] 6

Pm

i=1

m X

γτ

τ =1

i=1

hF i (wτ ), (wτ )i − ui i 6 Γ(t).

φi (z), we get t X

τ =1

"

γτ φ(wτ ) −

m X

i

#

φi (ui , (wτ ) ) 6 Γ(t).

i=1

Recalling that φ(·) is convex and φi (ui , ·) are concave, i = 1, ..., m, the latter inequality implies that # " t #" m X X i φi (ui , (b zt ) ) 6 Γ(t), zt ) − γτ φ(b τ =1

i=1

or, which is the same, m X i=1

"

φi (b zt ) −

m X

#

i

φi (ui , (b zt ) ) 6

i=1

"

t X

γτ

τ =1

#−1

Γ(t).

This relation holds true for all u = (u1 , ..., um ) ∈ Z; taking maximum of both sides in u, we get ErrN (b zt ) 6

"

t X

γτ

τ =1

#−1

Γ(t).

5.2. Proof of Theorem 3.5. In what follows, we use the notation from Theorem 3.2. By this theorem, in the case of constant stepsizes γτ ≡ γ we have −1

(5.13)

Errvi (b zt ) 6 [tγ]

Γ(t),

where Γ(t) = 2Θ + (5.14)

6 2Θ +

 t t  X ǫ2 3γ 2 X M 2 + (ǫrτ −1 + ǫwτ )2 + wτ + γ h∆τ , wτ − yτ −1 i 2α τ =1 3 τ =1

t t i X 7γ 2 X h 2 h∆τ , wτ − yτ −1 i. M + ǫ2rτ −1 + ǫ2wτ + γ 2α τ =1 τ =1

For a Nash v.i., Errvi in this relation can be replaced with ErrN . Note that by description of the algorithm rτ −1 is a deterministic function of ζ N (τ −1) and wτ is a deterministic function of ζ M(τ ) for certain increasing sequences of integers {M (τ )}, {N (τ )} such that N (τ − 1) < M (τ ) < N (τ ). Therefore ǫrτ −1 is a deterministic function of ζ N (τ −1)+1 , and ǫwτ and ∆τ are deterministic functions of

24

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

ζ M(τ )+1 . Denoting by Ei the expectation w.r.t. ζi , we conclude that under assumption I we have n o  (5.15) EN (τ −1)+1 ǫ2rτ −1 6 M 2 , EM(τ )+1 ǫ2wτ 6 M 2 , kEM(τ )+1 {∆τ } k∗ 6 µ,

and under assumption II, in addition, o n EN (τ −1)+1 exp{ǫ2rτ −1 M −2 } 6 exp{1}, (5.16)  EM(τ )+1 exp{ǫ2wτ M −2 } 6 exp{1}. Now, let

t i 7γ 2 X h 2 M + ǫ2rτ −1 + ǫ2wτ . Γ0 (t) = 2α τ =1

We conclude by (5.15) that E {Γ0 (t)} 6

(5.17)

21γ 2 M 2 t . 2α

Further, yτ −1 clearly is a deterministic function of ζ M(τ −1)+1 , whence wτ − yτ −1 is a deterministic function of ζ M(τ ) . Therefore (5.18)

EM(τ )+1 {h∆τ , wτ − yτ −1 i} = hEM(τ )+1 {∆τ } , wτ − yτ −1 i 6 µkwτ − yτ −1 k 6 2µΩ,

where the concluding p inequality follows from the fact that Z is contained in the k · kball of radius Ω = 2Θ/α centered at zc , see (2.5). From (5.18) it follows that ) ( t X h∆τ , wτ − yτ −1 i 6 2µγtΩ. E γ τ =1

Combining the latter relation, (5.13), (5.14) and (5.17), we arrive at (3.9). (i) is proved. To prove (ii), observe, first, that setting Jt =

t h i X M −2 ǫ2rτ −1 + M −2 ǫ2wτ , τ =1

we get (5.19)

Γ0 (t) =

7γ 2 M 2 [t + Jt ] . 2α

At the same time, we can write Jt =

2t X

ξj ,

j=1

where ξj > 0 is a deterministic function of ζ I(j) for certain increasing sequence of integers {I(j)}. Moreover, when denoting by Ej conditional expectation over ζ I(j) , ζ I(j)+1 ..., ζ I(j)−1 being fixed, we have Ej {exp{ξj }} 6 exp{1},

25

STOCHASTIC MIRROR-PROX ALGORITHM

see (5.16). It follows that      k k+1      X X ξj } exp{ξk+1 } ξj } = E Ek+1 exp{ E exp{      j=1 j=1     k k     X X ξj } . (5.20) ξj }Ek+1 {exp{ξk+1 }} 6 exp{1}E exp{ = E exp{     j=1

j=1

Whence E[exp{J}] 6 exp{2t}, and applying the Tchebychev inequality, we get ∀Λ > 0 : Prob {J > 2t + Λt} 6 exp{−Λt}. Along with (5.19) it implies that   21γ 2 M 2 t 7γ 2 M 2 t ∀Λ > 0 : Prob Γ0 (t) > (5.21) 6 exp{−Λt}. +Λ 2α 2α

Let now ξτ = h∆τ , wτ − yτ −1 i. Recall that wτ − yτ +1 is a deterministic function of ζ M(τ ) . Besides this, we have seen that kwτ − yτ −1 k 6 D ≡ 2Ω. Taking into account (5.15), (5.16), we get (5.22)

(a) EM(τ )+1 {ξ  τ } 6 ρ ≡ µD, (b) EM(τ )+1 exp{ξτ2 R−2 } 6 exp{1}, with R = M D.

Observe that exp{x} 6 x+exp{9x2 /16} for all x. Thus (5.22.b) implies for 0 6 s 6 (5.23)

4 3R

EM(τ )+1 {exp{sξτ }} 6 sρ + exp{9s2 R2 /16} 6 exp{sρ + 9s2 R2 /16}.

Further, we have sξτ 6 83 s2 R2 + 23 ξτ2 R−2 , hence for all s > 0,   2 2   2  3s R 2 2ξτ 2 2 . 6 exp EM(τ )+1 {exp{sξτ }} 6 exp{3s R /8}EM(τ )+1 exp + 3R2 8 3 4 When s > 3R , the latter quantity is 6 3s2 R2 /4, which combines with (5.23) to imply that for s > 0,

EM(τ )+1 {exp{sξτ }} 6 exp{sρ + 3s2 R2 /4}.

(5.24)

Acting as in (5.20), we derive from (5.24) that ) ( t X ξτ } 6 exp{stρ + 3s2 tR2 /4}, s > 0 ⇒ E exp{s τ =1

and by the Tchebychev inequality, for all Λ > 0, ( t ) X √ √ Prob ξτ > tρ + ΛR t 6 inf exp{3s2 tR2 /4 − sΛR t} = exp{−Λ2 /3}. s>0

τ =1

Finally, we arrive at ( t ) h X √i Prob γ h∆τ , wτ − yτ −1 i > 2γ µt + ΛM t Ω 6 exp{−Λ2 /3}. (5.25) τ =1

for all Λ > 0. Combining (5.13), (5.14), (5.21) and (5.25), we get (3.10).

26

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

5.3. Proof of Lemma 4.1. Proof of (i). We clearly have Z o = X o × Y o , and ω(·) is indeed continuously differentiable on this set. Let z = (x, y) and z ′ = (x′ , y ′ ), z, z ′ ∈ Z. Then 1 1 hωx′ (x) − ωx′ (x′ ), x − x′ i + hω ′ (y), y − y ′ i 2 αx Ωx αy Ω2y y 1 1 > 2 kx − x′ k2x + 2 ky − y ′ k2y > k[x′ − x; y ′ − y]k2 . Ωx Ωy

hω ′ (z) − ω ′ (z ′ ), z − z ′ i =

Thus, ω(·) is strongly convex on Z, modulus α = 1, w.r.t. the norm k · k. Further, the minimizer of ω(·) on Z clearly is zc = (xc , yc ), and 1 1 Θx + Θy = 1, αx Ω2x αy Ω2y p √ so that Θ = 1, whence Ω = 2Θ/α = 2. Proof of (ii). 10 . Let z = (x, y) and z ′ = (x′ , y ′ ) with z, z ′ ∈ Z. Observe that ky − y ′ ky 6 2Ωy and thus Θ=

ky ′ ky 6 2Ωy

(5.26)

due to 0 ∈ Y . On the other hand, we have from (4.9) F (z ′ ) − F (z) = [∆x ; ∆y ], where ∆x =

m X

[φ′ℓ (x′ ) − φ′ℓ (x)]∗ [ATℓ y ′ + bℓ ] +

ℓ=1 m X

∆y = −

ℓ=1

m X ℓ=1

[φ′ℓ (x)]∗ Aℓ [y ′ − y],

A∗ℓ [φℓ (x) − φℓ (x′ )] + Φ′∗ (y ′ ) − Φ′∗ (y).

We have k∆x kx,∗ =

6

max

h∈X khkx 61 m X ℓ=1

=

m X ℓ=1

6

hh,

m X ℓ=1

+

" " "

m h X ℓ=1

i [φ′ℓ (x′ ) − φ′ℓ (x)]∗ [ATℓ y ′ + bℓ ] + [φ′ℓ (x)]∗ Aℓ [y ′ − y] iX

max

hh, [φ′ℓ (x′ )



max

h[φ′ℓ (x′ )

φ′ℓ (x)]h, ATℓ y ′

h∈X khkx 61

h∈X khkx 61

max

h∈X khkx 61

max

h∈X khkx 61



φ′ℓ (x)]∗ [ATℓ y ′

+ bℓ ]iX +

+ bℓ iX +

max hh, [φ′ℓ (x)]∗ Aℓ [y ′ h∈X ,khkx 61

max h[φ′ℓ (x)]h, Aℓ [y ′ h∈X ,khkx 61

− y]iX

− y]iX

#

k[φ′ℓ (x′ ) − φ′ℓ (x)]hk(ℓ) kAℓ y ′ + bℓ k(ℓ,∗)

kφ′ℓ (x)hk(ℓ) kAℓ [y ′

#

− y]k(ℓ,∗) .

Then by (4.2), |∆x kx,∗ 6

m X ℓ=1

"

[Lx kx − x′ kx + Mx ][kAℓ y ′ k(ℓ,∗) + kbℓ k(ℓ,∗) ] + [Lx Ωx + Mx ]kAℓ [y − y ′ ]k(ℓ,∗)

= [Lx kx − x′ kx + Mx ]

m X

[kAℓ y ′ k(ℓ,∗) + kbℓ k(ℓ,∗) ] + [Lx Ωx + Mx ]

ℓ=1

6 [Lx kx − x′ kx + Mx ][Aky ′ ky + B] + [Lx Ωx + Mx ]Aky − y ′ ky ,

m X ℓ=1

kAℓ [y − y ′ ]k(ℓ,∗)

#

27

STOCHASTIC MIRROR-PROX ALGORITHM

by definition of A and B. Next, due to (5.26) we get by definition of k · k k∆x kx,∗ 6 [Lx kx − x′ kx + Mx ][2AΩy + B] + [Lx Ωx + Mx ]Aky − y ′ ky

6 [Lx Ωx kz − z ′ k + Mx ][2AΩy + B] + [Lx Ωx + Mx ]AΩy kz − z ′ k,

what implies (a) : k∆x kx,∗ 6 [Ωx [2AΩy + B]Lx + 2AΩy [Lx Ωx + Mx ]] kz − z ′ k + [2AΩy + B]Mx Further, k∆y ky,∗ =

max

hη, −

η∈Y,kηky 61

max

6

η∈Y,kηky 61

=

max

η∈Y,kηky 61

max

6

η∈Y,kηky 61

max

6

η∈Y,kηky 61

m X

A∗ℓ [φℓ (x) − φℓ (x′ )] + Φ′∗ (y ′ ) − Φ′∗ (y)iY

ℓ=1

m X

hη, A∗ℓ [φℓ (x) − φℓ (x′ )]iY + kΦ′∗ (y ′ ) − Φ′∗ (y)ky,∗

ℓ=1

m X

hAℓ η, φℓ (x) − φℓ (x′ )iEℓ + kΦ′∗ (y ′ ) − Φ′∗ (y)ky,∗

ℓ=1

m X

kAℓ ηk(ℓ,∗) kφℓ (x) − φℓ (x′ )k(ℓ) kΦ′∗ (y ′ ) − Φ′∗ (y)ky,∗

ℓ=1

m X

kAℓ ηk(ℓ,∗) [Lx Ωx + Mx ]kx − x′ kx [Ly ky − y ′ ky + My ],

ℓ=1

by (4.2.b) and (4.5). Now k∆y ky,∗ 6 A[Lx Ωx + Mx ]kx − x′ kx + [Ly ky − y ′ ky + My ], and we come to (b) : k∆y ky,∗ 6 [Ωx A[Lx Ωx + Mx ] + Ωy Ly ] kz − z ′ k + My . From (a) and (b) it follows that kF (z) − F (z ′ )k∗ 6 Ωx k∆x kx,∗ + Ωy k∆y ky,∗   6 Ω2x [2AΩy + B]Lx + 3AΩx Ωy [Lx Ωx + Mx ] + Ly Ω2y kz − z ′ k +Ωx [2AΩy + B]Mx + Ωy My . We have justified (4.12) 20 . Let us verify (4.13). The first relation in (4.13) is readily given by (4.3.a,c). Let us fix z = (x, y) ∈ Z and i, and let ∆ = (5.27)

=

F (z) − Ξ(z, ζi ) ℓ=1

|

As we have seen, (5.28)

ψℓ

m m z }| { X X [ [φ′ℓ (x) − Gℓ (x, ζi )]∗ [Aℓ y + bℓ ]; − A∗ℓ [φℓ (x) − fℓ (x, ζi )] .

{z

∆x

m X ℓ=1

} |

ℓ=1

kψℓ k(ℓ,∗) 6 2AΩy + B

{z

∆y

}

28

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

Besides this, for uℓ ∈ Eℓ we have k

m X ℓ=1

A∗ℓ uℓ ky,∗ =

m X A∗ℓ uℓ , ηiY = h

max

η∈Y, kηky 61

max

6

η∈Y, kηky 61

(5.29)

6

max

ℓ=1

 



X

max

m X uℓ , Aℓ ηiY h

η∈Y, kηky 61



ℓ=1

kuℓ k(ℓ) kAℓ ηk(ℓ,∗) 

16ℓ6m

max kuℓ k(ℓ)

η∈Y, kηky 61 16ℓ6m

 X

16ℓ6m

kAℓ ηk(ℓ,∗) = A max kuℓ k(ℓ) . 16ℓ6m

Hence, setting uℓ = φℓ (x) − fℓ (x, ζi ) we obtain (5.30) k∆y ky,∗ = k

Further,

m X

A∗ℓ [φℓ (x) − fℓ (x, ζ)]ky,∗ 6 A max kφℓ (x) − fℓ (x, ζ)k(ℓ) . 16ℓ6m ℓ=1 | {z }

k∆x kx,∗ = =

ξ=ξ(ζi )

max

hh,

h∈X , khkx 61

max

h∈X , khkx 61

max

6

h∈X , khkx 61

6

m X ℓ=1

m X ℓ=1

m X ℓ=1

h[φ′ℓ (x) − Gℓ (x, ζi )]h, ψℓ iX

m X

max

ℓ=1

h∈X , khkx 61

|

[φ′ℓ (x) − Gℓ (x, ζi )]∗ ψℓ iX

k[φ′ℓ (x) − Gℓ (x, ζi )]hk(ℓ) kψℓ k(ℓ,∗) k[φ′ℓ (x) − Gℓ (x, ζi )]hk(ℓ) kψℓ k(ℓ,∗) | {z } {z } ρℓ ξℓ =ξℓ (ζi )

Invoking (5.28), we conclude that (5.31)

k∆x kx,∗ 6

where all ρℓ > 0,

P



m X

ρℓ ξℓ ,

ℓ=1

ρℓ 6 2AΩy + B and

ξℓ = ξℓ (ζi ) =

max

h∈X , khkx 61

k[φ′ℓ (x) − Gℓ (x, ζi )]hk(ℓ)

Denoting by p2 (η) the second moment of a scalar random variable η, observe that p(·) is a norm on the space of square summable random variables representable as deterministic functions of ζi , and that p(ξ) 6 Ωx Mx , p(ξℓ ) 6 Mx by (4.3.b,d). Now by (5.30), (5.31),  1    1   E k∆k2∗ 2 = E Ω2x k∆x k2x,∗ + Ω2y k∆y k2y,∗ 2

29

STOCHASTIC MIRROR-PROX ALGORITHM

6 p (Ωx k∆x kx,∗ + Ωy k∆y ky,∗ ) 6 p Ωx 6 Ωx

X ℓ

m X ℓ=1

ρℓ ξℓ + Ωy Aξ

!

ρℓ max p(ξℓ ) + Ωy Ap(ξ) ℓ

6 Ωx [2AΩy + B]Mx + Ωy AΩx Mx , and the latter quantity is 6 M , see (4.12). We have established the second relation in (4.13). 30 . It remains to prove that in the case of (4.14), relation (4.15) takes place. To this end, one word the reasoning from item 20 with the function  can repeat  word2 by 2 pe (η) = inf t > 0 : E exp{η /t } 6 exp{1} in the role of p(η). Note that similarly to p(·), pe (·) is a norm on the space of random variables η which are deterministic functions of ζi and are such that pe (η) < ∞. 5.4. Proof of Lemma 4.3. Item (i) can be verified exactly as in the case of Lemma 4.1; the facts expressed in (i) depend solely on the construction from Section 4.2 preceding the latter Lemma, and are independent of what are the setups for X, X and Y, Y. Let us verify item (ii). Note that we are in the situation p k(x, y)k = pkxk21 /(2 ln(n)) + |y|21 /(4 ln(p(1) )), (5.32) k(ξ, η)k∗ = 2 ln(n)kξk2∞ + 4 ln(p(1) )|η|2∞ . For z = (x, y), z ′ = (x′ , y ′ ) ∈ Z we have 

F (z)−F (z ′ ) = ∆x = [Tr((y − y ′ )A1 ); ...; Tr((y − y ′ )An )]; ∆y = −

n X j=1

whence

k∆x k∞ 6 |y − y ′ |1 max |Aj |∞ 6 16j6n

|∆y |∞ 6 kx − x′ k∞ max |Aj |∞ 16j6n

and



(xj − x′j )Aj  .

p 2 ln(n)A∞ kz − z ′ k, q 6 2 ln(p(1) )A∞ kz − z ′ k,

k(∆x , ∆y )k∗ 6 [2 ln(n) + 4 ln(p(1) )]kz − z ′ k, as required in (4.29). Further, relation (4.30.a) is clear from the construction of Ξk . To prove (4.30.b), observe that when (x, y) ∈ Z, we have (see (4.27), (4.28)) kΞx (x, y, η)k∞ 6 |¯ yı | max |Aıj |∞ 6 A∞ , 16j6n

x

x

and, since F (x, y) = E {Ξ (x, y, ζ}, (5.33)

kΞx (x, y, η) − F x (x, y)k∞ 6 2A∞ .

Clearly, (5.34)

|Ξy (x, y, η) − F y (x, y)|∞ = |A −

n X j=1

xj Aj |∞ 6 2A∞ .

30

A. JUDITSKY, A. NEMIROVSKI AND C. TAUVEL

Applying [4, Theorem 2.1(iii), Example 3.2, Lemma 1], we derive from (5.33) and (5.34) that for every (x, y) ∈ Z and every i = 1, 2, ... it holds  2 E exp{kΞxk (x, y, ζi ) − F x (x, y)k2∞ /Nk,x } 6 exp{1},   p Nk,x = 2A∞ 2 exp{1/2} ln(n) + 3 k −1/2 and

 2 E exp{kΞyk (x, y, ζi ) − F y (x, y)k2∞ /Nk,y } 6 exp{1},   q Nk,y = 2A∞ 2 exp{1/2} ln(p(1) ) + 3 k −1/2 . Combining the latter bounds with (5.32) we conclude (4.30.b).