Accelerating Optimization via Adaptive Prediction

Report 6 Downloads 84 Views
arXiv:1509.05760v1 [stat.ML] 18 Sep 2015

Accelerating Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google 251 Mercer Street New York, NY 10012 [email protected]

Scott Yang Courant Institute 251 Mercer Street New York, NY 10012 [email protected]

September 21, 2015 Abstract We present a powerful general framework for designing data-dependent optimization algorithms, building upon and unifying recent techniques in adaptive regularization, optimistic gradient predictions, and problemdependent randomization. We first present a series of new regret guarantees that hold at any time and under very minimal assumptions, and then show how different relaxations recover existing algorithms, both basic as well as more recent sophisticated ones. Finally, we show how combining adaptivity, optimism, and problem-dependent randomization can guide the design of algorithms that benefit from more favorable guarantees than recent state-of-the-art methods.

1

Introduction

Online convex optimization algorithms represent key tools in modern machine learning. These are flexible algorithms used for solving a variety of optimization problems in classification, regression, ranking and probabilistic inference. These algorithms typically process one sample at a time with an update per iteration that is often computationally cheap and easy to implement. As a result, they can be substantially more efficient both in time and space than standard batch learning algorithms, which often have optimization costs that are prohibitive for very large data sets. In the standard scenario of online convex optimization [20], at each round t = 1, 2, . . ., the learner selects a point xt out of a compact convex set K and incurs loss ft (xt ), where ft is a convex function defined over K. The learner’s objective is to find an algorithm A that minimizes the regret with respect to a

1

fixed point x∗ : RegT (A, x∗ ) =

T X t=1

ft (xt ) − ft (x∗ )

that is the difference between the learner’s cumulative loss and the loss in hindsight incurred by x∗ , or with respect to the maximum over all x∗ in K, RegT (A) = maxx∗ ∈K RegT (A, x∗ ). We will assume only that the learner has access to the gradient or an element of the sub-gradient of the loss functions ft , but that the loss functions ft can be arbitrarily singular and flat, e.g. not necessarily strongly convex or strongly smooth. This is the most general setup of convex optimization in the full information setting. It can be applied to standard convex optimization and online learning tasks as well as to many optimization problems in machine learning such as those of SVMs, logistic regression, and ridge regression. Favorable bounds in online convex optimization can also be translated into strong learning guarantees in the standard scenario of batch supervised learning using online-to-batch conversion guarantees [8, 2, 11]. In the scenario of online convex optimization just presented, minimax optimal rates can be achieved by standard algorithms such as online gradient descent [20]. However, general minimax optimal rates may be too conservative. Recently, adaptive regularization methods have been introduced for standard descent methods to achieve tighter data-dependent regret bounds (see [1], [6], [10], [9], [13]). Specifically, in the framework of [9], if {rt }∞ t=1 is a sequence of Pt regularization functions such that r0 + s=1 (ft + rt ) is 1-strongly convex with respect to some norm k ·k(t) , then regret bounds of the following form have been given: T X RegT (A, x) ≤ r0:T (x) + kgt k2(t) , t=1

PT

where r0:T is a shorthand for s=1 rs and gt ∈ ∂ft (xt ) is an element of the subgradient of ft at xt . A common example of regularization is the proximal l2 -norm, rt (x) = kx − xt k22 , which prevent updates from straying far away from the current point. Examples of more sophisticated forms of adaptivity are the AdaGrad family of algorithms [6]. Note, however, that this upper bound on the regret can still be very large, even if the functions ft admit some favorable properties (e.g. ft ≡ f , linear). This is because the dependence is directly on the norm of gt ’s. An alternative line of research has been investigated by a series of recent publications that have analyzed online learning in “slowly-varying” scenarios [7, 5, 14, 4]. In the framework of [14], if R is a self-concordant function, k ·k(t) = k · k∇2 R(xt ) is the semi-norm induced by its Hessian at the point xt 1 , and g˜t+1 = g˜t+1 (g1 , . . . , gt , x1 , . . . , xt ) is a “prediction” of a time t + 1 subgradient gt+1 based on information up to time t, then one can obtain regret bounds of 1 The

norm √ induced by a symmetric positive definite (SPD) matrix A is defined for any x by kxkA = x⊤ Ax.

2

the following form: RegT (A, x) ≤

T X 1 kgt − g˜t k(t),∗ . R(x) + 2η η t=1

Here, k·k(t),∗ denotes the dual norm of k·k(t) : for any x, kxk(t),∗ = supkyk(t) ≤1 xT y. This guarantee can be very favorable in the optimistic case where g˜t ≈ gt for all t. Nevertheless, it admits the drawback that much less control is available over the induced norm since it is difficult to predict, for a given self-concordant function R, the behavior of its Hessian at the points xt selected by an algorithm. This paper presents a powerful general framework for designing online convex optimization algorithms combining adaptive regularization and optimistic gradient prediction which helps address several of the issues just pointed out. Our framework builds upon and unifies recent techniques in adaptive regularization, optimistic gradient predictions, and problem-dependent randomization. In Section 2, we describe a series of adaptive and optimistic algorithms for which we prove strong regret guarantees, including a new Adaptive and Optimistic Follow-the-Regularized-Leader (AO-FTRL) algorithm (Section 2.1) and a more general version of this algorithm with composite terms (Section 2.2). These new regret guarantees hold at any time and under very minimal assumptions. We also show how different relaxations recover both basic existing algorithms as well as more recent sophisticated ones. Next, in Section 3, we further combine adaptivity and optimism with problem-dependent randomization, which helps us devise algorithms benefitting from more favorable guarantees than recent state-of-the-art methods.

2

Adaptive and Optimistic Follow-the-RegularizedLeader algorithms

2.1

AO-FTRL algorithm

Algorithm 1 shows the pseudocode of our Adaptive and Optimistic Followthe-Regularized-Leader (AO-FTRL) algorithm. The following result provides a regret guarantee for proximal regularizers. Theorem 1 (AO-FTRL-Prox). Let {rt } be a sequence of proximal non-negative functions, such that argminx∈K rt (x) = xt , and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) is 1-strongly convex with respect to some norm k · k(t) (i.e. r0:t is 1-strongly convex with respect to k · k(t) ). Then, the following regret bound holds for AO-FTRL (Algorithm 1): RegT (AO-FTRL, x) =

T X t=1

ft (xt ) − ft (x) ≤ r0:T (x) +

3

T X t=1

kgt − g˜t k2(t),∗ .

Proof. Recall that xt+1 = argminx (g1:t + g˜t+1 ) · x + r0:t (x) = argminx h0:t (x), and let yt = argminx x·g1:t +r0:t (x). Then, by convexity, the following inequality holds: T X t=1

ft (xt ) − ft (x) ≤

T X t=1

gt · (xt − x) =

T X t=1

(gt − g˜t ) · (xt − yt ) + g˜t · (xt − yt ) + gt · (yt − x).

Now, we first prove by induction on T that for all x ∈ K the following inequality holds: T T X X g˜t · (xt − yt ) + gt · yt ≤ gt · x + r0:T (x). t=1

t=1

For T = 1, since g˜1 = 0 and rt ≥ 0, the inequality follows by the definition of yt . Now, suppose the inequality holds at iteration T . Then, we can write T +1 X t=1

g˜t · (xt − yt ) + gt · yt

=



" "

T X t=1

T X t=1

#

g˜t · (xt − yt ) + gt · yt + g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 #

gt · xT +1 + r0:T (xT +1 ) + g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1

(by the induction hypothesis for x = xT +1 ) ≤ [(g1:T + g˜T +1 ) · xT +1 + r0:T +1 (xT +1 )] + g˜T +1 · (−yT +1 ) + gT +1 · yT +1

(since rt ≥ 0, ∀t) ≤ [(g1:T + g˜T +1 ) · yT +1 + r0:T +1 (yT +1 )] + g˜T +1 · (−yT +1 ) + gT +1 · yT +1 (by definition of xT +1 ) ≤ g1:T +1 · y + r0:T +1 (y),

for any y.

(by definition of yT +1 ) PT PT Thus, we have that t=1 ft (xt ) − ft (x) ≤ r0:T (x) + t=1 (gt − g˜t ) · (xt − yt ) PT and it suffices to bound t=1 (gt − g˜t )T (xt − yt ). Notice that, by duality, one can immediately write (gt − g˜t )T (xt − yt ) ≤ kgt − g˜t k(t),∗ kxt − yt k(t) . To bound kxt − yt k(t) in terms of the gradient, recall first that since rt is proximal, xt = argmin h0:t−1 (x) + rt (x), x

yt = argmin h0:t−1 (x) + rt (x) + (gt − g˜t ) · x. x

The fact that r0:t (x) is 1-strongly convex with respect to the norm k·k(t) implies that h0:t−1 + rt is as well. In particular, it is 1-strongly convex at the points xt and yt . But this then implies that the conjugate function is 1-strongly smooth on the image of the gradient, including at ∇(h0:t−1 + rt )(xt ) = 0 and ∇(h0:t−1 + rt )(yt ) = −(gt −˜ gt ) (see Lemma 1 in the appendix or [15] for a general reference), which means that k∇((h0:t−1 + rt )∗ )(−(gt − g˜t )) − ∇((h0:t−1 + rt )∗ )(0)k(t) ≤ kgt − g˜t k(t),∗ . 4

Algorithm 1 AO-FTRL 1: Input: regularization function r0 ≥ 0. 2: Initialize: g ˜1 = 0, x1 = argminx∈K r0 (x). 3: for t = 1, . . . , T : do 4: Compute gt ∈ ∂ft (xt ). 5: Construct regularizer rt ≥ 0. 6: Predict the gradient g˜t+1 = g˜t+1 (g1 , . . . , gt , x1 , . . . , xt ). 7: Update xt+1 = argminx∈K g1:t · x + g˜t+1 · x + r0:t (x). 8: end for Since ∇((h0:t−1 + rt )∗ )(−(gt − g˜t )) = yt and ∇((h0:t−1 + rt )∗ )(0) = xt , we have that kxt − yt k(t) ≤ kgt − g˜t k(t),∗ . The regret bound just presented can be vastly superior to the adaptive methods of [6], [10], and others. For instance, one common choice of gradient prediction is g˜t+1 = gt , so that for slowly varying gradients (e.g. nearly “flat” functions), gt − g˜t ≈ 0, but kgt k(t) = kgk(t). Moreover, for reasonable gradient predictions, k˜ gt+1 k(t) ≈ kgt k(t) generally, so that in the worst case, Algorithm 1’s regret will be at most a factor of two more than standard methods. At the same time, the use of non self-concordant regularization allows one to more explicitly control the induced norm in the regret bound as well as provide more efficient updates than those of [14]. Section 2.3.1 presents an upgraded version of online gradient descent as an example, where our choice of regularization allows our algorithm to accelerate as the gradient predictions become more accurate. Note that the assumption of strong convexity of h0:t is not a significant constraint, as any quadratic or entropic regularizer from the usual mirror descent algorithms will satisfy this property. Moreover, if the loss functions {ft }∞ t=1 themselves are 1-strongly convex, then one can set r0:t ≡ 0 and still get a favorable induced norm k · k2(t),∗ = 1t k · k22 . If the gradients and gradient predictions are uniformly bounded, this recovers the worst-case log(T ) regret bounds. At the same time, Algorithm 1 would also still retain the potentially highly favorable data-dependent and optimistic regret bound. Liang and Steinhardt (2014) [18] also studied adaptivity and optimism in online learning. If, in the proof above, we assume their condition: ∗ ∗ r0:t+1 (−ηg1:t ) ≤ r0:t (−η(g1:t − g˜t )) − ηxTt (gt − g˜t ),

PT PT r ∗ (0)+r0:T +1 (x) . then we obtain the following regret bound: t=1 ft (xt )− t=1 ft (x) ≤ 1 η Our algorithm, however, is in general easier to use since it holds for any sequence of regularization functions and does not require checking for that condition. In some cases, it may be preferable to use non-proximal adaptive regularization. Since non-adaptive non-proximal FTRL corresponds to dual averaging, this scenario arises, for instance, when one wishes to use regularizers such as the

5

negative entropy to derive algorithms from the Exponentiated Gradient (EG) family (see [16] for background). We thus present the following theorem for this family of algorithms: Adaptive Optimistic Follow-the-Regularized-Leader General version (AO-FTRL-Gen). Theorem 2 (AO-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) is 1-strongly convex with respect to some norm k · k(t) (i.e. r0:t is 1-strongly convex wrt k · k(t) ). Then, the following regret bound holds for AO-FTRL (Algorithm 1): T X t=1

ft (xt ) − ft (x) ≤ r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

Due to spatial constraints, the proof of this theorem, as well as that of all further results in the remainder of Section 2, are presented in Appendix 5. As in the case of proximal regularization, Algorithm 1 applied to general regularizers still admits the same benefits over the standard adaptive algorithms. In particular, the above algorithm is an easy upgrade over any dual averaging algorithm. Section 2.3.2 illustrates one such example for the Exponentiated Gradient algorithm.

2.2

CAO-FTRL algorithm (Composite Adaptive Optimistic Follow-the-Regularized-Leader)

In some cases, we may wish to impose some regularization on our original optimization problem to ensure properties such as generalization (e.g. l2 -norm in SVM) or sparsity (e.g. l1 -norm in Lasso). This can be treated directly by modifying the regularization in our FTRL update. However, if we wish for the regularization penalty to appear in the regret expression but do not wish to linearize it (which could mitigate effects such as sparsity), then some extra care needs to be taken. Algorithm 2 gives the pseudocode of our algorithm when using such composite functions. The following results provide its regret guarantees. They admit the same advantages over former algorithms as for the non-composite case. Theorem 3 (CAO-FTRL-Prox). Let {rt } be a sequence of proximal non-negative functions, such that argminx∈K rt (x) = xt , and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then the following regret

6

Algorithm 2 CAO-FTRL 1: Input: regularization function r0 ≥ 0, composite functions {ψt }∞ t=1 where ψt ≥ 0. 2: Initialize: g ˜1 = 0, x1 = argminx∈K r0 (x). 3: for t = 1, . . . , T : do 4: Compute gt ∈ ∂ft (xt ). 5: Construct regularizer rt ≥ 0. 6: Predict the next gradient g˜t+1 = g˜t+1 (g1 , . . . , gt , x1 , . . . , xt ). 7: Update xt+1 = argminx∈K g1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x). 8: end for bounds hold for CAO-FTRL (Algorithm 2): T X t=1

T X t=1

ft (xt ) − ft (x) ≤ ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

[ft (xt ) + ψt (xt )] − [ft (x) + ψt (x)] ≤ r0:T (x) +

T X t=1

kgt − g˜t k2(t),∗ .

Corollary 1. With the following suitable choices of the parameters in Theorem 3, the following regret bounds can be recovered: 1. Adaptive FTRL-Prox of [9] (up to a constant factor of 2): g˜ ≡ 0. 2. Optimistic FTRL of [14]: r0:t ≡ 0, ψ1 = ηR where η > 0 and R a selfconcordant function, ψt = 0, ∀t ≥ 2. Theorem 4 (CAO-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions such that ψ1 (x1 ) = 0. Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the following regret bounds hold for CAO-FTRL (Algorithm 2): T X t=1

T X t=1

ft (xt ) − ft (x) ≤ ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

ft (xt ) + ψt (xt ) − [ft (x) + ψt (x)] ≤ r0:T −1 (x) +

7

T X t=1

kgt − g˜t k2(t),∗ .

2.3 2.3.1

Applications Adaptive and Optimistic Gradient Descent

Corollary 2 (AO-GD). Let K ⊂ ×ni=1 [−Ri , Ri ] be an n-dimensional rectangle, and set qP pPs s−1 n X t X ˜a,i )2 − ˜a,i )2 a=1 (ga,i − g a=1 (ga,i − g (xi − xs,i )2 . r0:t = 2R i i=1 s=1 Then, if we use the martingale-type gradient prediction g˜t+1 = gt , the following regret bound holds: v u T n uX X Ri t (gt,i − gt−1,i )2 . RegT (x) ≤ 4 t=1

i=1

Since the regularization function decomposes over coordinates, the AO-GD update can be executed in time linear in the dimension (the same as for standard gradient descent). Moreover, since the gradient prediction is simply the last gradient received, the algorithm also does not require much more storage than the standard gradient descent algorithm. However, as we mentioned in the general case, the regret bound here can be significantly more favorable than  p Pn 2 T G i=1 Ri ] bound of online gradient descent, or even its the standard O adaptive variants. 2.3.2

Adaptive and Optimistic Exponentiated Gradient

Corollary 3 (AO-EG). Let K = ∆n be the n-dimensional simplex and ϕ : x 7→ P n i=1 xi log(xi ) the negative entropy. Assume that kgt k ≤ C for all t and set s Pt C + s=1 kgs − g˜s k22 (ϕ + log(n)). r0:t = 2 log(n) Then, if we use the martingale-type gradient prediction g˜t+1 = gt , the following regret bound holds: v ! u T −1 u X kgt − gt−1 k22 . RegT (A, x) ≤ 2t2 log(n) C + t=1

The above algorithm admits the same advantages over predecessors as the AO-GD algorithm. Moreover, observe that this bound holds at any time and does not require the tuning of any learning rate. Steinhardt and Liang [18] also introduce a similar algorithm for EG, one that could actually be more favorable if the optimal a posteriori learning rate could be used. However, as pointed out by the authors, this is typically not possible, and instead, one would need to settle for a maxi kgs,i − g˜s,i k22 -type bound in practice. This makes the AO-EG more attractive from an implementation standpoint. 8

Algorithm 3 CAOS-FTRL 1: Input: regularization function r0 ≥ 0, composite functions {ψt }∞ t=1 where ψt ≥ 0. 2: Initialize: g ˜1 = 0, x1 = argminx∈K r0 (x). 3: for t = 1, . . . , T : do 4: Query gˆt where E[ˆ gt |x1 , . . . , xt , gˆ1 , . . . , gˆt−1 ] = gt ∈ ∂ft (xt ). 5: Construct regularizer rt ≥ 0. 6: Predict next gradient g˜t+1 = g˜t+1 (ˆ g1 , . . . , gˆt , x1 , . . . , xt ). 7: Update xt+1 = argminx∈K gˆ1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x). 8: end for

3

Adaptive Optimistic and Stochastic Followthe-Regularized-Leader algorithms

3.1

CAOS-FTRL algorithm (Composite Adaptive Optimistic Follow-the-Regularized-Leader)

We now generalize the scenario to that of stochastic online convex optimization, where, instead of exact subgradient elements gt , we receive only estimates. Specifically, we assume access to a sequence of vectors of the form gˆt , where E[ˆ gt |g1 , . . . , gt−1 , x1 , . . . , xt ] = gt . This extension is in fact well-documented in the literature (see [16] for a reference), and the extension of our adaptive and optimistic variant follows accordingly. For completeness, we provide the proofs of the following theorems in Appendix 8. Theorem 5 (CAOS-FTRL-Prox). Let {rt } be a sequence of proximal nonnegative functions, such that argminx∈K rt (x) = xt , and let g˜t be the learner’s estimate of gˆt given the history of noisy gradients gˆ1 , . . . , gˆt−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume further that the function h0:t (x) = gˆ1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the update xt+1 = argminx h0:t (x) of Algorithm 3 yields the following regret bounds: " T # # " T X X 2 E ft (xt ) − ft (x) ≤ E ψ1:T −1 (x) + r0:T −1 (x) + kˆ gt − g˜t k(t−1),∗ t=1

E

"

T X t=1

t=1

#

"

ft (xt ) + ψt (xt ) − ft (x) − αt ψt (x) ≤ E r0:T (x) +

T X t=1

#

kˆ gt − g˜t k2(t),∗ .

Theorem 6 (CAOS-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gˆt given the history of noisy gradients

9

gˆ1 , . . . , gˆt−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume furthermore that the function h0:t (x) = gˆ1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the update xt+1 = argminx h0:t (x) of Algorithm 3 yields the regret bounds: # " T # " T X X 2 E ft (xt ) − ft (x) ≤ E ψ1:T −1 (x) + r0:T −1 (x) + kˆ gt − g˜t k(t−1),∗ t=1

E

"

T X t=1

t=1

#

"

ft (xt ) + ψt (xt ) − ft (x) − ψt (x) ≤ E r0:T −1 (x) +

T X t=1

kˆ gt −

g˜t k2(t−1),∗

#

The algorithm above enjoys the same advantages over its non-adaptive or non-optimistic predecessors. Moreover, the choice of the adaptive regularizers {rt }∞ g }∞ t=1 and gradient predictions {˜ t=1 now also depend on the randomness of the gradients received. While masked in the above regret bounds, this interplay will come up explicitly in the following two examples, where we, as the learner, impose randomness into the problem.

3.2 3.2.1

Applications Randomized Coordinate Descent with Adaptive Probabilities

Randomized coordinate descent is a method that is often used for very largescale problems where it is impossible to compute and/or store entire gradients at each step. It is also effective for directly enforcing sparsity in a solution since the support of the final point xt cannot be larger than the number of updates introduced. The standard randomized coordinate descent update is to choose a coordinate uniformly at random (see e.g. [17]). Nesterov (2012) [12] analyzed random coordinate descent in the context of loss functions with higher regularity and showed that one can attain better bounds by using non-uniform probabilities. In the randomized coordinate descent framework, at each round t we specify a distribution pt over the n coordinates and pick a coordinate it ∈ {1, . . . , n} randomly according to this distribution. From here, we then construct an unbi(g ·eit )eit . This technique ased estimate of an element of the subgradient: gˆt = t pt,i t is common in the online learning literature, particularly in the context of the multi-armed bandit problem (see e.g. [3] for more information). The following theorem can be derived by applying Theorem 5 to the gradient estimates just constructed. We provide a proof in Appendix 9. Theorem 7 (CAO-RCD). Assume K ⊂ ×ni=1 [−Ri , Ri ]. Let it be a random variable sampled according to the distribution pt , and let gˆt =

(˜ gt · eit )eit gˆ˜t = , pt,it

(gt · eit )eit , pt,it 10

be the estimated gradient and estimated gradient prediction. Let qP qP s s−1 n X t X ga,i − gˆ˜a,i )2 − ga,i − gˆ˜a,i )2 a=1 (ˆ a=1 (ˆ r0:t = (xi − xs,i )2 2R i i=1 s=1

be the adaptive regularization. Then, the regret of the algorithm can be bounded by: v " T # u T n uX  (gt,i − g˜t,i )2  X X E Ri t E ft (xt ) + αt ψ(xt ) − ft (x) − αt ψ(x) ≤ 4 pt,i t=1 t=1 i=1

In general, we do not have access to an element of the subgradient gt before we sample according to pt . However, if we assume that we have some percoordinate upper bound on an element of the subgradient uniform in time, i.e. |gt,j | ≤ Lj ∀t ∈ {1, . . . , T }, j ∈ {1, . . . , n}, then we can use the fact L that |gt,j − g˜t,j | ≤ max{Lj − g˜t,j , g˜t,j } to motivate setting g˜t,j := 2j and (R L )2/3

pt,j = Pn j(Rjk Lk )2/3 (by computing the optimal distribution). This yields the k=1 following regret bound.

Corollary 4 (CAO-RCD-Lipschitz). Assume that at any time t the following per-coordinate Lipschitz bounds hold on the loss function: |gt,i | ≤ Li , ∀i ∈

{1, . . . , n}. Set pt,i =

2/3 Pn(Ri Li ) 2/3 j=1 (Rj Lj )

as the probability distribution at time t,

Li 2 .

and set g˜t,i = Then, the regret of the algorithm can be bounded as follows: !3/2 " T # n X X √ 2/3 . (Ri Li ) E ft (xt ) + αt ψ(xt ) − ft (x) − αt ψ(x) ≤ 2 T t=1

i=1

An application of √ H¨ older’s inequality will reveal that this bound is strictly smaller than the 2RL nT bound one would obtain from randomized coordinate descent using the uniform distribution. Moreover, the algorithm above still entertains the intermediate data-dependent bound of Theorem 7. Notice the similarity between the sampling distribution generated here with the one suggested by Nesterov (2010). However, Nesterov assumed higher regularity in his algorithm (i.e. ft ∈ C 1,1 ) and generated his probabilities from there. In our setting, we only need ft ∈ C 0,1 . We can also derive the analogous mini-batch update, which is provided in Appendix 9. 3.2.2

Stochastic Regularized Empirical Risk Minimization

Many learning algorithms can be viewed as instances of regularized empirical risk minimization (e.g. SVM, Logistic Regression, Lasso), where the goal is to minimize an objective function of the following form: H(x) =

m X

fj (x) + αψ(x).

j=1

11

Pm If we denote the first term by F (x) = j=1 fj (x), then we can view this objective in our CAOS-FTRL framework, where ft ≡ F and ψt ≡ αψ. In the same spirit as for non-uniform random coordinate descent, we can estimate the gradient of H at xt by sampling according to some distribution pt and use importance weighting to generate an unbiased estimate: If gt ∈ ∂F (xt ) and gtj ∈ ∂fj (xt ), then m X g jt gtj ≈ t . gt = pt,jt j=1 This motivates the design of an algorithm similar to the one derived for randomized coordinate descent. The full details are provided in Appendix 10, including a mini-batch version which can be very useful due to the variance reduction of the gradient prediction.

4

Conclusion

We presented a general framework for developing efficient adaptive and optimistic algorithms for online convex optimization. Building upon recent advances in adaptive regularization and predictable online learning, we improved upon each method. We demonstrated the power of this approach by deriving algorithms with better guarantees than those commonly used in practice. In addition, we also extended adaptive and optimistic online learning to the randomized setting. Here, we highlighted an additional source of problem-dependent adaptivity (that of prescribing the sampling distribution), and we showed how one can perform better than traditional naive uniform sampling.

12

References [1] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In NIPS, pages 65–72, 2007. [2] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. [3] Nicol` o Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. [4] Chao-Kai Chiang, Chia-Jung Lee, and Chi-Jen Lu. Beating bandits in gradually evolving worlds. In COLT, pages 210–227, 2013. [5] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, ChiJen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In COLT, pages 6.1–6.20, 2012. [6] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT, pages 257–269, 2010. [7] Elad Hazan and Satyen Kale. Better algorithms for benign bandits. In SODA, pages 38–47, 2009. [8] N. Littlestone. From on-line to batch learning. In COLT, pages 269–284, 1989. [9] H. Brendan McMahan. Analysis techniques for adaptive online learning. CoRR, 2014. [10] H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT, pages 244–256, 2010. [11] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. [12] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, pages 341–362, 2012. [13] Francesco Orabona, Koby Crammer, and Nicol` o Cesa-Bianchi. A generalized online mirror descent with applications to classification and regression. CoRR, 2013. [14] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, pages 993–1019, 2013. [15] R. Tyrell Rockafellar. Convex analysis. Princeton University Press, 1970.

13

[16] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., pages 107–194, 2012. [17] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1regularized loss minimization. Journal of Machine Learning Research, pages 1865–1892, 2011. [18] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML, pages 1593–1601, 2014. [19] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling. CoRR, 2014. [20] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 928–936, 2003.

14

Appendix 5

Proofs for Section 2

Lemma 1 (Duality Between Smoothness and Convexity for Convex Functions). Let K be a convex set and f : K → R be a convex function. Suppose f is 1strongly convex at x0 . Then f ∗ , the Legendre transform of f , is 1-strongly smooth at y0 = ∇f (x0 ). Proof. Notice first that for any pair of convex functions f, g : K → R, the fact that f (x0 ) ≥ g(x0 ) for some x0 ∈ K implies that f ∗ (y0 ) ≤ g ∗ (y0 ) for y0 = ∇f (x0 ). Now, f being 1-strongly convex at x0 means that f (x) ≥ h(x) = f (x0 ) + g0 · (x − x0 ) + σ2 kx − x0 k22 . Thus, it suffices to show that h∗ (y) = f ∗ (y0 ) + x0 · (y − y0 ) + 12 ky − y0 k22 , since x0 = ∇(h∗ )(y0 ). To see this, we can compute that h∗ (y) = max y · x − h(x) x

= y · (y − y0 + x0 ) − h(x)

(since the max is attained for y0 + (x − x0 ) = ∇h(x) = y)   1 = y · (y − y0 + x0 ) − f (x0 ) + y0 · (x − x0 ) + kx − x0 k22 2 1 = ky − y0 k22 + y · x0 − f (x0 ) 2 1 = −f (x0 ) + x0 · y0 + x0 · (y − y0 ) + ky − y0 k22 2 1 2 ∗ = f (y0 ) + x0 · (y − y0 ) + ky − y0 k2 2

Theorem 2 (AO-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) is 1-strongly convex with respect to some norm k · k(t) (i.e. r0:t is 1-strongly convex wrt k · k(t) ). Then, the following regret bound holds for AO-FTRL (Algorithm 1): T X t=1

ft (xt ) − ft (x) ≤ r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

Proof. Recall that xt+1 = argminx x · (g1:t + g˜t+1 ) + r0:t (x), and let yt = argminx x · g1:t + r0:t−1 (x). Then by convexity, T X t=1

ft (xt ) − ft (x) ≤

T X t=1

gt · (xt − x) =

T X t=1

15

(gt − g˜t ) · (xt − yt ) + g˜t · (xt − yt ) + gt · (yt − x)

Now, we first show via induction that ∀x ∈ K, the following holds: T X t=1

g˜t · (xt − yt ) + gt · yt ≤

T X t=1

gt · x + r0:T −1 (x).

For T = 1, the fact that rt ≥ 0, g˜1 = 0, and the definition of yt imply the result. Now suppose the result is true for time T . Then T +1 X t=1

g˜t · (xt − yt ) + gt · yt

= ≤

" "

T X t=1

T X t=1

#

g˜t · (xt − yt ) + gt · yt + g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 #

gt · xT +1 + r0:T −1 (xT +1 ) + g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1

(by the induction hypothesis for x = xT +1 ) ≤ [(g1:T + g˜T +1 ) · xT +1 + r0:T (xT +1 )] + g˜T +1 · (−yT +1 ) + gT +1 · yT +1 (since rt ≥ 0, ∀t) ≤ [(g1:T + g˜T +1 ) · yT +1 + r0:T (yT +1 )] + g˜T +1 · (−yT +1 ) + gT +1 · yT +1 (by definition of xT +1 ) ≤ g1:T +1 · y + r0:T (y) for any y. (by definition of yT +1 )

P P Thus, we have that Tt=1 ft (xt )− ft (x) ≤ r0:T −1 (x)+ Tt=1 (gt − g˜t )·(xt − yt ) PT and it suffices to bound t=1 (gt − g˜t )T (xt − yt ). By duality again, one can immediately get (gt − g˜t ) · (xt − yt ) ≤ kgt − g˜t k(t−1),∗ kxt − yt k(t−1) . To bound kxt − yt k(t) in terms of the gradient, recall first that xt = argmin h0:t−1 (x), x

yt = argmin h0:t−1 (x) + (gt − gˆt ) · x. x

The fact that r0:t−1 (x) is 1-strongly convex with respect to the norm k·k(t−1) implies that h0:t−1 is as well. In particular, it is strongly convex at the points xt and yt . But, this then implies that the conjugate function is smooth at ∇(h0:t−1 )(xt ) and ∇(h0:t−1 )(yt ), so that k∇(h∗0:t−1 )(−(gt − g˜t )) − ∇(h∗0:t−1 )(0)k(t) ≤ kgt − g˜t k(t−1),∗ Since ∇(h∗0:t−1 )(−(gt − g˜t )) = yt and ∇(h∗0:t−1 )(0) = xt , we have that kxt − yt k(t−1) ≤ kgt − g˜t k(t−1),∗ .

16

Theorem 3 (CAO-FTRL-Prox). Let {rt } be a sequence of proximal non-negative functions, such that argminx∈K rt (x) = xt , and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then the following regret bounds hold for CAO-FTRL (Algorithm 2): T X t=1

T X t=1

ft (xt ) − ft (x) ≤ ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

[ft (xt ) + ψt (xt )] − [ft (x) + ψt (x)] ≤ r0:T (x) +

T X t=1

kgt − g˜t k2(t),∗ .

Proof. For the first regret bound, define the auxiliary regularization functions r˜t (x) = rt (x) + ψt (x), and apply Theorem 2 to get T X t=1

ft (xt ) − ft (x) ≤ r˜0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

= ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

Notice that while rt is proximal, r˜t , in general, is not, and so we must apply the theorem with general regularizers instead of the one with proximal regularizers. For the second regret bound, we can follow the prescription of Theorem 1 while keeping track of the additional composite terms: Recall that xt+1 = argminx x · (g1:t + g˜t+1 ) + r0:t+1 (x) + ψ1:t+1 (x), and let yt = argminx x · g1:t + r0:t (x) + ψ1:t (x). We can compute that: T X t=1

ft (xt ) + αt ψ(xt ) − [ft (x) + ψt (x)]

≤ =

T X t=1

T X t=1

gt · (xt − x) + ψt (xt ) − ψt (x) (gt − g˜t ) · (xt − yt ) + g˜t · (xt − yt ) + gt · (yt − x) + ψt (xt ) − ψt (x)

Similar to before, we show via induction that ∀x ∈ K, T X t=1

g˜t · (xt − yt ) + gt · yt + ψt (xt ) ≤ r0:T (x) + 17

T X t=1

gt · x + ψt (x).

For T = 1, the fact that rt ≥ 0, gˆ1 = 0, ψ1 (x1 ) = 0, and the definition of yt imply the result. Now suppose the result is true for time T . Then T +1 X t=1

g˜t · (xt − yt ) + gt · yt + ψt (xt )

=

" T X t=1

#

g˜t · (xt − yt ) + gt · yt + ψt (xt )

+ g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 ) " T # X ≤ gt · xT +1 + r0:T (xT +1 ) + ψt (xT +1 ) t=1

+ g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 ) (by the induction hypothesis for x = xT +1 )

≤ [(g1:T + g˜T +1 ) · xT +1 + r0:T +1 (xT +1 ) + ψt (xT +1 )] + g˜T +1 · (−yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 ) (since rt ≥ 0, ∀t) ≤ [(g1:T + g˜T +1 ) · yT +1 + r0:T +1 (yT +1 ) + ψt (yT +1 )]

+ g˜T +1 · (−yT +1 ) + gT +1 · yT +1 + ψT +1 (yT +1 ) (by definition of xT +1 )

≤ g1:T +1 · y + r0:T +1 (y) + ψ1:T +1 (y)

for any y

(by definition of yT +1 )

Thus, we have that T X t=1

ft (xt ) + ψt (xt ) − [ft (x) + ψt (x)] ≤ r0:T (x) +

T X (gt − g˜t )T (xt − yt ), t=1

and we can bound the sum in the same way as before, since the strong convexity properties of h0:t are retained due to the convexity of ψt . Theorem 4 (CAO-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gt given the history of functions f1 , . . . , ft−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions such that ψ1 (x1 ) = 0. Assume further that the function h0:t : x 7→ g1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the following regret bound holds for CAO-FTRL

18

(Algorithm 2): T X t=1

T X t=1

ft (xt ) − ft (x) ≤ ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t−1),∗

ft (xt ) + ψt (xt ) − [ft (x) + ψt (x)] ≤ r0:T −1 (x) +

T X t=1

kgt − g˜t k2(t),∗ .

Proof. For the first regret bound, define the auxiliary regularization functions r˜t (x) = rt (x) + αt ψ(x), and apply Theorem 2 to get T X t=1

ft (xt ) − ft (x)

≤ r˜0:T −1 (x) +

T X t=1

kgt − gˆt k2(t),∗

= ψ1:T −1 (x) + r0:T −1 (x) +

T X t=1

kgt − gˆt k2(t−1),∗

For the second bound, we can proceed as in the original proof, but now keep track of the additional composite terms. Recall that xt+1 = argminx x · (g1:t + g˜t+1 ) + r0:t (x) + ψ1:t+1 (x), and let yt = argminx x · g1:t + r0:t−1 (x) + ψ1:t (x). Then T X t=1

ft (xt ) + ψt (xt ) − ft (x) − ψt (x)

≤ =

T X t=1

T X t=1

gt · (xt − x) + ψt (xt ) − ψt (x) (gt − g˜t ) · (xt − yt ) + g˜t · (xt − yt ) + gt · (yt − x) + ψt (xt ) − ψt (x)

Now, we show via induction that ∀x ∈ K, T X t=1

g˜t · (xt − yt ) + gt · yt + αt ψ(xt ) ≤

T X t=1

gt · x + ψt (x) + r0:T −1 (x).

For T = 1, the fact that rt ≥ 0, gˆ1 = 0, ψ1 (x1 ) = 0, and the definition of yt imply the result.

19

Now suppose the result is true for time T . Then T +1 X t=1

g˜t · (xt − yt ) + gt · yt + ψt (xt )

=

" T X t=1

#

g˜t · (xt − yt ) + gt · yt + ψt (xt )

+ g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 ) " T # X T ≤ gt xT +1 + r0:T −1 (xT +1 ) + ψt (xT +1 ) t=1

+ g˜T +1 · (xT +1 − yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 )

(by the induction hypothesis for x = xT +1 ) ≤ [(g1:T + g˜T +1 ) · xT +1 + r0:T (xT +1 ) + ψt (xT +1 )]

+ g˜T +1 · (−yT +1 ) + gT +1 · yT +1 + ψT +1 (xT +1 ) (since rt ≥ 0, ∀t)

≤ g1:T +1 · yT +1 + g˜T +1 · yT +1 + r0:T (yT +1 ) + ψ1:T +1 (yT +1 ) + g˜T +1 · (−yT +1 ) + gT +1 · yT +1 (by definition of xT +1 )

≤ g1:T +1 · y + r0:T (y) + ψ1:T +1 (y) (by definition of yT +1 )

for any y

PT Thus, we have that t=1 ft (xt ) + ψt (xt ) − ft (x) − ψt (x) ≤ r0:T −1 (x) + PT (g − g ˜ ) · (x − y ) and the remainder follows as in the non-composite sett t t t t=1 ting since the strong convexity properties are retained.

6

Proofs for Section 2.3.1

The following lemma is central to the derivation of regret bounds for many algorithms employing adaptive regularization. Its proof, via induction, can be found in Auer et al (2002). Pt aj ≤ Lemma 2. Let {aj }∞ j=1 be a sequence of non-negative numbers. Then j=1 Pj ak k=1 qP t 2 j=1 aj .

Corollary 2 (AO-GD). Let K ⊂ ×ni=1 [−Ri , Ri ] be an n-dimensional rectangle, and set qP pPs s−1 n X t 2− X (g − g ˜ ) ˜a,i )2 a,i a=1 a,i a=1 (ga,i − g r0:t = (xi − xs,i )2 . 2R i i=1 s=1 20

Then, we use the martingale-type gradient prediction g˜t+1 = gt , the following regret bound holds: v u T n uX X Ri t (gt,i − gt−1,i )2 . RegT (x) ≤ 4 t=1

i=1

Proof. r0:t is 1-strongly convex with respect to the norm: qP t n X ˜a,i )2 a=1 (ga,i − g kxk2(t) = x2i , R i i=1 which has corresponding dual norm: kxk2(t),∗ =

n X i=1

Ri qP t

a=1 (ga,i

− g˜a,i

)2

x2i .

By the choice of this regularization, the prediction g˜t = gt−1 , and Theorem 3, the following holds: qP pPs s−1 n X T 2− X (g − g ˜ ) ˜a,i )2 a,i a,i a=1 a=1 (ga,i − g (xi − xs,i )2 RegT (A, x) ≤ 2R i i=1 s=1 +

T X t=1

kgt − gt−1 k2(t),∗

v u T n X T uX X R (g − gt−1,i )2 qP i t,i 2Ri t (gt,i − gt−1,i )2 + = t 2 t=1 i=1 t=1 i=1 a=1 (ga,i − ga−1,i ) v v u T u T n n uX uX X X t 2 2Ri 2Ri t (gt,i − gt−1,i )2 ≤ (gt,i − gt−1,i ) + n X

i=1

t=1

i=1

t=1

by Lemma 2

7

Proofs for Section 2.3.2

Corollary 3 (AO-EG). Let K = ∆n be the n-dimensional simplex and ϕ : x 7→ P n i=1 xi log(xi ) the negative entropy. Assume that kgt k ≤ C for all t and set s Pt C + s=1 kgs − g˜s k22 r0:t = 2 (ϕ + log(n)). log(n) 21

Then, if we use the martingale-type gradient prediction g˜t+1 = gt the following regret bound holds: v ! u T −1 u X t 2 RegT (A, x) ≤ 2 2 log(n) C + kgt − gt−1 k2 . t=1

Proof. Since the negative ϕ is 1-strongly convex with respect to the q entropy P gs k22 C+ ts=1 kgs −˜ Euclidean norm, r0:t is 2 -strongly convex with respect to the log(n) same norm. Applying Theorem 2 yields a regret bound of: RegT (A, x) ≤ r0:T −1 (x) + s

T X t=1

kgt − gˆt k2(t−1),∗

s T X kgs − g˜s k22 1 log(n) √ kgt − gˆt k22 (ϕ + log(n)) + = 2 Pt−1 2 log(n) 2 C + kg − g ˜ k s s 2 s=1 t=1 v ! s u T T −1 u X X 1 log(n) √ kgs − g˜s k22 log(n) + ≤ t2 C + kgt − gˆt k22 Pt 2 2 kg − g ˜ k s s 2 s=1 t=1 s=1 v v ! u u T −1 T u u X X kgs − g˜s k22 log(n) + t2 log(n) kgt − gˆt k22 ≤ t2 C + C+

PT −1 s=1

s=1

t=1

v ! u T −1 u X t 2 kgs − g˜s k2 log(n) ≤2 2 C+ s=1

8

Proofs for Section 3

Theorem 5 (CAOS-FTRL-Prox). Let {rt } be a sequence of proximal nonnegative functions, such that argminx∈K rt (x) = xt , and let g˜t be the learner’s estimate of gˆt given the history of noisy gradients gˆ1 , . . . , gˆt−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume further that the function h0:t (x) = gˆ1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the update xt+1 = argminx h0:t (x) of Algorithm 3 yields the following regret bounds: " T # # " T X X 2 E ft (xt ) − ft (x) ≤ E ψ1:T −1 (x) + r0:T −1 (x) + kˆ gt − g˜t k(t−1),∗ t=1

t=1

22

E

"

T X

#

"

ft (xt ) + ψt (xt ) − ft (x) − αt ψt (x) ≤ E r0:T (x) +

t=1

Proof.

E

"

T X t=1

#

ft (xt ) − ft (x) ≤ =

T X t=1

T X t=1

=

T X t=1

=

T X t=1

T X t=1

kˆ gt −

g˜t k2(t),∗

#

.

E [gt · (xt − x)]   E E[ˆ gt |ˆ g1 , . . . , gˆt−1 , x1 , . . . , xt ]T (xt − x) E [E[ˆ gt · (xt − x)|ˆ g1 , . . . , gˆt−1 , x1 , . . . , xt ]] E [ˆ gt · (xt − x)]

This implies that upon taking an expectation, we can freely upper bound the difference ft (xt ) − ft (x) by the noisy linearized estimate gˆt · (xt − x). After that, we can apply Algorithm 2 on the gradient estimates to get the bounds: " T # # " T X X T 2 E gˆt (xt − x) ≤ E ψ1:T −1 (x) + r0:T −1 (x) + kˆ gt − g˜t k(t−1),∗ t=1

E

" T X t=1

t=1

gˆtT (xt

#

"

− x) + ψt (xt ) − ψt (x) ≤ E r0:T (x) +

T X t=1

kˆ gt −

g˜t k2(t),∗

#

Theorem 6 (CAOS-FTRL-Gen). Let {rt } be a sequence of non-negative functions, and let g˜t be the learner’s estimate of gˆt given the history of noisy gradients gˆ1 , . . . , gˆt−1 and points x1 , . . . , xt−1 . Let {ψt }∞ t=1 be a sequence of non-negative convex functions, such that ψ1 (x1 ) = 0. Assume furthermore that the function h0:t (x) = gˆ1:t · x + g˜t+1 · x + r0:t (x) + ψ1:t+1 (x) is 1-strongly convex with respect to some norm k · k(t) . Then, the update xt+1 = argminx h0:t (x) of Algorithm 3 yields the regret bounds: " T # # " T X X 2 E ft (xt ) − ft (x) ≤ E ψ1:T −1 (x) + r0:T −1 (x) + kˆ gt − g˜t k(t−1),∗ t=1

E

"

T X t=1

t=1

#

"

ft (xt ) + ψt (xt ) − ft (x) − ψt (x) ≤ E r0:T −1 (x) +

T X t=1

kˆ gt −

g˜t k2(t−1),∗

#

Proof. The argument is the same as for Theorem 5, except that we now apply the bound of Theorem 4 at the end. 23

9

Proofs for Section 3.2.1

Theorem 7 (CAO-RCD). Assume K ⊂ ×ni=1 [−Ri , Ri ]. Let it be a random variable sampled according to the distribution pt , and let gˆt =

(gt · eit )eit , pt,it

(˜ gt · eit )eit gˆ˜t = , pt,it

be the estimated gradient and estimated gradient prediction. Let qP qP s s−1 n X t X ga,i − gˆ˜a,i )2 − ga,i − gˆ˜a,i )2 a=1 (ˆ a=1 (ˆ (xi − xs,i )2 r0:t = 2R i i=1 s=1 be the adaptive regularization. Then the algorithm will have regret bounded by: v " T # u T n uX  (gt,i − g˜t,i )2  X X Ri t E ft (xt ) + αt ψ(xt ) − ft (x) − αt ψ(x) ≤ 4 E pt,i t=1 t=1 i=1

Proof. We can first compute that 

(gt · eit )eit E [ˆ gt ] = E pt,it



=

n X (gt · ei )ei i=1

pt,i

pt,i = gt

and similarly for the gradient prediction g˜t . Now, as in Corollary 2, the choice of regularization ensures us a regret bound of the form:  v " T # u T n uX X X gt,i − g˜t,i )2  Ri E t (ˆ E ft (xt ) + αt ψ(xt ) − ft (x) − αt ψ(x) ≤ 4 t=1

i=1

t=1

Moreover, we can compute that: v  v # u T u " T uX u X t t E (ˆ gt,i − g˜t,i )2  ≤ E gt,i − g˜t,i )2 ] Eit [(ˆ t=1

t=1

v u T uX  (gt,i − g˜t,i )2  E =t pt,i t=1

Corollary 5 (CAO-RCD-Lipschitz-Mini-Batch). Assume K ⊂ ×ni=1 [−Ri , Ri ]. Let ∪kj=1 {Πj } = {1, . . . , n} be a partition of the coordinates, and let eΠj = 24

P

ei . Assume we had the following Lipschitz condition on the partition: kgt · eΠj k ≤ Lj ∀j ∈ {1, . . . , k}. P Li )2/3 Define Si = j∈Πi Rj . Set pt,i = Pk(Si(S as the probability distribuL )2/3 i∈Πj

j=1

j

j

tion at time t, and set g˜t,i = L2i . Then the algorithm will have regret bounded by: E

10

"

T X t=1

#

√ ft (xt ) + αt ψ(xt ) − ft (x) − αt ψ(x) ≤ 4 T

k X

2/3

(Si Li )

i=1

!3/2

Further Discussion for Section 3.2.2

Theorem 8 (AOS-Reg-ERM). Assume K ⊂ BR . Let jt be a random variable j

sampled according to pt , and let gˆt =

gt t pt,jt

,

g˜t = gˆt−1 , be the estimated gradient √Ps−1 √Ps Pt ga −ˆ ga−1 k22 − ga −ˆ ga−1 k22 a=1 kˆ a=1 kˆ

and estimated gradient prediction. Let r0:t = s=1 2R xs k22 be the adaptive regularization. Then the algorithm will have regret bounded by: v 

2  u " T #

u T m j uX X  pt−1,j gti − pt,i gt−1  X 2 E ft (xt ) + αψ(xt ) − ft (x) − αψ(x) ≤ 4Ru   t p p t=1

t,i t−1,j

t=1 i,j=1

Proof. By applying Theorem 5 with the parameters in the hypothesis as well as the computation in Corollary 2, we get that v " T # u T i uX h X 2 E kˆ gt − gˆt−1 k2 E ft (xt ) + αψ(xt ) − ft (x) − αψ(x) ≤ 4Rt t=1

t=1

We can then compute that



 jt−1 2 T T

g jt h i X X gt−1



2 t E kˆ gt − gˆt−1 k2 = E  −

pt,jt p t,jt−1 t=1 t=1 2

=

T X

m X

t=1 i,j=1

j − pt,i gt−1 k22 pt,i pt−1,j

kpt−1,j gti

As in random coordinate descent, while it is unlikely for one to have a priori information on the gradient of each function fi at time t, if we have some sort of upper bound, we can still scale our probabilities accordingly to produce better regret bounds than those under uniform sampling.

25

kx−

Corollary 6 (AOS-Reg-ERM-Lipschitz). Assume that at time t we have the following per-function Lipschitz bounds on each loss function fi at point xt : L |gti | ≤ Lt,i , ∀i ∈ {1, . . . , m}. Set pt,i = Pm t,iLt,i as the probability distribution j=1 at time t. Then the AOS-Regularized ERM algorithm will have regret bound: v 

2  u " T #

u T n j uX X  pt−1,j gti − pt,i gt−1  X 2 E ft (xt ) + αψ(xt ) − ft (x) − αψ(x) ≤ 4Ru   t p p t=1

t=1 i,j=1

t,i t−1,j

v !2 u T n uX X t ≤ 8R Lt,i t=1

i=1

Notice that when we have only a single function in our ERM objective (i.e. n = 1), this reduces to the AOS-FTRL algorithm from before. In practice, we expect a martingale-type prediction g˜t+1 = gˆt to be more effective in a minibatch setting, where the variance in the gradient is much smaller. Corollary 7 (AOS-Reg-ERM with Non-Uniform Mini-batch Sampling). Let P ∪kj=1 {Πj } = {1, . . . , n} be a partition of the functions fi , and let eΠj = i∈Πj ei . P Assume we had the following Lipschitz condition on each partition: k i∈πi gt · eΠj k ≤ Lt,j ∀j ∈ {1, . . . , k}. L Set pt,i = Pk t,iL as the probability distribution at time t, and set gˆt = t,i j=1 P i g , g ˜ = g ˆ , where jt ∼ pt . t t−1 t i∈Πj t

Then the AOS-Reg-ERM Mini-batch algorithm will have regret bound: " T # X E ft (xt ) + αψ(xt ) − ft (x) − αψ(x) t=1

v 

2  u P P

u T k b

 uX X  pt−1,j a∈Πi gta − pt,i b∈Πj gt−1 2 ≤ 4Ru   t p p t,i t−1,j

t=1 i,j=1

≤ 8R

v u T k uX X t t=1

i=1

Lt,i

!2

A similar approach to Regularized ERM was developed independently by [19]. However, the one here improves upon that algorithm through the incorporation of adaptive regularization, optimistic gradient predictions, and the fact that we do not assume higher regularity conditions such as strong convexity for our loss functions. 26