Condition Number Analysis of Logistic Regression, and

Report 3 Downloads 50 Views
Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)

Penn State University, March 2018

1

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

How can optimization inform statistics (and machine learning)?

Paper in preparation (this talk): Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods

A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

2

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Outline Optimization primer: two basic first-order methods for convex optimization Logistic regression perspectives: statistics “vs.” machine learning A pair of condition numbers for the logistic regression problem: when the sample data is non-separable: a condition number for the degree of non-separability of the dataset informing the convergence guarantees of Greedy Coordinate Descent and Stochastic Gradient Descent (SGD) guarantees on reaching linear convergence (thanks to Bach)

when the sample data is separable: a condition number for the degree of separability of the dataset informing convergence guarantees to deliver an approximate maximum margin classifier 3

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Review of Two Basic First-Order Methods for Convex Optimization

Two Basic First-Order Methods for Convex Optimization: Greedy Coordinate Descent method “go in the best coordinate direction”

Stochastic Gradient Descent (SGD) method “go in the direction of the negative of the stochastic estimate of the gradient”

4

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Convex Optimization The problem of interest is: F ∗ :=

min x

F (x)

s.t. x ∈ Rp

where F (·) is differentiable and convex:

F (λx + (1 − λ)y ) ≤ λF (x) + (1 − λ)F (y ) for all x, y , and all λ ∈ [0, 1] Let kxk denote the given norm on the variables x ∈ Rp 5

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Norms and Dual Norms Let kxk be the given norm on the variables x ∈ Rp The dual norm is ksk∗ := maxx {s T x : kxk ≤ 1} Some common norms and their dual norms:

Name

Norm

Definition

Dual Norm

`2 -norm

kxk2

kxk2 =

qP

`1 -norm

kxk1

kxk1 =

Pp

`∞ -norm

kxk∞

kxk∞ = max{|x1 |, . . . , |xp |}

p j=1

j=1

|xj |2

|xj |

ksk∗ = ksk2 ksk∗ = ksk∞ ksk∗ = ksk1

6

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Lipschitz constant for the Gradient

F ∗ :=

min x

F (x)

s.t. x ∈ Rp We say that ∇F (·) is Lipschitz with parameter LF if: k∇F (x) − ∇F (y )k∗ ≤ LF kx − y k for all x, y ∈ Rp k · k∗ is the dual norm

7

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Matrix Operator Norm

Let M be a linear operator (matrix) M : Rp → Rn with norm kxka on Rp and norm kv kb on Rn The operator norm of M is given by: kMka,b := max x6=0

kMxkb kxka

8

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent Method: “go in the best coordinate direction” F ∗ :=

min x

F (x)

s.t. x ∈ Rp Greedy Coordinate Descent Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1

Compute gradient ∇F (x k )

2

Compute jk ∈ arg

max



j∈{1,...,p} k

|∇F (x k )j | and

d k ← sgn(∇F (x )jk )ejk 3

Choose step-size αk

4

Set x k+1

x k − αk d k

9

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent ≡ Steepest Descent in the `1 -Norm F ∗ :=

min x

F (x)

s.t. x ∈ Rp Steepest Descent method in the `1 -norm Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1 2

Compute gradient ∇F (x k )

Compute direction: d k ← arg max {∇F (x k )T d} kdk1 ≤1

3

Choose step-size αk

4

Set x k+1 ← x k − αk d k

10

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent ≡ Steepest Descent in the 1 -Norm, cont. d k ← arg max {∇F (x k )T d} d1 ≤1

dk

11

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Metrics for Evaluating Greedy Coordinate Descent F ∗ :=

min x

F (x)

s.t. x ∈ Rp Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∞ ≤ LF kx − y k1

for all x, y ∈ Rp

Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 12

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Metrics for Evaluating Greedy Coordinate Descent, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min x − x ∗ 1 ∗ ∗ x∈S0 x ∈S

S0 Dist0

x0

S⇤

(In high-dimensional machine learning problems, S ∗ can be very big)

13

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S

Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014], [Nesterov 2003]) If the step-sizes are chosen using the rule: αk =

k∇F (x k )k∞ LF

for all k ≥ 0 ,

then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤ 2LF (Dist0 )2 where Kˆ 0 := . F (x 0 ) − F ∗

2LF (Dist0 )2 Kˆ 0 + k


n GCD performs variable selection GCD imparts implicit regularization Just one tuning parameter (number of iterations)

39

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros



���



����� ����

���

����������������

�������������������

��� �

���������������� ���



�� � �

��� �

���

�� �

���

���

���

���������

���

���



���

���

���

���

���

���������

Compare with explicit regularization schemes (`1 , `2 , etc.) 40

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

How do GCD and SGD Inform Logistic Regression?

Some questions: How do the computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent specialize to the case of Logistic Regression?

Can we say anything further about the convergence properties of these methods in the special case of Logistic Regression?

What role does problem structure/conditioning play in these guarantees?

41

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Elementary Properties of the Logistic Loss Function L∗n :=

min

Ln (β) :=

β

1 n

n

i=1

ln(1 + exp(−yi β T xi ))

Recall that logistic regression “ideally” seeks β for which yi xiT β  0 for all i : yi = +1 ⇒ xiT β  0 yi = −1 ⇒ xiT β  0 &"#

%&'' &

%"#

% !"#$%&$'(!"%%

$"#

$

!"#

!

!"#

!$#

!%#

!

"

!"!#$"

#

$

42

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Geometry of the Data: Separable and Non-Separable Data

(a) Data is Non-Separable

(b) Data is Separable

43

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Very/Mild Separable/Non-Separable Data

(a) Data is Very Non-Separable

(b) Data is Very Separable

(c) Data is Mildly Non-Separable

(d) Data is Mildly Separable 44

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable and Non-Separable Data

Separable Data The data is separable if there exists β¯ for which ¯ T xi > 0 yi · (β)

for all i = 1, . . . , n

Non-Separable Data The data is non-separable if it is not separable, namely, every β satisfies yi · (β)T xi ≤ 0

for at least one i ∈ {1, . . . , n}

45

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data and Non-Attainment of Optimum

L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

The data is separable if there exists β¯ for which ¯ T xi > 0 yi · (β)

for all i = 1, . . . , n

¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Perhaps trying to optimize the logistic loss function is unlikely to be effective at finding a “good” linear classifier .... 46

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable and Non-Separable Data

(a) Data is Non-Separable

(b) Data is Separable

47

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Results in the Non-Separable Case

Results in the Non-Separable Case

48

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separable Data and Problem Behavior/Conditioning Let us quantify the degree of non-separability of the data.

(a) Very non-separable data

(b) Mildly non-separable data

We will relate this to problem behavior/conditioning.... 49

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Condition Number DegNSEP∗

Definition of Non-Separability Condition Number DegNSEP∗ Pn 1 T − DegNSEP∗ := minp i=1 [yi β xi ] n β∈R

s.t.

kβk1 = 1

DegNSEP∗ is the least average misclassification error (over all normalized classifiers)

DegNSEP∗ > 0 if and only if the data is strictly non-separable

50

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Measure DegNSEP∗ DegNSEP∗ :=

(a) DegNSEP∗ is large

n

minp

1 n

s.t.

β1 = 1

β∈R

i=1 [yi β

T

xi ] −

(b) DegNSEP∗ is small 51

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegNSEP∗ and Problem Behavior/Conditioning L∗n :=

min β

Ln (β) :=

DegNSEP∗ :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

Pn

min

1 n

s.t.

kβk1 = 1

β∈Rp

i=1 [yi β

T

xi ]−

Theorem: Non-Separability and Sizes of Optimal Solutions Suppose that the data is non-separable and DegNSEP∗ > 0. Then 1 2

3

the logistic regression problem LR attains its optimum, for every optimal solution β ∗ of LR it holds that L∗n ln(2) ≤ , and kβ ∗ k1 ≤ DegNSEP∗ DegNSEP∗ for any β it holds that kβk1 ≤

Ln (β) . DegNSEP∗

52

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case

Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (regularization): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(DegNSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n ) 53

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case

Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (regularization): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(DegNSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n ) 53

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) √ for i = 0, . . . , k, and suppose that the data is αi := k+1kXk kXk 2,2 2,∞ non-separable. Then it holds that: (i) (training error): i

E[ min Ln (β )] − 0≤i≤k

L∗n

1 ≤ √ k +1

(ii) (gradient norm):   i 2 E min k∇Ln (β )k2 ≤ i∈{0,...,k}

(iii) (regularization): kβ k k2 ≤



√1 k+1

√ k +1

2 2 (L∗ n ) kXk2,∞

√ 4

2 ln(2)(DegNSEP∗ )2

√

ln(2)kXk2,2 kXk2,∞ √ 2n

√

8n ln(2) kXk2,2

√ +

2 ln(2)nkXk2,∞ kXk2,2





 54

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) √ for i = 0, . . . , k, and suppose that the data is αi := k+1kXk kXk 2,2 2,∞ non-separable. Then it holds that: (i) (training error): i

E[ min Ln (β )] − 0≤i≤k

L∗n

1 ≤ √ k +1

(ii) (gradient norm):   i 2 E min k∇Ln (β )k2 ≤ i∈{0,...,k}

(iii) (regularization): kβ k k2 ≤



√1 k+1

√ k +1

2 2 (L∗ n ) kXk2,∞

√ 4

2 ln(2)(DegNSEP∗ )2

√

ln(2)kXk2,2 kXk2,∞ √ 2n

√

8n ln(2) kXk2,2

√ +

2 ln(2)nkXk2,∞ kXk2,2





 54

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence

Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression

For logistic regression, does GCD exhibit linear convergence?

55

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Some Definitions/Notation

Definitions: R :=

max i∈{1,...,n}

kxi k2 (maximum `2 norm of the feature vectors)

H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )

56

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Define: kˇ :=

16p ln(2)2 kXk41,2 R 2 . 9n2 (DegNSEP∗ )2 λpmin (H(β ∗ ))2

ˇ it holds that: Then for all k ≥ k, k

Ln (β ) −

L∗n



≤ (Ln (β ) −

L∗n )

λpmin (H(β ∗ ))n 1− p · kXk21,2

!k−kˇ .

57

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥

T 1 4n λpmin (X X) exp

 −

ln(2)kXk1,∞ DegNSEP∗



As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 58

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegNSEP∗ and “Perturbation to Separability”

DegNSEP∗ :=

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

Theorem: DegNSEP∗ is the “Perturbation to Separability” DegNSEP∗ =

Pn

inf

1 n

s.t.

(xi + ∆xi , yi ), i = 1, . . . , n are separable

∆x1 ,...,∆xn

i=1

k∆xi k∞

59

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Illustration of Perturbation to Separability

60

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Results in the Separable Case

Results in the Separable Case

61

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data and Problem Behavior/Conditioning Let us quantify the degree of separability of the data.

(a) Very separable data

(b) Barely separable data

We will relate this to problem behavior/conditioning.... 62

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separability Condition Number DegSEP∗ Definition of Separability Condition Number DegSEP∗ DegSEP∗ :=

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

DegSEP∗ maximizes the minimal classification value [yi β T xi ] (over all normalized classifiers) DegSEP∗ is simply the “maximum margin” in machine learning parlance DegSEP∗ > 0 if and only if the data is separable 63

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separability Measure DegSEP∗ DegSEP∗ :=

(a) DegSEP∗ is large

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

β1 ≤ 1

(b) DegSEP∗ is small 64

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegSEP∗ and Problem Behavior/Conditioning L∗n :=

min β

Ln (β) :=

DegSEP∗ :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

Theorem: Separability and Non-Attainment Suppose that the data is separable. Then DegSEP∗ > 0, L∗n = 0, and LR does not attain its optimum. Despite this, it turns out that the Steepest Descent family and also Stochastic Gradient Descent are reasonably effective at finding an approximate margin maximizer as we shall shortly see....

65

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Margin function ρ(β) Margin function ρ(β) ρ(β) :=

(a) ρ(β) is small

min

i∈{1,...,n}

[yi β T xi ]

(b) ρ(β) is large 66

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2   3.7nkXk21,2 for which the (i) (margin bound): there exists i ≤ ∗ 2 (DegSEP ) i i i normalized iterate β¯ := β /kβ k1 satisfies ρ(β¯i ) ≥

(ii) (shrinkage): kβ k k1 ≤ (iii) (gradient norm):

.18 · DegSEP∗ . n

√  1 p k kXk1,2 8n ln(2)

min i∈{0,...,k}

k∇Ln (β i )k∞ ≤ kXk1,2

q

ln(2) 2n·(k+1)

67

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2   3.7nkXk21,2 for which the (i) (margin bound): there exists i ≤ ∗ 2 (DegSEP ) i i i normalized iterate β¯ := β /kβ k1 satisfies ρ(β¯i ) ≥

(ii) (shrinkage): kβ k k1 ≤ (iii) (gradient norm):

.18 · DegSEP∗ . n

√  1 p k kXk1,2 8n ln(2)

min i∈{0,...,k}

k∇Ln (β i )k∞ ≤ kXk1,2

q

ln(2) 2n·(k+1)

67

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Separable Case

Theorem: Computational Guarantees for Stochastic Gradient Descent: Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, where 2,2 2,∞   28.1n3 kXk22,2 kXk22,∞ k := γ 2 (DegSEP∗ )4 and γ ∈ (0, 1]. If the data is separable, then :   γ(DegSEP∗ )2 P ∃i ∈ {0, . . . , k} s.t. ρ(β¯i ) ≥ ≥ 1−γ . 20n2 kXk2,∞ where β¯i := β i /kβ i k1 are the normalized iterates of SGD.

68

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegSEP∗ and “Perturbation to Non-Separability”

DegSEP∗ :=

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

Theorem: DegSEP∗ is the “Perturbation to Non-Separability” DegSEP∗ =

inf

∆x1 ,...,∆xn

s.t.

max i∈{1,...,n}

k∆xi k∞

(xi + ∆xi , yi ), i = 1, . . . , n are non-separable

69

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Illustration of Perturbation to Non-Separability

70

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Other Issues

Some other topics not mentioned (still ongoing): Other first-order methods for logistic regression (gradient descent, accelerated gradient descent, other randomized methods, etc. High-dimensional regime p > n, define DegNSEP∗k and DegSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods

Other...

71

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Summary Some old and new results for Greedy Coordinate Descent and Stochastic Gradient Descent Analyizing these methods for Logistic Regression: separable/non-separable cases Non-Separable case

condition number DegNSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including reaching linear convergence Separable case

condition number DegSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including computing an approximate maximum margin classifier

72

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Results for Some other Methods

Results for Some other Methods

73

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Standard Accelerated Gradient Method (AGM) P:

F ∗ := minimumx s.t.

F (x) x ∈ Rp

Lipschitz gradient: k∇f (y ) − ∇f (x)k2 ≤ Lky − xk2 for all x, y ∈ Rp Accelerated Gradient Method (AGM) Given x 0 ∈ Rp and z 0 := x 0 , and i ← 0 . Define step-size parameters θi ∈ (0, 1] 1 recursively by θ0 := 1 and θi+1 satisfies θ21 − θi+1 = θ12 . i+1

i

At iteration k: 1

Update :

y k ← (1 − θk )x k + θk z k x k+1 ← y k − L1 ∇f (y k ) z k+1 ← z k +

1 θk

(x k+1 − y k ) 74

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Accelerated Gradient Method (AGM) for Logistic Regression

Theorem: Computational Guarantees for Accelerated Gradient Method (AGM) for Logistic Regression Consider the AGM applied to the Logistic Regression problem initiated at β 0 := 0, and suppose that the data is non-separable. Then for each k ≥ 0 it holds that: (training error):

Ln (β k ) − L∗n ≤

2(ln(2))2 kXk22,2 n · (k + 1)2 · (DegNSEP∗ )2

75

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

AGM with Simple Re-Starting (AGM-SRS) Assume that 0 < F ∗ := minimumx F (x) Accelerated Gradient Method with Simple Re-Starting (AGM-SRS) Initialize with x 0 ∈ Rp . Set x1,0 ← x 0 , i ← 1 . At outer iteration i: 1

Initialize inner iteration. j ← 0

2

Run inner iterations. At inner iteration j: F (xi,j ) If ≥ 0.8 , then: F (xi,0 ) xi,j+1 ← AGM(F (·), xi,0 , j + 1) , j ← j + 1, and Goto step 2. Else xi+1,0 ← xi,j , i ← i + 1, and Goto step 1.

“xi,j ← AGM(F (·), xi,0 , j)” denotes assigning to xi,j the j th iterate of AGM applied with objective function F (·) using the initial point xi,0 ∈ Rp

76

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantee for AGM with Simple Re-Starting for Logistic Regression

Computational Guarantee for Accelerated Gradient Method with Simple Re-Starting for Logistic Regression Consider the AGM with Simple Re-Starting applied to the Logistic Regression problem initiated at β 0 := 0, and suppose that the data is non-separable. Within a total number of computed iterates k that does not exceed √

5.8kXk2,2 8.4kXk2,2 · L∗n √ , ∗ + √ n · DegNSEP n · DegNSEP∗ · ε

the algorithm will deliver an iterate β k for which Ln (β k ) − L∗n ≤ ε . 77

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Back-up Slides: Related Results for AdaBoost

Back-up Slides: Related Results for AdaBoost

78

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

AdaBoost: First Problem of Interest AdaBoost is also Greedy Coordinate Descent, but replaces the logistic loss function with the log-exponential loss: L∗l := minλ≥0

Ll (λ) = ln

1 m

Pm

i=1

 exp (−(Aλ)i ) .

Data: (x1 , y1 ) . . . , (xm , ym ) where xi ∈ Rn is the i th feature vector and yi ∈ {−1, +1} Here A := Y X, i.e., Aij := yi (xi )j Note that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 Assume for convenience that for every column Aj , −Aj is also a column of A 79

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

AdaBoost: Second Problem of Interest ∆n := {x ∈ Rn : e T x = 1, x ≥ 0} is the standard simplex in Rn Recall that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 The margin of a classifier λ ∈ Rn is: p(λ) :=

min i∈{1,...,m}

(Aλ)i = min w T Aλ w ∈∆m

It makes sense to look for a classifier with large margin, i.e., to solve: M :

ρ∗ := max p(λ) . λ∈∆n

80

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Dual of the Maximum Margin Problem The “edge” of a vector of weights on the data, w ∈ ∆m , is: f (w ) :=

max j∈{1,...,n}

w T Aj = max w T Aλ λ∈∆n

The (linear programming) dual of the maximum margin problem is the problem of minimizing the edge: E :

f ∗ := min f (w ) , w ∈∆m

AdaBoost is three algorithms: A boosting method based on a scheme for (multiplicatively) updating a vector of weights on the data Greedy Coordinate Descent applied to minimize the log-exponential loss function A version of the Mirror Descent method applied to the above problem E

81

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for AdaBoost Theory for Greedy Coordinate Descent and Mirror Descent leads to computational guarantees for AdaBoost: Step-Size Strategy

Separable Data Margin Bound ρ∗ − p(λk+1 )

Non-Separable Data Gradient Bound Loss Bound ˆ i )k∞ ˆ k ) − L∗ min k∇Ll (λ Ll (λ l

i∈{0,...,k}

“edge rule:”

ˆ k )k∞ αk = k∇Ll (λ

q

2 ln(m) k+1

8 ln(m)2 (NSEP∗ )2 k l

“line-search:”

αk = 12 ln

 1+r  k

q

2 ln(m) k+1

8 ln(m)2 (NSEP∗ )2 k l

q

2 ln(m) k+1

1−rk

“constant:”

q 2 ln(m) αi := k+1 for i = 0, . . . , k

“adaptive:”

αk =

q

r

q

2 ln(m) k+1

2 ln(m) k+1

ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)

r

ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)

NSEP∗l is a “non-separability condition number” for log-exponential loss

82