New Results for Sparsity-inducing Methods for Logistic ...

Report 4 Downloads 18 Views
GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

New Results for Sparsity-inducing Methods for Logistic Regression Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)

SIOPT Vancouver, May 2017

1

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

How can optimization inform statistics (and machine learning)?

Paper in preparation (this talk): New Results for Sparsity-inducing Methods for Logistic Regression

A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

2

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Outline Optimization primer: some “old” results and new observations for Greedy Coordinate Descent (GCD) Logistic regression perspectives: statistics and machine learning When the sample data is non-separable: a “condition number” for the degree of non-separability informing the convergence properties of GCD reaching linear convergence of GCD (thanks to Bach) When the sample data is separable: a “condition number” for the degree of separability of the data informing convergence to a certificate of separability Under construction: a different convergence result for an “accelerated” (but non-sparse) method for logistic regression (thanks to Renegar)

3

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Primer on Greedy Coordinate Descent

Some “Old” Results and New Observations for the Greedy Coordinate Descent Method

4

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

Greedy Coordinate Descent Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1

Compute gradient ∇F (x k )

2

Compute jk ∈ arg

max



j∈{1,...,p}

|∇F (x k )j | and

d k ← sgn(∇F (x k )jk )ejk 3

Choose step-size αk

4

Set x k+1 ← x k − αk d k

5

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent ≡ `1 -Steepest Descent F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

Steepest Descent method in the `1 -norm Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1 2

Compute gradient ∇F (x k )

Compute direction: d k ← arg max {∇F (x k )T d} kdk1 ≤1

3

Choose step-size αk

4

Set x k+1 ← x k − αk d k 6

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent ≡ `1 -Steepest Descent, cont. d k ∈ arg max {∇F (x k )T d} kdk1 ≤1

dk

7

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∞ ≤ LF kx − y k1

for all x, y ∈ Rp

Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 8

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Metrics for Evaluating Greedy Coordinate Descent, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0

S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S

S0 Dist0

S⇤

(In high-dimensional machine learning problems, S ∗ can be very big)

9

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S

Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014]) If the step-sizes are chosen using the rule: αk =

k∇F (x k )k∞ LF

for all k ≥ 0 ,

then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤

1 F (x 0 )−F ∗

2LF (Dist0 )2 1 < . k k + 2LF (Dist 2 0)

Note that αk → 0 as k∇F (x k )k∞ → 0

10

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD, cont. Theorem: Gradient Norm Convergence For any step-size sequence {αk } and for each k ≥ 0, it holds that: min i∈{0,...,k}

k∇F (x i )k∞ ≤

F (x 0 ) − F ∗ + Pk

i=0

LF 2

Pk

i=0

αi2

αi

.

If the step-sizes are chosen using the rule: αk =

k∇F (x k )k∞ LF

for all k ≥ 0 ,

then for each k ≥ 0 the following inequality holds: r 2LF (F (x 0 ) − F ∗ ) i min k∇F (x )k∞ ≤ k +1 i∈{0,...,k}

. 11

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD, cont. Theorem: Iterate Shrinkage For any step-size sequence {αk }, it holds for each k ≥ 0 that: kx k k1 ≤ kx 0 k1 +

k−1 X

αi .

i=0

If the step-sizes are chosen using the rule: αk =

k∇F (x k )k∞ LF

for all k ≥ 0 ,

then for each k ≥ 0 it holds that: kx k k1 ≤ kx 0 k1 +



s k

2(F (x 0 ) − F ∗ ) . LF 12

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression

Logistic Regression statistics perspective machine learning perspective

13

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression Statistics Perspective Example: Predicting Parole Violation Predict P(violate parole) based on age, gender, time served, offense class, multiple convictions, NYC, etc. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 6098 6099 6100 6101 6102

Violator Male Age TimeServed Class Multiple InCity 0 1 49.4 3.15 D 0 1 1 1 26.0 5.95 D 1 0 0 1 24.9 2.25 D 1 0 0 1 52.1 29.22 A 0 0 0 1 35.9 12.78 A 1 1 0 1 25.9 1.18 C 1 1 0 1 19.0 0.54 D 0 0 0 1 43.2 1.07 C 0 1 0 1 31.6 1.17 E 0 0 0 1 40.7 4.64 B 1 1 0 1 53.9 21.61 A 0 1 0 1 28.5 3.23 D 1 0 0 1 36.1 3.71 D 0 1 0 1 48.8 1.17 D 0 0 0 1 37.6 4.62 C 0 0 0 1 42.5 1.75 D 0 1 ... ... ... ... ... ... ... 0 1 55.0 0.72 E 0 0 0 1 49.6 29.88 A 0 1 0 1 22.4 2.85 D 0 1 0 1 44.8 1.76 D 1 0 0 0 45.3 1.03 E 0 0

14

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression for Prediction Y ∈ {−1, 1} is a Bernoulli random variable: P(Y = 1) = p P(Y = −1) = 1 − p x = (x1 , . . . , xp ) ∈ Rp is the vector of independent variables P(Y = 1) depends on the values of the independent variables x1 , . . . , xp Logistic regression model is: P(Y = 1 | x) =

1 1 + e −β T x 15

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression for Prediction, continued Logistic regression model is: P(Y = 1 | x) =

1 1 + e −β T x

Data records are (xi , yi ), i = 1, . . . , n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 6098 6099 6100 6101 6102

Violator Male Age TimeServed Class Multiple InCity 0 1 49.4 3.15 D 0 1 1 1 26.0 5.95 D 1 0 0 1 24.9 2.25 D 1 0 0 1 52.1 29.22 A 0 0 0 1 35.9 12.78 A 1 1 0 1 25.9 1.18 C 1 1 0 1 19.0 0.54 D 0 0 0 1 43.2 1.07 C 0 1 0 1 31.6 1.17 E 0 0 0 1 40.7 4.64 B 1 1 0 1 53.9 21.61 A 0 1 0 1 28.5 3.23 D 1 0 0 1 36.1 3.71 D 0 1 0 1 48.8 1.17 D 0 0 0 1 37.6 4.62 C 0 0 0 1 42.5 1.75 D 0 1 ... ... ... ... ... ... ... 0 1 55.0 0.72 E 0 0 0 1 49.6 29.88 A 0 1 0 1 22.4 2.85 D 0 1 0 1 44.8 1.76 D 1 0 0 0 45.3 1.03 E 0 0

Let us construct an estimate of β based on the data (xi , yi ), i = 1, . . . , n

16

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression: Maximum Likelihood Estimation

1 max −β T xi β 1 + e y =1

!

Y 

Y

yi =−1

i

1 = max −β T xi β 1 + e y =1

!

= max β

i=1

1 + e −yi β T xi

!

i

i

1

1 1 + e β T xi y =−1

!

Y

Y

n Y

1 1− 1 + e −β T xi

!

n  T 1X  ≡ min ln 1 + e −yi β xi =: Ln (β) β n i=1

17

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression: Maximum Likelihood Optimization Problem Logistic regression optimization problem is: L∗n :=

min

Ln (β) :=

s.t.

β ∈ Rp

β

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))



����





������� ����� ��������



� ��

��

��









T y������ x

The logistic term is a 1-smoothing of f (α) = max{0, −α} (≡ shifted “hinge loss”)

18

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression: Machine Learning Perspective

Logistic Regression: Machine Learning Perspective

19

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression as Binary Classification Data: (xi , yi ) ∈ Rp × {−1, 1}, i = 1, . . . , n

x = (x1 , . . . , xp ) ∈ Rp is the vector of features (ind. variables) y ∈ {−1, 1} is the response/label

Task: predict y based on the linear function β T x β ∈ Rp are the model coefficients Loss function: `(y , β T x) represents the loss incurred when the truth is y but our classification/prediction was based on β T x Loss Minimization Problem:

min n1 β

n X

`(yi , β T xi )

i=1

20

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Loss Functions for Binary Classification Some common loss functions used for binary classification 0-1 loss: `(y , β T x) := 1(y β T x < 0) Hinge loss: `(y , β T x) := max(0, 1 − y β T x)

Logistic loss: `(y , β T x) := ln(1 + exp(−y β T x)) �

����



��� ����� ��������





� ��

��

��









������

Here “Margin” = y β T x

21

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Advantages of Logistic Loss Function

Why use the logistic loss function for classification? Computational advantages: convex, smooth Fits previous statistical model of conditional probablity: P(Y = y | x) =

1 1+exp(−y β T x)

Makes sense when the data is non-separable Robust to misspecification of class labels

22

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Logistic Regression Problem of Interest, continued Alternate version of optimization problem adds regularization and/or sparsification:

L∗n :=

min

Ln (β) :=

s.t.

β ∈ Rp

β

1 n

Pn

i=1

ln(1 + exp(−yi β T xi )) +λkβkp

kβk0 ≤ k Aspirations: Good predictive performance on new (out of sample) observations Models that are more interpretable (e.g., sparse) 23

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent for Logistic Regression

Greedy Coordinate Descent for Logistic Regression

24

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent for Logistic Regression Greedy Coordinate Descent for Logistic Regression Initialize at β 0 ← 0, k ← 0 At iteration k ≥ 0: 1

Compute ∇Ln (β k )

2

Compute jk ∈ arg

3

max j∈{1,...,p}

|∇Ln (β k )j |

Set β k+1 ← β k − αk sgn(∇Ln (β k )jk )ejk

Why use Greedy Coordinate Descent for Logistic Regression? Scalable and effective when n, p  0 and maybe p > n GCD performs variable selection

GCD imparts implicit regularization Just one tuning parameter (number of iterations) Connections to boosting (LogitBoost)

25

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros

���





���

����� ����

����������������

�������������������

��� �

���������������� ���



�� � �

��� �

���

�� �

���

���

���

���������

���

���



���

���

���

���

���

���������

Compare with explicit regularization schemes (`1 , `2 , etc.) 26

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

How Can GCD Inform Logistic Regression?

Some questions: How do the computational guarantees for Greedy Coordinate Descent specialize to the case of Logistic Regression?

What role does problem structure/conditioning play in these guarantees?

Can we say anything further about the convergence properties of Greedy Coordinate Descent in the special case of Logistic Regression?

27

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Basic Properties of the Logistic Loss Function L∗n :=

min

Ln (β) :=

s.t.

β ∈ Rp

β

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

Ln (·) is convex ∇Ln (·) is L =

1 2 4n kXk1,2 -Lipschitz:

k∇Ln (β) − ∇Ln (β 0 )k∞ ≤

2 1 4n kXk1,2 kβ

− β 0 k1

where kXk1,2 := max kXj k2 j=1,...,p

0

For β := 0 it holds that Ln (β 0 ) = ln(2) L∗n ≥ 0

If L∗n = 0, then the optimum is not attained (something is “wrong” or “very wrong”) We will see later that “very wrong” is actually good....

28

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Basic Properties, continued L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))



����





������� ����� ��������



� ��

��

��









T y������ x

Logistic regression “ideally” seeks β for which yi xiT β > 0 for all i : yi > 0 ⇒ xiT β > 0 yi < 0 ⇒ xiT β < 0

29

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Geometry of the Data: Separable and Non-Separable Data

(a) Separable Data

(b) Not Separable Data

(c) Mildly Non-Separable” Data

(d) Very Non-Separable Data 30

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data

Separable Data ¯ T xi > 0 for all The data is separable if there exists β¯ for which yi · (β) i = 1, . . . , n Equivalently Y Xβ¯ > 0 where Y := diag(y )

31

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data, continued

L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

The data is separable if there exists β¯ for which Y Xβ¯ > 0

where Y := diag(y )

¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Perhaps trying to optimize the logistic loss function is unlikely to be effective at finding a “good” linear separator? 32

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Strictly Non-Separable Data Strictly Non-Separable Data We say that the data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ  0

(a) Strictly Non-Separable

(b) Not Strictly Non-Separable 33

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Strict Non-separability and Problem Behavior/Conditioning Theorem: Attaining Optima When the data is strictly non-separable, then the logistic regression problem attains its optimum. Let us quantify the degree of non-separability of the data and relate this to problem behavior/conditioning....

(a) Mildly non-separable data

(b) Very non-separable data 34

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Measure DistSEP∗

Definition of Non-Separability Measure DistSEP∗ DistSEP∗ :=

Pn

min

1 n

s.t.

kβk1 = 1

β∈Rp

i=1 [yi β

T

xi ] −

DistSEP∗ is the least average misclassification error DistSEP∗ > 0 if and only if the data is strictly non-separable 35

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Measure DistSEP∗ DistSEP∗ :=

(a) DistSEP∗ is small

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

(b) DistSEP∗ is large 36

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

DistSEP∗ and “Distance to Separability”

DistSEP∗ :=

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

Theorem: DistSEP∗ is the “Distance to Separability” DistSEP∗ =

Pn

inf

1 n

s.t.

(xi + ∆xi , yi ), i = 1, . . . , n are separable

∆x1 ,...,∆xn

i=1

k∆xi k∞

37

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

DistSEP∗ and Problem Behavior/Conditioning L∗n :=

min β

Ln (β) :=

DistSEP∗ :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ] −

Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤

ln(2) , DistSEP∗

whereby

Dist0 ≤

2 ln(2) . DistSEP∗ 38

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD: Non-Separable Case

Theorem: Computational Guarantees for GCD: Non-Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (regularization): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(DistSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n )

(iv) (sparsity): kβ k k0 ≤ k 39

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD: Non-Separable Case

Theorem: Computational Guarantees for GCD: Non-Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (regularization): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(DistSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n )

(iv) (sparsity): kβ k k0 ≤ k 39

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence

Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression

For logistic regression, does Greedy Coordinate Descent exhibit linear convergence?

40

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Some Definitions/Notation

Definitions: R :=

max i∈{1,...,n}

kxi k2 (maximum norm of the feature vectors)

H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )

41

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Define: kˇ :=

16 ln(2)2 kXk21,2 R 2 p . 9n(DistSEP∗ )2 λpmin (H(β ∗ ))2

ˇ it holds that: Then for all k ≥ k, k

Ln (β ) −

L∗n



≤ (Ln (β ) −

L∗n )

λpmin (H(β ∗ ))n 1− kXk21,2 p

!k−kˇ .

42

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥

T 1 4n λpmin (X X) exp

 −

ln(2)kXk1,∞ DistSEP∗



As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 43

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separability and Problem Behavior/Conditioning

Separable data

44

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data, continued

L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

Recall the data is separable if there exists β¯ for which Y Xβ¯ > 0

where Y := diag(y )

¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Despite this, it turns out that GCD is reasonably effective at finding a “good” linear separator as we shall shortly see.... 45

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Margin function ρ(β) Margin function ρ(β) ρ(β) :=

min i∈{1,...,n}

[yi β T xi ]

46

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separability Measure DistNSEP∗

Definition of Separability Measure DistNSEP∗ DistNSEP∗ :=

max

ρ(β)

s.t.

kβk1 = 1

β∈Rp

DistNSEP∗ is the maximum margin over all (normalized) β DistNSEP∗ > 0 if and only if the data is separable 47

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Separability Measure DistNSEP∗ DistNSEP∗ :=

max

ρ(β)

s.t.

kβk1 = 1

β∈Rp

48

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

DistNSEP∗ and “Distance to Non-Separability”

DistNSEP∗ :=

max

ρ(β)

s.t.

kβk1 = 1

β∈Rp

Theorem: DistNSEP∗ is the “Distance to Non-Separability” DistNSEP∗ =

inf

∆x1 ,...,∆xn

s.t.

max i∈{1,...,n}

k∆xi k∞

(xi + ∆xi , yi ), i = 1, . . . , n are non-separable 49

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD: Separable Case Theorem: Computational Guarantees for GCD: Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression problem k n (β )k∞ with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2 1,2

separable. 3.7nkXk21,2 (DistNSEP∗ )2 i i i normalized iterate β¯ := β /kβ k1 satisfies 



(i) (margin bound): there exists i ≤

ρ(β¯i ) ≥

(iii) (regularization): kβ k k1 ≤ (ii) (gradient norm):

min i∈{0,...,k}

for which the

.18 · DistNSEP∗ . n

√  1 p k kXk1,2 8n(ln(2) − L∗n ) q (ln(2)−L∗ n) k∇Ln (β i )k∞ ≤ kXk1,2 2n·(k+1)

(iv) (sparsity): kβ k k0 ≤ k 50

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for GCD: Separable Case Theorem: Computational Guarantees for GCD: Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression problem k n (β )k∞ with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2 1,2

separable. 3.7nkXk21,2 (DistNSEP∗ )2 i i i normalized iterate β¯ := β /kβ k1 satisfies 



(i) (margin bound): there exists i ≤

ρ(β¯i ) ≥

(iii) (regularization): kβ k k1 ≤ (ii) (gradient norm):

min i∈{0,...,k}

for which the

.18 · DistNSEP∗ . n

√  1 p k kXk1,2 8n(ln(2) − L∗n ) q (ln(2)−L∗ n) k∇Ln (β i )k∞ ≤ kXk1,2 2n·(k+1)

(iv) (sparsity): kβ k k0 ≤ k 50

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Other Issues Some other topics not mentioned today (still ongoing): Other “GCD-type”/“boosting-type” methods suggested by connections to Mirror Descent and the Frank-Wolfe method high-dimensional regime p > n, define DistSEP∗k and DistNSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods

Further investigation of the properties of other step-size choices for Greedy Coordinate Descent 51

GCD Primer

Logistic Regression

GCD for LR

Non-Separable Case

Separable Case

Other Issues

Summary Some “old” results and new observations for the Greedy Coordinate Descent Method Analyizing GCD for Logistic Regression: separable/non-separable cases

Non-Separable case

behavioral/condition measure DistSEP∗ computational guarantees for GCD including reaching linear convergence Separable case

behavioral/condition measure DistNSEP∗ computational guarantees for GCD including computing a reasonably good separator 52