New Results for Sparsity-inducing Methods for Logistic Regression

Report 5 Downloads 65 Views
GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

New Results for Sparsity-inducing Methods for Logistic Regression Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)

Cornell University, December 2016

1

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

How can optimization inform statistics (and machine learning)?

Paper in preparation (this talk): “New Results for Sparsity-inducing Methods for Logistic Regression”

A “cousin” paper available online: “A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives”

2

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

Outline Optimization primer: some “old” results and new observations for the Greedy Coordinate Descent (GCD) method Logistic regression: Statistics perspective, Machine Learning perspective A “condition number” for the logistic regression problem the degree of non-separability of the data data perturbation to separability of the data informing the convergence properties of Greedy Coordinate Descent Reaching linear convergence of Greedy Coordinate Descent for logistic regression (thanks to Bach) Different convergence for an “accelerated” (but non-sparse) method for logistic regression (thanks to Renegar)

3

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

eooooooooooo

Primer on Greedy Coordinate Descent

Primer: Some “Old” Results and New Observations for the Greedy Coordinate Descent Method

4

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oe oooooooooo

Gradient Descent ≡ `2 -Steepest Descent The problem of interest is: F ∗ :=

min x

F (x)

s.t. x ∈ Rp where F (x) is convex and differentiable. Steepest Descent method for minimizing f (x) Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1

Compute gradient ∇F (x k )

2

Choose step-size α ˆk

3

Set x k+1 ← x k − α ˆ k ∇F (x k ) 5

GCD Primer

Logistic Regression

-

D D D D D ~ ++-ohO--a-

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

D-OOOOOOO n

Connections to boosting (LogitBoost)

e

Just one tuning parameter (number of iterations)

e

GCD performs variable selection

e

GCD imparts implicit regularization

e

36

GCD Primer

Logistic Regression

GCD for LR Non-separability ooeoo

Reaching Linear Convergence Other Issues

Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros



���



����� ����

���

����������������

�������������������

��� �

���������������� ���



�� � �

��� �

���

�� �

���

���

���

���������

���

���



���

���

���

���

���

���������

Compare with explicit regularization schemes (`1 , `2 , etc.) 37

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooeo

Connections to Boosting

In boosting, the goal is to combine multiple “weak” models into a more powerful “committee” (Here a weal model corresponds to a feature) AdaBoost ([Schapire 1990], [Y. Freund 1995], [Y. Freund and Schapire 1996], . . . ) is a widely popular boosting algorithm for classification Can be interpreted as Greedy Coordinate Descent to minimize the exponential loss function ([Mason et al. 2000]) LogitBoost ([Friedman et al. 2000], ≡ Greedy Coordinate Descent for Logistic Regression) replaces the exponential loss with the logistic loss

38

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

0000•

How Can Optmization Inform Logistic Regression?

Some questions: How do the computational guarantees for Greedy Coordinate Descent specialize for Logistic Regression?

What role does problem structure/conditioning play in these guarantees?

Can we say anything further about the convergence properties of Greedy Coordinate Descent in the special case of Logistic Regression?

39

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

eoooooooooooooooo

Optimization Properties, Non-Separability, Complexity

Optimization Properties, Non-Separability, and Computational Guarantees

40

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oeooooooooooooooo

Basic Properties of the (Empirical) Logistic Loss L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

s.t. β ∈ Rp Relatively simple to show that: Ln (·) is convex ∇Ln (·) is L =

1 2 4n kXk1,2 -Lipschitz:

k∇Ln (β) − ∇Ln (β 0 )k∞ ≤

2 1 4n kXk1,2 kβ

− β 0 k1

where kXk1,2 := max kXj k2 j=1,...,p

0

For β := 0 it holds that Ln (β 0 ) = ln(2) L∗n ≥ 0

If L∗n = 0, then the optimum is not attained (something is “wrong”)

41

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooeoooooooooooooo

Basic Properties, continued L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))



����



������� ����� ��������





� ��

��

��









T y������ x

Logistic regression “ideally” seeks β for which yi xiT β > 0 for all i : yi > 0 ⇒ xiT β > 0 yi < 0 ⇒ xiT β < 0

42

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

Geometry of the Data: Separable and Non-Separable Data

(a) Strictly Separable Data

(b) Not Strictly Separable Data

(c) “Almost Separable” Data

(d) “Very Non-Separable” Data 43

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooooeoooooooooooo

Linearly Separable Data We are given data: (xi , yi ) ∈ Rp × {−1, +1}, i = 1, . . . , n

Let X ∈ Rn×p be the data matrix: xi is the i th row of X and Xj denotes the j th column y ∈ {−1, 1}n is the vector of labels

Linearly Separable Data ¯ T xi > 0 for all The data is linearly separable with separator β¯ if yi · (β) i = 1, . . . , n Equivalently Y Xβ¯ > 0 where Y := diag(y )

44

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooeooooooooooo

Linearly Separable Data, continued

L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

The data is linearly separable with separator β¯ if Y Xβ¯ > 0

where Y := diag(y )

¯ → 0 (= L∗n ) as θ → +∞ If β¯ linearly separates the data, then Ln (θβ) Thus the logistic loss function is not effective at finding a “good” linear separator 45

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

Strictly Non-Separable Data Strictly Non-Separable Data We say that the data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ  0

(a) Strictly Non-separable

(b) Not Strictly Non-Separable 46

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooooeooooooooo

Strict Non-Separability, continued

L∗n :=

min β

Ln (β) :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

The data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ  0

Theorem: Attaining Optima When the data is strictly non-separable, then the (empirical) logistic regression problem attains its optimum (and conversely). 47

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooooooooeoooooooo

Strict Separability and Problem Behavior/Conditioning Theorem: Attaining Optima When the data is strictly non-separable, then the (empirical) logistic regression problem attains its optimum (and conversely). Q: can we quantify the degree of non-separability of the data and relate this to problem behavior/conditioning?

• • • •• • •• • • • •• • • • • •• • • • • •• • • • • • • • • •• • • • • •• •• • • • • • • (a) Mildly non-separable data

• • • • •• • • • • • • • • • • •• • •• • •• • • • • • • • • •• • • • • • • • • • •• •• • • • • • •• • • • • • (b) Very non-separable data 48

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooooooeooooooo

Non-Separability Measure NSEP∗

Definition of Non-Separability Measure NSEP∗ NSEP∗ :=

Pn

min

1 n

s.t.

kβk1 = 1

β∈Rp

i=1 [yi β

T

xi ]−

NSEP∗ is the least average misclassification error NSEP∗ > 0 if and only if the data is strictly non-separable 49

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooooooooooeoooooo

Non-Separability Measure NSEP∗ NSEP∗ :=

1 n

s.t.

kβk1 = 1

β∈R

• • • •• • •• • • • • • ••• • • • • • • •• • • •• • • • • • • • • • • •• • •• • • • • • • (a) NSEP∗ is small

Pn

minp

i=1 [yi β

T

xi ]−

• • • • •• • • • • • • • • • • •• • •• • •• • •• • • • •• • • ••• • • • • • • •• •• • • • • • • •• • • • • (b) NSEP∗ is large

50

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooooooooeooooo

NSEP∗ and Problem Behavior/Conditioning L∗n :=

min β

Ln (β) :=

NSEP∗ :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ] −

Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤

ln(2) . NSEP∗ 51

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooooooooooooeoooo

NSEP∗ and Problem Behavior/Conditioning, cont. Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤

ln(2) . NSEP∗

S0 := {x ∈ Rp : Ln (β) ≤ Ln (β 0 )} is the level set of initial point β 0

S ∗ := {x ∈ Rp : Ln (β) = L∗n } is the set of optimal solutions Dist0 := max min kβ − β ∗ k1 ∗ ∗ β∈S0 β ∈S

Dist0 ≤

2 ln(2) NSEP∗ 52

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooooooooooeooo

Computational Guarantees for GCD for Logistic Regression

Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (shrinkage): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(NSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n )

(iv) (sparsity): kβ k k0 ≤ k 53

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

oooooooooooooeooo

Computational Guarantees for GCD for Logistic Regression

Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):

min i∈{0,...,k}

(iii) (shrinkage): kβ k k1 ≤

2(ln(2))2 kXk21,2 k·n·(NSEP∗ )2

k∇Ln (β i )k∞ ≤ kXk1,2

q

(ln(2)−L∗ n) 2n·(k+1)

√  1 p k kXk1,2 8n(ln(2) − L∗n )

(iv) (sparsity): kβ k k0 ≤ k 53

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooooooooooooooeoo

Other Step-size Choices Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression problem with arbitrary step-size sequence {αk }. Then for each k ≥ 0 it holds that: (i) (gradient norm):

min i∈{0,...,k}

(ii) (shrinkage): kβ k k1 ≤

k∇Ln (β i )k∞ ≤

Pk

i=0

kXk2 1,2

ln(2)−L∗ n+ Pk

8n

i=0

Pk

i=0

α2i

αi

αi

(iii) (sparsity): kβ k k0 ≤ k Other step-size sequences are interesting since one may want to consider less aggressive fitting methods The bound on the gradient norm arises from a certain equivalence with the Mirror Descent method 54

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

000000000000000•0

NSEP∗ and “Distance to Separability”

NSEP∗ :=

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

Theorem: NSEP∗ is the “Distance to Separability” NSEP∗ =

Pn

inf

1 n

s.t.

(xi + ∆xi , yi ), i = 1, . . . , n are linearly separable

∆x1 ,...,∆xn

i=1

k∆xi k∞

55

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

0000000000000000 •

Non-Separability Measure NSEP∗ NSEP∗ :=

1 n

s.t.

kβk1 = 1

β∈R

• • • •• • •• • • • • • ••• • • • • • • •• • • •• • • • • • • • • • • •• • •• • • • • • • (a) NSEP∗ is small

Pn

minp

i=1 [yi β

T

xi ]−

• • • • •• • • • • • • • • • • •• • •• • •• • •• • • • •• • • ••• • • • • • • •• •• • • • • • • •• • • • • (b) NSEP∗ is large

56

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues eooooo

Reaching Linear Convergence

Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression

57

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues oeoooo

Local Curvature of the Logistic Loss Function �

�������������







� ��

��

��









������

While the logistic loss behaves linearly in some regions, it has curvature near zero. And often the margin values at the optimal solution β ∗ are concentrated in this region

58

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues ooeooo

GCD and Local Curvature of the Logistic Loss Function



�������������







� ��

��

��









������

Q: Does Greedy Coordinate Descent adapt to the local curvature of the logistic loss at the optimal solution β ∗ ? A: Yes (as we will now demonstrate . . . ) 59

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues oooeoo

Some Definitions/Notation

Definitions: R :=

max i∈{1,...,n}

kxi k2 (maximum norm of the feature vectors)

H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )

60

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues ooooeo

Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Define: kˇ :=

16 ln(2)2 kXk21,2 R 2 p . 9n(NSEP∗ )2 λpmin (H(β ∗ ))2

ˇ it holds that: Then for all k ≥ k, k

Ln (β ) −

L∗n



≤ (Ln (β ) −

L∗n )

λpmin (H(β ∗ ))n 1− kXk21,2 p

!k−kˇ .

61

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues 00000•

Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] This property also yields a bound of the form Ln (β) − L∗n ≤

2k∇Ln (β)k22 λpmin (H(β ∗ ))

if k∇Ln (β)k2 is small enough As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . this result is quite new)

62

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

Other Issues Some other topics not mentioned today (still ongoing): Other “GCD-type”/“boosting-type” methods suggested by connections to Mirror Descent and the Frank-Wolfe method high-dimensional regime p > n, define NSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods

Further investigation of the properties of other step-size choices for Greedy Coordinate Descent 63

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

Summary

NSEP∗ for Logistic Regression problems that:

e

Some “old” results and new observations for the Greedy Coordinate Descent Method

e

e

e

e

measures the degree of non-separability of the data informs the convergence properties of Greedy Coordinate Descent

Computational guarantees for Greedy Coordinate Descent for Logistic Regression: e e

e

1 O( (NSEP ∗ )2 k ) global objective value convergence Reaching linear convergence Other guarantees in terms of norm of the gradient, shrinkage of the iterates

64

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

eoooo

Back-up Slides: Related Results for AdaBoost

Back-up Slides: Related Results for AdaBoost

65

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues oeooo

AdaBoost: First Problem of Interest AdaBoost is also Greedy Coordinate Descent, but replaces the logistic loss function with the log-exponential loss: L∗l := minλ≥0

Ll (λ) = ln

1 m

Pm

i=1

 exp (−(Aλ)i ) .

Data: (x1 , y1 ) . . . , (xm , ym ) where xi ∈ Rn is the i th feature vector and yi ∈ {−1, +1} Here A := Y X, i.e., Aij := yi (xi )j Note that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 Assume for convenience that for every column Aj , −Aj is also a column of A 66

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues

ooeoo

AdaBoost: Second Problem of Interest ∆n := {x ∈ Rn : e T x = 1, x ≥ 0} is the standard simplex in Rn Recall that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 The margin of a classifier λ ∈ Rn is: p(λ) :=

min i∈{1,...,m}

(Aλ)i = min w T Aλ w ∈∆m

It makes sense to look for a classifier with large margin, i.e., to solve: M :

ρ∗ := max p(λ) . λ∈∆n

67

GCD Primer

Logistic Regression

GCD for LR Non-separability

Reaching Linear Convergence Other Issues oooeo

Dual of the Maximum Margin Problem The “edge” of a vector of weights on the data, w ∈ ∆m , is: f (w ) :=

max j∈{1,...,n}

w T Aj = max w T Aλ λ∈∆n

The (linear programming) dual of the maximum margin problem is the problem of minimizing the edge: E :

f ∗ := min f (w ) , w ∈∆m

AdaBoost is three algorithms: A boosting method based on a scheme for (multiplicatively) updating a vector of weights on the data Greedy Coordinate Descent applied to minimize the log-exponential loss function A version of the Mirror Descent method applied to the above problem E

68

GCD Primer

Logistic Regression

GCD for LR Non-separability

000000000000 0000000000000000000 00000

Reaching Linear Convergence Other Issues

00000000000000000 000000

00 0000•

Computational Guarantees for AdaBoost Theory for Greedy Coordinate Descent and Mirror Descent leads to computational guarantees for AdaBoost: Step-Size Strategy

II

Separable Data Margin Bound ρ∗ − p(λk+1 )

II

Non-Separable Data Gradient Bound Loss Bound ˆ i )k∞ ˆ k ) − L∗ min k∇Ll (λ Ll (λ l

i∈{0,...,k}

“edge rule:”

ˆ k )k∞ αk = k∇Ll (λ

q

2 ln(m) k+1

8 ln(m)2 (NSEP∗ )2 k l

“line-search:”

αk = 12 ln

 1+r  k

q

2 ln(m) k+1

8 ln(m)2 (NSEP∗ )2 k l

q

2 ln(m) k+1

1−rk

“constant:”

q 2 ln(m) αi := k+1 for i = 0, . . . , k

“adaptive:”

αk =

q

r

q

2 ln(m) k+1

2 ln(m) k+1

ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)

r

ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)

NSEP∗l is a “non-separability condition number” for log-exponential loss

69