Condition Number Analysis of Logistic Regression, and

Report 1 Downloads 70 Views
Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods Robert M. Freund (MIT), Paul Grigas (Berkeley), and Rahul Mazumder (MIT)

INFORMS Denver, March 2018

1

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

How can optimization inform statistics (and machine learning)?

Paper in preparation (this talk): Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods

A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

2

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Outline Optimization review: Greedy Coordinate Descent (GCD) and Stochastic Gradient Descent (SGD) A pair of condition numbers for the logistic regression problem: when the sample data is non-separable: a condition number for the degree of non-separability of the dataset informing the convergence guarantees of GCD and SGD guarantees on reaching linear convergence (thanks to Bach)

when the sample data is separable: a condition number for the degree of separability of the dataset informing convergence guarantees of GCD and SGD to deliver an approximate maximum margin classifier 3

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Review of Greedy Coordinate Descent (GCD) and Stochastic Gradient Descent (SGD)

Two Basic First-Order Methods for Convex Optimization: Greedy Coordinate Descent method: “go in the best coordinate direction” Stochastic Gradient Descent (SGD) method: “go in the direction of the negative of the stochastic estimate of the gradient”

4

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Convex Optimization The problem of interest is: F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

where F (·) is differentiable and convex:

F (λx + (1 − λ)y ) ≤ λF (x) + (1 − λ)F (y ) for all x, y , and all λ ∈ [0, 1] Let kxk denote the given norm on the variables x ∈ Rp 5

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Norms and Dual Norms Let kxk be the given norm on the variables x ∈ Rp The dual norm is ksk∗ := maxx {s T x : kxk ≤ 1} Some common norms and their dual norms:

Name

Norm

Definition

Dual Norm

`2 -norm

kxk2

kxk2 =

qP

`1 -norm

kxk1

kxk1 =

Pp

`∞ -norm

kxk∞

kxk∞ = max{|x1 |, . . . , |xp |}

p j=1

j=1

|xj |2

|xj |

ksk∗ = ksk2 ksk∗ = ksk∞ ksk∗ = ksk1

6

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Lipschitz constant for the Gradient

F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

We say that ∇F (·) is Lipschitz with parameter LF if: k∇F (x) − ∇F (y )k∗ ≤ LF kx − y k for all x, y ∈ Rp k · k∗ is the dual norm

7

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Matrix Operator Norm

Let M be a linear operator (matrix) M : Rp → Rn with norm kxka on Rp and norm kv kb on Rn The operator norm of M is given by: kMka,b := max x6=0

kMxkb kxka

8

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Greedy Coordinate Descent Method: “go in the best coordinate direction” F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

Greedy Coordinate Descent Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1

Compute gradient ∇F (x k )

2

Compute jk ∈ arg

max



j∈{1,...,p} k

|∇F (x k )j | and

d k ← sgn(∇F (x )jk )ejk 3

Choose step-size αk

4

Set x k+1 ← x k − αk d k

9

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Metrics for Evaluating Greedy Coordinate Descent F ∗ :=

min

F (x)

s.t.

x ∈ Rp

x

Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∞ ≤ LF kx − y k1

for all x, y ∈ Rp

Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 10

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Metrics for Evaluating Greedy Coordinate Descent, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S

S0 Dist0

x0

S⇤

(In high-dimensional machine learning problems, S ∗ can be very big)

11

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S

Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014], [Nesterov 2003]) If the step-sizes are chosen using the rule: αk =

k∇F (x k )k∞ LF

for all k ≥ 0 ,

then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤ where

2LF (Dist0 )2 Kˆ 0 := . F (x 0 ) − F ∗

2LF (Dist0 )2 Kˆ 0 + k


0 yi · (β)

for all i = 1, . . . , n

Non-Separable Data The data is non-separable if it is not separable, namely, every β satisfies yi · (β)T xi ≤ 0

for at least one i ∈ {1, . . . , n}

30

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable and Non-Separable Data

(a) Data is Non-Separable

(b) Data is Separable

31

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Results in the Non-Separable Case

Results in the Non-Separable Case

32

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separable Data and Problem Behavior/Conditioning

Let us quantify the degree of non-separability of the data.

(a) Very non-separable data

(b) Mildly non-separable data

We will relate this to problem behavior/conditioning.... 33

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Condition Number DegNSEP∗

Definition of Non-Separability Condition Number DegNSEP∗ Pn 1 T − DegNSEP∗ := minp i=1 [yi β xi ] n β∈R

s.t.

kβk1 = 1

DegNSEP∗ is the least average misclassification error (over all normalized classifiers)

DegNSEP∗ > 0 if and only if the data is strictly non-separable

34

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Non-Separability Measure DegNSEP∗ DegNSEP∗ :=

(a) DegNSEP∗ is large

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

(b) DegNSEP∗ is small 35

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case

Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: 2(ln(2))2 kXk2

(i) (training error): Ln (β k ) − L∗n ≤ k·n·(DegNSEP1,2∗ )2 p √  (ii) (regularization): kβ k k1 ≤ k kXk1 1,2 8n(ln(2) − L∗n )

36

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case

Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: 2(ln(2))2 kXk2

(i) (training error): Ln (β k ) − L∗n ≤ k·n·(DegNSEP1,2∗ )2 p √  (ii) (regularization): kβ k k1 ≤ k kXk1 1,2 8n(ln(2) − L∗n )

36

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, and suppose that the data is 2,2 2,∞ non-separable. Then it holds that: (i) (training error): 1 E[ min Ln (β i )] − L∗n ≤ √ 0≤i≤k k +1 (ii) (regularization): kβ k k2 ≤



√ k +1

2 2 (L∗ n ) kXk2,∞

√ 4

2 ln(2)(DegNSEP∗ )2

√

8n ln(2) kXk2,2

√ +

2 ln(2)nkXk2,∞ kXk2,2





37

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, and suppose that the data is 2,2 2,∞ non-separable. Then it holds that: (i) (training error): 1 E[ min Ln (β i )] − L∗n ≤ √ 0≤i≤k k +1 (ii) (regularization): kβ k k2 ≤



√ k +1

2 2 (L∗ n ) kXk2,∞

√ 4

2 ln(2)(DegNSEP∗ )2

√

8n ln(2) kXk2,2

√ +

2 ln(2)nkXk2,∞ kXk2,2





37

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence

Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression

For logistic regression, does GCD exhibit linear convergence?

38

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Some Definitions/Notation

Definitions: R :=

max i∈{1,...,n}

kxi k2 (maximum `2 norm of the feature vectors)

H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )

39

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Define: kˇ :=

16p ln(2)2 kXk41,2 R 2 . 9n2 (DegNSEP∗ )2 λpmin (H(β ∗ ))2

ˇ it holds that: Then for all k ≥ k, k

Ln (β ) −

L∗n



≤ (Ln (β ) −

L∗n )

λpmin (H(β ∗ ))n 1− p · kXk21,2

!k−kˇ .

40

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥

T 1 4n λpmin (X X) exp

 −

ln(2)kXk1,∞ DegNSEP∗



As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 41

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegNSEP∗ and “Perturbation to Separability”

DegNSEP∗ :=

Pn

minp

1 n

s.t.

kβk1 = 1

β∈R

i=1 [yi β

T

xi ]−

Theorem: DegNSEP∗ is the “Perturbation to Separability” DegNSEP∗ =

Pn

inf

1 n

s.t.

(xi + ∆xi , yi ), i = 1, . . . , n are separable

∆x1 ,...,∆xn

i=1

k∆xi k∞

42

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Illustration of Perturbation to Separability

43

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Results in the Separable Case

Results in the Separable Case

44

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separable Data and Problem Behavior/Conditioning

Let us quantify the degree of separability of the data.

(a) Very separable data

(b) Barely separable data

We will relate this to problem behavior/conditioning.... 45

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separability Condition Number DegSEP∗ Definition of Separability Condition Number DegSEP∗ DegSEP∗ :=

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

DegSEP∗ maximizes the minimal classification value [yi β T xi ] (over all normalized classifiers) DegSEP∗ is simply the “maximum margin” in machine learning parlance DegSEP∗ > 0 if and only if the data is separable 46

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Separability Measure DegSEP∗ DegSEP∗ :=

(a) DegSEP∗ is large

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

(b) DegSEP∗ is small 47

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegSEP∗ and Problem Behavior/Conditioning L∗n :=

min β

Ln (β) :=

DegSEP∗ :=

1 n

Pn

i=1

ln(1 + exp(−yi β T xi ))

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

Theorem: Separability and Non-Attainment Suppose that the data is separable. Then DegSEP∗ > 0, L∗n = 0, and LR does not attain its optimum. Despite this, it turns out that Greedy Coordinate Descent and also Stochastic Gradient Descent are reasonably effective at finding an approximate margin maximizer ....

48

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Margin function ρ(β) Margin function ρ(β) ρ(β) :=

min i∈{1,...,n}

(a) ρ(β) is small

[yi β T xi ]

(b) ρ(β) is large 49

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2   3.7nkXk21,2 (i) (margin bound): there exists i ≤ for which the (DegSEP∗ )2 normalized iterate β¯i := β i /kβ i k1 satisfies ρ(β¯i ) ≥

(ii) (shrinkage): kβ k k1 ≤

.18 · DegSEP∗ . n

√  1 p k kXk1,2 8n ln(2) 50

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2   3.7nkXk21,2 (i) (margin bound): there exists i ≤ for which the (DegSEP∗ )2 normalized iterate β¯i := β i /kβ i k1 satisfies ρ(β¯i ) ≥

(ii) (shrinkage): kβ k k1 ≤

.18 · DegSEP∗ . n

√  1 p k kXk1,2 8n ln(2) 50

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Computational Guarantees for Stochastic Gradient Descent: Separable Case

Theorem: Computational Guarantees for Stochastic Gradient Descent: Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, where 2,2 2,∞   28.1n3 kXk22,2 kXk22,∞ k := γ 2 (DegSEP∗ )4 and γ ∈ (0, 1]. If the data is separable, then :   γ(DegSEP∗ )2 P ∃i ∈ {0, . . . , k} s.t. ρ(β¯i ) ≥ ≥ 1−γ . 20n2 kXk2,∞ where β¯i := β i /kβ i k1 are the normalized iterates of SGD.

51

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

DegSEP∗ and “Perturbation to Non-Separability”

DegSEP∗ :=

max

min

[yi β T xi ]

β∈Rp

i∈{1,...,n}

s.t.

kβk1 ≤ 1

Theorem: DegSEP∗ is the “Perturbation to Non-Separability” DegSEP∗ =

inf

∆x1 ,...,∆xn

s.t.

max i∈{1,...,n}

k∆xi k∞

(xi + ∆xi , yi ), i = 1, . . . , n are non-separable

52

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Illustration of Perturbation to Non-Separability

53

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Other Issues

Some other topics not mentioned (still ongoing): Other first-order methods for logistic regression (gradient descent, accelerated gradient descent, other randomized methods, etc. High-dimensional regime p > n, define DegNSEP∗k and DegSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods

Other...

54

Review of GCD and SGD

Logistic Regression

FOMs for LR

Non-Separable Case

Separable Case

Other Issues

Summary Some old and new results for Greedy Coordinate Descent and Stochastic Gradient Descent Analyizing these methods for Logistic Regression: separable/non-separable cases Non-Separable case

condition number DegNSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including reaching linear convergence Separable case

condition number DegSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including computing an approximate maximum margin classifier

55