Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods Robert M. Freund (MIT), Paul Grigas (Berkeley), and Rahul Mazumder (MIT)
INFORMS Denver, March 2018
1
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
How can optimization inform statistics (and machine learning)?
Paper in preparation (this talk): Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods
A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
2
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Outline Optimization review: Greedy Coordinate Descent (GCD) and Stochastic Gradient Descent (SGD) A pair of condition numbers for the logistic regression problem: when the sample data is non-separable: a condition number for the degree of non-separability of the dataset informing the convergence guarantees of GCD and SGD guarantees on reaching linear convergence (thanks to Bach)
when the sample data is separable: a condition number for the degree of separability of the dataset informing convergence guarantees of GCD and SGD to deliver an approximate maximum margin classifier 3
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Review of Greedy Coordinate Descent (GCD) and Stochastic Gradient Descent (SGD)
Two Basic First-Order Methods for Convex Optimization: Greedy Coordinate Descent method: “go in the best coordinate direction” Stochastic Gradient Descent (SGD) method: “go in the direction of the negative of the stochastic estimate of the gradient”
4
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Convex Optimization The problem of interest is: F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
where F (·) is differentiable and convex:
F (λx + (1 − λ)y ) ≤ λF (x) + (1 − λ)F (y ) for all x, y , and all λ ∈ [0, 1] Let kxk denote the given norm on the variables x ∈ Rp 5
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Norms and Dual Norms Let kxk be the given norm on the variables x ∈ Rp The dual norm is ksk∗ := maxx {s T x : kxk ≤ 1} Some common norms and their dual norms:
Name
Norm
Definition
Dual Norm
`2 -norm
kxk2
kxk2 =
qP
`1 -norm
kxk1
kxk1 =
Pp
`∞ -norm
kxk∞
kxk∞ = max{|x1 |, . . . , |xp |}
p j=1
j=1
|xj |2
|xj |
ksk∗ = ksk2 ksk∗ = ksk∞ ksk∗ = ksk1
6
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Lipschitz constant for the Gradient
F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
We say that ∇F (·) is Lipschitz with parameter LF if: k∇F (x) − ∇F (y )k∗ ≤ LF kx − y k for all x, y ∈ Rp k · k∗ is the dual norm
7
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Matrix Operator Norm
Let M be a linear operator (matrix) M : Rp → Rn with norm kxka on Rp and norm kv kb on Rn The operator norm of M is given by: kMka,b := max x6=0
kMxkb kxka
8
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent Method: “go in the best coordinate direction” F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Greedy Coordinate Descent Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Compute jk ∈ arg
max
j∈{1,...,p} k
|∇F (x k )j | and
d k ← sgn(∇F (x )jk )ejk 3
Choose step-size αk
4
Set x k+1 ← x k − αk d k
9
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Metrics for Evaluating Greedy Coordinate Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∞ ≤ LF kx − y k1
for all x, y ∈ Rp
Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 10
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Metrics for Evaluating Greedy Coordinate Descent, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S
S0 Dist0
x0
S⇤
(In high-dimensional machine learning problems, S ∗ can be very big)
11
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S
Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014], [Nesterov 2003]) If the step-sizes are chosen using the rule: αk =
k∇F (x k )k∞ LF
for all k ≥ 0 ,
then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤ where
2LF (Dist0 )2 Kˆ 0 := . F (x 0 ) − F ∗
2LF (Dist0 )2 Kˆ 0 + k
0 yi · (β)
for all i = 1, . . . , n
Non-Separable Data The data is non-separable if it is not separable, namely, every β satisfies yi · (β)T xi ≤ 0
for at least one i ∈ {1, . . . , n}
30
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Separable and Non-Separable Data
(a) Data is Non-Separable
(b) Data is Separable
31
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Results in the Non-Separable Case
Results in the Non-Separable Case
32
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separable Data and Problem Behavior/Conditioning
Let us quantify the degree of non-separability of the data.
(a) Very non-separable data
(b) Mildly non-separable data
We will relate this to problem behavior/conditioning.... 33
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Condition Number DegNSEP∗
Definition of Non-Separability Condition Number DegNSEP∗ Pn 1 T − DegNSEP∗ := minp i=1 [yi β xi ] n β∈R
s.t.
kβk1 = 1
DegNSEP∗ is the least average misclassification error (over all normalized classifiers)
DegNSEP∗ > 0 if and only if the data is strictly non-separable
34
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Measure DegNSEP∗ DegNSEP∗ :=
(a) DegNSEP∗ is large
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ]−
(b) DegNSEP∗ is small 35
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case
Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: 2(ln(2))2 kXk2
(i) (training error): Ln (β k ) − L∗n ≤ k·n·(DegNSEP1,2∗ )2 p √ (ii) (regularization): kβ k k1 ≤ k kXk1 1,2 8n(ln(2) − L∗n )
36
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case
Theorem: Computational Guarantees for Greedy Coordinate Descent: Non-Separable Case Consider the GCD applied to the Logistic Regression problem with k n (β )k∞ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Then for each k ≥ 0 it holds that: 2(ln(2))2 kXk2
(i) (training error): Ln (β k ) − L∗n ≤ k·n·(DegNSEP1,2∗ )2 p √ (ii) (regularization): kβ k k1 ≤ k kXk1 1,2 8n(ln(2) − L∗n )
36
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, and suppose that the data is 2,2 2,∞ non-separable. Then it holds that: (i) (training error): 1 E[ min Ln (β i )] − L∗n ≤ √ 0≤i≤k k +1 (ii) (regularization): kβ k k2 ≤
√ k +1
2 2 (L∗ n ) kXk2,∞
√ 4
2 ln(2)(DegNSEP∗ )2
√
8n ln(2) kXk2,2
√ +
2 ln(2)nkXk2,∞ kXk2,2
37
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Theorem: Computational Guarantees for Stochastic Gradient Descent: Non-Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, and suppose that the data is 2,2 2,∞ non-separable. Then it holds that: (i) (training error): 1 E[ min Ln (β i )] − L∗n ≤ √ 0≤i≤k k +1 (ii) (regularization): kβ k k2 ≤
√ k +1
2 2 (L∗ n ) kXk2,∞
√ 4
2 ln(2)(DegNSEP∗ )2
√
8n ln(2) kXk2,2
√ +
2 ln(2)nkXk2,∞ kXk2,2
37
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence
Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression
For logistic regression, does GCD exhibit linear convergence?
38
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Some Definitions/Notation
Definitions: R :=
max i∈{1,...,n}
kxi k2 (maximum `2 norm of the feature vectors)
H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )
39
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk21,2 non-separable. Define: kˇ :=
16p ln(2)2 kXk41,2 R 2 . 9n2 (DegNSEP∗ )2 λpmin (H(β ∗ ))2
ˇ it holds that: Then for all k ≥ k, k
Ln (β ) −
L∗n
kˇ
≤ (Ln (β ) −
L∗n )
λpmin (H(β ∗ ))n 1− p · kXk21,2
!k−kˇ .
40
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥
T 1 4n λpmin (X X) exp
−
ln(2)kXk1,∞ DegNSEP∗
As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 41
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
DegNSEP∗ and “Perturbation to Separability”
DegNSEP∗ :=
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ]−
Theorem: DegNSEP∗ is the “Perturbation to Separability” DegNSEP∗ =
Pn
inf
1 n
s.t.
(xi + ∆xi , yi ), i = 1, . . . , n are separable
∆x1 ,...,∆xn
i=1
k∆xi k∞
42
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Illustration of Perturbation to Separability
43
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Results in the Separable Case
Results in the Separable Case
44
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data and Problem Behavior/Conditioning
Let us quantify the degree of separability of the data.
(a) Very separable data
(b) Barely separable data
We will relate this to problem behavior/conditioning.... 45
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Separability Condition Number DegSEP∗ Definition of Separability Condition Number DegSEP∗ DegSEP∗ :=
max
min
[yi β T xi ]
β∈Rp
i∈{1,...,n}
s.t.
kβk1 ≤ 1
DegSEP∗ maximizes the minimal classification value [yi β T xi ] (over all normalized classifiers) DegSEP∗ is simply the “maximum margin” in machine learning parlance DegSEP∗ > 0 if and only if the data is separable 46
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Separability Measure DegSEP∗ DegSEP∗ :=
(a) DegSEP∗ is large
max
min
[yi β T xi ]
β∈Rp
i∈{1,...,n}
s.t.
kβk1 ≤ 1
(b) DegSEP∗ is small 47
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
DegSEP∗ and Problem Behavior/Conditioning L∗n :=
min β
Ln (β) :=
DegSEP∗ :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
max
min
[yi β T xi ]
β∈Rp
i∈{1,...,n}
s.t.
kβk1 ≤ 1
Theorem: Separability and Non-Attainment Suppose that the data is separable. Then DegSEP∗ > 0, L∗n = 0, and LR does not attain its optimum. Despite this, it turns out that Greedy Coordinate Descent and also Stochastic Gradient Descent are reasonably effective at finding an approximate margin maximizer ....
48
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Margin function ρ(β) Margin function ρ(β) ρ(β) :=
min i∈{1,...,n}
(a) ρ(β) is small
[yi β T xi ]
(b) ρ(β) is large 49
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2 3.7nkXk21,2 (i) (margin bound): there exists i ≤ for which the (DegSEP∗ )2 normalized iterate β¯i := β i /kβ i k1 satisfies ρ(β¯i ) ≥
(ii) (shrinkage): kβ k k1 ≤
.18 · DegSEP∗ . n
√ 1 p k kXk1,2 8n ln(2) 50
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent: Separable Case Theorem: Computational Guarantees for Greedy Coordinate Descent: Separable Case Consider GCD applied to the Logistic Regression problem with step-sizes k n (β )k∞ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk21,2 3.7nkXk21,2 (i) (margin bound): there exists i ≤ for which the (DegSEP∗ )2 normalized iterate β¯i := β i /kβ i k1 satisfies ρ(β¯i ) ≥
(ii) (shrinkage): kβ k k1 ≤
.18 · DegSEP∗ . n
√ 1 p k kXk1,2 8n ln(2) 50
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Stochastic Gradient Descent: Separable Case
Theorem: Computational Guarantees for Stochastic Gradient Descent: Separable Case Consider SGD √ applied to the Logistic Regression problem with step-sizes 8n ln(2) αi := √k+1kXk kXk for i = 0, . . . , k, where 2,2 2,∞ 28.1n3 kXk22,2 kXk22,∞ k := γ 2 (DegSEP∗ )4 and γ ∈ (0, 1]. If the data is separable, then : γ(DegSEP∗ )2 P ∃i ∈ {0, . . . , k} s.t. ρ(β¯i ) ≥ ≥ 1−γ . 20n2 kXk2,∞ where β¯i := β i /kβ i k1 are the normalized iterates of SGD.
51
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
DegSEP∗ and “Perturbation to Non-Separability”
DegSEP∗ :=
max
min
[yi β T xi ]
β∈Rp
i∈{1,...,n}
s.t.
kβk1 ≤ 1
Theorem: DegSEP∗ is the “Perturbation to Non-Separability” DegSEP∗ =
inf
∆x1 ,...,∆xn
s.t.
max i∈{1,...,n}
k∆xi k∞
(xi + ∆xi , yi ), i = 1, . . . , n are non-separable
52
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Illustration of Perturbation to Non-Separability
53
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Other Issues
Some other topics not mentioned (still ongoing): Other first-order methods for logistic regression (gradient descent, accelerated gradient descent, other randomized methods, etc. High-dimensional regime p > n, define DegNSEP∗k and DegSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods
Other...
54
Review of GCD and SGD
Logistic Regression
FOMs for LR
Non-Separable Case
Separable Case
Other Issues
Summary Some old and new results for Greedy Coordinate Descent and Stochastic Gradient Descent Analyizing these methods for Logistic Regression: separable/non-separable cases Non-Separable case
condition number DegNSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including reaching linear convergence Separable case
condition number DegSEP∗ computational guarantees for Greedy Coordinate Descent and Stochastic Gradient Descent, including computing an approximate maximum margin classifier
55