GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
New Results for Sparsity-inducing Methods for Logistic Regression Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)
SIOPT Vancouver, May 2017
1
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
How can optimization inform statistics (and machine learning)?
Paper in preparation (this talk): New Results for Sparsity-inducing Methods for Logistic Regression
A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
2
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Outline Optimization primer: some “old” results and new observations for Greedy Coordinate Descent (GCD) Logistic regression perspectives: statistics and machine learning When the sample data is non-separable: a “condition number” for the degree of non-separability informing the convergence properties of GCD reaching linear convergence of GCD (thanks to Bach) When the sample data is separable: a “condition number” for the degree of separability of the data informing convergence to a certificate of separability Under construction: a different convergence result for an “accelerated” (but non-sparse) method for logistic regression (thanks to Renegar)
3
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Primer on Greedy Coordinate Descent
Some “Old” Results and New Observations for the Greedy Coordinate Descent Method
4
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Greedy Coordinate Descent Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Compute jk ∈ arg
max
j∈{1,...,p}
|∇F (x k )j | and
d k ← sgn(∇F (x k )jk )ejk 3
Choose step-size αk
4
Set x k+1 ← x k − αk d k
5
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent ≡ `1 -Steepest Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Steepest Descent method in the `1 -norm Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1 2
Compute gradient ∇F (x k )
Compute direction: d k ← arg max {∇F (x k )T d} kdk1 ≤1
3
Choose step-size αk
4
Set x k+1 ← x k − αk d k 6
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent ≡ `1 -Steepest Descent, cont. d k ∈ arg max {∇F (x k )T d} kdk1 ≤1
dk
7
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∞ ≤ LF kx − y k1
for all x, y ∈ Rp
Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 8
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Metrics for Evaluating Greedy Coordinate Descent, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0
S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S
S0 Dist0
S⇤
(In high-dimensional machine learning problems, S ∗ can be very big)
9
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Greedy Coordinate Descent Dist0 := max min kx − x ∗ k1 ∗ ∗ x∈S0 x ∈S
Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014]) If the step-sizes are chosen using the rule: αk =
k∇F (x k )k∞ LF
for all k ≥ 0 ,
then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤
1 F (x 0 )−F ∗
2LF (Dist0 )2 1 < . k k + 2LF (Dist 2 0)
Note that αk → 0 as k∇F (x k )k∞ → 0
10
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD, cont. Theorem: Gradient Norm Convergence For any step-size sequence {αk } and for each k ≥ 0, it holds that: min i∈{0,...,k}
k∇F (x i )k∞ ≤
F (x 0 ) − F ∗ + Pk
i=0
LF 2
Pk
i=0
αi2
αi
.
If the step-sizes are chosen using the rule: αk =
k∇F (x k )k∞ LF
for all k ≥ 0 ,
then for each k ≥ 0 the following inequality holds: r 2LF (F (x 0 ) − F ∗ ) i min k∇F (x )k∞ ≤ k +1 i∈{0,...,k}
. 11
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD, cont. Theorem: Iterate Shrinkage For any step-size sequence {αk }, it holds for each k ≥ 0 that: kx k k1 ≤ kx 0 k1 +
k−1 X
αi .
i=0
If the step-sizes are chosen using the rule: αk =
k∇F (x k )k∞ LF
for all k ≥ 0 ,
then for each k ≥ 0 it holds that: kx k k1 ≤ kx 0 k1 +
√
s k
2(F (x 0 ) − F ∗ ) . LF 12
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression
Logistic Regression statistics perspective machine learning perspective
13
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression Statistics Perspective Example: Predicting Parole Violation Predict P(violate parole) based on age, gender, time served, offense class, multiple convictions, NYC, etc. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 6098 6099 6100 6101 6102
Violator Male Age TimeServed Class Multiple InCity 0 1 49.4 3.15 D 0 1 1 1 26.0 5.95 D 1 0 0 1 24.9 2.25 D 1 0 0 1 52.1 29.22 A 0 0 0 1 35.9 12.78 A 1 1 0 1 25.9 1.18 C 1 1 0 1 19.0 0.54 D 0 0 0 1 43.2 1.07 C 0 1 0 1 31.6 1.17 E 0 0 0 1 40.7 4.64 B 1 1 0 1 53.9 21.61 A 0 1 0 1 28.5 3.23 D 1 0 0 1 36.1 3.71 D 0 1 0 1 48.8 1.17 D 0 0 0 1 37.6 4.62 C 0 0 0 1 42.5 1.75 D 0 1 ... ... ... ... ... ... ... 0 1 55.0 0.72 E 0 0 0 1 49.6 29.88 A 0 1 0 1 22.4 2.85 D 0 1 0 1 44.8 1.76 D 1 0 0 0 45.3 1.03 E 0 0
14
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression for Prediction Y ∈ {−1, 1} is a Bernoulli random variable: P(Y = 1) = p P(Y = −1) = 1 − p x = (x1 , . . . , xp ) ∈ Rp is the vector of independent variables P(Y = 1) depends on the values of the independent variables x1 , . . . , xp Logistic regression model is: P(Y = 1 | x) =
1 1 + e −β T x 15
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression for Prediction, continued Logistic regression model is: P(Y = 1 | x) =
1 1 + e −β T x
Data records are (xi , yi ), i = 1, . . . , n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 6098 6099 6100 6101 6102
Violator Male Age TimeServed Class Multiple InCity 0 1 49.4 3.15 D 0 1 1 1 26.0 5.95 D 1 0 0 1 24.9 2.25 D 1 0 0 1 52.1 29.22 A 0 0 0 1 35.9 12.78 A 1 1 0 1 25.9 1.18 C 1 1 0 1 19.0 0.54 D 0 0 0 1 43.2 1.07 C 0 1 0 1 31.6 1.17 E 0 0 0 1 40.7 4.64 B 1 1 0 1 53.9 21.61 A 0 1 0 1 28.5 3.23 D 1 0 0 1 36.1 3.71 D 0 1 0 1 48.8 1.17 D 0 0 0 1 37.6 4.62 C 0 0 0 1 42.5 1.75 D 0 1 ... ... ... ... ... ... ... 0 1 55.0 0.72 E 0 0 0 1 49.6 29.88 A 0 1 0 1 22.4 2.85 D 0 1 0 1 44.8 1.76 D 1 0 0 0 45.3 1.03 E 0 0
Let us construct an estimate of β based on the data (xi , yi ), i = 1, . . . , n
16
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression: Maximum Likelihood Estimation
1 max −β T xi β 1 + e y =1
!
Y
Y
yi =−1
i
1 = max −β T xi β 1 + e y =1
!
= max β
i=1
1 + e −yi β T xi
!
i
i
1
1 1 + e β T xi y =−1
!
Y
Y
n Y
1 1− 1 + e −β T xi
!
n T 1X ≡ min ln 1 + e −yi β xi =: Ln (β) β n i=1
17
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression: Maximum Likelihood Optimization Problem Logistic regression optimization problem is: L∗n :=
min
Ln (β) :=
s.t.
β ∈ Rp
β
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
�
����
�
�
������� ����� ��������
�
� ��
��
��
�
�
�
�
T y������ x
The logistic term is a 1-smoothing of f (α) = max{0, −α} (≡ shifted “hinge loss”)
18
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression: Machine Learning Perspective
Logistic Regression: Machine Learning Perspective
19
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression as Binary Classification Data: (xi , yi ) ∈ Rp × {−1, 1}, i = 1, . . . , n
x = (x1 , . . . , xp ) ∈ Rp is the vector of features (ind. variables) y ∈ {−1, 1} is the response/label
Task: predict y based on the linear function β T x β ∈ Rp are the model coefficients Loss function: `(y , β T x) represents the loss incurred when the truth is y but our classification/prediction was based on β T x Loss Minimization Problem:
min n1 β
n X
`(yi , β T xi )
i=1
20
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Loss Functions for Binary Classification Some common loss functions used for binary classification 0-1 loss: `(y , β T x) := 1(y β T x < 0) Hinge loss: `(y , β T x) := max(0, 1 − y β T x)
Logistic loss: `(y , β T x) := ln(1 + exp(−y β T x)) �
����
�
��� ����� ��������
�
�
� ��
��
��
�
�
�
�
������
Here “Margin” = y β T x
21
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Advantages of Logistic Loss Function
Why use the logistic loss function for classification? Computational advantages: convex, smooth Fits previous statistical model of conditional probablity: P(Y = y | x) =
1 1+exp(−y β T x)
Makes sense when the data is non-separable Robust to misspecification of class labels
22
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Logistic Regression Problem of Interest, continued Alternate version of optimization problem adds regularization and/or sparsification:
L∗n :=
min
Ln (β) :=
s.t.
β ∈ Rp
β
1 n
Pn
i=1
ln(1 + exp(−yi β T xi )) +λkβkp
kβk0 ≤ k Aspirations: Good predictive performance on new (out of sample) observations Models that are more interpretable (e.g., sparse) 23
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent for Logistic Regression
Greedy Coordinate Descent for Logistic Regression
24
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent for Logistic Regression Greedy Coordinate Descent for Logistic Regression Initialize at β 0 ← 0, k ← 0 At iteration k ≥ 0: 1
Compute ∇Ln (β k )
2
Compute jk ∈ arg
3
max j∈{1,...,p}
|∇Ln (β k )j |
Set β k+1 ← β k − αk sgn(∇Ln (β k )jk )ejk
Why use Greedy Coordinate Descent for Logistic Regression? Scalable and effective when n, p 0 and maybe p > n GCD performs variable selection
GCD imparts implicit regularization Just one tuning parameter (number of iterations) Connections to boosting (LogitBoost)
25
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros
���
�
�
���
����� ����
����������������
�������������������
��� �
���������������� ���
�
�� � �
��� �
���
�� �
���
���
���
���������
���
���
�
���
���
���
���
���
���������
Compare with explicit regularization schemes (`1 , `2 , etc.) 26
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
How Can GCD Inform Logistic Regression?
Some questions: How do the computational guarantees for Greedy Coordinate Descent specialize to the case of Logistic Regression?
What role does problem structure/conditioning play in these guarantees?
Can we say anything further about the convergence properties of Greedy Coordinate Descent in the special case of Logistic Regression?
27
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Basic Properties of the Logistic Loss Function L∗n :=
min
Ln (β) :=
s.t.
β ∈ Rp
β
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Ln (·) is convex ∇Ln (·) is L =
1 2 4n kXk1,2 -Lipschitz:
k∇Ln (β) − ∇Ln (β 0 )k∞ ≤
2 1 4n kXk1,2 kβ
− β 0 k1
where kXk1,2 := max kXj k2 j=1,...,p
0
For β := 0 it holds that Ln (β 0 ) = ln(2) L∗n ≥ 0
If L∗n = 0, then the optimum is not attained (something is “wrong” or “very wrong”) We will see later that “very wrong” is actually good....
28
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Basic Properties, continued L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
�
����
�
�
������� ����� ��������
�
� ��
��
��
�
�
�
�
T y������ x
Logistic regression “ideally” seeks β for which yi xiT β > 0 for all i : yi > 0 ⇒ xiT β > 0 yi < 0 ⇒ xiT β < 0
29
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Geometry of the Data: Separable and Non-Separable Data
(a) Separable Data
(b) Not Separable Data
(c) Mildly Non-Separable” Data
(d) Very Non-Separable Data 30
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data
Separable Data ¯ T xi > 0 for all The data is separable if there exists β¯ for which yi · (β) i = 1, . . . , n Equivalently Y Xβ¯ > 0 where Y := diag(y )
31
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data, continued
L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
The data is separable if there exists β¯ for which Y Xβ¯ > 0
where Y := diag(y )
¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Perhaps trying to optimize the logistic loss function is unlikely to be effective at finding a “good” linear separator? 32
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Strictly Non-Separable Data Strictly Non-Separable Data We say that the data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ 0
(a) Strictly Non-Separable
(b) Not Strictly Non-Separable 33
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Strict Non-separability and Problem Behavior/Conditioning Theorem: Attaining Optima When the data is strictly non-separable, then the logistic regression problem attains its optimum. Let us quantify the degree of non-separability of the data and relate this to problem behavior/conditioning....
(a) Mildly non-separable data
(b) Very non-separable data 34
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Measure DistSEP∗
Definition of Non-Separability Measure DistSEP∗ DistSEP∗ :=
Pn
min
1 n
s.t.
kβk1 = 1
β∈Rp
i=1 [yi β
T
xi ] −
DistSEP∗ is the least average misclassification error DistSEP∗ > 0 if and only if the data is strictly non-separable 35
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Measure DistSEP∗ DistSEP∗ :=
(a) DistSEP∗ is small
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ]−
(b) DistSEP∗ is large 36
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
DistSEP∗ and “Distance to Separability”
DistSEP∗ :=
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ]−
Theorem: DistSEP∗ is the “Distance to Separability” DistSEP∗ =
Pn
inf
1 n
s.t.
(xi + ∆xi , yi ), i = 1, . . . , n are separable
∆x1 ,...,∆xn
i=1
k∆xi k∞
37
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
DistSEP∗ and Problem Behavior/Conditioning L∗n :=
min β
Ln (β) :=
DistSEP∗ :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ] −
Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤
ln(2) , DistSEP∗
whereby
Dist0 ≤
2 ln(2) . DistSEP∗ 38
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD: Non-Separable Case
Theorem: Computational Guarantees for GCD: Non-Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (regularization): kβ k k1 ≤
2(ln(2))2 kXk21,2 k·n·(DistSEP∗ )2
k∇Ln (β i )k∞ ≤ kXk1,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk1,2 8n(ln(2) − L∗n )
(iv) (sparsity): kβ k k0 ≤ k 39
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD: Non-Separable Case
Theorem: Computational Guarantees for GCD: Non-Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (regularization): kβ k k1 ≤
2(ln(2))2 kXk21,2 k·n·(DistSEP∗ )2
k∇Ln (β i )k∞ ≤ kXk1,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk1,2 8n(ln(2) − L∗n )
(iv) (sparsity): kβ k k0 ≤ k 39
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence
Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression
For logistic regression, does Greedy Coordinate Descent exhibit linear convergence?
40
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Some Definitions/Notation
Definitions: R :=
max i∈{1,...,n}
kxi k2 (maximum norm of the feature vectors)
H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )
41
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Define: kˇ :=
16 ln(2)2 kXk21,2 R 2 p . 9n(DistSEP∗ )2 λpmin (H(β ∗ ))2
ˇ it holds that: Then for all k ≥ k, k
Ln (β ) −
L∗n
kˇ
≤ (Ln (β ) −
L∗n )
λpmin (H(β ∗ ))n 1− kXk21,2 p
!k−kˇ .
42
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥
T 1 4n λpmin (X X) exp
−
ln(2)kXk1,∞ DistSEP∗
As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 43
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separability and Problem Behavior/Conditioning
Separable data
44
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data, continued
L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Recall the data is separable if there exists β¯ for which Y Xβ¯ > 0
where Y := diag(y )
¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Despite this, it turns out that GCD is reasonably effective at finding a “good” linear separator as we shall shortly see.... 45
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Margin function ρ(β) Margin function ρ(β) ρ(β) :=
min i∈{1,...,n}
[yi β T xi ]
46
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separability Measure DistNSEP∗
Definition of Separability Measure DistNSEP∗ DistNSEP∗ :=
max
ρ(β)
s.t.
kβk1 = 1
β∈Rp
DistNSEP∗ is the maximum margin over all (normalized) β DistNSEP∗ > 0 if and only if the data is separable 47
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Separability Measure DistNSEP∗ DistNSEP∗ :=
max
ρ(β)
s.t.
kβk1 = 1
β∈Rp
48
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
DistNSEP∗ and “Distance to Non-Separability”
DistNSEP∗ :=
max
ρ(β)
s.t.
kβk1 = 1
β∈Rp
Theorem: DistNSEP∗ is the “Distance to Non-Separability” DistNSEP∗ =
inf
∆x1 ,...,∆xn
s.t.
max i∈{1,...,n}
k∆xi k∞
(xi + ∆xi , yi ), i = 1, . . . , n are non-separable 49
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD: Separable Case Theorem: Computational Guarantees for GCD: Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression problem k n (β )k∞ with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2 1,2
separable. 3.7nkXk21,2 (DistNSEP∗ )2 i i i normalized iterate β¯ := β /kβ k1 satisfies
(i) (margin bound): there exists i ≤
ρ(β¯i ) ≥
(iii) (regularization): kβ k k1 ≤ (ii) (gradient norm):
min i∈{0,...,k}
for which the
.18 · DistNSEP∗ . n
√ 1 p k kXk1,2 8n(ln(2) − L∗n ) q (ln(2)−L∗ n) k∇Ln (β i )k∞ ≤ kXk1,2 2n·(k+1)
(iv) (sparsity): kβ k k0 ≤ k 50
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for GCD: Separable Case Theorem: Computational Guarantees for GCD: Separable Case Consider Greedy Coordinate Descent applied to the Logistic Regression problem k n (β )k∞ with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2 1,2
separable. 3.7nkXk21,2 (DistNSEP∗ )2 i i i normalized iterate β¯ := β /kβ k1 satisfies
(i) (margin bound): there exists i ≤
ρ(β¯i ) ≥
(iii) (regularization): kβ k k1 ≤ (ii) (gradient norm):
min i∈{0,...,k}
for which the
.18 · DistNSEP∗ . n
√ 1 p k kXk1,2 8n(ln(2) − L∗n ) q (ln(2)−L∗ n) k∇Ln (β i )k∞ ≤ kXk1,2 2n·(k+1)
(iv) (sparsity): kβ k k0 ≤ k 50
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Other Issues Some other topics not mentioned today (still ongoing): Other “GCD-type”/“boosting-type” methods suggested by connections to Mirror Descent and the Frank-Wolfe method high-dimensional regime p > n, define DistSEP∗k and DistNSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods
Further investigation of the properties of other step-size choices for Greedy Coordinate Descent 51
GCD Primer
Logistic Regression
GCD for LR
Non-Separable Case
Separable Case
Other Issues
Summary Some “old” results and new observations for the Greedy Coordinate Descent Method Analyizing GCD for Logistic Regression: separable/non-separable cases
Non-Separable case
behavioral/condition measure DistSEP∗ computational guarantees for GCD including reaching linear convergence Separable case
behavioral/condition measure DistNSEP∗ computational guarantees for GCD including computing a reasonably good separator 52