GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
New Results for Sparsity-inducing Methods for Logistic Regression Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)
Cornell University, December 2016
1
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
How can optimization inform statistics (and machine learning)?
Paper in preparation (this talk): “New Results for Sparsity-inducing Methods for Logistic Regression”
A “cousin” paper available online: “A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives”
2
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
Outline Optimization primer: some “old” results and new observations for the Greedy Coordinate Descent (GCD) method Logistic regression: Statistics perspective, Machine Learning perspective A “condition number” for the logistic regression problem the degree of non-separability of the data data perturbation to separability of the data informing the convergence properties of Greedy Coordinate Descent Reaching linear convergence of Greedy Coordinate Descent for logistic regression (thanks to Bach) Different convergence for an “accelerated” (but non-sparse) method for logistic regression (thanks to Renegar)
3
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
eooooooooooo
Primer on Greedy Coordinate Descent
Primer: Some “Old” Results and New Observations for the Greedy Coordinate Descent Method
4
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oe oooooooooo
Gradient Descent ≡ `2 -Steepest Descent The problem of interest is: F ∗ :=
min x
F (x)
s.t. x ∈ Rp where F (x) is convex and differentiable. Steepest Descent method for minimizing f (x) Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Choose step-size α ˆk
3
Set x k+1 ← x k − α ˆ k ∇F (x k ) 5
GCD Primer
Logistic Regression
-
D D D D D ~ ++-ohO--a-
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
D-OOOOOOO n
Connections to boosting (LogitBoost)
e
Just one tuning parameter (number of iterations)
e
GCD performs variable selection
e
GCD imparts implicit regularization
e
36
GCD Primer
Logistic Regression
GCD for LR Non-separability ooeoo
Reaching Linear Convergence Other Issues
Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros
�
���
�
����� ����
���
����������������
�������������������
��� �
���������������� ���
�
�� � �
��� �
���
�� �
���
���
���
���������
���
���
�
���
���
���
���
���
���������
Compare with explicit regularization schemes (`1 , `2 , etc.) 37
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooeo
Connections to Boosting
In boosting, the goal is to combine multiple “weak” models into a more powerful “committee” (Here a weal model corresponds to a feature) AdaBoost ([Schapire 1990], [Y. Freund 1995], [Y. Freund and Schapire 1996], . . . ) is a widely popular boosting algorithm for classification Can be interpreted as Greedy Coordinate Descent to minimize the exponential loss function ([Mason et al. 2000]) LogitBoost ([Friedman et al. 2000], ≡ Greedy Coordinate Descent for Logistic Regression) replaces the exponential loss with the logistic loss
38
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
0000•
How Can Optmization Inform Logistic Regression?
Some questions: How do the computational guarantees for Greedy Coordinate Descent specialize for Logistic Regression?
What role does problem structure/conditioning play in these guarantees?
Can we say anything further about the convergence properties of Greedy Coordinate Descent in the special case of Logistic Regression?
39
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
eoooooooooooooooo
Optimization Properties, Non-Separability, Complexity
Optimization Properties, Non-Separability, and Computational Guarantees
40
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oeooooooooooooooo
Basic Properties of the (Empirical) Logistic Loss L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
s.t. β ∈ Rp Relatively simple to show that: Ln (·) is convex ∇Ln (·) is L =
1 2 4n kXk1,2 -Lipschitz:
k∇Ln (β) − ∇Ln (β 0 )k∞ ≤
2 1 4n kXk1,2 kβ
− β 0 k1
where kXk1,2 := max kXj k2 j=1,...,p
0
For β := 0 it holds that Ln (β 0 ) = ln(2) L∗n ≥ 0
If L∗n = 0, then the optimum is not attained (something is “wrong”)
41
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooeoooooooooooooo
Basic Properties, continued L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
�
����
�
������� ����� ��������
�
�
� ��
��
��
�
�
�
�
T y������ x
Logistic regression “ideally” seeks β for which yi xiT β > 0 for all i : yi > 0 ⇒ xiT β > 0 yi < 0 ⇒ xiT β < 0
42
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
Geometry of the Data: Separable and Non-Separable Data
(a) Strictly Separable Data
(b) Not Strictly Separable Data
(c) “Almost Separable” Data
(d) “Very Non-Separable” Data 43
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooooeoooooooooooo
Linearly Separable Data We are given data: (xi , yi ) ∈ Rp × {−1, +1}, i = 1, . . . , n
Let X ∈ Rn×p be the data matrix: xi is the i th row of X and Xj denotes the j th column y ∈ {−1, 1}n is the vector of labels
Linearly Separable Data ¯ T xi > 0 for all The data is linearly separable with separator β¯ if yi · (β) i = 1, . . . , n Equivalently Y Xβ¯ > 0 where Y := diag(y )
44
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooeooooooooooo
Linearly Separable Data, continued
L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
The data is linearly separable with separator β¯ if Y Xβ¯ > 0
where Y := diag(y )
¯ → 0 (= L∗n ) as θ → +∞ If β¯ linearly separates the data, then Ln (θβ) Thus the logistic loss function is not effective at finding a “good” linear separator 45
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
Strictly Non-Separable Data Strictly Non-Separable Data We say that the data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ 0
(a) Strictly Non-separable
(b) Not Strictly Non-Separable 46
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooooeooooooooo
Strict Non-Separability, continued
L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
The data is strictly non-separable if: Y Xβ 6= 0 ⇒ Y Xβ 0
Theorem: Attaining Optima When the data is strictly non-separable, then the (empirical) logistic regression problem attains its optimum (and conversely). 47
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooooooooeoooooooo
Strict Separability and Problem Behavior/Conditioning Theorem: Attaining Optima When the data is strictly non-separable, then the (empirical) logistic regression problem attains its optimum (and conversely). Q: can we quantify the degree of non-separability of the data and relate this to problem behavior/conditioning?
• • • •• • •• • • • •• • • • • •• • • • • •• • • • • • • • • •• • • • • •• •• • • • • • • (a) Mildly non-separable data
• • • • •• • • • • • • • • • • •• • •• • •• • • • • • • • • •• • • • • • • • • • •• •• • • • • • •• • • • • • (b) Very non-separable data 48
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooooooeooooooo
Non-Separability Measure NSEP∗
Definition of Non-Separability Measure NSEP∗ NSEP∗ :=
Pn
min
1 n
s.t.
kβk1 = 1
β∈Rp
i=1 [yi β
T
xi ]−
NSEP∗ is the least average misclassification error NSEP∗ > 0 if and only if the data is strictly non-separable 49
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooooooooooeoooooo
Non-Separability Measure NSEP∗ NSEP∗ :=
1 n
s.t.
kβk1 = 1
β∈R
• • • •• • •• • • • • • ••• • • • • • • •• • • •• • • • • • • • • • • •• • •• • • • • • • (a) NSEP∗ is small
Pn
minp
i=1 [yi β
T
xi ]−
• • • • •• • • • • • • • • • • •• • •• • •• • •• • • • •• • • ••• • • • • • • •• •• • • • • • • •• • • • • (b) NSEP∗ is large
50
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooooooooeooooo
NSEP∗ and Problem Behavior/Conditioning L∗n :=
min β
Ln (β) :=
NSEP∗ :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ] −
Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤
ln(2) . NSEP∗ 51
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooooooooooooeoooo
NSEP∗ and Problem Behavior/Conditioning, cont. Theorem: Strict Non-Separability and Sizes of Optimal Solutions Suppose that the data is strictly non-separable, and let β ∗ be an optimal solution of the logistic regression problem. Then kβ ∗ k1 ≤
ln(2) . NSEP∗
S0 := {x ∈ Rp : Ln (β) ≤ Ln (β 0 )} is the level set of initial point β 0
S ∗ := {x ∈ Rp : Ln (β) = L∗n } is the set of optimal solutions Dist0 := max min kβ − β ∗ k1 ∗ ∗ β∈S0 β ∈S
Dist0 ≤
2 ln(2) NSEP∗ 52
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooooooooooeooo
Computational Guarantees for GCD for Logistic Regression
Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (shrinkage): kβ k k1 ≤
2(ln(2))2 kXk21,2 k·n·(NSEP∗ )2
k∇Ln (β i )k∞ ≤ kXk1,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk1,2 8n(ln(2) − L∗n )
(iv) (sparsity): kβ k k0 ≤ k 53
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
oooooooooooooeooo
Computational Guarantees for GCD for Logistic Regression
Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (shrinkage): kβ k k1 ≤
2(ln(2))2 kXk21,2 k·n·(NSEP∗ )2
k∇Ln (β i )k∞ ≤ kXk1,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk1,2 8n(ln(2) − L∗n )
(iv) (sparsity): kβ k k0 ≤ k 53
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooooooooooooooeoo
Other Step-size Choices Theorem: Computational Guarantees for GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression problem with arbitrary step-size sequence {αk }. Then for each k ≥ 0 it holds that: (i) (gradient norm):
min i∈{0,...,k}
(ii) (shrinkage): kβ k k1 ≤
k∇Ln (β i )k∞ ≤
Pk
i=0
kXk2 1,2
ln(2)−L∗ n+ Pk
8n
i=0
Pk
i=0
α2i
αi
αi
(iii) (sparsity): kβ k k0 ≤ k Other step-size sequences are interesting since one may want to consider less aggressive fitting methods The bound on the gradient norm arises from a certain equivalence with the Mirror Descent method 54
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
000000000000000•0
NSEP∗ and “Distance to Separability”
NSEP∗ :=
Pn
minp
1 n
s.t.
kβk1 = 1
β∈R
i=1 [yi β
T
xi ]−
Theorem: NSEP∗ is the “Distance to Separability” NSEP∗ =
Pn
inf
1 n
s.t.
(xi + ∆xi , yi ), i = 1, . . . , n are linearly separable
∆x1 ,...,∆xn
i=1
k∆xi k∞
55
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
0000000000000000 •
Non-Separability Measure NSEP∗ NSEP∗ :=
1 n
s.t.
kβk1 = 1
β∈R
• • • •• • •• • • • • • ••• • • • • • • •• • • •• • • • • • • • • • • •• • •• • • • • • • (a) NSEP∗ is small
Pn
minp
i=1 [yi β
T
xi ]−
• • • • •• • • • • • • • • • • •• • •• • •• • •• • • • •• • • ••• • • • • • • •• •• • • • • • • •• • • • • (b) NSEP∗ is large
56
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues eooooo
Reaching Linear Convergence
Reaching Linear Convergence using Greedy Coordinate Descent for Logistic Regression
57
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues oeoooo
Local Curvature of the Logistic Loss Function �
�������������
�
�
�
� ��
��
��
�
�
�
�
������
While the logistic loss behaves linearly in some regions, it has curvature near zero. And often the margin values at the optimal solution β ∗ are concentrated in this region
58
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues ooeooo
GCD and Local Curvature of the Logistic Loss Function
�
�������������
�
�
�
� ��
��
��
�
�
�
�
������
Q: Does Greedy Coordinate Descent adapt to the local curvature of the logistic loss at the optimal solution β ∗ ? A: Yes (as we will now demonstrate . . . ) 59
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues oooeoo
Some Definitions/Notation
Definitions: R :=
max i∈{1,...,n}
kxi k2 (maximum norm of the feature vectors)
H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ )
60
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues ooooeo
Reaching Linear Convergence of GCD for Logistic Regression Theorem: Reaching Linear Convergence of GCD for Logistic Regression Consider Greedy Coordinate Descent applied to the Logistic Regression k n (β )k∞ problem with step-sizes αk := 4nk∇L for all k ≥ 0, and suppose kXk21,2 that the data is strictly non-separable. Define: kˇ :=
16 ln(2)2 kXk21,2 R 2 p . 9n(NSEP∗ )2 λpmin (H(β ∗ ))2
ˇ it holds that: Then for all k ≥ k, k
Ln (β ) −
L∗n
kˇ
≤ (Ln (β ) −
L∗n )
λpmin (H(β ∗ ))n 1− kXk21,2 p
!k−kˇ .
61
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues 00000•
Reaching Linear Convergence of GCD for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] This property also yields a bound of the form Ln (β) − L∗n ≤
2k∇Ln (β)k22 λpmin (H(β ∗ ))
if k∇Ln (β)k2 is small enough As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . this result is quite new)
62
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
Other Issues Some other topics not mentioned today (still ongoing): Other “GCD-type”/“boosting-type” methods suggested by connections to Mirror Descent and the Frank-Wolfe method high-dimensional regime p > n, define NSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods
Further investigation of the properties of other step-size choices for Greedy Coordinate Descent 63
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
Summary
NSEP∗ for Logistic Regression problems that:
e
Some “old” results and new observations for the Greedy Coordinate Descent Method
e
e
e
e
measures the degree of non-separability of the data informs the convergence properties of Greedy Coordinate Descent
Computational guarantees for Greedy Coordinate Descent for Logistic Regression: e e
e
1 O( (NSEP ∗ )2 k ) global objective value convergence Reaching linear convergence Other guarantees in terms of norm of the gradient, shrinkage of the iterates
64
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
eoooo
Back-up Slides: Related Results for AdaBoost
Back-up Slides: Related Results for AdaBoost
65
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues oeooo
AdaBoost: First Problem of Interest AdaBoost is also Greedy Coordinate Descent, but replaces the logistic loss function with the log-exponential loss: L∗l := minλ≥0
Ll (λ) = ln
1 m
Pm
i=1
exp (−(Aλ)i ) .
Data: (x1 , y1 ) . . . , (xm , ym ) where xi ∈ Rn is the i th feature vector and yi ∈ {−1, +1} Here A := Y X, i.e., Aij := yi (xi )j Note that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 Assume for convenience that for every column Aj , −Aj is also a column of A 66
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues
ooeoo
AdaBoost: Second Problem of Interest ∆n := {x ∈ Rn : e T x = 1, x ≥ 0} is the standard simplex in Rn Recall that λ∗ is a linear separator of the data if and only if Aλ∗ > 0 The margin of a classifier λ ∈ Rn is: p(λ) :=
min i∈{1,...,m}
(Aλ)i = min w T Aλ w ∈∆m
It makes sense to look for a classifier with large margin, i.e., to solve: M :
ρ∗ := max p(λ) . λ∈∆n
67
GCD Primer
Logistic Regression
GCD for LR Non-separability
Reaching Linear Convergence Other Issues oooeo
Dual of the Maximum Margin Problem The “edge” of a vector of weights on the data, w ∈ ∆m , is: f (w ) :=
max j∈{1,...,n}
w T Aj = max w T Aλ λ∈∆n
The (linear programming) dual of the maximum margin problem is the problem of minimizing the edge: E :
f ∗ := min f (w ) , w ∈∆m
AdaBoost is three algorithms: A boosting method based on a scheme for (multiplicatively) updating a vector of weights on the data Greedy Coordinate Descent applied to minimize the log-exponential loss function A version of the Mirror Descent method applied to the above problem E
68
GCD Primer
Logistic Regression
GCD for LR Non-separability
000000000000 0000000000000000000 00000
Reaching Linear Convergence Other Issues
00000000000000000 000000
00 0000•
Computational Guarantees for AdaBoost Theory for Greedy Coordinate Descent and Mirror Descent leads to computational guarantees for AdaBoost: Step-Size Strategy
II
Separable Data Margin Bound ρ∗ − p(λk+1 )
II
Non-Separable Data Gradient Bound Loss Bound ˆ i )k∞ ˆ k ) − L∗ min k∇Ll (λ Ll (λ l
i∈{0,...,k}
“edge rule:”
ˆ k )k∞ αk = k∇Ll (λ
q
2 ln(m) k+1
8 ln(m)2 (NSEP∗ )2 k l
“line-search:”
αk = 12 ln
1+r k
q
2 ln(m) k+1
8 ln(m)2 (NSEP∗ )2 k l
q
2 ln(m) k+1
1−rk
“constant:”
q 2 ln(m) αi := k+1 for i = 0, . . . , k
“adaptive:”
αk =
q
r
q
2 ln(m) k+1
2 ln(m) k+1
ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)
r
ln(m) [2+ln(k+1)] 2 √ 2( k+2−1)
NSEP∗l is a “non-separability condition number” for log-exponential loss
69