SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods Robert M. Freund (MIT) joint with Paul Grigas (Berkeley) and Rahul Mazumder (MIT)
ISI Marrakech, July 2017
1
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
How can optimization inform statistics (and machine learning)?
Paper in preparation (this talk): Condition Number Analysis of Logistic Regression, and its Implications for First-Order Solution Methods
A “cousin” paper of ours: A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
2
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Outline Optimization primer: some “old” results and new observations for the family of steepest descent algorithms Logistic regression perspectives: statistics and machine learning A pair of condition numbers for the logistic regression problem: when the sample data is non-separable: a condition number for the degree of non-separability of the dataset informing the convergence guarantees of steepest descent family guarantees on reaching linear convergence (thanks to Bach)
when the sample data is separable: a condition number for the degree of separability of the dataset informing convergence guarantee to deliver an approximate maximum margin classifier 3
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Primer on Steepest Descent in a Given Norm
Some Old and New Results for Steepest Descent in a Given Norm
4
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Steepest Descent in a Given Norm (SDGN) F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Let k · k be the given norm on the variables x ∈ Rp Steepest Descent in a Given Norm (SDGN) Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Compute d k ← arg maxd {∇F (x k )T d : kdk ≤ 1}
3
Choose step-size αk
4
Set x k+1 ← x k − αk d k
5
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent ≡ `1 -Steepest Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Let k · k = k · k1 Steepest Descent method in the `1 -norm Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Compute direction: d k ← arg maxd {∇F (x k )T d : kdk1 ≤ 1}
3
Choose step-size αk
4
Set x k+1 ← x k − αk d k
6
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Greedy Coordinate Descent ≡ `1 -Steepest Descent, cont. d k ∈ arg max {∇F (x k )T d} kdk1 ≤1
dk
7
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Gradient Descent ≡ `2 -Steepest Descent F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Let k · k = k · k2 Steepest Descent method in the `2 -norm Initialize at x 0 ∈ Rp , k ← 0 At iteration k : 1
Compute gradient ∇F (x k )
2
Compute direction: d k ← arg maxd {∇F (x k )T d : kdk2 ≤ 1}
3
Choose step-size αk
4
Set x k+1 ← x k − αk d k
8
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Gradient Descent ≡ `2 -Steepest Descent, cont. d k ∈ arg max {∇F (x k )T d} kdk2 ≤1
rF (xk )
9
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family F ∗ :=
min
F (x)
s.t.
x ∈ Rp
x
Assume F (·) is convex and ∇F (·) is Lipschitz with parameter LF : k∇F (x) − ∇F (y )k∗ ≤ LF kx − y k
for all x, y ∈ Rp
k · k∗ is the usual dual norm Two sets of interest: S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0 S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions 10
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Metrics for Evaluating Steepest Descent family, cont. S0 := {x ∈ Rp : F (x) ≤ F (x 0 )} is the level set of the initial point x 0
S ∗ := {x ∈ Rp : F (x) = F ∗ } is the set of optimal solutions Dist0 := max min kx − x ∗ k ∗ ∗ x∈S0 x ∈S
S0 Dist0
S⇤
(In high-dimensional machine learning problems, S ∗ can be very big)
11
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family Dist0 := max min kx − x ∗ k ∗ ∗ x∈S0 x ∈S
Theorem: Objective Function Value Convergence (essentially [Beck and Tetruashvil 2014], [Nesterov 2003]) If the step-sizes are chosen using the rule: αk =
k∇F (x k )k∗ LF
for all k ≥ 0 ,
then for each k ≥ 0 the following inequality holds: F (x k ) − F ∗ ≤ where
2LF (Dist0 )2 Kˆ 0 := . F (x 0 ) − F ∗
2LF (Dist0 )2 Kˆ 0 + k
n GCD performs variable selection GCD imparts implicit regularization Just one tuning parameter (number of iterations)
28
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Implicit Regularization and Variable Selection Properties Artificial example: n = 1000, p = 100, true model has 5 non-zeros
���
�
�
���
����� ����
����������������
�������������������
��� �
���������������� ���
�
�� � �
��� �
���
�� �
���
���
���
���������
���
���
�
���
���
���
���
���
���������
Compare with explicit regularization schemes (`1 , `2 , etc.) 29
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
How Can SDGN Inform Logistic Regression?
Some questions: How do the computational guarantees for the Steepest Descent family specialize to the case of Logistic Regression?
What role does problem structure/conditioning play in these guarantees?
Can we say anything further about the convergence properties of the Steepest Descent family in the special case of Logistic Regression?
30
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Elementary Properties of the Logistic Loss Function L∗n :=
min
Ln (β) :=
β
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Logistic regression “ideally” seeks β for which yi xiT β > 0 for all i : yi > 0 ⇒ xiT β > 0 yi < 0 ⇒ xiT β < 0 �
����
�
�
������� ����� ��������
�
� ��
��
��
�
�
�
�
T y������ x
31
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Geometry of the Data: Non-Separable and Separable Data
(a) Very Non-Separable Data
(b) Very Separable Data
(c) Mildly Non-Separable Data
(d) Mildly Separable Data
32
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separable and Non-Separable Data
Separable Data The data is separable if there exists β¯ for which ¯ T xi > 0 yi · (β)
for all i = 1, . . . , n
Non-Separable Data The data is non-separable if it is not separable, namely, every β satisfies yi · (β)T xi ≤ 0
for some i ∈ {1, . . . , n}
33
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data
L∗n :=
min β
Ln (β) :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
The data is separable if there exists β¯ for which ¯ T xi > 0 yi · (β)
for all i = 1, . . . , n
¯ → 0 (= L∗n ) as θ → +∞ If β¯ separates the data, then Ln (θβ) Perhaps trying to optimize the logistic loss function is unlikely to be effective at finding a “good” linear classifier .... 34
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separable and Non-Separable Data
(a) Separable
(b) Non-Separable
35
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Results in the Non-Separable Case
Results in the Non-Separable Case
36
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separable Data and Problem Behavior/Conditioning
Let us quantify the degree of non-separability of the data.
(a) Very non-separable data
(b) Mildly non-separable data
We will relate this to problem behavior/conditioning.... 37
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Condition Number DegNSEP∗
Definition of Non-Separability Condition Number DegNSEP∗ Pn 1 T − DegNSEP∗ := minp i=1 [yi β xi ] n β∈R
s.t.
kβk = 1
DegNSEP∗ is the least average misclassification error (over all normalized classifiers)
DegNSEP∗ > 0 if and only if the data is strictly non-separable
38
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Non-Separability Measure DegNSEP∗ DegNSEP∗ :=
(a) DegNSEP∗ is large
Pn
minp
1 n
s.t.
kβk = 1
β∈R
i=1 [yi β
T
xi ]−
(b) DegNSEP∗ is small 39
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
DegNSEP∗ and Problem Behavior/Conditioning L∗n :=
min β
Ln (β) :=
DegNSEP∗ :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
Pn
min
1 n
s.t.
kβk = 1
β∈Rp
i=1 [yi β
T
xi ]−
Theorem: Non-Separability and Sizes of Optimal Solutions Suppose that the data is non-separable and DegNSEP∗ > 0. Then 1 2
3
the logistic regression problem LR attains its optimum, for every optimal solution β ∗ of LR it holds that L∗n ln(2) kβ ∗ k ≤ ≤ , and DegNSEP∗ DegNSEP∗ for any β it holds that kβk ≤
Ln (β) . DegNSEP∗
40
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family: Non-Separable Case
Theorem: Computational Guarantees for Steepest Descent family: Non-Separable Case Consider the SDGN applied to the Logistic Regression problem with k n (β )k∗ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2·,2 non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (regularization): kβ k k ≤
2(ln(2))2 kXk2·,2 k·n·(DegNSEP∗ )2
k∇Ln (β i )k∗ ≤ kXk·,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk·,2 8n(ln(2) − L∗n ) 41
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family: Non-Separable Case
Theorem: Computational Guarantees for Steepest Descent family: Non-Separable Case Consider the SDGN applied to the Logistic Regression problem with k n (β )k∗ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2·,2 non-separable. Then for each k ≥ 0 it holds that: (i) (training error): Ln (β k ) − L∗n ≤ (ii) (gradient norm):
min i∈{0,...,k}
(iii) (regularization): kβ k k ≤
2(ln(2))2 kXk2·,2 k·n·(DegNSEP∗ )2
k∇Ln (β i )k∗ ≤ kXk·,2
q
(ln(2)−L∗ n) 2n·(k+1)
√ 1 p k kXk·,2 8n(ln(2) − L∗n ) 41
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence
Reaching Linear Convergence using Steepest Descent with a Given Norm for Logistic Regression
For logistic regression, does SDGN exhibit linear convergence?
42
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Some Definitions/Notation
Definitions: R :=
max i∈{1,...,n}
kxi k2 (maximum `2 norm of the feature vectors)
H(β ∗ ) denotes the Hessian of Ln (·) at an optimal solution β ∗ λpmin (H(β ∗ )) denotes the smallest non-zero (and hence positive) eigenvalue of H(β ∗ ) NormRatio := maxβ6=0 kβk/kβk2
43
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of Steepest Descent family for Logistic Regression Theorem: Reaching Linear Convergence of Steepest Descent family for Logistic Regression Consider SDGN applied to the Logistic Regression problem with k n (β )k∗ step-sizes αk := 4nk∇L for all k ≥ 0, and suppose that the data is kXk2·,2 non-separable. Define: kˇ :=
16 ln(2)2 kXk4·,2 R 2 (NormRatio)2 . 9n2 (DegNSEP∗ )2 λpmin (H(β ∗ ))2
ˇ it holds that: Then for all k ≥ k, k
Ln (β ) −
L∗n
kˇ
≤ (Ln (β ) −
L∗n )
λpmin (H(β ∗ ))n 1− kXk2·,2 (NormRatio)2
!k−kˇ . 44
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Reaching Linear Convergence of Steepest Descent family for Logistic Regression, cont. Some comments: Proof relies on (a slight generalization of) the “generalized self-concordance” property of the logistic loss function due to [Bach 2014] Furthermore, we can bound: λpmin (H(β ∗ )) ≥
T 1 4n λpmin (X X) exp
−
ln(2)kXk·,∞ DegNSEP∗
As compared to results of a similar flavor for other algorithms, here we have an exact characterization of when the linear convergence “kicks in” and also what the rate of linear convergence is guaranteed to be Q: Can we exploit this generalized self-concordance property in other ways? (still ongoing . . . ) 45
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
DegNSEP∗ and “Perturbation to Separability”
DegNSEP∗ :=
Pn
minp
1 n
s.t.
kβk = 1
β∈R
i=1 [yi β
T
xi ]−
Theorem: DegNSEP∗ is the “Perturbation to Separability” DegNSEP∗ =
Pn
inf
1 n
s.t.
(xi + ∆xi , yi ), i = 1, . . . , n are separable
∆x1 ,...,∆xn
i=1
k∆xi k∗
46
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Illustration of Perturbation to Separability
47
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Results for Some other Methods
Results for Some other Methods
48
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Standard Accelerated Gradient Method (AGM) P:
F ∗ := minimumx s.t.
F (x) x ∈ Rp
Lipschitz gradient: k∇f (y ) − ∇f (x)k2 ≤ Lky − xk2 for all x, y ∈ Rp Accelerated Gradient Method (AGM) Given x 0 ∈ Rp and z 0 := x 0 , and i ← 0 . Define step-size parameters θi ∈ (0, 1] 1 recursively by θ0 := 1 and θi+1 satisfies θ21 − θi+1 = θ12 . i+1
i
At iteration k: 1
Update :
y k ← (1 − θk )x k + θk z k x k+1 ← y k − L1 ∇f (y k ) z k+1 ← z k +
1 θk
(x k+1 − y k ) 49
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Accelerated Gradient Method (AGM) for Logistic Regression
Theorem: Computational Guarantees for Accelerated Gradient Method (AGM) for Logistic Regression Consider the AGM applied to the Logistic Regression problem initiated at β 0 := 0, and suppose that the data is non-separable. Then for each k ≥ 0 it holds that: (training error):
Ln (β k ) − L∗n ≤
2(ln(2))2 kXk22,2 n · (k + 1)2 · (DegNSEP∗ )2
50
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
AGM with Simple Re-Starting (AGM-SRS) Assume that 0 < F ∗ := minimumx F (x) Accelerated Gradient Method with Simple Re-Starting (AGM-SRS) Initialize with x 0 ∈ Rp . Set x1,0 ← x 0 , i ← 1 . At outer iteration i: 1
Initialize inner iteration. j ← 0
2
Run inner iterations. At inner iteration j: F (xi,j ) If ≥ 0.8 , then: F (xi,0 ) xi,j+1 ← AGM(F (·), xi,0 , j + 1) , j ← j + 1, and Goto step 2. Else xi+1,0 ← xi,j , i ← i + 1, and Goto step 1.
“xi,j ← AGM(F (·), xi,0 , j)” denotes assigning to xi,j the j th iterate of AGM applied with objective function F (·) using the initial point xi,0 ∈ Rp
51
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantee for AGM with Simple Re-Starting for Logistic Regression
Computational Guarantee for Accelerated Gradient Method with Simple Re-Starting for Logistic Regression Consider the AGM with Simple Re-Starting applied to the Logistic Regression problem initiated at β 0 := 0, and suppose that the data is non-separable. Within a total number of computed iterates k that does not exceed √
5.8kXk2,2 8.4kXk2,2 · L∗n √ , ∗ + √ n · DegNSEP n · DegNSEP∗ · ε
the algorithm will deliver an iterate β k for which Ln (β k ) − L∗n ≤ ε . 52
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Results in the Separable Case
Results in the Separable Case
53
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separable Data and Problem Behavior/Conditioning Let us quantify the degree of separability of the data.
(a) Very separable data
(b) Barely separable data
We will relate this to problem behavior/conditioning.... 54
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separability Condition Number DegSEP∗ Definition of Non-Separability Condition Number DegSEP∗ DegSEP∗ :=
max
min
β∈Rp
i∈{1,...,n}
s.t.
kβk ≤ 1
[yi β T xi ]
DegSEP∗ maximizes the minimal classification value [yi β T xi ] (over all normalized classifiers) DegSEP∗ is simply the “maximum margin” in machine learning parlance DegSEP∗ > 0 if and only if the data is separable 55
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Separability Measure DegSEP∗ DegSEP∗ :=
(a) DegSEP∗ is large
max
min
β∈Rp
i∈{1,...,n}
s.t.
kβk ≤ 1
[yi β T xi ]
(b) DegSEP∗ is small 56
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
DegNSEP∗ and Problem Behavior/Conditioning L∗n :=
min β
Ln (β) :=
DegSEP∗ :=
1 n
Pn
i=1
ln(1 + exp(−yi β T xi ))
max
min
β∈Rp
i∈{1,...,n}
s.t.
kβk ≤ 1
[yi β T xi ]
Theorem: Separability and Non-Attainment Suppose that the data is separable. Then DegSEP∗ > 0, L∗n = 0, and LR does not attain its optimum. Despite this, it turns out that the Steepest Descent family is reasonably effective at finding an approximate margin maximizer as we shall shortly see....
57
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Margin function ρ(β) Margin function ρ(β) ρ(β) :=
min i∈{1,...,n}
[yi β T xi ]
58
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family: Separable Case Theorem: Computational Guarantees for Steepest Descent family: Separable Case Consider SDGN applied to the Logistic Regression problem with step-sizes k n (β )k∗ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk2·,2 3.7nkXk2·,2 (i) (margin bound): there exists i ≤ for which the ∗ 2 (DegSEP ) i i i normalized iterate β¯ := β /kβ k satisfies ρ(β¯i ) ≥
(ii) (shrinkage): kβ k k ≤ (iii) (gradient norm):
.18 · DegSEP∗ . n
√ 1 p k kXk·,2 8n ln(2)
min i∈{0,...,k}
k∇Ln (β i )k∗ ≤ kXk·,2
q
ln(2) 2n·(k+1)
59
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Computational Guarantees for Steepest Descent family: Separable Case Theorem: Computational Guarantees for Steepest Descent family: Separable Case Consider SDGN applied to the Logistic Regression problem with step-sizes k n (β )k∗ αk := 4nk∇L for all k ≥ 0, and suppose that the data is separable. kXk2·,2 3.7nkXk2·,2 (i) (margin bound): there exists i ≤ for which the ∗ 2 (DegSEP ) i i i normalized iterate β¯ := β /kβ k satisfies ρ(β¯i ) ≥
(ii) (shrinkage): kβ k k ≤ (iii) (gradient norm):
.18 · DegSEP∗ . n
√ 1 p k kXk·,2 8n ln(2)
min i∈{0,...,k}
k∇Ln (β i )k∗ ≤ kXk·,2
q
ln(2) 2n·(k+1)
59
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
DegSEP∗ and “Perturbation to Non-Separability”
DegSEP∗ :=
max
min
β∈Rp
i∈{1,...,n}
s.t.
kβk ≤ 1
[yi β T xi ]
Theorem: DegSEP∗ is the “Perturbation to Non-Separability” DegSEP∗ =
inf
∆x1 ,...,∆xn
s.t.
max i∈{1,...,n}
k∆xi k∗
(xi + ∆xi , yi ), i = 1, . . . , n are non-separable
60
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Illustration of Perturbation to Non-Separability
61
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Other Issues
Some other topics not mentioned (still ongoing): Other first-order methods for logistic regression (accelerated gradient descent, randomized methods, etc. high-dimensional regime p > n, define DegNSEP∗k and DegSEP∗k for restricting β to satisfy kβk0 ≤ k Numerical experiments comparing methods
Other...
62
SDGN Primer
Logistic Regression
SDGN for LR
Non-Separable Case
Separable Case
Other Issues
Summary Some old and new results for Steepest Descent in a Given Norm (SDGN) Analyizing SDGN for Logistic Regression: separable/non-separable cases
Non-Separable case
condition number DegNSEP∗ computational guarantees for SGDN including reaching linear convergence Separable case
condition number DegSEP∗ computational guarantees for SGDN including computing an approximate maximum margin classifier 63