K-HYPERPLANE HINGEMINIMAX CLASSIFIER Margarita Osadchy, Tamir Hazan, Daniel Keren University of Haifa
Goal • Non-linear binary classifier • Imbalanced data sets • Fast • Scalable • Natural Applications • Object Detection • Fraud Detection
Relatively small number of samples
SVM: the zero-one loss is upper bounded by the hinge loss.
1 n 2 T max( 0 , 1 y w x ) ) w i i n i 1 [Vapnik 2000, Zhang 2002, Bartlett & Mendelson 2003, Bousquet et al. 2004, Kalade et al, 2009]
(Infinitely) many training samples
Minimax: the zero-one risk is upper bounded by the worstcase risk among all data distributions Z ( , ) 1 sup Pr( wT z 0) ( wT ) 2 z ~ Z ( , ) 1 T w w [Lanckriet et al., 2003; Honorio & Jaakkola, 2014]
Imbalanced problems
Hinge Risk
Minimax
Negative set - very large number of samples
Positive set –small number of samples
Hinge Minimax Classifier
Hinge-Minimax Linear Classifier ( x, y ) ~ D, x R d , y {1, 1}, z ≜ yx
yw sign ( wT x)
L0D/1 (w) E D [1[ yw x y]] Hinge risk bound:
Minimax risk bound:
L0D/1 ( w) E D [max{0, 1 ywT x}]≜ LHD (w)
L ( w) 0 /1 D
T
sup Pr( w
z ~ Z ( z , z )
M L ≜ z 0) , z
z
( w)
z E( x , y ) ~ D [ z ] z E( x , y ) ~ D [( z z )( yx z )T ]
Hinge risk bound for the positive class: D+
Minimax risk bound for the negative class: D-
L ( w) E D [max{0, 1 w x}] H ,1 D
T
LMH D (w)
≜
LM ,,1 ( w) sup Pr( wT x 0) x ~ Z ( , )
M , 1 LHD,1 ( w ) L ( w) E [ x], E ( x, y )~ D
( x, y )~ D
1
1 w T
2
wT w
[( x )( x )T ]
LMH D (w)
≜
LHD ,1 ( w) LM ,,1 ( w)
ECCV 2012 • Applied linear classifier to vision problems.
• Generalized to kernel classifier with a fixed
number of support vectors. • Faster than SVM • But computing non-linear support vectors
is still expensive
Intersection of K Hyperplanes T fW ( x) 1 if W x 0 1 otherwise
Klivans & Servedio 2004; Arriaga & Vempala 1999 Computationally costly for a large number of negative examples
Intersection of K Hyperplanes T fW ( x) 1 if W x 0 1 otherwise
L0D/1 (W ) LHD ,1 (W ) LM ,,1 (W )≜ LMH D (W ) K H ,1 T LD (W ) E D max{0, 1 wi x} i 1
LM ,,1 (W ) sup Pr(W T x 0) x ~ Z ( , )
M , 1 ,
L
(W ) sup Pr(W x 0) ? T
x ~ Z ( , )
W x 0 is convex set T
Let Z(μ, Σ) be all distributions with known mean μ and covariance Σ. For K fixed hypeplanes wi (i 1,..., K ), 1 T (Marshall & Olkin, 1960) sup Pr(W x 0) , 2 1 d x ~ Z ( , ) T 1 d 2 inf ( x ) ( x ). T W x 0
Example W w1 w2
w3
x* w1
w2
w3
x* arg min x ( x )T 1 ( x ).
~ W is a matrix with columns that satisfy wT x* 0, ~ W w1 w2
M , 1 ,
L
(W ) sup Pr(W x 0) ? T
x ~ Z ( , )
Let Z(μ, Σ) be all distributions with known mean μ and covariance Σ. For K fixed hypeplanes wi (i 1,...K ),
1 sup Pr(W x 0) , 2 1 d x ~ Z ( , ) T
T 1 d 2 inf ( x ) ( x ). T W x 0
Let x* arg min x ( x )T 1 ( x ) and W~ be a matrix with columns that satisfy wT x* 0, We showed that
~ ~ T ~ 1 ~ T d W W W W 2
T
w T
d2
2
wT w
Expected Risk for y=-1
M , 1 ,
L
(W ) sup Pr(W x 0) T
x ~ Z ( , )
1
~ ~ T ~ 1 ~ T 1 W W W W T
Uniform Generalization Bound We showed: Confidence
MH D
L
M , 1 S
(W ) L (W ) L H ,1 S
Empirical estimation of
LHD ,1 (W )
log(1 / ) (W ) O K m
Empirical estimation of
LM ,,1 (W )
Training set size
Proof Sketch 1. Show
log(1 / ) L (W ) L (W ) O K m H ,1 D
H ,1 S
Extension of the Rademacher complexity to W R K d 2. Show M , 1 ,
L
M , 1 S
(W ) L
log(1 / ) (W ) O K m
Show for K=1 Follows the same steps for K>1
For K=1 1 1 1 w 1 w ˆ T
2
wT w
T
2
wT ˆ w
wT w ( wT ˆ ) 2 ( wT ) 2 ( wT ˆ ) 2 wT ( ˆ ) w wT w ( wT ) 2 wT ˆ w ( wT ˆ ) 2
1
2
We show:
ˆ ˆ 1
2
ˆ
2
α bounds the minimal eigenvalue of Σ
For K=1 (cont.) Using Bernstein inequality for vectors ([Gross, 2011,Candes & Plan, 2011)])
1
mˆ
32log(1 / ) 1 / 4 mˆ
2
2
for x 1,
1
2
32(log(1 / ) 1 / 4) mˆ
is the number of negative examples
For K=1 (cont.) M , 1 S
L
mˆ T ( w) sup Pr( w x 0) m x ~ Z ( ˆ ,ˆ )
Estimate E ( x , y ) ~ D 1[ y 1] by its empirical mean mˆ / m and use Hoeffding inequality to bound its deviation from .
Combining all together: 1 m log With the probability error of 3δ, for
M , 1 ,
L
M , 1 s
( w) L
32 log(1 / ) 1 / 4 log(1 / ) c m 2m
c 2 / 1/ 2 1
Algorithm • Minimizes the bound
min W
w
i
Convex optimization for K=1 2
max{0, 1 wiT x } sup Pr(W T x 0)
K
x ~ Z ( ˆ ,ˆ )
Hard to compute
Approximation Algorithm • Find K hyperplanes in a greedy way using convex optimization for
a single hyperlane. • Iteratively refine the K hyperplanes.
Approximation Algorithm- Greedy Phase
Approximation Algorithm- Greedy step
Approximation Algorithm- Greedy step
Approximation Algorithm- Greedy step
Approximation Algorithm Greedy phase, K=5:
K=2
K=1
K=4
K=3
K=5
Approximation Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane
Approximation Iterative Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane
Approximation Iterative Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane
Approximation Algorithm K=5, Refinement phase
25 iterations
Experiments: Synthetic Data • Test the robustness to imbalance in data. • 5000 2d points, equally partitioned into train, validation
and test.
10
10
10
5
5
5
0
0
10
5
0
0
-5
-5
…
-5
-5
10
-10
-10
-10
15 -15
-10
-5
0
-15 5
10
-15
-15
15
-10
-5
0
5
-10
10
-5
15
0
5
10
15 -15
-10
-5
0
5
10
15
Experiments: Letters Data Set • UCI Machine Learning Repository [Murphy & Aha,1994], • 16-dimensional feature , 26 letters in the English
alphabet.
Experiments: Scene Recognition • 397 categories of the SUN data base [Xiao et al., 2010]. • Features: BOW of dense HOG with 300 words • The data is divided into 50 training and 50 test images in
10 folds.
QUESTIONS THANK YOU!