K-hyperplane Hinge-Minimax Classifier - Margarita Osadchy

Report 5 Downloads 18 Views
K-HYPERPLANE HINGEMINIMAX CLASSIFIER Margarita Osadchy, Tamir Hazan, Daniel Keren University of Haifa

Goal • Non-linear binary classifier • Imbalanced data sets • Fast • Scalable • Natural Applications • Object Detection • Fraud Detection

Relatively small number of samples

SVM: the zero-one loss is upper bounded by the hinge loss.

1 n 2 T max( 0 , 1  y w x ) )   w  i i n i 1 [Vapnik 2000, Zhang 2002, Bartlett & Mendelson 2003, Bousquet et al. 2004, Kalade et al, 2009]

(Infinitely) many training samples

Minimax: the zero-one risk is upper bounded by the worstcase risk among all data distributions Z (  , ) 1 sup Pr( wT z  0)  ( wT  ) 2 z ~ Z (  , ) 1 T w w [Lanckriet et al., 2003; Honorio & Jaakkola, 2014]

Imbalanced problems

Hinge Risk

Minimax

Negative set - very large number of samples

Positive set –small number of samples

Hinge Minimax Classifier

Hinge-Minimax Linear Classifier ( x, y ) ~ D, x  R d , y  {1, 1}, z ≜ yx

yw  sign ( wT x)

L0D/1 (w)  E D [1[ yw x   y]] Hinge risk bound:

Minimax risk bound:

L0D/1 ( w)  E D [max{0, 1  ywT x}]≜ LHD (w)

L ( w)  0 /1 D

T

sup Pr( w

z ~ Z (  z , z )

M L ≜ z  0)  , z

z

( w)

 z  E( x , y ) ~ D [ z ]  z  E( x , y ) ~ D [( z   z )( yx   z )T ]

Hinge risk bound for the positive class: D+

Minimax risk bound for the negative class: D-

L ( w)  E D  [max{0, 1  w x}] H ,1 D

T

LMH D (w)



LM ,,1 ( w)  sup Pr( wT x  0)  x ~ Z (  , )

M , 1 LHD,1 ( w )  L ( w) E [ x],   E ( x, y )~ D



( x, y )~ D

1

1  w   T

2

wT w 

[( x   )( x   )T ]

LMH D (w)



LHD ,1 ( w)  LM ,,1 ( w)

ECCV 2012 • Applied linear classifier to vision problems.

• Generalized to kernel classifier with a fixed

number of support vectors. • Faster than SVM • But computing non-linear support vectors

is still expensive

Intersection of K Hyperplanes   T fW ( x)   1 if W x  0  1 otherwise

Klivans & Servedio 2004; Arriaga & Vempala 1999 Computationally costly for a large number of negative examples

Intersection of K Hyperplanes   T fW ( x)   1 if W x  0  1 otherwise

L0D/1 (W )  LHD ,1 (W )  LM ,,1 (W )≜ LMH D (W ) K   H ,1 T LD  (W )  E D   max{0, 1  wi x}  i 1 

LM ,,1 (W )  sup Pr(W T x  0) x ~ Z (  , )

M , 1  ,

L

(W )  sup Pr(W x  0)  ? T

x ~ Z (  , )

W x  0 is convex set T

Let Z(μ, Σ) be all distributions with known mean μ and covariance Σ. For K fixed hypeplanes wi (i  1,..., K ), 1 T (Marshall & Olkin, 1960) sup Pr(W x  0)  , 2 1 d x ~ Z (  , ) T 1 d 2  inf ( x   )  ( x   ). T W x 0

Example W  w1 w2

w3

x* w1

w2

w3 

x*  arg min x ( x   )T  1 ( x   ).



~ W is a matrix with columns that satisfy wT x*  0, ~ W  w1 w2 

M , 1  ,

L

(W )  sup Pr(W x  0)  ? T

x ~ Z (  , )

Let Z(μ, Σ) be all distributions with known mean μ and covariance Σ. For K fixed hypeplanes wi (i  1,...K ),

1 sup Pr(W x  0)  , 2 1 d x ~ Z (  , ) T

T 1 d 2  inf ( x   )  ( x   ). T W x 0

Let x*  arg min x ( x   )T  1 ( x   ) and W~ be a matrix with columns that satisfy wT x*  0, We showed that





~ ~ T ~ 1 ~ T d   W W W W  2

T

w    T

d2

2

wT w

Expected Risk for y=-1

M , 1  ,

L

(W )  sup Pr(W x  0)  T

x ~ Z (  , )



1



~ ~ T ~ 1 ~ T 1   W W W W  T

Uniform Generalization Bound We showed: Confidence

MH D

L

M , 1 S

(W )  L (W )  L H ,1 S

Empirical estimation of

LHD ,1 (W )

 log(1 /  )   (W )  O K  m  

Empirical estimation of

LM ,,1 (W )

Training set size

Proof Sketch 1. Show

 log(1 /  )   L (W )  L (W )  O K  m   H ,1 D

H ,1 S

Extension of the Rademacher complexity to W  R K d 2. Show M , 1  ,

L

M , 1 S

(W )  L

 log(1 /  )   (W )  O K  m  

Show for K=1 Follows the same steps for K>1

For K=1 1 1   1  w   1  w ˆ  T

2

wT w

 

T

2

wT ˆ w



wT w ( wT ˆ ) 2  ( wT  ) 2  ( wT ˆ ) 2  wT (  ˆ ) w  wT w  ( wT  ) 2  wT ˆ w  ( wT ˆ ) 2





1

2

We show:

  ˆ    ˆ 1  

2 

  ˆ

2

α bounds the minimal eigenvalue of Σ

For K=1 (cont.) Using Bernstein inequality for vectors ([Gross, 2011,Candes & Plan, 2011)])

1 



32log(1 /  )  1 / 4 mˆ

2



2 

for x  1,

1

2

32(log(1 /  )  1 / 4) mˆ

is the number of negative examples

For K=1 (cont.) M , 1 S

L

mˆ T ( w)  sup Pr( w x  0) m x ~ Z ( ˆ ,ˆ )

Estimate   E ( x , y ) ~ D 1[ y  1] by its empirical mean mˆ / m and use Hoeffding inequality to bound its deviation from  .

Combining all together: 1 m  log With the probability error of 3δ, for 

M , 1  ,

L

M , 1 s

( w)  L

32 log(1 /  )  1 / 4 log(1 /  ) c  m 2m

c  2 /  1/  2 1

Algorithm • Minimizes the bound

min W

 w

i

Convex optimization for K=1 2



 max{0, 1  wiT x  }  sup Pr(W T x   0)

K

x  ~ Z ( ˆ ,ˆ )

Hard to compute

Approximation Algorithm • Find K hyperplanes in a greedy way using convex optimization for

a single hyperlane. • Iteratively refine the K hyperplanes.

Approximation Algorithm- Greedy Phase

Approximation Algorithm- Greedy step

Approximation Algorithm- Greedy step

Approximation Algorithm- Greedy step

Approximation Algorithm Greedy phase, K=5:

K=2

K=1

K=4

K=3

K=5

Approximation Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane

Approximation Iterative Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane

Approximation Iterative Algorithm • Refinement phase • Keep K-1 hyperlanes fixed, find the Kth hyperplane

Approximation Algorithm K=5, Refinement phase

25 iterations

Experiments: Synthetic Data • Test the robustness to imbalance in data. • 5000 2d points, equally partitioned into train, validation

and test.

10

10

10

5

5

5

0

0

10

5

0

0

-5

-5



-5

-5

10

-10

-10

-10

15 -15

-10

-5

0

-15 5

10

-15

-15

15

-10

-5

0

5

-10

10

-5

15

0

5

10

15 -15

-10

-5

0

5

10

15

Experiments: Letters Data Set • UCI Machine Learning Repository [Murphy & Aha,1994], • 16-dimensional feature , 26 letters in the English

alphabet.

Experiments: Scene Recognition • 397 categories of the SUN data base [Xiao et al., 2010]. • Features: BOW of dense HOG with 300 words • The data is divided into 50 training and 50 test images in

10 folds.

QUESTIONS THANK YOU!