Regularization Strategies and Empirical Bayesian ... - Ryota Tomioka

Report 3 Downloads 49 Views
Regularization Strategies and Empirical Bayesian Learning for MKL . Ryota Tomioka1 , Taiji Suzuki1 1 Department

.

of Mathematical Informatics, The University of Tokyo

2010-12-11 NIPS2010 Workshop: New Directions in Multiple Kernel Learning

Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

1 / 25

Overview

Our contribution Relationships between different regularization strategies Ivanov regularization (kernel weights) Tikhonov regularization (kernel weights) (Generalized) block-norm formulation (no kernel weights)

Are they equivalent? —

in which way?

Empirical Bayesian learning algorithm for MKL Maximizes the marginalized likelihood Can be considered as a non-separable regularization on the kernel weights.

Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

2 / 25

Overview

Learning with a fixed kernel combination Fixed kernel combination kd (x, x ′ ) = minimize ¯f ∈H(d), b∈R

PM m=1

dm km (x, x ′ ).

N X ¡ ¢ C ℓ yi , ¯f (xi ) + b + ∥¯f ∥2H(d) , 2 i=1

(H(d) is the RKHS corresponding to the combined kernel kd ) is equivalent to learning M functions (f1 , . . . , fM ) as follows: minimize f1 ∈H1 , ...,fM ∈HM , b∈R

N M ³ P ´ CX X ∥fm ∥2Hm M ℓ yi , m=1 fm (xi ) + b + 2 dm

(1)

m=1

i=1

P where ¯f (x) = M m=1 fm (x). See Sec. 6 in Aronszajn (1950), Micchelli & Pontil (2005). Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

3 / 25

Regularization Strategies

Ivanov regularization We can constrain the size of kernel weights dm by N M ³ P ´ CX X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + b + , (2) f1 ∈H1 ,...,fM ∈HM , 2 dm b∈R, d1 ≥0,...,dM ≥0

i=1

M X

s.t.

m=1

h(dm ) ≤ 1

(h is convex, increasing).

m=1

Equivalent to the more common expression: minimize

N X

f ∈H(d), i=1 b∈R, d1 ≥0,...,dM ≥0 Ryota Tomioka (Univ Tokyo)

M X C 2 ℓ (yi , f (xi ) + b) + ∥f ∥H(d) , s.t. h(dm ) ≤ 1. 2 m=1

Generalized MKL

2010-12-11

4 / 25

Regularization Strategies

Tikhonov regularization We can penalize the size of kernel weights dm by N ³ P ´ X minimize ℓ yi , M f (x ) + b m=1 m i

f1 ∈H1 ,...,fM ∈HM , i=1 b∈R, d1 ≥0,...,dM ≥0

¶ M µ C X ∥fm ∥2Hm + + µh(dm ) . 2 dm

(3)

m=1

Note that the above is equivalent to minimize

N X

f ∈H(d), b∈R, |i=1 d1 ≥0,...,dM ≥0

Ryota Tomioka (Univ Tokyo)

M C Cµ X 2 ℓ (yi , f (xi ) + b) + ∥f ∥H(d) + h(dm ) . 2 2 {z } | {z } | m=1 {z } data-fit

f -prior Generalized MKL

dm -hyper-prior 2010-12-11

5 / 25

Regularization Strategies

Are these two formulations equivalent? Previously thought that... Yes. But the choice of the pair (C, µ) is complicated. ⇒ In the Tikhonov formulation we have to choose both C and µ! (Kloft et al., 2010) .

We show that... If give up the constant 1 in the Ivanov formulation Pyou M . m=1 h(dm ) ≤ 1, Correspondence via equivalent block-norm formulations. C and µ can be chosen independently. The constant 1 has no meaning. Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

6 / 25

Regularization Strategies

Ivanov ⇒ block-norm formulation 1 (known) Let h(dm ) = dmp (ℓp -norm MKL); see Kloft et al. (2010). N M ³ P ´ CX X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + b + , f1 ∈H1 ,...,fM ∈HM , 2 dm b∈R, d1 ≥0,...,dM ≥0

s.t.

i=1

M X

m=1

dmp ≤ 1.

m=1



minimize

N X

f1 ∈H1 ,...,fM ∈HM i=1 ,binR

Jensen’s inequality ¶2/q ³ P ´ C µX M M q . ℓ yi , m=1 fm (xi ) + b + ∥fm ∥Hm m=1 2 2/(1+p)

where q = 2p/(1 + p). Minimum is attained at dm ∝ ∥fm ∥Hm Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

7 / 25

Regularization Strategies

Tikhonov ⇒ block-norm formulation 2 (new) Let h(dm ) = dmp , µ = 1/p (ℓp -norm MKL) ¶ N M µ ³ P ´ CX X ∥fm ∥2Hm dmp M + . minimize ℓ yi , m=1 fm (xi ) + b + f1 ∈H1 ,...,fM ∈HM , 2 dm p b∈R, d1 ≥0,...,dM ≥0

m=1

i=1



Young’s inequality

M N ³ P ´ CX X M minimize ∥fm ∥qHm . ℓ yi , m=1 fm (xi ) + b + f1 ∈H1 ,...,fM ∈HM , q b∈R

m=1

i=1

2/(1+p)

where q = 2p/(1 + p). Minimum is attained at dm = ∥fm ∥Hm Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

. 8 / 25

Regularization Strategies

The two block norm formulations are equivalent Block norm formulation 1 (from Ivanov): ¶2/q N ³ P ´ C ˜ µX M X M q minimize ∥fm ∥Hm . ℓ yi , m=1 fm (xi ) + b + m=1 f1 ∈H1 ,...,fM ∈HM 2 ,binR

i=1

Block norm formulation 2 (from Tikhonov): N M ³ P ´ CX X M minimize ℓ yi , m=1 fm (xi ) + b + ∥fm ∥qHm . f1 ∈H1 ,...,fM ∈HM , q b∈R

i=1

m=1

˜ Just have to map C and C. The implied kernel weights are normalized/unnormalized. Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

9 / 25

Regularization Strategies

Generalized block-norm formulation N M ³ P ´ X X M minimize ℓ yi , m=1 fm (xi ) + b + C g(∥fm ∥2Hm ), f1 ∈H1 , ...,fM ∈HM , b∈R

i=1

(4)

m=1

where g is a concave block-norm-based regularizer. √ Example (Elastic-net MKL): g(x) = (1 − λ) x + λ2 x, minimize f1 ∈H1 , ...,fM ∈HM , b∈R

N ³ P ´ X ℓ yi , M f (x ) + b m=1 m i i=1

+C

M µ X

(1 − λ)∥fm ∥Hm

m=1 Ryota Tomioka (Univ Tokyo)

Generalized MKL

¶ λ 2 + ∥fm ∥Hm , 2 2010-12-11

10 / 25

Regularization Strategies

Generalized block-norm ⇒ Tikhonov regularization Theorem Correspondence between the convex (kernel-weight-based) regularizer h(dm ) and the concave (block-norm-based) regularizer g(x) is given as follows: µ ¶ 1 ∗ µh(dm ) = −2g , 2dm . where g ∗ is the concave conjugate of g. Proof: Use the concavity of g as ∥fm ∥2Hm ≥ g(∥fm ∥2Hm ) + g ∗ (1/(2dm )). 2dm See also Palmer et al. (2006). Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

11 / 25

Regularization Strategies

Examples Generalized Young’s inequality: xy ≥ g(x) + g ∗ (y ) where g is concave, and g ∗ is the concave conjugate of g. √ Example 1: let g(x) = x, then g ∗ (y ) = −1/(4y ) and ∥fm ∥2Hm dm + ≥ ∥fm ∥Hm 2dm 2

(L1-MKL).

Example 2: let g(x) = x q/2 /q (1 ≤ q ≤ 2), then g ∗ (y ) = q−2 (2y )q/(q−2) 2q ∥fm ∥2Hm dmp 1 + ≥ ∥fm ∥qHm 2dm 2p q

(ℓp -norm MKL),

where p := q/(2 − q). Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

12 / 25

Regularization Strategies

Correspondence

MKL model block 1-norm MKL ℓp -norm MKL Uniform-weight MKL (block 2-norm MKL) block q-norm MKL (q > 2) Elastic-net MKL

block-norm g(x) √ x 1+p p/(1+p) x 2p

kern weight h(dm ) dm dmp

reg const µ 1 1/p

x/2

I[0,1] (dm )

+0

1 q/2 x q

√ (1 − λ) x + λ2 x

−q/(q−2)

dm

(1−λ)dm 1−λdm

−(q − 2)/q 1−λ

I[0,1] (x) is the indicator function of the closed interval [0, 1]; i.e., I[0,1] (x) = 0 if x ∈ [0, 1], and +∞ otherwise. Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

13 / 25

Empirical Bayesian MKL

Bayesian view Tikhonov regularization as a hierarchical MAP estimation N M M ³ P ´ X X X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + +µ h(dm ) . f1 ∈H1 , 2dm i=1 m=1 m=1 ...,fM ∈HM , | {z } | {z } | {z } d ≥0, 1

...,dM ≥0

likelihood

fm -prior

dm -hyper-prior

Hyper prior over the kernel weights 1 dm ∼ exp(−µh(dm )) (m = 1, . . . , M). Z1 (µ) Gaussian process for the functions fm ∼ GP(fm ; 0, dm km ) (m = 1, . . . , M). Likelihood

















PM 1 yi ∼ exp(−ℓ(yi , m=1 fm (xi ))). Z2 (xi )

Ryota Tomioka (Univ Tokyo)



Generalized MKL





2010-12-11

14 / 25

Empirical Bayesian MKL

Marginalized likelihood Assume Gaussian likelihood 1 ℓ(y , z) = (y − z)2 . 2σy2 The marginalized likelihood (omitting hyper-prior for simplicity) − log p(y|d) ° °2 M XM ° ¯ ¯ 1 ° 1 X ∥fmMAP ∥2Hm 1 MAP ¯ (d)¯ . °y − ° = + + log ¯K f m ° ° 2 m=1 2σy 2 dm 2 {z } | m=1 {z | } | {z } likelihood

fm -prior

volume-based regularization

fmMAP : MAP estimate for a fixed kernel weights dm (m = 1, . . . , M). ¯ (d) := σ 2 I N + PM dm K m . K y m=1 See also Wipf & Nagarajan (2009). Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

15 / 25

Empirical Bayesian MKL

Comparing MAP and empirical Bayes objectives Hyper-prior MAP (MKL): M N M ³ P ´ 1X X X ∥fm ∥2Hm M ℓ yi , m=1 fm (xi ) + +µ h(dm ) . 2 dm |i=1 {z } | m=1{z } | m=1{z } likelihood

fm -prior

dm -hyper-prior (separable)

Empirical Bayes: 1 2σy2 |

°2 ° M XM ° ¯ ¯ ° 1 X ∥fmMAP ∥2Hm 1 MAP ° ¯ (d)¯ . °y − + log ¯K fm ° + ° m=1 2 dm 2 {z } | m=1 {z } | {z } likelihood

Ryota Tomioka (Univ Tokyo)

fm -prior

Generalized MKL

volume-based regularization (non-separable)

2010-12-11

16 / 25

Experiments

Caltech 101 dataset (classification) Cannon vs Cup 1

Accuracy

0.9

0.8 MKL (logit) 0.7

Uniform MKL (square)

0.6

ElasticnetMKL (λ=0.5) BayesMKL

0.5 0

10 20 30 40 Number of samples per class

50

Regularization constant C chosen by 2×4-fold cross validation on the training-set. Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

17 / 25

Experiments

Caltech 101 dataset: kernel weights 1,760 kernel functions. 4 SIFT features (hsvsift, sift, sift4px, sift8px) 22 spacial decompositions (including spatial pyramid kernel) 2 kernel functions (Gaussian and χ2 ) 10 kernel parameters

Ryota Tomioka (Univ Tokyo)

acc=[0.82 0.92 0.80 0.97 0.82] MKL (logit) acc=0.82

Uniform acc=0.92

MKL (square) acc=0.80

ElasticnetMKL (\lambda=0.5) acc=0.97

BayesMKL acc=0.82

0

Generalized MKL

500

1000

1500

2010-12-11

2000

18 / 25

Experiments

Caltech 101 dataset: kernel weights (detail) [8.166667e−01 9.166667e−01 8.000000e−01 9.666667e−01 8.166667e−01]

MKL (logit)

Uniform

2

chi −kernel

Gaussian kernel

2

chi

MKL (square)

ElasticnetMKL

BayesMKL 200 Ryota Tomioka (Univ Tokyo)

300

400 Generalized MKL

500

600

700 2010-12-11

19 / 25

Summary

Summary Two regularized kernel weight learning formulations Ivanov regularization. Tikhonov regularization.

are equivalent. No additional tuning parameter! Both formulations reduce to block-norm formulations via Jensen’s inequality / (generalized) Young’s inequality. Probabilistic view of MKL: hierarchical Gaussian process model. Elastic-net MKL performs similarly to uniform weight MKL, but shows grouping of mutually depended kernels. Empirical-Bayes MKL and L1-MKL seem to make the solution overly sparse, but often they choose slightly different set of kernels. Code for Elastic-net-MKL available from http://www.simplex.t.u-tokyo.ac.jp/˜s-taiji/software/SpicyMKL Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

20 / 25

Stuffs

Acknowledgements

We would like to thank Hisashi Kashima and Shinichi Nakajima for helpful discussions. This work was supported in part by MEXT KAKENHI 22700138, 22700289, and NTT Communication Science Laboratories.

Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

21 / 25

Stuffs

A brief proof Minimize the Lagrangian: E 1 X ∥fm ∥2Hm D ¯ XM min + g, f − fm , f1 ∈H1 , 2 dm | {zm=1 } H(d) M

...,fM ∈HM

m=1

equality const.

where g ∈ H(d) is a Lagrangian multiplier. Fréchet derivative ¿ À fm hm , − 〈g, km 〉H(d) = 0 ⇒ fm (x) = 〈g, dm km (·, x)〉H(d) . dm Hm Maximize the dual ­ ® 1 1 max − ∥g∥2H(d) + g, ¯f H(d) = ∥¯f ∥2H(d) g∈H(d) 2 2 Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

22 / 25

Stuffs

References Aronszajn. Theory of Reproducing Kernels. TAMS, 1950. Lanckriet et al. Learning the Kernel Matrix with Semidefinite Programming. JMLR, 2004. Bach et al. Multiple kernel learning, conic duality, and the SMO algorithm. ICML 2004. Micchelli & Pontil. Learning the kernel function via regularization. JMLR, 2005. Cortes. Can learning kernels help performance? ICML, 2009. Cortes et al. Generalization Bounds for Learning Kernels. ICML, 2010. Kloft et al. Efficient and accurate lp-norm multiple kernel learning. NIPS 22, 2010. Tomioka & Suzuki. Sparsity-accuracy trade-off in MKL. arxiv, 2010. Varma & Babu. More Generality in Efficient Multiple Kernel Learning. ICML, 2009. Gehler & Nowozin. Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. CVPR, 2009. Tipping. Sparse bayesian learning and the relevance vector machine. JMLR, 2001. Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable Models. NIPS, 2006. Wipf & Nagarajan. A new view of automatic relevance determination. NIPS, 2008.

Ryota Tomioka (Univ Tokyo)

Generalized MKL

2010-12-11

23 / 25

Stuffs

Method A: upper-bounding the log det term Use the upper bound ¯ ¯ XM ¯ (d)¯ ≤ log ¯K

m=1

zm dm − ψ ∗ (z)

Eliminate the kernels weights by explicit minimization (AGM ineq.) Update fm as à ° °2 X XM ° M √ 1 ° M ° ° + (f m )m=1 ← argmin y − f zm ∥f m ∥K m m ° ° 2 m=1 m=1 2σy (f m )M m=1

Update zm as (tighten the upper bound) ´ ³ P −1 zm ← Tr (σy2 I N + M d K ) K m , m=1 m m √ where dm = ∥fm ∥Hm / zm . Each update step is a reweighted L1-MKL problem. Each update step minimizes an upper bound of the Ryota Tomioka (Univ Tokyo) Generalized MKL 2010-12-11

24 / 25

Stuffs

Method B: MacKay update

Use the fixed point condition for the update of the weights: ³ ´ 2 PM ∥f FKL m ∥K m 2 −1 − + Tr (σ I N + m=1 dm K m ) K m = 0. dm2 Update fm as à ! ° °2 2 X X ° ° M M ∥f ∥ 1 ° 1 m Km (f m )M y− f m° + m=1 ← argmin ° ° 2 m=1 m=1 2σy 2 dm (fm )M m=1

Update the kernel weights dm as ∥f m ∥2K m ´. dm ← ³ P −1 d K Tr (σ 2 I N + M d K ) m m m=1 m m Each update step is a fixed kernel weight leraning problem (easy). Convergence empirically OK (e.g., RVM). Ryota Tomioka (Univ Tokyo) Generalized MKL 2010-12-11 25 / 25