Regularization Strategies and Empirical Bayesian Learning for MKL . Ryota Tomioka1 , Taiji Suzuki1 1 Department
.
of Mathematical Informatics, The University of Tokyo
2010-12-11 NIPS2010 Workshop: New Directions in Multiple Kernel Learning
Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
1 / 25
Overview
Our contribution Relationships between different regularization strategies Ivanov regularization (kernel weights) Tikhonov regularization (kernel weights) (Generalized) block-norm formulation (no kernel weights)
Are they equivalent? —
in which way?
Empirical Bayesian learning algorithm for MKL Maximizes the marginalized likelihood Can be considered as a non-separable regularization on the kernel weights.
Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
2 / 25
Overview
Learning with a fixed kernel combination Fixed kernel combination kd (x, x ′ ) = minimize ¯f ∈H(d), b∈R
PM m=1
dm km (x, x ′ ).
N X ¡ ¢ C ℓ yi , ¯f (xi ) + b + ∥¯f ∥2H(d) , 2 i=1
(H(d) is the RKHS corresponding to the combined kernel kd ) is equivalent to learning M functions (f1 , . . . , fM ) as follows: minimize f1 ∈H1 , ...,fM ∈HM , b∈R
N M ³ P ´ CX X ∥fm ∥2Hm M ℓ yi , m=1 fm (xi ) + b + 2 dm
(1)
m=1
i=1
P where ¯f (x) = M m=1 fm (x). See Sec. 6 in Aronszajn (1950), Micchelli & Pontil (2005). Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
3 / 25
Regularization Strategies
Ivanov regularization We can constrain the size of kernel weights dm by N M ³ P ´ CX X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + b + , (2) f1 ∈H1 ,...,fM ∈HM , 2 dm b∈R, d1 ≥0,...,dM ≥0
i=1
M X
s.t.
m=1
h(dm ) ≤ 1
(h is convex, increasing).
m=1
Equivalent to the more common expression: minimize
N X
f ∈H(d), i=1 b∈R, d1 ≥0,...,dM ≥0 Ryota Tomioka (Univ Tokyo)
M X C 2 ℓ (yi , f (xi ) + b) + ∥f ∥H(d) , s.t. h(dm ) ≤ 1. 2 m=1
Generalized MKL
2010-12-11
4 / 25
Regularization Strategies
Tikhonov regularization We can penalize the size of kernel weights dm by N ³ P ´ X minimize ℓ yi , M f (x ) + b m=1 m i
f1 ∈H1 ,...,fM ∈HM , i=1 b∈R, d1 ≥0,...,dM ≥0
¶ M µ C X ∥fm ∥2Hm + + µh(dm ) . 2 dm
(3)
m=1
Note that the above is equivalent to minimize
N X
f ∈H(d), b∈R, |i=1 d1 ≥0,...,dM ≥0
Ryota Tomioka (Univ Tokyo)
M C Cµ X 2 ℓ (yi , f (xi ) + b) + ∥f ∥H(d) + h(dm ) . 2 2 {z } | {z } | m=1 {z } data-fit
f -prior Generalized MKL
dm -hyper-prior 2010-12-11
5 / 25
Regularization Strategies
Are these two formulations equivalent? Previously thought that... Yes. But the choice of the pair (C, µ) is complicated. ⇒ In the Tikhonov formulation we have to choose both C and µ! (Kloft et al., 2010) .
We show that... If give up the constant 1 in the Ivanov formulation Pyou M . m=1 h(dm ) ≤ 1, Correspondence via equivalent block-norm formulations. C and µ can be chosen independently. The constant 1 has no meaning. Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
6 / 25
Regularization Strategies
Ivanov ⇒ block-norm formulation 1 (known) Let h(dm ) = dmp (ℓp -norm MKL); see Kloft et al. (2010). N M ³ P ´ CX X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + b + , f1 ∈H1 ,...,fM ∈HM , 2 dm b∈R, d1 ≥0,...,dM ≥0
s.t.
i=1
M X
m=1
dmp ≤ 1.
m=1
⇓
minimize
N X
f1 ∈H1 ,...,fM ∈HM i=1 ,binR
Jensen’s inequality ¶2/q ³ P ´ C µX M M q . ℓ yi , m=1 fm (xi ) + b + ∥fm ∥Hm m=1 2 2/(1+p)
where q = 2p/(1 + p). Minimum is attained at dm ∝ ∥fm ∥Hm Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
7 / 25
Regularization Strategies
Tikhonov ⇒ block-norm formulation 2 (new) Let h(dm ) = dmp , µ = 1/p (ℓp -norm MKL) ¶ N M µ ³ P ´ CX X ∥fm ∥2Hm dmp M + . minimize ℓ yi , m=1 fm (xi ) + b + f1 ∈H1 ,...,fM ∈HM , 2 dm p b∈R, d1 ≥0,...,dM ≥0
m=1
i=1
⇓
Young’s inequality
M N ³ P ´ CX X M minimize ∥fm ∥qHm . ℓ yi , m=1 fm (xi ) + b + f1 ∈H1 ,...,fM ∈HM , q b∈R
m=1
i=1
2/(1+p)
where q = 2p/(1 + p). Minimum is attained at dm = ∥fm ∥Hm Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
. 8 / 25
Regularization Strategies
The two block norm formulations are equivalent Block norm formulation 1 (from Ivanov): ¶2/q N ³ P ´ C ˜ µX M X M q minimize ∥fm ∥Hm . ℓ yi , m=1 fm (xi ) + b + m=1 f1 ∈H1 ,...,fM ∈HM 2 ,binR
i=1
Block norm formulation 2 (from Tikhonov): N M ³ P ´ CX X M minimize ℓ yi , m=1 fm (xi ) + b + ∥fm ∥qHm . f1 ∈H1 ,...,fM ∈HM , q b∈R
i=1
m=1
˜ Just have to map C and C. The implied kernel weights are normalized/unnormalized. Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
9 / 25
Regularization Strategies
Generalized block-norm formulation N M ³ P ´ X X M minimize ℓ yi , m=1 fm (xi ) + b + C g(∥fm ∥2Hm ), f1 ∈H1 , ...,fM ∈HM , b∈R
i=1
(4)
m=1
where g is a concave block-norm-based regularizer. √ Example (Elastic-net MKL): g(x) = (1 − λ) x + λ2 x, minimize f1 ∈H1 , ...,fM ∈HM , b∈R
N ³ P ´ X ℓ yi , M f (x ) + b m=1 m i i=1
+C
M µ X
(1 − λ)∥fm ∥Hm
m=1 Ryota Tomioka (Univ Tokyo)
Generalized MKL
¶ λ 2 + ∥fm ∥Hm , 2 2010-12-11
10 / 25
Regularization Strategies
Generalized block-norm ⇒ Tikhonov regularization Theorem Correspondence between the convex (kernel-weight-based) regularizer h(dm ) and the concave (block-norm-based) regularizer g(x) is given as follows: µ ¶ 1 ∗ µh(dm ) = −2g , 2dm . where g ∗ is the concave conjugate of g. Proof: Use the concavity of g as ∥fm ∥2Hm ≥ g(∥fm ∥2Hm ) + g ∗ (1/(2dm )). 2dm See also Palmer et al. (2006). Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
11 / 25
Regularization Strategies
Examples Generalized Young’s inequality: xy ≥ g(x) + g ∗ (y ) where g is concave, and g ∗ is the concave conjugate of g. √ Example 1: let g(x) = x, then g ∗ (y ) = −1/(4y ) and ∥fm ∥2Hm dm + ≥ ∥fm ∥Hm 2dm 2
(L1-MKL).
Example 2: let g(x) = x q/2 /q (1 ≤ q ≤ 2), then g ∗ (y ) = q−2 (2y )q/(q−2) 2q ∥fm ∥2Hm dmp 1 + ≥ ∥fm ∥qHm 2dm 2p q
(ℓp -norm MKL),
where p := q/(2 − q). Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
12 / 25
Regularization Strategies
Correspondence
MKL model block 1-norm MKL ℓp -norm MKL Uniform-weight MKL (block 2-norm MKL) block q-norm MKL (q > 2) Elastic-net MKL
block-norm g(x) √ x 1+p p/(1+p) x 2p
kern weight h(dm ) dm dmp
reg const µ 1 1/p
x/2
I[0,1] (dm )
+0
1 q/2 x q
√ (1 − λ) x + λ2 x
−q/(q−2)
dm
(1−λ)dm 1−λdm
−(q − 2)/q 1−λ
I[0,1] (x) is the indicator function of the closed interval [0, 1]; i.e., I[0,1] (x) = 0 if x ∈ [0, 1], and +∞ otherwise. Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
13 / 25
Empirical Bayesian MKL
Bayesian view Tikhonov regularization as a hierarchical MAP estimation N M M ³ P ´ X X X ∥fm ∥2Hm M minimize ℓ yi , m=1 fm (xi ) + +µ h(dm ) . f1 ∈H1 , 2dm i=1 m=1 m=1 ...,fM ∈HM , | {z } | {z } | {z } d ≥0, 1
...,dM ≥0
likelihood
fm -prior
dm -hyper-prior
Hyper prior over the kernel weights 1 dm ∼ exp(−µh(dm )) (m = 1, . . . , M). Z1 (µ) Gaussian process for the functions fm ∼ GP(fm ; 0, dm km ) (m = 1, . . . , M). Likelihood
PM 1 yi ∼ exp(−ℓ(yi , m=1 fm (xi ))). Z2 (xi )
Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
14 / 25
Empirical Bayesian MKL
Marginalized likelihood Assume Gaussian likelihood 1 ℓ(y , z) = (y − z)2 . 2σy2 The marginalized likelihood (omitting hyper-prior for simplicity) − log p(y|d) ° °2 M XM ° ¯ ¯ 1 ° 1 X ∥fmMAP ∥2Hm 1 MAP ¯ (d)¯ . °y − ° = + + log ¯K f m ° ° 2 m=1 2σy 2 dm 2 {z } | m=1 {z | } | {z } likelihood
fm -prior
volume-based regularization
fmMAP : MAP estimate for a fixed kernel weights dm (m = 1, . . . , M). ¯ (d) := σ 2 I N + PM dm K m . K y m=1 See also Wipf & Nagarajan (2009). Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
15 / 25
Empirical Bayesian MKL
Comparing MAP and empirical Bayes objectives Hyper-prior MAP (MKL): M N M ³ P ´ 1X X X ∥fm ∥2Hm M ℓ yi , m=1 fm (xi ) + +µ h(dm ) . 2 dm |i=1 {z } | m=1{z } | m=1{z } likelihood
fm -prior
dm -hyper-prior (separable)
Empirical Bayes: 1 2σy2 |
°2 ° M XM ° ¯ ¯ ° 1 X ∥fmMAP ∥2Hm 1 MAP ° ¯ (d)¯ . °y − + log ¯K fm ° + ° m=1 2 dm 2 {z } | m=1 {z } | {z } likelihood
Ryota Tomioka (Univ Tokyo)
fm -prior
Generalized MKL
volume-based regularization (non-separable)
2010-12-11
16 / 25
Experiments
Caltech 101 dataset (classification) Cannon vs Cup 1
Accuracy
0.9
0.8 MKL (logit) 0.7
Uniform MKL (square)
0.6
ElasticnetMKL (λ=0.5) BayesMKL
0.5 0
10 20 30 40 Number of samples per class
50
Regularization constant C chosen by 2×4-fold cross validation on the training-set. Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
17 / 25
Experiments
Caltech 101 dataset: kernel weights 1,760 kernel functions. 4 SIFT features (hsvsift, sift, sift4px, sift8px) 22 spacial decompositions (including spatial pyramid kernel) 2 kernel functions (Gaussian and χ2 ) 10 kernel parameters
Ryota Tomioka (Univ Tokyo)
acc=[0.82 0.92 0.80 0.97 0.82] MKL (logit) acc=0.82
Uniform acc=0.92
MKL (square) acc=0.80
ElasticnetMKL (\lambda=0.5) acc=0.97
BayesMKL acc=0.82
0
Generalized MKL
500
1000
1500
2010-12-11
2000
18 / 25
Experiments
Caltech 101 dataset: kernel weights (detail) [8.166667e−01 9.166667e−01 8.000000e−01 9.666667e−01 8.166667e−01]
MKL (logit)
Uniform
2
chi −kernel
Gaussian kernel
2
chi
MKL (square)
ElasticnetMKL
BayesMKL 200 Ryota Tomioka (Univ Tokyo)
300
400 Generalized MKL
500
600
700 2010-12-11
19 / 25
Summary
Summary Two regularized kernel weight learning formulations Ivanov regularization. Tikhonov regularization.
are equivalent. No additional tuning parameter! Both formulations reduce to block-norm formulations via Jensen’s inequality / (generalized) Young’s inequality. Probabilistic view of MKL: hierarchical Gaussian process model. Elastic-net MKL performs similarly to uniform weight MKL, but shows grouping of mutually depended kernels. Empirical-Bayes MKL and L1-MKL seem to make the solution overly sparse, but often they choose slightly different set of kernels. Code for Elastic-net-MKL available from http://www.simplex.t.u-tokyo.ac.jp/˜s-taiji/software/SpicyMKL Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
20 / 25
Stuffs
Acknowledgements
We would like to thank Hisashi Kashima and Shinichi Nakajima for helpful discussions. This work was supported in part by MEXT KAKENHI 22700138, 22700289, and NTT Communication Science Laboratories.
Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
21 / 25
Stuffs
A brief proof Minimize the Lagrangian: E 1 X ∥fm ∥2Hm D ¯ XM min + g, f − fm , f1 ∈H1 , 2 dm | {zm=1 } H(d) M
...,fM ∈HM
m=1
equality const.
where g ∈ H(d) is a Lagrangian multiplier. Fréchet derivative ¿ À fm hm , − 〈g, km 〉H(d) = 0 ⇒ fm (x) = 〈g, dm km (·, x)〉H(d) . dm Hm Maximize the dual ® 1 1 max − ∥g∥2H(d) + g, ¯f H(d) = ∥¯f ∥2H(d) g∈H(d) 2 2 Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
22 / 25
Stuffs
References Aronszajn. Theory of Reproducing Kernels. TAMS, 1950. Lanckriet et al. Learning the Kernel Matrix with Semidefinite Programming. JMLR, 2004. Bach et al. Multiple kernel learning, conic duality, and the SMO algorithm. ICML 2004. Micchelli & Pontil. Learning the kernel function via regularization. JMLR, 2005. Cortes. Can learning kernels help performance? ICML, 2009. Cortes et al. Generalization Bounds for Learning Kernels. ICML, 2010. Kloft et al. Efficient and accurate lp-norm multiple kernel learning. NIPS 22, 2010. Tomioka & Suzuki. Sparsity-accuracy trade-off in MKL. arxiv, 2010. Varma & Babu. More Generality in Efficient Multiple Kernel Learning. ICML, 2009. Gehler & Nowozin. Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. CVPR, 2009. Tipping. Sparse bayesian learning and the relevance vector machine. JMLR, 2001. Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable Models. NIPS, 2006. Wipf & Nagarajan. A new view of automatic relevance determination. NIPS, 2008.
Ryota Tomioka (Univ Tokyo)
Generalized MKL
2010-12-11
23 / 25
Stuffs
Method A: upper-bounding the log det term Use the upper bound ¯ ¯ XM ¯ (d)¯ ≤ log ¯K
m=1
zm dm − ψ ∗ (z)
Eliminate the kernels weights by explicit minimization (AGM ineq.) Update fm as à ° °2 X XM ° M √ 1 ° M ° ° + (f m )m=1 ← argmin y − f zm ∥f m ∥K m m ° ° 2 m=1 m=1 2σy (f m )M m=1
Update zm as (tighten the upper bound) ´ ³ P −1 zm ← Tr (σy2 I N + M d K ) K m , m=1 m m √ where dm = ∥fm ∥Hm / zm . Each update step is a reweighted L1-MKL problem. Each update step minimizes an upper bound of the Ryota Tomioka (Univ Tokyo) Generalized MKL 2010-12-11
24 / 25
Stuffs
Method B: MacKay update
Use the fixed point condition for the update of the weights: ³ ´ 2 PM ∥f FKL m ∥K m 2 −1 − + Tr (σ I N + m=1 dm K m ) K m = 0. dm2 Update fm as à ! ° °2 2 X X ° ° M M ∥f ∥ 1 ° 1 m Km (f m )M y− f m° + m=1 ← argmin ° ° 2 m=1 m=1 2σy 2 dm (fm )M m=1
Update the kernel weights dm as ∥f m ∥2K m ´. dm ← ³ P −1 d K Tr (σ 2 I N + M d K ) m m m=1 m m Each update step is a fixed kernel weight leraning problem (easy). Convergence empirically OK (e.g., RVM). Ryota Tomioka (Univ Tokyo) Generalized MKL 2010-12-11 25 / 25