arXiv:1402.3032v1 [stat.ML] 13 Feb 2014
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054
Regularization for Multiple Kernel Learning via Sum-Product Networks
Abstract In this paper, we are interested in constructing general graph-based regularizers for multiple kernel learning (MKL) given a structure which is used to describe the way of combining basis kernels. Such structures are represented by sumproduct networks (SPNs) in our method. Accordingly we propose a new convex regularization method for MLK based on a path-dependent kernel weighting function which encodes the entire SPN structure in our method. Under certain conditions and from the view of probability, this function can be considered to follow multinomial distributions over the weights associated with product nodes in SPNs. We also analyze the convexity of our regularizer and the complexity of our induced classifiers, and further propose an efficient wrapper algorithm to optimize our formulation. In our experiments, we apply our method to ......
1. Introduction In real world, information can be always organized under certain structures, which can be considered as the prior knowledge about the information. For instance, to understand a 2D scene, we can decompose the scene as “scene → objects → parts → regions → pixels”, and reason the relations between them (Ladicky, 2011). Using such structures, we can answer questions like “what and where the objects are” (Ladicky et al., 2010) and “what the geometric relations between the objects are” (Desai et al., 2011). Therefore, information structures are very important and useful for information integration and reasoning. Multiple kernel learning (MKL) is a powerful tool for information integration, which aims to learn optimal kernels for the tasks by combining different basis kernels linearly (Rakotomamonjy et al., 2008; Xu et al., 2010; Kloft et al., 2011) or nonlinearly (Bach, 2008; Cortes et al., 2009; Varma & Babu, 2009) with certain constraints on kernel weights. In (G¨onen & Alpaydın, 2011) a nice review Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.
on different MKL algorithms was given, and in (Tomioka & Suzuki, 2011) some regularization strategies on kernel weights were discussed. Recently, structure induced regularization methods have been attracting more and more attention (Bach et al., 2011; Maurer & Pontil, 2012; van de Geer, 2013; Lin et al., 2014). For MKL, Bach (Bach, 2008) proposed a hierarchical kernel selection (or more precisely, kernel decomposition) method for MKL based on directed acyclic graph (DAG) using structured sparsity-induced norm such as `1 norm or block `1 norm (Jenatton et al., 2011). Szafranski et. al. (Szafranski et al., 2010) proposed a composite kernel learning method based on tree structures, where the regularization term in their optimization formulation is a composite absolute penalty term (Zhao et al., 2009). Though the structure information of how to combine basis kernels are taken into account when constructing regularizers, however, the weights of nodes in the structures appear independently in these regularizers. This type of formulations actually weaken the connections between the nodes in the structures, making the learning rather easy. To distinguish our work from previous research on regularization for MKL: (1) We utilize the sum-product networks (SPNs) (Poon & Domingos, 2011) to describe the procedure of combining basis kernels. An SPN is a more general and powerful deep graphical representation consisting of only sum nodes and product nodes. Considering that the optimal kernel in MKL is created using summations and/or multiplications between non-negative weights and basis kernels, this procedure can be naturally described by SPNs. Notice that in general SPNs may not describe kernel embedding directly (Zhuang et al., 2011; Strobl & Visweswaran, 2013). However, using Taylor series we still can approximate kernel embedding using SPNs. (2) We accordingly propose a convex regularization method based on a new path-dependent kernel weighting function, which encodes the entire structures of SPNs. This function can be considered to follow multinomial distributions, involving much stronger connections between the node weights. We also analyze the convexity of our regularizer and the
055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109
Regularization for Multiple Kernel Learning via Sum-Product Networks
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
Rademacher complexity of the induced MKL classifiers. Further we propose an efficient wrapper algorithm to solve our problem, where the weights are updated using gradient descent methods (Palomar & Eldar, 2010). The rest of this paper is organized as follows. In Section 2, we explain how to describe the kernel combination procedure using SPNs and our path-dependent kernel weighting function based on SPNs. In Section 3, we provide the details of our regularization method, namely SPN-MKL, including the analysis of regularizer convexity, Rademacher complexity, and our optimization algorithm. We show our experimental results and comparisons among different methods on ......
2. Path-dependent Kernel Weighting Function 2.1. Sum-Product Networks A sum-product network (SPN) is a rooted directed acyclic graph (DAG) whose internal nodes are sums and products (Poon & Domingos, 2011). Given an SPN for MKL, we denote a path from the root node to a leaf node (i.e. kernel) in the SPN as m ∈ M where M consists of all the paths, and a product node as v ∈ V where V consists of all the nodes. Along each path m, we call the sub-path between any pair of adjacent sum nodes or between a leaf node and its adjacent sum node a layer, and denote it as ml (l ≥ 1) and the number of layers along m as Nm . We denote the number of product nodes in layer ml as Nml . We also denote the weights associated with path m and the weight of the nth product node in its layer ml as β m and βmnl , respectively, and β m is a vector consisting of all {βmnl }∀mnl . There is no associated weight for any sum node in the SPN. Fig. 1 gives an example of constructing an SPN for basis kernel combination by embedding atomic SPNs into each other. Atomic SPNs in our method are the SPNs with single layer. Given an SPN as shown at the bottom left in Fig. 1 and the node weights, we can easily calculate the optimal kernel as Kopt = β8 (β1 K1 + β2 K2 ) + β9 (β3 K3 + β4 K4 ) ◦ (β5 K5 + β6 K6 + β7 K7 ), where ◦ denotes the entry-wise product between two matrices. Moreover, we can rewrite Kopt as Kopt = β8 (β1 K1 + β2 K2 ) + Q4 Q7 β9 i=3 j=5 βi βj (Ki ◦ Kj ), whose combination procedure can be described using the SPN at the bottom right in Fig. 1. Here ∀i, j, Ki ◦ Kj is a path-dependent kernel. For instance, the corresponding kernel for the path denoted by the red edges at the bottom right figure is K4 ◦ K6 . In fact, such kernel combination procedures for MKL can be always represented using SPNs in similar ways as shown at the bottom right figure. Traditionally, SPNs are considered as probabilistic models
Figure 1. An example (i.e. bottom left) of constructing an SPN for basis kernel combination by embedding atomic SPNs into each other. All the weights (i.e. β’s) associated with product nodes in the SPN are learned in our method. The red edges in the bottom right graph denote a path from the root to a path-dependent kernel. This figure is best viewed in color.
and learned in an unsupervised manner (Gens & Domingos, 2012; Peharz et al., 2013; Poon & Domingos, 2011). However, in our method we only utilize SPNs as representations to describe the kernel combination procedure, and learn the weights associated with their product nodes for MKL. In addition, from the aspect of structures for kernel combination, many existing MKL methods, e.g. (Rakotomamonjy et al., 2008; Cortes et al., 2009; Xu et al., 2010; Szafranski et al., 2010), can be considered as our special cases. 2.2. Our Kernel Weighting Function Given an SPN and its associated weights β’s, we define our path-dependent kernel weighting function gm (β m ) as: ∀m ∈ M, gm (β m ) =
Nml N m Y Y
βmnl
1 m N ml
N
.
(1)
l=1 n=1
Taking the red path in Fig. 1 for example, the kernel 1
1
1
weighting function for this path is g = β92×1 β42×2 β62×2 with Nm = 2, Nm1 = 1, and Nm2 = 2. Given an SPN and ∀m ∈ M, suppose ∀mnl , 0 ≤ βmnl ≤ 1. Then from the view of probability, since ∀m ∈ M, Nm and Nml are constants, gm actually follows a multinomial distribution with variables β m (ignoring the scaling factor). This is different from recent work (G¨onen, 2012), where the kernel weights are assumed to follow multivariate normal distributions so that efficient inference can be performed. In contrast, our kernel weighting function is intuitively derived from the SPN structure, and under certain simple condition, it can guarantee the convexity of our
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
Regularization for Multiple Kernel Learning via Sum-Product Networks
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
(0) T (β (1) m − βm )
proposed regularizer (see our Lemma 1).
∵
3.1. Formulation
∀mnl ,
(0) ∂f (wm ,β m ) ∂βmn l
(0) (0) 2wm f (wm ,β (0) m )
=
(0)
kwm k22
=− (0)
βmn =βmn l
.
(0) f (wm ,β (0) m ) (0)
Nm Nml ·βmn
, and
,
l
l
∴ By substituting above equations into our target, in the end we only need to prove that
(1) f (wm , β (1) m )−
T (1) (0) (0) 2 wm wm f (wm , β (0) m ) (0)
kwm k22
(0) +f (wm , β (0) m )
i
s.t.
(0)
β m =β m
∂f (wm ,β (0) m ) (0) ∂wm w m =wm
3. SPN-MKL Given Nx training samples {(xi , yi )}, where ∀i, xi ∈ Rd is an input data vector and yi ∈ {1, −1} is its binary label, we formulate our SPN-MKL for binary classification as follows: pm n ml Nm N l X X X kwm k2 n βml 2 +λ min 2 · gm (β m ) B,W,b Nm Nml m∈M l=1 n=1 X +C `(xi , yi ; W, b) (2)
(0) ∂f (wm ,β m ) ∂β m
ml Nm N X X
l=1 n=1
∀β ∈ B, β ≥ 0,
where B = {β m }∀m∈M denotes the weight set, W = {wm }∀m∈M denotes the classifier parameter set, b denotes the bias term in the MKL classifier, λ ≥ 0, C ≥ 0, and P = {pv }∀v∈V are predefined constants. Function ∀i, `(x P i , yi ; W, b) = T max 0, 1 − yi w φ (x ) + b denotes the m i m∈M m hinge loss function, where ∀m ∈ M, φm (·) denotes a path-dependent kernel mapping function and (·)T denotes the matrix transpose operator, and our deci¯ is f (¯ sion for a given data x x; B, W, b) = P function T φ (¯ x ) + b. Moreover, we define w m m∈M m n − N 1N o 2 m ml ∀m, ∀l, ∀n, limβmn →0+ kwm k2 βmnl = 0. l This constraint guarantees the continuity of our objective function.
(1) ≥ f (wm , β (1) m )−
Note that unlike many existing MKL methods such as SimpleMKL (Rakotomamonjy et al., 2008), in Eq. 2 there is no `p norm constraint on the node weights β’s. This makes the weight learning procedure more flexible, only dependent on the data and the predefined SPN structure.
Proof.
2
(1) wm
T
(1) βm n 1 l Nm Nml β (0)n m l
(0) (0) wm f (wm , β (0) m ) (0)
kwm k22 (0) +f (wm , β (0) m )
gm (β (1) m ) gm (β (0) m )
2
(0) (1) wm gm (β (1)
w m ) m
≥ 0.
− ≥ q
) gm (β (0)
gm (β (1) ) m q
(3)
m
Since Eq. 3 always holds, our lemma is proven. kw k2
m 2 Lemma 2. Given an SPN for MKL, ∀m ∈ M, gm (β ≤ m) 2 PNm PNml kw k m 1 2 n=1 Nm Nm βmn . l=1 l
Y kwm k22 = gm (β m ) l,n
l
kwm k22 βmnl
1 m N ml
N
≤
X l,n
kwm k22 . Nm Nml βmnl
3.2. Analysis In this section, we analyze the properties of our proposed regularizer and the Rademacher complexity of the induced MKL classifier. Lemma 1. ∀m ∈ M, f (wm , β m ) = over both wm and β m .
kwm k22 gm (β m )
is convex
Proof. Clearly, f is continuous and differentiable with respect to wm and β m , respectively. Given arbitrary (0) (1) (1) wm , wm , β (0) m 0, and β m 0, where denotes the entry-wise ≥ operator, based on the definition (1) of a convex function, we need to prove f (w m , β (1) m ) ≥ (0) (0) (1) (0) T ∂f (wm ,β m ) + f (wm , β (0) ) + (w − w ) m m m (0) ∂wm wm =wm
From Lemma 1 and 2, we can see that our proposed regularizer is actually the lower bound of a family of widely used MKL regularizers (Rakotomamonjy et al., 2008; Xu et al., 2010; G¨onen & Alpaydın, 2011; Kloft et al., 2011), involving much stronger connections between node weights. Theorem 1 (Convex Regularization). Our regularizer in Eq. 2 is convex if ∀v ∈ V, pv ≥ 1.
∀mnl ,
βmn
p
mn l
Proof. When ∀v ∈ V, pv ≥ 1, is convex Nm Nml over B. Then based on Lemma 1, since the summation of convex functions is still convex, our regularizer is convex. l
275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329
Regularization for Multiple Kernel Learning via Sum-Product Networks
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
Theorem 2 (Rademacher Complexity). Denoting our MKL classifier learned from Eq. 2 as f (x; B, W, b) = P T m wm φm (x) + b and our regularizer in Eq. 2 as R(B, W; λ, P) = R1 + R2 =
(4) Nml
X m∈M
pmn
Nm X l X X βmnl kwm k22 +λ , 2 · gm (β m ) Nm Nml n=1 m∈M l=1
the empirical Rademacher complexity of our classifier ˆ ) is upper-bounded by R(f ( ) X 2A · min R(B, W; 1, 1) + C `(xi , yi ; W, b) Nx B,W,b i where Nx denotes the total number of training sam 21 P Nx P , and ples, constant A = i=1 m Km (xi , xi )
3.3. Optimization To optimize Eq. 2, we adopt a similar learning strategy as used in (Xu et al., 2010) by updating the node weights B and the classifier parameters (W, b) alternatively. 3.3.1. L EARNING (W, b) BY FIXING B We utilize the dual form of Eq. 2 to learn (W, b). Letting α ∈ RNx be the vector of Lagrange multipliers, and y ∈ {−1, 1}Nx be the vector of binary labels, then optimizing the dual of Eq. 2 is equivalent to maximizing the following problem: ! X 1 T T gm (β m )Km (α ◦ y) max e α − (α ◦ y) α 2 m∈M
s.t.
0 α Ce, yT α = 0,
(5)
∀m, ∀i, Km (xi , xi ) = φm (xi )T φm (xi ) denotes the ith element along the diagonal of the path-dependent kernel matrix Km .
where denotes the entry-wise ≤ operator. Based on Eq. P5, the optimal kernel is constructed as Kopt = P m∈M gm (β m )Km , and ∀m ∈ M, wm = gm (β m ) i αi yi φm (xi ).
Proof. Given the Rademacher variables σ’s, based on the definition of Rademacher complexity, we have # " Nx 2 X ˆ ) = Eσ R(f sup σi f (xi ; B, W, b) N x i=1 f ∈F (B,W) " # Nx 2 X X T = Eσ sup σi wm φm (xi ) N x i=1 f ∈F (B,W) m∈M " # 12 " # 21 X kw k2 X 4 m 2 ≤ · sup · gm (β m ) Nx f ∈F m 2 · gm (β m ) m
# " N x X
X
σi φm (xi ) ·Eσ
i=1 m ) ( X X kwm k2 2 2 ≤ · sup + gm (β m ) · A Nx f ∈F m 2 · gm (β m ) m ml Nm N 2 X X X n β 2A ml kwm k2 + ≤ · sup Nx f ∈F m 2 · gm (β m ) Nm Nml l=1 n=1 ( ) X 2A ≤ · min R(B, W; 1, 1) + C `(xi , yi ; W, b) Nx B,W,b i
Therefore, the updating rule for kwm k22 is:
From Theorem 2 we can see that with λ = 1 and ∀v ∈ V, pv = 1, minimizing our objective function in Eq. 2 is equivalent to minimizing the upper bound of Rademacher complexity of our induced MKL classifier. To enhance the flexibility of our method, we allow λ and P to be tuned according to datasets.
T
∀m ∈ M, kwm k22 = gm (β m )2 (α ◦ y) Km (α ◦ y) . (6) 3.3.2. L EARNING B BY FIXING (W, b) At this stage, minimizing our objective function in Eq. 2 is equivalent to minimizing R(B, W; λ, P) in Eq. 4, provided that ∀β ∈ B, β ≥ 1. For further usage, we rewrite R2 in Eq. 4 as follows: ml Nm N X X X X λ βvpv , (7) R2 = N N m m l n=1 v∈V
m∈M(v) l=1
where M(v) denotes all the paths which pass through product node v. Due to the complex structures of SPNs, in general there may not exist close forms to update B. Therefore, we utilize gradient descent methods to update B. (i) Convex Regularization with ∀v ∈ V, pv ≥ 1 Since in this case our objective function is already convex, we can calculate its gradient directly and use the following rule to update B: ∀v ∈ V, h i βv(k+1) = βv(k) − ηk+1 ∇βv R1 (B (k) ) + ∇βv R2 (B (k) ) (8) where ∇βv denotes the first-order derivative operation over variable βv , (·)(k) denotes the value at the k th iteration, ηk+1 ≥ 0 denotes the step size at the (k + 1)th iteration, and [·]+ = max{0, ·}.
+
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439
Regularization for Multiple Kernel Learning via Sum-Product Networks
440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494
Algorithm 1 SPN-MKL learning algorithm Input : {(xi , yi )}i=1,··· ,Nx , an SPN, {pv > 0}∀v∈V , {Km }∀m∈M , C Output: α, B = {βv }∀v∈V Initialize the kernel weights so that ∀v ∈ V, βv ≥ 0; repeat Update α using Eq. 5 while fixing B; (For multiclass cases, update {αc }c∈C using Eq. 10 while fixing B;) Update ∀m, kwm k22 using Eq. 6 while fixing α and B; (For multiclass cases, update ∀m, kwm k22 using Eq. 11 while fixing {αc }c∈C and B;) foreach v ∈ V do if pv ≥ 1 then Update βv using Eq. 8 while fixing w; else repeat (k+1) using Eq. 8 and Eq. 9 while Update βv fixing w; until Converge; end end until Converge; return α, B (ii) Non-Convex Regularization with ∃v ∈ V, 0 < pv < 1 In this case, since our objective function can be decomposed into summation of convex (i.e. in R2 all terms with pv ≥ 1) and concave (i.e. in R2 all terms with 0 < pv < 1) functions, we can utilize Concave-Convex procedure (CCCP) (Yuille & Rangarajan, 2003) to optimize it. Therefore, the weight updating rule for nodes with 0 < pv < 1 is changed to: n o βv(k+1) = arg min R1 + βv ∇βv R2 (B (k) ) . (9) βv ≥0
Again Eq. 8 can be reused to solve Eq. 9 iteratively. To summarize, as long as ∀v ∈ V, pv > 0, we can always optimize our objective function. We show our learning algorithm for binary SPN-MKL in Alg. 1. Note that once the weight of any product node is equal to 0, it will always keep zero, which indicates that the product node and all the paths that go through it can be deleted from the SPN permanently. This property can be used to simplify the SPN structure and accelerate the learning speed of our SPN-MKL.
shown as follows: X max eT αc {αc }c∈C
1 − (αc ◦ yc )T 2 s.t.
(10)
c∈C
! X
gm (β m )Km
(αc ◦ yc )
m∈M
∀c ∈ C, 0 αc Ce, ycT αc = 0,
∀m, kwm k22 = gm (β m )2
X
T
(αc ◦ yc ) Km (αc ◦ yc ) ,
c∈C
(11) where c ∈ C denotes a class label c in a label set C, αc denotes a clss-specific Lagrange multipliers, and yc denotes a binary label vector: if ∀i, yi = c, then the ith entry in yc is set to 1, otherwise, 0. The learning algorithm for multiclass SPN-MKL is listed in Alg. 1 as well.
4. Experiments References Bach, Francis. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, pp. 105–112, 2008. Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, and Obozinski, Guillaume. Structured sparsity through convex optimization. CoRR, abs/1109.2397, 2011. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Learning non-linear combinations of kernels. In NIPS, pp. 396–404, 2009. Desai, Chaitanya, Ramanan, Deva, and Fowlkes, Charless. Discriminative models for multi-class object layout. IJCV, 2011. Gens, Robert and Domingos, Pedro. Discriminative learning of sum-product networks. In Bartlett, P., Pereira, F.C.N., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), NIPS, pp. 3248–3256. 2012. G¨onen, Mehmet. Bayesian efficient multiple kernel learning. In ICML, 2012.
3.4. Multiclass SPN-MKL
G¨onen, Mehmet and Alpaydın, Ethem. Multiple kernel learning algorithms. JMLR, 12(July):2211–2268, 2011.
For multiclass tasks, we generate a single optimal kernel for all the classes, and correspondingly modify Eq. 5 and Eq. 6 for binary SPN-MKL without changing other steps. Using the “one vs. the-rest” strategy, the modification is
Jenatton, Rodolphe, Audibert, Jean-Yves, and Bach, Francis. Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res., 12:2777–2824, November 2011. ISSN 1532-4435.
495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549
Regularization for Multiple Kernel Learning via Sum-Product Networks
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604
Kloft, Marius, Brefeld, Ulf, Sonnenburg, Sren, and Zien, Alexander. `p -norm multiple kernel learning. JMLR, 12: 953–997, 2011.
Xu, Zenglin, Jin, Rong, Yang, Haiqin, King, Irwin, and Lyu, Michael R. Simple and efficient multiple kernel learning by group lasso. In ICML, pp. 1175–1182, 2010.
Ladicky, Lubor. Global Structured Models towards Scene Understanding. PhD thesis, Oxford Brookes University, April 2011.
Yuille, A. L. and Rangarajan, Anand. The concave-convex procedure. Neural Comput., 15(4):915–936, April 2003. ISSN 0899-7667.
Ladicky, Lubor, Russell, Chris, Kohli, Pushmeet, and Torr, Philip H. S. Graph cut based inference with cooccurrence statistics. In ECCV’10, pp. 239–253, 2010.
Zhao, Peng, Rocha, Guilherme, and Yu, Bin. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 2009.
Lin, Lijing, Higham, Nicholas J., and Pan, Jianxin. Covariance structure regularization via entropy loss function. Computational Statistics & Data Analysis, 72:315–327, 2014.
Zhuang, Jinfeng, Tsang, Ivor W., and Hoi, Steven C. H. Two-layer multiple kernel learning. In AISTATS, pp. 909–917, 2011.
Maurer, Andreas and Pontil, Massimiliano. Structured sparsity and generalization. J. Mach. Learn. Res., 13: 671–690, March 2012. ISSN 1532-4435. Palomar, Daniel P. and Eldar, Yonina C. (eds.). Convex optimization in signal processing and communications. Cambridge University Press, Cambridge, UK, New York, 2010. ISBN 978-0-521-76222-9. Peharz, Robert, Geiger, Bernhard, and Pernkopf, Franz. Greedy part-wise learning of sum-product networks. volume 8189, pp. 612–627. Springer Berlin Heidelberg, 2013. Poon, Hoifung and Domingos, Pedro. Sum-product networks: A new deep architecture. In UAI, pp. 337–346, 2011. Rakotomamonjy, Alain, Rouen, Universit De, Bach, Francis, Canu, Stphane, and Grandvalet, Yves. SimpleMKL. JMLR 9, pp. 2491–2521, 2008. Strobl, Eric and Visweswaran, Shyam. Deep multiple kernel learning. CoRR, abs/1310.3101, 2013. Szafranski, Marie, Grandvalet, Yves, and Rakotomamonjy, Alain. Composite kernel learning. Mach. Learn., 79(12):73–103, May 2010. ISSN 0885-6125. Tomioka, Ryota and Suzuki, Taiji. Regularization strategies and empirical bayesian learning for MKL. JMLR, 2011. van de Geer, Sara. Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 2013. To appear. Varma, Manik and Babu, Bodla Rakesh. More generality in efficient multiple kernel learning. In ICML, pp. 134, 2009.
605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659