Consistent Vector-valued Distribution Regression Zolt´ an Szab´ o
Joint work with Arthur Gretton (UCL), Barnab´as P´oczos (CMU), Bharath K. Sriperumbudur (PSU)
UCL Workshop on the Theory of Big Data January 8, 2015
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
The task
Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.
Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 .
⇒ Training examples: labelled bags.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Example: aerosol prediction from satellite images
Bag:= points of a multispectral satellite image over an area. Label of a bag:= aerosol value.
Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression: without domain knowledge, 100 × RMSE = 7.81.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula).
Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Several algorithmic approaches
1
Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2
Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006].
3
(Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4
Divergence measures (KL, R´enyi, Tsallis): [P´oczos et al., 2011].
5
Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Theoretical guarantee?
MIL dates back to [Haussler, 1999, G¨ artner et al., 2002].
Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014] + assumptions: 1 2
compact Euclidean domain. output = R.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Problem formulation Given: labelled bags ˆz = {(ˆ xi , yi )}li =1 , where
i .i .d.
i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ M+ 1 (D), yi ∈ Y .
Task: find a M+ z. 1 (D) → Y mapping based on ˆ
Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K )
µ=µ(k)
M+ 1 (D) −−−−→ X ⊆ H = H(k) −−−−−−−→ Y . Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = Y
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Goal in details
Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (l , N, λ) → 0 and rates, R [f ] = E(x,y ) kf (µx ) − y k2Y (expected risk), l
fˆzλ = arg min f ∈H
1X kf (µxˆi ) − yi k2Y + λ kf k2H , l
(λ > 0).
i =1
We consider two settings: 1 2
well-specified case: fρ ∈ H,
misspecified case: fρ ∈ L2ρX \H.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Kernel (k, K ), RKHS
k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,
Graphs, texts, time series, distributions.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Kernel (k, K ), RKHS
k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,
Graphs, texts, time series, distributions.
Note: H(K ) = {X → Y functions}, K (µx , µx ′ ) ∈ L(Y ).
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
µ
Step-1 (distribution embedding): M+ → X ⊆ H(k) 1 (D) − Given: kernel k : D × D → R.
Mean embedding of a distribution x, xˆi ∈ M+ 1 (D): Z k(·, u)dx(u) ∈ H(k), µx = D
N 1 X k(·, u)dˆ xi (u) = µxˆi = k(·, xi ,n ). N D
Z
n=1
Y = R, linear K ⇒ set kernel:
K (µxˆi , µxˆj ) = µxˆi , µxˆj
Zolt´ an Szab´ o
H
N 1 X k(xi ,n , xj,m ). = 2 N n,m=1
Consistent Vector-valued Distribution Regression
Step-2 (ridge regression): analytical solution
Given: training sample: ˆz, test distribution: t.
Prediction: (fˆzλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ],
(1) l×l
K = [Kij ] = [K (µxˆi , µxˆj )] ∈ L(Y )
,
k = [K (µxˆ1 , µt ), . . . , K (µxˆl , µt )] ∈ L(Y )1×l .
(2) (3)
Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d .
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Blanket assumptions
D: separable, topological domain. k: bounded, continuous. K : bounded, H¨older continuous (h ∈ (0, 1]: exponent). X = µ M+ 1 (D) ∈ B(H). Y : separable Hilbert.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Performance guarantees (in human-readable format) If in addition 1
well-specified case: fρ is ’c-smooth’ with ’b-decaying 1 covariance operator’ and l ≥ λ− b −1 , then E(fˆzλ , fρ ) ≤
2
1 1 logh (l ) + λc + 2 + 1 . N h λ3 l λ lλb
(4)
misspecified case: fρ is ’s-smooth’, L2ρX is separable, and 1 ≤ l , then λ2 h
E(fˆzλ , fρ ) ≤
log 2 (l ) h 2
N λ
3 2
1 +√ + lλ
Zolt´ an Szab´ o
√
λmin(1,s) √ + λmin(1,s) . λ l
Consistent Vector-valued Distribution Regression
(5)
Performance guarantee: example
Misspecified case: assume s ≥ 1, h = 1 (K : Lipschitz),
✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; l = N (a > 0)
t = lN a : total number of samples processed. Then 1 2
s = 1 (’most difficult’ task): E(fˆzλ , fρ ) ≈ t −0.25 , s → ∞ (’simplest’ problem): E(fˆzλ , fρ ) ≈ t −0.5 .
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Nonlinear K examples Y = R; D: compact, metric; k: universal ⇒ H¨older K -s: KG e
Ke
kµa −µb k2H − 2θ 2
e−
h=1
KC
kµa −µb kH 2θ 2
1 2
h=
Kt
θ 2
h=1
−1
Ki
1 + kµa − µb kθH h=
1 + kµa − µb k2H /θ 2
−1
(θ ≤ 2)
kµa − µb k2H + θ 2 h=1
− 1 2
They are functions of kµa − µb kH ⇒ computation: similar to set kernel. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Summary
Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specially, the set kernel is so (15-year-old open question).
Code ∈ ITE toolbox:
https://bitbucket.org/szzoli/ite/
Details (submitted to JMLR): http://arxiv.org/pdf/1411.2066.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Appendix: contents
Well/misspecified assumptions. Topological definitions, separability. Vector-valued RKHS. Weak topology on M+ 1 (D). Measurability of µ. Universal kernel examples.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Well-specified case: ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X
with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),
where b ∈ (1, ∞), c ∈ [1, 2].
Intuition: b – effective input dimension, c – smoothness of fρ .
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Misspecified case: assumption
˜ be the extension of T from H to L2ρ : Let T X SK∗ : H ֒→ L2ρX , SK : L2ρX → H,
(SK g )(µu ) =
Z
K (µu , µt )g (µt )dρX (µt ), X
˜ = S ∗ SK : L2 → L2 . T K ρX ρX ˜ s for some s ≥ 0. Our range space assumption on ρ: fρ ∈ Im T
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Misspecified case: note on the separability of L2ρX
L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008].
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Topological space, open sets
Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3
∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .
Then, (D, τ ) is called a topological space; O ∈ τ : open sets.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is
closed if D\A ∈ τ (i.e., its complement is open),
compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .
Closure of A ⊆ D:
¯ := A
\
C.
A⊆C closed in D
¯ = D. A ⊆ D is dense if A
(D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: l ∞ /L∞ .
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
(6)
Vector-valued RKHS Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f 7→ hy , f (µx )iY
(7)
is continuous for ∀µx ∈ X , y ∈ Y .
= The evaluation functional is continuous in every direction.
Riesz representation theorem ⇒ ∃Kµt ∈ L(Y , H):
K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), or shortly
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
(8) (9)
Vector-valued RKHS – continued
Examples (Y = Rd ): 1
Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).
2
(10)
Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) =
m X
A∗j Dj (µa , µb )Aj .
(11)
j=1
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Weak topology on M+ 1 (D)
Def.: It is the weakest topology such that the Lh : (M+ (D), τw ) → R, Z1 h(u)dx(u) Lh (x) = D
mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Measurability of µ
k: bounded, continuous ⇒
µ : (M+ 1 (D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.
If D is compact metric, k is universal, then µ is continuous and X ∈ B(H).
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Universal kernel examples
On every compact subset of Rd : k(a, b) = e −
ka−bk2 2 2σ 2
,
(σ > 0)
k(a, b) = e βha,bi , (β > 0), or more generally ∞ X an x n (∀an > 0) k(a, b) = f (ha, bi), f (x) = n=0
α
k(a, b) = (1 − ha, bi) ,
Zolt´ an Szab´ o
(α > 0).
Consistent Vector-valued Distribution Regression
Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011). Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.
Zolt´ an Szab´ o
Consistent Vector-valued Distribution Regression