Consistent Vector-valued Distribution Regression - Semantic Scholar

Comment

Report 3 Downloads 92 Views

Consistent Vector-valued Distribution Regression Zolt´ an Szab´ o

Joint work with Arthur Gretton (UCL), Barnab´as P´oczos (CMU), Bharath K. Sriperumbudur (PSU)

UCL Workshop on the Theory of Big Data January 8, 2015

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

The task

Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.

Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 .

⇒ Training examples: labelled bags.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Example: aerosol prediction from satellite images

Bag:= points of a multispectral satellite image over an area. Label of a bag:= aerosol value.

Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression: without domain knowledge, 100 × RMSE = 7.81.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula).

Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Several algorithmic approaches

1

Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2

Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006].

3

(Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

4

Divergence measures (KL, R´enyi, Tsallis): [P´oczos et al., 2011].

5

Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Theoretical guarantee?

MIL dates back to [Haussler, 1999, G¨ artner et al., 2002].

Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014] + assumptions: 1 2

compact Euclidean domain. output = R.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Problem formulation Given: labelled bags ˆz = {(ˆ xi , yi )}li =1 , where

i .i .d.

i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ M+ 1 (D), yi ∈ Y .

Task: find a M+ z. 1 (D) → Y mapping based on ˆ

Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K )

µ=µ(k)

M+ 1 (D) −−−−→ X ⊆ H = H(k) −−−−−−−→ Y . Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = Y

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Goal in details

Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (l , N, λ) → 0 and rates, R [f ] = E(x,y ) kf (µx ) − y k2Y (expected risk), l

fˆzλ = arg min f ∈H

1X kf (µxˆi ) − yi k2Y + λ kf k2H , l

(λ > 0).

i =1

We consider two settings: 1 2

well-specified case: fρ ∈ H,

misspecified case: fρ ∈ L2ρX \H.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Kernel (k, K ), RKHS

k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,

Graphs, texts, time series, distributions.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Kernel (k, K ), RKHS

k : D × D → R kernel on D, if ∃ϕ : D → H(ilbert) k(a, b) = hϕ(a), ϕ(b)iH . ∃! RKHS: H(k) = {D → R functions}, ϕ(u) = k(·, u). Kernel examples: D = Rd (p > 0, θ > 0): k(a, b) = (ha, bi + θ)p : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian,

Graphs, texts, time series, distributions.

Note: H(K ) = {X → Y functions}, K (µx , µx ′ ) ∈ L(Y ).

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

µ

Step-1 (distribution embedding): M+ → X ⊆ H(k) 1 (D) − Given: kernel k : D × D → R.

Mean embedding of a distribution x, xˆi ∈ M+ 1 (D): Z k(·, u)dx(u) ∈ H(k), µx = D

N 1 X k(·, u)dˆ xi (u) = µxˆi = k(·, xi ,n ). N D

Z

n=1

Y = R, linear K ⇒ set kernel:

K (µxˆi , µxˆj ) = µxˆi , µxˆj

Zolt´ an Szab´ o

H

N 1 X k(xi ,n , xj,m ). = 2 N n,m=1

Consistent Vector-valued Distribution Regression

Step-2 (ridge regression): analytical solution

Given: training sample: ˆz, test distribution: t.

Prediction: (fˆzλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ],

(1) l×l

K = [Kij ] = [K (µxˆi , µxˆj )] ∈ L(Y )

,

k = [K (µxˆ1 , µt ), . . . , K (µxˆl , µt )] ∈ L(Y )1×l .

(2) (3)

Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d .

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Blanket assumptions

D: separable, topological domain. k: bounded, continuous. K : bounded, H¨older continuous (h ∈ (0, 1]: exponent). X = µ M+ 1 (D) ∈ B(H). Y : separable Hilbert.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Performance guarantees (in human-readable format) If in addition 1

well-specified case: fρ is ’c-smooth’ with ’b-decaying 1 covariance operator’ and l ≥ λ− b −1 , then E(fˆzλ , fρ ) ≤

2

1 1 logh (l ) + λc + 2 + 1 . N h λ3 l λ lλb

(4)

misspecified case: fρ is ’s-smooth’, L2ρX is separable, and 1 ≤ l , then λ2 h

E(fˆzλ , fρ ) ≤

log 2 (l ) h 2

N λ

3 2

1 +√ + lλ

Zolt´ an Szab´ o

√

λmin(1,s) √ + λmin(1,s) . λ l

Consistent Vector-valued Distribution Regression

(5)

Performance guarantee: example

Misspecified case: assume s ≥ 1, h = 1 (K : Lipschitz),

✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; l = N (a > 0)

t = lN a : total number of samples processed. Then 1 2

s = 1 (’most difficult’ task): E(fˆzλ , fρ ) ≈ t −0.25 , s → ∞ (’simplest’ problem): E(fˆzλ , fρ ) ≈ t −0.5 .

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Nonlinear K examples Y = R; D: compact, metric; k: universal ⇒ H¨older K -s: KG e

Ke

kµa −µb k2H − 2θ 2

e−

h=1

KC

kµa −µb kH 2θ 2

1 2

h=

Kt

θ 2

h=1

−1

Ki

1 + kµa − µb kθH h=

1 + kµa − µb k2H /θ 2

−1

(θ ≤ 2)

kµa − µb k2H + θ 2 h=1

− 1 2

They are functions of kµa − µb kH ⇒ computation: similar to set kernel. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Summary

Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specially, the set kernel is so (15-year-old open question).

Code ∈ ITE toolbox:

https://bitbucket.org/szzoli/ite/

Details (submitted to JMLR): http://arxiv.org/pdf/1411.2066.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Appendix: contents

Well/misspecified assumptions. Topological definitions, separability. Vector-valued RKHS. Weak topology on M+ 1 (D). Measurability of µ. Universal kernel examples.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Well-specified case: ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X

with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),

where b ∈ (1, ∞), c ∈ [1, 2].

Intuition: b – effective input dimension, c – smoothness of fρ .

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Misspecified case: assumption

˜ be the extension of T from H to L2ρ : Let T X SK∗ : H ֒→ L2ρX , SK : L2ρX → H,

(SK g )(µu ) =

Z

K (µu , µt )g (µt )dρX (µt ), X

˜ = S ∗ SK : L2 → L2 . T K ρX ρX ˜ s for some s ≥ 0. Our range space assumption on ρ: fρ ∈ Im T

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Misspecified case: note on the separability of L2ρX

L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008].

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Topological space, open sets

Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3

∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .

Then, (D, τ ) is called a topological space; O ∈ τ : open sets.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is

closed if D\A ∈ τ (i.e., its complement is open),

compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .

Closure of A ⊆ D:

¯ := A

\

C.

A⊆C closed in D

¯ = D. A ⊆ D is dense if A

(D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: l ∞ /L∞ .

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

(6)

Vector-valued RKHS Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f 7→ hy , f (µx )iY

(7)

is continuous for ∀µx ∈ X , y ∈ Y .

= The evaluation functional is continuous in every direction.

Riesz representation theorem ⇒ ∃Kµt ∈ L(Y , H):

K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,

(∀µx , µt ∈ X ), or shortly

H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

(8) (9)

Vector-valued RKHS – continued

Examples (Y = Rd ): 1

Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).

2

(10)

Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) =

m X

A∗j Dj (µa , µb )Aj .

(11)

j=1

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Weak topology on M+ 1 (D)

Def.: It is the weakest topology such that the Lh : (M+ (D), τw ) → R, Z1 h(u)dx(u) Lh (x) = D

mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Measurability of µ

k: bounded, continuous ⇒

µ : (M+ 1 (D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.

If D is compact metric, k is universal, then µ is continuous and X ∈ B(H).

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Universal kernel examples

On every compact subset of Rd : k(a, b) = e −

ka−bk2 2 2σ 2

,

(σ > 0)

k(a, b) = e βha,bi , (β > 0), or more generally ∞ X an x n (∀an > 0) k(a, b) = f (ha, bi), f (x) = n=0

α

k(a, b) = (1 − ha, bi) ,

Zolt´ an Szab´ o

(α > 0).

Consistent Vector-valued Distribution Regression

Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011). Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.

Zolt´ an Szab´ o

Consistent Vector-valued Distribution Regression

Recommend Documents

Consistent Nonparametric Regression Charles J ... - Semantic Scholar

Distribution Regularized Regression Framework ... - Semantic Scholar

Distribution estimation consistent in total variation ... - Semantic Scholar

Linear Regression - Semantic Scholar

Geodesic Regression - Semantic Scholar