Regression on Probability Measures - Gatsby Computational ...

Report 4 Downloads 103 Views
Regression on Probability Measures: A Simple and Consistent Algorithm Zolt´an Szab´ o (Gatsby Unit, UCL)

Joint work with ◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnab´ as P´ oczos (ML Department, CMU), ◦ Arthur Gretton (Gatsby Unit, UCL)

CRiSM Seminars Department of Statistics, University of Warwick May 29, 2015

Szab´ o et al.

Regression on Probability Measures

The task

Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.

Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 .

⇒ Training examples: labelled bags.

Szab´ o et al.

Regression on Probability Measures

Example: aerosol prediction from satellite images

Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.

Relevance: climate research. Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression?

Szab´ o et al.

Regression on Probability Measures

Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula).

Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series.

Szab´ o et al.

Regression on Probability Measures

Several algorithmic approaches

1

Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2

Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006].

3

(Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

4

Divergence measures (KL, R´enyi, Tsallis): [P´oczos et al., 2011].

5

Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

Szab´ o et al.

Regression on Probability Measures

Theoretical guarantee?

MIL dates back to [Haussler, 1999, G¨ artner et al., 2002].

Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014] + assumptions: 1 2

compact Euclidean domain. output = R ([Oliva et al., 2013] allows distribution).

Szab´ o et al.

Regression on Probability Measures

Kernel, RKHS

k : D × D → R kernel on D, if

∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).

Szab´ o et al.

Regression on Probability Measures

Kernel, RKHS

k : D × D → R kernel on D, if

∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).

Kernel examples: D = Rd (p > 0, θ > 0) p

k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk1 : Laplacian.

In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).

Szab´ o et al.

Regression on Probability Measures

Kernel: example domains (D)

Euclidean space: D = Rd . Graphs, texts, time series, dynamical systems.

Distributions!

Szab´ o et al.

Regression on Probability Measures

Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,

i .i .d.

i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz.

Szab´ o et al.

Regression on Probability Measures

Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,

i .i .d.

i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz. Construction: distribution embedding (µx ) µ=µ(k)

P (D) −−−−→X ⊆ H = H(k)

Szab´ o et al.

Regression on Probability Measures

Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,

i .i .d.

i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz.

Construction: distribution embedding (µx ) + ridge regression µ=µ(k)

f ∈H=H(K )

P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R.

Szab´ o et al.

Regression on Probability Measures

Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,

i .i .d.

i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz.

Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K )

µ=µ(k)

P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R. Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = R

Szab´ o et al.

Regression on Probability Measures

Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ]

Szab´ o et al.

Regression on Probability Measures

Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates,

Szab´ o et al.

Regression on Probability Measures

Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates, ℓ

fˆzλ = arg min f ∈H

1X |f (µxˆi ) − yi |2 + λ kf k2H , ℓ i =1

Szab´ o et al.

Regression on Probability Measures

(λ > 0).

Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates, ℓ

fˆzλ = arg min f ∈H

1X |f (µxˆi ) − yi |2 + λ kf k2H , ℓ i =1

We consider two settings: 1 2

well-specified case: fρ ∈ H, misspecified case: fρ ∈ L2ρX \H.

Szab´ o et al.

Regression on Probability Measures

(λ > 0).

Step-1: mean embedding k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D

µxˆi =

Z

k(·, u)dˆ xi (u) =

D

N 1 X k(·, xi ,n ). N n=1

Szab´ o et al.

Regression on Probability Measures

Step-1: mean embedding k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D

µxˆi =

Z

k(·, u)dˆ xi (u) =

D

N 1 X k(·, xi ,n ). N n=1

Linear K ⇒ set kernel:

K (µxˆi , µxˆj ) = µxˆi , µxˆj

Szab´ o et al.



H

N 1 X = 2 k(xi ,n , xj,m ). N n,m=1

Regression on Probability Measures

Step-1: mean embedding

k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D

N 1 X k(·, u)dˆ xi (u) = µxˆi = k(·, xi ,n ). N D

Z

n=1

Nonlinear K example: K (µxˆi , µxˆj ) = e

Szab´ o et al.



kµxˆ −µxˆ k2 j H i 2σ 2

.

Regression on Probability Measures

Step-2: ridge regression (analytical solution)

Given: training sample: ˆz, test distribution: t.

Prediction on t: (fˆzλ ◦ µ)(t) = k(K + ℓλIℓ )−1 [y1 ; . . . ; yℓ ], K = [K (µxˆi , µxˆj )] ∈ R

ℓ×ℓ

,

k = [K (µxˆ1 , µt ), . . . , K (µxˆℓ , µt )] ∈ R1×ℓ .

Szab´ o et al.

Regression on Probability Measures

(1) (2) (3)

Blanket assumptions: both settings

D: separable, topological domain. k: bounded: supu∈D k(u, u) ≤ Bk ∈ (0, ∞), continuous.

K : bounded; H¨older continuous: ∃L > 0, h ∈ (0, 1] such that kK (·, µa ) − K (·, µb )kH ≤ L kµa − µb khH . y : bounded. X = µ (P(D)) ∈ B(H).

Szab´ o et al.

Regression on Probability Measures

Well-specified case: performance guarantee

Difficulty of the task: fρ is ’c-smooth’, ’b-decaying covariance operator’. 1

Contribution: If ℓ ≥ λ− b −1 , then with high probability E(fˆzλ , fρ ) ≤

1 1 logh (ℓ) + λc + 2 + 1 . h 3 N λ ℓ λ ℓλ b {z } | g (ℓ,N,λ)

Szab´ o et al.

Regression on Probability Measures

Well-specified case: performance guarantee

Difficulty of the task: fρ is ’c-smooth’, ’b-decaying covariance operator’. 1

Contribution: If ℓ ≥ λ− b −1 , then with high probability E(fˆzλ , fρ ) ≤

xˆi

logh (ℓ) N h λ3

|

1 1 + λc + 2 + 1 . ℓ λ ℓλ b {z } g (ℓ,N,λ)

c-smoothness

Szab´ o et al.

Regression on Probability Measures

(4)

Well-specified case: example

Assume b is ’large’ (1/b ≈ 0, ’small’ effective input dimension), h = 1 (K : Lipschitz),

✄ ✄ a ✂1 ✁= ✂2 ✁in (4) ⇒ λ; ℓ = N (a > 0),

t = ℓN: total number of samples processed. Then 2

1

c = 2 (’smooth’ fρ ): E(fˆzλ , fρ ) ≈ t − 7 – faster convergence, 1

2

c = 1 (’non-smooth’ fρ ): E(fˆzλ , fρ ) ≈ t − 5 – slower.

Szab´ o et al.

Regression on Probability Measures

Misspecified case: performance guarantee

Difficulty of the task: fρ is ’s-smooth’ (s > 0).

Contribution: 1 If L2ρX is separable and 2 ≤ l, λ then with high probability h

E(fˆzλ , fρ )



log 2 (l) h

3

2 2 |N λ

Szab´ o et al.

1 +√ + lλ

√ λmin(1,s) √ + λmin(1,s) . λ l {z }

g (ℓ,N,λ)

Regression on Probability Measures

Misspecified case: performance guarantee

Difficulty of the task: fρ is ’s-smooth’ (s > 0).

Contribution: If 1 L2ρX is separable and 2 ≤ l, λ then with high probability 1 +√ + lλ

h

log 2 (l)

E(fˆzλ , fρ ) ≤

N

|

xˆi

h 2

3 λ2



{z

λmin(1,s) √ λ l

g (ℓ,N,λ)

s-smoothness

Szab´ o et al.

+ λmin(1,s) . }

Regression on Probability Measures

(5)

Misspecified case: example

Assume s ≥ 1, h = 1 (K : Lipschitz),

✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; ℓ = N (a > 0)

t = ℓN: total number of samples processed. Then 1 2

s = 1 (’non-smooth’ fρ ): E(fˆzλ , fρ ) ≈ t −0.25 – slower,

s → ∞ (’smooth’ fρ ): E(fˆzλ , fρ ) ≈ t −0.5 – faster convergence.

Szab´ o et al.

Regression on Probability Measures

Notes on the assumptions: ∃ρ, X ∈ B(H)

k: bounded, continuous ⇒

µ : (P(D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.

If (*) := D is compact metric, k is universal, then µ is continuous, and X ∈ B(H).

Szab´ o et al.

Regression on Probability Measures

Notes on the assumptions: H¨older K examples In case of (*): KG e−

Ke

kµa −µb k2H

e−

2θ 2

h=1

KC

kµa −µb kH 2θ 2

1 2

h=



Kt 

θ 2

h=1

−1

Ki

1 + kµa − µb kθH h=

1 + kµa − µb k2H /θ 2

−1

(θ ≤ 2)



kµa − µb k2H + θ 2 h=1

− 1 2

Functions of kµa − µb kH ⇒ computation: similar to set kernel. Szab´ o et al.

Regression on Probability Measures

Notes on the assumptions: misspecified case

L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008].

Szab´ o et al.

Regression on Probability Measures

Vector-valued output: Y = separable Hilbert space

Objective function: l

fˆzλ

1X = arg min kf (µxˆi ) − yi k2Y + λ kf k2H , l f ∈H

(λ > 0).

i =1

K (µa , µb ) ∈ L(Y ):

operator-valued kernel, vector-valued RKHS.

Szab´ o et al.

Regression on Probability Measures

Vector-valued output: analytical solution

Prediction on a new test distribution (t): (fˆzλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ], l×l

K = [K (µxˆi , µxˆj )] ∈ L(Y )

,

k = [K (µxˆ1 , µt ), . . . , K (µxˆl , µt )] ∈ L(Y )1×l .

Specifically: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd .

Szab´ o et al.

Regression on Probability Measures

(6) (7) (8)

Vector-valued output: K assumptions

Boundedness and H¨older continuity of K : 1

2

Boundedness:  kKµa k2HS = Tr Kµ∗a Kµa ≤ BK ∈ (0, ∞),

(∀µa ∈ X ).

H¨older continuity: ∃L > 0, h ∈ (0, 1] such that kKµa − Kµb kL(Y ,H) ≤ L kµa − µb khH ,

Szab´ o et al.

∀(µa , µb ) ∈ X × X .

Regression on Probability Measures

Demo Supervised entropy learning: RMSE: MERR=0.75, DFDR=2.02 2 4

0

RMSE

entropy

1

true MERR

−1

3 2

DFDR 1

−2 0

1 2 rotation angle (β)

3

MERR

DFDR

Aerosol prediction from satellite images: State-of-the-art baseline: 7.5 − 8.5 (±0.1 − 0.6). MERR: 7.81 (±1.64).

Szab´ o et al.

Regression on Probability Measures

Summary

Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specifically, the set kernel is so (15-year-old open question).

Simplified version [Y = R, fρ ∈ H]: AISTATS-2015 (oral).

Szab´ o et al.

Regression on Probability Measures

Summary – continued

Code in ITE, extended analysis (submitted to JMLR): https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066. Closely related research directions (Bayesian world): ∞-dimensional exp. family fitting, just-in-time kernel EP: accepted at UAI-2015.

Szab´ o et al.

Regression on Probability Measures

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Szab´ o et al.

Regression on Probability Measures

Appendix: contents

Topological definitions, separability. Prior definitions (ρ). Universal kernel: definition, examples. Vector-valued RKHS. Demos: further details. Hausdorff metric. Weak topology on P(D).

Szab´ o et al.

Regression on Probability Measures

Topological space, open sets

Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3

∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .

Then, (D, τ ) is called a topological space; O ∈ τ : open sets.

Szab´ o et al.

Regression on Probability Measures

Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is

closed if D\A ∈ τ (i.e., its complement is open),

compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .

Closure of A ⊆ D:

¯ := A

\

C.

A⊆C closed in D

¯ = D. A ⊆ D is dense if A

(D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: ℓ∞ /L∞ .

Szab´ o et al.

Regression on Probability Measures

(9)

Prior (well-specified case): ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X

with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),

where b ∈ (1, ∞), c ∈ [1, 2].

Intuition: 1/b – effective input dimension, c – smoothness of fρ .

Szab´ o et al.

Regression on Probability Measures

Prior: misspecified case

˜ be defined as: Let T SK∗ : H ֒→ L2ρX , SK : L2ρX → H,

(SK g )(µu ) =

˜ = S ∗ SK : L2 → L2 . T K ρX ρX

Z

K (µu , µt )g (µt )dρX (µt ), X

  ˜ s for some s ≥ 0. Our range space assumption on ρ: fρ ∈ Im T

Szab´ o et al.

Regression on Probability Measures

Universal kernel: definition

Assume D: compact, metric, k : D × D → R kernel is continuous.

Then

Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ).

Szab´ o et al.

Regression on Probability Measures

Universal kernel: definition

Assume D: compact, metric, k : D × D → R kernel is continuous.

Then

Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ). Def-2: k is characteristic, if µ : P(D) → H(k) is injective.

Szab´ o et al.

Regression on Probability Measures

Universal kernel: definition

Assume D: compact, metric, k : D × D → R kernel is continuous.

Then

Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ). Def-2: k is characteristic, if µ : P(D) → H(k) is injective. universal, if µ is injective on the finite signed measures of D.

Szab´ o et al.

Regression on Probability Measures

Universal kernel: examples

On compact subsets of Rd k(a, b) = e −

ka−bk2 2 2σ 2

,

k(a, b) = e

−σka−bk1

k(a, b) = e

βha,bi

(σ > 0) ,

(σ > 0)

, (β > 0), or more generally ∞ X an x n (∀an > 0). k(a, b) = f (ha, bi), f (x) = n=0

Szab´ o et al.

Regression on Probability Measures

Vector-valued RKHS: H = H(K )

Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f ∈ H 7→ hy , f (µx )iY ∈ R is continuous for ∀µx ∈ X , y ∈ Y .

(10)

= The evaluation functional is continuous in every direction.

Szab´ o et al.

Regression on Probability Measures

Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH

(∀f ∈ H).

K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H).

Szab´ o et al.

Regression on Probability Measures

(11)

Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH

(∀f ∈ H).

(11)

K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H). K construction:

K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,

(∀µx , µt ∈ X ), i.e.,

H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.

Szab´ o et al.

Regression on Probability Measures

(12) (13)

Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH

(∀f ∈ H).

(11)

K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H). K construction:

K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,

(∀µx , µt ∈ X ), i.e.,

H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. Shortly: K (µx , µt ) ∈ L(Y ) generalizes k(u, v ) ∈ R.

Szab´ o et al.

Regression on Probability Measures

(12) (13)

Vector-valued RKHS – examples: Y = Rd

1

Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).

2

(14)

Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) =

m X

A∗j Dj (µa , µb )Aj .

j=1

Szab´ o et al.

Regression on Probability Measures

(15)

Demo-1: supervised entropy learning

Problem: learn the entropy of the 1st coo. of (rotated) Gaussians. Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR. Performance: RMSE boxplot over 25 random experiments. Experience: more precise than the only theoretically justified method, by avoiding density estimation.

Szab´ o et al.

Regression on Probability Measures

Demo-2: aerosol prediction – selected kernels Kernel definitions (p = 2, 3): kG (a, b) = e −

ka−bk2 2 2θ 2

ke (a, b) = e −

,

1

kC (a, b) = 1+

ka−bk22 θ2

,

kt (a, b) =

ka−bk2 2θ 2

1 1 + ka − bkθ2

kp (a, b) = (ha, bi + θ)p , kr (a, b) = 1 − ki (a, b) = q kM, 3 (a, b) = 2

kM, 5 (a, b) = 2

1

,

(16) ,

ka − bk22

ka − bk22 + θ

(17) ,

(18)

, (19) ka − bk22 + θ 2 ! √ √ 3ka−bk2 3 ka − bk2 e− θ , (20) 1+ θ ! √ √ 5ka−bk2 5 ka − bk2 5 ka − bk22 − θ + e 1+ . (21) θ 3θ 2 Szab´ o et al.

Regression on Probability Measures

Existing methods: set metric based algorithms Hausdorff metric [Edgar, 1995]: ( dH (X , Y ) = max

)

sup inf d(x, y ), sup inf d(x, y ) . (22)

x∈X y ∈Y

y ∈Y x∈X

Metric on compact sets of metric spaces [(M, d); X , Y ⊆ M]. ’Slight’ problem: highly sensitive to outliers.

Szab´ o et al.

Regression on Probability Measures

Weak topology on P(D)

Def.: It is the weakest topology such that the Lh : (P(D), τw ) → R, Z h(u)dx(u) Lh (x) = D

mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}.

Szab´ o et al.

Regression on Probability Measures

Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Szab´ o et al.

Regression on Probability Measures

Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Szab´ o et al.

Regression on Probability Measures

Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J., P´oczos, B., and Schneider, J. (2013). Distribution to distribution regression. International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057. Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. Szab´ o et al.

Regression on Probability Measures

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011). Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Reddi, S. J. and P´oczos, B. (2014). k-NN regression on functional data with incomplete observations. In Conference on Uncertainty in Artificial Intelligence (UAI). Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration. Szab´ o et al.

Regression on Probability Measures

Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Szab´ o et al.

Regression on Probability Measures

Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.

Szab´ o et al.

Regression on Probability Measures