Regression on Probability Measures: A Simple and Consistent Algorithm Zolt´an Szab´ o (Gatsby Unit, UCL)
Joint work with ◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnab´ as P´ oczos (ML Department, CMU), ◦ Arthur Gretton (Gatsby Unit, UCL)
CRiSM Seminars Department of Statistics, University of Warwick May 29, 2015
Szab´ o et al.
Regression on Probability Measures
The task
Samples: {(xi , yi )}li =1 . Goal: f (xi ) ≈ yi , find f ∈ H.
Distribution regression: xi -s are distributions, i available only through samples: {xi ,n }N n=1 .
⇒ Training examples: labelled bags.
Szab´ o et al.
Regression on Probability Measures
Example: aerosol prediction from satellite images
Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.
Relevance: climate research. Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression?
Szab´ o et al.
Regression on Probability Measures
Wider context Context: machine learning: multi-instance learning, statistics: point estimation tasks (without analytical formula).
Applications: computer vision: image = collection of patch vectors, network analysis: group of people = bag of friendship graphs, natural language processing: corpus = bag of documents, time-series modelling: user = set of trial time-series.
Szab´ o et al.
Regression on Probability Measures
Several algorithmic approaches
1
Parametric fit: Gaussian, MOG, exp. family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2
Kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006].
3
(Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4
Divergence measures (KL, R´enyi, Tsallis): [P´oczos et al., 2011].
5
Set metrics: Hausdorff metric [Edgar, 1995]; variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].
Szab´ o et al.
Regression on Probability Measures
Theoretical guarantee?
MIL dates back to [Haussler, 1999, G¨ artner et al., 2002].
Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014] + assumptions: 1 2
compact Euclidean domain. output = R ([Oliva et al., 2013] allows distribution).
Szab´ o et al.
Regression on Probability Measures
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Szab´ o et al.
Regression on Probability Measures
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Kernel examples: D = Rd (p > 0, θ > 0) p
k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk1 : Laplacian.
In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).
Szab´ o et al.
Regression on Probability Measures
Kernel: example domains (D)
Euclidean space: D = Rd . Graphs, texts, time series, dynamical systems.
Distributions!
Szab´ o et al.
Regression on Probability Measures
Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,
i .i .d.
i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz.
Szab´ o et al.
Regression on Probability Measures
Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,
i .i .d.
i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz. Construction: distribution embedding (µx ) µ=µ(k)
P (D) −−−−→X ⊆ H = H(k)
Szab´ o et al.
Regression on Probability Measures
Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,
i .i .d.
i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz.
Construction: distribution embedding (µx ) + ridge regression µ=µ(k)
f ∈H=H(K )
P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R.
Szab´ o et al.
Regression on Probability Measures
Problem formulation (Y = R) Given: labelled bags ˆz = {(ˆ xi , yi )}ℓi =1 ,
i .i .d.
i th bag: xˆi = {xi ,1 , . . . , xi ,N } ∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz.
Construction: distribution embedding (µx ) + ridge regression f ∈H=H(K )
µ=µ(k)
P (D) −−−−→X ⊆ H = H(k)−−−−−−−→R. Our goal: risk bound compared to the regression function Z y dρ(y |µx ). fρ (µx ) = R
Szab´ o et al.
Regression on Probability Measures
Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ]
Szab´ o et al.
Regression on Probability Measures
Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates,
Szab´ o et al.
Regression on Probability Measures
Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates, ℓ
fˆzλ = arg min f ∈H
1X |f (µxˆi ) − yi |2 + λ kf k2H , ℓ i =1
Szab´ o et al.
Regression on Probability Measures
(λ > 0).
Goal in details Expected risk: R [f ] = E(x,y ) |f (µx ) − y |2 . Contribution: analysis of the excess risk E(fˆzλ , fρ ) = R[fˆzλ ] − R[fρ ] ≤ g (ℓ, N, λ) → 0 and rates, ℓ
fˆzλ = arg min f ∈H
1X |f (µxˆi ) − yi |2 + λ kf k2H , ℓ i =1
We consider two settings: 1 2
well-specified case: fρ ∈ H, misspecified case: fρ ∈ L2ρX \H.
Szab´ o et al.
Regression on Probability Measures
(λ > 0).
Step-1: mean embedding k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D
µxˆi =
Z
k(·, u)dˆ xi (u) =
D
N 1 X k(·, xi ,n ). N n=1
Szab´ o et al.
Regression on Probability Measures
Step-1: mean embedding k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D
µxˆi =
Z
k(·, u)dˆ xi (u) =
D
N 1 X k(·, xi ,n ). N n=1
Linear K ⇒ set kernel:
K (µxˆi , µxˆj ) = µxˆi , µxˆj
Szab´ o et al.
H
N 1 X = 2 k(xi ,n , xj,m ). N n,m=1
Regression on Probability Measures
Step-1: mean embedding
k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u). Mean embedding of a distribution x, xˆi ∈ P(D): Z k(·, u)dx(u) ∈ H(k), µx = D
N 1 X k(·, u)dˆ xi (u) = µxˆi = k(·, xi ,n ). N D
Z
n=1
Nonlinear K example: K (µxˆi , µxˆj ) = e
Szab´ o et al.
−
kµxˆ −µxˆ k2 j H i 2σ 2
.
Regression on Probability Measures
Step-2: ridge regression (analytical solution)
Given: training sample: ˆz, test distribution: t.
Prediction on t: (fˆzλ ◦ µ)(t) = k(K + ℓλIℓ )−1 [y1 ; . . . ; yℓ ], K = [K (µxˆi , µxˆj )] ∈ R
ℓ×ℓ
,
k = [K (µxˆ1 , µt ), . . . , K (µxˆℓ , µt )] ∈ R1×ℓ .
Szab´ o et al.
Regression on Probability Measures
(1) (2) (3)
Blanket assumptions: both settings
D: separable, topological domain. k: bounded: supu∈D k(u, u) ≤ Bk ∈ (0, ∞), continuous.
K : bounded; H¨older continuous: ∃L > 0, h ∈ (0, 1] such that kK (·, µa ) − K (·, µb )kH ≤ L kµa − µb khH . y : bounded. X = µ (P(D)) ∈ B(H).
Szab´ o et al.
Regression on Probability Measures
Well-specified case: performance guarantee
Difficulty of the task: fρ is ’c-smooth’, ’b-decaying covariance operator’. 1
Contribution: If ℓ ≥ λ− b −1 , then with high probability E(fˆzλ , fρ ) ≤
1 1 logh (ℓ) + λc + 2 + 1 . h 3 N λ ℓ λ ℓλ b {z } | g (ℓ,N,λ)
Szab´ o et al.
Regression on Probability Measures
Well-specified case: performance guarantee
Difficulty of the task: fρ is ’c-smooth’, ’b-decaying covariance operator’. 1
Contribution: If ℓ ≥ λ− b −1 , then with high probability E(fˆzλ , fρ ) ≤
xˆi
logh (ℓ) N h λ3
|
1 1 + λc + 2 + 1 . ℓ λ ℓλ b {z } g (ℓ,N,λ)
c-smoothness
Szab´ o et al.
Regression on Probability Measures
(4)
Well-specified case: example
Assume b is ’large’ (1/b ≈ 0, ’small’ effective input dimension), h = 1 (K : Lipschitz),
✄ ✄ a ✂1 ✁= ✂2 ✁in (4) ⇒ λ; ℓ = N (a > 0),
t = ℓN: total number of samples processed. Then 2
1
c = 2 (’smooth’ fρ ): E(fˆzλ , fρ ) ≈ t − 7 – faster convergence, 1
2
c = 1 (’non-smooth’ fρ ): E(fˆzλ , fρ ) ≈ t − 5 – slower.
Szab´ o et al.
Regression on Probability Measures
Misspecified case: performance guarantee
Difficulty of the task: fρ is ’s-smooth’ (s > 0).
Contribution: 1 If L2ρX is separable and 2 ≤ l, λ then with high probability h
E(fˆzλ , fρ )
≤
log 2 (l) h
3
2 2 |N λ
Szab´ o et al.
1 +√ + lλ
√ λmin(1,s) √ + λmin(1,s) . λ l {z }
g (ℓ,N,λ)
Regression on Probability Measures
Misspecified case: performance guarantee
Difficulty of the task: fρ is ’s-smooth’ (s > 0).
Contribution: If 1 L2ρX is separable and 2 ≤ l, λ then with high probability 1 +√ + lλ
h
log 2 (l)
E(fˆzλ , fρ ) ≤
N
|
xˆi
h 2
3 λ2
√
{z
λmin(1,s) √ λ l
g (ℓ,N,λ)
s-smoothness
Szab´ o et al.
+ λmin(1,s) . }
Regression on Probability Measures
(5)
Misspecified case: example
Assume s ≥ 1, h = 1 (K : Lipschitz),
✄ ✄ a 1 = ✂ ✁ ✂3 ✁in (5) ⇒ λ; ℓ = N (a > 0)
t = ℓN: total number of samples processed. Then 1 2
s = 1 (’non-smooth’ fρ ): E(fˆzλ , fρ ) ≈ t −0.25 – slower,
s → ∞ (’smooth’ fρ ): E(fˆzλ , fρ ) ≈ t −0.5 – faster convergence.
Szab´ o et al.
Regression on Probability Measures
Notes on the assumptions: ∃ρ, X ∈ B(H)
k: bounded, continuous ⇒
µ : (P(D), B(τw )) → (H, B(H)) measurable. µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.
If (*) := D is compact metric, k is universal, then µ is continuous, and X ∈ B(H).
Szab´ o et al.
Regression on Probability Measures
Notes on the assumptions: H¨older K examples In case of (*): KG e−
Ke
kµa −µb k2H
e−
2θ 2
h=1
KC
kµa −µb kH 2θ 2
1 2
h=
Kt
θ 2
h=1
−1
Ki
1 + kµa − µb kθH h=
1 + kµa − µb k2H /θ 2
−1
(θ ≤ 2)
kµa − µb k2H + θ 2 h=1
− 1 2
Functions of kµa − µb kH ⇒ computation: similar to set kernel. Szab´ o et al.
Regression on Probability Measures
Notes on the assumptions: misspecified case
L2ρX : separable ⇔ measure space with d(A, B) = ρX (A △ B) is so [Thomson et al., 2008].
Szab´ o et al.
Regression on Probability Measures
Vector-valued output: Y = separable Hilbert space
Objective function: l
fˆzλ
1X = arg min kf (µxˆi ) − yi k2Y + λ kf k2H , l f ∈H
(λ > 0).
i =1
K (µa , µb ) ∈ L(Y ):
operator-valued kernel, vector-valued RKHS.
Szab´ o et al.
Regression on Probability Measures
Vector-valued output: analytical solution
Prediction on a new test distribution (t): (fˆzλ ◦ µ)(t) = k(K + l λIl )−1 [y1 ; . . . ; yl ], l×l
K = [K (µxˆi , µxˆj )] ∈ L(Y )
,
k = [K (µxˆ1 , µt ), . . . , K (µxˆl , µt )] ∈ L(Y )1×l .
Specifically: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd .
Szab´ o et al.
Regression on Probability Measures
(6) (7) (8)
Vector-valued output: K assumptions
Boundedness and H¨older continuity of K : 1
2
Boundedness: kKµa k2HS = Tr Kµ∗a Kµa ≤ BK ∈ (0, ∞),
(∀µa ∈ X ).
H¨older continuity: ∃L > 0, h ∈ (0, 1] such that kKµa − Kµb kL(Y ,H) ≤ L kµa − µb khH ,
Szab´ o et al.
∀(µa , µb ) ∈ X × X .
Regression on Probability Measures
Demo Supervised entropy learning: RMSE: MERR=0.75, DFDR=2.02 2 4
0
RMSE
entropy
1
true MERR
−1
3 2
DFDR 1
−2 0
1 2 rotation angle (β)
3
MERR
DFDR
Aerosol prediction from satellite images: State-of-the-art baseline: 7.5 − 8.5 (±0.1 − 0.6). MERR: 7.81 (±1.64).
Szab´ o et al.
Regression on Probability Measures
Summary
Problem: distribution regression. Literature: large number of heuristics. Contribution: a simple ridge solution is consistent, specifically, the set kernel is so (15-year-old open question).
Simplified version [Y = R, fρ ∈ H]: AISTATS-2015 (oral).
Szab´ o et al.
Regression on Probability Measures
Summary – continued
Code in ITE, extended analysis (submitted to JMLR): https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066. Closely related research directions (Bayesian world): ∞-dimensional exp. family fitting, just-in-time kernel EP: accepted at UAI-2015.
Szab´ o et al.
Regression on Probability Measures
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Szab´ o et al.
Regression on Probability Measures
Appendix: contents
Topological definitions, separability. Prior definitions (ρ). Universal kernel: definition, examples. Vector-valued RKHS. Demos: further details. Hausdorff metric. Weak topology on P(D).
Szab´ o et al.
Regression on Probability Measures
Topological space, open sets
Given: D 6= ∅ set. τ ⊆ 2D is called a topology on D if: 1 2 3
∅ ∈ τ, D ∈ τ. Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ . Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i ∈I Oi ∈ τ .
Then, (D, τ ) is called a topological space; O ∈ τ : open sets.
Szab´ o et al.
Regression on Probability Measures
Closed-, compact set, closure, dense subset, separability Given: (D, τ ). A ⊆ D is
closed if D\A ∈ τ (i.e., its complement is open),
compact if for any family (Oi )i ∈I of open sets with A ⊆ ∪i ∈I Oi , ∃i1 , . . . , in ∈ I with A ⊆ ∪nj=1 Oij .
Closure of A ⊆ D:
¯ := A
\
C.
A⊆C closed in D
¯ = D. A ⊆ D is dense if A
(D, τ ) is separable if ∃ countable, dense subset of D. Counterexample: ℓ∞ /L∞ .
Szab´ o et al.
Regression on Probability Measures
(9)
Prior (well-specified case): ρ ∈ P(b, c) Let the T : H → H covariance operator be Z K (·, µa )K ∗ (·, µa )dρX (µa ) T = X
with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y α ≤ nb tn ≤ β (∀n ≥ 1; α > 0, β > 0), c−1 2 ∃g ∈ H such that fρ = T 2 g with kg kH ≤ R (R > 0),
where b ∈ (1, ∞), c ∈ [1, 2].
Intuition: 1/b – effective input dimension, c – smoothness of fρ .
Szab´ o et al.
Regression on Probability Measures
Prior: misspecified case
˜ be defined as: Let T SK∗ : H ֒→ L2ρX , SK : L2ρX → H,
(SK g )(µu ) =
˜ = S ∗ SK : L2 → L2 . T K ρX ρX
Z
K (µu , µt )g (µt )dρX (µt ), X
˜ s for some s ≥ 0. Our range space assumption on ρ: fρ ∈ Im T
Szab´ o et al.
Regression on Probability Measures
Universal kernel: definition
Assume D: compact, metric, k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ).
Szab´ o et al.
Regression on Probability Measures
Universal kernel: definition
Assume D: compact, metric, k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ). Def-2: k is characteristic, if µ : P(D) → H(k) is injective.
Szab´ o et al.
Regression on Probability Measures
Universal kernel: definition
Assume D: compact, metric, k : D × D → R kernel is continuous.
Then
Def-1: k is universal if H(k) is dense in (C (D), k·k∞ ). Def-2: k is characteristic, if µ : P(D) → H(k) is injective. universal, if µ is injective on the finite signed measures of D.
Szab´ o et al.
Regression on Probability Measures
Universal kernel: examples
On compact subsets of Rd k(a, b) = e −
ka−bk2 2 2σ 2
,
k(a, b) = e
−σka−bk1
k(a, b) = e
βha,bi
(σ > 0) ,
(σ > 0)
, (β > 0), or more generally ∞ X an x n (∀an > 0). k(a, b) = f (ha, bi), f (x) = n=0
Szab´ o et al.
Regression on Probability Measures
Vector-valued RKHS: H = H(K )
Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx ,y : f ∈ H 7→ hy , f (µx )iY ∈ R is continuous for ∀µx ∈ X , y ∈ Y .
(10)
= The evaluation functional is continuous in every direction.
Szab´ o et al.
Regression on Probability Measures
Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H).
Szab´ o et al.
Regression on Probability Measures
(11)
Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
(11)
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H). K construction:
K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), i.e.,
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }.
Szab´ o et al.
Regression on Probability Measures
(12) (13)
Vector-valued RKHS: H = H(K ) – continued Riesz representation theorem ⇒ ∃K (µx |y )∈ H: hy , f (µx )iY = hK (µx |y ), f iH
(∀f ∈ H).
(11)
K (µx |y ): linear, bounded in y ⇒ K (µx |y )= Kµx (y ) with Kµx ∈ L(Y , H). K construction:
K (µx , µt )(y ) = (Kµt y )(µx ), K (·, µt )(y ) = Kµt y ,
(∀µx , µt ∈ X ), i.e.,
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. Shortly: K (µx , µt ) ∈ L(Y ) generalizes k(u, v ) ∈ R.
Szab´ o et al.
Regression on Probability Measures
(12) (13)
Vector-valued RKHS – examples: Y = Rd
1
Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K (µa , µb ) = diag (K1 (µa , µb ), . . . , Kd (µa , µb )).
2
(14)
Combination of Dj diagonal kernels [Dj (µa , µb ) ∈ Rr ×r , Aj ∈ Rr ×d ]: K (µa , µb ) =
m X
A∗j Dj (µa , µb )Aj .
j=1
Szab´ o et al.
Regression on Probability Measures
(15)
Demo-1: supervised entropy learning
Problem: learn the entropy of the 1st coo. of (rotated) Gaussians. Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR. Performance: RMSE boxplot over 25 random experiments. Experience: more precise than the only theoretically justified method, by avoiding density estimation.
Szab´ o et al.
Regression on Probability Measures
Demo-2: aerosol prediction – selected kernels Kernel definitions (p = 2, 3): kG (a, b) = e −
ka−bk2 2 2θ 2
ke (a, b) = e −
,
1
kC (a, b) = 1+
ka−bk22 θ2
,
kt (a, b) =
ka−bk2 2θ 2
1 1 + ka − bkθ2
kp (a, b) = (ha, bi + θ)p , kr (a, b) = 1 − ki (a, b) = q kM, 3 (a, b) = 2
kM, 5 (a, b) = 2
1
,
(16) ,
ka − bk22
ka − bk22 + θ
(17) ,
(18)
, (19) ka − bk22 + θ 2 ! √ √ 3ka−bk2 3 ka − bk2 e− θ , (20) 1+ θ ! √ √ 5ka−bk2 5 ka − bk2 5 ka − bk22 − θ + e 1+ . (21) θ 3θ 2 Szab´ o et al.
Regression on Probability Measures
Existing methods: set metric based algorithms Hausdorff metric [Edgar, 1995]: ( dH (X , Y ) = max
)
sup inf d(x, y ), sup inf d(x, y ) . (22)
x∈X y ∈Y
y ∈Y x∈X
Metric on compact sets of metric spaces [(M, d); X , Y ⊆ M]. ’Slight’ problem: highly sensitive to outliers.
Szab´ o et al.
Regression on Probability Measures
Weak topology on P(D)
Def.: It is the weakest topology such that the Lh : (P(D), τw ) → R, Z h(u)dx(u) Lh (x) = D
mapping is continuous for all h ∈ Cb (D), where Cb (D) = {(D, τ ) → R bounded, continuous functions}.
Szab´ o et al.
Regression on Probability Measures
Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873. Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Szab´ o et al.
Regression on Probability Measures
Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf). Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Szab´ o et al.
Regression on Probability Measures
Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J., P´oczos, B., and Schneider, J. (2013). Distribution to distribution regression. International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057. Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression. Szab´ o et al.
Regression on Probability Measures
International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011). Nonparametric divergence estimation with applications to machine learning on distributions. In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Reddi, S. J. and P´oczos, B. (2014). k-NN regression on functional data with incomplete observations. In Conference on Uncertainty in Artificial Intelligence (UAI). Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration. Szab´ o et al.
Regression on Probability Measures
Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach. In International Conference on Machine Learning (ICML), pages 1119–1126. Wang, Z., Lan, L., and Vucetic, S. (2012). Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, M.-L. and Zhou, Z.-H. (2009). Szab´ o et al.
Regression on Probability Measures
Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.
Szab´ o et al.
Regression on Probability Measures