Optimal rates for random Fourier feature kernel approximations Zolt´an Szab´o∗ Gatsby Unit, UCL
Joint work with Bharath K. Sriperumbudur∗ Department of Statistics, PSU (∗ equal contribution)
AMPLab, UC Berkeley November 20, 2015
Zolt´ an Szab´ o
Optimal Rates for RFFs
Outline
Kernels and kernel derivatives. Random Fourier features (RFFs). Guarantees on RFF approximation: uniform, Lr .
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel, RKHS
k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel, RKHS
k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).
Kernel examples: X = Rd (p > 0, θ > 0) p
k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk2 : Laplacian.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel, RKHS
k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).
Kernel examples: X = Rd (p > 0, θ > 0) p
k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk2 : Laplacian.
In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel: example domains (X)
Euclidean space: X = Rd . Graphs, texts, time series, dynamical systems, distributions.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel: application example – ridge regression
Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `
J(f ) =
1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel: application example – ridge regression
Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `
J(f ) =
1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1
Analytical solution, O(`3 ) – expensive: f (x) = [k(x1 , x), . . . , k(x` , x)](G + λ`I )−1 [y1 ; . . . ; y` ], G = [k(xi , xj )]`i,j=1 .
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel: application example – ridge regression
Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `
J(f ) =
1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1
Analytical solution, O(`3 ) – expensive: f (x) = [k(x1 , x), . . . , k(x` , x)](G + λ`I )−1 [y1 ; . . . ; y` ], G = [k(xi , xj )]`i,j=1 . ˆ matrix-inversion lemma, fast primal solvers → RFF. Idea: G,
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernels: more generally Requirement: inner product on the inputs (k : X × X → R). Loss function (λ > 0): J(f ) =
` X
V (yi , f (xi )) + λ kf k2H(k) → min . f ∈H(k)
i=1
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernels: more generally Requirement: inner product on the inputs (k : X × X → R). Loss function (λ > 0): J(f ) =
` X
V (yi , f (xi )) + λ kf k2H(k) → min . f ∈H(k)
i=1
By the representer theorem [f (·) = J(α) =
` X
P`
i=1 αi k(·, xi )]:
V (yi , (Gα)i ) + λαT Gα → min . α∈R`
i=1
⇒ k(xi , xj ) matters.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: application example
Motivation: fitting ∞-D exp. family distributions [Sriperumbudur et al., 2014], k ↔ sufficient statistics, rich family,
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: application example
Motivation: fitting ∞-D exp. family distributions [Sriperumbudur et al., 2014], k ↔ sufficient statistics, rich family, fitting = linear equation: coefficient matrix: (d`) × (d`), d = dim(x), entries: kernel values and derivatives.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: more generally
Objective: `
J(f ) =
1X V (yi , {∂ p f (xi )}p∈Ji ) + λ kf k2H(k) → min . ` f ∈H(k) i=1
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: more generally
Objective: `
J(f ) =
1X V (yi , {∂ p f (xi )}p∈Ji ) + λ kf k2H(k) → min . ` f ∈H(k) i=1
[Zhou, 2008, Shi et al., 2010, Rosasco et al., 2010, Rosasco et al., 2013, Ying et al., 2012]: semi-supervised learning with gradient information, nonlinear variable selection.
Kernel HMC [Strathmann et al., 2015].
Zolt´ an Szab´ o
Optimal Rates for RFFs
Focus
˜ − y)]. X = Rd . k: continuous, shift-invariant [k(x, y) = k(x By Bochner’s theorem: Z Z iω T (x−y) k(x, y) = e dΛ(ω) = cos ω T (x − y) dΛ(ω). Rd
Rd
Zolt´ an Szab´ o
Optimal Rates for RFFs
Focus
˜ − y)]. X = Rd . k: continuous, shift-invariant [k(x, y) = k(x By Bochner’s theorem: Z Z iω T (x−y) k(x, y) = e dΛ(ω) = cos ω T (x − y) dΛ(ω). Rd
Rd i.i.d.
RFF trick [Rahimi and Recht, 2007] (MC): ω1:m := (ωj )m j=1 ∼ Λ, m
X ˆ y) = 1 k(x, cos ωjT (x − y) = m j=1
Zolt´ an Szab´ o
Z
cos ω T (x − y) dΛm (ω).
Rd
Optimal Rates for RFFs
RFF – existing guarantee, basically
Hoeffding inequality + union bound: r
k − kˆ ∞ = Op L (S)
|S| |{z}
! log m . m
linear
Zolt´ an Szab´ o
Optimal Rates for RFFs
RFF – existing guarantee, basically
Hoeffding inequality + union bound: r
k − kˆ ∞ = Op L (S)
|S| |{z}
! log m . m
linear
Characteristic function point of view [Cs¨org˝o and Totik, 1983] (asymptotic!): 1 2
|Sm | = e o(m) is the optimal rate for a.s. convergence, For faster growing |Sm |: even convergence in probability fails.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Today: one-page summary
1
specifically
Finite-sample L∞ -guarantee −−−−−−→ ! p
log |S|
k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!
Zolt´ an Szab´ o
Optimal Rates for RFFs
Today: one-page summary
1
specifically
Finite-sample L∞ -guarantee −−−−−−→ ! p
log |S|
k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!
2
Finite sample Lr guarantees, r ∈ [1, ∞).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Today: one-page summary
1
specifically
Finite-sample L∞ -guarantee −−−−−−→ ! p
log |S|
k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!
2
Finite sample Lr guarantees, r ∈ [1, ∞).
3
Derivatives: ∂ p,q k.
Zolt´ an Szab´ o
Optimal Rates for RFFs
. . . , where
Uniform (r = ∞), Lr (1 ≤ r < ∞) norm: ˆ L∞ (S) := sup k(x, y) − k(x, ˆ y) , kk − kk x,y∈S
ˆ Lr (S) := kk − kk
Z Z S
Zolt´ an Szab´ o
ˆ y)|r dx dy |k(x, y) − k(x,
S
Optimal Rates for RFFs
1 r
.
. . . , where
Uniform (r = ∞), Lr (1 ≤ r < ∞) norm: ˆ L∞ (S) := sup k(x, y) − k(x, ˆ y) , kk − kk x,y∈S
ˆ Lr (S) := kk − kk
Z Z S
ˆ y)|r dx dy |k(x, y) − k(x,
1 r
.
S
Kernel derivatives: ∂ p,q k(x, y) =
∂ |p|+|q| k(x, y) , ∂x1p1 · · · ∂xdpd ∂y1q1 · · · ∂ydqd
Zolt´ an Szab´ o
Optimal Rates for RFFs
|p| =
d X j=1
|pj |.
ˆ L∞ (S) : proof idea kk − kk 1
R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G
x,y∈S
Zolt´ an Szab´ o
Optimal Rates for RFFs
ˆ L∞ (S) : proof idea kk − kk 1
R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G
x,y∈S
2
f (ω1:m ) = kΛ − Λm kG concentrates (bounded difference): 1 kΛ − Λm kG - Eω1:m kΛ − Λm kG + √ . m
Zolt´ an Szab´ o
Optimal Rates for RFFs
ˆ L∞ (S) : proof idea kk − kk 1
R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G
x,y∈S
2
f (ω1:m ) = kΛ − Λm kG concentrates (bounded difference): 1 kΛ − Λm kG - Eω1:m kΛ − Λm kG + √ . m
3
G is ’nice’ (uniformly bounded, separable Carath´eodory) ⇒ Eω1:m kΛ − Λm kG - Eω1:m R (G, ω1:m ) . | {z }
1 Pm E supg ∈G m j=1 j g (ωj )
Zolt´ an Szab´ o
Optimal Rates for RFFs
Proof idea
4
Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m
Z
|G|L2 (Λm )
q
log N (G, L2 (Λm ), u)du.
0
Zolt´ an Szab´ o
Optimal Rates for RFFs
Proof idea
4
Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m
5
Z
|G|L2 (Λm )
q
log N (G, L2 (Λm ), u)du.
0
G is smoothly parameterized by a compact set ⇒ N (G, L2 (Λm ), u) ≤
4|S|A +1 u
Zolt´ an Szab´ o
d
v u X u1 m , A(ω1:m ) = t kωj k22 . m
Optimal Rates for RFFs
j=1
Proof idea
4
Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m
5
|G|L2 (Λm )
q
log N (G, L2 (Λm ), u)du.
0
G is smoothly parameterized by a compact set ⇒ N (G, L2 (Λm ), u) ≤
6
Z
4|S|A +1 u
d
v u X u1 m , A(ω1:m ) = t kωj k22 . m j=1
Putting together [|G|L2 (Λm ) ≤ 2, Jensen inequality] we get . . .
Zolt´ an Szab´ o
Optimal Rates for RFFs
L∞ result for k
Let k be continuous, σ 2 := and compact set S ⊂ Rd m
Λ
kkˆ − kkL∞ (S)
R
kωk2 dΛ(ω) < ∞. Then for ∀τ > 0
h(d, |S|, σ) + √ ≥ m
√
2τ
s p h(d, |S|, σ) := 32 2dlog(2|S| + 1) + 16 32
!
2d + log(2|S| + 1)
p 2d log(σ + 1).
Zolt´ an Szab´ o
≤ e −τ ,
Optimal Rates for RFFs
Consequence-1 (Borel-Cantelli lemma)
m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate
Zolt´ an Szab´ o
Optimal Rates for RFFs
q
log |S| m .
Consequence-1 (Borel-Cantelli lemma)
m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate Growing diameter: log |Sm | m
m→∞
−−−−→ 0 is enough, i.e. |Sm | = e o(m) .
Zolt´ an Szab´ o
Optimal Rates for RFFs
q
log |S| m .
Consequence-1 (Borel-Cantelli lemma)
m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate Growing diameter: log |Sm | m
q
log |S| m .
m→∞
−−−−→ 0 is enough, i.e. |Sm | = e o(m) .
Specifically: asymptotic optimality [Cs¨org˝ o and Totik, 1983, Theorem 2] (if k(z) vanishes at ∞).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Consequence-2: Lr result for k (1 ≤ r )
Idea: Note that kkˆ − kkLr (S) =
Z Z S
ˆ y) − k(x, y)|r dx dy |k(x,
S
≤ kkˆ − kkL∞ (S) vol2/r (S).
Zolt´ an Szab´ o
Optimal Rates for RFFs
1 r
Consequence-2: Lr result for k (1 ≤ r )
Idea: Note that kkˆ − kkLr (S) =
Z Z S
ˆ y) − k(x, y)|r dx dy |k(x,
S
≤ kkˆ − kkL∞ (S) vol2/r (S). n vol(S) ≤ vol(B), where B := x ∈ Rd : kxk2 ≤
Zolt´ an Szab´ o
Optimal Rates for RFFs
|S| 2
o ,
1 r
Consequence-2: Lr result for k (1 ≤ r )
Idea: Note that kkˆ − kkLr (S) =
Z Z S
ˆ y) − k(x, y)|r dx dy |k(x,
S
≤ kkˆ − kkL∞ (S) vol2/r (S). n o vol(S) ≤ vol(B), where B := x ∈ Rd : kxk2 ≤ |S| 2 , R∞ d/2 d vol(B) = dπ d|S| , Γ(t) = 0 u t−1 e −u du. ⇒ 2 Γ
2
+1
Zolt´ an Szab´ o
Optimal Rates for RFFs
1 r
Lr result for k Under the previous assumptions, and 1 ≤ r < ∞: !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1)
Zolt´ an Szab´ o
Optimal Rates for RFFs
Lr result for k Under the previous assumptions, and 1 ≤ r < ∞: !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1) Hence, kkˆ − kkLr (S) = Oa.s.
! p |S|2d/r log |S| √ . m | {z } m→∞
Lr (S)-consistency if −−−−→ 0
Zolt´ an Szab´ o
Optimal Rates for RFFs
Lr result for k Under the previous assumptions, and 1 ≤ r < ∞: !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1) Hence, kkˆ − kkLr (S) = Oa.s.
! p |S|2d/r log |S| √ . m | {z } m→∞
Lr (S)-consistency if −−−−→ 0
Uniform guarantee: |Sm | = e m
δ 1) and
m
X
2
≤ Eω1:m Eε εi cos(hωi , · − ·i) .
m i=1 Lr (S) | {z } =:(∗)
Zolt´ an Szab´ o
Optimal Rates for RFFs
Direct Lr result for k: proof idea 1
2
ˆ Lr (S) concentrates (bounded difference): f (ω1 , . . . , ωm ) = kk − kk r ˆ Lr (S) ≤ Eω kk − kk ˆ Lr (S) + vol2/r (S) 2τ . kk − kk 1:m m 0
By Lr ∼ = (Lr )∗ ( 1r + symmetrization:
1 r0
ˆ Lr (S) Eω1:m kk − kk
0
= 1), the separability of Lr (S) (r > 1) and
m
X
2
≤ Eω1:m Eε εi cos(hωi , · − ·i) .
m i=1 Lr (S) | {z } =:(∗)
3
Since Lr (S) is of type min(2, r ) [’-rule’] ∃Cr0 such that (∗) ≤
Cr0
m X
! k cos(hωi , · −
min(2,r ) ·i)kLr (S)
i=1 Zolt´ an Szab´ o
Optimal Rates for RFFs
1 min(2,r )
.
Kernel derivatives: N2d 3 [p; q] 6= 0
p,q . If Goal: kd 1 supp(Λ) is bounded:
h i 2 Ck,p,q := Eω∼Λ |ω p+q | kωk2 < ∞: L∞ , Lr X, but Gaussian, Laplacian, inverse multiquadratic, Matern:( c0 universality ⇔ supp(Λ) = Rd , if k(z) ∈ C0 (Rd ).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: N2d 3 [p; q] 6= 0
p,q . If Goal: kd 1 supp(Λ) is bounded:
h i 2 Ck,p,q := Eω∼Λ |ω p+q | kωk2 < ∞: L∞ , Lr X, but Gaussian, Laplacian, inverse multiquadratic, Matern:( c0 universality ⇔ supp(Λ) = Rd , if k(z) ∈ C0 (Rd ). 2
supp(Λ) is unbounded: G: becomes unbounded. [Rahimi and Recht, 2007]: ’Hoeffding → Bernstein’, but
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: unbounded supp(Λ) Assumptions [ha = cos (a) , S∆ = S − S]: 1 z 7→ ∇ [∂ p,q k(z)]: continuous; S ⊂ Rd : compact, z Ep,q := Eω∼Λ |ω p+q | kωk2 < ∞. 2 ∃L > 0, σ > 0 M! σ 2 LM−2 (∀M ≥ 2, ∀z ∈ S∆ ), 2 f (z; ω) = ∂ p,q k(z) − ω p (−ω)q h|p+q| (ω T z).
Eω∼Λ |f (z; ω)|M ≤
Zolt´ an Szab´ o
Optimal Rates for RFFs
Kernel derivatives: unbounded supp(Λ)
d
1
Then with Fd := d − d+1 + d d+1 p,q kk ∞ Λm k∂ p,q k − ∂[ ≥ ≤ L (S) −
≤ 2d−1 e
2 m 8σ 2 1+ L2 2σ
+ Fd 2
4d−1 d+1
|S|(Dp,q,S + Ep,q )
where Dp,q,S := supz∈conv (S∆ ) k∇z [∂ p,q k(z)]k2 . Zolt´ an Szab´ o
Optimal Rates for RFFs
d d+1
−
e
2 m 8(d+1)σ 2 1+ L2 2σ
,
Summary
Finite sample spec.
L∞ (S) guarantees −−−→ |Sm | = e o(m) – asymp. optimal! Lr (S) results (⇐ uniform, type of Lr ).
Zolt´ an Szab´ o
Optimal Rates for RFFs
Summary
Finite sample spec.
L∞ (S) guarantees −−−→ |Sm | = e o(m) – asymp. optimal! Lr (S) results (⇐ uniform, type of Lr ). derivative approximation guarantees: bounded spectral support: X unbounded spectral support: trickier – to be continued;)
Zolt´ an Szab´ o
Optimal Rates for RFFs
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable Foundation.
Zolt´ an Szab´ o
Optimal Rates for RFFs
Cs¨ org˝o, S. and Totik, V. (1983). On how long interval is the empirical characteristic function uniformly consistent? Acta Scientiarum Mathematicarum, 45:141–149. Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Neural Information Processing Systems (NIPS), pages 1177–1184. Rosasco, L., Santoro, M., Mosci, S., Verri, A., and Villa, S. (2010). A regularization approach to nonlinear variable selection. JMLR W&CP – International Conference on Artificial Intelligence and Statistics (AISTATS), 9:653–660. Rosasco, L., Villa, S., Mosci, S., Santoro, M., and Verri, A. (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665–1714. Zolt´ an Szab´ o
Optimal Rates for RFFs
Shi, L., Guo, X., and Zhou, D.-X. (2010). Hermite learning with gradient data. Journal of Computational and Applied Mathematics, 233:3046–3059. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Hyv¨arinen, A., and Kumar, R. (2014). Density estimation in infinite dimensional exponential families. Technical report. http://arxiv.org/pdf/1312.3516.pdf. Strathmann, H., Sejdinovic, D., Livingstone, S., Szab´o, Z., and Gretton, A. (2015). Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In Neural Information Processing Systems (NIPS). Ying, Y., Wu, Q., and Campbell, C. (2012). Learning the coordinate gradients. Advances in Computational Mathematics, 37:355–378. Zolt´ an Szab´ o
Optimal Rates for RFFs
Zhou, D.-X. (2008). Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456–463.
Zolt´ an Szab´ o
Optimal Rates for RFFs