Optimal Rates for Random Fourier Feature Approximations - Gatsby ...

Optimal Rates for Random Fourier Feature Approximations Zolt´an Szab´o∗ Gatsby Unit, UCL

Joint work with Bharath K. Sriperumbudur∗ Department of Statistics, PSU (∗ equal contribution)

Department of Computing Science University of Alberta November 24, 2015

Zolt´ an Szab´ o

Optimal Rates for RFFs

Outline

Kernels and kernel derivatives. Random Fourier features (RFFs). Guarantees on RFF approximation: uniform, Lr .

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel, RKHS

k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel, RKHS

k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).

Kernel examples: X = Rd (p > 0, θ > 0) p

k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk2 : Laplacian.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel, RKHS

k : X × X → R kernel on X, if ∃ϕ : X → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ X).

Kernel examples: X = Rd (p > 0, θ > 0) p

k(a, b) = (ha, bi + θ) : polynomial, 2 2 k(a, b) = e −ka−bk2 /(2θ ) : Gaussian, k(a, b) = e −θka−bk2 : Laplacian.

In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel: example domains (X)

Euclidean space: X = Rd . Graphs, texts, time series, dynamical systems, distributions.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel: application example – ridge regression

Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `

J(f ) =

1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel: application example – ridge regression

Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `

J(f ) =

1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1

Analytical solution, O(`3 ) – expensive: f (x) = [k(x1 , x), . . . , k(x` , x)](G + λ`I )−1 [y1 ; . . . ; y` ], G = [k(xi , xj )]`i,j=1 .

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel: application example – ridge regression

Given: {(xi , yi )}`i=1 , H = H(k). Task: find f ∈ H s.t. f (xi ) ≈ yi , `

J(f ) =

1X [f (xi ) − yi ]2 + λ kf k2H → min (λ > 0). f ∈H ` i=1

Analytical solution, O(`3 ) – expensive: f (x) = [k(x1 , x), . . . , k(x` , x)](G + λ`I )−1 [y1 ; . . . ; y` ], G = [k(xi , xj )]`i,j=1 . ˆ matrix-inversion lemma, fast primal solvers → RFF. Idea: G,

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernels: more generally Requirement: inner product on the inputs (k : X × X → R). Loss function (λ > 0): J(f ) =

` X

V (yi , f (xi )) + λ kf k2H(k) → min . f ∈H(k)

i=1

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernels: more generally Requirement: inner product on the inputs (k : X × X → R). Loss function (λ > 0): J(f ) =

` X

V (yi , f (xi )) + λ kf k2H(k) → min . f ∈H(k)

i=1

By the representer theorem [f (·) = J(α) =

` X

P`

i=1 αi k(·, xi )]:

V (yi , (Gα)i ) + λαT Gα → min . α∈R`

i=1

⇒ k(xi , xj ) matters.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: application example

Motivation: fitting ∞-D exp. family distributions [Sriperumbudur et al., 2014], k ↔ sufficient statistics, rich family,

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: application example

Motivation: fitting ∞-D exp. family distributions [Sriperumbudur et al., 2014], k ↔ sufficient statistics, rich family, fitting = linear equation: coefficient matrix: (d`) × (d`), d = dim(x), entries: kernel values and derivatives.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: more generally

Objective: `

J(f ) =

1X V (yi , {∂ p f (xi )}p∈Ji ) + λ kf k2H(k) → min . ` f ∈H(k) i=1

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: more generally

Objective: `

J(f ) =

1X V (yi , {∂ p f (xi )}p∈Ji ) + λ kf k2H(k) → min . ` f ∈H(k) i=1

[Zhou, 2008, Shi et al., 2010, Rosasco et al., 2010, Rosasco et al., 2013, Ying et al., 2012]: semi-supervised learning with gradient information, nonlinear variable selection.

Kernel HMC [Strathmann et al., 2015].

Zolt´ an Szab´ o

Optimal Rates for RFFs

Focus

˜ − y)]. X = Rd . k: continuous, shift-invariant [k(x, y) = k(x By Bochner’s theorem: Z Z   iω T (x−y) k(x, y) = e dΛ(ω) = cos ω T (x − y) dΛ(ω). Rd

Rd

Zolt´ an Szab´ o

Optimal Rates for RFFs

Focus

˜ − y)]. X = Rd . k: continuous, shift-invariant [k(x, y) = k(x By Bochner’s theorem: Z Z   iω T (x−y) k(x, y) = e dΛ(ω) = cos ω T (x − y) dΛ(ω). Rd

Rd i.i.d.

RFF trick [Rahimi and Recht, 2007] (MC): ω1:m := (ωj )m j=1 ∼ Λ, m

  X ˆ y) = 1 k(x, cos ωjT (x − y) = m j=1

Zolt´ an Szab´ o

Z

  cos ω T (x − y) dΛm (ω).

Rd

Optimal Rates for RFFs

RFF – existing guarantee, basically

Hoeffding inequality + union bound: r

k − kˆ ∞ = Op L (S)

|S| |{z}

! log m . m

linear

Zolt´ an Szab´ o

Optimal Rates for RFFs

RFF – existing guarantee, basically

Hoeffding inequality + union bound: r

k − kˆ ∞ = Op L (S)

|S| |{z}

! log m . m

linear

Characteristic function point of view [Cs¨org˝o and Totik, 1983] (asymptotic!): 1 2

|Sm | = e o(m) is the optimal rate for a.s. convergence, For faster growing |Sm |: even convergence in probability fails.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Today: one-page summary

1

specifically

Finite-sample L∞ -guarantee −−−−−−→ ! p

log |S|

k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!

Zolt´ an Szab´ o

Optimal Rates for RFFs

Today: one-page summary

1

specifically

Finite-sample L∞ -guarantee −−−−−−→ ! p

log |S|

k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!

2

Finite sample Lr guarantees, r ∈ [1, ∞).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Today: one-page summary

1

specifically

Finite-sample L∞ -guarantee −−−−−−→ ! p

log |S|

k − kˆ ∞ = Oa.s. √ L (S) m ⇒ S can grow exponentially [|Sm | = e o(m) ] – optimal!

2

Finite sample Lr guarantees, r ∈ [1, ∞).

3

Derivatives: ∂ p,q k.

Zolt´ an Szab´ o

Optimal Rates for RFFs

. . . , where

Uniform (r = ∞), Lr (1 ≤ r < ∞) norm: ˆ L∞ (S) := sup k(x, y) − k(x, ˆ y) , kk − kk x,y∈S

ˆ Lr (S) := kk − kk

Z Z S

Zolt´ an Szab´ o

ˆ y)|r dx dy |k(x, y) − k(x,

S

Optimal Rates for RFFs

1 r

.

. . . , where

Uniform (r = ∞), Lr (1 ≤ r < ∞) norm: ˆ L∞ (S) := sup k(x, y) − k(x, ˆ y) , kk − kk x,y∈S

ˆ Lr (S) := kk − kk

Z Z S

ˆ y)|r dx dy |k(x, y) − k(x,

1 r

.

S

Kernel derivatives: ∂ p,q k(x, y) =

∂ |p|+|q| k(x, y) , ∂x1p1 · · · ∂xdpd ∂y1q1 · · · ∂ydqd

Zolt´ an Szab´ o

Optimal Rates for RFFs

|p| =

d X j=1

|pj |.

ˆ L∞ (S) : proof idea kk − kk 1

R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G

x,y∈S

Zolt´ an Szab´ o

Optimal Rates for RFFs

ˆ L∞ (S) : proof idea kk − kk 1

R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G

x,y∈S

2

f (ω1:m ) = kΛ − Λm kG concentrates (bounded difference): 1 kΛ − Λm kG - Eω1:m kΛ − Λm kG + √ . m

Zolt´ an Szab´ o

Optimal Rates for RFFs

ˆ L∞ (S) : proof idea kk − kk 1

R Empirical process form [Pg := g dP]: ˆ y) = sup |Λg − Λm g | = kΛ − Λm k . sup k(x, y) − k(x, G g ∈G

x,y∈S

2

f (ω1:m ) = kΛ − Λm kG concentrates (bounded difference): 1 kΛ − Λm kG - Eω1:m kΛ − Λm kG + √ . m

3

G is ’nice’ (uniformly bounded, separable Carath´eodory) ⇒ Eω1:m kΛ − Λm kG - Eω1:m R (G, ω1:m ) . | {z }

1 Pm E supg ∈G m j=1 j g (ωj )

Zolt´ an Szab´ o

Optimal Rates for RFFs

Proof idea

4

Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m

Z

|G|L2 (Λm )

q

log N (G, L2 (Λm ), u)du.

0

Zolt´ an Szab´ o

Optimal Rates for RFFs

Proof idea

4

Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m

5

Z

|G|L2 (Λm )

q

log N (G, L2 (Λm ), u)du.

0

G is smoothly parameterized by a compact set ⇒ N (G, L2 (Λm ), u) ≤



4|S|A +1 u

Zolt´ an Szab´ o

d

v u X u1 m , A(ω1:m ) = t kωj k22 . m

Optimal Rates for RFFs

j=1

Proof idea

4

Using Dudley’s entropy bound: 1 R (G, ω1:m ) - √ m

5

|G|L2 (Λm )

q

log N (G, L2 (Λm ), u)du.

0

G is smoothly parameterized by a compact set ⇒ N (G, L2 (Λm ), u) ≤

6

Z



4|S|A +1 u

d

v u X u1 m , A(ω1:m ) = t kωj k22 . m j=1

Putting together [|G|L2 (Λm ) ≤ 2, Jensen inequality] we get . . .

Zolt´ an Szab´ o

Optimal Rates for RFFs

L∞ result for k

Let k be continuous, σ 2 := and compact set S ⊂ Rd m

Λ

kkˆ − kkL∞ (S)

R

kωk2 dΛ(ω) < ∞. Then for ∀τ > 0

h(d, |S|, σ) + √ ≥ m





s p h(d, |S|, σ) := 32 2dlog(2|S| + 1) + 16 32

!

2d + log(2|S| + 1)

p 2d log(σ + 1).

Zolt´ an Szab´ o

≤ e −τ ,

Optimal Rates for RFFs

Consequence-1 (Borel-Cantelli lemma)

m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate

Zolt´ an Szab´ o

Optimal Rates for RFFs

q

log |S| m .

Consequence-1 (Borel-Cantelli lemma)

m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate Growing diameter: log |Sm | m

m→∞

−−−−→ 0 is enough, i.e. |Sm | = e o(m) .

Zolt´ an Szab´ o

Optimal Rates for RFFs

q

log |S| m .

Consequence-1 (Borel-Cantelli lemma)

m→∞ A.s. convergence on compact sets: kˆ −−−−→ k at rate Growing diameter: log |Sm | m

q

log |S| m .

m→∞

−−−−→ 0 is enough, i.e. |Sm | = e o(m) .

Specifically: asymptotic optimality [Cs¨org˝ o and Totik, 1983, Theorem 2] (if k(z) vanishes at ∞).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Consequence-2: Lr result for k (1 ≤ r )

Idea: Note that kkˆ − kkLr (S) =

Z Z S

ˆ y) − k(x, y)|r dx dy |k(x,

S

≤ kkˆ − kkL∞ (S) vol2/r (S).

Zolt´ an Szab´ o

Optimal Rates for RFFs

1 r

Consequence-2: Lr result for k (1 ≤ r )

Idea: Note that kkˆ − kkLr (S) =

Z Z S

ˆ y) − k(x, y)|r dx dy |k(x,

S

≤ kkˆ − kkL∞ (S) vol2/r (S). n vol(S) ≤ vol(B), where B := x ∈ Rd : kxk2 ≤

Zolt´ an Szab´ o

Optimal Rates for RFFs

|S| 2

o ,

1 r

Consequence-2: Lr result for k (1 ≤ r )

Idea: Note that kkˆ − kkLr (S) =

Z Z S

ˆ y) − k(x, y)|r dx dy |k(x,

S

≤ kkˆ − kkL∞ (S) vol2/r (S). n o vol(S) ≤ vol(B), where B := x ∈ Rd : kxk2 ≤ |S| 2 , R∞ d/2 d vol(B) = dπ  d|S|  , Γ(t) = 0 u t−1 e −u du. ⇒ 2 Γ

2

+1

Zolt´ an Szab´ o

Optimal Rates for RFFs

1 r

Lr result for k Under the previous assumptions, and 1 ≤ r < ∞:   !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ  √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1)

Zolt´ an Szab´ o

Optimal Rates for RFFs

Lr result for k Under the previous assumptions, and 1 ≤ r < ∞:   !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ  √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1) Hence, kkˆ − kkLr (S) = Oa.s.

! p |S|2d/r log |S| √ . m | {z } m→∞

Lr (S)-consistency if −−−−→ 0

Zolt´ an Szab´ o

Optimal Rates for RFFs

Lr result for k Under the previous assumptions, and 1 ≤ r < ∞:   !2/r √ d/2 d π |S| h(d, |S|, σ) + 2τ  √ Λm kkˆ − kkLr (S) ≥ ≤ e −τ . d d m 2 Γ( 2 + 1) Hence, kkˆ − kkLr (S) = Oa.s.

! p |S|2d/r log |S| √ . m | {z } m→∞

Lr (S)-consistency if −−−−→ 0

Uniform guarantee: |Sm | = e m

δ 1) and

m

X

2

≤ Eω1:m Eε εi cos(hωi , · − ·i) .

m i=1 Lr (S) | {z } =:(∗)

Zolt´ an Szab´ o

Optimal Rates for RFFs

Direct Lr result for k: proof idea 1

2

ˆ Lr (S) concentrates (bounded difference): f (ω1 , . . . , ωm ) = kk − kk r ˆ Lr (S) ≤ Eω kk − kk ˆ Lr (S) + vol2/r (S) 2τ . kk − kk 1:m m 0

By Lr ∼ = (Lr )∗ ( 1r + symmetrization:

1 r0

ˆ Lr (S) Eω1:m kk − kk

0

= 1), the separability of Lr (S) (r > 1) and

m

X

2

≤ Eω1:m Eε εi cos(hωi , · − ·i) .

m i=1 Lr (S) | {z } =:(∗)

3

Since Lr (S) is of type min(2, r ) [’-rule’] ∃Cr0 such that (∗) ≤

Cr0

m X

! k cos(hωi , · −

min(2,r ) ·i)kLr (S)

i=1 Zolt´ an Szab´ o

Optimal Rates for RFFs

1 min(2,r )

.

Kernel derivatives: N2d 3 [p; q] 6= 0

p,q . If Goal: kd 1 supp(Λ) is bounded:

h i 2 Ck,p,q := Eω∼Λ |ω p+q | kωk2 < ∞: L∞ , Lr X, but Gaussian, Laplacian, inverse multiquadratic, Matern:( c0 universality ⇔ supp(Λ) = Rd , if k(z) ∈ C0 (Rd ).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: N2d 3 [p; q] 6= 0

p,q . If Goal: kd 1 supp(Λ) is bounded:

h i 2 Ck,p,q := Eω∼Λ |ω p+q | kωk2 < ∞: L∞ , Lr X, but Gaussian, Laplacian, inverse multiquadratic, Matern:( c0 universality ⇔ supp(Λ) = Rd , if k(z) ∈ C0 (Rd ). 2

supp(Λ) is unbounded: G: becomes unbounded. [Rahimi and Recht, 2007]: ’Hoeffding → Bernstein’, but

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: unbounded supp(Λ) Assumptions [ha = cos (a) , S∆ = S − S]: 1 z 7→ ∇ [∂ p,q k(z)]: continuous; S ⊂ Rd : compact, z Ep,q := Eω∼Λ |ω p+q | kωk2 < ∞. 2 ∃L > 0, σ > 0 M! σ 2 LM−2 (∀M ≥ 2, ∀z ∈ S∆ ), 2 f (z; ω) = ∂ p,q k(z) − ω p (−ω)q h|p+q| (ω T z).

Eω∼Λ |f (z; ω)|M ≤

Zolt´ an Szab´ o

Optimal Rates for RFFs

Kernel derivatives: unbounded supp(Λ)

d

1

Then with Fd := d − d+1 + d d+1   p,q kk ∞ Λm k∂ p,q k − ∂[ ≥  ≤ L (S) −

≤ 2d−1 e

2 m  8σ 2 1+ L2 2σ

+ Fd 2

4d−1 d+1



|S|(Dp,q,S + Ep,q ) 



where Dp,q,S := supz∈conv (S∆ ) k∇z [∂ p,q k(z)]k2 . Zolt´ an Szab´ o

Optimal Rates for RFFs

d d+1



e

2 m   8(d+1)σ 2 1+ L2 2σ

,

Summary

Finite sample spec.

L∞ (S) guarantees −−−→ |Sm | = e o(m) – asymp. optimal! Lr (S) results (⇐ uniform, type of Lr ).

Zolt´ an Szab´ o

Optimal Rates for RFFs

Summary

Finite sample spec.

L∞ (S) guarantees −−−→ |Sm | = e o(m) – asymp. optimal! Lr (S) results (⇐ uniform, type of Lr ). derivative approximation guarantees: bounded spectral support: X unbounded spectral support: trickier – to be continued;)

Zolt´ an Szab´ o

Optimal Rates for RFFs

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation.

Zolt´ an Szab´ o

Optimal Rates for RFFs

Cs¨ org˝o, S. and Totik, V. (1983). On how long interval is the empirical characteristic function uniformly consistent? Acta Scientiarum Mathematicarum, 45:141–149. Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Neural Information Processing Systems (NIPS), pages 1177–1184. Rosasco, L., Santoro, M., Mosci, S., Verri, A., and Villa, S. (2010). A regularization approach to nonlinear variable selection. JMLR W&CP – International Conference on Artificial Intelligence and Statistics (AISTATS), 9:653–660. Rosasco, L., Villa, S., Mosci, S., Santoro, M., and Verri, A. (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665–1714. Zolt´ an Szab´ o

Optimal Rates for RFFs

Shi, L., Guo, X., and Zhou, D.-X. (2010). Hermite learning with gradient data. Journal of Computational and Applied Mathematics, 233:3046–3059. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Hyv¨arinen, A., and Kumar, R. (2014). Density estimation in infinite dimensional exponential families. Technical report. http://arxiv.org/pdf/1312.3516.pdf. Strathmann, H., Sejdinovic, D., Livingstone, S., Szab´o, Z., and Gretton, A. (2015). Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In Neural Information Processing Systems (NIPS). Ying, Y., Wu, Q., and Campbell, C. (2012). Learning the coordinate gradients. Advances in Computational Mathematics, 37:355–378. Zolt´ an Szab´ o

Optimal Rates for RFFs

Zhou, D.-X. (2008). Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456–463.

Zolt´ an Szab´ o

Optimal Rates for RFFs