Kernel Statistical Tests for Random Processes - UCL

Report 3 Downloads 119 Views
Kernel Statistical Tests for Random Processes ISNPS, Avignon, 2016 Arthur Gretton Gatsby Unit, UCL

Two-Sample Test for Random Processes

Is P the same distribution as Q ?

Outline Testing for differences in marginal distributions of random processes (MMD): • Markov chain convergence diagnostics • Change point detection Testing for independence between random processes (HSIC) • Dependency structure in financial markets • Brain region activation Why time series-based tests needed: • Most real data (in the brain!) are time series • MCMC diagnostics require tests on time series (or throwing out most of the data)

Maximum mean discrepancy, two-sample test

Feature mean difference • Two Gaussians with same means, different variance • Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2 Two Gaussians with different variances 0.4

PX

0.35

Q

X

Prob. density

0.3 0.25 0.2 0.15 0.1 0.05 0 −6

−4

−2

0

X

2

4

6

Feature mean difference • Two Gaussians with same means, different variance • Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕx = x2 2

Two Gaussians with different variances 0.4

Densities of feature X 1.4

P

P

X

0.35

X

QX

1.2

Prob. density

Prob. density

0.3 0.25 0.2 0.15

Q

X

1 0.8 0.6 0.4

0.1

0.2

0.05 0 −6

−4

−2

0

X

2

4

6

0 −1 10

0

1

10

10 2

X

2

10

Feature mean difference • Gaussian and Laplace distributions • Same mean and same variance • Difference in means using higher order features Gaussian and Laplace densities 0.7

P

X

QX

Prob. density

0.6 0.5 0.4 0.3 0.2 0.1 0 −4

−3

−2

−1

0

X

. . . so let’s explore feature representations!

1

2

3

4

Kernels: similarity between features Kernel: • We have two objects x and x′ from a set X (documents, images, . . .). How similar are they?

Kernels: similarity between features Kernel: • We have two objects x and x′ from a set X (documents, images, . . .). How similar are they? • Define features of objects: – ϕx are features of x, – ϕx′ are features of x′ • A kernel is the dot product between these features: k(x, x′ ) := hϕx , ϕx′ iF .

Probabilities in feature space: the mean trick The kernel trick • Given x ∈ X for some set X , define feature map ϕx ∈ F, ϕx = [. . . ei (x) . . .]

• For kernel k(x, x′ ), k(x, x′ ) = hϕx , ϕx′ iF

Probabilities in feature space: the mean trick The kernel trick • Given x ∈ X for some set X , define feature map ϕx ∈ F, ϕx = [. . . ei (x) . . .]

The mean trick • Given probability P define mean embedding µP ∈ F µP = [. . . EP [ei (X)] . . .] • For kernel k(x, x′ ),

• For kernel k(x, x′ ), k(x, x′ ) = hϕx , ϕx′ iF

hµP , µQ iF = EP,Q k(X, Y ) for X ∼ P and Y ∼ Q. Need to ensure Bochner integrability of ϕx for x ∼ P: true for bounded kernels.

The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: M M D2 (P, Q) = kµP − µQ k2F = hµP , µP iF + hµQ , µQ iF − 2 hµP , µQ iF

= EP k(x, x′ ) + EQ k(y, y′ ) − 2EP,Q k(x, y) | {z } | {z } | {z } (a)

(a)

(a)= within distrib. similarity, (b)= cross-distrib. similarity

(b)

The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: M M D2 (P, Q) = kµP − µQ k2F = hµP , µP iF + hµQ , µQ iF − 2 hµP , µQ iF

= EP k(x, x′ ) + EQ k(y, y′ ) − 2EP,Q k(x, y) | {z } | {z } | {z } (a)

(a)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity A biased empirical estimate (V-statistic): 2

\ = MMD

1 n2

n X i,j

=

1 n2

n X i,j

[k(xi , xj ) − k(xi , yj ) − k(yi , xj ) + k(yi , yj )] hϕxi − ϕyi , ϕxj − ϕyj iF | {z } K((xi ,yi ),(xj ,yj ))

The maximum mean discrepancy

∼P ∼Q

The maximum mean discrepancy

k(dogi , dogj )

k(dogi , fishj )

k(fishj , dogi )

k(fishi , fishj )

2

\ M M D = KP,P + KQ,Q − 2KP,Q

Statistical test using MMD • Two hypotheses:

– H0 : null hypothesis (P = Q) – H1 : alternative hypothesis (P 6= Q)

• Observe dependent samples x := {x1 , . . . , xt , . . . , xn } with marginal distribution P, and y := {y1 , . . . , yt , . . . , yn } with marginal distribution Q 2

\ is • If empirical MMD

– “far from zero”: reject H0 – “close to zero”: accept H0

Statistical test using MMD • Two hypotheses:

– H0 : null hypothesis (P = Q) – H1 : alternative hypothesis (P 6= Q)

• Observe dependent samples x := {x1 , . . . , xt , . . . , xn } with marginal distribution P, and y := {y1 , . . . , yt , . . . , yn } with marginal distribution Q 2

\ is • If empirical MMD

– “far from zero”: reject H0 – “close to zero”: accept H0

• Assumptions: (Xt )t∈Z and (Yt )t∈Z are strictly stationary and P∞ p τ -dependent with r=1 τ (r) < ∞.

˜r E Xr − X

1

< τ (r),

˜ r is a copy of Xr independent of X0 . where Xr is dependent on X0 , X

Statistical test using MMD • “far from zero” vs “close to zero” - threshold? \ • One answer: asymptotic distribution of MMD When P = Q, asymptotic distribution is 2

\ ∼ nMMD

∞ X

2

[Leucht and Neumann, 2013]

λℓ Q2ℓ =: DM M D

l=1 • where R – X K(z, z ′ )ψi (z)dP0 (z) = λi ψi (z ′ )

MMD density under H0 1.2

2

χ sum Empirical PDF 1

– Qℓ ∼ N (0, 1) correlated, cov(Qu , Qv ) =

∞ X

r=−∞

cov [ψu (Z0 ), ψv (Zr )]

Prob. density

– z := (x, y)

0.8

0.6

0.4

0.2

0 −2

0

2

4

n× MMD2

6

8

10

2

\ Asymptotics of M M D : proof idea First define an order m truncation of K(z, z ′ ): K(m) (z, z ′ ) =

m X

λℓ ψℓ (z)ψℓ (z ′ ).

ℓ=1

We can prove that as m → ∞ the asymptotics of the truncation approach those of K. The associated V -statistic is: ! m n 1 X X (m) λℓ ψℓ (Zs )ψℓ (Zt ) nVn = n s,t=1 ℓ=1 {z } | K(m) (zs ,zt )

=

m X ℓ=1

λℓ

n−1/2

n X t=1

ψℓ (Zt )

!2

2

\ Asymptotics of M M D : proof idea Under the assumptions on Zt , we can apply a central limit theorem for weakly dependent random variables on the inner sum: n−1/2

n h X t=1

ψ1 (Zt ) . . . ψℓ (Zt )

i

d



h

Q1 . . . Qℓ

i

Statistical test using MMD • Given P = Q, want threshold T such that P(MMD > T ) ≤ α 2

\ M M D = KP,P + KQ,Q − 2KP,Q MMD density under H0 and H1

1.2 null alternative

Prob. density

1

0.8

0.6

1−α quantile

0.4

Type II error 0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Statistical test using MMD • Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05 • Permutation for empirical CDF

[Arcones and Gin´ e, 1992]

Null distribution and permutation,

β=0

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory of the Processes Xt = βXt−1 + ǫt

i.i.d.

ǫt ∼ N (0, σ 2 )

β = 0.14

β = 0.97

The null distribution of the V -statistic is strongly affected by memory

Memory β = 0.0, permutation for null Null distribution and permutation,

β=0

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.2, permutation for null Null distribution and permutation,

β=0.2

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.4, permutation for null Null distribution and permutation,

β=0.4

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.5, permutation for null Null distribution and permutation,

β=0.5

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Wild bootstrap estimate of the asymptotic distribution Define a new time series Wt∗ with the property cov(Ws∗ , Wt∗ ) = ρ (|s − t| /ℓn ) , where ℓn is a width parameter growing with n, and ρ is a window, e.g. cov(Ws∗ , Wt∗ ) = exp (− |s − t| /ℓn ) . P∞ 2 p Xt and Yt τ -dependent with r=1 r τ (r) < ∞.

Wild bootstrap estimate of the asymptotic distribution Define a new time series Wt∗ with the property cov(Ws∗ , Wt∗ ) = ρ (|s − t| /ℓn ) , where ℓn is a width parameter growing with n, and ρ is a window, e.g. cov(Ws∗ , Wt∗ ) = exp (− |s − t| /ℓn ) . P∞ 2 p Xt and Yt τ -dependent with r=1 r τ (r) < ∞.

Wild bootstrap estimate of the null: 1 Pn ∗ Vn := n s,t=1 h((Xs , Ys ), (Xt , Yt ))Ws∗ Wt∗ As measured via Prokhorov metric dp ,   n X 1 ∗ ∗ p  dp DM M D , K((Xs , Ys ), (Xt , Yt ))Ws Wt → 0 n s,t=1

as

n → ∞.

How the proof works (1) Again define a finite approximation, Vn(m)∗

n 1 X (m) = K (Zs , Zt )Ws∗ Wt∗ n s,t=1

=

m X

λk

n−1/2

n X

ψk (Zt )Wt∗

t=1

k=1

!2

which can be shown to converge as m → ∞. Define h i Ut∗ := ψ1 (Zt ) . . . ψm (Zt ) Wt∗ We need that in probability (as n → ∞), n

1 X ∗ d √ Ut → N (0, Σm ) n t=1

How the proof works (2)

cov n−1/2

n X

ψj (Zs )Ws∗ , n−1/2

s=1

n X

ψk (Zt )Wt∗

t=1

!

n 1 X ψj (Zs )ψk (Zt )ρ(|s − t| ℓn ) = n s,t=1

n 1 X = (ψj (Zs )ψk (Zt ) − E [ψj (Zs )ψk (Zt )]) ρ(|s − t| ℓn ) n s,t=1 {z } | converges to 0

+

∞ X

r=−∞

|

E (ψj (Z0 )ψk (Zr )) ρ(|r| /ℓn )max {1 − |r| /n, 0} {z

converges to (Σm )j,k

}

Memory β = 0.0, wild bootstrap for null Null distribution and permutation,

β=0

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.2, wild bootstrap for null Null distribution and permutation,

β=0.2

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.4, wild bootstrap for null Null distribution and permutation,

β=0.4

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

Memory β = 0.5, wild bootstrap for null Null distribution and permutation,

β=0.5

Null PDF Null PDF from permutation

1

Prob. density

0.8

0.6

0.4

0.2

0 −2

0

2

4

2

n× MMD

6

8

10

MCMC M.D. Experiment

100

80

Does P have the same marginal distribution as Q?

60

40

20

0

-20

Test - MMD

Type one error

Permutation

68 %

Wild Bootstrap

6%

-40

-60

-80

-100 0

100

200

300

400

500

Testing Independence and the Hilbert-Schmidt Independence Criterion

MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion NIPS07a, ALT07, ALT08, JMLR10]

Related to

[Feuerverger, 1993]

and

[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2

[ALT05,

MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion NIPS07a, ALT07, ALT08, JMLR10]

Related to

[Feuerverger, 1993]

and

[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2

k(

!"

κ(

!"

k(

,

#"

#"

!"

, !"

,

#"

l(

) #"

!"

,

#"

)

)=

) × l(

!"

,

#"

)

[ALT05,

MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion

[ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to

[Feuerverger, 1993]

and

[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC2 (PXY , PX PY ) = kEXY [(ϕX − µPX ) ⊗ (ψY − µPY )] k2F ×G

= EXY EX ′ Y ′ k(x, x′ )l(y, y′ ) + EX EX ′ k(x, x′ )EY EY ′ l(y, y′ )   ′ ′ − 2EX ′ Y ′ EX k(x, x )EY l(y, y ) .

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&292'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& 92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&