Kernel Statistical Tests for Random Processes ISNPS, Avignon, 2016 Arthur Gretton Gatsby Unit, UCL
Two-Sample Test for Random Processes
Is P the same distribution as Q ?
Outline Testing for differences in marginal distributions of random processes (MMD): • Markov chain convergence diagnostics • Change point detection Testing for independence between random processes (HSIC) • Dependency structure in financial markets • Brain region activation Why time series-based tests needed: • Most real data (in the brain!) are time series • MCMC diagnostics require tests on time series (or throwing out most of the data)
Maximum mean discrepancy, two-sample test
Feature mean difference • Two Gaussians with same means, different variance • Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕ(x) = x2 Two Gaussians with different variances 0.4
PX
0.35
Q
X
Prob. density
0.3 0.25 0.2 0.15 0.1 0.05 0 −6
−4
−2
0
X
2
4
6
Feature mean difference • Two Gaussians with same means, different variance • Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕx = x2 2
Two Gaussians with different variances 0.4
Densities of feature X 1.4
P
P
X
0.35
X
QX
1.2
Prob. density
Prob. density
0.3 0.25 0.2 0.15
Q
X
1 0.8 0.6 0.4
0.1
0.2
0.05 0 −6
−4
−2
0
X
2
4
6
0 −1 10
0
1
10
10 2
X
2
10
Feature mean difference • Gaussian and Laplace distributions • Same mean and same variance • Difference in means using higher order features Gaussian and Laplace densities 0.7
P
X
QX
Prob. density
0.6 0.5 0.4 0.3 0.2 0.1 0 −4
−3
−2
−1
0
X
. . . so let’s explore feature representations!
1
2
3
4
Kernels: similarity between features Kernel: • We have two objects x and x′ from a set X (documents, images, . . .). How similar are they?
Kernels: similarity between features Kernel: • We have two objects x and x′ from a set X (documents, images, . . .). How similar are they? • Define features of objects: – ϕx are features of x, – ϕx′ are features of x′ • A kernel is the dot product between these features: k(x, x′ ) := hϕx , ϕx′ iF .
Probabilities in feature space: the mean trick The kernel trick • Given x ∈ X for some set X , define feature map ϕx ∈ F, ϕx = [. . . ei (x) . . .]
• For kernel k(x, x′ ), k(x, x′ ) = hϕx , ϕx′ iF
Probabilities in feature space: the mean trick The kernel trick • Given x ∈ X for some set X , define feature map ϕx ∈ F, ϕx = [. . . ei (x) . . .]
The mean trick • Given probability P define mean embedding µP ∈ F µP = [. . . EP [ei (X)] . . .] • For kernel k(x, x′ ),
• For kernel k(x, x′ ), k(x, x′ ) = hϕx , ϕx′ iF
hµP , µQ iF = EP,Q k(X, Y ) for X ∼ P and Y ∼ Q. Need to ensure Bochner integrability of ϕx for x ∼ P: true for bounded kernels.
The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: M M D2 (P, Q) = kµP − µQ k2F = hµP , µP iF + hµQ , µQ iF − 2 hµP , µQ iF
= EP k(x, x′ ) + EQ k(y, y′ ) − 2EP,Q k(x, y) | {z } | {z } | {z } (a)
(a)
(a)= within distrib. similarity, (b)= cross-distrib. similarity
(b)
The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: M M D2 (P, Q) = kµP − µQ k2F = hµP , µP iF + hµQ , µQ iF − 2 hµP , µQ iF
= EP k(x, x′ ) + EQ k(y, y′ ) − 2EP,Q k(x, y) | {z } | {z } | {z } (a)
(a)
(b)
(a)= within distrib. similarity, (b)= cross-distrib. similarity A biased empirical estimate (V-statistic): 2
\ = MMD
1 n2
n X i,j
=
1 n2
n X i,j
[k(xi , xj ) − k(xi , yj ) − k(yi , xj ) + k(yi , yj )] hϕxi − ϕyi , ϕxj − ϕyj iF | {z } K((xi ,yi ),(xj ,yj ))
The maximum mean discrepancy
∼P ∼Q
The maximum mean discrepancy
k(dogi , dogj )
k(dogi , fishj )
k(fishj , dogi )
k(fishi , fishj )
2
\ M M D = KP,P + KQ,Q − 2KP,Q
Statistical test using MMD • Two hypotheses:
– H0 : null hypothesis (P = Q) – H1 : alternative hypothesis (P 6= Q)
• Observe dependent samples x := {x1 , . . . , xt , . . . , xn } with marginal distribution P, and y := {y1 , . . . , yt , . . . , yn } with marginal distribution Q 2
\ is • If empirical MMD
– “far from zero”: reject H0 – “close to zero”: accept H0
Statistical test using MMD • Two hypotheses:
– H0 : null hypothesis (P = Q) – H1 : alternative hypothesis (P 6= Q)
• Observe dependent samples x := {x1 , . . . , xt , . . . , xn } with marginal distribution P, and y := {y1 , . . . , yt , . . . , yn } with marginal distribution Q 2
\ is • If empirical MMD
– “far from zero”: reject H0 – “close to zero”: accept H0
• Assumptions: (Xt )t∈Z and (Yt )t∈Z are strictly stationary and P∞ p τ -dependent with r=1 τ (r) < ∞.
˜r E Xr − X
1
< τ (r),
˜ r is a copy of Xr independent of X0 . where Xr is dependent on X0 , X
Statistical test using MMD • “far from zero” vs “close to zero” - threshold? \ • One answer: asymptotic distribution of MMD When P = Q, asymptotic distribution is 2
\ ∼ nMMD
∞ X
2
[Leucht and Neumann, 2013]
λℓ Q2ℓ =: DM M D
l=1 • where R – X K(z, z ′ )ψi (z)dP0 (z) = λi ψi (z ′ )
MMD density under H0 1.2
2
χ sum Empirical PDF 1
– Qℓ ∼ N (0, 1) correlated, cov(Qu , Qv ) =
∞ X
r=−∞
cov [ψu (Z0 ), ψv (Zr )]
Prob. density
– z := (x, y)
0.8
0.6
0.4
0.2
0 −2
0
2
4
n× MMD2
6
8
10
2
\ Asymptotics of M M D : proof idea First define an order m truncation of K(z, z ′ ): K(m) (z, z ′ ) =
m X
λℓ ψℓ (z)ψℓ (z ′ ).
ℓ=1
We can prove that as m → ∞ the asymptotics of the truncation approach those of K. The associated V -statistic is: ! m n 1 X X (m) λℓ ψℓ (Zs )ψℓ (Zt ) nVn = n s,t=1 ℓ=1 {z } | K(m) (zs ,zt )
=
m X ℓ=1
λℓ
n−1/2
n X t=1
ψℓ (Zt )
!2
2
\ Asymptotics of M M D : proof idea Under the assumptions on Zt , we can apply a central limit theorem for weakly dependent random variables on the inner sum: n−1/2
n h X t=1
ψ1 (Zt ) . . . ψℓ (Zt )
i
d
→
h
Q1 . . . Qℓ
i
Statistical test using MMD • Given P = Q, want threshold T such that P(MMD > T ) ≤ α 2
\ M M D = KP,P + KQ,Q − 2KP,Q MMD density under H0 and H1
1.2 null alternative
Prob. density
1
0.8
0.6
1−α quantile
0.4
Type II error 0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Statistical test using MMD • Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05 • Permutation for empirical CDF
[Arcones and Gin´ e, 1992]
Null distribution and permutation,
β=0
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory of the Processes Xt = βXt−1 + ǫt
i.i.d.
ǫt ∼ N (0, σ 2 )
β = 0.14
β = 0.97
The null distribution of the V -statistic is strongly affected by memory
Memory β = 0.0, permutation for null Null distribution and permutation,
β=0
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.2, permutation for null Null distribution and permutation,
β=0.2
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.4, permutation for null Null distribution and permutation,
β=0.4
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.5, permutation for null Null distribution and permutation,
β=0.5
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Wild bootstrap estimate of the asymptotic distribution Define a new time series Wt∗ with the property cov(Ws∗ , Wt∗ ) = ρ (|s − t| /ℓn ) , where ℓn is a width parameter growing with n, and ρ is a window, e.g. cov(Ws∗ , Wt∗ ) = exp (− |s − t| /ℓn ) . P∞ 2 p Xt and Yt τ -dependent with r=1 r τ (r) < ∞.
Wild bootstrap estimate of the asymptotic distribution Define a new time series Wt∗ with the property cov(Ws∗ , Wt∗ ) = ρ (|s − t| /ℓn ) , where ℓn is a width parameter growing with n, and ρ is a window, e.g. cov(Ws∗ , Wt∗ ) = exp (− |s − t| /ℓn ) . P∞ 2 p Xt and Yt τ -dependent with r=1 r τ (r) < ∞.
Wild bootstrap estimate of the null: 1 Pn ∗ Vn := n s,t=1 h((Xs , Ys ), (Xt , Yt ))Ws∗ Wt∗ As measured via Prokhorov metric dp , n X 1 ∗ ∗ p dp DM M D , K((Xs , Ys ), (Xt , Yt ))Ws Wt → 0 n s,t=1
as
n → ∞.
How the proof works (1) Again define a finite approximation, Vn(m)∗
n 1 X (m) = K (Zs , Zt )Ws∗ Wt∗ n s,t=1
=
m X
λk
n−1/2
n X
ψk (Zt )Wt∗
t=1
k=1
!2
which can be shown to converge as m → ∞. Define h i Ut∗ := ψ1 (Zt ) . . . ψm (Zt ) Wt∗ We need that in probability (as n → ∞), n
1 X ∗ d √ Ut → N (0, Σm ) n t=1
How the proof works (2)
cov n−1/2
n X
ψj (Zs )Ws∗ , n−1/2
s=1
n X
ψk (Zt )Wt∗
t=1
!
n 1 X ψj (Zs )ψk (Zt )ρ(|s − t| ℓn ) = n s,t=1
n 1 X = (ψj (Zs )ψk (Zt ) − E [ψj (Zs )ψk (Zt )]) ρ(|s − t| ℓn ) n s,t=1 {z } | converges to 0
+
∞ X
r=−∞
|
E (ψj (Z0 )ψk (Zr )) ρ(|r| /ℓn )max {1 − |r| /n, 0} {z
converges to (Σm )j,k
}
Memory β = 0.0, wild bootstrap for null Null distribution and permutation,
β=0
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.2, wild bootstrap for null Null distribution and permutation,
β=0.2
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.4, wild bootstrap for null Null distribution and permutation,
β=0.4
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
Memory β = 0.5, wild bootstrap for null Null distribution and permutation,
β=0.5
Null PDF Null PDF from permutation
1
Prob. density
0.8
0.6
0.4
0.2
0 −2
0
2
4
2
n× MMD
6
8
10
MCMC M.D. Experiment
100
80
Does P have the same marginal distribution as Q?
60
40
20
0
-20
Test - MMD
Type one error
Permutation
68 %
Wild Bootstrap
6%
-40
-60
-80
-100 0
100
200
300
400
500
Testing Independence and the Hilbert-Schmidt Independence Criterion
MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion NIPS07a, ALT07, ALT08, JMLR10]
Related to
[Feuerverger, 1993]
and
[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2
[ALT05,
MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion NIPS07a, ALT07, ALT08, JMLR10]
Related to
[Feuerverger, 1993]
and
[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2
k(
!"
κ(
!"
k(
,
#"
#"
!"
, !"
,
#"
l(
) #"
!"
,
#"
)
)=
) × l(
!"
,
#"
)
[ALT05,
MMD for independence • Dependence measure: the Hilbert Schmidt Independence Criterion
[ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to
[Feuerverger, 1993]
and
[Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC 2 (PXY , PX PY ) := kµPXY − µPX PY k2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC2 (PXY , PX PY ) = kEXY [(ϕX − µPX ) ⊗ (ψY − µPY )] k2F ×G
= EXY EX ′ Y ′ k(x, x′ )l(y, y′ ) + EX EX ′ k(x, x′ )EY EY ′ l(y, y′ ) ′ ′ − 2EX ′ Y ′ EX k(x, x )EY l(y, y ) .
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&292'-& 2.(+'-&(0#%9$)#&2',&.#'-2/&)8.+/28(':&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& 92'-& 2.(+'-&(0#%9$)#&2',&.#'-2/&)8.+/28(':&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&