Likelihood Ratio Tests and Singularities - Semantic Scholar

Report 4 Downloads 201 Views
Likelihood Ratio Tests and Singularities Mathias Drton Department of Statistics University of Chicago

What’s this talk about? • Algebraic statistics = Application of algebraic geometry in statistics • More focused view: Algebraic statistics = Study of algebraic statistical models Properties of statistical model

←→

(algebraic) geometry of parameter space

• Outline of this talk 1. Algebraic statistical models 2. Geometry and likelihood ratio tests (Chernoff’s theorem) 3. Likelihood ratio tests in factor analysis with one factor 4. Next steps: multi-factor analysis

Mathias Drton

Page 2

1. Statistical models • Statistical view of data: (Repeated) observation/measurement of some random vector X

∈ Rm

• Goal: Infer (aspects of) the unknown probability distribution of X • Statistical model:

Family P of probability distributions on Rm

• Parametric statistics (as contrasted with non-parametric statistics): Finite-dimensional model, i.e., P

= (Pθ | θ ∈ Θ) for Θ ⊆ Rk

• Example (repeatedly weighing an object with a scale): iid

X1 , . . . , Xn ∼ N (µ, σ 2 ), µ ∈ R, σ 2 > 0.

Mathias Drton

Page 3

Algebraic statistical models • Suppose PΘ = (Pθ | θ ∈ Θ) is a probabilistically “well-behaved” model with parameter space Θ ⊆ Rk that is an open set. • Definition:

A submodel PΘ0

= (Pθ | θ ∈ Θ0 ) of PΘ is an algebraic statistical model if there exists a semi-algebraic set A ⊆ Rk such that Θ0 = A ∩ Θ.

• Example 1 (Discrete data −→ Pachter & Sturmfels, 2005): –

X takes only finitely many values [d] = {1, . . . , d}

– Probability vectors (pi

(“A/C/G/T”, . . . )

| i ∈ [d]) in interior of prob simplex ∆d−1 ⊆ Rd

• Example 2 (Normal distribution): –

X is a vector of real-valued observations

(gene expression, . . . )

– Multivariate normal distributions Nm (µ, Σ) on Rm with mean vector

µ ∈ Rm and covariance matrix Σ ∈ Rm×m pd

Mathias Drton

Page 4

Univariate normal distribution Normal distribution N (µ, σ 2 ) has (Lebesgue) prob. density function

  2 (x − µ) 1 exp − pµ,σ2 (x) = √ , 2 2 2σ 2πσ

If X

x ∈ R.

∼ N (µ, σ 2 ), then E[X] = µ ∈ R and Var[X] = σ 2 > 0. 0.4 0.3

σ=1

0.2 0.1

-1

Mathias Drton

1

µ =2

3

4

5

Page 5

Multivariate normal distribution Multivariate normal distribution Nm (µ, Σ) has pdf on Rm :



~ If X



1 1 p exp − (~x − µ)t Σ−1 (~x − µ) . pµ,Σ (~x) = 2 (2π)m det(Σ)

~ = µ = (µ1 , . . . , µm )t ∈ Rm and ∼ Nm (µ, Σ), then E[X]   σ11

σ12

...

σ1m

 σ12 ~ =Σ= Var[X]  ..  .

σ22

...

σ2m  

σ1m

. . .

σ2m

. . .

...

σmm

 ∈ Rm×m pd 

Linear transformations (change of basis):

~ + b ∼ Nm (Aµ + b, AΣAt ), A full rank AX Mathias Drton

Page 6

A bivariate density Plot of pdf pµ,Σ of the bivariate normal distribution N2 (µ, Σ) with

    1 0.5 2  .   , Σ= µ= 0.5 1 1 z

y

x

Mathias Drton

Page 7

Gaussian independence • If X ∼ Nm (µ, Σ) and A ⊆ [m], then XA = (Xi | i ∈ A) ∼ NA (µA , ΣA×A ). • Independence: XA ⊥ ⊥XB iff σij = 0 for all i ∈ A, j ∈ B • Conditional independence: XA ⊥ ⊥XB | XC iff rank(ΣAC×BC ) ≤ |C| 

ΣA×A

 ΣB×A  ΣC×A

ΣA×B

ΣA×C

ΣB×B

ΣB×C

ΣC×B

ΣC×C



• Example: X1 ⊥ ⊥X2 | X3 ⇐⇒ det 

σ12 σ13 σ23 σ33

   



=0

• Gaussian conditional independence models are algebraic statistical models! Mathias Drton

Page 8

2. Likelihood ratio test • Nested statistical models (Pθ | θ ∈ Θ0 ) ⊆ (Pθ | θ ∈ Θ) • Using i.i.d. observations X (1) , . . . , X (n) ∈ Rm test H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ \ Θ0 • If pθ (x) is probability density function of Pθ , then log-likelihood function ℓn : Θ → R, θ 7→ • Likelihood ratio statistic



n X

log pθ (X (i) )

i=1



λn = 2 sup ℓn (θ) − sup ℓn (θ) .

Mathias Drton

θ∈Θ

θ∈Θ0

Page 9

Gaussian models with known covariance matrix Σ • Suppose Pθ = Nm (µ, Σ) and Σ = Im is known 1 Pn ¯ • Let X = n i=1 X (i) be the sample mean vector

• Then log-likelihood function is n

1 X (i) ¯ ¯ − µ)t (X (i) −X ¯ +X ¯ − µ) (X −X + X ℓn (µ) = const. − 2 i=1 n ¯ = const. − ||X − µ||2 . 2 • Thus if Θ = Rm and Θ0 ⊆ Rm , then LR statistic is ! λn = 2

sup ℓn (µ) − sup ℓn (µ)

µ∈Rm

µ∈Θ0

¯ − µ||2 = n · inf ||X µ∈Θ0

¯ and the set Θ0 . is n times squared Euclidean distance between X Mathias Drton

Page 10

Toy example 1: Line (Σ

= I2 )

• Suppose PΘ = (N2 (µ, I2 ) | µ ∈ R2 ) • Submodel given by Θ0 = {µ | µ2 = µ1 }; =3 Red point: x ¯ = (−0.4, 0.8) Closest point on line (0.2, 0.2) LR statistics: n · (0.62 + 0.62 ) ≈ 2.16 Suppose n

• If µ1 = µ2 , could a “distance” of 2.16 or larger arise just due to chance? Or should we conclude that µ1 6= µ2 ?

Mathias Drton

Page 11

Gaussian models with known Σ (cont.) • Suppose that the data were drawn from N2 (µ0 , I2 ) with µ01 = µ02 . • Rewrite LR statistic as

√ √ 2 ¯ ¯ λn = n · inf ||X − µ|| = inf || n(X − µ0 ) − n(µ − µ0 )||2 {z } µ∈Θ0 µ∈Θ0 | N (0,I2 )

• In this example, if µ is on line Θ0 then so is λn = inf ||Z − µ||2 µ∈Θ0



n(µ − µ0 ). Hence,

for Z

∼ N (0, I2 )

• This distribution doesn’t depend on the unknown µ0 !! • If µ0 is on the line then Prob(λn > 2.16) ≈ 0.14. • Conclusion: Data show no strong evidence against “the line”. Mathias Drton

Page 12

Toy example 2: “Folium of Descartes” (Σ

= I2 )

• Suppose Θ0 = {µ | µ22 = µ31 + µ21 }; (Param. f (t) = [t2 − 1, t(t2 − 1)])

• LR statistic:

√ √ ¯ λn = inf || n(X − µ0 ) − n(µ − µ0 )||2 {z } µ∈Θ0 | Z∼N (0,I2 )

is squared distance between Z

• As n → ∞: Sets Mathias Drton



∼ N (0, I2 ) and



n(Θ0 − µ0 ).

n(Θ0 − µ0 ) −→ tangent cone of Θ0 at µ0 . Page 13

Toy example 2: “Folium of Descartes” (Σ

= I2 )

• Suppose Θ0 = {µ | µ22 = µ31 + µ21 }; (Param. f (t) = [t2 − 1, t(t2 − 1)])

• If µ0 6= 0, then λn −→d “squared distance between N (0, I2 ) r.v. and a line” ∼ N (0, Im ) and m − k -dimensional linear space has the distribution of Z12 + · · · + Zk2 .

Lemma: Squared distance between Z

This so-called χ2k -distribution doesn’t depend on unknown µ0 !!

• If µ0 = 0, then λn −→d minimum of two independent χ21 -r.v. Mathias Drton

Page 14

Toy example 3: Neil’s parabola (Σ

= I2 )

• Suppose Θ0 = {µ | µ22 = µ31 }; (Param. f (θ) = (θ2 , θ3 ))

• If µ0 = 0, then λn −→d W ∼ 12 -mixture of χ21 and χ22 -distribution, i.e., 1 1 2 Prob(W ≥ t) = Prob(χ1 ≥ t) + Prob(χ22 ≥ t) 2 2

Mathias Drton

Page 15

Chernoff’s theorem (for exponential families) • Tangent cone TΘ (θ) =

n

lim αn (θn − θ) | αn > 0, θn ∈ Θ, θn → θ

n→∞

o

• Θ ⊆ Rk is Chernoff-regular at θ if for all τ ∈ TΘ (θ) there exists a curve α : [0, ε) → Θ with α(0) = θ such that τ = limt→0+ α(t)−α(0) . t • Chernoff’s Theorem:

∈ Θ0 ⊆ Θ ⊆ Rk be the true parameter point with Fisher-information I(θ0 ). If Θ0 is Chernoff-regular at θ0 , then the asymptotic distribution of the likelihood ratio statistic λn when n tends

Let PΘ be a regular exponential family. Let θ0

to infinity is the distribution of the squared Mahalanobis distance

min

τ ∈TΘ0 (θ0 )

Mathias Drton

(Z¯ − τ )t I(θ0 )(Z¯ − τ ), Z¯ ∼ Nk (0, I(θ0 )−1 ). Page 16

Chernoff’s theorem • Fisher-information = symmetric matrix of expectations    ∂ log pθ (X) ∂ log pθ (X) · I(θ) = E ∂θi ∂θj ij • If I(θ0 ) = I(θ0 )t/2 I(θ0 )1/2 , then the squared Mahalanobis distance has the same distribution as the squared Euclidean distance

min

τ ∈TΘ0 (θ0 )

between Z

||Z − I(θ0 )1/2 τ ||2

∼ Nk (0, Ik ) and the linearly transformed cone I(θ0 )1/2 TΘ0 (θ0 ).

• Lemma: A semi-algebraic set Θ ⊆ Rk is everywhere Chernoff-regular.

Mathias Drton

Page 17

3. Factor analysis (one factor) • Observe scores X = (X1 , . . . , X4 ) of student in m = 4 exams: X1 : algebra

X2 : analysis

X3 : statistics

X4 : physics

• A hidden variable Y (“general math ability”) may explain dependences: X1 ⊥ ⊥X2 ⊥ ⊥X3 ⊥ ⊥X4 | Y • One-factor analysis model Family of Nm (µ, Σ) with µ ∈ Rm and Σ in the semi-algebraic set Fm = {∆ + |{z} ΛΛt | ∆ > 0 diagonal, Λ ∈ Rm } rank ≤ 1  • Model is algebraic and of dimension: dim(Fm ) = min 2m, Mathias Drton

m+1 2

 Page 18

Tetrads • Let Im ⊆ R[σij | 1 ≤ i ≤ j ≤ m] be the vanishing ideal of Fm . • Decomposition: Σ = (σij ) ∈ Fm

=⇒

Σ = ∆ + ΛΛt = diag + (rank ≤ 1)

=⇒

Off-diagonal 2 × 2-minors = tetrads are in Im

• Theorem (De Loera/Sturmfels/Thomas, 1995): If m ≤ 3, then Im = {0}. If  m m ≥ 4, then the set of 2 4 tetrads

Tm = {σij σkℓ −σik σjℓ , σiℓ σjk −σik σjℓ | 1 ≤ i < j < k < ℓ ≤ m}

¨ is the reduced Grobner basis of Im wrto a certain monomial order.

• Example: Mathias Drton

I4 = hσ12 σ34 − σ13 σ24 , σ14 σ23 − σ13 σ24 i Page 19

Variety versus model • Inclusions:

Fm ⊂ Vpd (Im ) ⊂ VR (Im ) ⊂ V (Im )

• No equalities: e.g. for F3 , the ideal I3 = {0} but         σ11 + λ21 λ1 λ2 λ1 λ3 3 1 1           1 3 −1 6∈ F3 =  λ1 λ2 σ22 + λ22 λ2 λ3           2  1 −1 3 λ1 λ3 λ2 λ3 σ33 + λ3  (−1 = σ12 σ13 σ23 = λ21 λ22 λ23 )

Mathias Drton

Page 20

Singularities and algebraic tangent cones Theorem: (i) A matrix Σ

∈ V (Im ) is a singularity of the one factor model if and only if Σ

has at most one non-zero off-diagonal entry.

(ii) If Σ is diagonal then the algebraic tangent cone is equal to V (Im ).

6 0, then the algebraic = tangent cone is the set of symmetric matrices Ψ = (ψij ) with ψij = 0 for 3 ≤ i < j ≤ m and Ψ12×{3,...,m} of rank 1.   ψ ψ12 ψ13 . . . ψ1m   11    ψ12 ψ22 ψ23 . . . ψ2m      ψ33     .. .   ψmm

(iii) If Σ has exactly one off-diagonal entry, say σ12

Mathias Drton

Page 21

Tangent cones Theorem: (i) If Σ is diagonal then the tangent cone is equal to the topological closure of



t

∆ + ΓΓ | ∆ ∈ R

m×m

diagonal,

Γ∈R

m×1



.

(ii) If Σ has exactly one non-zero off-diagonal entry and this entry is positive, say

σ12 > 0, then the tangent cone is the set of symmetric matrices   ψ11 ψ12 ψ13 . . . ψ1m    ψ  cψ . . . cψ ψ  12 13 1m  22   Ψ = (ψij ) =   ψ33   .   . .

ψmm

for c Mathias Drton







σ12 σ22 , . If σ12 < 0 then the same holds with a negative multiplier. σ11 σ12 Page 22

A nice connection to eigenvalues • Tangent cone ⊆ algebraic tangent cone =⇒ limiting distribution of LR statistic λn is bounded below by distribution of DA :=

min

Ψ∈AFm (Σ)

(Z¯ − Ψ)t I(Σ)(Z¯ − Ψ)

• At singularity with σ12 6= 0, the algebraic tangent cone is a Cartesian product m−2 m+1 ( of R , {0} ⊆ R 2 ) , and a set of rank 1-matrices. • Fisher-info I(Σ) has entries σik σjℓ + σiℓ σjk and is here block-diagonal. • Theorem: Let V ∼ χ2m−2 , and let W be distributed like the smaller of the ( 2 ) two eigenvalues of a 2 × 2-Wishart matrix with m − 2 degrees of freedom and scale parameter the identity matrix I2 . If the true parameter point Σ is a singularity with σij 6= 0, i < j , then DA ∼ V + W . Mathias Drton

Page 23

4. Multi-factor analysis • # observed variables: m ∈ N;

# hidden factors:

ℓ ∈ N;

(ℓ ≪ m)

• Factor analysis model {Nm (µ, Σ) | µ ∈ Rm , Σ ∈ Fm,ℓ } with covariance matrix parameter space

Fm,ℓ = {∆ + ΛΛt : ∆ > 0 diag., Λ ∈ Rm×ℓ } • Model is algebraic and of dimension: n dim(Fm,ℓ ) = min m(ℓ + 1) −

Mathias Drton

 ℓ

2

,

o  m+1 2

Page 24

Eliminating diagonal entries • Vanishing ideal Im,ℓ = {f ∈ R | f (Σ) = 0 ∀Σ ∈ Fm,ℓ } ⊆ R[σij | 1 ≤ i ≤ j ≤ m] • Theorem: Let Mm,ℓ ⊆ R[σij | 1 ≤ i ≤ j ≤ m] be the ideal generated by the (ℓ + 1) × (ℓ + 1)-minors of Σ = (σij ). Then Im,ℓ = Mm,ℓ ∩ R[σij | i < j].

Mathias Drton

Page 25

Betti numbers of minimal generators ℓ=1

Mathias Drton

ℓ=2

ℓ=3

m

deg 2

deg 5

deg 3

deg 8

deg 7

deg 7

deg 4

4

2













5

10

1

0









6

30

6

5









7

70

21

35

21

0

15

0

8

140

56

140

168

140

120

14

9

252

126

420

756

1386

540

126

tetrad

pentad

minor

idealtheor.

idealtheor.

septad

minor

Page 26

Singularities for two factors:

m=5

• Pentad hypersurface (m = 5, ℓ = 2), Singular locus of codim 15-11 = 4 • Two symmetry classes of singularities: 1. One off-diagonal row zero (5 possible rows):



σ11

  0    0    0  0

0

0

0

σ22 σ23 σ24 σ23 σ33 σ34 σ24 σ34 σ44 σ25 σ35 σ45

0



 σ25    σ35  .  σ45   σ55

2. All tetrads not involving one given off-diagonal entry are zero 5 ( 2

Mathias Drton



= 10 off-diag entries)

Page 27

Conclusion • Many open problems, e.g. in factor analysis: – Generator of Im,2

– singular loci & tangent cones for Fm,2

• Many other statistical models could be considered • Associated statistical/probabilistic problems – e.g. study random distances from tangent cones (useful bounds?)

R EFERENCES 1. Likelihood Ratio Tests and Singularities. . . . in the works. . . . 2. (with S. Sullivant). Algebraic Statistical Models. (review paper for Statistica Sinica). 3. (with B. Sturmfels & S. Sullivant). Algebraic Factor Analysis: Tetrads, Pentads and Beyond. Probab. Theory and Related Fields, to appear in 2007. Mathias Drton

Page 28