Likelihood Ratio Tests and Singularities Mathias Drton Department of Statistics University of Chicago
What’s this talk about? • Algebraic statistics = Application of algebraic geometry in statistics • More focused view: Algebraic statistics = Study of algebraic statistical models Properties of statistical model
←→
(algebraic) geometry of parameter space
• Outline of this talk 1. Algebraic statistical models 2. Geometry and likelihood ratio tests (Chernoff’s theorem) 3. Likelihood ratio tests in factor analysis with one factor 4. Next steps: multi-factor analysis
Mathias Drton
Page 2
1. Statistical models • Statistical view of data: (Repeated) observation/measurement of some random vector X
∈ Rm
• Goal: Infer (aspects of) the unknown probability distribution of X • Statistical model:
Family P of probability distributions on Rm
• Parametric statistics (as contrasted with non-parametric statistics): Finite-dimensional model, i.e., P
= (Pθ | θ ∈ Θ) for Θ ⊆ Rk
• Example (repeatedly weighing an object with a scale): iid
X1 , . . . , Xn ∼ N (µ, σ 2 ), µ ∈ R, σ 2 > 0.
Mathias Drton
Page 3
Algebraic statistical models • Suppose PΘ = (Pθ | θ ∈ Θ) is a probabilistically “well-behaved” model with parameter space Θ ⊆ Rk that is an open set. • Definition:
A submodel PΘ0
= (Pθ | θ ∈ Θ0 ) of PΘ is an algebraic statistical model if there exists a semi-algebraic set A ⊆ Rk such that Θ0 = A ∩ Θ.
• Example 1 (Discrete data −→ Pachter & Sturmfels, 2005): –
X takes only finitely many values [d] = {1, . . . , d}
– Probability vectors (pi
(“A/C/G/T”, . . . )
| i ∈ [d]) in interior of prob simplex ∆d−1 ⊆ Rd
• Example 2 (Normal distribution): –
X is a vector of real-valued observations
(gene expression, . . . )
– Multivariate normal distributions Nm (µ, Σ) on Rm with mean vector
µ ∈ Rm and covariance matrix Σ ∈ Rm×m pd
Mathias Drton
Page 4
Univariate normal distribution Normal distribution N (µ, σ 2 ) has (Lebesgue) prob. density function
2 (x − µ) 1 exp − pµ,σ2 (x) = √ , 2 2 2σ 2πσ
If X
x ∈ R.
∼ N (µ, σ 2 ), then E[X] = µ ∈ R and Var[X] = σ 2 > 0. 0.4 0.3
σ=1
0.2 0.1
-1
Mathias Drton
1
µ =2
3
4
5
Page 5
Multivariate normal distribution Multivariate normal distribution Nm (µ, Σ) has pdf on Rm :
~ If X
1 1 p exp − (~x − µ)t Σ−1 (~x − µ) . pµ,Σ (~x) = 2 (2π)m det(Σ)
~ = µ = (µ1 , . . . , µm )t ∈ Rm and ∼ Nm (µ, Σ), then E[X] σ11
σ12
...
σ1m
σ12 ~ =Σ= Var[X] .. .
σ22
...
σ2m
σ1m
. . .
σ2m
. . .
...
σmm
∈ Rm×m pd
Linear transformations (change of basis):
~ + b ∼ Nm (Aµ + b, AΣAt ), A full rank AX Mathias Drton
Page 6
A bivariate density Plot of pdf pµ,Σ of the bivariate normal distribution N2 (µ, Σ) with
1 0.5 2 . , Σ= µ= 0.5 1 1 z
y
x
Mathias Drton
Page 7
Gaussian independence • If X ∼ Nm (µ, Σ) and A ⊆ [m], then XA = (Xi | i ∈ A) ∼ NA (µA , ΣA×A ). • Independence: XA ⊥ ⊥XB iff σij = 0 for all i ∈ A, j ∈ B • Conditional independence: XA ⊥ ⊥XB | XC iff rank(ΣAC×BC ) ≤ |C|
ΣA×A
ΣB×A ΣC×A
ΣA×B
ΣA×C
ΣB×B
ΣB×C
ΣC×B
ΣC×C
• Example: X1 ⊥ ⊥X2 | X3 ⇐⇒ det
σ12 σ13 σ23 σ33
=0
• Gaussian conditional independence models are algebraic statistical models! Mathias Drton
Page 8
2. Likelihood ratio test • Nested statistical models (Pθ | θ ∈ Θ0 ) ⊆ (Pθ | θ ∈ Θ) • Using i.i.d. observations X (1) , . . . , X (n) ∈ Rm test H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ \ Θ0 • If pθ (x) is probability density function of Pθ , then log-likelihood function ℓn : Θ → R, θ 7→ • Likelihood ratio statistic
n X
log pθ (X (i) )
i=1
λn = 2 sup ℓn (θ) − sup ℓn (θ) .
Mathias Drton
θ∈Θ
θ∈Θ0
Page 9
Gaussian models with known covariance matrix Σ • Suppose Pθ = Nm (µ, Σ) and Σ = Im is known 1 Pn ¯ • Let X = n i=1 X (i) be the sample mean vector
• Then log-likelihood function is n
1 X (i) ¯ ¯ − µ)t (X (i) −X ¯ +X ¯ − µ) (X −X + X ℓn (µ) = const. − 2 i=1 n ¯ = const. − ||X − µ||2 . 2 • Thus if Θ = Rm and Θ0 ⊆ Rm , then LR statistic is ! λn = 2
sup ℓn (µ) − sup ℓn (µ)
µ∈Rm
µ∈Θ0
¯ − µ||2 = n · inf ||X µ∈Θ0
¯ and the set Θ0 . is n times squared Euclidean distance between X Mathias Drton
Page 10
Toy example 1: Line (Σ
= I2 )
• Suppose PΘ = (N2 (µ, I2 ) | µ ∈ R2 ) • Submodel given by Θ0 = {µ | µ2 = µ1 }; =3 Red point: x ¯ = (−0.4, 0.8) Closest point on line (0.2, 0.2) LR statistics: n · (0.62 + 0.62 ) ≈ 2.16 Suppose n
• If µ1 = µ2 , could a “distance” of 2.16 or larger arise just due to chance? Or should we conclude that µ1 6= µ2 ?
Mathias Drton
Page 11
Gaussian models with known Σ (cont.) • Suppose that the data were drawn from N2 (µ0 , I2 ) with µ01 = µ02 . • Rewrite LR statistic as
√ √ 2 ¯ ¯ λn = n · inf ||X − µ|| = inf || n(X − µ0 ) − n(µ − µ0 )||2 {z } µ∈Θ0 µ∈Θ0 | N (0,I2 )
• In this example, if µ is on line Θ0 then so is λn = inf ||Z − µ||2 µ∈Θ0
√
n(µ − µ0 ). Hence,
for Z
∼ N (0, I2 )
• This distribution doesn’t depend on the unknown µ0 !! • If µ0 is on the line then Prob(λn > 2.16) ≈ 0.14. • Conclusion: Data show no strong evidence against “the line”. Mathias Drton
Page 12
Toy example 2: “Folium of Descartes” (Σ
= I2 )
• Suppose Θ0 = {µ | µ22 = µ31 + µ21 }; (Param. f (t) = [t2 − 1, t(t2 − 1)])
• LR statistic:
√ √ ¯ λn = inf || n(X − µ0 ) − n(µ − µ0 )||2 {z } µ∈Θ0 | Z∼N (0,I2 )
is squared distance between Z
• As n → ∞: Sets Mathias Drton
√
∼ N (0, I2 ) and
√
n(Θ0 − µ0 ).
n(Θ0 − µ0 ) −→ tangent cone of Θ0 at µ0 . Page 13
Toy example 2: “Folium of Descartes” (Σ
= I2 )
• Suppose Θ0 = {µ | µ22 = µ31 + µ21 }; (Param. f (t) = [t2 − 1, t(t2 − 1)])
• If µ0 6= 0, then λn −→d “squared distance between N (0, I2 ) r.v. and a line” ∼ N (0, Im ) and m − k -dimensional linear space has the distribution of Z12 + · · · + Zk2 .
Lemma: Squared distance between Z
This so-called χ2k -distribution doesn’t depend on unknown µ0 !!
• If µ0 = 0, then λn −→d minimum of two independent χ21 -r.v. Mathias Drton
Page 14
Toy example 3: Neil’s parabola (Σ
= I2 )
• Suppose Θ0 = {µ | µ22 = µ31 }; (Param. f (θ) = (θ2 , θ3 ))
• If µ0 = 0, then λn −→d W ∼ 12 -mixture of χ21 and χ22 -distribution, i.e., 1 1 2 Prob(W ≥ t) = Prob(χ1 ≥ t) + Prob(χ22 ≥ t) 2 2
Mathias Drton
Page 15
Chernoff’s theorem (for exponential families) • Tangent cone TΘ (θ) =
n
lim αn (θn − θ) | αn > 0, θn ∈ Θ, θn → θ
n→∞
o
• Θ ⊆ Rk is Chernoff-regular at θ if for all τ ∈ TΘ (θ) there exists a curve α : [0, ε) → Θ with α(0) = θ such that τ = limt→0+ α(t)−α(0) . t • Chernoff’s Theorem:
∈ Θ0 ⊆ Θ ⊆ Rk be the true parameter point with Fisher-information I(θ0 ). If Θ0 is Chernoff-regular at θ0 , then the asymptotic distribution of the likelihood ratio statistic λn when n tends
Let PΘ be a regular exponential family. Let θ0
to infinity is the distribution of the squared Mahalanobis distance
min
τ ∈TΘ0 (θ0 )
Mathias Drton
(Z¯ − τ )t I(θ0 )(Z¯ − τ ), Z¯ ∼ Nk (0, I(θ0 )−1 ). Page 16
Chernoff’s theorem • Fisher-information = symmetric matrix of expectations ∂ log pθ (X) ∂ log pθ (X) · I(θ) = E ∂θi ∂θj ij • If I(θ0 ) = I(θ0 )t/2 I(θ0 )1/2 , then the squared Mahalanobis distance has the same distribution as the squared Euclidean distance
min
τ ∈TΘ0 (θ0 )
between Z
||Z − I(θ0 )1/2 τ ||2
∼ Nk (0, Ik ) and the linearly transformed cone I(θ0 )1/2 TΘ0 (θ0 ).
• Lemma: A semi-algebraic set Θ ⊆ Rk is everywhere Chernoff-regular.
Mathias Drton
Page 17
3. Factor analysis (one factor) • Observe scores X = (X1 , . . . , X4 ) of student in m = 4 exams: X1 : algebra
X2 : analysis
X3 : statistics
X4 : physics
• A hidden variable Y (“general math ability”) may explain dependences: X1 ⊥ ⊥X2 ⊥ ⊥X3 ⊥ ⊥X4 | Y • One-factor analysis model Family of Nm (µ, Σ) with µ ∈ Rm and Σ in the semi-algebraic set Fm = {∆ + |{z} ΛΛt | ∆ > 0 diagonal, Λ ∈ Rm } rank ≤ 1 • Model is algebraic and of dimension: dim(Fm ) = min 2m, Mathias Drton
m+1 2
Page 18
Tetrads • Let Im ⊆ R[σij | 1 ≤ i ≤ j ≤ m] be the vanishing ideal of Fm . • Decomposition: Σ = (σij ) ∈ Fm
=⇒
Σ = ∆ + ΛΛt = diag + (rank ≤ 1)
=⇒
Off-diagonal 2 × 2-minors = tetrads are in Im
• Theorem (De Loera/Sturmfels/Thomas, 1995): If m ≤ 3, then Im = {0}. If m m ≥ 4, then the set of 2 4 tetrads
Tm = {σij σkℓ −σik σjℓ , σiℓ σjk −σik σjℓ | 1 ≤ i < j < k < ℓ ≤ m}
¨ is the reduced Grobner basis of Im wrto a certain monomial order.
• Example: Mathias Drton
I4 = hσ12 σ34 − σ13 σ24 , σ14 σ23 − σ13 σ24 i Page 19
Variety versus model • Inclusions:
Fm ⊂ Vpd (Im ) ⊂ VR (Im ) ⊂ V (Im )
• No equalities: e.g. for F3 , the ideal I3 = {0} but σ11 + λ21 λ1 λ2 λ1 λ3 3 1 1 1 3 −1 6∈ F3 = λ1 λ2 σ22 + λ22 λ2 λ3 2 1 −1 3 λ1 λ3 λ2 λ3 σ33 + λ3 (−1 = σ12 σ13 σ23 = λ21 λ22 λ23 )
Mathias Drton
Page 20
Singularities and algebraic tangent cones Theorem: (i) A matrix Σ
∈ V (Im ) is a singularity of the one factor model if and only if Σ
has at most one non-zero off-diagonal entry.
(ii) If Σ is diagonal then the algebraic tangent cone is equal to V (Im ).
6 0, then the algebraic = tangent cone is the set of symmetric matrices Ψ = (ψij ) with ψij = 0 for 3 ≤ i < j ≤ m and Ψ12×{3,...,m} of rank 1. ψ ψ12 ψ13 . . . ψ1m 11 ψ12 ψ22 ψ23 . . . ψ2m ψ33 .. . ψmm
(iii) If Σ has exactly one off-diagonal entry, say σ12
Mathias Drton
Page 21
Tangent cones Theorem: (i) If Σ is diagonal then the tangent cone is equal to the topological closure of
t
∆ + ΓΓ | ∆ ∈ R
m×m
diagonal,
Γ∈R
m×1
.
(ii) If Σ has exactly one non-zero off-diagonal entry and this entry is positive, say
σ12 > 0, then the tangent cone is the set of symmetric matrices ψ11 ψ12 ψ13 . . . ψ1m ψ cψ . . . cψ ψ 12 13 1m 22 Ψ = (ψij ) = ψ33 . . .
ψmm
for c Mathias Drton
∈
σ12 σ22 , . If σ12 < 0 then the same holds with a negative multiplier. σ11 σ12 Page 22
A nice connection to eigenvalues • Tangent cone ⊆ algebraic tangent cone =⇒ limiting distribution of LR statistic λn is bounded below by distribution of DA :=
min
Ψ∈AFm (Σ)
(Z¯ − Ψ)t I(Σ)(Z¯ − Ψ)
• At singularity with σ12 6= 0, the algebraic tangent cone is a Cartesian product m−2 m+1 ( of R , {0} ⊆ R 2 ) , and a set of rank 1-matrices. • Fisher-info I(Σ) has entries σik σjℓ + σiℓ σjk and is here block-diagonal. • Theorem: Let V ∼ χ2m−2 , and let W be distributed like the smaller of the ( 2 ) two eigenvalues of a 2 × 2-Wishart matrix with m − 2 degrees of freedom and scale parameter the identity matrix I2 . If the true parameter point Σ is a singularity with σij 6= 0, i < j , then DA ∼ V + W . Mathias Drton
Page 23
4. Multi-factor analysis • # observed variables: m ∈ N;
# hidden factors:
ℓ ∈ N;
(ℓ ≪ m)
• Factor analysis model {Nm (µ, Σ) | µ ∈ Rm , Σ ∈ Fm,ℓ } with covariance matrix parameter space
Fm,ℓ = {∆ + ΛΛt : ∆ > 0 diag., Λ ∈ Rm×ℓ } • Model is algebraic and of dimension: n dim(Fm,ℓ ) = min m(ℓ + 1) −
Mathias Drton
ℓ
2
,
o m+1 2
Page 24
Eliminating diagonal entries • Vanishing ideal Im,ℓ = {f ∈ R | f (Σ) = 0 ∀Σ ∈ Fm,ℓ } ⊆ R[σij | 1 ≤ i ≤ j ≤ m] • Theorem: Let Mm,ℓ ⊆ R[σij | 1 ≤ i ≤ j ≤ m] be the ideal generated by the (ℓ + 1) × (ℓ + 1)-minors of Σ = (σij ). Then Im,ℓ = Mm,ℓ ∩ R[σij | i < j].
Mathias Drton
Page 25
Betti numbers of minimal generators ℓ=1
Mathias Drton
ℓ=2
ℓ=3
m
deg 2
deg 5
deg 3
deg 8
deg 7
deg 7
deg 4
4
2
—
—
—
—
—
—
5
10
1
0
—
—
—
—
6
30
6
5
—
—
—
—
7
70
21
35
21
0
15
0
8
140
56
140
168
140
120
14
9
252
126
420
756
1386
540
126
tetrad
pentad
minor
idealtheor.
idealtheor.
septad
minor
Page 26
Singularities for two factors:
m=5
• Pentad hypersurface (m = 5, ℓ = 2), Singular locus of codim 15-11 = 4 • Two symmetry classes of singularities: 1. One off-diagonal row zero (5 possible rows):
σ11
0 0 0 0
0
0
0
σ22 σ23 σ24 σ23 σ33 σ34 σ24 σ34 σ44 σ25 σ35 σ45
0
σ25 σ35 . σ45 σ55
2. All tetrads not involving one given off-diagonal entry are zero 5 ( 2
Mathias Drton
= 10 off-diag entries)
Page 27
Conclusion • Many open problems, e.g. in factor analysis: – Generator of Im,2
– singular loci & tangent cones for Fm,2
• Many other statistical models could be considered • Associated statistical/probabilistic problems – e.g. study random distances from tangent cones (useful bounds?)
R EFERENCES 1. Likelihood Ratio Tests and Singularities. . . . in the works. . . . 2. (with S. Sullivant). Algebraic Statistical Models. (review paper for Statistica Sinica). 3. (with B. Sturmfels & S. Sullivant). Algebraic Factor Analysis: Tetrads, Pentads and Beyond. Probab. Theory and Related Fields, to appear in 2007. Mathias Drton
Page 28