conditional independence models for seemingly unrelated ...

Report 1 Downloads 53 Views
CONDITIONAL INDEPENDENCE MODELS FOR SEEMINGLY UNRELATED REGRESSIONS WITH INCOMPLETE DATA MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

Abstract. We consider normal ≡ Gaussian seemingly unrelated regressions (SUR) models with incomplete data (ID). Imposing a natural minimal set of conditional independence constraints, we find restricted SUR/ID models for which the likelihood function and the parameter space factors into the product of the likelihood functions and the parameter spaces of standard complete data multivariate analysis of variance models. Hence, the restricted model has a unimodal likelihood and permits explicit likelihood inference. The restricted model may be used to directly model the data actually observed. Alternatively, the maximum likelihood estimates in the restricted model can yield improved starting values for iterative methods to maximize the likelihood of the unrestricted SUR/ID model. In the development of our methodology, we review and extend existing results for complete data SUR models and the multivariate ID problem. The results are presented in the framework of both lattice conditional independence models and graphical Markov models based on acyclic directed graphs.

Date: October 1, 2003. Key words and phrases. Acyclic directed graph, graphical model, incomplete data, lattice conditional independence model, MANOVA, maximum likelihood estimator, multivariate analysis, multivariate linear model, missing data, seemingly unrelated regressions. 1

2

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

Contents 1.

Introduction

3

2.

Lattice conditional independence theory

5

2.1.

Lattices and LCI models

5

2.2.

The algebra of generalized block-triangular matrices with lattice structure

6

2.3.

The MANOVA model

7

2.4.

The linear LCI model and its likelihood factorization

8

3. 3.1. 3.2. 4.

Lattice inclusion An inclusion criterion based on join-irreducible elements An inclusion criterion based on the algebra of generalized block-triangular matrices Seemingly unrelated regressions

9 9 10 11

4.1.

The SUR model

11

4.2.

LCI restrictions for a SUR model

12

4.3.

Minimality of the LCI restrictions for a SUR model

13

5. 5.1. 5.2. 6.

Multivariate incomplete data ID patterns The ID lattice Linear incomplete data models

14 14 15 17

6.1.

Linear ID subspaces

17

6.2.

LCI restrictions for a linear ID model

20

6.3.

Minimality of the LCI restrictions for a linear ID model

22

7.

Seemingly unrelated regressions with incomplete data

23

7.1.

The SUR/ID model

7.2.

LCI restrictions for a SUR/ID model

25

7.3.

Minimality of the LCI restrictions for a SUR/ID model

27

8.

Acyclic directed graph theory

23

28

8.1.

Directed graphs

28

8.2.

Normal graphical Markov models based on acyclic directed graphs

29

8.3.

Equivalence of transitive ADG and LCI models

29

8.4.

Construction of the TADGs equivalent to the parsimonious LCI models

30

9. 10.

Examples Summary and Conclusion

References

30 43 44

CI MODELS FOR SUR WITH INCOMPLETE DATA

3

1. Introduction The seemingly unrelated regressions (SUR) model is an extension of the multivariate analysis of variance (MANOVA) model. In MANOVA, each of the observed variables is regressed on the same mean space. In SUR, this is relaxed by allowing different variables to be regressed on different mean spaces. The SUR model was made prominent in the 1960s by Zellner [40, 41] who established the asymptotic efficiency of his two-stage estimator. According to Goldberger [13, p. 323], it “plays a central role in contemporary econometrics”. An introduction to SUR can be found, for example, in the econometrics textbooks by Goldberger [13] and by Greene [15] but also in the multivariate statistics monograph by Mardia, Kent, and Bibby [25]. In the SUR literature, several distributional assumptions have been considered. Recent examples include Kowalski et al. [17] who assume t-distributed errors, and Lefkovitch [20] who considers SUR models based on generalized linear models. In this paper, however, we deal exclusively with the classical case of the normal (Gaussian) model. In general, likelihood inference in a normal SUR model requires iterative methods to maximize the likelihood function (LF). The most common method is the iterated version of Zellner’s two-stage estimator; an alternative is Telser’s method [36]. Normal SUR models are curved exponential families (van Garderen [37]) and the LF may be multimodal. A bivariate example with multimodal LF is studied in Drton and Richardson [11]. However, in the monotone (≡ triangular ≡ nested) case in which the regression spaces for the different variables are totally ordered by inclusion the LF is unimodal and explicit likelihood inference is possible by factoring the SUR model into a product of MANOVA models, cf. Andersson and Perlman [8]. Simple special cases can be found earlier, see e.g. Oksanen [30] for a bivariate example. Andersson and Perlman’s methodology [8] also covers nonmonotone SUR models. Using lattice conditional independence (LCI) theory cf. [6], they show that a nonmonotone SUR model determines a unique minimal set of conditional independence (CI) restrictions s.t. the LCIrestricted nonmonotone SUR model allows for explicit likelihood inference and has a unimodal LF. As in the monotone case, the key idea is the factorization of the SUR model into a product of MANOVA models. In a Monte Carlo study, Wu and Perlman [39] compare the finite sample performance of the estimators obtained in the LCI-restricted SUR model to traditional methods such as ordinary least squares or Zellner’s two-stage estimator. LCI theory also may be applied to incomplete data (ID) problems where, as assumed throughout this article, data is missing at random and the missing data can be ignored in the formulation of the likelihood for the incomplete data (compare Little and Rubin [21, Ch. 6]). For the case of i.i.d. multivariate normal observations with monotone incomplete data, explicit likelihood inference is again possible since the LF can be factored s.t. each factor corresponds to a complete data MANOVA model (see Little and Rubin [21, Ch. 7] and Liu [22]). As noted by Murray [29], however, nonmonotone incomplete data can lead to a multimodal LF, and iterative methods such

4

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

as the EM algorithm are needed to find the maximum likelihood estimator (MLE) (cf. Little and Rubin [21], Liu [22], and Schafer [31]). For the i.i.d. nonmonotone case, Andersson and Perlman [5] applied LCI theory to find a unique minimal set of LCI restrictions s.t. the model again becomes a product of MANOVA models and thus permits explicit likelihood inference. In particular, the LCI-restricted ID model has a unimodal LF. Earlier, Little and Rubin [21, Ch. 7] had introduced the idea of a CI restriction in a simple trivariate example. These two applications of LCI theory come together in a SUR model with incomplete data— considered for example by Hwang [16], Schmidt [32], Sharma [33], and Swamy and Mehta [35], where the latter work in a Bayesian setting. The LF of the SUR/ID model inherits multimodality from both the SUR model and the ID model. Meng and Rubin [26, 27] introduce the ECM algorithm as a generalization of the EM algorithm, in which the M-step is replaced by several conditional or constrained maximization steps. They demonstrate in particular how ECM can be used to fit a SUR/ID model. In the present paper, we combine the LCI theories developed for SUR and for incomplete data to find a minimal set of LCI restrictions that guarantee a unimodal LF for the SUR/ID model and render explicit likelihood inference possible. We develop the methodology in the LCI framework but also show how, equivalently, the resulting minimal set of CIs can be found using graphical Markov models based on acyclic digraphs (ADG ≡ DAG), cf. [2, 3, 9]. The practical value of our results is two-fold. On one hand, the parsimonious LCI model may be reasonable and in good agreement with the data. In this case one avoids the use of iterative methods and the possible difficulty of having to decide which local maximum of the LF yields the most desirable estimate. On the other hand, even if we so not wish to impose CI restrictions, the parsimonious LCI model can be employed to obtain starting values for iterative procedures. These starting values may avoid non-convergence and/or lead to faster convergence than starting values from ordinary least squares, which assumes complete independence of all observed variables. Furthermore, the estimates in the LCI model may help identify the most desirable local maximum if one is confronted with a multimodal LF. The paper is organized as follows. In Section 2 we introduce LCI theory. Since we aim to find minimal sets of CI restrictions we show in Section 3 how two LCI models can be compared for inclusion. The applications of LCI theory to SUR and to the ID problem are presented in Sections 4 and 5, respectively. The first new result is given in Section 6, where we introduce and study the linear ID model. This model includes in particular the MANOVA model with incomplete data. We prove that the parsimonious LCI model for i.i.d. incomplete data (as found in Andersson and Perlman [5]) also allows a factorization of the linear ID model into a product of complete data MANOVA models.

CI MODELS FOR SUR WITH INCOMPLETE DATA

5

Our main result is presented in Section 7. Here we show how a minimal set of LCI constraints can be found s.t. the SUR/ID model factors into a product of complete data MANOVA models. In Section 8 we show how a transitive ADG (TADG) can be determined by a SUR model, a linear ID model, or a SUR/ID model s.t. the graphical Markov model based on this TADG is equivalent to the parsimonious LCI model developed in Sections 4, 6, or 7, respectively. A series of examples in Section 9 illustrates our methodology. We conclude with a summary and some comments in Section 10. 2. Lattice conditional independence theory 2.1. Lattices and LCI models. Let Y ≡ (Yi | i ∈ I) ∼ N (µ, Σ) be a normal ≡ Gaussian random vector in RI , where I is a finite index set, µ ∈ RI , and Σ ∈ P(I) (the cone of all real positive definite I × I matrices). Let K be a ring of subsets of I, that is, a subset of the power set 2I closed under intersection and union, hence a finite distributive lattice. We always assume that ∅ ∈ K and I ∈ K. When referring to a lattice we will always mean a ring of subsets of I. The LCI model determined by K places the following CI constraints on the distribution of Y : YK ⊥ ⊥ YL | YK∩L

(2.1)

∀K, L ∈ K,

or less redundantly, (2.2)

YK\L ⊥ ⊥ YL\K | YK∩L

∀K, L ∈ K.

Here, YK denotes the subvector (Yi | i ∈ K) and ⊥ ⊥ denotes (conditional) independence. The set of all covariance matrices Σ s.t. Y satisfies the specified CIs is denoted by P(K). Likelihood inference for the normal LCI model (2.3)

N(K) := (N (µ, Σ) | µ ∈ RI , Σ ∈ P(K))

on RI and its extension to the normal linear LCI model (cf. (2.15), Proposition 2.3, and Theorem 2.4) is based on the partially ordered set (poset) J (K) of join-irreducible elements of the lattice K, ordered by inclusion. For K ∈ K, K 6= ∅, define hKi := ∪ (L ∈ K | L ( K), [K] :=K \ hKi; . thus K = hKi ∪ [K]. Now define J (K) :={K ∈ K | K 6= ∅, hKi ( K} (2.4)

={K ∈ K | K 6= ∅, [K] 6= ∅}.

Equivalently, K ∈ J (K) iff K 6= ∅ and K = L ∪ M ⇒ K = L or K = M . By Proposition 2.1 of [6], every set L ∈ K can be partitioned as (2.5)

˙ L = ∪([K] | K ∈ J (K), K ⊆ L).

6

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

In particular, the index set I can be partitioned as ˙ I = ∪([K] | K ∈ J (K)),

(2.6) and, for L ∈ J (K),

˙ hLi = ∪([K] | K ∈ J (K), K ( L).

(2.7)

See Andersson [1], Davey and Priestley [10, Ch. 2 and 5], or Graetzer [14, Ch. II] for further properties of the poset J (K). In particular, by Birkhoff’s Representation Theorem the joinirreducible elements determine the lattice K uniquely. 2.2. The algebra of generalized block-triangular matrices with lattice structure. For two index sets I and J we denote the vector space of I × J matrices by R I×J . However, if the set RI×J acts on another set of matrices by left multiplication then we denote it by M(I × J), or by M(I) when I = J. Further, AI 0 ×J 0 denotes the I 0 × J 0 submatrix of a matrix A ∈ M(I × J). In accordance with the partition (2.6) of the index set I, we can partition a matrix A ∈ M(I) as A = (A[L]×[M ] | L, M ∈ J (K)),

(2.8) For each lattice K define (2.9)

M(K) := {A ∈ M(I) | A[L]×[M ] = 0

∀ M 6⊆ L ∈ J (K)}.

It is shown in [6, Sect. 2.4] that M(K) is an algebra of generalized block-triangular matrices. In particular, for K ∈ J (K) it follows from (2.7) and (2.9) that the K × K submatrix A K of A ∈ M(K) has the form (2.10)

AK =

AhKi

0

A[Ki

A[K]

!

,

where AhKi , A[Ki , and A[K] are the hKi × hKi, [K] × hKi, and [K] × [K] submatrices of A (compare Remark 2.2 in [6]). The matrices in M(K) can be characterized alternatively as follows. Proposition 2.1 (Proposition 2.2 in [6]). A matrix A ∈ M(K) iff one of the following two equivalent conditions is fulfilled: (i) ∀y ∈ RI , ∀L ∈ K : yL = 0 ⇒ (Ay)L = 0; (ii) ∀y ∈ RI , ∀L ∈ K : (Ay)L = AL yL . Condition (i) implies that M(K) is closed under matrix multiplication and contains the identity, and thus indeed is an algebra. For further details on M(K), see Andersson and Perlman [6, Sect. 2.4], [7].

CI MODELS FOR SUR WITH INCOMPLETE DATA

7

2.3. The MANOVA model. Let N be a finite index set and assume that we observe the variables indexed by I on subjects indexed by N . Arranged in matrix form, we observe the random array X ∼ N (µ, Σ ⊗ 1N ) ∈ RI×N ,

(2.11)

where N denotes the normal distribution, 1N is the N × N identity matrix, the columns X·j , j ∈ N , of X are independent with common covariance matrix Σ ∈ P(I), and where EX ≡ µ ∈ RI×N is the array of means. The classical MANOVA1 model on RI×N is defined as N(U) := (N (µ, Σ ⊗ 1N ) | µ ∈ U, Σ ∈ P(I)),

(2.12)

where U is a MANOVA subspace of RI×N , defined as a linear subspace U ⊆ RI×N s.t. M(I)U ⊆ U.

(2.13)

Proposition 2.2 (Characterization of MANOVA subspaces [4]). For a subspace U ⊆ R I×N , the following statements are equivalent: (i) U is a MANOVA subspace; (ii) U = U I ≡ ×(U | i ∈ I) where U is a subspace of RN ; (iii) U = {γZ | γ ∈ M(I × J)} for some design matrix Z ∈ RJ×N and some finite index set J. Proof. (iii)⇒(i): Obvious, since M(I)RI×J = RI×J . (i)⇒(ii): Let Ui be the projection of U onto R{i}×N . By (2.13), U is invariant under left multiplication by permutation matrices, so Ui = U ⊆ RN for all i ∈ I. Hence, U ⊆ ×(Ui | i ∈ I) ≡ U I . Next, let τj ∈ RN , j ∈ J, be a basis of U . Then, by definition of U as the image of a projection, there exists ζ (i,j) ∈ U s.t. the ith row of ζ (i,j) equals τj . Multiply ζ (i,j) on the left by the I × I matrix having a one as ith diagonal entry and zeroes elsewhere and apply (2.13) to see that the I × N matrix µ(i,j) with τj as ith row and zero entries elsewhere is an element of U. Since the matrices µ(i,j) , i ∈ I, j ∈ J, span U I , we obtain that U = U I . (ii)⇒(iii): Choose a basis τj ∈ RN , j ∈ J, of U . Define Z to be the J × N matrix with rows (τj | j ∈ J). Then U = {δZ | δ ∈ RJ }, so U = {γZ | γ ∈ RI×J }. The MLEs in a MANOVA model are available explicitly (compare e.g. Mardia, Kent, and Bibby [25, Ch. 6]): µ ˆ = XZ 0 (ZZ 0 )−1 Z, (2.14)

ˆ = 1 (X − µ ˆ)(X − µ ˆ )0 . Σ n

ˆ exists a.s. if |N | ≥ |I| + dim(U ) where U is the row space of the design matrix The MLE (ˆ µ, Σ) Z. Otherwise, the MLE never exists. If it exists the MLE is the unique solution to the likelihood equations. 1 We

model.

do not make a distinction between a multivariate analysis of variance and a multivariate linear regression

8

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

2.4. The linear LCI model and its likelihood factorization. Again assume (2.11). Andersson and Perlman [8] introduced and studied the linear LCI model on R I×N : (2.15)

N(U, K) := (N (µ, Σ ⊗ 1N ) | µ ∈ U, Σ ∈ P(K))

where U is a K-linear subspace (or simply K-subspace) of RI×N , defined as a linear subspace of RI×N that satisfies M(K)U ⊆ U.

(2.16)

In contrast to a MANOVA model, a linear LCI model restricts the covariance matrix Σ because P(K) ⊆ P(I) but, since M(K) ⊆ M(I), it allows the mean matrix µ to lie in a more general subspace while still permitting explicit likelihood inference (cf. Theorem 2.4). Proposition 2.3 (Characterization of K-subspaces [8]). Let U be a linear subspace of R I×N . For each K ∈ J (K) let U[K] and UhKi denote the projections of U onto R[K]×N and RhKi×N , respectively. Then U is a K-subspace of RI×N iff the following three conditions are satisfied: (i) U = ×(U[K] | K ∈ J (K)); (ii) ∀K ∈ J (K), U[K] is a MANOVA subspace of R[K]×N ; (iii) ∀K ∈ J (K), M([K] × hKi)UhKi ⊆ U[K] . Proof. Theorem 4.2 in [8]; compare also Proposition 6.1. Under the linear LCI model N(U, K), the LF factors according to the partitioning (2.6) (cf. [8]), as follows. For K ∈ K, let XK (resp., µK ) denote the K × N sub-matrix of X (resp., µ) and let ΣK denote the K × K sub-matrix of Σ (cf. (2.11)). Partition XK , µK , and ΣK according to . the decomposition K = hKi ∪ [K]: ! ! ! ΣhKi ΣhK] µhKi XhKi (2.17) . , ΣK = , µK = XK = Σ[Ki Σ[K] µ[K] X[K] For each K ∈ J (K), the conditional distribution of X[K] given XhKi is (2.18) where

 (X[K] | XhKi ) ∼ N[K]×N ξ[K] + β[Ki XhKi , Λ[K] ⊗ 1N , ξ[K] := µ[K] − Σ[Ki Σ−1 hKi µhKi ,

(2.19)

β[Ki := Σ[Ki Σ−1 hKi , Λ[K] := Σ[K]·hKi = Σ[K] − Σ[Ki Σ−1 hKi ΣhK] .

(Thus, β[Ki is the matrix of regression coefficients for X[K] given XhKi and Λ[K] ⊗ 1N is the conditional covariance matrix.) The family (ξ[K] , β[Ki , Λ[K] | K ∈ J (K)) comprises the Kparameters of the model N(U, K).

CI MODELS FOR SUR WITH INCOMPLETE DATA

9

Theorem 2.4 (Factorization of a linear LCI model [8]). Let U be a K-subspace of R I×N . Then the LF for the model N(U, K) factors as Y fµ,Σ (x) = (2.20)

fξ[K] ,β[Ki ,Λ[K] (x[K] | xhKi ),

K∈J (K)

where fξ[K] ,β[Ki ,Λ[K] (x[K] | xhKi ) is the LF for the MANOVA model (2.18) on R[K]×N . Furthermore, the parameter space factors according to the bijective mapping   U × P(K) → × U[K] × R[K]×hKi × P([K]) | K ∈ J (K) (2.21)  (µ, Σ) 7→ ξ[K] , β[Ki , Λ[K] | K ∈ J (K) . Proof. See Theorems 5.1 and 5.2 in [8] (compare also Theorem 3.1 in [6]).

Theorem 2.4 enables one to find the MLE of (µ, Σ) by first deriving the MLEs of the Kparameters from the usual formulas for the MLEs in a MANOVA model (see (2.14)), then using the reconstruction algorithm ([5, 6] or, in a slightly different appearance, [9]) to reconstruct the MLE of (µ, Σ) from its estimated K-parameters. In particular, the MLE of (µ, Σ) exists and is the unique solution to the likelihood equations for a.e. x ∈ RI×N iff  |N | ≥ max |[K]| + dK + |hKi| K ∈ J (K) (2.22)  = max |K| + dK K ∈ J (K) ,

where dK is the dimension of U[K] divided by |[K]|. (Equivalently, if U[K] = U [K] then dK is the

dimension of U ⊆ RN .) 3. Lattice inclusion 3.1. An inclusion criterion based on join-irreducible elements. As shown in the subsequent sections, for nonmonotone SUR models and/or nonmonotone ID models, LCI theory dictates the construction of minimal sets of CI restrictions that render explicit likelihood inference possible. In order to prove the minimality, we need to compare two LCI models based on different lattices. From the definitions in Section 2.1 it is obvious that (3.1)

K ⊆ L =⇒ P(L) ⊆ P(K).

(The converse is false, as seen by the example K = {∅, {1}, {1, 2}} and L = {∅, {1, 2}} over the index set I = {1, 2}.) In the following sections we will use (3.1) to establish minimality of the CI restrictions imposed by an LCI model. Hence, we need to be able to compare lattices. The lemma presented in this subsection gives a criterion to check whether two lattices are nested by inclusion based on their join-irreducible elements. For i ∈ I, let (3.2)

Ki := ∩(K ∈ K | i ∈ K)

10

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

be the smallest member of K containing the index i. As shown in the proof of Proposition 2.1 in [6], Ki is join-irreducible and, thus, is the smallest join-irreducible element containing i. In particular, i ∈ [Ki ] ≡ Ki \ hKi i.

(3.3)

Lemma 3.1 (Inclusion of join-irreducible elements). Let K and L be two lattices over the same index set I. Let Ki and Li be the smallest join-irreducible elements of K and L, respectively, that contain the index i ∈ I. Then (3.4)

K ⊆ L ⇐⇒ Li ⊆ Ki

∀i ∈ I.

Moreover, if Li ⊆ Ki for all i ∈ I then [Li ] ⊆ [Ki ] for all i ∈ I. Proof. First, apply Lemma 1 of [34] with S = K, Fi = Ki , B = K, where Ki is defined in (3.2), to obtain that (3.5)

K ∈ K ⇐⇒ K =

[

Ki .

i∈K

(⇒): If K ⊆ L then in particular Ki ∈ L for all i. But then since i ∈ Ki it follows from (3.5) with K replaced by L that Ki =

[

Lj ⊇ L i .

j∈Ki

(⇐): Since L is closed under union it suffices by (3.5)to show that all Ki are elements of L. Let Ki (L) := ∪(L ∈ J (L) | L ⊆ Ki ) ∈ L; we shall show that Ki = Ki (L). By its definition, Ki (L) ⊆ Ki . To show Ki ⊆ Ki (L) let j ∈ Ki . Then Kj ⊆ Ki , since Ki ∈ J (K) and Kj is the smallest element of J (K) that contains j. Since Lj ⊆ Kj by assumption, Lj ⊆ Ki , hence j ∈ Lj ⊆ Ki (L). The claim that [Li ] ⊆ [Ki ] can be established as follows. Assume that j ∈ [Li ]. Then since also j ∈ [Lj ] it follows that j ∈ [Li ] ∩ [Lj ], and we can deduce from (2.6) that Lj = Li . Hence, j ∈ Lj = Li ⊆ Ki , from which it follows that Kj ⊆ Ki . On the other hand, i ∈ Li = Lj ⊆ Kj so Ki ⊆ Kj , which implies Kj = Ki . In particular, j ∈ [Kj ] = [Ki ] which establishes that [Li ] ⊆ [Ki ]. 3.2. An inclusion criterion based on the algebra of generalized block-triangular matrices. The inclusion of two lattices is also characterized by the inclusion of their associated algebras of generalized block-triangular matrices, which we present as Corollary 3.3 to the prepatory Lemma 3.2.

CI MODELS FOR SUR WITH INCOMPLETE DATA

11

Lemma 3.2. Let K be a lattice and define (3.6)

˜ := {K ⊆ I | ∀A ∈ M(K), ∀y ∈ RI : yK = 0 ⇒ (Ay)K = 0}. K

˜ = K. Then K ˜ (⊆) : We show that L 6∈ K ⇒ L 6∈ K. ˜ For Proof. (⊇) : Proposition 2.1(i) implies that K ⊆ K. L 6∈ K, set L− := ∪(K ∈ K | K ⊆ L) ∈ K and define L := L \ L− 6= ∅; set L+ := ∩(K ∈ K | K ⊇ L) ∈ K and define L := L+ \ L 6= ∅. Then for A ∈ M(K) and y ∈ RI s.t. yL = 0, we have (3.7)

(Ay)L = AL×(I\L) yI\L = AL×L yL 6= 0.

In (3.7), the first equality holds since yL = 0, and the second equality is true because A ∈ M(K) implies by (2.9) that A(L+ )×(I\L+ ) = 0 ⇒ AL×(I\L+ ) = 0. Finally, AL×L yL 6= 0 if A and y are chosen appropriately, for example, such that AL×L and yL both have positive entries only. Note that choosing AL×L with positive entries does not contradict A ∈ M(K) since there exists . ˜ K ∈ J (K) s.t. L ∪ L = [K] (compare (2.9)). In conclusion, (Ay)L 6= 0 and thus L 6∈ K. Corollary 3.3 (Inclusion of matrix algebras). Let K and L be two lattices over the same index set I with associated algebras of generalized block-triangular matrices M(K) and M(L), respectively. Then K ⊆ L ⇐⇒ M(L) ⊆ M(K).

(3.8)

Proof. (⇒): By Proposition 2.1(ii), A ∈ M(L) iff (3.9)

(Ay)L = AL yL

∀L ∈ L.

Since K ⊆ L by assumption, (3.9) holds for all L ∈ K, hence A ∈ M(K) (compare Andersson and Perlman [7, Equation (2.3)]). ˜ ⊆ L, ˜ hence K ⊆ L by Lemma 3.2. (⇐): The inclusion M(L) ⊆ M(K) implies that K 4. Seemingly unrelated regressions 4.1. The SUR model. The general normal SUR model on RI×N is determined by a SUR pair S := (U, IU ) for RI×N . Here, the SUR pattern U is a collection of distinct subspaces of RN with |U| ≤ |I| and the SUR partition (4.1)

IU := (IU | U ∈ U)

is a partition of I indexed by U, i.e. IU is a family of non-empty pairwise disjoint subsets of I s.t. . (4.2) I = ∪(IU | U ∈ U). The SUR linear subspace (or simply SUR subspace) of RI×N induced by S is defined as (4.3)

US := ×(U IU | U ∈ U) ⊆ RI×N .

12

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

The general normal SUR model on RI×N is defined to be N(US ) := (N (µ, Σ ⊗ 1N ) | µ ∈ US , Σ ∈ P(I)).

(4.4)

Less formally, in a SUR model each of the variables with index i ∈ IU is regressed on the same linear regression subspace U ⊆ RN , so these variables together follow the MANOVA model N(U IU ) on RIU ×N . These |U| MANOVA models are only seemingly unrelated because the variables in IU and IU 0 may be correlated if Σ is not diagonal. Note also that if |U| = 1, then US is a MANOVA subspace. If the SUR pattern U is totally ordered with respect to inclusion, the SUR model is called nested, or monotone. If inclusion yields only a partial ordering of the regression spaces, the SUR model given by (4.4) is called nonnested, or nonmonotone. In a nonmonotone normal SUR model, the MLE of (µ, Σ) cannot be found explicitly; instead, iterative methods are required. Furthermore, Drton and Richardson [11] show that the LF may be multimodal and that the standard iterative methods may converge to different local maxima depending upon which starting value is used. 4.2. LCI restrictions for a SUR model. For any lattice K ⊆ 2I , we can impose the associated LCI restrictions on the SUR model N(US ) on RI×N to obtain the LCI-restricted SUR model on RI×N : (4.5)

N(US , K) := (N (µ, Σ ⊗ 1N ) | µ ∈ US , Σ ∈ P(K)).

As shown by Andersson and Perlman [8], S determines a unique minimal lattice K S ⊆ 2I of subsets of I s.t. US becomes a KS -subspace of RI×N , hence N(US , KS ) becomes a linear LCI model on RI×N amenable to explicit normal-theory likelihood inference, cf. Theorem 2.4. Minimality of KS (cf. Theorem 4.3) implies that the corresponding set of LCI constraints is minimal ≡ parsimonious. In this subsection, we will review the construction of K S and give an alternate proof of its minimality using the lattice inclusion Lemma 3.1. This alternate proof is more easily adapted to the case of incomplete data considered below. The set of subspaces U is partially ordered by inclusion. For U ∈ U, define (4.6)

. KU = ∪(IU 0 | U 0 ⊆ U, U 0 ∈ U),

so KU ⊆ KU 0 iff U ⊆ U 0 and KU = KU 0 iff U = U 0 . Thus the sets (4.7)

PS := {KU | U ∈ U}

and U are in 1-1 correspondence and form isomorphic posets under inclusion. Note that P S is totally ordered by inclusion iff the SUR pattern is monotone. The SUR lattice KS is defined to be the lattice generated by PS , i.e. the smallest ring containing each KU , U ∈ U. The Birkhoff Representation Theorem (cf. [10, Theorem 5.12] or [1, Theorem

CI MODELS FOR SUR WITH INCOMPLETE DATA

13

3.2(ii)]) yields that PS = J (KS ) (also compare [8, Sect. 6]). It also follows easily from the definition of KU that (4.8)

KU = ∩(K ∈ KS | IU ⊆ K),

the smallest join-irreducible element in KS that contains IU (or any i ∈ IU ). Furthermore (compare to (3.3)) (4.9)

IU = [KU ],

U ∈ U.

Theorem 4.1 (The parsimonious linear LCI model). Under the LCI constraints determined by the SUR lattice KS , the LCI-restricted SUR model N(US , KS ) becomes a linear LCI model on RI×N . Proof. By Proposition 2.2(ii), U IU is a MANOVA subspace of RIU ×N , hence US ≡ ×(U IU | U ∈ U) fulfills conditions (i) and (ii) of Proposition 2.3. Condition (iii) is evident from the isomorphism of the posets PS and U under the inclusion orderings. Thus US is a KS -subspace, as required. By Theorems 2.4 and 4.1, the model N(US , KS ) on RI×N factors as a product of MANOVA models. By (2.22), the MLE of (µ, Σ) in this model exists and is the unique solution to the likelihood equations for a.e. x ∈ RI×N iff  |N | ≥ max |KU | + dU | U ∈ U (4.10)  . = max | ∪(IU 0 | U 0 ⊆ U )| + dU | U ∈ U , where dU is the dimension of U .

Remark 4.2. By (2.16) and Corollary 3.3, it follows that if two lattices are nested as K ⊆ L then a K-subspace of RI×N is also an L-subspace of RI×N . Hence, for any lattice L ⊇ KS , the model N(US , L) is also a linear LCI model on RI×N . Recall, however, that the larger lattice induces the CIs from the smaller lattice plus possibly further CIs; cf. (3.1). 4.3. Minimality of the LCI restrictions for a SUR model. The next theorem states the minimality of the imposed LCI restrictions in Theorem 4.1, first shown by Andersson and Perlman [6]. Here we give a different proof using the lattice inclusion Lemma 3.1. Theorem 4.3 (Lattice minimality for a SUR model). The SUR lattice KS is uniquely minimal among all lattices L over I s.t. N(US , L) is a linear LCI model on RI×N . Proof. Consider a competing lattice L s.t. US is an L-subspace. For each i ∈ I, let Li be the smallest join-irreducible element of L containing i and let Ui be the projection of US onto R{i}×N . Then by Proposition 2.3(ii) and (iii) it follows that Uj ⊆ Ui whenever j ∈ Li . Let U (i) be the unique member of U s.t. IU (i) contains i. By the definition of KU (i) , the inclusion U (j) = Uj ⊆ Ui = U (i) implies also that j ∈ KU (i) . Thus, all the join-irreducible elements of L

14

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

and KS are nested as Li ⊆ KU (i) . Because KU (i) = Ki (the smallest join-irreducible element of KS that contains i), the lattice inclusion Lemma 3.1 shows that KS ⊆ L. 5. Multivariate incomplete data 5.1. ID patterns. Consider a random array Y ∈ RI×N where the variables and subjects are indexed by I and N , respectively. Now, however, some entries of Y may be missing. The ID pattern can be described by a subset I ⊆ I × N with the interpretation that (i, n) ∈ I iff the variable i ∈ I is observed on subject n ∈ N . To avoid trivialities, assume that no variable and no subject is entirely missing. The set I can be represented in two canonical ways. First, for each n ∈ N let I(n) := {i ∈ I | (i, n) ∈ I} ∈ 2I \ {∅}

(5.1)

denote the set of all variables i that are observed on subject n, and define I := {I(n) | n ∈ N } ⊆ 2I \ {∅}.

(5.2)

The set I thus describes the pattern of partially observed column vectors and is called the column ID pattern. For each K ∈ I define (5.3)

NK := I −1 (K) ≡ {n ∈ N | I(n) = K} 6= ∅,

that is, NK indexes the subjects for which exactly the variables in K are observed. Then the family NI := (NK | K ∈ I)

(5.4)

constitutes a partition of N , called the column ID partition, or simply column partition, so that . (5.5) N = ∪(NK | K ∈ I). Second, for each i ∈ I let (5.6)

N (i) := {n ∈ N | (i, n) ∈ I} ∈ 2N \ {∅}

denote the set of all subjects n for which variable i is observed, and define (5.7)

N := {N (i) | i ∈ I} ⊆ 2N \ {∅}.

The set N thus describes the pattern of partially observed row vectors and is called the row ID pattern. For each M ∈ N define (5.8)

IM := N −1 (M ) ≡ {i ∈ I | N (i) = M } 6= ∅,

that is, IM indexes the variables that are observed exactly on the subjects in M . Then the family (5.9)

IN := (IM | M ∈ N )

CI MODELS FOR SUR WITH INCOMPLETE DATA

15

constitutes a partition of I, called the row ID partition, or simply row partition, so that . I = ∪(IM | M ∈ N ).

(5.10)

In the literature, I has been referred to as the incomplete data pattern (Andersson and Perlman [5]), the missing data pattern (Little and Rubin [21, Sect. 1.2]), or the missingness pattern (Schafer [31, p. 16]). The column viewpoint leading to the pattern I is important when specifying a distributional assumption because we usually want subjects to be independent. In explicit likelihood inference, however, we regress a subset of variables in I on another subset of variables (compare Theorem 2.4), thus the row viewpoint leading to the pattern N is important for statistical analysis. A column or row ID pattern that is totally ordered by inclusion is called monotone or nested (compare e.g. Little and Rubin [21]). Proposition 5.4 below shows that the column ID pattern I is monotone iff the row ID pattern N is monotone. If either pattern is not totally ordered by inclusion then we refer to it as nonmonotone or nonnested. For statistical analysis, for certain K ⊆ I it will be necessary to consider the set of subjects + NK

for which all variables in K are observed together. Formally, for any K ⊆ I we define + NK := ∩ (N (i) | i ∈ K) . = ∪ (NK 0 | K 0 ∈ I, K 0 ⊇ K).

(5.11)

Note that for all M ∈ N , it holds by definition that NI+M = M.

(5.12)

5.2. The ID lattice. Let KI ⊆ 2I denote the column ID lattice (or simply column lattice), that is, the lattice generated in 2I by the column ID pattern I. The row ID pattern N ⊆ 2N \ {∅} does not directly generate a lattice in 2I . However, N is partially ordered by inclusion, so we can proceed as follows (compare to §4.2). For M ∈ N , define (5.13)

. KM := ∪(IM 0 | M 0 ∈ N , M 0 ⊇ M ),

so KM ⊆ KM 0 iff M ⊇ M 0 and KM = KM 0 iff M = M 0 . It follows that the sets (5.14)

PN := {KM | M ∈ N }

and N are in 1-1 correspondence and form anti-isomorphic posets under inclusion. The row ID lattice (or simply row lattice) KN ⊆ 2I is now defined to be the lattice generated in 2I by PN . Then as in §4.2, (5.15)

PN = J (KN ).

16

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

Moreover, it follows as in (4.8) and (4.9) that KM is the smallest join irreducible element of KN containing IM (or any i ∈ IM ) and that (5.16)

IM = [KM ].

Proposition 5.1 (ID lattice). The column and row lattices coincide, i.e. K I = KN , and jointly define the ID lattice KI := KI = KN . Proof. The smallest join-irreducible element of KI containing i ∈ I is given by (compare (3.2)) Ki0 = ∩(K ∈ I | i ∈ K),

(5.17)

i ∈ I.

Now, j ∈ Ki0 iff i ∈ K implies j ∈ K for all K ∈ I, that is, iff variable j is observed on a subject n whenever variable i is observed on n. Thus, j ∈ Ki0 ⇐⇒ N (i) ⊆ N (j) ⇐⇒ j ∈ KN (i) ∈ PN . Hence Ki0 = KN (i) , which implies that the lattices KI and KN have the same join-irreducible elements, so KI = KN . Remark 5.2. As in the preceding proof, (3.2) implies that the join-irreducible elements of K I can be described explicitly as follows. The smallest join-irreducible element containing i ∈ I consists of all j ∈ I s.t. if variable i is observed on a subject n, then so is variable j. Remark 5.3. By (5.15) and (5.16), the row partition IN in (5.9) can be expressed equivalently as (5.18)

IN = ([K] | K ∈ J (KI )).

Proposition 5.4 (Monotonicity). The following conditions are equivalent: (i) the column ID pattern I is monotone; (ii) the row ID pattern N is monotone; (iii) the set of join-irreducible elements J (KI ) is totally ordered by inclusion; (iv) the ID lattice KI is totally ordered by inclusion. Proof. (iv)⇒(i): Obvious, since I generates KI , and hence I ⊆ KI . (i)⇒(iii): Equation (5.17) shows that all join-irreducible elements in J (K I ) are intersections of elements of I. Since we assume I to be monotone it follows that J (K I ) = I. Hence, J (KI ) is totally ordered by inclusion. (iii)⇒(iv): By (3.5), every element of a lattice is a union of join-irreducible elements, which implies that KI is totally ordered if J (KI ) is. (ii)⇔(iii): This follows immediately from PN = J (KN ) = J (KI ) and the definition of the sets KM comprised by PN (see (5.13) and (5.14)).

CI MODELS FOR SUR WITH INCOMPLETE DATA

17

In the development of LCI theory for ID models we will mainly adopt a column view, but for likelihood inference only the join-irreducible elements J (KI ) are required. Since we showed that J (KI ) = PN , we need only to determine the row ID pattern N and the row partition IN to be able to construct quickly the join-irreducible elements PN . 6. Linear incomplete data models 6.1. Linear ID subspaces. Continuing the discussion from Section 5, suppose now that the complete data array satisfies Y ∼ N (ν, Σ ⊗ 1N ),

(6.1)

where ν ∈ RI×N and Σ ∈ P(I). Let X denote the ID array, that is, the observed part of Y . By the definition of I, the sample space for X is the vector space RI , which can be written in the equivalent forms RI (6.2)



×(RK×NK | K ∈ I)

=

×(RIM ×M | M ∈ N ).



×(R[K]×NK | K ∈ J (KI )),

+

where the last equivalence follows from (5.12), (5.16), and (5.18). The projection of the complete data array Y onto the ID array X is denoted by pI : RI×N → RI , (6.3)

Y 7→ X := (XK | K ∈ I) ,

where XK is the K × NK submatrix of Y . Then X satisfies (6.4)

X ≡ (XK | K ∈ I) ∼ ⊗ (N (µK , ΣK ⊗ 1NK ) | K ∈ I) ∈ RI ,

where µK denotes the K × NK submatrix of ν and ΣK is the K × K submatrix of Σ. Here (6.5)

E[X] = µ ≡ (µK | K ∈ I) := pI (ν) ∈ RI .

In §2.3 and §2.4, linear hypotheses about the mean array µ were given by MANOVA subspaces and K-subspaces defined by invariance under left multiplication by the naturally associated matrix algebras M(I) and M(K), respectively. To define an analogous class of subspaces in the ID case, we define the multiplication of an ID array x ∈ RI by a matrix A ∈ M(I) as follows (cf. Andersson et al. [4]): (6.6)

Ax := (AK xK | K ∈ I) ∈ RI ,

where AK is the K × K submatrix of A. For the linear ID model introduced in §6.2, under appropriate LCI covariance restrictions, explicit likelihood inference is possible for the more

18

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

general linear hypothesis determined by a linear ID subspace (or simply K I -subspace) U of RI , that is, a linear subspace of RI that fulfills (compare (2.13) and (2.16)) M(KI )U ⊆ U,

(6.7)

where KI is the ID lattice defined in Proposition 5.1. Proposition 6.1 (Characterization of KI -subspaces of RI ). Let U be a linear subspace of RI . +

+

++ ++ For each K ∈ J (KI ), let U[K] and UhKi denote the projections of U onto R[K]×NK and RhKi×NK ,

respectively. Then U is a KI -subspace of RI iff the following three conditions are satisfied: ++ (i) U = ×(U[K] | K ∈ J (KI )); +

++ (ii) ∀K ∈ J (KI ), U[K] is a MANOVA subspace of R[K]×NK ; ++ ++ (iii) ∀K ∈ J (KI ), M([K] × hKi)UhKi ⊆ U[K] . + Proof. (⇒): Let A ∈ M(KI ) and µ ∈ U, so by (6.7), Aµ ∈ U. For K ∈ J (KI ), the K × NK

submatrix of Aµ is given by (Aµ)+ K (6.8)

(5.11)

=

(Aµ)K×NK 0 | K 0 ∈ I, K 0 ⊇ K

(6.6)

AK µK×NK 0 | K 0 ∈ I, K 0 ⊇ K

=

=

A K µ+ K,





where the last equality follows from Proposition 2.1(ii). This implies that +

[M(KI )U]K

(6.8)

=

+ M(KI )K UK

(2.10)

M(hKi)

0 M([K])

(6.7)

M([K] × hKi) ! ++ UhKi

=



=

!

++ UhKi ++ U[K]

!

++ U[K]

+ UK .

Thus, (6.9)

i i h h ++ ++ ++ ++ ⊆ U[K] , + M([K])U[K] [M(KI )U][K] = M([K] × hKi)UhKi

which yields (ii) and (iii).

++ ++ | K ∈ J (KI )). For any K ∈ J (KI ), choose a basis , U ⊆ ×(U[K] By the definition of U[K] +

++ ++ of U[K] and let τ be a member of this basis. Since U[K] is the projection of U onto R[K]×NK , ++ = τ . Multiplying ζ by the matrix in M(KI ) that has the identity there exists ζ ∈ U s.t. ζ[K]

matrix 1[K] in the [K]-th diagonal block and zeroes elsewhere shows that the element µ ∈ R I with µ++ [K] = τ and zeroes elsewhere is an element of U. Repeating this for all K ∈ J (K) and all ++ | K ∈ J (KI )), so (i) follows. basis elements τ shows that U contains a basis of ×(U[K]

CI MODELS FOR SUR WITH INCOMPLETE DATA

19

(⇐): If (ii) and (iii) are satisfied then the inclusion in (6.9) holds. This yields that   ++ M(KI )U ⊆ × [M(KI )U][K] | K ∈ J (KI ) (6.9)



(i)

=

  ++ | K ∈ J (KI ) × U[K]

U,

hence U satisfies (6.7). Corollary 6.2 (Restriction to a KI -subspace of RI ). The space U is a KI -subspace of RI iff there exists a KI -subspace V of RI×N s.t. U = pI (V). In particular, if V is a MANOVA subspace of RI×N then pI (V) is a KI -subspace of RI . Proof. (⇐) : If V is a KI -subspace of RI×N then Proposition 2.3 implies that U := pI (V) fulfills the conditions in Proposition 6.1, thus U is a KI -subspace of RI . (⇒) : Let U be a KI -subspace of RI , which can be written according to Proposition 6.1 as ++ U = ×(U[K] | K ∈ J (KI )).

(6.10) Define

+

++ V[K] := U[K] × R[K]×(N \NK ) ⊆ R[K]×N ,

(6.11) and let

V := ×(V[K] | K ∈ J (KI )) ⊆ RI×N .

(6.12)

Then by definition, pI (V) = U and V satisfies condition (i) of Proposition 2.3. Moreover, V also fulfills condition (ii) since for all K ∈ J (KI ),     + ++ M([K])V[K] = M([K])U[K] × M([K])R[K]×(N \NK )     + ++ × R[K]×(N \NK ) ⊆ U[K] =

V[K] .

+ + Here the inclusion is implied by Proposition 6.1(ii). Finally, because K 0 ⊆ K ⇒ NK 0 ⊇ NK , it

follows that VhKi , the projection of V onto RhKi×N , satisfies +

++ VhKi = ×(V[K 0 ] | K 0 ∈ J (KI ), K 0 ( K) ⊆ UhKi × RhKi×(N \NK ) .

Thus, M([K] × hKi)VhKi

⊆ ⊆ ⊆ =

  + ++ M([K] × hKi) UhKi × RhKi×(N \NK )     + ++ × M([K] × hKi)RhKi×(N \NK ) M([K] × hKi)UhKi     + ++ × R[K]×(N \NK ) U[K]

V[K] ,

20

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

which shows that condition (iii) of Proposition 2.3 holds and thus that V is a K I -subspace of RI×N . Note that this Corollary makes an existence statement only so that if V ⊆ R I×N is not a KI subspace of RI×N then pI (V) still might be a KI -subspace of RI . 6.2. LCI restrictions for a linear ID model. The challenge of maximum likelihood estimation when data is incomplete is somewhat similar to the SUR case. First, if the column or row ID pattern is nonmonotone then the MLE cannot be obtained explicitly. Second, the LF might be multimodal (cf. Murray [29]). But here again, based on the ID pattern/lattice, one can construct a parsimonious LCI model that yields explicit MLEs with guaranteed unimodality of the LF. The original work by Andersson and Perlman [5] treats the case of i.i.d. multivariate normal random vectors. Here, their approach is extended to a linear ID model. Let U be a KI -subspace of RI . The linear ID model N(U) on RI is defined as (recall (6.4) and (6.5)) (6.13)

N(U) := (⊗ (N (µK , ΣK ⊗ 1NK ) | K ∈ I) | µ ∈ U, Σ ∈ P(I)).

For any lattice K ⊆ 2I , define the LCI-restricted linear ID model on RI : (6.14)

N(U, K) := (⊗ (N (µK , ΣK ⊗ 1NK ) | K ∈ I) | µ ∈ U, Σ ∈ P(K)).

+ + + as of Y is fully observed. Partition XK submatrix XK For each K ∈ I, the K × NK ! ++ XhKi + (6.15) , XK = ++ X[K]

so by (6.2) (6.16)

X = (XK | K ∈ I) ++ = (X[K] | K ∈ J (KI )).

++ ++ Furthermore, let µ+ K , µhKi , and µ[K] denote the corresponding quantities when µ = E[X] from

(6.5) replaces X in (6.15). Then under the parsimonious LCI-restricted linear ID model N(U, K I ) ++ ++ on RI , for each K ∈ J (KI ) the conditional distribution of X[K] given XhKi is

(6.17) where

  ++ ++ ++ ++ + β[Ki XhKi , Λ[K] ⊗ 1N + , (X[K] | XhKi ) ∼ N[K]×N + ξ[K] K

++ −1 ++ ξ[K] := µ++ [K] − Σ[Ki ΣhKi µhKi ,

(6.18)

β[Ki := Σ[Ki Σ−1 hKi , Λ[K] := Σ[K]·hKi = Σ[K] − Σ[Ki Σ−1 hKi ΣhK] .

K

CI MODELS FOR SUR WITH INCOMPLETE DATA

21

++ The family (ξ[K] , β[Ki , Λ[K] | K ∈ J (KI )) comprises the KI -parameters of the model N(U, KI ) on

RI . The following factorization theorem extends the results in Section 3 of [5] from the i.i.d. case to the linear ID case. Theorem 6.3 (Factorization of the parsimonious LCI-restricted linear ID model). Let U be a KI -subspace of RI . Then the LF for the model N(U, KI ) on RI factors as  Y ++  fξ++ ,β[Ki ,Λ[K] x++ fµ,Σ (x) = (6.19) [K] xhKi , [K]

K∈J (KI )

 + ++ [K]×NK where fξ++ ,β[Ki ,Λ[K] x++ given by (6.17). [K] | xhKi is the LF of the MANOVA model on R [K]

Moreover, the parameter space factors according to the bijective mapping   ++ φI : U × P(KI ) → × U[K] × R[K]×hKi × P([K]) | K ∈ J (KI ) (6.20)   ++ (µ, Σ) 7→ ξ[K] , β[Ki , Λ[K] | K ∈ J (KI ) .

Proof. The factorization (6.19) of the LF follows immediately from the derivation of the fundamental factorization (3.12) in [5]. This derivation does not make use of any structure in the mean matrix µ, in particular, no use is made of the i.i.d. assumption. The bijectivity of the reparameterization φI can be seen as follows (compare also Proposition 6.1 in [9]). The restricted reparameterization (6.21)

 ψI : P(KI ) → β[Ki , Λ[K] K ∈ J (KI )   Σ 7→ R[K]×hKi × P([K]) K ∈ J (KI )

is bijective, as proved in Theorem 2.2 in [6]. Thus,

φI (µ, Σ) = φI (µ0 , Σ0 ) =⇒ ψI (Σ) = ψI (Σ0 ) ⇐⇒ Σ = Σ0 . To show that φI (µ, Σ) = φI (µ0 , Σ0 ) also implies µ = µ0 , choose a never-decreasing listing of the join-irreducible elements K ∈ J (KI ), i.e. find K1 , . . . , Kq , q = |J (KI )|, s.t. α < δ ⇒ Kδ 6⊆ Kα . ++

++ 0 Apply the definition (6.18) of ξ[K successively to find that µ++ [Kα ] = µ [Kα ] for α = 1, . . . , q. α]

This implies µ = µ0 and hence the injectivity of φI . The surjectivity of φI follows from an augmented version of the reconstruction algorithm in Section 3.3 of [5]. The augmentation consists in replacing the α-th step in which the next part of the mean column vector (denoted µ[α] in [5]) is obtained by a step in which the next part of the mean matrix is obtained, i.e. by ++ ++ µ++ [Kα ] = ξ[Kα ] + β[Kα i µhKα i . ++ I At the α-th step of the algorithm, µ++ hKα i ∈ UhKα i is constructed. Since U is a KI -subspace of R

it follows from Proposition 6.1 that (6.22)

++ ++ M([Kα ] × hKα i)UhK ⊆ U[K . αi α]

22

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

++ ++ The inclusion (6.22) implies that µ++ [Kα ] is constructed to be in U[Kα ] . Since U = × U[Kα ] | α =  1, . . . , q , the reconstructed µ is an element of U.  For given KI -parameters ξ ++ , β[Ki , Λ[K] K ∈ J (KI ) of N(U, KI ) on RI in the claimed image [K]

of φI , this augmented reconstruction algorithm yields (µ, Σ) ∈ U × P(KI ) s.t.   ++ φI (µ, Σ) = ξ[K] , β[Ki , Λ[K] | K ∈ J (KI ) . Thus, φI is surjective.

+

++ ++ Finally, since ξ[K] ranges through the entire MANOVA subspace U[K] of R[K]×NK , the model +

+ | observations. (6.17) is a MANOVA model on R[K]×NK based on |NK

Theorem 6.3 shows that, just as in the SUR case, if the LCI constraints given by K I are imposed on the linear ID model N(U) to produce the LCI-restricted linear ID model N(U, K I ) + then explicit likelihood inference is possible. However, different sample sizes |N K | apply to the

different regression factors. Thus, the necessary and sufficient condition for almost sure existence and uniqueness of the MLE in the LCI-restricted linear ID model N(U, KI ) is that (6.23)

+ |NK | ≥ |K| + dK

∀K ∈ J (KI ),

where dK is defined in (6.26). Remark 6.4. Note that a factorization theorem analogous to Theorem 6.3 holds for a LCIrestricted linear ID model N(U, L) whenever L ⊇ KI . To see this let Li and Ki denote the smallest join-irreducible elements of L and KI , respectively, that contain the index i ∈ I. Then, by the lattice inclusion Lemma 3.1, (6.24)

Li ⊆ Ki and [Li ] ⊆ [Ki ].

Hence, the proof of Theorem 6.3 applies since, due to (6.24), (i) the factorization of the LF of N(U, KI ) implies the factorization of the LF of N(U, L); (ii) the fact that (6.22) holds for Kα implies that (6.22) remains true if Kα is replaced by Li , where i ∈ I is such that Ki = Kα ; ++ ++ (iii) for all i ∈ I, U[L is a MANOVA subspace because U[K is one also. i] i]

6.3. Minimality of the LCI restrictions for a linear ID model. The next theorem shows the unique minimality of the lattice KI , which translates into parsimony of the induced conditional independences. Theorem 6.5 (Lattice minimality for a linear ID model). The ID lattice KI is uniquely minimal among all lattices L over I for which the model N(U, L) on RI admits factorizations of the LF and the parameter space as products of LFs and parameter spaces, respectively, of MANOVA models, as in Theorem 6.3.

CI MODELS FOR SUR WITH INCOMPLETE DATA

23

Proof. Let L be a competing lattice admitting a factorization as in (6.19) and (6.20) in Theorem 6.3 and let Li be the smallest join-irreducible element of L containing i ∈ I. Since the Li -th factor in the factorization must be a MANOVA model, a variable j ∈ Li is observed on every subject on which the variable i is observed. However, Ki , the smallest join-irreducible element of KI containing the index i, contains all j ∈ I s.t. variable j is observed whenever variable i is observed (see §5.2). Thus, Li ⊆ Ki for all i ∈ I. The lattice inclusion Lemma 3.1 then implies that KI ⊆ L, hence the minimality and uniqueness of KI . Remark 6.6. The distribution of the random ID array X in (6.4) is uniquely determined by the mean parameter µ ∈ RI from (6.5) and the covariance matrix Σ. In a linear ID model N(U) on RI , the parameter (µ, Σ) ∈ U × P(I) is identifiable iff Σ is identifiable, which holds iff ∪(K × K | K ∈ I) = I × I;

(6.25)

compare [5]. However, if U = pI (V) for a KI -subspace V of RI×N then ν ∈ V need not be uniquely identified by pI (ν). For this identifiability to hold, the projection pI : V → U must be bijective, or equivalently, dim(V) = dim(U). In applications, this condition can be verified as follows. Since V is a KI -subspace of RI×N , we can write the projection of V onto R[K]×N as (VK )[K] ++ divided by |[K]|. for a linear subspace VK ⊆ RN . Further, let dK be the dimension of U[K] +

++ Equivalently, if U[K] = (UK )[K] for some univariate linear subspace UK ⊆ RNK then

(6.26)

dK = dim(UK ).

Then X

|[K]| dim(VK ) | K ∈ J (KI ) X  ≥ |[K]|dK | K ∈ J (KI )

dim(V) = (6.27)



= dim(U),

so dim(V) = dim(U) iff dK = dim(VK ) for all K ∈ J (KI ). 7. Seemingly unrelated regressions with incomplete data 7.1. The SUR/ID model. We now combine the ID model considered in Sections 5 and 6 with the SUR model as considered in Section 4. We again observe X ∈ RI , normally distributed as in (6.4). Recall from (5.5) and (5.10) that the index sets N and I are partitioned as (N K | K ∈ I) and (IM | M ∈ N ), respectively. Further, recall from (6.2) that the sample space for X factors as RI = ×(RIM ×M | M ∈ N ). For each M ∈ N , let SM ≡ (UM , (IM )UM ) be a SUR pair for RIM ×M , that is, the SUR pattern UM is a collection of distinct subspaces of RM with |UM | ≤ |IM | and the SUR partition (IM )UM ≡ (IM,U | U ∈ UM ) is a partition of IM indexed by UM , so . (7.1) IM = ∪(IM,U | U ∈ UM ).

24

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

The SUR/ID partition (IM,U | M ∈ N , U ∈ UM )

(7.2)

of I is as least as fine as the row partition IN . Note that a variable i ∈ IM,U iff it is observed on all the subjects in M and on no other subject and is regressed on the mean space U ⊆ R M . The SUR subspace of RIM ×M induced by SM is given by USM = ×(U IM,U | U ∈ UM ) ⊆ RIM ×M

(7.3) (recall (4.3)). The collection

S := (SM | M ∈ N )

(7.4)

of SUR pairs is called a SUR/ID structure for RI . The SUR/ID subspace US of RI induced by S is defined to be the product space (7.5)

US := × (USM | M ∈ N ) = × (U IM,U | M ∈ N , U ∈ UM ).

Finally, the SUR/ID model N(US ) on RI is defined as (recall (6.4) and (6.5)) (7.6)

N(US ) := (⊗ (N (µK , ΣK ⊗ 1NK ) | K ∈ I) | µ ∈ US , Σ ∈ P(I))

(compare to (4.4) and (6.13)). Proposition 7.1 (Restriction to a SUR/ID subspace). The space U is a SUR/ID subspace of RI iff there exists a SUR subspace V of RI×N s.t. U = pI (V). Proof. (⇐) : Suppose that V is a SUR subspace of RI×N , i.e., V = VT ⊆ RI×N where T ≡ (U, IU ) is a SUR pair for RI×N . Then as in (4.2) and (4.3), . I = ∪(IU | U ∈ U), V = ×(U IU | U ∈ U). Furthermore, if VM denotes the projection of V onto RIM ×M then pI (V) = ×(VM | M ∈ N ). For all M ∈ N and U ∈ U, let UM be the projection of U onto RM . (Note that it is possible 0 for U 6= U 0 ∈ U.) Each VM is a SUR subspace of RIM ×M induced by the that UM = UM

SUR pair SM ≡ (Um , (IM )UM ), where UM = {UM | U ∈ U, M ∈ N , IU ∩ IM 6= ∅} with |UM | ≤ |{(M, U ) | (M, U ) ∈ N × U, IU ∩ IM 6= ∅}|. Thus pI (V) is a SUR/ID subspace of RI induced by the SUR/ID structure S = (SM | M ∈ N ).

CI MODELS FOR SUR WITH INCOMPLETE DATA

25

(⇒) : Suppose that U = US for some SUR/ID structure S ≡ ((UM , (IM )UM ) | M ∈ N ) for RI . For all M ∈ N , U ∈ UM , define VM,U

:=

U × RN \M ⊆ RN ,

V

:=

M,U ×(VM,U | M ∈ N , U ∈ UM ) ⊆ RI×N .

I

Let T be the SUR pair (U, IU ) with U = ∪(VM,U | M ∈ N , U ∈ UM ) and IU = (IV | V ∈ U), . where IV := ∪(IM,U | VM,U = V ). Then pI (V) = US , and V is a SUR subspace of RI×N induced by the SUR pair T. 7.2. LCI restrictions for a SUR/ID model. Let S ≡ (SM | M ∈ N ) be a SUR/ID structure for RI . The SUR/ID model N(US ) on RI inherits (possible) multimodality of the LF from both the SUR models and the ID problem, as well as the need for iterative methods to find MLEs. Here we combine the LCI theories developed for the SUR case and the ID case to find minimally restrictive LCI constraints that render the LF unimodal and allow explicit determination of the MLE. Note that for given µ and Σ, the factorization (6.19) of the LF still holds if we impose the LCI restriction Σ ∈ P(KI ) on N(US ). However, we must impose CI restrictions beyond those entailed by KI in order to obtain a factorization of the parameter space and to insure that every factor corresponds to a MANOVA model. For a (complete data) SUR model, the variables in IU with the common regression space U ∈ U are regressed on the variables in IU 0 only if the regression space U 0 ∈ U is a subspace of U , i.e. U 0 ⊆ U . For a linear ID model, the variables in IM , which are observed on exactly the subjects in M ∈ N , are regressed on the variables in IM 0 only if the variables in IM 0 are always observed with the variables in IM , i.e. M 0 ⊇ M . Combining these two ideas motivates a partial ordering on F := {(M, U ) | M ∈ N , U ∈ UM }

(7.7)

as follows. If M, M 0 ∈ N are nested as M ⊆ M 0 then we can project the space U ∈ UM 0 onto + RM ; denote the image of the projection by UM . Now define

(7.8)

(M 0 , U 0 ) ≤F (M, U ) ⇐⇒

The relation ≤F is a partial ordering:

 M 0 ⊇ M and (U 0 )+ M ⊆U .

(i) It is reflexive by definition. (ii) Moreover, if (M 00 , U 00 ) ≤F (M 0 , U 0 ) and (M 0 , U 0 ) ≤F (M, U ) then M 00 ⊇ M 0 ⊇ M . Fur0 + 0 00 + 0 0 + ther, (U 00 )+ M 0 ⊆ U and (U )M ⊆ U . Since M ⊇ M it follows that (U )M ⊆ (U )M ⊆ U ,

hence ≤F is transitive. (iii) If (M 0 , U 0 ) ≤F (M, U ) and (M, U ) ≤F (M 0 , U 0 ) then M = M 0 and U = U 0 . Thus ≤F is anti-symmetric.

26

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

By (7.7), the SUR/ID partition (7.2) can be rewritten as IF := (IF | F ∈ F).

(7.9)

Now define (compare (4.6) and (5.13)) . KF = ∪(IF 0 | F 0 ≤F F ),

(7.10)

so KF ⊆ KF 0 iff F ≤F F 0 and KF = KF 0 iff F = F 0 . The posets PF := {KF | F ∈ F}

(7.11)

and F, under inclusion and the partial ordering ≤F respectively, are isomorphic posets. We define the SUR/ID lattice KI,S ⊆ 2I to be the lattice generated by PF . In the construction of this lattice, each SUR pair SM ≡ (UM , (IM )UM ), M ∈ N , in S occurs within the corresponding layer RIM ×M of the ID sample space RI . As in §4.2 and §5.2, J (KI,S ) = PF , KF is the smallest join irreducible element of KI,S containing IF (or any i ∈ IF ), and IF = [KF ]. (Note that KI ⊆ KI,S .) For any lattice K ⊆ 2I , define the LCI-restricted SUR/ID model on RI : N(US , K) := (⊗ (N (µK , ΣK ⊗ 1NK ) | K ∈ I) | µ ∈ US , Σ ∈ P(K))

(7.12)

⊆ N(US ).

By Theorem 7.2 below, the parsimonious model N(US , KI,S ) on RI factors into a product of MANOVA models. In particular, the SUR/ID subspace US of RI is decomposed as a Cartesian +

++

product of MANOVA subspaces [US ][K] that are defined as the projections of US onto R[K]×NK , K ∈ J (KI,S ). Moreover, the K-th MANOVA model, K ∈ J (KI,S ), in the factorization in ++ ++ Theorem 7.2 arises from the conditional distribution of X[K] given XhKi , namely   ++ ++ ++ ++ (7.13) + β[Ki XhKi , Λ[K] ⊗ IN + . (X[K] | XhKi ) ∼ N[K]×N + ξ[K] K

K

Here the KI,S -parameters

++ (ξ[K] , β[Ki , Λ[K]

| K ∈ J (KI,S )) of the model N(US , KI,S ) on RI are

defined as in (6.18) but with KI,S replacing KI . Theorem 7.2 (Factorization of the parsimonious LCI-restricted SUR/ID model). Let S ≡ (SM | M ∈ N ) be a SUR/ID structure for RI . Then the LF for the model N(US , KI,S ) on RI factors as (7.14)

fµ,Σ (x) =

Y

K∈J (KI,S )

 ++  fξ++ ,β[Ki ,Λ[K] x++ [K] xhKi , [K]

 + ++ [K]×NK where fξ++ ,β[Ki ,Λ[K] x++ given by (7.13). [K] | xhKi is the LF of the MANOVA model on R [K]

The parameter space factors according to the bijective mapping   ++ φI,S : US × P(KI,S ) → × [US ][K] × R[K]×hKi × P([K]) K ∈ J (KI,S ) (7.15)   ++ , β[Ki , Λ[K] K ∈ J (KI,S ) . (µ, Σ) 7→ ξ[K]

CI MODELS FOR SUR WITH INCOMPLETE DATA

27

Proof. The proof of Theorem 6.3 applies with only two steps needing reconsideration. First, we +

++

check that for any K ∈ J (KI,S ) the subspace [US ][K] is a MANOVA subspace of R[K]×NK . By construction of J (KI,S ), the set [K] = [KF ] = IF for some F ∈ F. Hence for the unique M ∈ N ++

and U ∈ UM s.t. (M, U ) = F it follows from (7.5) that [US ][K] = U IF , which is a MANOVA subspace R

+ [K]×NK

.

Second, we show that the analogue to inclusion (6.22) holds, i.e. that for K ∈ J (K I,S ), ++

++

M([K] × hKi)[US ]hKi ⊆ [US ][K] ,

(7.16)

+

++

++

where [US ]hKi is the projection of US onto RhKi×NK . For i ∈ K, let [US ]i US onto R

+ {i}×NK

. Then, for i ∈ [K] and j ∈ hKi, (7.10) implies that

be the projection of

++ [US ]j

++

⊆ [US ]i , which

implies (7.16). The necessary and sufficient condition for almost sure existence and uniqueness of the MLE in the model N(US , KI,S ) is that + |NK | ≥ |K| + dK

(7.17)

∀K ∈ J (KI,S ),

++

where dK is the dimension of [US ][K] divided by |[K]|. More naturally, since K = KF for some ++

unique F = (M, U ) ∈ F, [US ][K] = U IF and dK = dim(U ). Remark 7.3. As in Remark 6.4, an analogue to Theorem 7.2 holds for the model N(U S , L) whenever L ⊇ KI,S . 7.3. Minimality of the LCI restrictions for a SUR/ID model. The SUR/ID lattice K I,S is the unique minimal lattice whose LCI constraints permit factorizations of the forms (7.14) and (7.15) for the LF and parameter space, respectively, in the associated LCI-restricted SUR/ID model. Theorem 7.4 (Lattice minimality for a SUR/ID model). The SUR/ID lattice K I,S is uniquely minimal among all lattices L ⊆ 2I s.t. the LCI-restricted SUR/ID model N(US , L) on RI admits a factorization of the LF and the parameter space into a product of LFs and parameter spaces of MANOVA models as in Theorem 7.2. Proof. Let L be a competing lattice permitting factorizations as in (7.14) and (7.15). Let L i , Si , and Ki be the smallest join-irreducible elements of the lattices L, KI , and KI,S containing index i ∈ I. The Li -th factor in the factorization is a MANOVA model on R proof of Theorem 6.5, Li ⊆ Si and R

+ [Li ]×NL

(7.18)

i

NL+i

=

NS+i .

Furthermore,

++ [US ][Li ]

+ [Li ]×NL

i

. Hence, as in the

is a MANOVA subspace of

and, from (7.16), it follows that ++

++

M([Li ] × hLi i)[US ]hLi i ⊆ [US ][Li ] . ++

Now (7.5) yields that [Li ] ⊆ IF ⊆ KF for some F = (M, U ) ∈ F and that [US ][Li ] = U [Li ] . . Further, the inclusion (7.18) implies that hLi i ⊆ ∪(IF 0 | F 0 ≤F F ) = KF . Therefore, Li ≡

28

MATHIAS DRTON, STEEN A. ANDERSSON, AND MICHAEL D. PERLMAN

. [Li ] ∪hLi i ⊆ KF . Since i ∈ [Li ] ⊆ IF it follows that KF = Ki and thus Li ⊆ Ki , for all i ∈ I, which implies KI,S ⊆ L by the lattice inclusion Lemma 3.1. Remark 7.5. In the SUR/ID model N(US ) on RI , the parameter (µ, Σ) ∈ US × P(I) is identifiable iff (6.25) holds. Moreover, if US = pI (VT ) is the restriction of a SUR subspace VT of RI×N induced by a SUR pair T ≡ (U, IU ) then, as in Remark 6.6, ν ∈ VT is identified by µ = pI (ν) + iff pI : VT → US is bijective. If we define UM to be the projection of U ∈ U onto RM then pI is + bijective iff dim(UM ) = dim(U ) for all M ∈ N and U ∈ U s.t. IU ∩ IM 6= ∅.

Remark 7.6. Let S be a SUR/ID structure for RI , and let T ≡ (U, IU ) be a SUR pair for RI×N s.t. the induced SUR/ID subspace US ⊆ RI is the projection onto RI of the SUR subspace VT ⊆ RI×N induced by T (compare Proposition 7.1). Then one can consider the set (7.19)

F 0 := {(M, U ) | M ∈ N , U ∈ U}

equipped with the partial ordering (7.20)

(M 0 , U 0 ) ≤F 0 (M, U ) ⇐⇒ (M 0 ⊇ M and U 0 ⊆ U ) .

The lattice induced by this partial ordering equals the lattice K(KI ∪ KT ) generated by the union of the ID lattice KI and the SUR lattice KT . However, this lattice is a larger lattice than KI,S , i.e. K(KI ∪ KT ) ⊇ KI,S , and in general KT 6⊆ KI,S (compare Example 9.6 in Section 9). Recall that smaller lattices lead to more parsimonious models.

8. Acyclic directed graph theory 8.1. Directed graphs. A graph is a pair (V, E), where V is a finite set of vertices and E ⊆ {(v, w) ∈ V × V | v 6= w} is a set of edges. An edge (v, w) ∈ E is undirected if (w, v) ∈ E and directed if (w, v) 6∈ E. Here we confine ourselves to directed graphs, i.e. graphs which contain only directed edges. We denote a (directed) edge in the directed graph D = (V, E) by v → w ∈ D. A path of length k ≥ 1 from vertex v to w in D is a sequence of distinct vertices (v 0 , v1 , . . . , vk ) s.t. v0 = v, vk = w, and vi−1 → vi ∈ D for all i = 1, . . . , k. If there exists a path from v to w, we write v