1996
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
Information-Theoretic Characterizations of Conditional Mutual Independence and Markov Random Fields Raymond W. Yeung, Senior Member, IEEE, Tony T. Lee, Senior Member, IEEE, and Zhongxing Ye, Senior Member, IEEE
Abstract—We take the point of view that a Markov random field is a collection of so-called full conditional mutual independencies. Using the theory of -Measure, we have obtained a number of fundamental characterizations related to conditional mutual independence and Markov random fields. We show that many aspects of conditional mutual independence and Markov random fields have very simple set-theoretic descriptions. New insights into the structure of conditional mutual independence and Markov random fields are obtained. Our results have immediate applications in the implication problem of probabilistic conditional independency and relational database. Toward the end of the paper, we obtain a hypergraph characterization of a Markov random field which makes it legitimate to view a Markov random field as a hypergraph. Based on this result, we naturally employ the Graham Reduction, a tool from relational database theory, to recognize a Markov forest. This connection between Markov random fields and hypergraph sheds some light on the possible role of hypergraph theory in the study of Markov random fields. Index Terms—Conditional independence (CI), hypergraph, -Measure, Markov random fields, relational database.
I. INTRODUCTION
A
MARKOV random field is often regarded as a generalization of a one-dimensional discrete-time Markov chain in the sense that the time index for the latter is replaced by a space index for the former. Historically, the study of Markov random fields stems from statistical physics. The classical Ising model, which is defined on a rectangular lattice, was used to explain certain empirically observed facts about ferromagnetic materials. The foundation of the theory of Markov random fields may be found in [10] or [11] (also see [4]). It was described in [10] that the theory can be generalized to the context of an arbitrary be a graph with the following formulation. Let is the set of vertices undirected graph, where and is the set of edges. In this paper, all graphs are undirected, and we assume that there is no edge in which joins a vertex of , denote by to itself. For any (possibly empty) subset the graph obtained from by eliminating all the vertices Manuscript received June 18, 1997; revised May 1, 1998, Jan. 18, 2000, and February 15, 2002. The work of Z. Ye was supported in part by the Chinese National Natural Science Foundation under Grant 10171066. R. W. Yeung and T. T. Lee are with the Department of Information Engineering, the Chinese University of Hong Kong, N.T., Hong Kong (e-mail: whyeung@ ie.cuhk.edu.hk;
[email protected]). Z. Ye is with the Department of Applied Mathematics, Shanghai Jiao Tong University, Shanghai 200030, China (e-mail:
[email protected]). Communicated by S. Shamai, Associate Editor for Shannon Theory. Publisher Item Identifier S 0018-9448(02)05169-6.
in and all the edges joining a vertex in . Let be the . Denote the sets of vertices of number of components in . If , we these components by say that is a cutset in . We will denote by the alphabet set of a random variable . whose Consider a collection of random variables on joint distribution is specified by a probability measure , where the random variable is as, we sociated with vertex in graph . For an event as . Here we assume that write for , so that the -Measure [14] for is well defined. To simplify notation, we will write as , and we will not distinguish between and the singleton containing . We now define two Markov properties for random variables associated with a graph : Definition 1 (Global Markov Property I (GMP-I)): Let be a partition of such that the sets of vertices and are disconnected in . Then the sets of random variables and are independent conditioning on . Definition 2 (Global Markov Property II (GMP-II)): For in , the sets of random variables all cutsets are mutually independent conditioning on . Let us first show that GMP-I and GMP-II are equivalent. It is clear that GMP-II implies GMP-I. Assume GMP-I and consider , define the cutset any cutset in . For
Then and are disconnected in . By and are indeGMP-I, . This pendent conditioning on are mutually independent implies that . Thus GMP-I implies GMP-II, and hence conditioning on GMP-I and GMP-II are equivalent. Henceforth, we will not distinguish GMP-I and GMP-II and we will refer to both of them as the global Markov property (GMP). , GMP states that if the graph repreWhen has more than one comsenting a probability measure , then the sets of random variables ponent, i.e., are mutually independent. Here we
0018-9448/02$17.00 © 2002 IEEE
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
regard unconditional independence as a special case of a Markov condition. Definition 3 (Markov Random Field): The probability meaare sure , or equivalently, the random variables said to form a Markov random field represented by a graph if . and only if the GMP for is satisfied by form a Markov random field represented by If form a Markov graph a graph , we also say that , or are represented by . When is a chain, form a Markov chain. When is a we say that form a Markov tree. When is a tree, we say that forest, i.e., a graph consisting of one or more disjoint trees, we form a Markov forest. say that are represented by a graph We point out that if , it is not necessary that all the unconditional/conditional inare indicated in . This may dependencies among seem strange at first, but this interpretation is in fact consistent with our usual interpretation of a Markov chain, which is the starting point of the theory of Markov random fields. To illustrate our point, we consider the following example. For three and we mutually independent random variables say that they form the Markov chain “1–2–3,” although the reand are independent conditioning on is lation that can be repnot indicated in the chain. In general, resented by more than one graph. In particular, it is easy to check are always represented by from Definition 3 that , the complete graph with vertices. Obviously, the graph specifies a degenerate Markov random field. are represented by a graph , then In essence, if must be all the Markov conditions which are indicated in valid. However, it is not necessary that all valid Markov conditions are indicated in . Definition 4: A conditional mutual independency (CMI) on is full if all are involved in the relation. In the definition of GMP, each cutset in specifies a full , denoted by . Formally CMI on
are mutually independent conditioning on For a collection of cutsets notation
in
, we introduce the
where “ ” denotes “logical AND.” Using this notation, are represented by a graph if and only if
1997
us consider the Markov chain “1–2–3–4.” The characterization in (1) gives (2) are, in As we will see in Section III, . So, the characterization given in (1) fact, implied by is in general reducible. In this paper, we take the point of view that a Markov random field is a collection of full CMIs. The -Measure [14] is the main tool used in this paper, so we give a review of this theory in Section II. In Section III, we prove a basic property of a CMI. We also discuss a graph-theoretic analog of this property. This property can be used to simplify a collection of full CMIs as well as to simplify the characterization of a Markov random field. In Section IV, we obtain an -Measure characterization of a CMI. If the CMI is full, then it corresponds to the -Measure vanishing on a certain set of atoms called the image of the CMI. For a set of full CMIs, its image is simply the union of the images of the individual CMIs. A set of full CMIs is completely characterized by its image, and we call it the canonical representation of the set of full CMIs. This is discussed in Section V. The aforementioned results lead to the -Measure characterization of a Markov random field in Section VI. In Section VII, we introduce a canonical form for information expressions when the random variables involved form a Markov random field, and we show the uniqueness of this canonical form for very general classes of information expressions. In this section, we also discuss the dimension of Shannon information measures of a Markov random field. Sections VIII and IX are about an interesting connection between a Markov random field and a hypergraph. This connection is established in Section VIII. In Section IX, we discuss the use of the Graham Reduction in relational database theory to recognize a Markov forest. Concluding remarks are in Section X. II.
-MEASURE PRELIMINARIES
In this section, we give a review of the main results regarding -Measure. For a detailed discussion of the theory, we refer the reader to [14], [22], [17]. Further results on -Measure can be found in [3]. be jointly distributed discrete random Let be a set variable corresponding to a random variables, and and let variable . Define the universal set to be be the -field generated by . The atoms of have the form , where is either or . Let be the set of all the atoms of except for , which is equal to the empty set by construction because
(1) Therefore, a Markov random field is simply a collection of full CMIs induced by a graph. We are interested in characterizations for being represented by a graph . The conditions given in (1) are such a characterization, but, in general, it is redundant and can be reduced. Let
Note that . In the rest of the paper, when we refer to an atom of , we always mean an atom of in . to denote To simplify notation, we will use and to denote for any . Let . It was shown in [14] that there exists a signed
1998
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
measure on which is consistent with all Shannon information measures via the following substitution of symbols:
, i.e., for any (not necessarily disjoint)
It is well known that for any , quantities of the form given in (3), namely, all Shannon information measures, are always nonnegative. These inequalities, called the basic inequalities [16], comprise an outer bound on . Recently, a new outer bound on for in the form of an information inequality (i.e., an inequality involving only Shannon information measures) was obtained in [20]. This inequality is the first of its kind ever discovis one of the most fundamental ered. The characterization of and difficult problem in information theory [16]. See [22] for a comprehensive discussion of information inequalities. Define the set
(3) When
, we interpret (3) as
When
, (3) becomes
When
and
In light of (5) and (6), a vector is in if and only if its components are the values of a valid -Measure on the atoms . of . We now prove a useful characterization of contains the positive orthant of . Theorem 1: which can take any set Proof: It suffices to construct a of nonnegative values on the nonempty atoms of . Recall that is the set of all nonempty atoms of . Let be mutually independent random variables. Now define the random by variables
, (3) becomes (4)
Thus, (3) covers all the cases of Shannon information measures. , and let Let is a nonempty subset of
and
(7)
for so deWe determine the -Measure are mutually independent, for any fined as follows. Since nonempty subsets of , we have (8)
Note that . Let be the -(column) vector on all the atoms of , and be the -vector of the values of on all the unions formed from of the values of or, equivalently, all the joint entropies involving the random . Then variables (5) where is a unique portant characteristic of write
matrix (independent of ). An imis that it is invertible [14], so we can (6)
is completely specified by the set of values In other words, , ; namely, all the joint entropies involving and by virtue of (4), is the unique measure which is consistent with all Shannon information meaon in general is not nonnegative. However, if sures. Note that form a Markov chain, is always nonnegative [3]. be the coordinates of . Then for any probLet , the corresponding joint ability measure on . On the other hand, entropies are represented by a vector in is said to be entropic if it corresponds a generic vector to the joint entropies associated with some probability measure on . Define the set is entropic
On the other hand (9) Equating the right-hand sides of (8) and (9), we have (10) Evidently, we can make the above equality hold for all nonempty subsets of by taking (11) . By the uniqueness of , this is also the only posfor all can take any nonnegative value, sibility for . Since can take any set of nonnegative values on the nonempty atoms of . The theorem is proved. Theorem 2: If then . . Let represents Proof: Consider any , the joint entropies of the random variables represents the joint entropies of the and let . Then . Let random variables and be independent, by and define random variables (12)
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
for all
. Then for any nonempty subset
1999
of (13)
, which represents the joint entropies of the Therefore, , is in . It follows that random variables
is in
. The theorem is proved.
The following corollary, which is apparent in the previous works but has never been stated explicitly, follows directly from the last two theorems. , then
Corollary 1: If
for any
.
With this corollary, it is easy to perturb the values of an -Measure on the atoms individually. Specifically, for any on a single atom given , we can increase the value of while keeping its value on all the other atoms fixed. On the other hand, for a given set of joint entropies, it is in general very difficult to perturb a single joint entropy while keeping all other joint entropies fixed. This property of -Measure makes it an extremely useful tool for studying the structure of Shannon information measures. We define the dimension of Shannon information measures containing . as the dimension of the smallest subspace of It follows from Theorem 1 that this smallest subspace is itself. Thus, we see that the dimension of Shannon information . measures for random variables is equal to To conclude, the theory of -Measure enables the use of the language and the rich set of tools in set theory to study the structure of Shannon information measures. As a consequence of the theory of -Measure, the information diagram (a special case of the Venn diagram) was introduced as a tool to visualize the relationship among information measures [14]. Fig. 1 shows the . Examinformation diagram for random variables ples of applications of information diagrams can be found in [14], [3], [17], [15] and [19]. In [15] and [19], information diagrams were used for proving converse coding theorems.
Fig. 1.
The information diagram for X
; X ; X
.
Proof: Assume that (14) is true. Consider
In the second step we have used (14), and the inequalities follow because conditioning reduces entropy. On the other hand, by the chain rule
III. A BASIC PROPERTY OF CONDITIONAL MUTUAL INDEPENDENCE (CMI) In this paper, we take the point of view that a Markov random field is a collection of full CMIs. In this section, we first prove a basic property of CMI.
Therefore,
be disjoint index sets and be , where . Assume that such that . Let and be collections of random variables. If are mu, i.e., tually independent conditioning on
Theorem 3: Let for a subset of there exist
then on
and
(14)
(15)
are mutually independent conditioning .
However, since conditioning reduces entropy, the th term in the summation on the left-hand side is lower-bounded by the th
2000
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
term in the summation on the right-hand side. Thus, we conclude that the above inequality is tight, and hence
The theorem is proved. Remark: The theorem can be proved more directly by considering the joint distribution. However, we prove the theorem by means of the basic inequalities in order to show that the subsequent characterizations of CMI and Markov random fields are all consequences of the basic inequalities.
Fig. 2. A probability measure satisfying the local Markov property but not the GMP.
The following proposition is the graph-theoretic analog of Theorem 3. The proof is trivial and is omitted.
Example 1: Consider the Markov random field in Fig. 2. Upon eliminating all the full CMIs via applications of Theorem 3, the Markov random field is specified by
be disjoint subsets of the vertex Proposition 1: Let and be a subset of for , set of a graph and . Assume that there exist at least two such that where . If are disjoint in , then those which are nonempty are disjoint in .
It will be shown later in Example 4 that and are implied , . Thus, the characterization of the Markov random field in Fig. 2 can be simplified as
This proposition and Theorem 3 establish an analogy between the structure of CMI and the connectivity of a graph. This analogy will play a key role in proving the -Measure characterization of a Markov random field in Section VI. This characterization is the bridge for the connection between a Markov random field and a hypergraph discussed in Section VIII. Theorem 3 specifies a set of CMIs which is implied by a CMI. This theorem will be useful when we discuss the effect of a CMI on the structure of Shannon information measures in the next section. Together with Proposition 1, we will prove the -Measure characterization of a Markov random field in Section VI. We end this section with an application of Theorem 3. Toward the end of Section I, we claim that the last three full CMIs in (2) are implied by the first two. By Theorem 3, we see that is implied by , is implied by , and is or . implied by either Thus, we see that the Markov chain “1–2–3–4” is completely . We now further show that and characterized by do not imply each other. Let be a random variable such . If we let and that constant, then is satisfied but is not satisfied. So, does not imply . Similarly, we can show that does not imply by letting and constant. In this sense, we say that is an irreducible characterization of the Markov chain “1–2–3–4.” In the above example, we have seen how we can simplify the characterization of a Markov chain from five full CMIs to two CMIs by applications of Theorem 1. In general, for any set of full CMIs, by keeping those CMIs which are not implied by any other CMI in the set via an application of Theorem 3, we may be able to reduce the set. Since a Markov random field is a set of (full) CMIs, we can also apply this technique to reduce the set. We originally suspected that the remaining set is always irreducible, but we later discovered the following counterexample with the help of ITIP (software codeveloped by one of the authors) [18].
This example shows that Theorem 3 alone is not powerful enough to eliminate all the redundant CMIs in a given set of CMIs, not even when all the CMIs are full. It appears that finding an irreducible subset of a set of CMIs (full or not) is in general a very difficult task. IV. CMI AND THE -MEASURE In this section, we study the effect of a CMI on the structure of Shannon information measures in terms of the -Measure. The advantage of using the -Measure will become clear when we handle more than one full CMI simultaneously, for example, in a Markov random field. Lemma 1: Let random variables, where such that conditioning on , i.e.,
be collections of , and let be a random variable, are mutually independent
Then
Lemma 2: Let be a set additive function and be sets, where . Then
and
(16)
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
where
is any proper subset of , , and the inner sum runs over all subsets of size
of
. Proof: See Appendix A. Proof of Lemma 1: We will prove the lemma using the set identity in (16). We note that for the terms inside the square , then the first term and bracket in (16), if the third term cancel each other, while the second term is zero. Thus, the terms in the square bracket sum to zero. The same is . Therefore, we only have to consider true if for which both and are nonempty. . In (16), let , We first prove the case for for , for , , and . Then (16) becomes
2001
Using Theorem 3 and this lemma, we now prove the following important result. be disjoint index sets, Theorem 4: Let and , and let , , and where be collections of random variables. Then are mutually independent conditioning on if and only if for any , where , , , then if there exist at least two such that (19) Proof: We first prove the “if” part. Assume that for any , where , , if there exist such that , then (19) holds. Thus,
where
consists of sets of the form (20)
(17) The terms inside the square bracket can be written as
for and there exists such with . By our assumption, if is such that there that for which , then . exists is possibly nonzero, then must be such Therefore, if for which . Now that there exists a unique , let be the set consisting of sets of the form in for , , and for . In other (20) with consists of atoms of the form words,
(18) As we have mentioned at the beginning of the proof, we and only need to consider the case when both are nonempty. Since and are independent conditioning on , we see that the quantity in (18) is equal to zero. Thus, the . lemma is proved for , we write For
with
and
. Then
Now
We then apply the lemma for
to see that Since
The lemma is proved.
is set-additive, we have
2002
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
cises. We first illustrate these two propositions in the following example.
Hence, we have
Example 2: Consider
and full CMIs
and Then Im
By Lemma 4 in Appendix C, are mutually . independent conditioning on We now prove the “only if” part. Assume that are mutually independent conditioning on . For any collec, where , , if there tion of sets such that , by Theorem 3, exist are mutually independent conditioning on . By Lemma 1, we obtain (19). The theorem is proved.
and Im
Theorem 5: Let be a full CMI on . Then holds if and only if for all Im . Proof: First, (21) is true if is a full CMI. As discussed above, the set in (20) is an atom of . The theorem can then be proved by a direct application of Theorem 4 to the full CMI .
V. A CANONICAL REPRESENTATION OF FULL CMIS As in the foregoing, we let suppose
. In Theorem 4,
Let
be a nonempty atom of
. Define the set (23)
Note that
is uniquely specified by
because
(21) defines a full CMI
Then,
on
as follows: are mutually independent conditioning on The set in (20) can be written as
(24) as the weight of the atom , the Define in which are not complemented. We now show number of is uniquely specified that a full CMI . First, by letting for in Definiby Im tion 5, we see that the atom
(22) which is seen to be an atom of . Recall that is the set of nonempty atoms of . Theorem 4 asserts that if holds, then vanishes on a certain set of atoms of . We call this subset of the image of , which is formally defined below. for Definition 5: For any full CMI , is the subset of image of , denoted by Im , of atoms of the form in (22), where . there exist at least two such that Proposition 2: Let independency (FCI) on
, the consisting , and
Proposition 3: Let . Then
in be a full CMI
Im
These two propositions greatly simplify the description of . Their proofs are elementary and they are left as exer-
Im
; or i) ii) there exists an atom of the form
be a full conditional . Then
Im
on
is in Im , and it is the unique atom in Im with the largest can be determined. To determine weight. From this atom, , we define a relation on as follows. , is in if and only if For
Im
, where
or
.
Recall that is the set of nonempty atoms of . The idea of is in if and only if for some ii) is that . Then is reflexive and symmetric by construction, and . In other words, is transitive by virtue of the structure of Im is an equivalence relation which partitions into . Therefore, and Im uniquely specify each other. The image of a full CMI completely characterizes the ef. The joint effect of on the -Measure for
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
fect of more than one full CMI can easily be described in terms of the images of the individual full CMIs. Let
be a set of full CMIs. By Theorem 4, holds if and only if vanishes on the atoms in . Then hold simultaneously if and only if vanishes on the atoms in Im . This is summarized as follows. Definition 6: The image of a set of full CMIs is defined as Im
Im
Theorem 6: Let be a set of full CMIs for . Then holds if and only if for all
Then by the last theorem, this is equivalent to Im Im and Im Im , i.e., Im Im corollary is proved.
Im
. The
Thus, a set of full CMIs is completely characterized by its image in the sense that two sets of full CMIs have the same image if and only if they are equivalent. A set of full CMIs is a set of probabilistic constraints, but the characterization by its image is purely set-theoretic! This characterization offers an intuitive set-theoretic interpretation of the joint effect of full . For example, CMIs on the -Measure for Im Im is interpreted as the effect commonly due and , Im Im is interpreted as the effect to but not , etc. due to . Let
Example 3: Consider .
In linear system theory, when two signals convolute in the time domain, their transforms in the frequency domain multiply with each other. In probability theory, one can think of the image of a full CMI as the “transform” of the full CMI. When two full CMIs are imposed simultaneously, the union of the two images is taken. In probability problems, we are often given a set of conditional independencies and we need to see whether another given conditional independency is logically implied. This is called the implication problem. The next theorem gives a solution to this problem if only full CMIs are involved. and be two sets of full CMIs. Then Theorem 7: Let implies if and only if Im Im . Im , then Proof: We first prove that if Im implies . Assume Im Im and holds. Then, for all Im . Since Im by Theorem 6, Im , this implies that for all Im . also holds. Therefore, if Again by Theorem 6, this implies Im Im , then implies . implies , then Im Im . We now prove that if implies but Im To prove this, we assume that Im , and we will show that this leads to a contradiction. Fix Im Im . By Theorem 1, we a nonempty atom such that can construct random variables vanishes on all the atoms of except for . Then vanishes but not on all the atoms in Im . on all the atoms in Im so conBy Theorem 6, this implies that for holds but does not hold. Therefore, does not structed, imply , which is a contradiction. The theorem is proved. Remark: In the course of proving this theorem and all its preliminaries, we have used nothing more than the basic inequalities. Therefore, we have shown that the basic inequalities are a sufficient set of tools to solve the implication problem if only full CMIs are involved. Corollary 2: Two sets of full CMIs are equivalent if and only if their images are identical. and are equivalent if Proof: Two set of full CMIs and only if and
2003
and let
and , and
. (For
,
,
.) Then
Im
Im
Im
Im
Im
Im
and where Im
(25)
Im
(26)
Im
(27)
Im
(28)
It can readily be seen by using an information diagram that Im Im . Therefore, implies . Note that no probabilistic argument is involved in this proof. Example 4: We can proved the claim in Example 1 by showing that Im Im
When a CMI is about the conditional independence of two sets of random variables, the CMI becomes a conditional independency (CI). Actually, there is no substantial difference between characterizing a set of full CMIs and a set of full CIs, because a full CMI is equivalent to a set of full CIs (see Appendix B). Previously, it was shown in [7] (see also [21]), [2] that full CIs are axiomatizable in that there exists a formal system for deriving all the full CIs that are logically implied by an arbitrary set of full CIs. As a consequence, the implication problem can be solved when all the independencies involved are full CIs. On the other hand, a nonaxiomatic method based on the standard chase algorithm has been reported in [12]. The axiomatizations in [7], [2] and their relation with our result are discussed in Appendix B. Basically, the results in [7], [2], [12], and our result are different characterizations of a set of full CMIs (or CIs). Compared with those in [7], [2], [12], our characterization has the
2004
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
advantage that it is in a closed form, and it renders insight into the set-theoretic structure of the problem not possible otherwise. It was pointed out in [7], that full CIs have the same axiomatization as multivalued dependencies (MVD) in a relational database, which can be viewed as a Bayesian network [8]. Therefore, our result has immediate application in the implication problem in relational database theory. In a recent comprehensive study on the implication problem for probabilistic CI [13], it was further pointed out that Bayesian networks and relational databases coincide on solvable classes of independencies (i.e., there exists a method to solve the implication problem within the class of independencies), which include the class of full conditional (mutual) independencies. Basically, the implication problem in both contexts are exactly the same within the solvable classes of independencies. We refer the readers to [13] for the details. VI.
-MEASURE CHARACTERIZATION MARKOV RANDOM FIELD
is represented by if and only if Theorem 8: for all Type II atoms . contains precisely Proof: We first note that all the proper subsets of . Thus, the set of full CMIs specified by the graph can be written as and (cf. (1)). We need to prove that and
We first prove that Im
Consider an atom , let By considering Im , or
so that , and for
. In Definition 5, for . , we see that . Therefore,
Im
Im
and
We now prove that and
Consider Im
, such that that and the definition of
By Proposition 1, in
such that
Im
.
, and there exist . It follows from (6)
are distinct components
Since there exist such that , , which implies . Therefore, we have proved (6), and hence the theorem is proved. The “only if” part of this theorem is a generalization of [3, Theorem 1]. Note that by definition, an atom is a Type I atom if . Thus, we see from the above for any graph theorem that conditional entropies of the form , which corresponds to the value of on weight atoms, can represent do not have any effect on whether a graph . Theorem 8 asserts that whether a probability measure satisfies the set of Markov conditions induced by a given graph can be determined by identifying the zero atoms of the -Measure; the values of the -Measure on the nonzero atoms are irrelevant. It is tempting to think that any Markov condition is a result of the -Measure vanishing on certain atoms of . This is not true, , and however, as is seen in the following example. Let be binary random variables taking values in . and are independent and identically distributed, taking the values and with equal probability, and . Then
and
Im
Im
where
,
OF A
We are now ready to present the -Measure characterization of a Markov random field. Let be a graph with vertex set . We now define two types of atoms. If , is connected, then is a Type I atom, otherwise i.e., is a Type II atom. Note that the type of an atom depends on the and be the sets of all Type I and given graph . We let Type II atoms of , respectively.
Im
Then there exists From Definition 5
and
Thus, we see that although and are independent, does not vanish on any of the two atoms contributing to this independency. As we have mentioned, the characterization in (1) of a Markov random field is in general reducible. By contrast, our characterization in Theorem 8 is irreducible in the sense that for for all Type II atom any Type II atom , does not imply . This is proved in the next theorem. However, we point out that although the characterization in Theorem 8 is irreducible, it does not mean that this characteri. zation involves the least number of constraints on For example, the simplified characterization of a Markov random field introduced toward the end of Section III usually consists of a considerably smaller number of constraints, because a full CMI in general “covers” more than one Type II atom. Finding the smallest set of full CMIs representing a Markov random field is a set covering problem.
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
2005
Theorem 9: The characterization of a Markov random field in Theorem 1 is irreducible. Proof: Fix any Type II atom . By Theorem 1, we can such that construct random variables and for all Type II atom . Therefore, for all Type II atoms does not imply . This theorem has the following immediate consequence. and , Corollary 3: Any two graphs , cannot give two equivalent sets of constraints on where . by Proof: Denote the set of Type II atoms for and and , which are, respectively, the images of the sets , then either of full CMIs specified by and . If or . Assume without loss of generalitythe be an edge in . Then the atom former and let
is in but not in , which implies . By do not give two equivaTheorem 2, we conclude that and . lent sets of constraints on We end this section with a remark. In [3], it was shown that if form a Markov chain, then is always nonnegative. As a Markov chain is a special case of a Markov forest, it is plausible that this result can be generalized to a Markov forest. We now give a counterexample to refute such a proposal. Let and be two independent binary random variables with . Let , , uniform distribution on , and , where denotes modulo addiform the Markov forest tion. It is obvious that shown in Fig. 3. Now
Therefore, is a signed measure in general when form a Markov forest.
Fig. 3. A Markov tree representing X
; X ; X ; X
.
The significance of a canonical form is its uniqueness. The uniqueness of the canonical form in [16] was proved for very general classes of expressions, including the linear expression as a special case, when no constraint is imposed on the joint entropies. As a consequence, for example, if we want to know and are idenwhether two linear information expression in canonical form and tical, we only need to express check whether all the coefficients are zero. When certain constraints on Shannon measures are imposed (most often Markov constraints on the random variables), it is not clear whether in general there exists a canonical form (which has to be unique at least for linear information expressions). We refer the reader to in [16, Sec. 4.4] for a discussion on this. In general, if we want to check whether two information exand are identical under certain constraints on pression and Shannon measures, we need to check that both always hold. Most of these checkings can be done automatically by ITIP [18]. If the constraints on Shannonn measures are in the form of a on Markov graph, from Theorem 8, we see that the values of all Type II atoms are zero. Thus, we are motivated to propose the following canonical form: For each Shannon measure in an information expression, express it as a summation of the values on the set of Type I atoms. Let us give an example. For of , let the Markov constraints be represented by the graph in Fig. 4, and we consider the information expression . Now
VII. A CANONICAL FORM FOR INFORMATION EXPRESSIONS CONDITIONING ON A MARKOV RANDOM FIELD Any Shannon measure can be expressed as a linear combination of unconditional joint entropies by means of the following identity: Since (29) We will call an expression involving only Shannon measures an information expression. Using (29), an information expression can be expressed in terms of unconditional joint entropies only. This is called the canonical form in [16]. In fact, any invertible linear transformation of the joint entropies can be used for thepurpose of defining a canonical form. A primary example on the atoms of . is the set of values of
is a Type II atom
which is in the proposed canonical form. For the rest of the section, we will assume that the constraints on Shannon measures are in the form of a Markov graph . We
2006
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
constraints which correspond to setting to , where is some subset of . Then it seems to follow trivially that any linear information expression can be expressed . We now show that this is not uniquely in terms of . Consider true in general by giving a counterexample for
Fig. 4. A Markov random field with four random variables.
will establish the uniqueness of the proposed canonical form for very general classes of information expressions in the same way and are the sets of all Type I and as in [16]. Recall that is said to Type II atoms of , respectively. A vector be -entropic if it corresponds to the joint entropies associated on whose with some probability measure -Measure is such that for all . Define the set is For every such that (Note that for such a , define the set
is
where we use to denote . is an invertible transformaIt is easy to check that . Now we impose the condition that tion of are mutually independent, which corresponds to to . Then setting
-entropic -entropic, let for all is
Referring to the information diagram for we see that
in Fig. 1,
.) Further
-entropic
Lemma 3: contains the positive orthant of . Proof: By Theorem 1, we can construct random variables such that for all and for all . The lemma is proved. Theorem 10: Let be a measurable function has zero Lebesgue measure. such that the set . Then cannot be identically zero on Proof: The proof is similar to that for [16, Theorem 1]. has zero Lebesgue measure and conSince , which has positive Lebesgue tains the positive orthant of cannot be a subset of . Theremeasure, , and obviously fore, there exists . Hence, cannot be identically zero on . The uniqueness of the proposed canonical form for very general classes of information expressions follows from this theorem. As an illustration, suppose we want to see whether two and are identical condilinear information expressions as . tioning on a Markov graph . We first express is linear, if is not the zero function, then the Since has zero Lebesgue measure, and by the theset and are not identical. If is the zero function, then orem, and are identical. obviously It is easy to think that the uniqueness of the proposed is canonical form is trivial. Specifically, suppose . Then an invertible linear transformation of a linear information expression can be expressed uniquely . Now suppose we impose a set of in terms of
It then follows that conditioning on , an information ex) has no unique expression in pression (involving . The explanation for this seemterms of ingly counter-intuitive phenomenon is the fact that are not free variables, as discussed in Section II. We refer the reader to [16] for further discussions on this subject. From this counterexample, it becomes clear that the uniqueness of the proposed canonical form is highly nontrivial. An alternative canonical form for information expressions conditioning on a Markov random field was proposed by Reviewer B of this paper. This alternative canonical form does not involve the use of -Measure, so it may be more appealing to those who are not familiar with -Measure. We include a discussion on this in Appendix D for the convenience of the reader. The proof of the uniqueness of this alternative canonical form for very general class of information expressions, however, still relies on Theorem 1. In Section II, we define the dimension of Shannon information measures when there is no constraint on the random varias the dimension of the smallest subspace of ables containing , and we show that it is equal to . form a Markov random field, we define the When dimension of Shannon information measures as the dimension containing . It follows from of the smallest subspace in . In the following Lemma 3 that this dimension is equal to for a chain and for a “star.” two examples, we determine Example 5: Let be the chain “1–2– is in if and only if the elements in
– .” Then an atom are consecutive.
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
In this case, let and be the first and the last element in respectively. Then
,
Example 6: Let be the star with edges , First, there are Type I atoms with , namely, , . For those atoms corresponding to , for a Type I atom with , must other vertices. Thus, the contain vertex as well as any . Hence, total number of such atoms is
2007
Consider any collection of random variables . , let be the set of all atoms in on which For has a nonzero value, i.e., such that . Then
is a hypergraph with the ground set , i.e., the set of all atoms in on which has nonzero values. We will show in the next theorem that by examining , it is possible to determine form a Markov graph . To simplify nowhether to denote for a set . tation, we will use if From Definition 7, we see that is a junction graph for and only if for any cutset in , the sets
are disjoint. From these two examples, we see that the dimension of is polynomial Shannon information measures of if the random variables form a Markov chain, but is in exponential in if the random variables form a Markov “star,” although both a chain and a star are special cases of a tree.
VIII. A HYPERGRAPH ASSOCIATED WITH MARKOV RANDOM FIELD
A
be a hypergraph with the ground set . are called hyperedges of the hypergraph . We introduce the notion of junction graph in the following definition.
Theorem 11: is a junction graph for
is represented by
if and only if
.
The reader should compare the condition in this theorem with GMP-II. This theorem establishes an analogy between probability theory and hypergraph theory for a Markov graph, which is made possible by the use of -Measure. Proof of Theorem 11: We first prove the “only if” part. Con. It suffices to show that sider any cutset in , so that , and for are disjoint, i.e., if an atom is in
Let
then has the form
. Now consider an atom
in this set. Then
be a hypergraph and Definition 7: Let be a graph. Associate hyperedge with vertex of , . Let for any . Then is a junction graph for if for any cutset in the sets
are disjoint. Note that a hypergraph has, in general, more than one juncis always a tion graph. In particular, the complete graph junction graph for . Definition 8: If is a junction graph for a hypergraph and is a forest (tree), then is a junction forest (tree) for . Definition 9: If hypergraph.
has a junction tree, then
where there exist and such that and . Now , and and belong to two . Since and , different components in and also belong to two different components in . has at least two components, which implies Therefore, is a Type II atom. By Theorem 1, we see that . Thus, is a junction graph for . Now we prove the “if” part. Assume is a junction graph for . For , consider any Type II atom
is an acyclic
Thus, a junction graph is a natural generalization of a junction forest. The latter is an important concept in hypergraph theory [5] as well as in relational database theory [1]. We will have a special discussion on construction of junction forests in the next section. In this section, we will define a hypergraph associated with . In terms of any collection of random variables this hypergraph, we will establish another characterization of forming a Markov graph .
(30)
, there exist such that and for some . We . Then, by Theorem 1, will show that is represented by . Now assume . Since in (30) . Simiis in noncomplimentary form, . However, this is a contradiction larly, and are disjoint for because Since
2008
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
any cutset
in and for any and the theorem is proved.
. Hence,
IX. MARKOV FOREST AND GRAHAM REDUCTION As mentioned in the Introduction, a Markov random field is often regarded as a generalization of a Markov chain. In this regard, a Markov tree is an immediate generalization of a Markov chain, and a Markov forest is a trivial extension of a Markov tree. form a Markov In the last section, we see that forest if and only if is a junction forest for . Also recall from Definition 9 that is acyclic if and only if has a junction form a Markov forest if and only forest. Therefore, if is acyclic. We will describe in the following a very simple procedure called the Graham Reduction for recognizing an acyclic hypergraph. In fact, if a hypergraph is acyclic, using the Graham Reduction, one can easily construct a junction forest for . Thus, if form a Markov forest, we want to check whether we only need to apply the Graham Reduction to the hypergraph , and a forest representation of can be obtained if there exists one. The Graham Reduction is an important tool in relational database theory [1]. Relations between the Graham Reduction and information inequalities have been reported in [6]. The Graham Reduction is also potentially useful in other information theory problems concerning conditional independencies. Definition 10 (Graham Reduction): Let be a hypergraph. The Graham Reduction on refers to repeated applications of the following two operations on until no further operation is possible. but for , delete from . If GR1 If becomes empty, eliminate from . , , eliminate from . GR2 If A process described above is referred to as a GR process. It was shown in [6] that a hypergraph is acyclic if and only if it is reduced to the empty set by the Graham Reduction. (This means that if a hypergraph cannot be reduced to the empty set by any particular GR process, then it is not acyclic.) Let us look at two examples. with is not acyclic because neither of the operations of the Graham Reduction can be applied on . Example 7: The hypergraph
Example 8: Consider the hypergraph with , , , , , , and . Table I shows the steps in the Graham Reduction, and it is seen that is acyclic. The steps are explained as follows. In Step 1, since belongs to only, belongs to only, belongs to only, belongs to only, and belongs to only, , , , , and are deleted from the corresponding hyperedges by GR1. , so and are eliminated In Step 2, , so they are eliminated by by GR2. Likewise, belong to only, and belong to GR2. In Step 3,
Fig. 5.
The junction forest F (B ) for the hypergraph B in Example 5.
TABLE I THE STEPS OF A GR PROCESS FOR THE HYPERGRAPH B IN EXAMPLE 5
only, so and are eliminated by repeated applications of GR1. Note that there exist more than one way to reduce to the empty set. So, a GR process in general is not unique. is acyclic. Then Suppose a hypergraph it can be reduced to the empty set by the Graham Reduction. In , then is eliminated from the GR process, if by GR2. We say that is absorbed into . Then we can with respect to a GR process construct a forest , there is an edge between vertex and vertex as follows. In if and only if is absorbed into or is absorbed into . It is seen from a slight modification of the work in [6] that is always a junction forest for . Fig. 5 shows the forest for the hypergraph in the last example with respect to the GR is a junction forest process in Table I. It is readily seen that for . Before we end this section, we mention that the Graham Reduction is only a way to construct a junction forest. We refer the reader to [5] for further references on this subject. X. CONCLUSION This paper consists of an information-theoretic treatment of CMI and Markov random fields. We take the point of view that a Markov random field is a collection of so-called full conditional mutual independencies. Using the theory of -Measure, we have obtained a number of fundamental characterizations related to CMI and a Markov random field. We show that many aspects of CMI and Markov random fields have very simple set-theoretic descriptions. New insights into the structure of CMI and Markov random fields are obtained. Our results have immediate applications in the implication problem of probabilistic conditional independency and relational database. Toward the end of the paper, we obtain a hypergraph characterization of a Markov random field which makes it legitimate to view a Markov random field as a hypergraph. Based on this result, we naturally employ the Graham Reduction, a tool from relational database theory, to recognize a Markov forest. This connection between a Markov random field and a hypergraph sheds some light on the possible role of hypergraph theory in the study of Markov random fields.
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
2009
Proof: We first prove the “only if” part. Consider for any
APPENDIX A PROOF OF LEMMA 2 We base our proof on the following simple variation of the inclusion-exclusive formula in [14]:
(32) (31) in (16) We consider the coefficient of the term . for any nonempty subset of : Case 1 appears as the first term and the The term , third term in the square bracket in (16) when but they cancel with each other. The term also appears when
are mutually indepen-
By the assumption that dent conditioning on
Therefore, both inequalities in (32) are tight, and hence
where The coefficient is then equal to
proving the “only if” part. To prove the “if” part, consider
(33) where we have used the binomial formula. : Case 2 This case is the same as Case 1, and the coefficient is equal . to and : Case 3 appears as the third term in the The term . The coefficient square bracket in (16) when . is equal to in (16) Thus, for all cases, the coefficient of , which is the same as that in (31), proving is equal to the lemma. APPENDIX B AXIOMATIZATION OF FCIS In this appendix, we first show that a full CMI is equivalent to a set of full CIs. be disjoint sets for Theorem 12: Let and , . Let , , and be collections of random variables. Then are mutually independent conditioning on if and only , and are if for all . independent conditioning on
where the last equality follows from the assumption that for all , and are independent . On the other hand conditioning on
in general. Therefore, the inequality in (33) is tight, proving the “if” part. The theorem is proved. Since a full CMI is equivalent to a set of full CIs, in principle there is no difference between characterizing a set of full CMIs and a set of full CIs. However, if a set of full CIs is equivalent to a full CMI, the latter has a more compact description. In [7], it was shown that full CIs are axiomatizable in that there exists a formal system for deriving all the full CIs that are logically implied by an arbitrary set of full CIs. Let and be index sets, with . and be collections of random Let ” to denote being variables. We now use “ conditioning on . Then the independent of following seven axioms are complete for full CIs: A1) if A2) if A3) if
, then and and
; , then , then
; ;
2010
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002
A4) if
and ; , then and ; , then
A5) if A6) if A7) if
, then
and
; , then ;
i.e., all the full CIs implied by a given set of full CIs can be derived by invoking these seven axioms. It can be shown that A1) and A2) form a minimal complete set of axioms for full CIs. In [2], an alternative set of complete axioms for full CIs was obtained. APPENDIX C A CHARACTERIZATION OF CMI
APPENDIX D AN ALTERNATIVE CANONICAL FORM FOR INFORMATION EXPRESSIONS CONDITIONING ON A MARKOV RANDOM FIELD be a graph. For , let be the Let containing all the vertices in . Let smallest subgraph of be the number of components in , and denote the . Then GMP is equivalent to components by such that for all
where we have written , , of
as . Since conditioning on
is independent
In this appendix, we prove a characterization of CMI. Lemma 4: ditioning on
are mutually independent conTherefore, GMP is equivalent to for all
if and only if (34)
Proof: It suffices to prove that (34) is equivalent to
such that (36)
Now for every (35)
are mutually inAssume (35) is true, so that is independent dependent conditioning on . Since for all , , conditioning on of
Therefore, (35) implies (34). Now assume that (34) is true. Consider
Then (34) implies
Since all the terms in the above summation are nonnegative, they , we have must all be equal to . In particular, for
By symmetry, it can be shown that
where we adopt the convention that . Then, every unconditional entropy can be written as a linear combina. By virtue tion of conditional entropy of the form of (29), this in turn implies that every information expression can be written as a linear combination of conditional en. For each term in tropy of the form , if , i.e., is connected, we leave the term unchanged. Otherwise, we apply (36) repeatedly if necessary to obtain a linear combination of conditional entropy of the form for which is connected. This is the alternative canonical form for the information expression conditioning on a Markov graph proposed by Reviewer B. such that . Since if and Consider , corresponds to the Type I atom only if
and this correspondence is one-to-one because is uniquely specified by this Type I atom. Therefore, the total number for which of conditional entropies of the form is exactly . It can be shown that the set of conditional entropies of the for which is an invertible linear form on the Type I atoms. transformation of the set of values of Therefore, this alternative canonical form and the canonical form we propose in Section VII enjoy the same uniqueness property. ACKNOWLEDGMENT
for all . Then this implies (35) by Theorem 12 in Appendix B, completing the proof.
The authors wish to thank both reviewers for the three rounds of very helpful comments on the manuscript.
YEUNG et al.: CHARACTERIZATIONS OF MUTUAL INDEPENDENCE AND MARKOV RANDOM FIELDS
REFERENCES [1] C. Berri, R. Fagin, D. Maier, and M. Yannakakis, “On the desirability of acyclic database schemes,” J. Assoc. Comput. Mach., vol. 30, no. 3, pp. 479–513, 1983. [2] D. Geiger and J. Pearl, “Logical and algorithmic properties of conditional independence and graphical models,” Ann. Statist., vol. 21, pp. 2001–2021, 1993. [3] T. Kawabata and R. W. Yeung, “The structure of the I -Measure of a Markov chain,” IEEE Trans. Inform. Theory, vol. 38, pp. 1146–1149, May 1992. [4] R. Kindermann and J. L. Snell, Markov Random Fields and Their Applications. Providence, RI: Amer. Math. Soc. , 1980. [5] S. L. Lauritzen, Graphical Models. Oxford, U.K.: Oxford Univ. Press, 1996. [6] T. T. Lee, “An information-theoretic analysis of relational databases—Part I: Data dependencies and information metric, Part II: Information structures of database schemas,” IEEE Trans. Software Eng., vol. SE-13, pp. 1049–1072, Oct. 1987. [7] F. M. Malvestuto, “A unique formal system for binary decompositions of database relations, probability distributions, and graphs,” Inform. Sci., vol. 59, pp. 21–52, 1992. [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Meteo, CA: Morgan Kaufman, 1988. [9] J. Pearl and A. Paz, “Graphoids: A graph based logic for reasoning about relevance relations,” in Advances in Artificial Intelligence—II, B. D. Boulay, D. Hogg, and L. Steel, Eds. Amsterdam, The Netherlands: North-Holland, 1987, pp. 357–363. [10] C. Preston, Random Fields. Berlin, Germany: Springer-Verlag, 1974.
2011
[11] F. Spitzer, “Random fields and interacting particle systems,” M.A.A. Summer Seminar Notes, 1971. [12] S. K. M. Wong, “Testing implication of probabilistic dependencies,” in Proc. 12th Conf. Uncertaining in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, 1996, pp. 545–553. [13] S. K. M. Wong, C. J. Butz, and D. Wu, “On the implication problem for probabilistic conditional independency,” Dept. Comp. Sci., Univ. Regina, Regina, SK, Canada, Tech. Rep. CS-99-03, Sept. 1999. [14] R. W. Yeung, “A new outlook on Shannon’s information measures,” IEEE Trans. Inform. Theory, vol. 37, pp. 466–474, May 1991. [15] , “Multilevel diversity coding with distortion,” IEEE Trans. Inform. Theory, vol. 41, pp. 412–422, May 1995. , “A framework for linear information inequalities,” IEEE Trans. [16] Inform. Theory, vol. 43, pp. 1924–1934, Nov. 1997. [17] , On entropy, information inequalities, and groups. [Online]. Available: http://www.ie.cuhk.edu.hk/people/raymond.php [18] R. W. Yeung and Y.-O. Yan.. [Online]. Available: http://www.ie.cuhk. edu.hk/IT_book [19] R. W. Yeung and Z. Zhang, “Distributed source coding for satellite communications,” IEEE Trans. Inform. Theory, vol. 45, pp. 1111–1120, May 1999. [20] Z. Zhang and R. W. Yeung, “On characterization of entropy functions via information inequalities,” IEEE Trans. Inform. Theory, vol. 44, pp. 1440–1452, July 1998. [21] F. M. Malvestuto and M. Studený, “Comment on ‘A unique formal system for binary decompositions of database relations, probability distributions, and graphs’,” Inform. Sci., vol. 63, pp. 1–2, 1992. [22] R. W. Yeung, A First Course in Information Theory. New York: Kluwer Academic/Plenum, 2002.