Inclusion Dependencies in XML: Extending Relational Semantics Michael Karlinger1, Millist Vincent2 , and Michael Schrefl1 1
2
Johannes Kepler University, Linz, Austria University of South Australia, Adelaide, Australia
Abstract. In this article we define a new type of integrity constraint in XML, called an XML inclusion constraint (XIND), and show that it extends the semantics of a relational inclusion dependency. This property is important in areas such as XML publishing and ‘data-centric’ XML, and is one that is not possessed by other proposals for XML inclusion constraints. We also investigate the implication problem for XINDs in complete XML documents, a class of XML documents that generalizes the notion of a complete relation, and present an axiom system that we show to be sound and complete.
1
Introduction
Integrity constraints are one of the oldest and most important topics in database research, and they find application in a variety of areas such as database design, data translation, query optimization and data storage [1]. With the adoption of the eXtensible Markup Language (XML) [8] as the industry standard for data interchange over the internet, and its increasing usage as a format for the permanent storage of data in database systems [20], the study of integrity constraints in XML has increased in importance in recent years (a survey of the topic is given in [11]). In this article we investigate the topic of inclusion constraints in XML. We use the syntactic framework of the keyref mechanism in XML Schema [19], where both the LHS and RHS of the constraint include a selector, which is used to select elements in the XML document, followed by a sequence of fields, which are used to specify the descendant nodes that are required to match in the document. This general idea of demanding selected elements in a document to have matching subnodes is also found in other approaches towards inclusion constraints in XML [5, 22, 16, 13]. While the syntactic framework of the keyref mechanism in XML Schema is an expressive one, both it and other work that use a similar framework have some important limitations from the perspective of semantics. The keyref mechanism imposes the restriction that there can be at most one node per field, which is frequently violated in practice, whereas the other approaches have the limitation that they do not always allow one to extend the semantics of a relational inclusion dependency (IND). We illustrate this last point by the following example.
In Figure 1, two relations teaches and offer are shown. teaches stores the details of courses taught by lecturers in a department, lec is the name of the lecturer, cno is the identifier of the course they are teaching, day is the day of the week that the course is being taught and sem is the semester in which the course is taught. offer stores all the courses offered by the university, where cno, day and sem have the same meaning as in teaches. We note that the key for offer is {cno, day, sem} and the key for teaches is {lec,cno,day,sem}, thus more than one lecturer can teach a course. The database also satisfies the IND teaches[cno, day, sem] ⊆ offer[cno, day, sem], which specifies that a lecturer can only teach courses that are being offered by the university. Suppose we now map the relational data to XML by first mapping the two relations to separate documents with root nodes offer and teaches and then combining these documents to a single document with root node uni, as shown in Figure 1. In particular, the tuples in relation offer were directly mapped to elements with tag course. Concerning the relation teaches, a nesting on {lec, day, sem} preceded the direct mapping of the (nested) tuples, within which tags course and info were introduced.
Flat Relations
XML Document cno {lec day sem} ③ C1 L1 TUE 09S L1 MON 08W ① ② Nested Relation
teaches lec cno day sem L1 C1 TUE 09S L1 C1 MON 08W offer cno day sem C1 TUE 09S C1 MON 08W
Fig. 1. Example Relations and XML Document
Now the XML document satisfies an inclusion constraint because of the original IND, but one cannot express this inclusion constraint by ((uni.teaches.course, [cno, info.day, info.sem]) ⊆ (uni.offer.course, [cno, day, sem])), where uni.teaches.course and uni.offer.course are the LHS and RHS selector, and [cno, info.day, info.sem] and [cno, day, sem] are the LHS and RHS fields, and applying the semantics given in [5, 22, 16, 13]. The reason is that since lecturer L1 teaches more than one course, the semantics require that every possible combination of nodes from [cno, info.day, info.sem] within a uni.teaches.course node must have matching nodes in [cno, day, sem] within a uni.offer.course node. So, since one combination is {C1, TUE, 08W}, this requires a uni.offer.course node with child nodes {C1, TUE, 08W}, which clearly does not hold.
In this paper, we propose different semantics so that the constraint σ holds in our example. The key idea is that we do not allow arbitrary combinations of nodes from the LHS fields, we only allow nodes that are closely related by what we will define later as the closest property. So for example (USE ①,②,③). The motivation for this restriction is that in the relational model data values that appear in the same tuple are more closely related than those that belong to different tuples, and our closest notion extends this idea to XML, and hence allows relational semantics to be extended. Having an XML inclusion constraint that extends the semantics of an IND is important in several areas. Firstly, in the area of XML publishing [12], where a source relational database has to be mapped to a single predefined XML schema, knowing how relational integrity constraints map to XML integrity constraints allows the XML document to preserve more of the original semantics. This argument also applies to ‘data-centric’ XML [17, 20], where XML databases (not necessarily with predefined schemas) are generated from relational databases. The first contribution of this article is to define an XML inclusion constraint (called XIND) that extends the semantics of an IND. While the constraint is defined for any XML document (tree), we show that in the special case where the XML tree is generated by first mapping complete relations to nested relations by an arbitrary sequences of nest operations, and then directly to an XML tree, the database satisfies the IND iff the XML tree satisfies the corresponding XIND. The second contribution of this article is to address the implication problem for XINDs, in the context of a class of XML trees (documents) introduced in previous work by one of the authors [21], called complete XML trees. We present an axiom system for XIND implication and show that the system is sound and complete. While our axiom system contains rules that parallel those of INDs [1], it also contains several additional rules that have no parallel in the IND system, reflecting the fact, as we now discuss, that complete XML trees are more general than complete relations. (EXTEND core XINDs; chase algorithm; consistency;) The intuition and motivation for complete XML trees are as follows. Intuitively, a complete XML tree is one that contains ‘no missing data’, and its motivation is to extend the notion of a complete relation to an XML tree. However, we note that a complete XML tree is a more general notion than a complete relation since it includes trees that cannot be mapped to complete relations, such as those that contain duplicate nodes or subtrees, and trees that contain element leaf nodes rather than only text or attribute leaf nodes. Our motivation for considering XIND implication in complete XML documents is that while XML explicitly caters for irregularly structured data, it is also widely used in more traditional business applications involving regularly structured data [17, 20], often referred to as ‘data-centric’ XML, and complete XML trees are a natural subclass in such applications. OUTLINE
2
XML Trees, Paths and Reachable Nodes
In this section we present some preliminary definitions. First, following the model adopted by XPath and DOM [8], we model an XML document as a tree as follows. We assume countably infinite, disjoint sets E and A of element and attribute labels respectively, and the symbol S indicating text. Thereby, the set of labels that can occur in the XML tree, L, is defined by L = E ∪ A ∪ {S}. Definition 1. An XML tree T is defined by T = (V, E, lab, val, vρ ), where - V is a finite, non-empty set of nodes; - the total function lab : V → L assigns a label to every node in V. A node v is called an element node if lab(v) ∈ E, an attribute node if lab(v) ∈ A, and a text node if lab(v) = S; - vρ ∈ V is a distinguished element node, called the root node, and lab(vρ ) = ρ; - the parent-child relation E ⊂ V × V defines the directed edges connecting the nodes in V and is required to form a tree structure rooted at node vρ . Thereby, for every edge (v, v¯) ∈ E, (i) v is an element node and is said to be the parent of v¯. Conversely, v¯ is said to be a child of v; (ii) if v¯ is an attribute node, then there does not exist a node v˜ ∈ V and an edge (v, v˜) ∈ E such that lab(˜ v ) = lab(¯ v ) and v˜ '= v¯; - the partial function val : V → string assigns a string value to every attribute and text node in V. In addition to the parent of a node v in a tree T, we define its ancestor nodes, denoted by ancestor(v), to be the transitive closure of parents of v. An example of an XML tree is presented in Figure 2, where E = {ρ, offer, holding, course, info} and A = {cno, day, lec, sem}. We note that our model allows mixed data, whereas the model used in some other work on XML integrity constraints, such as [6], do not.
ρ v1 offer v2 course
v10 teaches v6 course
v3 v4 v5 v7 v8 v9 cno day sem cno day sem C1 TUE 09S C1 MON 08W
v11 course v12 cno C1 v
14
lec L1
v13 info
v17 info
v15 v16 v18 v19 v20 day sem lec day sem TUE 09S L1 MON 08W
Fig. 2. An XML tree
The notion of a path, which we now present together with some frequently required operators on paths, is central to all work on XML integrity constraints. Definition 2. A path P = l1 . · · · .ln is a non-empty sequence of labels (possibly with duplicates) from L. Given paths P = l1 . · · · .ln and P¯ = ¯l1 . · · · .¯lm we define - P to be a legal path, if l1 = ρ and li ∈ E for all i ∈ [1, n−1]3. ¯ denoted by P ⊆ P, ¯ if n ≤ m and li = ¯li for all i ∈ [1, n]. - P to be a prefix of P, ¯ ¯ if P ⊆ P¯ and n < m. - P to be a strict prefix of P, denoted by P ⊂ P, ¯ ¯ to be l1 . · · · .ln .¯l1 . · · · .¯lm . - the concatenation of P and P, denoted by P.P, ¯ ¯ to be the - the intersection of P and P if both are legal paths, denoted by P ∩ P, ¯ longest path that is a prefix of both P and P . For example, referring to Figure 2, offer.course and ρ.cno.course are paths but not legal ones, whereas ρ.offer.course is a legal path. Also the path ρ.offer is a strict prefix of ρ.offer.course, and if P = ρ.offer.course.cno and P¯ = ρ.offer.course.sem, then P ∩ P¯ = ρ.offer.course. We now define a path instance, which is essentially a downward sequence of nodes in an XML tree. Definition 3. A path instance p = v1 . · · · .vn in a tree T = (V, E, lab, val, vρ ) is a non-empty sequence of nodes in V such that v1 = vρ and for all i ∈ [2, n], vi−1 = parent(vi ). The path instance p is said to be defined over a path P = l1 . · · · .ln , if lab(vi ) = li for all i ∈ [1, n]. We note that it follows from Definition 3 that a path instance p can be defined over only one path P . For example, referring to Figure 1, vρ .v1 .v2 is a path instance defined over the path ρ.offer.course. The next definition specifies the set of nodes reachable in a tree T from the root node by following a path P . Definition 4. Given a tree T = (V, E, lab, val, vρ ) and a legal path P, the function N(P, T) returns the set of nodes defined by {v ∈ V | v is the final node in path instance p and p is defined over P }. For instance, if T is the tree in Figure 2 and P = ρ.offer.course.day, then N (P, T) = {v4 , v8 }. We note that it follows from our tree model that for every node v in a tree T there is exactly one path instance p such that v is the final node in p and therefore N(P, T) ∩ N(P¯ , T) = ∅ if P '= P¯ . We therefore say that P is the path such that v ∈ N(P, T).
3
Defining XML Inclusion Dependencies
In this section we present the syntax and semantics of our definition of an XIND, starting with the syntax. 3
[1, n] denotes the set {1, . . . , n}
Definition 5. An XML Inclusion Dependency is a statement of the form " ! (P, [P1 , . . . , Pn ]) ⊆ (P " , [P1" , . . . , Pn" ]) , where P and P " are paths called LHS and RHS selector, and P1 , . . . , Pn and P1" , . . . , Pn" are non-empty sequences of paths, called LHS and RHS fields, such that for all i ∈ [1, n], P.Pi and P ".Pi" are legal paths ending in an attribute or text label. We now make some observations on this definition. (i) As mentioned previously, our syntax is based on the keyref mechanism used in XML Schema and other recent work in the area. (ii) We only consider simple paths in the selectors and fields, whereas XML Schema allows a restricted form of XPath expressions. (iii) XML schema also allows for relative constraints, whereby the inclusion constraint is only evaluated in part of the XML tree. We do not consider this. (iv) The restrictions on fields means that we only consider inclusion between text/attribute nodes, whereas XML Schema also allows for inclusion between element nodes. We should mention that the restrictions discussed in (ii) - (iv) are not intrinsic to our approach, and our definition of an XIND can easily be extended to handle these extension. Our reason for not considering these extensions here is so that we can concentrate on the main contribution of our paper, which is to apply different semantics to an XIND so as to extend relational semantics. To define the semantics of an XIND, we first make a preliminary definition (first presented in [21]) that is central to our approach. The intuition behind it, which will be made more precise in the next section, is as follows. In defining relational integrity constraints such as FDs or INDs, it is implicit that the relevant data values from either the LHS or RHS of the constraint belong to the same tuple. The closest definition extends this property of two data values belonging to the same tuple to XML, that is if two nodes in the XML tree satisfy the closest property, then ‘they belong to the same tuple’. Definition 6. Given nodes v1 and v2 in an XML tree T, the boolean function closest(v1 , v2 ) is defined to return true, iff there exists a node v21 such that (i) v21 ∈ aancestor(v1 ), and (ii) v21 ∈ aancestor(v2 ), and (iii) v21 ∈ N(P1 ∩ P2 , T), where P1 and P2 are the paths such that v1 ∈ N(P1 , T) and v2 ∈ N(P2 , T) and the aancestor function is defined by aancestor(v) = ancestor(v) ∪ {v}. For instance, in Figure 2 closest(v3 , v4 ) is true. This is because P = ρ.offer.course.cno and P¯ = ρ.offer.course.day are the paths such that v3 ∈ N(P, T) and v4 ∈ N(P¯ , T), and v3 and v4 have the common ancestor node v2 ∈ N(ρ.offer.course, T), where ρ.offer.course = P ∩ P¯ . However, closest(v3 , v8 ) is false since v3 and v8 have no common ancestor node in N(ρ.offer.course, T), and closest(v3 , v7 ) is false because v3 and v7 have no common ancestor node in N(ρ.offer.course.cno, T). This leads to the definition of the semantics of an XIND.
Definition 7. An XML tree T satisfies an XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P " , [P1" , . . . , Pn" ])), denoted by T ! σ, iff whenever there exists an LHS selector node v and corresponding field nodes v1 , . . . , vn such that: (i) v ∈ N(P, T), (ii) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v ∈ ancestor(vi ), (iii) for all i, j ∈ [1, n], closest(vi , vj ) = true, then there exists an RHS selector node v " and field nodes v1" , . . . , vn" such that (i’) v " ∈ N(P " , T), " " " " (ii’) for all i ∈ [1, n], vi" ∈ N(P .P i , T) and v ∈ ancestor(vi ), " " (iii’) for all i, j ∈ [1, n], closest(vi , vj ) = true, (iv’) for all i ∈ [1, n], val(vi ) = val(vi" ). For instance, the XML tree in Figure 2 satisfies the XIND σ = ((ρ.teaches.course, [cno, info.day, info.sem]) ⊆ (ρ.offer.course, [cno, day, sem])). This is because the only sequences of LHS field nodes that satisfy the closest property are v12 , v15 , v16 and v12 , v19 , v20 , and v12 , v15 , v16 is value equal to the sequence of RHS field nodes v3 , v4 , v5 and v12 , v19 , v20 is value equal to the sequence of RHS field nodes v7 , v8 , v9 . The essential difference between an XIND and other proposals is that we require the sequence of field-nodes (both LHS and RHS) generated by the cross product to also satisfy the closest property, whereas other proposals do not contain this additional restriction [5, 22, 16, 13]. As a consequence, and as discussed in the introductory example, the constraint σ is violated in Figure 2 according to other proposals. We also make the point that we do not explicitly address the situation where there may be no node for a LHS field, we only require that inclusion between LHS and RHS field nodes, when there is a node for every LHS field. This is consistent with other work in the area, like for example the approach towards keys for XML in [9].
4
Extending Relational Semantics
In this section we justify our claim that an XIND extends the semantics of an IND by showing that in the case where the XML tree is generated from a complete database by a very general class of mappings, then the database satisfies the IND if and only in the XML tree satisfies the corresponding XIND. To show this, we first define a general class of mappings from complete relational databases to XML trees. The presentation of the mapping procedure presented here will be abbreviated because of space requirements, and we refer the reader to [21] for a more detailed presentation if needed. The first step in the mapping procedure maps each initial flat relation to a nested relation by a sequence of nest operations. To be more precise, we recall that the nest operator νY (R∗ ) on a nested (or flat) relation R∗ , where Y is a subset of the schema R of R∗ , combines the tuples in R∗ which are equal on R∗ [R − Y] into single tuples [7]. So if the initial flat relation is denoted by R, we perform an arbitrary sequence of nest operations νY1 , . . . , νYn on R and so the final nested relation R∗ , is defined by R∗ = νYn (· · · νY1 (R)).
For instance, in the introductory example the flat relation teaches is converted to a nested relation R∗ by R∗ = νlec,day,sem .(teaches). The next step in the mapping procedure is to map the nested relation to an XML tree by converting each sub-tuple in the nested relation to a subtree in the XML tree, using a new element node for the root of the subtree, as illustrated in Example 1. While we don’t claim that our method is the only way to map a relation to an XML tree, it does have two desirable features. The first is that it allows the initial flat relation to be nested arbitrarily, which is a desirable feature in data-centric applications of XML [18]. The second is that it has been shown that the mapping procedure is invertible [21], and so no information is lost by the transformation. In the context of mapping multiple relations to XML, we extend the method just outlined as follows. We first map each relation to an XML tree as just discussed. We then replace the label in the root node by a label containing the name of the relation (which we assume to be unique), and construct a new XML tree with a new root node and with the XML trees just generated being principal subtrees. This procedure was used in the introductory example. This leads to the following important result which justifies our claim that an XIND extends the semantics of an IND. Theorem 1. Let complete flat relations R1 and R2 be mapped to an XML tree T by the method just outlined. Then R1 and R2 satisfy the IND R1 [A1 , . . . , An ] ⊆ R2 [B1 , . . . , Bn ], where R1 and R1 are the schemas of R1 and R2 , iff T satisfies the XIND, ((ρ.R1 , [PA1 , . . . , PAn ]) ⊆ (ρ.R2 , [PB1 , . . . , PBm ])), where ρ.R1 .PA1 , . . . , ρ.R1 .PAn , ρ.R2 .PB1 , . . . , ρ.R2 .PBm represent the paths over which the path instances in T that end in leaf nodes are defined. For instance, we deduce from the IND teaches[cno, day, sem] ⊆ offer[cno.day.sem] the XIND ((ρ.teaches, [course.cno, course.info.day, course.info.sem]) ⊆ (ρ.offer, [course.cno, course.day, course.sem])) and, from the inference rules to be given in the next section, this XIND is equivalent to the XIND given in the introductory example, namely ((ρ.teaches.course, [cno, info.day, info.sem]) ⊆ (ρ.offer.course, [cno, day, sem])). The crucial preliminary lemma needed in establish this theorem is the following lemma, established in [21]. Lemma 1. Let R be a flat relation over schema R = (A1 , . . . , An ), and let R be mapped to a tree T by the procedure just outlined. Then there exist leaf nodes v1 , . . . , vn in T such that: (i) vi ∈ N(PAi , T), for all i ∈ [1, n], and (ii) closest(vi , vj ) is true, for all i, j ∈ [1, n], iff there exists a tuple t ∈ R such that t[Ai ] = val(vi ), for all i ∈ [1, n]. This lemma shows that if a complete relation is mapped to an XML tree by the method just outlined, then a set of data values appear in the same tuple of a relation if and only if the corresponding nodes in the XML tree pairwise satisfy the closest property. For instance, from the lemma we deduce that in
our running example the data values C1, TUE, 09S appear in the same tuple of relation offer if and only if v3 , v4 , v5 pairwise satisfy the closest property.
5
Reasoning About XML Inclusion Dependencies
As discussed in Section 1, the focus in this article is on reasoning about core XINDs in complete XML trees, and in this section we first define these key concepts. In the remaining sections we then solve the implication and consistency problems for core XINDs in complete XML trees, using a chase algorithm. We note that we omit detailed proofs throughout this section and refer the reader to the full version of the paper which can be obtained at http://www.dke.uni-linz.ac.at/xind/fullversion.pdf. 5.1
The Framework: Core XINDs in Complete XML Trees
From a general point of view, before requiring that the data in an XML document be complete one first has to specify the structure of the information that the document is expected to contain. We use a set of paths P to specify the structure of the expected information in an XML document (tree), and now define what we mean by an XML tree conforming to P. Definition 8. A tree T is defined to conform to a set of paths P, if for every node v in T, if P is the path such that v ∈ N(P, T), then P ∈ P. For example, if we denote the subtree rooted at node v1 in Figure 2 by T1 , then T1 conforms to the set of paths P1 = {offer,offer.course, offer.course.cno,offer.course.day,offer.course.sem}. We now introduce the concept of a complete XML tree, which is an extension to XML of the notion of a complete relation. To understand the intuition, consider again the subtree T1 in Figure 2 and the set of paths P"1 = P1 ∪ offer.course.max}. Then T1 also conforms to P"1 , but we do not consider it to be complete w.r.t. P"1 because the existence of the path offer.course.max in P"1 means that we expect that every course in T1 will have a max, which is not satisfied by v2 and v6 in Figure 2. We now make this idea more precise. Definition 9. If T is a tree that conforms to a set of paths P, then T is defined ¯ to be complete w.r.t. P, if whenever P and P¯ are paths in P such that P ⊂ P, and there exists node v ∈ N(P, T), then there also exists node v¯ ∈ N(P, T) such that v ∈ ancestor(¯ v ). For instance, as just noted, T1 is not complete w.r.t. P"1 but it is complete w.r.t. P1 . This example also illustrates an important point. Unlike the relational case, the completeness of a tree is only defined w.r.t. a specific set of paths and so, as we have just seen, a tree may conform to two different sets of paths, but may be complete w.r.t. one set but not the other. We also note that if a tree T is complete w.r.t. a set of paths P, then P is what we call downward-closed.
That is, if P and P¯ are paths and P ∈ P, then P¯ ∈ P if P¯ ⊂ P . For example the sets P1 and P" 1 are downward-closed. We now turn to the class of XINDs that we consider in our reasoning. It is natural to expect that if an XIND σ is intended to apply to an XML tree T, then the constraint imposed by σ should belong to the information represented by T. We incorporate this idea into our framework by requiring that the paths in an XIND be taken from the set of paths to which the targeted tree conforms, which we now define. Definition 10. An XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P " ([P1" , . . . , Pn" ])) is defined " " to conform to a set of paths P, if for all i ∈ [1, n], P.Pi ∈ P and P .P i ∈ P. We also place another restriction on an IND, motivated by our belief that a consequence of a tree T satisfying an XIND σ should not be that a node in the tree has a fixed value. Suppose then that for the RHS selector of σ, P " = ρ, and for some i ∈ [1, n], the RHS field Pi" consists of exclusively one attribute label, then since there is only one root node, and in turn at most one " " attribute node in N(P .P i , T), the semantics of σ means then, that every node " " in N(P.Pi , T) ∪ N(P .Pi , T) must have a fixed value. We believe that this not the intent of an XIND, and that such a constraint should be specified instead in a DTD or XSD for T. Since the study of the interaction between structural constraints and integrity constraints is known to be a complex one [5], and outside the scope of this paper, we exclude such an XIND and this leads to the following definition. Definition 11. An XIND ((P, [P1 , . . . , Pn ]) ⊆ (P ", [P1" , . . . , Pn" ])) is a core XIND, if for all i ∈ [1, n], Pi" does not exclusively consist of one attribute label, if P " = ρ. 5.2
Inference Rules for Core XINDs in Complete Trees
Table 1 presents a set of inference rules R1 - R6, where symbol , denotes that the XINDs in the premise derive the XIND in the conclusion. We now present first our result on the soundness of rules R1 - R6, and then illustrate the rules. Theorem 2. The set of inference rules R1 - R6 is sound for the implication of core XINDs in complete XML trees. Rules R1 - R3 correspond to the well known inference rules for INDs [1], which is to be expected given Theorem 1 and the fact that XML trees generated from a complete relational database from the mapping described in Section 4 are a subclass of complete XML trees. The remaining rules have no parallels in the inference rules for INDs, and we now discuss them. Rule R4 allows one to transfer a path from the end of the RHS selector in an XIND to the start of the RHS fields. For example, from R4 and the XIND ((ρ.teaches.course, [cno]) ⊆ (ρ.offer.course, [cno])), which is satisfied in the XML tree of Figure 2, we derive the XIND ((ρ.teaches.course, [cno]) ⊆ (ρ.offer, [course.cno])), whereby the last label
in the RHS selector ρ.offer.course has been transferred to the start of the RHS fields. Rule R5 is the reverse of R4, whereby a path from the start of the RHS fields is transferred to the end of the RHS selector. Rule R8 is a rule that, roughly speaking, allows one to union the LHS fields and the RHS fields of two XINDs, provided that the RHS fields intersect only at the root. For example, given the XINDs ((ρ.teaches.course, [cno]) ⊆ (ρ, [offer.course.cno])) and ((ρ.teaches.course, [info.lec]) ⊆ (ρ, [department.lec])), then we can derive ((ρ.teaches.course, [cno, info.lec]) ⊆ (ρ, [offer.course.cno, department.lec])), since ρ.offer.course.cno ∩ ρ.department.lec = ρ. However, referring to the example in Figure 2, the XINDs ((ρ.teaches.course, [cno]) ⊆ (ρ, [offer.course.cno])) and ((ρ.teaches.course, [info.sem]) ⊆ (ρ, [offer.course.sem])) do not imply ((ρ.teaches.course, [cno, info.sem]) ⊆ (ρ, [offer.course.cno, offer.course.sem])) since ρ.offer.course.cno ∩ ρ.offer.course.sem '= ρ. R1 Reflexivity {} ! ((P, [P1 , . . . , Pn ]) ⊆ (P, [P1 , . . . , Pn ])) R2 Permutated Projection ((P, [P1 , . . . , Pn ]) ⊆ (P !, [P1! , . . . , Pn! ])) ! ! ! ((P, [Pπ(1) , . . . , Pπ(m) ]) ⊆ (P !, [Pπ(1) , . . . , Pπ(m) ])), if {π(1), . . . , π(m)} ⊆ {1, . . . , n} R3 Transitivity ((P, [P1 , . . . , Pn ]) ⊆ (P¯ , [P¯1 , . . . , P¯n ])) ∧ ((P¯ , [P¯1 , . . . , P¯n ]) ⊆ (P !, [P1! , . . . , Pn! ])) ! ((P, [P1 , . . . , Pn ]) ⊆ (P !, [P1! , . . . , Pn! ])) R4 Upshift ! ((P, [P1 , . . . , Pn ]) ⊆ (P .R, [P1! , . . . , Pn! ])) ! ((P, [P1 , . . . , Pn ]) ⊆ (P !, [R.P1! , . . . , R.Pn! ])) R5 Downshift ! ((P, [P1 , . . . , Pn ]) ⊆ (P !, [R.P1! , . . . , R.Pn! ])) ! ((P, [P1 , . . . Pn ]) ⊆ (P .R, [P1! , . . . , Pn! ])) R6 Union ! ! ! ((P, [P1 , . . . , Pm ]) ⊆ (ρ, [P1! , . . . , Pm ])) ∧ ((P, [Pm+1 , . . . , Pn ]) ⊆ (ρ, [Pm +1 , . . . , Pn ])) ! ! ! ! ! ((P, [P1 , . . . , Pn ]) ⊆ (ρ, [P1 , . . . , Pn ])), if ρ.Pi ∩ ρ.Pj = ρ for all i, j ∈ [1, m]×[m+1, n] Table 1. Inference Rules for XINDs
5.3
The Chase for Core XINDs in Complete Trees and Consistency
In this section we present the chase algorithm for core XINDs in complete trees, and then use it to solve the problem of consistency. As depicted in Algorithm 1, the chase is a recursive algorithm that takes as input: (i) a set of paths P, (ii) a tree T that is complete w.r.t P, and (iii) a set of XINDs Σ that conforms to P, ¯ that satisfies Σ. From a bird-eyes view, the chase and returns a complete tree T halts if the input tree Ts for a (recursive) step s satisfies Σ, and otherwise it (i) chooses an XIND σs = ((P, [P1 , . . . , Pn ]) ⊆ (P ," [P1" , . . . Pn" , ])) from Σ, such that Ts ! σs because of a sequence of nodes [v0 , v1 , . . . , vn ], where v0 is an LHS selector node for σs and v1 , . . . , vn are corresponding field nodes, and
(ii) creates new nodes in Ts , such that the resulting tree Ts+1 contains an RHS selector node v0" and field nodes v1" , . . . , vn" that remove the violation.
Algorithm 1 Chase(P, T, Σ) in: A downward-closed sequence of legal paths P = [R1 , . . . , Rm ] ordered by length A tree T = (V, E, lab, val, vρ ) that is complete w.r.t. P A sequence of XINDs Σ that conforms to P ¯ that is complete w.r.t. P and satisfies Σ out: Tree T 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
if T ! Σ then return T; end if; let σ = ((P, [P1 , . . . , Pn ]) ⊆ (P !, [P1! , . . . , Pn! ])) be the first XIND in Σ such that T ! σ; let Y be the set of all sequences of nodes that violate σ in tree T; for i := 0 to n do " choose violation repeat choose sequences [v0 , v1 , . . . , vn ] and [ˆ v0 , vˆ1 , . . . , vˆn ] from Y; if vi ≺ vˆi then remove [ˆ v0 , vˆ1 , . . . , vˆn ] from Y; end if until no more change to Y is possible; end for let [v0 , v1 , . . . , vn ] be the remaining sequence of nodes in Y; let X be a set that exclusively contains the root node vρ ; for i := 2 to m do " remove violation ! ! if there exists path Px! ∈ [P1! , . . . , Pn! ] such that Ri ∩ P .P x (= ρ then create a new node v and add v to both V and X; let l1 . · · · .lk be the sequence of labels in Ri ; set lab(v) = lk ; let vˆ be the node in N(l1 . · · · .lk−1 , T) ∩ X; add (ˆ v , v) to E such that v is the last child of vˆ; ! ! if there exists path Py! ∈ [P1! , . . . , Pn! ] such that Ri = P .P y then set val(v) = val(vy ); " vy ∈ [v1 , . . . , vn ] else if lk is an attribute or text label then set val(v) = ”0”; end if end if end for return Chase(P, T, Σ);
Our proof techniques are based on certain characteristics of the trees generated by the steps of the chase. It is therefore essential that these tree characteristics are, although dependent of the input, are independent of the specific sequence of steps used. That is, the the trees generated in the chase must be unique for a given input, and we now outline how this uniqueness is achieved. The crucial prerequisite is to unambiguously determine the sequence of violating nodes that is chosen in a step s in removing an XIND violation. Thereby, the chase chooses σs to be the first XIND in Σ that is violated in Ts . We note that Σ is expected to be a sequence, rather than a set, of XINDs and that therefore
σs is unique w.r.t. the input. In order to unambiguously choose a sequence of nodes [v0 , v1 , . . . , vn ] that violate σs , in case that there is more than one such sequence, we use the following, simplified notion of document-order. Definition 12. In a tree T, node v is defined to precede node v¯ w.r.t. documentorder, denoted by v ≺ v¯, if v is visited before v¯ in a pre-order traversal of T. Based on this notion of document-order, the sequence of violating nodes that is, roughly speaking, the one in the top-left of Ts , is chosen at Lines 3-9. Besides the unambiguous choice of a sequence of violating nodes in a step s, the chase also ensures that the procedure for removing the violation is unique. Roughly speaking, the chase loops for this purpose over the paths in P and creates path instances accordingly (cf. Lines 13 - 25), such that the resulting tree Ts+1 is complete w.r.t. P and contains a sequence of nodes [v0" , v1" , . . . , vn" ] that removes the violation. The desired uniqueness is basically achieved, since (i) paths P are expected to be a sequence, rather than a set of paths, and therefore the succession in which the paths in P are iterated in the loop at Line 12 is definite, and (ii) a new node is always added to the parent as the last child w.r.t. document-order (cf. Line 18). For example, if Tσ is the tree depicted in Figure 3, then [v1 , v2 ] and [v3 , v4 ] are the sequences violating XIND σ, which is depicted in the same figure. Then, the chase chooses the violation [v1 , v2 ], since v1 ≺ v3 , and creates nodes v8 , v9 in tree Tσ+1 , which remove the violation. Also, in order that Tσ+1 is complete w.r.t. the set of paths P in Figure 3, it adds node v10 as the last child of v8 . Ts+1
Ts ρ v1 ass
v3 ass
v5 emp
v8 emp
P = {ρ ρ.ass ρ.ass.name ρ.emp ρ.emp.name ρ.emp.room}
v2 v6 v7 v4 v9 v10 Σ = {σ = ((ρ.ass, [name]) ⊆ (ρ.emp, [name]))} name name name room name room A1 A2 A3 R1 A1 0
Fig. 3. Example Chase
¯ from the chase, In addition to requiring the uniqueness of the resulting tree T we also require that the chase terminates after a finite number of steps. The argument for the termination of the chase bases on the following observation on the set U of distinct string-values assigned to attribute and text nodes in a tree T. A tree T must satisfy an XIND σ = (P, (P1 , . . . , Pn ) ⊆ (P ", (P1" , . . . , Pn" )), if T contains for every sequence of values u1 , . . . , un ∈ U×· · ·×U, a sequence of nodes " " " " v˜1 ∈ N(P .P ˜n ∈ N(P .P 1 , T), . . . , v n , T) that pairwise satisfy the closest property, such that for all i ∈ [1, n], val(˜ vi ) = ui . Since the number of string-values in the initial tree is finite, and the chase introduces at most one new string-value (cf. Line 22), but adds in every step, w.r.t. an XIND in Σ, a new sequence of
nodes v1" , . . . , vn" , and thus also a new sequence of values val(v1" ), . . . , val(vn" ), it follows that the chase must terminate. The next Lemma presents the properties of the chase, which we have illustrated in the previous discussion. Lemma 2. An application Chase(P, T, Σ) of Algorithm 1 terminates and re¯ that is complete w.r.t. P and satisfies Σ. turns a unique tree T We now address the consistency problem, which we formulate as the question of whether there exists a tree T, for any given combination of a downward-closed set of paths P and a set of core XINDs Σ that conforms to P, such that T is complete w.r.t. P and satisfies Σ. We have the following result. Theorem 3. The class of core XINDs in complete XML trees is consistent. The correctness of Theorem 3 follows from the fact that there always exists a tree T that is complete w.r.t. a given set of paths P, and the result in Lemma ¯ returned by the application Chase(P, T, Σ) of Algorithm 1, is 2 that the tree T complete w.r.t. P and satisfies the given set of core XINDs Σ.
5.4
Implication of Core XINDs in Complete Trees
We now use the chase algorithm to address the implication problem for core XINDs in complete XML trees. So given a set of XINDs Σ ∪ {σ} that conforms to a set of paths P, we use Σ ! σ to denote that Σ implies σ, i.e. there does not exist a tree T that is complete w.r.t P, such that T ! Σ and T ! σ. We first construct a special initial tree Tσ 4 , which essentially has distinct values for LHS field nodes and is empty elsewhere. We then have the follow theorem. ¯ ! σ, where T ¯ is the tree returned by the application Theorem 4. Σ , σ iff T Chase(P, Tσ , Σ) of Algorithm 1. The key idea of the proof is to show by induction that the only XINDs satisfied by any intermediate tree during the chase are those derivable from Σ and rules R1 - R6. This theorem yields the following important result. Corollary 1. The set of inference rules R1 - R6 is complete for the implication of core XINDs in complete XML trees. ¯ ! σ and so Σ ! σ This result follows since if Σ " σ, then the final tree T from Lemma 2. Combining these results with Theorem 2 also shows that the ¯ ! σ. chase algorithm yields a decision procedure: Σ ! σ iff T 4
A detailed construction procedure is presented in the full version of the paper.
6
Related Work
In recent years, several types of XML Integrity Constraints (XICs) such as XFDs or XKeys have been investigated. Because of space limitations, we restrict our attention in this section to inclusion type constraints and refer the reader to [11] for a survey of other types of XICs. Early work on XICs has been done by Abiteboul [2] and Buneman [10]. They defined path inclusion constraints (PICs) which essentially require that whenever a node is reachable over one path, it must also be reachable over another path. Although a PIC and an XIND are both inclusion constraints, they are nevertheless different and cannot be directly compared. Given a set of nodes, an XIND forces XML documents to contain other nodes with corresponding values. In contrast, a PIC requires these nodes to be reachable over certain paths. Closer to XINDs are XFKeys which have been defined in [13, 15, 14, 4]. However, these XICs constrain attribute nodes or text nodes to be children of one element node, and so cannot express an XIND which allows for nodes to have different parents. Whereas the idref mechanism of DTD [8] is restricted to unary inclusion constraints, the keyref mechanism of XSD [19] is limited with respect to the possible number of matching path instances. For instance, consider the XIND ((ρ.holding.course, [cno]) ⊆ (ρ.offer, [course.cno])) in the introductory example in Section 1. Then the semantics of the keyref mechanism requires that any offer node has at most one descendant cno node, which is clearly not satisfied in this example. The XFKey defined in [3] is similar to thekeyref mechanism and therefore subject to the same limitation. A subset of the authors also considered an XML inclusion constraint in [22], however the are several differences between this paper and our current one. Firstly, the syntactic framework adopted in [22] is less expressive and does not use the selector/field framework. Secondly, the semantics used in [22], although partly based on the closest concept, differs from the semantics used in this paper and as a result has a different axiom system and does not always preserve the semantics of an IND.
References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. S. Abiteboul and V. Vianu. Regular Path Queries with Constraints. Computer and System Sciences, 58(3):428–452, 1999. 3. M. Arenas, W. Fan, and L. Libkin. What’s Hard about XML Schema Constraints? In DEXA, volume 2453 of Lecture Notes in Computer Science, pages 269–278. Springer, 2002. 4. M. Arenas, W. Fan, and L. Libkin. Consistency of XML Specifications. In Inconsistency Tolerance, volume 3300 of Lecture Notes in Computer Science, pages 15–41. Springer, 2005.
5. M. Arenas, W. Fan, and L. Libkin. On the Complexity of Verifying Consistency of XML Specifications. SIAM Journal on Computing, 38(3):841–880, 2008. 6. M. Arenas and L. Libkin. A Normal Form for XML Documents. ACM Transactions on Database Systems, 29:195–232, 2004. 7. P. Atzeni and V. DeAntonellis. Relational Database Theory. Benjamin Cummings, 1993. 8. T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0 Fifth Edition. Technical report, W3C, 2008. http://www.w3.org/TR/2008/REC-xml-20081126. 9. P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for xml. Computer Networks, 39(5):473–487, 2002. 10. P. Buneman, W. Fan, and S. Weinstein. Path constraints in semistructured databases. Computer Systems Sciences, 61(2):146–193, 2000. 11. W. Fan. XML Constraints: Specification, Analysis, and Applications. In DEXA Workshops, pages 805–809. IEEE Computer Society, 2005. 12. W. Fan. XML Publishing: Bridging Theory and Practice. In DBPL, pages 1–16, 2007. 13. W. Fan, G. M. Kuper, and J. Sim´eon. A Unified Constraint Model for XML. Computer Networks, 39(5):489–505, 2002. 14. W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. JACM, 49(3):368–406, 2002. 15. W. Fan and J. Sim´eon. Integrity Constraints for XML. In PODS, pages 23–34. ACM, 2000. 16. W. Fan and J. Sim´eon. Integrity constraints for XML. Computer and System Sciences, 66(1):254–291, 2003. 17. N. Klarlund, T. Schwentick, and D. Suciu. XML: Model, Schemas, Types, Logics, and Queries. In Logics for Emerging Applications of Databases, pages 1–41, 2003. 18. A. Møller and M. Schwartzbach. Introduction to XML and Web Technologies. Addison Wesley, 2006. 19. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures Second Edition. Technical report, W3C, 2004. http://www.w3.org/TR/xmlschema-1/. 20. A. Vakali, B. Catania, and A. Maddalena. XML Data Stores: Emerging Practices. IEEE Internet Computing, 9(2):62–69, 2005. 21. M. W. Vincent, J. Liu, and M. K. Mohania. On the Equivalence between FDs in XML and FDs in Relations. Acta Informatica, 44(3-4):207–247, 2007. 22. M. W. Vincent, M. Schrefl, J. Liu, C. Liu, and S. Dogen. Generalized Inclusion Dependencies in XML. In APWeb, volume 3007 of Lecture Notes in Computer Science, pages 224–233. Springer, 2004.