Rewriting Union Queries Using Views J. Wang (
[email protected]) INT, Griffith University, Australia
R. Topor (
[email protected]) CIT, Griffith University, Australia
M. Maher (
[email protected]) National ICT, Australia Abstract. The problem of finding contained rewritings of queries using views is of great importance in mediated data integration systems. In this paper, we first present a general approach for finding contained rewritings of unions of conjunctive queries with arbitrary built-in predicates. Our approach is based on an improved method for testing conjunctive query containment in this context. Although conceptually simple, our approach generalizes previous methods for finding contained rewritings of conjunctive queries and is more powerful in the sense that many rewritings that can not be found using existing methods can be found by our approach. Furthermore, ¨ over the base relations can be easily handled. We then implication constraints[ZO] present a simplified approach which is less complete, but is much faster than the general approach, and it still finds maximum rewritings in several special cases. Our general approach finds more rewritings than previous algorithms such as the Bucket and the resolution-based algorithms. Our simplified approach generalizes the U-join and the MiniCon algorithms with no loss of efficiency. Keywords: data integration, global schema, view, query, union query, constraint, implication constraint, contained rewriting.
1. Introduction The problem of rewriting queries using views is of great importance in mediated data integration systems. In such systems, users are presented with a uniform interface through which queries are submitted. The uniform interface, also called the global schema, consists of a set of virtual relations which may not be physically stored. The actual data sources (i.e, the stored data) are regarded as logical views defined on the virtual relations [Ull00]. Thus in order to answer a user query, the system must first rewrite the query into one that is defined on the views only. In other words, given a query Q defined on the base relations we need to find a query Qr defined on the view relations such that Qr gives correct answers to Q. If so, Qr is called a rewriting of Q. Usually two types of rewritings are sought: equivalent rewritings and contained rewritings. Equivalent rewritings are those that give exactly the same set of answers as the original query. Contained rewritings c 2005 Kluwer Academic Publishers. Printed in the Netherlands.
constraint.tex; 15/04/2005; 10:26; p.1
2
Wang, Toper and Maher
are those that give possibly only part of the answers to the original query. In this paper, we will study the problem of finding contained rewritings using views when the views are conjunctive queries with arbitrary built-in predicates (which we call general conjunctive queries) and the user query is a union of general conjunctive queries (referred to as a union query). We believe such problems are abundant in practical data integration systems, and one example is shown in Example 1.1 below. The rewritings we obtain are union queries. Our attention is focused on how to quickly find rewritings that give as many correct answers as possible, rather than on how efficiently the rewritings can be evaluated. EXAMPLE 1.1. Imagine there is a global database schema about properties that include the following relation schemas: property(pid, owner, year built), house(pid, land area), rental property(pid, rent). The meaning of the attributes is self-explanatory. Suppose there is a real estate agency which owns a database consisting of relation schemas that can be described as follows: v1 (pid, owner) :− property(pid, owner, year built), v2 (pid, land area) :− house(pid, land area), v3 (pid, rent) :− rental property(pid, rent). If a user asks for the property ID and the owners of houses built before the year 2000 or of rental properties built after 1995, that is, the user query is the union of the following two conjunctive queries: Q(pid, owner) :− property(pid, owner, year built), house(pid, land area), year built < 2000 Q(pid, owner) :− property(pid, owner, year built), rental property(pid, rent), year built > 1995. Although there are no rewritings using v1, v2 and v3 for either conjunctive query alone, there is a contained rewriting for their union:
Q(pid, owner) :− v1 (pid, owner), v2 (pid, land area), v3 (pid, rent). The rewriting gives the property IDs and the owners of some rental properties that are houses.
constraint.tex; 15/04/2005; 10:26; p.2
Rewriting Union Queries Using Views
3
1.1. Previous Work For a comprehensive survey on query rewriting using views, see [Hal01]. Here we only mention a few papers that are closely related to our work. Among the early algorithms on finding contained rewritings using views, the U-join algorithm [Qia96] and the Bucket algorithm [LRO96] were used to find rewritings of conjunctive queries using conjunctive views (in the Bucket algorithm, the views and the query may contain comparison predicates such as x < a, y = x), and the Inverse-rule algorithm [DG97] was used for rewriting Datalog programs using Datalog views. Later on, the MiniCon algorithm [PH01] and the SharedVariable-Bucket (hereafter abbreviated as SVB) algorithm [Mit01] were developed as faster (but less powerful) versions of the Bucket algorithm. A resolution-based algorithm was proposed in [GM02] and was claimed to be a generalization of the previous methods. Arithmetic comparisons in the query and views were further investigated in [ALM], which extends the MiniCon algorithm to allow some non-distinguished variables in the views to be “exported” as distinguished variables. Algorithms for finding contained rewritings in the presence of functional dependencies, inclusion dependencies, and full dependencies are studied in [Gry99, DGL00]. More detailed descriptions of some of the above algorithms are in Section 6. 1.2. The Problem and Our Contribution The problem we study is the rewriting of unions of general conjunctive queries using general conjunctive views. We also consider the case where ¨ over the base relations are present. As implication constraints [ZO] mentioned above, several previous algorithms consider rewritings of conjunctive queries with or without built-in predicates. However, for a union query such as Qu = Q1 ∪ Q2 , it is not enough to find the rewritings for each of the conjunctive queries and then union them, as shown in the following example. EXAMPLE 1.2. Let Qu = Q1 ∪ Q2 , where Q1 and Q2 are q(y) :− p(x, y), p (x , y), x > y and q(y) :− p(x, y), r(y, z), x ≤ y respectively. Let the views be v1 (y) :− p(x, y), x > 1, v2 (y) :− r(y, x), y < x, and v3 (y) :− p (x, y). Then q(y) :− v1 (y), v2 (y), v3 (y) is a contained rewriting of Qu . This rewriting is not a union of the rewritings of Q1 and Q2 .
constraint.tex; 15/04/2005; 10:26; p.3
4
Wang, Toper and Maher
Although the Inverse-rule algorithm in [DG97] considers rewritings of Datalog queries, it does not consider the case where built-in predicates are present. The resolution-based algorithm does not consider the case where the views have built-in predicates. In fact none of the previous algorithms mentioned above can find the rewriting in Example 1.2. To our knowledge, when the query is a single general conjunctive query, the rewritings found by existing algorithms can not have more subgoals than in the original query1 . Therefore, the rewritings such as that in the next example (Example 1.3) can not be found by those algorithms. In addition, the presence of general-form implication constraints was not considered by previous work. EXAMPLE 1.3. If the query is q(x) :− p(x, y, z), y < z, and the views are v1 (x, y) :− p(x, y, 1) and v2 (x, z) :− p(x, 1, z), then q(x) :− v1 (x, y), v2 (x, z), y < z is a rewriting. In this paper, we first present an improved result on testing query containment. This result is used extensively in the analysis of our algorithms. We then present a general approach for rewriting union queries using general conjunctive views. Our general approach can, for instance, find the rewritings in Examples 1.1 through 1.3. Furthermore, implication constraints can be handled easily. To improve the efficiency of the general approach, we provide a simplified approach which is less complete but significantly faster. When used in rewriting general conjunctive queries, our simplified approach finds strictly more rewritings than the MiniCon algorithm when built-in predicates exists in the query, and it finds maximum rewritings (see the next section for definition) in several special cases. When there are no built-in predicates in the query and the views, the simplified approach finds the same rewritings as the U-join algorithm with similar costs. The rest of the paper is organized as follows. Section 2 provides the technical background. In Section 3 we introduce the concept of relevance mappings and present the improved method for testing query containment. In Section 4 we present our general approach on rewriting union queries. The simplified approach is described in Section 5. 1
There are two exceptions: when there are functional dependencies, the resolution-based method in [GM02] may find rewritings that contain more subgoals; when the query and views involve only semi-interval constraints, the method in [ALM] may find recursive Datalog rewritings.
constraint.tex; 15/04/2005; 10:26; p.4
Rewriting Union Queries Using Views
5
Section 6 compares our approaches with some closely related work. Section 7 concludes the paper with a summary.
2. Preliminaries 2.1. General Conjunctive Queries and Union Queries We assume the existence of some variable domains, and for each domain, there is an infinite number of variables. We assume every variable x takes values only from its domain dom(x). If p is the name of a d-ary relation in the database, and X = (x1 , . . . , xd ) is a tuple of variables and constants, then p(X) is called an atom, and x1 , . . . , xd are called the arguments of p(X). If the arguments of an atom are all constants, then the atom is called a ground atom. If c is the name of a d-ary built-in relation (such as the comparison < and =), then c(X) is called a constraint. A general conjunctive query (GCQ) is of the form q(X) :− p1 (X1 ), ..., pn (Xn ), C
(1)
where q(X), p1 (X1 ), ..., pn (Xn ) are atoms (q is distinct from p1 , . . . , pn , n > 0), C is a conjunction of constraints over the variables in X ∪ X1 ∪ · · · ∪ Xn . We call q(X) the head, p1 (X1 ), ..., pn (Xn ), C the body, and C the constraint of the query. Each atom pi (Xi ) (i = 1, . . . , n) is called a subgoal. We will refer to the sequence X as the output of the query, and any variable in X as a head variable. A distinguished variable is a head variable or a variable that is equated, by C, to a constant or a head variable. Given a database instance I (consisting of some ground atoms) and a GCQ Q, if there is an assignment θ of constants to the variables appearing in Q such that θ maps every subgoal of Q to a ground atom in I, and θ satisfies the constraint C, then we say θ(X) is an answer to Q with respect to this instance I. Note that we haven’t put any “safety” conditions on the GCQs. In particular, we allow infinite relations, and it is possible that a variable in the head does not appear in the body, as in q(x) :− p(y), y > 1. The answer set for this query, for example, is dom(x) (when the database instance constains a tuple that makes the body logically true) or Ø (otherwise). Two GCQs are said to be comparable if they have the same number of head arguments and the corresponding head arguments are from the same domains. A union query is a finite union of comparable GCQs. In particular, a GCQ is also a union query.
constraint.tex; 15/04/2005; 10:26; p.5
6
Wang, Toper and Maher
Query containment and equivalence are defined in the usual way [Ull88]. We will use Q1 Q2 and Q1 = Q2 to denote Q1 is contained in Q2 and Q1 is equivalent to Q2 respectively. We will use empty query to refer to any query whose answer set is empty for every database instance. Clearly, a GCQ is empty if and only if its constraint is unsatisfiable. A GCQ is said to be in normal form, if the arguments in every atom (head or subgoal) are distinct variables only, and the sets of variables in different atoms are pairwise disjoint. A GCQ is said to be in compact form if there are no explicit or implicit non-tautological equalities between two variables or between a variable and a constant (e.g, x ≥ y ∧ x ≤ y, where x = y is not a tautology) in the constraint. Clearly, every GCQ can be put in normal form. It can be put in compact form provided all the implicit equalities in the constraint can be found. 2.2. Implication Constraints ¨ is a formula of the form An implication constraint [ZO] r1 (X1 ), . . . , rm (Xm ), D → FALSE
(2)
where m > 0, r1 , . . . , rm are relation names, X1 , . . . , Xm are tuples of variables and constants, and D is a constraint over the variables in X1 ∪ · · · ∪ Xm . Functional dependencies and equality-generating dependencies are special cases of implication constraints. Let ∆ be a set of implication constraints. If for any database instance I which satisfies the implication constraints in ∆, the answer set of Q1 is a subset of that of Q2 , then we say Q1 is contained in Q2 under ∆, denoted Q1 ∆ Q2 . 2.3. Rewritings and Maximum Rewritings We assume the existence of a set of base relations and a set W of views. A view is a GCQ defined on the base relations. Without loss of generality, we assume the arguments in the head of a view are distinct variables only. We refer to the relation in the head of the view as the view relation. There are two world assumptions [AD98]: under the closed world assumption, the view relation stores all of the answers to the view; under the open world assumption, the view relation stores possibly only part of the answers to the view. The open world assumption is usually used in data integration [Hal01] because it is usually not known that all answers of the views are actually stored. In this paper, we will use the open world assumption.
constraint.tex; 15/04/2005; 10:26; p.6
Rewriting Union Queries Using Views
7
For any base instance D consisting of instances of the base relations, we use W (D) to denote a view instance (with respect to D) consisting of instances of the view relations. Following the open world assumption [AD98], we assume each relation instance in W (D) may contain only part of the answers computed to the corresponding view using D. Now we can formally define rewritings. DEFINITION 2.1. Let Q be a query defined on the base relations. A rewriting of Q is a query Qr defined solely on the view relations such that, for any base instance D, all of the answers to Qr computed using any view instance W (D) are correct answers to Q. When Qr is a general conjunctive (or union) query, we call it a general conjunctive (or union) rewriting. If Qr does not always produce the empty answer set, we call it a non-empty rewriting. Note that a rewriting as defined above is a contained rewriting. To check whether a query Qr is a rewriting of Q, we need the expansion of Qr , as defined below. DEFINITION 2.2. If Qr is a GCQ defined on the view relations, then of Qr is the GCQ obtained as follows: For each the expansion Qexp r subgoal v(x1 , . . . , xk ) of Qr , suppose v(y1 , . . . , yk ) is the head of the corresponding view V , and σ is a mapping that maps the variable yi to the argument xi for i = 1, . . . , k, and maps every other variable in V to a distinct new variable, then replace v(x1 , . . . , xk ) with the body of σ(V ). It is possible that the expansion of a GCQ involves variables that appear only in the constraint, but not in any atoms. Such variables should be regarded as existential variables of the constraint, and be removed whenever possible. For example, if Qr is q(x) :− v(y), x = y + is q(x) :− p(z), x = z + 2. 1, and V is v(y) :− p(z), y = z + 1, then Qexp r of a union query Q is the union of the expanThe expansion Qexp u u sions of the GCQs in Qu . Q. It is easy to see that Qr is a rewriting of Q if and only if Qexp r There may be many different rewritings of a query. To compare them, we define maximum rewritings. DEFINITION 2.3. A rewriting Q1 of Q is said to be a maximum rewriting with respect to a query language L if for any rewriting Q2 of Q in L, every answer to Q2 is an answer to Q1 for any view instance W (D) with respect to any base instance D. Note that for a rewriting Q1 to be maximum under the open world Qexp holds for any other assumption, it is not enough that Qexp 2 1 rewriting Q2 .
constraint.tex; 15/04/2005; 10:26; p.7
8
Wang, Toper and Maher
EXAMPLE 2.1. Let the query Q be q(x) :− p(x). Let the views V1 and V2 be v1 (x) :− p(x) and v2 (x) :− p(x) respectively. Clearly Q1 : q(x) :− v1 (x) and Q2 : q(x) :− v2 (x) are two conjunctive rewrit= Qexp ings, and Qexp 1 2 . But neither Q1 nor Q2 is a maximum rewriting with respect to conjunctive queries or union queries because although V1 and V2 are defined to be equivalent, v1 and v2 may contain different tuples in a view instance. The above definition of rewritings extends straightforwardly to the case where implication constraints on the base relations exist. DEFINITION 2.4. Let ∆ be a set of implication constraints on the base relations, Q be a query defined on the base relations, and W be a set of views. If Qr is a query defined on the view relations in W such ∆ Q, then we say Qr is a rewriting of Q wrt W and ∆. that Qexp r 2.4. Inverse Rules and Inferred Constraints Given a view V : v(X) :− p1 (X1 ), . . . , pn (Xn ), C we can compute a set of inverse rules [DG97]: First, replace each nondistinguished variable in the body with a distinct Skolem function. The resulting view is said to be Skolemized. Suppose ρ is the mapping that maps the non-distinguished variables to the corresponding Skolem functions, then the inverse rules are ρ(pi (Xi )) ← v(X) (for i = 1, . . . , n). The left side of an inverse rule is called the head, and the right side is called the body. A variable in an inverse rule is said to be restricted if it appears as an independent argument of the head, that is, it appears in the head, and appears not only inside the Skolem functions. In addition, we will call ρ(C) the inferred constraint of the atom v(X). EXAMPLE 2.2. For the view v(x, z) :− p1 (x, y), p2 (y, z), x > y, there are two inverse rules: p1 (x, f (x, z)) ← v(x, z) and p2 (f (x, z), z) ← v(x, z). In the first inverse rule x is a restricted variable, but z is not. In the second one, z is a restricted variable, but x is not. The inferred constraint of v(x, z) is x > f (x, z).
constraint.tex; 15/04/2005; 10:26; p.8
Rewriting Union Queries Using Views
9
If we have more than one view, we can generate a set of inverse rules from each of them. In this case, the inverse rules generated from different views must use different Skolem functions. In the sequel, when we say a rule, we mean an inverse rule. For simplicity, we also assume the rules are compact as defined below. DEFINITION 2.5. The set of rules generated from a view is said to be compact, if the inferred constraint does not imply a non-tautological equality between a constant and a Skolem function, or between a constant and a variable, or between a variable and a Skolem function, or between two Skolem functions, or between two variables2 .
3. An Improved Method for Testing Query Containment Let us use Var (Q) (resp. Arg(Q)) to denote the set of variables (resp. the set of variables and constants) in the head and subgoals of a GCQ Q. A containment mapping from a GCQ Q2 to another GCQ Q1 is a mapping from Var (Q2 ) to Arg(Q1 ) such that it maps the output of Q2 to the output of Q1 , and maps each subgoal of Q2 to a subgoal of Q1 . The following lemma relates query containment to the existence of some particular containment mappings [Mah93]. Earlier work [Klu88] addressed containment in the context of specific kinds of constraints. LEMMA 3.1. Let Qi (i = 0, . . . , s) be comparable GCQs. Let Ci be the constraints in Qi . (1) If there are containment mappings δi,1 , . . . , δi,ki from Qi to Q0 i such that C0 → ∨si=1 ∨kj=1 δi,j (Ci ), then Q0 ∪si=1 Qi . (2) If Q1 , . . . , Qs are in normal form, C0 is satisfiable, and Q0 s ∪i=1 Qi , then there must be containment mappings δi,1 , . . . , δi,ki from i δi,j (Ci ). Qi to Q0 such that C0 → ∨si=1 ∨kj=1 The condition that Qi (i = 1, . . . , s) are in normal form in (2) of Lemma 3.1 is necessary. Thus in order to use the lemma, we need to put all of the queries Q1 , . . . , Qs in normal form. This sometimes makes the application of Lemma 3.1 inconvenient. Next, we present a revised method for testing query containment using relevance mappings. DEFINITION 3.1. Let Q be the query q(X) :− p1 (X1 ), . . . , pn (Xn ), C, and T be a formula q(Y ) :− p1 (Y1 ), . . . , pn (Yn ). The relevance mapping from Q with respect to T is a mapping δ from Var (Q) that maps every 2
It is possible though that there exist equalities such as x = y + z.
constraint.tex; 15/04/2005; 10:26; p.9
10
Wang, Toper and Maher
variable x in Var (Q) to the argument in T which corresponds to the first occurrence of x in Q (checking from left to right). Let δ be the relevance mapping from Q with respect to T . Let {x1 , . . . , xm } be the set of all arguments in the head and subgoals of Q. For every argument xi in {x1 , . . . , xm }, suppose the set of all arguments in T that correspond to xi are {yi,1 , . . . , yi,ki }. The associated constraint ki of δ, denoted Eδ , is the constraint ∧m i=1 ∧j=1 (yi,j = δ(xi )). Note that if xi is a constant, then δ(xi ) = xi . For example, the relevance mapping from q(x) :− p(x, y), r(y, z) with respect to q(w) :− p(u, 1), r(v, w) maps x to w, y to 1, and z to w. The associated constraint of the mapping is u = w ∧ v = 1. Observe that the query is mapped to q(w) :− p(w, 1), r(1, w) which is equivalent to q(w) :− p(u, 1), r(v, w), u = w ∧ v = 1. DEFINITION 3.2. Let Q2 be the GCQ q(X) :− p1 (X1 ), . . . , pn (Xn ), C2 , and Q1 be a GCQ comparable to Q2 . A target T of Q2 in Q1 is a query q (Y ) :− p1 (Y1 ), . . . , pn (Yn ) such that 1. q (Y ) is the head of Q1 , and for each i ∈ {1, . . . , n}, pi (Yi ) is a subgoal of Q1 with the same relation name as that of pi (Xi ) . 2. If we denote the sequence of all arguments in the head and subgoals of Q2 by S2 = (x1 , x2 , . . . , xm ) and denote the sequence of all arguments in T by S1 = (y1 , y2 , . . . , ym ), then none of the following holds: a) There is an i such that xi and yi are different constants. b) There are i and j such that xi and xj are different constants, but yi and yj are the same variable. c) There are i and j such that yi and yj are different constants, but xi and xj are the same variable.
We call a relevance mapping δ from Q2 wrt any target T of Q2 in Q1 a relevance mapping from Q2 to Q1 . Relevance mapping is an extension of containment mapping. Any containment mapping is a relevance mapping with the associated constraint equivalent to TRUE , and a relevance mapping is a containment mapping if and only if its associated constraint is equivalent to TRUE 3 . 3
Let us assume every variable domain contains more than one elements.
constraint.tex; 15/04/2005; 10:26; p.10
Rewriting Union Queries Using Views
11
EXAMPLE 3.1. Let Q1 and Q2 be q1 (w) :− h(w), p(x, y, 2, 1, u, u), p(1, 2, x, y, u, u), p(1, 2, 2, 1, x, y) and q2 (w) :− h(w), p(x, y, z, v, u, u), x < y, z > v, respectively. There are three targets of Q2 in Q1 : q1 (w) :− h(w), p(x, y, 2, 1, u, u), q1 (w) :− h(w), p(1, 2, x, y, u, u), and q1 (w) :− h(w), p(1, 2, 2, 1, x, y). Corresponding to these targets, there are three relevance mappings: δ1 : w → w, x → x, y → y, z → 2, v → 1, u → u, δ2 : w → w, x → 1, y → 2, z → x, v → y, u → u, and δ3 : w → w, x → 1, y → 2, z → 2, v → 1, u → x, with the associated constraints TRUE , TRUE and x = y respectively.
Let δ be a relevance mapping from Q2 to Q1 wrt T . Condition 2 in the definition of targets ensures that δ maps Q2 to a query δ(Q2 ) which is equivalent to T ∧ Eδ ∧ δ(C2 ) and the constraint Eδ is satisfiable. We will refer to δ(Q2 ) as an image of Q2 in Q1 . Clearly δ(Q2 ) Q2 . The next lemma implies that Q1 is contained in Q2 if and only if Q1 is contained in the union of all of the images of Q2 in Q1 . LEMMA 3.2. Let Ci be the constraint of the GCQ Qi (i = 0, 1, . . . , s). Suppose C0 is satisfiable. Then Q0 ∪si=1 Qi iff there are relevance mappings δi,1 , . . . , δi,ki from Qi (i = 1, . . . , s) to Q0 such that C0 → i (δi,j (Ci ) ∧ Eδi,j ), where Eδi,j is the associated constraint of ∨si=1 ∨kj=1 δi,j . Proof Let us put Qi in normal form as follows: check every argument from left to right, for every constant c, we replace it with a new variable x and conjoin x = c to Ci ; for every variable y that has already appeared at a previous position, we replace y with a new variable y and conjoin y = y to Ci . Let us denote the normal form obtained from Qi by Qi and the constraint of Qi by Ci for i = 1, . . . , s. Clearly, for every relevance mapping δ from Qi to Q0 , there is a corresponding containment mapping ρ from Qi to Q0 such that ρ(Ci ) = δ(Ci )∧Eδ . Conversely, for every containment mapping ρ from Qi to Q0 , if ρ(Ci ) is satisfiable, then there is a corresponding relevance mapping δ from Qi to Q0 such that ρ(Ci ) = δ(Ci ) ∧ Eδ (we only need to consider the target ρ(qi ) :− ρ(Pi ) of Qi in Q0 , assuming Qi is qi :− Pi , Ci ). Now suppose δi,1 , . . . , δi,ki are all of the relevance mappings from Qi to Q0 , and ρi,1 , . . . , ρi,mi are all of the containment mappings from Qi to Q0 . It is easy to see i i (δi,j (Ci ) ∧ Eδi,j ) = ∨si=1 ∨m ∨si=1 ∨kj=1 j=1 ρi,j (Ci ). By Lemma 3.1, we know Lemma 3.2 holds.
constraint.tex; 15/04/2005; 10:26; p.11
12
Wang, Toper and Maher
Note there is no need to put Q1 , . . . , Qs in any normal form. The next example demonstrates the application of Lemma 3.2. EXAMPLE 3.2. Continuing with Example 3.1, since δ1 (x < y, z > v) = x < y, δ2 (x < y, z > v) = x > y and δ3 (x < y, z > v) = TRUE , and x < y ∨ x > y ∨ x = y is always true, we know Q1 Q2 . Lemma 3.2 has an additional advantage over Lemma 3.1: the number of mappings we need to consider can be drastically reduced in some cases. The next example shows one of such cases. EXAMPLE 3.3. Let Q2 and Q1 be q(x, y) :− p(x, y, 0), p(x, y, z), p(x, u, z), x ≤ y and q(x, y) :− p(x, y, 1), p(x, y, 2), . . . , p(x, y, N ), x < y respectively. If we use Lemma 3.1 to test whether Q1 Q2 , then we need to put Q2 in normal form, and consider 3N containment mappings from Q2 to Q1 . If we use Lemma 3.2, we can see Q1 /Q2 immediately because there are obviously no relevance mappings from Q2 to Q1 .
4. The General Approach for Rewriting Union Queries Let Qu be the union query to be rewritten. Without loss of generality, we assume all the GCQs in Qu have the identical head. Our method for rewriting Qu consists of two major steps. In the first step, we generate a set of potential formulas (or p-formulas for short) which may or may not be rewritings; these p-formulas are generated separately for every GCQ in Qu . In the second step, we combine all these p-formulas to try to obtain correct rewritings. 4.1. Generating p-formulas We assume the compact set IR of inverse rules has been computed in advance. To generate a p-formula for a GCQ, we need to find a destination first. DEFINITION 4.1. Given the GCQ Q q(X) :− p1 (X1 ), . . . , pn (Xn ), C and a set IR of compact inverse rules, a destination of Q wrt IR is a sequence DS of n atoms p1 (Y1 ), . . . , pn (Yn ) such that 1. each atom pi (Yi ) is the head of some rule in IR, and it has the same relation name as that of pi (Xi ), the ith subgoal of Q;
constraint.tex; 15/04/2005; 10:26; p.12
Rewriting Union Queries Using Views
13
2. there is no i such that a constant in pi (Xi ) corresponds to a different constant in pi (Yi ); 3. no two occurrences of the same variable in Q correspond to two different constants in DS , and no two occurrences of the same variable or Skolem function in the same atom of DS correspond to two different constants in Q.
Intuitively, a destination “connects” the subgoals of the query to the view atoms in a rewriting. Once a destination DS of Q is found, we can use it to (try to) construct a p-formula as follows: 1. For each atom pi (Yi ) in DS , do the following: Suppose the arguments in pi (Yi ) are y1 , y2 , . . . , yl , and the corresponding arguments in pi (Xi ) are x1 , x2 , . . . , xl . Suppose pi (Yi ) is the head of the rule pi (Yi ) ← vi (Zi ) (If there are rules that have the same head but different bodies, then choose each of them in turn to generate different p-formulas). Define a variable mapping φi as follows: For each restricted variable z ∈ Yi , if z first appears (checking from left to right) at position i, then map z to xi . For each variable z in v(Zi ) which does not appear in pi (Yi ) as a restricted variable, let φi map z to a distinct new variable. 2. Construct a formula T : q(X) :− φ1 (p1 (Y1 )), . . . , φn (pn (Yn )). Construct the relevance mapping δ from Q wrt T (Skolem functions in the target are treated in the same way other arguments in the target are treated). We will get a GCQ q(X) :− φ1 (p1 (Y1 )), . . . , φn (pn (Yn )), δ(C) ∧ Eδ
(F)
3. Replace φi (pi (Yi )) with φi (vi (Zi )) (f or i = 1, . . . , n) in the above GCQ to get the formula q(X) :− φ1 (v1 (Z1 )), . . . , φn (vn (Zn )), δ(C) ∧ Eδ
(PF)
4. Suppose the inferred constraints of v1 (Z1 ), . . . , vn (Zn ) are C1 , . . . , Cn respectively. If the constraint δ(C) ∧ Eδ ∧ φ1 (C1 ) ∧ · · · ∧ φn (Cn ) is satisfiable, then output the formula (PF ) (remove duplicate atoms if possible). The formula (PF ) is the p-formula of Q we want. Any p-formula of a GCQ in the union Qu is called a p-formula of Qu . The p-formula (PF ) has the following property: If we replace each
constraint.tex; 15/04/2005; 10:26; p.13
14
Wang, Toper and Maher
view atom with the corresponding Skolemized view body and treat the Skolem functions as variables, then we will get a GCQ (hereafter referred to as the expansion of (PF ), denoted (PF )exp ) which is contained in Q. This is because (F ) is a GCQ which is equivalent to δ(Q), and all subgoals and constraints of (F ) are in the body of (PF )exp . LEMMA 4.1. The expansion of a p-formula of Q is a GCQ contained in Q. Thus if there happen to be no Skolem functions in the p-formula, then the formula is a rewriting of Q. THEOREM 4.1. For any query Q, if there are no Skolem functions in a p-formula of Q, then the p-formula is a rewriting of Q. However, if there are Skolem functions in Eδ ∧δ(C), then the formula (PF ) is not a correct GCQ because the Skolem functions appear only in the constraint δ(C) ∧ Eδ , and their values can not be determined4 . So (PF ) is not a rewriting if it contains Skolem functions. The following example demonstrates the above process of generating p-formulas. EXAMPLE 4.1. Let the query be q(u) :− p (u), p(x, y), r(y, v), x < y, y < v + 1. Let the views be v1 (u) :− p (u), v2 (y, z) :− p(x, y), p(y, z), x < z, and v3 (y, z) :− r(x, y), r(y, z), x < z. The compact inverse rules are: R1: p (u) ← v1 (u) R2: p(f (y, z), y) ← v2 (y, z) R3: p(y, z) ← v2 (y, z) R4: r(g(y, z), y) ← v3 (y, z) R5: r(y, z) ← v3 (y, z) There are four destinations: (1) p (u), p(f (y, z), y), r(y, z) (2) p (u), p(y, z), r(y, z) (3) p (u), p(f (y, z), y), r(g(y, z), y) (4) p (u), p(y, z), r(g(y, z), y) For the first destination, we first define the mappings 4
Note that it is not correct to regard the Skolem functions in the constraint as usual existential variables in the constraint, since they represent special existential variables that appear in some view subgoals. For example, if v(x) :− p(x, y) is a view, q(x) :− p(x, y), x > y is a query, and f (x) is a Skolem function in the skolemized view definition v(x) :− p(x, f (x)), then q(x) :− v(x), x > f (x) is not equivalent to q(x) :− v(x), ∃y x > y.
constraint.tex; 15/04/2005; 10:26; p.14
Rewriting Union Queries Using Views
15
φ1 : u → u; φ2 : y → y; and φ3 : y → y, z → v. and construct the target q(u) :− p (u), p(f (y, z), y), r(y, v). Then we obtain the following query wrt the above target q(u) :− p (u), p(f (y, z), y), r(y, v), f (y, z) < y, y < v + 1. Finally we replace p (u), p(f (y, z), y), r(y, v) with v1 (u), v2 (y, z), v3 (y, v) to get the p-formula q(u) :− v1 (u), v2 (y, z), v3 (y, v), f (y, z) < y, y < v + 1. Similarly, for the second destination (2), we can get the p-formula q(u) :− v1 (u), v2 (x, y), v3 (y, v), x < y, y < v + 1. The second p-formula is a rewriting because it involves no Skolem functions. 4.2. Obtaining Rewritings from the p-formulas As noted earlier, when there are Skolem functions in a p-formula, the pformula is not a rewriting. However, it is possible that such p-formulas can be combined to obtain correct rewritings. Generally, the following two steps are needed. First, we choose some p-formulas and combine them into a single formula. Suppose we have chosen k p-formulas PF 1 , . . . , PF k , where PF i is q(X) :− vi,1 (Zi,1 ), . . . , vi,ni (Zi,ni ), Ci . Then the combined formula is q(X) :− v1,1 (Z1,1 ), . . . , v1,n1 (Z1,n1 ), . . . , vk,1 (Zk,1 ), . . . , vk,nk (Zk,nk ), C1 ∨ · · · ∨ Ck where the variables which appear only in the view atoms (i.e., not in q(X)) of different p-formulas should be renamed to different variables. Second, for the above combined formula, we try to remove those constraints that involve Skolem functions, for example, by replacing them with another constraint that does not involve Skolem functions. Generally, we need to utilize the inferred constraints of view atoms as follows: Let D be the conjunction of the inferred constraints of the view atoms in (CF ). Write the constraint C1 ∨· · · ∨Ck in conjunctive normal form, and divide it into the conjunction of C and C , where C involves Skolem functions, but C does not. If there exists a constraint D over the variables in X, Z1,1 , . . . , Zk,nk such that D ∧ D ∧ C is satisfiable and D ∧ D ∧ C → C , then replace the constraint in the combined formula with C ∧ D . The resulting query is clearly a rewriting of Qu . Let us look at some examples. In Example 4.2, we get a rewriting from a single p-formula. EXAMPLE 4.2. For the first p-formula in Example 4.1, the inferred constraints of v1 (u), v2 (y, z) and v3 (y, v) are TRUE, f (y, z) < z, and
constraint.tex; 15/04/2005; 10:26; p.15
16
Wang, Toper and Maher
g(y, z) < v respectively. Their conjunction is f (y, z) < z ∧ g(y, v) < v. Since (z ≤ y) ∧ (f (y, z) < z) → f (y, z) < y, we can replace f (y, z) < y with z ≤ y and get the rewriting q(u) :− v1 (u), v2 (y, z), v3 (y, v), z ≤ y, y < v + 1. In the next example, we combine two p-formulas by conjoining their view atoms and “disjuncting” their constraints. EXAMPLE 4.3. For the views in Example 1.2, the inverse rules are p(f1 (y), y) ← v1 (y) r(y, f2 (y)) ← v2 (y) p (f3 (y), y) ← v3 (y). For Q1 ≡ q(y) :− p(x, y), p (x , y), x > y, we find the destination p(f1 (y), y), p (f3 (y), y) and then the p-formula q(y) :− v1 (y), v3 (y), f1 (y) > y. For Q2 ≡ q(y) :− p(x, y), r(y, z), x ≤ y, we find the destination p(f1 (y), y), r(y, f2 (y)) and then the p-formula q(y) :− v1 (y), v2 (y), f1 (y) ≤ y. Combining the two p-formulas, we will get the rewriting q(y) :− v1 (y), v2 (y), v3 (y). There are two remarks about the above general approach. First, about the completeness of the rewritings. Due to the inherent complexity of the problem (see [AD98]), we do not expect to find a rewriting which can produce all possible correct answers to the original query using the views. However, we do have the following theorem, which shows that p-formulas are an appropriate basis for performing rewritings. THEOREM 4.2. For any non-empty union rewriting Qr of Qu , there ∪si=1 PF exp are p-formulas PF 1 , . . . , PF s of Qu such that Qexp r i . The proof of the above theorem is in Appendix A. Next, about the complexity of the general approach. The above approach is exponential in the number of views. The step of combining the p-formulas is particularly expensive because the number of possible combinations may grow explosively and the constraint implication problem is intractable in general. In practice, one can use various simplified versions of the general approach. For example, we may choose to consider some, rather than all of the p-formulas. In Section 5, we focus on such a version which is
constraint.tex; 15/04/2005; 10:26; p.16
Rewriting Union Queries Using Views
17
practically much more efficient, yet it still produces maximum rewritings in some special cases. We can also consider only some, rather than all of combinations (e.g., only combinations of p-formulas which involve common Skolem functions). We can also use a more systematic way for combining the p-formulas. For example, as a rule of thumb, we should check each single p-formula first to see whether we can get rewritings; then combine p-formulas with the same non-constraint part (modulo variable renaming); and finally conjoin the view atoms of chosen p-formulas as stated before. 4.3. Handling Implication Constraints Suppose there is a set ∆ = {ici ≡ Ri (Yi ), Di (Yi ) → FALSE | i = 1, . . . , s} of implication constraints. Without loss of generality we assume the variables in the implication constraints do not appear in the head of the query Qu , that is, X ∩ Yi = Ø. Our general method for handling these implication constraints is as follows: First, for each implication constraint ici , we construct a GCQ (referred to as the induced query of ici ) q(X) :− Ri (Yi ), Di (Yi ) and find the p-formulas of this query. Second, we define a new union query Qu which is the union of the original query Qu and the induced queries. Third, we find the rewritings of Qu as shown in the previous section, with one restriction when combining the p-formulas of Qu to get rewritings: the p-formulas of the induced queries should always be combined with at least one p-formula of the original query Qu . In other words, we should not try to obtain rewritings from the p-formulas of the induced queries only (The reason is that any rewritings obtained as such would be empty wrt ∆). To see the correctness of the above method, we note that every induced query is empty under ∆. Therefore, for every query Q that is comparable to Qu , Q ∆ Qu iff Q Qu . Treating Q as the expansion of a rewriting, we can get the following theorem immediately. THEOREM 4.3. Any rewriting of Qu is a rewriting of Qu wrt ∆, and any rewriting of Qu wrt ∆ is a rewriting of Qu . From Theorems 4.3 and 4.2, we can also get the corollary below. COROLLARY 4.1. For every non-empty rewriting Qr of Qu wrt ∆, there are p-formulas of Qu such that the union of the expansions of these p-formulas contains Qexp r . The next example is modified from Example 1.2.
constraint.tex; 15/04/2005; 10:26; p.17
18
Wang, Toper and Maher
EXAMPLE 4.4. Let Qu = Q1 ∪ Q2 , where Q1 and Q2 are q(y) :− p(x, y), p1 (x1 , y), x > y and q(y) :− p(x, y), r(y, z), x < y respectively. Let ∆ contain p(x, y), p2 (y), x = y → FALSE only. The induced query is q(y) :− p(x, u), p2 (u), x = u. Let the views be v1 (y) :− p(x, y), v2 (y) :− r(y, x), v3 (y) :− p1 (x, y), and v4 (y) :− p2 (y). For Q1 , we find the p-formula q(y) :− v1 (y), v3 (y), f1 (y) > y. For Q2 , we find the p-formula q(y) :− v1 (y), v2 (y), f1 (y) < y. For the induced query, we find q(y) :− v1 (u), v4 (u), f1 (u) = u. Combine the three p-formulas, we get a combined formula q(y) :− v1 (y), v3 (y), v1 (y), v2 (y), v1 (u), v4 (u), f1 (y) > y ∨ f1 (y) < y ∨ f1 (u) = u. Since u = y → f1 (y) > y ∨ f1 (y) < y ∨ f1 (u) = u, we can replace the constraint in the combined formula with u = y to get the rewriting q(y) :− v1 (y), v2 (y), v3 (y), v4 (y).
5. A Simplified Approach The general approach is very expensive sometimes. The simplified approach imposes extra conditions on p-formulas, so that the rewriting process is simplified. 5.1. Retainable Destinations and Retainable p-formulas DEFINITION 5.1. Given the GCQ Q as in (1) and a set IR of compact rules, a destination DS of Q wrt IR is said to be retainable if 1. each distinguished variable or constant in pi (Xi ) corresponds to a restricted variable or to a constant, and 2. all occurrences of the same non-distinguished variable in Q correspond either all to restricted variables and constants, or all to the same Skolem function.
For instance, among the four destinations in Example 4.1, only the first two are retainable.
constraint.tex; 15/04/2005; 10:26; p.18
Rewriting Union Queries Using Views
19
Once we have found a retainable destination DS of Q, we can generate a p-formula as before. Note that even with a retainable destination, there may still be equalities of the form f (Z1 ) ≡ φi (f (Z)) = φj (f (Z)) ≡ f (Z2 ) in the Eδ part of the p-formula. In this case, we can remove the equality f (Z1 ) = f (Z2 ) by replacing it with Z1 = Z2 . The resulting formula is called a retainable p-formula. EXAMPLE 5.1. Let the query Q be q(x, x ) :− p(x, y), p(x , y). Let the view V be v(x) :− p(x, y). The inverse rule is p(x, f (x)) ← v(x), and p(x, f (x)), p(x, f (x)) is a retainable destination of Q. Therefore, we can construct a target q(x, x ) :− p(x, f (x)), p(x , f (x )) and then get the p-formula q(x, x ) :− v(x), v(x ), f (x ) = f (x). Replacing f (x ) = f (x) with x = x , we get a retainable p-formula q(x, x ) :− v(x), v(x ), x = x which is equivalent to q(x, x) :− v(x). This is a rewriting of Q. The retainable p-formula may still have Skolem functions if Q has a constraint involving non-distinguished variables. However, if Q does not have constraints, or the constraint of the GCQ involves distinguished variables only, then it is impossible for any retainable p-formula to have Skolem functions, thus every retainable p-formula will be a rewriting. THEOREM 5.1. If Q is a GCQ without constraint, or the constraint of Q involves only distinguished variables, then every retainable p-formula of Q is a rewriting of Q.
5.2. Obtaining Retainable p-formulas Directly We can modify the process for generating p-formulas so that the retainable p-formula can be obtained directly. That is, without having to substitute f (Z1 ) = f (Z2 ) with Z1 = Z2 . To do this, we note that, in the p-formula generated from a retainable destination, there is the equality f (Z1 ) = f (Z2 ) if and only if (1) there are two atoms of DS which share the same Skolem function f (Z), (2) the occurrences of f (Z) in the two atoms correspond to the same non-distinguished variable y in Q, and (3) the restricted variables in the two atoms of DS are mapped to different arguments. Also, corresponding to f (Z1 ) = f (Z2 ), there must be the atoms v(Z1 ) and v(Z2 ) in the p-formula. Therefore, if we modify the process by grouping the atoms appropriately (as below), and define the same variable mapping φ for all atoms in each group (rather than
constraint.tex; 15/04/2005; 10:26; p.19
20
Wang, Toper and Maher
define a mapping for each atom), then equalities such as f (Z1 ) = f (Z2 ) can not appear. Let DS be a retainable destination of Q. The modified process replaces step 1 of the p-formula generating process with the following: 1.1 Define a relation ∼ among the atoms of DS such that two atoms p and q satisfy p ∼ q if they share a Skolem function f (Z) and the two occurrences of f (Z) in p and q correspond to the same non-distinguished variable of Q, or if there are some other atoms r1 , . . . , rt in DS such that p ∼ r1 , r1 ∼ r2 , . . . , rt−1 ∼ rt , rt ∼ q. The relation ∼ is an equivalence relation. Divide the atoms in DS into disjoint groups G1 , . . . , Gk according to ∼. Clearly, the atoms in the same group are the heads of rules generated from the same view, that is, the rules have the same body v(Z). 1.2 For each group G in DS , do the following: Suppose the arguments in G are y1 , y2 , . . . , yl , and the corresponding arguments in pi (Xi ) are x1 , x2 , . . . , xl . Denote the body of rules corresponding to G by vG (ZG ). Define a variable mapping φG as follows: For each restricted variable z in G, if z first appears (checking from left to right) at position i, then map z to xi . For each variable z in vG (ZG ) which does not appear in G as a restricted variable, let φG map z to a distinct new variable.. Define φi be φG if pi (Yi ) is in group G (i = 1, . . . , n). Also, in step 3, it is no longer necessary to replace every atom in the same group G of Q with the same view atom, as this results duplicate atoms. Instead, we can replace the whole group of atoms with the single view atom. One can verify that using the modified process to Example 5.1 will generate the same p-formula. 5.3. When the Simplified Approach Finds Maximum Rewritings? The simplified approach is less powerful than the general approach in that it finds less rewritings. Nevertheless it still finds maximum rewritings in some special cases. In this section we list some of these cases. We assume that the variables are all from infinite domains in this subsection.
constraint.tex; 15/04/2005; 10:26; p.20
Rewriting Union Queries Using Views
21
First, when the query and views do not have constraints, the union of all retainable p-formulas is a maximum rewriting with respect to the language of union queries. This is implied by the next theorem. THEOREM 5.2. Suppose the relation attributes are all from infinite domains. If the GCQ Q and the views do not have constraints, then for any general conjunctive rewriting Qr of Q, there are some retainable p-formulas Q1 , . . . , Qs defined on the view relations mentioned in Qr such that Qr Q1 ∪ · · · ∪ Qs . The proof of the above theorem is in Appendix A. In Theorem 5.2 the condition that the variables are from infinite domains is necessary, as shown by the next example. EXAMPLE 5.2. Let p(x, y, z) be a relation, where x is from the reals and y, z are from the domain {ON, OFF}. Consider the query Q: q(x) :− p(x, y, y) and the view V : v(x) :− p(x, y, ON), p(x, y, OFF). It should be clear that V Q. Thus q(x) :− v(x) is a rewriting of Q. However, there are no retainable destinations and hence we can not find this rewriting using the simplified approach. Second, we consider the case where the query has no constraints, and the constraints of the views are conjunctions of linear arithmetic constraints involving only distinguished variables, assuming all attributes are from the reals. A linear arithmetic constraint is a constraint of the form a1 x1 + a2 x2 + · · · + al xl op b, where a1 , . . . , al and b are constants, x1 , . . . , xl are variables, op is one of , ≥, =, =. We claim that in this case, the union of all retainable p-formulas is a maximum rewriting with respect to the language of unions of conjunctive queries with linear arithmetic constraints. This is due to the next theorem. THEOREM 5.3. Suppose the relation attributes are all from the reals. If the query does not have constraint, and the constraints of the views are conjunctions of linear arithmetic constraints involving only distinguished variables, then for every general conjunctive rewriting Qr of Q whose constraint is a conjunction of linear arithmetic constraints, there is a retainable p-formula Qp of Q defined on the relations mentioned Qexp in Qr such that Qexp r p . For proof of the above theorem, see appendix A. Note that if the attributes are from integers then the above result does not hold. This is demonstrated in the following example.
constraint.tex; 15/04/2005; 10:26; p.21
22
Wang, Toper and Maher
EXAMPLE 5.3. Suppose all relation attributes in this example are from the integers. Let the query Q be q(x) :− p1 (x, y), p2 (x, y). Let the views be v1 (x) :− p1 (x, y), y > 0, y < 3 and v2 (x) :− p2 (x, 1), p2 (x, 2). The inverse rules are p1 (x, f (x)) ← v1 (x), p2 (x, 1) ← v2 (x), and p2 (x, 2) ← v2 (x). Thus there is no retainable destination of Q, and hence no rewriting that can be obtained by the simplified approach. But the query q(x) :− v1 (x), v2 (x) is a rewriting of Q because its expansion q(x) :− p1 (x, y), p2 (x, 1), p2 (x, 2), y > 0, y < 3 is contained in Q. The condition that constraints of the views involve only distinguished variables of the views is also necessary, as shown in the next example. EXAMPLE 5.4. Let Q be q(x) :− r(x). Let the view be v(y, z) :− r(x), s(y, z), y ≤ x, x ≤ z. The inverse rules are r(f (y, z)) ← v(y, z) and s(y, z) ← v(y, z). Therefore we do not have a retainable p-formula and hence no rewritings can be found by the simplified approach. However, it can be verified that q(x) :− v(x, x) is a rewriting of Q. 5.4. Finding All Retainable Destinations There are different methods for finding all retainable destinations. Here we provide two methods, partly for the purpose of comparing with the MiniCon algorithm later. Suppose Q is q(X) :− p1 (X1 ), . . . , pn (Xn ), C. Let us divide the rules into groups such that two rules are in the same group if and only if their heads have the same name. Let RS i be the group of rules whose heads have the same name as that of pi (Xi ) for i = 1, . . . , n. Our first method is to build a retainable destination from an initially empty list DS by adding to it the head of an inverse rule from each of RS 1 , . . . , RS n one by one. The head of a rule from RS i is addable to DS iff adding it to DS leaves DS consistent. Here by consistent we mean that if we denote the subgoals of Q corresponding to the atoms in DS by DS , then (1) every constant in DS corresponds to the same constant or to a restricted variable in DS ; (2) every distinguished variable in DS corresponds to a restricted variable or to a constant in DS ; (3) all occurrences of the same non-distinguished variable in DS correspond
constraint.tex; 15/04/2005; 10:26; p.22
Rewriting Union Queries Using Views
23
either all to restricted variables and constants, or all to the same Skolem function; (4) no two occurrences of the same variable in DS correspond to two different constants in DS , and no two occurrences of the same variable or Skolem function in the same atom of DS correspond to two different constants in DS . Our second method uses the concept of partial destination and minimal partial destination. A partial destination is a set P of inverse rule heads such that (1) each rule is from a distinct group among RS1 , . . . , RSn (let us call the atom from RS i the mirror of pi (Xi )); (2) P is consistent; (3) every subgoal of Q involving a non-distinguished variable which corresponds to a Skolem function in P has a mirror in P . The empty set is a partial destination. A minimal partial destination is a partial destination such that removing any atom from it will make it no longer a partial destination. Observe that a partial destination that contains a mirror for every subgoal of Q is a retainable destination. We first define a procedure FindMPD that, given any subgoal pi of Q, can find the set of minimal partial destinations that contain a mirror for pi . This procedure can be implemented, for example, by enumeration. Then starting from the first subgoal p1 , we use FindMPD to find the set MPD1 of minimal partial destinations that contain a mirror for p1 . Next for each element PD ∈ MPD1 which is not a retainable destination, we choose the a subgoal (checking from left to right) pi of Q that does not have a mirror in PD and call FindMPD to find the set MPDi of minimal partial destinations corresponding to pi , and append each of them to PD. Note the extended lists are partial destinations. In the case that M P Di is empty, we simply discard P D. This process continues until each partial destination is either extended to a retainable destination or discarded due to the fact that it can not be extended to a retainable destination.
6. Comparison with Related Work In this section, we compare our approaches with some most closely related algorithms: the Bucket algorithm, the resolution-based algorithm, the MiniCon algorithm, the SVB algorithm, and the U-join Algorithm. These algorithms (all except the resolution-based algorithm) assume that different views and the query use disjoint sets of variables. 6.1. The Bucket Algorithm and the General Approach The Bucket algorithm [LRO96] is as follows: For each subgoal pi of the query Q, a bucket Bi is created. If a view V has a subgoal pi which
constraint.tex; 15/04/2005; 10:26; p.23
24
Wang, Toper and Maher
is unifiable with pi , then let φ map every distinguished variable in pi to the corresponding argument in pi , if C ∧ φ(CV ) is satisfiable (where C and CV are the constraints of Q and V respectively), then put the view atom φ(v) in Bi . Then one view atom is taken from each of the buckets to form the body of a query Q which has the head identical to that of Q. Then the algorithm tries to find a constraint C such that Q exp ∧ C Q. If C can be found, then return Q ∧ C as a rewriting. It can be easily verified that there are rewritings (such as that in Example 1.3) that can be found by our general approach, but not the Bucket algorithm. On the other hand, any rewritings that can be found by the Bucket algorithm can also be found by our general approach. The reason is like this: Suppose the Bucket algorithm considers the combination v1 (Z1 ), . . . , vk (Zk ) (each vi (Zi ) is from a distinct Bucket) and finds a rewriting from F ≡ q(X) :− v1 (Z1 ), . . . , vk (Zk ). That is, it adds a constraint C to the body of F and gets a rewriting q(X) :− v1 (Z1 ), . . . , vk (Zk ), C . To find this rewriting, the Bucket algorithm has to find some relevance mappings δ1 , . . . , δm from Q to F exp , and determine D ∧ C → ∨m i=1 (δi (C) ∧ Eδi ), where D is the constraint in F exp , C is the constraint in Q, and Eδi is the associated constraint of δi . Corresponding to each of these relevance mappings, say δi , there is a target of Q in F exp . It is not difficult to see that corresponding to this target there is a destination of Q. The destination and the body of the target will be identical if an appropriate variable substitution is applied to the destination. By construction, the p-formula corresponding to the destination will have the same constraint as δi (C) ∧ Eδi if we apply the same variable substitution and replace the Skolem functions with appropriate variables. Therefore, if the Bucket algorithm can find C , then our general approach can also find C (with the same reasoning of the constraints) when combining the corresponding p-formulas (appropriate variable substitutions may need to be applied to the p-formulas before combining them) and hence find the same rewriting. For efficiency, the Bucket algorithm does not need to combine pformulas as in our general approach. However, for each query Q resulting from a combination of the atoms in the buckets, it needs to do a containment test and find the constraint C . This test is expensive. Also, the Bucket algorithm sometimes considers useless combinations of view atoms in the buckets, while our general approach will reject such combinations (i.e., it will not generate p-formulas with the same set of view atoms in the body). For example, if the query is q(y1 , y2 ) :− p1 (x, y1 ), p2 (x, y2 ), and the views are v1 (y1 ) :− p1 (1, y1 ) and v2 (y2 ) :− p2 (2, y2 ), then the bucket algorithm will generate the bucket for p1 (x, y1 ) containing v1 (y1 ), and the bucket for p2 (x, y2 ) containing v2 (y2 ). It will then try to find a rewriting of the form
constraint.tex; 15/04/2005; 10:26; p.24
Rewriting Union Queries Using Views
25
q(y1 , y2 ) :− v1 (y1 ), v2 (y2 ), C - only to find that C must be FALSE . But using our general approach, no destinations will be found, thus, the combination v1 (y1 ), v2 (y2 ) will not need to be considered at all. 6.2. The Resolution-Based Algorithm and the General Approach The resolution-based algorithm [GM02] can be used to rewrite queries using views where − the query is of the form q :− G, where q is the query predicate, and G is a conjunction of intentional database (IDB) predicates and extensional database predicates. Each IDB predicate is defined by one or more range-restricted, non-recursive rules, possibly containing basic comparisons such as x < y and x >= 1; − the views are conjunctive queries without constraints. Since each rule of the query is range-restricted and non-recursive, the query is in effect a union of conjunctive queries with basic comparisons. The algorithm proceeds by first computing the inverse rules (called Clark Completion Resource Rules in [GM02]) from the views, and then trying to resolve the query rules with the inverse rules using unification. For instance, if q :− p1 (X1 ), . . . , pn (Xn ) is a rule in the query, then the resolution of this rule goes on as follows: if p1 (X1 ) and the head of an inverse rule p1 (Y1 ) ← vi1 (Z1 ) can be unified, then replace p1 (X1 ) with σ(vi1 (Z1 )), where σ is the mgu of p1 (X1 ) and p1 (Y1 ) prefering the variables in the query. If the resolution is successful, i.e., all the base atoms are replaced with view atoms (and no Skolem functions are left), then the result of the resolution is a rewriting. It is easy to see that the resolution-based algorithm produces rewritings which can be obtained from our p-formulas containing no Skolem functions by equating some variables and/or constants. Therefore, the rewritings found by the resolution-based algorithm are subsumed by our p-formulas which do not contain Skolem functions. The resolutionbased algorithm can not find the rewritings, say, in Examples 1.3, 1.2 and 4.2. 6.3. The MiniCon Algorithm, the SVB algorithm, and the Simplified Approach The MiniCon algorithm and the SVB algorithm are used for finding union rewritings of general conjunctive query Q using general conjunctive views. The MiniCon algorithm [PH01] proceeds as follows:
constraint.tex; 15/04/2005; 10:26; p.25
26
Wang, Toper and Maher
Step 1. For every subgoal p of Q and every subgoal p of view V (suppose v(Z) is the head of V ), find a least restrictive mapping h from Z to Z such that there exists a mapping φ s.t. φ(p) = h(p ). If h and φ exist, then extend the domain of φ to the variables in a minimum set G of subgoals of Q such that (1) every subgoal in G is mapped to a subgoal of h(V ) by φ; (2) every distinguished variable in G is mapped to a distinguished variable in h(V ); (3) if a non-distinguished variable x in G is mapped to a non-distinguished variable of h(V ), then (i) every subgoal of Q involving x is in G; (ii) all variables in the comparison predicates of Q that involve x are in the domain of φ, and h(d) and φ(C) are consistent, where d is the constraint of V ; (4) h(d) → φ(C ), where C is the conjunction of comparisons in Q involving only variables in the domain of φ, and C involves at least one non-distinguished variable of Q. The tuple (h, v(h(Z)), φc , G) (where φc is the extended mapping) is called a minimum MCD in [PH01]. Among those minimum MCDs formed from the same h and φ, the algorithm only retains those that have the fewest number of subgoals in the G component. Step 2. If there are minimum MCDs (h1 , v1 (Y1 ), φ1 , G1 ), . . . , (hk , vk (Yk ), φk , Gk ) such that G1 , . . . , Gk are pairwise disjoint and G1 , . . . , Gk together cover all the subgoals of Q, then (1) If φi maps two or more variables x1 , . . . , xs in Gi to the same argument, then choose one of the variables as a representative, denote the representative variable of xj by ECi (xj ). Thus ECi (x1 ) = · · · = ECi (xs ). For every variable x in Q, if ECi (x) = ECj (x) (1 ≤ i, j ≤ k), then let EC(x) be one of them but consistently across all y for which ECi (y) = ECi (x); (2) for each y ∈ Yi , if exists x such that φi (x) = y, then let ψi (y) = x, otherwise let ψ(y) be a distinct new variable; (3)create the rewriting q(EC(X)) :− v1 (EC(ψ1 (Y1 ))), . . . , vk (EC(ψk (Yk ))), EC(C ). where C is the set of comparisons in Q which are not implied by the inferred constraints of the view atoms. The SVB algorithm [Mit01] is very similar to the MiniCon algorithm. It constructs two types of buckets: the single-subgoal buckets and the shared-variable buckets. The single-subgoal buckets correspond to the MCDs which have a single subgoal in their G components, and the shared-variable buckets correspond to the minimum MCDs having two or more subgoals in their G components. The algorithm then constructs a rewriting by combining atoms from some buckets which represent disjoint sets of subgoals and which together represent all subgoals of Q. Comparison predicates are handled in a way similar to the way they are handled by MiniCon. Therefore, in the following discussion, we will use the MiniCon algorithm as a representative of the two.
constraint.tex; 15/04/2005; 10:26; p.26
Rewriting Union Queries Using Views
27
There are some similarities between our simplified approach and the MiniCon algorithm. The minimum MCDs in MiniCon correspond to the minimal partial destinations, and the union of pairwise disjoint MCDs which cover all subgoals corresponds to a retainable destination in our approach. Therefore, we can claim that any rewriting that can be found by the MiniCon algorithm can also be found by our simplified approach. On the other hand, the handling of constraints by MiniCon is more restrictive, thus it may miss more rewritings than our simplified approach. For instance, the rewriting in Examples 4.2 and 1.3 can be found by our simplified approach but not by MiniCon. 6.4. The U-join Algorithm and the Simplified Approach The U-join algorithm [Qia96] can be used to find contained rewritings of conjunctive queries using conjunctive views when neither the query nor the views have constraints. It proceeds as follows. First a set of inverse rules is obtained in the same way as in this paper. Then for each subgoal pi in the user query, a label Li is created. Define attr(Li ) = Arg(pi ). If r :− v is an inverse rule, and r and pi are unifiable, then the pair (σ(Arg (pi )), σ(v)) is inserted into Li provided that σ(q) does not contain any Skolem functions. Here, σ is the most general unifier of pi and r, and q is the head of the user query. The U-join of two u labels L1 and L2 , denoted L1 L2 , is defined as follows. Let Y = u attr(L1 ) ∩ attr(L2 ), and Z = attr(L2 ) − attr(L1 ). Define attr(L1 L2 ) = attr(L1 ) ∪ Z. If L1 contains a pair (t1 , u1 ) and L2 contains a pair u (t2 , u2 ), then L1 L2 contains the pair (σ(t1 , t2 [Z]), σ(u1 ∧u2 )) where σ is a most general unifier of t1 [Y ] and t2 [Y ] such that σ(u1 ∧u2 ) does not contain Skolem functions, provided such σ exists. If (σ, vi1 ∧ · · · ∧ vin ) is in the U-join of all labels corresponding to the subgoals of the query, and the head of the query is q, then qσ :− vi1 , . . . , vin is a conjunctive rewriting of Q. The union of all such conjunctive rewritings is returned by the U-join algorithm. It is not difficult to see that the U-join algorithm and our simplified approach generate the same rewritings. This is because the condition in generating the label Li “σ(q) does not contain any Skolem functions” has the same effect as requiring that no distinguished variables of the query corresponds to a Skolem function in the retainable destination, and the condition in U-joining L1 and L2 “σ is a most general unifier of t1 [Y ] and t2 [Y ] such that σ(u1 ∧ u2 ) does not contain Skolem functions” has the same effect as requiring that no argument in the query corresponds to two different Skolem functions or to both a restricted variable (constant) and a Skolem function in the retainable destination.
constraint.tex; 15/04/2005; 10:26; p.27
28
Wang, Toper and Maher
The efficiency of the two are similar because both need to do similar variable substitutions. Thus our simplified approach can be regarded as a generalization of the U-join algorithm to deal with built-in predicates. Our general approach generalizes the algorithm further so that union queries and implication constraints can be handled. 7. Conclusion We presented a general approach for rewriting union queries using views. When used to rewrite GCQs, our approach is more powerful than the Bucket and resolution-based algorithms. Furthermore, it can exploit implication constraints to find rewritings. A simplified version of the approach is less complete, but is faster and can still find maximum rewritings in some special cases. Our simplified approach is a generalization of the U-join, the MiniCon and the resolution-based algorithms. The worst-case complexity of both the general and the simplified approaches are exponential in the number of views. Empirical analysis is required to shed light on the real effectiveness of the approaches. Appendix A - Proofs of Theorems THEOREM 4.3. For any non-empty union rewriting Qr of Qu , there ∪si=1 PF exp are p-formulas PF 1 , . . . , PF s of Qu such that Qexp r i . Proof Suppose Qu = Q1 ∪ Q2 , where Q1 and Q2 are GCQs. The case where there are more GCQs in the union is similar. We only need to prove for the case where Qr is a general conjunctive rewriting. Suppose Qr is q(X) :− ψ1 (v1 (Z1 )), . . . , ψk (vk (Zk )), Cr , where v1 (Z1 ), . . . , vk (Zk ) are view atoms, Cr is the constraint of Qr , and ψi maps the variables in Zi to variables or constants. Denote the conjunction of the inferred constraints of ψ1 (v1 (Z1 )), . . . , ψk (vk (Zk )) by Qu , according to Lemma 3.2, there are relevance mapDr . Since Qexp r and relevance mappings ρ1 , . . . , ρk2 pings δ1 , . . . , δk1 from Q1 to Qexp r exp k2 from Q2 to Qr such that Dr ∧Cr → ∨k1 i=1 (δi (C1 )∧Eδi )∨∨i=1 (ρi (C2 )∧ Eρi ), where Ci is the constraint of Qi (i = 1, 2). Let P be the set of subgoals in Qexp r . Consider the query k2 Qr ≡ q(X) :− P, ∨k1 i=1 (δi (C1 ) ∧ Eδi ) ∨ ∨i=1 (ρi (C2 ) ∧ Eρi ).
Qr . Thus to show Qexp is contained in the union of the Clearly Qexp r r expansions of some p-formulas, we only need to show every query in
constraint.tex; 15/04/2005; 10:26; p.28
Rewriting Union Queries Using Views
29
{q(X) :− P, δi (C1 )∧ Eδi |i = 1, . . . , k1} or {q(X) :− P, ρj (C2 )∧ Eρj |j = 1, . . . , k2} is contained in the expansion of some p-formula. Consider q(X) :− P, δ1 (C1 ) ∧ Eδ1 (others are similar). Suppose Q1 is q(X) :− r1 (X1 ), . . . , rm (Xm ), C1 and the target with which δ1 is obtained is T : q(X) :− r1 (Y1 ), . . . , rm (Ym ). Since every body atom in T is in the definition of some view atom (e.g, ψ1 (v1 (Z1 ))) which appears in Qr , without loss of generality, we can assume rj (Yj ) = ψij (rj (Uj )) (for j = 1, . . . , m; ij ∈ {1, . . . , k}), where rj (Uj ) is the head of the rule rj (Uj ) ← vij (Zij ) (suppose the view definition is Skolemized). Then r1 (U1 ) . . . , rm (Um ) is a destination of Q1 (This can be easily verified against Definitions 4.1 and 3.1). Let φj (for j = 1, . . . , m) map every restricted variable in Uj to the first argument in Xj it corresponds to, and every unrestricted variable in the rule rj (Uj ) ← vij (Zij ) to a distinct new variable, as in step 2 of the procedure for generating p-formulas. Then we will get the p-formula: q(X) :− φ1 (vi1 (Zi1 )), . . . , φm (vim (Zim )), θ(C1 )∧Eθ
(A)
where θ is the relevance mapping from Q1 with respect to q(X) :− φ1 (r1 (U1 )), . . . , φm (rm (Um )). We will show that the expansion (A)exp of the p-formula (A) contains the query q(X) :− P, δ1 (C1 ) ∧ Eδ1 . By the definition of φj (j = 1, . . . , m), φj maps a variable u in rj (Uj ) to a constant α only when the first corresponding argument in rj (Xj ) is α; φj maps two variables in rj (Uj ) to the same argument only when the corresponding arguments in rj (Xj ) are the same; φj maps a variable u in rj (Uj ) and φh maps a variable u in rh (Uh ) to the same argument only when the corresponding arguments in rj (Xj ) and in rh (Xh ) are the same. We also note that, if an argument in rj (Uj ) is originally a constant, then the corresponding argument in ψij (rj (Uj )) is the same constant; and if two arguments in rj (Uj ) are the same, then the corresponding two arguments in ψij (rj (Uj )) must be the same; and if two arguments in r1 (X1 ), . . . , rm (Xm ) are the same, then either the corresponding arguments in ψi1 (r1 (U1 )), . . . , ψim (rm (Um )) are the same, or there will be an equation in Eδ1 between the two corresponding arguments. Therefore, if we denote the arguments in
constraint.tex; 15/04/2005; 10:26; p.29
30
Wang, Toper and Maher
φ1 (r1 (U1 )), . . . , φm (rm (Um )) and in ψi1 (r1 (U1 )), . . . , ψim (rm (Um )) by (u1 , . . . , ul ) and (u1 , . . . , ul ) respectively, then ui is a constant α or Eθ → ui = α only if ui = α or Eδ1 → ui = α; and ui = uj or Eθ → ui = uj only if ui = uj or Eδ1 → ui = uj . In addition, we note that for those variables in vi1 (Zi1 ), . . . , vim (Zim ) that do not appear in r1 (U1 ), . . . , rm (Um ) as restricted variables, φ1 , . . . , φm map them to distinct new variables. We also note that if φj (j = 1, . . . , m) maps a variable u in Uj to a head variable of Q1 , say x1 ∈ X, then the first occurrence of u in Uj must correspond to x1 , hence either ψij (u) = x1 or there will be the equation ψij (u) = x1 . Similarly, if there is an equation φj (u) = x1 in Eθ (e.g, in the case where the first occurrence of u in Uj corresponds to φj (u) in Q1 , but a later occurrence corresponds to x1 ), then either ψij (u) = x1 or there will be the equation ψij (u) = x1 in Eδ1 . Therefore (A)exp contains the expansion of the formula q(X) :− ψi1 (vi1 (Zi1 )), . . . , ψim (vim (Zim )), δ1 (C1 ) ∧ Eδ1 , which in turn contains the query q(X) :− P, δ1 (C1 ) ∧ Eδ1 . THEOREM 5.2. Suppose the relation attributes are all from infinite domains. If the query Qu and the views do not have constraints, then for any general conjunctive rewriting Qr of Q, there are some retainable p-formulas Q1 , . . . , Qs defined on the view relations mentioned in Qr such that Qr Q1 ∪ · · · ∪ Qs . Before proving Theorem 5.2, let us prove some lemmas. LEMMA A.1. Let Q be q(X) :− p1 (X1 ), . . . , pn (Xn ). Suppose the views have no constraints. Let Qr be the query q(X) :− ψ1 (v1 (Z1 )), . . . , ψk (vk (Zk )), Cr , where v1 , . . . , vk are view relations, ψi maps the variables in Zi to variables or constants. If there is a relevance mapping δ from Q to such that Cr → Eδ (Eδ is the associated constraint of δ), then Qexp r there is a retainable p-formula F of Q defined on the view relations mentioned in Qr such that Qr F . Proof Denote ψi (Zi ) by Zi (i = 1, . . . , k). That is, the rewriting is q(X) :− v1 (Z1 ), . . . , vk (Zk ), Cr . Clearly, Cr only involves variables in X ∪ Z1 ∪ · · · ∪ Zk , it does not involve any non-head variables of the views defining v1 (Z1 ), . . . , vk (Zk )
constraint.tex; 15/04/2005; 10:26; p.30
Rewriting Union Queries Using Views
31
(recall that in expanding Qr , the non-head variables of the views are renamed to distinct new variables. So we can assume the non-head variables of the views are all different from the variables in Q). Since Cr → Eδ , and all variables are from infinite domains, Eδ can not involve non-head variables of the views either. Thus Qr = Qr ∧ Eδ . have identical head, assuming T ≡ q(X) :− p1 (Y1 ), Since Q and Qexp r . . . , pn (Yn ) is the target with respect to which δ is obtained, there can not be any head variable in Q that corresponds to a non-head variable (in T ) of the views, and there can not be any non-head variable in Q that corresponds to two different non-head variables, or to both a head variable (or constant) and to a non-head variable of the views (otherwise, Eδ would involve non-head variables of the views). It can also be verified that if there is a non-head distinguished variable or constant x in Q, then this x cannot correspond to any non-distinguished variable y of views either, for otherwise Eδ would involve y. This means that, if we use the Skolemized views to expand Qr to Qexp r , then no distinguished variable of Q corresponds to a Skolem function in T , and no variable corresponds to two different Skolem functions or to both a variable and a Skolem function. Therefore, corresponding to p1 (Y1 ), . . . , pn (Yn ) there is a retainable destination of Q in Qexp r : Since each pi (Yi ) is a subgoal of some view which appears in Qr , there is a rule pi (Ui ) ← vsi (Zsi ), where si ∈ {1, . . . , k} (i = 1, . . . , n, and it is possible si = sj ), and ψsi (pi (Ui )) = pi (Yi ). It is easy to verify that p1 (U1 ), . . . , pn (Un ) (denoted DS ) is a retainable destination. Corresponding to the above retainable destination and inverse rules, we can construct a retainable p-formula (we will see that we cannot fail to construct a retainable p-formula) F . Next we show that Qr F . Let us first recall the steps for the construction of F . First, we divide DS into groups G1 , . . . , Gh . The subgoals of Q are accordingly divided into G1 , . . . , Gh . Then we define mappings φG1 , . . . , φGh such that φGi maps each restricted variable in Gi to the first corresponding argument in Gi (counting from left to right), and maps each unrestricted variable in the rule corresponding to the group Gi to a distinct new variable. Let φj be the same mapping as φGi if pj (Uj ) (j = 1, . . . , n) is in group Gi (i = 1, . . . , h). We construct a relevance mapping θ from Q wrt q(X) :− p1 (φ1 (U1 )), . . . , pn (φn (Un )) to get the formula q(X) :− p1 (φ1 (U1 )), . . . , pn (φn (Un )), Eθ . Finally in the above formula we replace every atom which corresponds to the group Gi with vsi (φi (Zsi )) to get the retainable p-formula F : (F ) q(X) :− vs1 (φ1 (Zs1 )), . . . , vsn (φn (Zsn )), Eθ To show Qr F , we compare F with the following formula F 1: (F 1) q(X) :− vs1 (ψs1 (Zs1 )), . . . , vsn (ψsn (Zsn )), Eδ
constraint.tex; 15/04/2005; 10:26; p.31
32
Wang, Toper and Maher
Let S1 ≡ (z1 , . . . , zs ) be the sequence of all arguments in the nonconstraint part of F , and let S2 ≡ (z1 , . . . , zs ) be the sequence of all arguments in the non-constraint part of F 1. We will show that, for any i, j ∈ {1, . . . , s}, (1) if zi is a constant α or Eθ → zi = α, then either zi is the same constant, or Eδ → zi = α; (2) if zi and zj are the same argument or Eθ → zi = zj , then either zi and zj are the same argument, or Eδ → zi = zj . Therefore, if we construct a relevance mapping ρ from F wrt q(X) :− vs1 (ψs1 (Zs1 )), . . . , vsn (ψn (Zsn )), then the condition Eδ → ρ(Eθ ) ∧ Eρ will hold (note ρ(Eθ ) = Eθ because Eθ involves only the variables in X. Note also Eρ is a conjunction of equalities between two arguments in S2 that correspond to a common argument in S1, or between a constant α and a variable z in S2 when z corresponds to α at some position). Hence Qr F will be proved. Next we show the above points (1) and (2). Since the unrestricted variables in the views vsi (Zsi ) are mapped to distinct new variables by φi , we can ignore them. In other words, we can assume that all variables in vsi (Zsi ) are restricted variables. We can also assume that the (unrestricted) variables in an inverse rule appear in the same order in the head as in the body. Let x1 be an argument in X. For convenience, we use zi,j to denote and z to denote the the jth argument of vsi (Zsi ). Similarly, we use zi,j i,j jth argument of vsi (φi (Zsi )) and vsi (ψi (Zsi )) respectively. We will use , z , z , and z , z , z to explain the symbols z1,1 , z1,2 , z2,1 , z1,1 1,2 2,1 1,1 1,2 2,1 the reasoning behind points (1) and (2). Suppose p1 (U1 ) and p2 (U2 ) are in the groups G1 and G2 respectively. We consider the following possible cases. is a constant α. In this case, either z1,1 is α, or z1,1 is (a) z1,1 is a restricted variable but φ1 maps z1,1 to α. In the first case z1,1 obviously α. In the second case, the first occurrence of z1,1 in the group G1 must correspond to α in Q. Recall that atoms in the same group correspond to the same inverse rule, therefore, in S2 the argument corresponding to the position where the first occurrence of z1,1 appears . This means that, in T , there is an occurrence of z which is also z1,1 1,1 = α. corresponds to α in Q. Hence, Eδ → z1,1 and z1,2 are the same argument z . In this case, either z1,1 (b) z1,1 and z1,2 are the same argument, or φ1 maps z1,1 and z1,2 both to z . and z clearly must be the same argument. In In the first case, z1,1 1,2 the second case, the first occurrences of z1,1 and z1,2 in the group G1 correspond to the same argument z in Q. Hence, in T , there is at least and one occurrence of z corresponding to the one occurrence of z1,1 1,2 = z . same argument in Q. Hence, Eδ → z1,1 1,2 and z2,1 are the same variable. In this case, if p1 (U1 ) and (c) z1,1 p2 (U2 ) are in the same group, then with the same reasoning as in (b)
constraint.tex; 15/04/2005; 10:26; p.32
Rewriting Union Queries Using Views
33
and z are the same argument or E → z = we know either z1,1 δ 2,1 1,1 z2,1 . Suppose p1 (U1 ) and p2 (U2 ) are in different groups G1 and G2 respectively. Then φ1 maps z1,1 and φ2 maps z2,1 to the same argument. That is, the first occurrence of z1,1 in G1 and the first occurrence of z2,1 in G2 correspond to the same argument in Q. So there is an occurrence and there is an occurrence of z corresponding to the same of z1,1 2,1 = z . argument in Q, hence Eδ → z1,1 2,1 is x1 . In this case, the first occurrence of z1,1 in G1 corre(d) z1,1 corresponds to x1 in Q. Thus, in T , there is an occurrence of z1,1 sponding to x1 in Q. Hence Eδ → x1 = z1,1 . . In and z2,1 are different arguments, but Eθ → z1,1 = z2,1 (e) z1,1 this case, φ1 maps z1,1 to z1,1 , φ2 maps z2,1 to z2,1 , and either there is and an occurrence of z which correspond to the one occurrence of z1,1 2,1 = x∧z = x appears explicitly same argument x in Q (in this case z1,1 2,1 in Eθ ), or there are intermediate arguments z1 , . . . , zh such that some occurrences of z1,1 and z1 correspond to the same argument in Q, and some occurrences of z1 and z2 correspond to the same argument in Q, . . . , and some occurrences of zh and z2,1 correspond to the same for the first case, then argument in Q. If we can show Eδ → z1,1 = z2,1 = z clearly also holds for the second case. In the first Eδ → z1,1 2,1 case, without loss of generality, we assume that the first occurrence , and the is at the position (i, j), that is, φi maps zi,j to z1,1 of z1,1 first occurrence of z2,1 occurs at position (a, b), that is, φa maps za,b . Then there is an occurrence of z and an occurrence of z to z2,1 i,j a,b = z . that correspond to the same argument in Q. Hence Eδ → zi,j a,b = z and E → z = z . By point (b), we also know that Eδ → z1,1 δ 2,1 i,j a,b = z . Therefore, Eδ → z1,1 2,1 = z Similarly, if Eθ → z1,1 1,2 then Eδ → z1,1 = z1,2 . (f) Eθ → z1,1 = α. In this case, either one occurrence (not the first) corresponds to α in Q (first case); or there are intermediate of z1,1 and z correspond variables z1 , . . . , zh such that some occurrences of z1,1 1 to the same argument in Q, some occurrences of z1 and z2 correspond to the same argument in Q, . . ., some occurrence of zh corresponds to and α correspond to the same argument α in Q (second case); or z1,1 in Q (third case). In the first case, either there is an occurrence of z1,1 (in T ) corresponding to α in Q, and hence Eδ → z1,1 = α; or, when corresponding to α appears at position (i, j), i.e., the occurrence of z1,1 , we will have φ1 maps z1,1 and φi maps zi,j to the same symbol z1,1 = z and E → z = α. Hence we also have E → z = α. Eδ → z1,1 δ δ 1,1 i,j i,j = α. The In the second case, due to point (e), there is also Eδ → z1,1 = α also holds. third case is a special case of (e), so Eδ → z1,1
constraint.tex; 15/04/2005; 10:26; p.33
34
Wang, Toper and Maher
(g) Eθ → z1,1 = x1 . In this case, either one occurrence of z1,1 corresponds to x in Q (first case), or there are intermediate variables and z1 correspond to the z1 , . . . , zh such that some occurrences of z1,1 same argument in Q, and some occurrences of z1 and z2 correspond to the same argument in Q, . . ., and some occurrence of zh corresponds to x1 in Q (second case). Due to point (e), we only need to consider that the first case. In the first case, we assume the occurrence of z1,1 corresponds to x1 appears at position (i, j). That is, φi maps zi,j to . Thus z corresponds to x in Q. So E → z = x . By point (e), z1,1 1 1 δ i,j i,j = z , therefore, E → z = x . we also know Eδ → zi,j 1 δ 1,1 1,1
Before stating the next lemma, we need to define the IEI property of a constraint. DEFINITION A.1. Let C be a satisfiable conjunction of primitive constraints with variables over infinite data domains. If C has the following property, then we say that C has the Independent Equality Implication (IEI) property: For any non-tautological equalities E1 , . . . , Ek between two variables in C or between a variable in C and a constant, whenever C → ∨ki=1 Ei , there exists some i (1 ≤ i ≤ k) such that C → Ei . For instance, a conjunction of linear arithmetic constraints over the reals has the IEI property [LM92]. LEMMA A.2. Suppose the relation attributes are all from infinite domains. If the query Qu and the views do not have constraints, then for any general conjunctive rewriting Qr whose constraint has the IEI property, there is a retainable p-formula Qp of Qu defined on the view relations mentioned in Qr such that Qr Qp . Proof Let Qr (X) :− v1 (Z1 ), . . . , vk (Zk ), Cr be a general conjunctive rewriting of Qu such that Cr has the IEI property. Without loss of generality, we assume the head of Qr is identical to the head of Qu , and Zi (i = 1, . . . , k) is a tuple of distinct variables (we can make this assumption because Cr ∧ E also has IEI, where E is any conjunction of basic equalities involving variables from infinite Q, by Lemma 3.2, there must be relevance domains). Since Qexp r such that Cr → mappings δ1 , . . . , δm from the GCQs in Qu to Qexp r exp is C ). E (note that the constraint of Q ∨m r r i=1 δi Note that Eδi is a conjunction of tautologies and basic equalities involving variables over infinite domains. Therefore, there must be at least one Eδi which is TRUE or which involves variables in Var (Cr ) only. Otherwise, Eδ1 , . . . , Eδm are all non-tautological and involve at least one variable not in Cr , and since the variables are from infinite
constraint.tex; 15/04/2005; 10:26; p.34
Rewriting Union Queries Using Views
35
domains, for any values assigned to the variables in Cr that make Cr true, we can pick some values for those variables not in Cr such that none of Eδ1 , . . . , Eδm is true. This would be a contradiction to Cr → ∨m i=1 Eδi . Without loss of generality, let us assume that Eδ1 , . . . , Eδm1 (0 < m1 ≤ m) are those that are tautologies or involve only variables in Cr . Similar to the reasoning above, we can conclude that Cr → ∨m1 i=1 Eδi . In fact, for any valuation of Cr that makes Cr true, we can pick some values for those variables not in Cr such that none of Eδm1+1 , . . . , Eδm is true. On the other hand, since Cr → ∨m i=1 Eδi , there must be at least one among Eδ1 , . . . , Eδm1 that is true for that valuation. That is Cr → ∨m1 i=1 Eδi . By the assumption that Cr has the IEI property, we know that there is some i such that Cr → Eδi . In other words, there is a GCQ Q in Qu and a relevance mapping δ from Q to Qexp such that Cr → Eδ . r By Lemma A.1, there is a retainable p-formula defined on the view relations mentioned in Qr that contains Qr . LEMMA A.3. Any constraint can be written to a finite disjunction of constraints having the IEI property. Proof Let C = C(x1 , . . . , xn ) be an arbitrary constraint which does not have IEI. If there are k basic equalities E1 , . . . , Ek involving the k , . . . , x only, and C → E variables in x 1 n i=1 i , then we can write C = k i=1 (C ∧ Ei ). If every disjunct C ∧ Ei has IEI, we are done. Otherwise we do the same for every C ∧ Ei . This process will terminate because each time we conjoin an Ei with C, the dimension , i.e., the number of equivalent classes of variables that are not equated to a constant (two variables are in the same equivalent class iff they are equated) in the constraint is reduced. Now the proof of Theorem 5.2 is simple: it follows directly from Lemma A.2 and Lemma A.3. Proof [Proof of Theorem 5.2] By Lemma A.3, we can write Qr into the union of a finite number of general conjunctive rewritings whose constraints have the IEI property. By Lemma A.2, for every such general conjunctive rewriting Qr , there is a retainable p-formula of Q defined on the view relations mentioned in Qr (hence mentioned in Qr ) which contains Qr . THEOREM 5.3. Suppose the relation attributes are all from the reals. If the query does not have constraint, and the constraints of the views are conjunctions of primitive linear arithmetic constraints involving only distinguished variables, then for every general conjunctive
constraint.tex; 15/04/2005; 10:26; p.35
36
Wang, Toper and Maher
rewriting Qr of Q whose constraint is a conjunction of primitive linear arithmetic constraints, there is a retainable p-formula Qp of Q defined Qexp on the relations mentioned in Qr such that Qexp r p . The proof of this theorem uses the fact that any conjunction of linear arithmetic constraints has the IEI property. Proof Let Qr (X) :− v1 (Z1 ), . . . , vk (Zk ), Cr be a general conjunctive rewriting of Q. We assume the head of Qr is identical to the head Q, of Q. This clearly involves no loss of generality. Since Qexp r exp there must be relevance mappings δ1 , . . . , δk from Q to Qr such that (D is the D ∧ Cr → ∨ki=1 Eδi , where D ∧ Cr is the constraint in Qexp r constraint added to the body of Qr during the expansion). Clearly, Cr involves variables that appear in Qr only. Since D ∧ Cr has the IEI property, we know that there is some i such that D ∧ Cr → Eδi . Corresponding to the relevance mapping δi , there is a target of Q in Qexp r . If Cr and the constraints of the views are conjunctions of primitive linear arithmetic constraints over the reals, and the constraints of the views involve only distinguished variables, then D ∧ Cr involves only the variables in Qr . Since D ∧ Cr → Eδi , the equalities in Eδi can involve only the variables in Qr 5 . Thus every distinguished variable in Q can only correspond to a distinguished variable (of some view, not of Qexp r ) or a constant in the target, and every non-distinguished variable in Q can not correspond to two different non-distinguished variables or to both a distinguished variable and a non-distinguished variable. Therefore corresponding to this target of Q, there is a retainable destination of Q with respect to the inverse rules generated from the views mentioned in Qr . Using similar reasoning to that used in the proof of Lemma A.1, we can show that there is a retainable p-formula Qp of Q defined on the relations Qexp mentioned in Qr such that Qexp r p .
References AD98.
S. Abiteboul and O. Duschka. Complexity of answering queries using materialized views. In Proc. of PODS, pages 254–263, 1998.
5 Note that if the constraints in the views involve non-distinguished variables of the views, then the equalities in Eδi may also involve non-distinguished variables. For instance, suppose x + y = 0 is the constraint in v1 (x) and x = 0 is the constraint in v2 (x), then according to definition, y is not distinguished in v1 (x), and x + y = 0 ∧ x = 0 → y = 0.
constraint.tex; 15/04/2005; 10:26; p.36
Rewriting Union Queries Using Views
ALM.
DG97. DGL00.
GM02. Gry99. Hal01. Klu88. LM92. LRO96.
Mah93. Mit01. PH01. Qia96. Ull88. Ull00. ¨ ZO.
37
F. N. Afrati, C. Li, and P. Mitra:. Answering queries using views with arithmetic comparisons. In Proceedings of PODS’02, pages 209–220, Madison, Wisconsin, USA, 3-5 June. O. M. Duschka and M. R. Genesereth. Answering recursive queries using views. In Proc. 16th PODS, pages 109–116, 1997. O. Duschka, M. Genesereth, and A. Levy. Recursive query plans for data integration. Journal of Logic Programming, special issue on Logic Based Heterogeneous Information Systems, pages 778–784, 2000. J. Grant and J. Minker. A logic-based approach to data integration. Theory and Practice of Logic Programming, 2(3):323–368, 2002. J. Gryz. Query rewriting using views in the presence of functional and inclusion dependencies. Information Systems, 24(7):597–612, 1999. A. Y. Halevy. Answering queries using views: A survey. VLDB Journal, 10(4):270–294, 2001. A. C. Klug. On conjunctive queries containing inequalities. Journal of ACM, 35(1):146–160, 1988. J. L. Lassez and K. McAloon. A canonical form for generalized linear constraints. Journal of Symbolic Computation, 13(1):1–24, January 1992. A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of VLDB, pages 251–262, 1996. M. J. Maher. A logic programming view of CLP. In Proc. 10th International Conference on Logic Programming, pages 737–753, 1993. P. Mitra. An algorithm for answering queries efficiently using views. In Proc. of the 12th Australasian database conference, 2001. R. Pottinger and A. Y. Halevy. Minicon: A scalable algorithm for answering queries using views. VLDB Journal, 10(2-3):182–198, 2001. X. Qian. Query folding. In Proc. of 12th ICDE, pages 48–55, 1996. J. D. Ullman. Principles of Database and Knowledge-Base Systems, volume 1 & 2. Computer Science Press, 1st edition, 1988. J. D. Ullman. Information integration using logical views. TCS: Theoretical Computer Science, 239(2):189–210, 2000. ¨ X. Zhang and Z. M. Ozsoyoglu. Implication and referential constraints: A new formal reasoning. IEEE TKDE, 9(6):894-910, Nov/Dec 1997.
constraint.tex; 15/04/2005; 10:26; p.37
constraint.tex; 15/04/2005; 10:26; p.38