View Disassembly

Report 1 Downloads 124 Views
November 1997

View Disassembly|Godfrey & Gryz

p. 1 of 13

View Disassembly Parke Godfrey U.S. Army Research Laboratory Adelphi, Maryland, U.S.A. [email protected]

Abstract

Jarek Gryz Department of Computer Science York University, Toronto, Canada [email protected]

a query which involves views.1 Let us call a sub-view an unfolding of the view, as the view can be unfolded via its de nition into more speci c sub-views. These We explore a new form of view rewrite called view reasons include the following. disassembly. The objective is to rewrite views in order to \remove" certain sub-views (or unfoldings) of 1. Some unfoldings of the view may be e ectively the view. This becomes pertinent for complex views cached from previous queries [2], or may be mawhich may de ned over other views and which may terialized views [12]. involve union. Such complex views arise necessarily 2. Some unfoldings may be known to evaluate in environments as data warehousing and mediation empty, by reasoning over the integrity conover heterogeneous databases. View disassembly can straints [1]. be used for view and query optimization, preserving 3. Some unfoldings may match protected queries, data security, making use of cached queries and mawhich, for security, cannot be evaluated for all terialized views, and view maintenance. users [15]. 4. Some unfoldings may be subsumed by previously We provide computational complexity results of view asked queries, so are not of interest to the user. disassembly. We show that the optimal rewrites for disassembled views is at least NP-hard. However, we What does it mean to remove unfolding from a view provide good news too. We provide an approxima- or query? The modi ed view or query should not tion algorithm that has much better run-time behav- subsume|and thus, when evaluated, should never ior. We show a pertinent class of unfoldings for which evaluate|the removed unfoldings, but should subtheir removal always results in a simpler disassembled sume \everything else" of the original view. view than the view itself. We also show the complexity to determine when a collection of unfoldings cover In case 1, one might want to separate out certain unfoldings, because they can be evaluated much less exthe view de nition. pensively (and, in a networked, distributed environment, be evaluated locally). Then, the \remainder query" could be evaluated separately [2]. In case 2, the unfoldings are free to evaluate, since it is known 1 Introduction in advance that they must evaluate empty. If the remainder query is less expensive to evaluate than Many database applications and environments, such the original, this is an optimization. In case 3, when as mediation over heterogeneous database sources some unfoldings are protected, this does not mean and data warehousing for decision support, lead to that the \rest" of the query or view cannot be safely complex view de nitions. Views are often nested, de- evaluated. In case 4, when a user is asking a series ned over previously de ned views, and may involve of queries, he or she may just be interested in the unions. Union is a necessity in mediation, as views stream of answers returning. So any previously seen in the meta-schema are de ned to combine data from answers are no longer of interest. Furthermore, subdisparate sources. In these environments, view de - views within a view de nition can be rewritten so as nition maintenance is of paramount importance. to not overlap, which can optimize evaluation. There are many reasons why one might want to \remove" pieces (sub-views) from a given view, or from 1 In this paper, we use view and query synonymously.

November 1997

View Disassembly|Godfrey & Gryz

v1 p

v2 r

s

v3 t

u

w

Figure 1: The AND/OR tree representation of the original query. In [5], we introduced the notion of a discounted query, or discounted view. This is a view paired with a collection of its unfoldings, called the unfoldings-todiscount. In [6], we investigate an evaluation strategy for discounted views that, in general, is better in performance than the evaluation of the view itself (not discounted). This showed how discounting could be used for optimization. However, such a specialized evaluation procedure is not always available. Furthermore, for views, it may be more elegant, and often more ecient, to simply rewrite the view. In this paper, we consider rewriting the view into a form that evaluates the same as the discounted view. We call this rewrite process view disassembly. We represent queries and views in Datalog in this paper. View de nitions expand via their Datalog rule de nitions. We do not consider recursion nor negation in this paper. Thus all views are nonrecursive. A view (or query) can be represented as an AND/OR tree. In view disassembly, we consider algebraic rewrites of the views' corresponding AND/OR trees. Consider the following example.

Example 1. Let there be three views, v1, v2, and v3, in the database DB de ned over base relations p, r, s, t, u, w:

v1 v1

p. r.

v2 v2

s. t.

v3 v3

u. w.

De ne the following query Q:

Q:

v 1 ; v2 ; v 3 .

Query Q can be represented as its \parse tree" of its relational algebra (RA) representation, which is an AND/OR tree, as shown in Figure 1.2 Evaluating the query|in the order of operations as speci ed in its RA representation|is equivalent to materializing all nodes of the query tree. We refer to this type of evaluation (and representation) as bottom-up.

p. 2 of 13

Now assume that the following query F has been asked before, and that its results are stored in cache. (Equivalently, we could assume that this formula represents a materialized view, or is subsumed by an integrity constraint.)

F:

p; s; u.

If F is subsumed by an integrity constraint, then its evaluation cannot possibly return any answers. Thus it can be eliminated from the query without changing the result. If F is a cached query, or a materialized view, it may still be bene cial to \remove" it from the query. We call the query after this removal a discounted query. One way to achieve this is to rewrite Q as a union of joins over base tables, then to remove the join expression represented by F , and nally to evaluate the remaining join expressions. This may be very inecient, however. The number of join expressions that remain to be evaluated may be exponential in the size of the intensional database (IDB , the collection of view de nitions). Furthermore, we showed in [7] that such an evaluation plan (which we call a top-down evaluation plan) may require evaluating the same joins multiple times, and incur the expense that a given answer tuple may be computed many times (whenever base tables overlap). A top-down evaluation of Q from Figure 1 is the union of the eight join expressions from

fp; rg  fs; tg  fu; wg A more ecient evaluation plan for the discounted query can be devised by rewriting the query so that the number of operations (unions plus joins) is minimized. (See Figure 2). As a side e ect of this operation, the redundancy in join evaluation, as well as the redundancy in answer tuple computation, is reduced [7].3 Our goal is to nd good view disassemblies; that is, rewrites that result in small AND/OR trees. We aim to optimize over the number of nodes of the resulting AND/OR tree. (An ultimate goal might be to nd a rewrite that results in the smallest AND/OR tree possible.) Some disassemblies can be exponential in the size of the original view's AND/OR tree. These, of course, are intractable to evaluate simply due to the size of their representation.

3 We have a nave algorithm that nds the optimal rewrite This is not precisely the parse tree. We represent only the joins (ANDs) and unions (ORs). As in Datalog, the other in the case of a single unfolding-to-discount. Such a result operations, such as selects and projections, are implicit. would be dicult to nd by hand. 2

November 1997

p

View Disassembly|Godfrey & Gryz

r s

t

w t

u

r s

u

p. 3 of 13

from the collection of queries, and are heuristicsbased. We do not expect that the MQO techniques could result in the rewrites we propose in this paper. We can exploit the fact that our rewrites involve unfoldings that all come from the same view.

The problem of query tree rewrites for the purpose of optimization has been also considered in the conFigure 2: The AND/OR tree representation of the text of deductive databases with recursion. In [9], modi ed query. the problem of detecting and eliminating redundant subgoal occurrences in proof trees generated by programs in the presence of functional dependencies is In this paper, we present approaches to, and com- discussed. In [10], the residue method of [1] is exputational complexities of, view disassembly. In the tended to recursive queries. rst case, it may be that the unfoldings-to-discount (removed sub-views) cover the view. This means that In [5], we introduced a framework which we call inthe discounted view is equivalent to the null view, tensional query optimization, which enables rewrites (which is de ned here to evaluate empty). We present to be applied to non-conjunctive queries and views the complexity of deciding the cover question in Sec- (ones which involve union). An initial discussion of tion 2. In Section 3, we show that there are natural complexity issues and possible algorithmic solutions cases when the view can be always rewritten into a appeared in [7]. In [6], we present an algorithm which simpler form. This is true whenever an unfolding-to- incorporates unfolding removal into the query evaludiscount is simple. In such cases, the rewrite is always ation procedure. Hence this method is not an explicit an algebraic optimization. In Section 4, we consider query rewrite. the general case of rewriting a view in an algebraically absolute optimal way. We show that the complexity Our work in view disassembly is naturally related of a sub-case of this task (a sub-class of fairly sim- with all work on view and query rewrites. Howple views) is NP-complete over the size of the view's ever, most all work in view rewrites strives to nd AND/OR tree. We show the general problem is even views that are equivalent semantically with the origharder. In Section 5, we explore approximate op- inal. View disassembly does not. Rather, we are ustimality. We motivate a rewrite algorithm that pro- ing rewrites in order to remove implicitly components duces a disassembled view (equivalent semantically to (unfoldings) from the original view. This is a much the discounted view expression) for which the com- di erent goal than that of previous view rewrite work, plexity is over the number of unfoldings-to-discount, and so this requires a di erent treatment. Aside from and not over the size of the view's AND/OR tree. the work listed above, we are not aware of any work Hence, this approach is tractable, in general, and can on view rewrites that bears directly on view disassembly. result in rewrites that are reasonably compact. The work most closely related to view disassembly is [11]. The authors consider queries that involve nested union operations, and propose a technique for rewriting such queries when it is known that some of the joins evaluated as part of the query are empty. The technique in [11] applies, however, only to a class of simple queries, and no complexity issues are addressed.

2 Discounting and Covers

We de ne a view (and, likewise, a query) to be a set of atoms. For instance, fv1 ; v2; v3g represents the query/view in Example 1.4 Some of the atoms may be intensional; that is, they are written with view predicates de ned over base table predicates and, Another research area related to view disassembly is perhaps, other views. multiple query optimization (MQO) [14]. The goal in multiple query optimization is to optimize batch We provide a formal de nition for an unfolding of a evaluation of a collection of queries, rather than just 4 We ignore the ordering of the atoms in the query, witha single query. Since the queries in the collection are out loss of generality. We also only present \propositional" arbitrarily related, they may not be related in any examples for simplicity's sake. It ought to be clear how the structured way. The techniques developed for MQO examples and techniques apply when the view's variables are attempt to nd and reuse common sub-expressions explicit.

November 1997

View Disassembly|Godfrey & Gryz

p. 4 of 13

query.

v1 v2 v3 De nition 2. Given query sets Q and U , call U a 1-step unfolding of query set Q with respect to p r s t u w database DB i , given some qi 2 Q and a rule ha b1 ; : : :; bn :i in IDB such that qi   a (for Figure 3: Cover of the AND/OR tree in Example 3. (1)

(2)

(2)

(3)

(1)

(1, 2, 3)

(3)

most general uni er  [13]), then

U = Q ? fqig [ fb1 ; : : :; bng Denote this by U 1 Q. Call U1 simply an unfolding of Q, written as U1  Q, i there is some nite collection of query sets U1,: : :,Uk such that U1 1 : : : 1 Uk 1 Q. An unfolding U is called extensional i , for every qi 2 U , atom qi is written with an extensional

course, it is not a usable view de nition in the standard form. We seek how to rewrite the original view into a view equivalent in meaning to the discounted view. Clearly, the de nition for the meaning of the discounted view provides one such rewrite: take the union of all the extensional unfoldings of the view minus all those of every unfolding-to-discount. However, this rewrite is not compact; in fact, there can predicate (so is a base table). Call the unfold- be an exponential number of extensional unfoldings ing intensional otherwise (in other words, a view with respect to the size of the view's AND/OR tree. de nition). We are interested only in compact rewrites.

One of the 1-step unfoldings of the query in Example The rst case we ought to consider is when the set 1 is fv1; v2 ; ug. One of the extensional unfoldings of of extensional unfoldings entailed by the discounted the query is fr; t; wg. view is empty. In such a case, we say that the unfoldings-to-discount cover the view (or query). The It is easy to see how an unfolding's AND/OR tree degenerate case is QnfQg. At the opposite end of can be \inscribed" in the view's AND/OR tree. (The the spectrum is Qnunfolds (Q). When a discounted atoms of the unfolding can be marked in the view's view is covered, the most succinct disassembled view tree as shown in Figure 3 of Example 3.) This map- is the null view, which we de ne to evaluate to the ping may be ambiguous if there is repetition of atoms empty answer set. Thus, we are interested in how to in the AND/OR tree. We assume for sake of simplic- test when a discounted view is covered. As it hapity, and without loss of generality, that this mapping pens, there are interesting, and unobvious, cases of is never ambiguous, and that it is always clear how discounted views which turn out to be covered. Furan unfolding's AND/OR tree is inscribed within the thermore, cover detection is computationally hard. view's.5 Example 3. Consider the following unfoldings from Let Q be a view and U1 ,: : :,Uk be unfoldings of Q. We Q in Example 1: de ne the notation QnfU1; : : :; Ukg to be a discounted view of Q with unfoldings-to-discount U1,: : :,Uk . De U1 : fv1; s; v3g, ne unfolds(Q) to be the set of all extensional un U2 : fp; v2; v3g, foldings of Q. The meaning of QnfU1; : : :; Uk g is in U3 : fr; t; v3g. tended to be: Figure 3 shows U1 , U2, and U3 marked k  ?S S in the tree from Figure 1 of Example 1. unfolds (Q) ? unfolds (Ui) 3 S i=1 Since unfolds (Q)  unfolds (Ui ) the set i=1 (Note that our assumption of an unambiguous mapfU ; U ; U g is a cover of Q. 1 2 3 ping of unfoldings to AND/OR trees means that any syntactically equivalent unfoldings of Ui and Uj are, We establish that determining that a discounted indeed, the same unfolding.) view is covered is coNP-complete over the number of unfoldings-to-discount. For the set-theoretic verThe discounted view notation is a convenience. Of sion, the input is the view's AND/OR tree and the trees of the unfoldings-to-discount. 5 Note that an extensional unfolding's AND/OR tree, as inscribed in the view's AND/OR tree, is simply an AND-tree. No unions remain, as the extensional unfolding is equivalent to a join of its atoms.

De nition 4. A discounted view instance V is a pair of an AND/OR tree and a list of AND/OR trees

November 1997

View Disassembly|Godfrey & Gryz

p. 5 of 13

which are clearly inscribable in the initial. De- De nition 7. Let Q be a query and U0 be an unfolding of Q such that U0 1 U1 1 : : : 1 Uk = Q. ne COV as the subset of all discounted view Then, U0 is a simple unfolding of Q i , for all instances that are covered. i  k, the set Ui ?U0 contains at most one choice Theorem 5. COV is coNP-complete. point atom. Proof. By reduction from 3-SAT. (See Appendix for Example 8. Let the query Q be as in Example 1, details.) 2 but now assume that, in addition to the the rules for v1 ; v2; v3, we also have the following two rules: It is perhaps a surprising result that the complexity of deciding the cover question for discounted views p p 1 ; p2 . p1 p1;1. is dictated by the number of unfoldings-to-discount, p p1;2. 1 and not by the size of the view. This is still bad news whenever one has many unfoldings-to-discount Consider the following unfoldings of Q: to consider. The good news is that the intractability is independent of the view de nition's complex U1 : fp; v2; v3g ity. Furthermore, the news for cover detection seems  U2 : fp1;1; p2; v2 ; v3g good, in practice. Often, the number of unfoldings  U3 : fp; s; v3g being considered is manageably small, so the cover check is tractable. In addition, we have empirically Unfolding U1 is simple, since U1 1 Q and Q ? observed [8] that average case for cover check appears U1 = fv1g. Unfolding U2 is simple, since U2 1 tractable even for signi cantly more unfoldings-toU4 1 U1 1 Q, for which U4 = fp1; p2; v2 ; v3g, discount, even though worst-case is known to be inand we know U4 ? U2 = fp1g, U1 ? U2 = fpg, tractable. and Q ? U2 = fv1g. Unfolding U3 is not simple, however, because Q ? U3 = fv1; v2g, and both Thus, the rst step in view disassembly should be to are choice point atoms. check whether the discounted view is covered. We investigate next what can be done when it is not. It is easy to show that all 1-step unfoldings are simple (Unfolding U1 in Example 8 is a 1-step unfolding.)6

3 Simple Unfoldings A disassembled view may cost more to evaluate than the original view. A degenerate case is, of course, the case of the disassembled view that is the collection of all the extensional unfoldings. In general, it cannot be guaranteed that the AND/OR tree for a best disassembled view would be more compact (hence would require fewer operations to evaluate) than the original view. (Example 1 demonstrated this.) In this section, we de ne a type of an unfolding for which discounting guarantees to produce an AND/OR tree that is more compact than the AND/OR tree of the original view. We call such unfoldings simple. Let us de ne rst the concept of a choice point atom.

De nition 6. Let Q = fq1 ; : : :; qn g be a view. qi is called a choice point atom i there are more than one rule ha b1 ; : : :; bn :i such that qi   a, for most general uni er .

We can now state an optimization theorem.

Theorem 9. Let Q be a query and U0 be a simple unfolding of Q. Then, an AND/OR tree for QnfU0g can be produced by removing one or more nodes from the AND/OR tree for Q. Proof: (See Appendix.) Consider unfolding U1 in Example 8. Discounting U1 from Q is equivalent to pruning the node p from the

query tree in Figure 1.

Simple unfoldings are ideal when the goal of disassembly is optimization (that is, the discounted query or view must cost less to evaluate than the original). They are easy to detect (by de nition) and remove (as shown for U1 above), and the disassembled view is guaranteed to cost less to evaluate than the original view. We note also that even when a collection of unfoldings of Q, for which none is simple itself, may imply (that is, cover) an unfolding of Q which is simple. Consider a non-simple unfolding

Integrity constraints consisting of two atoms which subAll atoms in query Q of Example 1, (that is, v1 ; v2; sumed such unfoldings were called semi-complete join pairs in and v3) are choice point atoms. [11]. 6

November 1997

View Disassembly|Godfrey & Gryz

p. 6 of 13

U3 of Example 8, and another non-simple unfolding pair As and Br , aik 62 As or bik 62 Br , for 1  k  l. U5 : fr; s; v3g. Unfoldings U3 and U5 together cover Let C be the collection of all such maximal, consistent unfolding U6 = fv1 ; s; v3g, which is simple. We ad- pairs of As 's and Br 's. dress the problem of merging non-simple unfoldings unfolds (QnfU1; : : : Ul g) into simple ones in Section 5. S = unfolds (fAs; Br g) It can be shown that the class of simple unfoldings hAs ;Br i2C

is the only type of unfolding that guarantees the existence of a disassembled view that has fewer opera- Let t be the cardinality of the collection C . This number represents the number of trees that are needed to tions than the view. evaluate the discounted query. Let Wi , 1  i  t, be a query (unfolding) represented by each such tree and op(Wi ) be the total number (that is, a single required to evalu4 AND/OR Tree Minimization join and all unions)St7 of operations t S ate Wi . Then, op( Wi) = ( op(Wi )) + t ? 1. We i=1 i=1 When the unfoldings-to-discount are not simple, also require, without loss of generality, that W 's do the problem of nding the most compact of all not overlap; that is, there does not exist an unfolding AND/OR trees representing the disassembled view U such that U  Wi ; U  Wj , and i 6= j. is intractable. In this section, we consider a view which is a two-way join over unions of base tables. In We can now state the problem of Minimization of a sense, this is the simplest view for which the min- Discounted Query as follows. imization problem is nontrivial; that is, this is the least complex view that contains non-simple unfold- De nition 10. De ne the class Minimization of Discounted Query (MDQ) as follows. An inings. stance is the triplet of a query set Q, a collection of unfoldings-to-discount U , and a positive inteThus, let the view Q be: ger K. An instance belongs to MDQ i there q a; b: is a collection of unfoldings W1 ,...,Wt de ned as t S above, such that op( Wi )  K. in which a and b are de ned as follows: a a

.. .

a1 .

b

an .

b

.. .

b1 : bn :

De ne the set of unfoldings-to-discount as: U1 = fai1 ; bj1 g; : : :; Ul = fail ; bjl g, for ik ; jk 2 f1; :::; ng; 1  k  l. Assume, without loss of generality, that for every S  fa1; :::; ang, there is an atom (with the intensional predicate) As de ned as: As ak for all ak 2 S. We call the collection of all such atoms A; that is, A = fAs j S  fa1; :::; ang and As ak , for all ak 2 S g Similarly de ne Br for all subsets of fb1; :::; bng.

i=1

Theorem 11. Minimization of Discounted Query (MDQ) is NP-complete. Proof. By reduction from a known NP-hard problem, minimum order partition into bipartite cliques MOP [3]. 2 The NP-completeness result can be trivially generalized to the case where a and b in the query Q are de ned through di erent number of rules (see Theorem 15 in the Appendix).

The minimization of more complex queries does not remain NP-complete. Consider a query Q=fp1; :::; png, where each ofj pi 's is de ned through multiple rules as: hpi pi :i for 1  j  ki . Since the number of extensional unfoldings for this query is Then, the discounted query can be evaluated as a exponential in the size of the original query, verifying union of joins over atoms As and Br . We are inter- the solution cannot be, in general, done in polynomial ested in the maximal pairs of As 's and Br 's such that none of the unfoldings-to-discount can be found in 7 It is easy to show that the optimal tree for the two-way join view constructed above must be a union of two level trees also. their cross. (By maximal, it is meant that no super- This means that this is a legitimate sub-class of the absolute set of the chosen As or no super-set of the Br chosen optimization problem for view trees, thus the general problem would also have this property.) That is, given such a is at least NP-hard.

November 1997

View Disassembly|Godfrey & Gryz

p. 7 of 13

C := fg while new unfolding (Q, N [ C , U) V := refolding (U, N , C ) C := C [ fVg return parsimonious (C )

time. It can be shown, however, that minimization of an arbitrarily complex query is in the class p2 .

Theorem 12. Let Q be an arbitrarily pcomplex query. Then, minimization of Q is in 2 . Proof. (See Appendix.) 2 Algorithm 1: Unfold/refold algorithm for view disas-

We conjecture that it may be complete in this class. sembly.

5 Approximation Solutions

The entire algorithm is then repeated until the unfold step fails; that is, there is no such extensional unfolding, meaning that a cover has been established. On subsequent cycles, during the refold procedure, the unfolding is refolded only so far as it does not overlap with any unfoldings already in C (condition 2 from above). In the end, parsimonious ensures that the unfolding collection C returned is minimal; that is, no member unfolding can be thrown away.

In the previous section, we have shown that to nd an algebraic rewrite for view disassembly which optimizes absolutely the number of algebraic operations (that is, the size of the AND/OR tree) is intractable. In this section, we investigate an approximation approach. The premise of the approach is to not rewrite the original view's AND/OR tree, but rather to nd a At completion, the union of the unfoldings produced collection of unfoldings of the view which \complete" is a disassembled view. It is equivalent semantically the cover with respect to the unfoldings-to-discount. to the discounted view. The set of the unfoldings unioned with the set of the unfoldings-to-discount is This collection, call it C , should have the follow- equivalent semantically to the original view. ing properties. Let N be the set of unfoldings-toExample 13. Consider again Example 1. The aldiscount. gorithm is initialized with C := fg. Assume 1. N [ C should be a cover of the view; that is, that the rst extensional unfolding to consider any extensional unfolding of the view is also an is V = fr; s; ug. Refolding V , we arrive at an ununfolding of some unfolding in N or C . folding fr; v2; v3g. The next extensional unfold2. No two unfoldings in C should overlap; that is, ing that does not overlap either with fr; v2; v3g or for U; V 2 C (U 6= V), U and V have no unfolding with the unfolding-to-discount (which is fp; s; ug in common. (Call U and V pair-wise independent in this case), is V = fp; s; wg. Refolding it in this case.) produces an unfolding fp; v2; wg (which is pair3. Set C should be most general: wise independent with fr; v2; v3g in C from bea. no unfolding in C can be refolded at all, and fore). The last remaining extensional unfolding still preserve the above properties; and to consider is V = fp; t; ug. This one cannot b. for any U 2 C , (N [C ) ? fUg is not a cover be refolded any further. The AND/OR query of the view. tree representing the most-general unfoldings is shown in Figure 4. We present an algorithm to accomplish such a rewrite called the unfold/refold algorithm (Algorithm 5). It Note that the resulting rewrite in the example has works as follows. First, nd an extensional unfolding eleven nodes (algebraic operations) compared with which is not subsumed by any of the unfoldings-to- nine nodes of the tree in Example 1 which represented discount or any unfolding in the set C (that is, un- the most compact rewrite. foldings generated so far). The routine new unfolding performs that step. (This is the reverse of determin- The run-time complexity of the unfold/refold algoing cover. Thus, the diculty depends on the size of rithm is dictated by the unfold step of each cycle. the collection of the unfoldings-to-discount.) Second, This depends on the size of the collection of unfoldrefold the unfolding (that is, nd a super-unfolding) ings generated so far plus the number of unfoldingssuch that the super-unfolding does not subsume any to-discount. While this collection is small, the algoof the unfoldings-to-discount. This is performed by rithms is tractable. Only when the collection has the routine refolding.8 8

in the Carmin prototype [8]. It optimally performs the unfold A version of the unfold/refold algorithm is implemented step (hence, the covers test).

November 1997

r v2 v3 s t uw

View Disassembly|Godfrey & Gryz

p v2 w s t

p

t u

p. 8 of 13

ings implied are: fv1 ; s; v3g and fp; v2; v3g. Both of these are simple unfoldings, so the original tree can be pruned.

6 Conclusions

Figure 4: The result of unfold/refold algorithm applied to the query and the unfolding-to-discount of Current database environments, such as mediation Example 1. over heterogeneous data sources, necessitate complex views. View de nitions may be nested|de ned with other views|and may employ union. In these envigrown large does the algorithm tend towards in- ronments, it is important to be able to maintain views tractable. An advantage of the approach is that a in the presence of other views, and to take advantage threshold can be set, beyond which the rewrite com- of cached queries, materialized views, and semantic putation is abandoned. On average, we expect the constraints. nal cover not to be large. We have de ned the notion of a discounted view, A variation on the unfold/refold algorithm can be which is conceptually the view with some of its subused to nd the collection of most general unfoldings views (unfoldings) \removed". In this paper, we have that are covered by the unfoldings-to-discount. By explored how to rewrite e ectively the view into a most general, it is meant that no super-unfolding of form equivalent to a discounted view expression, thus those found are covered. This is a generalization of \removing" the unfoldings-to-discount. We called the check for cover discussed in Section 2. In the such a rewrite a disassembled view. Disassembled extreme case, if only a single most-general unfolding views can be used for optimization, data security, and is found (that is, the view itself), then the discounted streamlining the query/answer cycle (by helping to eliminate answers already seen). Disassembled views view itself is covered. can also be used to normalize collections of views by Ideally, we would always convert the set of unfoldings- removing overlapping parts of the views' de nitions. to-discount into this most-general form. This most- Thus disassembled view technology may play a usegeneral collection of unfoldings-to-discount is guaran- ful role in data warehousing and other view intensive teed always to be smaller (or the same size, at worst) environments. as the initial collection. Thus, it is a better input to View disassembly, as most forms of view and query the unfold/refold algorithm for view disassembly. rewrites, can be computationally hard. We showed These techniques avail us many tools for rewriting that optimal view disassembly rewrites is at least views and queries for a number of purposes. For NP-hard. However, e ective disassembled views can instance, by nding the most-general unfoldings-to- be found which are not necessarily algebraically opdiscount, one also identi es all the most-general sim- timal, but are compact. We explored an approximaple unfoldings that are entailed by the unfoldings-to- tion approach via an unfold/refold algorithm which discount. They are just the simple unfoldings that can result in compact disassembled views. The comappear in the collection. For view or query optimiza- plexity of the algorithm is dictated by the number of tion, the simple unfoldings can be pruned from the unfoldings-to-discount, and not by the complexity of AND/OR tree, resulting in a smaller, simpler tree the view de nition to be disassembled. Thus, there to evaluate (as shown in Section 3). If we \remove" are e ective tools for view disassembly. only the simple unfoldings, but not the others, we are not evaluating the discounted view, but something in- We also have identi ed a class of unfoldings, called between the view and the discounted view. However, simple unfoldings, which can be easily removed from the view de nition to result in a simpler view de when our goal is optimization, this is acceptable. nition. This o ers a powerful tool for semantic opExample 14. Consider removing the following six timization of views. Furthermore, with use of the (extensional) unfoldings from the AND/OR tree unfold/refold algorithm, we can nd all the simple in Figure 1: fp; s; ug, fp; s; wg, fp; t; ug, fp; t; wg, unfoldings implied by a collection of unfoldings-tofr; s; ug, and fr; s; wg. The most-general unfold- discount. In this paper, we also establish how we can

November 1997

View Disassembly|Godfrey & Gryz

p. 9 of 13

infer when a collection of unfoldings-to-discount cover the original view, meaning that the discounted view is void. This result has general application, and is fundamental to determine when a view is subsumed by a collection of views.

[4] Michael R. Garey and David S. Johnson. Com-

There is much practical work to be done on view disassembly. These include the following.

[5] P. Godfrey and J. Gryz. A framework for intensional query optimization. In Proceedings of

puters and Intractability: a Guide to the Theory of NP-Completeness. A Series of Books in

the Mathematical Sciences. W. H. Freeman and Company, New York, 1979.

the Workshop on Deductive Databases and Logic Programming at JICSLP'96, pages 57{68, Bonn,

 Improvements can be made to the unfold/refold algorithm. There are natural cases in which the unfold/refold rewrite would result in an exponential number of unfoldings. As we showed, the procedure can be curtailed in such cases. However, extensions to unfold/refold could cover more cases, allowing for compact representation when the simple unfold/refold algorithm presented here would not.

 There exist cases when if the original view tree

were syntactically rewritten in a semantically preserving way, a candidate unfolding could be removed easily, but could not be \easily" removed with respect to the original tree. We should study how view disassembly could be combined e ectively with other view rewrite procedures.

Germany, September 1996.

[6] P. Godfrey and J. Gryz. Overview of dynamic query evaluation in intensional query optimization. In Proceedings of 5th DOOD, Montreux, Switzerland, December 1997. (To appear). [7] P. Godfrey, J. Gryz, and J. Minker. Semantic query optimization for bottom-up evaluation. In Z. Ras and M. Michalewicz, editors, Proc. of the 9th. ISMIS, pages 561{571, Zakopane, Poland, June 1996. [8] Parke Godfrey. An Architecture and Implementation for a Cooperative Database System. PhD thesis, University of Maryland at College Park, College Park, Maryland 20742, 1997. In progress.

 A yet better understanding of the pro le of [9] L. V. S. Lakshmanan and H. J. Hernandez. view disassembly complexity, also with respect to other view rewrite techniques, would allow us to build perhaps better view disassembly algorithms. Empirical use of the unfold/refold algorithm, say in data warehousing environments, [10] might also provide more insight.

References [1] U. Chakravarthy, J. Grant, and J. Minker. Logic-based approach to semantic query optimization. ACM Transactions on Database Systems, 15(2):162{207, June 1990.

Structural query optimization - a uniform framework for semantic query optimization in deductive databases. In Proc. PODS, pages 102{114, 1991.

L. V. S. Lakshmanan and R. Missaoui. Pushing semantics inside recursion: A general framework for semantic optimization of recursive queries. In Proc. ICDE, pages 211{220, 1995.

[11] S. Lee, L.J.Henschen, and G.Z. Qadah. Semantic query reformulation in deductive databases. In

[12] [2] S. Dar, M. Franklin, B. Jonsson, D. Srivastava, and M. Tan. Semantic data caching and replacement. In Proceedings of 22nd VLDB, pages 330{ 341, 1996. [13]

Proc. IEEE International Conference on Data Engineering, pages 232{239. IEEE Computer

Society Press, 1991.

A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views. In Proc. PODS, pages 95{104, 1995. J.W. Lloyd. Foundations of Logic Programming. Springer{Verlag, second edition, 1987.

[3] T. Feder and R. Motwani. Clique partitions, graph compressions and speeding-up algorithms. [14] T. Sellis and S. Ghosh. On the multiple-query optimization problem. TKDE, 2(2):262{266, In Proceedings of the ACM Sumposium on TheJune 1990. ory of Computing, pages 123{133, 1991.

November 1997

View Disassembly|Godfrey & Gryz

[15] B. Thuraisingham and W. Ford. Security constraint processing in a multilevel secure distributed database management system. IEEE Transactions on Knowledge and Data Engineering, 7(2):274{293, April 1995.

p. 10 of 13

November 1997

C1

v1

:::

View Disassembly|Godfrey & Gryz

:::

v1;3

C1

Ck

vk;1 : : :

vk;3

v1

:::

:::

v1;3

Ci

p. 11 of 13

:::

p

Cj

p

:::

Ck

vk;1 : : :

vk;3

Figure 5: AND/OR tree representing a CNF propo- Figure 6: Unfolding tree representing a contradiction sitional theory. in the CNF propositional theory.

Proof. [Theorem 9] Let U0 1 U1 1 : : : 1 Un = Q Appendix be a sequence of unfolding steps. Let Vi 2 Ui be Proof. [Theorem 5] COV is in NP. A witness that an atom whose unfolding produces Ui?1 , through V 2 COV is a complete AND/OR tree (that is, the rule R0i : hAi B1 ; : : :; Bn :i, with a most an AND tree) inscribable in the initial AND/OR general uni er , such that Vi   Ai . Then, tree (the view), but which cannot be inscribed in Ui?1 = Ui ?fVig[fB1 ; : : :; Bn g. Clearly, there any AND/OR tree in the list. (This means that can be more than one rule through which Vi can the extensional unfolding V represents is not an be unfolded. extensional unfolding of any of the unfoldings-todiscount.) Tree V is bounded by the size of the input, and the conditions above are polynomial in the size of the input to verify.

A reduction of 3-SAT to COV. Consider any 3-CNF propositional theory T , an 3-SAT instance candidate, restricted, without loss of generality, such that no propositional variable occurs more than three times in it [4]. Transform the 3-SAT instance into an AND/OR tree as follows. Let C1; :::; Ck represent the clauses of T . Let vi;1 ; :::; vi;3 represent the occurrences of the propositional variables (positive or negative occurrences) in clause Ci . Build an AND/OR tree as in Figure 5. Construct the list of AND/OR trees to discount as follows. For each pair of occurrences of a propositional variable p and its negation :p, construct an AND/OR tree as follows. Assume p occurs in Ci and :p in Cj , and i 6= j. Build the tree in Figure 6. There are at most 2k such trees. These represent the \contradictions".

There are two cases. 1. For each i; 1  i  n, R0i is the only rule with the head Ai such that there exists a  Vi   Ai , for which Vi 2 Ui . Then, unfolds (Q) = unfolds (Un?1) = : : : = unfolds (U0). Thus QnfU0g is null and the query requires no operations to evaluate. 2. Let Ui be the last unfolding (that is, the unfolding with the lowest index) in the sequence U0 1 : : : 1 Ui 1 : : : 1 Un = Q for which there is more than one rule through which Vi can be evaluated; that is, there exists R0i : : : Rmi such that Rji : hAji Bi1 ; : : :; Bik i. Then, Ui is equivalent to the union of unfoldings W0 ,: : :,Wm where Wj = Ui ?fVi g[fBi1 ; : : :; Bik g, 1  j  m. Note, that replacing one of Wj 's, say Wl , with its de nition produces Ui?1. Also, since each of the subsequent unfolding steps between Ui?1 and U0 involves a single rule we have (by case 1 from above):

unfolds (QnfU0g) Ask if there exists an extensional AND/OR tree = unfolds (QnfUi?1g) of Figure 6; that is, not inscribable in any of = unfolds (QnfWl g) the trees as in Figure 6. There is, if and only if T is in 3-SAT (that is, satis able). Such an Hence, ignoring rule Rli in the unfolding of Vi is AND/OR tree would not include p and :p, for equivalent to discounting unfolding U0 from the any propositional variable p. It would, however, query. include a propositional variable from each clause Ci . The collection of these propositional vari2 ables represents a model of the propositional theory T . Thus, this is a witness that the proposi- Proof. [Theorem 11] It can be shown that MDQ is NP-hard by reducing a known NP-hard problem, tional theory T is in SAT. Otherwise, if there is minimum order partition into bipartite cliques no such tree, then there is no model, and T in MOP [3] to it . MOP can be de ned as follows: not in 3-SAT. 2

November 1997

View Disassembly|Godfrey & Gryz

Let G (U; V; E) be a bipartite graph with the vertex sets U = fu1; :::; ung and V = fv1; :::; vng and the edge set E. A bipartite clique in G is a complete bipartite graph, and its order is the number of vertices in it. A clique partition for G is a collection of bipartite cliques C = fC1; :::; Clg such that edge sets E(C1),..., E(Cl ) form a partition of the edge E. The order of a collection of bipartite cliques, or(C), is the sum of orders of the individual cliques. The MOP problem can be stated as follows:

p. 12 of 13

u1

v1

u2

v2

u3 v3 Minimum Order Partition (MOP) Instance: A bipartite graph G (U; V; E) with the 7: The query tree representation of the origivertex sets U = fu1 ; :::; ung and V = fv1; :::; vng Figure nal query.

and the edge set E, and a positive integer K.

Question: Is there a collection of bipartite cliques C=fC1; :::; Clg that partition G s.t. or(C)  K. The reduction is a straightforward mapping of atoms of a query to vertices of the graph as in the following example. Consider a bipartite graph shown in Figure 7. The graph can be partitioned into two bipartite cliques (u1; u2; u3; v1; v2) and (u1 ; u2; v4 ) marked with broken and dotted lines respectively. It is easy to see that this partition is minimal. Now consider a query Q=(u1 [ u2 [ u3) 1 (v1 [ v2 [ v3 ), where u1; u2; u3; v1; v2; v3 represent atoms. Assume that there is one unfolding-to-discount U = fu3; v3g. Note that all extensional unfoldings of the query can be represented as edges in the graph, where the missing one, < u3 ; v3 > represents the unfolding U . Also, partitioning the graph into cliques is equivalent to clustering the set of all extensional unfoldings into subsets s.t. they do not overlap and when unioned are equivalent to the discounted query. We show in detail in the proof below that minimizes the number of vertices in the cliques is equivalent to minimizing the number of operations in the discounted query. It is easy to see that MDQ 2 NP, since a nondeterministic algorithm need only guess a collection of W 's and check in polynomial time whether that collection can be evaluated with fewer than K operations. Note that in the worst case the collection of W 's is of size n2 (such a collection represents the number of extensional unfoldings of Q). We transform a MOP (minimumorder partition

of a graph into bipartite cliques) to MDQ, which has been shown to be NP-complete [3]. Let G (U; V; E) and the positive integer K be an instance of MOP. Let U = u1; :::; un and V = v1; :::; vn be the sets of vertices in two subgraphs of the bipartite graph. Let each of the vertices represent an atom of a query de ned as: Q= (u1 [ ::: [ un) 1 (v1 [ ::: [ vn). We also de ne a collection of unfoldingsto-discount, U1,...,Ul as follows: Ui = fui1 ; vik g i (ui1 ; vik ) 62 E, that is, unfoldings-to-discount represent the missing edges between the two subgraphs of the bipartite graph. Then, the remaining edges of the graph represent all extensional unfoldings of the query QnfU1; : : : Ul g. It is easy to see that each of the cliques Ci , 1  i  l, in C represents an unfolding Wi . Since no two cliques share edges, no two unfolding representing them could share extensional unfoldings either. What remains to be shown is the fact that the number of operations required to evaluate t W1,...,Wl, that is op( S Wi ), is equal (or difi=1 fers by a constant) to the order of the collections of cliques, or(C). Consider a clique Ci with vertices ui1 ; :::; uik ; vj1 ; :::; vjl . The order of this clique, or(Ci ), is k + l. Consider an unfolding Wi that represents this clique in our transformation. It has the form of a query: (ui1 [ ::: [ uik ) 1 (vj1 [ ::: [ vjl ), hence it requires (k ? 1) + (l ? 1) + 1 = k + l ? 1 operations. Then, op(Wi ) = or(Ci ) ? 1. Let C contain t cliques. Then, by de nition, the order of the graph, or(C) is equal to the sum of or-

November 1997

View Disassembly|Godfrey & Gryz

p. 13 of 13

t S ders of all cliques; that is, or(C) = or(Ci ), Proof. [Theorem 12] i=1 It is easy to see that minimization of query's where Ci is the i-th clique. On the other hand, operations is equivalent to minimization of opevaluating the discounted query requires evalut S erations in a propositional logic formula where ating the union of all Wi 's. i.e. Wi. Hence, joins are mapped to conjunctions and unions are i=1 t t mapped to disjunctions. This problem, known S S we have: op( Wi ) = ( op(Wi )) + t ? 1 = as MINIMUM EQUIVALENT EXPRESSION is i=1 i=1 known to be in p2 [4]. t t S S (or(Ci ) ? 1) + t ? 1 = ( or(Ci )) ? 1 = i=1 i=1 2 or(C) ? 1.

2

Theorem 15. We generalize the NP-completeness result of Theorem 11 to a query Q = (a1 [ ::: [ am ) 1 (b1 [ ::: [ bn ), where m is not necessarily equal to n. We call the minimization problem for such queries MDQ0. Minimization of Discounted Query (MDQ0 ) is NP-complete. Proof. [Theorem 15] It is easy to see that MDQ2 NP, since a nondeterministic algorithm need only guess a collection of W 's and check in polynomial time whether that collection can be evaluated with fewer than K operations.

We show that MDQ0 is NP-hard by transforming MDQ to MDQ0. Let the query Q in MDQ0 be (a1 [ ::: [ am ) 1 (b1 [ ::: [ bn), where n  m and the set of unfoldings-to-discount be U. 0

0

We construct a query Q to be (a1 [:::[an) 1 (b1 [ ::: [ bn ), where am+1 ; :::; an are arbitrary atoms. We also de ne a set of unfoldings-to-discount for S fai; bj g. this query to be U =U [ m+1in; 1j m Note that by adding the set of unfoldings-toS fai ; bj g we guarantee discount m+1in; 1j m that QnfUg = Q nfU g. Hence minimizing the number of operations in QnfUg minimizes the number of operations in Q nfU g. 0

0

0

0

0

2 The above result can be easily generalized (by reduction to MDQ0 ) to an arbitrary query more complex than the one de ned in MDQ0.

Corollary 16. Let MDQext be the same problem as MDQ for an arbitrary query Q. Then MDQext is NP-hard.