Optimizing Statistical Queries by Exploiting ... - Semantic Scholar

Report 19 Downloads 196 Views
Optimizing Statistical Queries by Exploiting Orthogonality and Interval Properties of Grouping Relations Chang Li and X. Sean Wang Department of Information and Software Systems Engineering George Mason University, Fairfax, Virginia, 22030

Abstract

ucts comprise a large and quickly expanding industry. These products share many of the characteristics of a statistical database. However, they also have many special features, the most important of which is their emphasis on very short response time to queries since these products are to support interactive data analysis. The central concept in an OLAP system is a multidimensional cube. An n-dimensional cube consists of n relations, each being called a dimension, and for each combination of n tuples (one from each dimension), there is an associated measurement value. As an example, consider the cube STORES representing sales data collected by a department store chain. There are three dimensions: the store, the manufacturer and the date. On each day, each store reports the sales amount of the goods supplied by each manufacturer.1 An instance of this cube is shown in Figure 1. It is easily seen that an n-dimensional cube can be conceptually viewed as consisting of a (huge) category represented by the cross product of all the dimension relations (with attributes properly renamed). A measurement value is associated with each tuple in this cross product. For example, the value 5384.61 is associated with the cross product of the following three tuples: tuple hCA, Los Angelesi from Store dimension, the tuple hCAi from the Manu dimension and tuple h1995; Jan; 30i from the Date dimension. Any subset of the cross product is a called a (sub)category (or subcube) of the cube. In our example, each subset of tuples (without the Sales column) in Figure 1(b) is a subcategory. In OLAP systems, summaries of many subcategories are pre-computed. For any pre-computation to be useful, an ecient algorithm must exist that nds out the usable pre-computed summaries to speed up the processing of user queries. Most existing systems only pre-compute summaries for orthogonal categories.2 An orthogonal category is a cross prod-

A statistical query rst manipulates source category data to build a target category in the form of a grouping relation and then performs statistical functions on the associated measurement data. In this paper, the attributes in a grouping relation are partitioned into pair-wise disjoint sets, each called a dimension. A grouping relation is said to be orthogonal if it is equal to the cross product of the projections of itself on all the dimensions. Orthogonality is useful in searching for and using pre-computed summaries on other categories. However, a grouping relation is sometimes not orthogonal, but rather k-partially orthogonal (i.e., the union of k orthogonal ones). It is shown that it is NPcomplete to decide if a grouping relation is k-partially orthogonal. The paper then gives an algorithm to derive partial orthogonality. Also investigated in this paper are interval properties of grouping relations useful for optimizing statistical queries. An algorithm is described to derive interval properties.

1 Introduction

A statistical query can usually be divided into two stages: the categories in the database are rst manipulated to build target categories, and then statistical functions are performed on the (usually large) measurement data set associated with the target category. Much progress has been made in optimizing such queries [13, 14, 12]. In particular, the use of precomputed summaries to evaluate queries has been proposed and studied [17, 7, 4]. In order to decide which pre-computed results can be used, part of this strategy needs to compare the target categories of a query with the categories associated with the pre-computed summaries. The purpose of this paper is to investigate ecient comparison methods that exploit properties present in many categories, namely orthogonality and interval properties. Our investigation is carried out in a business application domain called \on-line analytical processing" or OLAP [5, 1, 3]. Commercial OLAP system prod-

1 To simplify the presentation, we assume there is at most one store in a city and at most one manufacturer in a state. 2 In fact, most existing OLAP systems only pre-compute categories which are obtained by restricting one or more dimen-

 Work was supported in part by the NSF grant IRI-9409769. Contact author: X. Wang ([email protected])

1

Store: State

CA CA VA

City

Los Angeles San Jose Fairfax

Manu: State

CA VA

Date: Year

1995 1995

Month

Jan Jan

Day

25 30

State

CA CA CA CA CA CA CA CA VA VA VA VA

Store City

Los Angeles Los Angeles Los Angeles Los Angeles San Jose San Jose San Jose San Jose Fairfax Fairfax Fairfax Fairfax

(a) The dimensions.

Manu State

CA CA VA VA CA CA VA VA CA CA VA VA

Year

1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995

Date Month

Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan

Day

25 30 25 30 25 30 25 30 25 30 25 30

Sales

9310.56 5384.61 7185.19 2764.28 4034.50 3444.08 2078.91 2156.35 2490.30 2816.41 5901.13 4623.54

(b) The associated sales amounts.

Figure 1: An instance of cube STORES. both the rst and the last tuples in Figure 1(b) are uct of subsets of tuples, one subset from each dimenin the subcategory while the fourth tuple, which is resion. For example, all the tuples (without the Sales quired for the subcategory to be orthogonal, is not. column) in Figure 1(b) with the rst column being The aforementioned dimension-at-a-time comparison CA form an orthogonal category. This category is the method cannot be used directly for this category. cross product of the rst two tuples from the Store dimension with all the tuples from the Manu and Date In this paper, we assume that the pre-computed dimensions. summaries are on orthogonal subcategories of a cube, Comparing two orthogonal categories for containbut the target categories of user queries may not be ment can be carried out eciently: One only needs to orthogonal. The question is how to eciently compare compare each dimension. That is, an orthogonal cata target category with an orthogonal category. egory C1 contains (all the tuples of) another orthogoWe revisit the example above. We know that the nal category C2 if and only if the containment is true target category is not orthogonal and dimension-aton each dimension. We call this method \dimensiona-time comparison does not directly apply. However, at-a-time". This method is much more ecient than one does not yet need to resort to the tuple-at-a-time the \tuple-at-a-time" method, which needs to be percomparison method. Indeed, let r be the obtained tarformed if the two categories are arbitrary. Indeed, get category. It can easily be seen that the relations suppose a cube has 10 dimensions and each dimenr1 = Store:State =CA (r) and r2 = Store:State =VA (r) are sion consists of at most 10 tuples. Then dimensionboth orthogonal, and r = r1 [ r2. Hence, r is the at-a-time comparison (for orthogonal categories) only union of two orthogonal categories. We call such catneeds to deal with 10 pairs of tuple sets, each having egories \partially orthogonal". The dimension-at-aat most 10 tuples; while tuple-at-a-time comparison time method can be used on each of the orthogonal (for arbitrary categories) has to deal with sets with subcategories in a partially orthogonal subcategory. up to 1010 tuples. Clearly, every category is a union of a nite number Unfortunately, not every target category yielded subcategories. Indeed, each category is a nite union by a query is orthogonal. As an example, conof tuples, and each tuple by itself is an orthogonal sider the category obtained by performing a selection category by de nition. Therefore, we are interested in Manu:State =Store:State on the tuples in Figure 1(b). The nding out whether a category is the union of a small meaning of this subcategory is to gather all the sales number of cubes. We show in this paper that it is NPamounts of the products manufactured and sold in the complete to decide if an arbitrary category is a union same state. This subcategory is not orthogonal since of k orthogonal categories. In light of this complexity sions to individual tuples. In our example cube STORES, there result, we turn to investigate methods that can demay exist, for each day, the pre-computed summary of the sales rive partial orthogonality of the target category from amounts by all the stores regarding all manufacturers. In other the input categories and the query expression. Speci words, these pre-computed categories are obtained by restricting to individual tuples on the date dimension. cally, we carefully study the changes of partial orthog2

onality in categories under each relational operation and give an algorithm that automatically determines the partial orthogonality in the target categories. Another strategy investigated in this paper is by using the interval properties of a category. In many situations, dimension-at-a-time comparison can be further simpli ed. Suppose we have two orthogonal categories speci ed on the same dimension and these two categories forms two intervals with respect to a given order on the dimension. Then we only need to compare the two end points of each category to see if one contains the other. This is especially useful for temporal dimensions, e.g., the Date dimension in cube STORES. Interval properties also arise from a special operation that we call ordered grouping, or roll operation. The operation is important to business applications. For example, in order to compute daily cumulative sale amounts, days on the Date dimension is arranged into groups: For each day, all the days between the very rst day on the Date dimension to the given day form a group. Clearly, each group of days is an interval with respect to time. Aggregation functions are applied to each group. In this paper, we study the interval properties via the concept of interval dependencies. Our main goal is to automatically derive interval properties. We again carefully study the changes of interval properties under relational operations. Orthogonality is in general related to many research topics in statistical databases, namely, aggregation derivability [10, 11], pre-computation [4] and physical data accessing [15, 12]. In particular, our work is strongly related to [4], where a mechanism of storing precomputed summaries based on orthogonal categories is developed. Speci cally, the system stores non-overlapping precomputed summaries of orthogonal categories. When a query is evaluated, the precomputed summaries are examined and the usable summaries are retrieved. However, the work in [4] is focused on how the precomputated summaries are stored in a tree structure and how the tree is searched and updated. The results of our paper can be viewed as complementary to the work of [4] in that we are focused on how to eciently compare categories. We believe that an ecient comparison is of essential importance to make use of precomputed summaries in an interactive system. Partial orthogonality is a special kind of multivalued dependency studied in the relational normalization theory [16]. Many deep results have been established on the subject. Orthogonality in the binary relation case is also formally studied under the name \well-connected" relations in [2]. In this paper, how-

ever, we are focused on the derivation of (partial) orthogonality from a query expression so that the orthogonality can be used in query optimization. Interval dependency studied in this paper is used to compress the representation of data sets in order to expedite the comparison procedure. A similar concept called compact domain was used in [8] to de ne a normal form for statistical relations. A compact domain describes an \interval" with respect to the domain of the attribute. In this paper, the notion of an interval is de ned with respect to the relation itself. Furthermore, we give a set of inference rules for such interval properties. Finally, we focus on the derivation of interval dependencies from query expressions. The remainder of the paper is divided into 5 sections. In Section 2, the notion of a multidimensional cube and that of a grouping query are reviewed. The orthogonality is studied in Section 3. In Section 4, the ordered grouping queries that produce interval properties are reviewed. The interval properties themselves are investigated in Section 5. Finally, Section 6 concludes the paper with a discussion of the paper and future research directions.

2 Multidimensional cubes and grouping queries

In this section, we review the notion of an ndimensional cube and the grouping algebra [9]. Throughout the paper, we assume two disjoint sets: a set of attribute names and a set of dimension names. We use A, B etc. to denote attribute names, R to denote nite sets of attribute names, and D to denote dimension names. Each symbol may be subscripted to denote di erent (set of) names of the same kind. Each attribute A is associated with a domain dom(A). Furthermore, we use V to denote a set of scalar values (integers, reals, etc.).

De nition Let n be a positive integer. An ndimensional cube scheme is a set f(D1 ; R1); : : :; (Dn ; Rn)g, where D1 , .. ., Dn are distinct dimension

names and R1, .. ., Rn are sets of attribute names. An n-dimensional cube, or a cube for short, on the scheme f(D1; R1); : : :; (Dn; Rn)g is a pair (F; ), where (i) F = f(D1; r1); : : :; (Dn; rn)g such that ri is a relation on Ri for each 1  i  n, and (ii)  is a mapping from f f(D1; t1);    ; (Dn; tn)g j ti 2 ri for each 1  i  ng to V .

Thus, a multidimensional cube consists of a set of dimensions and a value mapping. Each dimension is a relation and the value mapping maps each combination of tuples, one from each dimension, to a value. 3

Note that unlike an ordinary array, we use names to label di erent dimensions and there is no order among these dimensions. Operations like \transpose" are of no meaning for a cube. However, for convenience and when no confusion arises, we will assume that there is an order among the dimension names in a cube. That is, we will assume that a cube scheme is a list h(D1 ; R1); : : :; (Dn; Rn)i, and we will write a cube on the scheme h(D1 ; R1); : : :; (Dn; Rn)i as (r1; : : :; rn; ), assuming that ri is a relation on Ri for each 1  i  n, and that  is a mapping from f(t1; : : :; tn) j ti 2 ri for each 1  i  ng to the set V of scalar values. Each (t1 ; : : :; tn) here is called a coordinate of the cube. For example, consider the chain store example in Section 1. A 3-dimensional cube can be used to formalize the sales data of the chain store: STORES = (rStore ; rManu ; rDate ; Sales): The cube scheme of STORES is h(Date; RStore ); (Manu; RManu ); (Date; RDate )i, where RStore = fState; Cityg, RManu = fStateg and RDate = fYear; Month; Dayg. An important view point of a dimension is that the relation actually provides a \basic" way of grouping. Indeed, for a given dimension D, each tuple t of the dimension can be taken as the \name" of a group and the data in the group are all the values that are yielded by the value mapping from all possible combinations of t (from dimension D) with all the tuples from the other dimensions. In the above example, a store (represented by its city) gives the group of all the sales data ever reported by the store on all manufacturers. A day on the Date dimension gives the group the sales data reported by all the stores on all the manufacturers on that day. The above intuition of using a relation to group data in a cube is extended to an arbitrary relation, leading to the notion of a \grouping relation". First, we introduce the notion of a grouping scheme. De nition A dimension attribute is a pair of a dimension name and an attribute name. A grouping (relation) scheme is a nite set of dimension attributes and (ordinary) attribute names. Hence, a grouping scheme (denoted by R) is simply a traditional relation on ordinary attributes as well as dimension attributes. In the sequel, we shall use the notation D:A to denote a grouping attribute. The di erence between a dimension attribute and an ordinary attribute is that the dimension attribute is pre xed with a dimension name. In order to simplify the presentation, we assume there exists a dummy \dimension name" D0 and each ordinary attribute A is viewed as a dummy dimension attribute D0 :A.

A grouping relation on a grouping scheme is simply a relation on the grouping scheme by ignoring the dimension names. That is, a (grouping) tuple of a grouping relation on the scheme R is an \ordinary" tuple such that t[D:A] 2 dom(A) for each D:A in R, and a grouping relation is a set of grouping tuples. The symbol r, possibly with a subscript, is used to denote a grouping relation. Note that all the traditional relational operations are applicable to grouping relations. However, not all relational operations are meaningful due to the special semantics associated with grouping relations. We call the restricted relational algebra grouping algebra. The grouping algebra consists of all the traditional relational algebra operations. The projection , selection , union [, di erence ? and intersection \ are used simply treating every dimension attribute as a usual attribute (with D: as part of the name). The rename and cross product operations are as de ned in the traditional relation algebra but with the restrictions as follows: For the rename operation, the only acceptable forms are renD:A;D :B (r) and renD :A;D :B (r), where D0 is the dummy dimension name. For the cross product operation r1 r2 , the (non-dummy) dimension names appearing in the two relations must be disjoint. Hence, a rename operation cannot changes a regular attribute to dimension one. A cross product cannot be performed if the same dimension name appears in both relations. The reason is our desire for a so called \non-empty reference" property [9]. Suppose r is a grouping relation r and D1, .. ., Dk are all the (non-dummy) dimension names appearing in the scheme of r. Suppose that C is an n-dimensional cube with dimensions D1, ... , Dk , ... , Dn. Then each tuple t in r \selects" a set of coordinates from C as follows: For each coordinate (t1 , .. ., tn) of C, the coordinate is selected i ti[A] = t[Di :A] for each (nondummy) dimension attribute Di :A of t. The above selected data from a cube by tuples can be considered as the measurement data associated with the category represented by the grouping relation. A statistical query usually applies statistical functions on the data. Consider again the example in Figure 1. Let r be the grouping relation copied from rDate in the cube STORES. Then for each t 2 r (which represents a day), the sales data collected by t are all the sales data reported on day t. As another example, let rm = Date:Month (rDate ). Then for each tuple tm 2 rm (which represents a month), the collected data are all the sales data of the month of all the years (e.g., the sale data of all January, or February, ..., etc.). 0

4

0

0

3 Orthogonality

nd a reference attribute set that divide a relation into a small number of orthogonal subrelations. It is easily seen that a relation r is orthogonal wrt Z if and only if the multivalued dependency Z !! RD is satis ed by r for each D. By the inference axioms of multivalued dependencies [16], we have that if r ? Z then r ? Z 0 for all Z 0  Z. However, we choose to use our notation because we represent partial orthogonality in a compact form, and also because we are only concerned with the multivalued dependencies in the form Z !! RD . Furthermore, we are interested, in this paper, in the changes of (partial) orthogonality in relations under relational operations. Before we consider relational operations, we introduce the notion of an orthogonality graph to keep track of the impacts of operations to orthogonality. The reason for using orthogonality graph is to keep track of many reference attribute sets.

In this section we discuss orthogonality of grouping relations. For the convenience of presentation, for a given relation scheme R, we use RD to denote all the attributes in R with dimension name D.

De nition A relation r on R is said to be orthogonal if r = R 0 (r)  : : :  R k (r), where D0, . .. , Dk are all the dimension names (including the dummy one D0) appearing in R. D

D

As discussed in the introduction, in practice, not all relations are orthogonal, and some relations can be divided into several subrelations, each of which is orthogonal. However, the following result shows that it is intractable to nd a \good" way to divide a relation into orthogonal subrelations.

Proposition 1 For a given integer K, it is NPcomplete to decide if an arbitrary relation is a union of at most K orthogonal relations.3

De nition An orthogonality graph is a undirected

graph G = (R; E), where R is a set of (dummy or non-dummy) dimension attributes and E is a set of (undirected) edges. An edge can be a loop, i.e., a undirected edge from a vertex to itself. A grouping relation r on R is said be orthogonal wrt G = (R; E) if r ? Z for each vertex cover Z of G.

Note that to decide if a relation is orthogonal has low complexity. The diculty is how the relation is divided. In light of the above intractability, we seek a special way of dividing a relation:

De nition Let r be a relation on R and Z  R. Then r is said to be partially orthogonal wrt Z, denoted g ? Z, if for each tuple tZ 2 Z (r), Z =tZ (r) is orthogonal.

A vertex cover of a graph (R; E) is a subset Z of R such that for each edge in E, at least one of the vertex in the edge is in Z. The intuitive meaning of an orthogonality graph is that if the attributes are partitioned into pair-wise disjoint groups such that there is no edge between attributes of di erent groups, then the relation is a cross product of the projections on these groups. Clearly, for an orthogonal relation, there is no edge between attributes of di erent dimensions. If there is only one edge between attributes of di erent dimensions, then the graph will be orthogonal with respect to one of the attributes, i.e., either one of the attributes can serve as the reference attribute. This property is extended to a set of edges. Thus, if a relation is orthogonal wrt an orthogonality graph, then each vertex cover of the graph is a potential reference attribute set. Clearly, we need only to consider the minimal vertex covers, i.e., the vertex covers that do not properly contain other vertex covers. We note here that it is NP-hard to nd a minimal vertex cover [6]. However, since the size of an orthogonality graph is in terms of the number of attributes (instead of number of tuples), this complexity is bearable. Orthogonality graph has a very good property. For each orthogonality graph, there always exists a rela-

For example, the relation in Figure 2 is orthogonal wrt D1 :A. D1 :A D2 :B D3:C 1 a e 1 b e a f 1 1 b f c g 2 2 d g 2 c h d h 2 Figure 2: A grouping relation. If r is not orthogonal wrt Z, we use the notation r 6? Z. The attribute set Z in the above de nition is called a reference attribute set. For each relation, there may exist more than one reference attribute set. On the other hand, each relation must have a reference set, namely the scheme of the relation itself (since the relation is then divided into subrelations with each having only one tuple). An interesting problem is to 3 Due to space limitation, proofs are omitted.

5

tion that is orthogonal wrt the graph. Furthermore, we have: Proposition 2 Let G = (R; E) be an orthogonality graph and Z  R such that Z is not a vertex cover of G. Then there exists a relation r such that r ? G but r 6? Z. That is, no reference attribute sets other than vertex covers are \logically implied" by a graph. We now turn to consider relational operations. The rst operation is selection. Here we only consider conjunctive selection condition with equality and V inequalities. That is, each selection is of the form i yi1 i yi2 , where yij (j = 1; 2) is either an attribute or is a constant, and i is a symbol in f; ; 6=; =g. Note that a selection with a conjunctive condition can be decomposed into a sequence of selections each with the condition of the form yi1 i yi2 . Lemma 1 Let G = (R; E) be an orthogonality graph and r a relation on R such that r is orthogonal wrt G. Then 1. D:Ac(r) ? G, where c is a constant; 2. D:AD:B (r) ? G; 3. D:AD :B (r) ? (R; E [ f(D:A; D0:B)g), if D 6= D0; In other words, if a selection condition only concerns a single dimension, then the orthogonality does not change. However, if the selection condition is across two dimensions, then any vertex cover needs to include at least one attribute in the selection condition. Lemma 2 If r1 ? (R1 ; E1) and r2 ? (R2; E2), then r1  r2 ? (R1 [ R2; E1 [ E2). Note that no attribute can appear in both R1 and R2 by the de nition of the cross product. Hence, any vertex cover of (R1 [ R2; E1 [ E2) is a union of a vertex cover of (R1 ; E1) and a vertex cover of (R2 ; E2). The above result simply says that the reference attribute set of a cross product is the union of the reference sets for each relation participating in the cross product. Lemma 3 Let G = (R; E) be an orthogonality graph, Z a vertex cover of G and Y a subset of R such that Z  Y . Then Y (r) ? Z. That is, if the projection keeps a vertex cover of the graph, then we can still use the vertex cover as our reference set. If the condition Z  Y is not satis ed in the above lemma, the orthogonality information may be lost. Consider the relation r in Figure 2. Clearly, r ? D1 :A. Consider now the projection r0 = D :B;D :C (r). Obviously, relation r0 can still

be divided into two orthogonal subrelations. However, using any attribute set as a reference set will not give such a partition.4 Hence, in order to keep more orthogonality information, we consider keeping attribute D1 :A as a \tag" attribute. That is, we consider a tag attribute not as part of the relation, but rather it gives us a way of dividing the relation into orthogonal subrelations. In order to formalize the notion of tag attributes, we have:

De nition Let r and r0 be relations on schemes R and R0 , respectively, with R  R0 . Then r0 is said to

be equivalent to r on R if R (r0 ) = r. The attributes in R0 ? R are intended as tags. As will be seen later, tag attributes are system generated and hence will not be used in aggregations either as grouping attributes (i.e., the attributes X after the keyword group by) or as lters for selecting coordinates from a cube since these system generated attributes are all dummy. By using the concept of equivalence, we change the projection operation D :B;D :C (r) into a rename operation renD :A;D :A (r), where D0 :A0 is a tag attribute. However, due to Lemma 3, we do not need to keep all attributes that are projected out as tags. Instead, we can decide rst which reference set we intend to use, and then any attribute that is projected out and is not in the reference set can be safely discarded. This leads us to the following two phased algorithm. The rst phase is to determine the orthogonality graph of an expression e, and the second phase is to change the expression e to keep certain attributes as tags. The algorithm for the rst phase is given in Figure 3. The input expression e in the algorithm is assumed to be an expression with only selection, projection and cross production operations. Furthermore, the selection conditions are conjunctive conditions with equality and inequalities. Also, as the input of the algorithm, we assume that the input relations have their own orthogonality graphs. If the input relation only involves one dimension name, then the input orthogonality graph is the graph with the empty edge set. If the input relation involves more than one dimension and no orthogonality graph is known, then the input orthogonality graph can be derived by using the notion of a key. That is, there are three kinds of edges: (1) there is a loop edge for each attribute in the key, (2) there is an edge between each pair attributes in the key, and (3) there is an edge from each non-key attribute to each key attribute. For example, 1

1

0

2

0

2

0

4 Using D2 as the reference set will divide 0 into four orthogonal subrelations. The same goes for using D3 . :B

r

:C

3

6

4 Ordered grouping

if R = fA; B; C; Dg and A; B are the key. Then the orthogonality graph is (R; E), where E consists of the following seven edges (A; A), (B; B), (A; B), (C; A), (C; B), (E; A) and (E; B). The graph is built such that any minimal vertex cover must contain exactly all the key attributes. The second phase of the algorithm is to generate a new grouping algebra expression that is equivalent to the original one but with tags for preserving orthogonality. Suppose the orthogonality graph obtained by the rst phase in Figure 3 is G. First, nd a minimal vertex cover Z for G. The attributes of Z can be divided into two disjoint subsets Z = Z 0 [ Z 00: The attributes in Z 0 are from the input relations and the attributes in Z 00 are system generated. Clearly, the new expression must keep not only attributes in Z 0 and also the attributes in Z 00. Since system generated attributes are produced at step 4 in the rst phase when the projection operation is dealt with, for each attribute A 2 Z 00 , we can nd a subexpression Y (e1 ) in  e such that A 62 Y , and A is the renamed version of A. We change this subexpression into Y [Y (renA;A  (e1 )), where Y 00 is all the system generated names before this step that are contained in the expression renA;A  (e1 )) and in Z 00 . After this procedure is performed for each A 2 Z 00, a new expression e00 is generated from expression e.

In OLAP applications, many groupings are performed on an ordered dimension (e.g., the Date dimension in the SALES cube) and involved collecting certain intervals of the dimension into groups. For example, in order to perform \cumulative sum" over the Date dimension, the tuples on the Date dimension are collected into (overlapping) groups. In order to express groupings that are order related, we rst introduce the concept of a \tuple order" on relations. We assume there exists a total order on the domain of each attribute name. And we use the ~ where X is a set of (dimension) attributes, notation X, to denote a permutation of the attributes in X. De nition Let X be a set of dimension attributes and X~ = D1 :A1; : : :; Dk :Ak a permutation of X. Then ~ ), where  is a symbol in the set fdesc; ascg, O = (X; is called a tuple order on X and gives a total order O on the tuples of the scheme X as follows: For tuples t1 and t2 on X, t1 O t2 i either t1 = t2 , or there exists a 1  j < k such that (1) t1[D1 :A1; : : :; Dj :Aj ] = t2 [D1 :A1; : : :; Dj :Aj ] and (2) if  = desc, then t1 [Dj+1 :Aj +1 ] > t2 [Dj+1 :Aj +1], and if  = asc, then t1[Dj+1 :Aj +1] < t2 [Dj+1 :Aj +1]. That is, the tuples on X are ordered lexicographically, large to small (desc) or small to large (asc), by looking at the attributes in the order given by the permu~ For example, if X~ = A; B and  = asc, then tation X. the tuple (1; 2) is before the tuple (1; 3). It is clear that each tuple order is a total order. And hence, for each relation on X, it is easily seen that a tuple order on X induces a sequence of the tuples in the relation as follows: ~ ) be a total order on X and De nition Let O = (X; g a grouping relation on scheme X. We use O[r] to denote the sequence t1 ; : : :; tn, where ti O tj for all i < j and t1 , . .. , tn are all the tuples in g. The sequence O[r] is simply a \sorted" relation by using the attribute order speci ed in O. Clearly, this is similar to the \Ordered by" clause in SQL. Before we de ned ordered grouping operations, we also need the following notation regarding attribute name conversions. For each attribute name A (not a dimension attribute), we always assume that T-A, where T is an arbitrary symbol, is a new attribute name that have the same domain as A. Also, for a given set X = fD1 :A1; : : :; Dk :Ak g of dimension attributes, let T-X be the set of dummy dimension attributes fT-A1 ; : : :; T-Ak g. We are now ready to de ne our roll operation:

00

Proposition 3 For any given input relations that

are orthogonal wrt their corresponding orthogonality graphs, the result relation r00 of the expression e00 , where e00 is as described above, satis es the following two conditions: (1) r00 is equivalent to r with respect to R, where r is the result of e on the inputs and R is the scheme of r, and (2) r00 ? Z, where Z is the vertex cover used to derive e00 above. As an example, consider the expression A;D3 :A (D1 :B>D2 :C (R1 R2 )). Suppose the corresponding orthogonality graphs are G1 = (fA; D1:B g; ;) and G2 = (fD2 :C; D3:Ag; ;). Following the above algorithm, the output graph G of the rst phase ^ C; ^ D3:Ag; (B; ^ C)), ^ where B^ and C^ will be (fA; B; are new names from D1:B and D2:C, respectively. Then in the second phase, a minimal vertex cover ^ Suppose we choose B. ^ can be found as B^ or C. Then the output expression of second phase will be: A;D3 :A;B^ (ren B;B^ (D1 :B>D2 :C (r1  r2 ))). Assume now that the input relations are r1 and r2 . Let r0 = ^ A;D3 :A;B^ (ren B;B^ (D1 :B>D2 :C (r1  r2 ))). Then r0 ? B, 0 and r is equivalent to A;D3 :A (D1 :B>D2 :C (r1 r2 )) with respect to fA; D3:Ag. 7

Find an orthogonality graph for an expression: INPUT: e(R1 ; : : : ; Rk ) and G1 ; : : : ; Gk OUTPUT: G METHOD: The output graph G is recursively derived as follows. For each subexpression e1 and e2 of e, assume the orthogonality graph derived for e1 and e2 are G1 = (R1 ; E1 ) and G2 = (R2 ; E2 ), respectively. 1. If e = Ri , then let G = Gi ; 2. If e = e1  e2 , then let G = (R1 [ R2 ; E1 [ E2 ); 3. If e = cond(e1 ), then if cond = D:AD0 :B such that D 6= D0 , let G = (R1 ; E1 [f(D:A; D0:B )g; otherwise let G = G1 . 4. If e0 = Y (e1 ), then let G be the graph G1 but with each attribute in R1 ? Y be renamed to a system generated new name.

Figure 3: First phase: nd orthogonality graph.

De nition An interval dependency (or ID) is an exint ~ pression of the form X ?! Y , where X and Y are sets

Operation(Ordered Grouping) Let b (begin), s

(step) and l (length) be positive integers (s can be 0), r a grouping relation on scheme R, X a subset of R such that no (regular) attribute name appearing more than once in X and O a tuple order on X. Assume O[X (r)] = t1 ; : : :; tn. For each m  0, let im = b + s  m and jm = im + l, and if im  jm  n, then for each im  k  jm , let tmk be the tuple on the scheme start-X [ end-X [ X such that tmk [start-X] = tim , tmk [end-X] = tjm and tmk [X] = tk . Let r0 be the relation consisting of all tmk de ned above. Let b;s;l O (r), called a roll of r, be the grouping relation r ./ r0 .

of attributes and Y~ is a permutation of Y . A relation int ~ r on R, where X [ Y  R, is said to satisfy X ?! Y, int ~ denoted r j= X ?! Y , if for each t 2 X (r), there exist integers i and j such that Y (X =t (r)) = ti ; : : :; tj , where t1 ; : : :; tn is the tuple sequence of Y (r) under the tuple order O = (Y~ ; asc). int ~ Thus, a relation r satis es the ID X ?! Y if for each tuple t in X (r), the projection of X =t (r) on Y is an interval in the sequence of Y (r) under the tuple order (Y~ ; asc). Note that in the above de nition, we used asc. It is clear that Y (X =t (r)) is an interval under the tuple order (Y~ ; asc) i it is an interval under the tuple order (Y~ ; desc). As an example, the relation int in Figure 5 satis es the interval dependency A ?! hB; C i. Indeed, for the tuples with A = 1, the BC values are in the interval between 00 and 10 (inclusive) and for the tuples with A = 2, the BC values are in the interval between 01 and 11 (inclusive).

Thus, an ordered grouping operation rst sorts the tuples in X (r) by the given tuple order. Then it generates groups by taking intervals from the sequence (the result of the sorting). Speci cally, tuples from the positions between b and b + l are grouped together, tuples from the positions between [b + s; b + s + l] are grouped together, and so on. (Here, b is the beginning position of the rst interval, and s is the step length, i.e., the distance between the beginning of each two neighbor intervals, and l is the length of each interval.) Finally, each interval group generates a result group by joining to the original relation. Figure 4 shows an example of tuple order and the roll operation. The ordered grouping can be extended to the form b;s;l (r) for all l   b;s; (r) which is the union of  O O jrj. As an example, assume r is a1;3grouping relation ;3 (r) creates a on the scheme fy; m; dg. Then (y;asc ) grouping relation that groups every three years (nonoverlapping) into a group; 1(y;3;m;;3asc) (r) creates a grouping ;that groups every three months into a group. Also,  1(m;0;asc (r) creates a \cumulative" grouping on months. )

It is easily seen that b;s;l O (r) satis es the followint ~ int ~ ing three IDs: (1) start:Y ?! Y ,(2) end:Y ?! Y, int ~ 5 ~ (3) start:Y [ end:Y ?! Y . We note that if Y =

We start with the notion of an interval dependency formalizing interval properties.

5 We use to denote the concatenation of the permutations of two disjoint sets. That is, is a permutation of , where = , such that in the permutation all the

A 1 1 1 2 2 2

B 0 0 1 0 1 1

C 0 1 0 1 0 1

int Figure 5: A relation that satis es A?!h B; C i.

5 Interval properties

~W ~ V

~W ~ V

V [W

8

V \W

;

A1 a a b c

A2 A3 4 c 4 h 2 b 1 d

(i) relation r

start-A1

a a a b b

A1 a b c (ii) O[r] with tuple order O = (A1 ; asc)

end-A1

b b b c c

A1 a a b b c

A2 A3 c h b b d

4 4 2 2 1

(iii) O1;1;2 (r)

Figure 4: Example of tuple order and ordered grouping Y~1 Y~2, then the dependency (1) above logically implies int ~ (4) start:Y ?! Y1 . Here, we adopt the standard notion of logical implication. That is, for each relation r, r satis es dependency (4) if r satis es dependencies (1). In other words, dependency (4) is weaker since (1) logically implies (4) but (4) does not imply (1). This weaker dependency nevertheless is useful in the case where a projection is performed on the relation that removes Y2 and renders dependency (1) useless while (4) remains valid. For example, suppose that r int int satis es A ?! hB; C i. Then it satis es A ?! hB i. int Clearly, A ?! hB; C i is meaningless in AB (r). Howint ever, It is easily seen that AB (r) satis es A ?! hB i. We are therefore interested in deriving interval dependencies that are logically implied by the given ones. As in standard dependency theories, we use the following inference axioms: int (I1) X ?! X~ Y~ for all attribute sets X and Y and permutations X~ and Y~ ; int ~ ~ int ~ ~ Y1 Y2 if X ?! Y1 Y2 ; (I2) XY1 ?! int ~ int ~ ~ (I3) XY1 ?! Y1 Y2 if XY1 ?! Y2 ; int ~ int ~ ~ (I4) X ?! Y1 if X ?! Y1 Y2 . Note that dependencies derived from I1 are trivial, i.e., they are satis ed by all relations. It is not dicult to show that axioms I1{I4 are sound. That is, for each relation r, if r satis es the premise of the axiom (for I1, there is no premise), then r satis es the conclusion. It is still open though whether the axioms I1{4 are complete, i.e., whether the use of the axioms can derive all logically implied IDs. We can show however a restricted form of completeness. Speci cally, suppose R is a relation scheme and R~ is a particular permutation of R. We say that int ~ ~ Then X ?! Y is under R~ if Y~ is a subinterval6 of R. the four axioms are complete for all the IDs under a speci c R~ . In practice it seems unusual to have re-

lations known to satisfy (non-trivial) IDs I such that the right-hand sides are under di erent R~ . Proposition 4 Let R be a relation scheme and R~ a permutation of R. Then axioms I1{4 are (1) sound, and (2) complete for all the interval dependencies under R~ . As with the partial orthogonality of grouping relations, our main interest here is to study the change of interval dependencies under relation operations. First, we have the de nition: int ~ De nition Let X ?! Y be an interval dependency int ~ and op a unary relational operation. Then X ?! Y is said to be preserved by op if for all grouping relation int ~ int ~ g, g j= X ?! Y implies op(r) j= X ?! Y. int ~ Proposition 5 Let X ?! Y be an interval depen-

dency, y1 y2 a selection condition, and all the attributes appearing in the condition are Z. Then int ~ X ?! Y is preserved by cond i Z  X or Z  Y . Under the assumptions of the above proposition, suppose Z is a subset of XY but not a subset of X int ~ or Y . Hence, cond does not preserve X ?! Y . However, we may be interested in a \weaker" interval deint ~ 0 int ~ pendency X 0 ?! Y that is implied by X ?! Y and is preserved by cond . This weaker dependency can be derived by using axioms I2{3.

Proposition 6 The projection operation Z preint ~ serves X ?! Y if X [ Y  Z. In the above proposition, if X  Z but Y  6 Z, then we may use axiom I4 to reduce the set Y to Y 0 int ~ 0 such that Z preserves X ?! Y.

Proposition 7 Let r be a grouping relation on R. int ~ Then r preserves X ?! Y . Here, the unary operation r on the input r0 returns the r  r0.

attributes in appear before all the attributes in . 6 A sequence 1 k is said to be a subinterval of a sequence 1 such that such that j = i+j?1 m if there exists 1 for = 1 . V

W

a

b

b j

a

i 

a

b

;:::;k

9

We are now ready for an algorithm which derives interval dependencies. The queries we consider involve only selection, projection and cross product operations. Due to space limitation, we only brie y describe the algorithm. Basically, the algorithm tries to use the three propositions above and the inference axioms. We have a set of IDs to begin with. These IDs may be obtained by a roll operation. For the selection operation, if the selection condition is not entirely in the left or the right hand side of an ID, then the algorithm tries to move the attributes from left to right (or the other way) using the axioms I2{3 to force the selection condition to be entirely in the left or the right hand side. For projection, we try to use axiom I4 so that the attributes projected out are not part of the ID. The simplest operation is the cross product, in which we simply keep both sets (from the two relations) of IDs.

[3] S. K. Bansal. Real world requirements for decision support - implications for RDBMS. ACM SIGMOD 1995. [4] M. C. Chen and L. P. McNamee. On the data model and access method of summary data management. IEEE TKDE, 1(4), December 1989. [5] E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computerworld, 27, 26 July 1993. [6] M. R. Garey and D. S. Johnson. Computers and Intractability: A guide to the theory of NPcompleteness. W. H. Freeman and Co., 1979.

[7] S. P. Ghosh. Statistical relation tables for statistical database management. IEEE Software Engineering, 12(12):1106{1116, 1986. [8] S. P. Ghosh. Statistical relational model. In Z. Michalewicz, editor, Statistical And Scienti c Databases. Elli Horwood, 1991. [9] C. Li and X. Wang. Supporting on-line analytical processing: data model, query language and optimization. Technical report, 1995. [10] F. M. Malvestuto. The derivation problem for summary data. ACM SIGMOD 1988. [11] F. M. Malvestuto and M. Moscarini. Aggregation evaluability in statistical databases. In 15th VLDB 1989. [12] W. Ng and C. Ravishankar. A physical storage model for ecient statistical query processing. In Statistical and Scienti c Database Management, pages 97{106, 1994. [13] G. Ozsoyoglu, Z. M. Ozoyoglu, and F. Mata. A language and a physical organization technique for summary tables. In PODS 1985. [14] M. Rafanelli and F. L. Ricci. Me sto: A functional model for statistical entities. IEEE TKDE, 5(4), August 1993. [15] K. Seamons and M. Winslett. Physical scheme for large multidimensional arrays in scienti c computing applications. In Statistical and Scienti c Database Management, pages 218{227, 1994. [16] J.D. Ullman. Database and Knowledge-Base Systems. Computer Science Press, 1988. [17] A. C. Yao. On the complexity of maintaining partial sums. SIAM J. Comp., 14, 1985.

6 Conclusion

In this paper, we proposed to use orthogonality and interval properties in optimizing statistical queries. In particular, we introduced algorithms for deriving these properties for the target categories from the query expressions. The results were presented in terms of OLAP systems. However, we believe that the results are readily applicable to general statistical queries. The query expressions studied in the paper are those only involving selection (conjunctive ones), projection and cross product. It is interesting to see how the techniques can be extended to general queries. For disjunctive selection condition, we may be able to extend our orthogonality graph into a hypergraph such that an edge may have more than two incident vertices. The vertex cover needs to be extended correspondingly. For the union operation, we need to introduce a new tag attribute for each relation and use di erent values for the tag attribute in the two relations. The situation for the di erence operation is more complicated and is worth further investigation. Another interesting research direction is the optimization of ordered grouping by using interval properties. Ordered grouping may create many overlapped groups. The question is to obtain summaries for each group using the smallest number of calculations. This problem is in general NP-hard. Interval properties should be useful for reducing redundant calculations.

References

[1] Aberdeen Group. IRI software pro le. Technical report, Boston, Massachusetts, March 1994. [2] S. K. Arora and K. C. Smith. A theory of wellconnected relations. Information Sciences, 19:97{ 134, 1979. 10