Some Properties of Query Languages for Bags

Report 2 Downloads 114 Views
Some Properties of Query Languages for Bags Leonid Libkin

Limsoon Wongy

Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104-6389, USA email:

fjlibkin,

limsoonjg @saul.cis.upenn.edu

Abstract

In this paper we study the expressive power of query languages for nested bags. We de ne the ambient bag language by generalizing the constructs of the relational language of Breazu-Tannen, Buneman and Wong, which is known to have precisely the power of the nested relational algebra. Relative strength of additional polynomial constructs is studied, and the ambient language endowed with the strongest combination of those constructs is chosen as a candidate for the basic bag language, which is called BQL (Bag Query Language). We prove that achieveing the power of BQL in the relational language amounts to adding simple arithmetic to the latter. We show that BQL has shortcomings of the relational algebra: it can not express recursive queries. In particular, parity test is not de nable in BQL. We consider augmenting BQL with powerbag and structural recursion to overcome this de ciency. In contrast to the relational case, where powerset and structural recursion are equivalent, the latter is stronger than the former for bags. We discuss problems with using structural recursion and suggest a new bounded loop construct which works uniformly for bags, sets and lists. It has the power of structural recursion and does not require any preconditions to be veri ed. We nd relational languages equivalent to BQL with powerbag and structural recursion/bounded loop. Finally, we discuss orderings on bags for rigorous treatment of partial information.

1 Summary Sets and bags are closely related structures. While sets have been studied intensively by the theoretical database community, bags have not received the same amount of attention. However, real implementations frequently use bags as the underlying data model. For example, the \select distinct" construct and the \select average of column" construct of SQL can be better explained if bags instead of sets are used. In an earlier paper [5], Breazu-Tannen, Buneman, and Wong de ned a language based on monads [20, 29] and structural recursion [3] for querying sets. In section 2 of this report, the same syntax is given a bag-theoretic semantics. We use this language as our ambient bag language Supported in part by NSF Grant IRI-90-04137 and AT&T Doctoral Fellowship. Supported in part by NSF Grant IRI-90-04137 and ARO Grant DAALO3-89-C-0031PRIME.  y

1

and study its properties. Due to space limitations, we give only sketches of some of the proofs. Full proofs can be found in [18]. The ambient bag language is inadequate in expressive power as it stands; for example, it can not express duplicate elimination. In section 3, additional primitives are proposed and their relative strength with respect to the ambient language is fully investigated. The primitive unique which eliminates duplicates from a bag is shown to be independent of the other primitives. A similar result was obtained by Van den Bussche and Paredaens in the setting of pure object oriented databases [8]. The primitive monus which subtracts one bag from another is proved to be the strongest amongst the remaining primitives. This result was independently obtained by Albert [2]. However, his investigation on relative strength is not as complete as this report. As a consequence, we regard the ambient language augmented with monus and unique as our basic bag language. This language will be called BQL (Bag Query Language). The relationship between bag and set queries is studied in Section 4. It is shown that the class of set functions computed by the ambient bag language endowed with equality on base types, an emptiness test, and unique , is precisely the class of functions computed by the nested relational language of [5]. Furthermore, if equality at all types is available, then the former strictly includes the latter. Grumbach and Milo also examined the relationship between sets and bags [9]. However they considered set functions on relations whose height of set nesting is at most 2. No such limit is imposed in this report. The relationship between sets and bags can be examined from a di erent perspective. In the remainder of section 4, we investigate augmenting the set language of [5] to endow it with precisely the expressive power of our basic bag language BQL. This is achieved by adding natural numbers, multiplication, subtraction, and a summation construct to the nested relational language. This also illustrates the natural relationship between bags and numbers. In section 5, we use the connection to nested relational language established in section 4 to prove several fundamental properties of BQL. In particular, the inexpressibility of properties (such as parity test) on natural numbers that are simultaneously in nite and co-in nite. Breazu-Tannen, Buneman, and Wong proved that the power of structural recursion on sets can be obtained by adding a powerset operator to their language [5]. However, this result is contingent upon the restriction that every type has a nite domain. In section 6, the powerbag primitive of Grumbach and Milo [9] is contrasted with structural recursion on bags. In particular, the latter is shown to be strictly more expressive than the former. Although a powerbag primitive increases expressive power considerably, it is dicult to express algorithms that are ecient. While structural recursion does not have this de ciency, it requires the satisfaction of certain preconditions that cannot be automatically veri ed [4]. In section 6, a bounded loop construct which does not require the veri cation of any precondition is introduced. It is shown to be equivalent in expressive power to structural recursion over sets, bags, as well as lists. This con rms the intuition that structural recursion is just a special case of bounded loop. Furthermore, in contrast to the powerbag primitive which

gives us all elementary functions [9], structural recursion gives us all primitive recursive functions. Also in section 6 we show that nonpolynomial operations on bags are more powerful than their set analogs, and nd the primitive that precisely lls the gap. Finally, in section 7, we show how to extend the approach of Buneman, Jung and Ohori [6] and Libkin [16] that uses certain partial orders to give semantics of databases with partial information to bags. We extend the idea of Libkin and Wong [18] of de ning an ordering whose meaning is \being more partial". Such an ordering is fully characterized for bags, and we demonstrate an ecient algorithm to test it. Related work. The semantic aspects of programming with collections using structural recursion were studied by Breazu-Tannen and Subrahmanyam in [4]. In particular, they showed that certain preconditions have to be satis ed for structural recursion to be well de ned. Breazu-Tannen, Buneman and Naqvi brought out the connection between structural recursion and database query languages [3]. Breazu-Tannen, Buneman and Wong avoided the need of checking preconditions by placing a simple syntactic restriction on structural recursion [5]. The language so restricted has several equivalent formulations, one of them being NRC [5, 30]. This language is equivalent to the algebra of Abiteboul and Beeri [1] without the powerset operator. Then Wong [30] proved that the language has the conservative extension property at all input/output heights. That is, the expressive power of the language is independent of the height of set nesting in the intermediate data. Then Libkin and Wong [19] showed that in the presence of very simple arithmetic operators conservativity can be extended uniformly to all input/output heights for languages augmented with bounded xpoint operator, transitive closure, powerset and many other operators. In [17] Libkin and Wong extended the use of the language NRC for querying or-sets. Grumbach and Milo [9] applied the algebra of Abiteboul and Beeri to bags. In particular, they investigated the relationship between set and bag languages restricted to certain input/output heights and the expressive power of bag languages with respect to the level of bag nesting. The basic bag language proposed in this report (BQL) is precisely the language of Grumbach and Milo without the powerbag operator. Vickers [28] studied re nements of bags which are a more general concept than the ordering we introduce in this paper. In particular, our ordering can be expressed as a re nement, but there exist certain re nements of bags which lead to counterintuitive results when applied in the study of partial information. The expressive power of Datalog under set and bag semantics was compared in [21]. In particular, an example of query was given that can not be expressed under the former but can be expressed under the latter. In [27] Saraiya shows that Datalog can be simulated with structural recursion on sets, preserving the PTIME complexity, by using as an intermediate step the loop operator described in section 6.2, and proving in the process that loop can be simulated by structural recursion (half of theorem 6.3 below). Several complexity-theoretic results for program properties and transformations are then be obtained by

recourse to known results for Datalog.

2 The ambient nested bag language The nested relational language proposed by Breazu-Tannen, Buneman, Wong [5] is denoted by NRL here. We now de ne an ambient bag query language NBL. It is obtained by replacing the set constructs in NRL by the corresponding bag constructs. The language has two presentations { algebraic, called NBA, and calculus style, called NBC { which are equivalent in terms of expressive power. Types. The types in NBL are either complex object types or are function types s ! t where s and t are complex object types. These types are the same as those of NRL except that bags fjsjg instead of sets fsg are used. The grammar for complex object types is given below.

s ::= b j unit j s  s j fjsjg A complex object type denotes a set of objects. unit is a special base type having exactly one element which we denote by (). s  t is the set of pairs whose rst component is from s and whose second component is from t. fjsjg are nite bags containing elements of type s. A bag is di erent from a set in that it is sensitive to the number of times an element occurs in it while a set is not. Finally, b are base types to be speci ed. Expressions. The expressions of NBA and NBC are given in gure 1. The type superscripts are usually omitted as they can be inferred [13, 23]. The semantics of these constructs is similar to the semantics of NRL except duplicates are not eliminated. Semantics of NBA constructs is as follows. Kc is the constant function that produces the constant c. id is the identity function. g  h is the composition of functions g and h; that is, (g  h)(d) = g(h(d)). The bang ! produces () on all inputs. 1 and 2 are the two projections on pairs. hg; hi is pair formation; that is, hg; hi(d) = (g(d); h(d)). K fjjg produces the empty bag. ] is the additive bag union. b  forms singleton bags: b (x) = fjxjg. b  attens a bag of bags: b fjB1 ; : : : ; B jg = B1 ] : : : ] B . b map (f ) applies f to every item in the input bag. Function b 2 is used for interaction between bags and pairs: b 2 (x; y) pairs x with every item in the bag y. For example, b 2 (1; fj1; 2jg) returns fj(1; 1); (1; 2)jg. Semantics of the NBC constructs which di er from NBA constructs U is as follows. fjjg is the empty bag. fjejg is the singleton bag containing e. fje1 j x 2 e2jg is the bag obtained by rst applying the function x:e1 to each itemUin the bag e2 and then taking the bag union of the results. For example, fjfjx; x + 1jg j x 2 fj1; 2; 3jgjg evaluates to fj1; 2; 2; 3; 3; 4jg. n

n

Proposition 2.1 The languages NBA and NBC have the same expressive power. 2

Therefore, we normally work with the component that is most convenient.

EXPRESSIONS OF NBA Category with Products

Kc : unit ! b

h:r!s g:s!t g  h:r!t

id : s ! s s

1 : s  t ! s

s

g:r!s h:r!t hg; hi : r ! s  t

2 : s  t ! t

s;t

! : s ! unit

s;t

Bag Monad

b  : s ! fjsjg

b  : fjfjsjgjg ! fjsjg

s

s

f :s!t b map (f ) : fjsjg ! fjtjg

K fjjg : unit ! fjsjg s

] : fjsjg  fjsjg ! fjsjg

b 2 : s  fjtjg ! fjs  tjg s;t

s

EXPRESSIONS OF NBC Lambda Calculus and Products

c:b

x :s s

() : unit

e1 : s ! t e 2 : s e1 e2 : t

e:t x :e : s ! t s

e:st 1 e : s  2 e : t

e1 : s e 2 : t (e1 ; e2 ) : s  t

Bag Monad

fjjg : fjsjg s

e:s

fjejg : fjsjg

e1 : fjsjg e2 : fjsjg e1 ] e2 : fjsjg

U efj1e : fjj txjg 2e2e :jgfj:sjfjgtjg 1 2 s

Figure 1: Syntax of NBL

3 Relative strength of bag operators Breazu-Tannen, Buneman, and Wong [5] added equality test eq for all types s to NRL. They showed that the presence of equality tests elevates NRL from a language that merely has structural manipulation capability to a full edged nested relational language. The question of what primitives to add to NBL to make it a useful nested bag language should now be considered. Unlike languages for sets for which we have a well established yardstick, very little is known about bags. Due to this lack of an adequate guideline, a large number of primitives are considered. Let us rst x some meta notations. A bag is just an unordered collection of items. count (d; B ) is de ned to be the number of times the object d occurs as an element in the bag B . The bag operations to be considered are listed below.  monus : fjsjg fjsjg ! fjsjg. monus (B1 ; B2 ) evaluates to a B such that for every d : s, count (d; B ) = count (d; B1 ) ? count (d; B2 ) if count (d; B1 ) > count (d; B2 ); and count (d; B ) = 0 otherwise.  max : fjsjg  fjsjg ! fjsjg. max (B1 ; B2 ) evaluates to a B such that for every d : s, count (d; B ) = max(count (d; B1 ); count (d; B2 )).  min : fjsjg  fjsjg ! fjsjg. min (B1 ; B2 ) evaluates to a B such that for every d : s, count (d; B ) = min(count (d; B1 ); count (d; B2 )).  eq : s  s ! fjunit jg. eq(d1 ; d2 ) = fj()jg if d1 = d2 ; it evaluates to fjjg otherwise. That is, we are simulating booleans as a bag of type fjunit jg. True is represented by the singleton bag fj()jg and False is represented by the empty bag fjjg.  member : s  fjsjg ! fjunit jg. member (d; B ) = fj()jg if count (d; B ) > 0; it evaluates to fjjg otherwise.  subbag : fjsjg  fjsjg ! fjunit jg. subbag (B1 ; B2 ) = fj()jg if for every d : s, count (d; B1 )  count (d; B2 ); it evaluates to fjjg otherwise.  unique : fjsjg ! fjsjg. unique (B ) eliminates duplicates from B . That is, for every d : s, count (d; B ) > 0 if and only if count (d; unique (B )) = 1. Each of these operators has polynomial time complexity with respect to size of input. Hence every function de nable in NBL(monus ; max ; min ; eq; member ; subbag ; unique ), where we have explicitly listed the additional primitives in brackets, has polynomial time and space complexity with respect to the size of input. The expressive power of these primitives relative to NBL is compared here. In contrast to NRL, where all nonmonotonic primitives are interde nable [5], these bag primitives di er considerably in expressive power. As a consequence of the theorem below, NBL(monus ; unique ) can be considered as the most powerful candidate for a standard bag query language. We denote NBL(monus ; unique ) by BQL. s

Theorem 3.1 monus can express all primitives other than unique. unique

is independent of the rest of the primitives. min is equivalent to subbag and can express both max and eq. member and eq are interde nable and both are independent of max. 2 The results of theorem 3.1 can be visualized in the following diagram. monus

max

??

min

subbag

eq

member

unique

The independence of unique was also proved by Van den Bussche and Paredaens [8] and the fact that monus is the strongest amongst the remaining primitives was also showed by Albert [2]. However, their comparison was incomplete. For example, the incomparability of max and eq was not reported. In contrast, the results presented in this section can be put together in theorem 3.1 which completely and strictly summarizes the relative strength of these primitives.

4 Relationship between bags and sets In this section, we study the relationship between bags and sets from two perspectives. First, we nd a bag language whose set theoretic expressive power is that of NRL(eq). Then we consider endowing NRL(eq) with new primitives that would give it precisely the expressive power of the basic bag language BQL.

4.1 Set-theoretic expressive power of bag languages

Several fragments of our nested bag language are compared with the nested relational language NRL(eq). This can be regarded as an attempt to understand the \set theoretic" expressive power of these bag languages. In order to compare bags and sets, two technical devices are required for conversions between bags and sets. We use the following constructs for this purpose: f :s!t f :s!t bs map (f ) : fjsjg ! ftg sb map (f ) : fsg ! fjtjg The semantics is as follows. bs map (f )(R) applies f to every item in the bag R and then puts the results into a set. For example, bs map (x:1+ x)fj1; 2; 3; 1; 4jg returns the set f2; 3; 4; 5g. sb map (f )(R) applies f to every item in the set R and then puts the results into a bag. For example, sb map (x:4)f1; 2; 3g returns the bag fj4; 4; 4jg.

Let s be a complex object type not involving bags. Then to bag (s) is a complex object type obtained by converting all set brackets in s to bag brackets. Every object o of type s is converted to an object to bag (o) of type to bag (s). Conversely, let s be a complex object type not involving sets. Then from bag (s) is a complex object type obtained by converting all bag brackets in s to set brackets. Every object o of type s is converted to an object from bag (o) of type from bag (s). The conversion operations are given inductively below. to bag unit := x:x to bag  := x:(to bag (1 x); to bag (2 x)) to bag f g := sb map (to bag ) s

s

s

t

s

t

s

s

from bag unit := x:x from bag  := x:(from bag (1 x); from bag (2 x)) from bag fj jg := bs map (from bag ) s

t

s

t

s

s

De ne SET (?) to be the class of functions f : s ! t where s and t are complex object types not involving bags and ? is a list of primitives such that there is f 0 : to bag (s) ! to bag (t) de nable in NBL(?) and the diagram below commutes. f0 id to bag (s) - to bag (t) to bag (t) to bag

6

6

to bag

s

s

f

-t

t

from bag to bag ( )

- t?

t

id Let eq be equality test restricted to base types. Let empty : fjunit jg ! fjunit jg be a primitive such that it returns the bag fj()jg when applied to the empty bag and returns the empty bag otherwise. Then Theorem 4.1 1. SET (unique ; eq ; empty) = NRL(eq). 2. NRL(eq) $ SET (unique ; eq) 3. NRL(eq) and SET (monus ) are incomparable. 2 The class SET (?) is precisely the class of \set theoretic" functions expressible in NBL(?). Consequently, the above results say that NBL(unique ; eq ; empty) is conservative over NRL(eq) in the sense that it has precisely the same set theoretic expressive power. On the other hand, NBL(unique ; eq) is a true extension over the set language. However, the presence of unique is in a technical sense essential for a bag language to be an extension of a set language. b

b

b

4.2 A set language equivalent to BQL

It was shown earlier that BQL = NBL(monus ; unique ) is the most powerful amongst the bag languages considered so far. From the foregoing discussion,

this bag language is a true extension of NRL(eq). In this subsection, the relationship between sets and bags is studied from a di erent perspective. In particular, the precise amount of extra power BQL possesses over NRL(eq) is determined. Let us endow NRL(eq) with natural numbers N together with multiplication, subtraction, and summation as de ned below.   : N  N ! N . The semantics of  is multiplication of natural numbers.  : : N  N ! N (sometimes called modi ed subtraction). The semantics is as follows:  m0 n : m = 0n ? m ifif nn ? ?m