A Constraint-based Spatial Extension to SQL - Semantic Scholar

Report 4 Downloads 48 Views
A Constraint-based Spatial Extension to SQL Gabriel Kuper Bell Laboratories

Kyuseok Shim Bell Laboratories

Sridhar Ramaswamy Bell Laboratories

Jianwen Suy U C Santa Barbara

2. We de ne a constraint algebra, equivalent to CSQL. This algebra is modeled on the relational algebra, with extensions to handle spatial data, such as spatial selection, projection, and aggregates. CSQL queries can be converted into the algebra, and this conversion is used to implement the CSQL language. 3. We compare CSQL with previous proposals for SQL extensions to spatial data [1, 5, 4]. In particular, CSQL is quite similar to the OpenGIS standard, and one aspect of our work can be seen as providing the technology to implement this standard. In addition, however, CSQL has several features for handling aggregation that are not present in the OpenGIS standard. We also compare CSQL with the DEDALE language described in [5], and claim that our language improves on this proposal in several ways. These include a) the handling of nested queries which is closer, in CSQL, to the spirit of SQL and b) the explicit handling of empty sets which we claim is preferable for reasons of expressiveness to the implicit elimination of such sets in DEDALE. 4. We are in the process of implementing a prototype of CSQL using an extensible object-relational database system supporting Abstract Data Types (ADTs). In addition to using ADTs to extend a relational database system to support constraint data types, our goals also include providing the ability to speci cy optimization schemes for constraints. 2 A Constraint-based Data Model We introduce a constraint database model that is based on a combination of the constraint database model [6] and the nested relational database model (see e.g. [2]). Speci cally, spatial data is treated as (possibly in nite) sets of points in the space with speci ed dimensions. These sets are then used as values in tuples, similar to the way in which nested relations are used. We rst illustrate the constraint database idea with the following example. Example 2.1 Consider a database containing information about cities: name, state, population, and region of the city. By region we mean the geographic area covered by the city, which involves spatial data, while the other attributes represent non-spatial data. We represent this information by the relation Cities (partially shown): Cities Name State Population Region Santa Barbara California 150 region sb Santa Maria California 50 region sm Oxnard California 70 region o ::: ::: ::: :::

Abstract The constraint database model provides a uniform framework for handling spatial and non-spatial data. We describe our proposal for a query language for GISs based on this model. Our language is called CSQL, as it is based on SQL. CSQL provides the full query capabilities of SQL on the non-spatial part of the database while being tightly integrated with the spatial part. This allows us to optimize CSQL queries in a uni ed fashion, considering both the spatial and non-spatial attributes simultaneously. We are in the process of implementing a constraint database system using CSQL as a query language, via a relational algebra CALG, and an extensible database system. 1 Introduction The constraint database model [6] is an active area of research in the database community. The key notion of the model is to generalize the notion of a tuple to a conjunction of constraints, e.g., linear arithmetic constraints. It turns out that the relational algebra and calculus can be naturally extended to such a model. The constraint database framework using linear arithmetic constraints is particularly appropriate for describing spatial data. In contrast to most of the previous models for spatial data, the constraint database model has a uniform, relational query language that handles both spatial and nonspatial data. This provides the potential for integrating spatial and non-spatial data. Apart from the advantages this brings to the user, such an integration also has the potential of enabling us to use query 1optimization techniques in a GIS framework. The Cosmos project at Bell Labs was established to investigate these aspects of constraint-based database systems. In this paper, we describe our proposal for a constraint query language, called CSQL. This is a query language for spatial data based on SQL, being developed as a part of the Cosmos project. Our main contributions are: 1. We de ne the CSQL query language. CSQL is a relational query language that is modeled on SQL. In particular, CSQL provides, as a special case, the standard SQL query language on the non-spatial attributes in the database.  Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974 y Department of Computer Science, University of California, Santa

Barbara, CA 93106. Supported in part by NSF grants IRI-9411330 and IRI-9700370. Part of the work was done while the author was visiting Bell Laboratories 1 http://www-db.research.bell-labs.com/project/cosmos

In most GIS systems, region sb ; region sm , and region o are either represented using some spatial data structure outside the DBMS, or are treated as black box ADTs inside.

To appear in Proc. 6th ACM Symposium on Geographic Information Systems, November, 1998 1

'1 = x1 6120 ^ x1 6x2 6135 ^ x2 6x1 +70 ^ x1 +6x2 >420 '2 = 1206x1 6200 ^ x1 ?4x2 > ? 380 ^ x1 ?4x2 640

S1

The formula '1 _ '2 represents the union of S1 and S2 . It is known that each quanti er-free formula ' is equivalentWto Va formula in disjunctive normal form (DNF), i.e., '  i j ij , where ij is a linear constraint. In the remainder of this paper we assume that each value in dom(Rk ) (k > 1) is a quanti er-free formula in DNF. Equality of values in dom(Rk ) is de ned semantically, i.e. two values are equal (=) if the formulas are logically equivalent. A relation schema is a nite set of attribute name and domain pairs such that the attribute names are distinct. Two relation schemas are compatible if they have the same number of attributes, and each pair of corresponding attributes are associated with the same domain. Let R(A1 :1 ; :::; An :n ) be a relation schema. A tuple over R is a total mapping t from attributes in R to domain values such that for each 1 6 i 6 n, t(Ai) 2 dom(i). Intuitively, a tuple contains a single value for each non-spatial attribute and a nitely representable set for each spatial attribute in the relation. Two tuples are equal if their corresponding values are pairwise equal, i.e., if they represent the same information. A relation instance of R is a nite set of tuples over R. Database schemas and instances are de ned naturally. Example 2.3 Consider a database containing data about cities, stations for cellular phones, and various geographical features (such as rivers, roads, airports, etc.). The database schema consists of the following three relations:

S2

Figure 1: Two nitely representable regions Such an explicit separation of spatial and non-spatial attributes results in data models and languages that are hard to use and to optimize. In the constraint database model, we represent, conceptually, the regions as mathematical objects using polynomial or linear equations and inequalities. For example, region sb could be represented as a formula '(x1 ; x2 ) = 68 6 x1 6 70 ^ x2 6 2 ^ 70 6 x1 + 8x2 _ 70 6 x1 6 71 ^ x2 6 2 ^ 70 6 x1 ? 2x2 , where a point (x1 ; x2 ) is in region sb i '(x1 ; x2 ) is true. Constraint databases extend relational databases by allowing arithmetic constraints over real (or rational) numbers to be used in databases and query expressions. In this paper we focus on linear constraints, over real numbers, that use equality, order predicates (=; 6=; ; >), addition (+), and subtraction (?). Examples of such linear constraints include equations and inequalities of the following form: \x + x + x + y + y ? z = 3:2" and \x + x ? y + z + z 6 3:2". It is clear that linear equations and inequalities with k variables can always be written as \ki=1 ai xi = a0 " and \ki=1 ai xi 6 a0 ", where the xi 's are variables and the ai 's are rational numbers. In this model, spatial objects in k dimensional space are represented by formulas that are boolean combinations of linear constraints over k variables such as these. A database in our model may contain both traditional, non-spatial data as well as spatial data.2 While the former is described using the standard relational model, the latter is modeled using nested point sets. We assume the existence of (1) attribute names and (2) primitive domains which include the usual non-spatial data types such as strings, integers, real numbers etc. and a collection of spatial constraint data types Rk where k is a positive integer. Each domain  has a set of values, denoted by dom( ). For non-spatial domains  , dom( ) is de ned in the usual manner, e.g., set of integers, strings, etc. For each spatial domain Rk , dom(Rk ) consists of sets of points in k-dimensional real space that can be \ nitely represented" in a logic language for linear real arithmetic in the sense described below. De nition. Let S be a set of points in k-dimensional real space. S is nitely representable if there is a quanti erfree formula '(x1 ; :::; xk ) involving real numbers, =, 6, +, ?, and k distinct variables x1 ; :::; xk such that each point (a1 ; :::; ak ) 2 S i the formula '(a1 ; :::; ak ) is true. We use ? to denote the empty set, represented by an inconsistent formula. Example 2.2 Consider a database containing information about cellular phone stations and coverage areas. The regions S1 , and S2 shown in Figure 1 are nitely representable by formulas '1 , and '2 (respectively) shown below.

Cities(cname:string, state:string, pop:integer, region:R2 ) Stations(sid:integer, location:R3 , coverage:R2 ) GeoFeatures(fname:string, type:string, footprint:R2 )

Here the region of a city is the geographical area of the city, the location of a cellular phone station also includes the elevation as the third dimension, and the type of a feature indicates whether the feature is a river, a road, an airport, etc. Part of an instance of Stations is shown below. Stations sid location coverage 628 x1 = 80 ^ x2 = 120 ^ x3 = 1000 '1 627 x1 = 160 ^ y = 80 ^ x3 = 200 '2 ::: :::::: ::: where '1 ; '2 are the formulas in Example 2.2. 3 CSQL: Query Language 3.1 Design Rationale Commercial relational database systems provide declarative query languages for querying data. The most widely used relational query language is SQL [3]. Because SQL is a declarative language, users do not have to be concerned with how to access the database eciently: When presented with an SQL query, a relational database system uses query optimization techniques to select an ecient plan to process the query. SQL is extensible: users can access external data and operators with the use of user-de ned functions. We have designed CSQL with similar principles in mind: it is a simple high level declarative language for manipulating spatial and non-spatial data, in a uniform style and with an intuitive syntax. CSQL is a conservative extension to SQL, i.e., it is equivalent to SQL on non-spatial data.

2 Our model can, in fact, represent both temporal and spatial data, but this is outside the scope of the current paper.

2

that returns the total area of this set. If the footprints of different geographical features are disjoint, an equivalent version of the query is shown below. Here, the area of each footprint is computed separately, and the standard aggregate function sum is used to compute the total area.

3.2 The Syntax of CSQL We will use the relation schema from Example 2.3. The most common form of a query in SQL is: select columnlist from tablelist where conditionlist Here, columnlist is the list of projection columns in the answer to the query, tablelist is the list of tables referenced to in the query, and conditionlist contains a boolean combination of predicates. In conditionlist a predicate may have an operand which is itself an SQL expression. Any relational SQL expression is also a valid CSQL expression. In addition, CSQL contains the following features for handling constraint data types. Generalized Projection Operators: CSQL has two different types of projections. The rst is a projection on nonspatial attributes. This is basically the same as projection in relational databases, and is expressed in a select (or where) clause in the same way. However, for spatial attributes, there is also a second kind of projection. For example, if we only want to know the projection, on the variable x1 , of the geometry attribute of each tuple in the Cities relation, we write:

select type, sum(area(footprint)) from GeoFeatures group by type

Selection predicates on aggregation functions must appear in the having clause in CSQL. The select clause can use built-in and user-de ned functions together with an aggregate function. For example, the following query is valid in CSQL: select type, min(min(footprint.[1])) from GeoFeatures group by type

In this query, the outer use of min is an aggregate function, while the inner use is a spatial function (which role the min function plays in each case can be determined from the context). This query computes the minimum value of the rst coordinate of the region attribute for each tuple, and outputs the smallest of these values, for each geographical feature type. 3.3 Comparison with DEDALE We now show how typical queries, such as those described by the DEDALE project [5] can be expressed in CSQL. We describe a few typical queries here. Other queries in [5], expressed in CSQL, can be found in [7]. It is also straightforward to express queries from the Sequoia 2000 benchmark [9] in CSQL. In this section, we use the example given in [5]. The example has the relation Grounds(name, groundtype, owner, geometry), where the attribute groundtype denotes whether the land described by geometry (a 2-dimensional spatial attribute) is covered by forest or grass. Additional relations that are used in [5] include Cities, Waterpoints, Forests, Lakes, Roads, and Abbeys. All these relations have two attributes, name and geometry with the natural interpretations. In addition, there is a relation Boat with two attributes name and trajectory, which describes the name and the trajectories of boats. The trajectory attribute is a threedimensional attribute, that describes position of a boat over time.  Return those parcels that are adjacent to a lake.

select cname, region.[1] from Cities

To project on several dimensions, e.g., to project a 3dimensional spatial attribute location onto the rst two dimensions, we use the notation location.[1,2] in the select/where clause. Generalized Functions and Aggregate Functions: In relational SQL, all functions, including aggregate functions, take a single value as input and return a single value. In contrast to this, functions on the spatial component in a database of our model can accept and return a set of values. For example, the output of the following CSQL query consists of the name of each object, together with the smallest value of the x1 -coordinate of the corresponding spatial part. In this example, the min function returns the smallest real number from a set of numbers (which is represented as a nite set of intervals). select name, min(region.[1]) from Cities

Such generalized functions can also appear in the clause:

where

select cname from Cities where max(region.[1]) < 3

select G.name, G.geometry from Grounds G, Lakes L where boundary (G.geometry) contains (intersect (G.geometry, L.geometry))

Like min, the function max accepts a set of real numbers, but returns a single value. Additional functions in CSQL include area, volume, surface, and length. As for aggregate functions, CSQL contains all the builtin SQL aggregate functions avg, count, max, min, and sum for non-spatial attributes as well as two new aggregate functions, union and intersection, for spatial attributes with the obvious meaning. As an example consider the following CSQL query to compute the total area of each geographical feature type in the GeoFeatures relation:

Here, boundary and intersect are also spatial functions that compute the boundary of a region and the intersection of two regions respectively.  Return those parts of roads that are within a forest. select R.name, union(intersect(R.geometry,F.geometry)) from Roads R, Forests F where R.geometry intersects F.geometry group by R.name

select type, area(union(footprint)) from GeoFeatures group by type

Here intersects is a predicate testing the intersection of two spatial values. In CSQL, elimination of tuples with an empty set of spatial values must be done explicitly in the where clause. This is in contrast to DEDALE, where such elimination is done implicitly. We believe that this

Here union is the aggregate function that combines the footprints into a single set, and area is a generalized function 3

approach leads to a simpler and cleaner language. In the example above, if the where clause is removed, roads that have no intersection with a forest are also returned.  Return those water points that are south of Orange (assuming that the co-ordinate system is increasing along the North to South direction):

does not enforce implicit elimination of tuples with empty set in spatial attribute. In the GIS community, there have been several proposals to extend SQL to spatial data. Earlier e orts focuses adding topological predicates into SQL such as the Spatial SQL of Egenhofer [4]. The extension in the OpenGIS speci cation [1] extends SQL to handle spatial data types such as point, curve, surface, polygon etc. The OpenGIS language is essentially SQL with some built-in functions for these types. The new spatial data types are treated essentially in isolation, which complicates the language. For example, since the result of intersecting two polygons can be either a point, a line or polygon, queries have to take all these possibilities into account. In contrast, CSQL uses constraints to model spatial data which results in a uniform treatment of all spatial values. Consequently, the language is not only simpler and more uniform than the OpenGIS language, but also extensible so that new spatial operations and predicates can be easily added.

select W.name, W.geometry from Waterpoints W, Cities C where C.name = 'Orange' and min(W.geometry.[2])<max(C.geometry.[2])

As a comparison, in DEDALE [5] the above query was expressed as select W.name, W.geometry from Waterpoints W, Cities C where C.name = 'Orange' and restrict?(proj(W.geometry) on (x2) cross proj(C.geometry) on (x2)) with (x1 < x2)

The DEDALE version uses a nested selection with the \restrict" construct while the CSQL version performs a simple testing of the \ 0 attributes and A1 ; :::; An are distinct attribute names, then (A1;:::;An e) is an expression with the type obtained from S after renaming the attributes. 8. (Group by and aggregates) If S contains attributes A1 ; :::; An , a non-spatial attribute Ba , a spatial attribute Bs , and C is a new attribute not in S , then

one of min, max, avg, count, and sum.

 (Spatial aggregates) A1 ;:::;An;C=f (Bs) e is an expression of schema S [A1 ; :::; An ; B ], where f is either [ (union) or \ (intersection).

Here the attributes before \;" are group-by attributes. (Note that simultaneous applications of aggregate functions can be expressed separately and one at a time and followed by a natural join of these results. As an alternative, we could also allow simultaneous aggregate functions.) Example 4.1 Consider a database with the schema shown in Example 2.3. The following queries can be expressed in the algebra CALG. In the query expressions, we use cn for cname, fn for fname, fp for footprint, reg for region, t for type, cv for coverage, C for Cities, GF for GeoFeatures, and S for Stations. 1. Find lakes that are completely inside city regions. fn;cn fpreg (C  t=\lake"GF ) 2. Find cities with forests, and for each such city, list the city name and also the forest areas. cn;freg freg6=? freg:fp\reg (C  t=\forest"GF ), or cn;freg freg:fp\reg fpreg (C  t=\forest" GF ). Here the forest areas inside a city is created by the  operator. 3. Find cellular phone stations that cover parts of I-5. sid cvfp(S  fn=\I-5" GF ). 4. Find cities that are not completely covered by cellular phone stations. Temp1 = ;cover=[(cv) cv S , Answer = cn :(regcover) (C  Temp1 ). Note that the group-by constructor with union is used in de ning Temp1 , which computes the union of all regions covered by stations and results in one tuple. It then selects the names of the cities whose regions are not totally contained in the covered area. 5. Find cities that have the largest cellular phone coverage. Temp2 = cn;creg creg:reg\cover (C  Temp1 ), Answer = cn cn;largest=max(area) (area = sf(cvr)

cn; cvr = [(creg) Temp2 ). Here Temp1 is the same as in the previous query. Temp2 contains city names and cellular phone coverages (which may have overlaps). Since one city may have several stations, a union aggregate is needed to combine di erent regions into one. After computing the size of the coverage area for each city by the function sf, we can simply select the cities with the largest value. 5 Design of a Constraint ADT The constraint-based spatial component of CSQL can be realized by implementing CALG. We are doing this using a Constraint Abstract Data Type (CADT). The goal of the CADT is to provide a simple and ecient interface for dealing with constraints. One should note here that there is no one-one correspondence between the ADT operations and the constraint functions supported by CALG and CSQL. Rather, the set of operations supported by the CADT should 5

Figure 2 shows an operator tree that computes the waterpoints south of the city of Orange. This tree uses the constraint project and min operations. The tree rst uses the cross-product to associate the city of Orange with all the waterbodies and then selects the waterbodies that are south of Orange. (In order to keep the gure simple, some of the project operations have not been shown.) Alternately, an index can be used for this purpose as well. Since it seems clear that constraints have to co-exist with other (mostly relational) kinds of data, it seems best to implement the CSQL proposal in an extensible database system. We are implementing CSQL using the Predator system developed at Cornell [8]. We are experimenting with a variety of choices of internal representation (vector mode, storing the extreme points, etc.) in order to characterize the trade-o s involved therein. We are also very interested in designing and implementing optimization techniques that can take into account the properties of both the constraint and non-constraint attributes simultaneously.

σ$1 < $2 Cross product CADT::min CADT::min CADT::project(2) CADT::project(2) σName = "Orange” Fetch Waterpoints Fetch Cities

Figure 2: The operator tree in Section 5.3 be general enough to support the CSQL language (and algebra). The set of methods provided by CADT can be broadly be classi ed into three groups:  Creation and manipulation methods,  Property extraction methods, and,  Special functions. We discuss these groups in detail now. (Member functions of the CADT type are shown with a ::, and arguments to functions are shown in angular brackets.)

6 Conclusions We have presented a constraint-based proposal for modeling and querying spatial databases. It provides a simple and uniform representation for spatial entities. The query language that we develop, CSQL, is a natural and backward compatible extension to SQL. Further, our model o ers the possibility of optimizing CSQL queries in a uni ed fashion, considering both the spatial and non-spatial attributes simultaneously. We are currently in the process of implementing our proposal in an extensible object-relational database system.

5.1 Creation and Manipulation



    

DNF ! CADT & This function creates a CADT instance from a DNF formula. CADT::project() This method projects the constraint object onto the requested dimensions. CADT::union(), CADT::intersect(), CADT::difference(). These methods takes the union, intersection and the di erence of the constraint object with their argument. CADT::normalize() The function triggers the \normalizing" of the constraint object into a standard form. CADT::empty() The function tests for the emptiness (non-satis ability) of a constraint object. CADT::boundingBox() It returns the bounding box of the constraint object.

create:

Acknowledgment The authors thank Luc Segou n for his comments on an earlier draft of the paper and Hongjun Zhu for her comments on the CSQL syntax. References [1] Open GIS simple features speci cation for SQL. Version 1.0, March, 1998. [2] S. Abiteboul and C. Beeri. The power of languages for the manipulation of complex values. VLDB Journal, 4(4):727{ 794, October 1995. [3] M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I.L. Traiger, B. W. Wade, and V. Watson. System R: Relational approach to database management. ACM TODS, 1(2):97{137, 1976. [4] M. J. Egenhofer. Spatial SQL: A query and presentation language. IEEE TKDE, 6(1):86{95, 1994. [5] S. Grumbach, P. Rigaux, and L. Segou n. The DEDALE system for complex spatial queries. In Proc. of ACM SIGMOD, 1998. [6] P. Kanellakis, G. Kuper, and P. Revesz. Constraint query languages. JCSS, 51(1):26{52, 1995. [7] G. Kuper, S. Ramaswamy, K. Shim, and J. Su. A constraintbased spatial extension to SQL. Tech. report, Bell Labs, 1998. [8] P. Seshadri and M. Paskin. PREDATOR: An OR-DBMS with enhanced data types. In Proc. of ACM SIGMOD, 1997. [9] M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 storage benchmark. In Proc. of ACM SIGMOD, 1993.

5.2 Property Extraction Methods and Special Functions  CADT::min(), CADT::max(), CADT::length(), CADT::area(), CADT::volume() The functions return the minimal and maximal values, the length, area, and volume of the constraint along the speci ed dimensions respectively.  CADT::buffer() It \bu ers" the constraint object by a distance given by the argument. 5.3 Expressing Queries using the CADT Interface In spite of this extremely simple interface, all the queries from the previous section can be expressed in terms of these ADT operations. We give an example to illustrate this. 6