Large Volume Spatial Data Management Based on Grid Computing Liu Hua, Li De-ren, ZHU Xin-yan National Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing Wuhan University Wuhan, China
quality of service (Qos) for data management and problem solving.
Abstract—As the emerging technology, grid computing is applied to many projects. And the management of spatial data also face new challenges. This paper introduces the application of grid computing to the management of spatial data, and emphasize the deploy and design of spatial data grid.
II.
Spatial data grid, as its name implies, will provide spatial data management and correlative services based on spatial data. Spatial data grid manages multi-scale and multi-precise spatial data, including vector, raster image and DEM. These spatial data can store in database, file or others ways, according to requirement. Grid provides two kinds of services: basic services and incremental services. Basic services are services every grid node need to support, e.g. data browse and query, economic attribute statistics. Basic services will not register in service register center. Incremental services are payable e.g. sell local customized data, or rectify PB level images using super computer located in a grid node or grid node groups controlled by this node. Incremental services will register in service register center supporting query and connecting for customers. This report will mainly discuss the design of basic services. The description of spatial data grid design will divide into two part: grid nodes deployment and grid nodes design.
Keywords-grid computing; spatial data grid; GIS;Globus
I.
INTRODUCTION
Spatial data management is the basic function of Geographical Information System (GIS), as well as an very important index of measuring GIS software abilities. Driven by database technology and Internet, spatial data management develop from desktop to WAN (Wide Area Network) and Internet, and the capacity of spatial data management also develop from MB to PB. With the improvement of human’s perception and understanding, spatial data management face to some new situations as follows: •
The amounts of spatial data production have accumulated to a certain degree; sensors become more and more, the received data count in PB.
•
The scales of problems solving become more and more large. it is impossible to solve problems just depend on several super-computers, e.g. global climate analysis, earthquake forecast.
A. Grid Nodes Deploy The conception of grid comes from electric-power net, so to do research on electric-power infrastructure will be helpful to the configuration of spatial data grid nodes. In China, electricpower net deploys according to district, e.g. Huadong electricpower net, Huabei electric-power net, and electricity is transported from high-voltage cables to family cables. This deployment is also adapt to spatial data grid deployment. Spatial data grid divides into different levels. Every node works as district center manages local spatial data, at the same time, has rights to order tasks to low-level nodes. Higher level can provide stronger functions, and has more higher connection speed. According to the above principles, spatial data grid nodes can be as follows: server of the first level node is super computer, and the connection speed is GB per second among them; server of the lowest grid node is common computer, network connection by dial. Figure 1 is a logic view of grid nodes deployment. In general, customers query services in local grid nodes, but if need higher service quality or grid nodes have troubles, permit to request for higher grid nodes. Figure 1 dashed lines describe the situation.
To sum up, spatial data management need more store space, compute resources and other resources, but traditional modes of spatial data management are not able to solve above problems very well. During these years, the advent of grid computing technology provides a new way for spatial data management. Grid is a kind of infrastructure, characterizing resources communion. Grid computing realizes cooperated resources sharing and problems solving among dynamic virtual organizations (VO). At present, grid computing widely applies in emulating, medicine, geology, biology, military affairs etc, and gains great effects that traditional distributed technologies can’t accomplish. Using grid computing technology not only solves above problems, but also conforms resources in VO. It provides Founded by the National 863 Project (No. 2003AA132080) Supported by the National Key Basic Research and Development Program of China (No. 2004CB318206)
0-7803-9050-4/05/$20.00 ©2005 IEEE.
DESIGN OF SPATIAL DATA GRID
722
Level
Level
center will return stored URI of compliant spatial data grid to customers according to data descriptions. Because compliant data may not in the same spatial data grid, it is possible to return several URIs. Lastly customers send data request to grid nodes based on received URIs.
CA Service
Level
Level
Level
Metadata
Metadata
Metadata Level
Data
Level
Address
Metadata
Request Customer
Spatial data Return
Figure 4.
Request Data
Deployment of Spatialdata Grid
Return Data
Within spatial data grid groups, a certificate authority (CA) will be built. Grid nodes and its services must create corresponding public key, secret key and unsigned safety certificates before added into spatial data grid groups, and then submit safety certificates to CA through safe ways e.g. email. After CA checks out grid nodes, it will sign the safety certificates and send it to asker. In the same way, before using grid services, customers also require to pass legal checking by CA. CA signs host-computer certificates and customer certificates to insure safety and legality of spatial grid nodes and grid customers.
Spatial data
Figure 4. Spatial data Request
Because of imbalance of district economic development, different visit frequency of grid nodes should be considered when deploying grid nodes and metadata centers. So some grid nodes can be added in hot spot areas depend on requirement as to keep balance of load, some nodes can be merged in lower visit frequency area with single economic attribute, e.g. in Shanghai, some nodes can be added to improve response speed, but in Tibet only one grid node need to be set.
Different types and ranges of spatial data in grid nodes management result in some troubles of data requests. So it is necessary to build a certain amount of metadata centers according to the scale of spatial data grid. This center will record some information of spatial data, e.g. range, type, precision, time, scale and service address etc. Metadata information center is also a grid releasing spatial data metadata. Metadata centers will confirm their integrity and consistency through subscribe/inform mechanism. Metadata centers will use LDAP (light directory access protocol) to build metadata information model and require protocol. In this way, every grid node just can modify metadata within its manage area. After one center changes its metadata, it will inform other centers to insure other centers changes metadata synchronically. Some abnormities should be foreseen, for example other metadata centers can’t work normally because of breakdown or turnoff. So once other centers receives notices, they are required to send confirm message. If within limited time, the metadata center, which sends change notices, can’t receive confirm message from other centers, it will send email to abnormal centers. After abnormal metadata centers recover, they will modify metadata according to this email.
Spatial data grid changes the mode of spatial data management from centralize management using server to autonomy management using grid nodes. This transformation makes the scale of spatial data management not subject to capacity of server disk array, but the total capacity of grid nodes. Considering grid nodes can be increase immediately and dynamically, this ∑ value is infinite. So spatial data grid has ability to manage large-scale spatial data. Simultaneously, it is able to conform resources (especially computer resources) in all grid nodes, even to provide common customers QoS, which can’t be supported by super computers. B. Design of Spatialdata Grid Nodes Spatial data grid nodes provide basic services and incremental services. This paper discusses the design of basic services. Spatial data grid node is the main part of spatialdata management, is the provider of corresponding services. So its capacity, safety and robust, closely connect to the efficiency and stabilization of system. Picture 3 is the design of grid nodes.
In fact, metadata centers act as a media of resources, spatial data grids act as a provider of resources, customers act as askers of resources. Figure 2 is the typical process of data request.
The basic, principal parts of grid node are data management middleware and metadata 0management middleware. Grid services will map to concrete physical resources by middleware.
When requesting data, firstly customers need to send files about the data they need in XML to the most nearest metadata center, including data type, precision etc. In this process, metadata center may be blocked, at the result, some request can’t answer immediately. So metadata center will transmit this request to other closer, higher speed center. The metadata
0-7803-9050-4/05/$20.00 ©2005 IEEE.
723
happen in the process, monitor will renew stopped part according to backup information. Customer
Load balancer will inspect visit condition of spatial data. Once balancer detects large-scale datasets are visited in a short interval blocking communication, it will store this dataset in several physical nodes, and transit some visit requests to other physical nodes.
Highspeed Possible User Mapping Unified Service Interface and Operation View Map
File
Attribute statistic
Distance Data Visit
File
Others
Data Transmit
Database
Load alancer
DATA MANAGEMENT Backup ConstructI. Others MIDDLEWARE ponent Component Com
Data management middleware also offer attribute statistic service. But performance pattern and request pattern of this service run in a different way. When request data, at first, metadata center will ask to gain a set of stored data addresses, then customer (data asker) will send data request using this addresses one by one. In attribute inform service, statistic result is the sum of sub results from all grid nodes. This process is not complete by customers, but the upper level grid. For example, to compute the population of age below 25 in Hubei province, Hubei grid node will send this task to sub-nodes, e.g. Wuhan, Ezhou etc. These sub-nodes will send tasks to their subordinates, e.g. if Wuhan sub-node finish its statistic task in local area, it will send to Huangpi sub-node. At last, every node will submit its statistic result to higher-level node to compute the final result. In this example, the last result will come out at Hubei province grid node. In this process, there is a difficulty existing: Grid node will stand for whom, when it send task to its sub-nodes? Is it necessary for sub-node to revalidate user right at superior node? For solving these problems, the case test in this paper using Globus Toolkit build a GSI (Grid Security Infrastructure) toolkit providing trust agent and single login mechanism.
Metadata Request
Metadata Management Middleware Metadata Register and
Metadata Modificatio
Figure 4. Design of Grid Node
Metadata management middleware provides more explicit metadata than metadata grid, including types of attribute data, data price etc. This middleware is responsible for release and modification of spatial metadata in metadata information grid. Data management middleware is responsible for store, backup, upgrade of spatial data, and also provide standard input/output API. Basic services, such as file transmission, attribute statistics etc, are provided based on data management middleware. As to support high quality services, middleware will offer tools for optimizing and supervising, e.g. transcript management, file transmission control etc.
III.
Owing to limited condition, the case reported here is a small-scale test, just manages several hundred MB data in all. In future this work will emphasis on management using multigrids and huge data, as well as computer resources cooperation. But this finished test case proves that spatial data can manage efficiently through grid computing technology.
Middleware will not limit the store mode of spatial data. Database system, file system or hybrid system can be used. In this way, every grid node can choose patterns for building repository according to actual situation, also can use stored spatial data productions furthest. But the precondition is that an abstract level providing standard, coherent API interface should be built on store systems.
This case test runs in three computers, two of them work as spatial data grid nodes, operation system are both Redhat9, and install Globus Toolkit 3.2, the IP address are 192.168.2.61 and 192.168.2.33 separately. One of the computers for grid nodes uses mysql as database to store vector data, another one manages ESRI shape files, and both of them manage image data stored in file. The last computer works as metadata center and CA, installing Windows2000 operation system. As to the development of grid services, Globus Toolkit API interface, version 3.2, is adopted. Picture 4 is overlapped result of vector data and image data from two grid nodes.
If spatial data grid nodes have satisfied data, results will be produced in file. There are two ways to deal with result in this paper: directly transmit to customers and local store which permit distant visit. They provide file transmission service and distant data visit service separately. Grid nodes will choose different services according to different strategies. When customers exchange data with grids, if customers already pay for services, grid nodes will transmit files by file transmission service. File transmission service will use GridFTP protocol. Because of dynamic, heterogeneous characteristics in network condition, it is possible to break out abnormity during data transmission. So a tool for supervising is needed during the process of data transmission. If abnormity happen, it can be used to deal with. Data transmission monitor in data management middleware will work for this. When file transmission service begins to work, the monitor will record condition information of transmission. When abnormity
0-7803-9050-4/05/$20.00 ©2005 IEEE.
IMPLEMENTATION OF SPATIAL DATA GRID SERVICES
724
REFERENCES [1]
Ian Forster,Carl Kesselman. The Grid: Blueprint for a Future Computing Infrastructure. Morgan Kaufmann Publishers,2004. [2] Ian Foster, Carl Kesselman, Steven Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International Journal Supercomputer Applications, 2001,15(3):200~222. [3] Ian Foster. What is the grid? A three point checklist. Grid Today, 2002.1 (6) [4] S.Berson and R. Lindell. An architecture for advance reservations in the internet.T echnical report. Work in Progress. [5] Karl Czajkowski, Ian Foster, and Carl Kesselman.C oallocation services for computational grids.I n Proc. 8th IEEE Symp. on High Performance Distributed Computing. I EEE Computer Society Press, 1999. [6] I. Foster, C. Kesselman, and S. Tuecke. The Nexus approach to integrating multithreading and communi- cation. Journal of Parallel and Distributed Computing, 37:70{82, 1996. [7] K. Holtman, P. van der Stok, I. Willers. Automatic Reclustering of Objects in Very Large Databases for High Energy Physics, Proc. of IDEAS ’98, Cardiff, UK, 1998. [8] A. Chervenak, et. al, "Giggle: A Framework for Constructing Scalable Replica Location Services," Proc. Of SC2002 Conf., Baltimore, MD, 2002. [9] L. Guy, et. al, “Replica Management in Data Grids,” Global Grid Forum 5, 2002. [10] John Kubiatowicz, et. al, “OceanStore: An Architecture for Global-Scale Persistent Storage," Proc. of ASPLOS 2000 Conference, November 2000. [11] J. Sidell, et. al, “Data Replication in Mariposa”, 12th Intl. Conf. on Data Engineering. Pages: 485 – 494, 1996.
Figure 4. Result of Case Test
IV.
CONCLUSION
Using grid-computing technology to implement management of large-scale spatial data evolves the pattern of traditional spatial data. Based on grid computing technology, spatial data grid is possible to be the infrastructure of GIS for sharing spatial data comprehensively and getting rid of information isolated land. The advent of grid computing technology may greatly change GIS software: to run from desktop, small-scale to Internet; to solve problems from small range to global. Therefore, it is necessary to strengthen researches on grid computing, enhance the combination of grid computing and GIS.
0-7803-9050-4/05/$20.00 ©2005 IEEE.
725