State Transfer Graph: An Efficient Tool for ... - Semantic Scholar

Report 3 Downloads 58 Views
State Transfer Graph: An Efficient Tool for Webview Maintenance Yan Zhang1 and Xiangdong Qin2 1

2

Center for Information Science, Peking Univ., Beijing, 100871, China [email protected], Science and Technology Department, Hebei Univ., Baoding, 071002, China [email protected],

Abstract. When the web becomes more and more whirling with tremendous data, traditional maintenance approaches for web warehouses exhibit poor performance. In this paper, we first demonstrate the requests to webviews has continuity, and then we propose the State Transfer Graph (STG), an adaptive tool for webview maintenance. Using STG, we can produce many concrete maintenance methods, which will effectively service the dynamically changing web warehouses. As a demonstration, we illustrate two particular methods named MEDI and VMF. The MEDI approach, which has three states for webviews, outperforms the previous approaches significantly when both of the web changes and the query requests are frequently. The VMF method, which has five states for webviews, is more powerful when the web becomes more and more dynamic. It improves query performance up to 59.4% over the minimum-update approach, when we enhance it with a global structure.

1 1.1

Introduction Problem Statement

A webview is a materialized view usually defined by a query, which is generally related to a specified topic. Webviews enable users who are interested in specified topics to query a single data structure instead of issuing lots of queries over different data sources and structures [20]. Previous work [2][4][10] has presented lots of methods to make better arrangement for refreshing the materialized views. However, as the web is becoming more and more dynamic, webview maintenance is still a challenge, especially for those large scale web warehouses that have tremendous webviews and relatively limited refreshing resources. Traditionally there are two extreme approaches for refreshing the integrated views: the lazy-update approach [12] [13] does re-computation for each query request, while the eager-update approach [20] re-computes the views for each update request. However, both of them are optimized for some special cases. While the amount of data in web warehouses is increasing and the need to provide up-to-date results to the query load becomes more and more urgent, the time window available for refreshing the web warehouses has been shrinking [14]. Therefore, these two approaches can not satisfy the users any longer.

So emerges this method: re-compute webviews only when necessary. For the update request, it delays the re-computation, remembers this update request and turns on a flag. When a query request arrives, it checks the status of the flag. If the flag is on, it re-computes the webview. If not, it directly uses the current value for the query. It is very easy to prove that this approach achieves the minimized updates, so it is called the minimum-update approach. This approach can be regarded as an improvement of the lazy-update approach. It works well when the data sources of the web warehouse (i.e., the real web information) are not very dynamic. However, it takes a longer time for query latency than the eager-update approach, and this problem becomes very serious when the web information changes rapidly (unfortunately, the real web is changing more and more rapidly). Therefore, we have to pursue more efficient techniques to keep up with the steps of the real web. 1.2

Our Solution and Contribution

Through analyzing the maintenance approaches in previous studies, we find a common characteristic: all of the approaches can be demonstrated by a group of transfers among some particular webview states. For example, in the eager-update approach, a view has only one state Active, which means the view is always up-todate and can be used directly. In the lazy-update approach, a view has only one state Sleeping, which means the view does not respond to any updates but will be computed on the fly. For the minimum-update appraoch, it provides two states, Active and Sleeping, for the web views (see Section 3). This observation motivates us to explore and propose the State Transfer Graph (STG), a general paradigm to describe the behaviors of the webviews. A state describes a specific status of a webview, while a state transfer graph is an abdtraction model to describe how a webview moves among these states. Therefore, a concrete approach can be drawn from a given STG. According to our study, during a long period, query resuests and update requests usually are continuous respectively, especially when the web becomes more active. This so called “the continuity phenomenon” makes our STG paradigm self-adaptive to the changing web. When the real web becomes more and more whirling with tremendous data and requests, we can introduce more states to simulate the webview behaviors. As a result, a more suitable STG is produced and the maintenance method derived from this STG will have a good performance. As a demonstration, we present two particular methods named MEDI and VMF in this paper. There are three and f ive states for webviews in these two methods, respectively. We show that when the web is not very dynamic, the previous approaches can work fairly well. However, when the web changes rapidly, the MEDI approach outperforms the previous approaches significantly. If the web keeps its changing trend and the queries are also increasing expeditely, we have to apply the VMF method to the webviews. The VMF approach demonstrates a better performance than all the previous approaches, according to our experimental results. For example, it can reduce the query latency up to 44.1%, compared with the minimum-update approach. Further, when we enhance the VMF with a global structure, we can achieve a query

performance improvement up to 59.4%. In addition, this approach is scalable to the dynamic web environment, which means it can achieve much benefit no matter the web is active or not. The rest of this paper is organized as follows. Section 2 gives a brief description of the background and related work. Section 3 introduces “the continuity law”, explains the motivation of proposing STG, and presents the MEDI approach as an example. Section 4 demonstrates VMF, a more powerful method derived from STG. Section 5 discusses the enhancement of VMF with a global structure. Section 6 evaluates all the approaches mentioned in this paper. Finally, we conclude this paper in Section 7.

2

Background and Related Work

Materialized view is an important and powerful technique in data warehouses. They allow users to query potentially terabytes of detail data in seconds, rewriting the queries by using kinds of materialized views. However, since it is meaningless to provide the stale data to users, the popularity of the materialized views necessitates efficient techniques for their maintenance, especially when the amount of data in a warehouse and the number of materialized views are rapidly increasing [2] [8][14][17][21]. Previous work usually uses an incremental approach or builds a global plan. For example, Gupta et al. illustrate when update sizes are small in relation to the sizes of source data, the incremental method is generally less expensive than recomputing the views from scratch [9]. Agrawal et al. propose a SWEEP algorithm for efficient incremental view maintenance at data warehouses [2]. Mistry et al. exploit common sub-expressions in their efficient maintenance plan [14]. Goldstein and Larson present a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and describes how it can be incorporated in transformation-based optimizers [8]. Yi et al. try to achieve runtime self-maintenance with high probability by maintaining a dynamic top-k view [21]. The above work mainly focuses on the relational data model. As semi-structured data model is becoming dominative in integrating heterogeneous data sources, researchers begin to pursue powerful algorithms for semi-structured data. For example, Abiteboul et al. start from the graph-based data model OEM, develop an analytic cost model, and propose an algorithm which can produce a set of queries that compute the changes to the view based upon a change to the source [1]. Papakonstantinou and Vassalos present an algorithm to find rewriting queries for a given semistructured query and a set of semistructured views [16]. Cluet et al. believe XML will play an important role in the world of the web, and they present a view mechanism for their Xyleme system, addressing the problems of defining, storing, and using views in a web scale database [18]. However, these studies do not consider the difference between materialized views in traditional data warehouses and webviews in web warehouses. Traditional data warehouses usually know the changes of base data, while the data sources in a web environment usually do not propagate their changes to the information consumers. Therefore, a web warehouse has to employ a probing mechanism to detect the changes of base data [4].

Cho et al. synchronize an integrated database to improve its freshness by detecting the change frequency of the base data [3][5]. However, they focus on the web pages and do not consider the webviews, which consist of the information comes from multiple web pages.

3

The Continuity Law and the State Transfer Graph

Both query response time and system maintenance cost are important in a large-scale web warehouse. Therefore, we can formulate the webview refreshing problem as following: how to allocate the refreshing resource for each webview, so that we can achieve the minimized average query response time? Generally speaking, this problem is NP-hard, just as the multiple query optimization problem [19]. However, there are some effective approximate algorithms. 3.1

The Continuity Law

Let us start our analysis from the eager-update approach [20]. Our motivation is to reduce the number of re-computation, without significantly increasing the response time. Along with the exploding of web information, both the query requests and update requests in web warehouses become more and more frequent. If each update to the base data leads to an immediate refresh of webviews, the heavy maintenance burden will cause serious performance degradation. Through a long time observation, we find that a webview does not receive a constant attention during its lifespan. A webview is generally related to a specified topic, and users show special interest on a given topic only in some particular periods. Therefore, the query requests to a specific view are very frequent in these periods, often continuous, and so do the update requests. In other words, a webview usually receives continuous query requests and continuous update requests, as shown in Figure 1.

Fig. 1. During the lifespan of a webview, the update and query requests dominate in turn. More particularly, they tend to cluster together.

Previous studies also validate our observation of this continuity [6][7][11] [15]. Cho et al. demonstrate that each web page has its infant, expansion, and maturity stages [6]. The page popularity increases rapidly in its expansion stage, and this increase is much more sudden under the search-dominant model than under the random-surfer model. We believe the same thing happens to the webviews: at the infant stage, a webview is always silent, while after the expansion it receives

Fig. 2. The state transfer graphs.

more and more query requests, usually continuous. When studying the change frequency distribution, Ntoulas et al. observe that a significant fraction of web pages (around 50%) never changed at all during the course of their study, while another quite large portion (15%) changed very frequently [15]. These observations also suggest that the query requests and update requests usually will be continuous respectively. Based on the observation of this continuity, through analyzing the behaviours of many webviews as well as doing lots of experiments, we propose “the continuity law” on temporal locality for the requests to webviews and believe it has strong rationality and feasible operability: the next request to a webview is quite possible to have the same type as the requests that occurred most frequently in the recent period. Inspired by “the continuity law”, we can improve the previous maintenance methods. For example, there is no need to do an immediate re-computation for an update request, if we know this update request is followed by another update request. To make this idea more clear, we propose a term state for webviews and an effective tool state transfer graph to further describe the states. 3.2

State Transfer Graph

A state describes a specific status of a webview. Each state has its attributes and the activity is the main one. The activity of a webview increases when it receives queries and decreases when it receives updates. The current state for a webview describes its current activity, and reflects the ratio of the query requests over update requests. A state transfer graph is an abdtraction model to describe how a view moves among these states. It is a directed graph, in which the nodes are the states, and the directed edges represent the transfer conditions as well as the actions that should be taken. We illustrate the state transfer graphs for the previous approaches, as shown in Figure 2. “Q+r” means when a view receives a query (Q) request, it should be re-computed. In an eager-update approach, a view has one state Active (A in short), which means the view is always up-to-date and can be used directly. To keep the view always Active, any update to the base data will lead to an immediate refreshing of this view. In a lazy-update approach, a view has one state Sleeping (S in short), which means the view does not response to any updates at all and will only be evaluated on demand. For the minimum-update approach, there are two states for each view: Active and Sleeping. A view transfers between these two states.

Fig. 3. The MEDI approach.

3.3

Fig. 4. The VMF approach.

The MEDI Approach

Although the minimum-update approach achieves the minimized maintenance cost, sometimes we have to endure a long query response time, e.g., a sleeping view must be refreshed before being used. The reason is, the two existing states in the minimum-update approach represent the two extreme conditions. To alleviate this long latency, we introduce a state M (stands for Median, which is a median of Active and Sleeping). A view in state M is up-to-date and can be used directly. Thus the query latency can be reduced partially. Figure 3 illustrates this approach, which is named as MEDI (it stands for MEDIan). In the MEDI approach, when a view in state A receives an update request, it will be re-computed immediately, thereby avoiding the latency for the next query. If the next request happens to be a query, we will benefit from this precomputation. However, the activity of this view will decrease, hence its state transfer to M. The MEDI approach takes advantage of “the continuity law”. By introducing more states into a system, we can have more flexibility to trade off the maintenance cost and the query response time. Moreover, we will have more choices for the attributes of states. Actually, we can obtain much more improvement in terms of the overall system performance, if we make deeper analysis and propose more appropriate states, as shown in next Section.

4

VMF Approach: A More Powerful Method from STG

The state transfer graph (STG) is a general paradigm to describe the behaviors of the webviews. When web information becomes more and more whirling and fleeting, we can introduce more states for webviews to describe their volatility. As a result, we will get a more suitable STG and then derive a better maintenance algorithm. In this Section we develop an more efficient approach, VMF (View

Maintenance Based on A Five-State Transfer Graph), which can be regarded as an upgraded version of the MEDI approach. In the MEDI approach, we refresh the webviews for each update request, hence the views can be used directly when queries arrive. However, if the next request is still an update, the previous re-computation will be a waste. This suggests we should predict the next requests in the request-sequence more veraciously. We believe more states can bring more flexibility. By splitting the state A and S, we introduce five states for a webview, A (Active), SA (Semi-Active), M (Median), SS (Semi-Sleeping) and S (Sleeping), instead of the three states in MEDI. Each state has its attributes, and the main one is the activity. When a view receiving a query request, its activity increases. In contrast, its activity decreases when receiving update requests. In general, the activities of state A, SA, M, SS and S decrease in turn. The first three states are up-to-date. The main difference between Active and Semi-Active is that a less active view usually can not transfer to Active state directly, unless it obtains enough activity. The main difference between Sleeping and Semi-Sleeping is that an Sleeping view will not transfer to Semi-Active state directly. It must go through the Median state. In VMF, our idea is to keep the active views active and the inactive views sleeping. Figure 4 illustrates the idea of VMF. Q means Query request, U means Update request, r means Re-computation for the view, q and u represent the number of user U +r:q≥u query requests and the number of update requests respectively. A −−−−−−→ SA means when an active view receives update requests, if q≥u, the view needs to be re-computed, and then transfers from state A to state SA.

5

Algorithm Enhancement with A Global Structure

Although VMF achieves a good system performance, some important factors are still omitted in the basic algorithm. For example, the relationship between the re-computation cost and the query latency is very important. Moreover, we should pay more attention to the views with high query frequencies. Since our goal is to achieve the minimized query response time (under the constraint that the system capacity is a constant), we should allocate more system resources for the views whose re-computation costs are lower or whose importance are relatively higher. Thus we can achieve more benefit with same cost. In Figure 5 we present the Enhanced-VMF approach, which illustrates this idea in detail. Here k actually is the abbreviation of ki , which equals to (qi /T Q)/(ri /T R). qi is the number of the query requests to the view vi , TQ is the total number of all query requests to all the views in the web warehouse, ri is the re-computation cost of this view, TR is the total re-computation costs of all the webviews. By introducing the parameter k, we can keep the important webviews up-to-date, thereby saving the query response time. The following example will demonstrate this enhancement clearly. In VMF, when a Semi-Active view receives an update request, it transfers to Semi-Sleeping immediately. However, this may not be true in the Enhanced-VMF approach. Whether it transfers to state SS is determined by the values of q, k and u. If the

Fig. 5. The Enhanced-VMF approach.

value of k multiplies q is larger than or equals to u, this view will be re-computed and keep state SA. Thus it can be used directly when receiving query requests. In the VMF approach, each view works separately. However, when we introduce the parameter ki for each view, all of them will collaborate together. Therefore the parameter ki brings us a global structure.

6 6.1

Evaluation Testbed Setup

We implemented the MEDI algorithm and the VMF algorithm, as well as the previous approaches in our testbed, a web warehouse mainly focused on the electronic commerce information. We collected 1.2 million web pages and stored them locally. After that we constructed 10,000 materialized views based on these real data. The building process of the global structure was very complicated. We first detected the changes to the 1.2 million source web pages on the real web, and then chose the most frequently accessed pages and constructed a basic set of materialized views based on these pages. Secondly we generated lots of uesr query requests, simulating queries to the real websites. We recorded the queries and ranked them by query rates, added new materialized views that users wanted to know, and deleted the obsolete views and those nobody accessed. After one month, the web warehouse was finally built up and could be used for evaluation. 6.2

Overall Results

In the experiments, we make the locally stored base data change randomly to similuate the behavior of the real web. The warehouse changes accordingly. We make the source data change at the rate of 15 pages per second, a reasonable number for electronic commercial web pages. By stochastically choosing queries from the real user requests set and varying the query frequencies to the web warehouse, we get the experimental result for the basic comparison.

Fig. 6. Comparison of different approaches. The source data change frequency is 15 pages/second.

Fig. 7. Comparison of different approaches. The query frequency is 20 requests/second.

In Figure 6, the unit of “average maintenance cost” means the percentage of the time being spent on the maintenance over the system processing time. From the Figure 6 we know, when query rate increases, MEDI and VMF will get a better performance gracefully. The eager-update approach achieves the minimized query response time, out of the four approaches. VMF and MEDI also do a good job. However, the lazy-update and minimum-update approaches take a bit longer. Compared with the minimum-update approach, VMF reduces the query latency up to 44.1%. In fact, when we get source data through the real web (not a LAN), we have to wait for much longer time to get the latest views, thereby the VMF approach will outperform much better than the minimum-update approach. We then keep the query frequency unchanged (20 requests/sec) and vary the change frequency to the base data. In Figure 7, although eager-update approach achieves minimized query response time, its maintenance cost increases linearly along with the base data change frequency. VMF takes a little longer than the eager-update does, however, its maintenance cost increases along with the change frequency very slowly, eventually reaching a threshold. The minimum-update approach takes a quite longer time for the query response, though it costs the minimized system resource.

Fig. 8. Making the “web” more dynamic: improving the base data change frequency.

6.3

The More Dynamic Web

Now we make the “web” more dynamic. We accelerate the update rate to the base data, say, 30 pages per second. Keeping all the other parameters unchanged, we get the new measurement results. As shown in Figure 8, when the change frequency increases, the maintenance cost of materialized method increases evidently. In contrast, the cost of VMF increases very slightly. Compared with the results in Figure 6, although the query latencies in MEDI and VMF are similar, the gap between their maintenance costs becomes larger. This indicates VMF is more suitable for a rapidly changing workload than MEDI. 6.4

Validating the Power of A Global Structure

The Enhanced-VMF has a global structure more than the basic VMF approach. How powerful is a global structure? Let us examine it. We know in the pure Enhanced-VMF approach, we must get the re-computation cost and the query rate for each view, which is very complicated. For simplicity, we classify the views into different groups instead of calculating the values for each view, and regard the views in the same group as equivalents. We have two classification methods, using maintenance cost or query frequency. First, we categorize the views into two groups, according to their maintenance costs. The average maintenance cost of group 1 is 3.8 times over group 2. We access the warehouse 20 times per second. The change frequency to the base data is 30 pages/second. According to our experimential result, the maintenance costs are almost equal in the VMF and the Enhanced-VMF approach. However, the average query latency in the Enhanced-VMF approach is 0.046sec, which is less than the average query latency in the VMF approach, 0.051sec. The improvement is 9.8%. We change the average maintenance costs by choosing different views for the two groups and get the similar results. Using the query frequency to classify webviews is more effective, compared with the maintenance cost. We vary the average query frequencies of the two groups in our experiment. When the ratio of the average query rate between the two groups increases, the Enhanced-VMOST approach achieves better result. As

shown in Figure 9, Enhanced(I)-1, Enhanced(I)-2 and Enhanced(I)-3 represent the case that the ratio of the average query rate between the two groups is 2, 4 and 8, respectively. In Enhanced(II), we partition the webviews into four groups and the query frequency ratio among the four groups is 1:2:4:8. The average query latency in Enhanced(II) is 0.037sec. It improves the query performance up to 27.5% over the VMF approach and 59.4% over the minimum-update approach. As we see, more groups can help, however, the additional benefit is very slight. In other words, it is not worthwhile to employ a much more complex maintenance algorithm for the marginal additional benefit.

Fig. 9. Comparison of the VMF approach and the four enhanced approaches. In the Enhanced(I)-1, 2, 3 approaches, there are two groups of webviews. The Enhanced(II) approach has f our groups for the webviews.

7

Conclusion

As the capacity of the web warehouses are rapidly increasing, the time available for maintaining webviews is dwindling, thereby making researchers pursue more efficient methods. In this paper, we first demonstrate the requests to webviews obey “the continuity law”, and then we propose the State Transfer Graph (STG), a powerful and adaptive tool for webview maintenance. When the web becomes more dynamic, we can produce more efficient methods from the corresponding STGs. To the best of our knowledge, our work is the first study that employs a state-graph approach to solve the maintenance problems in web warehouses, or more generally, in data warehouses. In this paper we present a three-state transfer graph method named MEDI and a five-state transfer graph method named VMF. We demonstrate both of them are very efficient, however, VMF outperforms MEDI when the web warehouse becomes more dynamic. Drawn from STG, there is lack of co-operation in VMF, which means each view works separately. To enhance such an approach, we establish a global structure by introducing a parameter ki . As expected, this parameter enables the warehouse to keep the frequently-accessed views up-to-date, thereby reducing the query latency significantly.

References 1. S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J. L. Wiener. Incremental maintenance for materialized views over semistructured data. In Proceedings of VLDB’98, Aug. 1998. 2. D. Agrawal, A. E. Abbadi, and A. Singh. Efficient view maintenance at data warehouses. In Proceedings of SIGMOD, pages 417–427, 1997. 3. J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In In Proceedings of SIGMOD, 2000. 4. J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Transactions on Internet Technology, 3(3), August 2003. 5. J. Cho and A. Ntoulas. Effective change detection using sampling. In In Proceedings of 28th International Conference on Very Large Databases, September 2002. 6. J. Cho and S. Roy. Impact of web search engines on page popularity. In In Proceedings of the World-Wide Web Conference (WWW), May 2004. 7. G.K.Zipf. Human Behavior and Principle of Least-Effort. Addison-Wesly, Cambridge, MA, 1949. 8. J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In In Proceedings of SIGMOD, 2001. 9. A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 157–166. ACM Press, 1993. 10. H. Gupta. Selection of views to materialize under maintenance cost constraint. In Proceedings of International Conference on Database Theory, pages 453–470, 1999. 11. L.Breslau, P.Cao, L.Fan, G.Phillips, and S.Shenker. On the implications of zipf’s law for web caching. In Proceedings of IEEE INFOCOM’99, March 1999. 12. B. Ludscher, R. Himmeroder, G. Lausen, W. May, and C. Schlepphorst. Managing semistructured data with FLORID: A deductive object-oriented perspective. Information Systems, 23(8):589–613, 1998. 13. B. Ludscher, Y. Papakonstantinou, and P. Velikhov. A framework for navigation driven lazy mediators. In Proceedings of WebDB’99, 1999. 14. H. Mistry, P. Roy, S. Sudarshan, and K. Ramamritham. Materialized view selection and maintenance using multiquery optimization. In In Proceedings of SIGMOD, 2001. 15. A. Ntoulas, J. Cho, and C. Olston. What’s new on the web? the evolution of the web from a search engine perspective. In In Proceedings of the World-Wide Web Conference (WWW), May, 2004. 16. Y. Papakonstantinou and V. Vassalos. Query rewriting for semistructured data. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 455–466. ACM Press, 1999. 17. N. Roussopoulos. Materialized views and data warehouses. SIGMOD Record, 27(1), March 1998. 18. S.Cluet, P.Veltri, and D.Vodislav. Views in a large scale xml repository. In In Proceedings of 27th International Conference on Very Large Databases (VLDB2001), pages 271–280, September 2001. 19. T. Sellis and S. Ghosh. On the multiple-query optimization problem. IEEE Transactions on Knowledge and Data Engineering, 2(2):262–266, 1990. 20. L. Xyleme. A dynamic warehouse for xml data of the web. In IEEE Data Engineering Bulletin, 2001. 21. K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen. Efficient maintenance of materialized top-k views. In Proceedings of 19th International Conference on Data Engineering, March 2003.