Promotional Subspace Mining with EProbe Framework Yan Zhang
Vermont Information Processing 402 Watertower Circle Colchester, Vermont 05446, U.S.A.
[email protected] Yiyu Jia
Vermont Information Processing 402 Watertower Circle Colchester, Vermont 05446, U.S.A.
[email protected] ABSTRACT In multidimensional data, Promotional Subspace Mining (PSM) aims to find out outstanding subspaces for a given object, and to discover meaningful rules from them. In PSM, one major research issue is to produce top subspaces efficiently given a predefined subspace ranking measure. A common approach is to achieve an exact solution, which searches through the entire subspace search space and evaluate the target object’s rank in every subspace, assisted with possible pruning strategies. In this paper, we propose EProbe, an Efficient Subspace Probing framework. This novel framework strives to initialize the idea of “early stop” of the top subspace search process. The essential goal is to provide a scalable, costeffective, and flexible solution where its accuracy can be traded with the efficiency using adjustable parameters. This framework is especially useful when the computation resources are insufficient and only a limited number of candidate subspaces can be evaluated. As a first attempt to seek solutions under EProbe framework, we propose two novel algorithms SRatio and SlidingCluster. In our experiments, we illustrate that these two algorithms could produce a more effective subspace traversal order. Being effective, the topk subspaces included in the final results are shown to be evaluated in the early stage of the subspace traversal process.
1.
INTRODUCTION
In many real-world applications, objects are given ranks based on a predefined score measurement. Whether for a single person or a merchandize product, they can be ranked because there exist comparable objects in the same category. For examples, an athlete has other athletes as competitors in terms of the performance score; and a product has other similar products as competitors in terms of the sales revenue. It is worth noting that, for multi-dimensional data, a target object’s rank may vary in different subspaces. To find out the potential top subspaces, in many areas, can be very useful. One motivating application area is to promote the target object from the subspaces where it ranks high. For example, when promoting a notebook computer, the information “ranked the 3rd in terms of battery life” can be more informative than the information “ranked the 15th in an overall feature evaluation” for a potential buyer.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.
Wei Jin
258 IACC Building A9 Dept. of Computer Science, North Dakota State Univ. Fargo, North Dakota, 58108-6050, U.S.A.
[email protected] The problem then becomes, how to pinpoint these top subspaces? In [1], the authors proposed Promotion Query, a database query function with the functionality of returning top ranked subspaces with a multidimensional data set as the input. The authors propose two categories of methods: (1) evaluating through the entire subspace search space, assisted with some pruning techniques; and (2) constructing data cubes to make some calculations off-line, so as to achieve fast querying operations in OLAP applications. However, the major challenges that pose on these methods include the scalability issue, and the complexity of the data cubes’ design & maintenance issue. By considering these important research issues, we suggest a completely different approach for the problem of producing top subspaces in multi-dimensional data. We propose EProbe – an Efficient Subspace Probing framework. As opposed to commonly adopted strategies that produce an exact solution, this novel framework aims at generating a flexible solution where its accuracy can be traded with efficiency using adjustable parameters. The ideal situation is, when the computation resources suffice for the task, an exact solution will be produced; otherwise, the accuracy of the result can be sacrificed with the return of the computational speedup. As a first effort to implement this EProbe framework, we propose two heuristics, which lead to two novel algorithms SRatio and SlidingCluster. The former applies score ratio (SR) based subspace sorting to obtain a sorted subspace set. The latter further includes the design of subspace sampling from sliding subspace clusters. Both algorithms strive to achieve an “early stop” of the subspace search, when a certain number of top subspaces have been probed and evaluated. By using two evaluation metrics: AVG TraceIndex and Coverage, we compare SRatio and SlidingCluster with a baseline algorithm DFP (Depth First Parent Subspace Pruning). We show an remarkable superiority of algorithm SRatio (SlidingCluster with w = 1) over DFP; and a consistent and significant improvement of algorithm SlidingCluster over SRatio, when only a limited number of candidate subspaces can be evaluated. The remaining of this article is organized as follows. In Section 2, we define relevant concepts and introduce necessary notations. In Section 3, we present the EProbe framework and the heuristics that generate algorithms SRatio and SlidingCluster. In Section 4, we describe our data and present the experimental results. We finally conclude this work and discuss our future work in Section 5.
2.
PROBLEM STATEMENT
We first introduce the preliminaries and definitions in Section 2.1, then we formulate our problem in Section 2.2.
2.1
Preliminaries and Definitions
Assuming we are given a data set D, which consists of a set of n instances. Also assuming that D has d categorical attributes A = {A1 , A2 , . . . , Ad }, an object attribute Aobj denoting object IDs, and a score attribute Ascore denoting the non-negative score of the corresponding object in a specific subspace. We denote the domain in which a variable X takes value from as dom(X). Specifically, we denote the complete set of competitive objects by O = dom(Aobj ), in which the target object ID ta ∈ O is given by the user. We also have dom(Ascore ) = R+ . A subspace is represented as S = {A1 = a11 , A2 = a21 , · · · , Ad = ad1 }, where ai1 ∈ dom(Ai ) or ai1 = ∗. The value “*” refers to “any”, which indicates any value in the corresponding attribute is included. S induces a projection of the data set DS (⊆ D) and a subspace of objects OS (⊆ O). We represent the full attribute space S∗ = {A1 = ∗, A2 = ∗, · · · , Ad = ∗}, a short form of which is S∗ = {∗}. We denote N0 as the number of candidate spaces in the entire subspace search space. Due to the space limit, please refer to [1] for the definitions of concepts “Parent-Child Subspace”, “Seen-Unseen Subspace”, “Rank of an object”, “Significance Sig”, and “Promotiveness P”. Definition 1: (SR: Score Ratio of an Object) If the score of target object A in subspace S is denoted P as SUMS (ta ), and the score of all the objects in S is denoted as t∈OS SUMS (t), the score ratio SR of ta in S is defined as the ratio of these two measures, as shown in Eq. (1). SRS (ta ) = P
SUMS (ta ) S t∈OS SUM (t)
(1)
Definition 2: (TraceIndex of a Subspace) The TraceIndex of subspace S, TraceIndex(S) is defined as the ordinal position, at which S is being evaluated. TraceIndex(S) ∈ [1, N0 ]. Definition 3: (Distance: Distance between Two Subspaces) Given subspace S1 = {A1 = a11 , A2 = a21 , · · · , Ad = ad1 }, and subspace S2 = {A1 = a12 , A2 = a22 , · · · , Ad = ad2 }, the distance between S1 and S2 is defined as the Hamming Distance between S1 and S2 , i.e. the number of positions at which the corresponding attribute values are different [2]. Distance(S1 , S2 ) can be calculated as shown in Eq. (2). Distance(S1 , S2 ) =
d X
I(ai1 6= ai2 )
(2)
i=1
In Eq. (2), I denotes an indicator function, which returns 1 when the values on the corresponding attribute does not equal and 0 otherwise. If S1 is a parent subspace of S2 , Distance(S1 , S2 ) = Distance(S2 , S1 ) = 1.
2.2
Problem Formulation
Based on the notations and definitions specified in Section 2.1, we present the target problem as follows. Assuming we are given multi-dimensional data set D, target object A, and subspace ranking measure P. Assuming Ik is the subspace set that contains the subspaces with k largest P, and r (r ≤ N0 ) is the number of subspaces that can be evaluated. For a given r, the proposed problem is to find a subspace set Ir (|Ir | = r), in which kr of these r subspaces belong to Ik , such that kr (kr ≤ k) is maximized.
3.
EPROBE FRAMEWORK
EProbe is an efficient subspace probing framework that aims at providing a scalable, cost-effective, and flexible solution for pro-
ducing top-k subspaces in multidimensional data. The essential concept of EProbe is the development of an accuracy-efficiency tradable solution along with the limited computation resources. Following this framework, we propose the idea of “early stop” of the subspace search, when a certain number of top subspaces have been probed and evaluated. To be specific, we intend to find a special subspace traversal order, such that the subspaces being evaluated earlier have higher probability to be included in the final top-k subspace result set. The ideal situation is, when the computation resources suffice for the task, an exact solution will be produced; otherwise, the accuracy of the result can be sacrificed with the return of the computational speed-up. The question is, how feasible it is to design an “early stop” criterion? Obviously, it is impossible to define an exact measure that positively correlates P while maintains a time complexity at the same level as or a lower level than P, because the target object’s Rank is non-monotonic over any parent-child subspaces, and the subspace ranking measure P is a monotonic function of Rank. However, when considering the attributes of the real-world data, it is still feasible to produce a more efficient subspace traversal order. For a candidate subspace and a given target object, the ranking measure P value is closely related to two factors: internal and external factors, respectively. First, it is positively correlated with the score of the target object itself. Second, it is negatively correlated with the combined score effects of other competitive objects. In other words, both a higher core of the target object, and a lower score of competitive objects will contribute to a higher P. Motivated from the observation that the score distribution of all the competitive objects in the data partly reflects their comparative position, we develop following heuristic. Heuristic 1: Given target object A, a subspace’ score ratio SR tends to be positively correlated with the measure P. Based on this heuristic, we propose algorithm SRatio, in which the subspace traversal order is formed based on the score ratio SR. In other words, the SR of every candidate subspace is calculated first, and their P values will be evaluated with a descending order of their corresponding SR. Although the score ratio SR is a good indicator of the comparative position of the target object among its peers, it still cannot serve as a precise metric to correlate the P value. This is because the impact coming from the external factor – the score fluctuation of other competitive objects also play an important role. As shown in Figure 1 (c), it will be more effective if a learning and/or adjustment module that considers this external factor can be appropriately
random traversal or non-metric based traversal (a)
A predefined metric Ordered traversal (b)
A predefined metric Learning and adjustment Ordered traversal
(c)
Figure 1: Algorithm Design under EProbe Framework Figure 1 shows the three improving algorithm designs, of which design (a) shows a random or non-metric based subspace traversal order; design (b) shows a metric based subspace traversal order; and design (c) shows a metric based and learning/adjustment enhanced traversal order. The algorithms DFP, SRatio, and SlidingCluster are representatives of these three designs, respectively. We compare these three methods in Section 4.
designed. Therefore, we come up with following heuristic, based on which the algorithm SlidingCluster is proposed. Heuristic 2: Subspaces with close SR and close genealogy relationship tend to share similar value of measure P. The genealogy relationship between subspaces S1 and S2 can be measured by the Distance measure, as shown in Eq. (2). When Distance(S1 , S2 ) = 1, S1 and S2 are regarded having the closest genealogy relationship. This heuristic is formed based on the following observations. An object’s score in a subspace S is accumulated by a set of child subspaces of S. Assuming the object A shows high score in a parent subspace compared to other objects, it is very likely, if not guaranteed, that object A excels in one or more child subspaces as well. Therefore, there exist many situations in which parent-child subspaces may share very close value of measure P. Calculate P1 for S1, P2 for S2, P3 for S3. If MAX(P1, P2, P3) = P2, the remaining of C2 will be evaluated in the next step
C1 C2 C3 C4
……
Cv
…… w=3
Figure 2: An Illustration of Algorithm SlidingCluster In Figure 2, we illustrate the idea of algorithm SlidingCluster. At first, the same as algorithm SRatio, the candidate subspaces are ordered based on the score ratio SR. Then the ordered sequence of candidate subspaces will be split into neighboring clusters based on Heuristic 2. By specifying a window size w, one candidate subspace will be sampled from each of the top w clusters. After calculating the measure P of these sampled subspaces, the subspace with the largest P is selected, and the corresponding cluster will be selected as well. The remaining subspaces in this cluster then will be evaluated. Following that, the next subspace cluster is included into the current cluster window. And again, a subspace will be sampled, evaluated, and compared from this cluster. This procedure continues until all the subspace clusters are evaluated.
4.
EXPERIMENTAL RESULTS
In Section 3, we have proposed two new design components: SR based subspace sorting, and sliding cluster subspace sampling, which are distinguished from any random or non-metric based subspace exploration methods. We develop algorithm SRatio by considering the first component, and algorithm SlidingCluster by considering both components. The essential purpose of our experiments is to explore whether these two algorithms can produce a more effective subspace traversal order. Being effective, the top-k subspaces to be included in the final results are to be evaluated in the early stage of the subspace traversal process. For comparison purpose, we use a baseline method DFP, which applies a depth first subspace evaluation order from the root subspace (S∗ = {∗}) to leaf subspaces (dimension attribute with non-star values). In the remaining of this section, we first introduce the experiment settings and evaluation measures in Section 4.1, then we report and analyze the experimental results in Section 4.2.
4.1
Experiment Settings
We perform our experiments on a machine with an Intel Core i5 2.27GHz processor, and 3.5GB of memory. We implement algorithms DFP, SRatio, and SlidingCluster using JDK 1.6 on Linux
Table 1: Product_Sales Data Description Obj ID A B C D E F
Global Rank 50 28 27 23 18 22
N (Si ) 692 2197 4183 6994 12191 20200
Cluster Size 132 315 502 962 1243 1065
Objects can be ranked by their aggregated score. We use 6 different objects A, B, C, D, E, F, as the target objects to perform the experiments. These objects are ranked 50, 28, 27, 23, 18, and 22 respectively, among 89 objects, in the full attribute spaces. N (Si ) denotes the number of candidate subspaces in the corresponding subspace search space. The “Cluster Size” column shows the number of subspace clusters being produced based on Heuristic 2. kernel 2.6.34 OS. When we evaluate the comparative experimental results on these three algorithms, we develop the AVG TraceIndex and the Coverage as the major evaluation measures, where these two measures can be calculated based on Eq. (3) and Eq. (4) respectively. We report the results with two varying variables: the window size w, and different target objects with varying number of candidate subspaces N (Si ). We set the distance d = 1 and the threshold minsup = 100. Pk i=1 TraceIndex(Si ) (3) AVG TraceIndex = k As defined in Definition 2, the TraceIndex of a subspace denotes the ordinal position on which this subspace is being evaluated. For the top-k subspaces that are eventually produced, the objective is to rather evaluate them on earlier ordinal positions. Therefore, the average TraceIndex of the top-k subspaces, as shown in Eq. (3), qualifies a good indicator of the performance. Pk i=1 I(TraceIndex(Si ) ≤ r) Coverage = × 100% (4) k In this study, the proposed problem is to obtain top subspaces when only a limited number of subspaces can be evaluated. It indicates that when the subspace evaluation is terminated in an intermediate spot without finishing traversing the search space N0 , it is hoped that a significant portion of the top-k subspaces have been evaluated. The measure Coverage, as defined in Eq. (4), indicates the percentage of the subspaces that are in the final top-k result when the first r subspaces are evaluated. I denotes an indicator function, which returns 1 when the condition is true and 0 otherwise. I enforces that only those subspaces with TraceIndex not greater than r will count.
4.2 Product_Sales Data In this experiment, we use Product_Sales data. Product_Sales data contains 66,801 record tuples, and each record has six categorical subspace attributes, the cardinalities of which are 6, 305, 8, 25, 3, 30 and 4, respectively; and one non-negative numerical attribute Score. The theme of this data set is about the sales revenue of different product brands. In Table 1, we show the characteristics of the target objects we experiment with. We choose these six specific target objects mainly because we intend to investigate how it affects the performance of algorithms when the number of candidate subspaces increase. We do not disclose the detailed theme of the data for privacy reasons. Varying w. w represents the window size of subspace clusters
AVG TraceIndex
150
(c) N(Sc) = 4183
SlidingCluster DFP: 249.43
SlidingCluster DFP: 91.16
125 100
SlidingCluster DFP: 329.87
75 50 200
(d) N(SD) = 6994
175
AVG TraceIndex
(b) N(SB) = 2197
(f) N(SF) = 20200
(e) N(SE) = 12191
150 125 100
SlidingCluster DFP: 449.3
75 50 1
3
5
7
9 11 13 15 17 19
Window Size (w)
SlidingCluster DFP: 452.69
SlidingCluster DFP: 734.39 1
3
5
7
9
11 13 15 17 19
Window Size (w)
1
3
5
7
9
11 13 15 17 19
100 90 80 70 60 50 40 30 20 10 0 100 90 80 70 60 50 40 30 20 10 0
Figures 3 (a)-(f) show the average TraceIndex of top-k subspaces when the window size w increases for target objects with different number of candidate subspaces. The x-axis represents the window size w, where w ∈ {1, 2, · · · , 19, 20}; and the y-axis represents the AVG TraceIndex of the top-k subspaces. N (Si )(i ∈ {A, B, C, D, E, F }) denotes the number of candidate subspaces in the corresponding subspace search space. The big sized “X” represents the data point of algorithm SRatio.
applied in the subspace sampling procedure in algorithm SlidingCluster. Figure 3 and 4 respectively show the comparison result on measures AVG TraceIndex and Coverage of the three algorithms: DFP, SRatio, and SlidingCluster, along with the varying window size w, where w ∈ {1, 2, · · · , 19, 20}, and k = r = 100. The experiment trial with low AVG TraceIndex and high Coverage indicates a promising result. When the window size w = 1, algorithm SlidingCluster degrades to SRatio, we therefore use a big sized “X” to represent the corresponding data point. For algorithm DFP, only Figure 3 (a) shows the corresponding data point. It is because the data points are all positioned out of the plot area for other experiment trials, when comparing the results with the same scale. As a remedy, we record AVG TraceIndex of DFP algorithm on the corresponding legend. Because the subspace traversal order of algorithm DFP is not manipulated, its average TraceIndex of the top-k subspaces is significantly higher than those produced by SRatio and SlidingCluster. Meanwhile, when the window size w enlarges, the average TraceIndex further decreases consistently. These evidences clearly indicate that, the average ordinal order of the top subspaces being evaluated has been brought forward effectively when using our proposed methods. As a result, SRatio and SlidingCluster have higher chance than DFP to produce the most important subspaces when the computation resources are limited and only the first r of N0 candidate subspaces can be evaluated. We are able to draw the same conclusion by examining Figure 4 as well. This figure shows the percentage of the top subspaces that are actually catched when the first r subspaces have been evaluated. Generally speaking, our experiments show an remarkable advance of algorithm SRatio over algorithm DFP; and a significant improvement of algorithm SlidingCluster over algorithm SRatio. When the window size w enlarges, the Coverage of SlidingCluster consistently improves. Taking Figure 4 (b) as an example, we suppose the computation resources are only sufficient to finish evaluating the first r = 100 subspaces. For algorithm DFP, only 9% subspaces in the final top k = 100 subspaces will have been evaluated by the time of algorithm terminating at r = 100. However, 80% and 87% top subspaces will have been evaluated for SRatio and SlidingCluster with w = 20. When comparing the SlidingCluster with w = 20 and the full subspace traversal method DFP, the computation time is saved (2197-100)/2197 = 95.4% ((N (SB )−r)/N (SB )), and the Coverage of the result is 87%, where k = r = 100.
(b) N(SB) = 2197 SlidingCluster DFP 1
3
5
7
9
11 13 15 17 19
Slidingcluster DFP 1
(d) N(SD) = 6994
3
5
7
9
3
5
7
9
11 13 15 17 19
SlidingCluster DFP 1
(e) N(SE) = 12191
SlidingCluster DFP 1
Window Size (w)
Figure 3: Average TraceIndex Comparison
(c) N(SC) = 4183
(a) N(SA) = 692
Coverage (%)
(a) N(SA) = 692
175
Coverage (%)
200
11 13 15 17 19
Window Size (w)
3
5
7
9
SlidingCluster DFP
SlidingCluster DFP 1
3
5
7
9
11 13 15 17 19
(f) N(SF) = 20200
11 13 15 17 19
Window Size (w)
1
3
5
7
9
11 13 15 17 19
Window Size (w)
Figure 4: Subspace Coverage Comparison with Varying w Figures 4 (a)-(f) show the top-k subspace coverage when the window size w increases for target objects with different number of candidate subspaces. The x-axis represents the window size w, where w ∈ {1, 2, · · · , 19, 20}; and the y-axis represents the Coverage of the top-k subspaces. N (Si )(i ∈ {A, B, C, D, E, F }) denotes the number of candidate subspaces in the corresponding subspace search space. The big sized “X” represents the data point of algorithm SRatio.
5.
CONCLUSION AND FUTURE WORK
In multi-dimensional data, the problem of promotional subspace mining (PSM) aims to find out outstanding subspaces for promoting a given object among a group of competitive objects. One major research problem is how to produce the top subspaces efficiently given a predefined subspace ranking measure. In this paper, we proposed a novel EProbe framework to address the scalability issue of this problem under the assumption that limited computation resources are enforced. We initialized the “early stop” idea by designing the algorithms SRatio and SlidingCluster, and showed very promising results in our experiments. We plan on our future work in two directions. Firstly, we will perform extensive experiments on data sets with different concepts and distribution, on different target objects within the same data set, and on relatively small or large k top subspaces. We will design and develop statistical models that consider the window size w, the number of subspaces r that should be evaluated, and the expected Coverage to achieve. Secondly, we have plan to refine the learning/adjustment module of the EProbe framework. In algorithm SlidingCluster, we had assumed that the values of measure P within the same subspace cluster were close. However, this assumption may be violated if the ranks of the competitive objects change violently. We will design a selfadaptive machine learning unit, which updates the window size w and the distance d according to the values of measure P of already evaluated subspaces. We will also investigate more sophisticated partition clustering methods [3, 4].
6.
REFERENCES
[1] Tianyi Wu, Tong Xin, Qiaozhu Mei, and Jiawei Han. Promotion analysis in multidimensional space. In VLDB 2009, pages 109–120, Lyon, France, 2009. [2] R. W. Hamming. Error detecting and error correcting codes. Bell System Tech Journal, 9:147–160, 1950. [3] Ian Davidson, Kiri L. Wagstaff, and Sugato Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD 2006, pages 115–126, Berlin, Germany, 2006. [4] Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. Clustering and exploring search results using timeline constructions. In ACM CIKM 2009, pages 97–106, Hong Kong, China, 2009.