Optimal sub-reducts with test cost constraint Fan Min and William Zhu Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363000, China
[email protected],
[email protected] Abstract. Cost-sensitive learning extends classical machine learning by considering various types of costs, such as test costs and misclassification costs, of the data. In many applications, there is a test cost constraint due to limited money, time, or other resources. It is necessary to deliberately choose a set of tests to preserve more useful information for classification. To cope with this issue, we define optimal sub-reducts with test cost constraint and a corresponding problem for finding them. The new problem is more general than two existing problems, namely the minimal test cost reduct problem and the 0-1 knapsack problem, therefore it is more complex than both of them. We propose two exhaustive algorithms to deal with it. One is straightforward, and the other takes advantage of some properties of the problem. The efficiencies of these two algorithms are compared through experiments on the mushroom dataset. Some potential enhancements are also pointed out. Keywords. Cost-sensitive learning, attribute reduction, test cost, constraint, exhaustive algorithm.
1
Introduction
Cost-sensitive learning has attracted much research interests in the past two decades. Two types of costs, namely test costs and misclassification costs [5], are more often addressed. The test cost is the measurement cost of determining the value of an attribute a exhibited by an object [5,16]. Hence in the context of cost-sensitive learning, an attribute is also called a test. In some classification problems, there are many available tests, and we would like to remove some of them to save the test cost. An ideal solution is to minimize the test cost and at the same time, preserve the information of the decision system. Then we can build classifiers which are as good as the ones built on the original decision system. This problem is called the minimal test cost reduct (MTR) problem and has been addressed in [3,7,8,15]. Unfortunately, the test cost one can afford is limited in many applications; and one has to sacrifice necessary information to keep the test cost under budget. Our problem is: Given a test cost constraint, how to choose test set with which the information is preserved to the highest degree? In fact, the constraint on the test cost has been addressed by Susmaga [15]. However, it is required that
the constraint be loose enough to preserve necessary information. Therefore the problem in [15] are different from ours. In this paper, we propose the concept of optimal sub-reducts with test cost constraint (OSRT). Since our problem is to find all these sub-reducts, we call it the OSRT problem. We show that the problem is more general than two existing NP-hard problems. One is the minimal test cost reduct problem [3,7]; and the other is the 0-1 knapsack problem [6,17]. We propose two exhaustive algorithms to deal with it. The first is obtained from the problem definition directly. The other takes advantage of some properties of the problem, and it is more efficient than the first one. By our open source software Coser [10], they can solve the OSRT problem of the mushroom dataset [1], which has 22 tests, in a number of seconds. Therefore they are applicable to datasets with rational sizes of tests. Another important use of them is to evaluate the result quality of heuristic algorithms, which can be employed in large datasets.
2
Preliminaries
In this section, we first present the data model where the test cost is represented by a vector. Then we review two related problems, i.e. the test-cost-sensitive reduction problem and the 0-1 knapsack problem.
2.1
Test-cost-independent decision systems
The concept of a decision system is fundamental in data mining research. A decision system is often denoted as S = (U, C, D, {Va |a ∈ C ∪ D}, {Ia |a ∈ C ∪ D}), where U is a finite set of objects called the universe, C is the set of conditional attributes, D is the set of decision attributes, Va is the set of values for each a ∈ C ∪ D, and Ia : U → Va is an information function for each a ∈ C ∪ D. We often denote {Va |a ∈ C ∪ D} and {Ia |a ∈ C ∪ D} by V and I, respectively. A decision system is often represented by a decision table. Cost-sensitive decision systems are more general than decision systems. We consider the simplest though most widely used model [8] as follows. Definition 1. [8] A test-cost-independent decision system (TCI-DS) S is the 6-tuple: S = (U, C, D, V, I, c),
(1)
where U, C, D, V , and I have the same meanings as in a decision system, and c : C → R+ is the P test cost function. Test costs are independent of one another, that is, c(B) = a∈B c(a) for any B ⊆ C. We use a vector c = [c(a1 ), c(a2 ), . . . , c(a|C| )] to represent the cost function. Free tests are not considered in this work.
2.2
The minimal test cost reduct problem
Attribute reduction has been intensively studied by the rough set society. There are many extensions of the classical rough set model [11], such as covering-based [20,21], decision-theoretical [18], variable-precision [22], and dominance-based [2] rough set models. A number of definitions of relative reducts exist [4,12] for different rough set models. The definition based on the conditional information entropy is given below. Definition 2. [14] Let S = (U, C, D, V, I) be a decision system, and H(D|B) be the conditional entropy of B ⊆ C w.r.t. D. Any R ⊆ C is a Shannon’s entropy reduct iff: 1. H(D|R) = H(D|C), and 2. ∀a ∈ R, H(D|R − {a}) > H(D|C). Sometimes we are interested in attribute subsets satisfying one of these conditions. Definition 3. Let S = (U, C, D, V, I) be a decision system and R ⊆ C. R is a super-reduct iff H(D|R) = H(D|C). R is a sub-reduct iff ∀a ∈ R, H(D|R − {a}) > H(D|C). Super-reducts are especially useful in addition-deletion reduct construction approaches. They can be equivalently defined as supersets of reducts [19]. Subreducts are important in this paper because they do not contain redundant tests. Let Red(S), Red∗ (S) and Red∗ (S) be the set of all reducts, the set of all super-reducts and the set of all sub-reducts, respectively of S. According to Definitions 2 and 3, \ Red(S) = Red∗ (S) Red∗ (S), (2) and Red∗ (S) =
[
{R0 |R ⊆ R0 ⊆ C}.
(3)
R∈Red(S)
The aim of the classical reduct problem is to find a minimal reduct. When the test cost issue is involved, we are interested in reducts with minimal test costs. This type of reducts are defined as follows. Definition 4. [7] Let S be a TCI-DS. Any R ∈ Red(S) where c(R) = min{c(R0 )|R0 ∈ Red(S)} is called a minimal test cost reduct. We denote the set of all minimal test cost reducts by M T R(S). In many applications we only need to find one element of M T R(S). And this problem is the minimal test cost reduct (MTR) problem. It is more general than the classical reduct problem, which is NP-hard [13].
2.3
The 0-1 knapsack problem
The 0-1 knapsack problem appears in textbooks such as data structure, algorithm design and implementation. It is defined as follows [17]: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. This problem is NP-complete.
3
The OSRT problem
This section defines the optimal sub-reducts with test cost constraint (OSRT) problem formally, and analyzes its relationships with two problems mentioned in the last section. 3.1
Decision system conditional entropy
The quality of a classifier depends largely on the decision system. The conditional entropy of a decision system S = (U, C, D, V, I) is H(D|C). Generally, decision systems with less conditional entropy are easier to classify. Specifically, if H(D|C) = 0, the decision system is consistent, and we can generate a rule set with 100% accuracy on the training set. To compute H(D|C), we need to divide U into subsets according to each attribute in C. Since each data item is read only once and respective object can be put into a certain subset directly, the time complexity is only O(|U | × |C|).
(4)
In this process, subsets containing objects in the same class are pure, and can be counted and discarded immediately. After division, objects in remaining subsets are investigated further, with major class objects counted. The counting operation takes O(|U |) of time and has no influence on the time complexity. We also need space to store subsets information, which is O(|U | + |C| × max{|Va ||a ∈ C}). 3.2
(5)
Problem definition
Suppose that we are given limited amount of test cost in terms of time, money, etc. In applications, it often happens that no reduct meets the test cost constraint. Therefore we have to sacrifice necessary information of the decision system. Naturally, we require that reduced decision systems have the minimal conditional entropy. This consideration brings us to the following definition. Definition 5. Let S = (U, C, D, V, I, c) be a TCI-DS and m be the test cost upper bound. The set of all test sets subject to the constraint is T (S, m) = {B ⊆ C|c(B) ≤ m}.
(6)
In T (S, m), the set of all test sets with the minimal conditional entropy is MT (S, m) = {B ∈ T (S, m)|H(D|B) = min{H(D|B 0 )|B 0 ∈ T (S, m)}}.
(7)
In MT (S, m), the set of all optimal sub-reducts is PMT (S, m) = {B ∈ MT (S, m)|c(B) = min{c(B 0 )|B 0 ∈ MT (S, m)}}.
(8)
Any element in PMT (S, m) is called an optimal sub-reduct with test cost constraint, or an optimal sub-reduct for brevity. In Definition 5, Equation (6) ensures that the test cost constraint is met. This is the basic requirement of our problem. Then Equation (7) ensures that the test set is optimal from the viewpoint of decision system conditional entropy. This is our primary optimization objective. Finally, Equation (8) ensures that the test cost is also optimized. This is our secondary objective. Without the secondary objective, redundant attributes may exist when m is greater than the test cost of a minimal test cost reduct. We have assumed that no free test exists in Definition 1, Equation (8) also ensures the that there is no redundant test. According to Definition 3, we know that any element in PMT (S, m) must be a sub-reduct. This is why PMT (S, m) is called the set of all optimal sub-reducts. The problem of constructing PMT (S, m) is called the optimal sub-reducts with test cost constraint (OSRT) problem. 3.3
Problem analysis
Now we analyze the relationships between the new problem and two problems mentioned in Section 2. When the test cost is enough for a reduct, the OSRT problem coincides with the minimal test cost reduct (MTR) problem. Theorem 1. Let R ∈ M T R(S). M T R(S) = PMT (S, m) ⇔ m ≥ c(R).
(9)
The theorem was proved in [9]. The OSRT problem is very similar to the 0-1 knapsack problem. First, each test has a cost, which corresponds to the weight of each item. Second, tests can help decrease the decision system conditional entropy, which corresponds to the value of each item. The key difference lies in that the value of each item is fixed, but the “value” of each test is variable; it depends on other selected tests. Therefore, the OSRT problem is more general, and more difficult than the 0-1 knapsack problem.
4
Exhaustive algorithms
Due to the complexity of the new problem, exhaustive algorithms are inapplicable to large datasets. They are, however, important from the theoretical viewpoint. They also help to evaluate the performance of a heuristic algorithm, often on small datasets. In the following context we assume that m < c(R) where R ∈ M T R(S), so that our problem does not coincide with the MTR problem.
Algorithm 1 The straightforward exhaustive optimal sub-reduct algorithm Input: S = (U, C, D, V, I, c), m Output: PMT (S, m), the set of all optimal sub-reducts Method: SESRA 1: Construct the set of all possible test sets; 2: Select elements satisfying the test cost constraint and obtain T (S, m); 3: Select elements with the minimal conditional entropy and obtain MT (S, m); 4: Select elements with the minimal test cost and obtain PMT (S, m).
4.1
The SESRA algorithm
Definition 5 has indicated an exhaustive algorithm, called the SESRA algorithm, as listed in Algorithm 1. We use one integer to represent a test set, with each bit representing whether or not a test is included. Specifically, we implemented our algorithm in Coser [10], and at most 63 tests are supported. For Step 1, time and space complexities are all O(2|C| × |C|). This is because the number of all possible test sets is 2|C| , and each test set should be stored in a length-|C| array. Step 2 checks the cost of each test set, therefore the time complexity is O(2|C| ×|C|). As a simple improvement, Step 1 and 2 can be merged to save time and space. Because merging these two steps is also straightforward, we still call the revised version of the algorithm SESRA. Since 2|C| test sets are checked, the time complexity lower bound is Ω(2|C| ). The space requirement, however, can be reduced significantly. This is illustrated through the following example. Example 1. The mushroom dataset has 22 possible tests. Let the test cost array be [96,9,36,50,75,71,9,51,49,3,73,97,18,76,11,11,56,34,40,38,93,13]. To check whether or not test sets satisfy the constraint, a total of is 21218957 ≈ 222 × 5.06 summing operations are needed. In other words, there are about 5 summing operations for each test set. On the other hand, there are 1464 test sets satisfying the constraint. Therefore the space requirement is only 1464 integers, instead of 222 integers, for test sets storing. This is an important reason why the algorithm can be employed in many applications with rational sizes of data. In order to have a better understanding of the algorithm, let us look at some experimental results as listed in the second column of Table 1. The experimental settings will be discussed in Section 5. On average of 100 different test cost settings, there are 961.01 test sets satisfying the constraint, 1.81 of which have the minimal conditional entropy, 1.02 of which are optimal. In fact, both MT (S, m) and PMT (S, m) often have 1 or 2 elements, making the time for Step 4 neglectful. About 8631.10/9818.55 ≈ 87.9% of time is spent on computing the conditional entropy of 961.01 test sets, making Step 3 the bottleneck. 4.2
The SESRA∗ algorithm
Next we discuss how to revise SESRA to provide better performance. The following propositions help to reduce some computation.
Proposition 1. Let ∅ ⊂ B 0 ⊂ B ⊆ C, H(D|B) ≤ H(D|B 0 ). Proposition 2. Let B ∈ T (S, m) and ∅ ⊂ B 0 ⊂ B. B 0 ∈ T (S, m). Proposition 1 indicates that to compute min{H(D|B)|B ∈ T (S, m)}, we do not have to check every element of T (S, m). Any element which is a subset of another element should be removed, and the test sets to be checked is T 0 (S, m) = T (S, m) − {B 0 ∈ T (S, m)|∃B ∈ T (S, m)st.B 0 ⊂ B)}.
(10)
Proposition 2 indicates that many elements of T (S, m) may be removed, and |T 0 (S, m)| maybe significantly smaller than |T (S, m)|. According to Proposition 1 and Equation (10), min{H(D|B)|B ∈ T 0 (S, m)} = min{H(D|B)|B ∈ T (S, m)}.
(11)
Similar to that of Equation (7), let MT0 (S, m) = {B ∈ T 0 (S, m)|H(D|B) = min{H(D|B 0 )|B 0 ∈ T 0 (S, m)}}.
(12)
We know that MT0 (S, m) = MT (S, m) ∩ T 0 (S, m) 6= ∅.
(13)
MT0 (S, m)
Therefore always contains some tests sets with the minimal conditional entropy. In most cases, however, not all test sets with the minimal conditional entropy are included in MT0 (S, m). That is, MT (S, m) 6⊆ MT0 (S, m).
(14)
PMT (S, m) 6⊆ MT0 (S, m).
(15)
To make the matter worse,
Similar to that of Equation (8), let 0 PM (S, m) = {B ∈ MT0 (S, m)|c(B) = min{c(B 0 )|B 0 ∈ MT0 (S, m)}}. T
(16)
We have 0 PM (S, m) 6= PMT (S, m), T
(17)
which indicates that we may miss optimal sub-reducts by discarding some test sets as indicated by Equation (10). The reason lies in that an element in PMT (S, m) 0 is included in PM (S, m) only if no superset of it meets the constraint. T Fortunately, we have the following propositions to amend this flaw. Proposition 3. ∀B ∈ MT (S, m), ∃B 0 ∈ MT0 (S, m) such that B ⊆ B 0 . Proof. Because B ∈ MT (S, m) ⊆ T (S, m), ∃B 0 ∈ T 0 (S, m) such that B ⊆ B 0 . On one hand, according to Proposition 1, H(D|B 0 ) ≤ H(D|B). On the other hand, according to Equation (7) and B 0 ∈ T 0 (S, m) ⊆ T (S, m), H(D|B) ≤ H(D|B 0 ). Therefore H(D|B 0 ) = H(D|B). Equation (12) indicates that B 0 ∈ MT0 (S, m). This completes the proof.
Algorithm 2 The SESRA∗ algorithm Input: S = (U, C, D, V, I, c), m Output: PMT (S, m), the set of all optimal sub-reducts Method: SESRA-star 1: Construct test sets and at the same time, obtain T (S, m); 2: Remove elements from T (S, m) and obtain T 0 (S, m); 3: Select elements with the minimal conditional entropy and obtain MT0 (S, m); 4: Compute MT00 (S, m) using the exhaustive attribute reduction algorithm; 5: Select elements with the minimal test cost and obtain PMT (S, m);
The following proposition provides an approach to compute a superset of the set of all optimal sub-reducts. Proposition 4. Let MT00 (S, m) =
[
{RedM (U, B, D, V, I, c)|B ∈ MT0 (S, m)},
(18)
where RedM (S) = RedM (U, C, D, V, I, c) is the set of all minimal test cost reducts of S. PMT (S, m) ⊆ MT00 (S, m). (19) Proof. For any B ∈ PMT (S, m), according to Proposition 3, ∃B 0 ∈ MT0 (S, m) such that B ⊆ B 0 . On the other hand, B has minimal test cost, therefore B ∈ RedM (U, B 0 , D, V, I, c). This completes the proof. According to above analysis, we obtain a new algorithm as listed in Algorithm 2. Step 1 corresponds to Step 1 through 2 in Algorithm 1. As discussed earlier, SESRA also uses the enhanced version. Step 2 through 4 of SESRA∗ correspond to Step 3 of SESRA. The reasons why SESRA∗ is faster include: 1. T (S, m) often contains many elements, and checking the conditional entropy of each element is time consuming; 2. MT (S, m), MT0 (S, m) and MT00 (S, m) contain very few elements (in most cases only 1); and 3. For B ∈ MT0 (S, m), B contains very few redundant tests if any, therefore it takes little time to compute MT00 (S, m).
5
Experiments
The main purpose of our experiments is to compare performance of two algorithms. Experiments are undertaken on the mushroom dataset, where |U | = 8124 and |C| = 22. The data size is rational in many decision making applications. Parameter settings are as follows: Test costs are random numbers ranging from 1 to 100. m is set to 0.8 × c(R∗ ) where R∗ is a minimal test cost reduct. 100 different test cost vectors are generated, and both algorithms run 100 times. Results are digested in Table 1.
Table 1. Results on the mushroom dataset (mean values for 100 test cost settings)
Set size
Run time (ms)
T (S, m) T 0 (S, m) MT (S, m) MT0 (S, m) MT00 (S, m) PMT (S, m) Candidates building Consistency computing Total
SESRA 961.01 1.34 1.01 1685.73 11425.12 13110.85
SESRA∗ 961.01 254.31 1.04 1.04 1.01 1680.95 2391.85 4072.80
“-” stands for inapplicable.
Table 1 showed that the number of test sets with conditional entropy checked is reduced from 961.01 to 254.31, about 1/4 of the initial value. Consequently, the time for respective step is reduced from 11425.12 ms to 2391.85 ms, a little more than 1/5 of the initial value. Finally, the total time is reduced to about 1/3. In general, the improvement is significant.
6
Conclusions and further works
Exhaustive algorithms are undoubtedly the right choice for datasets with rational sizes. In this paper, we proposed the OSRT problem and two exhaustive algorithms to deal with it. SESRA is straightforward from the problem definition, and SESRA∗ takes advantage of some properties of the problem. SESRA∗ is about 2 times faster than SESRA in our experiments. Both algorithms are applicable to the mushroom dataset. In the future we will revise SESRA∗ to support bigger datasets. We will also develop heuristic algorithms for large datasets.
Acknowledgements This work is in part supported by National Science Foundation of China under Grant No. 60873077/F020107.
References 1. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, http://www.ics.uci.edu/˜mlearn/mlrepository.html (1998) 2. Greco, S., Matarazzo, B., Slowinski, R., Stefanowski, J.: Variable consistency model of dominance-based rough sets approach. In: Rough Sets and Current Trends in Computing. LNCS, vol. 2005, pp. 170–181 (2000)
3. He, H., Min, F.: Accumulated cost based test-cost-sensitive attribute reduction. In: RSFDGrC. LNAI, vol. 6743, pp. 244–247 (2011) 4. Hu, Q., Yu, D., Liu, J., Wu, C.: Neighborhood rough set based heterogeneous feature subset selection. Information Sciences 178(18), 3577–3594 (2008) 5. Hunt, E.B., Marin, J., Stone, P.J. (eds.): Experiments in induction. Academic Press, New York (1966) 6. Mathews, G.B.: On the partition of numbers. In: Proceedings of the London Mathematical Society 28. pp. 486–490 (1897) 7. Min, F., He, H., Qian, Y., Zhu, W.: Test-cost-sensitive attribute reduction. Submitted to Information Sciences, revising (2011) 8. Min, F., Liu, Q.: A hierarchical model for test-cost-sensitive decision systems. Information Sciences 179(14), 2442–2452 (2009) 9. Min, F., Zhu, W.: Attribute reduction with test-cost constraint. Journal of Electronic Science and Technology of China 9(2) (June 2011) 10. Min, F., Zhu, W.: Coser: Cost-senstive rough sets, http://grc.fjzs.edu.cn/˜fmin/coser/index.html (2011) 11. Pawlak, Z.: Rough sets and intelligent data analysis. Information Sciences 147(12), 1–12 (2002) 12. Qian, Y., Liang, J., Pedrycz, W., Dang, C.: Positive approximation: An accelerator for attribute reduction in rough set theory. Artificial Intelligence 174(9-10), 597– 618 (2010) 13. Skowron, A., Rauszer, C.: The discernibility matrics and functions in information systems. In: Intelligent Decision Support (1992) 14. Slezak, D.: Approximate entropy reducts. Fundamenta Informaticae 53(3-4), 365– 390 (2002) 15. Susmaga, R.: Computation of minimal cost reducts. In: Foundations of Intelligent Systems, LNCS, vol. 1609, pp. 448–456. Springer (1999) 16. Turney, P.D.: Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research 2, 369– 409 (1995) 17. Wikipedia: Knapsack problem, http://en.wikipedia.org/wiki/knapsack problem (2011) 18. Yao, Y., Zhao, Y.: Attribute reduction in decision-theoretic rough set models. Information Sciences 178(17), 3356–3373 (2008) 19. Yao, Y., Zhao, Y., Wang, J.: On reduct construction algorithms. In: Rough Set and Knowledge Technology. pp. 297–304 (2006) 20. Zhu, W.: Topological approaches to covering rough sets. Information Sciences 177(6), 1499–1508 (2007) 21. Zhu, W., Wang, F.: Reduction and axiomization of covering generalized rough sets. Information Sciences 152(1), 217–230 (2003) 22. Ziarko, W.: Variable precision rough set model. Journal of Computer and System Sciences 46(1), 39–59 (1993)