Gupta K.M., Aha D.W., & Moore P.G. (2006). Rough set feature selection algoritms for textual case-based classification. To appear in M.G.. Göker M.G. & T.R. Roth-Berghofer (Eds.) Proceedings of Eigth European Conference on Case-Based Reasoning, ECCBR-06, Ölündeniz, Turkey: Springer.
Rough Set Feature Selection Algorithms for Textual Case-Based Classification Kalyan Moy Gupta1, David W. Aha2, and Philip Moore3 1 Knexus Research Corp.; Springfield, VA 22153; USA Naval Research Laboratory (Code 5515); Washington, DC 20375; USA 3 ITT Industries; AES Division; Alexandria, VA 22303; USA
[email protected] 2
Abstract. Feature selection algorithms can reduce the high dimensionality of textual cases and increase case-based task performance. However, conventional algorithms (e.g., information gain) are computationally expensive. We previously showed that, on one dataset, a rough set feature selection algorithm can reduce computational complexity without sacrificing task performance. Here we test the generality of our findings on additional feature selection algorithms, add one data set, and improve our empirical methodology. We observed that features of textual cases vary in their contribution to task performance based on their part-of-speech, and adapted the algorithms to include a part-of-speech bias as background knowledge. Our evaluation shows that injecting this bias significantly increases task performance for rough set algorithms, and that one of these attained significantly higher classification accuracies than information gain. We also confirmed that, under some conditions, randomized training partitions can dramatically reduce training times for rough set algorithms without compromising task performance.
1
Introduction
Textual case-based reasoning (TCBR) is a case-based reasoning (CBR) subfield concerned with the use of textual knowledge sources (Weber et al., 2005). TCBR systems differ in the degree to which their text content is used; some are weakly textual CBR while others are strongly textual CBR, meaning that textual information is the focus of reasoning (Wilson & Bradshaw, 2000). Applications such as email categorization, news categorization, and spam filtering require the use of strongly textual CBR methodologies. Most of these systems use a bag-of-words or term-based representation for cases (e.g., Wiratunga et al., 2004; Delany et al., 2005), which can be problematic for textual case bases that have thousands of features. For example, this huge dimensionality could reduce accuracies on classification tasks and/or result in large computational costs. A variety of feature selection algorithms can be used to address this issue. For example, these include conventional algorithms such as document frequency, information gain, and mutual information (Yang & Pederson, 1997). Wiratunga et al. (2004) extended these algorithms to include boosting and feature generalization with considerable success. However, some of these conventional algorithms have high computational complexity, which can be a problem when a TCBR system is applied to dynamic decision environments that require frequent case base maintenance.
Form Approved OMB No. 0704-0188
Report Documentation Page
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
1. REPORT DATE
3. DATES COVERED 2. REPORT TYPE
2006
00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE
5a. CONTRACT NUMBER
Rough Set Feature Selection Algorithms for Textual Case-Based Classification
5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S)
5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Knexus Research Corp,9120 Beachway Lane,Springfield,VA,22153 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
8. PERFORMING ORGANIZATION REPORT NUMBER 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT
Feature selection algorithms can reduce the high dimensionality of textual cases and increase case-based task performance. However, conventional algorithms (e.g., information gain) are computationally expensive. We previously showed that, on one dataset, a rough set feature selection algorithm can reduce computational complexity without sacrificing task performance. Here we test the generality of our findings on additional feature selection algorithms, add one data set, and improve our empirical methodology. We observed that features of textual cases vary in their contribution to task performance based on their part-of-speech, and adapted the algorithms to include a part-of-speech bias as background knowledge. Our evaluation shows that injecting this bias significantly increases task performance for rough set algorithms, and that one of these attained significantly higher classification accuracies than information gain. We also confirmed that, under some conditions, randomized training partitions can dramatically reduce training times for rough set algorithms without compromising task performance. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: a. REPORT
b. ABSTRACT
c. THIS PAGE
unclassified
unclassified
unclassified
17. LIMITATION OF ABSTRACT
18. NUMBER OF PAGES
Same as Report (SAR)
16
19a. NAME OF RESPONSIBLE PERSON
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
Feature selection algorithms based on rough set theory (RST) rather than conventional algorithms can potentially alleviate this high computational complexity and also increase the task performance of TCBR systems. RST is a relatively novel approach for decision making with incomplete information (Pawlak, 1991). Feature selection algorithms motivated by RST have been applied with much success in nontextual CBR systems (e.g., Pal & Shiu, 2004). Recently, these algorithms have been applied to textual data sets. For example, Chouchoulas and Shen (2001) applied a rough set algorithm called QuickReduct to select features for an email categorization task. Also, we examined a rough set feature selection algorithm, called Johnson’s reduct, to a multi-class classification problem (Gupta et al., 2005). We empirically demonstrated that this algorithm, for one data set, was an order of magnitude faster than information gain and yet provided comparable classification performance. We also introduced a methodology that randomly partitions a training set, and selects and merges features from each partition. This randomized training partitions procedure can dramatically reduce feature selection time. We showed that its combination with Johnson’s reduct was effective. In this paper, we extend our earlier work on feature selection for TCBR classification tasks by exploring additional rough set algorithms. In particular, we introduce a variant of Li et al.’s (2006) relative dependency metric, called the marginal relative dependency metric, and explore its effectiveness with randomized training partitions. In addition, we introduce the notion of part-of-speech bias in textual case bases. This is based on our observation that textual features with different parts of speech may inherently differ in their ability to contribute to reasoning. For example, noun features may contribute more than verb features, as described in Section 3.4. Adapting rough set and conventional feature selection algorithms to incorporate this bias could improve their performance. We empirically investigate these issues on two data sets. The rest of this paper is organized as follows. Section 2 introduces RST and two of its derivative feature selection algorithms. We also include a description of randomized training partitions and introduce the notion of part-of-speech bias. We present an empirical evaluation of the feature selection algorithms and their interaction with randomized training partitions and part-of-speech bias in Section 3. We review related work on feature selection in Section 4 and conclude with a discussion of our plans for future research in Section 5.
2
Rough Set Theoretic Feature Selection
2.1
Building Blocks of Rough Set Theory
For the sake of clarity for this audience, we use established CBR terminology, such as cases and features, to present the elements of RST. RST is based on a formal description of an information system (Pawlak, 1991). An information system S is a tuple S = C, F, V where: C = {c1, c2, …, cn} denotes a non-empty, finite set of cases, F = {f1, f2, …, fm} denotes a non-empty, finite set of features (or attributes), and V = {V1, V2, …, Vm} is the set of value domains for the features in F.
A decision table is a special case of an information system where we distinguish two kinds of features: (1) a class (or decision) feature fd, and (2) the standard conditional features Fp, which are used to predict the class of a case. Therefore, F = Fp { fd}. Table 1. A case base fragment for hiring decisions
Cases c1 = Anna c2 = Bill c3= Cathy c4 = Dave c5 = Emma c6 = Frank
f1 = age 21-30 21-30 21-30 31-40 31-40 31-40
f2 = experience none none 4-6 1-3 4-6 4-6
f3= grades good good average excellent good good
fd =hired yes no no yes yes yes
We will explain RST concepts using the trivial case base in Table 1, which pertains to making hiring decisions based on three features. Central to RST is the notion of indiscernibility. Examining the cases in Table 1, we see that cases c1=Anna and c2=Bill have identical values for all the features, and thus are indiscernible with respect to the three conditional features f1, f2, and f3. More broadly, a set of cases C' is indiscernible with respect to a set of features F' F if the following is true: (1) IND(F',C)= { C' C | fF', ci,cj (ij)C' f(ci) = f(cj)} Thus, two cases are indiscernible with respect to features in F' if they have identical values for all the features in F'. An indiscernibility relation is an equivalence relation that partitions the set of cases into equivalence classes. Each equivalence class contains a set of indiscernible cases for the given set of features F'. For example, given the hiring decision table: IND(F', C) = {{ c1 , c2}, { c3 },{ c4 },{ c5 , c6}} where F'={age, experience, grades} and C={c1,c2,c3,c4,c5,c6}. The equivalence class of a case ci with respect to selected features F' is denoted by [ci]F'. Based on the equivalence classes, RST develops two kinds of set approximations. First, given sets C' C and F' F, the lower approximation of C' with respect to F' is defined as: (2) lower(C, F', C') = {cC | [c]F' C'} or the collection of cases whose equivalence classes are subsets of C'. Second, the upper approximation of C' with respect to F' is instead defined as: (3) upper(C, F', C') = {c C | [c]F' C' } or the collection of cases whose equivalence classes have a non-empty intersection set with C'. A set of cases C' is crisp (or definable) if lower(C, F',C') = upper(C, F',C'), and is otherwise rough. For example, in the hiring decision table, consider C'{hired=yes}= {c1, c4, c5, c6}, then the lower and upper approximations of C'{hired=yes} with respect to F'={age, experience, grades} are: lower(C, F',C'{hired=yes})={c4, c5, c6} and upper(C, F',C'{hired=yes}) ={c1, c2, c4, c5, c6} Case c1 is not included in the lower approximation because its equivalence class {c1,c2} is not a subset of C'{hired=yes}. However, it is included in the upper approximation because its equivalence class has a non-empty intersection with C'{hired=yes}.
Another important RST element is the notion of a set called the positive region. The positive region of a decision feature fd with respect to F' F is defined as: (4) POSF'(fd,C) = { lower(C, F',C') | C' IND({fd},C)} or the collection of the F'-lower approximations corresponding to all the equivalence classes of fd. For example, the positive region of fd {hiring} with respect to F'={age, experience, grades}, where lower(C, F',C'{hired=no})={c3}, is as follows: POSF'(fd,C) = lower(C, F',C'{hired=yes}) lower(C, F',C'{hired=no})={c3, c4, c5, c6} The positive region can be used to develop a measure of a feature’s ability to contribute information for decision making. A feature f F' makes no contribution or is dispensable if POSF'(fd,C) = POSF'-{fd}(fd,C) and is indispensable otherwise. That is, removing the feature fd from F' does not change the positive region of the decision feature. Therefore, features can be selected by checking whether they are indispensable with respect to a decision variable. The minimal set of features F', F' F, is called a reduct if POSF'(fd,C) = POSF(fd,C). Often, an information system has more than one possible reduct. Generating a reduct of minimal length is a NP-hard problem. Therefore, in practice, algorithms have been developed to generate one “good” reduct. Next, we present our adaptations of two such algorithms: (1) Johnson’s heuristic algorithm and (2) the marginal relative dependency algorithm. 2.2
Feature Selection with Johnson’s Heuristic Algorithm
We adapted Johnson’s (1974) heuristic to compute reducts as follows. It sequentially selects features by finding those that are most discernible for a given decision feature (see Figure 1). It computes a discernibility matrix M, where each cell mi,j of the matrix corresponding to cases ci and cj includes the conditional features in which the two cases’ values differ. Formally, we define strict discernibility as: mi,j = {{ f Fp: f(ci ) f(cj)} for fd(ci) ≠ fd(cj), and otherwise } JOHNSONSREDUCT(Fp, fd, C) Input Fp: conditional features, fd: decision feature, C: cases Output R: Reduct R Fp 1 R, F'Fp 2 M computeDiscernibilityMatrix(C, F', fd) 3 do 4 fh selectHighestScoringFeature(M) 5 R R {fh} 6 for (i=0 to |C|, j=i to |C|) 7 mi,j if fh mi,j 8 F' F' – {fh} 9 until mi,j = i, j 1 return R
(5)
0 Figure 1. Pseudocode for Johnson’s heuristic algorithm
Given such a matrix M, for each feature, the algorithm counts the number of cells in which it appears. The feature fh with the highest number of entries is selected for addition to the reduct R. Then all the entries mi,j that contain fh are removed and the next best feature is selected. This procedure is repeated until M is empty. The computational complexity of JOHNSONSREDUCT is O(VC2), where V is the (typically large) vocabulary size and bounds the number of times the do loop is executed. However, this is a loose upper bound that is better approximated by O(RC2), where is R