2006 IEEE International Conference on Fuzzy Systems Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006
Fuzzy Data Mining by Heuristic Rule Extraction and Multiobjective Genetic Rule Selection Hisao Ishibuchi, Member, IEEE, Yusuke Nojima, Member, IEEE, and Isao Kuwajima, Student Member, IEEE Abstract — In this paper, we demonstrate that multiobjective genetic rule selection can significantly improve the accuracycomplexity tradeoff curve of fuzzy rule-based classification systems generated by a heuristic rule extraction procedure for classification problems with many continuous attributes. First a prespecified number of fuzzy rules are extracted in a heuristic manner based on a rule evaluation criterion. This step can be viewed as fuzzy data mining. Then multiobjective genetic rule selection is applied to the extracted rules to find a number of non-dominated rule sets with respect to accuracy maximization and complexity minimization. This step can be viewed as a postprocessing procedure in fuzzy data mining. Experimental results show that multiobjective genetic rule selection finds a number of smaller rule sets with higher classification accuracy than heuristically extracted rule sets. That is, the accuracycomplexity tradeoff curve of heuristically extracted rule sets in fuzzy data mining is improved by multiobjective genetic rule selection. This observation suggests that multiobjective genetic rule selection plays an important role in fuzzy data mining as a postprocessing procedure.
I. INTRODUCTION Evolutionary multiobjective optimization (EMO) is one of the most active research areas in the field of evolutionary computation [1]-[3]. Recently EMO algorithms have been employed in some studies on modeling and classification. For example, Kupinski & Anastasio [4] used an EMO algorithm to generate non-dominated neural networks on a receiver operating characteristic curve. Gonzalez et al. [5] generated non-dominated radial basis function networks of different sizes. Llora & Goldberg [6] used an EMO algorithm in Pittsburgh-style learning classifier systems. Abbass [7] used a memetic EMO algorithm (i.e., a hybrid EMO algorithm with local search) to speed up the backpropagation algorithm where multiple neural networks of different sizes were evolved to find an appropriate network structure. Non-dominated neural networks were combined into a single ensemble classifier in [8]-[10]. The use of EMO algorithms to design ensemble classifiers was also proposed in Ishibuchi & Yamamoto [11] where multiple fuzzy rulebased classifiers of different sizes were generated. In some studies on fuzzy rule-based systems, EMO algorithms were This work was partially supported by Japan Society for the Promotion of Science (JSPS) through Grand-in-Aid for Scientific Research (B): KAKENHI (17300075). Hisao Ishibuchi, Yusuke Nojima and Isao Kuwajima are with the Department of Computer Science and Intelligent Systems, Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Naka-ku, Sakai, Osaka 599-8531, Japan (
[email protected], nojima@cs. osakafu-u.ac.jp,
[email protected]).
0-7803-9489-5/06/$20.00/©2006 IEEE
used to analyze the tradeoff structure between accuracy and interpretability [12]-[20]. For more recent studies, see [21]. In this paper, we use an EMO algorithm for fuzzy rule selection to examine the usefulness of multiobjective genetic rule selection as a postprocessing procedure in fuzzy data mining for pattern classification problems. A number of nondominated rule sets (i.e., non-dominated fuzzy rule-based classification systems) with respect to their accuracy and complexity are found by EMO-based rule selection from fuzzy rules extracted by a heuristic rule extraction procedure in fuzzy data mining. Genetic fuzzy rule selection for classification problems was first formulated as a single-objective combinatorial 0/1 optimization problem in Ishibuchi et al. [22], [23] where a fitness function of each rule set was defined as a weighted sum of its accuracy (i.e., the number of correctly classified training patterns) and its complexity (i.e., the number of fuzzy rules). This single-objective formulation was extended in [12] as a two-objective problem where non-dominated rule sets were found by an EMO algorithm. Then fuzzy rule selection was formulated as a three-objective problem in [13] where the total rule length (i.e., the total number of antecedent conditions over fuzzy rules in each rule set) was used as an additional complexity measure. This threeobjective formulation was also handled by a memetic EMO algorithm in [16] and a multiobjective fuzzy genetics-based machine learning algorithm in [20]. This paper is organized as follows. First we briefly explain multiobjective optimization and fuzzy rule selection in Section II. We use two objectives in fuzzy rule selection: the minimization of the error rate on training patterns and the minimization of the number of fuzzy rules. Next we explain a heuristic procedure for extracting fuzzy classification rules in Section III. Several rule evaluation criteria used in the heuristic rule extraction procedure are compared in Section IV through computational experiments on some benchmark data sets in the UC Irvine machine learning repository. We also examine the accuracy-complexity tradeoff curve of extracted rule sets using various specifications of the number of fuzzy rules to be extracted. Then we demonstrate the usefulness of multiobjective genetic rule selection as a postprocessing procedure in fuzzy data mining in Section V. It is clearly shown that the accuracy-complexity tradeoff curve of heuristically extracted rule sets is improved by multiobjective genetic rule selection. That is, the accuracy of heuristically extracted rule sets is improved while their complexity is decreased by multiobjective genetic fuzzy rule selection. Finally Section VI concludes this paper.
1633
II. MULTIOBJECTIVE FUZZY RULE SELECTION In this section, we explain multiobjective optimization and multiobjective genetic fuzzy rule selection. A. Multiobjective Optimization Let us consider the following k-objective minimization problem: Minimize z = ( f1 ( y ), f 2 ( y ), ..., f k (y )) ,
(1)
subject to y ∈ Y ,
(2)
where z is the objective vector, f i (y ) is the i-th objective to be minimized, y is the decision vector, and Y is the feasible region in the decision space. Let a and b be two feasible solutions of the k-objective minimization problem in (1)-(2). If the following condition holds, a can be viewed as being better than b: ∀i
, f i (a ) ≤ f i (b) and
∃
j , f j ( a ) < f j (b) .
(3)
In this case, we say that a dominates b (equivalently b is dominated by a). When b is not dominated by any other feasible solutions (i.e., when there exists no feasible solution a that dominates b), the solution b is referred to as a Pareto-optimal solution of the k-objective minimization problem in (1)-(2). The set of all Pareto-optimal solutions forms the tradeoff surface in the objective space. This tradeoff surface is referred to as the Pareto front. Various EMO algorithms have been proposed to efficiently search for Pareto-optimal solutions [1]-[3]. B. Multiobjective Genetic Fuzzy Rule Selection Let us assume that we have N fuzzy rules extracted for a pattern classification problem by a heuristic rule extraction procedure. Multiobjective genetic fuzzy rule selection is used to find Pareto-optimal rule sets from these N fuzzy rules with respect to the two goals of knowledge extraction: accuracy maximization and complexity minimization. Let S be a subset of the extracted N fuzzy rules. The accuracy of the rule set S is measured by the error rate when all the training patterns are classified by S. We use a single winner rule-based method to classify each training pattern by S. That is, each pattern is classified by the single winner rule in S that has the maximum product of the rule weight and the compatibility grade with that pattern as explained in the next section. We include the rejection rate into the error rate (i.e., training patterns with no compatible fuzzy rules in S are counted among errors in this paper). On the other hand, we measure the complexity of the rule set S by the number of fuzzy rules in S. Thus our fuzzy rule selection problem is formulated as follows: Minimize f1 ( S ) and f 2 ( S ) ,
(4)
where f1 ( S ) is the error rate on training patterns by the rule set S and f 2 ( S ) is the number of fuzzy rules in S. Any subset S of the N fuzzy rules can be represented by a binary string of length N as S = s1s 2 ⋅ ⋅ ⋅ s N ,
(5)
where s j = 1 and s j = 0 mean that the j-th fuzzy rule is included in S and excluded from S, respectively. Such a binary string is handled as an individual in multiobjective genetic fuzzy rule selection. Since feasible solutions (i.e., any subsets of the N fuzzy rules) are represented by binary strings in (5), we can apply almost all EMO algorithms with standard genetic operations to our multiobjective fuzzy rule selection problem in (4). In this paper, we use the NSGA-II algorithm [24] because it is a well-known high-performance EMO algorithm. Let P be the current population in NSGA-II. The outline of NSGA-II can be written as follows:
Step 1: P := Initialize (P) Step 2: while a termination condition is not satisfied, do Step 3: P’ := Selection (P) Step 4: P’’ := Genetic Operations (P’) Step 5: P := Replace (P U P’’) Step 6: end while Step 7: return (non-dominated solutions (P)) First an initial population is generated in Step 1 in the same manner as in single-objective genetic algorithms. Genetic operations in Step 4 are also the same as those in singleobjective genetic algorithms. Parent selection in Step 3 and generation update in Step 5 of NSGA-II are different from single-objective genetic algorithms. Pareto ranking and a crowding measure are used to evaluate each solution in Step 3 for parent selection and in Step 5 for generation update. For details of NSGA-II, see Deb [1] and Deb et al. [24]. In the application of NSGA-II to multiobjective genetic fuzzy rule selection, we use two problem-specific heuristic tricks to efficiently find small rule sets with high accuracy. One trick is biased mutation where a larger probability is assigned to the mutation from 1 to 0 than that from 0 to 1. The other trick is the removal of unnecessary rules, which is a kind of local search. Since we use the single winner rulebased method for the classification of each pattern by the rule set S, some rules in S may be chosen as winner rules for no training patterns. By removing these rules from S, we can improve the second objective (i.e., the number of fuzzy rules in S) without degrading the first objective (i.e., the error rate on training patterns). The removal of unnecessary rules is performed after the first objective is calculated and before the second objective is calculated. III. HEURISTIC FUZZY RULE EXTRACTION In this section, we explain heuristic fuzzy rule extraction using rule evaluation criteria in data mining. A. Pattern Classification Problem Let us assume that we have m training (i.e., labeled) patterns x p = ( x p1 , ..., x pn ) , p = 1, 2, ..., m from M classes in the n-dimensional continuous pattern space where x pi is the attribute value of the p-th training pattern for the i-th attribute ( i = 1, 2, ..., n). For the simplicity of explanation, we assume that all the attribute values have already been normalized into real numbers in the unit interval [0, 1]. That
1634
is, x pi ∈ [0, 1] for p = 1, 2, ..., m and i = 1, 2, ..., n. Thus the pattern space of our pattern classification problem is the n-dimensional unit-hypercube [0, 1]n . B. Fuzzy Rules for Pattern Classification Problem For our n-dimensional pattern classification problem, we use fuzzy rules of the following type: Rule Rq : If x1 is Aq1 and ... and x n is Aqn then Class C q with CFq ,
Attribute value
1
0
Membership
1
Membership
Membership Membership
0
Attribute value
1
c( A q ⇒ Class h ) =
(6)
1
m
µ A q (x p ) .
(8)
∑ µ A q (x p )
p =1
The consequent class Cq is specified by identifying the class with the maximum confidence: c( A q ⇒ Class Cq ) =
where Rq is the label of the q-th fuzzy rule, x = ( x1 , ..., x n ) is an n-dimensional pattern vector, Aqi is an antecedent fuzzy set ( i = 1, 2, ..., n ), C q is a class label, and CFq is a rule weight (i.e., certainty grade). For other types of fuzzy rules for pattern classification problems, see [17], [25], [26]. Since we usually have no a priori information about an appropriate granularity of the fuzzy discretization for each attribute, we simultaneously use multiple fuzzy partitions with different granularities in fuzzy rule extraction. In our computational experiments, we use four homogeneous fuzzy partitions with triangular fuzzy sets in Fig. 1. In addition to the 14 fuzzy sets in Fig. 1, we also use the domain interval [0, 1] as an antecedent fuzzy set in order to represent a don’t care condition. That is, we use the 15 antecedent fuzzy sets for each attribute in our computational experiments. 1
∑
x p ∈Class h
max {c( A q ⇒ Class h ) } .
h =1,2, ..., m
(9)
The consequent class C q can be viewed as the dominant class in the fuzzy subspace defined by the antecedent part A q . When there is no pattern in the fuzzy subspace defined by A q , we do not generate any fuzzy rules with A q in the antecedent part. This specification method of the consequent class of fuzzy rules has been used in many studies since [27]. The rule weight CFq of each fuzzy rule Rq has a large effect on the performance of fuzzy rule-based classification systems [28]. Different specifications of the rule weight have been proposed and examined in the literature. We use the following specification because good results were reported by this specification in the literature [17], [29]: M
CFq = c( A q ⇒ Class Cq ) − ∑ c( A q ⇒ Class h ) . (10) h =1 h ≠ Cq
Let S be a set of fuzzy rules of the form in (6). A new pattern x p is classified by a single winner rule Rw , which is chosen from the rule set S as follows:
µ A w ( x p ) ⋅ CFw = max{µ A q ( x p ) ⋅ CFq | Rq ∈ S} . (11) 0
Attribute value
As shown in (11), the winner rule Rw has the maximum product of the compatibility grade and the rule weight in S. For other fuzzy reasoning methods for pattern classification problems, see Cordon et al. [25] and Ishibuchi et al. [17], [26]. It should be noted that the choice of an appropriate rule weight specification depends on the type of fuzzy reasoning (i.e., single winner rule-based fuzzy reasoning) used in fuzzy rule-based classification systems [17], [29].
1
1
0
Attribute value
1
Fig. 1. Four fuzzy partitions used in our computational experiments.
C. Fuzzy Rule Generation Since we use the 15 antecedent fuzzy sets for each attribute of our n-dimensional pattern classification problem, the total number of combinations of the antecedent fuzzy sets is 15n . Each combination is used in the antecedent part of the fuzzy rule in (6). Thus the total number of possible fuzzy rules is also 15n . The consequent class Cq and the rule weight CFq of each fuzzy rule Rq are specified from the given training patterns in the following heuristic manner. First we calculate the compatibility grade of each pattern x p with the antecedent part A q of the fuzzy rule Rq using the product operation as
µ A q ( x p ) = µ Aq1 ( x p1 ) ⋅ ... ⋅ µ Aqn ( x pn ) ,
(7)
where µ Aqi (⋅ ) is the membership function of Aqi . Next the confidence of the fuzzy rule “ A q ⇒ Class h ” is calculated for each class ( h = 1, 2, ..., M ) as follows [17]:
D. Rule Evaluation Criteria Using the above-mentioned procedure, we can generate a large number of fuzzy rules by specifying the consequent class and the rule weight for each of the 15n combinations of the antecedent fuzzy sets. It is, however, very difficult for human users to handle such a large number of generated fuzzy rules. It is also very difficult to intuitively understand long fuzzy rules with many antecedent conditions. Thus we only generate short fuzzy rules with only a small number of antecedent conditions. It should be noted that don’t care conditions with the special antecedent fuzzy set [0, 1] can be omitted from fuzzy rules. The rule length means the number of antecedent conditions excluding don’t care conditions. We examine only short fuzzy rules of length Lmax or less (e.g., Lmax = 3). This restriction is to find a small number of short (i.e., simple) fuzzy rules with high interpretability. Among short fuzzy rules, we choose a prespecified number of good rules by a heuristic rule evaluation criterion. In the field of data mining, two rule evaluation criteria (i.e.,
1635
confidence and support) have been often used [30], [31]. We have already shown the fuzzy version of the confidence criterion in (8). In the same manner, the support of the fuzzy rule “ A q ⇒ Class h ” is calculated as follows [17]: ∑
s( A q ⇒ Class h ) =
x p ∈Class h
µ A q (x p )
m
.
(12)
In our computational experiments, we use the following four rule evaluation criteria to extract a prespecified number of short fuzzy rules for each class from numerical data: Support with the minimum confidence level: Each rule is evaluated based on its support when its confidence is larger than or equal to the prespecified minimum confidence level. This criterion never extracts unqualified rules whose confidence is smaller than the minimum confidence level. Various values of the minimum confidence level (e.g., 0.1, 0.2, ..., 0.9) are examined in computational experiments. Confidence with the minimum support level: Each rule is evaluated based on its confidence when its support is larger than or equal to the prespecified minimum support level. This criterion never extracts unqualified rules whose support is smaller than the minimum support level. Various values of the minimum support level (e.g., 0.01, 0.02, ..., 0.09) are examined in computational experiments. Product of confidence and support: Each rule is evaluated based on the product of its confidence and support. Difference in support: Each rule is evaluated based on the difference between its support and the total support of the other rules with the same antecedent part and different consequent classes. More specifically, the rule Rq with the antecedent fuzzy vector A q and the consequent class C q is evaluated as f ( Rq ) = s( A q ⇒ Class C q ) −
M
∑
h =1 h ≠ Cq
s( A q ⇒ Class h ) .
Cleveland heart disease (Heart C), sonar (Sonar), and wine recognition (Wine) data sets. These six data sets are available from the UC Irvine machine learning repository. Data sets with missing values are marked by “*” in the third column of Table I. Since we do not use incomplete patterns with missing values, the number of patterns in the third column does not include those patterns with missing values. All attribute values are normalized into real numbers in the unit interval [0, 1]. For comparison, we show in the last two columns of Table I the reported results in Elomaa & Rousu [33] where six variants of the C4.5 algorithm were examined. The generalization ability of each variant was evaluated by ten independent runs (with different data partitions) of the whole ten-fold cross-validation (10CV) procedure (i.e., 10 × 10CV ) in [33]. We show in the last two columns of Table I the best and worst error rates on test patterns among the six variants reported in [33] for each data set. In our computational experiments, we also use the 10CV procedure. As in [33], 10CV is iterated ten times (i.e., 10 × 10CV ). In each of 100 runs in 10 × 10CV , error rates are calculated on training patterns (90% of the given patterns) as well as test patterns (10% of the given patterns). TABLE I DATA SETS USED IN OUR COMPUTATIONAL EXPERIMENTS
IV. COMPUTATIONAL EXPERIMENT USING HEURISTIC FUZZY RULE EXTRACTION In this section, we perform computational experiments using heuristic fuzzy rule extraction based on rule evaluation criteria. Extracted fuzzy rules are used as candidate rules in multiobjective genetic fuzzy rule selection in the next section. A. Data Sets We use six data sets in Table I: Wisconsin breast cancer (Breast W), diabetes (Diabetes), glass identification (Glass),
Attributes
Breast W Diabetes Glass Heart C Sonar Wine
9 8 9 13 60 13
Patterns 683* 768** 214 297* 208 178
Classes 2 2 6 5 2 3
C4.5 in [33] Best Worst 5.1 25.0 27.3 46.3 24.6 5.6
6.0 27.2 32.2 47.9 35.8 8.8
* Incomplete patterns with missing values are not included. ** Some suspicious patterns with attribute value “0” are included.
(13)
This criterion can be viewed as a simplified version of a rule evaluation criterion used in an iterative fuzzy genetics-based machine learning algorithm called SLAVE [32]. We generate a prespecified number of fuzzy rules with the largest values of each criterion in a greedy manner for each class. As we have already mentioned, only short fuzzy rules of length Lmax or less are examined in heuristic rule extraction in order to find interpretable fuzzy rules.
Data set
B. Performance on Training Patterns In heuristic fuzzy rule extraction, various specifications are used as the number of extracted fuzzy rules in order to examine the relation between the accuracy and complexity of fuzzy rule-based systems. The number of extracted fuzzy rules for each class is specified as 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, and 100. The four rule evaluation criteria in Section 3 are used in heuristic rule extraction. When multiple fuzzy rules have the same value of a rule evaluation criterion, those rules are randomly ordered (i.e., random tie break). The maximum rule length Lmax is specified as Lmax = 2 for the sonar data with 60 attributes and Lmax = 3 for the other data sets. That is, fuzzy rules of length 2 or less are examined for the sonar data while those of length 3 or less are examined for the other data sets. We use such a different specification because only the sonar data set involves a large number of attributes (i.e., it has a huge number of possible combinations of antecedent fuzzy sets). Among the four heuristic rule evaluation criteria, good results are obtained from the support with the minimum confidence level, the product of confidence and support, and
1636
TABLE V ERROR RATES ON TRAINING PATTERNS (HEART C)
the difference in support. Experimental results using these criteria are summarized in Tables II-VII where the average error rate on training patterns is calculated over 10 × 10CV . Five values of the minimum confidence level are examined in these tables. The best result (i.e., the lowest error rate) in each row is highlighted by underlined boldface. From Tables II-VII, we can see that the increase in the number of extracted fuzzy rules does not always lead to the decrease in the error rates (e.g., see the last column of Table II). This observation suggests that the classification accuracy of heuristically extracted fuzzy rules can be improved by rule selection. We can also see that the choice of an appropriate rule evaluation criterion is problem-dependent.
Number of rules 1 2 3 4 5 10 20 30 40 50 100
TABLE II ERROR RATES ON TRAINING PATTERNS (BREAST W) Number of rules 1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 9.47 9.47 8.99 7.24 9.33 9.08 9.08 7.53 7.77 8.24 7.60 7.60 5.29 6.69 6.45 5.30 5.30 5.20 6.48 6.60 5.27 5.27 5.43 6.46 6.93 6.37 5.78 5.38 5.20 6.08 5.12 4.84 4.29 4.10 4.45 4.26 4.12 4.23 4.35 4.44 4.46 4.44 4.43 4.39 4.41 4.42 4.34 4.29 4.18 4.37 3.94 3.94 3.94 3.94 4.26
1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 34.43 33.53 26.39 26.73 46.44 34.96 33.02 26.48 26.44 41.65 34.96 31.28 28.57 26.42 38.21 34.90 31.55 29.04 26.34 35.63 34.90 31.48 29.11 26.11 34.07 34.90 30.40 29.13 26.15 31.14 34.90 30.23 29.78 26.18 28.33 34.66 30.34 30.13 26.41 26.96 33.06 30.29 30.31 26.59 25.79 31.74 30.29 30.37 26.72 24.61 31.14 30.75 30.32 26.86 22.77
Product
Diff.
Number of rules
5.89 6.41 6.39 5.18 4.81 4.29 3.61 3.78 3.82 3.83 3.81
5.81 6.32 6.47 5.64 4.72 4.20 3.61 3.75 3.76 3.73 3.70
1 2 3 4 5 10 20 30 40 50 100
1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 47.49 43.51 38.53 63.65 76.77 44.06 42.76 37.58 61.63 76.43 42.54 42.15 37.08 59.78 76.02 42.67 41.92 36.76 58.46 75.78 42.92 41.53 36.55 57.63 75.69 40.97 39.84 36.18 55.14 74.40 39.70 38.09 34.76 52.64 72.54 38.41 37.41 33.97 51.96 71.07 38.00 36.68 33.41 51.42 70.18 37.91 35.76 33.07 50.68 68.84 35.92 33.16 32.79 49.39 66.96
Diff.
49.37 48.68 48.17 46.99 45.37 41.58 40.72 40.25 39.91 39.68 38.41
53.27 53.09 53.01 52.40 50.47 47.39 43.51 41.48 40.02 39.15 36.76
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 47.12 24.86 25.65 40.34 43.90 46.69 25.13 24.42 35.00 36.75 46.21 25.13 23.86 32.56 30.72 46.31 24.80 23.47 30.31 28.67 46.14 24.71 23.28 28.78 27.58 43.81 23.85 22.44 24.96 25.34 42.84 23.53 21.12 22.51 22.60 45.08 23.80 21.06 22.23 21.34 45.71 23.69 21.28 22.16 19.70 45.41 23.55 21.31 22.41 17.88 45.92 23.77 21.64 22.55 12.71
Product
Diff.
46.97 45.20 43.86 42.93 42.57 40.80 42.01 41.18 39.70 37.20 28.65
25.20 24.74 24.13 23.74 23.46 22.79 22.21 21.91 21.60 21.35 20.84
TABLE VII ERROR RATES ON TRAINING PATTERNS (WINE)
Product
Diff.
Number of rules
34.27 34.13 34.59 33.57 32.44 30.17 30.63 30.47 30.47 30.61 30.78
26.39 26.44 27.45 28.02 28.79 29.72 30.22 30.34 30.42 30.51 30.65
1 2 3 4 5 10 20 30 40 50 100
TABLE IV ERROR RATES ON TRAINING PATTERNS (GLASS) Number of rules
Product
TABLE VI ERROR RATES ON TRAINING PATTERNS (SONAR)
TABLE III ERROR RATES ON TRAINING PATTERNS (DIABETES) Number of rules
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 44.46 41.98 49.18 51.51 57.60 44.32 42.91 47.22 50.13 57.61 44.79 43.79 44.79 50.12 57.57 44.88 43.81 44.58 50.06 57.44 44.54 42.93 42.43 45.52 56.57 44.78 42.94 41.14 43.91 55.83 44.41 42.49 39.72 42.35 54.45 44.09 42.00 39.01 41.07 53.29 43.84 41.52 38.59 40.40 51.11 43.61 40.99 38.00 39.61 49.54 42.75 39.45 36.91 36.97 45.61
Product
Diff.
45.30 44.67 44.52 44.22 43.75 40.08 38.37 37.69 37.06 36.67 35.16
39.81 39.87 39.70 39.68 39.29 38.64 38.25 37.78 37.18 36.43 34.72
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 9.60 9.81 32.46 14.36 11.14 25.05 12.36 9.26 5.65 7.32 14.03 13.24 8.10 5.44 6.01 13.03 12.62 7.10 5.53 5.39 13.42 11.76 5.96 5.56 5.32 12.86 7.38 5.49 5.17 3.53 9.47 5.07 5.11 4.91 3.45 6.25 5.14 5.11 4.64 3.56 5.03 4.96 5.01 4.31 3.46 5.03 4.86 4.80 4.07 3.28 4.43 4.69 4.09 3.34 3.09
Product
Diff.
10.03 7.32 6.21 5.44 5.54 5.82 4.90 4.44 4.05 3.85 3.46
10.15 6.80 6.77 5.65 5.38 4.90 4.11 3.86 3.63 3.55 3.22
C. Performance on Test Patterns Experimental results on test patterns are summarized in Tables VIII-XI (Due to the page limitation, we only show experimental results for the first four data sets). During ten iterations of the whole 10CV procedure, average error rates in Tables VIII-XI are calculated on test patterns while those in Tables II-VII in the previous subsection are calculated on training patterns. We have almost the same observations from Tables VIII-XI for test patterns as Tables II-VII for training patterns. That is, the choice of an appropriate rule evaluation criterion is problem-dependent. The increase in the number of extracted fuzzy rules does not always increase their classification accuracy. 1637
TABLE VIII ERROR RATES ON TEST PATTERNS (BREAST W) Number of rules 1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 9.60 9.60 9.07 7.35 9.35 9.16 9.16 8.30 8.13 8.70 8.34 8.34 5.62 7.09 6.65 5.46 6.87 6.68 5.71 5.71 5.54 5.54 5.46 6.62 7.06 6.50 6.02 5.58 5.33 6.31 5.41 5.26 4.77 4.42 4.72 4.45 4.28 4.47 4.55 4.64 4.69 4.61 4.58 4.53 4.58 4.51 4.45 4.44 4.31 4.48 4.00 4.00 3.98 3.97 4.45
V. COMPUTATIONAL EXPERIMENT USING MULTIOBJECTIVE GENETIC FUZZY RULE SELECTION
Product
Diff.
6.59 7.01 7.07 5.95 5.42 4.53 3.76 3.97 4.03 4.14 4.17
6.49 6.90 7.06 6.38 5.27 4.54 3.81 3.95 4.04 4.06 4.04
TABLE IX ERROR RATES ON TEST PATTERNS (DIABETES) Number of rules 1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 34.37 33.18 26.57 26.92 46.97 34.97 32.33 26.67 26.53 42.28 34.95 30.63 29.00 26.63 39.02 34.90 31.03 29.30 26.55 36.45 34.90 30.93 29.37 26.18 34.75 34.90 30.41 29.63 26.57 31.87 34.90 30.34 30.07 26.80 29.17 34.78 30.54 30.32 27.15 27.87 33.44 30.48 30.54 27.10 27.07 32.19 30.45 30.63 27.37 26.17 31.43 30.86 30.60 27.53 24.52
Product
Diff.
34.30 34.21 34.64 34.24 33.06 30.42 30.76 30.73 30.64 30.75 30.84
26.57 26.59 28.27 28.79 29.29 30.09 30.39 30.56 30.59 30.67 30.84
TABLE X ERROR RATES ON TEST PATTERNS (GLASS) Number of rules 1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 52.24 49.57 42.59 71.58 81.75 49.14 49.10 41.94 70.79 81.61 47.08 48.76 41.28 69.05 81.19 47.40 48.76 41.09 68.07 81.05 47.32 48.11 41.42 67.37 80.86 46.19 47.13 42.46 65.48 80.01 46.44 47.27 41.90 63.14 78.44 46.53 47.04 41.38 62.30 76.48 46.80 46.37 40.86 61.59 75.88 47.04 46.14 41.39 60.44 74.42 46.03 45.55 41.21 58.87 72.79
Product
Diff.
49.05 48.16 47.98 48.26 48.21 47.34 47.09 47.71 47.29 47.15 45.76
46.86 47.24 47.34 47.63 47.29 47.72 47.38 46.82 47.06 46.18 44.31
TABLE XI ERROR RATES ON TEST PATTERNS (HEART C) Number of rules 1 2 3 4 5 10 20 30 40 50 100
Support with the minimum confidence 0.5 0.6 0.7 0.8 0.9 46.81 47.56 56.61 58.66 63.23 46.24 47.02 53.95 56.90 63.23 46.24 46.34 49.79 56.90 63.23 46.21 46.31 49.28 56.73 63.09 46.14 46.34 47.62 51.12 62.65 46.11 45.94 47.29 49.24 61.78 45.91 45.87 46.55 48.74 60.73 45.87 45.74 46.41 48.16 61.11 45.77 45.88 46.34 48.10 58.92 45.77 45.94 46.17 48.06 57.81 45.74 46.00 46.00 47.52 56.67
Product
Diff.
57.05 56.41 55.54 53.92 52.77 49.74 48.38 46.77 46.64 46.07 46.04
61.20 60.99 60.96 59.78 57.79 56.08 52.70 50.95 49.04 48.46 47.58
A. Settings of Computational Experiments For each data set, we choose a heuristic rule evaluation criterion from which the lowest error rate on test patterns was obtained for the case of 100 fuzzy rules for each class in the previous subsection. For example, the support criterion with the minimum confidence level 0.8 is chosen for the Wisconsin breast cancer data set (see Table VIII). As in the previous section, we iterate the whole ten-fold cross-validation procedure ten times ( 10 × 10CV ). In each of 100 runs in 10 × 10CV , we generate 300 fuzzy rules for each class (i.e., 300M rules in total for an M-class classification problem) from training patterns using the chosen heuristic rule evaluation criterion for each data set. Multiobjective genetic rule selection based on the NSGA-II algorithm is applied to the extracted 300M fuzzy rules for each data set using the following parameter specifications: Population size: 200 strings, Crossover probability: 0.8 (uniform crossover), Biased mutation probabilities: pm (0 → 1) = 1 / 300 M and p m (1 → 0) = 0.1, Stopping condition: 5000 generations. Multiple non-dominated rule sets are obtained by the NSGA-II algorithm in each of 100 runs in 10 × 10CV . We calculate the error rates of each rule set on training patterns and test patterns. Then the average error rates on training patterns and test patterns are calculated over obtained rule sets with the same number of fuzzy rules among 100 runs. Only when rule sets with the same number of fuzzy rules are found in all the 100 runs, we report the average error rates for that number of fuzzy rules. B. Performance on Training Patterns Experimental results on training patterns are shown in Fig. 2 where open circles denote the performance of nondominated rule sets obtained by multiobjective genetic rule selection. For comparison, we show in Fig. 2 experimental results of heuristic rule extraction using the same rule evaluation criterion as in the candidate rule generation in multiobjective genetic rule selection. For example, closed circles in Fig. 2 (a) correspond to the error rates in Table II by the support criterion with the minimum confidence level 0.8. This criterion is used in the candidate rule generation as explained in the previous subsection. We can see from Fig. 2 that the accuracy of heuristically extracted fuzzy rules on training patterns is improved by multiobjective genetic rule selection for all the six data sets. C. Performance on Test Patterns Experimental results on test patterns by multiobjective genetic rule selection are shown in Fig. 3 together with those by heuristic rule extraction. The reported results by the C4.5 algorithm in [33] are also shown in Fig. 3 for comparison (see the last two columns of Table I). From Fig. 3, we can see that the generalization ability of 1638
6 4 2 2
4
6
8
10 20 40 60 80 100 200
50 45 40 35 30 25
35 30 25 20 2
Number of rules (a) Breast W
Heuristic rule extraction Multiobjective rule selection
5 10 15 20 25 50 100 150 200 250 500
Heuristic rule extraction Multiobjective rule selection
50
4
6
8
10 20 40 60 80 100 200
Heuristic rule extraction Multiobjective rule selection
30 20 10 0
2
4
6
Heuristic rule extraction Multiobjective rule selection 45 40 35 30 25 20
Number of rules (b) Diabetes
40
Number of rules (d) Heart C
Error rate on training patterns (%)
8
40
due to the increase in the number of fuzzy rules. We can also see that the generalization ability of obtained non-dominated fuzzy rule sets is comparable to the reported results by the C4.5 algorithm in many cases (except for Fig. 3 (c)).
8
10 20 40 60 80 100 200
Error rate on training patterns (%)
10
Error rate on training patterns (%)
Heuristic rule extraction Multiobjective rule selection
Error rate on training patterns (%)
Error rate on training patterns (%)
Error rate on training patterns (%)
heuristically extracted fuzzy rules is improved for the five data sets (except for Heart C in Fig. 3 (d)) by multiobjective genetic rule selection. In Fig. 3 (d), we observe the increase in the error rates of obtained non-dominated fuzzy rule sets
6
12 18 24 30 60 120 180 240 300 600
Number of rules (c) Glass
Heuristic rule extraction Multiobjective rule selection 10 8 6 4 2 0
Number of rules (e) Sonar
3
6
9
12 15 30 60 90 120 150 300
Number of rules (f) Wine
9 8 7 6 5 4 3
2
4
6
8
10 20 40 60 80 100 200
Number of rules (a) Breast W
C4.5 Worst C4.5 Best
50 48 46 44 42
40 38 36 34 32 30 28 26 24 22
5 10 15 20 25 50 100 150 200 250 500
Number of rules (d) Heart C
Heuristic rule extraction Multiobjective rule selection
2
4
6
C4.5 Worst C4.5 Best
8 10 20 40 60 80 100 200
45 40 35 30 25 4
6
8
10 20 40 60 80 100 200
Number of rules (e) Sonar
Fig. 3. Error rates on test patterns.
1639
C4.5 Worst C4.5 Best
50 45 40 35 30 6 12 18 24 30 60 120 180240 300 600
Number of rules (c) Glass
C4.5 Worst C4.5 Best
50
2
Heuristic rule extraction Multiobjective rule selection
Number of rules (b) Diabetes
Heuristic rule extraction Multiobjective rule selection
20
Error rate on test patterns (%)
10
Heuristic rule extraction Multiobjective rule selection
Error rate on test patterns (%)
C4.5 Worst C4.5 Best
Error rate on test patterns (%)
Heuristic rule extraction Multiobjective rule selection
Error rate on test patterns (%)
11
Error rate on test patterns (%)
Error rate on test patterns (%)
Fig. 2. Error rates on training patterns.
Heuristic rule extraction Multiobjective rule selection
C4.5 Worst C4.5 Best
12 10 8 6 4 2
3
6
9 12 15 30 60 90 120 150 300
Number of rules (f) Wine
VI. CONCLUSIONS We showed that multiobjective genetic rule selection can decrease the number of heuristically extracted fuzzy rules while improving their classification accuracy on training patterns. Their generalization ability for test patterns was also improved by multiobjective genetic rule selection in many cases. Since a large number of fuzzy rules are usually extracted in a heuristic manner, our experimental results suggest the usefulness of multiobjective genetic fuzzy rule selection as a postprocessing procedure in fuzzy data mining with respect to the understandability of extracted knowledge. REFERENCES [1] K. Deb, Multi-Objective Optimization Using Evolutionary Algorithms, John Wiley & Sons, Chichester, 2001. [2] C. A. Coello Coello, D. A. van Veldhuizen, and G. B. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems, Kluwer Academic Publishers, Boston, 2002. [3] C. A. Coello Coello and G. B. Lamont, Applications of MultiObjective Evolutionary Algorithms, World Scientific, Singapore, 2004. [4] M. A. Kupinski and M. A. Anastasio, “Multiobjective genetic optimization of diagnostic classifiers with implications for generating receiver operating characteristic curve,” IEEE Trans. on Medical Imaging, vol. 18, no. 8, pp. 675-685, August 1999. [5] J. Gonzalez, I. Rojas, J. Ortega, H. Pomares, F. J. Fernandez, and A. F. Diaz, “Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation,” IEEE Trans. on Neural Networks, vol. 14, no. 6, pp. 1478-1495, November 2003. [6] X. Llora and D. E. Goldberg, “Bounding the effect of noise in multiobjective learning classifier systems,” Evolutionary Computation, vol. 11, no, 3, pp. 278-297, September 2003. [7] H. A. Abbass, “Speeding up back-propagation using multiobjective evolutionary algorithms,” Neural Computation, vol. 15, no. 11, pp. 2705-2726, November 2003. [8] H. A. Abbass, “Pareto neuro-evolution: Constructing ensemble of neural networks using multi-objective optimization,” Proc. of Congress on Evolutionary Computation, pp. 2074-2080, Canberra, Australia, December 8-12, 2003. [9] A. Chandra and X. Yao, “DIVACE: Diverse and accurate ensemble learning algorithm,” Lecture Notes in Computer Science 3177: Intelligent Data Engineering and Automated Learning - IDEAL 2004, Springer, Berlin, pp. 619-625, August 2004. [10] A. Chandra and X. Yao, “Evolutionary framework for the construction of diverse hybrid ensemble,” Proc. of the 13th European Symposium on Artificial Neural Networks - ESANN 2005, pp. 253-258, Brugge, Belgium, April 27-29, 2005. [11] H. Ishibuchi and T. Yamamoto, “Evolutionary multiobjective optimization for generating an ensemble of fuzzy rule-based classifiers,” Lecture Notes in Computer Science, vol. 2723, Genetic and Evolutionary Computation - GECCO 2003, pp. 1077-1088, Springer, Berlin, July 2003. [12] H. Ishibuchi, T. Murata, and I. B. Turksen, “Single-objective and twoobjective genetic algorithms for selecting linguistic rules for pattern classification problems,” Fuzzy Sets and Systems, vol. 89, no. 2, pp. 135-150, July 1997. [13] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective geneticsbased machine learning for linguistic rule extraction,” Information Sciences, vol. 136, no. 1-4, pp. 109-133, August 2001. [14] O. Cordon, M. J. del Jesus, F. Herrera, L. Magdalena, and P. Villar, “A multiobjective genetic learning process for joint feature selection and granularity and contexts learning in fuzzy rule-based classification systems,” in J. Casillas, O. Cordon, F. Herrera, and L. Magdalena (eds.), Interpretability Issues in Fuzzy Modeling, pp. 79-99, Springer, Berlin, 2003.
[15] F. Jimenez, A. F. Gomez-Skarmeta, G. Sanchez, H. Roubos, and R. Babuska, “Accurate, transparent and compact fuzzy models by multiobjective evolutionary algorithms,” in J. Casillas, O. Cordon, F. Herrera, and L. Magdalena (eds.), Interpretability Issues in Fuzzy Modeling, pp. 431-451, Springer, Berlin, 2003. [16] H. Ishibuchi and T. Yamamoto, “Fuzzy rule selection by multiobjective genetic local search algorithms and rule evaluation measures in data mining,” Fuzzy Sets and Systems, vol. 141, no. 1, pp. 59-88, January 2004. [17] H. Ishibuchi, T. Nakashima, M. Nii, Classification and Modeling with Linguistic Information Granules: Advanced Approaches to Linguistic Data Mining, Springer, Berlin, November 2004. [18] H. Wang, S. Kwong, Y. Jin, W. Wei, and K. F. Man, “Agent-based evolutionary approach for interpretable rule-based knowledge extraction,” IEEE Trans. on Systems, Man, and Cybernetics - Part C: Applications and Reviews, vol. 35, no. 2, pp. 143-155, May 2005. [19] H. Wang, S. Kwong, Y. Jin, W. Wei, and K. F. Man, “Multi-objective hierarchical genetic algorithm for interpretable fuzzy rule-based knowledge extraction,” Fuzzy Sets and Systems, vol. 149, no. 1, pp. 149-186, January 2005. [20] H. Ishibuchi and Y. Nojima, “Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning,” International Journal of Approximate Reasoning (in press). [21] Y. Jin (ed.), Multi-Objective Machine Learning, Springer, Berlin, 2006. [22] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, “Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms,” Fuzzy Sets and Systems, vol. 65, no. 2/3, pp. 237253, August 1994. [23] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, “Selecting fuzzy if-then rules for classification problems using genetic algorithms,” IEEE Trans. on Fuzzy Systems, vol. 3, no. 3, pp. 260-270, August 1995. [24] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. on Evolutionary Computation, vol. 6, no. 2, pp. 182-197, April 2002. [25] O. Cordon, M. J. del Jesus, and F. Herrera, “A proposal on reasoning methods in fuzzy rule-based classification systems,” International Journal of Approximate Reasoning, vol. 20, no. 1, pp. 21-45, January 1999. [26] H. Ishibuchi, T. Nakashima, and T. Morisawa, “Voting in fuzzy rulebased systems for pattern classification problems,” Fuzzy Sets and Systems, vol. 103, no. 2, pp. 223-238, April 1999. [27] H. Ishibuchi, K. Nozaki, and H. Tanaka, “Distributed representation of fuzzy rules and its application to pattern classification,” Fuzzy Sets and Systems, vol. 52, no. 1, pp. 21-32, November 1992. [28] H. Ishibuchi and T. Nakashima, “Effect of rule weights in fuzzy rulebased classification systems,” IEEE Trans. on Fuzzy Systems, vol. 9, no. 4, pp. 506-515, August 2001. [29] H. Ishibuchi and T. Yamamoto, “Rule weight specification in fuzzy rule-based classification systems,” IEEE Trans. on Fuzzy Systems, vol. 13, no. 4, pp. 428-435, August 2005. [30] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in U. M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, pp. 307-328, 1996. [31] F. Coenen, P. Leng, and L. Zhang, “Threshold tuning for improved classification association rule mining,” Lecture Notes in Computer Science 3518: Advances in Knowledge Discovery And Data Mining PAKDD 2005, pp. 216-225, Springer, Berlin, May 2005. [32] A. Gonzalez and R. Perez, “SLAVE: A genetic learning system based on an iterative approach,” IEEE Trans. on Fuzzy Systems, vol. 7, no. 2, pp. 176-191, April 1999. [33] T. Elomaa and J. Rousu, “General and efficient multisplitting of numerical attributes,” Machine Learning, vol. 36, no. 3, pp. 201-244, September 1999.
1640