Local Cascade Generalization Jo~ao Gama
LIACC, FEP - University of Porto Rua Campo Alegre, 823 4150 Porto, Portugal Phone: (+351) 2 6078830 Fax: (+351) 2 6003654 Email:
[email protected] http://www.up.pt/liacc/ML
Abstract In a previous work we have presented Cascade Generalization, a new general method for merging classi ers. The basic idea of Cascade Generalization is to sequentially run the set of classi ers, at each step performing an extension of the original data by the insertion of new attributes. The new attributes are derived from the probability class distribution given by a base classi er. This constructive step extends the representational language for the high level classi ers, relaxing their bias. In this paper we extend this work by applying Cascade locally. At each iteration of a divide and conquer algorithm, a reconstruction of the instance space occurs by the addition of new attributes. Each new attribute represents the probability that an example belongs to a class given by a base classi er. We have implemented three Local Generalization Algorithms. The rst merges a linear discriminant with a decision tree, the second merges a naive Bayes with a decision tree, and the third merges a linear discriminant and a naive Bayes with a decision tree. All the algorithms show an increase of performance, when compared with the corresponding single models. Cascade also outperforms other methods for combining classi ers, like Stacked Generalization and competes well against Boosting, with statistically signi cant con dence levels.
Keywords: Multiple Models, Constructive Induction, Merging Classi ers.
1 Introduction The ability of a chosen algorithm to induce a good generalization depends on how appropriate the class model underlying the algorithm is for the given task. An algorithm class model is the representation language it uses to express a generalization of the examples. The representation language for a standard decision tree is the DNF formalism that splits the instance space by axis-parallel hyper-planes, while the representation language for a linear discriminant function is a set of linear functions that split the instance space by oblique hyper-planes. Since dierent learning algorithms employ dierent knowledge representations and search heuristics, dierent search spaces are explored and diverse results are obtained. The problem of nding the appropriate bias for a given task is an active research area. We can consider two main lines: on one hand methods that try to select the most appropriate algorithm for the given task, for instance Schaer's selection by Cross-Validation, and on the other hand, methods that combine predictions of different algorithms, for instance Stacked Generalization [25]. This work follows the second research line. Instead of looking for methods that t the data using a single representation language, we present a family of algorithms, under the generic name of Cascade Generalization, whose search space contains models that use dierent representation languages. Cascade generalization was rst presented in [14]. It performs an iterative composition of classi ers. At each iteration a classi er is generated. The input space is extended by the addition of new attributes. These are in the form of a probability class distribution which are obtained, for each example, by the generated base classi er. The language of the nal classi er is the language used by the high level generalizer. This language uses terms that are expressions from the language of low
level classi ers. In this sense, Cascade Generalization generates a uni ed theory from the base theories. Here we extend the work presented in [14], by applying Cascade locally. In our implementation, Local Cascade Generalization generates a decision tree. The experimental study shows that this methodology usually improves both accuracy and theory size with statistical signi cance levels. The next section presents the framework of Cascade Generalization. In section 3 we de ne a new family of algorithms that apply Cascade Generalization locally. In section 4 we review previous work in the area of multiple models. In section 5, we perform an empirical study using UCI data sets. The last section presents an analysis of the results and concludes the paper.
2 Cascade Generalization Consider a learning set D = (x~n ; yn ) n = 1; :::; N , where x~n = [x1 ; :::; xm ] is a multidimensional input vector, and yn is the output variable. Since the focus of this paper is on classi cation problems, yn takes values from a set of prede ned values, that is yn 2 fCl1 ; :::; Clcg, where c is the number of classes. A classi er = is a function that is applied to the training set D in order to construct a model =(D). The generated model is a mapping from the input space X to the discrete output variable Y . When used as a predictor, represented by =(~x; D), it assigns a y value to the example ~x. This is the traditional framework for classi cation tasks. Our framework requires that the predictor =(~x; D) outputs a vector representing conditional probability distribution [p1; :::; pc], where pi represents the probability that the example ~x belongs to class i, i.e. P (y = Cli j~x). The class that is assigned to the example ~x, is the one that maximizes this last expression. Most of the commonly used classi ers, such as naive Bayes and Discriminant, classify each example in this way. Other classi ers, for example C4.5, have a dierent strategy for classifying an example, but it requires small changes to obtain a probability class distribution. We de ne a constructive operator (D0 ; =(~x; D)). This operator has two input parameters: a data set D0 and a predictor =(~x; D). The classi er = generates a theory from the training data D. For each example ~x 2 D0 , the generated theory outputs a probability class distribution. For all the examples in D0 the operator concatenates the input vector ~x with the output probability class distribution. The output of (D0 ; =(~x; D)) is a new data set D00 . The cardinal-
ity of D00 is equal to the cardinality of D0 (i.e. they have the same number of examples). Each example in ~x 2 D00 has an equivalent example in D0 , but augmented with c new attributes. The new attributes are the elements of the vector of class probability distribution obtained when applying classi er =(~x; D) to the example ~x. Cascade generalization is a sequential composition of classi ers, that at each generalization level applies the operator. Given a training set L, a test set T, and two classi ers =1 , and =2 , Cascade generalization proceeds as follows: Using classi er =1 , generates the Level1 data:
Level1train = (L; =1(~x; L)) Level1test = (T; =1 (~x; L)) Classi er =2 learns on Level1 training data and classi es the Level1 test data:
=2 (~x; Level1train) for each ~x 2 Level1test Those steps perform the basic sequence of a cascade generalization of classi er =2 after classi er =1 . We represent the basic sequence by the symbol r. The previous composition could be shortly represented by:
=2 r=1 = =2 (~x; Level1train) for each ~x 2 Level1test which is equivalent to:
=2 r=1 = =2 (~x; (L; =1 (x~0 ; L))) for each ~x 2 (T; =1 (x~00 ; L)) This is the simplest formulation of Cascade Generalization. Some possible extensions include the composition of n classi ers, and the parallel composition of classi ers. A composition of n classi ers is represented by:
=n r=n?1 r=n?2 :::r=1 In this case, Cascade Generalization generates n-1 levels of data. The high level theory, is that one given by the =n classi er. A variant of cascade generalization, which includes several algorithms in parallel, could be represented in this formalism by:
=n+1 r[=1 ; :::; =n ] = =n+1 (~x; (L; [=1 (x~0 ; L); :::; =n (x~0 ; L)])) for each ~x 2 (T; [=1 (x~00 ; L); :::; =n (x~00 ; L)])
The algorithms =1 , ..., =n run in parallel. The operator (L; [=1 (x~0 ; L); :::; =n (x~0 ; L)]) returns a new data set L0 which contains the same number of examples as L. Each example in L0 contains n c new attributes, where c is the number of classes. Each algorithm in the set =1 ; :::; =n contributes with c new attributes.
3 Local Cascade Generalization Most of Machine Learning algorithms for supervised learning use a divide and conquer strategy that attacks a complex problem by dividing it into simpler problems and recursively applies the same strategy to the subproblems. Solutions of sub-problems can be combined to yield a solution of the complex problem. This is the basic idea behind well known decision tree based algorithms: ID3 (Quinlan, 1984), ASSISTANT (Kononenko et all, 1987), CART (Breiman et all, 1984), C4.5 (Quinlan, 1993), etc. The power of this approach comes from the ability to split the hyperspace into subspaces and t each subspace with dierent functions. In our previous work [14] we have shown that Cascade signi cantly improves the performance of this type of learning algorithms. In this paper we explore the applicability of Cascade on the problems and subproblems that a divide and conquer algorithm must solve. The intuition behind this hypothesis is the same as behind any divide and conquer strategy. The relations that can not be captured at global level can be discovered on the simpler subproblems. Local cascade generalization, is a composition of algorithms that is performed for each task when building the classi er. At each iteration of a divide and conquer algorithm, local cascade generalization will be performed by applying the operator. The eect is that the input space is reconstructed by the insertion of the new attributes. These new attributes are propagated down to the subtasks that the algorithm might consider. In this paper we restrict the use of local Cascade Generalization to decision tree based algorithms. However, it would be possible to use it with any divide and conquer algorithm. Figure 1 presents the general algorithm of local Cascade Generalization, applied to a decision tree. When growing the tree, at each decision node new attributes are computed by applying the operator. The new attributes that are created there are propagated down the tree. The number of new attributes is equal to the number of classes of the examples that fall at this node. At dierent levels, the algorithm considers data sets with dierent number of attributes and
Input: A data set D, a base classi er = Output: A decision Tree Function CGtree(D, =) IF stop criteria(D) = TRUE return a Leaf with class probability distribution D0 = (D; =(~x; D)) Choose the attribute that maximizes splitting criterion on D0 For each partition of examples based on chosen attribute values Treei = CGtree(Di0 , =) return Tree as a decision node based on chosen attribute, storing =(D) and descendants Treei
End
Figure 1: Local Cascade Algorithm based on a Decision Tree classes. Deeper nodes contain an increasing number of attributes. This could be a disadvantage of the system, but the number of new attributes is not constant. As the tree grows and the classes are discriminated, deeper nodes also contain examples from a decreasing number of classes. This means that as the tree grows the number of new attributes decreases. In order to be applied as a predictor, any CGTree must store, at each node, the model generated by the base classi er using the examples that fall at this node. When classifying a new example, the example traverses the tree in the usual way, but at each decision node it is extended by the insertion of the probability class distribution provided the base classi er predictor at this node. In the framework of local cascade generalization, we have developed a CGLtree, that uses the (D; Discrim(~x; D)) operator in the constructive step. Each internal node of a CGLtree contains a discriminant function. This discriminant function is used to build new attributes. For each example ~x, the value of a new attribute Ai is computed using the probability p(Ci j~x) which is given by the linear discriminant function. At each decision node, the number of new attributes built by CGLtree is always equal to the number of classes taken from the examples that fall at this node. We use the following heuristic: we only consider a classi if the number of examples, at this node, belonging to classi is greater than N times the number of attributes1 . By default N is 3. This implies 1
This heuristic was suggested by Breiman et al.[3]
that at dierent nodes, dierent number of classes will be considered and a dierent number of new attributes is added. In our empirical study we have used two other algorithms that locally apply Cascade Generalization. CGBtree that uses as constructive operator (D; naiveBayes(~x; D)), and CGBLtree that uses as constructive operator: (D; [naiveBayes(~x; D); Discrim(~x; D)]). In all other aspects these algorithms are similar to CGLtree. There is one restriction to the application of the (D0 ; =(x; D)) operator: the = classi er must return a probability class distribution for each x 2 D0 . Any classi er that satis es these requisites could be applied. It is possible to imagine a CGTree, whose internal nodes are trees themselves. For example, small modi cations to C4.52, will allow the construction of a CGTree whose internal nodes are trees generated by C4.5.
of each hypothesis should be weighted by the posterior probability of that hypothesis given the training data. Several variants of the voting method can be found in the machine learning literature. From uniform voting where the opinion of all base classi ers contributes to the nal classi cation with the same strength, to weighted voting, where each base classi er has a weight associated, that could change over the time, and strengthens the classi cation given by the classi er. Ortega [20] presents the \Model Applicability Induction" approach for combining predictions from multiple models. The approach consists of learning for each available model a referee that characterize situations in which each of the models is able to make correct predictions. In future instances these referees are rst consulted to select the most appropriate prediction model and the prediction of the selected model is then returned.
4 Related Work
4.2 Generating dierent models
With respect to the nal model, there are clear similarities between CGLtree and Multivariate trees [5, 15]. Any multivariate tree is topologically equivalent to a three-layer inference network [18]. The constructive ability of our system is similar to the Cascade Correlation Learning architecture [11]. Also the nal model of CGBtree is related with the recursive naive Bayes presented in [17]. In a previous work [13], we have compared system Ltree, similar to CGLtree, with Oc1 [19] and LMDT [5]. The focus of this paper is on methodologies for combining classi ers. As such, we review other methods that generate and combine multiple models.
4.1 Combining Classi cations We can consider two main lines of research. One group includes methods where all base classi ers are consulted in order to classify a query example. The other includes methods that characterize the area of expertise of the base classi ers and for a query point only ask the opinion of the experts. Voting is the most common method used to combine classi ers. As pointed out by Ali and Pazzani [1], this strategy is motivated by the Bayesian learning theory which stipulates that in order to maximize the predictive accuracy, instead of using just a single learning model, one should ideally use all models in the hypothesis space. The vote 2
Two dierent methods are presented in [14, 23].
Several methods for generating multiple models appear in the literature. Breiman [3] proposes bagging, that produces replications of the training set by sampling with replacement. Each replication of the training set has the same size as the original data, but some examples do not appear in it, while others may appear more than once. From each replication of the training set a classi er is generated. All classi ers are used to classify each example in the test set, usually using a uniform vote scheme. The boosting algorithm of Freund and Schapire [12] maintains a weight for each example in the training set that re ects its importance. Adjusting the weights causes the learner to focus on dierent examples leading to dierent classi ers. Boosting is an iterative algorithm. At each iteration the weights are adjusted in order to re ect the performance of the corresponding classi er. The weight of the misclassi ed examples is increased. The nal classi er aggregates the learned classi ers at each iteration by weighted voting. The weight of each classi er is a function of its accuracy. Wolpert [25] proposed Stacked Generalization, a technique that uses learning in two levels. A learning algorithm is used to determine how the outputs of the base classi ers should be combined. The original data set constitutes the level zero data. All the base classi ers run at this level. The level one data are the outputs of the base classi ers. Another learning process occurs using as input the level one data and as output the
nal classi cation. This is a more sophisticated technique of cross validation that could reduce the error due to the bias. Brodley [4] presents MCS, a hybrid algorithm that combines, in a single tree, nodes that are univariate tests, multivariate tests generated by linear machines and instance based learners. At each node MCS uses a set of If-Then rules to perform a hill-climbing search for the best hypothesis space and search bias for the given partition of the dataset. The set of rules incorporates knowledge of experts. MCS uses a dynamic search control strategy to perform an automatic model selection. MCS builds trees, which could apply a different model in dierent regions of the instance space. Chan and Stolfo [7] presents two schemes for classi er combination: arbiter and combiner. Both schemes are based on meta learning, where a meta-classi er is generated from a meta data, built based on the predictions of the base classi ers. An arbiter is also a classi er and is used to arbitrate among predictions generated by dierent base classi ers. The training set for the arbiter is selected from all the available data, using a selection rule. An example of a selection rule is \Select the examples whose classi cation the base classi ers cannot predict consistently". This arbiter, together with an arbitration rule, decides a nal classi cation based on the base predictions. An example of an arbitration rule is \Use the prediction of the arbiter when the base classi ers cannot obtain a majority". Later [8], they have extended this framework using arbiters=combiners in an hierarchical fashion generating arbiter=combiner binary trees.
Wolpert [25] says that successful implementations of Stacked Generalization is a \black art", for classi cation tasks and the conditions under which stacking works are still unknown. Recently, Ting and Witten [23] have shown that successful stacked generalization requires the use of output class distributions rather than class predictions. In their experiments, only the MLR algorithm (a linear discriminant) was suitable for level-1 generalizer. Cascade Generalization belongs to the family of stacking algorithms. In the experiments described in [14] we have used the Bias Variance analysis as a criterion to select algorithms. The experiments suggest that at the top level an algorithm with low bias, like a decision tree, should be used. The main achievement of our proposed method is its ability to merge dierent models. As such, we get a single model whose components are terms of the base model language. The bias restriction imposed by using single model is relaxed. Cascade gives a single and structured model for the data, and this is a strong advantage over the methods that combine classi ers by voting. Another advantage of Cascade Generalization is related to the use of probability class distributions. Usual learning algorithms produced by the Machine Learning community use categories when classifying examples. Combining classi ers by means of categorical classes looses the strength of the classi er in its prediction. The use of probability class distributions allows us to explore that information.
4.3 Discussion
Ali and Pazzani [1] and Tumer and Gosh [24] present empirical and analytical results that show that \the combined error rate depends on the error rate of individual classi ers and the correlation among them". They suggest the use of \radically dierent types of classi ers" to reduce the correlation errors. This was our criterion when selecting the algorithms for the experimental work. We use three classi ers that have dierent behaviors under a bias-variance analysis: a naive Bayes, a Linear Discriminant, and a Decision Tree.
Earlier results of boosting or bagging are quite impressive. Using 10 iterations (i.e. generating 10 classi ers) Quinlan [22] reports reductions of the error rate between 10% and 19%. Quinlan argues that these techniques are mainly applicable for unstable classi ers. Both techniques require that the learning system is not stable, to obtain dierent classi ers when there are small changes in the training set. Under an analysis of bias-variance decomposition of the error [16], the reduction of the error observed with boosting or bagging is mainly due to the reduction in the variance. As mentioned in Ali et al. [1] \the number of training examples needed by Boosting increases as a function of the accuracy of the learned model. Boosting could not be used to learn many models on the modest training set sizes used in this paper.".
5 Empirical Evaluation 5.1 The Algorithms
5.1.1 Naive Bayes Bayes theorem allows to optimally predict the class of an unseen example, given a training set. The chosen class is the one that maximizes: p(Ci jE ) = p(Ci )p(E jCi )=p(E ). If the attributes are indepen-
dent, p(E jCi) can be decomposed into the product p(v1 jCi ) ::: p(vk jCi ). Domingos and Pazzani [9] show that this procedure has a surprisingly good performance in a wide variety of domains, including many where there are clear dependencies between attributes. In our reimplementation of this algorithm, the required probabilities are estimated from the training set. In the case of nominal attributes we use counts. Continuous attributes were discretized. This has been found to produce better results than assuming a Gaussian distribution [10, 9]. The number of bins used is a function of the number of dierent values observed on the training set: k = max(1; 2 log(nr: different values)). This heuristic was used in [10] and elsewhere with good overall results. Missing values were treated as another possible value for the attribute. In order to classify a query point, a naive Bayes uses all of the available attributes. Langley [17] refers that naive Bayes relies on an important assumption that the variability of the dataset can be summarized by a single probabilistic description, and that these are sucient to distinguish between classes. From an analysis of BiasVariance, this implies that naive Bayes uses a reduced set of models to t to the data. The result is low variance, but if the data cannot be adequately represented by the set of models, we obtain large bias.
5.1.2 Linear Discriminant A linear discriminant function is a linear composition of the attributes where the sum of squared dierences between class means is maximal relative to the internal class variance. It is assumed that the attribute vectors for the examples of class Ci are independent and follow a certain probability distribution with probability density function fi . A new point with attribute vector ~x is then assigned to that class for which the probability density function fi (~x) is maximal. This means that the points for each class are distributed in a cluster centered at i . The boundary separating two classes is a hyper-plane and it passes through the midpoint of the two centers. If there are only two classes, a unique hyper-plane is needed to separate the classes. In the general case of q classes, q ? 1 hyper-planes are needed to separate them. By applying the linear discriminant procedure described below, we get qnode ? 1 hyperplanes. The equation of each hyper-plane is given by:
P
Hi = i + j ij xj where i = ? 21 Ti S ?1 i and i = S ?1 i We use a Singular Value Decomposition (SVD) to compute S ?1 . SVD is numerically stable and is a tool for
detecting sources of collinearity. This last aspect is used as a method for reducing the features of each linear combination. A linear discriminant uses all, or almost all, of the available attributes when classifying a query point. Breiman[2] refers that from an analysis of Bias-Variance, Linear Discriminant is a stable classi er although it can t a small number of models. It achieves stability by having a limited set of models to t the data. The result is low variance, but if the data cannot be adequately represented by the set of models, then we obtain large bias.
5.1.3 Decision Tree Dtree is our version of a decision tree. It uses the standard algorithm to build a decision tree. The splitting criterion is the gain ratio. The stopping criterion is similar to C4.5. The pruning mechanism is similar to the pessimistic error of C4.5. Dtree uses a kind of smoothing process that usually improves the performance of tree based classi ers. When classifying a new example, the example traverses the tree from the root to a leaf. In Dtree, the example is classi ed taking into account not only the class distribution at the leaf, but also all class distributions of the nodes in the path. That is, all nodes in the path contribute to the nal classi cation. Instead of computing class distribution for all paths in the tree at classi cation time, as it is done, for instance, in Buntine [6], Dtree computes a class distribution for all nodes when growing the tree. This is done recursively, taking into account class distributions at the current node and at the predecessor of the current node, using the formula:
P (Ci jen ; e) = P (Ci jen ) P P(e(jeeje;C) ) n
i
n
where P (ejen ) is the probability that one example that falls at Noden goes to Noden+1 , and P (ejen; Ci ) is the probability that one example from class Ci goes from Noden to Noden+1 [21]. This recursive formulation, allows Dtree to compute eciently the required class distributions on the y. The smoothed class distributions have in uence on the pruning mechanism and on the treatment of missing values. It is the most relevant dierence from C4.5. A decision tree uses a subset of the available attributes to classify a query point. Kohavi and Wolpert [16], Breiman [2, 3] among other researchers, note that decision trees are unstable classi ers. Small variations on the training set can cause large changes in the resulting predictors. They have high variance but they can t any kind of data: the bias of a decision tree is low.
5.1.4 Local Cascade Generalization Algorithms All the implemented Local Cascade Generalization algorithms are based on Dtree. That is they use exactly the same splitting criteria, stopping criteria, pruning mechanism, etc. Moreover they share many minor heuristics that individually are too small to mention, but collectively can make dierence. At each decision node, CGLtree applies the Linear discriminant describe above, while CGBtree applies the naive Bayes algorithm. CGBLtree applies the Linear discriminant to the ordered attributes and the naive Bayes to the categorical attributes. In order to prevent over tting the construction of new attributes is constrained to a depth of 5. In addition, the level of pruning is greater than the level of pruning in Dtree.
5.2 The Datasets We have chosen 17 data sets from the UCI repository. All of them were previously used in other comparative studies. Evaluation was done using a 10 fold strati ed Cross Validation (CV). Datasets were permuted once before the CV procedure. All algorithms where used with the default settings. At each iteration of CV, all algorithms were trained on the same training partition of the data. Classi ers were also evaluated on the same test partition of the data. Comparisons between algorithms were performed using t-paired tests with signi cance level set at 95%. Table 1 presents the data sets characteristics, the error rate, and standard deviation of each base classi er. Relative to each algorithm, a +(-) sign on the rst column means that the error rate of this algorithm, is signi cantly better (worse) than Dtree. The error rate of C5.0 is presented for reference. These results provide an evidence, once more, that no single algorithm is better overall.
5.3 Local Cascade Generalization Table 2a presents the results of local Cascade Generalization. Each column corresponds to a Cascade Generalization algorithm. Each algorithm is compared against its components using t-paired tests. For example, CGLtree is compared against Dtree and Discrim. A +(-) sign means that the error rate of the composite model is, with statistical signi cance, higher (lower) than the respective component model. The trend on these results shows a clear improvement over the base classi ers. We never observe degradation
on the error rate of a composite model in relation to all the components. In same cases there is a signi cant increase of performance comparing to all the components. For example CGBLtree improves in 2 datasets over the 3 components, and in 5 datasets over 2 components. Table 2b presents the results of C5.0 boosting with the default parameter of 10, that is aggregating over 10 trees, and Stacked Generalization as it is de ned in [23]. That is, the level0 classi ers are C4.5 and Bayes, and the level1 classi er is Discrim. The attributes for the level1 data are the probability class distributions, obtained from the level0 classi ers using a 5 strati ed cross validation. Both Boosting and Stacked are compared against CGBLtree, using t-paired tests with the signi cance level set to 95%. A +(-) sign means that Boosting or Stacked performs signi cantly better (worst) than CGBLtree. In this study, CGBLtree performs signi cantly better than Stacked, in 5 datasets and never performs worse. Comparing with C 5:0Boosting, CGBLtree signi cantly improves in 4 datasets and loses in 3 datasets. The improvement observed with Boosting is mainly due to the reduction of the variance component of the error rate while, in Cascade algorithms, the improvement is mainly due to the reduction on the bias. We intend, in a near future, to boost CGBLtree. Another dimension for comparisons involves measuring the number of leaves. This corresponds to the number of dierent regions into which the instance space is partitioned by the algorithm. In almost all datasets3, any Cascade tree splits the instance space into half of the regions needed by Dtree or C5.0. This is a clear indication that Cascade models capture better the underlying structure of the data.
6 Conclusions This paper presents a new methodology for classi er combination. The basic idea of Cascade Generalization consists of a reformulation of the input space by means of insertion of new attributes. A base classi er computes the new attributes. Each new attribute is the instantiation of P (Ci j~x) given by the predictor function generated by the base classi er on this example. In this sense, the new attributes are terms, or functions, in the representational language of the base classi er. This constructive step acts as a way of extending the description language of the high level Except on Monks-2 dataset, where both Dtree and C5.0 produce a tree with only one leaf. 3
Dataset Class Nr.Ex. Types Dtree C5.0 Bayes Discrim Australian 2 690 8 Ord,6 Cont 14.37 6.18 13.63 4.36 15.07 3.76 14.05 5.23 Balance 3 625 4 Cont 21.91 4.63 21.92 4.93 - 30.08 7.01 + 13.14 2.46 Breast(W) 2 699 9 Ord 5.84 4.64 5.42 4.08 + 2.43 2.52 + 4.27 4.58 Diabetes 2 768 8 Cont 25.14 5.78 23.69 6.48 24.62 4.58 22.92 4.97 German 2 1000 17 Ord,7 Cont 28.70 4.30 29.10 2.81 27.60 5.15 + 23.50 5.54 Glass 6 213 9 Cont 31.85 7.61 32.30 10.19 - 46.35 10.93 - 36.78 8.07 Heart 2 270 6 Ord,7 Cont 25.16 9.84 22.96 8.69 + 15.93 8.56 + 15.93 4.29 Ionosphere 2 351 33 Cont 8.54 5.80 9.66 3.47 11.07 7.76 - 14.26 4.68 Iris 3 150 4 Cont 4.67 5.48 4.67 4.50 4.00 4.66 2.00 3.22 Monks-1 2 432 6 Ord 6.33 7.45 + 0.00 0.00 - 25.07 5.96 - 33.39 10.06 Monks-2 2 432 6 Ord 32.90 0.63 32.86 0.65 - 49.32 8.50 33.32 1.60 Monks-3 2 432 6 Ord 0.00 0.00 0.00 0.00 - 2.79 2.42 - 22.89 8.96 Satimage 6 6435 36 Cont 13.35 1.51 13.53 1.57 - 19.55 1.48 - 15.91 1.49 Segment 7 2310 18 Cont 3.64 1.13 3.38 1.34 - 10.22 0.74 - 8.18 0.83 Vehicle 4 846 18 Cont 28.11 4.87 27.27 5.48 - 37.70 2.18 + 22.34 2.87 Waveform 3 2581 21 Cont 23.38 3.40 - 24.88 2.94 + 18.52 2.24 + 15.15 1.86 Wine 3 178 13 Cont 6.66 6.32 7.19 7.44 2.22 3.88 + 0.56 1.76 Mean of error rate 16.50 16.02 20.15 17.56 Mean nr. Leaves 45.6 51.3
Table 1: Data Characteristics and Results of Base Classi ers Dataset CGLtree CGBtree CGBLtree Australian 14.354 4.77 14.499 3.76 14.058 4.80 Balance + + 7.016 2.68 + + 6.704 3.64 + + + 7.016 2.68 Breast (W) + 3.280 2.59 + 2.712 2.27 3.280 2.68 Diabetes 23.565 3.12 26.693 5.87 23.565 3.12 German 24.700 4.19 27.100 5.48 - 25.300 5.25 Glass 33.866 9.26 + 27.004 7.51 + 33.866 9.26 Heart + 17.037 5.58 + 16.667 6.11 + 17.037 5.58 Ionosphere 11.363 4.32 9.369 5.12 11.363 4.32 Iris 2.667 3.44 4.000 4.66 2.667 3.44 Monks-1 + 2.976 4.43 - + 14.372 8.69 + + 2.565 3.77 Monks-2 33.335 5.81 + + 13.874 6.97 + + + 11.120 5.36 Monks-3 + 0.698 1.12 + 0.465 0.47 + + 0.465 1.47 Satimage + 12.385 1.44 + + 11.673 1.25 + + 12.385 1.44 Segment + 3.853 1.22 + 4.416 1.47 + + 3.853 1.21 Vehicle + 21.025 3.08 + 28.844 3.88 + + 21.025 3.08 Waveform + 16.351 1.68 + 16.004 2.78 + 16.351 1.68 Wine + 0.556 1.76 3.403 3.94 + 0.556 1.76 Mean error rate 13.47 13.40 12.15 Mean nr.leaves 23.9 23.7 22.9
C5Boost Stacked 13.337 3.33 13.766 4.47 - 20.184 4.17 - 12.309 3.63 3.135 3.20 2.427 2.52 24.728 5.46 22.657 5.42 23.200 2.35 24.800 4.24 25.020 10.09 35.753 6.20 19.630 9.25 16.667 8.24 + 5.947 3.06 10.758 7.33 - 5.333 4.22 4.667 3.22 0.000 0.00 0.682 2.16 - 36.353 5.87 - 32.865 0.65 0.000 0.00 - 2.072 2.01 + 9.062 1.07 - 13.303 1.63 + 1.905 1.05 3.420 1.35 - 24.922 3.71 - 27.731 5.06 17.980 1.86 16.429 1.50 2.222 2.87 2.778 3.93 13.70 14.30
Table 2: Results of (a)Local Cascade Generalization (b)Boosting and Stacked classi ers. The number of new attributes is equal to the number of classes, and for each example, they are computed as the conditional probability of the example belonging to classi given by the base classi er. Cascade Generalization can be applied locally by any learning algorithm that uses a divide-conquer strategy. As pointed by several researchers, successful combination of classi ers requires dierent syntactic models. We have chosen, for the implementation of Local Cascade Generalization algorithms, three algorithms that have very dierent behavior from a bias-variance analysis: as high level classi er we use a decision tree and as low level classi er we use a naive Bayes, giving CGBtree and a Linear Discriminant, giving CGLtree. At each decision node a constructive step is performed by applying the base classi er. The new axis incorporates new knowledge provided by the base classi ers. The
bias restriction imposed by using single model classes is relaxed in the directions given by the base classi ers. It is this kind of synergy among classi ers that Cascade explores. There are two main issues that dierentiate Cascade from other previous methods on multiple models. The rst one is related to its ability to be applied locally merging dierent models. We get a single model whose components are terms of the base model language, extending the high level model language. Cascade gives a single structured model for the data, and in this way is more adapted to capture insights about problem structure. The second point is related to the use of probability class distributions. Using these probabilities allows the system to use information about the strength of the classi er. This is very useful information, particularly when combining predictions of
classi ers. We have shown that this methodology can improve the accuracy of the base classi ers, competing well with other methods for combining classi ers, preserving the ability to provide a single albeit structured model for the data.
Acknowledgements
Gratitude is expressed to the support given by the FEDER and PRAXIS XXI projects and the Plurianual support attributed to LIACC. Thanks to P.Brazdil, colleagues from LIACC, and the anonymous reviewers for the valuable comments.
References [1] K. Ali and M. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, Vol. 24, No. 1, 1996. [2] L. Breiman. Bias, variance, and arcing classi ers. Technical report 460, Statistics Department, University of California, 1996. [3] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth International Group., 1984. [4] C. Brodley. Recursive automatic bias selection for classi er construction. Machine Learning, 20, 1995. [5] C. Brodley and P. Utgo. Multivariate trees. Machine Learning, 19, 1995. [6] Wray Buntine. A theory of Learning Classi cation Rules. PhD thesis, University of Sydney, 1990. [7] P. Chan and S. Stolfo. A comparative evaluation of voting and meta-learning on partitioned data. In A. Prieditis and S. Russel, editors, Machine Learning Proc of 12th International Conference. Morgan Kaufmann, 1995. [8] P. Chan and S. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In KDD 95, 1995. [9] P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classi er. In L. Saitta, editor, Machine Learning Proc. of 13th International Conference. Morgan Kaufmann, 1996. [10] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In A. Prieditis and S. Russel, editors, Machine Learning Proc. of 12th International Conference. Morgan Kaufmann, 1995. [11] Scott E. Fahlman and Christian Lebiere. The Cascade-Correlation learning architecture. Technical Report CMU-CS-90-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, February 1990.
[12] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In L. Saitta, editor, Machine Learning Proc of 13th International Conference. Morgan Kaufmann, 1996. [13] J. Gama. Probabilistic linear tree. In D. Fisher, editor, Machine Learning Proc. of the 14th International Conference. Morgan Kaufmann, 1997. [14] J. Gama. Combining classi ers by constructive induction. In C. Nedellec and C. Rouveirol, editors, Machine Learning ECML-98. Springer Verlag, 1998. [15] G. John. Robust linear discriminant trees. In D. Fisher, editor, Learning from Data: Arti cial Intelligence and Statistics V. Springer Verlag, 1996. [16] R Kohavi and D. Wolpert. Bias plus variance decomposition for zero-one loss function. In L. Saitta, editor, Machine Learning Proc. of 13th International Conference. Morgan Kaufmann, 1996. [17] P. Langley. Induction of recursive bayesian classi ers. In P.Brazdil, editor, Machine Learning: ECML-93. LNAI 667, Springer Verlag, 1993. [18] Pat Langley. Elements of Machine Learning. Morgan Kaufmann, 1996. [19] S. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Arti cial Intelligence Research, 1994. [20] J. Ortega. Exploiting multiple existing models and learning algorithms. In AAAI 96 - Workshop in Induction of Multiple Learning Models, 1995. [21] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., 1988. [22] R. Quinlan. Bagging, boosting and c4.5. In Procs. 13th American Association for Arti cial Intelligence. AAAI Press, 1996. [23] K.M. Ting and I.H. Witten. Stacked generalization: when does it work ? In Procs. International Joint Conference on Arti cial Intelligence. Morgan Kaufmann, 1997. [24] K. Tumer and J. Ghosh. Classi er combining: analytical results and implications. In AAAI 96 - Workshop in Induction of Multiple Learning Models, 1995. [25] D. Wolpert. Stacked generalization. In Pergamon Press, editor, Neural Networks Vol.5, 1992.