A Comparison of Induction Algorithms for Selective and non-Selective ...

Report 0 Downloads 54 Views
Proceedings of the 12th International Conference on Machine Learning

, 497{505, 1995: Morgan Kaufmann.

A Comparison of Induction Algorithms for Selective and non-Selective Bayesian Classi ers Moninder Singh

Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104-6389 [email protected]

Abstract In this paper we present a novel induction algorithm for Bayesian networks. This selective Bayesian network classi er selects a subset of attributes that maximizes predictive accuracy prior to the network learning phase, thereby learning Bayesian networks with a bias for small, high-predictive-accuracy networks. We compare the performance of this classi er with selective and non-selective naive Bayesian classi ers. We show that the selective Bayesian network classi er performs signi cantly better than both versions of the naive Bayesian classi er on almost all databases analyzed, and hence is an enhancement of the naive Bayesian classi er. Relative to the non-selective Bayesian network classi er, our selective Bayesian network classi er generates networks that are computationally simpler to evaluate and that display predictive accuracy comparable to that of Bayesian networks which model all features.

1 INTRODUCTION Bayesian induction methods have proven to be an important class of algorithms that perform competitively with other well-known induction techniques such as decision trees and neural networks. Within the machine learning community, the most widely-studied Bayesian induction method is the naive Bayesian classi er (Kononenko, 1990; Langley, 1993). Despite its simplicity and the strong conditional independence assumptions it makes, the naive Bayesian classi er performs remarkably well. Within the Bayesian Arti cial Intelligence community, the best-known Bayesian representation is the Bayesian network (Pearl, 1988). Induction algorithms for Bayesian networks (e.g. K2 (Cooper and Herskovits, 1992)) have been applied to many domains; however, no one has yet compared the performance

Gregory M. Provan

Institute for the Study of Learning and Expertise 2164 Staunton Court Palo Alto, CA, 94306 [email protected] of these Bayesian network induction methods to other induction methods. In this paper we present a novel selective Bayesian network induction method, K2-AS, and compare its classi cation performance with selective and non-selective naive Bayesian classi ers. The selective Bayesian network classi er uses a subset of features that maximizes the predictive accuracy of the network: our goal is to construct networks that are simpler to evaluate, but which still have high predictive accuracy relative to networks induced with all features. Since a Bayesian network classi er is a natural extension of the naive Bayesian classi er in that it does not make restrictive conditional independence assumptions, it should perform better than the naive approach. Our experimental results comparing selective Bayesian network and naive Bayesian methods show that the selective Bayesian network classi er performs signi cantly better than both versions of the naive Bayesian classi er on almost all databases analyzed, and hence is an enhancement of the naive Bayesian classi er. Moreover, we have experimentally compared selective and non-selective Bayesian network classi ers, showing that the selective approach generates networks that are computationally simpler to evaluate and that display predictive accuracy comparable to that of Bayesian networks which model all features (Provan and Singh, 1995). The selective Bayesian network thus seems to be a good compromise between naive Bayesian classi ers and non-selective Bayesian networks. We organize the remainder of the paper as follows. Section 2 summarizes the naive Bayesian classi er to which we compare our new learning approach. Section 3 introduces the Bayesian network learning algorithm (K2) that we modify, and describes our selective Bayesian network classi er (K2-AS). Section 4 summarizes the experimental results of our comparison of selective Bayesian and naive Bayesian classi ers, and Section 5 discusses some implications of our results. Section 6 compares and contrasts our approach with other related work. Finally, we summarize our contri-

butions in Section 7.

2 NAIVE BAYESIAN CLASSIFIERS The naive Bayesian classi er assumes that the attributes are conditionally independent given the class variable. The classi cation process involves a class variable C that can take on values c1; c2; :::; cm, and a feature vector Z of q features that can take on a tuple of values denoted by fz1; z2 ; :::; zqg. Given a case Z represented by an instantiation fz1 ; z2; :::; zq g of feature values, our classi cation task is to de ne the class variable member ci that Z falls into. To perform this task, we assume that we have the prior probabilities for each value ci of the class variable, P (ci). Further, we assume that we have the conditional probability distributions for each feature value zj given class value ci, P (zj jci).1 V Given this data, we can classify a new case Z = j zj using Bayes' rule: V P (ci)P ( j zj jci) P (ci )P (Z jci) P (cijZ ) = = P P (V z jc )P (c ) : P (Z ) k k j j k (1) We can use the assumption of the independence of features within each class to rewrite equation 1 to give Q ^ P (ci) j P (zj jci) P Q : (2) zj ) = P (ci j k j P (zj jck )P (ck ) j Induction consists simply in counting the class value and the class value/feature value pair. This allows us to estimate P (ck ) and P (zj jck ). We refer to this naive Bayesian classi er (that models all attributes) as naive-ALL. Although its performance is remarkably good given its simplicity, this classi er is typically limited to learning classes that can be separated by a single decision boundary (Langley, 1993), and in domains in which the attributes are correlated given the class variable, its performance can be worse than other approaches which can account for such correlations. Bayesian networks can account for correlations among attributes, so they are a natural extension of the naive approach. The selective naive Bayesian classi er (Langley and Sage, 1994) (referred to as naive-AS) is an extension to the naive Bayesian classi er, and is designed to perform better in domains with redundant attributes. The intuition is that if highly correlated attributes are not selected, the classi er should perform better given its attribute independence assumptions. Using forward selection of attributes, this approach uses a greedy search, at each point in the search, to select For simplicity of exposition, we will restrict our discussion to nominal domains. 1

from the space of all attribute subsets the attribute which most improves accuracy on the entire training set. Attributes are added until the addition of any other attribute results in reduced accuracy.

3 BAYESIAN NETWORK CLASSIFIERS This section describes the Bayesian network classi er which uses all features, and our selective extension to this classi er.

3.1 INDUCTION OF BAYESIAN NETWORKS We rst describe the2 basic algorithm on which our research is based, K2. Assume that we have a database D of cases, where each case contains an instantiation for each of a set Z of q discrete features. BS denotes a belief network structure representing the features in Z . In a belief network, a node represents a feature, and the absence of an arc between two nodes denotes the independence of the two nodes given the remaining network structure. The posterior probability of a network given the data, P (BS jD), is proportional to the joint probability, so networks can be ranked according to their joint probabilities. The single most likely network is given by BSmax = argmaxBS [P (BS jD)]: K2 constructs a Bayesian network from a set of features as follows. K2 selects the network (out of a set of possible networks exponential in the number of network nodes) which maximizes the network's posterior probability, P (BS ; D). K2 requires an ordering on the features from which the network will be constructed. Given an ordering of the q features, K2 takes each successive feature in the ordering, adds it as a node ni in the network, and creates parents for ni in a greedy fashion: rather than evaluate all subsets of network nodes n1; n2; :::; ni?1 as parent nodes, K2 selects as a parent node the single node in fn1; n2; :::; ni?1g which most increases the posterior probability of the network structure. New parent nodes are added incrementally to ni as long as doing so increases the posterior probability of the network given the data. Our new approach proceeds in two phases. The rst phase computes a subset   Z of features that generates the network B with highest predictive accuracy, where B denotes the network formed from the subset   Z of features. The second phase computes the network (from the set of features ) which maximizes the predictive accuracy over the test data. 2 We use the nomenclature used in the papers on K2 by Herskovits and Cooper (Herskovits and Cooper, 1990; Cooper and Herskovits, 1992).

The learning algorithm that we use, called CB (Singh and Valtorta, 1995), is a modi ed version of K2. Whereas K2 assumes a node ordering, CB uses conditional independence (CI) tests to generate a \good" node ordering, and then uses the K2 algorithm to generate the Bayesian network from the database D using this node ordering. The CB algorithm starts by using CI tests of order 0 and keeps constructing networks for increasing orders of CI tests as long as the predictive accuracy of the generated network keeps increasing. Since CB uses the K2 algorithm to generate the Bayesian network from a particular ordering, CB is correct in the same sense that K2 is (Singh and Valtorta, 1995). Singh and Valtorta show the importance of deriving a good node ordering (Singh and Valtorta, 1995), given the n! possible node orderings on n features.

3.2 ATTRIBUTE SELECTION USING BAYESIAN NETWORKS We implemented the Attribute Selection Algorithm using the CB algorithm in both the attribute selection as well as the network construction phase. We call this approach K2-AS, since it uses the basic K2 algorithm allied with an Attribute Selection phase. The algorithm we use is what has been described as a wrapper model (John et. al., 1994), in that \the feature subset selection algorithms conducts a search for a good subset using the induction algorithm itself as part of the evaluation function" (John et. al., 1994, page 124). Our learning approach consists of two main steps, attribute selection and network construction. In the attribute selection phase, we choose the set of attributes from which the nal network is constructed. In the network construction phase, we construct the network from the subset of attributes selected in the previous phase. Finally, we test the predictive accuracy of the network. The algorithm used for the attribute selection phase is a forward selection algorithm, in that it starts with an empty set of features and adds features using a greedy search. This forward selection is just like K2. We now describe the di erent phases of the algorithm:  attribute selection phase: In this phase, K2AS chooses the set of attributes ,   Z , from which the nal network is constructed. The algorithm starts with the initial assumption that  consists of only the class variable classvar . It then adds incrementally that attribute (from Z ? ) whose addition results in the maximum increase in the predictive accuracy of the network constructed from the resulting set of nodes. When there is no single attribute whose addition increases predictive accuracy, the algorithm stops adding attributes. We de ne () to be the predictive accuracy of the set  of attributes, and

G () to be the network constructed from the set  of attributes. This phase can be described as follows:  fclassvar g old (fclassvar g) NotDone True while NotDone do 8 x 2 Z ? , let BSx G ( [ fxg) new maxx (BSx )  = arg maxxfx g if new > old then old new   [ f g else NotDone false end fwhileg;  network construction phase: K2-AS uses the

nal set of attributes  selected in the attribute selection phase to construct a network using training data. For this purpose, the CB algorithm (Section 3.1) is used.  network evaluation phase: In order to test the quality of the network, we test the network for its predictive accuracy on the test data.

4 EXPERIMENTAL COMPARISON OF SELECTIVE BAYESIAN AND NAIVE BAYESIAN CLASSIFIERS In our experiments we used a variety of databases acquired from the University of California, Irvine repository of Machine Learning databases (Murphy and Aha, 1992). The databases we used were Michalski's Soybean database, Schlimmer's Mushroom and Voting databases, the Gene-Splicing database due to Towell, Noordewier, and Shavlik,3 and Shapiro's Chess Endgame database. Table 1 summarizes the databases used in terms of number of cases, classes, attributes (excluding the class variable) as well as the (average) number of attributes selected by K2-AS. We examined two variants of the attribute selection criterion, as incorporated in algorithms K2-AS and K2-AS