1
Geometric Decision Tree
arXiv:1009.3604v2 [cs.LG] 4 Nov 2010
Naresh Manwani, P. S. Sastry, Senior Member, IEEE,
Abstract In this paper we present a new algorithm for learning oblique decision trees. Most of the current decision tree algorithms rely on impurity measures to assess the goodness of hyperplanes at each node while learning a decision tree in a top-down fashion. These impurity measures do not properly capture the geometric structures in the data. Motivated by this, our algorithm uses a strategy to assess the hyperplanes in such a way that the geometric structure in the data is taken into account. At each node of the decision tree, we find the clustering hyperplanes for both the classes and use their angle bisectors as the split rule at that node. We show through empirical studies that this idea leads to small decision trees and better performance. We also present some analysis to show that the angle bisectors of clustering hyperplanes that we use as the split rules at each node, are solutions of an interesting optimization problem and hence argue that this is a principled method of learning a decision tree.
Index Terms Decision trees, oblique decision tree, multiclass classification, generalized eigenvalue problem.
I. I NTRODUCTION Decision tree is a well known and widely used method for classification. The popularity of decision tree is because of its simplicity and easy interpretability as a classification rule. In a decision tree classifier, each non-leaf node is associated with a so called split rule or a decision function which is a function of the feature vector and is often binary-valued. Each leaf node in the tree is associated with a class label. To classify a feature vector using a decision tree, at every non-leaf node that we encounter (starting with the root node) we branch to one of the children of that node based on the value assumed by the split rule of that node on the given Naresh Manwani and P. S. Sastry are with the Department of Electrical Engineering, Indian Institute of Science, Bangalore, 560012 India e-mail: (naresh,
[email protected]).
November 5, 2010
DRAFT
2
feature vector. This process follows a path in the tree and when we reach a leaf the class label of the leaf is what is assigned to that feature vector. In this paper we address the problem of learning an oblique decision tree given a set of labeled training samples. We present a novel algorithm that attempts to build the tree by capturing the geometric structure of class regions in the feature space. Decision trees can be broadly classified into two types: axis-parallel and oblique [6]. In an axis parallel decision tree, the split rule at each node is a function of only one of the components of the feature vector. For example, at each node we may test whether a chosen component of the feature vector belongs to some interval. Axis-parallel decision trees are particularly attractive when all features are nominal (i.e., take only finitely many values); in such cases we can have a non-binary tree where at each node we test one feature value and the node can have as many children as the values assumed by that feature [13]. In general, axis-parallel decision trees represent class boundaries (in the feature space) by piecewise linear segments where each segment will be a line parallel to one of the feature axes. Axis-parallel decision trees are good when the decision boundaries are actually axis parallel. However, in more general situations, we have to approximate even arbitrary linear segments in the class boundary with many axisparallel pieces and hence the size of the resulting tree becomes large. The oblique decision trees, on the other hand, use a decision function that depends on a linear combination of all feature components. Thus an oblique decision tree is a binary tree where we associate a hyperplane (in the feature space) with each node. To classify a pattern, we follow a path in the tree by taking the left or right child at each node based on which side of the hyperplane (of that node) that the feature vector falls in. Oblique decision trees represent the class boundary as a general piecewise linear surface. Oblique decision trees are more versatile (and hence are more popular) when features are real-valued. Omnivariate decision trees [15] combine both the axis-parallel splits and multivariate linear splits with multivariate non-linear splits and a classifiability based measure decides which type of split is to be chosen at a given node. But here we are interested only in oblique decision trees, so we do not discuss omnivariate decision trees in more detail. The approaches for learning Oblique decision trees can be classified into two broad categories. In one set of approaches the structure of the tree is fixed beforehand and we try to learn the optimal tree with this fixed structure. This methodology has been adopted by several researchers and different optimization algorithms have been proposed [9], [12], [4], [5], [3], [25]. The November 5, 2010
DRAFT
3
problem with these approaches is that they are applicable only in the situations where we know the structure of the tree a priori, which is often not the case. The other class of approaches learn the tree in a top down manner. Top down approaches have been more popular in applications because of their versatility. Here we discuss only the top down algorithms because that is the approach we take. Top Down approaches for learning decision trees are recursive algorithms for building the tree in a top down fashion. We start with the given training data and decide on the ‘best’ hyperplane which is assigned to the root of the tree. Then, we partition the training examples into two sets that go to the left and right child of the root node using this hyperplane. Then at each of the two child nodes we repeat the same procedure (using the appropriate subset of the training data). Thus, at each stage of the recursive procedure, we find a hyperplane to associate with the current node based on the current subset of training examples, split this subset of the training examples into two sets and then learn the left and right subtrees. The recursion stops when the set of training examples that come to a node is pure, that is, all these training patterns are of the same class. Then we make it a leaf node and assign that class to the leaf node. (We can also have other stopping criteria such as: we make a node a leaf node if, say, 95% of the training examples reaching that node belong to one class). A detailed survey of top down decision tree algorithms is available in [22]. There are two main issues in the top down decision tree learning algorithms: given the training examples at a node, how to rate different hyperplanes that can be associated with this node? and, given a rating function how to find the optimal hyperplane at each node? One way of rating hyperplanes is to look for hyperplanes that are, in some sense, reasonably good classifiers for the training data at that node. In [17] two parallel hyperplanes are learned at each node such that one side of each hyperplane contains points of only one class and the space between these two hyperplanes contains the points which are not separable. Now the two sets of points which are well separated are removed from the data and the same process is repeated on the remaining data. A slight variant of the above algorithm is proposed in [14] where only one hyperplane is learned at each decision node in such a way that one side of the hyperplane contains points of only one class. However, in many cases such approaches produce very large trees that have poor generalization performance. Another approach that is proposed is to learn a linear classifier with least error (e.g., least mean squared error classifier) at each node (see. e.g., November 5, 2010
DRAFT
4
[21]). Decision tree learning for multiclass classification problem using linear classifier based approaches is discussed in [16], [26]. At every node all the classes are divided in two sets in such a way that a linear classifier separating these two sets should give maximum margin. To find which partition of classes gives maximum margin one needs to explore all possibilities at each node, which makes the approach computationally cumbersome. In [26] an efficient method is proposed to do that. In general, learning a ‘best’ linear classifier at each node may not be a good strategy because such a hyperplane may not separate the data into two sets which are now more easily classified. Instead of finding a linear classifier at each node, Cline [2], which is a family of decision tree algorithms uses various heuristics to determine hyperplanes at each node of the tree. However, they do not provide any results to show why these heuristics help or how one chooses a method. In a decision tree, each hyperplane at a non-leaf node should split the data in such a way that it aids further classification; the hyperplane itself need not be a good classifier at that stage. In view of this, many top down decision tree learning algorithms are based on rating hyperplanes using the so called impurity measures. The main idea is as follows. Given the set of training patterns at a node and a hyperplane, we know the set of patterns that go into the left and right child of this node. If each of these two sets of patterns have predominance of one class over others, then, presumably, the hyperplane can be considered to have contributed positively to further classification. At any stage in the learning process, the level of purity of a node is some measure of how skewed is the distribution of different classes in the set of patterns landing at that node. If the class distribution is nearly uniform then the node is highly impure; if number of patterns of one class is much larger than that of all others then the purity of the node is high. The impurity measures used in the algorithms give higher rating to a hyperplane which results in higher purity of child nodes. Gini index, entropy and twoing rule are some of the frequently used impurity measures [22]. A slightly different measure of goodness is area under the ROC curve [10], also called AUCsplit. This measure is related inversely to the Gini index of the parent node. Some of the popular algorithms which learn oblique decision trees by optimizing some impurity measures are discussed in [22]. Many of the impurity measures are not differentiable with respect to the hyperplane parameters. So the algorithms for decision tree learning using impurity measures need to use some search techniques (rather than using, e.g., gradient descent) for finding the best hyperplane at each node. November 5, 2010
DRAFT
5
All top down decision tree algorithms employ greedy search and a hyperplane learned at a node once, is never changed. If a bad hyperplane is learned at some node, it can not be rectified later on and the effect of it may be a large subtree at that node which, in turn, may lead to poor generalization. To overcome such over-fitting, one prunes the learnt decision tree. Pruning methods are discussed in detail in [22]. A problem with all impurity measures is that they depend only on the number of (training) patterns of different classes on either side of the hyperplane. Thus, if we change the class regions without changing the effective areas of class regions on either side of a hyperplane, the impurity measure of the hyperplane will not change. Thus the impurity measures do not really capture the geometric structure of class distributions (or the class regions in the feature space). Also, all the algorithms need to optimize on some average of impurity of the child nodes and often it is not clear what kind of average is proper. In [24], a different approach is suggested to overcome the problem with impurity measures. Their cost function (function for rating hyperplanes) gives low cost to hyperplanes which promote ‘degree of linear separability’ of the set of patterns landing at the child nodes. It has been found experimentally that the decision trees learned using this criteria are more compact than the one using impurity measures. In [24] a simple heuristic is used to define what is meant by ‘degree of linear separability’. Again, the problem with the proposed cost function is that it is not differentiable with respect to the parameters of the hyperplanes and the method uses a stochastic search technique called Alopex [27], [23] to find the optimal hyperplane at each node. This cost function also does not try to capture the geometry of pattern classes. In this paper we present a new decision tree learning algorithm which is based on the idea of capturing, to some extent, the geometric structure of the underlying class regions. For this we borrow ideas from some recent variants of the SVM method which are quite good at capturing (linear) geometric structure of the data. For a two class classification problem, the Multisurface Proximal SVM (MPSVM) algorithm [18] finds two clustering hyperplanes, one for each class. Each hyperplane is close to patterns of one class while being far from patterns of the other class. Then new patterns are classified based on the nearness to the hyperplanes. In problems where one pair of hyperplanes like this does not give sufficient accuracy, [18] suggests the idea of using the kernel trick of (effectively) learning the pair of hyperplanes in a high dimensional space to which the patterns are transformed. November 5, 2010
DRAFT
6
Motivated by Proximal SVM, we derive our decision tree approach as follows. At each node of the tree, we find the clustering hyperplanes as in Multisurface Proximal SVM. After finding these hyperplanes, we choose the split rule at this node as the angle bisectors of the two hyperplanes. Then we split the data based on angle bisector and recursively learn the left and right subtrees of this node. Since, in general, there will be two angle bisectors, we select the one which is better based on an impurity measure. Thus the algorithm (which is explained in the next section) combines the ideas of linear tendencies in data and purity of nodes to find better decision trees. We also present some analysis to bring out some interesting properties of our angle-bisectors that can explain why this may be a good technique to learn decision trees. We also show that this algorithm performs well by comparing it with other decision tree methods on a number of benchmark data sets. The rest of the paper is organized as follows1. We describe our algorithm in section II. In section III we present some analysis which brings out some properties of the angle bisectors of the clustering hyperplanes which are the hyperplanes used in our method. Based on these results, we argue that our angle bisectors are good choice for split rule at a node while learning the decision tree. Experimental results are given in section IV where we compare our algorithm with many standard decision tree algorithms on various synthetic and real world datasets. We also compare performance of our decision tree approach with support vector machine (SVM) classifier. Finally we conclude the paper with some discussions in section V. II. G EOMETRIC D ECISION T REE The performance of any top down decision tree algorithm depends on the measure used to rate different hyperplanes at each node. The accuracy of the learned tree depends on the size of the tree (depth and the number of leaf nodes) which is also a function of how we split the example set at every non-leaf node of the tree. If the split rule is such that it misses the right geometry of the classification boundary then the resulting decision tree is likely to be of large size. The issue of having a suitable algorithm to find the hyperplane that optimizes the chosen measure or rating function at each node, is also important. For example, for all impurity measures, the 1
A preliminary version of this work has been presented in [19], where some experimental results for only 2-class case are
presented without any analysis
November 5, 2010
DRAFT
7
optimization is difficult because finding the gradient of the impurity function with respect to the parameters of the hyperplane is not possible. Motivated by these considerations, here we propose a new criterion function to assess suitability of a hyperplane at a node that can capture the geometric structure of the class regions. For our criterion function, the optimization problem can also be solved more easily. We first explain our method of finding the best hyperplane at each node, by considering a 2-class problem. Given the set of training patterns at a node, we first find two hyperplanes, one for each class. Each hyperplane is such that it is closest to all patterns of one class and is farthest from all patterns of the other class. (The precise formulation of this criterion is given in the next subsection). We call these hyperplanes as the clustering hyperplanes (for the two classes). Because of the way they are defined, these clustering hyperplanes capture the dominant linear tendencies in the examples of each class that are useful for discriminating between the classes. Hence a hyperplane that passes in between them could be good for splitting the feature space. So, we take the hyperplane that bisects the angle between the clustering hyperplanes as the split rule at this node. Since, in general, there would be two angle bisectors, we choose that bisector which is better, based on an impurity measure, namely, the gini index. If the two clustering hyperplanes happen to be parallel to each other, then we take a hyperplane midway between the two clustering hyperplanes as the split rule. Before presenting the full algorithm, we illustrate it through an example. Consider the two dimensional classification problem shown in Figure. 1 where the two classes are not linearly separable. The hyperplane learnt at the root node using OC1, which is an oblique decision tree algorithm that uses the impurity measure of gini index, is shown in Figure. 1(a). As can be seen, though this hyperplane promotes (average) purity of child nodes, it does not really simplify the classification problem; it does not capture the symmetric distribution of class regions that is present in this problem. Fig. 1(b) shows the two clustering hyperplanes for the two classes and the two angle bisectors, obtained through our algorithm, at the root node on this problem. As can be seen, choosing any of the angle bisectors as the hyperplane at the root node to split the data results in linearly separable classification problems at both the child nodes. Thus we see here that our idea of using angle bisectors of two clustering hyperplanes actually captures the right geometry of the classification problem. This is the reason we call our approach as “geometric decision tree”. We also note here that neither of our angle bisectors scores high November 5, 2010
DRAFT
8
(a)
(b)
Fig. 1. An example to illustrate the proposed decision tree algorithm. (a) the hyperplane learnt at root node using an algorithm (OC1) that relies on the impurity measure of gini index, (b) the angle bisectors (solid line) of the clustering hyperplanes (dashed lines) at the root node on this problem, obtained using our method.
on any impurity based measure; if we use either of these hyperplane as the split rule at the root, the two child nodes would be highly impure as both the child nodes contain roughly equal number of patterns of each class. This example is only for explaining the motivation behind our approach. Not all classification problems have such a nice symmetric structure in class regions. But in most problems, our approach seems to be able to capture the geometric structure well, as seen from the results we present in Section IV. A. The Geometric Decision Tree Algorithm for the 2-Class Case Let S = {(xi , yi ) : xi ∈ ℜd ; yi ∈ {−1, 1} i = 1 . . . n} be the training dataset. Let C+ be the set of points for which yi = 1. Also let C− be the set of points for which yi = −1. For an oblique decision tree learning algorithm, the main computational task is: at a given node, given a set of data points at that node, find the best hyperplane to split the data. Let S t be the set of points at node t. Let nt+ and nt− denote the number of patterns of the two classes at that node t
with S+t and S−t being the sets of points of the classes C+ and C− respectively2. Let A ∈ ℜn+ ×d t
be the matrix containing points of class C+ at node t as rows. Similarly let B ∈ ℜn− ×d be the 2
We use C+ and C− both to denote the sets of examples of the two classes as well as the label of the two classes and the
meaning would be clear from the context November 5, 2010
DRAFT
9
matrix whose rows contain points of class C− at node t. Let h1 (w1 , b1 ) : w1T x + b1 = 0 and h2 (w2 , b2 ) : w2T x + b2 = 0 be the two clustering hyperplanes, one for each class. The hyperplane h1 is to be closest to all points of class C+ and farthest from points of class C− . Similarly the hyperplane h2 is to be closest to all points of class C− and farthest from points of class C+ . To find the clustering hyperplanes we use the idea as in Proximal SVM (also called GEPSVM) [18]. Nearness of a set of points to a hyperplane is represented by the average of squared distances of all the points of that set from the hyperplane. The average of squared distances of points of class P 1 T 2 C+ from a hyperplane h(w, b) : wT x + b = 0 is D+ (w, b) = nt ||w|| 2 xi ∈C+ |w xi + b| , where +
˜ = [x 1]T ∈ ℜd+1 . b] ∈ ℜd+1 and x P ˜Tx ˜ i . Note that by the definition of matrix A, we have xi ∈C+ |wT xi + b|2 = Then wT xi + b = w ˜ = [w ||.|| denotes the standard Euclidean norm. Let w
T
||Aw + bent+ ||2 , where ent+ is nt+ -dimensional column vector
3
of ones. Now, D+ (w, b) can be
further simplified as D+ (w, b) = =
1
X
nt+ ||w||2 x ∈C + i
where G =
P
xi ∈C+
1
X
nt+ ||w||2 x ∈C + i
˜ Tx ˜ i |2 |w
X 1 1 T ˜ ˜ T Gw ˜ ˜ix ˜ Ti w ˜ = w w x t 2 2 n+ ||w|| ||w|| x ∈C i
1 nt+
|wT xi + b|2 =
˜ ix ˜ Ti = x
1 [A nt+
+
ent+ ]T [A
ent+ ].
Similarly average of squared distances of points of class C− from h will be D− (w, b) = P 1 ˜ T H w, ˜ where H = n1t ˜ix ˜ Ti = n1t [B ent− ]T [B ent− ] and ent− is nt− w xi ∈C− x nt ||w||2 −
−
−
dimensional vector of ones. To find each clustering hyperplane we need to find h such that one of D+ or D− is maximized while minimizing the other. Hence the two clustering hyperplanes, ˜ 1 = [w1 specified by w
˜ 2 = [w2 b1 ]T and w
b2 ]T , can be formalized as the solution of
optimization problems as given below. ˜ 1 = argminw6˜ =0 w
Since G =
1 nt+
P
xi ∈C+
˜ T Hw ˜ ˜ T Gw ˜ w D+ (w, b) w = argmaxw6˜ =0 T = argminw6˜ =0 T ˜ Hw ˜ ˜ Gw ˜ w w D− (w, b) ˜ T Hw ˜ D− (w, b) w ˜ 2 = argminw6˜ =0 w = argminw6˜ =0 T ˜ Gw ˜ w D+ (w, b)
(1) (2)
˜ix ˜ Ti , it can easily be verified that G is a (d + 1) × (d + 1) symmetric x
positive semidefinite matrix. G is strictly positive definite when matrix A has full column rank. Similarly, matrix H is also a positive semidefinite matrix and it is strictly positive definite when 3
Unless stated otherwise all vectors are assumed to be column vectors
November 5, 2010
DRAFT
10
matrix B has full column rank. The problems given by (1) and (2) are standard optimization problems and their solutions essentially involve solving the following generalized eigenvalue problem [11]. ˜ = λGw ˜ Hw
(3)
˜ that is local solution of the optimization problems given by (1) It can be shown [11] that any w and (2) will satisfy Eq.(3) and the value of the corresponding objective functions is given by the eigenvalue λ. Thus the problem of finding the parameters (w1 , b1 ) and (w2 , b2 ) of two clustering hyperplanes gets reduced to finding eigen vectors corresponding to the maximum and minimum eigenvalues of the generalized eigenvalue problem described by Eq.(3). It is easy to see that ˜ 1 is a solution of problem (3), k w ˜ 1 also happens to be a solution for any k. Here for our if w purpose, we choose k =
1 . ||w1 ||
That is, the clustering hyperplanes (obtained as the eigen vectors
corresponding to maximum and minimum eigenvalue of the generalized eigenvalue problem (3)) ˜ 1 = [w1 b1 ]T and w ˜ 2 = [w2 b2 ]T and they are scaled such that ||w1 ||=||w2||=1. We solve are w this generalized eigenvalue value problem using the standard LU-decomposition based method [11] in the following way. Let matrix G have full rank. Let G = F F T , which can be done using LU factorization. Now, from Eq. (3), we get the following, ˜ = λF F T w
˜ Hw ˜ F −1 H w
⇒
˜ = λF T w
T
˜ = λF T w ˜ ⇒ F −1 HF −1 F T w T
˜ F −1 HF −1 y
⇒
= λ˜ y T
˜ = F T w. ˜ Which means that y ˜ is an eigen vector of F −1 HF −1 . It can be easily verified where y T
T
˜, x ˜ T F −1 HF −1 x ˜= that F −1 HF −1 is symmetric. For any x 1 ||[B nt−
T
˜ ||2 ≥ 0. This shows that F −1 HF −1 en− ]F −1 x
T
1 T −1 ˜ F [B x nt−
T
˜= en− ]T [B en− ]F −1 x
is positive semidefinite. So, we can
T
˜ is an eigen vector corresponding to the find orthonormal eigen vectors of F −1 HF −1 . If w ˜ = λGw, ˜ then F T w ˜ will be the eigen vector of symmetric generalized eigenvalue problem H w T
positive definite matrix F −1 HF −1 . Once we find clustering hyperplanes, the hyperplane we associate with the current node will be one of the angle bisectors of these two hyperplanes. Let w3T x + b3 = 0 and w4T x + b4 = 0 be the angle bisectors of w1T x + b1 = 0 and w2T x + b2 = 0. It is easily shown that these hyperplanes November 5, 2010
DRAFT
11
are given by (in the case w1 6= w2 )
˜ 1 = [w1 (Note that w
˜3 = w ˜1 + w ˜2 w
(4)
˜4 = w ˜1 − w ˜2 w
(5)
˜ 2 ). As we mentioned b1 ]T is such that ||w1 || = 1 and similarly for w
earlier, we choose the angle bisector which has lower impurity and we use gini index to measure ˜ t be a hyperplane which is used for dividing the set of patterns S t in two parts impurity. Let w S tl and S tr . Let nt+l and nt−l denote the number of patterns of the two classes in the set S tl and nt+r and nt−r denote the number of patterns of the two classes in the set S tr . Then gini index of ˜ t is given by [6], hyperplane w ˜ t) = gini(w
nt−r 2 nt+l 2 nt−l 2 ntr nt+r 2 ntl ) − ( ) 1 − ( ) − ( ) + 1 − ( nt ntl ntl nt ntr ntr
(6)
where nt = nt+ + nt− is the number of points in St . Also ntl = nt+l + nt−l is the number of points falling in the set S tl and ntr = nt+r + nt−r is the number of points falling in the set S tr . We choose ˜ 3 or w ˜ 4 to be the split rule for S t based on which of the two gives lesser value of gini index w given by Eq.(6). When the clustering hyperplanes are parallel (that is, when w1 = w2 ) we choose a hyperplane ˜ = (w, b) = (w1 , (b1 + midway between them as the splitting hyperplane and this is given by w b2 )/2). As is easy to see, in our method, the optimization problem of finding the best hyperplane at each node is solved exactly rather than by relying on a search technique based on local perturbations of the hyperplane parameters. The clustering hyperplanes are obtained by solving the eigenvalue problem. After that, to find the hyperplane at the node we need to compare only two hyperplanes based on gini index. The complete algorithm for learning the decision tree is as follows. At any given node, given the set of patterns S t , we find the two clustering hyperplanes (by solving the generalized eigenvalue value problem) and choose one of the two angle bisectors, based on the gini index, as the hyperplane to be associated with this node. We then use this hyperplane to split S t into two sets, that is, the ones that go into the left and right child nodes of this node. We then recursively do the same at the two child nodes. The process starts at the root node of the tree where S t is the complete set of example patterns. The recursion stops when the set of patterns November 5, 2010
DRAFT
12
at a node are such that the fraction of patterns belonging to the minority class of this set are below a user-specified threshold or the depth of the tree reaches a pre-specified maximum limit. A complete description of this is provided as Algorithm.1. Algorithm.1 recursively calls the procedure GrowTreeBinary(S t) which will return a subtree at node t. Algorithm 1: Binary Geometric Decision Tree Input: S = {(xi , yi )}n i=1 , Max-Depth, ǫ Output: Pointer to the root of a decision tree 1.1 begin 1.2
Root=GrowTreeBinary(S);
1.3
return Root;
1.4 end 1.5 GrowTreeBinary(S t ) Input: S t Output: Pointer to a subtree 1.6 begin 1.7
Find matrices A and B using the set S t of patterns at node t;
1.8
˜ 1 and w ˜ 2 which are solutions of optimization problems (1) and (2); Find w
1.9
˜ 3 and w ˜ 4 using Eq.(4) and (5); Compute angle bisectors w
1.10
˜ ∗; Choose one of the angle bisectors which gives lesser value of gini index (cf. Eq. (6)). Call it w
1.11
˜ t denote the split rule at node t. Assign w ˜t ← w ˜ ∗; Let w
1.12
˜ tT x ˜ tT x ˜ ≥ 0}; ˜ < 0} and S tr = {xi ∈ S t |w Let S tl = {xi ∈ S t |w
1.13
Define η(S t ) =
1.14
if
(η(S tl )
min(nt+ ,nt− )) , nt
t | and nt = |S t |; where nt = |S t |, nt+ = |S+ − −
< ǫ) or (Tree-Depth=Max-Depth) then
1.15
get a node tl and make tl a leaf node;
1.16
assign class label associated to the majority class to tl ; make tl left child of t;
1.17 1.18
else tl =GrowTreeBinary(S tl );
1.19
make tl left child of t;
1.20 1.21
end
1.22
if (η(S tr ) < ǫ) or (Tree-Depth=Max-Depth) then
1.23
get a node tr and make tr a leaf node;
1.24
assign class label associated to the majority class to tr ; make tr right child of t;
1.25 1.26
else tr =GrowTreeBinary(S tr );
1.27
make tr right child of t;
1.28 1.29
end
1.30
return t;
1.31 end
November 5, 2010
DRAFT
13
B. Handling Small Sample Size Problem In our method, we solve the generalized eigenvalue value problem using the standard LUdecomposition based technique. This will work when the matrices G and H have full rank which happens when the matrices A and B have full column rank. In general, if there are a large number of examples then (under the usual assumption that no feature is a linear combination of others), we would have full column rank for A and B. (This is the case, for example, in the proximal SVM method [18] which also finds the clustering hyperplanes like this). However, in our decision tree algorithm as we go down in the tree, the number of points falling at non-leaf nodes will keep decreasing. Hence, there might be cases where, e.g., the number of samples falling at a non-leaf node are less than the dimension of the data. In such cases, either matrix G or H or both become rank deficient. A similar situation sometimes occurs while finding Fisher linear discriminant [8]. We describe a method of handling this problem of small sample size at a node by adopting the technique presented in [8]. Suppose that matrix G has rank less than d + 1, then its null space has dimension greater than or equal to one. Let N be the null space of G. If G has rank r then N has dimension (d + 1 −r). Let Q = [α1 . . . αd+1−r ] be the matrix whose columns span N . We choose α1 , . . . , αd+1−r to be orthonormal. According to method given in [8], we first project all the points in class C− to the ˜ , belonging to class C− , after projection will become QQT x ˜ . Let null space of G. Every vector x P ˜ Then H ˜ = 1 ˜x ˜ T QQT = QQT x the matrix corresponding to H after projection be H. n−
x∈C−
˜ we construct G ˜ as G ˜ = QQT GQQT . But columns of Q span null QQT HQQT . Just like H ˜ = 0. Now the eigen vector corresponding to the largest eigenvalue of H ˜ is space of G. So, G selected as the desired vector (clustering hyperplane). We now explain how this approach works. The whole analysis is based on the following result Theorem 1: [8] Suppose that R is a set in the d-dimensional space, and ∀x ∈ R, f (x) ≥ 0, g(x) ≥ 0, and f (x)+g(x) > 0. Let h1 (x) = f (x)/g(x), and h2 (x) = f (x)/(f (x)+g(x)). Then, h1 (x) has a maximum (including positive infinity) at point x0 in R if h2 (x) has a maximum at point x0 . ˜ which maximize the ratio (w ˜ T H w)/( ˜ w ˜ T Gw ˜ +w ˜ T H w), ˜ Using Theorem.1, it is clear that w ˜ T H w)/( ˜ w ˜ T Gw). ˜ Given that H is strictly positive definite and G is will also maximize (w
November 5, 2010
DRAFT
14
˜ T Hw ˜ ≤ (w ˜ T Gw ˜ +w ˜ T H w). ˜ It is obvious that (w ˜ T H w)/( ˜ w ˜ T Gw ˜+ positive semidefinite, 0 6= w ˜ T H w) ˜ = 1 if and only if w ˜ T Hw ˜ 6= 0 and w ˜ T Gw ˜ = 0. This implies that for each w ˜ ∈N w ˜ T Hw ˜ 6= 0, it will maximize (w ˜ T H w)/( ˜ w ˜ T Gw ˜ +w ˜ T H w). ˜ However, simply which satisfies w ˜ T H w)/( ˜ w ˜ T (G + H)w) ˜ is not sufficient because w ˜ T Hw ˜ may maximizing the modified ratio (w ˜ = 0 and it gives G ˜+H ˜ = H. ˜ ˜ ∈ N enforce G not reach to its maximum value [8]. Selecting w ˜ is consistently zero, we have to select w ˜ w. ˜ ∈ N which will maximize w ˜TH ˜ In that Since G ˜ as w ˜ 1 (clustering hyperplane of class C+ ) which case, we choose principal eigen vector of H happens to be a clustering hyperplane in N which is closest to all the points of class C+ and farthest from all the points of class C− . To summarize, when matrix G becomes rank deficient, we find the null space of it and project all the feature vectors of class C− on to this null space. The clustering hyperplane for class C+ ˜ described earlier. Small sample size problem is chosen as the principal eigen vector of matrix H can occur only when matrix G becomes rank deficient in the maximization problem given by Eq.(1) because it is a maximization problem and G appears in the denominator. Rank deficiency of matrix H does not affect the solution of this maximization problem. In the situations when both G and H are rank deficient, we follow the same procedure that would have been used when only matrix G were rank deficient. C. Geometric Decision Tree for Multiclass Classification In this section we explain how the algorithm presented in the previous subsection can be generalized to handle the case when we have more than two classes. Let S = {(xi , yi ) : xi ∈ ℜd ; yi ∈ {1, . . . , K} i = 1 . . . n} be the training dataset, where K is the number of classes. Also let ntk be the number of points in class Ck at node t. At a node t of the tree, we divide the set of points S t at that node in two subsets, namely S+t and S−t . S+t contains points of majority class in S t whereas S−t contains rest of the points. We consider this as a binary classification problem and find the clustering hyperplanes for each of S+t and S−t and find the two angle bisectors. We choose one of the angle bisectors based on the gini impurity as the hyperplane to be associated with this node. We then use this hyperplane to split S t into two sets, namely Slt and Srt , that go into the left and right child nodes of this node. If we reach the specified maximum limit for the depth of the tree, then we mark both the child nodes as leaf nodes and assign the class label corresponding to the majority class at each November 5, 2010
DRAFT
15
Algorithm 2: Multiclass Geometric Decision Tree Input: S = {(xi , yi )}n i=1 , Max-Depth, ǫ1 , ǫ2 Output: Pointer to the root of a decision tree begin Root=GrowTreeMulticlass(S); end
return Root;
GrowTreeMulticlass(S t ) Input: Set of patterns at node t (S t ) Output: Pointer to a subtree begin t and S t = S t \S t ; Divide the set S t in two parts, namely S+ − + t contains points of majority class and S t contains points of remaining classes; S+ − t ; Find matrix A corresponding to the points in the set S+ t ; Find matrix B corresponding to the points in the set S−
˜ 1 and w ˜ 2 which are solutions of optimization problems (1) and (2); Find w ˜ 3 and w ˜ 4 using Eq.(4) and (5); Compute angle bisectors w ˜ ∗; Choose one of the angle bisectors which gives lesser value of gini index (cf. Eq. (6)). Call it w ˜ t denote the split rule at node t. Assign w ˜t ← w ˜ ∗; Let w ˜ ≥ 0}; ˜ tT x ˜ < 0} and S tr = {xi ∈ S t |w ˜ tT x Let S tl = {xi ∈ S t |w Define η1 (S t ) = Define η2
(S t )
=
min(nt+ ,nt− ) , where nt max(nt1 ,...,ntK ) ; nt
t | and nt = |S t |; nt = |S t |, nt+ = |S+ − −
if (Tree-Depth=Max-Depth) then get a node tl and make tl a leaf node; assign the class label associated to the majority class to tl ; make tl left child of t.; else if (η1 (S tl ) < ǫ1 ) then t
t
if (n−l < n+l ) then get a node tl and make tl a leaf node; t
assign the class label associated to the points in the set S+l to tl ; make tl left child of t.; else if (η2 (S tl ) ≥ ǫ2 ) then get a node tl and make tl a leaf node; assign the class label associated majority class in the set S tl to tl ; make tl left child of t.; else tl =GrowTreeMulticlass(S tl ); make tl left child of t; end else tl =GrowTreeMulticlass(S tl ); make tl left child of t; end end
November 5, 2010
DRAFT
16
if (Tree-Depth=Max-Depth) then get a node tr and make tr a leaf node; assign the class label associated to the majority class to tr ; make tl right child of t.; else if (η1 (S tr ) < ǫ1 ) then if (nt−r < nt+r ) then get a node tr and make tr a leaf node; tr assign the class label associated to points in the set S+ to tr ;
make tl right child of t.; else if (η2 (S tr ) ≥ ǫ2 ) then get a node tr and make tr a leaf node; assign the class label associated majority class in the set S tr to tr ; make tl right child of t.; else tr =GrowTreeMulticlass(S tr ); make tr right child of t; end else tr =GrowTreeMulticlass(S tr ); make tr right child of t; end return t
child node. If this is not the case, then we do the following at each child node. If the fraction of points of the majority class in S tl is above some threshold, say (1 − ǫ1 ), then we mark it as a leaf node with the label of the majority class. Otherwise we find a hyperplane to split these points using the same method as we did for S t . We do the same thing with S tr . In this way we recursively split the data till all nodes become leaf nodes. The computation of finding whether the majority class is predominant in S tl and S tr can be done a little efficiently. We explain this using the left child node. Let S+tl be the subset of S+t which goes to the left child of node t. Similarly let S−tl be the subset of S−t which goes to the left child node of node t. If the fraction of points of the smaller set between S+tl and S−tl , is less than some threshold, say ǫ1 , then we check which of the two sets S+tl and S−tl is in minority. If set S−tl is in minority, then we can assign class label of points of set S+tl to the left child and make it a leaf node. (This is because we know that all points in S+t and hence S+tl are of the same class). On the other hand, if set S+tl is in minority then we have to actually compute the fraction of points of the majority class in S tl to decide whether or not Slt needs to be split further.
November 5, 2010
DRAFT
17
A complete description of decision tree method for multiclass classification is given in Algorithm.2. Algorithm. 2 recursively calls the procedure GrowTreeMulticlass(S t) which will learn a split rule for node t and return a subtree at that node. III. A NALYSIS In this section we present some analysis of our algorithm. We consider only binary classification problem. Given a set of patterns (feature vectors) S, our algorithm finds the two clustering hyperplanes and then finds angle bisectors of these two clustering hyperplanes. In this section, we prove some interesting properties of the angle bisector hyperplanes to indicate why angle bisectors may be a good choice (in a decision tree) to divide the set of patterns S. Let S be a set of n patterns (feature vectors) of which n+ are of class C+ and n− are of class C− . Recall that as per our notation, A is a matrix whose rows are feature vectors of class C+ and B is a matrix whose rows are feature vectors of class C− . Let the sample mean (in the set of patterns) of class C+ be µ+ and that of class C− be µ− . Note that µ+ and µ− will be d-dimensional column vectors where d is the feature space dimension. Let Σ+ and Σ− be the sample covariance matrices. Then we have, 1 1 X (x − µ+ )(x − µ+ )T = (A − en+ µT+ )T (A − en+ µT+ ) Σ+ = n+ x∈C n+
(7)
+
where en+ is a n+ -dimensional column vector having all elements one. Similarly we will have, 1 1 X (x − µ− )(x − µ− )T = (B − en− µT− )T (B − en− µT− ) Σ− = n− x∈C n− −
Case 1 : Σ+ = Σ− = Σ We first consider the case of pattern classes with equal covariance matrix. Then, we have the following result. Theorem 2: Let S be a set of feature vectors with equal sample covariance matrices of the two classes. Then the angle bisector of two clustering hyperplanes will have same orientation as the Fisher linear discriminant hyperplane.
November 5, 2010
DRAFT
18
Proof: Given any arbitrary w ∈ ℜd , b ∈ ℜ, we have, 1 1 ||Aw + ben+ ||2 = ||(A − en+ µT+ )w + en+ µT+ w + ben+ ||2 n+ n+ 1 T = w (A − en+ µT+ )T (A − en+ µT+ )w + (en+ µT+ w + ben+ )T (en+ µT+ w + ben+ ) n+ T T T +2(en+ µ+ w + ben+ ) (A − en+ µ+ )w 1 = n+ wT Σw + n+ wT µ+ µT+ w + b2 n+ + 2bn+ wT µ+ n+ +2(µT+ w + b)eTn+ (A − en+ µT+ )w because eTn+ en+ = n+ and Eq.(7) = wT Σw + wT µ+ µT+ w + b2 + 2bwT µ+ +
2(µT+ w + b) X xi − µ+ )T w ( n+ x ∈C i
2
= σ +
+
ρ2+
where σ 2 = wT Σw and ρ+ = wT µ+ + b. Similarly we have, with ρ− = wT µ− + b, 1 ||Bw + ben− ||2 = σ 2 + ρ2− n− Let f1 (w1 , b1 ) be the objective function of optimization problem (1) whose solution is the clustering hyperplane for class C+ . Similarly, let f2 (w2 , b2 ) be the objective function of optimization problem (2) whose minimizer is the clustering hyperplane for class C− . Then, we have (since D+ (w1 , b1 ) =
||Aw1 +ben+ ||2 n+ ||w| |2
and similarly for D− (w1 , b)) σ12 + ρ21+ f1 (w1 , b1 ) = σ12 + ρ21− f2 (w2 , b2 ) =
σ22 + ρ22− σ22 + ρ22+
where σj2 = wjT Σwj , ρj+ = wjT µ+ + bj and ρj− = wjT µ− + bj , j = 1, 2 Now we take the derivative of f1 (w1 , b1 ) with respect to (w1 , b1 ) and that of f2 (w2 , b2 ) with respect to (w2 , b2 ). ∂f1 (w1 , b1 ) ∂w1 ∂f2 (w2 , b2 ) ∂w2 ∂f1 (w1 , b1 ) ∂b1 ∂f2 (w2 , b2 ) ∂b2 November 5, 2010
∂f1 (w1 , b1 ) ∂σ12 ∂f1 (w1 , b1 ) ∂ρ1+ ∂f1 (w1 , b1 ) ∂ρ1− + + 2 ∂σ1 ∂w1 ∂ρ1+ ∂w1 ∂ρ1− ∂w1 ∂f2 (w2 , b2 ) ∂σ22 ∂f2 (w2 , b2 ) ∂ρ2+ ∂f2 (w2 , b2 ) ∂ρ2− = + + 2 ∂σ2 ∂w2 ∂ρ2+ ∂w2 ∂ρ2− ∂w2 2 ∂f1 (w1 , b1 ) ∂σ1 ∂f1 (w1 , b1 ) ∂ρ1+ ∂f1 (w1 , b1 ) ∂ρ1− = + + ∂σ12 ∂b1 ∂ρ1+ ∂b1 ∂ρ1− ∂b1 ∂f2 (w2 , b2 ) ∂σ22 ∂f2 (w2 , b2 ) ∂ρ2+ ∂f2 (w2 , b2 ) ∂ρ2− = + + ∂σ22 ∂b2 ∂ρ2+ ∂b2 ∂ρ2− ∂b2 =
DRAFT
19
Also, by the definitions of σ12 , σ22 , ρi+ and ρi− for i = 1, 2, we get ∂σj2 ∂wj
= 2Σwj
∂ρji = µi , i = +, −, j = 1, 2 ∂wj
∂σj2 ∂bj
=0
∂ρji = 1, i = +, −, j = 1, 2 ∂bj
Now equating the derivatives to zero, we get ∂f1 (w1 , b1 ) ∂f1 (w1 , b1 ) ∂f1 (w1 , b1 ) Σw1 + µ+ + µ− 2 ∂σ1 ∂ρ1+ ∂ρ1− ∂f2 (w2 , b2 ) ∂f2 (w2 , b2 ) ∂f2 (w2 , b2 ) Σw2 + µ+ + µ− 2 2 ∂σ2 ∂ρ2+ ∂ρ2− ∂f1 (w1 , b1 ) ∂f1 (w1 , b1 ) + ∂ρ1+ ∂ρ1− ∂f2 (w2 , b2 ) ∂f2 (w2 , b2 ) + ∂ρ2+ ∂ρ2−
2
=0
(8)
=0
(9)
=0
(10)
=0
(11)
By substituting (10) in (8) and (11) in (9), we get ∂f1 (w1 , b1 ) ∂f1 (w1 , b1 ) Σw1 = (µ+ − µ− ) 2 ∂σ1 ∂ρ1− ∂f2 (w2 , b2 ) ∂f2 (w2 , b2 ) Σw2 = (µ+ − µ− ) 2 2 ∂σ2 ∂ρ2−
2
The above set of equation will give us w1 =
∂f1 (w1 ,b1 ) ∂ρ1− ∂f1 (w1 ,b1 ) 2 ∂σ2 1
Σ−1 (µ1 − µ2 ) = α1 Σ−1 (µ1 − µ2 )
w2 =
∂f2 (w2 ,b2 ) ∂ρ2− ∂f2 (w2 ,b2 ) 2 ∂σ2 2
Σ−1 (µ1 − µ2 ) = α2 Σ−1 (µ1 − µ2 )
This means that both the clustering hyperplanes are parallel to each other. In this case, our angle bisector will happen to be the hyperplane parallel to both the clustering hyperplanes and situated exactly in the mid of them. In other words, the only angle bisector (w3 , b3 ), is such that w3 ∝ Σ−1 (µ1 − µ2 ). This is also same as Fisher linear discriminant, thus proving the theorem. The above theorem says that if the sample covariance matrices are same then the angle bisector hyperplane (w3 , b3 ) is such that w3 is same as the normal to the Fisher Linear Discriminant hyperplane. This shows that in the equal covariance matrix case, the hyperplane we use is a good choice. November 5, 2010
DRAFT
20
Case 2: µ+ = µ− = µ Next we discuss the case of the data distribution where both the classes have same mean. We show that the angle bisectors are solutions of an optimization problem that is a reasonable choice for the hyperplane to be associated with a given node in the tree. We first show that both the clustering hyperplanes (and hence the angle bisectors) will pass through the common mean. Theorem 3: If the sample mean of two classes are same, then the clustering hyperplane found by solving optimization problems (1) and (2) will pass through the common mean. Proof: Let us consider the optimization problem (1), which finds the clustering hyperplane closest to all the points in class C+ and farthest from all the points in class C− . As explained earlier, the optimization problem solved is, P 1 T 2 ˜ T Hw ˜ w xi ∈C− (xi w + b) n− P : max 1 P = max T T 2 ˜ =0 w w6 (w,b)6=0 ˜ Gw ˜ xi ∈C+ (xi w + b) n+
(12)
This problem can be equivalently written as a constrained optimization problem in the following way. 1 n−
max
(w,b)6=0
s.t.
1 n+
T xi ∈C− (xi w
P
T xi ∈C+ (xi w
P
+ b)2
+ b)2 = 1
The Lagrangian of above constrained optimization problem is 1 X T 1 X T (xi w + b)2 − λ( (xi w + b)2 − 1) L= n− x ∈C n+ x ∈C −
i
i
+
Now equating the derivative of the Lagrangian with respect to b to zero, we get 2 X T 2λ X T (xi w + b) − (x w + b) = 0 n− x ∈C n+ x ∈C i i
−
i
+
1 X 1 X xi ) + b − λwT ( xi ) − λb = 0 ⇒ wT ( n− x ∈C n+ x ∈C i
T
−
i
+
T
⇒ w µ + b − λw µ − λb = 0 wT µ − λwT µ λ−1 ⇒ b = −wT µ
⇒ b=
This means that the clustering hyperplane for class C+ passes through the common mean. In the same way, we can show that the clustering hyperplane for class C− , also passes through the common mean. November 5, 2010
DRAFT
21
When µ+ = µ− , Theorem 3 says that b = −wT µ and the clustering hyperplanes pass through common mean µ. Now putting this value of b in (12), we get the optimization problem,P, for finding w as P = =
1 n− max 1 w6=0 n+
T xi ∈C− (xi w P T xi ∈C+ (xi w
P
1 wT n− max 1 T w6=0 w n+
= max w6=0
P
− wT µ)2 − wT µ)2
xi ∈C− (xi
P
xi ∈C+ (xi
− µ)(xi − µ)T w − µ)(xi − µ)T w
wT Σ− w wT Σ+ w
(13)
Hence w1 , which is normal to the clustering hyperplane for class C+ , will be the eigen vector corresponding to the maximum eigenvalue of generalized eigenvalue problem Σ− w = λΣ+ w. And b1 can be found as b1 = −w1T µ. In the same way, [w2 b2 ]T which is the parameter vector of the clustering hyperplane for class C− , will be such that w2 is the eigen vector corresponding to the minimum eigenvalue of the problem Σ− w = λΣ+ w and b2 = −w2T µ. Since the eigen vector can be determined only up to a scale factor, under our notation, we consider ||w1 || = ||w2|| = 1. Since the ratio
w T Σ− w w T Σ+ w
is invariant to scaling of the vector w, we
can maximize the ratio by constraining the denominator to have any constant value β. Hence the maximization problem in (13) can be recast as maxw wT Σ− w s.t.
wT Σ+ w = β
(14)
where the value of β can be chosen so that it is consistent with our scaling of w1 . The maximizer of problem (14) is w1 . Hence w1T Σ+ w1 = β. Let λ1 be the maximum value achieved. Thus w1T Σ− w1 = λ1 . As per our notation, w2 is the minimizer of the fraction
w T Σ− w . w T Σ+ w
Let the minimum value achieved by this fraction be λ2 , then w2T Σ+ w2 = β and w2T Σ− w2 = λ2 . Now the parameters of the two angle bisectors can be written as (w3 , b3 ) = (w1 + w2 , b1 + b2 ) and (w4 , b4 ) = (w1 − w2 , b1 − b2 ). It is easily seen that both the angle bisectors will also pass through the common mean.
November 5, 2010
DRAFT
22
We now show that the pair of vectors w3 and w4 are the solution to the following optimization problem. maxwa ,wb waT Σ− wb wT Σ+ wa = 2β = wT Σ+ wb a b s.t. w ˜ aT Σ+ w ˜b = 0
(15)
Consider the possible solution to the optimization problem (15) given by wa = w1 + w2 and wb = w1 − w2 . We have (w1 + w2 )T Σ+ (w1 + w2 ) = w1T Σ+ w1 + w2T Σ+ w2 + 2w1T Σ+ w2 = 2β (w1 − w2 )T Σ+ (w1 − w2 ) = w1T Σ+ w1 + w2T Σ+ w2 − 2w1T Σ+ w2 = 2β (w1 + w2 )T Σ+ (w1 − w2 ) = w1T Σ+ w1 − w2T Σ+ w2 = 0 Hence we see that the pair of vectors (w1 + w2 ) and (w1 − w2 ) satisfies all the constraints of the optimization problem (15) and hence is a feasible solution of the problem. Now we show that it is also the optimal solution. We can rewrite the objective function of problem (15) as waT Σ− wb =
1 (wa + wb )T Σ− (wa + wb ) − (wa − wb )T Σ− (wa − wb ) 4
The difference above will be maximum when the first term is maximized and second term is ˜ aT Σ+ w ˜ b = 0 together imply (wa + minimized. Constraints waT Σ+ wa = 2β = wbT Σ+ wb and w wb )T Σ+ (wa + wb ) = 4β and (wa − wb )T Σ+ (wa − wb ) = 4β. For any real symmetric matrices, Σ, and Σ′ , if we want to maximize wT Σw subject to the constraint wT Σ′ w = K (where K is a constant), the the solution is the eigen vector corresponding to the maximum eigenvalue of the generalized eigenvalue problem Σw = λΣ′ w. Similarly, to minimize wT Σw subject the same constraint, the solution is the eigen vector corresponding to the minimum eigenvalue. Hence, the optimal solution is obtained when (wa + wb ) is the eigen vector corresponding to maximum eigen value and (wa − wb ) is the eigen vector corresponding to the minimum eigenvalue of the generalized eigenvalue problem Σ− w = λΣ+ w. Thus wa = w1 + w2 and wb = w1 − w2 constitute the solution to the optimization problem given by (15). Thus, we have derived an optimization problem for which the pair of angle bisectors that our algorithm uses, namely, w3 = w1 + w2 and w4 = w1 − w2 is the optimal solution. Now we try to interpret this optimization problem to argue that this is a good optimization problem to solve November 5, 2010
DRAFT
23
when we want to find the best hyperplane to split the data at a node while learning a decision tree. Let X be a random variable denoting the feature vector from class C+ . Similarly, let Y be ˆ a, X ˆ b, Y ˆ a and random feature vector coming from class C− . Define new random variables X ˆ b as follows Y ˆ a = wT X X ˆ b = wT X X a b ˆ a = wT Y Y ˆ b = wT Y Y a b Now, let us assume that we have enough samples from both the classes, so that, we can assume empirical averages are close to the expectations. Now we can rewrite the objective function in the optimization problem given by (15) as 1 X T waT Σ− wb = wa (x − µ)(x − µ)T wb n− x∈C −
1 X T (wa x − waT µ)(xT wb − µT wb ) = n− x∈C − ˆ a − E[Y ˆ a ])(Y ˆ b − E[Y ˆ b ]) ≃ E (Y ˆ a, Y ˆ b) = cov(Y
Similarly we can rewrite the constraints of that problem as ˆ a − E[X ˆ a ])2 = var(X ˆ a) waT Σ+ wa ≃ E (X ˆ b − E[X ˆ b ])2 = var(X ˆ b) wbT Σ+ wb ≃ E (X ˆ a − E[X ˆ a ])(X ˆ b − E[X ˆ a ]) = cov(X ˆ a, X ˆ b) waT Σ+ wb ≃ E (X Now we can rewrite the fact that the angle bisectors are solution of the optimization problem (15) as ˆ a, Y ˆ b) (w1 + w2 ), (w1 − w2 ) = argmaxwa ,wb cov(Y ˆ a ) = 2β var(X ˆ b ) = 2β s.t. var(X cov(X ˆ a, X ˆ b) = 0 November 5, 2010
(16)
DRAFT
24
This optimization problem seeks to find wa and wb (which would be our angle bisectors) such ˆ a and Y ˆ b is maximized while keeping X ˆ a and X ˆ b uncorrelated. that the covariance between Y (The constraints on the variances are needed only to ensure that the optimization problem has ˆ a and Y ˆ b represent random variables that are projections of a class C− a bounded solution). Y ˆ a and X ˆ b are projections of class C+ feature feature vector onto wa and wb respectively and X vector on wa and wb . So, we are looking for two directions such that one class patterns become uncorrelated when projected onto these two directions while the correlation between projections of the other class feature vectors become maximum. Thus, our angle bisectors give us directions that are best for discriminating between two classes and hence we feel our choice of the angle bisectors as split rule is a sound choice while learning a decision tree. Case 3: Σ+ 6= Σ− and µ+ 6= µ− We next consider the general case of different covariance matrices and different means of two classes. Once again, the objective is to formulate an intuitively understandable optimization problem for which the angle bisectors (of clustering hyperplanes) would be the solution. Recall ˜ 1 and w ˜ 2 , which are eigen vectors that the parameters of the two clustering hyperplanes are w corresponding to maximum and minimum eigenvalues of the generalized eigenvalue problem ˜ = λGw. Then, using the similar arguments as in case 2, one can show that the angle Hw bisectors are the solution of following optimization problem.
where G =
1 n+
˜ T Hw ˜b minw˜ a ,w˜ b w a w ˜b ˜ aT Gw ˜ a = 2β = w ˜ bT Gw s.t. w ˜ aT Gw ˜b = 0 P P ˜ ix ˜ ix ˜ Ti and H = n1− xi ∈C− x ˜ Ti . xi ∈C+ x
(17)
Again, consider X as a random feature vector coming from class C+ and Y a random feature ¯ a, X ¯ b, Y ¯ a and Y ¯ b as follows vector coming from class C− . We define new random variables X ¯ a = wT X + ba X ¯ b = wT X + bb X a b ¯ a = wT Y + ba Y ¯ b = wT Y + bb Y a b As earlier, we assume that there are enough number of points from both the classes so that the empirical averages can be replaced by actual expectations. Then, as in the earlier case, we can November 5, 2010
DRAFT
25
rewrite the optimization problem given by (17) as ¯ aY ¯b ˜1 + w ˜ 2 ), (w ˜1 − w ˜ 2 ) = argmaxw˜ a ,w˜ b E Y (w 2 2 E X ¯ = 2β = E X ¯ a b s.t. E X ¯ aX ¯b = 0
(18)
This is very similar to the optimization problem we derived in the previous case with the only difference being that the covariances are replaced by cross-expectation or correlation. So, finding ˜ a and w ˜b the parameters of the two angle bisector hyperplanes is same as finding two vectors w in ℜd+1 such that the cross expectation of projection of class C− points4 on these vectors is maximized while keeping the cross expectation of projection of class C+ points on these vectors 2 2 ¯ and E X ¯ are kept constant to ensure that the solutions of the optimization at zero. Again E X a
a
problem are bounded. Once again we feel that the above shows that the angle bisectors (being
solutions of the above optimization problem) are a good choice as the split rule at a node in the decision tree. IV. E XPERIMENTS In this section we present empirical results to show the effectiveness of our decision tree learning algorithm. We test the performance of our algorithm on several synthetic and real datasets. We compare our approach with OC1 [20] and CART-LC [6], which are among the standard oblique decision tree algorithms at present. We also compare our approach with SVM classifier which is among the best generic classifiers today. We compare our approach with GEPSVM [18] also on binary classification problems. The experimental comparisons are presented on 3 synthetic datasets and 10 ‘real’ datasets from UCI ML repository [1]. Dataset Description We generated three synthetic piecewise-linearly-separable datasets in different dimensions which are described below 1) Dataset 1: 2×2 Checkerboard Dataset 2000 points are sampled uniformly from [−1 1]× ˜ 1 = [1 1 0]T [−1 1]. Here our augmented feature vector would be 3-dimensional. Let w 4
[x
Note that we are considering points in augmented feature space. That is, if x ∈ ℜd is a feature vector then we are considering 1]T ∈ ℜd+1 .
November 5, 2010
DRAFT
26
˜ 2 = [1 − 1 0]T be parameters of two lines (hyperplanes). Now the points are labeled and w as +1 or -1 based on the following rule 1, ˜ 1T x ˜≥0&w ˜ 2T x ˜ ≥ 0) OR (w ˜ 1T x ˜≤0&w ˜ 2T x ˜ ≤ 0) if (w y= −1, else
From our dataset of 2000 sampled points, 979 points are assigned label +1 and 1021 points are labeled -1 using this rule. Now all the points are rotated by an angle of π/6 with respect to the first axis in anti-clockwise direction, to form the final training set. 2) Dataset 2: 4×4 Checkerboard Dataset 2000 points are sampled uniformly from [0 4] × [0
4]. This whole square is divided into 16 unit squares having unit length in both
dimensions. These squares are given indices from 1 to 4 on both axis. If a point falls in a unit square such that the sum of its two indices is even, then we assign label +1 to that point, otherwise we assign label -1 to it. From our dataset of 2000 sampled points, 997 points are assigned label +1 and 1003 points are labeled -1 using this rule. Now all the points are rotated by an angle of π/6 with respect to the first axis in anti-clockwise direction. 3) Dataset 3: 10-Dimensional Synthetic Dataset 2000 points are sampled uniformly from [−1 1]10 . Consider three hyperplanes in ℜ10 whose parameters are given by vectors ˜ 1 = [1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]T , w ˜ 2 = [1, −1, 0, 0, 1, 0, 0, 1, 0, 0, 0]T and w ˜ 3 = [0, 1, −1, 0, −1, 0, 1, 1, −1, 1, 0]T . Now the points are labeled +1 or -1 based on the w following rule
y=
1,
−1,
if
˜ 1T x ˜≥0&w ˜ 2T x ˜≥0&w ˜ 3T x ˜ ≥ 0) || (w ˜ 1T x ˜≤0&w ˜ 2T x ˜≤0&w ˜ 3T x ˜ ≥ 0) || (w ˜ 1T x ˜≤0&w ˜ 2T x ˜≥0&w ˜ 3T x ˜ ≤ 0) || (w ˜ 1T x ˜≥0&w ˜ 2T x ˜≤0&w ˜ 3T x ˜ ≤ 0) || (w
else
Out of 2000 sampled points, 1020 points are assigned label +1 and 980 points are labeled -1 using this rule. Apart from these three datasets, we also tested Geometric decision tree on several bench-mark datasets downloaded from UCI ML repository [1]. The ten datasets that we used are described November 5, 2010
DRAFT
27
in Table I. The US Congressional Votes dataset available on UCI ML repository has many observations with missing values of some features. For our experiments, we choose only those observations for which there are no missing values for any feature. We also do not use all the observations in Magic dataset. It has a total of 19,020 samples of both the classes. But for our experiments, we randomly choose total 6000 points with 3000 from each class. Data set
Dimension
# Points
# Classes
Class Distribution
Breast-Cancer
10
683
2
444,239
Bupa Liver Disorder
6
345
2
145,200
Pima Indian
8
768
2
268,500
Magic
10
6000
2
3000,3000
Heart
13
270
2
150,120
Votes
16
232
2
108,124
Wine
13
178
3
59,71,48
Vehicle
18
846
4
199,217,218,212
Balance Scale
4
625
3
49,288,288
Glass
10
214
6
70,76,17,29,13,9
TABLE I D ETAILS OF
REAL WORLD DATASETS USED FROM
UCI ML REPOSITORY
Experimental Setup We implemented Geometric decision tree in MATLAB. For OC1 and CART-LC we have used the downloadable package available on internet. To learn SVM classifiers we use libsvm [7] code. Libsvm-2.84 [7] uses one versus rest approach for multiclass classification. We have implemented GEPSVM in MATLAB. Geometric decision tree has only one user defined parameter which is ǫ1 (the threshold on fraction of points of minority class to decide any node a leaf node). For all our experiments we have chosen ǫ1 between .1 and .2 and this range of ǫ1 appears quite robust on all the datasets. SVM has two user defined parameters: penalty parameter C and the width parameter σ for Gaussian kernel. Best values for these parameters are found using 5-fold cross-validation and the results reported are with these parameters. Both OC1 and CART use 90% of the total November 5, 2010
DRAFT
28
number of points for training and 10% points for pruning. OC1 needs two more user defined parameters. These parameters are number of restarts (R) and number of random jumps (J). For our experiments we have set R = 20 and J = 5 which are the default values suggested in the package. For the cases, where we use GEPSVM with Gaussian kernel, we found the best width parameter σ using 5-fold cross validation. Simulation Results We now discuss performance of Geometric decision tree (DT) in comparison with other approaches on different datasets. The results provided are based on 10 repetitions of 10-fold cross validation. We show average values and standard deviation (computed over the 10 repetitions) of accuracy, time taken and the number of leaf nodes and depth of the tree. Table.II shows comparison results of Geometric decision tree with other decision tree approaches. In the table we show the average and standard deviation5 for accuracy, size and depth of tree and time taken, for each of the algorithms for each of the problems. We can intuitively take the confidence interval of estimated accuracy of an algorithm to be one standard deviation on either side of the average. Then we can say that on a problem one algorithm has significantly better accuracy than another if the confidence intervals for accuracy of the the first algorithm is completely to the right of that of the second algorithm. From Table.II we see that, over all the problems, Geometric decision tree performs significantly better than all the other decision tree approaches in accuracy except on Breast-Cancer, Magic, Votes and Balance scale datasets. On these datasets, the performance of the Geometric decision tree algorithm is comparable with the best other decision tree method in the sense that the confidence intervals overlap. The average accuracy of Geometric decision tree is better than other decision tree approaches on Magic dataset, Votes dataset and Balance dataset. On Breast-Cancer dataset, its average accuracy is a little smaller than the best average accuracy. Thus, overall, performance of Geometric decision tree algorithm is better than or comparable to any other decision tree approach in terms of accuracy. In majority of the cases Geometric decision tree generates trees with smaller depth with lesser number of leaves as compared to other decision tree approaches. This supports idea that 5
We do not show the standard deviation if it is less than 0.001
November 5, 2010
DRAFT
29
Dataset
Method
Accuracy
Time(sec)
# leaves
Depth
2×2
Geometric DT
99.55±0.11
0.049
4
2
checker
OC1
98.44±0.27
2.73±0.27
17.38±2.6
9.02±1.39
board
CART
96.32±2.62
2.43±3.33
26.15±3.16
9.97±0.96
4×4
Geometric DT
94.18±0.53
0.098
17.14±1.18
4.79±0.47
checker
OC1
93.09±0.46
4.14±0.44
82.83±6.36
15.14±1.66
board
CART
88.16±2.75
4.24±2.57
91.79±9.06
14.61±1.11
10 Dimen-
Geometric DT
79.59±0.63
0.11±0.002
33.3±1.94
10.24±0.56
sional
OC1
68.71±1.66
16.19±1.1
48.59±14.69
11.19±1.79
CART
66.25±1.77
38.29±8.5
25.34±6
6.9±0.95
Breast
Geometric DT
94.46±0.57
0.01±0.001
2.71±0.5
1.46±0.3
Cancer
OC1
94.89±0.81
1.52±0.13
6.82±0.95
3.66±0.62
CART
95.60±0.6
0.1±0.01
3.98±0.92
2.42±0.63
Bupa
Geometric DT
69.10±1.99
0.016
13.03±1.22
6.75±0.42
Liver
OC1
66.26±2.49
3.19±2.31
10.59±5.69
4.86±1.88
CART
63.48±3.5
0.09±0.005
12.31±6.83
5.54±2.11
Pima
Geometric DT
76.83±0.58
.01
2.41±0.59
1.24±0.32
Indian
OC1
70.42±2.18
3.88±1.74
13.83±4.94
5.51±1.36
CART
73.46±1.16
0.29±0.01
15.71±6.62
7.69±1.48
Geometric DT
80.57±0.17
0.23
4
3
OC1
79.57±0.94
63.61±8.46
82.79±23.36
14.21±2.63
CART
-
-
-
-
Geometric DT
83.11±0.86
0.004
2.22±0.35
1.18±0.24
OC1
74.96±2.72
0.007
8.25±2.77
3.9±0.91
CART
75.96±2.06
0.09±0.004
5.51±1.94
2.98±0.71
Geometric DT
96.51±0.66
0.003
2
1
OC1
95.04±1.08
0.17±0.01
2.27±0.18
1.25±0.15
CART
95.99±0.73
0.023±0.003
2.06±0.08
1.04±0.05
Geometric DT
97.15±0.72
0.004
4.01±0.03
2.01±0.03
OC1
91.96±1.54
0.32±0.017
4.38±0.8
2.53±0.37
CART
91.29±2.8
0.04±0.002
4.21±1
2.46±0.38
Geometric DT
77.16±0.72
0.06±0.002
34.25±1.2
9.39±0.24
OC1
68.64±1.46
8.32±0.17
36.64±12.8
10.77±2.56
CART
69.88±1.4
0.82±0.02
39.07±13.38
11.67±2.2
Balance
Geometric DT
91.50±0.66
0.012
9.41±1.59
6.19±0.66
Scale
OC1
91.09±0.69
1.53±0.1
10.41±2.06
5.5±0.76
CART
85.52±1.08
0.12
8.26±3.76
4.38±1.32
Geometric DT
70.01±2.89
0.02
23.68±1.77
7.66±2.54
OC1
63.74±1.83
7.42±0.21
11.65±3.05
4.95±0.77
CART
68.27±3.49
0.83±0.03
12.33±2.46
5.32±0.59
Magic
Heart
Votes
Wine
Vehicle
Glass
TABLE II November 5, 2010C OMPARISON RESULTS BETWEEN G EOMETRIC DT AND OTHER DECISION TREE APPROACHES
DRAFT
30
4
4 Hyperplane learnt at the left child of the root node
3
2
2
1
1
0
0
−1
−1
−2
0
1
2
3
4
5
6
(a) Fig. 2.
Hyperplane learnt at the left child of the root node
3
Hyperplane learnt at root node
−2
Hyperplane learnt at root node 0
1
2
3
4
5
6
(b)
Comparisons of Geometric decision tree with OC1(Oblique) on 4×4 checkerboard data: (a)Hyperplanes learnt at the
root node and its left child using Geometric decision tree on rotated 4×4 checkerboard data; (b) Hyperplane learnt at root node and its left child node using OC1(Oblique) decision tree on rotated 4×4 checkerboard data.
our algorithm better exploits geometric structure of the dataset while generating decision trees. Normally if we learn smaller trees, the generalization error of the decision tree is likely to be small. Time wise Geometric decision tree algorithm is much faster than other decision tree approaches in all cases as can be seen from the results in the table. In all cases, the time taken by Geometric decision tree is less by at least a factor of ten. We feel that this is because the problem of obtaining the best split rule at each node is solved using an efficient linear algebra algorithm in case of Geometric decision tree while the other approaches have to resort to search techniques because optimizing impurity-based measures is tough. We next consider comparisons of Geometric decision tree algorithm with SVM. Table.III shows comparison results of Geometric decision tree with SVM and GEPSVM. GEPSVM with linear kernel performs same as Geometric decision tree for the 2×2 checkerboard problem because for this problem the two approaches work in a similar way. But when there are more than two hyperplanes required, GEPSVM with Gaussian kernel performs worse than our decision tree approach. For example, for 4×4 checkerboard example, GEPSVM can achieve only about 88.39% average accuracy while our decision tree gives about 94.18% average accuracy. Moreover, with Gaussian kernel, GEPSVM solves the generalized eigenvalue problem of the size of number
November 5, 2010
DRAFT
31
Dataset
Method
Accuracy
Time(sec)
kernel
2×2
Geometric DT
99.55±0.11
0.049
-
checker
SVM
99.45±0.09
.08±0.001
Gaussian
board
GEPSVM
99.55±0.11
0.0002
Linear
4×4
Geometric DT
94.18±0.53
0.098
-
checker
SVM
96.49±0.37
0.11±0.001
Gaussian
board
GEPSVM
88.39±0.54
337.72±3.09
Gaussian
10 Dimen-
Geometric DT
79.59±0.63
0.11±0.002
-
sional
SVM
75.82±0.52
0.88±0.006
Gaussian
GEPSVM
65.23±0.65
280.97±1.11
Gaussian
Breast
Geometric DT
94.46±0.57
0.01±0.001
-
Cancer
SVM
97.1±0.12
0.009
Gaussian
GEPSVM
93.34±0.37
0.0003
Linear
Bupa
Geometric DT
69.10±1.99
0.016
-
Liver
SVM
69.65±1.15
0.014
Gaussian
GEPSVM
59.49±0.29
3.29±0.05
Gaussian
Pima
Geometric DT
76.83±0.58
.01
-
Indian
SVM
76.95±0.29
14.08±2.41
Gaussian
GEPSVM
75.18±0.73
0.0003
Linear
Geometric DT
80.57±0.17
0.23
-
SVM
80.66±0.12
3.11±0.05
Gaussian
GEPSVM
74.48±0.16
0.002±0.002
Linear
Geometric DT
83.07±0.86
0.004
-
SVM
83.33±0.33
0.76±0.07
Linear
GEPSVM
81.89±0.89
0.0004
Linear
Geometric DT
96.51±0.66
0.003
-
SVM
96.94±0.14
0.002
Linear
GEPSVM
96.54±0.92
0.0005
Linear
Geometric DT
97.15±0.72
0.004
-
SVM
95.9±0.53
0.68±0.12
Linear
GEPSVM
-
-
-
Geometric DT
77.16±0.72
0.06±0.002
-
SVM
79.67±0.8
0.08±0.001
Gaussian
GEPSVM
-
-
-
Balance
Geometric DT
91.50±0.66
0.012
-
Scale
SVM
91.68
0.014
Linear
GEPSVM
-
-
-
Geometric DT
70.01±2.89
0.02
-
SVM
71.49±.49
0.075
Gaussian
GEPSVM
-
-
-
Magic
Heart
Votes
Wine
Vehicle
Glass
TABLE III November 5, 2010
C OMPARISON RESULTS OF G EOMETRIC DECISION TREE WITH SVM AND GEPSVM
DRAFT
32
of points, whereas our decision tree solves the generalized eigenvalue problem of the dimension of the data at each node (which is the case with GEPSVM only when it uses linear kernel). This gives us an extra advantage in computational cost over GEPSVM. For all binary classification problems Geometric decision tree outperform GEPSVM. The performance of Geometric decision tree is comparable to that of SVM in terms of accuracy. Geometric decision tree performs significantly better than SVM on 10-dimensional synthetic dataset and Wine dataset. Geometric decision tree performs comparable to SVM on 2×2-checkerboard dataset, Bupa Liver dataset, Pima Indian dataset, Magic dataset, Heart dataset, Votes dataset, Balance Scale dataset and Glass dataset. Geometric decision tree performs worse than SVM on 4×4 checker-board, Breast Cancer and Vehicle datasets. In terms of the time taken to learn the classifier, Geometric decision tree is faster than SVM on all cases. At every node of the tree we are solving a generalized eigenvalue problem which takes time of the order of (d + 1)3 , where d is the dimension of the feature space. On the other hand, SVM solves a quadratic program whose time complexity is O(nk ), where k is between 2 and 3 and n is the number of points. So in general, when number of points is large compare to the dimension of the feature space, Geometric decision tree learns the classifier faster than SVM. Thus, overall, Geometric decision tree performs better than SVM. Finally, in Figure.2 we show the effectiveness of our algorithm in terms of capturing the geometric structure of the classification problem. We show the first two hyperplanes learnt by our approach and OC1 for 4×4 checkerboard data. We see that our approach learns the correct geometric structure of the classification boundary, whereas the OC1, which uses the gini index as impurity measure, does not capture that. V. C ONCLUSIONS In this paper we presented a new algorithm for learning oblique decision trees. The novelty is in learning hyperplanes (at each node in the top-down induction of a decision tree) that captures the geometric structure of the class regions. At each node we find the two clustering hyperplanes and choose one of the angle bisector as the split rule. We presented some analysis to derive the optimization problem for which the angle bisectors are the solution. Based on which we argued that our method of choosing a hyperplane at each node is sound. Through extensive empirical studies we showed that the method performs better on than the other decision tree approaches November 5, 2010
DRAFT
33
in terms of accuracy, size of the tree and time. We also showed that the classifier obtained with Geometric decision tree is as good as that with SVM while generating decision tree faster than SVM. Thus, overall, the algorithm presented here is a good and novel classification method. R EFERENCES [1] D.J. Newman A. Asuncion. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2007. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [2] M. Fatih Amasyali and Okan Ersoy. Cline: A new decision-tree family. IEEE Transaction on Neural Networks, 19(2):356– 363, February 2008. [3] K. P. Bennett and J. A. Blue. A support vector machine approach to decision trees. In Proceedings of IEEE World Congress on Computational Intelligence, volume 3, pages 2396–2401, Alaska, USA, May 1998. [4] Kristin P. Bennett. Global tree optimization: A non-greedy decision tree algorithm. In Computing Science and Statistics, pages 156–160, 1994. [5] Kristin P. Bennett and Jennifer A. Blue. Optimal decision trees. Technical Report R.P.I. Math Report No. 214, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, 1996. [6] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Statistics/Probability Series. Wadsworth and Brooks, Belmont, California, U.S.A., 1984. [7] Chih-Chung Chang and Chih-Jen Lin. LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [8] Li-Fen Chen, Hong Yuan Mark Liao, Ming-Tat Ko, and Ja-Chen Lin andGwo Jong Yu. A new lda-based face recognition system which can solve the small sample size problem. Pattern Recognition, 33:1713–1726, 2000. [9] R. O. Duda and H. Fossum. Pattern classification by iteratively determined linear and piecewise linear discriminant functions. IEEE Transaction on Electronic Computers, EC-15(2):220–232, April 1966. [10] C`esar Ferri, Peter Flach, and Jos´e Hern´andez-Orallo. Learning decision trees using the area under the roc curve. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 139–146, San Francisco, CA, USA, July 2002. [11] Gene H. Gulub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, Maryland, third edition, 1996. [12] Michael I. Jordan. A statistical approach to decision tree modeling. In Proceedings of the Seventh Annual Conference on Computational Learning Theory (COLT), pages 13–20, New Brunswick, New Jersey, USA, July 1994. [13] Quinlan J.R. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. [14] T. Lee and J. A. Richards. Piecewise linear classification using seniority logic committee methods, with application to remote sensing. Pattern Recognition, 17(4):453–464, 1984. [15] Yuanhong Li, Ming Dong, and Ravi Kothari. Classifiability-based omnivariate decision trees. IEEE Transaction on Neural Networks, 16(6):1547–1560, November 2005. [16] Mingzhu Lu, C. L. Philip Chen, Jianbing Huo, and Xizhao Wang. Multi-stage decision tree based on inter-class and inner-class margin of svm. In Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, San Antonio, TX, USA, 2009.
November 5, 2010
DRAFT
34
[17] Olvi L. Mangasarian. Multisurface method of pattern separation. IEEE Transaction on Information Theory, 14(6):801–807, November 1968. [18] Olvi L. Mangasarian and Edward W. Wild. Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Transaction on Pattern Analysis and Machine Intelligence, 28(1):69–74, January 2006. [19] Naresh Manwani and P. S. Sastry. A geometric algorithm for learning oblique decision trees. In Proceedings of the Third International Conference on Pattern Recognition and Machine Intelligence (PReMI-09), pages 25–31, New Delhi, India, December 2009. [20] Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1–32, 1994. [21] Carla E. Brodley Paul E. Utgoff. Linear machine decision trees. Technical Report 91-10, Department of computer Science, University of Massachusettes, Amhersts, Massachusettes 01003 USA, January 1991. [22] Lior Rokach and Oded Maimon. Top-down induction of decision trees classifiers - a survey. IEEE Transaction on System, Man and Cybernetics-Part C: Application and Reviews, 35(4):476–487, November 2005. [23] P. S. Sastry, M. Magesh, and K. P. Unnikrishnan. Two timescale analysis of the alopex algorithm for optimization. Neural Computation, 14(11):2729–2750, 2002. [24] Shesha Shah and P. S. Sastry. New algorithms for learning and pruning oblique decision trees. IEEE Transactions on Systems, Man and Cybernetics, Part C, 29(4):494–505, November 1999. [25] Alberto Suarez and James F. Lutsko. Globally optimal fuzzy decision trees for classification and regression. IEEE Transaction on Pattern Analysis and Machine Intelligence, 21(12):1297–1311, September 1999. [26] Robert Tibshirani and Travor Hastie. Margin trees for high dimensional classification. Journal of Machine Learning Research, 8:637–652, March 2007. [27] K. P. Unnikrishnan and K. P. Venugopal. Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks. Neural Computation, 6(3):469–490, 1994.
November 5, 2010
DRAFT