IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 24, NO. 5,
MAY 2002
591
Upper Bounds for Error Rates of Linear Combinations of Classifiers Alejandro Murua, Member, IEEE AbstractÐA useful notion of weak dependence between many classifiers constructed with the same training data is introduced. It is shown that if both this weak dependence is low and the expected margins are large, then decison rules based on linear combinations of these classifiers can achieve error rates that decrease exponentially fast. Empirical results with randomized trees and trees constructed via boosting and bagging show that weak dependence is present in these type of trees. Furthermore, these results also suggest that there is a trade-off between weak dependence and expected margins, in the sense that to compensate for low expected margins, there should be low mutual dependence between the classifiers involved in the linear combination. Index TermsÐExponential bounds, weakly dependent classifiers, classification trees, machine learning.
æ 1
R
INTRODUCTION
ECENTLY, there has been a fair amount of interest in combining several classification trees so as to obtain better decision rules [1], [2], [3], [4], [5], [6], [7], [8]. The idea is similar to using several observations of a random variable to estimate the underlying mean: The sample average of all the observations is a much better estimator of the mean than any one observation since the combination decreases the variance of the estimator and, hence, is ªcloserº to the true mean. With nonparametric estimators, such as classification trees, one also has to consider the bias of the estimators [9], [10], [6]. Hence, one should combine the observations so as to not only decrease the variance, but also reduce both the variance and the square of the bias simultaneously (this is referred to as the mean squared error). Several ways of combining classification trees have been proposed in the literature: linear combination of randomized trees [1], [11], bagging [5], adaptive bagging [7], and boosting [3]. These methods will be summarized below. The aim of all these methods is to reduce the misclassification rate, by cleverly mixing the classifiers. The basic procedure consists of constructing several weights classifiers, say fT` gm `1 , to later combine them with Pm f` gm `1 IR, so as to form a unique classifier T `1 ` T` , from which a powerful decision rule can be P derived. For example, one could form the average tree T m `1 T` =m, (i.e., ` m 1 for all `). The classification rule associated to T consists of assigning a new data point to the class that maximizes T [1]. Bagging and adaptive bagging were conceived with the explicit aim of reducing the mean squared error of the classifier, whereas boosting was conceived with the explicit aim of reducing the training misclassification error, hoping to reduce in this way the true error rate with high probability (see Section 3). In fact, Schapire et al. [12] give an upper bound on the true error rate for the two-class problem (i.e., when the
. The author is with the Insightful Corporation, 1700 Westlake Ave., North, Suite 500, Seattle, WA 98109-3044. E-mail:
[email protected]. Manuscript received 8 July 1999; revised 17 Nov. 2000; accepted 1 Nov. 2001. Recommended for acceptance by J.R. Beveridge. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 110205.
data is divided in two classes) in terms of the margin, which is a measure of confidence in the class assignments given by the classifier at hand (see Section 3 for more details). Here, we present an improved bound based on recent advances in the theory of empirical processes [13], in which the margin does not appear (see Section 3). It is our belief, however, that these bounds based on PAC-learning theory [14], are, in general, too coarse. We think that bounds specifically suited to linear combination of classifiers can give more insight into this problem (see Section 4). In fact, we show that large expected (i.e., average) margins do play an important role in achieving low error rates for these type of classifiers. However, large margins alone are not sufficient to assure low error rates. In fact, Corollary 1 in Section 4 clearly states that, in order to achieve low error rates, one has to combine classifiers whose mutual dependence is low as well. The biggest challenge that all four of the above techniques must deal with is the fact that all of the classifiers must be trained with the same data. In order to avoid building the same classifier over and over again, some randomness must be included in the construction process. The four techniques mentioned above differ in the way they incorporate this randomness. For example, randomized trees are built by randomizing the set of admissible queries at each splitting node of a tree, bagging randomizes the trees by sampling at random from the training data, and adaptive bagging and boosting do something similar to bagging, but the construction process has to be sequential since the performance of previously constructed trees influences the sampling scheme for new trees. We conjecture that the ªartificialº randomness built in during the construction of the classifiers produces what we called weakly dependent classifiers (see Sections 4 and 5). In this paper, we introduce this new notion of mutual weak dependence, motivated by similar concepts developed in the time-series literature. It is shown that when both this dependence is very low and the expected margins (see (12)) are large, exponential upper bounds on the true error rate can be achieved by linearly combining all the classifiers. Our results (see Sections 4 and 5) suggest that there is a trade-off between weak dependence and expected margins, in the sense that to compensate for low expected margins, there
0162-8828/02/$17.00 ß 2002 IEEE
592
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
should be low mutual dependence between the classifiers involved in the linear combination. We note that, even though our exposition focuses on classifiers based on classification (or regression) trees, all our results are valid for all sorts of classifiers. The paper is organized as follows: Section 2 gives a brief summary of the most popular and recent methods to construct and combine classification and/or regression trees. Section 3 presents an upper bound for the true error rate based on PAC-learning theory. In Section 4, we introduce the notion of weak dependence between classifiers, and give an exponential upper bound on the true error rate. Sections 5 and 6 present some empirical results and conclusions, respectively. Section 7 is dedicated to the proofs of the main results of this paper.
2
3.
COMBINING SEVERAL CLASSIFIERS
Let C f1; 2; . . . ; Jg be the class space and
Y ; X be a random vector defined on a probability space
C ; F ; P , with Y taking values in C, and X in . Throughout this paper, a classifier will be a random vector T : ! IRJ ; a decision rule will be a random variable d
X d
T
X taking values in the class space C. For the particular case of classification trees [15], we will look at T
X
4.
r X
p1k ; p2k ; . . . ; pJk 11Rk
X; k1
where fR1 ; . . . ; Rr g is a partition of the space , 11A
is the indicator function associated to the set A , and
p1k ; p2k ; . . . ; pJk is the conditional probability distribution over the J classes, given X 2 Rk . Note that T
X also is the conditional probability distribution over the J classes given X. For simplicity, we will denote pjk by T
X; j, j 1; 2; . . . ; J, making explicit the dependence on X. Note that although the explicit dependence on Rk is now omitted, this is still present through X since X lies in Rk . A common decision rule for classification trees is d
X arg maxj1;...;J T
X; j. Let D f
Yi ; Xi 2 C ; i 1; . . . ; ng be the training data. The following are four common techniques for constructing the m classifiers fT` gm `1 and choosing their weights f` g`1 (see Section 1). 1.
2.
Randomized Trees. Let Q be the set of admissible splits at the nodes. The randomization of a tree is done at each node by choosing the best split from a small random subset of splits in Q. The m trees arising from this process are combined by averaging the conditional distributions T`
X, i.e., P T
X m 1 m `1 T`
X. P Bagging. Let IPn n 1 ni1 Xi be the empirical probability distribution of the training data D (this distribution assigns probability n 1 to the point Xi , for all i 1; . . . ; n). The trees are constructed by repeatedly sampling with replacement from D a sample of size n according to IPn , each tree T` is based on one of these samples, ` 1; . . . ; m. As with randomized trees, the m trees arising from this process are combined by averaging the P conditional distributions T`
X, i.e., T
X m 1 m `1 T`
X.
3
VOL. 24, NO. 5,
MAY 2002
Adaptive Bagging is a sequential and iterative algorithm. Initially developed in the context of regression, adaptive bagging sees the training data class assignments fYi gni1 as the initial response vector in a regression and the data points fXi gni1 as the predictor vector, i.e., Yi f
Xi error, where f is the unknown (to be estimated) regression function. It consists of iterating between the following two steps: Step a. Given the current response vector, construct a fixed number L of trees P via bagging and keep their average T
X L 1 L`1 T`
X. Step b. Update the response vector by subtracting from it an estimate of the prediction based on the average tree T
X and repeat Step a with the modified response vector (we note the similarity of Step b with a similar step in projection pursuit regression [16]). The m trees are the m average trees kept in each of the m iterations of these two steps. The final ªpredictionº tree is the sum of these m trees. For specific details, the reader is referred to [7]. Boosting. Let IPn be as above. The first tree is constructed as in bagging, while the second tree is constructed by replacing IPn with another empirical distribution derived from the empirical misclassification rate of T1 . More specifically, let 1 IPn
d
T1
X 6 Y be the training misclassification error of T1 . Replace IPn by a that new empirical probability distribution IP
2 n reweights the misclassified data points by a factor of
1 1 =1 . Then, T2 is constructed using a random sample of size n drawn from D according to IP
2 n . This step is the second iteration of boosting. This process is repeated m times to give rise to m classifiers (one classifier per iteration). The classifiers are combined using the weighted P log
1 ` =` T`
X, where ` are the average m `1 corresponding training misclassification rates associated with the empirical distributions IP
` n , obtained during the `th iteration of boosting (IP
` n are defined in a similar fashion as IP
2 n ). It can be shown that the boosting weights arise from minimizing an exponential upper bound on the empirical misclassification rate [12], [17].
PAC-LEARNING
In this section, we focus on the two-class problem and present a new upper bound for the error rate of any decision rule d
. In particular, our bound improves upon that in [12] and suggests that the margins (see below) do not play any role in sharpening this type of bound. In contrast to the statistics literature, which is concerned with minimizing the mean squared error of an estimator [9], [10], [6], the machine learning literature has focused on minimizing the empirical misclassification rate [12]. This latter goal arises naturally from PAC-learning theory (see [14], [18] for an excellent introduction to PAC-learning). In this theory, one would like to be very confident that a particular estimate of the true error rate is very close to the true error rate (the true error rate is sometimes referred to as the generalization error rate). This statement can be translated
MURUA: UPPER BOUNDS FOR ERROR RATES OF LINEAR COMBINATIONS OF CLASSIFIERS
into probabilities as follows: For any given, hopefully small, > 0; and > 0, one would like to have P
d
T
X 6 Y IPn
d
T
X 6 Y with probability at least 1
1
(recall that IPn stands for the empirical probability function and, hence, (1) is a probabilistic statement concerning the training sample D). Note that the right-hand side of (1) can be computed with training data. PAC-learning is mostly concerned with the number of training data points n needed to achieve (1) for given and . The margins. Given a data point X, the margin of X [19], [12], denoted margin
X is the quantity margin
X T
X; Y
max fT
X; j : j 6 Y g:
P
d
T
X 6 Y P
Y
1
However, bounds of a smaller order in n can be derived independently of . In fact, we have the following result: Theorem 1. Let Te be the family of classifiers formed by linearly
combining finitely many classifiers from T . Assume that the family Te is a VC-class, i.e., Te has finite VC-dimension e. Then, there exists 0 > 0 such that for every 0 < minf0 ; 1g P
d
T
X 6 Y IPn
d
T
X 6 Y
; n with probability at least 1
2T
X; 1 0
IPn
Y
1
2T
X; 1 1=2 ! 1 log2
n= 1 ; O p log 2 n
3
where , the Vapnik-Cervonenkis (VC) dimension of the family T of classifiers, is a measure of the ªsizeº or ªcomplexityº of T . For example, if T is finite, then log jT j, and if T is the family of classification trees, then number of internal nodes of the tree dimension of X. Note that ªpenalizesº families T that are too complex. The intuitive idea behind this is that a very complex family of classifiers has too much freedom to accommodate most of the training data. Hence, overtraining is likely to occur (in statistical terms, this is equivalent to constructing classifiers that minimize the bias ignoring prediction errors). For an exact definition of the VC dimension see, for example, [20, p. 196] and for more details, see [20, chapter 13]. Note that if one maximizes the margin, then, for fixed and small , the empirical probability given by the first term in the right-hand side of (3) might reach zero, leading to an upper bound of order q f
log
n==2 log
g=n:
4
, where
2~ 3 1 4C0 2 p 1 1 4C0 n 2 2~ 3 1=2 !
5 KC02 1 Ke 1 log ; log
~ log C0
~ 1e2
; n
2
The margins give us an idea of how much confidence can be put in the classification induced by the classifier T . Note that a particular data point X is misclassified by T if and only if margin
X < 0. In particular, for the two-class problem, i.e., when the classes are Y 1 or Y 1, an easy calculation shows that a data point X is misclassified if and only if margin
X Y
1 2T
X; 1 < 0. In what remains of this section, we will focus on the two-class problem. Bearing PAC-learning in mind, Boosting arises from the minimization of the empirical expectation of expf Y
1 2T
X; 1g, which is an upper bound of the empirical error rate IPn
d
T
X 6 Y [12] and, hence, through (1), a high-probability upper bound for the true error rate. In an effort to explain why boosting works, Schapire et al. [12] derived the following error bound for the two-class problem, which holds for every > 0 with probability at least 1 , 0 < < 1
593
0
~ K KC02 C0 e 1
1
e
2C02
:
6
K and C0 are universal constants, i.e., they only depend on the size of the family Te and are independent of n and . In p particular,
; n C0 = n for all 0 < minf0 ; 1g. It is easy to see that Theorem 1 implies a similar bound to (3) for all 0 < 1 ( in (3) must be replaced by e) since, for the two-class problem, IPn
d
T
X 6 Y IPn
Y
1 IPn
Y
1
2T
X; 1 < 0 2T
X; 1 :
Also, notice that log
n does not appear in this latter bound. In fact, ignoring p the constants, (5) gives an upper bound of order n 1=2 e log
1= for the misclassification rate. This yields a better bound in terms of n. Theorem 1 is a direct consequence of Talagrand's inequalities [13], its proof is given below in Section 7. The constant 0 in Theorem 1 ensures that the condition p C0 = n
; n in Talagrand's theorem holds. Unfortunately, explicit values for the constants K and C0 are, as yet, unknown (Talagrand's inequalities (see (17) below) are stated without explicit constants). For practical purposes, one can still use Alexander's inequality [21], which gives (1) with
; n n o1=2 1=2 O n ; log
1= 1; 024 ~ log
2; 048 ~=e for n2 64. This bound seems too large when compared to (5), due to the large constants 1,024 and 2,048 (in fact, one hopes that Talagrand's constants K and C0 are not as large as Alexander's ones). It is important to note that even though Theorem 1 yields sharper bounds in terms of the sample size n, the VC dimension e of Te is usually much larger than (see [22]); therefore, when the sample size n is not too large, (3) may be more practical than (5). In this sense, the introduction of the margin in (3) appears to be a technical artifact that counteracts the usually enormous increase in complexity (i.e., the VC dimension) due to the linear mixing of classifiers. We will
594
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
see later in Section 4 that the expected margin (see (12)) does seem to play an important and ªnaturalº role in reducing the classification error rates of linear combinations of classifiers, by deriving sharper and more informative error bounds than those based on PAC-learning. In fact, we believe that bounds such as the one given in Theorem 1 are too coarse to be of practical value in applications where either the sample size n is not large enough, or the VC-dimension of Te is big.
4
WEAKLY DEPENDENT CLASSIFIERS
In this section, we present an exponential bound for the error rate of decision rules arising from the linear combination of several classifiers, such as those reviewed in Section 2. The bounds are obtained under the assumption that the classifiers mutual dependence is small. The motivation for considering weak dependence came from our work [11] on speech recognition using randomized relational trees. However, we observed similar weak dependence behavior with trees constructed either via boosting or bagging (see Section 5). Our notion of weak dependence borrows from the literature on time series with long memory; see, for example, [23], [24]. In the context of time series, the dependence of sigma-fields generated from observing the time series over disjoint intervals of time is assumed to asymptotically vanish as the distance between the intervals increases to infinity. Here, we have adopted the following useful definition of weak dependence. (Note: throughout the remainder of the paper E
and E
j will stand for expectation and conditional expectation, respectively.) Weak Dependence. Let fZ` g`1 be a collection of centered (i.e., E
Z` 0) and bounded random variables, for each ` 1, consider the sigma-field F ` generated by the random variable Z` , and for k 2, consider F k1 1 W ` 0. The collection fZ` g`1 is said to be
; C-weakly dependent if there exists a sequence of real numbers f` g`2 0; , such that, for all ` > 1, and 2 0; 1=C
7 E eZ` 2eC ` : E eZ` F `1 1 Often, we will omit
; C and will refer to a
; C-weakly dependent collection as a weakly dependent collection. In the time-series context, it is usually assumed that the time-series satisfies the so-called -mixing condition, which in our set-up corresponds to the condition P
A \ B P
AP
B k P
A
8 for all A 2 F k1 1 ; and B 2 F k ; k 2: Roussas and Ioannides [25] showed that -mixing implies (7). We have kept the constant 2 in the right-hand side of (7) for this reason. We did not adopt condition (8) as our definition of weak dependence since it was proven to be too stringent in practice. In fact, our experiments (not shown here) yielded histograms of -values in (8) with large peaks near zero, but positively skewed, reaching large -values with extremely low frequencies; note, however, that in (8), those few unusually large -values are the most important
VOL. 24, NO. 5,
MAY 2002
ones. It is intuitive that the maximum -value is not a good measure of the dependence between the classifiers, instead the average -value gives more information about this dependence. For this reason, we adopted (7) as our definition of weak dependence. Besides, (7) is all that we need to obtain exponential bounds on the error rate. In Section 5, we have included some empirical results. We note that for very large values of `, ` might be large since the dependence among classifiers is expected to increase due to the limited amount of training data (see, however, our experiments in Section 5 where a rather steady pattern on the values of f` g`2 is observed). Measuring the dependence among variables. Note that the collection fZ` g`1 is always
12 ; C-weakly dependent since 0 < eZ` eC , for all 0. Hence, our notion of weak dependence becomes meaningful only when 0 and assume that dependent. Consider the empirical fZ` g`1 is
; C-weakly P average Sm m 1 m `1 Z` and let ` be as in (9). Then, for every > 0 ( !) 2 1 m ` 1
10 P Sm exp m 4C 2 !) ( 2 1 1=4 :
11 2 e 1 exp m m 4C 2 From (10), one sees that the dependence among the variables fZ` g has to be sufficiently small so that ` 2 f4C 2
1
1=mg 1 . Also note that this condition becomes more restrictive when the number of classifiers being averaged, m, is large, not just because of the explicit term in m, but more importantly due to the dependence of ` on m. As a corollary of Theorem 2, one can derive the upper bound given by (15) on the error rate of a linear combination of several classifiers. Before stating this result, we need to generalize (7) to a stronger version for classifiers. The Expected Margin. Recall the definition (2) of the margin for an individual data point X. A more interesting quantity from the point of view of summarizing the ªgoodnessº of a decision rule is the (conditional) expected value of
MURUA: UPPER BOUNDS FOR ERROR RATES OF LINEAR COMBINATIONS OF CLASSIFIERS
the margins given the true class. For the J-class problem and given the collection of classifiers fT` gm `1 , one needs to consider the expected margin of class i given that the true class is j,
ijj , for all i; j 2 C; i 6 j. In mathematical terms, one has
ijj
m 1X fE
T`
X; jjY j m `1
E
T`
X; ijY jg:
The expected margins for each classifier, E
T`
X; jjY j
E
T`
X; ijY j;
`
will be denoted by
ijj , ` 1; . . . ; m. Intuitively, if the decision rule based on a linear combination of classifiers does a good job, then the expected margins
ijj are expected to be ªlarge,º i.e., m 1X E
T`
X; jjY j >> m `1
m 1X E
T`
X; ijY j; for i 6 j; i; j 2 C; m `1
12
put differently, on average, the classification should be correct. For example, if the classifiers T` are classification trees, the expected margins will be large if the terminal nodes of the trees are nearly pure. Small values of
ijj are an indication that the trees are having trouble finding good splits in the covariate space and, hence, the overall misclassification rate will be high. In what follows, we will consider random variables
` Z`
i; j T`
X; i T`
X; j
ijj Note that these variables are centered, i.e., E
Z`
i; j 0. Mutual weak dependence. Let C be an upper bound
` for jT`
X; i T`
X; j
ijj j, for all i; j 2 C, ` 2; . . . ; m. We say that the collection of classifiers fT` gm `1 is
; C-mutually weakly dependent, if for every j 2 C, there exists a sequence f
`jj gm `2 0; satisfying for all i 2 C, and all 2 0; 1=C,
` T
X;i T`
X;j
ijj ` 1 E e ` F 1 ; Y j
13
` T`
X;i T`
X;j
ijj C E e Y j 2e
`jj : m The -coefficients f
`jj gJ; j1;`2 will be referred to as mutual weak dependence coefficients. As in (9), consider the log-geometric averages
`
jj
1 m
m X
1 `2
log 1 2e1=4
`jj ;
j 2 C:
14
We note that a mutually weakly dependent collection fT` g`1 is also weakly dependent since E
jF `1 1
J X j1
E
jF `1 1 ; Y j P
Y j:
Hence, to be mutually weakly dependent is a stronger condition. Corollary 1. Let > 0 and C > 0. Assume that the collection dependent. Consider the fT` gm `1 is
; C-mutually weakly P empirical averages S
mjj m 1 m `1 T`
X; j, j 1; . . . ; J,
595
and let f`
jj gj2C be as in (14). If the expected margins
ijj > 0, for all i 6 j; i; j 2 C, then P
misclassificationjY j P max S
mji 6 S
mjj jY j i2C ( !) 2 X
ijj 1 exp m `
jj 1 m 4C 2 i6j ( !) 2 1 1=4 :
J 1 exp m 2e 1 m 4C 2
15 Again, the proof of Corollary 1 is postponed to Section 7. Obviously, as in (10), (15) is a useful bound only when `
jj 2
ijj =
4C 2 holds, i 6 j; i; j 2 C. As we mentioned earlier, and as our empirical results suggest, it may be very difficult to construct many classifiers from the same training data without building up significant dependence between them. Note that the amount of training data does not enter explicitly into this bound. However, it enters indirectly through the log-geometric averages which are measuring the degree of dependence introduced between the classifiers as their number m increases. For classification trees, the dimension of the covariate (feature) space as well as the dependence between the covariates might play an important role in determining the number of trees that can be combined so as to have a mutually weakly dependent collection. For example, on randomized trees, the effect of randomizing the splits at the nodes is to somehow force different trees to look at different attributes of the data items (e.g., locations of determined intensity pixels in an image or of energy peaks in an spectrogram, see [1], [11]). This technique will work well, i.e., it will produce small values of
`jj , ` 2; . . . ; m, j 2 C, when the attributes considered are nearly independent (e.g., in a spectrogram, one might expect to see low correlation between energy peaks occurring at moderate to large distances over time). The value of the constant C is 2 when the classifiers fT` gm `1 are classification trees (e.g., randomized trees) since, in this case, each T`
X; i is a probability and, hence, bounded by one. Boosting (in its different flavors [12], [17]) and bagging may produce rather large values of C. The actual value of C is highly dependent on the specific classifiers involved. However, the values of the classifiers can easily be transformed to probabilities. This is done with a linear transformation based on the distances of T`
X; i to each class. For example, for the two class problem, if d 1 jT`
X 1j and d1 jT`
X 1j are the distances of the predicted value T`
X to 1 and 1, respectively, then one can consider the transformation Te`
X; 1 d1 =
d 1 d1 , Te`
X; 1 d 1 =
d 1 d1 . Note that to assess the performance of a specific linear combination of classifiers, one must not only look at the values of the weakly dependent coefficients, but also at the values of the expected margins. Judging from our empirical results in Section 5, there appears to be a trade-off between the amount of weak-dependence and the strength of the expected margins, in the sense that different procedures will balance differently their weak-dependence and expected margin properties to yield approximately the same error rate.
596
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
Amit and Geman [1] derive a simpler upper bound for the misclassification rate through Chebyshev's inequality; their bound depends on
ijj as well as on the average correlation between any two trees. Empirical results with randomized trees and trees constructed via boosting show low correlation between trees [26]. Our own experiments in Section 5 not only corroborate these results, but also show that the stronger notion of mutual weak dependence introduced in this paper holds. However, our empirically-estimated log-geometric averages `
jj are not small enough to guarantee the exponential decay in the error rate given by Corollary 1 when a moderate to large number m of trees is considered.
5
SOME EMPIRICAL RESULTS
Corollary 1 suggests looking at the expected margins as well as the weak dependence structure of a collection of classifiers in order to study the strength of decision rules based on linear combinations of classifiers. To this end, we study the properties of these type of decision rules on four data sets through two experiments. The first experiment comprises a large and complex data set. It consists of applying randomized trees to an isolated speech recognition task. The second experiment compares the performance of randomized trees, trees constructed via boosting, and trees constructed via bagging on three different data sets taken from the UCI repository [27]. We study the dependence between the resulting trees and estimate their mutual weak dependence coefficients and expected margins from data. Our experiments suggest both that mutual weak dependence is present in these trees, and that there appears to be a trade-off between the extent of this weakdependence and the strength of the margins in such a way that the resulting decision rule is optimal. We do not pretend to show an exhaustive study of the properties of the different decision rules reviewed in this paper. Rather, we focus on acquiring a good idea of their properties and behavior in terms of both the weak dependence structure created by randomized trees, bagging and boosting, and the relationship between weak dependence and expected margins.
5.1
Isolated Word Recognition via Randomized Trees Randomized trees were applied to an 11-class problem consisting of the 11 spoken ªdigitsº in C fzero; one; two; three; . . . ; nine; ohg: The data consist of studio quality speech taken from the well-known TI/NIST Connected Digits Recognition task (also known as TI-digits). This corpus does not include the segmentation of the utterances. Hence, we hand-segmented those utterances corresponding to digit sequences of at most two digits. Due to this limitation, our training data constituted a very small subset of the data available (4,460 points in all). Our testing data consisted of 2,486 isolated digits taken from the testing portion of the corpus. Each utterance was converted to a variable length Mel scale spectrogram with 18 frequency bins. Here, we only focus on the dependence structure of 100 randomized trees constructed with these data. For more details on the recognition task and the exact procedure used to construct these trees, the reader is referred to [11].
VOL. 24, NO. 5,
MAY 2002
For each pair of trees, we estimate the mutual weak dependence coefficients from the data. More specifically, for k; ` 1; . . . ; m, with ` < k m, and i; j 2 C, let
`; k; ijj be given by
`; k; ijj e C max E exp
Tk
X; i Tk
X; j
ijj F ` ; Y j 1=C 2 E exp
Tk
X; i Tk
X; j
ijj Y j (note that since, in this particular case, the trees are classification trees, C 2). We estimate
`; k; ijj from ^ k; ijj, which is obtained by replacing condidata by
`; tional expectations by Gaussian kernel smoothers [28] and expectations by sample averages. The conditional expectations E
expf
Tk
X; i Tk
X; j
ijj gjF k1 1 ; Y j on sigma-fields involving more than one tree are not computed for lack of sufficient testing data. Instead, we estimate the m mutual weak dependence coefficients f
kjj gJ; j1;k2 by ^
kjj
max
`1;...;k 1; i2C; i6j
^ k; ijj:
`;
16
We also considered estimates P of the conditional expectations given F 1k 1 by smoothing on k`11 T` . These yielded similar but smaller estimates of
`; k; ijj. For this reason, we prefer the larger (more conservative) estimates given by (16). The left plot in Fig. 1 shows the estimates based on (16). We note the similarity of the -coefficients in all the classes and across the trees. Judging by the small values of these coefficients, we can say that there is mutual weak dependence in these trees. The expected margins (right plot in Fig. 1) are moderately large. These two observations suggest a good recognition rate. In fact, the overall error rate of about 1.5 percent achieved by the decision rule based on the average of the 100 randomized decision trees is relatively low for these data [11].
5.2
A Brief Comparison of Bagging, Boosting, and Randomized Trees Trees constructed via bagging, randomized trees, and two flavors of boosting, GentleAdaBoost and LogitBoost (see [17] for a detail exposition of these algorithms), were applied to three small data sets from the UCI repository [27]: the Inosphere data, the Pima-Diabetes data, and the Wine data. The Ionosphere data set consists of radar data collected in Goose Bay, Labrador, by a system consisting of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The data are divided into two classes (ªGood,º and ªBadº). There are 34 covariates in this data set. The Pima-Diabetes data set is also a binary data set. The two classes indicate whether or not the subject shows signs of diabetes according to the World Health Organization criteria. All subjects are female of Pima Indian heritage. There are eight covariates in this data set. The Wine data set consists of a chemical study of Italian wines grown in the same region but from three different cultivars (the three classes). There are 13 covariates in this data set. We refer to the UCI repository [27] for more details on these data sets.
MURUA: UPPER BOUNDS FOR ERROR RATES OF LINEAR COMBINATIONS OF CLASSIFIERS
597
Fig. 1. Estimates of (a) the mutual weak dependence coefficients
kjj , for j 2 fzero; one; . . . ; ohg, and k 2; . . . ; 100 (left plot). (b) The corresponding log-geometric averages `
jj (center plot) associated to each of the 11 digits in the TI-digits corpus. (c) The expected margin trajectories (right plot). Each curve in the left plot tracks the -coefficients over the trees for each of the 11 digits. Each curve in the right plot tracks the expected margins
ijj over the trees for each of the 110 combinations of i; j 2 fzero; one; . . . ; ohg; i 6 j.
Figs. 2, 3, and 4 show the corresponding estimates of the weak-dependence coefficients
kjj (top rows), j 2 f1; . . . ; Jg, and k 2; . . . ; m, and of the expected margins e
ijj s (bottom rows), i; j 2 f1; . . . ; Jg; i 6 j, for the Ionosphere, Pima-Diabetes, and Wine data sets, respectively.
Note that, although trees constructed via boosting are built in a sequential fashion, there does not appear to be any increasing trend in the dependence of new trees with respect to previous ones (look at the second and third columns of the first row in Figs. 2 and 3, and at the second
Fig. 2. Estimates of the mutual weak dependence coefficients
kjj (top row), j 2 f1; 2g, and k 2; . . . ; 60, and of the corresponding trajectories of the expected margins e
ijj (bottom row), i; j 2 f1; 2g; i 6 j, for the Ionosphere data set. Each curve tracks either the -coefficients (top row) or the expected margins (bottom row) over the trees for each of the two classes. The blue curves correspond to
1j2 , and the green curves, to
2j1 . These coefficients were estimated from 60 trees constructed via bagging (first column), LogitBoost (second column), GentleAdaBoost (third column), and randomized trees (fourth column).
598
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 24, NO. 5,
MAY 2002
Fig. 3. Estimates of the mutual weak dependence coefficients
kjj (top row), j 2 f1; 2g, and k 2; . . . ; 60, and of the corresponding trajectories of the expected margins e
ijj (bottom row), i; j 2 f1; 2g; i 6 j, for the Pima-Diabetes data set. Each curve tracks either the -coefficients (top row) or the expected margins (bottom row) over the trees for each of the two classes. The blue curves correspond to
1j2 , and the green curves, to
2j1 . These coefficients were estimated from 60 trees constructed via bagging (first column), LogitBoost (second column), GentleAdaBoost (third column), and randomized trees (fourth column).
column of the first row in Fig. 4). This is also observed with trees constructed via randomized trees and bagging (last and first columns of the upper rows in Figs. 2, 3, and 4, respectively). However, in these latter cases, this is less surprising since all these trees are built in parallel (i.e., they are built independently of each other). One can infer from these plots that all the trees built with these procedures present evidence of mutual weak dependence. Fig. 5 plots the log-geometric averages yielded by the three different procedures on the three data sets. It is clear from this figure that all these procedures, including the two flavors of boosting, namely, LogitBoost and GentleAdaBoost, behave in a similar fashion. An exception can be seen in the randomized-tree based decision rule applied to the Wine data set. Here, the corresponding log-geometric averages look unusually large when compared with the ones yielded by bagging and boosting. Table 1 shows the percentage error rates of each of the four decision rules on the corresponding data sets. Note that despite the large log-geometric averages, randomized trees do better than bagging for the Wine data set. This can be explained by the consistently large expected margins yielded by the randomized trees (see Fig. 4), as opposed to those yielded by bagging. All four procedures give high error rates for the PimaDiabetes data set. This is consistent with the low expected margins observed in Fig. 3. Class 2, comprising those individuals that tested positive for Diabetes, appears very hard to classify correctly. In fact, all four procedures correctly recognizes only a few more than 50 percent of
these subjects. This class is so hard to recognize that the randomized-tree based procedure yielded margins that were near zero or even negative (see the green line at the bottom right corner of Fig. 3). From Figs. 3 and 4, one can observe a positive correlation between the values of the -coefficients and the expected margins associated to each class. More explicitly, those classes with low -values tend to be associated with low expected margins, and those classes with large -values tend to be associated with large expected margins. This observation suggests a trade-off between low-weak dependence and large expected margins, so that the misclassification pattern associated to each individual class is similar for all the classes. This trade-off behavior is also seen across the different decision rules in Fig. 4. The consequence of this trade-off is that to compensate for low margins, a procedure will construct trees with low-weak dependence, so as to boost the recognition rates. An extreme case corresponds to construct independent trees with very low margins. Note that the worst performers on each of the three data sets, i.e., LogitBoost on the Ionosphere data, randomized trees on the Pima-Diabetes data, and bagging on the Wine data, present anomalies either in the values of their associated -coefficients or in their expected margins. LogitBoost -coefficients for the Ionosphere data are consistently larger than those ones associated to the other three procedures (see upper row of Fig. 2). The expected margins associated to randomized trees on the Pima-Diabetes data
MURUA: UPPER BOUNDS FOR ERROR RATES OF LINEAR COMBINATIONS OF CLASSIFIERS
599
Fig. 4. Estimates of the mutual weak dependence coefficients
kjj (top row), j 2 f1; 2; 3g, and k 2; . . . ; m, and of the corresponding trajectories of the expected margins e
ijj (bottom row), i; j 2 f1; 2; 3g; i 6 j, for the Wine data set. Each curve tracks either the -coefficients (top row) or the expected margins (bottom row) over the trees for each of the three classes. The blue curves correspond to class j 1, the green curves, to class j 2, and the light blue curves, to class j 3. The solid blue curve corresponds to
2j1 , and the dashed blue curve to
3j1 . The solid green curve corresponds to
1j2 , and the dashed green curve to
3j2 . The solid light blue curve corresponds to
1j3 , and the dashed light blue curve to
2j3 . These coefficients were estimated from 10 trees constructed via bagging (first column), LogitBoost (middle column), and randomized trees (third column).
and to bagging on the Wine data are much lower than the expected margins associated to the other procedures. Also, note that despite the very large expected margins associated to randomized trees on the Wine data, LogitBoost performs better on this data set.
Even though this is not a comprehensive study of the performance of these classifiers, these findings corroborate, in practice, what Corollary 1 says in theory, namely, that what matters in order to achieve low error rates is the combination of weak-dependence with high expected margins.
Fig. 5. Estimates of the log-geometric averages `
jj . The blue bar corresponds to class 1, the green bar to class 2, and for the Wine data set, the light blue bar corresponds to class 3. BG; LB; GAB, and RT, refer to bagging, LogitBoost, GentleAdaBoost, and randomized-tree estimates, respectively.
600
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 24, NO. 5,
MAY 2002
TABLE 1 Percentage Error Rates Yielded by the Four Decision Rules Based on Linear Combination of Trees
6
CONCLUSION
A useful definition of mutual weak dependence between classifiers based on the same training data is presented. When this mutual weak dependence is very low, exponential upper error bounds for combinations of classifiers can be achieved. Our experiments show that three common methods used to construct multiple classifiers from the same training dataÐthe randomized-tree procedure, boosting, and baggingÐproduce mutually weakly dependent trees. Moreover, the pattern observed in the mutual weak dependence coefficients is similar in all three procedures. This pattern is characterized by -coefficients that present a rather steady oscillation about a small -value. Also, this pattern is usually accompanied by a flat region (corresponding to an almost constant and low error rate) in the corresponding error rate curves. This is known to happen in randomized-tree and boosting-based decision rules [11], [12]. A similar pattern can be observed in bagging-based decision rules (see Fig. 6). There appears to be a trade-off between low-weak dependence and large expected margins, so that the misclassification pattern associated to each individual class is similar for all the classes. A consequence of this trade-off is that to compensate for low margins, a procedure will construct trees with low-weak dependence, so as to boost the recognition rates.
Our empirical findings corroborate the statement of Corollary 1, i.e., what matters in order to achieve low error rates is the combination of weak-dependence with high expected margins. The main implication here is that it is not enough to look for procedures that maximize the margins. More importantly, we should look for procedures that create classifiers with large expected margins, while keeping the dependence between the classifiers low.
APPENDIX PROOF OF THE MAIN RESULTS Proof of Theorem 1. This theorem is a direct consequence of one of Talagrand's inequalities [13, inequality (1.4)], which we write here in a convenient form p P sup n IPn
T
X 6 Y P
T
X 6 Y C T 2T
V K C2 K e C V
17 2C 2
for all C C0 , with C0 a large constant, and V 1 ( is the VC-dimension of T . We have adopted the definition of p VC-dimension given in [22]). Writing C
; n n, we p obtain the lower bound
; n C0 = n; hence,
; n cannot decrease at a faster rate than n 1=2 . To simplify the notation, we will write for
; n in all the equations below.
Fig. 6. Error rates for the Ionosphere data (right) and the Pima-Diabetes data (left), yielded by increasing the number of trees used for classification. The trees were constructed via bagging.
MURUA: UPPER BOUNDS FOR ERROR RATES OF LINEAR COMBINATIONS OF CLASSIFIERS
Since C 1, we have
Given > 0, we have to find a small satisfying V K K2 n p e n V
22 n
and
2
C02 : n
Kn K log p
2V V n
1 log
22 n log :
19
KC02 2V 1 p 1 Ke n log V log log 0; 2 C0 C0 Ve
20
which is easily seen to be the expression in the righthand-side of (5), is a solution of (19), thus showing the theorem provided that this root satisfies the second constraint in (18). Imposing this latter constraint, i.e., p C0 = n, with given by (5), lead to the condition 0 , with 0 as in (6). To conclude the proof, note that if is a solution of (20), then satisfy the constraint (19) and, hence, the first constraint in (18). This is straightforward from the inequality 1 Kn K log p 1 log log V log V n 2 p KC 2V 1 1 Ke 0 22 n n log V log log 0; C0 C0 V e2
21
22 n
2V
1 2eC
1
which is easily derived by applying the well-known p inequality log
x x 1 to x n=C0 in the left-hand side of (21). u t Pk Proof of Theorem 2. Let Sk `1 Z` , k 1; . . . ; m. One has E eSk E eSk 1 eZk E eZk E eZk E eSk 1 E eZk jF k1 1 E eSk 1 E eZk jF k1 1 E eZk E eSk 1 E eZk :
k Y
2eC ` e
2
`2
e
2
C2 k
k Y 1 2eC
1 `2
C
C2
23
` :
`2
11fSm g 11fexpf
Sm
g
1g e
Sm
11fSm g e
Sm
;
for all > 0. Using this last inequality (24) and the definition (9) of ` , one obtains P
Sm E 11fSm g E eSm e exp m2 C 2 `
m 1 : Minimizing this latter expression over , one obtains 2 P
Sm expf 1g; which holds for 2 `
m 4mC
every 2mC. Finally, replacing by m, one quickly gets (10), for every 2C. To show (11), one only needs to note that by definition of
; C-weak dependence max1`m log
1 2e1=4 ` log
1 2e1=4 2e1=4 and, hence, ` 2e1=4 . It remains to show (23). Since E
Z` 0 and jZ` j C, we have, using a Taylor series expansion for ex , ! X
Ck X1 k Z`
Z` 1 E 1 Z` E e k! k! k2 k2 ( ) k1 X
2 C 2 k k!
2 C 2 2 k! 1
2k 1! k!
2k! k! k1 ( ) k k 1 X
2 C 2
2 C 2
C 1 2 k1 k! k! 1 C 2 C 2 2 2 2 2 e 1 e C 1 1 e C ; 2 2 where we have used C 1.
u t
Proof of Corollary 1. Note that P max S
mji 6 S
mjj jj P [i6j S
mji > S
mjj jj i X P S
mji > S
mjj jj i6j
X
P S
mji
S
mjj
ijj
i6j
which we will show below, one obtains from the identity (22) o n 2 2 E eSk E eSk 1 2eC k e C E eZ1
k Y ` 1 2e1=4 ` :
Let 11fSm g be the indicator function associated to the event fSm g. Note that
22 Using (7) and the fact that 2 2 E eZ` e C ; for C 1; ` 1; . . . ; m;
C
`1
We will show that the positive root of the quadratic equation on 22 n
k Y
18
Applying logarithm on both sides of the first constraint, we obtain the equivalent constraint V log
601
24
>
ijj jj : Now, note that S
mji
m n 1X S
mjj
ijj T`
X; i m `1
o
` T`
X; j
ijj :
By hypothesis, the collection fT`
X; i
`
ijj gm `1
T`
X; j
is
; C-mutually weakly dependent. Hence,
applying Theorem 2 to all these collections, for i 6 j, i; j 1; 2; . . . ; J, one quickly obtains (15).
u t
602
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
ACKNOWLEDGMENTS The author would like to thank Professor Jon Wellner for introducing him to Talagrand's inequalities and for his useful suggestions, to Shuguang Song for facilitating him his S-plus code to try several kinds of boosting algorithms, and to Gregory Ridgeway for the many useful conversations on boosting. He would also like to thank the Department of Statistics at the University of Washington and the Institute for Mathematics and Its Applications in Minneapolis, Minnesota, for hosting him during the time this paper was conceived and revised, respectively. Finally, he also would like to thank the referees for their helpful suggestions and comments to improve the readability of the paper and the presentation of both Theorem 1 and its proof.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
Y. Amit and D. Geman, ªShape Quantization and Recognition with Randomized Trees,º Neural Computation, vol. 9, pp. 15451588, 1997. Y. Freund, ªBoosting a Weak Learning Algorithm by Majority,º Information and Computation, vol. 121, no. 2, pp. 256-285, 1995. Y. Freund and R. Schapire, ªExperiments with a New Boosting Algorithm,º Machine Learning: Proc. 13th Int'l Conf., pp. 148-156, 1996. H. Drucker and C. Cortes, ªBoosting Decision Trees,º Advances in Neural Information Processing Systems 8, pp. 479-485, 1996. L. Breiman, ªBagging Predictors,º Machine Learning, vol. 24, no. 2, pp. 123-140, 1996. L. Breiman, ªBias, Variance and Arcing Classifiers,º Technical Report 460, Dept. of Statistics, Univ. of Calif. at Berkeley, 1996. L. Breiman, ªUsing Adaptive Bagging to Debias Regressions,º Technical Report 547, Dept. of Statistics, Univ. of Calif. at Berkeley, 1999. J.R. Quinlan, ªBagging, Boosting, and C4.5,º Proc. 13th Nat'l Conf. Artificial Intelligence, pp. 725-730, 1996. S. Geman, E. Bienenstock, and R. Doursat, ªNeural Networks and the Bias/Variance Dilemma,º Neural Computation, vol. 4, pp. 1-58, 1992. R. Tibshirani, ªBias, Variance, and Prediction Error for Classification Rules,º technical report, Dept. of Statistics, Univ. of Toronto, 1996. Y. Amit and A. Murua, ªSpeech Recognition Using Randomized Relational Decision Trees,º IEEE Trans. Speech and Audio Processing, 2000. R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee, ªBoosting the Margin: A New Explanation for the Effectiveness of Voting Methods,º Machine Learning: Proc. 14th Int'l Conf., 1997. M. Talagrand, ªSharper Bounds for Gaussian and Empirical Processes,º The Annals of Probability, vol. 22, no. 1, pp. 28-76, 1994. T. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, Calif.: Wadsworth, 1984. J.H. Friedman and W. Stuetzle, ªProjection Pursuit Regression,º J. Am. Statistical Assoc., vol. 76, pp. 817-823, 1981. J. Friedman, T. Hastie, and R. Tibshirani, ªAdditive Logistic Regression: A Statistical View of Boosting,º Annals of Statistics, 2000. B.D. Ripley, Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996. B.E. Boser, I.M. Guyon, and V. Vapnik, ªA Training Algorithm for Optimal Margin Classifiers,º Proc. Fifth Ann. ACM Workshop Computational Learning Theory, pp. 144-152, 1992. L. Devroye, L. GyoÈrfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer, 1996. K.S. Alexander, ªProbability Inequalities for Empirical Processes and a Law of the Iterated Logarithm,º Annals of Probability, vol. 12, pp. 1041-1067, 1984. A. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes. With Applications to Statistics. Springer-Verlag, 1996. I. Berkes and W. Philipp, ªApproximation Theorems for Independent and Weakly Dependent Random Vectors,º The Annals of Probability, vol. 7, pp. 29-54, 1979.
VOL. 24, NO. 5,
MAY 2002
[24] G.G. Roussas and Y.G. Yatracos, ªMinimum Distance Estimates with Rates under -Mixing,º Festschrift for Lucien Le Cam, D. Pollard, E. Torgersen, and G.L. Yang, eds., pp. 337-344, Springer, 1997. [25] G.G. Roussas and D. Ioannides, ªMoment Inequalities for Mixing Sequences of Random Variables,º Stochastic Analysis and Applications, vol. 5, no. 1, pp. 61-120, 1987. [26] Y. Amit Personal communication, 2001 [27] UCI Machine Learning Repository. http://www.ics.uci.edu/ mlearn/mlrepository.html. 2000 [28] T. Hastie, R. Tibshirani, Generalized Additive Models. Chapman and Hall, 1990. Alejandro Murua received the Mathematical Civil Engineering Diploma from the School of Engineering, the University of Chile, in 1988, and the PhD degree in applied mathematics from Brown University in 1994. He has been on the faculty in the Division of Applied Mathematics at Brown University during 1993-1994, in the Department of Statistics at the University of Chicago during 1994-1998, and in the Department of Statistics at the University of Washington, Seattle, during 1998-2000. In 2001, he joined Insightful Corporation in Seattle, as a member of its research and emerging technologies department. His main research interests focus on applications of statistics and probability to problems dealing with machine learning, natural language understanding, speech and object recognition, signal processing, and data mining. He is a member of the IEEE.
. For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.