Dynamics of Variance Reduction in Bagging and ... - Semantic Scholar

Report 3 Downloads 65 Views
Dynamics of Variance Reduction in Bagging and Other Techniques Based on Randomisation G. Fumera, F. Roli, and A. Serrau Dept. of Electrical and Electronic Eng., University of Cagliari, Italy {fumera,roli,serrau}@diee.unica.it

Abstract. In this paper the performance of bagging in classification problems is theoretically analysed, using a framework developed in works by Tumer and Ghosh and extended by the authors. A bias-variance decomposition is derived, which relates the expected misclassification probability attained by linearly combining classifiers trained on N bootstrap replicates of a fixed training set to that attained by a single bootstrap replicate of the same training set. Theoretical results show that the expected misclassification probability of bagging has the same bias component as a single bootstrap replicate, while the variance component is reduced by a factor N . Experimental results show that the performance of bagging as a function of the number of bootstrap replicates follows quite well our theoretical prediction. It is finally shown that theoretical results derived for bagging also apply to other methods for constructing multiple classifiers based on randomisation, such as the random subspace method and tree randomisation.

1

Introduction

Bagging [3] is the most popular method for constructing multiple classifier systems based on the “perturbing and combining” approach, which consists in combining multiple instances of the base classifier obtained introducing some randomness in the training phase. These methods seem to be effective in reducing the variance component of the expected misclassification probability of a classifier, and are thus believed to be effective especially for classifiers characterised by a high variance and a low bias, qualitatively defined by Breiman as “unstable”, i.e. classifiers that undergo significant changes in response to small perturbations of the training set (or other training parameters). However it is not yet clear how exactly bagging affects the bias and variance of individual classifiers, and for what kind of problems and classifiers it is more effective. Theoretical investigations like [4, 8] focused only on regression problems, while analytical models for classification problems turned out to be more difficult to develop, especially because no additive bias-variance decomposition exists for them. Moreover, although several decompositions have been proposed so far [2, 5, 7, 14, 17], no general consensus exists about which one is more appropriate to analyse the behaviour of classification algorithms. Therefore, only experimental analyses of

bagging have been presented so far for classification problems [6, 10, 14, 16]. Empirical evidences seem to confirm that the main effect of bagging is to reduce the variance component of the expected misclassification probability, but some exceptions have been pointed out, for instance in [10]. Moreover, no clear definition of “instability” has been provided yet, directly related to the amount of variance reduction or to the performance improvement attainable by bagging, although some attempts have been made [10, 16]. In this paper we look at bagging from the perspective of the theoretical framework developed in works by Tumer and Ghosh [19,20], and extended by Fumera and Roli in [9]. This model allows to evaluate the error reduction attainable by linearly combining the outputs of individual classifiers, and provides a particular bias-variance decomposition of the expected misclassification probability. Such decomposition accounts only for a fraction of the overall misclassification probability, and holds only under some assumptions. Nevertheless, we show that it can be exploited to analytically characterise the performance of bagging as a function of the number of bootstrap replicates N . This problem has not been considered so far in the literature. Indeed, bagging was proposed by Breiman as a method to approximate, using a single training set, an ideal “aggregated” predictor defined as the combination of the (possibly infinite) predictors obtained using all possible training sets of a fixed size. In practice, since there are mm different and equiprobable bootstrap replicates of a given training set of size m, bagging itself is approximated using N  mm replicates. Bagging has been always analysed “asymptotically”, i.e. for values of N sufficiently high to provide a good approximation of its theoretical definition. Empirical evidences showed that “asymptotic” values of N are between 10 and 50, depending on the particular data set and classifier used [1, 3, 16], that the performance of bagging tends to improve for increasing N until the “asymptotic” value is reached [16], and that such improvement is mainly due to variance reduction [1, 3]. However, the dynamics with which the performance of bagging reaches its “asymptotic” value has never been investigated. Under this viewpoint, we show that our theoretical framework provides a simple analytical relationship between the expected misclassification probability of bagging and that of an individual classifier trained on a single bootstrap replicate of a fixed training set. Such relationship shows that the performance of bagging improves as N increases, and this is entirely due to a reduction by a factor N of the variance of a single bootstrap replicate. We also show that our model theoretically supports the optimality of the simple average combining rule over the weighted average for classifiers generated by bagging. To the best of our knowledge, our model is the first to analytically characterise the dynamics of variance reduction attained by bagging as a function of the number N of combined classifers. Experiments carried out on the same data sets originally used in [3] support our theoretical predictions. The practical relevance of our results is that they provide a well grounded rule for choosing the number of bootstrap replicates N , which can be useful in applications characterised by strict requirements on computational complexity at operation time. Moreover, we show that our theoretical results are not limited to bagging, but hold for any

method based on independently generating individual classifiers using the same randomisation process, like Ho’s random subspace method [12] and Dietterich and Kong’s tree randomisation [5].

2 2.1

Theoretical Analysis of Bagging Expected misclassification probability of an individual classifier

Our analysis of bagging is based on the theoretical model developed in [19, 20], and extended in [9], which allows to analytically evaluate the reduction of the expected misclassification probability attainable by linearly combining the outputs of an ensemble of classifiers. This model considers classifiers that provide approximations Pˆk (x) of the class posterior probabilities Pk (x) (where k denotes the class), and focuses on the expected value of the additional misclassification probability (from now on, added error) over Bayes error attained on a given boundary between any two classes i and j, in the case when the effect of using the approximated a posteriori probabilities is a shift of the ideal boundary. This situation is depicted in Fig. 1 for the case of a one-dimensional feature space. In the following we did not consider the case of multi-dimensional feature spaces, which is discussed in [18]. The approximation Pˆk (x) can be written as Pk (x) + k (x), where k (x) denotes the estimation error, which is assumed to be a random variable. One source of randomness in constructing a classifier (the one exploited by bagging) is the training set. Therefore, in the following we shall write the estimation errors by indicating explicitely their dependence on a training set t, as i (x, t). However, we point out that all the following derivations hold also when other sources of randomness are considered (like the ones exploited by the random subspace method and by tree randomisation, or even random initial weights in neural networks). Under a first-order approximation of Pi (x) and Pj (x) around the ideal boundary x∗ , and approximating the probability distribution p(x) around the ideal boundary with p(x∗ ), the additional misclassification probability for a given t turns out to be z[i (xb , t) − j (xb , t)]2 ,

(1)

where xb is the estimated class boundary (see Fig. 1), and z is a constant term p(x∗ ) equal to 2[P 0 (x)−P 0 (x)] [9, 13, 19, 20]. It easily follows that the added error Eadd , j

i

i.e. the expected value of eq. (1) over t, is Eadd = z[(βi − βj )2 + σi2 + σj2 − 2covij ] ,

(2)

where βk and σk2 denote the mean and variance (over t) of k (x, t), k = i, j, while covij denotes the covariance between i (x, t) and j (x, t), and it is assumed that such quantities do not depend on x around the considered class boundary. We point out that eq. (2) can be viewed as a bias-variance decomposition, since it allows to express the added error as a function of the mean and variance over t of the estimated a posteriori probabilities provided by a classifier. Under

Pi ( x )

Pˆi ( x )

Pj ( x ) Pˆj ( x )

x*

xb

x

Fig. 1. True posterior probabilities around the ideal boundary x∗ between classes i and j (solid lines), and estimated posteriors, leading to the boundary xb , (dashed lines). Lightly and darkly shaded areas represent contribution of this class boundary to Bayes error and to additional misclassification error, respectively.

this viewpoint, it is analogous to the decomposition given by Friedman for a fixed x in [7]. The difference between the decomposition (2) and others proposed in the literature [2, 5, 7, 17] is that it holds only under the assumptions explained above. Although these are quite strict assumptions as clearly explained in [13], they provide an additive decomposition which is not attainable in classification problems without any simplifying assumption, as shown in [7]. In the following we show how this decomposition can be exploited to analytically characterise the performance of bagging as a function of the number of bootstrap replicates. 2.2

Added error of linearly combined classifiers

Consider a linear combination Pˆkave (x) (by simple averaging) of the estimated a posteriori probabilities provided PN by N classifiers trained PNon different training sets t1 , . . . , tN , given by N1 m=1 Pˆkm (x) = Pk (x) + N1 m=1 k (x, tm ). Under the same assumptions of Sect. 2.1, the additional misclassification probability on the same class boundary of Fig. 1 turns out to be #2 N 1 X i (xbave , tm ) − j (xbave , tm ) , z N m=1 "

(3)

where z is the same as in eq. (1), while xbave is the estimated class boundary which can differ from that of a single classifier, xb [9, 13, 19, 20]. We now focus on the case when the estimation errors of individual classifiers, k (x, tm ), m = 1, . . . , N , are i.i.d. random variables. This is the case of bagging: indeed, if t1 , . . . , tN are bootstrap replicates of a fixed training set, then they are i.i.d. random variables since each one is made up of m samples drawn independently from the same distribution. This implies that also k (x, tm ), m = 1, . . . , N are i.i.d. random variables. In this case, eq. (1) (which coincides to eq. (3) for N = 1) provides the misclassification probability of a

classifier trained on a single bootstrap replicate t of the considered training set, and eq. (2) is the expected value over all possible bootstrap replicates, while the expected value of eq. (3) provides the added error of N bagged classifiers, over all possible realisations of N bootstrap replicates of the same training set. Since all the training sets t1 , . . . , tN are identically distributed, also the estimation errors k (x, tm ), m = 1, . . . , N in eq. 3 and k (x, t) in eq. 1 are identically distributed. Under the assumption of Sect. 2.1 that the mean and variance of estimation errors are constant values around the ideal boundary x∗ , it follows that each k (xbave , tm ), m = 1, . . . , N has the same mean βk and variance σk2 as k (xb , t), while i (xbave , tm ) and j (xbave , tm ), m = 1, . . . , N , have the same covariance covij as i (xb , t) and j (xb , t), m = 1, . . . , N . Moreover, since k (x, tm ), m = 1, . . . , N are i.i.d., also i (xbave , tm )−j (xbave , tm ), m = 1, . . . , N , ave are i.i.d. . It then follows that the added error Eadd (the expected value of eq. (3)) with respect to the estimation errors is given by   1 2 2 ave 2 (4) Eadd = z (βi − βj ) + (σi + σj − 2covij ) . N It is worth noting that, according to the theoretical comparison between the simple and weighted average combining rules given in [9], based on the same theoretical framework considered here, the simple average is the optimal combining rule for classifiers generated by bagging, since the estimation errors of each classifier are i.i.d. . 2.3

Analysis of bagging

In the previous section we obtained a bias-variance decomposition of the added ave error Eadd of an ensemble of N classifiers trained on bootstrap replicates of a fixed training set (when the simple average combining rule is used), under the assumptions of Sect. 2.1. We showed that this decomposition relates the bias and variance components of N bagged classifiers to the ones of a classifier trained on a single bootstrap replicate. To the best of our knowledge, this is the first analytical model of the performance of bagging as a function of the number of bootstrap replicates N . Equations (2) and (4) show that the added error attained by N bagged classifiers has the same bias component of a classifier trained on a single bootstrap replicate, while its variance is reduced by a factor N . In other words our model states that, for a fixed training set, the expected performance of bagging always improves as N increases, and this is due only to variance reduction. This could seem an optimistic conclusion, since it is known that bagging does not always improve the performance of a classifier. However, we point out that the improvement predicted by our theoretical model does not refer to an individual classifier trained on the whole training set, but to an individual classifier trained on a single bootstrap replicate, where the expectation is taken over all possible bootstrap replicates of a fixed training set. In other words, the above result does not imply that bagging a given classifier always improves its performance, but only that, for any fixed training set, the expected performance

of N bagged classifiers improves for increasing N . This qualitatively agrees with results like the ones in [16], where the performance of bagging was often found to improve for increasing N , even when an individual classifier (trained on the whole training set) outperformed bagging. Moreover, it also quantitatively agrees with the empirical observation that bagging more than 10 to 50 classifiers does not lead to significant performance improvements. The above result can have a great practical relevance since it relates in a very simple way the performance improvement attainable by bagging to the number of bootstrap replicates N , and thus suggests a simple guideline to choose the value of N . Although the amount of the maximum reduction of the expected misclassification probability attainable by bagging is equal to the variance component, which is unknown in real applications, combining N bagged classifiers always provides an average reduction by a factor N of such component. This guideline can be useful in particular for applications characterised by strict requirements on the computational complexity at operation time, where it is necessary to find a trade-off between the potential performance improvement attainable by a combining method like bagging and the value of N . So far, this problem has been addressed through the development of techniques for selecting a subset of N classifiers out of an ensemble of M > N classifiers generated by methods like bagging (see [11, 15, 16]). Selection techniques are aimed to keep the ensemble size small without worsening its performance [11, 15], or even to improve performance by discarding poor individual classifiers [16]. Under this viewpoint, our results do not directly allow to understand if selection techniques can be effective, since they provide only the average performance of an ensemble of N classifiers randomly generated by bagging. The improvement attainable by selection techniques depends instead on how much the performances of different ensembles of N classifiers change: the higher this difference, the higher the gain attainable by selecting a proper ensemble of N classifiers instead of a random ensemble. 2.4

Other Techniques Based on Randomisation

We point out that all the results discussed above do not apply only to bagging. Indeed, as pointed out in Sect. 2.2, the added error of N combined classifiers (3) is given by (4) whenever the estimation errors of different classifiers, k (x), k = 1, . . . , N are i.i.d. random variables. Besides bagging, this is the case of any method for constructing classifier ensembles based on some randomisation process, in which the outputs of individual classifiers depends on i.i.d. random variables. For instance, this happens in the random subspace method [12], where each classifier is trained using a randomly selected feature subspace, in tree randomisation [5], where the split at each node is randomly selected among the k best splits, and even in the simple case of ensembles of neural networks trained using random initial weights. Therefore, the results of our analysis apply to all such techniques besides bagging. This opens an interesting perspective towards a unifying view of techniques for generating multiple classifiers based on randomisation.

3

Experimental Results

In this section we present an experimental evaluation on real data sets of the behaviour of bagging as a function of the number of bootstrap replicates N , and a comparison with the behaviour predicted by the theoretical model of Sect. 2. The experiments have been carried out on the same well known data sets used by Breiman in [3], i.e. Wisconsin breast cancer, Diabetes, Glass, Ionosphere, Soybean disease and Waveform. Decision trees have been used as base classifiers, and the linear combining rule (simple average) has been applied to the estimates of the a posteriori probabilities provided by individual trees. All data sets were randomly subdivided into a training and a test set of the same relative size as in [3]. To repeat the experiments, we constructed ten different training sets by randomly sampling without replacement a subset of the patterns from the original training set. Since we were interested in the expected error rate of bagging as a function of N , with respect to all possible realisations of N bootstrap replicates of a fixed training set, we estimated such value in each run by averaging over ten different sequences of N bootstrap replicates of the same training set, for N = 1, . . . , 50 (where the value for N = 1 refers to a single bootstrap replicate). The values obtained in the ten runs were then averaged again. In Figs. 2 and 3 we report the test set average misclassification rate and the standard deviation of bagging for N = 1, . . . , 50, and of a single tree trained on the whole training set. We point out that the average misclassification rate of bagging as a function of N can not be directly compared to eq. (4), since the latter refers to a single class boundary, and is valid only under the assumptions of Sect. 2. However, as suggested by eq. (4), our aim was to investigate if the observed average overall misclassification rate of bagging as a function of N , E(N ), can be modeled as E(N ) = EB +

1 EV , N

(5)

i.e. as the sum of a constant term and of a term decreasing as 1/N , where EB +EV is the average misclassification rate E(1) of a single bootstrap replicate, EB corresponds to the sum of the Bayes error and of the bias component, and EV to the variance component. To verify this hypothesis, we fitted the curve of eq. 5 to the value of E(1) for N = 1 (since we assume E(1) = EB +EV ) and to the value of E(50) for N = 50 (which should be the most “stable” value of E(N )), obtaining 1 1 EV = (1 − 50 )[E(1) − E(50)], and thus EB = E(1) − (1 − 50 )[E(1) − E(50)]. In 1 Figs. 2 and 3 we report the curve EB + N EV (dashed line), to be compared with the experimentally observed values of E(N ) (black circles), for N = 2, . . . , 49. From Figs. 2 and 3 we can see that, despite its simplicity, expression (5) fits quite well the average misclassification rate of bagging on five out of the six data sets considered, with the exception of Waveform for small N . On these five data sets, for values of N lower than 10, which are of great interest in practical applications, the deviation between the predicted and observed values of the misclassification rate is often lower than 0.01. A higher deviation can be observed for the smallest values of N (between 2 and 4, depending on the data

Error Rate

Error Rate 0.3

0.28

0.26

0.24

0.22

0.2

0.18

0.16

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

1

1

10

10

20

20

Diabetes

N

Breast Cancer

N

30

30

40

40

50

50 Error Rate

Error Rate 0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0

0.46

0.44

0.42

0.4

0.38

0.36

0.34

0.3

0.32

0.28

0.26

1

1

10

10

20

20

Ionosphere

N

Glass

N 30

30

40

40

50

50

Fig. 2. Average test set misclassification rate of N bagged classifiers (black circles) for N = 1, . . . , 50 (where N = 1 refers to a single bootstrap replicate), and standard deviation (shown as error bars). The dashed line represents the behaviour of the expected misclassification probability of bagging predicted by eq. 5. Horisontal lines represent the average misclassification rate (continuous line) and the standard deviation (dotted lines) of an individual classifier trained on the whole training set.

Error rate

Error Rate

0.32

0.3

0.28

0.26

0.24

0.22

0.2

0.18

0.3

0.25

0.2

0.15

0.1

0.05

0

1

10

10

20

20

Waveform

N 30

30

Soybean Disease

N

40

40

50

50

Fig. 3. See caption of Fig. 2.

set). Anyway, on all six data sets it is evident that for N > 10 the residual improvement attainable by bagging does not exceed 10% of that attained for N = 10, in agreement with eq. 5. It is also possible to see that for low N (between 2 and 5, depending on the data set) the average misclassification rate of bagging can be higher than that of an individual classifier trained on the whole training set, but becomes lower as N increases. The only exception is the Soybean data set, where the average performance of bagging is very close to that of the individual classifier even for high values of N . However, in all data sets the standard deviation of the misclassification rate of bagging is always lower than that of the individual classifier, for N approximately greater than 5. This means that, for a fixed training set, bagging N classifiers reduces the risk of obtaining a higher misclassification rate than using an individual classifier trained on the whole training set, even if the average misclassification rates are very similar as in the Soybean data set. In particular, the fact that the standard deviation of bagging can be quite high even for high values of N (for instance, it is about 0.04 on Glass and Soybean data sets) shows that classifier selection techniques could give a potential improvement, although no well grounded selection criteria has been proposed so far. Accordingly, the theoretically grounded guideline derived in this work can be considered as a contribution towards effective selection criteria.

References 1. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 36 (1999) 105–139 2. Breiman, L.: Bias, Variance, and arcing classifiers. Tech. Rep., Dept. of Statistics, Univ. of California (1995) 3. Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123–140 4. Buhlmann, P., Yu, B.: Analyzing bagging. Ann. Statist. 30 (2002) 927–961 5. Dietterich, T. G., Kong, E. B.: Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms. Tech. Rep., Dept of Computer Science, Oregon State Univ. (1995) 6. Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Machine Learning 40 (1999) 1–22 7. Friedman, J.H.: On bias, variance, 0-1 - loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1 (1997) 55–77 8. Friedman, J.H., Hall, P.: On bagging and nonlinear estimation. Tech. Rep., Stanford University, Stanford, CA (2000) 9. Fumera, G., Roli, F.: A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems. IEEE Trans. Pattern Analysis Machine Intelligence, in press. 10. Grandvalet, Y.: Bagging can stabilize without reducing variance. In: Proc. Int. Conf. Artificial Neural Networks. LNCS, Springer (2001) 49–56 11. Banfield, H., Hall, L.O., Boweyer, K.W., Kegelmeyer, W.P.: A new ensemble diversity measure applied to thinning ensembles. In: Kittler, J., Roli, F. (eds.): Proc. Int. Workshop Multiple Classifier Systems. LNCS Vol. 2096, Spinger (2003) 306–316 12. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis Machine Intelligence 20 (1998) 832–844 13. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Hoboken, N.J., Wiley (2004) 14. Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Saitta, L. (ed.): Proc. Int. Conf. Machine Learning. Morgan Kaufmann (1996) 275–283 15. Latinne, P., Debeir, O., Decaestecker, C.: Limiting the number of trees in random forests. In: Kittler, J., Roli, F. (eds.): Proc. Int. Workshop Multiple Classifier Systems. LNCS Vol. 2096, Spinger (2001) 178–187 16. Skurichina, M., Duin, R.P.W.: Bagging for linear classifiers. Pattern Recognition 31 (1998) 909–930 17. Tibshirani, R.: Bias, variance and prediction error for classification rules. Tech. Rep., Dept. of Statistics, University of Toronto (1996) 18. Tumer, K.: Linear and order statistics combiners for reliable pattern classification. PhD dissertation, The University of Texas, Austin (1996) 19. Tumer, K., Ghosh, J.: Analysis of Decision Boundaries in Linearly Combined Neural Classifiers. Pattern Recognition 29 (1996) 341–348 20. Tumer, K., Ghosh, J.: Linear and order statistics combiners for pattern classification. In: Sharkey, A.J.C. (ed.): Combining Artificial Neural Nets. Springer (1999) 127–155