Empirical comparison of bagging ensembles created using weak learners for a regression problem Karol Bańczyk1, Olgierd Kempa2, Tadeusz Lasota2, Bogdan Trawiński1 1
Wrocław University of Technology, Institute of Informatics, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland 2 Wrocław University of Environmental and Life Sciences, Dept. of Spatial Management ul. Norwida 25/27, 50-375 Wrocław, Poland {karol.banczyk, tadeusz.lasota}@wp.pl,
[email protected],
[email protected] Abstract. The experiments, aimed to compare the performance of bagging ensembles using three different test sets composed of base, out-of-bag, and 30% holdout instances were conducted. Six weak learners including conjunctive rules, decision stump, decision table, pruned model trees, rule model trees, and multilayer perceptron, implemented in the data mining system WEKA, were applied. All algorithms were employed to real-world datasets derived from the cadastral system and the registry of real estate transactions, and cleansed by property valuation experts. The analysis of the results was performed using recently proposed statistical methodology including nonparametric tests followed by post-hoc procedures designed especially for multiple n×n comparisons. The results showed the lowest prediction error with base test set only in the case of model trees and a neural network. Keywords: ensemble models, bagging, out-of-bag, property valuation, WEKA
1 Introduction Bagging ensembles, which besides boosting belong to the most popular multi-model techniques have been focused attention of many researchers for last fifteen years. Bagging, which stands for bootstrap aggregating, devised by Breiman [2] is one of the most intuitive and simplest ensemble algorithms providing a good performance. Diversity of learners is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn with replacement from the original training set. So obtained training data subsets, called also bags, are used then to train different classification and regression models. Finally, individual learners are combined through an algebraic expression, such as minimum, maximum, sum, mean, product, median, etc. [19]. Theoretical analyses and experimental results proved benefits of bagging especially in terms of stability improvement and variance reduction of learners for both classification and regression problems [3], [8], [9]. This collection of methods combines the output of the machine learning systems, in literature called “weak learners” in due to its performance [20], from the group of learners in order to get smaller prediction errors (in regression) or lower error rates (in
classification). The individual estimator must provide different patterns of generalization, thus the diversity plays a crucial role in the training process. Otherwise, the ensemble, called also committee, would end up having the same predictor and provide as good accuracy as the single one. It was proved that the ensemble performs better when each individual machine learning system is accurate and makes error on the different instances at the same time. The size of bootstrapped replicas in bagging usually is equal to the number of instances in an original dataset and the base dataset (Base) is commonly used as a test set for each generated component model. However, it is claimed it leads to an optimistic overestimation of the prediction error. So, as test error out-of-bag samples (OoB) are applied, i.e. those included in the Base dataset but not drawn to respective bags. These, in turn may cause a pessimistic underestimation of the prediction error. In consequence, correction estimators are proposed which are linear combinations of errors provided by Base and OoB test sets [4], [7]. So far we have investigated several methods to construct regression models to assist with real estate appraisal: evolutionary fuzzy systems, neural networks, decision trees, and statistical algorithms using MATLAB, KEEL, RapidMiner, and WEKA data mining systems [13], [15], [17]. We studied also ensemble models created applying various fuzzy systems, neural networks, support vector machines, regression trees, and statistical regression [14], [16], [18]. The main goal of the study presented in this paper was to investigate the usefulness of 17 machine learning algorithms available as Java classes in WEKA software to create ensemble models providing better performance than their single base models. The algorithms were applied to real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transaction obtained from a cadastral system. In the paper a part of results is presented comprising six of eight models which revealed prediction error reduction of bagging ensembles compared to base models. The models were built using weak learners including three simple ones: conjunctive rules, decision stump, and decision table as well as pruned model trees, rule model trees, and multilayer perceptron. The experiments aimed also to compare the impact of three different test sets composed of base, out-of-bag, and 30% holdout instances on the performance of bagging ensembles.
2 Algorithms Used and Plan of Experiments The investigation was conducted with an experimental multi-agent system implemented in Java using learner classes available in WEKA library [44]. WEKA (Waikato Environment for Knowledge Analysis), a non-commercial and open-source data mining system, comprises tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes [5], [21]. WEKA encompasses many algorithms for classification and numeric prediction, i.e. regression problems. The latter is interpreted as prediction of a continuous class. In our experiments we employed 17 learners taken from WEKA library to create bagging ensembles to examine how they improve the performance of models to assist
with real estate appraisal compared to single base models. However, following nine ensemble models did not provide lower prediction error: GaussianProcesses, IBk, IsotonicRegression, KStar, LeastMedSq, LinearRegression, PaceRegression, REPTree, SMOreg. In the case of ConjunctiveRule, DecisionStump, DecisionTable, M5P, M5Rules, MultilayerPerceptron, LWL, RBFNetwork learners bagging ensembles revealed better performance. Due to the limited space the results referring to only six first ones are presented in the paper, all belong to weak learners. CJR – ConjunctiveRule. This class implements a single conjunctive rule learner. A rule consists of antecedents combined with the operator AND and the consequent for the classification/regression. If the test instance is not covered by this rule, then it is predicted using the default value of the data not covered by the rule in the training data. This learner selects an antecedent by computing the information gain of each antecedent and prunes the generated rule. For regression, the information is the weighted average of the mean-squared errors of both the data covered and not covered by the rule. DST – DecisionStump. Class for building and using a decision stump. It builds onelevel binary decision trees for datasets with a categorical or numeric class, dealing with missing values by treating them as a separate value and extending a third branch from the stump. Regression is done based on mean-squared error. DTB – DecisionTable. Class for building and using a simple decision table majority classifier. It evaluates feature subsets using best-first search and can use cross-validation for evaluation. An option uses the nearest-neighbor method to determine the class for each instance that is not covered by a decision table entry, instead of the table’s global majority, based on the same set of features. M5P – Pruned Model Tree. Implements routines for generating M5 model trees The algorithm is based on decision trees, however, instead of having values at tree's nodes, it contains a multivariate linear regression model at each node. The input space is divided into cells using training data and their outcomes, then a regression model is built in each cell as a leaf of the tree. M5R – M5Rules. Generates a decision list for regression problems using separateand-conquer. In each iteration it builds an model tree using M5 and makes the "best" leaf into a rule. The algorithm divides the parameter space into areas (subspaces) and builds in each of them a linear regression model. It is based on M5 algorithm. In each iteration a M5 Tree is generated and its best rule is extracted according to a given heuristic. The algorithm terminates when all the examples are covered. MLP – MultiLayerPerceptron. A Classifier that uses backpropagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units). The dataset used in experiments was drawn out from a rough dataset containing above 50 000 records referring to residential premises transactions accomplished in one Polish big city with the population of 640 000 within eleven years from 1998 to 2008. In this period most transactions were made with non-market prices when the council was selling flats to their current tenants on preferential terms. First of all, transactional records referring to residential premises sold at market prices were selected. Then the dataset was confined to sales transaction data of apartments built
before 1997 and where the land was leased on terms of perpetual usufruct. Five following features were pointed out as main drivers of premises prices: usable area of premises, age of a building, number of rooms in a flat, number of storeys in a building, and distance from the city centre. Hence, the final dataset counted 5303 records. Due to the fact that the prices of premises change substantially in the course of time, the whole 11-year dataset cannot be used to create data-driven models using machine learning . Therefore it was split into subsets covering individual years, and we might assume that within one year the prices of premises with similar attributes were roughly comparable. The sizes of one-year data subsets are given in Table 1. Table 1. Number of instances in one-year datasets 1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
269
477
329
463
530
653
546
580
677
575
204
Three series of experiments were conducted each for different arrangements of training and test sets for each one-year dataset separately. The base dataset comprised the whole one-year dataset in first two cases whereas it was composed of the greater part obtained in the result of the 70%/30% random split of a one-year dataset. On the basis of each base dataset 50 bootstrap replicates (bags) were created. These replicates were then used as training sets to generate models employing individual learners. In order to assess the predictive capability of each model three different test sets were used, namely the base dataset, out-of-bag, and 30% split denoted in the rest of the paper as Base, OoB, and 30%H respectively. Diagrams illustrating the respective experiments are shown in Fig. 1, 2, and 3. Normalization of data was performed using the min-max approach. As performance functions the root mean square error (RMSE) and the Correlation between predicted and actual values were used. As aggregation functions averages were employed. Preliminary tuning tests were accomplished using the trial and error method in order to determine the best parameter settings of each learner for each arrangement. In order to determine the performance of base single models 10-fold cross-validation experiments were conducted. Statistical analysis of the results of experiments was performed using Wilcoxon signed rank tests and recently proposed procedures adequate for multiple comparisons of many learning algorithms over multiple datasets [6], [10], [11], [12]. Their authors argue that the commonly used paired tests i.e. parametric t-test and its nonparametric alternative Wilcoxon signed rank tests are not adequate when conducting multiple comparisons due to the so called multiplicity effect. They recommend following methodology. First of all the Friedman test or its more powerful derivative the Iman and Davenport test are carried out. Both tests can only inform the researcher about the presence of differences among all samples of results compared. After the nullhypotheses have been rejected he can proceed with the post-hoc procedures in order to find the particular pairs of algorithms which produce differences. They comprise Bonferroni-Dunn’s, Holm’s, and Hochberg’s procedures in the case of 1×n comparisons and Nemenyi’s, Shaffer’s, and Bergmann-Hommel’s procedures in the case of n×n comparisons. We used JAVA programs available on the web page of Research Group "Soft Computing and Intelligent Information Systems" at the University of Granada (http://sci2s.ugr.es/sicidm).
Fig. 1. Schema of the experiments with Base test set (Base).
Fig. 2. Schema of the experiments with Out-of-bag test set (OoB).
Fig. 3. Schema of the experiments with 30% holdout test set (30%H).
3 Results of Experiments The performance of the ensemble models built by CJR, DST, DTB, M5P, M5R, and MLP in terms of RMSE and Correlation was presented in Figures 4-9 respectively. Each bar chart illustrates the relationship among the outcome of the models with Base, OoB, and 30%H test sets for successive one-year datasets. Only M5P, M5R, and MLP models confirm the observations claimed by many authors that Base test set provides optimistic and OoB pessimistic estimation of model accuracy, in turn 30%H one gives higher values of RMSE because training sets contained smaller numbers of instances. In the case of DST and DTB the models revealed the same performance for both Base and OoB test sets. Using Correlation between predicted and actual values exactly inverse relationship could be observed; for this performance measure the higher value the better.
Fig. 4 Comparison of ConjuctiveRule bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart)
Fig. 5 Comparison of DecisionStump bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart)
Fig. 6 Comparison of DecisionTable bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart)
Fig. 7 Comparison of M5P bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart)
Fig. 8 Comparison of M5Rules bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart)
Fig. 9 Comparison of MultiLayerPerceptron bagging ensembles using different test sets, in terms of RMSE (left chart) and Correlation (right chart). Table 2. Results of Wilcoxon tests for bagging ensembles with consecutive pairs of test sets. RMSE
Test sets
CJR
DST
DTB
M5P
M5R
MLP
Base vs OoB Base vs 30%H OoB vs 30%H
≈ – ≈
≈ ≈ ≈
≈ + +
+ + –
+ + –
+ + +
In Table 2 the results of nonparametric Wilcoxon signed-rank test to evaluate the outcome of ensemble models using Base, OoB, and 30% test sets are presented. The zero hypothesis stated there were not significant differences in accuracy, in terms of RMSE, between given pairs of models. In Table 2 +, –, and ≈ denote that the first algorithm in a pair performed significantly better than, significantly worse than, or statistically equivalent to the second algorithm, respectively. Main outcome is as follows: for M5P, M5R, and MLP ensembles with Base test sets showed significantly better performance than those with OoB and 30%H test sets. For simple learners the
results were not so clear, only for DTB the models with Base and OoB revealed significantly lower RMSE than with 30%H, and CJR ensemble with 30%H was significantly better than with Base.
Fig. 10 Performance of bagging ensembles with Base test set (left chart), OoB test set (middle chart), and 30%H test set (right chart) for 2003 dataset.
The experiments allowed also for the comparison of the ensembles build with individual algorithms. For illustration, in Figure 10 the results for models with Base, OoB, and 30% test sets for 2003 one-year dataset are shown, the charts for other datasets are similar. Statistical tests adequate to multiple comparisons were made for 6 algorithms altogether over all 11 one-year datasets. This tests, described in pervious section, were accomplished for models with Base, OoB, and 30%H test sets separately. The Friedman and Iman-Davenport tests were performed in respect of average ranks, which use χ2 and F statistics. The calculated values of χ2 statistics were equal to 52.92, 52.14, 54.06 for models with Base, OoB, and 30%H test sets respectively, and F statistics were 254.68, 182.49, 578.19, whereas the critical values at α=0.05 are χ2(5)=12.83 and F(5,50)=2.40. This means that there are significant differences between some models. Average ranks of individual ensembles are shown in Table 3, where the lower rank value the better model. Thus, we were justified in proceeding to post-hoc procedures. In Tables 4-6 adjusted p-values for Nemenyi, Holm, and Shaffer tests for n×n comparisons in terms of RMSE of bagging ensembles with Base, OoB, and 30%H test sets respectively are shown. In all tables the p-values less than α=0.05, indicating that respective models differ significantly, were marked with an italic font. Table 3. Average rank positions of model performance in terms of RMSE Test set Base OoB 30%H
1st M5R (1.45) M5P (1.45) M5R (1.18)
2nd M5P (1.55) DTB (1.64) M5P (1.82)
3rd DTB (3.00) M5R (2.91) DTB (3.00)
4th MLP (4.00) MLP (4.00) MLP (4.00)
5th CJR (5.09) CJR (5.09) CJR (5.00)
6th DST (5.91) DST (5.91) DST (6.00)
Table 4. Adjusted p-values for n×n comparisons in terms of RMSE of bagging ensembles with Base test sets showing 8 hypotheses rejected out of 15 Alg vs Alg DST vs M5R DST vs M5P CJR vs M5R CJR vs M5P DST vs DTB M5R vs MLP M5P vs MLP CJR vs DTB
pUnadj 2.35E-08 4.50E-08 5.15E-06 8.81E-06 2.66E-04 0.001418 0.002091 0.008765
pNeme 3.52E-07 6.75E-07 7.73E-05 1.32E-04 0.003984 0.021275 0.031371 0.131472
pHolm 3.52E-07 6.30E-07 6.70E-05 1.06E-04 0.002921 0.014183 0.018823 0.070119
pShaf 3.52E-07 4.50E-07 5.15E-05 8.81E-05 0.002656 0.014183 0.014640 0.061354
pBerg 3.52E-07 4.50E-07 5.15E-05 5.29E-05 0.001859 0.009928 0.009928 0.035059
Table 5. Adjusted p-values for n×n comparisons in terms of RMSE of bagging ensembles with OoB test sets showing 8 hypotheses rejected out of 15 Alg vs Alg DST vs M5P DST vs DTB CJR vs M5P CJR vs DTB DST vs M5R M5P vs MLP DTB vs MLP CJR vs M5R
pUnadj 2.35E-08 8.50E-08 5.15E-06 1.49E-05 1.69E-04 0.001418 0.003047 0.006237
pNeme 3.52E-07 1.28E-06 7.73E-05 2.23E-04 0.002542 0.021275 0.045702 0.093555
pHolm 3.52E-07 1.19E-06 6.70E-05 1.79E-04 0.001864 0.014183 0.027421 0.049896
pShaf 3.52E-07 8.50E-07 5.15E-05 1.49E-04 0.001694 0.014183 0.021328 0.043659
pBerg 3.52E-07 8.50E-07 5.15E-05 8.93E-05 0.001186 0.009928 0.012187 0.024948
Table 6. Adjusted p-values for n×n comparisons in terms of RMSE of bagging ensembles with 30%H test sets showing 7 hypotheses rejected out of 15 Alg vs Alg DST vs M5R DST vs M5P CJR vs M5R CJR vs M5P DST vs DTB M5R vs MLP M5P vs MLP CJR vs DTB
pUnadj 1.54E-09 1.59E-07 1.70E-06 6.65E-05 1.69E-04 4.11E-04 0.006237 0.012172
pNeme 2.31E-08 2.38E-06 2.55E-05 9.97E-04 0.002542 0.006168 0.093555 0.182573
pHolm 2.31E-08 2.22E-06 2.21E-05 7.98E-04 0.001864 0.004112 0.056133 0.097372
pShaf 2.31E-08 1.59E-06 1.70E-05 6.65E-04 0.001694 0.004112 0.043659 0.085201
pBerg 2.31E-08 1.59E-06 1.70E-05 3.99E-04 0.001186 0.002879 0.024948 0.048686
Following main observations could be done: M5P, M5R, and DTB revealed significantly better performance than CJR and DST models for all three test sets. There were not significant differences among M5P, M5R, and DTB and CJR, DST, and MLP models. M5P and M5R were significantly better than MLP for two test sets and DTB for one test set.
4 Conclusions and Future Work The experiments, aimed to compare the performance of bagging ensembles using three different test sets composed of base, out-of-bag, and 30% holdout instances were conducted. Six weak learners including conjunctive rules, decision stump, decision table, pruned model trees, rule model trees, and multilayer perceptron, implemented in the data mining system WEKA, were applied. All algorithms were employed to regression problems of property valuation using real-world data derived from the cadastral system and cleansed by property valuation experts. The lowest prediction error expected by Base test set and the highest error by 30%H test set could be observed only in the case of model trees and a neural network. Model trees and decision tables revealed significantly better performance than conjunctive rule and decision stump. It is planned to explore resampling methods ensuring faster data processing such as random subspaces, subsampling, and techniques of determining the optimal sizes of multi-model solutions which lead to achieve both low prediction error and an appropriate balance between accuracy and complexity. Acknowledgments. This paper was partially supported by Ministry of Science and Higher Education of Poland under grant no. N N519 407437.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
Bańczyk, K.: Multi-agent system based on heterogeneous ensemble machine learning models. Master’s Thesis, Wrocław University of Technology, Wrocław, Poland (2011) Breiman, L.: Bagging Predictors. Machine Learning 24:2, pp. 123--140 (1996) Büchlmann, P., Yu, B.: Analyzing bagging, Annals of Statistics 30, pp. 927--961 (2002) Cordón, O., Quirin, A.: Comparing Two Genetic Overproduce-and-choose Strategies for Fuzzy Rule-based Multiclassification Systems Generated by Bagging and Mutual Information-based Feature Selection. Int. J. Hybrid Intel. Systems 7:1, pp. 45-64 (2010) Cunningham, S.J., Frank, E., Hall M., Holmes G., Trigg L., Witten I.H., WEKA: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, New Zealand (2005) Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, pp. 1–30 (2006) Efron, B., Tibshirani,R.J.: Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association 92:438, pp. 548–560 (1997) Friedman, J.H., Hall, P.: On bagging and nonlinear estimation Journal of Statistical Planning and Inference 137:3, pp. 669--683 (2007) Fumera, G., Roli, F., Serrau, A.: A theoretical analysis of bagging as a linear combination of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:7, pp. 1293-1299 (2008) García, S., Fernandez, A., Luengo, J., Herrera, F: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180, pp. 2044–2064 (2010) García, S., Fernandez, A., Luengo, J., Herrera, F.: A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10, pp. 959–977 (2009) García, S., Herrera, F.: An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9, pp. 2677–2694 (2008) Graczyk, M., Lasota, T., and Trawiński, B.: Comparative Analysis of Premises Valuation Models Using KEEL, RapidMiner, and WEKA. In Nguyen N.T. et al. (Eds.): ICCCI 2009, LNCS (LNAI) 5796, pp. 800-812, Springer, Heidelberg (2009) Graczyk, M., Lasota, T., Trawiński, B., Trawiński, K.: Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal. In N.T. Nguyen, et.al. (Eds.): ACIIDS 2010, Part II, LNCS (LNAI) 5991, pp. 340--350, Springer, Heidelberg (2010) Król, D., Lasota, T., Trawiński, B., Trawiński, K.: Investigation of Evolutionary Optimization Methods of TSK Fuzzy Model for Real Estate Appraisal. International Journal of Hybrid Intelligent Systems 5:3, pp. 111--128 (2008) Krzystanek, M., Lasota, T., Telec, Z., Trawiński, B.: Analysis of Bagging Ensembles of Fuzzy Models for Premises Valuation. In N.T. Nguyen, M.T. Le, and J. Świątek (Eds.): ACIIDS 2010, Part II, LNCS (LNAI) 5991, pp. 330--339, Springer, Heidelberg (2010) Lasota, T., Mazurkiewicz, J., Trawiński, B., Trawiński, K.: Comparison of Data Driven Models for the Validation of Residential Premises using KEEL, International Journal of Hybrid Intelligent Systems 7:1, pp. 3--16 (2010) Lasota, T., Telec, Z., Trawiński, B., and Trawiński K.: Exploration of Bagging Ensembles Comprising Genetic Fuzzy Models to Assist with Real Estate Appraisals. In H. Yin and E. Corchado (Eds.): IDEAL 2009, LNCS 5788, pp. 554--561, Springer, Heidelberg (2009) Polikar, R.: Ensemble Learning. Scholarpedia 4:1, pp. 2776 (2009) Schapire, R. E.: The Strength of Weak Learnability. Mach. Learning 5:2, 197--227 (1990) Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco (2005).