Empirical comparison of resampling methods using genetic neural networks for a regression problem Tadeusz Lasota1, Zbigniew Telec2, Grzegorz Trawiński3, Bogdan Trawiński2 1
Wrocław University of Environmental and Life Sciences, Dept. of Spatial Management ul. Norwida 25/27, 50-375 Wrocław, Poland 2 Wrocław University of Technology, Institute of Informatics, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland 3 Wrocław University of Technology, Faculty of Electronics, Wybrzeże S. Wyspiańskiego 27, 50-370 Wrocław, Poland
[email protected],
[email protected] {zbigniew.telec, bogdan.trawinski}@pwr.wroc.pl
Abstract. In the paper the investigation of m-out-of-n bagging with and without replacement using genetic neural networks is presented. The study was conducted with a newly developed system in Matlab to generate and test hybrid and multiple models of computational intelligence using different resampling methods. All experiments were conducted with real-world data derived from a cadastral system and registry of real estate transactions. The performance of following methods was compared: classic bagging, out-of-bag, Efron’s .632 correction, and repeated holdout. The overall result of our investigation was as follows: the bagging ensembles created using genetic neural networks revealed prediction accuracy not worse than the experts’ method employed in reality. Keywords: ensemble models, genetic neural networks, bagging, subagging.
1 Introduction Bagging, which is one of the most effective computationally intensive procedures to improve unstable regressors and classifiers [23], has been focusing the attention of many researchers for last fifteen years. Bagging, which stands for bootstrap aggregating, devised by Breiman [3] belongs to the most intuitive and simplest ensemble algorithms providing a good performance. Diversity of learners is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn with replacement from the original base dataset. So obtained training data subsets, called also bags, are used then to train different classification or regression models. Finally, individual learners are combined through an algebraic expression, such as minimum, maximum, sum, mean, product, median, etc. [22]. The classic form of bagging is the n-out-of-n with replacement bootstrap where the number of samples in each bag equals to the cardinality of a base dataset and as a test set the whole original dataset is used. In order to achieve better computational effectiveness less overloading techniques were introduced which consisted in drawing from an original dataset smaller numbers of samples, with or
without replacement. The m-out-of-n without replacement bagging, where at each step m observations less than n are distinctly chosen at random within the base dataset, belongs to such variants. This alternative aggregation scheme was called by Bühlmann and Yu [4] subagging for subsample aggregating. In the literature the resampling methods of the same nature as subagging are also named Monte Carlo cross-validation [20] or repeated holdout [2]. In turn, subagging with replacement was called by Biau et al. [1] moon-bagging, standing for m-out-of-n bootstrap aggregating. The statistical mechanisms of above mentioned resampling techniques are not yet fully understood and are still under active theoretical and experimental investigation [1], [2], [4], [5], [8], [9], [20]. Theoretical analyses and experimental results to date proved benefits of bagging especially in terms of stability improvement and variance reduction of learners for both classification and regression problems. Bagging techniques both with and without replacement may provide improvements in prediction accuracy in a range of settings. Moreover, n-out-of-n with replacement bootstrap and n/2-out-of-n without replacement sampling, i.e. half-sampling, may give fairly similar results. The size of bootstrapped replicas in bagging usually is equal to the number of instances in an original dataset and the base dataset is commonly used as a test set for each generated component model. However, it is claimed it leads to an optimistic overestimation of the prediction error. So, as test error out-of-bag samples are applied, i.e. those included in the base dataset but not drawn to respective bags. These, in turn may cause a pessimistic underestimation of the prediction error. In consequence, the correction of the out-of-bag prediction error was proposed [2], [7]. The main focus of soft computing techniques to assist with real estate appraisals was directed towards neural networks [14], [19], [21], [24]. So far, we have investigated several methods to construct regression models to assist with real estate appraisal: evolutionary fuzzy systems, neural networks, decision trees, and statistical algorithms using MATLAB, KEEL, RapidMiner, and WEKA data mining systems [11], [15], [17]. We have studied also bagging ensemble models created applying these computational intelligence techniques [12], [16], [18]. The goal of the study presented in this paper was to compare m-out-of-n bagging with and without replacement with different sizes of samples with a property valuating method employed by professional appraisers in reality and a standard 10fold cross validation. Genetic neural networks were applied to real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions obtained from a cadastral system. The investigation was conducted with a newly developed system in Matlab to generate and test hybrid and multiple models of computational intelligence using different resampling methods.
2 Methods Used and Experimental Setup The investigation was conducted with our new experimental system implemented in Matlab environment using Neural Network, Fuzzy Logic, Global Optimization, and Statistics toolboxes [6], [10]. The system was designed to carry out research into
machine learning algorithms using various resampling methods and constructing and evaluating ensemble models for regression problems. Real-world dataset used in experiments was drawn from a rough dataset containing above 50 000 records referring to residential premises transactions accomplished in one Polish big city with the population of 640 000 within 11 years from 1998 to 2008. The final dataset counted the 5213 samples for which the experts could estimate the value using their pairwise comparison method. Due to the fact that the prices of premises change substantially in the course of time, the whole 11-year dataset cannot be used to create data-driven models, therefore it was split into 20 half-year subsets. The sizes of half-year data subsets are given in Table 1. Table 1. Number of instances in half-year datasets 1998-2 202 2003-2 386
1999-1 213 2004-1 278
1999-2 264 2004-2 268
2000-1 162 2005-1 244
2000-2 167 2005-2 336
2001-1 228 2006-1 300
2001-2 235 2006-2 377
2002-1 267 2007-1 289
2002-2 263 2007-2 286
2003-1 267 2008-1 181
In order to compare evolutionary machine learning algorithms with techniques applied to property valuation we asked experts to evaluate premises using their pairwise comparison method to historical data of sales/purchase transactions recorded in a cadastral system. The experts worked out a computer program which simulated their routine work and was able to estimate the experts’ prices of a great number of premises automatically. First of all the whole area of the city was divided into 6 quality zones. Next, the premises located in each zone were classified into 243 groups determined by 5 following quantitative features selected as the main price drivers: Area, Year, Storeys, Rooms, and Centre. Domains of each feature were split into three brackets as follows: Area denotes the usable area of premises and comprises small flats up to 40 m2, medium flats in the bracket 40 to 60 m2, and big flats above 60 m2. Year (Age) means the year of a building construction and consists of old buildings constructed before 1945, medium age ones built in the period 1945 to 1960, and new buildings constructed between 1960 and 1996, the buildings falling into individual ranges are treated as in bad, medium, and good physical condition respectively. Storeys are intended for the height of a building and are composed of low houses up to three storeys, multi-family houses from 4 to 5 storeys, and tower blocks above 5 storeys. Rooms are designated for the number of rooms in a flat including a kitchen. The data contain small flats up to 2 rooms, medium flats in the bracket 3 to 4, and big flats above 4 rooms. Centre stands for the distance from the city centre and includes buildings located near the centre i.e. up to 1.5 km, in a medium distance from the centre - in the brackets 1.5 to 5 km, and far from the centre - above 5 km. Then the prices of premises were updated according to the trends of the value changes over time. Starting from the second half-year of 1998 the prices were updated for the last day of consecutive half-years. The trends were modelled by polynomials of degree three. Premises estimation procedure employed a two-year time window to take into consideration transaction data of similar premises.
1. 2.
Take next premises to estimate. Check the completeness of values of all five features and note a transaction date. 3. Select all premises sold earlier than the one being appraised, within current and one preceding year and assigned to the same group. 4. If there are at least three such premises calculate the average price taking the prices updated for the last day of a given half-year. 5. Return this average as the estimated value of the premises. 6. Repeat steps 1 to 5 for all premises to be appraised. 7. For all premises not satisfying the condition determined in step 4 extend the quality zones by merging 1 & 2, 3 & 4, and 5 & 6 zones. Moreover, extend the time window to include current and two preceding years. 8. Repeat steps 1 to 5 for all remaining premises. In our study we employed an evolutionary approach to real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions obtained from a cadastral system, namely genetic neural networks (GNN). Our GNN approach consisted in the evolution of connection weights with a predefined architecture of feedforward network with backpropagation comprising for five neurons in an input and hidden layers. The whole set of weights in a chromosome was represented by real numbers. Similar solutions are described in [13], [25]. Following resampling methods and their variants were applied in experiments and compared with the standard 10cv and the experts’ method. Classic: B100, B70, B50, B30 – m-out-of-n bagging with replacement with different sizes of samples using the whole base dataset as a test set. The numbers in the codes indicate what percentage of the base set was drawn to create training sets. OoB: O100, O70, O50, O30 – m-out-of-n bagging with replacement with different sizes of samples tested with the out-of-bag (OoB) datasets. The numbers in the codes mean what percentage of the base dataset was drawn to create a training set. Efron’s .632: E100, E70, E50, E30 – models represent the Efron’s 0.632 bootstrap method correcting the out-of-bag prediction error using the weighted average of the OoB error and so called resubstitution or apparent error with the weights equal to 0.632 and 0.368 respectively [2], [7]. k-Holdout: H90, H70, H50, H30 – m-out-of-n bagging without replacement with different sizes of samples. The numbers in the codes point out what percentage of the base dataset was drawn to create a training set. In the case of bagging methods 50 bootstrap replicates (bags) were created on the basis of each base dataset, as performance functions the mean square error (MSE) was used, and as aggregation functions simple averages were employed. The normalization of data was accomplished using the min-max approach.
3 Results of Experiments The performance of Classic, OoB, Efron’s .632, and k-Holdout models created by genetic neural networks (GNN) in terms of MSE is illustrated graphically in Figures 1 and 2 respectively. In each figure for comparison the same results for 10cv and Expert
methods are shown. The Friedman test performed in respect of MSE values of all models built over 20 half-year datasets showed that there are significant differences between some models. Average ranks of individual models are shown in Table 2, where the lower rank value the better model. In Table 3 and 4 the results of nonparametric Wilcoxon signed-rank test to pairwise comparison of the model performance are presented. The zero hypothesis stated there were not significant differences in accuracy, in terms of MSE, between given pairs of models. In both tables + denotes that the model in the row performed significantly better than, – significantly worse than, and ≈ statistically equivalent to the one in the corresponding column, respectively. In turn, / (slashes) separate the results for individual methods. The significance level considered for the null hypothesis rejection was 5%.
Fig. 1 Performance of Classic (left) and OoB (right) models generated using GNN Table 2. Average rank positions of models determined during Friedman test Method Classic OoB Efron’s .632 k-Holdout
1st B100 (2.25) 10cv (2.50) E100 (2.70) 10cv (2.60)
2nd B70 (2.50) O100 (3.00) E70 (3.20) H90 (2.70)
3rd B50 (3.60) Expert (3.10) Expert (3.40) Expert (3.00)
4th Expert (3.60) O70 (3.25) E50 (3.50) H70 (3.70)
5th 10cv (4.05) O50 (4.00) 10cv (3.50) H50 (3.95)
6th B30 (5.00) O30 (5.15) E30 (4.70) H30 (5.05)
Fig. 2 Performance of Efron’s .632 (left) and k-Holdout (right) models created by GNN
Table 3. Results of Wilcoxon tests for the performance of bagging with replacement models B/O/E100 B/O/E100 B/O/E70 B/O/E50 B/O/E30 10cv Expert
≈/≈/≈ –/≈/≈ –/–/– –/≈/≈ ≈/≈/≈
B/O/E70 ≈/≈/≈ ≈/≈/≈ –/–/– –/≈/≈ ≈/≈/≈
B/O/E50 +/≈/≈ ≈/≈/≈ –/–/– ≈/+/≈ ≈/≈/≈
B/O/E30 +/+/+ +/+/+ +/+/+ +/+/≈ ≈/≈/≈
10cv +/≈/≈ +/≈/≈ ≈/–/≈ –/–/≈
Expert ≈/≈/≈ ≈/≈/≈ ≈/≈/≈ ≈/≈/≈ ≈/≈/≈
≈/≈/≈
Table 4. Results of Wilcoxon tests for the performance of bagging without replacement models H90 H90 H70 H50 H30 10cv Expert
≈ ≈ – ≈ ≈
H70 ≈ ≈ – + ≈
H50 ≈ ≈ – ≈ ≈
H30 + + + + ≈
10cv ≈ – ≈ – ≈
Expert ≈ ≈ ≈ ≈ ≈
The general outcome is as follows: the performance of the experts’ method fluctuated strongly achieving for some datasets excessively high MSE values and for others the lowest values; MSE ranged from 0.007 to 0.023. The models created over 30% subsamples performed significantly worse than ones trained using bigger portions of base datasets for all methods. More specifically, no significant differences between B100 and B70 were observed. B100 and B70 provided better results than 10cv. No significant differences were noticed among O100, O70, and 10cv. In turn, 10cv turned out to be better than O50. No significant differences among E100, E70, E50, and 10cv were seen. H90 and H50 did not show any significant difference when compared to H70 and 10cv. A separate Wilcoxon test showed that B100 performed significantly better than H50. Thus, our tests did not confirm the observation presented in the literature that classic bagging and half-sampling provide statistically equivalent results.
4 Conclusions and Future Work The experiments, aimed to compare the performance of bagging ensembles built using genetic neural networks over real-world data taken from a cadastral system with different numbers of training samples drawn from the base dataset with and without replacement. The performance of following methods was compared: three variants of m-out-of-n bagging with replacement, i.e. classic bagging, out-of-bag, Efron’s .632 correction, and one m-out-of-n bagging without replacement called also subagging or repeated holdout. Moreover, the predictive accuracy of a pairwise comparison method applied by professional appraisers in reality was compared with soft computing machine learning models for residential premises valuation. The overall results of our investigation were as follows. The bagging ensembles created using genetic neural networks revealed prediction accuracy not worse than the experts’ method employed in reality. It confirms that automated valuation models can be successfully utilized to support appraisers’ work. We plan to continue exploring resampling methods ensuring faster data processing such as random subspaces, subsampling, and techniques of determining the optimal sizes of multi-model solutions. This can lead to achieve both low prediction error and an appropriate balance between accuracy, complexity, stability. Acknowledgments. This paper was partially supported by Ministry of Science and Higher Education of Poland under grant no. N N516 483840.
References 1. 2.
Biau, G., Cérou, F., Guyader, A.: On the Rate of Convergence of the Bagged Nearest Neighbor Estimate. Journal of Machine Learning Research 11, 687--712 (2010) Borra, S., Di Ciaccio, A.: Measuring the prediction error. A comparison of crossvalidation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis 54:12, 2976--2989 (2010)
3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
19. 20. 21. 22. 23. 24. 25.
Breiman, L.: Bagging Predictors. Machine Learning 24:2, 123--140 (1996) Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927--961 (2002) Buja, A., Stuetzle, W.: Observations on bagging, Statistica Sinica 16, 323--352 (2006) Czuczwara, K.: Comparative analysis of selected evolutionary algorithms for optimization of neural network architectures. Master’s Thesis (in Polish), Wrocław University of Technology, Wrocław, Poland (2010) Efron, B., Tibshirani,R.J.: Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association 92:438, 548--560 (1997) Friedman, J.H., Hall, P.: On bagging and nonlinear estimation Journal of Statistical Planning and Inference 137:3, 669--683 (2007) Fumera, G., Roli, F., Serrau, A.: A theoretical analysis of bagging as a linear combination of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:7, 1293--1299 (2008) Góral, M.: Comparative analysis of selected evolutionary algorithms for optimization of fuzzy models for real estate appraisals. Master’s Thesis (in Polish), Wrocław University of Technology, Wrocław, Poland (2010) Graczyk, M., Lasota, T., and Trawiński, B.: Comparative Analysis of Premises Valuation Models Using KEEL, RapidMiner, and WEKA. In Nguyen N.T. et al. (eds.) ICCCI 2009. LNAI, vol. 5796, pp. 800--812. Springer, Heidelberg (2009) Graczyk, M., Lasota, T., Trawiński, B., Trawiński, K.: Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal. In Nguyen N.T. et al. (eds.) ACIIDS 2010. LNAI, vol. 5991, pp. 340--350. Springer, Heidelberg (2010) Kim, D., Kim, H., Chung, D.: A Modified Genetic Algorithm for Fast Training Neural Networks. In Wang, J. , Liao, X., Yi, Z. (eds.). LNCS, vol. 3496, pp. 660--665. Springer, Heidelberg (2005) Kontrimas, V., Verikas, A.: The mass appraisal of the real estate by computational intelligence. Applied Soft Computing 11:1, 443--448 (2011) Król, D., Lasota, T., Trawiński, B., Trawiński, K.: Investigation of Evolutionary Optimization Methods of TSK Fuzzy Model for Real Estate Appraisal. International Journal of Hybrid Intelligent Systems 5:3, 111--128 (2008) Krzystanek, M., Lasota, T., Telec, Z., Trawiński, B.: Analysis of Bagging Ensembles of Fuzzy Models for Premises Valuation. In Nguyen N.T. et al. (eds.) ACIIDS 2010. LNAI, vol. 5991, pp. 330--339, Springer, Heidelberg (2010) Lasota, T., Mazurkiewicz, J., Trawiński, B., Trawiński, K.: Comparison of Data Driven Models for the Validation of Residential Premises using KEEL. International Journal of Hybrid Intelligent Systems 7:1, 3--16 (2010) Lasota, T., Telec, Z., Trawiński, B., Trawiński K.: Exploration of Bagging Ensembles Comprising Genetic Fuzzy Models to Assist with Real Estate Appraisals. In Yin, H., Corchado, E. (eds.) IDEAL 2009. LNCS vol. 5788, pp. 554--561, Springer, Heidelberg (2009) Lewis, O.M., Ware J.A., Jenkins, D.: A novel neural network technique for the valuation of residential property. Neural Computing & Applications 5:4, 224--229 (1997) Molinaro, A.N., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21:15, 3301--3307 (2005) Peterson, S., Flangan, A.B.: Neural Network Hedonic Pricing Models in Mass Real Estate Appraisal. Journal of Real Estate Research 31:2, 147--164 (2009) Polikar, R.: Ensemble Learning. Scholarpedia 4:1, pp. 2776 (2009) Schapire, R. E.: The Strength of Weak Learnability. Mach. Learning 5:2, 197--227 (1990) Worzala, E., Lenk, M., Silva, A.: An Exploration of Neural Networks and Its Application to Real Estate Valuation. Journal of Real Estate Research 10:2, 185--201 (1995) Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87:9, 1423--1444 (1999)