On Feature Selection, Bias-Variance, and Bagging Art Munson1
Rich Caruana2
1 Department
of Computer Science Cornell University
2 Microsoft
Corporation
ECML-PKDD 2009
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
1 / 22
Task: Model Presence/Absence of Birds
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
2 / 22
Task: Model Presence/Absence of Birds
Tried: SVMs boosted decision trees bagged decision trees neural networks ···
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
2 / 22
Task: Model Presence/Absence of Birds
Tried: SVMs boosted decision trees bagged decision trees neural networks ··· Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
2 / 22
Bagging Likes Many Noisy Features (?)
European Starling 0.385 bagging all features
0.38
RMS
0.375 0.37 0.365 0.36 0.355 0.35 0
Munson; Caruana (Cornell; Microsoft)
5
10
15 20 # features
On Feature Selection
25
30
ECML-PKDD 2009
3 / 22
Surprised Reviewers
Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set].
Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
4 / 22
Purpose of this Study
Does bagging often benefit from many features?
If so, why?
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
5 / 22
Outline 1
Story Behind the Paper
2
Background
3
Experiment 1: FS and Bias-Variance
4
Experiment 2: Weak, Noisy Features
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
6 / 22
Review of Bagging
Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
7 / 22
Facts about Bagging
Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96].
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
8 / 22
Review of Bias-Variance Decomposition
Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
9 / 22
Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample
1 2
of the training data.
Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
10 / 22
Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample
1 2
of the training data.
Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example. For each test example x with true label t: bias(x) = (t − ym )2 R 1X (ym − yi )2 variance(x) = R i=1
Average over test cases to get expected bias & variance for algorithm. Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
10 / 22
Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
11 / 22
Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.
Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
11 / 22
Experiment 1: Bias-Variance of Feature Selection
Summary: Dataset Sizes
19 datasets
1e+06
order features using feature selection
estimate bias & variance at multiple feature set sizes 5-fold cross-validation
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
# Features
forward stepwise feature selection or correlation feature filtering, depending on dataset size
100000 10000 1000 100 10 1 100
1000 10000 100000 # Samples
ECML-PKDD 2009
12 / 22
Case 1: No Improvement from Feature Selection covtype 0.08 variance bias/noise
0.07
MSE
0.06 single decision tree
0.05 bagged decision tree
0.04
0.02
1 2 3 4 5 10 20 30 40 50 54
0.03
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
13 / 22
Case 2: FS Improves Non-Bagged Model medis 0.1 0.095
variance bias/noise
KA A
0.09
A overfits with too many features
MSE
0.085 0.08 0.075 0.07
0.06
1 2 3 4 5 10 20 30 40 50 60 63
0.065
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
14 / 22
Take Away Points
More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
15 / 22
Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise
0.125 0.12 0.115 MSE
0.11 0.105 0.1 0.095 0.09
800
1,341
400
200
100
50
25
5
10
0.08
1
0.085
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
16 / 22
Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise
0.125 0.12 0.115
MSE
0.11 0.105 0.1 0.095 0.09
800
1,341
400
200
100
50
25
5
10
0.08
1
0.085
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
16 / 22
Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise
0.125 0.12 0.115
MSE
0.11 0.105 0.1 0.095 0.09
800
1,341
400
200
100
50
25
5
10
0.08
1
0.085
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
16 / 22
Hypothesis
Bagging improves base learner’s ability to benefit from weak, noisy features.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
17 / 22
Experiment 2: Noisy Informative Features
Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X % of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only)
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
18 / 22
Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise
0.25
MSE
0.2
0.15
0.05
core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.1
fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
19 / 22
Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise
0.25
MSE
0.2
0.15
0.1
0.05
core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6 original features (ideal)
fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
19 / 22
Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise
3 non-duplicated features (baseline)
MSE
HH
0.25
0.2 H H
HH j
0.15
0.1
0.05
core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6 original features (ideal)
fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
19 / 22
Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise
3 non-duplicated features (baseline)
MSE
HH
0.25 everything else: 3 non-duplicated features + 60 noisy features
0.2 H
H
HH j
0.15
0.1
0.05
core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6 original features (ideal)
fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
19 / 22
Conclusions
After training 9,060,936 decision trees . . .
Experiment 1: More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
20 / 22
Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
21 / 22
Exception: Overfitting Pseudo-Identifiers bunting 0.2 variance bias/noise
0.19
MSE
0.18 0.17 0.16 0.15
0.13
1 2 3 4 5 10 20 40 60 80 100 120 140 160 175
0.14
# features Munson; Caruana (Cornell; Microsoft)
On Feature Selection
ECML-PKDD 2009
22 / 22