On Feature Selection, Bias-Variance, and Bagging - VideoLectures

Report 0 Downloads 18 Views
On Feature Selection, Bias-Variance, and Bagging Art Munson1

Rich Caruana2

1 Department

of Computer Science Cornell University

2 Microsoft

Corporation

ECML-PKDD 2009

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

1 / 22

Task: Model Presence/Absence of Birds

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

2 / 22

Task: Model Presence/Absence of Birds

Tried: SVMs boosted decision trees bagged decision trees neural networks ···

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

2 / 22

Task: Model Presence/Absence of Birds

Tried: SVMs boosted decision trees bagged decision trees neural networks ··· Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

2 / 22

Bagging Likes Many Noisy Features (?)

European Starling 0.385 bagging all features

0.38

RMS

0.375 0.37 0.365 0.36 0.355 0.35 0

Munson; Caruana (Cornell; Microsoft)

5

10

15 20 # features

On Feature Selection

25

30

ECML-PKDD 2009

3 / 22

Surprised Reviewers

Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set].

Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

4 / 22

Purpose of this Study

Does bagging often benefit from many features?

If so, why?

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

5 / 22

Outline 1

Story Behind the Paper

2

Background

3

Experiment 1: FS and Bias-Variance

4

Experiment 2: Weak, Noisy Features

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

6 / 22

Review of Bagging

Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

7 / 22

Facts about Bagging

Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96].

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

8 / 22

Review of Bias-Variance Decomposition

Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

9 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample

1 2

of the training data.

Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

10 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample

1 2

of the training data.

Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example. For each test example x with true label t: bias(x) = (t − ym )2 R 1X (ym − yi )2 variance(x) = R i=1

Average over test cases to get expected bias & variance for algorithm. Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

10 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

11 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.

Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

11 / 22

Experiment 1: Bias-Variance of Feature Selection

Summary: Dataset Sizes

19 datasets

1e+06

order features using feature selection

estimate bias & variance at multiple feature set sizes 5-fold cross-validation

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

# Features

forward stepwise feature selection or correlation feature filtering, depending on dataset size

100000 10000 1000 100 10 1 100

1000 10000 100000 # Samples

ECML-PKDD 2009

12 / 22

Case 1: No Improvement from Feature Selection covtype 0.08 variance bias/noise

0.07

MSE

0.06 single decision tree

0.05 bagged decision tree



0.04



0.02

1 2 3 4 5 10 20 30 40 50 54

0.03

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

13 / 22

Case 2: FS Improves Non-Bagged Model medis 0.1 0.095

variance bias/noise

KA A

0.09

A overfits with too many features

MSE

0.085 0.08 0.075 0.07

0.06

1 2 3 4 5 10 20 30 40 50 60 63

0.065

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

14 / 22

Take Away Points

More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

15 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise

0.125 0.12 0.115 MSE

0.11 0.105 0.1 0.095 0.09

800

1,341

400

200

100

50

25

5

10

0.08

1

0.085

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

16 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise

0.125 0.12 0.115



MSE

0.11 0.105 0.1 0.095 0.09

800

1,341

400

200

100

50

25

5

10

0.08

1

0.085

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

16 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance bias/noise

0.125 0.12 0.115



MSE

0.11 0.105 0.1 0.095 0.09

 800

1,341

400

200

100

50

25

5

10

0.08

1

0.085

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

16 / 22

Hypothesis

Bagging improves base learner’s ability to benefit from weak, noisy features.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

17 / 22

Experiment 2: Noisy Informative Features

Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X % of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only)

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

18 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise

0.25

MSE

0.2

0.15

0.05

core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.1

fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

19 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise

0.25

MSE

0.2

0.15

0.1

0.05

core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

6 original features (ideal)

fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

19 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise

3 non-duplicated features (baseline)

MSE

HH

0.25

0.2 H H

HH j

0.15

0.1

0.05

core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

6 original features (ideal)

fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

19 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise

3 non-duplicated features (baseline)

MSE

HH

0.25 everything else: 3 non-duplicated features + 60 noisy features

0.2 H

H

HH j

0.15

0.1

0.05

core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

6 original features (ideal)

fraction feature values corrupted Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

19 / 22

Conclusions

After training 9,060,936 decision trees . . .

Experiment 1: More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

20 / 22

Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

21 / 22

Exception: Overfitting Pseudo-Identifiers bunting 0.2 variance bias/noise

0.19

MSE

0.18 0.17 0.16 0.15

0.13

1 2 3 4 5 10 20 40 60 80 100 120 140 160 175

0.14

# features Munson; Caruana (Cornell; Microsoft)

On Feature Selection

ECML-PKDD 2009

22 / 22