DABB 09 13 Outliers

Report 3 Downloads 24 Views
Data Analysis Brown Bag: September 2013

Outliers and Influential Points

Karen Grace-Martin

1

What is an outlier Univariate Outliers

Multivariate Outliers

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Where outliers come from 1. Errors - Measurement - Data Entry - Sampling frame 2. Genuine but extreme

What is the problem with outliers?

Estimate of µ too low

Estimate of µ too high

Estimate of µ correct

Estimate of σ too high

Estimate of σ too high

Estimate of σ too high

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

What is the problem with outliers?

Estimate of β too high

Estimate of β correct

Estimate of sβ too high

Estimate of sβ too high

Estimate of β correct

Estimate of β too high

Estimate of sβ correct

Estimate of sβ correct

Detecting outliers -

Beyond 5th and 95th percentile Beyond [2, 2.5, 3] std deviations Graphs Influence Statistics

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Influential Cases in Regression Leverage: hij the influence of any given observed value (Yi) on any specific predicted value (Pj) Cook's distance: the change in the regression coefficients attributable to the deletion of case j. Dfbeta: the change in individual coefficients that occur when a case is deleted. Mahalanobis’s distance: the distance between an observations’s value on the predictor variables compared to the mean of all cases; multivariate stat

Influential Cases

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Cook’s D

dfBetas for β0 and β1

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Techniques for dealing with outliers 1. Keep it and treat it as any other point 2. Trimming 3. Winsorizing Type I: Assign it a value closer to the center, often 95th percentile or 2 std deviations Type II: Assign it a lesser weight

4. Transformations 5. Robust Statistics

Transformations

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Robust Statistics 1. Rank based L-estimators Median Median Absolute Deviation: MAD{ki} = median{|ki} - median{ki}|} Quantile Regression

2. Trimmed statistics k% trimmed Mean and Std Dev

3. Maximum Likelihood based M-estimators Huber weighting: IRLS

4. Resampling techniques Bootstrap & Jackknife

Advantages and Disadvantages • • • • • •

Trimming & Winsorizing both create bias in parameter estimates and standard errors and undervalue the outlier Winsorizing puts more weight on the full distribution, better in symmetric distributions Effects in the full data set can appear or disappear in winsorized or trimmed data In Trimming data, new outliers can appear Transformations can make interpretation difficult, but full data are retained Many robust statistics are still being researched and/or may not be available for all methods

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com

Data Analysis Brown Bag: September 2013

Resources and Further Reading Robust Regression: http://www.stata-journal.com/sjpdf.html?articlenum=st0173

Quantile Regression: http://cscu.cornell.edu/news/statnews/stnews70.pdf

Outliers: An Evaluation of Methodologies http://www.amstat.org/sections/srms/proceedings/y2012/files/304068_7 2402.pdf

Quantitative Data Cleaning for Large Databases http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf

Copyright 2013 The Analysis Factor http://TheAnalysisFactor.com