Missing Data Workshop: Module 1 Introduction and Overview
Missing Data Workshop: Effectively Dealing with Missing Data Without Biasing your Results
Karen Grace-Martin
1
Workshop Outline
Module 1: Introduction and Overview 1. What is Missing Data? 2. Missing Data Mechanisms 3. The Four Main Approaches 4. Complete Case Analysis 5. Imputation
2
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
Workshop Outline
Module 2: Multiple Imputation 1. What it is 2. When to use it 3. How to do it: Continuous Variables The Four Steps Demonstrations
3
Workshop Outline
Module 3: Multiple Imputation: Special Cases 1. Clustered Data 2. Scale Data 3. Categorical Data 4. Missing Y
4
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
Workshop Outline
Module 4: Maximum Likelihood 1. What is Maximum Likelihood 2. EM Algorithm 3. Multilevel Models 4. Full Information Maximum Likelihood 5. Demonstrations
Non-Ignorable Missing Data
5
Workshop Outline
Module 5: Missing Data Diagnosis 1. Decision Factors in Choosing an Approach 2. Missing Data Diagnosis—the Steps 3. Demonstrations
Conclusions
6
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.1 What is Missing Data?
Missingness should hide a meaningful value Examples Don’t know for questions about income (√ ) Don’t know for vote in election (?) In a longitudinal study: Loss to follow-up (√ ) Death (X)
7
1.1 What is Missing Data? Unit Non-response People do not send back a survey Item Non-response Leave answers blank Refuse to answer Don’t Know Occasion Non-response Attrition Missed Sessions 8
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.1 What is Missing Data?
Related Concepts - Grouped, aggregated data - Rounded, range data - Censored data - Truncated data - Sample Selected data - Latent Variables
9
1.1 What is Missing Data?
Missing Data Structure V1 V2
Observed
Missing
10
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.2 Missing Data Mechanisms Determinants of Wages from the 1985 Current Population Survey n = 534 for full data set Variables include wages and worker characteristics
11
1.2 Missing Data Mechanisms Definition: By what kind of process are values missing?
12
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.2 Missing Data Mechanisms
Missing Completely at Random (MCAR) - No systematic relationship to Yobs or Ymiss Missing at Random (MAR) - Systematic relationship to Yobs Non-ignorable (NI) - Systematic relationship to Ymiss 13
1.2 Missing Data Mechanisms
Experience and Education are MAR 20% of Union members missing on experience 50% on Union non-members missing on experience 50% of women are missing on education 30% of men are missing on education
14
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.3 Approaches to Missing Data
Goals for any approach: Give unbiased parameter estimates Have adequate power Give accurate standard errors and p-values
15
1.3 Approaches to Missing Data
1. 2. 3. 4.
Complete Case Analysis Imputation Multiple Imputation Full Information Maximum Likelihood
16
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.3 Approaches to Missing Data
Approach can meet goals depending on: 1. Missing Data Mechanism 2. Percentage of Missing Data 3. Distribution of Missing Data 4. Statistical Analysis to be Done
17
1.4 Complete Case Analysis
Analyze only those cases that have no missing data
V1 V2
V1 V2
18
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.4 Complete Case Analysis
Advantages: Easy and Simple Fine for small amounts of missing data Disadvantages: Loss of information Increased standard errors Unbiased only when data are MCAR 19
1.4 Complete Case Analysis
Descriptive Statistics N educat Years of Education Full educat_mar Years of Education MAR exper Numbe of Years of Work Experience exper_mar Years of Experience Valid N (listwise)
Minimum
Maximum
Mean
Std. Deviation
534
2
18
13.02
2.615
308
2
18
12.97
2.763
534
0
55
17.82
12.380
283
0
54
17.30
12.320
164
20
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.4 Complete Case Analysis Correlations
educat Years of Education Full exper Numbe of Years of Work Experience lnwage Natural log of Wages
Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N
exper Numbe of Years of lnwage Natural log Work Experience of Wages -.353** .380** .000 .000 534 534 534 -.353** 1 .108* .000 .013 534 534 534 .380** .108* 1 .000 .013
educat Years of Education Full 1
534
534
534
**. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed).
Correlations educat_mar Years of Education MAR educat_mar Years of Education MAR exper_mar Years of Experience lnwage Natural log of Wages
Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N
**. Correlation is significant at the 0.01 level (2-tailed).
1.4 Complete Case Analysis Full Data Set Parameter
Complete Case Analysis
B
Sig.
B
Sig.
Intercept
.745
.000
.716
.001
exper
.011
.000
.012
.000
educat
.092
.000
.089
.000
marr
.079
.062
.131
.062
union
.201
.000
.151
.046
sex
-.232
.000
-.290
.000
south
-.105
.015
-.041
.552
[race=1]
-.098
.097
-.180
.062
[race=2]
-.086
.333
-.096
.560
[race=3]
0(a)
.
0(a)
n = 534, Adjusted
R2
= .300
n = 164, Adjusted
.
R2
=.352 22
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation
Definition: replace missing values with an estimate, then analyze the full data set as if the imputed values were actual observed values.
V1 V2
V1 V2
Imputed Missing
Observed
23
1.5 Imputation
Types of Imputation: Mean Substitution Cold deck Hot deck Interpolation and extrapolation Regression Stochastic regression 24
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation
Advantages: Preserves original sample size Preserves observed data Handles missing data once Disadvantages: Can create bias Often does not preserve associations among variables Requires specifying correct imputation model Understates uncertainty 25
1.5 Imputation
26
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation
27
1.5 Imputation
Regression Imputation
28
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation
Case Summariesa
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total
educat_mar Years of Education MAR 16 14 13 17 . 14 . 16 15 . . 12 . 12 . 9
N
educat_mean Years of Education Mean Imputed 16.00 14.00 13.00 17.00 12.97 14.00 12.97 16.00 15.00 12.97 12.97 12.00 12.97 12.00 12.97 15
educat_rir Years of Education Regression Imputed with random 16 14 13 17 16 14 15 16 15 14 22 12 13 12 14 15
a. Limited to first 15 cases.
29
1.5 Imputation Full Data Set Parameter
Mean Imputation
B
Sig.
B
Sig.
Intercept
.745
.000
1.044
.000
exper
.011
.000
.005
.031
educat
.092
.000
.075
.000
marr
.079
.062
.107
.016
union
.201
.000
.228
.000
sex
-.232
.000
-.218
.000
south
-.105
.015
-.126
.006
[race=1]
-.098
.097
-.114
.069
[race=2]
-.086
.333
-.135
.157
[race=3]
0(a)
.
0
.
n = 534, Adjusted R2 = .300
n = 534, Adjusted R2 =.201 30
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation Full Data Set Parameter
Regression Imputation
B
Sig.
B
Sig.
Intercept
.745
.000
.339
.010
exper
.011
.000
.012
.000
educat
.092
.000
.121
.000
marr
.079
.062
.082
.042
union
.201
.000
.205
.000
sex
-.232
.000
-.209
.000
south
-.105
.015
-.088
.032
[race=1]
-.098
.097
-.097
.084
[race=2]
-.086
.333
-.064
.448
[race=3]
0(a)
.
0
.
n = 534, Adjusted R2 = .300
n = 534, Adjusted R2 =.363 31
1.5 Imputation Full Data Set Parameter
Regression Imputation w/ Random Error
B
Sig.
B
Sig.
Intercept
.745
.000
.942
.000
exper
.011
.000
.007
.000
educat
.092
.000
.079
.000
marr
.079
.062
.100
.020
union
.201
.000
.221
.000
sex
-.232
.000
-.217
.000
south
-.105
.015
-.092
.039
[race=1]
-.098
.097
-.104
.084
[race=2]
-.086
.333
-.138
.130
[race=3]
0(a)
.
0
.
n = 534, Adjusted R2 = .300
n = 534, Adjusted R2 =.262 32
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com
Missing Data Workshop: Module 1 Introduction and Overview
1.5 Imputation
Summary 1. Imputations can give unbiased estimates if: Condition on observed variables Are multivariate to preserve associations among variables Generally have a random component 2. Single imputations: - give full power - underestimate standard errors - underestimation increases as % missingness is small 3. Good Imputation takes a long time
33
Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com