MD 01 The Problem slides handout 2 per page

Report 1 Downloads 40 Views
Missing Data Workshop: Module 1 Introduction and Overview

Missing Data Workshop: Effectively Dealing with Missing Data Without Biasing your Results

Karen Grace-Martin

1

Workshop Outline

Module 1: Introduction and Overview 1. What is Missing Data? 2. Missing Data Mechanisms 3. The Four Main Approaches 4. Complete Case Analysis 5. Imputation

2

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

Workshop Outline

Module 2: Multiple Imputation 1. What it is 2. When to use it 3. How to do it: Continuous Variables The Four Steps Demonstrations

3

Workshop Outline

Module 3: Multiple Imputation: Special Cases 1. Clustered Data 2. Scale Data 3. Categorical Data 4. Missing Y

4

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

Workshop Outline

Module 4: Maximum Likelihood 1. What is Maximum Likelihood 2. EM Algorithm 3. Multilevel Models 4. Full Information Maximum Likelihood 5. Demonstrations

Non-Ignorable Missing Data

5

Workshop Outline

Module 5: Missing Data Diagnosis 1. Decision Factors in Choosing an Approach 2. Missing Data Diagnosis—the Steps 3. Demonstrations

Conclusions

6

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.1 What is Missing Data?

Missingness should hide a meaningful value Examples Don’t know for questions about income (√ ) Don’t know for vote in election (?) In a longitudinal study: Loss to follow-up (√ ) Death (X)

7

1.1 What is Missing Data? Unit Non-response People do not send back a survey Item Non-response Leave answers blank Refuse to answer Don’t Know Occasion Non-response Attrition Missed Sessions 8

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.1 What is Missing Data?

Related Concepts - Grouped, aggregated data - Rounded, range data - Censored data - Truncated data - Sample Selected data - Latent Variables

9

1.1 What is Missing Data?

Missing Data Structure V1 V2

Observed

Missing

10

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.2 Missing Data Mechanisms Determinants of Wages from the 1985 Current Population Survey n = 534 for full data set Variables include wages and worker characteristics

11

1.2 Missing Data Mechanisms Definition: By what kind of process are values missing?

12

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.2 Missing Data Mechanisms

Missing Completely at Random (MCAR) - No systematic relationship to Yobs or Ymiss Missing at Random (MAR) - Systematic relationship to Yobs Non-ignorable (NI) - Systematic relationship to Ymiss 13

1.2 Missing Data Mechanisms

Experience and Education are MAR 20% of Union members missing on experience 50% on Union non-members missing on experience 50% of women are missing on education 30% of men are missing on education

14

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.3 Approaches to Missing Data

Goals for any approach: Give unbiased parameter estimates Have adequate power Give accurate standard errors and p-values

15

1.3 Approaches to Missing Data

1. 2. 3. 4.

Complete Case Analysis Imputation Multiple Imputation Full Information Maximum Likelihood

16

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.3 Approaches to Missing Data

Approach can meet goals depending on: 1. Missing Data Mechanism 2. Percentage of Missing Data 3. Distribution of Missing Data 4. Statistical Analysis to be Done

17

1.4 Complete Case Analysis

Analyze only those cases that have no missing data

V1 V2

V1 V2

18

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.4 Complete Case Analysis

Advantages: Easy and Simple Fine for small amounts of missing data Disadvantages: Loss of information Increased standard errors Unbiased only when data are MCAR 19

1.4 Complete Case Analysis

Descriptive Statistics N educat Years of Education Full educat_mar Years of Education MAR exper Numbe of Years of Work Experience exper_mar Years of Experience Valid N (listwise)

Minimum

Maximum

Mean

Std. Deviation

534

2

18

13.02

2.615

308

2

18

12.97

2.763

534

0

55

17.82

12.380

283

0

54

17.30

12.320

164

20

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.4 Complete Case Analysis Correlations

educat Years of Education Full exper Numbe of Years of Work Experience lnwage Natural log of Wages

Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N

exper Numbe of Years of lnwage Natural log Work Experience of Wages -.353** .380** .000 .000 534 534 534 -.353** 1 .108* .000 .013 534 534 534 .380** .108* 1 .000 .013

educat Years of Education Full 1

534

534

534

**. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed).

Correlations educat_mar Years of Education MAR educat_mar Years of Education MAR exper_mar Years of Experience lnwage Natural log of Wages

Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N

exper_mar lnwage Years of Natural log Experience of Wages 1 -.432** .377** .000 .000 308 164 308 -.432** 1 .076 .000 .201 164 283 283 .377** .076 1 .000 .201 308

283

534

21

**. Correlation is significant at the 0.01 level (2-tailed).

1.4 Complete Case Analysis Full Data Set Parameter

Complete Case Analysis

B

Sig.

B

Sig.

Intercept

.745

.000

.716

.001

exper

.011

.000

.012

.000

educat

.092

.000

.089

.000

marr

.079

.062

.131

.062

union

.201

.000

.151

.046

sex

-.232

.000

-.290

.000

south

-.105

.015

-.041

.552

[race=1]

-.098

.097

-.180

.062

[race=2]

-.086

.333

-.096

.560

[race=3]

0(a)

.

0(a)

n = 534, Adjusted

R2

= .300

n = 164, Adjusted

.

R2

=.352 22

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation

Definition: replace missing values with an estimate, then analyze the full data set as if the imputed values were actual observed values.

V1 V2

V1 V2

Imputed Missing

Observed

23

1.5 Imputation

Types of Imputation: Mean Substitution Cold deck Hot deck Interpolation and extrapolation Regression Stochastic regression 24

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation

Advantages: Preserves original sample size Preserves observed data Handles missing data once Disadvantages: Can create bias Often does not preserve associations among variables Requires specifying correct imputation model Understates uncertainty 25

1.5 Imputation

26

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation

27

1.5 Imputation

Regression Imputation

28

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation

Case Summariesa

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total

educat_mar Years of Education MAR 16 14 13 17 . 14 . 16 15 . . 12 . 12 . 9

N

educat_mean Years of Education Mean Imputed 16.00 14.00 13.00 17.00 12.97 14.00 12.97 16.00 15.00 12.97 12.97 12.00 12.97 12.00 12.97 15

educat_ri Years of Education Regression Imputed 16 14 13 17 13 14 12 16 15 12 22 12 14 12 13 15

educat_rir Years of Education Regression Imputed with random 16 14 13 17 16 14 15 16 15 14 22 12 13 12 14 15

a. Limited to first 15 cases.

29

1.5 Imputation Full Data Set Parameter

Mean Imputation

B

Sig.

B

Sig.

Intercept

.745

.000

1.044

.000

exper

.011

.000

.005

.031

educat

.092

.000

.075

.000

marr

.079

.062

.107

.016

union

.201

.000

.228

.000

sex

-.232

.000

-.218

.000

south

-.105

.015

-.126

.006

[race=1]

-.098

.097

-.114

.069

[race=2]

-.086

.333

-.135

.157

[race=3]

0(a)

.

0

.

n = 534, Adjusted R2 = .300

n = 534, Adjusted R2 =.201 30

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation Full Data Set Parameter

Regression Imputation

B

Sig.

B

Sig.

Intercept

.745

.000

.339

.010

exper

.011

.000

.012

.000

educat

.092

.000

.121

.000

marr

.079

.062

.082

.042

union

.201

.000

.205

.000

sex

-.232

.000

-.209

.000

south

-.105

.015

-.088

.032

[race=1]

-.098

.097

-.097

.084

[race=2]

-.086

.333

-.064

.448

[race=3]

0(a)

.

0

.

n = 534, Adjusted R2 = .300

n = 534, Adjusted R2 =.363 31

1.5 Imputation Full Data Set Parameter

Regression Imputation w/ Random Error

B

Sig.

B

Sig.

Intercept

.745

.000

.942

.000

exper

.011

.000

.007

.000

educat

.092

.000

.079

.000

marr

.079

.062

.100

.020

union

.201

.000

.221

.000

sex

-.232

.000

-.217

.000

south

-.105

.015

-.092

.039

[race=1]

-.098

.097

-.104

.084

[race=2]

-.086

.333

-.138

.130

[race=3]

0(a)

.

0

.

n = 534, Adjusted R2 = .300

n = 534, Adjusted R2 =.262 32

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com

Missing Data Workshop: Module 1 Introduction and Overview

1.5 Imputation

Summary 1. Imputations can give unbiased estimates if: Condition on observed variables Are multivariate to preserve associations among variables Generally have a random component 2. Single imputations: - give full power - underestimate standard errors - underestimation increases as % missingness is small 3. Good Imputation takes a long time

33

Copyright 2014 The Analysis Factor http://TheAnalysisFactor.com