Predicting voter turnout from survey data

Report 1 Downloads 87 Views
DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Predicting voter turnout from survey data

Julia Silge

Data Scientist at Stack Overflow

DataCamp

Supervised Learning in R: Case Studies

Views of the Electorate Research Survey (VOTER) Democracy Fund Voter Study Group Politically diverse group of analysts and scholars in the United States Data is freely available

DataCamp

Supervised Learning in R: Case Studies

Views of the Electorate Research Survey (VOTER) Life in America today for people like you compared to fifty years ago is better? about the same? worse? Was your vote primarily a vote in favor of your choice or was it mostly a vote against his/her opponent? How important are the following issues to you? Crime Immigration The environment Gay rights

DataCamp

Supervised Learning in R: Case Studies

Views of the Electorate Research Survey (VOTER) > voters # A tibble: 6,692 x 43 case_identifier turnout16_2016 RIGGED_SYSTEM_1_2016 RIGGED_SYSTEM_2_2016 1 779 Voted 3 4 2 2108 Voted 2 1 3 2597 Voted 2 4 4 4148 Voted 1 4 5 4460 Voted 3 1 6 5225 Voted 3 3 7 5903 Voted 3 4 8 6059 Voted 2 3 9 8048 Voted 4 4 10 13112 Voted 2 3 # ... with 6,682 more rows, and 39 more variables: RIGGED_SYSTEM_3_2016 , # RIGGED_SYSTEM_4_2016 , RIGGED_SYSTEM_5_2016 , # RIGGED_SYSTEM_6_2016 , track_2016 , persfinretro_2016 , # econtrend_2016 , Americatrend_2016 , futuretrend_2016 , # wealth_2016 , values_culture_2016 , US_respect_2016 , # trustgovt_2016 , trust_people_2016 , helpful_people_2016 , # fair_people_2016 , imiss_a_2016 , imiss_b_2016 , # imiss_c_2016 , imiss_d_2016 , imiss_e_2016 , # imiss_f_2016 , imiss_g_2016 , imiss_h_2016 , # imiss_i_2016 , imiss_j_2016 , imiss_k_2016 ,

DataCamp

Supervised Learning in R: Case Studies

Interpreting integer survey responses AMERICA IS A FAIR SOCIETY WHERE EVERYONE HAS THE OPPORTUNITY TO GET AHEAD Response

Code

Strongly agree

1

Agree

2

Disagree

3

Strongly disagree

4

Learn more about the data yourself!

DataCamp

Predicting voter turnout > voters %>% + count(turnout16_2016) # A tibble: 2 x 2 turnout16_2016 n 1 Did not vote 264 2 Voted 6428

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's get started!

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

VOTE 2016 Julia Silge Data Scientist at Stack Overflow

DataCamp

Supervised Learning in R: Case Studies

Exploratory data analysis

Did not

Elections don't

Gay rights are very

Crime is very

matter

important

important

55.3%

17.0%

66.3%

34.1%

25.3%

57.6%

vote Voted

DataCamp

Exploratory data analysis

Supervised Learning in R: Case Studies

DataCamp

Fitting a simple model > simple_glm > summary(simple_glm) Call: glm(formula = turnout16_2016 ~ ., family = "binomial", data = select(voters, -case_identifier)) Deviance Residuals: Min 1Q Median 3Q Max -3.2373 0.1651 0.2214 0.3004 1.7708 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.457036 0.732721 3.353 0.000799 *** RIGGED_SYSTEM_1_2016 0.236284 0.085081 2.777 0.005484 ** RIGGED_SYSTEM_2_2016 0.064749 0.089208 0.726 0.467946 RIGGED_SYSTEM_3_2016 0.049357 0.107352 0.460 0.645680 RIGGED_SYSTEM_4_2016 -0.074694 0.087583 -0.853 0.393749 RIGGED_SYSTEM_5_2016 0.190252 0.096454 1.972 0.048556 * RIGGED_SYSTEM_6_2016 -0.005881 0.101381 -0.058 0.953740 track_2016 0.241075 0.121467 1.985 0.047178 * persfinretro_2016 -0.040229 0.106714 -0.377 0.706191 econtrend_2016 -0.295370 0.087224 -3.386 0.000708 ***

Supervised Learning in R: Case Studies

DataCamp

Fitting a simple model > library(broom) > > simple_glm %>% + tidy() %>% + filter(p.value < 0.05) %>% + arrange(desc(estimate)) term estimate std.error statistic p.value 1 (Intercept) 2.45703562 0.73272138 3.353301 7.985370e-04 2 imiss_a_2016 0.39712084 0.13898678 2.857256 4.273207e-03 3 imiss_l_2016 0.27468893 0.10678119 2.572447 1.009825e-02 4 imiss_q_2016 0.24456695 0.11909335 2.053573 4.001699e-02 5 track_2016 0.24107452 0.12146679 1.984695 4.717843e-02 6 RIGGED_SYSTEM_1_2016 0.23628350 0.08508091 2.777162 5.483579e-03 7 futuretrend_2016 0.21056782 0.07120079 2.957380 3.102651e-03 8 RIGGED_SYSTEM_5_2016 0.19025188 0.09645384 1.972466 4.855648e-02 9 wealth_2016 -0.06940523 0.02634395 -2.634580 8.424157e-03 10 imiss_k_2016 -0.18103020 0.08272555 -2.188323 2.864611e-02 11 econtrend_2016 -0.29536980 0.08722417 -3.386330 7.083422e-04 12 imiss_f_2016 -0.32328040 0.10543220 -3.066240 2.167694e-03 13 imiss_g_2016 -0.33203385 0.07867346 -4.220405 2.438640e-05 14 imiss_n_2016 -0.44161183 0.09003981 -4.904628 9.360434e-07

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's build some models!

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Cross-validation Julia Silge Data Scientist at Stack Overflow

DataCamp

Supervised Learning in R: Case Studies

Cross-validation Partitioning your data into subsets and using one subset for validation

DataCamp

Supervised Learning in R: Case Studies

Cross-validation Partitioning your data into subsets and using one subset for validation method = "cv" method = "repeatedcv"

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Cross-validation Repeated cross-validation can take a long time Parallel processing can be worth it

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's practice!

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Comparing model performance Julia Silge Data Scientist at Stack Overflow

DataCamp

Confusion matrix > confusionMatrix(predict(fit_glm, training), + training$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 149 1633 Voted 63 3510 Accuracy : 0.6833 95% CI : (0.6706, 0.6957) No Information Rate : 0.9604 P-Value [Acc > NIR] : 1 Kappa : 0.0847 Mcnemar's Test P-Value : confusionMatrix(predict(fit_rf, training), + training$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 212 5 Voted 0 5138 Accuracy : 0.9991 95% CI : (0.9978, 0.9997) No Information Rate : 0.9604 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.9879 Mcnemar's Test P-Value : 0.07364 Sensitivity : 1.00000 Specificity : 0.99903 Pos Pred Value : 0.97696 Neg Pred Value : 1.00000 Prevalence : 0.03959 Detection Rate : 0.03959 Detection Prevalence : 0.04052

Supervised Learning in R: Case Studies

DataCamp

Confusion matrix for the testing data > confusionMatrix(predict(fit_glm, testing), + testing$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 37 428 Voted 15 857 Accuracy : 0.6687 95% CI : (0.6427, 0.6939) No Information Rate : 0.9611 P-Value [Acc > NIR] : 1 Kappa : 0.0787 Mcnemar's Test P-Value : confusionMatrix(predict(fit_rf, testing), + testing$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 0 14 Voted 52 1271 Accuracy : 0.9506 95% CI : (0.9376, 0.9616) No Information Rate : 0.9611 P-Value [Acc > NIR] : 0.9767 Kappa : -0.0168 Mcnemar's Test P-Value : 5.254e-06 Sensitivity : 0.00000 Specificity : 0.98911 Pos Pred Value : 0.00000 Neg Pred Value : 0.96070 Prevalence : 0.03889 Detection Rate : 0.00000 Detection Prevalence : 0.01047

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

Comparing model performance > library(yardstick) > > sens(testing_results, truth = turnout16_2016, estimate = `Logistic regression`) [1] 0.7115385 > > spec(testing_results, truth = turnout16_2016, estimate = `Logistic regression`) [1] 0.6669261 > > sens(testing_results, truth = turnout16_2016, estimate = `Random forest`) [1] 0 > > spec(testing_results, truth = turnout16_2016, estimate = `Random forest`) [1] 0.9891051

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's finish this case study!