DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Predicting voter turnout from survey data
Julia Silge
Data Scientist at Stack Overflow
DataCamp
Supervised Learning in R: Case Studies
Views of the Electorate Research Survey (VOTER) Democracy Fund Voter Study Group Politically diverse group of analysts and scholars in the United States Data is freely available
DataCamp
Supervised Learning in R: Case Studies
Views of the Electorate Research Survey (VOTER) Life in America today for people like you compared to fifty years ago is better? about the same? worse? Was your vote primarily a vote in favor of your choice or was it mostly a vote against his/her opponent? How important are the following issues to you? Crime Immigration The environment Gay rights
DataCamp
Supervised Learning in R: Case Studies
Views of the Electorate Research Survey (VOTER) > voters # A tibble: 6,692 x 43 case_identifier turnout16_2016 RIGGED_SYSTEM_1_2016 RIGGED_SYSTEM_2_2016 1 779 Voted 3 4 2 2108 Voted 2 1 3 2597 Voted 2 4 4 4148 Voted 1 4 5 4460 Voted 3 1 6 5225 Voted 3 3 7 5903 Voted 3 4 8 6059 Voted 2 3 9 8048 Voted 4 4 10 13112 Voted 2 3 # ... with 6,682 more rows, and 39 more variables: RIGGED_SYSTEM_3_2016 , # RIGGED_SYSTEM_4_2016 , RIGGED_SYSTEM_5_2016 , # RIGGED_SYSTEM_6_2016 , track_2016 , persfinretro_2016 , # econtrend_2016 , Americatrend_2016 , futuretrend_2016 , # wealth_2016 , values_culture_2016 , US_respect_2016 , # trustgovt_2016 , trust_people_2016 , helpful_people_2016 , # fair_people_2016 , imiss_a_2016 , imiss_b_2016 , # imiss_c_2016 , imiss_d_2016 , imiss_e_2016 , # imiss_f_2016 , imiss_g_2016 , imiss_h_2016 , # imiss_i_2016 , imiss_j_2016 , imiss_k_2016 ,
DataCamp
Supervised Learning in R: Case Studies
Interpreting integer survey responses AMERICA IS A FAIR SOCIETY WHERE EVERYONE HAS THE OPPORTUNITY TO GET AHEAD Response
Code
Strongly agree
1
Agree
2
Disagree
3
Strongly disagree
4
Learn more about the data yourself!
DataCamp
Predicting voter turnout > voters %>% + count(turnout16_2016) # A tibble: 2 x 2 turnout16_2016 n 1 Did not vote 264 2 Voted 6428
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Let's get started!
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
VOTE 2016 Julia Silge Data Scientist at Stack Overflow
DataCamp
Supervised Learning in R: Case Studies
Exploratory data analysis
Did not
Elections don't
Gay rights are very
Crime is very
matter
important
important
55.3%
17.0%
66.3%
34.1%
25.3%
57.6%
vote Voted
DataCamp
Exploratory data analysis
Supervised Learning in R: Case Studies
DataCamp
Fitting a simple model > simple_glm > summary(simple_glm) Call: glm(formula = turnout16_2016 ~ ., family = "binomial", data = select(voters, -case_identifier)) Deviance Residuals: Min 1Q Median 3Q Max -3.2373 0.1651 0.2214 0.3004 1.7708 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.457036 0.732721 3.353 0.000799 *** RIGGED_SYSTEM_1_2016 0.236284 0.085081 2.777 0.005484 ** RIGGED_SYSTEM_2_2016 0.064749 0.089208 0.726 0.467946 RIGGED_SYSTEM_3_2016 0.049357 0.107352 0.460 0.645680 RIGGED_SYSTEM_4_2016 -0.074694 0.087583 -0.853 0.393749 RIGGED_SYSTEM_5_2016 0.190252 0.096454 1.972 0.048556 * RIGGED_SYSTEM_6_2016 -0.005881 0.101381 -0.058 0.953740 track_2016 0.241075 0.121467 1.985 0.047178 * persfinretro_2016 -0.040229 0.106714 -0.377 0.706191 econtrend_2016 -0.295370 0.087224 -3.386 0.000708 ***
Supervised Learning in R: Case Studies
DataCamp
Fitting a simple model > library(broom) > > simple_glm %>% + tidy() %>% + filter(p.value < 0.05) %>% + arrange(desc(estimate)) term estimate std.error statistic p.value 1 (Intercept) 2.45703562 0.73272138 3.353301 7.985370e-04 2 imiss_a_2016 0.39712084 0.13898678 2.857256 4.273207e-03 3 imiss_l_2016 0.27468893 0.10678119 2.572447 1.009825e-02 4 imiss_q_2016 0.24456695 0.11909335 2.053573 4.001699e-02 5 track_2016 0.24107452 0.12146679 1.984695 4.717843e-02 6 RIGGED_SYSTEM_1_2016 0.23628350 0.08508091 2.777162 5.483579e-03 7 futuretrend_2016 0.21056782 0.07120079 2.957380 3.102651e-03 8 RIGGED_SYSTEM_5_2016 0.19025188 0.09645384 1.972466 4.855648e-02 9 wealth_2016 -0.06940523 0.02634395 -2.634580 8.424157e-03 10 imiss_k_2016 -0.18103020 0.08272555 -2.188323 2.864611e-02 11 econtrend_2016 -0.29536980 0.08722417 -3.386330 7.083422e-04 12 imiss_f_2016 -0.32328040 0.10543220 -3.066240 2.167694e-03 13 imiss_g_2016 -0.33203385 0.07867346 -4.220405 2.438640e-05 14 imiss_n_2016 -0.44161183 0.09003981 -4.904628 9.360434e-07
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Let's build some models!
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Cross-validation Julia Silge Data Scientist at Stack Overflow
DataCamp
Supervised Learning in R: Case Studies
Cross-validation Partitioning your data into subsets and using one subset for validation
DataCamp
Supervised Learning in R: Case Studies
Cross-validation Partitioning your data into subsets and using one subset for validation method = "cv" method = "repeatedcv"
DataCamp
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
DataCamp
Cross-validation Repeated cross-validation can take a long time Parallel processing can be worth it
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Let's practice!
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Comparing model performance Julia Silge Data Scientist at Stack Overflow
DataCamp
Confusion matrix > confusionMatrix(predict(fit_glm, training), + training$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 149 1633 Voted 63 3510 Accuracy : 0.6833 95% CI : (0.6706, 0.6957) No Information Rate : 0.9604 P-Value [Acc > NIR] : 1 Kappa : 0.0847 Mcnemar's Test P-Value : confusionMatrix(predict(fit_rf, training), + training$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 212 5 Voted 0 5138 Accuracy : 0.9991 95% CI : (0.9978, 0.9997) No Information Rate : 0.9604 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.9879 Mcnemar's Test P-Value : 0.07364 Sensitivity : 1.00000 Specificity : 0.99903 Pos Pred Value : 0.97696 Neg Pred Value : 1.00000 Prevalence : 0.03959 Detection Rate : 0.03959 Detection Prevalence : 0.04052
Supervised Learning in R: Case Studies
DataCamp
Confusion matrix for the testing data > confusionMatrix(predict(fit_glm, testing), + testing$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 37 428 Voted 15 857 Accuracy : 0.6687 95% CI : (0.6427, 0.6939) No Information Rate : 0.9611 P-Value [Acc > NIR] : 1 Kappa : 0.0787 Mcnemar's Test P-Value : confusionMatrix(predict(fit_rf, testing), + testing$turnout16_2016) Confusion Matrix and Statistics Reference Prediction Did not vote Voted Did not vote 0 14 Voted 52 1271 Accuracy : 0.9506 95% CI : (0.9376, 0.9616) No Information Rate : 0.9611 P-Value [Acc > NIR] : 0.9767 Kappa : -0.0168 Mcnemar's Test P-Value : 5.254e-06 Sensitivity : 0.00000 Specificity : 0.98911 Pos Pred Value : 0.00000 Neg Pred Value : 0.96070 Prevalence : 0.03889 Detection Rate : 0.00000 Detection Prevalence : 0.01047
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
Comparing model performance > library(yardstick) > > sens(testing_results, truth = turnout16_2016, estimate = `Logistic regression`) [1] 0.7115385 > > spec(testing_results, truth = turnout16_2016, estimate = `Logistic regression`) [1] 0.6669261 > > sens(testing_results, truth = turnout16_2016, estimate = `Random forest`) [1] 0 > > spec(testing_results, truth = turnout16_2016, estimate = `Random forest`) [1] 0.9891051
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Let's finish this case study!