Stack Overflow Developer Survey > stackoverflow # A tibble: 6,991 x 22 Respondent Country Salary YearsCodedJob OpenSource Hobby CompanySizeNumb… 1 3 United Ki… 113750 20.0 T T 10000 2 15 United Ki… 100000 20.0 F T 5000 3 18 United St… 130000 20.0 T T 1000 4 19 United St… 82500 3.00 F T 10000 5 26 United St… 175000 16.0 F T 10000 6 55 Germany 64516 4.00 F F 1000 7 62 India 6636 1.00 F T 5000 8 71 United St… 65000 1.00 F T 20.0 9 73 United St… 120000 20.0 T T 100 10 77 United St… 96283 20.0 T T 1000 # ... with 6,981 more rows, and 15 more variables: Remote , # CareerSatisfaction , `Data scientist` , `Database # administrator` , `Desktop applications developer` , `Developer # with stats/math background` , DevOps , `Embedded # developer` , `Graphic designer` , `Graphics programming` , # `Machine learning specialist` , `Mobile developer` , `Quality # assurance engineer` , `Systems administrator` , `Web # developer`
DataCamp
Stack Overflow Developer Survey Analyze the data yourself!
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
DataCamp
Class imbalance > stackoverflow %>% + count(Remote) # A tibble: 2 x 2 Remote n 1 Remote 718 2 Not remote 6273
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
Building a simple model > simple_glm % + select(-Respondent) %>% + glm(Remote ~ ., + family = "binomial", + data = .) > > summary(simple_glm)
Remote ~ . data = .
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Let's explore the data!
DataCamp
Supervised Learning in R: Case Studies
SUPERVISED LEARNING IN R: CASE STUDIES
Dealing with imbalanced data Julia Silge Data Scientist at Stack Overflow
DataCamp
Class imbalance Class imbalance is a common problem! often negatively affects the performance of your model
Supervised Learning in R: Case Studies
DataCamp
Class imbalance > stackoverflow %>% + count(Remote) # A tibble: 2 x 2 Remote n 1 Remote 718 2 Not remote 6273
Supervised Learning in R: Case Studies
DataCamp
Supervised Learning in R: Case Studies
Upsampling or oversampling Add more of the minority class so it has more effect on the predictive model Randomly sample with replacement from the minority class until it is the same size as the majority class