Essential copying and pasting from Stack Overflow

Report 0 Downloads 34 Views
DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Essential copying and pasting from Stack Overflow

Julia Silge

Data Scientist at Stack Overflow

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

Stack Overflow Developer Survey > stackoverflow # A tibble: 6,991 x 22 Respondent Country Salary YearsCodedJob OpenSource Hobby CompanySizeNumb… 1 3 United Ki… 113750 20.0 T T 10000 2 15 United Ki… 100000 20.0 F T 5000 3 18 United St… 130000 20.0 T T 1000 4 19 United St… 82500 3.00 F T 10000 5 26 United St… 175000 16.0 F T 10000 6 55 Germany 64516 4.00 F F 1000 7 62 India 6636 1.00 F T 5000 8 71 United St… 65000 1.00 F T 20.0 9 73 United St… 120000 20.0 T T 100 10 77 United St… 96283 20.0 T T 1000 # ... with 6,981 more rows, and 15 more variables: Remote , # CareerSatisfaction , `Data scientist` , `Database # administrator` , `Desktop applications developer` , `Developer # with stats/math background` , DevOps , `Embedded # developer` , `Graphic designer` , `Graphics programming` , # `Machine learning specialist` , `Mobile developer` , `Quality # assurance engineer` , `Systems administrator` , `Web # developer`

DataCamp

Stack Overflow Developer Survey Analyze the data yourself!

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Class imbalance > stackoverflow %>% + count(Remote) # A tibble: 2 x 2 Remote n 1 Remote 718 2 Not remote 6273

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

Building a simple model > simple_glm % + select(-Respondent) %>% + glm(Remote ~ ., + family = "binomial", + data = .) > > summary(simple_glm)

Remote ~ . data = .



DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's explore the data!

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Dealing with imbalanced data Julia Silge Data Scientist at Stack Overflow

DataCamp

Class imbalance Class imbalance is a common problem! often negatively affects the performance of your model

Supervised Learning in R: Case Studies

DataCamp

Class imbalance > stackoverflow %>% + count(Remote) # A tibble: 2 x 2 Remote n 1 Remote 718 2 Not remote 6273

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

Upsampling or oversampling Add more of the minority class so it has more effect on the predictive model Randomly sample with replacement from the minority class until it is the same size as the majority class

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

DataCamp

Implementing upsampling up_train % as_tibble()

Supervised Learning in R: Case Studies

DataCamp

Supervised Learning in R: Case Studies

Implementing upsampling stack_glm ppv(testing_results, truth = Remote, estimate = `Logistic regression`) [1] 0.1740741 > npv(testing_results, truth = Remote, estimate = `Logistic regression`) [1] 0.9428238

DataCamp

Supervised Learning in R: Case Studies

SUPERVISED LEARNING IN R: CASE STUDIES

Let's practice!