County demographics > life # A tibble: 3,142 x 4 state county expectancy income 1 Alabama Autauga County 76.060 37773 2 Alabama Baldwin County 77.630 40121 3 Alabama Barbour County 74.675 31443 4 Alabama Bibb County 74.155 29075 5 Alabama Blount County 75.880 31663 6 Alabama Bullock County 71.790 25929 7 Alabama Butler County 73.730 33518 8 Alabama Calhoun County 73.300 33418 9 Alabama Chambers County 73.245 31282 10 Alabama Cherokee County 74.650 32645 # ... with 3,132 more rows
Exploratory Data Analysis
Center: mean > x x [1] 76 78 75 74 76 72 74 73 73 75 74 > sum(x)/11 [1] 74.54545 > mean(x) [1] 74.54545
> sd(x_new) # Was 1.69 [1] 6.987001 > var(x_new) # Was 2.87 [1] 48.81818 > diff(range(x_new)) # Was 6 [1] 25 > IQR(x_new) # Doesn't change [1] 2
EXPLORATORY DATA ANALYSIS
Let’s practice!
EXPLORATORY DATA ANALYSIS
Shape and transformations
Exploratory Data Analysis
Modality Unimodal Bimodal Multimodal Uniform
Exploratory Data Analysis
Skew Right-skewed
Left-skewed
Symmetric
Exploratory Data Analysis
Shape of income > ggplot(life, aes(x = geom_density(alpha > ggplot(life, aes(x = geom_density(alpha
income, fill = west_coast)) + = .3) log(income), fill = west_coast)) + = .3)
EXPLORATORY DATA ANALYSIS
Let’s practice!
EXPLORATORY DATA ANALYSIS
Outliers
Exploratory Data Analysis
Characteristics of a distribution ●
Center
●
Variability
●
Shape
●
Outliers
Exploratory Data Analysis
Exploratory Data Analysis
Indicating outliers > life % mutate(is_outlier = income > 75000) > life %>% filter(is_outlier) %>% arrange(desc(income)) # A tibble: 45 x 6 state county expectancy 1 Wyoming Teton County 82.110 2 New York New York County 81.675 3 Texas Shackelford County 75.400 4 Colorado Pitkin County 82.990 5 Nebraska Wheeler County 79.180 6 California Marin County 83.230 7 Nebraska Kearney County 79.630 8 Texas McMullen County 77.320 9 Massachusetts Nantucket County 80.325 10 Texas Midland County 77.830 # ... with 35 more rows