DATA VISUALIZATION WITH GGPLOT2
Introduction
Data Visualization with ggplot2
Chapter 1 density
0.15
10000
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.05
0.00 −2.5
0.0
2.5
bimodal
20
5000 15
vore
0 Fair
Good
Very Good Premium
Ideal
sleep_total
price
15000
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.10
Carnivore Herbivore 10
Insectivore Omnivore
cut 5
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Chapter 2
A− O+
O−
15000
AB−
10000
B+
AB+ B−
Residuals vs Fitted 10
5000
4
100
5
carat 20
80
●
● ● ●
●
●
−5
40 60
40
80
10 0
0 80
●
10
20
Clay
● ● ●
●
●
●
−10
60
● ● ●
● ● ● ● ● ● ●● ● ● ●
10
Sand
31 ●
0
3
Residuals
2
60
1
40
0
5
Silt
0
20
price
A+
20
●
● 19 ● 20
30
40
50
Fitted values lm(Volume ~ Girth)
60
Data Visualization with ggplot2
Chapter 3
Alexander Platz Reichstag
Victory Column
Brandenburger Tor
Potsdamer Platz
Checkpoint Charlie
Data Visualization with ggplot2
Chapter 3
Alexander Platz Reichstag
Victory Column
Brandenburger Tor
Potsdamer Platz
Checkpoint Charlie
Data Visualization with ggplot2
Chapter 4 ●
Introduction to grid
●
Manipulating graphical objects
●
ggplot_build()
●
gridExtra
Data Visualization with ggplot2
Chapter 5
PARIS
REYKJAVIK
75
●
50 ● ● ●
25
●
●
temp
152 NEW YORK
group2
150
● ●
75
●
● ●
● ●
●
50
148
LONDON
●
●
● ● ● ●
●
●
●
25 ●
● ●
146 0
100
200
300
0
new_day 100
102
104
group1
● ●
75 ●
● ●
●
temp
98
●
●
50
●
●
●
●
New record high ●
25 ●
●
New record low ●
●
0
past record high 95% CI range Current year
100
past record low
200
new_day
300
100
200
300
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Box Plots
Data Visualization with ggplot2
Statistical plots ●
Academic audience
●
2 common types
●
●
Box plots
●
Density plots
Case study: 2D box plots
Data Visualization with ggplot2
Box plot ●
John Tukey - Exploratory Data Analysis
●
Visualizing the 5 number summary
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
●
values
● ●
0
● ●
● ● ●● ● ● ● ● ● ●● ● ●● ●
−1 ●
−2
●
●
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
●
standard deviation values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●
−1 ●
−2
●
●
mean Not robust!
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
●
values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●
−1 ●
●
minimum −2
●
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
●
values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
Q1
●● ●
−1 ●
●
minimum −2
●
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
●
values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
Q2 Q1
●● ●
−1 ●
●
minimum −2
●
Data Visualization with ggplot2
2
●
●
●
●
●
●
●
1
Q3
●
values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
Q2 Q1
●● ●
−1 ●
●
minimum −2
●
Data Visualization with ggplot2
2
maximum ●
●
●
●
●
●
●
1
Q3
●
values
●
0
● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
IQR = interquartile range Q2 = median Q1
●● ●
−1 ●
●
minimum −2
●
Data Visualization with ggplot2
2
5
●
●
●
●
●
25%
●
●
1
4
●
values
●
0
25%
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●
3 2
25%
●● ●
−1 ●
−2
25%
●
●
1
5-number summary
Data Visualization with ggplot2
6
values
4
2
● ● ●
● ● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6
4
values
●
2
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6
4
values
●
2
●
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6
4
values
●
2
●
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6
values
4
●
2
●
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6
●
●
values
4
2
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
Data Visualization with ggplot2
6 ●
●
values
4
2
● ● ●
● ●●
●
0
● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●
−2
●
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Density Plots
Data Visualization with ggplot2
Density plot ●
Distribution of univariate data
●
Statistics Probability Density Function
Standard Normal Curve
t (8) 0.4
0.4
0.1
0.0 −3
−2
−1
0
x
1
2
3
1.00
1.5
f(x)
f(x)
0.2
F (2,18)
2.0
0.3
0.3
f(x)
chi−sq (1)
0.2
0.75
f(x)
●
1.0
0.50
0.1
0.5
0.25
0.0
0.0
0.00
−3
−2
−1
0
1
2
3
x
0
1
●
Theoretical: based on formula
●
Empirical: based on data
2
x
3
4
0
1
2
x
3
4
Data Visualization with ggplot2
Kernel Density Estimate (KDE) A sum of 'bumps' placed at the observations.
The kernel function determines the shape of the bumps
while the window width, h, determines their width.
Source: Brian S. Everi! and Torsten Hothorn, A Handbook of Statistical Analyses Using R
Data Visualization with ggplot2
Example > x x [1] 0.0 1.0 1.1 1.5 1.9 2.8 2.9 3.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2.5
3.0
3.5
4.0
4.5
5.0
5.5
Data Visualization with ggplot2
Bumps 0.4
values
0.3
0.2
0.1
0.0 −2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2.5
3.0
3.5
4.0
4.5
5.0
5.5
Data Visualization with ggplot2
Sum of bumps 0.4
mode = value at which probability density function has its maximum value
0.3
values
Empirical Probability Density Function 0.2
0.1
0.0 −2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
x
Many overlapping lines -> higher value -> higher density
5.5
Data Visualization with ggplot2
Bandwidth - h 0.4
values
Remember: Density plots are representations of the underlying distribution!
0.279
0.3
0.2
0.1
0.0 −2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
2.5
3.0
3.5
4.0
4.5
5.0
5.5
bw = 0.69
0.4
0.355
values
0.3
0.2
0.1
0.0 −2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
bw = 0.4
Data Visualization with ggplot2
density
0.3
0.2
geom_density()
area ≠ 1
0.1
happens for every bandwidth!
0.0 0
0.4
1
2
3
bw = 0.4, restricted to range
Intermediate steps Plot extends beyond limits of data
values
0.3
0.2
0.1
0.0 −2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
bw = 0.4
2.5
3.0
3.5
4.0
4.5
5.0
5.5
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Multiple Groups/Variables
Data Visualization with ggplot2
Groups Levels within a factor variable > head(mammals) vore sleep_total 1 Carnivore 12.1 2 Omnivore 17.0 3 Herbivore 14.4 4 Omnivore 14.9 5 Herbivore 4.0 6 Herbivore 14.4 > levels(mammals$vore) [1] "Carnivore" "Herbivore"
"Insectivore" "Omnivore"
Data Visualization with ggplot2
Ji"ered points 20
●●
●
● ●
15
● ●
sleep_total
● ●● ●
● ●
●
● ● ● ● ● ●● ●● ● ●
●
10
●
●
●
● ●●
●
● ●
● ●
5
●● ● ●
● ●
● ●
● ●
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_point(position = position_jitter(0.2))
● ● ● ●●● ● ● ● ● ● ●● ● ● ●
● ●● ● ●● ● ● ●● ●
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Box plot
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot()
5 observations - meaningless! 20 ● ● ●
sleep_total
15
●
10
5
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Box plot (2)
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot(varwidth = TRUE)
20 ● ● ●
sleep_total
15
●
10
5
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Density plots
ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(col = NA, alpha = 0.35)
0.3
> # Add weights > mammals % group_by(vore) %>% mutate(n = n()/nrow(mammals))
0.2
vore
density
Carnivore Herbivore Insectivore Omnivore
0.1
0.0 5
10
15
20
sleep_total
abundant, but only 5 observations!
Data Visualization with ggplot2
Weighted
ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(aes(weight = n), col = NA, alpha = 0.35)
1.5
vore
1.0
density
Carnivore Herbivore Insectivore Omnivore
0.5
0.0 5
10
sleep_total
15
20
Data Visualization with ggplot2
Violin plot
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_violin()
20
sleep_total
15
10
5
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Weighted
ggplot(mammals, aes(x = vore,
y = sleep_total,
fill = vore)) + geom_violin(aes(weight = n), col = NA)
20
15
sleep_total
vore Carnivore Herbivore 10
Insectivore Omnivore
5
Carnivore
Herbivore
Insectivore
vore
Omnivore
Data Visualization with ggplot2
Compare separate variables > dim(faithful) [1] 272 2 > head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
Data Visualization with ggplot2
First look ● ● ●● ●● ● ●●● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
4
eruptions
●
●●
5
●
3
●
● ●
●
●
2
ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point()
● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●●●●●●● ●● ● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●
50
60
●
70
waiting
80
90
Data Visualization with ggplot2
2D density plot 5
eruptions
4
3
2
50
60
70
waiting
80
90
ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_density_2d()
Data Visualization with ggplot2
2D density plot
ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile",
aes(fill = ..density..), contour = FALSE)
5
density
eruptions
4
0.025 0.020 0.015 0.010
3
0.005
2
50
60
70
waiting
80
90
Data Visualization with ggplot2
Viridis
library(viridis) ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile", aes(fill = ..density..), contour = FALSE) + scale_fill_viridis()
5
density
eruptions
4
0.025 0.020 0.015 0.010
3
0.005
2
50
60
70
waiting
80
90
Data Visualization with ggplot2
Grid of circles 5
eruptions
4
3
●
●
●
●
● ●●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
●
2
●
●
● ●
●
● ● ●●● ● ● ●
●● ● ● ● ● ● ●●●●●● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● 60
●
● ●●
●
●
● ●
●
●●●
●●●●●
●
●
●
●
●
●
● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●●● ● ● ●
50
●
●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●●●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ●● ●
● ●● ● ● ●
● ●●
ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "point", aes(size = ..density..), n = 20, contour = FALSE) + scale_size(range = c(0, 9))
70
waiting
80
90
density
●
0.005
●
0.010
●
0.020
● 0.015
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!