DATA VISUALIZATION WITH GGPLOT2

Report 5 Downloads 305 Views
DATA VISUALIZATION WITH GGPLOT2

Introduction

Data Visualization with ggplot2

Chapter 1 density

0.15

10000

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.05

0.00 −2.5

0.0

2.5

bimodal

20

5000 15

vore

0 Fair

Good

Very Good Premium

Ideal

sleep_total

price

15000

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.10

Carnivore Herbivore 10

Insectivore Omnivore

cut 5

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Chapter 2

A− O+

O−

15000

AB−

10000

B+

AB+ B−

Residuals vs Fitted 10

5000

4

100

5

carat 20

80



● ● ●





−5

40 60

40

80

10 0

0 80



10

20

Clay

● ● ●







−10

60

● ● ●

● ● ● ● ● ● ●● ● ● ●

10

Sand

31 ●

0

3

Residuals

2

60

1

40

0

5

Silt

0

20

price

A+

20



● 19 ● 20

30

40

50

Fitted values lm(Volume ~ Girth)

60

Data Visualization with ggplot2

Chapter 3

Alexander Platz Reichstag

Victory Column

Brandenburger Tor

Potsdamer Platz

Checkpoint Charlie

Data Visualization with ggplot2

Chapter 3

Alexander Platz Reichstag

Victory Column

Brandenburger Tor

Potsdamer Platz

Checkpoint Charlie

Data Visualization with ggplot2

Chapter 4 ●

Introduction to grid



Manipulating graphical objects



ggplot_build()



gridExtra

Data Visualization with ggplot2

Chapter 5

PARIS

REYKJAVIK

75



50 ● ● ●

25





temp

152 NEW YORK

group2

150

● ●

75



● ●

● ●



50

148

LONDON





● ● ● ●







25 ●

● ●

146 0

100

200

300

0

new_day 100

102

104

group1

● ●

75 ●

● ●



temp

98





50









New record high ●

25 ●



New record low ●



0

past record high 95% CI range Current year

100

past record low

200

new_day

300

100

200

300

DATA VISUALIZATION WITH GGPLOT2

Let’s practice!

DATA VISUALIZATION WITH GGPLOT2

Box Plots

Data Visualization with ggplot2

Statistical plots ●

Academic audience



2 common types





Box plots



Density plots

Case study: 2D box plots

Data Visualization with ggplot2

Box plot ●

John Tukey - Exploratory Data Analysis



Visualizing the 5 number summary

Data Visualization with ggplot2

2















1



values

● ●

0

● ●

● ● ●● ● ● ● ● ● ●● ● ●● ●

−1 ●

−2





Data Visualization with ggplot2

2















1



standard deviation values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●

−1 ●

−2





mean Not robust!

Data Visualization with ggplot2

2















1



values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●

−1 ●



minimum −2



Data Visualization with ggplot2

2















1



values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ●

Q1

●● ●

−1 ●



minimum −2



Data Visualization with ggplot2

2















1



values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ●

Q2 Q1

●● ●

−1 ●



minimum −2



Data Visualization with ggplot2

2















1

Q3



values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ●

Q2 Q1

●● ●

−1 ●



minimum −2



Data Visualization with ggplot2

2

maximum ●













1

Q3



values



0

● ●

● ● ● ● ●● ● ● ● ● ● ●● ●

IQR = interquartile range Q2 = median Q1

●● ●

−1 ●



minimum −2



Data Visualization with ggplot2

2

5











25%





1

4



values



0

25%

● ●

● ● ● ● ●● ● ● ● ● ● ● ● ●

3 2

25%

●● ●

−1 ●

−2

25%





1

5-number summary

Data Visualization with ggplot2

6

values

4

2

● ● ●

● ● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6

4

values



2

● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6

4

values



2



● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6

4

values



2



● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6

values

4



2



● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6





values

4

2

● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



Data Visualization with ggplot2

6 ●



values

4

2

● ● ●

● ●●



0

● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●

−2



DATA VISUALIZATION WITH GGPLOT2

Let’s practice!

DATA VISUALIZATION WITH GGPLOT2

Density Plots

Data Visualization with ggplot2

Density plot ●

Distribution of univariate data



Statistics Probability Density Function

Standard Normal Curve

t (8) 0.4

0.4

0.1

0.0 −3

−2

−1

0

x

1

2

3

1.00

1.5

f(x)

f(x)

0.2

F (2,18)

2.0

0.3

0.3

f(x)

chi−sq (1)

0.2

0.75

f(x)



1.0

0.50

0.1

0.5

0.25

0.0

0.0

0.00

−3

−2

−1

0

1

2

3

x

0

1



Theoretical: based on formula



Empirical: based on data

2

x

3

4

0

1

2

x

3

4

Data Visualization with ggplot2

Kernel Density Estimate (KDE) A sum of 'bumps' placed at the observations. 
 The kernel function determines the shape of the bumps 
 while the window width, h, determines their width.

Source: Brian S. Everi! and Torsten Hothorn, A Handbook of Statistical Analyses Using R

Data Visualization with ggplot2

Example > x x [1] 0.0 1.0 1.1 1.5 1.9 2.8 2.9 3.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

x

2.5

3.0

3.5

4.0

4.5

5.0

5.5

Data Visualization with ggplot2

Bumps 0.4

values

0.3

0.2

0.1

0.0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

x

2.5

3.0

3.5

4.0

4.5

5.0

5.5

Data Visualization with ggplot2

Sum of bumps 0.4

mode = value at which probability density function has its maximum value

0.3

values

Empirical Probability Density Function 0.2

0.1

0.0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

x

Many overlapping lines -> higher value -> higher density

5.5

Data Visualization with ggplot2

Bandwidth - h 0.4

values

Remember: Density plots are representations of the underlying distribution!

0.279

0.3

0.2

0.1

0.0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

2.5

3.0

3.5

4.0

4.5

5.0

5.5

bw = 0.69

0.4

0.355

values

0.3

0.2

0.1

0.0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

bw = 0.4

Data Visualization with ggplot2

density

0.3

0.2

geom_density()

area ≠ 1

0.1

happens for every bandwidth!

0.0 0

0.4

1

2

3

bw = 0.4, restricted to range

Intermediate steps Plot extends beyond limits of data

values

0.3

0.2

0.1

0.0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

bw = 0.4

2.5

3.0

3.5

4.0

4.5

5.0

5.5

DATA VISUALIZATION WITH GGPLOT2

Let’s practice!

DATA VISUALIZATION WITH GGPLOT2

Multiple Groups/Variables

Data Visualization with ggplot2

Groups Levels within a factor variable > head(mammals) vore sleep_total 1 Carnivore 12.1 2 Omnivore 17.0 3 Herbivore 14.4 4 Omnivore 14.9 5 Herbivore 4.0 6 Herbivore 14.4 > levels(mammals$vore) [1] "Carnivore" "Herbivore"

"Insectivore" "Omnivore"

Data Visualization with ggplot2

Ji"ered points 20

●●



● ●

15

● ●

sleep_total

● ●● ●

● ●



● ● ● ● ● ●● ●● ● ●



10







● ●●



● ●

● ●

5

●● ● ●

● ●

● ●

● ●

ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_point(position = position_jitter(0.2))

● ● ● ●●● ● ● ● ● ● ●● ● ● ●

● ●● ● ●● ● ● ●● ●

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Box plot

ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot()

5 observations - meaningless! 20 ● ● ●

sleep_total

15



10

5

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Box plot (2)

ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot(varwidth = TRUE)

20 ● ● ●

sleep_total

15



10

5

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Density plots

ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(col = NA, alpha = 0.35)

0.3

> # Add weights > mammals % group_by(vore) %>% mutate(n = n()/nrow(mammals))

0.2

vore

density

Carnivore Herbivore Insectivore Omnivore

0.1

0.0 5

10

15

20

sleep_total

abundant, but only 5 observations!

Data Visualization with ggplot2

Weighted

ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(aes(weight = n), col = NA, alpha = 0.35)

1.5

vore

1.0

density

Carnivore Herbivore Insectivore Omnivore

0.5

0.0 5

10

sleep_total

15

20

Data Visualization with ggplot2

Violin plot

ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_violin()

20

sleep_total

15

10

5

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Weighted

ggplot(mammals, aes(x = vore, 
 y = sleep_total, 
 fill = vore)) + geom_violin(aes(weight = n), col = NA)

20

15

sleep_total

vore Carnivore Herbivore 10

Insectivore Omnivore

5

Carnivore

Herbivore

Insectivore

vore

Omnivore

Data Visualization with ggplot2

Compare separate variables > dim(faithful) [1] 272 2 > head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55

Data Visualization with ggplot2

First look ● ● ●● ●● ● ●●● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

4

eruptions



●●

5



3



● ●





2

ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point()

● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●●●●●●● ●● ● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●

50

60



70

waiting

80

90

Data Visualization with ggplot2

2D density plot 5

eruptions

4

3

2

50

60

70

waiting

80

90

ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_density_2d()

Data Visualization with ggplot2

2D density plot

ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile", 
 aes(fill = ..density..), contour = FALSE)

5

density

eruptions

4

0.025 0.020 0.015 0.010

3

0.005

2

50

60

70

waiting

80

90

Data Visualization with ggplot2

Viridis

library(viridis) ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile", aes(fill = ..density..), contour = FALSE) + scale_fill_viridis()

5

density

eruptions

4

0.025 0.020 0.015 0.010

3

0.005

2

50

60

70

waiting

80

90

Data Visualization with ggplot2

Grid of circles 5

eruptions

4

3









● ●●●● ● ●







































































































































































● ● ● ● ● ●

















● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ●





2





● ●



● ● ●●● ● ● ●

●● ● ● ● ● ● ●●●●●● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● 60



● ●●





● ●



●●●

●●●●●













● ● ● ● ● ● ● ●



























● ●







































● ● ●









































































































● ● ●●●● ● ● ●

50



●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●●●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ●● ●

● ●● ● ● ●

● ●●

ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "point", aes(size = ..density..), n = 20, contour = FALSE) + scale_size(range = c(0, 9))

70

waiting

80

90

density



0.005



0.010



0.020

● 0.015

DATA VISUALIZATION WITH GGPLOT2

Let’s practice!