Stat405 Displaying distributions Hadley Wickham Wednesday, June 13, 12
1. The diamonds data 2. Histograms and bar charts 3. Scatterplots for big data
Wednesday, June 13, 12
Diamonds Wednesday, June 13, 12
Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price
Wednesday, June 13, 12
x table width
z
depth = z / diameter table = table width / x * 100 Wednesday, June 13, 12
Recall Write down five ways to inspect the diamonds dataset. You have one minute!
Wednesday, June 13, 12
Histogram & bar charts Wednesday, June 13, 12
Histograms and barcharts Used to display the distribution of a variable Categorical variable → bar chart Continuous variable → histogram
Wednesday, June 13, 12
# With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) # Change binwidth: qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds$carat) last_plot() + xlim(0, 3)
Wednesday, June 13, 12
Always experiment with the bin width! Wednesday, June 13, 12
qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region qplot(table, data = diamonds, xlim(50, 70) qplot(table, data = diamonds, xlim(50, 70) qplot(table, data = diamonds, xlim(50, 70) + ylim(0, 50)
use xlim() and ylim() binwidth = 1) + binwidth = 0.1) + binwidth = 0.1) +
# Note that this type of zooming discards data # outside of the plot regions. See # ?coord_cartesian() for an alternative
Wednesday, June 13, 12
Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth.
Wednesday, June 13, 12
4000
count
3000
2000
1000
0 56
58
60
62
64
66
68
depth binwidth = 0.2) qplot(depth, data = diamonds, Wednesday, June 13, 12
70
4000
3000 cut
count
Fair Good 2000
Very Good Premium Ideal
1000
0
qplot(depth, data =60 diamonds, 56 58 62 64 binwidth 66 68 = 0.2, 70 depth fill = cut) + xlim(55, 70) Wednesday, June 13, 12
4000
3000 cut
count
Fair Good 2000
Very Good Premium Ideal
1000
Fill is the aesthetic 0 for fill colour
qplot(depth, data =60 diamonds, 56 58 62 64 binwidth 66 68 = 0.2, 70 depth fill = cut) + xlim(55, 70) Wednesday, June 13, 12
Fair
Good
Premium
Ideal
Very Good
2500 2000 1500 1000
count
500 0
2500 2000 1500 1000 500 0 56 58 60 62 64 66= 68diamonds, 70 56 58 60 binwidth 62 64 66 68 = 70 0.2) 56 58 + 60 qplot(depth, data xlim(55, 70) + facet_wrap(~depth cut) Wednesday, June 13, 12
62 64 66 68 70
Your turn Explore the distribution of price. What is a good binwidth to use? (Hint: How many bins will a binwidth of 1 give you?) Practice zooming in on regions of interest. How does price vary with colour, cut, or clarity?
Wednesday, June 13, 12
Fair
Good
Premium
Ideal
Very Good
6000 5000 4000 3000 2000
count
1000 0
6000 5000 4000 3000 2000 1000 0 0
5000
10000 15000
0
5000
10000 15000
0
5000
10000 15000
price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Wednesday, June 13, 12
Fair
Good
Premium
Ideal
Very Good
6000 5000 4000 3000 2000
count
1000 0
6000 5000 4000 3000 2000
What makes it difficult to compare the shape of the distributions? Brainstorm for 1 minute.
1000 0 0
5000
10000 15000
0
5000
10000 15000
0
5000
10000 15000
price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Wednesday, June 13, 12
Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area
Wednesday, June 13, 12
# Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Wednesday, June 13, 12
Scatterplots for big data Wednesday, June 13, 12
What’s the problem with this plot?
Wednesday, June 13, 12
What’s the problem with this plot? In pairs, brainstorm solutions for 2 minutes.
Wednesday, June 13, 12
Idea
ggplot
Small points
shape = I(".")
Transparency
alpha = I(1/50)
Jittering
geom = "jitter"
Smooth curve
geom = "smooth"
2d bins
geom = "bin2d" or geom = "hex"
Density contours
geom = "density2d"
Boxplots
geom = "boxplot" + group = ...
Wednesday, June 13, 12
# There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This is how you get help about a specific geom: # ?geom_smooth
Wednesday, June 13, 12
# To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: qplot(carat, price, data qplot(carat, price, data qplot(carat, price, data qplot(carat, price, data
Wednesday, June 13, 12
varying alpha = diamonds, alpha = diamonds, alpha = diamonds, alpha = diamonds, alpha
= = = =
I(1/10)) I(1/50)) I(1/100)) I(1/250))
qplot(table, price, data = diamonds) qplot(table, price, data = diamonds, geom = "boxplot") # Need to specify grouping variable: what determines # which observations go into each boxplot qplot(table, price, data = diamonds, geom = "boxplot", group = round_any(table, 1)) qplot(table, price, data = diamonds, geom = "boxplot", group = round_any(table, 1)) + xlim(50, 70)
Wednesday, June 13, 12
Your turn Explore the relationship between carat, price and cut using these techniques. (i.e. make this plot more informative: qplot(carat, price, data = diamonds, colour = cut))
Which did you find most useful?
Wednesday, June 13, 12