Hadley Wickham

Report 1 Downloads 196 Views
Stat405 Displaying distributions Hadley Wickham Wednesday, June 13, 12

1. The diamonds data 2. Histograms and bar charts 3. Scatterplots for big data

Wednesday, June 13, 12

Diamonds Wednesday, June 13, 12

Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price

Wednesday, June 13, 12

x table width

z

depth = z / diameter table = table width / x * 100 Wednesday, June 13, 12

Recall Write down five ways to inspect the diamonds dataset. You have one minute!

Wednesday, June 13, 12

Histogram & bar charts Wednesday, June 13, 12

Histograms and barcharts Used to display the distribution of a variable Categorical variable → bar chart Continuous variable → histogram

Wednesday, June 13, 12

# With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) # Change binwidth: qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds$carat) last_plot() + xlim(0, 3)

Wednesday, June 13, 12

Always experiment with the bin width! Wednesday, June 13, 12

qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region qplot(table, data = diamonds, xlim(50, 70) qplot(table, data = diamonds, xlim(50, 70) qplot(table, data = diamonds, xlim(50, 70) + ylim(0, 50)

use xlim() and ylim() binwidth = 1) + binwidth = 0.1) + binwidth = 0.1) +

# Note that this type of zooming discards data # outside of the plot regions. See # ?coord_cartesian() for an alternative

Wednesday, June 13, 12

Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth.

Wednesday, June 13, 12

4000

count

3000

2000

1000

0 56

58

60

62

64

66

68

depth binwidth = 0.2) qplot(depth, data = diamonds, Wednesday, June 13, 12

70

4000

3000 cut

count

Fair Good 2000

Very Good Premium Ideal

1000

0

qplot(depth, data =60 diamonds, 56 58 62 64 binwidth 66 68 = 0.2, 70 depth fill = cut) + xlim(55, 70) Wednesday, June 13, 12

4000

3000 cut

count

Fair Good 2000

Very Good Premium Ideal

1000

Fill is the aesthetic 0 for fill colour

qplot(depth, data =60 diamonds, 56 58 62 64 binwidth 66 68 = 0.2, 70 depth fill = cut) + xlim(55, 70) Wednesday, June 13, 12

Fair

Good

Premium

Ideal

Very Good

2500 2000 1500 1000

count

500 0

2500 2000 1500 1000 500 0 56 58 60 62 64 66= 68diamonds, 70 56 58 60 binwidth 62 64 66 68 = 70 0.2) 56 58 + 60 qplot(depth, data xlim(55, 70) + facet_wrap(~depth cut) Wednesday, June 13, 12

62 64 66 68 70

Your turn Explore the distribution of price. What is a good binwidth to use? (Hint: How many bins will a binwidth of 1 give you?) Practice zooming in on regions of interest. How does price vary with colour, cut, or clarity?

Wednesday, June 13, 12

Fair

Good

Premium

Ideal

Very Good

6000 5000 4000 3000 2000

count

1000 0

6000 5000 4000 3000 2000 1000 0 0

5000

10000 15000

0

5000

10000 15000

0

5000

10000 15000

price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Wednesday, June 13, 12

Fair

Good

Premium

Ideal

Very Good

6000 5000 4000 3000 2000

count

1000 0

6000 5000 4000 3000 2000

What makes it difficult to compare the shape of the distributions? Brainstorm for 1 minute.

1000 0 0

5000

10000 15000

0

5000

10000 15000

0

5000

10000 15000

price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Wednesday, June 13, 12

Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area

Wednesday, June 13, 12

# Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Wednesday, June 13, 12

Scatterplots for big data Wednesday, June 13, 12

What’s the problem with this plot?

Wednesday, June 13, 12

What’s the problem with this plot? In pairs, brainstorm solutions for 2 minutes.

Wednesday, June 13, 12

Idea

ggplot

Small points

shape = I(".")

Transparency

alpha = I(1/50)

Jittering

geom = "jitter"

Smooth curve

geom = "smooth"

2d bins

geom = "bin2d" or geom = "hex"

Density contours

geom = "density2d"

Boxplots

geom = "boxplot" + group = ...

Wednesday, June 13, 12

# There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This is how you get help about a specific geom: # ?geom_smooth

Wednesday, June 13, 12

# To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: qplot(carat, price, data qplot(carat, price, data qplot(carat, price, data qplot(carat, price, data

Wednesday, June 13, 12

varying alpha = diamonds, alpha = diamonds, alpha = diamonds, alpha = diamonds, alpha

= = = =

I(1/10)) I(1/50)) I(1/100)) I(1/250))

qplot(table, price, data = diamonds) qplot(table, price, data = diamonds, geom = "boxplot") # Need to specify grouping variable: what determines # which observations go into each boxplot qplot(table, price, data = diamonds, geom = "boxplot", group = round_any(table, 1)) qplot(table, price, data = diamonds, geom = "boxplot", group = round_any(table, 1)) + xlim(50, 70)

Wednesday, June 13, 12

Your turn Explore the relationship between carat, price and cut using these techniques. (i.e. make this plot more informative: qplot(carat, price, data = diamonds, colour = cut))

Which did you find most useful?

Wednesday, June 13, 12