Assignment #2: Statistical Inference Course

Report 5 Downloads 215 Views
Assignment #2: Statistical Inference Course B Mahoney May 18, 2016

Some Statistical Inference Analyses - Coursera Data Science Specialization This file encompasses the second of two separate data investigations required under the Project Assignment for the course, Statistical Inference, offered through Coursera in May, 2016.

Part 2 - Tooth Growth Data Analysis The objective of this exploration is to "...analyze the ToothGrowth data in the R datasets package." Exploratory Data Analysis and Basic Summary The data set is titled, "The Effect of Vitamin C on Tooth Growth in Guinea Pigs" and contains data measuring the "...the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2mg) with each of two delivery methods (orange juice or ascorbic acid)." Below is a chart showing the growth by dose, split by type of supplement (variation on UsingR Help example).

The raw data implies that higher doses produce greater tooth growth for a given supplement type, and at least at lower doses, the orange juice supplement (OJ) produces greater tooth growth than does the alternative supplement, ascorbic acid (VC). In addition, it appears from the data that tooth growth is not proportional to level of dose, as the slopes of the lines between 0.5 and 1.0 are steeper than the slopes from 1.0 to 2.0. How do the means of the test groups compare? Here is the table of sample mean and variance for tooth growth length by dose for OJ and VC separately.

Dose 0.5 1.0 2.0 0.5 1.0 2.0

Supp OJ OJ OJ VC VC VC

Mean 13.23 22.70 26.06 7.98 16.77 26.14

Var 19.89 15.30 7.05 7.54 6.33 23.02

It seems clear from the mean statistics, and from the charts of the raw data, that at the 0.5 and 1.0 doses, OJ produces superior tooth growth, compared to VC. But it also appears that at the highest dosage level tested the two supplements are indistiguishable from each other.

Confidence Intervals and Hypothesis Tests In all of the tests we assume that the samples are iid normal, and rely on the Student's t distribution to construct confidence intervals and hypothesis tests. 1. Examine the initial conclusion that OJ(VC) produces superior tooth growth at 1.0 dose compared to 0.5, by using confidence intervals on the difference of the means between the two dosage levels. Similar hypotheses were tested for OJ and VC separately comparing the 1.0 and 0.5 doses. H0 : The performance of the OJ(VC) supplement at dosage level 1.0 is higher than the performance at dosage level 0.5 (that is, the difference between the two means is statistically different from 0). H1 : The higher dosage does not produce significantly different performance (the difference between the means is not greater than zero). Recall that we have calculated the mean length from the test by dose and supplement type. For the OJ supplement, the means for 0.5 and 1.0 doses are 13.23 and 22.7 respectively. (For the VC supplement the means for 0.5 and 1.0 doses are 7.98 and 16.77). The formula for finding the t confidence interval is Y′ − X′ ± t(d. f. ) ∗ SE, where Y′ − X′ in this case is the mean at dose of 1.0, 22.7 minus the mean at dose of 0.5, 13.23, equal to 9.47 (for VC the difference is equal to 8.79). The standard errors of the two samples in question are 3.911 and 4.46. We assumed that the variances are different between the two samples, and used the one-sided t.test with the option var.equal = FALSE to calculate the confidence interval. With 95% confidence, the difference in means falls into this interval: [5.524, 13.416] Since the interval does not contain zero, and in fact is positive throughout, we fail to reject the null hypothesis that the higher dose produces greater growth, and conclude that the higher dosage produces higher growth for OJ. For VC, with 95% confidence the difference falls in the interval:[6.314, 11.266] Since the interval does not contain zero, and in fact is positive throughout, we fail to reject the null hypothesis that the mean at higher dose is greater than the mean at the lower dose, and conclude that the higher dosage (Y' in the t.test) produces a higher mean tooth growth than does the lower dosage for the VC supplement. Both of these tests yielded very small p-values (8.784919110^{-5} and 6.811017710^{-7}) reflecting the large difference in means, and correspondingly high level of significance.

2. Examine the conclusion that OJ produces superior tooth growth compared to VC at 1.0 dose level, by using confidence intervals on the difference of the means between the two supplement types. H0 : The performance of the OJ supplement at dosage level 1.0 better than the performance of the VC supplement for the same dose. H1 : The two performances are not different. For the 1.0 dosage level, the means for OJ and VC respectively are 22.7 and 16.77 respectively. In the t confidence interval Y′ − X′ ± t(d. f. ) ∗ SE, Y′ − X′ in this case is 22.7 minus 16.77, equal to 5.93. The standard errors of the two samples in question were 3.911 and 2.515. Again, we assumed that the variances are different between the two samples, and used the t.test with the option var.equal = FALSE to calculate the confidence interval. With a one-sided test using 95% confidence, the difference in means falls into this interval: [2.802, 9.058] Since the interval does not contain zero, and in fact is positive throughout, we fail to reject the null hypothesis that the OJ supplement produces higher growth at the dose level 1.0 This test yielded a small p-value (0.0391951) indicating a high observed level of significance. 3. Examine whether OJ and VC at the highest dose level, 2.0, produce the same tooth growth by using confidence intervals on the difference of the means between the two supplement types. H0 : The performance of the OJ supplement at dosage level 2.0 is the same as the performance of the VC supplement for the same dose (that is, the difference between the two means is not statistically different from 0). H1 : The two performances are different (the difference between the means is not zero). For the 2.0 dosage level, the means for OJ and VC respectively are 26.06 and 26.14 respectively. In the t confidence interval Y′ − X′ ± t(d. f. ) ∗ SE, Y′ − X′ in this case is 26.06 minus 26.14, equal to 0.08. The standard errors of the two samples in question were 2.655 and 4.798. Again, we assumed that the variances are different between the two samples, and used the t.test with the option var.equal = FALSE to calculate the confidence interval. Here we applied a two-sided test, with 95% confidence. The difference in means falls into this interval: [-3.798, 3.638] Since the interval contains zero we fail to reject the null hypothesis that the means are equal, and conclude that the supplements OJ and VC do not produce statistically significant different mean tooth growth at this level of dosage. This test yielded a large p-value (0.9638516) indicating a lower observed level of significance.

Conclusion The data suggests that higher tooth growth would be expected for both OJ and VC for dose level of 1.0 versus 0.5. At the 1.0 dose level, OJ performs significantly better than VC, based on the samples available. However, if the chosen dose level is 2.0, OJ and VC perform relatively similarly, that is, OJ performs better only for lower doses. As the t test across the two supplements shows, the confidence interval comparing all data for OJ to all data for VC is [-0.171, 7.571] Since the 95% confidence interval includes zero, we do not reject the null hypotheses that the two supplements perform the same overall.

Appendix Code below reads in the ToothGrowth dataset then creates a chart showing the growth by dose, split by type of supplement (variation on UsingR Help example). library(UsingR) library(datasets) dat=ToothGrowth coplot(len~dose | supp,data=dat,panel=panel.smooth, xlab="Dose by type of supplement", ylab="Tooth Growth length")

What follows is the code needed to construct confidence intervals using the t distribution. #Load the knitr package to support the construction of a "kable" formated table library(knitr) #Construct table showing statistics organized on the dimensions being analyzed #First, calculate the mean length by dose and supplement tgmns=aggregate(len ~ dose + supp,data=dat, mean) #Calculate the variances on the same dimensions tgvars=aggregate(len ~ dose + supp,data=dat, var) #Combine into a single table for display tgstats=cbind(tgmns,tgvars[,3]) #Organize the table for viewing in a Word document (later will convert to pdf for uplo ad) kable(tgstats,format="pandoc",col.names=c("Dose","Supp","Mean","Var"), align=c('c','c','c','c'),digits=2) #Create the vectors that we will compare in the hypotheses selected for investigation #Organize data into supplement vectors to facilitate t.test steps oj=dat[dat$supp=="OJ",] vc=dat[dat$supp=="VC",] #Compare OJ at doses 0.5 and 1.0, and VC at the same doses hyp1=t.test(oj[oj$dose==1.0,1],oj[oj$dose==0.5,1], lower.tail=TRUE,paired=FALSE,var.equal=FALSE) p1=hyp1$p.value hyp1a=t.test(vc[vc$dose==1.0,1],vc[vc$dose==0.5,1],lower.tail=TRUE, paired=FALSE,var.equal=FALSE)

p1a=hyp1a$p.value #Compare OJ at doses 1.0 and 2.0 hyp2=t.test(oj[oj$dose==1,1],oj[oj$dose==2,1],lower.tail=TRUE, paired=FALSE,var.equal=FALSE) p2=hyp2$p.value #Compare OJ and VC at dose = 1 hyp3=t.test(oj[oj$dose==1,1],vc[vc$dose==1,1],lower.tail=TRUE, paired=FALSE,var.equal=FALSE) p3=hyp3$p.value #Compare OJ and VC at dose =2

hyp4=t.test(oj[oj$dose==2,1],vc[vc$dose==2,1], paired=FALSE,var.equal=FALSE) p4=hyp4$p.value #Compare OJ and VC across doses hypa=t.test(oj[,1],vc[,1],lower.tail=TRUE, paired=FALSE,var.equal=FALSE) pa=hypa$p.value