STAT 130 Introduction and Explanations 17. Flint, MI, a city of about 100,000 households, has experienced extremely high levels of lead in its water. The following plot summarizes data on lead levels (measured in parts per billion (ppb), and rounded to the nearest whole number) in the water of a random sample of 71 homes in Flint, taken in early 2015. Important note: so the plot would fit on the page, one home with a lead level of 104 ppb has not been plotted.
NOTE: For these 71 homes, the mean is 7.31 ppb and the standard deviation is 14.35 ppb.
The website fivethirtyeight.com has a good article that discusses the situation in Flint and some of the statistical analysis involved. http://fivethirtyeight.com/features/what-went-wrong-in-flint-water-crisis-michigan/ (a) (4 points.) Use the sample data to compute a 95% confidence interval for an appropriate population mean. Background: The problem concerns homes in Flint, MI. The main variable of interest is lead level in a home’s water, a quantitative variable, with measurement units of parts per billion (ppb). This part asks about the population mean, that is, the mean (or average) lead level in the water of homes in Flint, MI. To estimate this population mean, we use data on the sample of 71 homes for which the lead levels were actually measured. Each dot in the plot represents a home in the sample. The mean lead level for these 71 homes was 7.31 ppb. Provided that the homes constitute a representative sample, 7.31 ppb should be relatively close to the true population mean. However, the sample average will vary from sample to sample. In order to account for this degree of variability, or “sampling error”, we provide not just a single number estimate of the population mean, but rather a range of values of the form 7.31ppb plus/minus a margin of error. The margin of error depends on 3 components The level of confidence: how confident are we that the resulting estimate is correct? To be more confident we require a larger range of values, that is, a larger margin of error. 95% confidence translates to the 2 in the formula below. The sample size: larger sample size requires a smaller margin of error. The degree of variability of the lead levels themselves, as estimated with the sample
standard deviation of 14.35 ppb. (The more variable the lead levels, the more the average lead level would vary from sample to sample, so the larger the margin of error required.) Solution: The statistical formula for the confidence interval is 14.35 7.31 ± 2 ⇒ 7.31 ± 3.41 ⇒ 3.9 𝑡𝑜 10.7 √71
(b) (4 points.) Write a clearly worded sentence containing the result of your confidence interval from part (a) in context. Solution: We estimate with 95% confidence that the mean lead level in the water of homes in Flint, MI is between 3.9 and 10.7 ppb.
Conclusion should be about the population, i.e. homes in Flint in general, and not just about the sample. The conclusion is about the mean (average) lead level It’s 95% confidence, not 95% of homes Context is important: homes in Flint, lead level, 3.9 to 10.7 ppb; NOT just “population mean is between 3.9 and 10.7”.
(c) (5 points.) Does the statistic upon which the confidence interval from parts (a-b) is based provide a reasonable measure of typical lead levels in the sample of homes? YES
NO
If yes, explain why. If no, explain why not and provide the value of a more reasonable measure. Solution: No. The sample mean is 7.31 ppb. From the plot, we see that almost 80% (i.e. 56/71) have lead levels that are lower than 7.31 ppb, so 7.31ppb would not be considered typical. While most homes have relatively low lead levels, the handful of houses that have very high lead levels (and the home with 104ppb in particular) pull the mean up. A more reasonable measure of “typical” lead levels is the sample median, which is the lead level which 50% of houses are below and 50% are above. The median lead level is 3ppb, which seems like a more reasonable measure of what is “typical”. (Since there are 71 values, the median is the 36th value after the 71 values have been arranged from smallest to largest. You can find the median by simply counting the dots on the plot.)
(d) (4 points.) The city actually collected data on 71 homes, but they excluded data on 2 homes: one with a lead level of 20 ppb, and one with a lead level of 104 ppb. Compute what the sample mean would be after excluding these two homes. Show your work in a calculation that does not involve adding a bunch of numbers. Solution: Since the values being removed are above the mean, their removal will cause the mean to decrease. The mean is the sum of all the values divided by the number of values. The sum of all the values is 7.31×71 = 519. Remove the two values of 20 and 104; now there are 69 values left whose sum is 519 – 104 - 20 = 395. Therefore the new mean is 395/69 = 5.72 ppb.
(e) (3 points.) Again, suppose the two homes were excluded as in the previous part, and the 95% confidence interval from parts (a-b) were recomputed. How would the revised CI compare to the CI from (a-b)? Center of revised CI is: Margin of error of revised CI is: Level of confidence of revised CI is:
larger larger larger
smaller smaller smaller
same same same
Solution: Center: smaller. The center of the CI is the sample mean, which has gone down. Margin of error: smaller. The outlier of 104ppb affects both the mean and the standard deviation. Once this value is excluded the sample standard deviation decreases, and so does the margin of error. (Removing the two values also decreases the sample size which would increase the margin of error, but this change is small relative to the change in sample SD.) Level of confidence: same. It is 95% in either case, specified independently of the data.
(f) (4 points.) Again, suppose the two homes were excluded as in the previous part. Find the proportion of homes in the sample with lead levels above 15 ppb. Solution: Lead level itself is a quantitative variable. However, in this and the following parts we treat it as categorical: is the lead level in the home above 15 ppb or not? In the sample of 71 homes, there are 8 homes with lead level about 15 ppb (count the dots on the plot and remember that the 104 ppb home is not plotted). After excluding the two homes, we have 6 out of 69 homes with level about 15 ppb, for a proportion of 6/69 = 0.087, a.k.a. 8.7%. (Part of this problem was to see if students could switch between treating lead level as quantitative and then treating it as categorical. Students were also required to work with the two versions of the data set, with or without the 2 observations.)
(g) (3 points.) Compute an appropriate 95% confidence interval corresponding to part (f). Solution: Similar to part (a) but now we are estimating a population proportion, not mean: what is the proportion of homes in Flint with lead level above 15 ppb? The formula is similar to (a), with the sample mean replaced by the sample proportion and the sample SD replaced by 0.5. 0.087 ± 2
0.5 √69
⇒ 0.087 ± 0.121 ⇒ 0 𝑡𝑜 0.207
Students should recognize here or in the next part that a proportion can’t be negative which is why the lower bound is 0 instead of 0.087-0.121. (Note: instead of 0.5, it is more common to use √0.087(1 − 0.087), but the 0.5 is a simplification that provides a reasonable, somewhat conservative, approximation is most situations. In this course students learned to use 0.5 when dealing with margins of error for proportions and sample SD when dealing with means.)
(h) (4 points.) Write a clearly worded sentence containing the result of your confidence interval from part (g) in context. Solution: We estimate with 95% confidence that the proportion of homes in Flint, MI with lead level in water above 15 ppb is between 0 and 0.207. OR: we estimate with 95% confidence that between 0 and 20.7 percent of homes in Flint, MI have lead level in water above 15 ppb.
(See comments in (b); similar considerations apply here.) (i) (3 points.) Under U.S. federal law, if at least 10% of homes in a city have lead levels over 15 ppb, the city has to warn its residents and start taking steps to fix the problem. i. In what sense was the city’s exclusion of the two homes “significant”? Explain.
Solution: Considering all the data 11.3% (8/71) of homes in the sample have levels above 15 ppb, while after excluding the two homes, the percent drops below the 10% threshold. Therefore, because the city excluded the two homes, they did not warn residents or take steps to fix the problem. (Yes, this really happened.) (We talk about “statistically significant”, and how that term has a very specific meaning. Statistical significance is not the issue here. One point of this part is that students should not forget common use of “significant”, and how simple descriptive statistics can often be used to judge practical importance.)
ii. Explain to a city official who has no background in statistics why you should use the answer to parts (g-h) and not just the answer to part (f) when determining if the 15 ppb threshold has been crossed. You should explain the most relevant statistical concepts in words that someone with no background in statistics can understand. Solution: About 9% of the homes in the sample were above 15ppb. However, what we are really interested in is the percent of all homes in Flint that have levels about 15ppb, which is probably not equal to 9%. The percent of homes above 15ppb would vary from sample to sample. In order to account for this “sampling error”, we provide not just a single number, but rather a range of values of the form 9% plus/minus a margin of error. Since the sample size is relatively small the margin of error is relatively large, about 12 percentage points. After accounting for the margin of error, we estimate that as many as 20% of homes in Flint could have lead levels above 15ppb. While only 9% of homes in the sample are above 15ppb, based on the sample data it remains plausible that more than 10% of all homes in Flint are above 15ppb. (But again, they shouldn’t have excluded the 2 values in which case the sample itself would have had more than 10% of homes above 15ppb.)