ST2080 Ali
5.1 Sampling Distributions for Means Everything we have learned so far in this course has set us up for understanding that Characteristics in a population have a distribution Randomly selecting units from the population help us learn about that distribution Sample statistics can be used to estimate a population level parameters A sample statistic has its own probability distribution because different samples of the same size can give rise to different values of the statistic Let’s study the last two points by through the following example: Eg. Suppose there are 8 students in the population from which we are interested in estimating the average spare cash that they have on them on a given day. Although unknown to us, assume the amount of spare cash each has on the given day is as follows: Student 1 xi 4
2 2
3 1
4 4
5 7
6 7
7 8
8 7
The true average amount of spare cash for this population is, 𝜇 = (4+2+1...+7)/8 = $5 However, once units are selected for the sample, we can record the desired characteristic, xi, from each sample unit and use the sample average (statistic) to estimate the population-level mean (parameter). For a SRS of size 4 out of a population of size 8, there are 70 possible samples, and the sample average spare cash may be the same for some of these samples (analogous to how there are 36 possible outcomes from tossing two dice, some of which give the same sum of the top faces). So, 𝑋 is a random variable itself, whose value depends on which sample (from the 70 possible samples) is actually drawn from the population. 87
ST2080 Ali
Students in Sample
Spare Cash (data, xi)
𝒙
1 2 3
1, 2, 3, 4 1, 2, 3, 5 1, 2, 3, 6
4, 2, 1, 4 4, 2, 1, 7 4, 2, 1, 7
2.75 3.50 3.50
⁞
⁞
⁞
⁞
70
5, 6, 7, 8
7, 7, 8, 7
7.25
Sample
Q. What is the sampling distribution of 𝒙 ? i.e. What are the possible values of 𝒙, and with what probability would we observe those values? 𝒙 Freq P( 𝒙 )
2.75 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 7.25
1
6
1/70 6/70
2
3
2/70 etc..
7
4
6
12
6
4
7
3
2
6
12/70 ...
1
total: 70
1/70
- symmetric - unimodal > signax = sigma/(sq root n)
90
ST2080 Ali
Ex. 5.2 (pg 302 of IPS). A SRS of size n = 36 is taken from a population with mean 240 and standard deviation 18. Find the mean and standard deviation of the sample mean. X1, ..., X36 ~ independent with u = 240, sigma = 18 ux-bar = u = 240
, sigma(x-bar) = sigma/(sq root n) = 18/(sq root 36) = 18/6 = 3
Shape: if the population distribution of X is N(, ), then the sampling distribution of the sample mean, 𝑋 is also normal with mean , and standard deviation /n. X1, ... , Xn ~ i.i.d N(u, sigma) >> X-bar ~ N(u,sigma/sq rt n)
It turns out that even if the population distribution is not normal, the sampling distribution of the sample mean is also normal for n adequately large enough. The Central Limit Theorem For large sample size n, taken from a very large population with mean , and standard deviation , the sampling distribution of the sample mean is approximately normal: 𝜎 𝑋 ~̇ 𝑁 𝜇, . √𝑛 rule of thumb: n>/= 30 for CLT to "kick in"
So in practice, we do not explicitly enumerate all possible samples of size n. However, to understand the shape of the sampling we could: - take many samples, m, each of size n - estimate the population mean for each sample (ie calculate x-bar) - draw a histogram of the m estimates. (histogram of m-bar The Central Limit Theorem tells us that the histogram has the following properties: - unimodal and symmetric - centred on u, with std dev sigma/(sq rt n)
91
ST2080 Ali
Ex. 5.7 (pg 305 of IPS). Arrival times between text messages on a cell phone follow an exponential distribution with = = 25 minutes. You estimate the mean arrival time from the sample mean arrival time of your next 50 messages. Approximately, what is the probability that the sample mean exceeds 21 minutes? n = 50, x1, ... , x50 ~ u=25, sigma=25 P(x-bar > 21) = ? CLT >> x-bar ~ N(25, 25/(sq rt 50) P( (x-bar - 25)/(25/sq rt 50) > (21 - 25)/(25/sq rt 50) ) = P(Z > -1.13) = 1 - P(Z