sampling distribution model for a proportion

Report 1 Downloads 38 Views
QUICK REVIEW: RANDOMNESS AND PROBABILITY (PART IV) - LAW OF LARGE NUMBERS (LLN) - more times we try sth, closer the results will get to theoretical perfection - LAW OF AVERAGES DNE

BASIC RULES OF PROBABILITY HOW TO FIND PROBABILITY OF: "this event OR this event occurs"? - add probab's and subtract probab. that both occur

HOW TO FIND PROBABILITY THAT: "event A and event B" occurred, given they are independent. - multiply probab's

CONDITIONAL PROBABILITY: how probable is one event to happen, knowing that another event happened - ie. P(event A | event B) => probability of event A, given event B

DISJOINT EVENTS => "mutually exclusive" - cannot both occur at same time

IF two events are INDEP, then occurence of one does not change probab. of the other occurring.

PROBABILITY MODEL FOR RANDOM VARIABLE - describes theoretical distrib. of outcomes

- expected value = mean of random var. - E(X) - add variances for sums or diff's of INDEP. random var's - IF distrib. of qvar var. is unimodal & symmetric - THEN can use normal model to estimate probab;'s

- use: a) GEOMETRIC model to est. probab. of getting first success after certain # of INDEPENDENT trials b) BINOMIAL model to est. probab. of getting certain #successes in finite # of INDEPENDENT trials c) POISSON model to est. probab. of #occurrences of relatively rare phenomenon - approximation to BINOMIAL model

============================

BEGINNING OF PART V OF THE BOOK: FROM THE DATA AT HAND TO THE WORLD AT LARGE

CHAPTER 18 SAMPLING DISTRIBUTION MODELS WHERE ARE WE GOING?

- will find out how much proportions from random samples will vary - allows us to start generalizing from retrieved samples to popn

MAIN TXT (example) - WHO: Canadian adults - WHAT: When to bring troops home - WHEN: Apr 2007 - WHERE: Canada - WHY: Public attitudes - to what extent should we see prop's (ex. of ppl favouring early withdrawl of troops) vary

The following polls both were taken within a similar time frame - ex. Angus reid poll randomly selected 1k Canadians and found 52% in favour of troops being withdrawn early from war -

= 0.52

- ex. Strategic Counsel poll randomly selected same amt of Canadians and found 64% in favour - = 0.64 - both of these were properly selected random samples, but their prop's are strikingly diff. [2] - why sample proportions vary going from 1 sample to another? - b/c samples are made up of diff. ppl

THE CENTRAL LIMIT THEOREM FOR SAMPLE PROPORTIONS

[1] (ex) Strategic Counsel poll - imagine results from all random samples of size 1,000 that the poll-administrators did not take - what would histogram for sample prop's retrieved from each one sample collectively look like?

[2] (ex) Strategic Counsel poll - the centre of this histogram is the true proportion in the popn - denoted as p ----

NOTATION ALERT - p = true (gen. unk) prop. of successes in popn - = sample prop - estimate of p derived from data - q-hat = sample prop. of failures - q = true (gen. unk) prop. of failures in popn --- suppose that p = 0.60 - 60% of ALL Canadian adults supported early withdrawl.

[3]

Shape of histogram? (ex) Strategic Counsel poll

- simulate as many unpicked size 1000 indep. random samples, but keep p the same - sample proportions are denoted - below histogram shows sample prop's supporting early withdrawl for 2k indep. samples of 1k adults when true prop. is p = 0.60

[4] - despite the true val. being 0.60, we do not get this same prop. from every randomly simulated sample that was drawn - each

comes from diff. simulated sample, and each sample has diff. ppl

FIG18.1 histogram: simulation of all prop's from all possible samples = sampling distribution of the proportions

[5] SAMPLING DISTRIBUTION OF THE PROPORTIONS - unimodal, symmetric - centred at p (TRUE PROP.) - Normal model applies to this

p475 [1] SAMPLING DISTRIBUTION MODEL

- shows how sample prop. varies from sample to sample - enables us to quantify this variation & discuss the likelihood that we observe a sample prop. falling under any particular interval in distrib.

[2] - sampling distribution model, recall, is described by a Normal model - recall: Normal model req: N(μ, σ) - standard deviation (σ) - mean (μ) - in this case, b/c centre of sampling distribution at p , μ = p

[3] HOW TO GET SD? - typically, mean does not give any info about SD (ex) - batch of bikes have mean radius for one of their wheels to be 50 cm. Now, what is the standard deviation? Do not know; in most situations, you cannot get σ knowing μ.

[4] HOW TO GET SD THEN? - when working with proportions, can use special fact: - given we know p, an get SD( ) => can use mean (which is true prop) to get SD

[5] - recall: if we get SRS's of "n" number of individuals, prop. retrieved will vary from one sample to another

- given that n is sufficiently large, the probability model model the distribution of these sample prop's

can be used to

- its a normal model centred at p, and has SD \sqrt{(pq)/(n)} - for prop's found for numerous random samples of fixed size "n" from popn, w/ success probability p

- NOTE: if n is small, then use Binomial model

(ex) Strategic counsel poll - p = 0.60 - were supposing that true prop. of adults who favour early withdrawl is 60%; we do NOT know if this is absolutely true - using this, can get SD( ), and then formulate the normal model:

Description: p = 0.60, Normal model for histogram of sample prop's of adults favouring early withdrawl (n = 1000)

p476 [1] Normal model for sampling distribution - can use 68-95-99.7 Rule on this, just like for any other normal model (ex) Strategic Counsel poll

- 95% of distributed val's are w/in 2SD's of mean, 99.7% of distributed val's are w/in 3SD's of mean

Let us apply 68-95-99.7 RULE TO ABOVE DATA: - given: - p = 0.60 - SD( ) = 0.0155 or 1.55% => 19/20 such polls (95%) to give results w/in 3.1% of true % (2*1.55%), and nearly all (~99.7%) should be w/in 4.65% (3*1.55%) of the true %.

- this can help us determine if certain %'s seem impossible to have been retrieved due to variation from one sample to another - ex. if we assume that true % is 60%, then a result of about 64% (3SD above mean) is plausible, but unlikely - but 52% seems impossible (the one that the other poll got) b/c it goes more than 5 SD's from centre

[2] SAMPLING VARIABILITY (aka SAMPLING ERROR) - v.natural, & expected variability of sample prop's compared to true popn prop from one random sample to another

p477 HOW GOOD IS THE NORMAL MODEL? [1] Claim: - if a) repeated random samples of the same size n are drawn from some popn and b) prop.

is measured for each sample

- then, collection of these prop's pile up around underlying popn prop. p , and this distribution of sample prop's can be modelled well by Normal model

[2] Requirement: - sample size must be big enough

Downfall: - claim is only approximately true, like all models, but it becomes better rep. of distribution of sample prop's if sample size gets larger (ie. more accurate for larger 'n') - having larger samples makes distribution get closer to resembling that of Normal model

ASSUMPTIONS AND CONDITIONS [1] - 2 assumptions needed when using sampling distribution model for sample prop's 1. INDEPENDENCE ASSUMPTION = sampled val's must be indep. random draws from popn of interest (one indep. of another ) 2. SAMPLE SIZE ASSUMPTION = n must be large enough

[2]

result is

- we assume assumptions b/c they are hard, or typically impossible to check - use conditions to - provide info about assumptions - check if assumptions are reasonable

- these are conditions we CHECK before using Normal model to model distribution of sample prop's

1. RANDOMIZATION CONDITION dep. on where data came from: a) expt? - then subj's must have been randomly assigned to treatments prior to data collection

b) survey? - sample should be SRS of popn - other sampling methods often give underestimates of true SD, (if u use formulae from this chapter)

what if it was not a proper random sample? check: - that sampling method not biased & - data can still be reasonably thought of as being rep'ive of popn of interest

p478 (con.)

2. 10% CONDITION - n mustn't be larger than 10% of popn

(ex) national polls - b/c total popn is v.large, sample is small fraction of popn

3. SUCCESS/FAILURE CONDITION - sample size has been sufficiently large s.t. we can expect minimum: - 10 successes (np) - 10 failures (nq) - given this is met, then we have enough data to draw sound conclusions (ex) Strategic councel poll - p = 0.60 - then, expect 1000(0.60) = 600 successes and - expect 1000(0.40) = 400 failures - both are over 10 => can certainly expect to get sufficient successes and failures

Don't the two conditions: SUCCESS/FAILURE, 10% seem to conflict with each other? 10% - wants sample size to be not larger than 10% of popn - would need to make adjustments to data analysis methods if sample was greater than 10% (not going to look at here) SUCCESS/FAILURE - req. sample size to be large enough to have at least 10 successes, 10 failures expected

- but amt of data this condition demands dep. on p - the smaller the p, the more data that is needed ------NOTE ABOUT "success", "failure" - success (p) - failure (q) - are outcomes with assoc. probability - success and failure are completely arbitrary labels (ex) if disease occurs with probability p, this means that there is this much % chance of getting it, while 1 - p of not getting it - doesn't mean: "success" at getting sick -------

SAMPLING DISTRIBUTION MODEL FOR A PROPORTION [1] SAMPLING DISTRIBUTION MODEL FOR PROPORTION - why are we modelling sample distribution in the first place? - it gives us insight as to how much sample prop. varies from sample to another - model is attempt to display distribution of ALL random samples

[2] - SAMPLING DISTRIBUTION MODEL FOR PROPORTION - how do we know that model really works? - model can be theoretically justified using mathematics that - larger the sample size (ie. as n gets larger), the better the model works (this was result proven by Laplace)

fundamental insight here: - can think of prop's from random samples as random quantities & then say sth specific about distribution (sampling distribution of prop's) they output

[3] We are now looking at proportions in another, v.impt way: - proportion: = random quantity that has a (probability) distribution, and this distribution is modelled by the "sampling distribution model" for the proportion" (as seen earlier) => prop. is not looked at sth we just get for set of data anymore.

TERMINOLOGY ALERT - sample distribution

- sampling distribution

- display of actual data that is collected from ONE sample

- display of summary statistics (Ex. ) for many diff. samples => "sampling distribution" is not the same as "sample distribution".

p479

In other words, IF - sampled val's are INDEPENDENT - n is sufficiently large THEN - sampling distrib. of

modelled by Normal model with:

=> all we need to know is -a) proportion (p) -b) sample size ..in order to know how variable a sample prop. is [1] SAMPLING DISTRIBUTION MODEL

- quantifies variability - tell us (if at all) how unusual any sample prop. is - allows us to make informed decisions wrt how far estimates from samples are from true prop. - inform us about amt of variation to be expected when a random sample retrieved (ex) spinning coin 100 times to decide if this is fair - 52 heads retrieved (Fair) - 90 heads retrieved (not fair), 64 heads? Do not know if it is fair or not, but this is case in which we need sampling distribution model to decide [2] SAMPLING DISTRIBUTIONS - practically speaking, these are only simulated - nevertheless, we can come up with statements about popn from them - sample proportion is viewd as a random variable from now on, instead of looking at it as some fixed quantity that we calculated from data - retrieving one is one of many that may have been retrieved had another diff. random sample been chosen

- by imagining what might occur if one was to draw, from the same popn, numerous samples, can learn quite a lot about how statistics retrieved from one particular sample is related to its corresponding popn parameters

p480 [1] SAMPLING DISTRIBUTION MODEL - allows us to measure how close our retrieved statistic val's are to their corresponding parameters

[2] SAMPLING DISTRIBUTION MODEL - can be retrieved for any statistic that we get from sampled data - but normal model cannot always be used to describe the statistic retrieved from sample - statistic in this context: this is referring to the type of val. plotted on x-axis

- normal model worked well for prop's, but not for every statistic

[3] SAMPLING DISTRIBUTIONS - WHY SPEND THE TIME TO MAKE THEM? - can help us understand how the statistic that sampling distribution is based on would have been dsitributed if all possible samples were drawn, without having to actually do it in reality - this distribution is used to generalize from data to the world at large

p482 WHAT ABOUT QUANTITATIVE DATA? - as you incr. the sample size (ex. # of dice), each sample avg has greater likelihood of being closer to popn mean - by the Law of Large Numbers

(ex) as we increase the number of dice that we are avging, the closer the distribution gets to approaching popn mean (centre) - we toss these die/dice 10,000 times, and then plot the distirbution

- sample size here is #of dice being avged for their roll.

- 3-dice-avg aka triangular distribution 1- can see that shape is continually tightening around 3.5 (popn mean) 2- shape of distribution becoming bell-shaped & approaching shape of Normal model as sample size incr's

CENTRAL LIMIT THEOREM: THE FUNDAMENTAL THEOREM OF STATISTICS [1]

- for making sampling distributions of means, there are nearly no conditions at all - contrast: sampling distribution of prop: few conditions - what was done with dice simulation above can occur for means of repeated samples for nearly any situation [2]

Laplace's CENTRAL LIMIT THEOREM - as sample size grows (as n gets larger), sampling distribution for any mean (doesn't even have to be mean of dice rolls) becomes v.close to Normal - req for data val's >- independent >- collected w/ randomization - shape of popn distirbution does not matter

[3] UNIQUE FEATURE OF SAMPLING DISTRIBUTION OF MEANS - as sample size gets larger, means of many random samples approach the shape of Normal model - true regardless of shape of popn distribution => even if sample came from skewed or bimodal popn, CLT says that means of many random samples put together tend to adhere to Normal model, as sample size gets larger - but you can get a normal model with a smaller sample size if the popn distribution is already normal, or nearly normal

p485

popn distribution is already normal?

- if popn distribution was normal to begin with, then the data val's themselves are Normal - if we take sample size 1, then "means" are really just the observations - if you think about it in that context - b/c normal distribution for means, but sample size 1 data already resembles the distribution for means

popn distribution is signif. skewed, or bimodal? - CLT will work, but it will take much more larger sample size for Normal model to work

[2]

CENTRAL LIMIT THEOREM = Mean of random sample is a random var. whose sampling distribution can be approximated by Normal model if sample size is large enough. >- as the sample size becomes larger, the approx. becomes better

ASSUMPTIONS AND CONDITIONS ASSUMPTION -> INDEPENDENCE

ASSUMPTION

- sampled val's must be independent

random selections from popn of interest

associated condition: ->

RANDOMIZATION CONDITION

val's, o/w concept of sampling distribution is senseless

ASSUMPTION

- must randomly sample data

-> SAMPLE SIZE ASSUMPTION - sample size must be large enough associated conditions: ->

10% CONDITION - if sample size is drawn w/out replacement,

then sample size (n) mustn't be more than 10% of the popn - typical is the case that we have sampling w/out replacement

-> LARGE ENOUGH SAMPLE CONDITION

- CLT does not

tell us what sample size we should use, but rather it depends on the popn - if popn depicts unimodal and symmetric distribution, then fairly small sample (5 or 10) is OK - if popn depicts strongly skewed distribution, then quite large sample (50, 100 or larger) needed - goal is to think about sample size in context of what you know about popn - this will help decide if this condition is met.

p486 BUT WHICH NORMAL? [1] - CLT states that samplign distribution of any mean/prop is approx. normal - any normal model is specified by mean & SD

WHAT IS THE NORMAL MODEL FOR SAMPLE MEANS?

parameter: μ - is the popn mean - what normal model of sampling distribution of means is centred at

[2]

parameter: σ - recall: means vary less than individual data val's => distribution of means have smaller SD's than individual data val's

(ex) - one (one individual data val.) person over 2m tall out of all the 100 students vs. - mean of 100 students taking the course are over 2m tall?

=> the first case may occur, but the second case will NEVER happen b/c means do not show as much variation as individual data val's

[3] - SD of y-bar decr's as sample grows - but it only goes down by sqrt of sample size

^- standard deviation of the normal model for the sampling distribution of mean (ex) to decr. SD by 2x, must incr. sample size by 4x. (to halve the SD, have to quadruple sample size)

-

sigma is standard dev. of popn

- SD parameter for model itself is denoted as:

----SAMPLING DISTRIBUTION MODEL FOR A MEAN - when SRS is taken from any big popn w/ mean μ and SD σ, its sample mean (y-bar) has sampling distribution w/ same mean μ but SD is given by:

- regardless of what popn the random sample orginates from, shape of sampling distribution approx. Normal given that sample size is sufficently lage - larger the sample size used, the more closely Normal model approx. sampling distribution for mean ----------When to use which sampling distribution model? SAMPLING DISTRIBUTION FOR PROPORTION

SAMPLING DISTRIBUTION FOR MEAN

- use for categorical data

- use for quantitative data

- calc. sample prop. (p-hat)

- calc. sample mean (y-hat)

- sampling distribution of prop (random var. in this context) is Normal model w/ mean at true prop. (p) & SD (SD(p-hat))

- sampling distribution of mean (random var. in this context) is Normal model w/ mean at true mean (mu) & (SD(y-hat))

(this model is used in C19,C20,C21,C22)

(this model is used in C23,C24,C25)

COMMONALITY - both have \sqrt{n} term in denominator for SD - larger the sample, the lesser that statistic (whether prop. or mean) will vary

DIFFERENCE - numerator

Note - less concern about shape of distribution of statistic when sample size is signif. large.

(View the "FirstClass" black workbook for the SBS, and FOR EXAMPLE section; p487-489)

ABOUT VARIATION - means show less variation than individual data val's

(ex) - if one test is given to many lecture sections of same course and we get class avg ~80%, then some students may have scored 95 or beyond, or may have failed, b/c individual scores vary considerably

- avg's (mean in this context) are more consistent for larger sample sizes

=> (ex) if we had sample size "n" = 4, then 1/\sqrt{4} = 1/2 => SD of mean of random sample has 1/2 the SD of an individual data val. - then, to make SD be halved again, need sample of 16, and to make it halved again, need sample of 64

p490 (con.) [2] BIGGER 'n' - larger sample size => SD of sampling distribution much more lower (=> "really" under control) so that sample mean depicts to us more about unk. popn mean

- downfall: time, and cost is higher for surveying larger samples - while this data is being gathered, popn itself may change - ex. news story may change their opinions => practical lim's exist for most sample sizes

- \sqrt{n} - lim's how much we can make sample tell about popn - ex. of LAW OF DIMINISHING RETURNS - this will elaborated upon later

BILLION DOLLAR MISUNDERSTANDING? [1] - lesson to be learned: sampling distribution of means that have smaller sample sizes vary more than those with larger

- they thought that schools that have smaller #students do better in academic performance than larger schools, not knowing that they show greater variation in their avg performance than do larger schools (which have larger sample size)

THE REAL WORLD AND THE MODEL WORLD [1]

real world

model world

- random samples of data are drawn

- describe how sample means & prop's we

- distribution of a sample >- could be displayed with: i) histogram (if quantitative data) ii) bar chart/table (if categorical data

observe in reality might behave if were able to see results of every possible random sample that could be drawn - sampling distribution of the statistic (ex. prop, or mean) >- could be described by modelling with Normal model, based on CLT

[2] CLT - it is pertaining to sample means & sample proportions of many diff. random samples retrieved from same popn - it IS NOT pertaining to distribution of data from one sample

Implication => CLT is not saying that data are Normally distributed, given sample is sufficiently large - rather, as sample gets larger, expect data to shapen like the distribution of popn, which is not necess. normal (ex) can collect sample of CEO salaries for next several yrs, but histogram will remain skewed to right, never Normal. (focused on saying that distribution of sample means or prop's (retrieved from taking data sets from each possible random sample) is normally distributed, as n gets larger)

Requirement of CLT - CLT req. that sample is sufficiently large (n is large enough) when popn shape is NOT - unimodal - symmetric

THE CLT IN NATURE [?] [1] GENERALIZATIONS OF CLT

CLT and dependence?? - if individual random var's - show some dependence (but not considerably large) OR -do not all originate from same distribution (ie. not same distribution (ex. distribution of height gain from genetic influences not same as distribution of height gain from envirnmtl influences)), given that they have small effect on the total

- then, normality will still arise if enough var's are avged (or summed up)

- (ex) could be that heights in a genetically homogenous popn ended up showing nearly Normal distribution b/c final result came from summing, or avging many small genetic & envirnmtl influences

Implication - many other var's formed by multiplying many small infleunces together, instead of adding them - can apply non-linear transformation to a non-normal model to potentially make it Normal (ex. taking the log)

SAMPLING DISTRIBUTON MODELS Summary of Sampling Distributions (Pt1) - statistic (ex. proportion) itself is a random quantity - at the "core" of hte idea of sampling distributions - cannot know what val. it takes on (true val.) b/c it originates from random sample - one statistical val. is one instance of sth that occurred for one particular random sample => diff. random sample could give a diff. val. for that statistic

- sampling dsitribution is gen'ed by this sample-to-sample variability - shows us distribution of possible val's that statistics may have had

Summary of Sampling Distributions (Pt2) - can simulate this distribution by taking lots of samples - CLT says that is is possible to model sampling distribution for mean or for prop. with a Normal model

Summary of Sampling Distributions (Pt3) -> 2 basic truths about them: 1. Sampling distributions originate b/c samples vary - each random sample contains diff. individuals => diff. val. of statistic will be retrieved

2. CLT informs us ahead of time when wokring for sampling distirbutions of means and prop's that the end result of the simulation of the sampling distribution will turn out to be a nearly Normal model, given we have an approp. sample size.

- process of forming the sampling distribution model:

p493 THE VIEW FROM HERE:

[1] - information collected from survey or results from expt is only small peak of the popn of interest - this "peak" is prone to error = variability that intrinsically comes from sample to sample (survey) and group to group (expt) - aka sampling error - unavoidable, but predictable and quantifiable - CLT quantifies sampling error

[2]

WAS THE DISCUSSION OF THIS CHAPTER "REALISTIC?" - no, purely theoretical - used Imagine - that we know popn parameters and, we used this to describe behav. of statistics observed in numerous samples

- this is not how real world works 1- we do not know popn parameters 2- have data usually only from single study or survey (ex) we can poll 1000 voters, but our main interest is knowing what ALL the voters of Ontario will choose

- so now, we are putting together a) ability to design studies b) sumarize data of studies w/ what CLT says about sampling variability

.. in oreder to use sample statistics we observe to make inferences about popn parameters of interest - recall: we do not know these popn parameters

- in C19-C22: inferences about prop's - in C23-C25: inferences about means

FINITE POPULATION CORRECTION

(NOT RESPONSIBLE FOR ON FINAL EXAM - FINITE POPULATION CORRECTION) [1] - when SRS is drawn from finite popn (size N) then we get actual SD of sample prop & sample mean given by these fornulae:

things to consider - its close to 1 given that N is much greater than n

- what if N = n? - ie. we sampled the ENTIRE popn - then SD =0 - b/c sample mean (or prop) equals the true mean or prop, so there is no variability at all

[2]

Independence assumption - fails when sampling without replacement is not negligable - ex. sampling five cards from deck of 52 cards - P(King of spades) = 1/52 - P(King of hearts) = 1/51 => every subseq. P is falling b/c we cannot pick out same card we just picked out before from deck, so we remove that from the total count.

- but instead, if we drew from 10 decks shuffled together, then P's are nearly' indep. (ie. do not change) from one successive card to next

[3]

10% Condition - what happens when 10% condition violated, and we fail to correct for it using finite popn correction? - conseq's: - variability of sample is overestimated - results will be more precise or accurate than they are suggested by our calculations >- these consequential errors collectively known as conservative error

WHAT CAN GO WRONG? DON'T CONFUSE THE SAMPLING DISTRIBUTION WITH THE DISTRIBUTION OF THE SAMPLE

distribution of the sample - pertains to getting data, and possibly calculating summary statistics for distribution of val's coming from ONE sample.

sampling distribution - imaginary collection of val's that statistic MIGHT have taken on for all random samples - includes nearly/all possible, retireved via simulation

- use sampling distirbution model to make statments wrt how statistic that we formed this distirbution for varies

BEWARE OF OBSERVATIONS THAT ARE NOT INDEPENDENT - CLT dep. on INDEPENDENCE ASSUMPTION - if this is violated, then statements that are being made about mean will be wrong

- independence is not sth that can be checked from data - have to instead think about how data were gathered

- what ensures independence is: - good sampling practice (if getting data from samples) or - well-designed randomized expts (if getting data from expts)

WATCH OUT FOR SMALL SAMPELS FROM SKEWED POPULATIONS - CLT assures that if n is big enough, we get sampling distribution model to be nearly Normal - if popn itself is nearly normal, then even small n can work (ex. n =10, in our elevator SBS EX.) (ex)

what if popn distribution is signif. skewed to begin with? - n must be large b4 Normal model will work approp. - there is no good rule of thumb for how large of n (when working with sampling distribution of the means; however, when wokring with proportions, we do have a rule: SUCESSS/FAILURE CONDITION) - this works for prop's b/c SD of prop. is related to its mean - advice: always ploit data to check, b/c what n will be dep. on how skewed the data distribution is

CONNECTIONS - we get distribution of statistic val's b/c diff. random samples, or diff. randomized expts gen. diff. statistic val's

- can use NORMAL model to model the sampling distirbutions of mean, prop.

- simulation (back from C11) was used to understand sampling distributions - by gen'ing numerous random samples to see how distribution will turn out to be

WHAT HAVE WE LEARNED? - recall: statistics is about variation - no sample completely describes popn

- sample prop's, means vary from one smaple to another = sampling error (aka sampling variability)

- sampling variability is predictable by mapping it out on sampling distribution for corresponding statistic

- CLT - describes behav. of sample prop's (shape, centre, spread) given that certain assumptions & conditions apply - ie. sample must be indep, random, sufficiently large s.t. we expect min. 10 successes & 10 failures

- if the above is met, then: - sampling distribution (imagined histogram from getting prop's from all possible samples) is shaped like Normal model - mean of sampling model = true prop. in popn - SD of sample propn given by:

- CLT also applies for sample means as well - req: sample must be indep, random, and n must be larger if data came from popn that is NOT roughly - unimodal - symmetric

- if the above apply, then: >- whatever the shape of original popn distirbution, shape of distribution of means of all possible samples is descibed by normal model, given the n is sufficently large

>- centre of sampling model = true mean of popn from which we got sample from

>- SD of sample means is given by:

TERMINOLOGY SAMPLING DISTRIBUTION MODEL - depicts behav. of statistic over ALL possible samples with same size n - recall: Diff. random samples give diff. val's for a given statistic

SAMPLING VARIABILITY (aka sampling error) - expected variability from one random sample to another

SAMPLING DISTRIBUTION MODEL for a proportion if - assumptions for indep & random sampling apply - expect at least 10 successes & 10 failures (dictator of whether or not n is sufficiently large; SUCCESS/FAILURE CONDITION) then - sampling dsitribution of the prop. is modelled by Normal w/ mean equal to p, SD = \sqrt{pq/n}

SAMPLING DISTRIBUTION MODEL for a mean if - assumptions for indep & random sampling apply

- n is sufficiently large then - sampling distribution of mean is modelled by Normal model w/ mean = mu (popn mean), SD equal to \sqrt{SD of popn/\sqrt{n}}

CENTRAL LIMIT THEOREM - states that sampling distribution model of sample mean (& prop) from random sample is approx. Normal for large n¸regardless of popn distribution, as long as data val's retrieved are indep.