STAB22 LEC05 (Covers chapter 6)
----[COURSE ANNOUNCEMENTS]- -----Note about End-of-chapter exercises - after being asked via e-mail to post suggested end-of-chapter problems, he is now doing so. Check out his homepage at: http://www.utsc.utoronto.ca/~butler/b22/ - 2 places to get answers for these prob's - back of book (Appendix A) - MyStatLab solutions manual http://media.pearsoncmg.com/intl/pec/mylab/XL/2012/deveaux_1ce/ssm/deveaux_1ce_ssm.html Mid-term - June 11 - tentative date for midterm ------------------------------------------------------------------
---------------[CHAPTER6]----------------[45] Recall: 2 exam scores - How can we fairly compare the exam scores to see which one is the better performance?
[46] What we used to fairly compare values of varying means, SD's, units etc. are z-scores:
Z-score
(ex) - x is sth that you measure (ex. score in exam) - Suppose x has: -mean = 10, - SD = 3 - Then, x-10 has -mean = 10 - 10, - but SD = 3 => if we shift the graph over by subtracting the mean, mean val. becomes 0, but the spread does not change from addition/subtraction
- So, z = (x-10)/3 will have - mean = 0/3 = 0, - SD = 3/3 = 1
- You end up getting these same values for mean, SD regardless of the data when you are getting a z-score - mean = 0 - SD = 1
(ex) Let x be some val. of interest, and let our: - mean = -5 - SD = 10
Then, x - (-5) has: - mean = -5 + 5 = 0 - SD = 10
Then, (x-(-5))/10 has - mean = 0/10 = 0 - SD = 10/10 = 1 - and this expression (ie. (x-(-5))/10 = z)
So why are we getting z-scores? - provides common basis for comparison o o o
ex. gives means of comparing exam scores by which one is better by putting them at common scale whenever you take away mean, bring them over and it becomes 0, and then divide by stdev, then they have stdev 1 => now can compare fairly b/c centre and spread are same, even tho. the prev. data had diff. means & SD's
- STANDARDIZATION - calculating a z-score [48] RETURNING TO THE EXAM SCORES: Which is better? Recall: Exam 1
- 67 - mean = 50 - SD = 10
Then,
Exam 2 - 62 - mean = 40 - SD = 12
Then,
Implication - mark of 62 is slightly better performance, relative to mean and SD - now we have a method to compare them on common grounds
[49] DENSITY CURVES & NORMAL MODEL - Need a mathematical model to describe what is occurring if we want to determine how big might a typical z-score be
(ex)
- normal model for this curve - works for data that is roughly symmetric, no outliers - is pretty close match, b/c it has to go thro. middle of top bars, but is close, not exact - red curve = example of normal distribution model - ie. a workable approximation of real data
[50] MEAN AND STANDARD DEVIATION ON NORMAL DISTRIBUTION This is an exemplary normal distribution - mean & median = 10, which is at peak - SD - 1 SD goes up to where fcn stops curving down and starts curving out - ie. an inflection point
- So, suppose the inflection points are 7, 13 - Then, SD = 13 - 10 = +3 (only need to use one inflection pt to get the SD; if we used the other, we would get same answer: - SD = 7 - 10 = -3 < - - what matters is the magnitude, NOT the sign => distance from mean to inflection point is SD of normal distribution
Characteristics of normal distribution - tall peak in middle - falls the same on both sides - no outliers - symmetric - b/c its symmetric and has no outliers, then the best measures of centre and spread to use are mean and SD, respectively - at the peak of normal distribution is mean - but mean and median are the same in this distribution - curve never reaches 0 (ie. touching the horizontal axis) - this is a unique curve, b/c it is the only curve possible that will have mean = 10, SD = 3
[51] Z-values AND TABLE Z Table Z - lists z-val digits in the columns towards the sides and above, and the cells of the table are the proportions that correspond to certain z-val's - see pg1047-1048 of TXT Procedure - get z-score - look up z-score in table, and the cell corresponding to it gives you prop. less. (meaning: it will give you the prop. covered by the z-val's lower than the z-val we are looking at) - note that they used grams for the x-axis, but shape of dsitribution exactly the same if we convert to z-scores and use those instead for x-axis.
- there is no definite formula associated with all normal distributions - instead, have to use table to get z-scores ------------
- Note - he will provide us with this table on exam, should we need it ------------
[52] ROMA TOMATOES Information - if you plot their weights, they reveal a normal distribution shape - mean = 74g, SD = 2.5g ======
Question - what prop. (proportion, or percentage) of these tomatoes will weigh less than 70g? Solution
Now, lets go look in table Z for cell that corresponds to z-value -1.60 - we get 0.0548 => to get the percentage, multiply by 100:
- therefore, less than about 5.5% of the tomatoes will weigh less than 70g. ----- Note - the prop. coming from the corresponding cell for the z-score always gives us "less" -----
[53] ROMA TOMATOES (Q2) Information - if you plot their weights, they reveal a normal distribution shape - mean = 74g, SD = 2.5g ======
Question - what prop. of these tomatoes will weigh MORE than 80g?
Solution
- the cell corresponding to this z-val says 0.9918 less => to get "more", subtract by 1: 1 - 0.9918 = 0.0082 => to get the percentage, multiply by 100:
About "more" - so if want to find out on normal distrib by how likely to be greater it will be for sth, work out zscore, look up on table for prop, and subtract answer by 1 - if draw pic, then part that we want is tiny (can deduce by picture what is what really we want)
[54] ROMA TOMATOES (Q2) Information - if you plot their weights, they reveal a normal distribution shape - mean = 74g, SD = 2.5g ======
Question - what prop. of these tomatoes will weigh b/ween 70 and 80g?
Solution Method 2 Find the z-scores for following val's: 70, 80, then get the prop. "less" that corresponds to each. a) For 70:
- corresponding prop. (look up via Table Z) = 0.0548 less
b) For 80:
- corresponding prop. (look up via Table Z) = 0.9918 less
- So, the answer is: 0.9918 - 0.0548 = 0.9370 => to get the percentage, multiply by 100:
- Therefore, the percentage of tomatoes that weigh between 70 and 80g is about 94%.
Method 1 - For 70, we get prop. = 0.0548 - For 80, we get prop. = 0.9918 less, but 0.0082 more So, what is in between is 1 - 0.0548 - 0.0082
-----Still confused? See the lecture at this time interval: 24:50 – 25:20 --------
[55] WHAT IF "z" HAS 2 DECIMAL PLACES? (ie.) what if we get z = 1.36. The main column of z-val's accounts for the first 2 digits (1.3), and the top spanning column accounts for the third digit (0.06) - using this, we find the corresponding cell
[56] GETTING VALUES FROM PROPORTIONS - Suppose we want to find the val. of interest, and we are given: -z - mean - SD
- Then, recall that the formula for finding z is:
- Rearranging this equation using allowable algebraic methods, we get: (
)
----Note - he does not go into 68-95-99.7% Rule until next lecture -----
[57] At-term newborn babies Example Information - the weights of the babies can be modelled by a normal distribution - mean = 3500g - SD = 500g Question 1: - baby is defined as being "high birth weight (hbw)" if its in top 2% of birth weights. What weight would make a baby have hbw? Solution: - 2% more = 98% less
- b/c if there is 2% more above, then there must be 98% less below - find closest z-score corresponding to 0.98 z = 2.05 - Now, solve for "val." (
) (
=>
)
(
)
Question 2: - baby is defined as being "low birth weight (lbw)" if its in bottom 0.1% of birth weights. What weight would make a baby have lbw? Solution - 0.1% less = 0.0010 less (converting to decimal format) - find closest z-score corresponding to 0.0010 - z = -3.09 - Now, solve for "val.": (
) (
=>
)
(
)
* (KNOW METHODs above FOR EXAM) Implication -
if start with % or prop and want to get val, then have to use table bkwards => can use things either way around
[58] HOW DO WE SEE WHETHER NORMAL DISTRIBUTION IS GOOD FIT TO DATA? - we assume, when we are doing our calculations with z-scores, that val's that we are working with are in a distribution that is Normal
- ie. symmetric, unimodal, curve
- but what if it ISN’T ok? (ex) potassium data - R-skewed distrbution - if apply normal distribution on this, then will not work - outliers on top on boxplot indicate that
distribution is
right-skewed
- the more specialized form of display to check if normal distribution is good fit for data is NORMAL PROBABILITY PLOT (QQ plot) - when plotting val's of data on this, can determine whether we should believe in results we receive using normal distribution calculations [59] EXAMPLES OF SCATTERPLOTS
^ THIS IS NOT "OK".
An OK normal distribution is one where: - dots roughly follow, or are on the black line
- if data exactly normal, then dots wil follow line exactly
In this example - this is not normal b/c pts are curving away from line - indication of right-skewedness - high val's are spread out - can see that many val's are low in distriubtion by the fact that there are many low val's clumped together - we are plotting normal quantities (z-val's retrieved for data, x-axis) with the corresponding data val's for potassium (y-axis), and seeing if we get roughly a straight line
[60] EXAMPLE OF SCATTERPLOTS In this example
- not 100% straight, but there aren't any obvious outliers or signs of curvature
- to apply normal distribution, and calculations for this data is OK
General Ruling - if - no curving - pts close to line - no obvious outliers - no spreading out too much - then - pretty normal [61] EXAMPLES OF SCATTERPLOT
Cereal calorie data - why are there horizontal line of dots going across? - b/c calories are only measured to nearest 10th => many calorie counts are the exact same in val., but only differ from each other by their normal quantities - outliers in this case are those pts that are too low or too high
[62] EXAMPLES OF SCATTERPLOT Cereal sugars - similar to calories, there are some horizontal lines, indicating a lot of values that are identical, but only differ by normal quantities - otherwise, v.close to line - normal distribution may work here, but not the best - not significantly distorted by too high or too low pts, b/c those outer pts are not too far away from the rest =========================