Section 1.3: The Normal Distributions

Report 4 Downloads 277 Views
Stat M11/Econ M40 (Spring 2002) Lecture Note Introduction to Statistical Methods for Business and Economics

Instructor: Hongquan Xu

Section 1.3: The Normal Distributions This section begins with a recap of the strategy for exploring data: 1. Plot your data - make appropriate graph 2. Look at overall pattern and any deviations from that pattern 3. If appropriate, compute numerical summaries (center/spread) 4. Approximate the histogram shape with a smooth curve which will serve as a model for the distribution of the variable. We will focus on item 4 above in this last section of Chapter 1.

Idea of a Model:

Here is a histogram based on n = 30 observations. The vertical axis is count.

a. How many observations were between 3 and 5?

b. What proportion of the observations were between 3 and 5?

c. Suppose we divide the vertical count axis values by 30. Then the height of the bars would = BUT THE PICTURE IS THE SAME (just a rescaling). In fact I could divide by any value and the overall picture would be the same. 1

Note: Since each class is equal width, it is the height or rather than the area of the bars that allows us to visually compare the various classes. d. What is the area of a rectangle in general?

e. Suppose we divide the left axis by 60. Then what is the AREA of the bar for the class 3 to 5? So the AREA of a bar = and the formal term for the vertical axis label would be This is the “histogram” that we will smooth out and use the resulting curve as our model. f. What is the total area of all of the bars? We will require the curve to have a total area under it of 1 - and called a DENSITY CURVE!

Density Curve Definition: A curve (or function) is called a Density Curve f(x) if: 1. It lies on or above the horizontal axis. 2. Total area under the curve is equal to 1. KEY IDEA: AREA under a density curve over a range of values corresponds (approximately) to the PROPORTION of observations with values in that range. It is an approximation, not exactly, but a nice simpler summary of the distribution. Example: Consider the following curve: Function Form: f (x) =

1 20

for 30 ≤ x ≤ 50

Picture Form:

Q: Is this a density curve? Q: If yes, find the proportion of items with a response less than 35.

2

Measuring center and spread for density curve In Section 1.2, we learned how to measure center and spread for a data set. Here we define how to measure center and spread for density curve. Definitions • The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. • The mean of a density curve is the balance point, at which the curve would balance if made of solid material.

Median vs Mean Some graphs

(a)

(b)

Rules • The median and mean are

for a symmetric density curve.

• The mean of a skewed curve is pulled the long tail. • The mean is

from the median in the direction of than the median if the density curve is skewed to the left.

The density curve will serve as a model for the distribution of our variable - a model for the whole population (even though it may have been derived from information from a sample). As a model for the population (book says “idealized distribution”), we will use different symbols for representing the mean and standard deviation. Notations

mean standard deviation

Data (Sample) x ¯ s

Density Curve or Model (Population) µ σ

There are many density curves that can be used as models. We will see more in chapter 4, but here we focus on an important family of densities called the NORMAL DISTRIBUTIONS. 3

Normal Distributions Most important family of distributions because many variables have this shape and form approximately and many statistics that we use in our inference methods are based on the sample mean X and sums or averages generally have (approximately) a normal distribution. A Normal Curve: Symmetric, unimodal, bell-shaped, centered at the mean µ and its spread is determined by the standard deviation σ. In fact, the change-of-curvature points on each side of the mean mark the values which are one standard deviation away from the mean.

Notation: The variable X is normally distributed with mean µ and standard deviation σ is denoted by

Let’s sketch two normal curves: N (50, 10) and N (80, 7)

Note: We could add the vertical axis which would be labelled ”density” but we will not compute the height of the curve, we only need the area under the density curve and we have a table that provides many areas for us.

4

Recall: Density functions can be used to model the distribution for a quantitative variable for a population. There are many types of density functions. The main feature is that the area under a density function over a range approximates the proportion of items in the population that have values in that range. We began discussing the family of normal distributions, each indexed by the population mean µ and the population standard deviation σ, and denoted by N (µ, σ). Next we practice finding areas (and thus proportions) under the Standard Normal Distribution. The Standard Normal Distribution • is the normal distribution N (0, 1) with mean 0 and standard deviation 1. Table A inside front cover and in the back of the book gives areas under the standard normal curve to the left of the points z. Examples:

5

The variable Z is often used to denote the variable following the N (0, 1) distribution. Example 1. What proportion of Z values are within 1 standard deviation of the mean? Proportion(−1 ≤ Z ≤ 1)?

2. What proportion of Z values are within 2 standard deviations of the mean? Proportion(−2 ≤ Z ≤ 2)?

3. What proportion of Z values are within 3 standard deviations of the mean? Proportion(−3 ≤ Z ≤ 3)?

4. Proportion (−1.5 < Z < 2.8)

5. Proportion (Z > 4.8)

6. What is the 90th percentile of the N (0, 1) distribution?

6

The first three results form what is called The 68-95-99.7 Rule for any Normal Distribution • About 68% of the observations fall within 1 standard deviation of the mean • About 95% of the observations fall within 2 standard deviations of the mean • About 99.7% (nearly all) of the observations fall within 3 standard deviations of the mean

A useful result – can provide a good frame of reference without detailed calculation. Example: Suppose the average test score is 70 with a standard deviation of 5 and you scored 85. If the distribution of scores is approximately normal then how good did you do?

We need one more idea to help us be able to find areas and thus proportions for any general N (µ, σ) distribution – the idea of standardization.

If the variable X has the N (µ, σ) distribution, then the standardized variable Z=

X −µ σ

will have the N (0, 1) distribution. The variable Z is sometimes referred to as the Z-score or standard score. Let’s see how this standardization idea works through our next example. Notation We use BIG Z or X to denote a variable and little z or x to denote a number.

7

Example: Costs of Textbooks The students at a local university spend a lot of money each term on textbooks. Suppose that the amount of money spent on textbooks for a term follows a normal distribution with a mean of $160 and a standard deviation of $20. Sketch this distribution below.

a. The local bookstore will offer a t-shirt to any student who spends more than $200 on textbooks. What proportion of students are eligible for the t-shirt?

b. Justin spent $170 on textbooks. What percentile does this value of $170 correspond to?

c. Complete the sentence: (show all work) Based on the above model, 20% of the students spend less than $

8

on textbooks.

Later in this class we will see that the assumption of a normal distribution will be needed in order to perform certain inference procedures. If we do not have much data, this assumption will be important to check. So we have the following question. How to judge if data follow approximately a normal model? 1. First look at a histogram or stemplot - check for non-normal features such as gaps, outliers, and strong skewness. If roughly symmetric, unimodal, bell-shaped - then we can turn to a tool that is more sensitive for assessing normality. 2. Normal Quantile Plot (aka Q-Q Plot or Normal Probability Plot) Big Idea: Plot percentiles of a standard normal against the corresponding percentiles of the data. If the observations follow a normal distribution, the resulting plot should be Deviations from this would indicate possible departures form a normal distribution; • Outliers appears as points that are far away from the overall pattern of the plot. • In a right skewed distribution, the largest observations fall distinctly above a line drawn through the main body of points. • In a left skewed distribution, the line.

observations fall distinctly

the

NOTE: Real data almost always show some departure from the ”THEORETICAL” normal model - don’t overreact to minor wiggles in the plot. Examples

Fig. 1.30. 9

Fig. 1.32.

Fig. 1.33

10