MATH 2233 Lab 4: CORRELATION AND REGRESSION Introduction: Statistical investigations frequently concern the relationship between two variables, X and Y. For example, we might be interested in the relationship between eyesight and age in humans, between the length and weight of snakes, or between plant growth and fertilizer level. Such investigations require that, for each observation to be measured, we obtain a value for both the X and Y variables. If X and Y are the only two variables measured, the resulting data set is termed a bivariate data set. To study the relationship between X and Y, one can use a scatter diagram in which one of the variables is simply plotted against the other. If the relationship appears to be linear, we can measure the strength of this linear relationship using PEARSON’S (SAMPLE) CORRELATION COEFFICIENT, r. Assuming that a sample of size n has been taken and the X and Y variables have been measured for each observation, i.e., we have a set of observations (xi, yi), i = 1, 2, ..., n, then Pearson’s Sample Correlation Coefficient, r, is defined as:
Pearson’s (Sample) Correlation Coefficient n
r
(x i 1
n
(x i 1
i
x )( y i y ) n
i
x ) 2 ( yi y ) 2 i 1
The value of r is always between –1 and +1 and the general rule of thumb for interpreting it is:
Range of values for r -1 r -0.9 -0.9 < r -0.5 -0.5 < r < 0 r=0 0 < r < 0.5 0.5 r < 0.9 0.9 r 1
Strength and direction of linear component of the relationship between X and Y STRONG NEGATIVE MODERATE NEGATIVE WEAK NEGATIVE NONE WEAK POSITIVE MODERATE POSITIVE STRONG POSITIVE
From a scatter diagram of Y versus X, we observe that typically it is impossible to draw a straight line which will pass through all of the points. We can try, however, to draw a line which “fits” the points as closely as possible.
The “least squares” line is defined as the line such that the sum of the squared vertical distances from the points to the line is minimized.
The variable Y is known as the response variable, while the variable X is called the predictor variable. The least squares line may be used to predict a value for Y for a given value of X. This can be done as long as the given value of X is within the range of the x-values in the observed data. Outside this range the least squares line may not accurately reflect the relationship between X and Y, so that its use to predict Y is highly questionable. However, that being said, if the given value of X is not too far from this range, the prediction may still have some merit. The Coefficient of Determination, R2, measures the degree to which the least squares line fits the data in this context. It is simply the square of Pearson’s Sample Correlation Coefficient, r. R2 0.8 R2 1.0 0.25 R2 < 0.8 0.0 R2 < 0.25
Fit STRONG MODERATE WEAK
Note that R2 is always between 0 and 1. In the context of a linear relationship between two variables, an observation would be an outlier if it deviated substantially from the linear pattern defined by the other observations in the data set. Such observations have a significant effect on Pearson’s Correlation Coefficient; their presence also influences the slope and y-intercept of the least squares line.
In this lab, we will study the relationship between two variables in a particular data set. We will use the measures and techniques discussed above with the help of R. A study was conducted to investigate whether there is a relationship between brain size and body weight in mammals. Below is a data set, available on Acorn, called animals.csv, which presents the brain weight (in grams) and the body weight (in kilograms) of 21 mammals.1
1
Rousseeuw, P.J. and Leroy, A.M. (1987), Robust Regression and Outlier Detection, Wiley, New York.
Species Mountain Beaver Cow Gray Wolf Guinea Pig Donkey Horse Potar Monkey Cat Giraffe Gorilla Rhesus Monkey Kangaroo Hamster Mouse Rabbit Sheep Jaguar Chimpanzee Rat Mole Pig
Body, kg 1.35 465 36.33 1.04 187.1 521 10 3.3 529 207 6.8 35 0.12 0.023 2.5 55.5 100 52.16 0.28 0.122 192
Brain, g 8.1 423 119.5 5.5 419 655 115 25.6 680 406 179 56 1 0.4 12.1 175 157 440 1.9 3 180
Part 1: Investigating the Relationship Between Two Variables In what follows, we refer to Body Weight as X and Brain Weight as Y. 1. [ 2 marks ] Use R to calculate descriptive statistics for X and Y. 2. [ 2 marks ] Generate a scatterplot of Y versus X. In R, select Graph > Scatterplot. Choose the correct variables for X and Y and select the option Least Squares Line. Don’t copy the plot into your answers yet! Does there appear to be a linear component to the relationship between X and Y? Comment on the direction and strength of this relationship simply based on the plot. 3. [ 3 marks ] After you have created the scatterplot, add a vertical reference line at the value of ̅ and a horizontal reference line at ̅. To do this, type the following command in the script window, replacing *** in the command below with the actual numerical values of ̅ and ̅ that you calculated in question 1. The command to use is: abline(v = ***, h = *** ) Copy the graph into your answers. Notice that the point ( ̅ ̅) is on the line. Into which two quadrants (defined by the ̅ ̅ lines) do most of the points fall? In order to quantify the linear relationship between the variables, we calculate Pearson’s Correlation Coefficient using the following R Commander menu command:
Statistics > Summaries > Correlation Test Pick the two variables and select Pearson product moment. The Alternative Hypothesis can be left as Two-sided 4.
[ 5 marks ] a. [2] Calculate Pearson’s Correlation Coefficient between Body Weight and Brain Weight. b. [2] Interpret the correlation that you obtained, using the guide given earlier in this lab assignment. c. [1] Is this correlation consistent with the conclusions you drew from the scatterplot in question 2b)?
5. [ 2 marks ] Is correlation the same as causation? For example, does the correlation that you obtained above imply that an increase in body weight will result in an increase in brain weight? Explain.
Part 2: Regression – The Least Squares Line Although we plotted the least squares line in Part 1, we don’t yet have the equation of the line. In this section, we calculate the equation and R2. 6. [ 4 marks ] Treating Brain Weight as the response variable and Body Weight as the predictor variable, perform a regression of Brain Weight versus Body Weight. Use the R command Statistics > Fit Models > Linear Regression…. Select the variables in the correct order. In the output, look for the Estimate to obtain the values of the coefficients. These are the values of b0 and b1 . Record the values of b0 (the y-intercept), b1 (the slope), and the adjusted R2. 7. [ 2 marks ] State the equation of the least squares line (in terms of body weight and brain weight, not x and y). 8. [ 2 marks ] Comment of the fit of this line to the data by interpreting the coefficient of determination using the interpretation of R2 as the “proportion of the total variation explained by the regression line”. 9. [ 3 marks ] a. [1] Calculating by hand, predict what the brain weight would be when the body weight is 550kg. b. [2] Should caution be used when interpreting this prediction? Explain.
10. [ 1 mark ] Observations whose vertical distance from the line is very large may be considered to be unusual. Identify the animal whose values of Brain and Body weight represent an outlier. Give the values of Brain Weight and Body Weight for this observation.
Part 3: The Effect of Outliers on Correlation and Regression 11. [ 2 marks ] You have already identified the unusual observation in the previous question. Use the Edit Data button to delete the row containing the unusual observation. (Mac users may have to open the data set in Excel or OpenOffice and remove the row). Recreate the scatterplot (you don’t need to copy it into your answers). How do you think removing the outlier will affect the value of r? 12. [ 2 marks ] a. [1] Calculate Pearson’s Correlation Coefficient between Body Weight and Brain Weight excluding the unusual value. b. [1] How does the value obtained in a) compare with that obtained in question 4a)? 13. [ 8 marks ] Removal of unusual observations also has an effect on regression. This is not surprising since the coefficient of determination used to measure the fit of the least squares line is just the value of r squared. a. [2] Generate the fitted line plot in this case with the unusual observation removed (like you did in question 3). b. [3] Give the equation of the least squares line (in terms of body weight and brain weight; directions were outlined in question 6). c. [1] Use R2 to comment on the fit of this line to the data. d. [2] Compare the slopes of the two least squares lines you have generated. Which is steeper? Explain how the presence of this outlier affects the slope in this way.