Regression Models Course Project George Lwevoola July 21, 2016 Executive Summary This Project assignment examines the use of exploratory tools in kick-starting the process of identifying relevant variables to include in a model given an outcome and a number of possible regressors or explanatory varaibles. The "strength" of these explanatory variables on the outcome is progressively tested until the most approriate set of variables are selected. Residual plots and diagnostics are used to establish the "goodness" of fit of the selected model. An attempt is also made to quantify the uncertainty through the use of inference tools. 1.
Introduction
We are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome) and interested in the following two questions: i. ii.
"Is an automatic or manual transmission better for MPG" "Quantify the MPG difference between automatic and manual transmissions"
This assignment seeks to answer the questions as to whether an automatic or manual transmission offers more miles per gallon MPG and then proceed to quantify the MPG difference between automatic and manual vehicles. We start by assuming that all the variables have an effect on mpg and try to determing the extent of this effect. 2.
Exploratory Data Analysis
Visualizing the data by using exploratory graphs will help us understand the data better as well as unearth patterns that may be crucial in developing the regression model. General observations can be made when mpg is plotted altermnately with the other variables. These can be seen in the appendix at the end of this report. When we examine the un-adjusted estimate of mpg as outcome regressed against transmission type, we see the results below: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15 ## factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
Note that the t-test for null hypothesis: coefficient of the transmission type variable = 0 versus the alternative hypothesis coefficient of transmission type is not equal to 0 is significant since 0.0002857 is greater than 0.05.
3.
Linear Model Selection
Using the analysis of variance anova function we try to select the best suited model as follows: ## Analysis of Variance Table ## ## Model 1: mpg ~ factor(am) - 1 ## Model 2: mpg ~ cyl + factor(am) - 1 ## Model 3: mpg ~ disp + cyl + factor(am) - 1 ## Model 4: mpg ~ hp + disp + cyl + factor(am) - 1 ## Model 5: mpg ~ drat + hp + disp + cyl + factor(am) - 1 ## Model 6: mpg ~ wt + drat + hp + disp + cyl + factor(am) - 1 ## Model 7: mpg ~ qsec + wt + drat + hp + disp + cyl + factor(am) - 1 ## Model 8: mpg ~ vs + qsec + wt + drat + hp + disp + cyl + factor(am) ## 1 ## Model 9: mpg ~ gear + vs + qsec + wt + drat + hp + disp + cyl + factor(am) ## 1 ## Model 10: mpg ~ carb + gear + vs + qsec + wt + drat + hp + disp + cyl + ## factor(am) - 1 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 30 720.90 ## 2 29 271.36 1 449.53 64.0039 8.231e-08 *** ## 3 28 252.08 1 19.28 2.7452 0.11241 ## 4 27 216.37 1 35.71 5.0849 0.03493 * ## 5 26 214.50 1 1.87 0.2663 0.61121 ## 6 25 162.43 1 52.06 7.4127 0.01275 * ## 7 24 149.09 1 13.34 1.8999 0.18260 ## 8 23 148.87 1 0.22 0.0309 0.86214 ## 9 22 147.90 1 0.97 0.1384 0.71365 ## 10 21 147.49 1 0.41 0.0579 0.81218 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 6 appears to offer the best fit for predicting mpg as it has the lowest p-value. This lower p-value indicates wehther all the included variables are necessary or not. Below we look at the model more closely and interpret the coefficients Coefficients: summary(fit6)$coef ## ## ## ## ## ## ## ##
For the first coefficient we observe that every additional increase in weight of 1000lbs leads to a decline in mpg of 3.27, holding all the other variables constant. The second cofficient indicates an increase in mpg of 0.485 for every unit increase in rear axle ratio, holding all the other variables constant. The third cofficient indicates a decrease in mpg of 0.02 for every unit increase in horsepower, holding all the other variables constant. The fourth cofficient indicates an increase in mpg of 0.01 for every cubic inch increase in displacement, holding all the other variables constant. The fifth cofficient indicates a decrease in mpg of 1.033 for every additional cylinder for a vehicle, holding all the other variables constant. The sixth and seventh cofficients indicate a decline in mpg equivalent to (37.4-36.0=1.4) as we compare an automatic and manual transmissions respectively, holding all the other variables constant. A residual plot for this model is displayed below par(mfrow = c(2, 2)) plot(fit6)
The residual plots seem to show a good fit model as indicated above. In addition the confidence intervals for our slope coefficients are give as below
As indicated from the exploratory graphs above, the selected linear models as well as residual plots and confidence intervals, we can generally deduce that manual transmissions offer better miles per gallon (mpg) compared to automatic transmissions taking into account the known variables.