Regression Models Project 1 Hui-yu Yang July 20, 2015 Executive Summary This report provides an analysis of the effect of transmission (Automatic or manual) on miles per gallon (MPG). The data was collected from the 1974 Motor Trend US magazine, and it comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Method of analysis include exploratory analysis, unadjust and adjusted regression models, and model diagnostics. All of the exploratory plots can be found in the appendix. Results of data analyzed show that two models are highly similar, but a more parsimonious one was selected as the final model. The final model explains about 83% of the variability of MPG around its mean while adjusting for the number of covariates in the model. The report shows that the manual transmission is better for MPG, and it’s on average 2.94 miles more than that of automatic transmission, adjusting for weight and 1/4 mile time. Exploratory Data Anslysis All of the variables are treated as continuous variable in the dataset, so We should convert categorical variables cyl, vs, am, gear and carb into factors during our analysis. The histogram of the continous variables didn’t show any skewness (Figure 1). The boxplots of categorical variables exhibit some interesting trend (Figure 2). Marginally, manual transmission is associated with higher MPG. The number of cylinders exhibit the same trend between the two types of transmissions, but the variability of MGP for manual transmission is more variable. The number of forward gears can only be compared when there are 4 forward gears because only automatic cars have 3 forward gears and only the manual ones have 5. By comparing the 4 forward gears across both transmissions, manual transmission is associated with higher MPG. The number of carburetors for both transmission types follow similar trend, but MPG for manual transmission is higher for all carburetor types. MPG for V/S follow the same trend between the two transmission types. Let’s furture investigate the influence of other variables on the relationship between MPG and transmission. Based on the scatterplot, we observed a trend for number of cylinders, displacement, gross horsepower, and V/S (Figure 3). Now let’s examine the correlation between each variables to avoid multicollinearity. Based on the correlation plot, weight, displacement, and number of cylinders are highly correlated with transmission (Figigure 4). If we look further within those three variables, displacement has a high correlation with number of cylinders (0.9) and weight (0.89). For further exploration, variance inflation factor for the dataset was calculated. The automated procedure uses a stepwise procedure to exclude the highly correlated variables, and the displacement and number of cyliners were excluded. Therefore, we should keep that in mind in modeling phase. Model Selection Based on our exploratory analysis, we know there are potential confounders that we need to account for. Therefore, an unadjusted model cannot answer our question of interest. The final model was selected based on likelihood ratio test between the unadjusted model and the unadjusted model with the variable of interest added using analysis of variance (ANOVA) output in R. If the model fit is better, we then keep the new variable and add another variable on top of that. To validate the model
1
selection, backward selection was used on the full model with all of the variables. These two different methods give the same final model, which is Y=β 0 +β 1 ×Manual + β 2 ×(Number of Cylinder) + β 3 ×weight+β 4 ×(Gross horsepower). However, we should remember that displacement and number of cylinders were highly correlated from our exploratory analysis. Therefore, let’s perform the model selection without either one of them. If we do that without the displacement variable, the resulting model stays the same. If we remove the number of cylinder before model selection, the resulting model contains 1/4 mile time and weight instead. Therefore, we should compare these two resulting models. Comparing these two models, we can see that they’re not much different in terms of adjust R-square (0.8401 vs 0.8336). However, we usually want a model that is parsimonious and the variance inflation factors warned us to be careful with displacement and number of cylinders, so I will choose the second model as our final model. ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
Call: lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars) Residuals: Min 1Q Median -3.4811 -1.5555 -0.7257
3Q 1.4110
Max 4.6610
Coefficients: Estimate Std. Error t value (Intercept) 9.6178 6.9596 1.382 factor(am)1 2.9358 1.4109 2.081 wt -3.9165 0.7112 -5.507 qsec 1.2259 0.2887 4.247 --Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|) 0.177915 0.046716 * 6.95e-06 *** 0.000216 *** '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.459 on 28 degrees of freedom Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336 F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The equation with parameter estimates is Y= 9.62 + 2.94×Manual - 3.92× weight + 1.23×(1/4 Mile Time). Model Diagnostics The final model explains about 83% of the variability of MPG around its mean while adjusting for the number of covariates in the model. Based on the Residuals vs Fitted plot, the points are randomly distributed around the horizontal line where residual is 0 (Figure 5). Therefore, we can say the assumption of homoskedasticity holds. However, the red fitted line have been pulled by three potential outliers. These potential outliers keep appearing in other diagnostic plots as well. Overall, the assumptions seem to be met but a future step to explore the effect of those potential outliers would be interesting. Results Based on the selected model, we see that MPG for manual transmission is 2.94 miles more than that for automatic transmission on average, holding weight and 1/4 mile time constant. Therefore, a manual transmission is better for MPG. 2
Appendix Fig 1. Histogram of continuous variables
200
300
400
0.0
0.4
Density
0.004
500
50
100
150
200
250
300
Displacement (cu.in.)
Gross horsepower
Histogram of Weight
Histogram of 1/4 Mile time
350
2.5
3.0
3.5
4.0
4.5
Rear axle ratio
0.15 0.00
Density
0.0 0.2 0.4
0.30
100
Density
Histogram of Rear axle ratio
0.000
Density
0.003
Histogram of Gross horsepower
0.000
Density
Histogram of Displacement
2
3
4
5
14
16
Weight (lb/1000)
18
20
22
1/4 Mile time
Fig 2. Boxplot of Categorical varaibles
35 30 25 20 15 10
Boxplot of MPG by transmission (color by cyl)
factor(am) 0
mpg
mpg
Boxplot of MPG by transmission
1 0
35 30 25 20 15 10
1
factor(cyl) 4 6 8
0
factor(am)
35 30 25 20 15 10
factor(carb) Boxplot of MPG by transmission 1 (color by carb)
factor(gear) 3
mpg
mpg
Boxplot of MPG by transmission (color by gear)
4 5 0
1
2 3 4 1
factor(am)
Boxplot of MPG by transmission (color by V/S) 35 30 25 20 15 10
factor(vs) 0 1 0
35 30 25 20 15 10
0
factor(am)
mpg
1
factor(am)
1
factor(am)
3
6 8
5.0
Fig 3. Scatterplot of MPG by transmission (colored by variables of interest) 35
30
35 disp
30
25
4
mpg
6
20
300 20
200
8 15
15
10 1
1
drat
20
3.5
15
3.0
10
35 wt
qsec 22.5
30
25 4
25
20.0
20
3
20
17.5
15
2
15
15.0
10 0
1
5
mpg
4.0
100
0
30 4.5
25
150
Transmission
35
30
200
20
Transmission
35
250
10 0
Transmission
25
15
100
10 0
mpg
300
400
25
hp
30
mpg
mpg
factor(cyl)
mpg
35
10
1
0
Transmission
1
0
Transmission
1
Transmission
35
35
35
30
30
30
0 20
1
25
3 4
20
mpg
25
1
factor(gear)
factor(vs)
mpg
mpg
factor(carb) 2
25
3 20
4
5
15
15
15
10
10
10
6 8
0
1
Transmission
0
1
0
Transmission
4
1
Transmission
Fig 4. Correlation Plot of all variables 1
gear 0.8
0.79 am 0.6
0.7 0.71 drat
0.4
0.48 0.6 0.68 mpg 0.21 0.17 0.44 0.66
0.2
vs
−0.21 −0.23 0.09 0.42 0.74 qsec
0
−0.58 −0.69 −0.71 −0.87 −0.55 −0.17 wt
−0.2
−0.56 −0.59 −0.71 −0.85 −0.71 −0.43 0.89 disp −0.49 −0.52 −0.7 −0.85 −0.81 −0.59 0.78 0.9
−0.4
cyl
−0.13 −0.24 −0.45 −0.78 −0.72 −0.71 0.66 0.79 0.83
−0.6
hp −0.8
0.27 0.06 −0.09 −0.55 −0.57 −0.66 0.43 0.39 0.53 0.75 carb −1
Fig 5. Diagnostic Plots for the final model
−4 −2 10
15
20
25
1
2
Chrysler Imperial Fiat 128 Toyota Corolla
0
0
2
4
Fiat 128 Toyota Corolla
−1
Chrysler Imperial
Normal Q−Q Standardized residuals
Residuals vs Fitted
30
−2
20
25
2 1
0.5
0
0.5
15
2
Chrysler Imperial Fiat 128
−1
Fiat 128 Toyota Corolla
0.0 10
1
Residuals vs Leverage Standardized residuals
Chrysler Imperial
0
Theoretical Quantiles
Scale−Location
1.0
1.5
Fitted values
−1
30
0.00
5
Merc 230
Cook's distance 0.05
0.10
0.15
0.20
0.25
0.30