Regression Models Project 1

Report 5 Downloads 139 Views
Regression Models Project 1 Hui-yu Yang July 20, 2015 Executive Summary This report provides an analysis of the effect of transmission (Automatic or manual) on miles per gallon (MPG). The data was collected from the 1974 Motor Trend US magazine, and it comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Method of analysis include exploratory analysis, unadjust and adjusted regression models, and model diagnostics. All of the exploratory plots can be found in the appendix. Results of data analyzed show that two models are highly similar, but a more parsimonious one was selected as the final model. The final model explains about 83% of the variability of MPG around its mean while adjusting for the number of covariates in the model. The report shows that the manual transmission is better for MPG, and it’s on average 2.94 miles more than that of automatic transmission, adjusting for weight and 1/4 mile time. Exploratory Data Anslysis All of the variables are treated as continuous variable in the dataset, so We should convert categorical variables cyl, vs, am, gear and carb into factors during our analysis. The histogram of the continous variables didn’t show any skewness (Figure 1). The boxplots of categorical variables exhibit some interesting trend (Figure 2). Marginally, manual transmission is associated with higher MPG. The number of cylinders exhibit the same trend between the two types of transmissions, but the variability of MGP for manual transmission is more variable. The number of forward gears can only be compared when there are 4 forward gears because only automatic cars have 3 forward gears and only the manual ones have 5. By comparing the 4 forward gears across both transmissions, manual transmission is associated with higher MPG. The number of carburetors for both transmission types follow similar trend, but MPG for manual transmission is higher for all carburetor types. MPG for V/S follow the same trend between the two transmission types. Let’s furture investigate the influence of other variables on the relationship between MPG and transmission. Based on the scatterplot, we observed a trend for number of cylinders, displacement, gross horsepower, and V/S (Figure 3). Now let’s examine the correlation between each variables to avoid multicollinearity. Based on the correlation plot, weight, displacement, and number of cylinders are highly correlated with transmission (Figigure 4). If we look further within those three variables, displacement has a high correlation with number of cylinders (0.9) and weight (0.89). For further exploration, variance inflation factor for the dataset was calculated. The automated procedure uses a stepwise procedure to exclude the highly correlated variables, and the displacement and number of cyliners were excluded. Therefore, we should keep that in mind in modeling phase. Model Selection Based on our exploratory analysis, we know there are potential confounders that we need to account for. Therefore, an unadjusted model cannot answer our question of interest. The final model was selected based on likelihood ratio test between the unadjusted model and the unadjusted model with the variable of interest added using analysis of variance (ANOVA) output in R. If the model fit is better, we then keep the new variable and add another variable on top of that. To validate the model

1

selection, backward selection was used on the full model with all of the variables. These two different methods give the same final model, which is Y=β 0 +β 1 ×Manual + β 2 ×(Number of Cylinder) + β 3 ×weight+β 4 ×(Gross horsepower). However, we should remember that displacement and number of cylinders were highly correlated from our exploratory analysis. Therefore, let’s perform the model selection without either one of them. If we do that without the displacement variable, the resulting model stays the same. If we remove the number of cylinder before model selection, the resulting model contains 1/4 mile time and weight instead. Therefore, we should compare these two resulting models. Comparing these two models, we can see that they’re not much different in terms of adjust R-square (0.8401 vs 0.8336). However, we usually want a model that is parsimonious and the variance inflation factors warned us to be careful with displacement and number of cylinders, so I will choose the second model as our final model. ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Call: lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars) Residuals: Min 1Q Median -3.4811 -1.5555 -0.7257

3Q 1.4110

Max 4.6610

Coefficients: Estimate Std. Error t value (Intercept) 9.6178 6.9596 1.382 factor(am)1 2.9358 1.4109 2.081 wt -3.9165 0.7112 -5.507 qsec 1.2259 0.2887 4.247 --Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|) 0.177915 0.046716 * 6.95e-06 *** 0.000216 *** '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336 F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11

The equation with parameter estimates is Y= 9.62 + 2.94×Manual - 3.92× weight + 1.23×(1/4 Mile Time). Model Diagnostics The final model explains about 83% of the variability of MPG around its mean while adjusting for the number of covariates in the model. Based on the Residuals vs Fitted plot, the points are randomly distributed around the horizontal line where residual is 0 (Figure 5). Therefore, we can say the assumption of homoskedasticity holds. However, the red fitted line have been pulled by three potential outliers. These potential outliers keep appearing in other diagnostic plots as well. Overall, the assumptions seem to be met but a future step to explore the effect of those potential outliers would be interesting. Results Based on the selected model, we see that MPG for manual transmission is 2.94 miles more than that for automatic transmission on average, holding weight and 1/4 mile time constant. Therefore, a manual transmission is better for MPG. 2

Appendix Fig 1. Histogram of continuous variables

200

300

400

0.0

0.4

Density

0.004

500

50

100

150

200

250

300

Displacement (cu.in.)

Gross horsepower

Histogram of Weight

Histogram of 1/4 Mile time

350

2.5

3.0

3.5

4.0

4.5

Rear axle ratio

0.15 0.00

Density

0.0 0.2 0.4

0.30

100

Density

Histogram of Rear axle ratio

0.000

Density

0.003

Histogram of Gross horsepower

0.000

Density

Histogram of Displacement

2

3

4

5

14

16

Weight (lb/1000)

18

20

22

1/4 Mile time

Fig 2. Boxplot of Categorical varaibles

35 30 25 20 15 10

Boxplot of MPG by transmission (color by cyl)

factor(am) 0

mpg

mpg

Boxplot of MPG by transmission

1 0

35 30 25 20 15 10

1

factor(cyl) 4 6 8

0

factor(am)

35 30 25 20 15 10

factor(carb) Boxplot of MPG by transmission 1 (color by carb)

factor(gear) 3

mpg

mpg

Boxplot of MPG by transmission (color by gear)

4 5 0

1

2 3 4 1

factor(am)

Boxplot of MPG by transmission (color by V/S) 35 30 25 20 15 10

factor(vs) 0 1 0

35 30 25 20 15 10

0

factor(am)

mpg

1

factor(am)

1

factor(am)

3

6 8

5.0

Fig 3. Scatterplot of MPG by transmission (colored by variables of interest) 35

30

35 disp

30

25

4

mpg

6

20

300 20

200

8 15

15

10 1

1

drat

20

3.5

15

3.0

10

35 wt

qsec 22.5

30

25 4

25

20.0

20

3

20

17.5

15

2

15

15.0

10 0

1

5

mpg

4.0

100

0

30 4.5

25

150

Transmission

35

30

200

20

Transmission

35

250

10 0

Transmission

25

15

100

10 0

mpg

300

400

25

hp

30

mpg

mpg

factor(cyl)

mpg

35

10

1

0

Transmission

1

0

Transmission

1

Transmission

35

35

35

30

30

30

0 20

1

25

3 4

20

mpg

25

1

factor(gear)

factor(vs)

mpg

mpg

factor(carb) 2

25

3 20

4

5

15

15

15

10

10

10

6 8

0

1

Transmission

0

1

0

Transmission

4

1

Transmission

Fig 4. Correlation Plot of all variables 1

gear 0.8

0.79 am 0.6

0.7 0.71 drat

0.4

0.48 0.6 0.68 mpg 0.21 0.17 0.44 0.66

0.2

vs

−0.21 −0.23 0.09 0.42 0.74 qsec

0

−0.58 −0.69 −0.71 −0.87 −0.55 −0.17 wt

−0.2

−0.56 −0.59 −0.71 −0.85 −0.71 −0.43 0.89 disp −0.49 −0.52 −0.7 −0.85 −0.81 −0.59 0.78 0.9

−0.4

cyl

−0.13 −0.24 −0.45 −0.78 −0.72 −0.71 0.66 0.79 0.83

−0.6

hp −0.8

0.27 0.06 −0.09 −0.55 −0.57 −0.66 0.43 0.39 0.53 0.75 carb −1

Fig 5. Diagnostic Plots for the final model

−4 −2 10

15

20

25

1

2

Chrysler Imperial Fiat 128 Toyota Corolla

0

0

2

4

Fiat 128 Toyota Corolla

−1

Chrysler Imperial

Normal Q−Q Standardized residuals

Residuals vs Fitted

30

−2

20

25

2 1

0.5

0

0.5

15

2

Chrysler Imperial Fiat 128

−1

Fiat 128 Toyota Corolla

0.0 10

1

Residuals vs Leverage Standardized residuals

Chrysler Imperial

0

Theoretical Quantiles

Scale−Location

1.0

1.5

Fitted values

−1

30

0.00

5

Merc 230

Cook's distance 0.05

0.10

0.15

0.20

0.25

0.30