Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)
© Pristine
© Pristine – www.edupristine.com
Multivariate Linear Regression- Fixing Heteroskedasticity Univariate scenario: • Find the “Standard Deviation” of “response” variable for the different levels of “independent” variable • Divide the independent values of the response variable by the respective “standard deviation” • The scaled values become the new “response variable”
• E.g. if variable is “Fuel Type” – If Fuel Type is “D”, divide “Capped Losses” by SquareRoot(33862) = 184 – If Fuel Type is “P”, divide “Capped Losses” by SquareRoot(16400) = 128
Multivariate scenario: • Create all possible unique combinations of independent variables • For each of the combinations, find “Standard Deviations” • Divide the independent values of the response variable by the respective “standard deviation” • Too cumbersome to do manually using MS Excel. Also the process is iterative. • More convenient to do using Statistical packages like R.
Course approach • First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables • Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual adjustment is always questionable. • Demonstrate linear regression using R © Pristine
1
Linear Regression- Preparing MS Excel 1
2
3
4
5
© Pristine
2
Linear Regression- Using MS Excel (Demo.) 1
3
4
2
5
© Pristine
3
Multivariate Linear Regression- Variable Selection Variable selection to be done on the basis of • Multicollinearity (correlation between independent variables) • Banding of variables e.g. whether to use “Age” or “Age Band” (also called custom bands) • Statistical significance of variables tested after performing above two steps
List of independent variables: 1.
Age
2.
Age Band
3.
Years of Driving Experience
4.
Number of Vehicles
5.
Gender
6.
Married
7.
Vehicle Age
8.
Vehicle Age Band
9.
Fuel Type
© Pristine
4
Multivariate Linear Regression- Variable Selection (Multicollinearity) “Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in regression Q: Which one to use and which one to reject? Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R 2 in this case). The variable producing higher R2 gets accepted.
Regression Statistics (Age) Multiple R 0.475766 R Square 0.226354 Adjusted R Square 0.226303 Standard Error 201.2306
Multiple R
0.475273
R Square
0.225885
Adjusted R Square
0.225834
Observations
Standard Error
201.2916
15290
Regression Statistics(Yrs Driving Experience)
Observations
15290
R2 for Age > R2 for Years of Driving Experience Reject Years of Driving Experience © Pristine
5
Multivariate Linear Regression- Custom Bands Investigate whether to use “Age” or “Age band” Fit regression independently using “Age” and “Age Band” Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band. Age Band
Sum of Age
# Policies
Average Age
16-25
93,770.0
4,563.0
20.6
26-59
270,793.0
6,384.0
42.4
60+
282,636.0
4,343.0
65.1
Regressions results using “Age” and “Average Age” Regression Statistics (Age) Multiple R 0.475766 R Square 0.226354 Adjusted R Square 0.226303 Standard Error 201.2306 Observations 15290
Regression Statistics (Average Age) Multiple R 0.509969 R Square 0.260068 Adjusted R Square 0.26002 Standard Error 196.7971 Observations 15290
R2 for Average Age > R2 for Age Select “Average Age” © Pristine
6
Multivariate Linear Regression- Custom Bands Investigate whether to use “Vehicle Age” or “Vehicle Age band” Fit regression independently using “Vehicle Age” and “Vehicle Age Band” Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band. Vehicle Age Band 0-5 6-10 11+
Sum of Vehicle Age 9,229 44,298 78,819
# Policies Average Vehicle Age 3,688 2.50 5,523 8.02 6,079 12.97
Regressions results using “Vehicle Age” and “Average Vehicle Age” Regression Statistics (Vehicle Age) Multiple R 0.289431325 R Square 0.083770492 Adjusted R Square 0.083710561 Standard Error 218.9903277 Observations 15290
Regression Statistics (Average Vehicle Age) Multiple R 0.303099405 R Square 0.09186925 Adjusted R Square 0.091809848 Standard Error 218.0203272 Observations 15290
R2 for Average Vehicle Age > R2 for Vehicle Age Select “Average Vehicle Age” © Pristine
7
Multivariate Linear Regression- Variable Selection List of shortlisted variables:
1.
Age Band in the form of “Average Age” of the band (selected out of “Age” and “Age Band”). Also got selected over “Years of Driving Experience”.
2.
Number of Vehicles
3.
Gender
4.
Married
5.
Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and “Vehicle Age Band”).
6.
Fuel Type
We will run regression in “multivariate” fashion and then select final list of variables by taking into consideration “statistical significance”.
© Pristine
8
Multivariate Linear Regression- Categorical variable conversion Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)
1.
Gender (F = 0 and M = 1)
2.
Married (Married = 0 and Single = 1)
3.
Fuel Type (P = 0, D = 1)
Snapshot of the final data on which we will run the multivariate regression
© Pristine
9
Multivariate Linear Regression- Output SUMMARY OUTPUT Regression Statistics Multiple R
0.865972274
R Square
0.749907979
Adjusted R Square
0.749809794
Standard Error
114.4310136
Observations
15290
ANOVA df
SS
Regression
6
Residual
15283
Total
15289 Coefficients
Intercept
MS
F
Significance F
600073213.5 100012202.3 7637.751088
0
200122584.4 13094.45688
800195798 Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
624.56529
5.29192
118.02233
0.00000
614.19249
634.93809
614.19249
634.93809
-5.55974
0.06546
-84.93889
0.00000
-5.68804
-5.43144
-5.68804
-5.43144
0.17875
0.97039
0.18420
0.85386
-1.72333
2.08082
-1.72333
2.08082
Gender Dummy
50.88326
1.89081
26.91084
0.00000
47.17705
54.58947
47.17705
54.58947
Married Dummy
78.39837
1.92148
40.80106
0.00000
74.63204
82.16469
74.63204
82.16469
Avg Vehicle Age
-15.14220
0.26734
-56.63987
0.00000
-15.66623
-14.61818
-15.66623
-14.61818
Fuel Type Dummy
267.93559
2.74845
97.48614
0.00000
262.54830
273.32287
262.54830
273.32287
Avg Age Number of Vehicles
© Pristine
10
Multivariate Linear Regression- Output 1
Independent Vars
Coefficients(b)
t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept
a
624.565
5.292
118.022
0.000
614.192
634.938
614.192
634.938
X1
Avg Age
b1
-5.560
0.065
-84.939
0.000
-5.688
-5.431
-5.688
-5.431
X2
Number of Vehicles
b2
0.179
0.970
0.184
0.854
-1.723
2.081
-1.723
2.081
X3
Gender Dummy
b3
50.883
1.891
26.911
0.000
47.177
54.589
47.177
54.589
X4
Married Dummy
b4
78.398
1.921
40.801
0.000
74.632
82.165
74.632
82.165
X5
Avg Vehicle Age
b5
-15.142
0.267
-56.640
0.000
-15.666
-14.618
-15.666
-14.618
X6
Fuel Type Dummy
b6
267.936
2.748
97.486
0.000
262.548
273.323
262.548
273.323
Insignificant
ANOVA
2
3
Standard Error (σ)
df
SS
MS (SS/df)
F (MSReg/MSRes)
Significance F (from F dist table)
7637.75
0
Regression {∑ (ypredictedl- ymean)2}
p
6
600073213.5
100012202.3
Residual {∑(yactual - ypredicted)2}
n-p-1
15283
200122584.4
13094.457
Total {∑(yactual - ymean)2}
n-1
15289
800195798
Regression Statistics
© Pristine
Multiple R
SquareRoot(R Square)
0.8659723
R Square
SS Regression/SS Total
0.7499080
Adjusted R Square
R2 - (1 - R2)*{p/(n-p-1)}
0.7498098
Standard Error
SquareRoot{SS Residual/(n-p-1)}
114.4310136
Observations
n
15290
11
Multivariate Linear Regression- Output (Significance Test) 1
Independent Vars
Coefficients(b)
Standard Error (σ)
t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept
a
624.565
5.292
118.022
0.000
614.192
634.938
614.192
634.938
X1
Avg Age
b1
-5.560
0.065
-84.939
0.000
-5.688
-5.431
-5.688
-5.431
X2
Number of Vehicles
b2
0.179
0.970
0.184
0.854
-1.723
2.081
-1.723
2.081
X3
Gender Dummy
b3
50.883
1.891
26.911
0.000
47.177
54.589
47.177
54.589
X4
Married Dummy
b4
78.398
1.921
40.801
0.000
74.632
82.165
74.632
82.165
X5
Avg Vehicle Age
b5
-15.142
0.267
-56.640
0.000
-15.666
-14.618
-15.666
-14.618
X6
Fuel Type Dummy
b6
267.936
2.748
97.486
0.000
262.548
273.323
262.548
273.323
Significance test of coefficients based on Normal distribution H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 Test statistic, Z = (b-0)/σ (at 95% two tailed confidence interval, Z = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant. © Pristine
12
Multivariate Linear Regression- Output (Significance Test) 1
Independent Vars
Coefficients(b)
Standard Error (σ)
t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept
a
624.565
5.292
118.022
0.000
614.192
634.938
614.192
634.938
X1
Avg Age
b1
-5.560
0.065
-84.939
0.000
-5.688
-5.431
-5.688
-5.431
X2
Number of Vehicles
b2
0.179
0.970
0.184
0.854
-1.723
2.081
-1.723
2.081
X3
Gender Dummy
b3
50.883
1.891
26.911
0.000
47.177
54.589
47.177
54.589
X4
Married Dummy
b4
78.398
1.921
40.801
0.000
74.632
82.165
74.632
82.165
X5
Avg Vehicle Age
b5
-15.142
0.267
-56.640
0.000
-15.666
-14.618
-15.666
-14.618
X6
Fuel Type Dummy
b6
267.936
2.748
97.486
0.000
262.548
273.323
262.548
273.323
Significance test of coefficients based on t distribution. • b/StdErr(b) ~ tn-2 H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 At 95% two tailed confidence interval and df greater that 120, t = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant. © Pristine
13
Multivariate Linear Regression- Output at 95% Confidence Interval SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
Excluding "Num Including "Num Vehicles" Vehicles" 0.865971953 0.749907424 0.749825608 114.4273971 15290
0.865972274 0.749907979 0.749809794 114.4310136 15290
Adjusted R-square improved
ANOVA df Regression Residual Total
5
SS
MS
600072769.2 120014553.8
F
Significance F
9165.874
0
15284 15289
200123028.7 800195798
13093.6292
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age
625.005 -5.560 50.883 78.402 -15.142
4.723 0.065 1.891 1.921 0.267
132.333 -84.942 26.912 40.806 -56.641
0.00 0.00 0.00 0.00 0.00
615.7474 -5.6879 47.1768 74.6356 -15.6660
634.2625 -5.4314 54.5890 82.1677 -14.6180
615.7474 -5.6879 47.1768 74.6356 -15.6660
634.2625 -5.4314 54.5890 82.1677 -14.6180
Fuel Type Dummy
267.935
2.748
97.489
0.00
262.5480
273.3223
262.5480
273.3223
© Pristine
Lower 95.0% Upper 95.0%
14
Multivariate Linear Regression- Regression Equation Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy + 78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy Interpretation: Coefficients Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age Fuel Type Dummy
625.005 -5.560 50.883 78.402 -15.142 267.935
Sign of Coefficient
-ve +ve +ve -ve +ve
Inference Higher is the age, lower is the loss Average Loss for Males is higher than Females Average Loss for Single is higher than Married Older is the vehicle, lower are the losses Losses are higher for Fuel type D
Illustration of using the equation given in MS Excel
© Pristine
15
Multivariate Linear Regression- Residual Plot Residual plot: • Residuals calculated as “Actual Capped Losses” – “Predicted Capped Losses” • Residuals should have a uniform distribution else there’s some bias in the model • Except for a few observations (circled in red), residuals are uniformly distributed
Capped Losses- Residual 1200 1000 800
600 400 200 0 0
2000
4000
6000
8000
10000
12000
14000
-200 -400
© Pristine
16
Multivariate Linear Regression- Gains Chart and Gini Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of Gini Coefficient Methodology illustrated using MS Excel Equal Obs Bin 0 1 2 3 4 5 6 7 8 9 10
# Policies
Predicted Loss
0 1528 1529 1529 1529 1529 1529 1529 1529 1529 1530
0 1,167,070 1,046,034 757,330 589,366 531,160 485,428 432,934 385,595 308,050 193,465
Actual Loss 0 1,230,474 991,944 746,854 552,534 553,919 477,284 385,411 423,814 310,846 223,351
Cumulative Actual Loss 0 1,230,474 2,222,418 2,969,272 3,521,806 4,075,725 4,553,009 4,938,420 5,362,234 5,673,081 5,896,432
Random 0 10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
Cumulative % Obs 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Cumulative Actual Loss 0 20.87% 37.69% 50.36% 59.73% 69.12% 77.22% 83.75% 90.94% 96.21% 100.00%
Actual vs Predicted Losses 3500
2500
Actual Loss
2000
600,000
1500
400,000
1000
200,000
500
-
0 0
© Pristine
2
4
6 8 Bins of Equal # Policies
10
12
# Policies
Losses
3000
Predicted Loss
800,000
0.27177
100% %Cumulative Actual Loss
# Policies
1,000,000
Gini Coeff
Gains Chart
1,400,000 1,200,000
Area Under Gains Curve 0 0.0104 0.0293 0.0440 0.0550 0.0644 0.0732 0.0805 0.0873 0.0936 0.0981
80% 60%
Cumulative % Obs 40% % Cumulative Actual Loss 20% 0% 0
2
4
6
8
10
Bins of Equal # Policies
17
Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.) 1
Create unique combinations of the variables - Avg Age, Gender Dummy, Married Dummy, Avg Vehicle Age and Fuel Type Dummy
Find “Standard Deviation” of capped Losses for the segments. Detailed methodology explained in MS Excel.
2
Calculate “Standardized Capped Losses” as “Capped Losses / Segment Std Dev”. This becomes the new response variable.
3
Manually doing this kind of exercise can be flawed as some the segments could be sparsely populated. This demo. Is just to explain the underlying technique/methodology. Statistical packages like SAS, R have in-built capability to take care of this. © Pristine
18
Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.) SUMMARY OUTPUT Regression Statistics Multiple R 0.359167467 R Square Adjusted R Square Standard Error Observations
0.129001269 0.128716331 4.77078689 15290
Insignificant which is questionable as “D” and “P” have significantly differe mean losses
ANOVA
df Regression Residual Total
Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age Fuel Type Dummy © Pristine
5 15284 15289
SS MS F Significance F 51522.10 10304.42 452.73 0 347870.07 22.76 399392.17
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862 -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368 -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047 -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361 19
Multivariate Linear Regression- Using R Step1: Download and install R software from http://www.r-project.org/ Step2: Convert the data to R readable format e.g. *.csv. • D:\Linear Reg using R\Linear_Reg_Sample_Data.csv
Writing R code for • Reading the data • Fitting the Linear Regression
> LinRegData FitLinReg
© Pristine
20
Multivariate Linear Regression- Using R Output
© Pristine
21
Thank you!
Pristine 702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA www.edupristine.com Ph. +91 22 3215 6191
© Pristine
© Pristine – www.edupristine.com