Business Analytics Multivariate Linear Regression (Using Ms-Excel ...

Report 5 Downloads 62 Views
Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

© Pristine – www.edupristine.com

Multivariate Linear Regression- Fixing Heteroskedasticity  Univariate scenario: • Find the “Standard Deviation” of “response” variable for the different levels of “independent” variable • Divide the independent values of the response variable by the respective “standard deviation” • The scaled values become the new “response variable”

• E.g. if variable is “Fuel Type” – If Fuel Type is “D”, divide “Capped Losses” by SquareRoot(33862) = 184 – If Fuel Type is “P”, divide “Capped Losses” by SquareRoot(16400) = 128

 Multivariate scenario: • Create all possible unique combinations of independent variables • For each of the combinations, find “Standard Deviations” • Divide the independent values of the response variable by the respective “standard deviation” • Too cumbersome to do manually using MS Excel. Also the process is iterative. • More convenient to do using Statistical packages like R.

 Course approach • First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables • Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual adjustment is always questionable. • Demonstrate linear regression using R © Pristine

1

Linear Regression- Preparing MS Excel 1

2

3

4

5

© Pristine

2

Linear Regression- Using MS Excel (Demo.) 1

3

4

2

5

© Pristine

3

Multivariate Linear Regression- Variable Selection  Variable selection to be done on the basis of • Multicollinearity (correlation between independent variables) • Banding of variables e.g. whether to use “Age” or “Age Band” (also called custom bands) • Statistical significance of variables tested after performing above two steps

 List of independent variables: 1.

Age

2.

Age Band

3.

Years of Driving Experience

4.

Number of Vehicles

5.

Gender

6.

Married

7.

Vehicle Age

8.

Vehicle Age Band

9.

Fuel Type

© Pristine

4

Multivariate Linear Regression- Variable Selection (Multicollinearity)  “Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in regression Q: Which one to use and which one to reject? Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R 2 in this case). The variable producing higher R2 gets accepted.

Regression Statistics (Age) Multiple R 0.475766 R Square 0.226354 Adjusted R Square 0.226303 Standard Error 201.2306

Multiple R

0.475273

R Square

0.225885

Adjusted R Square

0.225834

Observations

Standard Error

201.2916

15290

Regression Statistics(Yrs Driving Experience)

Observations

15290

 R2 for Age > R2 for Years of Driving Experience  Reject Years of Driving Experience © Pristine

5

Multivariate Linear Regression- Custom Bands  Investigate whether to use “Age” or “Age band”  Fit regression independently using “Age” and “Age Band”  Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band. Age Band

Sum of Age

# Policies

Average Age

16-25

93,770.0

4,563.0

20.6

26-59

270,793.0

6,384.0

42.4

60+

282,636.0

4,343.0

65.1

 Regressions results using “Age” and “Average Age” Regression Statistics (Age) Multiple R 0.475766 R Square 0.226354 Adjusted R Square 0.226303 Standard Error 201.2306 Observations 15290

Regression Statistics (Average Age) Multiple R 0.509969 R Square 0.260068 Adjusted R Square 0.26002 Standard Error 196.7971 Observations 15290

 R2 for Average Age > R2 for Age  Select “Average Age” © Pristine

6

Multivariate Linear Regression- Custom Bands Investigate whether to use “Vehicle Age” or “Vehicle Age band”  Fit regression independently using “Vehicle Age” and “Vehicle Age Band”  Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band. Vehicle Age Band 0-5 6-10 11+

Sum of Vehicle Age 9,229 44,298 78,819

# Policies Average Vehicle Age 3,688 2.50 5,523 8.02 6,079 12.97

 Regressions results using “Vehicle Age” and “Average Vehicle Age” Regression Statistics (Vehicle Age) Multiple R 0.289431325 R Square 0.083770492 Adjusted R Square 0.083710561 Standard Error 218.9903277 Observations 15290

Regression Statistics (Average Vehicle Age) Multiple R 0.303099405 R Square 0.09186925 Adjusted R Square 0.091809848 Standard Error 218.0203272 Observations 15290

 R2 for Average Vehicle Age > R2 for Vehicle Age  Select “Average Vehicle Age” © Pristine

7

Multivariate Linear Regression- Variable Selection  List of shortlisted variables:



1.

Age Band in the form of “Average Age” of the band (selected out of “Age” and “Age Band”). Also got selected over “Years of Driving Experience”.

2.

Number of Vehicles

3.

Gender

4.

Married

5.

Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and “Vehicle Age Band”).

6.

Fuel Type

We will run regression in “multivariate” fashion and then select final list of variables by taking into consideration “statistical significance”.

© Pristine

8

Multivariate Linear Regression- Categorical variable conversion  Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)



1.

Gender (F = 0 and M = 1)

2.

Married (Married = 0 and Single = 1)

3.

Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

© Pristine

9

Multivariate Linear Regression- Output SUMMARY OUTPUT Regression Statistics Multiple R

0.865972274

R Square

0.749907979

Adjusted R Square

0.749809794

Standard Error

114.4310136

Observations

15290

ANOVA df

SS

Regression

6

Residual

15283

Total

15289 Coefficients

Intercept

MS

F

Significance F

600073213.5 100012202.3 7637.751088

0

200122584.4 13094.45688

800195798 Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

624.56529

5.29192

118.02233

0.00000

614.19249

634.93809

614.19249

634.93809

-5.55974

0.06546

-84.93889

0.00000

-5.68804

-5.43144

-5.68804

-5.43144

0.17875

0.97039

0.18420

0.85386

-1.72333

2.08082

-1.72333

2.08082

Gender Dummy

50.88326

1.89081

26.91084

0.00000

47.17705

54.58947

47.17705

54.58947

Married Dummy

78.39837

1.92148

40.80106

0.00000

74.63204

82.16469

74.63204

82.16469

Avg Vehicle Age

-15.14220

0.26734

-56.63987

0.00000

-15.66623

-14.61818

-15.66623

-14.61818

Fuel Type Dummy

267.93559

2.74845

97.48614

0.00000

262.54830

273.32287

262.54830

273.32287

Avg Age Number of Vehicles

© Pristine

10

Multivariate Linear Regression- Output 1

Independent Vars

Coefficients(b)

t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept

a

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

Insignificant

ANOVA

2

3

Standard Error (σ)

df

SS

MS (SS/df)

F (MSReg/MSRes)

Significance F (from F dist table)

7637.75

0

Regression {∑ (ypredictedl- ymean)2}

p

6

600073213.5

100012202.3

Residual {∑(yactual - ypredicted)2}

n-p-1

15283

200122584.4

13094.457

Total {∑(yactual - ymean)2}

n-1

15289

800195798

Regression Statistics

© Pristine

Multiple R

SquareRoot(R Square)

0.8659723

R Square

SS Regression/SS Total

0.7499080

Adjusted R Square

R2 - (1 - R2)*{p/(n-p-1)}

0.7498098

Standard Error

SquareRoot{SS Residual/(n-p-1)}

114.4310136

Observations

n

15290

11

Multivariate Linear Regression- Output (Significance Test) 1

Independent Vars

Coefficients(b)

Standard Error (σ)

t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept

a

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

 Significance test of coefficients based on Normal distribution H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 Test statistic, Z = (b-0)/σ (at 95% two tailed confidence interval, Z = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant. © Pristine

12

Multivariate Linear Regression- Output (Significance Test) 1

Independent Vars

Coefficients(b)

Standard Error (σ)

t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept

a

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

 Significance test of coefficients based on t distribution. • b/StdErr(b) ~ tn-2 H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 At 95% two tailed confidence interval and df greater that 120, t = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant. © Pristine

13

Multivariate Linear Regression- Output at 95% Confidence Interval SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

Excluding "Num Including "Num Vehicles" Vehicles" 0.865971953 0.749907424 0.749825608 114.4273971 15290

0.865972274 0.749907979 0.749809794 114.4310136 15290

Adjusted R-square improved

ANOVA df Regression Residual Total

5

SS

MS

600072769.2 120014553.8

F

Significance F

9165.874

0

15284 15289

200123028.7 800195798

13093.6292

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age

625.005 -5.560 50.883 78.402 -15.142

4.723 0.065 1.891 1.921 0.267

132.333 -84.942 26.912 40.806 -56.641

0.00 0.00 0.00 0.00 0.00

615.7474 -5.6879 47.1768 74.6356 -15.6660

634.2625 -5.4314 54.5890 82.1677 -14.6180

615.7474 -5.6879 47.1768 74.6356 -15.6660

634.2625 -5.4314 54.5890 82.1677 -14.6180

Fuel Type Dummy

267.935

2.748

97.489

0.00

262.5480

273.3223

262.5480

273.3223

© Pristine

Lower 95.0% Upper 95.0%

14

Multivariate Linear Regression- Regression Equation  Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy + 78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy  Interpretation: Coefficients Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age Fuel Type Dummy

625.005 -5.560 50.883 78.402 -15.142 267.935

Sign of Coefficient

-ve +ve +ve -ve +ve

Inference Higher is the age, lower is the loss Average Loss for Males is higher than Females Average Loss for Single is higher than Married Older is the vehicle, lower are the losses Losses are higher for Fuel type D

 Illustration of using the equation given in MS Excel

© Pristine

15

Multivariate Linear Regression- Residual Plot  Residual plot: • Residuals calculated as “Actual Capped Losses” – “Predicted Capped Losses” • Residuals should have a uniform distribution else there’s some bias in the model • Except for a few observations (circled in red), residuals are uniformly distributed

Capped Losses- Residual 1200 1000 800

600 400 200 0 0

2000

4000

6000

8000

10000

12000

14000

-200 -400

© Pristine

16

Multivariate Linear Regression- Gains Chart and Gini  Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of Gini Coefficient  Methodology illustrated using MS Excel Equal Obs Bin 0 1 2 3 4 5 6 7 8 9 10

# Policies

Predicted Loss

0 1528 1529 1529 1529 1529 1529 1529 1529 1529 1530

0 1,167,070 1,046,034 757,330 589,366 531,160 485,428 432,934 385,595 308,050 193,465

Actual Loss 0 1,230,474 991,944 746,854 552,534 553,919 477,284 385,411 423,814 310,846 223,351

Cumulative Actual Loss 0 1,230,474 2,222,418 2,969,272 3,521,806 4,075,725 4,553,009 4,938,420 5,362,234 5,673,081 5,896,432

Random 0 10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

Cumulative % Obs 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

% Cumulative Actual Loss 0 20.87% 37.69% 50.36% 59.73% 69.12% 77.22% 83.75% 90.94% 96.21% 100.00%

Actual vs Predicted Losses 3500

2500

Actual Loss

2000

600,000

1500

400,000

1000

200,000

500

-

0 0

© Pristine

2

4

6 8 Bins of Equal # Policies

10

12

# Policies

Losses

3000

Predicted Loss

800,000

0.27177

100% %Cumulative Actual Loss

# Policies

1,000,000

Gini Coeff

Gains Chart

1,400,000 1,200,000

Area Under Gains Curve 0 0.0104 0.0293 0.0440 0.0550 0.0644 0.0732 0.0805 0.0873 0.0936 0.0981

80% 60%

Cumulative % Obs 40% % Cumulative Actual Loss 20% 0% 0

2

4

6

8

10

Bins of Equal # Policies

17

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.) 1

Create unique combinations of the variables - Avg Age, Gender Dummy, Married Dummy, Avg Vehicle Age and Fuel Type Dummy

Find “Standard Deviation” of capped Losses for the segments. Detailed methodology explained in MS Excel.

2

Calculate “Standardized Capped Losses” as “Capped Losses / Segment Std Dev”. This becomes the new response variable.

3

 Manually doing this kind of exercise can be flawed as some the segments could be sparsely populated.  This demo. Is just to explain the underlying technique/methodology.  Statistical packages like SAS, R have in-built capability to take care of this. © Pristine

18

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.) SUMMARY OUTPUT Regression Statistics Multiple R 0.359167467 R Square Adjusted R Square Standard Error Observations

0.129001269 0.128716331 4.77078689 15290

Insignificant which is questionable as “D” and “P” have significantly differe mean losses

ANOVA

df Regression Residual Total

Intercept Avg Age Gender Dummy Married Dummy Avg Vehicle Age Fuel Type Dummy © Pristine

5 15284 15289

SS MS F Significance F 51522.10 10304.42 452.73 0 347870.07 22.76 399392.17

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862 -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368 -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047 -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361 19

Multivariate Linear Regression- Using R  Step1: Download and install R software from http://www.r-project.org/  Step2: Convert the data to R readable format e.g. *.csv. • D:\Linear Reg using R\Linear_Reg_Sample_Data.csv

 Writing R code for • Reading the data • Fitting the Linear Regression

> LinRegData FitLinReg

© Pristine

20

Multivariate Linear Regression- Using R  Output

© Pristine

21

Thank you!

Pristine 702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA www.edupristine.com Ph. +91 22 3215 6191

© Pristine

© Pristine – www.edupristine.com

Recommend Documents