Business Analytics Predictive Modeling using Linear Regression
© Pristine
© Pristine – www.edupristine.com
Agenda Introduction Data
Basic Statistics Predictive modeling using Linear Regression
© Pristine
1
4. Correlation and Regression I.
Covariance and Correlation coefficient
II.
Regression
© Pristine
2
4a. Correlation I.
Covariance and Correlation coefficient
i. Definition
ii. Sample and population correlation iii. Illustrative example iv. Statistical significance test for sample correlation coefficient
© Pristine
3
4a. Covariance and Correlation Coefficient Covariance is a statistical measure of the degree to which the two variables move together. The sample covariance is calculated as n
(X
i
X )(Y iY )
cov xy i 1 n 1 Correlation coefficient
• It is a measure of the strength of the linear relationship between two variables • The correlation coefficient is given by:
xy
cov xy
x isy denoted by ρ (rho) • Population correlation • Sample correlation is denoted by r. It is an estimate of ρ same way as – S2 (sample variance) is an estimate of σ2 (population variance) and – (sample mean) is an estimate of μ(population mean) • Features of ρ and r X – Unit free and ranges between -1 and 1 – The closer to -1, the stronger the negative linear relationship – The closer to 1, the stronger the positive linear relationship – The closer to 0, the weaker the linear relationship © Pristine
4
4a. Example: Covariance and Correlation of the S&P 500 and NASDAQ Returns given a sample Closing Index Value Date
S&P 500
NASDAQ
12/2/2011
1,244.28
2,626.93
12/5/2011
1,257.08
2,655.76
12/7/2011
1,261.01
2,649.21
12/8/2011
1,234.35
2,596.38
12/9/2011
1,255.19
2,646.85
12/12/2011
1,236.47
2,612.26
© Pristine
5
4a. Solution: Covariance and Correlation of the S&P 500 and NASDAQ Returns given a sample Closing Index Value Date
Returns
Deviation
S&P 500
NASDAQ
S&P 500
NASDAQ
S&P 500
12/2/2011
1,244.28
2,626.93
Xi
Yi
Xi- X
Yi- Y
(Xi-X )*(Yi- Y )
12/5/2011
1,257.08
2,655.76
1.03%
1.10%
1.14%
1.20%
0.0137%
12/7/2011
1,261.01
2,649.21
0.31%
-0.25%
0.43%
-0.15%
-0.0006%
12/8/2011
1,234.35
2,596.38
-2.11%
-1.99%
-2.00%
-1.89%
0.0378%
12/9/2011
1,255.19
2,646.85
1.69%
1.94%
1.80%
2.05%
0.0369%
12/12/2011
1,236.47
2,612.26
-1.49%
-1.31%
-1.38%
-1.21%
0.0166%
X
Y
-0.12%
-0.10%
Total
0.1044%
sx
sy
Standard Deviation
0.01630504
0.01633798
Covariance
0.000261013
Correlation
0.979811179
Mean
© Pristine
NASDAQ
6
4a. Examples of Approximate r Values
y
y
y
r = -1
x
r = -0.6
y
r=0
x
y
r = +.3 © Pristine
x
x
r = +1
x 7
4.b.Case- Multivariate Linear Regression (Revisited) Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data having "Loss" amount and policy related information and asked him to "identify" and "quantify" the factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a multivariate regression. Now suppose, he approaches you and request for your help to complete the assignment. Lets help Adam in carrying out the multivariate regression.
© Pristine
8
4a. Testing the significance of the correlation coefficient Test whether the correlation between the population of two variables is equal to zero • Null hypothesis, H0: r = 0 Assuming that the two populations are normally distributed, we can use a t-test to determine whether the null hypothesis should be rejected. The test statistic is computed using the sample correlation, r, with n – 2 degrees of freedom (df ) • t = r √(n-2) √(1- r2) Calculated test statistic is compared with the critical t-value for the appropriate degrees of freedom and level of significance Reject H0 if t > tcritical or t