Data for this session is available in Data Regression I & II Regression Analysis I & II Quantitative Methods for Business Skander Esseghaier 1
In this session, you will learn: How to read and interpret a regression output how much does the regression explain (R-Square) is the regression significant? (F-Test) is any coefficient significant? (T-Test) How to deal with some key issues you may be confronted with when using regression models multicollinearity categorical independent variables (dummy variable regression) How to identify the most important variables in a regression analysis 2
Meddicorp Meddicorp sells medical supplies to hospitals, clinics and doctors offices Company currently markets in 3 regions of US: South, West and Midwest Meddicorp management is concerned with the effectiveness of a new bonus program for its sales force Management wants to know if there is a relationship between sales and bonuses in 1999 3
What Does the Data Says? SALESPERSON SALES $ BONUS $ 1 964 231 2 893 236 3 1057 272 4 1183 291 5 1420 282 6 1548 321 7 1580 294 8 1072 306 9 1078 238 10 1123 271 11 1305 333 12 1552 262 13 1040 236 14 1045 250 15 1102 233 16 1225 272 17 1508 267 18 1564 277 19 1635 312 20 1159 293 21 1203 268 22 1294 310 23 1468 291 24 1584 289 25 1125 273
Measures of Association Scatter plot It describes the relationship between two variables graphically Relationship between Sales Perfromance and Bonus Received 340 320 Bonus in $ 300 280 260 240 220 200 800 900 1000 1100 1200 1300 1400 1500 1600 1700 Sales in $
How Strong is the Association? Relationship between Sales Perfromance and Bonus Received B o n u s i n $ 340 320 300 280 260 240 220 200 800 900 1000 1100 1200 1300 1400 1500 1600 1700 Sales in $
Measures of Association SALESPERSON SALES $ BONUS $ 1 964 231 2 893 236 3 1057 272 4 1183 291 5 1420 282 6 1548 321 7 1580 294 8 1072 306 9 1078 238 10 1123 271 11 1305 333 12 1552 262 13 1040 236 14 1045 250 15 1102 233 16 1225 272 17 1508 267 18 1564 277 19 1635 312 20 1159 293 21 1203 268 22 1294 310 23 1468 291 24 1584 289 25 1125 273 Advertising 700 650 600 550 500 450 400 350 300 800 900 1000 1100 1200 1300 1400 1500 1600 1700 Sales in S The two variables must have the same number of observations (paired variables)
Measures of Association Covariance and Correlation Measures that summarize the strength of the linear relationship between the two variables numerically Cov(X,Y) = n i= 1 (X X)(Y Y) i n 1 i Corr(X,Y) = Cov(X,Y) ss X Y Excel: COVAR(data1,data2) Excel: CORREL(data1,data2) The two variables, say X and Y, must have the same number of observations (paired variables)
Measures of Association The advantage that the correlation has over covariance is that the correlation is always between -1 and +1. Corr(X,Y) = -1 negative linear relationship Corr(X,Y) = +1 positive linear relationship Corr(X,Y) = 0 no linear relationship All other values of correlation are judged in relation to these three values.
Measures of Association Covar(X,Y) = 11.88 Corr(X,Y) = 0.91 Covar(X,Y) = - 943.163 Corr(X,Y) = - 0.82
Measures of Association Covar(X,Y) = 85.43 Corr(X,Y) = 0.20 Covar(X,Y) = - 0.00058 Corr(X,Y) = - 0.0038
The Basic Idea Behind Regression Suppose you believe Salary is related to Education, Experience & Gender Measure of Association Corr (Salary, Education) Hypothesis testing Gender discrimination? Corr (Salary,Experience) Average Salary for Males versus Average Salary for Females 12
The Basic Idea Behind Regression If we believe Salary is related to Education, Experience and Gender, can we come up with some weighted sum of Education, Experience and Gender that will help us predict Salary with as little error as possible Salary = b0 + b1*education + b2*experience + b3*gender Goal of regression come up with the weight b0, b1, b2, b3 that will predict Salary with the least possible error using the variables Education, Experience and Gender Gender discrimination context Suppose b3 is significantly different from zero, then we can more conclusively say that there is possible gender discrimination, because we have separated out the effects of education and experience 13
Uses of Regression Describing and understanding relationships Sales = 20,000-3.2 Price + 0.1 Advertising Forecasting and predicting a new observation what will sales be if I change advertising or price? Adjusting the independent variables how much should I change advertising or price? 14
Back to Meddicorp Meddicorp management is concerned with the effectiveness of a new bonus program for its sales force There seems to be a relationship between sales and bonuses in 1999 But sales should be adjusted for other factors for example, if Meddicorp advertised heavily, then that would increase sales so sales should be corrected for advertising expenses to get the true effect of sales force performance 15
Meddicorp Regression Regression Analysis: SALES versus ADV, BONUS The regression equation is SALES = - 516 + 2.47 ADV + 1.86 BONUS T-Stats P-Value for 2-tailed test Predictor Coef SE Coef T P Constant -516.2 191.1-2.70 0.013 ADV 2.4715 0.2762 8.95 0.000 BONUS 1.8587 0.7184 2.59 0.017 Analysis of Variance R-Sq = 85.3% R-Sq(adj) = 84.0% R-Square Source DF SS MS F P Regression 2 1066070 533035 64.01 0.000 Residual Error 22 183212 8328 Total 24 1249282 F-Test 16
What is the R-Square Here? Y Y Y X X X R-Sq = 1 R-Sq < 1 R-Sq << 1 17
And Here? Y Y Y X X X R-Sq = 1 R-Sq < 1 R-Sq << 1 18
What About Here? Y Y Y X R-Sq = 1 R-Sq = 1 X R-Square is meaningless X 19
What if R-Square = 0? Does it imply that there is no relationship between the variables? 20
No Linear Relationship Y There is a relationship between the variables but not a linear one: R-Sq = 0 X 21
Types of Non-Linear Relationships Y Y X X 22
No Relationship Between the Variables Y No Relationship R-Sq = 0 X (but R-Sq = 0 No Linear Relationship only!!) 23
Meddicorp More Variables Suppose Meddicorp believes that sales of a representative could be a function of: its market share; and the sales of the largest competitor in its territory What would be your research hypothesis for these variables? 24
Meddicorp Interpret Output Regression Analysis: SALES versus ADV, BONUS, MKTSHR, COMPET The regression equation is SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.68 MKTSHR - 0.119 COMPET Predictor Coef SE Coef T P Constant -594.3 260.5-2.28 0.034 ADV 2.5113 0.3153 7.96 0.000 BONUS 1.9074 0.7450 2.56 0.019 MKTSHR 2.678 4.661 0.57 0.572 COMPET -0.1192 0.3743-0.32 0.753 Not significant R-Sq = 85.8% R-Sq(adj) = 82.9% Analysis of Variance Source DF SS MS F P Regression 4 1071426 267857 30.12 0.000 Residual Error 20 177856 8893 Total 24 1249282 25
Meddicorp In Search for More Predictors Meddicorp wonders if sales could be better explained if they not only included current bonus, but also last year s bonus of the sales person 26
Including Last Year Bonus to Explain Sales Regression Analysis: SALES versus ADV, BONUS, LAST YR BONUS The regression equation is SALES = - 511 + 2.45 ADV + 1.60 BONUS + 0.28 LAST YR BONUS Predictor Coef SE Coef T P Constant -511.0 196.3-2.60 0.017 ADV 2.4528 0.2915 8.41 0.000 BONUS 1.598 1.257 1.27 0.217 LAST YR 0.279 1.089 0.26 0.801 No longer significant R-Sq = 85.4% R-Sq(adj) = 83.3% Analysis of Variance Source DF SS MS F P Regression 3 1066639 355546 40.88 0.000 Residual Error 21 182643 8697 Total 24 1249282 27
Meddicorp Regression Regression Analysis: SALES versus ADV, BONUS The regression equation is SALES = - 516 + 2.47 ADV + 1.86 BONUS T-Stats P-Value for 2-tailed test Predictor Coef SE Coef T P Constant -516.2 191.1-2.70 0.013 ADV 2.4715 0.2762 8.95 0.000 BONUS 1.8587 0.7184 2.59 0.017 Analysis of Variance R-Sq = 85.3% R-Sq(adj) = 84.0% R-Square Source DF SS MS F P Regression 2 1066070 533035 64.01 0.000 Residual Error 22 183212 8328 Total 24 1249282 F-Test 28
Multicollinearity Caused by high correlation between the RHS variables Correlations: SALES, ADV, BONUS, LAST YR BONUS SALES ADV BONUS ADV 0.899 0.000 BONUS 0.565 0.415 0.003 0.039 Too High LAST YR 0.587 0.472 0.847 0.002 0.017 0.000 Cell Contents: Pearson correlation P-Value 29
Detecting and Correcting Multicollinearity Detecting Multicollinearity Look at Correlations Large F-Stat, but poor T-Stat Correcting Multicollinearity Keep only one of the highly correlated variables 30
More on Detecting Multicollinearity Look at Variance Inflation Factors (VIF) set this Option in Minitab when estimating regression look at largest VIF VIF=1 indicates no multicollinearity VIF >5 is considered very bad 31
Meddicorp Regression Regression Analysis: SALES versus ADV, BONUS The regression equation is SALES = - 516 + 2.47 ADV + 1.86 BONUS T-Stats P-Value for 2-tailed test Predictor Coef SE Coef T P Constant -516.2 191.1-2.70 0.013 ADV 2.4715 0.2762 8.95 0.000 BONUS 1.8587 0.7184 2.59 0.017 Analysis of Variance R-Sq = 85.3% R-Sq(adj) = 84.0% R-Square Source DF SS MS F P Regression 2 1066070 533035 64.01 0.000 Residual Error 22 183212 8328 Total 24 1249282 F-Test 32
Correcting for Multicollinearity Regression Analysis: SALES versus ADV, BONUS, LAST YR BONUS The regression equation is SALES = - 511 + 2.45 ADV + 1.60 BONUS + 0.28 LAST YR BONUS Predictor Coef SE Coef T P VIF Constant -511.0 196.3-2.60 0.017 ADV 2.4528 0.2915 8.41 0.000 1.3 BONUS 1.598 1.257 1.27 0.217 3.5 LAST YR 0.279 1.089 0.26 0.801 3.8 S = 93.26 R-Sq = 85.4% R-Sq(adj) = 83.3% Analysis of Variance Source DF SS MS F P Regression 3 1066639 355546 40.88 0.000 Residual Error 21 182643 8697 Total 24 1249282 Source DF Seq SS ADV 1 1010324 BONUS 1 55746 LAST YR 1 569 33
Modeling Categorical Variables Meddicorp believes that sales are often a function of a region s attractiveness and bonus should correct for that is bonus related to sales beyond regional attractiveness? How to include regions in the regression? South, West, Midwest Create dummy variables 34
Including Dummies Regression Analysis: SALES versus ADV, BONUS, South, West, Midwest * Midwest is highly correlated with other X variables * Midwest has been removed from the equation The regression equation is SALES = 441 + 1.36 ADV + 0.970 BONUS - 259 South - 211 West Predictor Coef SE Coef T P Constant 440.9 206.7 2.13 0.045 ADV 1.3608 0.2623 5.19 0.000 BONUS 0.9703 0.4815 2.02 0.058 South -259.43 48.47-5.35 0.000 West -210.82 37.49-5.62 0.000 R-Sq = 94.7% R-Sq(adj) = 93.6% Can use only n-1 dummy variables if there are n categories; if you use all n it creates multicollinearity 35
Assessing the Importance of the Variables Standardized coefficient for Xi when regressed on Y standardized coeff = unstandardized coeff * stdv of Xi / stdv of Y standardized coeff are also called beta coefficient Compute in Excel if you like compute std deviation for regulars variables using the stdev(.) compute std deviation for dummy variables using the stdeva(.) Warning: Compute betas only for significant coefficients 36
Most Important Variables for Meddicorp Regression Analysis: SALES versus ADV, BONUS, South, West, Midwest The regression equation is SALES = 441 + 1.36 ADV + 0.970 BONUS - 259 South - 211 West Predictor Coef SE Coef T P Constant 440.9 206.7 2.13 0.045 ADV 1.3608 0.2623 5.19 0.000 BONUS 0.9703 0.4815 2.02 0.058 South -259.43 48.47-5.35 0.000 West -210.82 37.49-5.62 0.000 Beta Coef 0.442 0.121-0.521-0.439 R-Sq = 94.7% R-Sq(adj) = 93.6% 37
Takeaways T-stats > 2 and p-values < 0.05 indicate significance at 5% level of a variable in a regression implies that the variable can explain variation in the dependent variable With large number of independent variables, don t put them all at one shot in the regression if there are high correlations among some variables, pick one of these for the regression to avoid multicollinearity If there are n categories in a categorical variable, use n-1 dummy variables Compute standardized (beta) coefficients to check importance of variables 38