The Multivariate Regression Model

The Multivariate Regression Model Example Determinants of College GPA Sample of 4 Freshman Collect data on College GPA (4.0 scale) Look at importance of ACT Consider the following model CGPA ACT i 0 i i ACT 4 tests English/math/reading/science reasoning Composite scores from -36 Average score in 000 was Movement from to represents 7 percentage points in the distribution (56 th to 63th percentile) 3 College GPA Scatter Plot: ACT Score and College GPA 4.0 3.5 3.0.5.0 5 7 9 3 5 7 9 3 33 35 ACT 4

College GPA 4.0 3.5 3.0.5 Scatter Plot: ACT Score and College GPA. *run regression with one variable. reg college_gpa act Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 39) = 6. Model.895588.895588 Prob > F = 0.039 Residual 8.5765406 39.3364477 R-squared = 0.047 -------------+------------------------------ Adj R-squared = 0.0359 Total 9.4060994 40.3864996 Root MSE =.36557 college_gpa Coef. Std. Err. t P> t [95% Conf. Interval] act.07064.00868.49 0.04.005586.048547 _cons.40979.6407 9.0 0.000.880604.95355 Interpret the result:.0 5 7 9 3 5 7 9 3 33 35 ACT 5 6 Is this an accurate estimate of (CGPA)/ (ACT)? ACT is but one measure of ability Noisy measure at best Are there other measures available? Consider another model (Think of this as the true model) CGPA ACT HSGPA i 0 i i i College GPA 4.0 3.5 3.0.5 Scatter Plot: HS GPA and College GPA 7.0.5 3 3.5 4 HS GPA 8

Scatter Plot: HS GPA and College GPA Scatter Plot: HS GPA and College GPA 4.0 35.0 3.5 30.0 College GPA 3.0 College GPA 5.0.5 0.0.0.5 3 3.5 4 HS GPA 5.0.5 3 3.5 4 HS GPA 9 0 *run synthetic regression of hs_gpa on act reg hs_gpa act. * get correlations between key variables. corr college_gpa act hs_gpa (obs=4) colleg~a act hs_gpa -------------+--------------------------- college_gpa.0000 act 0.068.0000 hs_gpa 0.446 0.3458.0000 Source SS df MS Number of obs = 4 ------------+------------------------------ F(, 39) = 8.88 Model.7356.7356 Prob > F = 0.0000 Residual.65835 39.09076403 R-squared = 0.96 ------------+------------------------------ Adj R-squared = 0.3 Total 4.3936 40.03558 Root MSE =.307 ----------------------------------------------------------------------------- hs_gpa Coef. Std. Err. t P> t [95% Conf. Interval] ------------+---------------------------------------------------------------- act.0388968.00895 4.35 0.000.097.0565964 _cons.46537.7773.3 0.000.0305.8930 ----------------------------------------------------------------------------- 3

, x (7) E[ ] ˆ x i 0 i i ˆ n i ( x x )( x x ) i i n i ( x x ) i we anticipate that 0 and we have shown that ˆ 0 E[ ˆ ] then E[ ] On average, the value we estimated in the False model will be greater than the one in the true model 3 4. * run multivariate regression. reg college_gpa act hs_gpa Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 38) = 4.78 Model 3.4365506.78753 Prob > F = 0.0000 Residual 5.984444 38.58484 R-squared = 0.764 -------------+------------------------------ Adj R-squared = 0.645 Total 9.4060994 40.3864996 Root MSE =.3403 college_gpa Coef. Std. Err. t P> t [95% Conf. Interval] act.00946.00777 0.87 0.383 -.08838.0307358 hs_gpa.4534559.09589 4.73 0.000.640047.64907 _cons.8638.3408 3.77 0.000.649.96037 The coefficient on ACT in the false model was 0.039 The coefficient in the True model is 0.009 the coefficient falls by 77% Example : Class Size and Performance Data from 40 schools in CA Outcome is average on state test for reading and math in 6 th grade Average scores around 650 for state Key covariate: student/teacher ratio SCORE STR i 0 i i 5 6 4

Scatter Plot: Student Teacher Ratio vs. Average Test Scores Scatter Plot: Student Teacher Ratio vs. Average Test Scores 70 70 700 700 Average Test Score 680 660 640 Average Test Score 680 660 640 60 60 600 0 4 6 8 0 4 6 8 600 0 4 6 8 0 4 6 8 Student-Teacher ratio Student-Teacher ratio 7 8. * run regression with one variable. reg average_score student_teacher Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 48) =.58 Model 7794.004 7794.004 Prob > F = 0.0000 Residual 4435.484 48 345.5353 R-squared = 0.05 -------------+------------------------------ Adj R-squared = 0.0490 Total 509.594 49 363.030056 Root MSE = 8.58 average_sc~e Coef. Std. Err. t P> t [95% Conf. Interval] student_te~r -.79808.479856-4.75 0.000-3.98 -.336637 _cons 698.933 9.46749 73.8 0.000 680.33 77.548 A one student increase in class size will reduce average Scores by.8 points Omitted variables Class size is but one covariate we could add Consider others that might be correlated with X that are omitted from model Example: % ESL These students tend to score lower on tests If they are also more or less likely to be in more crowded schools, then results could be biased Increasing a class size by 5 will reduce average scores By 5(.8)=.4 which is.4/654=.07 or by.7% 9 0 5

SCORE STR ESL i 0 i i i Think of this as the true model, E[ ] ˆ x x i 0 i i (7) ˆ n i ( x x )( x x ) i i n i ( x x ) i Scatter Plot: % ESL vs. Average Test Scores Scatter Plot: % ESL vs. Average Test Scores 70 70 700 700 Average Test Score 680 660 640 Average Test Score 680 660 640 60 60 600 0 0 0 30 40 50 60 70 80 90 600 0 0 0 30 40 50 60 70 80 90 % ESL % ESL 3 4 6

Scatter Plot: Student Teacher Ratio vs. % ESL Scatter Plot: Student Teacher Ratio vs. % ESL % ESL 90 80 70 60 50 40 30 0 0 0 0 4 6 8 0 4 6 8 Student-Teacher ratio 5 % ESL 90 80 70 60 50 40 30 0 0 0 0 4 6 8 0 4 6 8 Student-Teacher ratio 6 averag~e studen~r esl_pct -------------+--------------------------- average_sc~e.0000 student_te~r -0.64.0000 esl_pct -0.644 0.876.0000 and we have shown that ˆ 0 E[ ] ˆ then E[ ] we anticipate that 0 On average, the value we estimated in the False model will be smaller than the one in the true model 7 8 7

Think of the prediction this way ˆ E[ ] ˆ β 0 ( ) ( )( ) 9 In the single variable model, the Student/teacher ratio is picking up two effects Larger class sizes reduce performance ESL students are more likely to be in more crowded schools, and they that tend to have lower scores Therefore, the model without ESL will estimate a too large of a negative number 30. * run multivariate regression. reg average_score student_teacher esl_pct Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 47) = 55.0 Model 64864.30 343.506 Prob > F = 0.0000 Residual 8745.95 47 09.35 R-squared = 0.464 -------------+------------------------------ Adj R-squared = 0.437 Total 509.594 49 363.030056 Root MSE = 4.464 average_sc~e Coef. Std. Err. t P> t [95% Conf. Interval] student_te~r -.096.380783 -.90 0.004 -.848797 -.3537945 esl_pct -.6497768.039345-6.5 0.000 -.77 -.57443 _cons 686.03 7.43 9.57 0.000 67.464 700.6004 5 student increase in class size reduces test scores by 5(.) = 5.5 which is 5.5/654= 0.008 or.8% -- half the Estimate impact as before A one percentage point increase in % ESL in school Will reduce average scores by.64 points 3. * demonstrate the partialing out. * nature of mv regressions.. * run a regression of STR on ESL. * output the residuals. reg student_teacher esl Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 48) = 5.5 Model 5.79978 5.79978 Prob > F = 0.000 Residual 446.7809 48 3.469878 R-squared = 0.035 -------------+------------------------------ Adj R-squared = 0.039 Total 499.5808 49 3.5789584 Root MSE =.8604 student_te~r Coef. Std. Err. t P> t [95% Conf. Interval] esl_pct.0943.0049704 3.9 0.000.009649.0983 _cons 9.3343.99307 6. 0.000 9.09858 9.57006.. * output residuals. predict res_str, residual 3 8

. * run a regression of test scores. * on the student_teacher residuals. reg average_score res_str Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 48) = 4. Model 754.739 754.739 Prob > F = 0.0 Residual 50354.86 48 359.70065 R-squared = 0.0 -------------+------------------------------ Adj R-squared = 0.00 Total 509.594 49 363.030056 Root MSE = 8.9 ---------------------------------------------------------------------------- average_sc~e Coef. Std. Err. t P> t [95% Conf. Interva -------------+-------------------------------------------------------------- res_str -.096.498694 -. 0.08 -.084 -.8 _cons 654.565.95435 706.86 0.000 65.3375 655.97 ---------------------------------------------------------------------------- Exact same number as before 33 9