The Multivariate Regression Model

Similar documents
Notes on PS2

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics May 2011

Soci Statistics for Sociologists

* STATA.OUTPUT -- Chapter 5

Lecture 2a: Model building I

17.871: PS3 Key. Part I

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis

Week 11: Collinearity

PSC 508. Jim Battista. Dummies. Univ. at Buffalo, SUNY. Jim Battista PSC 508

Eco311, Final Exam, Fall 2017 Prof. Bill Even. Your Name (Please print) Directions. Each question is worth 4 points unless indicated otherwise.

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics May 2014

Compartmental Pharmacokinetic Analysis. Dr Julie Simpson

Midterm Exam. Friday the 29th of October, 2010

Bios 312 Midterm: Appendix of Results March 1, Race of mother: Coded as 0==black, 1==Asian, 2==White. . table race white

Applied Econometrics

You can find the consultant s raw data here:

CHECKING INFLUENCE DIAGNOSTICS IN THE OCCUPATIONAL PRESTIGE DATA

Example Analysis with STATA

Example Analysis with STATA

Week 10: Heteroskedasticity

Florida. Difference-in-Difference Models 8/23/2016

Application: Effects of Job Training Program (Data are the Dehejia and Wahba (1999) version of Lalonde (1986).)

Interpreting and Visualizing Regression models with Stata Margins and Marginsplot. Boriana Pratt May 2017

ROBUST ESTIMATION OF STANDARD ERRORS

The Effect of Occupational Danger on Individuals Wage Rates. A fundamental problem confronting the implementation of many healthcare

PREDICTIVE MODEL OF TOTAL INCOME FROM SALARIES/WAGES IN THE CONTEXT OF PASAY CITY

İnsan Tunalı November 29, 2018 Econ 511: Econometrics I. ANSWERS TO ASSIGNMENT 10: Part II STATA Supplement

The study obtains the following results: Homework #2 Basics of Logistic Regression Page 1. . version 13.1

Group Comparisons: Using What If Scenarios to Decompose Differences Across Groups

Trunkierte Regression: simulierte Daten

log: F:\stata_parthenope_01.smcl opened on: 17 Mar 2012, 18:21:56

. *increase the memory or there will problems. set memory 40m (40960k)

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

for var trstprl trstlgl trstplc trstplt trstep: reg X trust10 stfeco yrbrn hinctnt edulvl pltcare polint wrkprty

(LDA lecture 4/15/08: Transition model for binary data. -- TL)

Survey commands in STATA

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

A.O. Baranov, V.N. Pavlov, Yu.M. Slepenkova

ECON Introductory Econometrics Seminar 9

Foley Retreat Research Methods Workshop: Introduction to Hierarchical Modeling

Milk Data Analysis. 1. Objective: analyzing protein milk data using STATA.

Number of obs = R-squared = Root MSE = Adj R-squared =

This example demonstrates the use of the Stata 11.1 sgmediation command with survey correction and a subpopulation indicator.

SUGGESTED SOLUTIONS Winter Problem Set #1: The results are attached below.

Interactions made easy

Chapter 5 Regression

Biostatistics 208. Lecture 1: Overview & Linear Regression Intro.

Checking the model. Linearity. Normality. Constant variance. Influential points. Covariate overlap

(February draft)

Guideline on evaluating the impact of policies -Quantitative approach-

Why Are Electricity Prices in RTOs Increasingly Expensive?

Exploring Functional Forms: NBA Shots. NBA Shots 2011: Success v. Distance. . bcuse nbashots11

Categorical Data Analysis

Computer Handout Two

The Multivariate Dustbin

Unit 2 Regression and Correlation 2 of 2 - Practice Problems SOLUTIONS Stata Users

COMPARING MODEL ESTIMATES: THE LINEAR PROBABILITY MODEL AND LOGISTIC REGRESSION

ECON Introductory Econometrics Seminar 6

The Servitization of Manufacturing: A Longitudinal Study of Global Trends. Professor Andy Neely Director, Cambridge Service Alliance

Unit 5 Logistic Regression Homework #7 Practice Problems. SOLUTIONS Stata version

Introduction of STATA

Chapter 2. Linear model. Put some concreteness on problem. The Bivariate Regression Model. Sample of n observations, labeled as i=1,2,..

Examples of Using Stata v11.0 with JRR replicate weights Provided in the NHANES data set

Problem Points Score USE YOUR TIME WISELY SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

Multilevel/ Mixed Effects Models: A Brief Overview

Stata Program Notes Biostatistics: A Guide to Design, Analysis, and Discovery Second Edition Chapter 12: Analysis of Variance

Stata v 12 Illustration. One Way Analysis of Variance

Final Exam Spring Bread-and-Butter Edition

Table. XTMIXED Procedure in STATA with Output Systolic Blood Pressure, use "k:mydirectory,

SHIPPING COST, SPOT RATE AND EBAY AUCTIONS: A STRUCTURED EQUATION MODELING APPROACH

Analyzing CHIS Data Using Stata

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15%

Tabulate and plot measures of association after restricted cubic spline models

Do not turn over until you are told to do so by the Invigilator.

Elementary tests. proc ttest; title3 'Two-sample t-test: Does consumption depend on Damper Type?'; class damper; var dampin dampout diff ;

BIOSTATS 640 Spring 2017 Stata v14 Unit 2: Regression & Correlation. Stata version 14

Regression Analysis I & II

(R) / / / / / / / / / / / / Statistics/Data Analysis

Timing Production Runs

Topics in Biostatistics Categorical Data Analysis and Logistic Regression, part 2. B. Rosner, 5/09/17

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

Non-linear Pricing of Paid Content Products

Experiment Outcome &Literature Review. Presented by Fang Liyu

This is a quick-and-dirty example for some syntax and output from pscore and psmatch2.

STAT 350 (Spring 2016) Homework 12 Online 1

Outliers Richard Williams, University of Notre Dame, Last revised April 7, 2016

Probability Of Booking

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd

The SAS System 1. RM-ANOVA analysis of sheep data assuming circularity 2

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Longitudinal Data Analysis, p.12

Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS

ADVANCED ECONOMETRICS I

The Servitization of Manufacturing: A Longitudinal Study of Global Trends. Professor Andy Neely Director, Cambridge Service Alliance

EXPERIMENTAL INVESTIGATIONS ON FRICTION WELDING PROCESS FOR DISSIMILAR MATERIALS USING DESIGN OF EXPERIMENTS

PubHlth 640 Intermediate Biostatistics Unit 2 Regression and Correlation

All analysis examples presented can be done in Stata 10.1 and are included in this chapter s output.

Homework 1: Who s Your Daddy? Is He Rich Like Me?

What Factors Influence Seat Belt Usage Rates in the United States?: A Meta-analysis

EFA in a CFA Framework

Transcription:

The Multivariate Regression Model Example Determinants of College GPA Sample of 4 Freshman Collect data on College GPA (4.0 scale) Look at importance of ACT Consider the following model CGPA ACT i 0 i i ACT 4 tests English/math/reading/science reasoning Composite scores from -36 Average score in 000 was Movement from to represents 7 percentage points in the distribution (56 th to 63th percentile) 3 College GPA Scatter Plot: ACT Score and College GPA 4.0 3.5 3.0.5.0 5 7 9 3 5 7 9 3 33 35 ACT 4

College GPA 4.0 3.5 3.0.5 Scatter Plot: ACT Score and College GPA. *run regression with one variable. reg college_gpa act Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 39) = 6. Model.895588.895588 Prob > F = 0.039 Residual 8.5765406 39.3364477 R-squared = 0.047 -------------+------------------------------ Adj R-squared = 0.0359 Total 9.4060994 40.3864996 Root MSE =.36557 college_gpa Coef. Std. Err. t P> t [95% Conf. Interval] act.07064.00868.49 0.04.005586.048547 _cons.40979.6407 9.0 0.000.880604.95355 Interpret the result:.0 5 7 9 3 5 7 9 3 33 35 ACT 5 6 Is this an accurate estimate of (CGPA)/ (ACT)? ACT is but one measure of ability Noisy measure at best Are there other measures available? Consider another model (Think of this as the true model) CGPA ACT HSGPA i 0 i i i College GPA 4.0 3.5 3.0.5 Scatter Plot: HS GPA and College GPA 7.0.5 3 3.5 4 HS GPA 8

Scatter Plot: HS GPA and College GPA Scatter Plot: HS GPA and College GPA 4.0 35.0 3.5 30.0 College GPA 3.0 College GPA 5.0.5 0.0.0.5 3 3.5 4 HS GPA 5.0.5 3 3.5 4 HS GPA 9 0 *run synthetic regression of hs_gpa on act reg hs_gpa act. * get correlations between key variables. corr college_gpa act hs_gpa (obs=4) colleg~a act hs_gpa -------------+--------------------------- college_gpa.0000 act 0.068.0000 hs_gpa 0.446 0.3458.0000 Source SS df MS Number of obs = 4 ------------+------------------------------ F(, 39) = 8.88 Model.7356.7356 Prob > F = 0.0000 Residual.65835 39.09076403 R-squared = 0.96 ------------+------------------------------ Adj R-squared = 0.3 Total 4.3936 40.03558 Root MSE =.307 ----------------------------------------------------------------------------- hs_gpa Coef. Std. Err. t P> t [95% Conf. Interval] ------------+---------------------------------------------------------------- act.0388968.00895 4.35 0.000.097.0565964 _cons.46537.7773.3 0.000.0305.8930 ----------------------------------------------------------------------------- 3

, x (7) E[ ] ˆ x i 0 i i ˆ n i ( x x )( x x ) i i n i ( x x ) i we anticipate that 0 and we have shown that ˆ 0 E[ ˆ ] then E[ ] On average, the value we estimated in the False model will be greater than the one in the true model 3 4. * run multivariate regression. reg college_gpa act hs_gpa Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 38) = 4.78 Model 3.4365506.78753 Prob > F = 0.0000 Residual 5.984444 38.58484 R-squared = 0.764 -------------+------------------------------ Adj R-squared = 0.645 Total 9.4060994 40.3864996 Root MSE =.3403 college_gpa Coef. Std. Err. t P> t [95% Conf. Interval] act.00946.00777 0.87 0.383 -.08838.0307358 hs_gpa.4534559.09589 4.73 0.000.640047.64907 _cons.8638.3408 3.77 0.000.649.96037 The coefficient on ACT in the false model was 0.039 The coefficient in the True model is 0.009 the coefficient falls by 77% Example : Class Size and Performance Data from 40 schools in CA Outcome is average on state test for reading and math in 6 th grade Average scores around 650 for state Key covariate: student/teacher ratio SCORE STR i 0 i i 5 6 4

Scatter Plot: Student Teacher Ratio vs. Average Test Scores Scatter Plot: Student Teacher Ratio vs. Average Test Scores 70 70 700 700 Average Test Score 680 660 640 Average Test Score 680 660 640 60 60 600 0 4 6 8 0 4 6 8 600 0 4 6 8 0 4 6 8 Student-Teacher ratio Student-Teacher ratio 7 8. * run regression with one variable. reg average_score student_teacher Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 48) =.58 Model 7794.004 7794.004 Prob > F = 0.0000 Residual 4435.484 48 345.5353 R-squared = 0.05 -------------+------------------------------ Adj R-squared = 0.0490 Total 509.594 49 363.030056 Root MSE = 8.58 average_sc~e Coef. Std. Err. t P> t [95% Conf. Interval] student_te~r -.79808.479856-4.75 0.000-3.98 -.336637 _cons 698.933 9.46749 73.8 0.000 680.33 77.548 A one student increase in class size will reduce average Scores by.8 points Omitted variables Class size is but one covariate we could add Consider others that might be correlated with X that are omitted from model Example: % ESL These students tend to score lower on tests If they are also more or less likely to be in more crowded schools, then results could be biased Increasing a class size by 5 will reduce average scores By 5(.8)=.4 which is.4/654=.07 or by.7% 9 0 5

SCORE STR ESL i 0 i i i Think of this as the true model, E[ ] ˆ x x i 0 i i (7) ˆ n i ( x x )( x x ) i i n i ( x x ) i Scatter Plot: % ESL vs. Average Test Scores Scatter Plot: % ESL vs. Average Test Scores 70 70 700 700 Average Test Score 680 660 640 Average Test Score 680 660 640 60 60 600 0 0 0 30 40 50 60 70 80 90 600 0 0 0 30 40 50 60 70 80 90 % ESL % ESL 3 4 6

Scatter Plot: Student Teacher Ratio vs. % ESL Scatter Plot: Student Teacher Ratio vs. % ESL % ESL 90 80 70 60 50 40 30 0 0 0 0 4 6 8 0 4 6 8 Student-Teacher ratio 5 % ESL 90 80 70 60 50 40 30 0 0 0 0 4 6 8 0 4 6 8 Student-Teacher ratio 6 averag~e studen~r esl_pct -------------+--------------------------- average_sc~e.0000 student_te~r -0.64.0000 esl_pct -0.644 0.876.0000 and we have shown that ˆ 0 E[ ] ˆ then E[ ] we anticipate that 0 On average, the value we estimated in the False model will be smaller than the one in the true model 7 8 7

Think of the prediction this way ˆ E[ ] ˆ β 0 ( ) ( )( ) 9 In the single variable model, the Student/teacher ratio is picking up two effects Larger class sizes reduce performance ESL students are more likely to be in more crowded schools, and they that tend to have lower scores Therefore, the model without ESL will estimate a too large of a negative number 30. * run multivariate regression. reg average_score student_teacher esl_pct Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 47) = 55.0 Model 64864.30 343.506 Prob > F = 0.0000 Residual 8745.95 47 09.35 R-squared = 0.464 -------------+------------------------------ Adj R-squared = 0.437 Total 509.594 49 363.030056 Root MSE = 4.464 average_sc~e Coef. Std. Err. t P> t [95% Conf. Interval] student_te~r -.096.380783 -.90 0.004 -.848797 -.3537945 esl_pct -.6497768.039345-6.5 0.000 -.77 -.57443 _cons 686.03 7.43 9.57 0.000 67.464 700.6004 5 student increase in class size reduces test scores by 5(.) = 5.5 which is 5.5/654= 0.008 or.8% -- half the Estimate impact as before A one percentage point increase in % ESL in school Will reduce average scores by.64 points 3. * demonstrate the partialing out. * nature of mv regressions.. * run a regression of STR on ESL. * output the residuals. reg student_teacher esl Source SS df MS Number of obs = 40 -------------+------------------------------ F(, 48) = 5.5 Model 5.79978 5.79978 Prob > F = 0.000 Residual 446.7809 48 3.469878 R-squared = 0.035 -------------+------------------------------ Adj R-squared = 0.039 Total 499.5808 49 3.5789584 Root MSE =.8604 student_te~r Coef. Std. Err. t P> t [95% Conf. Interval] esl_pct.0943.0049704 3.9 0.000.009649.0983 _cons 9.3343.99307 6. 0.000 9.09858 9.57006.. * output residuals. predict res_str, residual 3 8

. * run a regression of test scores. * on the student_teacher residuals. reg average_score res_str Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 48) = 4. Model 754.739 754.739 Prob > F = 0.0 Residual 50354.86 48 359.70065 R-squared = 0.0 -------------+------------------------------ Adj R-squared = 0.00 Total 509.594 49 363.030056 Root MSE = 8.9 ---------------------------------------------------------------------------- average_sc~e Coef. Std. Err. t P> t [95% Conf. Interva -------------+-------------------------------------------------------------- res_str -.096.498694 -. 0.08 -.084 -.8 _cons 654.565.95435 706.86 0.000 65.3375 655.97 ---------------------------------------------------------------------------- Exact same number as before 33 9