Regression diagnostics

Similar documents
Problem Points Score USE YOUR TIME WISELY SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

To Hydrate or Chlorinate: A Regression Analysis of the Levels od Chlorine in the Public Water Supply

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

Read and Describe the SENIC Data

Final Exam Spring Bread-and-Butter Edition

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

Computer Handout Two

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

Chapter Six- Selecting the Best Innovation Model by Using Multiple Regression

Chapter 9. Regression Wisdom. Copyright 2010 Pearson Education, Inc.

!! NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA ! NOTE: The SAS System used:!

Two Way ANOVA. Turkheimer PSYC 771. Page 1 Two-Way ANOVA

Outliers Richard Williams, University of Notre Dame, Last revised April 7, 2016

Correlation between Carbon Steel Corrosion and Atmospheric Factors in Taiwan

Elementary tests. proc ttest; title3 'Two-sample t-test: Does consumption depend on Damper Type?'; class damper; var dampin dampout diff ;

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

This chapter will present the research result based on the analysis performed on the

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

Correlations. Regression. Page 1. Correlations SQUAREFO BEDROOMS BATHS ASKINGPR

Gasoline Consumption Analysis

STAT 350 (Spring 2016) Homework 12 Online 1

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Timing Production Runs

Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities:

Know Your Data (Chapter 2)

Unit 2 Regression and Correlation 2 of 2 - Practice Problems SOLUTIONS Stata Users

4.3 Nonparametric Tests cont...

F u = t n+1, t f = 1994, 2005

Soci Statistics for Sociologists

Winsor Approach in Regression Analysis. with Outlier

R-SQUARED RESID. MEAN SQUARE (MSE) 1.885E+07 ADJUSTED R-SQUARED STANDARD ERROR OF ESTIMATE

Week 10: Heteroskedasticity

CHAPTER 5 RESULTS AND ANALYSIS

BIO 226: Applied Longitudinal Analysis. Homework 2 Solutions Due Thursday, February 21, 2013 [100 points]

Price transmission along the food supply chain

Lab 1: A review of linear models

Model Building Process Part 2: Factor Assumptions

Week 11: Collinearity

AP Statistics Scope & Sequence

Chapter 5 Regression

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15%

QUESTION 2 What conclusion is most correct about the Experimental Design shown here with the response in the far right column?

Economic Analysis of Korea Green Building Certification System in the Capital Area Using House-Values Index

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Regression Analysis of the Levels of Chlorine in the Public Water Supply in Orange County, FL

STAT 2300: Unit 1 Learning Objectives Spring 2019

ALL POSSIBLE MODEL SELECTION IN PROC MIXED A SAS MACRO APPLICATION

BIOSTATS 640 Spring 2017 Stata v14 Unit 2: Regression & Correlation. Stata version 14

All Possible Mixed Model Selection - A User-friendly SAS Macro Application

Lab #4. 1 Summary statistics. STA 580: Biostatistics I Breheny

Lecture 2a: Model building I

Checking the model. Linearity. Normality. Constant variance. Influential points. Covariate overlap

Regression Analysis I & II

Midterm Exam. Friday the 29th of October, 2010

The Effect of Occupational Danger on Individuals Wage Rates. A fundamental problem confronting the implementation of many healthcare

Chapter 3. Database and Research Methodology

rat cortex data: all 5 experiments Friday, June 15, :04:07 AM 1

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

5 CHAPTER: DATA COLLECTION AND ANALYSIS

Statistics: Data Analysis and Presentation. Fr Clinic II

Linear Regression Analysis of Gross Output Value of Farming, Forestry, Animal Husbandry and Fishery Industries

Annexe 1 : statistiques descriptives

Subscribe and Unsubscribe: Archive of past newsletters

Biostatistics 208 Data Exploration

Continuous Improvement Toolkit

Variable Selection Methods

Didacticiel Études de cas

Exploring Functional Forms: NBA Shots. NBA Shots 2011: Success v. Distance. . bcuse nbashots11

CHECKING INFLUENCE DIAGNOSTICS IN THE OCCUPATIONAL PRESTIGE DATA

Stata v 12 Illustration. One Way Analysis of Variance

Oneway ANOVA Using SAS (commands= finan_anova.sas)

Paper Principal Component Regression as a Countermeasure against Collinearity Chong Ho Yu, Ph.D.

HIGH-BREAKDOWN ROBUST REGRESSION IN ANALYSIS OF INTERNET USAGE IN EUROPEAN COUNTRIES HOUSEHOLDS

Simple Linear Regression

Solution to Task T3.

y x where x age and y height r = 0.994

Statistical Techniques Useful for the Foundry Industry

Chapter 3. Displaying and Summarizing Quantitative Data. 1 of 66 05/21/ :00 AM

. *increase the memory or there will problems. set memory 40m (40960k)

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Categorical Predictors, Building Regression Models

Quantification of Harm -advanced techniques- Mihail Busu, PhD Romanian Competition Council

Foley Retreat Research Methods Workshop: Introduction to Hierarchical Modeling

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

INVESTIGATION OF SOME FACTORS AFFECTING MANUFACTURING WORKERS PERFORMANCE IN INDUSTRIES IN ANAMBRA STATE OF NIGERIA

Building Better Models with. Pro. Business Analytics Using SAS Enterprise Guide and. SAS Enterprise Miner. Jim Grayson Sam Gardner Mia L.

Table. XTMIXED Procedure in STATA with Output Systolic Blood Pressure, use "k:mydirectory,

Biostatistics 208. Lecture 1: Overview & Linear Regression Intro.

Example Analysis with STATA

A Research Note on Correlation

Example Analysis with STATA

Data analytics in financial statement audit

M. Zhao, C. Wohlin, N. Ohlsson and M. Xie, "A Comparison between Software Design and Code Metrics for the Prediction of Software Fault Content",

ROBUST REGRESSION PROCEDURES TO HANDLE OUTLIERS. PRESENTATION FOR EDU7312 SPRING 2013 Elizabeth Howell Southern Methodist University

Semester 2, 2015/2016

Eco311, Final Exam, Fall 2017 Prof. Bill Even. Your Name (Please print) Directions. Each question is worth 4 points unless indicated otherwise.

Transcription:

Regression diagnostics Biometry 755 Spring 2009 Regression diagnostics p. 1/48 Introduction Every statistical method is developed based on assumptions. The validity of results derived from a given method depends on how well the model assumptions are met. Many statistical procedures are robust, which means that only extreme violations from the assumptions impair the ability to draw valid conclusions. Linear regression falls in the category of robust statistical methods. However, this does not relieve the investigator from the burden of verifying that the model assumptions are met, or at least, not grossly violated. In addition, it is always important to demonstrate how well the model fits the observed data, and this is assessed in part based on the techniques we ll learn in this lecture. Regression diagnostics p. 2/48

Different types of residuals Recall that the residuals in regression are defined as y i ŷ i, where y i is the observed response for the ith observation, and ŷ i is the fitted response at x i. There are other types of residuals that will be useful in our discussion of regression diagnostics. We define them on the following slide. Regression diagnostics p. 3/48 Different types of residuals (cont.) Raw residuals: r i = y i ŷ i Standardized residuals: z i = r i s where s is the estimated error standard deviation (i.e. s =ˆσ = MSE). Studentized residuals: ri = z i 1 hi where h i is called the leverage. (More later about the interpretation of h i.) Jackknife residuals: r ( i) = ri s s ( i) where s ( i) is the estimated error standard deviation computed with the ith observation deleted. Regression diagnostics p. 4/48

Which residual to use? The standardized, studentized and jackknife residuals are all scale independent and are therefore preferred to raw residuals. Of these, jackknife residuals are most sensitive to outlier detection and are superior in terms of revealing other problems with the data. For that reason, most diagnostics rely upon the use of jackknife residuals. Whenever we have a choice in the residual analysis, we will select jackknife residuals. Regression diagnostics p. 5/48 Analysis of residuals - Normality Recall that an assumption of linear regression is that the error terms are normally distributed. That is ε Normal(0,σ 2 ). To assess this assumption, we will use the residuals to look at: histograms normal quantile-quantile (qq) plots Wilk-Shapiro test Regression diagnostics p. 6/48

Histogram and QQ plot of residuals in SAS ods rtf style = Analysis; ods graphics on; ods select ResidualHistogram; ods select QQPlot; proc reg data = one plots(unpack); model infrisk = los cult beds; quit; ods graphics off; ods rtf close; Regression diagnostics p. 7/48 Histogram of residuals in SAS Regression diagnostics p. 8/48

What is a normal QQ plot? Let q be a number between 0 and 1. The qth quantile of a distribution is that point, x, atwhichq 100 percent of the data lie below x and (1 q) 100 percent of the data lie above x. Specially named quantiles include quartiles, deciles, etc. The quantiles of the standard normal distribution are well known. Here are a few with which you should be familiar. q Quantile 0.025-1.96 0.05-1.645 0.5 0 0.95 1.645 0.975 1.96 Regression diagnostics p. 9/48 What is a normal QQ plot? (cont.) If data come from a normal distribution, then the quantiles of their standardized values should be approximately equivalent to the known quantiles of the standard normal distribution. A normal QQ plot graphs the quantiles of the data against the known quantiles of the standard normal distribution. Since we expect the quantiles to be roughly equivalent, then the QQ plot should follow the 45 reference line. Regression diagnostics p. 10/48

Normal QQ plot of residuals in SAS Regression diagnostics p. 11/48 What a normal QQ plot shouldn t look like... Regression diagnostics p. 12/48

The Wilk-Shapiro test H 0 : The data are normally distributed H A : The data are not normally distributed proc reg data = one noprint; model infrisk = los cult beds; output out = fitdata rstudent = jackknife; quit; proc univariate data = fitdata normal; var jackknife; Regression diagnostics p. 13/48 The Wilk-Shapiro test (cont.) Tests for Normality Test -Statistic--- -----p Value------ Shapiro-Wilk 0.994445 Pr < W 0.9347 We fail to reject the null hypothesis and conclude that there is insufficient evidence to conclude that the model errors are not normally distributed. Regression diagnostics p. 14/48

Identifying departures from homoscedasticity We identify departures from homoscedasticity by plotting the residuals versus the predicted values. ods html style = Journal; ods graphics on; ods select ResidualByPredicted; proc reg data = one plots(unpack); model infrisk = los cult beds; quit; ods graphics off; ods html close; Regression diagnostics p. 15/48 Departures from homoscedasticity (cont.) Regression diagnostics p. 16/48

The megaphone plot... (i.e. a violation) Regression diagnostics p. 17/48 Outliers Outliers are observations that are extreme in the sense that they are noticeably different than the other data points. What causes outliers? 1. Data entry errors (the biggest culprit!) 2. Data do not represent a homogeneous set to which a single model applies. Rather, the data are a heterogeneous set of two or more types, of which one is more frequent than the others. 3. Error distributions that have thick tails in which extreme observations occur with greater frequency. (What does that mean?) Regression diagnostics p. 18/48

Outliers (cont.) 0.0 0.1 0.2 0.3 0.4 t dist has thicker tails than Normal Standard Normal t with 2 df 6 4 2 0 2 4 6 Regression diagnostics p. 19/48 Outliers (cont.) Sample from Normal (0,1) dist Normal Q Q Plot Frequency 0 5 10 15 Sample Quantiles 2 0 1 2 2 1 0 1 2 2 1 0 1 2 Theoretical Quantiles Sample from t_2 df dist Normal Q Q Plot Frequency 0 10 30 Sample Quantiles 10 0 5 10 10 5 0 5 10 2 1 0 1 2 Theoretical Quantiles Regression diagnostics p. 20/48

Outliers (cont.) Although linear regression is robust to departures from normality, this is not the case when the error distribution has thick tails. Ironically, sampling distributions that look quite different from a normal distribution cause little trouble, while these thick tail distributions flaw the inference based on F tests. Regression diagnostics p. 21/48 Outlier detection - visual means 1. Simple scatterplots of the data (useful primarily for SLR) 2. Plots of the residuals versus the fitted values 3. Plot of the residuals versus each predictor ods html style = analysis; ods graphics on; ods select ResidualByPredicted; ods select ResidualPanel1; proc reg data = one plots(unpack); model infrisk = los cult beds; quit; ods graphics off; ods html close; Regression diagnostics p. 22/48

Outlier detection - Resid s vs. predicted values Regression diagnostics p. 23/48 Outlier detection - Resid s vs. predictors Regression diagnostics p. 24/48

Outlier detection - numerical means Some rules of thumb about jackknife residuals Jackknife residuals with a magnitude less than 2 (i.e. between -2 and +2) are not unusual. Jackknife residuals with a magnitude greater than 2 deserve a look. Jackknife residuals with a magnitude greater than 4 are highly suspect. Regression diagnostics p. 25/48 Outlier detection in SAS proc reg data = one; model infrisk = los cult beds; output out = fitdata rstudent = jackknife; quit; proc print data = fitdata; where abs(jackknife) > 2; Obs INFRISK LOS CULT BEDS jackknife 8 5.4 11.18 60.5 640-2.66653 35 6.3 9.74 11.4 221 2.29488 53 7.6 11.41 16.6 535 2.60976 63 5.4 7.93 7.5 68 2.20954 96 2.5 8.54 27.0 98-2.15224 Regression diagnostics p. 26/48

Leverage and influence The leverage of a data point refers to how extreme it is relative to x. Observations far away from x are said to have high leverage. The influence of a data point refers to its impact on the fitted regression line. If an observation pulls the regression line away from the fitted line that would have resulted if that point had not been present, then that observation is deemed influential. Regression diagnostics p. 27/48 Dotted = with point; Solid = without point Low lev, low inf Low lev, high inf y 2 2 4 6 y 2 2 4 6 2 0 2 4 6 x 2 0 2 4 6 x High lev, low inf High lev, high inf y 2 2 4 6 y 2 2 4 6 2 0 2 4 6 x 2 0 2 4 6 x Regression diagnostics p. 28/48

Measuring leverage For each data point, there is a corresponding quantity known as the hat diagonal that measures the standardized distance of the ith observation to the center of the predictors. In symbols, the hat diagonal is written h i. (We saw this term in the definition of studentized residuals.) As a general rule of thumb, any value of h i greater than 2(k +1)/n is cause for concern, where k is the number of predictors in the model and n is the number of data points. Regression diagnostics p. 29/48 Leverage and residual plots in SAS ods html style = analysis; ods graphics on; ods select RStudentByLeverage; proc reg data = one plots(unpack); model infrisk = los cult beds; quit; ods graphics off; ods html close; Regression diagnostics p. 30/48

Leverage and residual plots in SAS Regression diagnostics p. 31/48 Measuring influence For each data point, there is a corresponding quantity known as Cook s D i ( D for distance ) that is a standardized distance measuring how far ˆβ moves when the ith observation is removed. As a general rule of thumb, any observation with a value of Cook s D i greater than 4/n, where n is the number of observations, deserves a closer look. Regression diagnostics p. 32/48

Graphing of Cooks D in SAS ods html style = analysis; ods graphics on; ods select CooksD; proc reg data = one plots(unpack); model infrisk = los cult beds; quit; ods graphics off; ods html close; Regression diagnostics p. 33/48 Graphing of Cooks D in SAS (cont.) Regression diagnostics p. 34/48

Leverage/influence - numerical assessment proc reg data = one; model infrisk = los cult beds; output out = fitdata cookd = cooksd h = hat; quit; *(2*4)/113 = 0.071; proc print data = fitdata; where hat ge (2*4)/113; *4/113 = 0.035; proc print data = fitdata; where cooksd ge 4/113; Regression diagnostics p. 35/48 Leverage/influence - numerical assessment Obs INFRISK LOS CULT BEDS cooksd hat jackknife 8 5.4 11.18 60.5 640 0.45427 0.21252-2.66653 11 4.9 11.07 28.5 768 0.03058 0.08368-1.15910 13 7.7 12.78 46.0 322 0.02080 0.09230 0.90371 20 4.1 9.35 15.9 833 0.02697 0.11055-0.93115 46 4.6 10.16 8.4 831 0.00055 0.10609-0.13510 47 6.5 19.56 17.2 306 0.01081 0.30916-0.30957 54 7.8 12.07 52.4 157 0.04106 0.13462 1.02773 78 4.9 10.23 9.9 752 0.00066 0.07977 0.17330 110 5.8 9.50 42.0 98 0.00083 0.08236 0.19197 112 5.9 17.94 26.4 835 0.20239 0.19506-1.84793 Regression diagnostics p. 36/48

All the graphics in one panel ods html style = analysis; ods graphics on; proc reg data = one plots; model infrisk = los cult beds; quit; ods graphics off; ods html close; Regression diagnostics p. 37/48 All the graphics in one panel Regression diagnostics p. 38/48

Collinearity Collinearity is a problem that exists when some (or all) of the independent variables are strongly linearly associated with one another. If collinearity exists in your data, then the following problems result. The estimated regression coefficients can be highly inaccurate. The standard errors of the coefficients can be highly inflated. The p-values, and all subsequent inference, can be wrong. Regression diagnostics p. 39/48 Symptoms of collinearity Large changes in coefficient estimates and/or in their standard errors when independent variables are added/deleted. Large standard errors. Non-significant results for independent variables that should be significant. Wrong signs on slope estimates. Overall test significant, but partial tests insignificant. Strong correlations between independent variables. Large variance inflation factors (VIFs). Regression diagnostics p. 40/48

Variance inflation factor For each independent variable X j in a model, the variance inflation factor is calculated as VIF j = 1 1 R 2 X j all other Xs where R 2 X j all other Xs is the usual R2 obtained by regressing X j on all the other Xs, and represents the proportion of variation in X j that is explained by the remaining independent variables. Therefore, 1 R 2 X j all other Xs is a measure of the variability in X j that isn t explained by the other independent variables. Regression diagnostics p. 41/48 Variance inflation factor (cont.) When there is strong collinearity, R 2 will be large X j all other Xs 1 R 2 X j all other Xs will be small, and so VIF j will be large. As a general rule of thumb, strong collinearity is present when VIF j > 10. Regression diagnostics p. 42/48

Collinearity example Consider the MLR of INFRISK on LOS, CENSUS and BEDS. ods html; ods graphics on; proc corr data = one plots=matrix; var los census beds; ods graphics off; ods html close; proc reg data = one; model infrisk = los/vif; model infrisk = los census/vif; model infrisk = los census beds/vif; Regression diagnostics p. 43/48 Collinearity: PROC CORR output Regression diagnostics p. 44/48

Collinearity: PROC CORR output (cont.) Pearson Correlation Coefficients, N = 113 Prob > r under H0: Rho=0 LOS CENSUS BEDS LOS 1.00000 0.47389 0.40927 LENGTH OF STAY <.0001 <.0001 CENSUS 0.47389 1.00000 0.98100 AVG DAILY CENSUS <.0001 <.0001 BEDS 0.40927 0.98100 1.00000 NUMBER OF BEDS <.0001 <.0001 Regression diagnostics p. 45/48 Collinearity: PROC REG output INFRISK REGRESSED ON LOS Root MSE 1.13929 R-Square 0.2846 Dependent Mean 4.35487 Adj R-Sq 0.2781 Coeff Var 26.16119 Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Inflation Intercept 1 0.74430 0.55386 1.34 0.1817 0 LOS 1 0.37422 0.05632 6.64 <.0001 1.00000 Regression diagnostics p. 46/48

Collinearity: PROC REG output (cont.) INFRISK REGRESSED ON LOS AND CENSUS Root MSE 1.12726 R-Square 0.3059 Dependent Mean 4.35487 Adj R-Sq 0.2933 Coeff Var 25.88504 Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Inflation Intercept 1 0.99950 0.56531 1.77 0.0798 0 LOS 1 0.31908 0.06328 5.04 <.0001 1.28960 CENSUS 1 0.00145 0.00078668 1.84 0.0687 1.28960 Regression diagnostics p. 47/48 Collinearity: PROC REG output (cont.) INFRISK REGRESSED ON LOS, CENSUS AND BEDS Root MSE 1.12953 R-Square 0.3094 Dependent Mean 4.35487 Adj R-Sq 0.2904 Coeff Var 25.93727 Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Inflation Intercept 1 0.82296 0.61382 1.34 0.1828 0 LOS 1 0.33538 0.06706 5.00 <.0001 1.44245 CENSUS 1-0.00142 0.00392-0.36 0.7177 31.90045 BEDS 1 0.00225 0.00302 0.75 0.4569 29.71362 Regression diagnostics p. 48/48