Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

Similar documents
Problem Points Score USE YOUR TIME WISELY SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

Chapter 5 Regression

STATISTICS PART Instructor: Dr. Samir Safi Name:

Midterm Exam. Friday the 29th of October, 2010

STAT 350 (Spring 2016) Homework 12 Online 1

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15%

Regression diagnostics

4.3 Nonparametric Tests cont...

AP Statistics Scope & Sequence

Lecture (chapter 7): Estimation procedures

The Multivariate Regression Model

Business Statistics (BK/IBA) Tutorial 4 Exercises

R-SQUARED RESID. MEAN SQUARE (MSE) 1.885E+07 ADJUSTED R-SQUARED STANDARD ERROR OF ESTIMATE

Continuous Improvement Toolkit

Timing Production Runs

5 CHAPTER: DATA COLLECTION AND ANALYSIS

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Semester 2, 2015/2016

Foley Retreat Research Methods Workshop: Introduction to Hierarchical Modeling

EXPERIMENTAL INVESTIGATIONS ON FRICTION WELDING PROCESS FOR DISSIMILAR MATERIALS USING DESIGN OF EXPERIMENTS

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

Europ. Pharm., 5th Ed. (2005), Ch. 5.3, A completely randomised (0,4,4,4)-design - An in-vitro assay of influenza vaccines

Notes on PS2

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

Two Way ANOVA. Turkheimer PSYC 771. Page 1 Two-Way ANOVA

Week 10: Heteroskedasticity

Chapter 7-3. For one of the simulated sets of data (from last time): n = 1000, ˆp Estimating a population mean, known Requires

rat cortex data: all 5 experiments Friday, June 15, :04:07 AM 1

Sec Two-Way Analysis of Variance. Bluman, Chapter 12 1

Quantitative Subject. Quantitative Methods. Calculus in Economics. AGEC485 January 23 nd, 2008

Lecture 2a: Model building I

Examples of Statistical Methods at CMMI Levels 4 and 5

Lab 1: A review of linear models

Econometric Forecasting in a Lost Profits Case

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

ST7002 Optional Regression Project. Postgraduate Diploma in Statistics. Trinity College Dublin. Sarah Mechan. FAO: Prof.

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Survey commands in STATA

Final Exam Spring Bread-and-Butter Edition

PSC 508. Jim Battista. Dummies. Univ. at Buffalo, SUNY. Jim Battista PSC 508

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics May 2014

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

SAARC Training Workshop Program Identification, Comparison and Scenario Based Application of Power Demand/ Load Forecasting Tools

Week 11: Collinearity

CHAPTER 5 RESULTS AND ANALYSIS

What is the general form of a regression equation? What is the difference between y and ŷ?

Estimation of haplotypes

FROM: Dan Rubado, Evaluation Project Manager, Energy Trust of Oregon; Phil Degens, Evaluation Manager, Energy Trust of Oregon

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis

The study obtains the following results: Homework #2 Basics of Logistic Regression Page 1. . version 13.1

SPSS Guide Page 1 of 13

Application: Effects of Job Training Program (Data are the Dehejia and Wahba (1999) version of Lalonde (1986).)

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Regression Analysis I & II

Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities:

Example Analysis with STATA

Soci Statistics for Sociologists

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

Experimental Design Day 2

Example Analysis with STATA

AP Statistics - Chapter 3 notes

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics May 2011

energy usage summary (both house designs) Friday, June 15, :51:26 PM 1

MULTILOG Example #1. SUDAAN Statements and Results Illustrated. Input Data Set(s): DARE.SSD. Example. Solution

Simple Linear Regression

CHAPTER 6 ASDA ANALYSIS EXAMPLES REPLICATION SAS V9.2

Table. XTMIXED Procedure in STATA with Output Systolic Blood Pressure, use "k:mydirectory,

Center for Demography and Ecology

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

Chapter 4: Foundations for inference. OpenIntro Statistics, 2nd Edition

Statistics: Data Analysis and Presentation. Fr Clinic II

Estimation and Confidence Intervals

Statistics 201 Summary of Tools and Techniques

STAT 2300: Unit 1 Learning Objectives Spring 2019

y x where x age and y height r = 0.994

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset.

Psy 420 Midterm 2 Part 2 (Version A) In lab (50 points total)

Why do we need statistics to study genetics and evolution?

. *increase the memory or there will problems. set memory 40m (40960k)

Estimation and Confidence Intervals

Loblolly Example: Correlation. This note explores estimating correlation of a sample taken from a bivariate normal distribution.

Estimation and Confidence Intervals

This chapter will present the research result based on the analysis performed on the

Econometric Analysis Dr. Sobel

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Modelling categorical variables using logit models

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Using Excel s Analysis ToolPak Add-In

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

Categorical Predictors, Building Regression Models

Multiple Choice (#1-9). Circle the letter corresponding to the best answer.

AN EMPIRICAL STUDY OF QUALITY OF PAINTS: A CASE STUDY OF IMPACT OF ASIAN PAINTS ON CUSTOMER SATISFACTION IN THE CITY OF JODHPUR

WINDOWS, MINITAB, AND INFERENCE

More Multiple Regression

Bios 312 Midterm: Appendix of Results March 1, Race of mother: Coded as 0==black, 1==Asian, 2==White. . table race white

Statistical Modelling for Business and Management. J.E. Cairnes School of Business & Economics National University of Ireland Galway.

Biostatistics 208. Lecture 1: Overview & Linear Regression Intro.

December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

CONTRIBUTORY AND INFLUENCING FACTORS ON THE LABOUR WELFARE PRACTICES IN SELECT COMPANIES IN TIRUNELVELI DISTRICT AN ANALYSIS

Transcription:

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference Statistics 101 Thomas Leininger June 18, 2013

Types of outliers in linear regression Types of outliers How do(es) the outlier(s) influence the least squares line? To answer this question think of where the regression line would be with and without the outlier(s). 20 10 0 5 0 5 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 2 / 26

Types of outliers in linear regression Types of outliers How do(es) the outlier(s) influence the least squares line? 0 5 10 2 0 2 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 3 / 26

Types of outliers in linear regression Some terminology Outliers are points that fall away from the cloud of points. Outliers that fall horizontally away from the center of the cloud are called leverage points. High leverage points that actually influence the slope of the regression line are called influential points. In order to determine if a point is influential, visualize the regression line with and without the point. Does the slope of the line change considerably? If so, then the point is influential. If not, then it s not. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 4 / 26

Types of outliers in linear regression Influential points Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1. 6.0 w/ outliers w/o outliers log(light intensity) 5.5 5.0 4.5 4.0 3.6 3.8 4.0 4.2 4.4 4.6 log(temp) Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 5 / 26

Types of outliers in linear regression Types of outliers Question Which of the below best describes the outlier? (a) influential (b) leverage (c) none of the above (d) there are no outliers 20 0 20 40 2 0 2 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 6 / 26

Types of outliers in linear regression Types of outliers Does this outlier influence the slope of the regression line? 5 0 5 10 15 5 0 5 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 7 / 26

Types of outliers in linear regression Recap Question True or False? 1 Influential points always change the intercept of the regression line. 2 Influential points always reduce R 2. 3 It is much more likely for a low leverage point to be influential, than a high leverage point. 4 When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 8 / 26

Types of outliers in linear regression Recap (cont.) R = 0.08, R 2 = 0.0064 2 0 2 4 6 8 10 2 0 2 R = 0.79, R 2 = 0.6241 0 5 10 2 0 2 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 9 / 26

Understanding regression output from software Nature or nurture? In 1966 Cyril Burt published a paper called The genetic determination of differences in intelligence: A study of monozygotic twins reared apart? The data consist of IQ scores for [an assumed random sample of] 27 identical twins, one raised by foster parents, the other by the biological parents. foster IQ 130 120 110 100 90 80 70 R = 0.882 70 80 90 100 110 120 130 biological IQ Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 10 / 26

Question True or False? Inference for linear regression Understanding regression output from software Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 9.20760 9.29990 0.990 0.332 bioiq 0.90144 0.09633 9.358 1.2e-09 Residual standard error: 7.729 on 25 degrees of freedom Multiple R-squared: 0.7779, Adjusted R-squared: 0.769 F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09 1 For each 10 point increase in the biological twin s IQ, we would expect the foster twin s IQ to increase on average by 9 points. 2 Roughly 78% of the foster twins IQs can be accurately predicted by the model. 3 Foster twins with IQs higher than average IQs have biological twins with higher than average IQs as well. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 11 / 26

HT for the slope Testing for the slope Question Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses? (a) H 0 : b 0 = 0; H A : b 0 0 (b) H 0 : β 1 = 0; H A : β 1 0 (c) H 0 : b 1 = 0; H A : b 1 0 (d) H 0 : β 0 = 0; H A : β 0 0 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 12 / 26

HT for the slope Testing for the slope (cont.) Estimate Std. Error t value Pr(> t ) (Intercept) 9.2076 9.2999 0.99 0.3316 bioiq 0.9014 0.0963 9.36 0.0000 We always use a t-test in inference for regression. Remember: Test statistic, T = point estimate null value SE Point estimate = b 1 is the observed slope. SE b1 is the standard error associated with the slope. Degrees of freedom associated with the slope is df = n 2, where n is the sample size. Remember: We lose 1 degree of freedom for each parameter we estimate, and in simple linear regression we estimate 2 parameters, β 0 and β 1. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 13 / 26

HT for the slope Testing for the slope (cont.) Estimate Std. Error t value Pr(> t ) (Intercept) 9.2076 9.2999 0.99 0.3316 bioiq 0.9014 0.0963 9.36 0.0000 T = df = p value = Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 14 / 26

HT for the slope % College graduate vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Education: College graduate 1.0 Race/Ethnicity: Hispanic 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Freeways No data 0.0 Freeways No data 0.0 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 15 / 26

HT for the slope % College educated vs. % Hispanic in LA - another look What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? 100% % College graduate 75% 50% 25% 0% 0% 25% 50% 75% 100% % Hispanic Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 16 / 26

HT for the slope % College educated vs. % Hispanic in LA - linear model Question What is the interpretation of the slope in context of the problem? Estimate Std. Error t value Pr(> t ) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic -0.7527 0.0501-15.01 0.0000 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 17 / 26

HT for the slope % College educated vs. % Hispanic in LA - linear model Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA? Estimate Std. Error t value Pr(> t ) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic -0.7527 0.0501-15.01 0.0000 How reliable is this p-value if these zip code areas are not randomly selected? Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 18 / 26

CI for the slope Confidence interval for the slope Question Remember that a confidence interval is calculated as point estimate±me and the degrees of freedom associated with the slope in a simple linear regression is n 2. Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins. Estimate Std. Error t value Pr(> t ) (Intercept) 9.2076 9.2999 0.99 0.3316 bioiq 0.9014 0.0963 9.36 0.0000 (a) 9.2076 ± 1.65 9.2999 (b) 0.9014 ± 2.06 0.0963 (c) 0.9014 ± 1.96 0.0963 (d) 9.2076 ± 1.96 0.0963 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 19 / 26

CI for the slope Recap Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: T = b 1 null value SE b1 df = n 2 Confidence interval: b 1 ± t df=n 2 SE b 1 The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b 1, SE b1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we ll be focusing on the estimates and inference for the slope. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 20 / 26

CI for the slope Caution Always be aware of the type of data you re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data. If you have a sample that is non-random (biased), the results will be unreliable. The ultimate goal is to have independent observations and you know how to check for those by now. Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 21 / 26

An alternative statistic Variability partitioning We considered the t-test as a way to evaluate the strength of evidence for a hypothesis test for the slope of relationship between x and y. However, we can also consider the variability in y explained by x, compared to the unexplained variability. Partitioning the variability in y to explained and unexplained variability requires analysis of variance (ANOVA). Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 22 / 26

An alternative statistic ANOVA output - Sum of squares Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 5231.13 5231.13 87.56 0.0000 Residuals 25 1493.53 59.74 Total 26 6724.66 Sum of squares: SS Tot = (y ȳ) 2 = 6724.66 (total variability in y) SS Err = (y ŷ) 2 = e 2 i = 1493.53 (unexplained variability in residuals) SS Reg = (ŷ ȳ) 2 = SS Tot SS Err (explained variability in y) = 6724.66 1493.53 = 5231.13 Degrees of freedom: df Tot = n 1 = 27 1 = 26 df Reg = 1 (only β 1 counts here) df Res = df Tot df Reg = 26 1 = 25 Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 23 / 26

An alternative statistic ANOVA output - F-test Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 5231.13 5231.13 87.56 0.0000 Residuals 25 1493.53 59.74 Total 26 6724.66 Mean squares: MS Reg = SS Reg = 5231.13 = 5231.13 df Reg 1 MS Err = SS Err = 1493.53 = 59.74 df Err 25 F-statistic: F (1,25) = MS Reg MS Err = 87.56 (ratio of explained to unexplained variability) The null hypothesis is β 1 = 0 and the alternative is β 1 0. With a large F-statistic, and a small p-value, we Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 24 / 26

An alternative statistic Revisit R 2 Remember, R 2 is the proportion of variability in y explained by the model: A large R 2 suggests a linear relationship between x and y exists. There are actually two ways to calculate R 2 : (1) From the definition: proportion of explained to total variability (2) Using correlation: square of the correlation coefficient Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 25 / 26

Inference for linear regression An alternative statistic ANOVA output - R 2 calculation foster IQ 130 120 110 100 90 80 70 R = 0.882 70 80 90 100 110 120 130 biological IQ Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 5231.13 5231.13 87.56 0.0000 Residuals 25 1493.53 59.74 Total 26 6724.66 R 2 = explained variability total variability = SS Reg = 5231.13 0.78 (1) SS Tot 6724.66 R 2 = square of correlation coefficient = 0.881 2 0.78 (2) Statistics 101 (Thomas Leininger) U6 - L2: Outliers and inference June 18, 2013 26 / 26