Soci Statistics for Sociologists

Similar documents
Introduction of STATA

. *increase the memory or there will problems. set memory 40m (40960k)

The Multivariate Regression Model

Week 10: Heteroskedasticity

Compartmental Pharmacokinetic Analysis. Dr Julie Simpson

Stata v 12 Illustration. One Way Analysis of Variance

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

Interpreting and Visualizing Regression models with Stata Margins and Marginsplot. Boriana Pratt May 2017

ROBUST ESTIMATION OF STANDARD ERRORS

Survey commands in STATA

Examples of Using Stata v11.0 with JRR replicate weights Provided in the NHANES data set

Regression Analysis I & II

Timing Production Runs

BIOSTATS 640 Spring 2017 Stata v14 Unit 2: Regression & Correlation. Stata version 14

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

F u = t n+1, t f = 1994, 2005

This example demonstrates the use of the Stata 11.1 sgmediation command with survey correction and a subpopulation indicator.

1. Contingency Table (Cross Tabulation Table)

Regression diagnostics

Outliers Richard Williams, University of Notre Dame, Last revised April 7, 2016

Spreadsheets in Education (ejsie)

ECON Introductory Econometrics Seminar 6

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

for var trstprl trstlgl trstplc trstplt trstep: reg X trust10 stfeco yrbrn hinctnt edulvl pltcare polint wrkprty

The Dummy s Guide to Data Analysis Using SPSS

CHAPTER 10 REGRESSION AND CORRELATION

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

DIGITAL VERSION. Microsoft EXCEL Level 2 TRAINER APPROVED

This is a quick-and-dirty example for some syntax and output from pscore and psmatch2.

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

The Multivariate Dustbin

Quantitative Analysis Using Statistics for Forecasting and Validity Testing. Course #6300/QAS6300 Course Material

Quantification of Harm -advanced techniques- Mihail Busu, PhD Romanian Competition Council

Final Exam Spring Bread-and-Butter Edition

Homework I: Stata Guide

Introduction to Statistics. Measures of Central Tendency and Dispersion

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd

Appendix C: Lab Guide for Stata

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Center for Demography and Ecology

Simple Linear Regression

CHAPTER Activity Cost Behavior

Probability Of Booking

Measurement and sampling

CHAPTER. Activity Cost Behavior

Understanding and accounting for product

Stata version 12. Lab Session 2 April 2013

PARENT REPORT. Tables and Graphs Report for WNV and WIAT-II. Copyright 2008 by Pearson Education, Inc. or its affiliate(s). All rights reserved.

ENGG1811: Data Analysis using Excel 1

APPLICATION OF SEASONAL ADJUSTMENT FACTORS TO SUBSEQUENT YEAR DATA. Corresponding Author

Questionnaire. (3) (3) Bachelor s degree (3) Clerk (3) Third. (6) Other (specify) (6) Other (specify)

TEACHER NOTES MATH NSPIRED

Quadratic Regressions Group Acitivity 2 Business Project Week #4

EXPERIMENTAL INVESTIGATIONS ON FRICTION WELDING PROCESS FOR DISSIMILAR MATERIALS USING DESIGN OF EXPERIMENTS

ANALYSING QUANTITATIVE DATA

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Getting Started with OptQuest

Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS

The relationship between innovation and economic growth in emerging economies

Nested or Hierarchical Structure School 1 School 2 School 3 School 4 Neighborhood1 xxx xx. students nested within schools within neighborhoods

Calculating the Standard Error of Measurement

78 Part II Σigma Freud and Descriptive Statistics

Why Learn Statistics?

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers

Building Better Models with. Pro. Business Analytics Using SAS Enterprise Guide and. SAS Enterprise Miner. Jim Grayson Sam Gardner Mia L.

Make the Jump from Business User to Data Analyst in SAS Visual Analytics

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

Winsor Approach in Regression Analysis. with Outlier

49. MASS BALANCING AND DATA RECONCILIATION EXAMPLES

Chapter 3. Table of Contents. Introduction. Empirical Methods for Demand Analysis

Optimal Productivity And Design Of Production Quantity Of A Manufacturing Industry

Telecommunications Churn Analysis Using Cox Regression

How to Get More Value from Your Survey Data

LIR 832: MINITAB WORKSHOP

FROM: Dan Rubado, Evaluation Project Manager, Energy Trust of Oregon; Phil Degens, Evaluation Manager, Energy Trust of Oregon

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

Performance and regression analysis of thermoelectric generator

Departamento de Ciências Atmosféricas Universidade de São Paulo Phone: Fax:

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

Post-Estimation Commands for MLogit Richard Williams, University of Notre Dame, Last revised February 13, 2017

Grades in a Diploma Plus School

PROJECT PROPOSAL: OPTIMIZATION OF A TUNGSTEN CVD PROCESS

IBM SPSS Forecasting 19

CE 115 Introduction to Civil Engineering Graphics and Data Presentation Application in CE Materials

BIO 226: Applied Longitudinal Analysis. Homework 2 Solutions Due Thursday, February 21, 2013 [100 points]

Quantitative Methods. Presenting Data in Tables and Charts. Basic Business Statistics, 10e 2006 Prentice-Hall, Inc. Chap 2-1

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

Introduction to Statistics. Measures of Central Tendency

Scoring Assistant SAMPLE REPORT

How to view Results with Scaffold. Proteomics Shared Resource

Analysis of Factors Affecting Resignations of University Employees

Carlo Fabrizio D'Amico

Basic Statistics, Sampling Error, and Confidence Intervals

Research Article Minimum Porosity Formation in Pressure Die Casting by Taguchi Method

CHAPTER V RESULT AND ANALYSIS

Weka Evaluation: Assessing the performance

HTS Report. d2-r. Test of Attention Revised. Technical Report. Another Sample ID Date 14/04/2016. Hogrefe Verlag, Göttingen

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Supplementary information for A comparative study of carbon-platinum hybrid nanostructure architecture for amperometric biosensing

SHIPPING COST, SPOT RATE AND EBAY AUCTIONS: A STRUCTURED EQUATION MODELING APPROACH

Transcription:

University of North Carolina Chapel Hill Soci708-001 Statistics for Sociologists Fall 2009 Professor François Nielsen Stata Commands for Module 11 Multiple Regression For further information on any command in this handout, simply type help followed by the name of the command in Stata. See also the Stata and SAS Guide pdf (click on Documents in side bar; guide is linked under Software Documentation). 1 Statistical Functions in Stata The following statistical functions in Stata are useful for regression work. The regression printout itself usually comprises all necessary statistics. 1.1 Normal Distribution Functions The function normal(z) returns P(Z z), the area under the standard normal curve to the left of z. (Compare with Table A.). display normal(1.207).88628393 The function invnormal(p) returns z such that P(Z z) = p, i.e. z such that the area under the standard normal curve to the left of z is p. (Compare with Table A and Table D, bottom row.). display invnormal(0.975) 1.959964 1.2 Student t Distribution Functions The function ttail(df, t) returns P(T > t), the area under the Student s t distribution with df degrees of freedom to the right of t. (Compare with Table D.). display ttail(7, 1.960).04540985 The function invttail(df, p) returns t such that P(T > t) = p, i.e. t such that the area under Student s t distribution with df degrees of freedom to the right of t is p. (Compare with Table D.). display invttail(7, 0.025) 2.3646243 1

1.3 F Distribution Functions The function Ftail(n1, n2, f) returns P(F > f ), the area under the F distribution with n1 and n2 degrees of freedom to the right of f. (Compare with Table E.). display Ftail(1, 14, 21.55).00038068 The function invftail(n1, n2, p) returns f such that P(F > f ) = p, i.e. f such that the area under the F distribution with n1 and n2 degrees of freedom to the right of f is p. (Compare with Table E.). display invftail(1, 14,.00038068) 21.549979 2 Descriptive Statistics, Correlations, and Scatterplot Matrix I am using as an example the CSDATA from IPS6e (see Appendix D-2 for description). The units are 224 Computer Science majors at a large university. To enter the data in Stata I retrieved the csdata.xls file in the CD-ROM, selected the data and copied them to the clipboard (Ctrl-C). Then in Stata I opened the Data Editor (Data -> Data Editor) and pasted the data (Ctrl-V). Then I closed the Data Editor by clicking on. (You can save the data as a *.dta file if desired with File -> Save...) Then I listed the first 5 cases.. list in 1/5 +--------------------------------------------------+ obs gpa hsm hss hse satm satv sex -------------------------------------------------- 1. 1 3.32 10 10 10 670 600 1 2. 2 2.26 6 8 5 700 640 1 3. 3 2.35 8 6 8 640 530 1 4. 4 2.08 9 10 7 670 600 1 5. 5 3.38 8 9 8 540 580 1 +--------------------------------------------------+ The response variable of interest is grade point average after three semesters (gpa). The explanatory variables are high school grades in mathematics (hsm), science (hss) and English or language arts (hse); SAT score in math (satm) and verbal (satv); and sex (1=male, 2=female). First I produced descriptive statistics for all the variables I intend to put in the regression with the command su (for summarize).. su hsm hss hse satm satv gpa Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- hsm 224 8.321429 1.638737 2 10 hss 224 8.089286 1.699663 3 10 hse 224 8.09375 1.507874 3 10 satm 224 595.2857 86.40144 300 800 satv 224 504.5491 92.61046 285 760 -------------+-------------------------------------------------------- gpa 224 2.635223.7793949.12 4 2

Then I produced a matrix of correlation coefficients. Note that I put the response variable gpa last, to have the corresponding correlations all on the same row, same as in the scatterplot matrix to be produced next.. corr hsm hss hse satm satv gpa (obs=224) hsm hss hse satm satv gpa -------------+------------------------------------------------------ hsm 1.0000 hss 0.5757 1.0000 hse 0.4469 0.5794 1.0000 satm 0.4535 0.2405 0.1083 1.0000 satv 0.2211 0.2617 0.2437 0.4639 1.0000 gpa 0.4365 0.3294 0.2890 0.2517 0.1145 1.0000 Then I produced a scatterplot matrix with the following command. Note that I place the response variable gpa last in the list so all the scatterplots will be on the last row, with gpa on the vertical axis and the explanatory variable on the horizontal axis of each panel. I used the option half to show only the lower half of the matrix, reducing the complexity of the graph and making it more easily comparable to the correlation matrix, and option xsize(4) ysize(4) to make the graph 4 in square. The graph is shown in Figure 1.. graph matrix hsm hss hse satm satv gpa, half xsize(4) ysize(4) 3 Multiple Regression To do the multiple regression use the reg command with the list of variables, starting with the response variable gpa.. reg gpa hsm hss hse satm satv Source SS df MS Number of obs = 224 -------------+------------------------------ F( 5, 218) = 11.69 Model 28.6436439 5 5.72872878 Prob > F = 0.0000 Residual 106.819145 218.489996078 R-squared = 0.2115 -------------+------------------------------ Adj R-squared = 0.1934 Total 135.462789 223.607456452 Root MSE =.7 ------------------------------------------------------------------------------ gpa Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- hsm.1459611.039261 3.72 0.000.0685814.2233407 hss.0359053.0377984 0.95 0.343 -.0385918.1104024 hse.0552926.0395687 1.40 0.164 -.0226936.1332787 satm.0009436.0006857 1.38 0.170 -.0004078.002295 satv -.0004078.0005919-0.69 0.492 -.0015744.0007587 _cons.3267187.3999964 0.82 0.415 -.4616365 1.115074 ------------------------------------------------------------------------------ By default the reg command calculates the 95% CI for each coefficient. You can change the confidence level with the level option, e.g. level(99) for a 99% CI. The full command would then be (output not shown): 3

Figure 1: Scatterplot matrix of variables for the gpa regression (CSDATA). Produced with Stata command graph matrix hsm hss hse satm satv gpa, half xsize(4) ysize(4) 4

. reg gpa hsm hss hse satm satv, level(99) To obtain standardized coefficients (in place of the confidence intervals shown by default) use the option beta.. reg gpa hsm hss hse satm satv, beta Source SS df MS Number of obs = 224 -------------+------------------------------ F( 5, 218) = 11.69 Model 28.6436439 5 5.72872878 Prob > F = 0.0000 Residual 106.819145 218.489996078 R-squared = 0.2115 -------------+------------------------------ Adj R-squared = 0.1934 Total 135.462789 223.607456452 Root MSE =.7 ------------------------------------------------------------------------------ gpa Coef. Std. Err. t P> t Beta -------------+---------------------------------------------------------------- hsm.1459611.039261 3.72 0.000.3068942 hss.0359053.0377984 0.95 0.343.0783004 hse.0552926.0395687 1.40 0.164.106973 satm.0009436.0006857 1.38 0.170.1046039 satv -.0004078.0005919-0.69 0.492 -.0484621 _cons.3267187.3999964 0.82 0.415. ------------------------------------------------------------------------------ Using the command vif after running a regression will calculate the variance inflation factors (VIF). These are measures of collinearity, the degree to which each explanatory variable is associated with all the other explanatory variables. A VIF above 10 is considered bothersome, but there is no VIF above 10 in this particular example. (We are not going to use vif in this class.). vif Variable VIF 1/VIF -------------+---------------------- hsm 1.88 0.530820 hss 1.88 0.532372 hse 1.62 0.617241 satm 1.60 0.626083 satv 1.37 0.731277 -------------+---------------------- Mean VIF 1.67 4 Analysis of Residuals To calculate the predicted values of gpa and the gpa residuals I can use the command predict, with the option xb and residuals, respectively, assigning variable names of my choice. Then to check the distribution of residuals I draw a histogram of the residuals (shown in Figure 2) and a normal quantile plot of the residuals (shown in Figure 3). The only fancy options I use is xsize(3.5) ysize(3.5) with the normal quantile plot, to make the plot square. Together with the straight line that Stata draws automatically the square format shows deviations of the plot from linearity better than a rectangular plot (compare with IPS6e Figure 11.5 p.620). We can see the left-skew in the distribution. 5

Figure 2: Distribution of residuals for the gpa regression (CSDATA). Produced by Stata command histogram gparesid, kdensity. predict gpapredict, xb. predict gparesid, residuals. histogram gparesid, kdensity (bin=14, start=-2.0649281, width=.26931122). qnorm gparesid, xsize(3.5) ysize(3.5) Now I want to create a residual plot (plot of residuals against predicted values). I use the following command. The plot is shown in Figure 4.. twoway (scatter gparesid gpapredict), yline(0) I am interested in the six cases I see on the graph that have low predicted gpa. To identify them I use the following command (make sure you understand why). What is the story of these students with low predicted gpa? You may want to refer to the descriptive statistics calculated earlier.. list if gpapredict<1.9 +-------------------------------------------------------------------------+ obs gpa hsm hss hse satm satv sex gpapre~t gparesid ------------------------------------------------------------------------- 8. 8 2 3 7 6 460 530 1 1.565587.434413 84. 84 3 4 3 4 620 560 1 1.596081 1.403919 107. 107 2.82 4 5 7 400 470 1 1.662885 1.157115 127. 127.12 4 6 6 630 490 1 1.852368-1.732368 182. 182 1.6 4 7 7 460 460 2 1.79539 -.1953901 ------------------------------------------------------------------------- 183. 183 2 2 4 6 300 290 2 1.258819.7411809 +-------------------------------------------------------------------------+ Now I want to calculate the standard error for the estimated mean response (SEˆµ in IPS6e; standard error of prediction for Stata), and the standard error of prediction for a future observation (SEŷ in IPS6e; standard error of forecast 6

Figure 3: Normal quantile plot of residuals for the gpa regression (CSDATA). Produced by Stata command qnorm gparesid, xsize(3.5) ysize(3.5) Figure 4: Residual plot for the gpa regression (CSDATA). Produced by Stata command twoway (scatter gparesid gpapredict), yline(0) 7

for Stata; see IPS6e p.594 for formulas). I check the values for the first 5 observations. Note that the SE of forecast is always larger than the SE for the mean response, as the SE of forecast contains individual variation in the response variable in addition to uncertainty about the mean response.. predict gpasepred, stdp. predict gpaseforecast, stdf. list gpapredict gpasepred gpaseforecast in 1/5 +--------------------------------+ gpapre~t gpasep~d gpasef~t -------------------------------- 1. 3.085806.0871256.7053984 2. 2.165682.1740038.7212998 3. 2.539919.0932556.7061818 4. 2.773967.1156441.7094855 5. 2.532883.0898973.7057461 +--------------------------------+ 8