Unit 2 Regression and Correlation 2 of 2 - Practice Problems SOLUTIONS Stata Users

Size: px
Start display at page:

Download "Unit 2 Regression and Correlation 2 of 2 - Practice Problems SOLUTIONS Stata Users"

Transcription

1 Unit 2 Regression and Correlation 2 of 2 - Practice Problems SOLUTIONS Stata Users Data Set for this Assignment: Download from the course website: Stata Users: framingham_1000.dta Source: Levy (1999) National Heart Lung and Blood Institute. Center for Bio-Medical Communication. Framingham Heart Study Description: Cardiovascular disease (CVD) is the leading cause of death and serious illness in the United States. In 1948, the Framingham Heart Study - under the direction of the National Heart Institute (now known as the National Heart, Lung, and Blood Institute or NHLBI) was initiated. The objective of the Framingham Heart Study was to identify the common factors or characteristics that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke. Here we use a subset of the data comprised of information on 9 variables in a subset of n=1000. Variable Label Codings sbp Systolic Blood Pressure (mm Hg) ln_sbp Natural logarithm of sbp ln_sbp=ln(sbp) age Age, years bmi Body Mass index (kg/m 2 ) ln_bmi Natural logarithm of bmi ln_bmi=ln(bmi) sex Gender 1=male 2=female female Female Indicator 0 = male 1 = female scl Serum Cholesterol (mg/100 ml) ln_scl Natural logarithm of scl ln_scl=ln(scl) Multiple Regression Variables: Outcome Y = ln_sbp Predictor Variables: ln_bmi, ln_scl, age, sex Research Question: From among these 4 candidate predictors, what are the important risk factors and what is the nature of their association with Y=ln_sbp? sol_regression(2 of 2) stata.docx Page 1 of 8

2 1. From the course website, download to your desktop the data set framingham_1000.dta. Launch Stata and open framingham_1000.dta.. * From top menu bar, at left: FILE > OPEN Tip! At the start of your Stata session, do the following preliminaries a) In the command line, turn off automatic pausing of output. Type set more off b) Start a log of your session. Take care to select the format.log From the menu bar at top FILE > LOG > BEGIN 2. Using an appropriate command, check your data set. How many observations does it have? What are the variable names? What are the variable types? Etc.. * ) Describe/check data set. describe Contains data from /Users/cbigelow/Desktop/framingham_1000.dta obs: 1,000 vars: 9 3 Feb :34 size: 36, storage display value variable name type format label variable label sex float %9.0g sex Sex sbp float %9.0g Systolic Blood Pressure scl float %9.0g Serum Cholesterol age float %9.0g Age in Years bmi float %9.0g Body Mass Index id float %9.0g Subject id ln_bmi float %9.0g Natural logarithm (bmi) ln_sbp float %9.0g Natural logarithm (sbp) ln_scl float %9.0g Natural logarithm (scl) Sorted by: sex. codebook, compact Variable Obs Unique Mean Min Max Label sex Sex sbp Systolic Blood Pressure scl Serum Cholesterol age Age in Years bmi Body Mass Index id Subject id ln_bmi Natural logarithm (bmi) ln_sbp Natural logarithm (sbp) ln_scl Natural logarithm (scl) sol_regression(2 of 2) stata.docx Page 2 of 8

3 3. Using an appropriate command, produce numerical descriptives for the following 3 continuous variables: ln_sbp, age, ln_bmi. tabstat ln_sbp age ln_bmi, statistics(n mean sd min q max) columns(statistics) variable N mean sd min p25 p50 p75 max ln_sbp age ln_bmi Using an appropriate command, produce frequency/relative frequency tables for the discrete variable sex.. * ) Frequency/Relative frequencies. Preliminary (ONE TIME) install the command fre. ssc install fre checking fre consistency and verifying not already installed.... fre sex sex -- Sex Freq. Percent Valid Cum Valid 1 Men Women Total Using an appropriate command, test the null hypothesis that the distribution of Y = ln_sbp is normal. In 1 sentence, interpret.. * ) Test normality of distribution of Y =ln_sbp (NULL: normalilty). sfrancia ln_sbp Shapiro-Francia W' test for normal data Variable Obs W' V' z Prob>z ln_sbp 1, swilk ln_sbp Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z ln_sbp 1, Assumption of the null hypothesis of normality has led to an extremely statistically unlikely result (p=.00001). The null hypothesis of normality is rejected. Dear class we re going to proceed anyway. sol_regression(2 of 2) stata.docx Page 3 of 8

4 6. Using an appropriate command, create a 0/1 indicator of female gender. Name this variable female.. * ) Create 0/1 indicator of female gender. Name this variable female. Check.. generate female=sex. recode female (1=0) (2=1) (female: 1000 changes made). tab2 sex female -> tabulation of sex by female female Sex 0 1 Total Men Women Total , Using an appropriate command, create a new variable that is the interaction of age and female. Name this variable age_female.. * ) Create interaction of age and female. Name this variable age_female. generate age_female=age*female Note Stata will not return any output at this point, unless you ve made a mistake Forge on. 8. Fit a multiple linear regression model of Y=ln_sbp that has the following 4 predictors: female, age, ln_bmi, age_female. Using the appropriate command(s), display the coefficients table and the analysis of variance.. * ) Fit Y=ln_sbp to Predictors: female, age, ln_bmi, age_female. Display betas, anova.. regress ln_sbp female age ln_bmi age_female Source SS df MS Number of obs = F(4, 993) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = ln_sbp Coef. Std. Err. t P> t [95% Conf. Interval] female age ln_bmi age_female _cons sol_regression(2 of 2) stata.docx Page 4 of 8

5 9. Using an appropriate command, perform a 2-degree of freedom partial F test of the null hypothesis that, controlling for age and ln_bmi, the added predictors female and age_female are not important. In 1 sentence, interpret.. * ) 2 df partial F test of adjusted significance of female and age_female (NULL: Zero). testparm female age_female ( 1) female = 0 ( 2) age_female = 0 F( 2, 993) = Prob > F = Assumption of the null hypothesis of normality has led to an extremely statistically unlikely result (p <.00001). The null hypothesis of significance of female and age_female, adjusted for age and ln_bmi is rejected. So we will keep these in the model. ASIDE: In case you are interested. * Illustration: 2 df LR test of adjusted significance of female and age_female. * ---- Fit reduced model using option QUIETLY. quietly: regress ln_sbp age ln_bmi. estimates store reduced. * Fit full model using option QUIETLY. quietly: regress ln_sbp female age ln_bmi age_female. estimates store full. * Likelihood ratio test of adjusted significance of female and age_female. lrtest reduced full Likelihood-ratio test LR chi2(2) = (Assumption: reduced nested in full) Prob > chi2 = The conclusion of this chi square test is the same as that of the partial F test. 10. Preliminary to regression diagnostics, fit the model with predictors age, female, ln_bmi, age_female. Obtain selected post-estimation quantities.. *---- Preliminary to Diagnostics on Full model - Fit the full model. regress ln_sbp female age ln_bmi age_female Source SS df MS Number of obs = F(4, 993) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = ln_sbp Coef. Std. Err. t P> t [95% Conf. Interval] female age ln_bmi age_female _cons sol_regression(2 of 2) stata.docx Page 5 of 8

6 . * -- save predicted values to yfitted using ONLY the observations used in model estimation. predict yfitted if e(sample)==1, xb (2 missing values generated). * save residuals to residuals using ONLY the observations used in model estimation. predict residuals if e(sample)==1, residuals (2 missing values generated). * save leverage values to leverages using ONLY the observations used in model estimation. predict leverages if e(sample)==1, leverage (2 missing values generated) 11. Check the model. Assess normality of the residuals using a hypothesis test, histogram, and qq plot.. * ) Assess normality of residuals using test, histogram, and qq plot. sfrancia residuals Shapiro-Francia W' test for normal data Variable Obs W' V' z Prob>z residuals Assumption of the null hypothesis of normality has led to an extremely statistically unlikely result (p <.00001). The null hypothesis of normality is rejected by this test. But we will proceed anyway.. histogram residuals, normal title("distribution of Residuals") name(histogram, replace) (bin=29, start= , width= ). qnorm residuals, title("normal QQ Plot of Residuals") name(qqplot, replace). graph combine histogram qqplot And, actually, these graphs look pretty good. So perhaps the statistical significance of the francia and wilks test reflects the very large sample size. sol_regression(2 of 2) stata.docx Page 6 of 8

7 12. Check the model. Plot residuals v fitted and residuals v explanatory.. * ) Plots of residual v fitted and residual v explanatory. tabstat residuals, statistics(min max) variable min max residuals graph twoway (scatter residuals yfitted, symbol(d) msize(small)), subtitle("residuals v Fitted") yline(0) ylabel(-.5(.25)1.00) name(p1, replace). graph twoway (scatter residuals age, symbol(d) msize(small)), subtitle("residuals v Age") yline(0) ylabel(-.5(.25)1.00) name(p2, replace). graph twoway (scatter residuals ln_bmi, symbol(d) msize(small)), subtitle("residuals v ln(bmi)") yline(0) ylabel(-.5(.25)1.00) name(p3, replace). graph combine p1 p2 p3, row(1) sol_regression(2 of 2) stata.docx Page 7 of 8

8 13. Check the model. Plot leverages v residuals and leverages v normed residuals squared.. * ) Plots of leverage and normed residuals squared. graph twoway (scatter leverages residuals, symbol(d)), subtitle("leverages v Residuals") name(p4, replace). lvr2plot, subtitle("leverages v Normed Residuals Squared") name(p5, replace). graph combine p4 p5, row(1) sol_regression(2 of 2) stata.docx Page 8 of 8