COMPARING MODEL ESTIMATES: THE LINEAR PROBABILITY MODEL AND LOGISTIC REGRESSION

Size: px
Start display at page:

Download "COMPARING MODEL ESTIMATES: THE LINEAR PROBABILITY MODEL AND LOGISTIC REGRESSION"

Transcription

1 PLS 802 Spring 2018 Professor Jacoby COMPARING MODEL ESTIMATES: THE LINEAR PROBABILITY MODEL AND LOGISTIC REGRESSION This handout shows the log of a STATA session that compares alternative estimates of a single model. The model predicts voting behavior in the 2004 presidential election, using the linear probability model and logistic regression. The dependent variable is a dichotomy, coded zero for Kerry voters and one for Bush voters. There are six independent variables: Party identification and ideology are coded so that larger values indicate more Republican identifications and conservative affiliations, respectively. Issue attitudes and candidate trait assessments are coded so that larger values indicate more conservative issue stands and more positive views of Bush s personality, relative to Kerry s personality. Iraq war assessments are coded so that larger values indicate a more pessimistic view of the American war effort. Sociotropic economic judgments are coded so that larger values indicate more negative evaluations of the American economy. The models shown below are adapted from analyses presented in: Jacoby, William G. (2009) Ideology and Vote Choice in the 2004 Election. Electoral Studies 28: (FIRST FEW LINES OMITTED).. #delimit ; delimiter now ;. set more off; > * Retrieve data on presidential voting in 2004 > * in Stata dataset, "vote04.dta". Describe > * contents of dataset and obtain > * summary statistics. use vote04;. describe; Contains data from vote04.dta obs: 1,212 vars: 8 1 Apr :34 size: 36,360 - storage display value variable name type format label variable label - caseid int %8.0g libconr byte %8.0g Lib-con ideology partyid byte %8.0g Party ID vote byte %8.0g Presidential vote in 2004 traits double %12.0g Cand trait assessment iraq byte %8.0g Iraq war attitude issues double %12.0g Summary issue attitudes ecjudge double %12.0g Sociotropic economic beliefs - Sorted by:. summarize vote libconr partyid traits > iraq issues ecjudge;

2 Page 2 Variable Obs Mean Std. Dev. Min Max vote libconr partyid 1, traits 1, iraq 1, issues 1, ecjudge 1, > * Estimate linear probability model of > * 2004 presidential vote choice, using > * robust standard errors.. regress vote libconr partyid traits > iraq issues ecjudge, vce(robust); Linear regression Number of obs = 651 F(6, 644) = Prob > F = R-squared = Root MSE = Robust vote Coef. Std. Err. t P> t [95% Conf. Interval] libconr partyid traits iraq issues ecjudge _cons > * Retrieve predicted values, create > * dichotomous version, and compare > * predicted versus actual votes. predict yhat; (option xb assumed; fitted values) (307 missing values generated). summarize yhat; Variable Obs Mean Std. Dev. Min Max yhat generate dichot_yhat = 0;. replace dichot_yhat = 0 if yhat <= 0.5; (0 real changes made). replace dichot_yhat = 1 if yhat > 0.5; (740 real changes made). tabulate vote dichot_yhat, > cell chi2;

3 Page 3 Key frequency cell percentage Presidenti al vote in dichot_yhat Total Total display ^2; Pearson chi2(1) = Pr = correlate dichot_yhat vote; (obs=811) dichot~t vote dichot_yhat vote > * Plot vote (jittered) versus predicted > * values from the LPM, showing OLS line. twoway (scatter vote yhat, > scheme(s1color) > jitter(3) > msymbol(oh) > mcolor(black) > ysize(4.5) > xsize(4.5) > xaxis(1 2) yaxis (1 2) > xtitle("predicted values from LPM", axis(1)) > ytitle("vote for George W. Bush", axis(1)) > ylabel(, axis(1) nogrid) > ylabel(, axis(2) nolabel) > xlabel(, axis(2) nolabel) > legend(off) > name(linfit, replace) > ) > (lfit vote yhat) > ;. graph export "linfit.pdf", replace; (file linfit.pdf written in PDF format) > * Standardize independent variables and > * re-estimate the linear prob model. egen libconr2 = std(libconr); (292 missing values generated)

4 Page 4. egen partyid2 = std(partyid); (17 missing values generated). egen traits2 = std(traits); (20 missing values generated). egen iraq2 = std(iraq); (6 missing values generated). egen issues2 = std(issues); (6 missing values generated). egen ecjudge2 = std(ecjudge); (2 missing values generated). regress vote libconr2 partyid2 traits2 > iraq2 issues2 ecjudge2, vce(robust); Linear regression Number of obs = 651 F(6, 644) = Prob > F = R-squared = Root MSE = Robust vote Coef. Std. Err. t P> t [95% Conf. Interval] libconr partyid traits iraq issues ecjudge _cons > * Estimate logistic regression model for probability of > * Bush vote, using same indep vars as in LPM of vote. logit vote libconr partyid traits > iraq issues ecjudge; Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Iteration 5: log likelihood = Logistic regression Number of obs = 651 LR chi2(6) = Prob > chi2 = Log likelihood = Pseudo R2 = vote Coef. Std. Err. z P> z [95% Conf. Interval] libconr partyid traits iraq issues ecjudge _cons

5 Page 5 > * Re-estimate model to display odds > * ratios instead of coefficients. logistic vote libconr partyid traits > iraq issues ecjudge; Logistic regression Number of obs = 651 LR chi2(6) = Prob > chi2 = Log likelihood = Pseudo R2 = vote Odds Ratio Std. Err. z P> z [95% Conf. Interval] libconr partyid traits iraq issues ecjudge _cons > * Retrieve predicted probabilities of > * Bush vote, dichotomize, and compare > * to actual Bush votes. predict probbush, pr; (307 missing values generated). summarize probbush; Variable Obs Mean Std. Dev. Min Max probbush generate dichot_bush = 0;. replace dichot_bush = 0 if probbush <= 0.5; (0 real changes made). replace dichot_bush = 1 if probbush > 0.5; (762 real changes made). tabulate vote dichot_bush, > cell chi2 ; Key frequency cell percentage

6 Page 6 Presidenti al vote in dichot_bush Total Total Pearson chi2(1) = Pr = correlate dichot_bush vote; (obs=811) dichot~h vote dichot_bush vote display ^2; > * Obtain linear predictor from logistic > * regression, and plot probabilities of > * Bush vote against linear predictions. predict linpred, xb; (307 missing values generated). twoway (scatter vote linpred, > scheme(s1color) > jitter(3) > msymbol(oh) > mcolor(black) > ysize(4.5) > xsize(4.5) > xaxis(1 2) yaxis (1 2) > xtitle("log(p/(1-p))", axis(1)) > ytitle("vote for George W. Bush", axis(1)) > ylabel(, axis(1) nogrid) > ylabel(, axis(2) nolabel) > xlabel(, axis(2) nolabel) > legend(off) > name(logprob, replace) > ) > (function y = exp(x) / (1 + exp(x)), > range(linpred) > ) > ;. graph export "logprob.pdf", replace; (file logprob.pdf written in PDF format) > * Carry out logistic regression with > * standardized independent variables

7 Page 7. logistic vote libconr2 partyid2 traits2 > iraq2 issues2 ecjudge2; Logistic regression Number of obs = 651 LR chi2(6) = Prob > chi2 = Log likelihood = Pseudo R2 = vote Odds Ratio Std. Err. z P> z [95% Conf. Interval] libconr partyid traits iraq issues ecjudge _cons > * Compare vote predictions from > * LPM and logistic regression. tabulate dichot_yhat dichot_bush, > chi2 cell; Key frequency cell percentage dichot_yha dichot_bush t 0 1 Total Total , Pearson chi2(1) = 1.1e+03 Pr = log close; -

8 Page 8 Figure 1: Scatterplot of Bush vote versus predicted values from the LPM. Data points are jittered and the bivariate OLS line is shown. Fitted values Vote for George W. Bush Predicted values from LPM Figure 2: Scatterplot of Bush vote versus linear prediction from logistic regression model. Points are jittered and the predicted probability of a Bush vote is plotted across the range of the linear prediction. Vote for George W. Bush log(p/(1-p))