Applying Regression Analysis

Size: px
Start display at page:

Download "Applying Regression Analysis"

Transcription

1 Applying Regression Analysis Jean-Philippe Gauvin Université de Montréal January

2 Goals for Today What is regression? How do we do it? First hour: OLS Bivariate regression Multiple regression Interactions Regression diagnostics Second hour: MLE/GLMs Logit Ordered logit Multinomial logit All material is on my website:

3 Introducing Regression Part I The World of OLS Bivariate Regression Multiple Regression Predicted Values Categorical Predictors Interactions Regression Diagnostics Part II The World of MLE Introducing MLE Logit Models Ordered Logit Models Multinomial Logit Models

4 What is a regression? In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or predictors ).

5 What is a regression? In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or predictors ). Ordinary least square (OLS) regression is a family of regression models that fit a line by minimizing the residual sum of squares. It can only be used when dealing with a continuous dependent variable.

6 The scatterpoint Figure: Effect of Age on Social Ideology 1 Gauche-Droite Sociale Age

7 The Basic Linear Model (OLS) Figure: Effect of Age on Social Ideology 1 Gauche-Droite Sociale Age

8 Why a Line? It s accurate It gives an idea of the general pattern of the data It s easy to estimate It s a low computational cost, as a single formula can be used. It s easy to describe It can summarize the relationship between variables with only one number: the slope. In other words, it s a very good summary of "what happens".

9 The Components of OLS The model equation can be expressed as such: Y i = α + βx i + ɛ i Where: y is the dependent variable α is the intercept β is the slope x is the predictor ɛ is the error term

10 The Components of OLS The OLS equation then can be re-expressed like this: Ŷ i = α + βx i Where is the error term? The OLS aims to keep the residual ɛ as small as possible. In other words, Y i = Ŷ i + ɛ i ɛ i = Y i Ŷ i The OLS is the line that minimizes residuals.

11 What the Software Does If the OLS is the line that minimizes the residual sum of squares (RSS)... how does it do that? How do we get a slope (β) and an intercept (α)? RSS = ɛ 2 i = (Y i Ŷi) 2 (Xi ˆβ = X)(Y i Ȳ ) (Xi X) 2 ˆα = Ȳ ˆβ 1 X X Y Xbar Ybar

12 Estimating the Regression Line. egen Xbar = mean(x). egen Ybar = mean(y). egen sumxy = sum((x-xbar)*(y-ybar)). egen sumx2 = sum((x-xbar)^2).. disp "Beta is " sumxy/sumx2 Beta is disp "Alpha is " Ybar - (sumxy/sumx2)*xbar Alpha is

13 Estimating the Regression Line. egen Xbar = mean(x). egen Ybar = mean(y). egen sumxy = sum((x-xbar)*(y-ybar)). egen sumx2 = sum((x-xbar)^2).. disp "Beta is " sumxy/sumx2 Beta is disp "Alpha is " Ybar - (sumxy/sumx2)*xbar Alpha is Or let Stata do it!. reg Y X Y Coef. Std. Err. t P> t [95% Conf. Interval] X _cons α is 10. β is.56, which means Ŷ i = (X i )

14 The Basic Linear Model (OLS) Figure: Effect of Age on Social Ideology 1 Gauche-Droite Sociale.5 0 alpha = beta= Age

15 The Assumptions of OLS Sorry: The goal is not to get stars. Instead, we want the Best Linear Unbiased estimators (BLUE), stated by five assumptions: 1. The relationship is linear 2. The errors are normally distributed. 3. The errors have constant variance (no heteroskedasticy, no autocorrelation) 4. X is fixed on repeated sampling (no selection bias) 5. No exact linear relationships between independent variables and more observations than independent variables

16 Other Consideration: Model fit R 2 is the most used measure of fit. It is expressed as the ratio of the explained variance to the total variance. In other words R 2 = s 2 Ŷ /s2 Y. But what s a good fit in social sciences? What about other measures (AIC, BIC, etc.)?

17 Example 1: The Bivariate Regression Stata: scatter harper leftright R: plot(harper ~ leftright, main="example 1", xlab="ideology ", ylab="feelings for Harper ") 100 Feelings about Stephen Harper Left/right: Where would you place yourself on the scale below?

18 Example 1: The Bivariate Regression Stata: twoway (scatter harper leftright, jitter(20)) (lfit harper leftright) R: plot(jitter(harper,10) ~ jitter(leftright, 4), main="example 1", xlab="ideology ", ylab="feelings for Harper ") abline(lm(harper~leftright), col="red") Left/right: Where would you place yourself on the scale below? Feelings about Stephen Harper Fitted values

19 Example 1: Stata. reg harper leftright Source SS df MS Number of obs = F( 1, 1465) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = harper Coef. Std. Err. t P> t [95% Conf. Interval] leftright _cons

20 Example 1: Stata. reg harper leftright Source SS df MS Number of obs = F( 1, 1465) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = harper Coef. Std. Err. t P> t [95% Conf. Interval] leftright _cons FH = (LR)

21 Example 1: R > m1 <- (lm(harper~leftright, data = data)) > summary(m1) Call: lm(formula = harper ~ leftright) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** leftright < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 1465 degrees of freedom (2841 observations deleted due to missingness) Multiple R-squared: 0.219,Adjusted R-squared: F-statistic: on 1 and 1465 DF, p-value: < 2.2e-16

22 Multiple regression In a bivariate model, the OLS fits a line in a 2D space. But in a multiple regression, it fits a line for each covariate in N-dimensional space. You could theoretically draw a plane in a 3D space, but what about 4, 5 or 20 dimensions? However, we can extend the bivariate logic to multiple predictors, where the coefficients lead to multiple slopes piercing through N-dimension.

23 The Multiple Regression Equation When extending the bivariate model equation to multiple predictors, we get this: Y i = β 0 + β 1 X 1 + β 2 X β n X n + ɛ i

24 The Multiple Regression Equation When extending the bivariate model equation to multiple predictors, we get this: Y i = β 0 + β 1 X 1 + β 2 X β n X n + ɛ i Ex2. Does age also have an effect on feelings toward Harper? The OLS equation should read FH = β 0 + β 1 LR + β 2 Age + ɛ

25 Example 2. The Multiple Regression B= 0.12 B= ideology age

26 Example 2. Stata. reg harper leftright age Source SS df MS Number of obs = F( 2, 1437) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = harper Coef. Std. Err. t P> t [95% Conf. Interval] leftright age _cons

27 Example 2. R > m2 <- lm(harper ~ leftright + age, data=data) > summary(m2) Call: lm(formula = harper ~ leftright + age, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) leftright <2e-16 *** age Signif. codes: 0 *** ** 0.01 * Residual standard error: on 1437 degrees of freedom (2868 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 1437 DF, p-value: < 2.2e-16

28 Predicting Ŷ? What if we want to predict a value on this new slope? Can we just plug in the equation? What if we want to predict Ŷ when leftright is a 1 and our age is 50? Is it this easy?

29 Predicting Ŷ? What if we want to predict a value on this new slope? Can we just plug in the equation? What if we want to predict Ŷ when leftright is a 1 and our age is 50? Is it this easy? FH = (LR) +.04(Age) = = 12.48

30 Predicting Ŷ? What if we want to predict a value on this new slope? Can we just plug in the equation? What if we want to predict Ŷ when leftright is a 1 and our age is 50? Is it this easy? FH = (LR) +.04(Age) = = In Stata, we could easily check it after running the regression.. quietly: reg harper leftright age. disp _b[_cons] + _b[leftright]*1 + _b[age]*

31 Other Ways of Predicting Ŷ in Stata *Predicting yhat for leftright=1 and age=50. predict yhat (option xb assumed; fitted values) (2849 missing values generated). sum yhat if leftright==1 & age==50 /*lucky we have one obs = 12.56*/ Variable Obs Mean Std. Dev. Min Max yhat *Or simply use margins command. margins, at(leftright=1 age=50) at : leftright = 1 age = Delta-method Margin Std. Err. t P> t [95% Conf. Interval] _cons

32 Predicting Ŷ in R In R, you can easily specify your scenario values through a new dataframe. > m3 <- (lm(harper~leftright + age, data=data)) > newdat <- data.frame( + leftright = 1, + age = 50) > predict(m3, newdat, interval="confidence") fit lwr upr > Note that for a predicted binary predictor, we could store in newdat something like male="male" depending on the label (if dummy is factor). If you wanted multiple values, you could predict using a vector like leftright=c(1,2,3,4,5)

33 Using Categorical Predictors When adding variables, it is possible to add categorical variables. This is done by adding binary variables. One common example is gender:. reg harper leftright male harper Coef. Std. Err. t P> t [95% Conf. Interval] leftright male _cons But what does the coefficient mean? Being a male, rather than a female, decreases feelings for Harper by 0.62 (not statistically significant).

34 Adding a Binary Variable 30 Y = a + b 1 *LR + b 2 *Gender + e 25 Linear Prediction b 2 { Female intercept =b 0 Male intercept = b 0 + b Left/right: Where would you place yourself on the scale below?

35 Ex 4. Categorical Variables in Stata In Stata, you can use tab varname, gen(newvar) to automatically create dummies, or you can simply use the prefix i. or even specify the baseline with b1. b2. b3. etc.. reg harper i.votechoice /*liberal is baseline*/ harper Coef. Std. Err. t P> t [95% Conf. Interval] votechoice Tories NDP BQ Greens _cons reg harper b2.votechoice /*Conservative as baseline*/ harper Coef. Std. Err. t P> t [95% Conf. Interval] votechoice Libs NDP BQ Greens _cons

36 Ex 4. Categorical Variables in R As with Stata, R automatically recognizes categorical variables. > m4 <- lm(harper~votechoice, data=data) > summary(m4) Call: lm(formula = harper ~ votechoice, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** votechoicetories <2e-16 *** votechoicendp votechoicebq * votechoicegreens Signif. codes: 0 *** ** 0.01 *

37 Ex 4. Categorical Variables in R In R, you need to change the order of the labels to change the baseline. levels(data$votechoice) #current order data$votechoice2 = factor(data$votechoice, c("tories","libs","ndp", + "BQ","Greens")) levels(data$votechoice2) m4.2 <- lm(harper~votechoice2, data=data) summary(m4.2) You can also test if a variable is a factor by using is.factor(male). If it isn t, you could specify it with male.f <- factor(male, labels = c("male", "female"))

38 Interactions We might have reasons to believe that the effect of a variable varies on the range of another variable. This is called an interaction. 70 Y = a + b1*soc + b2*gender + e Y = a + b1*soc + b2*gender + b3soc*gender + e Linear Prediction 50 Linear Prediction male female writing score social studies score 30 male female writing score social studies score

39 Example 5. Stata In Stata, the # sign identifies the interaction. Keep in mind that by default, Stata thinks of factorial interactions. Continuous terms need the prefix c. in order to work properly. regress write i.female##c.socst write Coef. Std. Err. t P> t female socst female#c.socst _cons

40 Example 5. R In R, variables are already defined as factors or not. Interactions are thus handled automatically. > m5 <- lm(write~ female*socst, data=data2) > summary(m5) Call: lm(formula = write ~ female * socst, data = data2) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** femalefemale ** socst < 2e-16 *** femalefemale:socst * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 196 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 3 and 196 DF, p-value: < 2.2e-16

41 Regression Diagnostics Remember the main assumptions: 1. The relationship is linear 2. The expected value of the error term is zero. 3. The errors have identical, normal distributions (no heteroskedasticy, no autocorrelation) To get unbiased and efficient estimators, we must make sure we don t violate these assumptions. We ll focus on distribution of error and outliers.

42 Normally distributed and constant variance

43 Example 6. Heteroskedasticity in Stata You can test heteroskedasticity with estat hettest or imtest, white, or simply look at the residuals Residuals Fitted values reg harper leftright age male estat hettest imtest, white rvfplot, yline(0, lcolor(red)) reg harper leftright age male, vce(robust)

44 Example 6. Heteroskedasticity in R In R, you can test heteroskedasticity with the non constant variance test called ncvtest in the car package. The plot can be obtained with the spreadlevelplot(m6) command (where m6 is the object containing your model. However, getting robust standard errors like Stata is a bit more involved. You ll need the sandwich and lmtest packages. > library(sandwich) > library(lmtest) > coeftest(m6, vcov = vcovhc(m6, "HC1")) t test of coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) leftright <2e-16 *** age malemale Signif. codes: 0 *** ** 0.01 * >

45 Example 6. Normality of Residuals in Stata You can plot the residuals against a theoretical normal distribution Normal F[(r-m)/s] Empirical P[i] = i/(n+1) quietly: reg harper leftright age male predict r, residual pnorm r

46 Example 6. Normality of Residuals in R The same thing can be done in R with the command qqplot(m6) Studentized Residuals(m6) t Quantiles

47 Example 6. Outlier Plot in Stata You can identify the outliers by graphing a leverage plot with lvr2plot Leverage Normalized residual squared

48 Example 6. Outlier Plot in R You can do the same thing in R with influenceplot(m6,id.method="identify") Influence Plot Studentized Residuals Hat Values Circle size is proportial to Cook's Distance

49 Introducing Regression Part I The World of OLS Bivariate Regression Multiple Regression Predicted Values Categorical Predictors Interactions Regression Diagnostics Part II The World of MLE Introducing MLE Logit Models Ordered Logit Models Multinomial Logit Models

50 Categorical Outcomes: The Problem with OLS OLS is pretty useful. However, it fails at explaining non-continuous dependent variables. Imagine a simple model where Y i is binary. Y i = β 0 + β 1 X 1 + β 2 X 2 + u i OLS estimates won t be BLU in some respects: OLS gives an unbounded linear prediction, while Y i can only be 0 or 1. OLS assumes u i to be normally distributed. But since u i = Y i Ŷi, the residuals can only take the values 1 Ŷi or Y ˆ i OLS assumes constant variance of u i (homoskedasticity), while the variance of a binary choice will not permit that.

51 Constant Variance is Impossible 1.5 Residuals Fitted values

52 Maximum Likelihood Estimation The world of MLE is the world of frequentist probability. Formally, a probability is given by: Pr(Y M) = Pr(Data Model) Ideally, we would compute the inverse probability Pr(Model Data), but this is impossible.

53 Maximum Likelihood Estimation The world of MLE is the world of frequentist probability. Formally, a probability is given by: Pr(Y M) = Pr(Data Model) Ideally, we would compute the inverse probability Pr(Model Data), but this is impossible. Luckily, the likelihood function helps us a lot. N f (Y 1, Y 2,..., Y n θ) = f (Y i θ) = L(θ Y ) i=1 p(y θ) = L(θ Y ) In other words, there is a fixed value of θ and we maximize the likelihood to estimate θ and make assumptions to generate uncertainty about the estimate.

54 Log Likelihood We usually work with the log likelihood function: N ln L(θ Y ) = ln f (Y i θ) i=1 The software then tries to maximize the function over the likelihood function, by taking derivatives at multiple iterations. When the derivative slope is 0, it found the maximum and therefore has estimate the "most likely" θ parameters. ML estimation can be extended to a variety of functions, including logistic regression.

55 Logistic Regression for Binary Outcome P i = Pr(Y i = 1) = β k X ik 1 Linear Prediction 1 Logistic Regression.8.8 Linear Prediction.6.4 Pr(Party2) really dislike really like Feelings about Stephen Harper 0 really dislike really like Feelings about Stephen Harper

56 Example 7. Logit in Stata Logistic regression draw from a log odds distribution by maximizing the log likelihood. Coefficients are thus hard to interpret.. logit party2 harper leftright i.male age Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Iteration 5: log likelihood = Logistic regression Number of obs = 1021 LR chi2(4) = Prob > chi2 = Log likelihood = Pseudo R2 = party2 Coef. Std. Err. z P> z [95% Conf. Interval] harper leftright male Male age _cons

57 Example 7. Logit in Stata Odd ratios can also be used.. logit party2 harper leftright i.male age, or // to get odd ratios Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Iteration 5: log likelihood = Logistic regression Number of obs = 1021 LR chi2(4) = Prob > chi2 = Log likelihood = Pseudo R2 = party2 Odds Ratio Std. Err. z P> z [95% Conf. Interval] harper leftright male Male age _cons

58 Example 7. Logit in R > m7 <- glm(party2~harper+leftright+male+age, data=data, family="binomial") > summary(m7) Call: glm(formula = party2 ~ harper + leftright + male + age, family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** harper <2e-16 *** leftright <2e-16 *** malemale age Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 1020 degrees of freedom Residual deviance: on 1016 degrees of freedom (3287 observations deleted due to missingness) AIC: Number of Fisher Scoring iterations: 6

59 Example 7. Logit in R Odds ratios can be extracted easily with exp(coef(m7)). To get CI, simply state: > exp(cbind(or = coef(m7), confint(m7))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) harper leftright malemale age

60 Example 7. Predicted Probabilities in Stata In GLM terms, marginal effects are more easy to understand. However, only the ME for discrete variable behave as expected. ME for continuous predictors give the instant rate of change at the mean, which could vary across values.. margins, dydx(*) atmeans // to get marginal effects Conditional marginal effects Number of obs = 1021 Model VCE : OIM Expression : Pr(party2), predict() dy/dx w.r.t. : harper leftright 1.male age at : harper = (mean) leftright = (mean) 0.male = (mean) 1.male = (mean) age = (mean) Delta-method dy/dx Std. Err. z P> z [95% Conf. Interval] harper leftright male age Note: dy/dx for factor levels is the discrete change from the base level.

61 Example 7. Predicted Probabilities in Stata You should then plot the predicted probabilities over a continous predictor, either with margins or mcp, which will give you this graph: 1.8 Pr(party2), predict() harper margins, at(harper=(1(1)100)) marginsplot *or simply mcp harper

62 Example 7. Marginal Effects in R You can get marginal effects with the mfx package install.packages("mfx") library(mfx) logitmfx(party2~harper+leftright+male+age, data=data, atmean =T) Call: logitmfx(formula = party2 ~ harper + leftright + male + age, data = data, atmean = T) Marginal Effects: df/dx Std. Err. z P> z harper < 2.2e-16 *** leftright e-16 *** malemale age Signif. codes: 0 *** ** 0.01 * df/dx is for discrete change for the following variables: [1] "malemale"

63 Ordered Logit Sometimes, we have categorical variables that are still ordered. For example, a statement reads "We have gone too far pushing bilinguism in this country" and is coded : 1 for strongly disagree 2 for agree 3 for disagree 4 for strongly disagree Be careful! The assumption over ordered logit is that the conceptual cut between 1 and 2 is the same as between 2 and 3, etc.

64 Example 8. Ordered Logit in Stata Coefficients (here odd ratios) are the odds of increasing of one outcome.. ologit toobilingual i.votechoice harper leftright male age, or Ordered logistic regression Number of obs = 970 LR chi2(8) = Prob > chi2 = Log likelihood = Pseudo R2 = toobilingual Odds Ratio Std. Err. z P> z [95% Conf. Interval] votechoice Tories NDP BQ Greens harper leftright male age /cut /cut /cut

65 Example 8. Logit in Stata Be careful: predicted probabilities now need to be computed on all outcomes Left/right: Where would you place yourself on the scale below? Left/right: Where would you place yourself on the scale below? Pr(Toobilingual==1) Predictive Margins with 95% CIs Predictive Margins with 95% CIs Pr(Toobilingual==2) Left/right: Where would you place yourself on the scale below? Left/right: Where would you place yourself on the scale below? Pr(Toobilingual==3) Predictive Margins with 95% CIs Predictive Margins with 95% CIs Pr(Toobilingual==4) margins, at(leftright=(0(1)10)) predict(outcome(1)) margins, at(leftright=(0(1)10)) predict(outcome(2)) margins, at(leftright=(0(1)10)) predict(outcome(3)) margins, at(leftright=(0(1)10)) predict(outcome(4))

66 Example 8. Ordered Logit in R Ordered logit can be used with the MASS package. install.packages("mass") require(mass) data$toobil.f <- factor(data$toobilingual) is.factor(data$toobil.f) m8 <- polr(toobil.f ~votechoice+harper+leftright+male+age, data = data, Hess=TRUE) summary(m8) Call: polr(formula = toobil.f ~ votechoice + harper + leftright + male + age, data = data, Hess = TRUE) Coefficients: Value Std. Error t value votechoicetories votechoicendp votechoicebq votechoicegreens harper leftright malemale age Intercepts: Value Std. Error t value Residual Deviance: AIC: (3338 observations deleted due to missingness)

67 Multinomial Logit Models Finally, what if a categorical is unordered? In political science, such situation often arise when study vote choice, where the dependent variable might be coded like this: 1 for Liberals 2 for Tories 3 for NDP 4 for BQ 5 for Greens Multinomial logit estimates the probability of choosing one outcome over the other, just like a categorical variable would work. Keep in mind that there are assumptions linked to that, one of which is Independence of Irrelevant Alternatives (IIA). This needs to be tested for.

68 Example 9. Multinomial Logit in Stata mlogit votechoice harper leftright male age, rr Multinomial logistic regression Number of obs = 1021 LR chi2(16) = Prob > chi2 = Log likelihood = Pseudo R2 = votechoice RRR Std. Err. z P> z [95% Conf. Interval] Libs harper leftright male age _cons Tories (base outcome) NDP harper leftright male age _cons BQ harper leftright male age _cons Greens harper

69 Example 9. Multinomial Logit in Stata Change base outcome with mlogit votechoice harper leftright male age, b(1) You can also test for iia with the spost package net install spost9_ado.pkg quietly: mlogit votechoice harper leftright male age, b(1) //. mlogtest, iia **** Hausman tests of IIA assumption (N=1021) Ho: Odds(Outcome-J vs Outcome-K) are independent of other alternatives. Omitted chi2 df P>chi2 evidence Tories for Ho NDP BQ Greens Note: If chi2<0, the estimated model does not meet asymptotic assumptions of the test.

70 Example 9. Multinomial Logit in R Multinomial logits can be used with the nnet package. install.packages("nnet") library(nnet) m10 <- multinom(votechoice ~ harper + leftright + age + male, data = data) summary(m10) Call: multinom(formula = votechoice ~ harper + leftright + age + male, data = data) Coefficients: (Intercept) harper leftright age malemale Tories NDP BQ Greens Std. Errors: (Intercept) harper leftright age malemale Tories NDP BQ Greens Residual Deviance: AIC: #choose baseline data$vote2 <- relevel(data$votechoice, ref = "Tories") m10.2 <- multinom(vote2 ~ harper + leftright + age + male, data = data) summary(m10.2)

71 Thank You! Any questions? Don t hesitate to write if you do. jean-philippe.gauvin@umontreal.ca