Statistical Modelling for Business and Management. J.E. Cairnes School of Business & Economics National University of Ireland Galway.

Size: px
Start display at page:

Download "Statistical Modelling for Business and Management. J.E. Cairnes School of Business & Economics National University of Ireland Galway."

Transcription

1 Statistical Modelling for Business and Management J.E. Cairnes School of Business & Economics National University of Ireland Galway June 28 30, 2010 Graeme Hutcheson, University of Manchester Luiz Moutinho, University of Glasgow Modelling categorical variables using logit models Software commands and outouts The lecture notes, exercises and data sets associated with this course are available for download from: For full details on R and Rcmdr see...

2 Loading Infection.csv data set into R Load the Infection.csv data set into R Rcmdr: commands Data Import data from text file, clipboard or URL... Read Text Data from File, Clipboard or URL Enter name for data set: input name (Infection) or leave as default (Dataset) Field Separator: select Commas Open File name: Open select Infection.csv data file... Rcmdr: Messages NOTE: The dataset icecream has 49 rows and 3 columns... [see below] The data file includes information on Severity and Outcome (survived and died). There is also a variable called Outcome.graph, which is included so that a scatterplot can be drawn for demonstration purposes. This variable is not appropriate to include in an analysis and should not be included in any models. 2

3 Graphing the data - an OLS regression model Plotting a scatterplot in R Rcmdr: commands Graphs Scatterplot... Scatterplot Y variable (pick one): X variable (pick one): Options: Marginal boxplots: Options: Least-squares line: Options: Smooth Line: select: Outcome.graph select: Severity deselect select deselect Rcmdr: Output Not a particularly good model... It is obvious from this plot that plotting an OLS regression line-of-best-fit to these data (shown as a dashed-line) does not represent the model well. An OLS regression model of the binary response variable is, therefore, not appropriate. Graphing the data - a local regression model Fitting a non-linear line of best-fit using the smooth line plotting function gives a better model of the data. This line seems to provide a more accurate representation of the relationship between Outcome and Severity. 3

4 Plotting a scatterplot in R Rcmdr: commands Graphs Scatterplot... Scatterplot Y variable (pick one): select: Outcome.graph X variable (pick one): select: Severity Options: Marginal boxplots deselect Options: Least-squares line deselect Options: Smooth Line select Options: Span for smooth select 30 Rcmdr: Output A better model... 4

5 Running a logistic regression model Rcmdr: commands Computing a logistic regression model in R Statistics Fit models Generalized linear model... Generalized Linear Model Variables (double click to formula): Model Formula: Family (double click to select): link function: click on: Outcome and Severity this should read: Outcome Severity click on: binomial should read: logit Rcmdr: output > GLM.5 <- glm(outcome ~ Severity, family=binomial(logit), data=infection) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) *** Severity *** --- Signif. codes: 0 *** ** 0.01 * Null deviance: on 48 degrees of freedom Residual deviance: on 47 degrees of freedom AIC: The parameters above refer to a model of the logit of the respone variable... logit(outcome) = α + βseverity. The parameter for Severity therefore indicates the effect that a unit change in Severity has on the log-odds (the logit) of death. This value can easily be transformed into an odds (giving a simpler interpretation) using the R console... 5

6 Transforming logits into odds R console: commands Transforming logits into odds # to change a logit score of into an odds, use the exp() function exp(0.059) # the exp() function is simply the opposite of the log() function... log( ) R console: output > exp(0.059) [1] > log( ) [1] Rather than describing the relationship between Outcome and Severity in logits (for each unit increase in Severity, the log odds of dying increase by 0.059), it is clearer to describe it in terms of odds. When this is applied to the above model, the relationship between Outcome and Severity is described as... For each unit increase in Severity, the odds of dying increase by 6% ( :1). 6

7 Model-fit Although some information about model-fit was provided in the output to the logistic regression model (the Null and Residual deviances), it is useful to present this information in an ANOVA table... An ANOVA table for a logistic regression model... Compute a logistic regression model of Outcome (see above). Rcmdr: commands Models Hypothesis tests ANOVA table... ANOVA table Partial, obeying marginality ( Type II ): tick box Rcmdr: output > Anova(GLM.5, type="ii", test="lr") Anova Table (Type II tests) Response: Outcome LR Chisq Df Pr(>Chisq) Severity e-06 *** --- Signif. codes: 0 *** ** 0.01 * The LR Chisq statistic of shows the amount the deviance in the prediction of Outcome decreases when Severity is added to the model is simply the difference between the Null deviance (67.745) and Residual deviance (45.994) already given in the logistic regression output. 7

8 Nagelkerke s pseudo R-square For logit models a pseudo R-square measure can be computed. This is not available in Rcmdr (yet), but can be easily computed using the Design library (this is included on the CD). Use the commands shown below in R console to compute the statistics. Nagelkerke s pseudo R-square dataset: load dataset (Infection.csv) and name it Infection R console: commands # load the Design library library(design) # run the logistic regression model using the lrm() function. lrm(outcome ~ Severity, data=infection) Rconsole: output > lrm(outcome ~ Severity, data=infection) Logistic Regression Model lrm(formula = Outcome ~ Severity, data = Infection) Frequencies of Responses survived died Obs Max Deriv Model L.R. d.f. P C Dxy 49 1e Gamma Tau-a R2 Brier Coef S.E. Wald Z P Intercept e-04 Severity e-04 Nagelkerke s pseudo R-sqaure is given in the output under the R2 column (0.479). 8

9 Predictions for the cases in the data set Predictions can be made from logistic regression models in much the same way as they are from OLS regression models. Predictions from a logistic regression model can be made using the technique below... Predictions for a logistic regression model... Compute a logistic regression model of Outcome (see above). Rcmdr: commands Models Add observation statistics to data... Add Observation Statistics to Data Fitted values: tick box Rcmdr: output The fitted values (predictions) are added to the data set. To see these, simply view the data set: This just gives point estimates for values already in the database. For example, when Severity = 9.3, the probability of being in the survived category is 0.984, and when Severity = 50.9, the probability of being in the survived category is In order to get estimates for other values of Severity and confidence intervals, the following commands can be used... 9

10 Predictions and confidence intervals for defined cases There are many libraries that deal with confidence intervals for logistic regression models. I use the Design library, but you may wish to investigate other libraries (try a search on CRAN). Predictions from a logistic regression model Compute a logistic regression model of Outcome (see above). R console: commands # Load the Design library library(design) # Make a data frame (datpred) for the data you wish to predict. # (a value for Outcome is included even though it does not take part in the # calculation. datpred <- data.frame(outcome="survived", Severity=80) # Use the predict() function to get the standard errors for the logit model. # Save as predicted.logit predicted.logit <- predict(glm.1, type = "link", newdata = datpred, se = T) # get the estimates and confidence intervals for the logit model from the # predicted.logit model. # The CIs are estimated for the linear (logit) model. CI.lower <- predicted.logit$fit * predicted.logit$se.fit CI.middle <- predicted.logit$fit CI.upper <- predicted.logit$fit * predicted.logit$se.fit # The estimates computed so far apply to the logit estimates... # To print out the estimates on a probability scale use the plogis() function... plogis(ci.lower) plogis(ci.middle) plogis(ci.upper) Rconsole: output Predicted probability and confidence intervals for survival when Severity = > plogis(ci.lower) > plogis(ci.middle) > plogis(ci.upper)

11 Plotting predictions from the logistic regression model It is simple to plot the predicted regression model over the original data using the following R console commands. It is possible to do this in Rcmdr, but simpler and much quicker to cut and paste the following commands into the R console. 11

12 Rcmdr: commands Plotting predictions from a logistic regression model So that the graphic corresponds to those previously shown (i.e., the category died is above survived ) we need to reorder the categories in the Outcome variable: Data Manage variables in active data set Reorder factor levels... Reorder Factor Levels Factor (pick one): Reorder Levels Old Levels New order click on Outcome input a new order (eg. survived=1, died=2) Then, run the logistic regression model to get the predicted values... Statistics Fit models Generalized linear model... Generalized Linear Model Variables (double click to formula): Model Formula: Family (double click to select): link function: click on: Outcome and Severity this should read: Outcome Severity click on: binomial should read: logit Now, save the predicted values... Models Add observation statistics to data... Add Observation Statistics to Data Fitted values: tick box Graphs Scatterplot... Scatterplot Y variable (pick one): select: fitted.glm.1 X variable (pick one): select: Severity Options: Marginal boxplots deselect Options: Least-squares line deselect Options: Smooth Line select Options: Span for smooth select 0 12

13 The predicted model of outcome... The model produced by the logistic regression technique shows the predicted values for Outcome given different valuesof Severity. The model of probability is not linear (although the logit predictions are...) and is quite similar to the local-regression model shown above. It is quite easy to show this graphic as an overlay plot (over the original data) and to also include the confidence intervals (look on the web for clues how to do this within R - just plot the confidence intervals computed using the procedures discussed above). The basic R console code is shown below: 13

14 Overlay plots for a logistic regression model Compute a logistic regression model of Outcome (see above) - name it GLM.1. R console: commands # obtain the predictions from the model predicted.logit <- predict(glm.1, type = "link", newdata = Infection, se = T) # Save the predicted values to the data set # lower confidence intervals... Infection$CI.lower <- plogis(predicted.logit$fit * predicted.logit$se.fit) # # fitted values... Infection$fitted <- plogis(predicted.logit$fit) # # upper confidence intervals Infection$CI.higher <- plogis(predicted.logit$fit * predicted.logit$se.fit) # Plot the graphs... # the raw data... plot(infection$severity, Infection$Outcome.graph, ylab="outcome", xlab="severity") # overlay the lower confidence intervals (line) par(new=true,col="blue") # creates a new graphic plot(infection$severity,infection$ci.lower,axes=f,type="l",ylab="",xlab="") # overlay the figtted values (points) par(new=true,col="red") # creates a new graphic plot(infection$severity,infection$fitted,axes=f,type="p",ylab="",xlab="") # overlay the upper confidence intervals (line) par(new=true,col="blue") # creates a new graphic plot(infection$severity,infection$ci.higher,axes=f,type="l",ylab="",xlab="") Rconsole: output 14