Applied Logistic Regression

Size: px
Start display at page:

Download "Applied Logistic Regression"

Transcription

1 Applied Logistic Regression

2

3 Applied Logistic Regression Third Edition DAVID W. HOSMER, JR. Professor of Biostatistics (Emeritus) Division of Biostatistics and Epidemiology Department of Public Health School of Public Health and Health Sciences University of Massachusetts Amherst, Massachusetts STANLEY LEMESHOW Dean, College of Public Health Professor of Biostatistics College of Public Health The Ohio State University Columbus, Ohio RODNEY X. STURDIVANT Colonel, U.S. Army Academy and Associate Professor Department of Mathematical Sciences United States Military Academy West Point, New York

4 Copyright 2013 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) , fax (978) , or on the web at Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley &Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) , fax (201) , or online at Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) , outside the United States at (317) or fax (317) Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at Library of Congress Cataloging-in-Publication Data Is Available Hosmer, David W. Applied Logistic Regression / David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant. - 3rd ed. Includes bibliographic references and index. ISBN (cloth) Printed in the United States of America

5 To our wives, Trina, Elaine, and Mandy, and our sons, daughters, and grandchildren

6 Contents Preface to the Third Edition xiii 1 Introduction to the Logistic Regression Model Introduction, Fitting the Logistic Regression Model, Testing for the Significance of the Coefficients, Confidence Interval Estimation, Other Estimation Methods, Data Sets Used in Examples and Exercises, The ICU Study, The Low Birth Weight Study, The Global Longitudinal Study of Osteoporosis in Women, The Adolescent Placement Study, The Burn Injury Study, The Myopia Study, The NHANES Study, The Polypharmacy Study, 31 Exercises, 32 2 The Multiple Logistic Regression Model Introduction, The Multiple Logistic Regression Model, Fitting the Multiple Logistic Regression Model, Testing for the Significance of the Model, Confidence Interval Estimation, Other Estimation Methods, 45 Exercises, 46 vii

7 viii contents 3 Interpretation of the Fitted Logistic Regression Model Introduction, Dichotomous Independent Variable, Polychotomous Independent Variable, Continuous Independent Variable, Multivariable Models, Presentation and Interpretation of the Fitted Values, A Comparison of Logistic Regression and Stratified Analysis for 2 2 Tables, 82 Exercises, 87 4 Model-Building Strategies and Methods for Logistic Regression Introduction, Purposeful Selection of Covariates, Methods to Examine the Scale of a Continuous Covariate in the Logit, Examples of Purposeful Selection, Other Methods for Selecting Covariates, Stepwise Selection of Covariates, Best Subsets Logistic Regression, Selecting Covariates and Checking their Scale Using Multivariable Fractional Polynomials, Numerical Problems, 145 Exercises, Assessing the Fit of the Model Introduction, Summary Measures of Goodness of Fit, Pearson Chi-Square Statistic, Deviance, and Sum-of-Squares, The Hosmer Lemeshow Tests, Classification Tables, Area Under the Receiver Operating Characteristic Curve, Other Summary Measures, Logistic Regression Diagnostics, Assessment of Fit via External Validation, 202

8 contents ix 5.5 Interpretation and Presentation of the Results from a Fitted Logistic Regression Model, 212 Exercises, Application of Logistic Regression with Different Sampling Models Introduction, Cohort Studies, Case-Control Studies, Fitting Logistic Regression Models to Data from Complex Sample Surveys, 233 Exercises, Logistic Regression for Matched Case-Control Studies Introduction, Methods For Assessment of Fit in a 1 M Matched Study, An Example Using the Logistic Regression Model in a 1 1 Matched Study, An Example Using the Logistic Regression Model in a 1 M Matched Study, 260 Exercises, Logistic Regression Models for Multinomial and Ordinal Outcomes The Multinomial Logistic Regression Model, Introduction to the Model and Estimation of Model Parameters, Interpreting and Assessing the Significance of the Estimated Coefficients, Model-Building Strategies for Multinomial Logistic Regression, Assessment of Fit and Diagnostic Statistics for the Multinomial Logistic Regression Model, Ordinal Logistic Regression Models, Introduction to the Models, Methods for Fitting, and Interpretation of Model Parameters, Model Building Strategies for Ordinal Logistic Regression Models, 305 Exercises, 310

9 x contents 9 Logistic Regression Models for the Analysis of Correlated Data Introduction, Logistic Regression Models for the Analysis of Correlated Data, Estimation Methods for Correlated Data Logistic Regression Models, Interpretation of Coefficients from Logistic Regression Models for the Analysis of Correlated Data, Population Average Model, Cluster-Specific Model, Alternative Estimation Methods for the Cluster-Specific Model, Comparison of Population Average and Cluster-Specific Model, An Example of Logistic Regression Modeling with Correlated Data, Choice of Model for Correlated Data Analysis, Population Average Model, Cluster-Specific Model, Additional Points to Consider when Fitting Logistic Regression Models to Correlated Data, Assessment of Model Fit, Assessment of Population Average Model Fit, Assessment of Cluster-Specific Model Fit, Conclusions, 374 Exercises, Special Topics Introduction, Application of Propensity Score Methods in Logistic Regression Modeling, Exact Methods for Logistic Regression Models, Missing Data, Sample Size Issues when Fitting Logistic Regression Models, Bayesian Methods for Logistic Regression, The Bayesian Logistic Regression Model, MCMC Simulation, 411

10 contents xi An Example of a Bayesian Analysis and Its Interpretation, Other Link Functions for Binary Regression Models, Mediation, Distinguishing Mediators from Confounders, Implications for the Interpretation of an Adjusted Logistic Regression Coefficient, Why Adjust for a Mediator? Using Logistic Regression to Assess Mediation: Assumptions, More About Statistical Interaction, Additive versus Multiplicative Scale Risk Difference versus Odds Ratios, Estimating and Testing Additive Interaction, 451 Exercises, 456 References 459 Index 479

11 Preface to the Third Edition This third edition of Applied Logistic Regression comes 12 years after the 2000 publication of the second edition. During this interval there has been considerable effort researching statistical aspects of the logistic regression model particularly when the outcomes are correlated. At the same time, capabilities of computer software packages to fit models grew impressively to the point where they now provide access to nearly every aspect of model development a researcher might need. As is well-recognized in the statistical community, the inherent danger of this easy-to-use software is that investigators have at their disposal powerful computational tools, about which they may have only limited understanding. It is our hope that this third edition will help bridge the gap between the outstanding theoretical developments and the need to apply these methods to diverse fields of inquiry. As was the case in the first two editions, the primary objective of the third edition is to provide an introduction to the underlying theory of the logistic regression model, with a major focus on the application, using real data sets, of the available methods to explore the relationship between a categorical outcome variable and a set of covariates. The materials in this book have evolved over the past 12 years as a result of our teaching and consulting experiences. We have used this book to teach parts of graduate level survey courses, quarter- or semester-long courses, as well as focused short courses to working professionals. We assume that students have a solid foundation in linear regression methodology and contingency table analysis. The positive feedback we have received from students or professionals taking courses using this book or using it for self-learning or reference, provides us with some assurance that the approach we used in the first two editions worked reasonably well; therefore, we have followed that approach in this new edition. The approach we take is to develop the logistic regression model from a regression analysis point of view. This is accomplished by approaching logistic regression in a manner analogous to what would be considered good statistical practice for linear regression. This differs from the approach used by other authors who have begun their discussion from a contingency table point of view. While the contingency table approach may facilitate the interpretation of the results, we believe that it obscures the regression aspects of the analysis. Thus, discussion of the interpretation of the model is deferred until the regression approach to the analysis is firmly established. xiii

12 xiv preface to the third edition To a large extent, there are no major differences between the many software packages that include logistic regression modeling. When a particular approach is available in a limited number of packages, it will be noted in this text. In general, analyses in this book have been performed using STATA [Stata Corp. (2011)]. This easy-to-use package combines excellent graphics and analysis routines; is fast; is compatible across Macintosh, Windows and UNIX platforms; and interacts well with Microsoft Word. Other major statistical packages employed at various points during the preparation of this text include SAS [SAS Institute Inc. (2009)], OpenBUGS [Lunn et al. (2009)] and R [R Development Core Team (2010)]. For all intents and purposes the results produced were the same regardless of which package we used. Reported numeric results have been rounded from figures obtained from computer output and thus may differ slightly from those that would be obtained in a replication of our analyses or from calculations based on the reported results. When features or capabilities of the programs differed in an important way, we noted them by the names given rather than by their bibliographic citation. We feel that this new edition benefits greatly from the addition of a number of key topics. These include the following: 1. An expanded presentation of numerous new techniques for model-building, including methods for determining the scale of continuous covariates and assessing model performance. 2. An expanded presentation of regression modeling of complex sample survey data. 3. An expanded development of the use of logistic regression modeling in matched studies, as well as with multinomial and ordinal scaled responses. 4. A new chapter dealing with models and methods for correlated categorical response data. 5. A new chapter developing a number of important applications either missing or expanded from the previous editions. These include propensity score methods, exact methods for logistic regression, sample size issues, Bayesian logistic regression, and other link functions for binary outcome regression models. This chapter concludes with sections dealing with the epidemiologic concepts of mediation and additive interaction. As was the case for the second edition, all of the data sets used in the text are available at a web site at John Wiley & Sons, Inc. In addition, the data may also be found, by permission of John Wiley & Sons Inc., in the archive of statistical data sets maintained at the University of Massachusetts at in the logistic regression section. We would like to express our sincere thanks and appreciation to our colleagues, students, and staff at all of the institutions we have been fortunate to have been affiliated with since the first edition was conceived more than 25 years ago. This

13 preface to the third edition xv includes not only our primary university affiliations but also the locations where we spent extended sabbatical leaves and special research assignments. For this edition we would like to offer special thanks to Sharon Schwartz and Melanie Wall from Columbia University who took the lead in writing the two final sections of the book dealing with mediation and additive interaction. We benefited greatly from their expertise in applying these methods in epidemiologic settings. We greatly appreciate the efforts of Danielle Sullivan, a PhD candidate in biostatistics at Ohio State, for assisting in the preparation of the index for this book. Colleagues in the Division of Biostatistics and the Division of Epidemiology at Ohio State were helpful in their review of selected sections of the book. These include Bo Lu for his insights on propensity score methods and David Murray, Sigrún Alba Jóhannesdóttir, and Morten Schmidt for their thoughts concerning the sections on mediation analysis and additive interaction. Data sets form the basis for the way we present our materials and these are often hard to come by. We are very grateful to Karla Zadnik, Donald O. Mutti, Loraine T. Sinnott, and Lisa A. Jones-Jordan from The Ohio State University College of Optometry as well as to the Collaborative Longitudinal Evaluation of Ethnicity and Refractive Error (CLEERE) Study Group for making the myopia data available to us. We would also like to acknowledge Cynthia A. Fontanella from the College of Social Work at Ohio State for making both the Adolescent Placement and the Polypharmacy data sets available to us. A special thank you to Gary Phillips from the Center for Biostatistics at OSU for helping us identify these valuable data sets (that he was the first one to analyze) as well as for his assistance with some programming issues with Stata. We thank Gordon Fitzgerald of the Center for Outcomes Research (COR) at the University of Massachusetts / Worcester for his help in obtaining the small subset of data used in this text from the Global Longitudinal Study of Osteoporosis in Women (GLOW) Study s main data set. In addition, we thank him for his many helpful comments on the use of propensity scores in logistic regression modeling. We thank Turner Osler for providing us with the small subset of data obtained from a large data set he abstracted from the National Burn Repository 2007 Report, that we used for the burn injury analyses. In many instances the data sets we used were modified from the original data sets in ways to allow us to illustrate important modeling techniques. As such, we issue a general disclaimer here, and do so again throughout the text, that results presented in this text do not apply to the original data. Before we began this revision, numerous individuals reviewed our proposal anonymously and made many helpful suggestions. They confirmed that what we planned to include in this book would be of use to them in their research and teaching. We thank these individuals and, for the most part, addressed their comments. Many of these reviewers suggested that we include computer code to run logistic regression in a variety of packages, especially R. We decided not to do this for two reasons: we are not statistical computing specialists and did not want to have to spend time responding to queries on our code. Also, capabilities of computer packages change rapidly and we realized that whatever we decided to include here would likely be out of date before the book was even published. We refer readers interested in code specific to various packages to a web site maintained

14 xvi preface to the third edition by Academic Technology Services (ATS) at UCLA where they use a variety of statistical packages to replicate the analyses for the examples in the second edition of this text as well as numerous other statistical texts. The link to this web site is Finally, we would like to thank Steve Quigley, Susanne Steitz-Filler, Sari Friedman and the production staff at John Wiley & Sons Inc. for their help in bringing this project to completion. Stowe, Vermont Columbus, Ohio West Point, New York January 2013 David W. Hosmer, Jr. Stanley Lemeshow Rodney X. Sturdivant The views expressed in this book are those of the author and do not reflect the official policy or position of the Department of the Army, Department of Defense, or the U.S. Government.