Dealing with Missing Data: Strategies for Beginners to Data Analysis

Similar documents
Introduction to Survey Data Analysis. Linda K. Owens, PhD. Assistant Director for Sampling & Analysis

1. Understand & evaluate survey. What is survey data? When analyzing survey data... General information. Focus of the webinar

Introduction to Survey Data Analysis. Focus of the Seminar. When analyzing survey data... Young Ik Cho, PhD. Survey Research Laboratory

Introduction to Survey Data Analysis

PLAYING WITH HISTORY CAN AFFECT YOUR FUTURE: HOW HANDLING MISSING DATA CAN IMPACT PARAMATER ESTIMATION AND RISK MEASURE BY JONATHAN LEONARDELLI

MULTIPLE IMPUTATION. Adrienne D. Woods Methods Hour Brown Bag April 14, 2017

Sensitivity Analysis of Nonlinear Mixed-Effects Models for. Longitudinal Data That Are Incomplete

UCLA Department of Statistics Papers

Statistical Considerations

MISSING DATA TREATMENTS AT THE SECOND LEVEL OF HIERARCHICAL LINEAR MODELS. Suzanne W. St. Clair, B.S., M.P.H. Dissertation Prepared for the Degree of

Kristin Gustavson * and Ingrid Borren

OVERVIEW OF APPROACHES FOR MISSING DATA. Susan Buchman Spring 2018

Using Weights in the Analysis of Survey Data

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

Chapter 3. Basic Statistical Concepts: II. Data Preparation and Screening. Overview. Data preparation. Data screening. Score reliability and validity

Missing data procedures for psychosocial research

Modeling Contextual Data in. Sharon L. Christ Departments of HDFS and Statistics Purdue University

Analyzing non-normal data with categorical response variables

Estimation of multiple and interrelated dependence relationships

Methods for Multilevel Modeling and Design Effects. Sharon L. Christ Departments of HDFS and Statistics Purdue University

PROPENSITY SCORE MATCHING A PRACTICAL TUTORIAL

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Secondary analysis of national survey datasetsjjns_213

Potential sources of missing data in a meta-analysis. Missing data. Concepts in missing data. Concepts in missing data.

Dealing with missing data in practice: Methods, applications, and implications for HIV cohort studies

Application of Multiple Imputation in Dealing with Missing Data in Agricultural Surveys: The Case of BMP Adoption

Multilevel Modeling Tenko Raykov, Ph.D. Upcoming Seminar: April 7-8, 2017, Philadelphia, Pennsylvania

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

A Comparative evaluation of Software Effort Estimation using REPTree and K* in Handling with Missing Values

Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS

Chapter URL:

Cost-effectiveness and cost-utility analysis accompanying Cancer Clinical trials. NCIC CTG New Investigators Workshop

Missing data in software engineering

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis

Web Appendix to Advertising Spillovers: Evidence from Online. Field-Experiments and Implications for Returns on Advertising

Ch. 15 Data Preparation and Description

Financing Constraints and Firm Inventory Investment: A Reexamination

The Application of STATA s Multiple Imputation Techniques to Analyze a Design of Experiments with Multiple Responses

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. Learning Effects in Preference Tasks: Choice-Based Versus Standard Conjoint RESEARCH PAPER SERIES

Two Way ANOVA. Turkheimer PSYC 771. Page 1 Two-Way ANOVA

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes.

Partial Least Squares Structural Equation Modeling PLS-SEM

Linear model to forecast sales from past data of Rossmann drug Store

STATISTICS PART Instructor: Dr. Samir Safi Name:

Getting Started with HLM 5. For Windows

Optimal Method For Analysis Of Disconnected Diallel Tests. Bin Xiang and Bailian Li

Semester 2, 2015/2016

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

The correct bibliographic citation for this manual is as follows: Shreve, Joni N. and Donna Dea Holland SAS Certification Prep Guide:

Han Du. Department of Psychology University of California, Los Angeles Los Angeles, CA

Research Methods in Human-Computer Interaction

Archives of Scientific Psychology Reporting Questionnaire for Manuscripts Describing Primary Data Collections

Summarizing categorical data involves boiling down all the information into just a few

Can Microtargeting Improve Survey Sampling?

Tanja Srebotnjak, United Nations Statistics Division, New York, USA 1 Abstract

Multilevel/ Mixed Effects Models: A Brief Overview

Workshop II Project Management

Implementing Current Regulatory Guidance: An Industry Perspective

Analysis of Factors that Affect Productivity of Enterprise Software Projects

Estoril Education Day

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Machine-learning models for predicting drug approvals and clinical-phase transitions

WORK INTENSIFICATION, DISCRETION, AND THE DECLINE IN WELL-BEING AT WORK.

Data Integration (stat08014)

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo


SCHOOL OF AGRICULTURE

PSC 508. Jim Battista. Dummies. Univ. at Buffalo, SUNY. Jim Battista PSC 508

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:

Examination of Cross Validation techniques and the biases they reduce.

ADVANCED DATA ANALYTICS

Advice to Health Services Researchers: Be Cautious Using the Where Statement in SAS Programs for Nationally Representative Complex Survey Data

Using R for Introductory Statistics

Week 11: Collinearity

Ch. 15 Data Preparation and Description

Research Article One-Step Dynamic Classifier Ensemble Model for Customer Value Segmentation with Missing Values

Estimating Discrete Choice Models of Demand. Data

What we can do about human error

Code Compulsory Module Credits Continuous Assignment

Gasoline Consumption Analysis

01 University of Plymouth Research Outputs University of Plymouth Research Outputs

Not Just Another Pretty Formula: Practical Methods for Mitigating Self-Selection Bias in Billing Analysis Regressions

If you are using a survey: who will participate in your survey? Why did you decide on that? Explain

Modern Genetic Evaluation Procedures Why BLUP?

What Is Conjoint Analysis? DSC 410/510 Multivariate Statistical Methods. How Is Conjoint Analysis Done? Empirical Example

Disaggregating the Return on Investment to IT Capital

Midterm Test Department: Computer Science Instructor: Steve Easterbrook Date and Time: 10:10am, Thursday 1st March, 2012

Standard for applying the Principle. Involving Stakeholders DRAFT.

SUGI 29 Statistics and Data Analysis. Paper

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Handbook On Impact Evaluation With Stata. Examples >>>CLICK HERE<<<

GLMs the Good, the Bad, and the Ugly Ratemaking and Product Management Seminar March Christopher Cooksey, FCAS, MAAA EagleEye Analytics

Chapter 3. Database and Research Methodology

PharmaSUG 2016 Paper 36

Practical Aspects of Modelling Techp.iques in Logistic Regression Procedures of the SAS System

Transcription:

Dealing with Missing Data: Strategies for Beginners to Data Analysis Rachel Margolis, PhD Assistant Professor, Department of Sociology Center for Population, Aging, and Health University of Western Ontario

What exactly do you mean by missing data? In a typical data set, information is missing for some variables for some cases. E.g. usually sizable amount of missing data for income In Stata, shown with. Or.m Sometimes value=99 or 999. Need to check the codebook!

Why Data could be Missing 1- Outright refusal to answer 2- In self-administered surveys, people often overlook or forget to answer some questions 3- Even trained interviewers occasionally may neglect to ask some questions 4- Respondents say they do not know the answer 5- Respondents may not have the information available to them at the time of the survey 6- Question is inapplicable. e.g. Quality of marriage to unmarried respondents

Why Data could be Missing 7- In longitudinal studies, people who were interviewed in one wave, may move or die before the next wave. e.g. in one wave, but not the next 8- Some records could be lost. 9- Data couldn't be read by person inputting into database.

My Perspective on Missing Data This is not my area of research. I am a user of secondary survey data and have dealt with missing data in my research Instructor for applied regression course (Soc 9007) where many students deal with data analysis for the first time. This talk is not geared for experienced users. Drawing on Allison (2002) Missing Data and other

Why is missing data a problem in social science and health research? 1) Nearly all standard statistical methods presume that every case has info on all variables to be included in the analysis. 2) Multivariate analysis of large surveys: even if small percentages of missing data on each variable, you may have a large amount of cases with missing data on any of these variables 3) Analysis of small data sets (clinical data, cross-national data, quantitative analysis of qualitative data, ever case is important. 4) Analysis of variables involving sensitive topics

Why is missing data a problem in social science and health research? 5) If missing cases are deleted: - Reduces sample size and lower statistical power (lower SE and harder to detect sig relationships) - Biased estimates (sample selection) because analytic sample is not representative of whole sample 6) If we impute missing data - Risk of biased estimates: inadequate imputations - Biased standard errors and sig tests: over fitting. 7) Publication of research: Journal editors and reviewers are increasingly strict about how you deal with missing data

Step 1: Assess WHY data are missing Go to the codebook for your data Was the question not asked of all respondents? (It could have been inapplicable, or only asked of a subset to save cost) How are missing values coded? (There may be subcategories) Talk with others who work with the same data

Step 2: Analyze missing data as the dependent variable The next step in dealing with missing data is to empirically understand the nature of the missing data pattern. - Create a dummy variable for whether observations are missing on that value or not: Cases with missing values are coded as 1, cases without missing values coded as 0 - Estimate logistic regression or similar model, and other variables used as predictors - This will give you hints as to which characteristics are more likely to be associated with missing. This is similar to analysis of attrition in longitudinal data

Why analyze missing data as dependent variable? Provide the researcher with a substantive understanding of the missing data pattern Can help with selecting the best technique to address the missing data problem Can help with using the technique: creating weights, creating imputation data Depending on space, can be a part of your story: Missing data analysis as a section of a masters thesis, dissertation, or book. Appendix or footnote in journal article

Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) c) Missing not at random (MNAR)

Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) Missing cases are unrelated to any variable in the analysis (including the variable with missing data itself) Example: 1% of records were lost and fell into the mud. One computer with the data broke down out of 10. not related to which data it was holding. Analysis remains unbiased. We lose power, but estimated parameters are not biased. Most missing data techniques will work well

Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) If the data meet the requirement that missingness does not depend on the value of x, after controlling for another variable. Extent that missingness is correlated with other variables that are included in the analysis. Example: Depressed people might be less likely to report their income (reported income associated with depression). Depressed people might have lower income in general. When ignoring missing data, the distribution of income would be higher. If within depressed patients, the probability of reported income is unrelated to income level, then data are MAR not MCAR. MAR does produce bias but we have ways of dealing with it.

Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) c) Missing not at random (MNAR) Example: If we are studying mental health and the depressed are less likely to report their mental health, then data are NMAR. The mean mental health level will be biased than if we had the complete data. We need to write a model that accounts for the missing data process. Bias could be large or small depending on your data.

Step 4: Choose a technique 1) Listwise deletion 2) Simple techniques to avoid: Pairwise deletion, Hot deck imputation, Mean substitution 3) Dummy variable for missing data 4) Regression substitution 5) Multiple imputation 6) Maximum likelihood estimation

1- Listwise deletion Method: Delete any case which has missing data on any of the variables of interest Advantages: - Simple: default option in many statistical programs. - Acceptable with a small amount of missing data, one rule is less than 5% of the full sample. Disadvantages - Can quickly reduce sample size and statistical power where many variables have missing data - Undetected selection bias - Biased when data are not MCAR

2- Simple techniques to avoid a) Pairwise deletion: Deletes pairs of specific missing data, but not the whole observation. b) Hot deck imputation: Substituting a randomly selected similar unit for the missing value. c) Mean substitution: Substituting the mean value for the missing data Advantages: All available data are used Disadvantages: May over or underestimate coefficients Overfits data: artificially increases model fit by assuming that similar units are identical. Lower standard errors. Hot deck: Hard to justify the method for selecting similar units. Mean imputation: Hard to justify that missing values would be the mean. Doesn t take into account how they are different.

3- Single imputation/regression substitution Method: Use linear regression to predict what the missing value should be on the basis of other variables that are present. Then substitute the predicted value for the missing value. Advantages: More logical than other methods Full sample size preserved Disadvantages Overfits data: artificially increases model fit by assuming that similar units are identical. Lower standard errors.

4- Extra dummy variable for missing data Method: Add an extra dummy variable (coded 1 for the missing values and 0 otherwise to a series of dummy variables). Example: Education: university degree (ref), high school graduate (dummy), less than high school (dummy), unknown education (dummy) Advantages: - Full sample size preserved - Association between DV and missing data dummy is estimated

4- Extra dummy variable for missing data Disadvantages - Heterogeneity: missing data dummy possibly combined very diff vases together - Requires many extra dummy variables if you have missing data on multiple variables - Requires use of categorical/dummy vars not continuous variables

5- Multiple Imputation Method: Replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed using standard procedures for complete data and combine the results from these analyses. Advantages: Logical, full sample preserved By including random error, imputed data are more noisy than the observed data, therefore don t overfit as much as other methods. Disadvantages: Not necessarily available for all kinds of models. Not appropriate for missing on key independent variable or DV

6- Maximum Likelihood Estimation Method: EM algorithms estimate coefficients for model and standard errors with missing data. Advantages: Don t impute missing data. Best fitting parameters are selected via iterations that maximize the probability of observing the data that were collected Disadvantages Requires more statistical knowledge. Might require the use of different statistical programs. More common for SEM programs.

Summary First, you must understand why you have missing data and examine the patterns. Then you can choose a technique to deal with missing data. You may choose more than one. No matter how you deal with missing data, you should run your analysis various ways: With and without missing values included Using different methods to test whether it changes results Think about the direction in which missing values biases the results Before you start using one of these techniques, invest in understanding what assumptions you are making and how to do it with the software that you use.

General approaches The information presented here focused on general approaches for basic statistical analysis: OLS regression, logistic regression, ANOVA See literature on disciplinary and model-specific techniques and norms (multi-level models, structural equation modeling, factor analysis, panel data, etc.) Statistical Software: SPSS: limited in basic version. Some expensive upgrades available. Stata: multiple imputation, mi, ice, micombine SAS: MI MIANALYZE R: many options

References Allison, P.( 2002). Missing Data. Sage. A little green book. Johnson & Young (2011) Toward best practices in analyzing datasets with missing data: Comparisons and recommendations. Journal of Marriage and Family 73: 926 945. Acock. (2005) Working with missing values. Journal of Marriage and Family 67: 1012 1028. Raghunathan. (2004) What to do with missing data? Some options for analysis of incomplete data. Annual Review of Public Health 25: 99 117. Little and Rubin. (1989). The analysis of Social Science Data with Missing Values. Sociological Methods and Research.