Dealing with Missing Data: Strategies for Beginners to Data Analysis Rachel Margolis, PhD Assistant Professor, Department of Sociology Center for Population, Aging, and Health University of Western Ontario
What exactly do you mean by missing data? In a typical data set, information is missing for some variables for some cases. E.g. usually sizable amount of missing data for income In Stata, shown with. Or.m Sometimes value=99 or 999. Need to check the codebook!
Why Data could be Missing 1- Outright refusal to answer 2- In self-administered surveys, people often overlook or forget to answer some questions 3- Even trained interviewers occasionally may neglect to ask some questions 4- Respondents say they do not know the answer 5- Respondents may not have the information available to them at the time of the survey 6- Question is inapplicable. e.g. Quality of marriage to unmarried respondents
Why Data could be Missing 7- In longitudinal studies, people who were interviewed in one wave, may move or die before the next wave. e.g. in one wave, but not the next 8- Some records could be lost. 9- Data couldn't be read by person inputting into database.
My Perspective on Missing Data This is not my area of research. I am a user of secondary survey data and have dealt with missing data in my research Instructor for applied regression course (Soc 9007) where many students deal with data analysis for the first time. This talk is not geared for experienced users. Drawing on Allison (2002) Missing Data and other
Why is missing data a problem in social science and health research? 1) Nearly all standard statistical methods presume that every case has info on all variables to be included in the analysis. 2) Multivariate analysis of large surveys: even if small percentages of missing data on each variable, you may have a large amount of cases with missing data on any of these variables 3) Analysis of small data sets (clinical data, cross-national data, quantitative analysis of qualitative data, ever case is important. 4) Analysis of variables involving sensitive topics
Why is missing data a problem in social science and health research? 5) If missing cases are deleted: - Reduces sample size and lower statistical power (lower SE and harder to detect sig relationships) - Biased estimates (sample selection) because analytic sample is not representative of whole sample 6) If we impute missing data - Risk of biased estimates: inadequate imputations - Biased standard errors and sig tests: over fitting. 7) Publication of research: Journal editors and reviewers are increasingly strict about how you deal with missing data
Step 1: Assess WHY data are missing Go to the codebook for your data Was the question not asked of all respondents? (It could have been inapplicable, or only asked of a subset to save cost) How are missing values coded? (There may be subcategories) Talk with others who work with the same data
Step 2: Analyze missing data as the dependent variable The next step in dealing with missing data is to empirically understand the nature of the missing data pattern. - Create a dummy variable for whether observations are missing on that value or not: Cases with missing values are coded as 1, cases without missing values coded as 0 - Estimate logistic regression or similar model, and other variables used as predictors - This will give you hints as to which characteristics are more likely to be associated with missing. This is similar to analysis of attrition in longitudinal data
Why analyze missing data as dependent variable? Provide the researcher with a substantive understanding of the missing data pattern Can help with selecting the best technique to address the missing data problem Can help with using the technique: creating weights, creating imputation data Depending on space, can be a part of your story: Missing data analysis as a section of a masters thesis, dissertation, or book. Appendix or footnote in journal article
Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) c) Missing not at random (MNAR)
Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) Missing cases are unrelated to any variable in the analysis (including the variable with missing data itself) Example: 1% of records were lost and fell into the mud. One computer with the data broke down out of 10. not related to which data it was holding. Analysis remains unbiased. We lose power, but estimated parameters are not biased. Most missing data techniques will work well
Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) If the data meet the requirement that missingness does not depend on the value of x, after controlling for another variable. Extent that missingness is correlated with other variables that are included in the analysis. Example: Depressed people might be less likely to report their income (reported income associated with depression). Depressed people might have lower income in general. When ignoring missing data, the distribution of income would be higher. If within depressed patients, the probability of reported income is unrelated to income level, then data are MAR not MCAR. MAR does produce bias but we have ways of dealing with it.
Step 3: Determine the nature of missing data a) Missing completely at random (MCAR) b) Missing at random (MAR) c) Missing not at random (MNAR) Example: If we are studying mental health and the depressed are less likely to report their mental health, then data are NMAR. The mean mental health level will be biased than if we had the complete data. We need to write a model that accounts for the missing data process. Bias could be large or small depending on your data.
Step 4: Choose a technique 1) Listwise deletion 2) Simple techniques to avoid: Pairwise deletion, Hot deck imputation, Mean substitution 3) Dummy variable for missing data 4) Regression substitution 5) Multiple imputation 6) Maximum likelihood estimation
1- Listwise deletion Method: Delete any case which has missing data on any of the variables of interest Advantages: - Simple: default option in many statistical programs. - Acceptable with a small amount of missing data, one rule is less than 5% of the full sample. Disadvantages - Can quickly reduce sample size and statistical power where many variables have missing data - Undetected selection bias - Biased when data are not MCAR
2- Simple techniques to avoid a) Pairwise deletion: Deletes pairs of specific missing data, but not the whole observation. b) Hot deck imputation: Substituting a randomly selected similar unit for the missing value. c) Mean substitution: Substituting the mean value for the missing data Advantages: All available data are used Disadvantages: May over or underestimate coefficients Overfits data: artificially increases model fit by assuming that similar units are identical. Lower standard errors. Hot deck: Hard to justify the method for selecting similar units. Mean imputation: Hard to justify that missing values would be the mean. Doesn t take into account how they are different.
3- Single imputation/regression substitution Method: Use linear regression to predict what the missing value should be on the basis of other variables that are present. Then substitute the predicted value for the missing value. Advantages: More logical than other methods Full sample size preserved Disadvantages Overfits data: artificially increases model fit by assuming that similar units are identical. Lower standard errors.
4- Extra dummy variable for missing data Method: Add an extra dummy variable (coded 1 for the missing values and 0 otherwise to a series of dummy variables). Example: Education: university degree (ref), high school graduate (dummy), less than high school (dummy), unknown education (dummy) Advantages: - Full sample size preserved - Association between DV and missing data dummy is estimated
4- Extra dummy variable for missing data Disadvantages - Heterogeneity: missing data dummy possibly combined very diff vases together - Requires many extra dummy variables if you have missing data on multiple variables - Requires use of categorical/dummy vars not continuous variables
5- Multiple Imputation Method: Replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed using standard procedures for complete data and combine the results from these analyses. Advantages: Logical, full sample preserved By including random error, imputed data are more noisy than the observed data, therefore don t overfit as much as other methods. Disadvantages: Not necessarily available for all kinds of models. Not appropriate for missing on key independent variable or DV
6- Maximum Likelihood Estimation Method: EM algorithms estimate coefficients for model and standard errors with missing data. Advantages: Don t impute missing data. Best fitting parameters are selected via iterations that maximize the probability of observing the data that were collected Disadvantages Requires more statistical knowledge. Might require the use of different statistical programs. More common for SEM programs.
Summary First, you must understand why you have missing data and examine the patterns. Then you can choose a technique to deal with missing data. You may choose more than one. No matter how you deal with missing data, you should run your analysis various ways: With and without missing values included Using different methods to test whether it changes results Think about the direction in which missing values biases the results Before you start using one of these techniques, invest in understanding what assumptions you are making and how to do it with the software that you use.
General approaches The information presented here focused on general approaches for basic statistical analysis: OLS regression, logistic regression, ANOVA See literature on disciplinary and model-specific techniques and norms (multi-level models, structural equation modeling, factor analysis, panel data, etc.) Statistical Software: SPSS: limited in basic version. Some expensive upgrades available. Stata: multiple imputation, mi, ice, micombine SAS: MI MIANALYZE R: many options
References Allison, P.( 2002). Missing Data. Sage. A little green book. Johnson & Young (2011) Toward best practices in analyzing datasets with missing data: Comparisons and recommendations. Journal of Marriage and Family 73: 926 945. Acock. (2005) Working with missing values. Journal of Marriage and Family 67: 1012 1028. Raghunathan. (2004) What to do with missing data? Some options for analysis of incomplete data. Annual Review of Public Health 25: 99 117. Little and Rubin. (1989). The analysis of Social Science Data with Missing Values. Sociological Methods and Research.