1. Understand & evaluate survey. What is survey data? When analyzing survey data... General information. Focus of the webinar

Size: px
Start display at page:

Download "1. Understand & evaluate survey. What is survey data? When analyzing survey data... General information. Focus of the webinar"

Transcription

1 What is survey data? Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Data gathered from a sample of individuals Sample is random (drawn using probabilistic methods) Each member of population must have known, nonzero chance of selection Chance of selection does not have to be equal for all members Issues/techniques covered in webinar assume random sampling 4 General information Please hold questions until the end of the presentation Slides available at When analyzing survey data Understand & evaluate survey design 2. Screen the data 3. Adjust for sample design Please raise your hand so that I can see that you can hear me 2 5 Focus of the webinar 1. Understand & evaluate survey Define survey data Data setup Data cleaning/missing data Weighting data In short, present overview of what you need to consider prior to analyzing data Conductor of survey Sponsor of survey Measured variables Unit of analysis Mode of data collection Dates of data collection 3 6 1

2 1. Understand & evaluate survey 2. Data screening Geographic coverage Respondent eligibility criteria Sample design Sample size & response rate Out-of-range values: Delete data Recode values 7 10 Levels of measurement Missing data: Nominal Ordinal Interval Ratio can be unit or item missing can reduce effective sample size may introduce bias Remainder of seminar focuses on item missing, but unit missing is important and can require nonresponse analysis (possible topic for future webinar) Data screening Reasons for missing data ALWAYS examine raw frequency distributions for a) out-of-range values (outliers) b) missing values c) check frequencies against codebook d) for Web surveys, compare data collection output to data analysis output (e.g., Survey Monkey compared to SPSS) Refusals (question sensitivity) Don t know responses (cognitive problems, memory problems) Not applicable (skip patterns) Data processing errors Questionnaire programming errors Design factors (unit or item) Attrition in panel studies (unit)

3 Effects of ignoring missing data Missing at random (MAR) Reduced sample size loss of statistical power Data may no longer be representative introduces bias Difficult to identify effects Power analysis The probability of a variable being observed is independent of the true value of that variable controlling for one or more variables. Example: Probability of missing income is unrelated to income within levels of education Assumptions on missing data Ignorable missing data Missing completely at random (MCAR) Missing at random (MAR) Ignorable Nonignorable Type of missing data determines ignorable/nonignorable The data are MAR. The missing data mechanism is unrelated to the parameters we want to estimate Missing completely at random (MCAR) Nonignorable missing data Being missing is independent from any variables. Cases with complete data are indistinguishable from cases with missing data. Missing cases are a random subsample of original sample. The pattern of data missingness is non-mar. Missingness is related to either the data collection process or to the questionnaire content

4 Methods of handling missing data (MCAR & MAR) Methods of handling missing data Listwise (casewise) deletion: uses only complete cases (default for many software packages) Pairwise deletion: uses all available cases Dummy variable adjustment: Missing value indicator method Mean substitution: substitute mean value computed from available cases (cf. unconditional or conditional) Multiple imputation: combines the methods of ML to produce multiple data sets with imputed values for missing cases Methods of handling missing data Regression methods: predict value based on regression equation with other variables as predictors Hot deck: identify the most similar case to the case with a missing and impute the value Multiple imputation software SAS user written IVEware, experimental MI & MIANALYZE STATA user written ICE R user written libraries and functions SOLAS S-PLUS Methods of handling missing data Maximum likelihood methods: use all available data to generate maximum likelihoodbased statistics. Methods of handling missing data: Summary Listwise deletion assumes MCAR or MAR Dummy variable adjustment is biased; don t use Conditional mean substitution assumes MCAR within cells Hot deck & regression improvement on mean substitution but... Multiple imputation is least biased but computationally intensive

5 Missing data final point All methods of imputation underestimate true variance artificial increase in sample size values treated as though obtained by data collection Rao (1996). On variance estimation with imputed survey data. Journal of the American Statistical Society, 91, Fay (1996) in bibliography Good starting point is Paul Allison s chapter on missing data in Handbook of Survey Research (2010), 2 nd Edition, Marsden & Wright (ed.)--1 st edition chapter differs markedly Why complex sample designs? Statistical software packages with an assumption of SRS underestimate the sampling variance. Not accounting for the impact of complex sample design can lead to a biased estimate of the sampling variance (Type I error) Types of survey sample designs Sample weights Simple random sampling Systematic sampling Complex sample designs stratified designs cluster designs mixed mode designs Used to adjust for differing probabilities of selection. In theory, simple random samples are self-weighted. In practice, simple random samples are likely to also require adjustments for nonresponse Why complex sample designs? Types of sample weights Increased efficiency Decreased costs Selection weights: adjust for differences in probability of selection Poststratification weights: designed to bring the sample proportions in demographic subgroups into agreement with the population proportion in the subgroups. Nonresponse weights: designed to inflate the weights of survey respondents to compensate for nonrespondents with similar characteristics. Blow-up (expansion) weights: provide estimates for the total population of interest

6 Syntax examples of design-based analysis in STATA, SUDAAN, & SAS STATA svyset strata strata svyset psu psu svyset pweight finalwt svyreg fatitk age male black hispanic SUDAAN proc regress data= c:\nhanes.sav filetype=spss desgn=wr; nest strata psu; weight finalwt subpgroup sex race; levels 2 3; model fatintk = age sex race; Bibliography On Web site with slides Big names in area of missing data include the following: Roderick Little Donald B. Rubin Paul Allison See also McKnight et al. Missing Data: A Gentle Introduction Large government data sets (NSFG, NHANES, NHIS) often include detailed methodological documentation on imputation and weight construction Syntax examples of design-based analysis in STATA, SUDAAN, & SAS Questions? Type questions in chat box. SAS proc surveyreg data=nhanes; strata strata; cluster psu; class sex race; model fatintk = age sex race; weight finalwt In summary, when analyzing survey data... Understand & evaluate survey design. Screen the data deal with missing data & outliers. If necessary, adjust for study design using weights and appropriate computer software. Thank You!