Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Multiple Imputation and Multiple Regression with SAS and IBM SPSS See IntroQ Questionnaire for a description of the survey used to generate the data used here. *** Mult-Imput_M-Reg.sas ***; options pageno=min nodate formdlim='-'; title 'Multiple Imputation of Missing Data then Multiple Regression.'; run; PROC IMPORT OUT= WORK.IntroQuest DATAFILE= "C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav" DBMS=SPSS REPLACE; RUN; Data Priapus; set IntroQuest; SATM_Miss = 0; If SATM =. then SATM_Miss = 1; proc means n nmiss; run; proc corr nosimple; var SATM_Miss; with statoph gender ideal nucoph year; run; The data are imported from an SPSS.sav file. The MEANS Procedure Variable Label N N Miss Gender Gender 694 0 Ideal Ideal 689 5 Eye Eye 693 1 Statoph Statoph 685 9 Nucoph Nucoph 692 2 SATM SATM 547 147 Year Year 694 0 Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations SATM_Miss Statoph Statoph Gender Gender 0.08406 0.0278 685-0.05740 0.1309 694 MultReg_Mult-Imputation.docx

2 Ideal Ideal Nucoph Nucoph Year Year -0.01715 0.6531 689 0.00741 0.8458 692 0.08196 0.0309 694 Note that missingness on SATM is associated with statphobia and year. ------------------------------------------------------------------------------------------------ Proc MI seed=69301 out=midata; var statoph gender ideal nucoph SATM year; run; Proc MI is used to create five imputations. Data Set Method Multiple Imputation Chain Initial s for MCMC Start Prior Model Information Number of Imputations 5 Number of Burn-in Iterations 200 Number of Iterations 100 WORK.INTROQUEST MCMC Single Chain EM Posterior Mode Starting Value Jeffreys Seed for random number generator 69301 Missing Data Patterns Group Statoph Gender Ideal Nucoph SATM Year Freq Percent Group Means Statoph Gender Ideal Nucoph SATM Year 1 X X X X X X 540 77.81 6.1712 1.26666 70.27925 58.05740 506.6685 1997.6666 2 X X X X. X 139 20.03 6.6726 1.20863 70.23741 59.18705. 1999.1510 3 X X X. X X 1 0.14 6.0000 1.00000 73.00000. 650.0000 2012.0000 4 X X. X X X 2 0.29 5.0000 1.50000. 57.50000 440.0000 1993.0000 5 X X. X. X 3 0.43 5.3333 1.333333. 50.000000. 1999.333333 6. X X X X X 3 0.43. 1.000000 70.666667 58.333333 550.000000 2010.000000 7. X X X. X 5 0.72. 1.000000 67.200000 43.400000. 2010.000000 8. X X. X X 1 0.14. 1.000000 75.000000. 730.000000 1991.000000

The most common pattern (aside from complete data) is missingness only on SATM. We have means for each of the patterns. Those missing data on SATM do not appear to differ much from those with SATM data. Below we have Expectation Maximization estimates of means and covariances. Missingness on SATM is related to statophobia, by the way. 3 EM (Posterior Mode) s _TYPE NAME_ Statoph Gender Ideal Nucoph SATM Year MEAN 6.259824 1.252161 70.255976 58.155101 507.402318 1998.110951 COV Statoph 5.252201-0.141896 0.685998 0.893161-72.908739-3.509858 COV Gender -0.141896 0.186693-0.912737-0.826166 2.167681-0.041964 COV Ideal 0.685998-0.912737 14.816058 8.297513-23.940582-1.323101 COV Nucoph 0.893161-0.826166 8.297513 497.023714 77.921516 2.866324 COV SATM -72.908739 2.167681-23.940582 77.921516 9130.096517 284.126488 COV Year -3.509858-0.041964-1.323101 2.866324 284.126488 79.070552 Variance Information Variable Variance DF Relative Between Within Total Increase in Variance Fraction Missing Information Relative Efficiency Statoph 0.000237 0.007664 0.007949 549.04 0.037133 0.036421 0.992769 Ideal 0.000194 0.021632 0.021865 670.7 0.010747 0.010688 0.997867 Nucoph 0.004013 0.725252 0.730067 681.36 0.006639 0.006617 0.998678 SATM 3.857993 13.586615 18.216206 55.285 0.340746 0.277121 0.947486 Snip, snip. I have culled the rest of the text output from Proc MI. Proc Reg outest=mrbyimput covout; Model Statoph = gender ideal nucoph SATM year / stb; By _Imputation_; run; quit; Proc MIAnalyze; modeleffects intercept gender ideal nucoph SATM year; run; Here we used Proc Reg to conduct a multiple regression analysis on each of the five imputations. ------------------------------------- Imputation Number=1 -------------------------------------- Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 5 578.89881 115.77976 25.76 <.0001 Error 688 3092.80459 4.49536 Corrected Total 693 3671.70340

4 Variable Label Root MSE 2.12023 R-Square 0.1577 Dependent Mean 6.25911 Adj R-Sq 0.1515 Coeff Var 33.87421 DF Parameter Parameter s Standard Error t Value Pr > t Standardized Intercept Intercept 1 41.56462 19.15755 2.17 0.0304 0 Gender Gender 1-0.72482 0.22232-3.26 0.0012-0.13684 Ideal Ideal 1-0.01048 0.02499-0.42 0.6750-0.01763 Nucoph Nucoph 1 0.00154 0.00362 0.43 0.6696 0.01503 SATM SATM 1-0.00836 0.00089371-9.36 <.0001-0.34798 Year Year 1-0.01477 0.00956-1.54 0.1230-0.05739 ------------------------------------------------------------------------------------------------ Multiple Imputation of Missing Data then Multiple Regression. 5 ------------------------------------- Imputation Number=2 -------------------------------------- Variable Label Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 5 486.59526 97.31905 20.75 <.0001 Error 688 3226.23255 4.68929 Corrected Total 693 3712.82781 Root MSE 2.16548 R-Square 0.1311 Dependent Mean 6.28280 Adj R-Sq 0.1247 Coeff Var 34.46674 DF Parameter Parameter s Standard Error t Value Pr > t Standardized Intercept Intercept 1 43.57996 19.63473 2.22 0.0268 0 Gender Gender 1-0.77863 0.22730-3.43 0.0006-0.14618 Ideal Ideal 1-0.02357 0.02550-0.92 0.3556-0.03946 Nucoph Nucoph 1 0.00237 0.00370 0.64 0.5228 0.02289 SATM SATM 1-0.00735 0.00091452-8.04 <.0001-0.30560 Year Year 1-0.01556 0.00981-1.59 0.1132-0.06011

5 ------------------------------------- Imputation Number=3 -------------------------------------- Variable Label Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 5 542.46132 108.49226 23.71 <.0001 Error 688 3148.45474 4.57624 Corrected Total 693 3690.91606 Root MSE 2.13922 R-Square 0.1470 Dependent Mean 6.24411 Adj R-Sq 0.1408 Coeff Var 34.25972 DF Parameter Parameter s Standard Error t Value Pr > t Standardized Intercept Intercept 1 47.65661 19.41478 2.45 0.0143 0 Gender Gender 1-0.73550 0.22412-3.28 0.0011-0.13850 Ideal Ideal 1-0.00947 0.02528-0.37 0.7080-0.01585 Nucoph Nucoph 1 0.00284 0.00365 0.78 0.4367 0.02759 SATM SATM 1-0.00754 0.00086555-8.71 <.0001-0.32789 Year Year 1-0.01809 0.00970-1.87 0.0626-0.07011 ------------------------------------- Imputation Number=4 -------------------------------------- Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 5 495.04444 99.00889 21.43 <.0001 Error 688 3178.16052 4.61942 Corrected Total 693 3673.20496 Root MSE 2.14928 R-Square 0.1348 Dependent Mean 6.26736 Adj R-Sq 0.1285 Coeff Var 34.29329

6 Variable Label DF Parameter Parameter s Standard Error t Value Pr > t Standardized Intercept Intercept 1 47.04474 19.41199 2.42 0.0156 0 Gender Gender 1-0.77684 0.22556-3.44 0.0006-0.14663 Ideal Ideal 1-0.02171 0.02532-0.86 0.3916-0.03654 Nucoph Nucoph 1 0.00208 0.00366 0.57 0.5699 0.02033 SATM SATM 1-0.00742 0.00090292-8.21 <.0001-0.30996 Year Year 1-0.01733 0.00969-1.79 0.0741-0.06732 ------------------------------------- Imputation Number=5 -------------------------------------- Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 5 479.36626 95.87325 20.60 <.0001 Error 688 3201.88856 4.65391 Corrected Total 693 3681.25482 Root MSE 2.15729 R-Square 0.1302 Dependent Mean 6.24894 Adj R-Sq 0.1239 Coeff Var 34.52255 Variable Label DF Parameter Parameter s Standard Error t Value Pr > t Standardized Intercept Intercept 1 54.51860 19.46584 2.80 0.0052 0 Gender Gender 1-0.68471 0.22651-3.02 0.0026-0.12910 Ideal Ideal 1-0.00908 0.02533-0.36 0.7202-0.01533 Nucoph Nucoph 1 0.00235 0.00368 0.64 0.5227 0.02289 SATM SATM 1-0.00705 0.00089989-7.84 <.0001-0.29619 Year Year 1-0.02170 0.00972-2.23 0.0259-0.08418 Proc MIAnalyze is used to pool the results from the multiple imputations. The variance in the scores is partitioned between that among imputations (A) and that within imputations (W). The Relative Increase in Variance (r) is the increase in variance due to having missing data imputed

(relative to the condition where no data are missing), 1 m r W 1 A, where m is the number of imputations. A related statistic, Fraction of Missing Information, is an index of how much more precise the parameter estimate would have been if there had been no missing data. Power will, of course, be greater when the fraction of missing information and relative increase in variance are small. The greater the number of imputations, the less the error and the greater the power, ceteris paribus. Relative efficiency tells you how much power you have for the number of imputations you have employed relative to what you would have if you used an uncountably large number of imputations. The MIANALYZE Procedure Variance Information Parameter Variance DF Relative Between Within Total Increase in Variance Fraction Missing Information Relative Efficiency intercept 24.530433 377.042475 406.478994 762.72 0.078072 0.074841 0.985253 gender 0.001539 0.050701 0.052548 3237 0.036433 0.035748 0.992901 ideal 0.000051078 0.000639 0.000701 522.62 0.095874 0.090958 0.982133 nucoph 0.000000224 0.000013399 0.000013669 10309 0.020094 0.019888 0.996038 SATM 0.000000242 0.000000802 0.000001092 56.584 0.362170 0.290519 0.945087 year 0.000007301 0.000094011 0.000103 550.36 0.093198 0.088559 0.982596 7 Parameter 95% Confidence Limits DF Minimum Maximum t Pr > t intercept 46.872905 7.29463 86.45118 762.72 41.564615 54.518596 2.32 0.0203 gender -0.740101-1.18956-0.29064 3237-0.778630-0.684711-3.23 0.0013 ideal -0.014862-0.06686 0.03714 522.62-0.023569-0.009078-0.56 0.5747 nucoph 0.002236-0.00501 0.00948 10309 0.001544 0.002838 0.60 0.5453 SATM -0.007544-0.00964-0.00545 56.584-0.008363-0.007052-7.22 <.0001 year -0.017489-0.03740 0.00242 550.36-0.021695-0.014770-1.73 0.0851

Multiple Imputation with IBM SPSS 8 Analyze, Multiple Imputation, Impute Missing Data Values *Impute Missing Data Values. DATASET DECLARE IntroQ_Imputed. MULTIPLE IMPUTATION Statoph Gender Ideal Nucoph SATM Year /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS=IntroQ_Imputed. Multiple Imputation [DataSet] C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav

9 Imputation Specifications Imputation Method Automatic Number of Imputations 5 Model for Scale Variables Linear Regression Interactions Included in (none) Models Maximum Percentage of 100.0% Missing Values Maximum Number of Parameters in Imputation 100 Model Imputed Values Imputation Results Imputation Method Fully Conditional Specification Fully Conditional Specification Method Iterations 10 Imputed Statoph,Ideal,Nucoph,SATM Not Imputed(Too Many Dependent Variables Missing Values) Not Imputed(No Missing Values) Gender,Year Imputation Sequence Gender,Year,Nucoph,Ideal,Stato ph,satm Nucoph Ideal Statoph SATM Type Linear Regression Linear Regression Linear Regression Linear Regression Imputation Models Model Missing Values Imputed Effects Values Gender,Year,I deal,statoph,s ATM 2 10 Gender,Year,N ucoph,statoph, 5 25 SATM Gender,Year,N ucoph,ideal,s 9 45 ATM Gender,Year,N ucoph,ideal,st 147 735 atoph

10 At this point SPSS has created a new data set with the original data (imputation 0) and the imputed data (in this case, imputations 1 through 5). The cells with imputed scores fall are highlighted. At this point, all you need do is run the desired analysis. If that analysis is supported, it will automatically analyze the original data and each imputed set of data and give you convenient summaries of the results. DATASET ACTIVATE IntroQ_MultipleImputation. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Statoph /METHOD=ENTER Gender Ideal Nucoph SATM Year.

11 Model Summary Imputation_ Model R R Square Adjusted R Square Std. Error of the Original data 1.371 a.138.129 2.1940 1 1.361 b.130.124 2.1569 2 1.352 b.124.117 2.1682 3 1.388 b.150.144 2.1279 4 1.367 b.134.128 2.1547 5 1.340 b.116.109 2.1649 a. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender b. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal ANOVA a Imputation_ Model Sum of Squares df Mean Square F Sig. Regression 410.002 5 82.000 17.036.000 b Original data 1 Residual 2570.403 534 4.813 Total 2980.405 539 Regression 480.049 5 96.010 20.637.000 c 1 1 Residual 3200.736 688 4.652 Total 3680.785 693 Regression 456.176 5 91.235 19.406.000 c 2 1 Residual 3234.479 688 4.701 Total 3690.656 693 Regression 551.694 5 110.339 24.368.000 c 3 1 Residual 3115.278 688 4.528 Total 3666.972 693 Regression 495.625 5 99.125 21.351.000 c 4 1 Residual 3194.150 688 4.643 Total 3689.775 693 Regression 421.753 5 84.351 17.997.000 c 5 1 Residual 3224.629 688 4.687 Total 3646.382 693 a. Dependent Variable: Statoph b. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender c. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal

12 Coefficients a Imputation_ Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 45.632 22.219 2.054.040 Gender -.833.270 -.157-3.083.002 Original data 1 Ideal -.019.031 -.031 -.616.538 Nucoph.003.004.032.801.424 SATM -.008.001 -.308-7.183.000 Year -.017.011 -.064-1.506.133 (Constant) 49.000 19.465 2.517.012 Gender -.827.226 -.156-3.655.000 1 1 Ideal -.032.026 -.054-1.265.206 Nucoph.002.004.022.606.545 SATM -.007.001 -.302-7.951.000 Year -.018.010 -.070-1.842.066 (Constant) 53.105 19.503 2.723.007 Gender -.827.227 -.156-3.636.000 2 1 Ideal -.021.026 -.035 -.816.415 Nucoph.001.004.015.405.686 SATM -.007.001 -.287-7.587.000 Year -.020.010 -.079-2.102.036 (Constant) 45.250 19.175 2.360.019 Gender -.717.223 -.136-3.219.001 3 1 Ideal -.013.025 -.022 -.521.603 Nucoph.002.004.016.465.642 SATM -.008.001 -.335-8.988.000 Year -.017.010 -.065-1.737.083 (Constant) 43.517 19.558 2.225.026 Gender -.810.226 -.153-3.581.000 4 1 Ideal -.019.025 -.032 -.738.460 Nucoph.001.004.012.345.730 SATM -.007.001 -.312-8.216.000 Year -.016.010 -.061-1.598.111

13 Coefficients a Imputation_ Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1 (Constant) 50.098 19.720 2.540.011 5 Gender -.751.226 -.142-3.325.001 Ideal -.014.025 -.023 -.533.594 Nucoph.002.004.016.452.651 SATM -.007.001 -.273-7.073.000 Year -.019.010 -.076-1.963.050 (Constant) 48.194 19.934 2.418.016 Gender -.787.232-3.387.001 Pooled 1 Ideal -.020.027 -.735.463 Nucoph.002.004.452.651 SATM -.007.001-6.730.000 Year -.018.010-1.805.071 Coefficients a Imputation_ Model Fraction Missing Info. Relative Increase Variance Relative Efficiency (Constant).045.047.991 Gender.056.058.989 Pooled 1 Ideal.106.113.979 Nucoph.011.011.998 SATM.311.396.941 Year.048.049.991 a. Dependent Variable: Statoph Karl L. Wuensch, October, 2017 Return to Wuensch s Stats Lessons Page Treatment of Missing Data recommended reading, David Howell.