THE CONTINUING QUANDARY OF SURVEY DATA PART II: Comparison of SAS Procedures and SUDAAN Procedures

Size: px

Start display at page:

Download "THE CONTINUING QUANDARY OF SURVEY DATA PART II: Comparison of SAS Procedures and SUDAAN Procedures"

Madlyn Thompson
5 years ago
Views:

1 THE CONTINUING QUANDARY OF SURVEY DATA PART II: Comparison of SAS Procedures and SUDAAN Procedures Katherine Baisden, SRI International, Menlo Park, California ABSTRACT Once upon a time in the days of simple random sampling, analyzing survey data was straightforward. However, due to issues of efficiency and economy, we rarely utilize simple random sampling methods in survey research today. Most survey data are based on a stratified, clustered or complex sample design. Such designs impact the accuracy of variance estimates (standard errors) and test statistics (chi-squares, t-tests). Until recently, SAS programmers had to rely on other statistical software packages, such as SUDAAN, WesVar, and STATA, to produce accurate variance estimates and test statistics from complex sample designs. With the release of SAS V9.1, SAS has incorporated survey procedures (e.g., PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, PROC SURVEYLOGISTIC) to address this issue. This paper will examine four basic procedures used in a vast majority of survey research (means, frequencies, regressions and logistic regressions). It will explore the differences among the four Proc Survey procedures in SAS and the corresponding SAS-callable SUDAAN v8.0 procedures. Using data examples, the paper will highlight the differences in syntax and output. It will discuss the available options, limitations and recent updates of each package.. INTRODUCTION To maximize the effort of survey data collection and to minimize the cost, researchers continue to develop increasingly complex sample designs. These designs include stratification, clustering, unequal probabilities of selection, and a multitude of the combinations of all these techniques. Simple random sample designs are a rarity in this day and age of survey research. These complex designs impact the accuracy of variance estimates and test statistics. The SAS programmer must expand beyond the traditional tools in his/her analytical handbag to deal with survey data today. Until recently, SAS programmers had to use additional software packages, such as SUDAAN, to produce correct variance estimates. Now, with SAS v9.1 (PROC SURVEYSELECT, PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG AND PROC SURVEYLOGISTIC), some of the tools needed to deal with this type of survey data are available in SAS. This paper compares and contrasts SAS v9.1 and SAS-callable SUDAAN v8.0 focusing on syntax and output for four of the most common procedures used in analysis; the crosstabulation/frequency, means, regression and logistic regression procedure. This will be demonstrated using data from a study of teachers in the state of California. Schools were classified on three criteria: the percentage of emergency credentialed teachers in the school (EMERG: le 10%, 11-19%, 20%+), the size of the district (DISTSIZE: less than 5000, ,000, 10,000+) and the type of school (SCHL_LVL: elementary, middle and high school). Weights were developed for the data based on these stratification variables. Teachers were then selected from each of the strata. For analysis purposes, our statistician has classified this as a stratified sample with replacement. The data examples will give you a highlight of the syntax (not all options can be included) for each of the procedures in SAS-callable SUDAAN and SAS. The data being presented is for illustration purposes in terms of syntax and not substantive findings. GENERAL COMMENTS ABOUT SAS AND SAS-CALLABLE SUDAAN Both SAS and SUDAAN procedures are based on the Taylor linear approximation method to calculate the variance estimates. However, SUDAAN does offer the option of using balanced repeated replicates (BRR) and jackknife weights. SUDAAN does not have the capability to calculate BRR or jackknife weights, but can use them if they are provided on the data set. The SAS-callable version of SUDAAN is designed to use within the framework of SAS. Within any SAS program you call into play SUDAAN and it uses the SAS dataset format. Thus, much of the syntax of a procedure is very similar. However there are some important differences to note. This paper will focus on PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG and PROC SURVEYLOGISTIC. Each procedure has many options and statistics available in each package, however due to space and limitations, this paper will highlight the most common. 1

2 DETERMINING THE IMPACT OF SAMPLE DESIGNS ON VARIANCE ESTIMATES There is a way to measure the impact of complex sample designs (CSD) on variance estimates. A common measure is called the Design Effect (DE). The DE is a ratio. It takes the variance from the CSD and compares it to the variance that would have occurred under the assumption of simple random sampling (SRS). If the DE is close to 1.0 then one can assume the variances would have come out the same whether it was a CSD or a SRS design. Most of the time, the DE for a CSD is greater than one. The larger the DE, the more correlated are your respondents within clusters, leading to underestimated variances if analyzed with packages without the capabilities to go beyond the assumption of SRS. DE=variance of CDS/variance of SRS Below is a table which gives a breakdown on the impact of point and variance estimates when you are using weighted data with a complex sample design with various types of SAS procedures and SAS options (unweighted or weighted) and SUDAAN procedures. Point Estimates (Percents, Means, Etc ) Variances (Standard Errors, Variances) Unweighted Regular SAS Procedures Incorrect Incorrect Weighted Regular SAS Procedures Correct Incorrect SAS Proc Survey Procedures Correct Correct SUDAAN Procedures Correct Correct The point estimates will be the same for weighted regular SAS procedures, SAS Proc Survey procedures and SUDAAN procedures within rounding. The variance estimates will be the same for SAS Survey procedures and SUDAAN procedures. Using unweighted regular SAS procedures will produce incorrect point estimates and variance estimates. Using weighted regular SAS procedures will create correct point estimates but incorrect variance estimates. There may be a slight difference between the two programs because at this time there are slight differences in computation and the handling of missing data. For example, different estimates and standard errors may be due to the different tolerances for matrix inversion or the number of iterations in regression procedures. Before beginning any analysis it must be determined on what kind of sampling design the survey is based. SUDAAN offers you a choice of the following: 2

3 SAS and SUDAAN offer the following procedures: SAS SUDAAN PURPOSE RECORDS Print records from ASCII, SAS, SPSS and SUDAAN SURVEYFREQ CROSSTAB Produces weighted oneway and multiway frequencies RATIO Produces ratio estimates and their standard errors for correlated data SURVEYMEANS DESCRIPT Produces means, medians and quantiles and their standard errors SURVEYREG REGRESS Fits linear models SURVEYLOGISTIC RLOGIST Fits logistic regression models MULTILOG Logistic model with categorical dependent variables SURVIVAL Fits the discrete proportional hazards model SURVEYSELECT Helps you select a sample The majority of my knowledge about these procedures comes from self-discovery and hands -on experience. Although both programs use very similar syntax, SUDAAN requires more detail. For example SUDAAN version 8 requests that, for every variable in the syntax, you specify the number of levels in each variable (using the LEVELS statem ent). However, with the release of SUDAAN version 9, Research Triangle Institute (RTI ) will introduce the CLASS statement which will then eliminate the need to specify the number of levels for each categorical variable. The new CLASS statement can be us ed as a replacement for the SUBGROUP/LEVELS combination in all SUDAAN procedures. It also relaxes the restriction that levels of the variables must be consecutive integer values, 1, m. A CLASS variable must be numeric, but can take on any values, including missing values. In addition, SUDAAN will not accept 0,1 coding schemes when dealing with categorical values; all values for categorical variables must start with a 1, with the exception of the PROC RLOGIST. You do have the option to recode your variables on the fly within a SUDAAN procedure but it is another step that must be taken for successful completion of a procedure. Likewise, there is no default printing of output for SUDAAN. You must specify exactly what statistic you want printed and in what format. It is not as simple as requesting statistics on an OPTIONS statement within a SAS procedure. Unlike SAS, SUDAAN also does not provide the variable names in the output unless they are specified in the label of the variable. At the present time, you cannot run SAS Callable SUDAAN v8.0 in conjunction with SAS V9.0 or SAS v9.1, but you must use SAS v8.2. SUDAAN v9.0 will be compatible with SAS v9.1. SAS assumes that first-stage sampling is with replacement although reality bears witness that the vast majority of the time it is not. This can result in a slight overestimate of the variance, but this is very small. PROC SURVEYMEANS IN SAS PROC SURVEYMEANS; VAR T4B; STRATA EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; DOMAIN T40; TITLE MEAN OF 4B IN SAS ; RUN; This analysis is requesting the overall mean of T4B (number of classes taught) and the mean for number of classes taught for each gender (T40). The stratification variables are EMERG, DISTSIZE, and SCHL_LVL as indicated on the STRATA statement. The DOMAIN statement indicates a breakdown of T4B by gender. Without specifying any statistic keywords, SAS provides the NOBS, MEAN, STDERR and CLM statistics by default. A LIST option will provide basic information about (N, number of missing, strata variable levels) respondents in each stratum (presented in Exhibit 3A SAS SURVEYREG example). Calculated design effects are not available in this procedure. If you would like to calculate design effects you will need to run the same analysis as a normal weighted SAS MEANS procedure and then run it again as a PROC SURVEYMEANS. You would then take the results from the two procedures and then apply the DEFF formula (DEFF=CDS Variance/SRS Variance). Output in Exhibit 1A. 3

4 As in the PROC MEANS, when computing statistics for an analysis variable, SAS omits observations with missing values for that variable. In addition, it is important to note that in PROC SURVEYMEANS, if an observation has a missing value or non-positive value for the weight it will be excluded from the analysis. Observations are also excluded if there are missing values on the STRATA or CLUSTER statement, unless the MISSING option is used. When the MISSING option is used the missing values are treated as a valid category. As an experienced SAS programmer, you may want to sort the data set by T40 (gender) and use a BY statement. That method will produce a NOTE from SAS requesting a DOMAIN statement. PROC DESCRIPT IN SUDAAN (Overall Mean) PROC DESCRIPT DATA=ONE FILETYPE=SAS DESIGN=STRWR; NEST EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; VAR T4B; SETENV LABWIDTH=28 COLSPCE=1 COLWIDTH=10 DECWIDTH=4; PRINT NSUM= SAMPLE SIZE WSUM= POPULATION SIZE MEAN SEMEAN= S.E. DEFFMEAN= DESIGN EFFECT / STYLE=NCHS NSUMFMT=F6.O WSUMFMT=F10.0 DEFFMEANFMT=F6.2 SEMEANFMT=F7.4; RTITLE MEAN OF T4B IN SUDAAN ; RUN; (Mean by Gender) PROC DESCRIPT DATA=ONE FILETYPE=SAS DESIGN=STRWR; NEST EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; VAR T4B; SUBGROUP T40; LEVELS 2; Setenv labwidth=28 colspce=1 colwidth=10 decwidth=4; Print nsum= Sample Size Wsum= Population size Mean semean = S.E. Deffmean= Design Effect / style=nchs nsumfmt=f6.0 wsumfmt=f10.0 Deffmeanfmt=F6.2 Semeanfmt=F10.4; Rtitle Mean of T4B by T40 IN SUDAAN ; Run; This analysis is the same request as presented in the PROC SURVEYMEANS in the preceding section. The STWR design was used to correspond with the SAS assumptions. In SUDAAN the name of the procedure is DESCRIPT. You must specify the filetype and the design. The NEST statement is similar to the STRATA statement in SAS. The SUBGROUP statement corresponds to the DOMAIN statement in SAS; however, you must include a LEVELS statement indicating the number of levels for the variable. Design effects can be requested in SUDAAN, this is not true for the PROC SURVEYMEANS procedure. Output in Exhibit 1B. The SETENV statement sets the output environment parameters, similar to the options statement in SAS. The PRINT statement is the place where you have to indicate each statistic and a label for those statistics that you want in the output. The STYLE option is a particular way SUDAAN prints the output. NCHS style is printed according to the standards of the National Center for Health Statistics. Before you are done you must give a format for each statistic. If you have not given a large enough format, an ** will appear in the output. You must then go back and change the format for that specific variable. The RTITLE statement is equivalent to the TITLE statement in SAS. Unlike SAS, you have to execute the PROC DESCRIPT twice in order to get an overall mean of T4B and the separate means for T4B by gender (T40). SUDAAN handles missing values very much like SAS. Observations that have missing values for weights and required sample design variables will be excluded from the analysis. With the new CLASS statement you will have the option of including missing values in your analysis. 4

5 In both programs the point estimates and the standard errors are the same (within reasonable rounding error). SUDAAN does not provide an option to obtain standard deviations, but only calculates standard errors. SAS provides the flexibility of obtaining standard deviations. PROC SURVEYFREQ IN SAS PROC SURVEYFREQ; STRATA EMERG DISTSIZE SCHL_LVL; TABLES T40*T6 / CHISQ WCHISQ ROW COL CHISQ1; WEIGHT WGTD; TITLE CROSSTAB OF T40 BY T6 IN SAS ; RUN; This syntax is very similar to PROC FREQ in SAS. It is a crosstabulation of gender (T40) and T6 (Did respondent leave a teacher preparation program for employment?). There is an addition of a STRATA statement indicating the stratification variables. When requesting a chi-square analysis with this procedure you will get a Rao-Scott chisquare test (CHISQ option), which applies a design effect correction to the Pearson chi-square computing the design effect correction from proportion estimates instead of null proportions. The CHISQ1 option will give you a modified Rao-Scott chi-square test. The modified Rao-Scott chi-square bases the design effect correction on null hypothesis proportions. The WCHISQ is an option in the Tables statement that will give you a Wald chi-square. The default options are frequencies, weighted frequencies, standard error of the weighted frequencies, percentages and standard error of the percentages. You must specifically indicate that you want row and column percentages and their standard errors ; they are not given by default. Theoretically, the point estimates will not significantly differ from the SUDAAN output. If there are differences, it can usually be accounted for by rounding. Output in Exhibit 2A. PROC SURVEYFREQ excludes an observation from a crosstabulation table if that observation has a missing value for any of the table, weight or required sample design variables unless you specify the MISSING option. When the procedure excludes observations with missing values from a table, it displays the total frequency of missing observations below that table. With the MISSING option, the procedure treats the missing values as a valid category and includes them in calculations of percentages and other statistics. Unlike PROC FREQ you cannot specify a MISSPRINT option which will give the number of missing in each cell and still not include the missing values in the calculations of the percentages and other statistics. PROC CROSSTAB IN SUDAAN PROC CROSSTAB DATA=ONE FILETYPE=SAS DESIGN=STRWR; NEST EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; SUBGROUP T40 T6; LEVELS 2 2; TABLES T40*T6; SETENV COLWIDTH=9 DECWIDTH=2 COLSPCE=2; PRINT NSUM WSUM COLPER ROWPER TOTPER /WSUMFMT=F9.0 NSUMFMT=F9.0 CMHTEST=ALL TESTS=ALL CMHFMT=F8.2 CMHDFFMT=F8.0 CMHPVALFMT=F8.4 CHISQFMT=F11.2; RTITLE CROSSTAB OF T40 BY T6 IN SUDAAN ; RUN; The PROC CROSSTAB in SUDAAN follows the logic of the syntax presented in the PROC DESCRIPT. You must supply a DESIGN statement and a NEST statement. Besides specifying the crosstabulation in the TABLES statement, you must have a SUBGROUP statement and a corresponding LEVELS statement. SUDAAN will produce several types of chi-square tests including the Cochran-Mantel-Haenszel and the Pearson chi-square. Output in Exhibit 2B. The crosstabulation output prints out the totals on the left, reversed from the traditional SAS output. One of the disadvantages of SUDAAN output is that it produces a single page for every table and every test statistic you request. It is not environmentally friendly. 5

6 PROC SURVEYREG IN SAS PROC SURVEYREG; STRATA EMERG DISTSIZE SCHL_LVL / LIST; CLASS T40 ; MODEL T36=T40 T41 / ANOVA DEFF ADJRSQ SOLUTION ; WEIGHT WGTD; TITLE SURVEYREG OF T36(# YRS TEACHING)=T40 (GENDER) ; TITLE REGRESSION T36 (YRS TEACHING)=T41 (AGE)+ T40(GENDER) W/INTERCEPT IN SAS ; RUN; This procedure performs linear regression taking into account the survey design variables. The dependent variable must be continuous (or assume so) and the independent variables can be either continuous or categorical. Any categorical variable on the model statement must appear on the CLASS statement. In addition, the CLASS statement must precede the model statement. PROC SURVEYREG forms dummy indicator variables (coded 1 or 0) for categorical independent variables with the highest coded value of variable defined as reference group. By specifying ANOVA, you will get a traditional anova table. You have the ability to specify DEFF to get the design effects, which is important in understanding how the stratification or clustering sampling frame affected your data. SAS will produce an estimated regression coefficient table by default if there is not CLASS statement. If you have a CLASS statement in your code, then to produce this estimated regression coefficient table you must provide a SOLUTION option on the MODEL statement. To match the parameters of the SUDAAN procedure I ran the analysis to include the intercept. Output in Exhibit 3A. If an observation has a missing value or a non-positive value for the WEIGHT variable, then PROC SURVEYREG excludes that observation from the analysis. An observation is also excluded if it has a missing value for any STRATA variable, CLUSTER variable, dependent variable, or any variable used in the independent effects. The analysis includes all observations in the data set that have non-missing values for all these design and analysis variables. There is not an option of MISSING that is in the PROC SURVEYMEANS, PROC SURVEYFREQ and PROC SURVEYLOGISTIC, Regression and logistic regression procedures are exercises in developing a model to best explain or predict your dependent variable. It is an intricate and iterative process which can be very time-consuming. The process of modeling a phenomenon requires an in-depth knowledge of the subject matter. The choice of syntax and options will be determined by research questions and substantive knowledge. The examples used in this paper do not reflect the complexity of this type of analysis. PROC REGRESS IN SUDAAN PROC REGRESS DATA=ONEA FILETYPE=SAS DESIGN=STRWR; NEST EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; SUBGROUP T40 ; LEVELS 2 ; TEST SATADJCHI ADJWALDF; MODEL T36=T40 T41 ; SETENV COLWIDTH=8 DECWIDTH=3 LABWIDTH=44; PRINT BETA="BETA" SEBETA="STD ERR" DEFT="DEFF" T_BETA="T:BETA=0" P_BETA="P-VALUE" DF="DF" SATADJDF="ADJ DF" SATADCHI="CHI-SQ(SAT)" ADJWALDF="F-TEST(WALD)" SATADCHP="P-VALUE(SAT)" ADJWALDP="P-VALUE(WALD-F)" / RISK =ALL 6

7 DFFMT=F6.2 SATADJDFFMT=F6.3 SATADCHIFMT=F7.2 ADJWALDFFMT=F7.2 BETAFMT=F8.4 SEBETAFMT=F8.4 P_BETAFMT=F7.4 DEFTFMT=F6.2 SATADCHPFMT=F7.4 ADJWALDPFMT=F7.4; RTITLE "SUDAAN REGRESSION PROCEDURES T36=T40 T41"; RUN; The PROC REGRESS in SUDAAN follows the logic of the syntax presented in the PROC DESCRIPT. You must supply a DESIGN statement and a NEST statement. Instead of using a CLASS statement (at this time) you must use a subgroup statement and then indicate the number of levels for the categorical variable. As like all the other procedures in SUDAAN you must specify every statistic and their corresponding formats you want in the PRINT statement. You are not able to get a traditional anova table as you are accustomed to in SAS output. Output in Exhibit 3B. Comparable pieces of information have been highlighted in Exhibit 3A and Exhibit 3B. The R-square in SAS PROC SURVEYREG output (Exhibit 3A) is an adjusted multiple R-square and is found in the SUDAAN output (Exhibit 3B) labeled as Multiple R-Square for the dependent variable. The information in the tests of Model Effects table found in the SAS PROC SURVEYREG output can be found in the contrast table in the SUDAAN output. The beta coefficient and their standard errors are found in the estimated regression coefficient table in the SAS PROC SURVEYREG output and in the independent variables and effects table in the SUDAAN output. I have chosen to run the PROC SURVEYREG and PROC SURVEYLOGISTIC with an intercept. The NOINT (no intercept) option in both of these procedures uses the uncorrected sum of squares as opposed to an intercept option which uses a corrected sum of squares. Currently in the SUDAAN modeling procedures a record with missing values for any of the model variables is excluded from the analysis. With the new CLASS statement, records with missing values can now be included in the analysis provided the variable names are listed in the CLASS statement and INCLUDE=MISSING is used. The default is NOMISSING. PROC SURVEYLOGISTIC IN SAS PROC SURVEYLOGISTIC; STRATUM EMERG DISTSIZE SCHL_LVL; MODEL T42AB(EVENT= 1 )=T41 T40 / STB RSQ ; WEIGHT WGTD; TITLE SURVEYLOGISTIC OF T42AB T41 T40 W/INTERCEPT ; RUN; This procedure performs a logistic regression taking into account the survey design variables. Logistic regression analysis is often used to investigate the relationship between these discrete responses and a set of explanatory variables. The dependent variable can be binary (0,1) or ordinal (small, medium, large) in nature and the independent variables can be either continuous or categorical. A vast majority of variables in survey research are limited to binary or ordinal responses. When you have a binary dependent variable, you have the capability to determine which category you would like to be the event category in the model statement. The option RSQ on the MODEL statement will give you a generalized R square for the fitted model. Output in Exhibit 4A. In PROC SURVEYLOGISTIC, any observation with missing values for the response, offset or explanatory variables or any required sample design variable is excluded from the analysis. The estimated linear predictor, its standard error estimate, the fitted probabilities, and their confidence limits are not computed for any observation with missing offset or explanatory variable values. The MISSING option can be used in the same manner as with PROC SURVEYFREQ and PROC SURVEYMEANS. PROC RLOGIST IN SUDAAN PROC RLOGIST DATA=ONEA FILETYPE=SAS DESIGN=STRWR; NEST EMERG DISTSIZE SCHL_LVL; WEIGHT WGTD; 7

8 SUBGROUP T40 ; LEVELS 2; MODEL T42AB=T41 T40; TEST SATADJCHI WALDCHI; SETENV COLWIDTH=8 DECWIDTH=3 LABWIDTH=44; OUTPUT EXPECTED="EXPECTED" RESIDUAL="RESIDUAL" OBSERVED="OBSERVED" WEIGHT="WEIGHT" / FILENAME=FILETEST EXPECTEDFMT=F8.4 RESIDUALFMT=F8.4 OBSERVEDFMT=F8.4 WEIGHTFMT=F8.4; PRINT BETA="BETA" SEBETA="S.E." DEFT="DESIGN EFFECT" T_BETA="T:BETA=0" P_BETA="P-VALUE" OR LOWOR UPOR DF="DF" SATADJDF="ADJ DF" WALDCHI=" CHI-SQ (WALD)" SATADCHI=" CHI-SQ (SAT.)" WALDCHP=" P-VALUE (WALD)" SATADCHP=" P-VALUE (SAT.)" /T_BETAFMT=F8.2 DEFTFMT=F6.2 SEBETAFMT=F8.6 ORFMT=F5.2 LOWORFMT=F6.2 UPORFMT=F6.2 DFFMT=F7.0 SATADJDFFMT=F8.2 WALDCHIFMT=F8.2 SATADCHIFMT=F8.2 STYLE=NCHS; RTITLE "MODEL T42AB(MA Y/N)=T41(AGE) T40(GENDER) IN SUDAAN"; RUN; There are several procedures names in SAS-Callable SUDAAN that are very similar to SAS syntax. In order to not create confusion for SAS, SUDAAN has used a naming convention to start such procedures with the letter R. The syntax of this procedure is comparable to the PROC REGRESS. In all other procedures in SUDAAN, the binary coding of 0 and 1 is not accepted, however in the PROC RLOGIST the dependent variable can be coded as a 0 or 1. Output in Exhibit 4B. Information about the response (dependent) variable is found in the response profile in the SAS PROC SURVEYLOGISTIC (Exhibit 4A) output and the same information is found in the Sample and Population Counts for Response Variable table in the SUDAAN PROC RLOGIST output (Exhibit 4B). Beta coefficients and standard errors are found in the Analysis of Maximum Likelihood Estimates table in Exhibit 4A and Independent Variables and Effects table in Exhibit 4B. Each package produces odds ratios. LIMITATIONS OF EACH PACKAGE One of the major limitations at this time in SAS is the package does not offer the option of using balanced repeated replicates (BRR) or jackknife weights. Why is this so important? It is very common as a programmer/analyst to inherit data sets or secondary dataset analysis. In many cases we do not have access to the actual formation of the sampling design. It is essential, especially in SUDAAN, to be able to designate the sample design based on such information. With the use of balanced repeated replicates or jackknife weights, the syntax does not require any further information other than the supplied weights. This makes it much more usable for the analyst. Although SUDAAN offers more options in terms of survey sampling designs and procedures, it is a cumbersome program to code. SUDAAN documentation is not the easiest to comprehend, especially if you are a novice. The upcoming release of SUDAAN version 9 and the inclusion of a CLASS statement will resolve one of the major difficulties of working with SUDAAN and the ability to include missing if it is appropriate. From an economic viewpoint, using SUDAAN is an additional expense in terms of licensing and training. SAS gives you the ease of coding and more print control of output, however at this time it is very limited in what it offers an analyst in terms of design and procedures. 8

9 CONCLUSION Simple random sampling is like a rare gem in this day of social science research. We are dealing with increasingly more complex sample designs. These designs require the sophistication of SAS survey procedures or SUDAAN procedures. One must balance variety of choice with ease of coding. At this time SUDAAN is the most desirable package to use because of the variety of choice it offers in sample designs and the number of procedures available to analyze the data. However, it is a program that is cumbersome to program, creating a more labor-intensive task than its counterpoint in SAS. The inclusion of the CLASS statement in version 9 of SUDAAN will resolve some of these issues. You will still have to deal with the specification of the print options in SUDAAN. My conclusion is SAS is moving in the right direction and I hope to see it incorporate the power of SUDAAN in terms of choice and number of procedures in the future. The single most important SAS survey procedure to be included in SAS version 9.1 is PROC SURVEYFREQ. Contingency analys is is a mainstay of any data analysis. With the incorporation of these survey-based procedures in SAS, we look forward to greater ease in coding when dealing with complex sample designs. In this electronic age, we are faced with ever growing mountains of data and no single software package can meet our needs to manage and analyze the data. It is very common to switch back and forth between EXCEL or ACCESS and SAS. On many occasions we are given data in a spreadsheet format, asked to analyze the data in SAS and then requested to give back the results in a spreadsheet format. As programmers and analysts we see ancillary programs like EXCEL as part of our tool bag. We should think of using SAS v9.1 and SUDAAN in the same way. In the interim, if one has a design that fits the parameters of SAS design and statistical options offered now, welcome to automatic transmission. If your sample cannot meet the parameters of what SAS offers now, then you must contend with the manual transmission mode of SUDAAN. In the interim, we will have to switch back and forth between the two packages depending on our individual needs. ACKNOWLEDGEMENTS I wish to thank the Center for Education Policy at SRI International for their support and the opportunity to learn and expand into complex sample design programming. Special thanks go to Andrea Lash for her mentoring and support. I also want to thank Hal Javitz for his technical assistance. Thanks to my fellow programmers, Peter Godard and Kathryn Valdes for their comments. I owe a debt of gratitude to Betsy Davies-Mercier for editorial assistance. Overdue thanks to my husband, Rob Robbins for enduring late nights and lonely meals. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. REFERENCES An, Anthony and Donna Watts (1998), New SAS Procedures for Analysis of Sample Survey Data Proceedings of the Twenty-Third Annual SAS Users Group International Conference SAS Institute. Cassell, David L. and AnnMaria Rousey. Complex Sampling Designs Meet the Flaming Turkey of Glory Proceedings of the Twenty-Eighth Annual SAS Users Group International Conference. March Design Pathways and Spirit Lake Consulting, Seattle, WA. Research Triangle Institute (2001). SUDDAN User s Manual, Release 8.0, Research Triangle Park, NC: Research Triangle Institute. SAS Institute, Inc., SAS/STAT User s Guide, Version 8, Volumes 1,2,3, Cary, NC: SAS Institute Inc., PP. SAS online help that comes with version 9.1 CONTACT INFORMATION Katherine Baisden SRI International 333 Ravenswood Ave, BS381 Menlo Park, CA Phone: (650) Fax: (650) katherine.baisden@sri.com 9

10 EXHIBIT 1A SAS SURVEYMEANS Procedure Number of Strata 27 Number of Observations 530 Number of Observations Used 510 Number of Obs with Nonpositive Weights 20 Sum of Weights Statistics Std Error Lower 95% Upper 95% Variable N Mean of Mean CL for Mean CL for Mean T4b Domain Analysis: T40 Std Error Lower 95% Upper 95% T40 Variable N Mean of Mean CL for Mean CL for Mean (1) FEMALE T4b (2) MALE T4b EXHIBIT 1B SUDAAN PROC DESCRIPT (OVERALL MEAN) Number of observations read : 510 Weighted count : Denominator degrees of freedom : 483 Variance Estimation Method: Taylor Series (STRWR) Mean of T4b (# Classes Taught) by: Variable, One Variable Sample Population Design Size size Mean S.E. effect T4b (MEAN BY GENDER) Number of observations read : 510 Weighted count : Number of observations skipped: 20 (WEIGHT variable nonpositive) Denominator degrees of freedom :

11 Variance Estimation Method: Taylor Series (STRWR) Mean of T4b by T40 by: Variable, T40:GENDER Variable T40:GENDER Sample Population Design Size size Mean S.E. effect T4b:TOTAL NUMBER OF CLASSES TAUGHT Total (1) FEMALE (2) MALE

12 EXHIBIT 2A SAS SURVEYFREQ Procedure Data Summary Number of Strata 27 Number of Observations 530 Number of Observations Used 510 Number of Obs with Nonpositive Weights 20 Sum of Weights T40 T6 Freq Wted Freq Std Wted % SE % Row % SE Row Col % SE Col 1 (Females) 1 (Yes) (No) Total (Males) 1 (Yes) (No) Total Total 1 (Yes) (No) Total Frequency Missing = 39 Rao-Scott Chi-Square Test Pearson Chi-Square Design Correction Rao-Scott Chi-Square DF 1 Pr > ChiSq <.0001 F Value Num DF 1 Den DF 444 Pr > F <.0001 Sample Size = 471 Wald Chi-Square Test Chi-Square F Value Num DF 1 Den DF 444 Pr > F Sample Size = 471 Rao-Scott Modified Chi-Square Test Pearson Chi-Square Design Correction Rao-Scott Chi-Square DF 1 Pr > ChiSq <.0001 F Value Num DF 1 Den DF 444 Pr > F <.0001 Sample Size =

13 EXHIBIT 2B SUDAAN PROC CROSSTAB Number of observations read : 510 Weighted count : Number of observations skipped : 20 (WEIGHT variable nonpositive) Denominator degrees of freedom : 483 Variance Estimation Method: Taylor Series (STRWR) Crosstab of T40 (GENDER) by T6 (PREP PGM) by: T40:GENDER, T6:LEAVE MA OR PREP PGM FOR FT PAID POSITION T40:GENDER T6:LEAVE MA OR PREP PGM FOR FT PAID POSITION Total 1 (YES) 2 (NO) Total Sample Size Weighted Size Col Percent Row Percent Tot Percent SE Row Percent SE Col Percent SE Tot Percent (1) FEMALE Sample Size Weighted Size Col Percent Row Percent Tot Percent SE Row Percent SE Col Percent SE Tot Percent (2) MALE Sample Size Weighted Size Col Percent Row Percent Tot Percent SE Row Percent SE Col Percent SE Tot Percent

14 Variance Estimation Method: Taylor Series (STRWR) Chi Square Test of Independence for T40:GENDER and T6:LEAVE MA OR PREP PGM FOR FT PAID POSITION Crosstab of T40 (GENDER) by T6 (PREP PGM) ChiSq 0.41 P-value ChiSq 0.52 Degrees of Freedom ChiSq 1.00 LLChiSq 0.41 P-value LLChiSq 0.52 Degrees of Freedom LLChiSq Variance Estimation Method: Taylor Series (STRWR) Cochran-Mantel-Haenszel Test of Association for T40:GENDER and T6:LEAVE MA OR PREP PGM FOR FT PAID POSITION Crosstab of T40 (GENDER) by T6 (PREP PGM) Cochran-Mantel- Haenszel Chi- Square 0.41 Degrees of Freedom CMH 1 P-value CMH Test

15 EXHIBIT 3A SAS SURVEYREG Procedure Regression Analysis for Dependent Variable T36 Fit Statistics Data Summary R-square Number of Observations 500 Adjusted R-square Sum of Weights Root MSE Weighted Mean of T Denominator DF 473 Weighted Sum of T Design Summary Number of Strata 27 Stratum Information Stratum Information EMERG: DISTSIZE: SCHL_LVL: Stratum EMERGENCY DISTRICT SCHOOL Index STATUS SIZE LEVEL N Obs

16 Class Level Information Class Variable Label Levels Values T40 T40:GENDER ANOVA for Dependent Variable T36 Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Tests of Model Effects Effect Num DF F Value Pr > F Model <.0001 Intercept <.0001 T T <.0001 NOTE: The denominator degrees of freedom for the F tests is 473. Estimated Regression Coefficients Standard Design Parameter Estimate Error t Value Pr > t Effect Intercept < T T T < NOTE: The denominator degrees of freedom for the t tests is 473. Matrix X'WX is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique. EXHIBIT 3B S U D A A N PROC REGRESS Number of observations read : 510 Weighted count: Number of observations skipped : 20 (WEIGHT variable nonpositive) Observations used in the analysis : 500 Weighted count: Denominator degrees of freedom : 483 Maximum number of estimable parameters for the model is 3 File ONEA contains 510 Clusters 500 clusters were used to fit the model Maximum cluster size is 1 records Minimum cluster size is 1 records 16

17 Weighted mean response is Multiple R-Square for the dependent variable T36: Variance Estimation Method: Taylor Series (STRWR) SE Method: Robust (Binder, 1983) Working Correlations: Independent Link Function: Identity Response variable T36: T36:NUMBER OF YEARS FULLTIME TEACHER Sudaan Regression Procedures T36=T40 T41 T Independent Variables and Effects Beta STD Err DEFF T:Beta=0 P-value Intercept T40:GENDER (1) FEMALE (2) MALE T41:YEAR OF BIRTH Variance Estimation Method: Taylor Series (STRWR) SE Method: Robust (Binder, 1983) Working Correlations: Independent Link Function: Identity Response variable T36: T36:NUMBER OF YEARS FULLTIME TEACHER Sudaan Regression Procedures T36=T40 T41 T Contrast F- P- P- Chi- test(w- Value(- Value(- DF Adj DF sq(sat) ALD) SAT) Wald-F) OVERALL MODEL MODEL MINUS INTERCEPT INTERCEPT T T

18 Exhibit 4A SAS Surveylogistic Procedure Model Information Data Set WORK.ONE Response Variable T42Ab T42Ab:MASTER DEGREE Y/N Number of Response Levels 2 Stratum Variables EMERG EMERG: EMERGENCY STATUS DISTSIZE DISTSIZE: DISTRICT SIZE SCHL_LVL SCHL_LVL: SCHOOL LEVEL Number of Strata 27 Weight Variable WGTD WGTD: WEIGHT FOR RANDOM/TARGET ALL TEACHERS Model Binary Logit Optimization Technique Fisher's Scoring Variance Adjustment Degrees of Freedom (DF) Number of Observations Read 530 Number of Observations Used 289 Sum of Weights Read Sum of Weights Used Response Profile Ordered Total Total Value T42Ab Frequency Weight Probability modeled is T42Ab=1. NOTE: 231 observations were deleted due to missing values for the response or explanatory variables. NOTE: 10 observations having nonpositive frequencies or weights were excluded since they do not contribute to the analysis. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L R-Square Max-rescaled R-Square

19 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio <.0001 Score <.0001 Wald Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Intercept T T Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits T T Association of Predicted Probabilities and Observed Responses Percent Concordant 49.0 Somers' D Percent Discordant 30.9 Gamma Percent Tied 20.1 Tau-a Pairs c Number of zero responses : 128 Number of non-zero responses : 161 EXHIBIT 4B SUDAAN PROC RLOGIST Independence parameters have converged in 5 iterations Number of observations read : 510 Weighted count: Number of observations skipped : 20 (WEIGHT variable nonpositive) Observations used in the analysis : 289 Weighted count: Denominator degrees of freedom : 483 Maximum number of estimable parameters for the model is 3 File ONEA contains 510 Clusters 289 clusters were used to fit the model Maximum cluster size is 1 records Minimum cluster size is 1 records Sample and Population Counts for Response Variable T42AB 0: Sample Count 128 Population Count : Sample Count 161 Population Count R-Square for dependent variable T42AB (Cox & Snell, 1989):

20 -2 * Normalized Log-Likelihood with Intercepts Only : * Normalized Log-Likelihood Full Model : Approximate Chi-Square (-2 * Log-L Ratio) : 5.02 Degrees of Freedom : 2 Note: The approximate Chi-Square is not adjusted for clustering. Refer to hypothesis test table for adjusted test. Variance Estimation Method: Taylor Series (STRWR) SE Method: Robust (Binder, 1983) Working Correlations: Independent Link Function: Logit Response variable T42AB: T42Ab:MASTER DEGREE Y/N MODEL T42Ab(MA Y/N)=T41(AGE) T40(GENDER) Independent Variables and Effects DESIGN BETA S.E. EFFECT T:BETA=0 P-VALUE Intercept T41:YEAR OF BIRTH T40:GENDER (1) FEMALE (2) MALE Variance Estimation Method: Taylor Series (STRWR) SE Method: Robust (Binder, 1983) Working Correlations: Independent Link Function: Logit Response variable T42AB: T42Ab:MASTER DEGREE Y/N MODEL T42Ab(MA Y/N)=T41(AGE) T40(GENDER) Contrast CHI-SQ CHI-SQ P-VALUE P-VALUE DF ADJ DF (WALD) (SAT.) (WALD) (SAT.) OVERALL MODEL MODEL MINUS INTERCEPT INTERCEPT T T Variance Estimation Method: Taylor Series (STRWR) SE Method: Robust (Binder, 1983) Working Correlations: Independent Link Function: Logit Response variable T42AB: T42Ab:MASTER DEGREE Y/N MODEL T42Ab(MA Y/N)=T41(AGE) T40(GENDER) Lower Upper Independent Variables and Effects 95% 95% Odds Limit Limit Ratio OR OR Intercept T41:YEAR OF BIRTH T40:GENDER (1) FEMALE (2) MALE

THE QUANDARY OF SURVEY DATA: Comparison of SAS Procedures and SUDAAN Procedures

THE QUANDARY OF SURVEY DATA: Comparison of SAS Procedures and SUDAAN Procedures Katherine Baisden, SRI International, Menlo Park, California ABSTRACT Have you ever worked with survey data that are based