Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

Size: px
Start display at page:

Download "Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006"

Transcription

1 Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006 Computer assignment #3 DUE Wednesday MARCH 29 th (in class) Regression Analysis Assignment/ Regression Procedures for Final Paper At this point in the term, you should have a pretty good sense as to which variables you will be working with in your final research paper. You should also have a reasonable sense as to the principal hypotheses that you are putting to empirical test The purpose of the current assignment is three fold. First, it will provide you with a strategy for conducting your final regression analysis for your final paper. Secondly, this assignment is designed so that you can begin to explicitly start thinking about the issue of missing data (I recommend, for current purposes, that you use what is called PAIRWISE deletion of missing cases in your final regression). Thirdly, it will provide you with a simple strategy for dealing with sample weights when conducting multivariate analyses. As you may recall, Assignment # 1 was all about dealing with the issue of sampling weights when producing simple descriptive statistics and frequency distributions. This assignment does the same with multivariate analyses. Part A: Model Building: Regression Although there are various strategies possible when building models (stepwise regression; forward selection procedures, among other largely automated procedures), the best of all possible strategies is let your theory guide you in terms of introducing variables into multiple regression. This is what I am recommending for the current assignment and for your final paper. This is also why you have already conducted a literature review in putting together your research proposal, i.e. you are trying to situate your empirical analysis in terms of a broader research literature. Step 1. If applicable, using SPSS, carefully select the subsample that you will be focusing on in your final paper (example, all children aged 4-11 or all adults aged 18+). As in previous assignments, use the SELECT CASES procedure in delineating the sample for your analysis. One might note that in putting together your research proposals, some of you ignored this issue, and then subsequently went on select variables that applied to only specific cases within the larger samples available. If your variables that you select are only applicable to certain subsamples within the larger GSS or NLSCY samples (e.g. an age appropriate 1

2 behavioral scale or a question that applies only to adults) then you must explicitly take this into consideration when you select your subsample for your multiple regression. Also note that it doesn t make sense to select subsamples arbitrarily (e.g. select only cases within Ontario and Quebec) with providing some justification. If there is not strong theoretical reason to select a subsample, then why are you doing so? You need not hand anything in for Step 1, merely report your subsample (if applicable) Step 2. I had suggested that you come up with 2 primary hypotheses for your final paper (i.e. involving 2 separate independent variables). Run two separate simple regression with each independent variable separately, and highlight with a yellow marker on your output: (i) the standardized slopes, (ii) the unstandardized slopes, (iii) whether or not your variables have a significant effect or not (at the.05 level), and (iv) the overall R 2 of each regression. Note that in simple regression, the standardized slope is one and the same as the Pearsons r. Note: some of you skipped this step, and merely specified a series of hypotheses or correlations in your research proposal (i.e. you did not specify two primary hypotheses). If you did not specify 2 primary hypotheses, which of the independent variables in your proposed research do you consider most important in explaining your dependent variable (or which 2 do you find most interesting on theoretical grounds). Subsequently use these 2 independent variables in satisfying step 2, as specified above. Note: if one of your primary hypotheses from your research proposal involves an interaction effect, for step 2 merely run 2 separate simple regressions, with each involving one (and solely one) of the two variables that enter into the hypothesized interaction. You can test for the interaction effect in step 3 (see below). If you are creating your own scale(s) in your final paper, you have the option of including it in this assignment (merely briefly state which items went into creating the scale, and report the corresponding Chronbach s alpha). In creating your scale, you have the option of following the steps as specified in assignment # 2. If what you are creating is more like an index rather than a scale, then don t worry about Chronbach s alpha. Note: if your primary hypotheses involve any variable(s) measured at the nominal level, you should create dummy variables. If you forget how to do this, take another look the tutorials in Sociology 300 or ask Bev. Recall the following syntax in creating Dummy variables: Dummy Variable Syntax Assume that we have the following variable in our data set with 10 categories Province of residence (cgehd03) 10 Nfld 2

3 11 PEI 12 NS 13 NB 24 Quebec 35 Ontario 46 Manitoba 47 Sask 48 Alberta 59 British Columbia If we were to create a series of dummy variables, we can use the following syntax in our SPSS syntax file (in this example, I will be creating 5 different dummies meant to measure region of country: one could have just as easily used 10 dummies for province of residence). With some variables it is often worthwhile collapsing categories, so that you don t have too many dummy variables in your final model. COMPUTE ATL=0. IF (CGEHD03=10 OR CGEHD03=11 OR CGEHD03=12 OR CGEHD03=13) ATL=1. COMPUTE QUEBEC=0. IF (CGEHD03=24) QUEBEC=1. COMPUTE ONTARIO=0. IF (CGEHD03=35) ONTARIO=1. COMPUTE PRAIRIES=0. IF (CGEHD03=46 OR CGEHD03=47 OR CGEHD03=48) PRAIRIES=1. COMPUTE BC=0. IF (CGEHD03=59) BC=1. In this example, you have created 5 dummy variables. Note that potentially 10 dummy variables could have been created, but I decided to use only 5 to simplify my analysis. In a subsequent regression, it is necessary that you include only 4 of these dummy variables (the dummy variable that is not included acts as your reference category in regression). For example, if you were to exclude Quebec, the subsequent slopes in the regression always make implicit reference to this reference category. For example, in reading the slope associated with BC, it would give you the effect of living in BC on your dependent variable relative to living in Quebec (which acts as your reference category). Usually (but not always) it is recommended to work with dummy variables when introducing crudely categorized ordinal variables into regression (i.e. for example, ordinal variables that have only three or four response categories). Although it is often recommended, to simplify your life a bit, I give you the option of using dummy variables when working with such variables (it is your choice). The only stuff to hand in with Step 2 are your syntax and output files. 3

4 Step 3 Move beyond two separate simple regressions, to include both independent variables simultaneously into a multiple regression. With the results from this regression, highlight with a yellow marker on your output (i) the standardized slopes, (ii) the unstandardized slopes, (iii) whether or not your variables have a significant effect or not (at the.05 level), and (iv) the overall R 2 of your regression. Note that in multiple regression, the standardized slope is not one and the same as the Pearsons r. If one of your primary hypotheses includes the possibility of an interaction effect, it is here that you will have to introduce an interaction term into your model (as discussed last term). This usually involves computing a new variable which is the product of the two independent variables hypothesized as interacting. For example, assume that you hypothesize an interaction between VAR1 and VAR2 in explaining your dependent variable VAR3. You first need compute an interaction term, as: COMPUTE NEWVAR = VAR1 * VAR2. You can then include all three variables in your regression (i.e. VAR1, VAR2 and NEWVAR). If NEWVAR has a significant effect, then this is supportive of the hypothesis of interaction. In including this interaction term, it doesn t matter if VAR1 and VAR2 continue to have a significant effect, but it is important that you continue to include them in the model along with the interaction term (i.e. the final regression must include VAR1, VAR2 and NEWVAR). Note: If your interaction effect involves a variable that is operationalized via more than one dummy variable (e.g. province of residence, as discussed above), your analysis is complicated somewhat. For example, what if you hypothesized that province of residence interacted with another variable (e.g. income) in explaining your dependent variable, you may have to create several dummy variables for your regression model. In the above example, you would have to create 4 interaction terms to test for this interaction between income and region of residence in explaining your dependent variable. COMPUTE COMPUTE COMPUTE COMPUTE NEWVAR1 = INCOME * ATL. NEWVAR2 = INCOME * ONTARIO. NEWVAR3 = INCOME * PRAIRIES. NEWVAR4 = INCOME * BC. You would then subsequently have to include all four interaction terms in your regression (NEWVAR1; NEWVAR2; NEWVAR3; NEWVAR4) along with the province of residence dummy variables (ALT; ONTARIO; PRAIRIES; BC) and the income variable (INCOME). The interpretation is analogous to what you have with one interaction term. Only stuff to hand in are your syntax and output files. 4

5 Step 4 In your research proposal, I had suggested that you also consider potential control variables in your multivariate model (up to six). On this basis, may I suggest that you have the option of simplifying things somewhat for your final paper (and for this assignment). Of the controls that you had initially suggested, it is acceptable to use a minimum of 3 control variables (and definitely no more than 6). For this assignment and for your final assignment, this will make your analysis more manageable and you can tighten up and edit your literature review on this basis. Note: again you may very well have to create dummy variables (in fact, nearly all of you will need to create dummy variables for your final multiple regression). For example, assume that you want to include marital status as a control. It might be necessary in this context to create a series of dummy variables (e.g. MARRIED; SINGLE; DIVORCED/SEPARATED; COMMON LAW). One of these variables would subsequently not be included in the final regression model. When you consider the possibility of nominal variables, your final model may have quite a few independent variables and subsequently become rather complicated. With the results from this regression, again highlight on your output (i) the standardized slopes, (ii) the unstandardized slopes, (iii) whether or not your variables have a significant effect or not (at the.05 level), and (iv) the overall R 2 of your regression. The crucial issue here is whether or not the initial relationships (relating to your primary hypotheses) remain when introducing these additional controls. In other words, when you include simultaneously your two principal independent variables and three (or possibly more) relevant control variables, do you still find support for your primary hypotheses. Hand in your syntax and output files. Step 5 After completing these regressions, I would like you to present the results from these 4 regressions in two separate tables (one table which includes standardized slopes and the other table which includes non-standardized slopes). Each table will include the regression results obtained from steps 2 through 4. In so doing, I suggest that you use the format of table as specified on page 500 of the Singleton and Straits textbook. While the example provides results from 2 models, I would like you to use the same format in presenting the results from 4 models. A nice software for creating these tables is Excel (it is very easy to use). Alternatively, feel free to save time by using a pencil and paper, in neatly summarizing your results in table form (no marks deducted). I don t want you to waste a whole bunch of time screwing around with software in producing a table for this assignment or final paper. Present the results from Steps 2, 3 and 4 as separate models, as in Table In the first model, present the results from one of your independent variables (i.e. one independent 5

6 variable unless of course you had to introduce dummy variables in measuring the effect of a given concept). In the second model, include the results from the second independent variable (again, either a single variable or a series of dummies, if appropriate). In the third model, include these independent variables together (and if applicable, any relevant interaction term). In the fourth model, include all these variables in your model, as well as all relevant controls. This is your full model. Models 1 through 3 are referred to as nested models, relative to this complete model with all relevant explanatory variables. The dependent variable in the table on page 500 is standardized 12 th grade math grades. Don t let the term standardized throw you off here as this is merely the name of the variable. More specifically, I suggests that you create one table that includes unstandardized slopes, as in Table 15.4 (call it Table 1). In addition, I suggest that you create a second table that summarizes all the results while using standardized slopes (call it Table 2). These standardized slopes are particularly important in determining which of your independent variables has the strongest effect on your dependent variable. (Table 2 does not include the reported intercept (y intercepts) when reporting standardized coefficients). Again, rather than including the results from two models, as in Table 15.4, include also the results from all four models in the same table. In other words, your Table 1 should include 4 additional columns, just as your Table 2 should include 4 additional columns. In presenting your results, these tables should be very clearly labeled; for example: Table 1. Twelfth Grade Math Grades Regressed on Selected Independent Variables, 1985 General Social Survey, Unstandardized Coefficients. or Table 2. Twelfth Grade Math Grades Regressed on Selected Independent Variables, 1985 General Social Survey, Standardized Coefficients. As on page 500, it is recommended that you present R 2, a preferred indicator of overall model performance. Table 15.4 also specifies the sample size of all of your regressions. To locate the sample size of your regression in your corresponding output files, the degrees of freedom (df) as associated with your TSS (total sum of squares) on the ANOVA table gives you a direct indication of the total number of cases involved in each regression. This is not necessarily the same number of cases across the different regressions, due to the issue of missing cases (this should be clear to you after completing part B of the current assignment). 6

7 As output from Part A, all I am asking for is your syntax files and corresponding output files, organized as steps 1 through 4. I am also asking for these two tables, without any interpretation. In Part C of this assignment, you will be replicating the above regressions (so no need to start over), yet this time properly weighting your analysis. The results from the weighted regressions in Part C can be used in your final paper. PART B. Dealing with Missing Values In our empirical research this year we have yet to explicitly address the problem of missing values. In a database, missing values can surface as a direct result of a variety of factors, including the possibilities of don t know, refusal or nonapplicable in gathering information through a survey. Typically, non-applicable is not that great an issue as we can avoid this type of missing value by properly defining our subsample for analysis. On the other hand, refusals and/or don t know can potentially be problems. Characteristically, these missing values are assigned values like 9999 or in sociological datasets. The default procedure in regression when working with SPSS is to merely delete missing cases on a LISTWISE basis. Listwise deletion uses only cases from your data set that have valid (non-missing) values for all variables included in the regression. While this is the default procedure in regression, it is not always the preferred option. The problem with Listwise deletion of missing cases is that you can potentially lose quite a few cases (and this can seriously reduce your sample size). Consider for example a regression with 5 variables, whereby there is a 10% non-response rate (or refusals) on each variable. Technically speaking, if there is absolutely no overlap on this nonresponse for all variables, the Listwise deletion can exclude towards 50% of all cases in the regression. If there is 100% overlap across all variables in these missing cases, then Listwise deletion would exclude only 10% of all cases (and no more). There are various options available to LISTWISE. Many researchers prefer using what is referred to as PAIRWISE deletion of missing cases (this is what I propose for your regression, i.e. use PAIRWISE in Part C and in your final paper). What this method does is exclude completely from the analysis only those cases with no valid response across all independent variables in your model. In essence, what this procedure does is use as many valid cases as possible in identifying the bivariate associations between variables in the model, which is then used in the subsequent multiple regression. Using this procedure, you typically do not lose nearly as many cases (and for this reason, is preferred over LISTWISE in many contexts). This is true particularly if there is no overlap in missing cases. Working with the above example, if there is absolutely no overlap in missing cases, 0% of cases will be excluded. If there is 100% overlap in terms of missing cases, then Pairwise deletion of missing cases will remove only 10% of all cases. 7

8 As a third option in dealing with missing cases, it is possible to specify what is referred to as mean substitution (this is a form of imputation on missing values). With this procedure, for any missing case on a given variable, we impute that this case has a score equal to the mean of the variable of interest (using the REPLACE WITH MEAN option). This procedure includes all cases, regardless of how many missing values there are in your analysis. This procedure of imputing missing values has been criticized as being somewhat crude in dealing with problem cases. Note: if you have not correctly specified the subsample for your analysis, this latter procedure (REPLACE WITH MEAN) may imput the mean for all cases that are not applicable. For example, if you were examining child outcomes (ages 4-11) and did not specify this as part of your sample selection (i.e. erroneously include families without children aged 4-11) then for all families without such a child, this procedure would imput a value on the child outcome of interest which is comparable to the mean for all children aged 4-11 (i.e. you don t want to do this!!). For part B, merely run 3 separate regressions with the full model that you did in Part A (Model 4), with PAIRWISE deletion, LISTWISE deletion and the REPLACE WITH MEAN procedures separately. Note that in regression analysis you can specify how to deal with missing cases in the options feature of regression. In your corresponding output files, the degrees of freedom (df) as associated with your TSS (total sum of squares) on each corresponding ANOVA table gives you a direct 8

9 indication of the total number of cases involved in each regression (after dealing with the missing cases). With mean substitution, you should have the full sample size (i.e. after correctly identifying and selecting your subsample of interest), whereas the LISTWISE or PAIRWISE will likely lead to smaller samples. If it is much smaller, then you might have problems or biases introduced into your analysis. Present the results of the 3 separate regressions in a separate table (use the unstandardized slopes and not the standardized slopes: call it Table 3). Is there any large difference in the sample sizes? If so, is there a big difference in your regression results? If so, you might briefly mention in the methods section of your final paper that missing cases can potentially lead to biases in your analysis. If not, that s a very good sign that your analysis is robust to missing values (and you can also briefly mention this). Regardless of your results, the most cautious procedure in multiple regression is to usually rely upon PAIRWISE deletion (I recommend this for your final paper and for Part C of the current assignment). In completing part B, merely produce the table and give a one sentence response to each of the above questions. Also provide the syntax and output files corresponding to the 3 separate regressions. PART C: Running Weighted Regression Models Part of the purpose of this first assignment this term was to introduce you to the difference between weighted versus unweighted estimates using sample data, and introduce a simple strategy that you can use when dealing with sample weights. So far in terms of your regression analysis., you have worked exclusively with unweighted data. For the current exercise, you will also conduct a multiple regression with weighted data. This adds a few complications to our analysis, but in fact, is an indispensable step in working with sample data. The results from Part C can be used in your final paper. Remember: if your results are non-significant, don t worry that s okay. Merely report it in your final paper. Weighting: To remind you: What do I mean by unweighted as opposed to weighted data? The principle behind estimation in a probability sample such as the NLSCY or the GSS is that each person in the sample "represents", besides himself or herself, several other persons not in the sample. For example, in a 2% simple random sample of the population, each person in the sample represents 50 persons in the population. The weighting phase is a step which calculates, for each record, what this number is (i.e., the number of individuals in the population represented by this record). The relevant weights appear on 9

10 the NLSCY microdata file (AWTCW01) as well as the 1991 GSS microdata file (FINALWT), the 1998 GSS microdata file (WGHTFIN) and the 1999 victimization GSS (WGHT_PER). These must be relied upon when deriving meaningful estimates from the survey. In a probability sample, the sample design itself determines weights which must be used to produce unbiased estimates of the population. Each record (i.e. each case in your sample) must be weighted by the inverse of the probability of selecting the person to whom the record refers. In the example of a 2% simple random sample, this probability would be.02 for each person and the records must be weighted by 1/.02=50. The importance of working with weights is obvious when you start to examine the complex character of many of Statistics Canada s sample designs. In other words, Statistics Canada works with very complex sampling designs whereby it is clear that not all individuals in the targeted population have an equal likelihood of being selected. Therefore, in the absence of simple random sampling (whereby every case would have an equal weight), it is necessary to rely upon sample weights in obtaining descriptive statistics that are not biased by the sample design. A problem when running the weight feature on SPSS is that it unfortunately leads to erroneous inferences with regard to statistical significance. The reason for this is SPSS continues to treat your data like it were a sample, even though your calculations now involve the weighted numbers. Remember that if you work with very large samples, then even small differences or small associations may be significant. A simple way to deal with this problem is as follows: Step 1. FOR ALL THE REMAINING REGRESSIONS: USE PAIRWISE DELETION OF MISSING CASES (don t overlook this step!!) Step 2. Obtain the mean of the sample weight corresponding to the data that you are using. You can obtain the mean for either AWTCW01, FINALWT or WGHTFIN through the use of the descriptives command in SPSS. Note: you can skip this step, as I ve already calculated this with the 3 datasets (i.e. the mean of AWTCW01 is , the mean of FINALWT is , the mean of WGHTFIN is , and the mean of WGHT_PER is ). With this mean value, compute a new set of weights. The new weights should be equal to the initial weights divided by this mean score. The appropriate syntax in computing the new weight is as follows: Syntax for new weights: If working with the NLSCYL: COMPUTE NWEIGHT= AWTCW01 / If working with the GSS

11 COMPUTE NWEIGHT = FINALWT / If working with the GSS 1998 COMPUTE NWEIGHT= WGHTFIN / If working with the GSS 1999 Victimization Survey COMPUTE NWEIGHT = WGHT_PER / Step 3. After computing this new weight: subsequently, re-run all your regression models from PART A (4 models in total), weighting your analysis using this new weight called NWEIGHT. This is very easy to do. Merely get the old syntax files that you used in completing Part A, change the missing value procedure to PAIRWISE and add the weight procedure in each syntax file. In regression, it is possible to weight your analysis, using the WLS>> option on the regression procedure (see below). Merely paste this into your files. 11

12 The results should be unbiased and allow for more meaningful significance tests. Create two new tables that are identical to Tables 1 and 2, yet this time with the weighted results. There may be some differences in terms of significance (and possibly strength of effects) relative to what you obtained with the unweighted results. The two new tables (Tables 4 and 6) should be properly titled, this time indicating that the regression results are weighted. NOTE: in your final paper, this is the preferred procedure, i.e. using this weighted regression procedure with pairwise deletion of missing values. In your final paper, you should present the results obtained from your analysis, properly weighted, as specified above. In your final paper, you will have to make sense of these two tables. What do they tell us about the initial relationships as hypothesized. Are they significant? Do you find support for your initial hypotheses? Do they remain, even after including all your relevant control variables?? NOTE: For the current assignment, there is no need for interpretation (merely put together the two tables: Tables 4 and 5). THAT S IT!!! Interpret this in your final paper. Again, don t waste a whole bunch of time trying to put together the perfect table with your software if not proficient with it: merely provide neat hand written tables. Also, for future reference, It is useful to examine initially your data with unweighted counts. This provides you with some sense as to potential problems with small numbers and missing cases. Yet if you require descriptive estimates on how the variables likely look in the population (example, frequency distributions), then run all your procedures with the correct sample weights. If your sample design is not simple random sample, then these results can be quite different from what was initially observed. This is what you did in Assignment 1 (earlier this term). On the other hand, when you move into your final multivariate analysis, the only way to get meaningful significance tests is to revise the weight as specified above (this is not perfect, but it is certainly better than doing nothing). In using this method, the subsequent estimates should not be seriously biased and the significance tests should be reasonably accurate. This is a preferred option over not weighting your analysis or merely doing the analysis with the initial weights. There are more complex procedures possible, beyond the scope of the current course, that are preferred to the recommended procedure as specified above (save it for graduate school). 12