Practical Considerations in Raking Survey Data

Size: px
Start display at page:

Download "Practical Considerations in Raking Survey Data"

Transcription

1 Practical Considerations in Raking Survey Data Michael P. Battaglia 1, David Izrael 1, David C. Hoaglin 1, and Martin R. Frankel 1,2 (1) Abt Associates Inc., (2) Baruch College, CUNY Contact Author: Michael P. Battaglia Abt Associates Inc., 55 Wheeler Street, Cambridge, MA (v) , (f) , 1

2 Abstract A survey sample may cover segments of the target population in proportions that do not match the proportions of those segments in the population itself. The differences may arise from sampling fluctuations, nonresponse, or because the sample design was not able to cover the entire population. In such situations one can use raking to improve the relation between the sample and the population by adjusting the sampling weights of the cases in the sample so that the marginal totals of the adjusted weights on specified characteristics agree with the corresponding totals for the population. The raking procedure is described, and convergence issues and problems are discussed. The details of several practical aspects of raking are then given. The topics covered have not received much attention in the literature on raking. Specific aspects of raking are illustrated with graphical displays of output from a SAS Macro that can be obtained for free from the authors. Key Words Control totals, convergence, raking margins, weights, nonresponse 2

3 1. Introduction A survey sample may cover segments of the target population in proportions that do not match the proportions of those segments in the population itself. The differences may arise, for example, from sampling fluctuations, from nonresponse, or because the sample design was not able to cover the entire population. In such situations one can often improve the relation between the sample and the population by adjusting the sampling weights of the cases in the sample so that the marginal totals of the adjusted weights on specified characteristics agree with the corresponding totals for the population. This operation is known as raking ratio estimation (Kalton 1983), raking, or sample-balancing, and the population totals are usually referred to as control totals. Raking may reduce nonresponse and noncoverage biases, as well as sampling variability. The initial sampling weights in the raking process are often equal to the reciprocal of the probability of selection and may have undergone some adjustments for unit nonresponse and noncoverage. The weights from the raking process are used in estimation and analysis. The adjustment to control totals is sometimes achieved by creating a cross-classification of the categorical control variables (e.g., age categories x gender x race x family-income categories) and then matching the total of the weights in each cell to the control total. This approach, however, can spread the sample thinly over a large number of cells. It also requires control totals for all cells of the cross-classification. Often this is not feasible (e.g., control totals may be available for age x gender x race but not when those cells are subdivided by family income). The use of marginal control totals for single variables (i.e., each margin involves only one control variable) often avoids many of these difficulties. In return, of course, the two-variable (and higher-order) weighted distributions of the sample are not required to mimic those of the population. 3

4 A somewhat different problem motivated the original development of sample-balancing (Deming 1943). The Census Bureau needed to produce tabulations for the joint distribution of two (or more) variables in the U.S. population, in situations where information on the joint distribution was available only from a sample. The marginal totals, however, were available for the full population, and so the sample counts in the cells of the cross-classification were adjusted to provide an estimated tabulation that had the correct marginal totals. Raking (or sample-balancing) usually proceeds one variable at a time, applying a proportional adjustment to the weights of the cases that belong to the same category of the control variable. Software for sample-balancing has been available for many years, but not as part of SAS (except for the CLAMAR macro from France) or most other major software systems (WESVAR includes a raking algorithm). Older readers may be familiar with a FORTRAN program developed in the 1960s by MarketMath, Inc. Although that program executed rapidly, it had a variety of disadvantages. The user had to create an ASCII input data set, painstakingly prepare control statements (the original program was designed to read input from cards), and then process its ASCII output data set. It could rake on at most 12 variables. Also, it handled rounding in a way that could lose precision. Izrael et al. (2000) introduced a SAS macro for raking (sometimes referred to as the IHB raking macro) that combines simplicity and versatility. More recently, the IHB raking macro was enhanced to increase its utility and diagnostics (Izrael et al. 2004). The raking algorithm and issues related to convergence are discussed next. Several practical raking applications are then covered. 4

5 2. Basic Algorithm The procedure known as raking adjusts a set of data so that its marginal totals match specified control totals on a specified set of variables. The term raking suggests an analogy with the process of smoothing the soil in a garden plot by alternately working it back and forth with a rake in two perpendicular directions. In a simple 2-variable example the marginal totals in various categories for the two variables are known from the entire population, but the joint distribution of the two variables is known only from a sample. In the cross-classification of the sample, arranged in rows and columns, one might begin with the rows, taking each row in turn and multiplying each entry in the row by the ratio of the population total to the weighted sample total for that category, so that the row totals of the adjusted data agree with the population totals for that variable. The weighted column totals of the adjusted data, however, may not yet agree with the population totals for the column variable. Thus the next step, taking each column in turn, multiplies each entry in the column by the ratio of the population total to the current total for that category. Now the weighted column totals of the adjusted data agree with the population totals for that variable, but the new weighted row totals may no longer match the corresponding population totals. The process continues, alternating between the rows and the columns, and agreement on both rows and columns is usually achieved after a few iterations. The result is a tabulation for the population that reflects the relation of the two variables in the sample. The above sketch of the raking procedure focuses on the counts in the cells and on the margins of a two-variable cross-classification of the sample. In the applications that survey statisticians often encounter, involving data from complex surveys, it is more common to work with the survey 5

6 weights of the n individual respondents. Thus, the basic raking algorithm is described in terms of those individual weights, w, i = 1,2,..., n. For an unweighted (i.e., equally weighted) sample, one i can simply take the initial weights to be w = 1 for each i. i In a cross-classification that has J rows and K columns, denote the sum of the wi in cell ( j, k) by w. jk To indicate further summation, replace a subscript by a + sign. Thus, the initial row totals and column totals of the sample weights are w j + and corresponding population control totals by T + and. j T + k w + k, respectively. Analogously, denote the The iterative raking algorithm produces modified weights, whose sums are denoted by a suitably subscripted m with a parenthesized superscript for the number of the step. Thus, in the twovariable cross-classification m (1) jk denotes the sum of the modified weights in cell (j,k) at the end of Step 1. If one begins by matching the control totals for the rows, algorithm are T j +, the initial steps of the m (0) jk = w (j = 1,...,J; k=1,...,k) jk m = m ( T / m ) (for each k within each j) (1) (0) (0) jk jk j+ j+ m = m ( T / m ) (for each j within each k) (2) (1) (1) jk jk + k + k 6

7 The adjustment factors, (0) Tj+ / mj+ and (1) T+ k / m+ k, are actually applied to the individual weights, which could be denoted by (2) m i for example. In the iterative process an iteration rakes both rows and columns. Thus, for iteration s ( s = 0, 1,...) one may write m = m ( T / m ) (2s+ 1) (2 s) (2 s) jk jk j+ j+ m = m ( T / m ) (2s + 2) (2s + 1) (2s + 1) jk jk + k + k Bishop et al. (1975) discuss the relationship between iterative proportional fitting and raking. They point out that raking was originally developed not for fitting an unsaturated model to a data set, but rather for combining information from two or more data sets. In the two-way table discussed above, one is in effect fitting a fully saturated log-linear model: the two-factor interaction present in the sample persists after raking, and the one-factor terms (reflected in the population control totals) are also fitted. Thus, in some ways raking can thus be thought of as fitting a main effects model, where the main effects correspond to the given margins. Raking can also adjust a set of data to control totals on three or more variables. In such situations the control totals often involve single variables, but they may involve two or more variables. In one example, in raking on three variables one might have control totals T a++, T +b+, and T ++c. In another example, the control totals might be T ab+ and T ++c --- a two-variable margin and a onevariable margin. In actually carrying out the raking for this second example, it suffices to treat the two-variable margin as the one-variable margin for a composite variable, whose values simply index the cells of the underlying two-variable margin. 7

8 Ideally, one should rake on variables that exhibit strong associations with the key survey outcome variables or that are strongly related to nonresponse or noncoverage. This strategy will reduce the mean squared error of the key outcome variables. In practice, other considerations may enter. A variable such as gender may not be related to key outcome variables or to nonresponse or noncoverage, but raking on it may be desirable to preserve the face validity of the sample. 3. Convergence Convergence of the raking algorithm has received considerable attention in the statistical literature, especially in the context of iterative proportional fitting for log-linear models, where the number of variables is at least three and the process begins with a different set of initial values in the fitted table (often 1 in each cell). For raking survey data it is enough that the iterative raking algorithm (ordinarily) converges, as one would expect from the fact that (in a suitable scale) the fitted cell counts produced by the raking are the weighted-least-squares fit to the observed cell counts in the full cross-classification of the sample by all the raking variables (Deming 1943). As an extreme example, for the 2 x 2 table shown in Table 1, convergence is impossible. Convergence may require a large number of iterations. Oh and Scheuren (1978) note that the available convergence proofs make strong assumptions about the cell counts in the crossclassification of the raking variables that no cells are empty or that some particular combination of nonempty cells is present. They recommend setting up the raking problem in a sensible manner to avoid: 1) imposing too many marginal constraints on the sample, 2) defining marginal categories that contain a small percentage of the sample, and 3) imposing contradictory constraints on the sample. 8

9 The authors experience indicates that, in general, raking on a large number of variables slows the convergence process. However, other factors also affect convergence. One is the number of categories of the raking variables. Convergence will typically be slower for raking on 10 variables each with 5 categories than for 10 variables each with only 2 categories. A second factor is the number of sample cases in each category of the raking variables. Convergence may be slow if any categories contain fewer than 5% of the sample cases. A third factor is the size of the difference between each control total and the corresponding weighted sample total prior to raking. If some differences are large, the number of iterations will typically be higher. One can guard against the possibility of nonconvergence or slow convergence by setting an upper limit on the number of iterations (e.g., 50). Brick et al. (2003) also discuss problems with convergence. They point out that a large number of iterations indicate a raking application that is not well-behaved and that problems may exist with the resulting weights highly variable weights inflate sampling variances and produce unstable domain estimates. One example of a problem is the use of raking variables that have a strong association (correlation). In this situation the number of iterations may be large, and convergence will not occur if there are inconsistencies between the associations in the sample and the control totals (Table 1 shows such an example). The log-linear models literature on structural zeros in contingency tables is directly related to this issue. For example, if one rakes on Food Stamps eligibility and a poverty status variable, the cross-tabulation of these two variables in the sample will likely result in one or more cells that must be empty by definition. One simple definition of convergence requires that each marginal total of the raked weights be within a specified tolerance of the corresponding control total. As noted above, in practice, when a 9

10 number of raking variables are involved, one must check for the possibility that the iterations do not converge (e.g., because of sparseness or some other feature in the full cross-classification of the sample). As already noted, one can guard against this possibility by setting an upper limit on the number of iterations. As elsewhere in data analysis, it is sensible to examine the sample (including its joint distribution with respect to all the raking variables) before doing any raking. For example, if the sample contains no cases in a category of one of the raking variables, it will be necessary to revise the set of categories and their control totals (say, by combining categories). The authors recommend, at a minimum, checking the unweighted percentage of sample cases and the percentage of control cases in each category of each raking variable. Small categories in the sample or in the control totals (say under 5%) are potential candidates for collapsing. This step will reduce the chance of creating very unequal weights in raking. Category collapsing always needs to be done carefully, and in some instances it may be important to retain a small category in the raking. 4. The IHB Raking Macro The IHB SAS macro produces diagnostic output that contains the following information: number of iterations, name of variable currently being raked on, name of BY-variable if there is one, and marginal control total and calculated total weight for each level of the current raking variable, along with their difference and percentage difference. At termination, the macro gives the iteration number at which termination occurred and the reason, which is either that the tolerance has been met or that the process did not converge. The macro also writes diagnostics into the SAS LOG, from several of the checks that it makes. 10

11 Table 2 illustrates the use of the macro with an example involving two raking variables, Table 2 calls them VARIABLE1 and VARIABLE2, and a BY-variable, AREA, which has two levels. The marginal percentage and general control total for each level of the BY-variable are obtained outside the example, from PROC FREQ. Preliminary analyses of the data set showed that all categories of the raking variables represented in the marginal control data sets exist in the sample as well. Table 2 shows the unweighted distribution of each variable. The actual raking uses the weights of the individual cases. With the convergence tolerance set to 1, the raking converged after 3 iterations for Area 1, and also after 3 iterations for Area Sources of Control Totals The discussion of control totals refers to actual totals as opposed to percents. Surveys that use demographic and socioeconomic variables for raking must locate a source for the population control totals. An example of a source of true population control totals is the 2000 U.S. Census short-form data. The U.S. Census long-form variables, the 2000 U.S. Census 5-Percent Public Use Microdata Sample (PUMS) files, the Current Population Survey (CPS), U.S. Census Bureau population projections, the National Health Interview Survey, and private-sector sources such as Claritas are better viewed as control totals, because they are based either on large samples or on projection methodologies. Control totals obtained from a sample such as the CPS estimates are subject to much smaller sampling variability and nonresponse bias, and may be subject to much lower noncoverage bias, than a survey sample. For state-specific control totals, say for persons aged 0-17 years, the CPS estimates will be subject to considerably larger sampling variability; thus they are useful for national control totals, but potentially less useful for stable state control totals. Combining two years of CPS data can reduce the sampling variability of the state control totals. For projection 11

12 methods (e.g., age by sex by race mid-year population projections from the U.S. Census Bureau), the basic approach is to project information forward from 2000 for the non-censal years. Clearly, the farther one gets from 2000, the greater the likelihood that the projections will be off. This happened, for example, with the projection of the size of the Hispanic population for the years before the 2000 Census results came out. Eventually, the American Community Survey should provide a new source of information for non-censal years. It is important to make sure that control totals from different sources all add to the same population total. If not, the raking will not converge. For example, for a survey in the middle of 2003, one would use Census Bureau age, sex, and race projections of the civilian noninstitutionalized population for July 2003, and obtain control totals by household income from the March 2003 CPS. In this situation one would most likely need to ratio-adjust the CPS income control totals so that they summed to the Census projection control totals for July One must also consider how the variables are measured. A telephone survey may ask a single question to obtain household income. The source for the control totals, however, may have an income variable that is constructed from a series of questions about income from several sources (wages, cash-assistance programs, interest, dividends, etc.). One needs to consider carefully whether using income as a raking variable makes sense. If the sample is thought to substantially under-represent low-income persons, then raking on income may be preferred. If, on the other hand, there is concern that the survey is measuring income very differently from the source of the control totals, then consideration should be given to raking on a proxy variable such as educational attainment or even a dichotomous poverty-status variable. 12

13 Control totals usually do not come with a missing category. The same variable in the survey may have a nontrivial percentage of cases that fall in a DK or Refused category. In this situation it may be possible to impute for item nonresponse in the survey before the raking takes place. When imputation is not feasible, the following procedure can be used to adjust the control totals. Run a weighted frequency distribution on the raking variable in order to determine the percentage of sample cases that have a missing value (e.g., 4.3%). Allocate 4.3% of the control total to a newly created missing category (e.g., 4.3% of 1,500,000 = 64,500). Reapportion the control totals in the other categories so that they add to the reduced control total (1,500,000 64,500 = 1,435,500). After raking, the weighted distribution of the sample will agree with the revised control totals and will reflect a 4.3% missing- data rate in weighted frequencies and tabulations. 6. Trade-offs Related to Number of Margins and Numbers of Categories Some raking applications use margins for age, sex, and race, because it is relatively easy to obtain control totals for these variables. In other situations (especially in surveys with lower response or important noncoverage issues) one may need to rake on a considerably larger number of variables. This is feasible if control totals can be assembled. The authors have seen rakings that used well over ten variables. Raking on many variables will almost always require a large number of iterations. The authors have also seen rakings that used a smaller number of variables, but with fairly detailed categories. Again, a large number of iterations may be required. In both situations the cross-classification of the raking variables often yields an extremely large number of cells. For example, raking on 12 dichotomous variables yields 4,096 cells. Raking on five variables each containing six categories yields 7,776 cells. Many of these cells will contain no cases in the sample. Such cells, by definition, remain empty after raking. However, the two-variable, threevariable, and higher-order interactions in the sample are maintained in the raking to the marginal 13

14 control totals. The small cell sizes increase the chance that the raked weights will exhibit considerable variability, because those weights are maintaining sample interactions that are quite unstable. On top of the challenges of the numbers of variables and categories and the resulting number of underlying cells, large differences, before raking, between the weighted sample totals be and the marginal control totals will generally increase the number of iterations. These issues point to the need to closely examine: 1) the variables selected for raking, 2) the number and size of the categories of those raking variables, and 3) the magnitude of differences between the weighted sample totals and the control totals. Ideal variables for raking are those related to the key survey outcome variables and related to nonresponse and/or noncoverage. Variables that do not meet these conditions are candidates for exclusion from raking when a large number of variables are being considered. The categories of each candidate raking variable should be examined to see whether they contain a small proportion of the sample cases (say, under 5%) or whether the control total percentage is small (also, say, under 5%). Such small categories should be considered for collapsing. Sometimes the small categories of a nominal categorical variables can be collapsed into a larger residual category. For ordinal variables, collapsing with an adjacent category is often the best approach. If one or more weighted sample totals differ by a large amount from the corresponding control totals, one should first try to determine the source of the difference. Is it extreme differential nonresponse, or has the variable in the sample been measured in a very different manner than the corresponding variable used to form the control total? One should consider whether it is appropriate to use such a variable in raking. 14

15 7. Examining and Diagnosing Slow Convergence Sometimes the raking process does not converge in a specified number of iterations. As an aid to diagnosing such situations and taking appropriate action, the enhanced IHB raking macro incorporates a module that, in case of non-convergence, uses the data to predict the number of iterations needed for convergence. The prediction is based on an empirical observation that the logarithm of the magnitude of the difference between an adjusted weighted total and its control total declines linearly with the number of iterations. In the authors experience, this relation holds reasonably well when a slowly converging raking process approaches the specified number of iterations (50 in most applications). The enhanced macro extrapolates the last iteration slope and estimates the iteration at which the slowest converging variable will cross a given tolerance threshold. One usually considers a raking process to be converging slowly if either it does not converge in a specified number of iterations or convergence takes substantially more iterations than usual. In the authors work, convergence usually takes place in 5 to 20 iterations. However, when the number of raking variables is large (say, more than 8) and some of the raking variables have numerous levels (the variable State with 51 categories, for instance), the process may take much longer to converge or may even not converge in an initially set number of iterations. The statistician has options to proceed with raking. The first one is by using the predicted number of iterations from the diagnostics to rerake the sample, trying to achieve complete convergence. This option is illustrated later. However, the predicted number of iterations may be impractically large. Then, as a second option, one may attempt to preprocess the sample data. 15

16 A common strategy collapses categories of slowly converging variables. If, for instance, State is such a variable (with a value for each U.S. state and D.C.), it could be collapsed into, say, Census Division (9 levels) or even Census Region (4 levels). Of course, the statistician may not always have flexibility in collapsing. He/she may be required to rake by the original variables, or the slow variables may already be dichotomous. But if there is some flexibility in the statistical weighting methods, the authors recommend trying collapsing to accelerate convergence. How does one determine which raking variables are slow? The most effective way to examine a convergence process is to draw graphs. Figure 1 displays a plot of a slow raking process involving 12 variables; the x-axis is the iteration number, and the y-axis is log 10 of the maximum (taken over all categories of a given raking variable) of the absolute value of the difference between the adjusted weighted total and the control total. The reference line indicates the tolerance level, in this example log 10 (1) = 0. One can easily construct this kind of graph using standard SAS/GRAPH facilities. From the graph, one can easily single out the four slowest converging variables (their traces cluster distinctly higher): EEE, JJJ, GGG, and AAA. The variables GGG and AAA are dichotomous, so it is not possible to collapse them. To explore how categories of the variables EEE and JJJ (which are ordinal) converge and which of them might be collapsed, similar graphs show the individual categories of those two variables (Figure 2). Besides visual exploration of convergence of slow categories, one should apply common sense when combining them. For ordinal variables, for instance, it would be logical to combine adjacent 16

17 categories. Taking the meaning of values of EEE and JJJ into account, in addition to the graphs in Figure 2, collapsing combined Categories 1 and 2, and Categories 4 and 5 for both variables (keeping Category 3 separate). Correspondingly, the respective marginal totals were combined, after which the raking was rerun and new convergence graphs were constructed for those two collapsed variables (Figure 3). Because convergence of EEE and JJJ looked promising, a new overall convergence graph was constructed for all 12 raking variables (Figure 4). Comparing this graph with Figure 1, one can see that collapsing did play a dramatic role in speeding convergence. The raking process now converges in 17 iterations. As already noted, the statistician may not always have the flexibility to collapse categories, or he/she may still want to achieve convergence without altering the raking variables, i.e., using as many iterations as required. But how many are required? The enhanced macro calculates a predicted number of iterations needed for full convergence. The graph in Figure 5 demonstrates a two-variable raking process that initially did not converge in the default 50 iterations (vertical reference line) and predicted 65 as the needed number. When rerun, the raking did converge at exactly the 65th iteration. In a fairly rare situation, rerunning the raking with the predicted number of iterations could give non-convergence again, with a new and much larger number of predicted iterations. If this occurs, it makes sense to thoroughly examine sample and population data and make appropriate changes. 8. Inclusion of Two-Variable Raking Margins Raking can be viewed as analogous to fitting a main-effects-only model. Because of sample size limitations and/or availability of only one-variable (factor or dimension) control totals, many raking applications follow this approach. In some situations it may be important to fit a two- 17

18 variable interaction to the data. For example, one is planning to rake on Variables A, B, C, and D. However, control totals for Variable C crossed with Variable D are available and exhibit a strong interaction (e.g., persons aged 0-17 years are more likely to be Hispanic than persons aged 65+ years). If the cell counts in the C x D margin of the sample are large enough to support fitting a C x D interaction, one would rake on three margins: A, B, and C x D. It is not necessary also to rake on separate margins for Variables C and D. If, however, the C x D raking margin involved collapsing one could consider adding one-variable margins to the raking for Variables C and D without any collapsing of their categories. 9. Forming Control Totals for Quantity Variables In a specialized raking situation one is planning on raking a sample of persons on some categorical variables (e.g., age, sex, and race), but the source of the control totals also has a quantity variable related, to say, the total number of glasses of milk consumed in a week. The survey has also measured this same quantity variable; but the survey response rate is, let us assume, only 50%. One may want to ensure that the weighted total number of glasses of milk consumed per week from the sample agrees closely with the control total. This can be accomplished by dividing the sample into groups; each group will have a mean number of glasses of milk consumed in a week and a sum of weights. In the raking process one can modify the sum of the weights in each group so that the sum of the weights times the mean, summed over all the groups, adds to the control value of total glasses of milk consumed in a week. In the simplest application one can divide the sample into two groups: below versus above the median number of glasses of milk consumed in a week based on the control total data. For each group one can use the control data to obtain the total number of glasses of milk consumed in a week. This two-category margin is then added to the raking. Convergence may not occur making it necessary to shift the group boundary point away from the 18

19 median in order to achieve convergence. Once convergence is achieved the weighted total number of glasses of milk consumed in a week will be in close agreement with the control total value. This procedure may be extended to modify not only the total over the entire sample, but for various subpopulations as well. 10. Raking at the State Level in a Large National Survey Some large surveys stratify by state and are designed to yield state estimates. The resulting total national sample is usually very large. The survey statisticians seek to provide national estimates as well as state estimates. Often one sets up raking control totals at the state level and carries out 51 individual rakings. Assume those rakings use Variables A, B, and C; but the number of categories of each variable is limited because of the state sample sizes. For example, one might collapse Variables A, B, and C differently by state. If Variable A were race/ethnicity, one might be able to use Hispanic as a separate race/ethnicity category in California, but not in Vermont because of the small sample size. After the 51 rakings one might compare weighted distribution of Variables A, B, and C with national control totals and observe some differences that are caused by the state-level collapsing of categories. If having precise weighted distributions at the national level is important for analytic or face validity reasons, one can use the IHB raking macro in the following manner. Set up a single raking that includes margins for State x A, State x B, and State x C (i.e., combine the 51 individual state rakings into a single raking). Then add detailed national margins for Variables A, B, and C. Another, similar example would involve adding Variable D as a national raking margin because its control total is available only at the national level (e.g., household income). This strategy needs to be implemented carefully. Checks should be made for raking variables that contain small sample sizes. The coefficient of variation of the weights prior to raking 19

20 and after raking should be examined in each state to check for large increases in the variability of the weights. Finally, the raking diagnostics discussed above should be used if convergence problems arise. 11. Maintaining Prior Nonresponse and Noncoverage Adjustments in the Final weights Frankel et al. (2003) have discussed methods based on data on interruptions in telephone service (of a week or longer in the past 12 months) to compensate for the exclusion of persons in nontelephone households in random-digit-dialing surveys. One typically adjusts the base sampling weights of persons with versus without an interruption in telephone service. The resulting interruption-based weight adjusts for the noncoverage of nontelephone households. If one then rakes the sample on age, sex, and race, the impact of the nontelephone adjustment may be diluted somewhat, even though the raking starts with interruption-based weight. In that case it generally makes sense to create weighted control totals (using the interruption-based weight) from the sample for persons residing in households with versus without an interruption in telephone service. These weighted control totals should be ratio-adjusted so that they have the same sum as the age, sex, and race control totals. For example, if the age, sex, and race margins sum to 180,000,000 persons, then the interruption margin needs to be adjusted so that it also sums to 180,000,000. The raking would use the four variables instead of just three and would ensure that the nontelephone adjustment is fully reflected in the final weights. This would be appropriate where the interruptionin-telephone-service category could be small (e.g., in states where telephone coverage is very high), but one still wants to maintain that small category in the raking. 20

21 12. Raking Surveys that Screen for a Specific Target Population A common survey model for obtaining interviews with a specific target population is to screen a sample of households for the presence of members of the target population. An example would be children with special health care needs. The screening interview collects a roster of children with, say, their age, sex, and race, and determines whether each child has special health care needs. If the household contains one child with special health care needs, a detailed interview is conducted for that child. If the household has two or more such children, one is selected at random for the detailed interview. Of course, the interview response rate will be less than 100%, because some parents will not agree to do the detailed interview. Assume that the survey statisticians need to look at the prevalence of children with special health care needs, and they will also be analyzing the detailed interview data. In this situation one would calculate the usual base sampling weights, make adjustments for unit nonresponse and possibly make a noncoverage adjustment if warranted. One first obtains control totals for age, sex, and race in the U.S. population aged 0-17 years. One then rakes the entire sample of children in the screened households to those control totals, because that sample is a sample of children aged 0-17 in the U.S. The resulting screener weights can then be used to estimate the prevalence of children with special health care needs in the U.S. That screener weight would typically serve as the input weight in the calculation of weights for the children with completed detailed interviews. As part of that calculation process one also seeks to weight the detailed-interview sample by age, sex, and race. Of course, control totals are unlikely to be available for children with special health care needs. One can, however, use the screener weight 21

22 and the sample of children with special health care needs identified in the screened households to form weighted control totals for age, sex, and race and then use those in raking the detailedinterview weights. This method ensures that the survey analysts do not ask why the age distribution of children with special health care needs from the screener sample does not agree exactly with the distribution in the detailed interview data. Some caution needs to be exercised in using this approach when the screener shows survey evidence of false positives. 13. Raking to Control Totals Expressed as Percentages and Raking with No Input Weight Frequently, the user working with a weighted or an unweighted sample needs to weight it to fit marginal population proportions. As an example (Table 3), the authors created an 11-case sample data set that contains two variables: VAR1, which takes values 1, 2, and 3 with frequencies 27.27%, 45.45% and 27.27%, respectively; and VAR2, which takes values 1 and 2 with frequencies 45.45% and 54.55%, respectively. The objective was to weight this sample so that the distributions of VAR1 and VAR2 met the population distributions --- (20%, 35%, 45%) and (60%, 40%), respectively --- within a tolerance of 0.001%. 14. Weight Trimming and Raking Weight trimming refers to truncation of high or extreme weight values in order to reduce their impact on the variance of the estimates, especially for subgroup estimates. One consequence of the truncation of high weight values is that the weights of the entire sample will not add to the population size. Although weight trimming is a separate topic from raking; they are certainly related in the sense that weight trimming typically takes place at the last step in the calculations, which is often raking. Many large surveys use weight trimming (Srinath 2003, Abt Associates memorandum). Its objective is to reduce the mean squared error of the key outcome estimates. By 22

23 trimming high weight values one generally lowers sampling variability but may incur some bias. The MSE will be lower if the reduction in variance is large relative to the increase in bias arising from weight trimming. There are no established rules for weight trimming; rather most people use a general set of guidelines. Some common truncation points are: 1) the median weight plus five or six times the interquartile range (IQR) of the weights, 2) five times the mean weight, 3) the 95 th percentile of the weights. How can weight trimming be incorporated in raking? The IHB SAS macro can be used for weight trimming in the following steps (using as an example the median weight plus six times the IQR as the truncation point) 1 : 1. Prior to raking i, where i references the number of times the raking is run, examine the distribution of the raking input weight and calculate the median weight plus six times the interquartile (IQR) range of the weights. 2. Truncate values of the input weight that are above the median weight plus six times the IQR plus one to the median weight plus six times the IQR (values at or below the median weight plus six times the IQR plus one are not altered). 3. Using the truncated input weight, run the raking to obtain raking weight i. 4. Repeat Steps 1 to 3 (i.e., run the raking a second time, third time, etc.) until there are no weights that are above the median weight plus six times the IQR plus one. Although the cutoff value equals the median weight plus six times the IQR, weights that exceed the median weight plus six times the IQR plus one are truncated to the median weight plus six times 1 A somewhat more sophisticated, but computer intensive, procedure is to apply bounds to the weights as the raking is taking place. 23

24 the IQR, because the raking may increase the weight values of the cases that have been truncated, and thus cause the raking steps to repeat endlessly. The approach described above does not guarantee convergence (i.e., after running the raking several times there could still be weights above the median weight plus six times the IQR plus one), and one could consider adding a larger constant to increase the chances of convergence, but the authors have found in their applications that convergence is often achieved by adding a constant of one. Table 4 shows an example of the use of weight trimming with raking. Before raking there are four cases with input weights that exceed the median weight plus six times the IQR plus one of (condition). The weights of those cases are truncated to (cutoff) and the raking is run for the first time. After the first raking the condition equals Only one case has a weight that exceeds this value and that weight is truncated to the cutoff of After the second raking no cases have a weight that exceeds the condition and the process is stopped. The weights from the second raking add to the population size and meet the raking control totals. 15. Summary The authors have sought to give some background on how raking works and to discuss the convergence process. They have also sought to give some warnings of conditions that need to be checked before and after raking. Brick et al. (2003) discuss other examples of issues that one should be aware of when using raking. The IHB SAS macro discussed in this paper is available for free from the first author. 24

25 References Bishop YMM, Fienberg SE, and Holland PW. (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. Brick JM, Montaquila J, and Roth S. (2003). Identifying Problems with Raking Estimators Proceedings of the Annual Meeting of the American Statistical Association [CD-ROM], Alexandria, VA: American Statistical Association, pp Deming WE. (1943). Statistical Adjustment of Data. New York: Wiley. Frankel MR, Srinath KP, Hoaglin DC, Battaglia MP, Smith PJ, Wright RA, and Khare M. (2003). Adjustments for non-telephone bias in random-digit-dialling surveys. Statistics in Medicine, Volume 22, pp Izrael D, Hoaglin, DC, and Battaglia MP. (2000). A SAS Macro for Balancing a Weighted Sample. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp Izrael D, Hoaglin DC, and Battaglia MP. (2004). To Rake or Not To Rake Is Not the Question Anymore with the Enhanced Raking Macro. May 2004 SUGI Conference, Montreal, Canada. Kalton G. (1983). Compensating for Missing Survey Data. Survey Research Center, Institute for Social Research, University of Michigan. 25

26 Oh HL, and Scheuren F. (1978). Some Unresolved Application Issues in Raking Ratio Estimation Proceedings of the Section on Survey Research Methods, Washington, DC: American Statistical Association, pp

27 Table 1. A 2 x 2 Table for Which Raking Cannot Produce Agreement with the Control Totals Variable 1 Control Total Variable 2 Control Total

28 Table 2. Example of Raking Using the IHB SAS Macro Raking AREA - 1 VARIABLE1, iteration - 1 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ======== ========== ======== Raking AREA - 1 VARIABLE2, iteration - 1 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ======== ========== ======== Raking AREA - 1 VARIABLE1, iteration - 2 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ======== ========== ======== Raking AREA - 1 VARIABLE2, iteration - 2 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ======== ========== ======== Raking AREA - 1 VARIABLE1, iteration - 3 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ======== ========== ========

29 Raking AREA - 1 VARIABLE2, iteration - 3 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ======== ========== ======== **** Program for AREA 1 terminated at iteration 3 because all calculated margins differ from Control Totals by less than 1 Raking AREA - 2 VARIABLE1, iteration - 1 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ========= ========== ======== Raking AREA - 2 VARIABLE2, iteration - 1 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ========= ========== ======== Raking AREA - 2 VARIABLE1, iteration - 2 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ========= ========== ======== Raking AREA - 2 VARIABLE2, iteration - 2 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ========= ========== ========

30 Raking AREA - 2 VARIABLE1, iteration - 3 Calculated Control Calculated Control Difference VARIABLE1 margin Total Difference % % in % ========== ========= ========== ======== Raking AREA - 2 VARIABLE2, iteration - 3 Calculated Control Calculated Control Difference VARIABLE2 margin Total Difference % % in % ========== ========= ========== ======== **** Program for AREA 2 terminated at iteration 3 because all calculated margins differ from Control Totals by less than 1 30

31 Figure 1. Convergence of a Raking Process Involving 12 Variables variable AAA BBB CCC DDD EEE FFF GGG HHH III JJJ KKK LLL 31

32 Figure 2. Convergence of Variables EEE and JJJ before Collapsing 6 Variable EEE 5 Variable JJJ category category

33 Figure 3. Convergence of Variables EEE and JJJ after Collapsing 6 Variable EEE 5 Variable JJJ category Category 2 and 4 collapsed into 1 and 5 respectively category Category 2 and 4 collapsed into 1 and 5 respectively 33

34 Figure 4. Convergence of All 12 Variables in the Raking Process after collapsing Variables EEE and JJJ variable AAA BBB CCC DDD EEE FFF GGG HHH III JJJ KKK LLL 34

35 Figure 5. Prediction of the Number of Iterations Needed for Convergence variable AAA BBB predicted number of iterations for convergence

American Association for Public Opinion Research

American Association for Public Opinion Research Tips and Tricks for Raking Survey Data (a.k.a. Sample Balancing) Michael P. Battaglia, David Izrael, David C. Hoaglin, and Martin R. Frankel Abt Associates, 55 Wheeler Street, Cambridge, MA 02138 Key Words:

More information

Using Weights in the Analysis of Survey Data

Using Weights in the Analysis of Survey Data Using Weights in the Analysis of Survey Data David R. Johnson Department of Sociology Population Research Institute The Pennsylvania State University November 2008 What is a Survey Weight? A value assigned

More information

Treatment of Influential Values in the Annual Survey of Public Employment and Payroll

Treatment of Influential Values in the Annual Survey of Public Employment and Payroll Treatment of Influential s in the Annual Survey of Public Employment and Payroll Joseph Barth, John Tillinghast, and Mary H. Mulry 1 U.S. Census Bureau joseph.j.barth@census.gov Abstract Like most surveys,

More information

Sample: n=2,252 people age 16 or older nationwide, including 1,125 cell phone interviews Interviewing dates:

Sample: n=2,252 people age 16 or older nationwide, including 1,125 cell phone interviews Interviewing dates: Survey questions Library Services Survey Final Topline 11/14/2012 Data for October 15 November 10, 2012 Princeton Survey Research Associates International for the Pew Research Center s Internet & American

More information

The Application of Survival Analysis to Customer-Centric Forecasting

The Application of Survival Analysis to Customer-Centric Forecasting The Application of Survival Analysis to Customer-Centric Forecasting Michael J. A. Berry, Data Miners, Inc., Cambridge, MA ABSTRACT Survival analysis, also called time-to-event analysis, is an underutilized

More information

DATA MEMO. The average American internet user is not sure what podcasting is, what an RSS feed does, or what the term phishing means

DATA MEMO. The average American internet user is not sure what podcasting is, what an RSS feed does, or what the term phishing means DATA MEMO BY: PIP Director Lee Rainie (202-419-4500) DATE: July 2005 The average American internet user is not sure what podcasting is, what an RSS feed does, or what the term phishing means Large numbers

More information

Understanding People. Sample Matching

Understanding People. Sample Matching Understanding People Sample Matching Sample Matching Representative Sampling from Internet Panels A white paper on the advantages of the sample matching methodology by Douglas Rivers, Ph.D. - founder;

More information

THE NEW WORKER-EMPLOYER CHARACTERISTICS DATABASE 1

THE NEW WORKER-EMPLOYER CHARACTERISTICS DATABASE 1 THE NEW WORKER-EMPLOYER CHARACTERISTICS DATABASE 1 Kimberly Bayard, U.S. Census Bureau; Judith Hellerstein, University of Maryland and NBER; David Neumark, Michigan State University and NBER; Kenneth R.

More information

Course on Data Analysis and Interpretation P Presented by B. Unmar. Sponsored by GGSU PART 1

Course on Data Analysis and Interpretation P Presented by B. Unmar. Sponsored by GGSU PART 1 Course on Data Analysis and Interpretation P Presented by B. Unmar Sponsored by GGSU PART 1 1 Data Collection Methods Data collection is an important aspect of any type of research study. Inaccurate data

More information

Data Collection Instrument. By Temtim Assefa

Data Collection Instrument. By Temtim Assefa Data Collection Instrument Design By Temtim Assefa Instruments Instruments are tools that are used to measure variables There are different types of instruments Questionnaire Structured interview Observation

More information

Estimating Earnings Equations and Women Case Evidence

Estimating Earnings Equations and Women Case Evidence Estimating Earnings Equations and Women Case Evidence Spring 2010 Rosburg (ISU) Estimating Earnings Equations and Women Case Evidence Spring 2010 1 / 40 Earnings Equations We have discussed (and will discuss

More information

Comparison of Efficient Seasonal Indexes

Comparison of Efficient Seasonal Indexes JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 8(2), 87 105 Copyright c 2004, Lawrence Erlbaum Associates, Inc. Comparison of Efficient Seasonal Indexes PETER T. ITTIG Management Science and Information

More information

Recent Developments in Assessing and Mitigating Nonresponse Bias

Recent Developments in Assessing and Mitigating Nonresponse Bias Recent Developments in Assessing and Mitigating Nonresponse Bias Joanna Fane Lineback and Eric B. Fink 1 U.S. Census Bureau, Washington, D.C. 20233 Abstract In this paper, we address recent developments

More information

AP Statistics Scope & Sequence

AP Statistics Scope & Sequence AP Statistics Scope & Sequence Grading Period Unit Title Learning Targets Throughout the School Year First Grading Period *Apply mathematics to problems in everyday life *Use a problem-solving model that

More information

Getting Started with HLM 5. For Windows

Getting Started with HLM 5. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 3 Section 2: Accessing

More information

Pedro J. Saavedra, Paula Weir and Michael Errecart Pedro J. Saavedra, Macro International, 8630 Fenton St., Silver Spring, MD 20910

Pedro J. Saavedra, Paula Weir and Michael Errecart Pedro J. Saavedra, Macro International, 8630 Fenton St., Silver Spring, MD 20910 IMPUTING PRICE AS OPPOSED TO REVENUE IN THE EIA-782 PETROLEUM SURVEY Pedro J. Saavedra, Paula Weir and Michael Errecart Pedro J. Saavedra, Macro International, 8630 Fenton St., Silver Spring, MD 20910

More information

STAT 2300: Unit 1 Learning Objectives Spring 2019

STAT 2300: Unit 1 Learning Objectives Spring 2019 STAT 2300: Unit 1 Learning Objectives Spring 2019 Unit tests are written to evaluate student comprehension, acquisition, and synthesis of these skills. The problems listed as Assigned MyStatLab Problems

More information

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

SECTION 11 ACUTE TOXICITY DATA ANALYSIS SECTION 11 ACUTE TOXICITY DATA ANALYSIS 11.1 INTRODUCTION 11.1.1 The objective of acute toxicity tests with effluents and receiving waters is to identify discharges of toxic effluents in acutely toxic

More information

Introduction to Sample Surveys

Introduction to Sample Surveys Introduction to Sample Surveys Statistics 331 Kirk Wolter September 26, 2016 1 Outline A. What are sample surveys? B. Main steps in a sample survey C. Limitations/Errors in survey data September 26, 2016

More information

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B. Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institution of Technology, Madras

More information

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pages 37-64. The description of the problem can be found

More information

Center for Demography and Ecology

Center for Demography and Ecology Center for Demography and Ecology University of Wisconsin-Madison A Comparative Evaluation of Selected Statistical Software for Computing Multinomial Models Nancy McDermott CDE Working Paper No. 95-01

More information

International Program for Development Evaluation Training (IPDET)

International Program for Development Evaluation Training (IPDET) The World Bank Group Carleton University IOB/Ministry of Foreign Affairs, Netherlands International Program for Development Evaluation Training (IPDET) Building Skills to Evaluate Development Interventions

More information

Econ 792. Labor Economics. Lecture 6

Econ 792. Labor Economics. Lecture 6 Econ 792 Labor Economics Lecture 6 1 "Although it is obvious that people acquire useful skills and knowledge, it is not obvious that these skills and knowledge are a form of capital, that this capital

More information

e-learning Student Guide

e-learning Student Guide e-learning Student Guide Basic Statistics Student Guide Copyright TQG - 2004 Page 1 of 16 The material in this guide was written as a supplement for use with the Basic Statistics e-learning curriculum

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Displaying Bivariate Numerical Data

Displaying Bivariate Numerical Data Price ($ 000's) OPIM 303, Managerial Statistics H Guy Williams, 2006 Displaying Bivariate Numerical Data 250.000 Price / Square Footage 200.000 150.000 100.000 50.000 - - 500 1,000 1,500 2,000 2,500 3,000

More information

Chapter 1 Data and Descriptive Statistics

Chapter 1 Data and Descriptive Statistics 1.1 Introduction Chapter 1 Data and Descriptive Statistics Statistics is the art and science of collecting, summarizing, analyzing and interpreting data. The field of statistics can be broadly divided

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics

Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics Outline Lecture 10 Introduction to 1-1 Introduction 1-2 Descriptive and Inferential 1-3 Variables and Types of Data 1-4 Sampling Techniques 1- Observational and Experimental Studies 1-6 Computers and Calculators

More information

Belize 2010 Enterprise Surveys Data Set

Belize 2010 Enterprise Surveys Data Set I. Introduction Belize 2010 Enterprise Surveys Data Set 1. This document provides additional information on the data collected in Belize between August and October 2011 as part of the Latin America and

More information

Why Learn Statistics?

Why Learn Statistics? Why Learn Statistics? So you are able to make better sense of the ubiquitous use of numbers: Business memos Business research Technical reports Technical journals Newspaper articles Magazine articles Basic

More information

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc. Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data

More information

Application of SAS in Product Testing in a Retail Business

Application of SAS in Product Testing in a Retail Business Application of SAS in Product Testing in a Retail Business Rick Chambers, Steven X. Yan, Shirley Liu Customer Analytics, Zale Corporation, Irving, Texas Abstract: Testing new products is an important and

More information

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Introduction to Artificial Intelligence Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Chapter 9 Evolutionary Computation Introduction Intelligence can be defined as the capability of a system to

More information

Appendix G: Paper on Model-Aided Sampling for the O*NET Data Collection Program 1

Appendix G: Paper on Model-Aided Sampling for the O*NET Data Collection Program 1 Appendix G: Paper on Model-Aided Sampling for the O*NET Data Collection Program 1 1 Berzofsky, M. E., Welch, B., Williams, R. L., & Biemer, P. P. (2006). Using a model-assisted sampling paradigm instead

More information

Distinguish between different types of numerical data and different data collection processes.

Distinguish between different types of numerical data and different data collection processes. Level: Diploma in Business Learning Outcomes 1.1 1.3 Distinguish between different types of numerical data and different data collection processes. Introduce the course by defining statistics and explaining

More information

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Module - 01 Lecture - 08 Aggregate Planning, Quadratic Model, Demand and

More information

Vincent L. Bernardin (corresponding), Resource Systems Group

Vincent L. Bernardin (corresponding), Resource Systems Group Paper Author (s) Vincent L. Bernardin (corresponding), Resource Systems Group (Vince.Bernardin@RSGinc.com) Steven Trevino, Resource Systems Group, Inc. (Steven.Trevino@RSGinc.com) John P. Gliebe, Resource

More information

NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE DECEMBER 8, 2014 FOR FURTHER INFORMATION ON THIS REPORT:

NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE DECEMBER 8, 2014 FOR FURTHER INFORMATION ON THIS REPORT: NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE DECEMBER 8, 2014 FOR FURTHER INFORMATION ON THIS REPORT: Kristen Purcell, Research Consultant Lee Rainie, Director, Internet, Science and Technology

More information

Citation Statistics (discussion for Statistical Science) David Spiegelhalter, University of Cambridge and Harvey Goldstein, University of Bristol We

Citation Statistics (discussion for Statistical Science) David Spiegelhalter, University of Cambridge and Harvey Goldstein, University of Bristol We Citation Statistics (discussion for Statistical Science) David Spiegelhalter, University of and Harvey Goldstein, University of We welcome this critique of simplistic one-dimensional measures of academic

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006 Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006 Computer assignment #3 DUE Wednesday MARCH 29 th (in class) Regression

More information

Tutorial Resource Allocation

Tutorial Resource Allocation MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 160728 Tutorial Resource Allocation Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel and only

More information

ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3

ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3 ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3 Using a Statistical Sampling Approach to Wastewater Needs Surveys March 2017 Report to the

More information

DRAFT NON-BINDING BEST PRACTICES EXAMPLES TO ILLUSTRATE THE APPLICATION OF SAMPLING GUIDELINES. A. Purpose of the document

DRAFT NON-BINDING BEST PRACTICES EXAMPLES TO ILLUSTRATE THE APPLICATION OF SAMPLING GUIDELINES. A. Purpose of the document Page 1 DRAFT NON-BINDING BEST PRACTICES EXAMPLES TO ILLUSTRATE THE APPLICATION OF SAMPLING GUIDELINES A. Purpose of the document 1. The purpose of the non-binding best practice examples for sampling and

More information

CHAPTER 18 USE OF COST ESTIMATING RELATIONSHIPS

CHAPTER 18 USE OF COST ESTIMATING RELATIONSHIPS CHAPTER 18 USE OF COST ESTIMATING RELATIONSHIPS 1. INTRODUCTION Cost Estimating Relationships (CERs) are an important tool in an estimator's kit, and in many cases, they are the only tool. Thus, it is

More information

The Bahamas 2010 Enterprise Surveys Data Set

The Bahamas 2010 Enterprise Surveys Data Set I. Introduction The Bahamas 2010 Enterprise Surveys Data Set 1. This document provides additional information on the data collected in the Bahamas between April 2011 and August 2011 as part of the Latin

More information

Weka Evaluation: Assessing the performance

Weka Evaluation: Assessing the performance Weka Evaluation: Assessing the performance Lab3 (in- class): 21 NOV 2016, 13:00-15:00, CHOMSKY ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning objectives

More information

Introduction to Survey Data Analysis

Introduction to Survey Data Analysis Introduction to Survey Data Analysis Young Cho at Chicago 1 The Circle of Research Process Theory Evaluation Real World Theory Hypotheses Test Hypotheses Data Collection Sample Operationalization/ Measurement

More information

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes.

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes. What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods Introduction Applications-oriented oriented introduction to multivariate statistical methods for MBAs and upper-level business undergraduates

More information

Shewhart and the Probability Approach. The difference is much greater than how we compute the limits

Shewhart and the Probability Approach. The difference is much greater than how we compute the limits Quality Digest Daily, November 2, 2015 Manuscript 287 The difference is much greater than how we compute the limits Donald J. Wheeler & Henry R. Neave In theory, there is no difference between theory and

More information

CHAPTER 5 FIRM PRODUCTION, COST, AND REVENUE

CHAPTER 5 FIRM PRODUCTION, COST, AND REVENUE CHAPTER 5 FIRM PRODUCTION, COST, AND REVENUE CHAPTER OBJECTIVES You will find in this chapter models that will help you understand the relationship between production and costs and the relationship between

More information

Broadband Competition Helps to Drive Lower Prices and Faster Download Speeds for U.S. Residential Consumers

Broadband Competition Helps to Drive Lower Prices and Faster Download Speeds for U.S. Residential Consumers Broadband Competition Helps to Drive Lower Prices and Faster Download Speeds for U.S. Residential Consumers Dan Mahoney and Greg Rafert 1 November, 2016 1 Dan Mahoney is an Associate, and Greg Rafert is

More information

An Application of Categorical Analysis of Variance in Nested Arrangements

An Application of Categorical Analysis of Variance in Nested Arrangements International Journal of Probability and Statistics 2018, 7(3): 67-81 DOI: 10.5923/j.ijps.20180703.02 An Application of Categorical Analysis of Variance in Nested Arrangements Iwundu M. P. *, Anyanwu C.

More information

AP Stats ~ Lesson 8A: Confidence Intervals OBJECTIVES:

AP Stats ~ Lesson 8A: Confidence Intervals OBJECTIVES: AP Stats ~ Lesson 8A: Confidence Intervals OBJECTIVES: DETERMINE the point estimate and margin of error from a confidence interval. INTERPRET a confidence interval in context. INTERPRET a confidence level

More information

Chapter Standardization and Derivation of Scores

Chapter Standardization and Derivation of Scores 19 3 Chapter Standardization and Derivation of Scores This chapter presents the sampling and standardization procedures used to create the normative scores for the UNIT. The demographic characteristics

More information

System Dynamics Group Sloan School of Management Massachusetts Institute of Technology

System Dynamics Group Sloan School of Management Massachusetts Institute of Technology System Dynamics Group Sloan School of Management Massachusetts Institute of Technology Introduction to System Dynamics, 15.871 System Dynamics for Business Policy, 15.874 Professor John Sterman Professor

More information

Introduction to Business Research 3

Introduction to Business Research 3 Synopsis Introduction to Business Research 3 1. Orientation By the time the candidate has completed this module, he or she should understand: what has to be submitted for the viva voce examination; what

More information

SOFTWARE ENGINEERING

SOFTWARE ENGINEERING SOFTWARE ENGINEERING Project planning Once a project is found to be feasible, software project managers undertake project planning. Project planning is undertaken and completed even before any development

More information

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning Statistics Definitions ID1050 Quantitative & Qualitative Reasoning Population vs. Sample We can use statistics when we wish to characterize some particular aspect of a group, merging each individual s

More information

SOME ALTERNATIVE SAMPLING TECHNIQUES IN THE MEASUREMENT OF FARM-BUSINESS CHARACTERISTICS~

SOME ALTERNATIVE SAMPLING TECHNIQUES IN THE MEASUREMENT OF FARM-BUSINESS CHARACTERISTICS~ SOME ALTERNATIVE SAMPLING TECHNIQUES IN THE MEASUREMENT OF FARM-BUSINESS CHARACTERISTICS~ QUENTIN M. WEST Inter-American Institute of Agricultural Sciences AREA-SEGMENT sampling on a probability basis

More information

Credit Card Marketing Classification Trees

Credit Card Marketing Classification Trees Credit Card Marketing Classification Trees From Building Better Models With JMP Pro, Chapter 6, SAS Press (2015). Grayson, Gardner and Stephens. Used with permission. For additional information, see community.jmp.com/docs/doc-7562.

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

CHAPTER 1 Defining and Collecting Data

CHAPTER 1 Defining and Collecting Data CHAPTER 1 Defining and Collecting Data In this book we will use Define the variables for which you want to reach conclusions Collect the data from appropriate sources Organize the data collected by developing

More information

Section 9: Presenting and describing quantitative data

Section 9: Presenting and describing quantitative data Section 9: Presenting and describing quantitative data Australian Catholic University 2014 ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced or used in any form

More information

BASE WAGE AND SALARY SYSTEM. PRESENTED TO: Mr. Haroon Hafeez BY:

BASE WAGE AND SALARY SYSTEM. PRESENTED TO: Mr. Haroon Hafeez BY: BASE WAGE AND SALARY SYSTEM PRESENTED TO: Mr. Haroon Hafeez BY: Hafiz Abdul Basit Muhammad Ibraheem MT-06-17 MT-06-40 Learning Objectives Define base wage and salaries Objective of base wage and salary

More information

Mining for Gold gets easier and a lot more fun! By Ken Deal

Mining for Gold gets easier and a lot more fun! By Ken Deal Mining for Gold gets easier and a lot more fun! By Ken Deal Marketing researchers develop and use scales routinely. It seems to be a fairly common procedure when analyzing survey data to assume that a

More information

1 PEW RESEARCH CENTER

1 PEW RESEARCH CENTER 1 Methodology The analysis in this report is based on telephone interviews conducted Jan. 8-Feb. 7, 2019, among a national sample of 1,502 adults, 18 years of age or older, living in all 50 U.S. states

More information

Untangling Correlated Predictors with Principle Components

Untangling Correlated Predictors with Principle Components Untangling Correlated Predictors with Principle Components David R. Roberts, Marriott International, Potomac MD Introduction: Often when building a mathematical model, one can encounter predictor variables

More information

Introduction to Analytics Tools Data Models Problem solving with analytics

Introduction to Analytics Tools Data Models Problem solving with analytics Introduction to Analytics Tools Data Models Problem solving with analytics Analytics is the use of: data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................3

More information

Computing Descriptive Statistics Argosy University

Computing Descriptive Statistics Argosy University 2014 Argosy University 2 Computing Descriptive Statistics: Ever Wonder What Secrets They Hold? The Mean, Mode, Median, Variability, and Standard Deviation Introduction Before gaining an appreciation for

More information

Learning Objectives. Module 7: Data Analysis

Learning Objectives. Module 7: Data Analysis Module 7: Data Analysis 2007. The World Bank Group. All rights reserved. Learning Objectives At the end of this module, participants should understand: basic data analysis concepts the relationship among

More information

Designing the integration of register and survey data in earning statistics

Designing the integration of register and survey data in earning statistics Designing the integration of register and survey data in earning statistics C. Baldi 1, C. Casciano 1, M. A. Ciarallo 1, M. C. Congia 1, S. De Santis 1, S. Pacini 1 1 National Statistical Institute, Italy

More information

Kristin Gustavson * and Ingrid Borren

Kristin Gustavson * and Ingrid Borren Gustavson and Borren BMC Medical Research Methodology 2014, 14:133 RESEARCH ARTICLE Open Access Bias in the study of prediction of change: a Monte Carlo simulation study of the effects of selective attrition

More information

Telecommunications Churn Analysis Using Cox Regression

Telecommunications Churn Analysis Using Cox Regression Telecommunications Churn Analysis Using Cox Regression Introduction As part of its efforts to increase customer loyalty and reduce churn, a telecommunications company is interested in modeling the "time

More information

Glossary of Research Terms

Glossary of Research Terms Glossary of Research Terms January 2001 fact sheet I 2/11 Ad hoc Single surveys designed for a specific research purpose, as opposed to continuous, regularly repeated, or syndicated surveys. Advertising

More information

Survey Statistician to provide assistance for the Randomized rural household survey Scope of Work (SOW)

Survey Statistician to provide assistance for the Randomized rural household survey Scope of Work (SOW) AgResults Kenya On-Farm Storage Pilot Survey Statistician to provide assistance for the Randomized rural household survey Scope of Work (SOW) 1. Consultant Name: TBD 2. Period of Performance: TBD 3. Level

More information

Tutorial #7: LC Segmentation with Ratings-based Conjoint Data

Tutorial #7: LC Segmentation with Ratings-based Conjoint Data Tutorial #7: LC Segmentation with Ratings-based Conjoint Data This tutorial shows how to use the Latent GOLD Choice program when the scale type of the dependent variable corresponds to a Rating as opposed

More information

Abstract. About the Authors

Abstract. About the Authors Household Food Security in the United States, 2002. By Mark Nord, Margaret Andrews, and Steven Carlson. Food and Rural Economics Division, Economic Research Service, U.S. Department of Agriculture, Food

More information

A Production Problem

A Production Problem Session #2 Page 1 A Production Problem Weekly supply of raw materials: Large Bricks Small Bricks Products: Table Profit = $20/Table Chair Profit = $15/Chair Session #2 Page 2 Linear Programming Linear

More information

BUSS1020. Quantitative Business Analysis. Lecture Notes

BUSS1020. Quantitative Business Analysis. Lecture Notes BUSS1020 Quantitative Business Analysis Lecture Notes Week 1: Unit Introduction Introduction Analytics is the discover and communication of meaningful patterns in data. Statistics is the study of the collection,

More information

Chapter 8.B. Food and Agricultural Data Base: Local Modifications. Robert A. McDougall

Chapter 8.B. Food and Agricultural Data Base: Local Modifications. Robert A. McDougall Chapter 8.B Food and Agricultural Data Base: Local Modifications Robert A. McDougall 8.B.1 Overview Some local modifications were made to the agricultural and food products (AFP) dataset described in chapter

More information

1/15 Test 1A. COB 191, Fall 2004

1/15 Test 1A. COB 191, Fall 2004 1/15 Test 1A. COB 191, Fall 2004 Name Grade Please provide computational details for questions and problems to get any credit. The following problem is associated with questions 1 to 5. Most presidential

More information

Appendix E: Nonresponse Analysis for Analysis Cycles 9 Through 12

Appendix E: Nonresponse Analysis for Analysis Cycles 9 Through 12 Appendix E: Nonresponse Analysis for Analysis Cycles 9 Through 12 Appendix E: Nonresponse Analysis Establishments can cause nonresponse in the O*NET Data Collection Program at the verification, screening,

More information

Urban Transportation Planning Prof Dr. V. Thamizh Arasan Department of Civil Engineering Indian Institute Of Technology, Madras

Urban Transportation Planning Prof Dr. V. Thamizh Arasan Department of Civil Engineering Indian Institute Of Technology, Madras Urban Transportation Planning Prof Dr. V. Thamizh Arasan Department of Civil Engineering Indian Institute Of Technology, Madras Lecture No. # 14 Modal Split Analysis Contd. This is lecture 14 on urban

More information

Economics 448W, Notes on the Classical Supply Side Professor Steven Fazzari

Economics 448W, Notes on the Classical Supply Side Professor Steven Fazzari Economics 448W, Notes on the Classical Supply Side Professor Steven Fazzari These notes cover the basics of the first part of our classical model discussion. Review them in detail prior to the second class

More information

DERIVING DEMAND CURVES FOR SPECIFIC TYPES OF OUTDOOR RECREATION

DERIVING DEMAND CURVES FOR SPECIFIC TYPES OF OUTDOOR RECREATION 83 DERIVING DEMAND CURVES FOR SPECIFIC TYPES OF OUTDOOR RECREATION Jerry L. Crawford, Arkansas State University GENERAL STATEMENT Demand theory can be related to outdoor recreation by considering outdoor

More information

The Solomon Islands 2015 Enterprise Surveys Data Set

The Solomon Islands 2015 Enterprise Surveys Data Set The Solomon Islands 2015 Enterprise Surveys Data Set I. Introduction This document provides additional information on the data collected in Solomon Islands between September 2015 and May 2016. The objective

More information

A Decision Support System for Market Segmentation - A Neural Networks Approach

A Decision Support System for Market Segmentation - A Neural Networks Approach Association for Information Systems AIS Electronic Library (AISeL) AMCIS 1995 Proceedings Americas Conference on Information Systems (AMCIS) 8-25-1995 A Decision Support System for Market Segmentation

More information

Measurement and sampling

Measurement and sampling Name: Instructions: (1) Answer questions in your blue book. Number each response. (2) Write your name on the cover of your blue book (and only on the cover). (3) You are allowed to use your calculator

More information

The Efficient Allocation of Individuals to Positions

The Efficient Allocation of Individuals to Positions The Efficient Allocation of Individuals to Positions by Aanund Hylland and Richard Zeckhauser Presented by Debreu Team: Justina Adamanti, Liz Malm, Yuqing Hu, Krish Ray Hylland and Zeckhauser consider

More information

Prepared by: Chintan Turakhia, Jonathan Best, and Jennifer Su 1 Braxton Way Suite 125 Glen Mills, PA 19342

Prepared by: Chintan Turakhia, Jonathan Best, and Jennifer Su 1 Braxton Way Suite 125 Glen Mills, PA 19342 Methods Report for Vanderbilt University s: Tennessee Poll Spring 2018 S u r v e y o f T N R e g i s t e r e d V o t e r s A g e 1 8 + M a y 1 0, 2 0 1 8 Prepared by: Chintan Turakhia, Jonathan Best, and

More information

majority, plurality Polls are not perfect Margin of error Margin of error

majority, plurality Polls are not perfect Margin of error Margin of error 349 Polls and surveys Reporting on public opinion research requires rigorous inspection of a poll s methodology, provenance and results. The mere existence of a poll is not enough to make it news. Do not

More information

Evaluation of Selective Editing for the US Census Bureau Foreign Trade Data 1

Evaluation of Selective Editing for the US Census Bureau Foreign Trade Data 1 Evaluation of Selective Editing for the US Census Bureau Foreign Trade Data 1 Maria M Garcia 2, Andreana Able 2, Christopher Grieves 2 2 US Census Bureau, 4600 Silver Hill Rd., Washington D.C., 20233 Abstract

More information

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Correlation and Simple. Linear Regression. Scenario. Defining Correlation Linear Regression Scenario Let s imagine that we work in a real estate business and we re attempting to understand whether there s any association between the square footage of a house and it s final selling

More information

EXPERT REBUTTAL REPORT of HENRY S. FARBER In Connection With. Chen-Oster v. Goldman Sachs July 29, 2014

EXPERT REBUTTAL REPORT of HENRY S. FARBER In Connection With. Chen-Oster v. Goldman Sachs July 29, 2014 Case 1:10-cv-06950-AT-JCF Document 314 Filed 08/12/14 Page 1 of 49 EXPERT REBUTTAL REPORT of HENRY S. FARBER In Connection With Chen-Oster v. Goldman Sachs July 29, 2014 Case 1:10-cv-06950-AT-JCF Document

More information

FAIRNESS VS. FLEXIBILITY. An Evaluation of the District of Columbia s Proposed Scheduling Regulations

FAIRNESS VS. FLEXIBILITY. An Evaluation of the District of Columbia s Proposed Scheduling Regulations Dr. Lloyd Corder CorCom, Inc. Carnegie Mellon University Dr. Aaron Yelowitz University of Kentucky March 2016 FAIRNESS VS. FLEXIBILITY An Evaluation of the District of Columbia s Proposed Scheduling Regulations

More information

Characteristics of the Population of Internet Panel Members

Characteristics of the Population of Internet Panel Members Vol. 10, Issue 4, 2017 Characteristics of the Population of Internet Panel Members John M Boyle *, Ronaldo Iachan, Naomi Freedner- Maguire, Tala H Fakhouri ** * Institution: ICF Institution: ICF Institution:

More information