Analyses Based on Combining Similar Information from Multiple Surveys

Size: px
Start display at page:

Download "Analyses Based on Combining Similar Information from Multiple Surveys"

Transcription

1 Secton on Survey Research Methods JSM 009 Analyses Based on Combnng Smlar Informaton from Multple Surveys Georga Roberts, Davd Bnder Statstcs Canada, Ottawa Ontaro Canada KA 0T6 Statstcs Canada, Ottawa Ontaro Canada KA 0T6 Abstract Many researchers have access to dfferent survey sources, each wth smlar varables. These researchers are often nterested n the approprateness of brngng together the data from the dfferent sources for the purpose of data analyss, partcularly when each source has a small sample sze for the queston beng studed. We address a varety of topcs that the researchers should be aware of such as the comparablty of the varables across surveys, and the sutablty of postng a model for the varables n the dfferent surveys. We dscuss possble approaches for combnng the nformaton. Key Words: Poolng, target populaton, desgn-based analyss, model-desgnbased framework. Introducton Wth the ncreasng avalablty of more than one survey contanng the same or smlar varables, more attenton s beng pad to whether and how to combne the data from the dfferent surveys to mprove estmates. It seems reasonable to thnk that one should usually be able to mprove the estmate of a quantty of nterest (wth respect to ether accuracy or precson) by combnng the samples, provded that an approprate approach s used to form the new estmate. However, whch method s approprate s not always clear. There are several reasons why analysts would want to combne the data from two or more surveys. A major reason s that the sample szes for the phenomenon under study are small n each of the data sources ether due to each survey havng a small sample sze or due to the doman of nterest beng rare n the populaton(s) targeted by each of the surveys. The combnng of samples for ncreasng number of observatons s used not only n the case of separate surveys, but also for combnng rollng samples of the same survey and for combnng data from overlappng panels n a repeated panel survey. In all cases, t s expected that ncreasng the overall sample sze should lead to reduced samplng errors. Havng small sample szes n each of the data sources s not the only reason for wantng to combne the data from two or more surveys. Instead, an analyst may wsh to brng together the data from perodc surveys on the same topc n order to estmate change. Or, n cases where there may be frame defcences, combnng surveys wth smlar varables usng multple frame methods may be used to mprove the coverage. As well as the coverage problem, Schenker and Raghunathan (007) dscuss other types of nonsamplng errors for whch combnng nformaton from multple surveys could be benefcal. 38

2 Secton on Survey Research Methods JSM 009 Underlyng all of the reasons gven above for wantng to combne s a common problem - the data from any sngle survey are lmted n some sense for addressng the analytc problem at hand. However, combnng the data from more than one source rases a number of ssues that need to be addressed before reasonable decsons may be made on whether and how estmaton can be carred out usng the dfferent sources. The frst of these s the comparablty of the nformaton obtaned from the dfferent surveys. Schenker et al. (00) and Schenker and Raghunathan (007) dscuss a number of potental sources of ncomparablty that could affect whether varables recorded n dfferent surveys are actually measurng the same quanttes: dfferences n the types of respondents and/or the sources of the respondents nformaton, dfferences n the modes of ntervewng, dfferences n the survey contexts, dfferences n the sample desgns and dfferences n survey questons. However, an addtonal mportant comparablty queston relates to how the target populatons of the data sources compare: whether they are smlar for both target group and tme, whether the target groups are smlar but tmes dffer (whch s the most common case), or whether they dffer substantvely wth respect to both target group and tme. In ths paper we address a number of other topcs that are mportant to makng decsons on how estmaton can be carred out usng the dfferent survey sources. In Secton there s a descrpton of the three man types of quanttes that analysts are nterested n estmatng from combned data sources of smlar varables. Secton 3 begns wth defntons of the two usual approaches to estmaton when combnng data from multple surveys and then provdes descrptons of randomzaton frameworks that could assst an analyst n decdng whch approach mght be most sutable for whch type of quantty of nterest. An llustraton of combnng the data from two Canadan health surveys s descrbed n Secton 4, wth the paper fnshng wth a number of ponts of dscusson n Secton 5.. Three Types of Quanttes of Interest Before ntroducng the general randomzaton frameworks wthn whch the propertes of varous estmators can be dscussed, we present three general categores of what s frequently estmated from data arsng from multple surveys. In each case, we consder the unknown quanttes that are beng estmated and to whch target populaton these quanttes refer. It should be noted that the analyst s target populaton has two components the target group (.e., the attrbutes of the unts beng targeted) and the reference tme(s) (for example, a sngle tme pont such as December 3, 008 or a number of tme perods such as both 004 and 005). ) Smple Descrptve: We say that the quanttes of nterest are smple descrptve when they are characterstcs of a sngle fnte target populaton or when they are a fxed functon of the characterstcs of more than one fnte populaton. Fnte populaton characterstcs are quanttes such as means, proportons and totals. There would be a sngle fnte target populaton, for example, f all surveys beng combned covered the same target group at the same pont n tme. A sngle fnte The target group s defned as the set of unts havng the targeted attrbutes say, females aged 5 to 34 lvng n Calforna. 39

3 Secton on Survey Research Methods JSM 009 populaton would also be the case when the surveys beng combned each target a dfferent pece of the full target populaton and the quanttes of nterest. If each survey refers to a dfferent fnte populaton, we may be nterested n a smple or weghted average of the characterstcs of the dfferent populatons. For example, f the prevalence of a dsease s P and P n populatons and respectvely, our quantty of nterest may be ( P P ) /. In some cases, we may prefer a weghted average, such as weghtng by populaton sze, so that our quantty of nterest s ( NP N P ) /( N N ), where N and N are the respectve populaton szes. Other weghted averages can also be consdered. Another form of a descrptve characterstc s the dfference between two populaton means - ( Y Y ). Note that for any of these examples, the two populatons could nvolve entrely dfferent populaton groups (say, dfferent age groups) or they could be the same populaton group at dfferent ponts n tme. It should also be noted that, n the case of smple descrptve quanttes, the characterstcs of nterest are defned wthout a model justfcaton. ) Descrptve under an assumed relatonshp: Rather than beng a smple descrptve quantty, t s not uncommon that the parameter of nterest s based on an assumed relatonshp among the characterstcs of the fnte target populatons of the dfferent surveys. For example, we could suppose that the prevalence rate of a partcular dsease s the same for each populaton and t s ths common rate that we wsh to estmate. Another example would be the case where we want to estmate a quantty for a tme pont that s mdway between two survey perods; assumng a lnear trend over tme, the quantty of nterest would be a smple average of the ndvdual populaton quanttes. 3) Analytc quanttes: When the quanttes of nterest are characterstcs or relatonshps that hold beyond the specfc fnte populatons surveyed (such as the parameters of a superpopulaton), we say that these quanttes are analytc. Often parameters of a model are used to summarze such characterstcs or relatonshps; for example, a logstc model mght be used to descrbe a prevalence rate that s measured n each survey. 3. Approaches to Estmaton As we have descrbed n Secton, when combnng smlar nformaton from multple surveys, we need to frst consder whch populaton quantty s beng estmated. Once ths s establshed, the propertes of estmators should be assessed n the context of the randomzaton framework for selectng the sample. We frst descrbe the two usual approaches to estmaton when combnng data from multple surveys. The separate approach: In the separate approach to estmaton, an estmate s obtaned from each survey separately, and then the overall estmator s a functon of the separate estmates. The most common method here s to take some lnear combnaton of the separate estmates to form the overall estmator. The partcular lnear combnaton chosen can depend on whether the quantty of nterest s descrptve or analytc. The lnear combnaton can also depend on whether the separate survey estmates are ndependent, and whether one can acheve an adequate reducton n the varances of the overall estmate for the most mportant quanttes of nterest. (Note that n a multpurpose survey there are usually several quanttes that the researcher wshes to estmate.) 40

4 Secton on Survey Research Methods JSM 009 As an example of the separate approach, suppose that ˆ and ˆ are unbased estmates of the same unknown descrptve parameter from each of two surveys, and that ˆ and ˆ are ndependently dstrbuted wth known varances and, respectvely. The separate approach estmator ˆ ˆ ˆ c ( ) wll be unbased for regardless of the value of fxed composte weght and wll have mnmum varance when /( ). If / n, where n and n are the respectve survey sample szes, the mnmum varance estmator s ˆ ( n ˆ n ˆ ) /( n ). c n The pooled approach: On the other hand, n the pooled approach, the ndvdual records from all the surveys are combned, the orgnal weghts may be modfed, and estmaton s based on the pooled sample usng the new weghts and usng technques approprate to a sngle sample. Typcally, for the observatons n each ndvdual survey, the modfed weghts are proportonal to the orgnal weghts. The choce of rescalng factors can depend on crtera smlar to those used for choosng a lnear combnaton n the separate approach. Now, to study the propertes of these approaches, we need to establsh whch randomzaton process led to the observed data. Ths s not necessarly straghtforward, especally when each survey s taken at a dfferent tme pont. 3. Desgn-based Randomzaton For estmatng a descrptve quantty, t s common to assume that the underlyng randomzaton framework s desgn-based. Ths means that statstcal nferences (such as constructon of confdence ntervals and performng tests of hypotheses) are based only on the probabltes used to select the samples from the fnte populatons. For the separate approach to estmaton, n Fgure, we llustrate the desgn-based framework when there are two fnte populatons (where these populatons could overlap). We see that the samples are taken from each of the two populatons, separate estmates are formed from each, and then an overall estmate, based on these separate estmates, s derved. Fgure : Descrptve estmaton separate approach 4

5 Secton on Survey Research Methods JSM 009 The pooled approach n the desgn-based framework s llustrated n Fgure. In ths approach, the samples from each of the surveys are combned nto one large sample, possbly wth some weght adjustments, and an overall estmate s obtaned from the pooled data. Agan, n a desgn-based framework, the only randomness s the sample selecton process for each of the fnte populatons. Fgure Descrptve estmaton pooled approach In general, the pooled approach and the separate approach lead to dfferent estmates. These estmates may not even have the same expected values. For example, f the prevalence of a dsease n each of two populatons s estmated by ˆP and ˆP, usng the samples from each, a separate approach mght be to take the smple average of these estmates, whch has an expected value of ( P P ) /. On the other hand, f we take an analogous pooled approach, and rescale the weghts of the observatons from each sample by /, the quantty beng estmated would be ( NP N P ) /( N N ). Unless N N, the separate and pooled approaches are estmatng dfferent quanttes. On the other hand, f, under a model, both P and P are measurng a common overall prevalence rate, then the separate and pooled approaches are estmatng the same prevalence rate. A varety of methods has been proposed for rescalng weghts for use wth combned surveys (see, for example, Korn and Graubard, 999). One approach adopted by some (see, for example, Thomas (007)) s to rescale the weghts by the factor n for the th survey, where n s the sample sze and D s some average desgn effect for the th survey. Ths rescalng s motvated by the fact that for the separate approach ths can yeld mnmum varance estmates when the ndvdual survey estmates are unbased. In more complex cases, such as the fttng of regresson models, there are analogous dfferences between the separate and pooled approaches. 4

6 Secton on Survey Research Methods JSM 009 However, for the pooled approach, f the populaton szes are very dfferent, ths may not be best, even when the desgn effects for all the quanttes are equal wthn each survey. As an example, suppose that Pˆ s an unbased estmate from the th survey of a common overall prevalence rate P, wth varance D P( P) / n. A pooled approach estmate usng two samples and rescalng factor n s ˆ P c n N Pˆ. (3.) n N n N Pˆ n N On the other hand, f the separate approach estmator s defned as ˆ ( ) P ˆ ˆ c P ( ) P, the optmal value for the composte weght would be n opt. n n yeldng the mnmum-varance separate approach estmator ˆ ( ) P c npˆ n n n Pˆ. (3.) Therefore, f the populaton szes are very dfferent, the mnmum-varance separate approach estmate gven by (3.) wll not be close n value to the pooled estmate n (3.). Also, as s typcal when combnng surveys, f the sample sze for each survey s small, the estmates of the desgn effects may not be very accurate for ether approach. 3. Model-desgn-based Randomzaton Often the quantty of nterest to a researcher can be formulated n terms of parameters of a model. For example, the probablty of beng dagnosed wth a partcular dsease may be thought of as an outcome from a logstc regresson model. In ths case, a sutable randomzaton framework for statstcal nference may be gven by assumng that () the study varables n each fnte populaton are realzatons of random varables of a model, and () a probablty-based sample s selected from each resultng fnte populaton. Ths s llustrated n Fgure 3. Whereas our prevous descrpton of separate and pooled estmates was gven n the context of estmatng fnte populaton quanttes, these estmates (and others that are model-motvated) can be assessed under a model-desgn-based framework. As ponted out by Bnder and Roberts (003), when the samplng fractons are small, weghted estmates can be used to obtan model-desgn-based (approxmately) unbased estmates for the model parameters of nterest. However, when the sample sze (or the number of psu s n the case of a mult-stage survey) s not large, care may be requred n makng approprate nferences. 43

7 Secton on Survey Research Methods JSM 009 Fgure 3: Analytc study a model-desgn-based vew 4. Use of Health Care for Non-heterosexual Males An Example from the Canadan Communty Health Survey Suppose that an analyst s nterested n studyng whether gay and bsexual men dffer n ther use of health care. Two surveys are proposed as data sources for the analyss the Canadan Communty Health Surveys of 003 and snce the sample szes of gay and bsexual men are relatvely small n each survey. These are ndependent crosssectonal surveys of the non-nsttutonal Canadan populaton aged and over. Both surveys contan the followng same queston that would dentfy people aged 8 to 59 who self-report as beng homosexual or bsexual: Do you consder yourself to be heterosexual (sexual relatons wth people of the opposte sex), homosexual, that s lesban or gay (sexual relatons wth people of your own sex), or bsexual (sexual relatons wth people of both sexes)? As well, both surveys contan the same set of soco-demographc and health-related varables that the analyst would lke to use n hs study. The two surveys also seem comparable wth respect to other aspects that could nfluence results, such as sample desgns, survey questons and modes of ntervewng. Snce the surveys occur just two years apart, the analyst ntally expects that the characterstcs of hs target group should be very smlar at the two tme ponts, and that he wll be able to make assumptons of equalty of characterstcs when estmatng descrptve quanttes. However, when he does some ntal nvestgaton of hs two data sources, he fnds that, whle respondng sample szes overall and of males 8-59 are qute smlar at the two tme ponts, sample szes of gay and bsexual men are up 5% and % 3 See Béland (00) and the Statstcs Canada webste ( for more nformaton about these surveys and also Tjepkema (008), for a motvatng study for ths example. 44

8 Secton on Survey Research Methods JSM 009 n 005, as compared to 003. As well, whle estmated populaton sze of males 8-59 s farly steady, estmates for gay and bsexual men are up approxmately 0%. (See Table ) Furthermore, the estmated dstrbutons of some demographc characterstcs (see Table ) appear to dffer more than what mght be expected f the target groups actually are the same. In partcular, there are hgher estmated percentages n the older age groups and n the marred/common-law category for both gay and bsexual men n 005 and the estmated regonal concentratons dffer between years. Because of these observatons, the analyst should suspect dfferences n hs target groups at the two tmes 4, and thus be wary of makng assumptons of equalty over tme perods when dong hs estmatons. Snce the objectve of the analyst s to study the use of health care n hs target group, consder, now, an nvestgaton of whether gay men dffer from bsexual men n ther probablty of not havng a regular doctor. The estmated percentages wthout a regular doctor n 003 and 005 respectvely were 4 and 0 for gay men and 33 and for bsexual men. The decson s made to take a poolng approach for model estmaton. The analyst prepares a data fle that ncludes the observatons from both tme ponts and a weght varable that conssts of the unmodfed weghts of the orgnal surveys. Also ncluded on the fle are the addtonal varables requred for varance estmaton, whch wll be straghtforward snce the two surveys are ndependent. If the analyst should then ft a logstc model to hs data, ncludng just a 0/ tme ndcator and a 0/ gay/bsexual ndcator he would obtan the results llustrated n Table 3. It appears as f the probablty of not havng a regular doctor does dffer between the two tme perods but no sgnfcant dfference s found between gay and bsexual men. Ths sgnfcant tme dfference would have been mssed f the analyst had pooled the data and gnored the sources n hs analyss. Table : Sample szes and populaton estmates from the two surveys Sample Populaton Sample Populaton Sze Estmate Sze Estmate Both sexes 34,07 3,947 Males ,99 9,4,400 38,936 9,507,300 Gay 490 8, ,600 Bsexual 35 54, ,500 4 In fact, there was a seres of changes n provncal and federal legslaton over the 003 to 005 tme perod that gave same-sex unons legal recognton and a number of other rghts. These events mght have had an mpact on who was wllng to self-dentfy as gay or bsexual. 45

9 Secton on Survey Research Methods JSM 009 Table : Age, regon and martal status breakdowns of target groups Gay men Bsexual men Age Toronto/Montreal/Vancouver 49 6 Other CMA Not CMA Marred/Common law Prevously marred Sngle Table 3: Estmated logstc model Varable Coeffcent p-value of t-test Intercept Gay/Bsexual / Dscusson Combnng of smlar data from more than one survey does seem to be a possblty n a number of dfferent stuatons, but often t s not approprate. However, when approprate, care s requred n determnng an approach that s sutable and has the ntended propertes. A number of ponts to keep n mnd are outlned brefly below. As noted n Secton 3., t may be possble to estmate the same quantty by separate and pooled approaches, but the estmates themselves may not be the same. Furthermore, the best composte weghtng for the separate approach estmate s dfferent from the best survey-weght adjustment for the pooled estmate (where best means mnmum varance). However, you cannot actually calculate the mnmum varance separate estmator snce you do not know the varances requred for that estmator and frequently you cannot estmate them well f you have only small sample szes from each survey source. Regardless of the sample szes, however, usng estmated varances n the estmates affects ther mean values and varances A sutable way to apply the separate approach for a vector of quanttes of nterest (such as a vector of model coeffcents) does rase questons. What composte weghtng makes sense? Should each component of the vector be weghted equally? More study s requred here. 46

10 Secton on Survey Research Methods JSM 009 Usng a pooled approach when fttng models can be justfed under a model-desgnbased vew. Possble dfferences between the fnte target populatons generated by the model can be ncluded as varables n the model and tested for sgnfcance. However, there may be an ssue as to whch models are sutable to be used, partcularly f dealng wth small sample szes. Pooled samples do not necessarly need to have weghts adjusted. That depends on the target populaton to whch the combned estmate refers. It s possble, for example, that the target group for the analyss actually ncludes unts from the two or more dfferent tme ponts from whch the samples were taken. Frequently, an analyst wshes to nclude a number of dfferent analyses n hs study. He thus needs to consder whether a multpurpose pooled fle can actually estmate all quanttes of nterest by the desred approach(es). What s optmal for the estmator of one quantty may not be optmal for others. Statstcal tests about assumptons (such as equalty of a characterstc n the fnte populatons from whch the dfferent samples are drawn) may have lttle power f sample szes are small. Other sources of nformaton could be more valuable n decdng whether assumptons seem reasonable. It may not be straghtforward to use software tools desgned for a pooled approach to produce estmates for the separate approach. Sutable varance estmaton may be dffcult for ether the separate or the pooled approach, especally f samples are not ndependently selected. References Béland Y. (00), Canadan Communty Health Survey - Methodologcal Overvew, Health Reports, Statstcs Canada, Catalogue X, Ottawa, 9-4. Bnder, D. A. and G. R. Roberts (003), Desgn-Based and Model-Based Methods for Estmatng Model Parameters, n: R.L. Chambers and C.J Sknner eds., Analyss of Survey Data, Wley, Chchester, Korn, E. L. and B. I. Graubard (999). Analyss of Health Surveys. Wley, New York. Schenker, N., Gentleman, J.F., Rose, D., Hng, E., and I.M. Shmzu (00). Combnng estmates from complementary surveys: a case study usng prevalence estmates from natonal health surveys of households and nursng homes, Publc Health Reports 00, 7, Schenker, N. and T.E. Raghunathan, (007), Combnng Informaton from Multple Surveys to Enhance Estmaton of Measures of Health, Statstcs n Medcne, 6, Thomas, S. (007), Combnng Cycles of the Canadan Communty Health Survey, Proceedngs of Statstcs Canada Symposum 006: Methodologcal Issues n Measurng Populaton Health, Statstcs Canada, Catalogue -5-XIE, Ottawa. Tjepkema, M. (008), Health Care Use Among Gay, Lesban and Bsexual Canadans, Health Reports, 9(),Statstcs Canada, Catalogue 8-003, Ottawa,