THE ROYAL STATISTICAL SOCIETY 2009 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 8 SURVEY SAMPLING AND ESTIMATION

Size: px
Start display at page:

Download "THE ROYAL STATISTICAL SOCIETY 2009 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 8 SURVEY SAMPLING AND ESTIMATION"

Transcription

1 THE ROYAL STATISTICAL SOCIETY 009 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 8 SURVEY SAMPLING AND ESTIMATION Te Society provides tese solutions to assist candidates preparing for te examinations in future years and for te information of any oter persons using te examinations. Te solutions sould NOT be seen as "model answers". Rater, tey ave been written out in considerable detail and are intended as learning aids. Users of te solutions sould always be aware tat in many cases tere are valid alternative metods. Also, in te many cases were discussion is called for, tere may be oter valid points tat could be made. Wile every care as been taken wit te preparation of tese solutions, te Society will not be responsible for any errors or omissions. Te Society will not enter into any correspondence in respect of tese solutions. Note. In accordance wit te convention used in te Society's examination papers, te notation log denotes logaritm to base e. Logaritms to any oter base are explicitly identified, e.g. log 10. RSS 009

2 Higer Certificate, Module 8, 009. Question 1 SE p (i) For te small projects, ( ) = = (ii) (a) Te formula used in part (i) is appropriate for sampling from a very large ("infinite") population. In practice, for sampling fractions less tan about 5% te effect of ignoring te finite population correction factor (1 f ) is negligible. But for larger sampling fractions, it sould be used. In tis example, te sampling fractions in te tree strata Large, Medium, Small are 50%, 0%, 0%. Tus ignoring te finite population correction factor would be far from negligible. As can be seen, te corrected standard errors are substantially lower tan te uncorrected ones. (b) For te large projects, te corrected standard error is = (c) Te approximate 95% confidence interval is given by ( ˆ) pˆ ± SE p = 0.4 ± ( 0.063) = 0.4 ± 0.164, c i.e. it is (0.74, 0.56). [Note: is used as an approximation to 1.96, te double-tailed 5% point of N(0, 1) could of course be used.] Solution continued on next page

3 (d) We first need te stratified sampling estimate of te overall proportion, wic is pˆ st 3 1 = N ˆ p N = = ((60 0.4) + (00 0.6) + ( ) ) = = Te (estimated) variance of tis is given by Var 1 = i N ( pˆ ) N Var ( p ˆ ) st {( 60 ) ( ) ( )} (0.063) 00 (0.0693) 400 (0.0458) 1 = = = , 660 so te (estimated) standard error is So te approximate 95% confidence interval is given by 0.64 ± ( ) = 0.64 ± , i.e. it is (0.571, 0.713). (iii) Stratification divides te population into groups wic may differ from one anoter but are omogeneous witin eac group. If it is tougt tat te population does, or migt, divide in tis way, stratification is likely to be sensible. It leads to better precision for overall population estimates, and also enables te groups to be studied separately. But a possible disadvantage is tat all strata need to be visited as part of te survey and tis may be expensive (for example if geograpical regions are te strata) and run up against constraints of total cost. In te present study, it migt peraps be, say, tat te larger projects are more difficult to budget for; so stratification would be useful bot for te overall examination of te projects and for gaining information on te tree strata (sizes). Tere is no obvious drawback in terms of te overall cost of studying all of te strata.

4 Higer Certificate, Module 8, 009. Question [Solution continues on next page] A target population is te collection of units/people/businesses/ospitals to wic we want te results of a survey to apply. In teory we must sample from tat population. In practice some parts of it may be very difficult to reac, and te study population tat we actually use leaves tose parts out. As examples, people wo work "unsocial ours" are commonly not available for interviews; and remotely situated agricultural areas are often extremely expensive, in terms of bot time and money, to reac, and tis can be a major problem for government departments undertaking agricultural surveys A sampling frame may be constructed from lists of names/organisations etc tat do include any elusive or difficult to reac ones tat tere may be. Eiter we use suc a frame and select a few extra units to replace members of te original sample tat turn out not to be able to be reaced, or we use additional sources of information to indicate in advance wic are te units tat would be difficult to reac and ten exclude tem from te frame. Te "extra units" approac can cope wit problems suc as "unsocial ours", wile a case suc as remote farms can be coped wit by identifying tem troug additional information (in tis case, geograpical information) Random sampling can be tedious to carry out, especially if refusal rates of selected members are ig and/or selected members are not easy to contact. If results are required quickly, for example for a routine requirement suc as an opinion poll or a sopping-centre survey of residents' reactions to a topic of local interest, a quota sample may be adequate. Wen surveys are done regularly, any apparent very sarp canges in opinion (e.g. support for political parties) can eiter be explained or cecked against similar surveys in oter places. Quota samples do aim to cover te wole range of social/age/occupation(/etc) categories wic cannot be guaranteed in simple random sampling even toug te members of te quota sample are not cosen randomly from a well-defined population. Interviewers in quota sampling sould be properly trained and teir results scrutinised regularly Any item wic can be measured (or, sometimes, estimated) and wic may elp to explain te quantity of interest, y, tat is to be analysed is potentially useful. For example, in a social survey taking place every year, te previous year's value x of y on te same unit is sometimes a good predictor of y. As anoter example, consumption of food items or use of power (gas, electricity) or leisure spending (sport, olidays) or simply current income can all be useful in ouseold surveys. It is often reasonable to suppose tat y, te quantity of interest, varies rougly in proportion to x, te ancillary variable. An exact relationsip of tis kind would give y i = Rx i for eac pair of observations (y i, x i ) in te bivariate population, were R is te

5 population ratio of te totals (Y and X) or of te means (Y and X ), i.e. R = Y / X = Y / X. R can be estimated by te sample ratio r =Σy/ Σ x = y/ x calculated using te observations in a random sample. Suppose now tat te object of te survey is to obtain an estimate of te population total for te y variable, i.e. of Y. Assuming tat te population total X for te ancillary variable is known, we can use te "ratio estimator" Y ˆ = rx = X. Σy/ Σ x. Similarly, if we want to estimate te population mean for y, i.e. Y, we can use Y = rx = X / x y. ratio ( ) [Notice tat tis illustrates an intuitive appeal of ratio estimators: if it appens tat te sample mean x is (say) smaller tan te known value X, and bearing in mind te approximate proportionality between y and x, te estimator gives an upward adjustment of y in estimating Y.] Te proportionality relationsip leading to use of ratio estimators is, in effect, linear regression troug te origin. It often appens tat tere is a rougly linear relationsip between y and x wic does not pass troug te origin. Similar arguments lead to regression estimators for te population total and mean of te y variable. A simple example often arises in agricultural surveys: it is often te case tat tere are establised well-known relationsips between a measurement tat can be taken early in te season and te amount of te crop tat can be expected at arvest (e.g. lengt of raspberry cane and te eventual crop) Particularly in medical work, but also elsewere (e.g. educational studies), it is sometimes necessary to follow te effect of a treatment or a drug on te same group of patients for an extended period. Tis is a longitudinal survey and enables us to study te dynamic development of members of te population. In a cross-sectional survey, a study is made using a single sample at a fixed time point (or peraps over a fixed and fairly sort time period) wit te aim of investigating te caracteristics of te population at tat particular time. Often cross-sectional surveys cover widely defined populations (men, women, different ages, different etnic backgrounds, etc), wereas longitudinal surveys may peraps follow a more narrowly defined population over an extended time period. ratio [Tere are many relevant points tat could be made in response to tis question, in addition to tose set out briefly above. Oter examples are also obviously possible. In te examination, credit was given for any relevant points and for good examples.]

6 Higer Certificate, Module 8, 009. Question 3 We ignore finite population corrections in tis question as te sampling fractions are small (see solution to question 1(ii)(a) above). We use N(0, 1) rater tan a t distribution in forming te confidence intervals as te sample sizes are reasonably large. For tis question, we use 1.96 rater tan te approximation of as used in question 1, but tere can be no great objection to use of. (i) (a) Te required 95% confidence interval is given by 40 ± ( / 50), i.e. it is (37., 4.8) (ours). (b) (c) Te alf-widt of tis interval is.77 (to decimal places) ours, but it was required to be ours. To acieve a alf-widt of ours, we require a sample of size n were n is given by / n. Tis gives n (5 1.96) = So we would need an acieved sample size of 97. For all te students, te sample mean is {( ) ( ) ( )} y = / 3400 = 43.8 ours. Te underlying variance is Var N = N ( y) Var ( y ) wic we estimate by N s Var ( y ) = N n = + + = and so SE( y ) = 1.04 (ours). So te required 95% confidence interval is given by 43.8 ± ( ), i.e. it is 41.8 to 45.9 ours. Solution continued on next page

7 (ii) Proportional allocation cooses te stratum sample sizes n in te same ratio as te stratum population sizes N. For a sample of total size 00 wit te population strata sizes in te ratio 8:0:6, te n will be [= (8/34) 00], and 35.9 respectively, so we take stratum sample sizes 47, 118, 35 respectively. Optimal allocation aims to minimise, for given total sample size n, te variance of an overall population estimate. For tis, n as to be proportional to N s. Te values of tis are 1000, 0000 and 6000 respectively. So for a sample of total size 00, te n will be [= (1000/38000) 00], and respectively, so we take stratum sample sizes 63, 105, 3 respectively. As stated, optimum allocation improves te precision of overall estimates of population parameters. Tis will be important ere, as te standard deviation for te Faculty of Tecnology is considerably more tan for te oter two faculties wic optimum allocation as allowed for by specifying a larger sample in tat faculty tan would ave been te case under proportional allocation.

8 Higer Certificate, Module 8, 009. Question 4 (i) (ii) Bias is very likely. Tere are many groups in te local population wo would be unlikely to look at a local government website often, or indeed would be unlikely to look at it at all. For instance, older people do not usually spend muc (or any) time at a computer, but tese are likely to use public transport and quite unlikely to cycle. It is also possible for groups, eiter in favour or against, to organise eiter a "yes" or a "no" vote in tese circumstances, and te total response is unlikely to be large enoug to witstand tis. In sort, very little reliance, if any at all, can be placed on te figure. First, it is necessary to decide on te population of interest. It migt be all te residents in te area; or only tose subgroups wo use te present cycle track; or existing users of public transport wo migt want an improvement on te present service; and tere are many oter possibilities. It would also be necessary to consider weter tere migt be different views according to wat time of day people wis to travel; weter a factor migt be ow far tey need to travel; and issues suc as weter cyclists migt use te new tram service instead of cycling to work. A sampling frame could probably be constructed from te electors lists (but some people may ave exercised teir option of not aving teir names on te version of te list tat is available to te general public), or, since it is an official body carrying out te survey, a list of ouseolds may be used. Any information known to te City Council about te caracteristics of te populations of various areas in te city sould be considered as bases for stratification for example, te age distributions (are te residents in an area predominantly retired, predominantly young wit families, predominantly middle-aged, etc), and various indicators of socio-economic status (e.g. are tey predominantly owner-occupiers). If any suc known caracteristics are not used in constructing te sampling frame, tey sould be verified by questions on te response form. Standard questions on age etc sould in any case be included; post-oc stratification may be found to be necessary wen te results are being studied. A questionnaire to be used by interviewers visiting people at ome migt be te best metod, if resources allow. Revisits would be necessary for tose wo could not be contacted on an interviewer's first visit; te budget for te survey must allow for tese. Refusals sould be minimised by careful training of te interviewers so tat tey can explain te project fully; for example, tey sould be able to answer questions suc as te anticipated frequency and ours of operation of te tram service and ow long it would take to reac key destinations. It must be recognised tat some respondents may say tat te cange will not affect tem, and tey sould not be pressed to say yes or no to te project. Careful attention to tese matters sould minimise bias. [As wit question, credit was given in te examination for all relevant points and justifications of viewpoint. But a quota sample does not seem a proper basis ere, and answers based on quota samples gained less credit.]