THE OPTIMISATION OF SAMPLING DESIGN

Size: px
Start display at page:

Download "THE OPTIMISATION OF SAMPLING DESIGN"

Transcription

1 UNIVERSITY OF LATVIA Mārtiņš Liberts THE OPTIMISATION OF SAMPLING DESIGN SUMMARY OF DOCTORAL THESIS Submitted for the degree of Doctor of Mathematics Subfield of Probability Theory and Mathematical Statistics Riga, 2013

2 UNIVERSITY OF LATVIA FACULTY OF PHYSICS AND MATHEMATICS Mārtiņš Liberts THE OPTIMISATION OF SAMPLING DESIGN SUMMARY OF DOCTORAL THESIS Submitted for the degree of Doctor of Mathematics Subfield of Probability Theory and Mathematical Statistics Riga, 2013

3 The doctoral thesis was carried out at the Chair of Mathematical Analysis, Department of Mathematics, Faculty of Physics and Mathematics, University of Latvia from 2007 to This work has been supported by the European Social Fund within the project «Support for Doctoral Studies at University of Latvia 2». The thesis contains an introduction, four chapters, the main results, acknowledgements, a reference list, an appendix. Form of the thesis: dissertation in mathematics, in subfield of probability theory and mathematical statistics. Supervisor: Dr. habil. math., professor Aleksandrs Šostaks. Adviser: Dr. math. Jānis Lapiņš. Reviewers: 1) Jānis Valeinis, Dr. math., docent, University of Latvia; 2) Jānis Vucāns, Dr. math., professor, Ventspils University College; 3) Imbi Traat, Dr. phys.-math., senior lecturer, University of Tartu. The thesis will be defended at the public session of the Doctoral Committee of Mathematics, University of Latvia, at on 10 May 2013, at University of Latvia, Faculty of Physics and Mathematics, Zeļļu street 8, room 233, Riga. The thesis is available at Multi-branched Library Computer Science, Law, and Theology of the Library of the University of Latvia, Raiņa blvd. 19, Riga. This thesis is accepted for the commencement of the degree of Doctor of Mathematics on 19 March 2013 by the Doctoral Committee of Mathematics, University of Latvia. Chairman of the Doctoral Committee /Andris Buiķis/ Secretary of the Doctoral Committee /Jānis Cepītis/ University of Latvia, 2013 ISBN Mārtiņš Liberts, 2013

4 Abstract The aim of sample surveys is to obtain sufficiently precise estimates of population parameters with low cost. The expected precision of estimates and the expected data collection cost are usually unknown making the choice of sampling design a complicated task. Analytical methods can not be used often because of the complexity of the sampling design or data collection process. The aim of this thesis is to develop a mathematical framework to compare sampling designs of interest with respect to their expected precision of estimates and data collection cost. As a result a framework is developed, which employs artificial population data generation, survey sampling techniques, survey cost modelling, Monte Carlo simulation experiments and other techniques. The framework is applied to analyse the cost efficiency of the Labour Force Survey. Key words: cost efficiency; simulation study; survey cost estimation; survey methodology; variance estimation. Mathematics Subject Classification (2010): 62D05.

5 Contents General Research Description 5 1 Summary of the Doctoral Thesis Population and Parameters of Interest Theoretical Model of Population Parameters of the Population Redesign of the Two-Stage Sampling Design Generation of Artificial Population Data Static Population Dynamic Population Development and Application of the Framework Choice of Sampling Designs Cost Function Survey Budget Parameters of the Alternative Sampling Designs Population Parameters and Variance Estimation The Most Cost Effective Sampling Design The Main Research Results Discussion and Interpretation Conclusions and Proposals 25 Acknowledgements 26 References 27

6 General Research Description Actuality of the Topic and Research Novelty The inspiration for the thesis comes from pure practical necessity. National Statistical Institutes (NSIs) are the main providers of official statistics in most countries. A large proportion of official statistics produced by NSIs are done using data collected via sample surveys, with the main customer of official statistics being the general public (or tax payers, in other words). These days, cost efficiency is an essential consideration in all government spending; the question is, are NSI sample surveys cost efficient? There is not a simple answer to the question posed. A sample survey can possess one of many different sampling designs. The simplest sampling designs do not necessarily provide the lowest data collection cost. More complex sampling designs are considered in theory and applied in practice to obtain statistical information with an acceptable precision at a lower cost. In designing a sample survey, the following considerations should be decided upon: What is the expected precision of the estimates of population parameters? What is the expected data collection cost? Which sampling design should be chosen in order to minimise sampling errors under a fixed data collection cost? These are commonly asked questions during the planning stage of a sample survey. In most cases, the answers to the questions posed cannot be gained through analytical means and NSIs are usually reliant on expert s judgement to some extent. The relation between the precision of estimates and survey cost has been discussed in literature for at least 70 years, though the topic has not been comprehensively addressed. Different aspects of the relationship have been analysed and different goals of analysis have been set by authors but it is possible to observe the lack of common foundations for the topic. One of the first papers devoted to the topic are by Mahalanobis (1940) and Jessen (1942). The topic is extensively discussed by Hansen, Hurwitz, and Madow (1953) and Kish (1965). Significant book regarding the topic is by Groves (1989). The author advocates simulation studies to 5

7 be the best-suited for a sample design analysis because of usual complexity of cost and precision functions. The research of survey field operations is a brand new topic in the scope of statistical research. Several research activities have been devoted to the topic only recently (Chen, 2008; Cox, 2012). Several events have been organised recently, in the United States of America, devoted to the topics of survey cost estimation and simulation models for survey fieldwork operations. For example Survey Cost Workshop (2006, Washington, D.C.) and Workshop on Microsimulation Models for Surveys (2011, Washington, D.C.). Some general conclusions can be drawn from this literature review. In general, the total price for a survey, where data are collected directly from respondents, is increasing. There are several reasons for the increase of the price, but one significant reason is the decreasing of response level. In today s world, either much more effort is needed to increase the cost efficiency of surveys, or a higher price must be paid, in order to produce the same quality of statistics as in times when non-response was not such a big problem. However, given the current economic climate, in most cases it is simply not possible to spend more since most government budgets for surveys are reducing, or at best being kept the same as the previous year s. It is clear, therefore, that increased cost efficiency is crucial to maintain the production of high quality statistics under a decreasing or fixed budget. Since survey sampling emerged as a methodology, problem with non-response and budget restrains has not been met so often. This is one of the main reasons why survey cost efficiency has not been a very important research topic until recently. Another conclusion is that, simulation experiments are getting more and more attention as a tool used in the designing of production systems for official statistics. The expansion of the method is possible because of a cheap computer power available currently, even a desktop or a laptop computer nowadays can be set up to solve large scale simulation experiments. 6

8 Aim and Tasks of the Research The goal of this thesis is to develop a framework which can be used to compare arbitrary sampling designs by their cost efficiency. The framework should be used to analyse selected sampling designs and determine the sampling design that leads to the highest overall precision of estimates under a fixed survey budget. The following tasks are set to achieve the goal: 1. to update the frame of primary sampling units used for the Latvian Labour Force Survey (LFS), 2. to develop a sampling design suitable for LFS, 3. to create artificial population data representing the statistical characteristics of the target population of the LFS, so they are usable for sample design simulation experiments, 4. to compare the sampling design of the LFS with alternative sampling designs with respect to cost efficiency using the developed framework, 5. to provide recommendations for the choice of the LFS sampling design with respect to the cost efficiency. Research Topics Sampling design methodology is the topic of the first chapter. It includes the study of survey aims, research of population data, preparation of population frame, development of sampling design and practical implementation of the developed sampling design. The generation of artificial population data is the research topic of the second chapter. It includes the development of theoretical population model, the estimation of the population model parameters, application of the model for the generation of artificial population data and the assessment of artificial data regarding the compliance with aim. The precision of population parameter estimates from sample data, estimation of survey cost and simulation experiments of sampling designs are topics under study in the third chapter. The practical application of the developed framework, analysis of survey cost for several sampling designs and the choice of the most cost effective sampling design are the topics of the fourth chapter. 7

9 Description of the Methods Different mathematical methods are used in the thesis. For example, survey sampling methodology is used for the updating of the population frame, the development of a sampling design for LFS, the construction of estimators for population parameters and the derivation of formulas for the calculation of variance estimates. Random data imputation is used to match register data with sample survey data. Markov chain model is used to generate dynamic population data. Survey sample data processing methods are used to estimate transition matrices used for Markov chain models. Modelling of survey sample field work cost is used in the thesis. The length of travelling distance for interviewers is computed by algorithm for solving travelling sales man problem. Monte Carlo simulation experiments are broadly used to estimate the expected fieldwork cost under different sample designs and to estimate the precision of parameter estimates in case of two-stage sampling design. Methods of mathematical statistics are used to process the results of Monte Carlo simulation experiments. Hypothesis testing is used to compare the variance of estimates under different sampling designs. Approbation of the Results The results of thesis are published in three scientific publications (Liberts, 2010a, 2010b, 2013a). The latter is submitted for the publication in the journal Statistics in Transition new series. The research results have been presented in seven scientific conferences. The results have been published as conference thesis in four of them: The 66th Scientific Conference of the University of Latvia, Riga (Latvia), 2008, The 68th Scientific Conference of the University of Latvia, Riga (Latvia), 2010, The 8th Latvian Mathematical Conference, Valmiera (Latvia), 2010, thesis Self-rotating sampling design, 8

10 The 10th International Vilnius Conference on Probability Theory and Mathematical Statistics, Vilnius (Lithuania), 2010, thesis Selfrotating sampling design, The Third Baltic-Nordic Conference in Survey Statistics (BaNo- CoSS), Norrfällsviken (Sweden), 2011, thesis Simulation study of sampling design in Labour Force Survey, The 24th Nordic Conference in Mathematical Statistics (Nordstat), Umeå (Sweden), 2012, thesis Survey design analysis regarding cost efficiency, The 71st Scientific Conference of the University of Latvia, Riga (Latvia), The results have been presented also in three international workshops, where the results have been published as workshop thesis in two of them: Workshop on Labour Force Survey Methodology, Paris (France), 2010, BNU Workshop on Survey Sampling Theory and Methodology, Vilnius (Lithuania), 2010, thesis Weighting and estimation in household surveys with rotating panel, Workshop of Baltic-Nordic-Ukrainian Network on Survey Statistics, Valmiera (Latvia), 2012, thesis The simulation study of survey cost and precision. The thesis results have been presented as well in three university seminars: Weekly seminar, Institute of Mathematical Statistics, Tartu (Estonia), 2010, Joint statistical seminar at Umeå University, Umeå (Sweden), 2011, Scientific Workshop in Mathematical Statistics at the University of Latvia, Riga (Latvia), The doctoral thesis consists of abstracts in English and Latvian, a nomenclature, an introduction, the list of publications, the list of presentations, four chapters, the main results, acknowledgements, the list of references, and an appendix. The first chapter of the thesis is devoted to the redesign of the two-stage sampling design, the second chapter to the generation of artificial population data, the third chapter to the development of the framework for sample survey cost efficiency analysis, and finally the fourth chapter is devoted to the application of the framework in the case 9

11 of LFS. The length of the thesis is 85 pages, the length of the appendix is 25 pages. 1 Summary of the Doctoral Thesis 1.1 Population and Parameters of Interest The target population of the Latvian Labour Force Survey (LFS) consist of all residents of Latvia living in a private household during the reference week. The main population domain under study is working-age population (individuals in age 15 74). The definitions of dwelling and household are taken from the Central Statistical Bureau of Latvia 1 : Dwelling: One or several residential rooms envisaged for permanent residence (residential house, apartment in multi-dwelling building, room in communal apartment, etc.). Usually dwelling has definite address. Private household (hereafter called a household): Several persons living in one dwelling and sharing expenditures or one person having separate housekeeping. Assumption is made in this research that there is not any dwelling with more than one household. The target population parameters are dynamic over time. For example, an individual can gain or loose a job at any time. The target population of LFS is observed on weekly bases despite the fact it is changing with higher frequency. The choice of weekly observations is a good compromise between the precision of estimates and practical realisation of the survey. A week is a period of seven days, it starts on Monday and ends on Sunday. The survey data are collected about the situation on Sunday to decrease the impact of measurement errors on survey estimates. Sunday is a weekday when the change of economic activity status for any individual is less possible compared with other weekdays

12 LFS is organised in 33 participating countries 2 by mostly harmonised methodology (European Commission, 2012a, 2012b). The harmonised survey methodology allows to produce employment statistics comparable between all 33 participating countries. A quarter is a time period consisting of 13 consecutive weeks by the harmonised LFS methodology. It is assumed there are 52 weeks in every year and a year is divided in four periods of 13 weeks or quarters. The membership of a week to a year is determined by the year of week s Thursday. For example, the first week of 2013 is a week starting with 31st December 2012 because the Thursday of this week (3rd January) is the first Thursday of year The estimates of quarterly LFS parameters are studied in this thesis Theoretical Model of Population An individual is denoted by v i. The set of all individuals is denoted by V. The size of V is M. The individuals of set V are sorted in specified order and labelled by numbers from 1 to M: V = {v 1, v 2,..., v i,..., v M }. It is assumed the composition of V to be constant by time the set V consists of the same individuals in any time point. There is a value y i assigned to each individual v i from V. The values y i are changing by time. Assume individuals of V are observed on weekly bases. The observation of individual v i in week w is denoted by u i,w. There is a value y i,w assigned to each observation. The total of week w is defined as M Y w = y i,w. i=1 The set of observations at week w is denoted by U w. The size of U w is equal to the size of set V and equal to M: U w = {u 1,w, u 2,w,..., u i,w,..., u M,w }. 2 Participating countries are the 27 Member States of the European Union, Croatia, Iceland, Norway, Switzerland, the former Yugoslav Republic of Macedonia, and Turkey. 11

13 Assume the individuals of V are observed for W consecutive weeks. The union of weekly observations is denoted as a set U: U = W w=1 U w. The size of U is denoted as N = MW. The total of W weeks is defined as W W M Y = Y w = y i,w. w=1 w=1 i=1 The example of U is given in Table 1. The rows of the table represent M individuals, the columns of the table represent W weeks, and the cells of the table represent observations of individuals. The dimension of the table is M W. Table 1: Example of the set U i w = 1 w = 2 w = 3 w = 4 w = 5 w = W 1 u 1,1 u 1,2 u 1,3 u 1,4 u 1,5 u 1,W 2 u 2,1 u 2,2 u 2,3 u 2,4 u 2,5 u 2,W 3 u 3,1 u 3,2 u 3,3 u 3,4 u 3,5 u 3,W M u M,1 u M,2 u M,3 u M,4 u M,5 u M,W Parameters of the Population Two types of population parameters are studied in the thesis the quarterly average of weekly totals and the quarterly ratio of two totals. The quarterly average of weekly totals (over 13 weeks) is defined as Y q = w=1 Y w = M w=1 i=1 y i,w = 1 13 and the quarterly ratio of two totals is defined as R q = Y 13 q w=1 = Y 13 M w w=1 i=1 Z 13 q w=1 Z = y i,w 13 M w w=1 i=1 z i,w N y k = 1 13 Y, k=1 N k=1 = y k N k=1 z. k All elements of U has to be observed to compute Y q or R q. This is irrational. The alternative approach is to estimate Y q or R q using probability 12

14 sample of observations. The estimators for Y q and R q can be constructed using the π estimator (Särndal, Swensson, & Wretman, 1992, p.42, 176) as Ŷ q = 1 13 ˆR q = (i,w) s y i,w π i,w, y i,w (i,w) s π i,w z i,w, (i,w) s π i,w where s is a probability sample of observations from U, s U, y i,w and z i,w are values assigned to observation u i,w, and π i,w is the probability of observation u i,w to be included in sample s. For example, y i,w is defined as binary variable: { 1, if individual vi is employed in week w, y i,w = 0, if individual v i is not employed in week w. The value of Y w becomes the number of employed in week w, Y q is equal to the average number of employed over 13 weeks in this case. The theoretical population model is used in this thesis to describe the target population and parameters of interest in the case of LFS. The population model introduced before can be used to describe the population of households as well. Notation v i refers to households and V refers to the set of all households then. 1.2 Redesign of the Two-Stage Sampling Design The author of the thesis made the redesign of sampling design used for Latvian Labour Force Survey (LFS). LFS is a survey regularly done by the Central Statistical Bureau of Latvia. The sampling design after the redesign was introduced in practice since 2010 and is used till now. The research started with the analysis of the LFS organisation. The decision was made based on the results of the analysis that the population frame of the primary sampling units (PSUs) (areas) has to be updated. The following tasks were done to update the frame. The code of PSU was assigned to all dwellings in the Statistical Household Register (there were dwellings with out the PSU code before). The PSUs were redesigned if the 13

15 size of PSUs (measured by the number of dwellings) was not appropriate for sampling. The redesign of the sampling design and the selection of a new PSU sample was started after the update of the PSU frame. The resulting sampling design is a probability two-stage sample design. Areas are the first stage units. Areas are sampled by stratified systematic sampling with sampling probabilities of areas being proportional to the size of areas (measured by the number of dwellings). The stratification of areas is done by the level of urbanisation. There are four strata: Riga (the capital of Latvia), other cities, towns and rural areas. The sampling is done with random starting point in each stratum. The sampling units are dwellings at the second sampling stage of the design. Equal number of dwellings are sampled by simple random sampling from each sampled area in each stratum (number of dwellings sampled can be different in each stratum). All individuals from a sampled dwelling are included in sample. The sampling probabilities for all individuals are positive and equal in each stratum. It is possible to achieve asymptomatically unbiased estimates of population parameters under the sampling design developed. Some properties of the developed sampling design are given here. The design allows to organise rotating panel surveys. This is an important requirement for LFS. The Latvian LFS is a rotating panel survey with rotation scheme 2-(2)-2 (European Commission, 2012b, 7.lpp). The design provided easy management of sampled PSUs in practice. The sample of areas can be generated for several years allowing the timely planing of workload for interviewers. It is not necessary to make any corrections to the sample of areas during the usage of it. The sampling design is suitable for application of variance estimation methods based on re-sampling, for example, dependent random groups or Jackknife (Wolter, 2007). A new sample of areas was selected after the redesign of sampling design. The area sample selected can be used simultaneously for several sample surveys. The coordination of samples for three continuous surveys (LFS, Household Budget Survey (HBS), Survey of Domestic Travellers (SDT)) is incorporated in the design. The coordination of three samples allows to achieve lower total fieldwork cost for three surveys comparing to 14

16 non-coordinated samples. The sample of areas is usable also for one-time surveys. For example, the selected area sample has been used in Latvia for European Health and Social Integration Survey (EHSIS) in The area sample can be used for household sampling (for example, LFS, HBS, SDT) or individual sampling (for example, EHSIS). 1.3 Generation of Artificial Population Data Individual data representing the target population are necessary to carry out Monte Carlo simulation experiments with sampling designs. The next task of the thesis is to generate artificial population data. The artificial data has to be similar with the the real Latvian population data of workingage individuals by the statistical properties at macro level. The artificial population data are created as two data files. The first data file represents a static population (corresponding to the set of individuals V at some specific time point). The second data file represents a dynamic population (corresponding to the set of observations U) Static Population Data from Statistical Household Register (SHR) and LFS are used to generate static population. SHR is a statistical register maintained by the Central Statistical Bureau of Latvia. The SHR data are used to create the population frame the list of individuals with demographic and residence information attached. There are working-age individuals in the population frame. LFS sample data are used as data representing the economic activity of individuals. Data from both sources are matched by random imputation, where recipients are units from the register data and donors are the units from LFS data. Random imputation within classes (United Nations, 2010) is used as an imputation method. The imputation classes are created in both data sets using demographic and geographic information according the same specification. Imputation in class c is done independently from other classes. Donor d k D c is matched with recipient r i R c with probability 1 D c if D c 10, where D c is the set of donors in class c, R c is the set of recipients in class c, and D c the number of donors in class c. Donor d k D c 15

17 can be matched with several recipients from the set R c. Imputation is not done in class c if 0 D c < 10. If recipient r i is matched with donor d k, r i receives economic activity information from d k. Imputation is done in seven steps where imputation units are households in the first five steps and imputation units are individuals in the last two steps. The imputation of households as units in the first five steps allows to keep demographic and economic composition of households the same as observed in the survey data. The specification of the classes in the first five steps is hierarchical. The specification of the classes is the most detailed in the first step. The classes are merged by each succeeding step. Economic activity status is imputed for 82.2% of all individuals from the register data in the first five steps. The imputation for all individuals can not be done in this manner because there are classes of household in the register data which have not been observed in the survey data or have been observed only in few cases (less than 10). Economic activity status is imputed for the rest of individuals in the last two steps with the same imputation technique except imputation units are individuals and other specification of classes is used. The specification of the classes here is based on the same auxiliary information as used in the first five steps, though it is used at the individual level rather at the household level. The specification of the classes is hierarchical here as well. It is possible to impute economic activity status for all remaining individuals in the last two imputation steps. It is obvious that artificial population data created in such manner can not provide precise information at the level of individuals. However random imputation and imputation by classes (by geographic and demographic groups) allows to achieve artificial population data that are similar to the real population data according to population parameters at macro level. Static artificial population data have been generated representing individuals. The artificial population data provide good representation of the real population data at macro level in one specific time point. 16

18 1.3.2 Dynamic Population Dynamic population data are generated from the static population data. The dynamic population data represents a population with variables changing over time by weekly intervals. The dynamic population data are generated for the variable Economic activity status. It is the main variable of study in LFS. For each individual there are three possible values for the variable: 1 employed, 2 unemployed, 3 economic inactive individual. The dynamic population data are generated by the assumption that changes of individual economic activity status can be described by a finite Markov chain (Carkova, 2001). A time-inhomogeneous Markov chain has to be chosen because of seasonality in changes of economic activity. The state space of the Markov chain consists of three states representing three possible values of individual economic activity. The steps of the Markov chain represent weeks. Different transition matrices are used for each seasonal quarter. The quarterly transition matrices are determined according to the changes of individual economic activity observed for consecutive quarters in LFS. Transition matrices after one step are computed by the assumption that 13 weekly transition matrices (after one step) during one quarter are all equal. The initial state of Markov chain for each individual is determined equal to the economic activity status according to the static population data. The state of Markov chain for each individual in each step is generated using the state of the previous step and one of the four weekly transition matrices. The transition matrix to be used in each step is chosen according to the week number. The dynamic population data are generated for several weeks. The dynamic population data represents the changes in population variable with observation interval one week. 1.4 Development and Application of the Framework Assume an arbitrary population parameter θ. A probability sample s p is drawn by a sampling design p (s). The parameter θ is estimated by an 17

19 estimator ˆθ p. The variance of ˆθ p is denoted by Var p (ˆθp ). A cost function is denoted by c (s p ). The fieldwork cost of a sample s p is computed by the cost function c p = c (s p ). The result of the cost function is a random variable because s p is a random sample. The expectation of c p under a sampling design p (s) is denoted by E (c p ) = C p. Definition 1 is used to compare two sampling designs with respect to cost efficiency where γ is a survey budget available. Definition 1. Sampling design p (s) is more cost efficient than sampling design q (s) for estimation ) of population parameter ) θ with a survey budget Cp Cq γ if Var p (ˆθp γ < Var q (ˆθq γ. The parameter γ in Definition 1 can be replaced by a vector γ denoting budget allocation by operational domains. The expectation of survey cost has to be expressed as vector C p, where the components of C p describe the expected fieldwork cost in each operational domain. Specifying the budget as vector is useful in practice if the allocation of a budget by operational domains is important. For example, there is a fixed interviewer network allocated by regions (domains) available for a survey organiser. This implies fixed human resources expressed as γ that are available for survey organiser. Definition 1 will be used to compare different sampling designs by cost efficiency. The application of the framework for sample design cost efficiency analysis consists of the following steps: selection of sampling designs to be analysed, definition of a cost function c (s), setting the total budget γ or a budget allocation γ, setting sample design parameters for each chosen sample design to achieve the expected total cost or cost allocation for all designs approximately equal to γ or γ accordingly, selection of population parameters for the analysis, calculation of variance for the estimators of parameters selected, determination of the most cost efficient sample design using Definition 1. The set of procedures is created during the development of the cost efficiency analysis framework. The set of procedures allows simulation 18

20 of sampling designs by Monte Carlo experiments. The procedures are developed in R, which is a free environment for statistical computing (R Core Team, 2013). The computing complexity comes from a fact that computing has to be made with large data sets. All procedures are available in the appendix of the thesis and also online (Liberts, 2013b) Choice of Sampling Designs Two sampling designs a stratified simple random sampling of individuals and a stratified simple random sampling of households are chosen as alternative sampling designs to be compared with the two-stage sampling design regarding cost efficiency. Both designs are modified with respect to the requirements of the LFS even distribution of sampled units by weeks and one population unit can not be sampled more then once during a quarter. Formulas for the variance of Ŷq and the approximate variance of ˆR q under both alternative sampling designs are developed in the scope of thesis. The choice of the alternative sampling designs is justified with the following facts. They are one of the most simple sampling designs providing relatively simple expression (expressible by analytical form) for population parameter estimators and their variance estimators. As well the cluster effect for these designs is noticeable lower if compared to multistage sampling designs. Lower cluster effect allows to achieve population parameter estimates with equivalent precision using lower sample size compared to multi-stage sampling designs. Lower cluster effect allows a chance that the chosen alternative sampling designs could be more cost efficient compared to the currently used two-stage sampling design in case of LFS Cost Function There are two components in the fieldwork cost function: travel cost and interview cost. The cost function is constructed with following assumptions: all interviews are done by face-to-face interviews, passenger car is a mode of transport used by interviewers, 19

21 all sample units take part in the survey the response rate is 100%, an interviewer does the visiting of all respondents from a weekly sample in one go using the shortest path. The travel cost is estimated by function c 1 (s) = dk f C f k d, where d is the length of the path done by interviewer to visit all sampled units, K f is the average fuel consumption, C f is the average fuel price and k d is an adjustment coefficient specified by a statistician. There are G interviewers available and there is an interviewer assigned to each unit in population. Sampled units for week w are split by interviewers according to the predefined interviewer assignment in population. Geographical coordinates are known for the sampled units and also for the residence places of interviewers. Distance between any two points is computed as the Euclidean distance. The shortest path connecting the residence of an interviewer g and the sampled units assigned to an interviewer g is found by solving a travelling salesperson problem (TSP). The TSP is solved by the nearest insertion algorithm (Rosenkrantz, Stearns, & Lewis, 1977, p.572). The total length of travel distance d is computed by G W g=1 w=1 d g,w where W is the total number of weeks observed and d g,w is the length of the path found by solving a TSP for interviewer g in week w. The constants K f, C f and k d are set according to the information available. Interview cost is computed by a function c 2 (s) = ac a +bc b, where a is the total number of individuals in a sample s, b is the total number of households in a sample s, C a is interview cost for an individual questionnaire, and C b is interview cost for a household questionnaire Survey Budget The survey budget is set equal to the fieldwork budget necessary to run the LFS for one quarter by the current two-stage sampling design. The budget is allocated by three operational domains: Riga, Cities, and Towns and rural areas. The total budget and the allocation of the budget by domains is estimated by Monte Carlo simulation experiments (6000 iterations were done). The length of path done by the interviewers, the number of individuals in sample, and the number of households in sample (all parameters are split by domains) is computed in each iteration. The cost of sample can be computed in each iteration using the results of 20

22 the simulation and the cost function. The precise enough estimates of the expected survey cost can be achieved in such manner Parameters of the Alternative Sampling Designs Three strata are defined for the alternative sampling designs. The strata are defined according to the budget domains: Riga, Cities, and Towns and rural areas. The only design parameters for the alternative sample designs are three sample sizes in each stratum. The task is to determine the sample size in each stratum for the alternative sample designs to achieve the expected cost in each stratum approximately equal to the cost by the two-stage sample design. The task is accomplished using Monte Carlo simulation experiments and a linear regression modelling Population Parameters and Variance Estimation Six population parameters are chosen for cost efficiency analysis: the average weekly number of employed, the average weekly number of unemployed, the average weekly number of economic inactive individuals, an activity rate (the total number of employed and unemployed by the total number of working-age individuals), an employment rate (the total number of employed by the total number of working-age individuals), an unemployment rate (the total number of unemployed by the total number of employed and unemployed). The parameters are estimated for the whole population and also in breakdowns by the following domains: geographical domain (4) Riga (the capital city), cities (excluding Riga), towns, and rural areas, age group (2) individuals aged and years, geographical domain (4) age group (2). It makes 90 parameters (45 averages of weekly totals and 45 ratios of two totals) selected for the cost efficiency analysis. The variance of the population parameter estimates is computed by the analytical formula in the case of alternative designs. The variance of the population parameter es- 21

23 timates is computed by the help of Monte Carlo simulation experiments in the case of two-stage sampling design The Most Cost Effective Sampling Design The precise variance of the population parameter estimates is computed in the case of the alternative sample designs. The estimates of variance of the population parameter estimates is available in the case of twostage sampling design (the simulation error has to be considered). The variances can be directly compared in the case of the alternative sample designs. A hypothesis testing is used if the variance estimates related to the two-stage sample design are compared with the variances related to the alternative sample designs. 1.5 The Main Research Results The aim of this thesis was to develop a framework for the analysis of the cost efficiency of sampling designs. The aim of the thesis has been achieved by accomplishing several tasks. The study started with the redesign of the two-stage sampling design used for the Latvian Labour Force Survey (LFS). The study continued with the creation of artificial individual population data with statistical properties close the real target population data of the Latvian LFS at macro level. The final part of the study was devoted to the development of a framework for the analysis of the cost efficiency of sampling designs. The developed framework was used to compare the LFS sampling design with the alternative sampling designs regarding the cost efficiency. The recommendations regarding the choice of sampling design for LFS were prepared based on the results of the cost efficiency analysis. The following results have been achieved: 1. The sampling frame of the primary sampling units (areas) was updated. The updated sampling frame of areas is used for several sample surveys run by the Central Statistical Bureau of Latvia. The update of the frame reduced significantly the coverage errors of the frame. 2. The redesign of the sampling design used for the LFS, the Household Budget Survey and the Survey of Domestic Travellers has been done 22

24 in the scope of this thesis. The new design has been successfully implemented and has been in use since A methodology for generating artificial population data has been developed. The methodology allows the generation of artificial population data close to the real population data, according to the dimension and statistical properties of the real population. 4. The artificial population data has been created using the developed methodology. The data from the Statistical Household Register and the LFS were used as the input data. Two data files are created, where one data file corresponds to the static population fixed at one specific time point, and other data file corresponds to dynamic population changing over time. The static population data are similar to the LFS target population data at one specific time point. The changes of variables in the dynamic population are similar to the changes of the population observed at the LFS. The artificial population data are extensively used in the Monte Carlo simulation experiments carried out in the study. 5. Modified stratified simple random sampling design (mssrs), ensuring evenly distributed sample allocation by weeks, is introduced as an alternative to the currently used two-stage sampling design. The mssrs prohibits the sampling of an individual (or a household) more then once in a defined time period. The variance formula for the π-estimator of a population total and the approximate variance formula for the π-estimator of the ratio of two totals are derived. 6. The framework for the analysis of sampling designs with the respect to cost efficiency has been developed. The framework is based on analytical methods and Monte Carlo simulation experiments. The framework allows the user to gain information about the sampling design properties (for example, expected fieldwork cost, expected precision) in a relatively short time and with relatively low cost. This information is very valuable information for the survey planning and decision making process. The advantage of the framework is that no extra data collection is required. The framework utilises data already available to a statistical institute (administrative records, population census data or sample survey data). 23

25 7. The set of procedures is developed to support the implementation of the framework in practice. The procedures are developed in R, which is a free software environment for statistical computing and graphics. The procedures allows to carry out Monte Carlo simulation experiments of sampling designs. The procedures are modular. It allows to extend the set of the procedures with additional procedures. There are no limitations on the types of design that can be analysed by the procedures. The only requirement is that it must be possible to write the sampling process of the sampling design under analyses as an R function. 8. The cost efficiency of three sampling designs is estimated using the developed framework. The properties of the chosen sampling designs are explored and recommendations regarding the appropriate sampling design for the LFS are given. 9. It is proven that the two-stage sampling design used currently for the LFS, when compared to two other sampling designs, provides more precise population parameter estimates under the condition of fixed fieldwork cost. 1.6 Discussion and Interpretation The results of the design cost efficiency analysis are shown in Tables available in the thesis. The two-stage sampling design provides the lowest expected variance for 77 of 90 population parameters (with confidence level 0.99) if compared with the alternative sampling designs. Modified stratified simple random sampling design of individuals provides the lowest expected variance for three population parameters. Modified stratified simple random sampling design of households provides the lowest expected variance for ten population parameters. It has to be noted that the cost efficiency analysis is done from a conservative position with respect to the two-stage sampling design. The p-value was in interval (0.01, 0.10) in five out of ten cases when modified stratified simple random sampling design of households was chosen as the most cost efficient design. It can not be concluded that the two-stage sampling is less effective in these cases. 24

26 The two-stage sampling design has achieved the highest precision of estimates in most cases despite the conservative position with respect to it. Therefore it is recommended to keep using the currently used twostage sampling design for the Latvian Labour Force Survey to achieve the highest overall precision under the current budget constrains. Switching to a simpler sampling design will result with one of two negative effects: the loss of overall precision if the survey cost is kept in the current budget level or the increase of survey cost if overall precision level is kept equal to the currently achieved precision level. 2 Conclusions and Proposals The are several practical gains achieved by the results of this thesis. The updated first stage population frame allows to achieve more precise estimates not only for the Latvian Labour Force Survey but also for other sample surveys organised by the Central Statistical Bureau of Latvia if the updated population frame is used as a sampling frame. The updated population frame is easily usable for several sample surveys allowing to save resources and time during the sample selection process and sample survey fieldwork operation. The developed mathematical framework can be used also for other surveys organised by the Central Statistical Bureau of Latvia. It can be used for analyses of the continuously organised surveys and also for the planning of new sample surveys of households or individuals. The framework is flexible regarding the sampling designs under study. The framework can be adjusted to analyse different practical situation, for example, survey cost modelling can be extended by the survey specific components. The framework can be used both by governmental institutes and private companies working in the field of sample survey planing and organisation. The research can be continued by extending the framework with non-response modelling. The set of the developed R procedures has to be extended with additional procedures. The additional procedure is necessary to simulate the process of the non-response of sampled units. The cost function has to be adjusted to take into account the actions done by 25

27 interviewers in the case of non-response. The procedure for estimation of the population parameters in the case of non-response is necessary. Acknowledgements First and foremost I offer my gratitude to my thesis consultant Jānis Lapiņš for his guidance, advice, and encouragement he was providing throughout my doctoral studies. I have learned a lot from him during countless discussions we have had. I express my greatest gratitudes also to my thesis supervisor Aleksandrs Šostaks for his support and advice I have received during the studies. I express my gratitudes to my colleagues at the Central Statistical Bureau, especially to Aija Žīgure and Ieva Aināre, for the support I have received during my doctoral studies. I am also thankful to Gunnar Kulldorff and Imbi Traat for organising my study visits at the University of Umeå and the University of Tartu. I express my thanks to Rebecca Gillard for helping to improve the language of the thesis. Any remaining grammatical mistakes are my own. I am very thankful to the Stack Exchange community for the knowledge I have gained by searching and browsing the forums like Stack Overflow 3, Cross Validated 4, TeX LaTeX Stack Exchange 5, and English Language & Usage Stack Exchange 6. I am especially thankful for the promptly and suitable answers I have received to my questions asked at the forums mentioned. This work has been supported by the European Social Fund within the project «Support for Doctoral Studies at University of Latvia 2». I am thankful also to the Linda Peetre Memorial Fund for the scholarship I have received in Finally I offer my warmest gratitudes to the family, especially to Inga, for patience, understanding and support I have received

28 References Carkova, V. (2001). Markova ķēdes [Markov chains] (Study aid). Riga: University of Latvia. Chen, B.-C. (2008). Stochastic simulation of field operations in surveys (Research Rep.) Washington: U. S. Census Bureau. Retrieved from Cox, L. (2012). The case for simulation models of federal surveys. In Research conference papers of federal committee on statistical methodology research conference Washington. Retrieved from European Commission. (2012a). Labour force survey in the EU, candidate and EFTA countries Main characteristics of national surveys, 2011 (Tech. Rep.). Luxembourg: Eurostat. Retrieved from European Commission. (2012b). Quality report of the European Union Labour Force Survey 2010 (Tech. Rep.). Luxembourg: Eurostat. Retrieved from Groves, R. M. (1989). Survey errors and survey costs. New Jersey: Wiley. Hansen, M. H., Hurwitz, W. N., & Madow, W. G. (1953). Sample survey methods and theory (Vol. I). New-York: Wiley. Jessen, R. J. (1942). Statistical investigation of a sample survey for obtaining farm facts (Research Bulletin No. 304). Iowa State College of Agriculture and Mechanic Arts. Kish, L. (1965). Survey sampling. New-York: John Wiley & Sons. Liberts, M. (2010a). The redesign of Latvian Labour Force Survey. In M. Carlson, H. Nyquist, & M. Villani (Eds.), Official statistics methodology and applications in honour of Daniel Thorburn (pp ). Stockholm, Sweden: Stockholm University. Retrieved from Liberts, M. (2010b). The weighting in household sample surveys. In O. Krastiņš & I. Vanags (Eds.), The results of statistical scientific research 2010 (pp ). Riga: Central Statistical Bureau of Latvia. Retrieved from PhD/pub/10Papers/Liberts_2010_Weighting.pdf 27