Introduction to Sample Surveys

Size: px
Start display at page:

Download "Introduction to Sample Surveys"

Transcription

1 Introduction to Sample Surveys Statistics 331 Kirk Wolter September 26,

2 Outline A. What are sample surveys? B. Main steps in a sample survey C. Limitations/Errors in survey data September 26,

3 A. What are Sample Surveys? Many applied problems involve numerical summaries of a population Median test scores of high school students in Chicago Manufacturers shipments in Illinois Children age months who have received 3+ polio shots Intention to vote for a certain political candidate Sales of Crest toothpaste in Jewel supermarkets Regression of employment status on age, race, sex, and education September 26,

4 Basic Objective of Survey Surveys Observe a carefully selected, representative part of population, Without disturbing it, Make inferences about the whole population September 26,

5 Sample Surveys vs. Designed Experiments Surveys are not experimental studies that deliberately perturb part of population and study the effect e.g., randomize patients to Treatment A, Treatment B, or Control Surveys avoid disturbing the population or the sampled part of population September 26,

6 Sample Surveys vs. Purely Observational Studies Surveys are not purely observational studies that observe part of population with no control over which part e.g., results of a CNN call-in poll Surveys carefully control which part of the population you will study to ensure that the part is representative of the whole September 26,

7 B. Main Steps in a Sample Survey 1. Objectives of the survey 2. Population 3. Key parameters of the population 4. Sampling frame and sampling unit 5. Degree of accuracy, cost, and timing 6. Sampling design and implementation September 26,

8 B. Steps cont. 7. Data to be collected; what is to be measured 8. Data collection agents or interviewers 9. Methods of measurement 10.Data collection operations 11.Analysis of data 12.Delivery of data, analysis, and reports Statistics to 331, client September 26,

9 Step 1. Objectives of the Survey Questions to be answered Purpose Specific v. broad Single v. multipurpose September 26,

10 Example The Medicare Current Beneficiary Survey (MCBS) is a continuous, multipurpose survey of a nationally representative sample of the Medicare population, conducted by the Office of Enterprise Data and Analytics (OEDA) of the Centers for Medicare & Medicaid Services (CMS) through a contract with NORC at the University of Chicago. The central goals of the MCBS is to determine expenditures and sources of payment for all services used by Medicare beneficiaries, including co-payments, deductibles, and non-covered services; to ascertain all types of health insurance coverage and relate coverage to sources of payment; and to trace outcomes over time, such as changes in health status and spending down to Medicaid eligibility and the impacts of Medicare program changes on satisfaction with care and usual source of care. September 26,

11 2. Population High school students in Chicago Manufacturers in Illinois Children months in U.S. Registered voters Jewel supermarkets Adults in the U.S. labor force September 26,

12 3. Key Parameters Median test scores Manufacturers shipments Proportion of children who have received 3+ polio shots Proportion of those who plan to vote who would vote for a particular candidate Total sales and market share of Crest toothpaste Regression coefficient September 26,

13 4. Sampling Frame and Sampling Unit Sampling frame is a list of the members (or units) of the population Sampling unit High school students Manufacturers Supermarkets Medicare beneficiaries If a list is not available, we change the sampling unit to something for which a list is available Street address Telephone number City block September 26,

14 5. Degree of Accuracy, Cost, and Timing Accuracy [bias, variance (standard error, margin of error), mean square error (MSE)] September 26,

15 5. Degree of Accuracy, Cost, and Timing cont. Cost Timing Reference period Field period One-time or repeated September 26,

16 6. Sampling Design and Implementation Probability sampling Nonprobability sampling Convenience sampling Purposive sampling Self selection September 26,

17 A Randomization Example Very small country with N=4 farms. Design a study to estimate total acres of corn (limited budget!) One possible probability sample: choose any pair of farms with equal probability Label Acreage, x Corn Acres, y known known unknown September 26,

18 Simple Random Sampling (y=(1,3,5,15)) If sample selected is then measured values are and the estimated total corn acres is Farm 1 & farm 2 1,3 2*1+2*3=8 September 26,

19 SRS WOR, Cont. (y=(1,3,5,15)) If sample selected is then measured values are and the estimated total corn acres is Farm 1 & farm 2 1,3 2*1+2*3=8 1,3 1,5 2*1+2*5=12 1,4 1,15 2*1+2*15=32 2,3 3,5 2*3+2*5=16 2,4 3,15 2*3+2*15=36 3,4 5,15 2*5+2*15=40 September 26,

20 Randomization Distribution September 26,

21 Expected Value and Variance Mean of the randomization distribution is the true population total (24) ( )/6 = 24 Unbiased estimator Variance of the randomization distribution indicates deviations of possible estimates from the true population total: {(8 24) 2 + (12 24) (40 24) 2 }/6 = September 26,

22 Stratified Sampling Working model: big farm probably plants more corn Design can reflect this: always sample the big farm, plus one more sampled at random Estimated corn acreage total = big farm corn acreage + estimated corn acreage of small farms September 26,

23 Stratified Sampling, Cont. (y=(1,3,5,15)) If sample selected is then measured values are and the estimated total is 1,4 1,15 3*1+1*15=18 2,4 3,15 3*3+1*15=24 3,4 5,15 3*5+1*15=30 September 26,

24 Randomization Distributions September 26,

25 Expected Value and Variance Expected value ( )/3 = 24 Estimator is unbiased Variance {(18 24) 2 + (24 24) 2 + (30 24) 2 }/3 = 24 September 26,

26 Stratified Sampling Re-Revisited: Ignoring Weights If sample selected is then measured values are and the estimated total is 1,4 1,15 2*1+2*15=32 2,4 3,15 2*3+2*15=36 3,4 5,15 2*5*2*15=40 September 26,

27 Randomization Distributions September 26,

28 Expected Value and Variance Expected value ( )/3 = 36 Estimator is biased, and bias = = 12 Variance {(32-36) 2 + (36 36) 2 + (40 36) 2 }/3 = September 26,

29 Using Auxiliary Information (x=(4,6,6,20), y=(1,3,5,15)) If sample selected is then calibration weights are and the estimated total is 1,2 36/(4+6)= *1+3.6*3= ,3 36/(4+6)= *1+3.6*5= ,4 36/(4+20)= *1+1.5*15= ,3 36/(6+6)= *3+3.0*5= ,4 36/(6+20)= *3+1.38*15 = ,4 36/(6+20)= *5+1.38*15 = September 26,

30 Randomization Distributions September 26,

31 Expected Value and Variance Expected value ( )/6 = Estimator is not unbiased Variance {( ) ( ) 2 }/6 = September 26,

32 Clustering the Farms Now suppose the first two farms are in one location and the other two farms are in a very different location, and that travel costs are substantial To save cost, select one location at random and interview the two farms at that location September 26,

33 Cluster Sampling, Cont. (y=(1,3,5,15)) If sample selected is then measured values are and the estimated total is 1,2 1,3 2*1 + 2*3 = 8 3,4 5,15 2*5 + 2*15 = 40 September 26,

34 Expected Value and Variance Expected value (8 + 40)/2 = 24 Estimator is unbiased Variance {(8-24) 2 + (40-24) 2 }/2 = 256 September 26,

35 Comparing the Sampling Strategies Strategy Bias Variance MSE Simple Random Sampling Stratified Stratified Calibration Cluster Sampling Which sampling strategy would you choose? September 26,

36 7. Data to be Collected Questionnaire Content (items of information) Question wording Question sequencing and flow Ease of administration Other Tools for developing the questionnaire Testing the draft questionnaire September 26,

37 8. Data Collection Agents Field interviewers Telephone interviewers Self completion September 26,

38 9. Methods (Modes) of Measurement Face-to-face Telephone Mail Web Other Multi-mode September 26,

39 10. Data Collection Operations Dates and length of the field period Level of supervision given to the datacollection agents Followup protocol (callbacks) Transmission, management, and storage of the data September 26,

40 11. Analysis of Data Convert the responses to machine readable form Build the analytical database Missing and faulty data Edit and imputation Survey weights Tabulations September 26,

41 11. Analysis of Data cont. Relationships between variables Cross-tabulations Correlation coefficients Regression coefficients Graphical display of the results Measures of precision Tests of significance Confidence intervals September 26,

42 12. Delivery of Data, Analysis, and Reports to Clients Electronic files CDs or other Hard copy September 26,

43 C. Limitations/Errors in Survey Data Population Frame Sample Respondents Data Products Uses September 26, 2016 Coverage error Sampling error Nonresponse Measurement/processing error Processing/estimation error Lies, damned lies! Errors of Non- Observation Errors of Observation 43

44 Foundations of Survey Statistics A finite population U is a population of known number N of identifiable units U 1, U 2,, U N. This definition excludes such populations as the fish in a lake or unmarked bolts in a barrel. The list of population units is called the sampling frame. Stat 331, 9/23/

45 Foundations cont. Attached to each unit in the population is the true value of a characteristic of interest, Y i. In real applications, the characteristic is vector valued. Define Y = Y 1 Y 2 Y N. Stat 331, 9/23/

46 Foundations cont. The characteristics arise from questionnaire items or information on the sampling frame. Some characteristics may be derived from such information. Stat 331, 9/23/

47 Foundations cont. Any real-valued function θ(y = θ is called a parametric function, or simply a population parameter, or still more simply a parameter. For example, Y = Y = Y N N j =1 Y j (the population total) (the population mean) P = Y (the population proportion) Stat 331, 9/23/

48 Foundations cont. A sample, s, is a subset of U Sufficient for most of our purposes More general definitions are available n s is the sample size September 26,

49 Foundations cont. We let S denote the collection of all possible samples from U, called the sample space. S is the random variable taking on values s, for all s in S. Stat 331, 9/23/

50 Foundations cont. Let P denote probability over the space S, i.e., P(S = s 0 s S P(S = s = 1 We call the triple (U, S, P) a sampling design. Stat 331, 9/23/

51 Foundations cont. Notation Y i y i value of i-th unit in the population value of i-th unit in the sample An estimator is a real-valued function, θ(s, Y = θ(y = θ, thought to be good for estimating some parameter θ(y = θ. Stat 331, 9/23/

52 Foundations cont. Expected value E θ = s S θ (s, Y P(S = s Bias B θ = E θ θ Stat 331, 9/23/

53 Foundations cont. Variance Var θ = E (θ E θ ) 2 = θ(s, Y E θ 2 = E θ 2 E 2 θ s S P(S = s Mean square error MSE θ = Var θ + B 2 θ. Stat 331, 9/23/

54 Foundations cont. Consistency P θ θ > ε 0 for ε > 0, as n, N, and n N = fε(0,1. Stat 331, 9/23/

55 Foundations cont. A main goal in survey sampling is to employ an unbiased estimator; failing tha to employ a consistent estimator. Inclusion probabilities π i = P(U i S = π ij = P U i, U j S s i P (S = s = P (S = s s i,j Stat 331, 9/23/