Practical Sampling for Impact Evaluations

Size: px
Start display at page:

Download "Practical Sampling for Impact Evaluations"

Transcription

1 L.Chioda (adopted by DIME, LCRCE) Practical Sampling for Impact Evaluations innovations & solutions in infrastructure, agriculture & environment April 23-27, 2012, Naivasha, Kenya

2 Introduction Now that you know how to build treatment and control groups in theory, how to do it in practice? Which population or groups are we interested in and where do we find them? Selecting whom to interview From that population, how many people/firms/units should be interviewed/observed? Sample size Seemingly trivial, but the devil is in the details Example: Suppose we want to understand why fertilizer adoption rates are so low among maize farmers Duflo, Kremer and Robinson, 2008: How High Are Rates of Return to Fertilizer? Evidence from Field Experiments in Kenya 2

3 Introduction Example (1): Who to interview is informed by the research/policy question 1. All farmers? 2. All small farmers? 3. All small farmers in a particular agro-ecological zone? 4. All small farmers in a particular agro-ecological zone in a particular region? Need some information before sampling Complete listing of all units of observation available for sampling in each area or group Can be tricky for units like informal firms, but there are techniques to overcome this 3

4 Introduction How many Sample Size depends on a few ingredients Example (2), intuitively: One farmer receives fertilizer (treatment) Second farmer does not (control) The two have been selected at random Returns to new technology (impact) is given by the difference between the two crop yields Why does sample size matter? If too small, then you may draw conclusions that are not robust : What if the farmer receiving fertilizer by chance lived on more fertile land? Or, on the contrary, what if the one not receiving fertilizer was by chance more zealous and/or had access to better irrigation? 4

5 Introduction If Samples are too smalls : Why not assign the entire population (individuals; farmers) either to the treatment or to the control group? ideal world: without budget or time constraints, interviewing everyone would be a good solution In practice interviews are costly and time consuming not feasible e.g. Census every 10 years vs. more frequent household surveys In sum: Who to interview is ultimately determined by our research/policy questions Sample size matters & determines the credibility of results It allows us to say w/ some confidence whether the average outcome in the treatment group is higher/lower than that in comparison group 5

6 Introduction What will we be doing now with the rest of the time? 1. What do we mean by confidence? (min. statistical error) 2. Ingredients to determine sample size Detectable effect size Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) Units (firms, banks) per treated/control area 3. Multiple treatments 4. Group-disaggregated results 5. Take-up 6. Data quality 6

7 Calculating the Sample Size We understand confidence to mean with a some degree of certainty or with little error We are in Luck!! This time, the statistical jargon & plain English point to the same notion The same holds true in the statistical sense, only it entails formalizing what is meant by error. The statistical derivation of the sample size yields an ugly formula: ( z ) / 2 z N 1 ( 1) 2 H D Would you like me to derive this formula? 7

8 Calculating the Sample Size hopefully you answered no to my previous question (otherwise early coffee break) Intuitive approach will focus on 1. Detectable effect size 2. Errors in inference: Type II (and type I) errors 3. Variance of outcome(s) Question: How do these 3 ingredients affect credibility or your results? and therefore your choice of sample size 8

9 Calculating the Sample Size Think of the sample size as the accuracy of a measuring device: The more observations you have the more precise is your measuring device The more confident you are about the conclusions of your evaluation Example: guess the sentence below knowing only 2 letters the # of revealed letters is analogous to the # of observations where each letter, say, costs US$ 100,000 You have US$ 2M with which to uncover up to 20 letters (all of them) If you guess wrong, you loose all of your investment 9

10 Calculating the Sample Size Let s increase the number of observations (in this case letters) This is so much easier You feel more confident about guessing Common sense: the more complicated is the sentence, the more letters you would need Below, we discuss the sense in which impacts can be complicated to detect and would require larger samples. 10

11 1 st ingredient: Smallest Effect Size We do not know in advance the effect of our policy. We want to design a precise way of measuring it But precision is not cheap: need cost-benefit analysis to decide 1 st ingredient: Smallest program effect size that you wish to detect i.e. the smallest effect for which we would be able to conclude that it is statistically different from zero detect is used in a statistical sense Example: What if the use of fertilizer increases yields, and thus revenues, by 5% but costs (purchases, extra man hours, knowledge, etc.) grow by 4.5 %? What if the aggregate benefits (yields) are lower than the cost of the IE? 11

12 1 st ingredient: Smallest Effect Size Cost-benefit analysis guides us in determining smallest detectable effect : That could be useful for policy That could justify the cost of an impact evaluations, etc. The smaller are the (EXPECTED) differences between treatment & control the more precise the instrument has to be to detect them The larger the sample needs to be 12

13 1 st ingredient: Smallest Effect Size The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size increasing precision (of our measuring device) 13

14 Type II Error Why is it important to be able to measure differences with precision? Example (1): Fertilizer adoption Crop Yields Treatment very similar ( ) to Crop Yields Control Then we could conclude that our program has no effect for 2 reasons: i.e. That treatment and control outcomes are not statistically different 1. Because our instrument is not precise (Bad Inference ) 2. Because the program indeed had no effect (Good Inference ) Unless we have enough observations, we would not be able to decide with confidence between possibilities 1. and 2. 14

15 Type I Error (false positive) Example (2): Fertilizer adoption. is effective only when coupled w/ hybrid seeds Farmers receiving fertilizer flip a coin to see who gets also hybrid seeds By pure chance, treatment farmers tend to have more fertile soil. Crop Yields Treatment (statistically) Larger than Crop Yields Control We conclude that our program has an effect (despite there being none in truth) However the difference depends only on the difference in soil fertility (Bad Inference ) Good news: the larger the sample size the smaller we can make the probability to commit this type of error 15

16 One more Ingredient: Variance of Outcomes (1) How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which animals on average are bigger? How many observations from each circle would you need to decide? 16

17 One more Ingredient: Variance of Outcomes (2) Example: on average which group has the larger animals? Comparison is more complicated in this case, such that you need more information (i.e. a larger sample) answer may depend on which members of the blue & red groups you observes 17

18 One more Ingredient: Variance of Outcomes (3) Economic example: let s look at our farmers & fertilizer Imagine that the use of fertilizer leads to an increase in crop yields (impact) from 50% to 60% return rate on average Case A: farmers are all very similar & the distribution of crop yields is very concentrated Case B: farmers and other inputs are much different & distribution of crop yields are spread out (distributions overlap more) Which instance requires a more precise measuring device? 18

19 One more Ingredient: Variance of Outcomes (4) In sum: More underlying variance (heterogeneity) more difficult to detect difference need larger sample size Tricky: How do we know about outcome heterogeneity before we decide our sample size and collect our data? Ideal: pre-existing data but often non-existent Can use pre-existing data from a similar population Example: enterprise surveys, labor force surveys Common sense 19

20 What else to consider when deciding sample size Additional features of the design/data that may have implications for determination of sample size 1. Multiple treatment arms 2. Group-disaggregated results 3. Take-up 4. Data quality 20

21 1. Multiple treatments From fertilizer adoption example: Fertilizer can be very profitable (when used correctly) Results (Duflo, Kremer, Robinson 2008): Seasonal rate of return for ½ teaspoon of fertilizer: 36% Annualized mean return of 69.5% In practice, low adoption. Why? New IE: DKR (2011) consider: Treatment 1: 50% subsidy on fertilizer Treatment 2: fertilizer + smaller discount (SAFI program) Treatment 3: SAFI + reminder close to time of use Treatment 4: SAFI + free delivery Intuition: the more comparisons (treatments) the larger sample size needed to be confident 21

22 1. Multiple treatments To compare multiple treatment groups requires very large samples Analog to have multiple impact evaluations bundled in one The more comparisons you make, the more observations you need Especially if the various treatments are very similar, differences between the treatment groups can be expected to be smaller 22

23 Why do we need strata? Group-disaggregated results Are effects different for men and women? For different sectors? Different regions? If genders/sectors expected to react in a similar way, then estimating differences in treatment impact also requires very large samples To ensure balance across treatment and comparison groups, good to divide sample into strata (aka groups) before assigning treatment Strata Sub-populations (sub-groups or sub-sets) Common strata: geography, gender, sector, baseline values of outcome variable Treatment assignment (or sampling) occurs within these groups (i.e. randomize within strata) 23

24 What can go wrong, if you do not use Strata? Example: You randomize without stratification. Now you ask: What is the impact in a particular region? = Treatment & = Control, assigned randomly Can you assess with confidence the impact of fertilizers within regions? A C B 24

25 Why do we need strata? To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all A 25

26 Why do we need strata? To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all B 26

27 Why do we need strata? To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all C 27

28 Why do we need strata? How to prevent these imbalances and restore confidence in estimates within strata? Example: you have 6 regions Instead of sampling 2400 farmers at random (regardless of their region of origin) Within each region you draw a sample: Sample = 400 per region: 200 treatment & 200 control I.e. Random assignment to treatment within geographical units Within each unit, ½ will be treatment, ½ will be control. Similar logic for gender, industry, firm size, etc Which Strata? Your research & policy question should guide you 28

29 Why do we need strata? What about now? : The treatment and control farmers look balanced across regions Much better! 29

30 Take up Example: We can only offer subsidy for fertilizer. We cannot force farmers to adopt it We offer subsidy to 5000 Farmers Only 50 participate (sometimes not at random) In practice, because of low take up rate, we end up with a less precise measuring device We won t be able to detect differences with precision Can only find an effect if it is really large Take-up Low take-up (rate) lowers precision of our comparisons Effectively decreases sample size 30

31 Data Quality Data quality Poor data quality effectively increases required sample size Missing observations quality of data collection, attrition, migration High measurement error: answers not always precise e.g. reporting income, or use of fertilizer or revenues e.g. recollection bias, on purpose, framing, pleasing Poor data quality can be partly addressed with field coordinator on the ground monitoring data collection 31

32 In Conclusion Who to interview is ultimately determined by our research/policy questions How Many: Elements Implication for Sample Size The more (statistical) confidence/precision The smaller effects that we want to detect The more underlying heterogeneity (variance) The more complicated design - Multiple treatments - Strata The larger the sample size will have to be The lower take up The lower data quality 32

33 Power Calculation in Practice: an Example Calculations can be made in many statistical packages e.g. STATA, Optimal Design Optimal design software is freely donwloadable from University of Michigan website: 33

34 Power Calculation in Practice: an Example Example: Experiment in Ghana designed to increase the profits of microenterprise firms Baseline profits 50 cedi per month. Profits data typically noisy, so a coefficient of variation >1 common. Example STATA code to detect 10% increase in profits: sampsi 50 55, p(0.8) pre(1) post(1) r1(0.5) sd1(50) sd2(50) Having both a baseline and endline decreases required sample size (pre and post) Results 10% increase (from 50 to 55): 1,178 firms in each group 20% increase (from 50 to 60): 295 firms in each group. 50% increase (from 50 to 75): 48 firms in each group (But this effect size not realistic) What if take-up is only 50%? Offer business training that increases profits by 20%, but only half the firms do it. Mean for treated group = 0.5* *60 = 55 Equivalent to detecting a 10% increase with 100% take-up need 1,178 in each group instead of 295 in each group 34