The Evidence for Rank-Dependent Expected Utility: A Case of Over-fitting Laboratory Data

Size: px

Start display at page:

Download "The Evidence for Rank-Dependent Expected Utility: A Case of Over-fitting Laboratory Data"

Basil Bates
5 years ago
Views:

1 The Evidence for Rank-Dependent Expected Utility: A Case of Over-fitting Laboratory Data by Dale O. Stahl Malcolm Forsman Centennial Professor Department of Economics University of Texas at Austin stahl@eco.utexas.edu November 2016 ABSTRACT A re-examination of experiment data reveals evidence of substantial noise and overfitting of that noise by Rank-Dependent Expected Utility (RDEU) models. Further, we find that the simple EU model predicts out-of-sample better than all the enhanced RDEU models. The central lesson is that it is vitally important to guard against over-fitting laboratory data. Finally, we argue that due to the need for adequate monetary incentives and the sample size limitations of laboratory data, there is a limit to how confident we can be in our theories of lottery choices. J.E.L. Classifications: C52, C53, C91, D81 1

2 1. Introduction. It is widely thought that there is conclusive evidence that the behavior of subjects in laboratory experiments violates Expected Utility (EU) theory. Indeed, this conclusion has spurred a voluminous literature on non-eu theories (Machina, 2008). However, this conclusion may be premature. Early evidence was provided by Allais-type choice data, but the data was for hypothetical choices. Additional evidence comes from preference reversal experiments (e.g. Grether and Plott, 1979) in which subjects choose one lottery over another but assign a higher willingness-to-sell price for the latter. Instead of violations of EU, this phenomenon could be consistent with EU and path-dependent preferences (e.g. because the default endowment in the choice task is different from the endowment in the selling task). 1 Furthermore, focusing on tests of Allais-type choices runs the risk of over-fitting this class of choices, resulting in a revised theory that performs less well than EU on larger domains. Finally, since humans are not infallible, when analyzing experiment data one must disentangle mistakes from systematic violations of the theory being tested. 2 An alternative to EU is Rank-Dependent Expected Utility (RDEU), first introduced by Quiggin (1982, 1993) and popularized as Cumulative Prospect Theory (Tversky and Kahneman, 1992). Since RDEU nests EU, for clarity in this paper RDEU will refer to specifications that are not equivalent to EU. Laboratory experiments on lottery choices provide data for estimating these models. Since RDEU models nest EU models, they necessarily fit laboratory data on lottery choices better than EU models. But is this better fit an artifact of the increased complexity of RDEU which renders it susceptible to over-fitting (i.e. fitting noise), especially given the limited sample size of the data? Since tests based on asymptotic distributions are not valid for small sample sizes, it is vitally important to address the possibility of over-fitting laboratory data. A standard statistical method to assess over-fitting is to divide the data into an estimation subset and a prediction subset. One estimates the model parameters on the 1 Other explanations include framing effects: i.e. selling conjures strategic bargaining behavior in which it can be optimal to overstate one s willingness-to-sell price. 2 Harrison and Rutström (2008) provide a thorough review of the literature. 2

3 estimation subset, and then tests whether this fitted model is the data generating process for the prediction subset. A second method is to estimate two models (say RDEU and EU) on the estimation subset, and then compute the likelihood of the prediction subset conditional on the fitted models. If the more complex model performs worse than a simpler model on this prediction task, then it is quite possible that the complex model over-fit the estimation data; in other words, the data generating process was in fact a simpler model plus noise but the complex model was able to fit some of the noise in the estimation subset, which led to poor prediction performance out-ofsample. This paper uses data from two comprehensive experiments in which the subjects faced a variety of choice tasks each involving two lotteries: Hey and Orme (1994), and Harrison and Rutström (2009). The number of tasks is large enough to allow for partitioning into two similar subsets of tasks to be used for over-fitting tests. The original papers did not perform any overfitting tests. Our test results indicate that the RDEU models are prone to over-fitting the data. First, we can reject the hypothesis that the fitted models on the estimation subset of data are the data generating processes for the prediction subset. Second, we find that the simple EU model predicts better than all the enhanced RDEU models. The paper is organized as follows. Section 2 describes the HO and HR data sets. Section 3 specifies the encompassing RDEU models. Section 4 presents the over-fitting tests. Section 5 concludes with a discussion. 2. The Data Hey and Orme (1994; hereafter HO) is one of the first papers to confront a variety of decision theories with experimental data from a large number (100) of choice tasks. 3 Each task 3 These 100 tasks were presented to the same subjects again one week later. We do not consider that data here because the test that the same model parameters that best fit the first 100 choices are the same as those that best fit 3

4 was a choice between two lotteries with three prizes drawn from the set {0, 10, 20, 30 }. After all choices were completed, one task was randomly selected and the lottery the subject chose was carried out to determine monetary payoffs. On average, the difference in the expected monetary value of the two lotteries was about 5% of 30 = 1.5, so the expected monetary incentive for each choice task was 1.5 /100 = $0.02. To each decision theory, the authors appended a probit-like stochastic error specification, and computed maximum likelihood estimates of the model and error parameters for each of 80 subjects. They conclude: Our study indicates that behavior can be reasonably well modelled (to what might be termed a reasonable approximation ) as EU plus noise. Perhaps we should now spend some time on thinking about the noise, rather than about even more alternatives to EU. 4,5 Our results reinforce their conclusion. Harrison and Rutström (2009; hereafter HR) is the most recent large-scale study of binary lottery choices. Each task was a choice between two lotteries with 3 possible prizes drawn from the set {$0, $5, $10, $15}. 6 There were 30 distinct tasks, each presented twice (in shuffled order) to 63 subjects. After all choices were made, three tasks were randomly selected and the lotteries that had been chosen were carried out to determine monetary payoffs. On average, the difference in the expected monetary value of the two lotteries was about 5% of $15 = $0.75, so the expected monetary incentive for each choice task was $0.75/20 = $ The authors purpose was not to exhaustively test alternative decision theories, but instead to investigate how well a mixture model of behavior could fit the data. Their mixture model consisted of EU and the second 100 choices fails. Possible explanations for this finding are (i) that learning took place between the sessions, (ii) preferences changed due to a change in external (and unobserved) circumstances, and (iii) the subjects did not have stable preferences. Therefore, we focus our attention on the first 100 choice tasks. 4 Loomes and Sugden (1998) is a similar study as Hey and Orme (1994), except that their analysis of the data is based on non-parametric tests involving the number of reversals and violations of dominance 5 Wilcox (2008, 2011) uses the entire HO data to carefully study alternative stochastic specifications and his contextual utility model. Briefly, contextual utility essentially rescales the payoffs for each of the choice tasks to a [0, 1] scale based on the minimum and maximum payoff in that choice task. He estimates a random parameter econometric model and finds that contextual utility fits and forecasts best. We take a complementary Bayesian approach to characterizing the heterogeneity across subjects, and a use a different out-of-sample forecasting test. 6 Another experiment was conducted on the loss domain. 4

5 Separable Prospect Theory. They conclude that a 50:50 mixture is superior to either component by itself. The low expected payoffs in these two experiments raise suspicions that the economic incentives may not have been sufficient to elicit careful effort. 7 Since each distinct task in the HR experiment was presented twice, we can match up the responses and ask how many choices are different between the first and second presentation of the task. The number of inconsistent choices ranges from 0 to 14 (out of 30), with an average of 26%. In other words, 74% of the choices are the same across matched tasks. This sounds good until we compare it to the null hypothesis that all choices are uniformly random. Then given the first 30 choices, we expect 15 of the second choices to be different and 15 to be the same. Observing 10 or more differences has a probability of 0.95, so we could not reject the null at the 5% confidence level. In fact, 22 (34.9%) of the subjects have 10 or more differences, and 8 subjects have 12 or more differences. Therefore, the HR data set could contain a substantial amount of decision noise. Over-fitting is an increasing danger when there is substantial noise in the data. 3. The Rank-Dependent Utility Model. A convenient encompassing model is Rank-Dependent Expected Utility 8 (RDEU) [Quiggin (1982, 1993)], which nests EU. RDEU allows subjects to modify the rank-ordered cumulative distribution function of lotteries as follows. Let Y {y0, y1, y2, y3} denote the set of potential outcomes of a lottery, where the outcomes are listed in rank order from worst to best. Given rank-ordered cumulative distribution for a lottery on Y, and let Fj denote the cumulative probability up to and including yj. It is assumed that the subject transforms Fj by applying an increasing function H(Fj) with H(0) = 0 and H(1) = 1. From this transformation, the individual derives modified probabilities of each outcome: 7 The problem of flat payoff functions was forcefully raised by Harrison (1989). Another criticism of the random lottery incentive mechanism used in these experiments is the implicit assumption of the Compound Independence Axiom (Harrison and Swarthout, 2014). 8 This model is the same as the Cumulative Prospect (Tversky and Kahneman, 1992) model restricted to nonnegative monetary outcomes. 5

6 h0 = H(F0), h1 = H(F1) H(F0), h2 = H(F2) H(F1), and h3 = 1 H(F2). (1) Common parametric specifications of the transformation functions are H(Fj) (Fj) β /[(Fj) β + (1-Fj) β ], H(Fj) (Fj) β /[(Fj) β + (1-Fj) β ] 1/β, H(Fj) (αfj) β /[α(fj) β + (1-Fj) β ], α > 0, (2a) (2b) (2c) where β > 0. Arguing from symmetry that H(0.5) should equal 0.5, Quiggin (1982) recommended eq(2a). Tversky and Kahneman (1992) suggested eq(2b) because it allows the interior fixed point to differ from 0.5. Lattimore, et al. (1992) suggested eq(2c) which allows a greater range on the shape and fixed point. 9 For ease of reference, RDEU0 will refer to the model with eq(2a), RDEU1 to the model with eq(2b), and RDEU2 to the model with eq(2c). Given value function v(yj) for potential outcome yj, the rank-dependent expected utility is U(F) j v(yj )hj(f). (3) To confront the RDEU model with binary choice data (F A vs. F B ), we assume a logistic choice function: Prob(F A ) = exp{γu(f A )}/[ exp{γu(f A )} + exp{γu(f B )}, (4) where γ 0 is the precision parameter. Without loss of generality, we can assign a value of 0 to the worst outcome and a value of 1 to the best outcome. 10 Accordingly, we specify v0 v(y0) = 9 Prelec (1998) provides an axiomatic foundation for an alternative two-parameter transformation with a fixed point near 1/3; however, because we found that the Prelec specification fit the data much worse than any of the other specifications, we do not pursue this specification in this paper. 10 Since we estimate one precision parameter for all choice tasks, this scale specification is not simply the assumption of affine invariance; it is also an assumption about the magnitude of noise implicit in the logistic function relative to the payoffs. Wilcox (2008) argues for a re-scaling for each choice task. While we agree that rescaling may be needed for diverse choice tasks, we feel that in the context of the HO and HR tasks, since all four payoffs were encountered many times in succession, a re-scaling for the entire set is more appropriate. To test our intuition, we estimated the Wilcox-type EUT model for the HO data (which he used), and we found it fit slightly worse than a EUT model with one precision parameter. This different finding may be due to our using only the first 100 tasks of H0 and estimating individual parameters rather than a random coefficient specification. 6

7 0 and v3 v(y3) = 1. This leaves two free utility parameters: v1 v(y1) and v2 v(y2). Hence, the empirical EU model entails three parameters: (γ, v1, v2), the RDEU0 and RDEU1 models entail four parameters: (γ, v1, v2, β), and the RDEU2 model entail five parameters (γ, v1, v2, β, α). It is obvious that RDEU2 nests RDEU0 (when α = 1), and RDEU0 and RDEU1 nest EU (when β = 1). Table 1 shows the mean and standard deviation of the estimated parameters for each individual subject of the HO data. Instead of reporting the values for the precision parameter γ, it is more meaningful to report a behavioral measure: p(γ) 1/[1 + exp(-0.05γ)], (5) which is the probability that an option with 5% more value will be chosen; conversely, 1 - p(γ) is the probability of a mistake. Table 1: HO Data EU RDEU0 RDEU1 RDEU2 p(γ) (0.0989) (0.0943) (0.0966) (0.1134) v (0.1747) (0.1764) (0.1782) (0.1950) v (0.0960) (0.1210) (0.1039) (0.1273) β n/a (0.2234) (0.736) (0.2643) α n/a n/a n/a (0.596) LL (LL-LLEU) The next to last row shows the maximized log-likelihoods (LL) summed over all 80 subjects. The last row gives twice the difference between the LL of the RDEUn model and the EU model. Twice the difference between EU and RDEU0 (439.50) is distributed Chi-square with 80 d.f., and has a p-value of 0. Therefore, clearly the superior fit of the RDEU models relative to EU is statistically significant. Note that the estimates for p(γ), v1 and v2 are fairly 7

8 tight and stable across all models. However, the β and α parameter estimates vary widely across subjects, raising concerns about possible over-fitting. Table 2 shows the mean and standard deviation of the estimated parameters for each individual subject of the HR data, as well as the sum of the maximized log-likelihoods over all 63 subjects. Table 2: HR Data EU RDEU0 RDEU1 RDEU2 p(γ) (0.1160) (0.1085) (0.1138) (0.1212) v (0.2160) (0.1990) (0.1782) (0.2510) v (0.1432) (0.1142) (0.1039) (0.1689) β n/a (1.371) (0.736) (1.643) α n/a n/a n/a (2.709) LL (LL-LEU) Twice the difference between EU and RDEU0 (330.28) is distributed Chi-square with 63 d.f., and has a p-value of 0. Therefore, clearly the superior fit of the RDEU models relative to EU is statistically significant. Note that the estimates for p(γ), v1 and v2 are fairly tight and stable across all models. However, the β parameter estimates vary widely across subjects and across RDEU models, and the α parameter estimate for the RDEU2 model varies widely across subjects, raising concerns about possible over-fitting. While the RDEU models fit the data better, since the incentives for choosing the lottery with the highest expected monetary value was less than $0.05 per task, it could be that the β (and α) parameter of the RDEU models is merely fitting behavioral noise Perhaps related to this conjecture is the finding of Harrison and Swarthout (2014) that the random lottery incentive mechanism appears to matter when estimating RDEU but not when estimating EUT. 8

9 4. Over-fitting Tests. The estimation and prediction subsets of the data were selected as follows. For the HR data, since the experiment entails exactly two presentations of 30 distinct tasks, it is natural to use the first presentation as the subset to estimate the parameters, and then forecast the behavior for the second presentation. For the HO data, all 100 tasks were unique, so for the over-fitting exercise we simply spit the data into the first 50 and second 50 tasks. We estimated the RDEU and EU models on the first 50, and used those parameter estimates to forecast the choices for the second 50. We performed four tests for over-fitting. The first test is for parameter stability. Are the maximum likelihood parameter estimates for the prediction dataset the same as the estimates for the estimation dataset? We can reject this hypothesis at all levels of confidence. The second test is for comparative prediction performance across the RDEU models and the EU model. Using the maximum likelihood parameter estimates of each model for the estimation dataset, we compute the log-likelihood of the prediction dataset. We find that while the better fit of the RDEU models on the estimation dataset is statistically significant, their prediction performance is worse than the EU model. The third test use simulated data. Using the maximum likelihood parameter estimates from the estimation dataset, we simulate choices for each subject to form a pseudo dataset. Then, we re-estimate the parameters of the models on the estimation subset of the simulated data, and ask how that estimated model performs on the prediction subset of the simulated data. We find that the EU model performs at least as well as the RDEU model. a. Parameter Stability The first test of over-fitting is to estimate the parameters on the estimation subset of the data, and then test whether this fitted model is the data generating process for the prediction subset. Specifically, we use the maximum likelihood method to estimate the parameters on 9

10 the estimation subset. Using that fitted model for the estimation subset, we compute the loglikelihood of the prediction subset of data: call it LL e. Next, we estimate the parameters on the prediction subset, and compute the maximized log-likelihood of that subset: call it LL p. To compare, we compute LL p - LL e. Twice this difference is asymptotically distributed Chisquare with degrees of freedom equal to the number of model parameters times the number of subjects. Table 3: LL p LL e on Prediction Subset EU RDEU0 RDEU1 RDEU2 HO Data LL p LL e % failure HR Data LL p LL p e % failure (1) (1) The first noteworthy result is that the predicted LL p LL e was for several cases. The number in parentheses after the is the number of subjects for which the difference was. Essentially, the fitted model put probability less than (the floating point precision of the computer) on some choice which the subject actually chose, making LL e not a number, which we interpret as -. For the other cases (with finite LL e ), it is clear that we can strongly reject the hypothesis that the fitted models on the estimation data subset are the data generating processes for the prediction subsets. 12 Rather than conducting the test on the sum of all the individual log-likelihood differences (as was reported above), an alternative is to compute the test for each individual subject. Doing so, and accounting for the degrees of freedom, the percentage of individual subjects who fail the Chi-square test is given in the rows of Table 3 labeled % failure. These percentages are unacceptably high, and are a strong indication that the parameters are not stable across the tasks. 12 The p-values are essentially 0. 10

11 b. Prediction Tests The second test is to compare the prediction performance of the models. For each model we estimated the parameters for each individual subject on the estimation subset and used the fitted model to compute the log-likelihood of the prediction subset. Table 4 presents the maximized log-likelihoods (LL) summed over all subjects for the estimation subset. Table 4: LL for Estimation Subset EU RDEU0 RDEU1 RDEU2 HO Data HR Data Clearly, the EU model fits worst and the RDEU2 model fits best on the estimation subsets. The rankings are the same as in Tables 1 and 2 which used the entire data sets. On an individual basis, at the 5% significance level, we can reject EU in favor of RDEU2 for 38 out of 80 of the HO subjects and for 32 out of 63 of the HR subjects. Table 5 presents the maximized log-likelihoods (LL) summed over all subjects for the prediction subsets. Remarkably, by the aggregated LL criterion, the EU model predicts better Table 5: LL for Prediction Subset EU RDEU0 RDEU1 RDEU2 HO Data: LL % < EU 45.40% 48.75% 63.75% HR Data: LL (1) (1) % < EU 55.56% 53.97% 58.73% than all the RDEU models, and the RDEU2 model predicts worst. The infinities appear exactly for those cases in which appears in Table 3. To avoid inferring too much from these extreme outcomes, we consider an alternative measures of prediction performance: whether the predicted log-likelihood for an individual subject using the EU model is greater than the predicted loglikelihood using the RDEUn model. The % < EU rows of Table 5 give the percentage of subjects for whom the RDEUn model predicted worse than the EU model. These percentages are unacceptably high, and are a strong indication that the RDEU models have over-fit the data. 11

12 c. Predicted LL Using Simulated Data For each individual subject, we take the parameter estimates for the EU model obtained from the estimation subset of the real data, and using the fitted EU model we generate S pseudo datasets of the estimation and prediction subsets; call these xs for s = 1,... S. We then compute the log-likelihood of each prediction subset for each model: LLRDEUn(xs) and LLEU(xs), s = 1,... S. Since the RDEUn model nests the EU model, under the null hypothesis of no over-fitting by RDEUn, when EU is the data generating process, we would expect roughly half of the LLRDEUn(xs) LLEU(xs) to be positive and half negative; if a substantial proportion are negative then overfitting is a likely explanation. We performed these simulations with S = 199. For the HO data, since there were 80 subjects, there are 80 estimates of each model s parameters, so we generated = 1520 pseudo datasets. For the HR data, since there were 63 subjects, there are 63 estimates of each model s parameters, so we generated = 1197 pseudo datasets. Table 6 presents the results of these simulations. The DPG column gives the data generating process for the simulated xs, either EU or RDEUn, where n = 0, 1, 2 for the third through fifth columns. The EU row gives the results for the simulations just described when the EU model is the data generating process. The percentage of the simulated datasets for which LLRDEUn(xs) LLEU(xs) was negative are given in the other columns of the EU row. Table 6: Percentage of Simulated LLRDEUn LLEU < 0. DGP RDEU0 RDEU1 RDEU2 HO EU RDEUn 67.1% 38.2% 67.2% 33.2% 96.8% 30.6% HR EU RDEUn 68.9% 41.1% 69.2% 41.6% 80.2% 39.9% Clearly, when the EU model is the true data generating process, the majority of the simulations have LLRDEUn(xs) LLEU(xs) < 0, strongly suggesting that the poor performance with the real data, as revealed in Table 5, is due to overfitting by the RDEUn models. 12

13 Similarly, for each individual subject, we take the parameter estimates for the RDEUn model obtained from the estimation subset of the real data, and using the fitted RDEUn model we generate S pseudo datasets of the estimation and prediction subsets. We then compute the log-likelihood of each prediction subset for each model: LLRDEUn(xs) and LLEU(xs), s = 1,... S. Since the EU model is a restriction of the RDEUn models, under the null hypothesis of no overfitting by RDEUn when RDEUn is the data generating process, we would expect LLRDEUn(xs) LLEU(xs) to be positive for most of the simulations; if a substantial proportion are negative then overfitting is a likely explanation. As we can see from the RDEUn row of Table 6, when the RDEUn model is the true data generating process, still a substantial proportion of the simulations have LLRDEUn(xs) LLEU(xs) < 0, again suggesting the RDEUn models are prone to overfitting. It is a reasonable conjecture that the extra parameters of the RDEUn models, which affect the shape of the probability transformation function, H(F), are fitting behavioral noise in the estimation dataset. d. Other Potential Sources of Mis-specification. In addition to the extra parameters of the RDEU model, one can ask whether the nonparametric specification of the utility function is also a source of over fitting. A commonly used parametric specification is the constant relative risk aversion (CRRA) form: v(yi) = (yi) r, (6) where r 0. For the HO and HR experiments and our normalization, v0 = 0, v1 = (1/3) r, v2 = (2/3) r and v3 = 1. Let EU_C denote the simple EU model with CRRA utility, and let RDEUn_C denote the RDEUn model with CRRA utility. Table 8 presents the maximized log-likelihoods on the estimation subsets. Not surprisingly, we see that the RDEUn_C model fits better than the EU_C model. Comparing Tables 4 and 8, we see that the CRRA versions do not fit as well as the non-parametric versions. Table 8. LL for Estimation Subsets EU_C RDEU0_C RDEU1_C RDEU2_C HO Data HR Data

14 Table 9 presents the log-likelihoods on the prediction subsets using the fitted model from the estimation subset (analogous to Table 5). Clearly, the RDEUn_C model does not predict as well as the EU_C model. Therefore, the poor prediction performance of RDEUn revealed in Table 5 is not due solely to the non-parametric specification of utility. Table 9. LL for Prediction Subsets EU_C RDEU0_C RDEU1_C RDEU2_C HO Data HR Data (2) (1) 5. Conclusions and Discussion. Our findings strongly suggest that with the sample sizes available from laboratory experiments, the enhanced RDEU models are prone to over-fit the data. We have also ruled out the possibility that the poor prediction performance of the RDEU models is due solely to the non-parametric utility specification. If we value prediction performance, our analysis suggests that one should use the simple EU model rather than any enhanced RDEU model. The central lesson is that it is vitally important to guard against over-fitting laboratory data. The standard method is to use a portion of the data for estimation and the remainder of the data for testing prediction performance. Of course, if we had a very large number of observations for each individual, over-fitting would cease to be a problem. Unfortunately, there are natural limits to the sample size from laboratory experiments. The two data sets used in this paper are large in comparison to most experiment data. Obtaining hundreds of observations for each individual invites behavioral noise via fatigue and boredom. Collecting observations from the same individual over several sessions opens the door to learning and changes in preferences due to changing individual circumstances. Given these limitations for laboratory data, what are the prospects for future testing of EU theory? Our theories assume that our subjects have a definite way to compare lotteries, such as expected value. However, without a calculator, they cannot accurately compute expected 14

15 values, and they may very well fail to see a sensible and easy way to measure the tradeoff (e.g. the ratio of the change in the probability of y3 to the change in the probability of y0). Consequently, the choice is often made in a state of confusion, driven by emotions of fear and anxiety. Perhaps the subjects seize upon some simplistic rule such as choose the option with the highest probability of the middle prize, or avoid the option with the highest probability of the lowest prize for lack of a better, readily available coherent rule. Moreover, they may randomly use different ad hoc rules for different choice tasks. But then the observed choices do not reveal the subjects true preference, where the latter presupposes fixed rational preferences and an understanding of the options sufficiently rich to make comparisons between the options straightforward. Therefore, perhaps we should investigate alternative ways of presenting the options that make the pertinent differences more transparent. While a more informative/transparent presentation may reduce confusion, there will remain the problem that providing monetary incentives for the choice tasks and identifying indifference curves are conflicting objectives. The latter requires lotteries for which the subject is nearly indifferent, but consequently the monetary incentives for such choice tasks are very small. Alternative sets of lotteries could tighten the behaviorally distinguishable regions, but there is a limit to such tightening due to the conflicting requirements of near-indifferent lotteries and adequate monetary incentives. As we show next, this limitation is not restricted to the multiple choice design of the HO and HR experiments. A technique often used to assess risk preferences is a variation of the BDM method. The subject is presented with a safe lottery, say (0, 1, 0) on {$0, $20, $40}, and a risky lottery with (1-p, 0, p) on {$0, $20, $40}. The subject is asked to state a cutoff value for p. Afterwards, a random number, x, is drawn using a uniform distribution on [0, 1]. If x p, the subject gets the safe lottery, and if x > p, the subject gets the risky lottery but with p = x. W.l.o.g. taking U($40) = 1, U($20) = u and U($0) = 0, the expected utility is EU(p u) = up + (1 p 2 )/2 = -0.5(u-p) (1 + u 2 ). The unique optimal choice is p = u. Thus, it would appear that this one-choice task could reveal u. However, since EU(p u) is quadratic in (u-p), the cost of small errors is very small. A logit choice model will put nonnegligible probability on values of p near u. Indeed, if the logit precision is less than 200, the 90% confidence interval for u will be at least (u-0.15, u+0.15), which is unacceptably large. 15

16 Empirically, we rarely find precisions exceeding Therefore, the BDM method is not clearly better than a multi-task method for which the expected consequences are directly proportional to (u-p). 14 As another alternative, consider a sequential bisection process in which the subject is presented with a sequence of 6 choice tasks similar to BDM, but the specific sequence depends on the subject s prior choices. Unfortunately, paying for only 1 out of 6 choices effectively reduces the precision by 1/6, in which case the standard deviation is as high as 0.11, making the 90% confidence interval no better than the single BDM method. 15 Hence, it appears that we are encountering a Heisenberg-like uncertainty principle: there is a limit to the information we can obtain about indifference curves in lottery space. If so, there is a corresponding limit to how confident we can be in our theories about lottery choices. 13 Assuming a rescaling of the payoff so the utility of the best is 1 and the utility of the worst is Harrison (1992) and Harrison and Rutström (2008) make a similar argument. 15 In addition, there is the problem of strategic manipulation of the stepwise process by the subject. Harrison and Rutström (2008) make a similar critique. 16

17 References Grether, D. and Plott, C. (1979). Economic Theory of Choice and Preference Reversal Phenomena, Am. Econ. Rev., 69, Harrison, G. W. (1989). Theory and Misbehavior of First-Price Auctions, Am. Econ. Rev., 79, Harrison, G. W. (1992). Theory and Misbehavior of First-Price Auctions: Reply, Am. Econ. Rev., 82, Harrison, G. W. and Rutström, E. (2008). Risk Aversion in the Laboratory, In J. C. Cox and G. W. Harrison, eds., Research in Experimental Economics, 12, Harrison, G. W. and Rutström, E. (2009). Expected Utility and Prospect Theory: One Wedding and Decent Funeral, Experimental Economics, 12, Harrison, G. W., Swarthout, J. T (2014). "Experimental Payment Protocols and the Bipolar Behaviorist," Theory and Decision, 77, Hey, J. and Orme, C. (1994). Investigating Generalizations of Expected Utility Theory Using Experimental Data, Econometrica, 62, Lattimore, P. Baker, J. and Witt, A. (1992). The Influence of Probability on Risky Choice: A Parametric Examination, J. of Econ. Behavior and Organization, 17, Loomes G., and Sugden, R. (1998). Testing Alternative Stochastic Specifications for Risky Choice, Economica, 65, Machina, M. (2008). Non-expected Utility Theory, in The New Palgrave Dictionary of Economics (2nd edition). Eds. Steven N. Durlauf and Lawrence E. Blume. Palgrave Macmillan. Prelec, D. (1998), The Probability Weighting Function, Econometrica, 66, Quiggin, J. (1982). A Theory of Anticipated Utility, J. of Econ. Behavior and Organization, 3, Quiggin, J. (1993). Generalized Expected Utility Theory: the Rank-Dependent Model, Kluwer Academic Publishers. Tversky, A., and Kahneman, D. (1992). "Cumulative Prospect Theory: An Analysis of Decision Under Uncertainty," Journal of Risk and Uncertainty, 5, Wilcox, N. (2008). Stochastic models for binary discrete choice under risk: A critical primer and econometric comparison. In J. C. Cox and G. W. Harrison, eds., Research in Experimental Economics, 12, Wilcox, N. (2011). Stochastically more risk averse: A contextual theory of stochastic discrete choice under risk, Journal of Econometrics, 162,