Preference Reversals: The Impact of Truth-Revealing Incentives

Size: px
Start display at page:

Download "Preference Reversals: The Impact of Truth-Revealing Incentives"

Transcription

1 Preference Reversals: The Impact of Truth-Revealing Incentives By Joyce E. Berg Department of Accounting Henry B. Tippie College of Business University of Iowa Iowa City, Iowa John W. Dickhaut Department of Accounting Carlson School of Management University of Minnesota th Avenue South Minneapolis, MN and Thomas A. Rietz Department of Finance Henry B. Tippie College of Business University of Iowa Iowa City, Iowa July 2002 * We thank several anonymous referees, Colin Camerer, Jim Cox, Philip Dybvig, Mike Ferguson, William Goldstein, Glenn Harrison, Charles Holt, Joel Horowitz, Jack Hughes, Charles Plott, Paul Schoemaker, Amos Tversky, Nathaniel Wilcox, and workshop participants at the University of Chicago and Cornell University for thought provoking comments and conversation in the development of this paper.

2 Preference Reversals: The Impact of Truth-Revealing Incentives Abstract We examine the impact of truth-revealing incentives in preference reversal experiments. These experiments are first classified by incentive type: no monetary payments, monetary payments that are not a priori truth-revealing, and monetary payments that are truth-revealing. We find that the behavioral model that best explains preference reversal data is contingent on the nature of the incentives used in the experiment. In experiments with no monetary payments, random behavior or a model based on anchor and adjust biases best explains behavior. However, in experiments with truth-revealing incentives, an expected utility model with errors best explains behavior. In fact, we show that no model can possibly explain the data significantly better in experiments with truth-revealing incentives. Previous research failed to detect this pattern because the analyses emphasized reversal rates alone, which we show can be a poor indicator of incentives effects in these experiments.

3 Preference Reversals: The Impact of Truth-Revealing Incentives How well does expected utility theory, with an additional assumption that subjects make random errors, perform in comparison with non-expected utility theories of choice? 1 This question has been the subject of much theoretical interest, continued laboratory investigation and intense computational exercises. We examine this question in the context of preference reversal and show that the answer is contingent on how experimenters administer payoffs. In particular, we ask how the type of monetary incentives (none, non-truth-revealing, or truthrevealing) affects the performance of various behavioral models of choice in preference reversal experiments. We find previously unrecognized regularities across data from numerous studies. Subjects appear to shift between different behavioral models depending on the nature of incentives. The data suggest that this shift is predictable. Our conclusion that there are clear incentive effects seems to conflict with previous research documenting that monetary incentives do not appear to decrease reversal rates. Summarizing the evidence from three papers on preference reversals, Camerer and Hogarth (1999) conclude that there are no effects in two of three cases. In the third, they conclude that incentives actually resulted in less rational behavior according to a reversal rate metric. We do not argue with these conclusions. However, we show that when subjects have consistent preferences but make errors, changes in reversal rates can be poor measures of incentive effects. Why? Because more rational behavior, in the sense of reduced errors, can actually increase reversal rates. Our analysis explains why this is the case and uses a metric that takes this into account. To understand our methods and conclusions, consider the idea of a preference reversal. In a typical preference reversal experiment, each subject ranks pairs of gambles according to two tasks: 1) a paired choice task and 2) a pricing task. A preference reversal occurs when, for 1 We are not the first to ask this question. See, for example, Hey and Orme,

4 identical pairs of gambles, a subject's decisions in the two tasks imply different preference orderings. The literature shows that incentives seldom affect the percentage of observed reversals (the reversal rate ) in preference reversal experiments. Here, we ask whether preference reversal rates are the appropriate metric for identifying the role of incentives in preference reversal experiments. The key to observing incentive effects across these experiments is twofold: First, one must note that "preference reversals" and "errors" are not the same. Second, one must modify the lens through which one observes behavior accordingly. Notice that in a preference reversal experiment, subjects can err in reporting preferences in either one or both tasks. Cases in which otherwise rational subjects err in only one task are classified as preference reversals. However, cases in which such subjects err in both tasks are classified as non-reversals. In effect, two wrongs (in the error dimension) make a right (in the measured preference reversal dimension). Taking these observations into account, we propose an alternative measure based on the idea that incentives should reduce task error rates. Using this measure, we show that incentives actually do create predictable and observable differences in behavior across preference reversal experiments. To obtain this result, we classify experiments according to three incentives levels: (1) no incentives experiments, in which participants are not paid as a result of their choices, (2) indeterminate incentives experiments, in which participants are paid, but the payment structure does not necessarily lead to truthful revelation as an optimal strategy under expected utility maximization and (3) clear incentives experiments, in which participants are paid using truth reveling mechanisms. To perform our analysis, we use maximum likelihood estimation to distinguish between parameterizations of several behavioral models of choice, including Lichtenstein and Slovic s (1971) two-error-rate model, which corresponds to a model of expected utility maximizing subjects who make errors. First, we develop the idea of a best fit model that fits the data as well as any potential model could, in the sense of globally maximizing the likelihood across all 2

5 potential models of behavior for preference reversal experiments. In terms of likelihoods, the best fit model is equivalent to any model that can be parameterized to explain any pattern of data. Second, we show how three behavioral models previously developed to explain behavior in choice experiments can be parameterized as nested models. Then, we show that one of them (the least restricted, three-error-rate model based on the idea that subjects anchor and adjust as hypothesized in Lichtenstein and Slovic, 1971, and Goldstein and Einhorn, 1987) can never be rejected because it makes no meaningful, testable restrictions on the observed patterns of behavior. Next, we show that the expected-utility-based, two-error-rate model (the next least restricted model) is generally strongly rejected in preference reversal experiments with no payments. In those experiments, the best-fit model (or any equivalent behavioral model) does a significantly better job of fitting the data. In stark contrast, we show that the twoerror-rate model is never rejected when incentives are truth revealing. In fact, by comparing it to the best-fit model, we show that generally no other model can possibly fit the data significantly better than the two-error-rate model. Finally, in treatments with truth-revealing incentives, we show that the two-error-rate model fits the data significantly better than an even more restrictive model representing task dependent preferences. The stark contrasts across incentive treatments show the clear effects of incentives in preference reversal experiments. Our results strongly suggest that when incentives are introduced, subjects switch from one mode of behavior (random behavior or an anchor-andadjust-based three-error-rate model) to another (expected utility maximization with errors). I. Preference Reversal Tasks In a typical preference reversal experiment, subjects evaluate pairs of gambles with approximately equal expected value. One gamble (termed "the P-bet") has a high probability of winning a small amount of money. The other gamble (termed "the $-bet") has a low probability of winning a large amount of money. In general, while the expected values of each are 3

6 comparable, the variance of the $-bet is higher. For example, Grether and Plott (1979) contains the following gamble set: a gamble with a 32/36 chance of winning $4 and a 4/36 chance of losing $0.50 (the P-bet), and a gamble with a 4/36 chance of winning $40 and a 32/36 chance of losing $1.00 (the $-bet). This is a typical preference reversal experiment pair. While the gambles have approximately the same expected value ($3.50 versus $3.56), one has significantly lower variance than the other ($2.00 versus $166.02). The subject participates in two tasks in a preference reversal experiment. Each task is designed to elicit the subject s preferences in the gamble pair. In the paired choice task, the subject is presented with the two gambles and asked which one is preferred. In the pricing task, the subject is shown each gamble independently and asked to state a minimum selling price for the gamble. The preference ordering implied by the choice task is compared to the ordering implied by the pricing task to determine whether the subject s orderings are consistent. A preference reversal occurs when the choice task and the pricing task imply different orderings for the two gambles. 2 The data from preference reversal experiments are generally presented as aggregate frequencies in four cells (when indifference is not allowed as a response) or nine cells (when indifference is allowed). Figure 1 shows a typical data set from Grether and Plott (1979), experiment 1b. We will focus our analysis on Cell a through Cell d. 3 Cells a and d represent consistent choice and pricing decisions. Cells b and c represent reversals: the least risky bet chosen but priced lower, and the most risky bet chosen but priced lower, respectively. If we use 2 More details of the experiments analyzed here are found in the appendix. 3 We do so for three reasons. First, some experimenters did not allow indifferent responses. Second, it is not always clear exactly what corresponds to a reversal when a subject gives an indifferent response in one of the tasks. For example, a subject may rationally state the same price while choosing one gamble over the other if the difference in preference is not sufficiently high to be observed by the price increments allowed. Third, the models we test will only make predictions across Cells a through d, so relative frequencies in these cells are all that we will use for testing. 4

7 the variables a through d to represent the frequencies in Cells a through d, the reversal rate is (b+c)/(a+b+c+d). 4 Many researchers have studied preference reversals under various incentive schemes. They include: Lichtenstein and Slovic (1971, 1973); Grether and Plott (1979); Pommerehne, Schneider, and Zweifel (1982); Reilly (1982); Berg, Dickhaut, and O'Brien (1985); Goldstein and Einhorn (1987); and Tversky, Slovic, and Kahneman (1990); Selten, Sadrieh and Abbink (1999) among others. They generally conclude that the rate of preference reversal is insensitive to incentives. In contrast to studying preference reversals directly, we study the task error rates that lead to reversals. We define such error rates as follows. Suppose subjects have underlying preference orderings over gambles, but that they may misreport their preference ordering in the paired choice task, the pricing task, or both. A choice task error occurs whenever the subject chooses the less preferred bet in a paired-choice task. A pricing task error occurs whenever the subject prices the less preferred bet higher in a pricing task. Such errors can lead to apparently inconsistent preferences as revealed by the two tasks. Previous researchers have generally ignored an important point: preference reversals and task error rates are not the same. In fact, a subject who misreports preferences 100% of the time will have no preference reversals because they always seem to prefer the same (albeit, less preferred) gamble. If an error is made in only one task, this will be captured as a preference reversal. However, if an error is made in both tasks, behavior will be classified as "correct." Figure 2 shows the preference reversal rate as a function of the two possible task error rates according to Lichtenstein and Slovic's (1971) two-error-rate model (which we will discuss in more detail later). Notice that, over large regions, a reduction in one error rate will lead to an increase in the total number of reversals observed. Intuitively, this results from 4 We note that the Cell b represents "predicted" reversals, as defined Lichtenstein and Slovic s (1971) observation that they are more common and Cell c represents unpredicted (less common) reversals. 5

8 decreasing the number of consistent choice and pricing decisions that result from misstating preference in both tasks. While preference reversals are the generally accepted metric in the previous literature, we believe incentives should affect error rates (by making errors more costly) rather than preference reversal rates themselves. This idea motivated Siegel (1961) and lies behind both Smith's (1982) precepts of non-satiation, saliency and payoff dominance and Camerer and Hogarth s (1999) capital-labor-production theory for behavior in experiments. To shift the focus from preference reversals to task error rates, we focus on models of task behavior. Specifically, we focus on the two-error-rate model developed by Lichtenstein and Slovic (1971) and test it relative to other behavioral models of choice. We discuss these models and their implications for cell frequencies a through d following our discussion of maximum likelihood estimation for preference reversal data. II. Maximum Likelihood Estimation We estimate models of behavior by finding estimated parameters for the underlying model that maximize the likelihood of the observed data (maximum likelihood estimation). We will focus on the Cells a, b, c and d in Figure 1 for several reasons. First, the not all experiments allowed indifferent responses. Second, the models generally predict continuous preferences over the gambles, so the chances of being truly indifferent are vanishingly small according to them. So, we will drop indifference data and use the remaining response frequencies in cells a, b c and d to do estimation. The cell frequencies represent a multinomial distribution and the joint log likelihood based on predicted cell frequencies is: L = ln(n!) ln(a!) ln(b!) ln(c!) ln(d!) + Aln[m a (θ)!] + Bln[m b (θ)!]+ Cln[m c (θ)!]+ Dln[m d (θ)!] where n is the number of observations in the data; A, B, C and D are the total numbers of observations in Cells a, b, c and d; m a (θ), m b (θ), m c (θ) and m d (θ) and the underlying model s 6

9 predictions of frequencies for a, b, c and d based on the model s underlying parameters, θ. Estimates and their variances are found using standard maximum likelihood techniques (see Judge, Hill Griffiths, Lutkepohl and Lee, 1982.) The value of the likelihood function given the estimated parameters indicates the model s ability to explain the data. We will use likelihood ratio tests to test restrictions on parameters imposed by the models or by hypotheses. Given this form of the likelihood function, if the predicted cell frequencies could be set freely, then the maximum likelihood estimator matches the observed cell frequencies to the predictions. That is, a = m a (θˆ ), b = m b (θˆ ), c = m c (θˆ ), and d = (θˆ ). Since this is actually the global maximum of the likelihood function, such a model gives the best possible fit across all models. Any model that can always match the observed frequencies always results in a best fit. We call such a model a best-fit model. Such models can never be rejected. Any two such models are observationally equivalent. Below, we show that two models developed to explain preference reversal behavior (random behavior and anchor-and-adjust based preferencedependent-error-rate models) are both best-fit models and can never be rejected nor distinguished from each other. One can perform meaningful tests on models that restrict the possible values of m a (θ), m b (θ), m c (θ) and m d (θ). Consider a likelihood ratio test between an unrestricted (best-fit) model and a restricted behavioral model. A significant statistic shows that the restricted model cannot explain the observed data. An insignificant statistic shows that that data is consistent with the restricted model. Further, no other model can explain the observed data significantly better. Finally, if the restrictions are not binding for a given data set (i.e., the model can be parameterized to match cell frequencies), then the behavioral model corresponds to the best-fit for that dataset. m d 7

10 III. Models of Behavior Now we examine four models of individual behavior previously developed to explain preference reversal data and show how they can be parameterized in a nested manner. The ability to parameterize the models in this manner is important for several reasons. As parameterized, each model makes specific predictions about the pattern of observations (i.e., the frequencies a through d) in preference reversal experiments. From these predictions, we can determine the likelihood of seeing each dataset under each estimated model and compare the likelihoods across models. Because the models form a nested set, we can test the significance of differences using likelihood ratio tests. Based on these tests, we can form contingency tables based on what model(s) fit the data best for the nature of incentives used. Chi-squared tests on these tables show the clear effects of truth-revealing incentives. The first model, random behavior, is based on the observation that when choices and payoffs are independent, any choice is an expected utility maximizing choice. Thus, subject choices may follow mixed (random) strategies as Siegel (1961) suggested. The second model, preference-dependent error rates in the pricing task, comes from "anchor and adjust" models such as Goldstein and Einhorn's (1987). We will refer to this as the three-error-rate model. The third model is Lichtenstein and Slovic s (1971) two-error-rate model that corresponds to expected utility maximizing subjects who make errors. The error rates depend on the task, but not on underlying preferences. The fourth model, task dependent preferences, comes from arguments such as Tversky and Thaler's (1990) that subjects have no underlying preference relation. In such cases, procedural invariance fails and preferences themselves depend on the task. We show that the first two models (random behavior and three-error-rate) are observationally equivalent and that they nest the second two. 8

11 A. Random Behavior Random behavior is modeled with four parameters, one that represents the frequency with which subjects choose each of the four price/choice combinations. We use r a, r b, r c and r d to represent these frequencies. Random behavior is not necessarily irrational behavior in an economic sense for experiments without payoffs. Because rewards are independent of decisions, any choice/price combination is optimal and subjects may freely mix between them. Thus, any pattern of responses is consistent with rational utility maximizing behavior as well as other models of payoff dependent behavior. The maximum likelihood estimators are rˆ a = a, rˆ b = b, rˆ c = c and rˆ d = d. Because a, b, c and d are frequencies, the estimates all lie in the range [0,1] and, hence, are valid probabilities. As a result, the model can be parameterized to fit any observed set of data. Thus, by its definition, random behavior is a (global) best-fit model for any data set. B. Nested Behavioral Model I: Preference Dependent Error Rates (Three-Error- Rate) Model Consider a model in which subjects do not behave randomly, but have preferences over the gambles. For example, they may be expected utility maximizers. Let q denote the fraction of subjects who prefer the P-bet. Suppose further that there are errors in the communication of preferences. That is, subjects may report incorrectly a preference for the P-bet when they prefer the $-Bet and vice versa. Let r denote the error rate in the choice task. Now suppose that the pricing task error rates depend on preferences. Denote the error rates by s P if the subject prefers the P-Bet and s $ if the subject prefers the $-Bet. This parameterization allows for an anchor and adjust model in which subjects tend to overprice $-bets such as in Lichtenstein and Slovic (1971) and Goldstein and Einhorn (1987), leading to different pricing task error rates that depend on preferences. 9

12 Under this model, data will conform to the pattern shown in Figure 3. It can explain the systematic preference reversals (P-bet chosen in the paired-choice task but $-bet priced higher) termed "predicted" by Lichtenstein and Slovic (1971) by incorporating a higher error rate for pricing P-bets (s P ) than for pricing $-bets (s $ ). It will be the most general behavioral model we discuss. The other two are restricted versions of this model. The problem with the three-error-rate model is that it can explain any pattern of behavior and, thus, is observationally equivalent to random behavior. This is true because for any observed a, b, c and d, one can find valid parameters for the model such that a = m a (θˆ ), b = m b (θˆ), c = m c (θˆ ) and d = (θˆ ). It is easily done. First, let r ˆ = 0. Then, to match m d frequencies, the other estimates solve a qˆ (1 sˆ ), b = qs ˆ ˆ and c = 1 qˆ) sˆ. It is easy to = P P ( $ verify that q ˆ = a + b, sˆ P = b /( a + b) and sˆ C = c /( c + d) solve the system and represent valid fractions and probabilities. Hence, the three-error-rate model is also a (global) best-fit model and is observationally equivalent to random behavior in the preference reversal context. C. Nested Behavioral Model II: Expected Utility Maximization with Errors (Two- Error-Rate Model) Expected utility maximization would lead subjects to have invariant preferences over the gambles. Camerer and Hogarth (1999) argue that even if this were so, the labor cost of optimization and accurate reporting may result in errors. The error rates should depend on the difficulty of optimization relative to the payoffs for doing so. To see how such a model works, consider a parameterization of a model in which subjects have a preference over gambles. Let q denote the fraction of subjects who prefer the P-bet. Suppose further that there are errors in the communication of preferences. That is, subjects may report errantly a preference for the P- bet when they prefer the $-bet and vice versa. Let r denote the error rate in the choice task. Finally, suppose that there is a (single) pricing task error rate that depends only on the 10

13 complexity of the pricing task and not on preferences. This is consistent with Camerer and Hogarth s (1999) production theory. Let s denote the pricing task error rate. This model is the same as Lichtenstein and Slovic s (1971) two error rate model and gives the predicted pattern of responses shown in Figure 4. The two-error-rate model is clearly a restricted version of the three-error-rate model above with s P =s $ =s. This restriction is meaningful. In particular, matching the frequencies and solving for q, r and s shows that, when a solution exists, the following relationships maximize the likelihood function: ad - bc q ˆ (1 - q)= ˆ, (1) (a + d) - (b+ c) rˆ = (a + b - q)/(1 ˆ - 2q) ˆ (2) and sˆ = (a + c - q)/(1 ˆ - 2q) ˆ. (3) In this case, the two-error-rate model explains the data as well as the three-error-rate model and, in fact, explains it as well as any possible model. 5 However, this implies restrictions on the parameter estimates. In particular, it will not fit the data as well as the three-error-rate model when (ad-bc)/(a-b-c+d) < 0 or (ad-bc)/(a-b-c+d) > 0.25 because equation (1) cannot be solved. Further, equations (2) and (3) may not give valid error rates in the [0,1] range when (1) can be solved. These restrictions are stronger than one might think. Out of all possible cell frequencies that can be explained by the three-error-rate model, only about one quarter can be 5 Because of the quadratic form, there are actually two equivalent sets of parameters that satisfy these equations because q and 1-q are interchangeable. The resulting estimates of r and s are each one minus the original estimate. We do not take a stand on which set of estimates is true because it is irreverent to the likelihood function (both sets give the same likelihood) and, hence, to the likelihood ratio tests discussed below. We let the data choose which set we display in the tables by looking at the consistent choice cells a and d. If a>d, we display the solution with q>0.5 and if a<d, we display the other solution. 11

14 explained as well by the two-error-rate model. 6 In the rest of the cases, one or more restrictions keep the two-error-rate model from being the best-fit model. Even when equations (1)-(3) cannot be solved to make the two-error-rate model the best-fit model, estimates can be found and likelihood ratios can test the significance of the restrictions. We can place further restrictions on this model. First, restricting r=s shows whether the two tasks differ in estimated error rates. Second, as we show next, the restriction q=0 is equivalent to another behavioral model: task dependent preferences. D. Nested Behavioral Model III: Task Dependent Preferences Under task dependent preferences, subjects are allowed to prefer one gamble in the choice task but a different one in the pricing task. As a result, their preferences violate procedural invariance and there is no underlying preference ordering for gambles. Instead, the preference orderings implied by responses are task specific--"they ain't nothing till I [the subject] call them" (Tversky and Thaler, 1990, page 210). In such cases, the probability that a subject will prefer the P-bet in the paired choice task is unrelated to the probability that the subject will prefer the P-bet in the pricing task (call these probabilities r and s). Our task dependent preferences model comes from a feature of the compatibility hypothesis explanation of preference reversal developed by Tversky, Slovic, and Kahneman (1990): the failure of procedural invariance. Task dependent preferences would predict the pattern of observations shown in Figure 5. That is, m a (r,s)= rs, m b (r,s)=r(1-s), m c (r,s)=(1-r)s, m d (r,s)= (1-r)(1-s). This makes it obvious that that task dependent preferences is simply the two-error-rate model with q restricted to zero. A likelihood ratio test based on this restriction can differentiate between the two-error-rate model and task dependent preferences. 6 We arrive at this number through 50,000 simulations, drawing cell frequencies randomly from the feasible set. 12

15 Substituting the predicted frequencies into the likelihood function and differentiating shows that r ˆ = a + band s ˆ = a + c. The predicted cell frequencies will only match the actual frequencies if m a ( rˆ, sˆ) = ( a + b)( a + c) = a, m b ( rˆ, sˆ) = ( a + b)(1 a c) = b, m c ( rˆ, sˆ) = (1 a b)( a + c) = c and m d ( rˆ, sˆ) = (1 a b)(1 a c) = d. One can easily verify that these relationships hold only if (a+b)/(c+d) = a/c = b/d and, equivalently, if (a+c)/(b+d) = a/b = c/d. This is exactly the relationship tested by a simple χ 2 -test for independence between the rows and columns. Here it corresponds to the likelihood ratio test between the task dependent preferences model and the best-fit model, which is equivalent to the three-error-rate model. Thus, estimating the task-dependent-preferences model is simple and the test between it and the three-error-rate model is a simple χ 2 -test on the cell frequencies given by the data. IV. Data In this study, we apply a new metric to data generated in previously published preference reversal experiments. However, since we want to focus on incentives issues alone, we restrict the data we use according to the following criteria. First, the data must be published and the published paper must present the cell frequencies from Figure 1 (which we need to estimate and differentiate the underlying models). Second, we focus specifically on experiments that use individual choice tasks and individual selling price elicitation tasks for gambles involving either hypothetical and real monetary rewards. This excludes some data where the experimental design used other controversial and heavily debated aspects. 7 Such factors are important and deserve study, but confound the focus of our study: pure incentive effects. 7 Examples include: (1) context rich, multidimensional environments (e.g. Tversky, Sattath and Slovic, 1988), (2) changes in value elicitation response modes (e.g., Cox and Epstein, 1989, Bostic Herrnstein and Luce, 1990, Casey, 1991, and Bohm, 1994), (3) arbitraging irrational choices (e.g., Berg, Dickhaut and O Brien, 1985, arbitrage treatments) and (4) risk preference induction (e.g., Selten, Sadrieh and Abbink, 1999, induced risk preference experiments, and Berg, Dickhaut and Rietz, 2002), among others. 13

16 Even with these restrictions, our data set contains a large variety of incentives designs run by experimenters with a variety of backgrounds and intentions. That our results hold across these heterogeneities attests to their robustness. An overview of the experimental designs in the contained in our Appendix; design details are in the original papers. To perform our analysis, we first classify each experiment according to its incentives. Details supporting the classification of individual experiments are contained in the appendix. Then, we analyze the effects of incentives on the performance of the nested models in explaining the observed data. Note that this allows us to focus on individual task error rates rather than the more aggregate (and deceptive) measure, preference reversal rate. Table I shows the data used in this analysis. It includes each paper, experiment number, frequencies according to Figure 1, reversal rate and incentives classification. We classify incentives according to three criteria. "No-incentives experiments" use hypothetical gambles. We analyze four such experiments, all of which use flat participation fees, but have no performance-based rewards in their experimental design. Since no differential reward is given for responding truthfully, any response is an expected utility maximizing choice. Falling in this category are Lichtenstein and Slovic (1971) experiments 1 and 2, Goldstein and Einhorn (1987) experiment 2a, and Grether and Plott (1979) experiment 1a. We label the data sets L&S1, L&S2, G&E2a, and G&P1a, respectively. "Indeterminate-incentives experiments" use gambles that have real monetary payoffs, but the designs do not necessarily induce truthful revelation for utility maximizing subjects. The reward structures may give utility maximizing subjects some preference over gambles. However, the reward structures could lead to systematic misreporting of preferences for some utility functions and accurate reporting for others. In such cases, truthful reporting is not an optimal strategy for all expected utility maximizing subjects. Thus, these experiments fail to fit our definition of clear incentives. We classify and analyze them separately to provide the cleanest test of incentives effects. These experiments include Lichtenstein and Slovic (1971) 14

17 experiment 3, two Lichtenstein and Slovic (1973) experiments conducted in Las Vegas, and four experiments from Pommerehne, Schneider and Zweifel (1982). We label the data sets L&S3, L&SLV+ (this session used positive expected value gambles), L&SLV- (this session used negative expected value gambles) and PSZ1.1, PSZ1.2, PSZ2.1, and PSZ2.2, respectively. "Clear-incentives experiments" include clear incentives for truthfully reporting preferences. These experiments all use a paired choice task and a procedure that should elicit truthful revelation of prices (generally the Becker, DeGroot, Marschak, 1964, pricing procedure). We report on twelve experiments in this category: three from Grether and Plott (1979), two from Reilly (1982), four from Berg, Dickhaut, and O'Brien (1985), the Reilly replication in Chu and Chu (1990) and two from Selten, Sadrieh and Abbink (1999). These are denoted G&P1b, G&P2SP, G&P2DE, R1.1, R1.2, BDO1.1, BDO1.2, BDO2.1, BDO2.2, C&C, SSA1 and SSA2, respectively. V. Results Table II contains our main results. For each data set, we first show the log likelihood of the best-fit model. Next, we show the parameter estimates and log likelihood of the two-errorrate model. Then, we show the parameter estimates and log likelihood of the task dependent preferences model. Finally, we show a set of likelihood ratio test statistics. Our results are based on these statistics. Note that occasionally a likelihood ratio cell is blank. This occurs whenever the log likelihoods are equal for the two models, indicating that the two models explain the data equally well. Result 1: Best-fit models (random behavior or, identically, task dependent error rates) generally explain the data significantly better than other models when there are no incentives. This result is based on the first and second likelihood ratio tests in Table II. The first test determines whether the best-fit model explains the data significantly better than the two-error- 15

18 rate model. It does in three out of four cases for no-incentives experiments. The second test determines whether the best-fit model explains the data significantly better than the task dependent preferences model. In same three cases, it does. Result 2: Results are mixed under indeterminate incentives. Based on the first likelihood ratio test, the best-fit model explains the data significantly better than the two-error-rate model in only two of seven cases (28.57%). In two other cases (28.57%), the fits are identical (resulting in identical likelihoods and no meaningful likelihood ratio test). In the rest of the cases (42.86%), there are insignificant differences. According to the second likelihood ratio test, in six of seven cases (85.71%), the best-fit model explains the data significantly better than the task-dependent-preferences model. In the other case, the difference is insignificant. Result 3: When clear incentives exist, no model can do a better job of fitting the existing data than the two-error-rate model. Again, this result is based on the likelihood ratio tests in Table II. The first shows that in eight of sixteen cases, there is no difference in the power of the two-error-rate model and the best-fit model (resulting in identical likelihoods and no meaningful likelihood ratio test). This means that, in these cases, no model can fit the data better than the two-error-rate model. In the remaining four cases, the differences are insignificant. This means that, even in these cases, no model can fit the data significantly better than the two-error-rate model. Result 4: The two-error-rate specification is usually necessary for explaining the data. First, consider the third set of likelihood ratio tests in Table II, which test the significance of restricting r=s in the two-error-rate model. In all but the Berg, Dickhaut and O Brien (1985) sessions, this is a significant restriction. This implies that error rates that differ across tasks are 16

19 necessary for explaining the data. 8 This is consistent with Camerer and Hogarth s (1999) labor theory if the two tasks have different degrees of difficulty relative to the payoffs to error reduction because subjects should equalize the costs of laboring to reduce mistakes with the benefits of reduction. Second, consider the fourth set of likelihood ratio tests in Table II. In all but four cases, the two-error-rate model fits the data significantly better than the task-dependent-preferences model. Only one of these five is in the clear incentives sessions. Thus, the two-error-rate model fits better than task dependent preferences, which has no errors, but allows for inconsistent preference orderings. Result 5: The model that best explains subject behavior changes as the type of incentives change, shifting from essentially random behavior to a model of expected utility maximization with errors. This result is simply stating the significance of Results 1 through 3 jointly. To test the joint significance, count the number of times that the best-fit model does significantly better than the two-error-rate model in explaining the data under each treatment versus the times it does not: 3 times versus 1 time under no incentives, 2 times versus 5 times under indeterminate incentives and 0 versus 12 times under clear incentives. Treating each experiment as an observation, these frequencies form the basis for χ 2 -tests of independence between the treatment and the likelihood that the two-error-rate model explains the data as well as the bestfit model. Using all three incentive categories, the χ 2 (2) statistic is and, using no incentives versus clear incentives, the χ 2 (1) statistic is Both are significant at the 95% level of confidence, indicating that the best fitting behavioral model changes as the incentives change. 8 Error rates greater than zero are obviously necessary for explaining the existence of any reversals under this set of nested models. 17

20 Additional evidence comes from simulation data. If the cell frequencies in Figure 1 were drawn randomly from a uniform distribution over the feasible set, the three-error-rate model is the best-fit model for all cases, but the two-error-rate model is the best fit for only 25% of the cases. Label these cases success for the two-error-rate model. This forms the basis for a binomial test between random cell frequencies and those where the two-error-rate model is successful in best-fitting the data. The predicted number of successes under no incentives is 0.25x4 = 1. Thus, the observed and predicted success rates do not differ significantly. With indeterminate incentives, the predicted success rate is 0.25x7 = 1.75 and the actual is 2. Again, they do not differ significantly. However, with clear incentives, the predicted success rate is 0.25x12 = 3 and the actual is 8. The p-value of the two-sided test statistic is Thus, without clear incentives, the data could be explained as easily by completely random behavior as it could be by the two-error-rate model. However, with clear incentives, the data is significantly more consistent with the two-error-rate model. VI. Conclusions Our analysis leads to the unmistakable conclusion that incentives affect behavior in preference reversal experiments. In fact, they clearly suggest that subjects follow different models of choice depending on whether they receive truth-inducing incentive payments or not. Why are the response patterns so different under clear, truth-inducing incentives? Obviously, different incentive mechanisms result in structural differences in the choice environment. Economists may simply argue that clear incentives make the decisions real resulting in more rational choices. That is, incentives work as advertised. But, as Camerer and Hogarth (1999) suggest, we believe that the fundamental issues are complex and require a much more through understanding. We suggest that a reasonable starting point lies in the common observation from psychology, neurology and economics: structural differences affect the way stimuli are perceived and processed and how decisions are made. While, as 18

21 economists, we are particularly interested in the relationships between incentives and optimizing behavior, this is just one of numerous interactions between the environment and response worthy of study. 19

22 References Becker, G, M DeGroot, and J Marschak, 1964, Measuring Utility by a Single-Response Sequential Method, Behavioral Science, pp Berg, JE, JW Dickhaut, and JR O'Brien, 1985, "Preference Reversal and Arbitrage," in V. Smith, ed., Research in Experimental Economics, 3, JA Press, Greenwich, Berg, JE, JW Dickhaut, and TA Rietz, 2002, " Diminishing Preference Reversals by Inducing Risk Preferences," Working paper, the University of Iowa. Bohm, P, 1994, Time Preference and Preference Reversal among Experienced Subjects: The Effects of Real Payments, The Economic Journal, 104, Bostic, R, RJ Herrnstein and RD Luce, 1990, The Effect on the Preference-Reversal Phenomenon of Using Choice Indifference, Journal of Economic Behavior and Organizations, 13, Camerer, CF, and RM Hogarth, 1999, The effects of financial incentives in experiments: A review and capital-labor-production framework, Journal of Risk and Uncertainty, 19: Casey, JT, 1991, Reversal of the Preference Reversal Phenomenon, Organizational Behavior and Human Decision Processes, 48, Cox, JC, and S Epstein, 1989, Preference Reversals without the Independence Axiom, The American Economic Review, 79, 3, Chu, YP and RL Chu, 1990, "The Subsidence of Preference Reversals in Simplified and Marketlike Experimental Settings: A Note," The American Economic Review, 80, Goldstein, WM, and HJ Einhorn, 1987, "Expression Theory and the Preference Reversal Phenomenon," Psychological Review, 94, 2, Grether, DM, and CR Plott, 1979, "Economic Theory of Choice and the Preference Reversal Phenomenon," American Economic Review, 69, 4, Hey, JD and C Orme, Investigating Generalizations of Expected Utility Theory Using Experimental Data, Econometrica, 62, 6, Judge, GG, RC Hill, W Griffiths, H Lütkepohl and TC Lee, 1982, Introduction to the Theory and Practice of Econometrics, John Wiley & Sons, Inc., New York. Lichtenstein, S, and P Slovic, 1971, "Reversals of Preference Between Bids and Choices in Gambling Decisions," Journal of Experimental Psychology, 89, 1, Lichtenstein, S, and P Slovic, 1973, "Response-Induced Reversals of Preference in Gambling: An Extended Replication in Las Vegas," Journal of Experimental Psychology, 101, 1,

23 Pommerehne, WW, F Schneider and P Zweifel, 1982, "Economic Theory of Choice and the Preference Reversal Phenomenon: A Reexamination," American Economic Review, 72, 3, Reilly, RJ, 1982, "Preference Reversal: Further Evidence and Some Suggested Modifications in Experimental Design," American Economic Review, 72, 3, Selten, R, A Sadrieh and K Abbink, 1999, Money does not induce risk neutral behavior, but binary lotteries do even worse, Theory and Decision, 46: Siegel, S, 1961, "Decision Making and Learning Under Varying Conditions of Reinforcement," Annals of the New York Academy of Sciences, 89, Smith, VL, 1982, "Microeconomic Systems as an Experimental Science," American Economics Review, 72, 5, Tversky, A, S Sattath, and P Slovic, 1988, "Contingent Weighting in Judgment and Choice," Psychological Review, 95, Tversky, A, P Slovic, and D Kahneman, 1990, "The Causes of Preference Reversal," American Economic Review, 80, 1, Tversky, A, and RH Thaler, 1990, "Anomalies: Preference Reversals," Journal of Economic Perspectives, 4, 2,

24 Appendix: Categorization of Preference Reversal Experiments by Incentive Types We classify data according to three criteria. "No-incentives experiments" use hypothetical gambles. "Indeterminate-incentives experiments" use gambles that have real monetary payoffs, but the designs do not necessarily induce truthful revelation for utility maximizing subjects. "Clear-incentives experiments" include clear incentives for truthfully reporting preferences. A.I. No-Incentives Experiments These experiments do not include a performance-based reward in the experimental design. Since no differential reward is given for responding truthfully, any response is an expected utility maximizing choice. Falling in this category are: Lichtenstein and Slovic (1971) experiments 1 and 2, Goldstein and Einhorn (1987) experiment 2a, and Grether and Plott (1979) experiment 1a. The design for these experiments is simple. Each subject arrives and participates in a number of paired choice tasks and a number of pricing tasks. They are paid for participation, independent of their actions. Each experiment yields one data set for our analysis. We label the data sets L&S1, L&S2, G&E2a, and G&P1a, respectively. A.II. Indeterminate-Incentives Experiments Because of design choices in these experiments, one cannot unequivocally assert that incentives exist for an expected utility maximizing subject to truthfully report preferences. The reward structures may give utility maximizing subjects some preference over gambles. However, the reward structures could lead to systematic misreporting of preferences for some utility functions and accurate reporting for others. In such cases, truthful reporting is not an optimal strategy for all expected utility maximizing subjects. Thus, these experiments fail to fit our definition of clear incentives. We classify and analyze them separately to provide the cleanest test of incentives effects. These experiments include Lichtenstein and Slovic (1971) experiment 3, two Lichtenstein and Slovic (1973) experiments conducted in Las Vegas, and four experiments from Pommerehne, Schneider and Zweifel (1982). Lichtenstein and Slovic (1971) Experiment 3 gives one data set, which we label L&S3. They use the following design: Subject chooses between a pair of gambles (6 pairs) Subject States a selling price for a gamble (12 gambles) Subject rewarded for each choice made Payoffs were stated in points with points converted to dollars using a "conversion curve." The curve guaranteed a minimum dollar payoff of $0.80 (even with negative points) and a maximum payoff of $8.00. Before making any decisions, subjects were informed of the reward process and minimum and maximum amounts to win, but not the actual conversion curve. All decisions were played at the end of the experiment. For each paired choice task, the subject played the gamble indicated as preferred and received points according to the outcome. If the decision involved a pricing decision, then a "counter offer" was randomly selected from a betspecific distribution. These distributions were not disclosed at the beginning of the experiment. 22

25 If the "counteroffer" was greater than the price that the subject stated, then the subject received points in the amount of the counteroffer. If the counteroffer was less than the number the subject stated, then the subject played the bet and received points according to its outcome. Without prior specification of the conversion curve, subjects cannot determine which bet is the expected utility maximizing choice in the paired choice task, or what price response is the expected utility maximizing response in the pricing task. Even if subjects assume a linear increasing conversion curve, wealth effects could result in economically rational reversals of preference. Such rational reversals violate the assumption of a stable underlying preference in the two-error rate analysis. Since the tasks are taken in a particular order, wealth effects could easily make subject choices appear task dependent, when in fact these differences are due to wealth effects. Lichtenstein and Slovic (1973) report two studies conducted at the Four Queens Hotel and Casino in Las Vegas. Subjects in the experiments purchased chips with their own money. They chose to play with either $0.05, $0.10, or $0.25 chips for the duration of the session. Subjects made decisions about 10 pairs of positive expected value bets and 10 pairs of negative expected value bets. We group the data into two data sets based on this distinction. L&SLV+ denotes the set of positive expected value bets and L&SLV- denotes the set of negative expected value bets. The experiment proceeded as follows: Subject chooses one bet from each of two pairs Stage 1 (20 Paired Choices) Chosen bets are played Stage 2 (40 Selling Price Decisions) Subject reports selling price for a bet Counteroffer generated and bet played according to the counteroffer Subjects could drop out at any time. Data is reported for 53 completed sessions (from 44 different subjects). In these experiments, the conversion of points to dollars is known prior to making any choices. However, subjects play bets based on their decisions immediately following each decision. This could result in wealth effects and economically rational reversals. Such economically rational reversals violate the two-error-rate model's assumption of a constant underlying preference. Thus, the model again makes no clear prediction. Again, the sequencing of tasks can make wealth effects appear as task dependent preferences. The Pommerehne, Schneider, and Zweifel (1982) experiments give four data sets, which we label PSZ1.1, PSZ1.2, PSZ2.1, and PSZ2.2. In these experiments, subjects are rewarded a pro rata share of a fixed reward. Their reward depends on their own decisions as well as decisions of others in the experiment. There is no dominant strategy in such a reward scheme, so subjects have no clear incentive to truthfully report their preferences. For example, consider a two-person experiment where each person makes one paired choice decision and then based on the outcomes of their decisions, each person receives a pro rata share of $ Let the paired choice be between two gambles, (0.70, $10; 0.30, $1) and (0.30, $17.33; 0.70, $3), where the four-tuple is read as (the probability of winning the high prize, the point amount of the high prize; the probability of winning the low prize, the point amount of the low prize). A risk averse player with U(x)=x½ does not have a dominant strategy: he prefers bet 1 if the other player chooses bet 1 and prefers bet 2 if the other player chooses bet 2. Thus, the payoff 23