Do Students Behave Rationally in Multiple Choice Tests? Evidence from a Field Experiment

Size: px

Start display at page:

Download "Do Students Behave Rationally in Multiple Choice Tests? Evidence from a Field Experiment"

Janis Benson
5 years ago
Views:

1 經濟與管理論叢 (Journal of Economcs and Management), 2013, Vol. 9, No. 2, Do Students Behave atonally n Multple Choce Tests? Evdence from a Feld Experment María az Espnosa * Departamento de Fundamentos del Análss Económco II, Unversty of the Basque Country, (UV/EHU), Span Javer Gardeazabal Departamento de Fundamentos del Análss Económco II, Unversty of the Basque Country, (UV/EHU), Span A dsadvantage of multple choce tests s that students have ncentves to guess. To dscourage guessng, t s common to use scorng rules that ether penalze wrong answers or reward omssons. In psychometrcs, penalty and reward scorng rules are consdered equvalent. However, expermental evdence ndcates that students behave dfferently under penalty or reward scorng rules. These dfferences have been attrbuted to the dfferent framng (penalty versus reward). In ths paper, we model students behavor n multple choce tests as a choce among lotteres. We show that strategc equvalence among penalty and reward scorng rules holds only under rsk neutralty. Therefore, rsk averson could be an alternatve explanaton to the prevously found dfferences n students behavor when confronted wth penalty and reward scorng rules. We suggest the use of a modfed penalty scorng rule whch s equvalent to the reward rule for whatever rsk atttudes students mght have. To dsentangle the effect of framng and rsk averson on students behavor we desgn a feld experment wth three treatments, each one wth a dfferent scorng rule. Two of these scorng rules are equvalent but have dfferent framng, whle the thrd s not equvalent but has the same framng as one of the other two. The * Correspondence to: Unversty of the Basque Country, UV/EHU. BDGE, Departamento de Fundamentos del Análss Económco II. Avenda Lehendakar Agurre 83, Blbao. Span. Emal: marapaz.espnosa@ehu.es ; javer.gardeazabal@ehu.es. Fnancal support from the Spansh Mnstry of Scence and Innovaton MICINN (ECO and ECO ) and from the Basque Government (IT ) s gratefully acknowledged.

2 108 經濟與管理論叢 (Journal of Economcs and Management) expermental results ndcate that dfferences n students behavor are due to rsk averson and not due to dfferent framng. Keywords: scorng rules, rsk averson, feld experment JEL classfcaton: C93, D03, D81 1 Introducton Multple-choce tests are wdely used as an evaluaton tool, 1 ther man advantages over constructed-response tests beng that they guarantee wder content samplng and preclude measurement errors ntroduced by the grader. The man drawback s that multple-choce tests may encourage guessng, whch adds an error term to test scores and lowers test relablty n measurng students' knowledge. 2 Ths s the case when the test score s the number of rght answers, hereafter S n. When students are evaluated wth the number-rght scorng rule, they wll of course answer all questons whether they know the answer or not. Thus, the score ncludes an error component comng from those questons n whch a student gets the rght answer by chance. To mtgate ths problem, examners qute often use a formula scorng rule that penalzes wrong answers and s ntended to reduce guessng behavor. Although rarely used, an alternatve way of dscouragng guessng s to reward omtted questons. In the psychometrc lterature, scorng rules ncorporatng a correcton for guessng n the form of a penalty for wrong answers (S ) and a reward for omtted questons (S ) were consdered equvalent snce one s n fact an affne transformaton of the other. However, emprcal evdence ndcated that students behaved dfferently under both scorng rules. Bereby-Meyer et al. (2002) confronted students wth scorng rules that penalzed for wrong answers and rewarded for omssons. Ther expermental evdence shows that students omtted more tems when they were penalzed for ncorrect answers than when rewarded for omssons. These dfferences n students' behavor were thought to be assocated wth framng 1 See Segfred (1996) and Bredon (2003) to grasp the mportance of the use of multple choce tests n Economcs. 2 See Walstad and Becker (1994), Heck and Stout (1998), Becker and Johnston (1999), Chan and Kennedy (2002), for a comparson of essay and multple-choce tests n Economcs.

3 Do Students Behave atonally n Multple Choce Tests? 109 and attrbuted to psychologcal factors. 3 An explanaton of ths non-equvalence result was advanced by Budescu and Bar-Hllel (1993) appealng to the dfferent consderatons that opportunty costs (falure to wn) and out-of-pocket costs (payng a penalty) have for ndvduals. Accordng to ths vew, examnees should guess more when they are rewarded for omssons than when penalzed for wrong answers, gven that t s easer to forgo a gan than to ncur a loss. More recently, ths dea has been formalzed usng rospect Theory and the expermental results nterpreted as evdence n favor of the theory (Bereby-Meyer et al., 2003). 4 Other theores of student behavor n multple choce tests nclude Bernardo (1998) who assumes that students maxmze the score or mnmze the probablty of falng the exam, and Burgos (2004) who uses rospect Theory to postulate a utlty functon that assgns dfferent values to losses and gans. The psychometrcs lterature neglects the possblty that rsk averson could be the reason why students behave dfferently when confronted wth scorng rules that penalze for ncorrect answers or reward for omssons. For nstance, Bereby-Meyer et al. (2002) confront students wth scorng rules that penalzed for ncorrect answers and rewarded omssons, n both cases wth postve expected score, and clam that guessng was clearly always the optmal strategy (p. 323). These authors fal to consder the possblty that a rsk averse student could omt an tem wth postve expected reward. Smlarly, n Bereby-Meyer et al. (2003) expermental study wth the penalty and reward scorng rules, the authors clam that answerng all tems was the domnant strategy for both rules (p. 207) agan neglectng the possblty that rsk averson mght nduce omssons. The mplct assumpton n most of the psychometrcs lterature s that test takers are expected score maxmzers. Our contrbuton n ths paper s to provde a dfferentaton between students rsk preferences effects and framng effects on the number of omtted questons n multple choce tests. In ths paper students behavor n multple choce tests s modeled as a choce over lotteres where rsk consderatons play an mportant role. We contrbute three theoretcal results whch wll be useful to dsentangle the effect of framng and rsk averson. Frst, we show that when examnees are rsk-averse the two scorng rules 3 See Traub et al. (1969), Traub and Hambleton (1972) and Waters and Waters (1971). 4 See Kahneman and Tversky (1979) on rospect Theory.

4 110 經濟與管理論叢 (Journal of Economcs and Management) mply a dfferent trade-off when decdng whether to answer a queston or not. Therefore, expected utlty maxmzers may behave dfferently under S than under S, even though they are lnearly related. Second, we demonstrate that the two scorng rules are equvalent under rsk neutralty. Thrd, we also show that the penalty rule can be modfed so that t becomes strategcally equvalent to the reward rule even under rsk averson. These results are relevant for the desgn of experments n psychometrcs seekng to determne the effects of the scorng rule on the test's valdty and relablty and, n partcular, the mpact of psychologcal factors. revous experments confront subjects to scorng rules that are strategcally equvalent only for rsk neutral subjects. Therefore, the observed dfferences n behavor could be attrbuted to the framng of the scorng rule or to rsk atttudes. Our expermental desgn allows us to dstngush between the two factors by confrontng students to rules wth dfferent framng that are strategcally equvalent for all types of rsk atttudes. To determne whether n ths context rsk averson may be sgnfcant enough to gve rse to the dfferences n observed behavor, we desgned an experment usng several scorng rules. In a regular undergraduate Macroeconomcs course, t was announced that exams would be graded wth dfferent scorng rules for dfferent groups of students, so that all students knew well n advance the exact rule they would face n the exam. 5 The results of our feld experment ndcate that under equvalent scorng rules, there are no sgnfcant dfferences n the number of omssons, even though one rule s framed as a penalty for wrong answers and the other as a reward for omssons, whle dfferences n behavor are observed when the rules are not equvalent. Therefore, the results are consstent wth ratonal student behavor. Ths evdence suggests that ndvduals were not affected by the dfferent framng of the scorng rules; when the scorng rules were strategcally equvalent, subjects adopted smlar decsons and psychologcal factors dd not seem to play a role n students' behavor. Note that, n ths feld experment, subjects' decsons determned ther grade on the course, so there was a strong ncentve to take the rght decsons. In summary, the evdence s consstent wth expected utlty maxmzaton and supports the hypothess that dfferences n behavor are due to rsk averson 5 The expermental desgn descrbed n Secton 4 guarantees equal treatment for all students.

5 Do Students Behave atonally n Multple Choce Tests? 111 rather than psychologcal factors. 6 We also fnd that gender and students knowledge are mportant determnants of the number of omssons. We do not address the ssue of the optmalty of the gradng procedures and the theory and emprcal evdence n ths paper are n no way ntended to justfy or recommend the use of any partcular scorng rule. 7 Our work s manly a contrbuton to the study of rsk atttudes and ratonalty wthn a partcular context. Nevertheless, a better understandng of the ncentves behnd these rules could be a useful frst step for studyng the optmal way of desgnng multple-choce tests. The rest of the paper s organzed as follows. Secton 2 lays out the prelmnares. Secton 3 establshes the theoretcal results. Secton 4 descrbes the desgn of the experment. In Secton 5 we report the results of the feld experment and perform the statstcal analyss. Secton 6 concludes. 2 relmnares Let N be the number of tems n an exam and M the number of alternatves, one correct and M 1 ncorrect. A student s defned by her level of knowledge and a functon, u (s), representng her valuaton of the score, s, obtaned n the exam. We assume ths valuaton s such that u ( s) 0. 8 We do not restrct the second dervatve of the utlty functon, so students could be rsk averse, rsk neutral or rsk lovng. Note also that the utlty functon s ndependent of the scorng rule. Students may have dfferent preferences and dfferent levels of knowledge. The smplest scorng rule s number rght, denoted S, where the score s n smply the number of rght answers r : Sn = r. Some scorng rules ncorporate a correcton for guessng feature. Typcally, there s a penalty of 1 M 1 ponts for each ncorrect answer. Ths scorng rule yelds a fnal score: 6 There are numerous laboratory experments documentng dfferent types of devatons from ratonalty, but feld experments are scarce. See Bertrand et al. (2005), Haan et al. (2002), Hagh and Lst (2005) and Lst and Mllmet (2005) for notable exceptons. 7 See Espnosa and Gardeazabal (2010). 8 Ths assumpton does not exclude pass-fal exams n whch u = 0 untl the pass score s reached.

6 112 經濟與管理論叢 (Journal of Economcs and Management) w S = r, M 1 where r and w are the number of rght and wrong answers, respectvely. An alternatve rule for dscouragng guessng s to gve queston. Ths scorng method yelds a fnal score: o S = r +, M where o s the number of questons omtted. There are three mportant features of these scorng rules: 1 M ponts for each omtted () Frst, the reward for omssons and the penalty for wrong answers are ntended to nduce the same behavor n students: to dscourage guessng when the student does not know the answer. However, S s framed n terms of losses for wrong answers whle S s framed n terms of gans. () Second, S and S p are lnearly related as: N M 1 S S = +. (1) M M () Thrd, both scorng rules use values of the penalty for wrongs and reward for omssons such that the expected value of randomly answerng an tem equals the value of omttng. Consder an examnee who has no clue about the answer to an tem and selects an answer randomly. Thus, the probablty of a rght answer s 1 M and the probablty of falng the tem s M 1 M. Under S, the expected value from answerng s: 1 M = 0 1, M M M whch s equal to the gan from omttng. Under S the expected value from answerng the tem s: 1 M =, M M M whch s equal to the gan from omttng 1 M. We beleve that features () and () of the scorng rules mght be the reason why both rules have been consdered equvalent n the psychometrc lterature. A contrbuton of ths paper s the modelng of multple choce tests as lotteres. Item can be vewed as a gamble n whch a student has probablty q of gettng

7 Do Students Behave atonally n Multple Choce Tests? 113 the rght answer and the probablty of falng the tem s 1 q. Of course, these probabltes depend on the knowledge of the student and the dffculty of the tem. Assume the student answers only tem, leavng N 1 tems unanswered. Denote by s ( r, w, o) the score obtaned from r rghts, w wrongs and o omssons. Let {} denote the lottery nduced by answerng only tem,.e. obtanng a score s ( 1,0, N 1) wth probablty q and s ( 0, 1, N 1) wth probablty 1 q. Note that the scorng rule affects the values of the score, s( 1,0, N 1) and s ( 0,1, N 1), but the probablty q s rule-ndependent. Let U ( {}) denote the utlty derved from ths lottery. If the student evaluates lotteres accordng to the Expected Utlty Theory, then: U ( {} ) = q u( s(1,0, N 1)) + (1 q ) u( s(0,1, N 1)), where u ( ) s the student's valuaton of the score. Let {, j} denote the compound lottery nduced by answerng tems and j, leavng N 2 tems unanswered. The payoffs and probabltes of ths lottery are gven n Table 1 where q and q are the probabltes of gettng the rght j answer to tems and j, respectvely. The utlty derved from ths lottery s: U ( {, j} ) = qq ju( s(2,0, N 2)) + ( q (1 q j ) + (1 q ) q j ) u( s(1,1, N 2)) + (1 q )(1 q ) u( s(0,2, N 2)). j Table 1: Scores and robabltes Scores robabltes s ( 2,0, N 2) q q j ( 1,1, N 2) ( 1 q ) + (1 j q s ( 0,2, N 2) 1 q )(1 q ) q ) q s j ( j In an exam wth N tems any subset of tems s a compound lottery. Let L( N ) be the set of all compound lotteres n an exam wth N tems ncludng a degenerate lottery, denoted by { 0}, whch corresponds to omttng all tems. For example, n an exam wth two tems, the set of all compound lotteres s: L( { { 0}, { 1 }, { 2}, { 1, 2 } N ) =, that s, the degenerate lottery whch corresponds to omttng all tems, the lottery

8 114 經濟與管理論叢 (Journal of Economcs and Management) consstng of answerng only the frst tem, the lottery correspondng to answerng only the second tem and the lottery where the student answers both tems. A perfectly ratonal student would choose the best compound lottery n L( N ). Formally, she would maxmze the expected utlty over the set of all possble compound lotteres: maxu ( ) L( N) (2) In our model a ratonal test taker s expected to answer tems to maxmze expected utlty. Ths calculaton s dffcult to perform for varous reasons. Frst, the analytc soluton to the problem s not straghtforward and, second, t requres estmates of the probabltes of answerng tems correctly. Of course, subjects takng multple-choce tests are n no way assumed to lterally perform such calculatons n real exams. Our model of ratonal behavor should be understood as a theory and not a rule of behavor or a descrpton of cogntve decson processes (e.g., McKenze, 2003). 3 Theoretcal esults In ths secton we analyze whether penalzng for wrong answers and rewardng omssons nduce the same behavor on examnees. enalzng for wrong answers S ) and rewardng omssons ( S ) have been wrongly consdered equvalent, ( probably because one s just an affne transformaton of the other (see equaton (1)). Therefore, authors have focused on the dfferent framng, namely losses ( S ) and gans ( S ), e.g. Bereby-Meyer et al. (2003). In order to be precse about the equvalence between scorng rules we ntroduce the followng defnton. Defnton. Two scorng rules are strategcally equvalent f they always nduce the same behavor n a ratonal exam taker (an expected utlty maxmzer). Ths secton presents three results. The frst s that penalzng wrong answers and rewardng omssons, as defned n the prevous secton, are not n general strategcally equvalent.

9 Do Students Behave atonally n Multple Choce Tests? 115 roposton 1. For rsk-averse exam takers, equvalent. S and S are not strategcally To show ths, t s suffcent to fnd an example where a rsk averse exam taker would behave dfferently under S than under S. Consder an exam wth one tem, N = 1, two alternatves, M = 2, a student wth a concave valuaton such as u ( s) = a + s, wth a N, and probablty q of gettng the rght answer. In ths case, a student s faced wth a set of two lotteres, { { 0}, { 1}}, that s, omttng the tem and answerng the tem. Under lower than that from omttng f q S the expected utlty from answerng s a (1 q ) a 1 a. However, under < S the examnee obtans a hgher expected utlty from answerng f q a (1 q ) a > a It s easy to verfy that for a student wth = 1 a and a probablty of answerng correctly of q = 0. 6, both nequaltes hold and therefore the student would omt under S and answer under the two scorng rules are not n general strategcally equvalent. S. Ths shows that Our second result states that for a partcular type of rsk atttude, the two scorng rules become strategcally equvalent. roposton 2. For a rsk neutral examnee, equvalent. S and S are strategcally To show ths, t s necessary to prove that a rsk neutral student would always make the same decson under ether of the scorng rules. Under S, a rsk neutral student would choose lottery whenever ts expected payoff s at least as hgh as that of any other lottery, that s: s q s p q, (3) p for all L(N), where s are the scores under p S of all possble outcomes n lottery and q are the assocated probabltes of those outcomes, so that q = 1. Multplyng by M 1 M and addng N M to both sdes of (3) we get: q N M M 1 + s M p q N M M 1 + s M for all L(N). Usng equaton (1), the prevous equaton can be wrtten as: p,

10 116 經濟與管理論叢 (Journal of Economcs and Management) s q s q, for all L(N). In words, the student chooses lottery n the set L (N) under S f and only f she also chooses t under equvalence. Scorng rules S and S. Ths completes the proof of S are equvalent for rsk neutral students. However, experments seem to ndcate that students do not always behave dentcally under the two scorng rules. Our pont s that rsk preferences may have been dsmssed as other psychologcal factors n the experments desgned to evaluate S and S. In order to measure the effects of framng, t s necessary to use scorng rules that are strategcally equvalent for all types of rsk atttudes. For that purpose, we propose a modfed scorng rule wth penalty denoted S : M 1 1 S = r w + M M N M. Notce that the modfed penalty scorng rule can be wrtten as S ( S pn) (1 p) = p + + where p = 1 M 1 s the penalty, so t s an affne transformaton of the standard penalty rule. Intutvely, the modfed penalty rule gves a startup score of N/M to all students and then subtracts 1/M for each wrong answer. Ths s n fact the same as rewardng for omssons, as the followng proposton demonstrates. roposton 3. For each and every type of rsk preferences, strategcally equvalent. S and S are For S and S to be strategcally equvalent the student should answer the same set of tems under the two scorng rules,.e. the soluton to problem (2) should be the same. To prove ths, we smply have to show that the set L (N) s dentcal under S and S. We do ths n two steps. Frst, note that the probabltes of the dfferent events are ndependent of the scorng rule. Second, snce the number of tems n the exam s equal to the rght answers plus wrong answers plus omssons we have that: M 1 1 N M 1 1 N o S = r w + = r ( N r o) + = r + = M M M M M M M S,

11 Do Students Behave atonally n Multple Choce Tests? 117 so payoffs are also dentcal under both scorng rules. Ths completes the proof of equvalence. Corollary 1. If scorng rules and λ for λ > 0. S j S and S are strategcally equvalent, so are j λ S Ths follows from the fact that f the two scorng rules are strategcally equvalent, they must yeld the same payoffs and therefore, after multplyng ther scores by a postve constant, the payoffs of the two scorng rules would also be dentcal. The strategc equvalence of scorng rules S and S allows us to solate the effect of psychologcal factors from that of rsk preferences snce they nduce the same behavor n ratonal students. If students do not behave dentcally n the experment under S and S, then dfferences n behavor are due to framng. 4 Expermental Desgn and rocedures The objectve of our experment s threefold. Frst, we test whether students behave dfferently wth the standard scorng rules, so that our results are comparable to prevous fndngs. Second, we compare the results when students face penalty and reward scorng rules that are strategcally equvalent for all rsk atttudes, to determne whether there are any dfferences whch could be attrbuted to framng. Thrd, we also try to determne whch varables nfluence students' decsons to omt tems. We conducted a feld experment by gradng students wth dfferent scorng rules n the exams of a regular course. The payoff n terms of grade s partcularly approprate for two reasons. Frst, grades generate stronger ncentves than small amounts of money used n other contexts. Second, atttudes towards rsk concernng grades may be dfferent from behavor towards rsk when money s nvolved. The experment was conducted at the Unversty of the Basque Country, Span. The salent features of the experment are shown n Table 2. Subjects were second year undergraduate students pursung a bachelor s degree n Economcs, enrolled for Intermedate Macroeconomcs n the Sprng of The students' performance on the course was evaluated usng fve multple-choce tests; each of the 5 exams was

12 118 經濟與管理論叢 (Journal of Economcs and Management) worth 20 ponts summng up to a total of 100 ponts. Each exam covered the materal n one chapter, except the frst exam, whch covered the frst two (shorter) chapters. Each exam had ten tems and each tem had four possble answers, one correct and three ncorrect. Table 2: Descrpton of Sessons/Exams Sesson/Exam Date Treatments (Group) artcpants 1 March 9 S (W), S * (B), S (Y) March 21 S (B), S * (Y), S (W) Aprl 13 S (Y), S * (W), S (B) May 4 S n(b,w,y) May 20 S n(b,w,y) 148 W: whte, B: blue, Y: yellow; S : penalty, S * : modfed penalty, S : reward The experment had three treatments: penalty for ncorrect answers, S, modfed penalty for ncorrect answers, S, and reward for omssons, S. Gven the parameters of the exams, N = 10 and M = 4 the scorng rules are as follows: S = r 3 w, S = r 4 w and S = r + 4 o. For the sum of scores obtaned n all exams to add up to 100, all test scores are multpled by 2, 9 so that the scorng rules presented to students were: S = r 3 w, S = r 2 w and S = 2 r + 2 o. The rules were also presented to the subjects n a table; for example, students n the reward treatment were told that they would be graded accordng to the followng table (see the expermental nstructons n the appendx): Scorng ule ght +2 Wrong 0 Omt +0.5 These rules were posted on the course webste along wth other useful nformaton so that students were well aware of the scorng methods. The expermental desgn guaranteed equal treatment for students. The expected 9 Notce that, by Corolary 1, ths rescalng of scorng rules does not affect the strategc equvalence between pars of scorng rules.

13 Do Students Behave atonally n Multple Choce Tests? 119 score s lower wth scorng rule S than wth rules S and S. Therefore, p splttng subjects nto three treatments (one for each rule) would favor students n treatments S and S. To treat all students equally each examnee was evaluated p once wth S, S and p S n the frst three exams. At the begnnng of the course, students were randomly assgned to three groups -Blue, Yellow and Whte- wth 62, 62 and 61 students, respectvely. Students were told to whch group they had been assgned and that each group was gong to be assessed wth a dfferent scorng rule n each exam accordng to the desgn n Table 2. Therefore, group ndcates a partcular order n the admnstraton of treatments. There are sx possble permutatons of treatments whle only three groups. Had we splt students n sx groups, group sze would be cut n half. Instead, we splt students n three groups. A shortcomng of ths expermental desgn s that the partcular order of scorng rules mght nfluence the results. That s why we control for groups n the regressons reported below. The desgn of the experment also takes nto account that after the frst three exams, students wth low accumulated score have less ncentve to omt, as a large number of omssons does not guarantee a passng grade (55%). For ths reason, the fourth and ffth exams were graded wth S (number rght). At the begnnng of the n course 185 students were enrolled. The number of students takng exams decreased durng the course: 177 students took the frst exam, 169 the second and 162 the thrd. The analyss was restrcted to the 160 students who took the frst three exams. Even though we desgned the experment wth equal ntal group szes, due to drop outs the number of students n each group vares from 49 students n the Blue group to 57 students n the Yellow group. However, accordng to the data shown n Table 3, groups are probablstcally equvalent. After each exam, students were told ther scores. Group Number of Subjects Table 3: Group Characterstcs Males Females Knowledge* roporton of Exams wth No Omssons Blue Group Number of Subjects Males Females Knowledge* roporton of Exams wth No Omssons Yelow Whte *Knowledge=Average grade n a prevous Macroeconomcs course

14 120 經濟與管理論叢 (Journal of Economcs and Management) Appendx A contans the nstructons gven n all exams, whch ncluded a set of general nstructons and a treatment-specfc nstructon regardng the scorng rule. In the educatonal measurement lterature t s common practce to gve students advce regardng omssons. When usng penalty scorng, examners are generally recommended to advse students not to omt f they can rule out one or more answers as ncorrect. The dea s that students should answer when they have partal knowledge and the expected value of answerng s postve. However, a rsk-averse student may optmally decde not to respond to an tem wth postve expected value. Unless all students were rsk neutral, no good general advce could be provded, as the students' optmal behavor depends on ther degree of rsk averson. Therefore, we dd not gve students any advce. Under number-rght scorng, ratonal students ought to respond to all tems, and ths s what happened n the last two exams, n whch no student omtted a sngle tem. Thus, the analyss s restrcted to the frst three exams. Table 4 reports basc descrptve statstcs for the frst three exams. The average number of omssons vares between exams and scorng rules. Some students answered all questons, no matter what scorng rule they faced. On the other hand, no student omtted all questons n any exam. Table 4: Omssons. Descrptve Statstcs Observatons Mean Std. Dev Mn. Max. Treatment Frst Exam enalty S Modfed enalty S * eward S Second Exam enalty S * Modfed enalty S eward S Treatment Thrd Exam enalty S Modfed enalty S * eward S

15 Do Students Behave atonally n Multple Choce Tests? Expermental esults As a prelmnary way of assessng the effect of scorng rules on omssons, Fgure 1 plots the hstogram for omssons under the three rules. Omssons seem to be farly smlar across scorng rules, but we can see a dfference between the shape of the hstogram for the penalty scorng rule and the other two. The hstograms for the reward and modfed penalty are farly smlar and suggest that there s no dfference between the strategcally equvalent scorng rules (reward and modfed penalty) despte ther dfferent framng. Addtonally, note the dfference n the shape of the hstograms of the penalty scorng rule and the other two (reward and modfed penalty) whch are not strategcally equvalent to the former. Fgure 1: Hstograms for Omssons under the Three Scorng ules Table 5 reports the Mann-Whtney test statstcs for the null hypothess that the samples of omssons from the frst exam come from the same populaton. Accordng to these results we can reject equalty of dstrbutons between penalty

16 122 經濟與管理論叢 (Journal of Economcs and Management) and reward and we cannot reject the hypothess of equalty of dstrbutons between reward and modfed penalty. The results are consstent wth ratonal rsk averse subjects. ejecton of the null of equalty of dstrbutons between penalty and reward scorng rules, whch are not strategcally equvalent, s consstent wth the behavor of rsk-averse expected utlty maxmzers (see roposton 1). Furthermore, the fact that we cannot reject the null of equalty between modfed penalty and reward s also consstent wth the results n roposton 3. These two scorng rules are strategcally equvalent and therefore should nduce the same behavor n ratonal students regardless of ther atttudes toward rsk. It s worth notng that n the comparson between modfed penalty and reward, framng dffers (losses n the frst scorng rule and gans n the second). Nevertheless, subjects dd not show any sgnfcant reacton to the dfferent framng. Table 5: Omssons and Scorng ules. Mann-Whtney Test Frst Exam enalty vs. eward (0.0453) eward vs. Modfed enalty (0.2572) p-values n parentheses. The evdence reported n Table 5 makes use of data only from the frst exam. The reason for dong so s that only before the frst exam, students had an accumulated score of zero, whle after the frst exam ther accumulated scores from prevous exams were dfferent, reflectng dfferences n knowledge, luck and the dfferental effect of the scorng rules. From the second exam on, students wth dfferent scores mght have had dfferent behavor towards omsson even when faced wth the same scorng rule. Therefore, the results of the frst exam are the only ones not affected by the scores n past exams. In other words, after the frst exam, the dfferent treatment groups are not probablstcally equvalent because ther accumulated score n prevous exams could be dfferent. Therefore, the Mann- Whtney test n Table 5 cannot be used to dentfy the effect of treatment effects of scorng rules. However, we can stll carry out nference usng the results of the second and thrd exams under the assumpton of condtonal mean ndependence, that s, condtonal on a set of covarates, treatments and potental omssons are mean ndependent.

17 Do Students Behave atonally n Multple Choce Tests? 123 In addton to the treatment effects of the scorng rules, the number of omssons could be affected by a number of covarates: 1. The accumulated score n prevous tests mght nfluence students' behavor. After the frst exam the grades are revealed and ths may affect the way n whch the compound lotteres n the second and thrd exams are evaluated. A student wth a hgh accumulated score mght decde to omt more (or less) tems than a student wth a low accumulated score. To nvestgate ths possblty we nclude n the regresson the accumulated score, whch s set to zero n the frst exam, the score obtaned n the frst exam n the second exam, and the sum of the scores n the frst and second exams n the thrd one. 2. Knowledge, and n partcular knowledge of Macroeconomcs, should determne the behavor of students. All else beng equal, a student wth greater knowledge should omt less than a less knowledgeable student. In the regressons reported below we nclude a proxy for knowledge of the subject: the grade obtaned n a prevous Macroeconomcs course ether n the prevous semester (Fall 2004) or n a prevous year. 10 We also use as an alternatve measure of knowledge the grade obtaned n the last two exams (scored wth number rght), knowledge2. 3. The dffculty of the exam should defntely nfluence the number of omssons. For a gven set of students, a more dffcult exam ought to be reflected n a hgher number of omssons. Even though we tred to wrte exams of ex-ante smlar dffculty, t could be the case that exams had dfferent degrees of dffculty. To account for ths possblty n the regresson, we nclude a set of dummy varables ndcatng the exam (frst, second or thrd) whch capture unobserved characterstcs of exams, constant for all ndvduals. 4. Some studes have shown a lnk between rsk atttudes and gender (see for example Byrness et al., 1999, Cadsby and Maynes, 2005 and Scotchmer 2008). Snce scorng rules S and S are not equvalent for rsk-averse To account for subjects, gender mght affect the number of omssons Three students dd not have a grade n a prevous Macroeconomcs course, whch further restrcted our sample to 157 students. 11 Furthermore, t has been documented that nstructons concernng guessng behavor may affect

18 124 經濟與管理論叢 (Journal of Economcs and Management) these dfferences we nclude a gender dummy varable. 5. Students are dstrbuted nto fve sectons wth three dfferent nstructors. Dfferent teachng expertse could nduce dfferences n the performance of students. To control for any such dfferences we nclude dummy varables for sectons. 6. The order n whch students face the scorng rules s determned by the group (Blue, Yellow and Whte). To control for order effects and unobservable dfferences n group characterstcs we nclude a set of group dummes. Next we apply formal statstcal procedures to take nto account the effect of these factors on students' behavor. By controllng for these factors we can pool the second and thrd exam data together wth the frst exam and obtan more relable statstcal results. We do ths n two steps. Frst, we analyze the effect of scorng rules (treatments) and other covarates n the decson whether to omt or not to omt. We defne a bnary varable takng value 1 f the ndvdual has omtted at least one tem and 0 f none has been omtted and use logstc regresson. Second, we analyze the effect of scorng rules and other covarates on the number of omssons and use count data regresson. In both cases we use the same reference group, a female student from secton 5, n the blue group, graded wth the reward rule S n the thrd exam. Thus, coeffcent estmates are to be nterpreted as dfferences wth respect to ths reference pont. Table 6 presents the results of a logstc regresson. Columns (1) to (2) report the results usng alternatve measures of knowledge. Columns (3) and (4) nclude exam-specfc dummes and group specfc dummes. The penalty scorng rule always has a postve and sgnfcant coeffcent estmate. The modfed penalty scorng rule also has a postve coeffcent estmate across specfcatons, but t s never sgnfcantly dfferent from the reference pont (reward scorng rule). These results ndcate that for the decson whether to omt or not, subjects behave consstently wth our theoretcal results for rsk averse ndvduals. roposton 1 shows that the penalty and reward rules are not strategcally equvalent for rsk averse ndvduals, and the emprcal evdence ndcates that subjects behave dfferently when confronted to these scorng rules. These dfferences n behavor gender-related dfferences n multple-choce test scores (see reto and Delgado, 1999).

19 Do Students Behave atonally n Multple Choce Tests? 125 could be due to rsk averson, but also to a dfferent framng. However, roposton 3 shows that, despte the dfferent framng, modfed penalty and reward are strategcally equvalent for all types of rsk atttudes. The emprcal evdence ndcates that when confronted wth these strategcally equvalent rules, subjects do not behave n a sgnfcantly dfferent manner, despte ther dfferent framng. Table 6: Logstc egresson. Dependent Varable: Indcator (1) Omts (0) No Omts (1) (2) (3) (4) enalty 0.650** 0.741*** 0.684** 0.694*** (0.266) (0.274) (0.268) (0.264) Modfed enalty (0.254) (0.262) (0.255) (0.254) Male *** *** *** *** (0.218) (0.227) (0.218) (0.217) Accumulated Score 0.116*** 0.105*** *** (0.032) (0.032) (0.081) (0.032) Accumulated Score Squared *** *** *** (0.001) (0.001) (0.002) (0.001) Knowledge (0.399) Knowledge Square (0.029) Secton (0.303) (0.321) (0.308) (0.304) Secton *** 0.739** 1.066*** 1.037*** (0.350) (0.366) (0.348) (0.345) Secton (0.321) (0.350) (0.318) (0.315) Secton (0.494) (0.536) (0.480) (0.482) Knowledge (0.124) Knowledge 2 Squared (0.002) Exam (0.706) Exam (0.883) Whte (0.261) Yellow 0.447* (0.264) Constant (1.406) (1.600) (0.328) (0.344) Log-Lkelhood Observatons Standard errors n parentheses. One, two and three astersks represent ten, fve and one percent sgnfcance level. The logstc regressons reported n Table 6 nclude other covarates. Gender has a sgnfcant effect on the decson whether to omt, males omt less than females.

20 126 經濟與管理論叢 (Journal of Economcs and Management) The effect of the accumulated score n prevous exams has an nverted U shape. The probablty of omttng at least one tem s frst ncreasng and beyond a pont decreasng n the accumulated score n prevous exams. Knowledge does not appear to have a sgnfcant effect no matter the way we measure t. The dummy Secton 2 s sgnfcant, whch mght capture dfferences n nstructors teachng expertse. The exam-specfc dummes whch capture unobserved exam-specfc characterstcs such as dffculty, are not statstcally sgnfcant. Fnally, the dummy correspondng to the Yellow group s margnally sgnfcant at the 10 percent. ecall that the color group ndcates a partcular order n the admnstraton of the treatments and, as argued above, a partcular order n the admnstraton of the treatments mght affect subjects behavor. Next we analyze the nfluence of the scorng rules on the number of omssons. The dependent varable, the number of omtted tems by a student n a partcular exam s a count varable. Therefore, we use osson regresson for nference. Table 7 reports osson regresson estmates where, n addton to the regressors used n Table 6, we have ncluded nteracton terms between the scorng rules and the other covarates. The reason for ncludng these nteractons s to correct for the lack of randomzaton after the frst exam. Accordng to the treatment effects lterature, under the assumpton of condtonal mean ndependence, regressons of the outcome varable on the treatment dummy are enlarged wth a set of covarates and the nteracton of the treatment dummy and the covarates. Column (1) ncludes all the covarates and nteracton terms. Column (2) excludes the nteractons between scorng rules and sectons and Column (3) excludes the exam dummes. In all cases the penalty scorng rule has a negatve and sgnfcant coeffcent estmate, whereas the modfed penalty also has postve coeffcent but t s never sgnfcant. In accordance wth the results reported n Table 5, subjects confronted wth the penalty scorng rule omtted less tems than those confronted wth the reward scorng rule whle those confronted wth the modfed penalty scorng rule dd not omtted sgnfcantly more than those confronted wth the reward scorng rule. Agan, subjects behave dfferently when facng scorng rules that are not strategcally equvalent. Ths dfference n behavor could be explaned by rsk averson or framng. However, subjects behaved smlarly when facng strategcally equvalent scorng rules, hence rulng out framng as an explanaton of the dfferences n

21 Do Students Behave atonally n Multple Choce Tests? 127 behavor. In Table 6 the sgn of the penalty coeffcent s postve so that subjects are more lkely to be omtters under penalty than under reward (see also the hstogram n Fgure 1). In Table 7, the coeffcent of penalty s negatve, whch ndcates that ths scorng rule lowers the number of omssons (compared to reward). Thus, under penalty more subjects omt, but they omt less than under reward. Table 7: osson egressons. Dependent Varable: Number of Omssons (1) (2) (3) (4) Male (5) Female enalty ** *** *** *** (0.586) (0.555) (0.519) (0.939) (0.687) Modfed enalty (0.509) (0.430) (0.412) (0.833) (0.441) Male *** *** *** (0.153) (0.155) (0.155) Accumulated Score 0.055* 0.056** 0.074*** 0.100*** 0.055*** (0.029) (0.028) (0.017) (0.032) (0.019) Accumulated Score Squared *** *** *** *** ** (0.001) (0.001) (0.001) (0.001) (0.001) Knowledge ** (0.075) (0.078) (0.077) (0.379) (0.078) Knowledge Squared *** *** *** ** (0.006) (0.007) (0.006) (0.026) (0.008) Exam (0.333) (0.327) Exam (0.222) (0.215) Secton (0.222) (0.123) (0.123) (0.197) (0.146) Secton ** 0.232* 0.244** * (0.214) (0.124) (0.123) (0.219) (0.137) Secton * (0.230) (0.142) (0.143) (0.215) (0.186) Secton * (0.313) (0.155) (0.159) (0.339) (0.180) Whte * * (0.292) (0.281) (0.222) (0.376) (0.260) Yellow *** *** (0.290) (0.285) (0.224) (0.411) (0.234) enalty * Male 0.369* 0.348* 0.348* (0.199) (0.201) (0.201)

22 128 經濟與管理論叢 (Journal of Economcs and Management) Table 7: osson egressons. Dependent Varable: Number of Omssons (Contnued) (1) (2) (3) (4) Male (5) Female Modfed enalty * Male (0.219) (0.220) (0.221) enalty * Accumulated Score (0.024) (0.024) (0.023) (0.039) (0.029) Modfed enalty * ** ** Accumulated Score (0.026) (0.025) (0.021) (0.038) (0.023) enalty * Knowledge ** 0.140** *** (0.067) (0.064) (0.064) (0.117) (0.075) Modfed enalty * * 0.143** ** Knowledge (0.074) (0.071) (0.070) (0.129) (0.076) enalty * Exam (0.561) (0.552) (0.481) (0.763) (0.618) enalty * Exam *** ** (0.511) (0.501) (0.318) (0.539) (0.373) enalty * Secton (0.297) enalty * Secton (0.280) enalty * Secton * (0.331) enalty * Secton (0.358) Modfed enalty * Secton (0.312) Modfed enalty * Secton (0.325) Modfed enalty * Secton (0.342) Modfed enalty * Secton (0.423) Constant * (0.428) (0.426) (0.351) (1.387) (0.353) Log-Lkelhood Observatons obust standard errors n parentheses. One, two and three astersks represent ten, fve and one percent sgnfcance level.

23 Do Students Behave atonally n Multple Choce Tests? 129 Table 7 also ndcates that males tend to omt sgnfcantly less than females. To further nvestgate the effect of gender on omssons, Columns (4) and (5) report osson regressons for males and females respectvely. In both cases the modfed penalty s not sgnfcantly dfferent from the reference pont (reward scorng rule), ndcatng than nether males nor females are affected by the framng of the scorng rules. However, the penalty scorng rule dummy s sgnfcant for females (Column (5)) but not for males (Column (4)). A possble explanaton of ths result s that males behave as rsk neutral ndvduals, who see penalty and reward as strategcally equvalent, whle females behave as rsk averse ndvduals who do not consder the penalty and reward scorng rules as strategcally equvalent. Table 7 also reports coeffcent estmates for the other covarates. The effect of the accumulated score n prevous exams on omssons has an nverted U shape. Knowledge s a sgnfcant determnant of the number of omssons. Some of the dummes for sectons and color groups are also sgnfcant determnants of the number of omssons. Fnally, some nteracton terms are sgnfcant. However, these nteracton terms do not have a clear nterpretaton, as they are ncluded n the regresson to account for the lack of randomzaton after the frst exam. To sum up, the expermental results ndcate that the behavor of examnees n multple-choce tests s consstent wth ratonalty and rsk averson and does not seem to be affected by framng. To nterpret ths result t s mportant to note, frst, that the results come from a feld experment and subjects were partcpatng n real exams so that ther ncentves to behave ratonally were very hgh. Second, the scorng rules were known well n advance, so the subjects had enough tme to thnk what would be optmal for them to do under any of the rules. If the stakes had not been so hgh or the rules had been announced by surprse just before the test, t s lkely that framng would have played a more mportant role. Concernng rsk atttudes, our results confrm that ndvduals do not exhbt rsk neutralty on average snce behavor under penalty s sgnfcantly dfferent from behavor under modfed penalty. However, when we look at the results splttng the sample nto males and females, males tend to behave as rsk neutral ndvduals whle females act as rsk averse persons.

24 130 經濟與管理論叢 (Journal of Economcs and Management) 6 Concludng emarks In ths paper we show that scorng rules that penalze for wrong answers and those that reward for omssons are not strategcally equvalent for rsk averse ndvduals, although n the psychometrcs lterature they have been consdered equvalent, under the mplct assumpton that ndvduals are rsk neutral. Our man research queston s whether subjects behave ratonally or they are affected by framng, for any type of rsk atttudes. We desgned the experment to be able to test ratonalty, on the one hand, and the presence of rsk averson, on the other. We propose a modfcaton of the penalty rule that makes the two scorng rules (penalty and reward) strategcally equvalent for any type of rsk atttude. By confrontng students wth equvalent rules wth dfferent framng (modfed penalty and reward) and non-equvalent scorng rules wth the same framng (penalty and modfed penalty), t s possble to dstngush the effect of rsk averson from that of psychologcal factors related to the dfferent framng of the rules (gans versus losses). Our feld experment shows sgnfcant dfferences n students' behavor when they are evaluated wth penalty for wrong answers and reward for omssons. In addton, we fnd no sgnfcant dfferences between modfed penalty and reward scorng rules. Ths evdence s consstent wth expected utlty maxmzng behavor of rsk averse students. In addton to the scorng rule, other sgnfcant determnants of the number of omssons are the accumulated score n prevous exams, knowledge and gender. Our results may be of nterest to examners and theorsts. Frst, ths study has shown that rsk averson s an mportant factor n real exams. A useful mplcaton of ths result for examners s that any scorng rule that penalzes for wrong answers or rewards for omssons does ntroduce a bas aganst rsk averse students. As long as there s a lnk between rsk atttudes and socal group (gender, etc.) ths ssue may have some practcal relevance. However, the soluton s not necessarly the elmnaton of penaltes or rewards snce that would ncrease the random component of grades n multple-choce tests. Second, even though the optmalty of scorng rules s beyond the scope of ths

25 Do Students Behave atonally n Multple Choce Tests? 131 paper, a better understandng of scorng rules, the ncentves that they provde and students reactons to penaltes and rewards are lkely to be relevant for the optmal way of desgnng multple-choce tests. Our man emprcal result s that rsk averson appears as an mportant factor n real exams, and therefore ths varable should not be gnored n any study of the optmalty of scorng rules. Appendx A Instructons The orgnal nstructons were gven n Spansh and Basque: what follows s a translaton. All treatments ncluded the followng general nstructons. ead all nstructons carefully. You are not allowed to talk durng the exam. If you have a queston, rase your hand. Wrte down your name and ID number on the answer sheet and on ths exam. At the end of the exam you must hand n both ths exam and your answer sheet. Ths exam has 10 tems. Each tem has four possble answers and only one s correct. You have 30 mnutes. In addton to these general nstructons, each treatment had one more nstructon regardng the scorng rule used for that treatment. TEATMENT S Your score wll be gven by the followng formula: score = 2 x (rghts - wrongs ) = (2 x rghts) - (0.66 x wrongs). 3 That s, your score depends on the number of rghts, wrongs and omts. Each tem wll be graded accordng to the followng table:

26 132 經濟與管理論叢 (Journal of Economcs and Management) Scorng ule ght +2 Wrong Omt 0 TEATMENT S * Your score wll be gven by the followng formula: score = x (rghts - wrongs ) = 5 + (1.5 x rghts) - (0.5 x wrongs). 3 That s, your score depends on the number of rghts, wrongs and omts. Each tem wll be graded accordng to the followng table: Scorng ule ght +1.5 Wrong -0.5 Omt 0 TEATMENT S Your score wll be gven by the followng formula: omts score = 2 x (rghts + ) = (2 x rghts) + (0.5 x omts). 4 That s, your score depends on the number of rghts, wrongs and omts. Each tem wll be graded accordng to the followng table: Scorng ule ght +2 Wrong 0 Omt +0.5

27 Do Students Behave atonally n Multple Choce Tests? 133 eferences Becker, W. E. and C. Johnston, (1999), The elatonshp between Multple Choce and Essay esponse Questons n Assessng Economcs Understandng, The Economc ecord, 75, Bereby-Meyer, Y., J. Meyer, and O. M. Flascher, (2002), rospect Theory Analyss of Guessng n Multple Choce Tests, Journal of Behavoral Decson Makng, 15, Bereby-Meyer, Y., J. Meyer, and D. V. Budescu, (2003), Decson Makng under Internal Uncertanty: The case of Multple Choce Tests wth Dfferent Scorng ules, Acta sychologca, 112, Bernardo, José M., (1998), A Decson Analyss Approach to Multple Choce Examnatons, In: Grón, F. J. (ed.), Appled Decson Analyss, Boston, Kluwer, Bertrand, M., D. S. Karlan, S. Mullanathan, E. Shafr, and J. Znman, (2005), What's sychology Worth? A Feld Experment n the Consumer Credt Market, Workng apers 918, Economc Growth Center, Yale Unversty. Bredon, G., (2003), Take-Home Tests n Economcs, Economc Analyss and olcy, Queensland Unversty of Technology, School of Economcs and Fnance, 33, Budescu, D. and M. Bar-Hllel, (1993), To Guess or Not to Guess: A Decson- Theoretc Vew of Formula Scorng, Journal of Educatonal Measurement, 30, Burgos, A., (2004), Guessng and Gamblng, Economcs Bulletn, 4, Byrnes, J.., D. C. Mller, and W. D. Schafer, (1999), Gender Dfferences n sk Takng: A Meta-Analyss, sychologcal Bulletn, 125, Cadsby, C. B. and E. Maynes, (2005), Gender, sk Averson, and the Drawng ower of Equlbrum n an Expermental Corporate Takeover Game, Journal of Economc Behavor and Organzaton, 56, Chan, N. and. Kennedy, (2002), Are Multple-Choce Exams Easer for Economcs Students? A Comparson of Multple-Choce and 'Equvalent' Constructed-esponse Exam Questons, Southern Economc Journal, 68, 957-