Practical Peer Prediction for Peer Assessment Victor Shnayder and David C. Parkes Harvard University

Size: px
Start display at page:

Download "Practical Peer Prediction for Peer Assessment Victor Shnayder and David C. Parkes Harvard University"

Transcription

1 Practical Peer Prediction for Peer Assessment Victor Shnayder and David C. Parkes Harvard University Oct 31, 2016 HCOMP 16 1

2 Motivation: MOOCs 2

3 Code... Learn to Write MOOCs Design Speak online for free. 3

4 But how to scale feedback?...while we wait for AI 4

5 Crowdsource! 5

6 Great! Great! Argument? Terrible 6

7 But how to scale good feedback? 7

8 Peer prediction! 8

9 Great! good! Great! Evaluate good! Argument? Terrible bad! 9

10 Other aspects and approaches Rubric design Training students Building community Staff review... 10

11 Peer prediction beyond MOOCs Gather location-specific info Image and video labeling Search result evaluation Academic peer review Participatory sensing 11

12 Task model i Task Agent 1 Signal: 1 n 12

13 Task model i Task Agent 1 Prior probability P(i) 13

14 Task model i Task j Agent 1 Agent 2 Joint probability: P(i,j) 14

15 Task model i Task j Agent 1 Agent 2 People can misreport! 15

16 Task model i Task j Agent 1 Agent 2 Goal: design scores to encourage effort, truthful reports 16

17 Output agreement (von Ahn, Dabbish 04) Agree! Score Task Agent 1 Agent 2 17

18 Output agreement Don t agree: score Task Agent 1 Agent 2 18

19 Output agreement Task Agent 1 Agent 2 Honest reporting is a correlated equilibrium if my signal predicts yours 19

20 Output agreement Task Agent 1 Agent 2 To manipulate: all agents always report same thing 20

21 Ensuring truthful reporting is best [Kamble et al. 15, Radanovic et al. 16] Agree! Score 1/(scaling factor) 1 1 Task Agent 1 Agent 2 Scaling factor learned from reports on many similar tasks. Truthfulness is an equilibrium, guarantees highest payoff. 21

22 Multi-task approach [Dasgupta-Ghosh 13, Shnayder et al. 16] 1 Task 1 1 Task 2 Agent 1 22

23 Multi-task approach [Dasgupta-Ghosh 13, Shnayder et al. 16] 1 Task Task 2 Agent 1 2 Agent 2 Task 3 23

24 Key idea Task 1 1 Likely to match 1 2 Task 2 Agent 1 2 Agent 2 Task 3 24

25 Key idea 1 Task 1 Less likely to match 1 2 Task 2 Agent 1 2 Agent 2 Task 3 25

26 Key idea Task Task 2 Agent 1 2 Agent 2 Task 3 Reward matching on shared tasks. Punish matching on non-shared tasks. 26

27 Correlated Agreement (Shnayder et al. 16) 1. Split tasks into shared and non-shared. 2. Score = (agree on shared) (agree on non-shared) Agree when reports aren t equal, but positively correlated. Constant or random reporting still has expected score 0 27

28 Story so far: Peer prediction to encourage effort Can make truthful equilibrium the best equilibrium 28

29 Story so far: Peer prediction to encourage effort Can make truthful equilibrium the best equilibrium in terms of expected value. 29

30 Is E[truthfulness] E[other] enough? 30

31 Going beyond expected value. 31

32 I. Fairness is important You both did an equally good job. You get 50 points, she gets

33 II. Score isn t all that matters You deserve 75 points. We ll probably give you 100, but maybe 0. Sounds good? 33

34 III. Truthful score > random score isn t enough To cover cost of effort, just multiply all scores by

35 Approach: compare four mechanisms Output Agreement (OA) Simplest Kamble RPTS Correlated Agreement (CA) Best theoretically: truthful equilibrium is best 35

36 Approach: simulate on data set of 3,000,000 edx peer assessments 36

37 Approach: simulate on data set of 3,000,000 edx peer assessments About ~2000 world models, each based on ~1500 assessments 37

38 Approach: simulate on data set of 3,000,000 edx peer assessments No incentives for effort in deployed system 38

39 Fairness Coeff of Variation (Coefficient of variation = std dev / mean) 39

40 Fairness Coeff of Variation CA has lower score variability rewards non-exact agreement 40

41 Risk aversion Coeff of Variation 41

42 Risk aversion Coeff of Variation CA has lower score variability rewards non-exact agreement 42

43 Value of effort Count Expected score of random reporting is

44 Value of effort Count Average scores across ~2000 models 44

45 Value of effort Count Low benefit of truthful over random 45

46 Value of effort Count Scores with perfect agreement between peers 46

47 Lessons Fairness, risk aversion, score differences matter! Low levels of agreement between peers are a problem Recall: no effort incentives in deployed system! To increase agreement between reports: Try peer prediction and PropeRBoost [R+F, HCOMP 16] Also better rubrics, instructions, training 47

48 Is peer prediction practical? CA and other recent mechanisms promising for real deployments. Which is best? (info theory approach [Kong-Schoenebeck 16]) In education, I would deploy peer prediction as ungraded feedback first 48

49 Thank you! Victor Shnayder and David C. Parkes Harvard University Code and data: 49

50 Extra slides 50

51 Collusion on unintended signals 51

52 Collusion on unintended signals 52