Practical Peer Prediction for Peer Assessment Victor Shnayder and David C. Parkes Harvard University
|
|
- Todd Carroll
- 5 years ago
- Views:
Transcription
1 Practical Peer Prediction for Peer Assessment Victor Shnayder and David C. Parkes Harvard University Oct 31, 2016 HCOMP 16 1
2 Motivation: MOOCs 2
3 Code... Learn to Write MOOCs Design Speak online for free. 3
4 But how to scale feedback?...while we wait for AI 4
5 Crowdsource! 5
6 Great! Great! Argument? Terrible 6
7 But how to scale good feedback? 7
8 Peer prediction! 8
9 Great! good! Great! Evaluate good! Argument? Terrible bad! 9
10 Other aspects and approaches Rubric design Training students Building community Staff review... 10
11 Peer prediction beyond MOOCs Gather location-specific info Image and video labeling Search result evaluation Academic peer review Participatory sensing 11
12 Task model i Task Agent 1 Signal: 1 n 12
13 Task model i Task Agent 1 Prior probability P(i) 13
14 Task model i Task j Agent 1 Agent 2 Joint probability: P(i,j) 14
15 Task model i Task j Agent 1 Agent 2 People can misreport! 15
16 Task model i Task j Agent 1 Agent 2 Goal: design scores to encourage effort, truthful reports 16
17 Output agreement (von Ahn, Dabbish 04) Agree! Score Task Agent 1 Agent 2 17
18 Output agreement Don t agree: score Task Agent 1 Agent 2 18
19 Output agreement Task Agent 1 Agent 2 Honest reporting is a correlated equilibrium if my signal predicts yours 19
20 Output agreement Task Agent 1 Agent 2 To manipulate: all agents always report same thing 20
21 Ensuring truthful reporting is best [Kamble et al. 15, Radanovic et al. 16] Agree! Score 1/(scaling factor) 1 1 Task Agent 1 Agent 2 Scaling factor learned from reports on many similar tasks. Truthfulness is an equilibrium, guarantees highest payoff. 21
22 Multi-task approach [Dasgupta-Ghosh 13, Shnayder et al. 16] 1 Task 1 1 Task 2 Agent 1 22
23 Multi-task approach [Dasgupta-Ghosh 13, Shnayder et al. 16] 1 Task Task 2 Agent 1 2 Agent 2 Task 3 23
24 Key idea Task 1 1 Likely to match 1 2 Task 2 Agent 1 2 Agent 2 Task 3 24
25 Key idea 1 Task 1 Less likely to match 1 2 Task 2 Agent 1 2 Agent 2 Task 3 25
26 Key idea Task Task 2 Agent 1 2 Agent 2 Task 3 Reward matching on shared tasks. Punish matching on non-shared tasks. 26
27 Correlated Agreement (Shnayder et al. 16) 1. Split tasks into shared and non-shared. 2. Score = (agree on shared) (agree on non-shared) Agree when reports aren t equal, but positively correlated. Constant or random reporting still has expected score 0 27
28 Story so far: Peer prediction to encourage effort Can make truthful equilibrium the best equilibrium 28
29 Story so far: Peer prediction to encourage effort Can make truthful equilibrium the best equilibrium in terms of expected value. 29
30 Is E[truthfulness] E[other] enough? 30
31 Going beyond expected value. 31
32 I. Fairness is important You both did an equally good job. You get 50 points, she gets
33 II. Score isn t all that matters You deserve 75 points. We ll probably give you 100, but maybe 0. Sounds good? 33
34 III. Truthful score > random score isn t enough To cover cost of effort, just multiply all scores by
35 Approach: compare four mechanisms Output Agreement (OA) Simplest Kamble RPTS Correlated Agreement (CA) Best theoretically: truthful equilibrium is best 35
36 Approach: simulate on data set of 3,000,000 edx peer assessments 36
37 Approach: simulate on data set of 3,000,000 edx peer assessments About ~2000 world models, each based on ~1500 assessments 37
38 Approach: simulate on data set of 3,000,000 edx peer assessments No incentives for effort in deployed system 38
39 Fairness Coeff of Variation (Coefficient of variation = std dev / mean) 39
40 Fairness Coeff of Variation CA has lower score variability rewards non-exact agreement 40
41 Risk aversion Coeff of Variation 41
42 Risk aversion Coeff of Variation CA has lower score variability rewards non-exact agreement 42
43 Value of effort Count Expected score of random reporting is
44 Value of effort Count Average scores across ~2000 models 44
45 Value of effort Count Low benefit of truthful over random 45
46 Value of effort Count Scores with perfect agreement between peers 46
47 Lessons Fairness, risk aversion, score differences matter! Low levels of agreement between peers are a problem Recall: no effort incentives in deployed system! To increase agreement between reports: Try peer prediction and PropeRBoost [R+F, HCOMP 16] Also better rubrics, instructions, training 47
48 Is peer prediction practical? CA and other recent mechanisms promising for real deployments. Which is best? (info theory approach [Kong-Schoenebeck 16]) In education, I would deploy peer prediction as ungraded feedback first 48
49 Thank you! Victor Shnayder and David C. Parkes Harvard University Code and data: 49
50 Extra slides 50
51 Collusion on unintended signals 51
52 Collusion on unintended signals 52