Crowdsourcing. Markus Rokicki L3S Research Center. Markus Rokicki 1/27

Size: px
Start display at page:

Download "Crowdsourcing. Markus Rokicki L3S Research Center. Markus Rokicki 1/27"

Transcription

1 Crowdsourcing Markus Rokicki L3S Research Center Markus Rokicki 1/27

2 Human Computation Automaton Chess Player or Mechanical Turk. 1 1 Source: Markus Rokicki rokicki@l3s.de 2/27

3 Human Computation Automaton Chess Player or Mechanical Turk. 1 Crowdsourcing facilitates human computation through the web. 1 Source: Markus Rokicki rokicki@l3s.de 2/27

4 Example: Product Categorization Figure: Product categorization task on Amazon Mechanical Turk Markus Rokicki rokicki@l3s.de 3/27

5 Example: Stereotype vs Gender appropriate 3 3 Tolga Bolukbasi et al. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In: Advances in Neural Information Processing Systems. 2016, pp Markus Rokicki rokicki@l3s.de 4/27

6 Example: Damage Assessment during Disasters4 4 Luke Barrington et al. Crowdsourcing earthquake damage assessment using remote sensing imagery. Annals of Geophysics 54.6 (2012). In: Markus Rokicki rokicki@l3s.de 5/27

7 Crowdsourcing definition Crowdsourcing, according to a literature survey 5 : Participative online activity an individual, institution, non-profit, or company proposes voluntary undertaking of a task in a flexible open call to a group of individuals varying knowledge, heterogeneity, and number Users receive satisfaction of a given type of need economic, social recognition, self-esteem, learning,... always entails mutual benefit 5 Enrique Estellés-Arolas and Fernando González-Ladrón-de Guevara. Towards an integrated crowdsourcing definition. In: Journal of Information science 38.2 (2012), pp Markus Rokicki rokicki@l3s.de 6/27

8 Crowdsourcing Platforms Some crowdsourcing platforms: Amazon Mechanical Turk 6 Paid microtasks CrowdFlower 7 Paid microtasks Topcoder 8 Programming competitions Threadless 9 Propose and vote on t-shirt designs Kickstarter 10 Fund products Markus Rokicki rokicki@l3s.de 7/27

9 Paid Microtask Crowdsourcing Paid crowdsourcing scheme Figure source: Markus Rokicki 8/27

10 Paid Microtask Crowdsourcing Paid crowdsourcing scheme Figure source: Markus Rokicki 8/27

11 Paid Microtask Crowdsourcing Paid crowdsourcing scheme Figure source: Markus Rokicki 8/27

12 Task design Depending on application: (how) should you break up the task? more or less answering options? Free-form answers? how to present the problem E.g.: Highlighting keywords in search results 12 : 12 Omar Alonso and Ricardo Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In: European Conference on Information Retrieval. Springer. 2011, pp Markus Rokicki rokicki@l3s.de 9/27

13 Task design issues: Cognitive Biases 15 Anchoring effects Humans start with a first approximation (anchor) and then make adjustments to that number based on additional information. 13 Observed in crowdsourcing experiments 14 Group A Q1: More or less than 65 African countries in UN? Group B Q1: More or less than 12 African countries in UN? Q2: How many countries in Africa? Group A mean: 42.6 Group B mean: 18.5 Also: Order effects and many more 13 Daniel Kahneman and Amos Tversky. Subjective probability: A judgment of representativeness. In: Cognitive psychology 3.3 (1972), pp Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experiments on Amazon Mechanical Turk. In: Judgment and Decision Making 5.5 (2010). 15 Adapted from Markus Rokicki rokicki@l3s.de 10/27

14 Quality: Worker Access and Qualification Crowdsourcing platforms offer means to restrict access to tasks based on: Trust levels based on overall past accuracy Geography Language skills Certified based on language tests Classified by the platform based on geolocation, browser data and user history 16 Qualification for specific kinds of tasks gained through qualification/training tasks 16 Markus Rokicki 11/27

15 Qualification Tasks Qualification tests also provide feedback about the judgments expected by the worker and thus influence quality 17. Figure on the right: Accuracy depending on class imbalance in training questions. Classes: Matching, Not Matching 17 John Le et al. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In: SIGIR 2010 workshop on crowdsourcing for search evaluation. 2010, pp Markus Rokicki 12/27

16 Measuring Worker Accuracy: Gold Standard Would like to know how reliable answers are for the task Requesters may want to reject the work of unreliable workers In particular for spammers who do not try to solve the task Gold standard / honey-pot tasks: Add tasks for which the ground truth is already known at random positions Compare worker input with the ground truth Feedback on correctness is given to the workers Markus Rokicki rokicki@l3s.de 13/27

17 Using Gold Standard 18 Challenges: Need enough gold standard tasks or limit #tasks per worker Workers who recognize gold standard will tend to spam Data composition: Class balance Educational examples addressing likely errors Need to be indistinguishable from regular tasks. But: need to be unambiguous Standard workflow: Iterate ground truth creation Start with hand crafted small ground truth Use annotations to create additional gold standard data 18 David Oleson et al. Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. In: Human computation (2011). Markus Rokicki 14/27

18 Redundancy: Wisdom of Crowds So far: ensuring individual annotation quality. However: even best effort human annotations are not perfect most of the time. Markus Rokicki 15/27

19 Redundancy: Wisdom of Crowds Source: Markus Rokicki 15/27

20 Redundancy: Wisdom of Crowds Source: Markus Rokicki 15/27

21 Redundant Annotations and Aggregation of Results Redundancy: Each task is annotated by multiple workers Quality can be estimated based on inter-annotator agreement (e.g. Fleiss kappa) If certain agreement is not reached: obtain more annotations Aggregate answers by majority vote (in categorical case) Improves accuracy if workers are better than random Introduces additional cost Markus Rokicki 16/27

22 Majority Vote Accuracy 20 Majority vote accuracy depending on redundancy and individual accuracy, assuming equal accuracy (p) for all workers Figure source: 20 Ludmila I Kuncheva et al. Limits on the majority vote accuracy in classifier fusion. In: Pattern Analysis & Applications 6.1 (2003), pp Markus Rokicki rokicki@l3s.de 17/27

23 Estimating Worker Reliability 21 In reality, worker accuracy varies. Solution: estimate worker reliability and take it into account for computing the labels. Sketch of the approach: Treat each worker as a classifier characterized by Probability of correctly classifying each class Estimate iteratively using expectation maximization: E-step: Estimate hidden true labels of the data given current estimated worker reliability M-step: Estimate worker reliability given current estimated labels 21 Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. In: Applied statistics (1979), pp Markus Rokicki rokicki@l3s.de 18/27

24 Incentives: Influence of Payment 1 22 Setting: Reorder a list of images taken from traffic cameras chronologically. 611 participants sorted 36,000 images sets of varying size, for varying payments. 22 Winter Mason and Duncan J Watts. Financial incentives and the performance of crowds. In: ACM SigKDD Explorations Newsletter 11.2 (2010), pp Markus Rokicki rokicki@l3s.de 19/27

25 Findings Figure: Accuracy (left) and number of completed tasks (right) in relation to payment Markus Rokicki 20/27

26 Findings Figure: Post-hoc survey of perceived value Markus Rokicki 21/27

27 Influence of Payment 223 Setting: Find planets orbiting distant stars 23 Andrew Mao et al. Volunteering versus work for pay: Incentives and tradeoffs in crowdsourcing. AAAI conference on human computation and crowdsourcing In: First Markus Rokicki 22/27

28 Findings Experiments with 356 workers and annotated light curves. Simulated transits were added to real light curves (with noisy brightness values). Figure: Accuracy of results depending on task difficulty. Markus Rokicki 23/27

29 Additional Incentives: Competitions and Teamwork Reward Rank Rewards: Non-linear distribution among teams Individual share proportional to contribution Communication Team chats 24 Markus Rokicki, Sergej Zerr, and Stefan Siersdorfer. Groupsourcing: Team competition designs for crowdsourcing. In: Proceedings of the 24th International Conference on World Wide Web. ACM. 2015, pp Markus Rokicki rokicki@l3s.de 24/27

30 No Payment: Games With A Purpose Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. 2004, pp Markus Rokicki rokicki@l3s.de 25/27

31 No Payment: Games With A Purpose 25 Figure: The ESP Game 25 Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. 2004, pp Markus Rokicki rokicki@l3s.de 25/27

32 Mechanical Turk Workers 26 Surveys of workers on Mechanical Turk. 26 Joel Ross et al. Who are the crowdworkers?: shifting demographics in mechanical turk. In: CHI 10 extended abstracts on Human factors in computing systems. ACM. 2010, pp Markus Rokicki 26/27

33 Selected papers Dynamics and task types on MTurk: Djellel Eddine Difallah et al. The dynamics of micro-task crowdsourcing: The case of amazon mturk. on World Wide Web. ACM. 2015, pp Postprocessing results for quality: Vikas C Raykar et al. Learning from crowds. Research 11.Apr (2010), pp In: Proceedings of the 24th International Conference In: Journal of Machine Learning Influence of compensation and payment on quality: Gabriella Kazai. In search of quality in crowdsourcing for search engine evaluation. 2011, pp Collaborative workflows: In: European Conference on Information Retrieval. Springer. Michael S. Bernstein et al. Soylent: a word processor with a crowd inside. In: Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology. 2010, pp Markus Rokicki rokicki@l3s.de 27/27