The Wisdom Of the Crowds: Enhanced Reputation-Based Filtering

Size: px
Start display at page:

Download "The Wisdom Of the Crowds: Enhanced Reputation-Based Filtering"

Transcription

1 The Wisdom Of the Crowds: Enhanced Reputation-Based Filtering Jason Feriante CS 761 Spring 2015 Department of Computer Science University of Wisconsin-Madison Abstract Crowdsourcing services such as Amazon Mechanical Turk (AMT) provide a high volume, low cost source of labor. This can provide a relatively cheap source of data for various machine learning tasks such as computer vision where current systems like Omron require millions of images for training [3]. Inferring correct labels from noisy crowdsourced data can be a difficult task. Some of the challenges include high variance in worker quality and a lack of gold standard labels. This paper explores methods of inferring correct labels in the presence of random, deterministic, and sophisticated adversarial workers without the use of expert labels or a training set. The primary contribution of this paper is a reputation-reward model which enhances the original reputationbased worker filtering [1] algorithm by boosting the voting weight of a fixed percentage of the most trusted workers. Introduction Crowdsourcing has become increasingly popular with the rise of service providers such as Amazon Mechanical Turk (AMT), Crowdflower, and Clickworker. HIT tasks (human intelligence tasks), such as annotating tens of thousands of images, can now be done quickly on a massive scale for a very low price (as opposed to hiring experts) [2]. Unfortunately, one has little control over the often transient and anonymous workers performing HIT tasks. This results in significant variance in the label quality between workers. The focus of this paper is to improve the accuracy of aggregated crowdsourced labels for binary tasks in the face of noisy data, and sophisticated adversarial workers. The 2014 NIPs paper "Reputation-based Worker Filtering in Crowdsourcing" [1] approaches this problem by penalizing a fixed percent of workers that tend to disagree most. However, only disputed tasks are considered (all unanimously agreed upon tasks are ignored), and no rewards are given for agreement. The primary contribution of this paper is the reputation-reward model, which is an incremental improvement over the previous algorithm which does not consider undisputed tasks or reward workers for agreement. This new model rewards a fixed percent of workers that tend to agree most. By trusting the workers that agree most (in addition to the penalties implemented by the original model) one can boost the accuracy of aggregated crowdsourced labels. The structure of the paper is as follows: First the motivation for enhanced reputation-based filtering is discussed. Second, related work is reviewed. Third, the reputation-reward model is explained. Fourth, theoretical results are discussed; and fifth, experimental results are presented. Motivation Recent studies have shown typical machine learning approaches (e.g. gold standard labels, and majority voting) may fall short in their handling of random or deterministic worker labeling strategies. In other words, recent anecdotal evidence [4] demonstrates some of the most common machine labeling aggregation approaches are not as effective as previously thought when handling randomness in worker labels. Also, most current models also do not account for sophisticated, colluding adversarial workers. Strong evidence exists that many social sites like Digg have been manipulated by Sybil attacks (vote pollution with fake identities) [5]. In fact, some studies have shown that not only do malicious crowdsourcing systems exist, they are also growing in numbers and revenue. A recent UC Santa Barbara paper has coined the term "Crowdturfing" [6] to describe this growing online threat. Thus, there is a need for new improved models that can obtain accurate aggregated results in settings that include deterministic, random and increasingly sophisticated adversarial workers. Related Work This paper extends the model provided in "Reputation-based Worker Filtering in Crowdsourcing" [1]. Reputation-based worker filtering [2] is an improved machine learning method that handles more than just random labeling strategies. In other words, instead of just assuming all workers are voting based on some unknown i.i.d. distribution, a richer set of assumptions regarding worker behavior is used. There are two types of workers in this model: honest and adversarial. Categories of adversarial workers are: 1. random, 2. deterministic, and 3. sophisticated. Adversarial workers can adopt random or deterministic strategies. For example, deterministic workers are those who always vote the same ( 1 or 1) every time for a binary task { 1,1}. Adversarial workers are computationally unbounded and may use sophisticated colluding strategies (such as vote pollution, etc). After all labels are collected, the conflict set is generated. The conflict set contains only disputed tasks where workers were not unanimously agreed. Two different penalty models are proposed: 1. "soft penalty" and 2. "hard penalty". In both soft and hard penalty systems, a fixed percent of workers who disagreed the most are assigned weighted penalties where more disagreement

2 on a task resulted in higher penalties. In the soft model penalties are distributed evenly. In the hard penalty model an optimal semi-matching algorithm [7] is used in conjunction with the bipartite worker-assignment task graph to "load balance" the penalties. This results in penalties for only two adversarial representatives per conflict set task. Respectively, a single worker who voted 1 and another worker who voted 1 are assigned penalties for each disputed task. The load balancing helps distribute penalties to more workers instead of concentrating them on just a few. This results in bigger penalties for potential worst-case adversaries (the kind of sophisticated adversaries who may be colluding to influence a few specific items). Both soft and hard models assign a reputation to each worker, and those workers with the lowest reputations are completely removed from the dataset. The reputation-filter algorithm uses only the labels provided by the workers (there is no test set or gold labels) to pre-filter results before applying another machine learning approach. The reputation-based filter was shown to be effective for a number of methods including: Karger Oh Shah (KOS) [8], KOS+ [8], Majority Vote (MV), and Expectation Maximization (EM) [9]. Even though the resulting training sets were smaller, the performance improved and results were more accurate due to the adversarial results being removed [1]. The Reputation-Reward System The reputation-reward model is built on top of the existing reputationbased worker filtering algorithm [1] (see related work above). The improvement adds an additional step to the overall process that takes worker agreement into consideration. Step 2 of the process below awards the workers that tend to agree most with additional voting weight. Here is an overview of the steps in the reputationreward model process: 1. Hard and soft penalties are calculated, and the resulting 10% to 20% of the adversarial workers are removed. 2. Reputation rewards are calculated with the original data, and then applied to the remaining workers. This is a new step. 3. A label aggregation method is then applied to the remaining data (e.g. MV or EM). Mathematical Notation The notation for the crowdsourcing labeling process is as follows: each worker provides a subset of labels for a number of binary tasks where the true label is { 1,1}. Binary tasks T are assigned to workers W where both are associated with a bipartite graph G called the worker-task assignment graph. Let w i (t j ) denote the label provided by worker i for a task. An edge from w i to t i indicates the worker provided a label for this particular task. If the worker did not label the task, then there is no edge and w i (t j ) = 0; otherwise, the label is { 1,1}. Let d + j denote the number of workers who labeled a task as 1, and respectively d j denotes a worker who labeled a task as 1. The matrix of labels assigned by workers is denoted as L {0,1, 1} W x T. Thus, the notation for a label created by worker w i for task t j in the matrix is: L i j = w i (t j ). The goal is to use the information in matrix L to generate a vector y i with the true labels for each task. Workers Workers are defined as honest H or adversarial A where W = H A, and H A = /0. Honest workers are assumed to generate labels in a way that is compatible with an i.i.d. random model. So for task t j and worker w i, an honest worker will provide the true label y i with probability p i. The false label y i will be generated with probability 1 p i. The standard way [1] to represent the reliability of an honest worker is denoted: u i = 2p i 1 where u i {1, 1}. Also, it is assumed that the average reliability of all honest workers from a population is greater than zero; i.e. ( 1 n n i u i) > 0 or the average u > 0. It is assumed that all honest workers complete their tasks independently. In contrast, adversarial workers denoted "A" are free to take any random, deterministic or numerous alternative strategies that cannot be represented with any probabilistic model or process. Thus, these workers cannot be assumed to match any i.i.d. random model, and thus, this violates the fundamental assumption required to apply machine learning or other statistical methods. The solution is to remove these workers from the data completely (which is what is done in step 1 as described above). This results in a reduction of the training data in exchange for more accurate remaining data. During step one, the "conflict set"[1] is generated: T cs. The "conflict set" only includes tasks with disagreement. Soft and hard reputation penalties are calculated based on this data and a fixed percent of adversarial workers are removed (e.g. between 10% and 20%). In step 2, the new reputation-reward model is applied. Improving The Reputation Filtering Model Jagabathula, et al. [1] only consider tasks where there is at least one conflict. Thus, the reputation-based worker filter is not able to leverage any information from tasks where there was unanimous agreement. The fundamental idea is this: more weight should be given to the votes of the most accurate workers. This will help offset some of the noise and randomness caused by less accurate workers (who were not already filtered out of the data in step 1). One might argue that adversaries could exploit a system like this. While it is true that in some systems adversaries have knowledge of the votes of other workers, it is quite often that this is not the case. And even if the adversarial workers know the voting behavior of other workers, they are in fact trying to subvert the process, which will likely cause at least some conflicts for these workers (and this increases the likelihood these workers will be completely removed from the dataset). In other words, the previous model from Jagabathula, et al. [1] has demonstrated an ability 2

3 to filter out these subversive workers (as long as there is not a majority of colluding adversaries). Given a large number of workers, some fraction of these workers will likely have a high level of accuracy. The goal is to boost the voting weight of this small subset of high reputation workers to achieve more accurate results with the overall model. In many ways this model is similar to the original algorithm; however, it will be completely different in the sense that it utilizes data the original model did not consider. The Reputation-Reward Model Using the matrix of labels assigned by workers which were built during step 1 L {0,1, 1} W x T, workers are rewarded based on the level of agreement on tasks they were involved in. Let r + j denote the number of workers who labeled a task as 1, and r j respectively represents the count of worker w i who labeled a task as 1. In the Reputation-Reward Model, a reward of 1/r + j is given to all workers who label 1 on task t j, and respectively a reward of 1/ri is given to all workers who label 1 on task t j. The rewards are aggregated across all tasks and normalized by taking the average. The end result is this: the highest scores will generally go to the most honest and accurate workers. This assumes workers that tend to agree with others more often are more trustworthy and more accurate. It is possible this algorithm will reward dishonest workers who might have simply participated in more tasks than other workers. For this reason, scores are normalized based on the number of tasks the worker was involved in. As mentioned above, the most adversarial workers will be removed from the data set (in step 1 above). This will bound the minimum damage that can be caused by adversarial workers with guarantees similar to the original reputation-based filter. Refer to Jagabathula et al. for detailed theorems and proofs regarding this [1]. Assuming the worker accuracy distribution is normal, it can be inferred that a small percent of the workers are likely to be more trustworthy. Furthermore, since there is no other data on which to base the decision, one must assume these workers will tend to have more accurate votes than the rest. Weight will be added to a fixed percentage of workers in proportion to how often they were agreeable (e.g. 10% to 20%). These workers will then receive corresponding additional voting weight. Boosting the weight of workers who tend to agree most will have little or no impact when a task is mostly unanimous already. However, this method may help improve the accuracy of other disputed tasks. Also, it is important to note that no weight will (or can) be applied to the 15% or 20% of most adversarial workers who were removed from the data set. The Reputation-Reward Algorithm 1. Input: workers W, the set of all disputed tasks T dts and L W x T 2. For tasks t T assign trust reward r i j to workers w W: If L i j = 1, then: e i j = 1/r + j If L i j = 1, then: e i j = 1/r j 3. Output: increased voting weight (reputation rewards) for workers w i reward(w i ) = t j T i T dts e i j T i T dts 4. The voting weights are applied to the label aggregation method Theoretical Results The result of the reputation-reward algorithm depends on two key factors: the accuracy of the workers and the true labels for each respective task. Let the worker-task assignment graph be G, and let r be the number of tasks that l workers complete. Theorem: Assume there are no adversarial workers in the dataset, and that the worker-task assignment graph G is (l, r) regular. When the number of tasks r for workers w i and w i, then higher reliability workers are assigned greater rewards by the reputationreward algorithm. The proof of this theorem is analogous to the proof of the soft reward theorem in the supplementary materials of Jagabathula et al. [1]. Their theorem proves that penalties decrease with an increase in the reliability of the worker. Conversely, the theorem above states that worker reputation rewards will increase with a corresponding increase in the reliability of the worker. Instead of repeating their work with minor modifications one may refer to the original proofs instead. Experiments In this section, the performance of the reputation-reward model is evaluated with two of the most popular label aggregation algorithms: majority vote and expectation maximization. Majority Vote (MV) Majority Vote (MV) is one of the most common ways to aggregate worker labels. Let M i x j be a matrix where i is the number of tasks and j represents the two possible binary outcomes { 1,1}. Now the votes (which may have been weighted by previous methods) by each worker for each respective outcome are summed up. The labels receiving the maximum number of votes win. Although majority vote is a simplistic algorithm, the results are still relatively accurate when compared to far more sophisticated methods. One fundamental flaw with the MV algorithm is that it is extremely vulnerable to sophisticated adversarial attacks when used as a stand-alone solution. Expectation Maximization (EM) In the context of crowdsourcing label aggregation, Expectation Maximization (EM) allows one to iterate between the expected labels, the difficulty of the task, and worker expertise to 3

4 achieve maximum likelihood between these parameters. The approach applied in the experiments is called "GLAD" (Generative model of Labels, Abilities, and Difficulties) [9]. A brief overview of the GLAD EM algorithm follows. Let Z represent the true labels { 1, 1} which are determined based on votes from the workers w i. The difficulty of the task is modeled 1/β j [0, ), where 1/β j = means the task is extremely difficult, and 1/β j = 0 means the task is very easy (and 1/β j cannot be negative). The expertise (ability) of each worker is modeled by α i {,+ }, where α = means the worker always votes wrong, and α = + means the worker always votes right. Let L i j represent the label given by each worker w i for each task t j. The probability of L with this model is: 1 p(l i j = Z j α i,β j ) = 1 + e α iβ j With this model, the log probability of obtaining correct labels is based on a function of both the difficulty of the task and the skill of the worker: α i β i = log p(l i j = Z j ) 1 p(l i j = Z j ) E Step: Let the set of all labels for a task t j be denoted as T j = {T i j j = j}. The index i in T i j only includes the subset of workers who labeled that particular task. The posterior probabilities of all z { 1,1} are then calculated, given α,β values from the previous M Step. M Step: Now maximize a joint log-likelihood function of T,Z, the observed and latent variables, given the parameters α,β. This is done with respect to the posterior probabilities of the Z values were calculated in the last E step. The rest of the EM algorithm follows the traditional approach where one continues to iterate E and M steps until some preset number of cycles are completed, or convergence is reached. One problem with EM is the lack of theoretical guarantees. Due to this lack of guarantees, the results are sometimes less accurate than a simple majority vote. On the other hand, one benefit of EM is that it is more robust at handling noisy data than a simplistic approach like MV. Crowdsourcing Datasets The performance of the reputation-based filter [1] is considered with and without the reputation-reward model. Two public crowdsourced data sets were used to evaluate the methods. One of the publicly available subsets of the Clue Web 2009 [10] dataset was used. One of the publicly available versions of the TREC Task 2 [11] dataset was also used. Both datasets contain crowdsourced labels with binary document relevance votes where each worker usually participates in somewhere between 10 to 15 tasks. Results MV and EM methods were applied after first filtering the data with the hard penalty from Reputation-Based Worker Filtering [1] with or without the new Reputation-Reward algorithm described above. MV and EM methods were first run as stand-alone methods. Then they were run again after filtering the data with Reputation- Based Worker Filtering and the Reputation-Reward algorithm. Experimental Method Summary: 1. Base methods only (Base) 2. Base methods + Hard Penalty (Hard-P) 3. Base + Hard-P + Reputation-Reward (R-Reward) The reason the soft penalty was not used in any experiment is because it was less effective than the hard penalty algorithm in the study by Jagabathula et al. [1]. With both datasets the hard penalty algorithm was used to remove 15% to 20% of the most adversarial workers. Then, the reputation-reward algorithm was used to increase voting weights for 10% to 20% of the most agreeable workers. In each case, the results above represent the average score for each algorithm after tuning the parameters for optimal results. Although some marginal improvement was achieved with the reputationreward algorithm, it was not enough to make any strong conclusions regarding the effectiveness of the algorithm. Most likely some changes are needed to improve the algorithm which might involve adjusting the voting weights or the percent of workers with a voting weight boost. Conclusion More analysis and experiments are needed to further improve the reputation-reward algorithm presented in this analysis. For example, running this algorithm on a few more crowdsourced datasets with better results would help validate this theory. Although some marginal performance improvements were observed in the results, there were not enough experiments with enough datasets to make any statistically significant conclusions. Perhaps a derivative of the optimal semi-matching algorithm could be applied to "load balance" the weights of the worker votes in an uneven way. This application is similar to that of the "hard penalty" algorithm which is by far the best practical contribution of Jagabathula et al. [1]. However, there was no clear theoretically justifiable approach that was readily apparent where an application like this made sense in the context of boosting voting weights (during the process of distributing the reputation rewards). The fundamental idea behind the reputation-reward algorithm is sound. That idea is to use the data ignored in the previous work (i.e. all the results which were unanimously agreed upon). The theory provided in this paper is just one of many possible ways to utilize that data to achieve more accurate overall results. 4

5 References [1] Jagabathula, Srikanth, Lakshminarayanan Subramanian, and Ashwin Venkataraman. "Reputation-based Worker Filtering in Crowdsourcing."Advances in Neural Information Processing Systems [2] Howe, Jeff. "The rise of crowdsourcing." Wired magazine 14.6 (2006): 1-4. [3] Kumar, Neeraj, et al. "Attribute and simile classifiers for face verification." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, [4] Vuurens, Jeroen, Arjen P. de Vries, and Carsten Eickhoff. "How much spam can you take? an analysis of crowdsourcing results to increase accuracy." Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIRâĂŹ11) [5] Tran, Dinh Nguyen, et al. "Sybil-Resilient Online Content Voting." NSDI. Vol. 9. No [6] Wang, Gang, et al. "Serf and turf: crowdturfing for fun and profit." Proceedings of the 21st international conference on World Wide Web. ACM, [7] Harvey, Nicholas JA, et al. "Semi-matchings for bipartite graphs and load balancing." Algorithms and data structures. Springer Berlin Heidelberg, [8] Karger, David R., Sewoong Oh, and Devavrat Shah. "Iterative learning for reliable crowdsourcing systems." Advances in neural information processing systems [9] Whitehill, Jacob, et al. "Whose vote should count more: Optimal integration of labels from labelers of unknown expertise." Advances in neural information processing systems [10] "The ClueWeb09 Dataset." The ClueWeb09 Dataset. N.p., n.d. Web. 10 May [11] "2011 NIST TREC Crowdsourcing Track - TREC Crowdsourcing Track." 2011 NIST TREC Crowdsourcing Track - TREC Crowdsourcing Track. N.p., n.d. Web. 10 May