Master Thesis. Evaluation of. Crowdsourcing Translation Processes

Master Thesis Evaluation of Crowdsourcing Translation Processes Supervisor Professor Toru Ishida Department of Social Informatics Graduate School of Informatics Kyoto University Jun MATSUNO April 2010 admission March 2012 completion

i Abstract Evaluation of Crowdsourcing Translation Processes Jun MATSUNO Recently, the needs of translations have increased with the globalization. The translations of information in companies are needed for the overseas transfers of companies. The overseas transfers of small businesses are accelerating because of the strong yen in Japan. The overseas transfers of companies are also accelerating because of the rapid economic growth in China. At the level of the individual, translations are needed to send and get information on the Web written in other languages than the mother tongue and English. A professional translator is thought as a translator when translations are requested. By requesting translations to professional translators, the very high-quality results of translations are acquired, but the high cost is involved. Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for translation issues gets attention these days. The crowdsourcing translation can realize low-cost and rapid translations because non-professional translators can participate in translations. The quality of translations is not guaranteed because non-professional translators can participate in the crowdsourcing translation. Here, it is expected that the quality of translations can improve if multiple translators address translations cooperatively. The crowdsourcing translation by multiple translators has the need for the translation issues to be done at less cost than the translation by a professional translator and to be guaranteed in the quality. However, we don t know how multiple translators should translate cooperatively. In this study, we tried to resolve the following two challenges to propose the method of evaluation for crowdsourcing translation process. 1. Establishment of the experiment environment for the evaluation of crowdsourcing translation processes This experiment environment is built using a crowdsourcing market. Cooperative processes by multiple users have to be formed and translation tasks have to be processed by using the processes in the crowdsourcing market. We need to build the experiment environment for the evaluation of translation processes by

ii extending the function of the existing crowdsourcing market because the crowdsourcing market meeting the requirements does not exist. 2. Evaluation of crowdsourcing translation processes The best translation is not always acquired by the process where multiple translators improve a translation in order. For example, the better translation may be produced by the process where one translator translates based on the two translations after two translators translate independently if three translators translate cooperatively. Translation processes have to be efficiently evaluated because the evaluation of all translation processes is not realistic. We built the experiment environment for the evaluation of crowdsourcing translation processes using Amazon Mechanical Turk as a crowdsourcing market. We evaluated crowdsourcing translation processes in the built experiment environment. The contributions to the above challenges are as follows. 1. Establishment of the experiment environment using Amazon Mechanical Turk We built the experiment environment of a crowdsourcing translation using Amazon Mechanical Turk. Variety of tasks including a translation task can be requested to Amazon Mechanical Turk, and APIs and tools for forming disposal processes in Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable for a crowdsourcing market where the experiment environment is built. 2. Efficient evaluation of crowdsourcing translation processes using Amazon Mechanical Turk We evaluated parallel and iterative processes by translating 20 Chinese sentences to English in the built experiment environment. The parallel and iterative processes are the fundamental cooperative processes proposed by a previous study. In addition, we approximately evaluated the process with combination of the parallel and iterative processes using the evaluation results of parallel and iterative processes. The evaluation values of the parallel process, the iterative process and the process with combination of the parallel and iterative processes were respectively 4.22, 4.15 and 4.27 when the number of translators is 3. The evaluation results showed that the process with combination of the parallel and iterative processes was the best. We realized the evaluation of crowdsourcing translation processes cutting time and cost by evaluating translation processes approximately.

iii クラウドソーシング翻訳プロセスの評価松野淳内容梗概近年, グローバル化に伴い翻訳のニーズはますます高まっている. 例えば, 企業の海外移転のためには, 社内情報の翻訳が必要とされる. 日本では円高の影響で, 中小企業の海外移転が加速していき, また中国では経済の急成長で, 企業の海外移転が加速している. 個人レベルでも, 母国語や英語以外で表記された web 情報を発信取得するために, 翻訳が求められる. このような翻訳案件を依頼する先としては, まずプロの翻訳者が考えられる. プロの翻訳者へ翻訳案件を依頼することで, 極めて高品質な翻訳結果を得ることは可能であるが, 高いコストが伴われる. そこで, 翻訳案件に対して翻訳者が任意に手を挙げて翻訳を行う, クラウドソーシング翻訳が注目を集めている. クラウドソーシング翻訳は, プロでない翻訳者も参加できるため, 低コストかつ迅速な翻訳が実現可能である. クラウドソーシング翻訳にはプロでない翻訳者も参加できるため, 翻訳品質が十分に保証されないという問題が存在する. ここで, 複数の翻訳者が協力して翻訳すれば, 翻訳品質は向上すると考えられる. 複数の翻訳者によるクラウドソーシング翻訳のニーズは, プロの翻訳者による翻訳よりもコストを抑えて, 翻訳品質が保証されるべき翻訳案件にある. しかし, 複数の翻訳者がどのように協力して翻訳すべきかが分からない. そこで, 本研究では, クラウドソーシング翻訳プロセスの評価手法を提案するために, 以下の 2 つの課題の解決を試みた. 1. クラウドソーシング翻訳プロセスを評価するための実験環境の構築実験環境はクラウドソーシング市場を用いて構築される. そのクラウドソーシング市場では複数のユーザによる協働プロセスが形成可能であり, そのプロセスを用いて翻訳タスクが処理可能でなければならない. しかし, このようなクラウドソーシング市場は実在しない. そのため, 実在するクラウドソーシング市場の機能を拡張することで, 翻訳プロセスを評価するための実験環境を構築する必要がある. 2. クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳において, 複数の翻訳者が順番に翻訳の加筆修正を繰り返すことで最も良い翻訳結果が得られるとは限らない. 例えば,3

iv 人の翻訳者で翻訳を行う場合は, 最初に 2 人の翻訳者が独立に翻訳を行った後で,1 人の翻訳者がそれらの翻訳結果に基づいて翻訳を行った方がより良い翻訳結果が生成されるかもしれない. 全ての翻訳プロセスを評価することは現実的ではないため, 効率的に翻訳プロセスを評価する必要がある. クラウドソーシング市場として Amazon Mechanical Turk を用いることで, クラウドソーシング翻訳プロセスを評価するための実験環境を構築した. そして, 構築された実験環境でクラウドソーシング翻訳プロセスを評価した. 上記で述べた課題に対する本研究の貢献は以下の 2 点である. 1. Amazon Mechanical Turk を用いた実験環境の構築 Amazon Mechanical Turk を用いて, クラウドソーシング翻訳の実験環境を構築した.Amazon Mechanical Turk では, 翻訳タスクを含めた様々なタスクの依頼が可能であり,Amazon Mechanical Turk のための API や処理プロセスを形成するためのツールが公開されている. そのため,Amazon Mechanical Turk は実験環境を構築するクラウドソーシング市場として適している. 2. クラウドソーシング翻訳プロセスの効率的な評価構築した実験環境で, 中国語 20 文を英語に翻訳することで, 並列プロセスと繰り返しプロセスを評価した. 並列プセスと繰り返しプロセスは, 先行研究で提案されている Amazon Mechanical Turk における基本的な協働プロセスである. さらに, 並列プロセスと繰り返しプロセスの評価結果を用いることで, 並列プロセスと繰り返しプロセスを組み合わせて形成可能な翻訳プロセスを近似的に評価した. 並列プロセス, 繰り返しプロセス, 並列プロセスと繰り返しプロセスを組み合わせたプロセスによる翻訳の評価値は, それぞれ 4.22,4.15,4.27 であった. 評価結果から, 並列プロセスと繰り返しプロセスを組み合わせたプロセスが最も優れていることが分かった. 近似的に翻訳プロセスを評価することで, 時間とコストを抑制したクラウドソーシング翻訳プロセスの評価を実現した.

Evaluation of Crowdsourcing Translation Processes Contents Chapter 1 Introduction 1 Chapter 2 Amazon Mechanical Turk 4 2.1 Requester 5 2.2 Worker 8 Chapter 3 Related Works 10 3.1 Translation using Amazon Mechanical Turk 10 3.2 Increase in Quality of Tasks in Amazon Mechanical Turk 11 3.3 Relation between Result and Reward of Tasks 13 3.4 Task Processing with Cooperation 16 Chapter 4 Crowdsourcing Translation 20 4.1 Increase in Demand of Crowdsourcing Translation 20 4.2 Example of Crowdsourcing Translation 21 4.3 Increase in Quality of Translations by Cooperative Processes 23 Chapter 5 Establishment of Experiment Environment 24 5.1 Process of Tasks by Cooperative Processes 24 5.2 Request of Translatin Task 25 5.3 Request of Vote Task 25 5.4 Screening of Workers 27 Chapter 6 Expeiment and Evaluation 30 6.1 Experiment 30 6.2 Evaluation 22 6.3 Discussion 35 6.4 Lessons Learned 42 Chapter 7 Conclusion 43 Acknowledgements 45

References 46

Chapter 1 Introduction Recently, the needs of translations have increased with the globalization. The translations of information in companies are needed for the overseas transfers of companies. The overseas transfers of small businesses are accelerating because of the strong yen in Japan. The overseas transfers of companies are also accelerating because of the rapid economic growth in China. At the level of the individual, translations are needed to send and get information on the Web written in other languages than the mother tongue and English. A professional translator is thought as a translator when translations are requested. By requesting translations to professional translators, the very high-quality results of translations are acquired, but the high cost is involved. Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for translation issues gets attention these days. The crowdsourcing translation can realize low-cost and rapid translations because non-professional translators can participate in translations. The quality of translations is not guaranteed because non-professional translators can participate in the crowdsourcing translation. Here, it is expected that the quality of translations can improve if multiple translators address translations cooperatively. The crowdsourcing translation by multiple translators has the need for the translation issues to be done at less cost than the translation by a professional translator and to be guaranteed in the quality. However, we don t know how multiple translators should translate cooperatively. In this study, we tried to resolve the following two challenges to propose the method of evaluation for crowdsourcing translation process. Establishment of the experiment environment for the evaluation of crowdsourcing translation processes This experiment environment is built using a crowdsourcing market. Cooperative processes by multiple users have to be formed and translation tasks have to be processed by using the processes in the crowdsourcing market. We need to build the experiment environment for the evaluation of translation processes by extending the function of the existing crowdsourcing market because the crowdsourcing market meeting the requirements does not exist. 1

Evaluation of crowdsourcing translation processes The best translation is not always acquired by the process where multiple translators improve a translation in order. For example, the better translation may be produced by the process where one translator translates based on the two translations after two translators translate independently if three translators translate cooperatively. Translation processes have to be efficiently evaluated because the evaluation of all translation processes is not realistic. We built the experiment environment for the evaluation of crowdsourcing translation processes by using Amazon Mechanical Turk as a crowdsourcing market. We evaluated crowdsourcing translation processes in the built experiment environment. Variety of tasks including a translation task can be requested to Amazon Mechanical Turk, and APIs and tools for forming disposal processes in Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable for a crowdsourcing market where the experiment environment is built. The processing of translation tasks using Amazon Mechanical Turk has already been proposed [1, 2]. The studies showed the usability of Amazon Mechanical Turk for translation tasks, but translators create or improve translations independently in the studies. In our study, the processing of translation tasks by multiple translators using Amazon Mechanical Turk is performed. Comparing with the previous studies, the better experiment environment for translation tasks is built by the screening of translators. The reason why the screening of translators was performed is because many users don t process tasks seriously in Amazon Mechanical Turk [3]. Parallel and iterative processes [4] were used as processes for translations by multiple translators. The parallel process is suitable for tasks which are more improved by spending more time. The iterative process is suitable for tasks whose purpose is the suggestion of a unique idea. We evaluated the results of applying these processes to translation tasks. In addition, we approximately evaluated the results of the processes with combination of the parallel and iterative processes. This paper is organized as follows. In Chapter 2, this paper explains Amazon Mechanical Turk used for the evaluation of crowdsourcing processes. In Chapter 3, previous studies relevant to this research are introduced. In Chapter 4, this paper explains a real crowdsourcing translation on the web and the crowdsourcing translation 2

using cooperative processes. In Chapter 5, this paper explains the establishment of the experiment environment for the evaluation of crowdsourcing translation processes using Amazon Mechanical Turk. In Chapter 6, this paper reports the experiment and evaluation of the experiment and, Chapter 7 presents the conclusion. 3

Chapter 2 Amazon Mechanical Turk Amazon Mechanical Turk 1 is one of web services by Amazon. It is possible to process the tasks which can be easily resolved by humans in Amazon Mechanical Turk, and a lot of micro tasks needing human intelligence are generally requested. The tasks are resolved by many users around the world. In Amazon Mechanical Turk, a task is called Human Intelligence Task (HIT). The users requesting and processing tasks are respectively called Requester and Worker. Figure 2.1 shows the screen for browsing HITs in Amazon Mechanical Turk. Figure 2.1: Screen for browsing HITs in Amazon Mechanical Turk We use the leading HIT in Figure 2.1 to explain a HIT. Verify Businesses Websites 1 is the title of the HIT, and Requester: Dolores Labs is the name of the user requesting the HIT. A requester can set HIT Expiration Date, Reward, Time Allotted and HITs Available. Time Allotted is the time used for processing one task, and HITs Available is the number of HITs which can be processed by the worker browsing the screen. Moreover, Description, Keywords 1 https://www.mturk.com/mturk/welcome 4

and Qualification Required respectively represent the explanation of the HIT, the keywords relating to the HIT and the qualification required to process the HIT. These will appear on the screen by clicking the title of the HIT. 2.1 Requester A requester is limited to the user which can register the address in the United States. A requester can request variety of HITs and freely set information of the HIT. The reward of a HIT is different, but the general reward of a HIT is between 0.01$ and 0.1$. Amazon provides the three mechanisms to guarantee the result of a HIT in Amazon Mechanical Turk. The first mechanism is one having multiple workers process one HIT. Thanks to this mechanism, a requester can select a better result of a HIT. The second mechanism is one allowing a requester to set the qualification required to process a HIT. A requester often set the acceptance rate of the HITs processed by a worker and the country where a worker is living as qualifications. A requester can also give a qualification to the workers passing the test created by the requester. The third mechanism is one allowing a requester to reject the results of HITs processed by workers and not to pay the reward. However, a requester needs to tell workers the legitimate reason why the requester rejected the results of HITs. There are two methods to request HITs. The first method is using GUI provided by Amazon. If you request HITs and acquire the results using GUI, you need to go through the three steps of designing HITs, publishing HITs and managing HITs. The design of HITs You should create favorite HITs using templates because variety of templates is prepared for the design of HITs. Figure 2.2 shows the example of the screen for the design of a HIT. In this example, the HIT of translations from Chinese to English are being designed. HITs have to be designed according to the format of HTML. You also input information of HITs such as a title and a reward in the design of HITs. The publication of HITs This process is designed to request designed HITs to Amazon Mechanical Turk. If a requested HIT needs images, you upload the images and check the preview of the HIT. Finally, HITs are published on Amazon Mechanical Turk if it is possible to pay the cost required to publish the HITs. The cost includes the reward paid to workers and the 5

Figure 2.2: Example of the screen for the design of a HIT agent s commission paid to Amazon. The agent s commission is 10 % of the reward. The management of HITs Published HITs are managed in this process. It is possible to check how much HITs are processed and the results of processed HITs. Figure 2.3 shows the screen for the management of HITs. In this screen, you can approve and reject the results of processed HITs after you checked the results of processed HITs. You can also download the CSV file containing the results of processed HITs. The second method to request HITs is using Requester API provided by Amazon. API can be used in variety of languages such as Java, PHP and Ruby. The tools of Amazon Mechanical Turk have been developed and published recently. It is possible to request HITs more flexibly by using tools. For example, workers process HITs independently in Amazon Mechanical Turk, but it becomes possible to request the HITs using the results of HITs processed by other workers automatically if tools are used. If you use API, you can request HITs without troublesome operation by GUI and output the results of processed HITs to the console of program and files. The results of requested HITs using API are also reflected in the screen for the management of HITs. 6

2.2 Worker Figure 2.3: Screen for the management of HITs Anyone around the world can be a worker. A worker can acquire a reword by processing tasks. A worker selects a HIT based on the content and reward of the HIT. Amazon announced that there were more than 100 thousands workers coming from more than 1 hundred countries in Amazon Mechanical Turk as of spring 2007. It was announced that 76 % of workers were American and 8 % of them were Indian as of March 2008, but 56 % of workers were American and 36 % of them were Indian as of November 2009 from a statistical analysis [5]. Even now, most of the workers should be occupied by American and Indian. American workers don t feel that they have a strong relation to Amazon Mechanical Turk. On the other hand, Indian workers feel that they have a strong relation to Amazon Mechanical Turk. Many American workers may think that the process of HITs is just a hobby, but many Indian workers think that the process of HITs will lead to their rich life. Hence, it is expected that many Indian workers don t process tasks seriously to acquire a reward efficiently. In fact, we confirmed that many Indian workers submitted the results by machine translations and sloppy translation results when we requested the tasks of translations to Amazon Mechanical Turk. The higher 7

reward than a general reward is set to translation tasks because translation tasks are technical ones, and translation tasks are not processed appropriately by workers than other simple tasks. The screening of workers is necessary to acquire good translations results when translation tasks are requested to Amazon Mechanical Turk. A requester receives the low-quality translation results and takes a lot of trouble with the management of HITs without the screening of workers. Figure 2.4 shows the screen of the HIT which workers actually process. This HIT requires workers to input the information of restaurants found in the designated wet site. Workers can see the content of the HIT by clicking the link of View a HIT in this group in the screen for the browsing HITs in Amazon Mechanical Turk. Workers can process the HIT by pushing the button of Accept HIT after they saw the content of the HIT. Worker should push the button of Skip HIT if they don t want to process the HIT. Workers can see the content of another HIT if workers pushed the button of Skip HIT. If the results of HITs processed by workers are approved by the requester, the workers can acquire the reward of the HITs. If the results of HITs processed by workers are rejected by the requester, the workers can t acquire the reward of the HITs, and the performance of the workers will decrease. If the performance of a worker is low, the worker can t process the HITs which the requester wants to request to the reliable workers. Figure 2.4: Screen of a HIT 8

Chapter 3 Related Works 3.1 Translation using Amazon Mechanical Turk Amazon Mechanical Turk is used in various fields of study. The most popular use of Amazon Mechanical Turk is the request of a lot of micro HITs. It is possible to cut a lot of cost and time by using Amazon Mechanical Turk. The studies using Amazon Mechanical Turk in this way include the annotation to image data [6], evaluation of visual design [7] and collection of audio data [8]. The completions of HITs at low-cost and guaranteed quality of processed HITs have been shown in these studies. Anyone can t perform a translation because a translation is a technical process. A translation is not generally suitable for a HIT requested to Amazon Mechanical Turk. In fact, many HITs related to translations are not requested to Amazon Mechanical Turk, but there have been a few studies for translations using Amazon Mechanical Turk. The translations from 50 English sentences to French, German, Spanish, Chinese and Urdu have been requested to Amazon Mechanical Turk [1]. Multiple-Translation Chinese Corpus in LDC Catalog 1 and NIST ME Eval 2008 Urdu-English Test Set were used as the source sentences. These sentences are generally used to test the performance of machine translations. There was the notice saying you must not use machine translations in the screen of the HIT. However, many workers ignored the notice and submitted the translations created by machine translations to the requester. The translations created by machine translations were removed by requesting the additional task for the review of translations to Amazon Mechanical Turk. As a result, at least 30 % of the translations were created by machine translations. The translations were evaluated based on BLEU, which was a method for automatic evaluation of machine translations. Figure 3.1 shows the evaluation results of translations by workers. In all languages, the evaluation value of translations by workers was lower than that by professional translators but significantly exceeded that by machine translations. The evaluation would get better by removing the machine translations from the translations of workers. It is expected that the evaluation results increase more by the screening of workers in advance. As for a reward, the reward of a HIT for the translation of one 1 http://www.ldc.upenn.edu/catalog/ 9

Blue: Professional translator Green: Worker Orange: Machine translation Figure 3.1: Evaluation results of translations by Workers sentence was 0.1$, and the reward of a HIT for checking whether a translation was created by machine translations or not is 0.06$. As for the completion time of HITs, the times were respectively less than 4 hours, 20 hours, 22.5 hours, 2days and 4 days in Spanish, French, German, Chinese and Urdu. 3.2 Increase in Quality of Tasks in Amazon Mechanical Turk There are many workers which don t process HITs seriously in Amazon Mechanical Turk. The quality of HIT decreases if many workers don t process HITs seriously. The one of approaches to avoid such a consequence is the screening of workers. There is the study about the approaches to screen workers in Amazon Mechanical Turk [3]. Workers were asked demographics (age, sex and occupation) to screen workers in this study. In addition, workers were asked the two questions about an E-mail. If workers answer demographics and the questions correctly, they can get the right to process HITs which a very expensive reward is paid for. This study showed the following by conducting the experiment. The proportion of the workers which answer demographics and the questions about E-mail correctly was 61 % Women answered the questions more correctly than men The more older a worker was, more correctly the worker answered the questions 10

As a result, asking workers demographics and simple questions was very effective to the screening of workers. The screening of workers based on the processing time was also considered, but there was not a significant difference between high-quality and low-quality results. Another approach to increase the quality of results is the improvement of a HIT design. There has been the study showing that the quality of results increased by the improvement of a HIT [9]. In this study, the HIT having workers evaluate Wikipedia articles was used. Workers were asked to evaluate the Wikipedia articles based on a seven point Likert scale and answer how the Wikipedia articles should be improved before the HIT design was improved. Workers were additionally asked to answer the simple questions about the Wikipedia articles after the HIT design was improved. Workers can deeply understand the Wikipedia articles by answering the questions. Table 3.1 shows the experiment results. Table 3.1: Increase in the quality of HITs by the improvement of HIT design Without the improvement of HIT design With the improvement of HIT design Proportion of the invalid results Processing time Proportion of the results processed within 1 minute 48.6% 1:30 30.5% 2.5% 4:06 6.5% The invalid results are the answers not including how the Wikipedia articles should be improved. The proportion of the results processed within 1 minute was measured because many HITs were invalid when the processing time was within 1 minute before the HIT design was improved. Table 3.1 indicated that the quality of results increased by improving the HIT design. 3.3 Relation between Result and Reward of Tasks There was the question of crowdsourcing models such as Amazon Mechanical Turk which economists and psychologists were interested in for a long time. The question is how a reward has effect on the quality of results. In the traditional theory of economics, it was considered that the higher a reward is, the better the quality of tasks is. However, 11

many studies showed that giving a high reward decreased the internal motivation of enjoying tasks and the quality of tasks decreased. The results of the studies were against the traditional theory of economics. Many HITs are processed at low-cost and quickly by many workers in Amazon Mechanical Turk, but how much HITs are processed correctly depends on how requester can enhance the motivation of workers. A reward is given as one of the external motivations of workers. The two experiments have been conducted to survey the relation between the performance and reward of HITs [10]. In these experiments, the quality (accuracy) and quantity (number) of results were quantitatively measured. HITs used in these experiments didn t depend on the ability of workers, and the harder workers processed HITs, the more the results of HITs could increase. The HIT having workers rearrange the pictures taken at two seconds intervals by a traffic camera in chronological order was used in the first experiment. The basic reward was 0.1$, and the additional reward was paid to workers according to the effort of workers. If workers accept the HIT and provide their information, the basic reward was paid to workers. After the basic reward was paid, the level (easy: 2image,medium: 3image or hard: 4image) and reward (low: 0.01$, medium: 0.05$ or high: 0.1$ per 1 HIT)of HITs were selected randomly, and workers processed the main HITs. Finally, the experiment would continue until workers stopped processing HITs or all HITs were processed. Figure 3.2 shows the relation between the reward and number of processed HITs, and Figure 3.3 shows the relation between the reward and accuracy of processed HITs. Figure 3.2 indicated the followings two. The first was that the higher a reward was, the more HITs were processed regardless of the difficulty of HITs. More workers to which 0.1$ was paid processed all HITs than ones to which 0.01$ was paid. Many workers to which 0.01$ was paid were included in the workers processing less than 10 HITs. This result was consistent with the general theory of economics saying that the higher a reward was, the more HITs were processed. Figure 3.3 indicated that there was no relation between the accuracy and reward of HITs. It can be considered that the 12

Figure 3.2: Relation between the reward and the number of processed HITs Figure 3.3: Relation between the reward and the accuracy of processed HITs 13

Figure 3.4: Relation between the real reward and reward expected by workers different effects which a reward gave the quality and quantity was attributed to anchoring effect. Anchoring effect is the phenomenon of having effect to the next decision-making or information by having the strong impression with the first information or number (anchor). Figure 3.4 indicates that the expected reward was higher than the real reward. Therefore, the difference of rewards didn t have effect to the accuracy of the HITs processed by workers. However, workers thought that the higher a reward was, the more valuable the HIT was for workers, and the higher a reward was, the more HITs were processed. The HIT having workers find the words hidden in a puzzle was used in the second experiment. Workers didn t know how many words were hidden in the puzzle. The quantity and quality of processed HITs were respectively measured based on the number of processed puzzles and the number of words found in the puzzles. There were two methods to pay a reward. One method was the quota scheme paying a reward to workers each time one puzzle was correctly completed. Another method was the piece rate scheme paying a reward to workers each time one word was found. The four levels of a reward including nonpaying were considered in each scheme of a reward. Thus, there were seven different experiment conditions in all. The number of found words, hidden 14

words and the reward paid to workers were presented to workers. More tasks were processed and more words were found whether the scheme of a reward is a quota scheme or piece-rate scheme in the case of non-paying than paying. The significant difference between the first and second experiments was that there was not the relation between the level of a reward and the quantity of processed tasks. It was because that there was the strong relation between the feeling of enjoying HITs and the quantity of processed HITs. For example, a worker completed all HITs by spending five hours and found all words except for two words under the condition of non-paying. There was not also the relation between the level of a reward and the quality of HITs in the second experiment like the first experiment. The amount of a reward per a word was less in the quota scheme paying a reward if all words were found in a puzzle than the piece-rate scheme paying a reward each time a word was found. However, the quality of processed HITs more increased using the quota scheme than the piece-rate scheme. In addition, the quality of processed HITs more increased by using the non-paying scheme than using the quota scheme. A reward was not paid to workers if workers could not find all words in the quota scheme. The quality of processed HITs more increased by using the quota scheme because workers tried to find more words in a puzzle in the case of the quota scheme than in the case of the piece-rate scheme. Workers were asked the value of a puzzle and a word. The results showed that workers made an effort to process HITs, motivated by the rewards other than the financial reward when the financial reward was no paid to workers. Hence, even if there is a great difference between the reward expected by workers and the real reward, the accuracy of processed HITs is high in the case of nonpaying. On the other hand, workers decided whether the workers process HITs or not by comparing the real reward with the reward expected by the workers in the case of paying a reward. The quantity of processed HITs didn t increase simply because a reward increased. These experiment results indicated the followings. The quantity of processed HITs more increases in the case of paying a reward then in the case of non-paying The way of paying a reward has a larger effect to the quantity and quality of processed HITs than the amount of rewards 15

3.4 Task Processing with Cooperation There must be the tasks which are more efficiently processed by the cooperation of multiple workers. For example, the creation of Wikipedia articles is considered as one such task. Wikipedia is the collaboration type of a crowdsourcing site promoting collaboration on the web and having users create contents cooperatively. In Wikipedia, even users who don t have the account can edit articles. The relation between the increase in editors of articles and the quality of articles in the creation of Wikipedia articles has been surveyed [11]. The edit of Wikipedia articles is the task having the high interdependency of users and, the high cost of the cooperation between users may happen. For example, the modifications of the grammar or the spelling error in an article are the task having a low cooperativeness, but the modifications of the construction or the difference of the contents in an article are the tasks having a high cooperativeness because the unified opinion has to be established. In this study, a hypothesis was formulated. The hypothesis said that the increase in the number of editors leaded to the benefit of not the tasks needing a low cooperativeness but the tasks needing a high cooperativeness. The results of the verification showed that the hypothesis was correct. In Amazon Mechanical Turk, the experiment verifying the effectiveness of the cooperative processed has been conducted [4]. The parallel and iterative processes which were respectively represented by Figure 3.5 and 3.6 were used in this experiment. Voters participate in the vote task to decide a better processed HIT by majority vote in these processed. The best processed task is decided after the same HIT was processed by workers in the parallel process. The next worker can see the HIT processed by the previous worker in the iterative process. The HITs of writing the image description and suggesting the new company name were used to survey what kinds of HITs these processes were useful to. The number of workers and vote were respectively 6 and 5. The number of voters per one vote was also 5. In the HIT of writing the image description, 0.02$ and 0.01$ were respectively paid to the worker writing the image description and a voter. The workers inputted the sentence representing the content of the presented image, and the voters evaluated the two sentences representing the content of the presented image with the values from 1 to 10. The sentences representing the contents of 30 images were acquired using the parallel and iterative processes. Figure 3.7 shows the relation between the number of 16

Figure 3.5: Parallel process (The number of workers is n) Figure 3.6: Iterative process (The number of workers is n) workers and the evaluation values for the two processes. The table indicated that the iterative process was useful to the task of writing the image description than the parallel process. This was because that the longer a sentence was, the better the evaluation value was. In the HIT of suggesting the new company name, 0.02$ and 0.01$ were respectively paid to the worker suggesting the new company name and a voter. The voters evaluated the two company names with the values from 1 to 10. The new names of 30 companies were acquired using the parallel and iterative processes. The best evaluation values were respectively 7.3 and 8.3 by using the iterative and parallel processes. It was difficult to get the high evaluation value by using the iterative process because the next worker was influenced by the idea of the previous worker. The average values were respectively 6.4 17

and 6.2 by using the iterative and parallel processes though the best evaluation was got by using the parallel process. Figure 3.8 shows the relation between the number of workers and the evaluation values for the iterative processes. In the iterative process, the more the number of workers was, the better the evaluation value was. The results of these experiments showed that there was the relation of trade-off between the average quality and the bets quality of processed tasks. The iterative process decreased the variety of processed results which was very important for getting the best result though the average quality of processed tasks increased in the iterative process This was because the next worker often used the result by the previous worker as a reference in the iterative process. Figure 3.7: Relation between the number of workers and the evaluation values (blue: iterative process, red: parallel process) 18

Figure 3.8: Relation between the number of workers and the evaluation values (blue: the evaluation values by the iterative process, red: the average evaluation value by the parallel process) 19

Chapter 4 Crowdsourcing Translation 4.1 Increase in Demand of Crowdsourcing Translation The number of professional translators is limited, and the much time and cost are needed to request translations to professional translators. Low-cost and quick translations can be realized by the crowdsourcing translation because non-professional translators can also participate in the crowdsourcing translation. The crowd sourcing translation is mainly used for the globalization and overseas transfer of companies. It is expected that the demand of crowdsourcing translations will increase more from the foreign direct investment of Japan in recent years. The foreign direct investment of Japan is the direct investment for foreign companies by Japanese companies. The more the amount of foreign direct investment increase, the higher the possibility of overseas transfer is. Table 4.1 shows the amount of the foreign direct investment of Japan for ASEAN (2008-2010) published by the Bank of Japan. Table 4.2 also shows the comparison between the amounts of the foreign direct investments of Japan for ASEAN in 2010 and 2011 publish by the Bank of Japan. Table 4.1 and 4.2 indicate that the overseas transfer of Japanese companies to ASEAN has increased recently. This is mainly because that the yen is appreciating. A media reported that local governments supported the overseas transfer of smaller businesses in 2011. For example, the Ota district of Tokyo provides consultation for the overseas development and supports the translations of foreign documents. The demand of crowdsourcing translation by companies is increasing not only in Japan. Table 4.3 shows the amount of the foreign direct investment of Chinese (2008-2010) published by Japan External Trade Organization (JETRO). Table 4.3 indicates that more Chinese companies are performing the overseas transfer. The crowdsourcing translation can be used for not companies but individuals. Individuals use the crowdsourcing translation to send and get information on web. The example of use for sending information is the translation of the HP, blog or the explanation of created application. The example of use for getting information is the translation of news articles in foreign countries. These translations don t have to be perfect, and there is no problem as long as the meaning of these translations is correct. 20

The time and cost required for translations are expected to be decreased. Therefore, the translations given here are suitable for the translation issue of the crowdsourcing translation. Table 4.1: Amount of foreign direct investment of Japan for ASEAN in recent years Amount of investment (billion yen) 2008 2009 2010 6,518 6,587 7,711 Table 4.2: Comparison between the amounts of the foreign direct investments of Japan for ASEAN in 2010 and 2011 Amount of investment to ASEAN (billion yen) First-quarter in 2010 First-quarter in 2011 Second-quarter in 2010 Second-quarter in 2011 666 1,016 1,867 2,768 Table 4.3: Amount of the foreign direct investment of Chinese in recent years Amount of investment (million-dollar) 2008 2009 2010 41,859 47,800 59,000 4.2 Example of Crowdsourcing Translation mygengo 1 is given as the example of the crowdsourcing translation. mygengo is the service created in Japan, and the purpose of mygengo is the support of globalization in business. The customer companies of mygengo include the major companies in Japan, and mygengo is one of proven crowdsoucing translation services. The flow of use of my Gengo is as follows. 1. The order of a translation issue through the web site or API of mygengo 2. The translators registering at mygengo start to process the translation issue 3. It is possible to exchange comments between the requester and the translators in the process of the translation. The necessary modification is also performed at no charge after the requester checked the translation. 1 http://ja.mygengo.com/ 21

4. The delivery is notified by E-mail. The translation is sent to the requester automatically if API is used. The translators passing the qualification test can register at mygengo. The translators are classified into the standard and pro level of a translation. The requester can select the level of a translation. At the stage of spring 2011, more than 1600 translators register at mygengo. The translation languages used in mygengo are Japanese, English, Chinese, French, German, Italian and Spanish. Figure 4.1 shows the screen of the translation request in mygengo. Requesting translations to professional translators generally takes a lot of trouble. In mygengo, anyone can very easily request translations. The actual examples of the translations requested to mygengo are the translations of the explanation for an application, the press release of a company and the manual of a company. Figure 4.1: Screen of the translation request in mygengo 22

4.3 Increase in Quality of Translations by Cooperative Processes The quality of a translation is not sufficiently guaranteed because one translator generally takes responsibility for one translation issue in the crowdsourcing translation. It can be considered that the quality of a translation increases if multiple translators process the translation cooperatively. In this case, the translation is divided among multiple translators appropriately, and the reward is divided among multiple translators according to the achievement of each translator. The way to decide the contribution of the translation by a translator and pay the reward to the translator according to the contribution is very important, but we don t consider the way. In this study, we focus on the best way to increase the quality of a translation by multiple translators. mygengo is the controlled crowdsourcing service to process translations appropriately. Amazon Mechanical Turk is not controlled for translations though Amazon Mechanical Turk is also the crowdsourcing service. Very technical tasks such as a translation are not suitable for HITs requested to Amazon Mechanical Turk because workers and requesters can respectively process and request HITs very feely in Amazon Mechanical Turk. However, there is no other crowdsourcing services in which translation tasks can be requested and processed by cooperative processed. Therefore, we decided to use Amazon Mechanical Turk. 23

Chapter 5 Establishment of Experiment Environment We built the experiment environment of the translation from Chinese to English in this study. 5.1 Process of Tasks by Cooperative Processes We formed crowdsourcing translation processes using the parallel and iterative processes respectively represented by Figure 3.5 and 3.6. We used Turkit [12] to realize the parallel and iterative processes in Amazon Mechanical Turk. Turkit is the tool for the execution of the HIT processed iteratively in Amazon Mechanical Turk, and it is possible to process HITs using the process described by the JavaScript program. Figure 5.1 shows the execution screen of Turkit. Figure 5.1: Execution screen of Turkit The process and the content of HIT are input in the screen represented by 1 in Figure 5.1. 24

The result of HITs processed by workers and voters are output in the screen represented by 2 in Figure 5.1. The links to the pages of the HITs requested to Amazon Mechanical Turk are displayed in the screen represented by 3 in Figure 5.1. The screens represented by 2 and 3 are updated as the process progresses. 5.2 Request of Translatin Task There are the following two translation tasks in this experiment. (a) The translation of a source sentence to the target language (b) The improvement of the translated sentence based on the source sentence The translation task (a) is processed by the workers in the parallel process and the first worker in the iterative process, and the translation task (b) is processed by the workers other than the first worker in the iterative process. Figure 5.2 and 5.3 respectively show the request screen of the translation tasks (a) and (b) from Chinese to English. There is the notice saying The result of a HIT is rejected if a worker use machine translation or the quality of a translation is very low in these screens. The translation input by the previous worker is provided in the translation task (b), and the worker can also create the translation by modifying the provided translation. The notice in the translation task (b) says a worker can use the provided translation, but a worker can start over if you don t want to use the provided translation. The reward and processing time of the translation task (a) are respectively 0.2$ and 60 minutes, and the reward and processing time of the translation task (b) are respectively 0.1$ and 30 minutes. The processing time is long because we want workers to create a better translation. For example, we hope that workers find the translations of technical words using the dictionaries on the web. 5.3 Request of Vote Task Figure 5.4 shows the request screen of the vote task. Workers select a better English translation from Chinese by comparing two translations in this task. Workers can see the Chinese sentence when workers process the vote task. The reward and processing time of the vote task respectively are 0.03$ and 10 minutes. The processing time is long because we want workers to make a thoughtful choice. A vote task is important because the effort of translators is wasted if the function of a vote doesn t correctly perform. 25

Figure 5.2: Request screen of the translation task from Chinese to English Figure 5.3: Request screen of the improvement of the translation from Chinese to English 26

5.4 Screening of Workers Figure 5.4: Request screen of the vote task The previous studies [3] and [9] indicate that there are many workers who don t process HITs seriously in Amazon Mechanical Turk. Especially, the translation task needing a technical ability has to be requested, and it is difficult to determine whether a worker processed vote tasks seriously. Thus, the screening of workers is necessary in this experiment. There are two methods to screen workers in Amazon Mechanical Turk. The first method is that you have workers solve the Qualification test after you created and published a formal Qualification test. However, you can t create and publish a formal Qualification test using GUI. You have to execute the program created by Java, Ruby and Perl. You prepare the correct answer of a Qualification test in advance, and you can automatically give a Qualification to a worker by comparing the answer of the worker with the correct answer. The second method is that you request a HIT as a Qualification test. In this case, you have to write that a HIT is the Qualification test in the title or description of the HIT. You can create and publish a Qualification test using GUI though this method may not official. You can manually check whether the answer of a worker is 27

correct or not. We adopted the second method because translations should be manually checked in a Qualification test. Figure 5.5 shows the request screen of the Qualification test for Chinese-English translation. Figure 5.5: Request screen of the Qualification test for the Chinese-English translation The Guideline of the Qualification test says that a worker can get the Qualification for many other translation tasks if the worker translate the Chinese sentence to English correctly and a worker must not use machine translations. A worker has to answer his mother language, country of origin and country of residence after the worker solved the test of a translation in the Qualification test. We used the questionnaire to screen workers easily. In the Qualification test, the source sentence is 对不起, 我们这里没有这个人, and the example of the correct translation is I m sorry, but we don t have such person here. We used a very simple test because many workers don t take the test if a test is difficult. It is possible to see the number of workers passing the formal Qualification test published in Amazon Mechanical Turk. The number of workers passing the Qualification test for the Chinese-English translation published by other workers was around 30. This number is very low. The translation test used in this Qualification test was more difficult than our Qualification test. The reward and processing time of the Qualification test are respectively 0.01$ and 5 minutes. We set the 28

reward of our Qualification test 0$, but we considered that many workers would not gather if we set the reward 0$. The processing time is very low because the test is so easy that workers don t need use dictionaries. 29

Chapter 6 Experiment and Evaluation 6.1 Experiment We conducted the experiment of the translations using the parallel and iterative processes. We would clarify the relation between the number of translators and quality of translations in the parallel and iterative processes. We would clarify the between the number of translators and quality of translations in the translation processes with combination of the parallel and iterative processes. We published the Qualification test for the Chinese-English translation on Amazon Mechanical Turk to gather the workers participating in the experiment. The number of workers passing the Qualification test was 30 after 1 week has passed since the Qualification test was published. We set the number of translators and voters per one vote 3 because the number of workers passing the Qualification test was low. The parallel and iterative processes used in our experiment are respectively shown by Figure 6.1 and 6.2. The experiment was launched after 1 week has passed since the Qualification test was published. The Qualification test was published while the experiment was being conducted, and the worker passing the test was given the Qualification at anytime. The purpose, procedure and hypothesis of the experiment are as follows. Purpose of the experiment The evaluations of the crowdsourcing translation processes with combination of the parallel and iterative processes (the number of translators is 3) Procedure of the experiment The experiment 1 and 2 are conducted in order. Experiment 1. The experiment for the relation between the number of translators and quality of translations in the parallel process We translate Chinese sentences to English using the parallel process in Amazon Mechanical Turk. We acquire three translations for a Chinese sentence, and the three translations are considered as the translations by translator 1, 2 and 3 according to the order of the acquisition. We acquire the best translation by two votes. Experiment 2. The experiment for the relation between the number of translators and 30

quality of translations in the iterative process Chinese sentences are translated to English using the iterative process where the translations by translator 1 in the experiment 1 are used as the translations by translator 1. Figure 6.1: Parallel process (The number of translators is 3) Figure 6.2: Iterative process (The number of translators is 3) Hypotheses of the experiment Hypothesis 1. The better translation is acquired by using the iterative process than the parallel process The reason of the hypothesis 1. It is considered that the quality of translations increases by the iterative modifications (the improvements of the grammar and spelling error) Hypothesis 2. The better translation is acquired by using the process with combination of the parallel and iterative processes (Figure 6.3) than the iterative process The reason of the hypothesis 2. It is expected that the first translation for an improvement has the strong effect to the translators improving translations. Thus, it is possible to increase the quality of the final translation by acquiring the first translation for an improvement from multiple translators. The source sentence and reward are as follows. Source sentence Each 5 articles are randomly selected from the categories of sports, society, economic 31