School of Computing. Scheduling policies in Hadoop. Alexandar Dimitrov

Size: px
Start display at page:

Download "School of Computing. Scheduling policies in Hadoop. Alexandar Dimitrov"

Transcription

1 School of Computing FACULTY OF ENGINEERING Scheduling policies in Hadoop Alexandar Dimitrov Submitted in accordance with the requirements for the degree of Information Technology Session 2014/2015

2 - ii - The candidate confirms that the following have been submitted: Items Format Recipient(s) and Date Deliverable 1 Report SSO (13/05/2015) Deliverable 2 Software code Supervisor (13/05/2015) Type of Project: Empirical Study The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student) 2015 The University of Leeds and Alexandar Ventzislavov Dimitrov

3 - iii - Summary In recent years, we are witnessing a rapid increase in the number of Internet users and connected devices, as well as growth of the Internet of Things (connected watches, power sockets, door locks, coffee machines, etc.). As a result, quantities of data are generated (the so called Big Data), such as user data (images, articles, video files, music etc.), sensor data and log files (how the users use the services). It is a growing business for companies to collect and analyse Big Data and provide insights to their customers. Due to the high costs of implementing their own IT infrastructures, more and more companies are turning to Cloud providers. Consequentially, Cloud computing is gaining popularity at a rapid pace. Competition among the ever increasing number of firms offering Cloud services (whether it is a Platform as a Service, Software as a Service or Infrastructure as a Service) which makes gaining a competitive advantage more difficult. One way to achieve this would be to process data in the optimal way (by using the most efficient scheduling algorithm). The aim of this project is to introduce the reader to Hadoop scheduling policies and to benchmark the default MapReduce scheduler (First-In-First-Out), the Fair scheduler and the Capacity scheduler. Benchmarking is done, using a variety of parameters, e.g. data size, cluster size, to determine if the number of machines and the volume of data affects the performance. The code to be benchmarked has been written in java and then executed on the Cloud testbed of the University of Leeds.

4 Acknowledgements - iv - First and foremost, I would like to thank my project supervisor, Dr Karim Djemame for his guidance, patience and feedback, without which the project would have taken a completely different course. In addition, I would like to thank my project assessor Dr Natasha Shakhlevich for her support and feedback throughout the project. For the support of the new Cloud Testbed I would like to thank Dr Django Armstrong. Lastly, I would like to thank my friends and family for their never-ending support.

5 - v -

6 - vi - Contents Summary... iii Acknowledgements... iv Contents... vi Chapter 1 Introduction Overview Aim Objectives Approach Background research Experiment Design Experiment Implementation Evaluation Schedule Project tasks Milestones Timeline Deliverables Conclusion... 8 Chapter 2 Background research Introduction Big Data Challenges Distributed Systems Cloud Computing Virtualisation Cloud Computing Deployment Methods Cloud Computing Service Models Software as a Service Platform as a Service Infrastructure as a Service Hadoop MapReduce Map Reduce Hadoop MapReduce Architecture Hadoop Distributed File System Scheduling Scheduling in Hadoop First-In-First-Out-Scheduler... 22

7 - vii Fair Scheduler Capacity Scheduler Similar Work Conclusion Chapter 3 Experiment Design Experiment Variables Scheduling Algorithms Data Size Number of Machines Number of Virtual Machines Number of Jobs Measurements and metrics Scalability Runtime Resource Utilisation Hypothesis Which scheduler performs best? Which scheduler scales best? Does the size of the data affect the results? Conclusion Chapter 4 Experiment Implementation Experiment Environment Host Parameters Cluster Specifications Choice of Data Data Homogeneity Type of Data Size of data Program Code Conclusion Chapter 5 Technical Evaluation Single Job Single Host Cluster Multiple Host Cluster Five Jobs Single Host Cluster Multiple Host Cluster Hypothesis Evaluation Which scheduler performs best? Which scheduler scales best? Does the size of the data affect the results? Conclusion... 54

8 - viii - Chapter 6 Project Evaluation Methodology Evaluation Final Timeline Achievement of Aim and Objectives Similar Work Contribution to the Research Field Future Work Cost Energy Efficiency Various Configurations Hardware Utilization Fine-Tuning of Schedulers Experiments with jobs with different sized datasets Chapter 7 Project Conclusion List of References Personal Reflection Appendix A External Materials A.1 Data A.2 Java Code Appendix B Ethical Issues Addressed Appendix C Data from experimentation C.1 Single Job Data Single host Multiple host C.2 Five Job Data Single host Multiple host Appendix D Java Code D.1 Java Code... 84

9 - 1 - Chapter 1 Introduction 1.1 Overview In the digital age it is very easy to collect a lot of data. Data may be generated by users, sensors or computers in general. Companies gather and store it on clusters, and later process it. Every year, the size of the datasets, with which they work, increases in size exponentially. In this era it is normal for companies to process datasets in the magnitude of petabytes. Some examples of daily generation by companies include: [67] Google 20 petabytes Facebook 2.5 petabytes Ebay 6.5 petabytes CERN 440 terabytes With the increase of users and connected devices, the industry norm of daily generation will not slow down, eventually the datasets will be in the order of exabytes. This increase, make traditional serial data management systems an unfeasible choice, as the system locks for long time periods. Data of this scale is generally referred to as Big Data [2][3][8][20][27]. Typically Big Data is simultaneously accessed by multiple users. This type of interaction requires a specific type of system to store and process it - a distributed system [29]. The importance of distributed computing is growing continuously, as more and more industries require processing of Big Data sets. A crucial consideration for companies today is having the ability to process large amounts of data quickly and to provide valuable insights for their customers [2][8][10][14][32]. These insights are typically referred to as Business Intelligence [72][73][74]. Examples include: [67] What are the behavioural patterns of customers? Are the company s information systems functioning in an economically efficient way? Are there times when particular applications are popular? When and why? In order to gain these insights from Big Data, using a distributed system, numerous tools have been developed. This report will focus on Hadoop [4][16][25][68], as it is the de facto industry standard [17][18][22][33][35] for storing and processing Big Data. In particular, it will investigate the MapReduce [22][47] architecture within the Hadoop framework. Naturally, when multiple jobs have to be executed, there is a need for scheduling [5][24][51][64][66]. The specific focus of this report is to inform the reader of the impact of

10 - 2 - scheduling on the performance of a Hadoop cluster. This will be achieved by designing and executing experiments on the University of Leeds cloud testbed. In the following subsections the approach, methodology and objectives will be covered in more detail. 1.2 Aim The overall aim of this project is to provide an informative comparison of Hadoop scheduling policies by benchmarking the default First-In-First-Out (FIFO) scheduler, the Fair [6] and Capacity [7]. The Fair and Capacity schedulers were developed by external companies, as FIFO was not suitable for all scenarios [5]. In order to better understand the significance of Hadoop, the report will introduce the broader landscape starting with the challenges of distributed systems. It will examine the cloud paradigm, and put Hadoop into context. More specifically, the MapReduce programming model, which is an essential part of Hadoop, will be discussed. Finally, the report will focus on the scheduling of MapReduce jobs, what role it plays and will investigate what is the impact of scheduling on performance. Following the introduction, experiments will be designed which will be executed on the University of Leeds Cloud testbed. The experimentation will investigate the effects of cluster elasticity, variance of data volume and the effects of network delay in order to gain an understanding how to make data processing more efficient. This report will give the reader an informed and objective evaluation of the performance of scheduling algorithms and compare the findings collected from the experiments to other work in this field. The evaluation chapter will include an analysis of the results and factors which might affect performance. 1.3 Objectives Upon completion of the project, the following objectives will be achieved: Familiarising the reader with the background area and focusing on Hadoop. Designing a series of experiments on the Cloud testbed to evaluate the performance of Hadoop scheduling policies. Developing and running the tests; recording the results. Characterising Hadoop scheduling policies under various scenarios. Evaluating and contrasting the results from other papers. Reflecting on the methodology used for completion of the project. Identifying areas for future research.

11 Approach This project is an Empirical Study. It will be broken down into four parts to enable a structured approach. 1. The first phase will be background research. It will consist of research of the area of Cloud Computing [1][11][13][31], in order to gain a deep understanding of the context, which will lead to a good definition of the problem, and the scope of the project. In the background research chapter, similar research in the area will be introduced. 2. The second part will focus on the experiments design and the reasoning why the experiments are appropriate to the research. This phase will consist of the choice of metrics and measurements, which will be used for assessing each of the schedulers, as well as an explanation, which parameters will change during the experimentation. Finally, a hypothesis will be written. 3. The third part of the project will be the actual implementation of the experiments. This part will introduce the experiment environment. In addition, the choice of data and choice of program which will run on the cluster will be discussed. For each of the implementation choices a reasoning will be provided. For the development of the experiments an agile approach [76] will be used. This will allow flexible implementation of the experiments via prototyping. Prototypes will be designed, polished, and executed on the Cloud testbed. 4. Finally in the fourth part, the project will be evaluated. This part will be divided in two chapters. The first one will be a technical evaluation, in which the findings will be analysed and each of the scheduling algorithms will be compared to the others using the metrics defined in the experiment design chapter. At the end of the technical evaluation chapter, the findings will be compared and contrasted to the hypothesis. The second part will be a project evaluation which will reflect on the methodology, organisation and achievements of the project. Finally the findings from the report will be compared to similar studies in the field, and areas for future research will be suggested. The methodology for completing each of the phases will be Agile as well. Work will be completed in weekly sprints, to ensure that each point of the project schedule will be completed in an adequate and timely manner. Weekly meetings with the project supervisor will be used as a source of feedback and an opportunity to make adjustments to the future direction of the work. The meetings will consist of a reflection on the work carried out in the previous week; a review of the topic of discussion for the meeting; and an agreement regarding

12 - 4 - the objectives for the following week. The chapters of the report will be covered in the following subsections Background research The background research chapter will introduce the reader to the context of the project. It will briefly describe the main underlying concepts and the environment of the matter. The purpose of the chapter is to establish a solid foundation of the project, upon which the paper will start to funnel down to the focus of the research. Firstly, the background research will introduce the concept of Big Data. It will cover what it is, its importance, how it is gathered and processed. It will also touch on the challenges Big Data introduces, as well as tools and techniques that are used for the analysis. Secondly, it will explain the fundamentals of distributed computing and those of Cloud Computing. Virtualisation will be briefly discussed, as it plays a key role to the success of Cloud Computing. Different Cloud deployment methods will be described. The three types of Cloud service models will be examined and compared as well. Thirdly, the background research will introduce the Hadoop project. Specifically, why and how it was conceived, and its development to this day. The mission and architecture of Hadoop will be examined. The MapReduce programming model will be introduced, as well as the MapReduce architecture within Hadoop. The Hadoop Distributed File System (HDFS) [50][62][71] will be discussed, as it is one of the key components of Hadoop. In addition, the background research will underline the importance of scheduling in general, and scheduling within the context of Hadoop. Three of the scheduling algorithms will be introduced, the importance of each one will be highlighted. Finally, similar research in the area will be introduced and compared to this project Experiment Design This chapter will give insight on the reasoning behind the design choices for the experiments used for evaluation. It will contain a description of the experiment variables, as well as the metrics and measurements, based on which the schedulers will be compared. Finally a hypothesis will be constructed, which will state what the experimentation might show Experiment Implementation This chapter will cover the implementation path of the experiments. Firstly, it will describe the parameters of the experimentation environment. Secondly, it will introduce the choice and format of the data used for the experimentation. Finally, it will provide details about the language used for developing the program used for the experiments. The experiments will be

13 - 5 - executed and tested, in order to provide results, which will be used for the technical evaluation of the project Evaluation This part of the report will be divided in two chapters Technical Evaluation The first chapter will be a technical evaluation. This chapter will lay out the findings from the experimentation, and will provide the reader with analysis of the performance of the algorithms in each of the scenarios. At the end of the chapter, the findings from the experiments will be compared to the hypothesis. Any mismatch and discrepancies will be examined Project Evaluation The second will be a project evaluation. This chapter will look back at the choice of methodology, organisation, timeline and achievements of the project. This chapter will state if the research was successful, and the objectives were met. Finally, the findings from the report will be compared to similar studies in the field, and the areas for future research will be suggested. 1.5 Schedule Due to the size of the project, a detailed schedule is needed to ensure the timely delivery of the report and facilitate progress monitoring. While the milestones are generic, there are specific tasks needed to achieve these milestones Project tasks 1. Investigate research landscape Before starting the actual design, the research area needs to be explored. This includes both an introduction to the broader landscape, and investigation of the existing research papers focused on the scheduling aspects of Hadoop. 2. Determine experiments After the research has been conducted, the exact metrics and measurements for comparing the scheduling algorithms has to be determined. The experiments have to be relevant to the project and well-reasoned. They need to fit in the research landscape, and explore areas which have not been explored before. In addition, due to the time constraints, the scope has to be picked carefully while meeting the objectives.

14 Design experiments Before starting the actual testing on the Cloud testbed, the variables, metrics and measurements of the experiments have to be determined. The design reasoning will be based on the background research and thoroughly explained. The experiments will be designed so that they explore areas similar to other research in the area. Following the design of the experiments a hypothesis will be written. 4. Familiarise with the Cloud testbed A thorough understanding of the testbed functionality will be achieved before the actual tests are executed. This includes investigating the OpenNebula [75] software, understanding how Virtual Machines [80] are configured and instantiated, as well as understanding how to configure a Hadoop cluster. 5. Acquire data To ensure that the findings are practical and meaningful, the experiments will be executed with real world data. Based on the design rationale, a free to use, appropriate in size dataset will be investigated. 6. Execute experiments After the experiments have been designed, the testbed has been configured, the Hadoop cluster has been started and the data has been put in the HDFS, and the software for the testing has been written, the actual testing will commence. Using the Agile development methodology, experiments will be implemented on the cluster. The experiments will be governed by the experiment design chapter. During the prototyping [77][78], adjustments will be made until the findings from the tests are useful, reliable, replicable and meaningful. 7. Evaluate findings and project The findings from the experimentation will be investigated, and the questions from the hypothesis will be answered. Upon analysing the results, the best scheduling algorithm for each situation will be determined. Finally, a reflection and evaluation of the project will be constructed, which will reflect on the process and assess its success Milestones Throughout the project, a number of milestones must be met. 1. Delivery of Scoping and Planning Document This will act as an agreement between the project supervisor, the assessor, and the student regarding the aim, objectives, scope and schedule of the project. It is critical, as it will define the foundation of the project and will help the student understand the requirements.

15 Mid project report Midway through the project, a two-page summary of progress and work-to-do will be composed. In addition, a presentation, outlining key points of the project, will be delivered. By this point, the background research section must be finished. In addition, some experiments must be designed, and a rationale provided. Ideally, the presentation will include test results. Finally, future plans will be described in detail. 3. Progress meeting Near the end of the project, a draft of the report will be delivered to the supervisor. The purpose of the progress meeting is to ensure that the writing style and structure match the expected ones. By this point the experimentation should be completed. 4. Final report submission This marks the final stage of the project. At this point, the experimentation and evaluation phases must be completed, the report should be finished and submitted Timeline Before the start of the project a timeline was defined for the Scoping and Planning document. It required clear definitions of weekly deliverables. Its purpose was to act as a guideline for the work and to provide a clear idea of what the timeframe is and how it can be utilized efficiently. This subchapter will include the initial timeline, and the finalised one will be included in the project evaluation chapter. Task/Deliverable Plan of the project Lay out structure of the report Literature review; Introduce the focus of the report Experiment Design First draft of experiments Final draft of experiments Progress meeting presentation Implementation/Testing Record findings in report Evaluation Record findings in report Finalise the report Week Figure 1.1

16 Deliverables In order for the project to be considered a success, the following should be delivered: 1. Literature review: A concise report, which will introduce the area of distributed computing, and the role of Hadoop within it, as well as similar research in the area. 2. Design documentation: This report will consist of the scope of the experiments, and the metrics that will be used to compare and contrast the algorithms. 3. Test observation: This deliverable will be a documentation of the findings obtained after running the different tests. 4. Evaluation report: This report will include an analysis of the work done for the project. It will state if all the objectives of the project were met and if all the deliverables were delivered. In addition, it will compare the findings from the experiments, to other findings from other relevant papers. 1.7 Conclusion This chapter has familiarized the reader with the aim of this project - to provide an informative and comprehensive performance comparison between three MapReduce scheduling algorithms for the Hadoop framework. This will be achieved by determining meaningful and relevant metrics, designing and executing various experiments and evaluating the findings from the experiments. The workflow is guided by the project schedule and governed by the deliverables and milestones. The next chapter will introduce the context of the research. It will familiarize the reader with the concept of Big Data, the current Cloud Computing trends and the Hadoop project. Finally, it will provide a description of each of the MapReduce schedulers, which will be benchmarked in the experimentation phase of the project.

17 - 9 - Chapter 2 Background research 2.1 Introduction This chapter will familiarise the reader with the technological foundation of the research. It will introduce the importance of Big Data and how it is processed with distributed systems and within the Cloud. It will cover the Hadoop framework, as well as the MapReduce paradigm. It will focus on three of Hadoops scheduling algorithms: First-In-First-Out, Fair scheduling and Capacitive scheduling. Finally, similar work will be discussed, which will be compared and contrasted with the findings of the author in later chapters. 2.2 Big Data The characteristics of Big Data usually include data with an unstructured format, gathered from various collection points which results in data sets that are so large that traditional data management software is incapable of processing [2][3]. Big Data may originate from user generated data, for example, from search queries, website navigation path, or sensors, such as thermometers or cameras which monitor traffic and pedestrian activity [8]. Figure 2.1 Big Data Challenges [67]

18 Some real world examples of Big Data, and access to Big Data: Facebook 600 TB daily input data, 300 PB of Hive data (2014) [33] Google search queries every second, 3.5 billion searches per day (2012) [34] Ebay 75 billion database calls a day [35] The goal of data processing is to construct information from data and to gain knowledge from information. The main objective is to synthesise the knowledge into wisdom which will play a key strategic role for the company [61]. The same applies for Big Data. A study from Google has shown that processing more data leads to better results compared to improving algorithms [28]. In the digital age, it is not difficult to gather data but to extract meaningful insights from it. An IBM survey has discovered that more than half of the business leaders today are missing key insights needed to do their jobs [2]. Figure 2.1 displays various types of Big Data, as well as challenges which it brings. The next subchapter will focus on the challenges Challenges The nature of Big Data brings a number of challenges which make unfeasible managing, analysing, summarising, and extracting knowledge in a timely and scalable fashion with traditional analysis tools and approaches Size By definition Big Data has a vast scale. Traditional database management systems typically process datasets serially. When the data is of great size, the database system will lock for too long. It would be unfeasible to process datasets in the magnitude of petabytes, or even terabytes. This introduces the need for parallel data processing tools. Another challenge, which arises from big datasets is storage. Over the years, the capacity of hard drives, and the speed of solid state drives has been increasing, however, there are other limitations, which make storing large datasets on a single machine unfeasible. Limitations such as input and output technologies, which have not developed with the same pace. Therefore, processing time is highly dependent on the speed of a single input/output port. Storing the data in a distributed manner also has potential drawbacks. For example, the increase in the number of machines to store the data leads to the addition of network delay. The distribution of the Big Data has to be carefully managed in order to maximise the benefits of the distributed system. Another issue to consider is that multiple machines lead to multiple points of failure, therefore data duplication must be ensured.

19 Data Heterogeneity It is possible that the data originates from multiple sources, which in turn means that there could be a format mismatch. As multiple data standards and formats exist on the Internet, processing data may be challenging. One problem is the encoding, used to represent the characters of different websites/text files. For example, Google scans through massive amounts of websites every day, many of which are using various encodings. The Big Data processing system needs to convert all text to a unified encoding to ensure that when returning results to a query, nothing is missed out. Another issue, which comes with gathering data from different sources - a common template for the format needs to be designed, so that corruption does not occur (as seen in Figure 2.1 Veracity). For example, multiple sources may have different fields for its messages or datasets, the system has to be adapted to populate the empty fields. This way the resulting dataset is clean and consistent. Investigating the challenges data heterogeneity is out of the scope of the project Analysis While fine-tuning the algorithms might not improve insights as much as increasing the volume of data to be analysed [28], it is still important to understand what insights need to be gained from the dataset [2]. For example, web server log files may be parsed in order to monitor website navigation. This is useful if a company wants to understand customer behaviour, buying and navigation patterns and try to use the knowledge for predictions. This is how Amazon generates the list of items similar to the ones from the browsing history or ones that combine well with the current selected item [32]. In order to make analysis of Big Data easier, some companies are using visualisation techniques such as creating graphs, trees or sorting of the data. This aids humans in understanding data and spotting trends [27]. After the data has been gathered from the various sources and the analysts have identified how to start gaining insights from it, the next step is to process the data. One possible technology to use is a distribute system. This will be examined in the next section. 2.3 Distributed Systems A distributed system is a collection of autonomous computers connected by a network. They share their resources and activities using a distribution middleware layer. The distribution middleware adds a level of abstraction between the governance of resources and activities, and the end users. From the perspective of the users, they are communicating with a single system [29].

20 Key properties of a distributed system are [29]: Elasticity The system scales seamlessly in order to support more users and larger datasets. Concurrency Multiple users are able to simultaneously access the shared resources (databases, files, variables, etc.) Fault Tolerance The failure of one machine does not lead to the failure of the whole system. This is achieved by redundant connections and data replication. Transparency From the perspective of the end user the system appears as a single machine. He/she should not be able to distinguish whether he/she is communicating with one machine, or a thousand machines. Resource Sharing Processing resources, data storage and task schedulers are distributed between multiple hosts, so that parallel processing is possible. Heterogeneity [65] The distribution is established at a higher level, which implies that a distributed system may consist of machines with various hardware configurations. The type of distributed system, which is the focus of this report, is the phenomena called Cloud Computing. 2.4 Cloud Computing Cloud Computing is the evolution of Grid computing, providing computing resources ondemand. The term Cloud Computing lacks a specific definition [1], however definitions share a number of key properties. A Cloud should offer properties of Utility computing such as [1]: Transparency: The end user is not aware of specific details of the technologies on top of which the Cloud runs. Pay-per-usage: Similar to the water provided by water suppliers, computing power is provided on demand. Cloud providers allow users to choose how much processing or storage they need, and pay for as much time as they use. Elasticity: Users are able to increase the amount of storage or processing whenever they like without much effort on their side. That way, they are able to scale their software however they please, based on consumer demand. For example, a company may rent more resources during peak periods, then scale down again during times with less demand. Essentially, this means that as long as customers can afford it, they can have access to supercomputer-level processing. This is achieved by virtualisation of the resources [1].

21 This is an amazing development as it lowers the deployment barrier of an IT system. The traditional approach, which was widely used before Cloud Computing became popular, was to purchase technical equipment and hire staff who have to configure it and maintain it. Cloud Computing is particularly popular amongst small and medium enterprises that host their infrastructure on the Cloud and scale it as they grow. Many of the mobile applications have a Cloud backend [37]. Cloud Service Providers (CSP) are companies which maintain Cloud Computing implementations and rent them to customers. The staff at the CSP are responsible for the configuration and operation of the Cloud. This allows customers to have access to as much resources as they need with minimal technical background. CSPs report profits from their Cloud services reaching more than $6 billion [70]. Based on the needs of the clients, different layers of abstraction could be used when renting resources. These layers will be discussed later in the chapter. First a technology, key to the operation of the CSPs, will be examined. 2.5 Virtualisation Virtualisation is one of the key technologies that make Cloud Computing possible. It adds an abstraction layer on top of the physical layer, thus enabling Cloud Service Providers to instantiate and manage multiple virtual machines (VM) on a single physical host. A VM is stored as an image, which is stored as a file, meaning that it can be replicated and backed up very easily. Each of the instantiated VMs could run a number of operating systems. In addition, each of the virtual machines is independent from the others, which enables individual virtual machines to be managed separately. For example, the RAM of one machine can be increased without affecting the others, the same applies to CPU allocation, disk space and network bandwidth. This flexibility and elasticity are key to ensuring that Cloud Computing is financially feasible. For example, Amazon may instantiate 50 VMs on a single rack for 50 small enterprises, instead of lending 50 racks. 2.6 Cloud Computing Deployment Methods Based on the needs of the company, Cloud services could be utilised in a number of ways. The National Institute of Science and Technology defines 4 Cloud Computing deployment models [12]. Public Cloud A public Cloud is located externally from the client. Examples of providers include: Microsoft Azure [19], Amazon AWS [36], Google AppEngine [37] and Rackspace CloudServers [38][21]. The infrastructure of the Public Cloud is designed for open use by the general public. This type of Cloud service is most suitable for clients, who would

22 opt for the pay-per-usage model, look for on-demand scalability, or prefer to have the peace of mind provided by the IT experts on the Cloud provider side. Another benefit of renting an external cloud, compared to building an internal one, is cost. In the case of a startup, for example, the company might not have sufficient budged for an IT infrastructure deployment. It has to be taken into consideration that not all companies want to leave their data in the hands of strangers. Private Cloud The private Cloud is implemented in-house. A company might prefer this type of Cloud if its needs are unmet by existing Cloud providers. For example, a scientific research team might need to store and process data that is considered exotic to the existing providers and solutions. Alternatively, the company might have specific data security policies, which prohibit sensitive data from leaving the company network. An example of this kind of data is patient information from health institutions [12]. When using a private Cloud, the internal IT team has complete control over the resources and IT architecture. This way, they can perfectly tailor the service to the company needs. Community Cloud This Cloud infrastructure is designed by a particular community with shared interests and security concerns. It allows the community to build a Cloud and share its resources for the benefit of all the organisations involved. Not only will the companies share the resources and benefits from the Cloud, but they will divide the implementation costs. A Cloud may also be shared between multiple companies in the interest of sustainability and energy efficiency [31]. A community Cloud will provide the companies with a tailored and secure, yet costefficient, Cloud implementation. For example, a community Cloud may be shared between a number of local universities, as long as they have similar infrastructure and data protection policies. Hybrid Cloud The use of two different infrastructures within a company is defined as a hybrid Cloud. One example is using a private Cloud for sensitive information (such as employee data, health records, etc.) and a public Cloud for storage of larger files, which are less sensitive, and could be transferred externally. This example assumes that no traffic is flowing between the two Clouds. This might not always be the case. Another example of a hybrid Cloud implementation is using a public Cloud as an extension of a private Cloud. When the capacity of the private Cloud is exhausted, computation continues on

23 the public Cloud. This example assumes that transferring data over the Internet does not violate any policies. This type of implementation, requires policies governing the function of each of the Cloud infrastructures in order to avoid confusion and exposure of sensitive data [12]. 2.7 Cloud Computing Service Models The Cloud infrastructure consists of three service layers. The Cloud Service Provider exposes the client to different levels of abstraction and conceals the lower level configuration from the end users, depending on the customers needs. The client should select a service model which best suits their needs so that the control is not more explicit than needed [12]. Figure 2.2 Abstraction layers for Cloud Computing Service Models Software as a Service This is the most limited type of provision in terms of flexibility. Software as a Service (SaaS) is intended to act as local software. Examples include word editing programs such as: Google Docs and iwork for icloud. SaaS is not as powerful or as flexible as the services listed in the following subsections, and is appropriate for specific simple tasks [39].These services are typically accessed by thin clients such as a web browser. The actual computation and storage is done remotely on the servers of the CSP. SaaS is becoming increasingly popular for a variety of software packages, such as business applications, accounting systems, Enterprise Resource Planning and even Cloud management systems [40].

24 SaaS brings benefits to customers as well as the hosting company. Examples of customer benefits include, but are not limited to: no need to download additional updates, the software gets updates as soon as they are pushed to the servers; use of the software wherever the customer is, as long as there is an internet connection. A benefit to the CSP is that it holds all the data, which enables the provider to utilize a subscription-based service, meaning that the users need to pay a fee in order to access the service and their data Platform as a Service Platform as a Service (PaaS) is chosen when more flexibility is needed, it is not intended for one simple application. PaaS provides the consumer with the opportunity to execute an application onto a Cloud infrastructure using programming languages, libraries or services [12] [13]. There is an abstraction layer over the specifics of the machines, which are used for execution. That way developers need not worry about the specific virtualisation technique used by the CSP. They write programs and tools according to the specific restrictions of the platform, and benefit from built-in application scalability. Examples of such service provision are Google App engine [37], IBM Bluemix [41], Microsoft Azure Web sites [19], Amazon Web Services Elastic Beanstalk [42] and Cloudera [43][44]. PaaS enables developers to focus on the programming. The service providers handle the server configuration, network and load balance, databases, operating systems, etc. [45]. PaaS may include platforms for application design, development, testing and deployment, as well as tools for collaboration, web service integration, security, storage, etc. For example, Google App engine provides support for some of the most popular programming languages such as Python, PHP, Java and Go (the language designed by Google specifically for development for the Cloud [69]), tools such as Eclipse, Git and a local testing environment [37][46]. Google App Engine provides management of virtual machines (still in Beta), as a means of more extensive flexibility [37]. A detail that might be considered as a disadvantage is that the control of the developer over the development environment is reduced, and differences between the platforms of various providers may be interpreted as a lock-in to a certain platform Infrastructure as a Service Infrastructure as a Service (IaaS) [12] is provided by Infrastructure providers (IP). This type of service typically enables clients to manage low-level resources, such as memory and processing capacity. The allocation of these resources is achieved via virtualisation. It allows providers to split up, assign and dynamically resize the resources in order to build ad-hoc systems according to the clients requirements. An abstraction layer is introduced on top of

25 the virtualisation, so the client does not need to be concerned with the specifics of the virtualisation technology used by the CSP. When a client requests a virtual machine, a Virtual Infrastructure Manager allocates the resources on one of the machines within the mainframe of the CSP. The infrastructure providers deploy the software stacks that accommodate the services on the infrastructure. The level of control varies between companies, some allow creation of virtual networks of virtual machines. Companies usually provide various purpose-specific Virtual Machine templates, or allow users to develop and upload customised images. Some notable IPs include Google Compute Engine [88], IBM SmartCloud Enterprise [85], Windows Azure [84], Amazon Web Services [83], HP Converged Infrastructure [86] and Rackspace Open Cloud [87]. This is the most flexible and low-level service provision, allowing for the greatest customisation. 2.8 Hadoop Apache Hadoop is a software framework, developed by Yahoo, designed for distributed storage and computation of large data sets, potentially on thousands of hosts. Companies such as Facebook [16][17][18], Spotify [9], Twitter [14] and Yahoo [15] use Hadoop for a variety of data and logs analysis. For example, Spotify uses Hadoop to generate radio stations, or to make song recommendations [10]. Figure 2.3 Hadoop architecture [68]

26 The philosophy behind Hadoop is that it is better to move the computation to the data rather than vice-versa, as moving large datasets requires more resources [89]. In this report, three of the major Hadoop components will be examined. Namely: MapReduce, HDFS, and the Scheduling module. 2.9 MapReduce MapReduce is a programming model, designed for processing and outputting large datasets on a distributed cluster. As the name suggests, MapReduce consists of two procedures: map and reduce. The programmer specifies a map function, which takes as input key and value pairs and generates a set of immediate key and value pairs. Afterwards, a Reduce function is specified, which merges all immediate values associated with the same immediate key [49]. MapReduce acts as a high level abstraction layer, hiding the details of how the distribution of the program is handled, letting the programmer focus on writing the MapReduce program. MapReduce was originally inspired by Map/Reduce in functional programming languages, such as LISP, but is not equivalent to it. It was developed by Google [47] Map The map function maps input key/value pairs to a set of immediate key/value pairs. Map(key1, value1) list(key2, value2) For each pair in the input dataset, the function is run in parallel and returns another key/value pair as output. Further on, it takes all the pairs with the same key from all lists and merges them, creating a list for each key. The output of the Map function is key2, list(value2) The functions have been adapted from [48]. <row Id=" " PostId=" "... UserId=" " /> Mapper (key1, value1) (key1, value2) (key2, value3) (key1, value4) Figure 2.4: Mapper Function

27 Reduce Likewise, for each key, list group, a reduce function is executed in parallel. Reduce(key2, list(value2)) list(value3) Each return of the reduce function is either value3, or 0, although a return of more than one value is also allowed [48]. After all the reduce calls are finished, the results are collected and aggregated and possibly create a smaller list of values [49]. (key1, value1) (key1, value2) (key2, value3) (key1, value4) Reducer (key1, value1, value2, value 4) (key2, value3) Figure 2.5: Reducer Function 2.10 Hadoop MapReduce Architecture Figure 2.6 Comparison between MapReduce version 1.0 and 2.0 [90] In version 0.23 of Hadoop, the MapReduce architecture was drastically changed to MapReduce 2.0, also called Yet Another Resource Negotiator (YARN)[93]. YARN acts as an abstraction layer between the file system and the data processing engines, such as MapReduce, Apache Tez [91], Apache Spark [92], etc. This allows for different jobs to use different applications. YARN consists of: a global Resource Manager (RM), an Application

28 Master (AM) and a Node Manager (NM). The RM governs the resources across all application. It is running two services: a scheduler, which is responsible for the resource allocation of applications, and an Application Manager, which accepts, executes and monitors the job submissions. The AM is the job container, which is being monitored by the Application Manager. The NM is responsible for the AM container, and monitors the CPU, disk, RAM and network usage. The NM reports the monitoring information to the Resource Manager and the Scheduler, so that scheduling decisions, and resource allocation can be made. The YARN platform allows greater flexibility of the data processing engines. In addition, MapReduce 2.0 makes the scheduler a pluggable module, which allows to develop solutions best suited to user needs. One the results of the opportunities for customization is improved cluster utilization. The next subchapter will introduce HDFS, followed by the scheduling aspects of Hadoop, which is the focus of this report Hadoop Distributed File System The Hadoop Distributed File System (HDFS) was inspired by the Google File System [50]. It stores the input and output used by Hadoop. It is designed solve the problem with large datasets by breaking up the data in 64MB blocks (by default [62]). The architecture of HDFS (figure 2.6) consists of: A single NameNode, referred to as master node, which governs the file system and the scheduling and distribution of jobs. This node keeps record of the meta data of the cluster. The master node notes, which slave stores each block of data, where the replicated blocks are located, as well as information about the data splitting process, in case the data has to be reconstructed in the later stages of processing. One or more DataNodes, referred to as slaves, which are usually deployed on every node of the cluster and manage its storage. The slaves handle the data allocation and replication. Replication plays two roles within HDFS: It ensures that even if the DataNode containing one copy of the file fails, another DataNode is able to provide a replica of it. In addition, HDFS has a feature called rack awareness. It enables effective fault tolerance of the files system by distributing the replicas, so that the data block would still be accessible, even in the whole rack fails. It makes simultaneous access possible. If one user is parsing a file and another user requests the same file, he/she will be able to process a replica of it. The distribution of the data between the hosts within a cluster is key to the scalability of the system, as storing all the data on one machine is unrealistic.

29 Scheduling Scheduling is a management technique, which is used to govern the order of events. Scheduling can be influenced by priority, throughput and turnaround time requirements. Examples of scheduling include train timetables, event agendas (e.g. concerts and music festivals), CPU scheduling and office hours. These tasks are performed by a scheduler, either a human or in the case of this report - computer software. The focus of this project is scheduling of MapReduce jobs within Hadoop. The scheduler governs how the idle system accepts and executes jobs. In addition, it handles jobs which are submitted to the cluster, while a job execution is taking place. The goal of the scheduler is to optimize certain parameters. Examples include: minimizing turnaround, waiting and job completion time, maximizing throughput and resource utilization. The next subsection will examine the scheduling in Hadoop in more detail Scheduling in Hadoop The initial purpose of Hadoop was to run a large batch of jobs, such as log mining and web indexing. Users submitted jobs into a queue and the cluster executed them in consecutive order. However, as Hadoop became more popular and people started storing different types of data, they began to use Hadoop not only for lengthy jobs, but also for shorter queries. Eventually MapReduce clusters were shared by many users. The sharing brought a large set of benefits, for example it enabled the company to store all the data in one place, without having to build a separate Hadoop cluster for each user group. However, it also introduced a challenge. As the users accessing the cluster increased, the number of jobs increased as well. It became apparent that the default First-In-First-Out scheduler was not a feasible solution in all practical scenarios [5]. This problem was raised to the developers and eventually the scheduler became a pluggable module. Soon after, big companies started developing their own versions of schedulers, best suited to their requirements and company objectives. When designing a scheduling policy, certain goals, priorities and objectives are pursued. Examples of the goals include: to simply comply to the Service Level Agreements; to maximize a parameter - throughput, resource utilization, data locality, fairness, node heterogeneity; to minimize a parameter latency, turnaround time, job waiting time, running time, energy consumption, etc. As there are various goals which may be pursued, and a broad range of variables available in a MapReduce scenario, designing a scheduler, which will meet all the goals is an NP-Hard problem. Often compromises have to be made, in order to meet goals which have a higher

30 priority (e.g. the Service Level Agreements). This report will focus on three of the schedulers: First-In-First-Out, Capacity and Fair. The following subsections will introduce each of them First-In-First-Out-Scheduler First-In-First-Out is the default Hadoop scheduler. It is also referred to as JobQueueTaskScheduler in the configuration files. As the name suggests, the first job that is submitted to the queue is the first one to be executed. FIFO does not take into consideration the size or priority of the jobs. While it is efficient and easy to implement, the nature of the scheduling might lead to poor resource utilisation rate. For example, if a big job is submitted, followed by a small one, the second one would be delayed. Typically FIFO is best suited for small clusters [64]. Additionally, this scheduler is a poor choice for multi-user environments due to the lack of parallel execution of jobs [51] Fair Scheduler When Facebook started using Hadoop, the initial purpose of the cluster was to process large amounts of content and log data, which was generated daily. However the people accessing the HDFS increased, resulting in a larger queue, which led to delays. At first, Facebook considered building a separate cluster for the developers, but quickly gave up the idea, after estimating the implementation and maintenance costs for the new cluster. In the end, the IT team realised that a custom scheduler would be a better option and developed the Fair scheduler. The concept of fairness is that on average, all jobs get a proportionally split share of resources. Each job gets 1/N-th (where N is the number of jobs) of the available capacity (refer to figure 2.7). The scheduler is built with support of a multi-user environment. The Fair scheduler has three main concepts: When a user submits a job, it is assigned to a named pool. Pools are determined by configurable attributes, such as user group, user name, or a specific pool tag. Each pool could be configured to provide a minimum resource capacity via a configuration file. This means that each pool is allocated a minimum number of map and reduce slots. When a job is submitted to the pool, it gets at least the minimum number of slots [5]. Excess capacity from the pools is divided evenly between them [24]. The Fair scheduler also supports assignment of weights for unequal sharing. They can be based either on priority, on size or on pool, meaning that some jobs might take priority over others. Additionally, the administrator may limit the number of running jobs per user or per pool.

31 For the purpose of the research, the default configuration for the Fair scheduler has been used. All the jobs have been assigned to the same pool. Exploration of different configuration options has been considered out of the scope of the project. Figure 2.7 Resource Utilization in a Fair Scheduler Capacity Scheduler The Capacity scheduler was developed by Yahoo. It is designed for clusters, which are shared between many users. It functions in a similar way to the Fair scheduler, however it applies a different approach to sharing. Here the administrator defines a number of named queues (as shown in figure 2.8). Each of them is assigned a number of map and reduce slots. The capacity of the queue is the number of slots when the queue contains jobs. If a queue is empty, its capacity is shared between the queues. FIFO scheduling with priorities is used for each of the queues. In addition, a limit of the jobs, which the user submits could be configured. In other words, the Capacity scheduler tries to simulate a different FIFO cluster for each user and each organisation, rather than submit all jobs to a shared queue. This scheduler can improve the utilisation of the resources by executing multiple jobs in parallel, as well as by providing support for multiple users [7][5]. The Capacity scheduler is useful in scenarios where different companies might be sharing the same cluster, for example within the context of a community Cloud, or a public Cloud. The Capacity scheduler will be configured so that each queue complies with the Service Level Agreements of each client. For the purpose of the research, the default configuration for the Capacity scheduler has been used. All the jobs have been assigned to the same queue. Exploration of different configuration options has been considered out of the scope of the project.

32 Figure 2.8 Capacity Scheduler 2.13 Similar Work Due to the popularity of Hadoop many teams have researched different scheduling policies, especially in the beginning when the scheduler became a pluggable module. As soon as programmers started developing new schedulers, multiple research projects were conducted in the area of scheduling [64]. This report will focus on two research papers which have been carried out under similar circumstances: [51] and [56]. Criteria [51] [56] This research Scheduling Fair with Delay Fifo, Fair, Capacity FIFO, Capacity, Fair Algorithms Scheduling, Native Fair, FIFO, Capacity, Hadoop On Demand (HOD) Program Various Word Count, Merge Word Count Sort, URL Count, Inverted Index Data Size Various 250, 750 MB 500, 1000, 2000 MB Jobs submitted , 5 Nodes 5 3 1, 4, 8, 10 Figure 2.9 Comparison between the criteria of this research and similar work

33 Conclusion In this chapter the concepts and technologies, which are key for the Hadoop framework have been discussed. The field of Cloud Computing has been introduced. Its significance and relation to Big Data has been addressed. Lastly Hadoop has been covered, with a focus on MapReduce, and its scheduling aspects. The background research laid the theoretical foundation, based on which the findings in the next chapters will be analysed and discussed. The next chapter will cover the design of the actual experiments, as well as the reasoning behind the design choices.

34 Chapter 3 Experiment Design The focus of this report is to design and implement experiments, in order to objectively compare the performance between the three scheduling algorithms: FIFO, Fair and Capacity. This chapter will define the parameters, which will be varied during the evaluation. It will disclose the measurements and metrics, which will be used to compare the performance of the different scheduling algorithms. Finally it will construct a hypothesis of the results of the experiments, based on the literature review and similar research. The two main considerations for the experiments design are: possibility to scale the experiments with larger datasets and the time constraints. 3.1 Experiment Variables For the experimentation, various parameters will be altered in order to identify, which ones have the biggest impact on performance. The study will explore whether increasing the number of virtual machines, physical hosts or the size of the dataset will affect the performance of the schedulers linearly, or there are anomalies. In addition, two scenarios with a different number of jobs will be explored. By varying the parameters listed below, the research questions will be answered Scheduling Algorithms The experiments will run the jobs using the three scheduling algorithms. Due to time constraints, varying the configurations of the schedulers is left for future exploration. The report will focus on the performance differences between the schedulers with a dfs.replication [71] set to 2, and all other parameters left at default. Similar research in the area [51] has used a variety of programs such as grep and sort, however, this report has a focus on the pure performance of the scheduling algorithms within the same environment in order to assess how they perform in the same circumstances. Therefore, other programs and tasks have been considered out of the scope of the project Data Size The second factor, which will be varied is data size. For the purpose of the experimentation, an 8GB plain text XML file was chosen, as it could be processed without additional software libraries. It will be obtained from the Stackexchange website [26]. It consists of an archive of all the comments on the website. A more detailed reasoning will be included in 4.2. The tests will be conducted with 500MB, 1GB and 2GB slices of the dataset in order to observe what

35 effect the increase of the dataset has on the performance of the algorithms. This factor is important, as in the real world users process jobs of various sizes Number of Machines The third factor, which will be examined is the number of physical hosts, which will execute the jobs. Two scenarios will be monitored. In the first one, all the Virtual Machines will be instantiated on the same physical host. In the second one, each Virtual Machine will be on a separate physical host. This will reveal the impact of network overhead, and how it scales. Theoretically, in the scenario where each VM is on a separate host, as the size of the cluster increases, the addition of hosts will result in network overhead per machine. This is a key factor to examine, as in a real world scenario, it will be highly unlikely to have all the virtual machines rented by a user on a single physical host. Exploring this factor is important, as the research papers, which will be used as a reference point have not investigated it Number of Virtual Machines The fourth factor, which will be explored is the size of the cluster used to compute the MapReduce job/jobs. Experiments will run using on a cluster consisting of 1, 4, 8 and 10 virtual machines in order to observe how it effects the performance. Varying the number of VMs will enable the researcher to assess the scalability of the schedulers. This is a key factor to examine, as the different schedulers have different functional overhead. Experimenting with various cluster sizes will reveal if the performance difference scales linearly, or anomalies emerge. Exploring this factor will contribute to the research landscape, as the papers, which will be used as a reference point have experimented with a fixed size cluster Number of Jobs The fifth and final factor, which will be explored is the number of jobs, which are executed on the cluster. Due to time constraints, two scenarios will be explored. In the first scenario, only one jobs will be executed on the server. This will be used as a reference point for the second scenario, which will be to execute 5 jobs with a 60 seconds interval between the jobs. A shell script will be developed, so that the execution of the jobs is automated and all scheduling algorithms will be tested under the same circumstances. Exploring this factor will reveal how the scheduling algorithms adapt to an increase of the number of jobs. This variable will again examine what is the role of the functional overhead of each scheduler on its performance.

36 Measurements and metrics Below are the scenarios, which will be explored during the experimentation phase of the project, and the metrics that will be recorded from the experiments to enable a quantitative objective comparison between the algorithms in different scenarios [54] Scalability One of the most important features of a distributed system is scalability. This report will focus on scenarios assessing three main types of scalability: data, cluster and job scalability Data Scalability The first type of scalability, which will be assessed is data scalability. Experiments will test what is the performance of each of the algorithms in each of the scenarios stated in The runtime for each scenario will be measured, and the speedup and efficiency will be calculated Cluster Scalability The second type of scalability, which will be investigated is cluster scalability. The program code will be executed on the four cluster sizes stated in In addition the two scenarios stated in will be evaluated. The performance differences between the metrics defined in and will be compared in order to gain an understanding about the impact of cluster size on the performance of the schedulers Job Scalability A shell script will be written to evaluate the effect of the job count on the performance of each algorithm. Experiments will run in the scenarios stated in Recordings of the runtime, speedup and efficiency will be made to assess how each algorithm handles the increase of number of jobs. The specifics of the code will be covered in Runtime Runtime is the time it takes to finish the execution of a job. It is a widely used metric for comparing performance of scheduling algorithms. The two papers, which will be used as a reference point, also have recorded the runtime of each of the scenarios. Measurements of the runtime of each test scenario will be used as a mean to determine how the scheduling algorithms perform under the different conditions. Due to the time restrictions, all the experiments will monitor the runtime of a single program, which will be described in 4.3. The

37 runtime will be measured for each of the scenarios introduced in 3.1 in order to provide a measurable way of comparing the performance of each scheduling algorithms Resource Utilisation Two essential measurements that have to be considered are speedup and efficiency. These measurements will give an idea of the resource utilisation of each scheduler. Speedup and efficiency will enable the researcher to assess both how well the different algorithms scale as the number of virtual machines, or the dataset size is increased. Graphs or tables for each of the measurements will be produced in the technical evaluation chapter [21]. The two measurements will be covered in more detail in the following subsections Speedup Speedup is a metric, which measures the performance improvement between serial and parallel execution. For example, the time speedup between two virtual machines and four virtual machines is equal to the division between the execution time on two virtual machines and the execution time on four virtual machines. S = T old T new Formula 3.1 Speedup In this formula, S is the speedup achieved. T old is the old runtime and T new is the improved runtime [52] [53]. The ideal speedup will translate to halving the time it takes to execute a job when the CPUs used is doubled. This research will focus on the speedup of the runtime. The speedup will be calculated for each of the scenarios introduced in 3.1 in order to provide a measurable way of comparing the scalability of each scheduling algorithms Efficiency Efficiency is a metric, which derives from speedup. It typically falls within the boundaries of 0 to 100 (measured in percentages), revealing how well the available Virtual Machines are utilized. This metric will reveal how much time is spent synchronising the nodes. The formula, which will be used for calculating efficiency within the scope of the project is: E = ( S n VM n ) 100 Formula 3.2 Efficiency

38 Where E is efficiency, S n is the speedup for n Virtual Machines and VM n is the number of virtual machines [52][53]. The closer E is to 100, the more efficient a system is. Ideally E is equal to 100, however due to various delays and overhead it is not possible in practice. The efficiency will be calculated for each of the scenarios introduced in 3.1 in order to provide a measurable way of comparing the scalability of each scheduling algorithms. 3.3 Hypothesis The focus of this project is to compare the performance of three scheduling algorithms. After the experimentation phase, three core research questions need to be answered. This subchapter will hypothesise the answer of each question, based on the background research and findings from other papers Which scheduler performs best? The first research question that needs to be answered by the experimentation is: which scheduler completes the job/jobs quickest. In order to answer this question, all the experiments will be executed under the same circumstances. Each of the experimentation scenarios will be executed three times and the average execution time for all three, as well as the standard deviation will be included in the report. The two studies which will be used as reference points [51] and [56] show contradictory results when testing the schedulers. [56] suggests that Fair will complete the jobs 27% faster compared to FIFO and Capacity will complete the jobs 31% faster compared to FIFO. However [51] demonstrates that the Fair and Capacity schedulers should perform only around 7% quicker in terms of total running time. As the Capacity scheduler is designed for larger clusters, compared to the ones used in the project, it is likely that it will not perform the best due to its functional overhead. As each queue is designed for a separate company, the allocation of capacity might cause unnecessary delays when a new job is submitted to the pool. For the experimentation, each of the users will submit the job/jobs in a common pool named hduser. The Capacity scheduler supports simultaneous execution, therefore it is likely that it will outperform the First-In-First-Out scheduling. However, each of the queues within the Capacity scheduler utilizes FIFO, rather than Fair, therefore Capacity is expected to perform worse than the Fair scheduler. In contrast, FIFO lacks any prioritisation or sorting. The first job which is submitted to the queue is the first one to be executed when a slot becomes available. This might lead to delays, as when a new job is submitted, the resources are not distributed equally. Finally the Fair scheduler is designed so that each of the jobs submitted to the queue, on average, gets an equal share of the available resources. It is likely that the fair sharing of

39 recourses and the support of simultaneous execution will cause the Fair scheduler to outperform the other two schedulers. This will indicate the importance of fair allocation of resources Which scheduler scales best? An important aspect of distributed systems is scalability. As Hadoop is designed for execution on distributed systems, scheduler scalability is critical. In order to evaluate how well each of the schedulers scales, experiments will be executed under a variety of scenarios, described in 3.1. The goal is to provide clear evidence of the scalability of the schedulers, which will be compared based on the measurements and metrics covered in 3.2. The Capacity scheduler will utilize a separate FIFO queue for each user, which will lead to more overhead to the execution of jobs, compared to the default FIFO algorithm. The Fair scheduler assigns a fair portion of the resources to the jobs which are executed. Therefore, as the size of the cluster increases, and more nodes become available, the time taken to distribute the resources will result in communication overhead. As for the FIFO scheduler, which appends jobs to the queue and executes them in order of arrival, it is reasonable to assume that the overhead, which is added will not change once the size of the cluster increases. Therefore, as the size of the cluster increases, the effect of the queuing will decrease. Based on the analysis, it is expected that the FIFO scheduler will evidence the best speedup, as it is the scheduler which has the least amount of overhead Does the size of the data affect the results? Dataset size scalability is another measurement which is crucial to the project. Each of the experimental scenarios described in 3.1 will be investigated. Due to the nature, in which FIFO functions, it is expected that larger jobs will cause delays to the systems. The Capacity scheduler might also operate slower compared to Fair, as each of the queues within Capacity are using FIFO. Another aspect which has a bigger impact on performance, when tasks have to access files from disk is the file system. As all the experiments will store the input and output on the Hadoop Distributed File system, this will not be an issue. The similar research [51] suggests that the Fair scheduling will handle the increase of the dataset best, due to the concurrency support and low functional overhead, compared to Capacity.

40 Conclusion This chapter introduced the experiment environment, along with the parameters, which will be varied to provide different testing scenarios. The measurements and metrics, which will enable the researcher to carry out a quantitative and objective comparison between the scheduling algorithms, were defined. As a final point, a hypothesis was made regarding the results from the experiments. The hypothesis was based on the theoretical knowledge gained from documentation, as well as on the experimentation done in similar conditions: [51] and [56]. The next chapter will familiarize the reader with the details of the implementation of the experiments. The first subsection will introduce the experimental environment, as well as the choice of data, and the program code that will be used to benchmark each of the scheduling algorithms. The chapter will also discuss the reasoning behind each of the decisions made.

41 Chapter 4 Experiment Implementation This chapter will focus on the specifics of the experiments and will provide detailed reasoning for each of the decisions made. 4.1 Experiment Environment The experiments, described in the experiment design chapter, were executed on the Cloud testbed of the University of Leeds (refer to figure 3.1). It is a private Cloud consisting of 14 physical hosts. The machines are managed with the OpenNebula [75] Cloud Computing platform. It allows quick and easy Virtual Machine deployment. This provides a flexible and adjustable environment for the experiments. Figure 4.1: School of Computing Cloud Testbed Architecture [66]