Learning Based Admission Control and Task Assignment in MapReduce

Size: px

Start display at page:

Download "Learning Based Admission Control and Task Assignment in MapReduce"

Raymond Gaines
6 years ago
Views:

1 Learning Based Admission Control and Task Assignment in MapReduce Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science and Engineering by Jaideep Datta Dhok Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA June 2010

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Learning Based Admission Control and Task Assignment in MapReduce by Jaideep Dhok, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Dr. Vasudeva Varma

4 To my parents

5 Acknowledgments I would like to thank my advisor Dr. Vasudeva Varma for his continuous and inspiring guidance during the course of my thesis, without which this thesis would not have been possible. His constant support motivated me to pursue research in MapReduce. His insights in machine learning and its application also proved to be of a tremendous help. I also owe gratitude to Dr. Prasad Pingali, whose technical guidance during the initial phases of my thesis has been invaluable in shaping my thesis. He introduced me to MapReduce, and suggested to look into the internals of MapReduce for research problems. I also thank my parents, for their continuous encouragement during the course of my masters. Their insistence on pursuing a research degree kept me motivated during the research work. Finally, I thank all of my friends, Akshat, Gururaj, Nihar, Nitesh, Mihir, Rahul, Sandeep and Sarat for their companionship, which made my stay at IIIT-H enjoyable and a lot of fun. I also thank the ever so helpful M. Babji for his tremendous enthusiasm, and zeal for helping lab students. v

6 Abstract The MapReduce paradigm has become a popular way of expressing distributed data processing problems that need to deal with large amount of data. MapReduce is used by a number of organizations worldwide for diverse tasks such as application log processing, user behavior analysis, processing scientific data, web crawling and indexing etc. Many organizations utilize cloud computing services to acquire resources needed to process MapReduce jobs on demand. As a result of the popularity of using MapReduce in the cloud, cloud services such as the Amazon Elastic MapReduce have also become available. The Apache Hadoop framework is the leading open source implementation of the MapReduce model. It also provides a distributed file system (HDFS or the Hadoop Distributed File System). Hadoop is also the most popular MapReduce framework, and is being used by more than 75 organizations worldwide. It is designed to be scalable and robust, and runs on small single node clusters, to very large clusters containing several thousand compute and storage nodes. Despite the popularity and stability of Hadoop, it presents many opportunities for researching new resource management algorithms. Admission Control, Task Assignment and Scheduling, Data Local Execution, Speculative Execution and Replica placement are some of the key challenges involved in resource management in Hadoop. Resource management in Hadoop is complicated by the fact that the resources available are dynamic in nature because of frequent node failures. In this thesis, we approach two of the above problems: Admission Control and Task Assignment. Admission Control is the problem of deciding if a new job submission should be accepted for execution on a MapReduce cluster. An admission controller, or the module that handles admission control should make sure that jobs accepted into vi

7 vii the cluster do not overload resources. Another important requirement is that the accepted jobs should try to maximize the utility of the service provider, which in this case is the Hadoop cluster owner. We propose an admission control algorithm that selects incoming jobs based on the principle of expected utility hypothesis. Based on this principle, the admission controller always chooses jobs that are expected to provide maximum utility after their successful completion. To predict whether or not the job will be successful, we observe the effects of past decisions made under similar situations and use this knowledge to predict outcome of incoming jobs. For this purpose we use a Nave Bayes classifier that labels incoming jobs as potentially successful or potentially unsuccessful. Out of the jobs that are labeled potentially successful, we chose the job that has the maximum value of utility function at the time of job submission. If none of the incoming jobs are labeled as potentially successful, we do not admit any job into the cluster. Utility functions are supplied by users and express the value earned by users after successful completion of jobs as a function of time taken to complete their requests. We consider a job submission to be successful if it does not overload resources on the cluster. Once the effects of the admission are observed, we update the classifier so that this experience is used while performing next decision. The next problem that we approach is that of Task Assignment. Each MapReduce job is subdivided into a number of map and reduce tasks. The tasks of a job are executed concurrently. Task assignment is the problem of deciding which task should be allocated to a node in the MapReduce cluster. The task assignment algorithm should consider the current state of resources at the concerned node as well as the resources requirements of tasks in the schedulers queue. Given this information, the task assignment algorithm should choose a task that maximizes resource utilization on the concerned node while still making sure that the node is not overloaded. We used a similar approach as that used in the admission controller for task assignment as well. While choosing a task, we first classify the queued tasks into two categories, good and bad. Good tasks do not overload resources at a node during their execution. From the tasks that are labeled good, we choose the task that has the maximum utility, which in this case is the priority of a task. Again, if

8 viii none of the tasks are classified as good, we do not assign any task to the concerned node. Task priority is set by the cluster administrator, and can be used for policy enforcement. We thus decouple task assignment from policy enforcement. Once a task is assigned, we observe its effects, and if it results in overloading of resources at the concerned node, we conclude that the task assignment decision was incorrect and update the classifier accordingly. This makes sure that chances of assignments that result into overloading are reduced over time. Overload rules are specified by system administrator. We evaluate our implementation (Learning Scheduler) for a number of different MapReduce jobs. The jobs are chosen to mimic the behavior of real life use cases for which MapReduce is usually employed. Our results show that the scheduler is able to learn the impact of jobs on node utilization rather quickly, i.e. during first few runs of a modest sized job. The scheduler is also able to achieve user specified level of utilization on the cluster node.

9 Contents Chapter Page 1 Introduction Big Data and MapReduce Big Data: The Data Explosion MapReduce: Large Scale Data Processing Resource Management in MapReduce Problem definition and scope Similarity and interdependency between the two problems Admission Control Task Assignment Guiding principles Organization of the thesis Context: Grid Resource Management Brief Overview of Grid resource management Related Work: Admission Control Utility Functions Existing Admission Control Algorithms Related Work: Task Assignment Independent Task Scheduling Computational Intelligence based approaches Learning based approaches Existing Hadoop Schedulers Native Hadoop Scheduler LATE Scheduler FAIR Scheduler Capacity Scheduler Dynamic Priority Scheduler Summary ix

10 x CONTENTS 3 Learning Based Admission Control Service Oriented MapReduce The Model Usage examples Utility Functions: Expressing User Expectations Learning Based Admission Control Recap: Hadoop Architecture The Algorithm Evaluation and Results Simulation Settings Algorithm Correctness Comparison with Baseline Approaches Meeting Deadlines Performance with Load Cap Job Response Times Job Arrival Rates Effect of Utility Functions Summary Task Assignment in MapReduce LSCHED: Learning Scheduler for Hadoop Feature Variables Utility Functions Avoiding resource starvation Benefit of doubt Using a Naive Bayes Classifier Separability of feature vector space and classifier convergence Evaluation and Results Implementation Details Evaluation Cluster Details Workload Description Results Demonstrating Learning Behavior Maintaining Desired Utilization Comparing Learning Rates Comparison with Hadoop Native Scheduler Summary

11 CONTENTS xi 5 Conclusions and Future Work Similarity in the two approaches Admission Control Future directions Task Assignment Future Directions Summary Bibliography

12 List of Figures Figure Page 1.1 Centralized architecture of a MapReduce system. Data storage and computation are co-located on worker nodes A typical MapReduce work flow CPU usage patterns of MapReduce application(wordcount). Mean and variance of the resource usage distributions become recognizable characteristics of a particular MapReduce job Hadoop MapReduce Architecture The architecture of MapReduce as a Service. Our model is based on the Hadoop open source MapReduce framework Utility Functions for different values of decay parameters Architecture of MapReduce in Hadoop Admission Controller Simulation Parameters Achieved and expected load ratio Comparison of Achieved Load Averages Performance while meeting user deadlines Achieved Load Average with load cap Comparing mean job runtimes Comparing runtime distribution Effect of Job Arrival Rate (λ) on Job Acceptance Effect of Utility Function on Job Acceptance Task assignment using pattern classification. Evaluation of last decision, and classification for current decision are done asynchronously Hadoop settings used in evaluation Prominent Use Cases for Hadoop. (percentages are approximate). 70 xii

13 LIST OF FIGURES xiii 4.4 Resource usage of evaluation jobs as estimated on the scale of 10, a value of 1 indicates minimum usage Learning behavior of the scheduler for WordCount job Learning behavior of the scheduler for WordCount job Achieved utilization for different user requirements Classifier accuracy for URLGet and CPUActivity Comparison of task assignment by Learning Scheduler and Hadoop s native scheduler

14 List of Tables Table Page xiv

15 Chapter 1 Introduction Cloud computing has emerged as one of the most interesting new paradigms in the computing community in recent times. The ability of the cloud services to provide apparently unlimited supply of computing power on demand to users has caught the attention of industry as well as academia. A 2008 survey by the Internet Data Consortium found that 4% of all enterprises had already implemented some form of cloud computing, and this share is expected to double by 2012 [28]. During the last year itself more than a dozen academic conferences were organized which had cloud computing as an important area of interest. Cloud computing enables users to use computing as a utility, similar to other basic utilities such as electricity which are accessed only on demand, and users are charged only for the quantities of utilities consumed by them. Users do not have to own computing infrastructure in order to use it. Computing infrastructure, platforms and services are provided to users as services. The services are on demand, meaning that they are available to users at any time and at any location. Cloud services usually offer users easy to use interfaces to control computing infrastructure, obviating the need of investment in trained personnel and expensive equipment for management. 1

16 1.1 Big Data and MapReduce Big Data: The Data Explosion The growth in cloud computing is partly also fueled by the explosion in data. Scientific experiments such as the Large Hadron Collider generate terabytes of data every day 1 [14]. Scientists need to uncover results from this deluge of data. This explosion of data is being experienced by every sector in the computing industry today. The big Internet companies such as Google, Amazon, Yahoo!, Facebook etc have to deal with huge amounts of user generated data in the form of blog posts, photographs, status messages, and audio/video files. However, there is also a large quantity of data that is indirectly generated by web sites in the form of access log files, click through events etc. Analysis of this data can uncover useful patterns about user behavior. Most of this data is generated frequently, and the data sets are stored temporarily for a fixed period and then discarded after they have been processed. According to Google, there are about 281 Exabytes of data online on the web today, up from the 5 Exabytes in There has been a fifteen fold increase in user generated content since 2005 [13]. People are sharing more and more information online, including their personal lives, and their opinions on peoples and products. Companies are trying hard to make use of this data in order infer user preferences, to generate better recommendations for users (example - Amazon and IMDB user recommendation systems), and to just get to know their users more and more. However, analyzing the flood of data has become one of the greatest challenge of recent times, with traditional approaches including relational databases and data warehouses failing to match the scale of data MapReduce: Large Scale Data Processing New tools and frameworks have been developed to process data on distributed data centers. MapReduce [38], which is the most prominent among such paradigms, has garnered much deserved community attention. It was initially designed by Google to process web scale data on failure prone commodity hardware. MapRe- 1 The LHC experiments are speculated to generate about 27 TB of data every day. 2

17 duce allowed programmers to focus on application logic and handled the messy details such as handling failures, application deployment, task duplication and aggregation of results automatically. The model proved successful within Google, and today has become the de facto method for modeling large scale data processing problems, with Hadoop [3], an open source implementation of the original model, being the most popular framework to develop MapReduce applications. MapReduce is employed in a range of problems, including, web crawling and indexing, analysis of Bioinformatics data, log processing, and image processing. The two most important components of Hadoop are: MapReduce, which deals with the computational operation to be applied on data, and the Hadoop Distributed File System or HDFS, which deals with reliable storage of the data. HDFS has been designed based on the Google File System [49], developed at Google for similar purposes. The MapReduce component is responsible for application execution, guaranteeing execution in case of machine failures, and resource management. Hadoop is designed to be scalable, and can run on small as well as very large installations [12]. Several programming frameworks including Pig Latin [64], Hive [4] and Jaql [11] allow users to write applications in high level languages (loosely based on the SQL syntax) which compile into MapReduce jobs, which are then executed on a Hadoop cluster. Hadoop is increasingly being used for data processing in the cloud. As listed on the Hadoop PoweredBy page [10], it is frequently deployed on the Amazon EC2 [1] cloud for cost effective data processing. This has resulted in Amazon offering a special Elastic MapReduce Service [2] which allows users to specifically launch Hadoop clusters on demand. The popularity of Hadoop in the cloud is also demonstrated by the fact that several Linux distributions tuned especially for running Hadoop on public clouds offerings are available [7]. The scale of MapReduce applications and the diversity of use cases it has been applied for make resource management in MapReduce a very interesting area of research. 1.2 Resource Management in MapReduce To understand the essential problems in resource management in MapReduce, it is important to first understand the MapReduce model. In MapReduce a data 3

18 processing problem is solved by expressing the solution in the form of two functions, map and reduce. The idea is borrowed from higher order functions with the similar names usually present in most functional programming languages such as Lisp and ML. The map function in MapReduce takes as input a set of key-value pairs [(k, v)], and a function f, that performs a computation on a key-value pair. The function f operates on each of the pairs in the input and outputs a different set of key-value pairs. Output of the map function is then passed to the reduce function as input. The reduce function then, applies an aggregate function r on its input, and stores its output to disk. The output of reduce is also in the form of key-value pairs. At the end of reduce, the output is sorted according to the values of keys, and the function for comparison of the keys is usually supplied by the user. Figure 1.1 Centralized architecture of a MapReduce system. Data storage and computation are co-located on worker nodes. During the execution of a MapReduce job, the input is first divided into a set of input splits. The system then applies map functions on each of the splits in parallel. The system spawns one task for each input split, and output of the task is stored on disk for transferring it to the reduce tasks. The system starts reduce tasks once all the map tasks have been successfully completed. Task or node failures are dealt by relaunching of tasks. Data given as input to the tasks, and generated as output of the tasks is stored in a distributed file system (HDFS for instance), to make sure that output of the task survives failures. MapReduce is good at solving data parallel batch processing problems. It has been optimized for these use cases. One of the essential criterion during its 4

Figure 1.2 A typical MapReduce work flow design has been to improve the rate at which data is processed, i.e. to maximize I/O throughput. It works best for I/O intensive applications.

19 Figure 1.2 A typical MapReduce work flow design has been to improve the rate at which data is processed, i.e. to maximize I/O throughput. It works best for I/O intensive applications. The above paragraphs very briefly summarize the MapReduce model. For detailed information, we encourage the reader to see the original paper describing MapReduce [38]. The MapReduce system as implemented in Hadoop is discussed in further detail in Chapter 2. Effective resource management is essential in MapReduce to get the most value out of available resources. A resource in our discussion means a computational resource, such as CPU, memory, disk space or network bandwidth. We identify the following important problems in resource management in MapReduce: Job Scheduling: Job scheduling is the problem of deciding the sequence in which a set of jobs is to be executed on a MapReduce cluster. The order of jobs could be driven by a user specific criteria which usually include minimizing response time of a job. Response time is the time from the submission of a job to its completion. Scheduling could also be affected by system owner s goals; usually they include optimum resource utilization, and servicing as many users simultaneously as possible. Task Assignment: Where as job scheduling deals with high level ordering of jobs, task assignment is about choosing which task should be assigned to a given worker node. The task assignment should improve utilization of resources on the worker node, and at the same time should prevent overload. Task assignment depends on the current state of resources on the worker node, as well as the predicted outcome of assigning a task on the concerned node. 5

20 Admission Control: To prevent overload on a MapReduce cluster, and to meet QoS requirements of already running jobs, jobs submitted to a MapReduce cluster must be selectively accepted. If and which jobs are accepted for execution depends on the goals of the system owner. Some of these goals include: preventing overload of computational resources on the MapReduce cluster, and to maximize the value earned after successful execution of a job, in cases where the users are charged for using the MapReduce cluster. Speculative Execution: Speculative execution in the case of MapReduce is about improving response time of a job by relaunching slow progressing tasks by relaunching them on nodes with better resource availability. In large and heterogeneous clusters speculative execution can significantly improve job response time [38]. There are two important problems involved in speculative execution: First, deciding which of the currently executing tasks are progressing slowly relative to other tasks, and second, deciding a node on which a slow task must be relaunched. Data Local Execution: Data local execution; i.e. the idea of executing tasks as near as possible to their input data is embraced heavily in MapReduce. The main reason behind this being that it is much cheaper to transmit code to input data, than bringing input data to code. Further, it reduces latency issues which may arise if input data needs to be transferred across the network, thereby significantly reducing response time [67]. The important problems here are: to discover the underlying network topology, and choosing task and replica placement in order to maximize data locality. All of the above problems are complicated by the fact that a MapReduce cluster usually comprises of cluster of commodity off the shelf hardware. Given the size of clusters, and the quality of hardware, failures are frequent. Thus, any resource management component must take into consideration dynamism in the availability of resources, and in the demands of the applications as well. Catering to user requirements of deadlines and guarantee of execution should also be considered while designing a solution for the above problems. In Chapter 2, we discuss in detail the approaches used by Hadoop and some recent research done to address these problems. 6

21 1.3 Problem definition and scope In this thesis, we tackle two of the problems described in the previous section: Task Assignment and Admission Control. Both of these issues need to be solved in conjunction as the scheduling policy has significant impact on the decision making involved in admission control and vice versa. Both of these problems are also similar: In admission control one must decide which job to accept from a set of candidate jobs, and in task assignment one much choose a task to assign to a given node from a set of candidate tasks Similarity and interdependency between the two problems On the surface, admission control and task assignment may appear to be different problems, however, they are very closely related and we will explain how, in this section. The main goal of the admission control module in a resource management system is to guard the system against overloading by selectively accepting incoming jobs. The role of the scheduler, or the task allocator on the other hand is to make sure that the system resources are efficiently utilized. If the admission controller accepts too many jobs, the size of scheduler s queue increases, thereby increasing average response time of jobs, since the resources in the cluster are being multiplexed between jobs. On the other hand, if a scheduler does not utilize resources efficiently, even then the size of jobs waiting for resources increases, giving a false impression to the admission controller that the system is overloaded. Both of the scenarios are undesirable, as they lead to suboptimal system utilization. Hence, it is important to tackle both the problems together, and the approach we take tries to solve both these problems using a same algorithm. If we look closely, both the admission controller and the scheduler have to choose a particular decision among a set of plausible decisions. The choice made should maximize an objective function. This is the basic problem studied in decision theory, where an agent has to make a choice among a range of alternatives. Each of the alternatives has a payback (reward), and the chance (probability of success) associated with it. Based on the principal of Expected Utility Hypothesis, the agent chooses that outcome which maximizes the expected reward. i.e. the 7

22 product of the reward and the probability of achieving that reward. The agent must consider a number of factors that may affect the outcome of the decision. This usually involves past knowledge, and external information about the events. We implement a decision network using a simple binary classifier that classifies the plausible decisions into two sets: good and bad. The decisions labelled good, are only considered for further evaluation, and the rest are discarded. Among the good decisions, the one that maximizes expected reward, i.e., the one that maximizes the product of objective function (called utility function from here on) and the classifier score is chosen. In solving both the problems we have used a Naive Bayesian Classifier. Ideally, one should use a Bayesian Network to correctly model the interdependencies between factors that influence the decision. However, constructing a perfect Baesysian Network is a hard problem, and may require considerable expertise. Also, solving a generic Bayesian Network is also a hard problem, and a Naive Bayesian Classifier ignores both these problems by simply assuming that all factors are independent of each other. The classifier used in our systems is trained by an external evaluation engine which uses simple rules to evaluate the outcome of decisions, once the effects of a decision have been observed. Validating decisions after their effects are known, is easy and can be accomplished using simple rules. The results of the evaluation engine are fed back into the classifier, and the classifier learns from this data. Our system thus forms a closed control loop [6], which is an integral part of a distributed autonomic system. Next, we explain both of the problems in detail, and summarize how a classifier is used in both of these problems to achieve the desired goal. Both of the systems are covered in much detail in the later chapters Admission Control Admission Control is the first problem that we attempt to solve. As discussed earlier, admission control is about deciding if and which jobs should be admitted for execution on a MapReduce cluster. An admission control algorithm receives as input a set of candidate jobs and is supposed to output the set of jobs which will be accepted for execution on the cluster. For choosing the set of jobs to ad- 8

23 mit, the algorithm must consider the utility that the cluster s owner will earn after successful completion of the jobs. The algorithm must also ensure that the newly admitted jobs do not overload resources in the cluster, and thus do not adversely affect already running jobs. Utility earned after completing a job is usually specified by a utility function, which is provided by the user that is trying to submit the job. A utility functions expresses the utility gained by the submitter of the job as a function of time taken to complete the submitter s request. In cloud computing scenario the utility is typically the amount of money the user is willing to pay after successful completion of his/her request. This also introduces a new problem of price setting, where the users have to speculate the correct value of the service, which in this case is the share of the resources in a MapReduce cluster. In this thesis we do not focus on price setting mechanisms, they are surveyed in more detail in other works [26, 27]. MapReduce is a practical platform for developing distributed applications, and thus cloud computing offerings such as the Amazon Elastic MapReduce fall under the Platform As A Service (PaaS) paradigm. Although PaaS has its own advantages, MapReduce when offered in the Software as a Service (SaaS) paradigm can prove useful to users as well as the service providers. Users can reuse MapReduce components developed by others and service providers can expose MapReduce jobs as pay-per-use services. Service providers rent computational resources from an infrastructure provider, and allow users to run ready to use services. Effective admission control mechanism is necessary in this setting in order to maximize utility from the perspective of service providers, and to ensure quality of the services for users [26, 28]. Admission control has been essential in preventing overload of computational resources thereby maintaining a guaranteed level of service. The main contributions that we put forward related to admission control in MapReduce are: A method for modeling MapReduce jobs as ready to use services, thus effectively bringing MapReduce in the Software as a Service paradigm. An extension in the the utility models proposed in related work in order to adopt it to MapReduce. and, 9

24 An admission control algorithm that machine learning based approach for predicting job admission. The algorithm trains itself according to policy rules set by the service provider. We assume that the service provider, or the cluster owner has complete knowledge about the type of jobs being executed on his/her cluster. Further, it is assumed that any new job submission request is about executing one of the already known jobs to the service provider. This is in contrast to other works which discuss arbitrary job execution on a rented cluster. Although arbitrary job execution is possible in theory, implementing it in reality is difficult given the uncertainty involved in demands of a program, and other practical issues such as a priori resource provisioning, security requirements etc. The admission control algorithm we propose is explained in more detail in Chapter Task Assignment Task assignment in in MapReduce is an interesting problem, because efficient task assignment can significantly reduce runtime, or improve hardware utilization. Both of the improvements can result in reducing costs. Recent works on resource management in MapReduce [79] have focused on improving performance of Hadoop with respect to the user of the application. The schedulers developed so far for Hadoop, implement different policies, and focus on fair division of resources among users of a Hadoop cluster. However they do not address the inflexibility inherent in Hadoop s task assignment, which can result into overloading or underutilization of resources. Task assignment in Hadoop is worker driven. Each worker node sends a periodic message to the master describing its current state of the resources. The master node is supposed to choose a task from the tasks in the scheduler s queue and assign it to the worker for execution. The task assignment process has to consider the current state of resources on the worker, predicted state of resources in the future, and demands of the queued tasks while making an assignment decision. A task assignment algorithm should make sure that resources at the worker node in question are not being overloaded. 10

25 Many organizations schedule periodic Hadoop jobs to pre-process raw information in the form of application logs, session details, and user activity in order to extract meaningful information from them [10]. The repetitive nature of these applications provides an interesting opportunity to use performance data from past runs of the application and integrate that data into resource management algorithms. In this thesis, we present a scheduler for Hadoop that is able to maintain user specified level of utilization when presented with a workload of applications with diverse requirements. Thus, it allows the cluster administrator to focus on high level objectives such as maintaining a desired level of utilization. The scheduler frees the administrator from the responsibility of knowing about the resource requirements of submitted jobs. Although, we still allow users to provide hints to the scheduler if information about resource requirements of the jobs is indeed available. The scheduler learns the impact of different applications on system utilization rather quickly. The algorithm also allows the service provider in enforcing various policies such as fairness, deadline or budget based allocations. Users can plug in their own policies in order to prioritize jobs. Thus, the algorithm decouples policy enforcement (scheduling priorities) from task assignment. Our scheduler uses automatically supervised pattern classifiers for learning the impact of different MapReduce applications on system utilization. We use a classifier in predicting the outcome of queued tasks on node utilization. The classifier makes use of dynamic and static properties of the computational resources and labels each of the candidate tasks as good or a bad. We then pick the tasks associated with maximum utility from the tasks that have been labeled good by the classifier. Utility of the tasks is provided by an administrator specified utility function. We record every decision thus made by the scheduler. A supervision engine judges the decisions made by the scheduler in retrospect, and validates the decisions after observing their effects on the state of computational resources in the cluster. The validated decisions are used in updating the classifier so that experience gained from decisions validated so far can be used while making future task assignments. The scheduler design and implementation is covered in more detail in Chapter 4. 11

26 1.3.4 Guiding principles We followed these guiding principles while attempting to solve task assignment and admission control: Adaptability - The designed algorithms must be adaptable to changes in the amounts of computational resources. It should also adapt to the dynamic changes in the states of nodes constituting the grid/cluster. Scalability - The designed algorithm must scale with number of nodes in a MapReduce cluster. The algorithm must also scale with the number of users of the system. Meeting user requirements - The algorithms should try to meet users as well as system administrators QoS requirements. For users this could mean prioritizing their jobs, whereas for administrators it could mean maximizing resource utilization. Preventing overload - Overloading of resources should be avoided. Overloading can result into device failures. It could also reduce job processing rate if the finite amount of resources available are shared among increasingly large number of jobs. Finally, overload also results in more energy consumption, thereby increasing the operational cost of the system owner. As discussed earlier in this section, we use utility functions to allow users and system administrators to control the algorithms according to their needs. For preventing overload and maximizing resource utilization, we use a learning based approach. 1.4 Organization of the thesis Rest of the thesis is organized as follows. Chapter 2 begins by providing an overview of resource management in the context of grid computing systems. We then the discuss the topics related to resource management in MapReduce in detail. We then compare our admission control algorithm to existing algorithms applied in the context of utility computing. 12

27 The last part of this chapter discusses available schedulers for Hadoop, and differentiates our scheduler from the existing alternatives. Chapter 3 presents the Admission Control algorithm. We first propose a model for providing MapReduce jobs as services. We then explain the proposed algorithm for admission control, and conclude by presenting the performance of algorithms under a number of scenarios. Chapter 4 discusses the task assignment algorithm, and the learning scheduler for Hadoop. First, we propose the learning algorithm using pattern classification for task assignment. We then present the scheduler, briefly discussing implementation issues involved. We then evaluate the scheduler on a number of real life workloads and demonstrate that the scheduler behaves as expected. Chapter 5 is the last chapter that summarizes our work and establishes the key lessons learned from our work. We touch upon the future directions for research, and conclude this thesis. 13

28 Chapter 2 Context: Grid Resource Management This chapter begins by giving an overview of grid resource management and explains the admission control as well as task assignment problems in detail by mentioning the related research done in both of the areas. After discussing the prior works in generic task assignment for a computational cluster, we move on to Hadoop specific approaches, and discuss the available Hadoop schedulers and their task assignment policies in detail. 2.1 Brief Overview of Grid resource management Much of the ideas embraced by the cloud computing community have their roots in grid computing research. Ideas for providing computing as a commodity, on demand provisioning of resources and services, as well as composing distributed applications from web based services have been proposed in the grid community. Thus, in order to understand our work and differentiate it from proposed algorithms it is necessary to understand grid resource management as well. In this section, we give a brief overview of grid resource management, and then focus on related research in admission control and task assignment. Although various definitions of the Grid exist, we adopt the original definition proposed by Foster and Kesselman [44]. According to them, a grid is a very large scale distributed system formed of a network of heterogeneous computers spread across multiple organizations and administrative domains. A grid is usually very large and can comprise of several thousand nodes. Cloud computing could 14

29 be considered as a special case of grid computing. The focus of grids have been on collaboration and inter interoperability to address computational needs beyond the capacity of any single organization. In contrast, the main drive behind cloud computing has been the economic benefits experienced by both providers and consumers of computational resources. Grids could be broadly classified as a a computational grid, a data grid or a service grid. It is very common for data and computation to be co-located, and in case of MapReduce it is an essential requirement. A computational grid provides a large sum of computational resources aggregated from several individual nodes. The total capacity available is larger than that of any individual node in the system. The computational grid could carry out computation by distributing it across millions of machines for example the SETI@HOME Project [18, 19],and the Folding@Home [58] projects; or it could combine several high performance clusters or super computers for processing, e.g. the TeraGrid Project [29]. Problems that benefit from such a grid include grand challenges such as weather modeling, nuclear simulations and scientific experiments involved in high energy physics. A data grid provides safe and efficient access to vast amounts of data. Although a computational grid also offers data storage, the distinguishing factors in case of the data grid are the facilities for data access which include file catalog management, data retrieval services, high speed data movement and services that enable data mining and data warehousing. Finally, the service grid offers high level services which in turn utilize data or computational grids. Examples include collaborative services that allow users to form virtual work spaces by connecting users and applications in interactive environments, services that acquire resources on demand on behalf of the users etc. Service grids also offer mechanisms for negotiating and enforcing QoS, resource brokering, resource reservation and resource monitoring etc. A resource in a grid is any reusable entity that is employed to complete a users request for computation. A resource provider is an entity that controls the resources, and a resource consumer is the entity that uses the resources. A resource management system is thus a system that manages resources pooled from a number of resource providers and allocates them to complete requests initiated by resource 15

30 consumers. A resource management system or RMS, in a grid deals with managing the pool of resources available in the grid, i.e. it schedules the available processors, network bandwidth and disk storage. The RMS must be adaptable to the fluctuating amount of resources in the grid. It must be scalable, as well as robust. Supporting quality of service (QoS) and meeting computational cost constraints are also some of the issues that need to be addressed by a resource management system. Further, the RMS must also consider fault tolerance and stability as the resources in the grid could become unavailable at any time [15]. The grid could be spread across different autonomic organizational domains. The policies adopted by each of the domains could be different which makes resource management in grids more challenging than in other conventional distributed computing systems in control of a single organization. Further, as the organizations could be geographically distributed across countries, local legal constraints further complicate the issue. Many of these issues are also prevalent in cloud computing. In the abstract grid RMS architecture proposed in [15], we identify the following components that could be important in a MapReduce system. Grid resource broker - A resource broker provides mechanisms for establishing the QoS as desired by the users. This includes establishing service contract, determining the price of services and computational resources. The job of the broker is to select resources and services that best match the requirements specified by users and delegate user requests to the chosen service providers [52]. Admission Controller - Within the realm of a single service provider, admission controller checks whether incoming service requests should be given access to resources under control of the service providers. Incoming requests might not be accepted if resources are already overloaded or if the service provider is already over committed. Further, the admission controller must make sure that the requests admitted maximize the utility of service provider [54]. Global Scheduler - Global scheduler manages resource allocation at an aggregate level. If the grid consists of multiple sites, the global scheduler might decide which site should be allocated for a given set of jobs in its queue. The global 16

31 scheduler does not have access information about the resources of individual nodes in a local sites. It relies on the interfaces exposed by each of the local sites that provide aggregate resource availability in that site and bases its decisions on this aggregate information. [74, 73, 75] Local Scheduler - A local scheduler manages resources in a local cluster within a single administrative domain, for example - a local university cluster. Local scheduler has detailed information about the state of resources on each of the individual nodes in the cluster. It assigns tasks to the individual nodes. Some of the requirements from a local scheduler include maximizing resource utilization, load balancing and enforcing local policy for job priorities. Examples include the Sun Grid Engine [48], Condor-G [47], PBS and Maui [24]. Grid Information Service (GIS) - A grid information service provides a directory for lookup of grid services and resources. Service providers publish the availability of services to a grid information service. Grid brokers access a GIS to identify potential services for delegating requests on behalf of users.[35] Grid Monitoring Service - Grid monitoring service maintains information about the state of resources in a grid or a local clusters. Monitoring services also have features for aggregating resource information, notification in case of change of resource state and optionally, prediction on the availability of resources. Information made available by a monitoring service is used by schedulers and admission controllers. Examples of monitoring systems are are: Ganglia [68] and Network Weather Service [77]. As mentioned in Chapter 1, in this thesis we attempt to solve issues in admission control and local task scheduling for MapReduce clusters. In the next sections we discuss the related work of each of these issues in detail. 2.2 Related Work: Admission Control As discussed in Chapter 1, admission control is about selectively admitting jobs for execution on a cluster. Admission control has been researched well in the 17

32 utility computing community, where the goal is to provide computation as a utility. Admission control forms an integral part of such systems. Most grid systems have found admission control useful in preventing overload of resources, and in order to maximize utility for the service providers and users alike Utility Functions Utility functions are instruments used by the users of a system to express their expectations from a service. A utility function gives the value of the utility earned by the users of a system or service as a function of time. Typically the time argument passed to the utility function is the time taken to complete user s request. Naturally, users want their requests to be serviced as soon as possible, meaning that the utility earned by users is high if their requests are served by their expected deadline and then deteriorates afterwards. A utility function that decays with time has been proposed in earlier works: Risk Reward [50], Aggregate Utility [22], and Millennium [32]. They have used a three stage utility model, where the utility remains constant until a deadline, and then degrades linearly until a second deadline. After the second deadline is crossed, it is assumed that the user is no longer interested in the outcome of his/her request, and hence will not be paying for the completion of it. The service provider may be penalized if it fails to reach its deadline guarantees in such cases. A linear decay rate (slope of the utility curve) is specified by the user to express dissatisfaction with passage of time. We argue that a linear decay in utility rate assumes that user s expectations also decrease linearly. On the contrary, the rate of decay in utility ( du(t) ) actually increases with time, i.e. the users become more dt and more disappointed with each time unit passed after their expected deadlines. This necessitates the need of non linear and more generic utility functions that allow users to accurately express their expectations from offered services. The utility functions we use are mentioned in detail in Chapter Existing Admission Control Algorithms Usually in the existing literature admission control has been proposed as a submodule within the scheduler. In our approach we have treated scheduling and 18

33 admission controller independently. Furthermore the admission controller does not demand a specific type of scheduler within the cluster and vice versa. Despite of the need of an effective admission control algorithm, it has received very less attention on its own. We discuss the important algorithms relevant to our approach in this section. Other admission control algorithms mostly deal with media services and telecommunication networks, and their approaches are not directly applicable in cloud computing. Millennium Chun and Culler [31] presented one of the earliest works in admission control in the utility computing space. They discussed the admission control algorithms for fixed size clusters and jobs with fixed shapes. The roles of service providers and resource providers are expected from the same entity. Their algorithm calculates the value of yield/rp T, where yield is the utility earned after completing a job and RPT denotes the remaining processing time of the job. Thus, they select jobs that maximize the utility earned per unit time. Jobs can be preempted to give way to more promising requests. The novelty in this work came from evaluating the system based on a user centric metric (utility earned) rather than system centric metrics such as system utilization. An important contribution of their study is to demonstrate the ability of market based approaches in delivering significantly more user satisfaction compared to conventional approaches at the time. Risk Reward Work by Irwin, Grit and Chase [50] discusses scheduling of tasks in a market based task service where users submit task bids. Users are expected to submit the task, expected time to complete the task, and a value function (same as utility functions described earlier) to the service provider. A three phase value function is used, with linear decay rate. The service provider then runs acceptance heuristics to decide whether or not to accept a task. Once a task is accepted, tasks in the queue are scheduled using a FirstReward heuristic. The acceptance heuristic calculates a slack value for each task and rejects tasks if their slack value falls below a user defined threshold. Slack is the delay that can 19

34 be tolerated for a given task before its expected utility goes beyond an acceptable level. Tasks with higher slack value are preferred as they leave more opportunity to accept profitable tasks, i.e. tasks with higher value functions in the future. Experiments performed by the authors suggest that admission control is critical in preventing overload and also increases yield (utility) earned by the service provider by reducing number of tasks that end up resulting in penalties. The authors expect users to submit expected runtime of the task at the time of requesting task service. Users may not always have an idea of what deadlines to expect from the service. First, because detailed information about service provider s infrastructure may not be available (sometimes even to the service provider; especially in cloud computing) and second, the performance information of tasks might not be available if the tasks are assembled from task repositories on demand. Additionally, users might tend to overestimate the performance capacity of the service and might demand impossible deadlines. Positive Opportunity Positive Opportunity [65] takes an exhaustive approach, where they compute all possible schedules of the new job request with the existing jobs, and schedule of just the existing jobs. They select a schedule that results in the most reward (utility) after completing all the tasks. If the new job is in the schedule with maximum potential yield, then it is accepted, otherwise it is rejected. In other words, a job is accepted only if it does not decrease the profit that is expected after executing current set of jobs. Once a job is accepted, it is run to completion, as are all jobs, and scheduled according to an independent scheduling policy. Aggregate Utility Functions Young et. al. [22] present aggregate utility functions that allow users to control the behavior of the admission controller service. The algorithm they presented allowed capturing utility of an individual job as well as aggregate utility of a batch of jobs. Two sets of utility functions are used: one for each individual job and another for the entire work flow of jobs. The authors have approached the problem of admission control for arbitrary jobs in a computational cluster, which the 20

35 service provider is renting from a third party infrastructure provider. The admission control algorithm can make decisions for each individual job as well as for the entire contract of a batch of jobs. The scheduling algorithm used a FirstReward [65] heuristic. The algorithm selects the job that is expected to provide the least declining contribution to profit-rate. The cluster executes only one job at a time. Profit-rate is profit earned per unit time. For accepting contracts, a strategy is chosen from the following five strategies: 1. Oblivious - Accept contracts always 2. Average - Accept contract if the sum of the average load during the contract and the current load average of the cluster is less than that which can be sustained by current resource availability. 3. Conservative - Instead of average load use load that is 2 standard deviations above the mean. 4. Preempt-Conservative - conservative in addition to preemption of existing contracts if the current contracts will result into losses after accepting a new contract. 5. High Value Only - Accept a contract that offers more utility rate than a certain threshold set by the administrator. Admission Control in Hadoop Existing schedulers for Hadoop: FAIR [80], Capacity [5], and Dynamic Priority [8] offer very limited facility of admission control. For example, the FAIR scheduler has a feature to suspend jobs until sufficient free resources are available. Existing schedulers focus on implementing resource sharing policies, within a MapReduce cluster owned internally by an organization. The dynamic priority scheduler uses market based approaches to control resources shared, by users in a cluster, however they also do not address the problem of admission control. Our approach borrows from the previous work on admission control in utility computing and provides a model more suited for cloud computing, in a scenario where MapReduce is offered as a service. 21

36 In the current Hadoop implementation, admission control is left to the scheduler, and there is no independent module for it. A drawback with this approach is that the job data structures are already allocated in the master node s memory by the time the scheduler takes a decision. Further, the job construction phase is expensive, and if the master node decides to reject a submission, all the work done during job setup is wasted. It is necessary to do admission control before accepting a job. The job submission protocol in Hadoop can be extended to include a negotiation phase, in which the admission controller could participate. 2.3 Related Work: Task Assignment Having surveyed the existing literature on admission control, we now turn to task assignment in compute clusters. Multiprocessor task assignment is known to be an NP complete problem. [33, 34]. As a result, task assignment in grids and distributed systems remains an overwhelmingly attempted yet unsolved problem. A vast number of approaches have been tried. Application of heuristic based algorithms has been popular in this field. As every distributed system presents a new opportunity to try out domain specific heuristic, task assignment is still an interesting problem. However, some generic algorithms have been proposed to tackle task assignment for typical cluster of computers scenario. Out of these we only consider heuristic used in independent task scheduling, as that is the case that has maximum overlap with task assignment in MapReduce. For a detailed survey of dependent task scheduling and grid scheduling in general, the reader is encouraged to read excellent survey articles by Braun et. al. [25], Buyya et. al.[15] and Dong and Akl [40] Independent Task Scheduling In MapReduce tasks of the same type (map or reduce) are assumed to be independent of each other. In other words, the output of a map task of a job does not depend on the output of any other map task of the same job. The same goes for all reduce tasks of the job. Also, at the time of task assignment tasks of disparate jobs 22

37 in the scheduler s queue are also assumed to be independent. Task dependency goes against the philosophy of data parallel computation in MapReduce. A typical strategy in independent task assignment is to allocate tasks considering the load on resources in order to achieve high system utilization. The heuristic based algorithms used for independent task scheduling can be broadly classified into two categories: 1. Heuristic with performance estimate - The heuristic used for task assignment assumes that basic performance information such as mean CPU usage, expected time to complete the task etc are available. 2. Heuristic without performance estimate - algorithms that do not use performance estimates. Algorithms with performance estimates OLB or Opportunistic Load Balancing assigns a task to the next available machine. While allocating, the algorithm does not consider the expected runtime of the task on the chosen machine. Although the algorithm is simple, it can result in poor runtimes. [21, 45, 46] The MET (Minimum Execution Time) [25] algorithm assigns tasks to those machines which are expected to minimize the execution time of the task. Thus, each task is assigned to its best machine. This can cause load imbalance among the resources as machines with more computing capacity are more likely to be the best matches for most of the tasks, resulting into more load on more powerful machines, whereas some machines with less computational capacity could be left under utilized. The MCT (Minimum Completion Time) algorithm [21] assigns tasks to machines that are expected to minimize the expected completion time of tasks. The algorithm combines heuristics used in OLB and MET, and tries to avoid the shortcomings of both. The Min-Min algorithm computes the expected runtime of each task for all the machines in the cluster. This is done for all the tasks in the scheduler s queue. From the set of runtimes the corresponding minimum time for each task and the corresponding machine is selected. A new set containing minimum completion 23

38 times for each of the tasks is constructed. From this set, the the task-machine mapping that has the minimum runtime is chosen and the chosen task is assigned to the corresponding machine. The algorithm minimizes runtime for each task, and task with minimum runtime among all the tasks, hence the name min-min. [21, 45, 46] Max-Min is very similar to min-min, it computes the set of minimum expected times for all tasks and then selects the task with the maximum expected time from this set and assigns it to the corresponding machine. The chosen task is removed from the set of candidate tasks and the process is repeated until all tasks have been allocated. The max-min algorithm tries to match the task with longer expected times to their best machines. Thus, it tries to minimize the penalty incurred by running longer tasks concurrently with shorter tasks. The heuristics in min-min and max-min are combined in Duplex, which runs both min-min and max-min and chooses the better solution out of the two. Thus, it tries to perform well in situations where either min-min or max-min perform well. Suffrage and XSuffrage: The Suffrage algorithm [61] computes the difference between the best MCT and the next best MCT for a task. This difference is called suffrage. This process is repeated for all the tasks. From the set of suffrages the task with maximum suffrage is chosen. The philosophy behind the Suffrage algorithm is that the task should suffer as less as possible by incorrect task assignment. The task that suffers the most by incorrect assignment is always chosen. However, this algorithm may not work when all tasks are expected to achieve almost identical runtimes. Further, the Suffrage algorithm does not consider data locality or the distance between where input data is present and where the task is actually executed. The XSuffrage algorithm [59] tries to balance this by computing a cluster level suffrage value for all the clusters in a grid. XSuffrage works well in cases where accurate task performance information may not be available compared to Suffrage. Algorithms without performance estimates Subramani et al. [73] present a duplication based approach where they send a job to K independent sites. Each of the sites then schedules the job locally and 24

39 informs the global scheduler if it is able to start a job. The global scheduler upon receiving a start message from one of the K sites, sends a cancel message to all the remaining K 1 sites to cancel the job. The reasoning behind duplicating tasks is better utilization of idle machines on large clusters, and reduction in expected make span of a job. Another algorithm Work Queue Replication (WQR) [36] tries to duplicate tasks that are already running on other processors by relaunching them on other machines in the cluster. The difference between the previous algorithm and WQR is that WQR actually duplicates on going work, whereas the previous algorithm cancels all redundant work. WQR algorithm works well without using resource information, and also copes well with dynamic resource quality and application performance variations. This is very similar to the idea of speculative execution in Hadoop Computational Intelligence based approaches Nature inspired heuristics such as Genetic Algorithms (GA), Simulated Annealing, Swarm Intelligence etc have been also used in task assignment problems. While applying so called computational intelligence algorithms, the task assignment problem is modeled as an optimization problem, and a heuristic is applied to find out a solution that maximizes (or minimizes) an objective function. Next, we discuss some of such heuristic algorithms and their application in distributed task assignment. Genetic Algorithms Genetic Algorithms fall under the broader class of evolutionary algorithms and are typically used to discover solutions in large search spaces. The general method used in a GA is as follows: 1. Population Generation: A population contains a set of chromosomes (potential solutions). Initial population can be generated using other heuristics mentioned in Section

40 2. Chromosome evaluation: Each chromosome is evaluated based on the value of the objective function. For task assignment, an example of objective function is total make span of all tasks. 3. Crossover and Mutation: During crossover, two chromosomes are selected, and their random substrings are exchanged. In mutation, a single chromosome is selected and one of the mappings (task-machine) is reassigned. 4. Evaluation: New chromosomes generated in the previous steps are evaluated again. And a random subset of the chromosomes is preserved for the next iteration. The algorithm ends after a fixed number of evolutions, or if all the chromosomes converge to a single mapping. GA have been very popular heuristic and have been applied in a number of Grid scheduling problems. [17, 53, 72, 76] Simulated Annealing The next heuristic called Simulated Annealing (SA) is based on the restructuring of molecules observed while controlled cooling of molten metals and alloys. In SA, first a metal is melted, and then cooled slowly, until it reaches a thermal equilibrium which is an optimal state. When applied to task-machine mapping, we define a temperature entity that can be calculated for each task-machine mapping. Temperature could be calculated using the objective function. If a new mapping results in a higher temperature, it is accepted with a certain probability in order to escape from a local minima. The initial mapping is generated from a uniform random distribution. The mapping is then mutated similar to that in GA, and the temperature of the mapping is evaluated. If the new temperature is lower, it replaces the old mapping. If it is worse, then it is accepted with some probability: A uniform random number z is selected s.t. z [0, 1). z is then compared to y where, y = 1 + e old temp new temp new temp 1 26

41 If z y the new mapping is accepted, otherwise it is rejected. This completes one iteration of the algorithm. The algorithm is stopped if the temperature does not reduce after a certain number of iterations or if the temperature reduces to zero, indicating an optimal solution. Examples: [25, 78]. Genetic Simulated Annealing GA can be combined with SA to yield a hybrid approach called Genetic Simulated Annealing [16]. GSA follows the procedure of GA as described earlier. However while selecting a chromosome the simpler decision process involved in SA is used. Swarm Intelligence Swarm intelligence is a term used for a set of meta heuristic algorithms derived from the behavior of swarms of social animals in nature. Some such algorithms include Ant Colony Optimization (ACO) [41] which is modeled after the foraging behavior of ants, Particle Swarm Optimization (PSO) modeled after a swarm of fish or hypothetical particles, and Bee algorithm modeled after foraging behavior of honey bees. Ant Colony Optimization (ACO) proposed by Dorigo [41] has been used in Grid scheduling [42, 66]. In this approach, each mapping is represented by an edge between a graph that starts from a start node and ends on the end node (optimal solution). Each edge is traversed by ants (agents) which leave a trace (smell) of a certain chemical. This trace decays with time. While selecting a next edge, ants try to select the edge that has the maximum trace value. After an ants crosses over an edge its trace value is reinstated. The algorithm ends when all the ants have reached the end (optimal state) or if they are converged at an intermediate state after a number of iterations. Particle Swarm Optimization is modeled after the social interactions in social organisms. Initially a starting population is generated randomly, and a social network among the solutions to exchange information is defined. Each individual in the population is a candidate solution. Each individual or particle evaluates its fitness function and exchanges this information with its neighbors. A particle also 27

42 remembers its old location. A new location is chosen if one of the neighbors have achieved a better location, and all neighbors of that particle move towards the better location. This step is continued until all particles converge to a same location or after a certain number of iterations. [69] Other nature based approaches include optimization algorithm modeled after the search technique used by honey bees while searching nectar [30] Learning based approaches Motivation for using Machine Learning Figure 2.1 CPU usage patterns of MapReduce application(wordcount). Mean and variance of the resource usage distributions become recognizable characteristics of a particular MapReduce job. MapReduce applications have been successfully used in processing large amounts of data. Subtasks of the same type of a job apply the exact same computation on input data. Tasks tend to be I/O bound, with resource usages as a function of the size rather than the content of input data. As a result, the resource usage patterns of a MapReduce job tend to be fairly predictable. For example, in Figure 2.1, we show the CPU time spent in user mode and kernel mode by Map tasks of a WordCount job. The figure shows distribution of CPU usages for about 1700 Map tasks. As we can deduce from the figure, the resource usage of MapReduce applications follow recognizable patterns. Similar behavior is observed for other MapReduce apps and 28

43 resource types. This, and the fact that the number of tasks increase with the size of input data, present a unique opportunity for using learning based approaches. Stochastic Learning Automata Stochastic Learning Automata have been used in load balancing [20, 55, 56]. Learning automata learn through rewards and penalties which are awarded after successful and unsuccessful decisions respectively. The SLA tries to learn the best possible action in a given automaton state. Examples of actions are: task migration, task cancellation, task assignment etc. Every action is associated with a probability. Probability of allocating a task on an overloaded node is less than the probability of allocation on an underloaded node. The goodness value of an action is a binary value indicating the success/failure outcome of the action. This value is transmitted to the scheduler by the host where the action has been performed. After the action s outcome is received, probabilities are updated as below: 1. Probability of success actions are incremented with a reward update. 2. Probability of failed actions are penalized with a penalty update. Initially actions are chosen randomly. However, as time progresses, and consequences of a number of actions are known, for each state only a few actions remain viable. Decision Trees Another popular classifier, the C4.5 Decision Tree has also been applied in process scheduling in Linux [63]. Every process is associated with a feature vector having a number of static and dynamic features. Analysis is done to determine the best possible feature variable set in order to minimize turn around time of a process. The algorithm is evaluated for the Linux scheduler (versoin ) and a number of classifiers have been used in the evaluation. The C4.5 classifier is found to give the best results. The k-nearest neighbor algorithm is also evaluated. The reduction in turn around time of processes is because of the reduction in number of context switches. 29

44 Bayesian Learning Bayesian Learning has been used effectively in dealing with uncertainty in shared cluster environments [70, 71]. The authors have used a Bayesian Decision Network (BDN) to handle the conditional dependence between different factors involved in a load balancing problem. The Bayesian network consists of a number of nodes typically for each factor that might influence the eventual decision. Examples of such factors are: machine resource information, machine load, job properties etc. A set of decisions or actions are also predefined. Examples of actions include task migration, task duplication, task allocation or cancellation etc. While making a decision, first the decision variables are fed with the current state of the system, and based on the conditional probabilities, expected utility earned after evaluating each action is calculated. Utility is measured based on the value of an objective function, a typical example of which is the load distribution amongst nodes. An action is chosen so that the objective function is minimized. The current state of resources is also coupled with the predicted state of resources while making a decision. Linear models [39] are used to predict the next state of resources. After making the decision, the state of resources and the value of the objective functions are observed, and the conditional probability tables are updated accordingly. The BDN learns from one sample at a time, thus making the learning process incremental. The main motive behind using a BDN is in dealing with the uncertainty involved in node resource information, node availability and finally application behavior. Dynamic Bayesian Networks [23] have been used for load balancing as well. The similarity between their and our approach is the use of Bayesian inference. However whereas the authors in [71] have used a BDN, we use a Naive Bayes classifier, where all factors involved in making the decision are assumed to be conditionally independent of each other. Despite the assumption, Naive Bayes classifiers are known to work remarkably well [81], and as our results indicate it can be effectively applied to task assignment as well. Compared to Bayesian Networks, Naive Bayes Classifiers are much simpler to implement. Further details about using the classifier are available in Chapter 4. 30

45 2.4 Existing Hadoop Schedulers To better understand our approach and the limitations of current Hadoop schedulers, we now explain the key concepts involved in Hadoop scheduling. Figure 2.2 Hadoop MapReduce Architecture Hadoop MapReduce Architecture Hadoop borrows much of its architecture from the original MapReduce system at Google [38]. Figure 2.2 depicts the architecture of Hadoop s MapReduce implementation. Although the architecture is centralized, Hadoop is known to scale well from small (single node) to very large (upto 4000 nodes) installations [12]. HDFS (Hadoop Distributed File Systems) deals with storage and is based on the Google File System [49], and MapReduce deals with computation. Each MapReduce job is subdivided into a number of tasks for better granularity in task assignment. Individual tasks of a job are independent of each other, 31

46 and are executed in parallel. The number of Map tasks created for a job is usually proportional to size of input. For very large input size (of the order of petabytes), several hundred thousand tasks could be created [37] Native Hadoop Scheduler Heartbeat Mechanism Scheduling in Hadoop is centralized, and worker initiated. Scheduling decisions are taken by a master node, called the JobTracker, whereas the worker nodes, called TaskTrackers are responsible for task execution. The JobTracker maintains a queue of currently running jobs, states of TaskTrackers in a cluster, and list of tasks allocated to each TaskTracker. Every TaskTracker periodically reports its state to the JobTracker via a heartbeat mechanism. The contents of the heartbeat message are: Progress report of tasks currently running on sender TaskTracker. Lists of completed or failed tasks. State of resources - virtual memory, disk space, etc. A boolean flag (acceptnewtasks) indicating whether the sender Task- Tracker should be assigned additional tasks. This flag is set if the number of tasks running at the TaskTracker is less than the configured limit. Task or worker failures are dealt by relaunching tasks. The JobTracker keeps track of the heartbeats received from the workers and uses it in task assignment. If a heartbeat is not received from a TaskTracker for a specified time interval, then that TaskTracker is assumed to be dead. The JobTracker then relaunches all the tasks previously assigned to the dead TaskTracker, that could not be completed. The Heartbeat mechanism also provides a communication channel between the JobTracker and a TaskTracker. Any task assignments are sent to the TaskTracker in the response of a heartbeat. The TaskTracker spawns each MapReduce task in a separate process, in order to isolate itself from faults due to user code in the tasks. 32

47 Limiting task assignment The administrator specifies the maximum number 1 of Map and Reduce tasks that can simultaneously run on a TaskTracker. If the number of tasks currently running on a TaskTracker is less than this limit, and if there is enough disk space 2 available, the TaskTracker can accept new tasks. This limit should be specified before starting a Hadoop cluster. This mechanism makes some assumptions which we find objectionable: In order to correctly set the limit, the administrator has detailed knowledge about the resource usage characteristics of MapReduce applications running on the cluster. Deciding the task limit is even more difficult in cloud computing environments such as the Amazon EC2, where the resources could be virtual. All MapReduce applications have similar resource requirements. The limit on max number of concurrent tasks correctly describes the capacity of a machine. Clearly, these assumptions do not hold in real world scenarios given the range of applications for which Hadoop is becoming popular [10]. As the above assumptions have been built into Hadoop, all the current schedulers available with Hadoop, the Hadoop default scheduler, FAIR scheduler [80], the capacity scheduler [5] and the dynamic priority scheduler [8] suffer from this limitation. Job Priorities Hadoop has limited support for job priorities. Five job priorities are supported: 1. Very High 2. High 3. Normal 1 (mapred.map.tasks.maximum and mapred.reduce.tasks.maximum in Hadoop s configuration files) 2 This can also be configured in the same configuration file 33

48 4. Low 5. Very Low Job priority can be set at job launching time, or can be changed by the administrator when a job is running. Hadoop tries to schedule jobs according to a FCFS strategy, and picks oldest jobs with maximum priority for execution. Data Local Execution Data locality and speculative execution are two important features of Hadoop s scheduling. Data locality is about executing tasks as close to their input data as possible. Hadoop tries to achieve rack-level data locality. This is done to exploit the rack-based topology in typical data centers. A topology script gives the rack location and distance between input split location and a node under consideration for task assignment. First, the default scheduler tries to assign task to nodes whose distance with input data is zero. Distance between a node and its input is zero if they both reside on the same machine. Next, nodes with distance one are considered; these are the machines that are within the same server rack. If no such machines are found, then Hadoop assigns task a non data local task, i.e., the task whose input data are not in any of the machines in the same rack as the node running the task. Speculative Execution Speculative execution tries to re balance load on the worker nodes and tries to improve response time by relaunching slow tasks on different TaskTrackers with more resources. In this mechanism the Hadoop scheduler relaunches tasks that are progressing slowly, compared to other running tasks. Slow tasks are duplicated on machines with free slots. This tries to achieve better utilization when the job is about to end. This also reduces job make span by reducing runtime of the slow tasks. This also counters effects of overload by multiple task assignments on better machines. 34

49 2.4.2 LATE Scheduler The LATE scheduler [79] tries to improve response time of Hadoop in multiuser environments by improving speculative execution. It relaunches tasks expected to finish farthest into the future. To better accommodate different types of tasks, task progress is divided into zones. A user defined limit is used to control the number of speculative tasks assigned to one node. The LATE scheduler achieves better response times especially in heterogeneous cloud environments. We would like to point out that speculative execution increases the likelihood of node overload FAIR Scheduler The FAIR scheduler by Zaharia et. al. [80] has been optimized for multi-user environments, where a single cluster is shared across a number of users. This kind of Hadoop usage is popular in companies that do a lot of data mining operations based on user logs. The FAIR scheduler has been designed to reduce the makespan of short jobs, which are found to be frequent in such large environments [51]. The scheduler chooses jobs from a set of pools, and tries to assign jobs fairly across pools. Task preemption is also supported in order to achieve fairness. The authors use a version of max-min fairness with a minimum slot allocation guarantee. If jobs from a pool have been over assigned, tasks of those jobs are killed in order to free slots for jobs with less than guaranteed allocations. The FAIR scheduler uses delayed allocation to improve data locality on large clusters. While allocating a task, unlike the native Hadoop scheduler which uses a best effort philosophy, the FAIR scheduler delays the allocation of a task in the hope that a node with more data locality might ask for that task. If the job at the head of the job queue does not have a data local task, then subsequent jobs with data local tasks are chosen for assignment. If the head of the queue has not been getting a data local task for a specific period of time, then it is forcefully allocated. The authors calculate expected gain in job s response time by using delay scheduling to be: E(gain) = (1 e w t )(D t) 35

50 Where t is the rate of arrival of allocation requests from nodes (heartbeats), which is the rate parameter of a Poisson process, D is the expected extension in runtime of a non data local task compared to a data local task, and w is the wait time for launching tasks locally Capacity Scheduler Capacity Scheduler [5] is another scheduler designed for sharing of large clusters. The scheduler defines a concept of queues, which are created by system administrators for submitting jobs. Each queue is guaranteed a fraction of the capacity (number of task slots) of the entire cluster. All jobs submitted to a given queue have access to the resources guaranteed for that queue. In a single queue jobs are selected based on priorities. Higher priority jobs are selected before low priority tasks, however, jobs with lower priority are not preempted for higher priority tasks, which could result in priority inversion within a queue. If there are multiple queues and only a subset of these queues are having running jobs, then there is facility to allocate more than the guaranteed capacity to the subset of active queues. If and when the inactive queues also have job submissions, then their lost capacity is reclaimed. For this tasks of jobs in queues with excess capacity are killed. The scheduler also supports resource aware scheduling for memory intensive jobs. A job could optionally indicates if it needs nodes with more memory. Tasks of such a job are only allocated to nodes that have more than requested amount of free memory. Whenever a request for a task comes from a TaskTracker, the scheduler chooses a queue that has the most free capacity, i.e. the queue whose ratio of number of running tasks to the number of guaranteed tasks is the lowest. Task quota for users is also supported. While selecting a job, they are usually chosen in FIFO order, only if the quota of the user of the job is not reached, in which case the next job in the queue is chosen. 36

51 2.4.5 Dynamic Priority Scheduler Dynamic Scheduler [8], based on the Tycoon [57] system uses a market based approach for task assignment. Each user is given an initial amount of virtual currency. While submitting a job, users can declare a certain spending rate per unit time which indicates the amount of money the user is willing to spend in a given time. The scheduler chooses jobs that earn the maximum money for the scheduler, i.e. jobs whose users are providing the maximum spending rate. Users can control their spending rate in order to change the priority of their jobs, hence the name, dynamic priority scheduler. After a task allocation, amount equivalent to the effective spending rate is deducted from the account of the user. If the account balance of a user reaches zero, then no further tasks of that user are assigned. This turns the onus of properly prioritizing jobs on users instead of system administrators. The FAIR scheduler [80], Capacity scheduler[5], and the Dynamic Priority scheduler[8], try to achieve fairness, guaranteed capacity and adjustable priority based scheduling respectively. Hadoop on Demand[9] tries to use existing cluster managers for resource management in Hadoop. It should be noted that these schedulers concentrate either on policy enforcement (fairness, for example) or on improving response time of jobs from the perspective of the users. Our work differs from these approaches in that, we allow a service provider to plug in his/her own policy scheme, while maintaining a specified level of utilization. Also, as discussed in section 2.4.1, all of these schedulers allocate tasks only if there are fewer than maximum allowed tasks (a limit set by the administrator) running on a worker machine. Our scheduler, on the other hand assigns tasks as long as any additional task is likely to overload a worker machine. 2.5 Summary In this chapter we have attempted to give the reader a brief overview of the vast, extremely interesting and still actively researched field of Grid resource management. We introduced the concepts and terms involved in grid resource management, and tried to present a thorough literature survey. We also discussed existing 37

52 schedulers for Hadoop and their pitfalls, which lead to our work on admission control (Chapter 3) and task assignment (Chapter 4). 38

53 Chapter 3 Learning Based Admission Control This chapter presents the learning based admission control algorithm, designed for MapReduce clusters. We begin with introducing the context: offering MapReduce jobs as on demabd services. After describing the model we move on to explaining role of utility functions and mechanisms for establishing service contracts. Next, we present the admission control algorithm, and finally evaluate it against baseline approaches by comparing performance of a number of important parameters. 3.1 Service Oriented MapReduce Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) are the three key paradigms that enable cloud computing. In these models, software applications, software platforms and infrastructure are provided to the users in the form of on-demand services, and they are charged according to the pay-per-use model. MapReduce, which has become a popular paradigm for large scale data processing in the Cloud, is usually associated with the PaaS paradigm, where the service provider offers a ready to use MapReduce cluster where users can run their jobs. An example of such a platform is the Amazon Elastic MapReduce service [2] where users can provision Hadoop clusters on the fly, and perform data intensive computation by providing the implementation of Map and Reduce components, and process data that is hosted in Cloud based storage services. 39

Figure 3.1 The architecture of MapReduce as a Service. Our model is based on the Hadoop open source MapReduce framework. 3.1.1 The Model We propose a model that brings MapReduce in the SaaS paradigm.

54 Figure 3.1 The architecture of MapReduce as a Service. Our model is based on the Hadoop open source MapReduce framework The Model We propose a model that brings MapReduce in the SaaS paradigm. In our model, service providers offer a set of MapReduce applications as Web Services. To carry out data intensive operations, users search from a repository of registered services and select a service that performs the desired operation. The service repository thus takes form of an online market place where users can choose from a range of MapReduce services. A market is beneficial to users, as the service providers are forced to provide better service at cheaper rates in order to overcome competition. Figure 3.1 shows the architecture. MapReduce in the SaaS paradigm has following benefits from the users perspective: Users can choose from a wide range of available applications to perform their computations, without having to invest in development of such applications. Users do not have to deal with establishing and maintaining MapReduce clusters, thus saving operational cost. Users can combine MapReduce services to form a data processing pipeline, where each unit in the pipeline could be offered by a different service provider, thus allowing the user to form a mashup service. 40

55 After selecting a service, the users interact with it only through the web service interfaces exposed by the service provider. This helps in hiding implementation details from the uses, which could be beneficial to the service provider in cases where exposing application implementation details is not preferred. Messages exchanged between the user and the services are based on familiar transport mechanisms, for example, XML or JSON over HTTP. Users specify input data location in the cloud, meta data describing the computation to be performed on the input data and contractual service demands as parameters to the web service. Output data could either be obtained directly as a response from the service or it could be stored in the cloud. The latter approach is useful if the user desires to perform further computation by means of another service instance. After accepting a request, a service provider then launches a MapReduce job corresponding to the request in her cluster that is hosted entirely in the cloud. The cloud in this case can be a private cloud that emulates cloud computing on privately owned infrastructure, or it could also be hosted in public cloud computing offerings such as Amazon EC2 [1], GoGrid, RackSpace Cloud etc. We assume that distinct requests are independent of each other, and thus could be completed in parallel. To increase revenue, the service provider processes multiple requests simultaneously by multiplexing job execution in the MapReduce cluster to achieve better resource utilization. The computational resources of the cluster are shared proportionately among the users. The proportion of resource allocated to a user s request depends upon the utility earned by the service provider after completing the user s request Usage examples Having discussed the model, we now present few use cases of MapReduce as a service: Ad-hoc querying on large datasets - Consider a scenario where an online movie rental company is offering a large anonymized data set of its users order history (of the order of few TBs) for analysis. The company can also offer several ready to use operations such as data selection, filtering, joins, pre-processing, etc. in the form of MapReduce jobs. Users can chain several such operations to extract useful informative 41

56 patterns from the orders data such as set of genre of orders placed by users in a particular age group. Users do not have to own the data set, and they can reuse components developed by the service provider to extract desired information. On demand crawling and indexing of web information sources - A service provider could allow users to submit a list of seed URLs to be crawled using a domain specific crawling algorithm developed by the service provider. The crawling engine will utilize MapReduce jobs for distributing workload across multiple nodes. The service providers could also provide initial preprocessing utilities in the form of MapReduce jobs such as jobs for extracting images and their alt-text and surrounding text from crawled web pages. Users can pay the provider only for the resources consumed during their crawl process, such as network bandwidth, disk space consumed by the crawled data etc. Document Format Conversion Service - In this example, users submit a list of documents stored in the cloud to the service provider, along with the desired output format for the documents. The service provider can then offer MapReduce jobs to convert documents in the desired file format. Example applications include on-demand video and audio conversion, generating thumbnails from video files etc Utility Functions: Expressing User Expectations In MapReduce as a service model, users only pay for the share of resources consumed for their computation. Besides the demand of correctness of computation, deadline for performing the computation is also an integral part of users expectation about the quality of service. Thus the price of the resources that the user is willing to pay and the deadline that the service provider agrees by, constitute the service contract in this model. The user and service providers must negotiate and mutually agree upon this contract. We do not address the problem of price determination in this chapter; auctioning mechanisms such as the Dutch auction or 42

57 the English auction could be used effectively for the purpose of judging the value of service. Users specify utility functions that indicate the price they are willing to pay as a function of time taken to complete service request. We extend the generic three phase utility functions proposed by [22, 32, 50, 65]. In this framework, the users specify a soft deadline and a hard deadline. If the request completes before the soft deadline, a user pays the complete amount he/she agreed upon before submitting the request. After the soft deadline, the utility from the perspective of the user degrades, until the hard deadline, after which the user is no longer interested in the outcome of the request and is unwilling to pay for completion of the service. The decay in the utility could be linear, or the rate of decay could also vary with time passed since the soft deadline. The following set of parameters capture the set of utility functions that exhibit this behavior. Formally, utility can be expressed as a function of time: U 0 if 0 < t T 1 U(t) = U 0 α(t T 1 ) β if T 1 < t T 2 U P if t > T 2 where, t = 0 is the time when a service request is accepted. U 0 is the initial utility that the user is willing to pay if the request is completed before the soft deadline T 1, after which the utility decays until the hard deadline T 2. Users can control the values of decay parameters α and β. Finally, U P gives the utility that the users are willing to pay after the hard deadline. A negative value of U P implies a penalty to be incurred by the service provider for failing to meet the hard deadline. If U P is zero after T 2, it means that the user is no longer interested in the outcome of the service, and thus will not pay any charges to the provider. The provider is thus free to cancel the request. The values of the decay parameters (α and β) represent the users interest in the outcome of the service request. A value of β = 1 gives a linear degradation in the utility if the job is not completed within the soft deadline. Similarly a value of β = 0 indicates a sharp drop off in the users interest if the soft deadline is missed. Decay functions for various values of α and β are shown in Figure 3.2. The next section describes the need for admission control algorithms for the MapReduce as a Service model and our proposed algorithm. 43

58 Figure 3.2 Utility Functions for different values of decay parameters. 3.2 Learning Based Admission Control We attempt to solve the problem of admission control for Hadoop, which is a leading open source framework for MapReduce. We briefly mention the architecture of Hadoop MapReduce, and then proceed to our algorithm. First, let us consider the need for an admission control algorithm. In our model, a service provider processes multiple requests simultaneously by multiplexing job execution in the cluster. Resources in the cluster are shared proportionately among the requests, and these proportions are decided by the utility that the service provider is expecting to earn after successful completion of a request. As a result, it becomes necessary to judiciously accept incoming jobs, so that incoming jobs do not affect the performance of already running jobs. Admission control also helps to prevent overloading of resources in the cluster. As the cluster is hosted in the cloud, the resources in the cluster could be scaled ondemand using auto-scaling capabilities. However, even if an auto-scaling facility is available, admission control can still prove viable because the rate of arrival of new requests could be much more than the rate of commissioning new nodes in the cluster. 44

59 3.2.1 Recap: Hadoop Architecture Let s recapitulate the Hadoop architecture in brief in order to better understand our approach. Hadoop s MapReduce implementation borrows much of its architecture from the original MapReduce system at Google [38]. Figure 3.3 depicts the architecture of Hadoop s MapReduce implementation. Although the architecture is centralized, Hadoop is known to scale well for small (single node) to very large (up to 4000 nodes) installations [12]. Scheduling decisions are taken by a master node (JobTracker), whereas the worker nodes (TaskTrackers) are responsible for task execution. The JobTracker keeps track of the heartbeat messages received periodically from the TaskTrackers and uses the information contained in them while assigning tasks to the Task- Tracker. If a heartbeat is not received from a TaskTracker for a specified time interval, the TaskTracker is assumed to be dead. In such a case, the JobTracker re-launches all the incomplete tasks previously assigned to the dead TaskTracker. Task assignments are sent to the TaskTracker as a response to the heartbeat message. The TaskTracker spawns each MapReduce task in a separate process, in order to isolate itself from faults due to user code in other tasks. For a detailed description of the Hadoop architecture please see Chapter The Algorithm The administrator specifies the maximum number of Map and Reduce task slots that control the number of simultaneously running tasks on a TaskTracker. Jobs compete for task slots in the cluster, and it is the responsibility of the scheduler to properly allocate slots so that jobs do not suffer from starvation, and they receive their fair share of the resources in the cluster. The admission controller runs at the master (JobTracker) node in the MapReduce cluster. Although user requests for services can arrive asynchronously, the algorithm considers them for admission only at fixed points in time. Time interval between two such admission points is referred to as an admission interval. Job requests arrived during an admission interval are maintained in the queue of candidate jobs. The algorithm takes this queue as input, and admits at most one job for execution in the cluster. All other requests are rejected and are not considered 45

Figure 3.3 Architecture of MapReduce in Hadoop. for further processing. The users are notified if their requested services have been accepted or rejected. Figure 3.

60 Figure 3.3 Architecture of MapReduce in Hadoop. for further processing. The users are notified if their requested services have been accepted or rejected. Figure 3.4 summarizes the admission control block. To decide if and which request to accept, we use the Expected Utility Hypothesis from decision theory. This hypothesis states that given a set of choices with varying payouts and the likelihood of those payouts, a rational agent always prefers the option that maximizes the agent s expected utility. Applying this principle to the problem of selecting a job to be admitted, the algorithm chooses a job that maximizes expected utility from the perspective of the service provider. Formally, Selected job = argmax j (U j P (J = Success E)) where, U j is the utility of the job as calculated from the utility function agreed upon by the user and the service provider in their service contract. While making the comparison, we consider only the utility that will be earned if the job is completed before the soft deadline specified by the user. J = Success denotes the event that job admission is successful according to success criteria dictated by the 46

61 Figure 3.4 Admission Controller. service provider. The probability P (J = Success E) is conditional on the current state of the resources in the cluster, E. The admission controller uses prior knowledge accumulated to make admission control decisions for predicting the outcome of admission of candidate jobs. To achieve this, we compute the posterior probability P (J = Success E) using Bayes Theorem: P (J = Success E) = P (E J = Success) P (J = Success) P (E) The above equation forms the foundation of learning in our algorithm. The algorithm uses results of decisions made in the past to make the current decision. This is achieved by keeping track of past decisions and of their outcomes in the form of conditional probabilities. The denominator P (E) in the above equation is independent of candidate jobs and can be ignored safely as a constant while comparing the candidate jobs. For 47

62 each job in the list, we estimate the probability of future success as well as future failure. A job is rejected if the likelihood of a failure is more than that of a success. If all jobs are likely to fail, none of the jobs are admitted. In other words, we classify the candidate jobs into potentially successful and potentially unsuccessful jobs, and then select the job that provides maximum utility from the set of potentially successful jobs. Figure 3.4 summarizes this process. We thus select the job that maximizes the following quantity: U j P (E J = Success) P (J = Success) P (E) The state of the environment E comprises of a number of factors describing the state of cluster resources such as cluster load, number of pending tasks currently in the cluster, the rate at which tasks are being completed, etc. We also extend the state of resources by including in it the properties of job request such as the size of request, mean run times observed in the past to complete similar requests, etc. The list of factors that we consider while making a decision is given below. All the factors are chosen based on their speculated effect on the result of an admission control decision. Used map slots - Ratio of number of map tasks currently running to the maximum allowed number of concurrent tasks in the MapReduce cluster. This parameter quantifies the availability of resources. A value less than one indicates resource availability, and means that new requests are more likely to be completed, whereas a value greater than one indicates more contention among jobs for resources. Used reduce slots - Ratio of number of reduce tasks currently running to the maximum allowed number of concurrent tasks in the MapReduce cluster. Pending maps - Number of map tasks currently waiting for slots to be allocated. This parameter quantifies the pending map workload. Pending reduces - Number of reduce tasks currently waiting for slots to be allocated. 48

63 Finishing jobs - Number of jobs that are about to finish i.e. having very few pending tasks. If the value of this parameter is high, the newly accepted job is expected to have sufficient resources for its execution as currently running jobs will be finishing soon releasing resources. For our experiments we considered jobs with more than 85% of tasks completed as finishing jobs. Map time average - Moving average of map task runtimes. This denotes the rate at which map tasks are being completed. Its also an indication of the nature of jobs currently running in the cluster. Reduce time average - Same as above, but for reduce tasks. Load - Ratio of number of tasks waiting to be assigned a slot to the maximum number of slots. Job maps - Number of map tasks in the candidate job. Value of this parameter depends on the size of input data. This and the following parameters are job specific, and maybe different for each candidate job. Job reduces - Number of reduce tasks in the candidate job Mean map time - Mean map task runtime observed for this job in its past runs Mean reduce time - Same as above, but for reduce tasks of the job Given all these parameters, the quantity P (E J = Success) thus becomes: P (E J = Success) = P (e 1, e 2, e 3...e n J = Success) where, e 1, e 2,..., e n are the factors constituting the state of the environment E. We assume that the probabilities of these factors are conditionally independent of each other (the Naive Bayes assumption). Thus, n P (E J = Success) = P (e j J = Success) j=1 Service providers predefine the criteria for success or failure of a job. For example, the service provider could specify that any new admission that results in 49

64 overloading of resources of the cluster beyond a specified threshold will be considered as a failure. Success and failure rules are used to validate a decision, based on the effects of the current decision. Validation rules cannot be applied until data about the impact of a decision is available. The results of these validations are sent as feedback to the admission controller. Upon receiving the feedback, the algorithm updates its probabilities so that mistakes made by the algorithm, if any, are not repeated in the future. It is possible that an admission decision can adversely affect the makespan of already running jobs. However, the decision will be considered invalid only if it does not meet the success or failure criteria set by the service provider. Service providers could define success-failure criteria that consider the effect on makespan of other jobs as well. Our algorithm is greedy, as we choose the job that seems to provide maximum utility from the immediately available choice. It is also opportunistic, as we are willing to suffer degradation of performance of existing jobs, if the newly admitted job can offer more utility compared to utility gained from these already executing jobs. 3.3 Evaluation and Results To verify the efficacy of our algorithm, we simulated the Hadoop MapReduce architecture and studied the behavior of our algorithm with the following baseline approaches: Myopic - In this approach, the job with maximum initial utility is accepted without other considerations Random - A job is admitted randomly from a given set of candidate jobs. The given set of jobs is appended with a null value to simulate job rejection Simulation Settings In our simulation model, the properties of a job are distributions specifying runtimes of map and reduce tasks of a job. To model the distribution of runtimes, 50

65 we extracted and observed real world MapReduce job traces of MapReduce jobs run on actual Hadoop clusters. We observed that Map runtimes of a particular job follow the Normal distribution with the mean and standard deviation being the characteristic of the job. Similarly for reduce tasks the runtimes for Sort, Shuffle and Reduce phases also followed the Normal distribution. Based on these observations, a map task modeled in our simulation occupies a slot for a random amount of time which is chosen from a Normal distribution which is the characteristic of the job. Similarly each of the three phases in a reduce task modeled in our simulation also occupies a slot in accordance with Normal distributions which are again properties of the job. Our simulation does not model task failures as the utility is earned only after successful completion of a job request. Thus it is the responsibility of the service provider to make sure that all accepted jobs are executed successfully, irrespective of individual task failures. We only use the information that can be obtained through the JobTracker in Hadoop as the JobTracker provides a unified view of the MapReduce cluster. All the parameters mentioned in Section could directly be obtained from the JobTracker itself. Figure 3.5 lists the simulation parameters and distributions used in generating simulation events. For comparing the results across different runs, we keep the pseudo random distribution parameters constant between runs. All values reported in the results are averaged over 10 independent runs, unless otherwise specified Algorithm Correctness To verify whether the admission controller is able to accept/reject jobs in order to maintain overload threshold as specified by the service provider, we measured the actual load average observed in a simulation run, and compare it against the desired load average as set by the service provider. The plot below summarizes results of these experiments. As we can see in the plot (Figure 3.6), the achieved load average value is fairly close, to the desired load average value. Further, the error rate is independent of the desired load average value. The errors may arise as a result of Naive Bayes assumption made while computing posterior probabilities. 51

66 Parameter Decription Job arrival distribution Exponential Job arrival rate (λ) 5 minutes JobTracker heartbeat interval 3 seconds Admission interval 3 minutes JobTracker map slots 50 JobTracker reduce slots 20 Job map size Uniform Random (51, 100) Job reduce size Job map size /10 Simulation time 500 minutes Decay parameters (α and β) α = 1, β = 1 Soft deadline (T 1 ) Time taken when all map tasks are executed in parallel + Time taken when all reduce tasks are executed in parallel Hard deadline (T 2 ) Time taken when only one task of the job is executed at a time Figure 3.5 Simulation Parameters Comparison with Baseline Approaches Next, we compare the performance of the learning our admission control algorithm with two baselines, as specified in the beginning of this section. First, we compare the mean load averages observed in our algorithm to Myopic admission, and Random admission. For this set of experiments, we kept the overload threshold to 100%. In other words, our admission controller rejected all those jobs which were predicted to cause the cluster load over 1.0. Figure 3.7 shows the results. As can be clearly seen in Figure 3.7, our admission control algorithm is very effective in preventing overload. This establishes the correctness of our algorithm, and proves our argument of the necessity of sophisticated admission control algorithms for MapReduce Meeting Deadlines The next experiments in our evaluation verify the ability of our algorithm in meeting user deadline guarantees. For this set of experiments, the values of de- 52

67 Figure 3.6 Achieved and expected load ratio Algorithm Achieved Load Average Random Myopic Our algorithm 0.97 Figure 3.7 Comparison of Achieved Load Averages cay parameters α and β were both set to 1, thereby making the decay rate linear. The soft deadline (T 1 ) in our case is the runtime of the job, if all tasks of the job are executed simultaneously. The hard deadline (T 2 ) is double the value of the soft deadline. To compare the algorithm with baseline approaches, we calculate the percentage of jobs that complete before the soft deadline, and the percentage of jobs that complete after the soft deadlines. We can see in Figure 3.8 that our algorithm is able to meet user QoS requirements in most of the cases, whereas the baseline approaches cause job runtimes to exceed soft deadlines in most of the cases. 53

Figure 3.8 Performance while meeting user deadlines Figure 3.9 Achieved Load Average with load cap 3.3.5 Performance with Load Cap Next, we see if how the algorithms fare with an additional load cap enforced at the time of job admission.

The load cap is enforced in each of the algorithms, and their performance is presented in Figure 3.9.

68 Figure 3.8 Performance while meeting user deadlines Figure 3.9 Achieved Load Average with load cap Performance with Load Cap Next, we see if how the algorithms fare with an additional load cap enforced at the time of job admission. In this setting, a job is admitted only if the current load of the cluster is below a certain threshold. The load cap is enforced in each of the algorithms, and their performance is presented in Figure 3.9. With an additional load cap, we are adding a reactive constraint, since the current load on the cluster is a result of previous job submissions. However, a load cap ensures that a job is not added in case of overload even in the naive approaches. 54

As we can see in Figure 3.9, the MEU algorithm fares better even with a load cap. Compared to results in Section 3.3.3, the other two approaches achieve significantly less load averages.

69 As we can see in Figure 3.9, the MEU algorithm fares better even with a load cap. Compared to results in Section 3.3.3, the other two approaches achieve significantly less load averages. However, the performance of the MEU algorithm is almost the same despite the additional load cap, proving the efficacy of a learning approach. With the load cap, the other two approaches overload the cluster by more than 15 %, whereas the load average of MEU (97%) is very closed to desired utilization Job Response Times Figure 3.10 Comparing mean job runtimes Next, we evaluate the runtimes achieved by our admission control algorithm with the baseline approaches. For this set of experiments, we kept the utility of the jobs linearly proportional to job sizes. Figure 3.10 shows the mean runtime achieved by the MEU, RAND and MYOPIC algorithms respectively. As we can see from the figure, our algorithm achieves less runtime compared to the other two approaches. This is very beneficial to end users, as their main motivation is to finish their job as soon as possible. MEU algorithm achieves an improvement of 8.7 % over random admission, and 7.8 % over the myopic algorithm. It should be 55

70 Figure 3.11 Comparing runtime distribution noted that the reduction in response time is being done without having to overload resources in the cluster. Another way to compare overall runtimes achieved by all three algorithms is by studying the distribution of job runtimes. Figure 3.11 shows the histograms of MEU, RAND and MYOPIC. As we can see in the distributions, the number of jobs with less runtimes (runtime 25 minutes) in MEU is much more compared to the other two approaches. Another interesting observation is that in case of MEU very few ( 1%) jobs end up running for longer durations. This shows that our algorithm consistently achieves better response times. As discussed earlier, achieving better response times is crucial for user satisfaction. 56

71 3.3.7 Job Arrival Rates Figure 3.12 Effect of Job Arrival Rate (λ) on Job Acceptance It is important to study the behavior of algorithm under different levels of demand of access to the cluster. For this purpose, we vary the job arrival rate, i.e. the λ parameter of the exponential distribution. We increase value of λ from 0.5 to 15 in steps of 0.5. A small value of λ results into a higher job submission rate, resulting in many candidate job submissions. The main motivation behind this experiment is to see if the algorithm can cope up with heavy request traffic, and if it accepts more jobs in case of low demand to meet desired utilization Effect of Utility Functions To study this we plot the % of jobs accepted for each value of λ in Figure The results are mean values calculated over 10 independent runs of the simulation. As we can see in the figure, the job acceptance rate increase steadily with increasing inter job arrival time. This shows that as lesser number of submissions, the algorithm is willing to accept more jobs to maintain the specified level of utilization. This is important from the perspective of service providers, as they want to serve multiple simultaneous requests in order to achieve better revenues from the service. 57

Figure 3.13 Effect of Utility Function on Job Acceptance In the final experiment, we study the effect and need of utility functions. As mentioned earlier in Section 3.1.3, utility functions are instruments for the users and service providers to control job admission.

72 Figure 3.13 Effect of Utility Function on Job Acceptance In the final experiment, we study the effect and need of utility functions. As mentioned earlier in Section 3.1.3, utility functions are instruments for the users and service providers to control job admission. Proper choice of a utility function can significantly impact job admission decision. To study the effect, we compare the behavior of the algorithm in the following two cases: 1. Linear proportion: In this case, the job utility is linearly proportional to job size. Formally, U(J) = s, where s is the size of the job. Size of the job is the total number of map and reduce tasks in the job. 2. Exponential proportion: In this case the job utility is exponentially proportional to job size. Specifically, U(J) = a s, where a is some integer constant. During our experiment, we kept a = 2. The resulting distribution of job sizes accepted by the algorithm are shown in Figure We compare job size distribution, because in the absence of a utility function, our algorithm always tends to accept smaller jobs, as they are the ones least likely to overload a cluster. Hence, observing job sizes can confirm whether the choice of the algorithm is being affected by the utility function or not. As we can see in the figure, with linearly proportional utility, the size distribution is slightly skewed towards smaller job sizes. However, with exponential job sizes, the distribution is almost even, with larger jobs also getting accepted. Note that, from the perspective of a service provider, smaller jobs are always preferable 58

73 as they finish quickly, and there is more likelihood of meeting QoS requirements of smaller jobs, as in case of larger jobs other uncertain factors such as component failures, fluctuations in resource availability also play a more important role. Thus, the algorithm behaves according to the needs of the service provider. However, as is shown in the figure, this behavior can be adjusted by proper choice of utility functions. 3.4 Summary We presented a learning based admission control algorithm, specifically targeted for MapReduce clusters. Although the concept we exploited has its roots in decision theory, and the idea can be applied to a more generic case of web service admission control as well. Our results validated using online learning in making single choice decisions. Furthermore, the algorithm we proposed fared better on users as well as system administrators expectations on a number of criteria when compared to baseline approaches. We use a similar approach, although with some variations in tackling task assignment in MapReduce as well, which is the focus of the next chapter. 59

74 Chapter 4 Task Assignment in MapReduce This chapter presents the learning scheduler for Hadoop, developed as a part of this thesis. The scheduler source code is available for download from https: //code.google.com/p/learnsched/. The scheduler uses the pluggable scheduling API introduced in Hadoop versions 0.19 and later. We begin by explaining the scheduling algorithm used and then move on to describing the implementation details of the scheduler. Finally, we present scheduler evaluations under a number of test cases that demonstrate the benefits of learning based approach. 4.1 LSCHED: Learning Scheduler for Hadoop Having seen the scheduling mechanism in Hadoop in Chapter 2, we explain our task assignment algorithm in this section. Our algorithm runs at the JobTracker. Whenever a heartbeat from a TaskTracker is received at the JobTracker, the scheduler chooses a task from the MapReduce job that is expected to provide maximum utility after successful completion of the task. Figure 4.1 depicts the task assignment process. First, we build a list of candidate jobs. For each job in the queue of the scheduler, one candidate instance for Map part and one (or zero, if the job does not have a reduce part) for the Reduce part is added in the list. This is done because the resource requirements of Map and Reduce tasks are usually different. 60

75 Figure 4.1 Task assignment using pattern classification. Evaluation of last decision, and classification for current decision are done asynchronously. We then classify the candidate jobs into two classes, good and bad, using a pattern classifier. Tasks of good jobs do not overload resources at the TaskTracker during their execution. Jobs labeled bad are not considered for task assignment. If the classifier labels all the jobs as bad, no task is assigned to the TaskTracker. If after classification, there are multiple jobs belonging to the good class, then we choose the task of a job that maximizes the following quantity: E.U.(J) = U(J)P (τ J = good F 1, F 2,..., F n ) (4.1) where, E.U.(J) is the expected utility, and U(J) is the value of utility function associated with the MapReduce job J. τ J denotes a task of job J, and P (τ J = 61

76 good F 1, F 2,..., F n ) denotes the probability that the task τ J is good. The probability is conditional upon the feature variables F 1, F 2,..., F n. Feature variables are described in more detail later in this section. Once a job is selected, we first try to schedule a task of the job whose input data are locally available on the TaskTracker. Otherwise, we chose a non data local task. This policy is the same as used by the default Hadoop scheduler. We assume that the cluster is dedicated for MapReduce processing, and that the JobTracker is aware and responsible for every task execution in the cluster. Our scheduling algorithm is local as we consider state of only the concerned Task- Tracker while making an assignment decision. The decision does not depend on state of resources of other TaskTrackers. We track the task assignment decisions. Once a task is assigned, we observe the effect of the task from information contained in subsequent heartbeat from the same TaskTracker. If based on this information, the TaskTracker is overloaded, we conclude that last task assignment was incorrect. The pattern classifier is then updated (trained) to avoid such assignments in the future. If however, the TaskTracker is not overloaded, then the task assignment decision is considered to be successful. Users configure overload rules based on their requirements. For example, if most of the jobs submitted are known to be CPU intensive, then CPU utilization or load average could be used in deciding node overload. For jobs with heavy network activity, network usage can also be included in the overload rule. In a cloud computing environment, only those resources whose usage is billed could be considered in the overload rule. For example, where conserving bandwidth is important, an overload rule could declare a task allocation as incorrect if it results in more network usage than the limit set by the user. The overload rules supervise the classifiers. But, as this process is automated, the learning in our algorithm is automatically supervised. The only requirement for an overload rule is that it can correctly identify given state of a node as being overloaded or underloaded. It is important that the overload rule remains the same during the execution of the system. Also, the rule should be consistent for the classifiers to converge. 62

77 4.1.1 Feature Variables During classification, the pattern classifier takes into account a number of features variables, which might affect the classification decision. The features we use are described below: Job Features: These features describe the resource usage patterns of a job. These features could be calculated by analyzing past execution traces of the job. We assume that there exists a system which can provide this information. In absence of such a system, the users can utilize these features to submit hints about job performance to the classifier. Once enough data about job performance is available, user hints could be mapped to resource usage information. The job features we consider are: job mean CPU usage, job mean network usage, mean disk I/O rate, and mean memory usage. The users estimate the usages on the scale of 10. A value of 1 for a resource means minimum usage, whereas 10 corresponds to maximum usage. For a given MapReduce job, the resource usage variables of the Map part and the Reduce part are considered different. Node Features (NF): Node features denote the state and quality of computational resources of a node. Node Static Features change very rarely, or remain constant throughout the execution of the system. These include number of processors, processor speed, total physical memory, total swap memory, number of disks, name and version of the Operating System at the TaskTracker, etc. Node Dynamic Features include properties that vary frequently with time. Examples of such properties are CPU load averages, % CPU usage, I/O read/write rate, Network transmit/receive rates, number of processes running at the TaskTracker, amount of free memory, amount of free swap memory, disk space left etc. Processor speed could be be a dynamic feature on nodes where CPUs support dynamic frequency and voltage scaling Utility Functions Utility functions are used for prioritizing jobs and policy enforcement. An important role of the utility functions is to make sure that the scheduler does not always pick up easy tasks. If the utility of all the jobs is same, the scheduler will always pick up tasks that are more likely to be labeled good, which are usually the 63

78 tasks that demand lesser resources. Thus, by appropriately adjusting job utility it could be made sure that every job gets a chance to be selected. It is possible that a certain job is always classified as bad regardless of the values of feature vectors. This could happen if the resource requirements of the job are exceptionally high. However, this also indicates that the available resources are clearly inadequate to complete such a job without overloading. Utility functions could also be used in enforcing different scheduling policies. Examples of some such policies are given below. One or more utility functions could be combined in order to enforce hybrid scheduling policies. 1. Map before Reduce: In MapReduce, it is necessary that all Map tasks of a job are finished before Reduce operation begins. This can be implemented by keeping the utility of Reduce tasks zero until a sufficient number of Map tasks have completed. 2. First Come, First Serve (FCFS or FIFO): FCFS policy can be implemented by keeping the utility of the job proportional to the age of the job. Age of a job is zero at submission time. 3. Budget Constrained: In this policy, tasks of a job are allocated until the user of a job has sufficient balance in his/her account. As soon as the balance reaches zero, the utility of jobs of the said user becomes zero, thus no further tasks of jobs from the said user will be assigned to worker nodes. 4. Dedicated Capacity: In this policy a job is allowed a guaranteed access to a fraction of the total resources in the cluster. Here, the utility could be inversely proportional to the deficit in the currently allocated fraction, and the promised fraction. Utility of jobs allocated more than the promised fraction is set to zero to make sure that they are not considered during task assignment. 5. Revenue oriented utility: In this policy, utility of a job is directly proportional to the amount the job s submitter is willing to pay for successful completion of the job. This makes sure that the algorithm always picks tasks of users who are offering more money for the service. 64

79 4.1.3 Avoiding resource starvation As described in the previous section, utility functions are also used to avoid resource starvation amongst jobs. Resource starvation is possible if a particular job needs to do nontrivial amount of work. Jobs that are heavily CPU bound, might overload a worker node even if at the time of allocation the concerned node has 100% free resources. As a result, tasks of such jobs are always labeled bad and are never allocated. To prevent this, we maintain an assignment count for each job. After every allocation, the assignment count of the said job is increased. While choosing a job, we choose the job that has the least amount of assignments. For this the utility of the job is calculated as follows: U(J) = a K J.priority J.assignments where, a is a positive integer s.t. a > 1 and K is a large constant. We have kept a = 2 and K = 64. As the highest job priority in Hadoop corresponds to the lowest integer value, this makes sure that jobs with higher priorities always have higher utility values. The assignment counts of jobs are reset periodically. For this we calculate the maximum assignment value from the queued jobs. This value is deducted from all the jobs so that jobs that did not get any assignments in this cycle have the maximum priority in the next cycle. If a job is consistently not being allocated, its assignment value will continue to decrease. If this value reaches a certain lower limit set by the administrator, tasks of that job are forcefully allocated by labeling them as good. This makes sure that such jobs also get a chance to execute their tasks in the cluster. After such an allocation, the assignment count of the job is again reset to zero, or it could be incremented with a value proportional to the jobs priority, so that other jobs that do not overload resources get preference in the next cycle Benefit of doubt If in certain cases the likelihood of a job being classified as good or bad is almost equal, we label the task as good. For this the log likelihood of the poste- 65

80 rior probabilities are calculated. If the bad probability is not larger than the good probability by an order of magnitude, it is labeled as good, otherwise it is labeled as bad. Next, we explain how the same algorithm can be implemented by using two different pattern classifiers. In this chapter we consider only the Naive Bayes Classifier, and the Perceptron classifier [43]. Theoretically, any linear classifier could be used for classifying jobs. However, we discuss these two based on their ease of implementation, and the ability of learning from one sample at a time (online learning). Online learning helps in keeping memory used by the classifiers constant w.r.t the number of feature vectors. This is essential in our case; efficiency is an important goal for a scheduler implementation Using a Naive Bayes Classifier If we apply Bayes theorem to equation 4.1 mentioned in the beginning of this section, we get, E.U.(J) = U(J) P (F 1, F 2,..., F n τ J = good)p (τ J = good) P (F 1, F 2,..., F n ) The denominator in the above equation can be treated as a constant as its value is independent of the jobs, and thus its calculation can be skipped during comparison. We calculate both P (τ J = good F 1, F 2,..., F n ) and P (τ J = bad F 1, F 2,..., F n ). Job is labeled as good or bad depending on which of the two probabilities is higher. Under the assumption of Naive Bayes conditional independence we get, n P (F 1, F 2,..., F n τ J = good) = P (F i τ J = good) Thus, we compute the following quantity for all the jobs and select the job that maximizes it. E.U (J) = U(J)P (τ J = good) i=1 n P (F i τ J = good) Once the effects of a task assignments are observed, the probabilities are updated accordingly so that future decisions could benefit from the lessons learned from the effects of current decisions. i=1 66

81 Here we assume that the probabilities of all feature variables are conditionally independent of each other. This may not always be true. However, we observed that this assumption can yield a much simpler implementation. Despite the assumption, Naive Bayes classifiers are known to perform well. Our results show that the assumption does not have any drastic undesired effects on the overall performance of the scheduler Separability of feature vector space and classifier convergence Naive Bayes classifiers assume that all feature variables are conditionally independent of each other, and their probabilities could be calculated independently. This assumption is almost always incorrect in practice. However, Naive Bayes classifiers have been known to outperform other popular classifiers including decision trees and multilayer neural networks. Zhang [81] has discussed in detail about the unexpected efficiency of Naive Bayes classifiers. All the feature variables used in our classifier indicate either availability or usage of computational resources at a given node. Clearly, more the availability of a resource, more is the likelihood of a task being completed without overloading the resource. For features which correspond to usage of resources, such as the job features, the opposite is true. i.e., more the resource usage, more is the likelihood of task of that job overloading the node. Thus, we can say that for a given job, for every feature variable there exists a separating value on one side of which task of the job is likely to overload the node, and vice versa. The vector corresponding to all such separating values gives the hyperplane which separates the feature vectors into two classes, good, and bad. 4.2 Evaluation and Results We now briefly discuss the implementation, and then explain the evaluation methodology and results of our experiments in this section. 67

82 4.2.1 Implementation Details We have implemented our algorithm for Hadoop version Our scheduler uses the pluggable scheduling API introduced in Hadoop The scheduler customizes assigntasks method of the org.apahce.hadoop.mapred.taskscheduler class. We used only the Naive Bayes classifier in our implementation. Naive Bayes classifier is better in online learning (learning from one sample at a time) and handling categorical feature variables compared to Perceptron classifier. We used a simple histogram for counting probabilities of discrete features. Node Features are obtained from the heartbeat message. We extended the heartbeat protocol used in Hadoop to include node resources properties. Job Features are passed via a configuration parameter (learnsched.jobstat.map and learnsched.jobstat.reduce) while launching a job. In absence of these parameters, mode of the values for each resource is considered as the respective job feature. At any point of time, we maintain at the most k decisions made by the classifier for each TaskTracker, where k is the number of tasks assigned in one heartbeat. During the evaluation we kept k = 1. Once the decisions are evaluated by the overload rule, we persist them to disk so that they can be used in re-learning, or when the desired utilization level is changed by the user. A decision made for the current heartbeat is evaluated in the next heartbeat. This allows us to control the memory used by decisions. We disregard the accpetnewtasks flag in the heartbeat message, and consider a node for task assignment in every heartbeat. We allow users to implement their own utility functions by extending our API. Utility functions in the scheduler are pluggable and can be changed at runtime. We have implemented a constant utility function, and FIFO utility function. Users can also write their own overload rules by implementing the DecisionEvaluator interface. 68

83 4.2.2 Evaluation Cluster Details We used a cluster of eight nodes to evaluate our algorithm. One of the nodes was designated as the master node which ran HDFS and MapReduce masters (NameNode and JobTracker). The remaining seven nodes were worker nodes. All of the nodes had 4 CPUs (Intel Quad Core, 2.4 GHz), a single hard disk of capacity 250 GB, and 4 GB of RAM. The nodes were interconnected by an unmanaged gibabit Ethernet switch. All of the nodes had Ubuntu Linux (9.04, server edition) and SUN Java We used Hadoop version for this evaluation. The important Hadoop parameters and their values used in the experiments are described in Figure 4.2. For rest of the parameters, we used Hadoop s default values. Hadoop Parameter Value Replication 3 HDFS Block size 64 MB Speculative Execution Enabled Heartbeat interval 5 seconds Figure 4.2 Hadoop settings used in evaluation We used one minute CPU load averages to decide overloading of resources. Load averages summarize both CPU and IO activity on a node. We calculated the ratio of reported load average with the number of available processors in a node. A value of 1 for this ratio indicates 100% utilization on a node. A node was considered to be overloaded if the ratio crossed a user specified limit Workload Description We evaluated our scheduler using jobs that simulate real life workloads. In addition to the WordCount and Grep jobs used by Zaharia et. al. [79], we also simulate jobs to represent typical usage scenarios of Hadoop. We collected Hadoop usage information from the Hadoop PoweredBy [10] page. This page lists case studies of over 75 organizations. We categorized the usages into seven main categories, text indexing, log processing, web crawling, data mining, machine learning, 69

reporting, data storage and image processing. Figure 4.3 summarizes the frequency of these use cases. The percentages represented in the figure are approximate.

84 reporting, data storage and image processing. Figure 4.3 summarizes the frequency of these use cases. The percentages represented in the figure are approximate. From this information, we conclude that Hadoop is being used in a wide range of scenarios, naturally creating diversity in the resource requirements of MapReduce jobs. Figure 4.3 Prominent Use Cases for Hadoop. (percentages are approximate) We came up with the following set of jobs to evaluate our scheduler. We describe their functioning below: TextWriter: Writes randomly generated text to HDFS. Text is generated from a large collection of English words. WordCount: Counts word frequencies from textual data. WordCount with 10 ms delay: Exactly same as WordCount, except that we add an additional sleep of 10 ms before processing every key-value pair. URLGet: This job mimics behavior of the web page fetching component of a web crawler. It downloads a text file from a local server. The local server delays response for a random amount (normal distribution, µ = 1.5s, σ = 0.5s) of time to simulate internet latency. The text files we generated had sizes according to normal distribution with mean of 300 KB, and variance of 50 KB [60]. 70

Learning Based Admission Control. Jaideep Dhok MS by Research (CSE) Search and Information Extraction Lab IIIT Hyderabad

Learning Based Admission Control and Task Assignment for MapReduce Jaideep Dhok MS by Research (CSE) Search and Information Extraction Lab IIIT Hyderabad Outline Brief overview of MapReduce MapReduce as