EVALUATING TASK SCHEDULING IN HADOOP-BASED CLOUD SYSTEMS SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU UNIVERSITY OF CHINESE ACADEMY OF SCIENCES & RICE UNIVERSITY 2013-9-30
OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS & Future work
PRIVATE CLOUD "The NIST Definition of Cloud Computing", National Institute of fstandards d and dtechnology. Retrieved 24 July 2011 The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.
MOTIVATION A private cloud serves multiple users. Different ee ttask priorities tes Different task types Different task data sizes Optimizing the performance of private cloud is necessary and urgent A challenge for task scheduling!
OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work
HADOOP OVERVIEW Hadoop An open-source software framework for processing a large volume of data on a cluster
HADOOP TASK SCHEDULER FIFO Naïve Fair sharing Fair Sharing with Delay Scheduling Capacity Scheduling HOD
OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work
CLOUDRANK-D A benchmark presented by ICT of CAS A benchmark suite for private cloud Help researchers to simulate various multi-user applications in industrial scenarios Benchmark provides a set of 13 representative data analysis tools Basic operations Data mining operations Data warehouse operations
DATA SOURCES OF EACH PROGRAM IN CLOUDRANK-D Application Sort Word count Grep Naive Bayes Support vector machine K-means Item based collaborative filtering Frequent pattern growth Hidden Markov model Grep select Ranking select User visits aggregation User visits-rankings join Data sources Automatically generated News and Wikipedia Scientist search Sougou corpus Ratings on movies Retail market basket data Click-stream data of an on-line news portal Traffic accident data Collection of web html document Scientist search Automatically generated table
CONTENT Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work
WORKLOAD DESIGN Image processing Log processing Data mining Reporting 2% Text indexing Web crawling Machine learning Data storage 17% 17% 11% 16% Web crawling Data mining i Machine learning 15% Image Processing Text Indexing Log Processing 15% Reporting 7% Data Storage Applications in CloudRank-D Percent private clouds Applications age Naive Bayes SVM HMM IBCF FPG Basic Operations 35% 31% Hive 34%
WORKLOAD DESIGN Category Application Jobs 100 Jobs Basic Operations Data Mining Operations Data Warehouse Operations Sort 9 Word count 11 Grep 11 Naïve Bayes 6 Support vector machine 6 K-means 7 Item based collaborative 3 Frequent pattern growth 7 Hidden Markov model 6 Grep select Ranking select user visits aggregation 34 user visits-rankings join
JOB SUBMITTING Follows the distribution of input data size in Taobao Follows an exponential distribution with a mean of 14 seconds(facebook) Job submitted in a random order Input Data size Percentage <25MB 40.57% 25MB-625MB 39.33% 1.2GB-5GB 12.03% >5GB 8.07%
TESTBED Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes) CPU Type Intel Xeon E5645 Intel CPU Core 6 cores@2.40g L1 D/I Cache L2 Cache L3 Cache Memory Disk 6 32 KB 6 256 KB 12MB 16GB 8TB OS Hadoop Mahout Hive CentOS 5.5 1.0.2 0.6 0.11
HADOOP CONFIGURATION Hadoop Parameter Value Description The maximum number of map tasks that mapred.tasktracker. dt k 12 will be executed simultaneously by a task map.tasks.maximum tracker. mapred.tasktracker.r The maximum number of reduce tasks that educe.tasks.maximu m 12 will be executed simultaneously by a task tracker. mapred.map.tasks 48 Maximum number of concurrent running reduce task. mapred.reduce.tasks 45 Maximum number of concurrent running map task. dfs.replication 2 The actual number of replications specified when the file is created. mapreduce.tasktrack er.outofband.heartbe tb TRUE Open the out of band heartbeat. t at
HADOOP SCHEDULER EVALUATION Data Processed per Second Turnaround time Running time Waiting Time Throughput
DATA PROCESSED PER SECOND Total runnin ng time (103 3s) 25 20 15 10 5 DPS (MB/s s) 12 10 8 6 4 2 0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler 0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The total running time (10 3 sec) of running full workload by using five schedulers respectively The Data Processed per Second (Megabytes processed per second) of running full workload by using five schedulers respectively.
TURNAROUND TIME Turn around time (103s) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The average job turnaround time (10 3 sec) of running full workload by using five schedulers respectively.
AVERAGE JOB RUNNING TIME & WAITING TIME Running tim me (103s) 1.2 1.0 0.8 0.6 0.4 sec.) Wa aiting time ( 250 200 150 100 0.2 50 0.0 0 Task Scheduler Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The average job running time (10 3 sec) of running full workload by using five schedulers respectively. Average job waiting time (second) of running full workload by using five schedulers respectively.
THROUGHPUT Th hroughput (j jobs/min) 0.40 0.35 030 0.30 0.25 0.20 0 0.15 0.10 0.05 0.00 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The throughput (number of jobs processed in one minute) of running The throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively
EVALUATION RESULT ANALYSIS Fair with delay scheduling scheduler is the most efficient scheduler some jobs with large size will have longer time to finish than usual jobs Fair with delay scheduling, naïve fair, capacity, these three schedulers are all have the better performance than default FIFO scheduler HOD h d l f d t ll HOD scheduler preformed not very well, affected by the extra cost of virtualization
CONCLUSIONS & FUTURE WORK Optimizing i i the performance of Hadoop clusters is very necessary and significant The choice of task schedulers is very critical for system performance improvement of Hadoop cluster With fair sharing with delay scheduling, DPS is improved by 20% than that of FIFO scheduler Optimization and design of the scheduler need to refer to the characteristics of the workload In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systems
Q & A THANKS! E-MAIL: SOUNDER_LIU@163.COM, XUJG@UCAS.AC.CN