SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU & RICE UNIVERSITY

Similar documents
Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Scalability and High Performance with MicroStrategy 10

Infor LN Minimum hardware requirements. Sizing Documentation

Brian Macdonald Big Data & Analytics Specialist - Oracle

ORACLE S PEOPLESOFT HRMS 9.1 FP2 SELF-SERVICE

ETL on Hadoop What is Required

Big Data The Big Story

SAP Public Budget Formulation 8.1

Using the Blaze Engine to Run Profiles and Scorecards

SAP Predictive Analytics Suite

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

COMPUTE CLOUD SERVICE. Move to Your Private Data Center in the Cloud Zero CapEx. Predictable OpEx. Full Control.

Big Data & Hadoop Advance

Sizing SAP Central Process Scheduling 8.0 by Redwood

RESOURCE MANAGEMENT IN CLUSTER COMPUTING PLATFORMS FOR LARGE SCALE DATA PROCESSING

Hadoop Fair Scheduler Design Document

CASH: Context Aware Scheduler for Hadoop

IBM xseries 430. Versatile, scalable workload management. Provides unmatched flexibility with an Intel architecture and open systems foundation

Adobe Deploys Hadoop as a Service on VMware vsphere

ECLIPSE 2012 Performance Benchmark and Profiling. August 2012

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

HTCaaS: Leveraging Distributed Supercomputing Infrastructures for Large- Scale Scientific Computing

Job Scheduling for Multi-User MapReduce Clusters

A New Hadoop Scheduler Framework

Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures

SAP Cloud Platform Pricing and Packages

ANSYS, Inc. March 12, ANSYS HPC Licensing Options - Release

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Oracle Big Data Cloud Service

RODOD Performance Test on Exalogic and Exadata Engineered Systems

Planning the Capacity of a Web Server: An Experience Report D. Menascé. All Rights Reserved.

Increased Informix Awareness Discover Informix microsite launched

Let s distribute.. NOW: Modern Data Platform as Basis for Transformation and new Services

1. Intoduction to Hadoop

Recording. Solutions. Redefined. call recording

ORACLE BIG DATA APPLIANCE

Performance Interference of Multi-tenant, Big Data Frameworks in Resource Constrained Private Clouds

Building Efficient Large-Scale Big Data Processing Platforms

Recording. Solutions. Redefined. CALL RECORDING


Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2

Sr. Sergio Rodríguez de Guzmán CTO PUE

ARIA: Automatic Resource Inference and Allocation for MapReduce Environments

An Oracle White Paper April, Enterprise Manager 12c Cloud Control Metering and Chargeback

Prediction of Personalized Rating by Combining Bandwagon Effect and Social Group Opinion: using Hadoop-Spark Framework

Oracle Platform as a Service and Infrastructure as a Service Public Cloud Service Descriptions-Metered & Non-Metered.

In-Memory Analytics: Get Faster, Better Insights from Big Data

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Oracle Autonomous Data Warehouse Cloud

E-BUSINESS SUITE APPLICATIONS R12 (12.1.3) EXTRA- LARGE PAYROLL (BATCH) BENCHMARK - USING ORACLE11g ON AN IBM Power System S824

#23164 FASTEST GPU-BASED OLAP AND DATA MINING: BIG DATA ANALYTICS ON DGX. Speaker: Roman Raevsky, Co-Founder & CEO, Polymatica

[Header]: Demystifying Oracle Bare Metal Cloud Services

BMC CONTROL-M WORKLOAD OPTIMIZATION

Microsoft Azure Essentials

Bringing the Power of SAS to Hadoop Title

Apache Kafka. A distributed publish-subscribe messaging system. Neha Narkhede, 11/11/11

Get The Best Out Of Oracle Scheduler

20775: Performing Data Engineering on Microsoft HD Insight

BI Portal User Guide

Securing MapReduce Result Integrity via Verification-based Integrity Assurance Framework

Oracle PaaS and IaaS Universal Credits Service Descriptions

DynamicCloudSim: Simulating Heterogeneity in Computational Clouds

How to Build Your Data Ecosystem with Tableau on AWS

InfoSphere DataStage Grid Solution

A Contention-Aware Hybrid Evaluator for Schedulers of Big Data Applications in Computer Clusters

SLA-Driven Planning and Optimization of Enterprise Applications

IBM SPSS & Apache Spark

Sizing Component Extension 6.0 for SAP EHS Management

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS

Automation Test Introduction

SPECjbb2015 Benchmark Design Document

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN

Operations Management Suite

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Mike Strickland, Director, Data Center Solution Architect Intel Programmable Solutions Group July 2017

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Oracle Business Intelligence Suite Enterprise Edition 4,000 User Benchmark on an IBM System x3755 Server running Red Hat Enterprise Linux

OPERATING SYSTEMS. Systems and Models. CS 3502 Spring Chapter 03

Research Report. The Major Difference Between IBM s LinuxONE and x86 Linux Servers

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Comparative Analysis of Scheduling Algorithms of Cloudsim in Cloud Computing

Building Your Big Data Team

Oracle Autonomous Data Warehouse Cloud

Lecture 11: CPU Scheduling

Konica Minolta Business Innovation Center

NetApp Flexgroup Volumes in ONTAP. August 2017 SL10312 Version 1.0

Make the most of the cloud with Microsoft System Center and Azure

Energy-Efficient Scheduling of Interactive Services on Heterogeneous Multicore Processors

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Cost Optimization for Cloud-Based Engineering Simulation Using ANSYS Enterprise Cloud

Certified Functions: WebDAV Storage interface, Server functionality WebDAV Storage Interface LOAD Test performed Solution Manager Ready functionality

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Your Big Data to Big Data tools using the family of PI Integrators

Microsoft FastTrack For Azure Service Level Description

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

Starting with Oracle Data Science in the Cloud

Краеугольный камень ИТ-трансформации в Новую Экономическую Эру

Oracle Utilities Mobile Workforce Management Benchmark

Hadoop Integration Deep Dive

Transcription:

EVALUATING TASK SCHEDULING IN HADOOP-BASED CLOUD SYSTEMS SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU UNIVERSITY OF CHINESE ACADEMY OF SCIENCES & RICE UNIVERSITY 2013-9-30

OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS & Future work

PRIVATE CLOUD "The NIST Definition of Cloud Computing", National Institute of fstandards d and dtechnology. Retrieved 24 July 2011 The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.

MOTIVATION A private cloud serves multiple users. Different ee ttask priorities tes Different task types Different task data sizes Optimizing the performance of private cloud is necessary and urgent A challenge for task scheduling!

OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work

HADOOP OVERVIEW Hadoop An open-source software framework for processing a large volume of data on a cluster

HADOOP TASK SCHEDULER FIFO Naïve Fair sharing Fair Sharing with Delay Scheduling Capacity Scheduling HOD

OUTLINE Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work

CLOUDRANK-D A benchmark presented by ICT of CAS A benchmark suite for private cloud Help researchers to simulate various multi-user applications in industrial scenarios Benchmark provides a set of 13 representative data analysis tools Basic operations Data mining operations Data warehouse operations

DATA SOURCES OF EACH PROGRAM IN CLOUDRANK-D Application Sort Word count Grep Naive Bayes Support vector machine K-means Item based collaborative filtering Frequent pattern growth Hidden Markov model Grep select Ranking select User visits aggregation User visits-rankings join Data sources Automatically generated News and Wikipedia Scientist search Sougou corpus Ratings on movies Retail market basket data Click-stream data of an on-line news portal Traffic accident data Collection of web html document Scientist search Automatically generated table

CONTENT Background & Motivation Hadoop Task scheduler Benchmark & Methodology Evaluation CONCLUSIONS C O S & Future work

WORKLOAD DESIGN Image processing Log processing Data mining Reporting 2% Text indexing Web crawling Machine learning Data storage 17% 17% 11% 16% Web crawling Data mining i Machine learning 15% Image Processing Text Indexing Log Processing 15% Reporting 7% Data Storage Applications in CloudRank-D Percent private clouds Applications age Naive Bayes SVM HMM IBCF FPG Basic Operations 35% 31% Hive 34%

WORKLOAD DESIGN Category Application Jobs 100 Jobs Basic Operations Data Mining Operations Data Warehouse Operations Sort 9 Word count 11 Grep 11 Naïve Bayes 6 Support vector machine 6 K-means 7 Item based collaborative 3 Frequent pattern growth 7 Hidden Markov model 6 Grep select Ranking select user visits aggregation 34 user visits-rankings join

JOB SUBMITTING Follows the distribution of input data size in Taobao Follows an exponential distribution with a mean of 14 seconds(facebook) Job submitted in a random order Input Data size Percentage <25MB 40.57% 25MB-625MB 39.33% 1.2GB-5GB 12.03% >5GB 8.07%

TESTBED Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes) CPU Type Intel Xeon E5645 Intel CPU Core 6 cores@2.40g L1 D/I Cache L2 Cache L3 Cache Memory Disk 6 32 KB 6 256 KB 12MB 16GB 8TB OS Hadoop Mahout Hive CentOS 5.5 1.0.2 0.6 0.11

HADOOP CONFIGURATION Hadoop Parameter Value Description The maximum number of map tasks that mapred.tasktracker. dt k 12 will be executed simultaneously by a task map.tasks.maximum tracker. mapred.tasktracker.r The maximum number of reduce tasks that educe.tasks.maximu m 12 will be executed simultaneously by a task tracker. mapred.map.tasks 48 Maximum number of concurrent running reduce task. mapred.reduce.tasks 45 Maximum number of concurrent running map task. dfs.replication 2 The actual number of replications specified when the file is created. mapreduce.tasktrack er.outofband.heartbe tb TRUE Open the out of band heartbeat. t at

HADOOP SCHEDULER EVALUATION Data Processed per Second Turnaround time Running time Waiting Time Throughput

DATA PROCESSED PER SECOND Total runnin ng time (103 3s) 25 20 15 10 5 DPS (MB/s s) 12 10 8 6 4 2 0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler 0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The total running time (10 3 sec) of running full workload by using five schedulers respectively The Data Processed per Second (Megabytes processed per second) of running full workload by using five schedulers respectively.

TURNAROUND TIME Turn around time (103s) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The average job turnaround time (10 3 sec) of running full workload by using five schedulers respectively.

AVERAGE JOB RUNNING TIME & WAITING TIME Running tim me (103s) 1.2 1.0 0.8 0.6 0.4 sec.) Wa aiting time ( 250 200 150 100 0.2 50 0.0 0 Task Scheduler Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The average job running time (10 3 sec) of running full workload by using five schedulers respectively. Average job waiting time (second) of running full workload by using five schedulers respectively.

THROUGHPUT Th hroughput (j jobs/min) 0.40 0.35 030 0.30 0.25 0.20 0 0.15 0.10 0.05 0.00 Fair with DS Naïve Fair Capacity FIFO HOD Task Scheduler The throughput (number of jobs processed in one minute) of running The throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively

EVALUATION RESULT ANALYSIS Fair with delay scheduling scheduler is the most efficient scheduler some jobs with large size will have longer time to finish than usual jobs Fair with delay scheduling, naïve fair, capacity, these three schedulers are all have the better performance than default FIFO scheduler HOD h d l f d t ll HOD scheduler preformed not very well, affected by the extra cost of virtualization

CONCLUSIONS & FUTURE WORK Optimizing i i the performance of Hadoop clusters is very necessary and significant The choice of task schedulers is very critical for system performance improvement of Hadoop cluster With fair sharing with delay scheduling, DPS is improved by 20% than that of FIFO scheduler Optimization and design of the scheduler need to refer to the characteristics of the workload In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systems

Q & A THANKS! E-MAIL: SOUNDER_LIU@163.COM, XUJG@UCAS.AC.CN