Omega: flexible, scalable schedulers for large compute clusters. Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes

Similar documents
Cluster management at Google

Resource Scheduling Architectural Evolution at Scale and Distributed Scheduler Load Simulator

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis

On Cloud Computational Models and the Heterogeneity Challenge

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland

IBM Research Report. Megos: Enterprise Resource Management in Mesos Clusters

Data Management in the Cloud CISC 878. Patrick Martin Goodwin 630

Graph Optimization Algorithms for Sun Grid Engine. Lev Markov

Moreno Baricevic Stefano Cozzini. CNR-IOM DEMOCRITOS Trieste, ITALY. Resource Management

Reducing Execution Waste in Priority Scheduling: a Hybrid Approach

Hawk: Hybrid Datacenter Scheduling

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Preemptive Resource Management for Dynamically Arriving Tasks in an Oversubscribed Heterogeneous Computing System

Performance-Oriented Software Architecture Engineering: an Experience Report

Resource Critical Path Approach to Project Schedule Management

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Application Migration Patterns for the Service Oriented Cloud

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

Cost Optimization for Cloud-Based Engineering Simulation Using ANSYS Enterprise Cloud

Grid 2.0 : Entering the new age of Grid in Financial Services

IBM storage solutions: Evolving to an on demand operating environment

OPERATING SYSTEMS. Systems and Models. CS 3502 Spring Chapter 03

Lazy Evaluation of Transactions in Database Systems

Preemptive ReduceTask Scheduling for Fair and Fast Job Completion

Process Manufacturing

Building a Multi-Tenant Infrastructure for Diverse Application Workloads

Performance-Driven Task Co-Scheduling for MapReduce Environments

Bigger, Longer, Fewer: what do cluster jobs look like outside Google?

DOWNTIME IS NOT AN OPTION

Managing Customer Specific Projects Tomas Nyström

A simulated annealing approach to task allocation on FPGAs

Dynamic Fractional Resource Scheduling for HPC Workloads

InfoSphere DataStage Grid Solution

The Landscape of Enterprise Applications - a personal view

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Sensitivity and Risk Path Analysis John Owen, Vice President Barbecana, Inc.

Real World System Development. November 17, 2016 Jon Dehn

CMS readiness for multi-core workload scheduling

STAR-CCM+ Performance Benchmark. August 2010

ANSYS, Inc. March 12, ANSYS HPC Licensing Options - Release

AN ENERGY EFFICIENT SCHEME FOR CLOUD RESOURCE PROVISIONING USING TASK SCHEDULING STRATEGY BASED ON VACATION QUEUING THEORY

HPC ATM simulator for the performance assessment of TBO

Priority-enabled Scheduling for Resizable Parallel Applications

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

CPU Scheduling. Disclaimer: some slides are adopted from book authors and Dr. Kulkarni s slides with permission

HPC IN THE CLOUD Sean O Brien, Jeffrey Gumpf, Brian Kucic and Michael Senizaiz

NSF {Program (NSF ) first announced on August 20, 2004} Program Officers: Frederica Darema Helen Gill Brett Fleisch

Understanding Huawei Enterprise Storage Concepts

Platform Solutions for CAD/CAE 11/6/2008 1

MapReduce Scheduler Using Classifiers for Heterogeneous Workloads

rid Computing in he Industrial Sector

HLRS. The High Performance Computing. Center Stuttgart

Energy Efficient MapReduce Task Scheduling on YARN

Software-Defined Storage: A Buyer s Guide

Efficient Task Scheduling Over Cloud Computing with An Improved Firefly Algorithm

IBM z/tpf To support your business objectives. Optimize high-volume transaction processing on the mainframe.

BA /14/2010. PSU Problem Solving Process. The Last Step 5. The Last Step 8. Project Management Questions PLAN IMPLEMENT EVALUATE

Push vs. Pull: How Pull-driven Order Fulfillment Enables Greater Flexibility in Distribution Operations. Fortna

Data Center Operating System (DCOS) IBM Platform Solutions

Towers Watson Dr Andy Lingard

Towards Modelling-Based Self-adaptive Resource Allocation in Multi-tiers Cloud Systems

Mesos and Borg (Lecture 17, cs262a) Ion Stoica, UC Berkeley October 24, 2016

Digital Transformation & olvency II Simulations for L&G: Optimizing, Accelerating and Migrating to the Cloud

Cluster Workload Management

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

Understanding and Managing Uncertainty in Schedules

Job Scheduling in Cluster Computing: A Student Project

Extending on-premise HPC to the cloud

OPERATIONS RESEARCH Code: MB0048. Section-A

Order Streaming: A New Take On Fulfillment

Data-Powered Clouds: Challenges for Data Management, Cloud Computing, and Software

Andrew Burkimsher, Iain Bate, Leandro Soares Indrusiak. Department of Computer Science, University of York, York, YO10 5GH, UK

High-priority and high-response job scheduling algorithm

Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2

Understanding Cloud. #IBMDurbanHackathon. Presented by: Britni Lonesome IBM Cloud Advisor

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Interactive Simulation

Enterprise-Scale MATLAB Applications

HTCaaS: Leveraging Distributed Supercomputing Infrastructures for Large- Scale Scientific Computing

MBH1123 Project Scope, Time and Cost Management Prepared by Dr Khairul Anuar. Lecture 8 Project Time Management - IV

Delivering High Performance for Financial Models and Risk Analytics

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Multi-Stage Resource-Aware Scheduling for Data Centers with Heterogenous Servers

Pricing for Utility-driven Resource Management and Allocation in Clusters

The State of Kubernetes 2018

CHAPTER 6 DYNAMIC SERVICE LEVEL AGREEMENT FOR GRID RESOURCE ALLOCATION

Cloudera, Inc. All rights reserved.

Integration of Titan supercomputer at OLCF with ATLAS Production System

Moab and TORQUE Achieve High Utilization of Flagship NERSC XT4 System

Enabling Resource Sharing between Transactional and Batch Workloads Using Dynamic Application Placement

Introduction to IBM Technical Computing Clouds IBM Redbooks Solution Guide

Configuration Management

Dell and JBoss just work Inventory Management Clustering System on JBoss Enterprise Middleware

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

St Louis CMG Boris Zibitsker, PhD

Verification and Validation

How Container Schedulers and Software-Defined Storage will Change the Cloud

Testing SLURM batch system for a grid farm: functionalities, scalability, performance and how it works with Cream-CE

NEXT GENERATION PREDICATIVE ANALYTICS USING HP DISTRIBUTED R

Address system-on-chip development challenges with enterprise verification management.

Transcription:

Omega: flexible, scalable schedulers for large compute clusters Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes

Cluster Scheduling Shared hardware resources in a cluster Run a mix of workloads on the same set of machines Problem: Allocation of resources to different tasks from a resource pool in a cluster Schedule tasks on machines based on available resources

Cluster Scheduling not a new problem In HPC community (Maui, Moab, Platform LSF) What s new? Google scale Need for flexibility (changing policies, constraints) Heterogeneity (hardware, workloads, )

Characterizing Google s Workloads Workload composed of jobs Each job composed of multiple tasks Workload split Long running service jobs (e.g., web services) Shorter batch jobs (e.g., MapReduce jobs)

Characterizing Google s Workloads J=Jobs, T=Tasks, C=CPU-core-seconds, R=RAM-GB-seconds Most jobs are batch, but most resources are consumed by service jobs.

Characterizing Google s Workloads Batch Service Service jobs run for much longer than batch jobs

Characterizing Google s Workloads Batch Service Service jobs have much fewer tasks than batch jobs

Goals A cluster scheduling architecture that ensures: High resource utilization (utilization) Conformity to user policies, and ability to add new policies (flexibility) Should scale to large clusters (scalability)

What about existing solutions? S1 S2 S3 Scheduler S1 S2 S3 Resource Mgr Monolithic - Hard to add new policies - Scalability bottleneck Static Partitioning - Poor utilization - Inflexible allocation Two-level - Locks resource during offer (pessimistic) - Limited state information

Omega s Approach S1 S2 S3 Shared State

Allocation? S1 S2 S3 Conflict

Allocation? Failed allocation S1 S2 S3

What s the trade-off? Two-level scheduling (e.g., Mesos) Limits parallelism (pessimistic locking) Schedulers have restricted visibility of resources Shared state with optimistic concurrency control Eliminates the two issues with two-level approach Cost: Wasted work when optimistic assumption fails

Simulation Study

Monolithic, single logic

Monolithic, fast-path

Mesos

What s Going On? S1 S2 S3 Resource Mgr Green receives offer of all available resources Blue s tasks finish Blue receives small offer Offer is insufficient for blue Blue receives small offer Offer is insufficient for blue [repeat] Green finishes scheduling Blue receives larger offer By now, blue has given up

Omega

Effect of Parallelism

Takeaway? Omega s shared state model Performs as well as a complex monolithic multipath scheduler Can overcome its scalability issues by using multiple schedulers

Effect of Conflicts?

Conflict Fraction

Scheduler Busyness Interference is high for real-world settings

Case Study: Specialized MapReduce Scheduler

MapReduce Scheduler Opportunistically add more mappers/reducers as long as benefits are obtained Max-parallelism approach More policies in paper

Benefits of opportunistic allocation

Conclusions Cluster scheduling architecture with Parallelism Shared State Optimistic Concurrency control Enables Scalable scheduling Flexibility in scheduling policies Visibility to complete cluster state Potentially more efficient scheduling

Questions Is it okay to ignore certain global policies like fairness? it helps that fairness is not a primary concern in our environment: we are driven more by the need to meet business requirements. An underlying assumption is that decision times for schedulers can grow quite high (due to complicated scheduling); is this valid?

Questions Is the design too Google-specific? E.g., OCC works well when contention is low. In this context, contention is high if: Resources are few (small clusters) Too many tasks in a given time (high arrival rate) Number of schedulers is large (high parallelism) When does high contention become a bottleneck? Perhaps not an issue for Google s clusters, but in general Google can afford to over-provision resources, is it possible in general?