Dynamic Fractional Resource Scheduling for HPC Workloads

Similar documents
Dynamic Fractional Resource Scheduling for HPC Workloads

Dynamic Fractional Resource Scheduling versus Batch Scheduling

Dynamic Fractional Resource Scheduling vs. Batch Scheduling

Dynamic Fractional Resource Scheduling vs. Batch Scheduling

The Importance of Complete Data Sets for Job Scheduling Simulations

CPU Scheduling. Chapter 9

Resource Allocation Using Virtual Clusters

Preemptive Resource Management for Dynamically Arriving Tasks in an Oversubscribed Heterogeneous Computing System

Job Communication Characterization and its Impact on Meta-scheduling Co-allocated Jobs in a Mini-grid

Trade-off between Power Consumption and Total Energy Cost in Cloud Computing Systems. Pranav Veldurthy CS 788, Fall 2017 Term Paper 2, 12/04/2017

Class-Partitioning Job Scheduling for Large-Scale Parallel Systems

Multi-Resource Packing for Cluster Schedulers. CS6453: Johan Björck

Graph Optimization Algorithms for Sun Grid Engine. Lev Markov

Enabling Resource Sharing between Transactional and Batch Workloads Using Dynamic Application Placement

Job Scheduling with Lookahead Group Matchmaking for Time/Space Sharing on Multi-core Parallel Machines

A2L2: an Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation

A Profile Guided Approach to Scheduling in Cluster and Multi-cluster Systems

Resource Allocation using Virtual Clusters

Enhanced Algorithms for Multi-Site Scheduling

Job Scheduling in Cluster Computing: A Student Project

On the Comparison of CPLEX-Computed Job Schedules with the Self-Tuning dynp Job Scheduler

The resource Usage Aware Backfilling

Improving Throughput and Utilization in Parallel Machines Through Concurrent Gang

An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and Migration

Motivation. Types of Scheduling

Heuristics for Resource Matching in Intel s Compute Farm

RE-EVALUATING RESERVATION POLICIES FOR BACKFILL SCHEDULING ON PARALLEL SYSTEMS

SE350: Operating Systems. Lecture 6: Scheduling

Multi-Stage Resource-Aware Scheduling for Data Centers with Heterogenous Servers

Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids

CS 143A - Principles of Operating Systems

Integrated Scheduling: The Best of Both Worlds

Topology-aware Scheduling on Blue Waters with Proactive Queue Scanning and Migration-based Job Placement

CPU SCHEDULING. Scheduling Objectives. Outline. Basic Concepts. Enforcement of fairness in allocating resources to processes

LOAD SHARING IN HETEROGENEOUS DISTRIBUTED SYSTEMS

OPERATING SYSTEMS. Systems and Models. CS 3502 Spring Chapter 03

א א א א א א א א

CSE 5343/7343 Fall 2006 PROCESS SCHEDULING

Cluster Workload Management

Job co-allocation strategies for multiple high performance computing clusters

CPU Scheduling CPU. Basic Concepts. Basic Concepts. CPU Scheduler. Histogram of CPU-burst Times. Alternating Sequence of CPU and I/O Bursts

Announcements. Program #1. Reading. Is on the web Additional info on elf file format is on the web. Chapter 6. CMSC 412 S02 (lect 5)

On the Impact of Reservations from the Grid on Planning-Based Resource Management

Gandiva: Introspective Cluster Scheduling for Deep Learning

Toward Effective Multi-capacity Resource Allocation in Distributed Real-time and Embedded Systems

Principles of Operating Systems

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

LARGE scale parallel machines are essential to meet the

Queue based Job Scheduling algorithm for Cloud computing

Clairvoyant Site Allocation of Jobs with Highly Variable Service Demands in a Computational Grid

CLOUD computing and its pay-as-you-go cost structure

Job Scheduling for the BlueGene/L System

Learning Based Admission Control. Jaideep Dhok MS by Research (CSE) Search and Information Extraction Lab IIIT Hyderabad

Resource Sharing Usage Aware Resource Selection Policies for Backfilling Strategies

CSC 553 Operating Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS

FIFO SJF STCF RR. Operating Systems. Minati De. Department of Mathematics, Indian Institute of Technology Delhi, India. Lecture 6: Scheduling

Increasing Waiting Time Satisfaction in Parallel Job Scheduling via a Flexible MILP Approach

CS 111. Operating Systems Peter Reiher

On Cloud Computational Models and the Heterogeneity Challenge

Gang Scheduling Performance on a Cluster of Non-Dedicated Workstations

A Dynamic Co-Allocation Service in Multicluster Systems

CPU scheduling. CPU Scheduling

Hierarchical scheduling strategies for parallel tasks and advance reservations in grids

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

TBS: A Threshold Based Scheduling in Grid Environment

Characterization of Backfilling Strategies for Parallel Job Scheduling

Scheduling Processes 11/6/16. Processes (refresher) Scheduling Processes The OS has to decide: Scheduler. Scheduling Policies

Aspects of Fair Share on a heterogeneous cluster with SLURM

Multi-Stage Resource-Aware Scheduling for Data Centers with Heterogeneous Servers

Scheduling I. Today. Next Time. ! Introduction to scheduling! Classical algorithms. ! Advanced topics on scheduling

Chapter 6: CPU Scheduling. Basic Concepts. Histogram of CPU-burst Times. CPU Scheduler. Dispatcher. Alternating Sequence of CPU And I/O Bursts

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Heuristics for Resource Matching in Intel s Compute Farm

Priority-enabled Scheduling for Resizable Parallel Applications

Scheduling. CSE Computer Systems November 19, 2001

A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems

Decentralized Scheduling of Bursty Workload on Computing Grids

Reading Reference: Textbook: Chapter 7. UNIX PROCESS SCHEDULING Tanzir Ahmed CSCE 313 Fall 2018

Moreno Baricevic Stefano Cozzini. CNR-IOM DEMOCRITOS Trieste, ITALY. Resource Management

Lecture 3. Questions? Friday, January 14 CS 470 Operating Systems - Lecture 3 1

TASK SCHEDULING BASED ON EFFICIENT OPTIMAL ALGORITHM IN CLOUD COMPUTING ENVIRONMENT

Lecture 6: CPU Scheduling. CMPUT 379, Section A1, Winter 2014 February 5

Failure Recovery in Distributed Environments with Advance Reservation Management Systems

Introduction to Operating Systems. Process Scheduling. John Franco. Dept. of Electrical Engineering and Computing Systems University of Cincinnati

Computer Cluster Scheduling Algorithm Based on Time Bounded Dynamic Programming

Scheduling and Resource Management in Grids

A Novel Workload Allocation Strategy for Batch Jobs

Intro to O/S Scheduling. Intro to O/S Scheduling (continued)

Adaptive Job Scheduling via Predictive Job Resource Allocation

CPU Scheduling. Jo, Heeseung

Simulation of Process Scheduling Algorithms

Achieving Predictable Timing and Fairness Through Cooperative Polling

Comp 204: Computer Systems and Their Implementation. Lecture 10: Process Scheduling

Resource Allocation Strategies in a 2-level Hierarchical Grid System

Moldable Parallel Job Scheduling Using Job Efficiency: An Iterative Approach

Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems

Transcription:

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frédéric Vivien 2 Henri Casanova 1 1 Department of Information and Computer Sciences University of Hawai i at Mānoa 2 INRIA, France The 24th IEEE International Parallel and Distributed Processing Symposium April 19 23, 2010 Atlanta, USA

High Performance Computing Today, HPC usually means using clusters Homogeneous nodes connected via high speed network These are ubiquitous But large ones are expensive Users submit requests to run jobs Running jobs are made up of nearly identical tasks The number of tasks is generally specified by the user Tasks in a job are nearly identical Tasks can block while communicating with each other Most systems put each task on a dedicated node Many jobs are serial, a few require all of the system nodes Jobs are temporary The user wants a final result Quick turnaround relative to runtime is desired Jobs may have to wait until resources are available to start The assignment of resources to jobs is called scheduling

Current HPC Scheduling Approaches Batch Scheduling, which no one likes Usually FCFS with backfilling Backfilling needs (unreliable) compute time estimates Unbounded wait times Inefficient use of nodes/resources Gang Scheduling, which no one uses Globally coordinated time sharing Complicated and slow Memory pressure a concern Large granularity limits improvement over batch scheduling

Our Proposal Use virtual machine technology. Multiple tasks on one node Sharing of fractional resources Similar to preemption Performance isolation Define a run-time computable metric that captures notions of performance and fairness. Design heuristics that allocate resources to jobs while explicitly trying to achieve high ratings by our metric.

Requirements, Needs, and Yield Tasks have memory requirements and CPU needs All tasks of a job have the same requirements and needs For a task to be placed on a node there must be memory available at least equal to its requirements A task can be allocated less CPU than its need, and the ratio of the allocation to the need is the yield All tasks of a job must have the same yield, so we can also speak of the yield of a job The yield of a job is the rate at which it progresses toward completion relative to the rate if it were run on a dedicated system

Stretch Our goal: minimize maximum stretch (aka slowdown) Stretch: the time a job spends in the system divided by the time that would be spent in a dedicated system [Bender et al., 1998] Popular to quantify schedule quality post-mortem Not generally used to make scheduling decisions Runtime computation requires (unreliable) user estimates. Minimizing average stretch prone to starvation Minimizing maximum stretch captures notions of both performance and fairness.

Approach Job arrival/completion times are not known in advance We avoid the use of runtime estimates Instead we focus on maximizing minimum yield Similar, but not the same, as minimizing maximum stretch

Task Placement Heuristics We apply task placement heuristics studied in our previous work [Stillwell et al., 2008, Stillwell et al., 2009] Greedy Task Placement Incremental, puts each task on the node with the lowest computational load on which it can fit without violating memory constraints MCB Task Placement Global, iteratively applies multi-capacity (vector) bin-packing heuristics during a binary search for the maximized minimum yield Much better placement than greedy Can cause lots of migration But what if the system is oversubscribed? Need a priority function to decide which jobs to run

Priority Function? Virtual Time: The subjective time experienced by a job 1 First Idea: VIRTUAL TIME Informed by ideas about fairness Lead to good results But theoretically prone to starvation FLOW TIME Second Idea: VIRTUAL TIME Addresses starvation problem But lead to poor performance Third Idea: FLOW TIME (VIRTUAL TIME) 2 Combines idea #1 and idea #2 Addresses starvation Performs about the same as first priority function

Use of Priority By Greedy GreedyP Greedily schedule tasks, and suspend lower-priority tasks if necessary to run higher-priority tasks GreedyPM Like GreedyP, but can also migrate tasks instead of suspending them by MCB If no valid solution can be found for any yield value, remove the lowest priority task and try again

Resource Allocation Once tasks are placed on nodes we iteratively maximize the minimum yield Based on network resource allocation ideas about fairness Easy to compute and slightly better than maximizing average yield

When to apply Heuristics We consider a number of different options: Job Submission heuristics can use greedy or bin packing approaches Job Completion as above, can help with throughput when there are lots of short running jobs Periodically some heuristics periodically apply vector packing to improve overall job placement

MCB-Stretch Algorithm Like MCB, but tries to minimize maximum stretch Requires knowledge of time until next rescheduling period, uses current and estimated future stretch Second phase focuses on iteratively minimizing the maximum stretch

Methodology Experiments conducted using discrete event simulator Mix of synthetic and real trace data Ran experiments with and without migration penalties Periodic approaches use a 600 second (10 minute) period Absolute bound on max stretch computed for each instance Performance comparison based on max stretch degradation from bound

Max Stretch Degradation vs. Load, No Migration Cost 10000 Maximum Degradation From Bound vs. System Load, 0 second restart penalty. Maxstretch Degradation From Bound 1000 100 10 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 EASY FCFS Greedy* Load GreedyP* GreedyP/per GreedyP*/per Mcb8*/per per stretch-per

Max Stretch Degradation vs. Load, 5 minute penalty 10000 Maximum Degradation From Bound vs. System Load, 300 second restart penalty. Maxstretch Degradation From Bound 1000 100 10 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 EASY FCFS Greedy* Load GreedyP* GreedyP/per GreedyP*/per Mcb8*/per per stretch-per

Max Stretch Degradation vs. Load, 5 minute penalty 10000 Maximum Degradation From Bound vs. System Load, 300 second restart penalty. Maxstretch Degradation From Bound 1000 100 10 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 EASY FCFS Greedy* GreedyP* GreedyP/per GreedyP*/per Load GreedyP*/per/minvt:300 Mcb8*/per Mcb8*/per/minvt:300 per stretch-per

Bandwidth vs. Period 0.55 0.5 Average Migration+Preemption Bandwidth vs. Period migr minvt:300 migr minvt:600 nomigr minvt:300 nomigr minvt:600 0.45 Average Bandwidth (GB/s) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 2000 4000 6000 8000 10000 12000 Period (seconds)

Max Stretch Degradation vs. Period 15 14 Average Maxstretch Degradation from Bound vs. Period migr minvt:300 migr minvt:600 nomigr minvt:300 nomigr minvt:600 13 Average Maxstretch Degradation 12 11 10 9 8 7 6 5 2000 4000 6000 8000 10000 12000 Period (seconds)

Conclusions DFRS approaches can significantly outperform traditional approaches Aggressive repacking can lead to much better resource allocations But also to heavy migration costs A combination of opportunistic greedy scheduling, with limited periodic repacking has the best average case performance across all load levels Bandwidth costs can be reduced by extending the period without much loss of performance Greedy migration is not that useful Attempting to maximize the minimum yield does about the same as trying to minimize the maximum stretch

Summary We have proposed a novel approach to job scheduling on clusters, Dynamic Fractional Resource Scheduling, that makes use of modern virtual machine technology and seeks to optimize a runtime-computable, user-centric measure of performance called the minimum yield Our approach avoids the use of unreliable runtime estimates This approach has the potential to lead to order-of-magnitude improvements in performance over current technology Overhead costs from migration are manageable

References I Bender, M. A., Chakrabarti, S., and Muthukrishnan, S. (1998). Flow and Stretch Metrics for Scheduling Continuous Job Streams. In Proc. of the 9th ACM-SIAM Symp. On Discrete Algorithms, pages 270 279. Stillwell, M., Shanzenbach, D., Vivien, F., and Casanova, H. (2008). Resource Allocation using Virtual Clusters. Technical Report ICS2008-09-01, University of Hawai i at Mānoa Department of Information and Computer Sciences.

References II Stillwell, M., Shanzenbach, D., Vivien, F., and Casanova, H. (2009). Resource Allocation using Virtual Clusters. In CCGrid, pages 260 267. IEEE. Stillwell, M., Vivien, F., and Casanova, H. (2010). Dynamic fractional resource scheduling for HPC workloads. In IPDPS. to appear.