Cluster Workload Management

Size: px
Start display at page:

Download "Cluster Workload Management"

Transcription

1 Cluster Workload Management Goal: maximising the delivery of resources to jobs, given job requirements and local policy restrictions Three parties Users: supplying the job requirements Administrators: describing local use policies Workload management software: monitoring the state of the cluster, scheduling the jobs and tracking the resource usage Some or all the following activities are performed Queuing Scheduling Monitoring Resource management Accounting 1

2 Queuing Job submission usually consists of two primary parts: Resource directive (e.g. the amount of memory, the number of CPUs needed) Job description (e.g. job name, the location of the required input files) Once submitted, the jobs are held in the queue until the matching resources are available 2

3 Scheduling Determining at what time a job should be put into execution on what resources There are a variety of metrics to measure scheduling performance System-oriented metrics (e.g. throughput, utilisation, average response time of all jobs) user-oriented metrics (e.g. response time of a job submitted by a user) They can contradicts each other and balance needs to be made 3

4 Monitoring providing information to administrators, users and the scheduling system on the status of jobs and resources the method of collection may differ between different workload management systems, but the general purposes are the same 4

5 Resource management Handling the details of Starting a job under the identity of the user Stopping a job Cleaning up the mess left behind after the job either completes or is aborted Removing or adding resources For the batch system, the jobs are put into execution in such a way that the users need not be present during execution For interactive systems, the users have to be present to supply arguments or information during the execution of the jobs. 5

6 Accounting Accounting for which users are using what resources for how long Collecting resource usage data (e.g. job owner, resources requested by the job, total amount of resources consumed by the job) Accounting data can be used for: Producing system usage and user usage reports Tuning the scheduling policy Calculating future resource allocations Anticipating future resource requirements by users Determining the area of improvement within the cluster 6

7 PBS PBS, Portable Batch System, is a flexible workload management and job scheduling system Originally developed at NASA Different versions of PBS OpenPBS PBSpro Torque (recommended) Three key system demons pbs_server: run in the head node; is the centre of PBS pbs_mom: run in computing nodes; actually place the job into execution pbs_sched: scheduling jobs 7

8 PBS PBS job submission script #!/bin/sh #PBS -l walltime=1:00:00 #PBS -l mem=400mb #PBS -l ncpus=4 cd ${HOME}/PBS/test mpirun -np 4 myprogram Submitting a job % qsub myscriptfile x Inquiring the status of a job % qstat Delete a job %qdel

9 Maui By Maui high-performance computing centre and other partners A job scheduler that can interact with a number of different resource managers (e.g. PBS) Maui is an external scheduler, meaning it does not include a resource manager but rather extends the capabilities of the existing resource managers the underlying resource manager continues to maintain responsibility for managing nodes and tracking jobs Maui uses the APIs of other resource managers (e.g. PBS) to obtain system information Maui controls the decisions of when, where, and how jobs will run 9

10 Schedule Policies The simplest policy: First-Come First-Served Jobs are initiated in the same order as they are submitted. Does not require prior knowledge about tasks (e.g. runtime). Problems: jobs can block other jobs from starting, despite there being no performance benefit to either user. 10

11 First-Come First-Served 11

12 Backfilling The problem with FCFS is that idle time (sum of unused processing intervals) can be significant. One improvement is to backfill. Allows a job to start if it does not delay the first job in the queue. 12

13 Backfilling 13

14 Backfilling Advantages: Utilisation is improved. Disadvantages: Information about the job execution time is required. User estimation are usually inaccurate. It is a policy decision to decide what to do if a job overruns; many administrators choose to terminate a job if it exceeds its allocated execution time otherwise some users may deliberately underestimate the job length to get an earlier job start time. 14

15 Backfilling a problem if predicted runtime is wrong: 15

16 Scheduling Policies Reservation: Increasingly user-based quality of service (QoS) is an important scheduling metric. In addition to normal scheduling, reservation services can be used to plan resource allocation. Users are able to set up a reserved block of processing capability that they are able to use at some point in the future. Task management system agrees to the reservation. Users are subsequently able to run jobs within their reservation quotient. 16

17 Design parallel algorithms in clusters The design stages of a parallel algorithm Partition Decompose the problem. Communication Determine how the individual grains will communicate. Agglomeration Reducing communication costs Mapping Map the tasks to individual PEs. 17

18 Partitioning Data decomposition across PEs: Choice of data decomposition has impact on performance. Different decomposition may create different communication patterns. Maximise computation to communication ratio. For most efficient (least time) execution: Minimise communication Distribute workload evenly - load balance. 18

19 Partitioning the problem partitioning the problem as fine-grained as possible Exposing every opportunity for parallel execution Try to partition both computation and data into disjoint sets In the later stage, we will revisit this original partition and agglomerate the tasks which need intense communication A good partition divides into small pieces both the computation and the data 19

20 Partitioning approaches Domain decomposition (Data Parallelism) First, determine the data associated with the problem Then, determine a partition for the data, each partition with approximately the same size Finally, associate computation with each partitioned data Functional decomposition (Functional Parallelism) First, decompose the computation associated with the problem Then, work out the data that each partitioned computation needs to operate on Combination of these two approaches These two approaches can be applied to different components of a single problem to obtain efficient parallel algorithms 20

21 Domain decomposition The data that can be decomposed may be the input to the program, the output computed by the program Intermediate values generated by the program Rules of thumb Focusing on the data structure which is the largest or accessed most frequently Resulting in a number of tasks, each consisting of some data and a set of operations on that data A operation in one task may require data from other tasks, therefore causing communication 21

22 Illustration for Domain Decompositions At the early stage, try to decompose the data as fine as possible, therefore 3D decomposition is favoured The initial decompositions may need to be agglomerated (to 2D or 1D decomposition), depending on the intensity of the communication 22

23 Functional decomposition Dividing the computation into different tasks first Data operated by each task may be Disjoint (which is desired) Overlapping (if overlapping significantly, indicating domain decomposition should be used instead) Can be used to expose the structure of the functional components in the program 23

24 Granularity in partitioning The grain size (or granularity) is particular crucial. Fine-grained vs. Coarse-grained, that is, single large problem vs. multiple small problems. How can the problem be decomposed into independent grains that can be processed in parallel. Fine-grained problems: Requires system that permit rapid synchronisation between grains. Tight-coupling using fast (expensive) interconnects. Coarse-grained problems: Loose-coupling using slower (inexpensive) interconnects. 24

25 Communication If the operations in one task require the data from other tasks, communication has to occur Communication requirements are more difficult to determine in domain decomposition than in functional decomposition In functional decomposition, the component tasks for the problem usually have clear interfaces and the communication requirements are indicated by these interfaces In data decomposition, although the data are decomposed to disjoint sets, but there may not be clean cuts between tasks 25

26 Communication patterns Local vs. global communication In local communication, each task communicates with a small set of other tasks (neighbours) In global communication, each task communicates with many tasks Structured vs. unstructured In structured communication, a task and its neighbours form a regular structure (e.g. tree or mesh) in unstructured communications, tasks may be of arbitrary graphs Static vs. dynamic Static: the identity of the communication tasks does not change over time; Dynamic: the identity of the communication tasks are determined by data computed at runtime Synchronous vs. asynchronous Synchronous: both sender and receiver are aware of the existence of each other Asynchronous: senders don t know when receivers need data 26

27 Examples for local and global communication Local comm: the black task only communicates with the neighbouring four tasks to update the state The group of tasks which task A needs the data from are termed as task A s stencil Global comm: the central task communicates with all eight remaining tasks 27

28 Structured and unstructured communication Structure communication Unstructured communication: 28

29 Agglomeration In the first stage, the computation and data are partitioned without particular target computers and communications in mind, as a result: There may be intense communication between some tasks The number of these tasks might be much more than the number of processors in the target computer system. In this stage, revisiting the decisions made in the first two stages and aiming to obtain a design which will execute efficiently on the target computer systems, by Combining some tasks Replicating data and/or computation 29

30 Three goals in Agglomeration Reducing communication costs Retaining flexibility in mapping decisions Reducing software engineering costs 30

31 Reducing communication costs 16 tasks in figure (a) are agglomerated into one task in figure (b) 31

32 Preserving flexibility Creating more tasks than processors (usually an order of magnitude more), because The algorithm may be ported to larger parallel computers Providing the opportunity of overlapping computation and communication Providing greater scope for task mapping 32

33 Reducing software engineering costs If parallelising the existing codes, balance the performance and the advantage of less coding work If the program is used as a component of a larger application, considering if the design for the program has consistent interface with those of other constituent program components 33

34 Mapping Determining which processor each task is to execute on Placing tasks that are able to execute concurrently on different processors, so as to enhance concurrency Placing tasks that communicate frequently on the same processor, so as to increase locality The mapping problem is NP-complete problem; there is no polynomial time solution to achieving the optimal performance for a general case Using heuristic approaches to achieving sub-optimal solutions 34

35 Load Balancing Static Load Balancing Max efficiency obtained when there is equal load on each PE Problematic if heterogeneity in hardware or difference in workload. Useful to know: Speed of each processor Amount of processing for each grain Dynamic Load Balancing Cannot always know the processing speed a priori or the amount of work. Can solve by dynamically adjusting the domain decomposition by periodically repartitioning the data between PEs. Involves substantial overheads so use infrequently (performance trade-off). How to repartition Moving data between processors. 35