Aspects of Fair Share on a heterogeneous cluster with SLURM

Size: px

Start display at page:

Download "Aspects of Fair Share on a heterogeneous cluster with SLURM"

Bridget Ford
6 years ago
Views:

1 Aspects of Fair Share on a heterogeneous cluster with SLURM Karsten Balzer Computing Center, CAU Kiel balzer@rz.uni-kiel.de ZKI Tagung Heidelberg 216 (13th Oct. 216, 16:3-17:)

Background I HPC @ CAU Kiel - We operate a) NEC HPC-System consisting of NEC SX-ACE vector

peak performance: 47.2 TFlops sharing: global ScaTeFS file system: 1.

2 Background I CAU Kiel - We operate a) NEC HPC-System consisting of NEC SX-ACE vector system 256 nodes: 4 vector cores each; 64 GB memory bandwidth: 256 GB/s theo. peak performance: 65.6 TFlops NEC HPC-Linux-Cluster 22 cores (SandyBridge/Haswell); GB theo. peak performance: 47.2 TFlops sharing: global ScaTeFS file system: 1.5 PB batch system: NQSII b) separate Linux Cluster: rzcluster 15 nodes (AMD, Intel), 32 GB... 1 TB for rather general purpose but with exclusive islands batch system: PBSPro cluster is currently up for renewal

GB) separate software installation and modules home and work directories shared batch system:

3 Background II Nucleus for the cluster renewal c) caucluster: an additional small Linux Cluster 7 nodes (Haswell): 4 cores each; 256 GB separate login and service nodes (8 cores, 64 GB) separate software installation and modules home and work directories shared batch system: SLURM basic fair share scheduling currently 5 user groups (i.e., 5 SLURM accounts) fair share ratios

4 Agenda A. Introduction SLURM and fair share with SLURM Our intentions to deploy fair share B. Towards a heterogeneous system Challenges? From some test cases to useful accounting metrics C. Conclusions A brief summary Outlook

5 SLURM... A brief overview SLURM: acronym for Simple Linux Utility for Resource Management Originally developed at Lawrence Livermore National Lab as a simple resource manager (started in 22) Now maintained and supported by SchedMD ( Has evolved into a capable job scheduler ( 5, lines of C code) Portable, scalable and fault-tolerant Increasingly being used at academic research computing centers + Supports fair share, with a rather easy-to-use plugin + Fair share with generic groups (not based on Linux groups) + Very good documentation (for installation, administration and usage) + Nice tools and thorough MySQL database for monitoring/accounting + Open-source

6 ... and fair share with SLURM Priority plugins Default is FIFO: - scheduling jobs on a first in, first out basis Fair share: - job priorities adjusted according to short term historical usage - helps to steer a system toward defined usage targets Basic fair share with SLURM priority/multifactor: - fair share plugin, but used without other priority factors (such as age, job size, partition, qos,...) sched/backfill: - backfill plugin enabled shares: - allocated resources (core-hours) decaying with time - share decay half life: 12 h fair share tree: - consider a simple tree with 3 accounts - fair share targets S i : 7%, 2% and 1% of resources root accounts users S 1 =.7 S 2 =.2 S 3 =.1

7 Examples I A 1 ) Test cluster: - just 2 compute nodes with 8 cores each and 64 GB Job mix: - submit 5 jobs per account - random job properties: walltime: 1-1 min number of nodes: 1-2 cores per node: memory/core < 8 GB Time evolution of shares: s i (t n) = γ s i (t n 1 ) + ω i (t n) ω i (t n): consumption of core-hours in time interval t = t n t n 1 γ: decay factor for interval t Cluster allocation: - here ρ=.872 shares si [core-h] Theoretical peak share: s peak = N cluster-cores t 1 γ Effective peak share: s i (t) s eff = s peak ρ i s peak s eff = s peak ρ sum

8 Examples II 16 i A 2 ) Job allocation map: - is available from SLURM s accounting database Normalized shares: si norm. (t) = s i (t) s (t), s (t) = i s i (t) allocated cores shares s norm a i = T Accounting: 1 ρn cluster-cores T N jobs ( T ) i j=1 ω j s norm. i S i (S i : target share)

9 Examples III A 3 ) Job allocation map: - identify jobs of different kind serial single-core jobs parallel single-node jobs parallel multi-node jobs fine pattern coarse pattern full solid allocated cores Backfill strategy: - numbering of jobs according to submission order reveals backfill assistance - e.g.: 137, 172, 72, 14, 33 Reservations: - resources are being reserved essentially for multi-node jobs - reservations mainly determine the cluster allocation ρ

10 Examples III A 3 ) Job allocation map: - identify jobs of different kind serial single-core jobs parallel single-node jobs parallel multi-node jobs fine pattern coarse pattern full solid allocated cores Backfill strategy: - numbering of jobs according to submission order reveals backfill assistance - e.g.: 137, 172, 72, 14, 33 Reservations: - resources are being reserved essentially for multi-node jobs - reservations mainly determine the cluster allocation ρ

11 Examples IV B) Test cluster: - as in A) above Job mix: - submit 5 jobs/account - account 1: walltime: 3 min full node: 8 cores - accounts 2, 3: walltime: 1-1 min number of nodes: 1-2 cores per node: 1-8 Job allocation map: - excellent time-local performance of the fair share algorithm Cluster allocation: ρ = allocated cores

12 Our intentions to deploy fair share Make all compute resources available to all users Ensure fair wait times when cluster is flooded by jobs Reach pre-defined usage targets, ultimately on a monthly basis

13 Fair share challenges? It is not just about to monitor the overall cluster utilization ρ Instead: additionally (many) usage targets distributed over a well-branched tree! Cluster heterogeneity: - May it impede the smooth operation of the fair share algorithm? - Can we guarantee an adaquate cluster allocation? - Can we guarantee usage targets?

14 Test scenarios I A) Test cluster: - 2 nodes (64GB) with 8 and 4 cores, resp. Job mix: - memory/core < 8 GB - accounts 1, 3: walltime: 1-1 min 1 node: 1-4 cores - account 2: walltime: 1 min full node: 8 cores Job allocation map: - targets are reached (S 1 =.7, S 2 =.2, S 3 =.1) 16 allocated cores

15 Test scenarios II B 1 ) Test cluster: - 3 nodes with the following resources: Job mix: - random single-node jobs (number of cores 8) nodes 1, 2 node 3 8 cores (64 GB) 4 cores (256 GB) allocated cores Job allocation map: - for single-node jobs: ρ =

16 Test scenarios II allocated cores B 2 ) Test cluster: - 3 nodes with the following resources: nodes 1, 2 node 3 8 cores (64 GB) 4 cores (256 GB) Job mix: - account 1: 1 min, 1 cores, > 128GB - accounts 2, 3: walltime: 1-1 min 1 node, 1-8 cores per node memory/core < 1GB Job allocation map: Cluster allocation: ρ =.84 - for single-node jobs: ρ =

17 Test scenarios II allocated cores B 2 ) Test cluster: - 3 nodes with the following resources: nodes 1, 2 node 3 8 cores (64 GB) 4 cores (256 GB) Job allocation map: - for single-node jobs: ρ =.927 memory is secondary in scheduling the red jobs Job mix: - account 1: 1 min, 1 cores, > 128GB - accounts 2, 3: walltime: 1-1 min 1 node, 1-8 cores per node memory/core < 1GB Cluster allocation: ρ =

18 Test scenarios II allocated cores B 2 ) Test cluster: - 3 nodes with the following resources: nodes 1, 2 node 3 8 cores (64 GB) 4 cores (256 GB) Job allocation map: - for single-node jobs: ρ =.927 shares target shares (s i S i ) Job mix: - account 1: s 1 = 1/ s 2 = (46/3 2)/56.55 >.2 s 3 = (46/3 1)/ >.1 s 3/s 2 =.5 1 min, 1 cores, > 128GB - accounts 2, 3: walltime: 1-1 min 1 node, 1-8 cores per node memory/core < 1GB Cluster allocation: ρ =

19 Test scenarios II allocated cores B 2 ) Test cluster: - 3 nodes with the following resources: nodes 1, 2 node 3 8 cores (64 GB) 4 cores (256 GB) Job allocation map: - for single-node jobs: ρ =.927 shares target shares (s i S i ) Job mix: - account 1: s 1 = 1/ s 2 = (46/3 2)/56.55 >.2 s 3 = (46/3 1)/ >.1 s 3/s 2 =.5 1 min, 1 cores, > 128GB - accounts 2, 3: walltime: 1-1 min 1 node, 1-8 cores per node memory/core < 1GB Cluster allocation: ρ =.84 - same would occur on a homogeneous cluster: (3 1)/(3 4) = but throughput of account 1 would be larger: 3 (smaller problematic time window)

20 Test scenarios III 16 C 1 ) Test cluster: - 2 nodes with the following resources: shares s norm. i node 1 node Job allocation map: 8 cores (32 GB) 8 cores (64 GB) Job mix: - account 1: 5/1 min, 1/2 cores 3/4GB - accounts 2, 3: single-node jobs not fully at random memory/core: irrelevant Results: - cl. allocation: ρ =.31 - target shares are potentially reached by decrease of backfill - this behavior is reproducible allocated cores

21 Test scenarios III C 2 ) Test cluster: - 2 nodes with the following resources: node 1 node 2 8 cores (64 GB) 8 cores (64 GB) Job mix: - same as in C 1 ) Results: - targets are reached - alternating scheduling for account 1 - cl. allocation slightly better (though still low) Job allocation map: 16 allocated cores

22 Intermediate summary On a heterogeneous cluster (even at large ρ) it can be (much) more difficult to reach target shares If targets are achieved strongly depends on the job mix and on job properties Obtained targets may reflect the heterogeneity of the cluster Delay of backfill can assist in reaching targets, but leads to small ρ

23 Towards suitable metrics What about user satisfaction? 1..5 a i S i =

24 Towards suitable metrics What about user satisfaction? 1..5 a i S i = Can we provide a simple metric which helps to explain those situations?

25 Suitable metrics I Consider a situation where ρ 1 - On which resources a job j can run? - How many nodes, cores per node and how much memory per node are requested? N nodes,j, N cores-per-node,j and N gigabyte-per-node,j - How many cluster cores are allocatable to process the job? N allocatable-cores,j = f (N nodes,j,n cores-per-node,j,n gigabyte-per-node,j ) - How much cores are available in total on the cluster? N cluster-cores Job-based cluster allocation - Of interest is the ratio for job j : γ j = N allocatable-cores,j N cluster-cores γ j 1

26 Suitable metrics II Definition A - Job-based average: M A = 1 N jobs γ j N jobs j=1 Definition B - Runtime-weighted average: M B = 1 N jobs N jobs γ j T j, T = T j=1 j=1 T j

27 i Suitable metrics III Example: Test cluster: - 2 nodes with the following resources: node 1 node 2 8 cores (64 GB) 8 cores (32 GB) Job mix: - account 1: 1/3 min, 1/4 cores 48/48GB - accounts 2, 3: single-node jobs number of cores at random memory/core: irrelevant shares s norm S 1 =.7 M A M B shares s norm M A M B S 2 = shares s norm MA M B S 3 =

keep an eye on many different key quantities ρ S i One may quickly disappoint users Therefore, it is indispensable

28 Conclusions Fair share is (even in its simple form) much more complex than other scheduling mechanisms In particular on a heterogeneous system, where different resources are not equally available (intrinsic unfairness) One needs to keep an eye on many different key quantities ρ S i One may quickly disappoint users Therefore, it is indispensable to be able to explain how the applied fair share algorithm works and why in specific cases lower targets are reached M?

29 Outlook I Realistic tests caucluster

30 Outlook II 2nd-level backfill Basic idea: use situations where ρ < 1 to process low-priority jobs - Server+client architecture; with the server responsible for maintaining the pool and handing out tasks to the client(s) - Automatic scheduling of clients by a preemption rule Slurm configuration via standard plugin: - 2 partitions: fairq (Priority=1) + lowq (Priority=1) - PreemptType=preempt/partition prio - PreemptMode=CANCEL Software Kiel University: - Optimization algorithms for chemical and materials science (RG Prof. Hartke) 1 - Quantum Monte-Carlo (RG Prof. Bonitz) 2 Fair share aspect: - A grace time may influence the fair share in the standard partition 1 J.M. Dieterich and B. Hartke, An Error-safe, Portable, and Efficient Evolutionary Algorithms Implementation with High Scalability, submitted to J. Chem. Theory Comput. (216) 2 T. Dornheim, S. Groth, T. Schoof et al., Ab initio quantum Monte Carlo simulations of the uniform electron gas without fixed nodes, Phys. Rev. B 93, (216)

31 Outlook III 2nd-level backfill example - 3 accounts: Blue and green: Single-node jobs with max. 8 cores (fairq, ratio 2:1) Red: Serial backfill jobs in the lowq (fixed walltime: 5 min) allocated cores a b c

Cluster Workload Management

Cluster Workload Management Goal: maximising the delivery of resources to jobs, given job requirements and local policy restrictions Three parties Users: supplying the job requirements Administrators: