SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

Size: px
Start display at page:

Download "SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman"

Transcription

1 SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

2 AI is the new electricity. Machine translation Recommendation system Autonomous driving Object detection and recognition Supervised Unsupervised Learning Transfer Reinforcement 2

3 ML algorithms are approximate ML model: a parametric transformation! # $ " 3

4 ML algorithms are approximate ML model: a parametric transformation! $ % " maps input variables! to output variables " typically contains a set of parameters # Quality: how well model maps input to the correct output Loss function: discrepancy of model output and ground truth 4

5 Training ML models: an iterative process Tasks Job Model! " Send Task Update Model Model Replica! " # Data Shards Training algorithms iteratively minimize a loss function E.g., stochastic gradient descent (SGD), L-BFGS 5

6 Training ML models: an iterative process LRVV ReductLRn % LRgReg 6V0 LDA 0LPC CuPulDtLve TLPe % Quality improvement is subject to diminishing returns More than 80% of work done in 20% of time 6

7 Exploratory ML training: not a one-time effort Collect Data Extract Features Train ML Models Adjust Feature Space Tune Hyperparameters Restructure Models Train model multiple times for exploratory purposes Provide early feedback, direct model search for high quality models 7

8 How to schedule multiple training jobs on shared cluster? Key features of ML jobs Approximate Diminishing returns Exploratory process Job #1 Job #2 Job #3 Scheduler Problem with resource fairness scheduling Jobs in early stage: could benefit a lot from additional resources Jobs almost converged: make only marginal improvement 8

9 SLAQ: quality-aware scheduling Intuition: in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement AFFuraFy LRss 4ualLty-Aware FaLr 5esRurFe TLme 9

10 Solution Overview Normalize quality metrics Predict quality improvement Quality-driven scheduling 10

11 Normalizing quality metrics Applicable to All Algorithms? Comparable Magnitudes? Known Range? Predictable? Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc. Loss Normalized Loss Loss Normalized Loss 11

12 Normalizing quality metrics Normalize change of loss values w.r.t. largest change so far Currently does not support some non-convex optimization algorithms 1RrPDlLzeG LRVV eDnV LRgReg Rly GBT GBTReg 0L3C LDA LLnReg IterDtLRn 12

13 Training iterations: loss prediction Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17] Overhead for frequent offline analysis is huge Strawman: use last Loss as prediction for future Loss SLAQ: online prediction using weighted curve fitting 3reGLcWLRn ErrRr % LDA 6WrDwPDn G%7 LLn5eg WeLghWeG Curve V0 0L3C LRg5eg 6V03Rly 13

14 Scheduling approximate ML training jobs Predict how much quality can be improved when assign X workers to jobs Reassign workers to maximize quality improvement Scheduler Prediction Job #1 Resource Allocation 3 1 Job # Job #

15 Experiment setup Representative mix of training jobs with Compare against a work-conserving fair scheduler Algorithm Acronym Type Optimization Algorithm Dataset K-Means K-Means Clustering Lloyd Algorithm Synthetic Logistic Regression LogReg Classification Gradient Descent Epsilon [33] Support Vector Machine SVM Classification Gradient Descent Epsilon SVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34] Gradient Boosted Tree GBT Classification Gradient Boosting Epsilon GBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35] Multi-Layer Perceptron Classifier MLPC Classification L-BFGS Epsilon Latent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36] Linear Regression LinReg Regression L-BFGS YearPredictionMSD 15

16 Evaluation: resource allocation across jobs 6haUe of ClusteU C38s (%) 160 training jobs submitted to cluster following Poisson distribution 25% jobs with high loss values 25% jobs with medium loss values 50% jobs with low loss values (almost converged) %ottop 50% Jobs 6econd 25% Jobs 7oS 25% Jobs iPe (seconds) 16

17 Evaluation: cluster-wide quality and time Quality SLAQ s average loss is 73% lower than that of the fair scheduler LRss )alr 5esRurFe 6LA Lme (sefrqds) Time SLAQ reduces time to reach 90% (95%) loss reduction by 45% (30%) TLme (sefrqds) )alr 5esRurFe SLA LRss 5eduFtLRQ % 17

18 SLAQ Evaluation: Scalability Frequently reschedule and reconfigure in reaction to changes of progress Even with thousands of concurrent jobs, SLAQ makes rescheduling decisions in just a few seconds 6chedulinJ Time (s) Jobs umber of WorNers 18

19 Conclusion SLAQ leverages the approximate and iterative ML training process Highly tailored prediction for iterative job quality Allocate resources to maximize quality improvement 3reGLcWLRn ErrRr % LDA 6WrDwPDn WeLghWeG Curve G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly iPe (seconds) SLAQ achieves better overall quality and end-to-end training time 6haUe of ClusteU C38s (%) %ottop 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 19

20 Training iterations: runtime prediction Iteration runtime:! " #/% Model complexity!, data size #, number of workers % Model update (i.e., size of Δ() is comparably much smaller Iteration 7ime (s) K 100K umber oi Cores 20