SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism

Size: px
Start display at page:

Download "SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism"

Transcription

1 SERF: Effiient Sheduling for Fast Deep Neural Network Serving via Judiious Parallelism Feng Yan University of Nevada, Reno, Reno, NV, USA, Olatunji Ruwase Mirosoft Researh, Redmond, WA, USA, Yuxiong He Mirosoft Researh, Redmond, WA, USA, Evgenia Smirni College of William and Mary, Williamsburg, VA, USA, Abstrat Deep neural networks (DNNs) has enabled a variety of artifiial intelligene appliations. These appliations are baked by large DNN models running in serving mode on a loud omputing infrastruture. Given the ompute-intensive nature of large DNN models, a key hallenge for DNN serving systems is to minimize the request response latenies. This paper haraterizes the behavior of different parallelism tehniques for supporting salable and responsive serving systems for large DNNs. We identify and model two important properties of DNN workloads: homogeneous request servie demand, and interferene among requests running onurrently due to ahe/memory ontention. These properties motivate the design of SERF, a dynami sheduling framework that is powered by an interferene-aware queueing-based analytial model. We evaluate SERF in the ontext of an image lassifiation servie using several well known benhmarks. The results demonstrate its aurate lateny predition and its ability to adapt to hanging load onditions. I. INTRODUCTION Deep Neural Network (DNN) models have reently demonstrated state-of-the-art auray on important yet hallenging artifiial intelligene tasks, suh as image reognition [1], [2], [3] and aptioning [4], [5], video lassifiation [6], [7] and aptioning [8], speeh reognition [9], [10], and text proessing [11]. These advanements by DNNs have enabled a variety of new appliations, inluding personal digital assistants [12], real-time natural language proessing and translation [13], photo searh [14] and aptioning [15], drug disovery [16], and self-driving ars [17]. Many loud servie providers offer DNN servies as part of their mahine learning platform suh as Mirosoft AzureML [18] and Amazon mahine learning systems [19], whih provide library and runtime tools for appliation owners to develop and deploy their DNN appliations onveniently. These platforms support the training and serving of various DNN appliations. Figure 1 shows the user interfae of Mirosoft AzureML and lik to deploy model. Experiments session in Figure 1 is orresponding to the training phase, where appliation owners speify the neural network struture, algorithms, and data to train their DNN models. One trained, these models an be deployed instantly on the loud in a serving SC16; Salt Lake City, Utah, USA; November /16/$ IEEE Fig. 1. Mirosoft AzureML interfae. mode to proess appliation inputs, suh as images, voie ommands, speeh segments, handwritten text, see the Web Servies session in Figure 1. Our paper fouses on DNN serving systems, revealing the hallenges and opportunities to support fast deployment of responsive and salable DNN appliations. DNN serving platforms must satisfy the following two requirements. First, DNNs should offer short response time to user requests. Sine DNN appliations proess a stream of user requests, a serving system must onsistently offer fast responses to attrat and retain appliation users. Slow responses diretly degrade user experiene. For example, image reognition appliations [1], [2], [3] take photos or even real-time amera streams as input requests and send bak lassifiation results. Sine it is very similar to traditional query servie, users usually expet low lateny responses and may swith to another servie provider if the pereived lateny is high [20], [21]. Seond, DNNs should support fast deployment of appliations. One DNN models are trained and ready for deployment, the serving system should make the appliation available to aept online user requests within a few minutes [22]. No one would use a platform that takes hours or even days to deploy or update their appliations. DNN models that ahieve the best auray on the most hallenging tasks (e.g., image, speeh, et.) are often very

2 Fig. 2. Lateny under low load (left plot) and high load (right plot) using different onfigurations (inter-node parallelism is set to 1) for ImageNet-22K. large (ontaining billions of neural onnetions) and require signifiant ompute yles and memory bandwidth to serve eah request [1], [3], [16]. Suh large models may take seonds or even minutes to proess eah user request if exeuted in a sequential fashion on a single server. Parallel omputation is a promising approah to improve the response time of DNN serving appliations. In this paper, we onsider three ways of exploiting hardware parallelism within and aross mahines for DNN serving systems. First, parallel hardware threads on a mahine (e.g., hip-multiproessor (CMP) ores) an be used for parallel proessing eah request to redue response time (intra-node parallelism). Seond, the parallel hardware threads on a mahine ould alternatively be used for onurrent proessing of different requests to redue waiting time (servie parallelism). Third, the model ould be partitioned aross multiple mahines to leverage the aggregate ompute yles and memory bandwidth for faster proessing of eah request (inter-node parallelism). Finding good parallelism onfigurations to minimize DNN serving lateny is important but hallenging. Applying parallelism degrees blindly ould harm performane. For example, servie parallelism may inrease memory system ontention to the point of prolonging request proessing time; internode parallelism may prolong request proessing if the rossmahine ommuniation overhead exeeds the omputation speedup. Figure 2 shows the lateny of serving the ImageNet- 22K workload [23] under different ombinations of servie and intra-node parallelisms on an 8-ore mahine (refer to Setion V-A for detailed experimental setup). The left and right satter plots represent low and high load onditions (request arrival rate) respetively. Eah point represents one parallel onfiguration, and the size of the point indiates its lateny value. The figure demonstrates that: (1) Many parallel onfigurations are possible, even with only 8 ores and without onsidering inter-node parallelism. (2) The lateny differene between the best parallel onfiguration and the worst parallel onfiguration an be signifiant, i.e., by orders of magnitudes. This gap grows further under higher loads. (3) The lateny values and the best parallel onfiguration hanges as a funtion of the load. In order to find best parallelism onfigurations among many andidates, we need to quantify and ompare their lateny impat. One ould ondut exhaustive profiling of the performane of all ombinations of parallelism onfigurations and expeted load levels. This an be very expensive and may take hours or even days. Moreover, this profiling ost repeats when models are updated. Suh a method is impratial as the online servies require fast deployment within a few minutes. An alternative solution is to use analytial modeling to predit request latenies under different onfigurations and load levels. However, the effetiveness of a parallelism tehnique for a DNN depends on many fators suh as neural network harateristis, hardware, and the ombined impat of memory ontention and ommuniation overhead, whih make aurate lateny predition diffiult. Therefore, neither exhaustive profiling nor analytial modeling offer a pratial solution to find the best parallel onfiguration in a timely manner. We present SERF (serving deep learning systems fast), a sheduling framework that employs a hybrid approah by ombining lightweight profiling with queueing-based analytial modelling to quikly identify best parallel onfigurations for any given load. SERF needs to answer two important questions: (i) what should be profiled for auray but an be also profiled quikly, and (ii) how to model and predit request lateny? To answer these questions, we haraterize and identify two distintive properties of DNNs: (1) The DNN servie time is diffiult to model aurately, but an be measured effiiently. In partiular, DNN requests are homogeneous, i.e., when running under the same degree of parallelism, the servie time is deterministi. This property empowers lightweight profiling, i.e., only the average servie time information needs to be olleted rather than more omplex information suh as distributions. (2) There is interferene among onurrent running requests due to ahe/memory ontention: a request may take longer to exeute in the presene of other onurrent requests even when these requests are using different ores. In this paper, we develop an interferene-aware queuingbased analytial model that takes as input the servie time profiling information and aurately predits request lateny under different loads. SERF adopts this hybrid approah to identify the best onfiguration for any given load and deploys a dynami sheduler that adapts to load hanges online nearly instantly, ahieving the benefits of both empirial and analytial methods. We implement SERF in the ontext of an image lassifiation servie based on the image lassifiation module of the Adam distributed deep learning framework [3]. We stress that SERF is not limited to the Adam arhiteture, but also appliable to serving systems based on other DNN frameworks (e.g., Caffe [24], Theano [25], and Torh7 [26]) as similar parallelism deisions and onfiguration knobs are also available there. We ondut vast experiments by running several stateof-the-art lassifiation benhmarks, inluding ImageNet [23] and CIFAR [2]. We show that our predition model ahieves high auray: the average error is less than 4% omparing to measurement results. SERF always orretly identifies best parallel onfigurations under a variety of benhmarks and system loads. Moreover, ompared to using stati parallel onfigurations, SERF swiftly identifies and swithes to the

3 best onfiguration, reduing request lateny under various loads. Compared to exhaustive profiling, SERF adapts three orders of magnitude faster under dynami and ever-hanging environments, signifiantly reduing appliation deployment time. II. BACKGROUND DNNs onsist of large numbers of neurons with multiple inputs and a single output alled an ativation. Neurons are onneted hierarhially, layer by layer, with the ativations of neurons in layer l 1 serving as inputs to neurons in layer l. This deep hierarhial struture enables DNNs to learn omplex tasks, suh as image reognition, speeh reognition, and text proessing. A DNN servie platform supports training and serving. DNN training is offline bath proessing that uses learning algorithms, suh as stohasti gradient desent (SGD) [27] and labeled training data to tune the neural network parameters for a speifi task. DNN serving is instead interative proessing requiring fast response per request, e.g., within milliseonds, even for hallenging large-sale models like ImageNet-22K. It deploys the trained DNN models in serving mode to answer user requests, e.g., for a dog reognition appliation, a user request provides a dog image as input and reeives the type of the dog as output. The response time of a request is the sum of its servie time (exeution time) and waiting time. An important ommon performane metri for interative workloads is the average request response time (average lateny), whih we adopt in our work. In DNN serving, eah user input, whih we refer to as a request, is evaluated layer by layer in a feed-forward manner where the output of a layer l 1 beomes the input of layer l. More speifially, define a i as the ativation (output) of neuron i in layer l. The value of a i is omputed as a funtion of its J inputs from neurons in the preeding layer l 1 as follows: a i = f (( J j=1w ij a j ) + b i ), (1) where w i j is the weight assoiated with the onnetion between neuron i at layer l and neuron j at layer l 1, and b i is the bias term assoiated with neuron i. The ativation funtion f, assoiated with all neurons in the network, is a pre-defined non-linear funtion, typially a sigmoid or hyperboli tangent. Therefore, for a given request, its main omputation at eah layer l is a matrix-vetor multipliation of the weight of the layer with the ativation vetor from layer l 1 (or the input vetor if l = 0). Inter-node, intra-node, and servie-level parallelisms are well-supported among various DNN models and appliations [1], [3], [28]. Inter-node parallelism partitions the neural network aross multiple node/mahines, with ativations of neural onnetions that ross node/mahine boundaries being exhanged as network messages. Intra-node parallelism uses multi-threading to parallelize the feed-forward evaluation of eah input image using multiple ores. As the omputation at eah DNN layer is simply a matrix-vetor multipliation, it an be easily parallelized using parallel libraries suh as OpenMP [29] or TBB [30] by employing a parallel for loop. Servie-level parallelism is essentially admission ontrol that limits the maximum number of onurrent running requests. We define a parallelism onfiguration as a ombination of the intra-node parallelism degree, inter-node parallelism degree, and maximum allowed servie parallelism degree. Note the servie parallelism is defined as a maximum value instead of the exat value due to the random request arrival proess, e.g., at ertain moments, the system may have less requests than the defined servie parallelism degree. III. WORKLOAD CHARACTERIZATION In this setion, we present omprehensive workload haraterization that shows the opportunities and hallenges of using the various parallelism tehniques to redue DNN serving lateny, as well as their impliations on the design of SERF. We make four key observations: (1) Parallelism impats servie time in omplex ways, making it diffiult to model servie times without workload profiling. (2) DNN workloads have homogeneous requests, i.e., servie times under the same parallelism degree exhibit little variane, whih allows SERF to measure request servie time with affordable profiling ost. (3) DNN workloads exhibit interferene among onurrent running requests, whih motivates a new model and solution of SERF. (4) DNN workloads show load-dependent behavior, whih indiates the importane of aurate lateny estimation and parallel onfiguration adaptation aording to system load. We present workload haraterization results of two wellknown image lassifiation benhmarks, CIFAR-10 [2] and ImageNet-22K [23], on servers using Intel Xeon E proessors. Eah proessor has 8 ores, with private 32KB L1 and 256KB L2 ahe, and shared 20MB L3 ahe. The detailed experimental set up for both workloads and hardware is provided in Setion V. A. Impat of parallelism on servie time Modeling the impat of parallelism on DNN serving without workload profiling is hallenging beause parallelism has omplex effets on the omputation and ommuniation omponents of request servie time, as shown in Figures 3 and 4. Figure 3 shows the DNN request servie speedup for different degrees of intra-node, inter-node, and servie parallelism. For intra-node parallelism, the speedup is lose to linear up to 3 ores, but slows down beyond 4 ores. This effet is due to the limited memory bandwidth. When the total memory bandwidth demands are lose to or exeed the available bandwidth, the bandwidth per ore redues, dereasing speedup. For internode parallelism, inreasing the parallelism degree from 1 to 2 yields a 2X servie time speedup beause the omputation time, whih is dominant, is halved, while ommuniation time grows marginally; inreasing from 2 to 4 results in super-linear speedup due to ahing effets, as the working set fits in the L3 ahe; inreasing from 4 to 8 results in smaller speedup inrease as ommuniation starts to dominate servie time. For servie parallelism, parallelism degrees > 2 result in inreased

4 Fig. 3. Servie time omparison under different parallelism tehniques using ImageNet-22K. Eah plot reports the speedup when inrease the degree at only one parallelism (fix the other two parallelisms). servie time due to memory interferene among onurrently servied requests. These results are indiative of the impat of different parallelism on servie time. Speedups an vary a lot, depending on many fators, inluding DNN size, the ratio of omputation and ommuniation, ahe size, memory bandwidth. Fig. 4. Relationship between inter-node and intra-node parallelism using ImageNet-22K. Figure 4 demonstrates the relationship between inter-node and intra-node parallelism: the results indiate that the degree of one parallelism tehnique an affet the behavior of another. More preisely, intra-node parallelism speedup depends on the degree of inter-node parallelism: speedup redues with larger inter-node parallelism. This is beause ommuniation time is inreasingly the dominant portion of servie time with larger degrees of inter-node parallelism, therefore the omputation time improvements of intra-node parallelism beome less important to overall servie time. In summary, sine parallelism effiieny depends on various fators relating to workload and hardware properties and sine one parallelism tehnique an affet the behavior of others, it is diffiult to aurately model servie time. SERF irumvents this by inorporating workload profiling to predit request servie time. B. Homogenous requests We observe that for a given parallelism degree tuple 1, defined as (servie parallelism degree, inter-node parallelism degree, intra-node parallelism degree), the servie times of 1 Note that parallelism degree tuple is different from parallelism onfiguration. In parallelism degree tuple, eah parallelism is set exatly to the degree value while in parallelism onfiguration, max servie parallelism is an admission poliy that defines the maximum allowed degree of servie parallelism. DNN requests exhibit very little variane beause the same amount of omputation and ommuniation is performed for eah request. Thus, we refer to DNN requests as being homogenous. Figure 5 shows two examples orresponding to two representative ases of parallelism degrees. The first example as shown in the left plot of Figure 5 is with parallelism degree tuple of (2, 1, 4), where the majority of requests are in the range of 330ms to 340ms and the SCV (squared oeffiient of variation) is only The seond example as shown in the right plot of Figure 5 is under parallelism (4, 4, 2), where most requests are in the range of 130ms to 160ms with the SCV of The slightly larger variane an be attributed to variations in the ross-mahine ommuniation delays aused by inter-node parallelism. The magnitude of these variations is onsistent with what is normally expeted in omputer ommuniation systems while running a request multiple times [31]. This unique property of homogeneous requests for DNN workloads empowers lightweight profiling: the ost of measuring the servie time is low, i.e., for a given parallelism degree tuple, running one or a few input requests is suffiient. In omparison, many other online servies have requests with heterogeneous demands [32], [33] and require to exeute many more input samples to ollet servie time distributions. Fig. 5. CDH (Cumulative Data Histogram) of servie times. The left plot is with parallelism degree tuple (2, 1, 4) and the right plot is with (4, 4, 2). C. Interferene among onurrent requests For small DNNs like CIFAR-10 (the left plot of Figure 6), request servie time remains almost onstant when running requests onurrently under different servie parallelism degrees, beause there is little interferene among requests due to ahe/memory ontention. The interferene beomes more obvious for large DNNs. The right plot in Figure 6 shows the request servie time of ImageNet-22K when running different number of requests. It is lear that when running more than

5 2 requests onurrently, the interferene beomes severe. To explain performane interferene, it is important to understand the working set of DNN serving that omprises ativations and weights of the neural onnetions (the ore operation is a matrix-vetor multipliation of the weight matrix and the ativation vetor, see Eq. 1). Ativations are derived from request input, while weights represent the model parameters and are shared by all requests. When there are no more than 2 onurrent requests, the working sets of both fit into the L3 ahe. If more than three requests run onurrently, then the footprint of ativations inreases and the aggregate working set no longer fits in the L3 ahe, resulting in more L3 ahe misses, thus prolonging the request servie time. This is also why large DNNs like ImageNet-22K are more likely to have interferene than small ones, suh as CIFAR-10. Interferene makes modeling average servie time and waiting time for a given parallelism onfiguration muh more hallenging. In partiular, under the same parallelism onfiguration, the number of running requests an vary from 0 to the maximum servie parallelism of the onfiguration. Therefore, the servie time of a partiular request depends on the number of onurrent running requests at the moment of its exeution, and the average servie time depends on the probability distribution of the onurreny levels. The waiting time estimation is even more omplex. The existing queueing and sheduling models [34] are no longer appliable as they assume independene among requests: request servie time remains onstant regardless of the number of onurrent requests. This property of DNN motivates us to develop new model and solution of SERF to aurately model the waiting time and lateny impat of interferene. Fig. 6. Servie time omparison with different number of onurrent requests. D. Load-dependent Behavior In serving systems, load (request arrival rate) hanges dynamially over time. For a given parallel onfiguration, both request servie time and waiting time ould hange under different loads. To illustrate the load-dependent behavior of different parallelism approahes, we use 6 distintive onfigurations and ondut experiments under different load levels, see Table I. The left plot in Figure 7 shows the servie time of using the 6 onfigurations under different loads, the middle plot in Figure 7 shows their waiting time, and the right plot in Figure 7 shows their lateny. The results demonstrate that for the same onfiguration, servie time, waiting time, and lateny an vary Config. Servie Inter-node Intra-node Config Config Config Config Config Config TABLE I PARALLEL CONFIGURATIONS. under different loads. Therefore, the ability to estimate the lateny impat aording to the load and a sheduler that an hange the parallel onfigurations based on load are two neessary and important features. IV. SERF: A FRAMEWORK FOR DNN SERVING In this setion, we present the sheduling framework SERF. SERF applies a hybrid approah that integrates lightweight profiling and a queueing-based predition model to find best parallel onfigurations for any given load (request arrival rate) effetively and effiiently, ahieving the benefits of both empirial and analytial methods. We first disuss the sheduling objetive and give an overview of SERF (Setion IV-A). Then we answer the two important questions raised in the Introdution: (1) What should be profiled for auray yet an be profiled quikly (Setion IV-B)? (2) How to model the rest and predit request lateny (Setion IV-C)? Finally, we disuss how to use the predition results to dynamially hange the parallelism onfigurations online with varying loads (Setion IV-D). A. Overview Sheduling Objetive. Common objetives for sheduling interative serving systems are (1) to minimize response lateny using a given amount of resoures [33], [32] or (2) to minimize resoure onsumption while meeting lateny SLO [35], [36]. Our sheduling framework supports both. Due to the interest of spae, we fous on the first objetive of minimizing response lateny. We hoose to optimize average lateny beause DNN requests are homogenous and have similar servie time, reduing average lateny also redues the tail lateny. Framework Overview. Figure 8 presents an overview of SERF, whih onsists of three main modules: predition model, profiler, and sheduler. The modules are onneted by the onfiguration referene table, whih maps different load levels (represented by request arrival rate) to their orresponding best parallel onfigurations. For example, at arrival rate of 2 requests/seond, the best onfiguration is with a max servie parallelism 4, inter-node parallelism of 2, and intra-node parallelism of 4. The profiler takes the system information (e.g., the number of mahines and ores, and workload) as input and onduts lightweight profiling and feeds the profiling results to the predition model. The predition model is the key omponent of the framework. It utilizes the profiling results to predit the lateny of all ombinations of parallelism under different load levels and populates the onfiguration referene table. This table only needs to be built one, provided that

6 Fig. 7. Servie time, waiting time, and lateny under different loads using different onfigurations for ImageNet-22K. DNN workload harateristis and system hardware remain the same. The sheduler uses the urrent system load as index to searh the onfiguration referene table, find and adapt to the best parallel onfigurations. B. Profiler Fig. 8. Overview of SERF. An easy but ineffiient way to ahieve the sheduling objetive is via exhaustive profiling: exeute all possible parallelism onfigurations for all possible loads and find the best parallel onfiguration for eah load. The shortoming of suh exhaustive profiling is its high ost. Assuming that there are P different onfigurations and there are L load levels, one needs to ondut P L profiling experiments. In addition, measuring average lateny requires a relatively long time span (to measure enough samples) to ahieve statistial stability due to the stohasti queuing behaviors. Experimenting with lighter load levels requires even longer time for profiling beause the large idle intervals between requests inrease the duration of the experiment. Let T be the average ost to ahieve statistial stability in profiling, whih makes the overall ost of exhaustive profiling P L T. SERF onduts lightweight profiling by measuring the request servie time for eah parallelism degree tuple of (servie parallelism degree, inter-node parallelism degree, intra-node parallelism degree). For example, with the tuple (2, 4, 3), we measure the request servie time by running two requests onurrently, eah request aross 4 server nodes and with 3 ores on eah server node. Let E denote the ost of profiling request servie time for a given parallelism degree tuple, the total profiling ost of SERF is P E, where P is the total number of parallelism degree ombinations. The profiling of SERF has two key differenes ompared to exhaustive profiling, resulting in signifiantly lower profiling ost: (1) SERF measures the request servie time instead of lateny, and (2) SERF measures eah parallelism degree tuple instead of eah parallel onfiguration. Benefit of these profiling hoies is two-fold: (1) the servie time under a given parallelism degree tuple is independent of load, saving a multipliative ost fator along the load dimension L. (2) As requests have deterministi servie time under the same parallelism degree tuple and profiling the servie time is independent of the queueing delays, a few profiling samples are suffiient, i.e., the value of E is small. In ontrast, exhaustive profiling measures lateny for eah parallelism onfiguration, whih requires running many samples to ahieve statistial stability for queuing delays, i.e., T is muh more ostly than E, by up to 3 orders of magnitude. Therefore, SERF profiling is muh more effiient than exhaustive profiling, and P E P L T. We feed these profiling results to the predition model of SERF to estimate the lateny under different load levels, whih is introdued next. C. Queueing-based Predition Model We develop a queueing model that takes profiling results as input and predits request lateny under different load and parallelism onfigurations. The key hallenge and novelty of the model is its interferene-awareness, effetively quantifying the lateny impat of request interferene due to ahe and memory ontention. 1) Problem Formulation: We define the problem as prediting DNN request lateny for any given parallel onfiguration under any given load. We denote parallelism onfiguration with (maximum servie parallelism C servie, inter-node parallelism C inter, and intra-node parallelism C intra ). The inputs of the model are: Load in terms of inter-arrival rate: λ, here we assume Poisson arrivals for a short period, i.e., exponential interarrival times with mean rate λ, whih is typial for online servies [37], [38]. Suh assumption does not ontradit the bursty and long-range dependene harateristis in the literature [39]. SERF ontinuously monitors the inoming workload and periodially updates its observed load (arrival rate). Profiling results: µ i (i = 1...) represents the average servie rate when i requests are running onurrently, i.e., the average servie rate of the parallelism degree tuple (i,c inter,c intra ). The output of the model is the average lateny for the parallelism onfiguration under any given load. We model DNN serving as an interferene-aware deterministi servie proess and formulate the problem as a M/D inter f / queue. Here, M represents exponential interarrival times. D inter f represents two distintive properties of DNN workload: (1) Deterministi servie times, modeling homogeneous requests that exhibit little servie time variane

7 for any given parallelism degree tuple (as shown in Setion III-B). (2) Interferene-awareness, modeling the interferene among requests due to ahe and memory ontention (as shown in Setion III-C). stands for the maximum servie parallelism, equal to C servie. 2) Tehnial Challenges and Key Ideas: The M/D inter f / queue does not have a losed-form solution. In fat, even for the simpler problems: the interferene-oblivious M/D/ queue that assumes deterministi servie time without any interferene among onurrent running requests, or interferene-aware M/M inter f / queue that assume exponential distributed servie times with interferene among onurrent running requests, there is no losed-form solution. Intuitively, one may want to use M/M inter f / queue, M/D/ queue, or M/M/ queue to approximate the M/D inter f / queue, but suh approximation has the potential of ahieving bad auray. To illustrate why these simpler approahes an not model DNN workload, we implement these approximation methods and ondut experiments using ImageNet-22K. We ompare the lateny results of best onfigurations under different loads between testbed measurements and predition results from M/M inter f / queue, M/D/ queue, and M/M/ queue in Figure 9. The results learly shows that the predition from these approahes is poor. This large disrepany shows the importane of inorporating interferene and deterministi servie times into SERF predition model and solution. Our solution is inspired by Cosmetatos approximation [40] that estimates M/D/ model using the M/M/ model with adjustment and orretion, where M/M/ model is a standard multi-server queue model with Poisson arrival and exponential servie time. We extend the approximation approah to the interferene-aware ase and solve M/D inter f / queue in two steps. (1) Solve M/M inter f / queue that has interferene-aware exponential servie time. (2) Utilize the approximation method proposed in Cosmetatos approximation to adjust the results of M/M inter f / queue to approximate the M/D inter f / queue. We estimate the waiting time and servie time separately. Lateny is estimated as the sum of these two measures. when i requests are onurrently running, p i is the probability of i requests in the system. Let ρ i = µ λ i and ρ = ρ = µ λ. Based on the state transition diagram shown in Figure 10 and global balane equations, we obtain: n ρ i p n = n! p 0 (0 n 1) ρ n n (2) ρ i! p 0 (n ), where p n is the steady-state probability of state n, whih represents n requests in the system (sum of requests in the queue and in servie). p 0 represents the probability that the system is idle, i.e., no request is in the system. Sine all probabilities sum to 1: k 1 ρ i ρ i p k = p 0 (1 + + k=0 k=1 k!! ρ k ) k= k 1 ρ i ρ i = p 0 (1 + + k=1 k!! (1 ρ) ) = 1. Fig. 10. State transition diagram for the M/M inter f / queue. Eah state represents the number of requests in the node. Let H = k=1 k ρ i k! + ρ i! (1 ρ), then: (3) p 0 = H 1. (4) Assume that L q (λ) is the average number of requests waiting in the queue, by definition we have: together with Eq. 2, we have: L q (λ) = (k ) p k, (5) k= Fig. 9. Lateny omparison of best onfigurations between measurement and standard predition under different load. 3) Solving M/D inter f / queue: Waiting time estimation. We follow the two steps desribed in Setion IV-C2 to solve for the waiting time. (1) Solving M/M inter f / queue. Reall that µ i (i = 1...) is provided by profiling and represents the average servie rate p 0 ρ i L q (λ) =! (k ) ρ k k= p 0 ρ i ρ =! (1 ρ) 2. Using Little s law [41], the waiting time in the queue an be omputed as: W M/M inter f / q (λ) = L p 0 ρ i q(λ) ρ = λ λ! (1 ρ) 2, (7) (6)

8 (2). Approximating M/D inter f / using M/M inter f /. Cosmetatos approximation proposed in [40] states that the waiting time in the queue an be approximated as: where Wq M/D/ 1 M/M/ (1 + f (s) g(ρ)) Wq, (8) 2 f (s) = ( 1) ( ), (9) 16 g(ρ) = 1 ρ ρ, (10) ρ = µ λ, λ is the average arrival rate, and µ is the average servie rate. This approximation an be adjusted for the interferene-aware ase: we use the M/M inter f / queue with the same orretion terms as f (s) and g(ρ) as above to approximate the M/D inter f / queue as follows (using Eq. 7 and Eq. 8): W M/D inter f / q (λ) 1 p 0 ρ i 2 (1 + f (s) g(ρ)) ρ λ! (1 ρ) 2. (11) Servie time estimation. Although servie time under the same parallelism degree tuple is deterministi and an be profiled, the servie time under a given parallel onfiguration ould hange with load and needs to be predited. This is beause of the random requests arrival proess and interferene, e.g., at different moments, the system may have different number of onurrent running requests (ranging from 0 to the defined maximum servie parallelism), whih results in different interferene and therefore different servie times. We use the PASTA (Poisson Arrivals See Time Averages) property [42] to ompute the average servie time S M/D inter f / (λ) under arrival rate λ as follows: S M/D inter f / (λ) = 1 µ 1 p µ 2 p µ 3 p p µ µ p i i= = p i 1 µ i + ρ i µ! (1 ρ). (12) Lateny estimation. The average lateny W M/D inter f / equals to the average time spent in waiting in queue W M/D inter f / q plus the average time spent in exeution S M/D inter f / : W M/D inter f / (λ) p i 1 µ i + ρ i µ! (1 ρ) + 1 p 0 ρ i 2 (1 + f (s) g(ρ)) ρ λ! (1 ρ) 2. (13) In the above formula, reall that µ i is an input from profiling, λ is affeted by inter-node parallelism C inter as it defines how many mahines to serve eah request, equals to the maximum allowed servie parallelism C servie, and the intra-node parallelism is restrited by F/, where F is the number of ores in a node. Therefore, for a given system and workload, lateny an be omputed under different ombinations of servie parallelism, inter-node parallelism, and intranode parallelism. Eq. 13 is used to populate the onfiguration referene table that is the ore of SERF. The above solution is derived for a single serving unit (with C inter number of mahines). For a luster, the luster an be divided into serving units based on the inter-node parallelism C inter, e.g., for a luster with N mahines, there are N/C inter units and eah unit has an arrival rate of λ = λ all N/C inter, where λ all is the request arrival rate at the luster. D. Sheduler The sheduler takes the urrent system load as input, searhes the onfiguration referene table, finds and adapts to the best parallelism onfiguration. To enable quik onfiguration swithing, the entire DNN model is pre-installed on eah server, and eah input is sent to the server with a mapping of servers to input partitions. This informs the server of whih partition of the DNN model to use to proess the input, and whih servers to ommuniate with for ross-mahine neural onnetions. To sum up, we explore two distintive properties of DNN workload homogeneous requests with interferene to develop SERF. SERF ombines lightweight profiling with an interferene-aware queueing model to predit DNN serving lateny. It finds the best parallel onfiguration for any given load and then deploys a dynami sheduler to adapt to varying loads online nearly instantly. V. EXPERIMENTAL EVALUATION We present experimental results demonstrating how SERF improves DNN serving performane with respet to minimizing response lateny. Speifially, we evaluate the following properties of SERF: (i) auray of the lateny predition model, (ii) orret identifiation of best parallel onfiguration under different loads, (iii) adaptability to load dynamism ompared to a stati onfiguration, and (iv) effiient best onfiguration searh ompared to exhaustive profiling. A. Experimental Setup System Overview: We prototyped SERF based on the Adam distributed DNN system [3], whih supports servie parallelism through admission ontrol, intra-node parallelism using OpenMP [29], and inter-node parallelism by partitioning the model aross different mahines. In order to quikly swith the onfigurations, the entire model parameter is preinstalled on eah server, and eah input is augmented to the server with a mapping of servers to input partitions. As most distributed DNN serving platforms support part or all of these parallelisms, SERF an be used in other systems as well. Workload: We evaluate SERF using 3 popular image reognition tasks of varying omplexity with Poisson request arrivals: CIFAR-10 [2]: lassifies 32x32 olor images into 10 ategories. The DNN is moderately-sized, ontaining about

9 Fig. 11. CDF (Cumulative Distribution Funtion) and DDH (Data Density Histogram) of predition errors for different workloads million onnetions in 5 layers: 2 onvolutional layers with pooling, 2 fully onneted layers, and a 10- way output layer. ImageNet-1K [23]: lassifies 256x256 olor images into 1, 000 ategories. The DNN is moderately large, ontaining about 60 million onnetions in 8 layers: 5 onvolutional layers with pooling, 3 fully onneted layers, and a 1,000-way output layer [2]. ImageNet-22K [23]: the largest ImageNet task, whih is to lassify 256x256 olor images into 22, 000 ategories. This DNN is extremely large, ontaining over 2 billion onnetions in 8 layers: 5 onvolutional layers with pooling, 3 fully onneted layers, and a 22, 000-way output layer [3]. Hardware Environment: Experiments are run on a omputing luster of 20 identially onfigured ommodity mahines, ommuniating over Ethernet through a single 10Gbps (bidiretional) NIC. Eah mahine is dual-soket, with an Intel Xeon E proessor of 8 ores running at 2.1GHz on eah soket. Eah mahine has 64 GB of memory and a GFLOP/s SIMD FPU. B. Auray of Lateny Predition Model This setion evaluates the auray of the lateny predition model, based on Eq. 13, by omparing predited values to measured values. Figure 11 shows for eah workload the average and distribution of predition errors for all relevant predition ases. A relevant predition ase is a ombination of a parallel onfiguration that has performane impat for a workload and a load level. For example, CIFAR-10 has 20 parallel onfigurations beause inter-node parallelism degrees > 1 do not make sense for its small size. The larger ImageNet- 1K and ImageNet-22K have 40 and 80 parallel onfigurations beause inter-node parallelism degrees of up to 2 and 4 are relevant, respetively. For eah benhmark we onsider 10 load levels evenly spread aross low load to high load, so that there are 200, 400, and 800 relevant predition ases for CIFAR-10, ImageNet-1K, and ImageNet-22K, respetively. The results show that the predition is aurate and the errors are insignifiant: the average error is 2-4%, the 90th perentile is < 10%, and the 95th perentile is < 12%. C. Identifying Best Configurations We use our predition model to identify the best onfiguration under different load levels and then ompare with the Fig. 12. Best onfigurations and aording lateny under different loads. testbed measurement ground truth. The experimental results show that SERF always orretly identifies the best onfiguration. The top plot in Figure 12 depits the best onfiguration and aording lateny of serving CIFAR-10 workload under different load levels. It is lear when the load inreases, the lateny of best onfiguration also inreases due to queuing effets. Moreover, we observe the following: When load is low, intra-node parallelism is useful sine the servie time is the dominant fator in lateny. Intranode parallelism helps redue servie times and therefore ahieves better overall lateny. When load is high, sine the interferene aused by servie parallelism is low, servie parallelism is required beause the waiting time beomes the dominant fator and it redues waiting time more effiiently by allowing more requests to run in parallel. For large DNNs like ImageNet-22K, the observation is interesting and ounter-intuitive, see the bottom plot in Figure 12. Different from CIFAR-10, even under high loads, the best parallel onfiguration is still with servie parallelism of only 2. Intuitively, when the load is high, admitting more requests into the system ould yield better performane and the maxi-

10 Benhmark Method # of onfigs # of load levels # of profile exp to run Eah profile time (min) Total time (min) ImageNet-22K Exhaustive SERF ImageNet-1K Exhaustive SERF CIFAR-10 Exhaustive SERF TABLE II COST COMPARISON BETWEEN EXHAUSTIVE PROFILING AND SERF FOR DIFFERENT BENCHMARKS. mum servie parallelism should be best. This ounter-intuitive result is a onsequene of the high interferene. When the interferene among requests is high, servie parallelism may result in signifiantly inreased servie time, whih exeeds the waiting time benefit brought by allowing more requests running in parallel, ausing higher lateny. D. Benefits over Exhaustive Profiling We evaluate SERF here against exhaustive profiling for identifying the best parallel onfigurations under different load levels. The experimental results verified both SERF and exhaustive profiling always orretly identifies the best onfiguration. However, the ost of SERF is signifiantly lower than exhaustive profiling. Assume that the system has 80 different parallel onfigurations and the performane referene table has 10 entries (e.g., 10 different load levels). SERF requires only 80 quik profiling experiments while exhaustive profiling requires 800 expensive profiling experiments to build the performane referene table. The time for eah profiling experiment and the total time to build the performane referene table is shown in Table II. Note SERF requires muh less time for eah profiling experiment beause it only samples the servie time and the servie time is deterministi without load impat (i.e., sample the servie time of only 10 requests) while eah exhaustive profiling experiment needs to measure the average lateny, whih needs many samples to ahieve statial stability (e.g., when measuring lateny less than 5000 sample requests, the results beome very unstable). The results suggest that the time ost of SERF is more than 3 orders of magnitudes lower than exhaustive profiling, and the time savings grows with the size of the DNN workload and the number of performane referene table entries. Even if ompared to lightweight profiling, e.g., only do profiling under high load, the ost of SERF is still more than 2 orders of magnitudes lower. E. Benefits over Stati Configuration Request arrival rate and system load hanges dynamially for online servies [38]. In this setion, we demonstrate how SERF outperforms fixed onfigurations by adapting to load hanges. We use three baseline ases for omparison. Fixedlow is a best onfiguration in low load and Fixed-mod is a best onfiguration in moderate load. Fixed-other is another onfiguration that performs better than Fixed-low in moderate loads and better than Fix-mod in low loads. We ompare the performane of SERF and these baseline ases in a dynami user environment with load hanges from moderate to low and then bak to moderate, see Figure 13. The y-axis is the lateny measured in ms, the x-axis represents the experiment s elapsed time. Fixed-low and Fixed-mod perform well under the loads that they are optimized for, but perform poorly when load hanges. Fixed-other ahieves more stable performane, but not best under any load levels. SERF outperforms all these baseline sheduling methods and onsistently adapts to the load hange to ahieve lowest lateny. This experiment validates the need for adaptivity of SERF in a dynami workload environment, where, for example, a best onfiguration for high loads ould be sub-optimal for low loads. In addition, the profiling ost of these fixed onfigurations is more than 2 orders of magnitude higher than SERF, e.g., for ImageNet- 22K, it takes nearly 2 days to identify Fixed-low or Fixedmod onfiguration by profiling, and it takes even longer for Fix-other as profiling needs to be done for multiple loads. In omparison, SERF only takes a few minutes for identifying the best parallel onfigurations under various loads. Fig. 13. Lateny omparison under dynami load environment for different sheduling approahes. F. Disussion Salability. When SERF works in large systems, the number of profiling experiments sales linearly with the total number of parallelism ombinations. Beause eah profiling takes less than a few seonds, even for large systems running large and salable appliations with thousands of parallelism ombinations, the profiling takes no more than a few hours. This profiling time an be further redued to a few minutes if profiling experiments are onduted in parallel or in oarser granularity. In addition, the omputation of the queueing model is effiient, i.e., onstant with respet to the luster size. Therefore, SERF is salable to shedule large systems. Generalization to other workloads. SERF also has a potential to be used in other appliations beause SERF is developed based on two abstrated properties: homogeneous

11 Fig. 14. CDF and DDH of predition errors. request servie demand and interferene among requests running onurrently. Appliations with similar properties an also benefit from our approah. We use web-searh ranking [43] as an example for evaluation as it represents a typial supervised mahine learning problem. We instrument the implementation in [43] to make it a parallel version to simulate a serving system. We run extensive experiments with various load levels using different paralellism onfigurations. We show the average and distribution of the predition errors in Figure 14. The results show SERF is quite aurate as the average error is only 1.07%, the 90th perentile is < 3%, and the 95th perentile is < 5%. The experimental results also show that SERF always identifies the best onfiguration orretly. In the interest of spae, we omit detailed disussion here. VI. RELATED WORK DNN Serving. The state-of-the-art auray of DNNs on important artifiial intelligene tasks, suh as image reognition [1], [2], [3], speeh reognition [9], [10], and text proessing [11] has made the end-to-end lateny of large-sale DNN serving systems an important researh topi. Parallelism has being shown to be ritial for good DNN performane at sale. Prior work [1], [3] has shown that parallel training on a luster of ommodity CPU mahines ahieves high throughput thus an train big DNN models (billions of onnetions) in a reasonable amount of time (days instead of months). Although these training platforms fous on improving system throughput instead of request lateny, the parallelism mehanisms proposed there are diretly translated to serving platforms as inter-node, intra-node and servie parallelisms. Several reent work on DNN serving investigate hardware aeleration using GPUs [44], FPGAs [45], and ASICs [28]. They fous on mapping DNN omputation to ustomized hardware, but parallelism has also been shown ritial to offer low lateny. All these prior studies develop DNN serving platforms that support all or a subset of the parallelism mehanisms exploited in our paper. However, none of them investigates sheduling frameworks that make parallelism onfiguration hoies based on DNN harateristis, hardware harateristis, and system load, whih is the fous of SERF. SERF is omplementary to the above work and an be used as a sheduling framework for these serving platforms to identify best parallelism onfigurations and maximize their parallelism benefits. Interative Serving. There is a host of researh in parallelizing request proessing to redue response lateny, and request sheduling in a multiproessor environment to redue average lateny. There has been a lot of work on measuring and mitigating interferene among o-loated workloads [46], [47]. The main theme is to predit performane interferene among workloads and disover optimal workload oloations to improve system utilization while meeting user performane goals. These studies treat eah workload as a blakbox, and they do not onsider solutions that involve modifying the workload (e.g., hanging parallelism degree). Adaptive parallelism for interative server systems uses intranode and servie parallelism to redue request lateny. Raman et al. propose an API and runtime system for dynami parallelism [33], where developers express parallelism options and goals, suh as minimizing mean response time. Jeon et al. [32] propose a dynami parallelization algorithm to deide the degree of request parallelism in order to redue the average response time of Web searh queries. Both approahes assume independent servie time among requests, thus they do not onsider interferene among onurrent running requests, whih is a key property of DNN workload supported by SERF. Another line of work [48] proposes to use parallelism to redue tail lateny. DNN requests, however, are homogeneous with similar servie time, making these tehniques ineffetive. Finding best parallel onfigurations has also been studied on other appliations and systems, suh as database, data analytis, MapRedue [49], [50], [51]. However, none of these prior work leverages the distintive properties of DNN workloads to exploit request homogeneity and interferene awareness as SERF does. Queueing Models. Here we outline some results that are related to the M/D/ queue abstration used in our work. While the solution of the M/M/ system is exat [52], there are no exat solutions for M/D/ systems. We note the existene of the Allen-Cunnen approximation formula for GI/G/ [41] and Kimura s approximation [53], both of whih an also apply to M/D/ sine M is a speial ase of GI. Alternatively, an M/D/ system an be approximated using an n-stage Erlang for the servie proess, essentially by approximating the system using a M/Ph/1 queue. While the M/Ph/1 queue an be solved using the matrix-geometri method [54], the M/Ph/ suffers from the well known problem of state spae explosion. We diret the interested reader to [34] for an overview of various results on the M/D/ queue that have been developed sine the early 1930s. However, none of the above approximation methods for M/D/ systems an be easily adapted to estimate lateny of M/D inter f / systems. Here we extend the approximation by Cosmetatos to ahieve this goal. VII. CONCLUSIONS We presented SERF, a sheduling framework for DNN serving systems, whih ombines lightweight profiling with an interferene-aware queueing-based predition model. SERF effiiently identifies best parallel onfigurations to minimize average request lateny and it dynamially adapts to varying loads almost instantly.