ON THE PERFORMANCE IMPACT OF FAIR SHARE SCHEDULING

Size: px
Start display at page:

Download "ON THE PERFORMANCE IMPACT OF FAIR SHARE SCHEDULING"

Transcription

1 ON THE PERFORMANCE IMPACT OF FAIR SHARE SCHEDULING Ethan Bolker BMC Software, Inc Unversty of Massachusetts, Boston Ypng Dng BMC Software, Inc Far share schedulng s a way to guarantee applcaton performance by explctly allocatng shares of system resources among competng workloads. HP, IBM and Sun each offer a far share schedulng package on ther UNIX platforms. In ths paper we construct a smple model of the semantcs of CPU allocaton for transacton workloads and report on some experments that show that the model captures the behavor of each of the three mplementatons. Both the model and the supportng data llustrate some surprsng conclusons that should help admnstrators to use these tools wsely.. Introducton As mdrange UNIX systems penetrate the server market they must provde packages that allow an admnstrator to specfy operatonal or performance goals rather than requrng them to tnker wth prortes (UNIX nce values). It's then the software's responsblty to adjust tunng parameters so that those goals are met, f possble. Manframe systems have been equpped wth such tools for a long tme. Recently HP, IBM and Sun have ntroduced smlar UNIX offerngs n ths area. HP s Process Resource Manager for HP-UX (PRM), IBM s Workload Manager for AIX (WLM) and Sun s System Resource Manager for Solars (SRM ) packages each allow the admnstrator to specfy the share of some UNIX resource lke CPU or memory that should be made avalable to partcular groups of users or processes. In ths paper we explore the semantcs of packages lke these, focusng on the allocaton of CPU shares. We wll model the behavor of share Whoever chose ths name at Sun must have known that the IBM OS/390 resource allocaton manager s also called SRM - an MVS legacy. allocaton, make predctons based on our model and see how the predctons match the outcome of some benchmarkng experments. Note that share allocatons are not truly performance orented. Admnstrators would prefer to specfy response tmes or throughputs and have the system determne the shares and hence the dspatchng prortes 2. But specfyng shares s a bg step n the rght drecton. Predctng the result of settng shares s easer than predctng the result of settng prortes once you understand some of the surprsng subtletes of share semantcs and ther mplcatons. A useful sde effect of the decson to employ one of these schedulng packages s the need to decde how to organze the work on your system n order to allocate resources. That very effort wll help you characterze your workloads before you have even begun to tune for performance. All three packages allow you to group work by user. 2 IBM s OS/390 Workload Manager comes closer to ths goal. It's unfortunate that IBM chose the same name for the far share scheduler t offers for AIX.

2 PRM and WLM allow you to group t by process/applcaton as well. The packages themselves provde some reportng tools that allow you to track resource consumpton for workloads you specfy. 2. CPU-bound Workloads We start our study wth CPU-bound workloads snce they are the easest to understand. Suppose the system supports two workloads, each of whch s CPU bound. That s, each would use 00% of the processor s cycles f t were permtted to do so, but that the system has been confgured so that the workloads are allocated shares f and f 2 of the processor. We represent shares as fractons (normalzed), so that f? f. Then each workload runs on ts 2? own vrtual CPU wth an approprate fracton f of the total processng power. Thus work whch would complete n s seconds on a dedcated processor wll take ( / f ) s seconds nstead. If we magne that a workload conssts of a stream of CPU bound jobs wth batch schedulng so that one starts as soon as ts predecessor completes then we can restate ths concluson n terms of throughput: a throughput of t jobs per second on a dedcated machne becomes ft jobs per second when the workload has fracton f of the processor. Sun and IBM verfed ths predcton to valdate ther mplementatons Transacton Processng Workloads In transacton processng envronments a workload s rarely CPU bound. It s usually a stream of jobs wth a known arrval rate? (jobs per second) where each job needs s seconds of CPU servce. Then the throughput s the arrval rate, as long as the utlzaton u?? s s less than (00%). The mportant performance metrc s the response tme. Wth reasonable randomness assumptons for both arrval rates and servce tmes (? and s are both averages, after all) the average response tme at a unprocessor CPU wll be s /(? u). So, for example, a 3 second job wll take 3/(-0.75) = 2 seconds on a CPU that s 75% busy. 3 IBM s experments are reported n the AIX Workload Manager Techncal Reference [IBM00]. Suppose that the system supports two transacton processng workloads. Workload w has CPU share f, arrval rate? and servce demand s, where =, 2. What are the response tmes of the workloads, assumng that? s?? 2s2?, so that all the work can get done? The answer depends on the semantcs of share assgnment. There are two possbltes: shares may be caps or guarantees. 3.. Shares as Caps When shares are caps each workload owns ts fracton of the CPU. If t needs that much processng power t wll get t, but t wll never get more even f the rest of the processor s dle. These are the semantcs of choce when your company has a large web server and s sellng fractons of that server to customers who have hred you to host ther stes. Each customer gets what he or she pays for, but no more. Ths stuaton s easy to model. As before, each workload has a vrtual CPU whose power s the approprate fracton of the power of the real processor. Then a transacton workload allocated fracton f of the processor wll need s / f seconds to do what would take s seconds on the full machne. The utlzaton of the workload's vrtual machne wll be s? / f? u / f and transacton response tme at the CPU wll be ( s / f ) /(? ( u / f ))? s /( f? u), provded u s less than f : when shares are caps a workload s share must exceed ts utlzaton f t s to get ts work done Ths smple analyss helps us make our frst counterntutve asserton: share s not the same as utlzaton. A workload may be allocated 70% of the processor even though on the average t uses only 20%. In that case ts response tme wll be s/( ) = 2s rather than the s/(-0.2) =.25s t would enjoy f t owned the entre processor Shares as Guarantees The stuaton s more complcated when shares are merely guarantees. Suppose there are two workloads each of whch has been allocated a share of the CPU. Then on the average they wll

3 receve CPU tme slces n the rato f : f 2 whenever both are on the run queue smultaneously. But when ether workload s present alone t gets the full processng power. Thus shares serve to cap CPU usage only when there s contenton from other workloads. Ths s how you mght choose to dvde cycles between producton and development workloads sharng a CPU. Each would be guaranteed some cycles, but would be free to use more f they became avalable. In ths case utlzatons may be larger than shares. If workload needs 70% of the cycles whle workload 2 needs 0% then over tme the CPU wll be 80% busy. Both workloads can get ther work done even f they are allocated shares f = 0.2 and f 2 = 0.8, snce most of the tme when workload needs the CPU workload 2 wll be absent, so workload can use more than ts guaranteed 20%. Allocatng a small share to a busy workload slows t down but does not prevent t from completng ts work, as long as the system s not saturated. But how much wll t be slowed down? What wll the response tmes be? To answer that queston we propose a model for how the CPU mght choose jobs from the run queue. Our model nterprets shares as probabltes. We assume that the processor chooses to award the next avalable cycle to workload wth probablty p?, makng that choce wthout lookng frst f to see whether that workload s on the run queue at the moment. If t s not, the cycle goes to the other workload f t s ready. If nether workload s on the run queue, the cycle s wasted. Thus we can say that n ths model workload runs at hgh prorty wth probablty p and at low prorty wth probablty? p? p2. Wth the usual assumptons about random arrval rates and servce tmes the workload response tmes can be computed wth smple formulae 4. workload response tme at hgh prorty s?,? u 4 These formulae can be found n the queueng theory lterature. Note that they apply to unprocessors only. The correspondng formulae for multprocessors are known and can be used n our model when requred. See {Klen75] and {BB83} for detals. workload response tme at low prorty? s? u )(? u? ), ( 2 u2 where u and u2 are the utlzatons of the two workloads. The response tme of the transactons n each workload s the weghted average of the response tme when t has hgh prorty and that when t has low prorty: workload response tme s s? p? p2? u (? u)(? u? u2) It s no surprse, but worth notng, that each workload s response tme depends on both utlzatons as well as on the relatve shares. We wll return to ths pont later. 4. Model Valdaton How any partcular operatng system awards CPU tme slces so that each workload gets ts proper share s both subtle and system dependent. For example, SRM montors recent actvty n order to mantan the requred shares over tme. We do not take up those matters n ths paper. We are nterested n modelng the long-term average behavor of the system, not the means by whch t mantans those averages. 5 We tested our model n several benchmark experments. A lghtweght server daemon runs on the machne to be measured. At random tmes a clent program runnng on another machne asks the daemon to execute a CPU ntensve job for a partcular user for a random length of tme. The daemon keeps track of the CPU resources that job consumes, and tmestamps ts start and end. A postprocessor computes utlzatons and response tmes. 6 5 After developng our model we dscovered a far share scheduler that works by mplementng the model drectly: see [Wald95] and [WW94]. 6 Thanks to Aaron Ball and Tom Larard, who made the benchmark daemon as lghtweght as t needed to be, to Phlp Leung, our modelevaluator-n-chef, and to Kamlesh Mungekar and

4 In the frst set of experments we created two transacton workloads, usng a seeded random number generator so that we could see how the same job streams responded under dfferent share allocatons. Workload kept the CPU about 24% busy, workload 2 about 47% busy. Fgure shows the average response tme on our Sun system for workload executng a transacton that consumes second of CPU (on average) as a functon of the relatve share allocated to that workload 7. The straght lne shows the response tmes predcted by our model. Fgure 2 shows both benchmark and predcted results for the average response tmes for workload 2 as a functon of the share allocated workload. Fgure 2. The benchmark and predcted average response tmes (Sun) of workload 2 as a functon the share allocated to workload. Fgure. The benchmark and predcted average response tmes (Sun) of workload as a functon of ts allocated share. Fgure 3 shows the response tmes for both workloads and the average response tme, weghted by workload utlzaton. In ths case workload 2 has roughly twce the weght of workload. That average s nearly constant, ndependent of the relatve share allocaton, confrmng the theoretcal analyss that predcts response tme conservaton: a quantfcaton of the fact that there s no such thng as a free lunch. One workload only benefts at the other s expense. The results for HP/PRM and IBM/WLM are smlar; we show them n an appendx. Fred Zegler for system herocs beyond the call of duty. 7 You cannot actually assgn a workload a share of 0. We approxmated ths stuaton by makng the relatve share of that workload 0.00.

5 share (because ts performance s mportant) you may serously affect other lghter workloads. Conversely, gvng a lght workload a large share n order to mprove ts response tme may not hurt the other workloads unacceptably much. Fgure 3. The measured response tmes (Sun) of workload and workload 2 as a functon of the relatve share allocated to workload together wth ther sum, weghted by ther utlzatons. 5. Consequences Now that we have a vald model we can explore some of the consequences of far share CPU schedulng wthout havng to run more benchmarks. Ths s exactly the knd of what f analyss models make easy. Suppose that we magne that the two workloads n our benchmark each grow at the same rate, so that the overall CPU utlzaton grows from 70% to 90%. What effect wll share allocatons have at these utlzatons? Fgure 4 tells us the answers. We have already seen the two lower lnes n that fgure: they come from the model correspondng to our benchmark experment. The two upper lnes represent our what f analyss. At the hgher utlzaton each workload's response tme s more senstve to the share allocaton: the lnes n the fgure are steeper. And at hgh utlzaton t s even clearer that share allocatons affect workload more than workload 2. That s because workload has just half the utlzaton of workload 2, so when workload has a small share there s lots of workload 2 processng that usually runs at hgher prorty. There s a moral to ths story: f you gve a heavy workload a large Fgure 4. Response tmes for workloads and 2 as a functon of the share allocated workload at total utlzatons of 70% and 90%. The lnes representng the workload response tmes at a total utlzaton of 90% cross at a response tme of 0 seconds, correspondng to a share allocaton of 0.6+ for workload and 0.3+ for workload 2. Those share allocatons produce the response tmes that would be observed f no far share schedulng were n effect: /( 0.9) = 0 seconds for each workload. That's an easy consequence of the algebra n our model, and somewhat surprsng: to make far share schedulng for two workloads at hgh utlzaton mmc ordnary frst come frst served or round robn schedulng, allocate each workload a relatve share that s the utlzaton of the other workload. 6. Comparng Models

6 In Fgure 5 we compare the response tme predctons of three models for far share schedulng of two workloads. The frst s the model presented n [Gun99] and [Sun99b]: a batch processng envronment n whch each workload runs a sngle process sequentally submttng one-second compute bound jobs (the formulae are n Secton 2 above). The second two are transacton processng envronments n whch shares may be caps (Secton 3.) or guarantees (Secton 3.2). In each of these cases workload utlzatons are 30% and 60%. Our dscusson so far has focussed on the smple case n whch there are just two workloads. Our model extends to handle more. When there are n transacton workloads wth guaranteed shares we agan model the system by assumng that each of the n! possble prorty orders occurs wth a probablty determned by the shares. Then we compute each workload s response tme as a weghted average of n! terms. Here s the average response tme formula for workload n the three workload case, wth average servce tme normalzed to. The sum has sx terms, one for each of the possble prorty orderngs of the workloads: workload response tme? p(,2,3) p(,3,2)?? u? u p(2,,3)? (? u )(? u? u ) p(3,,2) (? u )(? u? u p(2,3,) (? u2 )(? u2? u3)(? u2? u3? u p(3,2,) (? u2 )(? u3? u2 )(? u3? u2? u ), ) ) Fgure 5. Response tmes for workload as a functon of ts allocated share under three modelng assumptons: compute bound batch processng and transacton processng wth share as caps and as guarantees. The prmary concluson to draw from Fgure 5 s that these models predct qute dfferent behavor, so that t s mportant to choose the rght one. The batch processng model yelds results that are ndependent of workload utlzatons. When shares are caps, shares must exceed utlzatons. The fgure shows that workload response tme ncreases wthout bound as ts share decreases toward ts utlzaton of 0.3. When shares are guarantees there s no unbounded behavor. 7. Many Transacton Workloads where p(, j, k) s the probablty that at any partcular moment workload has the top prorty, workload j the next and workload k has the lowest prorty. That probablty s where f f j f k p(, j, k)? ( f? f? f )( f? f ) f, f, f j, and workload, j, and k, respectvely. j k j f are the shares for These formulae can be smplfed algebracally. An optmzng compler or a clever programmer would fnd many repeated sub-expressons to speed up the arthmetc. We have left them n ths form so that you can see from the symmetry how they would generalze to more workloads. When there are more than three workloads the formulae are too unweldy to compute wth by hand. We wrote a program to do the work. Vst k k k

7 to play wth t 8. Even that program s not useful for large numbers of workloads snce t runs n tme O(n!). But you may not want to specfy shares separately for many workloads, snce the schedulng overhead ncreases wth the complexty of the decsons the scheduler must make. 9 We tested our model by runnng a seres of benchmarks on our IBM system. In each experment the same random sequence of jobs was generated. The three workloads had utlzatons 0.4, 0.2 and 0.39; we vared the CPU share assgnments from (near) 0.0 to (near).0 n ncrements of 0.2 n all possble ways that sum to.0: 2 experments n all 0. Fgure 6 shows how workload response tme vared as the share settngs were changed: t s larger toward the rear of the pcture where ts share s smaller. The data for the other two workloads lead to smlar pctures. Fgure 7 shows how our response tme predctons compared wth the measured values for workload 2 n each of the 2 experments. Fgure 6. Workload response tme as a functon of share settngs when there are three workloads. Fgure 8 shows that the response tme conservaton predcted by the theory s confrmed by the experments. 8 Chrs Thornley wrote the applet 9 In our experments wth two or three workloads that overhead s low. We plan to study how t ncreases as the demands on the scheduler ncrease. 0 Snce the shares sum to.0 t suffces to vary the shares of any two workloads ndependently. Fgure 7. Workload 2 response tme: measured vs. predcted values for each of the 2 experments.

8 But when the shares are guarantees the answer s dfferent. We model the herarchy by assumng that 80% of the tme producton work s at hgh prorty and development work at low, and that whenever producton s awarded the CPU, customer s at hgh prorty 40% of the tme. So the system operates n one of four possble states wth these probabltes: Prorty order Probablty (c, c2, dev) 0.8? 0.4 = 0.32 (c2, c, dev) 0.8? 0.6 = 0.48 (dev, c, c2) 0.2? 0.4 = 0.08 (dev, c2, c) 0.2? 0.6 = 0.2 Then we compute the response tmes for each workload n each of the four states wth standard formulas from queueng theory and use the probabltes to construct the weghted average response tme for each workload. The answer wll depend on the utlzatons as well as the shares allocated to the workloads. Fgure 8. Weghted average response tme as a functon of share settngs when there are three workloads. 8. Herarchcal Allocatons Far share schedulers often allow for more sophstcated assgnments of shares by arrangng workloads as the leaves of a tree n whch each node has a share. For example, suppose there are two producton workloads that serve your customers computng needs and one development workload, and that you have assgned shares ths way: Group producton customer customer 2 Share development 0.2 Ths scheme dvdes the 80% of the CPU allocated to producton between the two customers n the rato 4:6. Here too the analyss for compute bound work s straghtforward. The shares multply when you flatten the tree. Customer wll use 0.8?0.4 = 32% of the CPU, customer 2 wll use 48% and development 20%. The same fractons apply for transacton work f the shares are caps. Once you accept the fact that flattenng the tree s wrong for transacton workloads you can begn to understand why. Thnk about what happens when just workloads customer and development are competng for the processor. Because both customers are combned n a group customer can use to use what s guaranteed customer2. Wth the specfed share herarchy workload customer wll progress four tmes as fast as development. In the flattened tree t wll progress only about one and a half tmes as fast (the rato s 32/20). Ths llustrates a well known busness phenomenon: customers can get a better deal by poolng ther requests. We conducted an SRM benchmark study that vvdly llustrates ths phenomenon. The followng table shows the share allocaton herarchy, the measured utlzatons and response tmes and the predcted response tmes for one second of CPU work. Group Share Utl Resp (meas) group wkl wkl Resp (model) wkl Ths herarchy says essentally that group--2 always has prorty over workload 3 (snce the rato of shares s 00:) whle wthn group--2 workload has prorty over workload 2. In the

9 flattened confguraton the rato of the shares would be 0000::00, reversng the prorty orders of workloads 2 and 3. Were that the case we would expect to see response tmes of.27, 5.23 and 2.09 for workloads, 2 and 3 respectvely nstead of the observed values that match the predctons from our model. 9. Summary The ablty to allocate resource shares to workloads s a powerful tool n the hands of admnstrators who need to guarantee predctable performance. To use such a tool wsely you should master a hgh level vew before you start to tnker. Our model and the benchmark studes that valdate t help you understand that? Transacton workloads and CPU bound workloads behave qute dfferently under far share schedulng? Shares are not utlzatons? The effects of shares depend on utlzatons? Guarantees need not be caps? Important workloads should not unthnkngly be allocated large shares? Response tme s conserved? Herarchcal allocaton strateges may produce results that seem counterntutve? To a frst approxmaton, the three schedulers we studed behave n qute smlar ways even though were you to study ther documentaton you would see qute dfferent mplementatons. It follows that you do not need to understand the mplementatons to make good frst approxmatons when usng them. [Klen75] L. Klenrock, Queueng Systems Volume I: Theory, John Wley & Sons, Inc. 975 [KL88] J. Kay and P. Lauder, A Far Share Scheduler, CACM, V3, No, January 988, pp44-55 [Sun98] Solars Resource Manager.0 Whte Paper, Sun Mcrosystems, 998 [Sun99a] Solars Resource Manager. Reference Manual, Sun Mcrosystems, August 999 [Sun99b] Modellng the Behavor of Solars Resource Manager, Sun BluePrnts OnLne, August, [Wald95] Carl A. Waldspurger. Lottery and Strde Schedulng: Flexble Proportonal-Share Resource Management, Ph.D. dssertaton, Massachusetts Insttute of Technology, September 995. Also appears as Techncal Report MIT/LCS/TR [WW94] Carl A. Waldspurger and Wllam E. Wehl. Lottery Schedulng: Flexble Proportonal- Share Resource Management, Proceedngs of the Frst Symposum on Operatng Systems Desgn and Implementaton (OSDI '94), pages -, Monterey, Calforna, November References [BB83] Buzen, J. P., Bond, A. R., The Response Tmes of Prorty Classes under Preemptve Resume n M/M/m Queues, Operatons Research, Vol 3, No. 3, May-June, 983 [Gun99] N. Gunther, Capacty Plannng For Solars SRM: All I Ever Want s My Unfar Advantage (And Why You Can t Have It), Proc. CMG Conf., December 999, pp [HP99] HP Process Resource Manager User s Gude, Hewlett Packard, December 999 [IBM00] AIX V4.3.3 Workload Manager Techncal Reference, IBM, February 2000 Update

10 Appendx. Expermental Confguratons Vendor Hardware HP 9000/839 Model K20 IBM RS Sun SPARCstaton 20 Sun 4m OS HP-UX 0.20 AIX Solars 2.6 Far share scheduler PRM C.0.07 WLM SRM Appendx 3: HP and IBM Two Workload Benchmark Results Each machne was confgured as a unprocessor. Appendx 2. Far Share Scheduler Features SUN (Solars) SRM HP (HP-UX) PRM IBM (AIX) WLM Caps No Yes Yes Analyss Yes Yes Yes Setud Yes No No Herarches Yes No Yes Group by Process Group by User No Yes Yes Yes Yes Yes Fgure 9. The benchmark and predcted average response tmes (HP) of workload as a functon of ts allocated share. Fgure 0. The benchmark and predcted average response tmes (HP) of workload 2 as a functon of the share allocated to workload.

11 Fgure. The measured response tmes (HP) of workload and workload 2 as a functon of the relatve share allocated to workload together wth ther sum, weghted by ther utlzatons. Fgure 3. The benchmark and predcted average response tmes (IBM) of workload 2 as a functon of the share allocated to workload. Fgure 2. The benchmark and predcted average response tmes (IBM) of workload as a functon of ts allocated share. Fgure 4. The measured response tmes (IBM) of workload and workload 2 as a functon of the relatve share allocated to workload together wth ther sum, weghted by ther utlzatons.