Efficient Resource Management using Advance Reservations for Heterogeneous Grids

Efficient Resource Management using Advance Reservations for Heterogeneous Grids Claris Castillo, George N. Rouskas, Kaled Harfous Department of Computer Science Nort Carolina State University Raleig, NC 27695 Email: {ccastil,rouskas,kaarfou}@ncsu.edu Abstract Support for advance reservations of resources playsakeyroleingridresourcemanagementasitenables te system to meet user expectations wit respect to time requirements and temporal dependence of applications, increases predictability of te system and enables coallocation of resources. Despite tese attractive features, adoption of advance reservations is limited mainly due to te fact tat related algoritms are typically complex and fail to scale to large and loaded systems. In tis work we consider two aspects of advance reservations. First, we investigate te impact of eterogeneity on Grid resource management wen advance reservations are supported. Second, we employ tecniques from computational geometry to develop an efficient eterogeneity-aware sceduling algoritm. Our main finding is tat Grids may benefit from ig levels of resource eterogeneity, independently of te total system capacity. Our results sow tat our algoritm performs well across several user and system performance and overcome te lack of scalability and adaptability of existing mecanisms. I. INTRODUCTION Owing to te advances in tecnologies suc as resource virtualization and network management Grids ave experienced enormous growt not only in respect to teir adoption tey ave became te defacto infrastructure for computing service provisioning in academia and corporate R&D environments but also in teir functionality, complexity and size. Tis penomenon as led to te emergence of a wole new range of applications capable of performing tasks of a complexity not envisioned before. For instance, several scientific workfolk applications [24], [25], [27] involve te orcestration of multiple compute and data transfer stages. Tese stages normally ave strong dependency on completion times; tus te ability to co-scedule and syncronize resources usage is crucial. Furtermore, emerging classes of deadline-driven scientific applications suc as severe weater modeling [23] require simultaneous access to multiple resources and predictable completion times. In order to support suc temporal dependencies and strict time constraints Grid, middleware needs to offer planning capabilities so users can reserve resources in advance based on resource availability and meet te time Tis work was partially supported by NSF grants CAREER ANIR- 0347226 and CNS-0434975, Anita Borg Scolarsip and Cisco Systems Inc. requirements of teir applications. However, most existing Grids resource scedulers [10], [26], [28], [29], [45], [46] were originally designed to work under best-effort policies. In response to te emerging needs for more sopisticated resource management solutions some Grid resource management software as evolved to accommodate for advance reservations. Suc software includes LSF, PBS-Pro, Maui, Catalina, EASY and COSY (for a compreensive review of tese scedulers refer to [10] and references tereof). Advance reservations ave been widely proposed for provisioning for performance predictability, meeting resource requirements and providing guaranteed quality of service to applications [1] [6], [20], [21], [35], [36], [36] [38], [40]. GARA [34] is one of te seminal works on advance reservation and defines a basic arcitecture and simple API for te manipulation of advance reservation of different resources. A teoretical proof tat reservations can be used to improve te performance predictability of applications is presented in [38]. A comparison of provisioning models and best-effort mecanisms can be found in [32]. In [37] and [38] te performance and predictability of workflows applications wen advance reservations are used is investigated respectively, concluding tat it is beneficial for Grids to use advance reservations. A more general study on te usefulness of advance reservation is presented in [35]. In [5], [6] autors investigated te negative impact advance reservations ave on system and user performance. In [31], [43], [44] autors sow ow laxity and fuzziness in te reservation requests may be exploited to address some of te drawbacks of advance reservations. Two of te most recent major works on advance reservations in Grids are [32] and [21]. In [32], te autors propose a multiobjective genetic algoritm formulation for selecting te set of resources to be provisioned tat optimizes te application performance wile minimizing te resource costs. In [21] a cost-aware resource model is parented in wic reservation for eac application task is performed separately by negotiating wit te resource provider. In [33] te autors present a broker service for te Grid resources tat takes into account te fact tat deadline and budget are specified, and ten optimizes te usage of resources only by considering te current state of te

resources but witout any planning orizon. Te impact of resource eterogeneity as been investigated in contexts oter tan Grids. In [41] te autors exploit te eterogeneity found in HPC environment by dividing a task into subtasks and ten mapping te latter to resources tat best meet teir requirements. Tis work assumes offline sceduling and does not support advance reservations; our work deals wit online sceduling and allow users to scedule jobs in advance. In [19] te autors proposed a general framework to quantify te worst-case effect of increasing eterogeneity in models of parallel systems wit finite total capacity. An important contribution of tis work was a model to caracterize resource eterogeneity wic we adopt in tis paper. Overall, advance reservation of resources [1] [6] as generated great interest in te Grid community as a mecanism tat Grid providers may employ to offer planning capabilities to application users. Furtermore, it as sown to increase te predictability of te system maximizing te flexibility and adaptability of te system to cope wit te dynamic beavior of grid environments [35] [13] [38]. Despite te attractive features of advance reservations, tere is great scepticism in te Grid community about teir ability to meet teir promise; tis fact is mainly due to tree reasons. First, advance reservations ave sown to cause severe performance degradation [5], [6]. Second, typical advance reservation mecanisms lack flexibility as tey do not permit graceful degradation in application performance wen resource management policies mandate canges in allocations [9]. Tird, existing approaces suffer from poor scalability as tey are not effective in managing large sets of advance reservations or andling resource fragmentation. Also, most solutions lack of sopistication, and are not able to address te user needs (e.g. for time guarantees) and system requirements (e.g., for ig performance/trougput) in an integrated manner. To overcome tese callenges, algoritms for advance reservations need to beefficient so tey can adapt to dynamic canges in resource availability and user demand witout urting system and user performance. Moreover, tey must take into account resource eterogeneity since resources in Grid environments are typically igly eterogeneous. In previous work [40] we developed efficient algoritms for advance reservations of omogeneous resources. Tese algoritms are effective in meeting time requirements (e.g.,deadlines), may be adapted to employ several optimization criteria for sceduling jobs, and teir low running times make tem practical for large Grid environments. In tis paper we address te issue of meeting application time requirements in Grid environments wit resources of eterogeneous capabilities (e.g, as in te case of compute servers wit varying processing power). We consider an environment were users submit jobs dynamically, and tese jobs may start at a future time and must be completed witin a certain deadline. We first investigate te impact of eterogeneity on te sceduling of resources, and conclude tat sceduling algoritms need to be eterogeneity-aware to acieve appropriate system and user performance. Based on tis observation, we ten develop an efficient eterogeneityaware sceduling algoritm for advance reservations in tis context. We also describe ow to apply tecniques from computational geometry to develop data structures tat allow te service provider to manage efficiently te set of advance reservations and andle effectively te resulting resource fragmentation. Te rest of te paper is organized as follows. In Section II we describe te online sceduling problem we study in tis work. In Section III we make a case for eterogeneity-aware algoritms in Grids. By means of a simple experiment we sow tat resource eterogeneity may ave positive impact on performance if eterogeneity-aware algoritms are used. In Section IV we present a novel transformation of te advance reservations problem tat exploits tecniques from computational geometry. Using insigt from tis transformation, we ten develop a eterogeneity-aware algoritm in Section V, and provide details on its implementation and te associated data structures used to manage te fragmentation of resources. In Section VI we describe several directions for furter improving te performance of te sceduling algoritm tat are te subject of ongoing researc witin our group. In Section VII we investigate te performance of our algoritm troug simulation, and we conclude te paper in Section VIII. II. PROBLEM DESCRIPTION Consider a sceduler S for a Grid wit n servers wic may be geograpically distributed in a network. We consider a eterogeneous environment in tat server i as service rate µ i, were service rate refers to te amount of work a server can perform per unit of time. We also assume network delays are negligible. A user wit job j requiring service submits a request to te sceduler. Te request is caracterized by a treeparameter tuple (r j,l j,d j ), were: 1) r j is tereadytime of te job, i.e., te earliest time te job can be made available to te Grid for processing; 2) l j is tesize of te job, i.e, te amount of work te job requires; and 3) d j ( r j + l j ) is tedeadline of te job, i.e., te latest time by wic te job can be completed to provide any utility to te user. Te deadline is a measure of te quality of service required by te user. We assume tat deadlines are ard, in tat a user receives utility only if te job completes service by its deadline. Terefore, if S determines tat te deadline cannot be met, it drops te job and notifies its user accordingly. Note tat tis restriction may be relaxed wit minimal modifications to our algoritm; in

Section VI we describe a set of mecanisms tat may be used to re-negotiate and re-plan advance reservations in order to minimize te number of jobs tat are dropped. In our model, te availability of resources is represented by time intervals during wic servers are idle. We refer to tese intervals asidleperiods in tis paper. We say tat an idle period isfeasible for a given job j if it can accommodate j witin its deadline d j. Te feasibility of an idle period k for a given job j is determined by bot te service rate of te server associated wit te idle period and its duration. Terefore, we caracterize an idle period k on a server i wit service rate µ i by a tree-parameter tuple (st k,et k,c k ), were: st k is te starting time of te idle period; et k is te ending time of te idle period; and c k = µ i (et k st k ) is tenominalcapacity of te idle period, i.e., te amount of work tat server i can perform during idle period k. Note tat idle periods in slow (respectively, fast) servers may ave a long (respectively, sort) duration but small (respectively, large) nominal capacity. Moreover, te nominal capacity c k of an idle period k represents te maximum job size tat it can accommodate, assuming tat te job is sceduled to start execution exactly at time st k. As time progresses, te nominal capacity c k of te idle period decreases at a rate equal to its server s rate µ i. Consequently, if no job is allocated to te idle period by time t = st k, ten te maximum job size tat it can accommodate decreases linearly at rate µ i. Terefore, te nominal capacity of idle periods belonging to fast (respectively, slow) servers expires at a faster (respectively, slower) rate. We consider te online sceduling problem wereby users submit service requests to S at random instants. We assume tat S maintains a scedule wic records, for eac server i, te time periods in te future during wic te server is reserved for jobs tat ave already been accepted to te system. In essence, tis scedule represents te set of advance reservations tat ave been made, and it guarantees tat server resources will be available to te accepted jobs at specific future times. Figure 1(a) sows an example scedule for a 2-server system in wic server i as rate µ 1 = 1, and server 2 as rate µ 2 = 0.5. Te scedule is in te form of a timetable, and sows tat at te current time (i.e., t = 0), tere are four jobs sceduled for server 1: te job currently in service wic will end at time t 1, job A wic as reserved te server from time t 4 to time t 5, job B wic as reserved te server from time t 6 until time t 7, and job C wic is sceduled from time t 11 to time t 12. Similarly, tere are two jobs sceduled for server 2. Te figure also sows a new job j requesting service. Te job as ready time r j = t 3 and deadline d j. Tere are two representations of te new job. Te representation at te top as a sorter duration and sows te new job as seen by server 1, wile te one below as a longer duration (i.e., double tat at te top) and sows te job as seen by server 2. Wen a service request (r j,l j,d j ) for a new job j arrives, S immediately runs an algoritm to determine weter it is feasible to scedule te job so as to meet its deadline. If so, ten S uses a set of criteria to select one of te (possibly multiple) servers tat can andle tis job, updates its scedule, and returns a reference to tis server to te user; oterwise, te job is dropped. Te sceduling decision impacts te performance perceived by users as reflected by te fraction of jobs meeting (or missing) teir deadlines and te response time of te jobs. It also impacts te overall system performance as reflected by te system utilization, wic is a measure of ow well te overall service capacity of te system is used. Te callenge, terefore, is to develop efficient online sceduling algoritms tat minimize te fraction of dropped jobs wile maximizing utilization. A. Computational Heterogeneity To incorporate computational eterogeneity into our framework we use te model introduced in [19]. In tis model te autors use majorization partial order to compare te imbalance, i.e., eterogeneity, of capacity distributions. Te majorization partial order,, is defined as follows. Given two nonnegative vectors corresponding to te service rates of two n-servers systems C = (µ 1,µ 2,µ 3,,µ n ) and C = (µ 1,µ 2,µ 3,,µ n), we ave C C wen k k n n k µ [i] µ [i] and µ i = µ i (1) i=1 i=1 i=1 i=1 were µ [i] denotes te i-t largest component of C. We say tat te computational capacity distribution C A of a system A is more eterogeneous tan te computational capacity C B of a system B wenever C A C B. We say tat a Grid system is (H, n)-eterogeneous, H n, if te n servers are partitioned in H groups suc tat servers in group, = 1,,H, ave te same service rate µ. Note tat most existing Grids follow tis model as tey consist of a collection of clusters of identical processors. Tus, an (H, n)-eterogeneous system as n servers wit H different rates. For a given (H, n)-eterogeneous system we may generate a range of service rate distributions tat are more or less eterogeneous according to te majorization partial order in expression (1). We let L denote te levels of eterogeneity, i.e., te number of service rate distributions considered for a (H, n)-eterogeneous Grid, labeled in order of increasing eterogeneity: (H,n) L (H,n) 1 (H,n) 0 (2) were we use (H,n) 0 to denote te completely omogeneous system, i.e., one in wic all n servers ave te same rate µ. We use tis model in te experimental studies we report in Sections III and VII.

ready time r j new job... deadline d j new job capacity server 1 capacity=1 server 2 capacity=0.5 x y A new job w B... z C new job D c x c z w c lj y idle period x slope = 1 idle period w slope = 0.5 P idle period y slope = 1 idle period z slope = 1 new job P 0 t1 t2 t3 t 4 t 5 t6 t7 t8 t9 t10t11 t12 (a) 0 t 1 t 2 t 3 t 4 t5 t6 t 7 t 8 t 10 t 11 time (b) Fig. 1. (a) Scedule of a 2-server system as a timetable, and (b) geometric representation of te idle periods and te new job. III. THE CASE FOR HETEROGENEITY-AWARE ALGORITHMS To investigate te impact of eterogeneity in resource allocation mecanisms in Grid environments we perform two different experiments. We refer to tese experiments as eterogeneity-aware (HA) and eterogeneity-unaware (HU) experiments. In bot experiments we consider te problem described in Section II and use te same sceduling algoritm and data structure; te only difference being tat in one experiment we adapt te algoritm and data structure to accommodate eterogeneity. More specifically we consider te well-known firstfit (FF) sceduling algoritm, and we use a linked-list data structure to store idle periods. In te eterogeneityunaware (HU) experiment, all idle periods over all servers are stored in a single linked list in ascending order of teir starting times. To scedule a new job, te FF algoritm searces te linked list and returns te first feasible idle period for te job; we refer to tis algoritm as FF-HU. In te eterogeneity-aware (HA) experiment, te idle periods are stored in H linked lists, were H denotes te number of different rates in te system. Specifically, linked list, = 1,,H, stores te idle periods over all servers wit rate µ in ascending order of teir starting times. To scedule a new job, te FF algoritm considers te H lists in some order, and searces te first linked list for a feasible idle period; if no suc idle period is found, te algoritm continues to searc te next list in te order, and so on. Te FF-HA algoritm terminates wen te first feasible idle period is found, or wen all te lists ave been searced unsuccessfully. Clearly, te order in wic te FF-HA algoritm considers te H linked lists will ave an impact on performance. We used simulation to compare te performance of te FF-HU and FF-HA algoritms; te details of te simulation setup are described in Section VII. Following te model of Section II-A, we consider a (H,n)- eterogeneous Grid wit n = 120 servers divided into H = 3 groups, wit te server in eac group, = 1,2,3, aving te same rate µ. We created L = 4 (H,n)-eterogeneous systems by selecting te rate µ of eac server group witin eac system so tat L = 4 refers to te most eterogeneous system wit respect to expression (1) and L = 1 to te least eterogeneous one. Figure 2 plots te loss rate and utilization against system load, respectively. Eac figure sows two sets of four plots, one set for te FF-HU algoritm and one for FF-HA; in tis case, FF-HA considers te H lists of idle periods in increasing value of te rate µ of te corresponding servers. Eac plot witin a set corresponds to one of te L = 4 levels of eterogeneity, i.e., one of te (H, n)-eterogeneous systems obtained as we described above. We ave obtained results for oter performance measures, e.g., waiting time, but do not include tem ere as tey exibit similar trends. As we can see, for a given level of eterogeneity, te eterogeneity-aware algoritm (FF-HA) outperforms te eterogeneity-unaware one (FF-HU) across te spectrum of system loads. We also observe tat te performance of eac algoritm improves as te system becomes more eterogeneous, despite te fact tat te total service rate is te same for all L = 4 eterogeneity levels. Tis penomenon is due to te effect of statistical multiplexing, and is discussed in more dept in Section VII. Tese results, obtained wit a basic sceduling algoritm and data structure, suggest tat computational eterogeneity

0.45 0.8 Work Loss Rate 0.4 0.35 0.3 0.25 0.2 FF-HU-L=1 FF-HU-L=2 FF-HU-L=3 FF-HU-L=4 FF-HA-L=1 FF-HA-L=2 FF-HA-L=3 FF-HA-L=4 Utilization 0.7 0.6 0.5 0.4 0.3 0.15 0.1 0.05 System Load Fig. 2. 0.2 FF-HU-L=1 FF-HU-L=2 FF-HU-L=3 FF-HU-L=4 0.1 FF-HA-L=1 FF-HA-L=2 FF-HA-L=3 FF-HA-L=4 0 System Load Comparison of eterogeneous aware and unaware algoritms may ave a significant impact on bot user and system performance metrics and sould be taken into account wen designing sceduling algoritms. Noneteless, taking eterogeneity into account comes wit a price since it adds complexity to te problem and ence to te algoritms. For instance, altoug te worst-case running time of te FF-HA and FF-HU algoritms is te same (linear in te number of idle periods), te average running time of FF-HA can be significantly longer tan tat of FF-HU (since it may ave to traverse several lists before it finds a feasible period tat migt be stored near te ead of te single list maintained by FF-HU). Te callenge, terefore, is to design sceduling algoritms tat are bot eterogeneity-aware and efficient; tis is te subject of te next two sections. IV. A GEOMETRIC MODEL FOR ADVANCE RESERVATIONS In tis section we employ tecniques from computational geometry to model te problem we introduced in Section II. We ten use tis model to develop an algoritm for advance reservation of resources, along wit an associated data structure for storing and accessing efficiently te set of idle periods. Witout loss of generality, in te following discussion we make te assumption tat te service rate µ i of eac processor i is suc tat 0 < µ i 1. Tis assumption allows us to define te size l j of a job j as te amount of time for tis job to complete on a server of rate µ = 1. Clearly, te duration of te job on a server of rate µ i < 1 is ten equal to l j /µ i. A. Geometric Representation of Idle Periods and Jobs We represent idle periods and jobs on te first quadrant of a Cartesian coordinate system in wic te x axis represents time and te y axis represents nominal capacity. Figure 1(b) illustrates te geometric representation of te idle periods and new job of Figure 1(a). A job j caracterized by te tuple (r j,l j,d j ) is represented in tis coordinate system as a line segment between two points P = (r j,l j ) and P = (d j l j,l j ). Since, in Figure 1(a), te new job is defined by te tuple (t 3,l j,d j ), te two endpoints of te line segment representation of tis job in Figure 2(b) are P = (t 3,l j ) and P = (t 10 = d j l j,l j ). As defined, point P represents te earliest possible starting time and required capacity for tis job if it were sceduled on te fastest server, i.e., one wit rate µ = 1; similarly, point P corresponds te latest possible starting time and required capacity for tis job to be feasibly completed on te fastest server. Note tat altoug we assume tat servers may ave different capacities, we use a single representation for eac job j, namely te line segment wit respect to te server of rate µ = 1. An idle period k caracterized by te tuple (st k,et k,c k ) is also represented in te coordinate system as a line segment between two points, k 1 = (st k,c k ) and k 2 = (et k,0). Recall tat c k denotes te nominal capacity of idle period k. Terefore, point k 1 represents te point in time (i.e., starting time) at wic te idle period as te largest nominal capacity, and point k 2 te point in time (i.e., ending time) at wic te idle period as reaced zero capacity. Te slope of te line segment representing idle period k is equal to µ i, were µ i is te rate of te server corresponding to tis idle period; tis representation clearly sows tat te nominal capacity of te idle period decreases at rate µ i. Consider, for example, idle period x in Figure 1(a) wit starting time st x = t 1, ending time et x = t 4, and nominal capacity c x. Tis idle period is represented in te plane by te line segment between te two points x 1 = (st x,c x ) and x 2 = (et x,0). Te slope of te line segment is -1, since te rate of server 1 is µ 1 = 1. Idle periods y, z, and w are similarly represented by te line segments sown in Figure 1(b). Note also tat te slope of te line segment corresponding to idle periods y and z is -1, wile te one corresponding to w is -0.5 since te latter is on server 2 of rate µ 2 = 0.5. Feasibility Criteria. We may now use te above geometric representation to determine weter an idle period

is feasible for a new job. Consider an idle period k wit tuple (st k,et k,c k ) represented by te line segment defined by points k 1 and k 2, as explained earlier, and a new job j wit tuple (r j,l j,d j ) tat is represented by a line segment between points P and P. Idle period k is feasible for job j if and only if bot of te following conditions are satisfied. 1) Starting time feasibility. Let i be te server corresponding to idle period k, and µ i be its service rate. For te idle period k to be feasible for te new job j, its starting time st k as to be sufficiently early for te server to be able to complete te job before its deadline, i.e.: st k d j l j µ i (3) Expression (3) is necessary but not sufficient for feasibility, since te idle period k may end early, before job j can complete on server i. Returning to Figure 1, we observe tat idle period x satisfies te above condition wit respect to te new job. However, te residual capacity of tis idle period at te time te new job arrives is not sufficient to accommodate it. 2) Capacity feasibility. Assuming tat te starting time feasibility is satisfied, an idle period k is feasible for a new job j if te line segment representing k lies above or intersects wit, te line segment representing j. Equivalently, tis condition is satisfied if te leftmost endpoint of te line segment representing te new job lies below te line segment representing te idle period. In Figure 1(b) we see tat idle period y does not satisfy tis condition as its line segment lies below te line segment representing te new job; ence, y is not feasible for te new job. In Figure 1(b), te two conditions are satisfied for bot idle periods w and z wit respect to te new job represented by te line segment between points P and P. Consequently, idle period w as enoug capacity to accommodate te new job, as long as te latter starts before te time instant at wic te corresponding lines intersect; similarly for idle period z. Our objective is to develop tecniques to identify efficiently feasible idle periods for eac arriving job request, witout aving to examine all idle periods. As we ave sown in [40], we can efficiently find idle periods tat meet te starting time feasibility criterion by organizing te idle periods in an appropriate balanced tree structure tat can be searced in logaritmic time. However, identifying idle periods tat meet te capacity requirement, e.g., determining line segments lying above point P in Figure 1(b), requires tat eac idle period be examined separately. Tis is due to te fact tat to perform tis test te equation representing eac line segment needs to be evaluated for te coordinates of te given point. Next, we employ tecniques from computational geometry to obtain an equivalent representation of idle periods and new jobs tat allows us to develop an elegant solution to te problem of testing for te capacity feasibility criterion. B. Duality Transform and Duality Plane. Geometric duality [17] refers to te direct mapping between a point p (respectively, line l) and a line p (respectively, point l ). Te duality transform maps objects from te primal plane to te dual plane. We now describe a simple duality transform we use in te remaining of tis paper. Let p := (p x,p y ) be a point in te plane. Te dual of p, denoted p, is te line defined as p := (y = p x x p y ) (4) were p x and p y are p s x and y coordinates, respectively. Te dual l of a line l := (y = mx + b) is te point p suc tat p = l, tat is, l := (m, b) (5) were m and b are te slope and y-intercept of line l, respectively. One major advantage of tis particular duality transform is tat it is order preserving, tat is, point p lies above line l if and only if point l lies above line p [17]. Let us now return to our original problem and te geometric representation of idle periods and jobs sown in Figure 1(b). We transform tis primal plane to te dual plane by mapping te line l k corresponding to an idle period k to a point lk, and te point P corresponding to te earliest time new job j can start execution, to a line P. Using basic geometry principles we find tat for any idle period k, te value of b in expression (5) is µ i et k. Since te slope m of idle period k is µ i, were µ i is te rate of te corresponding server, expression (5) can be written as: l k := ( µ i, µ i et k ). (6) To find P, we substitute p x and p y in expression (4) wit r j and l j, respectively: P := (y = r j x l j ). (7) Figure 3(b) sows te dual plane corresponding to te primal plane in Figure 3(a); te latter figure is identical to Figure 1(b), and is repeated ere for convenience. As we can see, te idle periods are now mapped to points in te dual plane. Specifically, all idle periods on server 1 of rate µ 1 = 1 are now points wit y coordinates equal to µ 1 = 1; similarly, te idle period on server 2 of rate µ 2 = 0.5 as y coordinate equal to µ 2 = 0.5. Point P, on te oter and, wic represents te earliest time te new job can start execution is represented on te dual plane as a line. Consider now te capacity feasibility criterion we defined above. In te primal plane of Figure 3(a), it is

1 0.5 0 capacity l j new job (point P only) c x c zw c c lj y idle period x slope = 1 idle period w slope = 0.5 P idle period y slope = 1 idle period z slope = 1 new job P idle period w 0.5t8 t 4 idle period x t 6 idle period y idle period z t11 0 t 1 t 2 t 3 t 4 t5 t6 t 7 t 8 t 10 t 11 time (a) (b) Fig. 3. (a) Primal plane and (b) dual plane representations of te idle periods and new job of Figure 1(a) clear tat te idle period x is not feasible for te new job, as point P lies above te line segment representing x. Due to te order preservation of te duality transform, in Figure 3(b) we see tat te point corresponding to idle period x also lies above te line representing point P. Similarly, idle periods y and w are feasible for te new job, and teir corresponding points in te dual plane lie below te line representing point P. Terefore, cecking for capacity feasibility in te dual plane requires cecking weter te points representing idle periods lie below te line representing te new job. Tis test can be performed efficiently by organizing te idle periods (points) lying on te vertical line x = µ (i.e., tose corresponding to servers wit rate µ) in a searc tree structure, and searcing for tose wit a y- coordinate less tan tat of te point at wic te line representing te new job intersects te line x = µ; tis searc structure is described in te next section. Te observant reader will ave noticed tat, in te dual plane of Figure 3(b), te point representing idle period y lies below te line representing point P ; owever, a look at te primal plane of Figure 3(a) indicates tat idle period y isnot feasible. Note tat extending te line segment representing idle period y in te primal plane would result in a line lying above point P, ence te dual plane representation is consistent in tis regard. Te issue ere is tat idle period y starts too late to be feasible, terefore it will not pass te starting time feasibility criterion above. Consequently, bot te starting time and capacity feasibility criteria must be cecked to ensure tat an idle period is feasible. V. ALGORITHM AND DATA STRUCTURE DESCRIPTION We now introduce an efficient algoritm for finding a feasible idle period for a new job in a (H, n)- eterogeneous system wit advance reservations. Te algoritm is derived from te eterogeneity-aware FF- HA algoritm we described in Section III, and will refer to it as FF-HA+. Te FF-HA+ algoritm differs from FF-HA in tat it maintains H balanced trees, rater tan H linked lists, suc tat balanced tree T, = 1,,H, stores information about te idle periods over all servers wit rate µ. Similar to FF-HA, wen a new job arrives, FF-HA+ searces te balanced tree structures in ascending order of server rate, and returns as soon as it finds a feasible idle period. A. Balanced Tree Structure Te FF-HA+ algoritm maintains H 2-dimensional binary searc trees to organize te idle periods in a (H,n)-eterogeneous system, one suc tree T, = 1,,H, for eac distinct server rate value µ. Wenever te algoritm needs to searc te idle periods available in servers associated wit rate µ, te associated tree T is searced. We will refer to te first and second dimension trees of T as T primal and T dual. As teir name indicates, tey organize te idle periods according to teir parameterizations on te primal and dual planes, respectively. More specifically, tree T primal is used to select idle periods tat meet te starting time feasibility criterion, and tree T dual is used to select among tese idle periods te ones tat meet te capacity feasibility criterion.

Let us now describe te 2-dimensional tree T more in detail. In tree T primal, te actual idle periods are in te leaf nodes, arranged in ascending order of teir starting time. A leaf node corresponding to idle period k stores te following information: te starting time of k; te ending time of k; and auxiliary data, suc as te identity of te corresponding server. Internal tree nodes store information regarding te idle periods in teir subtree. Tis information is used to navigate te tree and locate idle periods appropriate for te new job. Te information at an internal node v consists of: te median starting time of te idle periods stored in te subtree of T primal rooted at v; and a pointer to te secondary priority searc tree T dual containing idle periods. Tree T dual stores te idle periods sorted in descending order of te y-coordinate of teir dual representation, tat is, of te corresponding point in te dual plane. Eac intermediate node v in T dual stores te following information: te median y-coordinate of te dual representation of te idle periods stored in te subtree rooted at v; and a pointer to te idle period in v s subtree wit te maximum nominal capacity. B. Searcing te Balanced Tree Structure Consider a request to scedule a new job j wit parameters (r j,l j,d j ). Te FF-HA+ algoritm searces te H balanced trees as we explained earlier, and returns te first feasible idle period found. We now describe ow te searc of balanced tree T is performed; tis process is identical for all trees T, = 1,,H. Specifically, te searc proceeds in two steps: 1) In te first step, te algoritm traverses te tree and marks te intermediate nodes v wose subtrees contain idle periods tat meet te starting time feasibility criterion. 2) In te second step, te algoritm searces te T primal secondary trees Tv dual at eac intermediate node v marked during te first step, to locate te subset of idle periods tat meet te capacity feasibility criterion. Step 1: Searc in T primal. In tis step, te algoritm identifies idle periods tat meet te starting time feasibility criterion expressed in (3). To tis end, we employ a standard searc algoritm wic starts at te root node and compares te quantity in te rigt-and side of (3) to te median starting time stored at eac internal node v. If te median starting time is smaller, ten all te idle periods stored in v s left subtree meet te first feasibility criterion; te algoritm marks te left subtree and proceeds to searc te rigt subtree. If te median starting time of te tree rooted at v is larger, ten we can safely conclude tat all te idle periods in te rigt subtree are infeasible and proceed recursively to searc te left subtree of v. Te algoritm returns te set of marked intermediate nodes as soon as it reaces a leaf, and proceeds to Step 2 described below. If no intermediate node is marked, te FF-HA+ strategy continues to searc in te 2-dimensional tree T +1 corresponding to te next larger value of server rate. Step 2: Searc in Tv dual. In tis step, te algoritm searces te idle periods meeting te starting time feasibility criterion, to identify te ones tat also satisfy te capacity feasibility criterion. To tis end, te algoritm searces eac of te subtrees rooted at te intermediate nodes marked in Step 1 and returns as soon as it finds one feasible idle period (if one exists). We will refer to Tv dual as te secondary tree, i.e., te dual tree, associated wit marked node v. Te algoritm starts at te root of Tv dual and compares te median y-coordinate stored at eac internal node u to te y-coordinate of te point in te dual plane at wic te line corresponding to te new job intersects te vertical line x = µ (refer also to Figure 3(b)). If te latter value is smaller ten it can be concluded tat all te idle periods in te left subtree are above te line, and ence are infeasible; te algoritm ten recursively searces u s rigt subtree. If te former value is smaller, ten all te idle periods in te rigt subtree of u are feasible, and tere may also exist feasible idle periods in its left subtree. In tis case, te algoritm accesses te idle period wit te maximum capacity in te rigt subtree by following te pointer stored at node u. If tis idle period is feasible, te algoritm returns it and assigns it to te new job. Oterwise, te searc continues recursively wit te left subtree of u. If te algoritm reaces a leaf, ten no feasible idle period exists in te given subtree and te algoritm continues searcing te next tree marked in Step 1. Running time complexity. In te worst case, te searc algoritm marks an intermediate node at eac level of te tree T primal in Step 1. Given tat it as to perform a standard searc for eac of tese trees, te overall complexity is O(log 2 V ) for 2-dimensional tree T, were V is te number of idle periods in te tree. Since te algoritm may ave to searc all H trees, te worst case complexity for FF-HA+ is O(H log 2 V ), were V = max{v }. As a comparison, te running time of FF-HA is O(HV ), i.e., linear in te number of idle periods, since it as to traverse H linked-list structures. Since H is typically a small constant, wereas te number V of idle periods can be quite large (especially for large systems wit tousands of servers and for long time orizons for advance reservations), FF-HA+ is significantly more scalable tan FF-HA.

VI. ADAPTABILITY: RE-PLANNING CAPACITY AND MAXIMIZING UTILIZATION As we mentioned earlier in Section I, one of te major concerns regarding te deployment of advance reservation mecanisms as to do wit teir lack of flexibility tat does not permit graceful degradation in application performance wen resource management policies mandate canges in allocations. In tis section we describe two mecanisms tat make it possible to exploit te efficiency of FF-HA+ in order to relax te ard deadline assumption and accommodate canges in resource availability; te implementation of tese mecanisms is te subject of ongoing work witin our group. Replanning Capacity. In our work so far we ave assumed tat deadlines are ard, i.e., jobs are dropped if tey can not be allocated witin teir deadline. It is possible to make te algoritm more flexible and increase te overall ability of te system to meet application QoS requirements by introducing a negotiation process. Tis process is invoked wenever te sceduler fails to allocate a job and attempts to rescedule existing reservations in order to allocate new incoming jobs wenever possible witout affecting te QoS of previously sceduled jobs. Tis negotiation process may utilize a set of data structures and algoritms similar to te one we described in te previous section to organize, searc, and modify existing reservations. Our algoritm can also be adapted to andle efficiently canges in job demands. Consider, for instance, a job currently running on a server, and assume tat it needs to execute for a longer period of time tan te one it originally reserved (i.e., te original estimate of its running time was incorrect). In current systems, suc jobs are eiter terminated or preempted and given low priority for sceduling. Given te low running time complexity of our searc algoritm, tere are several options to andling suc situations: one can eiter invoke te negotiation process to rescedule te job tat as reserved te server following te current job, or one can ceckpoint te job, invoke te sceduling algoritm to find te next available feasible idle period for it, and ten migrate te job to complete execution in anoter server. Opportunistic Sceduling. To enable users and Grid administrators to exploit te variations of resource conditions to improve bot application and system performance, te FF-HA+ algoritm may be extended to implement opportunistic sceduling. More specifically, new jobs tat ave no deadline requirements may use resources as tey become available, and tey may be preempted to accommodate new jobs wit deadlines. Suc an approac will increase utilization by filling idle periods tat migt not be used oterwise, and increases te flexibility of te system. VII. PERFORMANCE EVALUATION In tis section we present simulation results to demonstrate te performance of te FF-HA+ sceduling algoritm. We used te metod of batc means to estimate te performance parameters we consider (and wic we discuss sortly), wit eac batc consisting of tirty simulation runs and eac run lasting until 10 6 jobs ave been submitted to te Grid sceduler. We ave also obtained 95% confidence intervals for all te results, wic are sown in te figures. In our simulation, we assume tat job requests arrive following a uniform distribution in te range from one minute to 14 days [21]. Te duration of eac reservation request is randomly selected so tat 80% of te incoming jobs are smaller tan 4 ours, and 20% are between 4 and 36 ours; te mean job size is 5.6 ours. Tese values were cosen based on te experience wit running real Grid workfolk applications as described in [21], [22]. We let te deadline d j of job j be uniformly distributed in te interval (r j,r j + q), were q corresponds to te tigtness of te deadline; for most of our experiments we assume q = 20 ours unless stated oterwise. We consider a (H, n)-eterogeneous system wit n = 120 servers and H = 3 distinct service rates. We generated and studied L = 4 computational rate distributions suc tat L = 1 refers to te least eterogeneous system and L = 4 refers to te most eterogeneous one. We use four performance metrics in our study. Te worklossrate is te fraction of work tat is dropped due to te fact tat te deadline of te corresponding jobs cannot be met. Te system utilization is te fraction of time te n servers are busy serving jobs. Tewaiting time is te mean amount of time tat a job as to wait beyond its ready time until it starts execution; note tat dropped jobs do not contribute to te average waiting time. Finally, te algoritm running time captures te efficiency of te searc algoritm to scedule incoming jobs. To compute te running time we record te CPU time for eac simulation corresponding to 10 6 jobs. Work loss rate and waiting time are measures of te QoS perceived by te user, system utilization is a measure of system performance, and running time determines te scalability of te system. In our first experiment, we compare te FF-HA+ algoritm to te baseline algoritm FF-HU we described in Section III. Recall tat FF-HU strategy organizes idle periods in a single linked list ordered in ascending order of teir starting time; te algoritm traverses te list and returns te first feasible idle period for a new job, i.e., te one wit te earliest starting time. Note tat idle periods wit early starting times are at risk of expire unused if new jobs are not assigned to tem. Terefore, tis coice of a feasible idle period is expected to lead to low loss, since assigning a new job to te earliest possible feasible period allows idle periods starting later to be used for future job requests. On te oter and, te running time of te algoritm increases quickly wit te

0.35 40000 0.3 35000 0.25 30000 Work Loss Rate 0.2 Running Time 25000 20000 0.15 15000 0.1 10000 FF-HU FF-HA+ 0.05 System load Fig. 4. FF-HU FF-HA+ 5000 System load Comparison of FF-HA+ and FF-HU: (a) work loss rate against load, (b) running time (in milliseconds) against load 0.7 10 0.6 9 8 0.5 7 Utilization 0.4 0.3 Waiting Time 6 5 4 0.2 3 0.1 FF-HU FF-HA+ 0 System load Fig. 5. 2 1 FF-HU FF-HA+ 0 System load Comparison of FF-HA+ and FF-HU: (a) utilization against load, (b) waiting time against load size of te Grid system and te time orizon for making reservations. Te FF-HA+ algoritm organizes te idle periods in balanced tree structures, ence it scales well to large Grid systems. However, it does not necessarily return te feasible idle period wit te earliest starting time, ence we expect tat its work loss rate will be iger tan FF-HU. But we empasize tat FF-HA+ will always find a feasible idle period for a new job if one exists. Figure 4 confirms te above observations. Te figure plots te work loss rate and running time of te FF- HU and FF-HA+ algoritms against te system load. As we can see in Figure 4(a), te loss rate increases wit te system load for bot algoritms. Te two strategies exibit similar loss rates at low loads (wen tere are sufficient resources to scedule almost all jobs) and ig loads (wen te issue is te lack of resources, not te particular strategy used). However, te FF-HA+ strategy exibits a iger loss rate at medium loads, as we expected. A careful examination of our experiments sows tat FF-HU incurs less resource fragmentation tat FF-HA+. Tis result is due to te fact tat FF-HA+ returns te feasible idle period of maximum capacity among tose in its subtree; wile tis coice was made to speed up te operation of te algoritm, te side effect is iger fragmentation. On te oter and, te running time of FF-HU is significantly iger tan tat of FF- HA+, especially at medium to ig loads; again, tis result is consistent wit our discussion above. Te system utilization curves in Figure 5(a) suggest tat FF-HA+ utilizes better te resources available in te system, i.e., te servers are busy performing work for a longer fraction of time tan under FF-HU. However, since te loss rate for FF-HA+ is sligtly iger, tis results implies tat FF-HA+ allocates more jobs to slow processors tan FF-HU. A more careful examination of our results reveals tat, under FF-HA+, processors wit ig service rate exibit a iger fragmentation; since te capacity of processors wit ig service rate expires faster as time progresses, fragmentation of capacity on ig-rate servers as a more detrimental effect on system performance, as exibited by te iger loss rate of HH- FA+. Figure 5(b) plots te average waiting time tat jobs ave to wait beyond teir ready time. We observe tat jobs ave to wait significantly longer under FF-HU compared to FF-HA+. In oter words, altoug FF-HU scedules a larger fraction of jobs tan FF-HA+, te start time of tese jobs is pused back resulting in longer

0.5 0.7 0.45 0.6 0.4 0.35 0.5 Work Loss Rate 0.3 0.25 Utilization 0.4 0.3 0.2 0.2 0.15 0.1 L=1 L=2 L=3 L=4 0.05 System load Fig. 6. 0.1 L=1 L=2 L=3 L=4 0 System load Te impact of eterogeneity: (a) work loss rate against load, (b) utilization against load waiting times. Finally, Figure 6 investigates te impact of different levels of eterogeneity on performance. Figure 6 (a) plots te work loss rate against te load for L = 4 different levels of eterogeneity, were larger values of L imply iger eterogeneity; Figure 6 (b) is similar but plots system utilization against load. We can see tat as resources become more eterogeneous, te loss rate and system utilization bot improve, in many cases significantly so. Tis beavior follows from te fact tat to increase resource eterogeneity in a given system wile keeping te total service rate constant, as required by expression (2), te service rate of a few fast processors must increase furter. In oter words, a larger fraction of te total service rate is concentrated on fewer resources. Consequently, making te system more eterogeneous introduces a iger degree of statistical multiplexing, wereby fewer ig capacity servers are responsible for serving larger number of customers. Te results in Figure 6 ten are consistent wit te well-known fact from queueing teory tat statistical multiplexing improves system performance. VIII. CONCLUDING REMARKS We ave considered te problem of advance reservations for jobs wit deadlines in a Grid system wit eterogeneous resources. We ave developed a geometric representation of idle periods and jobs tat provides new insigt and allows for efficient organization of te reservations. We ave developed a sceduling algoritm wit good performance tat can scale to large Grid systems and long time orizons. We ave also sown tat resource eterogeneity may ave a positive impact on performance if taken into account in te design of sceduling algoritms. REFERENCES [1] E. Elmrot and J. Tordsson. A grid resource broker supporting advance reservations and bencmark-based resource selection. Lecture Notes in Computer Science, volume 3732, pages 1077 1085. Springer-Verlag, 2005. [2] I. Foster and C. Kesselman, editors. Te Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003. [3] M. Maeswaran K. Krauter, R. Buyya. A taxonomy and survey of grid resource management systems for distributed computing. Software: Practice and Experience, 32(2):135 164, February 2002. [4] R. Min and M. Maeswaran. Sceduling Advance Reservations wit Priorities in Grid Computing systems. In Proceedings of PDCS 01, pages 172 176, 2001. [5] W. Smit, I. Foster, and V. Taylor. Sceduling wit advanced reservations. In Proceedings of IPDPS 00, pages 127 132, 2000. [6] A. Sulistio and R. Buyya. A grid simulation infrastructure supporting advance reservation. In Proceedings of PDCS 04, pages 1-7, Nov. 2004. [7] H. Raseed, M. Dikaiakos, and S. Haridi. Quantification of Grid Resource Heterogeneity Effects on Performance. Tecnical Report, January, 2006. [8] G. Dasgupta, K. Dasgupta, A. Puroit, and B. Viswanatan. QoS- GRAF: A Framework for QoS based Grid Resource Allocation wit Failure provisioning. Proceedings of te 14t IEEE InternationalWorksoponQoS(IWQOS 06), pages 281-283, June 19 21, New Heaven, CT, USA. [9] I. Foster and A. Roy. Quality of Service Arcitecture tat Combines Resource Reservation and Application Adaptation. Proceedings of te 8t International Worksop on Quality of Service (IWQOS 2000), pages 181 188, June 5-7, 2000. [10] J. MacLaren. Advance Reservations: State of te Art. ttp:// www.fz-juelic.de/zam/rd/coop/ggf/graap/graap-wg.tml. [11] A. Andrieux, K. Czajkowski,A. Dan, K. Keaey, H. Ludwig, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu. Web Services Agreement Specifications WS-Agreement. Global Grid Forum, 2004. [12] A. Leff, J.T. Rayfield, and D.M. Dias. Service-Level Agreements and Commercial Grids. IEEE Internet Computing, pages 44 50,volume 7, number 4, July, 2003. [13] H. Li, and L. Wolters. An Investigation of Grid Performance Predictions Troug Statistcal Learning. 1st Worksop on Tackling Computer System Problems wit Macine Learning Tecniques (SysML), in conjunction wit ACM Sigmetrics, Saint-Malo, France, 2006. [14] L. Yang, J.M. Scopf, and I. Foster. Conservative Sceduling: Using Predicted Variance to Improve Sceduling Decisions in Dynamic Environments. Proceedings of te 15t ACM/IEEE Conference in Supercomputing(SC 03), pages, Poenix, Arizona, 2003. [15] I. Foster. Wat is Te Grid? A Tree Point Cecklist. www-fp. mcs.anl.gov/ foster/articles/watistegrid.pdf, July 20, 2002. [16] L. Jin, V. Maciraju, and A. Saai. Analysis on Service Level Agreement of Web Services. HP Lab Tecnical Report HPL-2002-180, June 21st, 2002. [17] M. de Berg, M. van Kreveld, M. Overmars, and O. Scwarzkopf. Computational Geometry: Algoritms and Applications. Springer- Verlag, second edition, 2000.