Job co-allocation strategies for multiple high performance computing clusters

Size: px

Start display at page:

Download "Job co-allocation strategies for multiple high performance computing clusters"

Curtis Ryan
5 years ago
Views:

1 Cluster Comput (2009) 12: DOI /s x Job co-allocation strategies for multiple high performance computing clusters Jinhui Qin Michael A. Bauer Received: 28 April 2008 / Accepted: 10 March 2009 / Published online: 28 March 2009 Springer Science+Business Media, LLC 2009 Abstract To more effectively use a network of high performance computing clusters, allocating multi-process jobs across multiple connected clusters becomes an attractive possibility. This allocation process entails dividing the processes of a job among several clusters, which we refer to as co-allocation. Co-allocation offers the possibility of more efficient use of computer resources, reduced turn-around time and computations using numbers of processes larger than processes on any single cluster. In order to realize these possibilities, effective co-allocation, ultimately, depends on the inter-cluster communication cost. In this paper, we introduce a scalable co-allocation strategy called the Maximum Bandwidth Adjacent cluster Set (MBAS) strategy. The strategy makes use of two thresholds to control allocation: one to control the limit on bandwidth on usable inter-cluster communication links and another to control how jobs are split. A simulator that can simulate the dynamic behavior of jobs running across multiple clusters was developed and used to examine the performance of the MBAS co-allocation strategy. Our results indicate that by adjusting the thresholds for link level control and chunk size control in splitting jobs, the MBAS co-allocation strategy can significantly improve both user satisfaction and system utilization. Keywords Resource management Job co-allocation Job scheduling HPC clusters Performance evaluation J. Qin ( ) M.A. Bauer Department of Computer Science, The University of Western Ontario, London, Ontario, N6A 5B7, Canada jhqin@csd.uwo.ca J. Qin qin.jinhui@gmail.com M.A. Bauer bauer@csd.uwo.ca 1 Introduction Over the last decade, cluster computing [1, 2] has become an important direction in parallel computing in order to solve larger and more complex problems in various areas. There are generally two broad classes of clusters [3]: highthroughput computing clusters, which connect many computer nodes using low-end interconnects, and high-performance computing (HPC) clusters, which connect more powerful computer nodes using faster interconnects. The goal of high-throughput computing is to maximize throughput by improving load balancing among compute nodes in the cluster. For high-performance computing, an additional consideration is to minimize communication overhead by mapping applications to available compute nodes. Highthroughput computing clusters are normally used for executing loosely coupled parallel or distributed applications, because such applications do not have high communication requirements among individual processes during execution time. HPC clusters are more suitable for tightly coupled parallel applications, which might have substantial communication and synchronization requirements. This research mainly focuses on HPC clusters, though the approaches presented are applicable to high-throughput clusters. To satisfy ever-increasing computing demands of many research institutes and commercial organizations, HPC clusters continue to become more and more popular. They have dominated Top 500 Supercomputer List [4] intheworld since The growth in HPC clusters has been driven by the emergence of relatively inexpensive and powerful commodity processors, the advances in networking technologies, such as Gigabit Ethernet [5], Myrinet [6], Quadrics [7]. In 1991, a 10-Gflops supercomputer was a Cray that cost approximately $40,000,000. Today, that same computing

2 324 Cluster Comput (2009) 12: power can be achieved by combining four 64-bit computers at a rough cost of about $4,000, bringing the hardware acquisition cost of supercomputing down to the personal desktop level [8]. Moreover, the scalability and the flexibility of commodity clusters, networking and storage make HPC clusters much more available and accessible to researchers and other general users. In many cases, the capacity of an individual HPC cluster is usually planned to support an individual organization s peak demand, which normally occurs infrequently. To improve resource utilization across the organization, sharing HPC clusters across an enterprise-wide set of users and projects has become a promising trend. As an example, SHARCNET [9] is a multi-institutional HPC network distributed across 16 academic institutions in Southern Ontario and structured as a cluster of clusters. That means, instead of smaller groups of users with exclusive access to a local cluster, larger groups of users can share the multiple HPC clusters. One approach to sharing is to simply provide users with access to all these clusters, that is, essentially letting them have access through a shared login identifiers and passwords. A more refined approach for sharing of clusters, is to provide means for processes that are part of a single job to be distributed across several clusters. Ideally, this should be done automatically. This sharing can, potentially, lead to lower turn-around times, higher resource utilization, and make larger jobs possible, i.e., make it possible to have jobs comprised of numbers of processes larger than the number of processors in a single cluster. Such a computing system with multiple clusters crossing an enterprise or even crossing multiple organizations is also referred to as a computing grid [10]. To accomplish the sharing of clusters, however, several significant challenges must be overcome; one of these is the resource management for such computing grids. Typical HPC clusters make use of Resource Management Systems (RMS) to manage the shared resources, user requests and the allocation of user requests to available resources as efficiently as possible. An RMS for HPC clusters normally comprises a resource manager and a job scheduler. The job scheduler communicates with the resource manager to obtain information about job queues, loads on compute nodes and resource availability, in order to make scheduling decisions based on their scheduling strategies. Currently, there are two typical approaches for the RMS in HPC clusters, namely, a queuing approach and a planning approach [11]. For the queuing approach, which is more commonly used, users submit their jobs to one of the queues according to the job characteristics. Different queues may have different limits on various aspects of jobs, such as the number of requested resources, estimated job duration, whether parallel or serial jobs, job priorities, and so on. Jobs in a queue are scheduled according to a certain policy, e.g. FCFS (First Come, First Serve), etc. Different queues may also have different priorities. If sufficient resources are not available to start any of the jobs in the queues, the system waits until enough resources become available. In contrast, with the planning approach, the RMS plans the start times for each job. If a job finishes before the user estimated time, a re-planning process is used to reschedule jobs to improve the system utilization. In comparing the two approaches, Hovestadt et al. [11] argued that the queuing approach is easier to maintain, and it does seem to be more popular for managing a single HPC cluster. However, newer features, such as advanced reservation, guaranteed quality of service, etc. may be easier to implement using a planning approach, though with more cost on the re-planning process. Regardless of the approaches, however, the problem of how to efficiently plan or schedule jobs across multiple clusters must consider the network communication cost and communication workload between clusters. In spite of available computation resources at different clusters, the network communication cost between clusters can be a significant factor in overall job performance. The performance of a job comprised of multiple parallel processes across multiple clusters can be affected by various factors, including the maximum capacity along a communication path, the current workload on the path, and the communication patterns between the processes of the job. Therefore, an advanced job scheduler or job allocation algorithm for HPC clusters must be concerned with communication characteristics of a job, along with other aspects of job performance, user satisfaction, and system resource utilization. Previous research has typically focused on how to efficiently allocate jobs inside a single cluster, either using the queuing approach or the planning approach. In considering the allocation of jobs across clusters, there are two general approaches: first, a job can be allocated to a single cluster within the grid, but no job is allowed to be split or simultaneously run across multiple clusters; second, a job can be split into sets of processes and those processes allocated to multiple clusters, i.e., co-allocated. The first approach is essentially an extension of the current single cluster approach with some general scheduler deciding on which cluster would be the best. This approach limits the size of jobs to those that fit within the largest cluster. It also means that some jobs, which could be split, might have to wait for a long time to be run on a particular cluster. In contrast, the second approach could enable some large jobs to be partitioned into smaller sets of processes and then to be mapped onto multiple clusters. Such an approach would likely introduce additional inter-communication costs among processes on different clusters. Therefore, in making such an allocation decision, the major issue becomes how to co-allocate jobs across clusters when considering the impact of communication.

3 Cluster Comput (2009) 12: This research aims at developing advanced scheduling strategies and job allocation algorithms which can accommodate communication factors. In our earlier research, we investigated the communication behavior between processes of jobs and its influence on the performance of those jobs running across multiple clusters [12]. We looked specifically at the effect of the workload on the communication link between clusters, the maximum communication capacity of inter-cluster links and different types of job communication patterns [12]. Based on the experimental observations, a scalable job co-allocation strategy for multi-cluster grid, i.e., the Maximum Bandwidth Adjacent cluster Set (MBAS) coallocation strategy, was developed. The MBAS co-allocation strategy aims to minimize the inter-cluster communication cost by allocating jobs to get maximum available bandwidth for the possible inter-cluster communication of the allocated jobs. The strategy makes use of two threshold values to control the co-allocation: one to control the level of acceptable link saturation and another to control how a job is split. To evaluate job scheduling and allocation algorithms, real world experiments are extremely useful, however, they are time-intensive, uncontrolled and often not repeatable. The disruption to other users in real world experiments prevents back-to-back runs of alternative designs or algorithms. Therefore, a simulator is commonly used as a means to evaluate algorithms and strategies in this type of research [13]. It can be used to compare different algorithms and, moreover, enables experiments that are configurable and repeatable. To simulate jobs across multiple HPC clusters, some researchers, e.g. [14 18], have used static slowdown ratios to represent the impact of communication cost and study the influence on the job execution time if the job was split across clusters. However, it is difficult to estimate such a slowdown ratio in advance, and, even if possible, such a ratio may not be uniform across the execution of a job or on multiple clusters with different communication links. Moreover, the slowdown ratio may actually change dynamically based on job communication patterns, workload on the network links, and the maximum bandwidth of the network links. Based on the observations from our initial experiments, a simulator for multiple HPC clusters was developed [19]. In order to avoid the drawbacks of using a static communication model as mentioned before, this simulator simulates the dynamic behavior of jobs running across clusters. The results of the simulation studies, as reported in this paper, demonstrate that by adjusting the two threshold values during job co-allocation, the use of the MBAS strategy can reduce turnaround time and increase overall system utilization at the same time. Moreover, the MBAS strategy can be used on large multi-cluster grids with complex topologies. This is a problem for other strategies, such as the Minimal Cluster Set (MCS) strategy [18, 20]. In this paper we describe the MBAS co-allocation strategy for multiple HPC clusters. We present simulation experiments that illustrate the utility of the strategy. The remainder of the paper is structured as follows: Sect. 2 provides an overview of related work. The MBAS co-allocation strategy is presented in Sect. 3. Section 4 describes an initial performance study of the MBAS co-allocation strategy along with a comparison to another co-allocation approach, namely, the MCS strategy. Additional performance studies using the MBAS co-allocation strategy are provided in Sect. 5. The summary and conclusion are given in Sect Related work Over last decade, there has been much research in the area of job scheduling for parallel systems, HPC clusters and grid computing [1, 2, 10, 21]. Some of this research tended to focus on developing scheduling strategies at a particular application level, while others paid more attention to the system level. For scheduling at application level, generally, applications are described as a group of dependent, or independent, processes or subtasks. Then the parallel scheduling problem becomes a problem of scheduling these subtasks of an application across a parallel machine in order to meet various objective functions. Since the general problem of optimally mapping tasks to machines in heterogeneous computing systems is a well-known NP-complete problem [22], developing heuristics for scheduling tasks onto heterogeneous computing systems has received considerable attention in recent years in literature. For example, Braun et al. [23] provided an overview of 11 heuristics based on the same set of assumptions. They define an application as a metatask which consists of a collection of independent tasks with no intertask data dependencies. The mapping of the metatask was performed statically, i.e., offline, or in a predictive manner. The goal of the mapping was to minimize the total execution time of such a metatask on a given heterogeneous platform. In contrast, other researchers realized that many Grid applications in areas such as bioinformatics and astronomy require workflow processing in which tasks were executed based on their control or data dependencies. These workflow applications were, then, modeled as directed acyclic graphs. As a result, a number of grid workflow management systems with scheduling algorithms have been developed, e.g., [24 27]. Their common objective was to minimize the execution time of workflow applications on Grids, or to maximize the steady-state throughput. These application-level scheduling algorithms are closely coupled with application internal structures. The scheduling systems are often based on user-provided estimation of task completion time. The scheduling and mapping of those

4 326 Cluster Comput (2009) 12: tasks mainly concentrate on computational resources. The resources, in general, are dedicated to the applications, or are reserved from, so called, utility grid environments used solely for those applications. In contrast to application scheduling, more recent research has focused on resource management at the system level for non-dedicated shared HPC networks. Our research falls into this category. In particular, we focus on resource management for jobs submitted from different users in a shared multi-cluster environment. Typical HPC clusters make use of an RMS to manage the shared resources, user requests and the allocation of user requests to available resources as efficiently as possible. In [11], Hovestadt et al. classified the existing resource management systems for HPC systems into a queuing approach and a planning approach. Regardless of the approach, the problem of how to efficiently plan or schedule jobs across clusters becomes critical. In considering how to allocate or map a specific user request, or a set of requests from multiple users, onto the shared resources in order to reduce the average response time and improve resource utilization, some researchers [28, 29] have developed techniques in order to pack/schedule jobs as tightly as possible. The common goal of this approach was to reduce the idle time of resources due to the fragmentation, where the fragmentation refers to a small number of idle resources available in a cluster. This research focused only on jobs that were to be allocated to a single cluster. More constraints must be considered in reducing fragmentation in a grid environment, e.g. the communication cost if a job is to be mapped across multiple clusters. Recently, some researchers have begun to realize the importance of these issues. To improve system utilization and reduce turnaround time, jobs are allowed to be distributed globally in HPC cluster networks, not just restricted to being allocated to a single cluster within the network. This included co-allocating jobs across site boundaries, especially in the case of jobs with a large number of processor requests. To study the performance of jobs running across multiple clusters, Banen et al. [14] presented a measurement study of the total runtime and communication time of some specific applications on both single clusters and on multicluster systems. A slowdown ratio was then computed for jobs split across clusters. Using a simulator, they then tested several multi-cluster scheduling policies. Huang et al. [15] and Ernemann et al. [16] did similar research, but rather than using a measured slowdown ratio, they tested different allocation policies using a sequence of jobs with different assumed slowdown ratios. Bucur et al. [17] and Li [18] used a static communication speed ratio, which is a static ratio between intra-cluster and inter-cluster communication speeds, in order to adjust a job s performance when running across clusters. The common conclusion from this research was that simply allowing co-allocation without any restrictions was not a very good idea. To achieve a better system performance, different allocation policies were suggested, depending, for example on whether the slowdown ratios or the communication speed ratio were under or above a predefined value. The use of a slowdown ratio or a static speed ratio to represent the influence on the performance of jobs running across multiple clusters avoids some of the real challenges. However, it is very difficult to estimate the slowdown ratio of a job in advance. Moreover, the slowdown ratio or the speed ratio can change dynamically based on the run-time circumstances. Instead of using a static communication model, Jones et al. [20] took a more dynamic view of job communication that is bandwidth-centric in order to analyze performance characteristics. The co-allocation strategy and the analysis in their work was limited to a star-shape multi-cluster topology. In [20], they assumed an ideal central switch which connected all the clusters to one another through a single dedicated link. This topology has a very limited scalability as does the co-allocation strategy they proposed. This work is also limited to jobs with all-all global communication patterns, that is, where there is communication among each (or nearly each) pair of processes. Scheduling or mapping a job across clusters has been addressed to some extent, but more work is needed, e.g., to include the consideration of different communication patterns, varying capacity of the network links between clusters, and the workload on network links. 3 MBAS co-allocation strategy Before introducing the MBAS co-allocation strategy, we first provide a brief discussion of the co-allocation problem in multiple clusters. Since our primary focus in this work is on inter-cluster communications, we have assumed that all clusters are homogenous, i.e., all processors have same computing power, and that no more than one process is allowed to be allocated to a processor within any cluster. Thus, the effect of workload on each processor and the impact of a job s computational part can be essentially ignored. 3.1 The co-allocation problem in multiple clusters In doing co-allocation, a job comprised of multiple interacting processes is to be split into groups of processes and those groups of processes then allocated to multiple clusters. The inter-process communications of a co-allocated job might introduce some inter-cluster communications. This intercluster communication could impact on the performance of the co-allocated job. From our earlier experiments [19], a job

5 Cluster Comput (2009) 12: that is running across multiple clusters experiences a slowdown in message transmission when the transmission workloads on the related inter-cluster links exceed their maximum link capacities. This then causes delays on the job completion time. In order to co-allocate a job across multiple clusters in such a way that completely prevents the slowdown associated with over-saturated network links, the scheduler must have the access to job s communication characteristics, including the pattern of communication and the bandwidth requirements for this job s inter-process communications. When combined with the network information, i.e., available bandwidths of inter-cluster links and available processors on each of the clusters, this co-allocation problem has been recognized as a constraint satisfaction problem (CSP). In CSP, a set of variables and a set of constraints are involved. The problem is defined as finding an assignment of values to some or all of the variables such that it does not violate any constraints. A CSP can be solved using the Generate-and-Test paradigm. In this paradigm, all possible combinations of variable assignments are generated one by one and each of them is then tested to see if it satisfies all the constraints. The first combination that satisfies all the constraints is the solution. However, the computation of this method is intolerably slow since the total number of all possible combinations is the size of the Cartesian product of all the variable domains. Many efforts have been made by researchers to improve the efficiency in finding a solution to the CSP problem, e.g., [30, 31]. To address the slowdown associated with over-saturated inter-cluster links during job co-allocation, Jones et. al. [20] employed a branch-and-bound brute-force technique to find a solution for a particular job s co-allocation. Even though this approach can guarantee that a link will never become over saturated due to job co-allocations, unfortunately, the calculations are still significant and they also require accurate communication characterization of each job, which, in reality, is not likely obtainable. The MBAS co-allocation strategy tries to co-allocate a job to such a set of clusters that the allocated bandwidths on related links for the potential inter-cluster communication of this job are maximized. Moreover, instead of trying to guarantee that no links will be over saturated during job coallocation, MBAS makes sure that as soon as a link has been detected being over saturated, that link will not be involved in job co-allocation until some workload has been released and the overload situation has been changed. 3.2 MBAS co-allocation strategy When users submit jobs to a cluster, all jobs are initially kept in a queue in FCFS order. We use a two-step strategy to schedule them. In the first step, every job in the queue is examined to see if it can be allocated (in FCFS) to any single cluster within the collection of clusters. All the jobs which cannot be so allocated are considered in the second step, i.e., consider for co-allocation across multiple clusters. Before we introduce the MBAS job co-allocation strategy, we first define several terms and introduce some notation used in the description of MBAS: We denote the available clusters as C 1,C 2,...,C n. We let A i be the number of available processors on cluster C i. We denote the direct link between cluster C i and cluster C j,ase ij, if it exists. We let N = (C 1,C 2,...,C n ), be the set of available clusters; this is input to the algorithm; we assume that N has been sorted in the descending order of A i, where 1 i n. λ is a threshold used to control the saturation level of inter-cluster links, and 0 λ 1. We define a link as not saturated if the ratio of the workload on the link to the maximum capacity of the link does not exceed the given threshold value λ. For any cluster C i, we define the Non-Saturated Adjacent Cluster Set (NSACS) of the cluster C i to be L i = {C 1,C 2,...,C m }, where cluster C j belongs to L i if A j > 0 and the link E ij is not saturated. S i is the total number of available processors of all clusters in L i, i.e., the NSACS of cluster C i, plus the available processors of cluster C i. A job allocation is defined as (J 1 s,j 2 s,...,j m s), where J i s is the number of processors allocated to job J s on cluster C i, and J 1 s + J 2 s + +J m s = J s. δ is a threshold used to control job splitting. It is used to ensure that during job co-allocation for a job J s, at least one set of processes of the job of size δj s, should be kept on a single cluster without being split further, and 0 δ 1. That is, at least one of J 1 s,j 2 s,...,j m s,forjob J s is at least size δj s. The Unit Process Bandwidth, UPBW ij of cluster C i to cluster C j is defined as BW ij /A i, where BW ij is the current available bandwidth of the link E ij. The detailed algorithm of MBAS is presented in Fig. 1. Input to the MBAS strategy is a job with information about the number of processes required and the set of available clusters. Phase 1 of MBAS attempts to reduce inter-cluster communication cost in job co-allocation by allocating a job J s to the NSACS of a cluster C. If multiple candidates are available, the cluster C is selected as the one with the most available processors among all candidates. In doing the processor mapping (Phase 2), MBAS fills the selected cluster C first, then, the rest of the clusters in the NSACS

6 328 Cluster Comput (2009) 12: Input: A job with size of J s, the cluster set N = (C 1,C 2,...,C n ) in descending order of A i, two threshold values of λ and δ, the workload and the maximum capacity of each direct link E ij. Output: The allocation of (J 1 s,j 2 s,...,j m s) for the job J s such that the inter-cluster communication bandwidth for the job execution is maximized. Algorithm MBAS: // Phase 1: Determine NSACS foundnsacs False for each C i in N and A i δj s do { //find the L i and calculate the S i of C i for each C j that has a direct link E ij to C i and A j > 0 do { ( ) Eij workload If E ij max Capacity λ then } add C j to L i } If (L i is not empty) then { S i A i + } C j L i A j If (S i J s ) then { findnsacs True break loop } // Phase 2: Determine allocation of job to the adjacent cluster set of C i If (findnsacs) then { Sort all C j in L i in the descending order of UPBW ij Let M = (C 1,C 2,...,C m ), where C 1 = C i and C 2,...,C m are copied from the sorted list L i RemainedSize J s For (k = 1; k m and RemainedSize > 0; k++) do { J k s = A k RemainedSize = RemainedSize J k s } Remove job J s from the queue Return (J 1 s,j 2 s,...,j m s) }else { Keep job J s in the queue Return NULL } Fig. 1 MBAS Algorithm are filled in the descending order of their Unit Process Bandwidth (UPBW) to the cluster C. MBAS allocates the least number of the job processes to the cluster in the NSACS that has the smallest UPBW to cluster C. In other words, it attempts to use clusters connected to cluster C where the links are least utilized in order to maximize a job s potential inter-cluster communications. Note that MBAS does not make use of any specific information about the

Cluster Comput (2009) 12: 323 340 329 Fig. 2 An example of a multi-cluster network inter-process communication among the processes of a job.

7 Cluster Comput (2009) 12: Fig. 2 An example of a multi-cluster network inter-process communication among the processes of a job. Moreover, since cluster C has the most available processors in all candidates, the size of the largest chunk of the job after splitting is maximized. Therefore, MBAS attempts to minimize the performance impact caused by the inter-cluster communication due to the job co-allocation. The examples in the next subsection illustrate how MBAS works in more detail. 3.3 Examples of using MBAS Figure 2 shows an example of a multi-cluster network which consists of 7 clusters. The number of currently available processors is specified for each cluster C i. The current workload on each inter-cluster link L i has also been specified and is represented as the percentage of a link s maximum bandwidth. All the links in the example are assumed to have the same maximum bandwidth of 1 Gbps. The adjacent cluster set of cluster C 1 is represented as C 1 {C 2,C 5 }; while the adjacent cluster set of cluster C 2 is represented as C 2 {C 1,C 3,C 4 }, and so on. Now we consider allocating a job with size 20, i.e., the required number of processors, using the MBAS co-allocation strategy. Assume that we have set λ = 1.0. In the algorithm in Fig. 1, a threshold λ = 1.0 means that if the workload on a link is at 100% of its maximum capacity, then that link should not be involved in this job s co-allocation analysis. In this example, all the links in Fig. 2 have capacity less that 100% and so all can be considered during co-allocation. Assume that the other threshold δ is set to 0, which indicates there is no constraint on job splitting. Therefore, for this example, there are three candidate cluster sets that are possible for this size 20 job: namely, C 1 {C 2,C 5 }, C 2 {C 1,C 3,C 4 } and C 4 {C 2,C 6,C 7 }. MBAS selects the NSACS of a cluster C that has the most available processors among all the candidates. Since cluster C 2 has the most available processors, the set C 2 {C 1,C 3,C 4 } is selected for this job s co-allocation when λ = 1.0 and δ = 0. If λ = 0.5, that means a link should not be involved in co-allocation if the workload on a link exceeds 50% of its maximum capacity, e.g., the links L 2, L 4 and L 5 in Fig. 2. If δ = 0.5, then that means one of the clusters should be able to hold at least 50% of this job s required processors, i.e., for this example that means that a single cluster should be able to accommodate at least 10 processes of this job. In Fig. 2 only cluster C 2 satisfies this constraint. Therefore, only one candidate cluster set would be available for this job, namely, C 2 {C 1, C 4 }, when λ = 0.5 and δ = 0.5. Cluster C 3 is not involved in this cluster set because link L 2 is considered oversaturated when λ = 0.5. If we set a more restrictive constraint on job splitting control, for example, setting δ = 0.8, then a single cluster would have to be able to hold at least 80% of the job s required processors, i.e., 16 processors. In Fig. 2 there is no single cluster satisfying this constraint. Therefore, this job would stay in the job queue and wait until more resources become available in order to satisfy all constraints. When the NSACS has been selected for a given job, MBAS then does processor mapping taking into consideration available bandwidth on links. In this example, MBAS determined that this size 20 job was to be allocated to the cluster set C 2 {C 1,C 3,C 4 }. According to Phase 2 of the MBAS algorithm (Fig. 1), cluster C 2 is filled first and the remaining clusters are filled in the descending order of their unit process bandwidth (UPBW) to cluster C 2.TheUPBW of cluster C 1 to cluster C 2, i.e., UPBW 12, according to the definition given before, is calculated by dividing the available bandwidth of link L 1 by the available number of processors on cluster C 1. Since the workload on link L 1 is 10%, the available bandwidth of L 1 is 900 Mbps, therefore, UPBW 12 = 900/6 = 150 Mbps. In a similar way, we can calculate UPBW 32 = 50 Mbps and UPBW 42 = 87.5 Mbps. Therefore, the rest of the clusters should be filled in the order of C 1, then C 4 and then C 3. The final mapping for this size 20 job is then: {C 2 = 10,C 1 = 6,C 4 = 4}. This example illustrates how MBAS attempts to use clusters connected to cluster C 2 where the links that are least utilized are used in order to maximize the bandwidth available for this job s potential inter-cluster communications. Although a job is co-allocated based on a NSACS determined by MBAS, there is no guarantee that links might not become over saturated during this job s execution. We are not assuming that a job s communication requirements can be completely known in advance. Rather, all we can do is to make sure that links are not over saturated during a single job s allocation. It is important to note that the two threshold values, namely, λ and δ, play important roles in determining the overall performance of using MBAS in co-allocation. The next section will present simulation studies of how the two thresholds interact and what their impact on the overall performance. We will also compare these results with those of using another co-allocation strategy.

8 330 Cluster Comput (2009) 12: Performance of MBAS In this section, we present an evaluation of the MBAS job co-allocation strategy. We compare the results of using the MBAS co-allocation strategy with another co-allocation strategy, which has been referred to as the Minimal Cluster Set (MCS) approach, and we describe next. 4.1 Minimal Cluster Set (MCS) strategy MCS has been commonly used by other researchers in this area, e.g. [18, 20]. During job co-allocation using the MCS approach, the available clusters are sorted in the order of their available number of processors. Then a job to be coallocated is first allocated to the cluster with the maximum number of available processors. If all the required processors can be found, then the allocation is complete. If all processes of the job could not be allocated, then the allocation of the job s remaining processes proceeds in the order from there. Thus, MCS tries to minimize the splitting of a given job in co-allocation. A major deficiency with this approach is that there is no consideration of the bandwidth of the links involved in the possible inter-cluster communication, which may have a serious negative performance impact. For example, a job having large communication requirement could end up being allocated on two clusters where the inter-cluster link has very limited available bandwidth. The research in [20] did explore the performance impact of keeping a large chunk of a parallel job in a single cluster without being split further. This is the same idea as the job splitting control we have adopted in the MBAS co-allocation strategy. However, there was no consideration of the bandwidth levels of intercluster links during job co-allocation in previous research. For comparison purposes, in our experiments we have added the two threshold controls to the MCS co-allocation strategy as well one threshold to control the saturation level of inter-cluster links and the other to control job splitting during co-allocation. 4.2 A reference model for comparison To evaluate and compare the performance of the two different co-allocation approaches, a reference model called the Single Cluster Only (SCO) was used. In SCO all jobs are only allowed to be allocated to a single cluster. Jobs that could not be allocated to any single cluster would have to wait until some running jobs were finished and enough processors on any single cluster were available. The results of the performance using either of the two co-allocation approaches were then represented as the percentage changes, i.e., increase or decrease, with respect to the performance of the SCO reference model on the same job set. As a result of this approach, we do not consider jobs that require more processors than those available in any single cluster; this is for future work. 4.3 Performance evaluation metrics In considering the metrics to be measured in our simulation study, and to avoid smaller jobs having a large impact on just raw metrics, such as the average response time, we computed the Average Weighted Response Time (AWRT) [32]. Basically, it weights a job s raw response time by the job s resource consumption. In addition, the average System Utilization (SU) was also measured. Since our experiments were performed on a given set of jobs, we also make use the Makespan (MS) of a set of jobs, i.e. the completion time required to process all the jobs in set. This is a useful measure for comparing the effectiveness of the different allocation strategies. Each metric was calculated as follows: AWRT n = ((CompletionTime i SubmitTime i ) Weight i ) i=0 Weight i = Size i ExecutionTime i ni=0 (Size i EcecutionTime i ) MS = max{completiontime i,i [0,...,n]} SubmitTime 0 ni=0 ((CompletionTime SU = i StartTime i ) Size i ) 100% totalclustersize MS 4.4 Experimental setup In order to focus on comparing the efficiency of the two co-allocation strategies, we considered an environment in which there was a single job scheduling server which was globally aware of all available resources of the entire system at anytime. Jobs are scheduled in two steps as mentioned before. In the first step, every job in the queue is examined to see if it can be allocated to any single cluster within the collection of clusters in a first come first serve (FCFS) order. Then, all the jobs which cannot be allocated to a single cluster are considered in the second step, i.e., co-allocation across multiple clusters using either of the two approaches. The network topology for this set of experiments consisted of only three clusters and one inter-cluster link for each pair of clusters. This was chosen for comparison purposes so that the comparison of the MBAS strategy to the MCS strategy would be fairer. Choosing such a small network makes it possible to avoid issues of routing when applying MCS to a larger network topology. Since MCS might allocate a job to two clusters which could be far away from

9 Cluster Comput (2009) 12: Fig. 3 System configurations of the experiments each other in a large network, multiple paths might exist for connecting the two clusters and this would have to be determined by the MCS strategy, something that would have to be added. In contrast, MBAS only considers co-allocating a job to an adjacent cluster set, which has the advantage of only considering co-allocation to clusters which have direct communication (i.e., this communication could be across a rather complex network but does not involve other clusters). To compare MBAS and MCS we look at the results of three experiments: Baseline Experiment 1: We first compare MBAS and MCS in a very homogeneous environment. We consider a multicluster system consisting of three interconnected clusters (C 1,C 2,C 3 ) with one link between each pair of clusters. The number of processors for each of the three clusters was set to be identical, namely: n 1 = n 2 = n 3 = 70. The links between the clusters were also set to the same maximum bandwidth: B 12 = 100 Mbps, B 13 = 100 Mbps and B 23 = 100 Mbps (see Fig. 3(a)). Baseline Experiment 2: The system configuration is also very homogeneous exactly as the previous experiment, but with higher bandwidth links: B 12 = 1000 Mbps, B 13 = 1000 Mbps and B 23 = 1000 Mbps (see Fig. 3(b)). Experiment 3: This experiment comparing MBAS and MCS uses the same configuration as in the previous experiments (all clusters connected) but with varying numbers of processors and links with different bandwidths. The number of processes for each of the three clusters was set to be: n 1 = 70,n 2 = 50,n 3 = 40 and the links between the clusters had different maximum bandwidth: B 12 = 1000 Mbps, B 13 = 100 Mbps and B 23 = 10 Mbps (see Fig. 3(c)). The simulator developed in [19] was used in the experiments. A synthetic workload was used to compare the two approaches. The job set was generated randomly with jobs inter-arrival times Poisson distributed with an average of 120 seconds. Job sets consisted of approximately 300 jobs. Jobs initial estimated execution times were uniformly distributed between 300 and 3000 seconds and jobs required processors were randomly set between 10 and 70. The average job size across a job set was set to about 25% of total number of processors of the entire multi-cluster system. The generated job sizes were relatively large with respect to the entire system. This was done intentionally since the job co-allocation strategies are intended to accommodate more large jobs this would provide a better comparison of the strategies. For jobs involved in the test job set, in order to capture both the features of a very common job communication pattern and a more intensive global communication pattern, the jobs inter-process communication patterns were randomly set as either a master-slave pattern or an all-all pattern. For the purpose of the study reported in this paper, these two patterns have been used. It should be noted, however, that the MBAS co-allocation strategy is not limited to these two patterns. It can accommodate any communication pattern among the processes of a job since it does not use this information for allocation. The patterns were selected to see the effects of the different strategies on different types of jobs with significantly different communication patterns. For the master-slave pattern, the inter-process communication happened only between the master process and any one of the slave processes; while for the all-all pattern, the inter-process communication existed between each pair of processes in the job. These inter-process communications require certain amount of bandwidth. During the experiments, the required inter-process communication bandwidth

332 Cluster Comput (2009) 12: 323 340 Fig. 4 Results of Baseline Experiment 1 per pair of processes (BWPP) was set same for all jobs in one test run.

10 332 Cluster Comput (2009) 12: Fig. 4 Results of Baseline Experiment 1 per pair of processes (BWPP) was set same for all jobs in one test run. For each test case, six runs were conducted, and each run used a different random seed for generating the experimental job set. The average results and their standard deviations are presented in the graphs of Figs. 4 to 6. Within the graphs, the height of each bar represents the average value, and the line within each bar, with crossing lines at the top and the bottom of the line, represents the range of one standard deviation from the mean, i.e., ±σ. 4.5 Experimental results The results of the simulation for the three sets of experiments are presented in Figs. 4, 5 and 6. These experiments used the same controlled link saturation level, i.e. λ = 1.0, but with two different levels of δ to control job splitting, i.e. δ = 0 and δ = 0.8. The selection of δ = 0.8 was also based on Jones work in [20]. The co-allocation experiments in [20] found that the overall performance was more stable in such a threshold setting. In order to display more details of the results on a smaller scale, especially for those positive values, some bars showing negative values were cut off the charts in these figures, e.g., in Fig. 4 the bars showing results for AWRT and MS at BWPP = 0.8 or 1 Mbps have been truncated. Negative values mean that the performances of the respective co-allocation experiment were much worse than that of the SCO model, in which there was no coallocation involved. From the results of baseline Experiment 1 (Fig. 4) and Experiment 2 (Fig. 5) we can make several observations: First, consider the situation when the maximum bandwidths of inter-cluster links are relatively small, e.g., 100 Mbps as in Experiment 1 (Fig. 4). In this case, the coallocation strategies provide some performance improvement (AWRT, MS and SU) only when jobs have less inter-process communication requirements, e.g., smaller BWPP, i.e., less than 0.1 Mbps. Basically, there is little communication to impact overall performance and the co-allocation provides improved use of computational resources. When the BWPP increases to 0.3 Mbps or 0.5 Mbps, both co-allocation strategies show some improvement in AWRT and MS when δ = 0.8. At higher values of BWPP (0.8 Mbps or 1.0 Mbps) the overall performance (as measured by AWRT and MS) is seriously degraded when using either co-allocation approach. Basically, when bandwidth is limited, performance suffers in co-allocated jobs. This is really not that surprising and the simulation reflects this.

11 Cluster Comput (2009) 12: Fig. 5 Results of Baseline Experiment 2 Fig. 6 Results of Experiment 3

12 334 Cluster Comput (2009) 12: Second, consider the case when the maximum bandwidths of inter-cluster links are relatively large, e.g., 1000 Mbps in Experiment 2 (Fig. 5). In using either of the two co-allocation approaches, the overall performance, as measured by AWRT and MS, is improved (relative to the SCO reference model) with all values of BWPP. Generally, for both MCS and MBAS, setting δ = 0 resulted in slightly better overall performance than that δ = 0.8. In comparing MCS to MBAS, there is little overall difference with the same values of λ and δ, with perhaps MBAS having a slightly better performance. The two sets of baseline experiments tested the situation of applying co-allocation strategies in very homogeneous environments, i.e., where cluster sizes are the same and links between clusters have the same maximum capacities. Overall, the two co-allocation approaches, MCS or MBAS, showed very similar performance in these two sets of experiments. By way of further comparison, we use the same job set as in the previous experiments on systems consisting of clusters with various sizes and different capacities of intercluster links. Experiment 3 provided a test in such a case. The results of Experiment 3 are depicted in Fig. 6. We can observe that: Using either of the two co-allocation approaches resulted in significant improvement in the system utilization (SU) improvement of about 33% 57% when compared to the SCO reference model. Second, for small values of BWPP, co-allocation resulted in significant improvements in AWRT and MS. When BWPP 0.08 Mbps, both MCS and MBAS approaches performed similarly, i.e., AWRT and MS were all reduced about 20% 40%. This was to be expected since jobs with small bandwidth requirements for their inter-process communication tended to have less chance of making the communication links overload during execution of the coallocated processes. However, when BWPP was at or over 0.3 Mbps, the MCS approach with both threshold settings resulted in a degradation in performance, i.e., the changes in AWRT and MS were negative. That means that doing co-allocation using MCS in these situations resulted in a performance worse than that of the SCO model. When job splitting was constrained during MBAS co-allocation, that is, setting δ = 0.8 instead of δ = 0, MBAS achieved performance improvements in reducing AWRT, MS and increasing SU at the same time, even for such large inter-process communication requirements. This was because a large portion of the processes of a job, i.e., 80% of the job when δ = 0.8, was kept on a single cluster during co-allocation, and that reduced possible inter-cluster communications. Moreover, MBAS allocated the rest of the job s processes to clusters with larger UPBW in the selected NSACS to maximize the bandwidth for the potential inter-cluster communications (see Sect. 3.2), while MCS did not. We have also observed that when the experimental results demonstrated relatively better performances, their standard deviations were relatively small and consistent. However, when the performance was seriously degraded (i.e., worse than SCO), the respective results of the experiments, i.e., generally set with relatively large BWPP values, had relatively larger standard deviations. The above three sets of experiments provide support to the benefit of applying co-allocation strategies in multi-cluster system. When comparing the performance of the MBAS approach with the MCS approach, we can conclude that both of them performed very similarly when the inter-cluster links all have very large maximum capacities and jobs have relatively small inter-process communication requirements. However, when the system consists of clusters with various sizes and inter-cluster links with different capacities, in particular, where some links may have very limited capacities, MBAS, with the control thresholds adjusted appropriately, seems to have advantages over MCS. Because MCS does not consider the bandwidth allocated for the possible intercluster communication, a job having large communication requirements could end up being allocated to clusters that have very limited available bandwidth. In the experiments, we only compared the two approaches on a small network of systems to ensure that both approaches would be applicable, even though MBAS can be applied to larger systems. 5 Further performance analysis of MBAS In this section we further analyze the performance of MBAS. First, we look at the impact of communication patterns among processes within a job. Second, we look at alternative job allocation policies using MBAS. 5.1 Impact of communication patterns In this set of experiments, two sets of jobs were considered, each with 300 jobs. The distributions of job sizes and job execution times were the same as before, however, one set had jobs with only master-slave communication patterns and the other set had only all-all communication patterns. The configuration of the system is same as the one that has been used in previous Experiment 3 (Fig. 3(c)). Three pairs of threshold settings were tested, i.e., (λ = 1.0,δ = 0.0), (λ = 1.0,δ = 0.8) and (λ = 0.5,δ = 0.8). The first two settings are as in the previous experiments and the third reflects a link saturation level of 50%. As before, for each experiment six runs were performed and each with a different random seed in generating the job set. The average results and the standard deviations are presented in Figs. 7 and 8. Fromthe results we observed that:

13 Cluster Comput (2009) 12: Fig. 7 Experiments on master-slave jobs Fig. 8 Experiments on all-all jobs

14 336 Cluster Comput (2009) 12: Fig. 9 Mixed jobs using policy A Master-slave jobs benefit significantly from MBAS job co-allocation, e.g., AWRT was reduced about 15% 20%, MS was reduced about 25% 35%, and SU was increased around 40% (Fig. 7). The performance of master-slave jobs was less sensitive to the changes of the two threshold values and to the increments of bandwidth requirements, i.e., BWPP, when compared to all-all jobs (Fig. 8); moreover, the standard deviations were small and very consistent during the experiments. For all-all jobs (Fig. 8), SU was increased in most cases. However, job performance (AWRT or MS) was seriously degraded; moreover, the standard deviations of multiple runs for those experiments were relatively large, especially as BWPP increased. These experiments showed that various communication patterns within a job had different impacts on the overall performance in using MBAS. It also showed that in some circumstances there were improvements in system utilization, makespan and average weighted response time. In practice, communication patterns among jobs will vary. If information about the communication pattern of a job was known in advance, such as a hint from the user, co-allocation might be done more effectively. We consider this next. 5.2 Three scheduling policies As mentioned before, there were two iteration steps in doing job scheduling and allocation. The first step was for allocation to a single cluster and the second step was for doing MBAS co-allocation. We now look at three different scheduling policies for job sets having, in particular, mixed communication patterns, and where we know whether a job has a master-slave communication pattern or an all-all communication pattern, or can be roughly classed as one or the other. The policies are: A) Job scheduling and allocation is done exactly the same as before, i.e., jobs with different communication patterns are treated the same in both iteration steps. B) During the first step for single cluster allocation, the scheduler gives the all-all jobs a higher priority than those of master-slave jobs; in contrast, during the second iteration for job co-allocation, the scheduler gives the master-slave jobs a higher priority than all-all jobs. C) The first step is the same as the one in policy B, however, during the second iteration for job co-allocation the scheduler considers the master-slave jobs only, i.e., allall jobs are not considered for co-allocation.

15 Cluster Comput (2009) 12: Fig. 10 Mixed jobs using policy B The average results of six runs of experiments to compare the performance of MBAS under the above three scheduling policies are presented in Figs. 9, 10 and 11. The job set used in these experiments were mixed with an equal number of the two types of jobs, i.e., master-slave jobs and all-all jobs. From these experiments we observe: In using policy A (Fig. 9) in doing MBAS job coallocation, setting δ = 0.8 instead of δ = 0, shows improvement in SU, and some improvement in AWRT and MS when BWPP has small values. In comparing policy B (Fig. 10) to policy A (Fig. 9), the improvements in job performance and system utilization were more consistent, especially when setting δ = 0.8 instead of δ = 0. However, as a job s BWPP was increased, the improvement in MS was not as consistent, i.e., when BWPP 1.4 Mbps; moreover, the standard deviations were large for those experiments that had the worst performances. With policy C (Fig. 11) it was clear that job performance and the system utilization were both improved when compared to policies A and B. Moreover, a less strict setting for the two threshold values, namely, (λ = 1.0,δ= 0) achieved the best improvement in results among the three settings of the two thresholds in most cases. Further, their standard deviations were relatively small and very consistent, which means that the experimental results were relatively stable and consistent. Since policy C only allows master-slave jobs to be considered for job co-allocation, and master-slave jobs have relatively less intensive communications, therefore, the thresholds to control link saturation level and job splitting in doing co-allocation can be set with less restriction to get better results in both job performance and system utilization. To further examine the benefit from using policy C, we also applied it to job sets containing jobs with different ratios of the two types of communication patterns. Figure 12 presents the results of the experiments on four job sets with the threshold setting (λ = 1.0,δ = 0). Each of the four job sets had 10%, 30%, 50% and 100% master-slave jobs, respectively. The results show that by using policy C on all four job sets, the performance and system utilization were all improved, albeit at different levels. Note that even with only 10% of the jobs being classed as master-slave, both the average weighted response time and makespan were reduced by 3% 10% and system utilization improved by 5% 15% on average. Moreover, for a job set with more master-slave jobs, the level of improvement was greater.

16 338 Cluster Comput (2009) 12: Fig. 11 Mixed jobs using policy C Fig. 12 Experiments on four mixed job sets using policy C with (λ = 1.0,δ= 0)

17 Cluster Comput (2009) 12: Summary and conclusions HPC clusters have become more and more popular in parallel computing in order to solve larger and more complex problems in various areas. To satisfy ever-increasing computing demands and to improve resource utilization across organizations, sharing HPC clusters has become a promising trend. To accomplish the sharing of clusters, the resource management for such multi-cluster grids has become one of the significant challenges. This paper introduced the Maximum Bandwidth Adjacent Set (MBAS) job co-allocation strategy for job co-allocation that allows jobs to be split and run simultaneously across multiple clusters. In order to avoid the slowdown caused by inter-cluster communications, previous research assumed that the bandwidth requirements for job inter-process communications were provided in advance, which is likely difficult to do in practice. Instead of assuming this, the MBAS job coallocation strategy attempts to reduce the impact of intercluster communication by allocating a job based on trying to get the maximum available bandwidth for the possible intercluster communication of the job. The performance study demonstrates that co-allocation can be positive improving system utilization and in many cases improving performance, say, as measured by average weighted response time. More specifically, adjusting the two threshold values during MBAS co-allocation to control link saturation and job splitting can also reduce turnaround times and increase overall system utilization at the same time. In comparing the MBAS approach with the MCS approach, the experiments show that MBAS has some advantages over MCS, especially when the system has some links with very limited available capacities. This is because MCS does not consider the bandwidth allocated for the possible inter-cluster communication when doing job allocation. The experiments only compared the two approaches on a small network to ensure that both approaches would be applicable. MCS would need to consider routing in a larger system and likely become more complex. Since MBAS only considers co-allocation of a job to an adjacent cluster set, it can be utilized in much larger systems. Clearly, considering the performance of MBAS in larger grids and including jobs requiring more processors than those in a single cluster are additional areas of research. The study has shown that the selection of the two thresholds has a significant impact on the effectiveness of coallocation, especially when jobs have different communication pattern. In reality, the information about the communication pattern of a job is more likely obtainable, e.g. an educated guess by the user as to whether a job s communication is primarily master-slave or all-all. It is also true that many users tend to submit same type of jobs many times. Either users can provide the information about jobs communication patterns, or it is reasonable that the system can learn such information from the historical data collected from previous jobs. The experiments have shown that using scheduling policy sensitive to the types of communication patterns of a job, co-allocation can be done more effectively, i.e., both job performance and system utilization can be improved, even for jobs having large communication requirements. However, the situation is more complicated in reality, as the mix of communication patterns can vary. Further work is needed to understand how MBAS performs with different threshold settings and under different scheduling policies. We see further studies proceeding in several directions: There is a need to understand threshold settings and strategies in more detail, e.g., let the system be able to adjust the threshold values based on the information learned from the historical data. In conjunction with the above, there is a need to investigate better characterization of job communication patterns. Our research has looked primarily at master-slave and all-all communication patterns. Since MBAS allocates jobs based on the adjacent cluster set and available bandwidth, it is not limited to only these two patterns. Further work is necessary to understand the performance of MBAS when jobs involve other types of communication patterns, such as a 2Dmesh nearest neighbor communication pattern. One could also look these patterns with more refined granularities, e.g., a job is 80% master-slave and 20% all-all. It would then be important to understand the relationship between more refined characterizations of communications and the threshold settings for MBAS. With the thresholds in MBAS, it also becomes possible to consider more adaptive strategies for co-allocation. For example, one can consider an adaptive threshold control system that could dynamically adjust the thresholds, say based on the system state, job sizes and their communication patterns. References 1. Buyya, R.: High Performance Cluster Computing: Architectures and System, vol. 1. Prentice Hall PTR, New York (1999) 2. Buyya, R.: High Performance Cluster Computing: Programming and Applications, vol. 2. Prentice Hall PTR, New York (1999) 3. (2009) 4. Top 500 Supercomputer List. (2009) 5. Elahi, A.: Network Communications Technology. Delmar, Albany (2001) 6. Myricom Home Page. (2009) 7. Quadrics Home Page. (2009) 8. Russel, C.: Overview of Microsoft Windows Computer Cluster Server 2003, White Paper, Microsoft Cooperation (2005) 9. SHARCNET homepage. (2009) 10. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Mateo (1999) 11. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management system: queuing vs planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling

340 Cluster Comput (2009) 12: 323 340 Strategies for Parallel Processing (JSSPP). LNCS, vol. 2862, pp. 1 20. Springer, Berlin (2003) 12. Qin, J., Bauer, M.

John s Newfoundland, May 2006 13. Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, New York (1991) 14.

18 340 Cluster Comput (2009) 12: Strategies for Parallel Processing (JSSPP). LNCS, vol. 2862, pp Springer, Berlin (2003) 12. Qin, J., Bauer, M.: A study on job co-allocation in multiple HPC clusters. In: Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS 2006), St. John s Newfoundland, May Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, New York (1991) 14. Banen, S., Bucur, A.I.D., Epema, D.H.J.: A measurement-based simulation study of processor co-allocation in multicluster systems. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 2862, pp Springer, Berlin (2003) 15. Huang, K., Chang, H.: Performance evaluation of load sharing policies on computing grid. In: Proceedings of Parallel and Distributed Processing Techniques and Applications (PDPTA), vol. I, pp CSREA Press (2005) 16. Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R., Schwiegelshohn, U.: On advantages of grid computing for parallel job scheduling. In: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CC-GRID 02), Berlin, Germany, May 21, 2002, pp Bucar, A., Epema, E.: The performance of processor co-allocation in multicluster system. In: 3rd International Symposium on Cluster Computing and the Grid, May 2003, pp Li, K.: Job scheduling for grid computing on metacomputers. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 05), Qin, J., Bauer, M.: A simulator for job co-allocation in multiple HPC clusters. In: IASTED 18th International Conference on Parallel and Distributed Computing and Systems (PDCS 2006), Dallas, TX, USA, Nov. 2006, pp Jones, W., Ligon III, W., Pang, L.: Characterization of bandwidthaware meat-schedulers for co-allocated jobs across multiple clusters. J. Supercomput. 34, (2005) 21. Feitelson, D.: Workshops on job scheduling strategies for parallel processing. (2009) 22. Ibarra, O.H., Kim, C.E.: Heuristic algorithm for scheduling independent tasks on non-identical processors. J. ACM 24(2), (1977) 23. Braun, T., Siegel, H., Beck, N., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61, (2001) 24. Yu, J., Buyya, R.: Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program. J. 14(3 4), (2006) 25. Deelman, E., et al.: Mapping abstract complex workflows onto grid environments. J. Grid Comput. 1, (2003) 26. Neubauer, F., Hoheisel, A., Geiler, J.: Workflow-based grid applications. Future Gen. Comput. Syst. 22, 6 15 (2006) 27. Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3(3 4), (2005) 28. Frachtenberg, E., Feitelson, D., Fernandez, J., Petrini, F.: Parallel job scheduling under dynamic workloads. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 2862, pp Springer, Berlin (2003) 29. Shmueli, E., Feitelson, D.: Backfilling with lookahead to optimize the performance of parallel job scheduling. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 2862, pp Springer, Berlin (2003) 30. Kumar, V.: Algorithms for constraint satisfaction problems: a survey. AI Mag. 13(1), (1992) 31. Bartak, R.: Constraint satisfaction for planning and scheduling. In: Vlahavas, I., Vrakas, D. (eds.) Intelligent Techniques for Planning, pp Idea Group, Hershey (2005) 32. Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global grid computing for job scheduling. In: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID 04), 2004 Jinhui Qin received her B.Eng. from Chengdu University of Science and Technology and both of her M.Sc. and Ph.D. in Computer Science from the University of Western Ontario. Her research interests are in distributed and high performance computing systems, focusing on resource management, performance analysis and parallel application development. She is currently working on Science Studio project which is one of the Canadian Network-Enabled Platform (NEP) projects funded by CANARIE. The project is to create a distributed experiment management system to enable researchers to control and observe, from their desktops, all aspects of experiments that must be carried out at specialized laboratories throughout Canada. Michael A. Bauer is a Professor of Computer Science at the University of Western Ontario. He was Chair of the Department from and from , and from , he was the Associate Vice-President Information Technology. He was the founding Principal Investigator for SHARCNET, a multi-university high performance computing grid, and is currently its Associate Director. His Ph.D. is in Computer Science is from the University of Toronto. His research interests include distributed computing, network management, and high performance computing networks. He has published over 200 refereed articles, has served on the organizing and program committee of numerous conferences and has refereed for a variety of international journals. He is a member of the IEEE and the Association for Computing Machinery (ACM) and has served on various committees of both organizations.