QoS-based Scheduling for Task Management in Grid Computing

QoS-based Scheduling for Task Management in Grid Computing Xiaohong Huang 1, Maode Ma 2, Yan Ma 1 Abstract--Due to the heterogeneity, complexity, and autonomy of wide spread Grid resources, the dynamic scheduling with QoS concerns in Grid computing becomes a challenge. This paper addresses the dynamic scheduling problem of parallel jobs with QoS demands in Grid environment. We outline a framework for the solution termed as QoS-based Scheduling Framework (QSF). QSF works based on a general adaptive scheduling heuristics that provides a fast and efficient approach to achieve load balancing inside the system with the concerns of QoS service. It is adaptive to the changes of the number of available resources and the quality of the resources. Meanwhile, one novel scheduling algorithm, namely Minimum Scheduling Hole First (MSHF), has been proposed to give suitable match between the eligible jobs and the available hosts. Extensive experimental studies have been conducted to verify the effectiveness of the scheduling mechanisms and the performance. The experimental results show that by integrating the proposed the scheduling algorithm MSHF into the scheduler QSF, it can produce a significant performance gain. Keywords--Grid Computing, Scheduling, Quality of Service (QoS) T I. INTRODUCTION HE current computational power demands and constraints of organizations have led to a new type of collaborative computing environment called Grid computing [1]. Grid computing is a new type of parallel and distributed computing. Currently, considerable research efforts in Grid computing have focused on the issues of security, resource scheduling and complex execution frameworks, etc. However, there are still many unsolved problems, in particular, those related to the management of the processing load inside the Grid system. Job scheduler is an important component which is responsible for the load balancing in a distributed computing system. So far, many scheduling schemes have been proposed [2-5]. However, most of them only address the problems that how to develop a schedule of processing jobs on a set of heterogeneous hosts that minimize the time required to execute the given jobs. Only recently, there are considerations given to a fundamental problem prevailing in service-oriented This work was supported by China Next Generation Internet (CNGI) project Large scale high performance grid application IPv6 based (CNGI-04-15-7A) and Co-struction project of Beijing Committee of Education (SYS100130422). Xiaohong Huang Author is with School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing, China (email: huangxh@buptnet.edu.cn) Maode Ma Author is with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (email: Maode_Ma@pmail.ntu.edu.sg) Yan Ma Author is with School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing, China (email: mayan@bupt.edu.cn) architecture to provide QoS services. Providing QoS service is one of the primary goals of the Grid approach. Without QoS concerns, a considerable degradation in performance and lower efficiency of the Grid computing could be incurred when a large number of jobs are issued for sharing Grid resources. One of the important aspects of jobs with high QoS requirements is that the execution of a job must be finished within a given amount of time. This time period is referred to as a job deadline. In a real-time Grid system, the principle task for the scheduler is to maximize the number of jobs that are completed before their respective deadlines. This paper proposes a semi-static, adaptive framework, termed QoS-based Scheduling Framework (QSF). QSF has its unique features following. 1) QSF is a QoS-based scheduler, which considers application and their QoS constraints to get a suitable match between applications and resources; 2) Communication overhead has been considered as an important parameter in the scheduling, which makes QSF is a good candidate for QoS services in Grid environment, especially, when the communication overhead is significant. This paper also provides an insight into scheduling algorithms along with the schedulers. In order to map the jobs to hosts, several static algorithms, namely trivial and best-fit algorithm, have been proposed in [4] and [5]. However, those solutions neglect the factor of communication overhead, which may cause large scheduling gap. The scheduling gap or hole, in this paper, has been defined as the difference between the moment when a job starts execution on a host and the moment when the host becomes idle. In order to minimize scheduling holes, one novel scheduling algorithm, namely Minimum Scheduling Hole First (MSHF), has been proposed to work the proposed scheduler, QSF. The organization of this paper is as follows. In Section 2, system model under the study will be presented. In Section 3, the proposed scheduler as well as the scheduling algorithm and their advantages are illustrated. In Section 4, experimental results are shown and discussed to present the performance of the solution. Finally, Section 5 concludes the paper. II. SYSTEM MODEL The Grid computing system considered in this study is composed of two sets of hosts. Each set of hosts has different QoS characteristics. Moreover, the Grid environment is heterogeneous and each host has different processing power. The Grid can be considered as the one with one agent, i.e., scheduler, with other hosts as work stations. The agent in each set is to manage the execution of parallel jobs within the group. And the agent will make decisions to distribute the jobs and assign the jobs to the hosts proportionally to the power of each host. 406

The application model, which describes the relationship among the jobs exhibits precedence constraints among the parallel jobs. Some jobs cannot be executed until other jobs have been completed. Each job is ineligible until all of its precedence constraints have been satisfied. The application and Grid models in this paper can be implemented by most of the famous Grid middleware software, such as Globus[6], Condor[7] and Legion[8]. The assumptions for the models to stand are as follows. 1) A set of N jobs {J 1, J 2,, J N } to be scheduled. 2) The proportion of jobs with high QoS requirement is denoted by P. Therefore, the number of jobs with high QoS requirement is equal to P*N, while the number of jobs with low QoS requirement is equal to (1-P)*N. 3) For each job J i with high QoS requirement, the deadline associate with it is denoted by D i. 4) Two set of hosts are defined. One is high QoS resources and the other is low QoS resources. For each set, there are M hosts {H 1, H 2,, H M } on which the jobs will be executed. 5) An array etc is a system parameter to describe the estimated time of each job to be completed on its host. The array element etc i,j indicates the estimated time to complete the ith job on the jth host. 6) Communication overhead for transferring a job from host i to host j is denoted by com i,j. 7) The expected completion time of job i on host j is defined as CT i,j. 8) The precedence constraint among the jobs is specified by connectivity, which is the probability of a job having a data dependency with a previous job. 9) Only one job can be executed on one specific host at one time. III. PROPOSED ALGORITHMS The QSF framework is designed to handle the scheduling of jobs with precedence constraints for the execution in a Grid system with QoS concerns. It schedules jobs with high QoS requirements to be executed on a resource providing high quality of service avoiding the case that the jobs requesting high QoS service waits for the jobs with low QoS requirement. The QSF scheme works in four steps. First, the scheduling problem is formalized with precedence constraints. Second, the jobs ineligible for the execution are filtered out of the scheduling. Meanwhile, the eligible jobs are divided into two sets. One is with high QoS requirement and the other is with low QoS requirement. Third, the eligible jobs including those waiting to be executed and those already in execution are scheduled according to the scheduling algorithms. And last, upon detection of a rescheduling event, the algorithm will initiate a new formalization of scheduling problem and repeat the scheduling process with new parameters. The rescheduling event is defined as the completion of one scheduled job. QSF framework aims to select the optimal match from the pool of eligible jobs and pool of available hosts. The mapping of jobs to hosts is obtained by using QSF with a simple O(M*N) algorithm. In this paper, we propose two scheduling algorithms to associate with the QSF framework. One is the Extension of Best-fit Algorithm (EBFA) and the other is Minimum Scheduling Hole First (MSHF). In the design of the two scheduling algorithms, communication overhead has been considered as an important factor for the decision making. A. QSF-EBFA Algorithm EBFA evolved from the proposal in [5], which is named as the best-fit algorithm. It has been proved to be efficient to schedule jobs in a Grid system without consideration of communication overhead. Our proposed EBFA algorithm is an extension of the best-fit algorithm by considering the communication overhead. As defined above, communication overhead for transferring data structure, due to two jobs A and B with precedence constraints, from host i to host j is denoted by com i,j. If both jobs are executed on the same host, the communication overhead will be zero, which means that no time is required to transfer the data structure. Otherwise, the communication overhead will be equal to com i,j. We present the EBFA algorithm in Fig.1, which schedules each eligible job in turn on a host, where it could be completed the earliest, taking into account the communication overhead. for (each job J i with high QoS requirements in the selected {for each host satisfying the high QoS requirements H m {if job i can be executed without waiting for data CT i,m =HAT m +etc i,m, where HAT m indicates the host available time of host m; produced from job j, which is executed on host n; if m=n, CT i,m =HAT m +etc i,m ; otherwise, CT i,m =HAT m +etc i,m + com n,m ;} select the host gives the job minimum completion time, which is denoted by m s ; for (each job J i with low QoS requirements in the selected {for each host satisfying the high and low QoS requirements H m {if job i can be executed without waiting for data CT i,m =HAT m +etc i,m ; produced from job j, which is executed on host n; if m=n, CT i,m =HAT m +etc i,m ; otherwise, CT i,m =HAT m +etc i,m + com n,m ;} select the host gives the job minimum completion time, which is denoted by m s ; B. Algorithm Fig. 1. Overview of QSF-EBFA algorithm 407

for (each job J i with high QoS requirements in the selected {for each host satisfying the high QoS requirements H m {if job i can be executed without waiting for data CT i,m =HAT m +etc i,m ; SH i,m =0, where SH denotes the scheduling hole produced from job j, which is executed on host n; if m=n CT i,m =HAT m +etc i,m and SH i,m =0; otherwise CT i,m =HAT m +etc i,m + com n,m and SH i,m = com n,m H m is denoted as an ineligible host for J i ;} among those eligible hosts that will not produce scheduling hole, that is, SH i,m =0 select the one gives the job minimum completion time, which is denote by m s ; for (each job J i with low QoS requirements in the selected {for each host satisfying the high and low QoS requirements H m {if job i can be executed without waiting for the CT i,m =HAT m +etc i,m ; SH i,m =0; structure X produced from job j, which is executed on host n; if m=n CT i,m =HAT m +etc i,m and SH i,m =0; otherwise CT i,m =HAT m +etc i,m + com n,m and SH i,m = com n,m ; H m is denoted as an ineligible host for J i ; among those eligible hosts that will not produce scheduling hole, that is, SH i,m =0 select the one gives the job minimum completion time, which is denote by m s ; Fig. 2. Overview of algorithm In this subsection, we propose a new scheduling algorithm, MSHF, which can largely reduce the time waste in scheduling. The logic behind the MSHF is to schedule the jobs with precedence constraints on the same host. Combined with the MSHF algorithm, the scheduling can achieve a greater improvement than the QSF-EBFA algorithm. The motivation of this proposal is to improve the performance of the QSF-EBFA algorithm. The fundamental problem in the EBFA algorithm is that it may cause a large scheduling hole on the selected host. The scheduling hole is defined as the time difference between the moment when the job starts execution and the moment when the host becomes available. For instance, if host i is selected to execute a job m, the job may be scheduled to be executed later due to the unavailability of data from job n, even though, by that time, the host i may have already become available for a while. The scheduling hole is the idle period for the host, which will waste the computing resources. The large scheduling hole in the EBFA is due to the way it selects hosts. The algorithm selects the hosts without the capability to reduce the communication overhead. This can downgrade the performance. The proposal of the MSHF algorithm can remedy this performance decay. The algorithm is described in Fig.2. The complexity of the MSHF algorithm can be evaluated based on its operations. It has one job selection procedure and one host selection procedure. The job selection procedure is to select the eligible job to be scheduled. And the host selection procedure is to select the suitable host to execute the job. The host selection algorithm simply loops with the number of the hosts, M, so that its complexity is O(M) in the worst case. Similarly, the complexity of the algorithm to select one job will be O(M) in the worst case. To schedule all the jobs, the algorithm will take O(M*N). IV. PERFORMANCE ANALYSIS We evaluate the performance of the QSF-EBFA and the algorithms by comparing them with the algorithms. The performance is measured by the makespan and tardy rate. The objectives of the simulations are twofold. First, we demonstrate the superior performance of the QSF framework by showing that the makespan and tardy rate can be reduced significantly by considering QoS factors. Second, the proposed scheduling algorithms when combined with the QSF framework are able to reduce the communication overhead greatly, thereby achieve better performance. A. Simulation Design The simulation experiments were started by a set of MATLAB functions developed to model the general behaviors of the Grid system and application models. In order to make the Grid environment heterogeneous, two sets of hosts are designed, where each host has different service time. One set of hosts are high QoS resources with 100Mb/s bandwidth and the other set of hosts are low QoS resources with 10Mb/s bandwidth. The execution of each job on each host is assumed to be a random number extracted from a Gaussian distribution with average L and variance L*H, where L denotes the average execution time for each job and H denotes the degree of heterogeneity for the Grid environment. In order to simulate the errors on the execution time estimated, three parameters, i.e., E, E 1 and E 2, are defined. For each host, E 1 is defined as a random number extracted from a Gaussian distribution with average 0 and variance E. E 2 is defined as a random number extracted from a Gaussian distribution with average E 1 and variance absolute (E 1 ). For each job s execution time estimated on this host, an error of E 2 will be introduced. Meanwhile, Jobs are randomly divided among three dependent batches. 408

for M ranging from 5 to 25 in the steps of 10 for L from 60s to 300s in the steps of 120s for H from 0.2 to 0.5 in the steps of 0.15 for N from 600 to 1800 in the step of 300 build up different scenarios; simulate different algorithms on different scenarios; Fig. 3. Four loops to build up simulation scenarios The communication time between two hosts among high QoS resources and the communication time between two hosts among low QoS resources are assumed to be a random number extracted from a Gaussian distribution with average L*R and (1/10)*L*R as well as variance L*R *H and (1/10)*L*R*H respectively, where R denotes the ratio of communication overhead in the set of low QoS resources to execution time of jobs. For the jobs with high QoS requirements, the time constraint on the jobs is expressed as the deadline. Deadline is assumed to be a random variable following a Gaussian distribution with average R 1 *L and variance R 1 *L*H, where R 1 denotes the ratio of average deadline to execution time of jobs. In order to evaluate the performance with respect to the number of hosts (M), the average execution time for each job (L), the degree of heterogeneity for the Grid environment (H) and the number of jobs to be scheduled (N), fours loops are used to generate different values of these four parameters so as to build up different scenarios, which is shown in Fig. 3. B. Simulation Results We produce 2 groups of simulation results. The first group of the simulation results will show the performance comparison in terms of makespan. In Fig. 4, our study focuses on the makespan versus different scenarios. In this experiment, R is set to 0.5, the proportion of jobs with high QoS requirements P is set to 50% and the error is set to 0.2. In order to have different scenarios, the algorithm shown in Fig. 3 is used. From the definition of the algorithm, we can have 135 different scenarios (N M *N L *N H *N N ) in the experimentation. From the figure, we can find that, the results are divided into three regions (1~45, 46~90, 91~135). According to the algorithm, the generation of these three regions is due to the loop of M, which is the number of hosts. It is easy to find that when M increases, makespan decreases. Meanwhile, for each distinct region, the region can be divided into three sub-regions. It is generated because of the loop of L, which is the mean execution time. When L increases, execution time increases. For each sub-region, there are three pairs of results, each of which is generated due to the loop of H, namely the degree of heterogeneity. Similarly, for each pair of result, makespan is increased with N, i.e., the number of jobs. From Fig. 4, we can draw the conclusion that supported by the QSF framework, the EBFA and the QSF algorithms significantly outperform the GSTR algorithm. This is because that the QSF is able to schedule the jobs with high QoS requirements first, which can avoid the case that the job with Time (s) Time (s) 8 x 104 7 6 5 4 3 2 1 * QSF-EBFA + o 0 0 20 40 60 80 100 120 140 Scenarios 4.5 x 104 4 3.5 3 2.5 2 QSF-EBFA Fig. 4. Makespan versus scenarios 1.5 0 10 20 30 40 50 60 70 80 90 Proportion of the tasks with high QoS requirements (%) Fig. 5. Makespan versus proportion of the jobs with high QoS requirements low QoS requirements occupy the high QoS resources while jobs with high QoS requirements have to wait even if the low QoS resources are idle. On the other hand, it is shown in Fig. 4 that the performance of the MSHF algorithm is better than that of the EBFA algorithm when combining with the QSF framework. It is predictable since MSHF is able to reduce the scheduling holes. Fig. 5 illustrates the makespan versus the proportion of jobs with high QoS requirements. For example, proportion of 70% means 70% jobs request high QoS resources, while 30% jobs have no QoS requirements. In this experiment, M is set to 15, L is set to 300s, R is set to 0.5, error is set to 0.2 and N is set to 1800. As shown in the figure, the QSF framework works better than the GSTR scheme in terms of makespan, especially when half of jobs have high QoS requirements. When the proportion is increased to 1, all the jobs request high QoS resources. In this case, the QSF and the GSTR will get the same performance because no low QoS jobs will compete for the resources. Among the three algorithms, the algorithm can achieve the best performance. It is due to the fact that the 409

8 x 104 70 60 QSF-EBFA 7.5 QSF- EBFA 50 Time (s) 7 6.5 Tardy rate (%) 40 30 20 6 10 5.5 0.1 0.2 0.3 0.4 0.5 Ratio of communication overhead to execution time Fig. 6. Makespan versus ratio of communication overhead to execution time 0 10 20 30 40 50 60 70 80 90 100 Proportion of the jobs with high QoS requirements (%) Fig. 8. Tardy rate versus proportion of the jobs with high QoS requirements 100 90 90 80 QSF-EBFA 80 70 Tardy rate (%) 70 60 50 Tardy rate (%) 60 50 40 30 40 30 * QSF-EBFA + o 20 10 20 0 20 40 60 80 100 120 140 Scenarios Fig. 7. Tardy rate versus scenarios MSHF scheme is capable to reduce the scheduling holes, which can increase the utilization of the computing resources. Fig. 6 shows the makespan versus R, i.e, the ratio of communication overhead in the low QoS set of resources to execution time of jobs. In this experiment, M is set to to 5, L is set to 300s, N is set to 1800, H is set to 0.5, error is set to 0.2, and the proportion of jobs with high QoS requirements is set to 50%. It is clear that the MSHF algorithm has achieved a significant improvement in terms of makespan, particularly, when the ratio is large. This is because the cost of data communication in a Grid system can significantly affect the scheduling and execution of jobs, especially when the communication overhead is comparable to the execution time. The MSHF algorithm aims to reduce the scheduling holes on the hosts. Meanwhile, the algorithm presents stable results when the ratio is increased. It is expected since communication overhead is reduced in the MSHF algorithm and it won t affect the performance of the scheduling. Similarly, it is shown that the QSF scheme works better than the GSTR scheme. The second group of the simulation results will show the performance comparison in terms of trady rate. Figure 8 presents tardy rate versus different scenarios. In this set of experiment, we assume that R is 0.5, R 1 is 10, the proportion of jobs with high QoS requirements P is 50% and the error is set to 0 10 20 30 40 50 60 70 80 90 100 Ratio of average deadline to execution time (%) Fig. 9. Tardy rate versus ratio of average deadline to execution time 0.2. Different scenarios are built up by using the algorithm shown in Fig. 3. As shown in the figure, there are 135 different scenarios. Similar as Figure 5, the results are divided into three regions (1~45, 46~90, 91~135), which are generated by the loop of M. The regions are further divided into three sub-regions, which are generated because of the loop of L. There are three pairs of results in each sub-region, which is generated due to the loop of H. And for each pair of results, the tardy rate is increased, which is because of the loop of N. As Fig. 7 shows, the QSF-EBFA and the algorithms outperform the algorithm significantly. This is because that the QSF framework has the ability to schedule jobs by taking QoS factors into the account. However, the GSTR scheme schedules jobs with high or low QoS requirements with the same priority. Therefore, the jobs with high QoS requirements may be blocked by those with low QoS requirement leading to high tardy rate. Fig. 8 compares the characteristic of tardy rate under varying the proportion of jobs with high QoS requirements. In this experiment, M is set to to 15, L is set to 300s, R is set to 0.5, R 1 is 40, error is set to 0.2, D is set to 3000s and N is set to 1800. As shown in the figure, the tardy rate of three algorithms is increased with the proportion of jobs with high QoS requirements. It is clear that QSF outperforms GSTR in terms of tardy rate. This is because the QSF algorithms can arrange 410

the jobs with high QoS requirements to be scheduled faster than the jobs with low QoS requirements, therefore resulting in low tardy rate. Also, the algorithm is better than QSF-EBFA. There is because MSHF algorithm is able to reduce the scheduling holes, leading to better performance. Fig. 9 shows the relationship between the tardy rate and average deadline. We assume that M is equal to 5, L is set to 300s, R is set to 0.5, error is set to 0.2, P is set to 50% and N is set to 600.We can clearly see that the tardy rate of these three algorithms decreases when the average deadline increases. As the figure shows, the QSF scheme with associated algorithms has achieved a significant improvement in terms of tardy rate. This is because the QSF scheme is capable to take the QoS factors as important factors in the scheduling. V. CONCLUSION In this paper, we introduce a novel scheduler QSF, which is designed to handle jobs with high QoS requirements as well as those with low QoS requirements in the Grid computing environments. With the QSF scheme, the makespan and tardy rate can be reduced significantly. Meanwhile, a novel scheduling algorithm, namely MSHF, is proposed to associate with the QSF scheme. The MSHF algorithm is to reduce the scheduling holes by taking the communication overhead into account. The small scheduling hole can lead to lower the performance in terms of increasing the makespan. Numerical studies have been conducted by extensive simulations. As the results show, the QSF framework can reduce the makespan and tardy rate significantly. In addition, the performance with respect to the proportion of high QoS jobs as well as the ratio of communication overhead to execution time/deadline is evaluated. The studies have shown that the scheduling can achieve the best performance because it has to ability to take QoS factors into account and to reduce the scheduling holes simultaneously. REFERENCES [1] I. Foster, The Grid: A New Infrastructure for 21 st Century Science, Physics Today, 55 (2) 2002, pp. 42-47. [2] D. F. Baca, Allocation modules to processors in a distributed system, IEEE Transactions on Software Engineering, Vol. 15, No. 11, pp. 1427-1436, 1989. [3] C. A. Bohn and G. B. Lamont, Load balancing for heterogeneous clusters of PCs, Future Generation Computer Systems, Vol. 18, pp. 389-400, 2002. [4] B. R. Carter, D. W. Watson, F. R. F. Freund, K. Elaine, M. Francesca and H. J. Siegel, Generational scheduling for dynamic job management in heterogeneous computing systems, Journal of Information Sciences, Vol. 106, pp. 219-236, 1998. [5] O. Lucchese, F.; Huerta Yero, EJ; Sambatti, FS, An adaptive scheduler for Grids, Journal of Grid Computing, Vol. 4, No. 1. pp. 1-17, 2006. [6] I. Foster, C. Kesselman, J. Nick and S. Tuecke, Grid services for distributed system integration, Computer, vol. 35, No. 6, 2002. [7] J. H. Epenema, M. Livny, R. Van Dantzing, X. Evers and J. Pruyne, A worldwide flock of condors: Load sharing among workstations clusters, Future Generation Computer Systems, Vol. 12, pp. 53-65, 1996. [8] A. S. Grimshaw and W. A. Wulf, Legion A view from 50,000 feet Los Alamitos, California, in Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, Agosto, IEEE Computer Society Press, 1996. 411