CHAPTER 6 DYNAMIC SERVICE LEVEL AGREEMENT FOR GRID RESOURCE ALLOCATION

158 CHAPTER 6 DYNAMIC SERVICE LEVEL AGREEMENT FOR GRID RESOURCE ALLOCATION 6.1 INTRODUCTION In a dynamic and heterogeneous Grid environment providing guaranteed quality of service for user s job is fundamentally important. The quality of service delivered to the client depends crucially on the selection of appropriate subset of available resources. To accomplish this goal, the process of resource selection should not be a simple matchmaking mechanism. Therefore, appropriate resources are selected from the available grid resources that satisfy the application requirements of trust, computation capacity and network capacity and establish dynamic service level agreements to the selected resources to provide guaranteed service to the client. The dynamic service level agreement helps to extend the resource capacity during runtime according to status of job execution and ensures successful completion of jobs by avoiding run time failures and unnecessary job migration common in certain Grid environments. The Grid resource owners have complete control over the resources shared on the Grid and it is important to consider the policy restrictions imposed by the provider during the selection of resources. Service Level Agreements are established between the resource providers and clients only after the resource access and usage policies are accepted by the client.

159 6.2 RESOURCE SELECTION STRATEGIES IN COMPUTATIONAL GRID In a computational grid, the resource selection and allocation mechanisms are incorporated at the meta-scheduler level to coordinate the sharing of resources among multiple clients. The selection of single or multiple resources for an application depends on the available computation capacity of the resource at the time of job submission. The existing resource selection mechanisms are merely based on matching the job requirements to the available resources and do not consider the access restriction policy of resources. Therefore, an integrated resource selection framework is developed to select appropriate resources with a concern on security, requirement satisfiability, resource availability, load and access policy restrictions. Computational grids enable sharing of resources and deliver various qualities of service to meet complex user demands. The QoS requirements for a client can be functional requirements as task and data dependencies and/or non-functional requirements as time deadline, budget restrictions and guarantee on bandwidth for data transfer (Brandic et al 2008). The grid supports execution of an application on multiple resources in a parallel fashion to reduce the completion time of the job. Hence, the resource selection strategy has to decide on the optimal number of resources required for job execution as over provisioning of resources increases the complexity of execution and minimizes the utilization of resources. The selected resources are evaluated against the security policy of user and resource provider, the cost constraints and the time deadline specified by the user. The jobs are submitted to these resources after negotiating and establishing SLA between the client and the resource providers. Hence, the problem of resource selection is studied in three different phases and is discussed in detail in the following sections.

160 6.2.1 Quality of Service in Resource Selection In a grid, the client submits his job request along with its QoS requirements. The qualities of service are in general the non-functional requirements that specify the availability, response time, reliability, security and network latency. The provisioning of QoS in grid computing is essential to employ grids in commercial domains apart from scientific research. The VO is highly dynamic and the participating resources can join or leave the VO at any time. Therefore, the users need to know whether they are interacting with the legitimate resource and the data and computations are protected against malicious activities (Humphrey and Thompson 2002). The required level of security is provided by integrating trust as a part of resource selection and it assures the basic entry level security for the client jobs. The trust relationships are established between the client and resource provider considering the domain information, type of activity performed on the resource and the number of jobs successfully executed. The resource access policies are defined for the security of the resource provider. The resource that satisfies all the requirements of the job, but not granting access is not available for job execution. Hence, policies at the site level are considered for selection of resources. The run time failures, reallocation and job migrations are avoided by evaluating the availability of computation and network resources at the time of job submission. User QoS based resource selection provides best effort services for job completion. But, to satisfy the user it is necessary to provide guaranteed service by establishing SLA between the participating resource providers and clients. 6.2.2 Service Level Agreements in Resource Selection Service Level Agreements guarantee the service quality delivered to the user as committed by the resource provider. The SLA specifies the

161 Service Level Objectives (SLO) that must be met by the resource provider. In grid environments, SLA establishment has to be supported at the metascheduler and it enforces restrictions on the resource usage and penalizes the user or the resource providers for violations against the agreements (Balakrishnan et al 2008). SLA should be a dynamic process and agreements are negotiated based on the requirements of the client and capability of resources according to the present environment operating conditions. The various phases of SLA management are SLA negotiation, SLA creation, SLA monitoring and SLA violation. In the negotiation phase, the resource QoS properties are described and it enables the meta-scheduler to select suitable resources according to the QoS requirements of the client. An agreement is created between the client and resource provider stating the service terms and guarantee terms and then the tasks are submitted. The SLA monitoring is done by the third party, the agent through the audit request and response messages. Any violations on the service and/or guarantee terms are noticed and penalties are imposed on the entity that violates the SLA. The SLA is an integral part of resource allocation in grid as the resource providers and users are from different domains bounded by different access and usage policies. 6.2.3 Policy Based Resource Selection The site level resource access policies and resource usage policies are stored in the policy repository of every VO. In a computational grid, the resources of a site are shared by local jobs and remote jobs submitted by grid users. If the client and resource provider are from different administrative domains, the client has to satisfy the site level policies of the resource. At the meta-scheduler level, the resource selection agent considers the site level policies for resource allocation. The resources that match the user requirements of trust, computation capacity and network capacity are

162 checked against the site level and usage policies stated by the resource providers. The local resource policies are taken care by the local resource management systems. The access policy expresses the authentication and authorization policies of resource where as the operational policies specify amount of resources contributed to the grid. These policies do not reveal any private information about the access control policies thus ensuring privacy in heterogeneous environments. The resource selection strategy presented in this work integrates trust, user QoS parameters and site level policies and thus establishes dynamic SLA for guaranteed service and eliminates run time failures that occur due to incompatible user resource pairs. 6.3 DYNAMIC SERVICE LEVEL AGREEMENTS The critical issue in grid environments is to select optimum number of appropriate resources and bind these resources to the client according to the SLA. The existing work assumes that parties know about SLA negotiation protocols and about the SLA templates before entering the negotiation. This assumption does not fit in a grid environment as the clients and resource providers meet each other dynamically and on-demand. Hence, there is a need to establish SLA by specifying the service terms and guarantee terms for every client job. 6.3.1 Establishing Dynamic SLA In a dynamic service environment, SLA management should be a dynamic process composed of SLA negotiation, definition, auditing, notification of violations and reactive actions if non-compliance is detected (Barbosa et al 2006). The SLA provides the user with the agreed resource capabilities and it should focus on guaranteed service which depends mainly on the successful completion of the submitted job.

163 In this work, the idea of dynamic DSLA is presented where the clients can extend the resource usage over a time period based on the forecasted load of the resource and this improves the success rate of the submitted jobs. To establish DSLA, templates are exchanged and service negotiations are performed and finally agreements are established. The resources can be used over an extended period through dynamic SLA in the same agreement. 6.3.2 Template Creation Every client application is different and the job requirements vary accordingly. The application may be either compute intensive or data intensive. The completion time of the job is decided by the amount of resource allotted and the time period over which these resources are allotted for execution of the tasks. If the required computation power for an application could not be provided by a single resource, then the job is divided into multiple tasks according to the resource capacity of the available resources and assigned to multiple resources. Every job or TS is allowed to execute on a resource after negotiation of SLO and signing of contracts with the resources that commit to provide service satisfying the SLO. To minimize the time involved in service negotiations, the proposed work creates a template to fetch the required information for the job. The templates are exchanged between the potential resource providers and the resource selection agent and SLO are negotiated with only those resources that accept to provide information according to template structure. The template is not a static structure and the template elements are decided for every application by the resource selection agent based on the application requirements. The compute intensive jobs are executed in a computational grid and hence templates are created suitable for these jobs. The basic structure of

164 a template contains elements relating to domain identification, site level security policy, trust level, resource usage, response time, support for extension and negotiation protocol. The templates are exchanged and negotiations are initiated with resources that accept the template and other resources that are not willing to provide the required information are neglected. Negotiating SLA only with these selected resources improves the service quality delivered to the client and significantly reduces the time spent for negotiations between the entities. 6.3.3 Extending Resource Usage The service level agreement that supports extended usage based on the present job execution conditions is called Dynamic SLA. In DSLA, the SLA expiration time is not specified as it provides a mechanism to extend the usage of a resource on which the job is currently in execution. The extension of resource usage is made possible by forecasting the load on the resource for the future n th time period. The forecasted load determines the available computation power of the resource for the next time interval and provides its capabilities beyond the time deadline specified in SLA. The resources that accept to deliver extended service are determined at the initial phases of template exchange. The SLA expires and terminates as the execution is completed successfully thereby enhancing the reliability of the allotted resource. The only limitation of DSLA is that extension would not be possible if the resources are reserved in advance for the specified time duration. The proposed method of DSLA is a more flexible approach and it avoids renegotiations that compromises the stated requirements and negotiates new requirements with the current resource. If renegotiation is not successful, new resources are to be selected and new SLA is to be agreed

165 upon for job execution. The above mentioned approaches waste a lot of time in relocation of jobs and leads to performance degradation of grid jobs. Dynamic SLA provides a promising solution to the above stated problems and provides a guaranteed service without unnecessary job migrations and runtime failures due to forced terminations. 6.4 GUARANTEED SERVICE RESOURCE SELECTION STRATEGY In the next generation computing systems, the user application increasingly rely on grids for their execution environments rather than local resources administered by its organization. There are several challenging issues that are to be addressed apart from the basic resource management functions of locating a resource and assigning jobs. These include cross domain trust, managing diverse resource policies, concurrent allocation of multiple resources, prediction of workload and commitments to provide required quality of service. The computational grid architecture is presented as three tier architecture. The middle layer is the resource selection and allocation layer that provides controlled access to the appropriate resources for execution of submitted application. The Guaranteed Service Resource Selection (GSRS) Strategy provides a unified framework for selection of appropriate resources by integrating the quantitative trust of a resource provider and its computation and network capabilities to enable on demand access through workload prediction. The dynamic service level agreements are established with the selected resources and it guarantees the required quality of service for grid jobs. The resource selection framework based on guaranteed service resource selection strategy is shown in Figure 6.1.

166 Virtual organization Client 1 App Client 2 App... Client n App Computational Resource Management Client Load Forecast Adaptive Factor Task Manager Task Grouping Single/ Multiple Resources Load Analysis Load Predictio Evaluate Resource Capacity Trust Model Trust Evaluation DT SBT RT OBT OTV Trust Update Trusted Resources User QoS Parameters Network Resource Management File Size Optimal Stripes Parallel Data Transfer Link Capacity Template Exchange SLA Negotiation Establish Agreement SLA Management Selected Resources Dynamic SLA SLA Monitoring Job Execution Management Resource Pool Trust Information Base Resource Allocation Job Execution User Satisfaction Job Status Satisfied SLO CN 1 CN 2 CN n CN 1 CN 2 CN n... CN 1 CN 2 CN n Figure 6.1 Architecture for guaranteed service resource selection strategy

167 6.4.1 Quantitative Execution Trust and User Quality of Service Trust has been recognized as an important factor for selection of appropriate resources in a grid. Trust is evaluated as a combination of qualitative trust and quantitative trust. The quantitative execution trust about a resource is evaluated by considering the components of Subjective Trust and Objective Trust of a resource. The qualitative trust is expressed as a function of user satisfaction and is evaluated from the actual experiences of the user involved in the transactions with the selected resource. The Service Satisfaction (SS R ) for the allotted resource is calculated as given in Equation (6.1), SS STJ * JS * SSLO (6.1) R R where, STJ represents the Status of completion of job, JS the Job Size and SSLO R represents the Satisfied Service Level Objectives of the resource R. The Satisfied Service Level Objectives are determined as the ratio of number of attained SLO to the number of committed SLO. The Service Satisfaction is the direct feedback about the allocated resource. Hence, the qualitative trust value is used as a factor for trust update in the Direct Trust Table of the resource and in the Overall Trust Repository (OTR) of the VO. User Quality of Service The satisfaction level of user depends on the level of QoS achieved for the client application. For every job, the minimum requirements of computation power, network speed, storage space and the QoS parameters as time deadline, response time and cost are specified by the user. Therefore, the resources that satisfy both the application requirements and user QoS parameters are selected for allocation to user jobs.

168 To avoid unnecessary job migration due to insufficient resource capability, the proposed model determines the total computation capability that can be offered by the resource. The ACP of a resource is calculated as the difference between the Overall Compute Power of the resource and the Occupied Compute Power of the resource at the submission time instant. The second QoS parameter that is important in resource selection is the Expected Completion Time for the job and is calculated prior to job submission. Selection of resources based on QoS parameters submits the job to the resource that best fits the application requirements and achieves a high level of user satisfaction. 6.4.2 Optimal Resource Selection with Network Resource Management In a computational grid, a client application is allowed to execute in parallel among multiple remote resources if it requires exhaustive computation power. A known strategy for efficient execution of a huge application is to partition the application into multiple independent tasks and schedule those tasks over a set of available processors (Saleh et al 2012). It is important to choose optimum number of resources for a job as too many resources increase the complexity of job scheduling and communication time for staging necessary input data on these resources. A client application is modeled as a set of tasks and the tasks are grouped according to Resource Capacity for the specified time deadline (RC TD ). Every task set k has m independent tasks that are grouped according to the RCP of the tasks and Resource Capacity as given in Equation. (4.1), TSk m j 1 RCP j where, m RCPj RC andm n TD j 1.

169 The optimum number of suitable resources is selected and the task sets are dispatched and scheduled on these resources. This method increases the acceptance rate of the jobs submitted on the grid and also improves the resource utilization. The distributed resources of a grid are connected through WAN. The job completion time is not only based on the computation capacity but also on the network bandwidth at the time of job submission. The Network Weather Service (Wolski 2003) forecasts the network load and provides approximate measurement on the network load for future job scheduling. In practice, the network bandwidth is shared by multiple entities and the speed of the link is relatively stable only for a short period of time and forecasts does not provide accurate network information (Marchal et al 2006). To efficiently utilize the bandwidth of the connected resource, TCP striping mechanism is employed and optimal number of stripes is opened in parallel between the communicating entities. The input data is sliced into equal sized blocks and transferred to the selected resources. To ensure minimum communication time and to avoid the delay in staging necessary input data for job execution, the data is transferred in parallel along multiple stripes determined by Optimal Stripe Size (OPT S ) as given in Equation (4.13). OPT S RLC ALC The network resource management component efficiently manages the network resources according to the present environmental conditions of the grid and utilizes the available bandwidth to its maximum potential in high latency networks.

170 6.4.3 Load Forecast and Managing SLA The resource management in computational grid combines the computational resource management and network resource management. The computational resource management considers the uncertainty of load conditions that prevails in grid due to local job execution and burst load from remote nodes. The pattern of load on resources varies from time to time and hence predicting the workload characteristics is of prime concern in the resource selection process. The jobs executed on multiple resources, require the job start time to be synchronized on all resources such that job is completed within the time limit. The current information of Resource Capacity (RC TD ) makes it possible to predict the workload on a resource at the n th time instant and job is submitted at that time to synchronize the start time of tasks. The Available Computation Power at the forecasted n th time interval (ACP FT ) is evaluated as given in Equation (5.7), ACPFT RCTD FLn 1 where, FL n-1 is the forecasted load at the (n-1) th time instant. The load is predicted based on the statistical properties of load on the resource. The state of the resource changes dynamically and hence short term forecasts offers good prediction accuracy and is useful for making appropriate scheduling decisions. In the Guaranteed Service Resource Selection framework, workload prediction plays a key role in 2 phases. In the pre-execution phase, the selection of appropriate resources through workload prediction strategy improves the acceptance rate and success rate of jobs as job failures due to uncertain load and insufficient computation power is avoided. In the second

171 phase, the execution stage, the jobs allocated to the resources are monitored for committed SLO and status of job completion. The jobs currently in execution, if not completed within the agreed time is allowed to extend the execution on the same resource by determining the future availability of resources for a future time period using workload prediction strategy. The Dynamic Service Level Agreements presented in this work has the capability to extend the usage of computational and network resources to complete the execution of the submitted job. The resources are requested to extend the duration of service and extending the resource usage is possible if there is availability of computation power determined by predicted load on the resource. The limitation of DSLA is that extended usage would not be possible if the resources have been reserved in advance for other user s job. The DSLA achieves high reliability for the users of grid as the jobs are completed on the allotted resource without abrupt termination or job migration. 6.5 PERFORMANCE EVALUATION The performance of the GSRS approach is evaluated based on the following performance metrics. Resource Utilization Rate: It is defined as the percentage of the utilized resource power to the available resource power for the resources present in the grid. Job Completion Time: It is defined as the total time taken by the resources for successful completion of the job and includes the job execution time, communication time and SLA creation time.

172 Job Success Rate: It is defined as the total number of jobs successfully completed to the total number of jobs submitted to the grid. 6.5.1 Guaranteed Service Resource Selection Strategy The performance of the proposed Guaranteed Service Resource Selection Strategy (GSRS) is analyzed and discussed. The simulation was based on the grid simulation toolkit GridSim Toolkit 4.0. The capabilities of GridSim have been expanded to support agreements during the process of resource allocation for providing guaranteed service to client applications. Each agreement has an Agreement ID associated with it to keep track of SLO and violations in it. For simulation purposes, five heterogeneous resources with different characteristics such as number of Processing Elements (PE) in a machine, MIPS rating of a PE, type of operating system, site security policy and cost were considered. The simulation is done for multiple client applications that submit different types and size of jobs. We consider the compute intensive jobs and a job can be divided into multiple tasks of varying size according to the resource capacity and resource availability. The performance metrics are evaluated for the applications submitted to the grid. The information of resources used in the simulation is shown in Table 3.12. The client request for resources and submits the job for execution stating the requirements of security, time limit and cost limit. The GSRS strategy selects single or multiple resources from the Virtual Organization that satisfies trust, site security policy, available resource capability and time deadline. In the present work, we have simulated the widely employed GridWay Meta-scheduler and compared the performance of GSRS strategy with the GridWay meta-scheduler.

173 The length of the submitted job varies from 5000MI to 500000MI and the size of the input data to be transferred varies from 1MB to 5MB. The time deadline specified for the client application is 12 seconds. The client applications are submitted to appropriate resources and the GSRS strategy performs divisible job scheduling and executes the task set in parallel to achieve optimal schedules. In the existing approaches, the communication time increases linearly with respect to the number of resources employed for job execution. But, this drawback is overcome in the GSRS strategy and the communication time is reduced as the data is transferred in multiple stripes to/from the remote resources using TCP striping. As optimal number of resources is chosen, the complexity and time overhead for the grid jobs are minimized. Establishing SLA between resource providers and clients provide reliable services to requesting clients without much time overhead. The SLA creation time for different client jobs is given in Table 6.1. and Figure 6.2. Table 6.1 SLA creation time No. of Job Requests SLA Template SLA Creation Time (s) GSRS-Template 20 0.207 0.172 40 0.25 0.21 60 0.265 0.22 80 0.3 0.23 100 0.431 0.28

174 SLA Creation Time (s) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 SLA Template GSRS- Template 0 20 40 60 80 100 Number of Job Requests Figure 6.2 SLA creation time The client job executed on more number of resources requires SLA to be created on all the resources. But, in the GSRS strategy, optimal number of resources are selected based on the resource availability and the predicted load and SLA are established only with these selected resource providers. Therefore the time for SLA creation has been reduced to 65% compared to GridWay which in turn reduces the completion time of jobs. Table 6.2. The Total Completion time of jobs executed on the grid is shown in Table 6.2 Job completion time Job Completion Time (s) No. of Jobs Gridway GSRS 50 15.15 11.28 100 16.05 12.86 150 18.3 13.5 200 20.2 15.97

175 The Total Completion Time of jobs executed on the grid resources using GSRS strategy and GridWay is depicted in Figure 6.3. 25 20 Job Completion Time(s) 15 10 5 GSRS Grid way 0 50 100 150 200 Number of Allotted Jobs Figure 6.3 Job completion time The user specified a time limit of 12 seconds and the proposed GSRS strategy successfully complete the job within a maximum time of 15seconds as it allows dynamic extension of resource usage and avoids run time failures. The SLA creation time, communication time and execution time are reduced to about 40% in the proposed method compared to GridWay. The reduced completion time is due to the load prediction strategy which considers local and remote jobs and predicts the available computation power accurately. The dynamic SLA avoids renegotiations and creation of new SLA to other resources for the jobs currently in execution. The jobs are completed with a small increase in the time deadline given by the user. This increased completion time does not affect the performance of either the resource or the

176 job. But, it improves the reliability of the resources as abrupt job terminations are avoided. The jobs divided as TS are executed in parallel and the execution time of the completed jobs is closer to the specified time deadline. It is clear from Figure 6.3, that the maximum extension of resource usage availed is only about 10-15% from the stated time limit. The Success Rate obtained in GSRS and GridWay strategies for the submitted grid jobs are shown in Table 6.3. The proposed GSRS strategy is very reliable as more focus is on selection of resources for client application. The selection is based on multiple criteria as trust level, site security policy, current resource availability and workload prediction. The selected resources are allocated to client after establishing DSLA and thus avoid forced termination due to insufficient resource capabilities. Table 6.3 Job success rate No. of Jobs Job Success Rate (%) GSRS GridWay 50 100 100 100 98 86.8 150 96.66 83.33 200 95 81 250 92.8 78.8 The Job Success Rate for the proposed GSRS and the existing GridWay approach is shown in Figure 6.4.

177 120 100 Job Success Rate (%) 80 60 40 GSRS Grid Way 20 0 50 100 150 200 250 Number of Jobs Figure 6.4 Job success rate It is evident that the proposed GSRS strategy performs well compared to GridWay and achieves about 20% higher success rate than GridWay. The GSRS algorithm is evaluated by executing multiple applications and the achieved success rate is more than 92%. The job is put in a pending state if suitable resources are not found for its execution but the probability of this state is less than 5%. The job acceptance rate is about 95% for the proposed algorithm. The Utilization Rate of the resources that participate in the grid for the different proposed approaches as UQS, ORS-G, CLP-DESA and GSRS and the existing GridWay is shown in Table 6.4.

178 Table 6.4 Resource utilization rate Resource ID Resource Utilization Rate (%) GridWay UQS ORS-G CLP-DESA GSRS R1 90 88 96 93 100 R2 78 73 89 82 94 R3 70 66 76 70 92 R4 84 80 93 86 97 R5 81 76 86 84 100 The utilization rate of computational grid resources for UQS, ORS-G, CLP-DESA and GSRS strategies is depicted in Figure 6.5. 110 100 90 80 Utilization Rate (%) 70 60 50 40 30 20 UQS ORS-G CLP-DESA GridWay GSRS 10 0 R1 R2 R3 R4 R5 Resource ID Figure 6.5 Resource utilization

179 The proposed method allocates task sets to resources based on their present availability and forecasted load. As the forecasted load allows the resources to be used for an increased time period if it is not been reserved and the idle time of resources are reduced leading to increased utilization. The GSRS approach achieves high resource utilization greater than 90% for all resources that are participate in the grid. The utilization of R1 and R5 are very high as they have highest capacity with more number of fast CPUs, high trust value, high bandwidth than R2, R3 and R4. The resource R1 if available has the top priority and client jobs are assigned to it. The jobs are allotted based on the capacity of the resources that best matches the client requirements and jobs are executed on these resources that satisfy the user qualities of service and the maximize the utilization of grid resources. 6.6 SUMMARY The Computational grid to be employed in commercial domains should provision suitable resources with performance guarantee to execute critical jobs submitted by different clients that demands different performance levels. An integrated resource selection framework is presented and it integrates multiple components for incorporating security, user QoS, TCP striping, load forecasting and dynamic service level agreements. The different components of the proposed Guaranteed Service Resource Selection (GSRS) framework are evaluated using different mechanisms and integrated to provide resource selection service. This unified framework is implemented at the meta-scheduler of the virtual organization to select and allocate appropriate resources for the requesting grid users. The DSLA provide the flexibility to extend the usage of allotted resources even after the time duration given in the agreement. The Grid jobs achieves a very high success rate as dynamic SLA allow the jobs to complete

180 the execution by extending the usage of allotted resources and avoid job migrations and forceful job terminations. The main focus of this work is on selection of appropriate trustworthy resources because if suitable resources are selected at the initial phases of job submission the chances of malicious attacks, runtime failures, job terminations and job migrations are highly reduced. There is a significant improvement in the performance of the whole Grid system as the utilization of the Grid resources are maximized and level of satisfaction attained by the users are also high. Hence, the computational Grid is viable to be employed in commercial and Business domains.