GRID RESOURCE AVAILABILITY PREDICTION-BASED SCHEDULING AND TASK REPLICATION

Size: px
Start display at page:

Download "GRID RESOURCE AVAILABILITY PREDICTION-BASED SCHEDULING AND TASK REPLICATION"

Transcription

1 GRID RESOURCE AVAILABILITY PREDICTION-BASED SCHEDULING AND TASK REPLICATION BY BRENT ROOD BS, State University of New York at Binghamton, 25 MS, State University of New York at Binghamton, 27 DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 211

2 c Copyright by Brent Rood 211 All Rights Reserved

3 Accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 21 May 13, 211 Dr. Michael J. Lewis, Chair and Faculty Advisor Department of Computer Science, Binghamton University Dr. Madhusudhan Govindaraju, Member Department of Computer Science, Binghamton University Dr. Kenneth Chiu, Member Department of Computer Science, Binghamton University Dr. Yu Chen, Outside Examiner Department of Electrical and Computer Engineering, Binghamton University iii

4 Abstract The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model s states. This study analyzes the accuracy of proposed predictors, which outperform existing approaches. Later chapters propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. Scheduling results characterize the inherent tradeoff between job makespan and the number of evictions due to resource unavailability, and demonstrate how prediction-based schedulers can navigate this tradeoff under various scenarios. Schedulers can use prediction-based job replication techniques to replicate those jobs that are most likely to fail. The proposed prediction-based replication strategies outperform others, as measured by improved makespan and fewer redundant operations. Multi-state availability predictor can provide information that allows distributed schedulers to be more efficient as measured by a new efficiency metric than others that blindly replicate all jobs or some static percentage of jobs. PredSim, a simulationbased framework, facilitates the study of these topics. iv

5 Dedication I dedicate this dissertation to all the people in this world who have cared for me and taken me into their hearts. To my mom. For watching me grow and helping me when I fell down. For giving me this incredible feeling that she will always be there for me. I hope she knows what that means to me, and that I ll always be there for her. To Marisa. Who s gentleness and smile helped me through the hardest of times. Her friendship and understanding taught me what it really means to love. I will forever carry our moments with me in my heart. To Holly. For showing me the beauty of a sunset and for always lending an ear when the world seemed so big and the road seemed so long. For teaching me how the love of a friend is the brightest light in this darkness. And when you have reached the mountain top, then you shall begin to climb. And when the earth shall claim your limbs, then shall you truly dance. - Khalil Gibran v

6 Acknowledgements It is no small exaggeration to say that this work would not exist if it were not for the guidance and insight of my advisor, Michael J. Lewis. I would like to acknowledge him for inspiring me to reach and strive for novelty. For showing me the thrill of publication and the tempering beauty of failure. For pushing me to better myself and in a very profound way, changing the course of my life. vi

7 Contents List of Tables List of Figures ix x 1 Introduction Characterizing Availability Predicting Resource Unavailability Prediction-based Scheduling Prediction-based Replication Load-adaptive Replication Multi-functional Device Characterization and Scheduling Simulation Tool Summary Related Work Resource Characterization MFD characterization and Burstiness Prediction Scheduling Replication Availability Model and Analysis Availability Model Condor Unavailability Types Grid Application Diversity Availability Analysis Trace Methodology Condor Pool Characteristics Classification Implications Multi-State Availability Prediction Prediction Methodology Predictor Accuracy Comparison Prediction Method Analysis Analysis of Related Work Prediction-Based Scheduling Scheduling Methodology Reliability Performance Relationship Prediction Based Scheduling Techniques Scoring Technique Performance Comparison Prediction Duration Multi-State Prediction Based Scheduling vii

8 5.4.1 Multi-State Prediction Based Scheduling Predictor Quality vs. Scheduling Result Scheduling Performance Comparison Prediction-Based Replication Replication Experimental Setup Static Replication Techniques Prediction-based Replication Replication Score Threshold Load Adaptive Motivation Load Adaptive Techniques Load Adaptive Replication Load Adaptive Experimental Setup Prediction based Replication Load Adaptive Replication Replication Strategy List Replication Index Function SR Parameter Analysis Sliding Replication Performance Prediction-based Replication Comparison SR Under Various Workloads Multi-Functional Device Utilization Multi-Functional Devices Unconventional Computing Environment Challenges Characterizing State Transitions Trace Analysis Overall State Behavior Transitional Behavior Durational Behavior A Metric for Burstiness Burstiness Test Degree of Burstiness Burstiness Characterization Burstiness-Aware Scheduling Strategies Experiment Setup Empirical Evaluation Simulator Tool Simulation Background Simulation Based Research Distributed System Simulators PredSim Capabilities and Tools Capabilities Tools PredSim System Design Input Components User Specified Components PredSim Components PredSim Performance Conclusions Summary Future Work viii

9 Bibliography 117 ix

10 List of Tables 3.1 Machine Classification Information Machine Classification Scheduling Results Prediction Interface Scheduling Interface Replication Interface Job Queuing Interface x

11 List of Figures 1.1 Thesis flow chart Multi-state Availability Model: Each resource resides in and transitions between four availability state, depending on the local use, reliability, and owner-controlled sharing policy of the machine Machine State over Time Total Time Spent in Each Availability State Number of Transitions versus Hour of the Day Number of Transitions versus Day of the Week State Duration versus Hour of the Day State Duration versus Day of the Week Machine Count versus State Duration Machine Count versus Average Number of Transitions Per Day Failure Rate vs. Job Duration Transitional Predictor Accuracy Weighting Technique Comparison Prediction Accuracy by Duration As schedulers consider performance more than reliability (by increasing Tradeoff Weight W), makespan decreases, but the number of evictions increases The resource scoring decision s effectiveness depends on the length of jobs. For longer jobs, schedulers should emphasize speed over predicted reliability Scoring decision effectiveness depends on job checkpointability. Schedulers that treat checkpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer more evictions, but mostly for checkpointable jobs; therefore, makespan remains low Schedulers can improve results by considering predictions for intervals longer than the actual job lengths. This is especially true for job length, between approximately 48 and 72 hours TRF and Ren MTTF show similar trends for both makespan and evictions. For any particular makespan, TRF causes fewer evictions, and for any number of evictions, TRF achieves improved makespan TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all Prediction Length Weights. TRF also produces a shorter average job makespan than all Ren predictors, except the one day predictor (which produces 45% more evictions) TRF-Comp produces fewer job evictions than all comparison schedulers (with the exception of the Pseudo-optimal scheduler) for jobs up to 1 hours long TRF-Comp produces fewer job evictions than all comparison schedulers (with the exception of the Pseudo-optimal scheduler) for loads up to 4, jobs Checkpointability Based Replication: Makespan (left) and extra operations (right) for a variety of replication strategies, two of which are based on checkpointability of jobs, across four different load levels Replication Score Threshold s effect on replication effectiveness - Low Load Replication Score Threshold s effect on replication effectiveness - Medium High Load 6 xi

12 6.4 Replication strategy performance across a variety of system loads Load adaptive replication techniques Replication Score Threshold s effect on replication effectiveness across various loads Load adaptive replication with random replication probability Sliding Replication s load adaptive approach Sliding Replication list comparison Sliding Replication function comparison Sliding Replication index analysis Static vs. adaptive replication performance comparison Sliding Replication under the Grid5 workload Sliding replication under the NorduGrid workload MFD count per floor at a large pharmaceutical company MFD count per floor at a division of a large computer company Trace data from four MFDs in a large corporation. The devices are Available over 98% of the time, and requests for local use are bursty and intermittent MFD Availability Model: MFDs may be completely Available for Grid jobs (when they are idle), Partially Available for Grid jobs (when they are ing, faxing, or scanning), Busy or unavailable (when they are printing), or Down MFD state transitions by day of week MFD state transitions by hour of day MFD state duration by day of week MFD state duration by hour of day Number of occurrences of each state duration Burstiness versus Day of Week and Hour of Day Comparison of MFD schedulers as Grid job length increases Determination of Grid job lengths with 95% confidence intervals for given failure rates Comparison of MFD schedulers as Grid job load increases PredSim System Architecture PredSim execution time versus the number of jobs executed PredSim execution time versus the number of schedulers investigated xii

13 Chapter 1 Introduction The functionality, composition, utilization, and size of large scale distributed systems continue to evolve. The largest Grids and test beds under centralized administrative control including TeraGrid [74], EGEE [21], Open Science Grid (OSG) [26], and PlanetLab [49] vary considerably in terms of the number of sites and the extent of the resources contained by each site. Various peer-topeer (P2P) [6] and Public Resource Computing (PRC) [4] systems allow users to connect and donate their individual machines, and to administer their systems themselves. Cloud-based systems are also coming into the mainstream consisting of sets of distributed resources for use or rental by remote users. Amazon EC2 [77] is a prominent cloud-based solution in which users can rent computational hardware. Eucalyptus is a private cloud computing implementation designed to operate on computer clusters behind web-based portals [48]. Hybrids of these various Grid models, cloud-based systems and middleware will continue to emerge as well [1] [22]. Future Grids will contain dedicated high performance clusters, individual less-powerful machines, and even a variety of alternative devices such as PDAs, sensors, and instruments. They will run Grid middleware, such as Condor [42] and Globus [24], that enables a wide range of policies for contributing resources to the Grid. This continued increase in functional heterogeneity will make Grid scheduling mapping jobs onto available constituent Grid resources even more challenging, partly because different resources will exhibit varying unavailability characteristics. For example, laptops may typically be turned on and off more frequently, and may join and leave the network more often. Individual workstation and desktop owners may turn their machines off at night, or shut down and restart to install new software more often than a server s system administrator might. Even the same kinds of resources will exhibit different availability characteristics. CS department research clusters may be readily available to the Grid, whereas a small cluster from a physics department may not allow remote Grid jobs to execute unless the cluster is otherwise idle. Site autonomy has long been recognized to be an important Grid attribute [39]. In the same vein, even owners of individual resources will exhibit 1

14 different usage patterns and implement different policies for how available they make their machines. If Grid schedulers do not account for these important differences, they will make poor mapping decisions that will undermine their effectiveness. Schedulers that know when, why, and how the resources become unavailable, can be much more effective, especially if this knowledge is coupled with information about job characteristics. For example, long-running jobs that do not implement checkpointing may require highly available host machines. Checkpointable jobs that require heavyweight checkpoints may prefer resources that are reclaimed by users, rather than those that unexpectedly become unavailable, thereby making possible an on demand checkpoint. This is less important for jobs that produce lightweight checkpoints that can be produced on shorter notice. Easily replicable processes without side effects might perform nearly as well on the less powerful and more intermittently available resources, leaving the more available machines for jobs that cannot deal as well with machine unavailability (at least under moderate to high contention for Grid resources). This thesis describes building Grid schedulers that use the unavailability and performance characteristics of target resources when making scheduling decisions, a goal that requires progress in several directions. The thesis is organized as follows. Chapter 3 describes a multi-state availability model and examines a University of Notre Dame trace in terms of that model to establish the volatility and to characterize the availability of the underlying resources. Chapter 4 examines several prediction techniques to forecast the future states of a resource s availability. Chapter 5 investigates using different approaches to schedule jobs with different characteristics based on those predictions, by attempting to choose resources that are least likely to become unavailable. Chapter 6 then explores the efficacy of using the availability predictors in a new way; they are tested to determine how helpful they can be in deciding which jobs to replicate with the goal of replicating those jobs most likely to experience resource unavailability. Chapter 7 investigates how replication techniques can adapt to changing system load by showing how it can effect replication policy selection. Chapter 8 investigates predicting and scheduling in the unconventional environment of multi-functional devices located in industry settings. Lastly, Chapter 9 describes the simulator tool that produced the results in this work. 1.1 Characterizing Availability To build Grid schedulers that consider the unavailability characteristics of target resources, resources must be traced to uncover why and how they become unavailable. As described in Chapter 2, most availability traces focus primarily on when resources become unavailable, and detail the patterns 2

15 based purely on availability and unavailability. This study identifies different types of unavailability, and analyzes the patterns of that unavailability. It is also beneficial to classify resources in a way that availability predictors can take advantage of, and design and build such predictors. Knowing that Grid resources become unavailable differently and even predicting when and how they do is important only to the extent that schedulers can exploit this information. Therefore, schedulers must be built that use both availability predictions (which are based on unavailability characteristics) and application characteristics. Section 3.1 identifies four resource availability states, dividing unavailable resources into separate categories based on why they are unavailable. This allows resources that have abruptly become unreachable to be differentiated from resources whose user has reclaimed them. Events that trigger resources to transition between availability states are described. Applications are then classified by characteristics that could help schedulers, especially in terms of those machines unavailability characteristics. Section 3.2 describes traces that expose unavailability characteristics more explicitly than current traces. The results presented identify the availability states that resources inhabit, as well as the transitions that resources make between these states. Section 3.2 reports sample data from four months of traces from the University of Notre Dame s Condor pool. To summarize, the resource characterization contributions include a new way of analyzing trace data, which better exposes resource unavailability characteristics, sample data from one such Condor-based trace, and a new classification approach that enables effective categorization of resources into availability states. These results make a strong case for studying how a Grid s constituent resources become unavailable, for using that information to predict how individual resources may behave in the future, and for building that information into future Grid schedulers, especially those that will operate in increasingly eclectic and functionally heterogeneous Grids. 1.2 Predicting Resource Unavailability An availability predictor, which forecasts the availability of candidate resources for the period of time that an application is likely to run on them, can help schedulers make better application placement decisions. Clearly, selecting resources that will remain available for the duration of application 3

16 runtimes would improve the performance of both individual applications and of the Grid as a whole. Resource categorization work demonstrates the potential for using past history to organize resources into categories that define their availability behavior, and then using schedulers that take advantage of this resource categorization to make better scheduling decisions [59]. However, due to the low granularity of the classes, the static nature of each machine s class and the fact that the classes do not recognize patterns in a machine s availability, on-demand availability prediction proved to be an improvement over the classification approach. Importantly, this work uses a multi-state availability model, which captures not just when and whether a resource becomes unavailable, but how it does so. This distinction allows schedulers to make decisions that improve overall performance, when they are configured to also consider application characteristics. For example, if some particular resource is expected to become unavailable because its owner reclaims the machine for local use (as opposed to the machine becoming unreachable unexpectedly), then that machine may be an attractive candidate for scheduling an application that can take an on-demand checkpoint. Chapter 4 describes investigations on predicting the availability of resources. Several kinds of predictors are described which use different criteria to predict resource availability. The best performing predictor considers how resources transition between the proposed model s unavailability states, and outperforms other approaches in the literature by 4.6% in prediction accuracy. As shown in later chapters, simulated scheduling results indicate that this accuracy difference significantly decreases both makespan and operations lost due to eviction. To summarize, the resource availability prediction contributions include introducing new multi-state availability prediction algorithms that effectively forecast future resource availability behavior for the proposed multi-state availability model, evaluating the most effective configurations of the predictors in this set, and producing significant prediction accuracy increases across a variety of prediction lengths compared to existing prediction approaches. These results demonstrate the feasibility of predicting the future availabilities of Grid resources which can in turn be used by schedulers to make better job placement decisions. 1.3 Prediction-based Scheduling Schedulers that effectively predict the availability of constituent resources, and use these predictions to make scheduling decisions, can improve application performance. Ignoring resource availability 4

17 characteristics can lead to longer application makespans due to wasted operations [19]. Even more directly, it can adversely affect application reliability by favoring faster but less available resources that cannot complete jobs before becoming unavailable or being reclaimed from the Grid by their owners. Unfortunately, performance and reliability vary inversely [19]; favoring one necessarily undermines the other. For this reason, Grid schedulers must consider both reliability and performance in making scheduling decisions. This requirement is true for current Grids, and will become more important as resource characteristics become increasingly diverse. Resource availability predictors can be used to determine resource reliability and this in turn can be combined with measures of resource performance when selecting resources. Since performance depends on competing resource load during the lifetime of the application, schedulers must make use of load monitors and predictors [8], which can be centralized, or can be distributed using various dissemination approaches [2]. In scheduling for performance and reliability, it is critical to investigate the effect on both application makespan and the number of wasted operations due to application evictions, which can occur when a resource executing an application becomes unavailable. Evictions and wasted operations are important because of their direct cost within a Grid economy, or simply because they essentially deny the use of the resource by another local or Grid application. The proposed approach to Grid scheduling involves analyzing resource availability history and predicting future resource (un)availability, monitoring and considering current load, storing static resource capability information, and considering all of these factors when placing applications. For scheduling, the best availability predictor (Chapter 4) is used to investigate the effects of weighing resource speed, load, and reliability in a variety of ways, to decrease makespan and increase application reliability. Chapter 4 also investigates different approaches to schedule applications with varying characteristics. In particular, results are presented for scheduling checkpointable jobs to consider speed and load more heavily than reliability. This is done because their eviction involves fewer wasted operations when compared with non-checkpointable jobs due to their ability to save state and resume execution elsewhere. The simulation-based performance study presented here begins by establishing the inherent performance/reliability scheduling tradeoff for a real world environment. The performance of several different schedulers that consider both reliability and performance is then characterized. Subsequent sections develop and explore the idea of varying the requested prediction duration by demonstrating its effect on application execution performance. The study focuses on scheduling for the two competing metrics of performance and reliability by characterizing the effects of (i) treating checkpointable 5

18 jobs differently from non-checkpointable jobs, and (ii) varying the checkpointability, length, and number of jobs. These approaches are compared with the only other multi-state availability predictor and scheduler, designed by Ren et al. [55]. Section characterizes the relative performance and configurability of the two competing approaches. Results show that Ren s scheduler causes up to 51% more evicted jobs while simultaneously increasing average job makespan by 18% when compared with the best performing scheduler proposed here. To summarize, the prediction-based scheduling contributions include establishing the performance/reliability scheduling tradeoff with real world traces, introducing several schedulers that consider both reliability and performance, varying prediction duration and showing it s effect on job execution performance, characterizing the effect of scheduling checkpointable/non-checkpointable jobs differently, and evaluating the proposed schedulers with a variety of job loads, lengths and checkpointabilities compared with other related work scheduling approaches. The scheduling results presented here demonstrate the feasibility of using resource availability predictions to aid schedulers in choosing reliable resources and to decrease average job makespan. 1.4 Prediction-based Replication Chapter 6 investigates the efficacy of using the availability predictors in a completely new way, namely to decide which jobs to replicate. The basis for this idea is that checkpointing and replication are two of the primary tools for dealing with resource unavailability. The system cannot generally add checkpointability; the application programmer must do that. So whereas schedulers can (and should) take advantage of knowing which jobs are checkpointable (Chapter 5 explores this idea), they cannot proactively add reliability by increasing the checkpointability of the job mix. The system can, however, replicate some jobs in an attempt to deal with possible resource unavailability. Replicating a job can benefit performance in one of two ways. First, jobs are scheduled onto resources using imperfect ranking metrics that may or may not reflect how fast a job will run on a machine. Therefore, by starting a job to run simultaneously on more than one resource, the application makespan is determined by the earliest completion time among all replicas. As this can depend on unpredictable load and usage patterns, the replica can potentially improve the makespan of the job. Second, replicated job executions can also help deal with unavailability; then when one 6

19 resource becomes unavailable, the adverse effect on performance of the jobs it runs can be reduced if replicas complete without experiencing resource unavailability. Replication does not come without a cost, however. Within a Grid economy, it is likely that applications will need to pay for Grid use on a per job, per process, or per operation basis. Therefore, replication can cost extra, assuming redundant jobs are counted separately (which seems likely). Replication can also have an adverse indirect effect, as some of the highest ranked resources could be used for executing replicas, leaving only worse (by whatever metric the scheduler uses to rank jobs) resources for subsequent jobs. For this reason, Chapter 6 tests the hypothesis that availability predictors can help select the right jobs to replicate. This approach improves overall average job makespan, reduces the redundant operations needed for the same improved makespan, or both. Chapter 6 explores several replication strategies, including strategies that make copies of all jobs, some fixed percentage of jobs, long jobs, jobs that are not checkpointable, and several combinations. These strategies are tested under a variety of loads using two availability traces [29][59]. Chapter 6 explores the effectiveness of replicating only the jobs that are mapped onto resources that fall below some reliability threshold. That is, the predictor is asked to forecast the likelihood that a resource will complete a particular job without becoming unavailable. If this predicted probability value is too low, the job is replicated. If the resource is deemed reliable enough, the job is not replicated. To summarize, the prediction-based replication contributions include describing a new metric, replication efficiency, to analyze replication effectiveness, analyzing a replication strategy for using availability predictions to determine when to replicate jobs. The results investigate the replication strategy s effect on both makespan and replication efficiency, describing how system load effects replication effectiveness, and proposing three different static load adaptive replication techniques, which consider the system load and the desired metric to decide which replication strategy to use. Prediction-based replication results show that using resource availability predictions to replicate those jobs that are most likely to experience unavailability can increase job performance and efficiency. 7

20 1.5 Load-adaptive Replication Choosing whether to replicate a job is further complicated by the current load on the system. As system load increases, selective replication strategies that make fewer replicas produce larger makespan improvements than more aggressive strategies [62]. The replication strategy described in Section consists of a static set of four replication techniques ranging from least selective to most selective, allowing the current load on the system to determine which replication technique to use. The four replication strategies are chosen statically based on performance data from a single synthetic workload. This technique does not perform as well when faced with different real world workloads. Chapter 7 proposes a replication strategy that adapts to changing system load and improves on the previously proposed technique. Section 7.3 introduces a new load adaptive strategy, Sliding Replication (SR). SR first determines the current load on the system and uses this value as input to a function that outputs an index into an ordered list of replication strategies. The selected replication strategy determines whether the job should be replicated. As load changes, the replication strategy changes. This approach attempts to dynamically discover the most appropriate replication technique given the current load. Real world resource availability and workload traces are used to test SR s performance versus previously proposed load adaptive and non-adaptive replication techniques. The results show that SR produces comparable average job makespan improvements while decreasing the number of wasted operations, as measured by replication efficiency (see Section 6.3). SR improves efficiency by an average of 347% when compared with a non-load adaptive approach and by 7% when compared with a less sophisticated load adaptive approach. To summarize, the load-adaptive replication contributions include showing how the changing system load of real-world job traces effects replication policy performance, proposing a load-adaptive replication approach, Sliding Replication, which measures system load and selects the most appropriate replication policy based on the current system load, investigating how varying the parameters of Sliding Replication effects its performance, and testing SR s performance against a related work replication approach in two real-world resource trace environments and with two real-world job traces. Load adaptive replication results demonstrate how varying system load affects replication policy 8

21 performance. Load-adaptive replication can adapt to changing load by selecting the replication policy most appropriate for the current system load. 1.6 Multi-functional Device Characterization and Scheduling The traces and workloads for the results summarized in Sections 1.1 through 1.5 have all been derived from environments that embody this type of setting. To explore the applicability of the proposed approaches in other environments, unconventional computing environments are leveraged, which here are defined as collections of devices that do not include general purpose workstations or machines. In particular, the following facts are exploited: 1. Special purpose instruments and devices are increasingly being outfitted to include processors, memory, and peripherals that make them capable of supporting high performance computing, not just for their intended purpose. 2. Large corporations often deploy dozens, hundreds, and even over a thousand such devices within a single floor, building, or campus, respectively. Moreover, these devices are connected to one another with high speed networks. 3. The fast processors and large amounts of memory are often required to adequately process peak loads, not for sustained long running applications. Combined with the inherent characteristics of the local job request mix, this leaves the devices idle for a significant percentage (95% to 98%) of time. These characteristics provide potential to federate and harvest the unused cycles of special purpose devices for compute jobs that require high performance platforms. Chapter 8 investigates the viability of using Multi-Function Devices (MFDs) as a Grid for running computationally intensive tasks under the challenging constraints described above. To summarize, the MFD characterization and scheduling contributions include developing a multi-state availability model and burstiness metric to capture the dynamic and preemptive nature of jobs encountered on MFDs, gathering an extensive MFD availability trace from a representative environment and analyzing it to determine the states in the proposed model, proposing an online and computationally efficient test for burstiness using the trace, and 9

22 showing, through trace-based simulations, that MFDs can be used as computational resources and demonstrating the scheduling improvements that can be achieved by considering resource burstiness in making job placements This MFD characterization work demonstrates the viability of using idle MFDs as computation resources. In particular, statistical confidence intervals are computed on the lengths of Grid jobs that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for the bursty environment. On average across all loads depicted, the MFD scheduling technique proposed comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average of 18.3% makespan improvement over a random scheduler. 1.7 Simulation Tool All the previous research topics require a platform on which to examine their behavior. Current simulation-based approaches are adequate for some of the experiments that do not use multi-state prediction, but a system with the capabilities to test multi-state prediction based scheduling was not available. PredSim is a new experimental platform capable of handing resources that exhibit different types of unavailability. PredSim consists of a variety of input, user specified and core components. These components provide a framework for testing prediction, scheduling and replication approaches. The key contributions of PredSim include providing functionality for multi-state based resource availability prediction, scheduling, task replication and job execution management in a purpose-built environment, accepting and combining workload and resource availability traces in multiple formats and supplying facilities for synthetically creating both jobs and resources, providing a Predictor Accuracy mode designed to isolate and determine a prediction algorithm s accuracy across a variety of circumstances, and including a library of included predictors, schedulers, replication policies and job queuing strategies with the flexibility for additional modules. PredSim has been a helpful tool in researching fault tolerant distributed computing approaches. 1

23 Figure 1.1: Thesis flow chart 1.8 Summary This thesis explores the idea of using resource reliability information to improve distributed system performance. Figure 1.1 demonstrates the logical flow of this work. Resource traces provide availability behavior characterization and demonstrate underlying resource volatility. This characterization information motivates prediction approaches that forecast resource availability. Schedulers use the prediction information, alongside performance information, to select resources for application execution. Resource availability predictions choose which jobs are most suitable for replication. These replication approaches can incorporate current system load in determining which jobs to replicate. The results presented in this thesis demonstrate that schedulers that judiciously use resource reliability information provided by availability predictors can improve system performance by selecting resources that are least likely to become unavailable for application execution, by matching an application s characteristics to the availability behavior of resources, and by efficiently choosing those jobs that can most benefit from replication. 11

24 Chapter 2 Related Work Related work resides in five broad categories, namely (i) resource tracing and characterization, (ii) burstiness and MFD scheduling (iii) prediction of resource state (availability and load), (iv) Grid scheduling, especially approaches that consider reliability and availability, and (v) job replication. This chapter is organized accordingly. 2.1 Resource Characterization Condor [42] is the best known cycle stealing system. The ability to utilize idle resources with minimal technical and administrative overhead has enabled long-term popularity and success. Resource management systems like Condor form the basis for the resource characterization work in Chapter 3. Several papers describe traces of workstation pools. Archarya et al. [2] take three 14 day traces at three university campuses, and analyze them for availability over time, and for the fraction of time a cluster of k workstations is free. Bolosky et al. [9] trace more than 51, machines at Microsoft, and report the uptime distribution, the available machine count over time, and temporal correlations among uptime, CPU load per time of the week, and lifetime. Arpaci et al. [8] describe job submission by time of day and length of job, report machine availability according to time of day, and analyze the cost of process migration. Other traces address host availability and CPU load in Grid environments. Kondo et al. [34][36] analyze Grid traces from an application perspective, by running fixed length CPU bound tasks. The authors analyze availability and unavailability durations during business and non-business hours, in terms of time and number of operations [34]. They also show the cumulative distribution of availability intervals versus time and number of operations [36]. The authors derive expected task failure rates but state that the traces are unable to uncover the specific causes of unavailability (e.g. user activity vs. host failure). The Condor traces used in this work borrow from this general 12

25 approach. Ryu and Hollingsworth [64] describe fine grained cycle stealing (FGCS) by examining a Network Of Workstations (NOW) trace [8] and a Condor trace for non-idle time and its causes (local CPU load exceeding a threshold, local user presence, and the time the Grid must wait after a user has left before utilizing the resource). Anderson and Fedak [5] analyze BOINC hosts for computational power, disk space, network throughput, and number of hosts available over time (in terms of churn, lifetime, and arrival history). They also briefly study on-fraction, connected-fraction, active-fraction, and CPU efficiency. Finally, Chun and Vahdat [16] report PlanetLab all pairs ping data, detailing resource availability distribution and mean time to failure (MTTF). Unlike the work described above, the traces investigated in this work are designed to uncover the causes of unavailability. The trace data is analyzed and used to classify machines into the proposed availability states. Individual resources are analyzed for how they enter the various states over time (by both day and hour). Less extensive traces are intended primarily to motivate the approach to availability awareness and prediction in Grid scheduling proposed in this work MFD characterization and Burstiness As mentioned above, Archarya et al. [2], Bolosky et al. [9] and Arpaci et al. [8] analyze availability of workstations over time in various academic and industry settings. But in aforementioned classes of resources, the dynamics exhibited by intermittent job arrivals (burstiness) and preemption is lacking or not explicitly analyzed - this is something unique to this thesis. Kondo et al. [34][36], Ryu and Hollingsworth [64] and Anderson and Fedak [5] examine resource behavior but state that the traces are unable to uncover the specific causes of unavailability (e.g. user activity vs. host failure). Unlike the work described above, the traces gathered for this work are designed not only to uncover the causes of unavailability, but also to examine them in the environment of MFDs. Burstiness and preemptive job arrivals in workstation pools have received little attention in the literature possibly because such resources are seldom suitable for distributed computing, especially parallel computing. In the literature, the volatility of any type of computational resource is mitigated by modeling the future availability of resources with predictors (and using such predictors for inline scheduling as described below). The approach proposed in Chapter 8 characterizes the magnitude of the disturbances from intermittent preemptive job arrivals towards Grid jobs and normalizes them across a pool of resources. 13

26 2.2 Prediction Prediction algorithms have been used in many aspects of computing, including the prediction of future CPU load and resource availability. The most notable prediction system for Grids and networked computers is Network Weather Service (NWS) [8]. NWS uses a large set of linear models for host load prediction, and combines them in a mixture-of-experts approach that chooses the best performing predictor. The RPS toolkit [18] uses a set of linear host load predictors. One such common linear technique is called BM(p), or Sliding Window, which records the states occupied in the last N intervals and averages those values. Computer and system hardware failure has been extensively studied in previous work [66][69][68]. Schroeder et al. analyze nine years of failure data from the Los Alamos National Laboratory including more than 23, machine failures. The authors study the root causes of failure and the mean time to failure of the resources [68]. Sahoo et al. investigate system failures from a network of 4 heterogeneous servers for patterns and distribution information [66]. Gibson et al. analyze disk replacement data from high-performance computing sites and Internet sites for mean time till failure data distribution [69]. It is not the purpose of this work to specifically target hardware failures. Instead, this approach analyzes and predicts the reasons machines become unavailable to the Grid, including resource failure, user presence and high local CPU load. Unavailability due to resource failure, while not uncommon, is less frequent than other forms of unavailability. For this reason, the prediction techniques proposed here forecast all types of unavailability, not just hardware failure unavailability. Importantly, this work attempts to predict the future occurrence of individual instances of resource unavailability rather than fitting failure data to probability distributions. In availability prediction, Ren et al. [55] [54] [56] use empirical host CPU utilization and resource contention traces to develop the only other multi-state model, prediction technique, and multi-state prediction based scheduler for resource availability. Their multi-state availability model includes five states, three of which are based on the CPU load level (which resides in one of three zones); the two other states indicate memory thrashing and resource unavailability. This model neither captures user presence on a machine, nor allows for differentiation between graceful and ungraceful process eviction. For prediction, the authors count the transitions from available to each state during the previous N days to produce a Markov chain for state transition predictions. These transition counts determine the probability of transitioning to each state from the available state. The Transitional N-Day Equal Weight (TDE) predictor (Chapter 4) uses a similar approach, but weighs each day s 14

27 transitions separately by first computing the probabilities for each of the N days and then combining them. This is in contrast to the Ren predictor, which sums all the transitions through the N days and then computes the probability according to their algorithm. Experiments show that the TDE method more effectively combines many days without allowing a single day s behavior to dominate the average. Also, Ren only counts the number of transitions to each state from available (i.e. three counters), uses these counts to determine the probability of exiting available for each state, then computes the probability of staying available as one minus the sum of the three exit state probabilities. In contrast, the TDE approach counts the number of times a resource completes a period of time equal to the prediction length during resource history analysis. This additional state counter is summed with the non-available state counters and then each state count is divided by the total count to produce the four probabilities. This technique allows the predictor to more accurately determine how often a resource completes the time interval required by the application. To further differentiate the approaches proposed in this work, the most successful scheduling predictor examines the most recent N hours of behavior instead of the time interval in question on the previous N days. The proposed predictors also employ several transition weighting schemes for further prediction improvement and enhanced scheduling results. These differences significantly influence scheduling results. Mickens and Noble [45] [44] [46] use variations and combinations of saturating counters and linear predictors, including a hybrid approach similar to NWS s mixture-of-experts, to predict the likelihood of a host being available for various look ahead periods. A Saturating Counter predictor increases a resource-specific counter during periods of availability, and decreases it during unavailability. A History Counter predictor gives each resource 2 N saturating counters, one for each of the possible availability histories dictated by the last N availability sampling periods. Predictions are made by consulting the applicable counter value associated with the availability exhibited in the last N sampling periods. Pietrobon and Orlando [5] use regressive analysis of past job executions to predict whether a job will succeed. Nurmi et al. [47] model machine availability using Condor traces and an Internet host availability dataset, attempt to fit Weibull, hyper-exponential, and Pareto models to the availability duration data, and evaluate them in terms of goodness-of-fit tests. They then provide confidence interval predictions for availability durations based on model-fitting [16]. Similarly, Kang and Grimshaw filter periodic unavailabilities out of resource availability traces and then apply statistical models to the remaining availability data [3]. Finally, several machine learning techniques use categorical time-series data to predict rare target events by mining event sets that frequently 15

28 precede them [76] [78] [65]. Chapter 4 s prediction approach differs from the current work by developing novel state transition counting schemes, by proposing and investigating numerous methods for determining which parts of a resource s historical data to analyze, and by using its predictions to forecast a unique set of availability states. Several transition weighting techniques are developed and are applied to these algorithms to further increase prediction accuracy. Multi-state availability prediction is used to forecast the occurrence of the model s states, and shows that the improved accuracy provided by the proposed predictors produces significantly improved scheduling results. 2.3 Scheduling Most Grid scheduling research attempts to decrease job makespan or to increase throughput. For example, Kondo et al. [32] explore scheduling in a volunteer computing environment. Braun et al. [11] explore eleven heuristics for mapping tasks onto a heterogeneous system, including min-min, maxmin, genetic algorithms and opportunistic load balancing. Cardinale and Casanova [13] use queue length feedback to schedule divisible load jobs with minimal job turnaround time. GrADS [17] uses performance predictions from NWS, along with a resource speed metric, to help reduce execution time. Fewer projects focus on scheduling for reliability. Kartik and Murphy [31] calculate the optimal set of processor assignments based on expected node failure rates, to maximize the chance of task completion. Qin et al. [51] investigate a greedy approach for scheduling task graphs onto a heterogeneous system to reduce reliability cost and to maximize the chance of completion without failure. Similarly, Srinivasan and Jha [72] use a greedy approach to maximize reliability when scheduling task graphs onto a distributed system. Unfortunately, scheduling only for reliability undermines makespan, and scheduling only on the fastest or least loaded machines can be detrimental due to the performance ramifications of resource unavailability. Dogan and Ozguner [19] develop a greedy scheduling approach that ranks resources in terms of execution speed and failure rate, weighing performance and reliability in different ways. Their work does not use availability predictions, and assumes that each synthetic resource follows a Poisson failure probability with no load variation, and that machines which become unavailable never restart. All of these assumptions are removed in the work described in this thesis, which uses availability and load traces from real resources. Amin et al. [3] use an objective function to maximize reliability while still meeting a real time 16

29 deadline. They search a scheduling table for a set of homogeneous non-dedicated processors to execute tandem real-time tasks. These authors also assume a constant failure rate per processor. Some distributed scheduling techniques use availability prediction to allocate tasks. Kondo et al. [33] examine behavior on the previous weekday to improve the chances of picking a host that will remain available long enough to complete a task s operations. Ren et al. [55] also examine scheduling jobs with their Ren N-day predictor. The Ren MTTF scheduler first calculates each resource s mean time to failure (MTTF i ) by adding the probabilities of exiting available for time t, as t goes from to infinity. It then calculates resource i s effective task length (ETL i ) as: ETL i = MTTF i CR i (1 L i ) CR i is resource i s clock rate and L i is its predicted average CPU load. The algorithm then selects the resource with the smallest ETL from the resources with MTTF values that are larger than the ETL. If no such resources exist, it selects the resource with the minimum job completion time, considering resource availability. The scheduling approach in Chapter 5 is most similar to Ren et al. s, and differs in the following ways: (i) it considers how and why a resource may become unavailable, and attempts to exploit varying consequences of different kinds of unavailability, (ii) it schedules checkpointable and non-checkpointable jobs differently, to improve overall performance, and (iii) it explicitly analyzes and schedules for the tradeoff between performance and reliability. 2.4 Replication Replication and checkpoint-restart are widely studied techniques for improving fault tolerance and performance. Data replication makes and distributes copies of files in distributed file sharing systems or data Grids [53][37]. These techniques strive to give users and jobs more efficient access to data by moving it closer, and to mitigate the effects of resource unavailability. Some work considers replica location when scheduling tasks. Santo-neto et al. [67] schedule data-intensive jobs, and introduce Storage Affinity, a heuristic scheduling algorithm that exploits a data reuse pattern to consider data transfer costs and ultimately reduce job makespan. Task replication makes copies of jobs, again for both fault tolerance and performance. Li et al. [4] strive to increase throughput and decrease Grid job execution time, by determining the optimal number of task replicas for a simulated and dynamic resource environment. Their analytical model determines the minimum number of replicas needed to achieve a certain task completion probability at a specified time. They compare dynamic rescheduling with replication, and extend 17

30 the replication technique to include an N-out-of-M scheduling strategy for Monte Carlo applications. Similarly, Litke et al. [41] present a task replication scheme for a mobile Grid environment. They model resources according to a Weibull reliability function, and estimate the number of task replicas needed for certain levels of fault tolerance. The authors use a knapsack formulation for scheduling, to maximize system utilization and profit, and evaluate their approach through simulation. Silva et al. [7] investigate scheduling independent tasks in a heterogeneous computational Grid environment, without host speed, load, and job size information; the authors use replication to cope with dynamic resource unavailability. Workqueue with Replication (WQR) first schedules all incoming jobs, then uses the remaining resources for replicas. The authors use simulation to compare WQR with various maximum amounts of replicas (1x, 2x, 3x, etc), to Dynamic FPLTF [43] and Sufferage [15] through simulation. Angalano et al. [7] later extend this work with a technique called WQR Fault Tolerant (WQR FT), which adds checkpointing to the algorithm. In WQR a failed task is abandoned and never restarted, whereas WQR FT adds automatic task restart to keep the number of replicas of each task above a certain threshold. Tasks may use periodic checkpoints upon restart. Fujimoto et al. [25] develop RR, a dynamic scheduling algorithm for independent coarse grained tasks; RR defines a ring of tasks that is scanned in round robin order to place new tasks and replicas. The authors compare their technique with five others, concluding that RR performed next to the best without knowledge of resource speed or load, even when compared with techniques that utilize such information. Others investigate the relationship between checkpoint-restart and replication. Weissman [79] develops performance models for quantitatively comparing the separate use of the two techniques in a Grid environment. Similarly, Ramakrishnan et al. [52] compare checkpoint-restart and task replication by first analytically determining the costs of each strategy and then provide a framework that enables plug and play of resource behavior to study the effects of each fault tolerant technique under various parameters. Importantly, none of the related work uses on demand individual resource availability prediction or application characteristics such as length and checkpointability to determine if a task replica should be created. A mix of checkpointable and non-checkpointable jobs are used, and the simulations are driven by real world resource traces that capture resource unavailability. This thesis introduces a new metric for studying the efficiency of replication strategies. Proposed replication techniques are then shown to improve upon results obtained by existing techniques. 18

31 Chapter 3 Availability Model and Analysis Resources in non-dedicated Grids oscillate between being available and unavailable to the Grid. When and how they do so depends on the availability characteristics of the machines, the policies of resource owners, the scheduling policies and mechanism of the Grid middleware, and the characteristics of the Grid s offered job load. Section 3.1 identifies four availability states and Section 3.2 analyzes a trace to uncover how resources behave according to that model. 3.1 Availability Model This section identifies four availability states, and several job characteristics that could influence their ability to tolerate resource faults. This discussion focuses on an analysis of Condor [42], but the results translate to any system with Condor s basic properties of (i) non-dedicated distributed resource sharing, and (ii) a mechanism that allows resource owners to dictate when and how their machines are used by the Grid. Section first discusses the Condor system as a motivation for the availability model, Section then presents an availability model, and Section discusses how different types of Grid applications respond to the types of availability described in the model Condor Condor [42] harnesses idle resources from clusters, organizations, and even multi-institutional Grid environments (via flocking and compatibility with Globus [24]) by integrating resource management, monitoring, scheduling, and job queuing components. Condor can automatically create process checkpoints for migration. Condor manages non-dedicated resources, and allows individual owners to set their own policies for how and when they are used, as described below. Default policies dictate the behavior of resources in the absence of customized user policies, and attempt to minimize Condor s disturbance of local users and processes. 19

32 User departs User returns to machine User Present Available to Grid Machine becomes reachable Machine becomes unreachable Down Local load on machine increases Local load decreases CPU Threshold Exceeded Figure 3.1: Multi-state Availability Model: Each resource resides in and transitions between four availability state, depending on the local use, reliability, and owner-controlled sharing policy of the machine. By default, Condor starts jobs only on resources that have been idle for 15 minutes, that are not running another Condor job, and whose local load is less than 3%. Running jobs remain subject to Condor s policies. If the keyboard is touched or if CPU load from local processes exceeds 5% for 2 minutes, Condor halts the process but leaves it in memory, suspended (if its image size is less than 1 MB). Condor resumes suspended jobs after 5 minutes of idle time, and when local CPU load falls below 3%. If a job is suspended for longer than 1 minutes or if its image exceeds 1 MB, Condor gives it 1 minutes to gracefully vacate, and then terminates it. Condor may also evict a job for a higher priority job, or if Condor itself is shut down. Condor defaults dictate that jobs whose checkpoints exceed 6 MB checkpoint every 6 hours; those with larger images checkpoint every 12 hours. By default, Condor delivers checkpoints back to the machine that submits the job Unavailability Types Condor s mechanism suggests a model that encompasses the following four availability states, depicted in Figure 3.1: Available: An Available machine is currently running with network connectivity, more than 15 minutes of idle time, and a local CPU load of less than the CPU threshold. It may or may not be running a Condor job. 2

33 User Present: A resource transitions to this state if the keyboard or mouse is touched, indicating that the machine has a local user. CPU Threshold Exceeded: A machine enters this state if the local CPU load increases above some owner-defined threshold, due to new or currently running jobs. Down: Finally, if a machine fails or becomes unreachable, it directly transitions to Down. These states differentiate the types of unavailability. In the context of this work, resource unavailability refers to the User Present, CPU Threshold Exceeded and Down states. If a job has been suspended for 15 minutes or the machine is shutdown, this is called a graceful transition to unavailable; a transition directly to Down is ungraceful. This model is motivated by Condor s mechanism, but can reflect the policies that resource owners apply. For example, if an owner allows Condor jobs even when the user is present, the machine never enters the User Present state. Increasing the local CPU threshold decreases the time spent in the CPU Threshold Exceeded state, assuming similar usage patterns. The model can also reflect the resource s job rank and suspension policy by showing when jobs are evicted directly without first being suspended Grid Application Diversity Grid jobs vary in their ability to tolerate faults. A checkpointable job need not be restarted from the beginning if its host resource transitions gracefully to unavailable. Only Condor jobs that run in the standard universe [75] are checkpointable. This requires re-linking with Condor compile, which does not allow jobs with multiple processes, interprocess communication, extensive network communication, file locks, multiple kernel level threads, files open for both reading and writing, or Java or PVM applications [75]. These restrictions demonstrate that only some Grid jobs will be checkpointable. Another important factor is job runtime; Grid jobs may complete in a few seconds, or may require many hours or even days [8]. Longer jobs will experience more faults, increasing the importance of their varied ability to deal with them. Grid resources will have different characteristics in terms of how long they stay in each availability state, how often they transition between the states, and which states they transition to. Different jobs will behave differently on different resources. If a standard universe job is suspended and then eventually gracefully evicted, it could checkpoint and resume on another machine. An ungraceful transition requires using the most recent periodic checkpoint. A job that is not checkpointable must restart from the beginning, even when gracefully evicted. 21

34 The point is that job characteristics, including checkpointability and expected runtime, can influence the effectiveness of scheduling those jobs on resources that behave differently according to their transitions between the availability states identified in Section Availability Analysis The characterization and analysis of Grid resources can be used to provide insight into resource behavior for multiple purposes. The data can be used by researchers to better understand resource behavior so that more realistic availability models can be built for simulation based experimentation. More importantly for this work, the gathered traces can be directly used to drive simulations to test various prediction, scheduling and replication strategies. Furthermore, analysis of the availability patterns can identify trends and behaviors that can be exploited to create more accurate availability predictors and scheduling strategies Trace Methodology For this study, a four month Condor resource pool trace at the University of Notre Dame was accessed, organized, and analyzed. The trace consists of time-stamped CPU load (as a percentage) and idle time (in seconds). Condor records these measurements approximately every 16 minutes, and makes them available via the condor status command. Idle times of zero imply user presence, and the absence of data indicates that the machine was down. The data recorded by Condor precludes determining whether the machine was shut down or the machine directly transitioned to Down, because intentional shutdown and unexpected machine failure appear the same in the data. Since the Down state is relatively uncommon, this study conservatively assumes that all transitions to the Down state are ungraceful. Also, a machine is considered to be in the user state after the user has been present for 5 minutes; this policy filters out short unavailability intervals that lead to application suspension, not graceful eviction. A machine is considered to be in the CPU Threshold Exceeded state if its local (i.e. non-condor) load is above 5%. Otherwise, a machine that is online with no user present and CPU load below 5% is considered Available. This includes machines currently running Condor jobs, which are clearly available for use by the Grid. On SMP machines, this study follows Condor s approach of treating each processor as a separate resource. The trace data was processed and analyzed to categorize resources according to the states proposed in the multi-state model. Again, the goal is to identify trends that enable multi-state avail- 22

35 Number of Available Machines Machine Availability vs. Time Days CPU Threshold Exceeded vs. Time Number of Machines With Users Present User Presence vs. Time 5 1 Days Down vs. Time Number of Machines with CPU Threshold Exceeded Number of Down Machines Days 5 1 Days Figure 3.2: Machine State over Time ability prediction and scheduling Condor Pool Characteristics First, the pool of machines is examined as a whole. This enables conclusions about the machines aggregate behavior. Pool State over Time Figure 3.2 depicts the number of machines in each availability state over time. Gaps in the data indicate brief intervals between the four months data. The data shows a diurnal pattern; availability peaks at night, and recesses during the day, but rarely below 3 available machines. This indicates workday usage patterns and a policy of leaving machines turned on overnight, exhibited by the diurnal User Present pattern. The number of machines that occupy the CPU Threshold Exceeded state is less reflective of daily patterns. Local load appears to be less predictable than user presence, and the number of Down machines also does not exhibit obvious patterns. Figure 3.3 depicts the overall percentage of time spent by the collective pool of resources in each availability state. The Available state encompasses 62.9% of the total time, followed by Down (29.7%), User Present (3.6%) and CPU Threshold Exceeded (3.7%). 23

36 Percentage of time spent in each state (total) CPU User Percentage of non idle time spent in each state (total) CPU User Down Available Down Figure 3.3: Total Time Spent in Each Availability State Number of Transitions to Available Number of Transitions to Available vs. Hour of Day Hour of Day Number of Transitions to User Present Number of Transitions to User Present vs. Hour of Day Hour of Day Number of Transitions to CPU Threshold Exceeded Number of Transitions to CPU Threshold Exceeded vs. Hour of Day Hour of Day Number of Transitions to Down Number of Transitions to Down vs. Hour of Day Hour of Day Figure 3.4: Number of Transitions versus Hour of the Day Daily and Hourly Transitions and Durations This section examines how often resources transition between states, and how long they reside in those states. The data is reported as a function of both the day of the week and the hour of the day. State transitions can affect the execution of an application because they can trigger application suspension, lead to checkpointing and migration, or even require restart. Figure 3.4 shows the number of transitions to each state by hour of day. Again, the numbers of transitions to both the Available and User Present states show clear diurnal patterns. Users return to their machines most often at around 3pm, and most seldom at 5am. Machines also frequently become available at Midnight and 1AM. Transitions to CPU Threshold 24

37 Number of Transitions to Available Total Number of Transitions to Available vs. Day of Week M T W T F S S Day of Week Number of Transitions to User Present Total Number of Transitions to User Present vs. Day of Week M T W T F S S Day of Week Number of Transitions to CPU Threshold Exceeded Total Number of Transitions to CPU Threshold Exceeded vs. Day of Week 2 M T W T F S S Day of Week Number of Transitions to Down Total Number of Transitions to Down vs. Day of Week M T W T F S S Day of Week Figure 3.5: Number of Transitions versus Day of the Week Exceeded and Down states also seem to exhibit similar (but slightly less regular) patterns. Figure 3.5 reports the daily pattern of machine transitions among states. Transitions to Available, User Present and CPU Threshold Exceeded fall from a highs at the beginning of the week to a low on the weekend. Down transitions show less regular patterns. Figure 3.6 investigates state duration and its patterns. It plots the average duration that resources remain in each state, beginning at each hour of the day. Expectedly, the average duration of availability is at its lowest in the middle of the day, since this is when users are present. Interestingly, CPU Threshold Exceeded and Down durations both reach their minimum at mid-day, meaning that whereas machines are not typically completely free, they also are not running CPU intensive non-condor jobs. Down duration is expectedly higher at night, when users abandon their machines until morning. The duration of the CPU Threshold Exceeded state is at its lowest during the day, most likely due to short bursts of daytime activity and longer, non-condor jobs being run overnight. Finally, Figure 3.7 examines the weekly behavior of the duration that resources reside in each state. The average length of time spent in the Available state increases throughout the week with the exception of Thursday while User Present duration decreases during the week. CPU Threshold Exceeded duration seems to be less consistent. The Down state duration exhibits fairly consistent behavior with the exception of Thursday. 25

38 Avg. Available Duration (Hours) Avg. CPU Threshold Exceeded Duration (Hours) Average Availability Duration vs. Hour of Day Hour of Day Average CPU Threshold Exceeded Duration vs. Hour of Day Hour of Day Avg. User Present Duration (Hours) Avg. Down Duration (Hours) Average User Present Duration vs. Hour of Day Hour of Day Average Down Duration vs. Hour of Day Hour of Day Figure 3.6: State Duration versus Hour of the Day Avg. Available Duration (Hours) Avg. CPU Threshold Exceeded Duration (Hours) Average Availability Duration vs. Day of the Week M T W T F S S Day of Week Average CPU Threshold Exceeded Duration vs. Day of the Week.76 M T W T F S S Day of Week Avg. User Present Duration (Hours) Avg. Down Duration (Hours) Average User Present Duration vs. Day of the Week M T W T F S S Day of Week Average Down Duration vs. Day of the Week M T W T F S S Day of Week Figure 3.7: State Duration versus Day of the Week 26

39 Machine Count vs. Average Availability Duration 6 Machine Count vs. Average User Present Duration Machine Count Machine Count Average Availability Duration (Hours) Average User Present Duration (Hours) 6 Machine Count vs. Average CPU Threshold Exceeded Duration 6 Machine Count vs. Average Down Duration 5 5 Machine Count Machine Count Average CPU Threshold Exceeded Duration (Hours) Average Down Duration (Hours) Figure 3.8: Machine Count versus State Duration Individual Machine Characteristics This section examines the characteristics of individual resources in the pool. Figure 3.8 and Figure 3.9 plot cumulative machine counts such that an (x, y) point in the graph indicates that y machines exhibit x or fewer (i) average hours per day in a state (Figure 3.8) or (ii) number of transitions per day to a state (Figure 3.9). The first aspect explored is how long, on average, each node remains in each state. Figure 3.8 shows the distribution according to the average time spent in each of the four states. For Availability, the data suggests three distinct categories: 144 resources average 9.4 hours or less, 328 average between 7.5 and 195 hours, and the top 31 resources average more than 195 hours. For User Present, 173 resources average less than 3 minutes of user presence; of those, 14 have no user activity. The remaining 33 resources have longer (over 3 minutes) average durations of user presence. For CPU Threshold Exceeded, 49 have no local load occurrences, 4 resources have short bursts of 6 minutes or less, 353 resources have longer average usage durations (but less than 6 hours), and the remaining 61 resources remain in their high usage state for over 6 hours on average. Average time in the Down state shows that 288 resources have durations less than 25 hours; 96 never transition to the Down state. The top 38 resources have Down durations exceeding 1,7 hours (mostly offline). 27

40 Machine Count vs. Avg. Number of Transitions to Available 5 Machine Count vs. Avg. Number of Transitions to User Present Machine Count 3 2 Machine Count Avg. Number of Transitions to Available per Machine Avg. Number of Transitions to User Present per Machine Machine Count vs. Avg. Number of Transitions to CPU Threshold Exceeded 5 5 Machine Count vs. Avg. Number of Transitions to Down 4 4 Machine Count 3 2 Machine Count Avg. Number of Transitions to CPU Threshold Exceeded per Machine Avg. Number of Transitions to Down per Machine Figure 3.9: Machine Count versus Average Number of Transitions Per Day Figure 3.9 examines the average number of resource transitions to each state per resource per day. 78.7% of resources transition to Available fewer than twice per day. The top 2% transition over 4.5 times on average. As reported earlier, 14 machines (2.7%) have no user activity. On the other hand, 399 machines (79.3%) infrequently experience local users becoming present (less than 2 transitions to User Present per day). Users frequently come and go on the remaining 14 resources (2.6%), with some of these experiencing up to 7 transitions to User Present on average per day. CPU Threshold Exceeded transitions are bipolar with 468 (93%) machines having 1 or fewer transitions to CPU Threshold Exceeded per day (49 of those 4 have none). The remaining machines average more than 1 transition and peak at 3.7 transitions each day. Finally, Down is relatively uncommon with 96 resources (19%) never entering the Down state, 341 (67.7%) transitioning to Down less than once on average, and only the top 5 machines (1%) having two transitions or more on average. Machine Classification This section classifies resources by considering multiple state characteristics simultaneously. This approach organizes resources based on which type of applications they would be most useful in executing. In particular, the average availability duration dictates whether the application is likely to complete; how a machine transitions to unavailable (gracefully or ungracefully) determines whether 28

41 Average Num of Graceful Transitions High Low Average Availability Duration Ungraceful Transitions High Low High Low High 44 9 Medium High 3 14 Medium Low Low Table 3.1: Machine Classification Information the job can checkpoint before migrating. Machines are classified based on average availability duration and the number of graceful and ungraceful transitions to unavailable per day. Table 3.1 reports the number of machines that fit these categories. The thresholds for the Availability categories are 65.5 (High), 39.7 (Medium High), and 6.5 (Medium Low) hours. More than 1.6 graceful transitions to unavailable, per day, are considered to be High, and two or more ungraceful transitions during the four months is considered High. Roughly 36% of machines have Medium High or High availability durations with both a low number of graceful and ungraceful transitions, making them highly available and unlikely to experience any kind of job eviction. In contrast, about 24% of resources exhibit low availability and a high number of graceful transitions; about 75% of these have high ungraceful transitions. The low availability group would be most appropriate for shorter running checkpointable applications. The last significant category (approximately 8.7%) has Medium Low availability and few graceful and ungraceful transitions. This group lends itself to shorter running applications with or without the ability to checkpoint Classification Implications This section demonstrates the implications of Table 3.1 s resource classification distribution. To test how applications would succeed or fail (a failed application is defined as an application that experiences resource unavailability before it completes execution) on resources in the different classes, a series of 1 jobs is simulated with the job runtimes being equally distributed between approximately 5 minutes to 12 hours (on a machine running at 25 MIPS). The simulation used each resource s local load at each point in the trace, along with the MIPS score that Condor calculates. Each job was begun at a random available point in each resource s trace and then the execution was simulated. This was done 1 times for each job duration on each resource. Results are included for the resources classified as (i) low availability with high graceful and 29

42 Failure Rate (%) Graceful Failure Rate vs. Job Duration Failure Rate (%) Ungraceful Failure Rate vs. Job Duration HA LG LUG MHA LG LUG MLA LG HUG LA HG HUG Job Duration (Instructions) x Job Duration (Instructions) x 1 7 Figure 3.1: Failure Rate vs. Job Duration ungraceful transitions (18% of the machines), (ii) medium-low availability with low graceful transitions, but high ungraceful transitions (8.7%), (iii) medium-high availability with few graceful and ungraceful transitions (19.3%), and (iv) high availability, also with few graceful and ungraceful transitions (16.7%). These classes represent a broad spectrum of the classification scheme and 62.7% of the total resources. They appear in bold in Table 3.1, and are labeled (i) LA-HG-HUG, (ii) MLA-LG-HUG, (iii) MHA-LG-LUG, and (iv) HA-LG-LUG in Figure 3.1. Figure 3.1 shows that graceful failure rates climb more rapidly for less available resources. The low availability resources also have a much higher ungraceful failure rate versus the high and medium high availability resources, which have a very low ungraceful failure rate. These results are important because of the diversity of Grid applications, as described in Section Checkpointable applications can tolerate graceful evictions by checkpointing on demand and resuming elsewhere, without duplicating work. However, these same applications are more sensitive to ungraceful evictions; the amount of lost work due to an ungraceful transition to the Down state depends on the checkpoint period. On the other hand, applications that do not checkpoint are equally vulnerable to graceful or ungraceful transitions; both require a restart. Therefore, resources in different classes should host different types of applications. Intuitively, to make the best matches under high demand for resources, a scheduler should match the job duration with the expected availability duration of the machine; machines that are typically available for longer durations should be reserved, in general, for longer running jobs. And noncheckpointable jobs should run on resources that are not likely to transition to Down ungracefully. Finally, machines with high rates of ungraceful transition to Down should perhaps be reserved for highly replicable jobs. 3

43 Random Scheduler Classifier Scheduler Improvement Completed Jobs % Gracefully Evicted Jobs % Ungracefully Evicted Jobs % Avg. Execution time (secs.) % Table 3.2: Machine Classification Scheduling Results A second simulation demonstrates the importance of the classes. 1 jobs were created of random length (equally distributed between 5 minutes and 12 hours), half of which were checkpointable with a period of six hours. Two schedulers are tested a scheduler that maps jobs to resources randomly, and one that prioritized classes according to Table 3.1. The simulator iterates through the trace, injecting an identical random job set and scheduling it with each scheduler. Job execution considers each resource s dynamic local load and MIPS score. When a resource that is executing a job becomes unavailable in the trace, the job is rescheduled. If the job is checkpointable and the transition is graceful, the job takes a fresh checkpoint and migrates. An ungraceful transition to unavailability requires restarting from the last successful periodic checkpoint. Non-checkpointable jobs must always restart from their beginning. The classifier used months one and two to classify the machines; schedulers ran in months three and four. Each subsequent month was simulated three times and the average values were taken for that month. Table 3.2 reports the sum of those averages across both months. Jobs meet more evictions and therefore take an average of 31.6% longer when the scheduler ignores the unavailability characteristics. 31

44 Chapter 4 Multi-State Availability Prediction Predicting the future availability of a resources can be extremely useful for many purposes. First, resource volatility can have a negative impact on applications executing on those resources. If the resource on which an application is executing becomes unavailable, the application will have to restart from the beginning of its execution on another resource. This wastes valuable cycles and increases application makespan. Prediction can allow the scheduler to choose resources that are least likely to become unavailable and avoid these application restarts. Second, replication can also be used to mitigate task failure by creating a copy of an application. Since the system can only support a finite number of replicas, it is useful to determine which applications are likely to experience resource unavailability via prediction and replicate those jobs that are most likely to experience resource unavailability. This can increase the efficiency of the system. Third, when storing data on distributed systems, resource unavailability can be predicted to ensure that data remains accessible even in the presence of resource unavailability. This work studies using predictions for both scheduling and replication in later chapters. This chapter first proposes prediction strategies in Section 4.1, then analyzes their accuracy in various configurations in Section and finally compares these predictors accuracy with related work predictors in Section Prediction Methodology A multi-state prediction algorithm takes as input a length of time (e.g. estimated application execution time), and uses a resource s availability history to predict the probabilities of that resource next exiting the Available state into each of the non-available states, and to remain available throughout the interval; these probabilities sum to 1%. The proposed availability predictor outputs four probabilities, one each for entering Figure 3.1 s User Present, CPU Threshold Exceeded and Down states next, and one for the probability of completing the interval without leaving the Available 32

45 state (Completion). It is often useful to think of highest computed probability for a given prediction as its surety because the higher the largest transition probability the more sure the predictor is of its forecast. Since a prediction algorithm examines a resource s historical availability data, the two important aspects that differentiate predictors are (i) when during the resource s history the predictor analyzes and (ii) how the predictor analyzes that section of availability history. The approaches presented here are classified along these two dimensions; by combining when and how historical data is analyzed, a range of availability prediction strategies is created. In determining when the predictor should analyze, this study takes two different approaches. Previous work [54] advocates examining a resource s behavior on previous days during the same interval. Therefore, the first approach investigated examines a resource s past N days of availability behavior during the interval being predicted. This approach is called N-Day or Day. Another approach examines a resource s most recent N hours of activity immediately preceding the prediction (N-Recent). The proposed predictors analyze the segments of resource availability history in two ways. The first makes predictions based on the resource s state durational behavior by calculating the percentage of time spent in each state during the analysis period. The rationale is that the state that is occupied the majority of the time could be the most likely next state. The second and more successful approach considers a resource s transitional behavior by counting the number of transitions from Available to each of the other states. Also, for every interval of time a resource was available, the predictor counts how many times the resource could have completed the requested job duration (e.g. a ten hour availability interval and a two hour job means five completions). The probabilities for each exit state (as well as the completion state) are calculated by summing each state s transition count and dividing the total number of transitions for each state by the total number of transitions. For both the durational and transitional approaches, when combining different sections of availability (e.g. when combining the N days probabilities together), the probabilities for each state are then averaged. The Transitional schemes weigh all transitions equally (Equal weighting scheme); weighing transitions according to when the transition occurred is also investigated. The predictors weigh transitions according to the time of day (Time) in response to previous observations that this correlates with future behavior (Chapter 3) [59]. This means that for the Time weighting scheme, the closer a transition time is to the time of day at which the prediction occurs, the higher the weight of that transition. In other words, if a prediction is made at 3pm, then transitions that occur closest to 33

46 3pm today and on other days have the largest weights. Lastly, Freshness weighting gives a higher weight to transitions that occur closer to the time of the prediction and likewise, a smaller weight to events that occurred further in the past. Section 2.2 evaluates related work predictors for comparison. Since linear regression [18] is such a prevalent method in many disciplines of prediction, this study includes both Sliding Window Single and Sliding Window Multi. Because states are categorical, they must be converted to numerical values. The following conventions are used. In the single state version of sliding window (SW Single), the Available state is assigned to 1 and any other state as. In the multi-state version (SW Multi), the Available state is assigned to 1, User Present as 2, CPU Threshold Exceeded as 3 and Down as 5. Counter based predictors such as those used for resource availability prediction [45] are evaluated, including Saturating Counter and History Counter. Just as in [45], a 2 bit saturating counter is updated every hour. The Completion predictor, which always predicts the machine will complete the interval in the Available state is used as a baseline comparison. Finally, Ren s N-Day multi-state availability predictor (called simply Ren) is implemented for comparison (see Section 2.2 for a full description of the Ren predictor) [55]. 4.2 Predictor Accuracy Comparison This section evaluates the predictors for their accuracy and focuses its investigation on predictors that analyze a resource s history with the transitional method analyzing both the N-Day and the N-Recent hours approaches for determining when during a resource s history to examine. Results for the durational method are excluded due to its inferior performance. This study uses six months of availability data from the 63 node Condor pool at the University of Notre Dame as its data set. This includes two additional months of data not reported in Chapter 3. This trace was taken during the months of January through June of 27 and is analyzed in Section 3.2. Again, measurements for CPU load, resource availability and user presence were taken every 15 minutes for each resource, and the states of availability are inferred according to the proposed multi-state model. Each predictor performed an identical set of 5, predictions on a random resources at a random time during the six month trace. Predictor accuracy is defined as the ratio of correct predictions to the total number of predictions. A correct prediction is one for which the machine is predicted to exit on a certain non-available state and it does, or for which the machine is predicted to remain available throughout the interval, and it does. 34

47 Prediction Accuracy (%) Prediction Accuracy vs. Number of Days Analyzed TDE: 12hr TDE: 25hr TDE: 6hr TDE: 12hr Ren: 12hr Ren: 25hr Ren: 6hr Ren: 12hr Number of Days Figure 4.1: Transitional Predictor Accuracy Prediction Method Analysis Figure 4.1 depicts the accuracy of the Transitional N-Day with Equal transition weights (TDE) predictor and the Ren N-Day predictor in relation to the number of days each analyze for predictions with lengths uniformly distributed between five minutes and the number of hours indicated in the legend. Recall that TDE examines the resource s state transition behavior (Transitional) during the requested prediction interval starting at the same time during each of the previous N days (N-Day). It then calculates the state exit probabilities based on the number of transitions from Available to each of the states separately for each day, combining the resulting probabilities for each day. For predictions of lengths between five minutes and 12, 25, 6 and 12 hours, the graph shows that TDE exhibits an increase in accuracy when analyzing any number of days in comparison with Ren. For example, for predictions between 5 minutes and 25 hours, TDE s accuracy increases through 16 days, at which point it peaks at 78.3%. In this case, Ren peaks at 73.7% and then falls abruptly when considering more than one day. Notice that Ren s predictor decreases in accuracy when analyzing more days whereas TDE can better incorporate new information into its prediction. For all prediction lengths, TDE initially increases in accuracy, then levels off when acquiring more 35

48 Prediction Accuracy vs. Number of Hours Analyzed TRE TRF TRT.77 Prediction Accuracy (%) Number of Hours Figure 4.2: Weighting Technique Comparison information (considering more days) whereas Ren becomes less accurate the more days it considers. Overall, TDE is 4.6% more accurate than the Ren predictor. Figure 4.2 examines the Transitional Recent hours (TR) predictor configured to analyze the most recent N hours of availability with various transition weighting (Equal, Freshness and Time of day) schemes for predictions between five minutes and 25 hours (other lengths are not examined due to space constraints). Again, this predictor examines the resource s state transition behavior (Transitional) but in contrast to TDE, does so for the last N-hours before the time that the prediction is made (Recent). For all weighting schemes, as the predictor considers more hours, accuracy increases dramatically at first, then the increases slow down, reach a maximum, then slowly decrease. The Freshness (TRF) weighting scheme provides the best performance, reaching the highest accuracy of 77.3% when examining the past 48 hours of behavior. TRF weights each transition t (W(t)) according to the following formula: W(t) = M (T t /L) M is the recentness weight set by the user (the default value is five), T t is the length of time 36

49 Prediction Accuracy vs. Prediction Length.9.8 TRF TDE Ren 1 day Completion Saturating Counter History Counter Sliding Window Single Sliding Window Multi Prediction Accuracy (%) Prediction Duration (Hours) Figure 4.3: Prediction Accuracy by Duration that elapsed from the beginning of the analysis interval to the transition t, and L is the total analysis interval length. In the next section, analysis focuses on the TDE and TRF predictors due to their high accuracy (78.3% and 77.3% respectively) Analysis of Related Work This section compares the accuracy of the best performing predictors, TDE and TRF, to several existing approaches from the literature, namely the Saturating and History Counter predictors [45] [44] [46], the Multi-State and Single State Sliding Window predictors [18], the Ren predictor [55] [54] [56], and the Completion predictor. Figure 4.3 depicts predictor accuracy versus prediction length, for predictions up to 12 hours. The Counter-based predictors, Completion predictor, and Sliding Window predictors perform similarly to one another, compete well with TRF, TDE and Ren for predictions up to 2 hours, and then sharply decline, never leveling off. In contrast, the TRF, TDE, and Ren predictors decrease initially and then level off; the rate of decrease for Ren is somewhat larger as the prediction length increases past 6 hours. The Ren predictor initially has lower accuracy, with accuracy decreasing 37

50 even faster than the Completion predictor for prediction durations shorter than 19 hours. During these short intervals, the Ren predictor is up to 5.6% less accurate than the TDE predictor, and 5.2% less accurate than TRF. Figure 4.3 demonstrates that for predictions shorter than 19 hours, TDE is the most accurate, accounting for its 1% increase in accuracy over TRF in Section 4.2 for predictions between five minutes and 25 hours. However, TRF becomes the most accurate predictor for predictions longer than 42 hours, reaching an accuracy increase of 9.8% over Ren and 3.1% over TDE for predictions of 12 hours. Chapter 5 s scheduling results demonstrate that this large difference in accuracy for longer predictions is critical. Chapter 5 focuses its prediction-based scheduling analysis on TRF because of the improved schedules it produces. Predictors that perform better for long term predictions lead to the best scheduling results when used with the schedulers proposed in Chapter 5, even when scheduling shorter jobs. As the results demonstrate in Section 5.4.2, this explains why using the TRF predictor decreases makespan and the number of job evictions when compared to using the Ren predictor, even when using an identical prediction-based scheduler. 38

51 Chapter 5 Prediction-Based Scheduling Resource selection choosing the resource on which to execute a given task is a complex scheduling problem. Resource selection can depend on the length of the job, it s ability to take a checkpoint, the load on the system, the availability of resources, the resource s reliability, and possible time-constraints on application completion [61][58]. Additionally, the performance metric can vary from reduced job makespan or increased job reliability, to an increased overall system efficiency and throughput. These factors lead to a complex scheduling environment involving many different tradeoffs. The resource selection problem is further complicated by resource volatility. Unexpected resource unavailability can have a large impact on scheduling performance. Schedulers that are aware of this resource volatility can make more informed decisions on job placement, avoiding less reliable resources. This chapter focuses on dealing with resource volatility through informed job placement, to reduce job evictions by avoiding resource unavailability, which in turn can decrease average job makespan. To that end, this chapter investigates scheduling jobs with the aid of resource availability predictions such as those introduced in Chapter 4. This chapter is organized as follows. First, Section 5.1 explains the simulator setup and scheduling methodology. Section 5.2 demonstrates the inherent tradeoff between scheduling for reliability and scheduling for performance, Section 5.3 investigates various prediction-based scheduling heuristics and their performance across various job lengths and checkpointabilities. Lastly, Section 5.4 compares the prediction based scheduler s results to various prediction-based and non-prediction-based scheduling approaches across job lengths. 5.1 Scheduling Methodology This work investigates various scheduling techniques through simulation-based experiments that vary the number of jobs, their characteristics, and the schedulers scoring techniques. The simulations in 39

52 this chapter are driven by the Notre Dame Condor pool (approximately 6 nodes) six month trace. A certain number of jobs are simulated executing on these machines, each utilizing its own recorded availability and load measurements. The simulation based experiments presented here create and insert each application at a random time during the six month simulation, such that application injection is uniformly distributed across the six months. The simulation assigns each application a duration (i.e. a number of operations needed for completion); application duration is uniformly distributed between two durations such as between five minutes and 25 hours (the duration is an estimate based on the execution of the application on a resource with an average MIPS speed with no load). This uniform distribution of job insertion times and durations allows the experiments to test the quality of the predictor and scheduler combination with equal weight for all prediction durations and prediction start times, without emphasizing a particular duration or start time. In all simulations in this chapter, 25% of the applications are checkpointable; that is, they can take an on demand checkpoint in five minutes if the user returns or if the local CPU load goes above 5%, leading to eviction and rescheduling. The remaining 75% non-checkpointable jobs need to start over upon eviction. Each machine s MIPS score (as defined by Condor s benchmarking tool), and the load and availability state information contained in the trace, influence the simulated running of the application. During each simulator tick, the simulator calculates the number of operations completed by each working resource, and updates the records for each executing application. If a resource executing an application leaves the Available state (as per the trace), effectively evicting the application, the executing application joins the back of the application queue for rescheduling. The number of operations that remain to be completed for this evicted application depends on the checkpointability of the application and the type of eviction (Section 3.1.3). All waiting applications (scheduled or rescheduled) remain queued until they reach the head of the queue, at which point they are rescheduled. During each simulator tick, the scheduler places applications off the head of the queue until the next application cannot be scheduled (e.g. because no resources are available). To facilitate resource score based scheduling (e.g. prediction-based scheduling as well as other scheduling heuristics), a scheduling algorithm is defined called Prediction Product Score (PPS) Scheduler. The PPS Scheduler scores each available resource and maps the job at the head of the queue onto the resource with the highest score (Ties are broken arbitrarily.) The resource scoring policy (how each resource is scored) defines that scheduler s placement policy. Various resource scoring approaches are examined throughout the experiments presented here. Scheduling quality is analyzed according to average job makespan and evictions. Makespan 4

53 is the time from submission to completion; average makespan is calculated across all jobs in the simulation. For evictions, the simulator counts the total number of times that jobs need to be rescheduled because a resource running a job transitions from available to one of the unavailable states. 5.2 Reliability Performance Relationship This section establishes the inherent tradeoff between scheduling for reliability and for performance for a real world environment. In the context of this work, resource reliability refers to the ability of a resource to consistently complete applications without experiencing unavailability and is used informally. It is not used formally as in other publications [51]. Schedulers that favor performance consider the static capability of target machines, along with current and predicted load conditions, to decrease makespan. Other schedulers may instead consider the past reliability and behavior of machines to predict future availability and to increase the number of jobs that complete successfully without interruption due to machines going down or owners reclaiming resources. Schedulers may consider both factors, but cannot in general optimize simultaneously for both performance-based metrics like makespan, and reliability-based metrics like the number of evictions or the number of operations that must be re-executed due to eviction (i.e. operations lost ) [19]. In fact, the interplay between these two metrics is often quite complex. Choosing a reliable resource can decrease job evictions and by virtue of that, decrease job makespan (increase performance). Along the same lines, choosing a fast resource can lead to a shorter job execution time, reducing the likelihood of experiencing a job eviction (increase reliability). In the real world, there is a large variety in the reliability and performance of resources and often a scheduler cannot find both a high performance and high reliability resource. In these cases, a scheduler may have to choose between a slow reliable resource or a fast unreliable resource or anywhere in between. Section 5.3 explores heuristics for dealing with this scheduling problem. This section demonstrates the tradeoff itself. To investigate the reliability-performance tradeoff, 6 simulated applications are executed on these machines according to the methodology outlined in Section 5.1. Application durations are uniformly distributed between five minutes and 25 hours. The PPS chooses from among resources for application execution (Section 5.1) and scores each available resource according to the following expression (the resource with the highest score is chosen for application execution): RS i = (1 W) P i [COMPLETE] + W (MIPSi/MIPS m ax) (1 L i ) 41

54 35 Average Makespan vs. Tradeoff Weight 25 Total Job Evictions vs. Tradeoff Weight Makespan (Hours) Evictions Tradeoff Weight Tradeoff Weight Figure 5.1: As schedulers consider performance more than reliability (by increasing Tradeoff Weight W), makespan decreases, but the number of evictions increases. P i [COMPLETE] is resource i s predicted probability of completing the job interval without failure, according to the TRF predictor (Section 4.2), MIPS i is the resource s processor speed, MIPS m ax is the highest processor speed of all resources (for normalization), and L i is the resource s current processor load. Figure 5.1 depicts the effect of varying the Tradeoff Weight and hence the relative influence of reliability or performance in scheduling. As performance is considered more prominently, makespan decreases but the number of evictions increases. In the middle of the plot, a tradeoff weight of.5 does achieve makespan within 6.7% of the lowest makespan on the curve, while simultaneously coming within 18.1% of the fewest number of evictions. Nevertheless, the makespan slope is uniformly negative, and the evictions slope is uniformly positive. 5.3 Prediction Based Scheduling Techniques This section explores in more detail a wider range of resource ranking strategies that consider resource performance and reliability. Section studies the effect of considering various combinations of CPU speed, load, and reliability; the section also introduces the idea of scheduling checkpointable jobs differently from non-checkpointable jobs. The experiments then describe how the best performing scheduler from Section behaves as the length of the interval for which the predictor forecasts is varied (Section 5.3.2). 42

55 5.3.1 Scoring Technique Performance Comparison In this Section, the PPS scheduler is configured with a range of resource scoring approaches, which utilize some or all of the following factors. CP U Speed(M IP S) : the resource s Condor MIPS rating CurrentLoad(L) : the machine utilization information at scheduling time CompletionP robability(p[com P LET E]) : The predicted probability of completing the projected job execution time without becoming unavailable. This is one minus the sum of the probabilities of the three unavailability states U ngracef ulevictionp robability(p i[u N GRACE]) : The predicted probability of exiting job execution directly to Down (with no chance for a checkpoint). Checkpointability : Whether the job could take an on-demand checkpoint before being evicted from a machine (CKPT) or not (NON CKPT) Considering whether or not the scoring system incorporates each of the first three criteria defines eight different kinds of scoring, ranging from considering none of the criteria (Random scheduling), to including them all. Then, if checkpointable jobs are scheduled differently from non checkpointable jobs, many more possible approaches emerge. There is also the matter of how to incorporate each factor into the score. Based on intuition (some combinations make more sense than others) and background experimental results, the PPS scheduler (Section 5.1) is configured with the following resource scoring approaches. The expression provides the formula for computing each resource s score, RS i S : MIPS i (1 L i ) S1 : P i [COMPLETE] S2 : MIPS i P i [COMPLETE] S3 : MIPS i (1 L i )P i [COMPLETE] S7 : CKPTMIPS i (1 L i )(1 P i [UNGRACE]) NON CKPTMIPS i (1 L i )P i [COMPLETE] S1 : CKPTMIPS i (1 L i ) NON CKPTMIPS i (1 L i ) P i [COMPLETE] 43

56 Makespan Makespan vs. Job Length Evictions Evictions vs. Job Length S S1 S3 S7 S Job Length (Hours) Job Length (Hours) Figure 5.2: The resource scoring decision s effectiveness depends on the length of jobs. For longer jobs, schedulers should emphasize speed over predicted reliability. S, S1, S2, and S3 schedule checkpointable jobs the same way they schedule non-checkpointable jobs with S considering speed, S1 considering reliability, and S2 and S3 considering both. However, the fact that checkpointable jobs react differently to various types of resource unavailability suggests that scheduling them differently could improve overall Grid performance. S7 and S9 are considered for this reason. This section executes simulations for the same environment and setup as in Section 5.1, but varies the maximum job length from 12 to 12 hours. P i [COMPLETE] (an availability prediction) is made by TRF (Section 4.2) and used as input to the resource scoring decisions that utilize it. The results obtained and presented in the remainder of the paper are based on simulating each scenario once. However, the consistency and repeatability of the results has been verified through repetition. Figure 5.2 plots average makespan vs. maximum job length, and evictions vs. maximum job length for all of the proposed resource scoring decisions as percentage difference versus S. Percentages are used for comparison throughout this chapter. The percentage difference of one scheduler versus a chosen baseline scheduler is plotted for each particular metric (e.g. operations lost or evictions). Percentages are used to emphasize the differences in scheduling quality and to facilitate visual comparison of performance, given the large range of y values. The text includes the magnitude of the results (e.g. the actual number of evictions) in several places. S1, which considers only reliability, performs significantly worse than all the others in terms of makespan for all maximum job lengths, but also for number of evictions for longer jobs. This demonstrates that schedulers must consider resource performance when scheduling. The figure also shows that makespans are relatively similar for the other scheduling approaches, but that S3 does the best job of avoiding evictions. For example, for jobs up to 84 hours, S3 obtains 5417 evictions 44

57 Makespan Makespan vs. Checkpointability S S1 S3 S7 S9 Evictions Evictions vs. Checkpointability Operations Lost Checkpointability (%) Operations Lost vs. Checkpointability Checkpointability (%) Checkpointability (%) Percent of Total Operations Lost vs. Checkpointability Percent of Total Operations Checkpointability (%) Figure 5.3: Scoring decision effectiveness depends on job checkpointability. Schedulers that treat checkpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer more evictions, but mostly for checkpointable jobs; therefore, makespan remains low. whereas S1, S7 and S9 obtain approximately 645 evictions (19% more). For a shorter job length of 6 hours, S3 causes 86 evictions and S7 and S9 cause about 17 evictions increasing the disparity to 24%. These results demonstrate that schedulers must consider both reliability and performance while scheduling, to simultaneously produce fewer evictions and shorter job makespan. Figure 5.1 s results are for 25% checkpointable jobs. When that percentage varies, the value of considering reliability in the scoring function changes, as does the value of scoring checkpointable and non-checkpointable jobs differently. Figure 5.3 plots makespan, evictions, operations lost, and percentage of total operations lost, as the percentage difference compared with S. As the percentage of checkpointable jobs increases, makespan difference increases more quickly for S1 and S3, compared to schedulers S7 and S9, which schedule non-checkpointable jobs on more reliable resources. Figure 5.3 shows that the difference in the number of evictions does increase more for S7 and S9; these evictions, however, are increasingly for checkpointable jobs, so the effect on makespan is small. When a small percentage of jobs are checkpointable, schedulers do well to consider reliability (like S3). The hybrid approaches achieve balance between reliability and performance, at each extreme. The remainder of this chapter considers only the case where 25% jobs are checkpointable, and therefore primarily uses S3, which performs similarly to S7 and S9 for this checkpointability mix. 45

58 Makespan Makespan vs. Job Length Evictions Evictions vs. Job Length Comp Job Length (Hours) Job Length (Hours) Figure 5.4: Schedulers can improve results by considering predictions for intervals longer than the actual job lengths. This is especially true for job length, between approximately 48 and 72 hours Prediction Duration In tests described so far, the scheduler asks the predictor about behavior over a prediction interval that is intended to match application runtime. A prediction for the next N hours may not necessarily reflect the best information for scheduling an N hour job. This section investigates the effect of scheduling an N hour job using predictions made for the next (M x N) hours, where M is the interval multiplier. M is set at.25,.5, 1, 2, 3, with.25 plotted as a baseline, along with two hybrid multipliers. The Comparison (Comp or TRF-Comp) scheduler uses M=3 for jobs less than 28 hours, and M=.25 for longer jobs. The Performance (Perf or TRF-Perf) scheduler ignores reliability on jobs less than 28 hours (instead selecting the fastest, least loaded resources); for longer jobs, it uses M=.25. Figure 5.4 shows that larger multipliers perform better for jobs up to 28 hours in duration. However for longer jobs, making predictions for intervals longer than the application runtime increases the number of evictions and the average job makespan. In fact, the longer the job, the smaller the optimal multiplier. This observation motivated the design of the hybrid scheduler, which does well in keeping evictions low, through 8-hour jobs, and makespan low, especially for longer jobs. For example, for short jobs of length 6 hours, M=.25 produces 1173 evictions, M=1 produces 878 evictions and M=3 produces 754 evictions. For 84 hour jobs, M=.25 has 8,118 evictions, M=1 has 84,662 and M=3 has 87,433. These figures support the conclusion that the longer the job, the shorter the optimal prediction length multiplier should be. 46

59 Average Job Makespan (Hours) Average Job Makespan (Hours) Average Job Makespan vs. Ren Days Analyzed Number of Days Prediction Length Weight Evictions Average Job Makespan vs. Prediction Length Weight Evictions Evictions vs. Ren Days Analyzed Ren MTTF Number of Days Evictions vs. Prediction Length Weight TRF Prediction Length Weight Figure 5.5: TRF and Ren MTTF show similar trends for both makespan and evictions. For any particular makespan, TRF causes fewer evictions, and for any number of evictions, TRF achieves improved makespan. 5.4 Multi-State Prediction Based Scheduling This section compares the proposed predictor and scheduler combination (the PPS scheduler utilizing the TRF predictor, PPS-TRF) with Ren s multi-state prediction based scheduler (described in Section 2.3) [56], in terms of the ability to tradeoff reliability and performance. To isolate the effect of the predictor from that of the scheduler, PPS-TRF is compared with Ren s predictor using the PPS scheduler Multi-State Prediction Based Scheduling This section explores the Transitional Recent-hours Freshness-weighted (TRF) predictor using the S3 prediction-based scheduler (Section 5.3.1) and Ren s MTTF scheduler (see Section 2.3 for a full description of Ren MTTF). The number of days that Ren s predictor uses is varied, while simultaneously varying TRF s interval multiplier (Section 5.3.2) for comparison. Both of these parameters allow each scheduler to trade off reliability and performance. Figure 5.5 shows the average makespan and number of evictions obtained by TRF and Ren MTTF as the interval multiplier varies for TRF and the number of days analyzed varies for Ren 47

60 Average Job Makespan vs. Prediction Length Weight Evictions vs. Prediction Length Weight Average Job Makespan Evictions Ren 1 Ren 2 Ren 4 Ren 8 Ren 12 Ren 16 TRF Prediction Length Weight Prediction Length Weight Figure 5.6: TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all Prediction Length Weights. TRF also produces a shorter average job makespan than all Ren predictors, except the one day predictor (which produces 45% more evictions). MTTF. Varying each parameter allows that scheduler to trade off performance (lower makespan) for reliability (fewer evictions). In selecting a point for Ren MTTF on one curve, however, TRF does better in terms of the other metric. For example, for approximately 15 evictions, TRF has average makespan that is 27% lower. And for makespan of 13 hours, TRF has 52% fewer evictions Predictor Quality vs. Scheduling Result To better understand how predictors affect scheduling, PPS is tested with both the TRF predictor and with Ren N-Day [55][54] for a variety of interval multipliers (Prediction Length Weights). 6 jobs are simulated ranging from 5 minutes to 25 hours in duration, 25% of which are checkpointable. The Ren predictor s N value is varied, as is the prediction length weights (i.e. multipliers, as outlined in Section 5.3.2). The weights are varied between 1 and 7, and follow the simulation setup explained in Section 5.1. The PPS scheduler with the S3 resource scoring approach is used for resource selection (Section 5.3.1). Figure 5.6 illustrates the percentage difference in makespan, vs. TRF. For all multiplication factors, TRF produces the fewest job evictions. Both predictors produce the fewest evictions with the 7x multiplier; Ren 2-Day produces 1,34 evictions and TRF produces 1,13 (21% fewer). For the 1x multiplier, Ren 12-Day (1,534 evictions) comes closest to TRF s 1,42 evictions (within 8%). For makespan, Ren 1-Day beats TRF by at most 18% lower makespan. However, Ren 1-Day sacrifices reliability and produces about 45% more evictions. Thus, TRF reduces the number of evicted jobs and can obtain the highest job reliability. Ren 1-Day coupled with the PPS scheduler can obtain a 48

61 .3 Makespan vs. Job Length 1.5 Evictions vs. Job Length Makespan Evictions 1.5 TRF Comp. TRF Perf. Ren MTTF 1 Ren MTTF 4 Ren MTTF 8 Ren MTTF 16 History Counter Sliding Window Job Length (Hours) Makespan Makespan vs. Job Length Evictions 5 1 Job Length (Hours) Evictions vs. Job Length TRF Comp. TRF Perf. Random CPU Speed Pseudo optimal S Job Length (Hours) Job Length (Hours) Figure 5.7: TRF-Comp produces fewer job evictions than all comparison schedulers (with the exception of the Pseudo-optimal scheduler) for jobs up to 1 hours long. shorter makespan than TRF, but only with a large loss in job reliability Scheduling Performance Comparison This section further compares the best performing scheduling approaches, TRF-Comp and TRF- Perf (using the S3 resource scoring approach), to other scheduling methods. To more thoroughly understand the characteristics of these schedulers in a variety of conditions, tests are performed with a diverse set of job lengths. Results are reported for simulating 6, jobs, 25% of which are checkpointable, over the six month Notre Dame trace. The jobs range from five minutes to the job length indicated on the x-axis. Figure 5.7 s top two graphs compare the TRF schedulers to Ren-MTTF (with a variety of days analyzed) and to the History and Sliding Window prediction-based schedulers. The graphs plot the percentage difference in both makespan and the number of evictions, vs. TRF-Comp. The History and Sliding Window predictors utilize the Comp with the S3 resource scoring scheduler as well. TRF-Comp maintains comparable makespan as job length increases, peaking at roughly 11% higher than Sliding Window, History, TRF-Perf and Ren MTTF-1 for jobs of up to 4 hours in length. For this same length, TRF-Comp achieves 6% fewer evictions than the next most reliable scheduler, Ren MTTF-1. For all job lengths up to 4 hours, TRF-Comp achieves at least 15% fewer 49

62 Makespan vs. Job Load Evictions vs. Job Load.2 TRF Comp. TRF Perf. Ren MTTF 1 Makespan.1 Evictions 1.5 Ren MTTF 4 Ren MTTF 8 Ren MTTF 16 History Counter Sliding Window Number of Jobs (x 1 4 ) Number of Jobs (x 1 4 ) Makespan vs. Job Load Evictions vs. Job Load 3 4 TRF Comp. TRF Perf. Makespan 2 1 Evictions Random CPU Speed Pseudo optimal S Number of Jobs (x 1 4 ) Number of Jobs (x 1 4 ) Figure 5.8: TRF-Comp produces fewer job evictions than all comparison schedulers (with the exception of the Pseudo-optimal scheduler) for loads up to 4, jobs evictions when compared with the most reliable scheduler, Ren MTTF-16; average job makespan simultaneously decreases by 2% (27.3 hours versus 32.9 hours). TRF-Comp also decreases the number of evictions by at least 57% compared with all other schedulers, for jobs up to 6 hours long (355 evictions versus 557). TRF-Perf comes within 1% of the shortest makespan (Sliding Window) for shorter lengths, and achieves the shortest makespan for jobs of 8+ hours. The bottom two graphs in Figure 5.7 compare TRF-Comp and TRF-Perf with non-predictionbased scheduling approaches including a Pseudo Optimal scheduler which selects the available resource that will execute the application in the smallest execution time, without failure, based on omnipotent future knowledge of resource availability. When all machines would become unavailable before completing the application, the Pseudo Optimal Scheduler chooses the fastest available resource in terms of MIPS speed. For average job makespan, TRF-Comp follows the non-optimal schedulers and produces the fewest evictions for all job lengths, by at least 11%. Figure 5.8 investigates the effect of varying the load on the system (the number of jobs). The job length is fixed to be uniformly distributed between five minutes and 25 hours and vary the number of jobs injected into the system over the six month trace from 1 to 5,. Again, the top two graphs compare the TRF schedulers to other prediction-based schedulers including Ren-MTTF. The figured demonstrates that TRF-Comp. produces fewer evictions across all loads except in the high 5

63 load case of 5, jobs. TRF-Perf. ties the lowest average job makespan across all loads, improving on TRF-Comp. by approximately 1%. The bottom two graphs compare the two TRF schedulers with traditional scheduling approaches. TRF-Comp. produces the lowest number of evictions across all loads with the exception of the Pseudo-optimal scheduler. Lastly, TRF and the other traditional schedulers produce comparable makespan improvements across all loads. 51

64 Chapter 6 Prediction-Based Replication Prediction-based resource selection is an important tool for dealing with resource unavailability. In choosing the best resources for executing each job, the scheduler can greatly reduce the number of job evictions and their effect. However, encountering job evictions due to unforeseen resource volatility is virtually inevitable. Job replication can be used to reduce the detrimental effect of resource unavailability and benefit performance in two ways. First, replication can be used to mitigate the effects of job eviction through creating copies of jobs such that even when one copy of a job is evicted, one its replicas may still complete without encountering resource unavailability. Since evicted jobs will have to find another resource on which to execute, possibly restarting its execution from the beginning (non-checkpointable jobs), replicas which do not experience resource unavailability can reduce the job s makespan. Second, when jobs are scheduled using imperfect resource ranking heuristics, future resource performance cannot always be anticipated. For example, the local load on a resource may suddenly fluctuate which results in the job currently executing on that resource taking longer to complete than anticipated at the time of scheduling. Replication can be used to allow a job to execute on multiple resources such that a resource which completes a replica of the job before the original job could complete will provide a shorter makespan and earlier completion time. In these two ways, replication can be seen as a scheduling tool to not only deal with unpredictable resource performance but also aid in handling volatile resource availability. One of the main difficulties with replication is the cost associated with it. Replication has two main costs in this context. First, because there are a limited number of total resources in any distributed system, creating replicas of jobs will reduce the available resource pool and the replicas may take desirable resources away from waiting non-replicated jobs. This can lead to situations in which the desirable resources (fast, available, etc) in a system can be taken by a few jobs and their replicas leaving only less desirable resources for future incoming jobs. Second, in situations such as Grid economies, creating replicas can have a real world cost where a user may have to pay per job, 52

65 per process or per cycle. These two costs mean that users must be judicious in their selection of which jobs to replicate. This chapter therefore sets out to test the hypothesis that the TRF availability predictor (Chapter 4) can help select the right jobs to replicate, and can improve overall average job makespan, reduce the redundant operations needed for the same improved makespan, or both. The replication techniques are tested in a range of system loads and job lengths while measuring average job makespan and the number of extra operations performed (overhead). This chapter tests two classes of replication techniques: Static techniques (Section 6.2) replicate based on the characteristics of the job being scheduled Prediction-Based techniques (Section 6.3) consider forecasts about the future availability and load of resources This chapter first investigates static replication techniques which replicate based on the characteristics of the job being scheduled. These techniques are then improved by incorporating availability predictions made on the resources chosen to execute a job in order to make better job replication decisions. Each replication techniques is analyzed in terms of a new metric, that of efficiency. Prediction accuracy is analyzed for it s effect the quality of the replication decisions. Lastly, Section investigates system load s effects the performance of the proposed replication techniques. Section proposes three static load adaptive replication techniques which consider overall system load and desired metric in determining which replication strategy to utilize (Chapter 7 goes on to introduce an improved load adaptive replication approach). All replication experiments in this chapter use the PPS Scheduler described in Chapter 5.1, augmented to support replication. Upon placing each job, the scheduler uses the replication policy to determine how many replicas to make. In these replication experiments, the scheduler makes or 1 replicas. When a task or its replica completes, the system terminates other copies, freeing up the resources executing them. Schedulers whose replication policies require an availability prediction use the TRF predictor, unless specified otherwise. Scheduling quality is analyzed according to average job makespan, extra operations, and replication efficiency, which is defined and discussed later. Task lengths are defined in terms of the number of operations needed to complete them; extra operations refers to the number of additional operations that the system performs for a job, including all lost operations due to eviction, and any operations performed by replicas (or initially scheduled jobs) that do not ultimately finish because some other copy of that job finished first. 53

66 6.1 Replication Experimental Setup Replication results are based on two traces, the six month Notre Dame Condor availability trace of 6 nodes analyzed in Section 3.2 [59] and also a trace from the SETI@home desktop Grid system [29]. Again, for the Condor simulations (as explained in Section 5.1), the MIPS score, the load on each processor, and the (un)availability states (included or implied by the trace data) all influence the simulated running of the applications and their replicas. Resources are considered available if they are running, connected, have no user present and have a local CPU load below 3% as these are the default settings for the Condor system [75]. A resource may only be assigned one task for execution at a time. Similarly, for the SETI simulations the double precision floating point speed and the (un)availability states (implied by the trace data) influence the execution of the applications. In the SETI trace, a machine is considered available if it is eligible for executing a Grid application. This is in accordance with the resource owner s settings which may dictate that a machine is only available if the CPU is idle and the user is not present. Since the SETI trace contains 226,28 resources over 1.5 years, 6 resources are randomly chosen and only execute the simulation on the first 6 months of the trace to compare directly to the results obtained from the Condor trace simulations. Applications are simulated by inserting them at random times throughout the six months, with durations uniformly distributed between five minutes and 25 hours (this follows the same simulation/scheduling methodology as defined in Section 5.1). 1 A uniform distribution demonstrates how each replication technique performs for a wide range of job lengths, without emphasizing a particular set of lengths. 25% of the applications are checkpointable, and can take an on-demand checkpoint in five minutes, if the user reclaims a resource, or if the local CPU load exceeds 5%. If the resource directly transitions to an unavailable state (for example, by becoming unreachable), no checkpoint can be taken and the job must restart from its last periodic checkpoint. 2 Non-checkpointable applications must restart from the beginning of their execution when evicted regardless of the eviction type. In both cases, an evicted job is added to the job queue and immediately rescheduled on an available resource. The PPS scheduler, as defined in Section 5.1, is used to schedule jobs. Again, jobs are scheduled each round (once every 3 minutes) by removing and scheduling the job at the head of the queue until a job can no longer be scheduled. For each job, each available resource is scored and the resource with the highest score executes the job. Resource i s score (RS(i)) is computed as (scoring 1 The duration is determined by the runtime on an unloaded resource with average speed. 2 Periodic checkpoints are taken every 6 hours (Condor s default). 54

67 approached S as defined in Section 5.3.1): RS(i) = m i (1 l i ) where m i is i s MIPS score, and l i is the CPU load currently on resource i. When a job is first scheduled, the scheduler determines if a single replica of the task should be created based on the current replication policy. 3 If the scheduler chooses to create a replica, an identical copy of the job is then scheduled. This replica is not considered for further replication. When any replica of a task completes, all other replicas of the task are located and terminated, freeing up the resources executing those replicas. Some replication policies use an availability prediction. For this the TRF predictor, as described in Section [6], is used to forecast resource availability. TRF takes as input the expected duration of the job and the name of a resource, and returns a percentage from to 1 to represent the predicted probability that the resource will complete the expected duration without becoming unavailable. In these experiments, the total number of jobs in the system is varied using 1K, 14K, 27K, and 4K total jobs over the 6 month traces, in four separate sets of simulations. This translates to.1,.13,.26, and.39 Jobs per Resource per Day, respectively. These same values were used for the rest of the tests described in this chapter. They are referred to as the Low, Medium Low, Medium High, and High load cases. 6.2 Static Replication Techniques This section explores the effect that replicating jobs based on checkpointability can have on both extra operations and on job makespan. The total number of jobs varies from Low to High load as defined in Section 6.1 cases. Figure 6.1 includes the following replication policies: 1x: Replicates each job exactly once. If either the main job or the replica runs on a resource that becomes unavailable, it is rescheduled but never re-replicated. Thus, two versions of each job are always running. Non-Ckpt: Replicates only jobs that are not checkpointable. Ckpt: Replicates only jobs that are checkpointable. 4 3 Although results have been obtained for an increased number of replicas per job, this study is restricted by limiting the number of replicas to one per job due to space constraints. 4 This replication policy is not based on intuition, but instead serves as a useful comparison for the Non-Ckpt policy 55

68 3 Makespan vs. Job Load 7 Extra Operations vs. Job Load Percent Makespan Improvement vs. No Replication No Rep 1x Non Ckpt Ckpt 5% Probability 5 Low Med Low Med High High Job Load Percent Extra Operations vs. No Replication Low Med Low Med High High Job Load Figure 6.1: Checkpointability Based Replication: Makespan (left) and extra operations (right) for a variety of replication strategies, two of which are based on checkpointability of jobs, across four different load levels. 5% Probability: Replicates half of the jobs, at random. No-Rep: Does not make replications Figure 6.1 illustrates that under low loads, increasing replicas achieves more makespan improvement, a benefit that falls off for higher loads because replication forces subsequent jobs to use even less desirable resources. Replicating only non-checkpointable jobs (Non-Ckpt) improves makespan for all but the highest load test, and at significantly less cost than 1x replication. Moreover, replicating the half of the jobs that are non-checkpointable is better than replicating half of the jobs at random, which indicates that using checkpointability for replication policies has benefit, as expected. 6.3 Prediction-based Replication As shown in the previous section, replicating non-checkpointable jobs yields larger rewards than replicating checkpointable jobs due to the inherent fault tolerance of checkpointable jobs. Replicating longer jobs also yields higher improvements although those results are not included here. These replication strategies can be improved by identifying and replicating the jobs that are most likely to experience resource unavailability. For this work, a resource is selected based on its projected near-future performance (speed and load) as described in Section 6.1, and jobs are replicated based on the probability of the job completing on that resource. Jobs that are more likely to experience resource unavailability are replicated to decrease job makespan. For this reason, it is important to consider makespan improvement and overhead within the same metric. This metric quantifies how much of makespan improvement each replication policy achieves per replica created. Replication Efficiency is defined as: 56

69 ReplicationEfficiency = MakespanImprovement ReplicasP erjob where Makespan Improvement is the percentage makespan improvement over not creating any replicas. As defined, increasing makespan improvement with the same number of replicas will increase efficiency, and increasing replicas to get the same makespan improvement will decrease efficiency. Another way of considering efficiency is to see how much makespan improvement can be achieved with a certain number of replicas. The best replication strategies will make replicas of the right jobs, and achieve more improvement for the same cost (number of replicas). Replication efficiency is important in any environment that applies a cost to inserting and executing jobs on a system. One user may wish to complete a job as soon as possible (reduce makespan) regardless of cost, whereas a frugal user may only wish to replicate if the replica created is likely to provide significant makespan improvement; the frugal user may choose an extremely efficient replication policy. Efficiency can also be useful to administrators in choosing which replication policy is most suited to the goals of their system. Efficient replication strategies reduce the overhead and increase the throughput of the overall system by reducing the number of wasted operations performed by replicas that do not complete or that are evicted. This section analyzes methods for using the availability predictions to make better replication decisions in terms of both makespan improvement and efficiency Replication Score Threshold This section first investigates how the prediction threshold at which the system makes a replica affects performance. The Replication Score Threshold (RST) is the predicted completion probability, as calculated by TRF, below which the system makes a replica, and above which it does not. As demonstrated in this section, RST can effect both makespan improvement and replication efficiency. For example, a system with an RST of 1 replicates all jobs scheduled on resources with a predicted completion probability below 1%. 5 Additionally, the RST-L-NC strategy chooses jobs to replicate based their predicted completion probability (RST), their Length (longer jobs are replicated), and whether they are checkpointable (only Non-Checkpointable jobs are replicated). A job length of greater than ten hours was chosen based on empirical analysis of replication approaches with a varying length parameter. Non-fixed length approaches have not been investigated and are left to future work. These results are compared with a policy (1x) that replicates each job, and a policy 5 An RST of 1 will not replicate all jobs but only jobs whose predicted probability of completion is below 1%. If a resource has not yet become unavailable within the period of time being traced, the predictor may output a 1% probability of completion. 57

70 Percent Makespan Improvement vs. No Replication Percent Makespan Improvement vs. No Replication Makespan vs. RST: Condor Replication Score Threshold Makespan vs. RST: Seti Replication Score Threshold Efficiency Efficiency Efficiency vs. RST: Condor 1x WQR FT RST RST L NC Replication Score Threshold Efficiency vs. RST: Seti Replication Score Threshold 1x WQR FT RST RST L NC Figure 6.2: Replication Score Threshold s effect on replication effectiveness - Low Load that replicates on available resources only (WQR-FT, described in Section 2.4) [7]. To summarize, 1x: Creates a single replica of each job RST: Given an availability prediction from the TRF predictor, replicates all jobs whose predicted completion probability falls below the configured RST value RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than ten hours (L), and given an availability prediction from the TRF predictor, whose predicted completion probability falls below the configured RST value WQR-FT: First schedules all incoming jobs, then uses the remaining resources for replicas by creating up to one replica of each job until either all jobs have a single replica executing or there are no longer any available resources Figure 6.2 plots makespan and efficiency versus RST in the Low load case. The top two graphs represent the results from executing the simulation on the Condor trace and the bottom two graphs present the results from the SETI trace. As RST increases, more jobs are replicated as more predicted completion probabilities fall below 58

71 the RST. In the low load case, this increases makespan improvement for both RST and RST-L-NC. The effect is more prominent in the Condor simulations, but still evident in the SETI simulations. 1x and WQR-FT achieve the highest makespan improvement in the Condor simulations but are matched by RST in SETI. The efficiency graphs exhibit the opposite trend. For all RST values, both RST and RST-L-NC produce higher efficiencies than WQR-FT and 1x. In fact, RST-L-NC produces a 39.5 (Condor) and a 47.5 (SETI) replication efficiency versus WQR-FT s efficiency of 21 (for both SETI and Condor). RST-L-NC s efficiency outperforms RST due to the incorporation of job characteristics in choosing to replicate longer, non-checkpointable jobs. The more intelligently and selectively the scheduler chooses which jobs to replicate, the higher the achieved efficiency. Clearly, considering both predicted completion probability and an application s characteristics yields the most efficient strategy. For example in SETI, RST matches the makespan improvement of WQR-FT but creates 35% fewer replicas while RST-L-NC comes within 6% of the makespan improvement of WQR-FT while creating 67% fewer replicas. Similarly in Condor, RST-L-NC comes within 6% of the makespan improvement of WQR-FT but creates 76.8% fewer replicas Load Adaptive Motivation This section examines how increasing load on the system influences the replication policies effectiveness. Figure 6.3 plots makespan and efficiency versus RST in the Medium High load case. Again, the top two graphs represent the results from the Condor simulations and the bottom two graphs represent the results from the SETI simulations. Under Medium High load, increasing RST actually decreases makespan improvement; creating additional replicas in higher load situations denies other jobs access to attractive resources. RST7 (RSTi, where i is a value between and 1 replicates all jobs whose predicted probability of completion is below i%), for example produces the largest makespan improvement for Condor and RST4 produces the largest makespan improvement for SETI. RST and RST-L-NC produce much higher efficiencies than either 1x or WQR-FT across all RST values in both the Condor and SETI simulations. At this load level, RST-L-NC produces a 3 (Condor) and a 2 (SETI) replication efficiency versus WQR-FT s 7 (Condor) and 8.5 (SETI), respectively. In SETI, RST-L-NC comes within.5% of the makespan improvement of WQR-FT while creating 58% as many replicas and RST improves on the makespan by 2.5% and creates 13.7% fewer replicas. For Condor, RST-L-NC improves on the makespan of WQR-FT by 1.5% while creating 72.5% fewer replicas. Again, RST-L-NC produces higher replication efficiency than RST due 59

72 Percent Makespan Improvement vs. No Replication Percent Makespan Improvement vs. No Replication Makespan vs. RST: Condor Replication Score Threshold Makespan vs. RST: Seti Replication Score Threshold Efficiency Efficiency Efficiency vs. RST: Condor 1x WQR FT RST RST L NC Replication Score Threshold Efficiency vs. RST: Seti Replication Score Threshold 1x WQR FT RST RST L NC Figure 6.3: Replication Score Threshold s effect on replication effectiveness - Medium High Load to its higher replica creation selectivity. Figure 6.3 demonstrates that under a higher load, selective replication strategies produce larger makespan improvements. Even under low load (Figure 6.2), these more selective and efficient replication techniques result in more improvement per replica. Figure 6.4 further investigates the effect that varying the load in the system has on the performance achieved by the replication techniques proposed in Section Figure 6.4 plots the RST replication technique with RST values of both 1% and 4% and compares them with not replicating (No Replication) and replicating a random 2% of jobs (2% Replication) only for the Condor case due to space constraints (although the result holds for the SETI case). As load increases, the makespan improvement and efficiency of all replication strategies initially increases slightly but then falls as Med-High and High load levels are reached. Intuitively, this is because under lower loads replicas will not interfere by taking quality resources away from other jobs. As load increases, replicas can cause new jobs to wait. Another important result is comparing the proposed techniques to blind replication. For example, the RST4-L-NC technique creates almost exactly the same number of replicas as randomly replicating 2% of jobs across the loads levels. Given the same number of replicas, the RST technique achieves a higher makespan improvement and efficiency across all load levels; it is choosing the right jobs to replicate. 6

73 Percent Makespan Improvement vs. No Replication Makespan vs. Job Load: Condor 1 Low Med Low Med High High Job Load Efficiency Efficiency vs. Job Load: Condor 1 Low Med Low Med High High Job Load Number of Replicas Created (x 1 4 ) No Rep RST1 L NC RST4 L NC 2% Probability Replicas vs. Job Load: Condor Low Med Low Med High High Job Load Figure 6.4: Replication strategy performance across a variety of system loads Figure 6.4 also shows that the replication strategy that provides the highest achievable makespan or efficiency varies based on the load of the system. For example, depending on the load level, either RST4-L-NC or RST1-L-NC produces the largest makespan improvement Load Adaptive Techniques Since load varies unpredictably, the replication policy should adapt dynamically. And because different users may target different metrics, the replication policies must also vary by job, suggesting three load adaptive replication techniques that respond to varying load levels by choosing the replication strategy that best achieves three different desired metrics. A load measurement system counts the number of jobs submitted to the system in the past day. This value then determines the current load level and helps select a replication strategy. Low load is defined as fewer than 4 jobs per day, Med-Low load as fewer than 115 jobs per day, Med-High load as fewer than 2 jobs per day and High load as 2 or more jobs a day. RST-based strategies consider different combinations of checkpointability and job length depending on the current system load and desired metric. Performance: Reduce average job makespan across loads Low load: 4x - Create 4 replicas of each job Med-Low load: RST1 - Replicate a job if the completion probability is less than 1% 61

74 Percent Makespan Improvement vs. No Replication Percent Makespan Improvement vs. No Replication Makespan vs. Job Load: Condor No Rep WQR FT Performance Rep Efficiency Rep Compromise Rep 1 Low Med Low Med High High Job Load Makespan vs. Job Load: Seti No Rep WQR FT Performance Rep Efficiency Rep Compromise Rep 1 Low Med Low Med High High Job Load Efficiency Efficiency Efficiency vs. Job Load: Condor 1 Low Med Low Med High High Job Load Efficiency vs. Job Load: Seti 1 Low Med Low Med High High Job Load Figure 6.5: Load adaptive replication techniques Med-High load: RST4-NC - Replicate a job if the completion probability is less than 4% and the job is non-checkpointable High load: No replication Efficiency: Increase replication efficiency across loads Low, Med-Low, Med-High: RST4-L-NC - Replicate a job if the completion probability is less than 4%, the job is non-checkpointable and the length is over 1 hours High load: No replication Compromise: Compromise between average job makespan and efficiency across loads Low load: RST1 - Replicate a job if the completion probability is less than 1% Med-Low load: RST1 - Replicate a job if the completion probability is less than 1% Med-High load: RST4-NC - Replicate a job if the completion probability is less than 4% and the job is non-checkpointable High load: No replication Figure 6.5 explores the results of the three proposed load adaptive replication policies across varying load levels and compares their results to the closest related work replication strategy, WQR- 62

75 FT. Again, the top two graphs represent the results from the Condor simulations and the bottom two illustrate the SETI results. For both cases, all of the load adaptive techniques produce the largest (or extremely close to the largest) achievable value for their chosen metric at each given load level with the exception of the high load case for SETI. The Efficiency technique produces the largest efficiency across all load levels (except in the high load case in SETI) while simultaneously producing the smallest makespan improvement. The Performance technique produces the opposite trend. The Compromise technique produces an intermediate makespan improvement and efficiency across all loads. The results demonstrate these replication techniques can determine the replication strategy that is best suited to achieving the desired metric at various loads. Compared to WQR-FT, all three load adaptive techniques produce higher replication efficiencies across all loads except for the Performance technique in the low load case and the high load case in SETI. The Efficiency technique produces an average replication efficiency increase of 1.8 across all loads with a maximum increase of 22 compared with WQR-FT. Similarly, the Efficiency technique produces at most an increase of 25 in efficiency for the Condor simulations and on average increase of 15 across all loads compared with WQR-FT. The Performance technique beats the makespan improvement achieved by WQR-FT by an average of 1.27% across all loads in the Condor simulations while simultaneously achieving a higher efficiency in all but the low load case. The Performance technique also improves on the makespan improvement of WQR-FT in all but the high load case in SETI. The Compromise technique improves or matches the efficiency and makespan achieved by WQR-FT across all loads with the exception of the high load case in the SETI simulations. 63

76 Chapter 7 Load Adaptive Replication The previous chapter investigated using availability predictors to decide which jobs to replicate. The goal of this prediction-based replication approach is to replicate those jobs which are most likely to experience resource unavailability. These jobs benefit the most from the fault tolerance which replication provides and produce the largest benefit to the system. As seen in Chapter 6, it is especially important to choose the right jobs to replication because of the costs associated with replication. As seen in Section 6.3.2, the load on the system has a large effect on the performance of a replication strategy. In particular, as load increases, jobs are competing for fewer and fewer resources. Given this lack of available resources under higher loads, replication strategies must become even more selective about which jobs are replicated. This is in contrast to lower load situations which can benefit from more aggressive replication policies. Section proposed three static load adaptive replication policies. These policies measure the load on the system in the past 24 hours and choose the replication policies appropriate for that load out of a set of four replication policies. The major drawback to this approach is that these replication policies are chosen statically ahead of time based on testing a range of policies with a given resource availability trace and workload. The best performing policy is then chosen for that load level range. Unfortunately, this means the load-adaptive approach is tailored to that specific availability and workload. The four strategies are statically chosen. As shown in this chapter, this strategy does not handle new workloads or new resource environments well. An additional issue is that only having four load levels means a relatively high granularity of load adaption. This large granularity is not able to account for the wide range of loads within each granularity level. This chapter proposes a new load adaptive approach which addresses these issues and delivers improved performance. The chapter is organized as follows. Section 7.2 investigates the effect that varying system load has on the performance achieved by the prediction-based replication techniques proposed in the previous chapter and motivate the need for load-adaptive replication. Section

77 introduces a load adaptive replication approach, Sliding Replication (SR) and goes on to test various configurations for SR in Sections 7.3.1, and Lastly, Section 7.4 tests SR against the load adaptive approach proposed in Section and then against WQR=FT using two real world load traces. 7.1 Load Adaptive Experimental Setup The same simulation methodology described in Section 5.1 is used for studying these load adaptive replication policies. This methodology dictates how resources are considered available based on the availability trace. Section 6.1 s experimental setup is used including using the Prediction Product Score (PPS) scheduler with the following resource ranking function: RS(i) = m i (1 l i ) where m i is i s MIPS score, and l i is the CPU load currently on resource i. When a job is first scheduled, the scheduler determines if a single replica of the task should be created based on the current replication policy. The Condor and SETI resource availability traces are used to dictate when and how resources become unavailable. These experiments use real world workload traces from the Grid Workload Archive [27] (GWA), an online repository of 11 distributed system workload traces including TeraGrid, NorduGrid, LCG, Sharcnet, Das-2, Auvergrid and Grid5 to drive the simulations. The Grid5 and NorduGrid traces are used in the load-adaptive replication simulations. Grid5 consists of between 1, and 25, system cores with over 1 million jobs. NorduGrid consists of over 25, system cores with between 5, and 1 million total job submissions. NorduGrid therefore represents a lower overall load level. 7.2 Prediction based Replication Section describes a replication strategy that utilizes predictions to choose which jobs to replicate [62], those that are most likely to experience resource unavailability (RST). Again, if a job s likelihood of failure exceeds some threshold, its replicated. The Replication Score Threshold (RST) is the predicted completion probability, as calculated by the TRF predictor (Section 4.2.1), below which the system makes a replica, and above which it does not. As shown in the last chapter, RST affects both makespan improvement and replication efficiency. 65

78 This section motivates the need for load aware replication strategies and investigates the effect of varying RST. A pool of synthetic jobs is simulated by inserting each job at a random time throughout the six month Condor trace, with application durations uniformly distributed between five minutes and 24 hours using the simulation methodology described in Section 5.1. The number of jobs is varied according to the Low (1 jobs) and Med-High loads (27, jobs) described in Section The goal is to determine the effect that system load and RST have on replication performance. Figure 7.1 plots makespan and efficiency versus RST for Low and Med-High load. Under Low load, as RST increases, more jobs are replicated because more predicted completion probabilities fall below the RST; this increases makespan improvement. The efficiency graphs exhibit the opposite trend. As RST increases, extra job replicas contribute slightly less to makespan improvement, leading to lower efficiency as RST increases. Under Medium High load, increasing RST actually decreases makespan improvement; creating additional replicas in higher load situations denies other jobs access to attractive resources. RST7 (RSTi, where i is a value between and 1, replicates all jobs whose predicted probability of completion is below i%), for example produces the largest makespan improvement for Condor. Efficiency decreases as RST increases but with a sharper slope than the Low load case, indicating the additional replicas are decreasing efficiency more under the higher load. Figure 7.1 illustrates that the replication strategy that creates the largest makespan improvement varies with the load on the system. As load increases, excess replicas steal quality resources from nonreplicated jobs; schedulers should become more selective about which jobs to replicate. Figure 7.1 shows that varying RST can be used as a tool to determine the selectivity of the replication policy. Higher RST values translate to less selectivity and lower RST values mean more selectivity (less jobs being replicated). This idea is used later in developing a load adaptive replication approach. 7.3 Load Adaptive Replication Sliding Replication (SR) is a load-adaptive replication approach that uses the current load on the system to influence its replication strategy choice. One simple load adaptive replication approach is to directly translate load to a probability of replication. A function takes system load as input and produces output that dictates the probability of replication. The Div-.7 load adaptive replication approach calculates the probability of replication (RP) with the following equation: 1 Med-Low and High load scenarios are not included in this study for clarity. 66

79 Percent Makespan Improvement vs. No Replication Makespan Improv. vs. RST: Condor Efficiency Efficiency vs. RST: Condor RST: Low Load RST: Med High Load Replication Score Threshold Replication Score Threshold Figure 7.1: Replication Score Threshold s effect on replication effectiveness across various loads Percentage Makespan Improvement vs No Replication Makespan Improv. vs. Job Load: Condor Percentage of Job Trace Efficiency Efficiency vs. Job Load: Condor Div.7 Thresh.3 Thresh Percentage of Job Trace Figure 7.2: Load adaptive replication with random replication probability 67

80 Figure 7.3: Sliding Replication s load adaptive approach RP = (1 (JDR/.7)) 1 where JDR is the number of jobs per day per available resource. JDR is calculated by examining the most recent 24 hours of behavior. The static value.7 represents a typical upper bound on the JDR measurement for a high load scenario. Div-.7 is compared with Thresh-.3 and Thresh-.4 which replicate all jobs if the JDR measurement is below.3 or.4, respectively. Figure 7.2 analyzes the performance of these three approaches in terms of makespan improvement and efficiency using the Condor resource availability trace paired with the Grid5 job workload trace. Importantly, the Grid5 workload does vary load throughout it s traced time in terms of how many jobs are injected into the system, but this doesn t effect the overall load dictated by the trace onto the system. To simulate a range from lower overall system loads to higher system loads, the simulation chooses a different percentage of jobs from the trace to execute in relation to the desired load level. K% means that the simulator keeps a random K% of jobs from the trace. In Figure 7.2, Thresh-.4 produces the highest makespan improvement while also producing the highest efficiency for the two higher load scenarios. Thresh-.3 produces the highest efficiency for the two lower load scenarios. However, these load adaptive replication strategies only choose how many jobs to replicate given the current load level. These results can be further improved if the replication strategy also chooses the best jobs to replicate. For example, for a given load, replicating the jobs that will fail will result in higher gains than replicating jobs that will run to completion without replication. This is the motivation behind the Sliding Replication (SR) approach. SR not only attempts to regulate how many replicas are produced given the current load level, but given that number of replicas, it seeks to choose the proper subset of jobs to replicate, attempting to choose those jobs that are most likely to fail. SR employs prediction based replication strategies that replicate all jobs whose predicted likeli- 68

81 hood of failure exceeds a certain threshold. Since a single failure threshold will replicate a similar percentage of jobs across most loads, and since load-adaptive replication approaches seek to vary the percentage of jobs replicated with system load, SR first uses load to determine which strategy to use to decide whether to replicate. This choice of strategy influences both the number of replicas and which jobs get replicated. Figure 7.3 illustrates Sliding Replication strategy s methodology. Sliding Replication uses an ordered list of replication strategies. The current load is passed to an Indexing Function to determine an index into that Ordered Replication Strategy List. The index corresponds to a particular replication strategy which is used to determine if the job is replicated Replication Strategy List This section investigates various ordered lists of replication strategies. As shown in Figure 7.1, higher system loads require more selective replication strategies. The first intuitive solution is to propose an ordered list of replication strategies called 21 where the strategies are ordered from the most aggressive (always replicate) to most selective (only make a replica of the job if it s completion probability is below 4%, it s non-checkpointable and it s duration is over 1 hours). This list is drawn from previously described replicate strategies in Chapter 6 [62]. The indexed list is as follows: Index : 1x: Creates a single replica of the job Index 1-7: RST: Given an availability prediction from the TRF predictor, replicates all jobs whose predicted completion probability falls below 1% (index 1) to 4% (index 7) Index 8-14: RST-NC: Replicates all non-checkpointable (NC) jobs whose completion probability (as predicted by TRF) falls below 1% (index 8) to 4% (index 14) Index 15-21: RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than ten hours (L), and given an availability prediction from the TRF predictor, whose predicted completion probability falls below 1% (index 15) to 4% (index 21) Index > 21: x: Never replicates 21 is compared with three other ordered lists: 21 rev: Uses the same list as 21 except in reverse order (most selective to most aggressive) 7: Uses the same list as 21 except only the first 7 indexes. Any index larger than 7 indicates the job will not be replicated. 69

82 Makespan Improv. vs. Replication List Evictions vs. Replication List Average Job Makespan (Seconds x 1 5 ) Evictions x _rev 7 21_rand List 21 21_rev 7 21_rand List Figure 7.4: Sliding Replication list comparison 21 rand: Has 21 indexes but uses probabilistic replication. The probability that each replication strategy in the 21 list has of creating a replica is determined and then based on the index chosen, the strategy randomly creates replicas with a probability equal to the corresponding strategy in the 21 list. This isolates choosing the right jobs to replicate from just choosing the right number of jobs to replicate at each load. Figure 7.4 demonstrates the performance of all four ordered replication strategy lists for average job makespan and number of evictions using the Condor availability trace and 1% of the jobs from the Grid5 workload. The 21 list strategy produces the fewest job evictions and lowest average job makespan. Beating 21 r ev demonstrates that making fewer replicas under higher loads is better, and beating 21 r and demonstrates that the list of policies is making replicas of the right jobs Replication Index Function This section investigates varying the function used by SR to alter the condition under which indices are chosen. SR can use both linear and exponential functions when calculating these indices. The exponential function (Exp-X) calculates an index into the ordered list of strategies as follows: Index = (JDR 1) X X determines the strategy s sensitivity to system load. Smaller values of X result in smaller indices and therefore more replicas. The linear function (Lin-X) is: Index = (JDR 1) X Again, lower X values result in lower indices and more replicas. 7

83 Percentage Makespan Improvement vs No Replication Percentage Makespan Improvement vs No Replication Makespan Improv. vs. Job Load Percentage of Job Trace Makespan Improv. vs. Job Load Percentage of Job Trace Efficiency Efficiency Efficiency vs. Job Load Exp 1.1 Exp 1.6 Exp Percentage of Job Trace Efficiency vs. Job Load Lin.75 Lin 1 Lin 1.25 Lin Percentage of Job Trace Figure 7.5: Sliding Replication function comparison Figure 7.5 compares linear indexing (Lin-X) in the top two graphs with exponential indexing (Exp-X) in the bottom two, for the Condor resource availability trace paired with the Grid5 job workload trace. The percentage of jobs injected from the Grid5 workload is varied from 25% to 1%. The X parameter is also varied in both equations. For both functions, larger values of X produce higher replication efficiencies, especially at higher loads. The exponential expression produces similar but slightly higher replication efficiency when compared with the linear function SR Parameter Analysis This section analyzes how the X parameter in the exponential index expression effects which index, and therefore, which replication policies are chosen. The simulation executes the Condor trace coupled with the Grid5 workload. The X parameter is varied from 1.1, 1.6 and 2.2 to see which replication policies (indices) are chosen throughout the simulation. Figure 7.6 plots the frequency with which each value of the X parameter selects each index and how many replicas are created at each index. Values over 21 result in no replicas and are not shown. The only unexpected result is index 8 (RST-4) creating a larger number of replicas than the surrounding strategies. 71

84 Frequency Count vs. Index Replicas Created vs. Index Frequency Count Replicas Created Exp 1.1 Exp 1.6 Exp Index Index Figure 7.6: Sliding Replication index analysis 7.4 Sliding Replication Performance This section compares Sliding Replication with other replication strategies. Section compares SR against the prediction-based load-adaptive replication approach Compromise-Rep, proposed in Section Section then compares SR with WQR-FT using two real world workload traces Prediction-based Replication Comparison This section compared the SR strategy with the previously proposed replication strategy Compromise- Rep as defined in Section [62]. Recall that Compromise-Rep measures the number of jobs in the last day and if the number of jobs is less than 115, it replicates the job if it s completion probability is lower than 1%. If the number of jobs is greater than or equal to 115 but less than 2, it creates a replica if the job s completion probability is less than 4% and the job is non-checkpointable. Any more than 2 jobs per day and Compromise-Rep will not create a replica. To compare these two strategies, again the Condor availability trace coupled with the Grid5 workload is used. Again, to simulate a range from lower overall system loads to higher system loads, different percentages of jobs are used from the Grid5 trace in relation to the desired load level. K% means that a random K% of jobs are kept from the trace. Figure 7.7 shows the results of the comparison between the two load adaptive replication strategies across a variety of loads. The left subgraph depicts makespan improvement over creating no replicas. For makespan improvement, SR (Exp-2.2) outperforms Compromise-Rep under low loads; under other loads the two strategies perform similarly. For replication efficiency, SR significantly outperforms Compromise-Rep for all loads. 72

85 Percentage Makespan Improvement vs No Replication Makespan Improv. vs. Job Load: Condor Percentage of Job Trace Efficiency Efficiency vs. Job Load: Condor Exp 2.2 Compromise Rep Percentage of Job Trace Figure 7.7: Static vs. adaptive replication performance comparison SR Under Various Workloads This section analyzes SR for a variety of workloads and availability traces across various exponential values and compares it to the WQR-FT replication strategy (defined in Section 2.4) [7]. SR is tested using three values of X, namely, 1.1, 1.6 and 2.2. Figure 7.8 depicts results from the Grid5 workload under both the Condor (top two graphs) and SETI (bottom two) availability traces. For Condor, SR produces a comparable makespan improvement to WQR-FT (2.75% on average across the loads) while producing a 225% increase in Replication Efficiency on average indicating SR is creating significantly less replicas to produce a similar makespan improvement. WQR-FT creates a replica if any resource is available, an aggressive but potentially inefficient strategy that improves makespan unless replicas leave significantly fewer quality resources for future jobs. For SETI, SR produces slightly over 1% makespan improvement over WQR-FT for all load levels while simultaneously increasing replication efficiency by an average of 547%. Increasing the value of X increases the replication efficiency in both the Condor and SETI simulations while slightly decreasing makespan improvement across all load levels. Figure 7.9 depicts results from the NorduGrid workload. Again, the top two graphs represent results from the Condor trace and the bottom two from SETI. SR produces a higher makespan improvement and replication efficiency in all but the SETI lower load case. 73

86 Percentage Makespan Improvement vs No Replication Percentage Makespan Improvement vs No Replication Makespan Improv. vs. Job Load: Condor Percentage of Job Trace Makespan Improv. vs. Job Load: SETI Percentage of Job Trace Efficiency Efficiency Efficiency vs. Job Load: Condor Exp 1.1 Exp 1.6 Exp 2.2 WQR FT 5 1 Percentage of Job Trace Efficiency vs. Job Load: SETI Exp 1.1 Exp 1.6 Exp 2.2 WQR FT 5 1 Percentage of Job Trace Figure 7.8: Sliding Replication under the Grid5 workload Percentage Makespan Improvement vs No Replication Percentage Makespan Improvement vs No Replication Makespan Improv. vs. Job Load: Condor Percentage of Job Trace Makespan Improv. vs. Job Load: SETI Percentage of Job Trace Efficiency Efficiency Efficiency vs. Job Load: Condor Exp 1.1 Exp 1.6 Exp 2.2 WQR FT Percentage of Job Trace Efficiency vs. Job Load: SETI Exp 1.1 Exp 1.6 Exp 2.2 WQR FT Percentage of Job Trace Figure 7.9: Sliding replication under the NorduGrid workload 74

87 Chapter 8 Multi-Functional Device Utilization Conventional workstations and clusters are increasingly supporting distributed computing, including applications that require distributed file systems [9][1] and parallel processing [64][34][36]. For computationally intensive workloads, schedulers can characterize and predict workstation availability, and use the information to effectively use pools of distributed idle resources as done in previous chapters [64][34][54][58]. In fact, many projects have successfully harvested idle cycles from workstations and general purpose compute clusters. This chapter describes efforts to leverage unconventional computing environments, defined here as collections of devices that do not comprise general purpose workstations or machines. In particular, the aim is to leverage special purpose instruments and devices that are often deployed in great numbers by corporations and academic institutions. These devices have increasingly fast processors and large amounts of memory, making them capable of supporting high performance computing. They are often connected by high speed networks, making them capable of receiving and sending large data intensive jobs. Importantly, and as seen in Section 8.4, these devices often have significant amounts of idle time (95% to 98% idle). These characteristics provide potential to federate and harvest the unused cycles of special purpose devices for compute jobs that require high performance platforms [57]. 8.1 Multi-Functional Devices One concrete example of an unconventional environment containing highly capable special purpose devices is a pharmaceutical company and its multi-function document processing devices (MFDs). MFDs, workgroup printers, and coupled servers are common in businesses, corporations, and academic institutions. All three types of institutions fulfill printing and document creation needs with fleets of such devices, which provide ever increasing speed, functionality, and print quality. 75

88 frequency # of devices per floor Figure 8.1: MFD count per floor at a large pharmaceutical company To support this additional functionality including , scanning, faxing, copying, document character recognition, concurrency and workflow management, and more the devices now contain processing power, RAM, secondary storage, and even GPUs, making them as computationally capable as general purpose servers. A single MFD, for example, includes at least a 1 GHz processor, 2 GB of memory, 8 GB HDD, Gigabit Ethernet, other processing cores and daughter boards that function as co-processors, phone and fax capability, and mechanical components. Such a device would currently cost approximately $25,. This configuration compares favorably to a typical high performance workstation or even one or more cluster nodes (the components of typical Grid and cloud computing platforms). Large organizations deploy a significant number of such devices, in part because to be useful they must be located in physical proximity to users. Figure 8.1 includes the number of MFDs on each floor of a large pharmaceutical company, a customer of Xerox Corporation. Each bar represents the number of floors of a building that contain the number of MFDs indicated on the X axis. The company has a total of 1655 MFDs, and over 65% of the floors in its buildings have five or more such MFDs. As a second example, MFDs from a business division of a large company from the computer industry (managed by Xerox Corporation) are depicted in Figure 8.2. This organization consists of 14 floors split across several buildings. 1 The organization has a total of 8 MFDs, 54% of the floors have five or more MFDs, and 67% of the buildings have 13 or more MFDs. 1 A typical building s floor is 4, square feet. 76

89 7 6 5 frequency # of devices per floor Figure 8.2: MFD count per floor at a division of a large computer company. 8.2 Unconventional Computing Environment Challenges Unfortunately, simply recognizing a new hardware platform and plugging in existing cycle harvesting solutions will not work. Unconventional computing environments may differ significantly from the environments that are normally used to support Grid jobs. These differences require new approaches and solutions. In particular, the differences include characteristics of the local jobs that the environments support, expectations that local users have for their availability and performance, the effect on Grid jobs when the resources are reclaimed for local use, and the kinds of Grid jobs the devices can support well (when they are not processing local jobs). These differences are addressed below. Local jobs: Desktop and workstation availability can track human work patterns; stretches of work keep the machines busy, with short periods corresponding to breaks. In contrast, shared printing resources (workgroup printers and MFDs) and print servers can be available for a relatively high percentage of time, but may be frequently interrupted to process short jobs. Moreover, these interruptions tend to arrive in bursts, as shown later in the chapter. Figure 8.3 depicts the burstiness of job arrivals of a sample MFD, by tracking the state of the device through one day. When the resource is printing, it is shown in the figure as Busy, when it is processing , faxing, or scanning a document, it is shown as Partially Available, and when it is idle, it is shown as Available. Figure 8.3 shows that MFDs in large organizations can be available for Grid use 95% or more of the time. But device usage patterns exhibit volatility; short high priority jobs arrive intermittently. 77

90 Total Time Busy/Partially Available Non available Time Partially Available Available Busy Part. Avail. MFD State vs. Time Busy Avail. 1am 11am Noon 1pm 2pm Time of day Figure 8.3: Trace data from four MFDs in a large corporation. The devices are Available over 98% of the time, and requests for local use are bursty and intermittent. Local user expectations, and the effect on Grid jobs: Workstations and cluster nodes can often suspend Grid or cloud jobs when a local job preempts them, keeping the Grid jobs in memory in case the resource may soon become idle again. Further, in conventional environments, the Grid job may be able to take an on-demand checkpoint, or even migrate to another server. Typically local users demand fast turnaround time for their jobs, and local jobs typically require the full capabilities of the MFD. Therefore, when a user requests service from an MFD, any Grid jobs must immediately be halted, without the possibility of checkpointing or migrating. Types of jobs: Partly because of these local user expectations, MFDs are better able to support shorter Grid jobs (e.g. tens of minutes) that can be squeezed into relatively short periods of availability. These relatively short Grid jobs may be related to printing, raster (image) processing, or other services such as optical character recognition (OCR). These jobs are processing intensive, but they can be divided amongst the devices in a floor or building. The jobs naturally lend themselves to being processed in a page parallel fashion (MFD jobs typically arrive as a number of separate pages). For example, a Grid job that is squeezed into some period of availability might represent several pages of a long OCR job from a nearby MFD. For the OCR example, a job that may have taken five to ten minutes might then require under two minutes if cycles from five nearby MFDs could be harvested. 78

91 A doctor s office is another example of an environment with non-traditional HPC requirements. Several federated MFDs could effectively support a processing intensive job such as OCR and archiving of patient medical records. During every visit, patients fill in and sign privacy notices, prior medical history, and other identifying information. For these to be automatically correlated to electronic medical records, OCR is necessary. At the same time, doctors offices print and fax frequently, making the job traffic to MFDs highly intermittent and varied. MFDs are better suited to scanning, handling paper forms, image related workflows, and secure prescription printing at the doctor s office. However, as outlined above, they benefit from acting in concert for processing intensive OCR jobs whenever a subset of devices detect a period of availability. Traditional compute clusters are not only overkill for this domain, but they also do not handle paper workflows such as form scanning and prescription printing. This chapter characterizes this domain and proposes a burstiness-aware scheduling technique designed to schedule Grid or cloud jobs to make better use of cycles that are left free in this environment. The devices also occasionally need bursts of computational power, perhaps from one another, to perform image processing, including raster processing, image corrections and enhancements, and other matrix operations that could be parallelized using known techniques [23] [1]. Sharing the burden of demanding jobs would allow fewer devices to meet the peak needs of bursty job arrivals. Using known programming models [1] also enables leveraging the devices for general purpose distributed computations. This chapter investigates the viability of using MFDs as a Grid for running computationally intensive tasks under the challenging constraints described above. Section 8.3 first defines an MFD multi-state availability model similar to the multi-state availability model presented in Section The key difference is that this new model is geared towards the availability states of unconventional computational resources such as MFDs whereas the previous model represents the states of traditional general purpose computational resources (e.g. workstations). Next, Section 8.4 gathers an extensive MFD availability trace from a representative environment and analyzes it to determine how MFDs in the real world behave in terms of the states states in the proposed model. Section 8.5 proposes an online and computationally efficient test for MFD burstiness using the trace. This proposed burstiness metric is used to capture the dynamic and preemptive nature of jobs encountered on the devices. Burstiness is a tool to show how MFDs can be used as computational resources and the scheduling improvements that can be achieved by considering resource burstiness in making Grid job placements. Section computes statistical confidence intervals on the lengths of Grid jobs that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for 79

92 Printing Available to Grid Scan/Fax/ Unavailable Figure 8.4: MFD Availability Model: MFDs may be completely Available for Grid jobs (when they are idle), Partially Available for Grid jobs (when they are ing, faxing, or scanning), Busy or unavailable (when they are printing), or Down. the bursty environment. On average across all loads depicted, the proposed scheduling technique comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average of 18.3% makespan improvement over the random scheduler. 8.3 Characterizing State Transitions This section develops an availability model with which to categorize the types of availability and unavailability that an unconventional computational resource such as an MFD experiences. In particular, this section is interested in how the job patterns of a certain core set of native MFD jobs (core or local jobs) affect the availability of CPU cycles to a second elastic set of Grid jobs, which are assumed to have to run at a lower priority. Devices oscillate between being available and unavailable for Grid jobs, depending on the core job load they experience from local users. MFDs can occupy four availability states based on processing their core job load. As shown, these states can have a significant impact on Grid jobs. Figure 8.4 identifies the four availability states suggested by MFD activity. An Available machine is currently running with network connectivity, and is not performing any other document related task such as printing, scanning, faxing, ing, character recognition or image enhancement. It may or may not be running a Grid application in this case. A device may transition to a Partially Available state if a local user begins to use it for scanning, faxing, or . These tasks require some but not all of the device s resources. When an MFD is printing a document, it typically 8

93 requires the full capability of the machine s memory and processing power, and so a separate Busy state is designated for MFDs that are currently processing print jobs. Finally, if an MFD fails or becomes unreachable, it is in the Down state. These states differentiate the types of availability and unavailability a resource will experience. When scheduling jobs onto a Grid of MFDs, it is best to avoid devices that are likely to become Busy before the application has finished executing. If the Busy state is entered, the Grid job must be preempted and that it loses all progress. This assumption reflects the expectations of typical MFD users which are highly concerned with local user job turn around time. Entering the Partially Available (e.g. scan/fax/ ) state also affects Grid jobs, which will be left with fewer cycles. However, because functionalities such as scanning, faxing and ing require significantly fewer computational resources compared to printing, Grid jobs need not be immediately halted; they may continue to execute with a significantly smaller percentage of the total CPU capacity while the user s scan, fax, or job completes. Given the different types of unavailability a device may experience, applications that attempt to utilize them as computational resources should attempt to avoid those devices most likely to enter the Busy state because of the large amount of lost cycles which occur if an application must restart on a different resource. Furthermore, those applications should also attempt to avoid a MFD likely to enter the Partially Available state due to the slow down although the consequences of entering this state compared with the Busy state are much less severe. A scheduler that is aware of each MFD s behavior and likelihood of entering these states can make better Grid job placement decisions by attempting to avoid the Busy state altogether and by limiting the number of times applications are slowed down by MFDs that enter the Partially Available state. Section 8.5 introduces a metric called burstiness, which allows schedulers to exploit the fact that transitions into Busy and Partially Available states are typically clustered together in time. 8.4 Trace Analysis A 153 day MFD job trace was gathered and analyzed from four MFDs located at a large enterprise company between mid-28 and mid-29. The trace consists of a series of time-stamped MFD jobs. Each job includes the job type (print, scan, fax or ), job name, number of pages, submit time, completion time and the user submitting the job. This data directly implies the Busy and Partially Available states. The trace only identifies when jobs are submitted and completed; it therefore 81

94 precludes determining exactly when the Down state was entered. As results show, MFDs are rarely Down, and these particular MFDs were more than 99% available, by service agreement; MFDs were Available if they were not otherwise being used. This section analyzes this trace data to characterize MFDs according to the states of the proposed model with the goal of identifying trends that could enhance the scheduling of Grid jobs Overall State Behavior To determine how much of an MFD s time is available to Grid jobs, Figure 8.3 analyzes the overall amount of time spent in each state (this is the time that an MFD is available and not printing). Figure 8.3 depicts the overall percentage of time spent by the four MFDs in each availability state and an example of that availability. Figure 8.3a shows that availability is the dominant state with 98.2% of the MFD s overall time being spent in the Available state; just 1.8% of the time is spent in the Busy and Partially Available states. In regards to how much time is spent in the non-available states shown in the right sub-graph, 85.7% of non-availability is Busy, leaving 14.3% in Partially Available. Figure 8.3 shows an example of the typical behavior of an MFD during mid-day on a weekday. It is mostly available but has sporadic bursts of unavailability. Thus, MFDs are highly available resources with a large potential for use by Grid jobs. Although there are more than 1,2 instances of availability of more than one hour, two-thirds of the availability periods during daytime hours are 2 minutes or less. For daytime Grid jobs, it can therefore be difficult to find an uninterrupted availability period. Even though there are many available time periods during the day, scheduling greedily to fill most of the availability period might lead to frequent application failures and restarts. As a result, the failure rate gets compounded it includes the failure rate for restarted jobs as well. Therefore, Section conducts an experiment to determine the size of Grid jobs that the MFD pool can handle well that is, for which the pool can keep Grid job failure rates between 5% and 1% Transitional Behavior This section analyzes the behavior of the MFDs in terms of when and what types of transitions they make. State transitions can affect the execution of an application because they can trigger reduced application performance (the Partially Available state) or even require application restart (the Busy state). Figure 8.5 shows the number of transitions to each state by the day of the week. Transitions to Available, Busy and Partially Available states are highest on Monday, taper off slightly by the 82

95 Number of Transitions to Available State M T W T F S S Day of Week Number of Transitions to Busy State M T W T F S S Day of Week Number of Transitions to Partially Available State M T W T F S S Day of Week Figure 8.5: MFD state transitions by day of week Number of Transitions to Available State Hour of Day Number of Transitions to Busy State Hour of Day Number of Transitions to Partially Availablel State Hour of Day Figure 8.6: MFD state transitions by hour of day end of the week, and are significantly reduced on weekends. Figure 8.6 shows the number of transitions to each state by hour of day. The numbers of transitions to each of the three states show clear diurnal patterns, and are similar to one another. Transitions peak at around 1am and 2pm with a trough between the two peaks corresponding to the lunch break. Transitions are much lower overnight. Thus, MFD utilization follows strong weekly and diurnal patterns. These patterns lend themselves to increased predictability and scheduling potential Durational Behavior This section analyzes the behavior of the MFDs in terms of how long they remain in each state relative to the day of the week and the hour of the day. Longer periods of availability would allow Avg. Available Duration (Hours) M T W T F S S Day of Week Avg. Busy Duration (Hours) M T W T F S S Day of Week Avg. Partially Available Duration (Hours) M T W T F S S Day of Week Figure 8.7: MFD state duration by day of week 83

96 Avg. Available Duration (Hours) Hour of Day Avg. Busy Duration (Hours) Hour of Day Avg. Partially Available Duration (Hours) Hour of Day Figure 8.8: MFD state duration by hour of day Number of Occurances Number of Occurances Available State Duration (Minutes) Busy State Duration (Minutes) Number of Occurances Partially Available State Duration (Minutes) Figure 8.9: Number of occurrences of each state duration longer Grid jobs to complete without interruption. Figure 8.7 depicts the average duration of each state according to the day of the week. The average duration of an Available state is quite low during the weekdays at approximately 1.4 hours but becomes much larger during the weekend reaching a peak of 2 hours on Saturday. Figure 8.8 investigates the average duration of each state according to the hour of the day. As expected, the average duration of the Available state is at its lowest in the middle of the day, since this is when users are most likely to utilize the MFDs. The duration of the Busy state shows no such trends. These results indicate that longer jobs are best scheduled later in the day when availability periods are typically much longer. Figure 8.9 shows the distribution of durations spent in each state. It plots the number of occurrences of each duration versus the length of that duration. The most significant result shown in Figure 8.9 is the duration of the Available state. Clearly, the vast majority of availability periods are extremely short, lasting 2 minutes or less with an extremely long tail, indicating decreasing occurrences of longer available durations. 84

97 8.5 A Metric for Burstiness This section proposes a burstiness metric and an algorithm with which to determine the burstiness of a given resource. Burstiness captures temporary surges in activity that occur sporadically in the MFD domain. Activities tend to occur in bursts in the sense that MFD use is clustered in time and lasts for short durations. This makes it difficult to detect and counter effectively. For example, the frequent occurrence of burstiness certainly impedes the completion of any long Grid job and makes scheduling short jobs difficult. This is because shorter jobs are purged or restarted frequently, thereby causing resource thrashing and wasted network and/or processor cycles. Therefore, to perform high performance distributed Grid computing on bursty devices (due to preemption by higher priority jobs), burstiness can be used as a form of availability prediction. The proposed burstiness metric enables the system to schedule jobs when the burstiness of a resource is low therefore greatly lowering the likelihood of job eviction Burstiness Test The following notations are used: At any given time t now, let T refer to a recently elapsed time window where burstiness is to be estimated (T issetto2hours). Let µ and σ refer to the mean and standard deviation of the time between jobs (that is, the interarrival time) respectively, in the time-period T. To quantify burstiness at time t now,, the algorithm looks back at a period T and estimates whether the most recent block of time within T (chosen as the average arrival time µ between jobs in T) (a) has experienced more transitions to the busy state of the device compared to other blocks of time in T; and (b) if the spread in the most recent block is significantly different from others in T. These comparisons detect whether the most recent block has had an abnormally high job arrival rate with spreads just enough to frequently disrupt Grid tasks. More formally, the steps are as follows: 1. Starting from t now, go back and divide the period T into contiguous blocks of time each of length µ, the average interarrival time. Call each time block b p where p is the index of the block counting backwards from t now. 2. Let the number of job arrivals in a block b p be n p. 3. Compute the mean and standard deviation of arrival times across all jobs in all blocks, and refer to them as µ B and σ B. 85

98 4. For a given block p, the spread in jobs is given by the standard deviation σ p of the time between jobs. 5. Comparison 1: Determine whether n p is within one σ B of µ B (meaning there are relatively more transitions into the unavailable state in this time block). 6. Comparison 2: Determine whether σ p < σ (meaning the jobs are more clustered and less spread out than normally observed within period T) Degree of Burstiness If the answers to both Comparison 1 and 2 are Yes, then the period p is considered bursty. 2 Otherwise, the block b p does not exhibit burstiness. Two dimensionless quantities characterize burstiness quantitatively. The strength S of the burst is defined as S p = n p µ B (8.1) where a quantity greater than one denotes that a time block p is experiencing a burst. Spread ratio (D) is defined as D p = σ σ p (8.2) where a quantity greater than one denotes that the time block p is experiencing a tighter cluster than the whole time history window. The burstiness of a window p is defined as B p = S p + D p (8.3) Other forms of this metric could weigh the two quantities in Equation 3 differently from one another. One drawback of this metric is that bursty events may spill over into adjacent time blocks {b p, b p 1 } and/or {b p, b p+1 }. This is addressed in future work by establishing time block combination strategies Burstiness Characterization To quantify the burstiness of devices, Figure 8.1 computes burstiness values at 1 7 sample points (that sampling is slightly less than once every 2 seconds for the trace). Figure 8.1 plots the calcu- 2 The degree of burstiness is defined in Equation 4. 86

99 4.5 Burstiness Burstiness Hour of Day M T W T F S S Day of Week Figure 8.1: Burstiness versus Day of Week and Hour of Day lated burstiness values versus both the time of the day and the day of the week for the four MFDs. The left subgraph illustrates that the device burstiness peaks at about mid-day and dramatically falls off early and late in the day. The right graph indicates that burstiness reaches a peak on Monday, at which point it slowly tapers off until it falls dramatically on the weekends. These results confirm the intuition that MFDs experience maximum burstiness on weekdays around mid-day. As shown in the next section, the burstiness of a resource closely follows it s transitional behavior. Hence it is a good indicator of volatility and can be extremely useful in scheduling to avoid preemptions brought on by MFD transitions to unavailable states (e.g. printing). 8.6 Burstiness-Aware Scheduling Strategies This section discusses the viability of scheduling jobs on MFDs given the volatility they exhibit. This section proposes a new scheduling technique that leverages the burstiness metric proposed in Section 8.5 to make better job placement decisions. These placement decisions reduce the number of evictions/preemptions experienced by jobs and thereby reduce the average job makespan, increasing the overall efficiency of the system. When a preemption is encountered, the executing job will halt and restart execution elsewhere (thereby increasing makespan) Experiment Setup To investigate the scheduling techniques proposed in this section, late sections simulate jobs executing on the MFDs; each uses its own recorded availability in accordance with the gathered (core job) workload trace. The simulations in this chapter insert Grid jobs (the aforementioned elastic class) at random times in the trace such that for the overall simulation, job injection is uniformly 87

100 distributed for the considered period of 153 days. However, jobs are only injected between 9am and 5pm during weekdays to reflect the types of jobs that are expected to be executed on such a system as discussed earlier. The simulation assigns each job a duration (i.e. a number of operations needed for completion) where the duration is an estimate based on the execution of the job on an idle device with average CPU speed (device attributes are not unlike server attributes and are based on published design/model specifications). Similarly to the simulation setup in Section 5.1, the CPU speed, the load on each MFD, and the (un)availability states (determined by the trace) influence the running of the jobs. MFDs are considered available if they are not executing a print job (are not in the Busy state). An MFD may only be assigned one task for execution at a time. During each simulator tick, the simulator calculates the number of operations completed by each working MFD, and updates the records for each executing job. If a device executing a job enters the Busy state due to the arrival of a core job, the Grid job is preempted and put back in the queue for rescheduling. MFDs are assumed to have no load on their CPU unless the Partially Available state is entered. While in this state, an executing job will not be preempted, but the job imposes a load of.5 on the processor. Note, however, that this state can model any portion of availability. In this case, jobs executing on an MFD in the Partially Available state will effectively take twice as long as they would if they were run on an MFD in the Available state. Section tests the following scheduling strategies: Random: Randomly selects an MFD for execution Upper Bound 3 : Selects the available MFD that will execute the job in the smallest execution time, without preemption, based on global knowledge of MFD availability. When all MFDs would become unavailable before completing the job, the scheduler chooses the fastest available MFD. Burst: For each job, each MFD s burstiness is calculated at the time the job is being scheduled. The MFD with the smallest burstiness value is greedily selected for job execution. Burst+Speed: For each job, each MFD s burstiness is calculated (B i ) at the time the job is being scheduled. The algorithm greedily choose the MFD with the highest score RS i where the score of MFD i is calculated according to the following expression: 3 Used for comparison purposes only; global knowledge will not usually be available. 88

101 Percentage Makespan Improvement Average Job Length (Hours) Preemptions Random Upper Bound Burst Burst+Speed Average Job Length (Hours) Figure 8.11: Comparison of MFD schedulers as Grid job length increases RS i = 1 B i 1 CPU max CPU i (8.4) where B i is MFD i s burstiness, CPU i is it s CPU speed and CPU max is the highest CPU speed of all resources (for normalization). Random is a scheduler with no information and represents the least sophisticated scheduling approach (a lower bound). Upper Bound is a scheduler with global knowledge and represents an upper bound on possible scheduling results (maximum makespan improvement with the fewest number of preemptions). Both Burst and Burst+Speed are greedy schedulers - Burst only takes MFD volatility into account whereas Burst+Speed considers both MFD volatility and speed Empirical Evaluation This section investigates the performance of the proposed scheduling strategies with two types of experiments. Impact of (Grid) Job Length Figure 8.11 illustrates the performance of the scheduling strategies as average job length varies. Job length is the time required to complete the job on a machine with an average dedicated CPU. 1, jobs are injected randomly and uniformly over the trace while varying the average length of the jobs. For each simulation, the four schedulers are given an identical set of jobs to schedule. In trace with the the shortest jobs, job lengths vary from five to 3 minutes (average length of 17.5 minutes). In the trace with the longest jobs, job lengths vary from five minutes to six hours (average length of 3.4 hours). Intermediate average lengths varied jobs between five minutes and 76, 147, 218, and 289 minutes, respectively. 89

102 Figure 8.11 s left subgraph plots the percentage makespan improvement over the random scheduler versus the average job length. Figure 8.11 demonstrates that simply utilizing the burstiness metric (Burst) to consider the volatility of resources and avoid preemptions greatly reduces the average job makespan by 24% for the shorter average job length case and 3.6% as average job length increases to 3 hours. Considering volatility and resource speed when scheduling with Burst+Speed further reduces the average job makespan by 36.6% for the shorter average job length case and 7.5% as average job length increases to three hours. Burst+Speed comes within an average of 7.6% of Upper Bound s makespan improvement across all job lengths; it comes within 13.7% for shorter jobs gradually improving to within 4.3% for longer jobs. Burst+Speed produces an average makespan improvement of 18.3% over the random scheduler. This indicates that the burstiness metric closely captures the true volatility of the devices and allows schedulers to choose MFDs that are far less likely to experience print preemptions, greatly increasing the efficiency of the system by lowering average job makespan. The overall trend for all schedulers is that as the average job length increases, the possible makespan improvement over random scheduling lessens. This is because longer jobs are simply more likely to fail and given the volatility of the devices in general, even an ideal scheduler such as Upper Bound cannot avoid preemptions for these longer jobs. Figure 8.11 s right subgraph plots the number of preemptions (jobs interrupted by the MFD encountering a print job and being forced off the MFD) versus the average job length. As average job length increases, all scheduling strategies naturally encounter more job preemptions. Note the direct correlation between preemptions encountered and average job makespan fewer preemptions reduce the average job makespan. Interestingly, Burst and Burst+Speed encounter a similar number of preemptions through the range of average job lengths with Burst+Speed encountering slightly more. Incorporating CPU speed into the MFD selection criteria (in Burst+Speed) greatly increases job makespan improvement (from a 3.8% improvement for longer jobs to a 12% improvement for shorter jobs) over only considering volatility with Burst. Figure 8.12 s left subgraph shows the maximum job length such that it will only be preempted k% of the time. 1 replications of the aforementioned experiments were executed and the Box- Whisker diagram plots in Figure 8.12a provide a 95% confidence interval. For example, using the Burst+Speed scheduler the diagram indicates with a95% confidence that a job of 7.5 minutes will only fail around k = 1% of the time. Users could read the figures directly off the graph to choose Grid job lengths that could be run on the system, given some tolerance for application failure and a particular statistical confidence interval. Figure 8.12 s right subgraph shows how the average failure rate varies with job length as computed from the ten experiment replications mentioned above. 9

103 Job Failure Rate (%) Job Failure Rate (%) Random Upper Bound Burst Burst+Speed Job Length (Minutes) Job Length (Minutes) Figure 8.12: Determination of Grid job lengths with 95% confidence intervals for given failure rates Percentage Makespan Improvement Number of Jobs x 1 4 Preemptions Random Upper Bound Burst Burst+Speed Number of Jobs x 1 4 Figure 8.13: Comparison of MFD schedulers as Grid job load increases Impact of (Grid) Job Load Figure 8.13 investigates the effect of varying the load. This test fixes the job length between five minutes and one hour (average Grid job length of 32.5 minutes) and varies the number of jobs injected over the course of the trace from 1, to 2, jobs. Again, Figure 8.13 s left subgraph plots average job makespan improvement as the load increases using the random scheduler as a baseline. Overall, the makespan improvement decreases as job load increases because as MFD contention increases, eventually all MFDs become utilized by all the schedulers, disallowing a scheduler to choose an MFD which will not experience preemption. Therefore, the schedulers converge on the same average job makespan. In the low load situations, Burst produces a 19.2% makespan improvement while Burst+Speed produces a 31.4% makespan improvement compared to the Random scheduler, coming within 13.3% of Upper Bound s 44.7% makespan improvement. As load increases, the makespan improvement gap between these techniques lessens, but Burst consistently produces a makespan improvement over Random and in turn Burst+Speed consistently produces a makepan improvement over Burst. On average across all loads depicted, Burst+Speed comes within 4.3% makespan im- 91

104 provement of the Upper Bound and produces an average of 7.11% makespan improvement over the random scheduler, with the gap lessening as load increases. Figure 8.13 s right subgraph plots the number of preemptions produced by each scheduling strategy versus the job load on the system. As job load increases, all scheduling strategies produce more preemptions eventually converging on roughly the same number of preemptions in the extremely high load case. The largest gaps come at lower loads with the gap narrowing as load increases. Again, the figure demonstrates that the number of preemptions produced by a scheduling strategy directly influences the average job makespan of the jobs. In all job load cases, Random produces the largest number of preemptions followed by Burst+Speed, then Burst and finally Upper Bound. These scheduling results demonstrate that the Burst+Speed scheduling mechanism greatly reduces the number of preemptions experienced by jobs compared with the random scheduler and thereby decreases the average job makespan for a variety of job lengths and job load situations. 92

105 Chapter 9 Simulator Tool This chapter discusses the motivation and implementation of the PredSim simulation framework. PredSim was used to obtain all the results presented throughout this work. It is a discrete-event, multi-purpose simulator with capabilities ranging from analyzing resource availability traces to testing job scheduling and availability prediction algorithms. PredSim s primary purpose is to allow users to easily add or remove components such as predictors and schedulers for easy comparison. PredSim also enables researchers to use real-world resource availability and workload traces to drive their simulations, thereby better approximating the actual behavior of a real system. In this way, PredSim allows for fast prototyping and development of reliability aware and fault tolerant prediction, and scheduling techniques driven by real world trace scenarios. 9.1 Simulation Background This section presents background information on simulation based research. Section discusses the state of the art and fundamental limitations and Section discusses related distributed system simulators Simulation Based Research Simulation based results are used in a variety of fields such as construction, manufacturing, industrial processes, environmental sciences and computer science to drive effective resource management, production and research. Simulators are particularly useful to researchers for the following reasons. Scalability: Simulators, when implemented well, can scale to incorporate many more resources and jobs than may be feasible or practical to achieve in the real world. The number of jobs injected into the system can exceed observed system load, to test the system under more stressful, rarely seen circumstances, which may become more common in the future. 93

106 Ease of Deployment: The overhead of creating, distributing and setting up a real world system can be prohibitively costly and time consuming. Simulations have the advantage of an extremely low cost of deployment. Furthermore, testing unproven ideas in a real world system can cause an unnecessary slowdown or loss of service to the users of the system. Repeatability and Consistency: As opposed to real world situations, in simulation-based testing, a given scenario can be run repeatedly till an optimal or close to optimal solution to a given scenario can be found. This testing and retesting can allow techniques to be fine-tuned and furthermore, allows for the demonstrable repeatability essential to the scientific process. This allows researchers to test competing ideas in a consistent environment with common scenarios to determine the most effective solution(s). Speed of Results: Many testing scenarios can take months or years to run to completion. Testing these scenarios in the real world means that results come extremely slowly. Furthermore, it means that the turnaround time for testing, modifying the technique and retesting becomes extremely tedious. Simulation can allow these long scenarios to be tested in minutes or hours, giving the researcher extremely timely feedback with which to further their work. This allows techniques to be developed several orders of magnitude faster. Accessibility: Simulators can easily be made available to the research community and distributed to interested parties. This is especially important given that it is difficult to have ready access to an actual Grid system. This easy of access and distribution can allow more collaboration between researchers on a problem. Although simulators have many benefits, they still have limitations and weaknesses compared with other approaches such as real world testing. Compatibility: Simulations are often written for the express purposes devised by the coder of that simulator. In particular, their exact design will often be unique and results from one simulator may not be directly comparable with results from another simulator. Bugs: Human error in the coding can cause the simulator to produce inaccurate results. Artificiality: The scenarios examined in simulation can be far fetched and not representative of any foreseeable scenario. This may lead to the results being misinterpreted. Applicability: Even the most thorough and exhaustive simulation cannot possibly represent the complexities of the real world. By their very nature, simulations are simplifications of 94

107 real world situations which not only make many assumptions about how the virtual world will operate but also make simplifying assumptions to facilitate their development. In the end, nothing can completely take the place of real world testing but simulation-based research has proved to be an invaluable tool Distributed System Simulators Computer Science as a field has developed a whole range of simulation based tools for use by researchers in all of it s various subfields. For example, SimOS can simulate an entire computer including operating system and application programs to study aspects of computer architecture and operating systems [63]. TOSSIM is a simulation based framework in which mobile sensor devices executing the TinyOS software system can be virtually deployed to investigate ad-hoc routing protocols and environmental data gathering techniques [38]. The range and functionality of these simulators in quite extensive so in this section focuses on related Grid-based simulation work. Distributed and Grid systems have evolved to become powerful platforms for solving large scale problems in science and other related fields. As these systems become larger and more complex, the management of the resources they comprise becomes more important. Coupled with the difficulty in obtaining access to a large scale Grid testbed, it becomes imperative to provide researchers with tools for devising and analyzing algorithms before they are deployed in real world environments. With this aim, many distributed and Grid system simulators have been developed [14] [12] [73] [71]. This section summarizes the most relevant related work. GridSim, the most closely related simulator, is a java-based discrete-event Grid simulation toolkit developed at the University of Melbourne to enable the study of distributed resource monitoring and application execution [12]. GridSim was specifically designed to incorporate unavailability of Grid resources during runtime, and provides the functionality to utilize workload traces taken from real world environments to create a more realistic Grid simulation. GridSim also enables the user to specify the Grid network topology complete with background traffic. GridSim is built on top of the SimJava discrete event infrastructure and provides the modeling of resources and jobs executing on the discrete events defined by the underlying SimJava layer. GridSim incorporates application modeling and configuration, resource configuration, job management and resource brokers and schedulers to allow for the system to be highly configurable for the user and to enable simulations which are reflective of their real world counterparts. SimGrid is another event-driven simulation toolkit designed to provide a testbed for the evalua- 95

108 tion of distributed system scheduling algorithms [14]. It enables the researcher to study application behavior in a heterogeneous distributed environment with the key aim of rapid prototyping and evaluation of different scheduling approaches. It incorporates the ability to model resource availability behavior based on synthetic resource and network behavior models fine-tuned with parameter-constants or by inputting resource availability traces. Furthermore, SimGrid supports task dependencies in the form of Directed Acyclic Graphs (DAG) jobs. DAGSim is build on top of this infrastructure and is used to evaluate scheduling algorithms for DAG-structured applications [28]. Bricks is a java-based discrete-event simulation system geared towards providing components for scheduling and monitoring applications through the use of resource and network monitoring tools [73]. It consists of several components including a ServerMonitor, NetworkMonitor, Scheduling Unit and Network Predictor. These components work in tandem to monitor resource and network activity and then actively make application scheduling decisions. The MicroGrid toolkit provides a platform on which to run Globus applications, allowing the evaluation of middleware and applications in a Grid based environment [71]. MicroGrid creates a virtual Grid infrastructure and supports scalable clustered resources with configurable Grid resource performance attributes. The key functionality provided is to enable users to accurately simulate the execution of Globus applications by presenting the applications with an identical Application Program Interface (API) such that they will behave accurately in the simulation. In this way, MicroGrid allows applications that have been designed to use the Globus Grid middleware infrastructure to be tested before deployment. The most fundamental difference between the previous approaches and PredSim is that PredSim allows a multi-state resource availability model. As noted in Section 3.1, resources may become unavailable in a multitude of different ways and because of the different effects that these unavailabilities can have on executing applications, it s important that a simulator incorporate these types of unavailability. Alongside this functionality, PredSim supports simulating checkpointable applications including periodic and on demand checkpoints along with checkpoint migration. Also, PredSim is purpose-built around supporting availability prediction-based scheduling approaches and as such has built in services for monitoring and cataloging resource history to enable prediction. Additionally, PredSim has a built in mode for testing a predictor s accuracy which can be used to isolate the accuracy of that predictor under a given scenario. PredSim also has built in mechanisms for creating and managing task replicas to test prediction and non-prediction based replication techniques. 96

109 9.2 PredSim Capabilities and Tools This section summarizes the primary capabilities and features of PredSim and discusses its included tools Capabilities PredSim allows for the following core functionalities Multi-state availability model compatible Task scheduling and prediction accuracy modes User specified components such as schedulers and predictors can easily be added Compatible with FTA and Condor resource availability trace formats GWA workload format capable Fully configurable synthetic resource availability and task creation options Provides statistical and resource information services for analysis Automatic execution of ranges of scenarios (load, resource characteristics, etc) In these ways, PredSim creates a functional testbed on which to compare prediction, predictionbased scheduling and replication techniques Tools In addition to the functionalities provided by the core PredSim modules, PredSim also includes several tools for further analysis and ease of use. Multi-exec Tool Often it becomes important to test a given technique in multiple environments under a series of conditions. The core modules of PredSim allow the simulator to automatically execute a range of conditions for a series of schedulers/predictors such as varying load, checkpointability or replicatability within a given trace. However, its also important to test techniques under different resource availability sets and workloads. Multi-exec allows the user to set PredSim to test a given technique or set of techniques utilizing different resource availability and workload traces and within each 97

110 trace, the conditions for that simulation can then be varied. Additionally, each run can be taken a configurable number of times to, for example, develop confidence intervals for the results. Combine Tool Given a series of simulation runs, the Combine Tool allows the output files to be combined into a single set of output files. This can be useful for aggregating results and understanding the overall behavior of techniques in a variety of circumstances and environments. Also, sometimes results can be inconsistent between runs. Executing many copies of a run and then combining the results can be helpful in determining the underlying performance trends. Availability Analysis Tool Section 3.2 investigated the availability behavior of the Notre Dame Condor pool. This data was created by the Availability Analysis Tool which inputs a resource availability trace and outputs a whole range of availability analysis data. It incorporates the ability to analyze both single state and multi-state availability traces. The analysis data it produces includes resource behavior versus day of the week and time of day. It also analyzes aggregate resource behavior such as time spent and transitions to each state over a trace. This tool can be useful for describing and characterizing the multi-state availability behavior of a set of traced resources. 9.3 PredSim System Design At it s core, PredSim was designed to allow and account for the different types of availability and unavailability that a resource may exhibit. Section 3.1 discussed these different states of availability including the user becoming present on the machine, the local load going over a certain threshold and the machine becoming unavailable unexpectedly (ungraceful failure). PredSim was designed to incorporate these different types of unavailability while having the functionality to simulate a variety of application types. PredSim supports both application checkpointing and migration including periodic checkpointing and on-demand checkpointing. PredSim also has built in services for supporting resource availability prediction and prediction-based scheduling and task replication algorithms. Figure 9.1 shows the architectural components of PredSim grouped into three categories: Input Components, User Specified Components and PredSim Components. The following sections discuss the design of these three component groups. 98

111 Figure 9.1: PredSim System Architecture Input Components In order to test prediction algorithms and scheduling techniques, both resource availability information and task information must be available to the simulator. PredSim has multiple methods for both accepting and/or generating this information. Resource Input As noted earlier, PredSim is designed from the ground up to support a multi-state resource availability model. As such, it s primary resource input format is that of Condor formatted resource availability traces. By default, the Condor distributed system logs resource availability information including user presence information, local CPU load measurements and overall resource connectivity/availability. This information can be extracted from the system via the condor stats command. At this point an analysis tool included with PredSim, then analyzes the condor formatted trace information and converts it into a format readable by PredSim. The PredSim format consists of newline separated unix epoch, availability state and CPU load items. The availability state is an integer from 1 to 5 indicating the availability state according to the multi-state model. In this way, the input format retains all the information about resource availability, user presence and CPU load. 99