ThroughputScheduler: Learning to Schedule on Heterogeneous Hadoop Clusters

Size: px

Start display at page:

Download "ThroughputScheduler: Learning to Schedule on Heterogeneous Hadoop Clusters"

Frank French
5 years ago
Views:

1 ThroughputScheuler: Learning to Scheule on Heterogeneous Haoop Clusters Shehar Gupta, Christian Fritz, Bob Price, Roger Hoover, an Johan e Kleer Palo Alto Research Center, Palo Alto, CA, USA {sgupta, cfritz, bprice, rhoover, eleer}@parc.com Cees Witteveen Delft University of Technology, The Netherlans c.witteveen@tuelft.nl Abstract Haoop is the e-facto stanar for big ata analytics applications. Presently available scheulers for Haoop clusters assign tass to noes without regar to the capability of the noes. We propose ThroughputScheuler, which reuces the overall job completion time on a clusters of heterogeneous noes by actively scheuling tass on noes base on optimally matching job requirements to noe capabilities. Noe capabilities are learne by running probe jobs on the cluster. ThroughputScheuler uses a Bayesian, active learning scheme to learn the resource requirements of jobs on-the-fly. An empirical evaluation on a set of sample problems emonstrates that ThroughputScheuler can reuce total job completion time by almost 20% compare to the Haoop FairScheuler an 40% compare to FIFOScheuler. ThroughputScheuler also reuces average mapping time by 33% compare to either of these scheulers. 1 Introuction Map-Reuce framewors, such as Haoop, are the technology of choice for implementing mayn big ata applications. However, Haoop an other framewors typically assume a homogeneous cluster of server noes an assign tass to noes regarless of their capabilities, while in practice, ata centers may contain a heterogeneous mix of servers. When the jobs executing on the cluster also have heterogeneous resource requirements, which is typical, then it is possible to significantly increase processing throughput by actively matching jobs to server capabilities [2, 4, 6]. In this paper, we present the ThroughputScheuler, which actively exploits the heterogeneity of a cluster to reuce the overall execution time of a collection of concurrently executing jobs with istinct resource requirements. This is accomplishe without any aitional input from the user or the cluster aministrator. Optimal tas allocation requires nowlege about both the resource requirements of jobs an the resource capabilities of servers, e.g., their relative CPU an is I/O spees. The ThroughputScheuler erives server capabilities by running probe jobs on the cluster noes. These capabilities rift very slowly in practice an can be evaluate at infrequent intervals, e.g., at cluster set-up. In contrast, each new job has a-priori unnown resource requirements. We therefore present a learning scheme to learn job resource requirements on-the-fly. The practicality of our solution relies on the structure of jobs in Haoop. These jobs are subivie into tass, often numbering in the thousans, which are execute in parallel on ifferent noes. Mapping tass belonging to ifferent jobs can have very ifferent resource requirements, while mapping tass belonging to the same job are very similar. This is true for the large majority of practical mapping tass, as Haoop ivies the ata to be processe into evenly size blocs. For a given job, we can therefore use online learning to learn a moel of its resource requirements from a small number of mapping tass in an explore phase, an then exploit this moel to optimize the allocation of the remaining tass. As we will show, this can result in a significant increase in throughput an never reuces throughput compare to Haoop s baseline scheulers FIFO an FairScheuler). We focus on minimizing the overall time to completion of mapping tass, which is typically the primary river of overall job completion time. The next section reviews scheuling in Haoop, followe by a iscussion of relate wor. We then efine a moel of tas completion time base on server capabilities an tas requirements. We erive a Bayesian experimental esign for learning the parameters of this moel online, an present a real-time heuristic algorithm to optimally scheule tass onto available cluster noes using this moel. Finally, we show empirically that ThroughputScheuler can reuce overall job execution time by up to 40% on a heterogeneous Haoop cluster. USENIX Association 10th International Conference on Autonomic Computing ICAC 13) 159

2 2 Haoop Scheuler In this section we briefly review the scheuler of Haoop YARN [1]. YARN has a central entity calle the resource manager. The resource manager has two primary moules: Scheuler an ApplicationManager. For every incoming job the ApplicationManager starts an ApplicationMaster on one of the slave noes. The Application- Master maes resource requests to the resource manager an is also responsible for monitoring the status of the job. Jobs are ivie into tass an for every tas the scheuler assigns a container upon the request from the corresponing ApplicationMaster. A container specifies the noe to run the tas on an a fixe amount of resources memory an CPU cores). YARN supports allocating containers base on the available resources as of now just base on memory) on the noes, but it has no mechanism to etermine the actual resource requirements of a job. To coorinate the allocation of resources for concurrent jobs, Haoop provies three ifferent scheulers: FIFO-, Fair- an CapacityScheuler. FairScheuler is the most popular scheuler among all because it enables fairness among concurrently executing jobs by giving them equal resources. All of Haoop s scheulers are unaware of the actual resource profiles of jobs an the capabilities of noe in the cluster an therefore often allocate resources sub-optimally. 3 Relate Wor Recently, researchers have realize that the assumption of a homogeneous cluster is no longer true in many scenarios an have starte to evelop approaches that improve Haoop s performance on heterogeneous clusters. Speculative execution, a feature of Haoop where a tas that taes longer to finish than expecte gets reexecute preemptively on a secon noe assuming the first may fail, can lea to egrae performance on heterogeneous clusters. This is because the scheuler s moel of how long a tas shoul tae oes not tae the heterogeneous resources into account, leaing to many instances of unnecessary speculative executions for tass executing on slower noes. The LATE Scheuler [9] improves speculative executing for heterogeneous clusters, to only speculatively execute tass that will inee finish late using the concept of straggler tass [3]. However, the approach assumes that the harware capabilities an the tas resource profiles are alreay nown rather than being iscovere automatically. The Context Aware Scheuler for Haoop CASH) [5] assigns tass to the noes that are most capable to satisfy the tass resource requirements. Similar to our approach CASH learns resource capabilities an resource requirements to enable efficient scheuling. However, unlie our online learning, CASH learns capabilities an requirements in offline moe. The performance of CASH is evaluate on a Haoop simulator rather than a real cluster. Tian et al. propose a ynamic scheuler which learns job resource profile on the fly [8]. Their scheuler only consiers the heterogeneity in the worloa an assumes a homogeneous cluster to assign tass to noes. An architecture of a resource-aware clou-river for heterogeneous Haoop clusters was propose to improve the performance an increase fairness [7]. The clou-river tries to improve the performance by proviing more efficient fairness among jobs in terms of resource allocation. Unlie our approach, the clou-river assumes that cluster capabilities are alreay nown an it has abstract nowlege of job resource requirements. 4 Approach In this section we escribe the esign for a scheuler that optimizes the assignment of tass to servers. To o this, we nee the tas requirements an server capabilities. Unfortunately, these requirements an capabilities are not irectly observable as there is no simple way of translating server harware specifications an tas program coe into resource parameters. We tae a learning base approach which starts with an explore phase where parameters are learne followe by an exploit phase in which the parameters are use to allocate tass to servers. To learn these parameters by observation, we propose a tas execution moel that lins observe execution times of map tass to the unobservable parameters. We assume that map tass belonging to the same job have very similar resource requirements. In the remainer of this section, we introuce the tas moel an then escribe the explore an exploit phases. 4.1 Tas Moel The tas performance moel preicts the execution time of a tas on a server given the tas resource requirements an the capabilities of the server noe. We moel a tas as a set of resource specific operation types such as reaing ata from HDFS, performing computation, or transferring ata over the networ. The tas resource requirements are represente by a vector θ = [θ 1,θ 2,...,θ N ] where each component represents the total requirement for an operation type e.g., number of instructions to process, bytes of I/O to rea). The capabilities of the server are escribe by a corresponing vector κ =[κ 1,κ 2,...,κ N ] which represent rates for processing the respective operation type e.g., FLOPS or I/O per secon) th International Conference on Autonomic Computing ICAC 13) USENIX Association

3 In theory, some of these operations coul tae place simultaneously. For instance, some computation can occur while waiting for is I/O. In practice this oes not have a large impact on Haoop tass we stuie. We therefore assume that the requirements for each operation type are processe inepenently. The time require to process a resource requirement is the total magnitue for the requirement ivie by the processing rate. The total time T j to process all resource requirements on server j is the sum of the times for each operation type T j = θ + Ω j 1) where Ω j is the overhea to start the tas on the server. We assume that every job imposes the same amount of overhea on a given machine. In this paper, we consier a two imensional moel in which κ =[κ c,κ ] represents computation an is I/O server capabilities an θ =[θ c,θ ] represents the corresponing tas requirements. Hence, the tas uration moel reuces to: T j = θ c c + θ + Ω j. 2) The parameters κ c an κ abstractly capture many complex low-level harware epenencies. For example, κ c internally accounts for the in of operations neee to be performe flops or integer ops or memory ops). Similarly, κ is epenent on is spee, see time, etc. In practice, it is very ifficult to buil a tas moel as a function of these low level parameters. To eep the moel simple an easier to unerstan we use such abstract parameters. 4.2 Explore We learn server resource capabilities an tas resource requirements separately. First we learn server capabilities offline. Then using these capabilities we actively learn the resource requirements for jobs online Learning Noe Capabilities We assume server capabilities s an overhea Ω j o not change frequently an can be estimate offline. The server parameters are estimate by executing probe jobs. Since the time we measure is the only imension with fixe units, the value of the parameters is uneretermine. We resolve the unientifiability of the system by choosing a unit map tas to efine a baseline. The unit map tas has an empty map function an it oes not rea or write from/to HDFS. The computation θ c ) an is tas requirements θ ) are both zero, therefore Equation 2 allows us to estimate Ω. Multiple executions are average to create an accurate point estimate. Note that Ω inclues some computation an is I/O that occur uring start up. One coul imagine attempting to isolate the remaining parameters in the same fashion, however, it is ifficult to construct a job with zero computation or zero is I/O. Instea we construct jobs with two ifferent levels of resource usage efine by a fixe ratio η. Let s assume we aim to etermine κ c. First we run a job J 1 c = θ c,ε with fixe is requirement ε J 1 c might be a job which simply reas an input file an processes the text in the file). We compute the average execution time of this job on each server noe. Accoring to our tas moel the average mapping time for every machine i can be given as T1 i = θ c κc i + ε κ i + Ω i 3) Next we run a job J η c which reas the same input but the processing is multiplie by η compare to J 1 c. Therefore, the resource requirements of J η c can be given as J η c = ηθ c,ε. The average mapping time for every noe can be given as Tn i = ηθ c κc i + ε κ i + Ω i 4) We solve for ε κ in equations 3 an 4, set them equal an solve for κ i c to get: κc i = θ cη 1) Tn i T1 i This equation gives us κc i in terms of a ratio. To mae it absolute, we arbitrarily choose one noe as the reference noe. We set κc 1 = 1 an κ 1 = 1 an then solve equation 5 for θ c. Once we have the tas requirements θ c in terms of the base units for server one, we can use this job requirement to solve for the server capabilities on all the other noes. Similarly we estimate κ. Normally in Haoop, the output of map tass goes to multiple reucers an may be replicate on several servers. This woul have the effect of introucing networ communication costs into the system. To avoi that while learning noe capabilities, we set the number of reucers to zero an set the replication factor to one. Table 1 gives an example of compute server capability parameters for a five noe cluster of heterogenous machines. The algorithm correctly iscovers that there are two classes of machines Learning Job Resource Profile In this phase the resource requirements for tass are learne in an online manner without interrupting prouction use of the cluster. To enable online learning we collect tas completion time samples from actual prouction 5) USENIX Association 10th International Conference on Autonomic Computing ICAC 13) 161

4 Noe κ c κ Ω Noe Noe Noe Noe Noe Table 1: Recore Noe Capabilities an Overhea jobs. With every new time sample we upate our belief about the resource profile [θ c, θ ] of the job. We assume that the observe execution time T j is normally istribute aroun the value preicte by the tas uration moel given by Eq. 2. Given a istribution over resource parameters [θ c,θ ], the remaining uncertainty ue to changing conitions on the server i.e., the observation noise) is given by a stanar eviation σ j. T j N θ c c ) + θ + Ω j, σ j Starting with prior beliefs about tas requirements pθ c,θ ) an the execution moel base lielihoo function pt j θ c,θ,κc j,,σ j ), Bayes rule allows us to compute a joint posterior belief over [θ c,θ ]: pθ c,θ T j, c,,σ j )=α pt j θ c,θ,,σ j )pθ c,θ ) 6) with capability an let the observe time be T j. Let the secon experiment be on machine with capability κ an let the observe time be T. The resulting posterior istribution is pθ c,θ T j,t )= ) 2 T 1 j θc κc exp 2π j θ Ω j 2 + T θc κ c θ κ 2 ) 2 Ω j We omit the erivation for space, but we o give the upate rules here. With every time sample we can recover the mean µ θc,θ an covariance matrix Σ θc,θ by using the property of the bivariate Gaussian istribution. Expaning the exponent of Equation 7 an collecting the θ c an θ term gives us a conic section in stanar form: a 20 θ c 2 + a 10 θ c + a 11 θ c θ + a 01 θ + a 02 θ 2 + a 00 = 0 8) There is a transformation to map between the coefficients of a conic in stanar form an the parameters of a Gaussian istribution. The mean an covariance of the istribution with the same elliptical form is given by: [ ] [ µθc a11 a = 01 2a 02 a 10 )/4a 20 a 02 a 2 11 ) ] 9) µ θ Σ 1 θ c θ = a 11 a 10 2a 20 a 01 )/4a 20 a 02 a 2 11 [ ] ) 1 a20 2 a a 11 a 02 7) 10) For our two-imensional CPU an is usage example, the lielihoo has the form Empirically we observe an observe variance of approximately +/- 3 inicates a stanar eviation of 1, therefore, σ j = 1): pt j θ c,θ ; κ c,κ )= 1 2π exp T j θ c c ) 2 θ Ω j 2 Note that the execution time is normally istribute aroun a line efine by the server capabilities [κ c,κ ]. The joint istribution of the lielihoo is not a bivariate normal, but a univariate Gaussian tube aroun a line. This maes sense, as a given execution time coul be ue to a slow CPU an fast is or a fast CPU an slow is. When a job is first submitte we assume that the resource requirements for its tass are completely unnown. Assuming an uninformative prior, the posterior istribution after the first observation is just proportional to the lielihoo. pθ c,θ T j )= 1 2πσ j exp T j θ c κ c θ κ Ω j) 2 For the secon an subsequent upates we have a efinite prior istribution an lielihoo function. These two are multiplie to obtain the ensity of the secon posterior upate. Let the first experiment be on machine j 2 For every new time sample we compute coefficients a nm for equation 8. These coefficients etermine the upate value of µ θc, µ θ, an Σ θc,θ c. Because we recover both the mean an the covariance of tas requirements, we can quantify our egree of uncertain about tas requirements, an hence ecie whether to eep exploring or starting to exploit this nowlege for optimize tas scheuling. In this paper we sample tass until we get a eterminant for the covariance matrix Σ θc,θ < Table 2 summarizes resource requirements learne by the online inference mechanism for some of the Haoop example jobs. When we compare the Pi job, which calculates igits of Pi, to RanomWriter, which writes bul ata, we see that the algorithm correctly recovers the fact that Pi is compute intensive large µ θc ) whereas RanomWrite is is intensive large µ θ ). Other Haoop jobs show intermeiate resource profiles as expecte. The J IO job will be escribe further in the experimental section. The # of Tass column gives the number of tass execute to reach the esire confience. 4.3 Exploit Once the resource profile of a job is learne to sufficient accuracy we switch from explore to exploit. The native Haoop scheuler sorts tas/machine pairs accoring to th International Conference on Autonomic Computing ICAC 13) USENIX Association

5 Job µ θc µ θ Σ θcθ # of Tass Pi Ranom Writer Grep WorCount GB) WorCount GB) J IO Table 2: Job resource profile measurements with variance an number of tass execute whether they are local ata for the tas is available on the machine), on the same rac, or remote. We introuce our routine base on our tas requirements estimation calle SelectBestJob to brea ties within each of these tiers as shown in Algorithm 4.1: If we have two local jobs, we woul run the one most compatible with the machine first. Algorithm 4.1: THROUGHPUTSCHEDULERCluster, Request) for each Noe N Cluster JobsWithLocalTass N.GETJOBSLOCALRequest) JobsWithRacTass N.GETJOBSRACKRequest) JobsWithOffSwitchTass N.GETJOBSOFFSWITCHRequest) if LocalJobs { NULL J SELECTBESTJOBLocalJobs,N) then o ASSIGNTASKFORJOBN, J) else if{ RacJobs NULL J SELECTBESTJOBRacJobs,N) then { ASSIGNTASKFORJOBN, J) J SELECTBESTJOBOffSwitchJobs,N) else ASSIGNTASKFORJOBN, J) Algorithm 4.2: SELECTBESTJOBNoeN, Listo f Jobs) return argmin J ListOfJobs normθ J c ) normκ N c ) + normθ J c ) normκ N c ) ) SelectBestJob, shown in Algorithm 4.2, selects job J that minimizes a score for tas completion on noe N. However, rather than using absolute values of θ c, θ, κ c an κ, we use the normalize value of these parameters to efine the score. While absolute values represent expecte time of completion, which can be measure in secons, job selection base on these numbers woul always favor short tass over longer once an fast machines over slower ones. This woul not achieve the optimize matching of job requirements to server capabilities. For example, consier Noes 1 an 3 in Table 1. Noe 3 is almost 7.5 times faster than Noe 1 in terms of CPU, but only 2.5 times faster in terms of is. Hence, intuitively, is intense jobs are better scheule on Noe 1, since the relativly higher CPU performance of Noe 3 is better use for CPU intense jobs if there are any). To account for this relativity of optimal resource matching, we normalize both jobs an machines to mae their total requirements an capabilities sum to one for each resource x here x {c,}): µ θ i x normθx)= i µ θ i 5 Experimental Results norm x )= x 5 =1 κ x To evaluate the performance of ThroughputScheuler we conucte experiments on a five noe Haoop cluster at PARC see Table 1). 5.1 Evaluation on Heterogeneous Jobs We evaluate the performance of our scheuler on jobs with ifferent resource requirements. Since the Haoop benchmars o not contain highly I/O intensive jobs cf. Table 2), we constructe our own I/O intensive Map- Reuce job, J IO. J IO reas 1.5 GB from HDFS, an writes files totaling 15 GB bac to HDFS. This resembles the resource requirements of many expan-translate-loa ETL) applications use in big ata applications to preprocess ata using Map-Reuce an writing into HBase, MongoDB, or another is-bace atabase. We learn J IO s resource profile using the job learner escribe in the Explore section. The learne resource requirement of J IO is liste in Table 2. To evaluate ThroughputScheuler on rastically heterogeneous job profiles, we run J IO along with the Haoop benchmar Pi, which is CPU intense. We compare the performance of ThroughputScheuler with FIFO- an FairScheuler for a single user, CapacityScheuler is no ifferent from FIFO Job Completion Time We first compare the performance of the propose scheuler in terms of overall job completion time. In case of multiple jobs, the overall job completion time is efine as the completion time of the job finishing last. In this experiment we stuy the effect of heterogeneity between job resource requirements, which we can quantify as the ratio of is I/O to CPU requirement of a job: h = θ θ c. In orer to vary this quantity we vary the I/O loa of J IO further by varying the replication factor of the cluster: the higher the replication factor, the higher the I/O loa of a job. This impacts is I/O intense jobs more than others. These results show that ThroughputScheuler performs better than FIFO- an FairScheuler in all cases. The relative performance increase of our scheuler increases as the heterogeneity of the two jobs increase, as simulate by an increase replication factor: up to 40% compare to FIFO, an 20% compare to Fair. Note that both the Fair- an the ThroughputScheuler benefit from USENIX Association 10th International Conference on Autonomic Computing ICAC 13) 163

6 19:12 16:48 14:24 12:00 9:36 7:12 4:48 2:24 0:00 h 2h 3h 4h 5h Fifo Fair Throughput 14:24 12:00 9:36 7:12 4:48 2:24 0:00 Comb1 Comb2 Comb3 Fifo Fair Throughput Figure 1: Overall job completion time in minutes Y axis) on heterogeneous noes at PARC for ifferent relative values of h = θ θ c. Dis loa θ is increase by increasing the replication number. higher replication as they can better tae avantage of ata locality. The improvements of ThroughputScheuler beyon Fair- are purely ue to our improve matching of jobs to computational resources. Job FIFO Fair Throughput Pi 9 sec 9 sec 6 sec J IO 2 min 15 sec 2 min 2 min 10 sec Figure 2: Job Completion time in minutes Y axis) of combinations of Haoop example jobs. all three scheulers perform similarly because both jobs are CPU intensive cf. Table 2). Job Combination FIFO Fair Throughput Pi1500sample), WC15GB) 440s 319s 310s Pi1500sample), Grep15GB) 210s 224s 214s WC15GB), Grep15GB) 225s 262s 214s Table 5: Completion time of job combinations on a homogeneous cluster. Table 3: Comparison of Average Mapping Time To better unerstan the source of this spee-up, we consiere the average mapping time for each job throughput). Table 3 summarizes these results an provies the explanation for the spee-up: our scheuler improves the throughput of Pi by 33%, while maintaining the throughput of J IO compare to the other scheulers. Since Pi has very many mapping tass, these savings pay off for the overall time to completion. 5.2 Performance on Benchmar Jobs To estimate the performance of ThroughputScheuler on realistic worloas, we also experimente with the existing Haoop example jobs. We ran the job combinations of concurrent jobs shown in Table 4. Comb 1 Comb 2 Comb 3 Grep 15 GB) + Pi 1500 samples) WorCount 15 GB) + Pi 1500 samples) WorCount 15 GB) + Grep 15 GB) Table 4: Job Combination The performance comparison in terms of job completion time is presente in Figure 2. For these worloas ThroughputScheuler performs better than either of the other two in all cases. For Comb 2 the job completion time is reuce by 30% compare to FIFO. For Comb Performance on Homogeneous Cluster We ran aitional experiments on a set of homogeneous cluster noes, to ensure such a setup woul not cause ThroughputScheuler to prouce inferior performance. These results are shown in Table 5. 6 Conclusion ThroughputScheuler represents a unique metho of scheuling jobs on heterogeneous Haoop clusters using active learning. The framewor learns both server capabilities an job tas parameters autonomously. The resulting moel can be use to optimize allocation of tass to servers an thereby reuce overall execution time an power consumption). Initial results confirm that ThroughputScheuler performs better than the efault Haoop scheulers for heterogenous clusters, an oes not negatively impact performance even on homogeneous clusters. While our emonstration uses the Haoop system, the approach implemente by ThroughputScheuler is applicable to other framewor of istribute computing as well th International Conference on Autonomic Computing ICAC 13) USENIX Association

7 References [1] Apache haoop nextgen mapreuce yarn). haoop-yarn/haoop-yarn-site/yarn.html. [2] BALAKRISHNAN, S., RAJWAR, R., UPTON, M., AND LAI, K. The impact of performance asymmetry in emerging multicore architectures. In In Proceeings of the 32n Annual International Symposium on Computer Architecture 2005), pp [3] BORTNIKOV, E., FRANK, A., HILLEL, E., AND RAO, S. Preicting execution bottlenecs in mapreuce clusters. In Proceeings of the 4th USENIX conference on Hot Topics in Clou Ccomputing Bereley, CA, USA, 2012), HotClou 12, USENIX Association, pp [4] GHIASI, S., KELLER, T., AND RAWSON, F. Scheuling for heterogeneous processors in server systems. In Proceeings of the 2n conference on Computing frontiers New Yor, NY, USA, 2005), CF 05, ACM, pp [5] KUMAR, K. A., KONISHETTY, V. K., VORU- GANTI, K., AND RAO, G. V. P. Cash: context aware scheuler for haoop. In Proceeings of the International Conference on Avances in Computing, Communications an Informatics New Yor, NY, USA, 2012), ICACCI 12, ACM, pp [6] KUMAR, R., TULLSEN, D. M., JOUPPI, N. P., AND RANGANATHAN, P. Heterogeneous chip multiprocessors. Computer 38, 11 Nov. 2005), [7] LEE, G., CHUN, B.-G., AND KATZ, H. Heterogeneity-aware resource allocation an scheuling in the clou. In Proceeings of the 3r USENIX conference on Hot topics in clou computing Bereley, CA, USA, 2011), HotClou 11, USENIX Association, pp [8] TIAN, C., ZHOU, H., HE, Y., AND ZHA, L. A ynamic mapreuce scheuler for heterogeneous worloas. In Gri an Cooperative Computing, GCC 09. Eighth International Conference on 2009), pp [9] ZAHARIA, M., KONWINSKI, A., JOSEPH, A. D., KATZ, R., AND STOICA, I. Improving mapreuce performance in heterogeneous environments. In Proceeings of the 8th USENIX conference on Operating systems esign an implementation Bereley, CA, USA, 2008), OSDI 08, USENIX Association, pp USENIX Association 10th International Conference on Autonomic Computing ICAC 13) 165