Extending Stride Scheduler with System Credit for Implicit Coscheduling

Size: px
Start display at page:

Download "Extending Stride Scheduler with System Credit for Implicit Coscheduling"

Transcription

1 Extending Stride Scheduler with System Credit for Implicit Coscheduling Lei Chen Ling Abstract: This paper describes an extension to stride scheduling with system credit (SSC), a proportional-share resource management algorithm used in implicit coscheduling. SSC is an operating system local scheduler running on each node in a NOW (network of workstations). SSC ensures the coordination of parallel jobs across cluster and achieves fairness for all participant workloads at the same time. However it does not works well for heterogeneous workloads. In order to provide better response-time for interactive jobs and reliable fairness, we proposed our improvement, the compensate stride and boost pass policy. We describe our approach, evaluate the simulation experiment, and finally show that with the extension version of the SSC, implicit coscheduling can achieve better performance and more efficiency. 1 Introduction Networks of Workstations (NOW) are expected to be successful in acting as general-purpose computer servers for the execution of parallel and sequential applications. To completely exploit the computing ability of the cluster, effective scheduling techniques are the crux and have been widely studied widely. For parallel applications, satisfactory performance can be achieved only if the communication processes on different workstations are simultaneously scheduled (coscheduling)[3]. Traditional coscheduling strategies use explicit information and have many deficiencies: high context-switch overheads, poor scalability. And the most important, local workloads such as interactive job suffer a lot when communication jobs try to achieve and maintain coordination during coscheduling. Recently, a class of coscheduling strategies is proposed which try to ensure the best performance for all types of applications. Among them, implicit coscheduling[1] makes scheduling decisions based on local knowledge and implicit information. It allows each communicating processes to determine the scheduling state of the parallel job across the cluster and decide independently whether or not inform local operating system scheduler. At the same time each local scheduler is free to make its own scheduling decision. In a word, implicit coscheduling relies on local scheduler and uses the communication behavior of parallel processes to achieve both the coordination of parallel jobs and the performance of other workloads. There are two key mechanisms in implicit coscheduling. One is conditional two-phase waiting. In the first phase, the process spins for some time. If the expected event does not occur, the process voluntarily yields the CPU and goes to sleep. The waiting time may be dynamically prolonged if some event occurs during the base spin phase. The other is a preemptive and fair local scheduler which ensures the efficiency and fairness. Each local scheduler is autonomous and free to schedule - 1 -

2 heterogeneous workloads along with the communicating jobs. The local scheduler should satisfy three fundamental requirements. First, it must have the ability to do the selective preemption. For example, when a message for a communicating process comes, in favor of the process to achieve coordination in other workstations, preemption is probably needed. Or considering an interactive job which need very low response latency, preemption are usually necessary. Second, the scheduler must export a fair cost-model, thus ensure the communicating processes at different rates receive approximately the same proportional share of CPU regardless its computation characteristic (spin-wait, context-switch, compute, communicate, or block). The cost-model should also ensure that the desired coordination can be obtained within the cost of a tolerable hurt to the dispatch latency and fairness of local sequential jobs. Third, most schedulers are based on periodic time-slice, but commonly the time-slices are not aligned or coordinated across workstations. So it is significant that the length of time-slice is long enough compared to the local context-switch cost. The current approach of local scheduler in implicit coscheduling is an extension version of stride scheduling[2] named stride scheduling with system credit (SSC). Stride scheduling is a proportionalshare mechanism to allocate resource to clients in proportion to the number of tickets they hold. SSC extends this mechanism by adding a compensation policy with system credit, thus guarantees that jobs voluntarily relinquish CPU receive their proportional share of resource over a certain period of time. The system gives exhaustible tickets (tickets with expiration time) to the processes according to the time the processes keep sleeping. So the woken up process will get compensate share in the future. But SSC still has some drawbacks: SSC can hardly meet the different needs of different kind of jobs, e.g. interactive job need good response time while scientific computation job concern more about execution time. In this paper, two main approaches to improve the SSC are described. One is compensation stride, the other is boost pass. With the improvement, the stride scheduling can achieve better performance (e.g. less slowdown for parallel job and less response latency for interactive job), while the fairness of local scheduler is also strengthened. The rest of this paper is organized as follows. In Section 2, we cover previous research in stride scheduling with system credit. The drawback of SSC and our approach to improve SSC are described in Section 3. We present our methodology and simulation environment in Section 4, and we continue with Section 5, where the experiment results are discussed and evaluated. Section 6 summarizes related work, while Section 7 discusses several issues that we leave for future work. Finally, we draw conclusion in Section 8. 2 Background In this section, we describe the traditional stride scheduling and the former extension with system credit. Recently, proportional-share scheduling algorithms such as Waldspurger lottery scheduling and stride scheduling[2] have been introduced to strive for instantaneous fairness. In lottery scheduling, each process holds a number of tickets. The scheduler makes the scheduling decision by picking a ticket from the runnable processes at random and chooses the process holding this ticket to run. If a client holds t tickets in a system with total T tickets, it receives t/t of resources over a period run time. Lottery scheduling has the deficiency of short-term variability due to its probabilistic feature. Stride - 2 -

3 scheduling follows on to the lottery scheduling, but it is a deterministic version. 2.1 Stride scheduling Like lottery scheduling, in stride scheduling, each client holds tickets, and the number of tickets determines the proportion of resource the client receives. Stride is a time interval which is inversely proportional to its number of tickets. For example, a client with twice the tickets of another has half the stride. So the more tickets a client holds, the smaller stride it obtains, and the higher priority it appears. Stride determines the frequency the client is scheduled. The pass associated with a client is incremented by its stride each time it is scheduled. The basic algorithm is: in each quantum, the client with the minimum pass is selected and its pass is updated by adding its stride. Thus, different clients can be simply allocated a certain proportion number of tickets to proportionally share the resources. X Y Z Global Tickets Stride Stride Scheduling examples Passes Time-Slice Job X 6 Job Y 3 Job Z 2 Global 1 Figure 1: Stride Scheduling Three jobs(x, Y, and Z) with different ticket allocation are running on a single workstation. At each time-slice, the job with the minimum pass is scheduled and its pass is increased by its stride. A global stride and global pass are also tracked. The global stride is inversely proportional to the sum of all the tickets in the system. The global pass is increased by global stride in each time-slice The circle indicate the process which is selected.. But the scheduler must deal with more than fairly scheduling CPU-bound jobs: in implicit coscheduling, the local scheduler is required to provide low slowdown to parallel jobs, good response time to interactive jobs and throughput to I/O-intensive jobs. Further more, some jobs often voluntarily relinquish their time-slice to sleep, and wake up when a message for communication or request to computing comes. But basic stride scheduling treats all kind of workloads as CPU-bound jobs, so these processes are given no incentive for relinquishing of CPU. The proportional share cannot be guaranteed and the response time cannot be improved in this situation. Thus the stride scheduling should be revised to do some compensation

4 2.2 System credit The jobs such as communicating jobs and interactive jobs should receive additional tickets for the time they are asleep. System Credit policy, built on exhaustible tickets[5], (tickets with expiration time) is one of the feasible compensation methods. After a client sleeps and awakens, the scheduler calculates the exhaustible tickets according to the moment the client relinquishing processor and the sleep time. The exhaustible tickets given to the processes ensure the proportional-share of the processes be achieved when the tickets expire. A client with t tickets in a system with T total tickets, sleep for an interval of S, the life time of exhaustible tickets is C, the number of exhaustible tickets is calculated as follow: tts e = CT -(S+ C)t tt e = if C = S T - 2t Two additional of instance are considered in the implementation of SSC. First, if the process relinquishes CPU while it still holds exhaustible tickets (pure interactive job usually cannot use up its time-slice because it completes its computation in a very short time), later calculation of exhaustible tickets should combine the unconsumed set and the newly generated set. Second, the exhaustible tickets are calculated independently without accounting for possible future increase in total tickets T, so the process may not be fully compensated. SSC assume there are few clients are simultaneously leaving and joining the system. 3 Approaches Although SSC works well in an environment where several parallel jobs compete for CPU. Some problems arise when SSC is applied to the environment with mixed workload. In such case, jobs have different needs. Some jobs need as much CPU time as possible and do not care when it is scheduled to run, e.g. some scientific computations. Some jobs want to be scheduled as soon as it wakes up, e.g. some interactive jobs. And parallel jobs also have some special needs. When a message comes from a remote process, the parallel job benefits from be scheduled immediately. And when the parallel process is spinning, avoiding preemption will help it to keep coordination with remote processes. In the remaining of this section, we will explain some drawbacks of SSC and propose our approaches to solve the problems. 3.1 Compensate Stride In SSC, the scheduler compensates woke up processes by assigning them extra tickets. The compensate tickets make the process s stride smaller. When the process is scheduled to run after waking up, its pass increases in a slower rate. So it will receive more CPU sharing. The problem here is that the woken up process is not really compensated until it is scheduled. SSC does not regard the response time as a kind of system resource just like CPU sharing. So it does not compensate for response latency at all

5 Suppose a process do a lot of computation work. Its pass increases when the process uses the CPU, then it traps into sleep, and the difference between its pass and global pass is recorded. After a period of time, it is woken up by some event, for example, a keystroke. Now it is moved from the blocked queue to the ready queue with its compensate tickets assigned. Its new pass, which is the sum of current global pass and the difference recorded before sleeping, may be much higher than current global pass so that the woke up process will inevitably suffer high response latency. To address the above problem, we introduce the compensate stride and compensate pass. The idea behind compensate stride is that after a client sleeps and awakens, the scheduler not only grant the client exhaustible tickets to ensure its proportional share over some long interval, but also assign the client with a positive compensate stride. When the woken process is waiting in the ready queue, its compensate pass is decreased by its compensate stride at every time interval. When the scheduler decides which process should run next, it compares the sum of the original pass and compensate pass. By this means, the response latency of waking process is reduced. However, we notice that the effect of compensate stride is limited to manipulate the dispatch priority of short term; it should not change the long-term proportional share of the client. So when the woken process is selected, we should get rid of the effect of compensate stride. One simple approach is just setting the value of compensate pass and compensate stride to zero. But this approach seems to be too ad hoc and introduces some instable factors into the system. In our design, when the process is running, we set its compensate stride to a negative value so that its compensate pass increases at every interval and the effect of positive compensate stride during waiting phase will be canceled. The whole scheme of compensate stride approach can summarized by Table 1 and illustrated by Figure 2: Event Job enters the system Process awakens Process is set to run Process yields/preempted Process sleeps Tickets expire Action Assign compensate pass to zero Assign compensate tickets Assign positive compensate stride Assign negative compensate stride Assign positive compensate stride Set both compensate stride and compensate pass to zero Set both compensate stride and compensate pass to zero Table 1. The actions taken at different events - 5 -

6 Figure 2. the effect of compensate pass. This figure shows the change of pass with the time in different phases. Notice that a positive stride will decrease the compensate pass and negative stride will increase the compensate pass. The process with the minimal value of (original pass + compensate pass) is selected to run in the next time interval. The value of compensate stride (positive and negative) will affect the system behavior greatly. We use the following formula (a)(b) to calculate their value. S S positive Sbase Tcompensate = (a) T T global T global current negative = S positive (b) Tcurrent Formula (a) (b) : The value of positive compensate stride and negative compensate stride. S positive is the positive compensate stride, S base is the base stride, which is calculated directly by process base tickets. T compensate is the compensate tickets. T global is the sum of all active tickets in the system, T current is current tickets, which is the sum of base tickets and compensate tickets The formula(a) is chosen in order to guarantee that the waiting process will have the same pass compensated just as it has been scheduled. During the computation phase, the negative compensate stride remove the effect of positive compensate stride gradually. And when the process reach its proportional share, the effect of negative compensate stride will exactly return the compensate pass to zero. 3.2 Boost pass Parallel jobs fit well into a two-phase model. During the computation phase, it is preemption insensitive. If the workload is distributed evenly over all the workstations, the slowdown will not be increased too much by some particular preemption. However, during the communication phase the parallel jobs are usually preemption sensitive. A single preemption may reduce the performance of the whole job greatly. If a parallel process is - 6 -

7 preempted when it is spinning, usually it will miss the chance to intercept the message. One extreme example is the root process in barrier operation. If this process is preempted, all the processes for that job have to block themselves. So the performance will greatly degrade. Our approach is quite simple. When the process is spinning, it is granted negative boost pass. And when the process stops spinning, we revoke its boost pass. By boost pass, we actually increase the priority of spinning process temporarily. Figure show how boost pass works. Figure 3: The effect of boost pass. This figure shows how the boost pass is granted and revoked. Notice that the boost pass is always non-positive. One argument against boost pass is that, in order to obtain more CPU share, a process may cheat the scheduler that it is spinning while actually it is computing. And the fairness of the scheduler will be hurt. However, a further investigation reveals that this argument is not true. We observe from Figure 3 that after the computation phase, the pass of the process will return to the same no matter it have got boost pass or not. Generally, applying boost pass has the same effect as borrowing share from the future. Since the pass is accumulative, the process will pay for the extra share it used after the boost pass is revoked. Someone may argue that a process may get a large amount of boost pass and never have it revoked. By this way, it can get more CPU share than its proportion. But since the value of boost pass is determined by scheduler but not the process, and the value is usually small compared to the total pass in the process life time. We can simply neglect this case. Just like the value of compensate stride, the value of boost pass is also important to the system behavior. Currently we have not studied this topic very thoroughly. We use the maximal value of compensate pass of all runnable processes as the value of boost pass. So that at least the spinning process will not be preempted immediately by other unboosted processes. 3.3 Integration Compensate stride approach and boost pass approach are orthogonal to each other. The only relation between them is that the value of boost pass is equal to the maximum value of compensate pass. So they can be easily integrated together. Therefore at each time, the process with the minimum sum of original pass, compensate pass and boost pass will be selected to run at the next time interval - 7 -

8 4 Simulation Environment We continue to use the LogP model[13] and the event-driven, process level simulator, SimSched, which are explored and used in the research of implicit coscheduling. The LogP model is described in Table 2. Variable Description Value L latency The delay incurred in communicating a short 10µs message from the source node to the destination O overhead The length of time that a processor is engaged in the transmission or reception of a message 0 G gap The minimum interval between consecutive 0 message transmissions or receptions P processors Number of processes in the system 32 W contextswitch The context-switch time in a processor 100µs Table 2: parameters of LogP model. This table describes the LogP model and the value set in our experiment, First, we add a new job model for interactive jobs in SimSched. Interactive job is similar to I/O intensive job, but the existing I/O-bound job model can not simulate a typical interactive job. During the lifetime of an interactive job, in some certain period of time (interactive phase), an interactive job will generate a bunch of requests (e.g. type several words in a short time can generate a bunch of keyboard events). For each request, it will compute for a very short time then yield the CPU and go to sleep. Below are the parameters to define an interactive job in our model. None-interactive interval: During two bunches of requests of an interactive job; it may simply sleep or act as a CPU-bound job. The none-interactive interval describes how long the time between two interactive phases is; Internal count: how many requests in a single bunch; Internal interval: the time between two consecutive requests; Computation tick: for a request, the computation tick determines how long it computes for; this value is much less than internal interval; Term count: how many bunches during the lifetime of the interactive job; Regular flag: a multi-media job behaviors like an interactive job, and some of the parameters above are fixed for multi-media job while are random for interactive job. The Regular flag indicate if those parameters are fixed value or not. In our simulation experiments, we consider a typical interactive job as: in the interactive phase, the process repeatedly computes for approximately 5ms and then sleep for 100ms, (try to simulate an editor such as emacs); the process sleeps for 5s between two interactive phases. In previous paper of implicit coscheduling, the application workloads consist entirely of parallel applications. That is the first step in handling general-purpose workloads and proved to be very successful. We want to go farther. We respectively use CPU-bound, I/O-bound, and interactive jobs as - 8 -

9 background competition jobs instead of pure parallel jobs. In more diverse circumstances, the performance of SSC and improved SSC are tested and compared. 5 Evaluation In this section, the result of simulation is studied. We first demonstrate the reduction of the response latency for the interactive jobs, and then show that the scheduler will give parallel fairer CPU share. Finally, we investigate the sensitivity of the scheduler to load imbalance. 5.1 Response latency of interactive jobs Figure 4 shows the response latency of interactive jobs when competing with two parallel jobs in different communication granularity. The parallel jobs use NEWS communication pattern. The tickets ratio between parallel jobs and interactive jobs is 5:5:2 so that the tickets can more reasonably reflect the CPU needs of these jobs. From the Figure 4, we can see that in the original SSC, the interactive jobs suffer response latency when the communications granularity is coarse. In this case the parallel job act like CPU bound job. As we pointed out before, original SSC can not fully compensate the process that keep sleeping in most of the time. In the improved version, interactive job can get much low response latency because it gets extra compensate stride. response latency ssc(original) ssc(improved) E+05 5E+05 communication granularity (time between barriers) Figure 4: This figure shows the response latency of interactive jobs competing against two parallel jobs. The tickets allocation is parallel job 1 : parallel job2 : interactive job = 500:500:200. In most cases, the response latency of interactive job is too low to be visible. 5.2 Performance of parallel jobs Figure shows the slowdown for parallel jobs competing with another parallel and an interactive job. The effect of boost pass is shown in this figure. When the communication granularity is fine, the new version of SSC can speedup the parallel job for a great deal. As the granularity becomes coarser, the - 9 -

10 behavior of parallel job shares more similarity with CPU bound job. So two version of scheduler produce the similar result. If we compare the Figure and Figure, we can find an interesting result. Generally, since the computation ability of CPU is a constant, the performance of parallel job and response latency of interactive job will be a tradeoff. But Figure 4 and Figure 5 show that the improved version of SSC can reduce the response time of interactive job and at the same time speedup the performance of parallel job. Our explanation is that unlike the original one, the improved version is aware of the nature of process. So it can reduce the unnecessary contention between two parallel jobs. The parallel job which is spinning will have higher priority over the parallel job which is computing. slow down of parallel jobs ssc(original) ssc(improved) communication granularity (time between barriers) Figure 5: This figure show the slow down of a parallel job competing with another parallel job and an interactive job. All the parameters are the same with Figure 4. The slowdown is calculated by dividing the completion time of parallel job in the competing environment with 2.4 times of the finish time when the parallel job runs by itself. The abnormal speed up of original SSC when c=1000us is due to the correlation between timeout of two-phase wait and load imbalance. 5.3 Sensitive to the load imbalance Figure 6 shows the slowdown of a parallel job competing with another parallel job and a CPU bound job. We can see that as the load imbalance increases, the performance for both versions decrease. But the improved version is always better than the original version. Slowdown ssc(original) ssc(improved) load imbalance (us)

11 Figure 6: This figure shows the slowdown of a parallel job competing with another parallel job and a CPU bound job. The ticket allocation is parallel job1 : parallel job2 : CPU bound job = 500:500:500. The communication granularity is 5ms. The slowdown is calculated by divide the finish time of parallel job in the competing environment with the three times of the finish time when the parallel job runs by itself. 6 Related Work The approach of Virtual Time is proved to be powerful in scheduling of multimedia (soft real-time) jobs. Essentially, virtual time scheduling shares quite a lot similarity with stride scheduling. In BVT (Borrow-Virtual-Time) scheduling[8], the scheduler dispatches the runnable thread with the earliest effective virtual time (EVT). A latency sensitive thread is allowed to wrap back in virtual time t make it appear earlier and thereby gain dispatch preference. This is just the boost pass in our approach. Each thread also has a warp time limit Li and unwrap time requirement Ui. Thread i is allowed to run warped for at most Li and if the thread i attempt to wrap after having previous wrapped within Ui, scheduler runs it unwrapped until at least time Ui has passed. However, as we explained in the section, there is no need to subject such restriction to parallel processes since most parallel jobs are computation oriental and they can not gain benefit from boost their pass frequently. In deadline-based[11] scheduling, thread declares future CPU needs to the system. A periodic thread may express a sequence of similar reservations a single period length and need per period. The system either accepts the request, in which case the thread is guaranteed to be dispatched according to its predeclared need, or the system rejects the request, in which case the thread receives no preferential dispatch. This scheduling can dispatch processes according to their needs and keeps balance between different dispatch requests. However, this complex scheduling model imposes extra overhead on the application developer and the scheduler itself. BERT[10] begins with fair sharing, but is willing to sometimes violate the rigid line of a "share". Rather than always providing a process with a share, BERT dynamically adjusts the service abstraction provided to real-time processes in response to user input and individual task requirements. BERT uses stealing within the context of a fair sharing algorithm to give real-time tasks the extra cycles they need to meet specific deadlines. One difficulty in applying deadline-based approach to our design is that both interactive job and parallel job can not predict precisely the next time it need to be scheduled. 7 Future Work We describe a number of issues raised by this paper that we leave for future work. The system overhead is a key factor to the success of scheduler. In this paper, we use a simulator to evaluate our design. The system overheads are not studied. Generally, the complexity of our scheduler is O(n). At every time quantum, all the waiting processed need to be updated and have their compensate pass increased. These updates will increase the overheads. One alterative is that we do not update the compensate pass at every time interval. We only update the pass when it is needed, i.e. when some waiting process has its pass lower than the process that is running. In current version of our implementation, the boost pass for spinning process is set to be the maximum compensate pass of all the running processes, hoping that no other process can preempt it. It is not studied that how the selection of boost pass will affect the scheduler's behavior. Generally, if the

12 value of boost pass is too small, parallel process can not get much benefit from it. And if the value is too large, there will be two serious outcomes: first, interactive job may suffer response latency because its compensate stride is too small compared with the boost pass of parallel process. Second, too much boost pass will make parallel job borrow too much CPU share from future. So when the parallel process stops spinning, it will not be scheduled for long time and the work imbalance of parallel job is increased. For developing a more effective local scheduling algorithm, it is important to have a precise model describing the relation between boost pass and system behavior. For further understanding and evaluation of our design, we will implement our scheduler in real system and have benchmark workloads running on it. We expect to encounter more problems and reveal more intrinsic nature of stride scheduler. 8 Conclusion This work extends stride scheduler with system credit to provide responsiveness for interactive jobs and improve the effectiveness of implicit coscheduling. We grant compensate stride to woken up process. By this means the interactive process gets the reasonable chance to preempt current running process. And we reduce the pass of the process of parallel job in its spinning wait phase. So the performance of parallel job benefit from higher degree of coordination. These techniques have been applied without squandering the proportional-share resource management semantics. Further, these two designs are combined together to enable the local scheduler allocate CPU resource. Our measurements show that our design achieve the goal we expected. The schedule latency of the interactive process is reduced and parallel processes obtain fairer CPU sharing. The response latency of interactive job is reduced from 5ms to almost zero. And the parallel job gains a speedup when competing with CPU bound job. The fairness of local scheduler is improved and the two-phase wait algorithm becomes more effective. We found the power of the conception of "pass. Compared with "Priority" in traditional time sharing scheduler, pass is easier to manipulate and configure. It can describe the job's short term and long term needs and make tradeoff between them. It simplifies the ideal system into a form that can be tracked and transforms the mathematical description of a complex system into an algorithm. Overall, we consider compensate stride and boost pass as an important step to achieving a fully general, efficient and fair local scheduler that can simultaneously execute jobs with different requirements, behaviors and failure modes. Equipped with implicit coscheduling and the fair local scheduler, NOW becomes a powerful and versatile platform for applications with heterogeneous purposes. References [1] A. C. Arpaci-Dusseau. Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems. PhD thesis, University of California at Berkeley, December [2] C. A. Waldspurge. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis,

13 [3] R. H. Arpaci, A. C. Dusseau, A. M. Vahdat, L. T. Liu, T. E. Anderson, and D. A. Patterson. The interaction of parallel and sequential workloads on a network of workstations. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages , May [4] Andrea C. Arpaci-Dusseau and David E. Culler. Extending Proportional-Share Scheduling to a Network of Workstations. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'97), June [5] Carl A. Waldspurger and William E. Weihl. An object-oriented framework for modular resource management. In Proceedings of the 5th International Workshop on Object Orientation in Operating Systems, pages , Seattle, WA, USA, October IEEE. [6] Anoop Gupta, Andrew Tucker, and Shigeru Urushibara. The impact of operating system scheduling policies and synchronization methods on the performance of parallel applications. In Proceedings of the 1991 ACMSIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages , [7]Sobalvarro, Pakin, Chien, and Weihl. Dynamic coscheduling on workstation clusters. Proceedings of the International Parallel Processing Symposium (IPPS '98), March 30-April [8] K. J. Duda and D. R. Cheriton. Borrowed-virtual-time (BVT) scheduling: supporting latencysensitive threads in a general purpose scheduler. In Proceedings of the 17th ACM Symposium on Operating System Principles, Dec [9] Petrou, D., Milford, J., Gibson, G., Implementing Lottery Scheduling: Matching the Specializations in Traditional Schedulers, Proc. of the USENIX 1999 Annual Technical Conference, June [10] A. Bavier, and L.L. Peterson, BERT: A Scheduler for Best Effort and Real-time Tasks, Technical Report, Department of Computer Science, Princeton University, [11] J. Nieh and M S. Lam. The Design, Implementation and Evaluation of SMART: A Scheduler for Multimedia Applications. In Proceedings of the sixteenth ACM symposium on Operating systems principles (SOSP'97), Saint-Malo, France, pages , December [12] Cosimo Anglano, A Comparative Evaluation of Implicit Coscheduling Strategies for Networks of Workstations. In Proc. Of Ninth International Symp. on High Performance Distributed Computing (HPDC 9), Pittsburgh, PN, Aug IEEE Press

14 [13] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken, LogP: Towards a realistic model for parallel computation, Proc. 5th Symp. on Parallel Algorithms and Architectures (1993)