Resource management in IaaS cloud. Fabien Hermenier
|
|
- Darren Harrington
- 5 years ago
- Views:
Transcription
1 Resource management in IaaS cloud Fabien Hermenier
2 where to place the VMs? how much to allocate
3 1What can a resource manager do for you 3
4 2 Models of VM scheduler 4
5 3Implementing VM schedulers 5
6 1What can a resource manager do for you 6
7 central piece of code provide the awaited Quality of Service address the management objectives
8 the RM should solve that VM6? N1 VM2 VM1 N2 VM3 VM4 VM5 N3 N4
9 the RM should prevent that VM2 VM1 VM3 N1 N2 VM6 VM4 VM5 N3 N4
10 Service Level Agreement a contract where a service is formally defined performance indicator MTTR MTBF availability (MTBF/MTTR + MTBF) pricing penalties
11 (KPI)? uptime as the only Key Performance Indicator
12 Cloud Performance SLAs What customers want guaranteed capacities (CPU, ory, bw) guaranteed latencies/throughput (network, load balancer, disk) What providers give -
13
14 the objective provider side min(x) or max(x)
15 atomic objectives min(penalties) Ownership) min(total Cost min(unbalance)
16 composite objectives using weights min(αx + β y) How to estimate coefficients? useful to model sth. you don t understand? min(α TCO + β VIOLATIONS) as a common quantifier: max(revenues)
17 ? Should a client ask for an objective
18 ? Should a client ask for an objective no if the provider cannot prove it is achieved one can check the latency is below a threshold proving a latency is minimal
19 SLOs and objectives are evolving ynamic Power Management (DRS 3.1) 2009? VM-VM affinity (DRS) 2010? Dedicated instances (EC2) mar apr VM-host affinity (DRS 4.1) there is no holy grail MaxVMsPerServer (DRS 5.1) The constraint needed in 2012 sep. 2012?? 2013
20 2 Models of VM schedulers 20
21 VM queue VM scheduler decisions actuators cloud model monitoring data
22 scheduling models static dynamic static dynamic schedulers resource allocation
23 static schedulers VM1 VM2 VM3 N1 N2 VM6 VM4 VM5 N3 N4
24 static schedulers VM2 VM6 VM3 VM1 N1 N2 VM4 VM5 N3 N4
25 static schedulers VM2 VM6 VM3 VM1 N1 N2 VM7 VM4 VM5 N3 N4
26 Combinatorial problem! past decisions impact the future ones
27 static schedulers VM1 buffering might help N1 VM2 N2 VM3 VM6 VM5 VM4 N3 N4
28 static schedulers VM1 buffering might help N1 VM2 N2 VM3 VM6 VM5 VM4 N3 N4
29 static schedulers VM1 buffering might help N1 VM2 N2 VM3 VM7 VM6 VM4 VM5 N3 N4
30 static schedulers VM1 VM7 buffering might help N1 VM2 N2 VM3 VM6 VM4 VM5 N3 N4
31 static schedulers VM1 VM7 buffering might help N1 VM2 N2 VM3 VM6 VM5 VM4 /!\ incoming rate starvation N3 N4
32 static schedulers manipulate the VM queue only react on VM arrival
33 pros static schedulers easy to model simple standard actions manage a few elements
34 static schedulers cons hard to perform fine grain optimisation
35 dynamic schedulers VM1 consider every VMs VM2 VM3 N1 N2 VM6 VM4 VM5 N3 N4
36 dynamic schedulers consider every VMs VM2 VM6 VM3 VM1 N1 N2 VM4 VM5 N3 N4
37 dynamic schedulers consider every VMs VM2 VM6 VM3 VM1 N1 N2 VM7 VM4 VM5 N3 N4
38 dynamic schedulers VM1 consider every VMs VM2 VM3 N1 N2 VM7 VM6 VM4 VM5 N3 N4
39 dynamic schedulers VM1 VM7 consider every VMs VM2 VM3 N1 N2 VM6 VM4 VM5 N3 N4
40 dynamic schedulers manipulate the VM queue reconfigure the schedule with migrations
41 dynamic schedulers a migration takes time
42 dynamic schedulers a migration impacts performance
43 dynamic schedulers dependency management VM1 VM2 VM4? VM2 VM6 N1 N2 N1 N2 VM6 VM4 VM1 N3 VM3 N4 VM5 N3 VM3 N4 VM5
44 dynamic schedulers dependency management VM4 VM1 VM2 N1 N2 VM6 N3 VM3 N4 VM5
45 dynamic schedulers dependency management VM1 VM2 N1 N2 VM4 VM6 N3 VM3 N4 VM5
46 dynamic schedulers dependency management VM6 VM1 VM2 N1 N2 VM4 N3 VM3 N4 VM5
47 dynamic schedulers dependency management VM6 VM2 N1 N2 VM4 VM1 N3 VM3 N4 VM5
48 dynamic schedulers cyclic dependencies anti-affinity(vm3,vm4) min(#onlinenodes) anti-affinity(vm3,vm4) min(#onlinenodes)? N1 N1 VM4 VM1 VM5 VM1 VM3 VM5 VM3 VM4 N3 N4 N3 N4
49 dynamic schedulers cyclic dependencies a pivot to break the cycle N1 fix or prevent the situation? VM4 VM1 VM3 VM5 N3 N4
50 dynamic schedulers cyclic dependencies N1 VM4 a pivot to break the cycle fix or prevent the situation? VM1 VM3 VM5 N3 N4
51 dynamic schedulers cyclic dependencies N1 VM4 a pivot to break the cycle fix or prevent the situation? VM5 VM1 VM3 N3 N4
52 dynamic schedulers cyclic dependencies a pivot to break the cycle N1 fix or prevent the situation? VM5 VM1 VM3 VM4 N3 N4
53 dynamic schedulers quality at a price VM4 VM1 VM2 N1 N2 VM6 N3 VM3 N4 VM5 min(#onlinenodes) = 3
54 dynamic schedulers quality at a price VM1 VM2 N1 N2 VM4 VM6 N3 VM3 N4 VM5 min(#onlinenodes) = 3
55 dynamic schedulers quality at a price VM6 VM1 VM2 N1 N2 VM4 N3 VM3 N4 VM5 min(#onlinenodes) = 3
56 dynamic schedulers quality at a price VM6 VM2 N1 N2 VM4 VM1 N3 VM3 N4 VM5 min(#onlinenodes) = 3
57 dynamic schedulers quality at a price VM6 sol #1: 1m,1m,2m VM2 N1 N2 VM4 VM1 N3 VM3 N4 VM5 min(#onlinenodes) = 3
58 dynamic schedulers quality at a price VM4 sol #1: 1m,1m,2m VM1 VM2 N1 N2 VM6 N3 VM3 N4 VM5 min(#onlinenodes) = 3
59 dynamic schedulers quality at a price VM4 sol #1: 1m,1m,2m VM1 VM2 N1 N2 VM5 VM6 VM3 N3 N4 min(#onlinenodes) = 3
60 dynamic schedulers quality at a price VM4 VM6 sol #1: 1m,1m,2m VM1 VM2 N1 N2 VM5 VM3 N3 N4 min(#onlinenodes) = 3
61 dynamic schedulers quality at a price VM4 VM6 sol #1: 1m,1m,2m N1 VM1 N2 VM2 sol #2: 1m,1m 1m VM5 VM3 N3 N4 min(#onlinenodes) = 3
62 dynamic schedulers quality at a price VM4 VM6 sol #1: 1m,1m,2m N1 VM1 N2 VM2 sol #2: 1m,1m 1m VM5 lower MTTR (faster) VM3 N3 N4 min(#onlinenodes) = 3
63 dynamic schedulers quality at a price the objective should reflect reconfiguration costs min(mttr), min(#migrations),
64 dynamic schedulers pros continuous optimisation through reconfiguration
65 dynamic schedulers cons harder to model harder to scale technically expensive costly reconfiguration
66 ? dynamic scheduling for the win in theory yes but that s theory benefits depends on the workload the objective/ SLAs the infrastructure dynamic scheduling is rare in public or large clouds
67 scheduling models static dynamic schedulers static dynamic resource allocation
68 static resource allocation VM3, ram, i/o, bandwidth allocated once for all allocation!= utilisation VM2 N1 VM1 p no sharing conservative allocation
69 static resource allocation sharing (overbooking) VM3 performance loss with concurrent accesses VM2 VM1 acceptable if stated in the SLA N1 p
70 dynamic resource allocation to fix violations VM3 VM2, ram, i/o, bandwidth allocated can be revised VM1 N1 p
71 dynamic resource allocation to fix violations VM2, ram, i/o, bandwidth allocated can be revised VM1 N1 p
72 dynamic resource allocation to fix violations VM2, ram, i/o, bandwidth allocated can be revised VM1 N1 p
73 dynamic resource allocation to support vertical elasticity VM3 VM2 VM1 N1 p
74 dynamic resource allocation to support vertical elasticity VM3 VM2 VM1 N1 p
75 dynamic resource allocation to support vertical elasticity VM2 VM1 N1 p
76 dynamic resource allocation to support vertical elasticity VM2 VM2 VM1 N1 p
77 ? dynamic resource allocation for the win common on CPU + overbooking (inc. hosting capacity) exceptional for ory (huge performance loss) benefits depends on the workload the objective/ SLAs the infrastructure
78 RECAP 50
79 The VM scheduler makes cloud benefits real 51
80 no holy grail 52
81 think about what is costly 53
82 static scheduling for a peaceful life 54
83 dynamic scheduling to cease the day 55
84 with great power comes great responsibility 56
85 3 Coding a VM scheduler
86
87 VM scheduling is hard static or dynamic scheduler allocation? does the workload/sla/objective requires migration? maximum duration to schedule?
88 VM scheduling is NP-Hard issues with large infrastructures or hard problems fast adhoc heuristics despite corner cases some use biased complete approaches (linear programming, constraint programming) like him
89 vector packing problem N1 VM3 VM1 items with a finite volume to place inside finite bins a generalisation of the bin packing problem N2 VM4 VM2 the basic to model the infra. 1 dimension = 1 resource
90 ? Which resource can be modeled as a packing dimension
91 ? Which resource can be modeled as a packing dimension CPU, ory, disk IO, licences cardinality, network boundaries, end to end network
92 how to support migrations temporary, resources are used on the source and the destination nodes
93 how to support migrations a simple way VM4 n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM6 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
94 how to support migrations a simple way VM4 n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 VM6 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
95 how to support migrations a simple way n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 VM6 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
96 how to support migrations a simple way VM6 n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 VM6 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
97 how to support migrations a simple way VM6 n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
98 how to support migrations a simple way VM6 n-phases vector packing N1 VM1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 VM1 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
99 how to support migrations a simple way VM6 n-phases vector packing N1 N2 VM2 VM duplication between 2 phases to simulate the migration VM4 VM1 easy to implement N3 VM3 N4 VM5 hard to manipulate to compute long term previsions
100 how to support migrations the alternative way N1 N3 VM1 VM7 VM3 N2 N4 VM2 VM5 VM6 VM4 N5 offline(n2) + no CPU sharing 1-phase + compute dependencies step by steps /!\ ok to implement cycles #migrations
101 how to support migrations the alternative way VM6 VM7 VM4 VM1 VM5 VM3 VM2 N1 N4 N3 N2 N5
102 how to support migrations the alternative way VM5 VM6 VM7 VM1 VM3 N1 N4 N3 N2 VM4 VM2 1) migrate VM2, migrate VM4, migrate VM5 N5
103 how to support migrations the alternative way VM5 VM6 VM1 VM7 VM3 N1 N4 N3 N2 VM4 N5 VM2 1) migrate VM2, migrate VM4, migrate VM5 2) shutdown(n2), migrate VM7
104 coarse grain staging delay actions mig(vm2) mig(vm4) mig(vm5) off(n2) mig(vm7) stage1 stage2 time
105 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
106 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
107 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
108 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
109 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
110 Resource-Constrained Project Scheduling Problem the clean way to model space and time N1 N2 N3 N4 N5 VM1 VM2 VM4 VM7 VM3 VM6 VM off VM5 VM1 VM3 VM6 VM7 VM2 VM4 time
111 Resource-Constrained Project Scheduling Problem the pure way to support migrations 1 resource per (node x dimension), bounded capacity tasks to model the VM lifecycle. height to model a consumption width to model a duration at any moment, the cumulative task consumption on a resource cannot exceed its capacity comfortable to express continuous optimisation very hard to implement (properly)
112 From a theoretical schedule to a practical one duration may be longer convert to an event based schedule 0:3 - migrate VM4 0:3 - migrate VM5 0:4 - migrate VM2 3:8 - migrate VM7 4:8 - shutdown(n2) - : migrate VM4 - : migrate VM5 - : migrate VM2!migrate(VM2) &!migrate(vm4): shutdown(n2)!migrate(vm5): migrate VM7
113 Back to vector packing based approaches
114 Fit Fit Decrease (FFD) basic VM scheduling sort VMs in desc order for each VM pick the first suitable node
115 Fit Fit Decrease (FFD) basic VM scheduling sort VMs in desc order for each VM pick the first suitable node difficult VMs first!
116 Fit Fit Decrease (FFD) basic VM scheduling sort VMs in desc order for each VM pick the first suitable node difficult VMs first! enough free resources
117 Fit Fit Decrease (FFD) V4 V3 V2 V1 example N1 N2 N3 N4
118 Fit Fit Decrease (FFD) V4 V1 V3 V2 example N1 N2 N3 N4
119 Fit Fit Decrease (FFD) V4 V1 V2 V3 example N1 N2 N3 N4
120 Fit Fit Decrease (FFD) V1 V2 V3 V4 example N1 N2 N3 N4
121 Fit Fit Decrease (FFD) example V1 V2 V3 V4 N1 N2 N3 N4
122 multi-dimension sorting? V4 V3 V1 V2 easy, 1 dimension is varying V4 V1 V3 easy, uniform variation V4 V1 V3?
123 multi-dimension sorting? V4 V3 V1 V2 easy, 1 dimension is varying V4 V1 V3 easy, uniform variation V4 V1 V3 sort by dimension aggregate dimensions find the most critical one
124 Balancing VMs
125 Why?
126 to reduce loss in terms of failures to reduce hotspots to absorb load spikes Why?
127 balancing in theory min(stddev([load(n ),, load(n )])) min( 1 i max([load(n ),, load(n )]), min([load(n ),, load(n )]), ) OR 1 i 1 i
128 Worst Fit Decrease (WFD) balancing, a practice sort VMs in desc order for each VM pick the suitable node with the highest remaining space
129 balancing over? multiple dimensions
130 balancing over? multiple dimensions dimension normalisation worst dimension dimension aggregation current state vs. next state
131 What about the SLOs? ynamic Power Management (DRS 3.1) 2009? VM-VM affinity (DRS) 2010? Dedicated instances (EC2) mar apr VM-host affinity (DRS 4.1) MaxVMsPerServer (DRS 5.1) The constraint needed in 2012 sep. 2012?? 2013
132 SLOs inject additional rules placement constraint (anti)affinity rules fault tolerance, hw. compatibility, security VM-VM, VM-PM, relative, absolute temporal constraint precedences, parralelism, sequence control, performance control between actions counting constraint performance, licensing restriction per node, group of nodes, resource
133 discrete constraints >>spread(vm[1,2]) ban(vm1, N1) ban(vm2, N2) N1 N2 N3 VM1 VM2 simple spatial problem continuous constraints N1 N2 VM1 spread(vm[1,2]) ban(vm1, N1) ban(vm2, N2) N3 VM2 harder scheduling problem (think about actions interleaving)
134 hard constraints spread(vm[1..50]) must be satisfied all or nothing approach not always meaningful mostlyspread(vm[1..50], 4, 6) soft constraints satisfiable or not internal or external penalty model harder to implement/scale hard to standardise?
135 energy efficient VM scheduler
136 1.5% USA of the 2005 budget 3 % in 2010? (Environmental Protection Agency, 2007)
137
138 1Consume less 2003
139 VM # Consumption (Watt) Node consumption statistics (Cluster edel, 128GB HDD) 110 W 175 W servers are not energy efficient
140 how to reduce consumption with identical nodes?
141 The principles place VMs on the minimum number of nodes turn off idle nodes rince/repeat for the dynamic version
142 Best Fit Decrease (BFD) a simple heuristic to pack VMs sort VMs in desc order for each VM pick the suitable node with the least remaining space efficiency decreases when nb. of dimensions increase
143 2-pass dynamic VM packing to manage the running VMs 1) address performance issues on saturated nodes, BFD to move away VMs on non-saturated nodes 2) address energy efficiency issues on low-loaded nodes, BFD to move away VMs on heavily loaded nodes many variations using threshold based systems, load predictions
144 100 Domain % CPU Node % CPU Node Domain % CPU Node % CPU Node Domain % CPU Node % CPU Node Domain % CPU Node % CPU Node Time (in sec.) Time (in sec.) boot time Domain-4 Domain-3 Domain-2 Domain-1 No packing 2-passes packing
145 100 Domain % CPU Node % CPU Node Domain % CPU Node % CPU Node Domain % CPU Node % CPU Node worthy? Domain % CPU Node % CPU Node Time (in sec.) Time (in sec.) boot time Domain-4 Domain-3 Domain-2 Domain-1 No packing 2-passes packing
146 baseline IVS 2-pass VM packing Energy (Watts) Time (min) Native execution IVS execution prototype interesting gains, the node boot time was the bottleneck simplistic model (identical nodes)
147 Consequences of having heterogeneous nodes?
148 Consequences of having heterogeneous nodes? different performance different Power Usage Effectiveness how to estimate the benefits of a migration?
149 vector packing problem +performance model +power model +cost model + a specialised model
150 power models estimate the energy consumption model the static and the dynamic energy profile of the components usually linear equations Wh(node) = α Hw +β
151 VM performance models a neutral performance unit mips, ECU, GCU, a model to map VM performance with their host perf(vm, N) = α % +β n
152 hw. particularities multi-core CPUs, DDR3 ory, spinning HD, PUE / CUE, boot/shutdown time workload particularities a fine-grain power model VM template migration duration migration payback time
153 - 27% -7% -16% -47% coarse to fine grain optimisation
154 2 Consume better 2012 from a spatial to a temporal point of view
155 DC4Cities let existing and new data centres become energy adaptive
156 grid renewable part renewable power (%) /01 17/01 18/01 19/01 20/01 Date sunroof PV production Power (W) /01 17/01 18/01 19/01 20/01 Date renewable energies are intermittent
157 so are elastic and batch-oriented applications 6 to 20 moonshot cartridges Request/s :00 05:00 11:00 17:00 23:00 Time SLO Penalty (euros) Performance degradation (Req/s) penalty model
158 align workload to renewable energies availability DC4Cities
159 shape EASCs for sustainable profitability energy providers smart city authority forecasts sustainable objective Carver application descriptor elected working modes energy adaptive applications Easc Easc web service... video transcoder pick WMs such as min(penalty(slo) + penalty(sma) + price(e))
160 Energy Adaptive Software Components attached to an application exhibit - working modes - SLO (cumulative or instant) - actuators (see UCC 15 paper)
161 An automaton for each EASC model the behavior, + penalty functions for the SMA, the SLA
162 application baseline (satisfy perf ) Watts /01 17/01 18/01 19/01 20/01 Date E learning G indexing Website carver Watts application E learning G indexing Website 0 16/01 17/01 18/01 19/01 20/01 Date green (max renewable) Watts application E learning G indexing Website 0 16/01 17/01 18/01 19/01 20/01 Date
163 Resulting running costs 17/01/15 18/01/15 19/01/15 20/01/15 Running cost (euros) expense energy SLO SMA 0 perf green Carver perf green Carver perf green Carver perf green Carver
164 RECAP 114
165 no holy grail 115
166 master the problem 116