Leistungsanalyse von Rechnersystemen

Size: px
Start display at page:

Download "Leistungsanalyse von Rechnersystemen"

Transcription

1 Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen Capacity Planning Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)

2 Center for Information Services and High Performance Computing (ZIH) Capacity Planning Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)

3 Two quotes Do not plan a bridge capacity by counting the number of people who swim across the river today Heard at a presentation, according to Raj Jain Prediction is very difficult, especially about the future Nils Bohr

4 Terms Capacity planning: Ensuring that adequate computer resources will be available to meet the future workload demands Alternative: just buy tons of equipment Capacity management: Ensuring that the currently available computer resources are used to provide the highest performance Alternatives: Adjust usage Rearrange configuration Change system parameters (performance tuning)

5 Steps in capacity planning one men show Instrument the system Monitor usage Characterize workload Change system parameters System model Forecast workload No Cost and performance Ok? Yes Done

6 Steps in capacity planning procurement process Instrument the system Monitor usage Characterize workload Forecast workload Change system parameters System model No Cost and performance Ok? Yes Make offer vendor customer Evaluate offer(s)

7 Problems in capacity planning 1. Different capacity planning tools use different terminology 2. There is no standard definition of capacity Maximum throughput (jobs per seconds, transactions per second) Maximum number of users meeting specified performance 3. Different capacities (nominal, usable, knee) 4. No standard workload unit 5. Forecasting future applications is difficult 6. No uniformity among systems from different vendors, same workload takes different amount of resources on different systems 7. Model input parameters cannot always be measured (e.g. think time ) 8. Validating model projections is difficult 1. Baseline validation (reproduce the measurement) 2. Projection validation (verify that your model is predictive) 9. Distributed environments are too complex to model 10. Performance is only a small part of the price/performance game (TCO is complex)

8 Contributions to TCO (total cost of ownership) Cost of hardware Cost of software Installation Maintenance Personnel (sys admins, support staff) Floor space (building infrastructure) Power Climate (temperature, humidity) Insurance

9 Common benchmarking mistakes 1. Only average behavior represented in the workload 2. Skewness of device demands ignored 3. Load level controlled inappropriately 4. Caching effects ignored 5. Buffer sizes not appropriate/ buffer effects are not understood 6. Inaccuracies due to sampling ignored 7. Monitor overhead ignored 8. Measurements not validated 9. Not ensuring the same initial conditions 10. Not measuring transient performance 11. Using device utilization as a performance metric inappropriately 12. Collecting too much data without sufficient analysis

10 Benchmarking games Different configurations may be used to run a workload Compilers may be wired to optimize the workload Test specifications might be biased towards one machine A synchronized job sequence might be used (e.g smart mixture of I/O bound and CPU bound jobs) Workload might be random Benchmarks might be too small Benchmarks might measure the benchmarker rather than the machine

11 Vendor problems in procurement process of HPC systems Competition is unknown Time difference between offer and delivery: New clock frequencies of CPUs New generation of CPUs New system generation Capability prediction often requires a difficult scaling analysis Scaling of application is more difficult than Amdahls law Large size difference between benchmarking and real system Caching effects difficult to understand, especially in combination with new CPU generations If the performance prediction is too conservative you don t win the RFP If the performance prediction is too aggressive you might have to pay penalties

12 Customer problems in procurement process of HPC systems It is difficult to predict future workloads It is difficult to create unbiased workloads that are representative for your user base, since your users are using your current system It is difficult to be unbiased without opening the possibility to measure the benchmarking team rather than the computer (source code modifications) If the procurement is too small you have to do the one man show procurement style

13 Poor mans capacity planning technology In many site the future workload is so unknown that more sophisticated prediction techniques may not be of great help Simple rules of thumb factor x every y years Often your budget is fixed and what you get is determined by the market

14 Improvement over factor x every y years: factors x_i Moore s Law CPU µproc 60%/yr. (2X/1.5yr) DRAM DRAM 9%/yr. (2X/10 yrs)

15 Center for Information Services and High Performance Computing (ZIH) Examples from real life Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)

16 Prediction of Backup Volume

17 Prediction of Backup Volume

18 Prediction of Backup Volume

19 Comparison SX-6 versus SX-8 SX-6 SX-8 CPU Clock cycle 1GHz 2GHz Peak Vector Performance 8GF 16GF Peak Scalar Performance 1GF 2GF Memory 8GB 16GB Memory Bandwidth 32GB/s 64GB/s LSI Process 0.15um 90nm NODE No. of CPUs 8 8 Peak Vector Performance 64GF 128GF Memory 64GB 128GB Memory Bandwidth(aggr.) 256GB/s 512GB/s Inter-node Bandwidth 8GB/s x 2 16GB/s x 2 I/O Bandwidth 8GB/s 12.8GB/s

20 Some performance expectation SX6+ SX8 Factor Frequency 563 MHz 1 GHz 1.78 Memory BW 36 GB/s 64 GB/s 1.78 Memory Lat?? ~1 IXS BW 8 GB/s 16 GB/s 2 IXS Latency 6.9 micro s 5.9 micro s 1.17 SQRT 300 MFlops 1500 MFlops 5

21 Performance values from Phase I Promised Delivered Memory Band/CPU 50 GB/s 63 GB/s Memory Band/Node 320 GB/s 360 GB/s Bisection 230 GB/s 560 GB/s InterNode MPI Latency 8 micro 4,61 micro InterNode MPI Bandwidth 12 GB/s 14,2 GB/s Fenfloss 45 GF/s 45,55 GF/s Uranus 46 GF/s 49,17 GF/s N3D 50 GF/s 52,6 GF/s

22 Prediction from 16 to 72 nodes on next generation system

23 SPEC OMP single node results

24 HPCC 4 node results

25 Power Estimation Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)

26 Power Estimation Management Question: What is the power consumption of a 15 Million Euro system in 2011/2012?

27 Top500 for extrapolation of performance

28 Stromverbrauch in kw existierender Systeme auf Platz 50 der TOP500 Nov-06 Jun-07 Nov-07 Jun-08 Nov-08 Jun-09 Wert aus TOP ,03 283,34 375,55 503,90 Wert aus TOP ,11 183,45 237,04 365,04 Wert aus TOP50 111,56 175,86 247,62 345,41 Wert aus TOP50 ohne CELL, BG 163,56 253,45 349,61 598,40 Deimos 250 Der Stromverbrauch nimmt mit der Zeit deutlich zu Starke Abhängigkeit von betrachteten Systemen Weltweit große Anstrengungen die Zunahme des Energieverbrauchs zu begrenzen, allerdings ist in den nächsten drei Jahren nur mit begrenztem Erfolg zu rechnen Daniel Hackenberg

29 First Approach: TOP500 Based Extrapolation Extrapolation from available (limited) data points ~2.5 MW for two rank 50 systems in 2012 Problems: Only energy efficient data centers submit their power measurements? Blue Gene or Cell Systems have significant influence on average Daniel Hackenberg

30 Summary of Estimation Original question: Power consumption of a 15 Million Euro System in the future? Assumptions: The money spent for the TOP500 systems is constant Exponential growth of performance will continue No major technology break-through in power efficiency in the next 3 years Approach: Modified question: What is the power consumption of the system ranked at position 50 in the TOP500 list? Extrapolation of performance with exponential growth Estimation of power efficiency based on subset of TOP500 list

31 Linpack Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)

32 The Linpack Benchmark is a measure of a computer s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report. LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n 3 + O(n 2 ) Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100.

33 Linpack Efficiency vs. Problem Size

34 Linpack EfficiencySize (E. Strohmeier, ISC 09)

35 Linpack Efficiency vs. Network (E. Strohmeier, ISC 09)

36 Remarks about Performance of AMD CPUs AMD Athlon X2 240: 2,8 GHz x 2 x 4 DP/cycle = 22,4 GFflops AMD Phenom X ,0 GHz x 2 x 4 DP/cycle = 24 GFlops AMD Phenom X ,8 GHz x 4 4 DP/cycle = 44,8 GFlops

37 Center for Information Services and High Performance Computing (ZIH) Thank you! Zellescher Weg 12 Raum WIL A113 Tel Matthias Müller (matthias.mueller@tu-dresden.de)