Windows Server Capacity Management PDF Free Download

Windows Server Capacity Management 101

What is Capacity Management? ITIL definition of Capacity Management is: Capacity Management is responsible for ensuring that adequate capacity is available at all times to meet the agreed needs of the business in a cost-effective manner.. (ITIL Capacity Management, 2016) The important words in this quote are adequate, at all times, needs and cost effective. Capacity Management is simply making sure you have the capacity for the business at all times, without it costing a fortune. Carrying out Capacity Management in a cost-effective manner is very important as it s very easy to do Capacity Management with an unlimited budget. For those that don t, budget planning, monitoring and reporting needs to be done to ensure that the processes are as efficient and cost effective as possible. So now we all have a good idea of what Capacity Management is, let s look at our next question. Why are so little of the resources used in Windows environments? Historically Windows systems weren t as reliable as they are today, for instance Windows 2003, which had a reliability rating of 98.54. Now that might sound like a lot but it means that over the course of a year the server was down for 12.8 hours and that is not acceptable when dealing with critical systems. This has led to: Windows systems on average ~5% CPU busy. Over-specification of hardware to avoid performance problems. As time has gone on Windows have improved their reliability but the people managing those systems are still reluctant to really push the servers because of previous bad experiences. MINE! Reluctance to share resources. Each Windows machine running only one function e.g. firewall, file server, database. 2 Transaction-Based Capacity Planning

Has the move to virtualization technology corrected this issue? Virtualization allows multiple Windows systems to run on one physical machine as guests under a hypervisor and this in theory should make the physical system utilize more resources but the problem lies in the mentality of the people running the system not the technology. Packing guests together means physical machine utilization should be higher, but a number of problems still persist: MINE! still prevalent - Being used to having a one server per service environment staff create the same situation with Virtual machines, but now it has been made even easier to have multiple machines and often they re not even very busy! Easy to suffer virtualization sprawl this is where the number of virtual machines (VMs) on a network reaches a point where the administrator can no longer manage them effectively. Virtualization sprawl is often for good reasons such as building redundancy into the system, this means if one VM is taken down another one can be put up with virtually no downtime. Some organizations rely on high-availability / dynamic resource sharing Careful Planning is needed to make sure that components of a service do not end up together on the same physical machine as if it fails it takes down a whole service. Transaction-Based Capacity Planning 3

Capture/monitor appropriate metrics The first step to properly capacity managing a Windows environment is to properly implement Capacity Management and it s important to plan out how you are going to do it. The first step is to start collecting Performance data on the Windows environment you want to manage and if need be the host machine if they are running under a hypervisor like VMware or Hyper-V, then using this data to create charts and trends. It s important to: Capture the right metrics. Pick the right capture interval. Select when to capture data. Remember that some metrics don t give the complete picture. To be able to properly capacity manage, you need data but it is important to capture the right data. This means planning what metrics to capture and at what interval. 15 mins intervals of data is a good starting point, but the correct interval length is highly dependent on the type of workload and how you are reporting. It is also important to consider when you want to capture data, capturing data 24 hours a day will make day averages much lower than if you only report on your peak hours. This is the same for days, if you do most of your work Monday to Friday then Saturday and Sunday will make the averages lower. These are considerations you need to take in to account when you are collecting data. It s also important to understand that the more frequently you collect data the more data that is going to produce, this may sound obvious but capturing data at 2mins 24/7 will produce a large amount of data very quickly. Once you are collecting data at the right time and interval, you want to start reporting and trending that data. Not all metrics give a true picture of hardware, an example of this is CPU reported busy. Windows will report higher utilization than is true because it s not aware that VMware will swap in and out the logical CPU, it will just report it was busy the whole time. 4 Transaction-Based Capacity Planning

Trending The purpose of trending is predicting what will happen by what has happened, the accuracy of trends rely on the fact that what is happening will carry on happening into the future. Importance of trends - A trend gives you warning if your demand is going to outstrip your supply and gives you chance to act. How long to trend forward - As with most things there is no one size fits all. When deciding on the length of trends it is important to take a few things into consideration such as how long it takes you to buy and install new hardware. There is no point in trending forward on disk space for a week if it takes you 2 months to get additional space. So a good length of a trend is how long it takes you to procure, physically install and configure new hardware, if this take 3 months then that is how long your trends should be. Trending is good at predicting when something will hit a threshold but not telling you what will happen when it does, this is where modeling comes in. Importance of modeling - it allows you to see how a system will react under different workloads. If the business has an event coming up that means its servers are going to be under higher than normal load you want to be able to reassure people that the system can handle it. What modeling shows - Modeling will show you how your components will perform under different workloads, and what component will fail and when. Modeling is used frequently for what if scenarios such as What if my workload increases 30% will my system handle the extra load or will it fail? If it does where will it fail? Knowing this lets you be proactive instead of reactive. Transaction-Based Capacity Planning 5

Balance service against cost The better the service the more the cost - When it comes to balancing the cost of a service it is important to know what the impact of spending too little or too much will be. Align your IT spend to your business needs It s not about spending more and more, it s about spending smarter. Understanding what the business needs are and understanding how to meet them in a cost effective way. It s important to know the wider business, if you are expanding by 50% you need to know what is needed to meet the new demand. Without forward-looking activities, you could be in for any number of unpleasant surprises, such as: Performance crises. Unnecessary hardware expenditure. User dissatisfaction. Understanding how to meet demand - Capacity Management is responsible for ensuring adequate capacity is available at all times to meet the requirements of the business. It is directly related to the business requirements and is not simply about the performance of the system s components, individually or collectively. 6 Transaction-Based Capacity Planning

Some best practice recommendations So now we have gone over what we need to properly manage a Windows environment, here are some best practise recommendations. There are 3 main components to monitor in your Windows systems: CPU physical utilization. Memory - usage. Disk occupancy and performance. These are all components that if they fill up or are over utilized will severely affect performance. CPU What to monitor The first component to look at is CPU. When monitoring CPU you need to understand the difference between Logical CPU and Physical CPU, if your system is virtualized then it will be logical CPU as the Windows environment does not know about the physical CPU it is being hosted on. If physical, CPU Total utilization of the machine - a physical system is much simpler as you are directly monitoring the physical components. If virtualized, CPU usage by the guest system - you will need to know the Physical CPU usage which is under the hypervisor. If you only look at CPU busy and it says 80%, it could be 80% busy of the 5% that has been allocated to it by VMware. You need to look at process level CPU busy. Process-level CPU busy; if virtualized gives a view of relative usage of the physical CPU busy from the host. It shows you how much CPU time each process is using, this is useful to see where all your CPU time is being used. How busy can I run it? How hard you can work a CPU is highly dependent on the type of CPU and the type of work it is doing, there is no one size fits all number for how hard to work them. Newer = more capable - Newer CPUs have larger on-chip cache memory, allowing more instructions to be kept nearer the cores. Cache memory is quicker to access than main memory but there is much less of it Megabytes of very fast cache, Gigabytes of fast ram and Terabytes of slow disk. Transaction-Based Capacity Planning 7

It s also not just about the speed of the clock a 3.4 GHz Pentium IV is nowhere near as capable (as in can get through work ) as a brand new 2 GHz Xeon processor. Because the newer CPU can do more in one clock cycle then an older CPU can do in many. More cores = can be pushed harder - It s all about THROUGHPUT, not just speed. The more cores a CPU has the harder you can run it without performance problems. Dependent on the type of work; hyper-threading or not 1 core At 50% busy will take twice as long as if it were 0% busy. At 80% busy it will take 5 times as long. At 90% it will take 10 times as long. You can see with one core it does not take much work to make throughput slowdown. 2 cores At 50% it will take 1.3 times as long. At 80% it will take 2.7 times as long. At 90% it will take 5.2 times as long. 16 Cores 50% - 1x, 80% - 1.02x, 90% - 1.22x For a 16 core CPU it does not make much of a difference running it at 80% compared to 50%. But you have to keep in mind that all configurations max out at 100% just the number of cores flattens out the curve. You can see with more cores it takes longer to hit the knee of the curve. Benefits of Multiple Cores This chart illustrates that as you add more and more cores the response time curve is more and more flat. 8 Transaction-Based Capacity Planning

How to monitor and manage CPU Hyper-threading is splitting a single CPU core into two logical processors, each of these processors can execute a separate piece of work. You will see one thread being the dominate thread and one processing when the other is stalled. There is some trade off with hyperthreading as it takes time for the CPU to switch between threads, Some work will fit well with this such as multiple threads of light-weight work, and more heavy work that needs the whole power of a core to get through could work slower with hyper-threading. Depending on the type of work hyper-threading is not always beneficial, sometimes it is better to have cores not have hyper-threading into multiple threads as the jumping between threads can lower the throughput. How to monitor Thresholds When dealing with thresholds there is no one size fits all but a good rule of thumb is 70% for a warning and 85% for an alarm, these can and should be tweaked when you have a better idea of performance thresholds for your CPU. Additionally it is good to have a threshold in place for when a CPU is being under-utilized, maybe threshold for 20% and 10% this lets you know which machine could be pushed harder. Trends When setting up a trend, you have to remember the longer the trend the less reliable it is. A good rule of thumb for a trend is 3 months, as this gives a reasonably reliable trend and also lets you know in good time to make a hardware change. Reports CPU Total Utilization Estd% - Report Example Transaction-Based Capacity Planning 9

Above is an example of an Estimated CPU core busy over a month for my computer with a trend going forward 1 month, you can see quickly that the trend line is going down. This kind of chart is very simple to create with a Capacity Management tool like athene. Memory What to monitor Memory utilization of whole system - if need be look at process working set sizes to see who s the culprit, this will show you which process is using the most memory and is a good way to detect memory leaks. A good rule of thumb for memory utilization is to have at least 10% left, this is to prevent excess paging which massively hurts performance. Page file usage% - if this is high it means that you are regularly running out of memory and Windows is having to use the page files. Memory leaks - when an application dynamically allocates memory, and does not free that memory when it is finished using it, that program has a memory leak. The memory is not being used by the application anymore, but it cannot be used by the system or any other program either. Memory leaks add up over time, and if they are not cleaned up, the system eventually runs out of memory. How to monitor Thresholds - when setting a threshold a good place to start is 80% warning and 90% alarm, remember if you are seeing performance issues before hitting the threshold then the threshold should be adjusted. If constantly breached, reset the value or look for memory leaks. Below is a good example of a memory leak, you can see that memory utilization is slowly creeping up, then I restart the machine and it drops down and then starts to creep up again. 10 Transaction-Based Capacity Planning

Memory Utilization - report example Some best practice recommendations for monitoring and managing memory How full can I run memory? - a good rule of thumb is 90%. What happens when I start to run out of memory? Page faults increase. Soft faults occur first (move things around in memory). Then hard faults (write pages to disk). Increases reads from and writes to page file (hard faults). Reads from image files can increase. Eventually the system will stop responding or just stop. Some best practice recommendations on how to monitor and manage disk What to monitor There are two main points to disk, which are occupancy and performance. Occupancy - use Free Space Ratio%, this shows you how much space you have left on the disk. Performance - to measure Performance we use average Response time of reads and average response time of writes. Transaction-Based Capacity Planning 11

How to monitor Thresholds - setting a threshold for disk occupancy is dependent on how quickly you can get additional disk space and how quickly disk space is filling up, but a good rule of thumb is 70% warning and 80% Alarm. Trends - very important when it comes to disk occupancy as it can show very far in advance when you are going to run out of disk space. Reports automate reports Reports Free Space Ratio% - Example Chart This is a good example of a disk slowly filling up, I could trend on this and easily get the date when I am going to run out of disk space. Having this information is very important in ensuring that there is no down time for any of my important applications. 12 Transaction-Based Capacity Planning

Summary Capacity Management is about ensuring that there is enough IT resource at all times. Windows systems are under-utilized because of mistrust in their reliability. Virtualization has helped make Windows systems more utilized but not completely solved the problem. It s important to balance the cost of the service to the benefit. When managing Windows system, look at CPU, Memory and Disk. Transaction-Based Capacity Planning 13

Metron Metron, Metron-Athene and the Metron logo as well as athene and other names of products referred to herein are trade marks or registered trade marks of Metron Technology Limited. Other products and company names mentioned herein may be trade marks of the respective owners. Any rights not expressly granted herein are reserved. www.metron-athene.com