Converged Operations Management for Virtualization and the Cloud

Size: px

Start display at page:

Download "Converged Operations Management for Virtualization and the Cloud"

Letitia George
5 years ago
Views:

1 Converged Operations Management for Virtualization and the Cloud Patrick Mulvanny, Ph.D TECHNICAL WHITE PAPER

2 Table of Contents Executive Summary Introduction Operations Management Embedded, Integrated Management... 4 Converged Management... 5 Cloud Economics Operational Issues Operational and Business Impacts Converged Management: Use Cases and Processes... 5 Remediating a Stressed Workload... 6 Adding an Application Correcting a Misconfigured Workload... 7 Common Language and Source of Truth...7 Converged Management Tools: VMware vcenter Operations Visibility Intelligent Automation Proactive Management VMware vcenter Operations in Action... 9 Benefits with Current Virtual Infrastructure Toward the Management End State Conclusions technical white paper / 2

3 Executive Summary In the cloud era, intelligent virtual infrastructure provides the foundation for faster, more agile and more efficient delivery of IT services and business applications. But managing these new environments using tools designed for physical environments creates operational and business problems and blocks the path to flexible, serviceoriented cloud infrastructure. This paper describes a new management paradigm that integrates and embeds the disciplines of performance, capacity and configuration management into virtualization-aware methodologies. It works across technology silos and meets the unique challenges of dynamic virtual and cloud environments including those that operations management teams face most often: Performance issues can arise quickly in shared infrastructure from any component of the physical or virtual layer. Resolution requires fast insight into capacity constraints and configuration problems. Workload mobility and rapid changes in demand require real-time capacity management with proactive tools that anticipate and resolve emerging constraints. Configuration and compliance issues can ripple quickly through virtual environments, requiring end-to-end visibility and situational awareness to isolate and resolve them. Converged solutions replace static thresholds and false alarms with dynamic, virtualization-aware metrics. They align efforts across operations management and business teams to keep virtual environments effective and efficient. And they transform the economics of infrastructure, applications and cloud services. VMware vcenter Operations exemplifies the new cloud management paradigm. It offers end-to-end visibility, proactive management and automation for fast, effective troubleshooting, root-cause isolation and remediation of even complex scenarios. Virtual environments continue to grow, converge, evolve toward self-service and merge with public clouds. vcenter Operations offers innovations in zero-touch automation, policy-driven service assurance and support for applications interoperability across cloud infrastructure anticipating and overcoming the challenges of the next stage of the virtualization journey. technical white paper / 3

4 Introduction Virtual and cloud infrastructure differ from the stratified, siloed architectures that preceded them in important ways. Interdependent and dynamic, they exceed the capabilities and outstrip the speed of management processes and tools designed to manage physical IT assets. Virtual environments need a new approach to operations management. Imagine if you could: Instantly sort through potential CPU, memory, storage and network causes of a performance Issue ignoring false alarms from high normal states to isolate and remediate root causes from a single centralized console. Anticipate the capacity required by organic growth and new applications, identifying future constraints and evaluating alternative remedies in real time. Evaluate configuration history to pinpoint the change that triggered a performance or compliance issue and then roll back the change, notify the affected parties and remediate the problem without time-consuming consultation across teams of specialists. Capabilities like these demand tools and processes that converge the disciplines of performance, capacity and configuration management into a consolidated, virtualization-aware management solution available now in the VMware vcenter Operations management suite. This paper: Details the need for a new paradigm for operations management. Outlines the capabilities of a solution that converges performance, capacity and configuration management. Points the way toward an end state capable of managing highly dynamic, service-oriented private/public cloud infrastructure. Outlines the benefits of the new approach for both businesses and IT staff. Operations Management Virtual machines outnumbered physical machines for the first time in 2009, and virtual-machine growth accelerated again to an annualized rate of 28 percent in Most enterprises now have established virtualization strategies, virtual first policies and clarity about their positions on the virtualization journey. Virtualization transforms stratified, siloed architectures of physical compute, network and storage assets into logical servers, clusters and datacenters that share those physical assets. Managing virtual infrastructure requires a corresponding transformation of operations management processes and the tools that support them. In virtual environments, workloads no longer operate independently, so boundaries between operations and infrastructure management are less distinct. In addition, dynamic processes unique to virtualization rebalancing and reconfiguration, autoscaling, and workload movement across pools of resources occur much more quickly than in physical environments. Virtual and cloud environments remap the way infrastructure is organized requiring a new, converged approach to the classic disciplines of operations management. Embedded, Integrated Management In virtual and cloud environments, interdependent workloads contend for shared resources. Performance issues can originate from any component, virtual machine, physical host or cluster. What s more, workloads rate of change has outstripped the capabilities of traditional management approaches. These new environments require a new paradigm in which management capabilities are tightly integrated with and deeply embedded into virtualization and cloud platforms themselves. 1 Brian Atkinson. VMworld 2010 Day 2 Brian Atkinson s Blog. (Palo Alto, California: VMware Communities, August, 2010). technical white paper / 4

5 Converged Management Workload interdependence in virtual and cloud environments stymies the ability of specialized teams to address performance, capacity and configuration issues in isolation. Performance issues can arise from capacity constraints or configuration errors far from the stressed workload. Virtualized and cloud environments require integration of management disciplines and metrics for a holistic view of the health of virtual and physical infrastructures, and the tools to maintain it. Cloud Economics IT is constantly asked to do more with less. This pressure is among the primary drivers of virtualization and the move to the cloud. But without aggregated, business-level metrics for service levels and costs, IT has no comprehensive picture of costs and can t document its value to business customers. Virtualization and the cloud require consolidated, end-to-end visibility into IT capital, operating and service costs so that business decisions for example, whether to deliver a service through a public or private cloud can be made quickly and with complete transparency. Operational Issues Dynamic virtual and cloud environments also raise practical day-to-day challenges that apply across all management disciplines: Static thresholds can t adapt to rapid change in a dynamic environments in which workloads and infrastructure are interdependent. Any serious problem will generate a storm of alerts as workloads cross static thresholds, even though most might have a single root cause. False alarms from high normal conditions or too-conservative thresholds encourage resource overcommitment, undermining virtualization s cost-effectiveness. Virtualization awareness is essential for quick problem resolution. Without knowing which hosts, clusters, networks and storage resources are shared by stressed workloads, administrators must pick through alerts one at a time guided by rules of thumb, tribal knowledge and guesswork. Event sequences matter, too. In virtualized environments, ill-advised configuration changes can cause resource contention far from the problem s source. Operations management tools need situational awareness to isolate and resolve problems quickly. Operational and Business Impacts Organizations that try to manage cloud and virtual infrastructure using practices and tools that evolved for physical infrastructure will be unable to manage their environments effectively. And they will be incapable of delivering the cost efficiency, quality of service and business agility their business customers expect from virtualized and cloud environments: At the operational/administrator level, management becomes slow and costly because people and processes can t work at the speed and volume of events. At the IT management/director level, legacy management interferes with resource optimization needed to meet budget and service-level requirements. At the business/cio level, operations management issues show up as failures to maintain promised service levels and deliver economies expected from virtualization. Converged Management: Use Cases and Processes Because they reorganize and abstract physical assets into logical pools, virtual and cloud environments demand a new approach to operations management. Converged management processes and the tools that support them respect the hierarchical relationships, workload interdependence, resource sharing and rapid cascades of events that define these dynamic new environments. technical white paper / 5

6 A few use cases from the daily routine of IT operations illustrate the requirements. Remediating a Stressed Workload Managing for high resource utilization inevitably means dealing with an occasional stressed workload that is, one operating at or near its resource limits. Consider the case of an application on a shared host, accidentally scheduled to run database backups at noon instead of midnight. In a tightly managed environment, contention among virtual machines for host network I/O and among hosts for shared network and storage resources affects performance across the datacenter. Management objectives are straightforward: Troubleshoot to identify the resource under stress, which could be host CPU, memory, disk I/O or network I/O. Determine whether the stress reflects temporary, tolerable high normal behavior or an actionable issue. Analyze to isolate the problem s root cause. It could be a guest workload, peer-level hosts contending for resources, network or storage constraints outside the host or recent configuration changes that drove up consumption or set resource shares too low. Remediate the problem by offloading virtual machines to less-stressed hosts, allocating new capacity, unwinding suspect configuration changes or as in this case simply asking the application owner to cancel the current backup run and change the schedule. Siloed, virtualization-blind processes and tools can t accomplish any of these steps in real time. Troubleshooting is bogged down in alerts (most of them false alarms that are due to context-blind static alert thresholds), affected workloads unrelated to the root cause or high normal states on completely unaffected workloads. Analysis suffers from fragmentation, requiring cross-silo bridge calls to rule out or confirm capacity constraints or configuration changes. Likewise, remediation depends on skills and tools that are isolated in technology silos across the organization. Virtual and cloud environments require converged processes and tools that put troubleshooting, analysis and remediation in full context including hierarchical and peer relationships as well as recent history. Such tools help administrators look for stressed resources down application stacks, within hosts or across datacenters. They can then examine history to distinguish high normal demands from real problems, rule out root causes from configuration changes and identify periodic problems such as the database backup schedule in this use case at a glance. Adding an Application Virtual and cloud environments promise the flexibility to adapt to business changes quickly. As a practical matter, this means adding new applications without the extended purchasing, setup, configuration and deployment cycles needed in physical environments. Consider a new multilayer e-commerce portal moving from development into production. If the additional workload and its likely growth strain current capacity, service levels will drop throughout the infrastructure. Here, the management objectives are to: Anticipate capacity requirements, forecasting when resource constraints will begin affecting performance, given current rates of consumption and usage cycles and assuming no configuration changes. Isolate future capacity constraints, to identify which resource is on track to create the bottleneck. Evaluate alternative remedies subtracting virtual machines, hosts or clusters; reconfiguring the number, CPU count, and utilization or memory of virtual resources; or adding new physical resources. Legacy capacity-management processes and tools are too slow to address rapid capacity swings in virtual and cloud environments. And their lack of real-time visibility and proactive analytics leaves them blind to present and future constraints. The operations team addresses the inevitable crises using institutional knowledge and rules of thumb or, more likely, overprovisioning to postpone them. Business users see only the delays and costs that these workarounds cause, and they look to third-party providers and the public cloud for more cost-effective service delivery. technical white paper / 6

7 Virtual and cloud environments need converged processes and tools to help anticipate emerging capacity constraints and then work proactively to avoid them. Critical capabilities include: Virtualization awareness to pinpoint the level and stack in which a constraint will be felt. Visibility across technology silos to isolate which resource is on track to create the bottleneck. Forward-looking analytics with what-if modeling to evaluate the effectiveness of alternative remedies. If consumption of host CPU cycles by the new e-commerce portal will accelerate the point at which consumption exceeds capacity, several remediation alternatives are available beyond just adding physical hosts. IT might be able to add virtual machines to the cluster or based on application priority and consumption history reconfigure other workloads on the host to consume fewer CPU cycles. In this case, it might be easiest to move the e-commerce workload to an underutilized host first ensuring that it is compliant with Payment Card Industry (PCI) standards. Correcting a Misconfigured Workload The final use case tracks identification and remediation of a sudden, dramatic decline in performance. The performance drop results from an administrator s configuration of a new workload to give it unlimited CPU cycles. In a very short time, all the other workloads sharing the host cluster begin showing signs of stress. Management objectives are similar to those of addressing a stressed workload: Troubleshoot to identify the problem s origin here, an event coincident with or immediately preceding the performance crisis. Analyze to isolate the root cause most likely a configuration change that created incompatibilities or contention states with peer elements. Remediate the problem by rolling back the configuration change and notifying the person who made it, relocating or reconfiguring competing workloads, or if no root-cause event can be found consulting with the application owner. In virtual and cloud environments, the performance alerts issued by legacy management tools actually misdirect troubleshooting efforts. After all, the workload consuming all those CPU cycles is running well below its own (infinite) static thresholds. Both root-cause analysis and remediation are delayed by the lack of visibility across technology silos and by the need to coordinate across specialties to isolate and fix a configuration problem with immediate performance impacts. Integrating performance metrics with infrastructure topology and event timelines resolves the problem. A timeline combining configuration events and performance metrics delivers at-a-glance analysis of common errors and a head start in analyzing even subtle effects of configuration changes. Automation accelerates remediation by rolling back configuration changes, and (after identifying potential compliance conflicts) moving workloads to properly configured alternative hosts. Common Language and Source of Truth Converged management processes and tools help operations teams keep virtual and cloud environments productive and efficient, and they solve problems quickly. Just as important, converged performance analytics abstract technology-silo jargon into a common high-level language. This helps the team work and communicate across technical specialties, aligned to the way converged environments work. These analytics are analogous to the concept of physical health. The brain receives multiple direct and indirect inputs from the body s systems: muscle length and joint position from the musculoskeletal system, pulse from the cardiovascular system, respiration rate from the pulmonary system and so on. Critically, the brain also understands context: It responds differently to rapid pulse, gasping for air and sweating when we are running a marathon (push!) than when we are relaxing on the couch (call an ambulance!). Like the brain, converged management processes and tools analyze millions of measurements and understand context: infrastructure topology, performance norms, trends, cycles and recent events. Context gives operations management the ability to distinguish high normal states or low-urgency issues (relax!) from urgent exceptions (act!). Converged performance analytics give operations management teams a common language and a single, objective measure of performance a single source of truth that applies across infrastructure resources, groups, layers or stacks. technical white paper / 7

8 Converged Management Tools: VMware vcenter Operations The VMware vcenter Operations suite is the state of the art in automated operations management for virtual and cloud infrastructure. Designed for dynamic virtual environments and deeply integrated with VMware vsphere, the suite offers visibility, intelligent automation and proactive management. Visibility vcenter Operations offers visualization of the entire virtual infrastructure and applications, with views optimized for rapid resolution of a wide variety of problems: Cross-team views that integrate performance, capacity and configuration-management data to isolate, analyze and remediate problems in real time without delays and errors from cross-silo communications. Abstraction of data into health, workload and capacity-risk measurements actionable intelligence that helps the operations team identify emerging performance problems undistracted by static alerts and false alarms. Best-practices templates for security, hardening and regulatory compliance that give real-time visibility into complex states without cumbersome checklist- and spreadsheet-level management processes. Intelligent Automation vcenter Operations automates routine but difficult and time-consuming management tasks. It uses analysis and workflow based on a deep understanding of virtual architectures and processes, for example: Correlation of information and events across management silos for quick resolution of issues whose causes and effects span traditional management disciplines. Remediation to roll back configuration changes that create performance problems or exceptions to security policies or regulatory requirements. Event-triggered workflows to execute low-level management tasks in response to recurring events, with no administrative intervention. Proactive Management vcenter Operations anticipates emerging problems to give the operations team time to identify, isolate, analyze and remediate them before they affect end users: Self-learning analytics that track performance against a data-rich dynamic normal, for early warning of issues without time wasted on false alarms. Continuous forecasting and what-if modeling of capacity to head off capacity shortfalls and support evaluation of alternative remediation steps. Policy control over all elements of datacenter infrastructure virtual and physical for compliance with IT, industry and regulatory standards. vcenter Operations combines the focus and context that operations management teams need to do their jobs effectively, aggregating detail so staff can focus on what matters without the distraction of false positives or the false confidence of guesswork. It creates a virtual team of stakeholders business and infrastructure as well as operations management who share a common language and set of high-level metrics and can solve problems at any level. technical white paper / 8

VMware vcenter Operations in Action The deep integration of vcenter Operations with VMware vsphere provides infrastructure context connecting elements logically according to dependencies, peer

The solution learns normal conditions and cycles, and gauges departures from them to track trends and anticipate performance degradation, capacity shortfalls, and conflicting or noncompliant

9 VMware vcenter Operations in Action The deep integration of vcenter Operations with VMware vsphere provides infrastructure context connecting elements logically according to dependencies, peer relationships and interactions in specific virtual environments. The solution learns normal conditions and cycles, and gauges departures from them to track trends and anticipate performance degradation, capacity shortfalls, and conflicting or noncompliant configurations. Figure 1. vcenter Operations Performance Management Dashboard Figure 1 shows a dashboard with multiple performance metrics for different workloads and resources, presented side-by-side for instant comparison. Full temporal, relationship and event context is only a click away, as shown in Figure 2. Figure 2. Workload Troubleshooting and Analysis Screen technical white paper / 9

Support for troubleshooting and root-cause analysis includes: Dynamic thresholds for key metrics, placing performance changes in context to distinguish high-normal states from actionable issues such

10 Support for troubleshooting and root-cause analysis includes: Dynamic thresholds for key metrics, placing performance changes in context to distinguish high-normal states from actionable issues such as spikes in I/O latency. Virtualization-aware displays showing peer, parent and child relationships for quick identification of problems with shared resources. Event displays tracking configuration changes that precede or coincide with workload issues, for quick analysis of root causes. Figure 3. Manager-Level Cross-Silo Heat Map An important feature of vcenter Operations is that information is available in business context. Figure 3 shows a heat map of performance metrics consolidated across application resource silos. It identifies business impacts at all levels of aggregation from infrastructure to virtual datacenters for efficient use by executives and managers as well as the operations team. technical white paper / 10

Figure 4. Datacenter Capacity vcenter Operations uses forward-looking analytics to assess current and future performance impacts from capacity issues.

11 Figure 4. Datacenter Capacity vcenter Operations uses forward-looking analytics to assess current and future performance impacts from capacity issues. Figure 4 gives a consolidated view of currently allocated capacity and consumption trends, forecasting a breach when consumption will exceed capacity and cause resource constraints. Drill-down analytics enable isolation of the resource that will cause the constraint. Proactive what-if analytics evaluate the effects of administrative actions. For example, they can project how the capacity consumed by adding virtual machines would move the breach point closer to the present. The operations team can then test alternatives that free additional capacity: adding or redeploying resources, or changing configurations to postpone the constraint. This approach respects relationships among infrastructure elements in virtual and cloud environments. From a business point of view, this approach to capacity management is a major step toward IT as a service, delivered on demand. IT can quickly project the effects on performance of business decisions, both within a business-application stack and throughout the infrastructure. technical white paper / 11

12 Figure 5. Events and Health Timeline Figure 5 shows the timeline of configuration events immediately preceding a period of widespread contention for CPU resources even though the most recently added workload shows no signs of stress. Instead of picking through individual CPU alerts to isolate the problem, the administrator can review configuration changes that preceded the crisis, and then drill down to find the most likely cause here, the administrator s own allocation of unlimited CPU capacity to the new workload. These capabilities can give business users more confidence that the workload requirements and service levels of their business-critical applications will be met, without physical infrastructure s capital and operational expense, or widespread overprovisioning. Benefits with Current Virtual Infrastructure vcenter Operations offers virtualization-aware, context-sensitive, proactive analytics and automated tools needed to manage today s dynamic virtual and cloud environments. Organizations that make the transition to this new class of management tools can expect improvements in: Efficiency, because routine tasks are completed and issues resolved in less time. Span of control, because administrative staff accomplish more within the constraints of currently available resources. Cost-effectiveness, from improved staff and resource utilization. Quality of service, because routine issues are resolved quickly with automated assistance. Availability, because issues are resolved on the fly without planned downtime and before they raise risks of service interruption. Compliance, from event-based actions and workflows that prevent or quickly resolve issues that might otherwise compromise policy or regulatory compliance. technical white paper / 12

13 In addition to these benefits for the business as a whole, management staff see improvements in their ability to manage their time and professional growth. Firefighting an endless list of routine issues is replaced by proactive, high-level administration with automated support. Toward the Management End State As virtualization evolves, so will management processes and tools. We anticipate that virtual and cloud environments will: Grow in scale and complexity, outstripping the capabilities of manual processes even those augmented by automated troubleshooting, analysis and remediation tools. Continue to converge, adding security, compliance and business-process layers to virtual infrastructure and applications. Evolve toward self-service, so business owners can select services with capabilities, performance, compliance and cost that match their requirements. Incorporate public cloud services such as infrastructure and software as services, when they suit business objectives without surrendering visibility or control. At VMware, we are already preparing vcenter Operations for this new environment. Our vision for the management end state includes: Zero-touch automation with embedded problem-solving and management expertise at each layer to optimize virtual and cloud environments. Policy-driven service assurance that ensures services performance, compliance and security, for self-service with complete control. Management and cloud interoperability, for freedom of choice among service providers using an open, standards-based approach that allows applications to move across private and public clouds. In this end-state environment, management disciplines will be embedded directly in the virtual or cloud platform itself, based on learned behaviors and adapting as conditions change. When analytics anticipate performance or capacity issues, automated workflows will add CPUs, move workloads or roll back configuration changes based on policies with no manual intervention. Business owners can select and launch service packages, complete with charge-back pricing plans scaled and configured to their requirements. Third-party clouds will integrate seamlessly into IT infrastructure, with assured visibility, compliance, application mobility, transparent pricing, and better service to business customers and end users. Conclusions Virtualization has transformed rigid, capacity-bound physical infrastructure into flexible, mobile, scalable, virtual elements. In its first stage, physical CPU, memory, network and storage components were aggregated into virtual machines. In the second, physical appliances needed to maintain service levels high availability, disaster recovery, security and more were integrated into the virtual fabric. Now, tools for managing virtual infrastructure are themselves converging into the fabric they support. Without such converged tools and processes, organizations will struggle to deliver operational and business efficiencies promised by today s virtual and cloud environments, and might never make the transition to virtualization s third stage large-scale, self-service, public-private infrastructure. VMware vcenter Operations supports these new converged management processes with end-to-end visibility, virtualization awareness, powerful self-learning analytics and automated tools that meet the most demanding operational and business requirements of today s environments. It aggregates millions of static measurements into a high-level language and consistent set of health, risk and efficiency metrics that help operations communicate effectively across technology silos and work more productively. VMware is anticipating the next stage of the virtualization journey with advanced, analytics-driven automation for zero-touch management of large-scale infrastructure; self-service IT with built-in security, compliance and business processes; and seamless management across hybrid public-private cloud infrastructure. technical white paper / 13

14 VMware, Inc Hillview Avenue Palo Alto CA USA Tel Fax Copyright 2011 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. Item No: VMW-TECH-WP-vCENTER-OPERATIONS-VISION-USLET-101