Establishing Self-Driving Infrastructure Operations

Size: px
Start display at page:

Download "Establishing Self-Driving Infrastructure Operations"

Transcription

1 WHITE PAPER AUGUST 2018 AIOps Essentials Establishing Self-Driving Infrastructure Operations Harnessing AI-Driven Operational Intelligence to Maximize Service Levels and Operational Efficiency

2 2 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com Table of Contents Executive Summary 3 The Challenge: The High Stakes and Big Obstacles for IT Operations 3 The Requirements 4 The Solution: CA Digital Operational Intelligence 5 CA Digital Operational Intelligence Architecture 5 Intelligence Layer Maximizes the Power of Unified Visibility 6 Conclusion 10

3 3 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com Executive Summary To compete and win in the application economy, it is vital to deliver optimized service levels at all times. At the same time, IT environments continue to get more dynamic and complex, which makes spotting and fixing performance issues increasingly difficult. With CA Digital Operational Intelligence, IT teams can correlate data from across their environments in order to gain timely, actionable insights. Featuring capabilities for automation and artificial intelligence operations (AIOps), this solution enables IT teams to maximize not only service levels but staff productivity and resource utilization. For the business, this translates to better user experiences, higher efficiency and the agility needed to thrive amidst constant change in a software-driven world. The Challenge: The High Stakes and Big Obstacles for IT Operations Experience is everything Today, the digital experience means everything. Regardless of your industry, your business is now in the software business. In the application economy, the way your customers and users interact with applications, and the quality of the digital experience delivered are ultimately playing an increasingly pivotal role in business success. For the IT operations teams responsible for supporting these digital experiences, the stakes continue to grow: For each hour of downtime, organizations lose between $140,000 and $2.5 million. Downtime is costly. For each hour of downtime, organizations lose between $140,000 and $2.5 million. 1 Plus, downtime happens a lot. In fact, on average, enterprises lose almost $22 million a year due to downtime. 2 Slow is the new downtime. Services don t need to be down for it to cost your business. For an increasingly demanding and impatient user population, if performance is slow, the service may as well be down and they will go elsewhere. Now, 53 percent of mobile site viewers will abandon a page if it takes longer than three seconds to load. 3 Optimized experiences and service levels: Critical, yet elusive While optimizing service levels is critical, it seems to be getting more challenging to do every day. Most enterprise-class applications rely on a plethora of technologies that run on different cloud and onpremises platforms and involve multiple virtual and software-defined components. To run these modern hybrid environments, IT operations teams have continued to add monitoring tools. 1 David Gewirtz, ZDNet, The astonishing hidden and personal costs of IT downtime (and how predictive analytics might help), May 30, 2017, 2 Vincent Bier, Overages and Outages? Solving the Problem of Unplanned Downtime, May Google Data, Global, anonymized Google Analytics data from a sample of Web sites opted into sharing benchmark data, March 2016

4 4 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com How have teams been faring with this complex set of tools and hybrid environments? Not good. Rather than spotting troubling trends and addressing them before there s an issue, operations teams are in firefighting mode. According to recent IDG research, 34 percent of issues are first spotted by users. 4 When it comes to resolving issues, mean time to repair takes 4.5 hours on average. 5 The problem is that, to manage their complex, hybrid, interrelated and highly dynamic environments, operations teams have to get much more efficient and agile. The volume and variety of data that needs to be managed, correlated and analyzed continues to grow dramatically, which means this efficiency and agility imperative only grows increasingly critical. At the same time, their disjointed tool sets conspire against those goals. Teams struggle with hundreds of thousands of alerts that feature a high rate of inaccuracy and redundancy. Too many processes are labor intensive and require advanced expertise, which means on top of escalating downtime costs, organizations contend with escalating operational costs. According to recent IDG research, 34 percent of issues are first spotted by users. In the wake of initiatives like multi-cloud deployments, microservices development and Internet of Things (IoT) implementations, teams continue to see explosive growth in the operational data being generated. Ultimately, internal team members simply can t keep pace with the volume, variety or velocity of operational data. The Requirements To address the challenges posed by managing modern IT operations, fundamentally new approaches are required. Simply put, internal teams can t continue to manually track and support modern environments today s environments are too dynamic and generate too much information. To succeed, these teams need the following capabilities: Unified, machine-learning powered intelligence. Now, IT operations teams need unified intelligence that bridges silos from across the IT landscape. Further, this unified intelligence needs to be augmented by machine learning to ensure these massive data stores can be processed and parsed effectively, so teams can fully capitalize on the intelligence they hold. It is only through machine learning that organizations will be able to keep pace with the volume, variety and velocity of data generated in modern environments. Real-time and historical data. To optimize their dynamic environments, staff need a platform that enables long-term historical data to be retained and analyzed, and at the same time offers near-realtime views of current environment status. Pre-packaged capabilities that provide a faster ROI. Establishing these advanced analytics requires aggregation and normalization of various sources of information as well as diverse types of data, including infrastructure monitoring events, logs, time-series data and topological information. Building this type of analytics capability can represent a major undertaking for internal teams, imposing massive up-front cost and long-term commitment, which can pose significant risk to the business. 4 IDG, IDG Quick Pulse: State of IT Operations and Analytics, February 2018, 5 Ibid.

5 5 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE A. CA Digital Operational Intelligence features an open architecture that yields maximum flexibility and scalability. The Solution: CA Digital Operational Intelligence Now, organizations can harness an analytics platform that addresses the demands of modern IT operations management without incurring the massive costs and risk associated with building this capability internally. CA Digital Operational Intelligence delivers capabilities for artificial intelligencepowered IT operations management, providing cross-domain contextual insights that enable IT teams to optimize service quality, capacity management and user experience. By delivering these insights along with capabilities for integration with automation platforms, the solution enables IT teams to establish the foundation for self-driving infrastructures. The solution consists of two core components: Open, scalable data store. CA Digital Operational Intelligence delivers comprehensive insights by ingesting and aggregating data from across the enterprise, including metric, topology, text and log data. The solution automatically ingests data from across your environment, including your CA performance monitoring tools and any third-party sources. It then stores this data in an open, highly scalable data store, also known as a data lake. The solution can ingest structured data, such as metrics and alarms, as well as unstructured data, including logs and traces. Once ingested, the data gets normalized so it can be managed and analyzed uniformly, regardless of the source. The solution aggregates data and retains it over the long term so it can be leveraged to do extensive historical and trend analysis. Intelligence layer. The solution offers machine learning-driven analytics and pre-packaged crossdomain visualization and correlation capabilities. With these capabilities, the platform enables IT teams to get the insights they need to maximize operational efficiency and consistently deliver a quality user experience. CA Digital Operational Intelligence Architecture CA Digital Operational Intelligence is built on a powerful analytics engine that leverages open technologies, such as ElasticSearch, Apache Kafka and Apache Spark. The solution s innovative architecture scales with your needs and allows your team to easily integrate with third-party business or IT data sources to further enrich the data set.

6 6 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE B. The solution has an analytics engine that leverages open technologies, such as ElasticSearch, Apache Kafka and Apache Spark. Lambda architecture delivers massive scale and instant visibility CA Digital Operational Intelligence is built on a Lambda architecture that is proven to handle massive quantities of data by leveraging both batch- and stream-processing methods. Through batch processing, the architecture balances latency, throughput and fault tolerance to provide comprehensive and accurate views of massive data volumes. At the same time, it employs stream processing to provide near real-time views of online data. Restful APIs enable extensive integration Through the solution s support for RESTful APIs, organizations can leverage data from a broad range of sources, including IT and business systems. For example, if your organization is using sentiment analysis tools to track how customers feel and talk about your solutions online, you can leverage that information along with infrastructure monitoring. Through this combined intelligence, you can more directly track how outages and performance issues affect customer sentiment. In addition, through this support for RESTful APIs, organizations can ensure access to data is aligned with existing authentication and authorization policies. Intelligence Layer Maximizes the Power of Unified Visibility As outlined earlier, CA Digital Operational Intelligence offers an extensive set of capabilities for ingesting, aggregating and normalizing data from across the organization. Through its advanced intelligence layer, the solution enables operations teams to extract maximum insights from the data being gathered. The following sections offer an overview of the key capabilities the advanced intelligence layer provides. Unified visualization and correlation boost operational efficiency Given the disparate, disjointed sets of tools that have traditionally been employed, IT operations teams have been forced to contend with labor-intensive reporting efforts. Too often, these teams have had to manually generate reports from across tiers, or build SQL queries in order to gather the data required. With CA Digital Operational Intelligence, IT operations teams can eliminate all the manual efforts associated with building and maintaining scripts and manually compiling reports from disparate repositories. The solution features a powerful, intuitive interface that was the result of extensive collaboration between expert practitioners and user experience designers. The solution eliminates the underlying complexities associated with assimilating data from different technologies and domains so organizations can benefit from truly unified visibility.

7 7 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE C. CA Digital Operational Intelligence can aggregate and correlate structured and unstructured data. FIGURE D. CA Digital Operational Intelligence offers dashboards that deliver intuitive insights. The solution also features an intuitive, drag-and-drop metric pallet that makes it easy to correlate metrics from across multiple domains and data types in order to identify the cause of issues faster. The solution features dashboards that provide at-a-glance visibility into multi-tier environments. It also offers pre-packaged reports in such areas as inventory, service level agreement (SLA) status and most frequent alarms. Rather than laboring with SQL queries, administrators can use an intuitive, drag-anddrop interface to create and modify reports, and then easily publish and share results with teams across the organization. Service analytics automate and speed root-cause analysis Within many IT organizations, when a business service starts to exhibit performance issues or downtime, operators struggle to determine why. While a single issue may be the culprit, large numbers of redundant or false alerts may be generated, making it difficult for administrators to filter through the noise and identify the issue that needs to be addressed. At the same time, when operators see that a particular device or system is experiencing issues, it may be difficult to determine how or if the issue is affecting business services.

8 8 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE E. With the solution, users can receive alarms that are timely, accurate and targeted. CA Digital Operational Intelligence delivers the timely, targeted insights operators need to automate and speed root-cause analysis. The solution features a topology analytics service that automatically discovers and maps key IT assets and stores topology information in a graph database. This service consumes data and correlates intelligence from multiple architectural layers. The service can integrate logical application topology information from CA Application Performance Management, network and infrastructure information from CA Unified Infrastructure Management and network monitoring intelligence from additional CA solutions. The topology analytics service incorporates incoming data feeds and uses a proprietary algorithm to identify the root cause of issues. The solution provides a holistic, intuitive view of all your business services and associated key performance indicators, including availability, alerts, revenue impact, usage and more. IT operations executives can easily identify services that are at risk. They can then quickly drill down for more information and, given the intelligent algorithms employed, even view the potential underlying root cause. In addition, the solution also provides a bottoms-up view that enables teams to start at the alarm level, and see whether and how issues are affecting business services. With the solution s automated root-cause analysis and service analytics, multiple cross-domain IT operations teams don t have to waste hours on triage when issues arise. Instead, targeted root-cause information can be forwarded directly to the appropriate team for remediation. Algorithmic noise reduction delivers smarter, more actionable alarms As outlined above, the complex, disjointed nature of having multiple monitoring tools is creating a number of challenges. One of the chief problem areas is the nature of alarms being generated. In many organizations, operators need to sift through massive volumes of alarms, many of which are either incorrect or redundant. For example, according to one IDG report, 31 percent of alarms generated in enterprises are false. Because disparate tools lack unified intelligence, one system issue can cause alarms to be generated by many interrelated systems, resulting in so-called alarm blizzards. The upshot of all this alarm noise is that staff productivity suffers, given all the time that gets wasted sifting through useless alerts. Further, the more noise being generated, the less likely it is that operators will be able to determine which alarms are truly meaningful and need to be acted upon. Ultimately, the more alarm noise, the better the likelihood that critical issues will be missed and service levels will suffer. CA Digital Operational Intelligence reduces alarm noise by employing a number of advanced capabilities, including machine-learning-based algorithms. As alarms get ingested into the system s data lake, the solution performs such techniques as text mining to de-duplicate alarms. The solution then orders, clusters and correlates related alarms, so patterns can be identified. For example, within the span of a few minutes, if the same network connection problem happens multiple times, and these problems are followed by network storage access issues, the network alarms are de-duplicated and then sequenced with the storage access problem.

9 9 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE F. CA Digital Operational Intelligence employs a number of advanced capabilities, including machine learning-based algorithms, to reduce alarm noise. FIGURE G. Featuring prepackaged algorithms that can detect meaningful deviations, the solution helps streamline and refine threshold management. In another case, a router problem could result in alarms being generated by a number of virtual machines and by applications resident on those virtual machines. CA Digital Operational Intelligence correlates these alarms and creates a single service alarm that details the specific issue IT operations staff need to handle. In addition, the solution details which systems and applications are affected, so staff can more intelligently prioritize their efforts. By leveraging these intelligent alarm capabilities, IT teams can realize improved staff productivity and faster resolution times. Anomaly detection makes it easy to spot real issues Within many organizations, the process of setting, managing and optimizing alarm thresholds is both time consuming and ineffective. Too often, staff have to spend too much time analyzing metrics and manually configuring thresholds. Further, this effort needs to be monitored and adapted over time, which can make it difficult for organizations to ensure that thresholds are aligned with ongoing changes in usage, traffic and trends. Through these manual efforts, organizations ultimately either experience too many false positives, or they run the risk of having problems being detected too late. CA Digital Operational Intelligence offers a better way to reduce noise and manage thresholds. Instead of requiring users to manually manage thresholds, the solution leverages pre-packaged algorithms that can detect meaningful deviations. Further, unlike competitive offerings that require users to choose specific algorithms, CA Digital Operational Intelligence automatically selects and applies algorithms as appropriate. The solution can leverage a number of algorithms, including Kernel Density Estimation (KDE) or Weighted Exponentially Moving Average (WMA^2).

10 10 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com FIGURE H. CA Digital Operational Intelligence helps administrators better understand, predict and optimize resource utilization. The solution can leverage historical data to identify seasonal trends. The solution can also use Western Electric rules to predict anomalies based on the intelligence being captured. By leveraging CA Digital Operational Intelligence, IT teams can eliminate the effort and guesswork associated with manually managing thresholds. Further, they gain the intelligence needed to more effectively identify, predict and ultimately prevent issues. Predictive capacity analytics enable enhanced utilization Leveraging traditional, piecemeal monitoring tools and approaches, organizations lack the insights needed to effectively predict capacity utilization, which makes it difficult to prevent capacity-related bottlenecks and optimize resource investments and provisioning. Executives can t intelligently do what-if planning to accurately estimate capacity requirements for planned initiatives. Further, these challenges grow exponentially as environments continue to get more complex, dynamic and hybrid in nature. With CA Digital Operational Intelligence, operators and executives can better understand, predict and optimize resource utilization. The solution applies regression models against collected intelligence. It can support executives what-if analysis based on detailed workload and usage data. The solution features pre-packaged capabilities for predicting capacity requirements. With the solution, IT teams can predict capacity bottlenecks, so they can take steps to resolve them before services are disrupted. Through this intelligence, teams can better manage costs by identifying underused capacity. Teams can better predict near and long-term trends so they can better anticipate and meet evolving technical and business requirements.

11 11 WHITE PAPER ESTABLISHING SELF-DRIVING INFRASTRUCTURES ca.com Conclusion In the application economy, IT teams simply can t continue to rely on disjointed point monitoring tools and all the time-consuming effort they require. Now, your organization has a better alternative. With CA Digital Operational Intelligence, your IT teams can leverage the powerful insights they need to optimize performance and availability of your complex, dynamic IT environments. The solution powers self-driving infrastructure operations, delivering insights to automation tools and enabling automated remediation. With the solution s AI-powered insights, cross-domain correlation and automated root-cause analysis, your organization can more consistently deliver optimized user experiences, while maximizing operational efficiency. For more information, please visit ca.com/doi Connect with CA Technologies CA Technologies (NASDAQ: CA) creates software that fuels transformation for companies and enables them to seize the opportunities of the application economy. Software is at the heart of every business, in every industry. From planning to development to management and security, CA is working with companies worldwide to change the way we live, transact and communicate across mobile, private and public cloud, distributed and mainframe environments. Learn more at ca.com. Copyright 2018 CA. All rights reserved. All trademarks referenced herein belong to their respective companies. This document does not contain any warranties and is provided for informational purposes only. Any functionality descriptions may be unique to the customers depicted herein and actual product performance may vary. CS _0818