Let s Get Real About Self-Driven IT Ops Jim Kokoszynski, VP Software Engineering, CA Technologies

Size: px
Start display at page:

Download "Let s Get Real About Self-Driven IT Ops Jim Kokoszynski, VP Software Engineering, CA Technologies"

Transcription

1 Let s Get Real About Self-Driven IT Ops Jim Kokoszynski, VP Software Engineering, CA Technologies June, 2018

2 Mainframe is Mission Essential in the Modern Software Factory 1.3B CICS transactions processed every second 1 78% % clients growing MIPS 2 availability for billions transactions daily 3 1 IBM Estimates on Real Client Usage, 2 Arcati 2017 Mainframe Yearbook, 3 Business Finance Magazine, Mainframe 101 for C Level Executives, Robert Frances Group Study, 4 Connected Mainframe for Digital Transformation, IDC CONNECTED MAINFRAME: $198.5M additional revenue yearly 4 2

3 The Modern Software Factory How do we secure data and protect privacy across enterprises and blockchains? Economics How do we free up budget to self-fund growth in MIPS? How can I learn from machine data to create self healing systems that provide availability and redundancy? Concept Product How do we increase innovation velocity to deliver new services to customers? How do we automate & digitize processes for maximum efficiency? 3 3

4 Nobody wants to be this guy! IT failure cost British Airways over $102 million 4 Enterprises are losing on average $21.8 million per year in downtime 1 Delta s computer outage cost $150 million 3 Average cost per hour of downtime is $140,000 to $2.5 million 2 Brand Impact Lost Trust Lost Revenues Upset Customers 1 Overages and Outages? Solving the Problem of Unplanned Downtime by Vincent Bier, May 2017; 2 Astonishing Hidden Costs of IT Downtime by David Gewirtz, May 2017; 3 Delta's Computer Outage To Cost Them $150 Million, Sept 7, 2016 ; 4 British Airways CEO puts cost of recent IT outage at 80 million pounds by Robert Hetz, Jun. 15, COPYRIGHT 2018 CA. ALL RIGHTS RESERVED

5 IT Ops Challenges TOO MUCH Too much data and complexity for humans alone to analyze efficiently TOO LONG Takes too long to access, analyze and derive insights from mainframe systems data TOO LATE Notified of potential problems too late to avoid business impacting incidents

6 Proactive Issue Resolution Is Critical But It s hard.. Too Much 31% FALSE ALARMS Too Long 4.5 AVG MTTR (HRS) Too Late 34% ISSUES IDENTIFIED BY USERS Sources: Recent IDG Research

7 What if You Could REDUCE MANUAL LABOR PREDICT ISSUES DIAGNOSE PROBLEMS > 40% Enable generalists to triage issues and engage only the experts needed, and increase automation +2H Earlier Take action earlier with embedded intelligence that dynamically alerts to abnormal patterns of operation +5X Faster Pinpoint root cause faster with multi-source data feeds and advanced machine learning algorithms 7

8 The Six Stages of Automation No Automation Driver Assistance Partial Automation Conditional Automation High Automation Full Automation A human controls all critical driving functions Vehicle is controlled by driver, but some driving assist features are included in the vehicle design. Car can perform one or more tasks at the same time, including steering and accelerating, but still requires the driver remain alert and in control. Car drives itself under certain conditions but requires human to intervene upon request with sufficient time to respond. Vehicle is capable of performing all driving functions under certain conditions. Driver may have option to control the vehicle. The Holy Grail. Car drives itself from departure to destination; human is out of the loop. Car is as good or better than human. SOCIETY OF AUTOMOTIVE ENGINEERS (SAE) AUTOMATION LEVELS Full Automation

9 MAINFRAME OPERATIONS ENTERPRISE SUPPORT Evolving to a Self-Driven Mainframe Data Center MTTR & Firefighting Optimized Performance & Efficiency Generalist VISUAL ANALYTICS ANOMALY DETECTION PATTERN DISCOVERY PRESCRIPTIVE GUIDANCE AUGMEMTED INTELLIGENT AUTOMATION Experts Data and Event Processing Machine Learning Algorithms Automation 9

10 MAINFRAME OPERATIONS ENTERPRISE SUPPORT Evolving to a Self-Driven Mainframe Data Center MTTR & Firefighting Optimized Performance & Efficiency Generalist Experts VISUAL ANALYTICS Modern U/X Self-service views Historical analysis ANOMALY DETECTION Dynamic alerting Proactive response PATTERN DISCOVERY Alerting clustering Data correlation Predict business service disruption Topology discovery PRESCRIPTIVE GUIDANCE Guided resolution Assisted triage Problem prevention AUGMEMTED INTELLIGENT AUTOMATION Automated response (going beyond prescriptive) Data, Sentiment Processing Machine Learning Algorithms Automation 10

11 Augmented Human Intelligence and Automation Applications Databases Systems Networks Storage Blockchain* Anomaly Detection Western Electric Rules Kernel Density Estimation Exponential Moving Average Pattern Detection Multi Variate Clustering Causality Model Remediation Predictive Insights Recommendation Engine User Sentiment Analysis IT Systems & Processes Automate simple tasks Application Tuning Tune and optimize app code Economics Automate to prevent unplanned capacity spikes Data, Sentiment Processing * Planned Machine Learning Algorithms Automation 11

12 Move From Reactive to Proactive Significantly Reduce MTTR and False Alarms Drowning in data Sea of red Reactive fire fights Adaptive alerting Fewer false positives Faster MTTR and RCA REACTIVE Monitoring PREDICTVE Machine-based Alerts 12

13 Operational Intelligence Drive Superior Experience and Operational Efficiencies from Cloud to Mainframe Applications Databases Systems Networks Storage Cloud Hybrid Infrastructures Operational Intelligence DATA-DRIVEN MACHINE LEARNING Easily Predict Issues Earlier with Smarter Alarms Reduce False Alerts through Algorithmic Noise Reduction Drive Faster Root Cause Analysis with Service Analytics Proactively Optimize Resources with Predictive Capacity Insights Boost Operational Efficiency with Unified Visualization & Correlation 13

14 Easily Visualize Data Relationships Flexible Multi-Source Data Feeds VISUAL ANALYTICS Applications Databases Systems Networks Storage Blockchain* CA Products SMF Data Open API 14

15 Proactively Detect Performance Anomaly Utilize historical data UNLIKELY Define bands of Likely and Unlikely Less Likely Map real-time metric streams Most Likely Multi-point alerts generated using industry-standard Western-Electric rules Make static thresholds optional! Typical Volatility Time ANOMALY 15

16 Adaptive Alerting Prevent Failures and Avoid Problems Before they Happen CPU Utilization APP CHANGE IN ENVIRONMENT ANALYTICS-BASED ALERT THRESHOLD-BASED ALERT Analytics-Based Alerts detect signal from noise Subtler than human-observed Subtler than static thresholds which may be last defense 16

17 Alert Clustering Automated Correlation Delivers Issue Intelligence PATTERN DISCOVERY 1 Gather Alerts from 2 Automate Clustering 3 Provide Incident Multiple Sources of Alerts into Issues Prediction and Insight Clustered Issue 7: Network problem 60 active alerts within 30 minutes 17

18 Next Step: Apply Sentiment Analysis Refine Predictions with Augmented Intelligence AUGMENTED INTELLIGENT AUTOMATION Gather Alerts 1 from Multiple Sources 2 Automate Clustering of Alerts into Issues 3 Provide Incident Predictions and Insights 4 Apply Sentiment Analysis Tribal knowledge captured Clustered Issue: Network problem 60 active alerts within 30 minutes Refines automatic recommendation 18

19 Intelligent Automation: Reach Past IT with Process and Release Automation AUGMENTED INTELLIGENT AUTOMATION Trending Out of Norm 1 Flags Dynamic Alert 2 Trigger an Automated Event 3 Identify Where in Process it Failed and Fix 19

20 Insight Streaming Dramatically Drive Down Cost of Analyzing Machine Data Anomaly Intelligence Issue Intelligence Filter data Remove noise Send only relevant data CA Mainframe Operational Intelligence 20

21 CHALLENGE: Global Insurance Company Exploding mainframe demands and complexity combined with shrinking staff resources drove the need for a solution to offload and automate low level staff work SOLUTION: CA Mainframe Operational Intelligence CA Application Performance Management >40% Reduction in Manual Effort CA s approach is way ahead of other intelligence engines which aren t real time enables me to avoid having experienced employees wasting their time...instead they are available to apply their expertise to next gen solutions Protect experts and reduce time spent on low level management tasks Improved uptime through automated remediation Reduced mean time to repair The company in this case study has policies against publicly endorsing vendors and prefers to remain anonymous. 21

22 CHALLENGES: Large Financial Services Company Major system incident resulted in 8 hour outage Average cost of a critical application failure per hour is $500,000 to $1 million 1 SOLUTION: Intelligent operations and automation: CA Mainframe Operational Intelligence CA SYSVIEW Performance Management CA OPS/MVS Event Management and Automation Detect Problems Hours Earlier Issue easily found within minutes. Anomaly detection would have alerted of the issue 2 hours before we even knew we had a problem 5 hours before we took action. Prevent costly downtime with earlier warning of critical system issues z/os System CPU1 Quickly and easily pinpoint root cause of a problem 1 The real cost of downtime, Alan Shimel, February 11,

23 5X Improvement in Resolution Time CHALLENGE: Being notified after a customer is impacted is too late. Static thresholds yield too many false positives and cause a lot of unnecessary churn to investigate. SOLUTION: Intelligent operations and automation: CA Mainframe Operational Intelligence CA SYSVIEW Performance Management CA Vantage Storage Resource Manager CA OPS/MVS Event Management and Automation Performance insights in near real time are spectacular. We reduced time to analyze cause of issue from hours to minutes. Test is one thing, being able to see in production provides the real insights." Increased IT efficiency with faster problem triage and remediation Improved customer experience through problem prevention Reduced mean time to resolution The company in this case study has policies against publicly endorsing vendors and prefers to remain anonymous. 23

24 Summary Benefits of Intelligent Operations and Automation Reduce False Alerts Through Algorithmic Noise Reduction Easily Predict Issues Earlier With Smarter Alarms Drive Faster Root Cause Analysis With Service Analytics Proactively Optimize Resources With Predictive Capacity Insights Boost Operational Efficiency With Unified Visualization & Correlation

25 Try out CA Mainframe Operational Intelligence! 25

26 Contact us to learn more about the data science algorithms behind the curtains VISIT: ca.com/intelligent-mainframe 26 We ll run YOUR data through the Intelligence Center, and you ll see why customers are so excited about this machine learning solution!

27 Thank You. Mainframe.ai

28