Let s Go Splunking. NLIT Summit Thursday, May 24, :15 AM 11:00 AM 104B. Robert H Murray, Lead Database Services - Oracle

Size: px
Start display at page:

Download "Let s Go Splunking. NLIT Summit Thursday, May 24, :15 AM 11:00 AM 104B. Robert H Murray, Lead Database Services - Oracle"

Transcription

1 Let s Go Splunking NLIT Summit 2018 Thursday, May 24, :15 AM 11:00 AM 104B Robert H Murray, Lead Database Services - Oracle robert.murray@inl.gov (208)

2 Data analytics for IT Operations Using machine data and deep learning, an AIOps engineer can build anomaly detection models to detect the likelihood of an immanent production issue, perform root cause analysis and correction protocols, long before the event ever happens. This is mission critical because it assures that IT Operations pipelines can keep flowing continuously.

3 Overview Spelunking is the recreational pastime of exploring wild (generally non-commercial) cave systems. Splunking (not to be confused with other definitions) is my term for the INL/Information Management s pastime of exploring wild (commercial) systems capable of discovering, acquiring, and organizing vast amounts of Machine Data -- then using Splunk s artificial intelligence for IT Operations (AIOps) to deliver the Key Performance Indicators (KPIs) in a format well adapted for human cognition. More information:

4 Greenfield Challenges Data and Analytics Architecture Building a sophisticated architecture for your data and analytics requires stakeholder commitment, a significant investment, and more importantly a clear vision of the desired end-state. Data acquisition Loading large datasets is a challenge, especially when combined with on-line filtering and data reduction. Information extraction and cleaning Frequently, the information collected will not be in a format ready for analysis. Data integration, aggregation, and representation Large-scale analysis to be effective, often requires the collection of heterogeneous data from multiple sources. Modeling and analysis Machine data is often noisy, dynamic, heterogeneous, inter-related, and untrustworthy. Interpretation Analytical pipelines involve complex multiple steps with inherent assumptions. Visualization and collaboration For interpretation to fully reach its potential, it needs to be in a visual form for human cognition.

5 Why splunk>? The fastest way to aggregate, analyze and get answers from your machine data. SPLUNK IT SERVICE INTELLIGENCE (ITSI) An out-of-the-box ready monitoring and analytics solution that gives you visibility across IT and business services, and enables you to use AI to go from reactive to predictive IT.

6 Hadoop vs Splunk

7 Hadoop vs Splunk

8 Hadoop vs Splunk

9 Hadoop vs Splunk

10 Hadoop vs Splunk

11 Hadoop vs Splunk

12 Hadoop vs Splunk

13 Hadoop (Build) vs Splunk (Buy) Disclaimer: This is not a Splunk sales pitch!! so let s keep moving

14 Machine Data Sources Machine data the digital exhaust created by the systems, technologies and infrastructure powering modern businesses to address big data, IT operations, security and analytics use cases. The insights gained from machine data can support any number of use cases across an organization and can also be enriched with data from other sources. The enterprise machine data fabric shares and provides access to machine data across the organization to facilitate these insights. Source:

15 Machine Data Sources Tools for Enterprise Infrastructure and Operations Image sources:

16 Machine Data Sources Single system of record on a single platform The CMDB automatically integrates with all applications and features built on the Now Platform, making it rich in functionality and value. IT can use the CMDB with Discovery, Service Mapping, and other applications to gain an end-to-end, service-aware view of CI lifecycles. Tools for Enterprise Information Management Governance Content sources:

17 Machine Data Sources Tools for Enterprise Application Monitoring Content sources:

18 Machine Data Sources Tools for Oracle Enterprise Database Administrators The Enterprise Manager Management Repository views provide access to target, metric, and monitoring information stored in the Management Repository. views.htm#emvws32040 The data in these views can be consumed by Splunk via Splunk s DBConnect solution for working with databases.

19 Machine Data Sources Getting Data into Splunk

20 Machine Data Sources Splunk DBConnect

21 Machine Data Overview In-house Watchdog Based on a hub and spoke model Same shell script runs on all servers at the same time Performs standard data collection Raw data sections Server Middleware Database Patching Backups Accounts Records contain date-time stamp One logfile for each server is sent to Splunk then overwritten

22 Machine Data Sources In-house Watchdog Sample

23 WHAT IS A KEY PERFORMANCE INDICATOR (KPI)? Key Performance Indicators (KPIs) are the critical indicators of progress toward an intended result. KPIs provides a focus for strategic and operational improvement, create an analytical basis for decision making and help focus attention on what matters most. More information:

24 KPI Reporting n Dimensions, Color Human Visual Cognition Source:

25 KPI Reporting / Reactive Source:

26 KPI Reporting Spreadsheets Multiple Copies of Old Data Source:

27 KPI Reporting Building Splunk Dashboards and Reports Source:

28 KPI Reporting Splunk Charts and Graphs Source:

29 KPI Reporting Splunk Glass Tables and ITSI Source:

30 KPI Reporting Splunk - Using Color as a Dimension Source:

31 KPI Reporting 2-D vs 3-D Color Topo Maps Source:

32 KPI Reporting 2-D Heat Maps Source:

33 KPI Reporting 3-D Heat Maps Source:

34 KPI Reporting 3-D Heat Map / Glass Table Answers the question Is bicycling in New York City dangerous? Source:

35 KPI Reporting 3-D x 2-D Heat Maps Source:

36 Correlating Disparity What if we could use colorized heat map layers to correlate time (x), KPIs (y), and domains (z)? What if we could strip out the greens and the yellows and only focus on the RED? Could that somehow let us identify correlations across seemingly disparate KPIs? Source:

37 Correlating Disparity

38 Correlating Disparity

39 Correlating Disparity

40 Correlating Disparity

41 Correlating Disparity

42 Correlating Disparity

43 Correlating Disparity

44 Correlating Disparity (repeat) What if we could use colorized heat map layers to correlate time (x), KPIs (y), and domains (z)? What if we could strip out the greens and the yellows and only focus on the RED? Could that somehow let us identify correlations across seemingly disparate KPIs? Source:

45 Anomaly Detection Find the needle in the haystack before it becomes the root cause... And fix it. Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. AD and automation are the building blocks for automated response in artificial intelligence for IT Operations (AIOps). The importance of anomaly detection is due to the fact that anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains. Text source: Image source:

46 Garbage In Garbage Out (GIGO) How do we pick the best data with which to train our systems, and how do we use our data to predict how well our systems will detect anomaly's once we deploy them in a new data environment? Deep learning uses a massive amount of unseen complex features to predict results, which enables them to fit beautifully to datasets. But it also means that if the training and testing data are even slightly biased with respect to the real-world test case data, some of those unseen complex features will end up damaging accuracy instead of bolstering it. Even with great labels and a lot of data, if the data we use to train our deep learning models doesn t mimic the data it will eventually be tested on in deployment, our models are liable to fail in deployment. Unfortunately, it s impossible to train on future data. And often we don t have access to that that even mimics past deployment data. But it is quite possible to simulate the errors we expect to have upon deployment by analyzing the sensitivity of our models to differences in training and testing data. By doing this, we can better develop training datasets and model configurations that are most likely to perform reliably well on deployment. More information: Purportedly-Great-ML-Models-Can-Be-Screwed-Up-By-Bad-Data-wp.pdf

47 Mobile Challenges Security Performance User Experience Source:

48 Q&A Contact Information Robert H Murray, Lead Database Services - Oracle robert.murray@inl.gov (208)