Introduction to Streaming Analytics

Similar documents
Introduction to Stream Processing

Customer Event Hub. The modern Customer 360 View. Guido

Building event-driven Microservices with Kafka Ecosystem

Oracle Stream Analytics

Microsoft Azure Essentials

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Business is being transformed by three trends

DOAG Big Data Days 2018 DWH Modernization

20775 Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight

Big data is hard. Top 3 Challenges To Adopting Big Data

Enterprise-Scale MATLAB Applications

How In-Memory Computing can Maximize the Performance of Modern Payments

CON7793 Fast Data: Business-User-Friendly Tooling Best Practices

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Databricks Cloud. A Primer

Internet of Things (IoT) Are traditional architectures good enough?

Big Data Introduction

Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale

Modernizing Your Data Warehouse with Azure

Spark and Hadoop Perfect Together

Integrating MATLAB Analytics into Enterprise Applications The MathWorks, Inc. 1

Active Analytics Overview

5th Annual. Cloudera, Inc. All rights reserved.

Cask Data Application Platform (CDAP)


MapR: Solution for Customer Production Success

IBM Analytics Unleash the power of data with Apache Spark

Hortonworks Connected Data Platforms

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

Integrating MATLAB Analytics into Enterprise Applications

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

CASE STUDY Delivering Real Time Financial Transaction Monitoring

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

Berkeley Data Analytics Stack (BDAS) Overview

Saudi Journal of Engineering and Technology. DOI: /sjeat. ISSN (Print) Dubai, United Arab Emirates Website:

Common Customer Use Cases in FSI

FLINK IN ZALANDO S WORLD OF MICROSERVICES JAVIER LOPEZ MIHAIL VIERU

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

Boston Azure Cloud User Group. a journey of a thousand miles begins with a single step

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

SAP Big Data. Markus Tempel SAP Big Data and Cloud Analytics Services

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

Big Data Analytics for Real Time Systems

Spark, Hadoop, and Friends

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Intro to Big Data and Hadoop

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Alexander Klein. ETL meets Azure

Who is Databricks? Today, hundreds of organizations around the world use Databricks to build and power their production Spark applications.

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

MapR Pentaho Business Solutions

Preface About the Book

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Transforming Analytics with Cloudera Data Science WorkBench

Hybrid Data Management

Real-time Digital Banking With Fast Data

Real-time Streaming Insight & Time Series Data Analytic For Smart Retail

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Evolution to Revolution: Big Data 2.0

Azure PaaS and SaaS Microsoft s two approaches to building IoT solutions

Introduction to Real-Time Processing in Apache Apex

Scaling up MATLAB Analytics with Kafka and Cloud Services

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Operational Hadoop and the Lambda Architecture for Streaming Data

SAP Predictive Analytics Suite

2015 The MathWorks, Inc. 1

Big Data The Big Story

Scaling up MATLAB Analytics with Kafka and Cloud Services

Stream Processing. Kai Wähner. as Game Changer for the Internet of

Oracle 全数据平台解决方案 : 打破技术壁垒, 释放数据能量. Sally Piao 甲骨文公司全球研发副总裁

How Data Science is Changing the Way Companies Do Business Colin White

Apache Spark and R A (big data) love story?

Introducing Amazon Kinesis Managed Service for Real-time Big Data Processing

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Realising Value from Data

COMP9321 Web Application Engineering

Event Driven Architecture for Real-Time Analytics

The Internet of Things Wind Turbine Predictive Analytics. Fluitec Wind s Tribo-Analytics System Predicting Time-to-Failure

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Digital transformation is the next industrial revolution

CA UIM Log Analytics. Gain Full Stack Visibility With Contextual Log Insights. Mark Tukh Principal Presale Consultant CA NESS AT

GPU ACCELERATED BIG DATA ARCHITECTURE

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Ampliando MATLAB Analytics con Kafka y Servicios en la Nube

HDInsight - Hadoop for the Commoner Matt Stenzel Data Platform Technical Specialist

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

C3 Products + Services Overview

Data Science, realizing the Hype Cycle. Luigi Di Rito, Director Data Science Team, SAP Center of Excellence

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Oracle Stream Analytics

BIG DATA AND HADOOP DEVELOPER

Two offerings which interoperate really well

Transcription:

Guido Schmutz @gschmutz BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Guido Schmutz Working for Trivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer, Software Architect for Java, SOA & Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz

Our company. Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: O P E R A T I O N Trivadis Services takes over the interacting operation of your IT systems.

With over 600 specialists and IT experts in your region. COPENHAGEN HAMBURG 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants DÜSSELDORF Research and development budget: CHF 5.0 million FRANKFURT Financially self-supporting and sustainably profitable BASEL FREIBURG STUTTGART BRUGG ZURICH MUNICH VIENNA Experience from more than 1,900 projects per year at over 800 customers GENEVA BERN LAUSANNE

Technology on its own won't help you. You need to know how to use it properly.

Agenda 1. Introduction / Motivation 2. Concepts (1-11) 3. Stream Processing Platforms

Introduction / Motivation

Traditional Data Processing - Challenges Introduces too much decision latency Responses are delivered after the fact Maximum value of the identified situation is lost Decision are made on old and stale data Data a Rest

Customer 360 View Flow with Big Data Hub Enterprise Data Warehouse Billing & Ordering CRM / Profile Marketing Campaigns Hadoop Clusterd Hadoop Cluster Hadoop Cluster SQL BI Tools Parallel Processing Search Search Location Social Click stream Mobile Apps Weather Data Call Center Data Flow Event Event Event Hub Hub Hub Distributed Filesystem NoSQL Machine Learning Graph Algorithms Natural Language Processing Online & Mobile Apps Sensor Data

The New Era: Streaming Data Analytics / Fast Data Events are analyzed and processed in real-time as the arrive Decisions are timely, contextual and based on fresh data Decision latency is eliminated Data in motion

Customer Event Hub taking Velocity into account Enterprise Data Warehouse Billing & Ordering File Import / SQL Import Hadoop Clusterd Hadoop Cluster Hadoop Cluster CRM / Profile Marketing Campaigns Event Event Event Hub Hub Hub Distributed Filesystem Parallel Processing NoSQL Batch Analytics SQL Search BI Tools Search Location Mobile Apps Streaming Analytics Social Weather Data Stream Analytics NoSQL Online & Mobile Apps Click stream Call Center Sensor Data Reference / Models Dashboard

When to Stream / When not? Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch

No free lunch Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch Difficult architectures, lower latency Easier architectures, higher latency

Real Time Analytics Use Cases Algorithmic Trading Online Fraud Detection Geo Fencing Proximity/Location Tracking Intrusion detection systems Traffic Management Recommendations Churn detection Internet of Things (IoT) / Intelligence Sensors Social Media/Data Analytics Gaming Data Feed

Concepts (1 11)

1) Categories of Stream/Event Processing Simple Event Processing (SEP) Event Stream Processing (ESP)

1) Categories of Stream/Event Processing Complex Event Processing (CEP)

2) Streaming Model Continuous Streaming Event Source Individual Event Event Source Ingestion Event Source P P P P P P P P P P P P Events processed as they arrive low-latency Less throughput fault tolerance expensive

2) Streaming Model - Micro-Batch Event Source Event Source Ingestion Event Source P P P P P P Splits incoming stream in small batches Could provide high(er) throughput Fault tolerance easier to achieve Higher latency

3) Delivery Guarantees [ 0 1 ] [ 1+ ] [ 1 ] At most once (fire-and-forget) - means the message is sent, but the sender doesn t care if it s received or lost. Imposes no additional overhead to ensure message delivery, such as requiring acknowledgments from consumers. Hence, it is the easiest and most performant behavior to support. At least once - means that retransmission of a message will occur until an acknowledgment is received. Since a delayed acknowledgment from the receiver could be in flight when the sender retransmits the message, the message may be received one or more times. Exactly once - ensures that a message is received once and only once, and is never lost and never repeated. The system must implement whatever mechanisms are required to ensure that a message is received and processed just once

4) API Compositional / Programmatic Highly customizable operator based on basic building blocks Manual topology definition and optimization Statistical Data Scientist friendly Dynamic type Declarative High-Level, fluent API Higher order function as operators (filter, mapwithstate ) Logical plan optimization SQL Query language allowing to use stream in FROM clause Extensions supporting windowing, pattern matching, spatial,. operators

5) Windowing Due to size and never-ending nature of it, it s not feasible to keep entire stream of data in memory Computations over multiple events can be done using windows of data Sliding Window - uses eviction and trigger policies that are based on time: window length and sliding interval length Tumbling Window - eviction policy is always based on the window being full and the trigger policy is based on either the count of items in the window or time

6) Pattern Matching Streaming Data often contains interesting patterns that only emerge as new streaming data arrives, e.g. Absence Pattern: event A not followed by event B within time window Sequence Pattern: event A followed by event B followed by event C Increasing Pattern: up trend of a value of a certain attribute Decreasing Pattern: down trend of a value of a certain attribute Pattern operators allow developers to define complex relationships between streaming events

7) Back Pressure Backpressure refers to the situation where a system is receiving data at a higher rate than it can process Backpressure, if not dealt with correctly, can lead to exhaustion of resources, or even, in the worst case, data loss A slow consumer should slow down the producer

Other Concepts 7) Scalability Scale-Up and/or Scale-Out 8) Resource Management Proprietary Server, YARN, Mesos,. 7) Auto-Scaling: Does platform support scale-up and scale-down without stopping 8) In-Flight Modifications Are modifications on running topology supported without redeployment 9) Contributors How many active contributors 10) Language Options What languages can be used 11) Self-Service Business user friendliness

Stream Processing Platforms

Stream Processing & Analytics Complex Event Processing Open Source Closed Source Event Stream Processing Simple Event Processing

Oracle Stream Analytics What it does Compelling, friendly and visually stunning real time streaming analytics user experience for Business users to dynamically create and implement Instant Insight solutions Key Features Analyze simulated or live data feeds to determine event patterns, correlation, aggregation & filtering Pattern library for industry specific solutions Streams, References, Maps & Explorations Benefits Accelerated delivery time Hides all challenges & complexities of underlying real-time event-driven infrastructure

Oracle Stream Analytics 2007 BEA Weblogic Event Server Oracle CQL 2008 Oracle Complex Event Processing (OCEP) 2012 Oracle Event Processing (OEP) 2013 2015 Oracle Stream Explorer (SX) Oracle Event Processing for Java Embedded Oracle IoT Cloud Service Oracle Edge Analytics (OAE) 2016 Oracle Stream Analytics (OSA)

Oracle Stream Analytics Computing Edge Enterprise OEA FOG Devices / Gateways EDGE Analytics Filtering Correlation Aggregation Pattern matching Macro-event High-value Actionable In-context Services Stream Analytics High Volume Continuous Streaming Extreme Low Latency Disparate Sources Temporal Processing Pattern Matching Machine Learning High Volume Continuous Streaming Sub-Millisecond Latency Disparate Sources Time-Window Processing Pattern Matching High Availability / Scalability Coherence Integration Geospatial, Geofencing Big Data Integration Business Event Visualization Sea of data Action!

Oracle Stream Analytics

Oracle Stream Analytics Category CEP Scalability Scale-up Streaming Model Delivery Guarantees Continuous Streaming At Least Once Resource Management Auto-Scaling OEP server, Spark Streaming no API Windowing Pattern detection Back Pressure Declarative/SQL Yes Yes No In-Flight modifications Contributors Language Options self-service no n.a. Java, CQL Yes

Apache Spark Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Originally developed 2009 in UC Berkley s AMPLab Based on 2007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk One of the largest OSS communities in big data with over 200 contributors in 50+ organizations Open Sourced in 2010 since 2014 part of Apache Software foundation

Apache Spark Stack Libraries Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Core Runtime Spark Core API and Execution Model Cluster Resource Managers Data Stores Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3

Apache Spark - Batch vs. Real-Time Processing Petabytes of Data Gigabytes Per Second

Apache Spark Resilient Distributed Dataset (RDD) RDD Input Source File Database Stream Collection.count() -> 100 Data

Apache Spark - Workflow Input HDFS File sc.hapoopfile() HadoopRDD flatmap() Stage 1 flatmap() + map() MappedRDD P0 Transformations (Lazy) MappedRDD map() MappedRDD DAG Scheduler P1 P3 Master Action (Execute Transformations) reducebykey() ShuffledRDD sc.saveastextfile() Text File Output Stage 1 reducebykey() ShuffledRDD P0

Apache Spark Streaming DStream DStream Transform.countByValue().reduceByKey().join.map X Seconds

Apache Spark Streaming time 1 time 2 time 3. Time Increasing time n DStream Transformation Lineage Input Stream Event DStream MappedDStream map() message RDD @time 1 message 1 message 2. message n RDD @time 1 f(message 1) f(message 2). f(message n) message message message RDD @time 2 message 1 message 2. message n RDD @time 2 f(message 1) f(message 2). f(message n) RDD @time 3 message 1 message 2. message n RDD @time 3 f(message 1) f(message 2). f(message n) RDD @time n message 1 message 2. message n RDD @time n f(message 1) f(message 2). f(message n) Actions Trigger Spark Jobs saveashadoopfiles() result 1 result 2. result n result 1 result 2. result n result 1 result 2. result n result 1 result 2. result n Adapted from Chris Fregly: http://slidesha.re/11pp7fv

Apache Spark Streaming

Spark Streaming Category ESP Scalability Scale-out Streaming Model Delivery Guarantees Continuous Streaming Exactly Once Resource Management Auto-Scaling YARN, Mesos, Standalone yes API Windowing Pattern detection Back Pressure Declarative/SQL Yes No Yes In-Flight modifications Contributors Language Options self-service no > 280 contributors Java, Scala, Python No

Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant Originated at Backtype, acquired by Twitter in 2011 Open Sourced late 2011 Part of Apache since September 2013

Apache Storm Tuple Immutable Set of Key/value pairs Stream an unbounded sequence of tuples that can be processed in parallel by Storm Topology Wires data and functions via a DAG (directed acyclic graph) Executes on many machines similar to a MR job in Hadoop Spout Source of data streams (tuples) can be run in reliable and unreliable mode Bolt Consumes 1+ streams and produces new streams Complex operations often require multiple steps and thus multiple bolts T T T T T T T T Spout Source of Stream B Spout Subscribes: A Emits: C Bolt Subscribes: A Emits: D Bolt Subscribes: A & B Emits: - Bolt Subscribes: C & D Emits: - Bolt

Apache Storm

Apache Storm Category ESP Scalability Scale-out Streaming Model Delivery Guarantees Event-Streaming At most once / At least once Resource Management Auto-Scaling Storm Cluster, YARN no API Windowing Pattern detection Back Pressure Compositional No No Yes In-Flight modifications Contributors Language Options self-service partial > 100 contributors Java, Clojure, Scala, No

Summary

Comparison Oracle Stream Analytics Spark Streaming Spark Storm Category CEP ESP ESP Streaming Model Continuous Streaming Continuous Streaming Event-Streaming Delivery Guarantees At Least Once Exactly Once At most once / At least once API Declarative/SQL Declarative/SQL Compositional Windowing Yes Yes No Pattern detection Yes No No Back Pressure No Yes Yes Scalability Scale-up Scale-out Scale-out Resource Management OEP server, Spark Streaming YARN, Mesos, Standalone Storm Cluster, YARN Auto-Scaling no yes no In-Flight modifications no no partial Contributors n.a. > 280 contributors > 100 contributors Language Options Java, CQL Java, Scala, Python Java, Clojure, Scala, self-service Yes No No

Summary More and more use cases (such as IoT) make Streaming Analytics necessary Treat events as events! Infrastructures for handling lots of events are available! Platforms such as Oracle Stream Analytics enable the business to work directly on streaming data (empower the business analyst) => User Experience of an Excel Sheet on streaming data Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can use Spark Streaming as the runtime infrastructure Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event Hub

Trivadis @ DOAG 2016 Booth: 3rd Floor next to the escalator Know how, T-Shirts, Contest and Trivadis Power to go We look forward to your visit Because with Trivadis you always win!