Introduction to Stream Processing

Size: px
Start display at page:

Download "Introduction to Stream Processing"

Transcription

1 Introduction to Processing Guido Schmutz DOAG Big Data 2018 BASEL BERN BRUGG DÜSSELDORF HAMBURG KOPENHAGEN LAUSANNE guidoschmutz.wordpress.com FRANKFURT A.M. FREIBURG I.BR. GENF MÜNCHEN STUTTGART WIEN ZÜRICH

2 Guido Schmutz Working at Trivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: Slideshare: Twitter: gschmutz

3 Agenda 1. Motivation for Processing? 2. Capabilities for Processing 3. Implementing Processing Solutions 4. Summary

4 Motivation for Processing?

5 Why not using Traditional BI Infrastructures? high latency BI Tools Bulk Source ETL / Stored Procedures Enterprise Data Warehouse File Extract Search / Explore Enterprise Apps { } API Architekturen von Big Data Anwendungen Logic

6 Big Data solves Volume and Variety not Velocity Enterprise Data Warehouse high latency Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform Extract File Import / SQL Import Refined File Storage Raw BI Tools SQL Results Parallel Processing Search / Explore Storage Enterprise Apps { } API Logic

7 Big Data solves Volume and Variety not Velocity Enterprise Data Warehouse high latency Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform Extract File Import / SQL Import Refined File Source Results Parallel Processing Search / Explore Storage Mobile Apps IoT Data Storage Raw BI Tools SQL Enterprise Apps Location { } Social API Telemetry Logic

8 Big Data solves Volume and Variety not Velocity Enterprise Data Warehouse Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform Extract File Import / SQL Import Refined File Results Parallel high latency Processing Source Mobile Apps IoT Data Storage Raw BI Tools SQL Hub Hub Hub Search / Explore Storage Machine Learning Graph Algorithms Natural Language Processing Enterprise Apps Location { } Social API Telemetry Logic

9 Data Value Chain Milliseconds Place Trace Serve ad Enrich Approve Trans Hundredths of SecondsSecond(s) Calculate Risk Retrieve Click Leaderboard Aggregate Show orders Count Architekturen von Big Data Anwendungen Minutes Backtest algo BI Daily Reports Hours Algo discovery Log analysis Fraud pattern match

10 "Data at Rest" vs. "Data in Motion" Data at Rest Act Analyze Store Data in Motion Store Act Analyze

11 Typical Processing Use Cases Notifications and Alerting - a notification or alert should be triggered if some sort of event or series of events occurs. Real-Time Reporting run real-time dashboards that employees/customers can look at Incremental ETL still ETL, but not in Batch but in streaming, continuous mode Update data to serve in real-time compute data that get served interactively by other applications Real-Time decision making analyzing new inputs and responding to them automatically using business logic, i.e. Fraud Detection Online Machine Learning train a model on a combination of historical and streaming data and use it for real-time decision making

12 When to / When not? Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch Source: adapted from Cloudera

13 "No free lunch" Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch "Difficult" architectures, lower latency "Easier architectures", higher latency

14 Processing Architecture solves Velocity Enterprise Data Warehouse Bulk Source File Extract BI Tools Low(est) latency, no history Source Search / Explore Mobile Apps IoT Data Hub Hub Hub Hadoop Clusterd Hadoop Cluster Analytics Platform Search Enterprise Apps Location Social Analytics Results { } API Reference / Models Telemetry Dashboard Logic

15 Big Data for all historical data analysis Refined File Import / SQL Import Extract Hub Data Flow Results Storage Raw File Enterprise Data Warehouse Hadoop Clusterd Hadoop Cluster Big Data Platform Bulk Source BI Tools Parallel Processing Storage Source Search / Explore Mobile Apps IoT Data Hadoop Clusterd Hadoop Cluster Analytics Platform Search Enterprise Apps Location Social Analytics Results { } API Reference / Models Telemetry Dashboard Logic

16 Integrate existing systems through CDC Traditional Silo-based System Data Store Integration CDC CDC Connector User Interface Logic Data Capture changes directly on database Change Data Capture (CDC) => think like a global database trigger Transform existing systems to event producer Hub Consuming Systems Logic State

17 Integrate existing systems with lower latency through CDC Enterprise Data Warehouse Hadoop Clusterd Hadoop Cluster Big Data Platform Bulk Source Refined File Import / SQL Import Extract Hub Data Flow Results Storage Raw File BI Tools Parallel Processing Storage Source Search / Explore Mobile Apps IoT Data Hadoop Clusterd Hadoop Cluster Analytics Platform Search Enterprise Apps Location Social Analytics Results { } API Reference / Models Telemetry Dashboard Logic

18 New systems participate in event-oriented fashion Enterprise Data Warehouse Hadoop Clusterd Hadoop Cluster Big Data Platform Bulk Source File Refined File Import / SQL Import Extract SQL BI Tools Data Flow Raw Hub Results Storage Parallel Processing Storage Source Search / Explore Mobile Apps IoT Data Analytics Platform { } Location Social Processor State API Microservice Platform Microservice State Enterprise Apps Service { } Telemetry Search API { } API Logic

19 Edge computing allows processing close to data sources Enterprise Data Warehouse Hadoop Clusterd Hadoop Cluster Big Data Platform Bulk Source File Refined File Import / SQL Import Extract Results Storage SQL BI Tools Raw Parallel Processing Storage Source Mobile Apps IoT Data Location Social Hub Search / Explore Analytics Platform Edge Node { } Processor Hub State API Microservice Platfrom Storage Rules Microservice State Enterprise Apps Service { } Telemetry Search API { } API Logic

20 Unified Architecture for Modern Data Analytics Solutions Enterprise Data Warehouse Hadoop Clusterd Hadoop Cluster Big Data Bulk Source File Refined File Import / SQL Import Extract Results Storage SQL BI Tools Raw Parallel Processing Storage Source Mobile Apps IoT Data Location Hub Search / Explore Analytics Search Edge Node { } Processor Hub State API Enterprise Apps Microservices Social Service Storage { } Telemetry Rules Microservice State API { } API Logic

21 Two Types of Processing (from Gartner) Data Integration Analytics primarily focuses on the ingestion and processing of data sources targeting realtime extract-transform-load (ETL) and data integration use cases filter and enrich the data optionally calculate time-windowed aggregations before storing the results in a database or file system targets analytics use cases calculating aggregates and detecting patterns to generate higher-level, more relevant summary information (complex events) Complex events may signify threats or opportunities that require a response from the business through real-time dashboards, alerts or decision automation

22 Important Capabilities for Processing

23 ing Model: -at-a-time vs. Micro Batch -at-a-time Processing s processed as they arrive low-latency fault tolerance expensive Micro-Batch Processing Splits incoming stream in small batches Fault tolerance easier Better throughput

24 Fault Tolerance: Delivery Guarantees [ 0 1 ] At most once (fire-and-forget) - means the message is sent, but the sender doesn t care if it s received or lost. Imposes no additional overhead to ensure message delivery, such as requiring acknowledgments from consumers. Hence, it is the easiest and most performant behavior to support. [ 1+ ] At least once - means that retransmission of a message will occur until an acknowledgment is received. Since a delayed acknowledgment from the receiver could be in flight when the sender retransmits the message, the message may be received one or more times. [ 1 ] Exactly once - ensures that a message is received once and only once, and is never lost and never repeated. The system must implement whatever mechanisms are required to ensure that a message is received and processed just once

25 API Compositional / Programmatic SQL Query language allowing to use stream in FROM clause Extensions supporting windowing, pattern matching, spatial,. Operators SELECT * FROM <stream> WHERE f1 = 10 Highly customizable operator based on basic building blocks Manual topology definition and optimization Declarative High-Level, fluent API Higher order function as operators (filter, mapwithstate ) Logical plan optimization

26 Windowing Computations over events done using windows of data Due to size and never-ending nature of it, it s not feasible to keep entire stream of data in memory A window of data represents a certain amount of data where we can perform computations on Windows give the power to keep a working memory and look back at recent data efficiently of Data Window of Data Time

27 Windowing Fixed Window (aka Tumbling Window) - eviction policy always based on the window being full and trigger policy based on either the count of items in the window or time Sliding Window (aka Hopping Window) - uses eviction and trigger policies that are based on time: window length and sliding interval length Time Time Session Window composed of sequences of temporarily related events terminated by a gap of inactivity greater than some timeout Time

28 Joining Challenges of joining streams to Join (one window join) Time Data streams need to be aligned as they come because they have different timestamps since streams are never-ending, the joins must be limited; otherwise join will never end join needs to produce results continuously as there is no end to the data to- Join to Join (two window join) to Static (Table) Join Time Time to-static Join to- Join

29 Pattern Detection ing Data often contains interesting patterns that only emerge as new streaming data arrives, e.g. Absence Pattern: event A not followed by event B within time window Sequence Pattern: event A followed by event B followed by event C Increasing Pattern: up trend of a value of a certain attribute Decreasing Pattern: down trend of a value of a certain attribute Pattern operators allow developers to define complex relationships between streaming events

30 State Management Necessary if stream processing use case is dependent on previously seen data or external data Options for State Management In-Memory Windowing, Joining and Pattern Detection use State Management behind the scenes State Management services can be made available for custom state handling logic State needs to be managed as close to the stream processor as possible Local, Embedded Store Replicated, Distributed Store Operational Complexity and Features Low high How does it handle failures? If a machine crashes and the/some state is lost?

31 Queryable State (aka. Interactive Queries) Exposes the state managed by the Analytics solution to the outside world Allows an application to query the managed state, i.e. to visualize it Analytics { } Processor State For some scenarios, Queryable State can eliminate the need for an external database to keep results Dashboard Processing Cluster Reference Data Model Search / Explore Query API Online & Mobile Apps

32 Time vs. Ingestion Time vs. Processing Time time the time at which events actually occurred Ingestion time / Processing Time the time at which events are ingested / observed in the system Not all use cases care about event times but many do Examples characterizing user behavior over time most billing applications anomaly detection

33 Capabilities: Data Integration vs. Analytics Data Integration Analytics Support for Various Data Sources Yes (yes) ing ETL (Transformation, Routing, Validation) yes (yes) Format Translation yes (yes) Micro-Batching yes (yes) Auto Scaling yes Backpressure Support yes - Processing Time vs. Time - yes -to-static Joins (Lookup/Enrichment) yes yes -to- Joins - yes Windowing - yes State Management - yes Queryable State (aka Interactive Queries) - yes ing SQL (yes) yes Pattern Matching / Detection - Yes

34 Implementing Processing Solutions

35 Processing & Analytics Ecosystem Analytics Open Source Closed Source Data Integration Edge Hub Source: adapted from Tibco

36 Summary

37 Summary Processing is the solution for low-latency Hub, Data Integration and Analytics are the main building blocks in your architecture Kafka is currently the de-facto standard for Hub Various options exists for Data Integration and Analytics SQL becomes a valid option for implementing Analytics Still room for improvements (SQL, Pattern Detection, ing Machine Learning)

38 Technology on its own won't help you. You need to know how to use it properly.