DOAG Big Data Days 2018 DWH Modernization

Size: px
Start display at page:

Download "DOAG Big Data Days 2018 DWH Modernization"

Transcription

1 DOAG Big Data Days 2018 DWH Modernization Do I need a data lake? If yes, why? Jan BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MANNHEIM MÜNCHEN STUTTGART WIEN ZÜRICH

2 Jan Ott Working at Trivadis 20 years Principal Consultant BI Speaker at Conferences Consultant, Trainer, Software Architect for BI: DWH & Big Data More than 20 years of software development experience Contact: 2

3 Agenda 1. Initial situation at the customer 2. DWH - Big Data Architecture 3. Lizences & Knowledge 4. Summary - Do I need a data lake? 3

4 Initial Situation 4

5 Current and desired status Current: 1 x load per week full load 1 x load per day CRM delta load Loading window getting to small... Desired: 1 x per day delta load Streaming of some data Plattform for analytics team Methodes to add public available data... 5

6 Data Warehouse Architecture Data Warehouse Sources Staging Area Cleansing Area Core Data Marts BI Platform ETL Meta Data 6

7 The Big Shift in Analytical Data Management Traditional BI/DWH Requirements Emerging Analytical Requirements Stable and Consolidated Data DWH as Single Point of Truth Velocity Agile to support New Business Demands Support of Self-Service features Business Driven Analytical Schema Assured Data Quality and Data History Preservation Variety Governed and Secure Data to meet Compliance Right-time (near real-time, not batch) Scales to support More Data, New Sources and Broader Use Cases Simplified from modelling, quality and development Volume It s much more an enrichment than a substitution of the requirements! 7

8 Data Lake = DWH + Möglichkeiten - Komplexität? ERP CRM Event Internet 8

9 DWH - Big Data Architecture 9

10 Reference Architecture Analytical Platform Automation Template Generator Meta Data Data Lineage Generate Artefact Generate Tracing info 10

11 Reference Architecture 0 Analytical Platform Automation Template Generator Meta Data Data Lineage Generate Artefact Generate Tracing info

12 How to do Big Data? 12

13 Big Data Ecosystem many choices. 13

14 Reference Architecture 0 0 Analytical Platform Automation Template Generator Meta Data Generate Artefact Generate Tracing info Data Lineage CONNECT 1 * DB is a logical standby

15 Key Success Factors for a Big Data Project 1. Support from Business Sponsor 2. Start with Outcome Answer First 3. Involve Real Users and Create Effective Use Cases 4. Define Quick-Win and Phasing 5. Sufficient Data Source 6. Choose the Open Technology Platform 7. Identify SLA for Service Operation 8. Project Review 15

16 Big Data is still work in progress Choosing the right architecture is key for any (big data) project Big Data is still quite a young field and therefore there are no standard architectures available which have been used for years In the past few years, a few architectures have evolved and have been discussed online Know the use cases before choosing your architecture To have one/a few reference architectures can help in choosing the right components 16

17 StreamSets Data Collector Founded by ex-cloudera, Informatica employees Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Batch or stream processing Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by civilians Relatively new - first public release September 2015 So far, vast majority of commits are from StreamSets staff 17

18 Apache Avro Row-based Data Serialization system { "namespace": "trimazon.schema.customer", Uses JSON based schemas Uses RPC calls to send data Schema s sent during data exchange Integrated with many languages Fast binary data format or encode with JSON } "type": "record", "name": "customer", "fields": [ {"name": "firstname", "type":"string"}, {"name": "lastname", "type":"string"}, {"name": "age", "type":"int"}, {"name": " ", "type":"string"} ] 18

19 Next Generation Data Warehousing 19

20 DWH Challenges & Key Issues Data Warehouse Automation Drive development performance ensure standardization Automation of development tasks& generator based standardization Close the gap in Requirements-Development-Governance Closed loop Design & Development process in one application Manage the change: Lifecycle Management Extensive Version Management for documentation and impact analysis Agility - agile data warehousing Automation enables short Release Cycles and Sandboxing approaches Achieve Flexibility support for individual architecture options Configurable generator is able to support real world DWH-architecture 20

21 Drive development performance ensure standardization Model Metadefinition Automation of Development tasks Generator Huge amount of recurring and monotonic development tasks. Standards/ Best Practices Data Base Objects Mappings Data Flow Source Data Mart DWH-Core Cleansing Staging Source substantial time and cost savings Standardization Source 1 Reduced Testing effort 21

22 Lizence & Knowledge 22

23 Lizence / Distributors Cloudera Hadoop Hbase Hive Impala Yarn... Databricks Spark Spark R Spark SQL... Confluent Kafka Kafka Connect Kafka Streams Kafka Schema Kafka KSQL... StreamSets Trivadis BiGenius 23

24 Knowledge Myth - One does it all. IT is no longer required? The Data Lab / Data Scientist Solves It All 24

25 Summary Do I need a Data Lake? 25

26 Summary Pro: ü Streaming ü Plattform for data analysis ü Flexibility Different data formats Add new data quickly ü Basis to build on ü Ready for the future ü More Data available More years Higher granularity Contra: Cost Complexity New Knowledge required 26

27 Questions & Answers Jan Ott 27