Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Size: px
Start display at page:

Download "Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved."

Transcription

1 Architecture Optimization for the new Data Warehouse Guido Oswald 1

2 Use Cases This image cannot currently be displayed. This image cannot currently be displayed. This image cannot currently be displayed. Self-Service BI Advanced Analytics Data Warehouse Optimization Bringing your tools and teams closer to the data with better access to all relevant data. Moving beyond traditional BI to predicting future outcomes to better guide your business. Augment your current EDW with a scalable, performant analytic database. 2

3 Cloudera is building the next generation analytic database in order to deliver on the promise of advanced analytics and self-service BI. 3

4 The Traditional Analytic Database 4

5 Challenges with Traditional Solutions Data Sources Store & Process Access Data Implement Serve Archive Search Structured Data Ingest Storage #1, 2, N Serve Optimize 1 Ingest ETL Serve Implement Unstructured Data EDW 1) Limited Tooling 5

6 Challenges with Traditional Solutions Data Sources Store & Process Access Data Implement Serve Archive Search Structured Data Ingest Storage #1, 2, N Serve Optimize 1 Ingest ETL 2 Serve Implement Unstructured Data EDW 1) Limited Tooling 2) Constrained Performance 6

7 Challenges with Traditional Solutions Data Sources Store & Process Access Data Implement Serve Archive 3 Search Structured Data Ingest Storage #1, 2, N Serve Optimize 1 Ingest ETL 2 Serve Implement Unstructured Data EDW 1) Limited Tooling 2) Constrained Performance 3) Bad Data Access 7

8 What is the impact to the business? Limited Engine Access Analytics remain the domain of the few with the skills to perform advanced analysis Single approach to performing analysis Business users must rely on on overtaxed data science resources Poor Performance Analytic outputs not available in time for action Delays in query resolution Poor dashboard and embedded application performance Limited Data Only Structured Data Limited to DW capacity Limited data results in bad model building Increased agility (schema on write) Increased Cost Data Warehouse space at a premium Ensuring performance demands specialized architecture Redundant environments based on tooling needs 8

9 The Rise of Hadoop Hadoop Data Warehouse Source: trends.google.com 9

10 The Modern Analytic Database 10

11 Cloudera Enterprise, A New Way Forward 11

12 The New Way Forward Data Sources Store & Process Access Data Implement Load ETL EDH Archive Load ELT Serve Search Optimize 1 Serve Machine Learning Implement Unstructured Data EDW 1) Advanced Toolset 12

13 The New Way Forward Data Sources 2 Store & Process Access Data Implement Load ETL EDH Archive Load ELT Serve Search Optimize 1 Serve Machine Learning Implement Unstructured Data EDW 1) Advanced Toolset 2) Actionable Performance 13

14 The New Way Forward Data Sources 2 Store & Process Access Data Implement Load ETL EDH Archive Load ELT 3 Serve Search Optimize 1 Serve Machine Learning Implement Unstructured Data EDW 1) Advanced Toolset 2) Actionable Performance 3) Secure Access 14

15 What is the benefit to the business? Expanded Engine Access Users have a choice of a broad ecosystem of access engines All integrated into a single platform Allowing for users to pick the best tool for the analysis High Performance Actionable analytic outputs Reduced query time Performance to enable real-time applications More Data The best full view of data to build models Incorporate new data sets without disrupting your environment Collect any type of data without capacity concerns Decreased Cost Single platform that supports all users Added security to enable widespread access Engineered to run on standard hardware 15

16 EDW Optimization 16

17 Data Warehouse Optimization Users are Augmenting their Data Warehouse Strategy Offload ETL Create Active Archive Migrate Complex Queries Archive Archive Enterprise Data Warehouse Load Data Store ETL to Hadoop Source: Omeno Case Study 17

18 Optimize Your Architecture EDH Workloads Every Stage of Analytics Active Archive Immediate Self Service Data Exploration ETL/ ELT Batch Processing Data Sources Ingest Analytic MPP DBMS Search Engine Machine Learning Batch Processing Online NoSQL DBMS Stream Processing ETL Reports Transactions Concurrency EDW Workloads Reports Transactional Data High Concurrency Apps Enterprise Data Hub Model Building BI/ Visualizations Point Solutions Custom Applications Offload Enterprise Data Warehouse 18

19 Cloudera Provides Traditional EDW Features Apache Hadoop Cloudera Traditional EDW ANSI SQL - X (Impala) X Columnar encoding - X (Parquet) X JDBC & ODBC - X X Cost based optimizer - X (Impala) X High availability X X X Granular security - X (Sentry) X Backup - X (BDR) X Workload management X X X Caching X X X Disaster recovery - X (BDR) X Scheduling X X X Performance monitoring - X (Manager) X 19

20 Cloudera Enterprise Analytic Database Components 20

21 One Platform, Many Workloads BATCH Spark, Hive, Pig MapReduce PROCESS, ANALYZE, SERVE STREAM Spark SQL Impala UNIFIED SERVICES SEARCH Solr SDK Partners Interactive SQL, Search, and 3rd Party Integration. OPERATIONS Cloudera Manager Cloudera Director RESOURCE MANAGEMENT YARN FILESYSTEM HDFS BATCH Sqoop RELATIONAL Kudu STORE SECURITY Sentry, RecordService NoSQL HBase REAL-TIME Kafka, Flume DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer Leading analytic performance. End-to-end analytic workflows Access more data Enable Self Service INTEGRATE 21

22 The right SQL engine for the use case 22

23 Impala: The Analytic Database for Hadoop 23

24 Cloudera Search: Search in an EDH Accessible Interactive full-text and faceted navigation Real-time exploration of all your data Multi-audience friendly Flexible Batch, real-time, and on-demand indexing (and re-indexing) Multi-datatype, multi-format support Natively integrates with other frameworks and workloads Rich API and ecosystem Open Industry standard search engine Mature code base, vibrant community 100% Apache, 100% Solr 24

25 Cloudera Navigator Optimizer Unlock Your Best Hadoop Strategy, Instantly Active Data Optimization for Hadoop to save you time and money Instant workload insights Intelligent optimization guidance Reduce Hadoop workload development effort 25

26 RecordService Unified Access Control Enforcement BATCH Spark, Hive, Pig MapReduce PROCESS, ANALYZE, SERVE RESOURCE MANAGEMENT YARN FILESYSTEM HDFS STREAM Spark BATCH Sqoop SQL Impala UNIFIED SERVICES RELATIONAL Kudu STORE INTEGRATE SEARCH Solr SECURITY Sentry, RecordService NoSQL HBase REAL-TIME Kafka, Flume SDK Partners New high performance security layer that centrally enforces access control policies across Hadoop Complements Apache Sentry s unified policy definition Row- and column-based security Dynamic data masking Apache-licensed open source Beta now available 26

27 Cloudera and Bimodal IT 27

28 Cloud - Bimodal IT Mode 2 Systems of Engagement Questions for BI: Data residency and sovereignty Data gravity Complex networking Containers Mode 1 Systems of Record Waterfall development Known vendors Strong governance Minimized risk Technology teams Traditional Mode When speed or innovation is needed, or there is a high degree of uncertainty Agile dev. Small/ innovative Stuck in themiddle partners "Fit for no one" Lightweight "Just good enough" governance Managed risk Multidisciplinary teams Nonlinear Mode "The reality is that you do have to operate at two speeds, and some of that you do by creating dedicated teams for each. Focusing on the big systems, making them run smooth, while at the same time having disrupters to innovate, together with marketing and the customer, exploiting digital." Willem Eelman, global CIO, Unilever Mythbuster: Nonlinear need not be limited to where speed is needed, for experiments, or for non-mission-critical initiatives Gartner, Inc. and/or its affiliates. All rights reserved. 28

29 Full Fidelity Business Analytics Customer 360 Personalization Systems of Record Advanced Analytics Systems of Engagement 29

30 Why Customers Want Hadoop in the Cloud Strategic Infrastructure Decision Organizational mandate to use cloud going forward Desire for increased IT agility/reduced time to service Data Gravity Lower TCO, Faster Time to Value Data is already available in a cloud environment Workloads generating data are already moved to the cloud Tactical Temporary Relief Ad-hoc/non-continuous processing use cases End user self-service 30

31 The Cloudera Advantage on Cloud Unified Architecture Single data, metadata, security, governance, management, across entire ecosystem Benefit: Simplified security easier compatibility with new components, reduced ecosystem fragmentation, lower TCO Platform Expertise Expert support and roadmap influence with the platform experts and committers; extensive support infrastructure for enterprise grade predictive and proactive support; Several production customers running on EC2 Benefit: Faster issue resolution, alignment to the future of Hadoop Continuous Innovation Better user experience (Impala, Search, Spark); better enterprise experience (Sentry, Manager, Navigator); ISV certifications; Cloud-specific experiences coming soon Benefit: Innovate without compromise across environments Portable Architecture for the Hybrid Cloud Deploy the same stack on-premises or on your choice of Cloud infrastructure; tools enable easy integration Benefit: Flexible deployment options; minimize vendor lock-in 31

32 Common Architectural Patterns in the Cloud ETL/MODELING (Spark, MapReduce) Short-running clusters Elastic workload No local storage necessary BI/ANALYTICS (Impala, Solr) Long-running clusters Sized to demand Some local storage APP DELIVERY (HBase) Fixed clusters Periodic sync All local storage Source Data Seed Data Backup/DR Object Storage 32

33 Cloudera Director Deploy and manage enterprise-grade Hadoop in the cloud Trusted for Production Supports large scale customer deployments Integrated part of Cloudera Enterprise Flexible Deployment Out-of-the-box integrations with AWS and GCP Open Plugin Framework for preferred environment Simple Administration Dynamic cluster lifecycle management Multi-cluster, multi-environment Custom use case support through API 33

34 The Only Complete Hadoop Management Suite Deliver optimum system utilization and meet SLA commitments. Cloudera Manager Focus on the solution, not the cluster, with the only complete, zero-downtime administration tool for Apache Hadoop. Unique Capabilities: Unified configuration, management and monitoring across all services Online installation and upgrades Direct connection to Cloudera Support 3 rd Party Extensibility 34

35 The Only Hadoop Data Governance Solution Enable compliance and maximize analyst productivity. Cloudera Navigator Minimize risk and maintain compliance with the only native end-to-end data governance solution for Apache Hadoop. Unique Capabilities: Auditing Lineage Metadata Tagging and Discovery Lifecycle Management 35

36 The Only Comprehensively Secure Hadoop Platform Meet compliance requirements and reduce risk exposure from storing sensitive data. 1. Perimeter Standards-based Authentication Process Discover Model Serve 2. Access Unified Role-based Authorization Security and Administration 3. Visibility Auditing & Governance Unlimited Storage 4. Data Encryption & Key Management Cloudera is the leader in Hadoop security. Unique Capabilities: Comprehensive and Unified Secure at the core No Performance Impact Jointly engineered with Intel Compliance-Ready Only distribution to pass PCI audit 36

37 Customer Success 37

38 Omneo drives $15-25M in annual savings by identifying and addressing supply chain issues in near real time.

39 National children s hospital improves pain management in premature babies and reduces asthmarelated ER visits.

40 Magnify delivers a self-serve, 360- degree customer view to Fortune 100 clients like Chrysler, DuPont & Ford.

41 Cloudera is the Fast, Easy, & Secure Analytic Database The leading big data platform from the leaders in enterprise Hadoop. Fast Cloudera Impala provides industry-leading analytic performance. Easy Broadest ecosystem of access engines and third party tooling to ensure the best tool for the job. Secure Enable Self-service Bi without risking security or improper access. 41

42 What Next? Download the Analytic Database Solution Brief Watch The Future of Impala Video See Why Companies Choose Impala 42

43 Thank you 43