Cloudera, Inc. All rights reserved.

Size: px
Start display at page:

Download "Cloudera, Inc. All rights reserved."

Transcription

1 1

2 Data Analytics 2018 CDSW Teamplay und Governance in der Data Science Entwicklung Thomas Friebel Partner Sales Engineer 2

3 We believe data can make what is impossible today, possible tomorrow 3

4 Cloudera at-a-glance First to market Open source innovation Open partner network Best of breed solutions Customer success Large enterprises fueling growth 2008 founded Clouderans partners 48% 140%+ customer growth net expansion Big data innovators from Google, Yahoo and Oracle Global team doing business in 28 countries Vast ecosystem of solution & service providers Last 4 years Global 8000 customers Expansion driven by data and new use cases Cloudera, Inc. Inc. All All rights reserved. 4

5 Adoption driven by large enterprises customers across all verticals ~500 Global 8000 customers 7/10 9/ /10 8/10 Top Global Top Global Countries with Top Global Top Global Government customers BANKING TELCO PUBLIC HEALTHCARE TECHNOLOGY 5

6 Portfolio of Joint Product Collaboration powered by Cloudera On-Premises Customer Public Cloud Big Data Appliance Big Data Cloud Machine Big Data Cloud Service Customer Data Center Purchased Customer Managed Customer Data Center Subscription Oracle Managed Oracle Cloud Subscription Oracle Managed 6

7 Cloudera Enterprise The modern platform for machine learning and analytics optimized for the cloud CORE SERVICES DATA SCIENCE ANALYTIC DATABASE OPERATIONAL DATABASE DATA ENGINEERING EXTENSIBLE SERVICES SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION DATA CATALOG STORAGE SERVICES Amazon S3 Microsoft ADLS HDFS KUDU 7

8 We are in the age of machine learning Data Analytics Deployment Data has never been more plentiful Open source data science and machine learning libraries are rapidly evolving Flexible commodity storage and compute make scalable production machine learning affordable 8

9 But there are practical challenges Data Analytics Deployment Data needs to move across multiple different systems Teams have different, conflicting requests for languages & libraries Most data science done at small scale, individually, and is difficult to replicate Very few models reach production 9

10 Our goal: Open data science at enterprise scale Help more data scientists use the power of Cloudera Use a powerful, familiar environment with direct access to Cloudera data and compute Data Scientist Data Engineer Make it easy and secure to add new users, use cases Offer secure self-service analytics and a faster path to production on common, affordable infrastructure Enterprise Architect Hadoop Admin 10

11 Balancing the needs of data scientists and IT Data Scientists explore, experiment, collaborate IT drive adoption, maintain compliance 11

12 Support the complete data science workflow From data to exploration to action Data Engineering Data Science Deployment Acquisition Dev: Collaboration, Version Control Ops: Deployment, Scheduling, Orchestration Visualization and Analysis Reports, Dashboards Processing Data Wrangling Online Scoring Data Governance Curation Model Training & Testing Batch Scoring Serving Shared: Data, Operations, Governance, Security, Metadata 12

13 Cloudera Data Science Workbench Data Science at Scale Runs and certified on BDA Powerful combination but Data scientists want a notebook-like interface Security often interferes with productivity Dependencies are very complicated Collaboration is difficult CDSW interface brings Data Scientists to the data Web-based notebook interface R, Python or Scala One-time Kerberos authentication Isolated, individual environments allow self-service Visualization, Team based sharing Access to governed and Secured data 13

14 Demo 14

15 Integration with Oracle Big Data Appliance Technical requirements: Available physical nodes for CDSW application dedicated edge nodes required CDSW 1.2.x supports Oracle Linux 7.3 Either use free nodes in BDA, order additional BDA nodes or add non-bda edge nodes Licensing requirements: Edge nodes need to be licensed for Cloudera Enterprise (covered by BDA or ordered directly from Cloudera) Additional user based CDSW license required, ordered from Cloudera directly (available as 10 user-pack for 1 year subscription) 15

16 A modern data science architecture Built on Docker and Kubernetes Runs on dedicated gateway nodes User sessions run in isolated engine containers which: Host Kerberos-authenticated Python/R/Scala runtimes Interact with Spark via YARN client mode (Driver runs in container, workers on CDH) Single-cluster only (for now) CDSW Master... Engine Engine... CDSW Engine Engine Engine gateway nodes EDH Cloudera Manager BDA Hive, HDFS,... BDA nodes BDA 24

17 Accelerated deep learning on-demand with GPUs Our data scientists want GPUs, but we can t find a way to deliver multi-tenancy. If they go to the cloud on their own, it s expensive and we lose governance. Multi-tenant GPU support on-premises or cloud Data Science Workbench BDA BDA Extend existing CDSW benefits to GPU-optimized deep learning tools Schedule & share GPU resources Train on GPUs, deploy on CPUs Works on-premises or cloud CPU GPU single-node training CPU CPU distributed training, scoring 25

18 More flexible automation with the Jobs API Orchestrate jobs from 3rd party workflow tools Parameterization via job environment variables View outputs in CDSW or receive notification curl -XPOST --user $USERNAME:$PASSWORD -H "Content-type: application/json" -d '{"environment": {"FISCAL_QUARTER": "Q3"}}' 26

19 Thank you Thomas Friebel 44