Taking Advantage of Cloud Elasticity and Flexibility Fred Koopmans Sr. Director of Product Management 1
Public cloud adoption is surging 2
Cloudera customers are leading the way 3
Hadoop was born for the cloud Speed Convenience Scale Self-Service TCO 4
But, cloud comes with its own set of challenges Performance Bill Shock Application Portability Security Data Governance Data Sovereignty Hybrid Cloud Lock-in 5
A stepwise approach Lift and shift the platform Optimize each application individually Reconstruct an Enterprise Data Hub 6
Lift and shift the platform 7
Openness is even more important in the cloud Open Environment Run the same platform in different clouds or on bare metal, so customers can move as needed without migration or retraining Open Ecosystem 450+ certified ISV s assures backward compatibility across releases, so customers can leverage their pre-existing investments Open Source Avoid vendor lock-in, and leverage components supported by the committers who drive the community roadmap 8
In on-prem environments, many applications typically share a single, multi-tenant cluster HDFS 9
The cloud creates more & smaller clusters, specialized for each application S3 Azure Data Lake Google Storage* 10
Where to store the data? Object Storage generally best choice Performance often good enough Generally cheaper per TB than DAS Scales independently from compute Not a drop-in replacement for HDFS Different data consistency models Different directory structure support Not all Object Stores created equal Different access control models Different maturity levels Not yet universally supported by CDH Mostly finished for S3 Just getting started for ADLS Not yet started for GCS 11
Object Storage support is rapidly reaching maturity Separation from HDFS S3A connector ADLS connector Filling the gaps Performance Consistency Renames Cloudera Functional Equivalence Security Governance Backup & Recovery Support as of C5.11 S3 ADLS Map Reduce Y Y Hive Y Y Hive on Spark Y - Spark Y Y HBase - Impala Y - Hue Y - Cross- Cluster Sharing Permissions Catalogue Lineage 12
How to provision and manage cloud infrastructure cost effectively? Provisioning requirements Spin clusters up & down quickly Grow & shrink clusters dynamically Select right instance types for each service Leverage demand based pricing whenever possible Management requirements Fully automated and parallelized installation and configuration Manage all aspects of cluster security automatically Retain diagnostic and log information after cluster is gone Support transient and long-lived clusters 13
Cloudera Director automates cluster lifecycle management Easy Single pane of glass for all cloud infrastructure Create templates to run applications in a preoptimized manner Flexible Multi-cloud: AWS, Azure, GCP Hourly pricing with auto billing & metering Spot instance/block support Enterprise-grade Integration across Cloudera Enterprise Management of CDH deployments at scale Deeply integrated with Cloudera Manager 14
Cloudera Manager automates cluster operations Object Store Easy administration Spot instance resiliency Automated security credential handling Transient cluster operations Optimized cluster provisioning Automatic collection of diagnostics and logs Long-lived cluster operations Downtime-less upgrade, patch, restart, and reconfiguration Monitoring, alerting, health checking, reporting, etc. 15
Optimize each application independently 16
Really, four discrete applications on one unified platform Data Engineering Data Science Analytic Database Operational Database Modern data processing (ETL) at scale Exploratory data science and machine learning for the enterprise Explore, analyze, and understand all your data Data-driven applications to deliver real-time insights Multi-Storage, Multi-Environment 17
Needs of each application can vary greatly Data Science & Engineering Access Patterns Batch Can be transient or persistent Performance Needs Relatively insensitve to latency and data locality Security Security often not required for many use cases Analytic Database Acess Patterns Batch or interactive Can be transient or persisent Performance Needs Relatively insensitve to latency and data locality Security Fine-grained security often required Operational Database Access Patterns Real-time Typically persistent Performance Needs Typically quite sensitive to latency and data locality Security Fine-grained security often required 18
Data Science & Engineering in the cloud Three architectural patterns to optimize price, convenience, performance Default Choice Transient Batch (most flexible) Spin up clusters as needed On-demand/spot instances Usage-based pricing Sized for workload Cluster per tenant/user Persistent Batch (most control) Persistent cluster(s) for frequent ETL Reserved instances Node-based pricing Grow/shrink Cluster per tenant group Persistent Batch on HDFS (fastest) Top performance for frequent ETL Reserved instances Node-based pricing Grow/shrink Shared across tenant groups Batch Cluster Batch Cluster Batch Cluster Batch Persistent Cluster Batch Batch Persistent Cluster Batch HDFS Batch Object Storage 19
Analytic DB in the cloud Refer to Data Science & Engineering guidelines Reduce Operating Costs Presents new set of choices New Insights, New Revenue ETL BI/Analytics Only pay for what you need, when you need it Explore and analyze all data, wherever it lives Transient clusters Object storage centric Cloud-native deployment Long-running clusters Object storage or local storage Lift-and-shift deployment 20
BI/Analytics in the cloud Three architectural patterns to optimize price, convenience, performance Default Choice Transient BI (infrequent usage) Spin up clusters when needed On-demand instances Usage-based pricing Grow/shrink Cluster per tenant or user Transient Cluster Transient Cluster Persistent BI (regular usage) Persistent clusters for BI any time Reserved instances Node-based pricing Grow/shrink Cluster per tenant group Persistent Cluster Persistent Cluster Persistent BI with Local Storage (fastest) Max speed for more regular workloads Reserved instances Node-based pricing Less frequent grow/shrink Shared cluster for shared local data Persistent Cluster HDFS and/or Kudu Object Storage 21
Operational DB in the cloud Not as well suited for cloud, but targeted benefits are possible Cost Goals Convenience Goals Low-cost backup and disaster recovery Development and testing environments easy to deploy and decommission Elastic growth for tightly provisioned workloads makes expansion easy, and enables a lower-cost steady state Fast and easy provisioning of additional clusters helps projects move quickly 22
Reconstruct an Enterprise Data Hub 23
Many problems are a combination of SQL & predictive, batch & online Traditional Architecture Data Sources Operational Data Stores Enterprise Data Warehouse Applications Archive Storage #1 BI System Portfolio Contracts Portfolio Risks Market, Counterparty, Ratings Ingest Storage #2 Modeling Ingest HPC GRID ELT Serve Payments Collections Charges Financial Ledger P&L Process Load Reporting Unstructured Ingest ETL Enterprise Data Warehouse 24
Reimagining the Enterprise Data Hub in the cloud Common Operations Developer Workbench Partner Ecosystem SQL Workbench Common Governance Common Security Common: Operations, Governance, Security, Schema, Catalog Object Store Object Store 25
Thank you Thank You Fred Koopmans 26