Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Redefine Big Data: EMC Data Lake in Action Andrea Prosperi Systems Engineer 1

Agenda Data Analytics Today Big data Hadoop & HDFS Different types of analytics Data lakes EMC Solutions for Data Lakes 2

The world before big data Data warehousing. Research and the definition of dimensions and facts started in the 1960 s. Things really got going in the 1980s. 3

So what changed? Big data rocked up to the party. 4

Traditional solutions struggled Too much data No Real Time analysis No Data Exploration More expensive hardware to go faster and deeper Overnight batch not good enough Not just structured data in a star schema 5

Thankfully we had Google Cue Doug Cutting s son and his elephant, Hadoop Computation Tier uses a framework called MapReduce Storage is provided via a distributed filesystem called HDFS Hadoop runs on commodity hardware 6

Competitive Advantage All analytics aren t equal Descriptive, Predictive and Prescriptive. There is also Diagnostic. How can we achieve the best outcome including the effects of variability? How can we achieve the best outcome? What will happen next if? What if these trends continue? What could happen? What actions are needed? What exactly is the problem? How many, how often, where? What happened? Prescriptive Predictive Descriptive Degree of Complexity Source: Based on "Competing on Analytics," Davenport and Harris 7

Descriptive Analytics Prescriptive Analytics Predictive Analytics 8

Data lakes Today, think of it in terms of co-existence with Enterprise DWH. Both environments are valid. Semi-structured & Unstructured Data Hadoop Based Data Lake Client/Portal Devices Analyze & Report Structured Data Data Transformation ETL/ELT Enterprise DWH Analyze & Report Client/Portal Device CRM ERP OLTP DB Data Security, Backup 9

What is a Data Lake? If you think of a datamart as a store of bottled water cleansed and packaged and structured for easy consumption the data lake is a large body of water in a more natural state. *James Dixon, coiner of Data Lake term 10

Pragmatic approach to Data Lake Identify Domain Be Pragmatic/Start Small Build Lake infrastructure Fill Lake Build Fishing Poles, exploration, extract value, then expand 11

Data Lake Interaction 3 Main Levels of interaction: Real Time: for fast analysis and correlation Interactive: for transactional processing Batch: for large dataset analysis 12

Lake Infrastructure EMC Solutions for Data Lake Infrastructure VIPR Controller EMC Big Data Storage DSSD ISILON VNX REAL-TIME INTERACTIVE VIPR Services Commodity ECS BATCH 13

Build Lake Infrastructure Use General Purpose Arrays/Commodity Disks As Data Lake Store ViPR Data Services 3 rd Party VNX Commodity Be Fast Reuse your current infrastructure to build an HDFS repository Reduce risk Reduce CAPEX investment required to perform analytics Maintain data protection, compliance at array level Reduce cost and complexity of dedicated clusters Reduce need for new vendor nodes and storage capacity 14

Build Lake Infrastructure Object, File And HDFS Operations On The Same Data Object Object & HDFS HDFS VIRTUAL ARRAY ViPR Object & ViPR HDFS access on the same data S3, Swift, Atmos API via the Object head File protocols in development Use your preferred Hadoop distribution Commodity 15

Build Lake Infrastructure Use Specialized Arrays As Data Lake Store ECS Appliance Hyper-scale: ECS supports unlimited applications and users on a single, scale- out architecture start at 360 TB and scale to multiple petabytes or even exabytes 3 rd platform applications Pre-Engineered and Pre-Built Commodity Hardware Structured and Unstructured Content 16

Build Lake Infrastructure Use Specialized Arrays As Data Lake Store Accelerate the benefits of Hadoop for the enterprise Proven Hadoop solution, faster implementation Greater interoperability with enterprise applications and Hadoop analytics through multi-protocol parallel access from any client Enterprise data protection Fast snapshots, backup, and recovery Simple, reliable data replication for disaster recovery Ultimate flexibility Scale compute and storage resources separately Supports physical and virtualized server environments 17

Lake Software EMC/Pivotal Solutions for Data Lake Software REAL-TIME INTERACTIVE Greenplum DB GemFire XD HAWQ REAL-TIME INTERACTIVE BATCH Unlimited Pivotal HD BATCH 18

Pivotal HD Architecture - Apache Resource Management & Workflow Yarn Zookeeper HBas e HDFS Pig, Hive, Mahout Map Reduce Sqoop Flume Apache 19

HAWQ - Full ANSI SQL Engine on Hadoop HAWQ Advanced Database Services Resource Managemen t & Workflow Yarn HBas e Xtension Framework ANSI SQL + Analytics MADlib Algorithms Catalog Services Dynamic Pipelining Query Optimizer Spring Pig, Hive, Mahout Map Reduce Comman d Center Configure, Deploy, Monitor, Zookeeper Hadoop Virtualization Extension HDFS Unified Storage Service Manage Sqoop Data Loader Flume Apache Pivotal 20

GemFire - Real-Time Data Service HAWQ Advanced Database Services GemFire XD Real-Time Database Services Resource Managemen t & Workflow Yarn HBas e Xtension Framework ANSI SQL + Analytics MADlib Algorithms Catalog Services Dynamic Pipelining Query Optimizer Distrubuted In-memory Store ANSI SQL + In-Memory Query Transactions Ingestion Processing Hadoop Driver Parallel with Compaction Spring Pig, Hive, Mahout Map Reduce Comman d Center Configure, Deploy, Monitor, Zookeeper Hadoop Virtualization Extension HDFS Unified Storage Service Manage Sqoop Data Loader Flume Apache Pivotal 21

A Reference Architecture Standardized, on-demand services are layered around shared data repositories & processing capabilities to form the data lake. Ingest and data capture Scheduled, Batch data ingest to capture bulk data sources. Micro-batch ingest capturing small quantities of data. Low-latency and real-time ingest of data. Real-time routing of data to complex event processing and persistent storage. Data Sources Existing structured data. Unstructured or semistructured data sources Machine generated data such as logs and sensor data. External data sources. Applications and integration CloudFoundry on vsphere. Build interactive, data-driven applications using modern frameworks and approaches. Data Analytics In-memory performance (GemFire) MPP Processing (Pivotal HD) High performance SQL access to HDFS data (HAWQ). Shared storage and re-use Isilon and ViPR provide shared access to new and existing data sources through HDFS. Minimize data copies. Smart De-dupe for Hadoop. Kerberos Authentication. 22

What about services? + Data Science Data Engineering 23