Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Similar documents
ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Emerging Business Applications of High Performance Analytics

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Architecture Overview for Data Analytics Deployments

MapR: Solution for Customer Production Success

5th Annual. Cloudera, Inc. All rights reserved.

EMC IT Big Data Analytics Journey. Mahmoud Ghanem Sr. Systems Engineer

SAS and Hadoop Technology: Overview

SAS & HADOOP ANALYTICS ON BIG DATA

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

SAP HANA MADE SIMPLE WITH VALIDATED SOLUTIONS & CONVERGED SYSTEMS. Joakim Zetterblad, Director SAP Practice, EMEA

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Hadoop Integration Deep Dive

Bringing the Power of SAS to Hadoop Title

Cloud Based Analytics for SAP

1. Intoduction to Hadoop

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Data: Foundation Of Digital Transformation

Adobe Deploys Hadoop as a Service on VMware vsphere

20775: Performing Data Engineering on Microsoft HD Insight

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Business is being transformed by three trends

Operational Hadoop and the Lambda Architecture for Streaming Data

Welcome to. enterprise-class big data and financial a. Putting big data and advanced analytics to work in financial services.

A NEW PLATFORM FOR A NEW ERA. Russell Acton, VP &GM EMEA,

Hadoop and Analytics at CERN IT CERN IT-DB

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

Insights to HDInsight

Microsoft Azure Essentials

Cask Data Application Platform (CDAP)

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Big and Fast Data: The Path To New Business Value

MapR Pentaho Business Solutions

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

Common Customer Use Cases in FSI

ETL challenges on IOT projects. Pedro Martins Head of Implementation

GET MORE VALUE OUT OF BIG DATA

Big Data. By Michael Covert. April 2012

Why Big Data Matters? Speaker: Paras Doshi

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

The Alpine Data Platform

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

巨量資料商機如何現代化您的產品及服務, 創造客戶最大的價值

The Intersection of Big Data and DB2

Digging into Hadoop-based Big Data Architectures

Microsoft Big Data. Solution Brief

Sr. Sergio Rodríguez de Guzmán CTO PUE

LEVERAGING DATA ANALYTICS TO GAIN COMPETITIVE ADVANTAGE IN YOUR INDUSTRY

Building Your Big Data Team

Dell EMC IT Big Data Analytics Journey. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Ray M Sugiarto MAPR Champion Indonesia

Got Data Silos? Automate Data Ingestion Into Isilon In Support Of Analytics

Big Data The Big Story

Building a Data Lake on AWS

DELL EMC HADOOP SOLUTIONS

Evolution to Revolution: Big Data 2.0

ETL on Hadoop What is Required

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Big Data & Hadoop Advance

Berkeley Data Analytics Stack (BDAS) Overview

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Big Business Value from Big Data and Hadoop

Hadoop in Production. Charles Zedlewski, VP, Product

Universal Storage for Data Lakes: Dell EMC Isilon

Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

Real-time Streaming Insight & Time Series Data Analytic For Smart Retail

Nimble Storage vs Dell EMC: A Comparison Snapshot

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

vsphere with Operations Management and vcenter Operations VMware vforum, 2014 Mehmet Çolakoğlu 2014 VMware Inc. All rights reserved.

Reduce Money Laundering Risks with Rapid, Predictive Insights

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

Strategies for Taming Data Growth through Archiving

Pentaho 8.0 and Beyond. Matt Howard Pentaho Sr. Director of Product Management, Hitachi Vantara

Hortonworks Powering the Future of Data

Exelon Utilities Data Analytics Journey

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

IBM Big Data Summit 2012

SAP Big Data. Markus Tempel SAP Big Data and Cloud Analytics Services

Copyright 2015 EMC Corporation. All rights reserved. STRATEGIC FORUM 2015 PAUL MARITZ CEO, PIVOTAL SOFTWARE

Modernizing Data Integration

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

EBOOK: Cloudwick Powering the Digital Enterprise

Hybrid Data Management

In-Memory Analytics: Get Faster, Better Insights from Big Data

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Five Questions to Ask Before Choosing a Hadoop Distribution

Big Data Management Best Practices for Data Lakes Philip Russom, Ph.D.

The Internet of Things Wind Turbine Predictive Analytics. Fluitec Wind s Tribo-Analytics System Predicting Time-to-Failure

Transcription:

Redefine Big Data: EMC Data Lake in Action Andrea Prosperi Systems Engineer 1

Agenda Data Analytics Today Big data Hadoop & HDFS Different types of analytics Data lakes EMC Solutions for Data Lakes 2

The world before big data Data warehousing. Research and the definition of dimensions and facts started in the 1960 s. Things really got going in the 1980s. 3

So what changed? Big data rocked up to the party. 4

Traditional solutions struggled Too much data No Real Time analysis No Data Exploration More expensive hardware to go faster and deeper Overnight batch not good enough Not just structured data in a star schema 5

Thankfully we had Google Cue Doug Cutting s son and his elephant, Hadoop Computation Tier uses a framework called MapReduce Storage is provided via a distributed filesystem called HDFS Hadoop runs on commodity hardware 6

Competitive Advantage All analytics aren t equal Descriptive, Predictive and Prescriptive. There is also Diagnostic. How can we achieve the best outcome including the effects of variability? How can we achieve the best outcome? What will happen next if? What if these trends continue? What could happen? What actions are needed? What exactly is the problem? How many, how often, where? What happened? Prescriptive Predictive Descriptive Degree of Complexity Source: Based on "Competing on Analytics," Davenport and Harris 7

Descriptive Analytics Prescriptive Analytics Predictive Analytics 8

Data lakes Today, think of it in terms of co-existence with Enterprise DWH. Both environments are valid. Semi-structured & Unstructured Data Hadoop Based Data Lake Client/Portal Devices Analyze & Report Structured Data Data Transformation ETL/ELT Enterprise DWH Analyze & Report Client/Portal Device CRM ERP OLTP DB Data Security, Backup 9

What is a Data Lake? If you think of a datamart as a store of bottled water cleansed and packaged and structured for easy consumption the data lake is a large body of water in a more natural state. *James Dixon, coiner of Data Lake term 10

Pragmatic approach to Data Lake Identify Domain Be Pragmatic/Start Small Build Lake infrastructure Fill Lake Build Fishing Poles, exploration, extract value, then expand 11

Data Lake Interaction 3 Main Levels of interaction: Real Time: for fast analysis and correlation Interactive: for transactional processing Batch: for large dataset analysis 12

Lake Infrastructure EMC Solutions for Data Lake Infrastructure VIPR Controller EMC Big Data Storage DSSD ISILON VNX REAL-TIME INTERACTIVE VIPR Services Commodity ECS BATCH 13

Build Lake Infrastructure Use General Purpose Arrays/Commodity Disks As Data Lake Store ViPR Data Services 3 rd Party VNX Commodity Be Fast Reuse your current infrastructure to build an HDFS repository Reduce risk Reduce CAPEX investment required to perform analytics Maintain data protection, compliance at array level Reduce cost and complexity of dedicated clusters Reduce need for new vendor nodes and storage capacity 14

Build Lake Infrastructure Object, File And HDFS Operations On The Same Data Object Object & HDFS HDFS VIRTUAL ARRAY ViPR Object & ViPR HDFS access on the same data S3, Swift, Atmos API via the Object head File protocols in development Use your preferred Hadoop distribution Commodity 15

Build Lake Infrastructure Use Specialized Arrays As Data Lake Store ECS Appliance Hyper-scale: ECS supports unlimited applications and users on a single, scale- out architecture start at 360 TB and scale to multiple petabytes or even exabytes 3 rd platform applications Pre-Engineered and Pre-Built Commodity Hardware Structured and Unstructured Content 16

Build Lake Infrastructure Use Specialized Arrays As Data Lake Store Accelerate the benefits of Hadoop for the enterprise Proven Hadoop solution, faster implementation Greater interoperability with enterprise applications and Hadoop analytics through multi-protocol parallel access from any client Enterprise data protection Fast snapshots, backup, and recovery Simple, reliable data replication for disaster recovery Ultimate flexibility Scale compute and storage resources separately Supports physical and virtualized server environments 17

Lake Software EMC/Pivotal Solutions for Data Lake Software REAL-TIME INTERACTIVE Greenplum DB GemFire XD HAWQ REAL-TIME INTERACTIVE BATCH Unlimited Pivotal HD BATCH 18

Pivotal HD Architecture - Apache Resource Management & Workflow Yarn Zookeeper HBas e HDFS Pig, Hive, Mahout Map Reduce Sqoop Flume Apache 19

HAWQ - Full ANSI SQL Engine on Hadoop HAWQ Advanced Database Services Resource Managemen t & Workflow Yarn HBas e Xtension Framework ANSI SQL + Analytics MADlib Algorithms Catalog Services Dynamic Pipelining Query Optimizer Spring Pig, Hive, Mahout Map Reduce Comman d Center Configure, Deploy, Monitor, Zookeeper Hadoop Virtualization Extension HDFS Unified Storage Service Manage Sqoop Data Loader Flume Apache Pivotal 20

GemFire - Real-Time Data Service HAWQ Advanced Database Services GemFire XD Real-Time Database Services Resource Managemen t & Workflow Yarn HBas e Xtension Framework ANSI SQL + Analytics MADlib Algorithms Catalog Services Dynamic Pipelining Query Optimizer Distrubuted In-memory Store ANSI SQL + In-Memory Query Transactions Ingestion Processing Hadoop Driver Parallel with Compaction Spring Pig, Hive, Mahout Map Reduce Comman d Center Configure, Deploy, Monitor, Zookeeper Hadoop Virtualization Extension HDFS Unified Storage Service Manage Sqoop Data Loader Flume Apache Pivotal 21

A Reference Architecture Standardized, on-demand services are layered around shared data repositories & processing capabilities to form the data lake. Ingest and data capture Scheduled, Batch data ingest to capture bulk data sources. Micro-batch ingest capturing small quantities of data. Low-latency and real-time ingest of data. Real-time routing of data to complex event processing and persistent storage. Data Sources Existing structured data. Unstructured or semistructured data sources Machine generated data such as logs and sensor data. External data sources. Applications and integration CloudFoundry on vsphere. Build interactive, data-driven applications using modern frameworks and approaches. Data Analytics In-memory performance (GemFire) MPP Processing (Pivotal HD) High performance SQL access to HDFS data (HAWQ). Shared storage and re-use Isilon and ViPR provide shared access to new and existing data sources through HDFS. Minimize data copies. Smart De-dupe for Hadoop. Kerberos Authentication. 22

What about services? + Data Science Data Engineering 23