5th Annual. Cloudera, Inc. All rights reserved.

Similar documents
Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Cloudera Hadoop & Industrie 4.0 wohin mit dem Datenstrom?

Modernizing Your Data Warehouse with Azure

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

MapR: Solution for Customer Production Success

Sr. Sergio Rodríguez de Guzmán CTO PUE

Common Customer Use Cases in FSI

Intro to Big Data and Hadoop

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Why Big Data Matters? Speaker: Paras Doshi

INDUSTRY BRIEF THE ENTERPRISE DATA HUB IN FINANCIAL SERVICES: THREE CUSTOMER CASE STUDIES

Realising Value from Data

Operational Hadoop and the Lambda Architecture for Streaming Data

Big data is hard. Top 3 Challenges To Adopting Big Data

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

How In-Memory Computing can Maximize the Performance of Modern Payments


Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Managing explosion of data. Cloudera, Inc. All rights reserved.

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Taking Advantage of Cloud Elasticity and Flexibility

Hortonworks Connected Data Platforms

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Microsoft Azure Essentials

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

Transforming Analytics with Cloudera Data Science WorkBench

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Hadoop and Analytics at CERN IT CERN IT-DB

Big Data Hadoop Administrator.

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Insights-Driven Operations with SAP HANA and Cloudera Enterprise

Confidential

Analytics in Action transforming the way we use and consume information

Bringing the Power of SAS to Hadoop Title

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Oracle Big Data Cloud Service

Investor Presentation. Fourth Quarter 2015

Investor Presentation. Second Quarter 2016

Contents at a Glance COPYRIGHTED MATERIAL. Introduction... 1 Part I: Getting Started with Big Data... 7

Spark, Hadoop, and Friends

Optimal Infrastructure for Big Data

Leveraging Predictive Tools to Decrease Resolution Time

Insights to HDInsight

THE CIO GUIDE TO BIG DATA ARCHIVING. How to pick the right product?

Business is being transformed by three trends

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

MapR Pentaho Business Solutions

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Preface About the Book

Big Data Introduction

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study

Copyright - Diyotta, Inc. - All Rights Reserved. Page 2

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Jason Virtue Business Intelligence Technical Professional

By: Shrikant Gawande (Cloudera Certified )

Building Your Big Data Team

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Make Business Intelligence Work on Big Data

Analytics Platform System

Angat Pinoy. Angat Negosyo. Angat Pilipinas.

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Microsoft Big Data. Solution Brief

Teradata IntelliSphere

WELCOME TO. Cloud Data Services: The Art of the Possible

Big and Fast Data: The Path To New Business Value

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

LEVERAGING DATA ANALYTICS TO GAIN COMPETITIVE ADVANTAGE IN YOUR INDUSTRY

DataAdapt Active Insight

GPU ACCELERATED BIG DATA ARCHITECTURE

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Big and Fast Data: The Path To New Business Value

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

AMD and Cloudera : Big Data Analytics for On-Premise, Cloud and Hybrid Deployments

Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

ETL challenges on IOT projects. Pedro Martins Head of Implementation

Modern Analytics Architecture

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Cognitive Data Warehouse and Analytics

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Big Data The Big Story

The Importance of good data management and Power BI

Hortonworks Powering the Future of Data

BIG DATA AND HADOOP DEVELOPER

New Big Data Solutions and Opportunities for DB Workloads

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Hadoop Stories. Tim Marston. Director, Regional Alliances Page 1. Hortonworks Inc All Rights Reserved

SAP Cloud Platform Big Data Services EXTERNAL. SAP Cloud Platform Big Data Services From Data to Insight

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

Transcription:

5th Annual 1

The Essentials of Apache Hadoop The What, Why and How to Meet Agency Objectives Sarah Sproehnle, Vice President, Customer Success 2

Introduction 3

What is Apache Hadoop? Hadoop is a software framework for storing, processing, and analyzing big data Distributed Scalable Fault-tolerant Open source 4

A Large (and Growing) Ecosystem Impala 5

About Cloudera The leader in Apache Hadoop-based software and services Founded in 2008 by leading experts on Hadoop Over 1000 employees Global operations spanning over 20 countries Provides support, consulting, training, and certification for Hadoop users Employs committers to virtually every significant Hadoop-related project Many authors of industry standard books on Apache Hadoop projects Tom White, Lars George, Kathleen Ting, etc. 6

CDH CDH (Cloudera s Distribution, including Apache Hadoop) 100% open source, enterprise-ready distribution of Hadoop and related projects The most complete, tested, and widely-deployed distribution of Hadoop Integrates all the key Hadoop ecosystem projects 7

Vendor Integration System Integration Software and OEM Data Systems Platform & Cloud 8

How it Works 9

Traditional Large-Scale Computation Traditionally, computation has been processor-bound Relatively small amounts of data Lots of complex processing The early solution: bigger computers Faster processor, more memory But even this couldn t keep up 10

Distributed Systems The better solution: more computers Distributed systems use multiple machines for a single job In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, we didn t try to grow a larger ox. We shouldn t be trying for bigger computers, but for more systems of computers. Grace Hopper 11

Distributed Systems: The Data Bottleneck (1) Traditionally, data is stored in a central location Data is copied to processors at runtime Fine for limited amounts of data 12

Distributed Systems: The Data Bottleneck (2) Modern systems have much more data terabytes+ a day petabytes+ total We need a new approach 13

The Origins of Hadoop Hadoop is based on work done at Google in the late 1990s/early 2000s Google s problem: Indexing the entire web requires massive amounts of storage A new approach was required to process such large amounts of data Google s solution: GFS, the Google File System - described in a paper released in 2003 Distributed MapReduce - described in a paper released in 2004 Doug Cutting and others read these papers and implemented a similar, open source solution This is what would become Hadoop 14

What is Hadoop? Hadoop is a distributed data storage and processing platform Stores massive amounts of data in a very resilient way Distributes the processing to where the data is stored Tools built around Hadoop (the Hadoop ecosystem ) can be configured/extended to handle many different tasks Extract Transform Load (ETL) BI environment Predictive analytics Statistical analysis Machine learning 15

Hadoop is Scalable Adding nodes (machines) adds capacity proportionally Increasing load results in a graceful decline in performance Not failure of the system Capacity Number of Nodes 16

Hadoop is Fault Tolerant Node failure is inevitable What happens? System continues to function Master re-assigns work to a different node Data replication means there is no loss of data Nodes which recover rejoin the cluster automatically 17

The Hadoop Ecosystem (1) Many tools have been developed around Core Hadoop Known as the Hadoop ecosystem Designed to make Hadoop easier to use, or to extend its functionality All are open source The ecosystem is growing all the time 18

The Hadoop Ecosystem (2) Examples of Hadoop ecosystem projects (all included in CDH): Project Spark HBase Hive Impala Parquet Sqoop Flume, Kafka Solr Hue Sentry What does it do? In-memory and streaming processing framework NoSQL database built on HDFS SQL processing engine designed for batch workloads SQL query engine designed for BI workloads Very efficient columnar data storage format Data movement to/from RDBMSs Streaming data ingestion Powerful text search functionality Web-based user interface for Hadoop Authorization tool, providing security for Hadoop 19

Hadoop in the Real World 20

Five Really Popular Use Cases ETL Processing Business Intelligence Predictive Analytics Enterprise Data Hub Low-cost Storage of Large Data Volumes 21

# 1 Traditional ETL Processing ETL: Extract, Transform, Load Data Stream ETL Grid Data Warehouse BI Tools Challenges: Too much data Networked Storage Transactional Systems Takes too long Too costly Tape Storage Tape Storage 22

# 1 ETL Processing with Hadoop Data Stream Enterprise Data Warehouse Networked Storage ETL Analytics Real-time Hadoop cluster is used for ETL Often now ELT: Extract, Load, then Transform Structured and unstructured data is moved into the cluster Once processed, data can be analyzed in Hadoop or moved to the EDW 23

# 1 Vendor Integration - ETL For more information visit: http://www.cloudera.com/partners/partnerslisting.html 24

# 2 Traditional Business Intelligence Data Stream ETL Grid Data Warehouse BI Tools BI traditionally takes place at the data warehouse layer Networked Storage Transactional Systems Problem: EDW can t keep up with growing data volumes Performance declines Increasing capacity can be very expensive Tape Storage Tape Storage Archived data is not available for analysis 25

# 2 Business Intelligence with Hadoop Data Stream Enterprise Data Warehouse Networked Storage ETL Analytics Real-time BI tools can use Hadoop for much of their work Analyze all the data BI Tools Use the EDW for the tasks for which it is best suited 26

# 2 Vendor Integration - BI For more information visit: http://www.cloudera.com/partners/partnerslisting.html 27

# 3 Predictive Analytics Predictive Analytics (Eckerson Group definition) The use of statistical or machine learning models to discover patterns and relationships in data that can help business people predict future behavior or activity The Hadoop platform can run analytic workloads on large volumes of diverse data Statistical models can be created and run inside the Hadoop environment Entire data sets can be used to create models There is no need to sample data Hadoop provides an environment that makes self-service analytics possible No need for ETL developers to stage data for data scientists 28

# 3 - Predictive Analytics: Cerner Corporation Cerner Corporation Healthcare IT space Solutions and Services - Used by 54,000 medical facilities around the world The problem Healthcare data is fragmented and lives in silos The data was used for historical reporting The solution Build a comprehensive view of population health using a single platform Use predictive analytics to Improve patient outcomes Increase efficiency / Reduce costs 29

# 3 Vendor Integration Predictive Analytics For more information visit: http://www.cloudera.com/partners/partnerslisting.html 31

# 4 The Need for the Enterprise Data Hub Thousands of Employees & Lots of Inaccessible Information Heterogeneous Legacy IT Infrastructure EDWs Marts Servers Document Stores Storage Search Data Archives Silos of Multi- Structured Data Difficult to Integrate ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources 32

# 4 The Enterprise Data Hub: One Unified System Information & data accessible by all for insight using leading tools and apps Enterprise Data Hub Unified Data Management Infrastructure EDH EDWs Marts Servers Documents Storage Search Archives Ingest All Data Any Type Any Scale From Any Source ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources 33

# 5 - Low-cost Data Storage (1) Hadoop combines industry standard hardware and a fault tolerant architecture This combination provides a very cost effective data storage platform The data stored on Hadoop is protected from loss by HDFS Data replication ensures that no data is lost The self-healing nature of Hadoop ensures that data is available when you need it Hadoop enables users to store data which was previously discarded due to the cost of saving it Transactional Social media Sensor Click stream 34

# 5 - Low-cost Data Storage (2) The low cost of HDFS storage enables the following use-cases: Enterprise Data Hub (EDH) Active data archive Staging area for data warehouses Staging area for analytics store Sandbox for data discovery Sandbox for analytics 35

In Summary, Why Do You Need Hadoop? More data is coming Internet of things Sensor data Streaming More data means bigger questions, better answers Hadoop easily scales to store and handle all of your data Hadoop is cost-effective Provides a significant cost-per-terabyte saving over traditional, legacy systems Hadoop integrates with your existing datacenter components Answer questions that you previously could not ask 36

Thank You 37