Hadoop Integration Deep Dive

Similar documents
A Unified Data Platform for Big Data & Cognitive

BIG DATA AND HADOOP DEVELOPER

IBM Spectrum Scale. Recent Updates and Outlook. Meet the Devs Oxford Feb 24, 2016 Ulf Troppens

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Hortonworks HDP with IBM Spectrum Scale

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

MapR: Solution for Customer Production Success

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

BIG DATA ANALYTICS WITH HADOOP. 40 Hour Course

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

Hadoop Course Content

Contents at a Glance COPYRIGHTED MATERIAL. Introduction... 1 Part I: Getting Started with Big Data... 7

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Hadoop Administration Course Content

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Cognitive Solutions in the Context of IBM Systems Cognitive Analytics / Integration Scenarios / Use Cases

Transforming Analytics with Cloudera Data Science WorkBench

Big Data & Hadoop Advance

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Analytics Platform System

IBM Spectrum Scale. Computing Insight UK Patrick Byrne Cameron McAllister Dec 2015

IBM Big Data Summit 2012

Modernizing Your Data Warehouse with Azure

Intro to Big Data and Hadoop

Business is being transformed by three trends

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

Hadoop Roadmap 2012 A Hortonworks perspective

Microsoft Azure Essentials

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Hortonworks Data Platform

2017 IBM Elastic Storage Server - Update

Spark, Hadoop, and Friends

Hadoop on Shared, Software-defined Storage

BigInsights on Cloud. Mike Nobles Executive, BigInsights Solution Specialist WW Technical Sales, Cloud Data Services

APAC Big Data & Cloud Summit 2013

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

SAS and Hadoop Technology: Overview

20775A: Performing Data Engineering on Microsoft HD Insight

IBM High Performance Services for Hadoop

5th Annual. Cloudera, Inc. All rights reserved.

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Cloud Based Analytics for SAP

Modern Data Architecture with Apache Hadoop

20775 Performing Data Engineering on Microsoft HD Insight

Rapid Start with Big Data Appliance X6-2 Technical & Operational Overview

20775: Performing Data Engineering on Microsoft HD Insight

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Accelerating Computing - Enhance Big Data & in Memory Analytics. Michael Viray Product Manager, Power Systems

IBM Spectrum Scale. What s new in Mathias Dietz Spectrum Scale RAS Architect

1. Intoduction to Hadoop

Hortonworks Data Platform for Enterprise Data Lakes delivers robust, big data analytics that accelerate decision making and innovation

By: Shrikant Gawande (Cloudera Certified )

Simplifying Hadoop. Sponsored by. July >> Computing View Point

20775A: Performing Data Engineering on Microsoft HD Insight

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

Five Questions to Ask Before Choosing a Hadoop Distribution

Optimal Infrastructure for Big Data

BIG DATA TRANSFORMS BUSINESS. The EMC Big Data Solution

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Big Business Value from Big Data and Hadoop

IBM Storage Reference Architecture for AI applied to Autonomous Driving (AD)

The Intersection of Big Data and DB2

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

Cloudy Jigsaw Puzzles

Louis Bodine IBM STG WW BAO Tiger Team Leader

StackIQ Enterprise Data Reference Architecture

Big Data in Cloud. 堵俊平 Apache Hadoop Committer Staff Engineer, VMware

Big data is hard. Top 3 Challenges To Adopting Big Data

Big Data The Big Story

HADOOP ADMINISTRATION

Bringing the Power of SAS to Hadoop Title

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Insights to HDInsight

Adobe Deploys Hadoop as a Service on VMware vsphere

SAP Cloud Platform Big Data Services EXTERNAL. SAP Cloud Platform Big Data Services From Data to Insight

Multi-cloud with Transparent Cloud Tiering

DataAdapt Active Insight

Setup HSP Ambari Cluster

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Enabling Hybrid Cloud Storage for IBM Spectrum Scale Using Transparent Cloud Tiering

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Sr. Sergio Rodríguez de Guzmán CTO PUE

Cask Data Application Platform (CDAP)

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland

GET MORE VALUE OUT OF BIG DATA

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

New Approach for scheduling tasks and/or jobs in Big Data Cluster

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Two offerings which interoperate really well

The IBM Reference Architecture for Healthcare and Life Sciences

Apache Hadoop in the Datacenter and Cloud

Big Data Hadoop Administrator.

Big Data Hadoop Administrator Training

Microsoft Big Data. Solution Brief

Transcription:

Hadoop Integration Deep Dive Piyush Chaudhary Spectrum Scale BD&A Architect 1

Agenda Analytics Market overview Spectrum Scale Analytics strategy Spectrum Scale Hadoop Integration A tale of two connectors Old GPFS Hadoop connector New Spectrum Scale HDFS Transparency connector 2

IDC s Business Analytics Broader Workload Perspective Conclusions Analytics market opportunity for Spectrum Scale is broader than just the Hadoop Big Data Market IDC definition beginning to account for new workloads NEED to consider a wide range of Analytics workloads in both Shared nothing and shared storage environments 3

Structured and Unstructured Data Market Conclusion Spectrum Scale addresses the sweet spot of the growth segment for Enterprise Storage Systems. 4

People and devices are creating oceans of data Data scientists and LoBs want to fish this ocean for insights The ocean of data must be efficiently stored, managed, protected, and exploited by the right applications at the right time 5 5 5

Spectrum Scale Vision Client workstations Users and applications Compute Farm Single name space POSIX NFS SMB/CIFS Hadoop/ HDFS OpenStack Cinder Manila Swift Glance Site B Site A IBM Spectrum Scale Automated data placement and data migration Site C Flash Disk Tape Storage Rich Servers 6 6

Multi-structured Data Mashups provide the Greatest Enterprise Value Systems of Record Structured data from operational systems 20% of all data generated Systems of Insight Diverse data types that combine structured and unstructured data for business insight Systems of Engagement Data that connects companies with their customers, partners and employees 80% of all data generated Data Warehouses Structured Data Small Data Transaction data Clearly formatted Quantitative ERP Data Objective Electronic Health Records Logical Puzzle Mainframe Data Repeatable OLTP System Data linear Advanced Analytics Context Accumulation Enterprise Integration Hadoop and Streams Audio Documents Images RFID Unstructured Data Big Data Language based Qualitative Subjective Emails Intuitive Sensors Mystery Social Data Exploratory Video dynamic Web Logs Traditional Sources New Data Sources 7

A Tale of Two Connectors GPFS Hadoop Connector Spectrum Scale HDFS Transparency Connector Henceforth known as the old connector Emulates a Hadoop compatible filesystem i.e. replaces HDFS Stateless Free download link Supports Spectrum Scale 4.1.1 and 4.2 Currently supported with IOP 4.0 and 4.1 Integrated with Ambari (IOP 4.1) Also supported with Open Source Apache Hadoop Henceforth known as the new connector Integrates with HDFS reuses HDFS client and implements NameNode and DataNode RPCs Stateless Free download link Supports Spectrum Scale 4.1.1 and 4.2 Planned support for IOP 4.1 and 4.2 (6 weeks after GA) Ambari integration being developed for both IOP 4.1 and 4.2 (coming soon) 8

Old GPFS Hadoop Connector Approach How can we be sure we re compatible? Hadoop File System API intended to be open. public abstract class org.apache.hadoop.fs.filesystem Source: hadoop.apache.org All user code that may potentially use the Hadoop Distributed File System should be written to use a FileSystem object. Latest File System APIs are described here: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/filesystem.html 9

Old GPFS Hadoop Connector Approach All based on org.apache.hadoop.fs.filesystem API Spectrum Scale (GPFS) is no different Source: https://wiki.apache.org/hadoop/hcfs 10 10

Old GPFS Hadoop Connector Approach Applications communicate with Hadoop using FileSystem API. Therefore, transparency is preserved. Hadoop Application API level Hadoop Application Hadoop FileSystem API HDFS Hadoop level Hadoop FileSystem API GPFS Hadoop Connector Ext4 Disk Disk Disk Kernel level file system Spectrum Scale FPO Disk Disk Disk All user code that may potentially use the Hadoop Distributed File System should be written to use a FileSystem object. Source: hadoop.apache.org 11 11

New Spectrum Scale HDFS Transparency Design Applications hdfs://hostnamex:portnumber Hadoop client Hadoop client Hadoop client Higher-level languages: Hive, BigSQL JAQL, Pig Hadoop FileSystem API Hadoop FileSystem API Hadoop FileSystem API HDFS Client HDFS Client HDFS Client Map/Reduce API Hadoop FS APIs HDFS RPC over network HDFS Client HDFS RPC Spectrum Scale Connector Server Connector GPFS Connector Service Connector on libgpfs,posix API GPFS node GPFS Connector Service Connector on libgpfs,posix API GPFS node Spectrum Scale Power Linux Power Linux Supported Hadoop versions: 2.7.1 Commodity hardware Shared storage 12 12

New Spectrum Scale HDFS Transparency Design Each node is installed with connector datanode server Only one node is installed with connector namenode server Connector namenode server can be configured with HA similar to HDFS First released in November 2015 DFSClient DFSClient DFSClient hdfs://namenode:<portnumber> HDFS RPC over network Connector Namenode Service Connector Datanode Service Connector Datanode Service GPFS/FPO cluster Hadoop cluster GPFS FPO GPFS FPO GPFS FPO 13 13

New Spectrum Scale HDFS Transparency Design Shared storage support Connector servers are installed over limited nodes (ex. GPFS NSD servers) GPFS client is not needed on the Hadoop computing nodes First released in January 2016 14 14

New Spectrum Scale HDFS Transparency Design Key Advantages Support workloads that have hard coded HDFS dependencies Simpler integration for currently compatible workloads & components Leverage HDFS Client cache for better performance No need to install Spectrum Scale clients on all nodes Full Kerberos support for Hadoop ecosystem Recently released HDFS + Spectrum Scale Federation Federate multiple Spectrum Scale clusters Isolate multiple Hadoop clusters on the same filesystem (restrict to sub-directory) Coming Soon BigInsights 4.2 support (additional components) 15 15

Current Ambari Integration New BigInsights 4.1.SpectrumScale stack Inherits from BigInsights 4.1 stack Removes HDFS, add Spectrum Scale, change all dependencies Can install IOP + Spectrum Scale (either new GPFS filesystem or integrate with existing filesystem) Value Add integration Basic Spectrum Scale monitoring (AMS) Support separate connector control Support GPFS and connector upgrades Collect GPFS snap Change GPFS parameters Add new nodes Remove nodes Provide quick link to Spectrum Scale GUI for full management and monitoring 16 16

Current Ambari Integration 17 17

Ambari Integration with HDFS Transparency Biggest change is that there is no new stack Spectrum Scale is added as a new service after full IOP install with HDFS (use dummy directory / mount point for HDFS) Spectrum Scale service integrates with HDFS Will support un-integrate capability Flip back and forth between HDFS & GPFS Will not move data back and forth between HDFS & GPFS Will simplify future upgrades 18 18

IBM Systems Spectrum Scale Federation Use Case 1 Hadoop Platform MapReduce Zookeeper Hive HCatalog Pig YARN Spark HBase Flume Sqoop viewfs://<clustername> GPFS Hadoop Connector Solr/Lucene Make 2+ GPFS file systems appear as one uniform file system for application Allow customer to leverage existing GPFS file system and new GPFS file system for data analysis /gpfs/fs1 GPFS Disk Disk Disk /gpfs/fs2 GPFS Disk 19

IBM Systems Spectrum Scale Federation Use Case 2 Hadoop Platform MapReduce YARN Zookeeper Hive HCatalog Spark HBase Flume Sqoop viewfs://<clustername> Pig Solr/Lucene Allow customers to run applications over GPFS & HDFS seamlessly Doesn t require customers to move data from HDFS to GPFS GPFS Hadoop Connector /gpfs/fs1 GPFS hdfs://<host:port/> HDFS Disk Disk Disk Disk 20

IBM Systems Spectrum Scale Workload Isolation Hadoop Platform Hadoop Platform MapReduce Zookeeper Hive HCatalog Pig MapReduce Zookeeper Hive HCatalog Pig YARN Spark HBase Flume Sqoop Solr/Lucene YARN Spark HBase Flume Sqoop Solr/Lucene viewfs://<clustername1> viewfs://<clustername1> GPFS Hadoop Connector /gpfs/fs GPFS Disk Disk Disk Disk Disk Disk Disk Support multiple Hadoop clusters over the same GPFS filesystem for different applications, groups or LOBs 21

IBM Systems Spectrum Scale FPO QOS Support Node failure is often-seen events for commodity hardware Restripe for data protection is painful for customer with large data QOS for FPO - Reduce the restripe impact in FPO because of node failure 22

IBM Systems Revamped Spectrum Scale Wiki Links Spectrum Scale wiki Analytics Spectrum Scale wiki - FPO Spectrum Scale wiki Hadoop Spectrum Scale wiki HDFS Transparency connector 23