ETL on Hadoop What is Required

Similar documents
Table of Contents: FOREWORD EXECUTIVE SUMMARY EVOLUTION OF ETL ARCHITECTURE: CONCLUSION CUSTOMER CASE STUDIES:

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

StackIQ Enterprise Data Reference Architecture

Bringing the Power of SAS to Hadoop Title

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

5th Annual. Cloudera, Inc. All rights reserved.

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Intro to Big Data and Hadoop

Hadoop Course Content

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

WELCOME TO. Cloud Data Services: The Art of the Possible

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

IBM Big Data Summit 2012

Analytics in Action transforming the way we use and consume information

Enterprise-Scale MATLAB Applications

BIG DATA AND HADOOP DEVELOPER

Enterprise Analytics Accelerating Your Path to Value with an Open Analytics Platform

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

Big Data & Hadoop Advance

ORACLE BIG DATA APPLIANCE

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

When Big Data Meets Fast Data

MapR Pentaho Business Solutions

Simplifying Hadoop. Sponsored by. July >> Computing View Point

Big Data The Big Story

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Cask Data Application Platform (CDAP) Extensions

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Making BI Easier An Introduction to Vectorwise

Building Your Big Data Team

Modernizing Your Data Warehouse with Azure

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Hortonworks Connected Data Platforms

Actian DataConnect 11

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Rapid Start with Big Data Appliance X6-2 Technical & Operational Overview

IBM PureData System for Analytics Overview

SAS & HADOOP ANALYTICS ON BIG DATA

The Internet of Things Wind Turbine Predictive Analytics. Fluitec Wind s Tribo-Analytics System Predicting Time-to-Failure

How In-Memory Computing can Maximize the Performance of Modern Payments

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

The Sysprog s Guide to the Customer Facing Mainframe: Cloud / Mobile / Social / Big Data

Databricks Cloud. A Primer

SOLUTION SHEET End to End Data Flow Management and Streaming Analytics Platform

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

DataAdapt Active Insight

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Deep Learning Acceleration with

Scalability and High Performance with MicroStrategy 10

IBM Tivoli Workload Automation View, Control and Automate Composite Workloads

Integrating MATLAB Analytics into Enterprise Applications

Confidential

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Big and Fast Data: The Path To New Business Value

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Apache Hadoop in the Datacenter and Cloud

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

ETL challenges on IOT projects. Pedro Martins Head of Implementation

Modernizing Data Integration

SAS and Hadoop Technology: Overview

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Active Analytics Overview

Going beyond today: Extending the platform for cloud, mobile and analytics

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

InfoSphere Warehouse. Flexible. Reliable. Simple. IBM Software Group

Make Business Intelligence Work on Big Data

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

1. Intoduction to Hadoop

Cognizant BigFrame Fast, Secure Legacy Migration

Hamburg 20 September, 2018

Microsoft reinvents sales processing and financial reporting with Azure

Announcing: Release 7

Microsoft Big Data. Solution Brief

BIG DATA TRANSFORMS BUSINESS

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Business is being transformed by three trends

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Two offerings which interoperate really well

Hortonworks Data Platform

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Syncsort Incorporated, 2016

DATENBANK TRANSFORMATION WARUM, WOHIN UND WIE? 15. Mai 2018

Adobe Deploys Hadoop as a Service on VMware vsphere

Microsoft Azure Essentials

Hadoop and Analytics at CERN IT CERN IT-DB

Sr. Sergio Rodríguez de Guzmán CTO PUE

<Insert Picture Here> Oracle Exalogic Elastic Cloud: Revolutionizing the Datacenter


THE MAGIC OF DATA INTEGRATION IN THE ENTERPRISE WITH TIPS AND TRICKS

Transcription:

ETL on Hadoop What is Required Keith Kohl Director, Product Management October 2012 Syncsort Copyright 2012, Syncsort Incorporated

Agenda Who is Syncsort Extract, Transform, Load (ETL) Overview and conventional challenges Use Case Challenges on Hadoop Requirements on Hadoop DMExpress Hadoop Overview and Architecture Capture and extract data into/out of Hadoop Pluggable Sort Improve user productivity Seamless improve Hadoop MapReduce performance Use Case using the plug in Summary 2

What s Needed to Drive Success for Hadoop Enterprise tooling to become a complete data platform Open deployment & provisioning Higher quality data loading Monitoring and management APIs for easy integration Ecosystem needs support & development Existing infrastructure vendors need to continue to integrate Apps need to continue to be developed on this infrastructure Well defined use cases and solution architectures need to be promoted Market needs to rally around core Apache Hadoop To avoid splintering/market distraction To accelerate adoption Source: Hortonworks, 2012 3

About Syncsort Multinational Software Company Founded 1968, operating today in North America, Europe and Asia 40+ Years of Performance Innovation 25+ Issued & Pending Patents Investors: Large Global Customer Base Leading provider of Data Integration and Data Protection solutions to businesses and governments worldwide 15,000+ deployments in 68 countries Syncsort Data Integration Offerings DMExpress family of high performance, purpose built Data Integration solutions for integrating, optimizing and migrating Big Data MFX high performance sort solutions for z/os and SAS mainframe environments DATA SERVICES FINANCE INSURANCE & HEALTHCARE PUBLIC SECTOR TELECOMMUNICATIONS RETAIL 4

What is ETL Extract Transform Extract Load Transform Load Oracle Data Mart File XML ETL ETL ERP Mainframe Data Warehouse Data Mart Real-Time Data Mart 5

Big Data Is Pushing the Limits of Conventional Data Integration Big Data + Conventional DI Solutions = VOLUME: Unprecedented creation of data VELOCITY: Shrinking batch windows VARIETY: Unstructured data, mobile devices, cloud UNDERPERFORMING: Never designed or optimized for today s increasing performance requirements INEFFICIENT: Requires massive investments in hardware, storage and database capacity COMPLICATED: Heavy architecture, disparate tools makes it difficult to develop, tune, and maintain Exponential hardware and storage costs More IT effort to develop and tune Longer development cycles 6

ETL Use Case using Hadoop Operational Systems DBMS ETL Data Warehouse BI Applications File XML ERP Mainframe Import Export Analytical Database Hadoop is replacing existing or conventional ETL processes ETL layer and/or Data Warehouse can not handle data volumes or processing Hadoop is a huge sink of cheap storage and processing Hadoop is used to as ETL layer to process the very large volumes of data prior to the loading in the data warehouse Web log processing Filtering/Cleansing Aggregations: Aggregates are built in Hadoop and exported http://www.slideshare.net/brocknoland/common-and-unique-use-cases-for-apache-hadoop 7

ETL on Hadoop Challenges: Load and Extract into Hadoop Structured data stores: RDBMS, Enterprise Data Warehouses, NoSQL Sqoop: A tool to automate data transfer between structured data stores and Hadoop Command line interface, no GUI sqoop import connect \ jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** JDBC based connectivity Files Hadoop shell commands: put and get Command line interface, no GUI Serial interface to Hadoop BUT Requires manual scripting/coding Relatively slow No pre processing capabilities hadoop dfs -put localfile /user/hadoop/hadoopfile Without ETL (cleanse, reformat, filter), garbage in, garbage out How to prepare data for optimum performance in Hadoop? 8

ETL on Hadoop Challenges: MapReduce Development & Maintenance MapReduce Development Lots of manual coding Java, Pig Limited skills available Heavy learning curve No GUI Limited metadata capabilities No mainframe connectivity 9

ETL on Hadoop Challenges: Performance Tuning Developer and Operations need to be aware of Hadoop configurations Memory and output size (especially mapper output) Hundreds of configuration tuning parameters Hadoop configuration JVM configuration Understanding disk I/O speeds, memory allocations, cache and spill over Scheduling (i.e., simultaneous tasks) 10

ETL Requirements on Hadoop Comprehensive transformation capabilities Aggregations, cleansing, filtering, reformatting, lookups, data type conversions, date/time, string manipulations, arithmetic, and so on Connectivity Native mainframe data access and conversion capabilities Heterogeneous database management system (DBMS) access on the cluster for reading into Hadoop, loading warehouses from Hadoop, as well as for sourcing data for lookups and other transformations Improves user productivity Graphical development environment that minimizes the need for Java coding and Pig script development Built in metadata capabilities, which enable greater transparency into impact analysis, data lineage, and execution flow; facilitates re use Optimizes MapReduce processing, seamlessly Reduces tuning for developers and operations 11

Agenda Who is Syncsort Extract, Transform, Load (ETL) Overview and conventional challenges Use Case Challenges on Hadoop Requirements on Hadoop DMExpress Hadoop Overview and Architecture Capture and extract data into/out of Hadoop Pluggable Sort Improve user productivity Seamless improve Hadoop MapReduce performance Use Case using the plug in Summary 12

DMExpress Hadoop Overview Install in Minutes. Deploy in Weeks. Never Tune Again. Shared File based Metadata Repository Impact Analysis Task Editor Job Editor User Interface Data Lineage Metadata Interchange Appliances Global Search High Performance Connectivity Real Time Cloud Hadoop SDK DMExpress Server Engine High Performance Transformations High Performance Functions Automatic Continuous Optimization Files / XML Mainframe Template driven Design Small Footprint ETL Engine Self tuning Optimizer Native, Direct I/O Access 13

Loading and Extracting Hadoop (HDFS) DMExpress delivers high performance connectivity and pre processing capabilities to optimize Hadoop environments Extract Preprocess & Compress Load Data Node Cloud HDFS Data Node Data Node Files Appliances XML Sort Aggregate Join Data Node Mainframe RDBMS Connect to virtually any source Compress Partition Pre process data to cleanse, validate, & partition for better and faster Hadoop processing and significant storage savings Load data up to 6x faster! Load Time (min) 150 100 50 0 Elapsed Processing Time HDFS Put DMExpress 14

Connecting to HDFS 15

HDFS Connectivity Input DMExpress on ETL server HDFS Partition the output for parallel loading Makes full use of network bandwidth with reduced elapsed time Hadoop/DMExpress can process wildcard input files from HDFS 16

DMExpress Accelerates HDFS Loading HDFS Load using DMExpress HDFS Load 20 partitions Uncompressed input file size from 100GB to 2100GB Cluster Specifications Size: 10+1+1 nodes Hadoop distribution: CDH4 HDFS block size: 256 MB Hardware Specifications (Per Node) Red Hat EL 5.8 Intel Xeon x5670 *2 Memory: 94 GB 6x Faster! 17

Syncsort MapReduce Contribution to Apache Hadoop Allow external sort to be plugged in Seamlessly accelerate MapReduce performance Replace Map output sorter Replace Reduce input sorter Improve developer productivity Develop MapReduce jobs via DMExpress GUI Aggregations, cleansing/filtering, reformatting, etc. Native Sort: ᵡ Not modular ᵡ Limited capabilities ᵡ Difficult to fine tune & configure (requires coding & compilation) Contribution: Modular Extensible Configurable through use of external sorters on MapReduce nodes Native Sort Native Sort Native Sort Native Sort Hadoop Node Hadoop Node Hadoop Node Hadoop Node https://issues.apache.org/jira/browse/mapreduce-2454 18

Apache Hadoop Pluggable Sort The main MapReduce code changes enable sorting to be done by external implementations MapTask.java: modified the class MapOutputCollector (which does the sorting on the Map side) The intermediate output of the external sorter complies with the Hadoop framework s sorter output format ReduceTask.java: modified such that when an external sorter is plugged in, the shuffled data will be sent to the external sorter The Hadoop framework code for shuffling is not modified Replaces up to 4 phases in MapReduce flow Mapper output sorter (seamless) Partitioner (optional) Combiner (optional) Reduce input sorter (merger) (seamless) https://issues.apache.org/jira/browse/mapreduce-2454 19

Improve User Productivity Use DMExpress to Accelerate Development and Maintenance of MapReduce Jobs MapReduce Development: Χ Lots of manual coding: Χ MapReduce, Pig, Java Χ Limited skills supply Χ Heavy learning curve DMExpress Hadoop Edition: No coding required full ETL capabilities Leverages the same skills most IT organizations already have New resources can be trained in just 3 days 20

Extracting from Hadoop to Load Data Warehouse Analytic Database (MPP) Node Node... Hadoop Cluster Load Load Load Load Reducer to write/load to warehouse Load data to analytic database/warehouse No need to land data in HDFS and then extract Use native load utilities from Hadoop Teradata TTU/TPT, GP, Vertica, Netezza, Oracle, etc. Note: Control number of reducers Node Note: Read right to left 21

Improve MapReduce Performance DMX DMX DMX DMX DMX Hadoop Cluster DMExpress Hadoop runs native on each data node on the cluster DMExpress is installed on each data node Same benefits as High performance ETL DMExpress Hadoop is not generating MR code (i.e., Java, Pig, Python) Issues with code generation Can t change code without breaking link Requires re compilation with every change May still require MR skills Always issues with efficiency of generated code 22

DMExpress Hadoop TeraSort Benchmark TeraSort Benchmark TeraSort Benchmark GZIP compression Uncompressed input file size from 100GB to 4TB Cluster Specifications Size: 10+1+1 nodes Hadoop distribution: CDH3 HDFS block size: 256 MB Hardware Specifications (Per Node) Elapsed Time (min) 250 200 150 100 50 TeraSort Over 2x Faster! Native Sort DMExpress Red Hat EL 5.8 Intel Xeon x5670 *2 Memory: 94 GB 0 0 1000 2000 3000 4000 5000 File Size (GB) 23

DMExpress Hadoop ETL Benchmark ETL Benchmark ETL Benchmark Filter & Aggregation TPC H data Uncompressed input file size from 100GB to 2.4TB Cluster Specifications Size: 10+1+1 nodes Hadoop distribution: CDH3u2 HDFS block size: 256 MB Hardware Specifications (Per Node) Red Hat EL 5.8 Intel Xeon x5670 *2 Memory: 94 GB Elapsed Time (sec) 3000 2500 2000 1500 1000 500 0 TPC H Aggregation 0 500 1000 1500 2000 2500 3000 File Size (GB) Almost 2x Faster than Java; Over 2x Faster Pig Java Pig DMExpress 24

Example of MR using the External Sort Plug in This is not a customer reference, purely a use case of the external plug in and DMExpress Goal: Aggregate tagged tracks Published listening data from last.fm via JSON Per track listened to, timestamp, list of similar tracks (big list), list of tags, artist, track title, etc. Tags (Romantic, Ambient, Love, Pop, 90s, Bicycle, Hard workout, etc.) Wrote a Mapper in Java to parse JSON and Pivot the data 1 record/tag with track info (multiple tags per track) Used Map Sort step via plug in to call DMExpress Aggregate by tag Reduce Sort step Aggregation to pull all of the combined map outputs together Still have a Reduce step which could call Java to do something (convert to XML) 25

Faster Connectivity. Optimized Processing. Easier Development Connect Load & Extract from HDFS Partition output for parallel loading 16:1 compression (Gzip) Optimum network I/O utilization Process wildcard input files from HDFS 6x Faster! Optimize Easily Implement & Process Optimized MapReduce No coding Leverage existing skills Increased throughput by more than 2X No code generation No compiling Elapsed Time (sec) 3000 2500 2000 1500 1000 500 TPC H Aggregation 0 0 500 1000 1500 2000 2500 3000 File Size (GB) 2x Faster! Java Pig DMExpress 26

Thank you!

DMExpress Hadoop Edition Accelerate Development. Optimize Performance. Reduce TCO. Fast Reduce elapsed processing time for existing Hadoop deployments Cost Effective Efficient Simple Reduce resource utilization (CPU, memory, I/O) Enhance Scalability No tuning No coding No MapReduce scripting Less Hardware Costs Less IT Staff hours wasted on constant coding & tuning Increased IT staff productivity Seamless plug in 28