Hadoop Roadmap 2012 A Hortonworks perspective

Similar documents
Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Hadoop Course Content

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Hortonworks Data Platform

Hortonworks Connected Data Platforms

MapR: Solution for Customer Production Success

5th Annual. Cloudera, Inc. All rights reserved.

Intro to Big Data and Hadoop

Big data is hard. Top 3 Challenges To Adopting Big Data

Cask Data Application Platform (CDAP)

BIG DATA AND HADOOP DEVELOPER


Apache Hadoop in the Datacenter and Cloud

Modernizing Your Data Warehouse with Azure

Analytics Platform System

Insights to HDInsight

THE CIO GUIDE TO BIG DATA ARCHIVING. How to pick the right product?

Sr. Sergio Rodríguez de Guzmán CTO PUE

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Taking Advantage of Cloud Elasticity and Flexibility

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

Hadoop Integration Deep Dive

Microsoft Big Data. Solution Brief

Microsoft Azure Essentials

20775 Performing Data Engineering on Microsoft HD Insight

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Architecting an Open Data Lake for the Enterprise

20775: Performing Data Engineering on Microsoft HD Insight

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

StackIQ Enterprise Data Reference Architecture

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

20775A: Performing Data Engineering on Microsoft HD Insight

Make Business Intelligence Work on Big Data

20775A: Performing Data Engineering on Microsoft HD Insight

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Common Customer Use Cases in FSI

Cask Data Application Platform (CDAP) Extensions

Hadoop in Production. Charles Zedlewski, VP, Product

DataAdapt Active Insight

Confidential

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Simplifying Your Modern Data Architecture Footprint

Rapid Start with Big Data Appliance X6-2 Technical & Operational Overview

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Contents at a Glance COPYRIGHTED MATERIAL. Introduction... 1 Part I: Getting Started with Big Data... 7

Jason Virtue Business Intelligence Technical Professional

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Business is being transformed by three trends

Hortonworks Powering the Future of Data

Big Business Value from Big Data and Hadoop

Real World Use Cases: Hadoop & NoSQL in Production. Big Data Everywhere London 4 June 2015

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Spark, Hadoop, and Friends

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

Got Data Silos? Automate Data Ingestion Into Isilon In Support Of Analytics

Hadoop Administration Course Content

Big Data Job Descriptions. Software Engineer - Algorithms

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

SOLUTION SHEET End to End Data Flow Management and Streaming Analytics Platform

OSIsoft Super Regional Transform Your World

Angat Pinoy. Angat Negosyo. Angat Pilipinas.

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Microsoft reinvents sales processing and financial reporting with Azure

GET MORE VALUE OUT OF BIG DATA

Hortonworks Data Platform. Buyer s Guide

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

Modern Data Architecture with Apache Hadoop

Five Questions to Ask Before Choosing a Hadoop Distribution

DATASHEET. Tarams Business Intelligence. Services Data sheet

Cloudera Hadoop & Industrie 4.0 wohin mit dem Datenstrom?

Table of Contents. Are You Ready for Digital Transformation? page 04. Take Advantage of This Big Data Opportunity with Cisco and Hortonworks page 06

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Hadoop and Analytics at CERN IT CERN IT-DB

1. Intoduction to Hadoop

Hadoop Stories. Tim Marston. Director, Regional Alliances Page 1. Hortonworks Inc All Rights Reserved

SAS and Hadoop Technology: Overview

Two offerings which interoperate really well

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Meta-Managed Data Exploration Framework and Architecture

More information for FREE VS ENTERPRISE LICENCE :

Transforming Analytics with Cloudera Data Science WorkBench

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

APAC Big Data & Cloud Summit 2013

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Spark and Hadoop Perfect Together

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Azure. Bruno Kovačić Axilis, Microsoft MVP

Pentaho Technical Overview. Max Felber Solution Engineer September 22, 2016

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Realising Value from Data

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Transcription:

Hadoop Roadmap 2012 A Hortonworks perspective Eric Baldeschwieler CTO Hortonworks Twitter: @jeric14, @hortonworks February 2012 Page 1

About Eric Baldeschwieler Co-Founder and CTO of Hortonworks Prior to Hortonworks: VP Hadoop Software Engineering for Yahoo! Technology leader for Inktomi s web service engine (acquired by Yahoo! in 2003) Developed software for video games, video post production systems and 3D modeling systems Master s degree in Computer Science University of California, Berkeley Bachelor s degree in Mathematics and Computer Science Carnegie Mellon Universityeric14. Follow me on Twitter: @jeric14 Architecting the Future of Big Data Page 2

Agenda Hortonworks Data Platform Community roadmap process Hortonworks development process & support model 2012 roadmap highlights Observed Trends and Anticipated Investment Areas Architecting the Future of Big Data Page 3

Apache Hadoop & Hortonworks http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop Yahoo! embraced Apache Hadoop, an open source platform, to crunch epic amounts of data using an army of dirt-cheap servers 2006 Hadoop at Yahoo! 40K+ Servers 170PB Storage 5M+ Monthly Jobs 1000+ Active Users 2011 Yahoo! spun off 22+ engineers into Hortonworks, a company focused on enabling Apache Hadoop to be next-generation data platform Page 4

Balancing Innovation & Stability Apache: Be aggressive - ship early and often Projects need to keep innovating and visibly improve Aim for big improvements Make early buggy releases Hortonworks: Be predictable - ship when stable We need to ship stable, working releases Make packaged binary releases available We need to do regular sustaining engineering releases HDP quarterly release trains sweep in stable Apache projects Enables HDP to stay reasonably current and predictable while minimizing risk of thrashing that coordinating large # of Apache projects can cause Page 5

The Hortonworks Development Process Apache Development process Hortonworks Development. The Hortonworks Data Platform is 100% open source We plan our engineering in the open We take feedback and use it to refine our plans We develop enhancements to Hadoop in collaboration with the Apache community, via the Apache processes & infrastructure Others contribute their own improvements and refine our work We iterate and we decide as a community when a release is ready Architecting the Future of Big Data Page 6

This presentation is part of that process Apache Hadoop and related projects are owned by the Apache foundation and worked on by a host of volunteers This presentation outlines what Hortonworks engineers currently plan to contribute to Apache projects We share our plans regularly with Apache contributors & the wider Apache Hadoop user community our business partners & customers We listen and we refine the plan Others share their plans & help us refine our thinking Our partners express their needs & requirements The plan evolves Architecting the Future of Big Data Page 7

Hortonworks Data Platform (HDP) Fully Supported Integrated Platform Challenge: Integrate, manage, and support changes across a wide range of open source projects that power the Hadoop platform; each with their own release schedules, versions, & dependencies. Time intensive, Complex, Expensive Hadoop Core Pig Zookeeper Hive HCatalog HBase Solution: Hortonworks Data Platform Integrated, certified platform distributions 100% Open Source Extensive Q/A process Industry-leading Support with clear service levels for updates and patches Continuity via multi-year Support and Maintenance Policy = New Version Page 8

Support & Distribution Model Hortonworks Data Platform Fully supported, integrated, tested, maintained 100% Apache license, or compatible: BSD, MIT/X11, NCSA, W3C Software license, X.Net Universe: Open Source Ecosystem Validated & interoperable with HDP Technical guidance support; work with OSS projects 100% OSI-compliant licenses Optionally installed Multiverse: Commercial Ecosystem Validated & interoperable with HDP Technical guidance support; work with TSANet 3 rd -party vendor licenses and support options Optionally installed Model and terminology conceptually similar to Ubuntu s model: http://www.ubuntu.com/project/about-ubuntu/licensing Page 9

Hortonworks Data Platform (HDP) Key Components of Standard Hadoop Open Source Stack Hortonworks Data Platform Universe Zookeeper (Cluster Coordination) HBase (Columnar NoSQL Store) Pig (Data Flow) Hive (SQL) MapReduce (Distributed Programing Framework) HCatalog (Table & Schema Management) Ambari & Ganglia and Nagios Oozie Workflow scheduling Sqoop & Other Ingest, ETL tools HDFS (Hadoop Distributed File System) Mahout & Other libraries Page 10

Hadoop Now, Next, and Beyond Apache community, including Hortonworks investing to improve Hadoop: Make Hadoop an Open, Extensible, and Enterprise Viable Platform Enable More Applications to Run on Apache Hadoop Hadoop.Now (Hadoop 1.0) HDP1 Most stable Hadoop ever HBase, security, WebHDFS Hadoop.Next (Hadoop 0.23) HDP2 HA, Next-gen MapReduce Extension & Integration APIs HCatalog data APIs Hadoop.Beyond Future investments Page 11

Hortonworks Data Platform Timeline Q1 Q2 Q3 Q4 1.0 preview 1.0 preview 1.0 preview 1.0 1.1 preview 1.1 1.2 1.3 Hortonworks Data Platform 1 2.0 preview 2.0 2.1 Hortonworks Data Platform 2 36 Month support policy, from GA date Page 12

Hortonworks Data Platform 1 Consumable Hadoop.Now Platform Based on Hadoop 1.0 (a.k.a. 0.20.205) The most stable release of Hadoop, ever! Differentiators: Code straight from Apache Apache release process restarted after hiatus since 2010! First Apache line supporting Security, HBase, WebHDFS and many stability fixes Common table and schema management via HCatalog (M/R, Pig and Hive) Capacity scheduler (very stable, high RAM job support, multi-tenant protections) Components: Standard Hadoop stack: HDFS+M/R, Hive, Pig, HBase, HCatalog, ZooKeeper Universe items: Sqoop, Oozie, Mahout, Packaging (.tar, RPM, DEB, single-box VM, single-box AMI, ) Monitoring via Nagios and Ganglia Page 13

Continuing investment in 1.0 line Quarterly updates will include bug fixes and enhancements to all components, from Apache Hortonworks plans to invest substantially in: Ambari and other management components The Hadoop community has traditionally not provided good open source tooling in this area, we are committed to closing this gap HCatalog When fully realized, we think HCatalog will have a huge impact Vastly simplifying the management of data on Hadoop Architecting the Future of Big Data Page 14

Management in HDP1.0 and beyond Now: Focus on Monitoring capability (most often used) Package Nagios & Ganglia The most common Open Source Hadoop Monitoring Tools Web dashboard for unified Hadoop-specific view System Health: NetSNMP Hadoop Metrics: Ganglia plug-in, JMX, SNMP MIB (to add) Alerting: Nagios Collect & Display: Nagios, Ganglia, RRD Next: Deployment and Management Provisioning & Configuration with Puppet (Chef + others to follow) VM and cloud packaging LDAP, Active directory integration User metering and management API/GUI Page 15

HCatalog 0.3 and beyond Common Table, Schema, Metadata Management In HDP 1.0 Enable Hive, Pig, and MR to use the same tables / metadata Generalizes Hive s Table/Metadata System Pig/Hive/MR use same HCatalog file storage methods Pig/Hive/MR use common code for their IO Manages Data Format and Schema Changes Allows columns to be appended to tables in new partitions Allows storage format changes CLI to create table w DDL, change default format, add columns, change data location, compact data, register data as a table Templeton APIs layer on top Longer-term: A simple uniform API that supports a rich set of data management tools transparent migration of data between systems & formats document models and other non-relational data gracefully Easy lookup, addition & modification of objects / records Page 16

Related Hortonworks Webinars HCatalog, Table Management for Hadoop Wednesday, February 22 10:00am Pacific/1:00pm Eastern http://hortonworks.com/webinars/ Other Topics Coming Soon Importing Data Into Hadoop Monitoring and Managing Hadoop Clusters HBase and Hive http://hortonworks.com/webinars/ Architecting the Future of Big Data Page 17

Hortonworks Data Platform 2 Consumable Hadoop.Next Platform Based on Hadoop 0.23.* Next generation of Hadoop First release of a set of features under development since 2010 Highlights: Next Generation MapReduce architecture Refactor to provide enhanced scalability and performance Decouple MapReduce from resource management architecture Enables MapReduce to evolve quickly Enables new application types (streaming, graph, MPI, bulk sync, etc) HDFS Federation Improved scalability and isolation Extension API that will allow new storage services to share HDFS storage HDFS NameNode High Availability Automatic failover Multiple-options for failover (shared disk, Linux HA, Zookeeper) Include latest stable standard Hadoop components Page 18

Related Hortonworks Webinars What s In Store For Hadoop.Next http://info.hortonworks.com/hadoopnextwebcast_landing.html Hadoop HDFS High Availability: HA NameNode http://info.hortonworks.com/1-24-12hanamenodewebcast_landingpage.html Other Topics Coming Soon Extending Hadoop beyond Map-Reduce Wednesday March 7th HDFS Federation http://hortonworks.com/webinars/ Architecting the Future of Big Data Page 19

Observed Trends, Anticipated Investments Page 20

Trend: Agile Data, Hadoop as data hub The old way Operational systems keep only current records, short history Analytics systems keep only conformed / cleaned / digested data Unstructured data locked away in operational silos Archives offline Inflexible, new questions require system redesigns The new trend Keep all data in Hadoop for a long time (raw inputs & processed) Able to produce a new analytics view on-demand Keep a new copy of data that was previously only in silos Can immediately do new reports, experiments in Hadoop New products / services can be added very quickly Agile outcome justifies new infrastructure Page 21

Traditional Enterprise Data Architecture Data Silos Serving Applications Traditional Data Warehouses, BI & Analytics Web Serving NoSQL RDMS EDW Data Marts BI / Analytics Serving Logs Social Media Sensor Data Text Systems Unstructured Systems Page 22

Agile Data Architecture w/hadoop Connecting All of Your Big Data Serving Applications Traditional Data Warehouses, BI & Analytics Web Serving NoSQL RDMS EDW Data Marts BI / Analytics ETL, Archive all data, Custom Analytics Serving Logs Social Media Sensor Data Text Systems Unstructured Systems Page 23

Hadoop as data hub implies ETL integration / data ingest Hadoop should work well with industry standard tools [Importing data webinar] Offsite backups / DR HDFS Snapshots, cloud backup, other tools Object / event level storage APIs Open Data APIs Non-relational model data HCatalog (for all 3 above) [HCatalog webinar] Efficient / low cost storage Compression, Raid / reed solomon No storage limits No file limits, scale beyond 10,000 computers / cluster Architecting the Future of Big Data Page 24

Trend: Data-centric Applications Limited runtime logic driven by huge lookup tables Data flows from application into Hadoop for processing Flows from Hadoop back into application for serving Complex computations in Hadoop Machine learning, other expensive computation offline Personalization, classification, fraud, value analysis Application development requires data science Huge amounts of actually observed data key to modern services Hadoop used as the science platform Architecting the Future of Big Data Page 25

CASE STUDY YAHOO! HOMEPAGE Serving Maps Users Interests Five Minute Produc on Weekly Categoriza on models USER BEHAVIOR SERVING MAPS (every 5 minutes) SCIENCE HADOOP CLUSTER PRODUCTION HADOOP CLUSTER USER BEHAVIOR» Machine learning to build ever better categorization models CATEGORIZATION MODELS (weekly)» Identify user interests using Categorization models SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second) Copyright Yahoo 2011 26

Data-centric Applications imply New programming frameworks [YARN webinar] In RAM models (MPI, Spark, Giraph ) Services in Hadoop (HBase, Stream processing, Serving ) Huge Lookup tables HDFS enhancements for HBase HBase improvements, integrations with other distributed stores Predictable latency Continuous availability Core investments [HA webinar] Data Driven applications in Hadoop Investments in Oozie, integration with HCatalog data model 27

Trend: Hadoop goes to the Enterprise Hadoop used in production processes Hadoop used in a multi-tenant environments Hadoop used in audited environments Hadoop integrated into many more environments Forwards and backwards compatibility a must Architecting the Future of Big Data Page 28

Hadoop goes to the Enterprise implies Authentication, Authorization & Security Encryption, Active directory integration simplified security deployment Open administration Open provisioning Open monitoring [Monitoring and managing webinar] Cloud support Amazon and Azure support, Whirr / JCloud VM tuning and deployment support Engineered for compatibility New protocol buffer based protocols Architecting the Future of Big Data Page 29

Hortonworks Focus and Value Hortonworks is focused on: Technology Leadership within Apache Hadoop Community Customer and Partner Enablement around Hadoop Platform What value do we provide? 100% Open Source Hortonworks Data Platform Expert Role-based Training Full Lifecycle Support and Services Next Webinar: HCatalog, Table Management for Hadoop Wednesday, February 22 10:00am Pacific/1:00pm Eastern http://hortonworks.com/webinars/ Page 30

Thank You! Questions? Eric Baldeschwieler @jeric14 @hortonworks Serving EDW, BI, ++ Serving Logs Social Media Sensor Data Text Systems Page 31