Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Similar documents
Optimize your FLUENT environment with Platform LSF CAE Edition

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

StackIQ Enterprise Data Reference Architecture

Intro to Big Data and Hadoop

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Building a Multi-Tenant Infrastructure for Diverse Application Workloads

Realize More with the Power of Choice. Microsoft Dynamics ERP and Software-Plus-Services

IBM i Reduce complexity and enhance productivity with the world s first POWER5-based server. Highlights

Insights to HDInsight

BIG DATA TRANSFORMS BUSINESS. The EMC Big Data Solution

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Evolution to Revolution: Big Data 2.0

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

IBM Accelerating Technical Computing

5th Annual. Cloudera, Inc. All rights reserved.

WELCOME TO. Cloud Data Services: The Art of the Possible

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Adobe Deploys Hadoop as a Service on VMware vsphere

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

NEXT GENERATION PREDICATIVE ANALYTICS USING HP DISTRIBUTED R

Cloud-Scale Data Platform

IBM Balanced Warehouse Buyer s Guide. Unlock the potential of data with the right data warehouse solution

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

HPC Workload Management Tools: Tech Brief Update

Cognizant BigFrame Fast, Secure Legacy Migration

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

SYSPRO Integration SYSPRO Integration Framework

The ABCs of. CA Workload Automation

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Business Insight at the Speed of Thought

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Simplifying Hadoop. Sponsored by. July >> Computing View Point

IBM Software IBM Business Process Manager

In-Memory Analytics: Get Faster, Better Insights from Big Data

IBM Digital Analytics Accelerator

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

GE Intelligent Platforms. Proficy Historian HD

An Oracle White Paper September, Oracle Exalogic Elastic Cloud: A Brief Introduction


Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

Spark, Hadoop, and Friends

Best Practices for Technology Renewal in Banking Institutions

IBM xseries 430. Versatile, scalable workload management. Provides unmatched flexibility with an Intel architecture and open systems foundation

Oracle Autonomous Data Warehouse Cloud

An Oracle White Paper January Upgrade to Oracle Netra T4 Systems to Improve Service Delivery and Reduce Costs

Exalogic Elastic Cloud

Grid 2.0 : Entering the new age of Grid in Financial Services

IBM Tivoli Workload Scheduler

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Engaging in Big Data Transformation in the GCC

LEVERAGING DATA ANALYTICS TO GAIN COMPETITIVE ADVANTAGE IN YOUR INDUSTRY

Investor Presentation. Second Quarter 2016

NICE Customer Engagement Analytics - Architecture Whitepaper

Realising Value from Data

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

IBM SmartCloud public images with selected software

Modernizing Your Data Warehouse with Azure

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

Oracle Big Data Cloud Service

IBM Grid Offering for Analytics Acceleration: Customer Insight in Banking

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

IBM Global Business Services Microsoft Dynamics AX solutions from IBM

IBM PureData System for Analytics Overview

Architected Blended Big Data With Pentaho. A Solution Brief

Building a solid foundation for big data analytics

On-Premises, Consumption- Based Private Cloud Creates Opportunity for Enterprise Out- Tasking Buyers

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

IBM and SAS: The Intelligence to Grow

Hortonworks Apache Hadoop subscriptions ( Subsciptions ) can be purchased directly through HP and together with HP Big Data software products.

SAS and Hadoop Technology: Overview

IBM Big Data Summit 2012

Comparison of Open Source Software vs. IBM Spectrum LSF Suite for Enterprise

Taking Advantage of Cloud Elasticity and Flexibility

White paper June Managing the tidal wave of data with IBM Tivoli storage management solutions

IBM Tivoli Service Desk

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

Hadoop Solutions. Increase insights and agility with an Intel -based Dell big data Hadoop solution

Contents at a Glance COPYRIGHTED MATERIAL. Introduction... 1 Part I: Getting Started with Big Data... 7

Microsoft Azure Essentials

IBM Tivoli Monitoring

How In-Memory Computing can Maximize the Performance of Modern Payments

Hadoop Integration Deep Dive

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Make smart business decisions when they matter most September IBM Active Content: Linking ECM and BPM to enable the adaptive enterprise

: Boosting Business Returns with Faster and Smarter Data Lakes

Actian DataConnect 11

Oracle Autonomous Data Warehouse Cloud

The Role of the Operating System in Cloud Environments

Modern Payment Fraud Prevention at Big Data Scale

EMC ATMOS. Managing big data in the cloud A PROVEN WAY TO INCORPORATE CLOUD BENEFITS INTO YOUR BUSINESS ATMOS FEATURES ESSENTIALS

The Sysprog s Guide to the Customer Facing Mainframe: Cloud / Mobile / Social / Big Data

Software-Defined Storage: A Buyer s Guide

Transforming IIoT Data into Opportunity with Data Torrent using Apache Apex

Compiere ERP Starter Kit. Prepared by Tenth Planet

Transcription:

Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11

Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer Problems... 2 Needs Current Solutions Five Challenges for Hadoop MapReduce in the Enterprise... 3 1. Lack of Performance and Scalability 2. Lack of Flexible and Reliable resource Management 3. Lack of Application Deployment Support 4. Lack of Quality of Service 5. Lack of Multiple Data Source Support. Use Case Example... 4 Example Scenarios. Conclusion... 5 1 5/9/11

Introduction Reporting and analysis drive businesses in making the best possible decisions. The source of all these decisions is data. There are two main types of data: structured and unstructured. Though IT has been able to deliver enterprise-class services for analysis and reporting on structured data (e.g., data warehouses,) IT has struggled to deliver the same level of services for capturing, managing and processing information from unstructured data. IT organizations need to adopt new ways to deliver enterprise-class services to extract and analyze unstructured data. Though new methods such as MapReduce have been found to access, extract and organize data results sets, its delivery is becoming too expensive without enterprise-class delivery services. To meet emerging business demands to extract knowledge from unstructured data, enterprises require an enterprise-class solution that can schedule and manage data analysis processes across an entire distributed file system with the robustness that enterprise IT requires. Platform Computing has managed enterprise class distributed architectures for nearly two decades and is well suited to provide enterprise class services across a distributed file system. Platform Computing s MapReduce distributed computing runtime engine meets this need. Current Market Conditions and Drivers According to the Market Strategy and BI Research group, data volumes are doubling every year: 42.6 percent of respondents are keeping more than three years of data for analytical purposes. New sources of data are emerging at huge volumes, in different industries, such as utilities. 80 percent of data is unstructured and not effectively used in the organization. Most of the unstructured data collected is driven by business value rather than need from a pure analytics perspective. The key is to turn this data into usable information. However, with the rapid growth in data volumes, even the fastest systems cannot keep pace. For the analysis of large data sets (i.e. Big Data ), the system architecture has to be revisited and designed to scale linearly as the volume of data grows. In order to meet these big data conditions, both computational and storage solutions have evolved: Emergence of new programming frameworks to enable distributed computing on large data sets (e.g., MapReduce). New data storage techniques (e.g. file systems on commodity hardware, like the Hadoop File System, or HDFS) for structured and unstructured data. Distributed storage systems made it more affordable to NYSE is generating store large volumes of data 1TB of data per day using commodity disks. Big Facebook is generating 20TB of data per data distributed file systems used for storage support day--compressed! some enterprise-class capabilities such as data flexibility, 40TB of data per day CERN is generating adoption within the IT ecosystem, high scalability (up to petabytes,) and reasonable cost. Customer Problems So, the customer problem is not with the distributed file system, but the ability to access, extract and organize the data using MapReduce with enterprise-class services. The common implementation of MapReduce based on open source code is not inherently designed for enterprise-class deployments. In fact, using MapReduce in an enterprise data center requires a highly scalable, highly available, and easily managed solution, which includes support for multiple MapReduce applications. These key capabilities do not exist within current open source solutions. Needs There is a need for an enterprise-class MapReduce computational solution to support distributed processing of the MapReduce programming model. Enterprise-class MapReduce computational engines need to: Enable deployment and operation of the extraction and analysis programs across the enterprise. Manage and monitor large-scale environments Include a workload management system to ensure quality of service and prioritization of applications based on business objectives. 2 5/9/11

Service multiple MapReduce users and lines of businesses, as well as potentially other distributed processing needs. Provide flexibility to choose the right storage/file system, based on the specific application need. Deliver SLAs that IT can commit to its business users. Current Solutions There are three current approaches to performing MapReduce operations on large amounts of data: Open Source Apache Hadoop Project Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data within the Hadoop Distributed File System (HDFS) 1. Hadoop was inspired by Google s MapReduce and Google File System (GFS). Hadoop is an Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo!, the largest contributor to the project, uses Hadoop extensively across its businesses. The design of the system is in flux, as the initial distribution suffers from multiple issues including a monolithic architecture of the core scheduling sub system. Being an open source distribution, customers wishing to implement the MapReduce programs on Hadoop must do so at their own risk by supporting the entire deployment themselves. The Hadoop implementation is also Javacentric and primarily works with the Hadoop file system (HDFS 2 ). There is an assumption that the customer has internal expertise on how to operate the code. The solution offers no serious manageability, high availability, or performance capability. It is designed to be used by IT departments that have an army of developers to help fix any issues they encounter. The source code is constantly evolving, and managing the infrastructure lifecycle is quite complex and may require interruption of the environment to perform system updates. Commercial Open Source Cloudera is one such commercial provider of a Hadoop stack software distribution, providing services in addition to add-on tools. Their distribution is based on open source which is still an unproven large-scale enterprise full stack solution. There are many shortcomings in the open source distribution, including the workload management capabilities. Other open source commercial distributions are emerging, with IBM and EMC entering the marketplace. However, all of these offerings are based on open source code and inevitably inherit the strengths and weaknesses of that code base and architectural design. Therefore they cannot meet the enterprise class requirements for big data problems as already mentioned. In-data Warehouse Analytics Some data warehouse vendors have implemented the MapReduce programming model on top of their data warehouses. These include EMC/Greenplum and Aster Data. Though the tight integration of MapReduce with their data warehouse is an attractive and reliable solution for their customers, it only works with their own data warehouse. Many customers will find this solution unappealing due to lack of choices. 1 Http://en.wikipedia.org/wiki/Hadoop 2 Contribution plug-ins for GPFS and CEPH have been offered by the community. Five Challenges for Hadoop MapReduce in the Enterprise 1Lack of Performance and Scalability programming model do not provide a fast, scalable distributed resource infrastructure solution 3. Organizations require a MapReduce distributed solution that can deliver a competitive advantage by solving a wide range of data-intensive analytic problems. It may also require the ability to harness resources from distributed clusters in remote data centers. A complete MapReduce implementation should help organizations run complex data simulations with sub-millisecond latency with data throughput over thousands of tasks per second. Current open source implementations have job startup time measured in seconds, not milliseconds. Applications should be able to scale to tens of thousands of cores and thousands of concurrent clients and/or applications. 2 Lack of Flexible and Reliable resource Management programming model are not able to react quickly to changes based on application and/or user demand. MapReduce distributed processing requires a flexible amount of computing power to support applications even when data streams to the distributed resources in real time. Based on volume, the MapReduce distributed resources should be able to grow or shrink by reallocating up to thousands of CPUs per second to adjust to the current workload, in order to reduce cost while maximizing results. The resource manager in the current open source solution is susceptible to being a single point of failure, and tasks will need to be resubmitted upon failure of this subsystem. 3 5/9/11

3Lack of Application Deployment Support programming model do not make it easy to manage multiple application integrations on production-scale distributed systems with automated application service deployment capability. An enterprise-class solution should have automated capabilities including application deployment, workload policies, tuning, and general monitoring and administration. This eliminates ongoing source code maintenance and simplifies IT operations. 4Lack of Quality of Service programming model do not run at optimal capacity to take advantage of multi-core servers. Organizations with Big Data challenges are looking for a solution that can dynamically allocate matching resources with non-uniform MapReduce workloads in order to maximize their IT infrastructure. The improved resource utilization also leads to higher application performance and faster time to results, and thereby delivers a higher quality of service to an organization. MapReduce implementations often overlook the capability of infrastructure lifecycle management. Because of this, the entire systems infrastructure has to be brought down in order to perform routine maintenance such as patching or upgrades. 5Lack of Multiple Data Source Support programming model only support one distributed file system for reading and writing data, the most common being HDFS. A complete implementation of the MapReduce programming model should be agile enough to provide simultaneous support for multiple distributed file systems. With the flexibility of being able to read input from one file system and write to a different file system, the task of data processing and data storage becomes far more efficient, and eliminates additional steps for data conversion. Platform Computing s MapReduce approach is to deliver enterprise-class distributed workload services for the MapReduce application programming model. It meets enterprise IT requirements when running MapReduce analytics, delivering availability, scalability, performance and manageability. The server is designed to work specifically with multiple data file systems, avoiding customer lock-in while offering a single MapReduce solution throughout the enterprise. As an analytic distributed platform, Platform MapReduce supports an open, compatible application architecture. It can support multiple programming languages and multiple datastorage techniques, and has consistent APIs with the open source Hadoop projects. This makes it easy to integrate with third-party software when moving current applications to Platform MapReduce. The Platform Computing MapReduce enhanced approach is built around the company s core technologies in Platform LSF and Platform Symphony. Its enterprise-class capabilities include the ability to scale to thousands of cores per MapReduce application, to perform at very high execution rates, and to offer IT manageability and monitoring while controlling workload policies for multiple lines of business users. It has built-in high availability services to ensure the necessary quality of service. 3 Apache Hadoop is limited to 4,000 nodes and 40,000 concurrent tasks. It also has a single point of failure Use Case Example Data will continue to accumulate within IT organizations. Within the course of time, this data can become extremely large and complex. It is comprised of multiple formats, including documents, web feeds, system logs, online forums, SharePoint, sensor data, and images/video content. The ability to analyze and make use of this data can dramatically assist in running any business. 4 5/9/11

Example Scenarios As a general purpose solution, for example, users may want to perform what-if questions from a graphical user interface against the data to determine customer buying patterns. Another example would be an application continuously performing queries to detect money laundering, or credit card fraud, by correlating location and buying timelines of financial transactions. to scale to thousands of cores per MapReduce application, to perform at very high execution rates, and to offer IT manageability and monitoring while controlling workload policies for multiple lines of business users. It also offers built-in high availability services to ensure the necessary quality of service. Conclusion Platform Computing s MapReduce approach provides development flexibility, operational maturity, better performance and higher scalability to meet the needs of the most complex environments. Platform Computing brings together two decades of distributed system management capabilities, providing a solution that allows linear scalability by balancing computation needs with the ever-growing volumes of data. Designed to support multiple applications, organizations can dramatically increase their IT infrastructure utilization across all resources, resulting in a high return on investment. Unlike other less sophisticated solutions that lack multiple MapReduce application support and scalability, Platform MapReduce s distributed workload services are designed for high scalability, fast performance, and extreme application compatibility through its low-latency SOA architecture. MapReduce applications can now run with high reliability under powerful central management, thereby meeting IT s SLAs with both reliability and consistency. Solution Platform MapReduce is a product designed to run MapReduce programs on a computational distributed system that provides enterprise-class services. As a computational distributed platform, it supports an open application architecture as well as multiple distributed file systems used by organizations today. Its enterprise-class capabilities include the ability Platform Computing is the leader in cluster, grid and cloud management software - serving more than 2,000 of the world s most demanding organizations for over 18 years. Our workload and resource management solutions deliver IT responsiveness and lower costs for enterprise and HPC applications. Platform has strategic relationships with Cray, DellTM, HP, IBM, Intel, Microsoft, Red Hat, and SAS. Visit www.platform.com. World Headquarters Platform Computing Corporation 3760 14th Avenue Markham, Ontario Canada L3R 3T7 Tel: +1 905 948 8448 Fax: +1 905 948 9975 Toll-free Tel: 1 877 528 3676 info@platform.com Sales - Headquarters Toll-free Tel: 1 877 710 4477 Tel: +1 905 948 8448 North America New York: +1 212 888 6270 San Jose: +1 408 392 4900 Europe Bramley: +44 (0) 1256 883756 London: +44 (0) 20 3206 1470 Paris: +33 (0) 1 41 10 09 20 Düsseldorf: +49 2102 61039 0 Asia-Pacific Beijing: +86 10 82276000 Xi an: +86 029 87607400 Tokyo: +81(0)3 6302 2901 Singapore: 65 6307 6590 Copyright 2011 Platform Computing Corporation. The symbols and T designate trademarks of Platform Computing Corporation or identified third parties. All other logos and product names are the trademarks of their respective owners, errors and omissions excepted. Printed in Canada. Platform and Platform Computing refer to Platform Computing Corporation and each of its subsidiaries.031910 5 5/9/11