Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11

Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer Problems... 2 Needs Current Solutions Five Challenges for Hadoop MapReduce in the Enterprise... 3 1. Lack of Performance and Scalability 2. Lack of Flexible and Reliable resource Management 3. Lack of Application Deployment Support 4. Lack of Quality of Service 5. Lack of Multiple Data Source Support. Use Case Example... 4 Example Scenarios. Conclusion... 5 1 5/9/11

Introduction Reporting and analysis drive businesses in making the best possible decisions. The source of all these decisions is data. There are two main types of data: structured and unstructured. Though IT has been able to deliver enterprise-class services for analysis and reporting on structured data (e.g., data warehouses,) IT has struggled to deliver the same level of services for capturing, managing and processing information from unstructured data. IT organizations need to adopt new ways to deliver enterprise-class services to extract and analyze unstructured data. Though new methods such as MapReduce have been found to access, extract and organize data results sets, its delivery is becoming too expensive without enterprise-class delivery services. To meet emerging business demands to extract knowledge from unstructured data, enterprises require an enterprise-class solution that can schedule and manage data analysis processes across an entire distributed file system with the robustness that enterprise IT requires. Platform Computing has managed enterprise class distributed architectures for nearly two decades and is well suited to provide enterprise class services across a distributed file system. Platform Computing s MapReduce distributed computing runtime engine meets this need. Current Market Conditions and Drivers According to the Market Strategy and BI Research group, data volumes are doubling every year: 42.6 percent of respondents are keeping more than three years of data for analytical purposes. New sources of data are emerging at huge volumes, in different industries, such as utilities. 80 percent of data is unstructured and not effectively used in the organization. Most of the unstructured data collected is driven by business value rather than need from a pure analytics perspective. The key is to turn this data into usable information. However, with the rapid growth in data volumes, even the fastest systems cannot keep pace. For the analysis of large data sets (i.e. Big Data ), the system architecture has to be revisited and designed to scale linearly as the volume of data grows. In order to meet these big data conditions, both computational and storage solutions have evolved: Emergence of new programming frameworks to enable distributed computing on large data sets (e.g., MapReduce). New data storage techniques (e.g. file systems on commodity hardware, like the Hadoop File System, or HDFS) for structured and unstructured data. Distributed storage systems made it more affordable to NYSE is generating store large volumes of data 1TB of data per day using commodity disks. Big Facebook is generating 20TB of data per data distributed file systems used for storage support day--compressed! some enterprise-class capabilities such as data flexibility, 40TB of data per day CERN is generating adoption within the IT ecosystem, high scalability (up to petabytes,) and reasonable cost. Customer Problems So, the customer problem is not with the distributed file system, but the ability to access, extract and organize the data using MapReduce with enterprise-class services. The common implementation of MapReduce based on open source code is not inherently designed for enterprise-class deployments. In fact, using MapReduce in an enterprise data center requires a highly scalable, highly available, and easily managed solution, which includes support for multiple MapReduce applications. These key capabilities do not exist within current open source solutions. Needs There is a need for an enterprise-class MapReduce computational solution to support distributed processing of the MapReduce programming model. Enterprise-class MapReduce computational engines need to: Enable deployment and operation of the extraction and analysis programs across the enterprise. Manage and monitor large-scale environments Include a workload management system to ensure quality of service and prioritization of applications based on business objectives. 2 5/9/11

Service multiple MapReduce users and lines of businesses, as well as potentially other distributed processing needs. Provide flexibility to choose the right storage/file system, based on the specific application need. Deliver SLAs that IT can commit to its business users. Current Solutions There are three current approaches to performing MapReduce operations on large amounts of data: Open Source Apache Hadoop Project Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data within the Hadoop Distributed File System (HDFS) 1. Hadoop was inspired by Google s MapReduce and Google File System (GFS). Hadoop is an Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo!, the largest contributor to the project, uses Hadoop extensively across its businesses. The design of the system is in flux, as the initial distribution suffers from multiple issues including a monolithic architecture of the core scheduling sub system. Being an open source distribution, customers wishing to implement the MapReduce programs on Hadoop must do so at their own risk by supporting the entire deployment themselves. The Hadoop implementation is also Javacentric and primarily works with the Hadoop file system (HDFS 2 ). There is an assumption that the customer has internal expertise on how to operate the code. The solution offers no serious manageability, high availability, or performance capability. It is designed to be used by IT departments that have an army of developers to help fix any issues they encounter. The source code is constantly evolving, and managing the infrastructure lifecycle is quite complex and may require interruption of the environment to perform system updates. Commercial Open Source Cloudera is one such commercial provider of a Hadoop stack software distribution, providing services in addition to add-on tools. Their distribution is based on open source which is still an unproven large-scale enterprise full stack solution. There are many shortcomings in the open source distribution, including the workload management capabilities. Other open source commercial distributions are emerging, with IBM and EMC entering the marketplace. However, all of these offerings are based on open source code and inevitably inherit the strengths and weaknesses of that code base and architectural design. Therefore they cannot meet the enterprise class requirements for big data problems as already mentioned. In-data Warehouse Analytics Some data warehouse vendors have implemented the MapReduce programming model on top of their data warehouses. These include EMC/Greenplum and Aster Data. Though the tight integration of MapReduce with their data warehouse is an attractive and reliable solution for their customers, it only works with their own data warehouse. Many customers will find this solution unappealing due to lack of choices. 1 Http://en.wikipedia.org/wiki/Hadoop 2 Contribution plug-ins for GPFS and CEPH have been offered by the community. Five Challenges for Hadoop MapReduce in the Enterprise 1Lack of Performance and Scalability programming model do not provide a fast, scalable distributed resource infrastructure solution 3. Organizations require a MapReduce distributed solution that can deliver a competitive advantage by solving a wide range of data-intensive analytic problems. It may also require the ability to harness resources from distributed clusters in remote data centers. A complete MapReduce implementation should help organizations run complex data simulations with sub-millisecond latency with data throughput over thousands of tasks per second. Current open source implementations have job startup time measured in seconds, not milliseconds. Applications should be able to scale to tens of thousands of cores and thousands of concurrent clients and/or applications. 2 Lack of Flexible and Reliable resource Management programming model are not able to react quickly to changes based on application and/or user demand. MapReduce distributed processing requires a flexible amount of computing power to support applications even when data streams to the distributed resources in real time. Based on volume, the MapReduce distributed resources should be able to grow or shrink by reallocating up to thousands of CPUs per second to adjust to the current workload, in order to reduce cost while maximizing results. The resource manager in the current open source solution is susceptible to being a single point of failure, and tasks will need to be resubmitted upon failure of this subsystem. 3 5/9/11

3Lack of Application Deployment Support programming model do not make it easy to manage multiple application integrations on production-scale distributed systems with automated application service deployment capability. An enterprise-class solution should have automated capabilities including application deployment, workload policies, tuning, and general monitoring and administration. This eliminates ongoing source code maintenance and simplifies IT operations. 4Lack of Quality of Service programming model do not run at optimal capacity to take advantage of multi-core servers. Organizations with Big Data challenges are looking for a solution that can dynamically allocate matching resources with non-uniform MapReduce workloads in order to maximize their IT infrastructure. The improved resource utilization also leads to higher application performance and faster time to results, and thereby delivers a higher quality of service to an organization. MapReduce implementations often overlook the capability of infrastructure lifecycle management. Because of this, the entire systems infrastructure has to be brought down in order to perform routine maintenance such as patching or upgrades. 5Lack of Multiple Data Source Support programming model only support one distributed file system for reading and writing data, the most common being HDFS. A complete implementation of the MapReduce programming model should be agile enough to provide simultaneous support for multiple distributed file systems. With the flexibility of being able to read input from one file system and write to a different file system, the task of data processing and data storage becomes far more efficient, and eliminates additional steps for data conversion. Platform Computing s MapReduce approach is to deliver enterprise-class distributed workload services for the MapReduce application programming model. It meets enterprise IT requirements when running MapReduce analytics, delivering availability, scalability, performance and manageability. The server is designed to work specifically with multiple data file systems, avoiding customer lock-in while offering a single MapReduce solution throughout the enterprise. As an analytic distributed platform, Platform MapReduce supports an open, compatible application architecture. It can support multiple programming languages and multiple datastorage techniques, and has consistent APIs with the open source Hadoop projects. This makes it easy to integrate with third-party software when moving current applications to Platform MapReduce. The Platform Computing MapReduce enhanced approach is built around the company s core technologies in Platform LSF and Platform Symphony. Its enterprise-class capabilities include the ability to scale to thousands of cores per MapReduce application, to perform at very high execution rates, and to offer IT manageability and monitoring while controlling workload policies for multiple lines of business users. It has built-in high availability services to ensure the necessary quality of service. 3 Apache Hadoop is limited to 4,000 nodes and 40,000 concurrent tasks. It also has a single point of failure Use Case Example Data will continue to accumulate within IT organizations. Within the course of time, this data can become extremely large and complex. It is comprised of multiple formats, including documents, web feeds, system logs, online forums, SharePoint, sensor data, and images/video content. The ability to analyze and make use of this data can dramatically assist in running any business. 4 5/9/11

Example Scenarios As a general purpose solution, for example, users may want to perform what-if questions from a graphical user interface against the data to determine customer buying patterns. Another example would be an application continuously performing queries to detect money laundering, or credit card fraud, by correlating location and buying timelines of financial transactions. to scale to thousands of cores per MapReduce application, to perform at very high execution rates, and to offer IT manageability and monitoring while controlling workload policies for multiple lines of business users. It also offers built-in high availability services to ensure the necessary quality of service. Conclusion Platform Computing s MapReduce approach provides development flexibility, operational maturity, better performance and higher scalability to meet the needs of the most complex environments. Platform Computing brings together two decades of distributed system management capabilities, providing a solution that allows linear scalability by balancing computation needs with the ever-growing volumes of data. Designed to support multiple applications, organizations can dramatically increase their IT infrastructure utilization across all resources, resulting in a high return on investment. Unlike other less sophisticated solutions that lack multiple MapReduce application support and scalability, Platform MapReduce s distributed workload services are designed for high scalability, fast performance, and extreme application compatibility through its low-latency SOA architecture. MapReduce applications can now run with high reliability under powerful central management, thereby meeting IT s SLAs with both reliability and consistency. Solution Platform MapReduce is a product designed to run MapReduce programs on a computational distributed system that provides enterprise-class services. As a computational distributed platform, it supports an open application architecture as well as multiple distributed file systems used by organizations today. Its enterprise-class capabilities include the ability Platform Computing is the leader in cluster, grid and cloud management software - serving more than 2,000 of the world s most demanding organizations for over 18 years. Our workload and resource management solutions deliver IT responsiveness and lower costs for enterprise and HPC applications. Platform has strategic relationships with Cray, DellTM, HP, IBM, Intel, Microsoft, Red Hat, and SAS. Visit www.platform.com. World Headquarters Platform Computing Corporation 3760 14th Avenue Markham, Ontario Canada L3R 3T7 Tel: +1 905 948 8448 Fax: +1 905 948 9975 Toll-free Tel: 1 877 528 3676 info@platform.com Sales - Headquarters Toll-free Tel: 1 877 710 4477 Tel: +1 905 948 8448 North America New York: +1 212 888 6270 San Jose: +1 408 392 4900 Europe Bramley: +44 (0) 1256 883756 London: +44 (0) 20 3206 1470 Paris: +33 (0) 1 41 10 09 20 Düsseldorf: +49 2102 61039 0 Asia-Pacific Beijing: +86 10 82276000 Xi an: +86 029 87607400 Tokyo: +81(0)3 6302 2901 Singapore: 65 6307 6590 Copyright 2011 Platform Computing Corporation. The symbols and T designate trademarks of Platform Computing Corporation or identified third parties. All other logos and product names are the trademarks of their respective owners, errors and omissions excepted. Printed in Canada. Platform and Platform Computing refer to Platform Computing Corporation and each of its subsidiaries.031910 5 5/9/11