E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms

for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors sell packages that bundle Hadoop distributions with different levels of support, as well as enhanced commercial distributions. Hadoop is an open source technology that today is the data management platform most commonly associated with big data applications. The distributed processing framework was created in 2006, primarily at Yahoo and based partly on ideas outlined by Google in a pair of technical papers; soon, other Internet companies such as Facebook, LinkedIn and Twitter adopted the technology and began contributing to its development. In the past few years, Hadoop has evolved into a complex ecosystem of infrastructure components and related tools, which are packaged together by various vendors in commercial Hadoop distributions. Running on clusters of commodity servers, Hadoop offers a high-performance, low-cost approach to establishing a big data management architecture for supporting advanced analytics initiatives. As awareness of its capabilities has increased, Hadoop's use has spread to other industries, for both reporting and Page 1 of 16

analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes Web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other devices on the Internet of Things. What is Hadoop? The Hadoop framework encompasses a large number of open source software components with a set of core modules for capturing, processing, managing and analyzing massive volumes of data that's surrounded by a variety of supporting technologies. The core components include: The Hadoop Distributed File System (HDFS), which supports a conventional hierarchical directory and file system that distributes files across the storage nodes (i.e., DataNodes) in a Hadoop cluster. MapReduce, a programming model and execution framework for parallel processing of batch applications. YARN (short for the good-humored Yet Another Resource Negotiator), which manages job scheduling and allocates cluster resources to running applications, arbitrating among them when there's contention for the Page 2 of 16

available resources. It also tracks and monitors the progress of processing jobs. Hadoop Common, a set of libraries and utilities used by the different components. In Hadoop clusters, those core pieces and other software modules are layered on top of a collection of computing and data storage hardware nodes. The nodes are connected via a high-speed internal network to form a highperformance parallel and distributed processing system. As a collection of open source technologies, Hadoop isn't controlled by any single vendor; rather, its development is managed by the Apache Software Foundation. Apache offers Hadoop under a license that basically grants users a no-charge, royalty-free right to use the software. Developers can download it directly from the Apache website and build a Hadoop environment on their own. However, Hadoop vendors provide prebuilt "community" versions with basic functionality that can also be downloaded at no charge and installed on a variety of hardware platforms. They also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services. In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management, or data Page 3 of 16

integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes. This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure, as well as guide the selection of tools and integration of advanced capabilities to quickly deploy high-performance analytical solutions to meet emerging business needs. The components of a typical Hadoop software stack What do you actually get when you obtain a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following: Alternative data processing and application execution managers such as Tez or Spark, which can run on top of or alongside YARN to provide cluster management; cached data management; and other means of improving processing performance. Apache HBase, a column-oriented database management system modeled after Google's BigTable project that runs on top of HDFS. SQL-on-Hadoop tools such as Hive, Impala, Stinger, Drill and Spark SQL, which provide varying degrees of compliance with the SQL standard for direct querying of data stored in HDFS. Page 4 of 16

Development tools such as Pig that help developers build MapReduce programs. Configuration and management tools such as ZooKeeper or Ambari, which can be used for monitoring and administration. Analytics environments such as Mahout that supply analytical models for machine learning, data mining and predictive analytics. Because the software is open source, you don't purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it has contributed to the community as part of its Hadoop distribution. Page 5 of 16

Who manages the Hadoop big data management environment? It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams will typically include: Requirements analysts to assess the system performance requirements based on the types of applications that will be run in the Hadoop environment. System architects to evaluate performance requirements and design hardware configurations. System engineers to install, configure and tune the Hadoop software stack. Application developers to design and implement applications. Data management professionals to do data integration, create data layouts and perform other management tasks. System managers to do operational management and maintenance. Page 6 of 16

Project managers to oversee the implementation of the various levels of the stack and application development work. A program manager to oversee the implementation of the Hadoop environment and prioritization, development and deployment of applications. The Hadoop software platform market In essence, the evolution of Hadoop as a viable large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that can be collected and analyzed as part of those applications. The market includes three independent vendors that specialize in Hadoop -- Cloudera Inc., Hortonworks Inc. and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon Web Services and Microsoft. Evaluating vendors that provide Hadoop distributions requires understanding the similarities and differences between two aspects of the product offerings. First is the technology itself: What's included in the different distributions; what platforms are they supported on; and, most important, what specific components are championed by the individual vendors? Second is the service Page 7 of 16

and support model: What types of support and SLAs are provided within each subscription level, and how much do different subscriptions cost? Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship. The next article in this series will examine several business use cases for a Hadoop big data management platform so you can determine your organization's needs and requirements. Next article Page 8 of 16

help you manage big data David Loshin, Knowledge Integrity Inc. To help you determine if a commercial Hadoop distribution could benefit your organization, consultant David Loshin examines big data use cases and applications that Hadoop can support. Many companies are struggling to manage the massive amounts of data they collect. Whereas in the past they may have used a data warehouse platform, such conventional architectures can fall short for dealing with data originating from numerous internal and external sources and often varying in structure and types of content. But new technologies have emerged to offer help -- most prominently, Hadoop, a distributed processing framework designed to address the volume and complexity of big data environments involving a mix of structured, unstructured and semi-structured data. Part of Hadoop's allure is that it consists of a variety of open source software components and associated tools for capturing, processing, managing and analyzing data. But, as addressed in a previous article in this series, in order to help users take advantage of the framework, many vendors offer commercial Hadoop distributions that provide performance and functionality enhancements over the base Apache open source technology and bundle the software with Page 9 of 16

maintenance and support services. As the next step, let's take a look at how a Hadoop distribution could benefit your organization. Making a case for a Hadoop distribution Hadoop runs in clusters of commodity servers and typically is used to support data analysis and not for online transaction processing applications. Several increasingly common analytics use cases map nicely to its distributed data processing and parallel computation model. The list includes: Operational intelligence applications for capturing streaming data from transaction processing systems and organizational assets, monitoring performance levels, and applying predictive analytics for pre-emptive maintenance or process changes. Web analytics, which are intended to help companies understand the demographics and online activities of website visitors, review Web server logs to detect system performance problems, and identify ways to enhance digital marketing efforts. Security and risk management, such as running analytical models that compare transactional data to a knowledge base of fraudulent activity patterns, as well as continuous cybersecurity analysis for identifying emerging patterns of suspicious behavior. Page 10 of 16

Marketing optimization, including recommendation engines that absorb huge amounts of Internet clickstream and online sales data and blend that information with customer profiles to provide real-time suggestions for product bundling and upselling. Internet of Things applications, such as analyzing data from things -- like manufacturing devices, pipelines and so-called smart buildings -- via sensors that continuously generate and broadcast information about their status and performance. Sentiment analysis and brand protection, which might involve capturing streaming social media data and analyzing the text to identify unsatisfied customers whose issues can be addressed quickly. Massive data ingestion for data collection, processing and integration scenarios such as capturing satellite images and geospatial data. Data staging, in which Hadoop is used as an initial landing spot for data that is then integrated, cleansed and transformed into more structured formats in preparation for loading into a data warehouse or analytical database for analysis. Page 11 of 16

Capabilities supporting the use cases Applications supporting these usage scenarios can be built on top of Hadoop using some prototypical implementation methodologies, such as: Data lakes. Because Hadoop delivers linear scalability for processing and storage as new data nodes are incorporated into a cluster architecture, it provides a natural platform for capturing and managing voluminous files of raw data. This has motivated many users to implement Hadoop systems as a catchall platform for their data, creating a conceptual data lake. Data warehouse augmentation platform. Hadoop's distributed storage can also be used to expand the data that's accessible for analysis in a data warehouse environment. For example, a temperature-based scheme can be used for allocating data to different levels of the storage hierarchy, depending on its frequency of use. The most frequently accessed "hot" data is kept in the data warehouse, while less-frequently used "cool" data is relegated to higherlatency storage such as the Hadoop Distributed File System. This approach relies on tightly coupled data warehouse integration with Hadoop. Large-scale batch computation engine. When configured with a combination of data and compute nodes, Hadoop becomes a massively parallel processing platform that's suited to batch processing applications for manipulating and analyzing data. One example would be data standardization and transformation Page 12 of 16

jobs applied to data sets to prepare them for analysis. Algorithm-driven analytics applications such as data mining, machine learning, pattern analysis and predictive modeling are also good matches for Hadoop's batch capabilities, as they can be executed in parallel over massive distributed data files with iterations of partial results accumulated until the program completes with a final set of results. Event stream analytics processing engine. A Hadoop environment can also be configured to process incoming data streams in real or near real time. As an example, a customer sentiment analysis application can have multiple communication agents running in parallel on a Hadoop cluster, each applying a set of stream processing rules to data feeds from social networks such as Twitter and Facebook. Advantages of adopting Hadoop: Is it right for you? A low-cost, high-performance computing framework like Hadoop can address different IT and business motivations for scaling up processing power or expanding data management capabilities in an organization. Let's examine some characteristics of application requirements that suggest the need for a data management platform based on a Hadoop distribution: Ingestion and processing of large data sets, massive data volumes and streaming data. Examples include capturing Web server logs that Page 13 of 16

contain information about billions of online events; indexing hundreds of millions of documents across different data sets; and continuously pulling in data streams such as social media channels, stock market data, news feeds and content published at expert communities. A need to eliminate performance impediments. Application performance is often throttled on traditional data warehouse systems as a result of data accessibility, latency and availability issues or bandwidth limits in relation to the amount of data that needs to be processed. The desire for linear scalability on performance. As data volumes grow and the number of users increases, having an environment in which performance will scale linearly as more computing and storage resources are added can be crucial, especially when applications can benefit from parallel computing. A mixture of structured and unstructured data. The applications need to use data from different sources that vary in structure, and some -- or much -- of it is unstructured or semi-structured, for example, text or server log data. IT cost efficiencies. Rather than paying premium prices for high-end servers or specialty hardware appliances, the system architects believe that acceptable performance can be achieved using commodity components Page 14 of 16

Considerations for integrating Hadoop into the enterprise A positive value proposition for using Hadoop still must be balanced, though, with the feasibility of integrating the platform into the enterprise. Because many organizations have made significant investments in traditional data warehouse platforms, there may be some resistance to introducing a newer technology. Before engaging a Hadoop distribution vendor, work to resolve any potential barriers to adoption and assess requirements for cluster sizing and configuration. For example, determine where a Hadoop cluster fits in your organization's data warehousing and analytics strategy -- whether it's intended to augment existing data warehouses or replace them. Also, identify integration and interoperability issues that need to be addressed, and review configuration alternatives, including whether it's better to implement the Hadoop ecosystem on premises or in a cloud-based or hosted environment. In addition, ensure that you have funding to hire people with the right skills or retrain existing employees. Hadoop application development differs greatly from conventional database development. Answering these types of questions will help in determining the feasibility of a Hadoop deployment. The next step, which will be examined in the third article in Page 15 of 16

this series, is to evaluate the features and functions you need in a commercial Hadoop distribution. About the author David Loshin, managing director at Decisionworx, is a recognized thought leader, speaker and expert consultant. He has written numerous books, including Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL and Graph. He can be reached through his website, at www.decisionworx.com. Email us at editor@searchbusinessanalytics.com and follow us on Twitter: @BizAnalyticsTT. Page 16 of 16