Pro Apache Hadoop. Second Edition. Sameer Wadkar Madhu Siddalingaiah

Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah

Pro Apache Hadoop Copyright 2014 by Sameer Wadkar and Madhu Siddalingaiah This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): 978-1-4302-4863-7 ISBN-13 (electronic): 978-1-4302-4864-4 Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Publisher: Heinz Weinheimer Lead Editor: Jonathan Gennick Technical Reviewer: Vimlesh Om Mittal Editorial Board: Steve Anglin, Mark Beckner, Ewan Buckingham, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Jonathan Hassell, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Nancy Sixsmith Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail rights@apress.com, or visit www.apress.com. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this text is available to readers at www.apress.com. For detailed information about how to locate your book s source code, go to www.apress.com/source-code/.

This book is dedicated to the Open Source Community whose contributions have made Software Development such as vibrant and exciting profession. Sameer To Sasha, Eashan, and Shilpa. Madhu

Contents at a Glance About the Authors... xix About the Technical Reviewer... xxi Acknowledgments... xxiii Introduction... xxv Chapter 1: Motivation for Big Data...1 Chapter 2: Hadoop Concepts...11 Chapter 3: Getting Started with the Hadoop Framework...31 Chapter 4: Hadoop Administration...47 Chapter 5: Basics of MapReduce Development...73 Chapter 6: Advanced MapReduce Development...107 Chapter 7: Hadoop Input/Output...151 Chapter 8: Testing Hadoop Programs...185 Chapter 9: Monitoring Hadoop...203 Chapter 10: Data Warehousing Using Hadoop...217 Chapter 11: Data Processing Using Pig...241 Chapter 12: HCatalog and Hadoop in the Enterprise...271 Chapter 13: Log Analysis Using Hadoop...283 Chapter 14: Building Real-Time Systems Using HBase...293 Chapter 15: Data Science with Hadoop...325 Chapter 16: Hadoop in the Cloud...343 v

Contents at a Glance Chapter 17: Building a YARN Application...357 Appendix A: Installing Hadoop...381 Appendix B: Using Maven with Eclipse...391 Appendix C: Apache Ambari...399 Index...403 vi

Contents About the Authors... xix About the Technical Reviewer... xxi Acknowledgments... xxiii Introduction... xxv Chapter 1: Motivation for Big Data...1 What Is Big Data?...1 Key Idea Behind Big Data Techniques...2 Data Is Distributed Across Several Nodes... 2 Applications Are Moved to the Data... 3 Data Is Processed Local to a Node... 3 Sequential Reads Preferred Over Random Reads... 3 An Example... 4 Big Data Programming Models...4 Massively Parallel Processing (MPP) Database Systems... 4 In-Memory Database Systems... 5 MapReduce Systems... 5 Bulk Synchronous Parallel (BSP) Systems... 6 Big Data and Transactional Systems...7 How Much Can We Scale?...8 A Compute-Intensive Example... 8 Amdhal s Law... 9 Business Use-Cases for Big Data...9 Summary...10 vii

Contents Chapter 2: Hadoop Concepts...11 Introducing Hadoop...11 Introducing the MapReduce Model...12 Components of Hadoop...16 Hadoop Distributed File System (HDFS)... 17 Secondary NameNode... 22 TaskTracker... 23 JobTracker... 23 Hadoop 2.0...24 Components of YARN... 26 HDFS High Availability...29 Summary...30 Chapter 3: Getting Started with the Hadoop Framework...31 Types of Installation...31 Stand-Alone Mode... 31 Pseudo-Distributed Cluster... 32 Multinode Node Cluster Installation... 32 Preinstalled Using Amazon Elastic MapReduce... 32 Setting up a Development Environment with a Cloudera Virtual Machine...33 Components of a MapReduce program...34 Your First Hadoop Program...34 Prerequisites to Run Programs in Local Mode... 35 WordCount Using the Old API... 36 Building the Application... 38 Running WordCount in Cluster Mode... 39 WordCount Using the New API... 39 Building the Application... 41 Running WordCount in Cluster Mode... 41 Third-Party Libraries in Hadoop Jobs...41 Summary...46 viii

Contents Chapter 4: Hadoop Administration...47 Hadoop Configuration Files...47 Configuring Hadoop Daemons...48 Precedence of Hadoop Configuration Files...49 Diving into Hadoop Configuration Files...49 core-site.xml... 50 hdfs-*.xml... 51 mapred-site.xml... 52 yarn-site.xml... 54 Memory Allocations in YARN... 55 Scheduler...56 Capacity Scheduler... 57 Fair Scheduler... 59 Fair Scheduler Configuration... 60 yarn-site.xml Configurations... 61 Allocation File Format and Configurations... 62 Determine Dominant Resource Share in drf Policy... 63 Slaves File...64 Rack Awareness...64 Providing Hadoop with Network Topology... 64 Cluster Administration Utilities...65 Check the HDFS... 66 Command-Line HDFS Administration... 68 Rebalancing HDFS Data... 70 Copying Large Amounts of Data from the HDFS... 71 Summary...72 Chapter 5: Basics of MapReduce Development...73 Hadoop and Data Processing...73 Reviewing the Airline Dataset...73 Preparing the Development Environment... 75 Preparing the Hadoop System... 75 ix

Contents x MapReduce Programming Patterns...76 Map-Only Jobs (SELECT and WHERE Queries)... 76 Problem Definition: SELECT Clause... 76 Problem Definition: WHERE Clause... 84 Map and Reduce Jobs (Aggregation Queries)... 87 Problem Definition: GROUP BY and SUM Clauses... 88 Improving Aggregation Performance Using the Combiner... 94 Problem Definition: Optimized Aggregators... 95 Role of the Partitioner... 100 Problem Definition: Split Airline Data by Month... 100 Bringing it All Together...103 Summary...106 Chapter 6: Advanced MapReduce Development...107 MapReduce Programming Patterns...107 Introduction to Hadoop I/O... 107 Problem Definition: Sorting... 109 Problem Definition: Analyzing Consecutive Records... 124 Problem Definition: Join Using MapReduce... 134 Problem Definition: Join Using Map-Only jobs... 140 Writing to Multiple Output Files in a Single MR Job... 145 Collecting Statistics Using Counters... 147 Summary...150 Chapter 7: Hadoop Input/Output...151 Compression Schemes...151 What Can Be Compressed?... 152 Compression Schemes... 152 Enabling Compression... 153 Inside the Hadoop I/O processes...154 InputFormat... 155 OutputFormat... 156 Custom OutputFormat: Conversion from Text to XML... 157

Contents Custom InputFormat: Consuming a Custom XML file... 161 Hadoop Files...170 SequenceFile... 171 MapFiles... 176 Avro Files... 177 Summary...183 Chapter 8: Testing Hadoop Programs...185 Revisiting the Word Counter...185 Introducing MRUnit...187 Installing MRUnit... 187 MRUnit Core Classes... 187 Writing an MRUnit Test Case... 188 Testing Counters... 190 Features of MRUnit... 193 Limitations of MRUnit... 194 Testing with LocalJobRunner...194 Limitations of LocalJobRunner... 197 Testing with MiniMRCluster...197 Setting up the Development Environment... 197 Example for MiniMRCluster... 199 Limitations of MiniMRCluster... 201 Testing MR Jobs with Access Network Resources...201 Summary...202 Chapter 9: Monitoring Hadoop...203 Writing Log Messages in Hadoop MapReduce Jobs...203 Viewing Log Messages in Hadoop MapReduce Jobs...206 User Log Management in Hadoop 2.x...209 Log Storage in Hadoop 2.x... 209 Log Management Improvements... 211 Viewing Logs Using Web Based UI... 211 xi

Contents Command-Line Interface... 211 Log Retention... 212 Hadoop Cluster Performance Monitoring...212 Using YARN REST APIs...213 Managing the Hadoop Cluster Using Vendor Tools...213 Ambari Architecture... 214 Summary...215 Chapter 10: Data Warehousing Using Hadoop...217 Apache Hive...217 Installing Hive... 218 Hive Architecture... 218 Metastore... 219 Compiler Basics... 219 Hive Concepts... 219 HiveQL Compiler Details... 223 Data Definition Language... 227 Data Manipulation Language... 228 External Interfaces... 229 Hive Scripts... 231 Performance... 232 MapReduce Integration... 232 Creating Partitions... 233 User-Defined Functions... 234 Impala...236 Impala Architecture... 237 Impala Features... 237 Impala Limitations... 237 Shark...238 Shark/Spark Architecture... 238 Summary...239 xii

Contents Chapter 11: Data Processing Using Pig...241 An Introduction to Pig...241 Running Pig...243 Executing in the Grunt Shell... 244 Executing a Pig Script... 244 Embedded Java Program... 245 Pig Latin...246 Comments in a Pig Script... 246 Execution of Pig Statements... 247 Pig Commands... 247 User-Defined Functions...252 Eval Functions Invoked in the Mapper... 253 Eval Functions Invoked in the Reducer... 253 Writing and Using a Custom FilterFunc... 260 Comparison of PIG versus Hive...262 Crunch API...263 How Crunch Differs from Pig... 263 Sample Crunch Pipeline... 264 Summary...269 Chapter 12: HCatalog and Hadoop in the Enterprise...271 HCatalog and Enterprise Data Warehouse Users...271 HCatalog: A Brief Technical Background...272 HCatalog Command-Line Interface... 274 WebHCat... 274 HCatalog Interface for MapReduce... 275 HCatalog Interface for Pig... 278 HCatalog Notification Interface... 279 Security and Authorization in HCatalog...279 Bringing It All Together...280 Summary...281 xiii

Contents Chapter 13: Log Analysis Using Hadoop...283 Log File Analysis Applications...283 Web Analytics... 283 Security Compliance and Forensics... 284 Monitoring and Alerts... 284 Internet of Things... 285 Analysis Steps...286 Load... 286 Refine... 286 Visualize... 287 Apache Flume...287 Core Concepts... 288 Netflix Suro...290 Cloud Solutions...291 Summary...291 Chapter 14: Building Real-Time Systems Using HBase...293 What Is HBase?...293 Typical HBase Use-Case Scenarios...294 HBase Data Model...295 HBase Logical or Client-Side View... 295 Differences Between HBase and RDBMSs... 296 HBase Tables... 297 HBase Cells... 297 HBase Column Family... 297 HBase Commands and APIs...298 Getting a Command List: help Command... 299 Creating a Table: create Command... 300 Adding Rows to a Table: put Command... 300 Retrieving Rows from the Table: get Command... 300 Reading Multiple Rows: scan Command... 300 xiv

Contents Counting the Rows in the Table: count Command... 301 Deleting Rows: delete Command... 301 Truncating a Table: truncate Command... 301 Dropping a Table: drop Command... 302 Altering a Table: alter Command... 302 HBase Architecture...302 HBase Components... 303 Compaction and Splits in HBase... 309 Compaction... 310 HBase Configuration: An Overview...311 hbase-default.xml and hbase-site.xml... 311 HBase Application Design...312 Tall vs. Wide vs. Narrow Table Design... 312 Row Key Design... 313 HBase Operations Using Java API...314 HBase Treats Everything as Bytes... 314 Create an HBase Table... 315 Administrative Functions Using HBaseAdmin... 315 Accessing Data Using the Java API... 316 HBase MapReduce Integration...320 A MapReduce Job to Read an HBase Table...320 HBase and MapReduce Clusters...323 Scenario I: Frequent MapReduce Jobs Against HBase Tables... 323 Scenario II: HBase and MapReduce have Independent SLAs... 323 Summary...323 Chapter 15: Data Science with Hadoop...325 Hadoop Data Science Methods...325 Apache Hama...326 Bulk Synchronous Parallel Model... 326 Hama Hello World!... 327 xv

Contents Monte Carlo Methods... 329 K-Means Clustering... 333 Apache Spark...336 Resilient Distributed Datasets (RDDs)... 336 Monte Carlo with Spark... 337 KMeans with Spark... 339 RHadoop...341 Summary...342 Chapter 16: Hadoop in the Cloud...343 Economics...343 Self-Hosted Cluster... 343 Cloud-Hosted Cluster... 344 Elasticity... 344 On Demand... 344 Bid Pricing... 345 Hybrid Cloud... 345 Logistics...345 Ingress/Egress... 345 Data Retention... 345 Security...346 Cloud Usage Models...346 Cloud Providers...347 Amazon Web Services... 347 Google Cloud Platform... 349 Microsoft Azure... 350 Choosing a Cloud Vendor... 350 Case Study: Amazon Web Services...351 Elastic MapReduce... 351 Elastic Compute Cloud... 354 Summary...356 xvi

Contents Chapter 17: Building a YARN Application...357 YARN: A General-Purpose Distributed System...357 YARN: A Quick Review...359 Creating a YARN Application...361 POM Configuration... 362 DownloadService.java Class...362 Client.java...365 Steps to Launch the Application Master from the Client... 365 ApplicationMaster.java...373 Communication Protocol between Application Master and Resource Manager: Application Master Protocol... 373 Node Manager Communication Protocol: Container Management Protocol... 373 Steps to Launch the Worker Tasks... 373 Executing the Application Master...378 Launch the Application in Un-Managed Mode... 379 Launch the Application in Managed Mode... 379 Summary...379 Appendix A: Installing Hadoop...381 Installing Hadoop 2.2.0 on Windows...381 Preparing the Installation Environment... 381 Building Hadoop 2.2.0 for Windows... 383 Installing Hadoop 2.2.0 for Windows... 383 Configuring Hadoop 2.2.0... 383 Preparing the Hadoop Cluster... 386 Starting HDFS... 387 Starting MapReduce (YARN)... 387 Verifying that the Cluster Is Running... 387 Testing the Cluster... 387 Installing Hadoop 2.2.0 on Linux...388 xvii

Contents Appendix B: Using Maven with Eclipse...391 A Quick Introduction to Maven...391 Creating a Maven Project... 391 Using Maven with Eclipse...393 Installing the m2e Maven Eclipse Plug-in... 393 Creating a Maven Project from Eclipse... 393 Building a Maven Project from Eclipse... 396 Appendix C: Apache Ambari...399 Hadoop Components Supported by Apache Ambari...399 Installing Apache Ambari...401 Trying the Ambari Sandbox on Your OS...401 Index...403 xviii

About the Authors Sameer Wadkar has more than 16 years of experience in software architecture and development. He has a bachelor s degree in electrical engineering and an MBA in finance from Mumbai University, a postgraduate diploma in software engineering from the National Center of Software Technology (now Center for Development of Advanced Computing), and a master s degree in applied and computational mathematics from Johns Hopkins University. He has implemented distributed systems and high-traffic web sites for a wide variety of clients ranging from federal agencies to investment banking companies. Sameer has been actively working on Hadoop/HBase implementations since 2011 and is also an open-source contributor. His GitHub page is https://github.com/sameerwadkar. Sameer s open-source contributions include a version of the popular text mining algorithm known as Latent Dirichlett Allocation, which scales to millions of documents on a single machine. He is an avid chess player and plays actively on www.chessclub.com. Madhu Siddalingaiah is a technology consultant with 25 years of experience in a variety of business domains, including aerospace, health care, financial, energy, defense, and scientific research. Over the years, he has specialized in electrical engineering, Internet technologies, and Big Data. More recently, Madhu has delivered several high profile Big Data systems and solutions. He earned his physics degree from the University of Maryland. Outside of his profession, Madhu is a private helicopter pilot, enjoys travel, and participates in a constantly growing list of hobbies. xix

About the Technical Reviewer Vimlesh Om Mittal has more than 14 years of technology implementation experience and is a Cloudera Certified Developer for Apache Hadoop CDH4. He has a very broad technology background, and his experience includes projects involving modernizing business processes, custom application development, ETL development, digital application development, business intelligence, database design, and data conversions. Vimlesh s experience is a strong blend of architecting robust information management systems, solution architecture, and software development in several development frameworks such as Big Data using the Hadoop ecosystem; data ingestion/acquisition; ETL; reporting/analytics;.net; J2EE; and scripting languages in different vendor technologies such as Cloudera, SAS, IBM, Oracle, Microsoft, ios, and BEA. Vimlesh s focus also includes solutions for financial services data, mobile application design, and development. He is also well-versed in all aspects of the software development life cycle and has extensive experience in project management disciplines. xxi

Acknowledgments Hadoop has come a long way over the past decade with a large number of vendors and independent developers contributing to it. Hadoop in turn depends on a large number of open source libraries. We would like to acknowledge the effort of all the open source contributors who have contributed to Hadoop. We would also like to acknowledge the blog writers and professionals who contribute to answering Hadoop-related queries on forums. Their participation contributes to making a complex product such as Hadoop mainstream and easier to use. We want to thank the Apress staff members who have applied their expertise to make this book into something readable. We also want to thank our family, friends, and colleagues for supporting us through the writing of this book. xxiii

Introduction This book is designed to be a concise guide to using the Hadoop software. Despite being around for more than half a decade, Hadoop development is still a very stressful yet very rewarding task. The documentation has come a long way since the early years, and Hadoop is growing rapidly as its adoption is increasing in the Enterprise. Hadoop 2.0 is based on the YARN framework, which is a significant rewrite of the underlying Hadoop platform. It has been our goal to distill the hard lessons learned while implementing Hadoop for clients in this book. As authors, we like to delve deep into the Hadoop source code to understand why Hadoop does what it does and the motivations behind some of its design decisions. We have tried to share this insight with you. We hope that not only will you learn Hadoop in depth but also gain fresh insight into the Java language in the process. This book is about Big Data in general and Hadoop in particular. It is not possible to understand Hadoop without appreciating the overall Big Data landscape. It is written primarily from the point of view of a Hadoop developer and requires an intermediate-level ability to program using Java. It is designed for practicing Hadoop professionals. You will learn several practical tips on how to use the Hadoop software gleaned from our own experience in implementing Hadoop-based systems. This book provides step-by-step instructions and examples that will take you from just beginning to use Hadoop to running complex applications on large clusters of machines. Here s a brief rundown of the book s contents: Chapter 1 introduces you to the motivations behind Big Data software, explaining various Big Data paradigms. Chapter 2 is a high-level introduction to Hadoop 2.0 or YARN. It introduces the key concepts underlying the Hadoop platform. Chapter 3 gets you started with Hadoop. In this chapter, you will write your first MapReduce program. Chapter 4 introduces the key concepts behind the administration of the Hadoop platform. Chapters 5, 6, and 7, which form the core of this book, do a deep dive into the MapReduce framework. You learn all about the internals of the MapReduce framework. We discuss the MapReduce framework in the context of the most ubiquitous of all languages, SQL. We emulate common SQL functions such as SELECT, WHERE, GROUP BY, and JOIN using MapReduce. One of the most popular applications for Hadoop is ETL offloading. These chapters enable you to appreciate how MapReduce can support common data-processing functions. We discuss not just the API but also the more complicated concepts and internal design of the MapReduce framework. Chapter 8 describes the testing frameworks that support unit/integration testing of MapReduce frameworks. Chapter 9 describes logging and monitoring of the Hadoop Framework. Chapter 10 introduces the Hive framework, the data warehouse framework on top of MapReduce. xxv

Introduction Chapter 11 introduces the Pig and Crunch frameworks. These frameworks enable users to create data-processing pipelines in Hadoop. Chapter 12 describes the HCatalog framework, which enables Enterprise users to access data stored in the Hadoop file system using commonly known abstractions such as databases and tables. Chapter 13 describes how Hadoop can used for streaming log analysis. Chapter 14 introduces you to HBase, the NoSQL database on top of Hadoop. You learn about use-cases that motivate the use of Hbase. Chapter 15 is a brief introduction to data science. It describes the main limitations of MapReduce that make it inadequate for data science applications. You are introduced to new frameworks such as Spark and Hama that were developed to circumvent MapReduce limitations. Chapter 16 is a brief introduction to using Hadoop in the cloud. It enables you to work on a true production grade Hadoop cluster from the comfort of your living room. Chapter 17 is a whirlwind introduction to the key addition to Hadoop 2.0: the capability to develop your own distributed frameworks such as MapReduce on top of Hadoop. We describe how you can develop a simple distributed download service using Hadoop 2.0. xxvi