Big Data Made Easy. A Working Guide to the Complete Hadoop Toolset. Michael Frampton

Size: px
Start display at page:

Download "Big Data Made Easy. A Working Guide to the Complete Hadoop Toolset. Michael Frampton"

Transcription

1 Big Data Made Easy A Working Guide to the Complete Hadoop Toolset Michael Frampton

2 Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset Copyright 2015 by Michael Frampton This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Acquisitions Editor: Jeff Olson Developmental Editor: Linda Laflamme Technical Reviewer: Andrzej Szymanski Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, James DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Rita Fernando Copy Editor: Carole Berglie Compositor: SPi Global Indexer: SPi Global Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary materials referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

3 This book is dedicated to my family to my wife, my son, and my parents.

4

5 Contents at a Glance About the Author... xv About the Technical Reviewer... xvii Acknowledgments... xix Introduction... xxi Chapter 1: The Problem with Data...1 Chapter 2: Storing and Configuring Data with Hadoop, YARN, and ZooKeeper...11 Chapter 3: Collecting Data with Nutch and Solr...57 Chapter 4: Processing Data with Map Reduce...85 Chapter 5: Scheduling and Workflow Chapter 6: Moving Data Chapter 7: Monitoring Data Chapter 8: Cluster Management Chapter 9: Analytics with Hadoop Chapter 10: ETL with Hadoop Chapter 11: Reporting with Hadoop Index v

6

7 Contents About the Author... xv About the Technical Reviewer... xvii Acknowledgments... xix Introduction... xxi Chapter 1: The Problem with Data...1 A Definition of Big Data...1 The Potentials and Difficulties of Big Data...2 Requirements for a Big Data System... 3 How Hadoop Tools Can Help... 4 My Approach... 5 Overview of the Big Data System...5 Big Data Flow and Storage... 5 Benefits of Big Data Systems... 7 What s in This Book...8 Storage: Chapter Data Collection: Chapter Processing: Chapter Scheduling: Chapter Data Movement: Chapter Monitoring: Chapter Cluster Management: Chapter Analysis: Chapter vii

8 Contents viii ETL: Chapter Reports: Chapter Summary...10 Chapter 2: Storing and Configuring Data with Hadoop, YARN, and ZooKeeper...11 An Overview of Hadoop...11 The Hadoop V1 Architecture The Differences in Hadoop V The Hadoop Stack Environment Management Hadoop V1 Installation...15 Hadoop Single-Node Installation Setting up the Cluster Running a Map Reduce Job Check Hadoop User Interfaces Hadoop V2 Installation...32 ZooKeeper Installation Hadoop MRv2 and YARN Hadoop Commands...49 Hadoop Shell Commands Hadoop User Commands Hadoop Administration Commands Summary...56 Chapter 3: Collecting Data with Nutch and Solr...57 The Environment...57 Stopping the Servers Changing the Environment Scripts Starting the Servers Architecture 1: Nutch 1.x...59 Nutch Installation Solr Installation Running Nutch with Hadoop

9 Contents Architecture 2: Nutch 2.x...70 Nutch and Solr Configuration HBase Installation Gora Configuration Running the Nutch Crawl Potential Errors A Brief Comparison...82 Summary...83 Chapter 4: Processing Data with Map Reduce...85 An Overview of the Word-Count Algorithm...85 Map Reduce Native...86 Java Word-Count Example Java Word-Count Example Comparing the Examples Map Reduce with Pig Installing Pig Running Pig Pig User-Defined Functions Map Reduce with Hive Installing Hive Hive Word-Count Example Map Reduce with Perl Summary Chapter 5: Scheduling and Workflow An Overview of Scheduling The Capacity Scheduler The Fair Scheduler Scheduling in Hadoop V V1 Capacity Scheduler V1 Fair Scheduler ix

10 Contents Scheduling in Hadoop V V2 Capacity Scheduler V2 Fair Scheduler Using Oozie for Workflow Installing Oozie The Mechanics of the Oozie Workflow Creating an Oozie Workflow Running an Oozie Workflow Scheduling an Oozie Workflow Summary Chapter 6: Moving Data Moving File System Data The Cat Command The CopyFromLocal Command The CopyToLocal Command The Cp Command The Get Command The Put Command The Mv Command The Tail Command Moving Data with Sqoop Check the Database Install Sqoop Use Sqoop to Import Data to HDFS Use Sqoop to Import Data to Hive Moving Data with Flume Install Flume A Simple Agent Running the Agent x

11 Contents Moving Data with Storm Install ZeroMQ Install JZMQ Install Storm Start and Check Zookeeper Run Storm An Example of Storm Topology Summary Chapter 7: Monitoring Data The Hue Browser Installing Hue Starting Hue Potential Errors Running Hue Ganglia Installing Ganglia Potential Errors The Ganglia Interface Nagios Installing Nagios Potential Errors The Nagios Interface Summary Chapter 8: Cluster Management The Ambari Cluster Manager Ambari Installation The Cloudera Cluster Manager Installing Cloudera Cluster Manager Running Cloudera Cluster Manager xi

12 Contents xii Apache Bigtop Installing Bigtop Running Bigtop Smoke Tests Summary Chapter 9: Analytics with Hadoop Cloudera Impala Installation of Impala Impala User Interfaces Uses of Impala Apache Hive Database Creation External Table Creation Hive UDFs Table Creation The SELECT Statement The WHERE Clause The Subquery Table Joins The INSERT Statement Organization of Table Data Apache Spark Installation of Spark Uses of Spark Spark SQL Summary Chapter 10: ETL with Hadoop Pentaho Data Integrator Installing Pentaho Running the Data Integrator Creating ETL Potential Errors

13 Contents Talend Open Studio Installing Open Studio for Big Data Running Open Studio for Big Data Creating the ETL Potential Errors Summary Chapter 11: Reporting with Hadoop Hunk Installing Hunk Running Hunk Creating Reports and Dashboards Potential Errors Talend Reports Installing Talend Running Talend Generating Reports Potential Errors Summary Index xiii

14

15 About the Author Michael Frampton has been in the IT industry since 1990, working in a variety of roles (tester, developer, support, QA) and many sectors (telecoms, banking, energy, insurance). He has also worked for major corporations and banks as a contractor and a permanent member of staff, including Agilent, BT, IBM, HP, Reuters, and JPMorgan Chase. The owner of Semtech Solutions, an IT/Big Data consultancy, Mike Frampton currently lives by the beach in Paraparaumu, New Zealand, with his wife and son. Mike has a keen interest in new IT-based technologies and the way that technologies integrate. Being married to a Thai national, Mike divides his time between Paraparaumu or Wellington in New Zealand and their house in Roi Et, Thailand. xv

16

17 About the Technical Reviewer Andrzej Szymanski started his IT career in 1992, in the data mining, warehousing, and customer profiling industry, the very origins of what is big data today. His main focus has been data processing and analysis, as well as development, systems, and database administration across all main platforms, such as IBM Mainframe, Unix, and Windows, and all leading DBMSs, such as Sybase, Oracle, MS SQL, and MySQL. Szymanski s big data and DevOps adventure began in News International, in January 2011, where he was a key player in creating a fully scalable and distributable big data ecosystem, with an aim of sharing it with subsidiaries of News Corporation. This involved R&D, solution architecture, creating ETL workflows for big data, Continuous Integration Zero Touch deployment mechanisms, and system administration and knowledge transfer to sister companies, to name but few of the key areas. Szymanski was born in Poland, where he completed his primary and secondary education. He studied economics in Moscow, but his key passion has always been computers. He is currently based in Prague. xvii

18

19 Acknowledgments I would like to thank my wife and son for allowing me the time to write this book. Without your support, Teeruk, developing this book would not have been possible. I would also like to thank all those who gladly answered my technical questions about the software covered in this book. I extend my gratitude to the Apache and Lucene organizations, without whom open-source-based projects like this one would not be possible. Also, specific thanks go to Deborah Wiltshire (Cloudera); Diya Soubra (ARM); Mary Starr (Nagios); Michael Armbrust (Spark); Rebecca G. Shomair, Daniel Bechtel, and Michael Mrstik (Pentaho); and Chris Taylor and Mark Balkenende (Talend). Lastly, my thanks go to Andrzej Szymanski, who carried out a precise technical check, and to the editorial help afforded by Rita Fernando, Jeff Olson, and Linda Laflamme. xix

20

21 Introduction If you would like to learn about the big data Hadoop-based toolset, then Big Data Made Easy is for you. It provides a wide overview of Hadoop and the tools you can use with it. I have based the Hadoop examples in this book on CentOS, the popular and easily accessible Linux version; each of its practical examples takes a step-by-step approach to installation and execution. Whether you have a pressing need to learn about Hadoop or are just curious, Big Data Made Easy will provide a starting point and offer a gentle learning curve through the functional layers of Hadoopbased big data. Starting with a set of servers and with just CentOS installed, I lead you through the steps of downloading, installing, using, and error checking. The book covers following topics: Hadoop installation (V1 and V2) Web-based data collection (Nutch, Solr, Gora, HBase) Map Reduce programming (Java, Pig, Perl, Hive) Scheduling (Fair and Capacity schedulers, Oozie) Moving data (Hadoop commands, Sqoop, Flume, Storm) Monitoring (Hue, Nagios, Ganglia) Hadoop cluster management (Ambari, CDH) Analysis with SQL (Impala, Hive, Spark) ETL (Pentaho, Talend) Reporting (Splunk, Talend) As you reach the end of each topic, having completed each example installation, you will be increasing your depth of knowledge and building a Hadoop-based big data system. No matter what your role in the IT world, appreciation of the potential in Hadoop-based tools is best gained by working along with these examples. Having worked in development, support, and testing of systems based in data warehousing, I could see that many aspects of the data warehouse system translate well to big data systems. I have tried to keep this book practical and organized according to the topics listed above. It covers more than storage and processing; it also considers such topics as data collection and movement, scheduling and monitoring, analysis and management, and ETL and reporting. This book is for anyone seeking a practical introduction to the world of Linux-based Hadoop big data tools. It does not assume knowledge of Hadoop, but it does require some knowledge of Linux and SQL. Each command use is explained at the point it is utilized. xxi

22 Introduction Downloading the Code The source code for this book is available in ZIP file format in the Downloads section of the Apress website, Contacting the Author I hope that you find this book useful and that you enjoy the Hadoop system as much as I have. I am always interested in new challenges and understanding how people are using the technologies covered in this book. Tell me about what you re doing! You can find me on LinkedIn at In addition, you can contact me via my website at or by at mike_frampton@hotmail.com. xxii