Preface About the Book

Similar documents
Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

20775A: Performing Data Engineering on Microsoft HD Insight

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

5th Annual. Cloudera, Inc. All rights reserved.

BIG DATA and DATA SCIENCE

20775: Performing Data Engineering on Microsoft HD Insight

Transforming Analytics with Cloudera Data Science WorkBench

20775A: Performing Data Engineering on Microsoft HD Insight

Big data is hard. Top 3 Challenges To Adopting Big Data

20775 Performing Data Engineering on Microsoft HD Insight

IBM Analytics Unleash the power of data with Apache Spark

Modernizing Your Data Warehouse with Azure

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Big Data Introduction

Microsoft Azure Essentials

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Hortonworks Data Platform

Why Big Data Matters? Speaker: Paras Doshi

MapR: Solution for Customer Production Success

Advanced Analytics With Spark Patterns For Learning From Data At Scale

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Hadoop Course Content

Big Data Foundation. 2 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study

Hortonworks Connected Data Platforms

Analytics Platform System


BIG DATA AND HADOOP DEVELOPER

Contents at a Glance COPYRIGHTED MATERIAL. Introduction... 1 Part I: Getting Started with Big Data... 7

Sunnie Chung. Cleveland State University

Post Graduate Program in BIG DATA ENGINEERING. In association with 11 MONTHS ONLINE

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Insights to HDInsight

Big Data & Hadoop Advance

Spark and Hadoop Perfect Together

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Two offerings which interoperate really well

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

Apache Hadoop in the Datacenter and Cloud

Digital Transformation 2.0

Big Data Job Descriptions. Software Engineer - Algorithms

Big Data -the Next Frontier for Innovation, Competition and Productivity

Hadoop In Practice: Includes 85 Techniques By Alex Holmes READ ONLINE

New Big Data Solutions and Opportunities for DB Workloads

C3 Products + Services Overview

Microsoft Big Data. Solution Brief

Business is being transformed by three trends

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Statistics & Optimization with Big Data

SAP Cloud Platform Big Data Services EXTERNAL. SAP Cloud Platform Big Data Services From Data to Insight

Cask Data Application Platform (CDAP)

with Dell EMC s On-Premises Solutions

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Oracle's Big Data analytics portfolio gains critical mass

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

Big Data Hadoop Developer Training

Making Analytics Viable in Enterprises: Potential routes for Industry 4.0

Alexander Klein. ETL meets Azure

BIG Data Analytics AWS Training

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

GPU ACCELERATED BIG DATA ARCHITECTURE

Course 20467C: Designing Self-Service Business Intelligence and Big Data Solutions

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Research on the Framework and Data Fusion of an Energy Big-data Platform

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Big Data Trends Arató Bence. BI Consulting

Leveraging Predictive Tools to Decrease Resolution Time

The Importance of good data management and Power BI

Access, Transform, and Connect Data with SAP Data Services Software

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

Confidential

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

COMP9321 Web Application Engineering

BIG WITH BIG DATA ANALYTICS

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Hybrid Data Management

Processing Big Data with Pentaho. Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara

Evolution or Revolution: Top Ten Development Trends

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

BIG WITH BIG DATA ANALYTICS

Deliver Always-On, Real-Time Insights at Scale. with DataStax Enterprise Analytics

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Insights-Driven Operations with SAP HANA and Cloudera Enterprise

Hadoop and Analytics at CERN IT CERN IT-DB

APAC Big Data & Cloud Summit 2013

BIG WITH BIG DATA ANALYTICS

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

Hortonworks Data Platform for Enterprise Data Lakes delivers robust, big data analytics that accelerate decision making and innovation

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Official Recruitment Partner of Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS. Powered by.

Big Data Hadoop Administrator.

Intro to Big Data and Hadoop

Operational Hadoop and the Lambda Architecture for Streaming Data

Transcription:

Preface About the Book We are living in the dawn of what has been termed as the "Fourth Industrial Revolution" by the World Economic Forum (WEF) in 2016. The Fourth Industrial Revolution is marked through the emergence of "cyber-physical systems" where software interfaces seamlessly over networks with physical systems, such as sensors, smartphones, vehicles, power grids or buildings, to create a new world of Internet of Things (IoT). Data and information are fuel of this new age where powerful analytics algorithms burn this fuel to generate decisions that are expected to create a smarter and more efficient world for all of us to live in. This new area of technology has been defined as Big Data Science and Analytics, and the industrial and academic communities are realizing this as a competitive technology that can generate significant new wealth and opportunity. Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools. In the recent years, there has been an exponential growth in the both structured and unstructured data generated by information technology, industrial, healthcare, retail, web, and other systems. Big data science and analytics deals with collection, storage, processing and analysis of massive-scale data on cloud-based computing systems. Industry surveys, by Gartner and e-skills, for instance, predict that there will be over 2 million job openings for engineers and scientists trained in the area of data science and analytics alone, and that the job market is in this area is growing at a 150 percent year-over-year growth rate. There are very few books that can serve as a foundational textbook for colleges looking to create new educational programs in these areas of big data science and analytics. Existing books are primarily focused on the business side of analytics, or describing vendor-specific offerings for certain types of analytics applications, or implementation of certain analytics algorithms in specialized languages, such as R. We have written this textbook, as part of our expanding "A Hands-On Approach" TM

12 series, to meet this need at colleges and universities, and also for big data service providers who may be interested in offering a broader perspective of this emerging field to accompany their customer and developer training programs. The typical reader is expected to have completed a couple of courses in programming using traditional high-level languages at the college-level, and is either a senior or a beginning graduate student in one of the science, technology, engineering or mathematics (STEM) fields. The reader is provided the necessary guidance and knowledge to develop working code for real-world big data applications. Concurrent development of practical applications that accompanies traditional instructional material within the book further enhances the learning process, in our opinion. Furthermore, an accompanying website for this book contains additional support for instruction and learning (www.big-data-analytics-book.com) The book is organized into three main parts, comprising a total of twelve chapters. Part I provides an introduction to big data, applications of big data, and big data science and analytics patterns and architectures. A novel data science and analytics application system design methodology is proposed and its realization through use of open-source big data frameworks is described. This methodology describes big data analytics applications as realization of the proposed Alpha, Beta, Gamma and Delta models, that comprise tools and frameworks for collecting and ingesting data from various sources into the big data analytics infrastructure, distributed filesystems and non-relational (NoSQL) databases for data storage, processing frameworks for batch and real-time analytics, serving databases, web and visualization frameworks. This new methodology forms the pedagogical foundation of this book. Part II introduces the reader to various tools and frameworks for big data analytics, and the architectural and programming aspects of these frameworks as used in the proposed design methodology. We chose Python as the primary programming language for this book. Other languages, besides Python, may also be easily used within the Big Data stack described in this book. We describe tools and frameworks for Data Acquisition including Publish-subscribe messaging frameworks such as Apache Kafka and Amazon Kinesis, Source-Sink connectors such as Apache Flume, Database Connectors such as Apache Sqoop, Messaging Queues such as RabbitMQ, ZeroMQ, RestMQ, Amazon SQS and custom REST-based connectors and WebSocket-based connectors. The reader is introduced to Hadoop Distributed File System (HDFS) and HBase non-relational database. The batch analysis chapter provides an in-depth study of frameworks such as Hadoop-MapReduce, Pig, Oozie, Spark and Solr. The real-time analysis chapter focuses on Apache Storm and Spark Streaming frameworks. In the chapter on interactive querying, we describe with the help of examples, the use of frameworks and services such as Spark SQL, Hive, Amazon Redshift and Google BigQuery. The chapter on serving databases and web frameworks provide an introduction to popular relational and non-relational databases (such as MySQL, Amazon DynamoDB, Cassandra, and MongoDB) and the Django Python web framework. Part III focuses advanced topics on big data including analytics algorithms and data visualization tools. The chapter on analytics algorithms introduces the reader to machine learning algorithms for clustering, classification, regression and recommendation systems, with examples using the Spark MLlib and H2O frameworks. The chapter on data visualization describes examples of creating various types of visualizations using frameworks such as Lightning, pygal and Seaborn. Bahga & Madisetti, c 2016

Through generous use of hundreds of figures and tested code samples, we have attempted to provide a rigorous "no hype" guide to big data science and analytics. It is expected that diligent readers of this book can use the canonical realizations of Alpha, Beta, Delta, and Gamma models for analytics systems to develop their own big data applications. We adopted an informal approach to describing well-known concepts primarily because these topics are covered well in existing textbooks, and our focus instead is on getting the reader firmly on track to developing robust big data applications as opposed to more theory. While we frequently refer to offerings from commercial vendors, such as Amazon, Google and Microsoft, this book is not an endorsement of their products or services, nor is any portion of our work supported financially (or otherwise) by these vendors. All trademarks and products belong to their respective owners and the underlying principles and approaches, we believe, are applicable to other vendors as well. The opinions in this book are those of the authors alone. Please also refer to our books "Internet of Things: A Hands-On Approach TM " and "Cloud Computing: A Hands-On Approach TM " that provide additional and complementary information on these topics. We are grateful to the Association of Computing Surveys (ACM) for recognizing our book on cloud computing as a "Notable Book of 2014" as part of their annual literature survey, and also to the 50+ universities worldwide that have adopted these textbooks as part of their program offerings. 13 Proposed Course Outline The book can serve as a textbook for senior-level and graduate-level courses in Big Data Analytics and Data Science offered in Computer Science, Mathematics and Business Schools. Business Analytics Data Science Business Analytics Business school courses with focus on applications of data analytics for businesses Big Data Science & Analytics Book Big Data Analytics Data Science Mathematics and Computer Science courses with focus on statistics, machine learning, decision methods and modeling Big Data Analytics Computer Science courses with focus on big data tools & frameworks, programming models, data management and implementation aspects of big data applications Big Data Science & Analytics: A Hands-On Approach

14 We propose the following outline for a 16-week senior level/graduate-level course based on the book. Week Topics 1 Introduction to Big Data Types of analytics Big Data characteristics Data analysis flow Big data examples, applications & case studies 2 Big Data stack setup and examples HDP Cloudera CDH EMR Azure HDInsights 3 MapReduce Programming model Examples MapReduce patterns 4 Big Data architectures & patterns 5 NoSQL Databases Key-value databases Document databases Column Family databases Week Topics Graph databases 6 Data acquisition Publish - Subscribe Messaging Frameworks Big Data Collection Systems Messaging queues Custom connectors Implementation examples 7 Big Data storage HDFS HBase 8 Batch Data analysis Hadoop & YARN MapReduce & Pig Spark core Batch data analysis examples & case studies 9 Real-time Analysis Stream processing with Storm In-memory processing with Spark Streaming Real-time analysis examples & case studies 10 Interactive querying Hive Spark SQL Interactive querying examples & case studies Bahga & Madisetti, c 2016

15 Week Topics 11 Web Frameworks & Serving Databases Django - Python web framework Using different serving databases with Django Implementation examples 12 Big Data analytics algorithms Spark MLib H2O Clustering algorithms 13 Big Data analytics algorithms Classification algorithms Regression algorithms 14 Big Data analytics algorithms Recommendation systems 15 Big Data analytics case studies with implementations 16 Data Visualization Building visualizations with Lightning, pygal & Seaborn Book Website For more information on the book, copyrighted source code of all examples in the book, lab exercises, and instructor material, visit the book website: www.big-data-analytics-book.com Big Data Science & Analytics: A Hands-On Approach