Spark, Hadoop, and Friends

Similar documents
5th Annual. Cloudera, Inc. All rights reserved.

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Bringing the Power of SAS to Hadoop Title

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Microsoft Azure Essentials

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Operational Hadoop and the Lambda Architecture for Streaming Data

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Business is being transformed by three trends

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

A survey on Static and Dynamic Hadoop Schedulers

Big Data The Big Story

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Five Questions to Ask Before Choosing a Hadoop Distribution

Sunnie Chung. Cleveland State University

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Hadoop Integration Deep Dive

Microsoft Big Data. Solution Brief

MapR Pentaho Business Solutions

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

GE Intelligent Platforms. Proficy Historian HD

HADOOP ADMINISTRATION

Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Cloudera Hadoop & Industrie 4.0 wohin mit dem Datenstrom?

Insights to HDInsight

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

20775: Performing Data Engineering on Microsoft HD Insight

The Internet of Things Wind Turbine Predictive Analytics. Fluitec Wind s Tribo-Analytics System Predicting Time-to-Failure

Adobe Deploys Hadoop as a Service on VMware vsphere

Big Data Live selbst analysieren

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Why Big Data Matters? Speaker: Paras Doshi

European Journal of Science and Theology, October 2014, Vol.10, Suppl.1, BIG DATA ANALYSIS. Andrej Trnka *

1. Intoduction to Hadoop

Oracle Autonomous Data Warehouse Cloud

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

Design of material management system of mining group based on Hadoop

Big Data in Cloud. 堵俊平 Apache Hadoop Committer Staff Engineer, VMware

Konica Minolta Business Innovation Center

Knowledge Discovery and Data Mining

Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Berkeley Data Analytics Stack (BDAS) Overview

Saudi Journal of Engineering and Technology. DOI: /sjeat. ISSN (Print) Dubai, United Arab Emirates Website:

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

Big Data. By Michael Covert. April 2012

Chapter 9. Business Intelligence Systems

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

RESOURCE MANAGEMENT IN CLUSTER COMPUTING PLATFORMS FOR LARGE SCALE DATA PROCESSING

Copyright - Diyotta, Inc. - All Rights Reserved. Page 2

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Common Customer Use Cases in FSI

LEVERAGING DATA ANALYTICS TO GAIN COMPETITIVE ADVANTAGE IN YOUR INDUSTRY

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA

Hybrid Data Management

Building Efficient Large-Scale Big Data Processing Platforms

Cloud Based Analytics for SAP

MapR: Solution for Customer Production Success

SAP Big Data. Markus Tempel SAP Big Data and Cloud Analytics Services

Big Data on Content Credibility of Social Networking Sites and Instant Messaging Applications

Building a Data Lake on AWS

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU & RICE UNIVERSITY

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Hortonworks Powering the Future of Data

ETL challenges on IOT projects. Pedro Martins Head of Implementation

GET MORE VALUE OUT OF BIG DATA

[Header]: Demystifying Oracle Bare Metal Cloud Services

More information for FREE VS ENTERPRISE LICENCE :

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

How to Build Your Data Ecosystem with Tableau on AWS

Hadoop in Production. Charles Zedlewski, VP, Product

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland

The Intersection of Big Data and DB2

SAS and Hadoop Technology: Overview

Big Data Trends to Watch

I. INTRODUCTION. 2013) massive knowledge sources which is able to be used in official statistics are:

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Oracle Autonomous Data Warehouse Cloud

BIG Data Analytics AWS Training

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Harnessing the Power of Big Data to Transform Your Business Anjul Bhambhri VP, Big Data, Information Management, IBM

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

The Rise of Engineering-Driven Analytics

IBM Big Data Summit 2012

Integrating MATLAB Analytics into Enterprise Applications

Building Your Big Data Team

Oracle Autonomous Data Warehouse Cloud

Transcription:

Spark, Hadoop, and Friends (and the Zeppelin Notebook) Douglas Eadline Jan 4, 2017 NJIT

Presenter Douglas Eadline deadline@basement-supercomputing.com @thedeadline HPC/Hadoop Consultant/Writer http://www.basement-supercomputing.com

Today s Topics 1. Hadoop Overview and Review 2. Spark (and Hadoop) 3. A Little Bit about Data Science 4. Getting Data into Hadoop (hands-on) (lunch) 5. Spark and The Zeppelin Notebook 6. Examples with Zeppelin (hands-on)

Hadoop Overview and Review Covered Hadoop in more detail last year (you should have the book) Today: Quick review in case you missed it Consider Hadoop 2 Quick Start Guide http://www.clustermonkey.net/hadoop2-quick-start-guide

Big Data & Hadoop Are The Next Big Thing!

Data Volume Let s Call It a Lake

Data Velocity Let s Call It a Waterfall

Data Variation Let s Call it Recreation

What is Big Data? (besides a marketing buzzword) The Internet created this situation Three V s (Volume, Velocity, Variability) Data that are large (lake), growing fast (waterfall), unstructured (recreation) - not all may apply. May not fit in a "relational model and method" Can Include: video, audio, photos, system/web logs, click trails, IoT, text messages/email/tweets, documents, books, research data, stock transactions, customer data, public records, human genome, and many others.

Why All the Fuss about Hadoop? We hear about Hadoop scale We hear about Hadoop storage systems We hear about Hadoop parallel processing something called "MapReduce" At Yahoo, 100,000 CPUs in more than 40,000 computers running Hadoop, 455 petabytes of storage (10E15 Bytes)

Is BIG DATA Really That big Two analytics clusters at Yahoo and Microsoft, median job input sizes are under 14GB and 90% of jobs on a Facebook cluster have input sizes under 100 GB. ( Nobody ever got fired for using Hadoop on a cluster, HotCDP 2012)

What Is Really Going on Here? Hadoop is not necessarily about scale (although it helps to have a scalable solution that can handle large capacities and large capabilities) Big Data is not necessarily about volume (although volume and velocity are increasing, so is hardware) However, The "Hadoop Data Lake" represents a fundamentally new way to manage data.

Hadoop Data Lake Data Warehouse applies schema on write and has an Extract, Transform, and Load (ETL) step. Hadoop applies schema on read and the ETL step is part of processing. All input data are placed in the lake in raw form.

Defining Hadoop (Version 2) A PLATFORM for data lake analysis that supports software tools, libraries, and methodologies Core and most tools are open source (Apache License) Large unstructured data sets (Petabytes) Written in Java, but not exclusively a Java platform Primarily GNU/Linux, Windows versions available Scalable from single server to thousands of machines Runs on commodity hardware and the cloud Application level fault tolerance possible Multiple processing models and libraries (>MapReduce)

Hadoop Core Components HDFS Hadoop Distributed File System. Designed to be a fault tolerant streaming file system using multiple DataNodes. YARN Yet Another Resource Negotiator. Master scheduler and resource allocator for the entire Hadoop cluster using worker nodes with NodeManagers. MapReduce YARN Application. Provides MapReduce processing (Compatible with version 1 of Hadoop). Cluster servers (nodes) are usually both DataNodes and NodeManagers (move processing to data)

Hadoop V2 The Platform

Hadoop Distributed File System (HDFS) Master/Slave model NameNode - metadata server or "data traffic cop (kept in memory for performance) Single Namespace - managed by the NameNode DataNodes - where the data live (data are sliced and placed across the DataNodes allowing concurrent access) Secondary NameNode - checkpoints NameNode metadata to disk, but not failover node Built with simple servers Features: High Availability (multiple NameServers), Federation (spread namespace across NameServers), Snapshots (instant read-only point-in-time copies), NFSv3 Access, federation

How the User Sees HDFS HDFS is a separate file system from the host machine Data must be moved to (put) and from (get) HDFS (some tools do this automatically) Hadoop processing happens in HDFS

A Few Hadoop Use Cases NextBio is using Hadoop MapReduce and HBase to process multiterabyte data sets of human genome data. Processing wasn't feasible using traditional databases like MySQL. US Xpress, one of the largest trucking companies in the US, is using Hadoop to store sensor data (100s of data points and geo data) from thousands of trucks. The intelligence they mine out of this, saves them $6 million year in fuel cost alone. Hadoop allows storing enormous amount of sensor data and querying/joining this data with other data sets. A leading retail bank is using Hadoop to validate data accuracy and quality to comply with federal regulations like Dodd-Frank. The platform allows analyzing trillions of records which currently result in approximately one terabyte per month of reports. Other solutions were not able to provide a complete picture of all the data.