Intro to Big Data and Hadoop

Similar documents
Spark, Hadoop, and Friends

5th Annual. Cloudera, Inc. All rights reserved.

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Operational Hadoop and the Lambda Architecture for Streaming Data

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Big Data The Big Story

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

BIG DATA AND HADOOP DEVELOPER

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

MapR: Solution for Customer Production Success

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

HADOOP ADMINISTRATION

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Microsoft Azure Essentials

Sunnie Chung. Cleveland State University

Big Data and Automotive IT. Michael Cafarella University of Michigan September 11, 2013

Business is being transformed by three trends

Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

SAS and Hadoop Technology: Overview

Bringing the Power of SAS to Hadoop Title

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

ETL on Hadoop What is Required

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Building Your Big Data Team

Big Data. By Michael Covert. April 2012

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Engaging in Big Data Transformation in the GCC

The Intersection of Big Data and DB2

New Big Data Solutions and Opportunities for DB Workloads

Hadoop Integration Deep Dive

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Berkeley Data Analytics Stack (BDAS) Overview

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Why Big Data Matters? Speaker: Paras Doshi

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Microsoft Big Data. Solution Brief

GET MORE VALUE OUT OF BIG DATA

Cloudera Hadoop & Industrie 4.0 wohin mit dem Datenstrom?

Cask Data Application Platform (CDAP)

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Big Data & Hadoop Advance

IBM Big Data Summit 2012

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

20775: Performing Data Engineering on Microsoft HD Insight

Hadoop in Production. Charles Zedlewski, VP, Product

Common Customer Use Cases in FSI

Hadoop and Analytics at CERN IT CERN IT-DB

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland

Oracle Big Data Cloud Service

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

IBM PureData System for Analytics Overview

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

1. Intoduction to Hadoop

Insights to HDInsight

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

MapR Pentaho Business Solutions

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Cloud Based Analytics for SAP

Real-time Streaming Insight & Time Series Data Analytic For Smart Retail

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Sr. Sergio Rodríguez de Guzmán CTO PUE

Big Data Live selbst analysieren

Big Business Value from Big Data and Hadoop

ORACLE BIG DATA APPLIANCE

Adobe Deploys Hadoop as a Service on VMware vsphere

Getting Big Value from Big Data

Design of material management system of mining group based on Hadoop

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Big Data: How it is Generated and its Importance

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

Ray M Sugiarto MAPR Champion Indonesia

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

European Journal of Science and Theology, October 2014, Vol.10, Suppl.1, BIG DATA ANALYSIS. Andrej Trnka *

Knowledge Discovery and Data Mining

Jason Virtue Business Intelligence Technical Professional

Big Data Analytics for Retail with Apache Hadoop. A Hortonworks and Microsoft White Paper

Konica Minolta Business Innovation Center

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

WELCOME TO. Cloud Data Services: The Art of the Possible

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Smarter Analytics for Big Data

1Week. Big Data & Hadoop. Why big data & Hadoop is important? National Winter Training program on

Oracle Big Data Discovery The Visual Face of Big Data

Introduction to Big Data

A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA

Experiences in the Use of Big Data for Official Statistics

Harnessing the Power of Big Data to Transform Your Business Anjul Bhambhri VP, Big Data, Information Management, IBM

The Internet of Things Wind Turbine Predictive Analytics. Fluitec Wind s Tribo-Analytics System Predicting Time-to-Failure

Head of Enterprise Sales TW Stan Yao

Real-Time Streaming: IMS to Apache Kafka and Hadoop

The Evolution of Big Data

The disruptive power of big data

Transcription:

Intro to Big and Hadoop Portions copyright 2001 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC, USA. SAS Institute Inc. makes no warranties with respect to these materials and disclaims all liability therefor. 1

2 Copyright 2016, SAS Institute Inc. All rights reserved. What Is Big? Definitions for big data found on Wikipedia: Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. Big data is a term used to describe data that cannot be managed and analyzed using traditional infrastructure, architecture, and technologies. 2

Disk Size Prefixes The amount of data being stored and analyzed continues to increase! 3 A good type rate is 10 KB/hour (0.01 MB/hr). An E- book is ~ 1 MB. A typical disk drive write rate is ~100-250 MB/sec (100 GB in 20 minutes; 1 TB in 3 hours, 1 PB in 4 months). Good internet is similar.

4 Copyright 2016, SAS Institute Inc. All rights reserved. Attributes of Big The following three characteristics make data big data : Volume (Terabytes -> Petabytes) Velocity (Batch -> Streaming ) Variety (Structured -> Unstructured) 4

Big Estimates for Three Vs Volume >3500 petabytes in North America >3000 petabytes for rest of the world Variety People (web, social media, e-commerce, music, videos, messaging) Machines (sensors, medical devices, GPS) Velocity About 3 million emails per second 200,000 logins on Facebook every minute Millions of stock trades in seconds 5

Big Variety Big data consists of structured and unstructured data. It is estimated that almost 90% of big data comes from unstructured data. Structured data databases files with schemas Unstructured data Videos Audio Images Text 10% 90% BIG DATA 6

7 Copyright 2016, SAS Institute Inc. All rights reserved. Factors Contributing Big Growth Cloud Social Media Mobile Big 7

8 Copyright 2016, SAS Institute Inc. All rights reserved. Emergence of Big LinkedIn Flickr Netflix Pandora Facebook Over 6 million views Google 2 million searches Twitter 100,000 new tweets YouTube 30 hours new video Email 204 million emails BIG DATA 600,000 GB data transferred globally in 1 minute on the Internet 8

Big Sources Public Social Media BIG DATA Enterprise Master Sensor Transaction 9

Big Sources: Health Care Public CMS data Genome project Weather data Social Media Tweets on epidemics Facebook comments on side effects Reviews on services BIG DATA Sensor Enterprise Master Cost of services Provider details Employees Facilities Patient vitals monitoring Delivery monitoring Wearable data Transaction Patients admitted Claims processed Patient satisfaction 10

Big Sources: Retail Public Weather data Customer complaints US census data Social Media Tweets on customer service Product recommendations on Facebook Product reviews 11 BIG DATA Transaction Sensor Enterprise Master Customer purchases Financial accounting Inventory management Supply Chain Product inventory Cost or price of products Employees Customer information Traffic measurement Weblog Wi-Fi traffic in stores

12 Copyright 2016, SAS Institute Inc. All rights reserved. Examples of Companies Using Big Yahoo Facebook >4500 nodes >1000 nodes 60% of jobs run on Apache PIG Hive-based data warehouse Log processing, advertising analytics Reporting, analytics, and machine learning 12

13 Copyright 2016, SAS Institute Inc. All rights reserved. Enterprise Challenges Due to Big Big data users encounter many challenges. Big Challenges 13

14 Copyright 2016, SAS Institute Inc. All rights reserved. Enterprise Challenges Due to Big Big data users encounter many challenges. Missing SLAs (service level agreement) Latency Scalability (extract, transform, load) ETL Challenges Big Challenges Cost Growing Unstructured 14

Traditional Processing: Architecture In the traditional data processing model, data moves to the physical hardware that contains the application logic. Client Application Logic applied on the data fetched 15

Traditional Processing: Big Impact In traditional data processing models, with big data, it is possible that the application logic or hardware (or both) that it runs on cannot handle the volume. Client Application Logic applied on the data fetched Server cannot handle the volume! base cannot handle the volume! 16

Big Processing: Architecture With big data processing, the logic is taken to the data. Node Node Client Application Master Node Logic Node Node Node 17

Traditional Management In traditional data management, the data schema is applied on WRITE. Raw data from source systems Apply schema on WRITE Load data into structured database (RDBMS) 18

Big Management In big data management, the schema is applied on READ. Raw data from source systems Load data into distributed systems in files NO SCHEMA Apply schema on READ 19

20 Copyright 2016, SAS Institute Inc. All rights reserved. Big Economics and Key Drivers Storage Costs CPU Costs Bandwidth Costs Linux Lucene Hadoop Hardware Cost Reduction Open-Source Technologies 20

21 Copyright 2016, SAS Institute Inc. All rights reserved. Processing Models Two distinct data processing models can be followed: Scale up by adding bigger, more powerful processing machines. Scale out by taking advantage of existing hardware already in inventory, or by purchasing more of the smaller, less expensive machines. 21

Scale Up versus Scale Out A decision to either scale up or scale out should be made. Compare costs versus hardware versus other considerations. Scale Up Cost Expensive Cheap Scale Out Hardware Specialized Commodity Fault Tolerance Low High Licensing Proprietary Open source and proprietary options Storage Terabytes Petabytes and more 22

Big versus Traditional Technologies Traditional Systems Rigid data models Weak fault-tolerance architecture Scalability constraints Expensive to scale Limitation for handling unstructured data Proprietary hardware and software Big Technologies Schema free Strong fault-tolerance architecture Highly scalable Economical (1TB ~ 5K) Can handle unstructured data Commodity hardware and open-source and proprietary software 23

Summary of Big Ecosystem Big data ecosystems have the following characteristics: volumes are exploding. Volume, variety, and velocity vary greatly for big data. Hadoop is an example of a scale-out model. Decreasing commodity hardware costs make Hadoop a leading platform for big data. No schema, schema-less, and unstructured data can be handled. 24

The Apache Hadoop Project 25

Hadoop Hadoop is a software framework with these capabilities: offers reliable, scalable, distributed computing enables the distributed processing of large data sets across clusters of computers using simple programming models designed to scale up from single servers to thousands of machines, each offering local computation and storage See http://hadoop.apache.org. 26

History of Hadoop Inventors: Doug Cutting and Mike Cafarella Apache Software Foundation is non-profit Hadoop is the name of Doug Cutting s child s yellow stuffed elephant https://wiki.apache.org/hadoop/poweredby lists many companies that use Hadoop and what they use it for. 27

Major Releases V1.x and V2.x 2005 V1.x Hadoop Common libraries and utilities Hadoop Distributed File System (HDFS) distributed file system that stores data Hadoop Map Reduce programming model 2013 V2.x Hadoop Common Hadoop Distributed File System Hadoop YARN resource management platform (Yet Another Resource Negotiator) Hadoop Map Reduce 28

Hadoop Name Node (NN) In the Hadoop ecosystem, the Name node is the centerpiece of an HDFS file system has RAM, CPU, and storage is usually more powerful (CPU, RAM) than a node stores HDFS file metadata to keep track of data in the HDFS directories across the nodes does not store the data can host a Job Tracker. NN 29

Hadoop Node (DN) In the Hadoop ecosystem, the node has RAM, CPU, and storage stores and processes data locally has data that can be replicated to it (optional) hosts a Task Tracker. NN DN The Name node and node should always be configured on separate machines. 30

Splits and Replication 1 2 3 A NN B A file is split into small chunks (64MB, 128MB, and so on) and copied across the cluster. can be replicated. (optional) The recommended replication is 3 for production systems. 4-5 chunks A2, A5, B1, B3, B3-2 chunks A5, B1 5 31

Hadoop Processing Trackers Hadoop Processing Trackers include the following: Job Tracker manages resources and the Task Tracker Task Tracker takes orders from Job Tracker updates Job Tracker 32

Hadoop Processing MapReduce Root Worker 33

MapReduce Processing s Key-value pairs in 34 Reference: http://code.google.com/edu/parallel/mapreduce-tutorial.html

Example from https://courses.cs.washington.edu/courses/cse490h/08au/lectures/ MapReduceDesignPatterns-UW2.pdf Pointer Following (or) Joining Input Feature List 1: <type=road>, <intersections=(3)>, <geom>, 2: <type=road>, <intersections=(3)>, <geom>, 3: <type=intersection>, stop_type, POI? 4: <type=road>, <intersections=(6)>, <geom>, 5: <type=road>, <intersections=(3,6)>, <geom>, 6: <type=intersection>, stop_type, POI?, 7: <type=road>, <intersections=(6)>, <geom>, 8: <type=town>, <name>, <geom>,... 1 2 3 Output Intersection List 3: <type=intersection>, stop_type, <roads=( 1: <type=road>, 2: <type=road>, 5: <type=road>, <geom>, <geom>, <geom>, <name>, <name>, <name>, )>, 6: <type=intersection>, stop_type, <roads=( 4: <type=road>, 5: <type=road>, 7: <type=road>, <geom>, <geom>, <geom>, <name>,, <name>,, <name>, )>, 7 6 5 4... 35

Inner Join Pattern Input Map Shuffle Reduce Output Feature list Apply map() to each; Key = intersection id Value = feature Sort by key Apply reduce() to list of pairs with same key, gather into a feature Feature list, aggregated 1: Road 2: Road 3: Intersection 4: Road (3, 1: Road) (3, 2: Road) (3, 3: Intxn) (6, 4: Road) 3 (3, 1: Road) (3, 2: Road) (3, 3: Intxn.) (3, 5: Road) 3: Intersection 1: Road, 2: Road, 5: Road 5: Road 6: Intersection 7: Road (3, 5: Road) (6, 5: Road) (6, 6: Intxn) (6, 7: Road) 6 (6, 4: Road) (6, 5: Road) (6, 6: Intxn.) (6, 7: Road) 6: Intersection 4: Road, 5: Road, 7: Road 1 7 2 3 5 6 4 36

YARN Yet Another Resource Negotiator Pig Hive HBase MapReduce Not MR YARN HDFS Hadoop 2.x 37

YARN Hadoop 2.x: Features YARN.2x features include the following: multi-tenancy not only batch-oriented real-time applications cluster utilization scalability compatibility backward compatible with Hadoop 1.x 38

39 Other members of the ecosystem Apache Sqoop is a framework designed for efficiently transferring bulk data between Hadoop and structured, relational databases (for example, MySQL and Oracle). It uses MapReduce to import and export the data. Apache Pig is a high-level platform for creating programs in Pig Latin that run on Hadoop. Pig can execute its jobs in MapReduce or Spark. Apache Hive is a data warehouse built on top of Hadoop for providing data summarization, query and analysis using an SQL-like interface. Apache Spark is an open-source cluster-computing framework. Typically it uses the Hadoop File System, but it is far more flexile than MapReduce. It is more RAM oriented than MapReduce (for better speed).