BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Similar documents
Analytics Platform System

BIG DATA AND HADOOP DEVELOPER

ADVANCED ANALYTICS & IOT ARCHITECTURES

20775A: Performing Data Engineering on Microsoft HD Insight

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Modernizing Your Data Warehouse with Azure

20775: Performing Data Engineering on Microsoft HD Insight

Big data is hard. Top 3 Challenges To Adopting Big Data

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

20775A: Performing Data Engineering on Microsoft HD Insight

Intro to Big Data and Hadoop

20775 Performing Data Engineering on Microsoft HD Insight

5th Annual. Cloudera, Inc. All rights reserved.

BIG DATA & ADVANCED ANALYTICS ROADSHOW

Insights to HDInsight

HDInsight - Hadoop for the Commoner Matt Stenzel Data Platform Technical Specialist

Spark and Hadoop Perfect Together

Databricks Cloud. A Primer


Transforming Analytics with Cloudera Data Science WorkBench

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Sunnie Chung. Cleveland State University

Why Big Data Matters? Speaker: Paras Doshi

Alexander Klein. ETL meets Azure

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Big Data Introduction

Berkeley Data Analytics Stack (BDAS) Overview

IBM Analytics Unleash the power of data with Apache Spark

Apache Spark and R A (big data) love story?

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Bringing the Power of SAS to Hadoop Title

Business is being transformed by three trends

How In-Memory Computing can Maximize the Performance of Modern Payments

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

INTRODUCTION TO R FOR DATA SCIENCE WITH R FOR DATA SCIENCE DATA SCIENCE ESSENTIALS INTRODUCTION TO PYTHON FOR DATA SCIENCE. Azure Machine Learning

Microsoft Big Data. Solution Brief

Cloudera, Inc. All rights reserved.

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Jason Virtue Business Intelligence Technical Professional

MapR: Solution for Customer Production Success

Preface About the Book

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Common Customer Use Cases in FSI

SAP Predictive Analytics Suite

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Analytics in Action transforming the way we use and consume information

Cloud Based Analytics for SAP

Microsoft Developer Day

Evolution or Revolution: Top Ten Development Trends

Investor Presentation. Second Quarter 2016

Big Data & Hadoop Advance

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

red red red red red red red red red red red red red red red red red red red red CYS Rithu P Ravi CYS Saumya K

Data Lake Organization A Hadoop Eco-System. Jan Cordtz, Microsoft Denmark Cloud Solution Architect

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

New Big Data Solutions and Opportunities for DB Workloads

COMP9321 Web Application Engineering

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

A World of Data. Raghu Ramakrishnan. CTO for Data, Technical Fellow Microsoft

Enterprise Database Systems for Big Data, Big Data Processing, and Big Data Analytics. Sunnie Chung Cleveland State University 1

Realising Value from Data

Pre-Requisites A good understanding of Azure data services A basic knowledge of the Microsoft Windows operating system and its core functionality

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Big Data with Azure: where to begin?

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Introduction to Stream Processing

By: Shrikant Gawande (Cloudera Certified )

Designing Business Intelligence Solutions with Microsoft SQL Server 2014

Microsoft Azure Essentials

Enterprise Analytics Accelerating Your Path to Value with an Open Analytics Platform

BIG DATA and DATA SCIENCE

Digital Transformation 2.0

Designing Business Intelligence Solutions with Microsoft SQL Server 2014 Course Code: 20467D

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

The Sysprog s Guide to the Customer Facing Mainframe: Cloud / Mobile / Social / Big Data

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Apache Hadoop in the Datacenter and Cloud

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

Advanced Analytics in Azure

Making Analytics Viable in Enterprises: Potential routes for Industry 4.0

Investor Presentation. Fourth Quarter 2015

2013 PARTNER CONNECT

Taking Advantage of Cloud Elasticity and Flexibility

Advanced Analytics With Spark Patterns For Learning From Data At Scale

Course 20467C: Designing Self-Service Business Intelligence and Big Data Solutions

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

R and Hadoop. Ram Venkat Dawn Analytics

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Cask Data Application Platform (CDAP)

Transcription:

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

TOPICS COVERED 1 2 Fundamentals of Big Data Platforms Major Big Data Tools

Scaling Up vs. Out SCALE UP (SMP) SCALE OUT (MPP) + (n) Upgrade components or buy bigger server each time Multiprocessor system where processors share resources : Operating System (OS) Memory I/O devices connected using a common bus Add nodes to the cluster Multiple processing nodes OS RAM Network

Innovation Timeline Doug Cutting & Mike Cafarella started working on Nutch Doug Cutting adds DFS & MapReduce support to Nutch NY Times converts 4TB of image archives over 100 EC2s Fastest sort of a TB, 3.5 mins over 910 nodes Fastest sort of a TB, 62 secs over 1,460 nodes Sorted a PB in 16.25 hours over 3,658 nodes 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & MapReduce papers Yahoo! Hires Cutting, Hadoop spins out of Nutch Founded Doug Cutting joins Cloudera Facebooks launches Hive: SQL Support for Hadoop Hadoop Summit 2009, 750 attendees

THE FUNDAMENTALS OF HADOOP Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s Hadoop consists of: MapReduce Hadoop Distributed File System (HDFS)

WHAT S NEW

BASICS OF MPP 400 bills 1 bill/ sec = 400 Seconds

BASICS OF MPP 200 bills 1 bill/ sec = 200 Seconds 200 bills 1 bill/ sec = 200 Seconds Total = 200 Seconds

BASICS OF MPP 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds Total = 100 Seconds

HDFS & MAPREDUCE The Main Node: runs the Job tracker and the name node controls the files. Each node runs two processes: Task Tracker and Data Node Map Reduce Job Tracker Task Tracker Task Tracker HDFS Cluster Name Node Data Node Data Node 1 N

BASICS OF MAPREDUCE The Main Node: runs the Job tracker and the name node controls the files. Each node runs two processes: Task Tracker and Data Node Query Result Data Nodes/Task Trackers Query Name Node/ Job Tracker

EXECUTION UNITS MAPREDUCE

SOME DISTRIBUTIONS OF APACHE HADOOP Apache Foundation

Sandbox Hortonworks

MAPREDUCE PIG & HIVE MAPREDUCE PIG HIVE Java Write many lines of code Mostly used by Yahoo Most used for data processing Shares some constructs w/ SQL Is more Verbose Needs a lot of training for users with limited procedural programming background Offers control over the flow of data Mostly used by Facebook for analytic purposes Used for analytics Relatively easier for developers w/ SQL experience Less control over optimization of data flows compared to Pig Not as efficient as MapReduce Higher productivity for data scientists and developers

THE EXPLOSION OF HADOOP

THE HISTORY OF SPARK MapReduce Top Level 2006 2010 2013 2004 2009 2011 2014 Spark Paper BSD Open Source Apache 17

SPARK SHARED LIBRARIES

SPARK THE UNIFIED PLATFORM FOR BIG DATA APIs for : Scala Java Python R Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) Spark Core

SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for batch and realtime data processing. Unified Engine Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing. A single application can combine all types of processing. Developer Productivity Easy-to-use APIs for processing large datasets. Includes 100+ operators for transforming. Ecosystem Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB. Runs on top of the Apache YARN resource manager.

ANALYTICS CORTANA

SQL Server Big Data Optimizations

SQL Server APS

SQL Server APS Growth Topology Scale Unit Base Unit Base Unit Extension

SQL Server Azure SQL DW

SQL Server Deployment options and hybrid solutions

SQL Server Connecting Islands of Data with PolyBase Select Result set Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Microsoft Azure HDInsight Hortonworks for Windows and Linux Cloudera SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non- Microsoft Hadoop distributions, such as Hortonworks and Cloudera

USE CASE: SUPPLY CHAIN MANAGEMENT Use Case 3: Supply Chain Management

USE CASE: SMART GRID MANAGEMENT

USE CASES SMART GRID PREDICTIVE MAINTENANCE DEMAND FORECASTING GRID OPTIMIZATION THEFT PERVENTION DEMAND RESPONSE CUSTOMER PROFILING

USE CASES SMART GRID

USE CASE: REAL TIME TRAFFIC ANALYSIS

REAL TIME TRAFFIC ANALYSIS

USE CASES STREAM ANALYTICS Real-time fraud detection Connected cars Click-stream analysis Real-time financial portfolio alerts Smart grid, energy management CRM alerting sales to customer case Data and identity protection services Real-time sales tracking

ML PROBLEMS SOLVED BY AZURE ML Classification Regression Recommenders Anomaly Detection Clustering

INDUSTRY USE CASES MACHINE LEARNING

THE MACHINE LEARNING WORKFLOW Input data Data transformation Define model Split data Train model Score (prediction) Evaluate Model

AZURE DATA FACTORY

HD INSIGHT

BIG DATA & Advanced Analytics Roadshow Questions? Orion Gebremedhin Orion.Gebremedhin@Neudesic.com Twitter: @oriongm