Big Data with Azure: where to begin?

Similar documents
Modernizing Your Data Warehouse with Azure

Two offerings which interoperate really well

Microsoft Azure Essentials

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight

Business is being transformed by three trends

20775A: Performing Data Engineering on Microsoft HD Insight


20775 Performing Data Engineering on Microsoft HD Insight

Big data is hard. Top 3 Challenges To Adopting Big Data

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Jason Virtue Business Intelligence Technical Professional

Security Solutions in Azure

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

MICROSOFT AZURE THE CLOUD PLATFORM FOR DIGITAL TRANSFORMATION

Data Lake Organization A Hadoop Eco-System. Jan Cordtz, Microsoft Denmark Cloud Solution Architect

Building IoT Solutions in Azure

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Azure: Microsoft Cloud. Microsoft Cloud End-to-end solutions

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

Azure Data Lake How to organize. Jan Cordtz, Microsoft Denmark Cloud Solution Architect

Alexander Klein. ETL meets Azure

Course 20535A: Architecting Microsoft Azure Solutions

Azure. Bruno Kovačić Axilis, Microsoft MVP

HDInsight - Hadoop for the Commoner Matt Stenzel Data Platform Technical Specialist

Analytics Platform System

Angat Pinoy. Angat Negosyo. Angat Pilipinas.

Mobile:

Insights to HDInsight

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

30 min. Close. Facilitating innovation with IoT. Digital Transformation. Microsoft portfolio for product development

Architecting Microsoft Azure Solutions

Microsoft Big Data. Solution Brief

Azure Part 2 - Cloud Agility with ZVR. Mike Nelson, Cloud Architect Shannon Snowden, Sr. Technical Architect

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Architecting Microsoft Azure Solutions

AmCham Vietnam Digital Transformation with Cloud. Jeremy Showalter

Digital transformation is the next industrial revolution

5th Annual. Cloudera, Inc. All rights reserved.

Why & How Public Cloud. Deepthi Anantharam Technology

Visual Studio Everywhere. Build Great Cloud Apps

aka.ms/ uber-selfies

Architecting Microsoft Azure Solutions

Limitless Creativity in the Cloud

Depending on who you ask, IoT is either:

The Importance of good data management and Power BI

Hortonworks Connected Data Platforms

Apache Hadoop in the Datacenter and Cloud

Azure Offerings for Big data. In Kee Paek Cloud Data Solution Architect Microsoft Korea October. 2016

BIG DATA AND HADOOP DEVELOPER

ADVANCED ANALYTICS & IOT ARCHITECTURES

Implementing Microsoft Azure Infrastructure Solutions

"Charting the Course... MOC A: Architecting Microsoft Azure Solutions. Course Summary

Architecting Microsoft Azure Solutions

Integrating the Enterprise. How Business Leaders are Implementing Digital Integration

Cask Data Application Platform (CDAP)

Advanced Analytics in Azure

Implementing Microsoft Azure Infrastructure Solutions 20533B; 5 Days, Instructor-led

Hadoop Course Content

E-guide Hadoop Big Data Platforms Buyer s Guide part 1


HPE Flexible Capacity with Microsoft Azure & Azure Stack

MapR: Solution for Customer Production Success

Why Big Data Matters? Speaker: Paras Doshi

IMPLEMENTING MICROSOFT AZURE INFRASTRUCTURE SOLUTIONS

House Keeping. You are in Listen Only Mode. Azure 101: Azure Overview. Azure 201: How to do a Cost Estimate for Virtual Machines

What s new on Azure? Jan Willem Groenenberg

Hortonworks Data Platform

Intro to Big Data and Hadoop

Azure Data Factory Hybrid data integration, at global scale. Erika Harris Senior Program Manager AzureCAT

How In-Memory Computing can Maximize the Performance of Modern Payments

Microsoft Azure Architect Design (AZ301)

Cloud service models

Analytics in Action transforming the way we use and consume information

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Bringing the Power of SAS to Hadoop Title

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Cognitive Data Warehouse and Analytics

A World of Data. Raghu Ramakrishnan. CTO for Data, Technical Fellow Microsoft

Turn your conversations into memorable conversations by learning how to showcase Dynamics CRM Online value proposition to Technical Decision Makers.

WELCOME TO. Cloud Data Services: The Art of the Possible

Industrial IoT Solution Architecture Design From Connectivity to Data

Microsoft FastTrack For Azure Service Level Description

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

How to create an Azure subscription

ETL challenges on IOT projects. Pedro Martins Head of Implementation

This module introduces students to cloud services and the various Azure services. It describes how to

Architecture Overview for Data Analytics Deployments

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Cask Data Application Platform (CDAP) Extensions

Integrating MATLAB Analytics into Enterprise Applications

Confidential

MapR Pentaho Business Solutions

Azure IoT Suite. Secure device connectivity and management. Data ingestion and command + control. Rich dashboards and visualizations

Common Customer Use Cases in FSI

Cloud Based Analytics for SAP

Transcription:

Big Data with Azure: where to begin? Concepts and best practices October 15 th 2016 Sofia Satya SK Jayanty Principal Architect & Managing Consultant consulting@dbia.uk

Sponsors Gold sponsors: Silver sponsors: Bronze sponsors:

Speaking Engagements

Author d http://tinyurl.com/sql2k8r2admincookbook http://tinyurl.com/sql2012instantcubesecurity http://www.manning.com/delaney/

Agenda.what agenda?...... no agenda!..... you like: small data big data all data!..that s why you are here today

What differentiates today s thriving organizations? Data. Data in all forms & sizes is being generated faster than ever before Capture & combine it for new insights & better, faster decisions

Strategic opportunity with Big Data Cloud Mobile Social How do you use technology innovation Big data? to architect business innovation? Increased productivity Customer growth Real-time insights Embrace new models

Security & Management Security & Management The Azure Platform Strategy Public Cloud Platfor m Hybrid Operations SaaS (Software as a Service) O365, CRM, VSO etc + 3 rd Party SaaS Solutions Hybrid Operations Microsoft Azure Stack & Cloud Platform System Public, Global, Shared Datacenters

Breaking points of traditional approach

Breaking points of traditional approach

Breaking points of traditional approach

Breaking points of traditional approach

Breaking points of traditional approach

What if you could handle big data? Petabytes Terabytes Click stream Wikis/blogs Sensors RFID Devices Social sentiment Audio/video Big Data Log files Spatial and GPS coordinates Gigabytes Data market feeds egov feeds Megabytes Weather Text/image Data Complexity: Variety and Velocity

Introducing Big Data Big data is a collection of data sets Cheap so Storage large and complex that it becomes awkward to work with using on-hand database management tools. > 2 billion users Difficulties include capture, storage, search, sharing, analysis, Sensor Networks and visualization. Inexpensive Computing Wikipedia Enormous amounts of data. online behavior social networking users... samples of medical ailments.. purchasing habits of grocery shoppers. crime statistics of cities... internet of things IoT.. 24/7 out-patient monitor. real-time tele-metric devices. 90% Of data in the world, has been created in the last 2 years

5 Vs

Evolving Approaches to Analytics Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load

Introducing Apache Hadoop Hadoop stores files in a distributed file system Hadoop can store very large amounts of data

Introducing Hadoop Comparison to Traditional RDBMS TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio

Data variety

Data velocity

Hadoop is a platform with portfolio of projects Hadoop common utilities to support modules HDFS (Hadoop Distributed File System) high throughput YARN job scheduling and cluster RM MapReduce YARN-based for parallel processing Spark compute engine Pig data-flow language & execution framework Oozie workflow scheduler Ambari provisioning, managing and monitoring clusters Sqoop bulk data transfer between Hadoop & Relational DB Batch processing centric using a Map-Reduce processing paradigm

Getting Started with HDInsight Introducing Azure HDInsight 100% Apache Hadoop Powered by the cloud Immersive insights 25

HDInsight supports Hive Hadoop 2.0

HDInsight supports HBase Coordination HMaster Name Node Region Server Region Server Region Server Region Server Job Tracker Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker

HDInsight supports Mahout

HDInsight supports Storm

TCO, Deployment & Geo-Redundancy $

Connect cloud Hadoop with on-premises

Scenarios for deploying Hadoop as hybrid

Bringing Hadoop to a billion people

Industry use cases of Hadoop Financial services Retail Telecom Manufacturing Healthcare Utilities, oil and gas Public sector

Introducing the zoo: HDInsight/Hadoop Eco system Legend Red = Core Hadoop Blue = Data processing Green = Packages Distributed Processing (MapReduce) Distributed Storage (HDFS) Purple = Microsoft integration points and value adds Orange = Data Movement

Programming HDInsight Since HDInsight is a service-based implementation, you get immediate access to the tools you need to program against HDInsight/Hadoop Existing Ecosystem.NET JavaScript DevOps/IT Pros: Hive, Pig, Sqoop, Mahout, Cascading, Scalding, Scoobi, Pegasus, etc. C#, F# Map/Reduce, LINQ to Hive,.Net Management Clients, etc. JavaScript Map/Reduce, Browser-hosted Console, Node.js management clients PowerShell, Cross-Platform CLI Tools

Challenges with implementing Hadoop

Why Hadoop in the cloud?

Applications Reports Dashboards Natural language query Mobile Data Orchestration Information management Complex event processing Modeling Machine learning The Microsoft data Relational platform Non-relational NoSQL Streaming Internal & external

Cortana Analytics Suite Transform data into intelligent action DATA INTELLIGENCE ACTION

Azure Data Factory A managed cloud service for building & operating data pipelines Part of the Cortana Analytics Suite

What about Non-Relational and NoSQL? fully featured RDBMS rich query transactional processing managed as a service elastic scale schema-free data model internet accessible http/rest arbitrary data formats There s a great David Chappell paper for getting up to speed on NoSQL - http://azure.microsoft.com/enus/documentation/articles/fundamentals-data-management-nosqlchappell/

PolyBase unites STRUCTURED UNSTRUCTURED BUSINESS DATA DATA DATA for a better together world of analytics

PolyBase and queries Provides a scalable, T-SQL-compatible query processing framework for combining data from both universes Access any data

So what is PolyBase? Answer: Component of the PDW Region in APS Answer: Unique Innovative Technology Answer: Seamless Integration Answer: Highly parallelised distributed query engine accessing heterogeneous data via SQL

Agnostic architecture PolyBase is agnostic = No vendor lock in PolyBase integrates with the cloud PolyBase supports Hadoop on Linux & Windows PolyBase supports HDInsight in APS & external Hadoop clusters

PolyBase builds the bridge Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI

PolyBase User Perspective Systems Perspective External Table External Data Source External File Format PDW Engine PDW Service Bridge

Mobile BI apps for SQL Server (Datazen) On-premises implementations are optimized for SQL Server Rich, interactive data visualization on all major mobile platforms View on any major mobile platform Access reports with online/offline support Data visualization and publishing Powerful insights

What is R? Extensible via packages Talented community of contributors High accuracy ML classifiers In-memory analytics Open source implementation Big data analytics Top tool for machine learning OOL for statistical computing Industry standard for computational mining Amazing data-visualization capabilities

Why R is famous? R plotting Box plot Bar plot Histogram Contour Dot plot Mosaic Scatter Latticist http://homes.cs.washington.edu/~jheer//files/zoo/?utm_source\x3dtwitterfeed\x26utm_medium\x3dtwitter

Revolution R Enterprise and SQL Big data analytics platform Based on open source R High-performance, scalable, full-featured Statistical and machine-learning algorithms are performant, scalable, and distributable Write once, deploy anywhere Scripts and models can be executed on a variety of platforms, including non- Microsoft (Hadoop, Teradata in-db) Integration with the R Ecosystem Analytic algorithms accessed via R function with similar syntax for R users. Arbitrary R functions/packages can be used in conjunction Advanced analytics

SQL Server 2016 R integration scenario Exploration Use RRE from R IDE to analyze large datasets and build predictive and embedded models with the compute happening on the SQL Server machine (SQL Server compute context) Operationalization Developer can operationalize R script/model over SQL Server data by using T-SQL constructs DBA can manage resource, secure, and govern R runtime execution in SQL Server

R script library in Microsoft Azure Marketplace Example solutions Fraud detection Sales forecasting Warehouse efficiency Predictive maintenance Extensibilit y Launch External Process R Integration R New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 Microsoft Azure Machine Learning Marketplace Benefits Faster deployment of ML models Faster performance (moves compute close to the data) Analytic library 010010 100100 010101 Data Scientist Interacts directly with data Improved scalability Benefits T-SQL interface Relational data 010010 100100 010101 Data Developer/DBA Manages data and analytics together Built into SQL Server Advanced analytics

Summary: R integration and advanced analytics SQL Server Analytics library Share and collaborate Manage and deploy Analytical engines Full R integration Fully extensible R + Data Scientists Publish algorithms, interact directly with data DBAs Manage storage and analytics together Capability Extensible in-database analytics, integrated with R, exposed through T-SQL Centralize enterprise library for analytic models Benefits Data Management Layer Relational data T-SQL interface Stream data in-memory Business Analysts Analysis through TSQL, tools, and vetted algorithms Advanced analytics

Standard approach to learn R Self-training is the key Math: Statistics, calculus, probability Machine learning algorithms Opensource R packages Industrial R with R: Hadoop, RRE Applied R with Microsoft Azure ML, RevR

Machine learning tools Open source R considered best fit Python Monte Carlo Machine Learning Library H2O Weka Octave-Forge Commercial Microsoft Azure Machine Learning SAS Enterprise Miner IBM SPSS Modeler RapidMiner Apache Mahout MATLAB Oracle Data Mining

Rich Services Heterogeneity Integrate with on-premises Lower Your Risk

Scaling

Azure in hawk-eye mode Platform Services Security & Management Portal Cloud Services Service Fabric Web Apps API Apps SQL Database Data Warehouse DocumentDB Hybrid Operations Azure AD Health Monitoring Azure Active Directory Azure AD B2C Batch RemoteApp Mobile Apps Logic Apps Redis Cache Azure Search Storage Tables AD Privileged Identity Management Domain Services Multi-Factor Authentication Automation Storage Queues BizTalk Services API Management Notification Hubs Backup Scheduler Hybrid Connections Service Bus HDInsight Machine Learning Stream Analytics Data Lake Operational Analytics Key Vault Visual Studio Azure SDK Data Factory Event Hubs Data Catalog Import/Export Store/ Marketplace VM Image Gallery & VM Depot Media Services Content Delivery Network (CDN) VS Online App Insights Infrastructure Services IoT Hub Mobile Engagement Azure Site Recovery StorSimple

Azure IT Capabilities Platform Services Security & Management Service Creation & Configuration User/Group Directory Store Identity Sign-Up and sign-in Multi-Factor Authentication Scheduled Service Management Task Scheduler Stateless Compute Scheduled Compute Jobs Simple Queuing Hybrid Connections Distributed Compute Virtual App Streaming B2B Integration Pub/Sub Queuing Web Apps Infrastructure Mobile Backends API Management API App Infrastructure Business Process Automation Push Notifications Big Data Analytics Relational SQL Database Distributed In-Memory Cache Predictive Analytics Data Warehouse Search Data Stream Analytics Document Database Service Simple Key/Value Store Big Data Storage Hybrid Operations Directory Health Monitoring Privileged Identity Management Domain Join & Policy Management Server Data Backup Operational Analytics Encryption Key Store Development Tools Software Development Kits Data Pipelines Device Data Collection Data Source Management Bulk Data Import And Export Software/Solution Marketplace Pre-Build VM Images Live & OD Media Streaming Content Delivery Network (CDN) Software Lifecycle Management Application Instrumentation Infrastructure Services IoT Device Management Mobile Analytics Disaster Recovery Hybrid/Intelligent Data Backup

Summary Big Data refers to data sets so large and/or complex that they become awkward to work with in conventional ways Hadoop and HDInsight = Microsoft s answer to Big Data Hadoop can store petabytes of data reliably and execute huge distributed computations However Big Data query results often involve significant latency Power BI includes authoring add-ins to query, analyze and visualize data sourced from Azure HDInsight Preload data in advance of business user queries Big Data is just another data source!

Resources Microsoft Big Data web site http://www.microsoft.com/en-us/server-cloud/solutions/big-data.aspx Azure HDInsight web site http://azure.microsoft.com/en-us/documentation/services/hdinsight/ Hortonworks tutorials http://hortonworks.com/tutorials Numerous tutorials are available to learn about Big Data by using the Hortonworks Sandbox Follow me @SQLMaste r www.sqlserver-qa.net

Sponsors Gold sponsors: Silver sponsors: Bronze sponsors: