Modernizing Your Data Warehouse with Azure

Similar documents
Transcription:

Modernizing Your Data Warehouse with Azure Big data. Small data. All data. Christian Coté

S P O N S O R S

The traditional BI Environment

The traditional data warehouse data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012

The traditional data warehouse 2 Real time data 1 Increasing 1data Increasing data 3 New data sources volumes volumes and types 4 Cloud-born data

Life isn t about waiting for the storm to pass It s about learning to dance in the rain.

The modern data warehouse

Microsoft s modern data warehouse SQL Server 2014 PDW Microsoft Azure HDInsight Data Platform

Fully managed relational data warehouse-as-a-service The first elastic cloud data warehouse with enterprise-grade capabilities Support your smallest to largest data sets

In-memory performance In-memory Columnstore for next-generation performance Columnstore index representation

Near real-time insights Real-time with complex event processing Event Sources Event Targets

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time - Wikipedia

What is Big Data? Many Options Variability

Volume What is Big Data? Exabytes (10E18) Petabytes (10E15) Terabytes (10E12) Gigabytes (10E9) Social Sentiment Click Stream Mobile Advertising ERP / CRM Internet of things Sensors / RFID / Devices WEB 2.0 ecommerce Collaboration Payables Contacts Payroll Deal Tracking Inventory Sales Pipeline Digital Marketing Search Marketing Web Logs Recommendations Wikis / Blogs Audio / Video Log Files Spatial & GPS Coordinates Data Market Feeds egov Feeds Weather Text/Image Velocity - Variety Storage/GB 1980 190,000$ ERP / CRM 1990 9,000$ WEB 2.0 2000 15$ Internet of things 2010 0.07$

What is Big Data? Common Scenarios

Hadoop Apache Hadoop is for big data Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models Designed to scale up from single servers to thousands of machines, each offering local computation and storage

Hadoop TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio

HDFS Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS Database

How it works?

How it works? Runtime Server Server Server Server

Architecture

Hadoop Ecosystem Pipeline / workflow (Oozie) Event Pipeline (Flume) Monitoring & Deployment (System Center) PowerShell NoSQL Database (HBase) C#, F#,.NET Scripting (Pig) Graph (Pegasus) Metadata (HCatalog) Query (Hive) Stats processing (RHadoop Distributed Processing (MapReduce) World's Data (Azure Data Marketplace) Distributed Storage (HDFS) Azure Storage Vault (ASV) Machine Learning (Mahout) Query/Scripting (Spark) Active Directory (Security) Data Integration ( ODBC / SQOOP/ REST) Relational (SQL Server) Event Driven Processing Business Intelligence Excel, Power View, SSAS) Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages

What is Hive? A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis Provides an SQL-Like language called HiveQL to query data Integration between Hadoop and BI and visualization tools http://hive.apache.org

What is Pig? Write complex MapReduce jobs using a simple script language (Pig Latin) A platform for analyzing large data sets that consists of highlevel language for expressing data analysis programs Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org

Data Flow Data Hadoop Analytics

Capabilities Extract Load Transform Distributed Compute Predictive Analysis Machine Learning Graph Processing

IT infrastructure optimization Legal discovery Social network analysis Traffic flow optimization Web app optimization Churn analysis Natural resource exploration Weather forecasting Healthcare outcomes Fraud detection Life sciences research Advertising analysis Equipment monitoring Smart meter monitoring

Features and benefits Analyze unstructured data in Excel Combine different types of data with Power Query/Power BI Analyze your data with Power Pivot and Power BI to perform analysis

Features and benefits Build a cluster in minutes and tear it down when you re done Optimize cluster-size for time to insight or cost-savings

Try HDInsight at www.windowsazure.com/bigdata Try SQL Server for data warehousing in Microsoft Azure VMs at www.windowsazure.com Try Hortonworks Data Platform for Windows at www. hortonworks.com Try SQL Server 2017 at https://www.microsoft.com/en-us/sql-server

Resources Apache Projects (list with links) http://bit.ly/mfplte Microsoft Azure HDInsight http://bit.ly/1dnlax1 HDInsight Documentation & Tutorials http://bit.ly/lwryol Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkcte Cloudera VMs CDH 5.3.x http://bit.ly/1enwghh Microsoft JDBC Driver 4.1 4.0 for SQL Server http://bit.ly/1kegj7o Microsoft Hive ODBC Driver http://bit.ly/nfkhch Getting Started with Big Data (MVA) http://bit.ly/1wu90xd Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvx1 Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1jope5f