Modernizing Your Data Warehouse with Azure Big data. Small data. All data. Christian Coté
S P O N S O R S
The traditional BI Environment
The traditional data warehouse data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012
The traditional data warehouse 2 Real time data 1 Increasing 1data Increasing data 3 New data sources volumes volumes and types 4 Cloud-born data
Life isn t about waiting for the storm to pass It s about learning to dance in the rain.
The modern data warehouse
Microsoft s modern data warehouse SQL Server 2014 PDW Microsoft Azure HDInsight Data Platform
Fully managed relational data warehouse-as-a-service The first elastic cloud data warehouse with enterprise-grade capabilities Support your smallest to largest data sets
In-memory performance In-memory Columnstore for next-generation performance Columnstore index representation
Near real-time insights Real-time with complex event processing Event Sources Event Targets
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time - Wikipedia
What is Big Data? Many Options Variability
Volume What is Big Data? Exabytes (10E18) Petabytes (10E15) Terabytes (10E12) Gigabytes (10E9) Social Sentiment Click Stream Mobile Advertising ERP / CRM Internet of things Sensors / RFID / Devices WEB 2.0 ecommerce Collaboration Payables Contacts Payroll Deal Tracking Inventory Sales Pipeline Digital Marketing Search Marketing Web Logs Recommendations Wikis / Blogs Audio / Video Log Files Spatial & GPS Coordinates Data Market Feeds egov Feeds Weather Text/Image Velocity - Variety Storage/GB 1980 190,000$ ERP / CRM 1990 9,000$ WEB 2.0 2000 15$ Internet of things 2010 0.07$
What is Big Data? Common Scenarios
Hadoop Apache Hadoop is for big data Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models Designed to scale up from single servers to thousands of machines, each offering local computation and storage
Hadoop TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio
HDFS Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS Database
How it works?
How it works? Runtime Server Server Server Server
Architecture
Hadoop Ecosystem Pipeline / workflow (Oozie) Event Pipeline (Flume) Monitoring & Deployment (System Center) PowerShell NoSQL Database (HBase) C#, F#,.NET Scripting (Pig) Graph (Pegasus) Metadata (HCatalog) Query (Hive) Stats processing (RHadoop Distributed Processing (MapReduce) World's Data (Azure Data Marketplace) Distributed Storage (HDFS) Azure Storage Vault (ASV) Machine Learning (Mahout) Query/Scripting (Spark) Active Directory (Security) Data Integration ( ODBC / SQOOP/ REST) Relational (SQL Server) Event Driven Processing Business Intelligence Excel, Power View, SSAS) Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages
What is Hive? A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis Provides an SQL-Like language called HiveQL to query data Integration between Hadoop and BI and visualization tools http://hive.apache.org
What is Pig? Write complex MapReduce jobs using a simple script language (Pig Latin) A platform for analyzing large data sets that consists of highlevel language for expressing data analysis programs Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
Data Flow Data Hadoop Analytics
Capabilities Extract Load Transform Distributed Compute Predictive Analysis Machine Learning Graph Processing
IT infrastructure optimization Legal discovery Social network analysis Traffic flow optimization Web app optimization Churn analysis Natural resource exploration Weather forecasting Healthcare outcomes Fraud detection Life sciences research Advertising analysis Equipment monitoring Smart meter monitoring
Features and benefits Analyze unstructured data in Excel Combine different types of data with Power Query/Power BI Analyze your data with Power Pivot and Power BI to perform analysis
Features and benefits Build a cluster in minutes and tear it down when you re done Optimize cluster-size for time to insight or cost-savings
Try HDInsight at www.windowsazure.com/bigdata Try SQL Server for data warehousing in Microsoft Azure VMs at www.windowsazure.com Try Hortonworks Data Platform for Windows at www. hortonworks.com Try SQL Server 2017 at https://www.microsoft.com/en-us/sql-server
Resources Apache Projects (list with links) http://bit.ly/mfplte Microsoft Azure HDInsight http://bit.ly/1dnlax1 HDInsight Documentation & Tutorials http://bit.ly/lwryol Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkcte Cloudera VMs CDH 5.3.x http://bit.ly/1enwghh Microsoft JDBC Driver 4.1 4.0 for SQL Server http://bit.ly/1kegj7o Microsoft Hive ODBC Driver http://bit.ly/nfkhch Getting Started with Big Data (MVA) http://bit.ly/1wu90xd Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvx1 Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1jope5f