Big Data with Azure: where to begin? Concepts and best practices October 15 th 2016 Sofia Satya SK Jayanty Principal Architect & Managing Consultant consulting@dbia.uk
Sponsors Gold sponsors: Silver sponsors: Bronze sponsors:
Speaking Engagements
Author d http://tinyurl.com/sql2k8r2admincookbook http://tinyurl.com/sql2012instantcubesecurity http://www.manning.com/delaney/
Agenda.what agenda?...... no agenda!..... you like: small data big data all data!..that s why you are here today
What differentiates today s thriving organizations? Data. Data in all forms & sizes is being generated faster than ever before Capture & combine it for new insights & better, faster decisions
Strategic opportunity with Big Data Cloud Mobile Social How do you use technology innovation Big data? to architect business innovation? Increased productivity Customer growth Real-time insights Embrace new models
Security & Management Security & Management The Azure Platform Strategy Public Cloud Platfor m Hybrid Operations SaaS (Software as a Service) O365, CRM, VSO etc + 3 rd Party SaaS Solutions Hybrid Operations Microsoft Azure Stack & Cloud Platform System Public, Global, Shared Datacenters
Breaking points of traditional approach
Breaking points of traditional approach
Breaking points of traditional approach
Breaking points of traditional approach
Breaking points of traditional approach
What if you could handle big data? Petabytes Terabytes Click stream Wikis/blogs Sensors RFID Devices Social sentiment Audio/video Big Data Log files Spatial and GPS coordinates Gigabytes Data market feeds egov feeds Megabytes Weather Text/image Data Complexity: Variety and Velocity
Introducing Big Data Big data is a collection of data sets Cheap so Storage large and complex that it becomes awkward to work with using on-hand database management tools. > 2 billion users Difficulties include capture, storage, search, sharing, analysis, Sensor Networks and visualization. Inexpensive Computing Wikipedia Enormous amounts of data. online behavior social networking users... samples of medical ailments.. purchasing habits of grocery shoppers. crime statistics of cities... internet of things IoT.. 24/7 out-patient monitor. real-time tele-metric devices. 90% Of data in the world, has been created in the last 2 years
5 Vs
Evolving Approaches to Analytics Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load
Introducing Apache Hadoop Hadoop stores files in a distributed file system Hadoop can store very large amounts of data
Introducing Hadoop Comparison to Traditional RDBMS TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio
Data variety
Data velocity
Hadoop is a platform with portfolio of projects Hadoop common utilities to support modules HDFS (Hadoop Distributed File System) high throughput YARN job scheduling and cluster RM MapReduce YARN-based for parallel processing Spark compute engine Pig data-flow language & execution framework Oozie workflow scheduler Ambari provisioning, managing and monitoring clusters Sqoop bulk data transfer between Hadoop & Relational DB Batch processing centric using a Map-Reduce processing paradigm
Getting Started with HDInsight Introducing Azure HDInsight 100% Apache Hadoop Powered by the cloud Immersive insights 25
HDInsight supports Hive Hadoop 2.0
HDInsight supports HBase Coordination HMaster Name Node Region Server Region Server Region Server Region Server Job Tracker Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker
HDInsight supports Mahout
HDInsight supports Storm
TCO, Deployment & Geo-Redundancy $
Connect cloud Hadoop with on-premises
Scenarios for deploying Hadoop as hybrid
Bringing Hadoop to a billion people
Industry use cases of Hadoop Financial services Retail Telecom Manufacturing Healthcare Utilities, oil and gas Public sector
Introducing the zoo: HDInsight/Hadoop Eco system Legend Red = Core Hadoop Blue = Data processing Green = Packages Distributed Processing (MapReduce) Distributed Storage (HDFS) Purple = Microsoft integration points and value adds Orange = Data Movement
Programming HDInsight Since HDInsight is a service-based implementation, you get immediate access to the tools you need to program against HDInsight/Hadoop Existing Ecosystem.NET JavaScript DevOps/IT Pros: Hive, Pig, Sqoop, Mahout, Cascading, Scalding, Scoobi, Pegasus, etc. C#, F# Map/Reduce, LINQ to Hive,.Net Management Clients, etc. JavaScript Map/Reduce, Browser-hosted Console, Node.js management clients PowerShell, Cross-Platform CLI Tools
Challenges with implementing Hadoop
Why Hadoop in the cloud?
Applications Reports Dashboards Natural language query Mobile Data Orchestration Information management Complex event processing Modeling Machine learning The Microsoft data Relational platform Non-relational NoSQL Streaming Internal & external
Cortana Analytics Suite Transform data into intelligent action DATA INTELLIGENCE ACTION
Azure Data Factory A managed cloud service for building & operating data pipelines Part of the Cortana Analytics Suite
What about Non-Relational and NoSQL? fully featured RDBMS rich query transactional processing managed as a service elastic scale schema-free data model internet accessible http/rest arbitrary data formats There s a great David Chappell paper for getting up to speed on NoSQL - http://azure.microsoft.com/enus/documentation/articles/fundamentals-data-management-nosqlchappell/
PolyBase unites STRUCTURED UNSTRUCTURED BUSINESS DATA DATA DATA for a better together world of analytics
PolyBase and queries Provides a scalable, T-SQL-compatible query processing framework for combining data from both universes Access any data
So what is PolyBase? Answer: Component of the PDW Region in APS Answer: Unique Innovative Technology Answer: Seamless Integration Answer: Highly parallelised distributed query engine accessing heterogeneous data via SQL
Agnostic architecture PolyBase is agnostic = No vendor lock in PolyBase integrates with the cloud PolyBase supports Hadoop on Linux & Windows PolyBase supports HDInsight in APS & external Hadoop clusters
PolyBase builds the bridge Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI
PolyBase User Perspective Systems Perspective External Table External Data Source External File Format PDW Engine PDW Service Bridge
Mobile BI apps for SQL Server (Datazen) On-premises implementations are optimized for SQL Server Rich, interactive data visualization on all major mobile platforms View on any major mobile platform Access reports with online/offline support Data visualization and publishing Powerful insights
What is R? Extensible via packages Talented community of contributors High accuracy ML classifiers In-memory analytics Open source implementation Big data analytics Top tool for machine learning OOL for statistical computing Industry standard for computational mining Amazing data-visualization capabilities
Why R is famous? R plotting Box plot Bar plot Histogram Contour Dot plot Mosaic Scatter Latticist http://homes.cs.washington.edu/~jheer//files/zoo/?utm_source\x3dtwitterfeed\x26utm_medium\x3dtwitter
Revolution R Enterprise and SQL Big data analytics platform Based on open source R High-performance, scalable, full-featured Statistical and machine-learning algorithms are performant, scalable, and distributable Write once, deploy anywhere Scripts and models can be executed on a variety of platforms, including non- Microsoft (Hadoop, Teradata in-db) Integration with the R Ecosystem Analytic algorithms accessed via R function with similar syntax for R users. Arbitrary R functions/packages can be used in conjunction Advanced analytics
SQL Server 2016 R integration scenario Exploration Use RRE from R IDE to analyze large datasets and build predictive and embedded models with the compute happening on the SQL Server machine (SQL Server compute context) Operationalization Developer can operationalize R script/model over SQL Server data by using T-SQL constructs DBA can manage resource, secure, and govern R runtime execution in SQL Server
R script library in Microsoft Azure Marketplace Example solutions Fraud detection Sales forecasting Warehouse efficiency Predictive maintenance Extensibilit y Launch External Process R Integration R New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 Microsoft Azure Machine Learning Marketplace Benefits Faster deployment of ML models Faster performance (moves compute close to the data) Analytic library 010010 100100 010101 Data Scientist Interacts directly with data Improved scalability Benefits T-SQL interface Relational data 010010 100100 010101 Data Developer/DBA Manages data and analytics together Built into SQL Server Advanced analytics
Summary: R integration and advanced analytics SQL Server Analytics library Share and collaborate Manage and deploy Analytical engines Full R integration Fully extensible R + Data Scientists Publish algorithms, interact directly with data DBAs Manage storage and analytics together Capability Extensible in-database analytics, integrated with R, exposed through T-SQL Centralize enterprise library for analytic models Benefits Data Management Layer Relational data T-SQL interface Stream data in-memory Business Analysts Analysis through TSQL, tools, and vetted algorithms Advanced analytics
Standard approach to learn R Self-training is the key Math: Statistics, calculus, probability Machine learning algorithms Opensource R packages Industrial R with R: Hadoop, RRE Applied R with Microsoft Azure ML, RevR
Machine learning tools Open source R considered best fit Python Monte Carlo Machine Learning Library H2O Weka Octave-Forge Commercial Microsoft Azure Machine Learning SAS Enterprise Miner IBM SPSS Modeler RapidMiner Apache Mahout MATLAB Oracle Data Mining
Rich Services Heterogeneity Integrate with on-premises Lower Your Risk
Scaling
Azure in hawk-eye mode Platform Services Security & Management Portal Cloud Services Service Fabric Web Apps API Apps SQL Database Data Warehouse DocumentDB Hybrid Operations Azure AD Health Monitoring Azure Active Directory Azure AD B2C Batch RemoteApp Mobile Apps Logic Apps Redis Cache Azure Search Storage Tables AD Privileged Identity Management Domain Services Multi-Factor Authentication Automation Storage Queues BizTalk Services API Management Notification Hubs Backup Scheduler Hybrid Connections Service Bus HDInsight Machine Learning Stream Analytics Data Lake Operational Analytics Key Vault Visual Studio Azure SDK Data Factory Event Hubs Data Catalog Import/Export Store/ Marketplace VM Image Gallery & VM Depot Media Services Content Delivery Network (CDN) VS Online App Insights Infrastructure Services IoT Hub Mobile Engagement Azure Site Recovery StorSimple
Azure IT Capabilities Platform Services Security & Management Service Creation & Configuration User/Group Directory Store Identity Sign-Up and sign-in Multi-Factor Authentication Scheduled Service Management Task Scheduler Stateless Compute Scheduled Compute Jobs Simple Queuing Hybrid Connections Distributed Compute Virtual App Streaming B2B Integration Pub/Sub Queuing Web Apps Infrastructure Mobile Backends API Management API App Infrastructure Business Process Automation Push Notifications Big Data Analytics Relational SQL Database Distributed In-Memory Cache Predictive Analytics Data Warehouse Search Data Stream Analytics Document Database Service Simple Key/Value Store Big Data Storage Hybrid Operations Directory Health Monitoring Privileged Identity Management Domain Join & Policy Management Server Data Backup Operational Analytics Encryption Key Store Development Tools Software Development Kits Data Pipelines Device Data Collection Data Source Management Bulk Data Import And Export Software/Solution Marketplace Pre-Build VM Images Live & OD Media Streaming Content Delivery Network (CDN) Software Lifecycle Management Application Instrumentation Infrastructure Services IoT Device Management Mobile Analytics Disaster Recovery Hybrid/Intelligent Data Backup
Summary Big Data refers to data sets so large and/or complex that they become awkward to work with in conventional ways Hadoop and HDInsight = Microsoft s answer to Big Data Hadoop can store petabytes of data reliably and execute huge distributed computations However Big Data query results often involve significant latency Power BI includes authoring add-ins to query, analyze and visualize data sourced from Azure HDInsight Preload data in advance of business user queries Big Data is just another data source!
Resources Microsoft Big Data web site http://www.microsoft.com/en-us/server-cloud/solutions/big-data.aspx Azure HDInsight web site http://azure.microsoft.com/en-us/documentation/services/hdinsight/ Hortonworks tutorials http://hortonworks.com/tutorials Numerous tutorials are available to learn about Big Data by using the Hortonworks Sandbox Follow me @SQLMaste r www.sqlserver-qa.net
Sponsors Gold sponsors: Silver sponsors: Bronze sponsors: