A Gentle Introduction to Data Analytics in SQL Server Presented by Stacia Varga Moderated by Angela Henry

Size: px
Start display at page:

Download "A Gentle Introduction to Data Analytics in SQL Server Presented by Stacia Varga Moderated by Angela Henry"

Transcription

1 A Gentle Introduction to Data Analytics in SQL Server 2016 Presented by Stacia Varga Moderated by Angela Henry

2 Thank You microsoft.com hortonworks.com aws.amazon.com red-gate.com Empower users with new insights through familiar tools while balancing the need for IT to monitor and manage user created content. Deliver access to all data types across structured and unstructured sources. Hortonworks develops, distributes and supports the only 100% open source distribution of Apache Hadoop explicitly architected, built and tested for enterprise grade deployments. It is the only Hadoop-based platform available on both Linux and Windows. Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale Microsoft SQL Server databases in the cloud. Redgate makes ingeniously simple tools for Microsoft technology professionals working with SQL Server,.NET, Visual Studio, Azure, TFS. Trusted by 91% of the Fortune

3 JOIN PASS PASS is a not-for-profit organization which offers year-round learning opportunities to data professionals Membership is free, join today at Access to online training and content Join Local Chapters and Virtual Chapters Enjoy discounted event rates Get advance notice of member exclusives 3

4 Stacia Varga Bio Stacia Varga is a Microsoft Data Platform MVP and SSAS Maestro with a Bachelor s degree in Social Sciences. A consultant, educator, author, and principal of Data Inspirations, her career spans more than 25 years, with a focus on improving business practices through technology. Since 2000, Stacia has been providing consulting and education services for Microsoft's Business Intelligence technologies. During that time, she has also authored several books covering the Microsoft BI stack as Stacia linkedin.com/in/staciavarga blog.datainspirations.com

5 Overview Data Analytics vs Business Intelligence vs Data Science Using Data Analytics in the Microsoft Stack Exploring Analytical Visualizations 5

6 Data Analytics vs Business Intelligence vs Data Science Analytics has emerged as a catch-all term for a variety of different business intelligence (BI)- and application-related initiatives. For some, it is the process of analyzing information from a particular domain, such as website analytics. For others, it is applying the breadth of BI capabilities to a specific content area (for example, sales, service, supply chain and so on). In particular, BI vendors use the analytics moniker to differentiate their products from the competition. Increasingly, analytics is used to describe statistical and mathematical data analysis that clusters, segments, scores and predicts what scenarios are most likely to happen. Whatever the use cases, analytics has moved deeper into the business vernacular. Analytics has garnered a burgeoning interest from business and IT professionals looking to exploit huge mounds of internally generated and externally available data. 6

7 Data Analytics vs Business Intelligence vs Data Science [Business intelligence] is a set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery. 7

8 Data Analytics vs Business Intelligence vs Data Science Data science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results. Cynthia Rudin MIT Sloan School of Management Stephen F Elston Principle Consultant, Quantia Analytics, LLC Data Science & Machine Learning Essentials (courses.edx.org) 8

9 Comparing Data Analytics to Business Intelligence Population count How many by category? What percentage of the total is in this group? Summarized data for specified period of time What were total sales for last 3 years? What was profit margin last year? Distribution Variation What is the center of variation (mean and median)? Is the distribution skewed? What is the variance? What is the standard deviation? Comparative data for multiple categories or time periods Did sales increase this year as compared to last year? What was profit margin by product category last year? Association Which variables occur frequently with others? Is the relationship linear? Negative? Is there a correlation? Consolidated data from various source systems How do sales volumes for top 10 customers compare to call center volumes for the same customers? 9

10 Comparing Data Analytics to Business Intelligence Data Analytics User formulates business objectives Patterns are derived from relationships in the data Inductive processing bottom up BI User formulates business objectives User tests predetermined patterns or manually seeks undiscovered patterns Deductive processing - drill down 10

11 And Then There s Data Science How much? How many: How much will the temperature change next week? How many products will we sell next week? Which category? Is this a cat or a dog? What is the topic of this news article? Which group? Which shoppers buy the same types of things? Which viewers like the same movies? Is it weird? Is this instrument reading unusual? Is this combination of purchases different from what this customer has made in the past? Which action? Should I raise or lower the temperature? Where should I position this product in my store? 11

12 Using Analytics in the Microsoft Data Platform Excel / Power BI SQL Server R Services SQL Server Analysis Services - data mining Azure Machine Learning 12 12

13 Organizing Your Data for Analytics Categorical Gender Race Nationality Quantitative (or numerical) Age Weight Height Ordinal Education level Hour of day 13 13

14 Bike Buyers Data for predictive model Customer data plus purchase history CREATE VIEW [dbo].[vtargetmail] AS SELECT c.[customerkey], c.[geographykey], c.[customeralternatekey], c.[title], c.[firstname], c.[middlename], c.[lastname], c.[namestyle], c.[birthdate], c.[maritalstatus], c.[suffix], c.[gender], c.[ address], c.[yearlyincome], c.[totalchildren], c.[numberchildrenathome], c.[englisheducation], c.[spanisheducation], c.[frencheducation], c.[englishoccupation], c.[spanishoccupation], c.[frenchoccupation], c.[houseownerflag], c.[numbercarsowned], c.[addressline1], c.[addressline2], c.[phone], c.[datefirstpurchase], c.[commutedistance], x.[region], x.[age], CASE x.[bikes] WHEN 0 THEN 0 ELSE 1 END AS [BikeBuyer] FROM [dbo].[dimcustomer] c INNER JOIN (SELECT [CustomerKey], [Region], [Age], Sum(CASE [EnglishProductCategoryName] WHEN 'Bikes' THEN 1 ELSE 0 END) AS [Bikes] FROM [dbo].[vdmprep] GROUP BY [CustomerKey], [Region],[Age]) AS [x] ON c.[customerkey] = x.[customerkey] 14 14

15 Bike Buyers Many columns included for use in Microsoft s data mining tools for convenient data exploration & reporting CREATE VIEW [dbo].[vtargetmail] AS SELECT c.[customerkey], c.[geographykey], c.[customeralternatekey], c.[title], c.[firstname], c.[middlename], c.[lastname], c.[namestyle], c.[birthdate], c.[maritalstatus], c.[suffix], c.[gender], c.[ address], c.[yearlyincome], c.[totalchildren], c.[numberchildrenathome], c.[englisheducation], c.[spanisheducation], c.[frencheducation], c.[englishoccupation], c.[spanishoccupation], c.[frenchoccupation], c.[houseownerflag], c.[numbercarsowned], c.[addressline1], c.[addressline2], c.[phone], c.[datefirstpurchase], c.[commutedistance], x.[region], x.[age], CASE x.[bikes] WHEN 0 THEN 0 ELSE 1 END AS [BikeBuyer] FROM [dbo].[dimcustomer] c INNER JOIN (SELECT [CustomerKey], [Region], [Age], Sum(CASE [EnglishProductCategoryName] WHEN 'Bikes' THEN 1 ELSE 0 END) AS [Bikes] FROM [dbo].[vdmprep] GROUP BY [CustomerKey], [Region],[Age]) AS [x] ON c.[customerkey] = x.[customerkey] 15 15

16 Bike Buyers Categorical and ordinal values CREATE VIEW [dbo].[vtargetmail] AS SELECT c.[customerkey], c.[geographykey], c.[customeralternatekey], c.[title], c.[firstname], c.[middlename], c.[lastname], c.[namestyle], c.[birthdate], c.[maritalstatus], c.[suffix], c.[gender], c.[ address], c.[yearlyincome], c.[totalchildren], c.[numberchildrenathome], c.[englisheducation], c.[spanisheducation], c.[frencheducation], c.[englishoccupation], c.[spanishoccupation], c.[frenchoccupation], c.[houseownerflag], c.[numbercarsowned], c.[addressline1], c.[addressline2], c.[phone], c.[datefirstpurchase], c.[commutedistance], x.[region], x.[age], CASE x.[bikes] WHEN 0 THEN 0 ELSE 1 END AS [BikeBuyer] FROM [dbo].[dimcustomer] c INNER JOIN (SELECT [CustomerKey], [Region], [Age], Sum(CASE [EnglishProductCategoryName] WHEN 'Bikes' THEN 1 ELSE 0 END) AS [Bikes] FROM [dbo].[vdmprep] GROUP BY [CustomerKey], [Region],[Age]) AS [x] ON c.[customerkey] = x.[customerkey] 16 16

17 Bike Buyers Quantitative values CREATE VIEW [dbo].[vtargetmail] AS SELECT c.[customerkey], c.[geographykey], c.[customeralternatekey], c.[title], c.[firstname], c.[middlename], c.[lastname], c.[namestyle], c.[birthdate], c.[maritalstatus], c.[suffix], c.[gender], c.[ address], c.[yearlyincome], c.[totalchildren], c.[numberchildrenathome], c.[englisheducation], c.[spanisheducation], c.[frencheducation], c.[englishoccupation], c.[spanishoccupation], c.[frenchoccupation], c.[houseownerflag], c.[numbercarsowned], c.[addressline1], c.[addressline2], c.[phone], c.[datefirstpurchase], c.[commutedistance], x.[region], x.[age], CASE x.[bikes] WHEN 0 THEN 0 ELSE 1 END AS [BikeBuyer] FROM [dbo].[dimcustomer] c INNER JOIN (SELECT [CustomerKey], [Region], [Age], Sum(CASE [EnglishProductCategoryName] WHEN 'Bikes' THEN 1 ELSE 0 END) AS [Bikes] FROM [dbo].[vdmprep] GROUP BY [CustomerKey], [Region],[Age]) AS [x] ON c.[customerkey] = x.[customerkey] 17 17

18 Bike Buyers Value to Predict CREATE VIEW [dbo].[vtargetmail] AS SELECT c.[customerkey], c.[geographykey], c.[customeralternatekey], c.[title], c.[firstname], c.[middlename], c.[lastname], c.[namestyle], c.[birthdate], c.[maritalstatus], c.[suffix], c.[gender], c.[ address], c.[yearlyincome], c.[totalchildren], c.[numberchildrenathome], c.[englisheducation], c.[spanisheducation], c.[frencheducation], c.[englishoccupation], c.[spanishoccupation], c.[frenchoccupation], c.[houseownerflag], c.[numbercarsowned], c.[addressline1], c.[addressline2], c.[phone], c.[datefirstpurchase], c.[commutedistance], x.[region], x.[age], CASE x.[bikes] WHEN 0 THEN 0 ELSE 1 END AS [BikeBuyer] FROM [dbo].[dimcustomer] c INNER JOIN (SELECT [CustomerKey], [Region], [Age], Sum(CASE [EnglishProductCategoryName] WHEN 'Bikes' THEN 1 ELSE 0 END) AS [Bikes] FROM [dbo].[vdmprep] GROUP BY [CustomerKey], [Region],[Age]) AS [x] ON c.[customerkey] = x.[customerkey] 18 18

19 Explore Your Data with Power BI 19 19

20 Explore Your Data with Power BI Get Data: SQL Server + database Use Import, not Direct Query Enter formula into Advanced Editor let Source = Sql.Database("localhost", "adventureworksdw2014"), #"Filtered Rows" = Table.SelectRows(Source, each [Name] = "vtargetmail"), #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Profile", each Table.Profile([Data])), #"Expanded Profile" = Table.ExpandTableColumn(#"Added Custom", "Profile", {"Column", "Min", "Max", "Average", "StandardDeviation", "Count", "NullCount", "DistinctCount"}, {"Column", "Min", "Max", "Average", "StandardDeviation", "Count", "NullCount", "DistinctCount"}), #"Removed Columns" = Table.RemoveColumns(#"Expanded Profile",{"Data", "Schema", "Item", "Kind"}) in #"Removed Columns" 20 20

21 Explore Your Data with R SQL Server Requires SQL Server 2016 With R Services 21 21

22 Explore Your Data with Analysis Services Data Mining 22 22

23 What Are You Looking For? Missing values relative to total rows Look at reasonableness of mean, min, and max Invalid values Outliers Data ranges that are too narrow or wide 23 23

24 Exploring Visualizations for Analytics Understand your data before modeling Distribution Relationships Uncover data quality issues Ranges Missing values 24 24

25 Single Variables Unimodal Bimodal 25 25

26 Histogram in R SQL Server Data rxhistogram(~age, data = sqlbikebuyerds, title = "Age Histogram", numbreaks=10) 26 26

27 Histogram in Power BI (Model) =Calculate(values('AgeBucket'[AgeBucket]), filter('agebucket', 'AgeBucket'[Start]<=[age] && AgeBucket[End]>=[age])) 27 27

28 Histogram in Power BI (Get Data) Option 1 Discrete values Add reference query with Group By transform on field 28 28

29 Histogram in Power BI (Get Data) Option 2 Custom buckets Add reference query with new column Group by transform on new column if([age] < 30) then "0-29" else if([age] < 40) then "30-39" else if([age] < 50) then "40-49" else if([age] < 60) then "50-59" else if([age] < 70) then "60-69" else if([age] < 80) then "70-79" else "80+" 29 29

30 Visualization in Azure Machine Learning Most steps in an experiment allow visualization of current state of data 30 30

31 Discrete Data in Column or Bar Charts 31 31

32 Two Continuous Variables: Scatter plot 32 32

33 Clustered Categorical Variables 33 33

34 Side by Side Categorical Variables 34 34

35 Ratios of Categorical Variables 35 35

36 Categorical Variables as Multiples 36 36