SQLStarter Intro to Data Science. Dave

Size: px
Start display at page:

Download "SQLStarter Intro to Data Science. Dave"

Transcription

1 SQLStarter Dave

2 SQLStarter Dave Leininger

3 WHO IS FUSION ALLIANCE?

4 SQLStarter: What is Data Science? Why would I want to be a Data Scientist? What are the tools and technologies? Who is hiring? Vetted resources

5 Data science is the extraction of actionable knowledge directly from data through a process of discovery, or hypothesis formulation and hypothesis testing. In its purest form, data science is the fourth paradigm of science, following experiment, theory, and computational sciences.

6 What is Data Science? Domain expertise Research Statistics Analytic systems Algorithms Programming skills

7 What is a Data Scientist? Possesses a combination of skills to uncover insights in data and change the way an organization approach challenges analytic machine learning data mining statistical skills algorithms coding computer science applications modelling analytics math strong business acumen effective communication Ferris Jumah Data Scientist SQLSaturday LinkedIn #530

8 What is Data Science? Skills noted in LinkedIn profiles of Data Scientists Data Analysis Data Mining Python Machine Learning SQL R Statistics Algorithms Java Hadoop C++ Matlab Statistical Modeling Linux Big Data LaTex C Computer Science Programming Analytics Artificial Intelligence Data Science MySQL Software Engineering Databases Predictive Modeling Business Intelligence Predictive Analytics Software Development MapReduce Javascript Mathematical Modeling Distributed Systems Perl Data Visualization Natural Language Processing SAS Hive Research Text Mining Optimization Time Series Analysis Pattern Recognition Quantitative Analytics SPSS Data Warehousing Information Retrieval Simulations Git

9 Why would I want to be a Data Scientist? A real data scientist knows how to apply mathematics, statistics, how to build and validate models using proper experimental designs. Having IT skills without statistics skills makes you a data scientist as much as it makes you a surgeon to know how to build a scalpel. ~ Lisa Winter, Senior Analyst at WillisTowersWatson

10 DATA

11 Why would I want to be a Data Scientist? Modern day businesses track everything, from website visits and customer transactions right the way through to individual consumer reviews. We are living in a world of data overload. Hidden within this vast expanse of data are new revenue streams and business efficiencies

12 Why would I want to be a Data Scientist? The average day of a data scientist involves extracting data from multiple sources, running it through an analytics platform and then creating visualizations of the data. They will then spend hours cleansing and analysing the data from multiple angles, looking for trends that highlight problems or opportunities. Any insight is communicated to business and IT leaders with recommendations to adapt existing business strategies.

13 Why would I want to be a Data Scientist? Most of the skills I use, as a data scientist, are different: domain expertise, business acumen, data intuition, use of vendor dashboards, finding the right data, making conclusions and applying results to my decision process to run my business. The systems that I develop (computational marketing, growth hacking) rely on a few principles: data-driven rather than modeldriven, simplicity, robustness, scalability, efficiency, fast implementation. Some processes do not involve coding, but instead making tools communicate together (machine to machine, automation). ~ Vincent Granville, Pioneering Data Scientist

14 Why would I want to be a Data Scientist? Interview questions Explain what regularization is and why it is useful. Which data scientists do you admire most? which startups? How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression. Explain what precision and recall are. How do they relate to the ROC curve? How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? What is root cause analysis? Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. What is statistical power? Explain what resampling methods are and why they are useful. Also explain their limitations. Is it better to have too many false positives, or too many false negatives? Explain.

15 Why would I want to be a Data Scientist? More interview questions What is selection bias, why is it important and how can you avoid it? Give an example of how you would use experimental design to answer a question about user behavior. What is the difference between "long" and "wide" format data? What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject? Explain Edward Tufte's concept of "chart junk." How would you screen for outliers and what should you do if you find one? How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event? What is a recommendation engine? How does it work? Explain what a false positive and a false negative are. Why is it important to differentiate these from each other? Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

16 What are the tools and technologies? Obligatory Dilbert Reference

17 What are the tools and technologies?

18 What are the tools and technologies?

19

20 What are the tools and technologies?

21 What are the tools and technologies? No longer can a motivated analyst pick up tools off the shelf and create an interactive data pipeline in an afternoon.

22 What are the tools and technologies? 5 stages of data science and machine learning projects Define Transform Model Deploy Act Organized project teams that collaborate through many iterations Engineers Business stakeholders Data Scripts Trained models Data scientists

23 What are the tools and technologies?

24 What are the tools and technologies?

25 What are the tools and technologies?

26 How do I earn the role? Integration with R With SQL 2016 the R analytic framework is now embedded natively into SQL Server. This alleviates the need to install a separate instance of the R framework which is beneficial since there is now less data movement between SQL Server and an external R SQL server. R code can be written and executed in native SQL Server tools which are familiar to many SQL users. There are two methods that are provided for embedded R services: R Services In-Database requires a database engine instance of SQL Server 2016 and is available in Standard, Developer and Enterprise editions; while the R Services Standalone edition does not require an SQL Server instance to be installed and provides R connectivity tools but is only available with SQL Server Enterprise edition. However, with either option, the integration of R services within SQL 2016 means that R code, scripts and solutions can be designed, managed and executed all within SQL Server.

27 Who is hiring? Manufacturing, retail, ecommerce, payment processing, logistics, healthcare Employ statistical / econometric / data mining techniques to assess, monitor and forecast different sources of risk Develop optimization frameworks to support models related to risk allocation, pricing, capital strategy to improve and guide business decisions Support the deployment of analytical tools related to quantitative risk management. Maintain and enhance previously developed models and tools Work with various data sources and platforms (PC, Mainframe, Unix/Linux, Teradata) Execute both descriptive and inferential ad hoc requests in a timely manner. Communicate and present models to business customers and executives

28 What are some vetted resources? Books: Machine Learning: An Algorithmic Perspective Pattern Discovery in Data Mining (Coursera) Statistical Aspects of Data Mining (Google Tech Talks series on Youtube) Doing Data Science KDnuggets is a leading site on Business Analytics, Big Data, Data Mining, and Data Science, and is managed by Gregory Piatetsky- Shapiro, a leading expert in the field.

29 Vetted resources PLATFORM VENDORS Context Relevant DataRobot Alpine Data Continuum Analytics Plotly Arimo Data Intelligence Platform Mode Analytics Platform Dataiku Data Science Software Nutonian AI powered modeling engine Domino: A Platform to Accelerate Data Science Sense (now part of Cloudera) Algorythmia Marketplace for Algorithms Yhat: Making Data Science [App]licable

30 Vetted resources ONLINE TRAINING RESOURCES Coursera 150+ data science courses from Johns Hopkins, Michigan, UCSD Pluralsight Understanding Machine Learning with Python Beginning Data Visualization with R Understanding Machine Learning with R Exploratory Data Analysis with R Data Science & Hadoop Workflows at Scale With Scalding Lynda Statistics with R Data Science Tips Udemy R for Excel Users R Statistics Essential Training Up and Running with R Introduction to Data Analysis with Python Foundations of Business Analytics: Prescriptive Analytics Dozens of data science courses, hundreds of lectures, thousands of reviews

31 Vetted resources MICROSOFT ONLINE TRAINING RESOURCES FOR DATA SCIENCE Microsoft Professional Degree Data Science Microsoft Virtual Academy Data Science and Machine Learning Essentials Building Recommendation Systems in Azure Microsoft Azure Machine Learning Data Science for Beginners

32 Credits Images from Freepik and other sources

33 SQLStarter: What is a Data Scientist? Why would I want to be a Data Scientist? What are the tools and technologies? Who is hiring? Vetted resources