Data Science: The Big Picture @MatthewRenze #SQLServerUserGroupDubai
Job Postings for Data Scientists
Top-paying Tech Skills Skill 2016 Change Skill 2016 Change Source: Dice Salary Survey 2017
What is data science? Why is it important? Where is this all going?
What is data science?
Computer Science Data Science Math and Statistics Domain Knowledge
What Is a Data Scientist? Performs data science More than a scientist More than an analyst More than a developer
What skills are necessary?
Data Science Skills Programming Working with data Descriptive statistics Data visualization
Data Science Skills Programming Working with data Descriptive statistics Data visualization Statistical modeling Handling Big Data Machine learning Deploying to production
What tools are used?
SQL Excel Python R MySQL Python tools ggplot SQL Server Tableau JavaScript Matplotlib Java PostgreSQL Oracle D3 Homegrown Hive Spark Cloudera Visual Basic MongoDB Hadoop SAS C++ PowerPivot Scala SQLite C Pig RedShift Weka Hbase (EMR) Perl SPSS Teradata Share of Respondents 70% 60% Data Science Tools 50% 40% 30% 20% 10% 0% Tool: language, platform, analytics Source: O Reilly 2015 Data Science Salary Survey
SQL Excel Python R MySQL Python tools ggplot SQL Server Tableau JavaScript Matplotlib Java PostgreSQL Oracle D3 Homegrown Hive Spark Cloudera Visual Basic MongoDB Hadoop SAS C++ PowerPivot Scala SQLite C Pig RedShift Weka Hbase (EMR) Perl SPSS Teradata Share of Respondents 70% 60% Data Science Tools 50% 40% 30% 20% 10% 0% Tool: language, platform, analytics Source: O Reilly 2015 Data Science Salary Survey
SQL Excel Python R MySQL Python tools ggplot SQL Server Tableau JavaScript Matplotlib Java PostgreSQL Oracle D3 Homegrown Hive Spark Cloudera Visual Basic MongoDB Hadoop SAS C++ PowerPivot Scala SQLite C Pig RedShift Weka Hbase (EMR) Perl SPSS Teradata Share of Respondents 70% 60% Data Science Tools 50% 40% 30% 20% 10% 0% Tool: language, platform, analytics Source: O Reilly 2015 Data Science Salary Survey
SQL Excel Python R MySQL Python tools ggplot SQL Server Tableau JavaScript Matplotlib Java PostgreSQL Oracle D3 Homegrown Hive Spark Cloudera Visual Basic MongoDB Hadoop SAS C++ PowerPivot Scala SQLite C Pig RedShift Weka Hbase (EMR) Perl SPSS Teradata Share of Respondents 70% 60% Data Science Tools 50% 40% 30% 20% 10% 0% Tool: language, platform, analytics Source: O Reilly 2015 Data Science Salary Survey
SQL Excel Python R MySQL Python tools ggplot SQL Server Tableau JavaScript Matplotlib Java PostgreSQL Oracle D3 Homegrown Hive Spark Cloudera Visual Basic MongoDB Hadoop SAS C++ PowerPivot Scala SQLite C Pig RedShift Weka Hbase (EMR) Perl SPSS Teradata Share of Respondents 70% 60% Data Science Tools 50% 40% 30% 20% 10% 0% Tool: language, platform, analytics Source: O Reilly 2015 Data Science Salary Survey
How is data science performed?
The Data Science Process Data
The Data Science Process Find a question Data
The Data Science Process Find a question Collect the data Data
The Data Science Process Find a question Collect the data Data Prepare the data
The Data Science Process Find a question Collect the data Data Prepare the data Create a model
The Data Science Process Find a question Collect the data Evaluate the model Data Prepare the data Create a model
The Data Science Process Find a question Deploy the model Collect the data Evaluate the model Data Prepare the data Create a model
The Data Science Process Find a question Deploy the model Collect the data Evaluate the model Data Prepare the data Create a model
The Data Science Process Find a question Iterative process Deploy the model Explore the data Evaluate the model Data Prepare the data Create a model
The Data Science Process Find a question Iterative process Non-sequential Deploy the model Explore the data Evaluate the model Data Prepare the data Create a model
The Data Science Process Find a question Iterative process Deploy the model Explore the data Non-sequential Early termination Evaluate the model Data Prepare the data Create a model
Why is data science important?
Data Analytics
Internet of Things Data Analytics
Internet of Things Data Analytics Big Data
Internet of Things Data Analytics Machine Learning Big Data
Internet of Things Data Analytics Machine Learning Big Data
Trends Past Present Future
Driven by economics Possible by technology
Cost Cost Value
Internet of Things Data Analytics Machine Learning Big Data
Internet of Things Data Analytics Machine Learning Big Data
Data Analysis (The Past)
Collecting, analyzing, and communicating data was difficult, expensive, and slow.
Data Analytics (The Present)
Collecting, analyzing, and communicating data is easy, inexpensive, and fast.
Data-Driven Decision Making (The Future)
Retail Sales Total Products Total Sales Sales by Product Type New Products This Year Annual Sales Comparison by Month Sales per Square Foot by Total Sales Variance and District
Internet Sales Show me sales by gender and marital status. Displaying sum of sales by gender and marital status Marital Status: Married Single Show me sales by gender and marital status. Male Female $0k $5k $10k $15k
4% higher productivity 6% higher profits Source: https://hbr.org/2012/10/big-data-the-management-revolution
Empowering people to make better decisions is not the end goal it s just the beginning
Internet of Things Data Analytics Machine Learning Big Data
The Internet (The Past)
Cost Speed Bandwidth
The internet was expensive, slow, and not generating much data.
Internet of Things (The Present)
Cost Speed Bandwidth
Billions of Devices Growth of the IoT Devices 50 40 30 20 10 0 1990 1995 2000 2005 2010 2015 2020 Year Source: NCTA, 2014
50 billion IoT devices by 2020
The internet of things is cheap, fast, and generating tons of data.
Internet of Everything (The Future)
Cost Speed Bandwidth
An internet connection will likely be as common to devices as electricity.
We re building a peripheral nervous system for our planet... but it needs a brain.
FUN GAME 1 Is It IoT?
Is it IoT?
YES!
Is it IoT?
YES!
Is it IoT?
YES!
Is it IoT?
NO : (
Internet of Things Data Analytics Machine Learning Big Data
Data (The Past)
Data sets were small, slow, and had little diversity.
Big Data (The Present)
Data in Zettabytes (ZB) Global Data Growth 40 35 30 25 20 15 10 5 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Year Source: UNECE Statistics Wikis
Doubling every two years
Volume Velocity Big Data Variety
Volume Velocity Big Data Variety
Volume
Velocity
Variety
INTEGRATION
Gender: Female Age: 31 Emotion: Happy Gender: Male Age: 5 Emotion: Happy Apple
Today s data sets are bigger, faster, and more diverse.
Just Data Again (The Future)
Cost
We re creating new tools to automate feature extraction... but it requires machine learning.
FUN GAME 2 Am I Smarter Than Big Data?
What Do These Have In Common?
What Do These Three Things Predict?
Internet of Things Data Analytics Machine Learning Big Data
Artificial Intelligence (The Past)
Source: Evan Amos
AI Winter has ended and things are warming up again.
Machine Learning (The Present)
Artificial Intelligence Machine Learning Statistics
f x
f x Data Function Prediction
f x Data Function Prediction Cat Dog
f x Data Function Prediction Cat Dog Is cat?
f x Data Function Prediction Cat Dog Is cat? Yes
The next generation of ML will be able to complete even more complex tasks.
Deep Learning (The Future)
Deep Learning Human Cat Dog Car
Deep Neural Network input hidden 1 hidden 2 hidden 3 output
Deep Neural Network input hidden 1 hidden 2 hidden 3 output
Deep Neural Network input hidden 1 hidden 2 hidden 3 output
Deep Neural Network input hidden 1 hidden 2 hidden 3 output
Deep Neural Network John Jane Miko Lee input hidden 1 hidden 2 hidden 3 output
f x
AI Winter 2.0?
AI AI Winter 2.0? 2.0? or Human Winter 1.0?
FUN GAME 3 Dog or Mop?
MOP!
DOG!
DOG!
MOP!
DOG! MOP!
Closing the Loop (The Future of Data Science)
Data Analytics
Internet of Things Data Analytics
Internet of Things Data Analytics Big Data
Internet of Things Data Analytics Machine Learning Big Data
Internet of Things Data Analytics Machine Learning Big Data
Internet of Things Data Analytics Machine Learning Big Data
Internet of Things Data Analytics Machine Learning Big Data
Fully Autonomous Systems
Smart systems Cloud robotics Cyber-physical systems
Embedded intelligence will be woven into the fabric of our society.
The Big Data Universe Amount of data as of 2016 in petabytes Human brain 2.5 PB Ebay 90 PB Google 15,000 PB (estimated) Spotify 10 PB Facebook 300 PB Source: The Royal Society, 2016
Complexity High Automation Framework Routine & Complex Non-routine & Complex Routine & Simple Non-routine & Simple Low High Low Repetitiveness Source: Abhas Gupta -The Automation Framework
Complexity High Automation Technology Deep Learning High-level Programming Conventional Machine Learning Low High Low Repetitiveness Source: Abhas Gupta -The Automation Framework
Complexity High Retail Salesperson Fold Clothes Greet Customers Convert Customers Sizing Inventory Count Inventory Pull Cash Register Low High Low Repetitiveness Source: Abhas Gupta -The Automation Framework
Complexity High Medical Doctor Treatment Plan Diagnose Disease Rare Disease Build Trust Routine Check-up Input EMR Write Prescription Low High Low Repetitiveness Source: Abhas Gupta -The Automation Framework
Source: CGP Grey
Industrial Robots (per 1000 US workers) Rise of the Robots 2 1.5 1 0.5 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year Source: International Federation of Robotics
Data science will amplify this trend
Which side of this new economy will your job be on? The side that s leading or the side being eliminated.
What will you choose?
Welcome Robot Overlords!!!
Where to We Go Next?
Where to Go Next Pluralsight: https://www.pluralsight.com Coursera: https://www.coursera.org Data Camp: https://www.datacamp.com
Recommended Courses Data Science: The Big Picture Data Science with R Exploratory Data Analysis with R Data Visualization with R (3-part) https://www.pluralsight.com/authors/matthew-renze
www.matthewrenze.com
Feedback Very important to me! What did you like? What could I improve?
Conclusion
Internet of Things Data Analytics Machine Learning Big Data
Are you prepared? Is your organization? Is our world prepared?
Thank You! Matthew Renze Data Science Consultant Renze Consulting Twitter: @matthewrenze Email: info@matthewrenze.com Website: www.matthewrenze.com