DATA SCIENCE: HYPE AND REALITY PATRICK HALL

About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014

Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist 3

So, you have no SQL experience? That s right, I have NoSQL experience. Statistician Data Scientist 4

Audience poll Is data science a new field? Sources of data Technologies NETWORKS TEXT Is data science a true mathematical science?

Intro to data science

Historical roots J. W. Tukey, The Future of Data Analysis, 1962 International Federation of Classification Societies, 1996 William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, 2001

Data science Venn diagram 1.0 Drew Conway, 2010 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Data science Venn diagram 2.0 http://joelgrus.com/wp-content/uploads/2013/06/venndiagram2.png

Source: "Of the unicorn" by Special Collections, University of Houston Libraries - http://digital.lib.uh.edu/u?/p15195coll18,33. Licensed under CC0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/file:oftheunicorn.jpg#/media/file:oftheunicorn.jpg 10

Intro to machine learning

Data science Venn diagram 1.0 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

A closer look at machine learning SUPERVISED LEARNING Regression LASSO regression Logistic regression Ridge regression Decision tree Gradient boosting Random forests Neural networks SVM Know Y Naïve Bayes Neighbors Gaussian processes UNSUPERVISED LEARNING A priori rules Clustering k-means clustering Mean shift clustering Spectral clustering Kernel density estimation Don t Nonnegative matrix know Y factorization PCA Kernel PCA Sparse PCA Singular value decomposition SOM SEMI-SUPERVISED LEARNING Prediction and classification* Clustering* EM TSVM Manifold regularization Sometimes know Y Autoencoders Multilayer perceptron Restricted Boltzmann machines

Sacrificing interpretability for accuracy Traditional regression Decision tree Hill and plateau sample data Neural network

The shocking truth revealed! http://www.kdnuggets.com/2015/10/deep-learning-vapnik-einstein-devil-yandex-conference.html

Most time is spent cleaning and preprocessing the data!

Small data tools

Multicore CPU GPU Solid state drive (SSD) 64+ GB of RAM Scalable algorithms Data scientist Workstation Data Software server Data scientist Software client MPI Based Data

How do we turn our insights into a production system?

ESTIMATION VS. PREDICTION DIFFERENT MINDSETS What happened? Why? Assumptions Parsimony Interpretation Regression Discriminant Analysis What will happen? Predictive Accuracy Machine Learning Production Deployment Identify/ Formulate Problem Data Preparation/ Exploration Model Building Deploy Model Evaluate/ Monitor Model

!!!???!!! I just built 850 new models. When can you put them into production? The IT folks The Analytics folks

big data tools

Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases Distributed analytics platform Disk-enabled: Hadoop MapReduce In-memory: H20.ai SAS High-Performance Analytics SAS LASR Analytic Server Spark ML/MLlib Data scientist Software client MPI Based Distributed data and software on multiple servers

Data growth 50,00 World s Data in Zettabytes 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 1991 1996 2001 2006 2011 2016 SOURCE: Oracle 2012

Data growth (1 zettabyte = 1 billion terabytes)

In 2008 In 2013 Typical server hard drive was 500GB with a transfer rate of 98 MB/sec Typical Server Hard Drive was 4TB with a transfer rate of 150 MB/sec An entire Disk could be transferred in 85 minutes An entire disk could be transferred in 440 minutes

$1,20 Average Price 1MB RAM $1,00 $0,80 $0,60 $0,40 $0,20 $0,00 2000 2005 2010

4000 CPU Speed in MHz 3500 3000 2500 2000 1500 1000 500 0 1978 1982 1985 1989 1995 1997 1999 2000 2005 2008

Disk capacities are getting bigger, but disks are not spinning faster Processors are not running much faster, but they have more cores RAM is becoming affordable

So To handle all of this new data we distribute it on clusters of computers Most modern analytical architectures take advantage of in-memory, distributed processing

Hadoop and Spark Bulk ETL Batch processing Deployment Online transactions Advanced Analytics MapReduce is a difficult framework for iterative, sophisticated algorithms

https://github.com/szilard/benchm-ml

Hadoop Corporate Adoption Remains Low Death of RDBMS exaggerated Big data adoption will require time

Parting shot

Use the scientific method. http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html

Where you can find me Keep the Science in Data Science http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html An Introduction to Machine Learning http://blogs.sas.com/content/sascom/2015/08/11/an-introduction-to-machine-learning/ SAS Data Mining Community https://communities.sas.com/ Quora Github Twitter www.quora.com github.com/jphall663 @jpatrickhall github.com/sassoftware