DATA SCIENCE: HYPE AND REALITY PATRICK HALL
About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014
Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist 3
So, you have no SQL experience? That s right, I have NoSQL experience. Statistician Data Scientist 4
Audience poll Is data science a new field? Sources of data Technologies NETWORKS TEXT Is data science a true mathematical science?
Intro to data science
Historical roots J. W. Tukey, The Future of Data Analysis, 1962 International Federation of Classification Societies, 1996 William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, 2001
Data science Venn diagram 1.0 Drew Conway, 2010 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data science Venn diagram 2.0 http://joelgrus.com/wp-content/uploads/2013/06/venndiagram2.png
Source: "Of the unicorn" by Special Collections, University of Houston Libraries - http://digital.lib.uh.edu/u?/p15195coll18,33. Licensed under CC0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/file:oftheunicorn.jpg#/media/file:oftheunicorn.jpg 10
Intro to machine learning
Data science Venn diagram 1.0 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
A closer look at machine learning SUPERVISED LEARNING Regression LASSO regression Logistic regression Ridge regression Decision tree Gradient boosting Random forests Neural networks SVM Know Y Naïve Bayes Neighbors Gaussian processes UNSUPERVISED LEARNING A priori rules Clustering k-means clustering Mean shift clustering Spectral clustering Kernel density estimation Don t Nonnegative matrix know Y factorization PCA Kernel PCA Sparse PCA Singular value decomposition SOM SEMI-SUPERVISED LEARNING Prediction and classification* Clustering* EM TSVM Manifold regularization Sometimes know Y Autoencoders Multilayer perceptron Restricted Boltzmann machines
Sacrificing interpretability for accuracy Traditional regression Decision tree Hill and plateau sample data Neural network
The shocking truth revealed! http://www.kdnuggets.com/2015/10/deep-learning-vapnik-einstein-devil-yandex-conference.html
Most time is spent cleaning and preprocessing the data!
Small data tools
Multicore CPU GPU Solid state drive (SSD) 64+ GB of RAM Scalable algorithms Data scientist Workstation Data Software server Data scientist Software client MPI Based Data
How do we turn our insights into a production system?
ESTIMATION VS. PREDICTION DIFFERENT MINDSETS What happened? Why? Assumptions Parsimony Interpretation Regression Discriminant Analysis What will happen? Predictive Accuracy Machine Learning Production Deployment Identify/ Formulate Problem Data Preparation/ Exploration Model Building Deploy Model Evaluate/ Monitor Model
!!!???!!! I just built 850 new models. When can you put them into production? The IT folks The Analytics folks
big data tools
Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases Distributed analytics platform Disk-enabled: Hadoop MapReduce In-memory: H20.ai SAS High-Performance Analytics SAS LASR Analytic Server Spark ML/MLlib Data scientist Software client MPI Based Distributed data and software on multiple servers
Data growth 50,00 World s Data in Zettabytes 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 1991 1996 2001 2006 2011 2016 SOURCE: Oracle 2012
Data growth (1 zettabyte = 1 billion terabytes)
In 2008 In 2013 Typical server hard drive was 500GB with a transfer rate of 98 MB/sec Typical Server Hard Drive was 4TB with a transfer rate of 150 MB/sec An entire Disk could be transferred in 85 minutes An entire disk could be transferred in 440 minutes
$1,20 Average Price 1MB RAM $1,00 $0,80 $0,60 $0,40 $0,20 $0,00 2000 2005 2010
4000 CPU Speed in MHz 3500 3000 2500 2000 1500 1000 500 0 1978 1982 1985 1989 1995 1997 1999 2000 2005 2008
Disk capacities are getting bigger, but disks are not spinning faster Processors are not running much faster, but they have more cores RAM is becoming affordable
So To handle all of this new data we distribute it on clusters of computers Most modern analytical architectures take advantage of in-memory, distributed processing
Hadoop and Spark Bulk ETL Batch processing Deployment Online transactions Advanced Analytics MapReduce is a difficult framework for iterative, sophisticated algorithms
https://github.com/szilard/benchm-ml
Hadoop Corporate Adoption Remains Low Death of RDBMS exaggerated Big data adoption will require time
Parting shot
Use the scientific method. http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html
Where you can find me Keep the Science in Data Science http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html An Introduction to Machine Learning http://blogs.sas.com/content/sascom/2015/08/11/an-introduction-to-machine-learning/ SAS Data Mining Community https://communities.sas.com/ Quora Github Twitter www.quora.com github.com/jphall663 @jpatrickhall github.com/sassoftware