DATA SCIENCE: HYPE AND REALITY PATRICK HALL

Size: px
Start display at page:

Download "DATA SCIENCE: HYPE AND REALITY PATRICK HALL"

Transcription

1 DATA SCIENCE: HYPE AND REALITY PATRICK HALL

2 About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014

3 Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist 3

4 So, you have no SQL experience? That s right, I have NoSQL experience. Statistician Data Scientist 4

5 Audience poll Is data science a new field? Sources of data Technologies NETWORKS TEXT Is data science a true mathematical science?

6 Intro to data science

7 Historical roots J. W. Tukey, The Future of Data Analysis, 1962 International Federation of Classification Societies, 1996 William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, 2001

8 Data science Venn diagram 1.0 Drew Conway, 2010 Source:

9 Data science Venn diagram 2.0

10 Source: "Of the unicorn" by Special Collections, University of Houston Libraries - Licensed under CC0 via Wikimedia Commons

11 Intro to machine learning

12 Data science Venn diagram 1.0 Source:

13 A closer look at machine learning SUPERVISED LEARNING Regression LASSO regression Logistic regression Ridge regression Decision tree Gradient boosting Random forests Neural networks SVM Know Y Naïve Bayes Neighbors Gaussian processes UNSUPERVISED LEARNING A priori rules Clustering k-means clustering Mean shift clustering Spectral clustering Kernel density estimation Don t Nonnegative matrix know Y factorization PCA Kernel PCA Sparse PCA Singular value decomposition SOM SEMI-SUPERVISED LEARNING Prediction and classification* Clustering* EM TSVM Manifold regularization Sometimes know Y Autoencoders Multilayer perceptron Restricted Boltzmann machines

14 Sacrificing interpretability for accuracy Traditional regression Decision tree Hill and plateau sample data Neural network

15 The shocking truth revealed!

16 Most time is spent cleaning and preprocessing the data!

17 Small data tools

18 Multicore CPU GPU Solid state drive (SSD) 64+ GB of RAM Scalable algorithms Data scientist Workstation Data Software server Data scientist Software client MPI Based Data

19 How do we turn our insights into a production system?

20 ESTIMATION VS. PREDICTION DIFFERENT MINDSETS What happened? Why? Assumptions Parsimony Interpretation Regression Discriminant Analysis What will happen? Predictive Accuracy Machine Learning Production Deployment Identify/ Formulate Problem Data Preparation/ Exploration Model Building Deploy Model Evaluate/ Monitor Model

21 !!!???!!! I just built 850 new models. When can you put them into production? The IT folks The Analytics folks

22 big data tools

23 Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases Distributed analytics platform Disk-enabled: Hadoop MapReduce In-memory: H20.ai SAS High-Performance Analytics SAS LASR Analytic Server Spark ML/MLlib Data scientist Software client MPI Based Distributed data and software on multiple servers

24 Data growth 50,00 World s Data in Zettabytes 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0, SOURCE: Oracle 2012

25 Data growth (1 zettabyte = 1 billion terabytes)

26 In 2008 In 2013 Typical server hard drive was 500GB with a transfer rate of 98 MB/sec Typical Server Hard Drive was 4TB with a transfer rate of 150 MB/sec An entire Disk could be transferred in 85 minutes An entire disk could be transferred in 440 minutes

27 $1,20 Average Price 1MB RAM $1,00 $0,80 $0,60 $0,40 $0,20 $0,

28 4000 CPU Speed in MHz

29 Disk capacities are getting bigger, but disks are not spinning faster Processors are not running much faster, but they have more cores RAM is becoming affordable

30 So To handle all of this new data we distribute it on clusters of computers Most modern analytical architectures take advantage of in-memory, distributed processing

31

32 Hadoop and Spark Bulk ETL Batch processing Deployment Online transactions Advanced Analytics MapReduce is a difficult framework for iterative, sophisticated algorithms

33

34 Hadoop Corporate Adoption Remains Low Death of RDBMS exaggerated Big data adoption will require time

35 Parting shot

36 Use the scientific method.

37 Where you can find me Keep the Science in Data Science An Introduction to Machine Learning SAS Data Mining Community Quora Github Twitter github.com/sassoftware