Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery brian.macdonald@oracle.com Copyright 2015, Oracle and/or its affiliates. All rights reserved. 1
Agenda Impact of more data on the Data Scientist A day in the life of a Data Scientist Oracle Big Data Discovery Oracle R technologies
What are some of the key challenges for Data Scientists Or Analysts, Statisticians, whatever you want to call them. Data Lots of data. Everywhere Desire to leverage this data for predictive analytics Data scientist hiding in their office and sucking data for themselves. "Just give me the data Working with the Business Separate communities yet have the same objectives -- improve business They speak a different language How to share information? Usually not the math people. R?
Where is the data? Relational Hadoop The web Excel Files from data providers Other mysterious sources
How Oracle Brings it all Together End-to-End Solutions Data Fast Data Apps 1 2 3 Streams Events Actions Custom Data Management Packaged Reservoir Factory Warehouse Business Analytics Data Lab Reports Data Sets Discovery Data Science Visualization
Day in the life of Data Scientist Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Let's focus on predictive analytics Data scientists have the tools SAS, SPSS, Matlab, R, Python, many others. We will discuss R The business has the domain expertise BI Tools, Excel Some cross over But generally they work independently
What if they can work together more seamlessly? I will take the data scientists perspective
What does a data scientist do? Define problem Find data Exploratory Data Analysis Transform Modeling Share results Deploy This consumes most of the time
Predictive Modeling with R Pros lots of sophisticated modeling capabilities Can transform merge data Robust EDA capabilities Extensible Free Cons It's a programming environment Need expertise Can be tedious for EDA and transformations Generally Limited to memory of laptop Doesn't help finding data Single threaded
Where will the data scientist start to solve a problem Manually ask/search around for data Generate statistics on data. dim(orcl), head(orcl), summary(orcl), skewness(orcl) Generate some graphs to visualize hist(orcl), plot(orcl), plot(lag_orcl) Start manipulating data log_orcl <- log(orcl), lag_orcl <- diff(orcl) All to see if the data is worth using.
R Demo Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Powerful But can this be automated and made easier? 13
What commands come next? 14
More Code. Didn t I just do this for the other data sets? 15
Oracle Big Data Discovery Reducing the time for Data Science projects Search Explore transform discover deploy 16
How can Big Data Discovery help the data scientist Find data Show core statistics about all the data Automatic Visualization Transform the data i.e. bin, log transform, ratios, sentiment,. Place data in Hadoop to start modeling This is what Consumed lots of the Data Scientists Time What BDD doesn't do (yet) Predictive
Transform Business Users can help with this! Scrub PII Missing Data Entity Normalizations Outlier Detection Benefits Share Transformations In Real time Can be done before Data Scientist gets involved Outlier Elimination Centering & Scaling Features 18
Big Data Discovery Demo Copyright 2015, Oracle and/or its affiliates. All rights reserved.
The Data Scientist needs more though How can I do modeling if data is in Hadoop? Copy data to R? What of there is too much data? Use R to connect to Hadoop and run models on Hadoop Or Use R to connect to Oracle DB and run models on Oracle Or Use R Connected to Oracle DB with data on Hadoop
Oracle s R Technologies Oracle R Distribution ROracle Software available to R Community for free Oracle R Advanced Analytics for Hadoop A component of the Oracle Big Data Connectors software suite Oracle R Enterprise A component of the Oracle Advanced Analytics option to Oracle Database
Oracle R Advanced Analytics for Hadoop: Integration Using the Hadoop and HIVE Integration, plus R Engine and Open-Source R Packages Hadoop Cluster with Oracle R Advanced Analytics for Hadoop HQL Basic Statistics, Data Prep, Joins and View creation ORAAH distributed algorithms: MLP Neural Nets*, GLM*, LM PCA, k-means, NMF, LMF Open-source R packages via Map-Reduce * Spark-Caching enabled HQL R Oracle Database Server with Advanced Analytics option R Client R Analytics Oracle R Advanced Analytics for Hadoop SQL Client SQL Developer Other SQL Apps Copyright 2016 Oracle and/or its affiliates. All rights reserved. 22
Oracle R Advanced Analytics for Hadoop Advanced Analytics algorithms in a Hadoop Cluster: Map-Reduce and Spark based Classification Clustering Statistical Functions Generalized Linear Model Logistic Regression Hierarchical k-means Correlation Covariance Cross Tabulation Summary statistics Regression Linear Regression Multi-Layer Neural Networks Attribute Importance Principal Components Analysis Feature Extraction Nonnegative Matrix Fact(NMF) Collaborative Filtering (LMF)
Oracle R Advanced Analytics for Hadoop Demo Copyright 2015, Oracle and/or its affiliates. All rights reserved.
OAA with Big Data SQL: EXADATA + BDA Using the in-database algorithms, plus R Engine and Open-Source R Packages if desired Oracle BIG DATA APPLIANCE Oracle EXATADA with Advanced Analytics Option R Client R Analytics Oracle R Enterprise R SQL Client SQL Developer Other SQL Apps Big Data SQL 25
Oracle Advanced Analytics: in-database Machine Learning Using the in-database algorithms, plus R Engine and Open-Source R Packages if desired Oracle Database Server with Advanced Analytics Option SQL Basic Statistics and Joins Data Mining Predictive Analytics 15 PL/SQL In-Database algorithms R Client R Analytics Oracle R Enterprise ORE Parallel algorithms: MLP Neural, Stepwise, LM, GLM, PCA Access to open-source R packages R SQL Client SQL Developer Other SQL Apps 26
Oracle Advanced Analytics Predictive Analytics algorithms in-database Classification Logistic Regression Decision Trees Random Forests Naïve Bayes Support Vector Machines Clustering Hierarchical k-means Hierarchical O-Cluster Expectation-Maximization Regression Linear Regression Support Vector Machines Multi-Layer Neural Networks Random Forests Anomaly Detection One-Class SVM Association Rules Apriori Text Mining Tokenization Theme Extraction Attribute Importance Minimum Description Length Principal Components Analysis Feature Extraction Nonnegative Matrix Fact(NMF) Singular Value Decomposition(SVD) Copyright 2016 Oracle and/or its affiliates. All rights reserved. 27
Now that the Model is Build and Scored. What s Next? Provide Business Users Access to Scored Models Import scored data to BDD Share insights with Business Users Allow further Discovery and Analysis Operationalize the Model Business Users will provide guidance
Import Scored Data Demo Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Besides How providing can technology Oracle help? Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Oracle Analytics Assessment Engagement Process Business Case redevelopment (If needed) Socialization of results widely across business Data Science Business Process Improvement Data Summit Preparation and Discovery Workshop Business Case Analytics Sprints Analytics Assessment Deliverable Oracle led presentation to key stakeholders Output: Agreement to do DSS and target Usecase Value Led discovery Output: HYPOTHESIS backed by business case metrics, data sources, etc needed for Sprints Oracle Specialist-led Story Board & Demo development against HYPOTHESIS 31
Summary: Oracle Value Proposition for Data Scientists Increase Productivity Expand Capabilities Deploy Effectively HOW? Enabling Technology Partnership Model To: Deliver Business Process Improvement through Math and Data
OAA Links and Resources Oracle Advanced Analytics Overview: OAA presentation Big Data Analytics in Oracle Database 12c With Oracle Advanced Analytics & Big Data SQL Big Data Analytics with Oracle Advanced Analytics: Making Big Data and Analytics Simple white paper on OTN Oracle Internal OAA Product Management Wiki and Workspace YouTube recorded OAA Presentations and Demos: Oracle Advanced Analytics and Data Mining at the YouTube Movies (6 + OAA live Demos on ODM r 4.0 New Features, Retail, Fraud, Loyalty, Overview, etc.) Getting Started: Link to Getting Started w/ ODM blog entry Link to New OAA/Oracle Data Mining 2-Day Instructor Led Oracle University course. Link to OAA/Oracle Data Mining 4.0 Oracle by Examples (free) Tutorials on OTN Take a Free Test Drive of Oracle Advanced Analytics (Oracle Data Miner GUI) on the Amazon Cloud Link to OAA/Oracle R Enterprise (free) Tutorial Series on OTN Additional Resources: Oracle Advanced Analytics Option on OTN page OAA/Oracle Data Mining on OTN page, ODM Documentation & ODM Blog OAA/Oracle R Enterprise page on OTN page, ORE Documentation & ORE Blog Oracle SQL based Basic Statistical functions on OTN BIWA Summit 16, Jan 26-28, 2016 Oracle Big Data & Analytics User Conference @ Oracle HQ Conference Center
Books on Oracle Advanced Analytics & Big Data Books available on Amazon Predictive Analytics Using Oracle Data Miner: Develop for ODM in SQL & PL/SQL Using R to Unlock the Value of Big Data Oracle Big Data Handbook 34