Analytical Capability Security Compute Ease Data Scale Price Users
Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Real-Time vs. MapReduce Locality vs. Movement Memory Limits
No Magic Bullet.
Our Vision: R becomes the defacto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges
Open Source Commercial
Traditional Open Source R Beside Architecture: CRAN Algorithms rodbc rhdfs rhbase rhive
Replace Open Source R Beside Architecture with Revolution R Open CRAN Algorithms As with Open Source R: Still Free. Still Memory Based. Data Still Moves. rodbc rhdfs rhbase rhive Improvements: Accelerates Math with Intel MKL Improves R-based packages Limitations No Effect for non-r Code
Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
Write R Code to Explicitly Parallelize Deploy Across Several Systems ForEach & Iterator DoParallel (PC, server) DoMPI (cluster) RRE RxEXEC Example Uses: Bootstrapping Simulation HPC Can Include CRAN Algorithms Carefully rodbc rhdfs rhbase rhive As with Previous: Still Free. Still Memory Based. Data Still Moves. Intel MKL with RRO Improvements: Parallelized Execution Limitations: Parallelization Difficulty Data Movement Platform Specific
Execute R Code & CRAN Algorithms Inside Hadoop Remote Desktop Example Uses: Scoring Transformation Easily Parallelized Algorithms R Code rmapreduce Hadoop Streaming rhbase rhdfs Can Include CRAN Algorithms Carefully As With Previous: Still Free. Optional Intel MKL in RRO Improvements: Runs R in MapReduce No Data Movement Limitations: Manual Parallelization Hadoop Specific
Traditional Beside Architecture with Optimized Algorithms Available for Windows, Linux As With Previous: Includes Intel MKL in RRO Revolution R Enterprise: ScaleR PEMA Algorithms plus All of CRAN (subject to memory limits) rodbc rhdfs rhbase rhive Advantages Speed: PEMAs Parallelize Across Threads, Cores & Sockets Scale: PEMAs Chunk - no Memory Limits All of CRAN Available Portability Fully Supported Limitations: Data Movement Single Machine
is. the only big data big analytics platform based on open source R
Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Combination PEMA-R API rxdatastep rxexec New in 7.3 21
Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing Master Algorithm Process Combine Individual Results ScaleR PEMA Analyze Each Block Load Block At A Time Data Not Limited to Available Memory Unlimited Data Scale Ingests Data One Chunk At A Time. Adjustable Memory Footprint Multi-Thread Execution Performance Highly-Optimized Algorithms Algorithm Math Fully Refactored for Parallelism Delivered as ScaleR Library in Revolution R Enterprise
(opt.) Thin Client or Remote Desktop Fast Single-Server Alternative for Modest Data Scale ScaleR + CRAN Algorithms Edge Node rodbc rhdfs rhbase rhive Local File System As With Previous: Single Machine Execution PEMA Scale & Speed (Single Machine) Use ScaleR + CRAN Accelerate R with Intel MKL Improvements: Easily Shared via No Data Movement Develop on Desktop Run on Edge Node Limitations: Shorter Trip for Data
Fast Parallelized Analytics on Large Data Sets In Hadoop Desktop & Server Tools and Applications Web Servi Web Services DeployR Remote Execution jobtracker ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Advantages Parallel Computation No Data Movement ScaleR PEMA Parallelization Can Parallelize CRAN Carefully Portable Coding Limitations: Hadoop Workload Profiles
Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1 Edge Node 2 Admin Nodes 9 Task Nodes 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. 58 sec. 128GB 24 cores each 128GB 24 cores each 64GB 24 cores each Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 25
Maximized Flexibility, Performance & Workload Handling Thin Client Development Remote Execution ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Desktop & Server Web Tools and Servi Applications ces rstudio DeployR Advantages Flexibility for Blended Workloads Little or No Data Movement Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes
Where are the bulk of your skills? SAS? R? Java? Python? SQL? Where do you build models today? Do you have the skills to parallelize algorithms? Can models be built on a big shared server? How will you run models? Do you have the budget to purchase commercial solutions? How will your needs change over time? What is your future architecture plan? How risk averse is your management team regarding new platforms and open source?
Revolution Analytics Products http://www.revolutionanalytics.com/products http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws Whitepaper: Delivering Value from Big Data with Revolution R Enterprise and Hadoop http://www.revolutionanalytics.com/whitepaper/delivering-value-big-datarevolution-r-enterprise-and-hadoop Revolution Analytics on Social Media: http://blog.revolutionanalytics.com/ Twitter Twitter
Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR