Similar documents
Azure Offerings for Big data. In Kee Paek Cloud Data Solution Architect Microsoft Korea October. 2016

Announcing: Release 7

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Brian Macdonald Big Data & Analytics Specialist - Oracle

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Enterprise-Scale MATLAB Applications

Hadoop Course Content

IBM SPSS & Apache Spark

How to develop Data Scientist Super Powers! Using Azure from R to scale and persist analytic workloads.. Simon Field

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

BIG DATA SKILLS: CHALLENGES FOR THE UNIVERSITY WORLD CREATING A NEW GENERATION OF DATA SCIENTISTS. Massimiliano Marcellino Bocconi University

DATA ANALYTICS WITH R, EXCEL & TABLEAU

Transforming Analytics with Cloudera Data Science WorkBench

SAP Predictive Analytics Suite

Data Analytics for Engineers

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017

Delivering High Performance for Financial Models and Risk Analytics

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight

C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

BIG DATA AND HADOOP DEVELOPER

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

SAP Machine Learning for Hadoop. Customer

20775: Performing Data Engineering on Microsoft HD Insight

Integrating MATLAB Analytics into Enterprise Applications

Intel s Machine Learning Strategy. Gary Paek, HPC Marketing Manager, Intel Americas HPC User Forum, Tucson, AZ April 12, 2016

The Alpine Data Platform

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Accelerating Microsoft Office Excel 2010 with Windows HPC Server 2008 R2

Zhang Zhang, Victoriya Fedotova. Intel Corporation. November 2016

20775A: Performing Data Engineering on Microsoft HD Insight

Data Science is a Team Sport and an Iterative Process

Modern Analytics Architecture

RAPIDS, FOSDEM 19. Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.


AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Deployment. 15 Feb Data & Intelligence Global One Team. NTT DATA Mathematical Systems, Inc. NTT DATA Mathematical Systems, Inc.

Intro to Big Data and Hadoop

Deep Learning Acceleration with

Big Data Analytics met Hadoop

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Practical Application of Predictive Analytics Michael Porter

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Pentaho 8.0 Overview. Pedro Alves

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

GSAW 2018 Machine Learning

Big Data Hadoop Administrator.

Ask the right question, regardless of scale

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

MapR Pentaho Business Solutions

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Add Sophisticated Analytics to Your Repertoire with Data Mining, Advanced Analytics and R

What s new in MATLAB and Simulink

Bringing the Power of SAS to Hadoop Title

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

R and Hadoop. Ram Venkat Dawn Analytics

Cask Data Application Platform (CDAP) Extensions

Introducing Analytics with SAS Enterprise Miner. Matthew Stainer Business Analytics Consultant SAS Analytics & Innovation practice

USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE MODELLING

High-Performance Computing (HPC) Up-close

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

Deep Learning Acceleration with MATRIX: A Technical White Paper

R shines! The 4th force is coming to visualize your data. By Oliver Engels & Gabi Münster

By: Shrikant Gawande (Cloudera Certified )

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Spark and Hadoop Perfect Together

Analytics in Action transforming the way we use and consume information

Jason Virtue Business Intelligence Technical Professional

St Louis CMG Boris Zibitsker, PhD

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Introduction to Stream Processing

Building a Multi-Tenant Infrastructure for Diverse Application Workloads

DELL EMC POWEREDGE 14G SERVER PORTFOLIO

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

ThingSpeak - IoT Platform with MATLAB Analytics

CASE STUDY Delivering Real Time Financial Transaction Monitoring

Comparing Application Performance on HPC-based Hadoop Platforms with Local Storage and Dedicated Storage

National Occupational Standard

VICE PRESIDENT, ARCHITECTURE GENERAL MANAGER, AI PRODUCTS GROUP - INTEL

Data Analysis in the Internet of Things: IoT capabilities with MATLAB/Simulink

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study

Apache Spark and R A (big data) love story?

Integrating MATLAB Analytics into Enterprise Applications The MathWorks, Inc. 1

Azure ML Studio. Overview for Data Engineers & Data Scientists

IBM Tivoli Monitoring

Advanced analytics at your hands

IBM SPSS Statistics. Editions. Get the analytical power you need for better decision making. Why use IBM SPSS Statistics? IBM Analytics Solution Brief

Transcription:

Analytical Capability Security Compute Ease Data Scale Price Users

Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Real-Time vs. MapReduce Locality vs. Movement Memory Limits

No Magic Bullet.

Our Vision: R becomes the defacto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges

Open Source Commercial

Traditional Open Source R Beside Architecture: CRAN Algorithms rodbc rhdfs rhbase rhive

Replace Open Source R Beside Architecture with Revolution R Open CRAN Algorithms As with Open Source R: Still Free. Still Memory Based. Data Still Moves. rodbc rhdfs rhbase rhive Improvements: Accelerates Math with Intel MKL Improves R-based packages Limitations No Effect for non-r Code

Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html

Write R Code to Explicitly Parallelize Deploy Across Several Systems ForEach & Iterator DoParallel (PC, server) DoMPI (cluster) RRE RxEXEC Example Uses: Bootstrapping Simulation HPC Can Include CRAN Algorithms Carefully rodbc rhdfs rhbase rhive As with Previous: Still Free. Still Memory Based. Data Still Moves. Intel MKL with RRO Improvements: Parallelized Execution Limitations: Parallelization Difficulty Data Movement Platform Specific

Execute R Code & CRAN Algorithms Inside Hadoop Remote Desktop Example Uses: Scoring Transformation Easily Parallelized Algorithms R Code rmapreduce Hadoop Streaming rhbase rhdfs Can Include CRAN Algorithms Carefully As With Previous: Still Free. Optional Intel MKL in RRO Improvements: Runs R in MapReduce No Data Movement Limitations: Manual Parallelization Hadoop Specific

Traditional Beside Architecture with Optimized Algorithms Available for Windows, Linux As With Previous: Includes Intel MKL in RRO Revolution R Enterprise: ScaleR PEMA Algorithms plus All of CRAN (subject to memory limits) rodbc rhdfs rhbase rhive Advantages Speed: PEMAs Parallelize Across Threads, Cores & Sockets Scale: PEMAs Chunk - no Memory Limits All of CRAN Available Portability Fully Supported Limitations: Data Movement Single Machine

is. the only big data big analytics platform based on open source R

Data Step Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Combination PEMA-R API rxdatastep rxexec New in 7.3 21

Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing Master Algorithm Process Combine Individual Results ScaleR PEMA Analyze Each Block Load Block At A Time Data Not Limited to Available Memory Unlimited Data Scale Ingests Data One Chunk At A Time. Adjustable Memory Footprint Multi-Thread Execution Performance Highly-Optimized Algorithms Algorithm Math Fully Refactored for Parallelism Delivered as ScaleR Library in Revolution R Enterprise

(opt.) Thin Client or Remote Desktop Fast Single-Server Alternative for Modest Data Scale ScaleR + CRAN Algorithms Edge Node rodbc rhdfs rhbase rhive Local File System As With Previous: Single Machine Execution PEMA Scale & Speed (Single Machine) Use ScaleR + CRAN Accelerate R with Intel MKL Improvements: Easily Shared via No Data Movement Develop on Desktop Run on Edge Node Limitations: Shorter Trip for Data

Fast Parallelized Analytics on Large Data Sets In Hadoop Desktop & Server Tools and Applications Web Servi Web Services DeployR Remote Execution jobtracker ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Advantages Parallel Computation No Data Movement ScaleR PEMA Parallelization Can Parallelize CRAN Carefully Portable Coding Limitations: Hadoop Workload Profiles

Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1 Edge Node 2 Admin Nodes 9 Task Nodes 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. 58 sec. 128GB 24 cores each 128GB 24 cores each 64GB 24 cores each Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 25

Maximized Flexibility, Performance & Workload Handling Thin Client Development Remote Execution ScaleR Algorithms As With Previous: Speed and Scale of ScaleR PEMA Algorithms Use CRAN Where Appropriate Accelerate R Math with MKL Custom Parallelized Algo s Desktop & Server Web Tools and Servi Applications ces rstudio DeployR Advantages Flexibility for Blended Workloads Little or No Data Movement Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes

Where are the bulk of your skills? SAS? R? Java? Python? SQL? Where do you build models today? Do you have the skills to parallelize algorithms? Can models be built on a big shared server? How will you run models? Do you have the budget to purchase commercial solutions? How will your needs change over time? What is your future architecture plan? How risk averse is your management team regarding new platforms and open source?

Revolution Analytics Products http://www.revolutionanalytics.com/products http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws Whitepaper: Delivering Value from Big Data with Revolution R Enterprise and Hadoop http://www.revolutionanalytics.com/whitepaper/delivering-value-big-datarevolution-r-enterprise-and-hadoop Revolution Analytics on Social Media: http://blog.revolutionanalytics.com/ Twitter Twitter

Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR