IBM SPSS & Apache Spark

Similar documents
Transforming Analytics with Cloudera Data Science WorkBench

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

SmartCare. SPSS Workshop. Rick Durham - North American Advanced Analytics Channel Team IBM Corporation. Date: 5/28/2014

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

Brian Macdonald Big Data & Analytics Specialist - Oracle

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

C3 Products + Services Overview

SAP Predictive Analytics Suite

1% + 99% = AI Popularization

C3 IoT: Products + Services Overview

SAP Machine Learning for Hadoop. Customer

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK

IBM SPSS Modeler Personal

Bringing the Power of SAS to Hadoop Title


C3 IoT: Products + Services Overview

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

RAPIDS, FOSDEM 19. Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA

IBM Analytics Unleash the power of data with Apache Spark

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

IBM SPSS Modeler Personal

DATA ANALYTICS WITH R, EXCEL & TABLEAU

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

Hadoop Course Content

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

IBM SPSS Predictive Analytics Enterprise V4.1 is one of the most complete analysis and deployment technology options for adopting data science

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

GSAW 2018 Machine Learning

In-Memory Analytics: Get Faster, Better Insights from Big Data

Cognitive Data Warehouse and Analytics

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

IBM SPSS Modeler. Accelerate time to value with visual data science and machine learning. Highlights

Sunnie Chung. Cleveland State University

Analytics in Action transforming the way we use and consume information

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Operating in a Big Data World. Thinking about ROI

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Databricks Cloud. A Primer

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Pentaho 8.0 Overview. Pedro Alves

Apache Spark and R A (big data) love story?

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

BIG DATA AND HADOOP DEVELOPER

Research on the Framework and Data Fusion of an Energy Big-data Platform

Who is Databricks? Today, hundreds of organizations around the world use Databricks to build and power their production Spark applications.

: What are examples of data science jobs?

Hybrid Data Management

Data mining and Renewable energy. Cindi Thompson

Spark and Hadoop Perfect Together

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

Zhang Zhang, Victoriya Fedotova. Intel Corporation. November 2016

DataAdapt Active Insight

Azure ML Studio. Overview for Data Engineers & Data Scientists

Enterprise-Scale MATLAB Applications

Achieve Better Insight and Prediction with Data Mining

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

BIG DATA and DATA SCIENCE

SAS BIG DATA ANALYTICS INCREASING YOUR COMPETITIVE EDGE

IBM SPSS Predictive Analytics Workshop

Experiences in the Use of Big Data for Official Statistics

Microsoft Azure Essentials

Data Science End to End

Introducing Analytics with SAS Enterprise Miner. Matthew Stainer Business Analytics Consultant SAS Analytics & Innovation practice

Data Science is a Team Sport and an Iterative Process

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

Agile Industrial Analytics

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Integrating MATLAB Analytics into Enterprise Applications

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

BIG DATA SKILLS: CHALLENGES FOR THE UNIVERSITY WORLD CREATING A NEW GENERATION OF DATA SCIENTISTS. Massimiliano Marcellino Bocconi University

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Announcing: Release 7

Machine Learning Models for Sales Time Series Forecasting

By: Shrikant Gawande (Cloudera Certified )

Big Data The Big Story

Data Analytics with HPC L01. Introduction What are data analytics, big data, data science,?

ADVANCED ANALYTICS & IOT ARCHITECTURES

SAS FORUM RUSSIA Welcome

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

5th Annual. Cloudera, Inc. All rights reserved.

BIG WITH BIG DATA ANALYTICS

IBM SPSS Decision Trees

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

20775A: Performing Data Engineering on Microsoft HD Insight

Harnessing Machine Data with Data-Driven Machine Learning

BIG WITH BIG DATA ANALYTICS

Big Data Hadoop Administrator.

BIG WITH BIG DATA ANALYTICS

Transcription:

IBM SPSS & Apache Spark Making Big Data analytics easier and more accessible ramiro.rego@es.ibm.com @foreswearer 1 2016 IBM Corporation

Modeler y Spark. Integration Infrastructure overview Spark, Hadoop & IBM Example architecture New with SPSS Modeler 18 (March 15, 2016) IBM new capabilities: SPSS y Spark out of the box and via Analytic Server Functionality demonstration of Spark Spark basic model Monitor Spark jobs Creating custom modeler with SPSS Conversation / Debate / Discussion Apendix IBM development communities y and online resources 2 2016 IBM Corporation

Part One INFRASTRUCTURE OVERVIEW 3 2016 IBM Corporation

Analytics at Scale: Performance Matters Parallel In-Database IBM BigInsights for Apache Hadoop Optimized for Big Data environments Reduce network traffic Improved processing speed Reduce data movement: SQL pushback Optimize performance: in-database adapters Increase analytic flexibility: in-database mining 4 2016 IBM Corporation

Distributed analysis on modern data sources Comprehensive statistics and datamining on Hadoop based systems Data focused architecture assures scalability and performance The processing happens where data resides Exploits Spark to run analytics faster & make users more productive Provides extensible framework for augmenting analytics Provides access to Python libraries & Spark MLlib algorithms Efficiently deploys R models into distributed systems Abstracts analysts from complexities of distributed big data systems 5 2016 IBM Corporation

Meetup Big Data Developers - Madrid Analytic Server and Apache Spark. Example IBM Client Applications Analytical Servers Modeler Server IBM SPSS Modeler Analytic Server Application Server Modeler C&DS Modeler C&DS IBM Analytic Server Batch Realtime Hadoop Database cluster IBM C&DS 6 Spark is not necessarily invoked in a real-time scoring scenario where 1 record is scored at a time. Spark is invoked when MLLib algorithms are called OR when data file size exceeds ~ 128 Mb. Business applications 2016 IBM Corporation

Part Two SPARK INTEGRATION IN SPSS 7 2016 IBM Corporation

IBM SPSS Modeler Discover key insights, patterns & trends in data to optimize decisions Predictive Analytics workbench Easy to use / visual Comprehensive set of algorithms Structured & unstructured data Supports data mining process Outstanding performance & scalability Reproducible process delivering high productivity, quick time-to-solution & high ROI Brings repeatability to ongoing decision making 8 2016 IBM Corporation

Leverage Spark without Programming Analytic Server IBM SPSS Analytic Server Derive JavaSparkContext spark = new JavaSparkContext(); JavaRDD<String> file = spark.textfile("hdfs://..."); derive = new FlatMapFunction<String, String>() { public Iterable<String> call(row row) { return row( revenue ) / row( cost ); }} JavaRDD<Row> deriveddata = file.flatmap(derive); Task Threads Block Manager 9 2016 IBM Corporation

Spark Enabled Nodes in SPSS Model Building Record Operations Output Field Operations Graph Model Scoring 10 2016 IBM Corporation

Spark MLlib Algorithms Accessible in Modeler Model Type Binary Classification Regression & Binary Classification Binary & Multiclass Classification* Regression & Binary, Multiclass Classification Regression Recommender Engine Clustering Dimension Reduction Itemsets Linear SVM Gradient-Boosted Trees Logistic Regression Naïve Bayes Decision Trees Random Forests Algorithm Linear Least Squares (Lasso, Ridge) Isotonic Regression Collaborative Filtering (Alternating Least Squares) K-means Gaussian Mixture Power Iteration Latent Dirichlet Allocation* Principal Component Analysis Singular Value Decomposition* Frequent Pattern Mining: FP-growth 11 2016 IBM Corporation

Spark MLlib Algorithms Accessible in Modeler Spark MLlib Linear SVM Logistic Regression Random Forests LASSO & Ridge Regression SPSS Modeler'AS Node LSVM GLE Random Trees GLE There is some overlap between Spark MLlib algorithms and native SPSS Modeler Nodes: 12 2016 IBM Corporation

Update 15/March: Modeler 18 is available In version 18, these algorithms are now available in Modeler with any type of data, i.e. there is no need to connect Modeler to an Analytic Server The algorithms include: Random Trees a popular method in the data science community that involves taking a C&R Tree model with bagging and then only consider a sampling with replacement of variables for each split of the tree Tree-AS which is based on CHAID GLE which incorporates a number of regression methods Linear-AS which performs linear regression Linear Support Vector Machines Two-Step-AS clustering 13 2016 IBM Corporation

Update 15/March: Modeler 18 is available 14 2016 IBM Corporation

SPSS Desktop unites HDFS with Spark MLLib Even simple models are more efficient with Spark 15 2016 IBM Corporation

SPSS Desktop unites HDFS with Spark MLLib Collaborative filtering recommender calls MLlib algorithm Train, test & score model Entire stream runs in Spark 16 2016 IBM Corporation

AS/ Spark console monitors Spark job activity 17 2016 IBM Corporation

Monitor Shows Mapping of Distributed Data 18 2016 IBM Corporation

Part 3 CUSTOM MODELS USING PYSPARK 19 2016 IBM Corporation

Custom Dialog Builder adds Python for Spark support Data Scientists can create new Modeler nodes (extensions) Exploit algorithms from MLlib and other PySpark processes Also access other common Python libraries* e.g.: Numpy, Scipy, Scikitlearn, Pandas 20 2016 IBM Corporation

Custom Dialog Builder Python for Spark Nodes can be shared with non-programmer Data Scientists to democratize access to Spark capabilities Spark becomes usable for non-programmers with code abstracted behind a GUI 21 2016 IBM Corporation

DISCUSSION 22 2016 IBM Corporation

APPENDIX 23 2016 IBM Corporation

IBM Developer Community http://ibmpredictiveanalytics.github.io 24 2016 IBM Corporation

IBM Developer Community https://github.com/ibmpredictiveanalytics/mllib-cf https://github.com/ibmpredictiveanalytics/mllib-pageran 25 2016 IBM Corporation

IBM Data Scientist Workbench Notebooks and other tools aid Data Scientists in: Experimentation Data Visualization Desktop testing With minimal time and easy setup https://datascientistworkbench.com/ 26 2016 IBM Corporation

IBM AnalyticsZone Developer-focused environment for exploration and testing Downloads and trial versions Visualization tools Extensions Industry-specific examples www.analyticszone.com 27 2016 IBM Corporation