IBM SPSS & Apache Spark Making Big Data analytics easier and more accessible ramiro.rego@es.ibm.com @foreswearer 1 2016 IBM Corporation
Modeler y Spark. Integration Infrastructure overview Spark, Hadoop & IBM Example architecture New with SPSS Modeler 18 (March 15, 2016) IBM new capabilities: SPSS y Spark out of the box and via Analytic Server Functionality demonstration of Spark Spark basic model Monitor Spark jobs Creating custom modeler with SPSS Conversation / Debate / Discussion Apendix IBM development communities y and online resources 2 2016 IBM Corporation
Part One INFRASTRUCTURE OVERVIEW 3 2016 IBM Corporation
Analytics at Scale: Performance Matters Parallel In-Database IBM BigInsights for Apache Hadoop Optimized for Big Data environments Reduce network traffic Improved processing speed Reduce data movement: SQL pushback Optimize performance: in-database adapters Increase analytic flexibility: in-database mining 4 2016 IBM Corporation
Distributed analysis on modern data sources Comprehensive statistics and datamining on Hadoop based systems Data focused architecture assures scalability and performance The processing happens where data resides Exploits Spark to run analytics faster & make users more productive Provides extensible framework for augmenting analytics Provides access to Python libraries & Spark MLlib algorithms Efficiently deploys R models into distributed systems Abstracts analysts from complexities of distributed big data systems 5 2016 IBM Corporation
Meetup Big Data Developers - Madrid Analytic Server and Apache Spark. Example IBM Client Applications Analytical Servers Modeler Server IBM SPSS Modeler Analytic Server Application Server Modeler C&DS Modeler C&DS IBM Analytic Server Batch Realtime Hadoop Database cluster IBM C&DS 6 Spark is not necessarily invoked in a real-time scoring scenario where 1 record is scored at a time. Spark is invoked when MLLib algorithms are called OR when data file size exceeds ~ 128 Mb. Business applications 2016 IBM Corporation
Part Two SPARK INTEGRATION IN SPSS 7 2016 IBM Corporation
IBM SPSS Modeler Discover key insights, patterns & trends in data to optimize decisions Predictive Analytics workbench Easy to use / visual Comprehensive set of algorithms Structured & unstructured data Supports data mining process Outstanding performance & scalability Reproducible process delivering high productivity, quick time-to-solution & high ROI Brings repeatability to ongoing decision making 8 2016 IBM Corporation
Leverage Spark without Programming Analytic Server IBM SPSS Analytic Server Derive JavaSparkContext spark = new JavaSparkContext(); JavaRDD<String> file = spark.textfile("hdfs://..."); derive = new FlatMapFunction<String, String>() { public Iterable<String> call(row row) { return row( revenue ) / row( cost ); }} JavaRDD<Row> deriveddata = file.flatmap(derive); Task Threads Block Manager 9 2016 IBM Corporation
Spark Enabled Nodes in SPSS Model Building Record Operations Output Field Operations Graph Model Scoring 10 2016 IBM Corporation
Spark MLlib Algorithms Accessible in Modeler Model Type Binary Classification Regression & Binary Classification Binary & Multiclass Classification* Regression & Binary, Multiclass Classification Regression Recommender Engine Clustering Dimension Reduction Itemsets Linear SVM Gradient-Boosted Trees Logistic Regression Naïve Bayes Decision Trees Random Forests Algorithm Linear Least Squares (Lasso, Ridge) Isotonic Regression Collaborative Filtering (Alternating Least Squares) K-means Gaussian Mixture Power Iteration Latent Dirichlet Allocation* Principal Component Analysis Singular Value Decomposition* Frequent Pattern Mining: FP-growth 11 2016 IBM Corporation
Spark MLlib Algorithms Accessible in Modeler Spark MLlib Linear SVM Logistic Regression Random Forests LASSO & Ridge Regression SPSS Modeler'AS Node LSVM GLE Random Trees GLE There is some overlap between Spark MLlib algorithms and native SPSS Modeler Nodes: 12 2016 IBM Corporation
Update 15/March: Modeler 18 is available In version 18, these algorithms are now available in Modeler with any type of data, i.e. there is no need to connect Modeler to an Analytic Server The algorithms include: Random Trees a popular method in the data science community that involves taking a C&R Tree model with bagging and then only consider a sampling with replacement of variables for each split of the tree Tree-AS which is based on CHAID GLE which incorporates a number of regression methods Linear-AS which performs linear regression Linear Support Vector Machines Two-Step-AS clustering 13 2016 IBM Corporation
Update 15/March: Modeler 18 is available 14 2016 IBM Corporation
SPSS Desktop unites HDFS with Spark MLLib Even simple models are more efficient with Spark 15 2016 IBM Corporation
SPSS Desktop unites HDFS with Spark MLLib Collaborative filtering recommender calls MLlib algorithm Train, test & score model Entire stream runs in Spark 16 2016 IBM Corporation
AS/ Spark console monitors Spark job activity 17 2016 IBM Corporation
Monitor Shows Mapping of Distributed Data 18 2016 IBM Corporation
Part 3 CUSTOM MODELS USING PYSPARK 19 2016 IBM Corporation
Custom Dialog Builder adds Python for Spark support Data Scientists can create new Modeler nodes (extensions) Exploit algorithms from MLlib and other PySpark processes Also access other common Python libraries* e.g.: Numpy, Scipy, Scikitlearn, Pandas 20 2016 IBM Corporation
Custom Dialog Builder Python for Spark Nodes can be shared with non-programmer Data Scientists to democratize access to Spark capabilities Spark becomes usable for non-programmers with code abstracted behind a GUI 21 2016 IBM Corporation
DISCUSSION 22 2016 IBM Corporation
APPENDIX 23 2016 IBM Corporation
IBM Developer Community http://ibmpredictiveanalytics.github.io 24 2016 IBM Corporation
IBM Developer Community https://github.com/ibmpredictiveanalytics/mllib-cf https://github.com/ibmpredictiveanalytics/mllib-pageran 25 2016 IBM Corporation
IBM Data Scientist Workbench Notebooks and other tools aid Data Scientists in: Experimentation Data Visualization Desktop testing With minimal time and easy setup https://datascientistworkbench.com/ 26 2016 IBM Corporation
IBM AnalyticsZone Developer-focused environment for exploration and testing Downloads and trial versions Visualization tools Extensions Industry-specific examples www.analyticszone.com 27 2016 IBM Corporation