Data Analytics with Adam Filion Application Engineer MathWorks 2015 The MathWorks, Inc. 1
Case Study: Day-Ahead Load Forecasting Goal: Implement a tool for easy and accurate computation of dayahead system load forecast Requirements: Acquire and clean data from multiple sources Accurate predictive model Easily deploy to production environment 2
Challenges with Data Analytics Aggregating data from multiple sources Cleaning data Choosing a model Moving to production 3
NYISO Energy Load Data 4
Techniques to Handle Missing Data List-wise deletion Unbiased estimates Reduces sample size Implementation options Built in to many functions Manual filtering 5
Techniques to Handle Missing Data Substitution replace missing data points with a reasonable approximation Easy to model Too important to exclude 6
Merge Different Sets of Data Join along a common axis Popular Joins: Inner Full Outer Left Outer Right Outer Inner Join Full Outer Join Left Outer Join 7
Full Outer Join Key A B 1 1.1 4 1.4 7 1.7 9 1.9 First Data Set Key X Y Z 1 0.1 0.2 3 0.3 0.4 5 0.5 0.6 7 0.7 0.8 Second Data Set Key B Y Z 1 3 4 5 7 9 1.1 NaN 1.4 NaN 0.1 0.3 NaN 0.5 0.2 0.4 NaN 0.6 1.7 1.9 0.7 NaN 0.8 NaN Joined Data Set 8
Learn More: Big Data with www.mathworks.com/discovery/big-data-matlab.html www.mathworks.com/discovery/matlab-mapreduce-hadoop.html 9
Challenges with Data Analytics Aggregating data from multiple sources Cleaning data Choosing a model Moving to production 10
Machine Learning Characteristics and Examples Characteristics Lots of variables System too complex to know the governing equation (e.g., black-box modeling) Examples Pattern recognition (speech, images) Financial algorithms (credit scoring, algo trading) Energy forecasting (load, price) Biology (tumor detection, drug discovery) 11
Overview Machine Learning Type of Learning Categories of Algorithms Machine Learning Supervised Learning Develop predictive model based on both input and output data Classification Regression Unsupervised Learning Clustering Group and interpret data based only on input data 12
Supervised Learning Regression Neural Networks Decision Trees Ensemble Methods Non-linear Reg. (GLM, Logistic) Linear Regression Classification Support Vector Machines Discriminant Analysis Naive Bayes Nearest Neighbor 13
Unsupervised Learning k-means, Fuzzy C-Means Hierarchical Clustering Neural Networks Gaussian Mixture Hidden Markov Model 14
Learn More: Machine Learning with mathworks.com/machine-learning 16
Challenges with Data Analytics Aggregating data from multiple sources Cleaning data Choosing a model Moving to production 17
Deployment Highlights Database Servers Desktop Applications Hadoop / Big Data Spreadsheets Application Servers Client Applications Web Applications Batch/Cron Jobs Share with others who may not have Royalty-free deployment Encryption to protect your intellectual property 18
Deploying Applications with Compiler Compiler SDK Standalone Application Excel Add-in Hadoop C/C ++ Java.NET Production Server Compiler for sharing programs without integration programming Compiler SDK provides implementation and platform flexibility for software developers Production Server provides the most efficient development path for secure and scalable web and enterprise applications 19
Sharing Standalone Applications Application Author Toolboxes 1 2 Compiler End User Standalone Application Excel Add-in Hadoop 3 Runtime 20
Integrating -based Components Application Author Toolboxes Application author and software developer might be same person 1 Software Developer 2 Compiler SDK C/C ++ Java.NET Production Server 3 4 Runtime 21
Production Server Directly deploy analytic programs into production Centrally manage multiple programs & MCR versions Automatically deploy updates without server restarts Scalable & reliable Service large numbers of concurrent requests Add capacity or redundancy with additional servers Production Server(s) Use with web, database & application servers Lightweight client library isolates processing Access programs using native data types Integrates with Java,.NET, C and Python Web Server(s) HTML XML Java Script 22
Deployed Analytics Production Server Web Application Server Apache Tomcat Production Server Production Server Desktop Train in Web Server/ Webservice Request Broker Predictive Models CTF Weather Data Energy Data 23
Learn More: Application Deployment with www.mathworks.com/solutions/desktop-web-deployment/ 24
Learn More: Application Deployment Also www.mathworks.com/solutions/desktop-web-deployment/ 25
Data Analytics Products Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics with Systems Parallel Computing Toolbox, Distributed Computing Server Production Server Database Toolbox Statistics and Machine Learning Toolbox Compiler Data Acquisition Toolbox Curve Fitting Toolbox Neural Network Toolbox Compiler SDK Mapping Toolbox Signal Processing Toolbox Computer Vision System Toolbox Image Acquisition Toolbox Image Processing Toolbox Econometrics Toolbox Used in today s demo OPC Toolbox Additional Data Analytics products 26
Key Takeaways Data preparation can be a big job; leverage built-in tools and spend more time on the analysis Rapidly iterate through different predictive models, and find the one that s best for your application Leverage parallel computing to scale-up your analysis to large datasets Eliminate the need to recode by deploying your algorithms into production 27
2015 The MathWorks, Inc. 28