CIS 602-01: Scalable Data Analysis Cloud Computing Dr. David Koop
Data Science Tasks TASKS (major involvement only) 49% CREATING VISUALIZATIONS 43% 43% FEATURE EXTRACTION 47% IDENTIFYING BUSINESS PROBLEMS TO BE SOLVED WITH ANALYTICS DEVELOPING PROTOTYPE MODELS 39% ORGANIZING AND GUIDING TEAM PROJECTS 36% IMPLEMENTING MODELS/ ALGORITHMS INTO PRODUCTION 32% COLLABORATING ON CODE PROJECTS (READING/EDITING OTHERS' CODE, USING GIT) 31% TEACHING/TRAINING OTHERS 30% PLANNING LARGE SOFTWARE PROJECTS OR DATA SYSTEMS 30% DEVELOPING DASHBOARDS 53% DATA CLEANING 58% COMMUNICATING FINDINGS TO BUSINESS DECISION-MAKERS 61% CONDUCTING DATA ANALYSIS TO ANSWER RESEARCH QUESTIONS 69% BASIC EXPLORATORY DATA ANALYSIS 5% 28% COMMUNICATING WITH PEOPLE OUTSIDE YOUR COMPANY 20% DEVELOPING DATA ANALYTICS SOFTWARE 19% 19% DEVELOPING PRODUCTS THAT DEPEND ON REAL-TIME DATA ANALYTICS USING DASHBOARDS AND SPREADSHEETS (MADE BY OTHERS) TO MAKE DECISIONS 29% ETL 24% SETTING UP / MAINTAINING DATA PLATFORMS DEVELOPING HARDWARE (OR WORKING ON SOFTWARE PROJECTS THAT REQUIRE EXPERT KNOWLEDGE OF HARDWARE) [O'Reilly] 2
Programming Languages PROGRAMMING LANGUAGES 9% 9% C++ 13% VISUAL BASIC / VBA 17% JAVASCRIPT 18% JAVA 24% BASH 54% PYTHON 57% R 70% SQL SHARE OF RESPONDENTS MATLAB 8% SCALA 8% C 8% C# 5% PERL 5% SAS 3% RUBY SALARY MEDIAN AND IQR (US DOLLARS) SQL R Python Bash Java JavaScript Visual Basic/VBA C++ Matlab Scala C C# Perl SAS Ruby Octave Go 2% OCTAVE 1% GO 0 50K 100K 150K 200K Range/Median Languages [O'Reilly] 3
Spreadsheet & Business Intelligence Tools SPREADSHEETS, BI, REPORTING 4% 5% SPOTFIRE ORACLE BI 6% COGNOS 6% BUSINESSOBJECTS 7% QLIKVIEW 8% POWER BI 10% POWERPIVOT SHARE OF RESPONDENTS 69% EXCEL 3% PENTAHO 3% ADOBE ANALYTICS 3% MICROSTRATEGY 2% ALTERYX 1% 1% DATAMEER JASPERSOFT SALARY MEDIAN AND IQR (US DOLLARS) Excel PowerPivot Power BI QlikView BusinessObjects Cognos Oracle BI Spotfire Pentaho Adobe Analytics Microstrategy Alteryx Jaspersoft Datameer 30K 60K 90K 120K 150K Range/Median Spreadsheets, BI, reporting [O'Reilly] 4
Trends in Data Analysis Tool Use [KDNuggets] 5
Python Tools matplotlib: - Python's "base" visualization library - Originally mimicked the functions in matlab - Integrated with pandas.plot calls - Jupyter Notebook: %matplotlib inline or notebook - seaborn adds extras on top of matplotlib scikit-learn - Machine learning library - Fit and predict using models - Classification and regression 6
Reading Response Due Thursday (10/12) before class Read Reiss et al. Short Summary + Critique Questions: - Do you agree with the conclusions? - What explanations could there be for the results? - Does the paper suggest improvements in cloud computing based on workload analysis? 7
Assignment 2 http://www.cis.umassd.edu/~dkoop/ cis602-2017fa/assignment2.html New York City Trees - 680,000+ trees - Use WebGL for visualization - Use a Python bridge (mapboxgl) - Use the fork! - Smaller data versions available - Keep using pandas - Label subproblems and answers - Due Thursday, October 19 8
Scaling Up PC [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 9
Scaling Up PC Server [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 9
Scaling Up PC Server Cluster [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 9
Scaling Up PC Server Cluster Data center [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 9
Scaling Up PC Server Cluster Data center Network of data centers [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 9
Scale Problems 1.Difficult to dimension - Load can vary considerably - Waste resources of lose customers 2.Expensive - Hardware costs - Personnel costs - Maintenance costs 3.Difficult to scale - scaling up (new machines, new buildings) - scaling down (energy, fixed costs) [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 10
Power Plant to Cloud Analogy Power source Power source Power source directly connected Network Meter Customer [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 11
Cloud Computing: State-of-the-art and Research Challenges Q. Zhang, L. Cheng, and R. Boutaba
Cloud Computing Features No up-front investment Lowering operating cost Highly scalable Easy access Reducing business risks and maintenance expenses 13
Everything as a Service Software as a service (SaaS) [Restaurant] Platform as a service (PaaS) [Take-out food] Infrastructure as a service (Iaas) [Grocery] User Application Middleware Hardware Cloud provider [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 14
Cloud Computing Architecture [Zhang et al., 2010] 15
Business Model IaaS and PaaS often part of the same organization Businesses leveraging subscription models to support cloud services [Zhang et al., 2010] 16
Types of Clouds Public: - Everyone can use - Lack fine-grained control over data, network, security settings Private (aka internal): - May be homegrown or external - Offer more control - Require up-front capital Hybrid: - Keep parts secure (e.g. accounting, finances) - Other pieces can leverage public cloud resources - Have to make determination of what goes where 17
Data Center Network Infrastructure [Zhang et al., 2010] 18
Virtualization Bob Virtual machine monitor Charlie Alice Physical machine Daniel Hypervisor controls access to resources Flexibility for provider Secure and isolated Performance may be hard to predict D. Koop, CIS 602-02, Fall 2015 [Haeberlen and Ives, 2015] 19
Cloud Comparison Cloud Provider Amazon EC2 Windows Azure Google App Engine Classes of Utility Computing Infrastructure service Platform service Platform service Target Applications General-purpose applications General-purpose Windows applications Traditional web applications with supported framework Computation OS Level on a Xen Virtual Machine Microsoft Common Language Runtime (CLR) VM; Predefined roles of app. instances Predefined web application frameworks Storage Elastic Block Store; Amazon Simple Storage Service (S3); Amazon SimpleDB Azure storage service and SQL Data Services BigTable and MegaStore Auto Scaling Automatically changing the number of instances based on parameters that users specify Automatic scaling based on application roles and a configuration file specified by users Automatic Scaling which is transparent to users 20
Cloud Challenges and Opportunities Availability: What happens to my business if there is an outage in the cloud? Data lock-in: How do I move my data from one cloud to another? Data confidentiality and auditability: How do I make sure that the cloud doesn't leak my confidential data? Data transfer bottlenecks: How do I copy large amounts of data from/to the cloud? Performance unpredictability: VMs sharing the same disk [Haeberlen and Ives, 2015] D. Koop, CIS 602-02, Fall 2015 21
Cloud Challenges and Opportunities Scalable storage: Cloud model (short-term usage, no up-front cost, infinite capacity on demand) does not fit persistent storage well Bugs in large distributed systems: Many errors cannot be reproduced in smaller configs Scaling quickly: Dealing with boot times and idle power Reputation fate sharing: One customer's bad behavior can affect the reputation of others using the same cloud (e.g. spam, FBI raids) Software licensing: Licenses tied to computers, how to scale? D. Koop, CIS 602-02, Fall 2015 22
Cloud Comparer 23
Cloud Adoption 24
Cloud Strategy 25
Cloud Tools 26
Cloud Challenges 27
Cloud Initiatives 28
Public Cloud Adoption 29
Public Cloud Change 30
Private Cloud Adoption 31
Wasted Cloud Spending 32
Amazon Web Services 101 I. Massingham
Reading Response Due Thursday (10/12) before class Read Reiss et al. Short Summary + Critique Questions: - Do you agree with the conclusions? - What explanations could there be for the results? - Does the paper suggest improvements in cloud computing based on workload analysis? 34