MacroBase: Prioritizing Attention in Fast Data. Peter Bailis DAWN Project, Stanford InfoLab

Similar documents
Data Informatics. Seon Ho Kim, Ph.D.

The Rise of Engineering-Driven Analytics. Richard Rovner VP Marketing

The Rise of Engineering-Driven Analytics

Managing explosion of data. Cloudera, Inc. All rights reserved.

ThingSpeak - IoT Platform with MATLAB Analytics

Tech Trends. Big Data, IOT, Security, Machine Learning, Search engines... Anant Asthana: github.com/anantasty

Confidential

MATLAB for Data Analytics The MathWorks, Inc. 1

COMP9321 Web Application Engineering

WIN BIG WITH GOOGLE CLOUD

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

IoT Analytics. Martin Keseg Enterprise Account Manager 26/10/2017

Internet OF Things: Transformation In The Digital Era

Building a Robust Analytics Platform

SENPAI.

Enterprise-Scale MATLAB Applications

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

Big Data & Analytics for Wind O&M: Opportunities, Trends and Challenges in the Industrial Internet

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

AWS Smart Cities and

Data Science, realizing the Hype Cycle. Luigi Di Rito, Director Data Science Team, SAP Center of Excellence

IoT ANALYTICS IN THE ENTERPRISE WITH FUNL

ADVANCED ANALYTICS & IOT ARCHITECTURES

Oracle Fusion ERP Drives Scalable Engagement with Intelligent AI

Turn Data into Business Value

Digital Twin. David Kadleček, PhD., IBM Cognitive IoT

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

SAP Predictive Analytics Suite

Integrating MATLAB Analytics into Enterprise Applications The MathWorks, Inc. 1

The Open IoT Stack: Architecture and Use Cases

Update on SAP Leonardo IoT. 8 th June 2017

What s new in MATLAB and Simulink

DOWNTIME IS NOT AN OPTION

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Is Machine Learning the future of the Business Intelligence?

Intel Public Sector 3

Berkeley Data Analytics Stack (BDAS) Overview

NVIDIA AND SAP INDUSTRY CHALLENGES INTEGRATED SOLUTION

2015 The MathWorks, Inc. 1

Scaling up MATLAB Analytics with Kafka and Cloud Services

Deploying Info Clouds to Rapidly Deliver Actionable Information to Stakeholders. Brad Schmidt Business Development, Intergraph Canada

Safe Harbor Statement

AI IN THE ENTERPRISE: USING AI TO DELIVER VALUE FROM DATA YOU ALREADY HAVE

Intelligent Remote Services for Connected Cars. Tom Fleischmann July 27, 2017

The AI Car: Ramifications, Risks, & Opportunities

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Artificial Intelligence Definition, Opportunities and Challenges

What s New in MATLAB and Simulink

Advanced Analytics in Azure

Data. Does it Matter?

Enterprise Mobility Native Mobile Apps that Transform Business Processes and Boost Productivity

Novedades de las últimas versiones de MATLAB y Simulink

Uncovering the Hidden Truth In Log Data with vcenter Insight

TechValidate Survey Report. Converged Data Platform Key to Competitive Advantage

IBM Watson IoT Strategy

Cognitive IoT unlocking the data challenge

Ampliando MATLAB Analytics con Kafka y Servicios en la Nube

CASE STUDY Delivering Real Time Financial Transaction Monitoring

The Autonomous Data Platform A Data Driven Retail Customer Experience

Data Science is a Team Sport and an Iterative Process

IoT Standard Pack to Facilitate Visualization and Remote Monitoring of Industrial Devices and Equipment

PNDA.io: when big data and OSS collide

Mid-Atlantic CIO Forum

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Spark and Hadoop Perfect Together

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Bots, Outliers and Outages

Scaling up MATLAB Analytics with Kafka and Cloud Services

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

The Need For Speed: Fast Data Development Trends Insights from over 2,400 developers on the impact of Data in Motion in the real world

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

IOT Analytics and business assurance. Ericsson-wedo perspectives October 2017

Gartner's Top 10 Strategic Technology Trends for 2014

What s New in MATLAB and Simulink

Sharing and Deploying MATLAB Programs

Simplifying Data Engineering to Accelerate Innovation

5th Annual. Cloudera, Inc. All rights reserved.

Chris Nelson. Vice President Software Development. #PIWorld OSIsoft, LLC

MACHINE LEARNING IN THE OIL AND GAS INDUSTRY

Big Data Solution for Church & Dwight. Get the most out of your complex business data with Big Data and business intelligence

Insights-Driven Operations with SAP HANA and Cloudera Enterprise

INSIGHTS & BIG DATA. Data Science as a Service BIG DATA ANALYTICS

Microsoft Azure Essentials

Cask Data Application Platform (CDAP) Extensions

Developing a Strategy for Advancing Faster with Big Data Analytics

Active Analytics Overview

Architecting for Real- Time Big Data Analytics. Robert Winters

Envision technology conference

Smart Cities Brief. Amazon Web Services. July 2016

MapR Pentaho Business Solutions

What s New in MATLAB and Simulink

Microsoft reinvents sales processing and financial reporting with Azure

Using Lean Digital SM and Systems of Engagement TM to enable Intelligent Operations SM

VICE PRESIDENT, ARCHITECTURE GENERAL MANAGER, AI PRODUCTS GROUP - INTEL

Platform overview. Platform core User interfaces Health monitoring Report generator

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018

Predictive Maintenance with MATLAB and Simulink

Copyright 2012 EMC Corporation. All rights reserved.

Transcription:

MacroBase: Prioritizing Attention in Fast Data Peter Bailis DAWN Project, Stanford InfoLab

Stanford DAWN (Data Analytics for What s Next) Project Peter Bailis Streaming & Databases Chris Ré MacArthur Genius Databases + ML Kunle Olukotun Father of Multicore Domain Specific Languages Matei Zaharia Co-Creator of Spark and Mesos

It s the golden era of data * Incredible advances in image recognition, natural language processing, planning, info retrieval Society-scale impact: autonomous vehicles, personalized medicine, human trafficking No end in sight for advances in ML *for the best-funded, best-trained engineering teams

The DAWN Question What if anyone with domain expertise could build their own production-quality ML products? Without a PhD in machine learning Without being an expert in DB + systems Without understanding the latest hardware What s needed: end-to-end systems that cover the full user workflow

The DAWN Stack

Understanding Customer Interactions Given smart customer data streams (e.g., retail telemetry, purchasing decisions, customer service interactions), how can we fuse, filter, and aggregate data to deliver actionable customer, business insights? Our research: scalable ML-powered tools that prioritize analyst attention in large data streams

Customer Scale: Opportunity and Challenge Click impressions (10K-MMs / day) Retail interactions (10s-1000s video feeds) Purchasing history (years of data warehousing) External data sources (demographic, social media, 3 rd party) Opportunity: combining large, fast data sources boosts quality Challenge: efficient, cost- and time-effective analytics

After Big Data, Data Continues to Grow! Big Data systems (e.g., HDFS, S3, Kafka), cloud reduced storage costs Further, instrumentation of complex applications via sensors, processes, production telemetry has led to exploding data volumes e.g., today, Facebook, Twitter, LinkedIn collect 12M+ events/sec Percent (2016) 600 450 300 150 0 Projected Data Growth % (IDC) 2016 2017 2018 2019 2020 2021 Year

Challenge: Limited human attention Human attention is scarce! infeasible to manually inspect large volumes In practice: data only accessed for post-hoc root cause analyses top SV orgs say: < 6% data read MEMS & sensors: manufacturing, monitoring, IoT

Example: Cambridge Mobile Telematics Product: collect, analyze telemetry to improve driver behavior Question: do users enjoy the application on every platform?

over 24K different Android device types; 2x since 2013

Example: Cambridge Mobile Telematics Product: collect, analyze telemetry to improve driver behavior Question: is the application behaving well on every platform? Challenge: 24K different Android devices, 25 Major API releases spending even 1 second per combination requires 7 days IOS 9.0 beta 1 5 (but not 9.0.1) had a buggy BLE stack that prevented ios devices from connecting to devices.

Challenge: Limited human attention Human attention is scarce! infeasible to manually inspect large volumes In practice: data only accessed for post-hoc root cause analyses top SV orgs say: < 6% data read Needed: systems that automatically prioritize attention

MacroBase Architecture and Topics TRANSFORM CLASSIFY EXPLAIN Outliers {iphone6, Canada} {iphone6, USA} {iphone5, Canada} extract domain-specific signals e.g., identify data in tails find disproportionately correlated attributes Inliers {iphone6, USA} {iphone6, USA} {iphone5, USA} Key research question: how can building end-to-end systems improve scalability and result quality?

TRANSFORM CLASSIFY EXPLAIN MacroBase Architecture and Topics Key research Outliers {iphone6, Canada} {iphone6, USA} {iphone5, Canada} extract domain-specific signals e.g., identify data in tails find disproportionately correlated attributes Inliers {iphone6, USA} {iphone6, USA} {iphone5, USA} question: how can building end-to-end systems improve scalability and result quality? Note: 100x more inliers

MacroBase System: Prioritize Attention Execute cascades of operators that transform, filter, aggregate the stream dimensionality reduction, feature extraction, data fusion, streaming ETL segmentation, rule evaluation, supervised + unsupervised inference aggregation into interpretable outputs Must combine to reduce volume! Example Output: readings from Android Galaxy S5 devices running app version 52 are 30 times more likely than others to have abnormally high frequency [Bailis et al., CIDR 2017]

Early Production Usage automotive monitoring fleet QoS online services & datacenters (DevOps / monitoring) identifying slow containers, exception telemetry industrial manufacturing key sources of process variance in product mobile applications diagnosis of misconfigured platforms

Early Production Usage MacroBase discovered a rare issue with the CMT application and a device-specific battery problem. Consultation and investigation with the CMT team confirmed these issues as previously unknown [Bailis et al., SIGMOD 2017]

MacroBase Architecture and Topics extract domain-specific signals TRANSFORM CLASSIFY EXPLAIN Outliers {iphone6, Canada} {iphone6, USA} {iphone5, Canada} e.g., identify data in tails find disproportionately correlated attributes Inliers {iphone6, USA} {iphone6, USA} {iphone5, USA}

NoScope: 1000x Faster Video Extraction What if we want to extract higher-level features from complex streams, like video? Neural networks offer promise Problem: state-of-theart neural networks run ~30fps on $1K GPU [Kang et al., arxiv:1703.02529]

NoScope: 1000x Faster Video Extraction End result: Process up to 1000x more video streams for same processing cost Idea: use ideas from query optimization to speed NN evaluation 1. A difference detector to see if video has changed 2. Models that are specialized just in time to operate on a given feed 3. An optimizer to trade-off accuracy and speed [Kang et al., arxiv:1703.02529]

MacroBase and Customer Interactions Given smart customer data streams (e.g., retail telemetry, purchasing decisions, customer service interactions), how can we fuse, filter, and aggregate data to deliver actionable customer insights? Our research: scalable ML-powered systems that prioritize analyst attention in large data streams

The DAWN Stack

Conclusions Increasing data volumes demand new infrastructure for model training, fast inference, prioritizing human attention MacroBase: combine feature extraction, classification, explanation Major systems opportunity: efficient implementation at each stage, spanning query optimization, cardinality estimation, specialization Stanford DAWN Project: A new stack for next-gen analytics http://dawn.cs.stanford.edu/