Berkeley Data Analytics Stack (BDAS) Overview

Similar documents
Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Databricks Cloud. A Primer

5th Annual. Cloudera, Inc. All rights reserved.

Microsoft Azure Essentials

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

BIG DATA AND HADOOP DEVELOPER

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018

Intro to Big Data and Hadoop

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

CS294: RISE Logistics, Overview, Trends

Cask Data Application Platform (CDAP)

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Transforming Analytics with Cloudera Data Science WorkBench

Apache Spark and R A (big data) love story?

Future Technologies. Mahidhar Tatineni User Services, SDSC Costa Rica Big Data School December 7, 2017

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

How In-Memory Computing can Maximize the Performance of Modern Payments

20775A: Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

20775A: Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight

New Big Data Solutions and Opportunities for DB Workloads

COMP9321 Web Application Engineering

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden


DOWNTIME IS NOT AN OPTION

Spark and Hadoop Perfect Together

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Big data is hard. Top 3 Challenges To Adopting Big Data

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Hortonworks Connected Data Platforms

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

From Big Data to Fast Data. Sina Sheikholeslami

IBM Big Data Summit 2012

Big Data Cloud. Simple, Secure, Integrated and Performant Big Data Platform for the Cloud

Insights to HDInsight

Preface About the Book

Introduction to Streaming Analytics

Cloud Based Analytics for SAP

Hadoop Course Content

Common Customer Use Cases in FSI

Big Data Hadoop Administrator.

Advanced Analytics With Spark Patterns For Learning From Data At Scale

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Engineering Unplugged: A Discussion With Pure Storage s Brian Gold on Big Data Analytics for Apache Spark

Spark, Hadoop, and Friends

Cloud-Scale Data Platform

Big Data The Big Story

Introduction to Stream Processing

Sr. Sergio Rodríguez de Guzmán CTO PUE

R and Hadoop. Ram Venkat Dawn Analytics

Architecture Overview for Data Analytics Deployments

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Adobe Deploys Hadoop as a Service on VMware vsphere

Cloudera, Inc. All rights reserved.

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

On Cloud Computational Models and the Heterogeneity Challenge

Welcome to. enterprise-class big data and financial a. Putting big data and advanced analytics to work in financial services.

Business is being transformed by three trends

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Cask Data Application Platform (CDAP) Extensions

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

SAP Machine Learning for Hadoop. Customer

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Cloudera Enterprise Data Hub Reference Architecture for Oracle Cloud Infrastructure Deployments O R A C L E W H I T E P A P E R J U N E

SAS & HADOOP ANALYTICS ON BIG DATA

HDInsight - Hadoop for the Commoner Matt Stenzel Data Platform Technical Specialist

Operational Hadoop and the Lambda Architecture for Streaming Data

Oracle's Cloud Strategie für den Geschäftserfolg Alles Neue von der OOW

Modernizing Your Data Warehouse with Azure

RAPIDS, FOSDEM 19. Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA

IoT REALIZED CONNECTED CAR. Ian Huston Senior Data

Bringing the Power of SAS to Hadoop Title

Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN

Saudi Journal of Engineering and Technology. DOI: /sjeat. ISSN (Print) Dubai, United Arab Emirates Website:

Simplifying Data Engineering to Accelerate Innovation

1% + 99% = AI Popularization

Forum on Analytics for Advanced Cancer Research Basel, 26 October 2016 Ted Slater Global Head, Healthcare & Life Sciences

Meetup DB2 LUW - Madrid. IBM dashdb. Raquel Cadierno Torre IBM 1 de Julio de IBM Corporation

Big Data and Apache Spark: A Review Abhishek Bhattacharya 1*, Shefali Bhatnagar 2

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Reduce Money Laundering Risks with Rapid, Predictive Insights

High-Performance Computing (HPC) Up-close

StackIQ Enterprise Data Reference Architecture

INDUSTRY BRIEF THE ENTERPRISE DATA HUB IN FINANCIAL SERVICES: THREE CUSTOMER CASE STUDIES

Taking Advantage of Cloud Elasticity and Flexibility

Who is Databricks? Today, hundreds of organizations around the world use Databricks to build and power their production Spark applications.

Leveraging smart meter data for electric utilities:

Oracle Big Data Cloud Service

Leveraging smart meter data for electric utilities:

Hortonworks Data Platform

Transcription:

Berkeley Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY

What is Big used For? Reports, e.g., - Track business processes, transactions Diagnosis, e.g., - Why is user engagement dropping? - Why is the system slow? - Detect spam, worms, viruses, DDoS attacks Decisions, e.g., - Decide what feature to add - Decide what ad to show - Block worms, viruses, is only as useful as the decisions it enables

Processing Goals Low latency (interactive) queries on historical data: enable faster decisions - E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data - E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable better decisions - E.g., anomaly detection, trend analysis

Today s Open Analytics Stack..mostly focused on large on-disk datasets: great for batch but slow Application Processing Storage Infrastructure

Goals Batch One stack to rule them all! Interactive Streaming Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

Our Approach: Support Interactive and Streaming Comp. Aggressive use of memory Why? 1. Memory transfer rates >> disk or even SSDs - Gap is growing especially w.r.t. disk 10Gbps 128-512GB 40-60GB/s 2. Many datasets already fit into memory - The inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory - E.g., 1TB = 1 billion records @ 1 KB each 3. Memory density (still) grows with Moore s law - RAM/SSD hybrid memories at horizon 16 cores 0.2-1GB/s (x10 disks) 10-30TB 1-4GB/s (x4 disks) 1-4TB High end datacenter node

Our Approach: Support Interactive and Streaming Comp. Increase parallelism Why? result - Reduce work per node à improve latency T Techniques: - Low latency parallel scheduler that achieve high locality - Optimized parallel communication patterns result (e.g., shuffle, broadcast) - Efficient recovery from failures and straggler mitigation T new (< T)

Our Approach: Support Interactive and Streaming Comp. Trade between result accuracy and response times Why? 128-512GB doubles every 18 months - In-memory processing does not guarantee interactive query processing - E.g., ~10 s sec just to scan 512 GB RAM! - Gap between memory capacity and transfer rate increasing 40-60GB/s Challenges: - accurately estimate error and running time for - arbitrary computations 16 cores doubles every 36 months

Our Approach Easy to combine batch, streaming, and interactive computations - Single execution model that supports all computation models Easy to develop sophisticated algorithms - Powerful Python and Scala shells - High level abstractions for graph based, and ML algorithms Compatible with existing open source ecosystem (Hadoop/HDFS) - Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume,..) - Support existing execution models (e.g., Hive, GraphLab)

Berkeley Analytics Stack (BDAS) Application New apps: AMP-Genomics, Carat, in-memory processing Processing trade between time, quality, and cost Storage Management Efficient data sharing across frameworks Resource Infrastructure Management Share infrastructure across frameworks (multi-programming for datacenters)

The Berkeley AMPLab lgorithms Launched January 2011: 6 Year Plan - 8 CS Faculty - ~40 students - 3 software engineers Organized for collaboration: achines eople

The Berkeley AMPLab Funding: - X, CISE Expedition Grant - Industrial, founding sponsors - 18 other sponsors, including Goal: Next Generation of Analytics Stack for Industry & Research: Berkeley Analytics Stack (BDAS) Release as Open Source

Berkeley Analytics Stack (BDAS) Application Processing Processing Processing Management Management Resource Management Resource Management Management Infrastructure Resource Management

Berkeley Analytics Stack (BDAS) Existing stack components. HIVE Pig Processing HBase Storm MPI Processing Hadoop Management HDFS Resource Management Resource

Mesos [Released, v0.9] Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers HIVE Pig HBase Storm MPI Processing Hadoop HDFS Resource Mesos

[Release, v0.7] In-memory framework for interactive and iterative computations - Resilient Distributed set (RDD): fault-tolerance, in-memory storage abstraction Scala interface, Java and Python APIs HIVE Pig Storm MPI Processing Hadoop HDFS Resource Mesos

Community 3000 people attended online training in August 500+ meetup members 14 companies contributing spark- project.org

Streaming [Alpha Release] Large scale streaming computation Ensure exactly one semantics Integrated with à unifies batch, interactive, and streaming computations! Streaming HIVE Pig Storm MPI Processing Hadoop HDFS Resource Mesos

Shark [Release, v0.2] HIVE over : SQL-like interface (supports Hive 0.9) - up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Streaming Shark HIVE Pig Storm MPI Processing Hadoop HDFS Resource Mesos

& Shark available now on EMR!

Tachyon [Alpha Release, this Spring] High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS Support for and Hadoop Streaming Shark HIVE Pig Storm MPI Processing Hadoop Tachyon HDFS Resource Mesos

BlinkDB [Alpha Release, this Spring] Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook BlinkDB Streaming Shark HIVE Pig Storm MPI Processing Hadoop Tachyon HDFS Resource Mesos

Graph [Alpha Release, this Spring] GraphLab API and Toolkits on top of Fault tolerance by leveraging BlinkDB Streaming Graph Shark HIVE Pig Storm MPI Processing Hadoop Tachyon HDFS Resource Mesos

MLbase [In development] Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts BlinkDB Streaming Graph MLbase Shark HIVE Pig Storm MPI Processing Hadoop Tachyon HDFS Resource Mesos

Compatible with Open Source Ecosystem Support existing interfaces whenever possible GraphLab API BlinkDB Hive Interface Streaming Graph MLbase Shark and Pig Shell HIVE Storm MPI Processing Hadoop Tachyon HDFS API HDFS Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos Resource Mesos

Compatible with Open Source Ecosystem Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, Support Hive API TCP Sockets, BlinkDB Streaming Graph MLbase Shark HIVE Pig Storm MPI Processing Support HDFS API, Hadoop S3 API, and Hive metadata Tachyon HDFS Resource Mesos

Summary Holistic approach to address next generation of Big challenges! Support interactive and streaming computations - In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations - execution engine supports all comp. models Batch Easy to develop sophisticated algorithms - Scala interface, APIs for Java, Python, Hive QL, - New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Interactive Streaming Open source (Apache/BSD) and fully committed to release high quality software - Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5 th Cloudera engineer)

What s Next? This tutorial: - Matei Zaharia: - Tathagata Das (TD): Streaming - Reynold Xin: Shark Afternoon tutorial: - Hands on with, Streaming, and Shark