Databricks Cloud. A Primer

Similar documents
Who is Databricks? Today, hundreds of organizations around the world use Databricks to build and power their production Spark applications.

Simplifying Data Engineering to Accelerate Innovation

Microsoft Azure Essentials

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Cask Data Application Platform (CDAP) Extensions

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Cask Data Application Platform (CDAP)

IBM Analytics Unleash the power of data with Apache Spark

Big Data Cloud. Simple, Secure, Integrated and Performant Big Data Platform for the Cloud

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Transforming Analytics with Cloudera Data Science WorkBench

zdata Solutions BI / Advanced Analytic Platform and Pilot Programs

SAP Predictive Analytics Suite

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Active Analytics Overview

20775 Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Make Business Intelligence Work on Big Data

SOLUTION SHEET End to End Data Flow Management and Streaming Analytics Platform

NFLABS SIMPLIFYING BIG DATA. Real &me, interac&ve data analy&cs pla4orm for Hadoop

20775: Performing Data Engineering on Microsoft HD Insight

WELCOME TO. Cloud Data Services: The Art of the Possible

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Hortonworks Connected Data Platforms

Architecting an Open Data Lake for the Enterprise

Building a Single Source of Truth across the Enterprise An Integrated Solution

Deep Learning Acceleration with

How to Build Your Data Ecosystem with Tableau on AWS

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

How In-Memory Computing can Maximize the Performance of Modern Payments

20775A: Performing Data Engineering on Microsoft HD Insight

Data Science at Scale

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Integrating MATLAB Analytics into Enterprise Applications

SAP Cloud Platform Big Data Services EXTERNAL. SAP Cloud Platform Big Data Services From Data to Insight

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

MANAGEMENT CLOUD. Leveraging Your E-Business Suite

Analytics for All Your Data: Cloud Essentials. Pervasive Insight in the World of Cloud

Pentaho 8.0 Overview. Pedro Alves

Enabling Self-Service Analytics Across The UDA With Teradata AppCenter

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018

1% + 99% = AI Popularization

Predictive Analytics Reimagined for the Digital Enterprise

Enterprise Collaboration Patterns

Adobe Cloud Platform

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Oracle Management Cloud

AppDynamics Launches Business iq

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

Oracle Big Data Cloud Service

5th Annual. Cloudera, Inc. All rights reserved.

SAP Cloud Platform Pricing and Packages

TECHNICAL WHITE PAPER. Rubrik and Microsoft Azure Technology Overview and How It Works

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

Why an Open Architecture Is Vital to Security Operations


HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Decisyon App Composer (DAC) Technology Overview

SAP BusinessObjects Business Intelligence

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

SAP Cloud Platform Pricing and Packages

Deep Learning Acceleration with MATRIX: A Technical White Paper

Business Applications. Power Platform October Release Notes

EXPERIENCE EVERYTHING

The Fastest, Easiest Way to Integrate Oracle Systems with Salesforce. Real-Time Integration, Not Data Duplication WHITEPAPER

Statistics & Optimization with Big Data

Big Data Platform Implementation

Intelligence, Automation, and Control for Enterprise DevOps

Deliver Always-On, Real-Time Insights at Scale. with DataStax Enterprise Analytics

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

HyperCloud. IT s Cloud Dilemma

C3 Products + Services Overview

Optimal Infrastructure for Big Data

Advanced Analytics in Azure

Transforming IIoT Data into Opportunity with Data Torrent using Apache Apex

Mastering the operational complexity of IoT Applications

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

Savvius and Splunk: Network Insights for Operational Intelligence

: Boosting Business Returns with Faster and Smarter Data Lakes

Mid-Atlantic CIO Forum

DATASHEET. Tarams Business Intelligence. Services Data sheet

Ellucian Ethos. A unifying platform for higher education

Analytics in Action transforming the way we use and consume information

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Next Challenge, Next Solution, NextAxiom

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

30 Minutes Overview of Data Science for Business

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

Meta-Managed Data Exploration Framework and Architecture

Accelerated Data Blending and Analytics for Microsoft Power BI with Alteryx

DevOps Journey. adoption after organizational and process changes. Some of the key aspects to be considered are:

Split Primer. split.io/primer. Who is Split?

Enterprise DevOps with Plutora

A BUYER S GUIDE TO CHOOSING A MOBILE MARKETING PLATFORM

Adobe and Hadoop Integration

The Importance of good data management and Power BI

Transcription:

Databricks Cloud A Primer

Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically simplify big data processing and free users to focus on turning their data into value. We do this through our product, Databricks Cloud, that is powered by Spark. For more information on Spark, download the Spark Primer. Data Databricks Cloud Value The speed of Databricks Cloud and the power of the Spark are unparalleled. Post implementation, we ve been able to run complex monitoring over our entire dataset on an hourly basis in an automated manner. The value of that simple automation for my team alone is worth the investment. For the first time, we don t feel like we re three steps behind with a fast and comprehensive monitoring system. Gloria Lau, VP of Data, Timeful 2

What is Databricks Cloud? Databricks Cloud is a hosted end-to-end data platform powered by Spark. It enables organizations to seamlessly transition from data ingest through exploration and production. There are four foundational components that comprise Databricks Cloud: Managed Spark Clusters Exploration and Visualization Production Pipelines Third-Party Apps The Foundational Components of Databricks Cloud 3

Managed Spark Clusters Fully managed Spark clusters in the cloud that helps enterprises focus on their data and not operations. Easily Provision Clusters: Launch, dynamically scale up or down, and terminate clusters with just a few clicks. We automate management so you can focus on your data. Harness the Power of Spark: Configured and tuned by the people who built it. Import Data Seamlessly: Import data from S3, your local machine, or a wide variety of data sources, including HDFS, RDBMS, Cassandra, and MongoDB. Exploration and Visualization An interactive workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. Explore: Use interactive notebooks to write Spark commands in Python, Scala, or SQL and reuse your favorite Python, Java, or Scala libraries. Collaborate: Work on the same notebook in real time or send it around for offline collaboration. Visualize: Leverage a wide assortment of point-and-click visualizations. Or use powerful scriptable options like matplotlib, ggplot, and D3. Publish: Build rich dashboards that present key findings to share with your colleagues and customers. 4

Production Pipelines A production pipeline scheduler that helps users get from prototype to production without re-engineering. Schedule Production Workflows: Schedule any existing notebook or locally developed Spark code to run periodically using existing or newly-provisioned clusters. Implement Complete Pipelines: Build production pipelines that span data import and ETL, complex conditional processing, and data export. Monitor Progress and Results: Set up custom alerts for job completion and failure, and easily view historical and in-progress results. Third-Party Apps A platform for powering Spark-based applications that helps users leverage a growing ecosystem of applications, and re-use their favorite tools. 5

What are some of the technical and operational bottlenecks faced by data scientists, data engineers and analysts with their data pipeline? Over last few years, Spark has made great strides in helping enterprises overcome some of their big data processing challenges, however many enterprises are still struggling to extract value from their data pipelines. Capturing value from big data requires capabilities beyond data processing; enterprises are finding out that there are many challenges in their journey to operationalize their data pipeline: 1. Infrastructure issues requiring data teams to pre-provision, setup and manage on-premise clusters that are both costly and time consuming. 2. Once the infrastructure challenges have been addressed, data scientists and engineers still have to contend with siloed workspaces where working with data, code, and visualization requires switching between different software, and sharing work amongst peers means manually copying data. 3. Sharing of insights to non-engineering stakeholders and the hand-off to the production team. 6

Problem: the journey is complex and costly. Get a cluster up and running Import and explore data Build a Production Pipeline Expensive to build and hard to manage Disparate and difficult tools Months of re-engineering to deploy Your Data Pipeline: the journey is complex and costly In all this, enterprises are required to cobble various components together, making it not just highly inefficient, but also difficult to track data lineage and usage patterns over the various components within the stack. With this current model, enterprises are not able to implement complete pipelines - this severely inhibits innovation and value creation. Why Databricks Cloud? Given the challenges faced by data professionals and enterprises in managing their data pipeline, we saw the need for a single platform that can enable customers to easily deploy Spark as-a-service while providing a rich set of tools out-of-the-box. Key attributes: Managed Spark Clusters in the Cloud Notebook Environment Production Pipeline Scheduler 3rd Party Applications 7

Our key differentiators are: Unified Platform With Databricks Cloud, enterprises are able to go from data ingestion through exploration and production on a single data platform. This significantly minimizes the integration pains they currently face when cobbling together multiple tools and systems, and helps streamline entire pipeline deployments. With a unified platform, data professionals are able to reuse their code base by utilizing the same notebooks for exploration and production, resulting in tremendous time savings. Zero Management Databricks Cloud provides powerful cluster management capabilities which allow users to create new clusters in seconds, dynamically scale them up and down, and share them across users. This obviates the need to set up and maintain the clusters. As such organizations do not need to have dedicated DevOps teams - their data teams can now enable self-service Spark clusters and import their data seamlessly. This allows them to focus on their core mission understanding and gaining insights from their data, not in managing day-to-day operations. Real-Time Databricks Cloud provides real-time capabilities in several dimensions. 1. The notebook feature allows users to perform interactive queries and visualize results in real-time. This can dramatically increase their productivity when performing explorations and gain additional insights. 2. The interactive workspace feature enables real-time collaboration amongst multiple users. Team members can seamlessly share code, plots, and results, leveraging each other s work far more effectively. Open Platform Databricks Cloud is a platform for powering Spark-based applications and comes with a third-party API in addition to JDBC connectivity, so users can plug in their favorite BI tools directly to their Databricks Cloud clusters, as each cluster comes with a JDBC server. This enables users to reuse their favorite tools, leverage our growing application ecosystem and to maximize their investments and knowledge base, leading to improved time to value and productivity. 3. The streaming feature provides low-latency and fault-tolerant processing of continuous data streams. This enables organizations to rapidly take action in response to live data in real-time. 8

How are enterprises typically using Databricks Cloud? Enterprises deploy Databricks Cloud to achieve a wide variety of objectives, including: Data integration and transformation Databricks Cloud is powered by Spark and can ingest data from a diverse set of sources with built-in connectors and apply custom code to transform data into easier to process and query formats. The real-time interactive querying and data visualization capability of Databricks Cloud makes this typically slow process much faster. Product prototyping and deployment Databricks Cloud allows teams to efficiently explore very large data sets and experiment with new product ideas through the interactive workspace. Advanced analytics libraries such as MLlib also provide an easy way for teams to deploy sophisticated algorithms in Spark. Once a prototype has been built, one can seamlessly deploy it in production and at scale, using the Jobs feature. Internal or customer-facing business analytics With Databricks Cloud, familiarity with SQL is sufficient to run real-time queries against large-scale data sets for analysis ranging from user behavior to customer funnel. Results and complex visualization in Databricks Cloud can be easily exposed as customized dashboards for consumption with a few clicks. Continuous monitoring The high performance of Databricks Cloud and the Jobs feature enables automated and continuous monitoring of business-critical systems. Databricks Cloud can set up complex pipelines to compute quality metrics and send notifications if human intervention is required. 9

How will Databricks Cloud benefit data professionals and enterprises? Databricks Cloud helps data professionals and enterprises to focus on finding answers from their data, building data products, and ultimately capture the value promised by big data. Evaluate Databricks Cloud with a trial account now. databricks.com/registration The platform delivers the following key benefits to data professionals and enterprises: Higher productivity Fast computation in-memory and on disk Real-time data exploration, visualization, collaboration Focus on data analysis, not infrastructure Provide non-engineers direct access to data Improved documentation of code and knowledge-base Faster deployment of data pipelines Instant Spark clusters Scale from small scale exploration to large production deployments without re-engineering Improved Infrastructure Cost Efficiency Eliminate capital expenditure Eliminate infrastructure lifecycle maintenance costs Reduce DevOps overhead and associated costs Launch, scale, and terminate clusters to match data processing needs The fact that explorations by our data science team now take less than an hour, rather than days, has fundamentally changed how we ask questions and visualize changes to the index. Darian Shirazi, CEO, Radius Intelligence databricks_primer_150417 10