Databricks Cloud A Primer
Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically simplify big data processing and free users to focus on turning their data into value. We do this through our product, Databricks Cloud, that is powered by Spark. For more information on Spark, download the Spark Primer. Data Databricks Cloud Value The speed of Databricks Cloud and the power of the Spark are unparalleled. Post implementation, we ve been able to run complex monitoring over our entire dataset on an hourly basis in an automated manner. The value of that simple automation for my team alone is worth the investment. For the first time, we don t feel like we re three steps behind with a fast and comprehensive monitoring system. Gloria Lau, VP of Data, Timeful 2
What is Databricks Cloud? Databricks Cloud is a hosted end-to-end data platform powered by Spark. It enables organizations to seamlessly transition from data ingest through exploration and production. There are four foundational components that comprise Databricks Cloud: Managed Spark Clusters Exploration and Visualization Production Pipelines Third-Party Apps The Foundational Components of Databricks Cloud 3
Managed Spark Clusters Fully managed Spark clusters in the cloud that helps enterprises focus on their data and not operations. Easily Provision Clusters: Launch, dynamically scale up or down, and terminate clusters with just a few clicks. We automate management so you can focus on your data. Harness the Power of Spark: Configured and tuned by the people who built it. Import Data Seamlessly: Import data from S3, your local machine, or a wide variety of data sources, including HDFS, RDBMS, Cassandra, and MongoDB. Exploration and Visualization An interactive workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. Explore: Use interactive notebooks to write Spark commands in Python, Scala, or SQL and reuse your favorite Python, Java, or Scala libraries. Collaborate: Work on the same notebook in real time or send it around for offline collaboration. Visualize: Leverage a wide assortment of point-and-click visualizations. Or use powerful scriptable options like matplotlib, ggplot, and D3. Publish: Build rich dashboards that present key findings to share with your colleagues and customers. 4
Production Pipelines A production pipeline scheduler that helps users get from prototype to production without re-engineering. Schedule Production Workflows: Schedule any existing notebook or locally developed Spark code to run periodically using existing or newly-provisioned clusters. Implement Complete Pipelines: Build production pipelines that span data import and ETL, complex conditional processing, and data export. Monitor Progress and Results: Set up custom alerts for job completion and failure, and easily view historical and in-progress results. Third-Party Apps A platform for powering Spark-based applications that helps users leverage a growing ecosystem of applications, and re-use their favorite tools. 5
What are some of the technical and operational bottlenecks faced by data scientists, data engineers and analysts with their data pipeline? Over last few years, Spark has made great strides in helping enterprises overcome some of their big data processing challenges, however many enterprises are still struggling to extract value from their data pipelines. Capturing value from big data requires capabilities beyond data processing; enterprises are finding out that there are many challenges in their journey to operationalize their data pipeline: 1. Infrastructure issues requiring data teams to pre-provision, setup and manage on-premise clusters that are both costly and time consuming. 2. Once the infrastructure challenges have been addressed, data scientists and engineers still have to contend with siloed workspaces where working with data, code, and visualization requires switching between different software, and sharing work amongst peers means manually copying data. 3. Sharing of insights to non-engineering stakeholders and the hand-off to the production team. 6
Problem: the journey is complex and costly. Get a cluster up and running Import and explore data Build a Production Pipeline Expensive to build and hard to manage Disparate and difficult tools Months of re-engineering to deploy Your Data Pipeline: the journey is complex and costly In all this, enterprises are required to cobble various components together, making it not just highly inefficient, but also difficult to track data lineage and usage patterns over the various components within the stack. With this current model, enterprises are not able to implement complete pipelines - this severely inhibits innovation and value creation. Why Databricks Cloud? Given the challenges faced by data professionals and enterprises in managing their data pipeline, we saw the need for a single platform that can enable customers to easily deploy Spark as-a-service while providing a rich set of tools out-of-the-box. Key attributes: Managed Spark Clusters in the Cloud Notebook Environment Production Pipeline Scheduler 3rd Party Applications 7
Our key differentiators are: Unified Platform With Databricks Cloud, enterprises are able to go from data ingestion through exploration and production on a single data platform. This significantly minimizes the integration pains they currently face when cobbling together multiple tools and systems, and helps streamline entire pipeline deployments. With a unified platform, data professionals are able to reuse their code base by utilizing the same notebooks for exploration and production, resulting in tremendous time savings. Zero Management Databricks Cloud provides powerful cluster management capabilities which allow users to create new clusters in seconds, dynamically scale them up and down, and share them across users. This obviates the need to set up and maintain the clusters. As such organizations do not need to have dedicated DevOps teams - their data teams can now enable self-service Spark clusters and import their data seamlessly. This allows them to focus on their core mission understanding and gaining insights from their data, not in managing day-to-day operations. Real-Time Databricks Cloud provides real-time capabilities in several dimensions. 1. The notebook feature allows users to perform interactive queries and visualize results in real-time. This can dramatically increase their productivity when performing explorations and gain additional insights. 2. The interactive workspace feature enables real-time collaboration amongst multiple users. Team members can seamlessly share code, plots, and results, leveraging each other s work far more effectively. Open Platform Databricks Cloud is a platform for powering Spark-based applications and comes with a third-party API in addition to JDBC connectivity, so users can plug in their favorite BI tools directly to their Databricks Cloud clusters, as each cluster comes with a JDBC server. This enables users to reuse their favorite tools, leverage our growing application ecosystem and to maximize their investments and knowledge base, leading to improved time to value and productivity. 3. The streaming feature provides low-latency and fault-tolerant processing of continuous data streams. This enables organizations to rapidly take action in response to live data in real-time. 8
How are enterprises typically using Databricks Cloud? Enterprises deploy Databricks Cloud to achieve a wide variety of objectives, including: Data integration and transformation Databricks Cloud is powered by Spark and can ingest data from a diverse set of sources with built-in connectors and apply custom code to transform data into easier to process and query formats. The real-time interactive querying and data visualization capability of Databricks Cloud makes this typically slow process much faster. Product prototyping and deployment Databricks Cloud allows teams to efficiently explore very large data sets and experiment with new product ideas through the interactive workspace. Advanced analytics libraries such as MLlib also provide an easy way for teams to deploy sophisticated algorithms in Spark. Once a prototype has been built, one can seamlessly deploy it in production and at scale, using the Jobs feature. Internal or customer-facing business analytics With Databricks Cloud, familiarity with SQL is sufficient to run real-time queries against large-scale data sets for analysis ranging from user behavior to customer funnel. Results and complex visualization in Databricks Cloud can be easily exposed as customized dashboards for consumption with a few clicks. Continuous monitoring The high performance of Databricks Cloud and the Jobs feature enables automated and continuous monitoring of business-critical systems. Databricks Cloud can set up complex pipelines to compute quality metrics and send notifications if human intervention is required. 9
How will Databricks Cloud benefit data professionals and enterprises? Databricks Cloud helps data professionals and enterprises to focus on finding answers from their data, building data products, and ultimately capture the value promised by big data. Evaluate Databricks Cloud with a trial account now. databricks.com/registration The platform delivers the following key benefits to data professionals and enterprises: Higher productivity Fast computation in-memory and on disk Real-time data exploration, visualization, collaboration Focus on data analysis, not infrastructure Provide non-engineers direct access to data Improved documentation of code and knowledge-base Faster deployment of data pipelines Instant Spark clusters Scale from small scale exploration to large production deployments without re-engineering Improved Infrastructure Cost Efficiency Eliminate capital expenditure Eliminate infrastructure lifecycle maintenance costs Reduce DevOps overhead and associated costs Launch, scale, and terminate clusters to match data processing needs The fact that explorations by our data science team now take less than an hour, rather than days, has fundamentally changed how we ask questions and visualize changes to the index. Darian Shirazi, CEO, Radius Intelligence databricks_primer_150417 10