IBM Analytics Unleash the power of data with Apache Spark

Size: px
Start display at page:

Download "IBM Analytics Unleash the power of data with Apache Spark"

Transcription

1 IBM Analytics Unleash the power of data with Apache Spark Agility, speed and simplicity define the analytics operating system of the future

2 Use Spark to create value from data-driven insights Lower the barrier for building data products and analytics applications The art of the possible with Spark Why IBM and Spark?

3 Use Spark to create value from opportunities to create new insights, uncover new patterns and make smarter, more-informed decisions. But getting to those insights has traditionally taken lots of skill and a great deal of time. According to estimates, data scientists spend 50 to 80 percent of their time wrangling data before they can explore and analyze it. 1 How can you create the valuable insights that are the currency for the new economy while controlling complexity? Data matters more than ever to business success. But value does not come from data alone. Rather, it comes from the insights enabled by data. No matter what your role is, or where you are in your data journey, you are looking for ways to drive innovation. Within today s massive volumes of data are endless More and more organizations are turning to Apache Spark. As data and analytics become more embedded into the fabric of business and society, Spark introduces a new approach to large-scale data processing. 3

4 Spark is an open source engine built specifically for data science. You can use it to better extract value from big data, conduct deeper analyses and deliver fast results, all while reducing the time and effort required for coding. Spark helps unify access to data across the organization with a set of core libraries that support a number of programming languages and streamline the ability to work with a multitude of different data sources (Figure 1). General compute engine handles distributed task dispatching, scheduling and basic I/O functions Executes SQL statements Spark SQL IBM Cloud IBM BigInsights (HDFS) IBM Cloudant IBM dashdb IBM SQL Database IBM Cloud Object Storage IBM Analytics for Apache Spark IBM Data Science Experience Performs streaming analytics using microbatches Spark Streaming Other cloud Amazon Web Services MongoDB Microsoft Azure Rackspace Cassandra Redis Spark Core Common machine learning and statistical algorithms MLib (machine learning) Cloud applications Salesforce NetSuite Distributed graph processing framework GraphX (graph) On premises IBM DB2 Oracle SAP Spark can support a large variety of data sources and formats, both on premises and in the cloud Figure 1. Spark accommodates several analytical methods, allowing users to process data from many different sources. 4

5 Lower the barrier for building data Spark helps simplify algorithm development and accelerate analytics results. The engine is well-suited to iterative algorithms such as the ones used for machine learning. With Spark, these algorithms can run up to 100 times faster than on Hadoop MapReduce in memory. 2 If you have large amounts of data requiring low-latency processing that a typical MapReduce program cannot provide, Spark is an alternative. Spark s tooling and runtime infrastructure allow developers and data scientists to rapidly refine their statistical algorithms through fast, iterative modeling on massively parallel server clusters, farms 5

6 and cloud-based platforms. Drawing on Hadoop and other data lakes, Spark facilitates collaboration among data scientists, data-driven application developers and data engineers by offering an environment that encourages reuse and sharing of data, algorithms and other assets. Spark offers a powerful engine and tools that significantly lower the barrier to entry for building data products and analytics workflows, allowing you to: Streamline development: With Spark, there s no need to learn new languages. Developers and data scientists can use their existing expertise with programming languages such as Scala, Python and SQL to speed time to market. Prepare for tomorrow by enabling more professionals to work with data science tools and make data scientists more productive. Simplify data access: Spark removes data access complexity, providing seamless access to enterprise data with familiar tools. Built-in machine learning and graph algorithm libraries make it easy to enable interactive queries and deliver fast responses. Develop a wide range of algorithms: Spark lets you develop and deploy all workloads faster including machine learning, iterative and batch. Accelerate analytics results: Spark uses an in-memory data processing approach to deliver results quickly. 6

7 Tapping Spark s unique advantages Spark has quickly become popular among developers and data scientists as an essential platform for integrating big data into applications. Its notable advantages include: An in-memory architecture that generates up to 100x speed improvements compared to Hadoop file-based architecture 3 Faster speed, which enables new use cases such as interactive or iterative analysis A simple programming model that requires less code Support for multiple programming languages A single modular platform that enables extension through libraries, not separate applications A large and growing community of contributors who continuously improve the full analytics stack and extend capabilities 7

8 The art of the possible with Spark In the coming years, machine learning applications will lead to new breakthroughs that will amplify human abilities, assist us in making good choices, look out for our safety and help us navigate our world in interesting new ways. Here are some examples of how you can begin using Spark to build your own intelligent analytics applications. Natural language processing The most expressive and insightful interactions with your customers are likely captured in unstructured form social media exchanges, phone conversations and so on. Natural language processing techniques, such as Spark MLlib term frequency-inverse document frequency (TF-IDF), can turn an unstructured body of text into information you can use to teach a machine learning algorithm. With Spark MLlib, you can bake natural language processing directly into applications so you can proactively manage customer interactions. 8 3 The art of the possible with Spark 4 Why IBM and Spark?

9 Prescriptive analytics Prescriptive analytics help you predict not only that something will happen, but the reason why it is going to happen. For example, you can use machine learning to determine which attributes have the most predictive power in forecasting customer actions (that is, attribute importance). When you know why customers act the way they do, you can intervene in a personalized way through systems of engagement. In short, prescriptive analytics and machine learning enable you to offer a tailored next-bestaction when it s needed most. Power in numbers As smart and as powerful as a single person may be, a group of specialists can more effectively work together to win a battle. Machine learning is no different. Spark MLlib supports machine learning techniques called ensembles. With ensembles, many different models collaborate to make better predictions. This technique is well-suited for the massively parallel horsepower of Spark. Real-time machine learning You can use Spark to develop and deploy applications that will learn in real time. Spark Streaming and MLlib work together to help your applications adapt on the fly. For example, the MLlib streaming K-means implementation can learn dynamically, which is useful when patterns in the data change over time. This method enables your applications to focus on what s important in the moment. 9 3 The art of the possible with Spark 4 Why IBM and Spark?

10 Automating automation Machine learning applications need automation and optimization. Automating machine learning is an area where Apache Spark really shines. For example, Spark lets you automatically determine the best way to train your learning algorithm, a method commonly known as hyperparameter tuning. The Spark community is leading the way in this space. Common Spark use cases Interactive querying of very large data sets Running large data processing batch jobs Complex analytics and data mining across various types of data Building and deploying rich analytics models Implementing near-real-time stream event processing 10 3 The art of the possible with Spark 4 Why IBM and Spark?

11 Why IBM and Spark? Apache Spark is a critical technology for delivering the benefits of intelligence-based, in-time action. IBM believes the value of Spark lies in its ability to: Enable collaboration Support data access Broaden the application of analytics Deploy deep intelligence into every application, including the Internet of Things (IoT), web, mobile, social, business processes and more IBM Analytics for Apache Spark You can get the benefits of Apache Spark via the fully managed Spark Service on IBM Bluemix. IBM Analytics for Apache Spark provides all of the capabilities and rich features found in an on-premises Spark deployment without the cost and complexity of managing infrastructure or complex components. The service is immediately ready for you to run analyses skipping setup hurdles, hassles and wasted time. IBM is committed to Apache Spark, making investments in design-led innovation and broad-scale education programs to promote open source innovation and accelerate the injection of intelligence into every application. 11

12 This investment extends the IBM history of proven leadership in open source projects, analytics and innovation, and it will continue to push the envelope of what is possible with Spark. IBM has been a key member of the Apache Spark community since the beginning, and makes significant contributions to support the growth of the core Spark project and its community. The goals: help develop the skills of the nextgeneration data practitioner, and facilitate IBM solutions by incorporating Spark. For example, IBM Streams improves on Spark s built-in streaming capabilities by enabling streaming analytics for cybersecurity, financial trading and other applications that require extremely low latency. The Spark Technology Center, located at the San Francisco campus of Galvanize, is dedicated to creating and encouraging open and free educational assets for enabling Spark adoption and sharing best practices, as well as evangelizing Spark technology and its business potential. IBM seeks to help build a community, deliver best-of-breed training, create innovation labs and produce a reference architecture to speed time to value. To that end, IBM is committed not only to using Spark for what it offers today, but also to leading the charge to keep Spark on the cutting edge. To learn more about the IBM point of view on Spark, visit: ibm.com/spark 12

13 Copyright IBM Corporation 2016 IBM Analytics Route 100 Somers, NY Produced in the United States of America November 2016 IBM, the IBM logo, ibm.com, BigInsights, Bluemix, Cloudant, dashdb, and DB2 are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at Copyright and trademark information at ibm.com/legal/copytrade.shtml Microsoft and Azure are trademarks of Microsoft Corporation in the United States, other countries, or both. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Statements regarding IBM s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. 1 Lohr, Steve, For Big-Data Scientists, Janitor Work Is Key Hurdle, New York Times, August 17, 2014, com/2014/08/18/technology/for-big-data-scientists-hurdleto-insights-is-janitor-work.html?_r= Ibid. Please Recycle CDM12353-USEN-00