Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved. 1
SAP Big Data Webinar Series Big Data - Introduction to SAP Big Data Technologies Big Data - Streaming Analytics Big Data - Smarter Data Virtualization Big Data - Gain New Insight from Hadoop Big Data - Spatial Data Processing for Richer Insights Big Data - Text Analytics
Speaker Introduction Yuvaraj Athur Raghuvir, is a Senior Director in the SAP HANA Platform Solution Management at SAP. He leads the Big Data Analytics portfolio including SAP Real-time Data Platform. Yuvaraj has over 14 years of experience spanning Business Applications, Business Analytic Solutions, Architecture and Engineering. 2013 SAP AG or an SAP affiliate company. All rights reserved. 3
SAP Big Data Webinar Series Gain New insight from Hadoop Presented by: Yuvaraj Athur Raghuvir, SAP HANA Platform July 24 2013
SHOPPERS GET FASHION ADVICE THAT FITS THEIR STYLE
Big Data Economics Streaming Data incl. Sensors, Social and Mobile Predictive incl. data mining and machine learning Urgent Need Unstructured Analysis Incl. text, media, spatial etc. Scalable Storage 2013 SAP AG or an SAP affiliate company. All rights reserved. 6
Open Source Community Gift: Apache Hadoop Streaming Data Apache Hadoop incl. Sensors, Social and Mobile Predictive [Commons] incl. data mining and machine learning Urgent Need [Projects] Unstructured Analysis Incl. text, [Distributions] media, spatial etc. [Data Scientists] Scalable Storage 2013 SAP AG or an SAP affiliate company. All rights reserved. 7
Hadoop Possibilities Amazon has reported that it started 2 million Elastic MapReduce (EMR) clusters in a single year. Dynamic Community with over 2500+ Hadoop related projects in GitHub. Growing Popularity among Vendors Storage Compute Network High Performance Computing Vendors Pushing New Boundaries! Source: 1) Gartner Blog http://blogs.gartner.com/merv-adrian/2013/02/23/hadoop-2013-part-three-platforms/ retrieved on July 17 2013 2) Gartner Blog http://blogs.gartner.com/merv-adrian/2013/03/08/hadoop-2013-part-four-players/ retrieved on July 17 2013 3) GitHub query Hadoop retrived on July 17 2013 2013 SAP AG or an SAP affiliate company. All rights reserved. 8
Hadoop Community Complexity! 2500+ Community Projects!! Highly Dynamic & Evolving Ecosystem Projects & Alternatives Gartner Infographic Source: Gartner Blog http://blogs.gartner.com/mervadrian/2013/02/21/hadoop-2013-part-two-projects/ Source: GitHub query Hadoop retrived on July 17 2013 Source: GigaOm Infographic, Mar 5, retrieved on July 17 2013 retreived on July 17 2013 2013 SAP AG or an SAP affiliate company. All rights reserved. GigaOm Infographic 9
The Big Data Phenomenon Big Data is more than just Hadoop Business Trends Technology Trends Exploding data volumes Increasing data variety Accelerating data velocity Storage / Memory / CPU advances Hadoop & distributed MPP Data Mining/Predictive analysis In-memory computing Complex event processing Enterprise Big Data 2013 SAP AG or an SAP affiliate company. All rights reserved. 10
Big Data Challenges Business and technical needs Business Needs Quick insights from all business-relevant data Pick the right action among many choices 100101 011010 100101 Technical Challenges Cost of store vs. cost to process data considerations Pressure to gain insight quickly from data Action 1 Action 2 Action 3 All Relevant Big Data Quick Insight Diversity of data formats makes it difficult to analyze Right Action Need to have disparate technologies interoperate across enterprise Determine the right action among many valuable insights across variety, volume, velocity and technology is difficult 2013 SAP AG or an SAP affiliate company. All rights reserved. 11
How to Capitalize on the Big Data Opportunity and Address Big Data Technical Challenges? To deploy an integrated data processing framework Optimize data management in each phase of the information lifecycle process Regardless of data source, processing technologies, latency challenges, number of user demands To enable real-time, actionable insights in business process context Marry business process insights from structured data analysis with deep pattern, behavior analysis of unstructured data Enable decision making based on multi-factor considerations, not just instinct/experience To derive new value from INFORMATION Focus on deriving new value from data by enabling new business and technology use cases previously not feasible Augment existing business scenarios with new data insights to enable better decision 2013 SAP AG or an SAP affiliate company. All rights reserved. 12
Enterprise Scenarios with Hadoop Hadoop as a flexible data store Streaming Data Social Media Reference Data Transaction Data Enterprise Data SAP Business Suite Other SAP solutions Hadoop as a simple database Hadoop SAP Solutions Data warehouse/database (SAP HANA, SAP Sybase IQ/SAP Sybase ASE) SAP Data Services Computation Engine(s) Job Management Data storage (Hadoop Distributed File system) Hadoop as a processing engine Hadoop for data analytics Non-SAP solutions BI and analytics software from SAP In-memory Analytic engine and/or... Disk-based data ware-house (SAP Sybase IQ) Analytic engine 2013 SAP AG or an SAP affiliate company. All rights reserved. 13
Hadoop as a Simple Database SAP Business Suite SAP Solutions Focus: Storage and Retrieval of data from Hadoop typically using interfaces provided by HIVE or direct HDFS access Other SAP solutions Data warehouse/database (SAP HANA, SAP Sybase IQ/SAP Sybase ASE) Hadoop as a flexible data store Streaming Data Social Media Reference Data Transaction Data Enterprise Data Hadoop as a simple database Hadoop SAP Data Services Computation Engine(s) Job Management Data storage (Hadoop Distributed File system) Scenarios of Use include: Extract-Transform-Load from other systems to Hadoop. SAP Data Services provides ETL support from Hadoop to SAP HANA. Store & Retrieve Structured data based on projects like Hive. Depending on the scenario, scalable analytical systems like Sybase IQ can also be considered. Handle large documents as Blobs to do retrievals or analytics later. Use as a near-line store for offloading data that is considered cold or frozen Data lifecycle management is typically manual. 2013 SAP AG or an SAP affiliate company. All rights reserved. 14
Hadoop as a Processing Engine SAP Business Suite SAP Solutions Focus: Distributed Compute leveraging the Map-Reduce computation framework of Hadoop on large distributed data sets Other SAP solutions Data warehouse/database (SAP HANA, SAP Sybase IQ/SAP Sybase ASE) Hadoop as a flexible data store Streaming Data Social Media Reference Data Transaction Data Enterprise Data Hadoop SAP Data Services Computation Engine(s) Job Management Data storage (Hadoop Distributed File system) Hadoop as a processing engine Scenarios of Use include: Data Enrichment. Push down of Text Data Transforms from Data Services is an example Data Pattern Analysis. This is an emerging space across new data forms. Convergence between procedural data science and declarative access patterns are evolving. 2013 SAP AG or an SAP affiliate company. All rights reserved. 15
Hadoop for Data Analytics BI and analytics software from SAP Analytic engine In-memory and/or... Disk-based data ware-house (SAP Sybase IQ) Analytic engine Focus: A combination of storage and delegated analytics supporting two approaches: Two-Phase Analytics: Background processing engine refine and feeding data Federated Queries: Client side federation across data stores Hadoop as a flexible data store Streaming Data Social Media Reference Data Transaction Data Enterprise Data Hadoop for data analytics Hadoop Computation Engine(s) Job Management Data storage (Hadoop Distributed File system) Scenarios of Use include: Cross DM analytics. Practical only when performance from Hadoop is acceptable Stand-alone analytics. Emerging area to use Hadoop as the direct data store for analytics 2013 SAP AG or an SAP affiliate company. All rights reserved. 16
SAP HANA SP6 - smart data access capability Data virtualization for on-premise and hybrid cloud environments New HANA Tables Transactions + Analytics SAP HANA Virtual Tables Benefits Enables access to remote data access just like local table Provides SAP HANA to SAP HANA queries Smart query processing including query decomposition with predicate push-down, functional compensation Supports data location agnostic development No special syntax to access heterogeneous data sources Non-disruptive evolution Teradata IQ Heterogeneous data sources SAP HANA to Hadoop (Hive) Teradata SAP Sybase ASE SAP Sybase IQ Hadoop ASE SAP HANA 2013 SAP AG. All rights reserved. 17
Example Reference Architecture : Machine-to-Machine Infrastructure Run-Time Architecture Device End User Apps Web App Mobile App Dashboard Edge Cloud / Backend Device Management Stream M2M Services Data Acquisition & Processing Batch M2M Application Server Core Services Application and Analytical Services Industry-Specific Services SQLA / UltraLite Data Persistence Synchronization Real-Time Data Platform ESP Event Processing SQLA Data Synchronization HANA Hot Data Data Models Predictive Models ASE / IQ Warm and Cold Data Hadoop Big Data Sets 2013 SAP AG. All rights reserved. 18
Beyond Business Networks Internet of Things Meters, Drills MRI, PDAs Generators Turbines Healthcare Industrial Windmills Implants, Surgical Equipment UPS Batteries Pumps, Monitors Public Sector Fuel Cells, etc. Pumps, Valves, Vats, Conveyors, Pipelines Telemedicine, etc. Motors, Drives, Converting, Fabrication Annual smart meter shipments to surpass 140 million units worldwide by 2016, representing a CAGR of 32.9%.[3] Tanks, Fighter Jets Battlefield Comms Jeeps, Cars, Ambulances Breakdown, Lone Worker Homeland Security Tolls, etc. Environ. Monitoring, etc. Planes, Signage Assembly/Packaging, Vessels/Tanks, etc. Vehicles, Lights, Ships Meter Data Management will exceed $420 Million by 2020 with a CAGR of 16.8%[2] Cellular M2M revenue opportunity projected to reach $1.2 Trillion by 2020 Picture: Beecham Research 1. GSMA; 2.Pike Research; 3. IDC Energy Insights
DRIVERS DIVERTED BEFORE FATAL ACCIDENTS HAPPEN
SAP Big Data Webinar Series Thank You! Presented by: Yuvaraj Athur Raghuvir, SAP HANA Platform yuvaraj.athur.raghuvir@sap.com 2013 SAP AG or an SAP affiliate company. All rights reserved.
Hadoop Commons : Core / Common Components Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The Map function divides a query into multiple parts and processes data at the node level. The Reduce function aggregates the results of the Map function to determine the answer to the query. Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices, for example to collect data and integrate it into Hadoop. Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Source: http://wikibon.org/wiki/v/hbase,_sqoop,_flume_and_more:_apache_hadoop_defined retrieved on Jul 17 2013 2013 SAP AG or an SAP affiliate company. All rights reserved. 22
Hadoop Commons : Core / Common Components Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonwroks, which include Ambari in its Hortonworks Data Platform. Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop: Sqoop is a connectivity tool for moving data from non-hadoop data stores such as relational databases and data warehouses into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored. BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole. Source: http://wikibon.org/wiki/v/hbase,_sqoop,_flume_and_more:_apache_hadoop_defined retrieved on Jul 17 2013 2013 SAP AG or an SAP affiliate company. All rights reserved. 23