IBM Big Data Summit 2012 12.10.2012
InfoSphere BigInsights Introduction Wilfried Hoge Leading Technical Sales Professional hoge@de.ibm.com twitter.com/wilfriedhoge 12.10.1012
IBM Big Data Strategy: Move the Analytics Closer to the Data New analytic applications drive the requirements for a big data platform Integrate and manage the full variety, velocity and volume of data Apply advanced analytics to information in its native form Visualize all available data for adhoc analysis Development environment for building new analytic applications Workload optimization and scheduling Security and Governance BI / Exploration / Functional Industry Predictive Reporting Visualization App App Analytics Visualization & Discovery Hadoop System Analytic Applications IBM Big Data Platform Application Development Accelerators Stream Computing Content Analytics Systems Management Data Warehouse Information Integration & Governance
BigInsights analytical platform for persistent Big Data Based on open source & IBM technologies Distinguishing characteristics Built-in analytics... enhances business knowledge Enterprise software integration... complements and extends existing capabilities Production-ready platform with tooling for analysts, developers, and administrators... speeds time-to-value and simplifies development/maintenance IBM advantage Combination of software, hardware, services and advanced research BI / Exploration / Functional Industry Predictive Reporting Visualization App App Analytics Visualization & Discovery Hadoop System Analytic Applications IBM Big Data Platform Application Development Accelerators Stream Computing Content Analytics Systems Management Data Warehouse Information Integration & Governance
About the BigInsights Platform Flexible, enterprise-class support for processing large volumes of data Based on Google s MapReduce technology Inspired by Apache Hadoop; compatible with its ecosystem and distribution Well-suited to batch-oriented, read-intensive applications Supports wide variety of data Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks = node Nodes can be combined into clusters New nodes can be added as needed without changing Data formats How data is loaded How jobs are written
Hadoop Explained Map Reduce Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Hadoop Data Nodes public void map(object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new Intritable(); public void reduce(text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();... MapReduce Application Distribute map tasks to cluster Shuffle 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) Result Set Return a single result set
BigInsights Value Beyond Open Source Technical differentiators Built-in analytics Text processing engine, annotators, Eclipse tooling Statistical and predictive analysis Interface to project R (statistical platform) Enterprise software integration (DBMS, warehouse) Spreadsheet-style analytical tool for analysts Ready-made business process accelerators Integrated installation of supported open source and IBM components Web Console for administration and application access Platform enrichment: additional security, performance features,... Standard IBM licensing agreement and world-class support Business benefits Quicker time-to-value due to IBM technology and support Reduced operational risk Enhanced business knowledge with flexible analytical platform Leverages and complements existing software assets
Zookeeper IBM LZO Compression Avro InfoSphere BigInsights Embrace and Extend Hadoop Analytics ML Analytics Text Analytics BigSheets Interface Web console Application Pig Hive Jaql MapReduce AdaptiveMR FLEX BigIndex Oozie Lucene Monitor cluster health Add / remove nodes Start / stop services Inspect job status Inspect workflow status Deploy apps Launch apps / jobs Work with distrib. file system Work with spreadsheet interface Support REST-based API... Storage HDFS HBase GPFS-SNC Eclipse plug-ins Data Sources/ Connectors Streams Netezza BoardReader R Text analytics MapReduce programming Jaql development Hive query development Data Stage DB2 CSV / XML / JSON SPSS Flume JDBC Web Crawler IBM Open Source
Web Installation Tool Seamless process for single node and cluster environments Integrated installation of all selected components Post-install validation of IBM and open source components No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.
Web Console Manage BigInsights Inspect system health Add / drop nodes Start / stop services Run / monitor jobs (applications) Explore / modify file system Launch applications Spreadsheet-like analysis tool Pre-built applications (IBM supplied or user developed) Publish applications Leverage community resources
Quick start applications or apps Reusable software assets based on customer engagements Useful for starting point for various applications Can be customized by BigInsights application developers as needed Accessible through Web console Available assets Data export (to relational DBMS, files, HBase) Data import (from relational DBMS, files) Web crawler, Twitter crawler Boardreader.com support (Web forum search engine) Ad hoc queries for Jaql, Hive, Pig TeraGen-TeraSort, WordCount sample applications
Running Applications from the Web Console
DEMO web console 12.10.1012
BigSheets BigSheets is a visual tool for data manipulation and prototyping Allows more users to do more work, more quickly Simply stated, growing an army of MapReduce developers is not cost effective In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights Sample Uses Data exploration and visualization Visual job creation
BigSheets Spreadsheet-style Data Analysis and Discovery
BigSheets Visualization
DEMO BigSheets 12.10.1012
Text Analytics in BigInsights Text analytics Distill structured information from unstructured data Rich annotator library supports multiple languages Declarative Information Extraction (IE) system based on an algebraic framework Richer, cleaner rule semantics Better performance through optimization Developed at IBM Research since 2004 Embedded in several IBM products Lotus Notes Cognos Consumer Insights InfoSphere Streams Compose operators to build complex annotators
Text Analytics highly accurate analysis of textual content How it works Parses text and detects meaning with annotators Understands the context in which the text is analyzed Hundreds of pre-built annotators for names, addresses, phone numbers, along others Accuracy Highly accurate in deriving meaning from complex text Performance AQL language optimized for MapReduce Unstructured text (document, email, etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Classification and Insight
BigInsights Text Analytics Development AQL
Text Analytics Tooling AQL Editor Result Viewer Runtime Explain
DEMO Text Analytics 12.10.1012
Ways to get started with BigInsights In the Cloud Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. Pay only for the resources used. In the Virtual Classroom Free Hadoop Fundamentals training course www.bigdatauniversity.com e.g. BD105EN - Text Analytics Essentials On Your Cluster Download Basic Edition from ibm.com. In the Classroom Enroll in the InfoSphere BigInsights Essentials course.
Visit the BigInsights technical portal.... Free links to papers, demos, discussion forum, and more http://www.ibm.com/developerworks/wiki/biginsights/
IBM Big Data Summit 2012 12.10.2012