IBM s InfoSphere BigInsights: Smart Analytics for Big Data

Size: px
Start display at page:

Download "IBM s InfoSphere BigInsights: Smart Analytics for Big Data"

Transcription

1 An IBM Proof of Technology IBM s InfoSphere BigInsights: Smart Analytics for Big Data Meridee Lowry < BigInsights & Streams Technical Specialist meridee@us.ibm.com 2013 IBM Corporation

2 IBM Disclaimer Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion IBM Corporation

3 Agenda The Big Data challenge: smarter analytics for a smarter planet IBM s approach The big picture Details on BigInsights How BigInsights fits in your software stack How IBM can help you get off to a quick start IBM Corporation

4 An IBM Proof of Technology The Big Data Challenge 2013 IBM Corporation

5 Information is at the Center of a New Wave of Opportunity 44x as much Data and Content Over Coming Decade zettabytes And Organizations Need Deeper Insights Business leaders frequently make decisions based on information they 1 in 3 don t trust, or don t have Business leaders say they don t have access to the information they 1 in 2 need to do their jobs ,000 petabytes 80% Of world s data is unstructured 83% of CIOs cited Business intelligence and analytics as part of their visionary plans to enhance competitiveness 60% of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions IBM Corporation

6 What we hear from customers.... Lots of potentially valuable data is dormant or discarded due to size/performance considerations Large volume of unstructured or semistructured data is not worth integrating fully (e.g. Tweets, logs,...) Not clear what should be analyzed (exploratory, iterative) Information distributed across multiple systems and/or Internet Some information has a short useful lifespan Volumes can be extremely high Analysis needed in the context of existing information (not stand alone) IBM Corporation

7 Merging the Traditional and Big Data Approaches Traditional Approach Structured & Repeatable Analysis Big Data Approach Iterative & Exploratory Analysis Business Users Determine what question to ask IT Delivers a platform to enable creative discovery IT Structures the data to answer that question Business Explores what questions could be asked Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization IBM Corporation

8 An IBM Proof of Technology IBM s approach 2013 IBM Corporation

9 IBM Big Data Platform Strategy Integrate and manage the full variety, velocity and volume of Big Data Apply advanced analytics to information in its native form Visualize all available data for adhoc analysis Development environment for building new analytic applications Support workload optimization and scheduling Provide for security and governance Integrate with enterprise software BI / Reporting Analytic Applications Exploration / Visualization Industry App IBM Big Data Platform Visualization & Discovery Hadoop System Predictive Analytics Application Development Accelerators Stream Computing Content.... BI / Analytics Reporting Systems Management Data Warehouse Information Integration & Governance IBM Corporation

10 BigInsights Brings Hadoop to the Enterprise BigInsights = analytical platform for persistent Big Data Based on open source & IBM technologies Managed like a start-up.... Emphasis on deep customer engagements, product plan flexibility Distinguishing characteristics Built-in analytics.... Enhances business knowledge Enterprise software integration.... Complements and extends existing capabilities Production-ready platform with tooling for analysts, developers, and administrators.... Speeds time-to-value; simplifies development and maintenance IBM advantage Combination of software, hardware, services and advanced research IBM Corporation

11 BigInsights: Value Beyond Open Source Open source components Enterprise Capabilities Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security IBM-certified Apache Hadoop and related projects Key differentiators Built-in text analytics Enterprise software integration SQL support Spreadsheet-style analysis Integrated installation of supported open source and other components Web Console for admin and application access Platform enrichment: additional security, performance features, GPFS (alternative file system),... World-class support Full open source compatibility Business benefits Quicker time-to-value due to IBM technology and support Reduced operational risk Enhanced business knowledge with flexible analytical platform Leverages and complements existing software IBM Corporation

12 BigInsights and the data warehouse Big Data analytic applications Traditional analytic tools Data warehouse BigInsights Filter Transform Aggregate IBM Corporation

13 BigInsights and the data warehouse Traditional analytic tools Big Data analytic applications BigInsights Data Warehouse Query-ready repository for cold warehouse data IBM Corporation

14 A Closer Look at BigInsights IBM Corporation

15 About the BigInsights Platform Flexible, enterprise-class support for processing large volumes of data Based on Google s MapReduce technology Inspired by Apache Hadoop; compatible with its ecosystem and distribution Well-suited to batch-oriented, read-intensive applications Supports wide variety of data Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks = node Nodes can be combined into clusters New nodes can be added as needed without changing Data formats How data is loaded How jobs are written IBM Corporation

16 The MapReduce Programming Model "Map" step: Input split into pieces Worker nodes process individual pieces in parallel (under global control of the Job Tracker node) Each worker node stores its result in its local file system where a reducer is able to access it "Reduce" step: Data is aggregated ( reduced from the map steps) by worker nodes (under control of the Job Tracker) Multiple reduce tasks can parallelize the aggregation IBM Corporation

17 Logical MapReduce Example: Word Count map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Content of Input Documents Hello World Bye World Hello IBM Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1> Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> IBM Corporation

18 So What Does This Result In? Easy To Scale Fault Tolerant and Self-Healing Data Agnostic Extremely Flexible IBM Corporation

19 Integrated Web Console Manage BigInsights Inspect /monitor system health Add / drop nodes Start / stop services Run / monitor jobs (applications) Explore / modify file system Create custom dashboards... Launch applications Spreadsheet-like analysis tool Pre-built applications (IBM supplied or user developed) Publish applications Monitor cluster, applications, data, etc. Create / view event alerts IBM Corporation

20 Spreadsheet-style Analysis Web-based analysis and visualization Spreadsheet-like interface Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved IBM Corporation

21 Big Data Application Ecosystem Data integration scenario: Pre-defined work flows simplify loading data from various sources Work flows can be configured, deployed, executed and scheduled Application scenarios (web log, , social media, ): Samples provide starting point, speed time to value Big Data Web Console Publish Development tooling: Text analytics MapReduce Query languages... App Development Eclipse App library MapReduce, Text Analytics Query App Development Code application program, and generate associated App Deploy Apps to Enterprise Manager IBM Corporation

22 Quick start sample applications 20+ software assets based on common customer needs Useful for starting point for various applications Accessible through Web console Available assets Data movement From relational DBMS, files, REST-based sources To relational DBMS, files Web crawler, social media data collectors, etc. Ad hoc query Monitoring Data sampling and subsetting TeraGen-TeraSort, WordCount sample applications IBM Corporation

23 Running Applications from the Web Console IBM Corporation

24 Build a Big Data Program MapReduce example Eclipse tools For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc IBM Corporation

25 Build a Big Data Program MapReduce example Eclipse tools For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc IBM Corporation

26 Quickly drag and drop to create new applications IBM Corporation

27 Visualize results through dashboards Built-in dashboards for monitoring system health, application status, distributed file system, etc. Easy to customize.... Add, group, or remove widgets for: BigSheets collections and charts Cluster/system Monitoring HDFS monitoring MapReduce metrics Third party Widgets or Open Social Gadgets can be added to a dashboard Create new, custom dashboards to suit your needs! IBM Corporation

28 Big SQL Apr IBM Corporation

29 Big SQL 3.0: Architected for performance Architected from the ground up for low latency and high throughput MapReduce replaced with a modern MPP architecture Compiler and runtime are native code (not java) Big SQL worker daemons live directly on cluster Continuously running (no startup latency) Processing happens locally at the data Message passing allows data to flow directly between nodes Operations occur in memory with the ability to spill to disk Supports aggregations and sorts larger than available RAM Hive Metastore Head Node Big SQL Head Node Task Data Big Tracker Task Node Data SQL Big Tracker Compute Task Node Node Data SQL Big Tracker Task Node Data SQL Big Compute Node Tracker Node SQL Compute Node Compute Node HDFS/GPFS IBM Corporation

30 BigInsights and Text Analytics Distills structured info from unstructured text Sentiment analysis Consumer behavior Illegal or suspicious activities Parses text and detects meaning with annotators Understands the context in which the text is analyzed Features pre-built extractors for names, addresses, phone numbers, etc. Built-in support for English, Spanish, French, German, Portuguese, Dutch, Japanese, Chinese Unstructured text (document, , etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Classification and Insight IBM Corporation

31 An IBM Proof of Technology Big R End-to-end integration of R into IBM BigInsights R Clients 1. Explore, visualize, transform, and model big data using familiar R syntax and paradigm Pull data (summaries) to R client R Packages 2. Scale out R Partitioning of large data ( divide ) Parallel cluster execution of pushed down R code ( conquer ) All of this from within the R environment (Jaql, Map/Reduce are hidden from you Almost any R package can run in this environment Data Sources Scalable Statistic s Engine 3. Scalable machine learning A scalable statistics engine that provides canned algorithms, and an ability to author new ones, all via R Or, push R functions right on the data R Packages Embedded R Execution 2013 IBM Corporation

32 Application Accelerators Quickly build, deploy custom applications in high-value areas IBM Accelerator for Social Data Analytics B2C businesses Sample applications: Customer acquisition / retention, Customer Segmentation or Micro Segmentation, Marketing Campaign Optimization, Lead generation, Brand Management or Surveillance Ships with BigInsights v2 and Streams v3 IBM Accelerator for Machine Data Analytics Cross-industry: manufacturing, oil & gas, energy and utility, healthcare, travel and transportation, CPG, Retail, etc. Operational efficiency monitoring, security incident investigation. proactive maintenance, troubleshooting, outage prevention, efficiency tracking, etc Ships with BigInsights v2 IBM Accelerator for Telco Event Data Analytics Telcos Campaign management, real-time promotion, fraud detection, service assurance and network monitoring, Ships with Streams v3, but works with BigInsights or PureSparta for Analytics (a.k.a. Netezza) IBM Corporation

33 Adaptive MapReduce (Platform Symphony) performance features Other Grid Server Each engine polls broker ~5 times per second (configurable) Client Broker Engines Send work when engine ready Network transport (client to broker) Wait for engine to poll broker Network transport (broker to engine) Compute Result Post result back to broker Serialize input data Broker Compute time De-serialize Input data Serialize result Time Platform Symphony SSM Compute time & logging Network transport De-serialize Compute result No wait time due to polling, faster serialization/de-serialization, More network efficient protocol Serialize Platform Symphony advantages: Efficient C language routines use CDR (common data representation) and IOCP rather than slow, heavy-weight XML data encoding) Network transit time is reduced by avoiding text based HTTP protocol and encoding data in more compact CDR binary format Processing time for all Symphony services is reduced by using a native HPC C/C++ implementation for system services rather than Java Serialize input Network transport (SSM to engine) Time Network transport (engine to SSM) Platform Symphony has a more efficient push model that avoids entirely the architectural problems with polling IBM Corporation

34 GPFS FPO File system alternative to HDFS. Optional. Key features No single point of failure Built-in High Availability POSIX compliance Enhanced Security with ACL support Support for Storage Pools SnapShot capability IBM Corporation

35 InfoSphere Data Click self-service data integration on-demand Broad connectivity Traditional and big data sources Simple end-to-end experience Web-based configuration IBM Corporation

36 BigInsights Connectivity to DBMS / Warehouse Netezza BigInsights IBM Corporation

37 About Big Data and BigInsights... Big Data is a strategic initiative for IBM Significant investments across software, hardware and services. InfoSphere BigInsights Enables firms to exploit growing variety, velocity, and volume of data Delivers diverse range of analytics Leverages and extends open source Provides enterprise-class features and supporting services Complement existing software investments and commercial offerings Available in basic (free) and enterprise editions IBM advantage Full solution spanning software, hardware & services Rapid technology advances through partnerships with IBM Research Global reach IBM Corporation

38 Get started with BigInsights In the Cloud Pay only for the resources used. On Your Cluster Download Quick Start Edition from ibm.com Tutorials and examples built in In the Classroom Via IBM Education Online at With the HadoopDev Community Links to demos, papers, downloads, etc. Online at IBM Corporation

39 IBM big data IBM big data IBM big data An IBM Proof of Technology IBM big data IBM big data THINK IBM big data IBM big data IBM big data IBM big data IBM big data 2013 IBM Corporation

40 An IBM Proof of Technology Supplemental 2013 IBM Corporation

41 Key resources and references IBM Internal Info Management Acceleration Zone for BigInsights and Big Data BigInsights internal forum (monitored by developers and other IBM experts) IM demo cloud (to provision your own BigInsights environment in the cloud) External InfoCenter (product documentation) BigInsights wiki. Contains links to videos, articles, online education, external forum, etc IBM Corporation

42 What is Hadoop? Apache Hadoop = free, open source framework for data-intensive applications Inspired by Google technologies (MapReduce, GFS) Well-suited to batch-oriented, read-intensive applications Originally built to address scalability problems of Nutch, an open source Web search technology Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks of commodity box = Hadoop node Boxes can be combined into clusters New nodes can be added as needed without changing Data formats How data is loaded How jobs are written IBM Corporation

43 Two Key Aspects of Hadoop MapReduce framework How Hadoop understands and assigns work to the nodes (machines) Hadoop Distributed File System = HDFS Where Hadoop stores data A file system that spans all the nodes in a Hadoop cluster It links together the file systems on many local nodes to make them into one big file system IBM Corporation

44 What is the Hadoop Distributed File System? HDFS stores data across multiple nodes HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes The file system is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS IBM Corporation

45 Streams and BigInsights - Integrated Analytics on Data in Motion & Data at Rest Visualization of realtime and historical insights Data Data ingest, preparation, online analysis, model validation InfoSphere Streams 1. Data Ingest 2. Bootstrap/Enrich Control flow 3. Adaptive Analytics Model InfoSphere BigInsights, Database & Warehouse Data Integration, data mining, machine learning, statistical modeling IBM Corporation

46 IBM Watson IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working IBM Corporation

47 Big Data and Watson Big Data technology is used to build Watson s knowledge base Watson technology offers great potential for advanced business analytics Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory. POS Data CRM Data Social Media Approx. 200M pages of text (To compete on Jeopardy!) InfoSphere BigInsights Distilled Insight - Spending habits - Social relationships - Buying trends Watson s Memory Advanced search and analysis IBM Corporation

48 BigInsights Secure Architecture Intranet Authentication store Web Console Private network BigInsights IBM Corporation

49 Security (cont d) User authentication approaches None Flat file LDAP Pluggable Authentication Modules (PAM) Authorization (role-based) System administrator Data administrator Application administrator User Credentials store Store potentially sensitive info: tokens, passwords, etc. Maintained in BigInsights distributed file system Public and private directories Read/write via command line utility IBM Corporation

50 Can you answer these questions? Is this an authorized job? Is there an unknown user running big data queries? Is someone leaving the company running lots of jobs? Is this query downloading large amounts of sensitive data? Is there a program algorithmically trying to access sensitive data? IBM Corporation

51 Guardium monitors data activities on BigInsights Relevant messages are copied and sent to collector Hive queries InfoSphere Guardium S-TAP Minimal impact to big data server resources or network Separation of duties audited data on secure appliance Heterogeneous support (IBM, Cloudera) MapReduce jobs InfoSphere Guardium collector HDFS and HBase commands Cluster Sensitive data alert! InfoSphere Guardium reporting and alerting IBM Corporation

52 About IBM s LZO-based compression Similar to GNU-based LZO compression, but no index needed Fixed-size compression blocks automatically created Original source: Compressed representation Fixed size IBM Corporation

53 Comparison of general compression technologies (conducted by third party) Size (Mbytes) Compression speed (sec) Compression memory used (MBytes) Decompression speed Decompression memory used (Mbytes) uncompressed 96 gzip bzip lzo lzm Approximate values from IBM Corporation

54 About DBMS connectivity and Jaql.... Jaql JDBC modules DB2 LUW and InfoSphere Warehouse Netezza Generic JDBC data source Parallelism Multiple JDBC connections per job (one connection per Map task) Native partitioning leveraged with Netezza Distribution key column look up for DB2 LUW with DPF (automatic) Writing data from BigInsights to RDBMS No ACID guarantee (no transactions in MapReduce) Failed MapReduce tasks automatically restart could lead to duplicate rows (repeated inserts) in DBMS Consider writing BigInsights data to DBMS temp table. If no Map restarts in the job, copy temp table contents to target DBMS table IBM Corporation

55 BigInsights Enterprise Edition Open Source IBM Optional IBM and partner offerings Infrastructure Integrated installer Text compression Analytics and discovery Text processing engine and library BigSheets Big R Enhanced security Solr Accelerator for social data analysis Accelerator for machine data analysis Big SQL Oozie Lucene Apps Web Crawler Boardreader Distrib file copy... Jaql HBase ZooKeeper DB export DB import Ad hoc query Pig Hive Machine learning Data processing MapReduce Administrative and development tools Web console Monitor cluster health, jobs, etc. Add / remove nodes Start / stop services Inspect job status Inspect workflow status Deploy applications Launch apps / jobs Work with distrib file system Work with spreadsheet interface Support REST-based API Create / view alerts... Adaptive MapReduce Flexible scheduler GPFS FPO HCatalog HDFS Eclipse tools Connectivity and Integration JDBC Sqoop DB2 Netezza Streams Text analytics MapReduce programming Jaql, Hive, Pig development BigSheets plug-in development Oozie workflow generation Flume Data Explorer Guardium DataStage Cognos BI IBM Corporation

56 Packaging for BigInsights NEW with BigInsights v2.1.1 (Announce Feb 4, 2014): Per node based pricing Perpetual pricing Standard Edition Enterprise Edition Big R (1Q14, March 2014) Machine Data Analytics Social Media Analytics Text Analytics BigInsights Quick Start Edition Free Download from ext. Web Site Non Warranted & Non-production use Quick Start Edition VM Quick Start Edition Cluster Install Enterprise Edition Developer Starter Kit Based on Enterprise Edition. - per Authorized User Same as Starter Kit in v2.0! Dev use only Enterprise Edition Developer Starter Kit Standard Edition Big Sheets Big SQL Dev Tools Management console Installer IBM certified base Hadoop Business Intelligence (Cognos bundle)* Real Time Streaming (Streams)* Search & Exploration (Data Explorer)* IIG Integrations (Data Click, March 2014) Workload optimization (Adaptive Map Reduce) GPFS BigSheets BigSQL (now and 2Q14) Dev Tools Management console Installer IBM certified base Hadoop Difference between standard & enterprise edition is enforced solely per license terms! 56 *bundled products 2014 / IBM limited Corporation use license

57 BigInsights and IBM Platform Symphony Overview Enterprise ready Hadoop stack Key Capabilities Productivity tools for analytics and data connectivity Text analytics, BigSheets, data-warehouse and database connectors High-performance, low-latency architecture Platform Symphony low latency scheduler Heterogeneous application platform Run Hadoop and non-hadoop applications on a shared cluster Benefits Lower cost via sharing and increased utilization Accelerated results through higher performance Higher productivity for administrators and developers IBM Corporation

58 Benchmarking IBM Platform Symphony for Big Data performance TeraSort SWIM 10 times fewer CPU cores 10x Less hardware for the fastest TeraSort score. 6 times faster Berkley SWIM is a workload benchmark developed at University of California at Berkley. Throughput 60 times faster Measure core scheduling efficiency of MapReduce workloads at Hadoop World 2011 Multi-tenant resource management IBM Corporation

59 Index/Explore BigInsights and Enterprise Data Discover the Data Provide discovery and navigation Connect securely to the applications that manage the data regardless of location Analyze Structured & Unstructured Data Reveal themes & visualize relationships Identify the value of the data Recognize users of the data Establish context of data usage Collaborate on the Data Augment the data with user knowledge Create personalized views of the data Identify ongoing user and system integration points IBM Corporation

60 Social Data Analytics Accelerator What does it do? Provides the ability to analyze large volumes of various types of social media data with real-time processing Social Data Analytics Why should you care? It enables clients to easily obtain insights necessary for: Effective/targeted Marketing Campaigns Timely product/marketing decisions Gaining competitive Intelligence Building customer retention and new customer acquisition programs Example Application : Movie Campaign Effectiveness Large Movie Studio wants to understand reaction of movie commercials around events (e.g., SuperBowl) Over 30 Million social media consumer profiles built and used in the analysis Real-time summary of insights correlated with the airing of the commercial IBM Corporation

61 Machine Data Analytics Accelerator What does it do? Provides the ability to ingest, parse and extract a wide variety of machine data Faceted search enables easy navigation and discovery Visualization enables easy analysis of the data Why should you care? Example Application: Facilities Management Machine Data Analytics It enables clients to gain insights, beyond what was traditionally possible, into operations, customer experience, transactions and behavior, processing machine data in minutes instead of days and weeks With these insights, clients can: Proactively plan to increase operational efficiency Troubleshoot problems and investigate security incidents Monitor end-to-end infrastructure to avoid service degradation or outages Use real time data from building devices such as meters, sensors and motion detectors to monitor and manage power usage IBM Corporation

62 IBM Information Software Management Vestas (European Energy Company) Business Challenge Analyze large volumes of public and private weather data for alternative energy business Existing high-performance computing hardware, limited staff Project objectives Leverage large volume (2+ PB) of weather data to optimize placement of turbines. Reduce modeling time from weeks to hours.. Optimize ongoing operations. The benefits Reliability, security, scalability, and integration needs fulfilled Standard enterprise software support Single-vendor solution for software, hardware, storage, support Solution Components: IBM InfoSphere BigInsights Enterprise Edition: - Scalability (data volumes) - Jaql (query support and extensibility) - IBM-provided file system (support existing hardware & apps) - Strong runtime performance IBM xseries hardware videos IBM Corporation

63 IBM Information Software Management Global Technology Firm Business challenge Analyze & correlate log records across to improve service Detect & predict failure patterns; initiate automated or manual preventive actions Project objectives Process variety of logs generated by multiple systems, devices in distinct formats (XML, text, ) Accommodate large data volumes growing at ~1 TB /day Parse logs, identify/extract entities of interest, index as needed, cluster data by sessions, detect & visualize patterns through GUI Report on Top X, Bottom X patterns; support exploratory queries The benefits IBM analytics and tooling simplify development and speed time-tovalue. You have done in 2 weeks what I have been trying to generalize for the past 6 months. -- Customer project leader Solution Components: IBM InfoSphere BigInsights Enterprise Edition including: Spreadsheet data discovery and visualization Text analytics runtime and tooling Flexible query support Scalability IBM InfoSphere Streams IBM Corporation

64 IBM Information Software Management Global Media Firm Business challenge Identify unauthorized content streaming (piracy) Quantify annual revenue loss, analyze trends Monitor social media sites (e.g., Twitter, Facebook) to identify dissemination of pirated content. Time sensitive! Project objectives: Analyze high variety of data. Volumes unclear. Start with social media data for 1 year. Use text analytics to Qualify & classify info of interest (complex, custom set of rules) Search for URLs with live streaming of target data, sentiment,. Future potential for video analysis Solution Components: IBM InfoSphere BigInsights Enterprise Edition including: Text analytics runtime and tooling Custom text annotators Flexible query support Scalability The benefits Improved understanding of business exposures through advanced analytics Improved decision-making process Scalable, flexible infrastructure for handling future analytic needs IBM Corporation

65 BigInsights v2.1 HA architecture Recommended three node HA setup, allows customer to tolerate two consecutive hardware failures HA Manager does live check and service dependency check on each HA node. HA Manager Master orchestras service life cycle and decides which service runs on a certain node BigInsights Management Console HA Node (JT active) HA Mgr (Master) JobTracker HA Node (NN Active) HA Manager NameNode Failover Optional Standby Node HA Manger Failover IBM Corporation

66 Additional performance enhancements Flexible job scheduler option Optimize response time for small jobs Available in addition to FAIR, FIFO scheduling Efficient processing of compressed text data Use multiple Map tasks (vs Hadoop default of 1) for processing compressed text files Enabled through BigInsights LZO-based compression technology Automatic with Jaql; programming option with Java MapReduce IBM Corporation

67 Big Data in the financial sector: adoption IBM Corporation

68 Big Data in the financial sector: data sources and analytics IBM Corporation

69 Example Analysis : Extraction from Twitter messages Extract intent, interests, life events and micro segmentation attributes Monetizable Intent Relocation Name, Birth Day Location I had an iphone, but it's (I've no idea where it's)!want a blackberry im moving to miami in 3 months. i look foward to the new good!!! U shouldnt! Think about the important stuff, like ur birthday ;) btw happy birthday Sylvia ;) I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others While accounting for less relevant messages Subtle Spam, Advertising I think deserves his 2 AMAZING songs in top ten!!! Buy them on itunes Looking to buy a phone? WiFi Cell Phones, Windows Mobile Sarcasm, Wishful Gotta do more research my Versace term paper 2day. Before I die, I want a versace purple diamond tiara. Im just sayin>lol had so much fun today! I want to buy a million dollar house with a wrap around porch wading river on the long island sound, ha i wish! IBM Corporation

70 Enterprise class IBM Software BigInsights Editions prior to V2.1.2 IBM reference only PureData for Hadoop - Appliance simplicity Quick Start Edition Free. Non-production only Apache Hadoop Basic Edition Free download - Jaql - Integrated install Enterprise Edition Sold by # of terabytes managed - Accelerators - Performance Optimization - Dashboards - Pre-built applications - Text analytics - Spreadsheet-style tool - RDBMS, warehouse connectivity - Administrative tools, security -- Eclipse development tools -- Enterprise Integration -- Integrated web console.-monitoring and alerts -.Big SQL, Big R,... PureData for Hadoop brings BigInsights as an appliance form factor to the market Breadth of capabilities IBM Corporation