Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks

Similar documents
E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Bringing the Power of SAS to Hadoop Title

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Big Data The Big Story

Microsoft Big Data. Solution Brief

In-Memory Analytics: Get Faster, Better Insights from Big Data

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

ETL on Hadoop What is Required

GE Intelligent Platforms. Proficy Historian HD

WHITEPAPER. Unlocking Your ATM Big Data : Understanding the power of real-time transaction monitoring and analytics.

Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

Machina Research White Paper for ABO DATA. Data aware platforms deliver a differentiated service in M2M, IoT and Big Data

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Operational Hadoop and the Lambda Architecture for Streaming Data

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Microsoft Azure Essentials

REDUCING NETWORK COSTS WITHOUT SACRIFICING QUALITY

Oracle DataRaker The Most Complete, Most Reliable Solution for Transforming Complex Data into Actionable Insight

IBM Software IBM InfoSphere BigInsights

Architected Blended Big Data With Pentaho. A Solution Brief

IBM Tivoli Monitoring

A technical discussion of performance and availability December IBM Tivoli Monitoring solutions for performance and availability

The ABCs of. CA Workload Automation

SAS ANALYTICS AND OPEN SOURCE

Evolution to Revolution: Big Data 2.0

ActualTests.C Q&A C Foundations of IBM Big Data & Analytics Architecture V1

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Harnessing the Power of Big Data to Transform Your Business Anjul Bhambhri VP, Big Data, Information Management, IBM

Copyright - Diyotta, Inc. - All Rights Reserved. Page 2

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Kx for Telecommunications

Data Strategy: How to Handle the New Data Integration Challenges. Edgar de Groot

Building Your Big Data Team

Cloud Based Analytics for SAP

Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa

White Paper. Five industries where big data is making a difference

1. Intoduction to Hadoop

E.ON Energie Kundenservice: Getting the Most Out of Data with SAP Business Warehouse powered by SAP HANA

Sr. Sergio Rodríguez de Guzmán CTO PUE

Transforming Big Data to Business Benefits

SAS & HADOOP ANALYTICS ON BIG DATA

IBM Big Data Summit 2012

Reduce Money Laundering Risks with Rapid, Predictive Insights

Key Factors When Choosing a Shopper Counting Solution An Executive Brief

E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA

Trusted by more than 150 CSPs worldwide.

DELL EMC HADOOP SOLUTIONS

An Effective Convergence of Analytics and Geography

Boston Azure Cloud User Group. a journey of a thousand miles begins with a single step

Oracle Big Data Cloud Service

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

White Paper: SAS and Apache Hadoop For Government. Inside: Unlocking Higher Value From Business Analytics to Further the Mission

Can Machine Learning Prevent Application Downtime?

Hadoop in Production. Charles Zedlewski, VP, Product

Innovative solutions to simplify your business. IBM System i5 Family

CASE STUDY Telecommunications Provider Masters Customer Journeys with NICE Customer Engagement Analytics. Copyright 2017 NICE. All rights reserved.

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Cognitive enterprise archive and retrieval

Big Data Anwendungsfälle aus dem Bereich der digitalen Medien

Quantifying the Value of Software Asset Management

Big Data Analytics for Retail with Apache Hadoop. A Hortonworks and Microsoft White Paper

Hadoop and the Data Warehouse: When to Use Which

The Evolution of Big Data

Konica Minolta Business Innovation Center

A Crucial Challenge for the Internet of Everything Era

Image Itron Total Outcomes

COPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem

WHITE PAPER. Loss Prevention Data Mining Using big data, predictive and prescriptive analytics to enpower loss prevention.

The Economic Benefits of Puppet Enterprise

Oracle Big Data Discovery The Visual Face of Big Data

InfoSphere Warehouse. Flexible. Reliable. Simple. IBM Software Group

Architecture Overview for Data Analytics Deployments

Kaseya Traverse Unified Cloud, Network, Server & Application Monitoring

MSP Guide to Automating Your Business for Profitability

ORACLE BIG DATA APPLIANCE

ANY SURVEILLANCE, ANYWHERE, ANYTIME DDN Storage Powers Next Generation Video Surveillance Infrastructure

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Comprehensive Enterprise Solution for Compliance and Risk Monitoring

Hybrid Data Management

IBM ICE (Innovation Centre for Education) Welcome to: Unit 1 Overview of delivery models in Cloud Computing. Copyright IBM Corporation

Next-generation forecasting is closer than you might think

Improving Healthcare Payer Performance with Big Data

How Data Science is Changing the Way Companies Do Business Colin White

Real-Time Streaming: IMS to Apache Kafka and Hadoop

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Capacity Management from the ground up

total energy and sustainability management WHITE PAPER Utility Commercial Customer Engagement: The Five Analytics-Enabled Strategies that Matter Most

Billing Strategies for. Innovative Business Models

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

SAP Big Data. Markus Tempel SAP Big Data and Cloud Analytics Services

Hortonworks Apache Hadoop subscriptions ( Subsciptions ) can be purchased directly through HP and together with HP Big Data software products.

InfoSphere Warehousing 9.5

More information for FREE VS ENTERPRISE LICENCE :

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

Conquering big data challenges

IBM QRadar SIEM. Detect threats with IBM QRadar Security Information and Event Management (SIEM) Highlights

Adobe Deploys Hadoop as a Service on VMware vsphere

Transcription:

APPLICATION BRIEF Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks Electronic devices generate data every millisecond they are in operation. This data is vast, complex and contains a wealth of useful information. Network-based service providers such as utility companies, mobile providers and broadcast networks are capturing, storing and analyzing machine-generated data to help them measure and improve customer experience, provide proactive support, predict and prevent service outages, and drive impactful product roadmaps. For utility companies, sensor data from smart meters provides environmental information and corresponding resource usage for every point on the grid. For mobile providers, call detail records (CDRs) contain the details of each call or event that passes through a switch. As networks and devices continue generating more data at shorter intervals, and as user bases continue to grow, harnessing the volume, variety and velocity of this data presents a challenge. Traditional Ways of Analyzing Customer Usage and Experience are Slow and Cumbersome Network-based service providers have traditionally relied on two mechanisms to understand customer usage and to identify areas for improvement: 1. Direct customer feedback: Usually offered through routine surveys or client outreach for support, this data is valuable yet often skewed because participation is self selected. Example: If a cable company receives several calls from clients enrolled in the family package who want to add a sports channel to their plan, the company might notice a trend and decide to offer both sports and family channels in a single, bundled package. But they may be missing other cross-sell opportunities if those clients haven t made calls to customer support. 2. Data collected from machines, typically on a weekly or monthly basis: Utilities and mobile or broadcast network providers must measure network usage, primarily to ensure correct billing. Example: Utility companies measure energy consumption at customer accounts by checking each meter on a monthly basis. However, if there was an error on the customer s device in Week One of the billing cycle, it might not be identified for another three weeks, resulting in the utilities provider losing revenue because of an inability to accurately bill for energy consumption during the entirety of that month. Application Summary Improve customer experience and deliver better products to market by collecting and analyzing machine-generated data from customers devices Relevant Industries Network-based service providers Utilities Mobile communications Broadcast and cable Key Challenges Goal to capture and analyze streaming, machinegenerated data Expensive to store exponentially growing data volumes Data sampling and aggregations required for analysis limits complete view of data Key Benefits of Apache Hadoop Schema on read enables real time data loads of unstructured data Commodity hardware delivers petabyte-scale storage at low cost Fast, flexible analysis of massive unstructured data enables deeper insights

APPLICATION BRIEF From a technology perspective, network-based service providers have traditionally relied largely on online analytical processing (OLAP) using relational database management systems (RDBMS) to collect and analyze this information. Today operators increasingly face demands to capture more machine-generated information at a higher granularity and with greater frequency than before. This demand is driven by business needs to better understand system operation as well as customers behaviors and actions. Herein lies the challenge: traditional data warehouse environments struggle to capture and analyze machine-generated data with such high volume, velocity and variety. Common Challenges The data generated by machines is vast, complex and comes in many varieties. As these data sets quickly expand into the petabyte range, often with billions of files each containing millions of records, extracting valuable information using traditional tools becomes practically impossible. Challenges typically center around four key areas: 1. Ingestion of large data volumes leads to input/output (I/O) bottlenecks that affect service level agreements (SLAs); the data generated outweighs the system s ability to accommodate it. 2. Storage and protection of Big Data is expensive. 3. Processing and analysis of large, complex data sets is difficult, forcing trade-offs between data set size and detail of analysis. 4. Integration of this data into business processes requires constant upkeep of data formats, schemas within the requirements of operational SLAs. Leveraging Hadoop to Ingest, Store and Analyze Data Ingest and store more data, more affordably Apache Hadoop is a linearly scalable, grid-based data management platform designed to run on commodity hardware. Every node added to the Hadoop cluster increases storage capacity, network bandwidth and processing power. Huge volumes of data can be ingested and stored more quickly than with traditional storage area network (SAN) and network attached storage (NAS) solutions, which require data to be funneled through a single system head. Also, scaling into the petabyte range with Hadoop is cost effective because the platform is open source and clusters can be built on low-cost servers. Hadoop s cost-efficient scalability is especially valuable to network-based service providers attempting to capture their machine-generated data, which is especially voluminous and must be ingested in near real time. Perform deeper, more flexible analytics on larger data sets Hadoop was designed to manage massive quantities of complex data, eliminating trade-offs between the size of the data set and time-to-answer during analysis. Because Hadoop is built on a highly scalable and flexible file system, any type of data can be loaded without altering its format, preserving data integrity and delivering complete analytic flexibility. Hadoop implements a schema on read approach, allowing context for the data to be set when the question is asked. Data no longer needs to be transferred using time-consuming extract, transform, and load (ETL) processes from storage over a network to the database or analytic platform where computation takes place. Any type of data can be mined in its original format and combined with other data types to paint a more comprehensive picture. And the amount of time required to perform deep analytics and mass transformations on very large quantities of complex data is dramatically reduced, for example, from 6-10 hours down to 5-10 minutes. As a result, users can get more reliable results by running queries against comprehensive data sets, rather than relying on sampling or aggregations of a limited window of historical data. In Summary Hadoop allows network-based service providers to capture, store and analyze all of their data both machine-generated and otherwise for more flexible, insightful and ad hoc analysis at petabyte scale. Success Story The Customer: Opower Customer engagement platform for the utility industry Challenge Needed to capture, store, manage and analyze ever-increasing utility data streams from large smart meter deployments receiving terabytes of advanced metering infrastructure (AMI) data, along with data generated from smart appliances, interactive user applications, sensors, and social media data Solution Deployed CDH platform of Apache Hadoop, HBase, Hive, and Sqoop to store, query and transform all time series and social data Results Analysts and product managers have gained 360-degree view into customer energy usage patterns, facilitating proactive customer support Utilities providers can now foster better end user relationships by offering advice and feedback based on individual usage patterns 2012 Cloudera, Inc. All Rights Reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their

Effectively Leveraging the Power of Apache Hadoop: Shuffle and Snappy Shuffle 101 Many of Hadoop s core mechanics come into play in the Shuffle phase. The Shuffle phase is the intermediary step between Map and Reduce and often raises many questions. Several parameters can be adjusted to help make this step run more smoothly. When a MapReduce job is run, the Mapper generates key/value pairs but what actually happens on disk? This is a simplified explanation into this activity of what happens when the mapper is operating: Key/value pairs are first written to a buffer (io.sort.mb). This buffer controls the intermediate performance of a job--the typical point where the largest delays are encountered in a MapReduce job. When the buffer exceeds io.sort.spill.pct, a spill thread begins and will spill keys and values to disk respectively. If the buffer fills before the spill is complete, the spill will buffer or will block the mapper until the spill completes. The spill is deemed complete when the buffer is completely flushed. After this, the mapper will continue to fill the buffer until another spill begins. This loop will continue until the mapper has emitted all of its key/value pairs. Setting a larger value for io.sort.mb allows more key/value pairs to fit in memory, yielding fewer spills. Additionally, modifying io.sort.spill.pct gives the spill thread a higher tolerance results in fewer blocks. How much room is required for accounting information to reduce spill? To reduce spill of accounting information the parameter io.sort.record.percent should be addressed. The amount of room required for the accounting information is a function of the number of records, not the record size. Therefore, a higher number of records might need more room for accounting to reduce spill. This formula can help determine space requirements: 16/(16 + R) where R is the average record size in bytes. Example: if average map output record is 16 bytes, 16/32 = 0.50 MapReduce-64 now enables a job to automatically derive the percent value by using all available memory (up to io.sort.mb) for either the data or accounting. This can still be set manually per job. What happens when your cluster is required to merge multiple spill files into a single output file for the reducer? One very important parameter to keep in mind when running larger MapReduce jobs via large data sets, or over a large cluster is io.sort.factor. A combiner needs to be called to force the merges into one file effectively calling multiple iterations over the intermediate data set. If the io.sort. factor size (depending on size of cluster/job) is increased, the number of merges required to achieve the reducer input can be decreased. This cuts the number of spills and the number of times the combiner is called, resulting in only one full pass through the data set. So the io.sort.factor is very important! Note: io.sort.factor defaults to 10 and will lead to too many spills and merges when an organization begins to scale. It can be increased to 100 or more on clusters greater than 50 nodes or when processing extremely large data sets. When it comes down to getting maximum performance out of MapReduce jobs that are crunching extreme data sets or operating on a large cluster, enabling the Shuffle phase to run more efficiently will vastly improve overall performance.

Snappy Compression The Hadoop community often counteracts slow disk performance by leveraging a compression library. This can reduce the size of the data footprint on disk, attaining faster reads. But, there is an IO/CPU trade off--compression consumes CPU cycles. While compression will reduce the storage footprint, the key is being able to leverage compressed files in the computations. The traditional compression library deployed in Hadoop is LZO, however the system administrator had to address the challenges of managing the compression libraries and other quirks that would result from running LZO on Hadoop. Then came Snappy, an open-source compression library that compresses files at 250MB/s. Snappy keeps the compression speed at a premium, with a small sacrifice on the compressed storage size compared to other compression algorithms. If you d like an explanation on why I had to change it I d be more than happy. But thank you very much for allowing me to make the edits, and I apologize that it was last minute. Snappy was a major addition to Cloudera s Distribution for Hadoop (CDH), because of its outstanding performance and the ability to work with distributed file systems. Any Hadoop user understands that additions to the Hadoop stack can complicate things; for example, LZO can be very difficult to manage when added to a Hadoop stack. However, Snappy is extremely easy to use with Hadoop. Hadoop users now have the ability to leverage an extremely fast compression algorithm to store and process their data as evidenced in this comparison: Compresssion Original File Size Compressed Output Compression Pct. Compression Speed Decompression Speed Snappy 1.52MB 9.09MB 59.8% 109.0 MB/s 354.1/s ZLIB 1.52MB 5.44MB 35.8% 12.5 MB/s 160.4 MB/s LZO 1.52MB 8.27MB 54.4% 125.6 MB/s 273.1 MB/s LibLZF 1.52MB 8.29MB 54.6% 139.0 MB/s 322.4 MB/s QuickLZ 1.52MB 8.34MB 54.9% 92.9 MB/s 67.9 MB/s FastLZ 1.52MB 8.54MB 56.2% 51.3 MB/s 106.9 MB/s Full comparison of Snappy vs. LZO, ZLIB, QuickLZ, FastLZ, & LibLZF. To use Snappy with a specific Hadoop job you just add the compression codec to a jobconf object:... Configuration conf = new Configuration(); // Compress Map output conf.set( mapred.compress.map.output, true ); conf.set( mapred.map.output.compression.codec, org.apache.hadoop.io.compress.snappycodec ); // Compress MapReduce output conf.set( mapred.output.compress, true ); conf.set( mapred.output.compression, org.apache.hadoop.io.compress.snappycodec );... Turning a Hadoop cluster into a high performance distributed computing machine becomes much easier with Snappy compression, and for those using CDH3U1 or higher, the installation is already done!

Ensure Network Performance With Apache Hadoop Companies across the globe rely heavily on their network for ensuring business continuity, performance and expansion. IT Departments must ensure that a network stays accessible and secure. Companies are using Hadoop to detect threats or fraudulent activity and identify potential trouble areas. This brief will highlight how real companies are using Hadoop to ensure network performance is maintained at all times. Analyzing Network Data to Predict Failure Utilities run big, expensive and complicated systems to generate power. Monitoring the health of the entire grid requires the capture and analysis of data from every utility, and even from every generator, in the grid. A power company built a Hadoop cluster to capture and store the data streaming off of all of the sensors in the network. As a result, the power company can see and react to long-term trends and emerging problems in the grid that are not apparent in the instantaneous performance of any particular generator. Combining all of that data into a single repository and analyzing it together, can help IT organizations better understand their infrastructure and improve efficiencies across the network. Hadoop is a powerful platform for dealing with fraudulent and criminal activity. It is flexible enough to store all of the data message content, relationships among people and computers, patterns of activity that matters. It is powerful enough to run sophisticated detection and prevention algorithms and to create complex models from historical data to monitor real-time activity. Threat Analysis A global developer of software and services to protect against computer viruses has amassed an enormous library of malware indexed by virus signatures. The vendor uses MapReduce to compare instances of malware to one another, and to build higher-level models of the threats that the different pieces of malware pose. The ability to examine all the data comprehensively allows the company to build more robust tools for detecting known and emerging threats.

Ensure Network Performance With Apache Hadoop Threat Analysis (cont.) Online retailers are particularly vulnerable to fraud and theft. Many use web logs to monitor user behavior on the site. By tracking that activity, tracking IP addresses and using knowledge of the location of individual visitors, these sites are able to recognize and prevent fraudulent activity. Hadoop is a powerful platform for dealing with fraudulent and criminal activity. It is flexible enough to store all of the data message content, relationships among people and computers, patterns of activity that matters. It is powerful enough to run sophisticated detection and prevention algorithms and to create complex models from historical data to monitor real-time activity. Why Cloudera Cloudera is the leading provider of Hadoop-based software and services. Our open source software offering, Cloudera s Distribution for Apache Hadoop (CDH) is the industry s most popular means of deploying Hadoop. CDH is a platform for data management and combines the leading Hadoop software and related projects and provides them as an integrative whole with common packaging, patching and documentation. Cloudera s professional services team is experienced at delivering high value services to thousands of users supporting hundreds of implementations over a range of industries. Our customers use Cloudera s products and services to store, manage, and analyze data on large Hadoop implementations. Cloudera is the leading provider of Hadoop-based software and services. Our open source software offering, Cloudera s Distribution for Apache Hadoop (CDH) is the industry s most popular means of deploying Hadoop. CDH is a platform for data management and combines the leading Hadoop software and related projects and provides them as an integrative whole with common packaging, patching and documentation.

Increase Revenue With Apache Hadoop Large and successful companies are using Hadoop to do powerful data analyses of the data they collect. With the ability to store any kind of data from any source, inexpensively and at very large scale, Hadoop makes it possible to conduct the types of analysis that would be impossible or impractical using any other database or data warehouse. Hadoop lowers costs and extracts more value from data. Companies looking to increase revenue opportunities can utilize Hadoop to analyze a wide variety of data to solve several business challenges including determining how customers are lost, predicting customer preferences, better ad targeting, and creating better promotional offers. This brief will highlight how real companies are using Hadoop to solve these challenges. Customer Churn Analysis A large mobile carrier needed to analyze multiple data sources to understand how and why customers decided to terminate their service contact. What issues were important and how could the provider improve satisfaction and retain customers? The company used Hadoop to combine traditional transactional and event data with social network data. By examining call logs to see who spoke with whom, creating a graph of that social network, and analyzing it, the company was able to show that if people in the customer s social network were to leave, then the customer was more likely to depart, too. Combining data in this way gave the provider a much better measure of risk that a customer would leave and improved planning for new products and network investments to improve customer satisfaction. Recommendation Engine A leading online dating service has to measure compatibility between individual members so that it can suggest good matches for potential relationships. The company combined survey information with demographic and web activity to build a comprehensive picture of its customers. The data included a mix of complex and structured information, and the analytical system had to evolve continually to provide better recommendations over time. Hadoop makes an exceptional staging area for an enterprise data warehouse. It provides a place for users to capture and store new data sets or data sets that have not yet been placed in the enterprise data warehouse. Hadoop can store all types of data and makes it easy for analysts to pose questions, develop hypotheses and explore the data for meaningful relationships and value. Hadoop allowed the company to incorporate more data over time, improving the compatibility score that customers see. Hadoop s built-in parallelism and incremental scalability mean that the company can size its system to meet the needs of its customer base, and that it can grow easily as new customers join.

Increase Revenue With Apache Hadoop Recommendation Engine (cont.) A large content publisher and aggregator uses Hadoop to determine the most relevant content for each visitor. Many online retailers, and even manufacturers, rely on Hadoop to store and digest user purchase behavior and to produce recommendations for products that a visitor might buy. In each of these instances, Hadoop combines log, transaction and other data to produce recommendations. Ad Targeting Leading advertising networks select ads best suited to a particular visitor. Ad targeting systems must understand user preference and behavior, estimate how interested a given user will be, and choose the one that maximizes revenue. Optimization requires examining both the relevance of a given advertisement to a particular user, and the collection of bids by different advertisers who want to reach that visitor. One advertising exchange uses Hadoop to collect and analyze the stream of user activity coming off of its servers. Business analysts at the exchange are able to see reports on the performance of individual ads, and adjust the system to improve relevance and increase revenues immediately. A second exchange builds sophisticated models of user behavior in order to choose the right ad for a given visitor in real time. Hadoop delivers much better targeted advertisements by steadily refining those models and delivering better ads. Point of Sale Transaction Analysis A large retailer doing Point-of-Sale (PoS) transactional analysis needed to combine larger quantities of PoS transaction analysis data with new and interesting data sources to forecast demand and improve return that it got on its promotional campaigns. Cloudera is the leading provider of Hadoop-based software and services. Our open source software offering, Cloudera s Distribution for Apache Hadoop (CDH) is the industry s most popular means of deploying Hadoop. CDH is a platform for data management and combines the leading Hadoop software and related projects and provides them as an integrative whole with common packaging, patching and documentation. The retailer built analytic applications on the SQL system for Hadoop, called Hive, to perform the same analysis that it had done on its data warehouse system but over much larger quantities of data, and at much lower cost. Hadoop makes an exceptional staging area for an enterprise data warehouse. It provides a place for users to capture and store new data sets or data sets that have not yet been placed in the enterprise data warehouse. Hadoop can store all types of data and makes it easy for analysts to pose questions, develop hypotheses and explore the data for meaningful relationships and value.