Pro Apache Hadoop. Second Edition. Sameer Wadkar Madhu Siddalingaiah
|
|
- Jewel Maxwell
- 6 years ago
- Views:
Transcription
1 Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah
2 Pro Apache Hadoop Copyright 2014 by Sameer Wadkar and Madhu Siddalingaiah This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Publisher: Heinz Weinheimer Lead Editor: Jonathan Gennick Technical Reviewer: Vimlesh Om Mittal Editorial Board: Steve Anglin, Mark Beckner, Ewan Buckingham, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Jonathan Hassell, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Nancy Sixsmith Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to
3 This book is dedicated to the Open Source Community whose contributions have made Software Development such as vibrant and exciting profession. Sameer To Sasha, Eashan, and Shilpa. Madhu
4
5 Contents at a Glance About the Authors... xix About the Technical Reviewer... xxi Acknowledgments... xxiii Introduction... xxv Chapter 1: Motivation for Big Data...1 Chapter 2: Hadoop Concepts...11 Chapter 3: Getting Started with the Hadoop Framework...31 Chapter 4: Hadoop Administration...47 Chapter 5: Basics of MapReduce Development...73 Chapter 6: Advanced MapReduce Development Chapter 7: Hadoop Input/Output Chapter 8: Testing Hadoop Programs Chapter 9: Monitoring Hadoop Chapter 10: Data Warehousing Using Hadoop Chapter 11: Data Processing Using Pig Chapter 12: HCatalog and Hadoop in the Enterprise Chapter 13: Log Analysis Using Hadoop Chapter 14: Building Real-Time Systems Using HBase Chapter 15: Data Science with Hadoop Chapter 16: Hadoop in the Cloud v
6 Contents at a Glance Chapter 17: Building a YARN Application Appendix A: Installing Hadoop Appendix B: Using Maven with Eclipse Appendix C: Apache Ambari Index vi
7 Contents About the Authors... xix About the Technical Reviewer... xxi Acknowledgments... xxiii Introduction... xxv Chapter 1: Motivation for Big Data...1 What Is Big Data?...1 Key Idea Behind Big Data Techniques...2 Data Is Distributed Across Several Nodes... 2 Applications Are Moved to the Data... 3 Data Is Processed Local to a Node... 3 Sequential Reads Preferred Over Random Reads... 3 An Example... 4 Big Data Programming Models...4 Massively Parallel Processing (MPP) Database Systems... 4 In-Memory Database Systems... 5 MapReduce Systems... 5 Bulk Synchronous Parallel (BSP) Systems... 6 Big Data and Transactional Systems...7 How Much Can We Scale?...8 A Compute-Intensive Example... 8 Amdhal s Law... 9 Business Use-Cases for Big Data...9 Summary...10 vii
8 Contents Chapter 2: Hadoop Concepts...11 Introducing Hadoop...11 Introducing the MapReduce Model...12 Components of Hadoop...16 Hadoop Distributed File System (HDFS) Secondary NameNode TaskTracker JobTracker Hadoop Components of YARN HDFS High Availability...29 Summary...30 Chapter 3: Getting Started with the Hadoop Framework...31 Types of Installation...31 Stand-Alone Mode Pseudo-Distributed Cluster Multinode Node Cluster Installation Preinstalled Using Amazon Elastic MapReduce Setting up a Development Environment with a Cloudera Virtual Machine...33 Components of a MapReduce program...34 Your First Hadoop Program...34 Prerequisites to Run Programs in Local Mode WordCount Using the Old API Building the Application Running WordCount in Cluster Mode WordCount Using the New API Building the Application Running WordCount in Cluster Mode Third-Party Libraries in Hadoop Jobs...41 Summary...46 viii
9 Contents Chapter 4: Hadoop Administration...47 Hadoop Configuration Files...47 Configuring Hadoop Daemons...48 Precedence of Hadoop Configuration Files...49 Diving into Hadoop Configuration Files...49 core-site.xml hdfs-*.xml mapred-site.xml yarn-site.xml Memory Allocations in YARN Scheduler...56 Capacity Scheduler Fair Scheduler Fair Scheduler Configuration yarn-site.xml Configurations Allocation File Format and Configurations Determine Dominant Resource Share in drf Policy Slaves File...64 Rack Awareness...64 Providing Hadoop with Network Topology Cluster Administration Utilities...65 Check the HDFS Command-Line HDFS Administration Rebalancing HDFS Data Copying Large Amounts of Data from the HDFS Summary...72 Chapter 5: Basics of MapReduce Development...73 Hadoop and Data Processing...73 Reviewing the Airline Dataset...73 Preparing the Development Environment Preparing the Hadoop System ix
10 Contents x MapReduce Programming Patterns...76 Map-Only Jobs (SELECT and WHERE Queries) Problem Definition: SELECT Clause Problem Definition: WHERE Clause Map and Reduce Jobs (Aggregation Queries) Problem Definition: GROUP BY and SUM Clauses Improving Aggregation Performance Using the Combiner Problem Definition: Optimized Aggregators Role of the Partitioner Problem Definition: Split Airline Data by Month Bringing it All Together Summary Chapter 6: Advanced MapReduce Development MapReduce Programming Patterns Introduction to Hadoop I/O Problem Definition: Sorting Problem Definition: Analyzing Consecutive Records Problem Definition: Join Using MapReduce Problem Definition: Join Using Map-Only jobs Writing to Multiple Output Files in a Single MR Job Collecting Statistics Using Counters Summary Chapter 7: Hadoop Input/Output Compression Schemes What Can Be Compressed? Compression Schemes Enabling Compression Inside the Hadoop I/O processes InputFormat OutputFormat Custom OutputFormat: Conversion from Text to XML
11 Contents Custom InputFormat: Consuming a Custom XML file Hadoop Files SequenceFile MapFiles Avro Files Summary Chapter 8: Testing Hadoop Programs Revisiting the Word Counter Introducing MRUnit Installing MRUnit MRUnit Core Classes Writing an MRUnit Test Case Testing Counters Features of MRUnit Limitations of MRUnit Testing with LocalJobRunner Limitations of LocalJobRunner Testing with MiniMRCluster Setting up the Development Environment Example for MiniMRCluster Limitations of MiniMRCluster Testing MR Jobs with Access Network Resources Summary Chapter 9: Monitoring Hadoop Writing Log Messages in Hadoop MapReduce Jobs Viewing Log Messages in Hadoop MapReduce Jobs User Log Management in Hadoop 2.x Log Storage in Hadoop 2.x Log Management Improvements Viewing Logs Using Web Based UI xi
12 Contents Command-Line Interface Log Retention Hadoop Cluster Performance Monitoring Using YARN REST APIs Managing the Hadoop Cluster Using Vendor Tools Ambari Architecture Summary Chapter 10: Data Warehousing Using Hadoop Apache Hive Installing Hive Hive Architecture Metastore Compiler Basics Hive Concepts HiveQL Compiler Details Data Definition Language Data Manipulation Language External Interfaces Hive Scripts Performance MapReduce Integration Creating Partitions User-Defined Functions Impala Impala Architecture Impala Features Impala Limitations Shark Shark/Spark Architecture Summary xii
13 Contents Chapter 11: Data Processing Using Pig An Introduction to Pig Running Pig Executing in the Grunt Shell Executing a Pig Script Embedded Java Program Pig Latin Comments in a Pig Script Execution of Pig Statements Pig Commands User-Defined Functions Eval Functions Invoked in the Mapper Eval Functions Invoked in the Reducer Writing and Using a Custom FilterFunc Comparison of PIG versus Hive Crunch API How Crunch Differs from Pig Sample Crunch Pipeline Summary Chapter 12: HCatalog and Hadoop in the Enterprise HCatalog and Enterprise Data Warehouse Users HCatalog: A Brief Technical Background HCatalog Command-Line Interface WebHCat HCatalog Interface for MapReduce HCatalog Interface for Pig HCatalog Notification Interface Security and Authorization in HCatalog Bringing It All Together Summary xiii
14 Contents Chapter 13: Log Analysis Using Hadoop Log File Analysis Applications Web Analytics Security Compliance and Forensics Monitoring and Alerts Internet of Things Analysis Steps Load Refine Visualize Apache Flume Core Concepts Netflix Suro Cloud Solutions Summary Chapter 14: Building Real-Time Systems Using HBase What Is HBase? Typical HBase Use-Case Scenarios HBase Data Model HBase Logical or Client-Side View Differences Between HBase and RDBMSs HBase Tables HBase Cells HBase Column Family HBase Commands and APIs Getting a Command List: help Command Creating a Table: create Command Adding Rows to a Table: put Command Retrieving Rows from the Table: get Command Reading Multiple Rows: scan Command xiv
15 Contents Counting the Rows in the Table: count Command Deleting Rows: delete Command Truncating a Table: truncate Command Dropping a Table: drop Command Altering a Table: alter Command HBase Architecture HBase Components Compaction and Splits in HBase Compaction HBase Configuration: An Overview hbase-default.xml and hbase-site.xml HBase Application Design Tall vs. Wide vs. Narrow Table Design Row Key Design HBase Operations Using Java API HBase Treats Everything as Bytes Create an HBase Table Administrative Functions Using HBaseAdmin Accessing Data Using the Java API HBase MapReduce Integration A MapReduce Job to Read an HBase Table HBase and MapReduce Clusters Scenario I: Frequent MapReduce Jobs Against HBase Tables Scenario II: HBase and MapReduce have Independent SLAs Summary Chapter 15: Data Science with Hadoop Hadoop Data Science Methods Apache Hama Bulk Synchronous Parallel Model Hama Hello World! xv
16 Contents Monte Carlo Methods K-Means Clustering Apache Spark Resilient Distributed Datasets (RDDs) Monte Carlo with Spark KMeans with Spark RHadoop Summary Chapter 16: Hadoop in the Cloud Economics Self-Hosted Cluster Cloud-Hosted Cluster Elasticity On Demand Bid Pricing Hybrid Cloud Logistics Ingress/Egress Data Retention Security Cloud Usage Models Cloud Providers Amazon Web Services Google Cloud Platform Microsoft Azure Choosing a Cloud Vendor Case Study: Amazon Web Services Elastic MapReduce Elastic Compute Cloud Summary xvi
17 Contents Chapter 17: Building a YARN Application YARN: A General-Purpose Distributed System YARN: A Quick Review Creating a YARN Application POM Configuration DownloadService.java Class Client.java Steps to Launch the Application Master from the Client ApplicationMaster.java Communication Protocol between Application Master and Resource Manager: Application Master Protocol Node Manager Communication Protocol: Container Management Protocol Steps to Launch the Worker Tasks Executing the Application Master Launch the Application in Un-Managed Mode Launch the Application in Managed Mode Summary Appendix A: Installing Hadoop Installing Hadoop on Windows Preparing the Installation Environment Building Hadoop for Windows Installing Hadoop for Windows Configuring Hadoop Preparing the Hadoop Cluster Starting HDFS Starting MapReduce (YARN) Verifying that the Cluster Is Running Testing the Cluster Installing Hadoop on Linux xvii
18 Contents Appendix B: Using Maven with Eclipse A Quick Introduction to Maven Creating a Maven Project Using Maven with Eclipse Installing the m2e Maven Eclipse Plug-in Creating a Maven Project from Eclipse Building a Maven Project from Eclipse Appendix C: Apache Ambari Hadoop Components Supported by Apache Ambari Installing Apache Ambari Trying the Ambari Sandbox on Your OS Index xviii
19 About the Authors Sameer Wadkar has more than 16 years of experience in software architecture and development. He has a bachelor s degree in electrical engineering and an MBA in finance from Mumbai University, a postgraduate diploma in software engineering from the National Center of Software Technology (now Center for Development of Advanced Computing), and a master s degree in applied and computational mathematics from Johns Hopkins University. He has implemented distributed systems and high-traffic web sites for a wide variety of clients ranging from federal agencies to investment banking companies. Sameer has been actively working on Hadoop/HBase implementations since 2011 and is also an open-source contributor. His GitHub page is Sameer s open-source contributions include a version of the popular text mining algorithm known as Latent Dirichlett Allocation, which scales to millions of documents on a single machine. He is an avid chess player and plays actively on Madhu Siddalingaiah is a technology consultant with 25 years of experience in a variety of business domains, including aerospace, health care, financial, energy, defense, and scientific research. Over the years, he has specialized in electrical engineering, Internet technologies, and Big Data. More recently, Madhu has delivered several high profile Big Data systems and solutions. He earned his physics degree from the University of Maryland. Outside of his profession, Madhu is a private helicopter pilot, enjoys travel, and participates in a constantly growing list of hobbies. xix
20
21 About the Technical Reviewer Vimlesh Om Mittal has more than 14 years of technology implementation experience and is a Cloudera Certified Developer for Apache Hadoop CDH4. He has a very broad technology background, and his experience includes projects involving modernizing business processes, custom application development, ETL development, digital application development, business intelligence, database design, and data conversions. Vimlesh s experience is a strong blend of architecting robust information management systems, solution architecture, and software development in several development frameworks such as Big Data using the Hadoop ecosystem; data ingestion/acquisition; ETL; reporting/analytics;.net; J2EE; and scripting languages in different vendor technologies such as Cloudera, SAS, IBM, Oracle, Microsoft, ios, and BEA. Vimlesh s focus also includes solutions for financial services data, mobile application design, and development. He is also well-versed in all aspects of the software development life cycle and has extensive experience in project management disciplines. xxi
22
23 Acknowledgments Hadoop has come a long way over the past decade with a large number of vendors and independent developers contributing to it. Hadoop in turn depends on a large number of open source libraries. We would like to acknowledge the effort of all the open source contributors who have contributed to Hadoop. We would also like to acknowledge the blog writers and professionals who contribute to answering Hadoop-related queries on forums. Their participation contributes to making a complex product such as Hadoop mainstream and easier to use. We want to thank the Apress staff members who have applied their expertise to make this book into something readable. We also want to thank our family, friends, and colleagues for supporting us through the writing of this book. xxiii
24
25 Introduction This book is designed to be a concise guide to using the Hadoop software. Despite being around for more than half a decade, Hadoop development is still a very stressful yet very rewarding task. The documentation has come a long way since the early years, and Hadoop is growing rapidly as its adoption is increasing in the Enterprise. Hadoop 2.0 is based on the YARN framework, which is a significant rewrite of the underlying Hadoop platform. It has been our goal to distill the hard lessons learned while implementing Hadoop for clients in this book. As authors, we like to delve deep into the Hadoop source code to understand why Hadoop does what it does and the motivations behind some of its design decisions. We have tried to share this insight with you. We hope that not only will you learn Hadoop in depth but also gain fresh insight into the Java language in the process. This book is about Big Data in general and Hadoop in particular. It is not possible to understand Hadoop without appreciating the overall Big Data landscape. It is written primarily from the point of view of a Hadoop developer and requires an intermediate-level ability to program using Java. It is designed for practicing Hadoop professionals. You will learn several practical tips on how to use the Hadoop software gleaned from our own experience in implementing Hadoop-based systems. This book provides step-by-step instructions and examples that will take you from just beginning to use Hadoop to running complex applications on large clusters of machines. Here s a brief rundown of the book s contents: Chapter 1 introduces you to the motivations behind Big Data software, explaining various Big Data paradigms. Chapter 2 is a high-level introduction to Hadoop 2.0 or YARN. It introduces the key concepts underlying the Hadoop platform. Chapter 3 gets you started with Hadoop. In this chapter, you will write your first MapReduce program. Chapter 4 introduces the key concepts behind the administration of the Hadoop platform. Chapters 5, 6, and 7, which form the core of this book, do a deep dive into the MapReduce framework. You learn all about the internals of the MapReduce framework. We discuss the MapReduce framework in the context of the most ubiquitous of all languages, SQL. We emulate common SQL functions such as SELECT, WHERE, GROUP BY, and JOIN using MapReduce. One of the most popular applications for Hadoop is ETL offloading. These chapters enable you to appreciate how MapReduce can support common data-processing functions. We discuss not just the API but also the more complicated concepts and internal design of the MapReduce framework. Chapter 8 describes the testing frameworks that support unit/integration testing of MapReduce frameworks. Chapter 9 describes logging and monitoring of the Hadoop Framework. Chapter 10 introduces the Hive framework, the data warehouse framework on top of MapReduce. xxv
26 Introduction Chapter 11 introduces the Pig and Crunch frameworks. These frameworks enable users to create data-processing pipelines in Hadoop. Chapter 12 describes the HCatalog framework, which enables Enterprise users to access data stored in the Hadoop file system using commonly known abstractions such as databases and tables. Chapter 13 describes how Hadoop can used for streaming log analysis. Chapter 14 introduces you to HBase, the NoSQL database on top of Hadoop. You learn about use-cases that motivate the use of Hbase. Chapter 15 is a brief introduction to data science. It describes the main limitations of MapReduce that make it inadequate for data science applications. You are introduced to new frameworks such as Spark and Hama that were developed to circumvent MapReduce limitations. Chapter 16 is a brief introduction to using Hadoop in the cloud. It enables you to work on a true production grade Hadoop cluster from the comfort of your living room. Chapter 17 is a whirlwind introduction to the key addition to Hadoop 2.0: the capability to develop your own distributed frameworks such as MapReduce on top of Hadoop. We describe how you can develop a simple distributed download service using Hadoop 2.0. xxvi
Asset Accounting Configuration in SAP ERP
Asset Accounting Configuration in SAP ERP A Step-by-Step Guide Andrew Okungbowa Asset Accounting Configuration in SAP ERP: A Step-by-Step Guide Copyright 2016 by Andrew Okungbowa This work is subject to
More informationBig Data & Hadoop Advance
Course Durations: 30 Hours About Company: Course Mode: Online/Offline EduNextgen extended arm of Product Innovation Academy is a growing entity in education and career transformation, specializing in today
More informationUsing Microsoft Dynamics AX 2012
Using Microsoft Dynamics AX 2012 Andreas Luszczak Using Microsoft Dynamics AX 2012 Updated for Version R2 3rd Edition Dr. Andreas Luszczak Vienna, Austria ISBN 978-3-658-01708-8 DOI 10.1007/978-3-658-01709-5
More informationE-guide Hadoop Big Data Platforms Buyer s Guide part 1
Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors
More informationCFO Techniques. A Hands-On Guide to Keeping Your Business Solvent and Successful. Marina Guzik
CFO Techniques A Hands-On Guide to Keeping Your Business Solvent and Successful Marina Guzik CFO Techniques: A Hands-On Guide to Keeping Your Business Solvent and Successful Copyright 2011 by Marina Guzik
More informationBuilding Your Big Data Team
Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.
More informationBringing the Power of SAS to Hadoop Title
WHITE PAPER Bringing the Power of SAS to Hadoop Title Combine SAS World-Class Analytics With Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities ii Contents Introduction... 1 What
More informationWindows Azure Platform TEJASWI REDKAR
Windows Azure Platform TEJASWI REDKAR Windows Azure Platform Copyright 2009 by Tejaswi Redkar All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic
More informationBIG DATA and DATA SCIENCE
Integrated Program In BIG DATA and DATA SCIENCE CONTINUING STUDIES Table of Contents About the Course...03 Key Features of Integrated Program in Big Data and Data Science...04 Learning Path...05 Key Learning
More informationApache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.
Apache Spark 2.0 GA The General Engine for Modern Analytic Use Cases 1 Apache Spark Drives Business Innovation Apache Spark is driving new business value that is being harnessed by technology forward organizations.
More informationORACLE DATA INTEGRATOR ENTERPRISE EDITION
ORACLE DATA INTEGRATOR ENTERPRISE EDITION Oracle Data Integrator Enterprise Edition delivers high-performance data movement and transformation among enterprise platforms with its open and integrated E-LT
More informationSimplifying the Process of Uploading and Extracting Data from Apache Hadoop
Simplifying the Process of Uploading and Extracting Data from Apache Hadoop Rohit Bakhshi, Solution Architect, Hortonworks Jim Walker, Director Product Marketing, Talend Page 1 About Us Rohit Bakhshi Solution
More informationEthics for Biomedical Engineers
Ethics for Biomedical Engineers Jong Yong Abdiel Foo Stephen J. Wilson Andrew P. Bradley Winston Gwee Dennis Kwok-Wing Tam Ethics for Biomedical Engineers Jong Yong Abdiel Foo Electronic and Computer
More informationData Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB
Data Analytics and CERN IT Hadoop Service CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB 1 Data Analytics at Scale The Challenge When you cannot fit your workload in a desktop Data
More informationMicrosoft Azure Essentials
Microsoft Azure Essentials Azure Essentials Track Summary Data Analytics Explore the Data Analytics services in Azure to help you analyze both structured and unstructured data. Azure can help with large,
More informationENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)
ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS) Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how Dell EMC Elastic Cloud Storage (ECS ) can be used to streamline
More informationBig Data Job Descriptions. Software Engineer - Algorithms
Big Data Job Descriptions Software Engineer - Algorithms This position is responsible for meeting the big data needs of our various products and businesses. Specifically, this position is responsible for
More informationHadoop and Analytics at CERN IT CERN IT-DB
Hadoop and Analytics at CERN IT CERN IT-DB 1 Hadoop Use cases Parallel processing of large amounts of data Perform analytics on a large scale Dealing with complex data: structured, semi-structured, unstructured
More informationAccelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica
Accelerating Your Big Data Analytics Jeff Healey, Director Product Marketing, HPE Vertica Recent Waves of Disruption IT Infrastructu re for Analytics Data Warehouse Modernization Big Data/ Hadoop Cloud
More informationCask Data Application Platform (CDAP)
Cask Data Application Platform (CDAP) CDAP is an open source, Apache 2.0 licensed, distributed, application framework for delivering Hadoop solutions. It integrates and abstracts the underlying Hadoop
More informationOracle Big Data Cloud Service
Oracle Big Data Cloud Service Delivering Hadoop, Spark and Data Science with Oracle Security and Cloud Simplicity Oracle Big Data Cloud Service is an automated service that provides a highpowered environment
More informationAdobe Deploys Hadoop as a Service on VMware vsphere
Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and
More informationDeloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward
Deloitte School of Analytics Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward February 2018 Agenda 7 February 2018 8 February 2018 9 February 2018 8:00 9:00 Networking
More informationTop 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11
Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11 Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer
More informationETL on Hadoop What is Required
ETL on Hadoop What is Required Keith Kohl Director, Product Management October 2012 Syncsort Copyright 2012, Syncsort Incorporated Agenda Who is Syncsort Extract, Transform, Load (ETL) Overview and conventional
More information1. Intoduction to Hadoop
1. Intoduction to Hadoop Hadoop is a rapidly evolving ecosystem of components for implementing the Google MapReduce algorithms in a scalable fashion on commodity hardware. Hadoop enables users to store
More informationSr. Sergio Rodríguez de Guzmán CTO PUE
PRODUCT LATEST NEWS Sr. Sergio Rodríguez de Guzmán CTO PUE www.pue.es Hadoop & Why Cloudera Sergio Rodríguez Systems Engineer sergio@pue.es 3 Industry-Leading Consulting and Training PUE is the first Spanish
More informationKnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE
FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK Are you drowning in Big Data? Do you lack access to your data? Are you having a hard time managing Big Data processing requirements?
More informationLeveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden
Leveraging Oracle Big Data Discovery to Master CERN s Data Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden Manuel Martin Marquez Intel IoT Ignition Lab Cloud and
More informationCask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications
Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications Copyright 2015 Cask Data, Inc. All Rights Reserved. February
More informationCOPYRIGHTED MATERIAL. 1Big Data and the Hadoop Ecosystem
1Big Data and the Hadoop Ecosystem WHAT S IN THIS CHAPTER? Understanding the challenges of Big Data Getting to know the Hadoop ecosystem Getting familiar with Hadoop distributions Using Hadoop-based enterprise
More informationIBM Sterling Gentran:Server for Windows
IBM Sterling Gentran:Server for Windows Handle your business transactions with a premier e-business platform Overview In this Solution Overview, you will learn: How to lower costs, improve quality of service,
More informationMapR Pentaho Business Solutions
MapR Pentaho Business Solutions The Benefits of a Converged Platform to Big Data Integration Tom Scurlock Director, WW Alliances and Partners, MapR Key Takeaways 1. We focus on business values and business
More informationHadoop Integration Deep Dive
Hadoop Integration Deep Dive Piyush Chaudhary Spectrum Scale BD&A Architect 1 Agenda Analytics Market overview Spectrum Scale Analytics strategy Spectrum Scale Hadoop Integration A tale of two connectors
More informationE-guide Hadoop Big Data Platforms Buyer s Guide part 3
Big Data Platforms Buyer s Guide part 3 Your expert guide to big platforms enterprise MapReduce cloud-based Abie Reifer, DecisionWorx The Amazon Elastic MapReduce Web service offers a managed framework
More informationOperational Hadoop and the Lambda Architecture for Streaming Data
Operational Hadoop and the Lambda Architecture for Streaming Data 2015 MapR Technologies 2015 MapR Technologies 1 Topics From Batch to Operational Workloads on Hadoop Streaming Data Environments The Lambda
More informationJason Virtue Business Intelligence Technical Professional
Jason Virtue Business Intelligence Technical Professional jvirtue@microsoft.com Agenda Microsoft Azure Data Services Azure Cloud Services Azure Machine Learning Azure Service Bus Azure Stream Analytics
More informationHadoop in Production. Charles Zedlewski, VP, Product
Hadoop in Production Charles Zedlewski, VP, Product Cloudera In One Slide Hadoop meets enterprise Investors Product category Business model Jeff Hammerbacher Amr Awadallah Doug Cutting Mike Olson - CEO
More informationNew and noteworthy in Rational Asset Manager V7.5.1
Rational Asset Manager New and noteworthy in Rational Asset Manager V7.5.1 IBM Corporation 2011 The information contained in this presentation is provided for informational purposes only. While efforts
More informationReal World Use Cases: Hadoop & NoSQL in Production. Big Data Everywhere London 4 June 2015
Real World Use Cases: Hadoop & NoSQL in Production Ted Dunning Big Data Everywhere London 4 June 2015 1 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC
More informationSM100. SAP Solution Manager Configuration for Operations COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)
SM100 SAP Solution Manager Configuration for Operations. COURSE OUTLINE Course Version: 15 Course Duration: 5 Day(s) SAP Copyrights and Trademarks 2013 SAP AG. All rights reserved. No part of this publication
More informationIBM Big Data Summit 2012
IBM Big Data Summit 2012 12.10.2012 InfoSphere BigInsights Introduction Wilfried Hoge Leading Technical Sales Professional hoge@de.ibm.com twitter.com/wilfriedhoge 12.10.1012 IBM Big Data Strategy: Move
More informationBuilding a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa
Building a Data Lake with Spark and Cassandra Brendon Smith & Mayur Ladwa July 2015 BlackRock: Who We Are BLK data as of 31 st March 2015 is the world s largest investment manager Manages over $4.7 trillion
More informationBig Data The Big Story
Big Data The Big Story Jean-Pierre Dijcks Big Data Product Mangement 1 Agenda What is Big Data? Architecting Big Data Building Big Data Solutions Oracle Big Data Appliance and Big Data Connectors Customer
More informationSAS & HADOOP ANALYTICS ON BIG DATA
SAS & HADOOP ANALYTICS ON BIG DATA WHY HADOOP? OPEN SOURCE MASSIVE SCALE FAST PROCESSING COMMODITY COMPUTING DATA REDUNDANCY DISTRIBUTED WHY HADOOP? Hadoop will soon become a replacement complement to:
More informationKonica Minolta Business Innovation Center
Konica Minolta Business Innovation Center Advance Technology/Big Data Lab May 2016 2 2 3 4 4 Konica Minolta BIC Technology and Research Initiatives Data Science Program Technology Trials (Technology partner
More informationBIT300 Integration Technology ALE
BIT300 Integration Technology ALE. COURSE OUTLINE Course Version: 10 Course Duration: 3 Day(s) SAP Copyrights and Trademarks 2016 SAP SE or an SAP affiliate company. All rights reserved. No part of this
More informationAurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect
Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect 2005 Concert de Coldplay 2014 Concert de Coldplay 90% of the world s data has been created over the last two years alone 1 1. Source
More informationDocument Center and Document Management in S/4HANA Frank Spiegel, SAP October 2016
Document Center and Document Management in S/4HANA Frank Spiegel, SAP October 2016 Disclaimer The information in this presentation is confidential and proprietary to SAP and may not be disclosed without
More informationFamilienunternehmen und KMU
Familienunternehmen und KMU Edited by A. Hack, Berne A. Calabrò, Witten/Herdecke H. Frank, Vienna F. W. Kellermanns, Tennessee T. Zellweger, St. Gallen Both Family Firms and Small and Medium Sized Enterprises
More informationIBM WebSphere Service Registry and Repository, Version 6.0
Helping you get the most business value from your SOA IBM Repository, Version 6.0 Highlights Provide clear visibility into service Use other standard registries associations and relationships while and
More informationManagement for Professionals
Management for Professionals For further volumes: http://www.springer.com/series/10101 ThiS is a FM Blank Page Erik Jannesson Fredrik Nilsson Birger Rapp Editors Strategy, Control and Competitive Advantage
More informationCommon Customer Use Cases in FSI
Common Customer Use Cases in FSI 1 Marketing Optimization 2014 2014 MapR MapR Technologies Technologies 2 Fortune 100 Financial Services Company 104M CARD MEMBERS 3 Financial Services: Recommendation Engine
More informationMicroarrays in Diagnostics and Biomarker Development
Microarrays in Diagnostics and Biomarker Development . Editor Microarrays in Diagnostics and Biomarker Development Current and Future Applications Editor CoReBio PACA Luminy Science Park 13288 Marseille
More informationSAP Business Client 6.5
SAP Business Client 6.5 Product Management P&I Technology Core Platform SAP SE INTRODUCTION SINGLE POINT OF ENTRY to SAP business applications for desktop users SAP BUSINESS CLIENT HARMONIZED ACCESS to
More informationTERP10. SAP ERP Integration of Business Processes COURSE OUTLINE. Course Version: 16 Course Duration: 10 Day(s)
TERP10 SAP ERP Integration of Business Processes. COURSE OUTLINE Course Version: 16 Course Duration: 10 Day(s) SAP Copyrights and Trademarks 2015 SAP SE. All rights reserved. No part of this publication
More informationSAP HANA Cloud Connector Solution Brief
SAP HANA Cloud Connector Solution Brief Applies to: SAP HANA Cloud Connector, SAP HANA Cloud Platform Summary This document is a solution brief about the SAP HANA Cloud connector, the secure and reliable
More informationOracle Big Data Discovery The Visual Face of Big Data
Oracle Big Data Discovery The Visual Face of Big Data Today's Big Data challenge is not how to store it, but how to make sense of it. Oracle Big Data Discovery is a fundamentally new approach to making
More informationExercises in Environmental Physics
Exercises in Environmental Physics Valerio Faraoni Exercises in Environmental Physics Valerio Faraoni Physics Department Bishop s University Lennoxville, Quebec J1M 1Z7 Canada vfaraoni@cs-linux.ubishops.ca
More informationWhat s New for Oracle Big Data Cloud Service. Topics: Oracle Cloud. What's New for Oracle Big Data Cloud Service Version
Oracle Cloud What's New for Oracle Big Data Cloud Service Version 17.4.3 E79543-14 November 2017 What s New for Oracle Big Data Cloud Service This document describes what's new in Oracle Big Data Cloud
More informationC4C10. SAP Hybris Cloud for Customer Administration COURSE OUTLINE. Course Version: 20 Course Duration: 3 Day(s)
C4C10 SAP Hybris Cloud for Customer Administration. COURSE OUTLINE Course Version: 20 Course Duration: 3 Day(s) SAP Copyrights and Trademarks 2017 SAP SE or an SAP affiliate company. All rights reserved.
More informationUsing the Blaze Engine to Run Profiles and Scorecards
Using the Blaze Engine to Run Profiles and Scorecards 1993, 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationPro EDI in BizTalk Server 2006 R2
Pro EDI in BizTalk Server 2006 R2 Electronic Document Interchange Solutions Mark Beckner Pro EDI in BizTalk Server 2006 R2: Electronic Document Interchange Solutions Copyright 2007 by Mark Beckner All
More informationHortonworks Apache Hadoop subscriptions ( Subsciptions ) can be purchased directly through HP and together with HP Big Data software products.
HP and Hortonworks Data Platform Hortonworks Apache Hadoop subscriptions ( Subsciptions ) can be purchased directly through HP and together with HP Big Data software products. Hortonworks is a major contributor
More informationDigging into Hadoop-based Big Data Architectures
52 Digging into Hadoop-based Big Data Architectures Allae Erraissi 1, Abdessamad Belangour 2 and Abderrahim Tragha 3 1,2,3 Laboratory of Information Technology and Modeling LTIM, Hassan II University,
More informationTHR82. SAP SuccessFactors Performance and Goals Academy COURSE OUTLINE. Course Version: 71 Course Duration: 15 Day(s)
THR82 SAP SuccessFactors Performance and Goals Academy. COURSE OUTLINE Course Version: 71 Course Duration: 15 Day(s) SAP Copyrights and Trademarks 2017 SAP SE or an SAP affiliate company. All rights reserved.
More informationMapR: Solution for Customer Production Success
2015 MapR Technologies 2015 MapR Technologies 1 MapR: Solution for Customer Production Success Big Data High Growth 700+ Customers Cloud Leaders Riding the Wave with Hadoop The Big Data Platform of Choice
More informationSAP CENTRAL PROCESS SCHEDULING BY REDWOOD: FREQUENTLY ASKED QUESTIONS
SAP NetWeaver SAP CENTRAL PROCESS SCHEDULING BY REDWOOD: FREQUENTLY ASKED QUESTIONS Exploring the Central Process-Scheduling Software Developed by Redwood Software for SAP NetWeaver As IT landscapes become
More informationAnalyzing Data with Power BI
Course 20778A: Analyzing Data with Power BI Course Outline Module 1: Introduction to Self-Service BI Solutions Introduces business intelligence (BI) and how to self-serve with BI. Introduction to business
More informationOracle Service Cloud. New Feature Summary
Oracle Service Cloud New Feature Summary February 2017 TABLE OF CONTENTS REVISION HISTORY... 3 ORACLE SERVICE CLOUD FEBRUARY RELEASE OVERVIEW... 4 WEB CUSTOMER SERVICE... 4 Widget Inspector... 4 Community
More informationAccelerate Your Digital Transformation
SAP Value Assurance Accelerate Your Digital Transformation Quick-Start Transformation with SAP Value Assurance Service Packages 1 / 17 Table of Contents 2017 SAP SE or an SAP affiliate company. All rights
More informationRemote Support Platform for SAP Business One. June 2013 Partner External
Remote Support Platform for SAP Business One June 2013 Partner External Remote Support Platform Advantage for SAP Business One Run Better with RSP RSP has been engineered to mitigate implementation and
More informationMicrosoft Big Data. Solution Brief
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
More informationIBM Tivoli Monitoring
Monitor and manage critical resources and metrics across disparate platforms from a single console IBM Tivoli Monitoring Highlights Proactively monitor critical components Help reduce total IT operational
More informationAnswers Pdf For System Administrator
Customer Manager Interview Questions And Answers Pdf For System Administrator Top 50 Network Administrator Interview Questions everything by rote, but at least be able to have a resource you can get the
More informationStore Specific Consumer Prices
Store Specific Consumer Prices Copyright Copyright 2006 SAP AG. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission
More informationSAP Business One OnDemand. SAP Business One OnDemand Solution Overview
SAP Business One OnDemand SAP Business One OnDemand Solution Overview SAP Business One OnDemand Table of Contents 4 Executive Summary Introduction SAP Business One Today 8 A Technical Overview: SAP Business
More informationData Engineer. Purpose of the position. Organisational position / Virtual Team. Direct Reports: Date Created: July 2017
Data Engineer Business Unit: Strategy and Growth Reporting to: Data Engineering Direct Reports: None Date Created: July 2017 Purpose of the position The purpose of the Data Engineer role is to design,
More informationMapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia
MapR: Converged Data Pla3orm and Quick Start Solu;ons Robin Fong Regional Director South East Asia Who is MapR? MapR is the creator of the top ranked Hadoop NoSQL SQL-on-Hadoop Real Database time streaming
More informationTransform Application Performance Testing for a More Agile Enterprise
SAP Brief SAP Extensions SAP LoadRunner by Micro Focus Transform Application Performance Testing for a More Agile Enterprise SAP Brief Managing complex processes Technology innovation drives the global
More informationThe Alpine Data Platform
The Alpine Data Platform TABLE OF CONTENTS ABOUT ALPINE.... 2 ALPINE PRODUCT OVERVIEW... 3 PRODUCT ARCHITECTURE.... 5 SYSTEM REQUIREMENTS.... 6 ABOUT ALPINE DATA ADVANCED ANALYTICS FOR THE ENTERPRISE Alpine
More informationReal-Time Streaming: IMS to Apache Kafka and Hadoop
Real-Time Streaming: IMS to Apache Kafka and Hadoop - 2017 Scott Quillicy SQData Outline methods of streaming mainframe data to big data platforms Set throughput / latency expectations for popular big
More informationMachine-generated data: creating new opportunities for utilities, mobile and broadcast networks
APPLICATION BRIEF Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks Electronic devices generate data every millisecond they are in operation. This data is
More informationSapphireIMS 4.0 ITAM Suite Feature Specification
SapphireIMS 4.0 ITAM Suite Feature Specification Overview Organizations are realizing significant cost savings and improved planning capabilities through integration of the entire asset lifecycle. Strong
More informationAC210. New General Ledger Accounting (in SAP ERP) COURSE OUTLINE. Course Version: 10 Course Duration: 5 Day(s)
AC210 New General Ledger Accounting (in SAP ERP). COURSE OUTLINE Course Version: 10 Course Duration: 5 Day(s) SAP Copyrights and Trademarks 2013 SAP AG. All rights reserved. No part of this publication
More informationStuck with Power BI? Get Pyramid Starting at $0/month. Start Moving with the Analytics OS
Stuck with Power BI? Start Moving with the Analytics OS Get Pyramid 2018 Starting at $0/month Start Moving with Pyramid 2018 Break Away from Power BI Many organizations struggle to meet their analytic
More informationIntroducing Infor Xi/Ming.le for M3
Introducing Infor Xi/Ming.le for M3 Merit Consulting AS Sandnes/Norway karsten.hesselager@infor.com 1 2 Agenda Introducing Infor Xi Tech Stack Why have Infor developed Xi? What is included in Xi Demo of
More informationReal-time Streaming Insight & Time Series Data Analytic For Smart Retail
Real-time Streaming Insight & Time Series Data Analytic For Smart Retail Sudip Majumder Senior Director Development Industry IoT & Big Data 10/5/2016 Economic Characteristics of Data Data is the New Oil..then
More informationAC235. SAP Convergent Charging 4.1 COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)
AC235 SAP Convergent Charging 4.1. COURSE OUTLINE Course Version: 15 Course Duration: 5 Day(s) SAP Copyrights and Trademarks 2017 SAP SE or an SAP affiliate company. All rights reserved. No part of this
More informationCloud Based Analytics for SAP
Cloud Based Analytics for SAP Gary Patterson, Global Lead for Big Data About Virtustream A Dell Technologies Business 2,300+ employees 20+ data centers Major operations in 10 countries One of the fastest
More informationTurn Data into Business Value
Turn Data into Business Value Infinite Video Platform Analytics Layne Berg, Product Manager Steve Epstein, Distinguished Engineer June 21, 2017 Applying Big Data Analytics to Video Today, primarily descriptive
More information"Charting the Course... MOC A Retail n Brick and Mortar Stores: Development and Customization for Microsoft Dynamics AX 2012 R2.
Description Course Summary Microsoft Dynamic AX for Retail is an integrated solution that is designed for Microsoft Dynamics AX 2012 which can be used to manage a retail business from the head office to
More informationFrom Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN
From Information to Insight: The Big Value of Big Data Faire Ann Co Marketing Manager, Information Management Software, ASEAN The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT
More informationDesign of material management system of mining group based on Hadoop
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Design of material system of mining group based on Hadoop To cite this article: Zhiyuan Xia et al 2018 IOP Conf. Ser.: Earth Environ.
More informationBOCE10 SAP Crystal Reports for Enterprise: Fundamentals of Report Design
SAP Crystal Reports for Enterprise: Fundamentals of Report Design SAP BusinessObjects - Business Intelligence Course Version: 96 Revision A Course Duration: 2 Day(s) Publication Date: 14-01-2013 Publication
More informationSM72D. SAP Solution Manager 7.2 Delta Training COURSE OUTLINE. Course Version: 17 Course Duration: 3 Day(s)
SM72D SAP Solution Manager 7.2 Delta Training. COURSE OUTLINE Course Version: 17 Course Duration: 3 Day(s) SAP Copyrights and Trademarks 2017 SAP SE or an SAP affiliate company. All rights reserved. No
More informationSCM510 Inventory Management and Physical Inventory
SCM510 Inventory Management and Physical Inventory. COURSE OUTLINE Course Version: 10 Course Duration: 5 Day(s) SAP Copyrights and Trademarks 2013 SAP AG. All rights reserved. No part of this publication
More informationPrimavera Analytics and Primavera Data Warehouse Security Overview
Analytics and Primavera Data Warehouse Security Guide 15 R2 October 2015 Contents Primavera Analytics and Primavera Data Warehouse Security Overview... 5 Safe Deployment of Primavera Analytics and Primavera
More informationData Center Operating System (DCOS) IBM Platform Solutions
April 2015 Data Center Operating System (DCOS) IBM Platform Solutions Agenda Market Context DCOS Definitions IBM Platform Overview DCOS Adoption in IBM Spark on EGO EGO-Mesos Integration 2 Market Context
More information