I. INTRODUCTION. 2013) massive knowledge sources which is able to be used in official statistics are:

Size: px
Start display at page:

Download "I. INTRODUCTION. 2013) massive knowledge sources which is able to be used in official statistics are:"

Transcription

1 International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : Data Analysis Using R and Hadoop Amit Rajbanshi, Birendra Kumar Sah, C. K. Raina Department of Computer Science and Engineering, Adesh College of Engineering & Technology, Chandigarh, Kharar, Punjab, India ABSTRACT Analyzing and managing huge information may be very hard exploitation classical means like electronic data service management systems or desktop package package packages for statistics and image. Instead, huge information desires huge clusters with an entire heap or even thousands of computing nodes. Official statistics is progressively} considering huge information for clarification new statistics as a results of huge information sources would possibly manufacture additional relevant and timely statistics than ancient sources. one of the package package tools successfully and wide unfold used for storage and method of huge information sets on clusters of artefact hardware is Hadoop. Hadoop framework contains libraries, a distributed file-system (HDFS), and a resource-management platform and implements a version of the MapReduce programming model for big scale process. throughout this paper we've got an inclination to analyze the possibilities of integration Hadoop with R that would be a stylish package package used for applied mathematics computing and information image. we've got an inclination to gift three ways in which of integration them: R with Streaming, Rhipe and RHadoop which we have a tendency to emphasize the advantages and downsides of each answer. Keywords: R, Big Data, Hadoop, Rhipe, Rhadoop, Streaming I. INTRODUCTION The massive knowledge revolution can method the suggests that we have a tendency to tend to tend to know the encircling economic or social processes. we have a tendency to tend to ar able to not ignore the large volume of knowledge being created on a daily. The term big knowledge was Diamond State fi ned as info sets of skyrocketing volume, rate and selection (Mayer-Schönberger, 2012), (Beyer, 2011). massive knowledge sizes ar starting from form of tons of terabytes to several petabytes {of info of data of knowledge} throughout one information set. Such quantity of knowledge is tough to be managed and processed with classical management systems and statistics and image package package packages it wishes high computing power and big storage devices. Official statistics ought to harness the potential of huge knowledge to derive lots of relevant and timely statistics however this not a straightforward technique. the first step is to spot the sources of big knowledge potential to use in official statistics. keep with (HLG, 2013) massive knowledge sources which is able to be used in official statistics are: body data; industrial or transactional information, like online transactions victimisation credit cards; information provided by sensors (satellite imaging, climate sensors, etc.); information provided by pursuit devices (GPS, mobile devices, etc.); activity information (for example net searches); information provided by social media. Using huge information in official statistics raises many challenges (HLG, 2013). Among them we tend to ca mention: legislative problems, maintaining the privacy of the info, fi nancial issues relating to the value of sourcing information, information quality and quality of applied math strategies and technological challenges. At this point there square measure many international initiatives that associated} define an action arrange for victimisation huge information in CSEIT Received : 25 Nov 2017 Accepted : 28 Dec 2017 November-December-2017 [(2)6: ] 1093

2 offi cial statistics: Eurostat Task Force on huge information, UNECE s huge information HLG project. II. R AND HADOOP - SOFTWARE TOOLS FOR LARGE DATA SETS STATISTICAL ANALYSIS R may be a free software system package for statistics and knowledge visualisation. it's out there for UNIX, Windows and MacOS platforms and is that the results of the work of the many programmers from round the world. R contains facilities for knowledge handling, provides high performance procedures for matrix computations, an outsized assortment of tools for knowledge analysis, graphical functions for knowledge visualisation and a simple artificial language. R comes with regarding twenty five customary packages and lots of additional packages out there for transfer through the capacity measure family of websites ( ). R is employed as a process platform for normal statistics production in several official statistics agencies (Todorov, 2010), (Todorov, 2012). Besides official statistics, it's utilized in several different sectors like finance, retail, producing, educational analysis etc., creating it a well-liked tool among statisticians and researchers. Hadoop may be a free software system framework developed with the aim of distributed process of enormous knowledge sets exploitation clusters of trade goods hardware, implementing easy programming models (White, 2013). it's a middleware platform that manages a cluster of computers that was developed in Java and though Java is main artificial language for Hadoop different languages may be used to: R, Python or Ruby. Hadoop is out there at one among the most important users of Hadoop is Yahoo!. Yahoo! uses Hadoop for the Yahoo! Search netmap that is associate application that runs on a awfully giant cluster and produces knowledge utilized in Yahoo! Web search queries (Yahoo! Developer Network, 2014). Another Hadoop vital user is Facebook that operated a Hadoop cluster with over a hundred atomic number 82 of knowledge in 2012 (Ryan, 2012). The Hadoop framework includes: Hadoop Distributed classification system (HDFS) - a high performance distributed file system; Hadoop YARN that could be a framework for job programming and cluster resource management; Hadoop MapReduce a system for data processing of enormous knowledge sets that implements the MapReduce model of distributed programming (Dean, 2004). MapReduce could be a model for process giant sets of knowledge in-parallel on giant clusters computers. It splits the computer file in chucks that area unit processed in parallel by the map tasks. The results of the map tasks area unit sorted and forwarded as inputs to the scale back tasks that performs a outline operation. The framework that implements the MapReduce paradigm ought to marshal the distributed servers, run tasks in parallel, manage the info transfers between the nodes of the cluster, and supply fault tolerance. Hadoop MapReduce hides the correspondence from the software engineer, presenting him an easy model of computation. The main options of the Hadoop framework are often summarized as follows: High degree of scalability: new nodes are often superimposed to a Hadoop cluster as required while not dynamic knowledge formats, or application that runs on high of the FS; value effective: it permits for massively parallel computing victimisation trade goods hardware; Flexibility: Hadoop differs from RDBMS, having the ability to use any sort of knowledge, structured or not; Fault tolerance: if a node fails from totally different reasons, the system sends the work to a different location of the info and continues process. Hadoop has additionally a series of limitations which may be summarized as follows: HDFS is associate append-only fi lupus system, it doesn t enable update operations; MapReduce jobs run in batch mode. That s why Hadoop isn't suited to interactive applications; Hadoop can't be utilized in transactional applications. Data analysts UN agency work with Hadoop could have lots of R scripts/packages that they use for processing. mistreatment these scripts/packages with Hadoop commonly needs revising them in Java or different language that implements MapReduce. this can be cumbersome and will be a diffi cult and error prone task. What we'd like may be a thanks to connect 1094

3 Hadoop with R and use the computer code already written for R with the information keep in Hadoop (Holmes, 2012). one more reason for integration R with Hadoop for giant information sets analysis is that the method R works it processes the information loaded within the main memory. terribly massive information sets (TB or PB) can't be loaded within the RAM memory and for these information Hadoop integrated with R is one in every of the fi rst selection solutions. though there area unit several solutions for mistreatment R on a high performance computing setting ( snow, rmpi or rsge) of these solutions need that {the information the info the information} should be loaded in memory before the distribution to computing nodes and this can be straightforward unfeasible for terribly massive data sets. III. R AND HADOOP INTEGRATION We will gift 3 approaches to integrate R and Hadoop: R and Streaming, Rhipe and RHadoop. There also are different approaches to integrate R and Hadoop. as an example RODBC/RJDBC can be accustomed access knowledge from R however a survey on web shows that the foremost used approaches for linking R and Hadoop square measure Streaming, Rhipe (Cleveland, 2010) and RHadoop (Prajapati, 2013). The general structure of the analytics tools integrated with Hadoop may be viewed as a superimposed design. The fi rst layer is that the hardware layer it consists in a very cluster of (commodity) computers. The second layer is that the middleware layer Hadoop. It manages the distributions of the fi les by mistreatment HDFS and also the MapReduce jobs. Then it comes a layer that has associate degree interface for knowledge analysis. At this level we will have a tool like Pig that could be a high-level platform for making MapReduce programs employing a language referred to as Pig- Latin. we will even have Hive that could be a knowledge warehouse infrastructure developed by Apache and designed on prime of Hadoop. Hive provides facilities for running queries associate degreed knowledge analysis mistreatment an SQL-like language referred to as HiveQL and it additionally provides support for implementing MapReduce tasks. Besides these 2 tools we will implement at this level associate degree interface with different applied math computer code like R. we will use Rhipe or Rhadoop libraries that build associate degree interface between Hadoop and R, permitting users to access knowledge from the Hadoop fi autoimmune disorder system and write their own scripts for implementing Map and scale back jobs, or we will use Streaming that's a technology integrated in Hadoop. IV. R AND STREAMING Streaming could be a technology integrated within the Hadoop distribution that permits users to run Map/Reduce jobs with any script or viable that reads knowledge from commonplace input and writes the results to straightforward output because the clerk or reducer. this implies that will we will we are able to} use Streaming along side R scripts within the map and/or cut back part since R can read/write knowledge from/to commonplace input. during this approach there's no client-side integration with R as a result of theuser can use the Hadoop program line to launch the Streaming jobs with thearguments specifying the clerk and reducer R scripts. V. RHIPE Rhipe stands for R associated Hadoop Integrated Programming Environment and is an open supply project that gives a good integration between R and Hadoop. It permits the user to hold out information analysis of massive information directly in R, providing R users an equivalent facilities of Hadoop as Java developers have. The computer code package is freely offered for transfer at installation of the Rhipe is somehow a troublesome task. On every DataNode the user ought to install R, Protocol Buffers and Rhipe and this can be not a simple task: it needs that R ought to be designed as a shared library on every node, the Google Protocol Buffers to be designed and put in on every node and to put in the Rhipe itself. The Protocol Buffers square measure required for information publishing, increasing the ef fi ciency and providing ability with different languages. The Rhipe is associate R library that permits running a MapReduce job among R. The user ought to write speci fi c native R map and cut back functions and Rhipe can manage the rest: it'll transfer them and invoke them from map and cut back tasks. The map and cut back inputs square measure transferred employing a Protocol Buffer coding theme to a Rhipe C library that uses R to decision the map and cut back 1095

4 functions. the benefits of mistreatment Rhipe and not the parallel R packages consist in its integration with Hadoop that gives a knowledge distribution theme mistreatment Hadoop distributed fi lupus erythematosus system across a clusterof computers that tries to optimize the processor usage and provides fault tolerance. VI. RHADOOP RHadoop is associate degree open supply project developed by Revolution Analytics ( that has clientside integration of R and Hadoop. It permits running a MapReduce jobs inside R similar to Rhipe and consist during a assortment of 4 R packages: plyrmr : plyr-like processing for structured knowledge, providing common knowledge manipulation operations on terribly massive knowledge sets managed by Hadoop; rmr: a group of functions providing and integration of R and MapReduce model of computation; rdfs: associate degree interface between R and HDFS, providing file management operations inside R; rhbase: associate degree interface between R and HBase providing direction functions for HBase inside R; Setting up RHadoop isn't an advanced task though RHadoop has dependencies on alternative R packages. operating with RHadoop implies to put in R and RHadoop packages with dependencies on every knowledge node of the Hadoop cluster. RHadoop features a wrapper R script referred to as from Streaming that calls user outlined map and scale back R functions. RHadoop works equally to Rhipe permitting user to outline the map and scale back operation. VII. CONCLUSIONS Official statistics is a lot of considering massive knowledge for building new statistics as a result of its potential to provide more relevant and timely statistics than ancient knowledge sources. one amongst the computer code tools with success used for storage and process of huge knowledge sets on clusters of goods hardware is Hadoop. during this paper we have a tendency to conferred 3 ways of integration R and Hadoop for process massive scale knowledge sets: R and Streaming, Rhipe and RHadoop. we've to say that there are different ways that of integration them like ROBDC, RJBDC or Rhive however they need some limitations. every of the approaches conferred here has bene fi ts and limitations. whereas mistreatment R with Streaming raises no issues concerning installation, Rhipe and RHadoop needs some effort so as to line up the cluster. the mixing with R from the consumer aspect half is high for Rhipe and Rhadoop and is missing for R and Streaming. Rhipe and RHadoop permits users to First State fi ne and decision their own map and cut back functions among R whereas Streaming uses a instruction approach wherever the map and cut back functions ar passed as arguments. concerning the licensing theme, all 3 approaches need GPL-2 and GPL-3 for R and Apache a pair of.0 for Hadoop, Streaming, Rhipe and RHadoop. we've to say that there ar different alternatives for big scale knowledge analysis: Apache driver, Apache Hive, business versions of R provided by Revolution Analytics, go on framework or ORCH, associate Oracle connexion for R however Hadoop with R appears to be the foremost used approach. for straightforward Map- Reduce jobs the simple answer is Streaming however this answer is restricted to text solely input file fi les. For a lot of complicated jobs the answer ought to be Rhipe or RHadoop. VIII. REFERENCES [1]. Ahas, R., and Tiru, M., victimisation mobile positioning information for touristry statistics: Sampling and information management problems, NTTS - Conferences on New Techniques and Technologies for Statistics, Bruselles. [2]. Beyer, M., "Gartner Says determination 'Big Data' Challenge Involves over simply Managing Volumes of Data". Gartner, accessible at accessed on twenty fifth March [3]. Cleveland, William S., Guha, S., Computing atmosphere for the applied math analysis of huge and sophisticated information, degree treatise, Purdue University West Lafayette. [4]. Dean, J., and Ghemawat, S., "MapReduce: Simplifi erectile dysfunction processing on giant Clusters", accessible at 1096

5 h.google.com/ro//archive/mapreduce-osdi04.pdf, accessed on twenty fifth March [5]. High-Level cluster for the improvement of applied math Production and Services (HLG), (2013), What will "big data" mean for of fi cial statistics?, UNECE, accessible at age.action?pageid= , accessed on twenty fifth March [6]. Holmes, A, Hadoop in follow, Manning Publications, New Jersey. [7]. Mayer-Schönberger, V., and Cukier, K, "Big Data: A Revolution That Transforms however we have a tendency to Work, Live, and Think", Houghton Mif American state in Harcourt. [8]. Prajapati, V, huge information analysis with R and Hadoop, PaktPublishing. [9]. R Core Team, associate Introduction to R, accessible at accessed on twenty fifth March

Intro to Big Data and Hadoop

Intro to Big Data and Hadoop Intro to Big and Hadoop Portions copyright 2001 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC, USA. SAS Institute Inc. makes no warranties

More information

R and Hadoop. Ram Venkat Dawn Analytics

R and Hadoop. Ram Venkat Dawn Analytics R and Hadoop Ram Venkat Dawn Analytics What is Hadoop? Hadoop is an open source Apache software for running distributed applications on 'big data' It contains a distributed file system (HDFS) and a parallel

More information

Spark, Hadoop, and Friends

Spark, Hadoop, and Friends Spark, Hadoop, and Friends (and the Zeppelin Notebook) Douglas Eadline Jan 4, 2017 NJIT Presenter Douglas Eadline deadline@basement-supercomputing.com @thedeadline HPC/Hadoop Consultant/Writer http://www.basement-supercomputing.com

More information

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

E-guide Hadoop Big Data Platforms Buyer s Guide part 1 Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors

More information

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA. ABOUT THIS TRAINING: The world of Hadoop and Big Data" can be intimidating - hundreds of different technologies with cryptic names form the Hadoop ecosystem. This comprehensive training has been designed

More information

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW TOPICS COVERED 1 2 Fundamentals of Big Data Platforms Major Big Data Tools Scaling Up vs. Out SCALE UP (SMP) SCALE OUT (MPP) + (n) Upgrade

More information

Modernizing Your Data Warehouse with Azure

Modernizing Your Data Warehouse with Azure Modernizing Your Data Warehouse with Azure Big data. Small data. All data. Christian Coté S P O N S O R S The traditional BI Environment The traditional data warehouse data warehousing has reached the

More information

BIG DATA AND HADOOP DEVELOPER

BIG DATA AND HADOOP DEVELOPER BIG DATA AND HADOOP DEVELOPER Approximate Duration - 60 Hrs Classes + 30 hrs Lab work + 20 hrs Assessment = 110 Hrs + 50 hrs Project Total duration of course = 160 hrs Lesson 00 - Course Introduction 0.1

More information

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics   Nov. Outline of Hadoop Background, Core Services, and Components David Schwab Synchronic Analytics https://synchronicanalytics.com Nov. 1, 2018 Hadoop s Purpose and Origin Hadoop s Architecture Minimum Configuration

More information

Hadoop Course Content

Hadoop Course Content Hadoop Course Content Hadoop Course Content Hadoop Overview, Architecture Considerations, Infrastructure, Platforms and Automation Use case walkthrough ETL Log Analytics Real Time Analytics Hbase for Developers

More information

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop Simplifying the Process of Uploading and Extracting Data from Apache Hadoop Rohit Bakhshi, Solution Architect, Hortonworks Jim Walker, Director Product Marketing, Talend Page 1 About Us Rohit Bakhshi Solution

More information

Bringing the Power of SAS to Hadoop Title

Bringing the Power of SAS to Hadoop Title WHITE PAPER Bringing the Power of SAS to Hadoop Title Combine SAS World-Class Analytics With Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities ii Contents Introduction... 1 What

More information

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica Accelerating Your Big Data Analytics Jeff Healey, Director Product Marketing, HPE Vertica Recent Waves of Disruption IT Infrastructu re for Analytics Data Warehouse Modernization Big Data/ Hadoop Cloud

More information

5th Annual. Cloudera, Inc. All rights reserved.

5th Annual. Cloudera, Inc. All rights reserved. 5th Annual 1 The Essentials of Apache Hadoop The What, Why and How to Meet Agency Objectives Sarah Sproehnle, Vice President, Customer Success 2 Introduction 3 What is Apache Hadoop? Hadoop is a software

More information

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand Paper 2698-2018 Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand ABSTRACT Digital analytics is no longer just about tracking the number

More information

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11 Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11 Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer

More information

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK Are you drowning in Big Data? Do you lack access to your data? Are you having a hard time managing Big Data processing requirements?

More information

Microsoft Big Data. Solution Brief

Microsoft Big Data. Solution Brief Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,

More information

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes Contents Introduction...3 Hadoop s humble beginnings...4 The benefits of Hadoop...5

More information

Sunnie Chung. Cleveland State University

Sunnie Chung. Cleveland State University Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:

More information

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

ORACLE DATA INTEGRATOR ENTERPRISE EDITION ORACLE DATA INTEGRATOR ENTERPRISE EDITION Oracle Data Integrator Enterprise Edition delivers high-performance data movement and transformation among enterprise platforms with its open and integrated E-LT

More information

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Roger Ding Cloudera February 3rd, 2018 1 Agenda Hadoop History Introduction to Apache Hadoop

More information

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration KnowledgeSTUDIO Advanced Modeling for Better Decisions Companies that compete with analytics are looking for advanced analytical technologies that accelerate decision making and identify opportunities

More information

By: Shrikant Gawande (Cloudera Certified )

By: Shrikant Gawande (Cloudera Certified ) By: Shrikant Gawande (Cloudera Certified ) What is Big Data? For every 30 mins, a airline jet collects 10 terabytes of sensor data (flying time) NYSE generates about one terabyte of new trade data per

More information

Big Data & Hadoop Advance

Big Data & Hadoop Advance Course Durations: 30 Hours About Company: Course Mode: Online/Offline EduNextgen extended arm of Product Innovation Academy is a growing entity in education and career transformation, specializing in today

More information

Berkeley Data Analytics Stack (BDAS) Overview

Berkeley Data Analytics Stack (BDAS) Overview Berkeley Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY What is Big used For? Reports, e.g., - Track business processes, transactions Diagnosis, e.g., - Why is user engagement dropping?

More information

StackIQ Enterprise Data Reference Architecture

StackIQ Enterprise Data Reference Architecture WHITE PAPER StackIQ Enterprise Data Reference Architecture StackIQ and Hortonworks worked together to Bring You World-class Reference Configurations for Apache Hadoop Clusters. Abstract Contents The Need

More information

On Cloud Computational Models and the Heterogeneity Challenge

On Cloud Computational Models and the Heterogeneity Challenge On Cloud Computational Models and the Heterogeneity Challenge Raouf Boutaba D. Cheriton School of Computer Science University of Waterloo WCU IT Convergence Engineering Division POSTECH FOME, December

More information

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase BIG DATA COURSE Big Data Application Engineer/ Developer Specialization in Apache Spark, Kafka, Airflow, HBase In Exclusive Association with 21,347+ Participants 10,000+ Brands 1200+ Trainings 45+ Countries

More information

Analytics Platform System

Analytics Platform System Analytics Platform System Big data. Small data. All data. Audie Wright, DW & Big Data Specialist Audie.Wright@Microsoft.com Ofc 425-538-0044, Cell 303-324-2860 Sean Mikha, DW & Big Data Architect semikha@microsoft.com

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 1, 2017 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2457

More information

Insights to HDInsight

Insights to HDInsight Insights to HDInsight Why Hadoop in the Cloud? No hardware costs Unlimited Scale Pay for What You Need Deployed in minutes Azure HDInsight Big Data made easy Enterprise Ready Easier and more productive

More information

SAS and Hadoop Technology: Overview

SAS and Hadoop Technology: Overview SAS and Hadoop Technology: Overview SAS Documentation September 19, 2017 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS and Hadoop Technology: Overview.

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN From Information to Insight: The Big Value of Big Data Faire Ann Co Marketing Manager, Information Management Software, ASEAN The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT

More information

Analytical Capability Security Compute Ease Data Scale Price Users Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs.

More information

red red red red red red red red red red red red red red red red red red red red CYS Rithu P Ravi CYS Saumya K

red red red red red red red red red red red red red red red red red red red red CYS Rithu P Ravi CYS Saumya K red red red red red red red red red red red red red red red red red red red red CYS14011 - Rithu P Ravi CYS14012 - Saumya K Why and What HADOOP?... Apache Hadoop is an open-source software framework A

More information

Cloud Survival Kit How to estimate and measure the Cloud

Cloud Survival Kit How to estimate and measure the Cloud Cloud Survival Kit How to estimate and measure the Cloud PSM August 2, 2012 David Seaver OPS Consulting davidseaver@opsconsulting.com 8/2/2012 1 Some Perspective The DoD Cloud Computing Strategy introduces

More information

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight. Course Content Course Description: The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight. At Course Completion: After competing this course,

More information

Adobe Deploys Hadoop as a Service on VMware vsphere

Adobe Deploys Hadoop as a Service on VMware vsphere Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and

More information

20775A: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight 20775A: Performing Data Engineering on Microsoft HD Insight Duration: 5 days; Instructor-led Implement Spark Streaming Using the DStream API. Develop Big Data Real-Time Processing Solutions with Apache

More information

Business is being transformed by three trends

Business is being transformed by three trends Business is being transformed by three trends Big Cloud Intelligence Stay ahead of the curve with Cortana Intelligence Suite Business apps People Custom apps Apps Sensors and devices Cortana Intelligence

More information

Oracle Big Data Discovery The Visual Face of Big Data

Oracle Big Data Discovery The Visual Face of Big Data Oracle Big Data Discovery The Visual Face of Big Data Today's Big Data challenge is not how to store it, but how to make sense of it. Oracle Big Data Discovery is a fundamentally new approach to making

More information

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform Analytics R&D and Product Management Document Version 1 WP-Urika-GX-Big-Data-Analytics-0217 www.cray.com

More information

Adopting Big Data Technologies in the Support of Official Statistical Production: Opportunities, Experiences and Lessons Learned

Adopting Big Data Technologies in the Support of Official Statistical Production: Opportunities, Experiences and Lessons Learned Adopting Big Data Technologies in the Support of Official Statistical Production: Opportunities, Experiences and Lessons Learned Antonino Virgillito Istat Istituto Nazionale di Statistica virgilli@istat.it

More information

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB Data Analytics and CERN IT Hadoop Service CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB 1 Data Analytics at Scale The Challenge When you cannot fit your workload in a desktop Data

More information

20775 Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight Duración del curso: 5 Días Acerca de este curso The main purpose of the course is to give students the ability plan and implement big data workflows on HD. Perfil de público The primary audience for this

More information

Realising Value from Data

Realising Value from Data Realising Value from Data Togetherwith Open Source Drives Innovation & Adoption in Big Data BCS Open Source SIG London 1 May 2013 Timings 6:00-6:30pm. Register / Refreshments 6:30-8:00pm, Presentation

More information

A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA

A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July 2016 www.ijrets.com, editor@ijrets.com, ISSN 2454-1915 A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA

More information

Microsoft Azure Essentials

Microsoft Azure Essentials Microsoft Azure Essentials Azure Essentials Track Summary Data Analytics Explore the Data Analytics services in Azure to help you analyze both structured and unstructured data. Azure can help with large,

More information

20775: Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight Let s Reach For Excellence! TAN DUC INFORMATION TECHNOLOGY SCHOOL JSC Address: 103 Pasteur, Dist.1, HCMC Tel: 08 38245819; 38239761 Email: traincert@tdt-tanduc.com Website: www.tdt-tanduc.com; www.tanducits.com

More information

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper Sponsored by Successful Data Warehouse Approaches to Meet Today s Analytics Demands EXECUTIVE BRIEF In this Paper Organizations are adopting increasingly sophisticated analytics methods Analytics usage

More information

Hadoop and Analytics at CERN IT CERN IT-DB

Hadoop and Analytics at CERN IT CERN IT-DB Hadoop and Analytics at CERN IT CERN IT-DB 1 Hadoop Use cases Parallel processing of large amounts of data Perform analytics on a large scale Dealing with complex data: structured, semi-structured, unstructured

More information

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena. Brochure Software Education Exploring Big Data and Data Analytics with Hadoop and IDOL You are experiencing transformational changes in the computing arena. Brochure Exploring Big Data and Data Analytics

More information

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

DATA SCIENCE: HYPE AND REALITY PATRICK HALL DATA SCIENCE: HYPE AND REALITY PATRICK HALL About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014 Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist

More information

BIG DATA and DATA SCIENCE

BIG DATA and DATA SCIENCE Integrated Program In BIG DATA and DATA SCIENCE CONTINUING STUDIES Table of Contents About the Course...03 Key Features of Integrated Program in Big Data and Data Science...04 Learning Path...05 Key Learning

More information

Analytics Landscape and Careers

Analytics Landscape and Careers Analytics Landscape and Careers Industry Landscape Big Data Everywhere! BIG DATA Data that is TOO LARGE & TOO COMPLEX for conventional data tools to capture, store and analyze. The 3V s of Big Data Shares

More information

Preface About the Book

Preface About the Book Preface About the Book We are living in the dawn of what has been termed as the "Fourth Industrial Revolution" by the World Economic Forum (WEF) in 2016. The Fourth Industrial Revolution is marked through

More information

DIVIDE & RECOMBINE (D&R), RHIPE, AND RIPOSTE

DIVIDE & RECOMBINE (D&R), RHIPE, AND RIPOSTE DIVIDE & RECOMBINE (D&R), RHIPE, AND RIPOSTE FOR LARGE COMPLEX DATA 1 Statistics, Purdue Saptarshi Guha, now at Mozilla Jin Xia, now at GE Research Ryan Hafen, now at PNNL Ashrith Barthur Bill Cleveland

More information

Zeal Education Society s Zeal Institute of Business Administration, Computer Application & Research

Zeal Education Society s Zeal Institute of Business Administration, Computer Application & Research Zeal Education Society s Zeal Institute of Business Administration, Computer Application & Research Sr. No. 39, Narhe, Pune -411041, Phone No.:020-67206031, Website: www.zibacar.in (Approved by A.I.C.T.E.,

More information

20775A: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight 20775A: Performing Data Engineering on Microsoft HD Insight Course Details Course Code: Duration: Notes: 20775A 5 days This course syllabus should be used to determine whether the course is appropriate

More information

Design of material management system of mining group based on Hadoop

Design of material management system of mining group based on Hadoop IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Design of material system of mining group based on Hadoop To cite this article: Zhiyuan Xia et al 2018 IOP Conf. Ser.: Earth Environ.

More information

Analytics in Action transforming the way we use and consume information

Analytics in Action transforming the way we use and consume information Analytics in Action transforming the way we use and consume information Big Data Ecosystem The Data Traditional Data BIG DATA Repositories MPP Appliances Internet Hadoop Data Streaming Big Data Ecosystem

More information

Business Intelligence. By Prof.Sushila Aghav-Palwe

Business Intelligence. By Prof.Sushila Aghav-Palwe Business Intelligence By Prof.Sushila Aghav-Palwe Introduction Business intelligence may be defined as a set of mathematical models and analysis methodologies that exploit the available data to generate

More information

EXAMPLE SOLUTIONS Hadoop in Azure HBase as a columnar NoSQL transactional database running on Azure Blobs Storm as a streaming service for near real time processing Hadoop 2.4 support for 100x query gains

More information

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden Leveraging Oracle Big Data Discovery to Master CERN s Data Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden Manuel Martin Marquez Intel IoT Ignition Lab Cloud and

More information

Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks

Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks APPLICATION BRIEF Machine-generated data: creating new opportunities for utilities, mobile and broadcast networks Electronic devices generate data every millisecond they are in operation. This data is

More information

Big Data. Fluch oder Segen? Dr. Rainer Schmidt Scientist AIT Austrian Institute of Technology GmbH

Big Data. Fluch oder Segen? Dr. Rainer Schmidt Scientist AIT Austrian Institute of Technology GmbH Big Data Fluch oder Segen? Dr. Rainer Schmidt Scientist AIT Austrian Institute of Technology GmbH Energiesysteme im Umbruch VI OVE Österreichischer Verband für Elektrotechnik 4.10.2017 Faster and Faster

More information

ETL on Hadoop What is Required

ETL on Hadoop What is Required ETL on Hadoop What is Required Keith Kohl Director, Product Management October 2012 Syncsort Copyright 2012, Syncsort Incorporated Agenda Who is Syncsort Extract, Transform, Load (ETL) Overview and conventional

More information

1. Intoduction to Hadoop

1. Intoduction to Hadoop 1. Intoduction to Hadoop Hadoop is a rapidly evolving ecosystem of components for implementing the Google MapReduce algorithms in a scalable fashion on commodity hardware. Hadoop enables users to store

More information

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved. What s New Bernd Wiswedel KNIME 2018 KNIME AG. All Rights Reserved. What this session is about Presenting (and demo ing) enhancements added in the last year By the team Questions? See us at the booth.

More information

Big Data and Challenges of the Century

Big Data and Challenges of the Century International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 3, Issue 4, 2016, PP 16-20 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) DOI: http://dx.doi.org/10.20431/2349-4859.0304003

More information

IBM Big Data Summit 2012

IBM Big Data Summit 2012 IBM Big Data Summit 2012 12.10.2012 InfoSphere BigInsights Introduction Wilfried Hoge Leading Technical Sales Professional hoge@de.ibm.com twitter.com/wilfriedhoge 12.10.1012 IBM Big Data Strategy: Move

More information

Six Critical Capabilities for a Big Data Analytics Platform

Six Critical Capabilities for a Big Data Analytics Platform White Paper Analytics & Big Data Six Critical Capabilities for a Big Data Analytics Platform Table of Contents page Executive Summary...1 Key Requirements for a Big Data Analytics Platform...1 Vertica:

More information

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer Redefine Big Data: EMC Data Lake in Action Andrea Prosperi Systems Engineer 1 Agenda Data Analytics Today Big data Hadoop & HDFS Different types of analytics Data lakes EMC Solutions for Data Lakes 2 The

More information

INDUSTRY BRIEF THE ENTERPRISE DATA HUB IN FINANCIAL SERVICES: THREE CUSTOMER CASE STUDIES

INDUSTRY BRIEF THE ENTERPRISE DATA HUB IN FINANCIAL SERVICES: THREE CUSTOMER CASE STUDIES INDUSTRY BRIEF THE ENTERPRISE DATA HUB IN FINANCIAL SERVICES: THREE CUSTOMER CASE STUDIES The Enterprise Data Hub in Financial Services: Three Customer Case Studies CLOUDERA INDUSTRY BRIEF 2 Table of Contents

More information

From Big Data to Fast Data. Sina Sheikholeslami

From Big Data to Fast Data. Sina Sheikholeslami From Big Data to Fast Data Sina Sheikholeslami s.sheikholeslami@digikala.com CEIT GradTalks, Tehran Polytechnic May 29 2017 Overview The War on Big Data Definition The Early Days State-of-the-art Big Data

More information

Research on Intelligent Management Unified Service Platform of Internet Plus Agriculture

Research on Intelligent Management Unified Service Platform of Internet Plus Agriculture 2018 International Conference on Big Data and Artificial Intelligence (ICBDAI 2018) Research on Intelligent Management Unified Service Platform of Internet Plus Agriculture Liu Xiaogang Zhejiang Institute

More information

Hadoop Integration Deep Dive

Hadoop Integration Deep Dive Hadoop Integration Deep Dive Piyush Chaudhary Spectrum Scale BD&A Architect 1 Agenda Analytics Market overview Spectrum Scale Analytics strategy Spectrum Scale Hadoop Integration A tale of two connectors

More information

Comparing Application Performance on HPC-based Hadoop Platforms with Local Storage and Dedicated Storage

Comparing Application Performance on HPC-based Hadoop Platforms with Local Storage and Dedicated Storage Comparing Application Performance on HPC-based Hadoop Platforms with Local Storage and Dedicated Storage Zhuozhao Li *, Haiying Shen *, Jeffrey Denton and Walter Ligon * Department of Computer Science,

More information

Operational Hadoop and the Lambda Architecture for Streaming Data

Operational Hadoop and the Lambda Architecture for Streaming Data Operational Hadoop and the Lambda Architecture for Streaming Data 2015 MapR Technologies 2015 MapR Technologies 1 Topics From Batch to Operational Workloads on Hadoop Streaming Data Environments The Lambda

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Who we are Experts At Your Service Over 50 specialists in IT infrastructure Certified, experienced, passionate Based In Switzerland 100% self-financed Swiss company Over CHF8 mio.

More information

Angat Pinoy. Angat Negosyo. Angat Pilipinas.

Angat Pinoy. Angat Negosyo. Angat Pilipinas. Angat Pinoy. Angat Negosyo. Angat Pilipinas. Four megatrends will dominate the next decade Mobility Social Cloud Big data 91% of organizations expect to spend on mobile devices in 2012 In 2012, mobile

More information

New Big Data Solutions and Opportunities for DB Workloads

New Big Data Solutions and Opportunities for DB Workloads New Big Data Solutions and Opportunities for DB Workloads Hadoop and Spark Ecosystem for Data Analytics, Experience and Outlook Luca Canali, IT-DB Hadoop and Spark Service WLCG, GDB meeting CERN, September

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 19 1 Acknowledgement The following discussion is based on the paper Mining Big Data: Current Status, and Forecast to the Future by Fan and Bifet and online presentation

More information

Investor Presentation. Fourth Quarter 2015

Investor Presentation. Fourth Quarter 2015 Investor Presentation Fourth Quarter 2015 Note to Investors Certain non-gaap financial information regarding operating results may be discussed during this presentation. Reconciliations of the differences

More information

Big Data. By Michael Covert. April 2012

Big Data. By Michael Covert. April 2012 Big By Michael Covert April 2012 April 18, 2012 Proprietary and Confidential 2 What is Big why are we discussing it? A brief history of High Performance Computing Parallel processing Algorithms The No

More information

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved. Apache Spark 2.0 GA The General Engine for Modern Analytic Use Cases 1 Apache Spark Drives Business Innovation Apache Spark is driving new business value that is being harnessed by technology forward organizations.

More information

1Week. Big Data & Hadoop. Why big data & Hadoop is important? National Winter Training program on

1Week. Big Data & Hadoop. Why big data & Hadoop is important? National Winter Training program on 1Week National Winter Training program on Big Data & Hadoop Why big data & Hadoop is important? Highlights of Big Data & Hadoop Implement a Hadoop Project Learn to write Complex MapReduce programs Perform

More information

HADOOP ADMINISTRATION

HADOOP ADMINISTRATION HADOOP ADMINISTRATION PROSPECTUS HADOOP ADMINISTRATION UNIVERSITY OF SKILLS ABOUT ISM UNIV UNIVERSITY OF SKILLS ISM UNIV is established in 1994, past 21 years this premier institution has trained over

More information

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS) ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS) Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how Dell EMC Elastic Cloud Storage (ECS ) can be used to streamline

More information

Tapping on the Power of Big Data to Improve the Operation of Power Systems

Tapping on the Power of Big Data to Improve the Operation of Power Systems Tapping on the Power of Big Data to Improve the Operation of Power Systems KA Folly University of Cape Town South Africa Place your company logo here NB: Your company s logo is only permitted on the first

More information

Why Big Data Matters? Speaker: Paras Doshi

Why Big Data Matters? Speaker: Paras Doshi Why Big Data Matters? Speaker: Paras Doshi If you re wondering about what is Big Data and why does it matter to you and your organization, then come to this talk and get introduced to Big Data and learn

More information

Big Data The Big Story

Big Data The Big Story Big Data The Big Story Jean-Pierre Dijcks Big Data Product Mangement 1 Agenda What is Big Data? Architecting Big Data Building Big Data Solutions Oracle Big Data Appliance and Big Data Connectors Customer

More information

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved. Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved. 1 SAP Big Data Webinar Series Big Data - Introduction to SAP Big Data Technologies Big Data - Streaming Analytics Big Data - Smarter

More information

The Intersection of Big Data and DB2

The Intersection of Big Data and DB2 The Intersection of Big Data and DB2 May 20, 2014 Mike McCarthy, IBM Big Data Channels Development mmccart1@us.ibm.com Agenda What is Big Data? Concepts Characteristics What is Hadoop Relational vs Hadoop

More information

Digital Transformation 2.0

Digital Transformation 2.0 Digital Transformation 2.0 Job roles and skills that every IT Services company must know We have been hearing for quite some time, that the world is going through digital transformation & HR department

More information

HADOOP DEVELOPMENT PROSPECTUS HADOOP DEVELOPMENT UNIVERSITY OF SKILLS

HADOOP DEVELOPMENT PROSPECTUS HADOOP DEVELOPMENT UNIVERSITY OF SKILLS HADOOP DEVELOPMENT PROSPECTUS HADOOP DEVELOPMENT UNIVERSITY OF SKILLS ABOUT ISM UNIV UNIVERSITY OF SKILLS ISM UNIV is established in 1994, past 21 years this premier institution has trained over 7000+

More information

WELCOME TO. Cloud Data Services: The Art of the Possible

WELCOME TO. Cloud Data Services: The Art of the Possible WELCOME TO Cloud Data Services: The Art of the Possible Goals for Today Share the cloud-based data management and analytics technologies that are enabling rapid development of new mobile applications Discuss

More information

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud Datametica The Modern Data Platform Enterprise Data Hub Implementations Why is workload moving to Cloud 1 What we used do Enterprise Data Hub & Analytics What is Changing Why it is Changing Enterprise

More information