A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA

Size: px

Start display at page:

Download "A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA"

Jodie Williamson
6 years ago
Views:

1 International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July ISSN A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA Shaik Aleem Ur Rehaman 1, Raman Preet Kaur 2, Tanveer Baig Z 1, Saqib Rashid 1, Zahid Nazir Moon 1 1 Dept. Of Electronics and Communication, HKBK College of Engineering, Bangalore, India 2 Dept. Of Computer Science, HKBK College of Engineering, Bangalore, India ABSTRACT: This paper aims at providing the description of big data, its objectives and the processing of Big data. The benefits of using Hadoop architecture is dealt which serves as a core platform for structuring the Big data as a resultant of massive data creation from all possible source. Hadoop uses distributed computing system which has multiple servers using relatively cheaper hardware to store large data.then the creating a value using big data is also been discussed in this paper. Keywords: Big Data,Processors,Huge Data Storage, Hadoop [1] INTRODUCTION According to McKinsey, Big Data refers to datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyze[11]. There is no explicit definition of how big a dataset should be in order to be considered Big Data. New technology has to be in place to manage this Big Data phenomenon. IDC defines Big Data technologies as a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high velocity capture, discovery and analysis. According to O Reilly, Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of existing database architectures [1]. To gain value from these data, there must be an alternative way to process it. Data volume is also growing exponentially due to the explosion of machine-generated data (data records, web-log files, sensor data) and from growing human engagement within the social networks. Analysis of data sets can find new correlations to spot business trends,prevent diseases, combat crime and so on [9]. Figure: 1. A decade of Digital Universe Growth :Storage in Exabyte s Shaik Aleem Ur Rehaman, Raman Preet Kaur, Tanveer Baig Z, Saqib Rashid, Zahid Nazir Moon 1

2 A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA [2] OBJECTIVES OF BIG DATA Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings. Like traditional analytics, it can also support internal business decisions. The technologies and concepts behind big data allow organizations to achieve a variety of objectives, but most of the organizations we interviewed were focused on one or two. The chosen objectives have implications for not only the outcome and financial benefits from big data, but also the process who leads the initiative, where it fits within the organization, and how to manage the project. A) Cost Reduction from Big Data Technologies Some organizations pursuing big data believe strongly that MIPS and terabyte storage for structured data are now most cheaply delivered through big data technologies like Hadoop clusters. One company s cost comparison, for example, estimated that the cost of storing one terabyte for a year was $37,000 for a traditional relational database, $5,000 for a database appliance, and only $2,000 for a Hadoop cluster Of course, these figures are not directly comparable, in that the more traditional technologies may be somewhat more reliable and easily managed. Data security approaches, for example, are not yet fully developed in the Hadoop cluster environment. Organizations that were focused on cost reduction made the decision to adopt big data tools primarily within the IT organization on largely technical and economic criteria. IT groups may want to involve some of your users and sponsors in debating the data management advantages and disadvantages of this kind of storage, but that is probably the limit of the discussion needed. [3] B) Time Reduction from Big Data The second common objective of big data technologies and solutions is time reduction. Macy s merchandise pricing optimization application provides a classic example of reducing the cycle time for complex and large-scale analytical calculations from hours or even days to minutes or seconds. The department store chain has been able to reduce the time to optimize pricing of its 73 million items for sale from over 27 hours to just over 1 hour. Described by some as big data analytics, this capability set obviously makes it possible for Macy s to re-price items much more frequently to adapt to changing conditions in the retail marketplace. This big data analytics application takes data out of a Hadoop cluster and puts it into other parallel computing and in-memory software architectures. Macy s also says it achieved 70% hardware cost reductions. Kerem Tomak, VP of Analytics at Macys.com, is using similar approaches to time reduction for marketing offers to Macy s customers (see the, Big Data at Macys.com, case study). He notes that the company can run a lot more models with this timesaving s. [3] BIG DATA PROCESSING Big-data projects have a number of different layers of abstraction from abstraction of the data through to running analytics against the abstracted data. Following figure shows the basic elements of analytical Big-data and their interrelationships. The higher level components help 2

International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July 2016 www.ijrets.com, editor@ijrets.

3 International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July ISSN make big data projects easier and more dynamic. Hadoop is often at the center of Big-data projects, but it is not a precondition. Fig2: Analysis of Big Data Components The components of analytical Big-data are given below Hadoop packaging and support organizations like Cloudera; to include MapReduce - essentially the compute layer of big data. Any File system like Hadoop Distributed File System (HDFS), that manages the retrieval and storing of data and metadata required for computation. Databases such as Hbase can also be used. A higher-level language such as Pig (part of Hadoop) can be used instead of using JAVA to simplify the writing of computations. A data warehouse layer named Hive is built on top of Hadoop A thin Java library named Cascading is sits on top of Hadoop to allow suites of MapReduce jobs to be run and managed as a unit. This is a widely used as a special tool CR-X, a Semi-automated modeling tool allow to develop interactively at great speed, and can help set up the database that will run the analytics. Greenplum or Netezza, a specialized scale-out analytic databases allows very fast load & reload the data for the analytic models ISV big data analytical packages like ClickFox and Merced run against the database to help address the business issues [4] HADOOP ARCHITECTURE A) Apache Hadoop It is an open-source software framework for storage and large scale processing of data sets on clusters of commodity hardware. Hadoop is an Apache toplevel project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.The Apache Hadoop framework is composed of the following modules: Shaik Aleem Ur Rehaman, Raman Preet Kaur, Tanveer Baig Z, Saqib Rashid, Zahid Nazir Moon 3

4 A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA Hadoop Common - contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. Hadoop Map Reduce - a programming model for large-scale data processing. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers. For the end-users, though Map Reduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher-level user interfaces like Pig Latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.apache Hadoop is a registered trademark of the Apache Software Foundation. B) Architecture of Hadoop Hadoop consists of the Hadoop Common package, which provides file system and OS level abstractions, a Map Reduce engine and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java Archive (JAR) files and scripts needed to start Hadoop. The package also provides source code, documentation and a contribution section that includes projects from the Hadoop Community.For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. HDFS uses this method when replicating data to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure, so that even if these events occur, the data may still be readable. 4

5 International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July ISSN Fig 3: A multi-node Hadoop cluster A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, Name Node and Data Node. A slave or worker node acts as both a Data Node and Task Tracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard start-up and shutdown scripts require Secure Shell to be set up between nodes in the cluster. In a larger cluster, the HDFS is managed through a dedicated Name Node server to host the file system index, and a secondary Name Node that can generate snapshots of the name node s memory structures, thus preventing file-system corruption and reducing loss of data. Similarly, a standalone Job Tracker server can mage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate file system, the Name Node, secondary Name Node and Data Node architecture of HDFS is replaced by the file-systemspecific equivalent. C) File System Hadoop distributed file system (HDFS) It is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single name node; a cluster of data nodes form the HDFS cluster. The situation is typical because each node does not require a data node to be present. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other. Shaik Aleem Ur Rehaman, Raman Preet Kaur, Tanveer Baig Z, Saqib Rashid, Zahid Nazir Moon 5

A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA Fig 4: illustrates a simple big data technology environment HDFS stores large files (typically in the range of gigabytes to tera bytes across multiple

6 A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA Fig 4: illustrates a simple big data technology environment HDFS stores large files (typically in the range of gigabytes to tera bytes across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does theoretically not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX-compliant filesystem is increased performance for data throughput and support for non-posix operations such as Append. The HDFS file system includes a so-called secondary name node, which that when the primary name node goes offline, the secondary name node takes over. In fact, the secondary name node regularly connects with the primary name node and builds snapshots of the primary name node s directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary name node without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the name node is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate name nodes. [5] CREATING VALUE THROUGH BIG DATA McKinsey Global Institute conducted a research on big data where they pointed out five key areas where big data can create value like creating transparency, Employee performance improvement, segmenting populations to customer actions, Improve decision making, and innovating new product/service / business models. A) Creating transparency: If the company makes data available to the authorized person in a timely manner then it must be create transparency towards the company. In the organization it is also important to make data available to the inter-departmental use. B) Employee performance improvement: In organization it is very important to continuous improvement of the employee performance. Bid data can be important resource for the improving performance. As in data center employee s detail work history has been recorded. So if any employee not doing the task 6

7 International Journal of Research in Engineering, Technology and Science, Volume VI, Special Issue, July ISSN properly then it is very easy to analyze the work history and fine the solution which will improve the performance. C) Segmenting populations to customize actions: In marketing, customer segmentation is very important. Because through this company can seize right business strategies for the customer. Big data enables firm to collect detail information and buying pattern of the customer. Through analysis if any company offers precise product & service then customer will be happier. D) Improve decision making: Big data enables firm to collect detail information about customers and competitors. So by analyzing all data set, a firm can make better decision rather than who analyze only sample information. E) Innovating new product/service / business models: Through using big data a firm can offer new product / service to the existing/new customer. Because existing customer can provide excellent suggestion for new product & service. In addition with that their customer detail history can be a good spring of new business model [6] CONCLUSION AND FUTURE SCOPE Data volume is growing exponentially due to the explosion of machine-generated data (data records, web-log files, sensor data) and from growing human engagement within the social networks. As data volumes are growing exponentially, so is the concern over data preservation, access, dissemination, and usability. Many agencies has taken initiatives to research into areas such as automated analysis techniques, data mining, machine learning, privacy, and database interoperability and these will help to identify how big data can enable science in new ways and at new levels. The growth of data constitutes the Big Data phenomenon a technological phenomenon brought about by the rapid rate of data growth and parallel advancements in technology that have given rise to an ecosystem of software and hardware products that are enabling users to analyze this data to produce new and more granular levels of insight REFERNCES [1] Murnane, L. G., (April 9, 2012). Big Data: The Future Is Now. [2] Data,Data Everywhere The Economist.25 February [3] Big Data initiative to optimize geospatial intelligence.(10 April 2012). [4] Denne, S. (April 6, 2012). Big Data Success Stories: Opera Solutions. The Wall Street Journal. [5]Improving Pharmaceutical Research with Netezza Powered Analytics, (March 15, 2012). [6] Manyika,J.,Chui M.,Brown B., Bughin J., Dobbs R., Roxburgh C., & Byers A. H., (2011), Big data:the next [7] Putting real-time data to work and providing a platform for technology development. (December 15, 2010). [8] Smith, D., (2011). 5 real-world uses of big data. [9]Nijders,C;Matzat,Reips, BigData.BigGaps Of Knowledge In The Field Internet. International Journal Of Internet Science 7:1-5 [10] Big data brings big value. (February 29, 2012). IT Web Data management.retrieved April 13, 2012 Shaik Aleem Ur Rehaman, Raman Preet Kaur, Tanveer Baig Z, Saqib Rashid, Zahid Nazir Moon 7

8 A REVIEW ON HADOOP ARCHITECTURE FOR BIG DATA AUTHOR S BRIEF INTRODUCTION: 1. Shaik Aleem Ur Rehaman is currently pursuing his BE in electronics and communication engineering from HKBK college of engineering, Bangalore. He has presented over 30 papers in various national and international conferences.his area of interests are VLSI, Automation and robotics,etc. 2. Raman Preet Kaur is currently pursuing her BE in computer science engineering from HKBK college of engineering, Bangalore. Her area of interests are computer networking and java programming. 3. Tanveer Baig Z is currently working as assistant professor in Dept of ECE in HKBK college of engineering, Bangalore. His area of specialization is telecommunication. 4. Saqib Rashid is currently pursuing his BE in electronics and communication engineering from HKBK college of engineering, Bangalore.His area of interests are Embedded systems and Robotics. 5. Zahid Nazir Moon is currently pursuing his BE in electronics and communication engineering from HKBK college of engineering, Bangalore. His area of interest is Embedded systems and Robotics. 8

5th Annual. Cloudera, Inc. All rights reserved.

5th Annual. Cloudera, Inc. All rights reserved. 5th Annual 1 The Essentials of Apache Hadoop The What, Why and How to Meet Agency Objectives Sarah Sproehnle, Vice President, Customer Success 2 Introduction 3 What is Apache Hadoop? Hadoop is a software