New Approach for scheduling tasks and/or jobs in Big Data Cluster

Size: px
Start display at page:

Download "New Approach for scheduling tasks and/or jobs in Big Data Cluster"

Transcription

1 New Approach for scheduling tasks and/or jobs in Big Data Cluster IT College, Chairperson of MS Dept.

2 Agenda Introduction What is Big Data? The 4 characteristics of Big Data V4s Different Categories of Data Unstructured data is exploding Apache Hadoop Hadoop Ecosystem Scheduler role in Big Data Our new approach for scheduling tasks and/or jobs in Big Data Clusters Conclusion

3 Introduction

4 What is Big Data? Definition: extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.

5 The 4 characteristics of Big Data V4s (1/5) Volume Scale of Data Velocity Analysis of Streaming Data Variety Different forms of Data Veracity Uncertainty of Data

6 The 4 characteristics of Big Data V4s (1/5) Volume Scale of Data Velocity Analysis of Streaming Data Variety Different forms of Data Veracity Uncertainty of Data

7 The 4 characteristics of Big Data V4s (2/5)

8 The 4 characteristics of Big Data V4s (3/5)

9 The 4 characteristics of Big Data V4s (4/5)

10 The 4 characteristics of Big Data V4s (5/5)

11 Different Categories of Data XML Files, body Semi-structured Audio, Video, Image Files, Archived documents Unstructured Data Data from Entreprise Systems (ERP, CRM) Structured Data

12 Unstructured data is exploding (1/2) IOT (number of wearable devices, Number of wireless devices (RFID, WIFI, ) Number of uploaded videos on social networks and on Youtube 4 Billions of Hours are watched on Youtube Number of pieces of content exchanged on social networks 400 Millions Tweets are sent per day

13 Unstructured data is exploding (2/2) 800% growth in data volume within the next 5 years Amount of unstructured data is growing 62% faster 80% of data will be unstructured data in 2019 (source Gartner) Source: Gartner & IDC

14 Apache Hadoop The Apache Hadoop project develops opensource software for reliable, scalable, distributed computing The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models It is designed to scale up from single servers to thousands of machines, each offering local computation and storage Source: Apache Hadoop

15 Popular Hadoop Distributions

16 Hadoop Popular Programming Languages

17 Cloudera Hadoop Ecosystem Source: Cloudera

18 Scheduler Role in Big Data (1/2)

19 Scheduler Role in Big Data (2/2) In order to achieve greater performance an efficient scheduler needs to be implemented Scheduling is a technique of assigning jobs to available resources in a manner to minimize starvation and maximize resource utilization Performance of scheduling technique can be improved by applying constraints Various scheduling algorithms are proposed in the past few years for optimal utilization of cluster resources

20 Our new approach for scheduling tasks and/or jobs in Big Data Cluster (1/2) Is based on resources of the Data Nodes CPU Load RAM Load I/O Load Network load Type of the Job (Spark, Hbase, Impala, ) Job Scheduler computes the aforementioned resources load according to this formula: RL = (CPU L) α +(RAM L) β + (I/O L) γ +(Network L) δ

21 Our new approach for scheduling tasks and/or jobs in Big Data Cluster (2/2) For every Data Node in the Cluster (1 cluster) Get the resources load Switch (type of job){ End For Case: Spark: α = 1 β = 1 γ = 0. 5 δ = 0. 5 Case Hbase: α = 0. 7 β = 0. 7 γ = 1 δ = 0. 7 Case MapReduce: } α = 0. 5 β = 0. 7 γ = 1 δ = 1 Sort the array of resources load Assign Taks and/or Jobs to the Data Nodes

22 Conclusion I have presented a new approach for scheduling tasks and/or jobs in Big Data clusters based on data nodes resources load In the future, I will introduce machine learning (Artificial Neural Networks) within the scheduler in order to efficiently assign tasks and/or jobs to data nodes