From Big Data to Fast Data. Sina Sheikholeslami

Size: px
Start display at page:

Download "From Big Data to Fast Data. Sina Sheikholeslami"

Transcription

1 From Big Data to Fast Data Sina Sheikholeslami CEIT GradTalks, Tehran Polytechnic May

2 Overview The War on Big Data Definition The Early Days State-of-the-art Big Data Processing Platforms The Rise of Fast Data: Applications & Platforms How to Get Involved

3 Part 1: The War on Big Data Definition

4 What is Big Data? Big Data everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it - Dan Ariely 4

5 What is Big Data? (Cont d) Big Data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. - Oxford English Dictionary (Since 2013) 5

6 What is Big Data? (Cont d) Big Data is high-volume, high-velocity and/or high-variety information assets that demand costeffective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. - Gartner IT Glossary 6

7 What is Big Data? (Cont d) Big Data consists of extensive datasets - primarily in the characteristics of volume, variety, velocity, and/or variability - that require a scalable architecture for efficient storage, manipulation, and analysis. - U.S. National Institute of Standards & Technology 7

8 What is Big Data? (Cont d) - UC Berkeley Datascience Survey, September

9 Part 2: The Early Days

10 The Google File System In SOSP 03, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung published the paper on GFS. Google developed GFS to provide efficient, reliable access to data using large clusters of commodity hardware. 10

11 Bringing the Computation Near Data: MapReduce Jeffrey Dean & Sanjay Ghemawat published the MapReduce paper in OSDI 04. It has been cited more than times since then. 11

12 Some say we can divide the human race in two: Those who have never heard of the Word Count example, and those who well, let s just say, don t like it.

13 The Word Count Example

14 The Yellow Elephant Based on GFS & MapReduce papers, the guys at Yahoo! developed an opensource platform for distributed storage and processing of big datasets. Called Apache Nutch in its early days, the first release of Apache Hadoop happened in January

15 The Hadoop Ecosystem Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARNbased system for parallel processing of large data sets. 15

16 Data Got Bigger Number of Internet Users (Millions) ,000 1,500 2,000 December, 1995 December, 1999 March, 2001 July, 2002 October, 2003 October, 2004 September, 2005 June, 2006 June, 2007 June, 2008 June, 2009 internetworldstats.com 16

17 And Bigger There were 5 exabytes of information created by the entire world between the dawn of civilization and Now that same amount is created every two days. Eric Schmidt (then CEO of Google), at the Techonomy Conference in Lake Tahoe, California, August

18 Part 3: State-of-the-art Big Data Processing Platforms

19 A Classic Batch Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 19

20 The Big Data Stack Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 20

21 The Big Data Stack Resource Management Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 21

22 The Big Data Stack Storage Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 22

23 The Big Data Stack Data Processing Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 23

24 Apache Spark In-Memory Distributed Processing Platform Similar Semantics for Batch & Stream Processing Initially started by Matei Zaharia at UC Berkeley s AMPLab in 2009 Became a top-level Apache Project in February Forks, 1068 Contributors Written primarily in Scala, more than 1M lines of code 24

25 Spark vs. Hadoop MapReduce Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 25

26 Spark Stack 26

27 The Bigger Picture BDAS, the Berkeley Data Analytics Stack 27

28 Apache Flink open-source stream processing framework for distributed, highperforming, always-available, and accurate data streaming applications Data is processed an event-at-a-time rather than as a series of batches Originally named Stratosphere, started in 2010 with funding from DFG Became a top-level Apache Project in December Forks, 309 Contributors Written primarily in Java, more than 1M lines of code 28

29 Flink Stack 29

30 The Next Big Thing: 30

31 Part 4: The Rise of Fast Data Applications & Platforms

32 They Don t Wait For It 32

33 We Can t Wait For It 33

34 They Won t Wait For It 34

35 They Shouldn t Wait For It cabotsolutions.com 35

36 My Boss Won t Wait For It 36

37 Fast Data: A Definition Fast data is the application of big data analytics to smaller data sets in near-real or real-time in order to solve a problem or create business value. - TechTarget 37

38 Looking Back at a Classic Batch Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 38

39 Fast Data Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 39

40 How to Get Involved 40

41 AWS Ecosystem 41

42 Hortonworks Data Platform 42

43 Open-source! 43

44 And to Wrap it Up Big Data History & Platforms Big Data vs. Fast Data Fast Data Architectures & Platforms Getting Involved 44

45 Attribution Thanks to Alekksall, Ddraw, Ibrandify,Yurlick, and Makyzz of freepik.com, for the free pics! Thanks to the awesome people at The Apache Foundation. For Everything. Including the graphics. 45

46 And To You 46