From Big Data to Fast Data. Sina Sheikholeslami
|
|
- Sharon Theresa Arnold
- 5 years ago
- Views:
Transcription
1 From Big Data to Fast Data Sina Sheikholeslami CEIT GradTalks, Tehran Polytechnic May
2 Overview The War on Big Data Definition The Early Days State-of-the-art Big Data Processing Platforms The Rise of Fast Data: Applications & Platforms How to Get Involved
3 Part 1: The War on Big Data Definition
4 What is Big Data? Big Data everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it - Dan Ariely 4
5 What is Big Data? (Cont d) Big Data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. - Oxford English Dictionary (Since 2013) 5
6 What is Big Data? (Cont d) Big Data is high-volume, high-velocity and/or high-variety information assets that demand costeffective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. - Gartner IT Glossary 6
7 What is Big Data? (Cont d) Big Data consists of extensive datasets - primarily in the characteristics of volume, variety, velocity, and/or variability - that require a scalable architecture for efficient storage, manipulation, and analysis. - U.S. National Institute of Standards & Technology 7
8 What is Big Data? (Cont d) - UC Berkeley Datascience Survey, September
9 Part 2: The Early Days
10 The Google File System In SOSP 03, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung published the paper on GFS. Google developed GFS to provide efficient, reliable access to data using large clusters of commodity hardware. 10
11 Bringing the Computation Near Data: MapReduce Jeffrey Dean & Sanjay Ghemawat published the MapReduce paper in OSDI 04. It has been cited more than times since then. 11
12 Some say we can divide the human race in two: Those who have never heard of the Word Count example, and those who well, let s just say, don t like it.
13 The Word Count Example
14 The Yellow Elephant Based on GFS & MapReduce papers, the guys at Yahoo! developed an opensource platform for distributed storage and processing of big datasets. Called Apache Nutch in its early days, the first release of Apache Hadoop happened in January
15 The Hadoop Ecosystem Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARNbased system for parallel processing of large data sets. 15
16 Data Got Bigger Number of Internet Users (Millions) ,000 1,500 2,000 December, 1995 December, 1999 March, 2001 July, 2002 October, 2003 October, 2004 September, 2005 June, 2006 June, 2007 June, 2008 June, 2009 internetworldstats.com 16
17 And Bigger There were 5 exabytes of information created by the entire world between the dawn of civilization and Now that same amount is created every two days. Eric Schmidt (then CEO of Google), at the Techonomy Conference in Lake Tahoe, California, August
18 Part 3: State-of-the-art Big Data Processing Platforms
19 A Classic Batch Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 19
20 The Big Data Stack Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 20
21 The Big Data Stack Resource Management Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 21
22 The Big Data Stack Storage Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 22
23 The Big Data Stack Data Processing Layer Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 23
24 Apache Spark In-Memory Distributed Processing Platform Similar Semantics for Batch & Stream Processing Initially started by Matei Zaharia at UC Berkeley s AMPLab in 2009 Became a top-level Apache Project in February Forks, 1068 Contributors Written primarily in Scala, more than 1M lines of code 24
25 Spark vs. Hadoop MapReduce Courtesy of Amir H. Payberah, Data Intensive Computing Platforms 25
26 Spark Stack 26
27 The Bigger Picture BDAS, the Berkeley Data Analytics Stack 27
28 Apache Flink open-source stream processing framework for distributed, highperforming, always-available, and accurate data streaming applications Data is processed an event-at-a-time rather than as a series of batches Originally named Stratosphere, started in 2010 with funding from DFG Became a top-level Apache Project in December Forks, 309 Contributors Written primarily in Java, more than 1M lines of code 28
29 Flink Stack 29
30 The Next Big Thing: 30
31 Part 4: The Rise of Fast Data Applications & Platforms
32 They Don t Wait For It 32
33 We Can t Wait For It 33
34 They Won t Wait For It 34
35 They Shouldn t Wait For It cabotsolutions.com 35
36 My Boss Won t Wait For It 36
37 Fast Data: A Definition Fast data is the application of big data analytics to smaller data sets in near-real or real-time in order to solve a problem or create business value. - TechTarget 37
38 Looking Back at a Classic Batch Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 38
39 Fast Data Processing Architecture Dean Wampler, Fast Data Architectures For Streaming Applications 39
40 How to Get Involved 40
41 AWS Ecosystem 41
42 Hortonworks Data Platform 42
43 Open-source! 43
44 And to Wrap it Up Big Data History & Platforms Big Data vs. Fast Data Fast Data Architectures & Platforms Getting Involved 44
45 Attribution Thanks to Alekksall, Ddraw, Ibrandify,Yurlick, and Makyzz of freepik.com, for the free pics! Thanks to the awesome people at The Apache Foundation. For Everything. Including the graphics. 45
46 And To You 46