Languages, Systems and Paradigms for Big Data. Dario COLAZZO

Size: px
Start display at page:

Download "Languages, Systems and Paradigms for Big Data. Dario COLAZZO"

Transcription

1 Languages, Systems and Paradigms for Big Data Dario COLAZZO

2 Plan Introduction to Big Data MapReduce Hadoop and its ecosystem, RDD et Spark Pig Latin & Hive

3 Modalités de contrôle Projet : analytics via MapReduce + Spark experimental evaluation on the report Written exam Final grade = 0.7*Ex + 0.3*Pr

4 About me PhD in Computer Science at Università di Pisa Post-docs: Università di Venezia, Université Paris-Sud MdC at Université Paris-Sud, LRI-INRIA Saclay Professor at Univ. Paris-Dauphine since September 2013 Research and teaching topics: languages and systems for databases focus on Web data (XML, RDF, JSON) expressive and efficient processing of Big Data Publications :

5 Big data and Data science: an introduction

6 Contents What? Where? Why? How?

7 Some numbers

8 An introductory video

9 How many, how much? How many data in the world? 44 Zettabytes by Terabytes, Exabytes, 2006 (1EB = B) 4.5 Zettabytes, 2012 (1ZB = B) How much is a zettabyte? 1,000,000,000,000,000,000,000 bytes A stack of 1TB hard disks that is 25,400 km high How many data in a day? 7 TB, Twitter 10 TB, Facebook 90% of world's data: generated over last two years! 640K ought to be enough for anybody.

10 Data grows fast!

11 How much fast?

12 Proliferation of data sources Proliferation of data sources

13 Proliferation of devices

14 Users generate data

15 Internet of Thinks

16 Types of data

17 What is more important? The Big? The Data? Both? Neither What organisations do with data Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom Cliff Stoll

18 Not just a matter of Volume The four "V s" of Big Data Not just a matter of volume..

19 Big Data: V 4 + Value Volume:Terabyte(10 12 ), Petabyte(10 15 ), Exabyte(10 18 ), Zettabyte (10 21 ) Variety: Structured, semi-structured, unstructured; Text, image, audio, video, record Velocity: Periodic, Near Real Time, Real Time Veracity: Quality of the data can vary greatly Value: Big data can generate huge competitive advantages

20 The wide availability of data allows us to apply more sophisticated models and you get much more accurate results than in the past! It is a capital mistake to theorize before one has data Anthony Goldbloom The bigger the data set you have, the more accurate the predictions about the future will be

21 Bigger = Smarter? YES algorithms work better tolerate errors discover the long tail and corner case rner cases" BUT more heterogeneity data grows very fast still need humans to ask right questions

22 Why now? Because we have large amount of data Data originates already in digital form ~40% growth per year Because we can 500$ for a drive in which to store all the music of the world 40 years of Moore's Law large computational resources 64% of organizations have invested in big data in billion $ invested in big data in 2013 Because we reached dead end with logic

23 bigger=smarter, an example Google Translate you collect snippets of translations you match sentences to snippets you continuously debug your system Why does it work? there are tons of snippets on the Web the accuracy improves as the training set grows

24 Other success stories More success stories

25 Other cases Crime Prevention in Los Angeles Diagnosis and treatment of genetic diseases Investments in the financial sector Generation of personalized advertising Astronomical discoveries...

26 Use cases Today s Challenge New Data What s Possible Healthcare Expensive office visits Manufacturing In-person support Location-Based Services Based on position Public Sector Standardized services Retail One size fits all marketing Remote patient monitoring Product sensors Real time location data Citizen surveys Social media Preventive care, reduced hospitalization Automated diagnosis, support Geo-advertising, traffic, local search Tailored services, cost reductions Sentiment analysis segmentation

27 The Big Data Process Decision Interpretation Acquisition Extraction Integration Goal: to make effective strategic decisions exploiting the availability of big data Analysis

28 Big Data in action Big Data in action Decision Interpretation Acquisition Extraction Integration Requires: selection filtering Metadata generation managing provenance Analysis

29 Big Data in action Decision Interpretation Acquisition Extraction Integration Requires: transformation normalization cleaning aggregation error handling Analysis

30 Big Data in action Decision Interpretation Acquisition Extraction Integration Requires: standardization conflict management reconciliation mapping definition Analysis

31 Big Data in action Decision Interpretation Acquisition Extraction Integration Requires: exploration data mining machine learning visualization Analysis

32 Big Data in action Decision Interpretation Acquisition Analysis Extraction Integration Requires: Knowledge of the domain Knowledge of the provenance Identification of patterns of interest Flexibility of the process

33 Big Data in action Decision Acquisition Extraction Requires: managerial skills continuous improvement of the process Interpretation Integration Analysis

34 A simple example Problem: The sale of lollipops is going down! Acquisition: Sales by customer region and time Surveys of users Social networks Extraction: Data loading from receipts Automatic reading of questionnaires Data extraction from twitter Integration: On the basis of user types Analysis : lollipops bought by people older than 25 lollipops preferred by people younger than 10 Interpretation: Moms believe: lollipops = bad teeth Boys and girls believe that lollipops are for babies Decision: We make lollipops without sugar We ask dentists to advertise our lollipops We make commercials targeted to boys and girls

35 Risks and Challenges Performance, performance, performance! Data grows faster than energy on chip Efficiency Scalability Effectiveness Heterogeneity Flexibility Privacy Costs

36 Effectiveness : a failure story Google Flu Trends: over-estimated the relevance of flu for 100 of 108 weeks Google would not comment on thisyear s difficulties. But several researchers suggest that the problems may be due to widespread media coverage of this year s severe US flu season, including the declaration of a public-health emergency by New York state last month. The press reports may have triggered many flu-related searches by people who were not ill. Nature (

37 Risk of bad interpretations Risks of bad interpretation

38 Privacy: unpleasant drawbacks AOL search data leak (NYT, 8/9/2006) Anonymous Netflix vs IMDb database (Wired, 12/13/2007) Why Johnny Can t Browse The Internet In Peace (Forbes, 8/1/2012) How Companies Learn Your Secrets (N Y T, 16/2/2012)

39 Performance

40 Distribution and parallelism Distributed Architecture that work together to a common goal Clusters of computers that work together to a common goal Scale out not up! Fault- tolerance Resource replication Eventual consistency Distributed processing Shared-nothing model New programming paradigms

41 The Big Data landscape

42 The new software stack New programming environments designed to get their parallelism not from a supercomputer, but from computing clusters Bottom of the stack: distributed file system (DFS) We have a winner! I think there is a world market for about five computers.

43 On top of Hadoop Hundreds of different (high-level) programming solutions Two main scenarios: Analytics (mainly batch) collecting, transforming, and modelling data with the goal of discovering useful information and supporting decision-making Real-time processing processing data and returning the results sufficiently quickly to affect the environment at that time e-commerce, search engines, booking,

44 The Big Data flow Real Time Streams Real Time Processing Batch Processing ETL NoSQL (HBase, Cassandra, MongoDB) New SQL (Oracle, VoltDB, Teradata) Analytics (Vertica, Cloudera, Greenplum) Distributed File System (HDFS)

45 Techniques for Big Data analysis Extract, transform, and load (ETL) Data fusion and data integration Distributed file system NoSQL database systems Cloud computing Analytics Data mining Association rule learning Classification Cluster analysis Regression Machine learning Supervised learning Unsupervised learning Crowdsourcing..

46 Goal of Analytics

47 Data scientist: a brand new profession Data Scientist: The Sexiest Job of the 21st Century [Harward Business Review 2013] Data scientist? A guide to 2015's hottest profession [Mashable 2015]

48 Skills of data scientists 58

49 Conclusions We live in the era of Big Data Wide range of availability in different areas Big opportunities to solve big problems They can create value The challenge is how to manage and use them New technologies are needed Methodological aspects are important A rapidly evolving area Data scientists: the current hottest profession in IT

50 So, let us face big data projects

51 with a Bruce Willis attitude!

52 References "Big Data: The next frontier for innovation, competition, and productivity". McKinsey&Company, "Challenges and Opportunities with Big Data". A community white paper developed by leading researchers across the United States, "Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics". Bill Franks, John Wiley & Sons, 2012.

53 Photo credit: Jimmy Lin Questions?

54 Thanks and credits for this introductory part Jimmy Lin (University of Maryland) Riccardo Torlone (Università di Roma)