COMP9321 Web Application Engineering

Size: px
Start display at page:

Download "COMP9321 Web Application Engineering"

Transcription

1 COMP9321 Web Application Engineering Semester 1, 2017 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II)

2 Big Data: Challenges and Opportunities

3 We are Generating Vast Amounts of Data!! 3 Remote patient monitoring Product sensors Healthcare Social media Manufacturing books, music, videos, etc. Retail Real time location data Digitalization of Artefacts Location-Based Services

4 We are Generating Vast Amounts of Data!! Air Bus A380: generate 10 TB every 30 min Twitter: Generate approximately 12 TB of data per day. Facebook: Facebook data grows by over 500 TB daily. New York Stock: Exchange 1TB of data everyday. 4

5 We are Generating Vast Amounts of Meta-data!! 5 Provenance Data Versioning Privacy Security

6 We are Generating Vast Amounts of Meta-data!! 6 Provenance Data Versioning Privacy Security We are Tracing everything: Who did What? When? Where? e.g. Twitter handles ~1.6 billion search queries per day.

7 We are Generating Vast Amounts of Meta-data!! 7 Provenance Data Versioning Privacy Security

8 We are Generating Vast Amounts of Meta-data!! Reading a book, e.g. Kindle tracks: what you are reading, when you are reading it, how often you read it, etc. Listening to music, e.g. mp3 player tracks: what you are listening to, when and how often, in what order, etc. Smart phones, e.g. iphone tracks: our location, our speed, what apps we are using, who we are ringing, etc. 8

9 We are Generating Vast Amounts of Meta-data!! Reading a book, e.g. Kindle tracks: what you are reading, when you are reading it, how often you read it, etc. Listening to music, e.g. mp3 player tracks: what you are listening to, when and how often, in what order, etc. Smart phones, e.g. iphone tracks: our location, our speed, what apps we are using, who we are ringing, etc. 9

10 Big Data and Big Meta-Data 10 Big share, comment, review, crowdsource, etc.

11 So, What is Big Data? Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second! Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc. 11

12 So, What is Big Data? Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second! Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc. 12

13 So, What is Big Data? Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second! Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc. 13

14 Volume What Makes it Big Data? the vast amounts of data generated every second. Velocity the speed at which new data is generated and moves around. Variety the increasingly different types of data. Veracity the quality of data, e.g. the messiness of the data. Needs detecting and correcting noisy and inconsistent data Value Statistical, Events, Correlation, Hypothetical 14

15 Challenges: How to Store and Process? 15 Big data is high volume, high velocity, and/or high variety information assets. Require new forms of storage and processing. On-hand database management tools? Traditional data processing applications?

16 Challenges: Big Data Storage NoSQL databases: 16 Employs less constrained consistency models. Simple retrieval and appending operations. Significant performance benefits. Examples: Key value Store Document Store Graph Database

17 (Graphs are Everywhere) Challenges: Big Data Storage 17 Social Network User Collaborative Filtering Netflix Movie Probabilistic Analysis Text Analysis Docs Wiki Words

18 (Graphs are Everywhere) Challenges: Big Data Storage 18 Social Network User Collaborative Filtering Netflix Movie Probabilistic Analysis Text Analysis Docs Wiki Words

19 (Graphs are Everywhere) Challenges: Big Data Storage 19 Social Network User Collaborative Filtering Netflix Movie Probabilistic Analysis Text Analysis Docs Wiki Words

20 (Graphs are Everywhere) Challenges: Big Data Storage 20 Social Network User Collaborative Filtering Netflix Movie Probabilistic Analysis Text Analysis Docs Wiki Words

21 Challenges: Big Data Processing Apache Hadoop: 21 Hadoop is an open source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. Apache Hadoop solution: Distributed File System (HDFS) MapReduce Pig HCatalog Who Use Hadoop? Amazon Facebook Google IBM New York Times Yahoo!

22 Challenges: Big Data Processing Apache Spark: Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop 22 Efficient In-memory storage Usable Rich APIs in Java, Scala, Python

23 Challenges: Big Data Processing Apache Spark: Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop 23 Efficient In-memory storage Usable Rich APIs in Java, Scala, Python Resilient Distributed Dataset (RDD), Spark's data storage model

24 Challenges: Big Data Integration 24 Workflows IT Systems Web Services People Example Scenario: Business Processes (BPs) BPs Execution Log..

25 Challenges: Big Data Integration 25 Workflows IT Systems Web Services People Example Scenario: Business Processes (BPs) BPs Execution Log..

26 Challenges: Big Data Integration Messy, schema-less and complex Big Data world. Less than 10% of Big Data world are genuinely relational. 26 e.g. Linked Data

27 Challenges: Big Data Integration Big Data-as-a-Service: Effective processing of big data within acceptable processing time Easy access of the big data and the big data analysis results API Engineering 27 ProgrammableWeb - APIs, Mashups and the Web as Platform; DataSift.open data sources

28 Challenges: Big data requires a broad set of skills 28 Math and Operations Research Expertise Data Experts Data architecture, management, governance, policy Develop analytic algorithms Decision Making Executive and Management Apply information to solve business issues Tool Developers Mask complexity and analytics to lower skills boundaries Visualization Expertise Interpret data sets, determine correlations and present in meaningful ways Industry Vertical Domain Expertise Develop hypothesis, identify relevant business issues, ask the right questions

29 Challenges: Big Data Analytics Analytics can be defined in many ways, but what matters is the purpose of analytics. Most definitions agree on the following: 29 Analytics is used to gain insights from data in order to make better decisions, using mathematical or scientific methods. Data Insight Action Analyse Decide Manage the Data Understand the Data Act on the Data

30 Challenges: Big Data Analytics Analytics can be defined in many ways, but what matters is the purpose of analytics. Most definitions agree on the following: 30 Analytics is used to gain insights from data in order to make better decisions, using mathematical or scientific methods. Data Insight Action Analyse Decide Manage the Data Understand the Data Act on the Data

31 Challenges: Big Data Analytics 31

32 Challenges: Big Data Analytics 32

33 Challenges: Big Data Analytics Example: Beheshti et al., Scalable Graph-based OLAP Analytics over Process Execution Data, DAPD Journal (2015). Beheshti et al., A Framework and a Language for On-Line Analytical Processing on Graphs, WISE Conference (2012). 33 OLAP, is an approach to answering multi-dimensional analytical queries swiftly. Problem: extension of existing OLAP techniques to analysis of graphs is not straightforward. key business insights remain hidden in the interactions among objects. Solution: On-Line Analytical Processing on Graphs

34 Challenges: Big Data Analytics 34

35 Challenges: Big Data Analytics 35 Big Data Analytics benefits from: NLP Machine Learning Pattern recognition, Learning, Extraction, Classification, Enrichment, Linking, etc. Examples: Healthcare Social Networks e.g. Twitter Education Finance

36 Challenges: Big Data Analytics 36 Big Data Analytics benefits from: NLP Machine Learning Pattern recognition, Learning, Extraction, Classification, Enrichment, Linking, etc. Beheshti, et al., Big data and cross-document coreference resolution: Current state and future opportunities...

37 Big Data Leadership!! Industry has been in the lead Google, Amazon, Yahoo!, etc. University researchers have been left behind!! due to lack of access to large-scale cluster computing facilities Government agencies are making heavy investments Investments in big-data computing will have extraordinary near-term and long-term benefits. Cloud computing must be considered a strategic resource 37

38 Big Data: Opportunities 38 Varieties of Data Text Social Media Networks Multimedia Machine Data Sensors Analytics Organizing Big Data Navigating through data Summarizing Big Data Process Data Analytics Support decision-making Integration Integrating enterprise and public data Linking data/context Entity Extraction and Integration Knowledge Graph Big Data Performance In memory New Benchmarks and Architecture User Experience automation and intelligent guidance Visualizing with Analytics Interacting with Analytics Storytelling

39 Big Data: Opportunities 39 Varieties of Data Text Social Media Networks Multimedia Machine Data Sensors Analytics Organizing Big Data Navigating through data Summarizing Big Data Process Analytics Support decision-making Integration Integrating enterprise and public data Linking data/context Entity Extraction and Integration Knowledge Graph Big Data Performance In memory New Benchmarks and Architecture User Experience automation and intelligent guidance Visualizing with Analytics Interacting with Analytics Storytelling

40 Conclusion Why Big Data is different from past Very Large Datasets? Meta-Data!! 40 Having the ability to analyse Big Data is of limited value if users cannot understand the analysis. How can the industry and academia collaborate towards solving Big Data challenges!! What is big today maybe not be big tomorrow!

41 Thank you! 41