Transforming Big Data into Unlimited Knowledge

Size: px
Start display at page:

Download "Transforming Big Data into Unlimited Knowledge"

Transcription

1 Transforming Big Data into Unlimited Knowledge Timos Sellis, RMIT University

2 Big Data What is it? Most commonly accepted definition, by Gartner (the 3 Vs) Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. 2

3 Big Data Characteristics based on the source: McKinsey Global Institute,

4 Big Data notice. Among the few innovations where industry is far ahead from academia Not a new wave, most of the problems have been in the focus of data management research for years The main issue is to put all this together, using innovative technology, serving users needs Industry and academia need to work hand-inhand 4

5 A paradigm shift - Science The 4 th Paradigm of Science Data-Intensive Scientific Discovery From escience to dscience 5

6 Data Science at its best Human Brain Project (EU), develop platform to simulate human brain! Bottleneck in Spatial Analysis 3D Spatial Range Query 86 billion of neurons 100 trillion of synapses Model Size [GigaBytes] K 10K 100K 1M Simulation Size [# of Neurons] 6

7 A paradigm shift - Business Danish firm Vestas uses supercomputers and a big data modelling solution to pinpoint the optimal location for its wind turbines to maximize power generation and reduce energy cost. Incorporates data from global weather systems with data collected from its existing turbines. The wind library holds nearly 3 Petabytes of data. Vestas Wind Energy Turbine Placement and Maintenance Parameters include temperature, barometric pressure, humidity, precipitation, wind direction and velocity from the ground level up to 300 feet, and the company's recorded historical data. The company expects to analyze even more diverse and bigger weather data sets reaching 20-plus petabytes over the next four years as Vestas plans to add global deforestation metrics, satellite images, historical metrics, geospatial data and data on phases of the moon and tides. 7

8 Some Issues for this talk. How is the volume, variety, velocity and veracity of data affecting knowledge discovery? Exploring and discovering information by sifting through large quantities of data How to acquire, understand and correlate social media for decision making Understanding the behaviour of customers based on geospatial and text data 8

9 Knowledge Discovery Supervised Learning Classification Regression Recommender Data Learning Model Unsupervised Learning Clustering Dimensionality reduction Topic modeling 9

10 On Model Power and Data More Data Improves generalization Facilitates more powerful models Is often cheap Increases I/O Cost More Powerful Models Improve model fidelity and Generalization Require more training data Increase computational cost Observation: Today, we almost exclusively focus on simple models using lots of data. Conjecture: It is too hard to write complex, distributed learning algorithms. Knowledge Discovery needs to change 10

11 Influences of Big Data on KD Success Feature Engineering Parameter Tuning Problem Definition Domain Expert Data Scientist Learning Methods & Theory Learning Methods Efficient Algorithms Learning Theory Machine Learning Researcher Parallel Computing Programming Models Distributed Systems Resource Mgmt Software Engineers 11

12 Knowledge Discovery Workflow Step I: Example Formation Step II: Modeling Step III: Evaluation (and eventually Deployment) Example Formation Exampl es Modeling Model Evaluation 12

13 A Unifying Design ML algorithm Logical query over training data Query processor Parallel dataflow engine e.g. Hadoop 13

14 Another type of Big Data - Big Networks The Web: Yahoo had a 1.4 billion node Web graph in 2002 (Kang,2012). Sensor networks: Nine billion devices connected to the Internet (wsnblog.com) Social networks: Facebook loads 60 terabytes of new data every day.

15 A World of Big Networks Amazon review labelled network. About 5.8 million product reviews. The snapshot shows 1000 reviewers. Green nodes represent reviewers and blue nodes products. Reviews are links between reviewers and products. 15

16 FraudEagle: Opinion Fraud Detection A user-product labelled network: identifying fake reviews and fraudulent reviewers (reviews are edges linking reviewers and products).

17 FraudEagle

18 FraudEagle on the Amazon Product Review Network Network by simulating reviewers with a probability go giving a fraudulent review. It shows the labelled network after running the FraudEagle program. The circular nodes are users and the square nodes are products. Green users are honest and red are fraudulent. Blue products are rated good, black products are rated bad.

19 Exploring Big Spatial Data - Geomarketing Geomarketing Delivering relevant content for a given geographical context Analyze consumers behavior according to specific areas Mapvertising Using maps or satellite pictures with relevant information, ads, promotions, etc. Showing sponsored results vs. normal search results

20 Use Case Austrian Bank (WIGeoGIS) Support the 143 Erste Bank branches in Austria with marketing campaigns as personally and flexibly as possible and be able to reach the desired target audiences within the branch areas optimally, Erste Bank s branch marketing department has for years relied on a web-based branch and marketing information system from WIGeoGIS. Identify target audience at the administrative district level - on the basis of socio-demographic figures on the one hand and, purchasing power and company-specific data on the other. Administrative districts that exhibit high potential yet still-low performance can easily be selected via the tool in order to take them into particular consideration in spatial marketing campaigns. Extending it with social media spatio-textual information

21 More information is available Activity sequences People s similar moving patterns 21

22 Identification of network hubs POIs that different users visit Depart from and arrive at Number of position samples Extended periods of time 22

23 Geometric layer extraction Concrete type of movement Usage of frequently sampled trajectories Derive and link network nodes to understand activities 23

24 Final remarks Several companies struggle with making sense of and creating opportunities from their data (the data economy). Big Data is a key factor for enterprises. Big Data has a transformative potential in a wide range of areas, and interesting, new issues are raised. Academia and industry should closely collaborate, lots of exciting opportunities 24