Transforming Big Data into Unlimited Knowledge

Size: px
Start display at page:

Download "Transforming Big Data into Unlimited Knowledge"

Transcription

1 Transforming Big Data into Unlimited Knowledge Timos Sellis, RMIT University

2 2 Big Data What is it? Most commonly accepted definition, by Gartner (the 3 Vs) Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

3 Big Data Characteristics based on the source: McKinsey Global Institute,

4 4 Big Data Is it a new wave? Yes and no Yes, it is a different type of data wave: one needs to put together many sources of information, coming through many different channels, throwing away what is not important, working under time constraints, serving analysts and end users No, most of these problems have been in the focus of data management research for years The main issue is to put all this together, using innovative technology, serving users needs

5 5 A paradigm shift - Science The 4 th Paradigm of Science Data-Intensive Scientific Discovery From escience to dscience

6 6 A paradigm shift - Business Danish firm Vestas uses supercomputers and a big data modelling solution to pinpoint the optimal location for its wind turbines to maximize power generation and reduce energy cost. Incorporates data from global weather systems with data collected from its existing turbines. The wind library holds nearly 3 Petabytes of data. Vestas Wind Energy Turbine Placement and Maintenance Parameters include temperature, barometric pressure, humidity, precipitation, wind direction and velocity from the ground level up to 300 feet, and the company's recorded historical data. The company expects to analyze even more diverse and bigger weather data sets reaching 20-plus petabytes over the next four years as Vestas plans to add global deforestation metrics, satellite images, historical metrics, geospatial data and data on phases of the moon and tides.

7 Model Size [GigaBytes] 7 Is there power in the data? Human Brain Project (EU), develop platform to simulate human brain! 3D Spatial Range Query 86 billion of neurons 100 trillion of synapses Bottleneck in Spatial Analysis K 10K 100K 1M Simulation Size [# of Neurons]

8 8 Some Issues for this talk. How is the volume, variety, velocity and veracity of data affecting knowledge discovery? Exploring and discovering information by sifting through large quantities of data How to acquire, understand and correlate social media for decision making Understanding the behaviour of customers based on geospatial and text data

9 9 Machine Learning Data Learning Model Supervised Classification Regression Recommender Unsupervised Clustering Dimensionality reduction Topic modeling

10 10 On Model Power and Data More Data Improves generalization Facilitates more powerful models Is often cheap Increases I/O Cost More Powerful Models Improve model fidelity and Generalization Require more training data Increase computational cost Observation: Today, we almost exclusively focus on simple models using lots of data. Conjecture: It is too hard to write complex, distributed learning algorithms. Machine Learning needs to change

11 11 Influences of Big Data on ML Success Feature Engineering Learning Methods Parallel Computing Parameter Tuning Efficient Algorithms Distributed Systems Problem Definition Learning Methods & Theory Programming Models Learning Theory Resource Mgmt Domain Expert Data Scientist Machine Learning Researcher Software Engineers

12 12 Machine Learning Workflow Step I: Example Formation Feature and Label Extraction Step II: Modeling Step III: Evaluation (and eventually Deployment) Example Formation Exampl es Modeling Model Evaluation

13 A Unifying Design ML algorithm Logical query over training data Query optimizer Parallel dataflow engine

14 A World of Big Networks Sensor networks: Nine billion devices connected to the Internet (wsnblog.com) The Web: Yahoo had a 1.4 billion node Web graph in 2002 (Kang,2012). Social networks: Facebook loads 60 terabytes of new data every day.

15 A World of Big Networks Static vs. dynamic networks Dynamic networks are time series of static networks. Labelled vs. unlabelled networks Nodes in networks have labels The labels may be different bipartite networks.

16 16 A World of Big Networks The Enron dynamic network. 50,572 s between 151 Enron employees 11/5/99-21/6/02. The s sent in each week. Blue edges represent the current week, and brown represent the previous week. Node size represents degree (number of s sent)

17 17 A World of Big Networks Amazon review labelled network. A user and product bipartite network. About 5.8 million product reviews. The snapshot on the right shows 1000 reviewers. Green nodes represent reviewers and blue nodes products. Reviews are links between reviewers and products.

18 Network Anomalies Structural anomalies. The structural properties differ significantly from the norm. High number triangles in neighbourhood Star: low interactions between neighbours Clique: extremely high volume of interactions between neighbours Heavy vicinity: large number of links to neighbours Label anomalies. Nodes whose labels are significantly different from the norm (by inference) are anomalies. opinion spams. A reviewer is fraud if s/he tries to promote a set of target bad products, and/or damage the reputation of a set of good products

19 FraudEagle: Opinion Fraud Detection A user-product labelled network: identifying fake reviews and fraudulent reviewers (reviews are edges linking reviewers and products).

20 FraudEagle

21 FraudEagle on the Amazon Product Review Network Network by simulating reviewers with a probability go giving a fraudulent review. It shows the labelled network after running the FraudEagle program. The circular nodes are users and the square nodes are products. Green users are honest and red are fraudulent. Blue products are rated good, black products are rated bad.

22 Geomarketing Geomarketing Delivering relevant content for a given geographical context Analyze consumers behavior according to specific areas Mapvertising Using maps or satellite pictures with relevant information, ads, promotions, etc. Showing sponsored results vs. normal search results

23 Use Case Austrian Bank (WIGeoGIS) Support the 143 Erste Bank branches in Austria with marketing campaigns as personally and flexibly as possible and be able to reach the desired target audiences within the branch areas optimally, Erste Bank s branch marketing department has for years relied on a web-based branch and marketing information system from WIGeoGIS. Identify target audience at the administrative district level - on the basis of socio-demographic figures on the one hand and, on the other hand, purchasing power and company-specific data on the other. Administrative districts that exhibit high potential yet still-low performance can easily be selected via the tool in order to take them into particular consideration in spatial marketing campaigns. Extending it with social media spatio-textual information

24 24 More information is available Activity sequences People s similar moving patterns

25 25 Identification of network hubs POIs that different users visit Depart from and arrive at Number of position samples Extended periods of time

26 26 Geometric layer extraction Concrete type of movement Usage of frequently sampled trajectories Derive and link network nodes to understand activities

27 Fusion of network layers 27

28 28 Final remarks Several companies struggle with making sense of and creating opportunities from their data (the data economy). Big Data is a key factor for enterprises. Big Data has a transformative potential in a wide range of areas, and interesting, new issues are raised.