Linked Big Data Graph Database and Analytics

Size: px

Start display at page:

Download "Linked Big Data Graph Database and Analytics"

Gwen Sutton
5 years ago
Views:

1 E6893 Big Data Analytics Lecture 6: Linked Big Data Graph Database and Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science October 11, 2018 E6893 Big Data Analytics Lecture 6 CY Lin, 2018 Columbia University

2 Spark GraphX 2 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

3 Graph Analytics 3 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

4 Graph Definitions and Concepts A graph: G = ( V, E) V = Vertices or Nodes E = Edges or Links The number of vertices: Order Nv = V Ne = E 4 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

5 Property Graph 5 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

6 GraphX Graph Operations In-degree = 8 Out-degree = 8 6 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

Degree Distribution Example: Power-Law Network A. Barbasi and E. Bonabeau, Scale-free Networks, Scientific American 288: p.50-59, 2003.

7 Degree Distribution Example: Power-Law Network A. Barbasi and E. Bonabeau, Scale-free Networks, Scientific American 288: p.50-59, p k k m m = e k! E6893 Big Data Analytics Lecture 6 Graph Analytics p = C k e k τ k / κ Newman, Strogatz and Watts, CY Lin, Columbia University

8 Another example of complex network: Small-World Network Six Degree Separation: adding long range link, a regular graph can be transformed into a small-world network, in which the average number of degrees between two nodes become small. from Watts and Strogatz, E6893 Big Data Analytics Lecture 6 Graph Analytics C: Clustering Coefficient, L: path length, (C(0), L(0) ): (C, L) as in a regular graph; (C(p), L(p)): (C,L) in a Small-world graph with randomness p CY Lin, Columbia University

9 Some examples of Degree Distribution (a) scientist collaboration: biologists (circle) physicists (square), (b) collaboration of move actors, (d) network of directors of Fortune 1000 companies 9 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

10 Relationship Between Network Topology and Productivity Network size is positively correlated with performance. Each person in your address book at work is associated with $948 dollars in annual revenue. 1 direct contact in a person s network $74.07 increase in monthly revenues or $948 annual revenues Std error =(26.38)*** Significant at p < E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

11 Basic graph algorithms in GraphX 11 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

12 Centrality There is certainly no unanimity on exactly what centrality is or its conceptual foundations, and there is little agreement on the procedure of its measurement. Freeman Degree (centrality) Closeness (centrality) Betweeness (centrality) Eigenvector (centrality) 12 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

13 Closeness Closeness: A vertex is close to the other vertices c CI ( v) = u V 1 dist( v, u) where dist(v,u) is the geodesic distance between vertices v and u. 13 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

14 Betweenness ==> Bridges Example: Healthcare experts in the world Connections between different divisions Example: Healthcare experts in the U.S. 14 E6893 Big Data Analytics Lecture 6 Graph Analytics Key social bridges 2017 CY Lin, Columbia University

15 Betweenness Betweenness measures are aimed at summarizing the extent to which a vertex is located between other pairs of vertices. Freeman s definition: c B ( v) = s t v V σ ( s, t v) σ ( s, t) Calculation of all betweenness centralities requires calculating the lengths of shortest paths among all pairs of vertices Computing the summation in the above definition for each vertex 15 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

16 Eigenvector Centrality Try to capture the status, prestige, or rank. More central the neighbors of a vertex are, the more central the vertex itself is. cei ( v) = α cei ( u) { u, v} E The vector c = ( c (1),..., c ( N )) T Ei Ei Ei v is the solution of the eigenvalue problem: A c = α Ei 1 c Ei 16 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

17 PageRank Algorithm (Simplified) 17 E6893 Big Data Analytics Lecture 6 Graph Analytics 2017 CY Lin, Columbia University

Economic Issues Network Topology and Worker Productivity Topological point of views What

Cohesive Network Trust Absorptive capacity Precision, Reliability Structurally Diverse

network structure is most beneficial in a electronic network for consultants?

18 Economic Issues Network Topology and Worker Productivity Topological point of views What type of network structure is beneficial? Cohesive Network Trust Absorptive capacity Precision, Reliability Structurally Diverse Network Brokering position Access to many pools of diverse, novel information What type of network structure is most beneficial in a electronic network for consultants? Importance of Direct Contacts? Importance of Indirect Contacts? Constrained vs. unconstrained? E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

19 Network Topology Measures Direct Contacts Size(7) = 4 Size(12)= 3 + No information distortion - High maintenance cost Network size! strong work performance (?) Indirect Contacts Btw(7)= 33 Btw(12)=6 3steps(7) =11 3steps(12)=8 + Access diverse information - Information distortion Btw-centrality! Strong work performance (?) 3-step Reach!Strong work performance (?) Structural Diversity Div(7)=.53 Div12)=0.16 +Transfer complex knowledge - Access diverse knowledge Diversity! Strong work performance (?) 19 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Enterprise becomes more successful utilizing Social Network Analysis We and MIT studied 2,038 IBM Global Business Consultants for 2 years, it was found that: After a consultant started using

e., someone with frequent communications), increases $948 yearly revenue for IBM.

20 Enterprise becomes more successful utilizing Social Network Analysis We and MIT studied 2,038 IBM Global Business Consultants for 2 years, it was found that: After a consultant started using SmallBlue, his social network/capital obviously grew and his monthly billable revenue for IBM increased by $ (i.e., $7,010 per year) Joint analysis of social capital and economic capital: Adding a person in personal network (i.e., someone with frequent communications), increases $948 yearly revenue for IBM. (selected by BusinessWeek Magazine as the Top Story of the Week, April 8, 2009) 1% increase in social network diversity is associated with $239.5 in monthly revenue (i.e., $2,874 revenue increase per year). 1% increase in social network diversity is associated with an increase of 11.8% in job retention (i.e., surviving layoff). SmallBlue / Atlas was featured in 120+ news articles, including 4 times by BusinessWeek (Jan and May 2008, April and June 2009)!20 SmallBlue Team 2010 IBM Corporation

21 Observations from Personal Social Networks vs. Revenue Structural Diverse networks with abundance of structural holes are associated with higher performance. Having diverse friends helps. Betweenness is negatively correlated. Being a bridge between a lot of people is not helpful. Network reach are highly corrected. The number of people reachable in 3 steps is positively correlated with higher performance. Having too many strong links the same set of people one communicates frequently is negatively correlated with performance. Perhaps frequent communication to the same person may imply redundant information exchange. Future textual analysis can be done to confirm this. 21 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Project Team Composition Managers The number of managers in a project exhibit an inverted-u shaped curve. 1.Having managers in a project is correlated with team performance initially. 2.

22 Project Team Composition Managers The number of managers in a project exhibit an inverted-u shaped curve. 1.Having managers in a project is correlated with team performance initially. 2.Too many managers in a project is negatively associated with team performance. 2 revenue α + β mgr + β mgr + γ otherfactor γ otherfactor + ε = k k # Managers in project (# Managers in project) ^2 β 1 β *** (537.5) *** (215.3) Revenue-fitted S=.027 S=-.056 Managers( normalized) E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

23 Culture Factor in CMC-based Communications preferences of CMC tools patterns of growing social network sentiments in conversations!23 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

24 Preferences of CMC Tools IM vs. Calendar Meet vs. IM!24 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

25 Growing one s Social Networks!25 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

26 Sentiments in Conversation!26 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

27 Role Analysis Role difference of normal behavior 27 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Information Reuse Behavior (CHI 11) Percentage of slides with reused content Percentage 20 15 10 5 0 Research HR Sales Product Percentage 100 80 60 40 20 0 same author different author Research HR

28 Information Reuse Behavior (CHI 11) Percentage of slides with reused content Percentage Research HR Sales Product Percentage same author different author Research HR Sales Product Percentage of reused slides that were reused by the same author vs. by a different author Number of slide pairs with exact vs. partial text reuse Slides Reused Partial Text Reuse Exact Text Reuse Research HR Sales Product Percentage % reused of downloaded material from... outside group inside group Research HR Sales Product Percentage of downloaded material being reused!28 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Anomaly Detection algorithms and infrastructure Thrust 1: Anomaly Detection Algorithms -- New algorithms to detect abnormal humans (nodes) as well as abnormal contacts (edges) from social networks.

ound-truth issue by (1) Interpretation friendly properties (e.g.

29 Anomaly Detection algorithms and infrastructure Thrust 1: Anomaly Detection Algorithms -- New algorithms to detect abnormal humans (nodes) as well as abnormal contacts (edges) from social networks. -- Explore the structure feature and incorporate content (semantic) features. Thrust 2: Anomaly Usability -- Address the lack-of-the ground-truth issue by (1) Interpretation friendly properties (e.g., non-negativity, sparseness, etc) into the current anomaly detection matrix factorization; and (2) providing some concise summarization to perform anomaly attribution. Typical abnormal nodes and their local ego-net structures Thrust 3: Infrastructure Support -- General and scalable graph/network management system to process large!29 E6893 Big Data Analytics Lecture 6 The overall flowchart of the graph management system CY Lin, Columbia University

Use Case: Utilizing Social Network Analysis for Spam Detection Normal: (1) Clique-like (2) Two-way links Spamming: Near-Star All reported as Spammer An analysis in a telecomm area of 6 million users

30 Use Case: Utilizing Social Network Analysis for Spam Detection Normal: (1) Clique-like (2) Two-way links Spamming: Near-Star All reported as Spammer An analysis in a telecomm area of 6 million users in % 90% 80% Existing antispam system Perfect Result SpamWatcher!30 In experiment Social Network Analysis is with recall of 89.97% and precision of 88.17% while comparison system is with 66.77% recall and 14.85% precision. SNA s precision/recall area is 8 times larger E6893 Big Data Analytics Lecture 6 70% 60% 50% 40% 30% 20% 10% Recall 0 Precision 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% No one reported as Spammer CY Lin, Columbia University

Anomaly Detection information flow-based approach Video demo: http://smallblue.

31 Anomaly Detection information flow-based approach Video demo: E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

500M+ Users) Example 3: Web Graph Nodes: Web Pages; Edges: Hyperlinks (Yahoo Web: 1.4B nodes, 6.

32 Network Science Network and Graph Analysis Example 1: Internet Map Nodes: ISPs; Edges: Connection (33K Nodes, 290K edges) Example 2: Social Network Nodes: People; Edges: Friendship (FaceBook has 500M+ Users) Example 3: Web Graph Nodes: Web Pages; Edges: Hyperlinks (Yahoo Web: 1.4B nodes, 6.6B edges) Multiple Scales, Multiple Disciplines E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Network Analysis Example Centrality Ranking in Large Networks Who are the most important actors? Three centralities Degree: # of neighbor Closeness: avg.

Betweenness : Easy Application Measuring the financial company value Network attack monitoring O( E ) O( V 3) O( V 2log V ) [Internet Web] V = Billions E =

33 Network Analysis Example Centrality Ranking in Large Networks Who are the most important actors? Three centralities Degree: # of neighbor Closeness: avg. shortest path length Betweenness: # of times a node sits between shortest path [15th Century Florentine Family] V = 15 E = 19 Degree : Easy Closeness : Easy Betweenness : Easy Application Measuring the financial company value Network attack monitoring O( E ) O( V 3) O( V 2log V ) [Internet Web] V = Billions E = Billions Degree : Easy Closeness : Hard Betweenness : Hard 33 For 2 Billon Edges, - standard closeness: 30,000 years E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

34 Graph Partitioning 34 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

35 Distributed Graph Computation in GraphX 35 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

Network Analysis -- Effectiveness & Efficiency (GBase) Example -- we proposed two new centralities (`effective closeness and `LineRank ), and efficient large scale algorithms for billion-scale graphs.

36 Network Analysis -- Effectiveness & Efficiency (GBase) Example -- we proposed two new centralities (`effective closeness and `LineRank ), and efficient large scale algorithms for billion-scale graphs. Scalability Results (Near-linear scalability) Effective Closeness vs. Closeness (Near-linear correlation ( 97.8%) For 2 Billon Edges, - standard closeness: 30,000 years - effective closeness: ~ 1 day! 1,000,000 times faster! Kang, Tong, Sun, Lin, and Faloutsos, GBase: A Scalable and general graph management system, KDD 2011 Analysis of Real-World Graph 36 E6893 Big Data Analytics Lecture 6 CY Lin, Columbia University

37 RDF and SPARQL 37 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

38 RDF and SPARQL 38 EE6893 Big Data Analytics Lecture CY Lin, Columbia University

Resource Description Format (RDF) A W3C standard sicne 1999 Triples Example: A company has nince of part p1234 in stock, then a simplified triple rpresenting this might be {p1234 instock 9}.

39 Resource Description Format (RDF) A W3C standard sicne 1999 Triples Example: A company has nince of part p1234 in stock, then a simplified triple rpresenting this might be {p1234 instock 9}. Instance Identifier, Property Name, Property Value. In a proper RDF version of this triple, the representation will be more formal. They require uniform resource identifiers (URIs). 39 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

40 An example complete description 40 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

41 Advantages of RDF Virtually any RDF software can parse the lines shown above as self-contained, working data file. You can declare properties if you want. The RDF Schema standard lets you declare classes and relationships between properties and classes. The flexibility that the lack of dependence on schemas is the first key to RDF's value. Split trips into several lines that won't affect their collective meaning, which makes sharding of data collections easy. Multiple datasets can be combined into a usable whole with simple concatenation. For the inventory dataset's property name URIs, sharing of vocabulary makes easy to aggregate. 41 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

42 SPARQL Query Langauge for RDF The following SPQRL query asks for all property names and values associated with the fbd:s9483 resource: 42 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

43 The SPAQRL Query Result from the previous example 43 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

44 Another SPARQL Example What is this query for? Data 44 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

45 Open Source Software Apache Jena 45 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

46 Property Graphs 46 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

47 Reference 47 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

48 A usual example 48 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

49 Query Example I 49 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

50 Query Examples II & III 50 E6893 Big Data Analytics Lecture 6: Graph Computing Computational intensive 2016 CY Lin, Columbia University

51 Graph Database Example 51 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

52 Execution Time in the example of finding extended friends (by Neo4j) 52 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

53 Modeling Order History as a Graph 53 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

54 A query language on Property Graph Cypher 54 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

55 Cypher Example 55 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

56 Other Cypher Clauses 56 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

57 Property Graph Example Shakespeare 57 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

58 Creating the Shakespeare Graph 58 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

59 Query on the Shakespeare Graph 59 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

60 Another Query on the Shakespeare Graph 60 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

61 Building Application Example Collaborative Filtering 61 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

62 Chaining on the Query 62 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

63 Example Interaction Graph What's this query for? 63 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

64 How to make graph database fast? 64 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

65 Use Relationships, not indexes, for fast traversal 65 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

66 Storage Structure Example 66 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

67 Nodes and Relationships in the Object Cache 67 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

68 An Emerging Benchmark Test Set: data generator of full social media activity simulation of any number of users 68 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University

69 Questions? 69 E6893 Big Data Analytics Lecture 6: Graph Computing 2016 CY Lin, Columbia University