Inferring Social Ties across Heterogeneous Networks

Size: px
Start display at page:

Download "Inferring Social Ties across Heterogeneous Networks"

Transcription

1 Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN

2 Introduction Social Ties Information carrying connections between people It can be: Strong, weak or absent How to determine it? By asking individuals who they consider a friend, approach for professional help, or communicate with on a regular basis

3 Why and when to use Social Network Analysis (SNA) When you wish to understand how to improve the effectiveness of the network To uncover patterns in relationships or interactions To follow the paths that information flows in social networks To identify dysfunctional communities or networks To promote social cohesion and growth in an online community

4 Social Ties Different types of social ties have different influence between people Example: Enterprise network Mobile Communication network

5 Social Ties & Community It is the basic unit to form the n/w structure Relationships between the users can be either directed or undirected Two aspects of social tie are considered: I. To which extent the label of social ties between people can be inferred in social networks? II. how reciprocal (two-way) relationships are developed from Para-social (one-way) relationships and how the relationships further develop into triadic closure and communities?

6 So, what is the problem? (Motivation) Social networks are more complex and consists of many overlapping parts Ex: Facebook, Twitter, LinkedIn, YouTube, etc Survey: European mobile phone network Only 16% people have created contact groups in their mobile Networks are very unbalanced

7 Example of inferring social ties

8 Challenges What are the fundamental factors that form the structure of different networks? How can we design a generalized framework to formalize the problem in a unified way? How to scale up the model learning algorithm to adapt to the growth of large real networks?

9 Overview of this paper Transfer based factor graph (TranFG) model Incorporates social theories into a semi-supervised learning framework Evaluate the proposed model on five different networks: Epinions, Slashdot, Mobile, Coauthor, and Enron Proposed model has significantly improved the performance (+15% in terms of F1-measure) for inferring social ties across different networks.

10 Background and Related work Few efforts have been made so far I. D. J. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher, and J. Kleinberg. Inferring social ties from geographic coincidences. PNAS, 107: , Dec II. III. C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo. Mining advisor-advisee relationships from research publication networks. In KDD 10, pages , C. P. Diehl, G. Namata, and L. Getoor. Relationship identification for social network discovery. In AAAI, pages , Focuses only on mining particular types of relationships in a specific domain Difficult to extend it to other domains

11 Problem definition Two social networks: a source network and a target network Let G = (V, E L, E U, X) be a partially labelled social network V = no. of vertices E L = set of labeled relationships E U = set of unlabeled relationships E L U E U = E X = (IEI x d) attribute matrix, with edges in E with each row corresponding to an edge, each column an attribute, and an element x ij denoting the value of the j th attribute of edge e i. The label of edge e i is denoted as y i Y, where Y is the possible space of the labels (e.g., family, colleague, classmate).

12 Problem definition - Input Input: Two partially labeled networks G s (source network) & G T (target network) with the condition: IE L si >> IE L TI (i.e.) the number of labeled relationships in the source network is more larger than that of the target network In real social networks, the relationship could be: 1. Directed/undirected relationships 2. Static/dynamic relationships

13 Given: Problem definition - Learning task G s and G T (remember!!) Goal: To learn a predictive function f : (G T G S ) Y T for inferring the type of relationships in the target network by leveraging the supervised information (labeled relationships) from the source network Output assumption: Probability : p(y i Ie i ) So, the predictive function should return: (e i, y i, p(y i Ie i ))

14 Key issues The source network and the target network may be very different Example: a coauthor network and an network The label of relationships in the target network and that of the source network could be different. As both the source and the target networks are partially labeled, the learning framework should consider the labeled information as well as the unlabeled information.

15 Data collection We consider five different types of networks: I. Epinions a network of product reviewers a network of reviewers connected with trust and distrust relationships The data set consists of 131,828 nodes (users) and 841,372 edges, of which about 85.0% are trust links 80,668 users received at least one trust or distrust edge II. Slashdot - a network of friends Site for sharing technology related news Users can tag each other as friends (like) or foes (dislike) The data set is comprised of 77,357 users and 516,575 edges of which 76.7% are friend relationships

16 Data collection [cont.] III. Mobile a network of mobile users Data set comprises of e logs of calls, blue-tooth scanning data and cell tower IDs of 107 users during about ten months IV. Coauthor a network of edges To infer advisor-advisee relationships between coauthors The data set is comprised of 815,946 authors and 2,792,833 coauthor relationships V. Enron an communication network Consists of 136,329 s between 151 Enron employees Two types of relationships: Manager-subordinate and colleague

17 Observations Domain specific features will change in the case of multiple heterogeneous networks Connect the problem to several psychological theories and focus the analysis on the network based correlations via the following 4 statistics: I. Social balance: How is the social balance property satisfied and correlated in different networks? II. III. IV. Structural hole: Would structural holes have a similar behavior pattern in different networks? Social status: How do different networks satisfy the properties of social status? Two-step flow: How do different networks follow the two-step flow of information propagation?

18 Social balance People in a social network tend to form into a balanced network structure Balance theory: either all three users in a triad are friends or only one pair of them are friends

19 Balance theory Probabilities of balanced triads in different networks based on communication links and friendships Observation: In communication Links, different networks have very different balance probabilities

20 Structural hole Relationship of non-redundancy between two contacts Goal: to test if a structural hole tends to have the same type of relationship with the other users Algorithm: for each node, we count the number of pairs of neighbors who are not directly connected. all users are ranked based on the number of pairs and then top 1% users with the highest numbers are viewed as structural holes in the network.

21 Structural hole [cont.] Probabilities that two connected (or disconnected) users (A and B) have the same type of relationship with user C, conditioned on whether user C spans a structural hole or not users are more likely (on average +70% higher than chance) to have the same type of relationship with C if C spans a structural hole disconnected users are more likely than connected users to have the same type of relationship with a user classified as spanning a structural hole

22 Social status Based on directed relational network In a triad, we take each negative edge, reverse its direction, and flip its sign to positive, then the resulting triangle (with all positive edge signs) should be acyclic Conducted analysis on Coauthor and Enron networks 99% of triads in the two networks satisfy the social status theory

23 Two-step flow theory (opinion leader) Ideas (innovations) usually flow first to opinion leaders, and then from them to a wider population Eg. Enterprise network Goal: to examine whether opinion leaders are more likely to have a higher social status (manager or advisor) than ordinary users Categorize the users into opinion leaders and opinion users by PageRank Select as opinion leaders with the top 1% users who have the highest PageRank scores and the rest as ordinary users Then, examine the probabilities (A and B) have a directed social relationship (from higher social status user to lower social-status user)

24 Model framework proposed model Transfer Based Factor Graph (TranFG) Predictive model: Given a network G=(V, E L,E U,X), each edge (relationship) e i has an attribute vector x i with a label y i. Here X = {x i } and Y = {y i } The formulation is: G = network information

25 According to Bayes rule, P(YIG) = probability of labels given the structure of the network P(XIY) = probability of generating attributes X associated to all edges given their labels Y

26 Assumption: The generative probability of attributes given the label of each edge is conditionally independent, thus we have, where P(x i Iy i ) is the probability of generating attributes x i given the label y i How to instantiate the probabilities? Either through: 1. Bayesian theorem 2. Markov random fields (Hammersley-Clifford theorem)

27 Hammersley-Clifford theorem The two probabilities can be defined as, Z 1 and Z 2 are normalization factors g j (x ij,y i ) = feature function for each attribute x ij associated with edge e i α j = weight of the j th attribute h k (Y c ) = correlation feature function over clique Y c in the n/w μ k = weight of the k th correlation feature function

28 The simplest clique is an edge Thus a feature function h k (y i,y j ) can be defined as the correlation between two edges (e i, e j ), if the two edges share a common end node Can consider triads as cliques too So, given a network G with labeled information Y, learning the predictive function is to estimate a parameter configuration, Θ = {α, μ} To maximize the log-likelihood function, i.e.

29 Learning across heterogeneous networks How to learn the predictive model with two heterogeneous networks? Idea: Social theories are general over all the networks Different networks satisfy the different social theories to transfer the knowledge across networks So, incorporate the social theories in the predictive function, log-likelihood objective function

30 Model learning and inferring How to learn the TransFG model and how to infer the unknown social relationships in the target networks. Goal: To estimate the parameter configuration θ({α},{β},{μ}) and to maximize the objective function O(α,β,μ). A gradient decent method (Newton-Raphson) is used to solve the objective function. Similar gradients can be derived for α j and β j

31 TranFG learning algorithm

32 Experimental setup Five different kinds of networks: Epinions, Slashdot, Mobile, Coauthor, and Enron Comparison methods: SVM, CRF, PFG, TranFG In all experiments, they have used the same feature definitions for all methods Evaluation measures: experimented with different pairs of (source and target) networks, and evaluated the approaches in terms of Precision, Recall and F1- Measure

33 Results and performance analysis Comparing the four methods to infer the social relationships on four pairs of networks. Labeled information in the target network has been considered for transfer

34 Factor contribution analysis The above graphs describe about different social theories which help to infer social ties

35 Performance of inferring friendships with and w/o the balance based transfer by varying the percent of labeled data in the target network

36 Conclusions Problem of inferring social ties across heterogeneous networks has been defined and proposed a transferbased factor graph (TranFG) model Incorporates the social theories to transfer supervised information from the source network to help infer social ties in the target network Compared the proposed model with several alternative methods and evaluated the performance

37 Critical thoughts The proposed model considers only the static relational network structures for inferring social ties What if the network is dynamic? What is the network dynamically evolve over time? The considered network structures are not as big as Facebook or Twitter or YouTube How to estimate the model on such real large networks? What is the complexity of the algorithm?

38 References Jie Tang, Tiancheng Lou, Jon Kleinberg, Inferring Social Ties across Heterogeneous Networks, ACM SIGMOD, 2012 D. J. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher, and J. Kleinberg. Inferring social ties from geographic coincidences. PNAS, 107: , Dec C. P. Diehl, G. Namata, and L. Getoor. Relationship identification for social network discovery. In AAAI, pages , 2007 C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo. Mining advisor-advisee relationships from research publication networks. In KDD 10, pages , 2010

39 Thank you! Questions??