A Survey on Influence Analysis in Social Networks

Size: px
Start display at page:

Download "A Survey on Influence Analysis in Social Networks"

Transcription

1 A Survey on Influence Analysis in Social Networks Jessie Yin 1 Introduction With growing popularity of Web 2.0, recent years have witnessed wide-spread adoption of rich social media applications such as Flickr, Twitter, and Facebook. These applications provide a light-weight, easy form of communication that enables individuals and groups to easily share information, exchange ideas and express their opinions in fluid and less formal ways. The user-generated content published on social media sites contains rich contextual information, e.g., textual content, user network, contributed annotations and temporal evolution. It thus offers a window through which to understand online users activities in the context of topic-related conversations and interactions in social media. Our project aims to explore various aspects of social media content and develop new methods for analysing social influence in large-scale social networks. Understanding social influence can benefit a variety of applications such as viral marketing, enterprise management and government services. To help achieve project goals, this report will review some previous work on influence analysis from both the social science and computer science perspectives, which would help understand the concept of influence in social networks and develop feasible solutions to identify influence people. 2 Definitions of Influence in Social Science In the areas of social science and communication theory, there are a number of conflicting theories about how information and innovations spread in a community. In the 1940s and 1950s, Katz and Lazarsfeld [6] formulated a breakthrough theory of public opinion formation that sought to reconcile the role of media influence, in a variety of decision-making scenarios, ranging from political to personal, individuals may be influenced more by exposure to each other than to the media. 1

2 According to their theory, a minority of opinion leaders act as intermediaries between the mass media and the majority of society; they are loosely described as being informed, respected, and well-connected. Because information, and thereby influence flows from the media through opinion leaders to their respective followers, this model is referred to as the two-step flow of communication. In the decades after the introduction of the two-step flow, the idea of opinion leaders, or influentials as they are also called [11], became a central theme in the literatures of the diffusion of innovations [14] and communications research [17] and marketing [3]. This theory also spread well beyond academia and has been adopted in many marketing businesses. By identifying and convincing a small number of influential individuals, a viral compaign can reach a wide audience at a small cost driven by a large-scale chain-reaction of influence. In contrast, a more modern view of influence emphasizes the importance of information flow more than the role of influentials. Watts and Dodds [16] pointed out that traditional influence theory does not consider the role of ordinary users in the formation of its information flow. In order to compare the role of influentials and ordinary users, they developed a series of simulations, in which information flow freely between users, and a user adopts an innovation when he is influenced by more than a threshold of the sample population. The simulation showed that influentials initiated more frequent and larger cascades than average users, but they were neither necessary nor sufficient for information diffusions, as suggested in the traditional theory. Therefore, they concluded that, what differentiates successful from unsuccessful diffusion is largely related to the properties of the network as a whole, not the properties of a small number of special individuals. The above competing influence theories have still remained as hypotheses in sociology for several reasons. First, there is a lack of real-world data that could be used to empirically test these theories. Second, there lacks a standard definition of what influence means, nor is there a well-accepted measure to quantity its force. This has thus become a source of difficulty in understanding large-scale network data on social influence. 3 Influence Analysis in Computer Science More recently, computer scientists have begun developing models for influence in social networks, motivated by applications such as viral marketing, the spread of online news and the growth of on-line communities. Domingos and Richardson [5, 13] were the first to consider the propagation of influence and the problem of identification of influential users from a data mining perspective. The problem is tackled by means of a probabilistic model which assumes customers are directly 2

3 influenced by their neighbours in the network, and heuristics are given to choose a set of customers which can maximize the expected profit of a marketing action. 3.1 Influence Maximization One important research focus in social networks has been on the influence maximization problem, which was first formulated as a discrete optimization problem by Kempe et. al. [7]. The influence maximization problem is defined as follows: given a network with influence estimates, how to select an initial set of k users such that they eventually influence the largest number of users in the social network. This problem is known to be NP-complete, and therefore their work focuses on providing provable approximation guarantees in two existing propagation models, namely Linear Threshold Model and Independent Cascade Model. The main limitation of this work is the efficiency of their greedy algorithm, which requires to compute the influence spread given a seed set. For this task they run Monte-Carlo simulations of the propagation model for sufficiently many times to obtain an accurate estimate, which incurs very long computation time. Following this direction, a recent line of research has been motivated to develop methods for improving the efficiency of the greedy algorithm for influence maximization [4, 10]. Leskovec et al. [10] studied the propagation problem from a different perspective namely outbreak detection. Outbreak detection is modeled as how to select nodes in a network in order to detect the spread of a virus or information as quickly as possible. They present a general methodology for near optimal sensor placement in these and related problems. By exploiting submodularity, they developed an efficient algorithm based on a lazy-forward optimization in selecting new seeds, which achieves near optimal placements, while being 700 times faster than the simple greedy algorithm. Despite this big improvement over the basic greedy algorithm, this method still faces serious scalability problems as demonstrated in [4]. In this work, Chen et al. proposed a new heuristic algorithm that restricts computations on the local influence regions of nodes, and therefore this heuristic is several orders of magnitude faster than all existing greedy algorithm while matching the influence spread of the greedy algorithms. All the works discussed above assume there exists a basic propagation model, where the influence weights on the edges are given as input. In our project instead we focus on studying how influence weights/scores can be produced by mining social networks, or in other terms, how influential people can be discovered. 3

4 3.2 Influence Analysis in Twitter In recent years, Twitter has gained huge popularity as a microblogging service, where online users can interact with each other to share opinions and exchange ideas. It thus provides a good platform to study the problem of identifying influential people in online communities. A popular metric of perceived influence on Twitter measures the quantity of a user s followers. In general, the more followers a user has, the more impact he appears to make in the Twitter environment, because he seems more popular (namely, that users follow him). This statement makes sense assuming that Twitter acts as a successful broadcast medium, where a user publishes a tweet and it is read by every follower. However, this view of Twitter as a broadcast medium ignores the potential for users to interact with the content on the platform. Kwak et al. [8] ranked users by the number of followers and by PageRank applied to the network of followings and followers, and found the two rankings to be similar. They also ranked users by the number of retweets and found the resulting ranking differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one s tweets. Cha et al. [2] also compared three different measures of influence number of followers, number of retweets, and number of mentions. They found that while retweets and mentions correlate well with each other, the most followed users did not necessarily score highest on the other two measures. Based on this, they hypothesized that the number of followers may not be a good measure of influence in the Twitter context. The work proposed by the Web Ecology project [9] measured the influence based on a ratio of attention (including retweet, reply, and user mentions) that a user received to the tweets he posted. These three metrics do not utilize the global link structure among users. Recent attempts have been made to take into account the global link structure for measuring influence in the Twitter context. For example, TunkRank 1 extends PageRank and calculates a user s influence recursively as: Inf luence(x) = Y F ollowers(x) 1+p Influence(Y). (1) F riends(y) Herepis the constant probability that a user retweets a tweet. TunkRank measures a user s influence strength as the expected number of users who will read a tweet that he posts. However, this metric assumes the retweet probability is constant over all the tweets and needs to be known in advance

5 Tang, et al. [15] introduced the problem of topic-level social influence analysis. Given a social network and a topic distribution for each user, the problem is to find topic-specific subnetworks, and topic-specific influence weights between members of the subnetworks. They propose a Topical Affinity Propagation (TAP) approach based on the theory of a factor graph, in which observation data is coupled with local attributes and connections. In their work, social influences are modeled to be associated with different topics, and the strength of influence is thus interpreted as to which extent the topical content is copied from the influencing nodes to the influenced nodes. Another notable work is TwitterRank, proposed by Weng, et al. [18], which computes the topical distribution of a user based on Latent Dirichlet Allocation (LDA) [1], and constructs a weighted user graph based on the following relationships, where edge weight represents the topical similarity between two users. They run a variant of the PageRank algorithm over the directed weighted graph separately for each topic in order to find authorities on each topic. TwitterRank differs from PageRank in that the random surfer performs a topic-specific random walk, i.e., the transition probability from one user to another is topic-specific, which is defined as follows: P t (i,j) = T j a: s i follows s a T a sim t(i,j), (2) where T j is number of tweets published by s j, and a: s i follows s a T a sums up the number of tweets published by all of s i s friends. Here sim t (i,j) denotes the similarity between users s i and s j on topic t and it is defined as sim t (i,j) = 1 DT it DT jt, where the row-normalised matrix DT is one of the output resulted from running an LDA, and thus DT it represents the probability that user s i is interested in topic t. One basic assumption made by TwitterRank is that users with similar interests have a stronger influence on each other. However, it is prone to skew by a few users with scores that are orders of magnitude larger than the majority of the graph (i.e., celebrities). Pal and Counts [12] proposed using a set of features to represent social media authors, including both nodal and topical metrics. A probabilistic clustering algorithm is run over this feature space and a within-cluster ranking procedure is thereafter applied on the selected clusters, which produces a final list of top authors for a given topic. This clustering-based method offers potential advantage over network-based calculations in that it is less prone to skew by a few celebrities who are heavily followed by many people, and more importantly, it is computationally feasible in near real-time for capturing rapidly changing dynamics of microblogs. 5

6 References [1] D.M. Blei, A.Y. Ng, and M.J. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: , [2] M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi. Measuring user influence in twitter: The million follower fallacy. In Proceedings of the Internatinal AAAI Conference on Weblogs and Social Media (ICWSM), Washington DC, USA, May [3] K.K. Chan and S. Misra. Characteristics of the opinion leader: A new dimension. Journal of Advertising, 19:53 60, [4] W. Chen, C. Wang, and Y. Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the Sixteenth International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages , Washington, DC, USA, August [5] P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 57 66, San Francisco, CA, USA, August [6] E. Katz and P. Lazarsfeld. Personal Influence: The Part Played by People in the Flow of Mass Communication. Free Press, New York, [7] D. Kempe, J.M. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages , Washington, DC, USA, August [8] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web (WWW), pages , Raleigh, NC, USA, April [9] A. Leavitt, E. Burchard, D. Fisher, and S. Gilbert. The influentials: New approaches for analyzing influence on twitter. A Publication of the Web Ecology Project available at: September [10] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In Proceedings of the Thirdteenth International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages , San Jose, CA, USA, July

7 [11] Robert K. Merton. Patterns of Influence: Local and cosmopolitan Influentials, chapter Social Theory and Social Structure, pages Free Press, New York, [12] A. Pal and S. Counts. Identifying topical authorities in microblogs. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pages 45 54, Hong Kong, February [13] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 61 70, Edmonton, Alberta, Canada, July [14] E.M. Rogers. Diffusion of Innovations. Free Press, [15] J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in largescale networks. In Proceedings of the Fifteenth International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages , Paris, France, June July [16] D. Watts and P. Dodds. Influentials, networks, and public opinion formation. Journal of Consumer Research, 34(4): , [17] G. Weimann. The Influentials: People Who Influence People. SUNY Press, [18] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM), pages , New York City, USA, Feburary