Survey on Twitter Analysis

Size: px
Start display at page:

Download "Survey on Twitter Analysis"

Transcription

1 Material for Rinko Survey on Twitter Analysis June 25th, 2010 Kitsuregawa, Toyoda Lab (D1) RenYong ID: ABSTRACT In his paper we introduce two aspects of research on Twitter.(1) the quantitative study on Twitter and (2)Influential Twitterers Finding Algorithms. At the end of this paper two applications based on Twitter are presented. 1. INTRODUCTION An important characteristic of Twitter is real-time nature, users can publish what they are doing or thinking now. Unlike on most online social networking sites, such as Facebook or MySpace, the relationship of following and being followed requires no reciprocation. A user can follow any other user, and the user being followed need not follow back. Being a follower on Twitter means that the user receives all the messages (called tweets) from those the user follows. Common practice of responding to a tweet has evolved into well-defined markup culture: RT stands for followed by a user identifier address the user, and # followed by a word represents a hashtag. This well-defined markup vocabulary combined with a strict limit of 140 characters per posting conveniences users with brevity in expression. 2. Characteristics Analysis 2.1 Following and Followers Distribution Figure 1 #following/followers A directed network based on the following and followed is constructed. Figure 1 displays the distribution of the number of followings as the solid line and that of followers as the dotted line. The y-axis represents complementary cumulative distribution function (CCDF). The research first explains the distribution of the number of followings. The dashed line in Figure 1 up to x = 10 fits to a power-law distribution with the exponent of Most real networks including social networks have a power-law exponent between 2 and 3. The data points beyond x = 10 represent users who have many more followers than the power-law distribution predicts. Similar tail behavior in degree distribution has been reported from Cyworld in [1] but not from other social networks. The common characteristics between Twitter and Cyworld are that many celebrities are present and they readily form online relations with their fans. 2.2 Reciprocity Twitter shows a low level of reciprocity; 77.9% of user pairs with any link between them are connected one-w a y, and only 22.1% have reciprocal relationship between them. The research calls those r-friends of a user as they reciprocate a user s following. Previous studies have reported much higher reciprocity on other social networking services: 68% on Flickr [2] and 84% on Yahoo! 360 [3]. 2.3 Degree of Separation The concept of degrees of separation has become a key to understanding the societal structure, ever since Stanley Milgram s famous six degrees of separation experiment [4]. In his work he reports that any two people could be connected on average within six hops from each other. Research on the MSN messenger network reports that the median and the 90% degrees of separation are 6 and 7.8, respectively[5]. To estimate the path-length distribution of Twitter, the same random sampling approach as in [1] is used. The median

2 and the mode of the distribution are both 4, and the average path length is The 90th percentile distance, known as the effective diameter, is 4.8. The average path length of 4:12 is quite short for the network of Twitter size. 3 Trending The Trends How many topics does a user participate on average? Out of 41 million Twitter users, a large number of users (8, 262, 545) participated in trending topics and about 15% of those users participated in more than 10 topics during four months. 3.1 Comparison with Trends in Other Media To answer what topics are popular in Twitter, the research compares Twitter s trending topics with those in other media, namely, Google Trend and CNN headlines. Consider a search keyword and a trending topic a match if the length of the longest common substring is more than 70% of either string. Only 126 (3.6%) out of 3, 479 unique trending topics from Twitter exist in 4, 597 unique hot keywords from Google. Most of them are real world events, celebrities, and movies. Figure 2 The age of the trending topics The freshness of topics in Google Trend and Twitter trending topics are also compared. Figure 2 plots how many topics are fresh, a day old, a week old, or longer. On average 95% of topics each day are new in Google while only 72% of topics are new in Twitter. Interactions among users, e.g., retweet, reply, and mention, are prevalent in Twitter unlike Google search, and such interactions might be a factor to keep trending topics persist. How close are trending topics to CNN Headline News in time and coverage? CNN Headline News of our Twitter data collection period are collected and preliminary analysis is conducted. From a subset of trending topics that they have matched against CNN Headline News more than half the time CNN was ahead in reporting. However, some news broke out on Twitter before CNN and they are of live broadcasting nature (e.g., sports matches and accidents). 3.2 User Participation in Trending Topics Figure 3 Cumulative fraction A trending topic does not last forever nor dies to never come back. Figure 3 plots the CDF of the active periods and shows that 73% topics have a single active period. About 15% of topics have 2 active periods and 5% have 3. Very few have more than 3 active periods. Most of the active periods are a week or shorter. Figure shows that 31% of periods are 1 day long, and only 7% of periods are longer than 10 days. There are, however, a few long-lasted topics that have been active for more than two months. This research applies their classification methodology in[6] on the number of tweets and their times, and classifies trending topic periods into the following fourcategories: exogenous subcritical, exogenous critical, endogenous subcritical, and endogenous subcritical. Manual inspection of the topics that fall into the exogenous critical class reveal that they are mostly timely breaking news, which the research refer as headline news. Table 1 # of topics in each category Subcritical Critical Exo. 31.5%(1905) 54.3%(3290) Endo. 6.9%(419) 7.3%(444) The numbers and percentage of active periods in each class areshown in Table 1. The largest number falls into the exogenous critical class. This meanstwitter users tend to talk

3 about topics from headline news and respond to fresh news. 4 Impact of Retweet On Twitter people acquire information not always directly from those they follow, but often via retweets. Assuming a tweet posted by a user is viewed and consumed by all of the user s followers, count the number of additional recipients who are not immediate followers of the original tweet owner. Figure 4 displays its average and median per tweet against the number of followers of the original tweet user. The median lies almost always below the average, indicating that many tweets have a very large number of additional recipients. Up to about 1, 000 followers, the average number of additional recipients is not affected by the number of followers of the tweet source. That is, no matter how many followers a user has, the tweet is likely to reach a certain number of audience, once the user s tweet starts spreading via retweets. This illustrates the power of retweeting. That is, the mechanism of retweet has given every user the power to spread information broadly. Figure 5 Retweet trees of air france flight tweets Figure plots the CCDFs of the retweet tree heights and the number of users in a retweet tree. The height of 1 is the most common claiming 95.8%,and no tree goes beyond 11 hops. Figure 6 Height and participating users in retweet trees Figure 4 Average and median numbers of additional recipicents 4.2 Temporal Analysis of Retweet 4.1Retweet Tree In order to answer how far and deep retweets travel in Twitter,the research builds an information diffusion tree of every tweet that is retweeted and calls it a retweet tree. All retweet trees are subgraphs of the Twitter network. he research illustrate all the retweet trees of the topic air france flight in Figure5. In every connected component different colors represent different tweets. The forest of retweet trees has a large number of one or two-hop chains. The research finds interesting retweet patterns such as repetitive retweet and cross-retweet; the former is repeatedly retweeting the same tweet, and cross-retweet is retweeting each o t h e r. Figure 7 Time lag between a retweet and the original tweet Figure 6 plots the time lag from a tweet to its retweet. Half of retweeting occurs within an hour, and 75% under a day. However, about 10% of retweets take place a month later.

4 automatically by analyzing the content of their tweets. Based on the topics distilled, topic-specific relationship networks among twitterers are constructed. Finally, measure the influence taking both the topical similarity between twitterers and the link structure into account. Figure 8 Time lag between a retweet and the original tweet Figure 7 plots the time lag between two nodes on a retweet tree. As most retweet trees are one-hop deep, the time lag on the first hop is spread out, with the median at just under 1 hour and the inter-quartile range expanding from a few minutes to more than a day. What is interesting is from the second hop and on is that the retweets two hops or more away from the source are much more responsive and basically occur back to back up to 5 hops away. 5 Influential Twitterers Finding Algorithms 5.1Motivations The benefit of finding Influential Twitterers is multifold. First, it potentially brings order to the real-time web in that it allows the search results to be sorted by the authority/influence of the contributing twitterers giving a timely update of the thoughts of influential twitterers. Second, according to [7], Twitter is also a marketing platform. Targeting those influential users will increase the efficiency of the marketing campaign. N o w, a twitterer s influence is often measured by her node in-degree in the network, i.e.,the number of followers. However, as observed in previous social network analysis studies [8], in-degree does not accurately capture the notion of influence. PageRank improves over in-degree by considering the link structure of the whole network [8]. Nevertheless, Pagerank ignores the interests of twitterers, which affects the way twitterers influence one another. Given this, the algorithms called TwitterRank is proposed. The framework of it is shown in the Figure9 5.2T o p i c -Distillation and Homophily among Twitterers The goal of the topic distillation is to automatically identify the topics that twitterers are interested in based on the tweets they published. For this purpose, Latent Dirichlet Allocation (LDA) mo del [9] is applied, which is an unsupervised machine learning technique to identify latent topic information from large document collection. To distill the topics that twitterers are interested in using LDA, documents should naturally correspond to tweets. However, since the goal is to understand the topics that each twitter is interested in rather than the topic that each single tweet is about, the research aggregats the tweets published by individual twitterer into a big document. Thus, each document essentially corresponds to a twitterer. The result is represented in three matrices: 1. DT, a D T matrix, where D is the number of twitterers and T is the number of topics. DT contains the number of times a word in twitterer si s tweets has been assigned to topic t. 2. WT, a W T matrix, where W is the number of unique words used in the tweets and T is the number of topics. WT captures the number of times unique word w has been assigned to topic t, 3. and Z, a 1 N vector, where N is the total number of words in the tweets. Z is the topic assignment for word w. Among the three matrices in the result of topic distillation, matrix DT is of particular interest. It contains the number of times a word in a twitterer s tweets has been assigned to a particular topic. It can be normalized as DT such that Figure 9 Framework of the Proposed Approach First, topics that twitterers are interested in are distilled DT. = 1 for each row DT.. Each element DT captures the probability that twitterer s is interested in topic t Given this, the topical difference between twitterers can be

5 measured as follows: Definition 1Topical difference between two twitterers s and s can be calculated as: dist(i, j) = D (i, j) (1) D (i, j) is the Jensen-Shannon Divergence between the two probability distributions DT. and DT.,which is defined as: D (i, j) = D DT. M + D DT. M (2) M is the average of the two probability distributions, i.e. M = DT i. + DTj.. D in Eq (2) is the Kullback-Leibler Divergence which defines the divergence from distribution Q to P as: D (P Q) = P(i) log P(i) Q(i) And according to the definition of topical difference, the research find (1)Twitterers with following relationship are more similar than those without (2)Twitterers with reciprocal following relationship are more similar than those without This two findings show that Hompohily does exist among Twitterers. 5.3T o p i c -specific TwitterRank First of all, a directed graph D(V,E)is formed with the twitterers and the following relationships among them. V is the vertex set, which contains all the twitterers. E is the edge set. There is an edge between two twitterers if there is following relationship between them, and the edge is directed from follower to friend. A random surfer model on graph D computes the Twitter- Rank as follows: the random surfer visits each twitterer with certain probability by following the appropriate edge in D. TwitterRank differentiates itself from PageRank in that the random surfer performs a topic-specific random walk, i.e. the transition probability from one twitterer to another is topic-specific. By doing so, the research is essentially constructing a topic-specific relationship network among twitterers. The transition matrix for topic t, denoted as Pt, is defined as follows: Definition2 Given a topic,each element of matrix Pt,i.e. the transition probability of the random surfer from follower Si to friend Sj.is defined as: P (i, j) = : sim (i, j) (3) Τ is number of tweets published by s,and : Τ sums up the number of tweets published by all of s s friends. sim (i, j) is the similarity between s and s in topic t,which is defined as: sim (i, j) = 1 DT DT (4) This definition captures two notions. Assume twitterer s follows a number of friends.those friends publish different numbers of tweets, all of which will be directly visible to s. The more a friend s publishes, the higher portion of tweets s reads is from s. Generally, this leads to a higher influence on s, which corresponds to a higher transition probability from s to s. This intuition is captured in the first term in the RHS of Eq. (3). Figure 10 shows an example about three twitterers. s follows s and s, who publish 500 and 1000 tweets respectively. In this case, s s influence on s is two times of that of s, when the topical similarity among the three twitterers is not taken into account. Figure 10 Example of Transition Probability Calculation Second, s s influence on s is also related to the topical similarity between the two as suggested by the homophily phenomenon discussed in Section before. Row-normalized

6 matrix DT is one of the results in the topic distillation. A row DT. contains the probability of twitterer s s interest in diff erent topics. The similarity between s and s in topic t can be evaluated as the diff erence between the probability that the two twitterers are interested in the same topic t, which is basically the second term in the RHS of Eq. (3).The more similar the two twitterers are, the higher the transition probability from s to s. It is possible that some twitterers would follow one another in a looping manner without following other twitterers outside the loop. Such loop will accumulate high influence without distribute their influence. To tackle this, a teleportation vector E is also introduced, which basically captures the probability that the random surfer would jump to some twitterers instead of following the edges of the graph D. E is defined as follows: Definition 3 The teleportation vector of the random surfer in topic t is defined as: E = DT. (5) DT. is the t-th column of matrix DT, which is the column-normalized form of matrix DT such that DT. = 1. DT is one of the results obtained during the topic distillation. With the transition probability matrix and teleportation vector defined, the topic-specific TwitterRank can be calculated. Definition4 The topic-specific TwitterRank of the twitterers in topic t, denoted as TR, can be calculated iteratively by: TR = γp TR + (1 γ) E (6) Pt is the transition probability matrix defined in Eq. (3), Et is the teleportation vector defined in Eq. (5). γ is a parameter between 0 and 1 to control the probability of teleportation. The lower γ is, the higher probability the random surfer will teleport to twitterers according to Et, and vice versa 5.4Aggregation of Topic-specific TwitterRank The approach presented in the sections above generates a set of topic-specific TwitterRank vectors, which basically measure the twitterers influence in individual topics. An aggregation of TwitterRank can also be obtained to measure twitterers overall influence. Definition5 Twitterers s general influence can be measu r e d as an aggregation of the topic-specific TwitterRank in different topics, which is calculated as: TR = r. TR TR is the TwitterRank vector for topic t, while r is the weight assigned to topic t and associated TR 5.5 Empirical Evaluation The research evaluates the usefulness of TwitterRank in the twitterer recommendation task. And comparisons against related algorithms are also conducted. The related algorithms studied include: In-degree, which measures the influence of twitterers by the number of followers. PageRank, which measures the influence with only link structure of the network taken into account. Topic-sensitive PageRank, makes use of topic biased teleportation vector. The recommendation task is designed as below: 1 Choose following relationship set based on different basis 2 For each following relationship in the set do things as below: 3 take s and s as follower and friend 4 Choose another 10 twitterers that s dose not follow, denote them as St 5 Remove existed relationship between s and s to generate a new network 6 apply different algorithms to measure the influence of the twitterers in the new network 7 Based on the influence, s is recommended whether to follow s

7 L, the set of existing following relationships in Step 1 of the recommendation task is considered the ground truth for evaluation: the recommendation is considered good if s is ranked higher than all the twitterers in St chosen in Step 4. Given this, the quality of the recommendation is measured as the number of twitterers in St who have a higher rank than s. More formally, it is defined as follows: Definition 6 Assume l is a ranked list recommended by any of the algorithms, and s is a twitterer. Let l(s )be the rank list of s in l (a higher rank corresponds to a low-numbered rank in l). The quality of the recommendation Q(l) is measured as Q(l) = s s S, and l(s ) < l s. s is the friend removed in Step 5 in Figure The lower the value of Q(l) is, the higher the quality of corresponding algorithm is. Diff erent L s based on various criteria have been used to study the proposed TwitterRank s performance as comprehensively as possible. Currently, there are in total four criteria based on which L is generated: (a): T w o L s denoted by L and L are generated based on the number of followers that sf has: L has s with high follower count, while L has s with low follower count. s s follower count is considered high if it is larger than FH, and low if smaller than FL. FH and FL are set as the 90th and 10th percentile of all the follower counts of the twitterers (b): Two L s denoted by L and L are generated based on the number of tweets that sf has. These two sets are generated in a similar approach as in (a). The diff erence is that the thresholds for high tweet count and low tweet count, denoted as TH and TL,are set as the 90th and 10th percentile of all the tweet counts of the twitterers. (c): Two L s denoted by L and L are generated based on the topical difference between s and s.these two sets are generated in a similar approachas in(a) and (b). The difference is that the thresholds for low topical difference and high topical difference, denoted as DL and DH, are set as the 10th and 90th percentile of the difference of all the existing following relationships. (d):two L s denoted by L and L are generated based on whether there is reciprocal following relationship between s and s. There is no threshold applied. There are eight sets of L used in each individual round of evaluation. Five rounds of evaluation are conducted. Figure 1 1 Comparison of Performance (measured by Q(l)) in the Recommendation Task Figure 11 shows the average results of the four algorithms with diff erent sets of L over all the evaluation rounds. It can be observed that all the algorithms perform better in scenarios where L is used than in those where L is used. This observation shows that there are twitterers who follow because of the topical similarity between them and their friends. This supports the phenomenon of homophily discussed before. TR is outperformed by other algorithms in 3 out of the 8 scenarios studied, including those where L, L,and L are used. In scenarios where L is used, there is no obvious diff erence in the performance of all the algorithms. Y e t, I n D achieves the best performance. This is probably because, in the dataset, twitterers following behaviors have already been biased toward those with more followers,since InD is essentially the algorithm applied in Twitter to recommend friends. In scenarios where L is used, TR s performance is the worst among all. This is because the quality of topics distilled for s is not as good since LDA-based topic distillation is less accurate with little content available. Consequently, this impacts the performance of TR which takes into account the topical similarity when measuring the twitterers influence. In scenarios where L is used, TR outperforms all the other algorithms except InD. This phenomenon, together

8 with the one observed in scenarios where L is used, shows that there still exist some twitterers who do not follow based on topical similarity, although homophily is observed. TR performs the best in all the other scenarios, though the improvement is not significant in most of cases. It is noted that in scenarios where L is used, TR outperforms the other algorithms significantly, especially InD and PR. This is because friends of s in the following relationships in L are with lower numbers of followers. Consequently, the corresponding so would have lower chance to be biased by the recommendation made by Twitter, which is essentially made with InD. In such cases, the chance that the following relationship is formed due to topical similarity is higher. Therefore, TR outperforms InD and PR,which do not take into account topical similarity. Furthermore,TR outperforms TSPR.This is because TSPR uses identical transition probability matrix when calculating the topics pecific ranks. By doing so, TSPR basically propagates a twitterer s influence in one topic to her friends in different topics with equal probabilities. 6 Applications Twitter has been using in various areas. Target event detection and online word of mouth are two typical examples,both of them make use of the real-time nature of Twitter.More information can be found in research and research. 7 Summary We introduced characteristics of Twitter: non-power-law follower distribution, a short average path length, and low reciprocity, which all mark a deviation from known characteristics of human social networks.the impact of retweet and the news nature of trending topics are presented. Then the phenomenon of homophily is explained. An extension of PageRank algorithms called TwitterRank which is based on homophily is introduced. [2] M. Cha, A. Mislove, and K. P. Gummadi. A measurement-driven analysis of information propagation in the Flickr social network. In Proc. of the 18 th international conference on World Wide Web. ACM, 2009 [3] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Proc. of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006 [4] S. Milgram. The small world problem. Psychology today, 2(1):60 67, [5] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. In Proc. of the 17th international conference on World Wide Web. ACM, 2008 [6] R. Crane and D. Sornette. Robust dynamic classes revealed by measuring the response function of a social system. Proc. of the National Academy of Sciences, 105(41): , [7] S. Milstein, A. Chowdhury, G. Hochmuth, B. Lorica, and R. Magoulas. Twitter and the micro-messaging revolution: Communication, connections, and immediacy 140 characters at a time. O Reilly Report, November [8] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Network and ISDN Systems, 30(1-7): , [9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3: , References [1] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proc. of the 16 th international conference on World Wide Web. ACM, 2007