Link prediction in the Twitter mention network: impacts of local structure and similarity of interest

Size: px
Start display at page:

Download "Link prediction in the Twitter mention network: impacts of local structure and similarity of interest"

Transcription

1 2016 IEEE 16th International Conference on Data Mining Workshops Link prediction in the Twitter mention network: impacts of local structure and similarity of interest Hadrien Hours, Eric Fleury and Márton Karsai Univ Lyon, ENS de Lyon, Inria, CNRS, UCB Lyon 1, LIP UMR 5668, IXXI, F-69342, Lyon, France Abstract The creation of social ties is driven by several factors which can arguably be related to individual preferences and to the common social environment of individuals. Effects of homophily and triadic closure mechanisms are claimed to be important in terms of initiating new social interactions and in turn to shape the global social structure. This way they eventually provide some potential to predict the creation of social ties between disconnected people sharing common friends or common subjects of interest. In this paper we analyze a large Twitter data corpus and quantify similarities between people by considering the set of their common friends and the set of their commonly shared hashtags in order to predict mention links among them. We show that these similarity measures are correlated among connected people and that the combination of contextual and local structural features provides better predictions as compared to cases where they are considered separately. These results help us to better understand the evolution of egocentric and global social networks and provide advances in the design of better recommendation systems and resource allocation plans. I. INTRODUCTION Understanding the structure of social networks is a long lasting research endeavor which has been accelerated lately due to the recent availability of large digital maps of online social networks [1]. These advancements not only allow us to learn about the structure but also about the evolution of social networks. Along this line here we address the problem of social tie prediction by using local structural information and contextual similarities between users of a large online social network. Understanding how, when, and why people interact is a complex problem conventionally studied in the social science arena. Historically, these questions have been studied by expensive and error prone surveys, which lead to several fundamental findings, but also suffered of major limitations due to the limited sample, and inaccurate reporting of the targeted population. However, with the advent of online social networks, there is now an unprecedented opportunity to observe a wide range of large-scale social networks at a limited cost. Such a new source of information has led to an important breakthrough in the understanding of social interactions allowing to model the structure and dynamics of human interactions in a rich variety of environments. Special interest has been committed lately to the problematic of prediction of ties in online social networks. This is an important question (a) to better understand the evolution of egocentric networks, which in turn induce the emerging global network structure and drive its dynamics; (b) to approximate better the social structure from sampled data; and (c) to design better resource allocation plans and recommendation systems in large online social systems, just to mention a few examples. The decision of two egos to build a new social tie is arguably driven by several cognitive processes. Among these, one important tie creation mechanism is called triadic closure, which arguably induces the high clustering of social networks. This mechanism is responsible for the emergence of triangles in the structure through the creation of links between people sharing common friends [2]. At the same time, homophily [3] (i.e. the tendency of people of similar age, gender, or interest, etc. to be connected) was assigned as another important tie creation mechanism. In this work our aim is to propose a methodology which combines these two mechanisms to improve the prediction of link creation in the online social network of Twitter. Twitter is an online micro-blogging system which allows users to post short messages (up to 140 characters) that can be read by any other users in the system. In addition, Twitter proposes several ways for users to interact with each other. Users can follow each other to receive regular updates of their tweets, while they may also explicitly mention other users in their tweets, which can be used to capture direct social interactions. Using direct mentions one can easily derive a mention network by drawing a directed link from a user u directly mentioning another user v. To even better approximate the underlying social structure we can consider people to be connected in the network, in this case with an undirected link, if only they at least once mutually mention each other. Opposite to the follower network, that reflects passive information exposure and less social involvement, the mutual mention network may better capture the underlying social structure between users [4], thus we use this network definition in our coming work. As follows, in Section II we shortly review related works on link prediction in the Twitter social network. In Section III, we present in details the dataset in use for this study and, in Section IV, we define formally the mutual Twitter mention network. The core of our work is presented in Section V where we derive a set of comprehensive parameters based on user similarity, that will prove to be useful to capture personal social ties. We complement these parameters with structural metrics that allow us to effectively predict the creation of links in the mention network. Our conclusions and directions for future work are presented in Section VI /16 $ IEEE DOI /ICDMW

2 Please note that the code used for this study is public and can be found in our public GitHub repository 1. II. RELATED WORK Social network analysis is a very broad field of research which has been accelerated due to the recent access to an unprecedented amount of information capturing the digital footprints of individuals social behaviour. Among online social platforms the Twitter social network is one of the most studied systems, with an exponentially expanding number of works which cannot be overviewed extensively here. However, in particular, studies of link creations [5], [6] or communities [7], in Twitter have been among the most successful. Relying on the rich literature of network science [8], rules such as triadic closures [9], [10] were shown to be efficient to predict link creation [5], [11]. In addition there is a rich literature studying the evolution of social networks based on their structural properties. One can find a quantitative comparison of approaches considering network structure for link prediction in [12]. On the other hand, another field of interest is related to information diffusion on social networks [5]. Based on its structure one might predict the diffusion of information through the network [13]. The likelihood of someone propagating an information he/she received can be assessed taking into account its potential resilience and its familiarity with the information being propagated. By sharing an information (understand tweet in the case of Twitter) a user can participate in its spread on the whole or part of the network. The study of information cascades allows to model the propagation of an information in a given network [11]. As compared to earlier achievements made in the domain of link prediction [14], the novelty of our work is to apply homophily in terms of common interest, in addition to other structural properties, when predicting the creation of social ties. We use both approaches to complement a model based on the network structural properties, together with information diffusion inspired metrics that capture the similarity of interest of people which may connect to each other in the future. III. TWITTER DATASET Data used in our work was collected from the Twitter streaming API. While the use of the Twitter streaming API can introduce sample bias, it was also shown that this possible bias is less present when working with a bigger dataset [15]. This study uses one year of data, between June 2014 and June It contains a 15% of random sample of all tweets posted in the time zone GMT and GMT+1. Within this geographical slice we collected 72 millions of tweets of which 66 millions were in French or English and thus used in this study. Using this sample we considered users that: Used hashtags. A hashtag is a metadata tag beginning with the character #. Hashtags are used to label tweets, 1 Number of Tweets and Mentions Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2014 Feb 2014 Mar 2014 Apr 2014 May 2014 Fig. 1. Number of tweets and mentions per month during the observation period of one year. which makes it easier for other users to find messages with a specific theme or content. Used mentions. A mention is a mean to directly refer to other users using character before the profile name of the addressed user in a tweet. This way of filtering provided us users. Note that the capture of Tweets during the observation period was sometimes interrupted by a few short data streaming outage. This may influences our predictions (as discussed) but its effect is limited as the actual integration window is typically longer than an average outage period. To demonstrate this effect in Fig. 1 we show the number of observed tweets together with the number of mentions observed in each month. A strong correlation appears between these two counters, which is straightforward as mentions are extracted from tweets. IV. TWITTER MENTION NETWORK As we discussed earlier one way of direct interaction on Twitter is via the use of mentions. When one user, u, mentions another user, v, user v will see the tweet posted by user u directly in his notifications. In our work we take direct mentions as proxies of social interactions and use them to estimate social ties defined between two users who at least once mutually mentioned each other during the actual observation period. Here we study the structural and dynamical properties of the mention network by taking into account all the mentions that appeared during the whole observation period of Δ=1year. This analysis allows us to identify the time frame it takes for the network to arrive to a stationary state where its overall structural properties do not change qualitatively anymore. This time frame gives us the necessary time to consider for link prediction in the later sections. A. Definition More formally, our mutual mention network is an undirected, unweighted graph G =(V,E) where V is the set of all users that used mentions and hashtags during the period as explained above. We consider this network to evolve as a function of time t as: At t =0we start with an empty graph containing all users ever observed during the full observation period Δ. 455

3 1 F(142) = CDF(δuv) F(9) = 0.5 F(1) = Days CDF(Degree) Week 1 Week 4 Week Week 6 Week 7 Week 15 Week Degree Fig. 2. Cumulative distribution function of the reciprocal mentioning time δ uv during one year. Fig. 3. Evolution of the cumulative degree distribution. The distribution is plotted on a semi logarithmic scale without nodes with zero degree. We add an edge, e uv (t), to the graph at time t if user u mentioned user v at time t <tand user v mentioned back user u at time t. At time t =Δwe receive the year mention graph G Δ = (V Δ,E Δ ), which contains all links that were observed as reciprocal mentions during Δ. Users with degree 0 in G Δ are not considered in the rest of our study. In our data, we observe 42 millions direct mentions, from which 1 million were reciprocal. Consequently G Δ appears with E Δ = 1 million links connecting V Δ = 350, 000 users. These people who are reciprocally mentioning each other express some mutual interest, which assigns some social commitments between these interacting peers. On the other hand the large discrepancy between the number of direct and mutual mentions can be due to several reasons. Obvious reason are that people mention each other several times or simply do not reply even if they receive a direct mention. Another explanation is related to the finite observation window size. In this case people reply to direct mentions but with a delay, thus the time of their response may fall out of Δ. To explore this case we measure the Cumulative Distribution Function (CDF) of the δ uv = t t time difference between the first mention of a user u by user v at time t and the reciprocal mention of this user v by the user u at time t. F (δ uv ) is depicted in Fig. 2, with max(δ uv )=Δ. As commonly observed in social interactions, the distribution shows a heavy tail. More than 30% of the reciprocal mentions happen within a day but 9 days are needed to observe 50% of the reciprocal mentions. Eventually, to observe the 90% of the reciprocal mentions one has to wait 142 days. This observation suggests that even 9 days is enough to observe the majority of the mutual mention links, thus a 2 weeks period might be eligible to use during the prediction of social links. B. Degree distribution Another way to estimate the minimum eligible time for observation is by detecting the time frame after which the evolving network arrives to a stationary state, where its overall properties do not change qualitatively anymore. To identify this period we study the evolution of the cumulative degree distribution of the mutual mention network. We start with an empty graph at the beginning (t =0) of our observation period (June 2014) and add edges iteratively between nodes who reciprocally mentioned each other (as explained above). To avoid the effect of zero degree nodes at intermediate times t<δ we exclude them while measuring the degree distribution at a given time t. Fig. 3 shows the evolution of the cumulative degree distribution. It is evident that degrees are heterogeneously distributed thus giving rise to a skewed degree distribution with a fat tail. In addition, after four weeks the distribution reaches a stationary regime where its functional form is not changing anymore, only its average is shifted due to the iteratively added new links. This can be seen when comparing the CDF after four weeks (blue dotted line with square markers) to the CDF after five weeks (green dotted line with triangle markers) and after six weeks of observation (black dotted line with circle markers). The stationarity of the distribution is evidenced by the small change of the CDF after four weeks. Note that similar analysis has been carried out on mobile communication networks where the stationary phase appeared after three weeks [16]. V. PREDICTING MENTIONS After estimating the time frames to capture the dynamics of reciprocal mentions and the reciprocal mention network we are able to divide our observation period into training and testing periods that are used for link prediction. We do not apply a shifting training period with fixed length, but rather we start the training period from the beginning of the observation and expand it to study the impact of the accumulated information of the growing network and users on triadic closure and homophily. This is to capture the realistic scenario when one starts collecting data and have access to an overly increasing data set as time goes by. Based on the accumulated information we define two metrics to capture the effects of homophily and triadic closure on the creation of edges in the evolving network. A. Evolving network To define the training and testing periods we take 45 periods {T 0, T 1,...,T t,...,t T } with increasing lengths up to the full observation period Δ. Here consecutive periods appear with one week difference (i.e. T t+1 = T t +1week). Each period T t is divided in a training period P t =[0, 4+t) of 4+t weeks, 456

4 W=0 W=1 W=2 W=3 W=4 W=5 W=6 W=7 W=8 T 0 P 0 M 0 T 1 P 1 M 1 T 2 P 2 M 2 Fig. 4. Division of our observation period in sequential overlapping sub periods, T t, divided into training periods P t and testing periods M t,onan example of 8 weeks, from Week 0 (W=0) to Week 8 (W=8). and a testing period M t =[4+t, 4+t +2) of 2 weeks, as demonstrated in Fig. 4. Using these periods we define the evolving network as follows. For each period T t : We build a mention network G Pt corresponding to the period P t using all reciprocal mentions appeared up 4+t. We build the mention network G Mt corresponding to the period M t using all reciprocal mentions that happen during the period of [4 + t, 4+t +2) The period P t is used as a training period to compute structural and individual properties and their impact on the new mentions created during the period of M t. The choices of the periods is dictated by the length of the periods M t that has been chosen using our observations on the distribution of the reciprocal mention time, and the evolution of the degree distribution of the mention network studied in Sections IV-A and IV-B. B. User closeness inference 1) Common interest: One of the main motivations of our work lies on the hypothesis that if two users share common interest they are more likely to have a social interactions (homophily). Therefore, in this section, we quantify the closeness of different users based on their commonly shared hashtags. The question we try to answer is How is the similarity of interest of two Twitter users correlated with the existence of a social tie between them? To quantify the common interest that two users may share, we measure hashtag similarity as the distance between the sets of hashtags that the user tweeted. As the set of hashtags used by a user can be very large but may contain several related hashtags (e.g. #weather and #meteo), we use categories of hashtags instead of the hashtags, or the text of the containing tweets themselves. This is an appropriate choice as we want to concentrate on the common interest of users without being biased by the variance of linguistic variables or geographical language correlations [17]. To infer hashtag categories, we cluster hashtags based on the similarity of the text of the tweets they appear in. Clustering items using surrounding text is a common method in information retrieval and document clustering [18]. Our calculation is based on the TF-IDF method (Term Frequency, Inverse Document Frequency) [18, Chapter 1]. This method gives to each word a weight that is the product of the frequency of the word in the document it appears in (TF) and of the inverse of the frequency of this word in the whole corpus of documents (IDF). The IDF is very efficient in discarding common (stop) words (such as the ) and in measuring similarity between documents based on the uncommon words they share. Finally, what we call hashtag similarity, S u,v, of two users, u and v, is the inverse of the Jaccard distance of their sets of categories. It is calculated for each period T t as follows: Extract user hashtags and the corresponding text for the period P t ; Re-scale each hashtag word with its TF-IDF; Cluster hashtags based on the distance of the text they appear with; Assign to each user its list of hashtag categories; Build a mention graph G Mt for period M t ; Measure hashtag similarity for connected users in G Mt using their hashtag category list detected in P t. Compare the obtain hashtag similarity values with an average similarity value calculated between the same number of randomly selected user pairs observed in P t. Several choices were made to cluster hashtags efficiently. First, we selected hashtags that appear in at least 10 different tweets. Second, we remove the top 1% of hashtags in terms of tweets they appear in. This 1% is an empirical choice to ensure that hashtags such as (#rt or #bbl), carrying little meaning for this study, are not taken into account. Third, to infer hashtag categories, we use the DBSCAN clustering algorithm [19]. This algorithm defines core points based on the density of their neighborhood that represent the centroid of the clusters. Then, by defining a radius and a density threshold, it gradually adds additional points, edge points, to the cluster. This iterative process allows the clustering of points with values in a multidimensional space and corresponding to potentially sparse and noisy vectors. This bottom-up algorithm presents two main advantages. First it was shown to be efficient to cluster sparse multidimensional feature vectors. Second, the number of clusters is inferred from the data itself, as opposed to algorithms such as the K-Means clustering algorithm. On the other hand a payoff has to be considered when using the DBSCAN algorithm. This comes from the fact that isolated points may be considered as noise and will not be part of any cluster. This results in the automatic cleaning of the input data but also in the potential discard of important points. 2) Parameterization: One important limitation of the DB- SCAN algorithm is its high sensitivity to its hyper parameters, which are i) The distance ii) The minimum number of neighbors: the minimum number of point within a radius to be considered as core point iii) The radius. In our study we opted for the cosine distance [18, Chapter 3] that led to more balanced clusters than the Jaccard distance [18, Chapter 3]. Please note that the cosine distance is used to measure word TF-IDF distances. The choice of the minimum number of neighbors and the choice of the radius are not independent. For our data, we tested different parameterizations on the data representing the whole year of observation. We use as performance metric the ratio between neighbor similarity and randomly chosen users in terms of Jaccard similarity (see 457

5 (a) 10 0 Ratio Similarity Radius Neighbor Fig. 5. Evolution of the ratio of neighbor similarity vs random user similarity as function of DBSCAN hyper parameters. Similarity (b) Similarity Hashtag similarity mention neighbors Hashtag similarity random nodes Overlap mention neighbous Overlap random nodes Fig. 5). In this case, as measuring the distance between set of hasthag categories, the Jaccard similarity (and not the cosine distance) is more suited. We can observe that the best ratio is obtained for a minimum number of neighbors of 2 and a radius of 0.5. These are also the parameter values that lead to more balanced clusters. By increasing the radius or the minimum number of neighbors we are creating a super cluster that attract most of the hashtags. For our study we intend to discriminate user interest based on their hashtag categories thus we choose parameters resulting a balanced number of categories. Note that, the average hashtag similarity between neighbors in the year mention network G T is 0.4, while its average value over the whole set of users is measured as This observation already suggests that common interest of users has a considerable correlation with the existence of social ties. We verify this suggestion in the following section where we study the impact of hashtag similarity on the creation of links in the evolving network. 3) Results: To assess the predictive potential of hashtag similarity, we now go back to the evolving network and apply the methodology presented in Section V-B. For the different periods T t, we measure the average hashtag similarity of users in the training period P t for whom a link is created in the corresponding M t and compare it with the average hashtag similarity of a same number of randomly selected user pairs in P t. The results are presented in Figure 6a. We observe that, for all periods T t, the hashtag similarity of users that will connect in the following two weeks is about one order of magnitude larger than the average similarity between randomly selected pairs of users in P t. This result represents an important finding and provides the base of the predictive model that will be developed in Section V-D. C. Link overlap In the previous section, we studied user closeness based on their hashtag similarities. In this section, we take our other hypothesis and study the impact of local structural properties on link creations, namely user link overlap. Link overlap is a metric commonly used in complex networks [20] to capture the local clustering coefficient of links Increasing periods Fig. 6. (a) Hashtag similarity of connected nodes (star symbols) as compared to random nodes (dot symbols) measuring the Jaccard similarity of the users set of categories inferred from clustering hashtags using the DBSCAN algorithm on TF-IDF of hashtag tweet text. (b) Evolution of overlap for users connected in the mention network (stars) as compared to random pairs (circles) of users. The overlap of a link connecting nodes u and v is defined as [20]: n uv O uv =, (1) (k u 1) + (k v 1) n uv where n uv is the number of common neighbors of nodes u and v, k u (resp. k v ) is the degree of node u (resp. v). The average link overlap between nodes connected in the year mention network G T is 0.05 while its average among any node is 3e 5. We therefore observe a correlation between node overlap and the existence of mention links between users. Also note that to compute the average overlap between any pairs of users we need to compute the link overlap of non existing links. In the following, a link overlap has to be understood as the overlap that a given link (would) have if it is (was) present in a given network. In the following of this section, we use the same methodology as the one explained in Section V-B for hashtag similarity. Namely we compute the link overlap of two nodes at period P t and observe its impact on the creation of a mention link at period M t. The way we proceed is as follows: For each period T t : Extract the mention network G Pt corresponding to the training period P t. Extract the mention network G Mt corresponding to testing period M t. For each edge, e u,v,gmt {G Mt G Pt } Compute the link overlap, O u,v (G Pt ) for the nonexisting link between nodes u and v in G Pt Compute the average overlap O u,v (G Pt ) e u,v G Mt 458

6 (a) 10 0 (a) 10 8 Precision (b) Overlap Similarity alpha=0.8, beta=0.2 alpha=0.5, beta=0.5 alpha=0.2, beta=0.8 % Precision TW (b) (c) alpha=0.8, beta=0.2 alpha=0.5, beta=0.5 alpha=0.2, beta=0.8 Recall Overlap Similarity alpha=0.8, beta=0.2 alpha=0.5, beta= alpha=0.2, beta= Increasing periods Fig. 7. Precision (a) and recall (b) of the predictive model for different values of α and β. For the same number of randomly selected pairs of users observed in period P t that do not share a mention link yet compute the average overlap. Results presented in Fig. 6b. show that the average link overlap for randomly selected non-connected user pairs is almost always 0 ( 1e 5 ). At the same time, its average value for nodes that will be connected in the following two weeks is always around This result is in-line with previous works predicting link creation in social network based on the structural property of triadic closure [9], [10]. D. Prediction After having observed that both the link overlap and the hashtag similarity have a strong impact on the creation of links in the mention network, we use these two features to build a predictive model. In this model, each potential pairs of nodes is assigned with a score capturing the nodes hashtag similarity and potential link overlap during the testing period. We then use this score to assess the likelihood of a link being created in the mention network of the subsequent testing period. 1) Scoring: Here we introduce a combined score of possible connections between unconnected users in order to predict links created among them. In other words, we want to see whether user similarity and/or link overlap measured between disconnected nodes correlate effectively with future link creation events. Note that this goal is far more ambitious than the one presented in the previous section, where we verified that created links have a hashtag similarity and link overlap bigger than the overall average. We define the score, L u,v, of a potential link between two nodes, u and v, as: L u,v = α O u,v + β S u,v, (2) % Recall alpha=0.8, beta=0.2 alpha=0.5, beta=0.5 alpha=0.2, beta= Increasing periods Fig. 8. Precision (b) and recall (c) gain and loss of the predictive model for different values of α and β as compared to using only structural based predictive model. Panel (a) shows the collected Twitter frequencies and cumulative counts. where S u,v represents the hashtag similarity of two users u and v (defined in Section V-B) and O u,v is the link overlap of their corresponding (non-existing) link in the training mention network (see Section V-C). We also impose that α + β =1. 2) Implementation: Because of the complexity and resource requirements to consider all possible links between all possible couples of users, we make the following simplifications i) We only consider links for pairs of users with a non zero overlap ii) We only consider users who have used at least one hashtag that belongs to any of the inferred hashtag categories. For a given training period, P t, we call this set of links the set of potential candidate links. Above i) is motivated by computational optimization but is shown to have little impact on the conclusion drawn from this work, while ii) comes from an inherent limitation of the DBSCAN algorithm and is discussed in Section VI. 3) Results: Our final goal is to predict which potential candidate links extracted from a training period P t are actually going to be created in the corresponding testing period M t. To do so, we select from the set of potential candidate links the ones with the highest scores that we call candidate links. We then compare the candidate links with the links that were actually created that constitutes our ground truth. This predictive model includes three parameters. The two parameters defined in equation 2, namely The user similarity weight, α; The overlap weight, β; but also the threshold ρ to select candidate links from potential candidate links. We study the performance of our model for different values 459

7 of α and β based on two complementary performance metrics that are precision (the fraction of selected candidate links that are actually created) and recall (the fraction created links that were in the selected candidates) defined as: precision = TP TP + FP, recall = TP TP + FN, (3) where TP, FP and TN are the True Positives, False positives and True Negatives, respectively. As our goal is to assess the capacity of a model to predict link creation when including multiple mechanisms, we choose the value of ρ which leads to the best precision of our prediction (defined in Eq. 3). Fig. 7a shows the best precision that can be achieved for the different values of α and β when selecting the threshold ρ that maximizes the precision of the prediction. By doing so, we penalize the recall, as one can see in Fig. 7b. This behavior is not surprising as the computed hashtag similarity is based on the distance of user hashtag categories that, as we discussed in previous sections, discards users using very rare and isolated hashtags (in the space of Tweet words). As a consequence, including hashtag similarity greatly improves the precision of our prediction, being more selective, but also decreases the number of selected candidates and thus the recall. Interestingly the model shows weak sensitivity to the variation of α and β values. While we can clearly observe an improvement of the precision of the prediction in Fig. 7a when including hashtag similarity, the difference between [α =0.5,β =0.5] and [α =0.8,β =0.2] is very small. This is also true for [α =0.2,β =0.8] if we look at global precision across the different periods. This result indicates the intrinsic impact of homophily on social tie creation but also mitigate the effect of the computational optimization made in our approach. The important finding behind these results comes from the improvement that we get in terms of precision when including users common interest in the predictive model. Having observed the benefit of considering homophily in our prediction, we now try to optimize both the precision and recall of our model by comparing results to models considering whether non-structural or structural properties only. To optimize both precision and recall we select ρ by using a performance metric called the F-score (the harmonic mean of precision and recall): precision recall F =2 (4) precision + recall By optimizing both the precision and the recall we set a common base to compare a predictive model that takes into account homophily to a model that does not. The results are presented in Fig. 8a and in Fig. 8b which shows the gain in terms of precision and the loss in terms of recall when including hashtag similarity in our model. Different observations can be made from these two figures. First, taking into account hashtag similarity greatly improves the precision of mention link prediction in the first half of the period of observation. Second, there are periods, when its impact is negative. However, these poorly predicted periods correspond to earlier training periods during which little data was collected due to technical problems (see Fig. 4 and Fig. 8a upper panel for collected twitter frequencies). While this is not related to our model accuracy but to data capture, it also shows the sensitivity of the DBSCAN algorithm, and by extension of our approach, to data shortage. Another important point to notice is that as we increase the memory of our model, the mention network becomes more and more connected. Towards this end, the predictive model based on overlap is getting more accurate and the improvement induced by the hashtag similarity becomes less relevant. VI. CONCLUSIONS AND FUTURE WORK In this section we first summarize the different results of this paper before pinpointing at some possible limitations of our approach and proposing solution to mitigate such limitations and improve further the precision of our model. A. Conlusions In this paper we studied the impact of homophily and triadic closure mechanisms on the social link creation in the Twitter mention network. While many studies focus only on structural properties of social networks to predict their evolution, we show that the propensity of users belonging to the same community to share common interest can be used to effectively predict social interactions. Our work proposes evidence that common interest shared by two users impacts their likelihood to interact. We further define similarity measures capturing effects of homophily, which in turn improve the precision of prediction of social link creation by up to 60% as compared to an approach based on triadic closure exclusively. In our work we use link overlap to capture the effects of triadic closure mechanism on the network evolution and we complement it with user hashtag similarity to capture effects induced by homophily. One specificity of our approach comes from the use of hashtag categories inferred from clustering hashtags based on the text of tweets they appear in. Inspired by techniques borrowed from the fields of Document Clustering and Information Retrieval, we show our methodology to be efficient in inferring meaningful hashtag categories. By discarding noise in the data and focusing on important words when measuring hashtag distances, this approach infers categories that perform well in capturing the shared interest of users. The first step of our work is dedicated to verify that the link overlap and hashtag similarity are discriminating features for link prediction. We therefore compare the value of these two features between pairs of users that creates a mention link with the average value calculated for randomly selected peers. The results we obtain show that, on average, users that mention each other in a coming two weeks period show a link overlap four orders and hashtag similarity one order of magnitude higher than the random average. Based on this result, we build a predictive model that computes a link creation likelihood score using hashtag similarity 460

8 and link overlap. This model greatly improves the precision of Twitter reciprocal mention link prediction (up to 60%) at the detriment of the recall, whose loss can be bound to 10%. We found that designing a combined likelihood score accounting for 20% of hashtag similarity and 80% of link overlap in its prediction leads to an average precision improvement up to 50% while limiting the loss in terms of recall to less than 5%. While this method shows some limitations due its use of the DBSCAN clustering algorithm, to infer hashtag categories, it clearly underlines the importance to consider homophily when inferring future social interactions. Eventually, our study shows that the prediction of social links is mainly improved by homophily during the early times of observation of the social network. However, as the social networks gets more connected, the impacts of structural information, such as the fraction of common friends, becomes more important. Therefore, one additional advantage of applying contextual similarity measures to social link prediction is its lower requirement in the amount of data necessary to predict the creation of social links. B. Limitations of the approach and future work Our model lies on a simple linear combination of hashtag similarity and link overlap. The aim of this work is to show the impact of homophily on the prediction of social links. Since this is an explanatory first step in this direction we applied a linear model for simplicity. However, there exist several machine learning approaches that one can use to build a more elaborated and accurate predictive model [14]. While this approach is out of the scope of this paper, it should be definitely considered as a future direction. As mentioned previously, a limitation of our approach comes from the use of the DBSCAN clustering algorithm that does not cope very well with the presence of hashtags that appear very seldom on Twitter. Such discarded hashtags results in many users who are not assigned with any hashtag categories and thus their social ties cannot be predicted with this method. A careful study of the DBSCAN algorithm parameterization may improve the quality of the hashtag categories and in turn the prediction accuracy, however such a extensive study of the DBSCAN algorithm points beyond the scope of this work. [5] L. Weng, J. Ratkiewicz et al., The role of information diffusion in the evolution of social networks, in Proceedings of the 19th ACM SIGKDD, ser. KDD 13. ACM, 2013, pp [6] S. A. Myers and J. Leskovec, The bursty dynamics of the twitter information network, in Proceedings of the 23rd International Conference on World Wide Web, ser. WWW 14. ACM, 2014, pp [7] J. Leskovec, K. J. Lang, and M. Mahoney, Empirical comparison of algorithms for network community detection, in Proceedings of the 19th International Conference on World Wide Web, ser. WWW 10. New York, NY, USA: ACM, 2010, pp [8] M. E. J. Newman, The structure and function of complex networks, SIAM REVIEW, vol. 45, pp , [9] D. M. Romero and J. M. Kleinberg, The directed closure process in hybrid social-information networks, with an analysis of link formation on twitter, in Proceedings of the Fourth International Conference on Weblogs and Social Media, May 23-26, 2010, ser. ICWSM 2010, [10] J. Leskovec, L. Backstrom et al., Microscopic evolution of social networks, in Proceedings of the 14th ACM SIGKDD, ser. KDD 08, 2008, pp [11] D. Antoniades and C. Dovrolis, Co-evolutionary dynamics in social networks: A case study of twitter, CoRR, vol. abs/ , [12] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, in Proceedings of the Twelfth International Conference on Information and Knowledge Management, ser. CIKM 03. ACM, 2003, pp [13] J. Cheng, L. Adamic et al., Can cascades be predicted? in Proceedings of the 23rd International Conference on World Wide Web, ser. WWW 14. New York, NY, USA: ACM, 2014, pp [14] V. Srinivas and P. Mitra, Link Prediction in Social Networks: Role of Power Law Distribution, 1st ed. Springer Publishing Company, Incorporated, [15] F. Morstatter, J. Pfeffer et al., Is the sample good enough? comparing data from twitter s streaming api with twitter s firehose. in ICWSM, [16] G. Krings, M. Karsai et al., Effects of time window size and placement on the structure of an aggregated communication network, EPJ Data Science, vol. 1, no. 1, May [17] U. Pavalanathan and J. Eisenstein, Audience-modulated variation in online social media, American Speech, vol. 90, no. 2, pp , May [Online]. Available: [18] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, NY, USA: Cambridge University Press, [19] J. Sander, M. Ester et al., Density-based clustering in spatial databases: The algorithm gdbscan and its applications, Data Min. Knowl. Discov., vol. 2, no. 2, pp , Jun [20] J.-P. Onnela, J. Saramäki et al., Structure and tie strengths in mobile communication networks, Proceedings of the National Academy of Sciences, vol. 104, no. 18, pp , ACKNOWLEDGMENT This work was partially funded by the SoSweet (ANR-15- CE ) and CODDDE (ANR-13-CORD ) ANR projects. REFERENCES [1] D. Lazer, A. Pentland et al., Computational Social Science, Science, vol. 323, no. 5915, pp , Feb [2] M. S. Granovetter, The Strength of Weak Ties, American Journal of Sociology, vol. 78, no. 6, pp , [3] M. McPherson, L. S. Lovin, and J. M. Cook, Birds of a Feather: Homophily in Social Networks, Annual Review of Sociology, vol. 27, no. 1, pp , [4] B. A. Huberman, D. M. Romero, and F. Wu, Social networks that matter: Twitter under the microscope, CoRR, vol. abs/ ,