Derek Davis, Gerardo Figueroa, and Yi-Shin Chen

Size: px

Start display at page:

Download "Derek Davis, Gerardo Figueroa, and Yi-Shin Chen"

Ethan Anderson
5 years ago
Views:

1 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors Derek Davis, Gerardo Figueroa, and Yi-Shin Chen Abstract Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework SociRank which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics. Index Terms Information filtering, social computing, social network analysis, topic identification, topic ranking. I. INTRODUCTION THE mining of valuable information from online sources has become a prominent research area in information technology in recent years. Historically, knowledge that apprises the general public of daily events has been provided by mass media sources, specifically the news media. Many of these news media sources have either abandoned their hardcopy publications and moved to the World Wide Web, or now produce both hard-copy and Internet versions simultaneously. These news media sources are considered reliable Manuscript received July 21, 2015; revised September 14, 2015; accepted October 17, Date of publication March 10, 2016; date of current version May 15, This work was supported by the Ministry of Science and Technology, China, under Grant MOST E and Grant MOST E This paper was recommended by Associate Editor F. Wang. D. Davis and G. Figueroa are with the Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu 30013, Taiwan ( ddavis@gmail.com; gerardo_ofc@yahoo.com). Y.-S. Chen is with the Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan ( yishin@gmail.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TSMC because they are published by professional journalists, who are held accountable for their content. On the other hand, the Internet, being a free and open forum for information exchange, has recently seen a fascinating phenomenon known as social media. In social media, regular, nonjournalist users are able to publish unverified content and express their interest in certain events. Microblogs have become one of the most popular social media outlets. One microblogging service in particular, Twitter, is used by millions of people around the world, providing enormous amounts of user-generated data. One may assume that this source potentially contains information with equal or greater value than the news media, but one must also assume that because of the unverified nature of the source, much of this content is useless. For social media data to be of any use for topic identification, we must find a way to filter uninformative information and capture only information which, based on its content similarity to the news media, may be considered useful or valuable. The news media presents professionally verified occurrences or events, while social media presents the interests of the audience in these areas, and may thus provide insight into their popularity. Social media services like Twitter can also provide additional or supporting information to a particular news media topic. In summary, truly valuable information may be thought of as the area in which these two media sources topically intersect. Unfortunately, even after the removal of unimportant content, there is still information overload in the remaining news-related data, which must be prioritized for consumption. To assist in the prioritization of news information, news must be ranked in order of estimated importance. The temporal prevalence of a particular topic in the news media indicates that it is widely covered by news media sources, making it an important factor when estimating topical relevance. This factor may be referred to as the MF of the topic. The temporal prevalence of the topic in social media, specifically in Twitter, indicates that users are interested in the topic and can provide a basis for the estimation of its popularity. This factor is regarded as the UA of the topic. Likewise, the number of users discussing a topic and the interaction between them also gives insight into topical importance, referred to as the UI. By combining these three factors, we gain insight into topical importance and are then able to rank the news topics accordingly. Consolidated, filtered, and ranked news topics from both professional news providers and individuals have c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 980 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 several benefits. The most evident use is the potential to improve the quality and coverage of news recommender systems or Web feeds, adding user popularity feedback. Additionally, news topics that perhaps were not perceived as popular by the mass media could be uncovered from social media and given more coverage and priority. For instance, a particular story that has been discontinued by news providers could be given resurgence and continued if it is still a popular topic among social networks. This information, in turn, can be filtered to discover how particular topics are discussed in different geographic locations, which serve as feedback for businesses and governments. A straightforward approach for identifying topics from different social and news media sources is the application of topic modeling. Many methods have been proposed in this area, such as latent Dirichlet allocation (LDA) [1] and probabilistic latent semantic analysis (PLSA) [2], [3]. Topic modeling is, in essence, the discovery of topics in text corpora by clustering together frequently co-occurring words. This approach, however, misses out in the temporal component of prevalent topic detection, that is, it does not take into account how topics change with time. Furthermore, topic modeling and other topic detection techniques do not rank topics according to their popularity by taking into account their prevalence in both news media and social media. We propose an unsupervised system SociRank which effectively identifies news topics that are prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Even though this paper focuses on news topics, it can be easily adapted to a wide variety of fields, from science and technology to culture and sports. To the best of our knowledge, no other work attempts to employ the use of either the social media interests of users or their social relationships to aid in the ranking of topics. Moreover, SociRank undergoes an empirical framework, comprising and integrating several techniques, such as keyword extraction, measures of similarity, graph clustering, and social network analysis. The effectiveness of our system is validated by extensive controlled and uncontrolled experiments. To achieve its goal, SociRank uses keywords from news media sources (for a specified period of time) to identify the overlap with social media from that same period. We then build a graph whose nodes represent these keywords and whose edges depict their co-occurrences in social media. The graph is then clustered to clearly identify distinct topics. After obtaining well-separated topic clusters (TCs), the factors that signify their importance are calculated: MF, UA, and UI. Finally, the topics are ranked by an overall measure that combines these three factors. The remainder of this paper is organized as follows. Section II reviews previous work on topic identification and other research areas that are implemented in our method, including keyword extraction, co-occurrence similarity measures, graph clustering, topic ranking, and social network analysis. Section III describes the overall framework of SociRank and all of its stages. Section IV presents the setup and results of our experiments. Finally, we provide the conclusions and future work in Section V. II. RELATED WORK The main research areas applied in this paper include: topic identification, topic ranking social, network analysis, keyword extraction, co-occurrence similarity measures, and graph clustering. Extensive work has been conducted in most of these areas. A. Topic Identification Much research has been carried out in the field of topic identification referred to more formally as topic modeling. Two traditional methods for detecting topics are LDA [1] and PLSA [2], [3]. LDA is a generative probabilistic model that can be applied to different tasks, including topic identification. PLSA, similarly, is a statistical technique, which can also be applied to topic modeling. In these approaches, however, temporal information is lost, which is paramount in identifying prevalent topics and is an important characteristic of social media data. Furthermore, LDA and PLSA only discover topics from text corpora; they do not rank based on popularity or prevalence. Wartena and Brussee [4] implemented a method to detect topics by clustering keywords. Their method entails the clustering of keywords based on different similarity measures using the induced k-bisecting clustering algorithm [5]. Although they do not employ the use of graphs, they do observe that a distance measure based on the Jensen Shannon divergence (or information radius [6]) of probability distributions performs well. More recently, research has been conducted in identifying topics and events from social media data, taking into account temporal information. Cataldi et al. [7] proposed a topic detection technique that retrieves real-time emerging topics from Twitter. Their method uses the set of terms from tweets and model their life cycle according to a novel aging theory. Additionally, they take into account social relationships more specifically, the authority of the users in the network to determine the importance of the topics. Zhao et al. [8] carried out similar work by developing a Twitter-LDA model designed to identify topics in tweets. Their work, however, only considers the personal interests of users, and not prevalent topics at a global scale. Another trending area of related research is the detection of bursty topics (i.e., topics or events that occur in short, sudden episodes). Diao et al. [9] proposed a method that uses a state machine to detect bursty topics in microblogs. Their method also determines whether user posts are personal or refer to a particular trending topic. Yin et al. [10] also developed a model that detects topics from social media data, distinguishing between temporal and stable topics. These methods, however, only use data from microblogs and do not attempt to integrate them with real news. Additionally, the detected topics are not ranked by popularity or prevalence.

3 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 981 B. Topic Ranking Another major concept that is incorporated into this paper is topic ranking. There are several means by which this task can be accomplished, traditionally being done by estimating how frequently and recently a topic has been reported by mass media. Wang et al. [11] proposed a method that takes into account the users interest in a topic by estimating the amount of times they read stories related to that particular topic. They refer to this factor as the UA. They also used an aging theory developed by Chen et al. [12] to create, grow, and destroy a topic. The life cycles of the topics are tracked by using an energy function. The energy of a topic increases when it becomes popular and it diminishes over time unless it remains popular. We employ variants of the concepts of MF and UA to meet our needs, as these concepts are both logical and effective. Other works have made use of Twitter to discover news-related content that might be considered important. Sankaranarayanan et al. [13] developed a system called TwitterStand, which identifies tweets that correspond to breaking news. They accomplish this by utilizing a clustering approach for tweet mining. Phelan et al. [14] developed a recommendation system that generates a ranked list of news stories. News are ranked based on the co-occurrence of popular terms within the users RSS and Twitter feeds. Both of these systems aim to identify emerging topics, but give no insight into their popularity over time. Moreover, the work by Phelan et al. [14] only produces a personalized ranking (i.e., news articles tailored specifically to the content of a single user), rather than providing an overall ranking based on a sample of all users. Nevertheless, these works provide us with a basis for extending the premise of UA. Research has also been carried out in topic discovery and ranking from other domains. Shubhankar et al. [15] developed an algorithm that detects and ranks topics in a corpus of research papers. They used closed frequent keyword-sets to form topics and a modification of the PageRank [16] algorithm to rank them. Their work, however, does not integrate or collaborate with other data sources, as accomplished by SociRank. C. Social Network Analysis In the case of UA, Wang et al. [11] estimated this factor by using anonymous website visitor data. Their method counts the amount of times a site was visited during a particular period of time, which represents the UA of the topic to which the site is related. Our belief, on the other hand, is that, although website usage statistics provide initial proof of attention, additional data are needed to corroborate it. We employ the use of social media, specifically Twitter, as a means to estimate UA. When a user tweets about a particular topic, it signifies that the user is interested in the topic and it has captured her attention more so than visiting a website related to it. In summary, visiting a website might be the initial stimulus, but taking the additional step of discussing a topic via social media signifies genuine attention. 1 Additionally, we believe that the relationship between social media users who discuss the same topics also plays a key role in topic relevance. Kwan et al. [17] proposed a measure referred to as reciprocity, which attempts to detect the interaction between social media users and perceive their engagement in relation to a particular topic. Higher reciprocity means greater interaction between users, and thus topics with higher reciprocity should be considered more important because of their underlying community structure. We can inherently identify the power and influence of a well-structured community as opposed to a decentralized and unstructured one. Our method applies this logic to support the idea that higher reciprocity signifies greater importance. D. Keyword Extraction Concerning the field of keyword or informative term extraction, many unsupervised and supervised methods have been proposed. Unsupervised methods for keyword extraction rely solely on implicit information found in individual texts or in a text corpus. Supervised methods, on the other hand, make use of training datasets that have already been classified. Among the unsupervised methods, there are those that employ statistical measures of term informativeness or relevance, such as term specificity [18], TFIDF [19], word frequency [20], n-grams [21], and word co-occurrence [22]. Other unsupervised approaches are graph-based, where a text is converted into a graph whose nodes represent text units (e.g., words, phrases and sentences) and whose edges represent the relationships between these units. The graph is then recursively iterated and relevance scores are assigned to each node using different approaches. A popular example of a graphbased keyword extraction method is TextRank, proposed by Mihalcea and Tarau [23]; it utilizes the premise of the popular PageRank [16] algorithm. There has also been much work on keyword extraction using supervised and hybrid approaches. Two traditional supervised frameworks are KEA [24] and GenEx [25], which use machine learning algorithms for the effective extraction of keywords. Other innovative approaches for keyword extraction have been proposed in recent years, including the application of neural networks [26] [28] and conditional random fields [29]. Hybrid methods (i.e., methods that make use of unsupervised and supervised components) have been proposed as well, such as HybridRank [30], which makes use of collaboration between the two approaches. Due to its simple implementation, we use TextRank [23] to extract keywords from the news media sources. Furthermore, TextRank does not require training or any document corpus for its operation. E. Co-Occurrence Similarity Matsuo and Ishizuka [22] suggested that the co-occurrence relationship of frequent word pairs from a single document 1 The works by Sankaranarayanan et al. [13] and Phelan et al. [14] support this premise.

4 982 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 may provide statistical information to aid in the identification of the document s keywords. They proposed that if the probability distribution of co-occurrence between a term x and all other terms in a document is biased to a particular subset of frequent terms, then term x is likely to be a keyword. Even though our intention is not to employ co-occurrence for keyword extraction, this hypothesis emphasizes the importance of co-occurrence relationships. Chen et al. [31] proposed a novel co-occurrence similarity measure in which they measure the association of terms using snippets returned by Web searches. They refer to this measure as co-occurrence double checking (CODC). Bollegala et al. [32] proposed a method that uses page counts and text snippets from Web searches to measure the similarity between words or entities. They compared their method with CODC [31], as well as with variants of several other co-occurrence similarity measures, such as the overlap (Simpson) [33], Dice [34], point-wise mutual information (PMI) [35], Jaccard [36], and cosine similarities. Since establishing the importance of the word-pair cooccurrence distribution in the actual corpus of tweets is of more interest to us, we did not employ Bollegala s or Chen s semantic similarity methods. In this paper, we tested other similarity measures, and found that the Dice similarity measure provided the best results. F. Graph Clustering The main purpose of graph clustering in this paper is to identify and separate TCs, as done in Wartena and Brussee s work [4]. Iwasaka and Tanaka-Ishii [37] also proposed a method that clusters a co-occurrence graph based on a graph measure known as transitivity. The basic idea of transitivity is that in a relationship between three elements, if the relationship holds between the first and second elements and between the second and third elements, it also holds between the first and third elements. They suggested that each output cluster is expected to have no ambiguity, and that this is only achieved when the edges of a graph (representing co-occurrence relations) are transitive. Matsuo et al. [38] employed a different approach to achieve the clustering of co-occurrence graphs. They used Newman clustering [39] to efficiently identify word clusters. The core idea behind Newman clustering is the concept of edge betweenness. The betweenness measure of an edge is the number of shortest paths between pairs of nodes that run along it. If a network contains clusters that are loosely connected by a few intercluster edges, then all shortest paths between different clusters must go along one of these edges. Consequently, the edges connecting different clusters will have high edge betweenness, and removing them iteratively will yield well-defined clusters. Newman realized, however, that the main disadvantage of the algorithm was its high computational demand, and thus proposed a new method to identify clusters based on modularity [40]. Modularity is a measure designed to estimate the strength of division of a network into clusters. Networks that possess a high modularity value have dense connections between the nodes within each cluster, but sparse connections between nodes in different clusters. Newman s new algorithm calculated modularity as it progressed, making it simple to find the optimal clustering structure. Given their effectiveness, the concepts of betweenness and transitivity are both applied into our graph clustering algorithm. III. SOCIRANK FRAMEWORK The goal of our method SociRank is to identify, consolidate and rank the most prevalent topics discussed in both news media and social media during a specific period of time. The system framework can be visualized in Fig. 1. To achieve its goal, the system must undergo four main stages. 1) Preprocessing: Key terms are extracted and filtered from news and social data corresponding to a particular period of time. 2) Key Term Graph Construction: A graph is constructed from the previously extracted key term set, whose vertices represent the key terms and edges represent the co-occurrence similarity between them. The graph, after processing and pruning, contains slightly joint clusters of topics popular in both news media and social media. 3) Graph Clustering: The graph is clustered in order to obtain well-defined and disjoint TCs. 4) Content Selection and Ranking: The TCs from the graph are selected and ranked using the three relevance factors (MF, UA, and UI). Initially, news and tweets data are crawled from the Internet and stored in a database. News articles are obtained from specific news websites via their RSS feeds and tweets are crawled from the Twitter public timeline [41]. A user then requests an output of the top k ranked news topics for a specified period of time between date d 1 (start) and date d 2 (end). A. Preprocessing In the preprocessing stage, the system first queries all news articles and tweets from the database that fall within date d 1 and date d 2. Additionally, two sets of terms are created: one for the news articles and one for the tweets, as explained below. 1) News Term Extraction: The set of terms from the news data source consists of keywords extracted from all the queried articles. Due to its simple implementation and effectiveness, we implement a variant of the popular TextRank algorithm [23] to extract the top k keywords from each news article. 2 The selected keywords are then lemmatized using the WordNet lemmatizer in order to consider different inflected forms of a word as a single item. After lemmatization, all unique terms are added to set N. It is worth pointing out that, since N is a set, it does not contain duplicate terms. 2) Tweets Term Extraction: For the tweets data source, the set of terms are not the tweets keywords, but all unique and relevant terms. First, the language of each queried tweet is identified, disregarding any tweet that is not in English. From the remaining tweets, all terms that appear in a stop word 2 Terms that appear in a predefined stop word list or that are less than three characters in length are not taken into consideration.

DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 983 Fig. 1. SociRank framework. list or that are less than three characters in length are eliminated.

5 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 983 Fig. 1. SociRank framework. list or that are less than three characters in length are eliminated. The part of speech (POS) of each term in the tweets is then identified using a POS tagger [42]. This POS tagger is especially useful because it can identify Twitter-specific POSs, such as hashtags, mentions, and emoticon symbols. Hashtags are of great interest to us because of their potential to hold the topical focus of a tweet. However, hashtags usually contain several words joined together, which must be segmented in order to be useful. To solve this problem, we make use of the Viterbi segmentation algorithm [43]. The segmented terms are then tagged as hashtag. To eliminate terms that are not relevant, only terms tagged as hashtag, noun, adjective or verb are selected. The terms are then lemmatized and added to set T, which represents all unique terms that appear in tweets from dates d 1 to d 2. B. Key Term Graph Construction In this component, a graph G is constructed, whose clustered nodes represent the most prevalent news topics in both news and social media. The vertices in G are unique terms selected from N and T, and the edges are represented by a relationship between these terms. In the following sections, we define a method for selecting the terms and establish a relationship between them. After the terms and relationships are identified, the graph is pruned by filtering out unimportant vertices and edges. 1) Term Document Frequency: First, the document frequency of each term in N and T is calculated accordingly. In the case of term set N, the document frequency of each term n is equal to the number of news articles (from dates d 1 to d 2 ) in which n has been selected as a keyword; it is represented as df(n). The document frequency of each term t in set T is calculated in a similar fashion. In this case, however, it is the number of tweets in which t appears; it is represented as df(t). For simplification purposes, we will henceforth refer to the document frequency as occurrence. Thus, df(n) is the occurrence of term n and df(t) is the occurrence of term t. 2) Relevant Key Term Identification: Let us recall that set N represents the keywords present in the news and set T represents all relevant terms present in the tweets (from dates d 1 to d 2 ). We are primarily interested in the important news-related terms, as these signal the presence of a newsrelated topic. Additionally, part of our objective is to extract the topics that are prevalent in both news and social media. To achieve this, a new set I is formed I = N T. (1) This intersection of N and T eliminates terms from T that are not relevant to the news and terms from N that are not mentioned in the social media. Set I, however, still contains many potentially unimportant terms. To solve this problem, terms in I are ranked based on their prevalence in both sources. In this case, prevalence is interpreted as the occurrence of a term, which in turn is the term s document frequency. The prevalence of a term is thus a combination of its occurrence in both N and T. Prevalence p of each term i in I is calculated such that half of its weight is based on the occurrence of the term in the news media, and the other half is based on its occurrence in social media i I : p(i) = df(n) T N + df(t) 2 T where T is the total number of tweets selected between dates d 1 and d 2, and N is the total number of news articles selected in the same time period. (2)

6 984 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 The terms in set I are then ranked by their prevalence value, and only those in the top πth percentile are selected. Using a π value of 75 presented the best results in our experiments. We define the newly filtered set I top using set-builder notation { I top = i I : P } i 100 >π I (3) where P i = { j I : p( j) < p(i)} (4) where P i is the number of elements in subset P i, which in turn represents the terms in I with a lower prevalence value than that of term i, and I is the total number of elements in set I. I top now represents the subset of top key terms from date d 1 to date d 2, taking into account their prevalence in both news and social media. 3) Key Term Similarity Estimation: Next, we must identify a relationship between the previously selected key terms in order to add the graph edges. The relationship used is the term co-occurrence in the tweet term set T. The intuition behind the co-occurrence is that terms that co-occur frequently are related to the same topic and may be used to summarize and represent it when grouped. We define co-occurrence as two terms occurring in the same tweet. If term i I top and term j I top both appear in the same tweet, their co-occurrence is set to 1. For each additional tweet in which i and j appear together, their co-occurrence is incremented by 1. I top is iterated through and the co-occurrence for each term pair {i, j} is found, defined as co(i, j). The term-pair co-occurrence is then used to estimate the similarity between terms. As explained in our related work, several similarity measures were tested in our experiments, namely overlap [33], Dice [34], PMI [35], Jaccard [36], and cosine. Dice similarity a variant of cosine similarity and Jaccard similarity provided the best results in our experiments. The Dice similarity between terms i and j is calculated as follows: { 0 if co(i, j) ϑ dice_qs(i, j) = 2 co(i,j) df top (i)+df top ( j) otherwise (5) where df top (i) is the number of tweets that contain term i I top,df top ( j) is the number of tweets that contain term j I top, and co(i, j) is the number of tweets in which terms i and j co-occur in I top. ϑ is a threshold used to discard quotients of similarity that fall below it. Given the scale of and noise in social media data, it is possible that a pair of terms cooccurs purely by chance. In order to reduce the adverse effects of these co-occurrences, the quotient of similarity QS of two terms is set to zero if their co-occurrence value is less than ϑ. In our experiments, we set ϑ to 5, though this can be adjusted as needed. For elaboration, we also show how the variant of the Jaccard similarity measure is computed (as exhibited in [32]) { 0 if co(i, j) ϑ jacc_qs(i, j) = otherwise. co(i,j) df top (i)+df top ( j) co(i,j) (6) Fig. 2. PDF of a typical set of QS values in Q top. Finally, the variant of cosine similarity measure described by Chen et al. [31] is defined by the following equation: 0 if co(i, j) ϑ cosine_qs(i, j) = co(i, j) otherwise. (7) dftop (i) df top ( j) All of the previously described similarity measures produce a value between 0 and 1. Additionally, all QS values less than 0.01 are disregarded in order to reduce the effects of co-occurrences that are considered insignificant. The vertices of the graph are now defined as key terms that belong to set I top and the edges that connect them are defined as the co-occurrence of the terms in the tweet dataset. Using the terms occurrence and co-occurrence values in the tweets, the relationship between vertices is further normalized by using a coefficient of similarity to represent an edge. We henceforth refer to the QS values of all term-pair combinations in I top as set Q top. 4) Outlier Detection: Even though many potentially unimportant terms have been excluded thus far, there are still too many terms (vertices) and co-occurrences (edges) in the graph. We wish to capture only the most significant term cooccurrences, that is, those with sufficiently high QS values. To identify significant edges in the graph, irregular co-occurrence values (outliers) must be differentiated from regular ones. Fig. 2 illustrates the probability density function (PDF) of a typical set of QS values in Q top. This particular distribution has 8444 values. It can be observed that most QS values lie near or below the mean, with those that are several standard distributions away from the mean being the most interesting ones. These values are considered outliers (i.e., they fall outside of the overall pattern of the rest of the data). We have tested several outlier detection methods and found that using the interquartile range (IQR) works well. The IQR of a given set of values is the unit difference between the third (Q 3 ) and first (Q 1 ) quartiles, as computed IQR = Q 3 Q 1. (8) In other words, the IQR indicates how dispersed the middle 50% of the data are. As Hubert and Van der Veeken [44] described, the IQR can be used to aid in the detection of outliers by multiplying it by a coefficient of 1.5 and adding it to the third quartile or subtracting it from the first quartile.

7 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 985 the Chris Christie scandal, and may be perceived as more important to this topic. As seen in the previous example, some terms can belong to two or more topics, which leaves us with two options: 1) allowing terms to belong to multiple topics (overlapping clusters) or 2) allowing terms to be belong to a single topic (nonoverlapping clusters). Since our goal is to clearly identify unambiguous TCs, we employ the second option. Furthermore, we hypothesize that for a specific moment in time, and given that the terms in I sig are highly informational and specific, they will be the most representative for a single topic. We must therefore find a way to eliminate the edges that cause some topics to be ambiguous and increase the cluster quality of G. Fig. 3. Ambiguous TCs in graph G. In data that is normally distributed, the IQR would simply be multiplied by 1.5 in the equation. However, as can be seen in Fig. 2, our data are right-skewed, and thus an additional factor c must be included in the calculation to compensate for it. Even though Hubert and Van der Veeken [44] provided a complex method to compute this coefficient, for our needs we simply test several constant values and choose the one that yields the best results. The set of outliers O is defined as { Q1 1.5c IQR if outliers < Q O = 1 (9) Q c IQR if outliers > Q 3 where c is the coefficient by which 1.5 is multiplied. In our experiments, setting c = 4 provided the best results. We are only interested in outliers above the mean, so only the second condition of the previous formula is applied. In the particular QS distribution of Q top mentioned before, after applying the aforementioned outlier-detection method, only 301 values remain (of the original 8444). These 301 values represent the most significant term-pair combinations in Q top. We henceforth refer to these pair combinations as Q sig ; their individual terms are similarly referred to as I sig.the vertices in graph G are now the terms from I sig ; its edges are the term pairs from Q sig, with their QS values being the edge weights. Fig. 3 shows an example of two clearly identifiable clusters in G that represent unique topics. In some cases, as shown in Fig. 3, two or more TCs (subgraphs) appear to be joined because they share similar terms. In the figure, a topic related to the Amanda Knox murder trial is represented by the terms amanda, knox, murder, italy, conviction, extradition, meredith, and sollecito. Another topic, related to U.S. Governor Chris Christie s scandal, is represented by the terms chris, christie, govenor, lawyer, evidence, wildstien, lane, and authority. Although the term evidence is connected to the Chris Christie scandal, it is also connected to the term knox (and is also related to that topic). The term evidence, however, is connected to more vertices related to C. Graph Clustering Once graph G has been constructed and its most significant terms (vertices) and term-pair co-occurrence values (edges) have been selected, the next goal is to identify and separate well-defined TCs (subgraphs) in the graph. Before explaining the graph clustering algorithm, the concepts of betweenness and transitivity must first be understood. 1) Betweenness: Matsuo et al. [38] proposed an efficient approach to achieve the clustering of co-occurrence graphs. They use a graph clustering algorithm called Newman clustering [39] to efficiently identify word clusters. The core idea behind Newman clustering is the concept of edge betweenness. The betweenness value of an edge is the number of shortest paths between pairs of nodes that run along it. If a network contains clusters that are loosely connected by a few intercluster edges, then all shortest paths between the different clusters must go along these edges. Consequently, the edges connecting the clusters will have high edge betweenness. Removing these edges iteratively should thus yield well-defined clusters. As outlined by Brandes [45], the betweenness measure of an edge e is calculated as follows: betweenness(e) = σ (i, j e) (10) σ (i, j) i,j V where V is the set of vertices, σ(i, j) is the number of shortest paths between vertex i and vertex j, and σ(i, j e) is the number of those paths that pass through edge e. By convention, 0/0 = 0. 2) Transitivity: Iwasaka and Tanaka-Ishii [37] developed a method that accomplishes the clustering of a co-occurrence graph based on a concept known as transitivity. Transitivity is a property in a relation between three elements such that if the relation holds between the first and second elements, and between the second and third elements, then it also holds between the first and third elements. The authors indicate that each output cluster is expected to have no ambiguity, and that this is only achieved when a graph s edges representing cooccurrence relations are transitive. Thus, a graph with higher global transitivity is considered to have better cluster quality than one with lower global transitivity. The transitivity of a graph G is defined as transitivity(g) = #triangles (11) #triads

8 986 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Algorithm 1 Improve the Cluster Quality of a Graph 1: Input: Graph G 2: Output: Cluster-quality-improved G 3: B ={} empty set 4: repeat 5: for all (edge e G) do 6: Calculate betweenness(e) and append to B 7: end for 8: if first iteration of loop then 9: b avg = avg(b) 10: end if 11: b max = max(b) 12: trans 0 = transitivity(g) previous transitivity 13: Remove edge with b max from G 14: trans 1 = transitivity(g) posterior transitivity 15: Clear set B 16: until (trans 1 < trans 0 or b max < b avg ) 17: Add edge with b max to G where #triangles is the number of complete triangles (i.e., complete size-three subgraphs) in G and #triads is the number of triads (i.e., edge pairs connected to a shared vertex). 3) Graph Clustering Algorithm: We apply the concepts of betweenness and transitivity in our graph clustering algorithm, which disambiguates potential topics. The process is outlined in Algorithm 1. First, the betweenness values of all edges in graph G are calculated in lines 5 7. Then, the initial average betweenness of graph G is calculated in line 9; we wish for all edges to approach this betweenness. To achieve this, edges with high betweenness values are iteratively removed in order to separate clusters in the graph (line 13). It is worth pointing out that set B, which keeps track of all betweenness values in the graph, is emptied at the end of each iteration. The edge-removing process is stopped when removing additional edges yields no benefit to the clustering quality of the graph. This occurs: 1) when the transitivity of G after removing an edge is lower than it was before or 2) when the maximum edge betweenness is less than the average the latter indicating that the maximum betweenness is considered normal when compared to the average. Once the process has been stopped, the last-removed edge is added back to G. Fig. 4 shows two clearly distinct and unambiguous TCs after the graph clustering algorithm has been run. It can be observed that the edge that connected the terms knox and evidence (in addition to other edges) has been removed. Although the accuracy of the algorithm is not 100% (i.e., a few topics may remain ambiguous), its performance is satisfactory for our needs. 4) Time Complexity Analysis of Graph Clustering: In this section, we provide a brief time complexity analysis of the graph clustering algorithm described above. Our initial assumption is that, after performing the outlier detection step (Section III-B4) to obtain a reduced set of edges, graph G may be considered a sparse graph. In other words, the number Fig. 4. Unambiguous TCs in graph G after running the cluster-qualityimprovement algorithm. of edges in G is approximately the same as the number of vertices, or E V. First, we analyze the time complexity for calculating the betweenness of all edges in G (10). Given that G is sparse, Johnson s algorithm [46] may be used for calculating the shortest paths of all vertex pairs. This algorithm runs in O( V 2 log V + V E ) time, performing faster than the Floyd Warshall algorithm [47], which solves the problem in O( V 3 ). The shortest paths of all vertex pairs are first computed and stored in a data structure in order to avoid calculating them for every edge (lines 5 7). Next, the average and maximum values for set B (lines 9 and 11, respectively) are calculated in O( E ) time, since this set refers to the edges in G. The transitivity of G (11) is then calculated in O( V 2 ) time, as this is the worst-case scenario for finding all triangles and triads in a graph. Finally, the large loop between lines 4 and 16 will iterate q times, where q is a number much smaller than the number of edges in G, or q E. This assumption comes from the fact that the number of edges in G is reduced by 1 after each iteration. Overall, we have the following time complexity for Algorithm 1: ( (( ) O q V 2 log V + V E + 2 E +2 V 2)) (12) which can be reduced by selecting the largest factors, obtaining a final complexity of O(q( V 2 log V + V E )). D. Content Selection and Ranking Now that the prevalent news-tcs that fall within dates d 1 and d 2 have been identified, relevant content from the two media sources that is related to these topics must be selected and finally ranked. Related items from the news media will represent the MF of the topic. Similarly, related items from social media (Twitter) will represent the UA more specifically, the number of unique Twitter users related to the selected tweets. Selecting the appropriate items (i.e., tweets and news articles) related to the key terms of a topic is not an easy task, as many other items unrelated to the desired topic also contain similar key terms.

9 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 987 1) Node Weighting: The first step before selecting appropriate content is to weight the nodes of each topic accordingly. These weights can be used to estimate which terms are more important to the topic and provide insight into which of them must be present in an item for it to be considered topic-relevant. The weight of each node i in TC is calculated using (13), which utilizes the number of edges connected to a given node and their corresponding weights 1 λ ( ) Ei λ w e e E i TC : nw(i) = i E rmtc (13) w f f E TC where E i is the set of edges connected to node i, E TC is the set of edges in TC, w is the weight of an edge e or f, and λ is a number between 0 and 1. In our experiments, λ is set to 0.5, as we wish to allot equal importance to the number of edges and their weights. The value of nw falls between 0 and 1, which provides a reasonable scale to compare node weights. We have compared the performance of our node weighting method with that of PageRank [16], and our method provided equal or better results. For elaboration, we provide a modified version of the PageRank formula below, using our symbology and undirected edges i TC : nw pr (i) = (1 d) + d w ij nw pr ( j) j V i k V j w jk (14) where V i and V j are the sets of nodes that are connected to nodes i and j, respectively, and w ij represents the weight of the edge between nodes i and j. The constant d represents the damping factor, which is usually set around Now that the nodes of each TC are weighted according to their estimated importance to that cluster, the weights are used to select items from the two data sources. Given that the way in which items are selected from tweets and news is different, we describe separately how node weights are used to select those items in the following two sections. 2) User Attention Estimation: To calculate the UA measure of a TC, the tweets related to that topic are first selected and then the number of unique users who created those tweets is counted. To ensure that the tweets are genuinely related to TC, the weight of each node in TC is utilized. A threshold τ is first selected, which is used to specify the sum of node combinations that will be acceptable for tweet inclusion. For instance, say there is a TC with nodes i 1, i 2, i 3, and i 4, with respective weights 0.1, 0.2, 0.5, and 0.6. If τ is set to 0.4, then tweets only containing terms i 1 and i 2 or a combination of the two would not be considered in relation to TC. This is because neither i 1 s nor i 2 s weights alone, which are 0.1 and 0.2, respectively, nor a sum of their weights, is greater than or equal to τ. In the given case, only tweets containing terms i 3 and i 4 or a combination of the two would be considered. Although it is acceptable to set τ to a fixed value, which presented good results in our experiments, we set τ dynamically. This makes our method more robust by not limiting τ Fig. 5. Boxplot of the summed node weights of valid combinations of a TC discovered on a particular date. to a particular data set. The value of τ is based on where the largest increase in the summed weights occurs when making node combinations. Node combinations are made only between nodes that share an edge between them. Thus, if there is an edge [i 1, i 2 ] and an edge [i 2, i 3 ], but edge [i 1, i 3 ] does not exist, then [i 1, i 3 ] is not considered a valid combination. Only [i 1, i 2 ], [i 2, i 3 ], or [i 1, i 2, i 3 ] would be considered valid combinations. With this in mind, we find the distribution of the summed weights of all valid combinations. An example of this distribution can be seen in Fig. 5. As can be seen in the box plot, most of the values fall below the median and even below the first quartile. This indicates that there are few larger values that influence the mean. Hence, we infer that smaller values are less important and that the threshold τ should lie somewhere within the first quartile. We further deduce that the consecutive values in the first quartile, which have the largest difference between them, provide a fair estimate of where the threshold should lie. τ is thus set to the lesser of the values with the largest difference. All valid combinations are found for a given TC and tweets that contain terms from combinations whose summed weights are greater than τ are selected from the data sets. The number of unique users who created these tweets is then counted, which directly represents the UA of TC. Because our main objective is ranking, the number of unique users related to TC must be compared to all other TCs discovered. This is achieved by employing a simple method, as can be seen in the following equation: TC G : UA(TC) = U TC U TC TC G (15) where U TC is the number of unique users related to TC and G is the entire graph. In summary, the UA of a TC in graph G is equal to the total number of unique users related to that TC divided by the sum of total unique users related to all TCs in G. This equation produces a value between 0 and 1. 3) Media Focus Estimation: To estimate the MF of a TC, the news articles that are related to TC are first selected. This presents a problem similar to the selection of tweets when calculating UA. The weighted nodes of TC are used to accurately select the articles that are genuinely related to its inherent topic. The only difference now is that instead of comparing node combinations with tweet content, they are compared to the top k keywords selected from each article. Comparing the term combinations with the actual content of the article is

10 988 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 unnecessary, as the reason for selecting the top k keywords in the first place was to gain insight into the most important terms of each article. Furthermore, the terms that make up the TCs are drawn from this set of keywords. In summary, we must now select articles whose top k keywords are made up of valid node combinations the sum of their node weights must be greater than or equal to τ. After selecting these articles, the MF value of each TC is calculated TC G : MF(TC) = A TC A TC TC G (16) where A TC is the number of news articles related to TC and G is the entire graph. The MF of a TC in graph G is therefore equal to the total number of news articles related to that TC (A TC ) divided by the sum of the total number of articles related to all TCs in G. This equation also produces a value between 0 and 1. 4) User Interaction Estimation: Thenextstepistomeasure the degree of interaction between the users responsible for the social media content creation related to a specific topic. The database is queried for followed relationships for each of these users and a social network graph is constructed from these relationships. In Twitter, there are two types of relationships: 1) follow and 2) followed. In the first type, user u 1 can follow user u 2, in which case the direction of the relationship flows from u 1 to u 2. In the second case, user u 1 can be followed by user u 2 ; the direction of the relationship now flows from u 2 to u 1. In this paper, we are solely interested in the followed relationships of a particular user (i.e., the second case), as this represents the degree of interest other users have in the content created by the user at hand. Based on the followed relationships that are queried, a nondirected graph S TC is constructed, whose set of nodes U TC represents the users and the set of edges R TC represents the users followed relationships. The complex relationships of the users in S TC are used to measure the level of interaction that potentially exists between them. Before calculating the level of UI, the concept of reciprocity is first presented. a) Reciprocity: Kwan et al. [17] proposed a measure called reciprocity, which attempts to detect the interaction between social-media users and perceive user engagement in relation to a particular topic. Reciprocity is defined as the ratio of all edges in a social graph to the possible number of edges it can have. Reciprocity is a useful indicator of the degree of mutuality and reciprocal exchange in a network, which in turn relates to social cohesion. Higher reciprocity means greater interaction between users, and our intuition is that topics with higher reciprocity are more relevant, as this signifies a community structure. In other words, it is clearer to identify the power and influence of a well-structured community, rather than that of a decentralized and unstructured one. A variant of the reciprocity measure is employed in our method, and is defined as follows: reciprocity(s TC ) = 2 R TC (17) U TC 1 where U TC is the total number of nodes in the social graph, R TC its total number of edges, and reciprocity(s TC ) is the reciprocity of social graph S TC representing TC. Equation (17) is the simplified version of the formula, as the original one divides the total number of edges by all possible edges and then multiplies this figure by the total number of nodes for scaling purposes. b) User interaction: The value of reciprocity(s TC ) is directly interpreted as the UI of the topic, which is calculated using the social-user relationships of the topic. Following our previous convention, the value of UI is normalized by using TC G : UI(TC) = reciprocity(s TC) reciprocity(s TC ). (18) TC G 5) Overall Ranking: Now that the MF, UA, and UI measures have been defined, their values are combined into a single metric r. This metric represents the overall ranking of a topic. Equation (19) is employed to calculate the overall ranking of a TC TC G : r(tc) = (MF(TC)) α (UA(TC)) β (UI(TC)) γ (19) where α + β + γ = 1. In our experiments, all three exponents are set to the same value (1/3), though each factor in the equation can be weighted differently depending on the application at hand. The TCs are then sorted in descending order by their r value. IV. EXPERIMENTS AND RESULTS The testing dataset consists of tweets crawled from Twitter public timeline and news articles crawled from popular news websites during the period between November 1, 2013 and February 28, The news websites crawled werecnn.com, bbc.com, cbsnews.com, reuters.com, abcnews.com, and usatoday.com. Over the specified period of time, a total of news articles and bilingual tweets were collected. After non-english tweets were discarded, tweets remained. The dataset was divided into two partitions. 1) Data from January and February 2014 were used as the testing dataset, on which experiments were performed for the overall method evaluation. 2) Data from November and December 2013 were used as the control dataset, where experiments were performed to establish adequate thresholds and select measures that presented the best results. A. Method Evaluation The evaluation of topic ranking is quite challenging, as the interpretation of the results is generally subjective. However, in an attempt to show that the ranked topics are indeed those that users would prefer, a method for ranking popular news topics must be established. To retrieve the most popular topics, Google s news aggregation service [48] was utilized. For each day from November 1, 2013 to February 28, 2014, the top 10 news stories displayed on this site were collected at the end of the day.

11 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 989 TABLE I SOME STATISTICS RELEVANT TO THE TESTING DATASET Next, 20 master s and doctoral students were asked to view the titles of the top 10 news stories from each day and select the ones they considered relevant. Each participant was required to select a minimum of two articles per day and a maximum of all 10. The participants results were then partitioned into 12 date ranges: 1) November 1, 2013 November 10, 2013; 2) November 11, 2013 November 20, 2013; 3) November 21, 2013 November 30, 2013; 4) December 1, 2013 December 10, 2013; 5) December 10, 2013 December 20, 2013; 6) December 20, 2013 December 31, 2013; 7) January 1, 2014 January 10, 2014; 8) January 11, 2014 January 20, 2014; 9) January 21, 2014 January 31, 2014; 10) February 1, 2014 February 10, 2014; 11) February 11, 2014 February 20, 2014; and 12) February 21, 2014 February 28, As explained earlier, the first six data ranges were utilized for the controlled experiments and the last six for the method evaluation. The stories for each range were manually grouped into topics, where each topic was given a score for its corresponding time range. The score for each topic is calculated by multiplying the number of times the participants voted for the stories in the topic by the total number of stories in it. For instance, if in the time range January 1, 2014 January 10, 2014 a topic contained five stories (which of course occurred on five different days within that time range), and the total votes for those stories was 15, then the score of that topic would be 75. The topics were finally ranked in descending order of their score, resulting in a ranked topic list for each time range. We will hereafter refer to this ranked list of topics as voted topics. For each of the aforementioned time ranges we set the start (d 1 ) and end (d 2 ) dates accordingly and executed SociRank. Some statistics relevant to the news topics extracted for the testing dataset (i.e., from January and February 2014) are shown in Table I. The second column in the table indicates the total number of topics extracted per time period. The third, fourth, and fifth columns show the average number of tweets, news stories, and Twitter users associated with each topic, respectively. 1) Topic Selection Evaluation: First, we evaluate the topicselection completeness of SociRank and the media-focus-only approach 3 (hereafter referred to as MF). To represent the MF method, we simply manipulate (19) by setting β and γ to 0 3 The results for UA and UI alone are not shown in our experiments, as these have no topic-specific significance without the MF component. In other words, without the element of MF, any identified topics would only reflect what is popular among users, without these necessarily being related to the media (news). Fig. 6. Percentage of overlap between all voted topics and all topics selected by SociRank and MF. and α to 1. To evaluate completeness, all of the topics (per time range) selected by SociRank and by the MF approach were compared with those in the voted topics. Fig. 6 shows the percentage of topics selected by SociRank and by MF that overlap with the voted topics. It can be seen that SociRank clearly outperforms MF in terms of overlap with the voted topics (i.e., those topics that users selected as the most important). This indicates that SociRank is better at discovering prevalent news topics that users find interesting when compared to a method that only utilizes data from the news media. 2) Topic Ranking Evaluation: Next, we evaluate the ranking of topics using the SociRank and MF ranking formulas, selecting only the top k topics from each approach. Again, we compare the ranked topics for each time range with the voted topics. We evaluated the top 10, 20, 30, and 40 topics for both methods and calculated the percentage of topics that overlapped with the top 10, 20, 30, and 40 voted topics, respectively. For instance, if we are comparing the top 10 topics using SociRank with the voted topics, and five topics overlap, then the overlap percentage would be 5/10. Fig. 7 shows the percentage of overlap (for each of the time ranges) between the top 10 voted topics and the top 10 topics using the SociRank and MF approaches. In the figure, the green line represents the MF overlap percentage plus 1 standard deviation. If the SociRank bar surpasses this line, it indicates that there is a large enough difference between the two overlap percentages for the improvement to be considered significant. However, as can be seen in the figure, this does not occur there is no significant difference between the two methods in the top 10 ranked list. In the top 20, 30, and 40 ranked lists, however, the results favored SociRank, as can be seen in Figs. 8 10, respectively. In each of these lists, the overlap percentage of SociRank significantly surpasses that of the MF approach. Summarizing our findings, Fig. 11 represents the average overlap percentages of the top 10, 20, 30, and 40 ranked lists for both methods. The figure clearly illustrates that, with the exception of

12 990 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Fig. 7. Percentage of overlap between top 10 voted topics and top 10 topics selected by SociRank and MF. Fig. 9. Percentage of overlap between top 30 voted topics and top 30 topics selected by SociRank and MF. Fig. 8. Percentage of overlap between top 20 voted topics and top 20 topics selected by SociRank and MF. Fig. 10. Percentage of overlap between top 40 voted topics and top 40 topics selected by SociRank and MF. the top 10 list, SociRank outperformed the MF method by a significant margin. B. Controlled Experiments In this section, we show experiments performed on some of the variables and components used in our method, namely, the co-occurrence measure, IQR coefficient, and node weighting. For the controlled experiments, the control dataset was divided into six partitions, with each partition representing ten days worth of news and tweets. The time periods tested were those corresponding to November and December ) Co-Occurrence Measure: In Section III-B3, it was explained that five similarity measures were tested: 1) Dice; 2) cosine; 3) Jaccard; 4) PMI; and 5) overlay. To determine which similarity measure performs best, all parameters in SociRank were controlled, with the exception of the similarity measure. In order to measure the effectiveness of each similarity measure, the number of correctly identified topics that each measure produced was counted. Fig. 12 shows the number of correctly identified topics for each similarity measure, grouped by time period. As can be seen in the figure, the Dice similarity measure produced the greatest number of correctly identified topics. The closest competitor to Dice was the cosine similarity measure, but even this measure fell noticeably short of producing similar numbers. The other measures performed poorly in our dataset, and in some instances were unable to identify any topics at all. The Dice similarity measure was therefore selected as being the most fit to fulfill our needs. Depending on the dataset used, however, other similarity measures may potentially produce similar or better results than Dice. 2) IQR Coefficient: To estimate the most appropriate value for coefficient c in (9), several values were tested, starting from c = 1, and taking increments of 1. The previously listed

13 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 991 Fig. 11. Average percentage of overlap between top k voted topics and top k topics selected by SociRank and MF. Fig. 13. Evaluation of different values for IQR coefficient c in the outlier detection formula (9). Fig. 12. Evaluation of different co-occurrence similarity measures used as edge weights in graph G (Section III-B3). time periods were utilized, and all other parameters besides c were controlled. For each value of c, the number of correctly identified topics was counted, and the value that yielded the highest number was considered most effective one. Fig. 13 shows the number of correctly identified topics for different values of c. It can be seen that c = 4 yielded the best results. When c was in the range between 1 and 4, the number of correctly identified topics increased with each increment of 1, but setting c = 5 decreased the count significantly. Thus, c = 4 was subsequently selected as the optimum value in our experiments. 3) Node Weighting: The node-weighting formula (13) presented in Section III-D1 takes into account both number of edges connected to a node and their weights. A more naive approach might be to utilize only the number of edges connected to a node as a representation of its weight. We compared both of these approaches to the PageRank weighting algorithm [16] to ascertain which approach provided the best results, as shown in Fig. 14. In the figure, weighting the nodes simply by the number of edges is referred to as edge, and weighting by both number of edges and their weights (13) is referred to as edge-weight. Recalling from Section III-D1, node weighting is used to correctly select related content from Twitter. We therefore use this same measure to evaluate the effectiveness of each nodeweighting approach. The bars in Fig. 14 represent the number of correctly selected tweets by each node-weighting approach, and the line graph represents the precision of each approach. We calculate the precision using the following formula: R + precision = R + + R (20) where R + is the number of correct records retrieved and R is the number of incorrect records retrieved. In our experiments, thirty random topics from the aforementioned time periods were selected (five from each period), and each node-weighting approach was used to retrieve tweets related to those topics. Only 10% of the tweets were used. As can be seen in Fig. 14, the precision with which correct tweets were selected was the same or had very little difference in most cases, regardless of the approach. However, although the precision was somewhat negligible, the same cannot be said about the number of correct tweets selected. Using the edge-weight approach (13) yielded a far greater number of correctly selected tweets than the other two approaches. Selecting a higher number of correctly related tweets at a similar or greater precision than the other two approaches is preferable, so the nodes in the TCs are ultimately weighted using our approach. C. Comparison of Ranked Topic Lists Lastly, we provide an example of a ranked list of topics produced by the MF approach and one produced by SociRank.

14 992 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Fig. 14. Evaluation of different node-weighting approaches (Section III-D1). TABLE II COMPARISON OF RANKED TOPIC LISTS PRODUCED BY THE MF AND SOCIRANK METHODS Table II shows the top 10 topics obtained by the MF and SociRank approaches for the February 1, 2014 February 10, 2014 time period. In the table, the cells with the new topics are shaded in green and the cells with the topics that were removed from the MF list are shaded in red. As can be seen, using all three factors produced a list with three new topics that did not appear in the MF list. Additionally, many of the topics in the SociRank list either moved up or down in position as compared with the MF list. This example shows that using SociRank produces a very different ranked list of news topics, which may signify that relying only on high-frequency news topics provided by the media does not necessarily give insight into what users are interested on or consider important. Utilizing the user focus and UI factors can also uncover popular topics that were previously not considered so significant by the mass media, such as the three topics shaded in green in Table II. This information can hint media sources into giving continuity or more importance to certain news stories that were perhaps thought of as unpopular. Taking all results into consideration emphasizes the point that MF alone is a substandard estimator of what users find interesting or consider important, and should therefore not be used in this way. SociRank, on the other hand, proves to be more capable of performing this, and so we conclude that the information provided by SociRank can prove vital in commerce-based areas where the interest of users is paramount. V. CONCLUSION In this paper, we proposed an unsupervised method SociRank which identifies news topics prevalent in both social media and the news media, and then ranks them by taking into account their MF, UA, and UI as relevance factors. The temporal prevalence of a particular topic in the news media is considered the MF of a topic, which gives us insight into its mass media popularity. The temporal prevalence of the topic in social media, specifically Twitter, indicates user interest, and is considered its UA. Finally, the interaction between the social media users who mention the topic indicates the strength of the community discussing it, and is considered the UI. To the best of our knowledge, no other work has attempted to employ the use of either the interests of social media users or their social relationships to aid in the ranking of topics. Consolidated, filtered, and ranked news topics from both professional news providers and individuals have several benefits. One of its main uses is increasing the quality and variety

15 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 993 of news recommender systems, as well as discovering hidden, popular topics. Our system can aid news providers by providing feedback of topics that have been discontinued by the mass media, but are still being discussed by the general population. SociRank can also be extended and adapted to other topics besides news, such as science, technology, sports, and other trends. We have performed extensive experiments to test the performance of SociRank, including controlled experiments for its different components. SociRank has been compared to mediafocus-only ranking by utilizing results obtained from a manual voting method as the ground truth. In the voting method, 20 individuals were asked to rank topics from specified time periods based on their perceived importance. The evaluation provides evidence that our method is capable of effectively selecting prevalent news topics and ranking them based on the three previously mentioned measures of importance. Our results present a clear distinction between ranking topics by MF only and ranking them by including UA and UI. This distinction provides a basis for the importance of this paper, and clearly demonstrates the shortcomings of relying solely on the mass media for topic ranking. As future work, we intend to perform experiments and expand SociRank on different areas and datasets. Furthermore, we plan to include other forms of UA, such as search engine click-through rates, which can also be integrated into our method to provide even more insight into the true interest of users. Additional experiments will also be performed in different stages of the methodology. For example, a fuzzy clustering approach could be employed in order to obtain overlapping TCs (Section III-C). Lastly, we intend to develop a personalized version of SociRank, where topics are presented differently to each individual user. REFERENCES [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , Jan [2] T. Hofmann, Probabilistic latent semantic analysis, in Proc. 15th Conf. Uncertainty Artif. Intell., 1999, pp [3] T. Hofmann, Probabilistic latent semantic indexing, in Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, Berkeley, CA, USA, 1999, pp [4] C. Wartena and R. Brussee, Topic detection by clustering keywords, in Proc. 19th Int. Workshop Database Expert Syst. Appl. (DEXA), Turin, Italy, 2008, pp [5] F. Archetti, P. Campanelli, E. Fersini, and E. Messina, A hierarchical document clustering environment based on the induced bisecting k-means, in Proc. 7th Int. Conf. Flexible Query Answering Syst., Milan, Italy, 2006, pp [Online]. Available: [6] C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, [7] M. Cataldi, L. Di Caro, and C. Schifanella, Emerging topic detection on Twitter based on temporal and social terms evaluation, in Proc. 10th Int. Workshop Multimedia Data Min. (MDMKDD), Washington, DC, USA, 2010, Art. no. 4. [Online]. Available: [8] W. X. Zhao et al., Comparing Twitter and traditional media using topic models, in Advances in Information Retrieval. Heidelberg, Germany: Springer Berlin Heidelberg, 2011, pp [9] Q. Diao, J. Jiang, F. Zhu, and E.-P. Lim, Finding bursty topics from microblogs, in Proc. 50th Annu. Meeting Assoc. Comput. Linguist. Long Papers, vol , pp [10] H. Yin, B. Cui, H. Lu, Y. Huang, and J. Yao, A unified model for stable and temporal topic detection from social media data, in Proc. IEEE 29th Int. Conf. Data Eng. (ICDE), Brisbane, QLD, Australia, 2013, pp [11] C. Wang, M. Zhang, L. Ru, and S. Ma, Automatic online news topic ranking using media focus and user attention based on aging theory, in Proc. 17th Conf. Inf. Knowl. Manag., Napa County, CA, USA, 2008, pp [12] C. C. Chen, Y.-T. Chen, Y. Sun, and M. C. Chen, Life cycle modeling of news events using aging theory, in Machine Learning: ECML Heidelberg, Germany: Springer Berlin Heidelberg, 2003, pp [13] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling, TwitterStand: News in tweets, in Proc. 17th ACM SIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., Seattle, WA, USA, 2009, pp [14] O. Phelan, K. McCarthy, and B. Smyth, Using Twitter to recommend real-time topical news, in Proc. 3rd Conf. Recommender Syst., New York, NY, USA, 2009, pp [15] K. Shubhankar, A. P. Singh, and V. Pudi, An efficient algorithm for topic ranking and modeling topic evolution, in Database Expert Syst. Appl., Toulouse, France, 2011, pp [16] S. Brin and L. Page, Reprint of: The anatomy of a large-scale hypertextual web search engine, Comput. Netw., vol. 56, no. 18, pp , [17] E. Kwan, P.-L. Hsu, J.-H. Liang, and Y.-S. Chen, Event identification for social streams using keyword-based evolving graph sequences, in Proc. IEEE/ACM Int. Conf. Adv. Soc. Netw. Anal. Min., Niagara Falls, ON, Canada, 2013, pp [18] K. Kireyev, Semantic-based estimation of term informativeness, in Proc. Human Language Technol. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguist., 2009, pp [19] G. Salton, C.-S. Yang, and C. T. Yu, A theory of term importance in automatic text analysis, J. Amer. Soc. Inf. Sci., vol. 26, no. 1, pp , [20] H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. Develop., vol. 1, no. 4, pp , [21] J. D. Cohen, Highlights: Language- and domain-independent automatic indexing terms for abstracting, J. Amer. Soc. Inf. Sci., vol. 46, no. 3, pp , [22] Y. Matsuo and M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, Int. J. Artif. Intell. Tools, vol. 13, no. 1, pp , [23] R. Mihalcea and P. Tarau, TextRank: Bringing order into texts, in Proc. EMNLP, vol. 4. Barcelona, Spain, [24] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning, KEA: Practical automatic keyphrase extraction, in Proc. 4th ACM Conf. Digit. Libr., Berkeley, CA, USA, 1999, pp [25] P. D. Turney, Learning algorithms for keyphrase extraction, Inf. Retrievel, vol. 2, no. 4, pp , [26] J. Wang, H. Peng, and J.-S. Hu, Automatic keyphrases extraction from document using neural network, in Advances in Machine Learning and Cybernetics. Heidelberg, Germany: Springer, 2006, pp [27] T. Jo, M. Lee, and T. M. Gatton, Keyword extraction from documents using a neural network model, in Proc. Int. Conf. Hybrid Inf. Technol. (ICHIT), vol , pp [28] K. Sarkar, M. Nasipuri, and S. Ghose, A new approach to keyphrase extraction using neural networks, Int. J. Comput. Sci. Issues, vol. 7, no. 3, pp , Mar [29] C. Zhang, Automatic keyword extraction from documents using conditional random fields, J. Comput. Inf. Syst., vol. 4, no. 3, pp , [30] G. Figueroa and Y.-S. Chen, Collaborative ranking between supervised and unsupervised approaches for keyphrase extraction, in Proc. Conf. Comput. Linguist. Speech Process. (ROCLING), 2014, pp [31] H.-H. Chen, M.-S. Lin, and Y.-C. Wei, Novel association measures using Web search with double checking, in Proc. 21st Int. Conf. Comput. Linguist. 44th Annu. Meeting Assoc. Comput. Linguist., 2006, pp [32] D. Bollegala, Y. Matsuo, and M. Ishizuka, Measuring semantic similarity between words using Web search engines, in Proc. WWW, Banff, AB, Canada, 2007, pp [33] D. Szymkiewicz, Etude comparative de la distribution florale, Rev. Forest, vol. 1, 1926.

994 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 [34] L. R. Dice, Measures of the amount of ecologic association between species, Ecology, vol. 26, no. 3, pp.

acm.org/ citation.cfm?id=89086.89095. [36] P. Jaccard, Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Lausanne, Switzerland: Impr. Corbaz, 1901. [37] H.

Ishizuka, Graph-based word clustering using a Web search engine, in Proc. Conf. Empir. Methods Nat. Lang. Process., 2006, pp. 542 550. [39] M. Girvan and M. E. J.

16 994 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 [34] L. R. Dice, Measures of the amount of ecologic association between species, Ecology, vol. 26, no. 3, pp , [35] K. W. Church and P. Hanks, Word association norms, mutual information, and lexicography, Comput. Linguist., vol. 16, no. 1, pp , Mar [Online]. Available: citation.cfm?id= [36] P. Jaccard, Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Lausanne, Switzerland: Impr. Corbaz, [37] H. Iwasaka and K. Tanaka-Ishii, Clustering co-occurrence graph based on transitivity, presented at the 5th Workshop Very Large Corpora, 1997, pp [38] Y. Matsuo, T. Sakaki, K. Uchiyama, and M. Ishizuka, Graph-based word clustering using a Web search engine, in Proc. Conf. Empir. Methods Nat. Lang. Process., 2006, pp [39] M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proc. Nat. Acad. Sci., vol. 99, no. 12, pp , [40] M. E. J. Newman, Fast algorithm for detecting community structure in networks, Phys.Rev.E, vol. 69, no. 6, 2004, Art. no [41] Twitter. [Online]. Available: accessed Feb [42] K. Gimpel et al., Part-of-speech tagging for Twitter: Annotation, features, and experiments, in Proc. 49th Annu. Meeting Assoc. Comput. Linguist. Human Lang. Technol. Short Papers, vol. 2. Portland, OR, USA, 2011, pp [43] A. J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans. Inf. Theory, vol. IT-13, no. 2, pp , Apr [44] M. Hubert and S. Van der Veeken, Outlier detection for skewed data, J. Chemometr., vol. 22, nos. 3 4, pp , [45] U. Brandes, On variants of shortest-path betweenness centrality and their generic computation, Soc. Netw., vol. 30, no. 2, pp , [46] D. B. Johnson, Efficient algorithms for shortest paths in sparse networks, J. ACM, vol. 24, no. 1, pp. 1 13, Jan [Online]. Available: [47] R. W. Floyd, Algorithm 97: Shortest path, Commun. ACM, vol. 5, no. 6, p. 345, Jun [Online]. Available: [48] Google News. [Online]. Available: accessed Feb Derek Davis was born in Belize City, Belize, in He received the B.Sc. degree in information technologies from the University of Belize, Belmopan, Belize, and the M.Sc. degree in information systems and applications from National Tsing Hua University, Hsinchu, Taiwan. He was a Systems Analyst/Database Administrator for a Belizean crude oil drilling company. He is currently a Systems Analyst in the banking and finance industry. His research interests include keyword extraction, summarization, and natural language processing. Gerardo Figueroa was born in Tegucigalpa, Honduras, in He received the B.Sc. degree in computational systems engineering from Universidad Tecnologica Centroamericana, Tegucigalpa, and the M.Sc. degree in information systems and applications from National Tsing Hua University, Hsinchu, Taiwan, where he is currently pursuing the Ph.D. degree in information systems and applications. From 2011 to 2015, he was a Teacher Assistant with the Institute of Information Systems and Applications and the Department of Computer Science, National Tsing Hua University. His current research interests include keyword extraction and summarization, national language processing, and machine learning. Yi-Shin Chen received the B.A and M.B.A. degrees in information management from National Central University, Taoyuan, Taiwan, in 1996 and 1997, respectively, and the Ph.D. degree in computer science from the University of Southern California, Los Angeles, CA, USA, in She joined the Department of Computer Science, National Tsing Hua University, as an Assistant Professor in She is passionate about increasing the benefits of society through her research efforts. Her goal is to create a social media interface that explores and visualizes Web data easily. Her current research interests include Web intelligence and integration.

Lioma, Christina Amalia: Part of Speech n-grams for Information Retrieval

Lioma, Christina Amalia: Part of Speech n-grams for Information Retrieval David Nemeskey Data Mining and Search Research Group MTA SZTAKI Data Mining Seminar 2011.05.12. Information Retrieval Goal: return