Derek Davis, Gerardo Figueroa, and Yi-Shin Chen

Size: px
Start display at page:

Download "Derek Davis, Gerardo Figueroa, and Yi-Shin Chen"

Transcription

1 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors Derek Davis, Gerardo Figueroa, and Yi-Shin Chen Abstract Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework SociRank which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics. Index Terms Information filtering, social computing, social network analysis, topic identification, topic ranking. I. INTRODUCTION THE mining of valuable information from online sources has become a prominent research area in information technology in recent years. Historically, knowledge that apprises the general public of daily events has been provided by mass media sources, specifically the news media. Many of these news media sources have either abandoned their hardcopy publications and moved to the World Wide Web, or now produce both hard-copy and Internet versions simultaneously. These news media sources are considered reliable Manuscript received July 21, 2015; revised September 14, 2015; accepted October 17, Date of publication March 10, 2016; date of current version May 15, This work was supported by the Ministry of Science and Technology, China, under Grant MOST E and Grant MOST E This paper was recommended by Associate Editor F. Wang. D. Davis and G. Figueroa are with the Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu 30013, Taiwan ( ddavis@gmail.com; gerardo_ofc@yahoo.com). Y.-S. Chen is with the Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan ( yishin@gmail.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TSMC because they are published by professional journalists, who are held accountable for their content. On the other hand, the Internet, being a free and open forum for information exchange, has recently seen a fascinating phenomenon known as social media. In social media, regular, nonjournalist users are able to publish unverified content and express their interest in certain events. Microblogs have become one of the most popular social media outlets. One microblogging service in particular, Twitter, is used by millions of people around the world, providing enormous amounts of user-generated data. One may assume that this source potentially contains information with equal or greater value than the news media, but one must also assume that because of the unverified nature of the source, much of this content is useless. For social media data to be of any use for topic identification, we must find a way to filter uninformative information and capture only information which, based on its content similarity to the news media, may be considered useful or valuable. The news media presents professionally verified occurrences or events, while social media presents the interests of the audience in these areas, and may thus provide insight into their popularity. Social media services like Twitter can also provide additional or supporting information to a particular news media topic. In summary, truly valuable information may be thought of as the area in which these two media sources topically intersect. Unfortunately, even after the removal of unimportant content, there is still information overload in the remaining news-related data, which must be prioritized for consumption. To assist in the prioritization of news information, news must be ranked in order of estimated importance. The temporal prevalence of a particular topic in the news media indicates that it is widely covered by news media sources, making it an important factor when estimating topical relevance. This factor may be referred to as the MF of the topic. The temporal prevalence of the topic in social media, specifically in Twitter, indicates that users are interested in the topic and can provide a basis for the estimation of its popularity. This factor is regarded as the UA of the topic. Likewise, the number of users discussing a topic and the interaction between them also gives insight into topical importance, referred to as the UI. By combining these three factors, we gain insight into topical importance and are then able to rank the news topics accordingly. Consolidated, filtered, and ranked news topics from both professional news providers and individuals have c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 980 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 several benefits. The most evident use is the potential to improve the quality and coverage of news recommender systems or Web feeds, adding user popularity feedback. Additionally, news topics that perhaps were not perceived as popular by the mass media could be uncovered from social media and given more coverage and priority. For instance, a particular story that has been discontinued by news providers could be given resurgence and continued if it is still a popular topic among social networks. This information, in turn, can be filtered to discover how particular topics are discussed in different geographic locations, which serve as feedback for businesses and governments. A straightforward approach for identifying topics from different social and news media sources is the application of topic modeling. Many methods have been proposed in this area, such as latent Dirichlet allocation (LDA) [1] and probabilistic latent semantic analysis (PLSA) [2], [3]. Topic modeling is, in essence, the discovery of topics in text corpora by clustering together frequently co-occurring words. This approach, however, misses out in the temporal component of prevalent topic detection, that is, it does not take into account how topics change with time. Furthermore, topic modeling and other topic detection techniques do not rank topics according to their popularity by taking into account their prevalence in both news media and social media. We propose an unsupervised system SociRank which effectively identifies news topics that are prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Even though this paper focuses on news topics, it can be easily adapted to a wide variety of fields, from science and technology to culture and sports. To the best of our knowledge, no other work attempts to employ the use of either the social media interests of users or their social relationships to aid in the ranking of topics. Moreover, SociRank undergoes an empirical framework, comprising and integrating several techniques, such as keyword extraction, measures of similarity, graph clustering, and social network analysis. The effectiveness of our system is validated by extensive controlled and uncontrolled experiments. To achieve its goal, SociRank uses keywords from news media sources (for a specified period of time) to identify the overlap with social media from that same period. We then build a graph whose nodes represent these keywords and whose edges depict their co-occurrences in social media. The graph is then clustered to clearly identify distinct topics. After obtaining well-separated topic clusters (TCs), the factors that signify their importance are calculated: MF, UA, and UI. Finally, the topics are ranked by an overall measure that combines these three factors. The remainder of this paper is organized as follows. Section II reviews previous work on topic identification and other research areas that are implemented in our method, including keyword extraction, co-occurrence similarity measures, graph clustering, topic ranking, and social network analysis. Section III describes the overall framework of SociRank and all of its stages. Section IV presents the setup and results of our experiments. Finally, we provide the conclusions and future work in Section V. II. RELATED WORK The main research areas applied in this paper include: topic identification, topic ranking social, network analysis, keyword extraction, co-occurrence similarity measures, and graph clustering. Extensive work has been conducted in most of these areas. A. Topic Identification Much research has been carried out in the field of topic identification referred to more formally as topic modeling. Two traditional methods for detecting topics are LDA [1] and PLSA [2], [3]. LDA is a generative probabilistic model that can be applied to different tasks, including topic identification. PLSA, similarly, is a statistical technique, which can also be applied to topic modeling. In these approaches, however, temporal information is lost, which is paramount in identifying prevalent topics and is an important characteristic of social media data. Furthermore, LDA and PLSA only discover topics from text corpora; they do not rank based on popularity or prevalence. Wartena and Brussee [4] implemented a method to detect topics by clustering keywords. Their method entails the clustering of keywords based on different similarity measures using the induced k-bisecting clustering algorithm [5]. Although they do not employ the use of graphs, they do observe that a distance measure based on the Jensen Shannon divergence (or information radius [6]) of probability distributions performs well. More recently, research has been conducted in identifying topics and events from social media data, taking into account temporal information. Cataldi et al. [7] proposed a topic detection technique that retrieves real-time emerging topics from Twitter. Their method uses the set of terms from tweets and model their life cycle according to a novel aging theory. Additionally, they take into account social relationships more specifically, the authority of the users in the network to determine the importance of the topics. Zhao et al. [8] carried out similar work by developing a Twitter-LDA model designed to identify topics in tweets. Their work, however, only considers the personal interests of users, and not prevalent topics at a global scale. Another trending area of related research is the detection of bursty topics (i.e., topics or events that occur in short, sudden episodes). Diao et al. [9] proposed a method that uses a state machine to detect bursty topics in microblogs. Their method also determines whether user posts are personal or refer to a particular trending topic. Yin et al. [10] also developed a model that detects topics from social media data, distinguishing between temporal and stable topics. These methods, however, only use data from microblogs and do not attempt to integrate them with real news. Additionally, the detected topics are not ranked by popularity or prevalence.

3 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 981 B. Topic Ranking Another major concept that is incorporated into this paper is topic ranking. There are several means by which this task can be accomplished, traditionally being done by estimating how frequently and recently a topic has been reported by mass media. Wang et al. [11] proposed a method that takes into account the users interest in a topic by estimating the amount of times they read stories related to that particular topic. They refer to this factor as the UA. They also used an aging theory developed by Chen et al. [12] to create, grow, and destroy a topic. The life cycles of the topics are tracked by using an energy function. The energy of a topic increases when it becomes popular and it diminishes over time unless it remains popular. We employ variants of the concepts of MF and UA to meet our needs, as these concepts are both logical and effective. Other works have made use of Twitter to discover news-related content that might be considered important. Sankaranarayanan et al. [13] developed a system called TwitterStand, which identifies tweets that correspond to breaking news. They accomplish this by utilizing a clustering approach for tweet mining. Phelan et al. [14] developed a recommendation system that generates a ranked list of news stories. News are ranked based on the co-occurrence of popular terms within the users RSS and Twitter feeds. Both of these systems aim to identify emerging topics, but give no insight into their popularity over time. Moreover, the work by Phelan et al. [14] only produces a personalized ranking (i.e., news articles tailored specifically to the content of a single user), rather than providing an overall ranking based on a sample of all users. Nevertheless, these works provide us with a basis for extending the premise of UA. Research has also been carried out in topic discovery and ranking from other domains. Shubhankar et al. [15] developed an algorithm that detects and ranks topics in a corpus of research papers. They used closed frequent keyword-sets to form topics and a modification of the PageRank [16] algorithm to rank them. Their work, however, does not integrate or collaborate with other data sources, as accomplished by SociRank. C. Social Network Analysis In the case of UA, Wang et al. [11] estimated this factor by using anonymous website visitor data. Their method counts the amount of times a site was visited during a particular period of time, which represents the UA of the topic to which the site is related. Our belief, on the other hand, is that, although website usage statistics provide initial proof of attention, additional data are needed to corroborate it. We employ the use of social media, specifically Twitter, as a means to estimate UA. When a user tweets about a particular topic, it signifies that the user is interested in the topic and it has captured her attention more so than visiting a website related to it. In summary, visiting a website might be the initial stimulus, but taking the additional step of discussing a topic via social media signifies genuine attention. 1 Additionally, we believe that the relationship between social media users who discuss the same topics also plays a key role in topic relevance. Kwan et al. [17] proposed a measure referred to as reciprocity, which attempts to detect the interaction between social media users and perceive their engagement in relation to a particular topic. Higher reciprocity means greater interaction between users, and thus topics with higher reciprocity should be considered more important because of their underlying community structure. We can inherently identify the power and influence of a well-structured community as opposed to a decentralized and unstructured one. Our method applies this logic to support the idea that higher reciprocity signifies greater importance. D. Keyword Extraction Concerning the field of keyword or informative term extraction, many unsupervised and supervised methods have been proposed. Unsupervised methods for keyword extraction rely solely on implicit information found in individual texts or in a text corpus. Supervised methods, on the other hand, make use of training datasets that have already been classified. Among the unsupervised methods, there are those that employ statistical measures of term informativeness or relevance, such as term specificity [18], TFIDF [19], word frequency [20], n-grams [21], and word co-occurrence [22]. Other unsupervised approaches are graph-based, where a text is converted into a graph whose nodes represent text units (e.g., words, phrases and sentences) and whose edges represent the relationships between these units. The graph is then recursively iterated and relevance scores are assigned to each node using different approaches. A popular example of a graphbased keyword extraction method is TextRank, proposed by Mihalcea and Tarau [23]; it utilizes the premise of the popular PageRank [16] algorithm. There has also been much work on keyword extraction using supervised and hybrid approaches. Two traditional supervised frameworks are KEA [24] and GenEx [25], which use machine learning algorithms for the effective extraction of keywords. Other innovative approaches for keyword extraction have been proposed in recent years, including the application of neural networks [26] [28] and conditional random fields [29]. Hybrid methods (i.e., methods that make use of unsupervised and supervised components) have been proposed as well, such as HybridRank [30], which makes use of collaboration between the two approaches. Due to its simple implementation, we use TextRank [23] to extract keywords from the news media sources. Furthermore, TextRank does not require training or any document corpus for its operation. E. Co-Occurrence Similarity Matsuo and Ishizuka [22] suggested that the co-occurrence relationship of frequent word pairs from a single document 1 The works by Sankaranarayanan et al. [13] and Phelan et al. [14] support this premise.

4 982 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 may provide statistical information to aid in the identification of the document s keywords. They proposed that if the probability distribution of co-occurrence between a term x and all other terms in a document is biased to a particular subset of frequent terms, then term x is likely to be a keyword. Even though our intention is not to employ co-occurrence for keyword extraction, this hypothesis emphasizes the importance of co-occurrence relationships. Chen et al. [31] proposed a novel co-occurrence similarity measure in which they measure the association of terms using snippets returned by Web searches. They refer to this measure as co-occurrence double checking (CODC). Bollegala et al. [32] proposed a method that uses page counts and text snippets from Web searches to measure the similarity between words or entities. They compared their method with CODC [31], as well as with variants of several other co-occurrence similarity measures, such as the overlap (Simpson) [33], Dice [34], point-wise mutual information (PMI) [35], Jaccard [36], and cosine similarities. Since establishing the importance of the word-pair cooccurrence distribution in the actual corpus of tweets is of more interest to us, we did not employ Bollegala s or Chen s semantic similarity methods. In this paper, we tested other similarity measures, and found that the Dice similarity measure provided the best results. F. Graph Clustering The main purpose of graph clustering in this paper is to identify and separate TCs, as done in Wartena and Brussee s work [4]. Iwasaka and Tanaka-Ishii [37] also proposed a method that clusters a co-occurrence graph based on a graph measure known as transitivity. The basic idea of transitivity is that in a relationship between three elements, if the relationship holds between the first and second elements and between the second and third elements, it also holds between the first and third elements. They suggested that each output cluster is expected to have no ambiguity, and that this is only achieved when the edges of a graph (representing co-occurrence relations) are transitive. Matsuo et al. [38] employed a different approach to achieve the clustering of co-occurrence graphs. They used Newman clustering [39] to efficiently identify word clusters. The core idea behind Newman clustering is the concept of edge betweenness. The betweenness measure of an edge is the number of shortest paths between pairs of nodes that run along it. If a network contains clusters that are loosely connected by a few intercluster edges, then all shortest paths between different clusters must go along one of these edges. Consequently, the edges connecting different clusters will have high edge betweenness, and removing them iteratively will yield well-defined clusters. Newman realized, however, that the main disadvantage of the algorithm was its high computational demand, and thus proposed a new method to identify clusters based on modularity [40]. Modularity is a measure designed to estimate the strength of division of a network into clusters. Networks that possess a high modularity value have dense connections between the nodes within each cluster, but sparse connections between nodes in different clusters. Newman s new algorithm calculated modularity as it progressed, making it simple to find the optimal clustering structure. Given their effectiveness, the concepts of betweenness and transitivity are both applied into our graph clustering algorithm. III. SOCIRANK FRAMEWORK The goal of our method SociRank is to identify, consolidate and rank the most prevalent topics discussed in both news media and social media during a specific period of time. The system framework can be visualized in Fig. 1. To achieve its goal, the system must undergo four main stages. 1) Preprocessing: Key terms are extracted and filtered from news and social data corresponding to a particular period of time. 2) Key Term Graph Construction: A graph is constructed from the previously extracted key term set, whose vertices represent the key terms and edges represent the co-occurrence similarity between them. The graph, after processing and pruning, contains slightly joint clusters of topics popular in both news media and social media. 3) Graph Clustering: The graph is clustered in order to obtain well-defined and disjoint TCs. 4) Content Selection and Ranking: The TCs from the graph are selected and ranked using the three relevance factors (MF, UA, and UI). Initially, news and tweets data are crawled from the Internet and stored in a database. News articles are obtained from specific news websites via their RSS feeds and tweets are crawled from the Twitter public timeline [41]. A user then requests an output of the top k ranked news topics for a specified period of time between date d 1 (start) and date d 2 (end). A. Preprocessing In the preprocessing stage, the system first queries all news articles and tweets from the database that fall within date d 1 and date d 2. Additionally, two sets of terms are created: one for the news articles and one for the tweets, as explained below. 1) News Term Extraction: The set of terms from the news data source consists of keywords extracted from all the queried articles. Due to its simple implementation and effectiveness, we implement a variant of the popular TextRank algorithm [23] to extract the top k keywords from each news article. 2 The selected keywords are then lemmatized using the WordNet lemmatizer in order to consider different inflected forms of a word as a single item. After lemmatization, all unique terms are added to set N. It is worth pointing out that, since N is a set, it does not contain duplicate terms. 2) Tweets Term Extraction: For the tweets data source, the set of terms are not the tweets keywords, but all unique and relevant terms. First, the language of each queried tweet is identified, disregarding any tweet that is not in English. From the remaining tweets, all terms that appear in a stop word 2 Terms that appear in a predefined stop word list or that are less than three characters in length are not taken into consideration.

5 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 983 Fig. 1. SociRank framework. list or that are less than three characters in length are eliminated. The part of speech (POS) of each term in the tweets is then identified using a POS tagger [42]. This POS tagger is especially useful because it can identify Twitter-specific POSs, such as hashtags, mentions, and emoticon symbols. Hashtags are of great interest to us because of their potential to hold the topical focus of a tweet. However, hashtags usually contain several words joined together, which must be segmented in order to be useful. To solve this problem, we make use of the Viterbi segmentation algorithm [43]. The segmented terms are then tagged as hashtag. To eliminate terms that are not relevant, only terms tagged as hashtag, noun, adjective or verb are selected. The terms are then lemmatized and added to set T, which represents all unique terms that appear in tweets from dates d 1 to d 2. B. Key Term Graph Construction In this component, a graph G is constructed, whose clustered nodes represent the most prevalent news topics in both news and social media. The vertices in G are unique terms selected from N and T, and the edges are represented by a relationship between these terms. In the following sections, we define a method for selecting the terms and establish a relationship between them. After the terms and relationships are identified, the graph is pruned by filtering out unimportant vertices and edges. 1) Term Document Frequency: First, the document frequency of each term in N and T is calculated accordingly. In the case of term set N, the document frequency of each term n is equal to the number of news articles (from dates d 1 to d 2 ) in which n has been selected as a keyword; it is represented as df(n). The document frequency of each term t in set T is calculated in a similar fashion. In this case, however, it is the number of tweets in which t appears; it is represented as df(t). For simplification purposes, we will henceforth refer to the document frequency as occurrence. Thus, df(n) is the occurrence of term n and df(t) is the occurrence of term t. 2) Relevant Key Term Identification: Let us recall that set N represents the keywords present in the news and set T represents all relevant terms present in the tweets (from dates d 1 to d 2 ). We are primarily interested in the important news-related terms, as these signal the presence of a newsrelated topic. Additionally, part of our objective is to extract the topics that are prevalent in both news and social media. To achieve this, a new set I is formed I = N T. (1) This intersection of N and T eliminates terms from T that are not relevant to the news and terms from N that are not mentioned in the social media. Set I, however, still contains many potentially unimportant terms. To solve this problem, terms in I are ranked based on their prevalence in both sources. In this case, prevalence is interpreted as the occurrence of a term, which in turn is the term s document frequency. The prevalence of a term is thus a combination of its occurrence in both N and T. Prevalence p of each term i in I is calculated such that half of its weight is based on the occurrence of the term in the news media, and the other half is based on its occurrence in social media i I : p(i) = df(n) T N + df(t) 2 T where T is the total number of tweets selected between dates d 1 and d 2, and N is the total number of news articles selected in the same time period. (2)

6 984 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 The terms in set I are then ranked by their prevalence value, and only those in the top πth percentile are selected. Using a π value of 75 presented the best results in our experiments. We define the newly filtered set I top using set-builder notation { I top = i I : P } i 100 >π I (3) where P i = { j I : p( j) < p(i)} (4) where P i is the number of elements in subset P i, which in turn represents the terms in I with a lower prevalence value than that of term i, and I is the total number of elements in set I. I top now represents the subset of top key terms from date d 1 to date d 2, taking into account their prevalence in both news and social media. 3) Key Term Similarity Estimation: Next, we must identify a relationship between the previously selected key terms in order to add the graph edges. The relationship used is the term co-occurrence in the tweet term set T. The intuition behind the co-occurrence is that terms that co-occur frequently are related to the same topic and may be used to summarize and represent it when grouped. We define co-occurrence as two terms occurring in the same tweet. If term i I top and term j I top both appear in the same tweet, their co-occurrence is set to 1. For each additional tweet in which i and j appear together, their co-occurrence is incremented by 1. I top is iterated through and the co-occurrence for each term pair {i, j} is found, defined as co(i, j). The term-pair co-occurrence is then used to estimate the similarity between terms. As explained in our related work, several similarity measures were tested in our experiments, namely overlap [33], Dice [34], PMI [35], Jaccard [36], and cosine. Dice similarity a variant of cosine similarity and Jaccard similarity provided the best results in our experiments. The Dice similarity between terms i and j is calculated as follows: { 0 if co(i, j) ϑ dice_qs(i, j) = 2 co(i,j) df top (i)+df top ( j) otherwise (5) where df top (i) is the number of tweets that contain term i I top,df top ( j) is the number of tweets that contain term j I top, and co(i, j) is the number of tweets in which terms i and j co-occur in I top. ϑ is a threshold used to discard quotients of similarity that fall below it. Given the scale of and noise in social media data, it is possible that a pair of terms cooccurs purely by chance. In order to reduce the adverse effects of these co-occurrences, the quotient of similarity QS of two terms is set to zero if their co-occurrence value is less than ϑ. In our experiments, we set ϑ to 5, though this can be adjusted as needed. For elaboration, we also show how the variant of the Jaccard similarity measure is computed (as exhibited in [32]) { 0 if co(i, j) ϑ jacc_qs(i, j) = otherwise. co(i,j) df top (i)+df top ( j) co(i,j) (6) Fig. 2. PDF of a typical set of QS values in Q top. Finally, the variant of cosine similarity measure described by Chen et al. [31] is defined by the following equation: 0 if co(i, j) ϑ cosine_qs(i, j) = co(i, j) otherwise. (7) dftop (i) df top ( j) All of the previously described similarity measures produce a value between 0 and 1. Additionally, all QS values less than 0.01 are disregarded in order to reduce the effects of co-occurrences that are considered insignificant. The vertices of the graph are now defined as key terms that belong to set I top and the edges that connect them are defined as the co-occurrence of the terms in the tweet dataset. Using the terms occurrence and co-occurrence values in the tweets, the relationship between vertices is further normalized by using a coefficient of similarity to represent an edge. We henceforth refer to the QS values of all term-pair combinations in I top as set Q top. 4) Outlier Detection: Even though many potentially unimportant terms have been excluded thus far, there are still too many terms (vertices) and co-occurrences (edges) in the graph. We wish to capture only the most significant term cooccurrences, that is, those with sufficiently high QS values. To identify significant edges in the graph, irregular co-occurrence values (outliers) must be differentiated from regular ones. Fig. 2 illustrates the probability density function (PDF) of a typical set of QS values in Q top. This particular distribution has 8444 values. It can be observed that most QS values lie near or below the mean, with those that are several standard distributions away from the mean being the most interesting ones. These values are considered outliers (i.e., they fall outside of the overall pattern of the rest of the data). We have tested several outlier detection methods and found that using the interquartile range (IQR) works well. The IQR of a given set of values is the unit difference between the third (Q 3 ) and first (Q 1 ) quartiles, as computed IQR = Q 3 Q 1. (8) In other words, the IQR indicates how dispersed the middle 50% of the data are. As Hubert and Van der Veeken [44] described, the IQR can be used to aid in the detection of outliers by multiplying it by a coefficient of 1.5 and adding it to the third quartile or subtracting it from the first quartile.

7 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 985 the Chris Christie scandal, and may be perceived as more important to this topic. As seen in the previous example, some terms can belong to two or more topics, which leaves us with two options: 1) allowing terms to belong to multiple topics (overlapping clusters) or 2) allowing terms to be belong to a single topic (nonoverlapping clusters). Since our goal is to clearly identify unambiguous TCs, we employ the second option. Furthermore, we hypothesize that for a specific moment in time, and given that the terms in I sig are highly informational and specific, they will be the most representative for a single topic. We must therefore find a way to eliminate the edges that cause some topics to be ambiguous and increase the cluster quality of G. Fig. 3. Ambiguous TCs in graph G. In data that is normally distributed, the IQR would simply be multiplied by 1.5 in the equation. However, as can be seen in Fig. 2, our data are right-skewed, and thus an additional factor c must be included in the calculation to compensate for it. Even though Hubert and Van der Veeken [44] provided a complex method to compute this coefficient, for our needs we simply test several constant values and choose the one that yields the best results. The set of outliers O is defined as { Q1 1.5c IQR if outliers < Q O = 1 (9) Q c IQR if outliers > Q 3 where c is the coefficient by which 1.5 is multiplied. In our experiments, setting c = 4 provided the best results. We are only interested in outliers above the mean, so only the second condition of the previous formula is applied. In the particular QS distribution of Q top mentioned before, after applying the aforementioned outlier-detection method, only 301 values remain (of the original 8444). These 301 values represent the most significant term-pair combinations in Q top. We henceforth refer to these pair combinations as Q sig ; their individual terms are similarly referred to as I sig.the vertices in graph G are now the terms from I sig ; its edges are the term pairs from Q sig, with their QS values being the edge weights. Fig. 3 shows an example of two clearly identifiable clusters in G that represent unique topics. In some cases, as shown in Fig. 3, two or more TCs (subgraphs) appear to be joined because they share similar terms. In the figure, a topic related to the Amanda Knox murder trial is represented by the terms amanda, knox, murder, italy, conviction, extradition, meredith, and sollecito. Another topic, related to U.S. Governor Chris Christie s scandal, is represented by the terms chris, christie, govenor, lawyer, evidence, wildstien, lane, and authority. Although the term evidence is connected to the Chris Christie scandal, it is also connected to the term knox (and is also related to that topic). The term evidence, however, is connected to more vertices related to C. Graph Clustering Once graph G has been constructed and its most significant terms (vertices) and term-pair co-occurrence values (edges) have been selected, the next goal is to identify and separate well-defined TCs (subgraphs) in the graph. Before explaining the graph clustering algorithm, the concepts of betweenness and transitivity must first be understood. 1) Betweenness: Matsuo et al. [38] proposed an efficient approach to achieve the clustering of co-occurrence graphs. They use a graph clustering algorithm called Newman clustering [39] to efficiently identify word clusters. The core idea behind Newman clustering is the concept of edge betweenness. The betweenness value of an edge is the number of shortest paths between pairs of nodes that run along it. If a network contains clusters that are loosely connected by a few intercluster edges, then all shortest paths between the different clusters must go along these edges. Consequently, the edges connecting the clusters will have high edge betweenness. Removing these edges iteratively should thus yield well-defined clusters. As outlined by Brandes [45], the betweenness measure of an edge e is calculated as follows: betweenness(e) = σ (i, j e) (10) σ (i, j) i,j V where V is the set of vertices, σ(i, j) is the number of shortest paths between vertex i and vertex j, and σ(i, j e) is the number of those paths that pass through edge e. By convention, 0/0 = 0. 2) Transitivity: Iwasaka and Tanaka-Ishii [37] developed a method that accomplishes the clustering of a co-occurrence graph based on a concept known as transitivity. Transitivity is a property in a relation between three elements such that if the relation holds between the first and second elements, and between the second and third elements, then it also holds between the first and third elements. The authors indicate that each output cluster is expected to have no ambiguity, and that this is only achieved when a graph s edges representing cooccurrence relations are transitive. Thus, a graph with higher global transitivity is considered to have better cluster quality than one with lower global transitivity. The transitivity of a graph G is defined as transitivity(g) = #triangles (11) #triads

8 986 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Algorithm 1 Improve the Cluster Quality of a Graph 1: Input: Graph G 2: Output: Cluster-quality-improved G 3: B ={} empty set 4: repeat 5: for all (edge e G) do 6: Calculate betweenness(e) and append to B 7: end for 8: if first iteration of loop then 9: b avg = avg(b) 10: end if 11: b max = max(b) 12: trans 0 = transitivity(g) previous transitivity 13: Remove edge with b max from G 14: trans 1 = transitivity(g) posterior transitivity 15: Clear set B 16: until (trans 1 < trans 0 or b max < b avg ) 17: Add edge with b max to G where #triangles is the number of complete triangles (i.e., complete size-three subgraphs) in G and #triads is the number of triads (i.e., edge pairs connected to a shared vertex). 3) Graph Clustering Algorithm: We apply the concepts of betweenness and transitivity in our graph clustering algorithm, which disambiguates potential topics. The process is outlined in Algorithm 1. First, the betweenness values of all edges in graph G are calculated in lines 5 7. Then, the initial average betweenness of graph G is calculated in line 9; we wish for all edges to approach this betweenness. To achieve this, edges with high betweenness values are iteratively removed in order to separate clusters in the graph (line 13). It is worth pointing out that set B, which keeps track of all betweenness values in the graph, is emptied at the end of each iteration. The edge-removing process is stopped when removing additional edges yields no benefit to the clustering quality of the graph. This occurs: 1) when the transitivity of G after removing an edge is lower than it was before or 2) when the maximum edge betweenness is less than the average the latter indicating that the maximum betweenness is considered normal when compared to the average. Once the process has been stopped, the last-removed edge is added back to G. Fig. 4 shows two clearly distinct and unambiguous TCs after the graph clustering algorithm has been run. It can be observed that the edge that connected the terms knox and evidence (in addition to other edges) has been removed. Although the accuracy of the algorithm is not 100% (i.e., a few topics may remain ambiguous), its performance is satisfactory for our needs. 4) Time Complexity Analysis of Graph Clustering: In this section, we provide a brief time complexity analysis of the graph clustering algorithm described above. Our initial assumption is that, after performing the outlier detection step (Section III-B4) to obtain a reduced set of edges, graph G may be considered a sparse graph. In other words, the number Fig. 4. Unambiguous TCs in graph G after running the cluster-qualityimprovement algorithm. of edges in G is approximately the same as the number of vertices, or E V. First, we analyze the time complexity for calculating the betweenness of all edges in G (10). Given that G is sparse, Johnson s algorithm [46] may be used for calculating the shortest paths of all vertex pairs. This algorithm runs in O( V 2 log V + V E ) time, performing faster than the Floyd Warshall algorithm [47], which solves the problem in O( V 3 ). The shortest paths of all vertex pairs are first computed and stored in a data structure in order to avoid calculating them for every edge (lines 5 7). Next, the average and maximum values for set B (lines 9 and 11, respectively) are calculated in O( E ) time, since this set refers to the edges in G. The transitivity of G (11) is then calculated in O( V 2 ) time, as this is the worst-case scenario for finding all triangles and triads in a graph. Finally, the large loop between lines 4 and 16 will iterate q times, where q is a number much smaller than the number of edges in G, or q E. This assumption comes from the fact that the number of edges in G is reduced by 1 after each iteration. Overall, we have the following time complexity for Algorithm 1: ( (( ) O q V 2 log V + V E + 2 E +2 V 2)) (12) which can be reduced by selecting the largest factors, obtaining a final complexity of O(q( V 2 log V + V E )). D. Content Selection and Ranking Now that the prevalent news-tcs that fall within dates d 1 and d 2 have been identified, relevant content from the two media sources that is related to these topics must be selected and finally ranked. Related items from the news media will represent the MF of the topic. Similarly, related items from social media (Twitter) will represent the UA more specifically, the number of unique Twitter users related to the selected tweets. Selecting the appropriate items (i.e., tweets and news articles) related to the key terms of a topic is not an easy task, as many other items unrelated to the desired topic also contain similar key terms.

9 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 987 1) Node Weighting: The first step before selecting appropriate content is to weight the nodes of each topic accordingly. These weights can be used to estimate which terms are more important to the topic and provide insight into which of them must be present in an item for it to be considered topic-relevant. The weight of each node i in TC is calculated using (13), which utilizes the number of edges connected to a given node and their corresponding weights 1 λ ( ) Ei λ w e e E i TC : nw(i) = i E rmtc (13) w f f E TC where E i is the set of edges connected to node i, E TC is the set of edges in TC, w is the weight of an edge e or f, and λ is a number between 0 and 1. In our experiments, λ is set to 0.5, as we wish to allot equal importance to the number of edges and their weights. The value of nw falls between 0 and 1, which provides a reasonable scale to compare node weights. We have compared the performance of our node weighting method with that of PageRank [16], and our method provided equal or better results. For elaboration, we provide a modified version of the PageRank formula below, using our symbology and undirected edges i TC : nw pr (i) = (1 d) + d w ij nw pr ( j) j V i k V j w jk (14) where V i and V j are the sets of nodes that are connected to nodes i and j, respectively, and w ij represents the weight of the edge between nodes i and j. The constant d represents the damping factor, which is usually set around Now that the nodes of each TC are weighted according to their estimated importance to that cluster, the weights are used to select items from the two data sources. Given that the way in which items are selected from tweets and news is different, we describe separately how node weights are used to select those items in the following two sections. 2) User Attention Estimation: To calculate the UA measure of a TC, the tweets related to that topic are first selected and then the number of unique users who created those tweets is counted. To ensure that the tweets are genuinely related to TC, the weight of each node in TC is utilized. A threshold τ is first selected, which is used to specify the sum of node combinations that will be acceptable for tweet inclusion. For instance, say there is a TC with nodes i 1, i 2, i 3, and i 4, with respective weights 0.1, 0.2, 0.5, and 0.6. If τ is set to 0.4, then tweets only containing terms i 1 and i 2 or a combination of the two would not be considered in relation to TC. This is because neither i 1 s nor i 2 s weights alone, which are 0.1 and 0.2, respectively, nor a sum of their weights, is greater than or equal to τ. In the given case, only tweets containing terms i 3 and i 4 or a combination of the two would be considered. Although it is acceptable to set τ to a fixed value, which presented good results in our experiments, we set τ dynamically. This makes our method more robust by not limiting τ Fig. 5. Boxplot of the summed node weights of valid combinations of a TC discovered on a particular date. to a particular data set. The value of τ is based on where the largest increase in the summed weights occurs when making node combinations. Node combinations are made only between nodes that share an edge between them. Thus, if there is an edge [i 1, i 2 ] and an edge [i 2, i 3 ], but edge [i 1, i 3 ] does not exist, then [i 1, i 3 ] is not considered a valid combination. Only [i 1, i 2 ], [i 2, i 3 ], or [i 1, i 2, i 3 ] would be considered valid combinations. With this in mind, we find the distribution of the summed weights of all valid combinations. An example of this distribution can be seen in Fig. 5. As can be seen in the box plot, most of the values fall below the median and even below the first quartile. This indicates that there are few larger values that influence the mean. Hence, we infer that smaller values are less important and that the threshold τ should lie somewhere within the first quartile. We further deduce that the consecutive values in the first quartile, which have the largest difference between them, provide a fair estimate of where the threshold should lie. τ is thus set to the lesser of the values with the largest difference. All valid combinations are found for a given TC and tweets that contain terms from combinations whose summed weights are greater than τ are selected from the data sets. The number of unique users who created these tweets is then counted, which directly represents the UA of TC. Because our main objective is ranking, the number of unique users related to TC must be compared to all other TCs discovered. This is achieved by employing a simple method, as can be seen in the following equation: TC G : UA(TC) = U TC U TC TC G (15) where U TC is the number of unique users related to TC and G is the entire graph. In summary, the UA of a TC in graph G is equal to the total number of unique users related to that TC divided by the sum of total unique users related to all TCs in G. This equation produces a value between 0 and 1. 3) Media Focus Estimation: To estimate the MF of a TC, the news articles that are related to TC are first selected. This presents a problem similar to the selection of tweets when calculating UA. The weighted nodes of TC are used to accurately select the articles that are genuinely related to its inherent topic. The only difference now is that instead of comparing node combinations with tweet content, they are compared to the top k keywords selected from each article. Comparing the term combinations with the actual content of the article is

10 988 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 unnecessary, as the reason for selecting the top k keywords in the first place was to gain insight into the most important terms of each article. Furthermore, the terms that make up the TCs are drawn from this set of keywords. In summary, we must now select articles whose top k keywords are made up of valid node combinations the sum of their node weights must be greater than or equal to τ. After selecting these articles, the MF value of each TC is calculated TC G : MF(TC) = A TC A TC TC G (16) where A TC is the number of news articles related to TC and G is the entire graph. The MF of a TC in graph G is therefore equal to the total number of news articles related to that TC (A TC ) divided by the sum of the total number of articles related to all TCs in G. This equation also produces a value between 0 and 1. 4) User Interaction Estimation: Thenextstepistomeasure the degree of interaction between the users responsible for the social media content creation related to a specific topic. The database is queried for followed relationships for each of these users and a social network graph is constructed from these relationships. In Twitter, there are two types of relationships: 1) follow and 2) followed. In the first type, user u 1 can follow user u 2, in which case the direction of the relationship flows from u 1 to u 2. In the second case, user u 1 can be followed by user u 2 ; the direction of the relationship now flows from u 2 to u 1. In this paper, we are solely interested in the followed relationships of a particular user (i.e., the second case), as this represents the degree of interest other users have in the content created by the user at hand. Based on the followed relationships that are queried, a nondirected graph S TC is constructed, whose set of nodes U TC represents the users and the set of edges R TC represents the users followed relationships. The complex relationships of the users in S TC are used to measure the level of interaction that potentially exists between them. Before calculating the level of UI, the concept of reciprocity is first presented. a) Reciprocity: Kwan et al. [17] proposed a measure called reciprocity, which attempts to detect the interaction between social-media users and perceive user engagement in relation to a particular topic. Reciprocity is defined as the ratio of all edges in a social graph to the possible number of edges it can have. Reciprocity is a useful indicator of the degree of mutuality and reciprocal exchange in a network, which in turn relates to social cohesion. Higher reciprocity means greater interaction between users, and our intuition is that topics with higher reciprocity are more relevant, as this signifies a community structure. In other words, it is clearer to identify the power and influence of a well-structured community, rather than that of a decentralized and unstructured one. A variant of the reciprocity measure is employed in our method, and is defined as follows: reciprocity(s TC ) = 2 R TC (17) U TC 1 where U TC is the total number of nodes in the social graph, R TC its total number of edges, and reciprocity(s TC ) is the reciprocity of social graph S TC representing TC. Equation (17) is the simplified version of the formula, as the original one divides the total number of edges by all possible edges and then multiplies this figure by the total number of nodes for scaling purposes. b) User interaction: The value of reciprocity(s TC ) is directly interpreted as the UI of the topic, which is calculated using the social-user relationships of the topic. Following our previous convention, the value of UI is normalized by using TC G : UI(TC) = reciprocity(s TC) reciprocity(s TC ). (18) TC G 5) Overall Ranking: Now that the MF, UA, and UI measures have been defined, their values are combined into a single metric r. This metric represents the overall ranking of a topic. Equation (19) is employed to calculate the overall ranking of a TC TC G : r(tc) = (MF(TC)) α (UA(TC)) β (UI(TC)) γ (19) where α + β + γ = 1. In our experiments, all three exponents are set to the same value (1/3), though each factor in the equation can be weighted differently depending on the application at hand. The TCs are then sorted in descending order by their r value. IV. EXPERIMENTS AND RESULTS The testing dataset consists of tweets crawled from Twitter public timeline and news articles crawled from popular news websites during the period between November 1, 2013 and February 28, The news websites crawled werecnn.com, bbc.com, cbsnews.com, reuters.com, abcnews.com, and usatoday.com. Over the specified period of time, a total of news articles and bilingual tweets were collected. After non-english tweets were discarded, tweets remained. The dataset was divided into two partitions. 1) Data from January and February 2014 were used as the testing dataset, on which experiments were performed for the overall method evaluation. 2) Data from November and December 2013 were used as the control dataset, where experiments were performed to establish adequate thresholds and select measures that presented the best results. A. Method Evaluation The evaluation of topic ranking is quite challenging, as the interpretation of the results is generally subjective. However, in an attempt to show that the ranked topics are indeed those that users would prefer, a method for ranking popular news topics must be established. To retrieve the most popular topics, Google s news aggregation service [48] was utilized. For each day from November 1, 2013 to February 28, 2014, the top 10 news stories displayed on this site were collected at the end of the day.

11 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 989 TABLE I SOME STATISTICS RELEVANT TO THE TESTING DATASET Next, 20 master s and doctoral students were asked to view the titles of the top 10 news stories from each day and select the ones they considered relevant. Each participant was required to select a minimum of two articles per day and a maximum of all 10. The participants results were then partitioned into 12 date ranges: 1) November 1, 2013 November 10, 2013; 2) November 11, 2013 November 20, 2013; 3) November 21, 2013 November 30, 2013; 4) December 1, 2013 December 10, 2013; 5) December 10, 2013 December 20, 2013; 6) December 20, 2013 December 31, 2013; 7) January 1, 2014 January 10, 2014; 8) January 11, 2014 January 20, 2014; 9) January 21, 2014 January 31, 2014; 10) February 1, 2014 February 10, 2014; 11) February 11, 2014 February 20, 2014; and 12) February 21, 2014 February 28, As explained earlier, the first six data ranges were utilized for the controlled experiments and the last six for the method evaluation. The stories for each range were manually grouped into topics, where each topic was given a score for its corresponding time range. The score for each topic is calculated by multiplying the number of times the participants voted for the stories in the topic by the total number of stories in it. For instance, if in the time range January 1, 2014 January 10, 2014 a topic contained five stories (which of course occurred on five different days within that time range), and the total votes for those stories was 15, then the score of that topic would be 75. The topics were finally ranked in descending order of their score, resulting in a ranked topic list for each time range. We will hereafter refer to this ranked list of topics as voted topics. For each of the aforementioned time ranges we set the start (d 1 ) and end (d 2 ) dates accordingly and executed SociRank. Some statistics relevant to the news topics extracted for the testing dataset (i.e., from January and February 2014) are shown in Table I. The second column in the table indicates the total number of topics extracted per time period. The third, fourth, and fifth columns show the average number of tweets, news stories, and Twitter users associated with each topic, respectively. 1) Topic Selection Evaluation: First, we evaluate the topicselection completeness of SociRank and the media-focus-only approach 3 (hereafter referred to as MF). To represent the MF method, we simply manipulate (19) by setting β and γ to 0 3 The results for UA and UI alone are not shown in our experiments, as these have no topic-specific significance without the MF component. In other words, without the element of MF, any identified topics would only reflect what is popular among users, without these necessarily being related to the media (news). Fig. 6. Percentage of overlap between all voted topics and all topics selected by SociRank and MF. and α to 1. To evaluate completeness, all of the topics (per time range) selected by SociRank and by the MF approach were compared with those in the voted topics. Fig. 6 shows the percentage of topics selected by SociRank and by MF that overlap with the voted topics. It can be seen that SociRank clearly outperforms MF in terms of overlap with the voted topics (i.e., those topics that users selected as the most important). This indicates that SociRank is better at discovering prevalent news topics that users find interesting when compared to a method that only utilizes data from the news media. 2) Topic Ranking Evaluation: Next, we evaluate the ranking of topics using the SociRank and MF ranking formulas, selecting only the top k topics from each approach. Again, we compare the ranked topics for each time range with the voted topics. We evaluated the top 10, 20, 30, and 40 topics for both methods and calculated the percentage of topics that overlapped with the top 10, 20, 30, and 40 voted topics, respectively. For instance, if we are comparing the top 10 topics using SociRank with the voted topics, and five topics overlap, then the overlap percentage would be 5/10. Fig. 7 shows the percentage of overlap (for each of the time ranges) between the top 10 voted topics and the top 10 topics using the SociRank and MF approaches. In the figure, the green line represents the MF overlap percentage plus 1 standard deviation. If the SociRank bar surpasses this line, it indicates that there is a large enough difference between the two overlap percentages for the improvement to be considered significant. However, as can be seen in the figure, this does not occur there is no significant difference between the two methods in the top 10 ranked list. In the top 20, 30, and 40 ranked lists, however, the results favored SociRank, as can be seen in Figs. 8 10, respectively. In each of these lists, the overlap percentage of SociRank significantly surpasses that of the MF approach. Summarizing our findings, Fig. 11 represents the average overlap percentages of the top 10, 20, 30, and 40 ranked lists for both methods. The figure clearly illustrates that, with the exception of

12 990 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Fig. 7. Percentage of overlap between top 10 voted topics and top 10 topics selected by SociRank and MF. Fig. 9. Percentage of overlap between top 30 voted topics and top 30 topics selected by SociRank and MF. Fig. 8. Percentage of overlap between top 20 voted topics and top 20 topics selected by SociRank and MF. Fig. 10. Percentage of overlap between top 40 voted topics and top 40 topics selected by SociRank and MF. the top 10 list, SociRank outperformed the MF method by a significant margin. B. Controlled Experiments In this section, we show experiments performed on some of the variables and components used in our method, namely, the co-occurrence measure, IQR coefficient, and node weighting. For the controlled experiments, the control dataset was divided into six partitions, with each partition representing ten days worth of news and tweets. The time periods tested were those corresponding to November and December ) Co-Occurrence Measure: In Section III-B3, it was explained that five similarity measures were tested: 1) Dice; 2) cosine; 3) Jaccard; 4) PMI; and 5) overlay. To determine which similarity measure performs best, all parameters in SociRank were controlled, with the exception of the similarity measure. In order to measure the effectiveness of each similarity measure, the number of correctly identified topics that each measure produced was counted. Fig. 12 shows the number of correctly identified topics for each similarity measure, grouped by time period. As can be seen in the figure, the Dice similarity measure produced the greatest number of correctly identified topics. The closest competitor to Dice was the cosine similarity measure, but even this measure fell noticeably short of producing similar numbers. The other measures performed poorly in our dataset, and in some instances were unable to identify any topics at all. The Dice similarity measure was therefore selected as being the most fit to fulfill our needs. Depending on the dataset used, however, other similarity measures may potentially produce similar or better results than Dice. 2) IQR Coefficient: To estimate the most appropriate value for coefficient c in (9), several values were tested, starting from c = 1, and taking increments of 1. The previously listed

13 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 991 Fig. 11. Average percentage of overlap between top k voted topics and top k topics selected by SociRank and MF. Fig. 13. Evaluation of different values for IQR coefficient c in the outlier detection formula (9). Fig. 12. Evaluation of different co-occurrence similarity measures used as edge weights in graph G (Section III-B3). time periods were utilized, and all other parameters besides c were controlled. For each value of c, the number of correctly identified topics was counted, and the value that yielded the highest number was considered most effective one. Fig. 13 shows the number of correctly identified topics for different values of c. It can be seen that c = 4 yielded the best results. When c was in the range between 1 and 4, the number of correctly identified topics increased with each increment of 1, but setting c = 5 decreased the count significantly. Thus, c = 4 was subsequently selected as the optimum value in our experiments. 3) Node Weighting: The node-weighting formula (13) presented in Section III-D1 takes into account both number of edges connected to a node and their weights. A more naive approach might be to utilize only the number of edges connected to a node as a representation of its weight. We compared both of these approaches to the PageRank weighting algorithm [16] to ascertain which approach provided the best results, as shown in Fig. 14. In the figure, weighting the nodes simply by the number of edges is referred to as edge, and weighting by both number of edges and their weights (13) is referred to as edge-weight. Recalling from Section III-D1, node weighting is used to correctly select related content from Twitter. We therefore use this same measure to evaluate the effectiveness of each nodeweighting approach. The bars in Fig. 14 represent the number of correctly selected tweets by each node-weighting approach, and the line graph represents the precision of each approach. We calculate the precision using the following formula: R + precision = R + + R (20) where R + is the number of correct records retrieved and R is the number of incorrect records retrieved. In our experiments, thirty random topics from the aforementioned time periods were selected (five from each period), and each node-weighting approach was used to retrieve tweets related to those topics. Only 10% of the tweets were used. As can be seen in Fig. 14, the precision with which correct tweets were selected was the same or had very little difference in most cases, regardless of the approach. However, although the precision was somewhat negligible, the same cannot be said about the number of correct tweets selected. Using the edge-weight approach (13) yielded a far greater number of correctly selected tweets than the other two approaches. Selecting a higher number of correctly related tweets at a similar or greater precision than the other two approaches is preferable, so the nodes in the TCs are ultimately weighted using our approach. C. Comparison of Ranked Topic Lists Lastly, we provide an example of a ranked list of topics produced by the MF approach and one produced by SociRank.

14 992 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 Fig. 14. Evaluation of different node-weighting approaches (Section III-D1). TABLE II COMPARISON OF RANKED TOPIC LISTS PRODUCED BY THE MF AND SOCIRANK METHODS Table II shows the top 10 topics obtained by the MF and SociRank approaches for the February 1, 2014 February 10, 2014 time period. In the table, the cells with the new topics are shaded in green and the cells with the topics that were removed from the MF list are shaded in red. As can be seen, using all three factors produced a list with three new topics that did not appear in the MF list. Additionally, many of the topics in the SociRank list either moved up or down in position as compared with the MF list. This example shows that using SociRank produces a very different ranked list of news topics, which may signify that relying only on high-frequency news topics provided by the media does not necessarily give insight into what users are interested on or consider important. Utilizing the user focus and UI factors can also uncover popular topics that were previously not considered so significant by the mass media, such as the three topics shaded in green in Table II. This information can hint media sources into giving continuity or more importance to certain news stories that were perhaps thought of as unpopular. Taking all results into consideration emphasizes the point that MF alone is a substandard estimator of what users find interesting or consider important, and should therefore not be used in this way. SociRank, on the other hand, proves to be more capable of performing this, and so we conclude that the information provided by SociRank can prove vital in commerce-based areas where the interest of users is paramount. V. CONCLUSION In this paper, we proposed an unsupervised method SociRank which identifies news topics prevalent in both social media and the news media, and then ranks them by taking into account their MF, UA, and UI as relevance factors. The temporal prevalence of a particular topic in the news media is considered the MF of a topic, which gives us insight into its mass media popularity. The temporal prevalence of the topic in social media, specifically Twitter, indicates user interest, and is considered its UA. Finally, the interaction between the social media users who mention the topic indicates the strength of the community discussing it, and is considered the UI. To the best of our knowledge, no other work has attempted to employ the use of either the interests of social media users or their social relationships to aid in the ranking of topics. Consolidated, filtered, and ranked news topics from both professional news providers and individuals have several benefits. One of its main uses is increasing the quality and variety

15 DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 993 of news recommender systems, as well as discovering hidden, popular topics. Our system can aid news providers by providing feedback of topics that have been discontinued by the mass media, but are still being discussed by the general population. SociRank can also be extended and adapted to other topics besides news, such as science, technology, sports, and other trends. We have performed extensive experiments to test the performance of SociRank, including controlled experiments for its different components. SociRank has been compared to mediafocus-only ranking by utilizing results obtained from a manual voting method as the ground truth. In the voting method, 20 individuals were asked to rank topics from specified time periods based on their perceived importance. The evaluation provides evidence that our method is capable of effectively selecting prevalent news topics and ranking them based on the three previously mentioned measures of importance. Our results present a clear distinction between ranking topics by MF only and ranking them by including UA and UI. This distinction provides a basis for the importance of this paper, and clearly demonstrates the shortcomings of relying solely on the mass media for topic ranking. As future work, we intend to perform experiments and expand SociRank on different areas and datasets. Furthermore, we plan to include other forms of UA, such as search engine click-through rates, which can also be integrated into our method to provide even more insight into the true interest of users. Additional experiments will also be performed in different stages of the methodology. For example, a fuzzy clustering approach could be employed in order to obtain overlapping TCs (Section III-C). Lastly, we intend to develop a personalized version of SociRank, where topics are presented differently to each individual user. REFERENCES [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , Jan [2] T. Hofmann, Probabilistic latent semantic analysis, in Proc. 15th Conf. Uncertainty Artif. Intell., 1999, pp [3] T. Hofmann, Probabilistic latent semantic indexing, in Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, Berkeley, CA, USA, 1999, pp [4] C. Wartena and R. Brussee, Topic detection by clustering keywords, in Proc. 19th Int. Workshop Database Expert Syst. Appl. (DEXA), Turin, Italy, 2008, pp [5] F. Archetti, P. Campanelli, E. Fersini, and E. Messina, A hierarchical document clustering environment based on the induced bisecting k-means, in Proc. 7th Int. Conf. Flexible Query Answering Syst., Milan, Italy, 2006, pp [Online]. Available: [6] C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, [7] M. Cataldi, L. Di Caro, and C. Schifanella, Emerging topic detection on Twitter based on temporal and social terms evaluation, in Proc. 10th Int. Workshop Multimedia Data Min. (MDMKDD), Washington, DC, USA, 2010, Art. no. 4. [Online]. Available: [8] W. X. Zhao et al., Comparing Twitter and traditional media using topic models, in Advances in Information Retrieval. Heidelberg, Germany: Springer Berlin Heidelberg, 2011, pp [9] Q. Diao, J. Jiang, F. Zhu, and E.-P. Lim, Finding bursty topics from microblogs, in Proc. 50th Annu. Meeting Assoc. Comput. Linguist. Long Papers, vol , pp [10] H. Yin, B. Cui, H. Lu, Y. Huang, and J. Yao, A unified model for stable and temporal topic detection from social media data, in Proc. IEEE 29th Int. Conf. Data Eng. (ICDE), Brisbane, QLD, Australia, 2013, pp [11] C. Wang, M. Zhang, L. Ru, and S. Ma, Automatic online news topic ranking using media focus and user attention based on aging theory, in Proc. 17th Conf. Inf. Knowl. Manag., Napa County, CA, USA, 2008, pp [12] C. C. Chen, Y.-T. Chen, Y. Sun, and M. C. Chen, Life cycle modeling of news events using aging theory, in Machine Learning: ECML Heidelberg, Germany: Springer Berlin Heidelberg, 2003, pp [13] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling, TwitterStand: News in tweets, in Proc. 17th ACM SIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., Seattle, WA, USA, 2009, pp [14] O. Phelan, K. McCarthy, and B. Smyth, Using Twitter to recommend real-time topical news, in Proc. 3rd Conf. Recommender Syst., New York, NY, USA, 2009, pp [15] K. Shubhankar, A. P. Singh, and V. Pudi, An efficient algorithm for topic ranking and modeling topic evolution, in Database Expert Syst. Appl., Toulouse, France, 2011, pp [16] S. Brin and L. Page, Reprint of: The anatomy of a large-scale hypertextual web search engine, Comput. Netw., vol. 56, no. 18, pp , [17] E. Kwan, P.-L. Hsu, J.-H. Liang, and Y.-S. Chen, Event identification for social streams using keyword-based evolving graph sequences, in Proc. IEEE/ACM Int. Conf. Adv. Soc. Netw. Anal. Min., Niagara Falls, ON, Canada, 2013, pp [18] K. Kireyev, Semantic-based estimation of term informativeness, in Proc. Human Language Technol. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguist., 2009, pp [19] G. Salton, C.-S. Yang, and C. T. Yu, A theory of term importance in automatic text analysis, J. Amer. Soc. Inf. Sci., vol. 26, no. 1, pp , [20] H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. Develop., vol. 1, no. 4, pp , [21] J. D. Cohen, Highlights: Language- and domain-independent automatic indexing terms for abstracting, J. Amer. Soc. Inf. Sci., vol. 46, no. 3, pp , [22] Y. Matsuo and M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, Int. J. Artif. Intell. Tools, vol. 13, no. 1, pp , [23] R. Mihalcea and P. Tarau, TextRank: Bringing order into texts, in Proc. EMNLP, vol. 4. Barcelona, Spain, [24] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning, KEA: Practical automatic keyphrase extraction, in Proc. 4th ACM Conf. Digit. Libr., Berkeley, CA, USA, 1999, pp [25] P. D. Turney, Learning algorithms for keyphrase extraction, Inf. Retrievel, vol. 2, no. 4, pp , [26] J. Wang, H. Peng, and J.-S. Hu, Automatic keyphrases extraction from document using neural network, in Advances in Machine Learning and Cybernetics. Heidelberg, Germany: Springer, 2006, pp [27] T. Jo, M. Lee, and T. M. Gatton, Keyword extraction from documents using a neural network model, in Proc. Int. Conf. Hybrid Inf. Technol. (ICHIT), vol , pp [28] K. Sarkar, M. Nasipuri, and S. Ghose, A new approach to keyphrase extraction using neural networks, Int. J. Comput. Sci. Issues, vol. 7, no. 3, pp , Mar [29] C. Zhang, Automatic keyword extraction from documents using conditional random fields, J. Comput. Inf. Syst., vol. 4, no. 3, pp , [30] G. Figueroa and Y.-S. Chen, Collaborative ranking between supervised and unsupervised approaches for keyphrase extraction, in Proc. Conf. Comput. Linguist. Speech Process. (ROCLING), 2014, pp [31] H.-H. Chen, M.-S. Lin, and Y.-C. Wei, Novel association measures using Web search with double checking, in Proc. 21st Int. Conf. Comput. Linguist. 44th Annu. Meeting Assoc. Comput. Linguist., 2006, pp [32] D. Bollegala, Y. Matsuo, and M. Ishizuka, Measuring semantic similarity between words using Web search engines, in Proc. WWW, Banff, AB, Canada, 2007, pp [33] D. Szymkiewicz, Etude comparative de la distribution florale, Rev. Forest, vol. 1, 1926.

16 994 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 6, JUNE 2017 [34] L. R. Dice, Measures of the amount of ecologic association between species, Ecology, vol. 26, no. 3, pp , [35] K. W. Church and P. Hanks, Word association norms, mutual information, and lexicography, Comput. Linguist., vol. 16, no. 1, pp , Mar [Online]. Available: citation.cfm?id= [36] P. Jaccard, Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Lausanne, Switzerland: Impr. Corbaz, [37] H. Iwasaka and K. Tanaka-Ishii, Clustering co-occurrence graph based on transitivity, presented at the 5th Workshop Very Large Corpora, 1997, pp [38] Y. Matsuo, T. Sakaki, K. Uchiyama, and M. Ishizuka, Graph-based word clustering using a Web search engine, in Proc. Conf. Empir. Methods Nat. Lang. Process., 2006, pp [39] M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proc. Nat. Acad. Sci., vol. 99, no. 12, pp , [40] M. E. J. Newman, Fast algorithm for detecting community structure in networks, Phys.Rev.E, vol. 69, no. 6, 2004, Art. no [41] Twitter. [Online]. Available: accessed Feb [42] K. Gimpel et al., Part-of-speech tagging for Twitter: Annotation, features, and experiments, in Proc. 49th Annu. Meeting Assoc. Comput. Linguist. Human Lang. Technol. Short Papers, vol. 2. Portland, OR, USA, 2011, pp [43] A. J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans. Inf. Theory, vol. IT-13, no. 2, pp , Apr [44] M. Hubert and S. Van der Veeken, Outlier detection for skewed data, J. Chemometr., vol. 22, nos. 3 4, pp , [45] U. Brandes, On variants of shortest-path betweenness centrality and their generic computation, Soc. Netw., vol. 30, no. 2, pp , [46] D. B. Johnson, Efficient algorithms for shortest paths in sparse networks, J. ACM, vol. 24, no. 1, pp. 1 13, Jan [Online]. Available: [47] R. W. Floyd, Algorithm 97: Shortest path, Commun. ACM, vol. 5, no. 6, p. 345, Jun [Online]. Available: [48] Google News. [Online]. Available: accessed Feb Derek Davis was born in Belize City, Belize, in He received the B.Sc. degree in information technologies from the University of Belize, Belmopan, Belize, and the M.Sc. degree in information systems and applications from National Tsing Hua University, Hsinchu, Taiwan. He was a Systems Analyst/Database Administrator for a Belizean crude oil drilling company. He is currently a Systems Analyst in the banking and finance industry. His research interests include keyword extraction, summarization, and natural language processing. Gerardo Figueroa was born in Tegucigalpa, Honduras, in He received the B.Sc. degree in computational systems engineering from Universidad Tecnologica Centroamericana, Tegucigalpa, and the M.Sc. degree in information systems and applications from National Tsing Hua University, Hsinchu, Taiwan, where he is currently pursuing the Ph.D. degree in information systems and applications. From 2011 to 2015, he was a Teacher Assistant with the Institute of Information Systems and Applications and the Department of Computer Science, National Tsing Hua University. His current research interests include keyword extraction and summarization, national language processing, and machine learning. Yi-Shin Chen received the B.A and M.B.A. degrees in information management from National Central University, Taoyuan, Taiwan, in 1996 and 1997, respectively, and the Ph.D. degree in computer science from the University of Southern California, Los Angeles, CA, USA, in She joined the Department of Computer Science, National Tsing Hua University, as an Assistant Professor in She is passionate about increasing the benefits of society through her research efforts. Her goal is to create a social media interface that explores and visualizes Web data easily. Her current research interests include Web intelligence and integration.

Tag cloud generation for results of multiple keywords queries

Tag cloud generation for results of multiple keywords queries Tag cloud generation for results of multiple keywords queries Martin Leginus, Peter Dolog and Ricardo Gomes Lage IWIS, Department of Computer Science, Aalborg University What tag clouds are? Tag cloud

More information

Lioma, Christina Amalia: Part of Speech n-grams for Information Retrieval

Lioma, Christina Amalia: Part of Speech n-grams for Information Retrieval Lioma, Christina Amalia: Part of Speech n-grams for Information Retrieval David Nemeskey Data Mining and Search Research Group MTA SZTAKI Data Mining Seminar 2011.05.12. Information Retrieval Goal: return

More information

PRODUCT REVIEW SENTIMENT ANALYSIS WITH TRENDY SYNTHETIC DICTIONARY AND AMBIGUITY MITIGATION

PRODUCT REVIEW SENTIMENT ANALYSIS WITH TRENDY SYNTHETIC DICTIONARY AND AMBIGUITY MITIGATION PRODUCT REVIEW SENTIMENT ANALYSIS WITH TRENDY SYNTHETIC DICTIONARY AND AMBIGUITY MITIGATION Prof. A. N. Nawathe anunawathe@gmail.com Godge Isha Sudhir ishagodge37@gmail.com Dang Poornima Mahesh dang.poornima@gmail.com

More information

SOCIAL MEDIA MINING. Behavior Analytics

SOCIAL MEDIA MINING. Behavior Analytics SOCIAL MEDIA MINING Behavior Analytics Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

HybridRank: Ranking in the Twitter Hybrid Networks

HybridRank: Ranking in the Twitter Hybrid Networks HybridRank: Ranking in the Twitter Hybrid Networks Jianyu Li Department of Computer Science University of Maryland, College Park jli@cs.umd.edu ABSTRACT User influence in social media may depend on multiple

More information

Predicting the Odds of Getting Retweeted

Predicting the Odds of Getting Retweeted Predicting the Odds of Getting Retweeted Arun Mahendra Stanford University arunmahe@stanford.edu 1. Introduction Millions of people tweet every day about almost any topic imaginable, but only a small percent

More information

Automatic Ranking System of University based on Technology Readiness Level Using LDA-Adaboost.MH

Automatic Ranking System of University based on Technology Readiness Level Using LDA-Adaboost.MH Automatic Ranking System of University based on Technology Readiness Level Using LDA-Adaboost.MH Bagus Setya Rintyarna 1 Department of Electrical Engineering Universitas Muhammadiyah Jember Jember, Indonesia

More information

Final Report Evaluating Social Networks as a Medium of Propagation for Real-Time/Location-Based News

Final Report Evaluating Social Networks as a Medium of Propagation for Real-Time/Location-Based News Final Report Evaluating Social Networks as a Medium of Propagation for Real-Time/Location-Based News Mehmet Ozan Kabak, Group Number: 45 December 10, 2012 Abstract This work is concerned with modeling

More information

UNSUPERVISED KEYWORD EXTRACTION FROM MICROBLOG POSTS VIA HASHTAGS a

UNSUPERVISED KEYWORD EXTRACTION FROM MICROBLOG POSTS VIA HASHTAGS a Journal of Web Engineering, Vol. 17, No. 1&2 (2018) 093 120 c River Publishers UNSUPERVISED KEYWORD EXTRACTION FROM MICROBLOG POSTS VIA HASHTAGS a LIN LI 1 School of Computer Science & Technology, Wuhan

More information

Diversifying Web Service Recommendation Results via Exploring Service Usage History

Diversifying Web Service Recommendation Results via Exploring Service Usage History \ IEEE TRANSACTIONS ON SERVICES COMPUTING on Volume: PP; Year 2015 Diversifying Web Service Recommendation Results via Exploring Service Usage History Guosheng Kang, Student Member, IEEE, Mingdong Tang,

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Data-driven modelling of police route choice

Data-driven modelling of police route choice Data-driven modelling of police route choice Kira Kowalska *1, John Shawe-Taylor 2 and Paul Longley 3 1 Department of Security and Crime Science, University College London 2 Department of Computer Science,

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Automatically Mining Relations Between Heterogeneous Text Sources. CSE 291 Project by Siddharth Dinesh 13 March, 2018

Automatically Mining Relations Between Heterogeneous Text Sources. CSE 291 Project by Siddharth Dinesh 13 March, 2018 Automatically Mining Relations Between Heterogeneous Text Sources CSE 291 Project by Siddharth Dinesh 13 March, 2018 What is the need? Studying reactions to News events on Twitter Identifying first person

More information

Generative Models for Networks and Applications to E-Commerce

Generative Models for Networks and Applications to E-Commerce Generative Models for Networks and Applications to E-Commerce Patrick J. Wolfe (with David C. Parkes and R. Kang-Xing Jin) Division of Engineering and Applied Sciences Department of Statistics Harvard

More information

UNED: Evaluating Text Similarity Measures without Human Assessments

UNED: Evaluating Text Similarity Measures without Human Assessments UNED: Evaluating Text Similarity Measures without Human Assessments Enrique Amigó Julio Gonzalo Jesús Giménez Felisa Verdejo UNED, Madrid {enrique,julio,felisa}@lsi.uned.es Google, Dublin jesgim@gmail.com

More information

Finding Compensatory Pathways in Yeast Genome

Finding Compensatory Pathways in Yeast Genome Finding Compensatory Pathways in Yeast Genome Olga Ohrimenko Abstract Pathways of genes found in protein interaction networks are used to establish a functional linkage between genes. A challenging problem

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Inferring Social Ties across Heterogeneous Networks

Inferring Social Ties across Heterogeneous Networks Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN Introduction Social Ties Information carrying connections between people It can be: Strong, weak or

More information

Can Cascades be Predicted?

Can Cascades be Predicted? Can Cascades be Predicted? Rediet Abebe and Thibaut Horel September 22, 2014 1 Introduction In this presentation, we discuss the paper Can Cascades be Predicted? by Cheng, Adamic, Dow, Kleinberg, and Leskovec,

More information

Worker Skill Estimation from Crowdsourced Mutual Assessments

Worker Skill Estimation from Crowdsourced Mutual Assessments Worker Skill Estimation from Crowdsourced Mutual Assessments Shuwei Qiang The George Washington University Amrinder Arora BizMerlin Current approaches for estimating skill levels of workforce either do

More information

Data Mining in Social Network. Presenter: Keren Ye

Data Mining in Social Network. Presenter: Keren Ye Data Mining in Social Network Presenter: Keren Ye References Kwak, Haewoon, et al. "What is Twitter, a social network or a news media?." Proceedings of the 19th international conference on World wide web.

More information

Large Scale High-Precision Topic Modeling on Twitter. Shuang Yang, Alek Kolcz Andy Schlaikjer, Pankaj Gupta

Large Scale High-Precision Topic Modeling on Twitter. Shuang Yang, Alek Kolcz Andy Schlaikjer, Pankaj Gupta Large Scale High-Precision Topic Modeling on Twitter Shuang Yang, Alek Kolcz Andy Schlaikjer, Pankaj Gupta Topic modeling of Tweets Many Use Cases Business intelligence & analytics sports technology government

More information

Unsupervised ranking of clients: machinelearning approach to define a good customer

Unsupervised ranking of clients: machinelearning approach to define a good customer Unsupervised ranking of clients: machinelearning approach to define a good customer Uladzimir Parkhimenka, Mikhail Tatur, Olga Khandogina Abstract Ranking of clients is a natural problem for every business.

More information

Rounding a method for estimating a number by increasing or retaining a specific place value digit according to specific rules and changing all

Rounding a method for estimating a number by increasing or retaining a specific place value digit according to specific rules and changing all Unit 1 This unit bundles student expectations that address whole number estimation and computational fluency and proficiency. According to the Texas Education Agency, mathematical process standards including

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

A SURVEY ON PRODUCT REVIEW SENTIMENT ANALYSIS

A SURVEY ON PRODUCT REVIEW SENTIMENT ANALYSIS A SURVEY ON PRODUCT REVIEW SENTIMENT ANALYSIS Godge Isha Sudhir ishagodge37@gmail.com Arvikar Shruti Sanjay shrutiarvikar89@gmail.com Dang Poornima Mahesh dang.poornima@gmail.com Maske Rushikesh Kantrao

More information

Tassc:Estimator technical briefing

Tassc:Estimator technical briefing Tassc:Estimator technical briefing Gillian Adens Tassc Limited www.tassc-solutions.com First Published: November 2002 Last Updated: April 2004 Tassc:Estimator arrives ready loaded with metric data to assist

More information

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean STAT/MATH 3379 Statistical Methods in Practice Dr. Ananda Manage Associate Professor of Statistics Department of Mathematics & Statistics SHSU 1 Chapter3 Averages and Variation Copyright Cengage Learning.

More information

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS DETECTING COMMUNITIES BY SENTIMENT ANALYSIS OF CONTROVERSIAL TOPICS SBP-BRiMS 2016 Kangwon Seo 1, Rong Pan 1, & Aleksey Panasyuk 2 1 Arizona State University 2 Air Force Research Lab July 1, 2016 OUTLINE

More information

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test CHAPTER 8 T Tests A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test 8.1. One-Sample T Test The One-Sample T Test procedure: Tests

More information

IBM Research Report. Toward Finding Valuable Topics

IBM Research Report. Toward Finding Valuable Topics RC24975 (W4-42) April 3, 2 Computer Science IBM Research Report Toward Finding Valuable Topics Zhen Wen, Ching-Yung Lin IBM Research Division Thomas J. Watson Research Center P.O. Box 74 Yorktown Heights,

More information

Glossary Adjacency matrix Adjective Orientation Similarity Aspect coverage Bipartite networks CAO Collaborative filtering Complete graph

Glossary Adjacency matrix Adjective Orientation Similarity Aspect coverage Bipartite networks CAO Collaborative filtering Complete graph Glossary Adjacency matrix The adjacency matrix is a matrix whose rows and columns represent the graph vertices. A matrix entry at position (i, j) contains a 1 or a 0 value according to whether an edge

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

An Analysis Framework for Content-based Job Recommendation. Author(s) Guo, Xingsheng; Jerbi, Houssem; O'Mahony, Michael P.

An Analysis Framework for Content-based Job Recommendation. Author(s) Guo, Xingsheng; Jerbi, Houssem; O'Mahony, Michael P. Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title An Analysis Framework for Content-based Job

More information

Group #2 Project Final Report: Information Flows on Twitter

Group #2 Project Final Report: Information Flows on Twitter Group #2 Project Final Report: Information Flows on Twitter Huang-Wei Chang, Te-Yuan Huang 1 Introduction Twitter has become a very popular microblog website and had attracted millions of users up to 2009.

More information

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Mahsa Naseri and Simone A. Ludwig Abstract In service-oriented environments, services with different functionalities are combined

More information

THE internet has changed different aspects of our lives, job

THE internet has changed different aspects of our lives, job 1 Recommender Systems for IT Recruitment João Almeida and Luís Custódio Abstract Recruitment processes have increasingly become dependent on the internet. Companies post job opportunities on their websites

More information

A stochastic approach to multi channel attribution in online advertising

A stochastic approach to multi channel attribution in online advertising A stochastic approach to multi channel attribution in online advertising Authors Venkatesh Sreedharan (venkateshw@freshworks.com Swaminathan Padmanabhan (swami@freshworks.com Machine Learning @Freshworks

More information

Leveraging the Social Breadcrumbs

Leveraging the Social Breadcrumbs Leveraging the Social Breadcrumbs 2 Social Network Service Important part of Web 2.0 People share a lot of data through those sites They are of different kind of media Uploaded to be seen by other people

More information

Seminar Data & Web Mining. Christian Groß Timo Philipp

Seminar Data & Web Mining. Christian Groß Timo Philipp Seminar Data & Web Mining Christian Groß Timo Philipp Agenda Application types Introduction to Applications Evaluation 19.07.2007 Applications 2 Overview Crime enforcement Crime Link Explorer Money laundering

More information

Verint Engagement Management Solution Brief. Overview of the Applications and Benefits of

Verint Engagement Management Solution Brief. Overview of the Applications and Benefits of Verint Engagement Management Solution Brief Overview of the Applications and Benefits of Verint Engagement Management November 2015 Table of Contents Introduction... 2 Verint Engagement Management Advantages...

More information

Resolution of Chemical Disease Relations with Diverse Features and Rules

Resolution of Chemical Disease Relations with Diverse Features and Rules Resolution of Chemical Disease Relations with Diverse Features and Rules Dingcheng Li*, Naveed Afzal*, Majid Rastegar Mojarad, Ravikumar Komandur Elayavilli, Sijia Liu, Yanshan Wang, Feichen Shen, Hongfang

More information

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration KnowledgeSTUDIO Advanced Modeling for Better Decisions Companies that compete with analytics are looking for advanced analytical technologies that accelerate decision making and identify opportunities

More information

Towards Effective and Efficient Behavior-based Trust Models. Klemens Böhm Universität Karlsruhe (TH)

Towards Effective and Efficient Behavior-based Trust Models. Klemens Böhm Universität Karlsruhe (TH) Towards Effective and Efficient Behavior-based Trust Models Universität Karlsruhe (TH) Motivation: Grid Computing in Particle Physics Physicists have designed and implemented services specific to particle

More information

Using Twitter as a source of information for stock market prediction

Using Twitter as a source of information for stock market prediction Using Twitter as a source of information for stock market prediction Argimiro Arratia argimiro@lsi.upc.edu LSI, Univ. Politécnica de Catalunya, Barcelona (Joint work with Marta Arias and Ramón Xuriguera)

More information

Social Recommendation: A Review

Social Recommendation: A Review Noname manuscript No. (will be inserted by the editor) Social Recommendation: A Review Jiliang Tang Xia Hu Huan Liu Received: date / Accepted: date Abstract Recommender systems play an important role in

More information

Algebra II Common Core

Algebra II Common Core Core Algebra II Common Core Algebra II introduces students to advanced functions, with a focus on developing a strong conceptual grasp of the expressions that define them. Students learn through discovery

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

A Destination Prediction Method Based on Behavioral Pattern Analysis of Nonperiodic Position Logs

A Destination Prediction Method Based on Behavioral Pattern Analysis of Nonperiodic Position Logs A Destination Prediction Method Based on Behavioral Pattern Analysis of Nonperiodic Position Logs Fumitaka Nakahara * and Takahiro Murakami ** * Service Platforms Research Laboratories, NEC Corporation

More information

Online Student Guide Types of Control Charts

Online Student Guide Types of Control Charts Online Student Guide Types of Control Charts OpusWorks 2016, All Rights Reserved 1 Table of Contents LEARNING OBJECTIVES... 4 INTRODUCTION... 4 DETECTION VS. PREVENTION... 5 CONTROL CHART UTILIZATION...

More information

Limits of Software Reuse

Limits of Software Reuse Technical Note Issued: 07/2006 Limits of Software Reuse L. Holenderski Philips Research Eindhoven c Koninklijke Philips Electronics N.V. 2006 Authors address: L. Holenderski WDC3-044; leszek.holenderski@philips.com

More information

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records Cold-start Solution to Location-based Entity Shop Recommender Systems Using Online Sales Records Yichen Yao 1, Zhongjie Li 2 1 Department of Engineering Mechanics, Tsinghua University, Beijing, China yaoyichen@aliyun.com

More information

How to Create a Dataset from Social Media: Theory and Demonstration

How to Create a Dataset from Social Media: Theory and Demonstration How to Create a Dataset from Social Media: Theory and Demonstration Richard N. Landers Old Dominion University @rnlanders rnlanders@odu.edu CARMA October 2017 Agenda/Learning Objectives 1. Foundational

More information

Determining the Factors that Drive Twitter Engagement-Rates

Determining the Factors that Drive Twitter Engagement-Rates Archives of Business Research Vol.5, No.2 Publication Date: February. 25, 2017 DOI: 10.14738/abr.52.2700 Semiz, G. & Berger, P. D. (2017). Determining the Factors that Drive Twitter Engagement-Rates. Archives

More information

AI that creates professional opportunities at scale

AI that creates professional opportunities at scale AI that creates professional opportunities at scale Deepak Agarwal AI in practice, AAAI, San Francisco, USA Feb 6 th, 2017 We are seeking new professional opportunities all the time To Advance our Careers

More information

Examining and Modeling Customer Service Centers with Impatient Customers

Examining and Modeling Customer Service Centers with Impatient Customers Examining and Modeling Customer Service Centers with Impatient Customers Jonathan Lee A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF APPLIED SCIENCE DEPARTMENT

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

Influence of Time on User Profiling and Recommending Researchers in Social Media. Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp

Influence of Time on User Profiling and Recommending Researchers in Social Media. Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp Influence of Time on User Profiling and Recommending Researchers in Social Media Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp 1 Motivation Construct user s professional profiles from social media

More information

Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge

Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2007 Prentice Hall Chapter Objectives To explain how knowledge is discovered

More information

Finding Your Food Soulmate

Finding Your Food Soulmate CS 224W Information Networks Angela Sy, Lawrence Liu, Blanca Villanueva Dec. 9, 2014 Project Final Report Finding Your Food Soulmate 1. Abstract Individuals in social networks are often unaware of people

More information

Predicting Popularity of Messages in Twitter using a Feature-weighted Model

Predicting Popularity of Messages in Twitter using a Feature-weighted Model International Journal of Advanced Intelligence Volume 0, Number 0, pp.xxx-yyy, November, 20XX. c AIA International Advanced Information Institute Predicting Popularity of Messages in Twitter using a Feature-weighted

More information

Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation

Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation Superimposing Activity-Travel Sequence Conditions on GPS Data Imputation Tao Feng 1, Harry J.P. Timmermans 2 1 Urban Planning Group, Department of the Built Environment, Eindhoven University of Technology,

More information

A Case Study on Mining Social Media Data

A Case Study on Mining Social Media Data Chan, H. K. and Lacka, E. and Yee, R. W. Y. and Lim, M. K. (2014) A case study on mining social media data. In: IEEM 2014, 2014-12-09-2014-12-12, Sunway Resort Hotel & Spa., http://dx.doi.org/10.1109/ieem.2014.7058707

More information

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING Abbas Heiat, College of Business, Montana State University, Billings, MT 59102, aheiat@msubillings.edu ABSTRACT The purpose of this study is to investigate

More information

ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring

ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring 1 Kavyashree M Bandekar, 2 Maddala Tejasree, 3 Misba Sultana S N, 4 Nayana G K, 5 Harshavardhana Doddamani 1, 2, 3, 4 Engineering

More information

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University Feature Level Text Categorization For Opinion Mining Gandhi Vaibhav C. Computer Engineering Parul Institute Of Engineering & Technology Gujarat Technological University Assistant Professor Neha Pandya

More information

BPIC 13: Mining an Incident Management Process

BPIC 13: Mining an Incident Management Process BPIC 13: Mining an Incident Management Process Emmy Dudok, Peter van den Brand Perceptive Software, 22701 West 68th Terrace, Shawnee, KS 66226, United States {Emmy.Dudok, Peter.Brand}@perceptivesoftware.com

More information

Creating a Transit Supply Index. Andrew Keller Regional Transportation Authority and University of Illinois at Chicago

Creating a Transit Supply Index. Andrew Keller Regional Transportation Authority and University of Illinois at Chicago Creating a Transit Supply Index Andrew Keller Regional Transportation Authority and University of Illinois at Chicago Presented at Transport Chicago Conference June 1, 2012 Introduction This master's project

More information

Quality Assessment Method for Software Development Process Document based on Software Document Characteristics Metric

Quality Assessment Method for Software Development Process Document based on Software Document Characteristics Metric Quality Assessment Method for Software Development Process Document based on Software Document Characteristics Metric Patra Thitisathienkul, Nakornthip Prompoon Department of Computer Engineering Chulalongkorn

More information

A Summer Internship Project Report. Sparse Aspect Rating Model for Sentiment Summarization

A Summer Internship Project Report. Sparse Aspect Rating Model for Sentiment Summarization A Summer Internship Project Report On Sparse Aspect Rating Model for Sentiment Summarization Carried out at the Institute for Development and Research in Banking Technology, Hyderabad Established by Reserve

More information

A Systematic Approach to Performance Evaluation

A Systematic Approach to Performance Evaluation A Systematic Approach to Performance evaluation is the process of determining how well an existing or future computer system meets a set of alternative performance objectives. Arbitrarily selecting performance

More information

Data Mining in CRM THE CRM STRATEGY

Data Mining in CRM THE CRM STRATEGY CHAPTER ONE Data Mining in CRM THE CRM STRATEGY Customers are the most important asset of an organization. There cannot be any business prospects without satisfied customers who remain loyal and develop

More information

Predicting ratings of peer-generated content with personalized metrics

Predicting ratings of peer-generated content with personalized metrics Predicting ratings of peer-generated content with personalized metrics Project report Tyler Casey tyler.casey09@gmail.com Marius Lazer mlazer@stanford.edu [Group #40] Ashish Mathew amathew9@stanford.edu

More information

INTERACTIVE ANNEALING FOR UNDERSTANDING INVISIBLE DARK EVENTS

INTERACTIVE ANNEALING FOR UNDERSTANDING INVISIBLE DARK EVENTS INTERACTIVE ANNEALING FOR UNDERSTANDING INVISIBLE DARK EVENTS Yoshiharu Maeno and Yukio Ohsawa Graduate School of Systems Management, University of Tsukuba School of Engineering, University of Tokyo maeno.yoshiharu@nifty.com

More information

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement Atef Chaudhury University of Waterloo Waterloo, ON N2L 3G1, Canada a29chaud@uwaterloo.ca Myunghwan Kim LinkedIn

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

Combinational Collaborative Filtering: An Approach For Personalised, Contextually Relevant Product Recommendation Baskets

Combinational Collaborative Filtering: An Approach For Personalised, Contextually Relevant Product Recommendation Baskets Combinational Collaborative Filtering: An Approach For Personalised, Contextually Relevant Product Recommendation Baskets Research Project - Jai Chopra (338852) Dr Wei Wang (Supervisor) Dr Yifang Sun (Assessor)

More information

The Heterogeneity Principle in Evaluation Measures for Automatic Summarization

The Heterogeneity Principle in Evaluation Measures for Automatic Summarization The Heterogeneity Principle in Evaluation Measures for Automatic Summarization Enrique Amigó Julio Gonzalo Felisa Verdejo UNED, Madrid {enrique,julio,felisa}@lsi.uned.es Abstract The development of summarization

More information

SAS Visual Analytics: Text Analytics Using Word Clouds

SAS Visual Analytics: Text Analytics Using Word Clouds ABSTRACT Paper 1687-2018 SAS Visual Analytics: Text Analytics Using Word Clouds Jenine Milum, Atlanta, GA, USA There exists a limit to employee skills and capacity to efficiently analyze volumes of textual

More information

The Science of Social Media. Kristina Lerman USC Information Sciences Institute

The Science of Social Media. Kristina Lerman USC Information Sciences Institute The Science of Social Media Kristina Lerman USC Information Sciences Institute ML meetup, July 2011 What is a science? Explain observed phenomena Make verifiable predictions Help engineer systems with

More information

Introduction to Software Metrics

Introduction to Software Metrics Introduction to Software Metrics Outline Today we begin looking at measurement of software quality using software metrics We ll look at: What are software quality metrics? Some basic measurement theory

More information

WE consider the general ranking problem, where a computer

WE consider the general ranking problem, where a computer 5140 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 11, NOVEMBER 2008 Statistical Analysis of Bayes Optimal Subset Ranking David Cossock and Tong Zhang Abstract The ranking problem has become increasingly

More information

User Profiling in an Ego Network: Co-profiling Attributes and Relationships

User Profiling in an Ego Network: Co-profiling Attributes and Relationships User Profiling in an Ego Network: Co-profiling Attributes and Relationships Rui Li, Chi Wang, Kevin Chen-Chuan Chang, {ruili1, chiwang1, kcchang}@illinois.edu Department of Computer Science, University

More information

Product Aspect Ranking Based on the Customers Reviews.

Product Aspect Ranking Based on the Customers Reviews. Product Aspect Ranking Based on the Customers Reviews. Kadam Snehal, Dhomse Kalyani, Bhumkar Smita, Janrao Swati, Ambhore Shakuntala BE Scholar, Department of Computer Engineering S.R.E.S. s College of

More information

CHAPTER 4 A FRAMEWORK FOR CUSTOMER LIFETIME VALUE USING DATA MINING TECHNIQUES

CHAPTER 4 A FRAMEWORK FOR CUSTOMER LIFETIME VALUE USING DATA MINING TECHNIQUES 49 CHAPTER 4 A FRAMEWORK FOR CUSTOMER LIFETIME VALUE USING DATA MINING TECHNIQUES 4.1 INTRODUCTION Different groups of customers prefer some special products. Customers type recognition is one of the main

More information

A Survey on Recommendation Techniques in E-Commerce

A Survey on Recommendation Techniques in E-Commerce A Survey on Recommendation Techniques in E-Commerce Namitha Ann Regi Post-Graduate Student Department of Computer Science and Engineering Karunya University, India P. Rebecca Sandra Assistant Professor

More information

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM Abbas Heiat, College of Business, Montana State University-Billings, Billings, MT 59101, 406-657-1627, aheiat@msubillings.edu ABSTRACT CRT and ANN

More information

Estimation of social network user s influence in a given area of expertise

Estimation of social network user s influence in a given area of expertise Journal of Physics: Conference Series PAPER OPEN ACCESS Estimation of social network user s influence in a given area of expertise To cite this article: E E Luneva et al 2017 J. Phys.: Conf. Ser. 803 012089

More information

The Interaction-Interaction Model for Disease Protein Discovery

The Interaction-Interaction Model for Disease Protein Discovery The Interaction-Interaction Model for Disease Protein Discovery Ken Cheng December 10, 2017 Abstract Network medicine, the field of using biological networks to develop insight into disease and medicine,

More information

TwitterRank: Finding Topicsensitive Influential Twitterers

TwitterRank: Finding Topicsensitive Influential Twitterers TwitterRank: Finding Topicsensitive Influential Twitterers Jianshu Weng, Ee-Peng Lim, Jing Jiang Singapore Management University Qi He Pennsylvania State University WSDM 2010 Feb 5, 2010 Outline Introduction

More information

WaterlooClarke: TREC 2015 Total Recall Track

WaterlooClarke: TREC 2015 Total Recall Track WaterlooClarke: TREC 2015 Total Recall Track Haotian Zhang, Wu Lin, Yipeng Wang, Charles L. A. Clarke and Mark D. Smucker Data System Group University of Waterloo TREC, 2015 Haotian Zhang, Wu Lin, Yipeng

More information

Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION Advancing Information Management and Analysis with Entity Resolution Whitepaper February 2016 novetta.com 2016, Novetta ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION Advancing Information

More information

TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011

TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011 TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011 1 Develop Models for Customers Likely to Churn Churn is a term used to indicate a customer leaving the service of one company in

More information

Strength in numbers? Modelling the impact of businesses on each other

Strength in numbers? Modelling the impact of businesses on each other Strength in numbers? Modelling the impact of businesses on each other Amir Abbas Sadeghian amirabs@stanford.edu Hakan Inan inanh@stanford.edu Andres Nötzli noetzli@stanford.edu. INTRODUCTION In many cities,

More information

Speech Analytics Transcription Accuracy

Speech Analytics Transcription Accuracy Speech Analytics Transcription Accuracy Understanding Verint s speech analytics transcription and categorization accuracy Verint.com Twitter.com/verint Facebook.com/verint Blog.verint.com Table of Contents

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

STAT 2300: Unit 1 Learning Objectives Spring 2019

STAT 2300: Unit 1 Learning Objectives Spring 2019 STAT 2300: Unit 1 Learning Objectives Spring 2019 Unit tests are written to evaluate student comprehension, acquisition, and synthesis of these skills. The problems listed as Assigned MyStatLab Problems

More information