Discovery of Trending Topics in Microblog Streams Based on Contextual Search

Size: px
Start display at page:

Download "Discovery of Trending Topics in Microblog Streams Based on Contextual Search"

Transcription

1 Journal of Computational Information Systems 10: 2 (2014) Available at Discovery of Trending Topics in Microblog Streams Based on Contextual Search Wenjun ZHANG, Ning ZHENG, Yizhi REN, Jian XU, Haiping ZHANG, Ming XU College of Computer, Hangzhou Dianzi University, Hangzhou , China Abstract With the rapid growing of microblog, the need to discover the trending topics in microblogs becomes more and more pressing. Although topic detection has long been a hotspot research, it is difficult to discover the trending topics with high contextual meaning because the microblogs are small elements of content, such as short sentences, individual images, or video links. This paper proposes a novel approach to detect trending topics in microblog streams based on contextual search. First, burst keywords from microblog segment will be detected by calculating their frequencies and average growth rates; then related tweets are clustered as a document based on a contextual search method; finally, top-k keywords in the document are regarded as the trending topic based on a TF-IDF method. The experiment results show that the proposed approach can detect trending topics with high contextual meaning and accuracy. Keywords: Microblogs; Trending Topics; Burst Keywords; Contextual Search 1 Introduction Nowadays, microblogging services have gone deep into people s life. It spreads news and opinions about real word events quickly and broadly. Weibo, a form of microblogging service in China, has kept the explosive growth in less than three years. According to CCNIC [1], the scale of users in Weibo has reached 309 millions in 2012, which is still growing at nearly 23.5% yearly. Microblog users now used to share their personal things or opinions about real word news everywhere. When the news appears in real word, more and more people will post tweets on the same topic. The number of related tweets grows rapidly, and there will be a few burst keywords on the same topic. At this time, a trending topic is forming in microblog streams. Trend detection is of great value to news reporters and analysts, as it might relate to a great story [2, 3]. Since a growing number of people are concerned about the issues which people talked about, microbloging services have offered an opportunity to meet the demand of people. However, the number of tweets in microblog is so huge that people can t catch the trending easily. Corresponding author. address: zhangwenjun32@163.com (Wenjun ZHANG) / Copyright 2014 Binary Information Press DOI: /jcis8787 January 15, 2014

2 492 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) In this paper, a new approach is proposed to detect the trending topics with high accuracy. First, it extracts some burst keywords by a new computing method. Then, it uses the burst keywords to search the related tweets, and keeps them in a document for one trending topic. Finally, it calculates the TF-IDF scores of each word in the document and extracts top-k words with high score to present the trending topic. The structure of this paper is organized as follows: Section 2 introduces the related work; Section 3 does some preprocessing work; Section 4 describes the proposed approach in detail. In Section 5, it reports the experimental results, and finally Section 6 concludes the paper. 2 Related Work Trend detection on microblog streams has become a hot research recently. Mathioudakis and Koudas [2] presented a system that performs trend detection over the Twitter stream. While it only gave a brief introduction of realization method, the algorithm and experiment were not presented in detail. Owen and Kevin [4] mined twitter s trending topics to recommend real-time topical news, but they didn t describe the approach of discovering trending topics in microblog streams. Discovering burst features is the first step of trending topics detection. Kleinberg [5] proposed a formal approach to identify the bursts in a document stream based on an infinitestate. And Zhao [6] followed this method for scalable event detection in text streams. However, different from traditional documents, microblog has features of short, nimble and quick. The approach to deal with document streams can t response to microblog streams. Manoj [7] used dynamic graphs to discovery dense clusters in real time, and it can find clusters efficiently in highly dynamic environments. However, it is not context-sensitive, and it is not clear what is important about the trending topic. The work of Nargis [8] is similar to ours. It first identified frequently occurring two-word clusters, and then used these two-word clusters to generate larger word clusters. It can detect contextual topics and is efficient to process high-rate Twitter scale streams. The approach is similar to the Apriori approach [9]. Since the pruning of the approach is rough, its precision declines rapidly with the cluster size increases. There are also many researches over Weibo [10, 11]. Zhao [12] presented a sentiment analysis system for Chinese tweets. It can monitor sentiment in real-time, but not trending topics. Tu [13] detected the hot topic based on incomplete clustering. Although most valueless tweets were filtered by Bayes classifier, it didn t consider the burst of topic in the procedure of clustering. Xu [14] combined TF-IDF method and the increasing rate of each word to detect hot topics, while it was not context-sensitive. The proposed approach uses the burst keywords to gather the related tweets, and extracts some subject keywords as a trending topic. Experiments have shown that the approach can detect more contextual keyword than traditional method and maintain a good accuracy. 3 Preprocessing Microblog data is so large that it needs to divide them into segments [15]. In this work, tweets in an hour are divided as a segment. A time sequence is defined as, t n 1, t n, t n+1,, and then microblog segments sequence can be defined as, D n 1, D n, D n+1,. When a segment

3 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) is formed, the proposed approach will be used to detect the trending topics in this period. It processes the segments one by one as the time passes. In order to avoid the influence from spam messages, the useless microblogs need to filter [16]. According to the analysis and observation, microbloggs which include the following four characteristics will be considered as useless messages. (1) Brackets. The angle bracket is used to highlight a topic. Microblogs containing it are often posted by newspaper or social media. They are often not what people talked about, and most of them are advertise messages. (2) Hashtag. A hashtag is a word or a phrase prefixed with the symbol #. Microblogs including it are usually initiated by microblog platform, and there are many human factors in these messages. (3) Mention. Mention is the preceding a user name to reply another user. Microblogs containing it often appear in the conversational contexts. The content of these messages are inclined to personal things, not trending topic. (4) URL. URL is a sharing link in microblogs. Microblogs containing a link often give an objective introduction to the link. Spam messages in microblogs often contain links and the trending topic which people talked about rarely contains URL. The remaining microblogs will be divided into tokens by word segmentation. Since a noun or verb can almost express the meaning of a tweet, all nouns or verbs will be extracted as keywords. There are some frequent words uncorrelated to the trending topics. For example, good night, sleep, moon and so on. These words are frequent in a short period, and they correspond to the characteristic of trending topic. Here, a method similar to Xu [14] is used to filter out common words as well as the stop words. It counts the word-frequency of seven days and sorts the keywords by frequency. Top ranked keywords without topic information are tagged manually. The proposed approach will remove the tagged words to detect burst keywords in Section Proposed Approach In this section, a new approach which detects trending topics based on contextual search is described. The proposed approach is mainly composed of three steps. First, burst keywords from microblog segment will be extracted. Then related tweets are clustered as a document based on the contextual search method. Finally, top-k subject keywords in the document are used to present trending topic based on the TF-IDF method. 4.1 Extracting burst keywords Overall, the trending topic tweets have two characteristics. One is that the keywords about it grows rapidly in a short period, another is that the frequency of keywords related to the same topic is very high. In order to detect burst keywords in current segment, it is necessary to compare

4 494 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) it to the keywords of past segments. The increasing rate of each word on the time t n is measured as follow: AvgGrowth(w, t n ) = 1 n 1 F (w, t n ) F (w, t i ), (1) n 1 F (w, t i=1 i ) + 1 where F (w, t i ) presents the frequency of word w appeared in the i th segment. The burst score of the word can be computed by the following formula: BurstScore(w, t n ) = AvgGrowth(w, t n ) log(f (w, t n )) (2) If the burst score of a word meets a predefined threshold, the word is kept as a burst keyword. The algorithm extracting burst keywords is showed as Algorithm 1. The purpose of line 2 is to Algorithm 1 Extracting Burst Keywords Input: Microblog stream segment set D Output: Burst keywords set S 1: for all D i D do 2: Preprocess D i with the method in Section 3 3: for all keyword w D i do 4: Compute the frequency of w F (w, t i ) 5: Compute AvgGrowth(w, t i ) with Formula (1) 6: Compute BurstScore(w, t i ) with Formula (2) 7: if BurstScore(w, t i ) threshold then 8: S i S i w 9: end if 10: end for 11: S S S i 12: end for 13: return S preprocess the microblog segment. Line 3-6 is the procedure of computing burst score and line 7-8 is to extract burst keywords. 4.2 Contextual search Although the keywords with high burst score are related to trending topic, the burst keywords are very few if it need to maintain high accuracy. In this section, these few burst keywords are used to search the segment. The search results will form a document in one topic, and more potential keywords from the document will be found by TF-IDF method. The process is described in Algorithm2 and Algorithm 3. Algorithm 2 gives the overall process of finding word clusters. Line 2 is to mark the burst keywords. For a burst keyword which is not visited, it calls the DFSQuery algorithm to search the related tweets (line 4-5). Line 6 writes the related tweets into a document, and line 7 generates a word cluster to describe the trending topic based on TF-IDF method. The details of search algorithm are given in Algorithm 3 and section 4.3 will present the process of TF-IDF computation in this work.

5 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Algorithm 2 Finding Word Clusters Input: Burst keywords set S Output: Word cluster set T 1: for all S i S do 2: Initialize words flag array visited true 3: for all keyword w S i do 4: if visited[w] = true then 5: RS DF SQuery(w) 6: Write RS to doc 7: T d GenerateSubjectKeywords(doc) 8: T T T d 9: end if 10: end for 11: end for 12: return T Algorithm 3 is a depth-first search strategy. It first searches the related tweets in microblog segment (line 2), and then identifies whether there are any other burst keywords in search results. If there are other burst keywords, it will continue to search the results by DFSQuery algorithm (line 4-10). The algorithm will return all the records related to the burst keywords. Algorithm 3 DFSQuery Input: Burst keyword w Output: Search results set RS 1: visited[w] f alse 2: R search from D i with w 3: RS R 4: for all R i R do 5: for all w R i do 6: if visited[w] = true then 7: RS RS DF SQuery(w) 8: end if 9: end for 10: end for 11: return RS 4.3 Generating subject keywords The subject keywords of a trending topic are generated by an improved TF-IDF method. TF-IDF method is used to compute the weight of a word in a document. The TF-IDF score of a word can reflect how important the word is to the document. The short texts in microblogs bring obstacles to trending topic detection. If a tweet is kept as a document in traditional TF-IDF method, the information which keywords conveyed is not clear. In order to solve the problem, an improved TF-IDF method is proposed to generate more subject keywords.

6 496 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Different from tranditional TF-IDF method, the document of the proposed method is formed by the search algorithm, not original document. The content of document is a pieces of tweets related to one topic. The formula is defined as follow: T D(w) = f(w) f sum log 1 + D D(w), (3) where f(w) presents the frequency of word w in document, and f sum is the sum of each word frequency in the document. D is the number of documents in post periods, and D(w) is the number of documents contained word w in post periods. Finally, it ranks the scores of each word and top-k words with high score are extracted to present the trending topic. 5 Experimental Results To demonstrate the effectiveness of the proposed approach, evaluating experiments are developed based on SINA Weibo platform. SINA has published its APIs since 2010 and we can obtain the public tweets by using the interface of GetPublicTimeline. From March 1 st, 2013 to March 7 th, 2013, there are 26,291,715 microblogs crawled in total. After filtering, it remains 12,960,952 microblogs. 3 annotators manually tagged each day s trending topics and produce a standard trending topic list for every day. In this work, F-measure is used to evaluate the performance as defined in formula (6). In formula (4) and formula (5), MF is the set of the manual topic list, and F is the set of trending topics our approach detect. 5.1 Parameter setting P recision = F MF F F MF Recall = MF 2 P recision Recall F measure = P recision + Recall In this section, the best threshold of burst score is selected to extract the burst keywords. In experiment, the threshold was learned by the collected data. The performances are tested with the threshold varying from 0 to 100. As is showed in Fig. 1, the value of F-measure peaked at when the threshold equals to 70. It is showed that the accuracy of burst keywords increased with the threshold growing from the start. Since the threshold enhances, more and more false burst keywords are removed. The increase of F-measure is primary due to the decrease of false burst keywords. When the threshold exceeds 70, some true burst keywords are filtered out and the value of F-measure declined. In light of the trend, the performance will continue to drop since it reached the highest point. So the threshold is set to 70 for further steps. (4) (5) (6) 5.2 Comparison with TrendMiner Our approach is compared with Nargis TrendMiner [8] on the same collected data. Fig. 2 shows the F-measure of TrendMiner and our approach. It is showed that the F-measure of TrendMiner

7 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) F-measure Threshold F-measure F-measure Cluster size(number of words) Our approach TrendMiner Fig. 1: F-mearsure for varying threshold Fig. 2: F-measure of TrendMiner and our approach has a sharp decline with the cluster size increasing, ranging from 1.0 for 2-word clusters to for 12-word clusters. However, the F-measure of our approach has a smaller change than TrendMiner, ranging from for 2-word clusters to for 12-clusters. Compared to TrendMiner, the average F-measure of our approach improves about 13.2%. Because our approach gathers the related tweets into one document through contextual search, it still has an acceptable performance for larger word clusters. Table 1 reports each day s top-five manual trending topics in the first three days. It compares them with the trending topics for 8-word clusters detected by our approach. The results showed that our approach is able to identify most of the trending topics with high contextual content from microblog streams. Table 1: Comparing with Manual Topics NO. Manual Trending Topics Our Approach Date 1 Jumei Discount [webpage, paralysis, server, client, breakdown, website,panic buying, spring] 2 CET Grade Query [grade, admission ticket, score, English, query, examination,reading,score line] 1 Mar Mekong Case Execution [death penalty, execution, live, drunk king, injection, Mekong, mariner, offender] 4 Five Rules Publication [house, loan, income tax, regulate, rules, transfer,sell, interest rate] 5 I m a Singer [singer, Goose, high pitch, Baoliang Sha, Winnie Hsin, Morin khuur,audience, grasslands] 6 I m a Singer Replay [Comprehension, singer, Winnie Hsin, Goose, Terry Lin, Qishan Huang, folk song, aunt Huang] 7 Housing Property Tax [government, income tax, housing, departure, sell house, real estate,trade, new house] 2 Mar Milk Powder Restriction [milk powder, Hong Kong, citizen, imprision, white powder, breast milk, bring, departure] 9 Press Conference of CPPCC [spokesman, CPPCC, XinHua Lv, Press Conference, reply, journalist, conference, translate] 10 Spain s National League [goal, alternate, ball game, Bassar, pass, Messi, ball king, defense] 11 Spain s National League [judgment, tricks, yellow card, red card, penalty, hat, Bassar, Arribas] 12 Conference of CPPCC [NPC&CPPCC, CPPCC, open, conference, committee, TV, ceremony, rebroadcast] 3 Mar The Legendary Swordsman 14 You Think You Can Dance [conquest, star-tv, dance, competitor, martial forest, dance, program, pavane] 15 Semi-final of CBA [semi-final, GuangSha, JinYu, ShanDong, forign aid, ZheJiang, Morris, men s basketball] 6 Conclusion This paper proposes a new approach to detect trending topics in microblog streams. The proposed approach can detect trending topics with high contextual content and the experiments have shown that it performs a good effectiveness. It covers three main steps including burst keywords extraction, contextual search and TF-IDF computation. Compared to existing work, the approach can detect more keywords which is relate to the trending topic and maintain an acceptable accuracy at the same time. In the future, the size of microblog segment will be reduced and the trending topic will be discovered more quickly based on this approach.

8 498 W. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Acknowledgement This work is supported by the Natural Science Foundation Natural Science Foundation of China under Grant No and , the Zhejiang Province Natural Science Foundation Natural Science Foundation of China under Grant No. Y and LY12F02006, the Zhejiang Province key industrial projects in the priority themes of China under Grant No. 2010C11050, and the science and technology search planned projects of Zhejiang Province (No. 2012C21040). References [1] CCNIC, For further detail, please the website, [2] Mathioudakis, M. and N. Koudas, TwitterMonitor: Trend Detection over the Twitter Stream, in: SIGMOD 10 Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010, pp [3] Alan Ritter, Mausam, Oren Etzioni, Open Domain Event Extraction from Twitter, KDD 12, 2012, pp [4] Phelan, O., K. McCarthy and B. Smyth, Using Twitter to Recommend Real-Time Topical News, in: RecSys 09 Proceedings of the third ACM Conference on Recommender Systems, 2009, pp [5] Kleinberg, J, Bursty and Hierarchical Structure in Streams, KDD 02, 2002, pp [6] Zhao, W.X., et al, A Novel Burst-based Text Representation Model for Scalable Event Detection, in: ACL 12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012, pp [7] Agarwal, M.K., K. Ramamritham and M. Bhide, Real Time Discovery of Dense Clusters in Highly Dynamic Graphs: Identifying Real World Events in Highly Dynamic Environments. VLDB 12, 2012, pp [8] Pervin, N., et al., Scalable, and Context-Sensitive Detection of Trending Topics in Microblog Post Streams, ACM Transactions on Management Information Systems, 2013, Vol. 3(4). [9] Agrawal, R. and R. Srikant, Fast Algorithms for Mining Association Rules, in: VLDB 94 Proceedings of the 20th International Conference on Very Large Data Bases. 1994, pp [10] Zhou, X. and F. LI, Mining Aspects and Opinions from Microblog Events, Journal of Computational Information Systems, 2013, Vol. 6(9), pp [11] Fu, B. and T. LIU, Weakly-supervised Consumption Intent Detection in Microblogs, Journal of Computational Information Systems, 2013, Vol. 6(9), pp [12] Zhao, J., et al, MoodLens: An Emoticon-Based Sentiment Analysis System for Chinese Tweets, KDD 12, 2012, pp [13] Tu, H. and J. Ding, An Efficient Clustering Algorithm for Microblogging Hot Topic Detection. in: 2012 International Conference on Computer Science and Service System, 2012, pp [14] Xu, W., et al, Detecting Hot Topics in Chinese Microblog Streams Based on Frequent Patterns Mining, in: Proceedings of the International Conference on Web Information Systems and Mining, 2012, pp [15] Chenliang Li, Aixin Sun, Anwitaman Datta, Twevent: Segment-based Event Detection from Tweets, CIKM 12, 2012, pp [16] Yu, L., S, Asur and B.A. Huberman, What Trends in Chinese Social Media. in: The 5th SNA-KDD Workshop 11 (SNA-KDD 11), 2011.