Being Aware of the World: An Early-Warning Framework by Detecting Real-time Social-Media Streams

Size: px
Start display at page:

Download "Being Aware of the World: An Early-Warning Framework by Detecting Real-time Social-Media Streams"

Transcription

1 Being Aware of the World: An Early-Warning Framework by Detecting Real-time Social-Media Streams Chung-Hong Lee, Huan-Wen Tsai, Shun-Chieh Lin, Chi-Chun Hsia, Chih-Hung Wu, and Wei-Shiang Wen Abstract The social-media messages give a valuable collection of information for real-time event detection and early warning. In this work, we present a message intensity tracking approach by comparing the divergence of a word between long-term and short-term distributions, and propose a self-adaptive stream clustering method with a time-decay function which follows the event lifecycle to deal with the stream as waves. The result shows our approach can immediately evaluate the significantly emergent events to achieve the goals of real-time discovery of disastrous information and early warning for crisis management. R Keywords data mining, stream mining, social network I. INTRODUCTION ISING demand for mobile applications in particular has added a new requirement to the information processing. With burgeoning microblogs and social networking services, both platforms appear poised to provide consistent pressure on real-time mining researches. In the meantime, limited storage space, growing network traffic, and limited computation are continuing to challenging the researchers in information retrieval field. Social networks have become the primary information sources, which collect and propagate real-life information, and changed the way for information acquisition. As a powerful information tool for early warning, a key challenge confronted in real-time clustering is the validation of consecutive message streams based on their temporal factors. The problem of clustering evolving streams has been extensively researched in recent years because of the large number of web applications [, 4, 6, 8, 5]. In this paper, we will study the real-time stream clustering problem for social networking applications. Previous studies on clustering evolving data streams such as [5, 7, 8, 3, 4] focus on the concept drift and event evolving detection issues. The basic idea of these researches is to Chung-Hong Lee is with the Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan (phone: ext.558; fax: ; leechung@mail.ee.kuas.edu.tw). Huan-Wen Tsai, Shun-Chieh Lin and Chi-Chun Hsia is with the Industrial Technology Research Institute, Tainan, Taiwan ( {hwtsai;jason.lin;shiacj}@itri.org.tw). Chih-Hung Wu and Wei-Shang Wen is with the Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan ( williamwu.tw@gmail.com). accumulate data stream to a specified length and then trace the clustered results among different periods. They mentioned a general process of reducing the system load by limiting the time frame or data size. However, how to reflect the behavior of information cascading and insure the performance in real-time stream mining is still a difficult challenge. People share their interests through social networks, e.g. Twitter, in which aggregate no end of real-time information. A real-time clustering framework, which follows the nature of social behavior, is necessary in order to extract hot events over evolving streams efficiently. The message streams in social networking platforms are propagated as the general trend of Zipf's law [2,, 2] (or the Pareto distribution). For real-time stream clustering, the first task is to find the most concerned messages instantly, and then uses a self-adaptive approach to cluster the consecutive message. In order to achieve this goal, three important points need to be considerate of processing real-time streams: ) How to extract the content of streams in an efficient manner? 2) How to decide the length of data frame and the clustering result in a feasible number? 3) How to identify the important events from social networking streams? In this paper, we propose a bursty cluster detection method to filter real-time messages and introduce an innovative approach to mining online evolving streams. The main contributions of our work are: ) We present a message intensity tracking by comparing the divergence of a word between long-term and short-term distributions. Furthermore, we also introduce an instinct index, called the word bursting factor, which takes the temporal aspect of the word evolution in a consecutive stream into account and allows the clustering process to refer the intensity of a message simultaneously. 2) We describe a self-adaptive stream clustering method with a time-decay function which follows the event lifecycle to deal with the stream as waves. This method dynamically adjusts the time window according to the density of incoming stream and the duration of each cluster. II. DYNAMIC TERM WEIGHTING PROCESS To process texts with a chronological order, a fundamental 226

2 problem we concerned is how to find the significant features in collected streams. Specifically, the trends of concept are often not stable but change with time, which is also known as concept drift [3]. Under such a circumstance, the design of weighting process of real-time message should be constantly updated. Here we apply the term weighting scheme BursT which was proposed in our previous work [9]. The experimental results indicate that has a better performance in weighting words of messages than incremental TFIDF. The weight of the word w at time t in BursT will be constituted by two factors: BS (Burst Score) and TI (Term Importance) as: BursT w,t = BS w,t TI w,t (), and we first define the bursty score (BS) as follow: BS q wr,t = max ar q w r,t E ar q wr,t, 0 E ar q wr,t (2) ar w,t = + at w,t at w,t (3) Where t >, at w,t is the arrival time of a word w at current time t, at w,t is the last arrival time of w, and ar w,t= = 0. E ar q wr,t = ì q wr,t = ì q wr,t + (n q wr,t n q wr,t ) ar q w r,t ar q wr,t Where t >, E ar wr is the expectation of the incoming word w r q to formulate update equations, µ wr is the arithmetic mean of arrival rate, µ wr = =0, ar wr is the new arrival rate of the income word, and n wr denotes the number of w r q. Therefore, we regard ar wr as the current observation result, to compare with expected value E ar wr of the word w r q. It should be noted that the return value of equation () would not always be positive if the result is less than the expectation value. In such a case, we define the word as a falling word at that time, and enable BS factor to be zero. The second consideration in BursT weighting scheme is TI (Term Importance) factor, which is formulized by the proportion and its importance of the term in the online collection. For the operation of mining hot events from messages, if a word occurs in more messages, it is more likely to be a trending topic. Thus, the TI corresponding to the word w at time t is formulated as equation (5): (4) dynamically according to its frequency, which is discrete by entropy, within the time frame t l to t. The illustration of dynamic term weighting process is as Fig.. Fig. Evolvement of word BS and dynamic term weighting III. SELF- ADAPTIVE STREAM CLUSTERING A. Self-adaptive Clustering Model In the self-adaptive clustering method, the number of messages and words are constrained in a finite scale, and each cluster will be diminished by the time-decay function. As shown in Fig. 3, this model adjusts the frame size in accordance with their dynamic status in the online process in which adds new messages from online stream and deletes obsolete messages involved in faded clusters. An event thread is a series of similar clusters sharing the same concept or idea, a dummy cluster is a spot which cannot be categorized into any existed event thread or acts as the seed of a new event cluster, and a faded cluster is a dummy cluster or an event cluster that has been diminished by time. Time line Offline Process Faded Clusters Online Process (Self-adaptive Time Window) Dummy clusters Event Threats Fig. 2 Self-adaptive clustering model Recent Message Stream It is worth to mention that each cluster has its own decay time controlled by the density of messages and survival time. For example, an event cluster should be quickly identified when a large amount of messages are categorized into a cluster within a very short period, but however it might be extended if the cluster grows steadily. In self-adaptive clustering model, the messages within the online collection L are incremental and also shrunk by removing the messages of faded clusters as equation (6): TI q wr,t = lnm tg,t ln w q r m: w q r W [tl,t] (5) L = M [t l,t] + m r c t (6) TI denotes the importance of a word of the online message collection, and furthermore this factor can also be adjusted where L denotes the number of messages in the online thread containing word w, the number increases when a new message 227

3 m r is added and decreases since faded cluster c t is removed. B. Determining the cluster lifecycle with Time Decay Function Many studies of evolving stream clustering used fixed window to reduce their system loads and avoided to process entire corpus as traditional text mining. The stream can only be clustered when the consecutive data is accumulated to the window size. In fact, each independent event cluster has its life cycle, and a feasible practice is to adjust the duration automatically. Since different users might be interconnected and the propagation of messages is usually constrained in a restricted number, the transmission of messages will be spread quickly but has a heavy tail. This phenomenon was remarked by many researches e.g. the analysis of human dynamics, blog posts interchanging in blogsphere and short messages spreading of microblogging. Such researches demonstrated a common point that the propagation of information based on web 2.0 applications reveals heavy tail phenomenon and follows the power law. [] analyzed the temporal aspects of blogsphere and founded that blog posts do not have a bursty behavior; the popularity of posts drops with a power law, instead of exponentially, that one may have expected. They also denoted that the exponent of the power law is -.5, agreeing very well with Barabasi s theory of heavy tails in human behavior[2]. Furthermore, [0] conducted an empirical analysis of user activity on Digg and Twitter, two of the most famous blog and microblog sites, and how information spreads through these social networks. The result also showed that user activity on both sites has a power-law distribution with either pull or push direction. The self-adaptive clustering model developed in this work mainly contains clustering and term weighting techniques. Let C = c,, c E Emax be the feature vectors of event clusters, and E max is the maximum permissible number of online threads. Meanwhile, these functions employ the total set of concurrent event clusters as i C i and i j C i C j =. Let r be the latest state of processing a new message with no restriction since real-time messages are posted continuously. The clusters are compared and updated each time while a new message m r is presented to the algorithm, and the winner is the cluster that lies closest to m r. The winner will absorb m r into its vector and be updated so as to evolve the real-time cluster, and then all clusters are updated and toward to fade-away. Since the instant messages are posted continuously, a better practice is to follows the cascading behavior of social information propagation to segment each event discussing thread as mentioned above. In this study, we use a time-decay function, d( ), to determinate the lifecycle of an event cluster and to erase slack messages which is denoted as: (t c i t m r ) dt ci, t ci, t mr, n = çn á +(t e ci t m r ) (7) Where α > is the power factor and η = + ln n controls the slope and optimizes our scheme according to the number of real-time posts. The similarity between m r and c i would be determined by a threshold θ. The m r is considered dissimilar to c i and cannot be joined to c i accurately if: sim(c i, m r )d( ) < è (8) In this study, each cluster will be decided to survive or phase out by checking its decay factor. As same as the message intensity, we not only consider the evolvement of a cluster but also the number of messages in it. Even with a high similarity, a new message would not be joined into the cluster which is going to fade away. C. Clustering Real-time Message Stream Due to the collected words follow the general trend of Zipf's law[2] (or Pareto distribution), an easy way to determine the uninformative terms in this heavy tail distribution is to follow their natures. As noted above, instant message might be lack of useful features due to the insufficient words in a posted text. Messages are propagated in a cascading behavior on social networking website[], and thus we compare a new message to each existed cluster by estimating their semantic relationship and then place it to the most similarity cluster. After iterating, each identified cluster will accumulate messages until the end of its lifecycle, and this process yields an incremental vector for each cluster. It is worth mentioning that the kernel similarity function can be manipulated on demand in this work. In order to fast evaluate our scheme, we use the traditional cosine measure in this preliminary scheme as shown in equation (9) to compare the term vectors of c i and m r. sim(c i, m r ) = max i=,,e cosv c i, v mr c in (9) Where, v c i represents the vector terms of c i, v mr represents the vector terms of m r, and c in represents the recent number of messages in c i, and then the m r would be absorbed into winner event cluster as: c i = v c i v mr (0) The clusters with more messages will outlive others, and we manipulate c in as the factor to stretch the lifecycle out over an extended period while such events are discussed accumulatively. Further, in order to calculate the weight of a word in the online thread, we apply BursT to evaluate the burst score of a word as mentioned in section 2. weight mr,tm r, w r q = BursT wr () IV. REAL-TIME EVENT DETECTION Social networking message stream provides a good opportunity to facilitate hot event detection due to its high convergence of special, temporal, and conceptual relationship as mentioned. The key to catch the hot event is to detect spread messages fast and efficiently. In order to reach this target, two major processes need to be accomplished including: ) 228

4 monitoring the temporal variability of a term and the intensity of online message, 2) tracing the evolvement of an incident according to its density and life-cycle. As mentioned in section 3, each event has its own life-cycle, and the distribution of messages within the event cluster is fit in the time decay function. Fig. 3 demonstrates the relationship between message arrival rate and event life cycle. In the beginning stage of a new incident, the number of posted messages soars sharply, but the messages are propagated with a long tail. In this study, the lifecycle of an event cluster is decided by the density and arrival rate. For example, C4 is extracted within a shorter period than C due to the high message density. This phenomenon also provides a good opportunity to detect the real-time event in a faster manner. density. Then, the TFIDF weighting is applied as the base line to test the stability and performance of our stream clustering and event extracting approaches. As observed in Fig. 4, messages were posted following the evolvement of real event which fulfills our clustering model in section 3 and 4. Event Life Cycle C C2 C3 C4 C5 C6 C7 C8 Message Arrival Rate Fig. 3 Event life cycle and message arrival rate Here, we first extract the event threat by tracing the evolvement of event clusters. Let C i y E max = c y, c 2 y,, c k y be the set of clusters of the collected event thread y and y C i y, and c t = C i+, if simc y i, c t è c C y+, if simc y i, c t < è c (2) simc y i, c t = max y=,,e i, c t (3) and θ c is the threshold determining the extracted cluster c t belongs to a existed thread or a new event. An event can be quickly aggregated by evaluating their similarity and sequence, but however a bigger event thread would cause the magnet effect which obscures nearby threads. Many studies used fixed window to separate event threads, but as mentioned how to decide the length of data frame and the clustering result in a feasible number is a critical issue been worth to discuss. This study first uses a self-adaptive window to extract event clusters and then calculates the message intensity within such clusters to estimate the evolving stage and detect the potential incident in real-time. V. EXPERIMENTAL RESULTS In the experiment, a total number of 3,440,898 Twitter posts were collected as our data source, dating from Jan 6, 20 to Feb 2, 202. The test samples were collected through Twitter Stream API. After filtering out non-ascii tweets, 67,62,636 tweets had been utilized as our data source. We first analyzed the relationships among time and message i C9 Fig. 4 Number of messages vs. time Fig. 5 compares the performance between our BS weighting function and TFIDF based on same clustering method, and obviously the BS weighting function gets higher stability than TFIDF. The reason is our BS weighting function evaluates the importance of an event according to its long-term evolvement, but however the TFIDF only considers the frequency of a word set. Moreover, the result also demonstrates our approach can avoid the win-take-all effect by density-based clustering #3765 #379 #3793 #3809 #38 #3827 #3859 #3876 #3878 #3888 #3898 #3900 #396 #398 #3925 #3940 #3948 #4479 #4849 #4976 Related event no. Original event no. #3696 Related #3765 #379 #3793 #3809 #38 #3827 #3859 #3876 #3878 #3888 event no. burst tfidf Related #3898 #3900 #396 #398 #3925 #3940 #3948 #4479 #4849 #4976 event no. burst tfidf Fig. 5 Comparison of performance in the real-time event extracting task VI. CONCLUSION AND FUTURE WORKS The key to catch the real-time event from social-media is to detect spread messages fast and efficiently. For performing event clustering and tracing, we established a dynamic weighting function to monitor the temporal variability of a term and the intensity of online messages, and further an online clustering algorithm to trace the evolvement of an incident according to its density and lifecycle. In the future, we plan to develop an online system platform that can immediately evaluate the significantly emergent events to achieve the goals of real-time discovery of disastrous information and early warning for crisis management. burst tfidf 229

5 REFERENCES [] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A framework for clustering evolving data streams, In Proceedings of the 29th International Conference on Very large data bases, vol. 29 (Berlin, Germany2003), VLDB Endowment, 2003, pp [2] A. L. Barabasi, The origin of bursts and heavy tails in human dynamics, Nature 435, vol. 7039, 2005, pp [3] A. Bulut, and A. K. Singh, A Unified Framework for Monitoring Data Streams in Real Time, In Proceedings of the 2st International Conference on Data Engineering (2005), IEEE Computer Society, vol , 2005, pp [4] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng, Evolutionary spectral clustering by incorporating temporal smoothness, In Proceedings of the 3th ACM SIGKDD international conference on Knowledge discovery and data mining (San Jose, California, USA2007), ACM, 2007, pp [5] B. R. Dai, J. W. Huang, M. Y. Yeh, and M. S. Chen, Adaptive Clustering for Multiple Evolving Streams, IEEE Trans. on Knowl. and Data Eng, vol.8, 2006, pp [6] C. Feng, E. Martin, Q. Weining, and Z. Aoying, Density-based clustering over an evolving data stream with noise, [7] V. Ganti, J. Gehrke, and R. Ramakrishnan, DEMON: Mining and Monitoring Evolving Data, IEEE Trans. on Knowl. and Data Eng, vol. 3, 200, pp [8] H. L. Chen, M. S. Chen, and S. C. Lin, Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data, IEEE Transactions on Knowledge and Data Engineering 2, 2009, pp [9] C. H. Lee, C. H. Wu, and T. F. Chien, BursT: A Dynamic Term Weighting Scheme for Mining Microblogging Messages, In Advances in Neural Networks ISNN 20, D. LIU, H. ZHANG, M. POLYCARPOU, C. ALIPPI and H. HE Eds. Springer Berlin/Heidelberg, 20, pp [0] K. Lerman, and R. Ghosh, Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks, CoRR abs/ , 200. [] J. Leskovec, M. Mcglohon, C. Faloutsos, N. Glance, and M. Hurst, Cascading Behavior in Large Blog Graphs, In Proceedings of the SIAM International Conference on Data Mining (SDM), [2] W. Li, Random texts exhibit Zipf's-law-like word frequency distribution, Information Theory, IEEE Transactions on 38, 992, pp [3] M. M. Gaber and S. Y. Philip, Classification of Changes in Evolving Data Streams using Online Clustering Result Deviation, [4] A. Zhou, F. Cao, W. Qian, and C. Jin, Tracking clusters in evolving data streams over sliding windows, Knowl. Inf. Syst, vol.5, 2, 2008, pp [5] Y. Zhu, and D. Shasha, StatStream: statistical monitoring of thousands of data streams in real time, In Proceedings of the 28th international conference on Very Large Data Bases, 2002, pp