Quantitative Prediction of Twitter Message Dissemination: A Machine Learning Approach

Size: px

Start display at page:

Download "Quantitative Prediction of Twitter Message Dissemination: A Machine Learning Approach"

Alyson Turner
5 years ago
Views:

Knowedge Engineering Facuty of Eectronic and Engineering, Mathematics

1 Quantitative Prediction of Twitter Message Dissemination: A Machine Learning Approach Pattern Recognition Lab Department of Media and Knowedge Engineering Facuty of Eectronic and Engineering, Mathematics and Computer Science Deft University of Technoogy Inteigent Systems (INSY)

3 Quantitative Prediction of Twitter Message Dissemination: A Machine Learning Approach For the degree of Master of Science in Pattern Recognition Lab at Department of Media and Knowedge Engineering at Deft University of Technoogy

4 IV Facuty of Eectronic and Engineering, Mathematics and Computer Science Deft University of Technoogy Deft, The Netherands

5 A rights reserved. Copyright Media and Knowedge Engineering Department Facuty of Eectronic and Engineering, Mathematics and Computer Science Deft University of Technoogy Deft, The Netherands

6 VI

7 Deft University of Technoogy Department of Media and Knowedge Engineering The undersigned hereby certify that they have read and recommend to the Facuty of Eectronic and Engineering, Mathematics and Computer Science for acceptance a thesis entited Quantitative Prediction of Twitter Message Dissemination: A Machine Learning Approach by in partia fufiment of the requirements for the degree of Master of Science. Dated: Supervisor: Dr. D.M.J. Tax Readers: Prof.dr.ir. M.J.T. Reinders Dr. E.A. Hendriks

8 VIII Dr.ir. S.E. Verwer

9 i A Ma- Master Thesis Quantitative Prediction of Twitter Message Dissemination: chine Learning Approach Pattern Recognition Lab Media and Knowedge Engineering Deft University of Technoogy Abstract Predicting the popuarity of contents in socia networks is quite important for severa appications such as vira marketing, news propagation and personaization. In this work, we deveoped an statistica earning approach to predict the popuarity of tweets in the twitter socia network. We extracted severa user-based, tweet-based and network-based features from each tweet and adopted severa cassifiers to predict the popuarity of tweets. We mode this probem with a binary cassification probem where popuar tweets are considered as the positive and non-popuar tweets are considered as the negative cass. Popuarity is defined by a threshod which indicates how many time a tweet is retweeted. We defined severa popuarity threshods and examined the performance of different cassifiers based on different threshod vaues. Our experimenta resuts show that there is no goba best cassifier for the probem of popuarity prediction in twitter but depending on the dataset, popuarity threshod and our interest, we can adopt an optima cassifier with a proper set of features for this task. Categories and Subject Descriptors: Socia Network anaysis Machine Learning Key words: Twitter, Popuarity Prediction, Socia Networks, Cassification, Feature Extraction, Microbogging

10 ii

11 Tabe of Contents 1 Introduction Motivation Research Questions and Contributions to this Work Structure of this Thesis Reated Work Introduction Information Dissemination in Socia Networks Predicting the Popuarity of Content in Socia Networks Popuarity Prediction in Twitter Popuarity Predictions and Recommendations Popuarity Predictions and Infuences Socia Network Prediction Appications Summary Predictive Mode Introduction Prediction Chaenges

12 iv Tabe of Contents 3-3 Mode Architecture Cassification Methods Linear and Quadratic discriminant Cassifiers Naive Bayes Cassifier Distance-based Cassifiers Support Vector Machine (SVM) Features Tweet Features User Features Network Features Combination of Features Summary Experimenta Resuts Introduction Dataset Twitter structure Dataset Coection Reationa Database Creation Specification of Datasets Spitting Methods of Datasets Impementation of Cassifier Evauation Metrics Cassifier Parameters Setup Summary Concusions and Future Works Concusions Future Work

13 Tabe of Contents v Acknowedgement First of a I woud ike to thank my famiy for their kind and great support to do my master study in TU Deft. I woud aso ike to thank my student counseor, John Stas, for his great hep and advices; my supervisor, Dr. D.M.J. Tax, for his kind support; Drs. D.E. Butterman-Dorey, for heping me to edit my thesis; and Dr.Christian Doerr for his technica advices.

14 vi Tabe of Contents

15 Chapter 1 Introduction Due to great success of the Onine Socia Network (OSN), a arge number of peope are now utiizing OSN services in order to gain active coaboration, participation and interaction within their communities with other users. Twitter, the argest microbogging onine service, has gained significant attention in the past few years. Users share and discuss everything in this socia network. Microbogging is a content-oriented concept in which peope can interact with others both known and unknown. Twitter, which is a successfu microbogging socia network, has gained enormous popuarity in recent years. As of March 2013, twitter has over 1 biion users and an average of 500 miion tweets per day 1. In twitter you are restricted to writing messages of no more than 140 characters these are then turned into short messages. These sma messages create substantia information dissemination in the network and make twitter a successfu socia network for content dissemination. The dissemination of a tweet in the network depends on different factors. One of the factors that contributes consideraby to the propagation of the posts is users. Not a users can equay infuence the propagation of tweets. Infuentia users are, however, the users whose contents propagate more successfuy. Infuentia users are quite important to the anayzing and managing of propagation in tweeter socia networks. 1

16 2 Introduction In this project we have deveoped a earning-based approach to twitter to discover why and how some particuar tweets become popuar. We sha further investigate this network to see to what extent we can predict the popuarity of tweets. 1-1 Motivation In twitter more than 19% of the tweets are about organizations or product brands, ess than 20% of which are shown to have significant sentiment (Yu and Kak, 2012). Predicting the tweets which are ikey to stimuate users interests can improve the sae and marketing of different products and brands. Onine advertisements coud use such predicted messages to efficienty target the ocations of networks which are visited the most. Moreover successfu predictions can aso increase user satisfaction by providing them with more attractive contents. Media companies coud earn how to effectivey generate buzzes for new fims and shows. In poitica campaigning, groups coud earn who they shoud target in order to successfuy spread their message. Predicting the popuarity of content in twitter is aso quite important for severa other purposes such as vira marketing, popuar news detection, personaized message recommendation and trend anaysis. Users with many connections can suffer from information overoad. It is quite important to fiter information fow for the end users and to provide them with important tweets. Popuarity prediction is aso hepfu in personaizing the content and finding the right tweets for end users. On the other hand, understanding how and why a tweet becomes popuar, can hep to gain a better insight into how the information is dispersed over the network. In the case of marketing, predicting popuar tweets is quite usefu for determining what are the trending topics and products. In this work we deveoped an automatic earning-based approach to predict the popuarity of content in tweeter. Automatic prediction with machine power has much ower costs compared to human-based work (Bothos et a., 2010). Furthermore, automatic approaches can scae to very arge datasets which woud be impossibe to manage with human-based cassifications. The probem of popuarity prediction in twitter has been studied in some previous works. In most of the recent works the popuarity of tweets is defined as the number of retweets since retweeting is potentiay the most effective way to disseminate messages due to its vira nature. Such popuarity can

17 1-1 Motivation 3 aso be measured by the number of repies made to each tweet as we as the number of times that a tweet is favored by other users. We sha, however, measure the popuarity of tweets in terms of the retweet count for the foowing reasons: 1) Peope are more ikey to retweet a tweet rather than favor it. A tweet (eg. bad news) can be tweeted many times without getting any favorites. We therefore think that the favorite count is not a good indication of the popuarity of tweets. 2) The repy count is not a good measure of the popuarity of tweets either because not a tweets are conversationa tweets. A tweet can trigger ots of repies whie getting very few retweets and not spreading widey in the network. Previous works on popuarity predictions can be cassified to either approaches that rey on the content of message (Hong et a., 2011b; Suh et a., 2010; Tsur and Rappoport, 2012a) or approaches that rey on the socia characteristics of the network (Artzi et a., 2012). There are aso some studies that try to predict the popuarity of content by anayzing the infuentia users (DeRue and Ashford, 2010). Hong et a. (2011b) and Suh et a. (2010) approach the popuarity prediction probem by extracting features from content and metadata of the messages and trying to predict which messages wi attract high numbers of retweets. In simiar recent work, Zhang et a. (2014) deveoped a simiar approach whie considering different weight vaues for different features. Tsur and Rappoport (2012b) deveoped another feature-based approach which predicts the popuarity of the contents in twitter by means of a inear regressionbased approach. Artzi et a. (2012) deveoped a mode that is abe to predict the ikeihood that a tweet wi be retweeted. They used a combination of tweet features as we as features from the entire network for the prediction task. In a very recent work, Zaman et a. (2013) deveoped a Bayesian networkbased approach which measured the popuarity of tweets based on ony the retweet times and the network structure of retweets in a five minute time window after tweeting. In our experiments, however, we wi show that more than 40% of the retweets wi occur in the first five minutes after tweeting which is aready a significant number of retweets. In this work, we sha consider the retweet counts in different window sizes as different features and study the infuence of each feature on the performance of the prediction task. Our approach aso differs from previous studies in the sense that we do a binary cassification task to see if a tweet becomes popuar or not, In the work Zaman et a. (2013) they tried, by contrast to predict the retweeting activity over the course of time.

18 4 Introduction The previous earning-based works are mainy based on tria and error which trains cassifiers based on a imited set of features in a specific dataset. None of the above earning-based studies anayze the predictors to see why they can predict the popuarity of content, nor can they prove that their approach is generaized enough to be appied to any dataset. In this work, however, we sha do an in-depth anaysis on the different features that we extract from tweets. We not ony introduce some new features that improve the performance of the prediction, but we aso perform an indepth anaysis on each singe feature, on different cassifiers, to gain insight into the contributions and imitations of each singe feature. Furthermore, we deveoped our experiments under different conditions (such as extreme data imbaance data, different time periods and different threshods for the popuarity definition) to generaize the mode as much as possibe. 1-2 Research Questions and Contributions to this Work From a high eve point of view we are interested in seeing how the information is disseminated in the twitter socia network. This task is mainy done by predicting which tweets do actuay become popuar within the network. The popuarity prediction probem is to some extent simiar to the recommender systems probem (Yu and Kak, 2012). With the probem of recommendation, the system tries to predict which content woud be interesting for a user based on his past history. In contrast, our probem focuses on predicting popuar content and information dissemination regardess of the interest of individua users. Whie the probem of popuarity prediction is different from recommendations, it can aso be usefu for recommender systems in the sense that popuar content can be recommended to users. There have aso been studies which have showen that combining popuarity-based recommender systems with personaized recommendations can provide the best recommendations (Jonnaagedda and Gauch, 2013). In this work we deveoped different statistica earning-based cassifiers that predict whether a tweet wi be popuar or not. We extracted different tweet-based and user-based features and studied the contributions and imitations of each singe feature in detai. We conducted our experiments with a set of generative and discriminative cassifiers and expored the advantages and imitations of each cassifier. The experiments were carried out on different datasets under various conditions to make sure that the modes are generaized enough.

19 1-3 Structure of this Thesis 5 More specificay, we are interested in the foowing research questions: Can we predict whether a tweet wi be popuar or not and, if so, to what extent can we predict the popuarity of tweets. What kind of features can be extracted from a tweet to predict its popuarity. What are the most informative features for popuarity predictions. What are the contributions and imitations of each individua feature. Which machine-earning approaches can best mode the popuarity prediction probem. For what kinds of tweets is the earning-based approach successfu and for what kinds of tweets does the mode fais to predict such popuarity. 1-3 Structure of this Thesis This thesis is organized as foows: in Chapter 2 we discuss the work reated to our research. The popuarity prediction probem in twitter and our approach is expained in Chapter 3. In Chapter 4 we expain, in some detai, the datasets and experiments that we have conducted. Finay we draw some concusions and discuss the possibe future directions in chapter 5.

20 6 Introduction

21 Chapter 2 Reated Work 2-1 Introduction The probem of popuarity prediction in socia networks has aways been widey studied. This probem has not ony been studied in conjunction with twitter, but aso in connection with other socia networks. In this chapter we provide an overview of the existing approaches to popuarity prediction in socia networks, we discuss the reated work and we eaborate on the advantages and imitations of existing methods. 2-2 Information Dissemination in Socia Networks A growing ine of research has been foowed on information dissemination through socia networks. These studies propose that network cascades can pay an important roe as mediums for the dissemination of various information. These studies tend to be based on the idea that the information is spread by various infection mechanisms (Granovetter, 1978; Kempe et a., 2003). Under the same category, Kempe et a. (2003) studied a combinatoria optimization probem sometimes known as the infuence maximization probem. The probem invoves finding a sma set of seed nodes in socia networks to target initia activation so that the argest expected spread of information

22 8 Reated Work can be yieded. However, the exact computation of information cascades is an NP-hard probem (Chen et a., 2010). Information diffusion has been studied in severa onine socia networks, such as Ficker (Cha et a., 2008) and Digg (Lerman and Gastyan, 2008). The information propagation probem has aso been studied in the Twitter socia network. In recent work Gauba and Aberer (2010) characterize and mode the propagation of URLs in Twitter. They expoit content popuarity, user infuence and the rate of propagation to mode the propagation of URLs in the network on the basis of inear threshod modes. Yang and Counts (2010) studied information diffusion networks on twitter through mentioned network. They generated a nove mode to capture the three genera properties of information diffusion: speed, scae, and range. Romero et a. (2011) performed an experimenta study on twitter to expore how different types of information actuay spread over the network. 2-3 Predicting the Popuarity of Content in Socia Networks Due to the advent of web 2.0, user-generated content has increased dramaticay. There are various types of contents that can be generated by users, such as comments and reviews on photos, movies and products. Most of these web 2.0 services connect the user with other users through socia network, thus producing a socia graph. For instance, in microbogging services such as Twitter this socia graph is caed a foower network. Any content generated from a user becomes visibe to a of his/her foowers and each of these contents has the chance to be re-posted by these foowers who subsequenty disperse the content over the socia network. Re-posting, commony known as retweeting, gives post the chance to become popuar. The probem of popuarity prediction in socia networks has been widey studied. In this section we expain this probem in different domains. In a recent study Szabo and Huberman (2010) used two content sharing portas Youtube and Digg to demonstrate how by monitoring responses to the stories, they can predict the popuarity of such stories with remarkabe accuracy. In another study, Lerman and Gastyan (2008) examined the roe of socia networks in promoting content on Digg. They discovered that patterns of the spread of interest in a story on the network are indicative of how popuar the story wi become.

23 2-3 Predicting the Popuarity of Content in Socia Networks 9 In another domain, Leskovec et a. (2006) considered information cascades in the context of arge person-to-person recommendation networks and studied the patterns of cascading that arise in arge socia networks. Watts and Dodds (2007) added other key factors that can determine infuence, (i) the interpersona reationships between ordinary users (ii) the readiness of a society to adopt an innovation. This modern view on infuence eads to many marketing strategies, such as coaborative fitering which is a technique used by some recommender systems Popuarity Prediction in Twitter Due to the popuarity of the twitter microbogging service there have been many studies on twitter. A great amount of work has been done to predict the popuarity of tweets in this network. In this section we first justify why users do retweeting in twitter by reviewing the reated iterature and then we briefy expain the reated work on popuarity prediction on twitter socia network. Understanding how users tweet and their motivations for tweeting is potentiay important for predicting whether a tweet wi be popuar or not. In fact discovering what contents users choose to retweet can hep to expain why a particuar tweet becomes popuar. The motivations for the act of retweeting are we expored in the study done by Boyd et a. (2010). They highighted the mains reasons for retweeting as given by users. They introduced 10 different motivations for retweeting such as commenting on tweets, propagating tweets to new audiences, to inform specific persons or groups and to save tweets for future persona access. Athough the focus of their study is not to predict the popuarity of tweets, the underying motivations of retweeting that they found can suggest which features to extract from tweets to predict their popuarity. Another exporatory study has been done by Suh et a. (2010) to find out the factors that ead to retweeting. They extracted three atent factors from tweet features using the Principa Component Anaysis (PCA) approach and tried to associate it with rea features. They then introduced a inear mode to find the degree of popuarity of retweets. They did not however motivate their choice of inear mode for the prediction task nor did they discuss whether PCA is an effective approach for deriving the important factors of retweetabiity. Moreover, they ony performed their experiments on a imited set of data as expressed by themseves. They concuded that contentbased features such as hashtag and ur greaty contribution to retweetabiity.

24 10 Reated Work This concusion was chaenged by ater studies Petrovic et a. (2011), which showed that content features are not informative enough to predict the popuarity of tweets. In a simiar study, Petrovic et a. (2011) performed experimenta work to predict whether a tweet wi be retweeted or not. They deveoped an onine earning-based agorithm (Crammer et a., 2006) to make the prediction as quicky as possibe. They trained a set of oca modes merey on different subsets of data which are generated based on the time of the day to be abe to better expoit the time information of tweets. As in the study of (Suh et a., 2010), they did not motivate their choice of mode and not did they examine their mode according to different datasets. But they compared the performance of their onine-earning approach with human-based predictions and discovered that their method perform as we as human-based predictions. Zaman et a. (2010) aso performed a popuarity prediction study based on the coaborative fitering approach. Unike other studies that use features directy extracted from tweets or users, they incorporate impicity positive and negative feedback into their mode. If the active foower users retweet a tweet, it is considered as positive feedback and otherwise it wi be considered as negative feedback. One drawback of this study though, is that they train the modes based on at east one hour of data after a tweet has been pubished. On the other hand earier studies 1 show that more than 90% of the retweets take pace within the first hour after tweeting. Thus it is not practicay worthwhie to train a mode based on such a ong time interva. Artzi et a. (2012) deveoped a discriminative mode to predict the ikeihood of retweeting a tweet. They extracted severa historica and exica features from the text of the tweet and some socia features from the user pubishing the tweets. They adopted two different cassifiers, the Mutipe Additive Regression-Tree (Wu et a., 2008) and the Maximum Entropy cassifier, for their prediction task. Their study mainy focused on the content features and thus as a imitation, their work is ony adopted for Engish tweets. Moreover, they did not expoit features, such as hash tag, which are potentiay usefu for the popuarity prediction of tweets. Zaman et a. (2013) deveoped a Bayesian-based approach to predict the number of retweets for a given tweet based on its eary spreading pattern. They approached the popuarity prediction probem by studying the pattern of spread of tweets. They found that the reaction times to the tweets can be we estimated by adopting a og norma distribution. They introduced 1

25 2-3 Predicting the Popuarity of Content in Socia Networks 11 a Bayesian network to mode the evoution of retweets in terms of time. Unike other studies which are mainy feature-based earning modes, this study does not extract any features directy from tweets or the user. In fact the prediction is ony based on the eary spreading patterns of tweets. They caim that their approach works we when at east 10% of the retweets of a tweet are observed. This is ess interesting for us firsty because it is not cear when 10% of tweets are known, and secondy because we are interested in predicting the popuar tweets before they get pubished or very shorty after their pubishing. In a more recent study, Zhang et a. (2014) proposed a feature-weighted mode that predicts the popuarity of tweets in terms of the number of potentia retweets. Despite other works, this work is a mutipe cassification task in which a tweet wi be assigned to one of the four possibe casses. The casses are: 0: not retweeted, 1: retweeted ess than 10 times, 2: retweeted ess than 100 times and 3: retweeted more than 100 times. Their feature extraction mode, extracts a set of features from the tweet itsef and from the user who pubished the tweet. Despite the study of (Artzi et a., 2012), their focus is more on the socia features. They adopted a Support Vector Machine (SVM) cassifier with a Radia Basis Function (RBF) kerne which aows their mode to create compex boundaries to distinguish the casses. Their weighted mechanism assigns a weight to each of the features, which is cacuated on the basis of the Information Gain of each singe feature. This mechanism assigns a higher weigh vaue to the feature which has more information gains, thus making them contribute more to the cassification task. The weight vaues were obtained based on a experimenta evauation on their dataset. Athough their approach is reported to outperform the nonweighted approach, the authors have not provided enough evidence that this approach is aso optima for other datasets and under other settings. A simiar study was done by (Hong et a., 2011a) to predict the popuarity of tweets. They formuate this task according to two different binary and mutipe cassification probems. The binary cassification task sets out to predict whether a message wi be retweeted or not, and the mutipe cassification task tries to predict the voume of retweets of a tweet after a tweet has been pubished. They adopted the TF-IDF mechanism to process the content features but they did not ceary specify what kind of cassifier they used and which features contribute most to the cassification task. Their approach aso suffers from a ack of generaity since they ony tested their method on a imited dataset. Most of the reated work on popuarity prediction in twitter was carried out

26 12 Reated Work on a imited dataset with a imited number of settings. To our knowedge there are no studies which examine the contribution of individua features to popuarity prediction. In this work we are studying the contribution of each individua feature to predicting the popuarity of tweets. We aso introduced some new features that have not been used in previous works. We aso performed a comprehensive set of experiments on different datasets to consoidate our findings. Our approach and the features are described in Chapters 3 and Popuarity Predictions and Recommendations The probems of popuarity predictions and recommendations are simiar in some aspects. Both probems try to identify infuentia contents. Whie in popuarity prediction probems the focus is more on the popuarity of content, in recommender systems the focus is more on the user, the goa being to recommend the items to a user which satisfy him the most. Predicting the popuarity of contents can be quite usefu in connection with making recommendations. Jonnaagedda and Gauch (2013) showed that a hybrid popuar-based and personaized-based recommender system can outperform merey personaized-based recommender systems. On the other hand, recommender systems can aso be hepfu for predicting popuar contents. Petrovic et a. (2011); Zaman et a. (2010)proposed a method for predicting popuar content based on recommender system agorithms. Their approach seeks to predict whether a tweet is retweeted by another user on the basis of coaborative fitering agorithm Popuarity Predictions and Infuences Due to the high importance paced by users on making content popuar, there has been quite some research into the identifying of infuentia users. By studying the behavior of infuentia users, we can further investigate their contribution to predictions concerning popuar contents. Sois (2012) discussed the importance of digita infuence and probems with measuring infuence and finay defined infuence. Some of the most frequent questions which attract the audience were; What is infuence and what makes someone infuentia? Who is infuentia in socia networks and why? How can I recognize infuence or the capacity to infuence? By better understanding how digita infuence works, businesses can improve their understanding of the market and depoy socia media media to steer positive

27 2-3 Predicting the Popuarity of Content in Socia Networks 13 conversations. (Sois, 2012) beieves that infuence as a score is imprecise, however recenty many studies have defined metrics as a score which is assigned to peope on what they do and say in socia networks. We have reviewed some these studies and we noticed that most of them suffer from the same probem. Previous studies made use of different terminoogy in their research into matters such as the infuence, popuarity, important nodes and efficient seed sets. Beow are six different tites that are frequenty used in the reated iterature: Predicting the popuarity of users on Twitter Predicting the popuarity of Tweets on Twitter Predicting the infuentia users on Twitter Predicting the infuentia tweets on Twitter Predicting the infuence of Users on Twitter Predicting the infuence of tweets on Twitter Each of these definitions have different meanings and impications. The first definition, the popuarity of users on Twitter, can transate into the number of foowers of users. The second, impies the number of retweets obtained from tweets. But in the ast four tites, very genera words such as infuence and infuentia users appear which cannot be transated into one singe measurement. There are cear distinctions on why popuarity and infuence are not the same definition. In reation to socia network anaysis (SNA), severa metrics exist that indicate the socia infuence of users in networks. The top three common measures are presented by (Freeman, 1979) as, (i) Degree centraity, which indicates the number of direct/indirect ties of a node to other nodes. (Iyengar et a., 2010) caed we-connected users "hubs". (ii)coseness centraity: unike degree centraity takes into account ony immediate ties rather than a the connections and emphasizes the distance of a user to a others in the network by focusing on the distance from each user to a the others.(iii) Betweenness centraity: quantifies the number of times a user acts as a bridge aong the shortest path between two other users. In same the area one study indicates that such bridges that connect two unconnected parts of the network are infuentia

28 14 Reated Work Hinz et a. (2011) put forward the notion of degree and betweenness centraity to find the best seed set of infuentia users dependent on socia inks. They found that hubs and bridges are more ikey to participate in successfu seeding strategy in vira marketing campaigns. Cha and Gummadi (2010) presented an empirica anaysis of the infuence of the twitter, they compared three different measures of infuence: in-degree which counted the number of foowers of a user, The more foowers a user has, the more infuentia the user is. Retweet, which invoves counting the number of retweets, beongs to the post of one person, so the more retweet post a user gets the more infuentia he is. Mentions counts the number of mentions containing a person s name, so the more repies a user receives, the more infuentia he is. Bakshy et a. (2011) narrowed down the definition of infuence to the abiity of the user to post URLs which diffuse through the Twitter foower graph. They studied ony the users who post URLs and caed them "seed" content. They quantified the infuence of a given post by the number of users who repost the URL. They fitted a mode which predicted infuence using individua attributes and past activity to examine the utiity of such a mode for targeting users. The size of the diffusion is more directy associated with diffusion and the dissemination of information. Li et a. (2013), ike in other works, defined infuence as a successfu diffusion of information and they correcty mention that information/infuence propagation and information/infuence diffusion and information cascades are the same concept and that is a concept that is used frequenty in their work. Cha and Gummadi (2010) compared three measures and discovered that the number of retweets and the number of mentions are correated whie the number of friends are not correated and so their hypothesis is that the number of foowers of users may not be a good infuence measure. Kwak et a. (2010) compared different infuence measures in terms of both the rewteets and the foower network. The various authors ranked users by the number of foowers and PageRank and found two rankings that were simiar. They found a gap in the infuence cacuated on the foower network versus the retweet network which is inferred from the number of foowers and the popuarity of the tweets. Weng et a. (2010) did not define infuence very ceary, they mentioned that an infuentia twitter is one with certain authority within the socia network. They impemented topic sensitive pagerank to overcome the probem of iden-

29 2-4 Socia Network Prediction Appications 15 tifying the interest of twitters which affects the way twitters infuence one another. So they took into account both ink structure and topica simiarity among twitters. In another study, Katz and Lazarsfed (1955) introduced Opinion eadership in a two-step fow theory, where opinion eaders receive information from society through the news media and send it to ess informed peope. Rogers (1995) reies on the idea of two-step fow theory in deveoping his ideas on the infuence of Opinion Leaders in the diffusion of innovation. Opinion Leaders typicay have greater exposure to the mass media, more socia experience, greater viewers and foowers and are more innovative. Wejnert (2002) mentions terms of benefit Vs cost, which means that successfu adoption of innovation is the benefit and indirect/direct cost which you pay to increase the benefit is the cost of innovation. An exampe woud be the need to buy a new kind of fertiizer to use innovative seeds. Trusov et a. (2010) had a different idea, they measured infuence based on network activity by studying og-in data in socia networks and showed how the posting attitude of one user has an effect on their networks members who were at top-eve, or those who are connected by direct invitation and at other eves those who were friends of friends. They evauated whether content from within members at this top eve changes to a og-in frequency and ength of stay on site, and concuded that such changes are evidence of infuence of top eve on the reset of members. So as we can see, infuence is not a back and white concept. In OSN, what peope are infuenced by are different persons to the person and there is not a singe way of measuring infuence in OSN. 2-4 Socia Network Prediction Appications Predictive modes anayze past information to assess how ikey it is than an event wi occur in the future. Athough human experts coud have greater accuracy they are not scaabe and do not work propery in cases when events have very ow or high probabiity and they are definitey more expensive compared to the computer-based approach (Bothos et a., 2010). Different studies have focused on the appications of micro-bogging services in different fieds. For instance, Boen et a. (2010); Sprenger and Wepe (2010) studied the appications of micro-bogging in the stock market. They investigated whether coective inteigence from micro-bogging information

30 16 Reated Work can predict events in the stock market. Asur and Huberman (2010) made a inear regression mode to predict box-office revenues before the reease of movies using Twitter data. Jansen et a. (2009) beieve that micro-bogging coud be used as part of onine word-of-mouth marketing and services ike Twitter and coud pay an important roe in marketing. Micro-bogging has become an important patform for information pubishing and dissemination. In recent years, the adoption and use of micro-bogging in an emergency has received a great dea of attention. For exampe, Cuotta (2010) studied the feasibiity of detecting infuenza outbreaks by anayzing micro-bogging data. A number of studies discussed the use of micro-bogging as a communication information sharing resource in the event of various crises, invoving for instance vioence and natura disasters. Sakaki et a. (2010) use a rea-time characteristics of Twitter and peope s actions and posting on Twitter during catastrophe to investigate the rea-time interaction during events such as earthquakes on twitter and they proposed an agorithm to monitor tweets and to detect a target event. 2-5 Summary In this chapter we discussed various studies reated to our work. Due to the importance of predicting popuar content in socia networks there have been quite a number of studies on how and why content gets popuar in socia networks. We expored different studies which are based on information propagation and popuarity predictions in different socia networks. Twitter itsef has aso been the subject of much reated work in this area. We discussed the advantages and imitations of the existing studies in this area and motivate our approach to popuarity predictions in twitter socia networks.

31 Chapter 3 Predictive Mode 3-1 Introduction In this chapter we describe our proposed earning method for predicting the popuarity of tweets in twitter socia networks. We sha mode the probem of popuarity prediction as a binary cassification probem. To formuate the binary cassification probem we need to define what exacty constitutes popuar and unpopuar content. As we mentioned in Chapter 1, there are different approaches to defining popuarity. Our proposed definition of popuarity is based on the works of (Hong et a., 2011a; Zhang et a., 2014) in which the popuarity is defined as the number of retweets that a tweet wi get. In the present work we wi consider different threshods to define the popuar tweets and perform different experiments in various setups in order to aso obtain the best possibe threshod for our cassification probem. The rest of this chapter is organized as foows. We review the prediction chaenges and expain why it is important to predict popuar contents in twitter. The overa architecture of our proposed earning-based method is then described. The next section introduces the different types of cassifiers that we used and determines their usefuness for our cassification task. In Section 3.4 we describe in detai the features that we extracted for our cassification task and how we combine these features. In the next chapter we experimentay examine our proposed earning method.

32 18 Predictive Mode 3-2 Prediction Chaenges The abiity to predict popuar contents in socia networks is quite important for adopting and personaizing the huge amount of information for users. Successfu prediction can provide the most reevant contents to users and improve user s experience with socia media (Yu and Kak, 2012). Furthermore, eary prediction of vira information is quite usefu for marketing, trend anaysis and popuar news detection. Automatic prediction of popuar content is not however a simpe probem. This probem is even more chaenging for twitter socia networks due to imitation paced on the size of a tweet message. Moreover, the imbaanced nature of the data, i.e., the huge difference between the number of tweets which get popuar with those which do not, makes the probem even more chaenging. In fact we need to find the tweets which are ikey to become popuar in a arge poo of tweets which are very unikey to become popuar. Finding the features that are abe to distinguish popuar tweets from those which are not, is quite important. Another chaenging issue of the popuarity prediction probem is the abiity to predict the popuar tweets as soon as possibe. In other words we, are in practice imited to using ony the information which is avaiabe shorty after has been pubished a tweet. In the next section we expain how we approach this probem which motivating our decision. 3-3 Mode Architecture Our proposed approach to popuarity prediction is based on a feature-based cassification mode in which we extract a set of features from tweets and cassify them as popuar/unpopuar casses. We are in fact interested to see to what extent we can predict the popuarity of tweets in the twitter socia network. In other words our research interest reduces to a binary cassification probem in which a tweet wi be assigned to a popuar (positive) or unpopuar (negative) cass. As we mentioned in Chapter 1, there are severa research question that can be raised to tacke this probem. In this section we introduce the overa architecture of our system, which comprises different components. Each component needs an in-depth anaysis to be abe to drive the optima mode for this probem and give answers to the research questions posed in Chapter 1. Figure 3-1 iustrates the overa architecture of our proposed mode. The data coection method is iustrated in the eft side of the figure. Detaied

33 3-4 Cassification Methods 19 information about the data coection is described in Chapter 4. The mode then extracts severa features from tweets and different machine earning approaches are used to train cassifiers. The cassifier is then used to predict whether a tweet wi be popuar or not. Twitter Stream API Feature Vectors Popuar Retweet count >= k Convert JSON To Reationa DB Train set Machine Learning Methods Non Popuar Retweet count < k Data Labeing Extract Frequent Hashtag Test set Seected Feature Vectors Predictive Mode Expected Labe Data Coection Figure 3-1: Schematic representation of mode architecture The two important decision for earning-based systems are the choice of cassifier and the features that are extracted from the data. In the foowing two section we describe our choices in more detai and motivate our approach. We introduce the different discriminative and generative cassifiers which are potentiay suitabe for our cassification probem and the features that wi be extracted from the twitter environment. In the next Chapter we experimentay examine the suitabiity of the cassifiers that are introduced in this Chapter and find the optima mode for our probem which is both accurate and generaizabe. 3-4 Cassification Methods The task of automatic cassification of data can be carried out with the hep of severa different cassifiers. The machine earning community has intro-

34 20 Predictive Mode duced severa feature-based cassifiers which are suitabe for different appications (Theodoridis and Koutroumbas, 2008). Depending on the features and nature of the data, severa cassifiers can be used for a prediction task. In statistica machine earning, a cassifier can be either generative or discriminative. A generative cassifier tries to predict a probabiistic distribution for each cass of data and assign an unknown sampe to the cass with highest ikeihood. On the other hand, discriminative approaches try to depict a curve which best discriminates the data points in different casses. Depending on the nature of the data, features and desired performance and compexity different modes can be trained. In this section we sha describe the cassifiers that we used and the reasons for using them. In the next chapter we sha experimentay show the performance of each method and introduce the optima mode for our probem. Generative cassifiers work by earning the cass conditiona probabiity, that is,f c (x) = P r(x = x C = c) for each cass c. In this formua et the feature vector be x and the cass abes be C. Assume the prior probabiity for cass c is denoted as π c, C c=1 π c = 1. In order to estimate f c (x), we wi first make some assumptions about its form. First assume that f c (x) is norma or Gaussian with mean µ c and covariance σ 2 c. In particuar, the foowing estimates are used, where n is the tota number of training observations, and n c is the number of training observations in the c th cass. µ c is simpy the average of a the training observations from the c th cass and σ 2 can be seen as weighted the average of the sampe variances for each of the C casses. ˆµ c = 1 n c i:c i =c X i (3-1) ˆσ 2 = 1 n C c C i:c i =c (X i ˆµ c ) 2 (3-2) π c is usuay estimated simpy by the empirica frequency of the training set equation 3-3. ˆπ c = Number of sampes in cass c Tota number of sampes = n c N (3-3)

35 3-4 Cassification Methods 21 The cass conditiona probabiity f(x) can then be defined as a Gaussian distribution as foows: 3-4. f c (x) = 1 (2π) p/2 Σ c 1 e 2 (x µc)t Σ 1 c (x µ c) 1/2 (3-4) p is the dimension and c is the covariance matrix. The vector X and the mean vector µ c are both coumn vectors. For QDC this is the density of X conditioned on the each cass C or cass C = c denoted by f c (x). According to the Bayes rue, what we need is to compute the posterior probabiity equation which defined as foows: P r(c = c X = x) = f c(x)π c Ci=1 f i (x)π i (3-5) given the posterior probabiities the cass abe for a given sampe can be decided based on the foowing decision rue cass(x) = arg max P r(c = c X = x) (3-6) Linear and Quadratic discriminant Cassifiers Linear and quadratic discriminant cassifiers are two simpe yet effective generative cassifiers that have been widey used in different appications such as for text cassification (Aggarwa and Zhai, 2012) and face recognition (Lee et a., 2010). To our knowedge this cassifier has never been used for the task of popuarity prediction in socia networks. Due to their simpicity and ow time compexity we adopted different inear and quadratic cassifiers (such as LDA and QDA) to examine their suitabiity for our prediction probem. The purpose of discriminant anaysis is to assign abes to one of severa groups or casses assuming that the measurements from each cass are normay distributed and different casses have the same covariance matrix, Σ. Quadratic Discriminant Anaysis, on the other hand, set outs to find the

36 22 Predictive Mode quadratic combination of features and is more compex than inear discriminant anaysis. Unike LDA, QDC does not make the assumption that different casses have the same covariance matrix Σ. Instead, QDC makes the assumption that each cass C has its own covariance matrix Σ c. A major probem associated with LDA and even more with QDA is that a arge number of parameters have to be estimated in the case of high-dimensiona datasets 1. But most of the datasets in our probem are ow-dimensiona (around 20 features), which makes the use of these two cassifier ess of a probem in terms of compexity Naive Bayes Cassifier A Naive Bayes cassifier is a simpe generative cassifier based on the appication of the Bayes theorem with strong assumpt-ions that the features are highy independent. In other words, a Naive Bayes cassifier assumes that the presence or absence of a particuar feature is unreated to the presence or absence of any other feature, given the cass variabe. Despite their naive design and apparenty oversimpified assumptions, Naive Bayes cassifiers have worked quite we in many compex rea-word situations such as for text cassification (Frank and Bouckaert, 2006), spam detection (Freeman, 2013), sentiment cassification (Narayanan et a., 2013) and with opinion mining (Fouzia Sayeedunnissa et a., 2013). The Naive Bayes mode works very we in the probems in which the features are independent. In our tweet cassification probem as you wi see ater in this chapter, most of the feature are independent and the Naive Bayes cassifier is potentiay a proper cassifier for that. The Bayes cassifier cacuates the probabiity of an object beonging to each of the casses. Given a cass abe C for a tweet (popuar or non-popuar) a tweet which is represented by a feature vector x (x 1,...x f ). From the Bayes rue we can cacuate cass posterior probabiity P (c X) as foows: P (c X) = P (c x 1,..., x f ) = P (C)P (x 1,... x f c) P (x 1,..., x f ) (3-7) Using the naive independence assumption we can write: 1 Dimensiona-Data.pdf

37 3-4 Cassification Methods 23 P (x i c, x 1,..., x i 1, x i+1,..., x f ) = P (x i c) (3-8) for a i = 1...f using this equation the cass posterior can be written as: P (c x 1,..., x f ) = P (c) n i=1 P (x i c) P (x 1,..., x f ) (3-9) Since P (x 1,..., x f ) is constant for a casses, we can use the foowing cassification ĉ = arg max t n P (c) P (x i c) (3-10) i=1 the cass with highest posterior probabiity woud be decided as the cass abe for a given sampe Distance-based Cassifiers Due to their simpicity, we examined two distance-based cassifier namey K-Nearest Neighbour (K-NN) and Nearest Mean cassifiers. The K-Nearest Neighbour (K-NN) cassifier is another popuar and simpe cassifier which is potentiay suitabe for our probem. k-nn is a type of instance-based earning, or azy earning, in which the function is ony approximated ocay and a computation is deferred unti cassification. The k-nn agorithm is among the simpest of a machine earning agorithms. K-nearest neighbour (K-NN) agorithm is a discriminative cassification agorithm that assigns query data to the cass to which most of its k-nearest neighbours beong. A Eucidean distance measure is used to find the k- nearest neighbours from the sampe pattern from a set of known cassifications(witten and Frank, 2005). A drawback of the basic "majority voting" cassification occurs when the cass distribution is skewed. Frequenty cass tends to dominate the prediction of the new exampe, because this tends to be common among the k nearest neighbours due to their arge number (Coomans and Massart, 1982).

38 24 Predictive Mode The nearest Mean Cassifier is a cassification mode that assigns to observations the abe of the cass of training sampes whose mean is cosest to the observation. This cassifier works in a simiar way to the nearest neighbour cassifier. In this cassifier, instead of storing each training sampe, the mean of each cass is stored as a cass. Using Eucidean distance and objects are assigned to groups with the nearest mean. Nearest mean cassifiers are ess sensitive to imbaance data because the mean of casses do not depend on the number of sampes in each cass Support Vector Machine (SVM) SVM is a discriminative based cassifier which has been successfuy appied to many probems such as text cassification (Aggarwa and Zhai, 2012), image processing and face recognition (Heisee et a., 2001; Jafri and Arabnia, 2009), Spam detection (Wang) and many more probems in socia media. This cassifier has aso been adopted in the tweet popuarity prediction probem (Zhang et a., 2014). However, as we mentioned earier in the previous chapter, this work has not empoyed some of the features such as content features, that we introduced in this work. SVM tries to discover a hyperpane which discriminates casses and it is not necessary to estimate what is the cass density P (X C) or what is the posterior probabiity vaue P (C X). Suppose we are given a training set (x i, y i ), i = 1,..., n in which x i = (x i1,..., x in ) is a n-dimensiona sampe and y i {1, 1} is the corresponding abe. The task of a support vector cassifier is to find a inear discriminant function g(x) = w T x + w 0, w has to be chosen to satisfy w T x i + w 0 +1 for a points in cass y i = +1. Simiary it must satisfy w T x i + w 0 1 for those in cass y i = 1. Therefore we seek a soution which is such that the foowing condition hods. y i (w T x i + w 0 ) 1 i = 1,..., n (3-11) The optima inear function is obtained by minimizing the foowing quadratic programming probem: min 1 2 wt w n i=1 α i (y i (w T x i + w 0 ) 1) (3-12)

39 3-4 Cassification Methods 25 where α i, i = 1,..., n; α 0 are Lagrange mutipiers,subject to α i 0,for a n Minimizing the norm makes the margin maximum. At the optimum of this new objective function, the partia derivative of the objective function with respect to w and b must be zero, which eads to the foowing soution: n w = α i y i x i (3-13) i=1 where α i, i = 1,..., n; α 0 are Lagrange mutipiers. n max α i 1 n n α i α j y i y j xi T x j (3-14) i=1 2 i=1 j=1 this expression is known as the dua optimization probem and it has to be maximized with the foowing constraints n α 0, α i y i = 0 (3-15) i=1 This optimization probem is a constrained quadratic programming task because of the α i α j term. To be abe to ineary separate data, the feature space shoud be typicay mapped to a higher dimensiona space. Functions that correspond to inner products in some spaces are known as kerne functions. The kerne function k : X X R takes two sampes from input space and maps it to a rea number indicating their simiarity. For a x i, x j X, the kerne function satisfies k(x i, x j ) = w(x i ), w(x j ) (3-16) Where w is an expicit mapping from input space χ and < a, b > indicate the inner product to a and b to a dot product feature space w. Where w is a Hibert space 2. This inner product can be repaced by another kerne function. There are penty of kerne functions, which are each equivaent to an inner product after some transformation that we can use. The foowing are the four most popuar kerne functions: 2 A Hibert space is a compete inear space equipped with an inner product operation

40 26 Predictive Mode Tabe 3-1: SVM kenes Name Definition inear K(x i, x j ) = xi T x j Radia Basis K(x i, x j ) = e γ x i x j 2, γ > 0 poynomia K(x i, x j ) = (γxi T x j + c 0 ) d, γ > 0 Sigmoid K(x i, x j ) = tanh(γxi T x j + c 0 ) where d is Parameter degree of a kerne function. γ is Parameter γ of a kerne function. c0 is Parameter coef0 of a kerne function. The RBF Poynomia Sigmoid are more fexibe and both have additiona parameters (γ) that must be set by the user. In cases where data is non-separabe the training feature vector can adhere to three categories. I) vectors can fa outside the margin and can be correcty cassified. II) vectors can fa inside the margin and be correcty cassified. III) vectors can be miscassified. These three categories can be deat with under a singe type of constraint : y i (w T x i + w 0 ) 1 ξ i (3-17) The first category of the data corresponds to ξ i = 0, the second to 0 < ξ i 1 and the third to ξ i > 0. The goa is to make the margin as arge as possibe whie at the same time making the number of points with ξ i > 0 as sma as possibe. So equation 3-18 can be written as foows: min 1 2 wt w n i=1 α i (y i (w T x i + w 0 ) 1) + C svm ni=1 ξ i (3-18) where C svm is a penaty parameter for the errors on the training set. In this work we are using package which is impemented by the LIBSVM ibrary for support a vector machine (Dimitriadou et a., 2010) 3-5 Features A critica factor when deveoping a prediction mode is to represent sampes with a good set of features. Good features shoud be informative and shoud

41 3-5 Features 27 have discriminative power. That means that the features shoud be abe to discriminate between the tweets that become popuar and those which do not. In our proposed mode we have extracted features from three sources of information: the features of the tweet, the user who posts the tweet and the foower network of connected users (i.e. foowers and foowees). The features can be either discrete which means that they can have a vaue from a set of defined vaues, or they can be continuous which means that the features have a continuous vaue. Most of the features that we extracted for this work are independent. The tweet features such as date and time for exampe, do not depend on th user who pubish the tweet. Nevertheess some features such as foower count and friend count has some correation with each other. Figure 3-2 shows the correation between the features that we extracted in this work. In this section we describe the features that we extracted and different approaches are adapted to combine such features Tweet Features Tweet features are the features that are extracted from the tweets themseves. We extracted severa features from tweets. Tabe 3-2 ists the features that we extracted from tweets, together with their description and their type (i.e. continuous or discrete). Feature Type Description Tabe 3-2: Features extracted from tweets Date discrete Day of the week when the tweet was posted Time discrete Hour of posting URL discrete Tweet containing URL or not Hashtag discrete Tweet containing hashtag or not User Features One of the most important factors that contributes to the popuarity of any tweet is the user who posted the tweet. Extracting features from the user can significanty hep the cassifier to predict the popuar posts. We extracted severa discrete and continuous features from users, a of which are isted in Tabe 3-4.

42 28 Predictive Mode stdeapsetime num_retweet hour_perday_offset sec_perday_offset avgeapsetime mineapsetime parent_avg_friends_count parent_avg_foower_count parent_std_foower_count parent_std_tweet_perday parent_friends_count parent_avg_tweet_perday parent_tweet_perday parent_foower_count stdeapsetime num_retweet hour_perday_offset sec_perday_offset avgeapsetime mineapsetime parent_avg_friends_count parent_avg_foower_count parent_std_foower_count parent_std_tweet_perday parent_friends_count parent_avg_tweet_perday parent_tweet_perday parent_foower_count Figure 3-2: The correation between different features.

43 3-5 Features 29 Tabe 3-3: Features extracted from users Feature Type Description parent Foower count continuous number of foowers of the user parent friends count continuous number of friends of the user parent tweet perday continuous number of tweets per/day each user has done Network Features In addition to tweet and user features, we aso extracted some additiona t Foower count features from the network of the user who posted the tweets see (Figure 3- t Friend count 3). In this schema, node one is the origina tweeter, who has 13 foowers, and node two has seven foowers, these foowers are caed "foowers of t Tweet perday foower". These features heps to better expoit the information in the user s Foowers count network of Parent which foowers can potentiay contribute to predicting the popuarity of the Foowers network oowers count tweets. of Parent foowers riends count of Parent foowers weet perday of Parent foowers eet perdayof Parent foowers foowers count of Parent foowers friends count of Parent foowers Tweet Initiator Parent/Tweeter has 13 Foower 1 has 10 Friends 1 does 4 tweet per day 1 has Avg Foower count has 14 sum foower count Foowers of foowers Foowers Figure 3-3: Tweet pubisher and its foower network. Avg/Std # of foowers of foowers These two features are constructed on the basis of the network of the users. The network of the user who posts the tweet has an important roe in the propagation and popuarity of tweets because the tweets are mainy propagated through the network of users. In fact the tweets of the users who have

44 30 Predictive Mode a arger network have higher chance to be exposed and therefore retweeted. To buid these features, we cacuate the average and standard deviation of the number of foowers of user s foowers which is indicated by fof. More specificay, suppose that user u has n foowers and fo i (u) indicates the number of foowers of i th foower of u. The average and standard deviation of number of foowers of foowers is then defined as: ni=1 fof(u) fo i (u) = (3-19) n σ fof (u) = 2 n i=1 fo i (u) fof(u) 2 n (3-20) Avg/Std # of friends of foowers These two features are constructed in a very simiar way to the avg/std number of foowers of foowers. The ony difference is that instead of having foowers of foowers, the number of friends of each foower wi be taken into account which is indicated by rof. More specificay, suppose that user u has n foowers and fr i (u) indicates the number of friends of i th foower of u. These two features are cacuated using the foowing equations: rof(u) = σ rof (u) = 2 n ni=1 fr i (u) n i=1 fr i (u) rof(u) 2 n (3-21) (3-22) Avg/Std Tweets per day of foowers As the name of these two features suggests, they are constructed by averaging/standard deviating over the number of tweets per day of a the foowers of the user. These two features are cacuated using the foowing equations: tof(u) = σ tof (u) = 2 n ni=1 t i (u) n i=1 t i (u) tof(u) 2 n where t i (u) indicates the tweets per day of the i th foower of user u. (3-23) (3-24)

45 3-5 Features 31 Tabe 3-4: Features extracted from users Feature Type Description f of continuous avg # of foowers of foowers of the user σ fof continuous std # of foowers of foowers of the user rof continuous avg # of friends of foowers of the user tof continuous avg # of tweet perday of foowers of the user σ rof continuous std # of tweet perday of foowers of the user Eary Tweet Features In addition to typica features of tweets, we aso extracted a set of features based on the eary features of tweets. For extracting theses features we monitor the events that are happening 120 second after the tweet is pubished. Tabe X ist the feature that we extracted from this eapsed time period. Tabe 3-5: Features extract form first 120 Sec of retweet Feature Type Description Number of retweet continuous # of retweet after 120 sec of first retweet AvgEapseTime continuous Average time of retweets in the first t min StdEapseTime continuous Std time of retweets in the first t min Combination of Features We have extracted severa features in our mode. To obtain the optima cassifier it is important to effectivey combine the features. In this work we have conducted a fu factoria design so that the informativeness of each feature can be cacuated. Hassan et a. (2006) has shown that factoria experimenta design is a viabe approach in feature seection. In statistics, a fu factoria experiment is an experiment whose design consists of two or more factors, each with discrete possibe vaues or eves and whose experimenta units take on a possibe combinations of these eves across a such factors. Due to the varying contribution of each of the features in the cassification task, we have done some experiments based on the factoria design mode to discover what are the most informative features. Furthermore, fu factoria design heps us to detect the useess features in our cassification task, so eading to the designing a better mode. In the next chapter we wi experi-

46 32 Predictive Mode mentay expain our feature seection method and the contribution made by each singe feature to our predictive mode. 3-6 Summary In this chapter we expained our predictive mode from the theoretica point of view. To buid our predictive mode, we examined different types of earning approaches and different features. We expained the cassification modes, as we as the features we aso expained how we obtained them. In the next chapter we wi experimentay test our approach to different datasets and justify our mode by comparing our resuts with some baseines and previous works.

47 Chapter 4 Experimenta Resuts 4-1 Introduction In this chapter we expain in detais how the dataset was coected and how the experiment were conducted. We coected four different datasets from twitter and performed different experiments on them to see to what extent we can predict the popuarity of tweets in the Twitter socia network. We further expain how the data is coected and how they are spit for training and testing. We then describe how we setup the cassifiers that we considered for this probem and expain in detais how can we effectivey tune their parameters to be suitabe for our prediction task. This chapter is organized as foows: Section 4-2 describes the datasets that we used for this work and their coection method. We further expain in this section how we transferred the datasets into a reationa database and aso introduce the spitting strategies that we considered in this work. In section 4-3 we first introduce the evauation metrics that we used in this work and motivate their choices. We then address the chaenge of tuning the right parameters for the cassifiers and compare the performance of different cassifiers with different configurations. The summary and concuding points are further discussed in section 4-4.

48 34 Experimenta Resuts 4-2 Dataset In this section we wi describe in more detai which datasets are used and how we coect this data. Twitter is an information exchange network that produces 200 miion tweets per day 1. In this work we did our experiments on a set of static datasets to see to what extent we can address the research questions which we posed in previous chapters. To be abe to test our proposed methods, we created four different datasets using the twitter streaming API 2. It is important to note that the twitter APIs are constanty changing and deveoping Twitter is not a one-off event. The four datasets are different in terms of the time when they were coected, the size and the topic of the tweets. Having four different dataset aows us to test our methods on different situations to see how we our methods can be generaized. As streaming API is a free service and we do not have an obigation to coect 100% of data so we use steaming API. We have created the four datasets on different topics. We have created three datasets from the hot topics, each with a different size and time when the data was coected and the other one is a more genera dataset which is not necessariy reated to the hot topic of the day. We chose to have datasets from both hot and genera topics to gauge the performance of our approach in different types of datasets. These datasets are iustrated in Tabe 4-1. Datasets Description Duration Steve Jobs Death Steve Jobs quit from being CEO 4 Days The US Eection During US eection campaign 16 Days Foxnews Obama Assassination Fox News Twitter account hacked 4 Days :) A tweets contain :) 1 Days Tabe 4-1: Four datasets coected over different periods of time Twitter structure Twitter is a micro-bogging site which was created in This service aows users to share information in the form of 140 character messages known as tweets. Users have two different networks, friend networks (foowing) which receive posts from persons in their time-ines, which shows

49 4-2 Dataset 35 the numbers of users who are infuenced by Twitter. Secondy, foowers reationships which foow him/her from a directed foower network, a foowers wi receive posted message in their time-ines. Users categorized posts by topic by adding # hashtags these content categories hep users to search for a subject and this can occur anywhere in a Tweet at the beginning, midde,or at the end. When hashtag words become popuar they are then caed Trending topics. In order to send your message to a specific user, it is sufficient to mention his/her user name in that post and they wi then see the Tweet in their Mentions tab. [account] In this way, the originator of this tweet can add other users to the post. When the user opens his/her own permanent page, he can see a the posts he/she is mentioned in. We ca this post action as it concerns direct post. Another use worth mentioning invoves rebroadcasting of other persons posts or ( retweeting). Users can use the retweet button option avaiabe under the post or they can mention the at the beginning of post. Retweets are usefu because they aow one to track the fow of information on twitter. [additiona text] : [origina tweet] Every ink between a tweet and retweet can be imagined as a directed edge in a graph, if one connects these retweets together one obtains the retweet network. Twitter Message Structure At first we are going to describing variabes inside tweets. Each tweet has one main body containing singe fied attributes ike,(id, text, source, in repy to status id) and compex attributes ike (User, Entity, Geo, Pace) which contain more attributes inside them. Here we have shown the three most important distinct parts of each Tweet, Tweet, User, Entity in JSON ist 1 Tweet Body Fieds Each tweet contains severa fieds which we show in isting 1 where we are going to expain them in detai. However it is important to mention that the

50 36 Experimenta Resuts twitter JSON stream is not reiabe and as we mentioned earier the twitter API can be changed during the time. But we can sti count retweet count and extract some usefu information from the JSON stream. 1 "created_at":"fri Ju 04 12:37: ", 2 "tweetid": , 3 Iran s Message: We Can Make History", 4 "truncated":fase, 5 "in_repy_to_status_id":nu, 6 "in_repy_to_status_id_str":nu, 7 "in_repy_to_user_id":nu, 8 "in_repy_to_user_id_str":nu, 9 "in_repy_to_screen_name":nu, 10 "pace":nu, 11 "contributors":nu, 12 "retweet_count":274, 13 "favorite_count":0, 14 "ang":"en" 15 "user":{}, 16 "entities":{}, 17 "retweeted_status":{}, 18 "geo":nu, Listing 1: Tweet JSON Stream TweetId : Tweets are identified by ong unique integers which increase per tweet throughout the whoe twitter domain. retweet Status: Contains origina tweet information. It wi appear in the retweet body. create at: The date when the user became a members of Twitter. tweet created at: Time when a tweet/retweet/repy post occurred. parent Tweet Id : TweetID of post generator shown in retweet attributes. parent User Id : UserID of post generator shown in retweet attributes of the originator During the data processing we create our own attribute to faciitate the future work.

51 4-2 Dataset 37 retweet time difference: Each tweet has a time stamp so by reducing the time stamp between the origina tweet and the retweet we can capture the retweet time difference. User Profie Fieds Each tweet JSON stream contains a user fied which contains a user profie. In isting 2 we have shown these fieds. 1 "user": 2 { 3 "userid": , 4 "name": "Jason H. Moore, Ph.D", 5 "screen_name": "moorejh", 6 "ocation": "Hanover, NH, USA", 7 "description": "Third Century Professor, 8 Bioinformatics,Compexity, BigData", 9 "ur": 10 "entities": 11 { 12 "ur":{}, 13 "description":{} 14 }, 15 "foowers_count": 6440, 16 "friends_count": 1980, 17 "isted_count": 534, 18 "created_at": "Sat Ju 19 23:10: ", 19 "favourites_count": 177, 20 "utc_offset": , 21 "time_zone": "Eastern Time (US & Canada)", 22 "statuses_count": 21275, 23 "ang": "en", 24 } Listing 2: User fieds in Tweet JSON stream userid : Tweeter users have a unique id.

52 38 Experimenta Resuts friends count : Indicates number of users the user foows (known as "foowings") foowers count : Indicates the number of users that foow the user status count : posted so far. Indicates the number of messages that the user has ang : Indicates the anguage of the posts he has chosen for the messages. Tweet entity Fieds The tweet entity gives extra information about the tweets themseves. We have shown this in isting 3. 1 "entities": 2 { 3 "hashtags":[], 4 "symbos":[], 5 "urs":[], 6 "user_mentions":[] 7 } Listing 3: Entities fieds in Tweet JSON stream hashtag : List of hashtags which are used in the tweet text. symbos : List of any extra symbos which are used in the tweet text. Urs : List of Urs which are mentioned in the tweet text. User mentions : List of Users which are mentioned in the tweet text In section we wi describe in detais how these four dataset are coected.

53 4-2 Dataset Dataset Coection In this section we describe in more detai how we coected the four dataset with the twitter APIs. There are three different ways to access twitter data: (I) Twitter Search API (REST API),(II) Twitter Streaming API, (III) Twitter Firehouse. Streaming API gives you the opportunity to access tweets happening in near rea-time. With Twitter s Streaming API, users register a set of criteria (keywords, usernames, ocations, named paces, etc.) and as tweets match the criteria, they are pushed directy to the user. The major drawback of the Streaming API is that Twitter s Steaming API provides ony a sampe of tweets that are occurring. The actua percentage of tota tweets users receive with Twitter s Streaming API varies greaty, depending on the criteria users request and the current traffic. Studies have estimated that by using Twitter s Streaming API users can expect to receive anywhere from 1% of the tweets to over 40% of the tweets in near rea-time. However this shortage can be overcame by using Twitter firehouse API which is not a free service guarantees a deivery of 100% of the tweets that match your criteria. Athough since we ony access to free APIs we used streaming APIs. The Twitter search API was founded on REST architecture. REST architecture refers to a coection of network design principes that define resources and ways to address and access data. By aowing third-party deveopers partia access to its API, Twitter aows them to create programs that incorporate Twitter s services. The Search API passes on the reevant resuts to ad-hoc user queries from a imited corpus of recent tweets 3. The REST API aows access to the nouns and verbs of Twitter such as User Profie, Time-ines, Tweets, Tweet-ocations, Lists, Friends and Foowers. A these different access methods have some input and output. The input is a specific criteria such as keywords, time, hashtags in the case of streaming API and user id, tweet id, ocation and etc. in case of search API. The output format of a these data access methods wi be JSON or XML. JSON is a simpe text format that faciitates reading and writing, it is a widey used data-interchange anguage because its parsing and its generation is easy for machines. In order to be abe to extract data from JSON we need to access a the data in the stream fie, we deveop a Java program to read and parse the JSON input stream and insert data in the reationa database. We used Java anguage to read JSON streams and import them into reationa databases. 3

54 40 Experimenta Resuts Reationa Database Creation Now we have streams of tweets and retweets. We have generated informative fieds out of tweet streams and generate a reationa database. We define six database tabes Tweet, Retweet, User, User Foower, User Foower Network, Entity. We have shown the database schema in figure 4-1. First tabe is Tweet, we extract origina tweet from stream of tweets. Origina tweet means those tweets which have been written for the first time by a user. These tweets do not have Retweet Status section in the tweet stream. We define TweetID as a primary key for this tabe. In retweet tabe which ooks simiar to previous one we store the retweets fieds. Each retweet connected to its parent tweet by ParentTweetId. In this tabe TweetId is aso a primary key. Next we extract the user profie into User tabe, each user can have severa tweets which connected to tweet or retweet tabe by UserID. In User Foower Network tabe, foowers of each user are isted, we are going one step further and wi reach the information of user-foowers and make another tabe caed User Foower which is aggregation of a users foowers information. In the ast tabe we have gathered information of tweet-entities which contains Hashtags, Urs, User-mentions and its connected to tweet or retweet tabe by TweetID Specification of Datasets In this section we wi expain in more detai the specification of the four datasets and describe the basic statistics about them. Datasets # Tweets # User # Retweet Duration Steve Jobs Death Day US Eection Day Foxnews Obama Assassination Day :) Day Tabe 4-2: Dataset overa statistica information More detaied statistics about the four dataset are isted in tabes4-2 and 4-3. The foowing tabes give us an insight on how the data are distributed and what are the differences of the datasets in terms of detaied statistics.

55 4-2 Dataset 41 One to Many Retweet Tweet Entity PK TweetID PK TweetID One to Many FK TweetID FK Parent_TweetID One to Many FK UserID Hashtag FK UserID Created_at Ur Created_at Text User_Mentions Text Retweet_count source pace One to Many Language User_Foower_Network User PK UserID PK UserID FoowerID Foower_count User_Foower Friend_count PK UserID Created_at Status_count avg_foower_count time_zone std_foower_count ang avg_friends_count screen_name avg_tweet_perday isted_count std_tweet_perday favourites_count parent_foower_count ocation parent_friends_count Figure 4-1: Reationa database schematic mode

56 42 Experimenta Resuts Dataset Fox news :) Steve jobs America Eection Features Skewness kurtosis Skewness kurtosis Skewness kurtosis Skewness kurtosis #f o #f i #tweet/day f of f of rof tof σ rof #Retweet t_time σ t_time t_h Figure 4-2: Skewness and kurtosis of different features The tabe 4-2 and 4-3 ist skewness and kurtosis of features in four datasets before and after og transformation. Skewness quantifies how symmetrica the distribution is, a symmetrica distribution has a skewness of zero. An asymmetrica distribution with a ong tai to the right (ager vaue) has positive vaue and data with ong tai to the eft has negative vaue. There is a rue of thumb to indicate skew distribution, if the skewness is greater than 1 (or ess than -1) the skewness is substantia and the distribution is far from symmetrica. 4 Kurtosis quantifies whether the shape of the data distribution matches the Gaussian distribution. A Gaussian distribution has a kurtosis of zero and a fatter distribution has a negative kurtosis and a distribution with sharper peak has positive kurtosis. 5 We do og transformation to make sure the data is ess skew and sharp. As you can see in tabe 4-3 after og transformation the skewness and kurtosis of data decreased Spitting Methods of Datasets In order to test the performance of our cassifier, we need to exacty define how we spit the dataset for training and testing and motivate our choices

57 4-2 Dataset 43 Dataset Fox news :) Steve jobs America Eection Features Skewness kurtosis Skewness kurtosis Skewness kurtosis Skewness kurtosis #f o #f i #tweet/day f of σ fof rof tof σ rof #Retweet t_time σ t_time t_h Figure 4-3: Skewness and kurtosis of Log-Transformation of different features to do so. Data spitting strategies usuay are not we defined in previous studies. We define two different strategies, in order to spit data to be abe to see whether chronoogica spitting has any differences (in performance) over other spitting methods on predicting the popuarity of tweets. The two spitting methods are as foowings: Chronoogica spitting: The idea of chronoogica spitting is to divide the train and test set based on the time of tweets. A the tweets and retweets up to a certain point of time are considered training set and the tweets and retweets in ater times are considered as a test set. The motivation of spitting dataset chronoogicay is based on the fact that in a rea popuarity prediction scenario we don t know about future tweets but ony about tweets that are pubished unti the point of prediction. We created different spits on our dataset based on the number of days which are considered as train or test set. Figure 4-4 iustrates our chronoogica data spitting method. ater in this chapter we wi show the performance of the cassifier on different spits we defined. Random spitting: Athough the idea of chronoogica spitting seems to be ogica, we aso spit our datasets randomy to see whether or not time-aware spitting

44 Experimenta Resuts Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Training Testing Training Training Testing Testing Testing Training Training Testing Figure 4-4: Chronoogica data set spitting has any

58 44 Experimenta Resuts Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Training Testing Training Training Testing Testing Testing Training Training Testing Figure 4-4: Chronoogica data set spitting has any infuence in the performance of cassifying. Further more, random spitting aows us to perform cross-vaidation on the dataset to make sure the test resuts are stabe among different spits. In random spitting, depending on weather we want to perform crossvaidation or not, we spit a tweets into different sets. For each tweet, the number of its tweet woud be considered as cass abe. Figure 4-5 iustrate the random spitting method for train/test set and crossvaidation scenarios. Figure 4-5: Random data set spitting

Landscape Ruggedness in Evolutionary Algorithms

Landscape Ruggedness in Evolutionary Algorithms Persona use of this materia is permitted. However, permission to reprint/repubish this materia for advertising or promotiona purposes or for creating new coective works for resae or redistribution to servers