Stream-based active learning for sentiment analysis in the financial domain

Size: px

Start display at page:

Download "Stream-based active learning for sentiment analysis in the financial domain"

Augusta Chandler
6 years ago
Views:

1 Stream-based active learning for sentiment analysis in the financial domain Jasmina Smailović 1,2, Miha Grčar 1,2, Nada Lavrač 1,2,3, Martin Žnidaršič 1,2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia 3 University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia Abstract Studying the relationship between public sentiment and stock prices has been the focus of several studies. This paper analyzes whether the sentiment expressed in Twitter feeds, which discuss selected companies and their products, can indicate their stock price changes. To address this problem, an active learning approach was developed and applied to sentiment analysis of tweet streams in the stock market domain. The paper first presents a static Twitter data analysis problem, explored in order to determine the best Twitter-specific text preprocessing setting for training the Support Vector Machine (SVM) sentiment classifier. In the static setting, the Granger causality test shows that sentiments in stock-related tweets can be used as indicators of stock price movements a few days in advance, where improved results were achieved by adapting the SVM classifier to categorize Twitter posts into three sentiment categories of positive, negative and neutral (instead of positive and negative only). These findings were adopted in the development of a new stream-based active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet streams. To this end, a series of experiments was conducted to determine the best querying strategy for active learning of the SVM classifier adapted to sentiment analysis of financial tweet streams. The experiments in analyzing stock market sentiments of a particular company show that changes in positive sentiment probability can be used as indicators of the changes in stock closing prices. Keywords: predictive sentiment analysis, stream-based active learning, stock market, Twitter, positive sentiment probability, Granger causality 1. Introduction Predicting the value of stock market assets is a challenge investigated by numerous researchers. One of the reasons for addressing this challenge is the controversy of the efficient market hypothesis [18], which claims that stocks are always traded at their fair value. Based on this market theory, claiming that it is not possible for investors to buy undervalued stocks or sell stocks for overestimated prices, it is impossible for traders to consistently outperform the average market returns. This hypothesis is based on the assumption that financial markets are informationally efficient (i.e., that stock prices always reflect all the relevant information at investment time). The unpredictable nature of stock market prices was first investigated by Regnault [53] and later by Bachelier [4]. Fama [18], who proposed the efficient market hypothesis, also claimed that stock price movement is unpredictable and that past price movements cannot be used to forecast future stock prices. However, as the efficient market hypothesis is controversial, researchers from various disciplines (including economists, statisticians, finance experts, and data miners) have been investigating the means to predict future stock market prices. The findings vary: from those claiming that stock market prices are not predictable to those presenting opposite conclusions [10, 35]. This paper addresses the described challenge in the context of the explosive growth of social media and usergenerated content on the Internet. Through blogs, forums, and social networking media, more and more people share their opinions about individuals, companies, movements, or important events. Such opinions both express and evoke Corresponding author. address: jasmina.smailovic@ijs.si Preprint submitted to Elsevier May 5, 2014

2 sentiments [51]. Recent research indicates that analysis of these online texts can be useful for trend prediction. For example, it was shown that the frequency of blog posts can be used to forecast spikes in online consumer purchasing [24]. Moreover, it was shown by Tong [74] that references to movies in newsgroups are correlated with their sales. Sentiment analysis of weblog data was successfully used to predict the financial success of movies [41]. Twitter 1 posts were also shown to be useful for predicting box-office revenues of movies before their release [3]. Twitter is currently the most popular microblogging platform [47] allowing its users to send and read short messages of up to 140 characters in length, known as tweets, via SMS, the Twitter website, or a range of applications for mobile devices. Twitter gained global popularity very quickly with over 500 million active users in 2012, writing over 340 million tweets daily [17, 42]. Twitter data (and data from other social network websites) are very interesting because of their large volume, popularity, and capability of near-real-time publishing of individuals opinions and emotions about any subject. Given that this massive amount of user-generated content became abundant and easily accessible, many researchers became interested in the predictive power of microblogging messages, especially in the domain of stock market prediction, prediction of election results, or prediction of the financial success of movies or books. Many of these studies use sentiment analysis [38, 77] as a basis for prediction. The term sentiment, used in the context of automatic analysis of text and detection of predictive judgments from positively and negatively opinionated texts, first appeared in the papers by Das and Chen [15] and Tong [74], where the authors were interested in analyzing stock market sentiment. Even though there are many studies on predicting the phenomenon of interest using sentiment analysis of online texts, there is still an urge to develop methods and tools for adaptive dynamic sentiment analysis of microblogging posts, which would enable handling changes in such data streams. This field of research is still insufficiently explored and represents a challenge, which is addressed in this work through active learning [63]. This work contributes to sentiment analysis and to active learning research, and partly towards better understanding of phenomena in financial stock markets. While sentiment analysis is generally aimed at detecting the author s attitude, emotions or opinions expressed in the text, our study is concerned with the development of an approach to predictive sentiment analysis. With this term, we denote an approach in which sentiment analysis is used to predict a specific phenomenon or its changes, postulating that the proposed methodology for predictive sentiment analysis of streams of microblogging messages should be capable of predicting the financial phenomenon of interest. The indication that there may be a relationship between emotions and stock market prices relies on findings in psychological research which indicate that emotions are crucial to rational thinking and social behavior [14], and can influence the choice of actions. Given that the general mood of a society is propagated through social interactions, the collective social mood can be transferred through the investors to the stock market and consequently, the sentiment can be reflected in stock price movements. As a result, the stock market itself can be considered as a measure of social mood [45]. It is, thus, reasonable to expect that the analysis of the public mood can be used to predict price movements in the stock market. We hypothesize that this assumption may hold in situations when people actually express positive or negative opinions about some topic concerning the stock market, whereas in situations when people do not express opinions, but mostly neutral facts, we anticipate finding no correlations. In accordance with this hypothesis, we propose a mechanism for distinguishing opinionated (positive and negative) from non-opinionated (neutral) tweets in Twitter data streams. In an effort to build an active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet data streams, we first addressed a static Twitter data analysis problem, which was explored in order to determine the best Twitter-specific text preprocessing setting for training the Support Vector Machine (SVM) sentiment classifier. In the static setting, the Granger causality test showed that sentiment in stockrelated tweets can be used as an indicator of stock price movements a few days in advance, where improved results were achieved by adapting the SVM classifier to categorize Twitter posts into three sentiment categories of positive, negative and neutral (instead of positive and negative only). These findings were successfully used in the development of a new stream-based active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet data streams. Using stream data for sentiment analysis makes sense when the information about the changes in the sentiment is time-critical and a proper data flow is available, for example, in the analysis of streams of financial tweets in which people express their opinions about stocks in real time. The main idea of active learning [58, 63, 67], adapted in this study for continuously updating the sentiment classifier from a tweet stream, is that the algorithm is allowed to 1 2

3 select new examples to be labeled by the oracle (e.g., a human annotator) and added to the training set. It aims at maximizing the performance of the algorithm with as little human labeling effort as possible. The main challenge of active learning is the selection of the most suitable examples for labeling in order to achieve the highest prediction accuracy, while knowing that one cannot afford to label all the examples [88]. For example, query algorithms based on uncertainty sampling select for labeling the examples for which the current learner has the highest uncertainty [37, 64, 75]. Similarly, algorithms based on query-by-committee use disagreement among an ensemble of learners to select new examples for labeling [20, 50, 68]. The active learning approach proposed in this paper combines uncertainty and random sampling and was developed by adapting the initial static sentiment analysis approach to deal with changes over time in a tweet stream. On the one hand, the use of active learning is a consequence of the scarcity of labeled tweets available for sentiment analysis, which prevents the use of conventional machine learning methods. It is namely very difficult and costly to obtain large hand-labeled datasets of tweets, especially if they are domain dependent. On the other hand, these datasets and the resulting models change with time and, consequently, soon become outdated. Thus, continuous learning that allows for adaptations to change in the modeled environment is inevitable to keep the models current. In summary, the main contribution of this paper is a new methodology for stream-based active learning for tweet sentiment analysis in finance, which can be used on continuously changing tweet streams. A series of experiments was conducted to determine the best querying strategy for active learning of the SVM classifier, which was adapted to sentiment analysis of streams of financial tweets and applied to predictive stream mining in a financial stock market application. As a side effect, since there is no large labeled dataset of financial tweets publicly available, we have labeled and made publicly available a collection of financial tweets, making it the first large (in the sense of labeling effort) publicly available dataset of its kind. We used the dataset in the simulated active learning setting and in the evaluation of the results of tweet stream analysis. The paper is structured as follows. Section 2 presents a brief overview of related work. Section 3 discusses Twitter-specific text preprocessing options, and presents the developed SVM tweet sentiment classifier, learned from adequately preprocessed Twitter data. Section 4 presents the dataset of financial tweets, which were collected for the purpose of the study, as well as the method and technology developed for enabling financial market predictions from Twitter data. The approach uses positive sentiment probability as a new indicator for predictive sentiment analysis in finance, proposed in our previous work [71]. Furthermore, due to the fact that financial tweets do not necessarily express the sentiment, this section applies sentiment classification using the neutral zone, which allows classification of a tweet into the neutral category, thus improving the predictive power of the sentiment classifier compared to the SVM classifier categorizing Twitter posts into positive and negative sentiment categories only. Section 5 introduces incremental learning of the classifier on a stream of financial tweets. The general purpose classifier was incrementally updated in order to adapt to the changes in the data stream by using the active learning approach. The paper concludes with a summary of results and plans for further work in Section Related work In this section, we give an overview of related studies, which are focused on: (i) analyzing sentiment in Twitter data, (ii) sentiment analysis of social media as a predictor of the future stock market indicators, and (iii) active learning on data streams. Although these tasks have been well-studied separately, there is a lack of work which would combine them and propose a dynamic adaptive sentiment analysis methodology for microblogging stream posts, which would be able to handle changes in data streams our work addresses this issue Sentiment analysis and microblogging channels In recent years, several studies have analyzed sentiments expressed in Twitter data in order to describe its content and study its relation to trends. O Connor et al. [46] analyzed several surveys on consumer confidence and political opinion, and found a correlation with sentiments in Twitter messages. Furthermore, Thelwall et al. [73] analyzed 30 top events in Twitter over a one-month period and showed that popular events are associated with an increase in average negative sentiment strength. In [30], the authors addressed target-dependent sentiment classification and applied it to English tweets on popular topics. They incorporated target-dependent features and also took related tweets into consideration. Asur et al. [3] constructed a model based on tweet-rate about particular topics for predicting boxoffice revenues of movies before their release. They further showed how sentiment extracted from Twitter posts can 3

4 improve their forecasting power. In the context of the 2009 German federal elections, Tumasjan et al. [76] showed that sentiment expressed in Twitter messages closely corresponds to the offline political landscape. There has also been research exploring whether sentiment analysis of social media can be used to predict future stock market indicators. In [61], the authors analyzed sentiment in messages from the Yahoo! Finance website 2 and demonstrated that sentiment and stock values are closely correlated. They also showed that one can use sentiment analysis to make predictions about stock behavior over a short-term period. In [47], the authors analyzed sentiments in postings from stock microblogging channel, Stocktwits.com 3, over a period of three months and found that stock microblog sentiments may predict future stock price movements. Additionally, they found that pessimistic information has higher predictive value as compared to optimistic information. Zhang et al. [87] measured positive and negative emotions in tweets and analyzed the correlation between these measures and stock market indices such as Dow Jones, S&P 500, NASDAQ, and VIX. The authors indicated that by inspecting Twitter for any kind of emotional outburst gives a predictor of how the stock market will perform the following day. Bollen et al. [8] measured mood in tweets in terms of six dimensions (calm, alert, sure, vital, kind, and happy) and showed that changes in calmness can predict daily up and down changes in the closing values of the Dow Jones Industrial Average Index (DJIA). Furthermore, Chen and Lazer [12] confirmed the results of Bollen et al. [8] and showed that even with much simpler sentiment analysis methods, a correlation between Twitter sentiment data and stock market movements can be observed. Mittal and Goel [40] based their work for finding a correlation between public sentiment and the stock market on the approach of Bollen et al. [8]. Their results [40] are in some agreement with the results of Bollen et al. [8], but they indicate that not only the calm, but also the happy mood dimension has a good correlation with the DJIA values. The authors in [43] calculated daily sentiment of aggregated data from multiple sources (Twitter, 11 online message boards, and Yahoo! Finance news stream), where the data was concerned with stocks of the S&P 500 index during a six-month period. In their experiments, they showed that publicly available data in microblogs, forums, and news have predictive power for stock price changes on the following day. Sprenger et al. [72] analyzed about 250,000 stock-related tweets and found that the sentiments in tweets is associated with exceptional stock returns and that message volume predicts next-day trading volume. In addition, the authors showed that users that give above-average investment advice are retweeted more often and have more followers, which shows their influence in microblogging forums. Finally, Yu et al. [85] studied the effect of social and conventional media on firm stock market performance and found that social media has a stronger impact. Nevertheless, the authors found that social and conventional media together do have an effect on the stock market. They also found that the effect of social media varies depending on its type. The above literature overview confirms that sentiment analysis of social media contains predictive information about future stock market indicators, which is also the topic of this paper. Close to our research is the work of Sprenger et al. [72], which aims at finding associations among various values describing tweets and stocks. Also, a similar idea exists in [43], but the authors were interested in aggregating data from multiple sources, whereas we are specifically interested in adjusting our approach to microblogging data. In our previous studies, we used the volume and sentiments in stock-related tweets to identify important events, as a step towards the prediction of future movements of stock prices [70, 71]. This paper substantially extends our previous work Stream-based active learning Active learning has been studied in three different scenarios: (i) membership query synthesis, (ii) pool-based sampling, and (iii) stream-based selective sampling [65]. In the membership query synthesis scenario, the learner may select new examples for labeling from the input space or it can generate new examples itself. In the pool-based scenario, the learner may request labels for any example from a large pool of historical data. Finally, in the stream-based active learning scenario, examples are made available constantly from a data stream and the learner has to decide in real time whether to request a label for a new example or not. Active learning on data streams has been a subject in many studies. One of the simplest ways to select the examples to be labeled is based on maximizing the expected informativeness of labeled examples. For example, the learner may find the examples with the highest uncertainty to be the most informative and request them to be labeled. Zhu et al. [88] used uncertainty sampling to label instances within a batch of data from the data stream. Žliobaite et al. [90]

5 proposed strategies that extend the fixed uncertainty strategy with dynamic allocation of labeling efforts over time and randomization of the search space. The latter approach was used also in some of our active learning strategies described in Section 5. These newly proposed active learning strategies explicitly handle concept drift and adapt the classifier to data distribution changes in data streams over time. In contrast to our approach, Žliobaite et al. [90] do not consider batches, but perform labeling decisions on every encountered data instance. Also, their labeling budget management is different, as they have a fixed overall budget and dynamically adapt the active learning rate according to the amount of budget left. This can be beneficial for flexible adaptation, but can disperse the labeling effort very unevenly. We opted for a fixed budget per batch, which enables the labeling effort to remain the same in each time period. This was perceived as a favorable approach from the user s point of view, as in our case the labeling cost is measured in human time, which is difficult to provide in unevenly dispersed bursts. Deciding which instances are the most suitable for labeling can be made by a single evolving classifier [90] or by a classifier ensemble [81, 88, 89]. In classifier-ensemble-based active learning frameworks, a number of classifiers are trained from small portions of stream data. These classifiers construct an ensemble classifier for predictions [86], while our work is concerned with the development of a single evolving sentiment classifier for Twitter posts. Active learning on stream data for sentiment analysis of tweets in financial domains is still insufficiently explored and represents a significant challenge. Our preliminary work on this topic was presented in [56]. Bifet and Frank [7] discuss the challenges posed by Twitter data streams, focusing on classification problems, and consider these streams for sentiment analysis, but they do not use the active learning approach. On the other hand, Settles [66] has developed an active learning annotation tool, DUALIST; while he showed its potential by applying it to sentiment analysis of general tweets, his tool is not specifically adjusted to tweet analysis. 3. Defining the best parameter setting for tweet preprocessing Preprocessing is a necessary data preparation step to supervised machine learning when training a sentiment classifier. We describe here the algorithm used in the development of the initial general tweet sentiment classifier, the dataset, different data preprocessing settings, and the experiments that led to the choice of the best tweet preprocessing setting. In this work, classification refers to the process of categorizing a new tweet into one of the two categories or classes: the positive or the negative sentiment of a tweet. The classifier is trained to classify new instances based on a set of class-labeled training instances (tweets), each described by a vector of features (terms, formed from one or several consecutive words) which have been pre-categorized manually or in some other presumably reliable way. Features are all the terms detected in the training dataset. The length of the feature vectors corresponds to the number of features. The approach to tweet preprocessing and classifier training was implemented using the LATINO 4 software library of text processing and data mining algorithms The algorithm used for sentiment classification There are three common approaches to sentiment classification [49, 73]: (i) machine learning, (ii) lexicon-based methods, and (iii) linguistic analysis. Linguistic analysis tends to be computationally demanding for use in a streaming near-real-time setting. Lexicon-based methods are faster, but are unable to adapt to changes in the modeled environment. In the analysis of dynamic concepts, such as public sentiment, this is a serious drawback. Namely, certain terms, such as company names, countries or phrases, can shift with time from one sentiment class to the other. Therefore, we have decided to use a machine learning approach to learn a sentiment classifier from a set of class-labeled examples. An algorithm, standardly used in document classification, is the linear Support Vector Machine (SVM) [79, 80, 13]. The SVM algorithm has several advantages, which are important for learning a sentiment classifier from a large Twitter dataset. For example, it is fairly robust to overfitting and it can handle large feature spaces [11, 31, 60]. Based on a set of training examples, labeled as belonging to one of the two classes, an SVM algorithm represents the examples as points in the space and separates them by a hyperplane. The aim of the SVM is to place this hyperplane in such a way that examples of the two classes are divided by a gap that is as wide as possible. New examples are then mapped into that same space and classified based on the side of the hyperplane in which they reside. For training the tweet sentiment classifier, we used the SVM perf [32, 33, 34] implementation of the SVM algorithm. 4 LATINO (Link Analysis and Text Mining Toolbox) is open-source (mostly under the LGPL license) and is available at 5

6 3.2. The data used for initial classifier training Since there is no publicly available large hand-labeled data set for sentiment analysis of Twitter data, we have trained the general purpose tweet sentiment classifier on an available large collection of 1,600,000 (800,000 positive and 800,000 negative) tweets collected and prepared by Stanford University [22], where the tweets were labeled based on a presence of positive and negative emoticons. Therefore, the emoticons approximate the actual positive and negative sentiment labels. This approach was proposed by Read [52]. For example, if a tweet contains the :) emoticon, it is labeled as positive, and if it contains the :( emoticon, it is labeled as negative. Tweets containing both positive and negative emoticons were not taken into account. The full list of emoticons used for labeling can be found in Table 1. Inevitably, this simple approach causes partially correct or noisy labeling. However, in Appendix A, we illustrate that smiley-labeled tweets are still a reasonable approximation for manually-annotated positive/negative sentiments of tweets. In the dataset, the emoticons were already removed from the tweets in order for the classifier to learn from the other features that characterize them. Note that the tweets from this set do not focus on any particular domain Data preprocessing The data preprocessing step is important in sentiment analysis and with appropriate selection of preprocessing techniques, the classification accuracy can be improved [26]. We apply both Twitter-specific and standard preprocessing on the data set. The Twitter-specific preprocessing is necessary, since the Twitter community has created its own specific language to post messages. Therefore, we first explored the unique properties of this language and experimented with the following options [2, 22] for Twitter-specific preprocessing to better define the feature space: Usernames: mentioning other users in a tweet in the was replaced by a single token named USERNAME. Usage of web links: Web links pointing to different web pages were replaced by a single token named URL. Letter repetition: repetitive letters with more than two occurrences in a word were replaced by a word with one occurrence of this letter (e.g., word loooooooove was replaced by love). Negations: we replaced negation words (not, isn t, aren t, wasn t, weren t, hasn t, haven t, hadn t, doesn t, don t, didn t) with a unique token NEGATION. Using this approach, we do not lose information about a negation, but treat all negation expressions in the same way. Exclamation and question marks: exclamation marks were replaced by a single token EXCLAMATION and question marks by a single token QUESTION. In addition to Twitter-specific text preprocessing, other standard preprocessing steps were performed [19] to define the feature space for tweet feature vector construction. These include text tokenization (text splitting into individual words/terms), removal of stopwords (words carrying no relevant information, e.g., and, or, a, an, the, etc.), stemming (converting words into their root form), and N-gram construction (concatenating 1 to N stemmed words appearing consecutively in the text, where N=2) for feature space reduction. We also added the condition that a given term has to appear at least twice in the entire corpus, either twice in a given tweet or in two different tweets. The resulting terms were used as features in the construction of feature vectors representing the documents (tweets). In our experiments, Table 1: List of emoticons used for labeling the training set. Positive emoticons Negative emoticons :) :( :-) :-( : ) : ( :D =) 6

7 we did not use a part of speech (POS) tagger, since it was indicated by Go et al. [22] and Pang et al. [48] that POS tags are not useful when using SVMs for sentiment analysis. Moreover, Kouloumpis et al. [36] showed that POS features may not be useful for sentiment analysis in the microblogging domain. The standard approach to feature vector construction is TF-IDF-based, where TF-IDF stands for the term frequencyinverse document frequency feature weighting scheme [31, 84]. TF is the term frequency feature weighting scheme, where a weight reflects how often a word is found in a document, while TF-IDF is the term frequency-inverse document frequency feature weighting scheme, where a weight reflects how important a word is to a document in a document collection (TF-IDF increases proportionally to the number of times a word appears in the document, but decreases with respect to the number of documents in which the word occurs). We experimented with both schemes, TF-IDF- and TF-based, where for every document (tweet) TF weights were normalized to a range of [0,1]. As shown in Table 2, the TF-based approach proved to outperform the TF-IDF-based approach to tweet preprocessing, which is expected in a classification setting [39]. The significance of the finding is confirmed using the Wilcoxon s significance test [16, 83], which concluded that using TF is statistically significantly better than TF-IDF (with p < ) Selecting the best preprocessing setting for classifier training The experiments with different Twitter-specific preprocessing settings were performed to determine the best preprocessing options which were used in addition to the standard text preprocessing steps. The best preprocessing setting for a classifier 5 was chosen according to the F-measure (also known as F-score or F1 score) [78], determined using the ten-fold cross-validation method 6. The F-measure was used for comparison of different preprocessing settings since later, in the active learning experiments, to compare different querying strategies, we calculate the F-measure of positive tweets as there is high three-class imbalance in batches from the data stream. In order to be consistent and allow the reader to compare results in our paper, we used the F-measure in all our experiments. The experiments show that the best preprocessing setting is Setting 1 shown in the first row of Table 2. It is TF-based, uses maximum N-grams of size 2, words which appear at least two times in the corpus, it replaces links with the URL token, and replaces negations with the NEGATION token. This tweet preprocessing setting resulted in the construction of 1,288,681 features used for classifier training. Using the unpaired one-tailed homoscedastic t-test [54], we investigated whether the best preprocessing setting (Setting 1) is statistically significantly better than the other preprocessing settings. The results show that the best preprocessing setting is not significantly better than Settings 2 12,14, and 16, but it is significantly better than the remaining preprocessing settings (Setting 13, Setting 15, Settings 17 32) with a p-value lower than Since in these experiments the original dataset was pre-filtered and did not contain tweets with both positive and negative emoticons, the reported results may be somewhat overoptimistic (i.e., if the data were not pre-filtered and contained also tweets with mixed emoticons, the results in terms of the F-measure would probably be somewhat lower). Nevertheless, even if the reported results are overoptimistic, this property of the dataset does not affect the general conclusions concerning the choice of preprocessing settings, given that in all the settings the dataset was preprocessed in the same way. From Table 2, it follows that replacing negation words is particularly beneficial since almost all settings which perform this replacement are placed in the upper part of the table. Replacing exclamation and question marks with a token does not seem to be helpful, since the five top settings do not employ this replacement. Regarding replacing usernames and URLs with a token, and removing letter repetition, one cannot draw general conclusion, since these preprocessing options are dispersed across the table. Nevertheless, the first setting employs replacing URLs with a token and we used it in the rest of our experiments. Interestingly, the setting which does not apply any of the preprocessing adjustments achieved the lowest performance, leading to the conclusion that, in general, it is beneficial to preprocess Twitter data. In addition to the SVM algorithm, we also tested the k-nearest neighbor (KNN) and Naive Bayes classifiers on the same dataset. In this setting, the standard KNN algorithm proved to be too slow (in ten-fold cross-validation experiments for K=5 and K=10, the one-fold experiment took more than 24 hours on a standard desktop computer), 5 Based on our previous experience in [71], the parameters for the SVM perf learner were set to -c e The F-measure is a harmonic mean of precision and recall, and it reaches its best value at 1 and worst at 0. It is calculated as: F 1 = 2 precision recall/(precision + recall). Precision is the fraction of all examples classified as positive which are correctly classified as positive, while recall is the fraction of all the positive examples that are correctly classified as positive. 7

8 Table 2: Classifier performance evaluation for various preprocessing settings. ID Usernames to a token URLs to a token Remove letter repetition Negations to a token Exclamation and question marks to tokens Avg. F-measure ten-fold cross-val. ± std. dev. (TF) Avg. F-measure ten-fold cross-val. ± std. dev. (TF-IDF) 1 X X ± ± X X ± ± X X X ± ± X ± ± X X X ± ± X X X ± ± X X X ± ± X X X X ± ± X X X X X ± ± X X X ± ± X X X ± ± X X X X ± ± X X ± ± X X ± ± X X ± ± X X X X ± ± X X ± ± X X X ± ± X X X X ± ± X X ± ± X X ± ± X X X X ± ± X X X ± ± X X ± ± X ± ± X X ± ± X X X ± ± X ± ± X ± ± X ± ± X X X ± ± ± ± and Naive Bayes had lower performance compared to the SVM (the ten-fold cross-validation achieved an F-measure of 0.73). We, thus, used the SVM classifier with preprocessing Setting 1 from Table 2 for the rest of the study and analyses. 4. Stock market analysis in a static predictive tweet analysis setting Motivated by the earlier research and observation that the stock market itself can be considered as a measure of social mood [45], this section investigates whether sentiment analysis on Twitter posts can provide predictive information about the value of stock closing prices. We use a supervised machine learning approach to train a sentiment classifier, using a SVM algorithm. By applying the best setting for tweet preprocessing, as explained in Section 3.4, two sets of experiments were performed. In the first set of experiments, tweets were classified into two categories, positive or negative. In the second set of experiments, the SVM classification approach was advanced by taking into account the neutral zone, enabling us to identify neutral tweets (not clearly expressing positive or negative sentiments) as those, positioned a small distance from the SVM model hyperplane. 8

9 4.1. The data used in the stock market application A tweet dataset and stock closing prices of several companies were collected for our experiments. On the one hand, we collected 152,570 tweets discussing relevant stock information concerning eight companies (Apple, Amazon, Baidu, Cisco, Google, Microsoft, Netflix, and RIM) 7 in the nine-month time period from March 11 to December 9, On the other hand, we collected stock closing prices of these companies for the same time period. The data source for collecting financial Twitter posts is the Twitter API 8 (i.e., the Twitter Search API), which returns tweets that match a specified query. By informal Twitter conventions, the dollar-sign notation is used for discussing stock symbols. For example, the $BIDU tag indicates that the user discusses Baidu stocks. This convention was used for the retrieval of financial tweets 9. The stock closing prices of the companies for each day were obtained from the Yahoo! Finance website. The time of tweets in our dataset is presented in UTC (Coordinated Universal Time) since the Twitter API stores and returns dates and times in UTC. On the other hand, Baidu is included in the NASDAQ-100 index, and this stock exchange works in the EST(Eastern Standard Time)/EDT(Eastern Daylight Time) timezone which is four to five hours behind UTC. Therefore, compared to EST/EDT, there is an additional shift of four to five hours; thus, there is more of a time lag between the tweets of a previous day and the stock market activity and closing prices of the current day. In the entire study, we focused on the analysis of financial tweets on the Chinese web search engine provider, Baidu 10, in order to investigate relationships between the observed sentiments in the stock-related tweets and the corresponding stock price movements. The collection of Baidu tweets was manually labeled by the domain expert. The data of this Chinese web search engine provider was chosen for hand-labeling since the set of tweets related to Baidu was of a manageable size given the resources available (we collected and labeled approximately 11,000 tweets, compared to, for example, approximately 40,000 tweets that we collected for the Apple company). Even this hand-labeling effort took us over three months to ensure good quality of the labeled data. In tweet labeling, we were faced with the problem of choosing a labeling strategy. Having discussed this issue at length with stock market financial experts, we opted for manual labeling of the tweets from the point of view of a particular company and not mainly on the sentiment-carrying words used. The reason for this decision is that our long-term intention is to construct classifiers that should distinguish between sentiments of tweets of different companies; hence, a company-focused view is a necessity. The labels were given to instances according to their financial sentiment; that is, their impact on the perception of the company, its products, or its stock. For example, a tweet: I just love shorting CompanyX. What a nice day of profits, first of many... would be labeled as negative, since shorting means betting that the value of the stock will drop. Despite many positive sentiment words, such a tweet would be providing a message of a negative financial prospect for CompanyX. Another issue was that in the dataset there are many tweets that do not discuss Baidu stocks, although they do contain the $BIDU tag. These tweets may actually express an opinion about another company, such as the tweet, Apple is great $BIDU, and reflect a positive tweet sentiment, but do not discuss the Baidu company at all. Again, these kinds of tweets were labeled from the point of view of the Baidu company, and not mainly on the sentiment-carrying words used. Therefore, the mentioned tweet would be labeled as neutral. Therefore, in Baidu sentiment labeling, the annotator was instructed to focus on the following question: What would someone who knows what Baidu is and shares in general, think of Baidu and its shares after he sees this tweet?, or in other words, Is this tweet positive, negative, or neutral concerning Baidu and/or the price of its shares? The resulting hand-labeled dataset consists of 11,389 Baidu financial tweets (4,861 positive, 1,856 negative, and 4,672 neutral tweets). 11 In this dataset, neutral tweets are those that contain no sentiment about Baidu, contain both positive and negative sentiments about Baidu, as well as those that do not discuss Baidu even if they are positively or negatively oriented (as discussed above). 7 Tweet IDs of our datasets are available on: To deal with spam (writing nearly identical messages from different accounts), we employed the algorithm based on the work of Broder et al. [9] to discard tweets that were detected as near duplicates The Baidu tweet IDs and manual labels are publicly available on: file BIDU.txt 9

10 4.2. Correlation between tweet sentiment and stock closing price Given the time series of tweet sentiments and the time series of stock closing prices, the question addressed is whether one time series is useful in forecasting another. We applied a statistical test to determine whether sentiments expressed in tweets contain predictive information about the future values of stock closing prices. To this end, we performed a Granger causality analysis test, which is a statistical hypothesis test for discovering whether one time series is effective for forecasting another time series [23]. Since we have the tweets time series on the one hand and the stock closing price time series on the other hand, this test suits our needs to check whether there is a predictive relationship between sentiments in tweets and stock closing prices. If time series X is said to Granger-cause Y, then the information in past values of X helps predict values of Y better than only the information in past values of Y. Therefore, the lagged values of X will have a statistically significant correlation with Y. Granger causality analysis is based on linear regression modeling of stochastic processes and it is usually done using a series of t-tests and F-tests on lagged values of X (combined also with lagged values of Y). The test expects that the time series data is covariance stationary and that it can be represented by a linear model. Complex implementations for nonlinear cases exist; nevertheless, they are often more challenging to apply in practice [62]. The output of the Granger causality test is the p-value, which takes values in the [0,1] interval. In statistical hypothesis testing, the p-value is a measure of how much evidence we have against the null hypothesis [57]. If the p-value is lower than the selected significance level, for example 5% (p < 0.05), the null hypothesis is rejected and the result is statistically significant. On the other hand, a large p-value represents weak evidence against the null hypothesis; thus, the null hypothesis cannot be rejected. The Granger causality test that we used is based on Free Statistics Software [82]. To enable in-depth analysis, we calculated a sentiment indicator for predictive sentiment analysis in finance, named positive sentiment probability: p sp, which was proposed in our previous work [71]. Positive sentiment probability is computed for a day d of a time series by dividing the number of positive tweets N pos by the number of all tweets on that day N d. p sp (d) = N pos (d)/n d (d) (1) This ratio is used to estimate the probability that the sentiment of a randomly selected tweet on a given day is positive. To test whether one time series is useful in forecasting another, using the Granger causality test, we first calculated positive sentiment probability for each day when the stock market was open. We then calculated two ratios 12 to meet the Granger causality test condition that the time series data needs to be stationary: Daily change of the positive sentiment probability D sent : positive sentiment probability today positive sentiment probability yesterday. D sent (d) = p sp (d) p sp (d 1) (2) Daily return in stock closing price D price : (closing price today closing price yesterday)/closing price yesterday. 13 D price (d) = price(d) price(d 1) price(d 1) (3) We applied the Granger causality test to test the following null hypothesis: sentiment in tweets does not predict stock closing prices (when rejected, meaning that the sentiment in tweets Granger-causes the values of stock closing prices). We performed tests on the entire nine-month time period (from March 11 to December 9, 2011), as well as on individual three-month periods (corresponding approximately to March to May, June to August, and September to November). Results for Baidu are shown in the first column of Table 3. In Granger causality testing, we considered lagged values of time series for one, two, and three days. 12 The ratios were defined in collaboration with the domain experts from the Stuttgart Stock Exchange (see Acknowledgments). 13 The same transformation of the price time series was used in [55]. 10

11 Since in our experiments we compute the p-value repetitively and do multiple comparisons of p-values for different experimental settings, we used the Bonferroni correction [1] to neutralize the problem of multiple comparisons. This correction is considered very conservative. It makes adjustments to a critical p-value by dividing it by the number of comparisons being made. In our case, we divided the p-value of 0.1 by 4, as this is the number of time periods (whole nine months and three three-month periods) which we consider to be a family of tests. We compare the p-values which came from the Granger causality test with 0.1/4=0.025 and reject the null hypothesis if the value is lower than After applying the Bonferroni correction, the results of the Granger analysis indicated that in this particular setting there are no significant results Experiments in a three-class setting with the neutral zone In the previous section, we classified financial tweets into one of the two categories, positive or negative, and therefore assumed that every tweet contains an opinion. This is, however, sometimes an unrealistic assumption, since a tweet can be objective and without any opinion about a given company (i.e., without expressed sentiment). Considering this, a tweet should also have the possibility of being classified as either neutral or weakly opinionated. In this section, we address a three class problem of classifying tweets into the positive, negative, and neutral categories. Our training data does not contain any neutral tweets for the classifier to learn from. Therefore, we define a tweet, which is projected into an area close to the SVM model s hyperplane, as neutral. We define this area as the neutral zone, which is parameterized by value t, where t represents the positive and t the negative border of the neutral zone. If a tweet x is projected into this zone, that is, t < d(x) < t, then rather than being assigned to one of the two sentiment classes, it is assumed to be neutral. Note that our neutral zone does not denote only the neutral tweets, such as tweets which would be labeled as neutral by a human annotator. Instead, the neutral zone contains also the tweets which are either positive or negative but close to the SVM hyperplane which separates the positives from the negatives. Thus, the neutral zone includes tweets containing mixed sentiments, weakly opinionated positive/negative tweets, as well as tweets containing terms which were not observed during the training phase (if human annotated neutral tweets were available, they would have been included in the neutral zone as well). For a greater t (i.e., greater size of the neutral zone), the classifier is more confident in its classification decision for positive and negative tweets. Our definition of the neutral zone is simple, but allows fast computation. We repeated our experiments on classifying financial tweets, but now also took into account the neutral zone. Our aim was to investigate whether the introduction of the neutral zone would improve the predictive capabilities of tweets. Therefore, every tweet which mentioned the Baidu company was classified into one of the three categories: positive, negative, or neutral. Then, we applied the same processing of data as before (count the number of positive, negative, and neutral tweets, calculate positive sentiment probability, calculate daily changes of the positive sentiment probability and the daily return of the stocks closing price) and performed the Granger analysis test. We varied the t value from 0 to 1 (where t=0 corresponds to classification without the neutral zone) and again calculated the p-value for the separate day lags (1, 2, and 3). The results are shown in Table 3. The first column, where the size of the neutral zone is 0, represents the classification without the neutral zone, where financial tweets were classified into one of the two categories, positive or negative. All the remaining columns contain p-values for various sizes of the neutral zone. In Appendix B, we also report the results of the Granger causality correlation between positive sentiment probability and closing stock price for the rest of the companies (Apple, Amazon, Cisco, Google, Microsoft, Netflix, and RIM), whose tweets we collected. The results show that for several other companies, the learned classifier has the potential to be useful for stock price prediction in terms of Granger causality. Values which are lower than a p-value of 0.1, after applying the Bonferroni correction, are marked in bold in Table 3. The highest number of significant values was obtained with t values of 0.5 and 0.6 for the border distance of the neutral zone from the SVM hyperplane. Therefore, by introducing the neutral zone, we improved the predictive power of our classifier. From Table 3 it follows that for the June-August time period we achieved the best results and relationships between sentiments in tweets and stock closing prices. Therefore, we investigated in more detail the Baidu data and public web media from this time period to find possible reasons for this. Figure 1 shows a screenshot from the Google Finance 14 web page displaying stock price and news media coverage for Baidu in From the figure, it can be

Table 3: Statistical significance (p-values) of Granger causality correlation between positive sentiment probability and closing stock price for Baidu, while changing the size of the neutral zone (i.

12 Table 3: Statistical significance (p-values) of Granger causality correlation between positive sentiment probability and closing stock price for Baidu, while changing the size of the neutral zone (i.e., the t value from 0 to 1). Values which are lower than a p-value of 0.1, after applying the Bonferroni correction, are marked in bold. Size of the neutral zone (t value) Time period Lag 9 months Mar.-May June-Aug Sept.-Nov months Mar.-May June-Aug Sept.-Nov months Mar.-May June-Aug Sept.-Nov observed that most of the key events in 2011 happened in the period from June to August. Note that this period is also characterized by the highest number of press releases 15 for Baidu in We hypothesize that this resulted in higher media exposure and, consequently, enabled speculations about price movements in social media. However, further studies are required to confirm or reject this claim. In addition, we explored whether there is evidence for the reversed causality (that the price movements may influence the public sentiment). The results show that, after making adjustments to the critical p-value by applying the Bonferroni correction, no significant results were left for the reverse direction Figure 1: Screenshot from the Google Finance web page showing stock prices and key events. It can be observed that most of the key events in 2011 happened in the period from June to August. We hypothesize that this resulted in a higher media exposure and, consequently, enabled speculations about price movements in social media. 12

Data Preprocessing, Sentiment Analysis & NER On Twitter Data.

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 73-79 www.iosrjournals.org Data Preprocessing, Sentiment Analysis & NER On Twitter Data. Mr.SanketPatil, Prof.VarshaWangikar,