Stream-based active learning for sentiment analysis in the financial domain

Size: px
Start display at page:

Download "Stream-based active learning for sentiment analysis in the financial domain"

Transcription

1 Stream-based active learning for sentiment analysis in the financial domain Jasmina Smailović 1,2, Miha Grčar 1,2, Nada Lavrač 1,2,3, Martin Žnidaršič 1,2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia 3 University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia Abstract Studying the relationship between public sentiment and stock prices has been the focus of several studies. This paper analyzes whether the sentiment expressed in Twitter feeds, which discuss selected companies and their products, can indicate their stock price changes. To address this problem, an active learning approach was developed and applied to sentiment analysis of tweet streams in the stock market domain. The paper first presents a static Twitter data analysis problem, explored in order to determine the best Twitter-specific text preprocessing setting for training the Support Vector Machine (SVM) sentiment classifier. In the static setting, the Granger causality test shows that sentiments in stock-related tweets can be used as indicators of stock price movements a few days in advance, where improved results were achieved by adapting the SVM classifier to categorize Twitter posts into three sentiment categories of positive, negative and neutral (instead of positive and negative only). These findings were adopted in the development of a new stream-based active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet streams. To this end, a series of experiments was conducted to determine the best querying strategy for active learning of the SVM classifier adapted to sentiment analysis of financial tweet streams. The experiments in analyzing stock market sentiments of a particular company show that changes in positive sentiment probability can be used as indicators of the changes in stock closing prices. Keywords: predictive sentiment analysis, stream-based active learning, stock market, Twitter, positive sentiment probability, Granger causality 1. Introduction Predicting the value of stock market assets is a challenge investigated by numerous researchers. One of the reasons for addressing this challenge is the controversy of the efficient market hypothesis [18], which claims that stocks are always traded at their fair value. Based on this market theory, claiming that it is not possible for investors to buy undervalued stocks or sell stocks for overestimated prices, it is impossible for traders to consistently outperform the average market returns. This hypothesis is based on the assumption that financial markets are informationally efficient (i.e., that stock prices always reflect all the relevant information at investment time). The unpredictable nature of stock market prices was first investigated by Regnault [53] and later by Bachelier [4]. Fama [18], who proposed the efficient market hypothesis, also claimed that stock price movement is unpredictable and that past price movements cannot be used to forecast future stock prices. However, as the efficient market hypothesis is controversial, researchers from various disciplines (including economists, statisticians, finance experts, and data miners) have been investigating the means to predict future stock market prices. The findings vary: from those claiming that stock market prices are not predictable to those presenting opposite conclusions [10, 35]. This paper addresses the described challenge in the context of the explosive growth of social media and usergenerated content on the Internet. Through blogs, forums, and social networking media, more and more people share their opinions about individuals, companies, movements, or important events. Such opinions both express and evoke Corresponding author. address: jasmina.smailovic@ijs.si Preprint submitted to Elsevier May 5, 2014

2 sentiments [51]. Recent research indicates that analysis of these online texts can be useful for trend prediction. For example, it was shown that the frequency of blog posts can be used to forecast spikes in online consumer purchasing [24]. Moreover, it was shown by Tong [74] that references to movies in newsgroups are correlated with their sales. Sentiment analysis of weblog data was successfully used to predict the financial success of movies [41]. Twitter 1 posts were also shown to be useful for predicting box-office revenues of movies before their release [3]. Twitter is currently the most popular microblogging platform [47] allowing its users to send and read short messages of up to 140 characters in length, known as tweets, via SMS, the Twitter website, or a range of applications for mobile devices. Twitter gained global popularity very quickly with over 500 million active users in 2012, writing over 340 million tweets daily [17, 42]. Twitter data (and data from other social network websites) are very interesting because of their large volume, popularity, and capability of near-real-time publishing of individuals opinions and emotions about any subject. Given that this massive amount of user-generated content became abundant and easily accessible, many researchers became interested in the predictive power of microblogging messages, especially in the domain of stock market prediction, prediction of election results, or prediction of the financial success of movies or books. Many of these studies use sentiment analysis [38, 77] as a basis for prediction. The term sentiment, used in the context of automatic analysis of text and detection of predictive judgments from positively and negatively opinionated texts, first appeared in the papers by Das and Chen [15] and Tong [74], where the authors were interested in analyzing stock market sentiment. Even though there are many studies on predicting the phenomenon of interest using sentiment analysis of online texts, there is still an urge to develop methods and tools for adaptive dynamic sentiment analysis of microblogging posts, which would enable handling changes in such data streams. This field of research is still insufficiently explored and represents a challenge, which is addressed in this work through active learning [63]. This work contributes to sentiment analysis and to active learning research, and partly towards better understanding of phenomena in financial stock markets. While sentiment analysis is generally aimed at detecting the author s attitude, emotions or opinions expressed in the text, our study is concerned with the development of an approach to predictive sentiment analysis. With this term, we denote an approach in which sentiment analysis is used to predict a specific phenomenon or its changes, postulating that the proposed methodology for predictive sentiment analysis of streams of microblogging messages should be capable of predicting the financial phenomenon of interest. The indication that there may be a relationship between emotions and stock market prices relies on findings in psychological research which indicate that emotions are crucial to rational thinking and social behavior [14], and can influence the choice of actions. Given that the general mood of a society is propagated through social interactions, the collective social mood can be transferred through the investors to the stock market and consequently, the sentiment can be reflected in stock price movements. As a result, the stock market itself can be considered as a measure of social mood [45]. It is, thus, reasonable to expect that the analysis of the public mood can be used to predict price movements in the stock market. We hypothesize that this assumption may hold in situations when people actually express positive or negative opinions about some topic concerning the stock market, whereas in situations when people do not express opinions, but mostly neutral facts, we anticipate finding no correlations. In accordance with this hypothesis, we propose a mechanism for distinguishing opinionated (positive and negative) from non-opinionated (neutral) tweets in Twitter data streams. In an effort to build an active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet data streams, we first addressed a static Twitter data analysis problem, which was explored in order to determine the best Twitter-specific text preprocessing setting for training the Support Vector Machine (SVM) sentiment classifier. In the static setting, the Granger causality test showed that sentiment in stockrelated tweets can be used as an indicator of stock price movements a few days in advance, where improved results were achieved by adapting the SVM classifier to categorize Twitter posts into three sentiment categories of positive, negative and neutral (instead of positive and negative only). These findings were successfully used in the development of a new stream-based active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet data streams. Using stream data for sentiment analysis makes sense when the information about the changes in the sentiment is time-critical and a proper data flow is available, for example, in the analysis of streams of financial tweets in which people express their opinions about stocks in real time. The main idea of active learning [58, 63, 67], adapted in this study for continuously updating the sentiment classifier from a tweet stream, is that the algorithm is allowed to 1 2

3 select new examples to be labeled by the oracle (e.g., a human annotator) and added to the training set. It aims at maximizing the performance of the algorithm with as little human labeling effort as possible. The main challenge of active learning is the selection of the most suitable examples for labeling in order to achieve the highest prediction accuracy, while knowing that one cannot afford to label all the examples [88]. For example, query algorithms based on uncertainty sampling select for labeling the examples for which the current learner has the highest uncertainty [37, 64, 75]. Similarly, algorithms based on query-by-committee use disagreement among an ensemble of learners to select new examples for labeling [20, 50, 68]. The active learning approach proposed in this paper combines uncertainty and random sampling and was developed by adapting the initial static sentiment analysis approach to deal with changes over time in a tweet stream. On the one hand, the use of active learning is a consequence of the scarcity of labeled tweets available for sentiment analysis, which prevents the use of conventional machine learning methods. It is namely very difficult and costly to obtain large hand-labeled datasets of tweets, especially if they are domain dependent. On the other hand, these datasets and the resulting models change with time and, consequently, soon become outdated. Thus, continuous learning that allows for adaptations to change in the modeled environment is inevitable to keep the models current. In summary, the main contribution of this paper is a new methodology for stream-based active learning for tweet sentiment analysis in finance, which can be used on continuously changing tweet streams. A series of experiments was conducted to determine the best querying strategy for active learning of the SVM classifier, which was adapted to sentiment analysis of streams of financial tweets and applied to predictive stream mining in a financial stock market application. As a side effect, since there is no large labeled dataset of financial tweets publicly available, we have labeled and made publicly available a collection of financial tweets, making it the first large (in the sense of labeling effort) publicly available dataset of its kind. We used the dataset in the simulated active learning setting and in the evaluation of the results of tweet stream analysis. The paper is structured as follows. Section 2 presents a brief overview of related work. Section 3 discusses Twitter-specific text preprocessing options, and presents the developed SVM tweet sentiment classifier, learned from adequately preprocessed Twitter data. Section 4 presents the dataset of financial tweets, which were collected for the purpose of the study, as well as the method and technology developed for enabling financial market predictions from Twitter data. The approach uses positive sentiment probability as a new indicator for predictive sentiment analysis in finance, proposed in our previous work [71]. Furthermore, due to the fact that financial tweets do not necessarily express the sentiment, this section applies sentiment classification using the neutral zone, which allows classification of a tweet into the neutral category, thus improving the predictive power of the sentiment classifier compared to the SVM classifier categorizing Twitter posts into positive and negative sentiment categories only. Section 5 introduces incremental learning of the classifier on a stream of financial tweets. The general purpose classifier was incrementally updated in order to adapt to the changes in the data stream by using the active learning approach. The paper concludes with a summary of results and plans for further work in Section Related work In this section, we give an overview of related studies, which are focused on: (i) analyzing sentiment in Twitter data, (ii) sentiment analysis of social media as a predictor of the future stock market indicators, and (iii) active learning on data streams. Although these tasks have been well-studied separately, there is a lack of work which would combine them and propose a dynamic adaptive sentiment analysis methodology for microblogging stream posts, which would be able to handle changes in data streams our work addresses this issue Sentiment analysis and microblogging channels In recent years, several studies have analyzed sentiments expressed in Twitter data in order to describe its content and study its relation to trends. O Connor et al. [46] analyzed several surveys on consumer confidence and political opinion, and found a correlation with sentiments in Twitter messages. Furthermore, Thelwall et al. [73] analyzed 30 top events in Twitter over a one-month period and showed that popular events are associated with an increase in average negative sentiment strength. In [30], the authors addressed target-dependent sentiment classification and applied it to English tweets on popular topics. They incorporated target-dependent features and also took related tweets into consideration. Asur et al. [3] constructed a model based on tweet-rate about particular topics for predicting boxoffice revenues of movies before their release. They further showed how sentiment extracted from Twitter posts can 3

4 improve their forecasting power. In the context of the 2009 German federal elections, Tumasjan et al. [76] showed that sentiment expressed in Twitter messages closely corresponds to the offline political landscape. There has also been research exploring whether sentiment analysis of social media can be used to predict future stock market indicators. In [61], the authors analyzed sentiment in messages from the Yahoo! Finance website 2 and demonstrated that sentiment and stock values are closely correlated. They also showed that one can use sentiment analysis to make predictions about stock behavior over a short-term period. In [47], the authors analyzed sentiments in postings from stock microblogging channel, Stocktwits.com 3, over a period of three months and found that stock microblog sentiments may predict future stock price movements. Additionally, they found that pessimistic information has higher predictive value as compared to optimistic information. Zhang et al. [87] measured positive and negative emotions in tweets and analyzed the correlation between these measures and stock market indices such as Dow Jones, S&P 500, NASDAQ, and VIX. The authors indicated that by inspecting Twitter for any kind of emotional outburst gives a predictor of how the stock market will perform the following day. Bollen et al. [8] measured mood in tweets in terms of six dimensions (calm, alert, sure, vital, kind, and happy) and showed that changes in calmness can predict daily up and down changes in the closing values of the Dow Jones Industrial Average Index (DJIA). Furthermore, Chen and Lazer [12] confirmed the results of Bollen et al. [8] and showed that even with much simpler sentiment analysis methods, a correlation between Twitter sentiment data and stock market movements can be observed. Mittal and Goel [40] based their work for finding a correlation between public sentiment and the stock market on the approach of Bollen et al. [8]. Their results [40] are in some agreement with the results of Bollen et al. [8], but they indicate that not only the calm, but also the happy mood dimension has a good correlation with the DJIA values. The authors in [43] calculated daily sentiment of aggregated data from multiple sources (Twitter, 11 online message boards, and Yahoo! Finance news stream), where the data was concerned with stocks of the S&P 500 index during a six-month period. In their experiments, they showed that publicly available data in microblogs, forums, and news have predictive power for stock price changes on the following day. Sprenger et al. [72] analyzed about 250,000 stock-related tweets and found that the sentiments in tweets is associated with exceptional stock returns and that message volume predicts next-day trading volume. In addition, the authors showed that users that give above-average investment advice are retweeted more often and have more followers, which shows their influence in microblogging forums. Finally, Yu et al. [85] studied the effect of social and conventional media on firm stock market performance and found that social media has a stronger impact. Nevertheless, the authors found that social and conventional media together do have an effect on the stock market. They also found that the effect of social media varies depending on its type. The above literature overview confirms that sentiment analysis of social media contains predictive information about future stock market indicators, which is also the topic of this paper. Close to our research is the work of Sprenger et al. [72], which aims at finding associations among various values describing tweets and stocks. Also, a similar idea exists in [43], but the authors were interested in aggregating data from multiple sources, whereas we are specifically interested in adjusting our approach to microblogging data. In our previous studies, we used the volume and sentiments in stock-related tweets to identify important events, as a step towards the prediction of future movements of stock prices [70, 71]. This paper substantially extends our previous work Stream-based active learning Active learning has been studied in three different scenarios: (i) membership query synthesis, (ii) pool-based sampling, and (iii) stream-based selective sampling [65]. In the membership query synthesis scenario, the learner may select new examples for labeling from the input space or it can generate new examples itself. In the pool-based scenario, the learner may request labels for any example from a large pool of historical data. Finally, in the stream-based active learning scenario, examples are made available constantly from a data stream and the learner has to decide in real time whether to request a label for a new example or not. Active learning on data streams has been a subject in many studies. One of the simplest ways to select the examples to be labeled is based on maximizing the expected informativeness of labeled examples. For example, the learner may find the examples with the highest uncertainty to be the most informative and request them to be labeled. Zhu et al. [88] used uncertainty sampling to label instances within a batch of data from the data stream. Žliobaite et al. [90]

5 proposed strategies that extend the fixed uncertainty strategy with dynamic allocation of labeling efforts over time and randomization of the search space. The latter approach was used also in some of our active learning strategies described in Section 5. These newly proposed active learning strategies explicitly handle concept drift and adapt the classifier to data distribution changes in data streams over time. In contrast to our approach, Žliobaite et al. [90] do not consider batches, but perform labeling decisions on every encountered data instance. Also, their labeling budget management is different, as they have a fixed overall budget and dynamically adapt the active learning rate according to the amount of budget left. This can be beneficial for flexible adaptation, but can disperse the labeling effort very unevenly. We opted for a fixed budget per batch, which enables the labeling effort to remain the same in each time period. This was perceived as a favorable approach from the user s point of view, as in our case the labeling cost is measured in human time, which is difficult to provide in unevenly dispersed bursts. Deciding which instances are the most suitable for labeling can be made by a single evolving classifier [90] or by a classifier ensemble [81, 88, 89]. In classifier-ensemble-based active learning frameworks, a number of classifiers are trained from small portions of stream data. These classifiers construct an ensemble classifier for predictions [86], while our work is concerned with the development of a single evolving sentiment classifier for Twitter posts. Active learning on stream data for sentiment analysis of tweets in financial domains is still insufficiently explored and represents a significant challenge. Our preliminary work on this topic was presented in [56]. Bifet and Frank [7] discuss the challenges posed by Twitter data streams, focusing on classification problems, and consider these streams for sentiment analysis, but they do not use the active learning approach. On the other hand, Settles [66] has developed an active learning annotation tool, DUALIST; while he showed its potential by applying it to sentiment analysis of general tweets, his tool is not specifically adjusted to tweet analysis. 3. Defining the best parameter setting for tweet preprocessing Preprocessing is a necessary data preparation step to supervised machine learning when training a sentiment classifier. We describe here the algorithm used in the development of the initial general tweet sentiment classifier, the dataset, different data preprocessing settings, and the experiments that led to the choice of the best tweet preprocessing setting. In this work, classification refers to the process of categorizing a new tweet into one of the two categories or classes: the positive or the negative sentiment of a tweet. The classifier is trained to classify new instances based on a set of class-labeled training instances (tweets), each described by a vector of features (terms, formed from one or several consecutive words) which have been pre-categorized manually or in some other presumably reliable way. Features are all the terms detected in the training dataset. The length of the feature vectors corresponds to the number of features. The approach to tweet preprocessing and classifier training was implemented using the LATINO 4 software library of text processing and data mining algorithms The algorithm used for sentiment classification There are three common approaches to sentiment classification [49, 73]: (i) machine learning, (ii) lexicon-based methods, and (iii) linguistic analysis. Linguistic analysis tends to be computationally demanding for use in a streaming near-real-time setting. Lexicon-based methods are faster, but are unable to adapt to changes in the modeled environment. In the analysis of dynamic concepts, such as public sentiment, this is a serious drawback. Namely, certain terms, such as company names, countries or phrases, can shift with time from one sentiment class to the other. Therefore, we have decided to use a machine learning approach to learn a sentiment classifier from a set of class-labeled examples. An algorithm, standardly used in document classification, is the linear Support Vector Machine (SVM) [79, 80, 13]. The SVM algorithm has several advantages, which are important for learning a sentiment classifier from a large Twitter dataset. For example, it is fairly robust to overfitting and it can handle large feature spaces [11, 31, 60]. Based on a set of training examples, labeled as belonging to one of the two classes, an SVM algorithm represents the examples as points in the space and separates them by a hyperplane. The aim of the SVM is to place this hyperplane in such a way that examples of the two classes are divided by a gap that is as wide as possible. New examples are then mapped into that same space and classified based on the side of the hyperplane in which they reside. For training the tweet sentiment classifier, we used the SVM perf [32, 33, 34] implementation of the SVM algorithm. 4 LATINO (Link Analysis and Text Mining Toolbox) is open-source (mostly under the LGPL license) and is available at 5

6 3.2. The data used for initial classifier training Since there is no publicly available large hand-labeled data set for sentiment analysis of Twitter data, we have trained the general purpose tweet sentiment classifier on an available large collection of 1,600,000 (800,000 positive and 800,000 negative) tweets collected and prepared by Stanford University [22], where the tweets were labeled based on a presence of positive and negative emoticons. Therefore, the emoticons approximate the actual positive and negative sentiment labels. This approach was proposed by Read [52]. For example, if a tweet contains the :) emoticon, it is labeled as positive, and if it contains the :( emoticon, it is labeled as negative. Tweets containing both positive and negative emoticons were not taken into account. The full list of emoticons used for labeling can be found in Table 1. Inevitably, this simple approach causes partially correct or noisy labeling. However, in Appendix A, we illustrate that smiley-labeled tweets are still a reasonable approximation for manually-annotated positive/negative sentiments of tweets. In the dataset, the emoticons were already removed from the tweets in order for the classifier to learn from the other features that characterize them. Note that the tweets from this set do not focus on any particular domain Data preprocessing The data preprocessing step is important in sentiment analysis and with appropriate selection of preprocessing techniques, the classification accuracy can be improved [26]. We apply both Twitter-specific and standard preprocessing on the data set. The Twitter-specific preprocessing is necessary, since the Twitter community has created its own specific language to post messages. Therefore, we first explored the unique properties of this language and experimented with the following options [2, 22] for Twitter-specific preprocessing to better define the feature space: Usernames: mentioning other users in a tweet in the was replaced by a single token named USERNAME. Usage of web links: Web links pointing to different web pages were replaced by a single token named URL. Letter repetition: repetitive letters with more than two occurrences in a word were replaced by a word with one occurrence of this letter (e.g., word loooooooove was replaced by love). Negations: we replaced negation words (not, isn t, aren t, wasn t, weren t, hasn t, haven t, hadn t, doesn t, don t, didn t) with a unique token NEGATION. Using this approach, we do not lose information about a negation, but treat all negation expressions in the same way. Exclamation and question marks: exclamation marks were replaced by a single token EXCLAMATION and question marks by a single token QUESTION. In addition to Twitter-specific text preprocessing, other standard preprocessing steps were performed [19] to define the feature space for tweet feature vector construction. These include text tokenization (text splitting into individual words/terms), removal of stopwords (words carrying no relevant information, e.g., and, or, a, an, the, etc.), stemming (converting words into their root form), and N-gram construction (concatenating 1 to N stemmed words appearing consecutively in the text, where N=2) for feature space reduction. We also added the condition that a given term has to appear at least twice in the entire corpus, either twice in a given tweet or in two different tweets. The resulting terms were used as features in the construction of feature vectors representing the documents (tweets). In our experiments, Table 1: List of emoticons used for labeling the training set. Positive emoticons Negative emoticons :) :( :-) :-( : ) : ( :D =) 6

7 we did not use a part of speech (POS) tagger, since it was indicated by Go et al. [22] and Pang et al. [48] that POS tags are not useful when using SVMs for sentiment analysis. Moreover, Kouloumpis et al. [36] showed that POS features may not be useful for sentiment analysis in the microblogging domain. The standard approach to feature vector construction is TF-IDF-based, where TF-IDF stands for the term frequencyinverse document frequency feature weighting scheme [31, 84]. TF is the term frequency feature weighting scheme, where a weight reflects how often a word is found in a document, while TF-IDF is the term frequency-inverse document frequency feature weighting scheme, where a weight reflects how important a word is to a document in a document collection (TF-IDF increases proportionally to the number of times a word appears in the document, but decreases with respect to the number of documents in which the word occurs). We experimented with both schemes, TF-IDF- and TF-based, where for every document (tweet) TF weights were normalized to a range of [0,1]. As shown in Table 2, the TF-based approach proved to outperform the TF-IDF-based approach to tweet preprocessing, which is expected in a classification setting [39]. The significance of the finding is confirmed using the Wilcoxon s significance test [16, 83], which concluded that using TF is statistically significantly better than TF-IDF (with p < ) Selecting the best preprocessing setting for classifier training The experiments with different Twitter-specific preprocessing settings were performed to determine the best preprocessing options which were used in addition to the standard text preprocessing steps. The best preprocessing setting for a classifier 5 was chosen according to the F-measure (also known as F-score or F1 score) [78], determined using the ten-fold cross-validation method 6. The F-measure was used for comparison of different preprocessing settings since later, in the active learning experiments, to compare different querying strategies, we calculate the F-measure of positive tweets as there is high three-class imbalance in batches from the data stream. In order to be consistent and allow the reader to compare results in our paper, we used the F-measure in all our experiments. The experiments show that the best preprocessing setting is Setting 1 shown in the first row of Table 2. It is TF-based, uses maximum N-grams of size 2, words which appear at least two times in the corpus, it replaces links with the URL token, and replaces negations with the NEGATION token. This tweet preprocessing setting resulted in the construction of 1,288,681 features used for classifier training. Using the unpaired one-tailed homoscedastic t-test [54], we investigated whether the best preprocessing setting (Setting 1) is statistically significantly better than the other preprocessing settings. The results show that the best preprocessing setting is not significantly better than Settings 2 12,14, and 16, but it is significantly better than the remaining preprocessing settings (Setting 13, Setting 15, Settings 17 32) with a p-value lower than Since in these experiments the original dataset was pre-filtered and did not contain tweets with both positive and negative emoticons, the reported results may be somewhat overoptimistic (i.e., if the data were not pre-filtered and contained also tweets with mixed emoticons, the results in terms of the F-measure would probably be somewhat lower). Nevertheless, even if the reported results are overoptimistic, this property of the dataset does not affect the general conclusions concerning the choice of preprocessing settings, given that in all the settings the dataset was preprocessed in the same way. From Table 2, it follows that replacing negation words is particularly beneficial since almost all settings which perform this replacement are placed in the upper part of the table. Replacing exclamation and question marks with a token does not seem to be helpful, since the five top settings do not employ this replacement. Regarding replacing usernames and URLs with a token, and removing letter repetition, one cannot draw general conclusion, since these preprocessing options are dispersed across the table. Nevertheless, the first setting employs replacing URLs with a token and we used it in the rest of our experiments. Interestingly, the setting which does not apply any of the preprocessing adjustments achieved the lowest performance, leading to the conclusion that, in general, it is beneficial to preprocess Twitter data. In addition to the SVM algorithm, we also tested the k-nearest neighbor (KNN) and Naive Bayes classifiers on the same dataset. In this setting, the standard KNN algorithm proved to be too slow (in ten-fold cross-validation experiments for K=5 and K=10, the one-fold experiment took more than 24 hours on a standard desktop computer), 5 Based on our previous experience in [71], the parameters for the SVM perf learner were set to -c e The F-measure is a harmonic mean of precision and recall, and it reaches its best value at 1 and worst at 0. It is calculated as: F 1 = 2 precision recall/(precision + recall). Precision is the fraction of all examples classified as positive which are correctly classified as positive, while recall is the fraction of all the positive examples that are correctly classified as positive. 7

8 Table 2: Classifier performance evaluation for various preprocessing settings. ID Usernames to a token URLs to a token Remove letter repetition Negations to a token Exclamation and question marks to tokens Avg. F-measure ten-fold cross-val. ± std. dev. (TF) Avg. F-measure ten-fold cross-val. ± std. dev. (TF-IDF) 1 X X ± ± X X ± ± X X X ± ± X ± ± X X X ± ± X X X ± ± X X X ± ± X X X X ± ± X X X X X ± ± X X X ± ± X X X ± ± X X X X ± ± X X ± ± X X ± ± X X ± ± X X X X ± ± X X ± ± X X X ± ± X X X X ± ± X X ± ± X X ± ± X X X X ± ± X X X ± ± X X ± ± X ± ± X X ± ± X X X ± ± X ± ± X ± ± X ± ± X X X ± ± ± ± and Naive Bayes had lower performance compared to the SVM (the ten-fold cross-validation achieved an F-measure of 0.73). We, thus, used the SVM classifier with preprocessing Setting 1 from Table 2 for the rest of the study and analyses. 4. Stock market analysis in a static predictive tweet analysis setting Motivated by the earlier research and observation that the stock market itself can be considered as a measure of social mood [45], this section investigates whether sentiment analysis on Twitter posts can provide predictive information about the value of stock closing prices. We use a supervised machine learning approach to train a sentiment classifier, using a SVM algorithm. By applying the best setting for tweet preprocessing, as explained in Section 3.4, two sets of experiments were performed. In the first set of experiments, tweets were classified into two categories, positive or negative. In the second set of experiments, the SVM classification approach was advanced by taking into account the neutral zone, enabling us to identify neutral tweets (not clearly expressing positive or negative sentiments) as those, positioned a small distance from the SVM model hyperplane. 8

9 4.1. The data used in the stock market application A tweet dataset and stock closing prices of several companies were collected for our experiments. On the one hand, we collected 152,570 tweets discussing relevant stock information concerning eight companies (Apple, Amazon, Baidu, Cisco, Google, Microsoft, Netflix, and RIM) 7 in the nine-month time period from March 11 to December 9, On the other hand, we collected stock closing prices of these companies for the same time period. The data source for collecting financial Twitter posts is the Twitter API 8 (i.e., the Twitter Search API), which returns tweets that match a specified query. By informal Twitter conventions, the dollar-sign notation is used for discussing stock symbols. For example, the $BIDU tag indicates that the user discusses Baidu stocks. This convention was used for the retrieval of financial tweets 9. The stock closing prices of the companies for each day were obtained from the Yahoo! Finance website. The time of tweets in our dataset is presented in UTC (Coordinated Universal Time) since the Twitter API stores and returns dates and times in UTC. On the other hand, Baidu is included in the NASDAQ-100 index, and this stock exchange works in the EST(Eastern Standard Time)/EDT(Eastern Daylight Time) timezone which is four to five hours behind UTC. Therefore, compared to EST/EDT, there is an additional shift of four to five hours; thus, there is more of a time lag between the tweets of a previous day and the stock market activity and closing prices of the current day. In the entire study, we focused on the analysis of financial tweets on the Chinese web search engine provider, Baidu 10, in order to investigate relationships between the observed sentiments in the stock-related tweets and the corresponding stock price movements. The collection of Baidu tweets was manually labeled by the domain expert. The data of this Chinese web search engine provider was chosen for hand-labeling since the set of tweets related to Baidu was of a manageable size given the resources available (we collected and labeled approximately 11,000 tweets, compared to, for example, approximately 40,000 tweets that we collected for the Apple company). Even this hand-labeling effort took us over three months to ensure good quality of the labeled data. In tweet labeling, we were faced with the problem of choosing a labeling strategy. Having discussed this issue at length with stock market financial experts, we opted for manual labeling of the tweets from the point of view of a particular company and not mainly on the sentiment-carrying words used. The reason for this decision is that our long-term intention is to construct classifiers that should distinguish between sentiments of tweets of different companies; hence, a company-focused view is a necessity. The labels were given to instances according to their financial sentiment; that is, their impact on the perception of the company, its products, or its stock. For example, a tweet: I just love shorting CompanyX. What a nice day of profits, first of many... would be labeled as negative, since shorting means betting that the value of the stock will drop. Despite many positive sentiment words, such a tweet would be providing a message of a negative financial prospect for CompanyX. Another issue was that in the dataset there are many tweets that do not discuss Baidu stocks, although they do contain the $BIDU tag. These tweets may actually express an opinion about another company, such as the tweet, Apple is great $BIDU, and reflect a positive tweet sentiment, but do not discuss the Baidu company at all. Again, these kinds of tweets were labeled from the point of view of the Baidu company, and not mainly on the sentiment-carrying words used. Therefore, the mentioned tweet would be labeled as neutral. Therefore, in Baidu sentiment labeling, the annotator was instructed to focus on the following question: What would someone who knows what Baidu is and shares in general, think of Baidu and its shares after he sees this tweet?, or in other words, Is this tweet positive, negative, or neutral concerning Baidu and/or the price of its shares? The resulting hand-labeled dataset consists of 11,389 Baidu financial tweets (4,861 positive, 1,856 negative, and 4,672 neutral tweets). 11 In this dataset, neutral tweets are those that contain no sentiment about Baidu, contain both positive and negative sentiments about Baidu, as well as those that do not discuss Baidu even if they are positively or negatively oriented (as discussed above). 7 Tweet IDs of our datasets are available on: To deal with spam (writing nearly identical messages from different accounts), we employed the algorithm based on the work of Broder et al. [9] to discard tweets that were detected as near duplicates The Baidu tweet IDs and manual labels are publicly available on: file BIDU.txt 9

10 4.2. Correlation between tweet sentiment and stock closing price Given the time series of tweet sentiments and the time series of stock closing prices, the question addressed is whether one time series is useful in forecasting another. We applied a statistical test to determine whether sentiments expressed in tweets contain predictive information about the future values of stock closing prices. To this end, we performed a Granger causality analysis test, which is a statistical hypothesis test for discovering whether one time series is effective for forecasting another time series [23]. Since we have the tweets time series on the one hand and the stock closing price time series on the other hand, this test suits our needs to check whether there is a predictive relationship between sentiments in tweets and stock closing prices. If time series X is said to Granger-cause Y, then the information in past values of X helps predict values of Y better than only the information in past values of Y. Therefore, the lagged values of X will have a statistically significant correlation with Y. Granger causality analysis is based on linear regression modeling of stochastic processes and it is usually done using a series of t-tests and F-tests on lagged values of X (combined also with lagged values of Y). The test expects that the time series data is covariance stationary and that it can be represented by a linear model. Complex implementations for nonlinear cases exist; nevertheless, they are often more challenging to apply in practice [62]. The output of the Granger causality test is the p-value, which takes values in the [0,1] interval. In statistical hypothesis testing, the p-value is a measure of how much evidence we have against the null hypothesis [57]. If the p-value is lower than the selected significance level, for example 5% (p < 0.05), the null hypothesis is rejected and the result is statistically significant. On the other hand, a large p-value represents weak evidence against the null hypothesis; thus, the null hypothesis cannot be rejected. The Granger causality test that we used is based on Free Statistics Software [82]. To enable in-depth analysis, we calculated a sentiment indicator for predictive sentiment analysis in finance, named positive sentiment probability: p sp, which was proposed in our previous work [71]. Positive sentiment probability is computed for a day d of a time series by dividing the number of positive tweets N pos by the number of all tweets on that day N d. p sp (d) = N pos (d)/n d (d) (1) This ratio is used to estimate the probability that the sentiment of a randomly selected tweet on a given day is positive. To test whether one time series is useful in forecasting another, using the Granger causality test, we first calculated positive sentiment probability for each day when the stock market was open. We then calculated two ratios 12 to meet the Granger causality test condition that the time series data needs to be stationary: Daily change of the positive sentiment probability D sent : positive sentiment probability today positive sentiment probability yesterday. D sent (d) = p sp (d) p sp (d 1) (2) Daily return in stock closing price D price : (closing price today closing price yesterday)/closing price yesterday. 13 D price (d) = price(d) price(d 1) price(d 1) (3) We applied the Granger causality test to test the following null hypothesis: sentiment in tweets does not predict stock closing prices (when rejected, meaning that the sentiment in tweets Granger-causes the values of stock closing prices). We performed tests on the entire nine-month time period (from March 11 to December 9, 2011), as well as on individual three-month periods (corresponding approximately to March to May, June to August, and September to November). Results for Baidu are shown in the first column of Table 3. In Granger causality testing, we considered lagged values of time series for one, two, and three days. 12 The ratios were defined in collaboration with the domain experts from the Stuttgart Stock Exchange (see Acknowledgments). 13 The same transformation of the price time series was used in [55]. 10

11 Since in our experiments we compute the p-value repetitively and do multiple comparisons of p-values for different experimental settings, we used the Bonferroni correction [1] to neutralize the problem of multiple comparisons. This correction is considered very conservative. It makes adjustments to a critical p-value by dividing it by the number of comparisons being made. In our case, we divided the p-value of 0.1 by 4, as this is the number of time periods (whole nine months and three three-month periods) which we consider to be a family of tests. We compare the p-values which came from the Granger causality test with 0.1/4=0.025 and reject the null hypothesis if the value is lower than After applying the Bonferroni correction, the results of the Granger analysis indicated that in this particular setting there are no significant results Experiments in a three-class setting with the neutral zone In the previous section, we classified financial tweets into one of the two categories, positive or negative, and therefore assumed that every tweet contains an opinion. This is, however, sometimes an unrealistic assumption, since a tweet can be objective and without any opinion about a given company (i.e., without expressed sentiment). Considering this, a tweet should also have the possibility of being classified as either neutral or weakly opinionated. In this section, we address a three class problem of classifying tweets into the positive, negative, and neutral categories. Our training data does not contain any neutral tweets for the classifier to learn from. Therefore, we define a tweet, which is projected into an area close to the SVM model s hyperplane, as neutral. We define this area as the neutral zone, which is parameterized by value t, where t represents the positive and t the negative border of the neutral zone. If a tweet x is projected into this zone, that is, t < d(x) < t, then rather than being assigned to one of the two sentiment classes, it is assumed to be neutral. Note that our neutral zone does not denote only the neutral tweets, such as tweets which would be labeled as neutral by a human annotator. Instead, the neutral zone contains also the tweets which are either positive or negative but close to the SVM hyperplane which separates the positives from the negatives. Thus, the neutral zone includes tweets containing mixed sentiments, weakly opinionated positive/negative tweets, as well as tweets containing terms which were not observed during the training phase (if human annotated neutral tweets were available, they would have been included in the neutral zone as well). For a greater t (i.e., greater size of the neutral zone), the classifier is more confident in its classification decision for positive and negative tweets. Our definition of the neutral zone is simple, but allows fast computation. We repeated our experiments on classifying financial tweets, but now also took into account the neutral zone. Our aim was to investigate whether the introduction of the neutral zone would improve the predictive capabilities of tweets. Therefore, every tweet which mentioned the Baidu company was classified into one of the three categories: positive, negative, or neutral. Then, we applied the same processing of data as before (count the number of positive, negative, and neutral tweets, calculate positive sentiment probability, calculate daily changes of the positive sentiment probability and the daily return of the stocks closing price) and performed the Granger analysis test. We varied the t value from 0 to 1 (where t=0 corresponds to classification without the neutral zone) and again calculated the p-value for the separate day lags (1, 2, and 3). The results are shown in Table 3. The first column, where the size of the neutral zone is 0, represents the classification without the neutral zone, where financial tweets were classified into one of the two categories, positive or negative. All the remaining columns contain p-values for various sizes of the neutral zone. In Appendix B, we also report the results of the Granger causality correlation between positive sentiment probability and closing stock price for the rest of the companies (Apple, Amazon, Cisco, Google, Microsoft, Netflix, and RIM), whose tweets we collected. The results show that for several other companies, the learned classifier has the potential to be useful for stock price prediction in terms of Granger causality. Values which are lower than a p-value of 0.1, after applying the Bonferroni correction, are marked in bold in Table 3. The highest number of significant values was obtained with t values of 0.5 and 0.6 for the border distance of the neutral zone from the SVM hyperplane. Therefore, by introducing the neutral zone, we improved the predictive power of our classifier. From Table 3 it follows that for the June-August time period we achieved the best results and relationships between sentiments in tweets and stock closing prices. Therefore, we investigated in more detail the Baidu data and public web media from this time period to find possible reasons for this. Figure 1 shows a screenshot from the Google Finance 14 web page displaying stock price and news media coverage for Baidu in From the figure, it can be

12 Table 3: Statistical significance (p-values) of Granger causality correlation between positive sentiment probability and closing stock price for Baidu, while changing the size of the neutral zone (i.e., the t value from 0 to 1). Values which are lower than a p-value of 0.1, after applying the Bonferroni correction, are marked in bold. Size of the neutral zone (t value) Time period Lag 9 months Mar.-May June-Aug Sept.-Nov months Mar.-May June-Aug Sept.-Nov months Mar.-May June-Aug Sept.-Nov observed that most of the key events in 2011 happened in the period from June to August. Note that this period is also characterized by the highest number of press releases 15 for Baidu in We hypothesize that this resulted in higher media exposure and, consequently, enabled speculations about price movements in social media. However, further studies are required to confirm or reject this claim. In addition, we explored whether there is evidence for the reversed causality (that the price movements may influence the public sentiment). The results show that, after making adjustments to the critical p-value by applying the Bonferroni correction, no significant results were left for the reverse direction Figure 1: Screenshot from the Google Finance web page showing stock prices and key events. It can be observed that most of the key events in 2011 happened in the period from June to August. We hypothesize that this resulted in a higher media exposure and, consequently, enabled speculations about price movements in social media. 12

Data Preprocessing, Sentiment Analysis & NER On Twitter Data.

Data Preprocessing, Sentiment Analysis & NER On Twitter Data. IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 73-79 www.iosrjournals.org Data Preprocessing, Sentiment Analysis & NER On Twitter Data. Mr.SanketPatil, Prof.VarshaWangikar,

More information

Context-Sensitive Classification of Short Colloquial Text

Context-Sensitive Classification of Short Colloquial Text Context-Sensitive Classification of Short Colloquial Text TU Delft - Network Architectures and Services (NAS) 1/12 Outline Emotions propagate through a social network like viruses. Some people influence

More information

Predicting Stock Prices through Textual Analysis of Web News

Predicting Stock Prices through Textual Analysis of Web News Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels

More information

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series From Tweets to Polls: Linking Text Sentiment to Public Brendan O Connor Opinion Time Series Machine Learning Department Carnegie Mellon University http://brenocon.com Presentation at AAPOR, May 2012 Joint

More information

PAST research has shown that real-time Twitter data can

PAST research has shown that real-time Twitter data can Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis Stuart Colianni, Stephanie Rosales, and Michael Signorotti ABSTRACT PAST research has shown that real-time Twitter data can be

More information

Twitter Trending Topic Classification

Twitter Trending Topic Classification 2011 11th IEEE International Conference on Data Mining Workshops Twitter Trending Topic Classification Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md. Mostofa Ali Patwary, Ankit Agrawal, and Alok

More information

New restaurants fail at a surprisingly

New restaurants fail at a surprisingly Predicting New Restaurant Success and Rating with Yelp Aileen Wang, William Zeng, Jessica Zhang Stanford University aileen15@stanford.edu, wizeng@stanford.edu, jzhang4@stanford.edu December 16, 2016 Abstract

More information

Tweet Segmentation Using Correlation & Association

Tweet Segmentation Using Correlation & Association 2017 IJSRST Volume 3 Issue 3 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology Tweet Segmentation Using Correlation & Association Mr. Umesh A. Patil 1, Miss. Madhuri M.

More information

TwitterRank: Finding Topicsensitive Influential Twitterers

TwitterRank: Finding Topicsensitive Influential Twitterers TwitterRank: Finding Topicsensitive Influential Twitterers Jianshu Weng, Ee-Peng Lim, Jing Jiang Singapore Management University Qi He Pennsylvania State University WSDM 2010 Feb 5, 2010 Outline Introduction

More information

How to Set-Up a Basic Twitter Page

How to Set-Up a Basic Twitter Page How to Set-Up a Basic Twitter Page 1. Go to http://twitter.com and find the sign up box, or go directly to https://twitter.com/signup 1 2. Enter your full name, email address, and a password 3. Click Sign

More information

Public Opinion Mining on Social Media: A Case Study of Twitter Opinion on Nuclear Power 1

Public Opinion Mining on Social Media: A Case Study of Twitter Opinion on Nuclear Power 1 , pp.224-228 http://dx.doi.org/10.14257/astl.2014.51.51 Public Opinion Mining on Social Media: A Case Study of Twitter Opinion on Nuclear Power 1 DongSung Kim 2 and Jong Woo Kim 2,3 2 222 Wangsimni-ro,

More information

Lumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang

Lumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang Lumière A Smart Review Analysis Engine Ruchi Asthana Nathaniel Brennan Zhe Wang Purpose A rapid increase in Internet users along with the growing power of online reviews has given birth to fields like

More information

Using Twitter to Predict Voting Behavior

Using Twitter to Predict Voting Behavior Using Twitter to Predict Voting Behavior Mike Chrzanowski mc2711@stanford.edu Daniel Levick dlevick@stanford.edu December 14, 2012 Background: An increasing amount of research has emerged in the past few

More information

Who Are My Best Customers?

Who Are My Best Customers? Technical report Who Are My Best Customers? Using SPSS to get greater value from your customer database Table of contents Introduction..............................................................2 Exploring

More information

Static Code Analysis A Systematic Literature Review and an Industrial Survey

Static Code Analysis A Systematic Literature Review and an Industrial Survey Thesis no: MSSE-2016-09 Static Code Analysis A Systematic Literature Review and an Industrial Survey Islam Elkhalifa & Bilal Ilyas Faculty of Computing Blekinge Institute of Technology SE 371 79 Karlskrona,

More information

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS DETECTING COMMUNITIES BY SENTIMENT ANALYSIS OF CONTROVERSIAL TOPICS SBP-BRiMS 2016 Kangwon Seo 1, Rong Pan 1, & Aleksey Panasyuk 2 1 Arizona State University 2 Air Force Research Lab July 1, 2016 OUTLINE

More information

E-Commerce Sales Prediction Using Listing Keywords

E-Commerce Sales Prediction Using Listing Keywords E-Commerce Sales Prediction Using Listing Keywords Stephanie Chen (asksteph@stanford.edu) 1 Introduction Small online retailers usually set themselves apart from brick and mortar stores, traditional brand

More information

CONSUMER SENTIMENT ANALYSIS USING TWITTER

CONSUMER SENTIMENT ANALYSIS USING TWITTER CONSUMER SENTIMENT ANALYSIS USING TWITTER A Paper Submitted to the Graduate Faculty of the North Dakota State University of Agriculture and Applied Science By Rumana Rashid In Partial Fulfillment of the

More information

Automatic Detection of Rumor on Social Network

Automatic Detection of Rumor on Social Network Automatic Detection of Rumor on Social Network Qiao Zhang 1,2, Shuiyuan Zhang 1,2, Jian Dong 3, Jinhua Xiong 2(B), and Xueqi Cheng 2 1 University of Chinese Academy of Sciences, Beijing, China 2 Institute

More information

Data Mining Applications with R

Data Mining Applications with R Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG

More information

There are many ways to use Twitter, but you want to use it in the most

There are many ways to use Twitter, but you want to use it in the most twitter Chapter 6 Aligning Your Twitter Strategy with Your Business There are many ways to use Twitter, but you want to use it in the most efficient, effective way to grow your business. Like all social

More information

Innovative Marketing Ideas That Work

Innovative Marketing Ideas That Work INNOVATIVE MARKETING IDEAS THAT WORK Legal Disclaimer: While all attempts have been made to verify information provided in this publication, neither the Author nor the Publisher assumes any responsibility

More information

What about streaming data?

What about streaming data? What about streaming data? 1 The Stream Model Data enters at a rapid rate from one or more input ports Such data are called stream tuples The system cannot store the entire (infinite) stream Distribution

More information

Leveraging the Social Breadcrumbs

Leveraging the Social Breadcrumbs Leveraging the Social Breadcrumbs 2 Social Network Service Important part of Web 2.0 People share a lot of data through those sites They are of different kind of media Uploaded to be seen by other people

More information

Classification Model for Intent Mining in Personal Website Based on Support Vector Machine

Classification Model for Intent Mining in Personal Website Based on Support Vector Machine , pp.145-152 http://dx.doi.org/10.14257/ijdta.2016.9.2.16 Classification Model for Intent Mining in Personal Website Based on Support Vector Machine Shuang Zhang, Nianbin Wang School of Computer Science

More information

Predicting Yelp Ratings From Business and User Characteristics

Predicting Yelp Ratings From Business and User Characteristics Predicting Yelp Ratings From Business and User Characteristics Jeff Han Justin Kuang Derek Lim Stanford University jeffhan@stanford.edu kuangj@stanford.edu limderek@stanford.edu I. Abstract With online

More information

Inferring Nationalities of Twitter Users and Studying Inter-National Linking

Inferring Nationalities of Twitter Users and Studying Inter-National Linking Inferring Nationalities of Twitter Users and Studying Inter-National Linking Wenyi Huang Ingmar Weber Sarah Vieweg Information Sciences and Technology Pennsylvania State University University Park, PA

More information

Weighted Fuzzy Rule Based Sentiment Prediction Analysis on Tweets

Weighted Fuzzy Rule Based Sentiment Prediction Analysis on Tweets , pp.240-244 http://dx.doi.org/10.14257/astl.2017.143.48 Weighted Fuzzy Rule Based Sentiment Prediction Analysis on Tweets Syed Muzamil Basha, Dharmendra Singh Rajput, Iyengar N.Ch.S.N, Ronnie D. Caytiles

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 19 1 Acknowledgement The following discussion is based on the paper Mining Big Data: Current Status, and Forecast to the Future by Fan and Bifet and online presentation

More information

Mobile Marketing. This means you need to change your strategy for marketing to those people, or risk losing them to your competition.

Mobile Marketing. This means you need to change your strategy for marketing to those people, or risk losing them to your competition. Mobile Marketing Introduction Mobile marketing is one of the fastest growing segments of online marketing. Over the last two to three years, the number of people who access the internet on mobile devices

More information

Squibs. Evaluation Methods for Statistically Dependent Text. Sarvnaz Karimi CSIRO. Jie Yin CSIRO. Jiri Baum Sabik Software Solutions

Squibs. Evaluation Methods for Statistically Dependent Text. Sarvnaz Karimi CSIRO. Jie Yin CSIRO. Jiri Baum Sabik Software Solutions Squibs Evaluation Methods for Statistically Dependent Text Sarvnaz Karimi CSIRO Jie Yin CSIRO Jiri Baum Sabik Software Solutions In recent years, many studies have been published on data collected from

More information

SURVEY PAPER ON TECHNIQUES USED IN OPINION MINING

SURVEY PAPER ON TECHNIQUES USED IN OPINION MINING SURVEY PAPER ON TECHNIQUES USED IN OPINION MINING Vikrant R. Harmalkar 1, Omkar H. Jagdale 2, Swati N. Chavan 3, Prof. Nidhi Sharma 4 1,2,3,4 Department of CSE, BVCOENM, Abstract With the growing availability

More information

Tuning in to the Emotions of the Capital Markets with Sentiment Analysis

Tuning in to the Emotions of the Capital Markets with Sentiment Analysis Tuning in to the Emotions of the Capital Markets with Sentiment Analysis Abstract It is crucial for rms operating in nancial markets to understand investor needs, customer preferences, and society perspectives.

More information

An Executive s Guide to B2B Video Marketing. 8 ways to make video work for your business

An Executive s Guide to B2B Video Marketing. 8 ways to make video work for your business An Executive s Guide to B2B Video Marketing 8 ways to make video work for your business [Video] Content Is King Companies that utilize video content to present their products and services can experience

More information

The Predictor Impact of Web Search Media On Bitcoin Trading Volumes

The Predictor Impact of Web Search Media On Bitcoin Trading Volumes The Predictor Impact of Web Search Media On Bitcoin Trading Volumes Martina Matta, Ilaria Lunesu and Michele Marchesi Universita degli Studi di Cagliari Piazza d Armi, 09123 Cagliari, Italy {martina.matta,

More information

When Politicians Tweet: A Study on the Members of the German Federal Diet

When Politicians Tweet: A Study on the Members of the German Federal Diet When Politicians Tweet: A Study on the Members of the German Federal Diet Mark Thamm GESIS - Leibniz Institute for the Social Sciences Unter Sachsenhausen 6-8, 50667 Cologne, Germany Mark.thamm@gesis.org

More information

WE consider the general ranking problem, where a computer

WE consider the general ranking problem, where a computer 5140 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 11, NOVEMBER 2008 Statistical Analysis of Bayes Optimal Subset Ranking David Cossock and Tong Zhang Abstract The ranking problem has become increasingly

More information

Beating the Competition with Cognitive Commerce

Beating the Competition with Cognitive Commerce Beating the Competition with Cognitive Commerce Tom Robertshaw Founder & CEO of Meanbee @bobbyshaw Meanbee UK ecommerce Agency Specialized in Magento Technology First Client revenues average $2-10 million

More information

Predicting the Future With Social Media

Predicting the Future With Social Media Predicting the Future With Social Media Sitaram Asur Social Computing Lab The Social Computing Lab focuses on methods for harvesting the collective intelligence of groups of people in order to realize

More information

SAS ANALYTICS AND OPEN SOURCE

SAS ANALYTICS AND OPEN SOURCE GUIDEBOOK SAS ANALYTICS AND OPEN SOURCE April 2014 2014 Nucleus Research, Inc. Reproduction in whole or in part without written permission is prohibited. THE BOTTOM LINE Many organizations balance open

More information

Getting Started with HLM 5. For Windows

Getting Started with HLM 5. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 3 Section 2: Accessing

More information

WHITEPAPER. Text Analytics - Sentiment Extraction. Measuring the Emotional Tone of Content

WHITEPAPER. Text Analytics - Sentiment Extraction. Measuring the Emotional Tone of Content WHITEPAPER Text Analytics - Sentiment Extraction Measuring the Emotional Tone of Content What is Sentiment Scoring and why do you care? Sentiment scoring allows a computer to consistently rate the positive

More information

Brian Macdonald Big Data & Analytics Specialist - Oracle

Brian Macdonald Big Data & Analytics Specialist - Oracle Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery brian.macdonald@oracle.com Copyright 2015, Oracle and/or its affiliates.

More information

Case Study. How Are UBank Using Social Media?

Case Study. How Are UBank Using Social Media? How Are UBank Using Social Media? Case Study Version 1.0 July 2011 About emarketingconnected emarketingconnected is a new type of company in the online marketing industry, acting independently to help

More information

Microsoft Dynamics GP. Manufacturing Core Functions

Microsoft Dynamics GP. Manufacturing Core Functions Microsoft Dynamics GP Manufacturing Core Functions Copyright Copyright 2010 Microsoft. All rights reserved. Limitation of liability This document is provided as-is. Information and views expressed in this

More information

Tweeting Questions in Academic Conferences: Seeking or Promoting Information?

Tweeting Questions in Academic Conferences: Seeking or Promoting Information? Tweeting Questions in Academic Conferences: Seeking or Promoting Information? Xidao Wen, University of Pittsburgh Yu-Ru Lin, University of Pittsburgh Abstract The fast growth of social media has reshaped

More information

SOCIAL MEDIA GLOSSARY General terms

SOCIAL MEDIA GLOSSARY General terms SOCIAL MEDIA GLOSSARY General terms API Application programming interface: Specifies how software components interact. On the web, APIs allow content to be embedded and shared between locations (it s a

More information

ASSIGNMENT SUBMISSION FORM

ASSIGNMENT SUBMISSION FORM ASSIGNMENT SUBMISSION FORM Treat this as the first page of your assignment Course Name: Assignment Title: Business Analytics using Data Mining Crowdanalytix - Predicting Churn/Non-Churn Status of a Consumer

More information

NEW TECHNOLOGIES TO OPTIMIZE PARKING AVAILABILITY, SAFETY AND REVENUE

NEW TECHNOLOGIES TO OPTIMIZE PARKING AVAILABILITY, SAFETY AND REVENUE NEW TECHNOLOGIES TO OPTIMIZE PARKING AVAILABILITY, SAFETY AND REVENUE AllTrafficSolutions.com 2 New Technologies to Optimize Parking Availability, Safety and Revenue CONTENTS Introduction Parking Technology:

More information

3 Ways to Improve Your Targeted Marketing with Analytics

3 Ways to Improve Your Targeted Marketing with Analytics 3 Ways to Improve Your Targeted Marketing with Analytics Introduction Targeted marketing is a simple concept, but a key element in a marketing strategy. The goal is to identify the potential customers

More information

Achieving customer intimacy with IBM SPSS products

Achieving customer intimacy with IBM SPSS products Achieving customer intimacy with IBM SPSS products Transformative technologies for the new era of customer interactions Highlights: Customer intimacy is an innovative strategy for helping organizations

More information

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ANALYTICAL MODEL DEVELOPMENT AGENDA Enterprise Miner: Analytical Model Development The session looks at: - Supervised and Unsupervised Modelling - Classification

More information

Introduction to software testing and quality process

Introduction to software testing and quality process Introduction to software testing and quality process Automated testing and verification J.P. Galeotti - Alessandra Gorla Engineering processes Engineering disciplines pair construction activities activities

More information

Providing the right level of analytics self-service as a technology provider

Providing the right level of analytics self-service as a technology provider The Information Company White paper Providing the right level of analytics self-service as a technology provider Where are you in your level of maturity as a SaaS provider? Today s technology providers

More information

Predictive Planning for Supply Chain Management

Predictive Planning for Supply Chain Management Predictive Planning for Supply Chain Management David Pardoe and Peter Stone Department of Computer Sciences The University of Texas at Austin {dpardoe, pstone}@cs.utexas.edu Abstract Supply chains are

More information

Machine Learning as a Service

Machine Learning as a Service As we approach the pinnacle of the big data movement, businesses face increasing pressure to integrate data analytics into their regular decision-making processes, and to constantly iterate on the valuable

More information

NLP WHAT S HAPPENING TO NLP?

NLP WHAT S HAPPENING TO NLP? WHAT S HAPPENING TO NLP? A Guide to Synthesio s New & Improved Natural Language Processing Capabilities WHAT IS HAPPENING TO NLP? Synthesio s Natural Language Processing (NLP) model is about to undergo

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................3

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

Insights from the Wikipedia Contest

Insights from the Wikipedia Contest Insights from the Wikipedia Contest Kalpit V Desai, Roopesh Ranjan Abstract The Wikimedia Foundation has recently observed that newly joining editors on Wikipedia are increasingly failing to integrate

More information

Multi-Touch Attribution

Multi-Touch Attribution Multi-Touch Attribution BY DIRK BEYER HEAD OF SCIENCE, MARKETING ANALYTICS NEUSTAR A Guide to Methods, Math and Meaning Introduction Marketers today use multiple marketing channels that generate impression-level

More information

The Evolution of Contestable Markets: A Computing Simulation

The Evolution of Contestable Markets: A Computing Simulation usiness, 2010, 2, 295-299 doi:10.4236/ib.2010.23037 Published Online September 2010 (http://www.scirp.org/journal/ib) The Evolution of Contestable Markets: A Computing Simulation Zhenguo Han 1, Hui Zhang

More information

Big Data The Big Story

Big Data The Big Story Big Data The Big Story Jean-Pierre Dijcks Big Data Product Mangement 1 Agenda What is Big Data? Architecting Big Data Building Big Data Solutions Oracle Big Data Appliance and Big Data Connectors Customer

More information

Research Methods in Human-Computer Interaction

Research Methods in Human-Computer Interaction Research Methods in Human-Computer Interaction Chapter 5- Surveys Introduction Surveys are a very commonly used research method Surveys are also often-maligned because they are not done in the proper manner

More information

Predicting user rating for Yelp businesses leveraging user similarity

Predicting user rating for Yelp businesses leveraging user similarity Predicting user rating for Yelp businesses leveraging user similarity Kritika Singh kritika@eng.ucsd.edu Abstract Users visit a Yelp business, such as a restaurant, based on its overall rating and often

More information

THE BUSINESS LEADER S GUIDE TO. Becoming a Social Business

THE BUSINESS LEADER S GUIDE TO. Becoming a Social Business THE BUSINESS LEADER S GUIDE TO Becoming a Social Business Introduction Customers expect personalized, one-to-one interactions whenever and wherever they interact with your brand and a growing number of

More information

HABIT 2: Know and Love Quality Score

HABIT 2: Know and Love Quality Score HABIT 2: Know and Love Quality Score IMPROVING QUALITY SCORE: THE VALUE OF BEING MORE RELEVANT RAISING QUALITY SCORE TO INCREASE EXPOSURE, LOWER COSTS, AND GENERATE MORE CONVERSIONS WHY SHOULD YOU CARE

More information

The Impact of Agile. Quantified.

The Impact of Agile. Quantified. The Impact of Agile. Quantified. Agile and lean are built on a foundation of continuous improvement: You need to inspect, learn from and adapt your performance to keep improving. Enhancing performance

More information

TRANSPORTATION ASSET MANAGEMENT GAP ANALYSIS TOOL

TRANSPORTATION ASSET MANAGEMENT GAP ANALYSIS TOOL Project No. 08-90 COPY NO. 1 TRANSPORTATION ASSET MANAGEMENT GAP ANALYSIS TOOL USER S GUIDE Prepared For: National Cooperative Highway Research Program Transportation Research Board of The National Academies

More information

Data Mining of the Concept «End of the World» in Twitter Microblogs

Data Mining of the Concept «End of the World» in Twitter Microblogs Summary Data Mining of the Concept «End of the World» in Twitter Microblogs Bohdan Pavlyshenko Ivan Franko Lviv National University,Ukraine, pavlsh@yahoo.com This paper describes the analysis of quantitative

More information

In mid-february President Bush unveiled his administration s climate-change policy

In mid-february President Bush unveiled his administration s climate-change policy Policy Brief Stanford Institute for Economic Policy Research U.S. Climate-Change Policy: The Bush Administration s Plan and Beyond Lawrence H. Goulder In mid-february President Bush unveiled his administration

More information

PRODUCT DESCRIPTIONS AND METRICS

PRODUCT DESCRIPTIONS AND METRICS PRODUCT DESCRIPTIONS AND METRICS Adobe PDM - Adobe Analytics (2015v1) The Products and Services described in this PDM are either On-demand Services or Managed Services (as outlined below) and are governed

More information

Why Search + Social = Success For Brands The Role Of Search And Social In The Customer Life Cycle

Why Search + Social = Success For Brands The Role Of Search And Social In The Customer Life Cycle A Forrester Consulting April 2016 Thought Leadership Paper Commissioned By Catalyst, Part of GroupM Connect Why Search + Social = Success For Brands The Role Of Search And Social In The Customer Life Cycle

More information

Communicate and Collaborate with Visual Studio Team System 2008

Communicate and Collaborate with Visual Studio Team System 2008 Communicate and Collaborate with Visual Studio Team System 2008 White Paper May 2008 For the latest information, please see www.microsoft.com/teamsystem This is a preliminary document and may be changed

More information

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Stamatina Thomaidou 1,2, Konstantinos Leymonis 1,2, Michalis Vazirgiannis 1,2,3 1 : Athens University of Economics and Business,

More information

Validation of RFI Reports V Nov 2017

Validation of RFI Reports V Nov 2017 Validation of RFI Reports V1.0-10 Nov 2017 1) Criteria Rationale: The RFI is a reporting framework at this first stage of development, there are no correct answers, no (widely accepted) best practices,

More information

Designing Of Effective Shopper Purchase Analysis Model Based On Chrip Likes

Designing Of Effective Shopper Purchase Analysis Model Based On Chrip Likes Designing Of Effective Shopper Purchase Analysis Model Based On Chrip Likes Palla Jyothsna #1, S.Igni Sabasti Prabu *2 # Department of Information Technology, Sathyabama University Chennai, Tamil Nadu,

More information

Twitter Sentiment Analysis on Demonetization tweets in India Using R language

Twitter Sentiment Analysis on Demonetization tweets in India Using R language Impact Factor Value: 4.029 ISSN: 2349-7084 International Journal of Computer Engineering In Research Trends Volume 4, Issue 6, June-2017, pp. 252-258 www.ijcert.org Twitter Sentiment Analysis on Demonetization

More information

A Parametric Bootstrapping Approach to Forecast Intermittent Demand

A Parametric Bootstrapping Approach to Forecast Intermittent Demand Proceedings of the 2008 Industrial Engineering Research Conference J. Fowler and S. Mason, eds. A Parametric Bootstrapping Approach to Forecast Intermittent Demand Vijith Varghese, Manuel Rossetti Department

More information

ADVANCED LEAD NURTURING

ADVANCED LEAD NURTURING Definitive Guide to Lead Nurturing Lead Advanced Lead Nurturing In Part One, we defined lead nurturing the process of building relationships with qualified prospects regardless of their timing to buy,

More information

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

E-guide Hadoop Big Data Platforms Buyer s Guide part 1 Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors

More information

Trust-Networks in Recommender Systems

Trust-Networks in Recommender Systems San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 Trust-Networks in Recommender Systems Kristen Mori San Jose State University Follow this and additional

More information

Data Drives Social Performance

Data Drives Social Performance Data Drives Social Performance The Benchmark Study on Organic Publishing to Social Networks August 2014 52 Vanderbilt Ave, 12th Floor New York, NY 10017 212.883.9844 www.socialflow.com 2014 SocialFlow

More information

COMMUNICATIONS H TOOLKIT H NATIONAL VOTER REGISTRATION DAY. A Partner Communications Toolkit for Traditional and Social Media

COMMUNICATIONS H TOOLKIT H NATIONAL VOTER REGISTRATION DAY. A Partner Communications Toolkit for Traditional and Social Media NATIONAL VOTER REGISTRATION DAY COMMUNICATIONS H TOOLKIT H A Partner Communications Toolkit for Traditional and Social Media www.nationalvoterregistrationday.org Table of Contents Introduction 1 Key Messaging

More information

Ranking Potential Customers based on GroupEnsemble method

Ranking Potential Customers based on GroupEnsemble method Ranking Potential Customers based on GroupEnsemble method The ExceedTech Team South China University Of Technology 1. Background understanding Both of the products have been on the market for many years,

More information

Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09)

Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09) Software for Typing MaxDiff Respondents Copyright Sawtooth Software, 2009 (3/16/09) Background: Market Segmentation is a pervasive concept within market research, which involves partitioning customers

More information

AppExchange Packaging Guide

AppExchange Packaging Guide Salesforce.com: Salesforce Summer '09 AppExchange Packaging Guide Last updated: July 6, 2009 Copyright 2000-2009 salesforce.com, inc. All rights reserved. Salesforce.com is a registered trademark of salesforce.com,

More information

The Mathematical Truth about VDPs, Price to Market and Sales. Noah John Co-founder Autoscores Orlando, FL (330)

The Mathematical Truth about VDPs, Price to Market and Sales. Noah John Co-founder Autoscores Orlando, FL (330) The Mathematical Truth about VDPs, Price to Market and Sales Noah John Co-founder Autoscores Orlando, FL (330) 368-4846 noah@autoscores.com The views and opinions presented in this educational program

More information

Essential Twitter Techniques

Essential Twitter Techniques Essential Twitter Techniques Contents Intro... 3 Know The Difference Between Posting And Spamming... 3 Be Clear On What You Are Promoting... 4 Plan How To Communicate With Others To Convey Professionalism...

More information

Technical Report. Simple, proven approaches to text retrieval. S.E. Robertson, K. Spärck Jones. Number 356. December Computer Laboratory

Technical Report. Simple, proven approaches to text retrieval. S.E. Robertson, K. Spärck Jones. Number 356. December Computer Laboratory Technical Report UCAM-CL-TR-356 ISSN 1476-2986 Number 356 Computer Laboratory Simple, proven approaches to text retrieval S.E. Robertson, K. Spärck Jones December 1994 15 JJ Thomson Avenue Cambridge CB3

More information

The Power of Digital Printing

The Power of Digital Printing September 2002 The Power of Digital Printing Taking the next step to new opportunity and profitability Jim Hamilton Associate Director CAP Ventures, Inc. Page 3 Contents 3 Straight Talk About Digital Print

More information

Consumner Durable Spending

Consumner Durable Spending GEORGE KATONA* The University of Michigan A Communication: Consumner Durable Spending WE AT THE SURVEY RESEARCH CENTER were greatly pleased to find included in Brookings Papers on Economic Activity a study

More information

Modeling of competition in revenue management Petr Fiala 1

Modeling of competition in revenue management Petr Fiala 1 Modeling of competition in revenue management Petr Fiala 1 Abstract. Revenue management (RM) is the art and science of predicting consumer behavior and optimizing price and product availability to maximize

More information

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS Anirvan Banerji New York 24th CIRET Conference Wellington, New Zealand March 17-20, 1999 Geoffrey H. Moore,

More information

NEWBIE GUIDE TO ADVERTISING. by ExoClick s Customer Service Team

NEWBIE GUIDE TO ADVERTISING. by ExoClick s Customer Service Team NEWBIE GUIDE TO ADVERTISING by ExoClick s Customer Service Team There are lots of reasons why you might like to start advertising with us: perhaps you want to promote your new cool site, or you are taking

More information

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for human health in the past two centuries. Adding chlorine

More information

Mining the reviews of movie trailers on YouTube and comments on Yahoo Movies

Mining the reviews of movie trailers on YouTube and comments on Yahoo Movies Mining the reviews of movie trailers on YouTube and comments on Yahoo Movies Li-Chen Cheng* Chi Lun Huang Department of Computer Science and Information Management, Soochow University, Taipei, Taiwan,

More information

Chapter 5 Evaluating Classification & Predictive Performance

Chapter 5 Evaluating Classification & Predictive Performance Chapter 5 Evaluating Classification & Predictive Performance Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Why Evaluate? Multiple methods are available

More information

BUS 168 Chapter 6 - E-commerce Marketing Concepts: Social, Mobile, Local

BUS 168 Chapter 6 - E-commerce Marketing Concepts: Social, Mobile, Local Consumers Online: The Internet Audience and Consumer Behavior Around 84% of U.S. adults use the Internet in 2015 Intensity and scope of use both increasing Some demographic groups have much higher percentages

More information