Predicting Volatility in Equity Markets Using Macroeconomic News

Size: px
Start display at page:

Download "Predicting Volatility in Equity Markets Using Macroeconomic News"


1 Stanford University Predicting Volatility in Equity Markets Using Macroeconomic News CS 229 Final Project Authors: Carolyn Campbell Rujia Jiang William Wright Supervisors: Professor Andrew Ng (TA) Youssef Ahres Dec 11, 2015

2 1 1. Introduction On June 26, 2015, months of debt negotiation between the Greek government, headed by Prime Minister Alexis Tsipras, and its creditors, including the IMF and fellow Eurozone countries, broke off abruptly. Tsipras announced a snap referendum regarding the terms of the pending bailout. By the following morning, S&P500 was down significantly. As political and economic uncertainty grew in Europe, investors across the world were moving their funds away from risky assets that could be negatively affected by a Greek default, and towards safer assets. Days later, when the Greek situation had resolved, the process reversed as equities rallied. These market movements are not uncommon; just weeks later, crisis erupted in the Chinese equity markets, and asset values were responding to the uncertainty this provoked. In this paper, we investigate how macroeconomic sentiment immediately impacts the volatility in liquid markets, by measuring market volatility through the VIX (volatility index of the S&P 500). Using data pulled from Twitter, we are researching how breaking economic news affects the markets, and subsequently how to predict which news stories can increase volatility. We will consider a tweet significant, in that the news presented in the tweet contributes to volatility in the market, if within 30 minutes of the tweet being tweeted, the volatility of the asset increases by one-fifth of a standard deviation. By employing three machine learning techniques for classification, we aim to predict the volatility-inducing power of a single tweet and/or news story. Using a training set, we will identify key words and their associated probability of increasing volatility in the markets using Naive Bayes, SVM and logistic regression. The input of our algorithm will be the word count of each bucket of tweets in a dictionary we created. We then use the three prediction methods to output a predicted increase in the VIX, which will be a binary variable (does it increase by a fifth of a standard deviation or not). In order to create a viable trading strategy, we aim to predict moves with above 51% accuracy. 2. Past Work There is a lot of research on predicting market price movements by conducting technical analysis using historical and time-series data [5]. A large portion of the studies fall into developing techniques to analyze markets trends from texts such as financial market news [5] or social media posts [7]. Some tactics include tokenizing each article, and matching them with stop word lists [2], using a subset of article terms as features [4], and using tagging of named entities to group nouns into predefined categories [6]. The latter two techniques are more frequently used in Question Answering (QA) systems, where themes have been predetermined [6]. However, to analyze the sentiments within markets without knowing the objectives, the lexicon-based approach is a more robust technique, although it is subject to the effect of words around a single word, and trends can sometimes be hard to detected by using single-word weighting [2]. Once texts have been dissected into data sets, there are several machine learning approaches to make predictions in price movements in various financial products. One is Support Vector Regression, which performs linear regression in the high dimension feature space to predict stock prices [3]. Logistic Regression and Neural Networks are also plausible in analyzing stocks trends [1]. However, all these methods are sensitive to bias and noise, and the study performances tend to improve as more noises are removed from original data sets. In addition, the majority of past research focus on price prediction of specific stocks, and very few of them focus on the volatility of the overall markets. We have taken inspiration from various article and the techniques used, such as SVM and logistic regression, but have elected to view this is a classification problem since we are not focusing on single-name stocks but rather market volatility. 3. Dataset and Features We have pulled macroeconomic news from Twitter; news sources will tweet a headline and the link to the accompanying news article, which allows for us to access the key topics frequently and in a rather condensed format. In addition, hedge funds, banks, and analysts will frequently tweet either articles or short views on the market. We have selected 70 accounts that we deemed relevant, and tweeted with a

3 2 significant frequency. The Twitter API allows for the most recent 3200 tweets per account to be exported from the website, which provides at least a month of tweets (and hence, news) per account. Because we are investigating the immediate market reaction, and therefore using intraday data, this is more than sufficient. This amounted to more than 200,000 tweets. Bloomberg retains intraday data, at 30 minute intervals, for 6 months, which we have also pulled. We bucketed tweets into thirty minute intervals, so that instead of regarding an individual tweet as a single data point, we are looking at clusters of tweets. The aim of this strategy is to reduce the noise in the model and reduce the amounts of false classification. For instance, if CNN breaking news is tweeting about Kanye West having a baby at the same time that oil prices begin to plummet, both would be associated with moves in volatility. The Kanye West tweet has no impact on moves in the S&P 500, but if treated as its own data point, the model might believe that Kanye was a key word. We consider only tweets between 9 am and 4pm to align with the market intra-day data. We also constructed our own dictionary, based on our universe of tweets, rather than using a pre-implemented dictionary. Because of the hashtag feature on Twitter, we suspect that there are key words that will affect market movements that are not real words (such as #Grexit) and are commonly used to unify a particular topic. Similarly, analysts often will use the stock ticker in a tweet when discussing an update about a singlename stock (such as $AAPL). By constructing our own dictionary, we ensure that these trending topics are included, which can perhaps be even more significant than actual english words. We cleaned the tweets by eliminating all characters that were not letters, and eliminated URLs. The dictionary that we used contains just over 10,000 words, which serves as our features for the algorithms. After the processing of data, we reduced our 200,000 tweets to 988 data points. Each data point is a vector the length of the dictionary, corresponding to the frequency of each token (feature) in that block of tweets. Because we are using time-series data, we cannot use cross validation techniques, so we use 70% of the data (692 points) for training and 296 points for testing. Figure 1 shows a plot of the VIX since 2008 (daily data). While we used intra-day data going back 2 months, this illustrates how events, such as the Greek financial crisis, the Lehman Brothers default, the devaluation of the yuan, and other macro events effect the volatility. Figure Methods 4.1. Naive Bayes. The first classification model we consider is the Multi-factor Naive Bayes. The key assumption of this model is the conditional independence of the features. This amounts to the following assumption: for any features x i and x j, given that y = 1, i.e. that the VIX has gone up substantially, then the knowledge of x i has no effect on our beliefs about x j. Mathematically, we say: P(x i, x j y = 1) = P(x i y = 1, x j ) This is a strong assumption to make about our Twitter data set. The Naive Bayes is used as a preliminary classification, but we explore two other methods that relax this assumption.

4 3 We want to fit the following parameters: φ k y=1 = P(x k y = 1), φ k y=0 = P(x k y = 0) and φ y = P(y), for all k features. In order to fit this model, we maximize the log-likelihood function to find the estimates: m L(φ y, φ k y=0, φ k y=1 ) = P(x (i), y (i) ) We additionally apply Laplace smoothing to avoid issues when we arrive at a word that is unknown in our dictionary. We have the following estimates: m φ k y=1 = 1(y = 1)n m k,i + 1 m 1(y = 1)n i + N, φ k y=0 = 1(y = 0)n m k,i + 1 m 1(y = 0)n i + N, φ 1(y = 1) y = m where m is the number if training points, n k,i is the frequency of the k th token in the i th training point, n i is the total number of tokens in the i th training set, and N is the number of tokens. Having fit these parameters, we make a prediction on a new example with given features x by calculating the ratio P(y = 1 x) P(y = 0 x) 4.2. Support Vector Machines (SVM). Subsequently, we employed the support vector machine method, which performs linear regression in the high dimension feature to maximize the functional margin. Given a training set S = { (x 1, y 1 ),, (x n, y n ) }, for a linear regression model with the linear functionf(x) = w, x + b,we want to find the weight vector w that solves the following problem minimize w subject to 1 2 w y (i) ( w, x (i) + b) 1, i = 1,, m where m is the number of training sample. We can solve the dual problem of the above optimization as m m maximize W (α) = α i y (i) y (j) α i α j K(x (i), x (j) ) α subject to i,j=1 α i 0, i = 1,, n m α i y i = 0, i = 1,, n where K(x (i), x (j) ) is a kernel function, which we have chosen to be Gaussian, and m is the number of support vector. For our project, we use the LIBLINEAR package to perform the SVM. To use the built-in functions in the LIBLINEAR, we constructed our training sets and test sets into sparse matrices of size (a b), where a = # of tweets and b = # of tokens. Each row of the matrix represents a tweet; the j th column of the i th row represents the number of times the j th token appeared in tweet i. This is a sparse matrix, and therefore we only record non-zero values along with their index numbers. VIX going up is indicated as class +1, and VIX not trending as class PCA and Logistic Regression. Finally, we performed logistic regression. Logistic regression makes the least amount of assumptions on the dataset, and since we cannot be sure that the assumption of conditional independence holds true in the Naive Bayes algorithm, logistic regression makes sense. Logistic regression is a model that measures the relationship between a categorical binary dependent variable, which we have taken to be whether or not the VIX has increased by a certain amount, and the independent variables (the features of our dictionary, from the tweets). It estimates the probabilities of the categorical variable using the logistic function. Our hypothesis has the form h θ (x) = 1 1+e θt x, the sigmoid function. We assume that P(y = 1 x, θ) = h θ(x). Letting n be the number of features in our dictionary, θ R nx1 is the vector that we are trying to fit in from out training sample, and x R nx1, such that x 0 = 1 and x i is the count of the i th word in the data point.

5 4 Using log-likelihood methods, we should be able to fit the parameters θ. However, we have 10,000 features and only 988 data points, which makes this impossible. Therefore, we need to employ dimension-reduction techniques to make this a feasible strategy. Principal component analysis (PCA) is a procedure that reduces the dimension of a set of observations (tweets, for us) by using orthogonal transformations to convert correlated covariates (the features) into a set of linearly uncorrelated variables, know as principal components. The first principal component accounts for as much variance in the data set as possible, and as we increase the number of principal components, we increase the amount of variance accounted for. Each principal component is an eigenvector of the covariance matrix of the data set, guaranteeing the orthogonality of the components. Effectively, we can use a number of principal components significantly less than the number of features. We began with 10,000 features, perform principal component analysis, and selected the first 100 principal components to use, reducing our dimension by a factor of 100. By multiplying the first 100 loadings (a 10,000 by 100 matrix) by our matrix of covariates, we reduce create a new design matrix of size 100 x 988. Using this new design matrix, we perform a logistic regression using the binary VIX as our response. The function glmfit in Matlab inputs the data and outputs the model with fitted parameter θ. The predict function in Matlab, inputs the model created from the training set and tests the model using the 30% testing data by making predictions of the probability of a VIX increase. 5. Results We want to compare the three classification models to determine which one best fits our data. Therefore, we use hold-out cross validation with 70% of our data for training, and the remaining 30% for testing, and compare the estimated generalization error/accuracy of each model. Since the impact of certain words on the VIX changes in time, it does not make sense to train with data that occurs after the testing data. For this reason, we do not use k-fold cross validation, as we need our testing data to be the most recent 30% of the tweets. The primary metrics for each model that we are interested in are accuracy, precision, and recall. Accuracy is the proportion of correctly classified data points to total number of data points. Precision or positive predictive value is the number of true positives over the total number of positives predicted, or the probability that a positively predicted data point is actually positive. Recall is the proportion of positive data points that are correctly classified. In other words, we have the following formulas: Acc = (T P + T N)/M, P rec = T P/(T P + F P ), Recall = T P (T P + F N), where T P stands for true positive, T N true negative, F P false positive, F N false negative, and M number of data points. We first look at the confusion matrix to get an idea of how each model is performing. Figure 2. We see from figure 2 that both Naive Bayes and Logistic Regression do a good job predicting negative data points, but do not predict very accurately the positive data points. The SVM algorithm predicts more positive data points than Naive Bayes and Logistic Regression, but also has a lot more fale positives, lowering its precision. These observations are made more precise in figure 3, which compares each of the models main metrics to each other.

6 5 Figure 3. From figure 3, we see that Logistic Regression has the best accuracy and precision at 67% and 41% respectively, with Naive Bayes trailing with 66% and 38% for accuracy and precision. Although SVM does much better than both Naive Bayes and Logistic Regression in recall, we still consider SVM to be the worst performing model for our purposes. This is because our trading strategy will only enter into a position in the market if we predict a positive result from the data. Thus, the most important metric for us is precision, since we can see it as the probability that the position we enter into will be profitable or not. Therefore, since Logistic Regression has the highes accuracy and precision, we come to the conclusion that it is the best model for our problem. The SVM algorithm performed much worse than the other two algorithms. This was caused by the noise in our data, and small number of data points. Most of the twitter accounts we used had many tweets that had no impact on the markets at all, but the algorithm still included them in that analysis. This along with the small number of data points led to the SVM algorithm choosing a decision boundary that was not very accurate. The reason that Logistic Regression outperformed the Naive Bayes algorithm is that the Naive Bayes assumption of the conditional independence of the features definitely does not hold in our case. Also, due to the size of the dictionary, it is possible that both SVM and Naive Bayes were overfit. To mitigate this effect, we created our dictionary using a subset of the tweets, before performing our analysis. Due to the fact that PCA was performed before doing the Logistic Regression, it depended on fewer features, lowering the chance that we overfit the model. 6. Conclusion Using three supervised learning techniques, we have developed a methodology for predicting volatility movements, with an accuracy between 57-67%. The Logistic Regression model outperformed both Naive Bayes and SVM due to less assumptions being made, and a lower chance that the model was overfit. We believe that with more concentrated tweets, and eliminating tweets that are not macroeconomic headlines, or headlines unrelated to the S&P500 movements, the accuracy and precision of the models would greatly increase. Also, by modifying and shorteningthe dictionary to include exclusively marketrelated words and tokens, we would lower our chance of overfitting, and see an increase in accuracy. With more time, we would look into different ways to obtain our data to make it cleaner, such as pulling headlines from Bloomberg. We would also look into tweaks of our models, to address some of the problems they were having, such as using a different kernel for SVM, or adding regularization to account for outliers.

7 6 References [1] T. Ding, V. Fang, and D. Zuo, Stock market prediction based on time series data and market sentiment, [2] T. L. Im, P. W. San, C. K. On, R. Alfred, and P. Anthony, Impact of financial news headline and content to market sentiment, International Journal of Machine Learning and Computing, 4 (2014), p [3] P. Meesad and R. I. Rasel, Predicting stock market price using support vector regression, in Informatics, Electronics & Vision (ICIEV), 2013 International Conference on, IEEE, 2013, pp [4] D. Moldovan, M. Paşca, S. Harabagiu, and M. Surdeanu, Performance issues and error analysis in an open-domain question answering system, ACM Transactions on Information Systems (TOIS), 21 (2003), pp [5] R. P. Schumaker and H. Chen, Textual analysis of stock market prediction using breaking financial news: The azfin text system, ACM Transactions on Information Systems (TOIS), 27 (2009), p. 12. [6] S. Sekine and C. Nobata, Definition, dictionaries and tagger for extended named entity hierarchy., in LREC, 2004, pp [7] M. S. A. Wolfram, Modelling the stock market using twitter, School of Informatics, (2010), p. 74.

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

Predicting Corporate 8-K Content Using Machine Learning Techniques

Predicting Corporate 8-K Content Using Machine Learning Techniques Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: Hyungjun Lee Department

More information

Using Twitter as a source of information for stock market prediction

Using Twitter as a source of information for stock market prediction Using Twitter as a source of information for stock market prediction Argimiro Arratia LSI, Univ. Politécnica de Catalunya, Barcelona (Joint work with Marta Arias and Ramón Xuriguera)

More information

Tracking #metoo on Twitter to Predict Engagement in the Movement

Tracking #metoo on Twitter to Predict Engagement in the Movement Tracking #metoo on Twitter to Predict Engagement in the Movement Ana Tarano (atarano) and Dana Murphy (d km0713) Abstract: In the past few months, the social movement #metoo has garnered incredible social

More information

Cryptocurrency Price Prediction Using News and Social Media Sentiment

Cryptocurrency Price Prediction Using News and Social Media Sentiment Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price

More information

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow Problem Statement It is often said that stock prices are determined

More information

How hot will it get? Modeling scientific discourse about literature

How hot will it get? Modeling scientific discourse about literature How hot will it get? Modeling scientific discourse about literature Project Aims Natalie Telis, CS229 Many metrics exist to provide heuristics for quality of scientific literature,

More information

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Predicting Restaurants Rating And Popularity Based On Yelp Dataset CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract

More information

Predicting Corporate Influence Cascades In Health Care Communities

Predicting Corporate Influence Cascades In Health Care Communities Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice

More information

When to Book: Predicting Flight Pricing

When to Book: Predicting Flight Pricing When to Book: Predicting Flight Pricing Qiqi Ren Stanford University Abstract When is the best time to purchase a flight? Flight prices fluctuate constantly, so purchasing at different

More information

Automatic detection of plasmonic nanoparticles in tissue sections

Automatic detection of plasmonic nanoparticles in tissue sections Automatic detection of plasmonic nanoparticles in tissue sections Dor Shaviv and Orly Liba 1. Introduction Plasmonic nanoparticles, such as gold nanorods, are gaining popularity in both medical imaging

More information

PAST research has shown that real-time Twitter data can

PAST research has shown that real-time Twitter data can Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis Stuart Colianni, Stephanie Rosales, and Michael Signorotti ABSTRACT PAST research has shown that real-time Twitter data can be

More information

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for human health in the past two centuries. Adding chlorine

More information

Bitcoin UTXO Lifespan Prediction

Bitcoin UTXO Lifespan Prediction Bitcoin UTXO Lifespan Prediction Robert Konrad & Stephen Pinto December, 05 Background & Motivation The Bitcoin crypto currency [, ] is the most widely used and highly valued digital currency in existence.

More information

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Tianyu Ding 1 and Junyi Deng 1 and Jingting Li 1 and Yu-Ru Lin 1 1 University of Pittsburgh, Pittsburgh PA

More information

Model Selection, Evaluation, Diagnosis

Model Selection, Evaluation, Diagnosis Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier

More information

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning Airbnb Price Estimation Liubov Nikolenko SUNet ID: liubov Hoormazd Rezaei SUNet ID: hoormazd Pouya Rezazadeh SUNet ID: pouyar Project Category: General Machine Learning

More information

Predicting Popularity of Messages in Twitter using a Feature-weighted Model

Predicting Popularity of Messages in Twitter using a Feature-weighted Model International Journal of Advanced Intelligence Volume 0, Number 0,, November, 20XX. c AIA International Advanced Information Institute Predicting Popularity of Messages in Twitter using a Feature-weighted

More information

Big Data. Methodological issues in using Big Data for Official Statistics

Big Data. Methodological issues in using Big Data for Official Statistics Giulio Barcaroli Istat ( Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics

More information

Convex and Non-Convex Classification of S&P 500 Stocks

Convex and Non-Convex Classification of S&P 500 Stocks Georgia Institute of Technology 4133 Advanced Optimization Convex and Non-Convex Classification of S&P 500 Stocks Matt Faulkner Chris Fu James Moriarty Masud Parvez Mario Wijaya coached by Dr. Guanghui

More information

Data Mining in Social Network. Presenter: Keren Ye

Data Mining in Social Network. Presenter: Keren Ye Data Mining in Social Network Presenter: Keren Ye References Kwak, Haewoon, et al. "What is Twitter, a social network or a news media?." Proceedings of the 19th international conference on World wide web.

More information

Mining Psychology from English News and Applications on Finance

Mining Psychology from English News and Applications on Finance Mining Psychology from English News and Applications on Finance Guan-Cheng Li University of California, Berkeley Presentation, ERATO at Osaka Dec 17, 2010 Outline 1 Motivation? 2 What to mine? 3 Where

More information

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 672 Web and Text Mining Sentiment Analysis Ms. Anjana Agrawal Abstract This paper describes the key steps followed

More information

Strength in numbers? Modelling the impact of businesses on each other

Strength in numbers? Modelling the impact of businesses on each other Strength in numbers? Modelling the impact of businesses on each other Amir Abbas Sadeghian Hakan Inan Andres Nötzli INTRODUCTION In many cities,

More information

Predicting Stock Prices through Textual Analysis of Web News

Predicting Stock Prices through Textual Analysis of Web News Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels

More information

Applications of Machine Learning to Predict Yelp Ratings

Applications of Machine Learning to Predict Yelp Ratings Applications of Machine Learning to Predict Yelp Ratings Kyle Carbon Aeronautics and Astronautics Kacyn Fujii Electrical Engineering Prasanth Veerina Computer

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Predicting the Odds of Getting Retweeted

Predicting the Odds of Getting Retweeted Predicting the Odds of Getting Retweeted Arun Mahendra Stanford University 1. Introduction Millions of people tweet every day about almost any topic imaginable, but only a small percent

More information

Unravelling Airbnb Predicting Price for New Listing

Unravelling Airbnb Predicting Price for New Listing Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 Aniket Jain H John Heinz III College

More information

E-Commerce Sales Prediction Using Listing Keywords

E-Commerce Sales Prediction Using Listing Keywords E-Commerce Sales Prediction Using Listing Keywords Stephanie Chen ( 1 Introduction Small online retailers usually set themselves apart from brick and mortar stores, traditional brand

More information

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang Abstract Our team intends to develop a recommendation system for job seekers based on the information of current

More information

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara Customer Relationship Management in marketing programs: A machine learning approach for decision Fernanda Alcantara CRM Goal Support the decision taking Personalize the best individual

More information

Application of Machine Learning to Financial Trading

Application of Machine Learning to Financial Trading Application of Machine Learning to Financial Trading January 2, 2015 Some slides borrowed from: Andrew Moore s lectures, Yaser Abu Mustafa s lectures About Us Our Goal : To use advanced mathematical and

More information

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University

More information

Restaurant Recommendation for Facebook Users

Restaurant Recommendation for Facebook Users Restaurant Recommendation for Facebook Users Qiaosha Han Vivian Lin Wenqing Dai Computer Science Computer Science Computer Science Stanford University Stanford University Stanford University

More information

Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.

Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance. Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance Itamar Snir The Leonard N. Stern School of Business Glucksman Institute for Research

More information

Predicting and Explaining Price-Spikes in Real-Time Electricity Markets

Predicting and Explaining Price-Spikes in Real-Time Electricity Markets Predicting and Explaining Price-Spikes in Real-Time Electricity Markets Christian Brown #1, Gregory Von Wald #2 # Energy Resources Engineering Department, Stanford University 367 Panama St, Stanford, CA

More information

DATA SETS. What were the data collection process. When, where, and how? PRE-PROCESSING. Natural language data pre-processing steps

DATA SETS. What were the data collection process. When, where, and how? PRE-PROCESSING. Natural language data pre-processing steps 01 DATA SETS What were the data collection process. When, where, and how? 02 PRE-PROCESSING Natural language data pre-processing steps 03 RULE-BASED CLASSIFICATIONS Without using machine learning, can

More information

Classifying Search Advertisers. By Lars Hirsch (Sunet ID : lrhirsch) Summary

Classifying Search Advertisers. By Lars Hirsch (Sunet ID : lrhirsch) Summary Classifying Search Advertisers By Lars Hirsch (Sunet ID : lrhirsch) Summary Multinomial Event Model and Softmax Regression were applied to classify search marketing advertisers into industry verticals

More information

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file. BUSINESS DATA MINING (IDS 572) HOMEWORK 4 DUE DATE: TUESDAY, APRIL 10 AT 3:20 PM Please provide succinct answers to the questions below. You should submit an electronic pdf or word file in blackboard.

More information

Who Matters? Finding Network Ties Using Earthquakes

Who Matters? Finding Network Ties Using Earthquakes Who Matters? Finding Network Ties Using Earthquakes Siddhartha Basu, David Daniels, Anthony Vashevko December 12, 2014 1 Introduction While social network data can provide a rich history of personal behavior

More information

Predictive Analytics Cheat Sheet

Predictive Analytics Cheat Sheet Predictive Analytics The use of advanced technology to help legal teams separate datasets by relevancy or issue in order to prioritize documents for expedited review. Often referred to as Technology Assisted

More information

User Profiling in an Ego Network: Co-profiling Attributes and Relationships

User Profiling in an Ego Network: Co-profiling Attributes and Relationships User Profiling in an Ego Network: Co-profiling Attributes and Relationships Rui Li, Chi Wang, Kevin Chen-Chuan Chang, {ruili1, chiwang1, kcchang} Department of Computer Science, University

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information


Profit Optimization ABSTRACT PROBLEM INTRODUCTION Profit Optimization Quinn Burzynski, Lydia Frank, Zac Nordstrom, and Jake Wolfe Dr. Song Chen and Dr. Chad Vidden, UW-LaCrosse Mathematics Department ABSTRACT Each branch store of Fastenal is responsible

More information

Churn Prediction with Support Vector Machine

Churn Prediction with Support Vector Machine Churn Prediction with Support Vector Machine PIMPRAPAI THAINIAM EMIS8331 Data Mining »Customer Churn»Support Vector Machine»Research What are Customer Churn and Customer Retention?» Customer churn (customer

More information

Brian Macdonald Big Data & Analytics Specialist - Oracle

Brian Macdonald Big Data & Analytics Specialist - Oracle Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery Copyright 2015, Oracle and/or its affiliates.

More information

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,

More information

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 / Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Intro Logistic Regression Gradient Descent + SGD

Intro Logistic Regression Gradient Descent + SGD Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 29, 2016 1 Ad Placement

More information

Empirical Exercise Handout

Empirical Exercise Handout Empirical Exercise Handout Ec-970 International Corporate Governance Harvard University March 2, 2004 Due Date The new due date for empirical paper is Wednesday, March 24 at the beginning of class. Description

More information

Fraud Detection for MCC Manipulation

Fraud Detection for MCC Manipulation 2016 International Conference on Informatics, Management Engineering and Industrial Application (IMEIA 2016) ISBN: 978-1-60595-345-8 Fraud Detection for MCC Manipulation Hong-feng CHAI 1, Xin LIU 2, Yan-jun

More information

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Introduction We have barely scratched the surface when it comes to microbial diversity.

More information

Inferring Social Ties across Heterogeneous Networks

Inferring Social Ties across Heterogeneous Networks Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN Introduction Social Ties Information carrying connections between people It can be: Strong, weak or

More information

Data Science Training Course

Data Science Training Course About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information


USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM Abstract In this document, we describe in high-level use cases and strategies to implement machine learning algorithms to

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Accurate Campaign Targeting Using Classification Algorithms

Accurate Campaign Targeting Using Classification Algorithms Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an

More information



More information

The Impact of Clinical Drug Trials on Biotechnology Companies

The Impact of Clinical Drug Trials on Biotechnology Companies The Impact of Clinical Drug Trials on Biotechnology Companies Cade Hulse Claremont, California Abstract The biotechnology (biotech) industry focuses on the development and production of innovative drugs

More information

Software Data Analytics. Nevena Lazarević

Software Data Analytics. Nevena Lazarević Software Data Analytics Nevena Lazarević 1 Selected Literature Perspectives on Data Science for Software Engineering, 1st Edition, Tim Menzies, Laurie Williams, Thomas Zimmermann The Art and Science of

More information

Using Metrics Through the Technology Assisted Review (TAR) Process. A White Paper

Using Metrics Through the Technology Assisted Review (TAR) Process. A White Paper Using Metrics Through the Technology Assisted Review (TAR) Process A White Paper 1 2 Table of Contents Introduction....4 Understanding the Data Set....4 Determine the Rate of Relevance....5 Create Training

More information

New restaurants fail at a surprisingly

New restaurants fail at a surprisingly Predicting New Restaurant Success and Rating with Yelp Aileen Wang, William Zeng, Jessica Zhang Stanford University,, December 16, 2016 Abstract

More information

A Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG 1, Hong-Yu FU 1

A Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG 1, Hong-Yu FU 1 International Conference on Management Science and Management Innovation (MSMI 2014) A Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG

More information

Prediction from Blog Data

Prediction from Blog Data Prediction from Blog Data Aditya Parameswaran Eldar Sadikov Petros Venetis 1. INTRODUCTION We have approximately one year s worth of blog posts from [1] with over 12 million web blogs tracked. On average

More information

Propensity of Contract Renewals

Propensity of Contract Renewals 1 Propensity of Contract Renewals Himanshu Shekhar ( - Stanford University I. INTRODUCTION Multinational technology firm develops, manufactures, and sells networking hardware, telecommunications

More information

ECPR Methods Summer School: Big Data Analysis in the Social Sciences.

ECPR Methods Summer School: Big Data Analysis in the Social Sciences. ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barberá London School of Economics Course website: Supervised Machine Learning Supervised

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis

More information


POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) OUTLINE FOR THE POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) Module Subject Topics Learning outcomes Delivered by Exploratory & Visualization Framework Exploratory Data Collection and

More information

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pages 37-64. The description of the problem can be found

More information

Discriminant models for high-throughput proteomics mass spectrometer data

Discriminant models for high-throughput proteomics mass spectrometer data Proteomics 2003, 3, 1699 1703 DOI 10.1002/pmic.200300518 1699 Short Communication Parul V. Purohit David M. Rocke Center for Image Processing and Integrated Computing, University of California, Davis,

More information

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL Modern Data Center Characteristics Growth in scale and complexity Addition and removal of system components

More information

Prediction of Google Local Users Restaurant ratings

Prediction of Google Local Users Restaurant ratings CSE 190 Assignment 2 Report Professor Julian McAuley Page 1 Nov 30, 2015 Prediction of Google Local Users Restaurant ratings Shunxin Lu Muyu Ma Ziran Zhang Xin Chen Abstract Since mobile devices and the

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Using AI to Make Predictions on Stock Market

Using AI to Make Predictions on Stock Market Using AI to Make Predictions on Stock Market Alice Zheng Stanford University Stanford, CA 94305 Jack Jin Stanford University Stanford, CA 94305 1 Introduction

More information

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web: Masters in Business Statistics (MBS) - 2014/2015 Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa Web: Course Coordinator: Prof. T S G Peiris Prof. in Applied

More information

Text Mining. Theory and Applications Anurag Nagar

Text Mining. Theory and Applications Anurag Nagar Text Mining Theory and Applications Anurag Nagar Topics Introduction What is Text Mining Features of Text Document Representation Vector Space Model Document Similarities Document Classification and Clustering

More information

Preface to the third edition Preface to the first edition Acknowledgments

Preface to the third edition Preface to the first edition Acknowledgments Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................

More information

AI is the New Electricity

AI is the New Electricity AI is the New Electricity Kevin Hu, Shagun Goel {huke, gshagun} Category: Physical Sciences 1 Introduction The three major fossil fuels petroleum, natural gas, and coal combined accounted

More information

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count Correcting Sample Bias in Oversampled Logistic Modeling Building Stable Models from Data with Very Low event Count ABSTRACT In binary outcome regression models with very few bads or minority events, it

More information


DESIGN AND DEVELOPMENT OF SOFTWARE FAULT PREDICTION MODEL TO ENHANCE THE SOFTWARE QUALITY LEVEL International Journal of Information Technology and Management Information Systems (IJITMIS) Volume 1, Issue 2007, Jan Dec 2007, pp. 01 06 Available online at

More information

Final Report: Local Structure and Evolution for Cascade Prediction

Final Report: Local Structure and Evolution for Cascade Prediction Final Report: Local Structure and Evolution for Cascade Prediction Jake Lussier (, Jacob Bank ( ABSTRACT Information cascades in large social networks are complex

More information

Can Cascades be Predicted?

Can Cascades be Predicted? Can Cascades be Predicted? Rediet Abebe and Thibaut Horel September 22, 2014 1 Introduction In this presentation, we discuss the paper Can Cascades be Predicted? by Cheng, Adamic, Dow, Kleinberg, and Leskovec,

More information

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records

Cold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records Cold-start Solution to Location-based Entity Shop Recommender Systems Using Online Sales Records Yichen Yao 1, Zhongjie Li 2 1 Department of Engineering Mechanics, Tsinghua University, Beijing, China

More information

Engine remaining useful life prediction based on trajectory similarity

Engine remaining useful life prediction based on trajectory similarity Engine remaining useful life prediction based on trajectory similarity Xin Liu 1, Yunxian Jia 2, Jie Zhou 3, Tianbin Liu 4 1, 2, 3 Department of Equipment Command and Management. Ordnance Engineering College,

More information



More information

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,

More information

Rank hotels on to maximize purchases

Rank hotels on to maximize purchases Rank hotels on to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory

More information

Code Compulsory Module Credits Continuous Assignment

Code Compulsory Module Credits Continuous Assignment CURRICULUM AND SCHEME OF EVALUATION Compulsory Modules Evaluation (%) Code Compulsory Module Credits Continuous Assignment Final Exam MA 5210 Probability and Statistics 3 40±10 60 10 MA 5202 Statistical

More information

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011 Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring 2011 Your name: Final Examination Tuesday June 7, 2011 Instructions: Answer each question in the space

More information

Assignment 2 : Predicting Price Movement of the Futures Market

Assignment 2 : Predicting Price Movement of the Futures Market Assignment 2 : Predicting Price Movement of the Futures Market Yen-Chuan Liu Jun 2, 215 1. Dataset Exploratory Analysis Data mining is a field that has just set out its foot and begun to

More information

Unlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management

Unlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management Unlocking Unstructured Social Media Data in Marketing William Rand Assistant Professor of Bussiness Management In Collaboration with Kelly Hewett, Roland Rust, and Harald J. van Heerde Managers perspectives

More information

Using Twitter to Predict Voting Behavior

Using Twitter to Predict Voting Behavior Using Twitter to Predict Voting Behavior Mike Chrzanowski Daniel Levick December 14, 2012 Background: An increasing amount of research has emerged in the past few

More information

Predicting International Restaurant Success with Yelp

Predicting International Restaurant Success with Yelp Predicting International Restaurant Success with Yelp Angela Kong 1, Vivian Nguyen 2, and Catherina Xu 3 Abstract In this project, we aim to identify the key features people in different countries look

More information

Using decision tree classifier to predict income levels

Using decision tree classifier to predict income levels MPRA Munich Personal RePEc Archive Using decision tree classifier to predict income levels Sisay Menji Bekena 30 July 2017 Online at MPRA Paper No. 83406, posted

More information

Data Mining Applications with R

Data Mining Applications with R Data Mining Applications with R Yanchang Zhao Senior Data Miner,, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG

More information

A Comparative Study of Filter-based Feature Ranking Techniques

A Comparative Study of Filter-based Feature Ranking Techniques Western Kentucky University From the SelectedWorks of Dr. Huanjing Wang August, 2010 A Comparative Study of Filter-based Feature Ranking Techniques Huanjing Wang, Western Kentucky University Taghi M. Khoshgoftaar,

More information