Predicting Volatility in Equity Markets Using Macroeconomic News
|
|
- Gavin Todd
- 5 years ago
- Views:
Transcription
1 Stanford University Predicting Volatility in Equity Markets Using Macroeconomic News CS 229 Final Project Authors: Carolyn Campbell Rujia Jiang William Wright Supervisors: Professor Andrew Ng (TA) Youssef Ahres Dec 11, 2015
2 1 1. Introduction On June 26, 2015, months of debt negotiation between the Greek government, headed by Prime Minister Alexis Tsipras, and its creditors, including the IMF and fellow Eurozone countries, broke off abruptly. Tsipras announced a snap referendum regarding the terms of the pending bailout. By the following morning, S&P500 was down significantly. As political and economic uncertainty grew in Europe, investors across the world were moving their funds away from risky assets that could be negatively affected by a Greek default, and towards safer assets. Days later, when the Greek situation had resolved, the process reversed as equities rallied. These market movements are not uncommon; just weeks later, crisis erupted in the Chinese equity markets, and asset values were responding to the uncertainty this provoked. In this paper, we investigate how macroeconomic sentiment immediately impacts the volatility in liquid markets, by measuring market volatility through the VIX (volatility index of the S&P 500). Using data pulled from Twitter, we are researching how breaking economic news affects the markets, and subsequently how to predict which news stories can increase volatility. We will consider a tweet significant, in that the news presented in the tweet contributes to volatility in the market, if within 30 minutes of the tweet being tweeted, the volatility of the asset increases by one-fifth of a standard deviation. By employing three machine learning techniques for classification, we aim to predict the volatility-inducing power of a single tweet and/or news story. Using a training set, we will identify key words and their associated probability of increasing volatility in the markets using Naive Bayes, SVM and logistic regression. The input of our algorithm will be the word count of each bucket of tweets in a dictionary we created. We then use the three prediction methods to output a predicted increase in the VIX, which will be a binary variable (does it increase by a fifth of a standard deviation or not). In order to create a viable trading strategy, we aim to predict moves with above 51% accuracy. 2. Past Work There is a lot of research on predicting market price movements by conducting technical analysis using historical and time-series data [5]. A large portion of the studies fall into developing techniques to analyze markets trends from texts such as financial market news [5] or social media posts [7]. Some tactics include tokenizing each article, and matching them with stop word lists [2], using a subset of article terms as features [4], and using tagging of named entities to group nouns into predefined categories [6]. The latter two techniques are more frequently used in Question Answering (QA) systems, where themes have been predetermined [6]. However, to analyze the sentiments within markets without knowing the objectives, the lexicon-based approach is a more robust technique, although it is subject to the effect of words around a single word, and trends can sometimes be hard to detected by using single-word weighting [2]. Once texts have been dissected into data sets, there are several machine learning approaches to make predictions in price movements in various financial products. One is Support Vector Regression, which performs linear regression in the high dimension feature space to predict stock prices [3]. Logistic Regression and Neural Networks are also plausible in analyzing stocks trends [1]. However, all these methods are sensitive to bias and noise, and the study performances tend to improve as more noises are removed from original data sets. In addition, the majority of past research focus on price prediction of specific stocks, and very few of them focus on the volatility of the overall markets. We have taken inspiration from various article and the techniques used, such as SVM and logistic regression, but have elected to view this is a classification problem since we are not focusing on single-name stocks but rather market volatility. 3. Dataset and Features We have pulled macroeconomic news from Twitter; news sources will tweet a headline and the link to the accompanying news article, which allows for us to access the key topics frequently and in a rather condensed format. In addition, hedge funds, banks, and analysts will frequently tweet either articles or short views on the market. We have selected 70 accounts that we deemed relevant, and tweeted with a
3 2 significant frequency. The Twitter API allows for the most recent 3200 tweets per account to be exported from the website, which provides at least a month of tweets (and hence, news) per account. Because we are investigating the immediate market reaction, and therefore using intraday data, this is more than sufficient. This amounted to more than 200,000 tweets. Bloomberg retains intraday data, at 30 minute intervals, for 6 months, which we have also pulled. We bucketed tweets into thirty minute intervals, so that instead of regarding an individual tweet as a single data point, we are looking at clusters of tweets. The aim of this strategy is to reduce the noise in the model and reduce the amounts of false classification. For instance, if CNN breaking news is tweeting about Kanye West having a baby at the same time that oil prices begin to plummet, both would be associated with moves in volatility. The Kanye West tweet has no impact on moves in the S&P 500, but if treated as its own data point, the model might believe that Kanye was a key word. We consider only tweets between 9 am and 4pm to align with the market intra-day data. We also constructed our own dictionary, based on our universe of tweets, rather than using a pre-implemented dictionary. Because of the hashtag feature on Twitter, we suspect that there are key words that will affect market movements that are not real words (such as #Grexit) and are commonly used to unify a particular topic. Similarly, analysts often will use the stock ticker in a tweet when discussing an update about a singlename stock (such as $AAPL). By constructing our own dictionary, we ensure that these trending topics are included, which can perhaps be even more significant than actual english words. We cleaned the tweets by eliminating all characters that were not letters, and eliminated URLs. The dictionary that we used contains just over 10,000 words, which serves as our features for the algorithms. After the processing of data, we reduced our 200,000 tweets to 988 data points. Each data point is a vector the length of the dictionary, corresponding to the frequency of each token (feature) in that block of tweets. Because we are using time-series data, we cannot use cross validation techniques, so we use 70% of the data (692 points) for training and 296 points for testing. Figure 1 shows a plot of the VIX since 2008 (daily data). While we used intra-day data going back 2 months, this illustrates how events, such as the Greek financial crisis, the Lehman Brothers default, the devaluation of the yuan, and other macro events effect the volatility. Figure Methods 4.1. Naive Bayes. The first classification model we consider is the Multi-factor Naive Bayes. The key assumption of this model is the conditional independence of the features. This amounts to the following assumption: for any features x i and x j, given that y = 1, i.e. that the VIX has gone up substantially, then the knowledge of x i has no effect on our beliefs about x j. Mathematically, we say: P(x i, x j y = 1) = P(x i y = 1, x j ) This is a strong assumption to make about our Twitter data set. The Naive Bayes is used as a preliminary classification, but we explore two other methods that relax this assumption.
4 3 We want to fit the following parameters: φ k y=1 = P(x k y = 1), φ k y=0 = P(x k y = 0) and φ y = P(y), for all k features. In order to fit this model, we maximize the log-likelihood function to find the estimates: m L(φ y, φ k y=0, φ k y=1 ) = P(x (i), y (i) ) We additionally apply Laplace smoothing to avoid issues when we arrive at a word that is unknown in our dictionary. We have the following estimates: m φ k y=1 = 1(y = 1)n m k,i + 1 m 1(y = 1)n i + N, φ k y=0 = 1(y = 0)n m k,i + 1 m 1(y = 0)n i + N, φ 1(y = 1) y = m where m is the number if training points, n k,i is the frequency of the k th token in the i th training point, n i is the total number of tokens in the i th training set, and N is the number of tokens. Having fit these parameters, we make a prediction on a new example with given features x by calculating the ratio P(y = 1 x) P(y = 0 x) 4.2. Support Vector Machines (SVM). Subsequently, we employed the support vector machine method, which performs linear regression in the high dimension feature to maximize the functional margin. Given a training set S = { (x 1, y 1 ),, (x n, y n ) }, for a linear regression model with the linear functionf(x) = w, x + b,we want to find the weight vector w that solves the following problem minimize w subject to 1 2 w y (i) ( w, x (i) + b) 1, i = 1,, m where m is the number of training sample. We can solve the dual problem of the above optimization as m m maximize W (α) = α i y (i) y (j) α i α j K(x (i), x (j) ) α subject to i,j=1 α i 0, i = 1,, n m α i y i = 0, i = 1,, n where K(x (i), x (j) ) is a kernel function, which we have chosen to be Gaussian, and m is the number of support vector. For our project, we use the LIBLINEAR package to perform the SVM. To use the built-in functions in the LIBLINEAR, we constructed our training sets and test sets into sparse matrices of size (a b), where a = # of tweets and b = # of tokens. Each row of the matrix represents a tweet; the j th column of the i th row represents the number of times the j th token appeared in tweet i. This is a sparse matrix, and therefore we only record non-zero values along with their index numbers. VIX going up is indicated as class +1, and VIX not trending as class PCA and Logistic Regression. Finally, we performed logistic regression. Logistic regression makes the least amount of assumptions on the dataset, and since we cannot be sure that the assumption of conditional independence holds true in the Naive Bayes algorithm, logistic regression makes sense. Logistic regression is a model that measures the relationship between a categorical binary dependent variable, which we have taken to be whether or not the VIX has increased by a certain amount, and the independent variables (the features of our dictionary, from the tweets). It estimates the probabilities of the categorical variable using the logistic function. Our hypothesis has the form h θ (x) = 1 1+e θt x, the sigmoid function. We assume that P(y = 1 x, θ) = h θ(x). Letting n be the number of features in our dictionary, θ R nx1 is the vector that we are trying to fit in from out training sample, and x R nx1, such that x 0 = 1 and x i is the count of the i th word in the data point.
5 4 Using log-likelihood methods, we should be able to fit the parameters θ. However, we have 10,000 features and only 988 data points, which makes this impossible. Therefore, we need to employ dimension-reduction techniques to make this a feasible strategy. Principal component analysis (PCA) is a procedure that reduces the dimension of a set of observations (tweets, for us) by using orthogonal transformations to convert correlated covariates (the features) into a set of linearly uncorrelated variables, know as principal components. The first principal component accounts for as much variance in the data set as possible, and as we increase the number of principal components, we increase the amount of variance accounted for. Each principal component is an eigenvector of the covariance matrix of the data set, guaranteeing the orthogonality of the components. Effectively, we can use a number of principal components significantly less than the number of features. We began with 10,000 features, perform principal component analysis, and selected the first 100 principal components to use, reducing our dimension by a factor of 100. By multiplying the first 100 loadings (a 10,000 by 100 matrix) by our matrix of covariates, we reduce create a new design matrix of size 100 x 988. Using this new design matrix, we perform a logistic regression using the binary VIX as our response. The function glmfit in Matlab inputs the data and outputs the model with fitted parameter θ. The predict function in Matlab, inputs the model created from the training set and tests the model using the 30% testing data by making predictions of the probability of a VIX increase. 5. Results We want to compare the three classification models to determine which one best fits our data. Therefore, we use hold-out cross validation with 70% of our data for training, and the remaining 30% for testing, and compare the estimated generalization error/accuracy of each model. Since the impact of certain words on the VIX changes in time, it does not make sense to train with data that occurs after the testing data. For this reason, we do not use k-fold cross validation, as we need our testing data to be the most recent 30% of the tweets. The primary metrics for each model that we are interested in are accuracy, precision, and recall. Accuracy is the proportion of correctly classified data points to total number of data points. Precision or positive predictive value is the number of true positives over the total number of positives predicted, or the probability that a positively predicted data point is actually positive. Recall is the proportion of positive data points that are correctly classified. In other words, we have the following formulas: Acc = (T P + T N)/M, P rec = T P/(T P + F P ), Recall = T P (T P + F N), where T P stands for true positive, T N true negative, F P false positive, F N false negative, and M number of data points. We first look at the confusion matrix to get an idea of how each model is performing. Figure 2. We see from figure 2 that both Naive Bayes and Logistic Regression do a good job predicting negative data points, but do not predict very accurately the positive data points. The SVM algorithm predicts more positive data points than Naive Bayes and Logistic Regression, but also has a lot more fale positives, lowering its precision. These observations are made more precise in figure 3, which compares each of the models main metrics to each other.
6 5 Figure 3. From figure 3, we see that Logistic Regression has the best accuracy and precision at 67% and 41% respectively, with Naive Bayes trailing with 66% and 38% for accuracy and precision. Although SVM does much better than both Naive Bayes and Logistic Regression in recall, we still consider SVM to be the worst performing model for our purposes. This is because our trading strategy will only enter into a position in the market if we predict a positive result from the data. Thus, the most important metric for us is precision, since we can see it as the probability that the position we enter into will be profitable or not. Therefore, since Logistic Regression has the highes accuracy and precision, we come to the conclusion that it is the best model for our problem. The SVM algorithm performed much worse than the other two algorithms. This was caused by the noise in our data, and small number of data points. Most of the twitter accounts we used had many tweets that had no impact on the markets at all, but the algorithm still included them in that analysis. This along with the small number of data points led to the SVM algorithm choosing a decision boundary that was not very accurate. The reason that Logistic Regression outperformed the Naive Bayes algorithm is that the Naive Bayes assumption of the conditional independence of the features definitely does not hold in our case. Also, due to the size of the dictionary, it is possible that both SVM and Naive Bayes were overfit. To mitigate this effect, we created our dictionary using a subset of the tweets, before performing our analysis. Due to the fact that PCA was performed before doing the Logistic Regression, it depended on fewer features, lowering the chance that we overfit the model. 6. Conclusion Using three supervised learning techniques, we have developed a methodology for predicting volatility movements, with an accuracy between 57-67%. The Logistic Regression model outperformed both Naive Bayes and SVM due to less assumptions being made, and a lower chance that the model was overfit. We believe that with more concentrated tweets, and eliminating tweets that are not macroeconomic headlines, or headlines unrelated to the S&P500 movements, the accuracy and precision of the models would greatly increase. Also, by modifying and shorteningthe dictionary to include exclusively marketrelated words and tokens, we would lower our chance of overfitting, and see an increase in accuracy. With more time, we would look into different ways to obtain our data to make it cleaner, such as pulling headlines from Bloomberg. We would also look into tweaks of our models, to address some of the problems they were having, such as using a different kernel for SVM, or adding regularization to account for outliers.
7 6 References [1] T. Ding, V. Fang, and D. Zuo, Stock market prediction based on time series data and market sentiment, [2] T. L. Im, P. W. San, C. K. On, R. Alfred, and P. Anthony, Impact of financial news headline and content to market sentiment, International Journal of Machine Learning and Computing, 4 (2014), p [3] P. Meesad and R. I. Rasel, Predicting stock market price using support vector regression, in Informatics, Electronics & Vision (ICIEV), 2013 International Conference on, IEEE, 2013, pp [4] D. Moldovan, M. Paşca, S. Harabagiu, and M. Surdeanu, Performance issues and error analysis in an open-domain question answering system, ACM Transactions on Information Systems (TOIS), 21 (2003), pp [5] R. P. Schumaker and H. Chen, Textual analysis of stock market prediction using breaking financial news: The azfin text system, ACM Transactions on Information Systems (TOIS), 27 (2009), p. 12. [6] S. Sekine and C. Nobata, Definition, dictionaries and tagger for extended named entity hierarchy., in LREC, 2004, pp [7] M. S. A. Wolfram, Modelling the stock market using twitter, School of Informatics, (2010), p. 74.
Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest
Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other
More informationProgress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong
Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.
More informationPredicting Corporate 8-K Content Using Machine Learning Techniques
Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: minjilee@stanford.edu Hyungjun Lee Department
More informationUsing Twitter as a source of information for stock market prediction
Using Twitter as a source of information for stock market prediction Argimiro Arratia argimiro@lsi.upc.edu LSI, Univ. Politécnica de Catalunya, Barcelona (Joint work with Marta Arias and Ramón Xuriguera)
More informationTracking #metoo on Twitter to Predict Engagement in the Movement
Tracking #metoo on Twitter to Predict Engagement in the Movement Ana Tarano (atarano) and Dana Murphy (d km0713) Abstract: In the past few months, the social movement #metoo has garnered incredible social
More informationCryptocurrency Price Prediction Using News and Social Media Sentiment
Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price
More informationCS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow
CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow haoyeec@stanford.edu ac1408@stanford.edu Problem Statement It is often said that stock prices are determined
More informationHow hot will it get? Modeling scientific discourse about literature
How hot will it get? Modeling scientific discourse about literature Project Aims Natalie Telis, CS229 ntelis@stanford.edu Many metrics exist to provide heuristics for quality of scientific literature,
More informationPredicting Restaurants Rating And Popularity Based On Yelp Dataset
CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract
More informationPredicting Corporate Influence Cascades In Health Care Communities
Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice
More informationWhen to Book: Predicting Flight Pricing
When to Book: Predicting Flight Pricing Qiqi Ren Stanford University qiqiren@stanford.edu Abstract When is the best time to purchase a flight? Flight prices fluctuate constantly, so purchasing at different
More informationAutomatic detection of plasmonic nanoparticles in tissue sections
Automatic detection of plasmonic nanoparticles in tissue sections Dor Shaviv and Orly Liba 1. Introduction Plasmonic nanoparticles, such as gold nanorods, are gaining popularity in both medical imaging
More informationPAST research has shown that real-time Twitter data can
Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis Stuart Colianni, Stephanie Rosales, and Michael Signorotti ABSTRACT PAST research has shown that real-time Twitter data can be
More informationDetermining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for
Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for human health in the past two centuries. Adding chlorine
More informationBitcoin UTXO Lifespan Prediction
Bitcoin UTXO Lifespan Prediction Robert Konrad & Stephen Pinto December, 05 Background & Motivation The Bitcoin crypto currency [, ] is the most widely used and highly valued digital currency in existence.
More informationSentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter
Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Tianyu Ding 1 and Junyi Deng 1 and Jingting Li 1 and Yu-Ru Lin 1 1 University of Pittsburgh, Pittsburgh PA
More informationModel Selection, Evaluation, Diagnosis
Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier
More informationAirbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.
Airbnb Price Estimation Liubov Nikolenko SUNet ID: liubov Hoormazd Rezaei SUNet ID: hoormazd Pouya Rezazadeh SUNet ID: pouyar Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.git
More informationPredicting Popularity of Messages in Twitter using a Feature-weighted Model
International Journal of Advanced Intelligence Volume 0, Number 0, pp.xxx-yyy, November, 20XX. c AIA International Advanced Information Institute Predicting Popularity of Messages in Twitter using a Feature-weighted
More informationBig Data. Methodological issues in using Big Data for Official Statistics
Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics
More informationConvex and Non-Convex Classification of S&P 500 Stocks
Georgia Institute of Technology 4133 Advanced Optimization Convex and Non-Convex Classification of S&P 500 Stocks Matt Faulkner Chris Fu James Moriarty Masud Parvez Mario Wijaya coached by Dr. Guanghui
More informationData Mining in Social Network. Presenter: Keren Ye
Data Mining in Social Network Presenter: Keren Ye References Kwak, Haewoon, et al. "What is Twitter, a social network or a news media?." Proceedings of the 19th international conference on World wide web.
More informationMining Psychology from English News and Applications on Finance
Mining Psychology from English News and Applications on Finance Guan-Cheng Li University of California, Berkeley Presentation, ERATO at Osaka Dec 17, 2010 Outline 1 Motivation? 2 What to mine? 3 Where
More informationInternational Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis
International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 672 Web and Text Mining Sentiment Analysis Ms. Anjana Agrawal Abstract This paper describes the key steps followed
More informationStrength in numbers? Modelling the impact of businesses on each other
Strength in numbers? Modelling the impact of businesses on each other Amir Abbas Sadeghian amirabs@stanford.edu Hakan Inan inanh@stanford.edu Andres Nötzli noetzli@stanford.edu. INTRODUCTION In many cities,
More informationPredicting Stock Prices through Textual Analysis of Web News
Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels
More informationApplications of Machine Learning to Predict Yelp Ratings
Applications of Machine Learning to Predict Yelp Ratings Kyle Carbon Aeronautics and Astronautics kcarbon@stanford.edu Kacyn Fujii Electrical Engineering khfujii@stanford.edu Prasanth Veerina Computer
More informationGetting Started with OptQuest
Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable
More informationPredicting the Odds of Getting Retweeted
Predicting the Odds of Getting Retweeted Arun Mahendra Stanford University arunmahe@stanford.edu 1. Introduction Millions of people tweet every day about almost any topic imaginable, but only a small percent
More informationUnravelling Airbnb Predicting Price for New Listing
Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 paridhic@andrew.cmu.edu Aniket Jain H John Heinz III College
More informationE-Commerce Sales Prediction Using Listing Keywords
E-Commerce Sales Prediction Using Listing Keywords Stephanie Chen (asksteph@stanford.edu) 1 Introduction Small online retailers usually set themselves apart from brick and mortar stores, traditional brand
More informationA Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang
A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang Abstract Our team intends to develop a recommendation system for job seekers based on the information of current
More informationCustomer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara
Customer Relationship Management in marketing programs: A machine learning approach for decision Fernanda Alcantara F.Alcantara@cs.ucl.ac.uk CRM Goal Support the decision taking Personalize the best individual
More informationApplication of Machine Learning to Financial Trading
Application of Machine Learning to Financial Trading January 2, 2015 Some slides borrowed from: Andrew Moore s lectures, Yaser Abu Mustafa s lectures About Us Our Goal : To use advanced mathematical and
More informationIdentifying Splice Sites Of Messenger RNA Using Support Vector Machines
Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University
More informationRestaurant Recommendation for Facebook Users
Restaurant Recommendation for Facebook Users Qiaosha Han Vivian Lin Wenqing Dai Computer Science Computer Science Computer Science Stanford University Stanford University Stanford University qiaoshah@stanford.edu
More informationUsing Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.
Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance Itamar Snir The Leonard N. Stern School of Business Glucksman Institute for Research
More informationPredicting and Explaining Price-Spikes in Real-Time Electricity Markets
Predicting and Explaining Price-Spikes in Real-Time Electricity Markets Christian Brown #1, Gregory Von Wald #2 # Energy Resources Engineering Department, Stanford University 367 Panama St, Stanford, CA
More informationDATA SETS. What were the data collection process. When, where, and how? PRE-PROCESSING. Natural language data pre-processing steps
01 DATA SETS What were the data collection process. When, where, and how? 02 PRE-PROCESSING Natural language data pre-processing steps 03 RULE-BASED CLASSIFICATIONS Without using machine learning, can
More informationClassifying Search Advertisers. By Lars Hirsch (Sunet ID : lrhirsch) Summary
Classifying Search Advertisers By Lars Hirsch (Sunet ID : lrhirsch) Summary Multinomial Event Model and Softmax Regression were applied to classify search marketing advertisers into industry verticals
More informationBUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.
BUSINESS DATA MINING (IDS 572) HOMEWORK 4 DUE DATE: TUESDAY, APRIL 10 AT 3:20 PM Please provide succinct answers to the questions below. You should submit an electronic pdf or word file in blackboard.
More informationWho Matters? Finding Network Ties Using Earthquakes
Who Matters? Finding Network Ties Using Earthquakes Siddhartha Basu, David Daniels, Anthony Vashevko December 12, 2014 1 Introduction While social network data can provide a rich history of personal behavior
More informationPredictive Analytics Cheat Sheet
Predictive Analytics The use of advanced technology to help legal teams separate datasets by relevancy or issue in order to prioritize documents for expedited review. Often referred to as Technology Assisted
More informationUser Profiling in an Ego Network: Co-profiling Attributes and Relationships
User Profiling in an Ego Network: Co-profiling Attributes and Relationships Rui Li, Chi Wang, Kevin Chen-Chuan Chang, {ruili1, chiwang1, kcchang}@illinois.edu Department of Computer Science, University
More informationApplying Regression Techniques For Predictive Analytics Paviya George Chemparathy
Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS
More informationProfit Optimization ABSTRACT PROBLEM INTRODUCTION
Profit Optimization Quinn Burzynski, Lydia Frank, Zac Nordstrom, and Jake Wolfe Dr. Song Chen and Dr. Chad Vidden, UW-LaCrosse Mathematics Department ABSTRACT Each branch store of Fastenal is responsible
More informationChurn Prediction with Support Vector Machine
Churn Prediction with Support Vector Machine PIMPRAPAI THAINIAM EMIS8331 Data Mining »Customer Churn»Support Vector Machine»Research What are Customer Churn and Customer Retention?» Customer churn (customer
More informationBrian Macdonald Big Data & Analytics Specialist - Oracle
Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery brian.macdonald@oracle.com Copyright 2015, Oracle and/or its affiliates.
More informationCSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification
CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,
More informationMachine Learning Logistic Regression Hamid R. Rabiee Spring 2015
Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression
More information2 Maria Carolina Monard and Gustavo E. A. P. A. Batista
Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of
More informationIntro Logistic Regression Gradient Descent + SGD
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 29, 2016 1 Ad Placement
More informationEmpirical Exercise Handout
Empirical Exercise Handout Ec-970 International Corporate Governance Harvard University March 2, 2004 Due Date The new due date for empirical paper is Wednesday, March 24 at the beginning of class. Description
More informationFraud Detection for MCC Manipulation
2016 International Conference on Informatics, Management Engineering and Industrial Application (IMEIA 2016) ISBN: 978-1-60595-345-8 Fraud Detection for MCC Manipulation Hong-feng CHAI 1, Xin LIU 2, Yan-jun
More informationPredicting prokaryotic incubation times from genomic features Maeva Fincker - Final report
Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity.
More informationInferring Social Ties across Heterogeneous Networks
Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN Introduction Social Ties Information carrying connections between people It can be: Strong, weak or
More informationData Science Training Course
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationUSE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM
USE CASES OF MACHINE LEARNING IN DEXFREIGHT S DECENTRALIZED LOGISTICS PLATFORM Abstract In this document, we describe in high-level use cases and strategies to implement machine learning algorithms to
More informationText Categorization. Hongning Wang
Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS
More informationText Categorization. Hongning Wang
Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS
More informationAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an
More informationPREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY
PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY GUY HASKIN FERNALD 1, DORNA KASHEF 2, NICHOLAS P. TATONETTI 1 Center for Biomedical Informatics Research 1, Department of Computer
More informationThe Impact of Clinical Drug Trials on Biotechnology Companies
The Impact of Clinical Drug Trials on Biotechnology Companies Cade Hulse Claremont, California Abstract The biotechnology (biotech) industry focuses on the development and production of innovative drugs
More informationSoftware Data Analytics. Nevena Lazarević
Software Data Analytics Nevena Lazarević 1 Selected Literature Perspectives on Data Science for Software Engineering, 1st Edition, Tim Menzies, Laurie Williams, Thomas Zimmermann The Art and Science of
More informationUsing Metrics Through the Technology Assisted Review (TAR) Process. A White Paper
Using Metrics Through the Technology Assisted Review (TAR) Process A White Paper 1 2 Table of Contents Introduction....4 Understanding the Data Set....4 Determine the Rate of Relevance....5 Create Training
More informationNew restaurants fail at a surprisingly
Predicting New Restaurant Success and Rating with Yelp Aileen Wang, William Zeng, Jessica Zhang Stanford University aileen15@stanford.edu, wizeng@stanford.edu, jzhang4@stanford.edu December 16, 2016 Abstract
More informationA Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG 1, Hong-Yu FU 1
International Conference on Management Science and Management Innovation (MSMI 2014) A Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG
More informationPrediction from Blog Data
Prediction from Blog Data Aditya Parameswaran Eldar Sadikov Petros Venetis 1. INTRODUCTION We have approximately one year s worth of blog posts from [1] with over 12 million web blogs tracked. On average
More informationPropensity of Contract Renewals
1 Propensity of Contract Renewals Himanshu Shekhar (hshekhar@stanford.edu) - Stanford University I. INTRODUCTION Multinational technology firm develops, manufactures, and sells networking hardware, telecommunications
More informationECPR Methods Summer School: Big Data Analysis in the Social Sciences. pablobarbera.com/ecpr-sc105
ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barberá London School of Economics pablobarbera.com Course website: pablobarbera.com/ecpr-sc105 Supervised Machine Learning Supervised
More informationBioinformatics for Biologists
Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis
More informationPOST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)
OUTLINE FOR THE POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) Module Subject Topics Learning outcomes Delivered by Exploratory & Visualization Framework Exploratory Data Collection and
More informationThe SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa
The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pages 37-64. The description of the problem can be found
More informationDiscriminant models for high-throughput proteomics mass spectrometer data
Proteomics 2003, 3, 1699 1703 DOI 10.1002/pmic.200300518 1699 Short Communication Parul V. Purohit David M. Rocke Center for Image Processing and Integrated Computing, University of California, Davis,
More informationMachine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL
Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL Modern Data Center Characteristics Growth in scale and complexity Addition and removal of system components
More informationPrediction of Google Local Users Restaurant ratings
CSE 190 Assignment 2 Report Professor Julian McAuley Page 1 Nov 30, 2015 Prediction of Google Local Users Restaurant ratings Shunxin Lu Muyu Ma Ziran Zhang Xin Chen Abstract Since mobile devices and the
More informationBusiness Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee
Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of
More informationUsing AI to Make Predictions on Stock Market
Using AI to Make Predictions on Stock Market Alice Zheng Stanford University Stanford, CA 94305 alicezhy@stanford.edu Jack Jin Stanford University Stanford, CA 94305 jackjin@stanford.edu 1 Introduction
More informationMasters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:
Masters in Business Statistics (MBS) - 2014/2015 Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa Web: www.mrt.ac.lk Course Coordinator: Prof. T S G Peiris Prof. in Applied
More informationText Mining. Theory and Applications Anurag Nagar
Text Mining Theory and Applications Anurag Nagar Topics Introduction What is Text Mining Features of Text Document Representation Vector Space Model Document Similarities Document Classification and Clustering
More informationPreface to the third edition Preface to the first edition Acknowledgments
Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................
More informationAI is the New Electricity
AI is the New Electricity Kevin Hu, Shagun Goel {huke, gshagun}@stanford.edu Category: Physical Sciences 1 Introduction The three major fossil fuels petroleum, natural gas, and coal combined accounted
More informationCorrecting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count
Correcting Sample Bias in Oversampled Logistic Modeling Building Stable Models from Data with Very Low event Count ABSTRACT In binary outcome regression models with very few bads or minority events, it
More informationDESIGN AND DEVELOPMENT OF SOFTWARE FAULT PREDICTION MODEL TO ENHANCE THE SOFTWARE QUALITY LEVEL
International Journal of Information Technology and Management Information Systems (IJITMIS) Volume 1, Issue 2007, Jan Dec 2007, pp. 01 06 Available online at http://www.iaeme.com/ijitmis/issues.asp?jtype=ijitmis&vtype=1&itype=2007
More informationFinal Report: Local Structure and Evolution for Cascade Prediction
Final Report: Local Structure and Evolution for Cascade Prediction Jake Lussier (lussier1@stanford.edu), Jacob Bank (jbank@stanford.edu) ABSTRACT Information cascades in large social networks are complex
More informationCan Cascades be Predicted?
Can Cascades be Predicted? Rediet Abebe and Thibaut Horel September 22, 2014 1 Introduction In this presentation, we discuss the paper Can Cascades be Predicted? by Cheng, Adamic, Dow, Kleinberg, and Leskovec,
More informationCold-start Solution to Location-based Entity Shop. Recommender Systems Using Online Sales Records
Cold-start Solution to Location-based Entity Shop Recommender Systems Using Online Sales Records Yichen Yao 1, Zhongjie Li 2 1 Department of Engineering Mechanics, Tsinghua University, Beijing, China yaoyichen@aliyun.com
More informationEngine remaining useful life prediction based on trajectory similarity
Engine remaining useful life prediction based on trajectory similarity Xin Liu 1, Yunxian Jia 2, Jie Zhou 3, Tianbin Liu 4 1, 2, 3 Department of Equipment Command and Management. Ordnance Engineering College,
More informationIMPROVING GROUNDWATER FLOW MODEL PREDICTION USING COMPLEMENTARY DATA-DRIVEN MODELS
XIX International Conference on Water Resources CMWR 2012 University of Illinois at Urbana-Champaign June 17-22,2012 IMPROVING GROUNDWATER FLOW MODEL PREDICTION USING COMPLEMENTARY DATA-DRIVEN MODELS Tianfang
More informationOn of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a
Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,
More informationRank hotels on Expedia.com to maximize purchases
Rank hotels on Expedia.com to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory
More informationCode Compulsory Module Credits Continuous Assignment
CURRICULUM AND SCHEME OF EVALUATION Compulsory Modules Evaluation (%) Code Compulsory Module Credits Continuous Assignment Final Exam MA 5210 Probability and Statistics 3 40±10 60 10 MA 5202 Statistical
More informationFinal Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011
Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring 2011 Your name: Final Examination Tuesday June 7, 2011 Instructions: Answer each question in the space
More informationAssignment 2 : Predicting Price Movement of the Futures Market
Assignment 2 : Predicting Price Movement of the Futures Market Yen-Chuan Liu yel1@ucsd.edu Jun 2, 215 1. Dataset Exploratory Analysis Data mining is a field that has just set out its foot and begun to
More informationUnlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management
Unlocking Unstructured Social Media Data in Marketing William Rand Assistant Professor of Bussiness Management In Collaboration with Kelly Hewett, Roland Rust, and Harald J. van Heerde Managers perspectives
More informationUsing Twitter to Predict Voting Behavior
Using Twitter to Predict Voting Behavior Mike Chrzanowski mc2711@stanford.edu Daniel Levick dlevick@stanford.edu December 14, 2012 Background: An increasing amount of research has emerged in the past few
More informationPredicting International Restaurant Success with Yelp
Predicting International Restaurant Success with Yelp Angela Kong 1, Vivian Nguyen 2, and Catherina Xu 3 Abstract In this project, we aim to identify the key features people in different countries look
More informationUsing decision tree classifier to predict income levels
MPRA Munich Personal RePEc Archive Using decision tree classifier to predict income levels Sisay Menji Bekena 30 July 2017 Online at https://mpra.ub.uni-muenchen.de/83406/ MPRA Paper No. 83406, posted
More informationData Mining Applications with R
Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG
More informationA Comparative Study of Filter-based Feature Ranking Techniques
Western Kentucky University From the SelectedWorks of Dr. Huanjing Wang August, 2010 A Comparative Study of Filter-based Feature Ranking Techniques Huanjing Wang, Western Kentucky University Taghi M. Khoshgoftaar,
More information