Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Similar documents
Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

PAST research has shown that real-time Twitter data can

Applications of Machine Learning to Predict Yelp Ratings

New restaurants fail at a surprisingly

Model Selection, Evaluation, Diagnosis

Predicting Corporate 8-K Content Using Machine Learning Techniques

How hot will it get? Modeling scientific discourse about literature

Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.

MACHINE LEARNING OPPORTUNITIES IN FREIGHT TRANSPORTATION OPERATIONS. NORTHWESTERN NUTC/CCIT October 26, Ted Gifford

Cryptocurrency Price Prediction Using News and Social Media Sentiment

Using AI to Make Predictions on Stock Market

When to Book: Predicting Flight Pricing

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Who Matters? Finding Network Ties Using Earthquakes

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Predictive Analytics Using Support Vector Machine

Automatic detection of plasmonic nanoparticles in tissue sections

Propensity of Contract Renewals

WE consider the general ranking problem, where a computer

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Predicting the Odds of Getting Retweeted

Predicting Corporate Influence Cascades In Health Care Communities

Predicting and Explaining Price-Spikes in Real-Time Electricity Markets

Intro Logistic Regression Gradient Descent + SGD

Predicting Airbnb Bookings by Country

E-Commerce Sales Prediction Using Listing Keywords

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Predicting Stock Prices through Textual Analysis of Web News

Drift versus Draft - Classifying the Dynamics of Neutral Evolution

Fraud Detection for MCC Manipulation

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE

Strength in numbers? Modelling the impact of businesses on each other

Predicting Yelp Ratings From Business and User Characteristics

Predicting Volatility in Equity Markets Using Macroeconomic News

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Unravelling Airbnb Predicting Price for New Listing

Data Science Training Course

Machine Learning Techniques For Particle Identification

Neural Networks and Applications in Bioinformatics

Predictive Modelling for Customer Targeting A Banking Example

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.

Restaurant Recommendation for Facebook Users

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

Churn Prediction with Support Vector Machine

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011

Profit Optimization ABSTRACT PROBLEM INTRODUCTION

Predicting International Restaurant Success with Yelp

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Prediction of Google Local Users Restaurant ratings

Real Estate Appraisal

Predicting ratings of peer-generated content with personalized metrics

Dallas J. Elgin, Ph.D. IMPAQ International Randi Walters, Ph.D. Casey Family Programs APPAM Fall Research Conference

Convex and Non-Convex Classification of S&P 500 Stocks

3 Ways to Improve Your Targeted Marketing with Analytics

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

"Capacity Prediction" instead of "Capacity Planning": How Uber uses machine learning to accurately forecast resource utilization

Predicting Yelp Business Ratings

Building the In-Demand Skills for Analytics and Data Science Course Outline

Combinational Collaborative Filtering: An Approach For Personalised, Contextually Relevant Product Recommendation Baskets

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

Price Prediction Evolution: from Economic Model to Machine Learning

Stock Price Prediction with Daily News

Tutorial Segmentation and Classification

Redesign of MCAS Tests Based on a Consideration of Information Functions 1,2. (Revised Version) Ronald K. Hambleton and Wendy Lam

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL

Prediction from Blog Data

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

AI is the New Electricity

Rank hotels on Expedia.com to maximize purchases

Data Mining Applications with R

Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, Kost Elisevich. Effect of Classifiers in Consensus Feature Ranking for Biomedical Datasets

Determining the veracity of rumours on Twitter

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

Using Twitter as a source of information for stock market prediction

Tracking #metoo on Twitter to Predict Engagement in the Movement

Machine learning-based approaches for BioCreative III tasks

Final Project - Social and Information Network Analysis

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

results. We hope this allows others to better interpret and replicate our findings.

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Predicting Customer Purchase to Improve Bank Marketing Effectiveness

Accurate Campaign Targeting Using Classification Algorithms

3DCNN for Lung Nodule Detection And False Positive Reduction

Using decision tree classifier to predict income levels

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Tutorial Segmentation and Classification

Exploring Google s Passive Data for Origin-Destination Demand Estimation

Selecting the best statistical distribution using multiple criteria

Worker Skill Estimation from Crowdsourced Mutual Assessments

Transcription:

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other users of the website can either upvote or downvote the submitted content. The number of downvotes subtracted from the number of downvotes determines the score of the post. Posts with higher scores are then more prominently featured on the website. One interesting note is that the popularity of the post is often times determined by the discussion that occurs in the comment section of the post. The goal of this project was to determine if there were any features in the first ten comments of a given post that would help us predict the future score of the post. We chose this task after observations from a previous 229 project Predicting Reddit Post Popularity. (1) 2. Data Set We collected a dataset of roughly 2000 training examples and 220 testing examples by scraping Reddit's website using their API. For each example (Reddit post) we preserved the post's score and text bodies of the earliest ten comments. Reddit fuzzes some of its data, so the oldest ten comments were not necessarily in order, nor were other metrics about post popularity necessarily accurate. Score was guaranteed to be accurate, so we chose to use this as our label value instead of something like vote counts or upvote ratios (both of which are fuzzed). 3. Task Definition Initially we sought to model the problem as a regression task, where given our input of features we would attempt to predict a value for the future score. We attempted some linear regression models, but felt that this task would prove to be infeasible because of the strange distribution of the scores. 54 percent of our data never achieved greater than a score of ten. Additionally there were very few posts with large scores that skewed the regression heavily. So instead we decided to model the problem as a binary classification task. Where a post was given a label of 0 if it did not surpass a certain threshold, where the threshold is just some arbitrary score. 4. Features The input features are solely metrics on post comments. Since not all posts had at least 10 comments, our first feature was the fraction of first ten comments that existed. Additionally we considered sentiment analysis on the comments as well as comment length. Since the comment ordering was fuzzed, we could not use metrics on individual comments as features. Instead, we used metrics on the ten comments as a whole. Thus, our features were maximum, minimum, lower-half average, upper-half average, and average of the comment sentiment analysis scores and the comment lengths. Additionally, we had another feature, polarity, which was the maximum sentiment analysis score minus the minimum sentiment

analysis score. Sentiment analysis was done with the Stanford NLTK, which for each sentence in a comment body returns a rating of Very,, Neutral, Positive, or Very Positive. We assigned numerical scores -2, -1, 0, 1, 2, respectively to sentences based on those ratings. Each comment received a sentiment score that was the average of the sentiment scores of each of its sentences. 5. Evaluation Metric The metric that we used to evaluate our data was classification error. For a post where As described earlier the label of 1 is given if the post surpassed the threshold. The Classification error was All of our algorithms were compared against a baseline algorithm which simply predicted zero for each post. As we mentioned earlier the majority of posts never even surpassed our lowest threshold, a score of ten. For the rest of the post we will also use the term accuracy which is simply 1-classification error. In the results section we also describe how we compare algorithms across all thresholds. 6. Results To the right we can see how the testing error changes for various algorithms against the baseline as we increase the value of the threshold. We will describe each algorithm in more detail below. However we can see that there is a trend of increasing test accuracy as we increase the thresholds. More sophisticated models like Support Vector Machine Learning and Logistic Regression are more stable with the changes in the thresholds whereas simpler algorithms like Perceptron and K- neighboring algorithms, despite having some high peaks behave more erratically. In the analyses of the separate algorithms we chose to look at the confusion matrices across all thresholds and then normalize. This will give us an average which we will use as our overall performance grade. Obviously something like variance is not taken into account, but we believe that this aggregate accuracy across all thresholds holds the most insight into which algorithms were most successful.

36.95 21.79 14.04 149.21 K-Neighbors 6a. K-Neighbors Where Yn is one of the k closest neighbors in the training set. We chose five as our k-value which is the default in the scitkit library. K- neighbors is a very simple algorithm, but is also computationally expensive. However, we are able to obtain very high accuracy at low thresholds. The success of this algorithm faded when we used higher thresholds. This would suggest that there are many moderately successful posts which are clustered together feature wise, but more variation in the features of highly successful posts. 6b. Perceptron Perceptron and linear classifiers in general are great if the data is linear separable. It was immediately clear that our data did not fit into this mold. Trying to create a line between this data is reflected in the inconsistency of the accuracy of the algorithm, even dipping below the baseline at certain points. Using algorithm with the given data set, was infeasible due to these circumstances. Our update rule was the standard Perceptron update below. Where are hypothesis was the dot product of theta and x. We used the default alpha for the scikit library. 30.5 29.38 28.25 133.88 Perceptron 42.13 15.58 17.38 143.93 SVM to this model, where C is a chosen constant used to influence the weights of the error terms (inside the summation). w is the weights vector. 6c. SVM We ran an SVM with a linear kernel and L1 regularization. When an SVM with a Gaussian kernel had extremely low test set error and higher training set error. We switched to a linear kernel with L1 regularization in order to be less sensitive to the suspected outliers. The resulting increase in test set error and decrease in training set error confirmed that the data has many outliers. Below is the optimization problem corresponding

6d. Logistic Regression Logistic regression is a discriminative learning algorithm. It fits the sigmoid function (below) to the data. The implementation of logistic regression relied on the LIBLINEAR library (2). The performance of this algorithm was well above baseline for all thresholds we tested. Logistic regression uses maximum likelihood estimates to find the parameter vector theta that best fits the training data to the hypothesis below. This is also known as the sigmoid function. 41.63 17.45 17.875145.04 Logistic Regression Stochastic gradient descent is used to find the maximize the likelihood, with update rule: 6e. Aggregate Training and Test Accuracy Algorithm Training Set Accuracy Testing Set Accuracy Baseline (Guess all 0 s, where 0 means below threshold ) 0.731981981982 0.731981981982 Logistic Regression 0.886261261261 0.840840840841 Support Vector Machine 0.993806306306 0.822072072072 K-Neighbors Clustering (k=5) 0.892713365539 0.838588588589 Perceptron 0.767400250492 0.740427927928 6. Discussion Based on our results, we can conclude that there is some amount of correlation between the initial commentary of the post and the score of the post itself. Though we are able to achieve better-than-random accuracy with our baseline (simply predicting below threshold for every post), for most thresholds, all algorithms achieve even better accuracy. It is interesting to see both logistic regression and SVM converge as we increase our threshold amounts while our other algorithms seem to bounce up and down and have more variability. The results are satisfactory, but if we could figure out a way to have high success rates at both low and high thresholds that would be much better, but are currently unsure of how to solve this problem. As expected the commentary did have significant impact on the success of the post, however

there remains significant error, since the success of the post is also dependent on features of the post itself and some amount of randomness. Conclusion Reddit post popularity definitely has an element of randomness but using characteristics of its early comments, a post's popularity with respect to a fixed threshold can be predicted with up to 89% accuracy. Of course popularity of a post is dependent also on the content of the post, which we ignored here; perhaps metrics on post characteristics could improve our predictions. Additionally, more data and more time to fine-tune our algorithms would likely have helped smooth out the error analysis curves and so inspire more confidence in our results. Nevertheless, it is remarkable to have such strong success in predicting post popularity simply from characteristics of Reddit posts' earliest comments. Future Work In the future we would like to perform a more extensive analysis on our features. It would be a good idea to perform an ablative analysis in order to determine which features are most useful in our prediction task. Additionally it may be a good idea to see if we can think of more features to use in our analysis. For example exploring the presence of key words, could help us use something like a Naïve Bayes classifier. References 1. http://cs229/proj2012/zamoshchinsegall-predictingredditpostpopularity.pdf 2. http://www.csie.ntu.edu.tw/~cjlin/liblinear/