DATA SETS. What were the data collection process. When, where, and how? PRE-PROCESSING. Natural language data pre-processing steps

Similar documents
Predicting the Odds of Getting Retweeted

New restaurants fail at a surprisingly

Context-Sensitive Classification of Short Colloquial Text

OPEN ENROLLMENT SOCIAL MEDIA TOOLKIT

COMPLEX MEDIA: BRINGING SOCIAL STATS TO EVERY DEPARTMENT

How to Set-Up a Basic Twitter Page

Data Mining in Social Network. Presenter: Keren Ye

Predicting Airbnb Bookings by Country

LOAN OFFICER GUIDE TO MARKETING : LEAD NURTURING REALTOR GUIDE TO MARKETING TOTALMORTGAGE.COM PART 1

Measure Social Media Like a Pro: Social Media Analytics Uncovered

Twitter page management for beginners

NLP WHAT S HAPPENING TO NLP?

Brian Macdonald Big Data & Analytics Specialist - Oracle

Experiences in the Use of Big Data for Official Statistics


How To Attract Crowds To Your Webinar


Data Preprocessing, Sentiment Analysis & NER On Twitter Data.

Social Media and Self-Advocacy KIT MEAD AUTISTIC SELF ADVOCACY NETWORK (ASAN)

Predicting Stock Prices through Textual Analysis of Web News

CONSUMER SENTIMENT ANALYSIS USING TWITTER

TwitterRank: Finding Topicsensitive Influential Twitterers

A Guide to Social Media Team

Social Media Savvy: Facebook and Twitter. Jennifer Li AFSCME Digital Communications

Employer Branding Essentials. 4 Tips Inspired by LinkedIn s Top Attractors Ranking

FACEBOOK GUIDE HOW TO USE FACEBOOK FOR RECRUITMENT MARKETING

AN INTRODUCTION TO FACEBOOK FOR BUSINESS.

THE COVER LETTER CHECKLIST: 9 INGREDIENTS YOU NEED FOR A STAND-OUT COVER LETTER.

Social Media. An introductory guide.

When Social Media Meets Employer Branding: Your Guide to Doing It Right

Application of Machine Learning to Financial Trading

A Framework for Analyzing Twitter to Detect Community Crime Activity

Who s Here Today? B2B Social Media: Why?

Get started and get better Erwin Taets VIVES Leisure Management Research & Expertise Center

Twitter Set-up Guide. How to Optimize Your Profile

How to Make Your Content Go VIRAL!

Welcome to. The Social Media System Twitter Success System

E-Commerce Sales Prediction Using Listing Keywords

BIZZLOC! - Locate Your Business Group Web Mining Project

PAST research has shown that real-time Twitter data can

HOW TO SETUP YOUR PROFILE

Three Ways to Better Connect Your Workforce with Enterprise Social

Proactive Listening: Taking Action to Meet Customer Expectations

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

The Importance of Supplementing NPS Scores with Insights Drawn from Real Comments and Reviews. Whitepaper

Inferring Nationalities of Twitter Users and Studying Inter-National Linking

How Beacons Can Reshape Retail Marketing

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Fill in the worksheet with your responses (per the instructions in the write up for the deliverable).

Social Media. BootCamp

TWITTER GUIDE TABLE OF CONTENTS

Prediction of Google Local Users Restaurant ratings

Predictive Modelling for Customer Targeting A Banking Example

Little Caesars Pizza

Signing up, managing your team and monitoring your team s progress is easy to do online.

Media Kit 2015 BRINGS 'BUMPER OPPORTUNITIES! Australia s No.1 magazine FOR YOUNG HORSE LOVERS!

Alek Irvin

2017 SOCIAL MEDIA BENCHMARK REPORT. Industry benchmarks across the most important social media metrics

Data Mining Applications with R

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Predicting International Restaurant Success with Yelp

Best Practices. for Social Media Marketing Success

SOCIAL MEDIA. Facebook. Instagram. Twitter. Hashtags # Using Facebook, Twitter and Instagram to build your team!

Social Media Training

SEO & CONTENT MARKETING: A SPEED DATE

Predicting Corporate Influence Cascades In Health Care Communities

Predicting the Future With Social Media

NEWS RELEASES AND IMPROVE ROI REVENUE-GENERATING HOW TO OPTIMIZE STRATEGY. CHAPTER 1 Achieve Better Results with Targeted News Release Distribution

Unveiling the Patterns of Video Tweeting: A Sina Weibo-based Measurement Study

Rank hotels on Expedia.com to maximize purchases

Twitter. Runa Sarkar Indian Institute of Management Calcutta

I WANT TO START MY OWN USE-IT!

Radian6 Overview What is Radian6?... 1 How Can You Use Radian6? Next Steps... 9

Social MEDIA in the hospitality

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Stream-based active learning for sentiment analysis in the financial domain

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS

Lumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang

Dean College Social Media Handbook

Text Mining. Theory and Applications Anurag Nagar

SECRETS TO BOOST YOUR

There are many ways to use Twitter, but you want to use it in the most

SOCIAL MEDIA TOOLKIT FOR NONPROFITS

SOCIAL MEDIA GLOSSARY General terms

Insights Partners Overview

HOST REWARDS. Step 1 - Create your Presentation Step 2 - Invite Guests Step 3 - Enter Orders Step 4 - Redeem Rewards. Virtual Guests.

Video Traffic Classification

Online Tools. For Intelligence Work. 29 tried and tested ways to explore digital open sources. Updated December 2016.

Predictive Analytics Using Support Vector Machine

8 Ways To Build Your Brand Using Social Media

Page 1 of 29

How to Increase Clicks Using Social Media Advertising. Ashley Ward CEO & Social Media Director Madhouse Marketing

PERSONAL BRANDING SOCIAL SELLING NOT FOR PUBLICATION - PROPERTY OF GAIL MERCER-MACKAY

Sunnie Chung. Cleveland State University

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

Five trends in Capacity Management

Big Data and Automotive IT. Michael Cafarella University of Michigan September 11, 2013

5 Great Reasons to Start Using Sendible

How To Get Leads. The Complete Guide to a Fully Integrated B2C Social Media Program. Brought to you by

Transcription:

01 DATA SETS What were the data collection process. When, where, and how? 02 PRE-PROCESSING Natural language data pre-processing steps

03 RULE-BASED CLASSIFICATIONS Without using machine learning, can we use some basic rules to classify sentiments? 04 MACHINE LEARNING BINARY CLASSIFICATIONS Using machine learning algorithms, we attempt to classify positive and negative sentiments 05 MACHINE LEARNING MULTI-CLASS CLASSIFICATIONS Using machine learning to predict positive, neutral and negative sentiments and determine the best model for data with unknown sentiments

06 REASULTS The summary and statistics of the collected airline tweets 07 CLOSING THOUGHTS FEEDBACK Q&A

Where, when and how?

@AmericanAir @delta @southwestair @united AA DELTA SOUTHWEST UNITED #americanairlines #deltaairlines #southwestairlines #unitedairlines #americanair #deltaair #southwestair #unitedair

k 80,121 Tweets TWITTER API Using Python and twython to retrieve tweets through Twitter s API during 7 days period. 14,640 Tweets KAGGLE Reformatted/cleaned tweets with graded sentiment of Major Airlines from Feb 2015 4,000 Tweets Newsroom Commercial datasets provided by Newsroom with machine graded tweets

Natural language processes to prepare and transform the tweet content for various classification models and sentiment analysis

Major processing steps Tokenize all tweet Perform regular Remove all the English For machine learning contents expression stop words Vector Matrix get it?

tweet.lower() re.sub('((www\.[^\s]+) (https?://[^\s]+))','url',tweet) re.sub(r'@([^\s]+)', r'\1', tweet) Lower Cases Eliminate URLs Replace @username Turn all letters to lower cases. One Turn all https://... and www format to Convert all @username to display just option is to retain original upper cases display just the word URL username without the @ sign re.sub(r'#([^\s]+)', r'\1', tweet) Replace #hastags & re.sub(r'&', r' and', tweet) Replace &amp Signs re.sub('[\s]+', ' ', tweet) Remove Additional Spaces Convert all #hashtagwords to just Replacing all &amp symbols with the Remove all annoying white spaces hashtagwords actual words and with a space

RULE BASED First we will use two simple rule-based techniques to conduct our sentiment analysis to see what happens

(pos) x (pos) = (pos) This movie is very good (pos) x (neg) = (neg) This movie is not good at all (neg) x (neg) = (pos) This movie was not bad

VADER is a lexicon and rulebased sentiment analysis tool that is especially attuned to sentiments expressed in social media.

Examples VADER is smart, handsome and funny. { neg : 0.0, neu : 0.254, pos : 0.746, compound : 0.8316} Today SUX! { neg : 0.779, neu : 0.221, pos : 0.0, compound : -0.5461} VADER is very smart, handsome, and funny! { neg : 0.0, neu : 0.293, pos : 0.707, compound : 0.8646} Today kinda sux! But I ll get by, lol :) { neg : 0.154, neu : 0.42, pos : 0.426, compound : 0.5974} VADER is VERY SMART, handsome, and FUNNY!!! { neg : 0.0, neu : 0.233, pos : 0.767, compound : 0.9342} At least it isn t a horrible book. { neg : 0.0, neu : 0.637, pos : 0.363, compound : 0.431}

Part I Binary Classifications

Data Prep Remove All Neutral Sentiment Random Forest Perform Random Forest Model Linear SVM Perform Support Vector Classifier Hyperplane

k k TRAINING: 9,266 TESTING: 2,275 Training Accuracy: 99.87% Testing Accuracy: 89.58% k TRAINING: 1,124 TESTING: 290 Training Accuracy: 99.91% Testing Accuracy: 84.48% TRAINING: 11,541 TESTING: 1,843 Training Accuracy: 99.88% Testing Accuracy: 86.49%

k TRAINING: 1,124 TESTING: 290 Training Accuracy: 99.56% Testing Accuracy: 84.14% TRAINING: 11,541 TESTING: 1,843 Training Accuracy: 99.12% Testing Accuracy: 86.38% k k TRAINING: 9,266 TESTING: 2,275 Training Accuracy: 99.18% Testing Accuracy: 90.95%

MACHINE LEARING Part II Multi-Class Classifications

k TRAINING: 2,040 TESTING: 360 TRAINING: 14,640 TESTING: 2,400

CV Accuracy 67.80% 64.86% Test Accuracy 60.88% 66.94% [+1] Precision 33.85% 60.34% [ne] Precision 1.43% 50.00% [ -1] Precision 65.25% 96.89% 0% 20% 40% 60% 80% 100% 120% Kaggle->Tweet Tweet->Tweet While the test accuracy is slightly lower, using Kaggle data as training set proved to be a stronger predictor based on the crossvalidation accuracy results. In addition, it is superior in positive and neutral precision, not to mention much higher training volume.

CV Accuracy 71.91% 66.39% Test Accuracy 69.08% 69.72% [+1] Precision 47.69% 60.34% [ne] Precision 14.29% 49.10% [ -1] Precision 79.34% 93.33% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Kaggle->Tweet Tweet->Tweet Again, while we see the precision of the twitter data is superior, the Kaggle data yields a more balanced results for all three sentiments. Because the overall test accuracy is about the same, we chose to use Kaggle data as training due to its superior CV accuracy as well as its large sum of data points (~8X).

Support Vector Classifier Model Parameters: decision_fun= ovo, kernel= linear, cost=0.4, gamma=0 Use Entire Kaggle Dataset as Training Set (14,640 data points) Apply Model to Tweets of Unknown Sentiment

Now we have the model, let s take a look at the results of collected tweets!

American 10,516 18,762 Delta 5,728 11,718 Southwest 5,478 9,647 United 5,705 12,567 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000 Original Tweets Re-Tweets

Negative 52,694 Original Tweets Neutral Positive

20,000 100% 18,000 90% 16,000 80% 14,000 70% 12,000 60% 10,000 50% 8,000 40% 6,000 30% 4,000 20% 2,000 10% 0 American Delta Southwest United 0% American Delta Southwest United Negative Netual Positive Negative Netual Positive

Negative Sentiment Cenk Uygur Positive Sentiment -

Negative Sentiment Delta Assist Positive Sentiment -

Negative Sentiment Dalton Rapattoni Positive Sentiment Dalton Rapattoni

Negative Sentiment Muslim, Shame Positive Sentiment Happy Birthday, 90th"

59.5% 49.2% If the sentiment is classify accurately, people are more likely to retweet a neutral or positive tweets than a negative tweet. 26.0% 31.7% 14.5% 19.1% NEGATIVE NEUTRAL POSITIVE Original Tweets Retweets

10,260 Retweets RT @JensenAckles: @AmericanAir I know it's crazy #DFWsnow I'm just excited 2see my baby girl! Man freaking out next 2 me isn't helping 3,795 Retweets RT @UMG: The perfect trio @theweeknd @ddlovato and Lucian Grainge #MusicIsUniversal #UMGShowcase w/ @AmericanAir & @Citi https://t.co/j 17,093 Retweets RT @Mets: We re partnering with @Delta to send a lucky fan to games 1 & 2! RT for your chance to win #DeltaMetsSweeps https://t.co/cdelse84 3,535 Retweets RT @JackAndJackReal: Thanks @Delta for adding our movie to all your flights! If any of you are traveling and bored make sure u peep it 3,362 Retweets RT @SouthwestAir: Yo #Kimye, we're really happy for you, we'll let you finish, but South is one of the best names of all time. 1,952 Retweets RT @DaltonRapattoni: Been home in Dallas for hours now but still in the airport. @SouthwestAir PLEASE find my bag. Please RT. #SWAFindDalto 3,329 Retweets RT @antoniodelotero: this skank bitch was SO RUDE to me on my flight!!!! @united I asked for her name and she goes, "it's Britney bitch. 352 Retweets RT @ggreenwald: Arab-American family of 5 going on vacation - 3 young kids in tow - removed from @United plane for "safety reasons" https:/

iphone 60% 50% 40% 30% 20% Others 10% 0% Android The majority of the tweets are from iphone. There is also almost 20% of tweets from Android devices. The rest from the Web client and other applications. There is little difference between the source of negative and positive tweets; but there seems to be more alternative sources for the neutral sentiment. Web Negative Neutral Positive

Photo Photo Link Video Link Negative Neutral Positive

ENDING THOUGHTS Discussing final thoughts and some future steps

COORDINATES MULTIPLE VALIDATION OTHER MODELS CLUSTERING Current coordinates data Each data point being Naïve Bayes Unsupervised model to isn t great for Geo-location validated by multiple people Lemmatization discover words association analysis https://github.com/michaelucsb/python_twitter-sentiment_ga

Project Description Prior to machine learning, the project conducted two rule-based sentiment classifications. One of which simply count the number of positive and negative words, and the other utilizes VADER lexicon, a pre-built sentiment analysis tool for Python. Both rule-based sentiment classifications achieved an accuracy of 47% to 55%. The project aims to use rule-based classification and machine learning algorithms to predict content sentiments of Twitter feeds that mentioned four major airlines: American Airline, United, Delta, and Southwest. Next, the project used machine learning algorithms Random Forest, and Support Vector Classifier to conduct binary classification on positive and negative sentiments by removing all known neutral tweets. 85% to 90% accuracy were achieved on the test dataset. The project collected 80,121 tweets during a 7-day period via Twitter s API using Twython, a Python package; among these tweets, 2,400 randomly selected tweets were given a sentiment rating manually (test). Through Kaggle, the project also obtained a cleaned dataset contains 14,640 airline tweets with already rated sentiments (training). Finally, the project used Ada Boost and Support Vector Classifier to conduct multi-class classifications by including back all neutral tweets. Through grid-search and cross-validation, the project achieved a 70% overall accuracy on prediction and an 80% precision on negative sentiment prediction on the test dataset. To start off the project, the tweets were processed using tokenization, regular expression, stop words removals, and vectorization. In regular expression, the project eliminated all URLs, @ sign, # sign, and replaced & sign with the word and. Using the best model and parameters found, the model is applied to tweets with unknown sentiments. The statistics summary were included in the later half of the presentation. The presented word clouds also captured interesting stories during the time the tweets were collected.