KNOWLEDGE DISCOVERY AND TWITTER SENTIMENT ANALYSIS: MINING PUBLIC OPINION AND STUDYING ITS CORRELATION WITH POPULARITY OF INDIAN MOVIES

Size: px
Start display at page:

Download "KNOWLEDGE DISCOVERY AND TWITTER SENTIMENT ANALYSIS: MINING PUBLIC OPINION AND STUDYING ITS CORRELATION WITH POPULARITY OF INDIAN MOVIES"

Transcription

1 INTERNATIONAL JOURNAL OF MANAGEMENT (IJM) Volume 6, Issue 1, January (2015), pp IAEME ISSN (Print) ISSN (Online) Volume 6, Issue 1, January (2015), pp IAEME: Journal Impact Factor (2014): (Calculated by GISI) IJM I A E M E KNOWLEDGE DISCOVERY AND TWITTER SENTIMENT ANALYSIS: MINING PUBLIC OPINION AND STUDYING ITS CORRELATION WITH POPULARITY OF INDIAN MOVIES Anuj Verma 1, Kunwar Abhay Pratap Singh 2, Kakali Kanjilal 3 1, 2 International Management Institute, New Delhi , India, 3 Associate Professor, International Management Institute, New Delhi , India, ABSTRACT Twitter is the world s most popular platform for self-expression. With more than 18 million registered users, India has the third highest number of accounts on Twitter. This provides a tremendous opportunity to marketers to gauge public opinion and make precise and targeted campaigns. Movies in India are increasingly taking to digital platforms to create buzz before the release of a movie. Sentiment Analysis, or Opinion Mining, can be used to gauge the reaction of people for numerous purposes by quantifying the contextual polarity of tweets in real time. While the technique is a crucial element of the Natural Language Processing domain, its use to understand the mind of consumers in India, has been limited. The study is an attempt to measure the impact of mood of Twitter users, over the period leading up to the release of a movie, on subsequent performance of the movie on the opening weekend, in India. Sentiment Analysis is performed on tweets fetched in real time to comprehend whether word of mouth disseminated in digital form, affects the popularity and hence, the performance of movies. The endeavor is to ultimately predict, with use of relevant metrics, whether a movie can succeed on the box office or not. Measuring and analyzing the mood of people can help companies gain invaluable insights about the dynamics of consumers preferencesinstantaneously. Keywords: Digital Marketing, Opinion Mining, Sentiment Analysis, Text Analytics, Twitter 1. INTRODUCTION In India, a movie s fate is predominantly sealed on the opening weekend itself. As movies continue to break records and filmmakers realize the gargantuan potential of social media to create pre prelease hype and digital buzz, the digital marketing budgets of movies are ballooning. Twitter being 697

2 simplistic and popular, can be used for observing the dispersion of opinions across the user base with explanatory power that is absent in other media. Twitter has ensured that consumer is no more a passive component of the marketing exchange process. Twitter has transformed the way online word of mouth is disseminated, and it potentially has the power to decide fate of offerings like Movies [1]. Text Mining uses techniques to gather consumer insights from unstructured texts, like Twitter data. The popularity of movies can be traced by applying Sentiment Analysis and Opinion mining techniques on Twitter data, with promising results [2]. The main approaches towards sentiment analysis are: Lexicon based and Text classification. The Lexicon based approach derives orientation of the text from semantic orientation of words and phrases [3]. The classification approach builds classifiers to form labeled instances of tags and is essentially a machine-learning task [4]. An analytical model is designed using advanced sentiment analysis, which uses both opinion mining techniques and agglomerative techniques to train itself. The model becomes more efficient with every iteration. The system uses data, which varies, both in size and nature of content so that it can adapt itself to subtle variations that can appear for a diverse subject. Movies are powerful agents to divide the opinion of the masses. Using them as a base for opinion mining provides two advantages- the sheer volume of tweets that they potentially generate and high variability in public opinion [5]. The purpose of the current undertaking is to understand the relationship between Twitter sentiment, related digital marketing campaigns and popularity of movies in India, and ultimately, performance on the Box Office. The relative absence of literature on the matter is an extremely significant oversight. The study aims at studying the effect of tweets, both in valence and volume on the popularity of movies in India. For the purpose of research, analysis was done for five Indian movies, released over a period of two months, October to December 2014, with the movies being- Happy New Year, Kill Dil, Ungli, Action Jackson and PK. 2. THE SENTIMENT ANALYSIS FRAMEWORK Tweets are collected in real time from twitter, stored and analyzed, and classified into subentities. The critical period, i.e., the period around release is used to contemplate the online hype generated. Sentiment Analysis along with Agglomerative Hierarchical Clustering technique is used for the process. The sentiment analysis framework is shown in Fig. 1. Fig 1: Sentiment Analysis Framework 698

3 The following movies have been used for the purpose of research: Happy New Year (Release Date: October 24, 2014) Kill Dil (Release Date: November 14, 2014) Ungli (Release Date: November 28, 2014) Action Jackson (Release Date: December 04, 2014) PK (Release Date: December 19, 2014) 2.1 DATA COLLECTION The system will fetch real time tweets for an extended time period for a set of movies, with the observation period preceding therelease of the movie. The following information is captured for a particular tweet: Tweet ID; Tweet Text; Screen Name; Timestamp; Geotag; Retweet Counter. 2.2 CRITERIA FOR ANALYSIS Sentiment Analysis is used in conjunction with other parameters, which are mentioned below. It canpredict, and validate any correlation with the popularity or performance of Indian movies. Relevant parameters that have been identified are as follows- Hashtag analysis to quantify pre-release hype; Pre-release Positive to Negative Tweet Ratio; Budget for the movie; Sentiment cluster. To validate predictions generated by the system, box office collection of the particular movie is recorded. 2.3 PROGRAMMING ENVIRONMENT The analysis requires construction of an online system that uses- the Twitter Streaming API and the Twitter Search API, to perform Lexicon based filtering and Trend Analysis using Agglomerative Hierarchical Clustering. The online system is constructed using R, a functional environment for statistical analysis of complex datasets. The data is stored in MongoDB database, which is a cross platform document database. 2.4 ANALYTICAL TECHNIQUES The Twitter Streaming API can be used for low-level latent access to the tweet database. Predefined dictionaries can be used to perform lexicon based filtering and classify tweets as positive, negative or neutral. The sentiment analysis, thus, is fundamental to calculation of the ratio of positive tweets to negative tweets, informally called the Golden Ratio for Twitter. It is an effective way to measure public sentiment or mood, for an appropriate sample. Agglomerative Hierarchical Clustering begins with a set of binary mergers, at the start for individual sets and the process iterates for previously formed clusters. The distance between the clusters is measured using squared Euclidean distance. This technique helps to derive what people are talking about, not just frequency wise, but the clusters help to identify which words occur together. The knowledge can help to gauge popular opinion and trends, and the task is basically accomplished using the Twitter Search API. Text is obtained from a pre-set corpus of movie tweets and processed by implementing various dictionaries, lexicons and other lingual sources like parts-of-speech to tag and divide the sets into nouns, verb, adjectives and phrases. A set can be studied by the method of comparison of the tagged set to the defined lexicons. The processed text can be further included to existing lexicons for future references, enabling the system to learn as it grows and processes newer samples. In the last stage, sentiment scores are calculated and applied at document/ subject/ entity levels [6]. 699

4 3. EXPERIMENTAL DESIGN 3.1 LEXICON SET The efficacy of Sentiment Analysis algorithm is dependent on the quality of pre-defined lexicon set and the subsequent Training Set used for training the opinion mining system. Large Movie Review Dataset v1.0 [7] contains 25,000 movie reviews as Training Set and 25,000 movie reviews as Testing Set. The dataset is used for training the proposed system, and Tweets are collected for Indian movies during the pre-release period and prediction is generated. The dataset is available for free on the Stanford Artificial Intelligence Laboratory website. Raw text and preprocessed word bags are also available for analysis. The details of data collected are as follows: Training Data: 25,000 Movie Reviews (Stanford Artificial Intelligence Lab) Testing Data: Tweets collected during respective pre-release period- Movie Release Date Table 1: Tweet Collection for Indian Movies Observation Period Tweets Collected Sample Tweets Training: Testing Ratio Happy New Year 24/10/14 15/10-23/ :10 Kill Dil 14/11/14 06/11-13/ :10 Ungli 28/11/14 19/11-27/ :10 Action Jackson 04/12/14 26/11-03/ :10 PK 19/12/14 10/12-18/ : TEXT SENTIMENT Positive Tweets are those that have a positive contextual polarity. Negative Tweets are those that have a negative contextual polarity. Neutral Tweets are those: which either have no contextual polarity; which have a mixture of positive and negative contextual polarity; which according to the system are neither positive nor negative. 3.3 PT/NT RATIO The PT/NT ratio has been defined as the ratio of the number of positive tweets and the number of negative tweets [5]. A PT/NT ratio of more than five implies that a movie is expected to do well on the opening weekend. A PT/NT ratio in the range two-to-five implies an average box office opening. While a PT/NT ratio of less than two implies that the movie is not expected to do well. In India, the major share of business for a movie is done on the opening weekend itself and hence pre-release hype and the corresponding PT/NT ratio are indicators of the performance during the same period. 3.4 PRE RELEASE HYPE The trending of the movie hashtag is also an indicator of a movie s popularity and the prerelease hype it generates on Twitter. The popularity of the movie hashtag can be expressed on a scale of 100 with respect to all hashtags trending on Twitter for a specific period. 700

5 A score of more than 45 is constituted as high amount of hype. A score of implies an average hype, while a score of less than 40 implies low hype. Hashtagify is a free tool that enables an intuitive and visual way to check the trends for a particular hashtag for a period of eight weeks. 3.5 PREDICTION For a movie to be successful on the opening weekend it should be popular in terms of hype, accompanied by a strong PT/NT ratio. Various factors affect the pre-release popularity of the movie in terms of expectation of the audience, for instance, response to the trailer of the movie, music of the movie, fan following/past record of director and principal actors, number of releases during the week and timing of the release. These can in turn affect the performance on the opening weekend. This has been summarizedin Table 2. Table 2: Categorizing performance of a movie on the opening weekend Budget of the Movie Revenue of the Movie on Opening Weekend Predicted Performance on Opening Weekend > 0.80 * Super Hit 0.80 * to 0.65 * Hit 0.65 * to 0.50 * Average 0.30 * to 0.50 * Below Average < 0.30 * Flop It must be noted that the ultimate performance of the movie can never be judged with 100 per cent accuracy. Outliers to prediction are bound to be present. The defined metrics are expressed to describe the usual trends. Combinations of various metrics defined above have been expressed in Table 3. PT/NT Ratio Pre- Release Table 3: Prediction using defined metric Predicted Performance on Opening Weekend Hype Pre-Release >5 >45 Super Hit > Hit-Super Hit >5 <40 Average 2-5 >45 Hit Average 2-5 <40 Flop-Below Average <2 >45 Below Average < Flop-Below Average <2 <40 Flop 4. EXPERIMENTAL RESULTS The tweets collected for the selected movies are fed into the Sentiment Analysis System. Contextual polarity is associated to each tweet, which contains magnitude as well as sign representative of the opinion. The results are summarized in Tables

6 Table 4: Data Analysis for Happy New Year (Pre Release Period) Table 5: Data Analysis for Kill Dil (Pre Release Period) Table 6: Data Analysis for Ungli (Pre Release Period) 702

7 Table 7: Data Analysis for Action Jackson (Pre Release Period) Table 8: Data Analysis for PK (Pre Release Period) The popularity of the hashtag is generated using the Hashtagify tool, for a period of eight weeks preceding the release of the movie. The time periods corresponding to the release of different movies are then made to coincide and the pre release hype is plotted, as shown in Fig 2. Popularity (Out of 100) Hashtag Popularity ('Scaled' Pre-Release Period) Week(s) Before Release (Time Period made to Overlap) Happy New Year Kill Dil Ungli Action Jackson PK Fig 2: Hashtag Popularity across a period of 8 weeks before release 703

8 Table 9 contains the prediction (performance on opening weekend) generated by matching the results generated to previously defined metrics. 5. RESULTS Table 9: Prediction Generation Movie Hype (Week 0) PT/NT ratio Prediction Happy New Year Super hit Kill Dil Average Ungli Average Action Jackson Average PK Super hit When the popularity of the movie hashtag, which is representative of the online buzz and trends is observed over a period of time (Fig. 2), it is observed that the pre-release hype is maximum just before the release of the movie. To investigate the efficacy of the results of the Sentiment Analysis System, the box office performance of the movies on the opening weekend is recorded. Table 10: Financial details (All Data from Box Office India ) Movie Budget Opening Weekend Revenue as % of Collection Budget Happy New Year INR 125 crores INR 109 crores 87.20% Kill Dil INR 35 crores INR 20 crores 57.14% Ungli INR crores INR 12.5 crores % Action Jackson INR 60 crores INR 30.5 crores 50.83% PK INR 85 crores INT 93.3 crores % The data from Table 9 and Table 10 is combined to validate the generated prediction. As can be seen in Table 11, all the generated predictions predictions are correct. Table 11: Validating Generated Prediction Movie Hype P/N Prediction Revenue as % Accuracy of of Budget Prediction Happy New Year Super hit 87.20% Correct Kill Dil Average 57.14% Correct Ungli Average % Correct Action Jackson Average 50.83% Correct PK Super hit % Correct 6. LIMITATIONS Limitations of Twitter API- Twitter API allows a maximum of 15 API requests per rate limit window, search is limited to 180 queries per window of 15 minutes length. This reduces the operating speed of the algorithm exponentially. 704

9 Sampling Bias- Sample of tweets collected from Twitter could suffer from a sampling bias. Twitter allows access to only a fraction of tweets stored in its database. It is probable that some samples are better representative of the population than others. Noise, Promotion & Spam- Tweets, especially reaching up to the pre-release period, can be repetitive and contain irrelevant communication. Infringement of Privacy- Twitter discloses User ID, Followers, Retweets and Geographical coordinates for a particular tweet. Disclosing sensitive information about a particular user could lead to infringement of privacy. 7. CONCLUSION Twitter is the most widely used micro-blogging platform. As demonstrated by the study, it is of great utility for harnessing the gigantic pool of data and gaining insight into the mind of the consumers. Sentiment Analysis is a powerful technique to achieve the same. The technique is used to predict the box office performance for movies in India on the opening weekend, and it is observed that it can be successfully implemented to correlate the opinion of the public and ultimate performance on the box office. The study canbe extended to investigate the application of the sentiment analysis framework to predict elections and financial markets. REFERENCES 1. Rui, H., Liu, Y., & Whinston, A. (2013). Whose and what chatter matters? The effect of tweets on movie sales. Decision Support Systems (Elsevier), Jain, V. (2013). Prediction of Movie Success using Sentiment Analysis of Tweets.The International Journal of Soft Computing and Software Engineering, Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of 40th Meeting of the Association for Computational Linguistics (pp ). Philadelphia: ACL. 4. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the Conference on Empirical Methods in NLP (pp ). Philadelphia: ACL. 5. Asur, S., & Huberman, B. (2010). Predicting the Future With Social Media. California: HP Labs. 6. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics, Mass, A. (2011). Large Movie Review Dataset v1.0. Large Movie Review Dataset v1.0. Stanford University, Stanford, California. 8. Dr.Gagandeep K Nagra and Dr.R.Gopal, The Effect of Digital Marketing Communication on Consumer Buying International Journal of Management (IJM), Volume 5, Issue 3, 2014, pp , ISSN Print: , ISSN Online: Dr. Jamshed siddiqui, An Overview of Opinion Mining Techniques International Journal of Advanced Research in Engineering & Technology (IJARET), Volume 4, Issue 7, 2013, pp , ISSN Print: , ISSN Online: Mitisha Vaidya and Priyank Thakkar, Aspect Based Sentiment Analysis of Movie Reviews International Journal of Advanced Research in Engineering & Technology (IJARET), Volume 5, Issue 6, 2014, pp , ISSN Print: , ISSN Online: