1. Objectives: 1.1 Specific objectives:

Size: px
Start display at page:

Download "1. Objectives: 1.1 Specific objectives:"

Transcription

1

2 Introduction: The present work has been developed with the purpose of participating in the challenge promoted by Rosette, exemplifying the combined use of RapidMiner software and Rosette extensions for RapidMiner (see It is worth to point out that none of the presented data or results can be said to be conclusive regarding the good or bad qualification of the proposed scenario. In fact, the goal of this work is only to promote the combined use of the two technologies mentioned above. The size of the sample was not taken into account so that the results represent the qualification of the proposed scenario. 2

3 1. Objectives: - To analyze the relationship between the current moment of economy and politics in Brazil, and the international scenario, based on tweets generated in the social network Twitter ( in English language, containing the terms "Brazil" and "Michel Temer". - The choice of the terms "Brazil" and "Michel Temer" were based on the fact that these two terms are intrinsically linked to Brazil's political and economic scenario. There are no political motivations, legal or personal, in this choice. 1.1 Specific objectives: 1.1.1: Analyze the statistical correlation between the number of retweets and the perceived feeling in each of the original tweets (positive, neutral or negative), evaluating which type of message has the greatest proliferation power among the analyzed tweets, considering the number of retweets as a factor of proliferation; 1.1.2: Categorize, search the main entities in specific categories, as well as analyze the sentiment of each tweet, according to the relevant categories, in order to assess the political and economic scenario of Brazil, focus of this study. 3

4 2. Description: 2.1 First stage: At this stage, using RapidMiner, we have collected tweets in English containing the term "Brazil" and tweets containing the term "Michel Temer", current president of Brazil. For this stage, two RapidMiner operators were used: (i) Search on Twitter, which connects Rapid Miner with the Twitter API and get the tweets with the parameterized query, and (ii) the Write Database operator, which writes those tweets into a local MySQL-type database. Writing the tweets into a database is important to accumulate the tweets along several days and, thus, generate a bigger database. Tweets for the two terms were accumulated from December 1st to December 17th 2016, following the Twitter API rules of not providing data that is more than a week old. RapidMiner Process: Search Twitter Write Database 4

5 2. Description: 2.2 Second stage: At this stage, a RapidMiner process was assembled using the read database operator to access the database with the stored tweets, followed by the Analyze Sentiment operator of the Rosette Text Toolkit extension, which, using the external API communication, ranks each tweet with respect to the Feeling perceived in the text as positive, neutral or negative. After the feeling classification, the Map operator was inserted to convert the pos, neu or neg outcomes into numerical parameters 1, 0 or -1, respectively. This conversion is required for the next Correlation Matrix operator, which requires numeric variables. From the correlation matrix generated as the final result, we obtain the statistical correlation between the Sentiment column and the Retweet-Count column. This provides the answers to one of the questions previously defined as the objective of this study: does the way a message spreads through retweets depend on its positive, neutral or negative content? 5

6 2. Description: 2.2 Second stage: For purposes of understanding, the following definition of statistical correlation is taken from the description of the RapidMiner Correlation Matrix operator: "A correlation is a number between -1 and +1 that measures the degree of association between two attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa." RapidMiner Process: Read Database Analyze Sentiment Map Correlation Matrix Rosette Text Toolkit 6

7 2. Description: On the correlation calculations The relationship between the classification of news into positive, neutral and negative contents and the interest of the readers in these stories, especially within the context of politics, has been subject of long debate over the past years. In Ref. [1], for example, among other discussions, Trussler and Soroka investigated how negative, positive and neutral news on the Internet attract the interest of readers. Results of their study suggest that news with negative headlines are more prone to be selected to be read further, specially when the text has political content, although this becomes more evident when the topic of the news is on political strategy. Our study share similarities with [1], but with a social network character: we analyze how the sentiment (positive/negative/neutral) of contents published in Twitter affect the number of re-tweets and sharing of these contents. With the large database captured in the first stage, we also go further and calculate correlation between the sentiment and the number of re-tweets. 7

8 2. Description: On the correlation calculations Of course, larger databases, that would provide even more accurate correlation parameters, can be obtained by adjusting parameters of the RapidMiner tool, or simply by taking a longer data collection time, so that more tweets are collected. [1] M. Trussler and S. Soroka, Consumer Demand for Cynical and Negative News Frames, The International Journal of Press/Politics 19, 360 (2014). Just like in [1], we label negative, neutral and positive entries with a sentiment parameter s = -1, 0, and 1, respectively. This arbitrary choice of parameters does not affect the results of the correlation calculations, either qualitatively of quantitatively, as one can check by analyzing e.g. Pearson's correlation coefficient mathematical expression. The only requirement for the parameter s is to have its value increasing from negative, to neutral, to positive in this way, we understand e.g. a negative correlation as being due to the fact that lower (higher) values of s are more likely to be connected to higher (lower) values of the number of retweets in other words, converting s back to the sentiment classifications, negative correlation would mean a connection between negative (positive) comments and more (less) re-tweets. 8

9 2. Description: On the correlation calculations Therefore, the combination of RapidMiner and Rosette tools allows us to assess the old question of how the negative/neutral/positive character of a given political news affects its impact among readers, but now using data analysis tools, which are easy to handle and avoid the need of running experiments that usually involve a number of participants, computer based surveys, etc. [1] M. Trussler and S. Soroka, Consumer Demand for Cynical and Negative News Frames, The International Journal of Press/Politics 19, 360 (2014). 9

10 2. Description: 2.3 Third Stage In the third stage, a process was set up with the Read Database operator to access the database with tweets with the term "Brazil" (for this purpose, a SELECT was used inside the operator, in order to combine all the tables generated from the Twitter API, one for each day of collection). After this operator was inserted, the following operators of the Rosette Text Toolkit were included: (i) Categorize, in order to find all the tweets generated between 1st and 17th of December of 2016 classified by the operator in the category Law, Gov't & Politics; (ii) Extract Entities, which analyzes if there is a predominance of specific entities in the tweet classified in the category Law, Gov't & Politics; and (iii) Analyze Sentiment, which analyzes if there is a predominant sentiment classification, by specific entities found in tweets categorized as Law, Gov't & Politics. RapidMiner Process: Rosette Text Toolkit Read Database Extract Entities Categorize Analyze Sentiment 10

11 3. Results: second stage In the first graph, we compare the statistical correlation obtained during the 17 days of collection between the volume of retweets and the classification of feeling in the generated text. It should be noted that the correlation of the database with tweets with the term "Brazil" shows a tendency of dissemination of weak negative tweets, with result of -0,16. The tweets generated with the term "Michel Temer" have a correlation with the tendency of dissemination of negative tweets also weak, with result of -0.25, although close to the zone of moderate correlation (above -0.30), with peaks reaching some days, as shown in the following graphs. Correlation result by search term -0,25 Michel Temer -0,16 Brazil 11

12 3. Results: second stage For the term "Michel Temer", when we analyzed the statistical correlation factors for each of the collection days, we noticed the negative predominance of correlation indexes. In fact, there is no case along the analyzed days where positive correlation indexes are observed, i.e., positive tweets didn't lead to greater number of retweets in any of these days. In six of the seventeen days analyzed, the negative correlation level even exceeds the margin. Term: Michel Temer Correlation per day -0,49-0,27-0,36-0,36-0,08-0,18-0,11 0,01-0,15-0,24-0,35-0,29-0,03-0,26-0,02-0,37-0,33 12

13 3. Results: second stage As for the term "Brazil", when we analyze the correlation factor per day, we observe a greater fluctuation of indexes, ranging from a moderate correlation strength for the dissemination of tweets classified as positive, to a moderate correlation force for proliferation of tweets classified as negative. Tweets that have the term Brazil involve diverse questions, including those about politics. As a matter of fact, the collection period was followed by an accident with the Chapecoense soccer team, which generated a great commotion worldwide, thus somewhat affecting the statistical results shown here. Term: Brazil Correlation per day -0,16 0,1-0,12-0,01 0,46-0,01 0,1-0,27-0,42-0,01-0,07-0,16-0,17 0 0,05-0,57 0,36 13

14 3. Results: second stage In an analysis of the tweets collected over the seventeen days for each of the terms, the percentage of tweets that were classified as negative on each of the bases was checked. Notice the large volume of negative tweets for the database with the term "Michel Temer". This result gives us a more complete understanding of the situation and confirms the tendency of a greater dissemination of tweets classified as negative (volume of retweets), which have the term "Michel Temer". Percentage of negative tweets per search term Michel Temer Brazil 25% 75% 14

15 3. Results: third stage The focus of this study is Brazil's current political scenario. The graph below shows the percentage of all tweets collected during the first 17 days of December that have the term "Brazil", which were categorized as Law, Gov't & Politics, according to the operator Categorize of the Rosette Text Toolkit. Distribution by category Law, Gov't & Politics Others categories 2% 98% 15

16 3. Results: third stage This graph demonstrates the analysis of tweets categorized as Law, Gov't and Politics and their behavior with respect to the classification of feelings. The operators of Categorize and Analyze Sentiment were used, showing that in this category, tweets classified as negative are predominant. Classifier: Law, Gov't & Politics tweets Pos Neu Neg 20% 62% 18% 16

17 3. Results: third stage The graph below shows the combination of two operators of the Rosette Tool Kit, allowing, after the process of categorizing the tweets, to extract the most relevant entities of a certain category. The combination of these operators allows a detailed understanding of the proposed scenario and can be a powerful tool for decision making in various segments, such as press, communication or institutional relations. Key entities found in tweets categorized as Law, Gov't & Politics 38% Key entities found in tweets categorized as Law, Gov't & Politics 24% 24% Senate President Supreme Court 17

18 4. Conclusions By making the combined use of RapidMiner with the Rosette Text Toolkit, we have been able to create a powerful way to analyze data in text, especially social networks. The use of both tools allows professionals from different areas, who have the need to analyze this type of information, to do so with low levels of technical knowledge. Focusing on the result and the analyzes made possible by using the Rosette Text Toolkit gives you the chance to analyze social media data much more deeply than other tools with standardized reports. It is possible to continuously create new metrics and think about the process of knowledge discovery in the database. As an example, we can mention the search for patterns in tweets that are classified as negative, that belong to a certain category, with the predominance of a specific entity. In terms of decision-making, the combination of the two tools, with the methodologies developed in this work, allows real-time evaluation of people's reaction to a given political scenario. Through these analyzes one can shape institutional campaigns, measure public interests or even model speeches and other communications in line with people's yearnings. 18

19 5. Final remarks This work was developed by Delano Lima, graduated in Advertising and postgraduate in Marketing Management by UNIFOR - University of Fortaleza. Validation of the process of converting the feeling classifications from text to numerical data was done in collaboration with Prof. Andrey Chaves, from the Department of Physics of Universidade Federal do Ceará (UFC), PhD in Physics by UFC and University of Antwerp, in Belgium, with a post-doc period at Columbia University, USA. For any question about this work, please contact Mr. Delano Lima at delano@miningmetrics.net. For specific questions about the process of converting sentiment classifiers into text to numbers, please contact Professor Andrey Chaves at andrey@fisica.ufc.br. 19

20