T-PICE: Twitter Personality based Influential Communities Extraction System

Similar documents
Conceptual Replication ISSN Predicting Personality from Social Media Text. Jennifer Golbeck

Influencer Communities. Influencer Communities. Influencers are having many different conversations

Exploiting time series analysis in Twitter to measure a campaign process performance

Visiting Patterns and Personality of Foursquare Users

Estimating the Impact of User Personality Traits on electronic Word-of-Mouth Text-mining Social Media Platforms

Predicting Popularity of Messages in Twitter using a Feature-weighted Model

Cyber-Social-Physical Features for Mood Prediction over Online Social Networks

Predicting the Odds of Getting Retweeted

How hot will it get? Modeling scientific discourse about literature

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Tweeting Questions in Academic Conferences: Seeking or Promoting Information?

Using Twitter as a source of information for stock market prediction

E-Commerce Sales Prediction Using Listing Keywords

Here comes the Brave New World of Social Media. Miltiadis Kandias Athens University of Economics & Business

Fraud Detection for MCC Manipulation

Understanding Low Review Ratings in Online Communities: A Personality Based Approach

2016 U.S. PRESIDENTIAL ELECTION FAKE NEWS

Indian Election Trend Prediction Using Improved Competitive Vector Regression Model

How to Create a Dataset from Social Media: Theory and Demonstration

AN INTELLIGENT APPROACH FOR PREDICTING SOCIAL MEDIA IMPACT ON BRAND BUILDING

Improving the Response Time of an Isolated Service by using GSSN

HIERARCHICAL LOCATION CLASSIFICATION OF TWITTER USERS WITH A CONTENT BASED PROBABILITY MODEL. Mounika Nukala

Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification

Large Scale Product Recommendation of Supermarket Ware Based on Customer Behaviour Analysis

SOCIAL MEDIA MINING. Behavior Analytics

An Introduction to Social Analytics: Concepts and Methods

Application of Location-Based Sentiment Analysis Using Twitter for Identifying Trends Towards Indian General Elections 2014

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA 2013

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter

Machine learning-based approaches for BioCreative III tasks

MODEL OF SENTIMENT ANALYSIS FOR SOCIAL MEDIA DATA

Who Will Retweet This? Detecting Strangers from Twitter to Retweet Information

Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.

Forecasting mobile games retention using Weka

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

1. Objectives: 1.1 Specific objectives:

Various Techniques for Efficient Retrieval of Contents across Social Networks Based On Events

Unlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management

Sentiment analysis using Singular Value Decomposition

Text, Web, and Social Media Analytics

Restaurant Recommendation for Facebook Users

Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks

Social Media Analytics for E-commerce Organisations

Context-Sensitive Classification of Short Colloquial Text

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis

Data Preprocessing, Sentiment Analysis & NER On Twitter Data.

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

5.1 Leadership Versus Management 5.2 Transactional Leadership 5.3 Transformational Leadership 5.4 Situational Leadership

Social Media Analytics

A Comparative Study of Recommendation Methods for Mobile OSN Users

An Algorithm for Mobile Computing Opinion Mining In Multilingual Forms By Voice and Text Processing

Course Description Applicable to students admitted in

REVIEW ON PREDICTION OF CHRONIC KIDNEY DISEASE USING DATA MINING TECHNIQUES

Predicting Corporate 8-K Content Using Machine Learning Techniques

Stream Clustering of Tweets

How to Create a Dataset from Twitter or Facebook: Theory and Demonstration

A logistic regression model for Semantic Web service matchmaking

Predicting ratings of peer-generated content with personalized metrics

Data Science Challenges for Online Advertising A Survey on Methods and Applications from a Machine Learning Perspective

Estimation of social network user s influence in a given area of expertise

Big Data. Methodological issues in using Big Data for Official Statistics

Incorporating AI/ML into Your Application Architecture. Norman Sasono CTO & Co-Founder, bizzy.co.id

Automated Tracking of Components of Job Satisfaction via Text Mining of Twitter Data. Purdue University 2 Georgia Institute of Technology

Building Cognitive applications with Watson services on IBM Bluemix

Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

SOCIAL NETWORK AND ATTITUDE ANALYSIS

Predicting Corporate Influence Cascades In Health Care Communities

Feature Extraction from Micro-blogs for Comparison of Products and Services

Cryptocurrency Price Prediction Using News and Social Media Sentiment

COMPARATIVE STUDY OF SUPERVISED LEARNING IN CUSTOMER RELATIONSHIP MANAGEMENT

Available online at ScienceDirect. Procedia Computer Science 59 (2015 ) James Luke 1, Suharjito 2 *

Community Level Topic Diffusion

Speech Analytics Transcription Accuracy

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

API Economy - making APIs part of new business models

Available online at ScienceDirect. Procedia Technology 18 (2014 ) 72 79

WaterlooClarke: TREC 2015 Total Recall Track

Stock Price Prediction with Daily News

Group #2 Project Final Report: Information Flows on Twitter

Glossary Adjacency matrix Adjective Orientation Similarity Aspect coverage Bipartite networks CAO Collaborative filtering Complete graph

CONNECTING SOCIAL MEDIA TO ECOMMERCE USING MICROBLOGGING AND ARTIFICIAL NEURAL NETWORK

Final Report: Local Structure and Evolution for Cascade Prediction

Opinion Mining Task and Techniques: A Survey

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM

«ARE WE DISRUPTING OURSELVES?» Jörg Besier Managing Director, Accenture

TDWI Analytics Fundamentals. Course Outline. Module One: Concepts of Analytics

Improving Consumer Consumption Preference Prediction Accuracy with Personality Insights

Experiences in the Use of Big Data for Official Statistics

Enabling News Trading by Automatic Categorization of News Articles

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Predictive Analytics Using Support Vector Machine

Social Media Insights Social Media Trends and Analytics Implications

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Article Review: Personality assessment in organisational settings

A Comparative Study of Filter-based Feature Ranking Techniques

A Comparison of Indonesia s E-Commerce Sentiment Analysis for Marketing Intelligence Effort (case study of Bukalapak, Tokopedia and Elevenia)

Research Article Rice Products Feature Analyzing on the Base of Online Review Mining

Transcription:

2014 IEEE International Congress on Big Data T-PICE: Twitter Personality based Influential Communities Extraction System Eleanna Kafeza Business School Athens University of Economics and Business, Greece kafeza@aueb.gr Andreas Kanavos Computer Engineering and Informatics Department University of Patras, Greece kanavos@ceid.upatras.gr Christos Makris Pantelis Vikatos Computer Engineering and Computer Engineering and Informatics Department Informatics Department University of Patras, Greece University of Patras, Greece makri@ceid.upatras.gr vikatos@ceid.upatras.gr Abstract The identification of influential users in social media communities has been recently of major concern, since these users can contribute to viral marketing campaigns. In our approach we extend the notion of influence from users to networks and consider personality as a key characteristic for identifying influential networks. We describe the Twitter Personality based Influential Communities Extraction (T-PICE) system that creates the best influential communities in a Twitter network graph considering users personality. We then expand existing approaches in users personality extraction by aggregating data that represent several aspects of user behavior using machine learning techniques. We use an existing modularity based community detection algorithm and we extend it by inserting a pre-processing step that eliminates graph edges based on users personality. The effectiveness of our approach is demonstrated by sampling the twitter graph and comparing the influence of the created communities with and without considering the personality factor. We define several metrics to count the influence of communities. Our results show that the T-PICE system creates the most influential communities. Keywords-classification; influential community detection; personality mining; social media analytics; I. INTRODUCTION In social networking sites, only a fraction of users can influence other users. Businesses try to identify influential users for propagating communication messages by looking in most cases, on users static profile. In our approach, instead of identifying users, we identify communities that demonstrate high activity and we generate the user profile based on their behavior. We look into complex relationships considering users personality in social media networks so as to identify the best information conducting communities. Our objective is to extract, from social media data, the appropriate features that represent these complex relationships which stem from different data origins and subsequently use them to identify influential communities. The importance of considering psychological mechanisms for understanding internet use has already been identified in the literature [16], justifying that user personality plays a dominant role in social media communication. In this work, we investigate the role that personality plays in information diffusion. Users personality can be described by a combination of personality traits that express tendencies to behave. There are five basic dimensions of personality that remain stable in individuals, forming the Big Five Model [18]. Our proposed Twitter Personality based Influential Communities Extraction methodology (T-PICE) results in the identification of networks that have the highest possible communication capability. It extracts a diversity of user information through Twitter and creates user profiles as tuples of the extracted aggregated information. Classification algorithms from the Weka toolkit are used to map these user profiles to personality traits. We train classifiers using vectors of features augmented with predefined category of each personality trait; the produced models are tested for their performance determining the best classification algorithm for each trait. Hence, each node of the Twitter graph is associated with a 5-tuple that represents the user personality. We propose the use of personality traits as an additional parameter for influential community detection. T-PICE framework utilizes the method described in [2] to identify communities within the Twitter network which is based on modularity optimization. We extend the approach of [2] by considering the personality relationship between nodes at a pre-processing step. Our contributions are in several aspects: firstly, we extend the existing approaches for personality based on users behavior extraction from social media data; then we identify the mining algorithms that best fit each personality trait and ultimately, we extend community detection algorithms by adding a pre-processing step that accounts for users personality. Furthermore, a unified framework that combines personality mining and community detection to address the problem of identifying influential communities, is proposed. Our results show that the T-PICE system creates the most influential communities. The remainder of the paper is organized as follows. Section II overviews related work. The proposed system architecture is described in Section III. Moreover, in Section IV and V, modules and sub-modules of our model as well as details of the implementation of the system are respectively presented. In addition, Section VI presents a reference to our experimental results while in Section VII we discuss our results. Finally, in Section VIII, we present our concluding 978-1-4799-5057-7/14 $31.00 2014 IEEE DOI 10.1109/BigData.Congress.2014.38 212

remarks, open problems and future work. II. RELATED WORK The automatic extraction of each user s personality has gained the interest of scientists in the recent years. Computational linguistics and data mining have been used for the automatic recognition of personality based on text. The most widely known model of personality trait qualification is the Big Five [18]. According to Big Five, the human personality is described as a vector of five values of traits as shown in Table I. The combination of Big Five personality dimensions explain the dynamics of a personality. For example, a person may be very talkative (high Extraversion), not very tolerant and sensitive (low Agreeableness), systematic and punctual (high Conscientiousness), easily anxious (high Neuroticism) and extremely curious (high Openness). Trait Agreeableness (A) Conscientiousness (C) Extraversion (E) Neuroticism (N) Openness (O) Table I PERSONALITY TRAITS Description This personality dimension includes attributes such as affability, tolerance, sensitivity, trust and kindness Common features of this dimension include organization, punctuality, achievementorientation and dependency This trait includes individuals such as outgoing talkative, sociable and enjoying social situations Individuals high in this trait tend to be anxious, irritable, temperamental and moody This trait features characteristics such as curiosity, originality, intellectuality, creativity and openness to new ideas In existing literature, the problem of automatic recognition of personality traits has been addressed using computational linguistics and characteristics of social network structure in a limited manner. In recent years, supervised learning approaches have been used for extracting the types of personalities. In [17], the authors presented firstly a detailed correlation analysis between Big Five personality traits and the features contained in LIWC [20] and MRC [5]; then they classified Big Five personality traits using regression and classification models. The authors in [12] tested linguistic features derived from LIWC for predicting personality in a large corpus of blogs using Support Vector Machines (SVM) as classification algorithm. In [21], the authors used a combination of decision trees with linear models at the leaves using the M50 algorithm, categorizing High and Low scores in Big Five traits via Twitter profiles. Prediction of personality trait scores of Facebook users is addressed in [9], using M5 trees based on linguistic characteristics and social network features. A study to automatically recognize Big Five personality traits on Facebook status messages is presented in [1], observing that MNB (Multinomial Naive Bayes) sparse model performs better than SMO (Support Vector Machines using Sequential Minimal Optimization) and BLR (Bayesian Logistic Regression). Other efforts using unsupervised learning and statistical methods have been introduced in [3] and [4] using annotated Twitter dataset as well as Facebook relationships respectively. Furthermore, there are some studies which include personality recognition traits with datasets, that are not derived from social networks. These studies have introduced methods of recognition of the blogger s personality [19] or speech based dialogue system understanding a user s personality [1]; datasets from different languages [3] are also present. The above literature review indicates that there are many studies for automatic personality identification. However, the results in these studies are not directly comparable because of the different methods and the different datasets used. Our approach differs from the existing studies. To be more specific, our proposed methodology in personality mining differs from [4], since in the latter mentioned, they used data from Facebook and not Twitter as well as they did not apply any data mining techniques; instead they used features from correlation analysis of the study [17]. Moreover, in [9], the authors apply data mining techniques to small texts such as about me or Blurb texts in Facebook accounts. In [3], the emotional stability is described without Big Five personality traits using an unsupervised learning method. A similar work in [21] uses only structural features without linguistic characteristics of users text. In [1], the authors use Facebook data and introduce a classification model using only the classification algorithm SMO. Our approach integrates the methods of existing techniques applying a variety of data mining techniques [11] that have not been used all together in the existing literature; hence doing an elaborative comparison identifying the best approach for personality data mining. Furthermore, our model of the user profile creation integrates existing approaches, use the network structure and linguistics aspects; it expands the existing literature by creating a user profile that takes into account several features of network structure and social media metrics that have not been considered before. We sample the Twitter and extract the corresponding Twitter network, which is separated in communities using a well-used community detection approach [2], [8]. We extend the modularity based community detection by inserting a pre-processing step that eliminates graph edges based on their personality. It is the first time that such an extensible study has considered Twitter data on personality mining. III. SYSTEM ARCHITECTURE In T-PICE, users personality is extracted based on a variety of elements: the linguistic presence, the user behavior within the network and the way communities are formed 213

based on the network structure. Figure 1 represents a generic model depicting the system architecture for a personality mining system that identifies influential communities. The system is composed of the following modules: The social media crawler. The crawler is responsible for sampling and traversing the social media; also it collects information regarding the users activity as well as the connections based on a given topic. The user profile creation. The profile creation takes the social media graph as input and creates a vector that represents the user profile. The linguistic analysis is based on both the users tweets and the network characteristics; this is where we extract attributes that represent the user s structural position within the social media graph as well as its metrics. With these metrics, we can capture the user behavior, which include the number of tweets, retweets etc. The personality classification module takes the above user profile as input and determines the user personality based on the theory of Big Five. A personality test in the form of questionnaire is used to train the classifier. The communities decomposition module takes into consideration the users personality and extracts communities using different criteria. The influential communities identification module takes the communities as input and determines the influential ones. IV. PERSONALITY MINING FOR THE IDENTIFICATION OF INFLUENTIAL COMMUNITIES An influential community is a community that demonstrates a high level of activity having several tweets or followers. We argue that the personality aspect plays an important role when determining influential communities, hence we augment existing approaches in community detection with personality detection as well. In the following section, we present the modules and sub-modules of our model. A. Social Media Crawler The social media crawler traverses the Twitter and creates a social media graph where nodes are users and edges represent the follow connection between two users. For our experiments, we use a topic-based sampling approach where tweets are collected via a keyword search query. The process creates a sample of the Twitter graph as follows: initially it retrieves the users and their followers, which have posted a tweet within the given time period. Subsequently, it connects users that follow each other or have a common follower through that follower. More specifically, the process for generating the Social Media Graph is presented (see Algorithm 1). Algorithm 1 Generation of Social Media Graph 1: input Query/Keyword #q 2: output The sample Graph Users, The list of followers of a user Followers[], The list of followers to be inserted to Users Newnodes 3: identify set of tweets for given #q, T = {t 1,t 2,...,t i } 4: tweet t i T 5: u i = user of tweet t i 6: Followers[u i ] = Followers of u i 7: for each t i T do 8: Users = Users u i 9: end for 10: identify set of followers of a user u k, Followers[u k ]= {f 1,f 2,...,f j } 11: for each u k Users do 12: for each f j Followers[u k ] do 13: if f j Users then 14: link f j with u k 15: else 16: for each u l Users and u l u k do 17: if f j Followers[u l ] then 18: Newnodes = Newnodes f j 19: link f j with u k and link f j with u l 20: end if 21: end for 22: end if 23: end for 24: end for 25: Users = Users Newnodes B. User Profile Creation The user profile is determined by the user behavior in social media. There are several aspects that describe the user behavior such as: use of words, emotions, frequency of communication, number of friends etc. Moreover, user s social relationships play an important role in user profiling and such relationships can be extracted from the social graph based on users communication patterns. In our work, we extend existing approaches in predicting personality traits by sketching the user profile while processing heterogeneous information collected from different sources of social media data. We aggregate information collected based on: The linguistic and emotional content of the tweets. The user communication behavior. The network structure aspects of the user presence in Twitter. 1) Linguistic and Emotional Analysis: The Linguistic Inquiry and Word Count (LIWC) software measures the cognitive and emotional properties of a person. It is a widely used linguistic analysis tool that parses users text (tweets in our case) and assigns the words in psychologically mean- 214

Figure 1. System Architecture ingful categories. There are 80 such features that include linguistic and psychological use of language as well as personal concerns. Hence, each Twitter user is represented as a vector with 80 values that characterize their linguistic and emotional behavior. Definition: User Linguistic profile is a tuple of 80 characteristics that represent user linguistic presence in Twitter l(c 1,...,c 80 ). 2) Social Media Analytics: Social media analytics can be used to monitor and capture user s behavior. The followers of a user, the number of contributions to the social network and the frequency of contribution are some aspects that differentiate user behavior. Definition: User Social Media Analytics profile is a tuple a(y 1,...,y 6 ), where each value is extracted as a metric from the social media user behavior. More precisely, in the case of Twitter, the Twitter analytics profile is a tuple a(y 1,...,y 6 ), where y 1 is the number of Followers, y 2 is the number of Direct Tweets, y 3 is the number of Retweets, y 4 is the number of Conversations, y 5 is the Frequency of user s Tweets and y 6 is the number of Hashtag Keywords as in [15]. These metrics describe the user communication behavior in Twitter. 3) Network Information: Each user is represented as a node in the social graph. As such, the user has some structural network characteristics. These characteristics are associated with their behavior. Definition: User Network Structure profile is a tuple n(z 1,z 2,z 3 ), where z 1 is the Egocentric Network Density, z 2 is the Betweenness Centrality and z 3 is the Closeness Centrality. 4) User Profile: A user profile is the union of different user profiles i.e. the linguistic, analytics and network profile. By incorporating different aspects of user behavior, we achieve to construct a complete user profile that better captures user behavior. Definition: User Profile UP(x 1,...,x n ) = l(c 1,...,c 80 ) a(y 1,...,y 6 ) n(z 1,z 2,z 3 ). C. Personality Classification We predict user personality based on their UP vector, using machine learning techniques. A pre-defined label of High or Low for each personality trait is added to the UP vector based on the score derived from the questionnaire creating a particular dataset of each trait. Subsequently, the five datasets are used for training the classifiers. We employ a variety of classification algorithms to gain a better understanding of which method better suits to each personality trait and identify the best classifier for each trait. The performance is evaluated by the F-Measure metric. The models with the highest F-Measure value for each personality trait, are used for the prediction of the new test instances. D. Communities Decomposition In our approach, we aim to identify the most influential communities in the twitter graph. There are several algorithms for community detection in which modularity based community detection is considered one of the most popular methods. Existing approaches do not consider node features of the graph as a parameter for community detection. We base our community detection module on the modularity detection and we extend the approach presented in [2] proposing a pre-processing step where graph edges are removed or kept according to the following alternatives: 1) Links between nodes with equal personality traits are removed (EL). 215

2) Links between nodes of different personality traits are removed (DL). 3) Based on [21], nodes that have the same values in agreeableness, extraversion and openness are kept, while the rest are removed (AEO). After the pre-processing step, the modularity community detection algorithm [2] is used to cut the network into communities. E. Influential Communities Identification So as to identify influential communities, we use the following activity metrics: the number of Tweets, the number of Followers and the Borda Count of tweets and followers. These metrics capture the activity level within each community. Moreover, we define a new combinatorial metric by dividing the selected activity metric with the size of the community. This metric gives us insight on the influence of each community by presenting the number of tweets per node or the number of followers per node. We rank communities for each approach (i.e. Blondel, EL, DL and AEO) and compare the results. V. IMPLEMENTATION We based our experiments on Twitter and used Twitter API to collect tweets. We implemented the Twitter graph using Twitter4J 1, and have colored our graph according to our methodology. We sampled the Twitter graph implementing the process of Algorithm 1. We collected tweets published for a time interval of 21 days (06/01/2014-26/01/2014) using the keyword #SocialNetworks. Our Twitter graph consists of 693 nodes. In order to construct the training set, we conducted a survey on 80 individuals. Each user replied to a questionnaire 2 that determines user personality as described in [13], [14]. Then, we crawled the Twitter to retrieve the relevant information for each of these users and constructed the UP vector. In our implementation, the UP vector consists of 80 linguistic metrics, 6 Twitter analytics metrics and 3 network information metrics, as presented in Table II. For each user of the dataset and based on the answers of the personality questionnaire, we compute a score for each personality trait. This score is derived from the mean value of the corresponding questions, as described in [13]. In order to train the classifier, we differentiate for each trait a High and Low category based on a threshold. We determine the threshold for each trait based on previous research [1]. Table III presents the distribution of instances of High and Low categories for each personality trait. Thus, we five datasets are created; each for every personality trait. We separated each dataset to training and test set, using two approaches: a) K-Fold Cross-Validation (K=10 Fold) and b) Leave-One-Out Cross-Validation. The concept of Table II THE USER PROFILE FEATURE VECTOR Features # Description LIWC 80 4 general descriptor categories (total word count, words per sentence, percentage of words captured by the dictionary, and percent of words longer than six letters), 22 standard linguistic dimensions (e.g., percentage of words in the text that are pronouns, articles, auxiliary verbs, etc.), 32 word categories tapping psychological constructs (e.g., affect, cognition, biological processes), 7 personal concern categories (e.g., work, home, leisure activities), 3 paralinguistic dimensions (assents, fillers, nonfluencies), and 12 punctuation categories (periods, commas, etc) Twitter Metrics 6 Followers, Tweets, Retweets, Conversations, Frequency, Hashtag Keywords Network 3 Egocentric Network Density, Betweenness Centrality, Closeness Centrality Table III DISTRIBUTION OF LABELS Trait High (%) Low (%) Agreeableness (A) 55 45 Conscientiousness (C) 45 55 Extraversion (E) 40 60 Neuroticism (N) 35 65 Openness (O) 75 25 using both techniques is that splitting with 10-Fold Cross- Validation, important information can be removed from the training set. However, the Leave-One-Out Cross-Validation technique evaluates the classification performance based on one sample. The classifiers were chosen from bayes, functions, lazy, trees and rules categories of the Weka library 3. Table IV shows the results for the 10-Fold Cross- Validation measure, for each classifier and for each trait regarding the F-Measure. Based on these results, we select the best classifier for each trait, depicted in bold in the table. Similarly, Table V shows the results for Leave-One- Out Cross-Validation. For personality traits A, C and E on both approaches, the AdaBoost, BayesNet and JRip are selected as the best classifiers. In the case of N, 10-Fold Cross-Validation selects Ridor and in Leave-One-Out Cross- Validation, IBK achieves the best performance. Because the F-Measure is substantially larger in 10-Fold Cross- Validation, we select the Ridor as the best classifier. In the case of O, the 10-Fold Cross-Validation selects the JRip while in the Leave-One-Out Cross-Validation, J48 and PART are selected. Again because the F-Measure of 10-Fold Cross- Validation is substantially larger, we select the JRip as the 1. Twitter4J API: http://twitter4j.org/en/index.html 2. http://tinyurl.com/bigfiveinventory 3. Weka toolkit: http://www.cs.waikato.ac.nz/ml/weka/ 216

best classifier. Table IV 10-FOLD CROSS-VALIDATION Classifiers A C E N O AdaBoost 0.7 0.719 0.581 0.481 0.67 BayesNet 0.726 0.47 0.747 0.517 0.617 IBK 0.476 0.671 0.517 0.469 0.587 J48 0.6 0.7 0.76 0.359 0.52 JRip 0.824 0.525 0.517 0.474 0.695 Multilayer Perceptron 0.473 0.504 0.333 0.408 0.679 Naive Bayes Classifier 0.476 0.678 0.46 0.407 0.605 PART 0.626 0.702 0.669 0.282 0.541 Ridor 0.624 0.52 0.467 0.606 0.585 RotationForest 0.523 0.543 0.594 0.43 0.658 SMO 0.45 0.577 0.367 0.469 0.664 Figure 2. Comparison of Community Detection Algorithms based on the percentage of Followers of the top communities Table V LEAVE-ONE-OUT CROSS-VALIDATION Classifiers A C E N O AdaBoost 0.62 0.726 0.581 0.307 0.605 BayesNet 0.726 0.426 0.803 0.457 0.544 IBK 0.426 0.671 0.452 0.506 0.587 J48 0.65 0.579 0.758 0.469 0.65 JRip 0.724 0.435 0.556 0.428 0.648 Multilayer Perceptron 0.423 0.504 0.273 0.43 0.561 Naive Bayes Classifier 0.476 0.645 0.43 0.344 0.61 PART 0.65 0.726 0.664 0.452 0.65 Ridor 0.65 0.47 0.452 0.343 0.64 RotationForest 0.597 0.629 0.493 0.407 0.601 SMO 0.423 0.55 0.367 0.407 0.561 Figure 3. Comparison of Community Detection Algorithms based on the percentage of Tweets of the top communities The classification of High and Low category for each personality trait, creates a tuple of 5 labels for each Twitter user. In other words, there are 2 5 =32different combinations that characterize people s personality and thus can be depicted as different colors in the graph s nodes. VI. RESULTS In the following figures 2, 3 and 4, we present the performance of each of our algorithms in determining the influential communities. We rank the influence of a community using different metrics for different application scenarios. For example, we use the number of tweets within each community as the ranking metric for applications that require finding influential communities regarding a topic or a specific time period or an event. For the top communities, we compute the percentage of tweets from nodes participating in them, versus the total number of tweets in the original graph crawled. For applications that are more generic and require an overall estimation of the influence of a community, we determine influence based on the number of followers. In cases where both tweets and followers are of interest, we use the Borda Count of tweets and followers to measure influence. The Borda Count is a single-winner election Figure 4. Comparison of Community Detection Algorithms based on the percentage of Borda Count of the top communities method, in which voters rank options in order of preference. Namely, each option gets 1 point for each last place vote received, 2 points for each next-to-last point vote; all the way up to N points for each first place vote (where N is the number of options). Since our motivation stems from the fact that we are interested in identifying the more influential communities and not just the first one, we use the summation of the metrics for the first three communities. Figure 2 presents the metric percentage of followers 217

for the first three communities as well as the corresponding community sizes. Our observation is that our proposed methods (EL and DL) increase significantly the number of followers, versus the community size in the first three communities, as compared to Blondel and AEO approaches. DL detects communities with the best percentage of followers. In Figure 3, we use the metric percentage of tweets to measure all methods performance. We observe that the performance of DL is the best regarding the percentage of tweets. Blondel and AEO have the same results while EL gives the less percentage of Tweets. When looking the tweets, versus the community size, EL is better, followed by DL and Blondel. In Figure 4, we evaluate all methods using the metric of the Borda Count of followers and tweets. In this case, EL achieves remarkably the better performance, while the other three methods have marginally the same. The introduced metrics for counting the influence of a community do not take into consideration the size of the community. Hence, we introduce a normalized metric based on size (see Table VI). This is a metric that can be used for a variety of applications, especially when cost is associated with the size of the communities. Such applications are advertising ones, where we look for the smaller communities with the largest impact. In all cases, AEO algorithm that deletes edges, which differ at least in one of Agreeableness, Extraversion or Openness trait, gives worse results compared to EL and DL. Moreover, we conducted a set of experiments for AEO variations where the removed edges are between personalities with a difference in three traits, and the results we obtained are similar to AEO. Keeping links that do not differ so much, creates balanced personality graphs were communities are not influential. This result is consistent with the metric/size metric (Table VI). EL achieves the best results across all metrics. The top communities which are extracted using the different approaches of communities decomposition, depict the diversity in the distribution of dissimilar personalities as it is shown in Table VII. We can see for each algorithm the average of the percentage of dissimilarity of personalities for the top communities. This metric is computed by counting the number of nodes with different personalities divided by the total number of nodes in the top communities. According to Table VII, EL exhibits the greatest diversity in personalities in the resulting influential communities. DL results in less variation in personalities. VII. DISCUSSION In our work we use [2] for community detection; an approach based on the modularity criterion. This is a popular technique for community detection. The modularity measures the density of links inside communities as compared to links between communities. When the similar personality Table VI NORMALIZED METRIC FOR RATING INFLUENTIAL COMMUNITIES Communities Decomposition Tweets / Size Followers / Size Borda Count / Size Simple Blondel 1,704 2,201 1,150 EL 2,065 2,794 1,467 DL 1,651 2,584 1,054 AEO 1,322 1,892 1,111 Table VII AVERAGE OF DISSIMILARITY RATES OF USERS PERSONALITY OF THE TOP COMMUNITIES Communities Decomposition Ranking Tweets Ranking Followers Ranking Borda Count Simple Blondel 58,7% 66,3% 66,3% EL 69,3% 70,2% 67,8% DL 54,1% 61,1% 54,2% AEO 63,4% 60,2% 60,2% links are discharged (EL) in the pre-processing step then the modularity is determined based on the density of users that have different personalities only. Hence, more heterogeneous communities are created that tend to be more influential. Similarly, when the different personality links are deleted, influential homogeneous communities are created. Looking at the extreme cases we observe the following. In the case that the Twitter graph has nodes that correspond to individuals with the same personality mixture of Big Five traits, the EL approach will lead to a graph after preprocessing step which includes only isolated nodes. So the influence of the communities will be much lower because the top communities are constituted by one node each one. In this case, the influential network is transformed to influential users. On the other hand, DL approach will keep the graph as it is and thus the performance of the influential communities will be the same as in Blondel. In the case of an extracted graph where all nodes have different personalities i.e. only for graphs with equal or less than 32 nodes, the EL approach will keep the graph as it is and thus the influence of the communities will be the same as in Blondel. The DL will create a graph with isolated nodes and thus the influence of top communities will be reduced. Our results show that in all different cases and metrics, EL or DL outperforms Blondel creating the most influential communities that exhibit either a heterogeneous or a homogenous personality distribution. VIII. CONCLUSION -FUTURE WORK In this work, we looked into the problem of determining influential communities in Twitter. We propose the Influential Communities Extraction methodology (T-PICE), a unified framework that extracts users personality based on several aspects of user behavior and colors the network graph using machine learning algorithms according to the 218

32 possible personality descriptions as defined by the Big Five personality model. Furthermore, we determine the best classification algorithm for each personality trait in order to improve the performance of our system. The influential communities are created based on several variations of modularity based community detection, where personality is also considered in a pre-processing step. Finally, the comparison of the proposed variations and the initial community detection algorithm is evaluated based on metrics that count the activity level of the top three communities. The detected top communities by EL (whre links between nodes with equal personality traits are removed) and DL (where links between nodes of different personality traits are removed) indicate that personality heterogeneous as well as homogenous communities are the more influential ones in creating networks of higher information diffusion. The T-PICE system can be a tool for marketing managers or advertisers to help them identify the influential community, thus better promoting their products. As future work, we are interested in examining the scalability problems that emerge when considering bigger graphs. In addition, we aim to make more experiments using several subjects and identify the parameters that influence the results of our algorithms in a finer granularity level. In conclusion, we will investigate the evolution of influential communities in time as well as the impact of other features in the influential community ranking. REFERENCES [1] F. Alam, E. A. Stepanov and G. Riccardi, Personality Traits Recognition on Social Network - Facebook, Computational Personality Recognition, 2013. [2] V. D. Blondel, J. - L. Guillaume, R. Lambiotte and E. Lefebvre, Fast Unfolding of Community Hierarchies in Large Networks, Journal of Statistical Mechanics: Theory and Experiment, P10008, 2008. [3] F. Celli and L. Rossi, The Role of Emotional Stability in Twitter Conversations, Semantic Analysis in Social Media, pp. 10-17, 2012. [4] F. Celli and L. Polonio, Relationships between Personality and Interactions in Facebook, Social Networking: Recent Trends, Emerging Issues and Future Outlook, pp. 41-53, 2013. [5] M. Coltheart, The MRC Psycholinguistic Database, Quarterly Journal of Experimental Psychology, Volume 33A, pp. 497-505, 1981. [6] C. Dwork, R. Kumar, M. Naor and D. Sivakumar, Rank Aggregation Methods for the Web, World Wide Web Conference (WWW), pp. 613-622, 2001. [7] M. Farah and D. Vanderpooten, An Outranking Approach for Rank Aggregation in Information Retrieval, Conference on Research and Development in Information Retrieval (SIGIR), pp. 591-598, 2007. [8] S. Fortunato, Community Detection in Graphs, Physics Reports 486, pp. 75-174, 2010. [9] J. Golbeck, C. Robles and K. Turner, Predicting Personality with Social Media, Human Factors in Computing Systems (CHI), pp. 253-262, 2011. [10] L. R. Goldberg, The Development of Markers for the Big Five factor Structure, in Psychological Assessment, Volume 4, Issue 1, pp. 26-42, 1992. [11] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. The Morgan Kaufmann Series in Data Management Systems, 2011. [12] F. Iacobelli, A. J. Gill, S. Nowsonl and J. Oberlander, Large Scale Personality Classification of Bloggers, Affective Computing and Intelligent Interaction (ACII), pp. 568-577, 2011. [13] O. P. John, E. M. Donahue and R. L. Kentle, The Big Five Inventory - Versions 4a and 54, Berkeley: University of California, Institute of Personality and Social Research, 1991. [14] O. P. John and S. Srivastava, The Big Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives, in Handbook of Personality: Theory and Research, 2nd ed. pp. 102-138, New York: The Guilford Press, 1999. [15] E. Kafeza, A. Kanavos, C. Makris and D. Chiu, Identifying Personality-based Communities in Social Networks, Legal and Social Aspects in Web Modeling (Keynote Speech in LSAWM), in conjunction with the International Conference on Conceptual Modeling (ER), 2013. [16] R. N. Landers and J. W. Lounsbury, An Investigation of Big Five and Narrow Personality Traits in Relation to Internet Usage, Journal of Computers in Human Behavior, Volume 22, Issue 2, pp. 283-293, 2006. [17] F. Mairesse, M. A. Walker, M. R. Mehl and R. K. Moore, Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text, Journal of Artificial Intelligence Research (JAIR), Volume 30, pp. 457-500, 2007. [18] R. R. McCrae and O. P. John, An Introduction to the Five- Factor Model and Its Applications, Journal of Personality, Volume 60, Issue 2, pp. 175-215, 1992. [19] H. Mohtasseb and A. Ahmed, Mining Online Diaries for Blogger Identification, Data Mining and Knowledge Engineering (ICDMKE), pp. 295-302, 2009. [20] J. W. Pennebaker, M. E. Francis and R. J. Booth, Linguistic Inquiry and Word Count (LIWC): LIWC2001, New Jersey: Lawrence Erlbaum Associates, 2001. [21] D. Quercia, M. Kosinski, D. Stillwell and J. Crowcroft, Our Twitter Profiles, Our Selves: Predicting Personality with Twitter, Social Computing (SocialCom)/Privacy, Security, Risk and Trust (PASSAT), pp. 180-185, 2011. 219