T-PICE: Twitter Personality based Influential Communities Extraction System

Size: px

Start display at page:

Download "T-PICE: Twitter Personality based Influential Communities Extraction System"

Clare Daniel
6 years ago
Views:

1 2014 IEEE International Congress on Big Data T-PICE: Twitter Personality based Influential Communities Extraction System Eleanna Kafeza Business School Athens University of Economics and Business, Greece Andreas Kanavos Computer Engineering and Informatics Department University of Patras, Greece Christos Makris Pantelis Vikatos Computer Engineering and Computer Engineering and Informatics Department Informatics Department University of Patras, Greece University of Patras, Greece Abstract The identification of influential users in social media communities has been recently of major concern, since these users can contribute to viral marketing campaigns. In our approach we extend the notion of influence from users to networks and consider personality as a key characteristic for identifying influential networks. We describe the Twitter Personality based Influential Communities Extraction (T-PICE) system that creates the best influential communities in a Twitter network graph considering users personality. We then expand existing approaches in users personality extraction by aggregating data that represent several aspects of user behavior using machine learning techniques. We use an existing modularity based community detection algorithm and we extend it by inserting a pre-processing step that eliminates graph edges based on users personality. The effectiveness of our approach is demonstrated by sampling the twitter graph and comparing the influence of the created communities with and without considering the personality factor. We define several metrics to count the influence of communities. Our results show that the T-PICE system creates the most influential communities. Keywords-classification; influential community detection; personality mining; social media analytics; I. INTRODUCTION In social networking sites, only a fraction of users can influence other users. Businesses try to identify influential users for propagating communication messages by looking in most cases, on users static profile. In our approach, instead of identifying users, we identify communities that demonstrate high activity and we generate the user profile based on their behavior. We look into complex relationships considering users personality in social media networks so as to identify the best information conducting communities. Our objective is to extract, from social media data, the appropriate features that represent these complex relationships which stem from different data origins and subsequently use them to identify influential communities. The importance of considering psychological mechanisms for understanding internet use has already been identified in the literature [16], justifying that user personality plays a dominant role in social media communication. In this work, we investigate the role that personality plays in information diffusion. Users personality can be described by a combination of personality traits that express tendencies to behave. There are five basic dimensions of personality that remain stable in individuals, forming the Big Five Model [18]. Our proposed Twitter Personality based Influential Communities Extraction methodology (T-PICE) results in the identification of networks that have the highest possible communication capability. It extracts a diversity of user information through Twitter and creates user profiles as tuples of the extracted aggregated information. Classification algorithms from the Weka toolkit are used to map these user profiles to personality traits. We train classifiers using vectors of features augmented with predefined category of each personality trait; the produced models are tested for their performance determining the best classification algorithm for each trait. Hence, each node of the Twitter graph is associated with a 5-tuple that represents the user personality. We propose the use of personality traits as an additional parameter for influential community detection. T-PICE framework utilizes the method described in [2] to identify communities within the Twitter network which is based on modularity optimization. We extend the approach of [2] by considering the personality relationship between nodes at a pre-processing step. Our contributions are in several aspects: firstly, we extend the existing approaches for personality based on users behavior extraction from social media data; then we identify the mining algorithms that best fit each personality trait and ultimately, we extend community detection algorithms by adding a pre-processing step that accounts for users personality. Furthermore, a unified framework that combines personality mining and community detection to address the problem of identifying influential communities, is proposed. Our results show that the T-PICE system creates the most influential communities. The remainder of the paper is organized as follows. Section II overviews related work. The proposed system architecture is described in Section III. Moreover, in Section IV and V, modules and sub-modules of our model as well as details of the implementation of the system are respectively presented. In addition, Section VI presents a reference to our experimental results while in Section VII we discuss our results. Finally, in Section VIII, we present our concluding /14 $ IEEE DOI /BigData.Congress

2 remarks, open problems and future work. II. RELATED WORK The automatic extraction of each user s personality has gained the interest of scientists in the recent years. Computational linguistics and data mining have been used for the automatic recognition of personality based on text. The most widely known model of personality trait qualification is the Big Five [18]. According to Big Five, the human personality is described as a vector of five values of traits as shown in Table I. The combination of Big Five personality dimensions explain the dynamics of a personality. For example, a person may be very talkative (high Extraversion), not very tolerant and sensitive (low Agreeableness), systematic and punctual (high Conscientiousness), easily anxious (high Neuroticism) and extremely curious (high Openness). Trait Agreeableness (A) Conscientiousness (C) Extraversion (E) Neuroticism (N) Openness (O) Table I PERSONALITY TRAITS Description This personality dimension includes attributes such as affability, tolerance, sensitivity, trust and kindness Common features of this dimension include organization, punctuality, achievementorientation and dependency This trait includes individuals such as outgoing talkative, sociable and enjoying social situations Individuals high in this trait tend to be anxious, irritable, temperamental and moody This trait features characteristics such as curiosity, originality, intellectuality, creativity and openness to new ideas In existing literature, the problem of automatic recognition of personality traits has been addressed using computational linguistics and characteristics of social network structure in a limited manner. In recent years, supervised learning approaches have been used for extracting the types of personalities. In [17], the authors presented firstly a detailed correlation analysis between Big Five personality traits and the features contained in LIWC [20] and MRC [5]; then they classified Big Five personality traits using regression and classification models. The authors in [12] tested linguistic features derived from LIWC for predicting personality in a large corpus of blogs using Support Vector Machines (SVM) as classification algorithm. In [21], the authors used a combination of decision trees with linear models at the leaves using the M50 algorithm, categorizing High and Low scores in Big Five traits via Twitter profiles. Prediction of personality trait scores of Facebook users is addressed in [9], using M5 trees based on linguistic characteristics and social network features. A study to automatically recognize Big Five personality traits on Facebook status messages is presented in [1], observing that MNB (Multinomial Naive Bayes) sparse model performs better than SMO (Support Vector Machines using Sequential Minimal Optimization) and BLR (Bayesian Logistic Regression). Other efforts using unsupervised learning and statistical methods have been introduced in [3] and [4] using annotated Twitter dataset as well as Facebook relationships respectively. Furthermore, there are some studies which include personality recognition traits with datasets, that are not derived from social networks. These studies have introduced methods of recognition of the blogger s personality [19] or speech based dialogue system understanding a user s personality [1]; datasets from different languages [3] are also present. The above literature review indicates that there are many studies for automatic personality identification. However, the results in these studies are not directly comparable because of the different methods and the different datasets used. Our approach differs from the existing studies. To be more specific, our proposed methodology in personality mining differs from [4], since in the latter mentioned, they used data from Facebook and not Twitter as well as they did not apply any data mining techniques; instead they used features from correlation analysis of the study [17]. Moreover, in [9], the authors apply data mining techniques to small texts such as about me or Blurb texts in Facebook accounts. In [3], the emotional stability is described without Big Five personality traits using an unsupervised learning method. A similar work in [21] uses only structural features without linguistic characteristics of users text. In [1], the authors use Facebook data and introduce a classification model using only the classification algorithm SMO. Our approach integrates the methods of existing techniques applying a variety of data mining techniques [11] that have not been used all together in the existing literature; hence doing an elaborative comparison identifying the best approach for personality data mining. Furthermore, our model of the user profile creation integrates existing approaches, use the network structure and linguistics aspects; it expands the existing literature by creating a user profile that takes into account several features of network structure and social media metrics that have not been considered before. We sample the Twitter and extract the corresponding Twitter network, which is separated in communities using a well-used community detection approach [2], [8]. We extend the modularity based community detection by inserting a pre-processing step that eliminates graph edges based on their personality. It is the first time that such an extensible study has considered Twitter data on personality mining. III. SYSTEM ARCHITECTURE In T-PICE, users personality is extracted based on a variety of elements: the linguistic presence, the user behavior within the network and the way communities are formed 213

3 based on the network structure. Figure 1 represents a generic model depicting the system architecture for a personality mining system that identifies influential communities. The system is composed of the following modules: The social media crawler. The crawler is responsible for sampling and traversing the social media; also it collects information regarding the users activity as well as the connections based on a given topic. The user profile creation. The profile creation takes the social media graph as input and creates a vector that represents the user profile. The linguistic analysis is based on both the users tweets and the network characteristics; this is where we extract attributes that represent the user s structural position within the social media graph as well as its metrics. With these metrics, we can capture the user behavior, which include the number of tweets, retweets etc. The personality classification module takes the above user profile as input and determines the user personality based on the theory of Big Five. A personality test in the form of questionnaire is used to train the classifier. The communities decomposition module takes into consideration the users personality and extracts communities using different criteria. The influential communities identification module takes the communities as input and determines the influential ones. IV. PERSONALITY MINING FOR THE IDENTIFICATION OF INFLUENTIAL COMMUNITIES An influential community is a community that demonstrates a high level of activity having several tweets or followers. We argue that the personality aspect plays an important role when determining influential communities, hence we augment existing approaches in community detection with personality detection as well. In the following section, we present the modules and sub-modules of our model. A. Social Media Crawler The social media crawler traverses the Twitter and creates a social media graph where nodes are users and edges represent the follow connection between two users. For our experiments, we use a topic-based sampling approach where tweets are collected via a keyword search query. The process creates a sample of the Twitter graph as follows: initially it retrieves the users and their followers, which have posted a tweet within the given time period. Subsequently, it connects users that follow each other or have a common follower through that follower. More specifically, the process for generating the Social Media Graph is presented (see Algorithm 1). Algorithm 1 Generation of Social Media Graph 1: input Query/Keyword #q 2: output The sample Graph Users, The list of followers of a user Followers[], The list of followers to be inserted to Users Newnodes 3: identify set of tweets for given #q, T = {t 1,t 2,...,t i } 4: tweet t i T 5: u i = user of tweet t i 6: Followers[u i ] = Followers of u i 7: for each t i T do 8: Users = Users u i 9: end for 10: identify set of followers of a user u k, Followers[u k ]= {f 1,f 2,...,f j } 11: for each u k Users do 12: for each f j Followers[u k ] do 13: if f j Users then 14: link f j with u k 15: else 16: for each u l Users and u l u k do 17: if f j Followers[u l ] then 18: Newnodes = Newnodes f j 19: link f j with u k and link f j with u l 20: end if 21: end for 22: end if 23: end for 24: end for 25: Users = Users Newnodes B. User Profile Creation The user profile is determined by the user behavior in social media. There are several aspects that describe the user behavior such as: use of words, emotions, frequency of communication, number of friends etc. Moreover, user s social relationships play an important role in user profiling and such relationships can be extracted from the social graph based on users communication patterns. In our work, we extend existing approaches in predicting personality traits by sketching the user profile while processing heterogeneous information collected from different sources of social media data. We aggregate information collected based on: The linguistic and emotional content of the tweets. The user communication behavior. The network structure aspects of the user presence in Twitter. 1) Linguistic and Emotional Analysis: The Linguistic Inquiry and Word Count (LIWC) software measures the cognitive and emotional properties of a person. It is a widely used linguistic analysis tool that parses users text (tweets in our case) and assigns the words in psychologically mean- 214

4 Figure 1. System Architecture ingful categories. There are 80 such features that include linguistic and psychological use of language as well as personal concerns. Hence, each Twitter user is represented as a vector with 80 values that characterize their linguistic and emotional behavior. Definition: User Linguistic profile is a tuple of 80 characteristics that represent user linguistic presence in Twitter l(c 1,...,c 80 ). 2) Social Media Analytics: Social media analytics can be used to monitor and capture user s behavior. The followers of a user, the number of contributions to the social network and the frequency of contribution are some aspects that differentiate user behavior. Definition: User Social Media Analytics profile is a tuple a(y 1,...,y 6 ), where each value is extracted as a metric from the social media user behavior. More precisely, in the case of Twitter, the Twitter analytics profile is a tuple a(y 1,...,y 6 ), where y 1 is the number of Followers, y 2 is the number of Direct Tweets, y 3 is the number of Retweets, y 4 is the number of Conversations, y 5 is the Frequency of user s Tweets and y 6 is the number of Hashtag Keywords as in [15]. These metrics describe the user communication behavior in Twitter. 3) Network Information: Each user is represented as a node in the social graph. As such, the user has some structural network characteristics. These characteristics are associated with their behavior. Definition: User Network Structure profile is a tuple n(z 1,z 2,z 3 ), where z 1 is the Egocentric Network Density, z 2 is the Betweenness Centrality and z 3 is the Closeness Centrality. 4) User Profile: A user profile is the union of different user profiles i.e. the linguistic, analytics and network profile. By incorporating different aspects of user behavior, we achieve to construct a complete user profile that better captures user behavior. Definition: User Profile UP(x 1,...,x n ) = l(c 1,...,c 80 ) a(y 1,...,y 6 ) n(z 1,z 2,z 3 ). C. Personality Classification We predict user personality based on their UP vector, using machine learning techniques. A pre-defined label of High or Low for each personality trait is added to the UP vector based on the score derived from the questionnaire creating a particular dataset of each trait. Subsequently, the five datasets are used for training the classifiers. We employ a variety of classification algorithms to gain a better understanding of which method better suits to each personality trait and identify the best classifier for each trait. The performance is evaluated by the F-Measure metric. The models with the highest F-Measure value for each personality trait, are used for the prediction of the new test instances. D. Communities Decomposition In our approach, we aim to identify the most influential communities in the twitter graph. There are several algorithms for community detection in which modularity based community detection is considered one of the most popular methods. Existing approaches do not consider node features of the graph as a parameter for community detection. We base our community detection module on the modularity detection and we extend the approach presented in [2] proposing a pre-processing step where graph edges are removed or kept according to the following alternatives: 1) Links between nodes with equal personality traits are removed (EL). 215

5 2) Links between nodes of different personality traits are removed (DL). 3) Based on [21], nodes that have the same values in agreeableness, extraversion and openness are kept, while the rest are removed (AEO). After the pre-processing step, the modularity community detection algorithm [2] is used to cut the network into communities. E. Influential Communities Identification So as to identify influential communities, we use the following activity metrics: the number of Tweets, the number of Followers and the Borda Count of tweets and followers. These metrics capture the activity level within each community. Moreover, we define a new combinatorial metric by dividing the selected activity metric with the size of the community. This metric gives us insight on the influence of each community by presenting the number of tweets per node or the number of followers per node. We rank communities for each approach (i.e. Blondel, EL, DL and AEO) and compare the results. V. IMPLEMENTATION We based our experiments on Twitter and used Twitter API to collect tweets. We implemented the Twitter graph using Twitter4J 1, and have colored our graph according to our methodology. We sampled the Twitter graph implementing the process of Algorithm 1. We collected tweets published for a time interval of 21 days (06/01/ /01/2014) using the keyword #SocialNetworks. Our Twitter graph consists of 693 nodes. In order to construct the training set, we conducted a survey on 80 individuals. Each user replied to a questionnaire 2 that determines user personality as described in [13], [14]. Then, we crawled the Twitter to retrieve the relevant information for each of these users and constructed the UP vector. In our implementation, the UP vector consists of 80 linguistic metrics, 6 Twitter analytics metrics and 3 network information metrics, as presented in Table II. For each user of the dataset and based on the answers of the personality questionnaire, we compute a score for each personality trait. This score is derived from the mean value of the corresponding questions, as described in [13]. In order to train the classifier, we differentiate for each trait a High and Low category based on a threshold. We determine the threshold for each trait based on previous research [1]. Table III presents the distribution of instances of High and Low categories for each personality trait. Thus, we five datasets are created; each for every personality trait. We separated each dataset to training and test set, using two approaches: a) K-Fold Cross-Validation (K=10 Fold) and b) Leave-One-Out Cross-Validation. The concept of Table II THE USER PROFILE FEATURE VECTOR Features # Description LIWC 80 4 general descriptor categories (total word count, words per sentence, percentage of words captured by the dictionary, and percent of words longer than six letters), 22 standard linguistic dimensions (e.g., percentage of words in the text that are pronouns, articles, auxiliary verbs, etc.), 32 word categories tapping psychological constructs (e.g., affect, cognition, biological processes), 7 personal concern categories (e.g., work, home, leisure activities), 3 paralinguistic dimensions (assents, fillers, nonfluencies), and 12 punctuation categories (periods, commas, etc) Twitter Metrics 6 Followers, Tweets, Retweets, Conversations, Frequency, Hashtag Keywords Network 3 Egocentric Network Density, Betweenness Centrality, Closeness Centrality Table III DISTRIBUTION OF LABELS Trait High (%) Low (%) Agreeableness (A) Conscientiousness (C) Extraversion (E) Neuroticism (N) Openness (O) using both techniques is that splitting with 10-Fold Cross- Validation, important information can be removed from the training set. However, the Leave-One-Out Cross-Validation technique evaluates the classification performance based on one sample. The classifiers were chosen from bayes, functions, lazy, trees and rules categories of the Weka library 3. Table IV shows the results for the 10-Fold Cross- Validation measure, for each classifier and for each trait regarding the F-Measure. Based on these results, we select the best classifier for each trait, depicted in bold in the table. Similarly, Table V shows the results for Leave-One- Out Cross-Validation. For personality traits A, C and E on both approaches, the AdaBoost, BayesNet and JRip are selected as the best classifiers. In the case of N, 10-Fold Cross-Validation selects Ridor and in Leave-One-Out Cross- Validation, IBK achieves the best performance. Because the F-Measure is substantially larger in 10-Fold Cross- Validation, we select the Ridor as the best classifier. In the case of O, the 10-Fold Cross-Validation selects the JRip while in the Leave-One-Out Cross-Validation, J48 and PART are selected. Again because the F-Measure of 10-Fold Cross- Validation is substantially larger, we select the JRip as the 1. Twitter4J API: Weka toolkit: 216

best classifier. Table IV 10-FOLD CROSS-VALIDATION Classifiers A C E N O AdaBoost 0.7 0.719 0.581 0.481 0.67 BayesNet 0.726 0.47 0.747 0.517 0.617 IBK 0.476 0.671 0.517 0.469 0.587 J48 0.6 0.7 0.76 0.359 0.

467 0.606 0.585 RotationForest 0.523 0.543 0.594 0.43 0.658 SMO 0.45 0.577 0.367 0.469 0.664 Figure 2.

605 BayesNet 0.726 0.426 0.803 0.457 0.544 IBK 0.426 0.671 0.452 0.506 0.587 J48 0.65 0.579 0.758 0.469 0.65 JRip 0.724 0.435 0.556 0.428 0.648 Multilayer Perceptron 0.423 0.504 0.273 0.43 0.

6 best classifier. Table IV 10-FOLD CROSS-VALIDATION Classifiers A C E N O AdaBoost BayesNet IBK J JRip Multilayer Perceptron Naive Bayes Classifier PART Ridor RotationForest SMO Figure 2. Comparison of Community Detection Algorithms based on the percentage of Followers of the top communities Table V LEAVE-ONE-OUT CROSS-VALIDATION Classifiers A C E N O AdaBoost BayesNet IBK J JRip Multilayer Perceptron Naive Bayes Classifier PART Ridor RotationForest SMO Figure 3. Comparison of Community Detection Algorithms based on the percentage of Tweets of the top communities The classification of High and Low category for each personality trait, creates a tuple of 5 labels for each Twitter user. In other words, there are 2 5 =32different combinations that characterize people s personality and thus can be depicted as different colors in the graph s nodes. VI. RESULTS In the following figures 2, 3 and 4, we present the performance of each of our algorithms in determining the influential communities. We rank the influence of a community using different metrics for different application scenarios. For example, we use the number of tweets within each community as the ranking metric for applications that require finding influential communities regarding a topic or a specific time period or an event. For the top communities, we compute the percentage of tweets from nodes participating in them, versus the total number of tweets in the original graph crawled. For applications that are more generic and require an overall estimation of the influence of a community, we determine influence based on the number of followers. In cases where both tweets and followers are of interest, we use the Borda Count of tweets and followers to measure influence. The Borda Count is a single-winner election Figure 4. Comparison of Community Detection Algorithms based on the percentage of Borda Count of the top communities method, in which voters rank options in order of preference. Namely, each option gets 1 point for each last place vote received, 2 points for each next-to-last point vote; all the way up to N points for each first place vote (where N is the number of options). Since our motivation stems from the fact that we are interested in identifying the more influential communities and not just the first one, we use the summation of the metrics for the first three communities. Figure 2 presents the metric percentage of followers 217

7 for the first three communities as well as the corresponding community sizes. Our observation is that our proposed methods (EL and DL) increase significantly the number of followers, versus the community size in the first three communities, as compared to Blondel and AEO approaches. DL detects communities with the best percentage of followers. In Figure 3, we use the metric percentage of tweets to measure all methods performance. We observe that the performance of DL is the best regarding the percentage of tweets. Blondel and AEO have the same results while EL gives the less percentage of Tweets. When looking the tweets, versus the community size, EL is better, followed by DL and Blondel. In Figure 4, we evaluate all methods using the metric of the Borda Count of followers and tweets. In this case, EL achieves remarkably the better performance, while the other three methods have marginally the same. The introduced metrics for counting the influence of a community do not take into consideration the size of the community. Hence, we introduce a normalized metric based on size (see Table VI). This is a metric that can be used for a variety of applications, especially when cost is associated with the size of the communities. Such applications are advertising ones, where we look for the smaller communities with the largest impact. In all cases, AEO algorithm that deletes edges, which differ at least in one of Agreeableness, Extraversion or Openness trait, gives worse results compared to EL and DL. Moreover, we conducted a set of experiments for AEO variations where the removed edges are between personalities with a difference in three traits, and the results we obtained are similar to AEO. Keeping links that do not differ so much, creates balanced personality graphs were communities are not influential. This result is consistent with the metric/size metric (Table VI). EL achieves the best results across all metrics. The top communities which are extracted using the different approaches of communities decomposition, depict the diversity in the distribution of dissimilar personalities as it is shown in Table VII. We can see for each algorithm the average of the percentage of dissimilarity of personalities for the top communities. This metric is computed by counting the number of nodes with different personalities divided by the total number of nodes in the top communities. According to Table VII, EL exhibits the greatest diversity in personalities in the resulting influential communities. DL results in less variation in personalities. VII. DISCUSSION In our work we use [2] for community detection; an approach based on the modularity criterion. This is a popular technique for community detection. The modularity measures the density of links inside communities as compared to links between communities. When the similar personality Table VI NORMALIZED METRIC FOR RATING INFLUENTIAL COMMUNITIES Communities Decomposition Tweets / Size Followers / Size Borda Count / Size Simple Blondel 1,704 2,201 1,150 EL 2,065 2,794 1,467 DL 1,651 2,584 1,054 AEO 1,322 1,892 1,111 Table VII AVERAGE OF DISSIMILARITY RATES OF USERS PERSONALITY OF THE TOP COMMUNITIES Communities Decomposition Ranking Tweets Ranking Followers Ranking Borda Count Simple Blondel 58,7% 66,3% 66,3% EL 69,3% 70,2% 67,8% DL 54,1% 61,1% 54,2% AEO 63,4% 60,2% 60,2% links are discharged (EL) in the pre-processing step then the modularity is determined based on the density of users that have different personalities only. Hence, more heterogeneous communities are created that tend to be more influential. Similarly, when the different personality links are deleted, influential homogeneous communities are created. Looking at the extreme cases we observe the following. In the case that the Twitter graph has nodes that correspond to individuals with the same personality mixture of Big Five traits, the EL approach will lead to a graph after preprocessing step which includes only isolated nodes. So the influence of the communities will be much lower because the top communities are constituted by one node each one. In this case, the influential network is transformed to influential users. On the other hand, DL approach will keep the graph as it is and thus the performance of the influential communities will be the same as in Blondel. In the case of an extracted graph where all nodes have different personalities i.e. only for graphs with equal or less than 32 nodes, the EL approach will keep the graph as it is and thus the influence of the communities will be the same as in Blondel. The DL will create a graph with isolated nodes and thus the influence of top communities will be reduced. Our results show that in all different cases and metrics, EL or DL outperforms Blondel creating the most influential communities that exhibit either a heterogeneous or a homogenous personality distribution. VIII. CONCLUSION -FUTURE WORK In this work, we looked into the problem of determining influential communities in Twitter. We propose the Influential Communities Extraction methodology (T-PICE), a unified framework that extracts users personality based on several aspects of user behavior and colors the network graph using machine learning algorithms according to the 218

8 32 possible personality descriptions as defined by the Big Five personality model. Furthermore, we determine the best classification algorithm for each personality trait in order to improve the performance of our system. The influential communities are created based on several variations of modularity based community detection, where personality is also considered in a pre-processing step. Finally, the comparison of the proposed variations and the initial community detection algorithm is evaluated based on metrics that count the activity level of the top three communities. The detected top communities by EL (whre links between nodes with equal personality traits are removed) and DL (where links between nodes of different personality traits are removed) indicate that personality heterogeneous as well as homogenous communities are the more influential ones in creating networks of higher information diffusion. The T-PICE system can be a tool for marketing managers or advertisers to help them identify the influential community, thus better promoting their products. As future work, we are interested in examining the scalability problems that emerge when considering bigger graphs. In addition, we aim to make more experiments using several subjects and identify the parameters that influence the results of our algorithms in a finer granularity level. In conclusion, we will investigate the evolution of influential communities in time as well as the impact of other features in the influential community ranking. REFERENCES [1] F. Alam, E. A. Stepanov and G. Riccardi, Personality Traits Recognition on Social Network - Facebook, Computational Personality Recognition, [2] V. D. Blondel, J. - L. Guillaume, R. Lambiotte and E. Lefebvre, Fast Unfolding of Community Hierarchies in Large Networks, Journal of Statistical Mechanics: Theory and Experiment, P10008, [3] F. Celli and L. Rossi, The Role of Emotional Stability in Twitter Conversations, Semantic Analysis in Social Media, pp , [4] F. Celli and L. Polonio, Relationships between Personality and Interactions in Facebook, Social Networking: Recent Trends, Emerging Issues and Future Outlook, pp , [5] M. Coltheart, The MRC Psycholinguistic Database, Quarterly Journal of Experimental Psychology, Volume 33A, pp , [6] C. Dwork, R. Kumar, M. Naor and D. Sivakumar, Rank Aggregation Methods for the Web, World Wide Web Conference (WWW), pp , [7] M. Farah and D. Vanderpooten, An Outranking Approach for Rank Aggregation in Information Retrieval, Conference on Research and Development in Information Retrieval (SIGIR), pp , [8] S. Fortunato, Community Detection in Graphs, Physics Reports 486, pp , [9] J. Golbeck, C. Robles and K. Turner, Predicting Personality with Social Media, Human Factors in Computing Systems (CHI), pp , [10] L. R. Goldberg, The Development of Markers for the Big Five factor Structure, in Psychological Assessment, Volume 4, Issue 1, pp , [11] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. The Morgan Kaufmann Series in Data Management Systems, [12] F. Iacobelli, A. J. Gill, S. Nowsonl and J. Oberlander, Large Scale Personality Classification of Bloggers, Affective Computing and Intelligent Interaction (ACII), pp , [13] O. P. John, E. M. Donahue and R. L. Kentle, The Big Five Inventory - Versions 4a and 54, Berkeley: University of California, Institute of Personality and Social Research, [14] O. P. John and S. Srivastava, The Big Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives, in Handbook of Personality: Theory and Research, 2nd ed. pp , New York: The Guilford Press, [15] E. Kafeza, A. Kanavos, C. Makris and D. Chiu, Identifying Personality-based Communities in Social Networks, Legal and Social Aspects in Web Modeling (Keynote Speech in LSAWM), in conjunction with the International Conference on Conceptual Modeling (ER), [16] R. N. Landers and J. W. Lounsbury, An Investigation of Big Five and Narrow Personality Traits in Relation to Internet Usage, Journal of Computers in Human Behavior, Volume 22, Issue 2, pp , [17] F. Mairesse, M. A. Walker, M. R. Mehl and R. K. Moore, Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text, Journal of Artificial Intelligence Research (JAIR), Volume 30, pp , [18] R. R. McCrae and O. P. John, An Introduction to the Five- Factor Model and Its Applications, Journal of Personality, Volume 60, Issue 2, pp , [19] H. Mohtasseb and A. Ahmed, Mining Online Diaries for Blogger Identification, Data Mining and Knowledge Engineering (ICDMKE), pp , [20] J. W. Pennebaker, M. E. Francis and R. J. Booth, Linguistic Inquiry and Word Count (LIWC): LIWC2001, New Jersey: Lawrence Erlbaum Associates, [21] D. Quercia, M. Kosinski, D. Stillwell and J. Crowcroft, Our Twitter Profiles, Our Selves: Predicting Personality with Twitter, Social Computing (SocialCom)/Privacy, Security, Risk and Trust (PASSAT), pp ,

Conceptual Replication ISSN Predicting Personality from Social Media Text. Jennifer Golbeck

Conceptual Replication ISSN Predicting Personality from Social Media Text. Jennifer Golbeck Transactions on R eplication R esearch Conceptual Replication ISSN 2473-3458 Predicting Personality from Social Media Text Jennifer Golbeck Human Computer Interaction Lab, University of Maryland, College