Will My Followers Tweet? Predicting Twitter Engagement using Machine Learning. David Darmon, Jared Sylvester, Michelle Girvan and William Rand

Will My Followers Tweet? Predicting Twitter Engagement using Machine Learning David Darmon, Jared Sylvester, Michelle Girvan and William Rand Abstract: Managers who are able to understand how social media is evolving first have an advantage over those who are slower to understand what their followers are doing. Despite the advantage such knowledge would bring, user predictability in social media is not well understood. We use two different machine learning methods to model the behavior of 15,000 users on the basis of their past behavior during a seven-week period. We demonstrate that the behavior of users on Twitter can be well modeled as processes with self-feedback. We also explore how different structural segments of Twitter users behave differently. These insights would enable differential targeting schemes that might increase customer engagement with disparate groups of Twitter followers. Keywords: social media, twitter, prediction, machine learning, Twitter 1. Introduction Since the earliest days of marketing research, marketers have known that what one consumer says to another has a larger impact on their decision to purchase a product, respond to an advertising slogan, or participate in a call to action than anything that an advertiser can do (Ryan & Gross, 1943). However, for most of the history of marketing it has been difficult, expensive and time-consuming to tap into word-of-mouth communications between consumers. Recently, the large growth in social media usage presents a unique opportunity for brand managers to examine these conversations. If a manager is proactively monitoring these conversations then they can identify key times to launch marketing campaigns, reach out to key influentials, or even launch proactive countermarketing campaigns. In this paper, we begin research that moves in the direction of predictive social media analytics, i.e., tools that will not only describe current user behavior on social media, but also predict future user behavior. We begin this research project by using past behavior of users to predict when they will produce content (i.e., tweet) on a major social media platform, namely Twitter. We start by describing the data that we use to examine this question, then we discuss two different predictive analytic methods drawn from machine learning tools, computational mechanics and echo state network, and explore initial results of attempting to predict when users will tweet by describing behavioral models of the individuals drawn from these machine learning methods. We also examine the behavioral characteristics of structural segments of Twitter users, and end with a discussion of future directions. 2. Framework and Approaches In order to predict individual behavior on social media, we adopt a computational agent perspective (DeDeo, 2012). The user receives inputs from their surroundings, combines those inputs in ways dependent on their own internal states, and produces an observed behavior or output. In the context of a microblogging platform such as Twitter, the inputs may be streams from other Twitter users, real world events, etc., and the observed behavior may be a tweet, mention, or retweet. As a first approximation to the computation performed by a user, we might consider only the user s own past behavior as possible inputs to determine their future behavior. From this perspective, the behavior of the user can be modeled only from the time points when social interactions occurred (Perry & Wolfe, 2010). Such point process

models, while very simple, have found success recently in describing social systems (Steeg & Galstyan, 2012; Cho, Galstyan Brantigham and Tita, 2013). We propose extending this previous work by explicitly studying the predictive capability of the point process models. That is, given observed behavior for the user, we seek a model that not only captures the dynamics of the user, but also is useful for predicting the future behavior of the user, given their past behavior. The rationale behind this approach is that if we are able to construct models that both reproduce the observed behavior and successfully predict future behavior, the models capture something about the computational aspects, in the sense outlined above, of the user. We explore two machine-learning frameworks that enable this modeling. The first is the causal state modeling approach, motivated by results from computational mechanics, which assumes that every individual can initially be modeled as a biased coin, and then adds structure as necessary to capture patterns in the data. It does this by expanding the number of states necessary to represent the underlying behavior of the agent. Causal state models have been used successfully in various fields (Haslinger, Klinkner & Shalizi, 2010; Cointet, Faure & Roth, 2007; Padro & Padro, 2005). The second approach we explore is echo state networks, which assumr that agent behavior is the result of a complex set of internal states with intricate relationships to the output variables of interest, and then simplifies the weights on the relationships between the internal states and the output variables over time (Jaeger, 2001; Ozturk, Xu & Prícipe, 2007). Echo state networks have proven useful in a number of different domains (Jaeger & Haas, 2004; Salmen & Ploger, 2005; Tong, Bickett, Christiansen & Cottrell, 2007). 3. The Data The data consists of the statuses of 15,000 Twitter users over a 49-day period, of which 12,043 were active during the time period of our study. In this paper, we discard the actual content of the tweets and instead examine just whether or not an individual tweets in a particular time interval. For most of this paper we consider time intervals of 10 minutes, though we have data at the resolution of 1 second. In addition, the users were filtered to include only the top 3,000 most active users over the 49-day period. A base activity measure was determined by the proportion of seconds in the 7 AM to 10 PM window the user tweeted, which we call the tweet rate. Of the top 3,000 users, these tweet rates ranged from 0.38 to 8.5 10-5. 90% of the top 3,000 users had a tweet rate below 0.05. After this filtering, our dataset consists of 3,000 binary time series of length 57,600 (the number of ten-minute intervals in our dataset). 4. Prediction Results The 49 days of user activity were partitioned, chronologically, into a 45-day training set and a 4 day testing set. This partition was chosen to account for possible changes in user behavior over time, which would not be captured by using a shuffling of the days. Thus, for each user, the training set consists of 4,320 timepoints, and the testing set consists of 384 timepoints. In all cases, we predict tweet behavior on the out-of-sample test set counting a correct prediction when we match the tweet/no-tweet prediction right for a ten-minute interval. We compared the accuracy rates on the causal state model and echo state network to a baseline accuracy rate for each user. The baseline predictor is the majority vote of tweet vs. not-tweet behavior over the training days. In the context of our data, for users that usually tweeted in the training set, the baseline predictor will always predict that the user tweets, and for users that usually did not tweet in the training set, the baseline predictor will always predict the user does not tweet. For any process with memory, as we would expect from most Twitter users, a predictor should be able to outperform this base rate.

The comparison between the baseline and the causal state model and echo state network predictors are shown in Figure 1. In both plots, each red point corresponds to the baseline rate on the testing set for a given user, and the blue point corresponds to the accuracy rate on the testing set using one of the models. Here, the tweet rate is computed in terms of the coarsened time series. That is, the tweet rate is the proportion of ten minute windows over the 49 day period which contain tweets. Clearly, the model predictions show improvement over the baseline prediction, especially for those users with a tweet rate above 0.2. Overall, the causal state models and the echo state networks both showed improvement, and in some cases drastic improvement, over a baseline predictor. Moreover, for a large proportion of the users, the two methods gave similar predictive results. These predictive analytics can give managers the ability to predict when users are likely to tweet and as a result they can take that information into account when scheduling the deployment of social media marketing content. Figure 1: The improvement over the baseline accuracy rate for the causal state model and echo state network. In both plots, each red point corresponds to the baseline accuracy rate for a user, and the connected blue point is the accuracy rate using the causal state model or the echo state network. Moreover, since different users may be predicted to tweet at different times and since by inspecting a user s past timeline it is possible to infer what they might tweet about, brand managers can predict when a particular follower or influential who often tweets on a particular topic will tweet. This gives managers the ability to predict when certain topics will be tweeted about. In future work, we also hope to specifically predict when a user will tweet on a particular topic. This involves movingbeyond a binary alphabet to a larger alphabet where different topics are encoded as different symbols. If this proves successful, then managers would have a tool to predict not only when certain users will tweet, but what they will tweet about. 5. Behavioral Models and Network Segmentation The causal state model creates behavioral models for individuals that can be interpreted as transitions for each individual user between states of behavior. These models can be interpreted to map to real-world states of behavior of the users. For instance, states may map to tweeting from work, tweeting from a mobile device, not tweeting while driving, etc. We can draw these models as state diagrams with transitions between the states Tweet Rate Accuracy Rate Baseline CSM Tweet Rate Accuracy Rate Baseline ESN

being labeled by whether or not the user tweets during that transition. We illustrate four such state diagrams in Figure 2. Out of all the users, 58.8% had inferred causal state models similar to Figure 2(b), where a user has a tweeting state A and a non-tweeting state P. To investigate if different communities within the 15,000 users have stereotyped behavior, we applied a community detection algorithm to the network and then considered the statistics of various dynamical measures (statistical complexity, entropy rate, etc., described below) by community. To perform the community detection, we used the fast-greedy a) b) 1 1 1 p B 0 1 p 1 A P 0 0 1 c) d) 0 1 0 1 1 1 A P A 1 I 0 1 P 1 1 1 1 R 0 1 1 1 1 0 R 0 0 Figure 2: Typical 1, 2, 3, and 4-state causal state models. Of the 3,000 users, 383 (12.8%), 1,765 (58.8%), 132 (4.4%), and 100 (3.3%) had these number of states, respectively. algorithm of Clauset, Newman, and Moore (2004). For any network, after partitioning the nodes into communities, we define the modularity of the network as the fraction of edges that lie within communities. We consider the normalized modularity, where we normalize by the expected number of edges in a randomized version of the observed network. This procedure gives a hierarchy of possible community structures, and of those structures we choose the one that maximizes the normalized modularity. A maximum always exists since the normalized modularity is 0 for a network with a single community, and for typical networks, beyond a certain iteration grouping together communities decreases the modularity. Using this procedure on the 15K network, 69 communities were detected at a maximum normalized modularity of 0.2435. Of these, the largest four accounted for 98% of the users. The largest community contained 7520 users. The Twitter account of Om Malik, the account used as the seed in collecting the network, belonged to this community. The remaining three largest communities contained 6410, 400, and 409 users. The causal state model inferred for each user has two complementary metrics associated with it: its statistical complexity and entropy rate. The statistical complexity, loosely, specifies the number of bits into the past of the process that we need to look to optimally predict its future. The entropy rate specifies the inherent unpredictability of the process, due to randomness in its dynamics. These can be computed directly from the inferred causal state model. We can investigate the distribution of these values across the clusters, as shown in Figure 3. We see that most users have statistical complexities between 0 and 1, which corresponds to a causal state model where the user alternates between tweeting and nontweeting states. However, the statistical complexity distributions exhibit long tails across all of the communities, and these tails differ from cluster to cluster. Similarly, the entropy rates tend to cluster near 0.4, but the tails of the distributions differ by cluster. Thus, we see that the communities are heterogeneous in terms of the dynamics of the users contained within them.

Estimated Density 0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4 Rest Estimated Density 0 1 2 3 4 1 2 3 4 Rest 0 1 2 3 4 5 Inferred Statistical Complexity Inferred Entropy Rate Figure 3. The distribution of statistical complexities and entropy rates across the four largest clusters, grouping the remaining 65 clusters 'Rest'. For the statistical complexities, users with statistical complexity equal to 0, corresponding to a Bernoulli process (a 'coin flip' process) are considered as point masses and not used to infer the continuous density. Examining the behavioral complexity of users in different structural communities, gives managers the ability to examine how different network segments behave. This gives insight into how they might engage with those different users. At the simplest level, since we do not know when users are actively engaged in Twitter except through their tweeting capability, this gives us the ability to predict when they are most likely to be on Twitter. At a richer level it also gives us the ability to predict when a user is most likely to engage on Twitter, meaning that the user may be induced to tweet about the company if provided some useable content. Since these models are combined with network segmentation, it enables the manager to target different segments with different content at different times, tailoring to both the segment s particular preferences for being active on Twitter, but also giving them the ability to tailor the content to the particular segment s interests in a form of social dayparting.. 6. Conclusion and Future Work In this paper, we have shown that by building representations of the latent states of user behavior we can start to predict their actions on social media. We have done this using two different approaches, which have different ways of capturing the complexity of user behavior. Ultimately, the two methods performed very similarly on a large proportion of the users. It should be noted that this was not expected. The two methods differ drastically in their modeling paradigm, and the data was quite dynamic, providing plenty of opportunity for differentiation. Our best explanation is that in the end, most users exhibit only a few latent states of behavioral processing, and therefore any model which is able to capture these states will do well at capturing the behavior of users. We have also shown that different clusters of users based on network ties exhibit very different types of behavior, enabling segmentation of users and the possibility of targeting different social media content to different user groups. One of the biggest weaknesses of the present approach is its failure to incorporate exogenous inputs to a user. That is, we have treated each user as an autonomous unit, and only focused on using their own past behavior to predict their future behavior. In a social

context, such as Twitter, it makes more sense to incorporate network effects, and then examine how the behavior of friends and friends of friends directly impact a user's behavior. For example, the behavior of many of the users, especially those users with a low tweet rate, may become predictable after incorporating the behavior of users in their following network. We have seen that taking a predictive, model-based approach to exploring user behavior has allowed us to discover typical user profiles that have predictive power on a popular social media platform, and that it is possible to segment these users based on network properties, and that these different segments exhibit different behaviors. Such predictions, which take into account social context, could be useful in any number of domains. Brand managers could use these models to understand who will respond to a message that is sent out to a group of users, and potentially even assist in the determination of whether or not a particular piece of content will go viral. Predicting user behavior on social media has the potential to be transformative in terms of both our understanding of human interactions with social media, and the ability of organizations to engage with their audience. 7. References Y.-S. Cho, A. Galstyan, J. Brantingham, and G. Tita, Latent point process models for spatial-temporal networks, arxiv preprint arxiv:1302.2671, 2013. Clauset, A., Newman, M., & Moore, C. (2004). Finding community structure in very large networks Physical Review E, (Vol. 70). J.-P. Cointet, E. Faure, and C. Roth, Intertemporal topic correlations in online media, in Proceedings of 1st International Conference on Weblogs & Social Media (ICWSM), 2007. S. DeDeo, Evidence for non-finite-state computation in a human social system, arxiv preprint arxiv:1212.0018, 2012 R. Haslinger, K. Klinkner, and C. Shalizi, The computational structure of spike trains, Neural Comp., vol. 22, no. 1, pp. 121 157, 2010. H. Jaeger, The echo state approach to analysing and training recurrent neural networks, Fraunhofer Institute for Autonomous Intelligent Systems, Technical Report #148, 2001. H. Jaeger and H. Haas, Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication, Science, vol. 304, no. 5667, pp. 78 80, 2004. M. C. Ozturk, D. Xu and J. C. Príncipe, Analysis and design of echo state networks, Neural Computation, vol. 19, no. 1, pp. 111 138, 2007. M. Padró and L. Padró, A named entity recognition system based on a finite automata acquisition algorithm, Procesamiento del Lenguaje Natural, vol. 35, pp. 319 326, 2005. P. O. Perry and P. J. Wolfe, Point process modeling for directed interaction networks, arxiv preprint arxiv:1011.1703, 2010. Ryan, Bryce and Neal C. Gross (1943), The diffusion of hybrid seed corn in two Iowa communities, Rural Sociology, 8(1), 15-24. M. Salmen and P. G. Ploger, Echo state networks used for motor control, in Proc. IEEE Conf. on Robotics and Automation (ICRA). IEEE, 2005, pp. 1953 1958. M. H. Tong, A. D. Bickett, E. M. Christiansen, and G. W. Cottrell, Learning grammatical structure with echo state networks, Neural Networks, vol. 20, no. 3, pp. 424 432, 2007. G. Ver Steeg and A. Galstyan, Information transfer in social media, in Proc. 21st Int l World Wide Web Conf. ACM, 2012, pp. 509 518.