The Science of Social Media. Kristina Lerman USC Information Sciences Institute

Similar documents
A Meme is not a Virus:

The Paradoxes of Social Data:

On Burst Detection and Prediction in Retweeting Sequence

TwitterRank: Finding Topicsensitive Influential Twitterers

Measuring message propagation and social influence on Twitter.com

Targeting Individuals to Catalyze Collective Action in Social Networks

Using Link Analysis to Discover Interesting Messages Spread Across Twitter

Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach

Learning Influence from Heterogeneous Social Networks

Influence Maximization-based Event Organization on Social Networks

A Propagation-based Algorithm for Inferring Gene-Disease Associations

Information vs Interaction: An Alternative User Ranking Model for Social Networks

Will My Followers Tweet? Predicting Twitter Engagement using Machine Learning. David Darmon, Jared Sylvester, Michelle Girvan and William Rand

The Anatomy of Large Facebook Cascades

Advancing Influencer and Sentiment Analysis on Social Networks:

Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks

Predicting the Odds of Getting Retweeted

Cross-Pollination of Information in Online Social Media: A Case Study on Popular Social Networks

arxiv: v2 [cs.si] 15 Nov 2011

Leveraging the Social Breadcrumbs

Learning to Blend Vitality Rankings from Heterogeneous Social Networks

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

But, there is a difference in the diffusion of Innovations/Information Infections People

Influence of Network Structure on Market Share in Complex Market Structures

Computational models of technology adoption at the workplace

Influencer Communities. Influencer Communities. Influencers are having many different conversations

PROXIMITY, INTERACTIONS, AND COMMUNITIES IN SOCIAL NETWORKS: PROPERTIES AND APPLICATIONS.

Friendship Paradox Redux: Your Friends Are More Interesting Than You

Advanced Topics in Computer Science 2. Networks, Crowds and Markets. Fall,

Topical Influence on Twitter: A Feature Construction Approach

IBM SPSS Modeler Personal

Identifying Decision Makers from Professional Social Networks

DETECTING COMMUNITIES BY SENTIMENT ANALYSIS

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement

Predicting Yelp Ratings From Business and User Characteristics

What is Evolutionary Computation? Genetic Algorithms. Components of Evolutionary Computing. The Argument. When changes occur...

Automatic Detection of Rumor on Social Network

Predicting user rating for Yelp businesses leveraging user similarity

Social Recommendation: A Review

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors

Who Will Retweet This? Detecting Strangers from Twitter to Retweet Information

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Tweeting Questions in Academic Conferences: Seeking or Promoting Information?

Is SharePoint 2016 right for your organization?

Intermediate Google AdWords Instructors: Alice Kassinger

Simulation-Based Analysis and Optimisation of Planning Policies over the Product Life Cycle within the Entire Supply Chain

ERP Reference models and module implementation

Mining Misinformation in Social Media

Evaluating Twitter Influence Ranking with System Theory

Cascades: A View from Audience

Marketing & Big Data

Putting abortion opinions into place: a spatial-temporal analysis of Twitter data

Random Walks on the Reputation Graph

Ph.D. Defense: Resource Allocation Optimization in the Smart Grid and High-performance Computing Tim Hansen

Estimating the Impact of User Personality Traits on electronic Word-of-Mouth Text-mining Social Media Platforms

A Dynamic Trust Network for Autonomy-Oriented Partner Finding

Bidding for Sponsored Link Advertisements at Internet

BloomReach Commerce Search Datasheet

Identifying Peer Influence in Massive Online Social Networks: A Platform for Randomized Experimentation on Facebook

Berkeley Data Analytics Stack (BDAS) Overview

Viral products and ideas are intuitively understood to grow through a person-to-person diffusion process

Machine Learning Choices

e-trans Association Rules for e-banking Transactions

Seminar Data & Web Mining. Christian Groß Timo Philipp

Threshold model of diffusion: An agent based simulation and a social network approach

K.R.I.O.S. Dr. Vasilis Aggelis PIRAEUS BANK SA (WINBANK) Syggrou Ave., 87 ATHENS Tel.: Fax:

TRank: ranking Twitter users according to specific topics

Data Mining in Social Network. Presenter: Keren Ye

Assessing the Potential of Ride- Sharing Using Mobile and Social Data

Social Media Is More Than a Popularity Contest

Time Series Motif Discovery

Event-based Social Networks: Linking the Online and Offline Social Worlds

Association rules model of e-banking services

ENERGY SAVING INFORMATION CASCADES IN ONLINE SOCIAL NETWORKS: AN AGENT-BASED SIMULATION STUDY. Qi Wang John E. Taylor

WebShipCost - Quantifying Risk in Intermodal Transportation

Individual Behavior and Social Influence in Online Social Systems

Twitter in Academic Conferences: Usage, Networking and Participation over Time

How Online Content is Received by Users in Social Media: A Case Study on Facebook.com Posts

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 9, March 2014

Prediction of Personalized Rating by Combining Bandwagon Effect and Social Group Opinion: using Hadoop-Spark Framework

Predictive Analytics Using Support Vector Machine

Retweets but Not Just Retweets: Quantifying and Predicting Influence on Twitter

Netflix Optimization: A Confluence of Metrics, Algorithms, and Experimentation. CIKM 2013, UEO Workshop Caitlin Smallwood

Downloaded From: on 10/26/2016 Terms of Use:

Who s Here Today? B2B Social Media: Why?

Make the Jump from Business User to Data Analyst in SAS Visual Analytics

Bridge-Driven Search in Social Internetworking Scenarios

Data Mining of the Concept «End of the World» in Twitter Microblogs

Yelp Recommendation System Using Advanced Collaborative Filtering

Software Assurance Marketplace Use Case

When Politicians Tweet: A Study on the Members of the German Federal Diet

Analysis of User s Relation and Reading Activity in Weblogs

Agent-based model of information spread in social networks. Introduction

How to Set-Up a Basic Twitter Page

On the Measurement and Prediction of Web Content Utility: A Review

An Empirical Study of Valuation and User Behavior in Social Networking Services

Using Twitter to Predict Voting Behavior

Data Science Services from GE Digital. Unlock new business value from your industrial enterprise data

Workforce Evolution & Optimization Modeling and Optimization for smarter long-term workforce planning

Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling

Transcription:

The Science of Social Media Kristina Lerman USC Information Sciences Institute ML meetup, July 2011

What is a science? Explain observed phenomena Make verifiable predictions Help engineer systems with desired behavior predict Design cycle User interface Collective behavior design

Why now? Data driven network science Availability of large scale scale, time resolved data from social media sites allows us to ask new questions about social behavior How does social behavior arise from individual interactions? How far and how fast does information/behavior spread on networks? How do we measure network structure, find communities and influential people? How does the network structure affect information flow? What does the flow of information tell us about its quality?

Alternative approaches to the science of social media Data mining (CS) approach Apply statistical regression to identify correlated sets of features in large data sets Predict outcomes for new cases based on features Cannot identify causal mechanisms Experimental approach Controlled experiments to uncover causal mechanisms Not practical for researchers need access to large pool of study participants Empirical (physics based) approach Empirically grounded framework for discovering mechanistic models Leverage statistical physics methodology to study social behavior

Roadmap Use physics based approach to study social media 1. Mathematical analysis of social dynamics How do user actions (e.g., specified by the UI) give rise to collective behavior? Mathematical models relate microscopic actions to collective dynamics Explain and predict collective behavior 2. Dynamics of social contagion How far does information spread on a network? Empirical analysis and simulations to study how the microscopic mechanism affects macroscopic properties of spread Fundamental differences between disease and information spread 3. Beyond PageRank How to find important nodes in a network? The role of diffusion in network analysis

Social News: Digg Users submit or vote for (digg) news stories Online social networks Users follow friends to see Stories friends submit Stories friends vote for Shown in Friends Activity Trending stories Digg promotes most popular stories to its Popular page Data set* Votes on ~3K popular stories Follower graph w. ~280K users *http://www.isi.edu/~lerman/downloads/digg2009.html

Microblogging: Twitter Users tweet short text posts Retweet posts of others Tweets may contain URLs Online social networks Users follow friends to see Tweets by friends Retweets by friends Trending topics Twitter analyzes activity to identify popular trends Data set Tweets mention ~70K URLs Follower graph w. ~700K users

Mathematics of social dynamics: Modeling collective behavior in social media

User interface Collective behavior (diggs)

User as a stochastic Markov process browse front page navigate read interesting? 1 2 1 2 1 2 browse friends friends read friends interesting? 1 2 1 2 browse upcoming upcoming read upcoming interesting? 1 2 1 2

User as a stochastic Markov process browse front page navigate read interesting? submitter fans 1 2 1 2 1 2 other fans browse friends friends read friends interesting? 1 2 1 2 nonfans browse upcoming upcoming read upcoming interesting? 1 2 1 2

individual behavior n 2 collective behavior <n 2 > ØØØ Ø n 1 up n 4 Ø <n 1 > <n 4 > n 3 <n 3 > stochastic modeling in one slide <n k > time model solutions d n1 a n2 b n dt d n2 dt mathematical model 1

Model solutions give dynamics of collective behavior Fully calibrated model Interestingness (r S, r F, r N ) the only adjustable parameters. Max likelihood estimate (MLE) to find best r values that fit observed collective behavior Story 1 Non fans votes Other fans Submitter s fans time (hours)

Popular submitter advantage promoted not promoted promotion threshold [2006 data] Less interesting (lower r) stories submitted by popular users (many followers) will be promoted to the front pages

Predict popularity Estimate parameters (r S, r F, r N ) based on early voting history Then solve model for later times to predict future dynamics (popularity) Story 2 Non fans votes Other fans Submitter s fans prediction time t time (hours)

Confidence of prediction Likelihood surface indicates how well data constrains r values Compute 95% confidence bounds on model predictions

Confidence correlated with prediction error Predict at promotion time for ~90 stories Compute error 24hrs after promotion Error=predicted votes - actual votes

Dynamics of social contagion: What stops social epidemics?

Information spread on network A cascade is a sequence of activations generated by a contagion process, in which nodes cause their neighbors to be activated with some probability (transmissibility) Underlying network Two cascades spreading on the network* 1 2 3 7 4 6 5 *Nodes are labeled in the temporal order of activation

Cf Spread of disease An epidemic is a contagion process that spreads to a fraction of all the nodes 1 Fraction of nodes infected 0 Epidemic threshold predicted for many cascade models Transmissibility, λ

Size of cascades on social media Digg (~280K users) Twitter (~700K users) number of cascades number of cascades cascade size cascade size Most cascades reach fewer than 1% of all users!

Why are these cascades so small? Standard model of epidemic growth Most cascades fall in this range (Heterogenous mean field theory, SIR model, same degree distribution as Digg) λ*, Transmissibility Transmissibility of almost all Digg stories fall within width of this line?!

Maybe graph structure is responsible? Mean field prediction (same degree dist.) Simulated cascades on a random graph with same degree dist. epidemic threshold Simulated cascades on the observed Digg graph transmissibility clustering reduces epidemic threshold and cascade size, but not enough!

What about the spreading mechanism?? Infected Not infected Independent Cascade Model implicit in many epidemic models: a node with n infected friends has n chances to be infected

How important are repeat exposures? More than half exposed to a story more than once!

How do people respond to repeated exposure? Not much. We have similar results for Twitter Also noted by Romero, et al, WWW 2011 repeated exposure has little effect on probability to become infected

Big consequences for epidemic growth 1. Most people are exposed to a story more than once 2. Repeated exposures have little effect Growth of epidemics is severely curtailed (especially compared to Ind. Cascade Model)

Weak response to repeated exposure Take effect of repeat exposure into account: Epidemic threshold unchanged Actual Digg cascades Result of simulations λ*, Transmissibility λ* [Ver Steeg, Ghosh & Lerman, What stops social epidemics? in ICWSM 11]

Beyond PageRank: Dynamics and network structure

What are the important nodes in a network? Centrality metrics examine structure of the network to determine relative importance of nodes Degree Centrality Measures number of node s neighbors Betweenness centrality [Freeman, 1979] Fraction of all shortest paths passing through a given node PageRank [Brin et al., 1998] Probability random surfer ends up at a node Alpha centrality [Bonacich, 1987], Katz score [Katz, 1953] Number of paths of any length, exponentially attenuated by their length, from a given node to all nodes

The Billion Dollar Algorithm: PageRank (Brin et al 98) 1 4 5 2 3 r PR 1 ( t, i) (1 ) n j followeri ( ) r PR ( t 1, out d j j) After many iterations, converges to PageRank scores of nodes.

PageRank and the Random Surfer Consider a web surfer who clicks on links on web pages at random with no regard to content With probability, the Surfer follows a hyperlink from a given web page With probability (1 ), the Surfer jumps to a random web page 1 4 5 1 4 5 2 3 2 3 Node size = probability ~ importance After a long time the probability the Random Surfer visits a web page is given by that page's PageRank.

Random walk and conservative diffusion Random surfer executes a random walk (with restarts) on a web graph a stochastic process obeying the following rules With probability, walker jumps from a node to one of its neighbors With probability (1 ), walker jumps to any node Random walk and diffusion Random walk is a prototypical example of conservative diffusion process Other examples are money transfer, used goods circulation, etc. where some quantity ($, goods, probability) is conserved

Matrix formulation 0 1 0 0 0 1 4 5 2 3 Adjacency matrix of the network A = 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 Outdegree matrix D = 0 2 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 1

Matrix formulation of conservative diffusion Weight vector gives the probability of finding the walker anywhere on the graph at time t. It is changed by diffusion: C 1 C wt ( 1 ) s D Awt 1 where starting vector s specifies where a random jump ends up Steady state solution: as t w C 1 (1 ) s ( 1 ) s (1 ) D As... 1 ( I D A) same as PageRank when s=1/n

Social phenomena are non conservative Random walk is not a good model of social phenomena, such as epidemics, adoption of innovation, or information spread in social media broadcast based diffusion 4 5 4 5 1 1 2 3 2 3 Node size ~ importance Broadcast based diffusion is non conservative; i.e., amount of the diffusing quantity (disease, information) changes over time.

Matrix formulation of non conservative diffusion Weight gives the amount of quantity on the graph at time t. Weight evolves according to rules of non conservative diffusion w NC NC t s Aw t 1 Steady state solution: as w NC k A k s s(i A) 1 k 0 with starting vector s t Holds for < 1/ 1 Weight vector is not conserved: w NC NC t w t 1

Alpha Centrality [Bonacich, 87] Number of paths of any length, attenuated by their length with 4 5 1 2 3 r t Alpha ea ea 2... t ea t 1

Alpha Centrality [Bonacich, 87] Number of paths of any length, attenuated by their length with 4 5 1 2 3 r t Alpha ea ea 2... t ea t 1

Alpha Centrality [Bonacich, 87] Number of paths of any length, attenuated by their length with 4 5 1 t r Alpha ea (I A) 2 Holds for < 1/ 1 3

Alpha Centrality [Bonacich, 87] Number of paths of any length, attenuated by their length with 4 5 1 non-conservative diffusion 2 3 t r Alpha ea (I A) Holds for < 1/ 1 w NC s (I A)

Two classes of dynamic processes on networks Conservative diffusion Non conservative diffusion 1 4 5 1 4 5 2 3 2 3 Mathematical formulation of two types of diffusion Equivalence of steady state solutions and centrality metrics Unifies social network analysis and epidemic models Implications Location of the epidemic threshold Social Network Analysis should consider not only the network topology, but type of dynamics also I.e., which centrality metric? [Ghosh and Lerman, Predicting Influential Users in Online Social Networks. SNAKDD10]

Implications: epidemic threshold Non conservative diffusion has a threshold, given by 1/ 1, (largest eigenvalue of A) [Wang et al., 2003] For <1/ 1, process (epidemic) dies out For >1/ 1,process (epidemic) reaches many nodes c =0.006 c =0.009 transmissibility

Implications: Social Network Analysis Which centrality metric best predicts important nodes? PageRank Alpha Centrality 6 6 7 5 4 7 5 4 1 2 3 1 2 3 9 8 9 8 10 10

Implications: Social Network Analysis Which centrality metric best predicts important nodes? PageRank Alpha Centrality 6 6 7 5 1/3 4 7 5 1 1 1 4 9 8 1/3 1 2 3 1/2 1/2 9 8 1 1 2 3 1 1 10 10

Which metric is right? How can we evaluate centrality metrics? User activity in social media provides an independent measure of importance/influence Serves as ground truth for evaluating centrality metrics Evaluation methodology Define an empirical measure of influence (ground truth) Compare centrality metrics with the ground truth

Information flow on Digg Post Post submitter fan Follower vote fan fan Post Follower vote Information spread on Digg is non-conservative Non-conservative metric will best predict influential users

Empirical estimate of influence 1. Average follower votes Likelihood a follower votes for the story Influence of submitter Quality of the story Story quality Random variable Average out by aggregating fan votes over all stories submitted by the same submitter 289 users submitting at least 2 stories 2. Average Cascade size How far does the story spread into the network Ground truth(s): Rank users according to each estimate

Evaluation of importance prediction Correlation between the rankings produced by the empirical measures of influence and those predicted by Alpha-Centrality and PageRank (1) Avg. # of follower votes (2) Avg. cascade size Non conservative Alpha Centrality best predicts influence rankings

Conclusion Physics inspired analysis of large scale, time resolved data about social behavior online leads to new understanding of social dynamics and social networks Mathematical modeling links details of individual actions with collective social behavior Explain phenomenology of Digg Predict future popularity of new content Design better social systems Dynamics of information spread on networks What limits the size of information cascades? Fundamental differences between social and disease contagion Understanding network structure Topology alone does not explain structure, e.g., important nodes. Also need to consider the type of diffusion process Centrality metric based on epidemic models better predicts influential users of Digg.

Readings Mathematical modeling Using a Model of Social Dynamics to Predict Popularity of News. Lerman, K., and Hogg, T. 2010. In Proceedings of 19th International World Wide Web Conference (WWW). http://www.isi.edu/~lerman/papers/wfp0788 lerman.pdf Using Stochastic Models to Describe and Predict Social Dynamics of Web Users. Lerman, K., and Hogg", T. 2010. submitted to ACM Transactions on Intelligent Systems and Technology. http://www.isi.edu/~lerman/papers/prediction.pdf Dynamics of information spread What stops social epidemics?. Steeg, G. V.; Ghosh, R.; and Lerman, K. 2011. In Proceedings of 5th International Conference on Weblogs and Social Media. http://arxiv.org/abs/1102.1985 Information Contagion: an Empirical Study of Spread of News on Digg and Twitter Social Networks. Lerman, K., and Ghosh, R. 2010. In Proceedings of 4th International Conference on Weblogs and Social Media (ICWSM). http://arxiv.org/abs/1003.2664 Diffusion and network structure Non Conservative Diffusion and its Application to Social Network Analysis. Ghosh, R.; Lerman, K.; Surachawala, T.; Voevodski, K.; and Teng, S. 2011. http://arxiv.org/abs/1102.4639 Predicting Influential Users in Online Social Networks. Ghosh, R., and Lerman, K. 2010. In Proceedings of KDD workshop on Social Network Analysis (SNA KDD), July. A Parameterized Centrality Metric for Network Analysis. Ghosh, R., and Lerman, K. 2011. Physical Review, E 83(6):066118. http://arxiv.org/abs/1010.4247

Thanks Collaborators Rumi Ghosh (USC) Greg Ver Steeg (USC) Tad Hogg (IMM) Rob Zinkov (USC) Tawan Surachawala (USC) Funding agencies NSF AFOSR AFRL