Behavioral Data Mining. Lecture 22: Network Algorithms II Diffusion and Meme Tracking

Size: px
Start display at page:

Download "Behavioral Data Mining. Lecture 22: Network Algorithms II Diffusion and Meme Tracking"

Transcription

1 Behavioral Data Mining Lecture 22: Network Algorithms II Diffusion and Meme Tracking

2 Once upon a time

3 Once upon a time Approx 1 MB Travels ~ 1 hour, B/W = 3k b/s

4 News Today, the total news available from reputable news sources and blogs is around 10GB / day (from Spinn3r). (To get all this only requires a 1 Mb/s connection)

5 Memes A meme is an idea, behavior or style that spreads from person to person within a culture Coined by Richard Dawkins in the selfish gene by analogy with gene. Examples: Nine-nine-nine Unleash the American people Governor etch-a-sketch President Obama wants everybody in America to go to college. What a snob! Riding the Tsunami in online education

6 Memes A meme is an idea, behavior or style that spreads from person to person within a culture Coined by Richard Dawkins in the selfish gene by analogy with gene. Examples: Nine-nine-nine We want to unleash the power of the American people The etch-a-sketch candidate President Obama wants everybody in America to go to college. What a snob! It s a tsunami we want to get ahead of this wave

7 Memetracker Memes are finer than topics, but longer than entities. Not President Obama, and not all the words associated with the Iraq war topic. Memes are extracted from the quoted content from news stories and blogs. Memetracker identifies memes by custom clustering, and finds a source phrase to represent the cluster.

8 Meme Example (2008 content) Identify memes by custom clustering, and find a source phrase. Memes are finer than topics, but longer than entities.

9 Nodes are phrases. Phrase Graph Edges point to strictly longer phrases. Edges join very similar phrases (edit distance 1, or long common phrase). The resulting graph must be a DAG.

10 Phrase Graph The goal is to cluster the phrase graph into meaningful clusters. A natural cluster description is the longest phrase, which contains all the other phrases. There are multiple containing phrases for several phrases below, so we have to remove weak edges.

11 Phrase Graph Delete the lowest-weight edges. Edge weight of (p,q): Increases with the frequency of the parent phrase q. Decreases with edit distance from p to q.

12 Phrase Graph After deletion, we have three components, which are the memes. Each has a unique root, which is a long phrase spanning the cluster content, and is a description of the meme.

13 Phrase Graph Strengths/weaknesses of this approach?

14 Power Laws

15 Dataset Three months of online mainstream and social media activity from August 1 to October million documents per day. 90 million documents (blog posts and news articles) from 1.65 million different sites The total dataset size is 390GB All mainstream media sites that are part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. Extracted 112 million quotes (Similar dataset for early this year is available at )

16 Trending Memes

17 Principles: Dynamics Imitation: source imitate what others are doing Recency: more recent information is favored Preferential attachment : imitation is linear leads to power law behavior.

18 Thread volume over time: Dynamics

19 Blogs vs News Media

20 Temporal Profile

21 Outline Memetracker Viral Marketing The biggest experiment ever

22 Viral Marketing Quick background on diffusion theory: Adoption rate Time

23 Viral Marketing Quick background on marketing:

24 Viral Marketing Data source: a marketing campaign which used referrals. The first buyer choses friends to whom an is sent. If the receiver makes a purchase through a web link in the , the receiver gets a 10% discount and the buyer gets a 10% credit on the purchase price. Data comprises 15,646,121 recommendations among 3,943,084 users.

25 By product type

26 Recommendation components

27 Network structures

28 Recommendations The deeper the user is in the cascade, the more recommendations they make:

29 The Model The model implied by the data is: N t+1 = p t N t Where N t is the number of recommendations at time t, and p t is a probability. Working through from this assumption, we obtain a LogNormal distribution: When the variance is large, this distribution behaves as a power law over a range of magnitudes.

30 Peer Influence Note: recommendations after the first purchase are discarded.

31 Recommender Benefits Top row is number of purchase vs. number of outbound recs. Bottom row is probability of credit vs. outbound recs.

32 Outline Memetracker Viral Marketing The biggest experiment ever

33 54% of adult internet users Facebook 48% of their real-life friends are listed in FB.

34 54% of adult internet users Facebook 48% of their real-life friends are listed in FB. Feed No Feed

35 The experiment A random subset of User-URL pairs was selected for treatment. Over 7 weeks: 250 Million subjects 75 Million URLs 1.2 billion subject-url pairs Items were removed from the feed in the no-feed condition.

36 Peer influence Causal influence of number of peer shares:

37 Peer influence Strength of tie with the recommender:

38 Peer influence How feed/no feed affects peer influence

39 Peer influence How feed/no feed affects peer influence

40 Strong vs. Weak Ties Strong ties are individually more influential but the larger number of weak ties are more influential in aggregate.

41 Summary Memetracker Viral Marketing The biggest experiment ever