Jure Leskovec Stanford University

Size: px
Start display at page:

Download "Jure Leskovec Stanford University"

Transcription

1 Jure Leskovec Stanford University Includes joint work with Jon Kleinberg (Cornell), Lars Backstrom (Cornell), Andreas Krause (Caltech), Carlos Guestrin (CMU),Christos Faloutsos (CMU)

2 2 Intersection of news media, technology, and the political process From its early stages, a tension between global effects from the mass media and local effects carried by social structure How does information transmitted by the media interact with the personal influence arising from social networks?

3 3 Internet, blogging, and social media: Social media means the dichotomy between global and local influence is evaporating Speed of media reporting and discussion has intensified: very rapid progression of stories, with no pauses The 24-hour news cycle : Difficult to define, but associated with technological acceleration and a challenge to healthy civic discourse [Kovach-Rosenstiel 99]

4 4 Question: Is the news cycle simply a metaphorical construct describing our perception of the news, or is it visible in data? And if it's visible, can we measure some of its basic properties?

5 Bloggers write posts and refer (link) to other posts and the information propagates 5

6 6

7 7 What basic units make up the news cycle? Cascasding hyper-links to articles: too fine-grained [Adar et al 04, Gruhl et al 04, Kumar et al 03, Leskovec et al] Topics as probabilistic term mixtures: too coarsegrained [Blei-Lafferty 06, Wang-McCallum 06, Wang et al 07] Named entities: too coarse-grained Obama, McCain, Microsoft, Paris, Apple Common sequence of words: too noisy I love you, web 2.0, Oh my God, Made in China

8 8 Need units that: correspond to aggregates of articles, vary over the order of days, and can be handled at terabyte scale Plan: identify text fragments, phrases, memes that travel relatively unchanged through many articles. Idea: quoted phrases:.* are integral parts of journalistic practices tend to follow iterations of a story as it evolves are attributed to individuals and have time and location

9 9 Data from Spinn3r on the 3 months leading up to the 2008 U.S. Presidential Election: 1 million news articles and blog posts per day Essentially a complete online media coverage: 20,000 sites that are part of Google News 1.6 million blogs From August 1 to October million documents from 1.65 million sites, 390GB We extract 112 million quotes (phrases)

10 Our opponent is someone who sees America, it seems, as being so imperfect, imperfect enough that he s palling around with terrorists who would target their own country. 10

11 BCD BDXCY ABCD Nodes are quotes ABCDEFGH ABC ABCEF ABCEFG CEF CEFP CEFPQR UVCEXF 11

12 BCD BDXCY ABCD Nodes are quotes Edges are inclusion relations ABCDEFGH ABC ABCEF ABCEFG CEF CEFP CEFPQR UVCEXF 12

13 BCD BDXCY ABCD Nodes are quotes Edges are inclusion relations Edges have weights ABCDEFGH ABC ABCEF ABCEFG CEF CEFP CEFPQR UVCEXF 13

14 14 Objective: in directed acyclic graph (approx. quote inclusion), delete min total edge weight s.t. each connected component has a single sink node BDXCYZ BCD ABCD ABCDEFGH ABC ABCEF ABCEFG CEF CEFP UVCEXF CEFPQR

15 15 Observation: enough to know node s parent Heuristic: proceed top down and assign node to strongest cluster BDXCYZ BCD ABCD ABCDEFGH ABC ABCEF ABCEFG CEF CEFP UVCEXF CEFPQR

16 Quoted text Volume the fundamentals of our economy are strong 3654 the fundamentals of the economy are strong 988 fundamentals of our economy are strong 645 fundamentals of the economy are strong 557 if john mccain hadn't said that the fundamentals of our economy are strong on the day of one of our nation's worst financial crises the claim that he invented the blackberry would have been the most preposterous thing said all week 224 fundamentals of the economy 172 the fundamentals of the economy are sound 119 i promise you we will never put america in this position again we will clean up wall street 83 the fundamentals of our economy are sound 81 clean up wall street 78 our economy i think still the fundamentals of our economy are strong 75 fundamentals of the economy are sound 72 the fundamentals of our economy are strong but these are very very difficult times and i promise you we will never put america in this position again 68 the economy is in crisis 66 these are very very difficult times 63 the fundamentals of our economy are strong but these are very very difficult times 62 do you still think the fundamentals of our economy are strong genius 62 our economy i think still the fundamentals of our economy are strong but these are very very difficult times 60 mccain's first response to this crisis was to say that the fundamentals of our economy are strong then he admitted it was a crisis and then he proposed a commission which is just washington-speak for i'll get back to you later 55 i still believe the fundamentals of our economy are strong 53 i think still the fundamentals of our economy are strong 50 16

17 17? Can we extract any interesting temporal variations? is periodic, has no trends. Bandwidth of the online media is constant

18 August October Volume over time of top 50 largest total volume quote clusters 18

19 19

20 20 What ingredients are essential to qualitatively reproduce the observed dynamics? Temporal variation has potential connections with natural processes Species competing for resources in an ecosystem. Biological systems synchronize to favor small number of individuals [Lacker-Peskin 1981] N news sources, one new story per time step. Source s choice of what to cover controlled by: Imitation: increasing in number of sources covering story Recency: decreasing in time since story's appearance Attractiveness: prefer more interesting stories

21 21 Only imitation Only recency/attractiveness Imitation & Recency

22 Can study typical quote cluster volume curve Peak behaves like a delta function. Phrases are very short lived. 22

23 Peak blog intensity comes about 2.5 hours after news peak. Using Google News we label: Mainstream media: 20,000 sites (44% vol.) Blog (everything else): 1.6 million sites (56% vol.) 23

24 Can classify individual sources by their typical timing relative to the peak aggregate intensity News media Professional blogs 24

25 Can study oscillation of attention between news and media 25

26 Obscure technology story Small tech blog So, who is really the influential here? Wired BBC New Scientist Slashdot CNN New York Times Sooner we read the story, more of its influence area we coverb 26

27 27 = I have 10 minutes. Which blogs should I read to be? most up to date? = Who are the most influential bloggers?

28 Want to read things before others do. Detect blue & green soon but miss red. Detect all stories but late. 28

29 29 = Given a budget (e.g., of 3 blogs) = Select blogs to cover the most of the blogosphere? = Bad news: Solving this exactly is NP-hard = Good news: Theorem: Our algorithm can do it in linear time and with factor 3 approximation Blogosphere

30 30 Cost: Cost of monitoring is node dependent Reward: Minimize the number of affected nodes: If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: Minimize time to detection ( ) Maximize number of detected outbreaks A R(A)

31 31 Given: Graph G(V,E), budget M Data on how cascades C 1,, C i,,c K spread over time Select a set of nodes A maximizing the reward subject to cost(a) M Reward for detecting cascade i Solving the problem exactly is NP-hard Max-cover [Khuller et al. 99]

32 Jure Leskovec, Joint Statistical Meeting, August New monitored blog: B 1 B 1 B B 3 B 2 Adding B helps a lot B 2 Adding B helps very little B 4 Placement A={B 1, B 2 } Placement B={B 1, B 2, B 3, B 4 } Gain of adding a node to small set is larger than gain of adding a node to large set Submodularity: diminishing returns, think of it as concavity )

33 33 We must show R is submodular: A B R(A {u}) R(A) R(B {u}) R(B) Gain of adding a node to a small set Natural example: Sets A 1, A 2,, A n R(A) = size of union of A i (size of covered area) Gain of adding a node to a large set B A u If R 1,,R K are submodular, then p i R i is submodular

34 34 Theorem: Reward function is submodular Consider cascade i: R i (u k ) = set of nodes saved from u k R i (A) = size of union R i (u k ), u k A R i is submodular Global optimization: R(A) = Prob(i) R i (A) R is submodular Cascade i u 2 R i (u 2 ) u 1 R i (u 1 )

35 35 We develop CELF (cost-effective lazy forward-selection) algorithm: Two independent runs of a modified greedy Solution set A : ignore cost, greedily optimize reward Solution set A : greedily optimize reward/cost ratio Pick best of the two: arg max(r(a ), R(A )) Theorem: If R is submodular then CELF is near optimal CELF achieves ½(1-1/e) factor approximation

36 36 a b c d a Repeatedly add node with highest marginal gain c d e b Slow! O(kN) For the size of our problems naïve CELF (greedy) is too slow Marginal reward Current solution: {a} {} {a, c}

37 37 Observation: Submodularity guarantees that marginal rewards decrease with the solution size Idea: d Marginal reward Use marginal reward from previous step as a upper bound on current marginal reward b c d a e

38 38 CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a b c d b c d a e e Marginal reward

39 39 CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a b c d b c d a e e Marginal reward

40 40 CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a d b e b c d a e c Marginal reward

41 Question: Which blogs should one read to catch big stories? Idea: Each blog covers part of the blogosphere Each dot is a blog Proximity is based on the number of common cascades 41

42 Which blogs should one read? CELF Reward (higher is better) In-links Out-links # posts (used by Technorati) Number of selected blogs (sensors) Random For more info see our website: 42

43 43 Run time (seconds) (lower is better) Exhaustive search CELF runs 700x faster than simple greedy algorithm Greedy CELF Number of selected blogs (sensors)

44 k Score Blog Posts InLinks OutLinks

45 [w/ Krause et al., J. of Water Resource Planning] 45 Given: a real city water distribution network data on how contaminants spread over time Place sensors (to save lives) Problem posed by the US Environmental Protection Agency S c 2 c 1

46 [w/ Ostfeld et al., J. of Water Resource Planning] Population saved (higher is better) Number of placed sensors CELF Our approach performed best at the Battle of Water Sensor Networks competition Degree Random Population Flow Author Score CMU (CELF) 26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1 46

47 A framework for tracking memes through the news, to quantify the dynamics of the news cycle. Demo + Data: Many further questions: Which elements of the news cycle do we miss? Can this analysis of memes help identify dynamics of polarization? (cf. [Adamic-Glance, 2005]) How are these memes actually spreading among people? 47

48 48 Predictive models of information diffusion When, where and what post will create a cascade? Where should one tap the network to get the effect they want? Social Media Marketing How do news and information spread New ranking and influence measures for blogs Sentiment analysis from cascade structure

49 frequency Why certain versions of the meme persist? time [hours] 49

50 50 What is a phylogenetic tree of information? frequency baseline

51 Why do certain quotes start off slowly? Which parts of the web they propagate into? A changing environment will affect Alaska more than any other state because of our location I'm not one though who would attribute it to being man-made 51

52 52