Automatically Mining Relations Between Heterogeneous Text Sources. CSE 291 Project by Siddharth Dinesh 13 March, 2018

Size: px
Start display at page:

Download "Automatically Mining Relations Between Heterogeneous Text Sources. CSE 291 Project by Siddharth Dinesh 13 March, 2018"

Transcription

1 Automatically Mining Relations Between Heterogeneous Text Sources CSE 291 Project by Siddharth Dinesh 13 March, 2018

2 What is the need? Studying reactions to News events on Twitter Identifying first person accounts of events mentioned in news stories Summarizing fans reactions to sporting events Showcasing contrarian view-points to an editorial article Understanding reactions to Court judgements in News articles Research Question How do we compare text sources to identify relations between them? 2

3 Design Challenges Heterogeneous text sources News article corpus Tweet corpus Long published, edited text Short stream-of-thought text Differing rates of data production Thousands of news articles Millions of tweets Lack of clarity over the axes of similarity to mine relations Semantic meaning Time Location Subject 3

4 About the Data Newspaper articles and Sina Weibo posts in Chinese script One month of randomly sampled Weibo posts Around 25,000 tweets collected per day 20% of posts have hashtags Each tweet has a mean of 42 words One month of newspaper articles from news agencies in China 2200 news articles collected per day Each article has a mean of 606 words 4

5 Ideal System: Use a Similarity Join - For 2 collections, Similarity Join returns all pairs of similar objects according to a similarity function - Similarity Search: Given a collection and query object, return all objects in collection similar to the query object - Calculating similarity between objects is an expensive operation - Inverted Indexes as solution - Similarity joins between text objects which have: - Unequal information - Different language models 5

6 Qualitative Evaluation 6

7 Ideal System: Use a Similarity Join - For 2 collections, Similarity Join returns all pairs of similar objects according to a similarity function - Similarity Search: Given a collection and query object, return all objects in collection similar to the query object - Calculating similarity between objects is an expensive operation - Inverted Indexes as solution - Similarity joins between text objects which have: - Unequal information - Different language models 7

8 Tweets to News Articles Building the Features Hashtag A Hashtag B Hashtag C 8

9 Tweets to News Articles 9

10 Tweets to News Articles Articles are mostly unrelated to tweets. Average relevancy = 2.36 on a scale of 10. Failure to identify most relevant terms in tweets Failure of language modeling Bag of words does not work well Excessive noise in tweets Many entertainment related hashtags i. Songs, Artists, Advertisements Challenge : Picking newsworthy hashtags Better performance with pre-selected hashtags from seed set of political, economics hashtags 10

11 News Articles to Tweets 11

12 News Articles to Tweets - Experiment 2: Latent Dirichlet Analysis vs. Latent Semantic Indexing as clustering mechanism for news articles - 2 samples each - LDA Clusters were not topically related, very loose - Average relevancy rating on 25 related tweets = Mean execution time sec, σ = LSI Clusters worked better - Topic clusters where tighter in general - Average relevancy = Mean execution time sec on Python implementation, σ = 9.89s 12

13 Discussion (a)symmetry of similarity join News Tweet Tweet News Tradeoff between automated system and interactive system in terms of Quality of results Performance: memory and running time Accounting for Semantic Structure in Text sources Named Entities Locations Semantic Structure adds additional axes of comparison 13

14 Future Work 14

15 References 15