Influence of Time on User Profiling and Recommending Researchers in Social Media. Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp

Size: px
Start display at page:

Download "Influence of Time on User Profiling and Recommending Researchers in Social Media. Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp"

Transcription

1 Influence of Time on User Profiling and Recommending Researchers in Social Media Chifumi Nishioka Gregor Große-Bölting Ansgar Scherp 1

2 Motivation Construct user s professional profiles from social media items Researchers reveal thoughts on Twitter [Letierce et al. 14] User profiling from Twitter is useful Challenge on extracting professional interests Professional profiles may contain a lot of noise [Abel et al. 11] Challenge on extracting interests from social media items Short and sparse Methods revealing implicit interests are required 2

3 Extracting User s Professional Interests In this work we construct user s professional profiles and investigate influence of older data Use a domain-specific knowledge base Contain only entities in a specific domain MeSH for medicine and ACM CCS for computer science Avoid adding noise to profiles Incorporate spreading activation functions Extract entities which are not mentioned directly Experiment with two different domains Computer science Medicine 3

4 Process of Constructing User Profiles Texts - social media items / publications Entity Extraction - string matching with entity labels Scoring of Entities - give a score for each entity in a text - compare different 10 scoring functions User s Professional Profile 4

5 Research Questions (I) Effectiveness of different entity scoring functions (II) Influence of older data (III) Influence of the number of publications and social media items (IV) Influence of using abstracts for profiling 5

6 Scoring Functions [1/5] Frequency score freq (c, d) = freq(c, d) freq(c, d): the number of appearances of an entity c Basic Spreading Activation [Kapanipathi et al. 14] C l (c): a set of child concepts of a concept c λ: decay parameter, λ = 0.4 e.g., score basic c, d = freq c, d + λ score basic (c i, d) score basic c 1, d = 1.00 score basic c 2, d = 0.40 score basic c 3, d = 0.16 c 1 c i C l (c) c 2 c 3 Social Recommendation Web Searching World Wide Web Social Tagging Web Mining Site Wrapping Web Log Analysis 6

7 Scoring Functions [2/5] Bell Spreading Activation [Kapanipathi et al. 14] score bell c, d = freq c, d + F c score bell (c i, d) c i C l (c) F c : reciprocal of the number of entities that are located one level below an entity c Bell Logarithmic Spreading Activation [Kapanipathi et al. 14] score belllog c, d = freq c, d + FL c score belllog (c i, d) c i C l (c) FL c : reciprocal of the logarithmic number of entities that are located one level below an entity c 7

8 Scoring Functions [3/5] CF-IDF [Goossen et al. 11] D score cfidf (c, d) = cf(c, d) log d D: c d Extension of TF-IDF replacing words with entities Lower scores for entities appeared in many documents HCF-IDF (Hierarchical CF-IDF) score hcfidf c, d Extension of CF-IDF revealing inexplicit entities Combine the strength of the semantics with the statistical strength d D: c d : The number of documents including an entity c after applying BellLog D = score belllog (c, d) log d D: c d 8

9 Scoring Functions [4/5] BM25C score bm25c (c, d) = IDF(c, D) IDF c, D = log freq(c, d) (k + 1) freq c, d + k (1 b + b d avgdl ) D d D: c d d D: c d Extension of Okapi BM25 replacing words with entities avgdl: average length of the documents in a corpus BM25HC score belllog (c, d) (k + 1) score bm25c (c, d) = IDF(c, D) score belllog c, d + k (1 b + b d avgdl ) 9

10 Scoring Functions [5/5] Methods without using entities TF-IDF BM25 score tfidf (w, d) = cf(w, d) log D d D: w d score bm25 (w, d) = IDF(w, D) IDF w, D = log freq(w, d) (k + 1) freq w, d + k (1 b + b d avgdl ) D d D: w d d D: w d

11 Datasets for Experiments Computer Science Medicine social media Twitter # of users 88 users 64 users average # of tweets (± ) (± ) publication DBLP PubMed average # of titles per user (± 13.45) (± 65.95) average # of abstracts per user 3.69 (± 5.12) (± 60.23) knowledge base ACM CCS MeSH # of entities 2,299 27,300 # of labels 9, ,368 Twitter is chosen as a social media platform Twitter users are collected by searching hashtags computer science: hashtags of A* conference medicine: hashtags of the top 5 journals (impact factors) All tweets, publications, and labels of entities are in English 11

12 User Profiling Experiment 1 How well does a user s social media profile reflect a user s professional interests from publication profiles? Procedure User User s social media items User s publications Scoring function Scoring function User s social media profile User s publication profile Cosine similarity 12

13 Results [1/2] Experiment 1 (I) Effectiveness of different entity scoring functions Basic Spreading Activation results in the largest similarity scores in both datasets (II) Influence of older data We incrementally add publications of older years and measure similarity scores Computer Science Medicine 13

14 Results [2/2] Experiment 1 (III) Influence of the number of publications and social media items Kendall s rank correlation coefficient between the number of publications/tweets and similarity scores Moderate correlations Stronger correlations with the number of publications (IV) Influence of using abstracts for profiling Observe slight improvements when using abstracts 14

15 Recommending Researchers Experiment 2 How well can the entity scoring functions distinguish a user from other users? Procedure 1. Compute similarity scores between a user u s social media profile and each of all users publication profiles by cosine similarity 2. Rank users publication profiles by similarity scores and identify the rank of the u s publication profile 3. Calculate MRR (Mean Reciprocal Rank) MRR = 1 U u U 1 rank(u) 15

16 Results [1/3] Experiment 2 (I) Effectiveness of different entity scoring functions BM25 and TF-IDF perform best for the computer science dataset BM25C and BM25HC perform best for the medicine dataset Possible reason of the result: quality of knowledge bases MeSH contains much more entities (i.e., 27,300 entities) than ACM CCS (2,299 entities) MeSH can detect sufficient entities from short texts 16

17 Results [2/3] Experiment 2 (II) Influence of older data In the computer science dataset, the recommendation performs best with publications published after 2004 In the medicine dataset, the performance does not vary much, when older publications are added Computer Science Medicine 17

18 Results [3/3] Experiment 2 (III) Influence of the number of publications and social media items Kendall s rank correlation coefficient between the number of publications/tweets and MRR Moderate correlations with the number of publications Almost no correlations with the number of tweets Some tweets are not about professional interests and portion of tweets about professional interests depend on users 10% of tweets contain at least one entity in computer science 48% of tweets contain at least one entity in medicine (IV) Influence of using abstracts for profiling Improvements with Frequency, Basic, Bell, and BellLog 18

19 Conclusion Influence of the older data For computer science, adding too old data decrease the performance For medicine dataset, performance does not vary much with adding older data Influence of scoring functions Basic Activation is the best for user profiling In recommendation, while BM25 and TF-IDF perform best for computer science, BM25C and BM25HC do for medicine The difference between two domains comes from size of knowledge bases? 19