Tag cloud generation for results of multiple keywords queries

Size: px

Start display at page:

Download "Tag cloud generation for results of multiple keywords queries"

Calvin Webster
5 years ago
Views:

1 Tag cloud generation for results of multiple keywords queries Martin Leginus, Peter Dolog and Ricardo Gomes Lage IWIS, Department of Computer Science, Aalborg University

2 What tag clouds are? Tag cloud is a visual retrieval interface depicting the most important terms of a dataset. Tag clouds build on top of the entire dataset or query based tag clouds.

3 Tag clouds build on top of the entire dataset. What tag clouds are?

4 Query based tag clouds. What tag clouds are?

5 Motivation It is motivated by personalization tasks, surveillance systems and information retrieval tasks defined with multiple keywords.

6 Techniques Most Frequent Tags from Corpus (MFTC) Most Frequent Tags from Query Result Set (POP) The most frequent topics within the system are propagated to the tag cloud. The tag cloud does not cover other not so frequently represented topics which could be relevant for the user. Term frequency inverse document frequency selection (TFIDF) For each tag from the documents ( ) that is associated with the query keywords, tf idf is computed. These values are aggregated and sorted in the descending order. No consideration of semantic similarities between tags.

7 Techniques Max coverage selection (COV) Maximization of coverage and minimization of overlap between tag clouds tags. The optimization of coverage might result into the generation of tag clouds that contain terms with high coverage but are irrelevant for the specific user's information retrieval goal.

8 Graph based techniques 1. Tag space transformed into a graph. 1. Calculate a tag pair co occurence using Jaccard similarity for all tags. 2. When similarity for a tag pair is greater than a predefined threshold α, we consider such tags as similar. 3. Each similar tag pair is transformed into two directed edges t1 t2 and t2 t1 2. Graph based methods for relevance estimation The algorithms rank an importance of a tag t with respect to the query keywords T I (t Tq) Top k most relevant tags are selected for the final tag cloud

9 Graph based techniques 1. Tag space transformed into a graph. Calculate a tag pair co occurence using Jaccard similarity for all tags. Samuel L. Jackson assigned to Goodfellas (1990),Pulp Fiction (1994),Die Hard: With a Vengeance (1995),Kill Bill: Vol. 2 (2004) Tarantino assigned to Reservoir Dogs (1992),Pulp Fiction (1994),Four Rooms(1995), Jackie Brown (1997),Kill Bill: Vol. 1 (2003),Kill Bill: Vol. 2 (2004) Cooccurring at Pulp Fiction and Kill Bill: Vol. 2 JAC(Samuel L. Jackson;Tarantino) =

10 Graph based techniques 1. Tag space transformed into a graph. 1. Calculate a tag pair co occurence using Jaccard similarity for all tags. 2. When similarity for a tag pair is greater than a predefined threshold α, we consider such tags as similar. 3. Each similar tag pair is transformed into two directed edges t1 t2 and t2 t1 2. Graph based methods for relevance estimation The algorithms rank an importance of a tag t with respect to the query keywords T I (t ) Top k most relevant tags are selected for the final tag cloud

11 Graph based techniques Graph based methods for relevance estimation Distance based approaches computationally expensive Stochastic approaches simulation of a random traversal of the graph In this work, we focus only on stochastic approaches

12 Stochastic Graph based techniques Measuring importance of nodes in the graph through the simulation of a stochastic process i.e., random traversing of the graph. The transition probability from a node for all nodes that have an ingoing edge from. is defined as

13 Stochastic Graph based techniques Bruce Willis Reservoir dogs Unbreakable Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

14 Stochastic Graph based techniques Bruce Willis Starts a random walk from Pulp Fiction node 5 options of transitions Reservoir dogs Unbreakable Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

15 Stochastic Graph based techniques Bruce Willis Jumped to Bruce Willis tag only three options of transitions. Reservoir dogs Unbreakable Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

16 Stochastic Graph based techniques Bruce Willis Jumped to Unbreakable tag only three options of transitions. Reservoir dogs Unbreakable Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

17 Stochastic Graph based techniques Bruce Willis The random walk after some time converges if you will run it longer the time a token stays at a certain node will be the same. Reservoir dogs Unbreakable XY Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

18 Stochastic Graph based techniques At each step of the random walk, it is possible to perfom a random restart which starts the random walk again from one of the root noods query tags. Bruce Willis Reservoir dogs Unbreakable XY Samuel L. Jackson Quentin Tarantino Pulp Fiction Kill Bill vol. 2

19 Pagerank with priors Relative importance to a query tag is introduced through the vector of prior probabilities = { } A random surfer is assured with a back probability ) The resulting ranks biased towards are considered as definition of importance after convergence i.e.; I(t The method requires to set up several parameters such as the back probability and prior probabilities with respect to a specific dataset.

20 HITS with priors The same prior probabilities probability = { } and a back ) Where: )

21 K step Markov Chain This method differs in the implementation of a random surfer model. Implement with a path length limitation determines how often we jump back to root nodes..... Relative importance to root nodes is introduced through a vector of prior probabilities

22 Prior probabilities Uniform prior distribution results into inclusion of irrelevant tags into the final tag cloud. Relative popularity of query tags

24 Datasets Bibsonomy contains 206k items, 51k tags and 466k tagging posts. Movielens contains 16k tags, 7k movies and 95k tagging posts. Delicious contains 187k tags, 355k bookmarks and 2046k tagging posts.

25 Synthetic metrics Synthetic metrics express a quality of tags selection process (Venetis 2011). Relevance of : Expresses how relevant the tags in are to the query tags. We compute an average relevance of all tags from in the following way: The metric captures to which extent resources associated with tag cloud tags overlap with the resources retrieved by the query tags. We do not consider Coverage as this metrics might be misleading tags with high coverage can be irrelevant and not enough discriminative for the retrieval tasks

26 A set of tags issued as a query Documents associated with the tag Documents retrieved by the keywords query Documents associated with the tag Tag T is more relevant than T The tag T can be perceived as more specific subtopic of the documents returned by the query T / more discriminative for filtering purposes.

27 Results Bibsonomy

28 Results Movielens

29 Results Delicious

30 Limitations The methods do not perform that well on top of datasets with the long tail distribution of tags. Caused by the way the tag space is transformed into a graph structure.

31 Conclusions The graph based methods perform the best at the Movielens and the Bibsonomy datasets. The proposed extension of the setting of prior probabilities for the random walk based algorithms. The methods do not perform well at the Delicious dataset.

32 Future work Propose an enhanced graph creation Enhance tags selection to generate more diverse and novel tag clouds. Extend synthetic metrics that will better capture diversity and novelty of tag clouds

33 Questions

35 Possible questions: Why there is a need to adjust a prior probabilities? When a rarely used tag is chosen as a query tag, such tag does not co occure with many tags. Therefore, there are not many edges connecting this graph node with other nodes. A random traversal of the graph initiated from the rarely used tag/node might reach not important/relevant nodes (tags). Consequently, it results into an inclusion of irrelevant tags into the tag cloud. We verified this assumption by series of preliminary evaluations.

36 Possible questions: Why Delicious is different? There are many very frequent tags in the dataset, i.e., almost 20 tags that were assigned at least times, almost 500 tags that were placed by users at least 1000 times. On the other hand, there are tags utilized less than 10 times. The underlying co occurrence graph links very frequent tags with very rarely used tags. It results into the inclusion of more frequent tags into tag clouds. Such inclusion causes lower relevance.

Methodologies for Improved Tag Cloud Generation with Clusterin

Methodologies for Improved Tag Cloud Generation with Clustering. Martin Leginus, Peter Dolog, Ricardo Lage, and Frederico Durao Department of Computer Science, Aalborg University July, 2012 Agenda Introduction