Modeling and Predicting User Interests based on Taxonomy. Makoto Nakatsuji

Size: px

Start display at page:

Download "Modeling and Predicting User Interests based on Taxonomy. Makoto Nakatsuji"

Phillip Farmer
6 years ago
Views:

1 Modeling and Predicting User Interests based on Taxonomy Makoto Nakatsuji

3 Abstract In the thesis, we analyze user interests based on a domain specific taxonomy. We propose modeling user interests and measuring similarity of users according to the taxonomy in the domain. Then we apply our method to recommender systems. We propose identifying topics, those that include new concepts that are likely be interesting to the user even though those concepts are not present in the user profile. We try to expand user interests significantly by letting the user browse those topics. Recommender systems are widely used by content providers to drive their commercial success. Many content providers adopt methods based on collaborative filtering (CF), which is a broad term for the process of recommending items to an active user, who receives the recommendation, based on the intuition that users who access the same items with the user tend to have similar interests with the user. Basic CF methods measure the similarity of users only from the co-rating behaviors against items, and compute recommendation for the active user by analyzing the items possessed by the most similar users with the user. As a result, they are apt to recommend the types of items that have already been accessed by the user. For example, if the user highly rates a horror movie (as an item), the typical CF methods recommend items that were made by the same director, performed by the same actors, or included in the same genre, horror. Those items are not truly novel since they are often already known to the user, or easily discovered by the user. We also apply our method to knowledge management in a system develi

4 opment domain. We extract user knowledge of system development from accumulated mails for a system development project. Developers responsible for handling modules or development procedures often collaborate with each other in the course of their work. Given the long development schedules common for complex projects, some turn over of personnel must be accepted. It is essential that the new people be able to utilize the know-who and know-how information created by the original experts as contained in the logs of message systems. With respect to the above issues, the thesis studies the following topics. 1. Identifying novel topics based on user interests. We introduce the method that extracts user interests from users blog entries and measures similarity of users according to the taxonomy of items. We also introduce a new measure, score of novelty, to understand how novel the recommended items are for the user based on the taxonomy. This metric is useful in two ways. First, it presents, in an easy to understand manner, the relationship between the user s present interests and the target item. That is, the user can understand why the presented items are different from those that user has accessed before. Second, the user can boost the novelty threshold if she wants items that are completely unknown to her. We start with a proposal to build user interests according to a taxonomy of items. We consider that users who like items, may like the classes that include those items. Our method thus reflects the rating of the user on an item to that of a class that includes that item. We then measure similarity of users by using not only co-rating behaviors against items but also those against classes in the taxonomy. As a result, we can identify many items accurately for the user by analyzing the items of users who share the same items and/or same classes with the user. Then, it creates a user graph whose nodes are users; weighted edges are set between users according to their similarity. It performs Random Walk with Restarts over the user graph and extracts user nodes that are frequently passed by the walk, even though ii

5 weights on the edges from the starting node to those nodes are not high. The users so extracted are likely to have items with high novelty for the starting node user. An offline evaluation conducted on several datasets finds that our method identifies more novel items with higher accuracy than previous methods. We also perform an online experiment for analyzing user reactions to topics recommended based on our assessments. By analyzing the frequency of user access to novel items output by our recommendation scheme over time, we confirmed the effectiveness of our novel topic recommendation. We found that the novel topics recommended by our technique were used for creating new communication links between users; this was confirmed by evaluating the frequency of comments between users who came to know each other through our online recommendations. 2. Cross-domain recommendations over domain specific user graphs Content providers want to make recommendations across multiple interrelated domains such as music and movies. However, existing collaborative filtering methods fail to accurately identify items that may be interesting to the user but that lie in domains that the user has not accessed before. Our method is based on the observation that users who share similar items or who share social connections, can provide recommendation chains (sequences of transitively associated edges) to items in other domains. It first builds domain-specific-user graphs (DSUGs) whose nodes, users, are linked by weighted edges that reflect the similarity of users. It then connects the DSUGs via the users who rated items in several domains, to create a cross-domain-user graph (CDUG). It performs Random Walk with Restarts on the CDUG to extract user nodes that are related to the starting user node on the CDUG even though they are not present in the DSUG of the starting user node. Then, it incorporates items possessed by those users to recommendations of the iii

6 starting node user. Furthermore, to extract many more user nodes, we employ our proposed taxonomy-based similarity measure that states that users are similar if they share the same items and/or same classes. Thus we can set many suitable routes from the starting user node to other user nodes in the CDUG. An evaluation of user implicit ratings against items in two interrelated domains as extracted from a blog portal, indicates that our method identifies potentially interesting items in other domains with higher accuracy than is possible with existing CF methods. 3. Analyzing the developer s knowledge based on development-related taxonomies Product developers frequently discuss topics related their development project with others, but often use technical terms whose meanings are not clear to non-specialists. To provide nonexperts with precise and comprehensive understanding of the information discussed, the method proposed herein categorize the messages using a taxonomy of products developed and a taxonomy of tasks relevant to those products. The instances in the taxonomy are products and/or tasks manually selected as relevant to the system development. We apply our previously proposed method that extracts user interests from blogs, to the developer s knowledge extraction from the mail messages accumulated in mailing lists. The problem is that there is no taxonomy related to system development, thus we semi-automatically enrich the taxonomy from the mails accumulated in the mailing list for the system development. Using such expanded taxonomy, we can analyze user knowledge accurately. This provides the concrete application example that drives forward the taxonomy based knowledge management. Thus, in this research we have proposed a taxonomy-based method of user knowledge analysis from the viewpoint of user s knowledge management and providing novelty to the user. The method proposed by this thesis iv

7 makes the research direction in knowledge management of users based on the taxonomy and novel item identification for the future recommender systems. v

9 Acknowledgments I wish to express my sincere gratitude to my supervisor, Professor Toru Ishida at Kyoto University, for his continuous guidance, valuable discussion and advice. I gratefully appreciate my thesis committee at Kyoto University, Professor Katsumi Tanaka and Professor Toyoaki Nishida for their valuable comments. Associate Professor Shigeo Matsubara, and Assistant Professor Hiromitsu Hattori always gave me kind advice and helpful comments. Coordinators at Ishida & Matsubara laboratory, Ms. Yoko Kubota and Ms. Terumi Kosugi usually help me with my office tasks. I wish to thank all members of Ishida & Matsubara laboratory for their support in numerous ways. I appreciate the constructive discussions and the kindly advices from my colleagues, Dr. Yoshihiro Otsuka, Dr. Ko Fujimura, Mr. Tadasu Uchiyama, Mr. Makoto Yoshida, Mr. Akimichi Tanaka, Dr. Toshio Uchiyama, Mr. Tatsuyuki Kimura, Mr. Yu Miyoshi, and Mr. Yasuhiro Fujiwara. I wish to thank all members of NTT Cyber Solutions Laboratories and NTT Network Service Systems Laboratories for their support in numerous ways. Finally, I want to express my gratitude to my family; my parents Hiroshi Nakatsuji, Mieko Nakatsuji, my elder brother Satoru Nakatsuji, my elder brother Susumu Nakatsuji, and my wife Kumiko Nakatsuji. They gave me sincere support and encouragement. vii

11 Contents 1 Introduction Objectives Research Issues Thesis Outline Background Collaborative filtering Accurate Item Recommendation Novel Item Recommendation Cross-domain Recommendation Random Walk with Restarts Identifying novel topics based on user interests Introduction Background and our purpose Approaches Impact of applications based on our method Related works Collaborative Filtering Interest ontology extraction Designing service-domain ontology Interest ontology generation algorithm Introducing interest weight to ontology Detecting novelty by similarity measurements ix

12 3.5.1 Interest-weight-based similarity measurement Innovative blog-entry detection Offline Experimental results Datasets and methodology Measuring interest distributions of blog users Measuring performance of extracted interest ontology Comparing filtering algorithms Analyzing size of user-oriented community Measuring performance of detecting novel topics Online Experimental results Explaining our online experiment Evaluating recommendation results based on extracted interest ontology Evaluating detection of novel topics Evaluating activation of blog community Summary Discussion Summary Analyzing accuracy and novelty of taxonomy-based recommendations Introduction Related Works Background Collaborative Filtering Random Walk with Restarts Proposed Method Problem Definition Modeling user interests Measuring user similarity Identifying highly novel items from the user graph Evaluation Datasets x

13 4.5.2 Methodology Compared methods Results on accuracy Results on item novelty Results when extending movie taxonomy Results when using restaurant dataset Summary Cross-domain recommendations over domain speciific user grap Introduction Related Works Background Collaborative Filtering Random Walk with Restarts Method Creating a cross-domain-user graph (CDUG) Identifying items in other domains Modeling user interests Measuring similarity of users Evaluation Datasets Methodology Compared methods Results Summary Analyzing the developer s knowledge based on developmentrelated taxonomies Introduction Related Works Method The design of taxonomies Method xi

14 6.3.3 Analyzing know-who using expanded taxonomy Analyzing semantic relationships between similar users Analyzing know-how using expanded taxonomies Evaluation Dataset and methodology Evaluating the characteristics of relevant concepts Evaluating the effectiveness of know-who analysis Evaluating the effect of know-how analysis Summary Conclusion 115 xii

15 List of Figures 3.1 Procedure for designing service-domain ontology Procedure for generating interest ontologies Hops in filtering algorithms Applying interest weight to ontology Measuring similarity based on degree of interest agreement Definitions of terms and examples Community creation service of recommending innovative blog entries Experimental results of user distributions and ontology extraction Experimental results of our ontology extraction and detection of novel topics (a) number of users obtained by changing X. (b) number of users that have high interest weight after changing X Snapshot of online experimental service DoblogMusic User accesses to DoblogMusic Increasing user-user communication through DoblogMusic Taxonomy in movie domain An explanation of score of novelty Measuring similarity of user a and u Example of extended taxonomy of MovieLens items Example of creating the column-normalized adjacency matrix of the CDUG xiii

16 5.2 Measuring similarity of users a and u MAE when we set T D = Image of taxonomy with relevant and unified concepts Procedure of assigning s Explanatory image of classifying phrases to concepts Assigning taxonomy-based semantic tags to the relationships between users Accuracy of extracting relevant concepts Results of know-who analysis. (X axis indicates the number of users and Y axis indicates accuracy of the results.) Results of know-how analysis. (MS means mail set.) xiv

17 List of Tables 4.1 Definition of main symbols MAE against movie dataset MAE against non-japanese music dataset MAE against Japanese music dataset Prediction coverage versus item novelty against non- Japanese music dataset Prediction coverage versus item novelty against Japanese music dataset MAE against movie dataset with extended taxonomy MAE against restaurant dataset Prediction coverage versus item novelty against restaurant dataset xv

19 Chapter 1 Introduction 1.1 Objectives Recommender systems are widely used by content providers to drive their commercial success. Many content providers adopt methods based on collaborative filtering (CF), which is a broad term for the process of recommending items to an active user, who receives the recommendation, based on the intuition that users who access the same items with the user tend to have similar interests with the user. Basic CF methods measure the similarity of users only from the co-rating behaviors against items, and compute recommendation for the active user by analyzing the items possessed by the most similar users with the user. As a result, they are apt to recommend the types of items that have already been accessed by the user. For example, if the user highly rates a horror movie (as an item), the typical CF methods recommend items that were made by the same director, performed by the same actors, or included in the same genre, horror. Those items are not truly novel since they are often already known to the user, or easily discovered by the user. One solution of this problem is increasing the diversity of the items in the recommendation list for the active user. The list includes items in several classes defined by a taxonomy. However, those schemes fail to consider the semantic relationships between a user and items that are recommended 1

20 to the user. The semantic relationships described above are necessary if the user is to accept recommended items, especially when the user has not thought of those items before. Our purpose is to expand the user s interests significantly by identifying topics that are not lie in the class that the user accessed before, and recommending those to the user. However, it is generally difficult to understand that what types of class the user is interested in and what types of items are not lie in the class that the user accessed before. In this thesis, we propose modeling user interests in detail following a taxonomy of items. Then measure similarity of users according to the taxonomy of items. By analyzing the relationships of the present user knowledge and recommended items, we can understand what types of items are recommended to the user and how far the recommended items are from the present interests of the user. We also apply our user knowledge modeling according to the taxonomy to a system development domain and analyze what types of development activities the user well understood. 1.2 Research Issues In this thesis, we present the following research issues. 1. How to analyze knowledge of users 2. How to identify items that are not known to the user but may be interesiting to the user To solve the former problem, we introduce the taxonomy-based approach. Taxonomy of items are designed by the service provider to enable their customers to access their preference items easily. Thus, we consider that users who like items, may like the classes that include those items. Our method thus reflects the rating of the user on an item to that of a class that includes that item. Next, it measures similarity of users by using not only co-rating behaviors against items but also those against classes in the taxonomy. As a result, we can accurately identify many items for the active 2

21 user by analyzing the interests of users who share same items and/or same classes with the active user. In the system development domain, we apply our taxonomy-based approach to the developper s knowledge extraction from the mail messages accumulated in mailing lists. The problem is that there is no taxonomy related to system development, thus we semi-automatically enrich the taxonomy from the mails accumulated in the mailing list for the system development. Using such expanded taxonomy, we can analyze user knowledge accurately. To solve the latter problem, we introduce the definition, the score of novelty, as the smallest number of hops from the class user accessed before to the class that includes a possible item over the taxonomy. It indicates the relationships between the present interests of a user and the items that are recommended to the user, by using a taxonomy of items defined by service designers. By using this measure, the active user can understand what types of items are recommended to him and how far the recommended items are from the present interests of the user. Here, we consider concept as a class defined in the taxonomy created in each service domain. By presenting items with supporting information such as novelty of those items, the user can more readily become interested in items not stored in his/her profile and so acquire new interests. Furthermore, to identify items with higher novelty, we introduce a graphbased approach. We consider users who are similar to the active user tend to share the same items in the same classes with the user, and so are not likely to provide items with high novelty for the user. We create a user graph whose nodes are users and that sets weighted edges between users according to their similarity. We perform Random Walk with Restarts (RWR)[60] on the user graph, and try to extract user nodes that the walk arrives frequently even though the weights on the edge from the starting node to those extracted user nodes are small; this means that such users share less items with the active user and/or less classes with the active user. Thus, we incorporate items held by the users so discovered to compute the prediction values for the starting node user to identify items with higher novelty. 3

22 We also try to identify items of interest in domains that the active user has not accessed before. It will a key tool for the content providers that want to offer items across multiple interrelated domains, especially when they have a large rating datasets in some domains while for some other domains they can collect only limited rating datasets. Our approach is based on the observation that users who share similar items or who share social connections, can provide recommendation chains (sequences of transitively associated edges) to items in other domains. It first builds domain-specific-user graphs (DSUGs) whose nodes, users, are linked by weighted edges that reflect the similarity of users. It then connects the DSUGs via the users who rated items in several domains or via the users who share social connections, to create a cross-domain-user graph (CDUG). It performs Random Walk with Restarts on the CDUG to extract user nodes that are related to the starting user node on the CDUG even though they are not present in the DSUG of the starting user node. Then, it incorporates items possessed by those users to recommendations of the starting node user. 1.3 Thesis Outline This thesis consists of seven chapters, including this chapter as the introduction. Chapter 2 is dedicated to introduce the background of this thesis and describe about present studies of user modeling and novel topic identification using collaborative filtering technique. We also introduce Random Walk with Restarts (RWR) that measures the relatedness of two nodes in the graph because we use RWR to measure the relatedness between users. First, we will see overall of collaborative filtering technique. Second, we review related works of collaborative filtering, which can be categorized into three parts: accurate item recommendation, novel item recommendation, and cross-domain recommendation. Third, we see usages of RWR in the field of information retirieval and recommendation studies. 4

23 Chapter 3 introduces the notion of novel topics, those that includes new concepts that are likely be interesting to the user even though those concepts are not present in the user profile. We try to expand user interests significantly by letting the user browse those topics. We introduce a new measure, score of novelty, to understand how novel the recommended items are for the user and try to identify items with high novelty for the user, while also guaranteeing highly accurate recommendation results. We first build interests of a user as a hierarchy of classes where a rating value of the user is assigned to each class and item. Next, we measure the similarity of users using user ratings against items as well as those against classes and generate a user group that has high similarity to the user. The novel topics for the user are then identified with the score of novelty, by determining a suitable size of the user group and analyzing the items possessed by users in the user group. We perform an online experiment for analyzing user reactions to topics recommended based on our assessments. By analyzing the frequency of user access to novel items output by our recommendation scheme over time, we confirmed the effectiveness of our novel topic recommendation. We found that the novel topics recommended by our technique were used for creating new communication links between users; this was confirmed by evaluating the frequency of comments between users who came to know each other through our online recommendations. Chapter 4 analyzes our taxonomy-based recommendation method from the viewpoint of accuracy and novelty of the prediction results. Our method takes two approaches. First, it measures similarity of users using items rated by users and a taxonomy of items. It can identify for the user many items accurately. Second, it creates a user graph whose nodes are users; weighted edges are set between users according to their similarity. It performs Random Walk with Restarts over the user graph and extracts user nodes that are frequently passed by the walk, even though weights on the edges from the starting node to those nodes are not high. The users so extracted are likely to have items with high novelty for the starting node user. An evaluation conducted on several data sets finds that our method identifies more novel items with higher accuracy than previous methods. 5

24 Chapter 5 extends our taxonomy-based recommendtion to identify items that may be interesting to the user but that lie in domains that the user has not accessed before. Content providers want to make recommendations across multiple interrelated domains such as music and movies. However, existing collaborative filtering methods fail to accurately identify such items. Our method is based on the observation that users who share similar items or who share social connections, can provide recommendation chains (sequences of transitively associated edges) to items in other domains. It first builds domain-specific-user graphs (DSUGs) whose nodes, users, are linked by weighted edges that reflect the similarity of users. It then connects the DSUGs via the users who rated items in several domains or via the users who share social connections, to create a cross-domain-user graph (CDUG). It performs Random Walk with Restarts on the CDUG to extract user nodes that are related to the starting user node on the CDUG even though they are not present in the DSUG of the starting user node. Then, it incorporates items possessed by those users to recommendations of the starting node user. Furthermore, to extract many more user nodes, we employ a taxonomy-based similarity measure that states that users are similar if they share the same items and/or same classes. Thus we can set many suitable routes from the starting user node to other user nodes in the CDUG. An evaluation of user implicit ratings against items in two interrelated domains and social connection histories of users as extracted from a blog portal, indicates that our method identifies potentially interesting items in other domains with higher accuracy than is possible with existing CF methods. Chapter 6 applies our previously proposed method that extracts user interests from blogs, to the developper s knowledge extraction from the mail messages accumulated in mailing lists. The problem is that there is no taxonomy related to system development, thus we semi-automatically enrich the taxonomy from the mails accumulated in the mailing list for the system development. Using such expanded taxonomy, we can analyze user knowledge accurately. This provides the concrete application example that drives forward the taxonomy based knowledge management. Chapter 7 summarizes the main contribution in the thesis and concludes 6

25 the thesis summarizing the result obtained through this research. We also address the prospect of the future research. 7

27 Chapter 2 Background Our method extends CF and uses RWR to identify items with high novelty. Thus, we explain those in this chapter. 2.1 Collaborative filtering CF methods can be classified into two approaches: memory-based CF and model-based CF. Memory-based CF is based on the assumption that each user belongs to a larger group of similarly behaving users. Indeed this method is referred to as user-oriented memory-based CF; an analogous method which builds item similarity groups using co-purchase history is known as item-oriented[89]. On the other hand, model-based CF generates the predictions by using a model that is optimized by training data. Clustering[75, 94], Bayesian network models are examples of the modelbased approach[76, 95]. In computing similarity of users, basic CF methods often use the Pearson correlation approach[88] or the cosine-based approach[68]. If we define M as number of items rated by user a,, who is to receive the recommendation, and u, r a,ii is the rating value of user a for item I i, and r a is the average value of item ratings given by a, the Pearson correlation coefficient measures the similarity S(a, u) between a and u according to equation (5.1). 9

28 S(a,u)= M i (r a,ii r a )(r u,ii r u ) (2.1) M i (r a,ii r a ) 2 M i (r u,ii r u ) 2 When we use the cosine-based approach, we compute the similarity S(a,u) between a and u according to equation (2.2). S(a,u)= M i r a,ii r u,ii M i r 2 a,ii M i r 2 u,ii (2.2) The advantage of the Pearson correlation approach is that it takes into account that different users might have different rating schemes. If we assume N is the set of users that are most similar to the active user a, the predicted rating of a on item I i, p a,ii is obtained by the following equation (5.2). p a,ii = r a + N u (r u,ii r u )S(a,u) N (2.3) u S(a,u) The below, we review related works of collaborative filtering, which can be categorized into three parts: accurate item recommendation, novel item recommendation, and cross-domain recommendation Accurate Item Recommendation In most CF studies, the researchers focus on improving the accuracy of the prediction results. Here, we explain some of those works such as those using the matrix factorization technique, taxonomy-based technique, and graph mining technique. Yehuda et al. proposed method that uses matrix factorization that characterizes both items and users by vectors of factors inferred from item rating patterns[77, 78]. High correspondence between item and user factors leads to a recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. However, their method does not aim to suggest novel items for the user. Furthermore, 10

29 the matrix factorization technique usually analyzes latent factors inferred from item rating patterns, thus it is not easy for a user to understand why identified items are recommended to the user. We consider that presenting semantic reasons for a user is important especially when recommending novel items. Thus, we use the taxonomy of items to explain why the presented items are novel for the active user. Some researchers use a taxonomy of items to raise the accuracy of prediction results[96]. Their method was shown to be useful when the transaction data of users was sparse. However, in measuring user similarity, their method focuses only on classes that include items rated by both users and their super classes. As a result, this method naively assumes that users who share many items are highly similar with the user; those users may have many good as well as many not so good items for the user. Our method is different from previous taxonomy-based method because it focuses on the width of user interests according to the taxonomy. Width of user interests is computed by checking how many sub-classes the user is interested in each class in the taxonomy. This is based on our observation that interests of users is always not categorized as the same type even though users share many sub-classes. For example, readers can naturally guess that users who love only rock genre is somewhat different types of users from users who love both rock and classic genres. By carefully analyzing such nuance in measuring similarity of users, we can accurately identify many items for the active user by analyzing the interests of users who share same items and/or same classes with the active user. The authors in [90, 91] assigns a-priori score to the classes in the taxonomy of items, and compute the relationships between scores assigned to different classes. Then, they propagate those scores for a specific user to predict each user preference. Their method was also shown to be useful when the transaction data of users was sparse. Recently, they learn item taxonomies autonomously by using clustering algorithms, and improve the prediction accuracy[92]. Their method is not in the scope of CF methods (they call their approach ontology filtering) because they did not compute similarities of users. We consider that the taxonomy can be used to explain 11

30 what types of items are recommended to the active user and why the users are computed as similar even if users do not share any items. Thus, we measure similarities of users using taxonomy of items and recommend an item to the active user with score of novelty. Some researchers have started to use random walks or RWR on a graph to compute recommendations[95, 70, 76, 69, 87]. Yildirim and Krishnamoorthy perform random walks on the item graph whose nodes on the graph are items and whose weighted edges are set between item nodes according to item similarity; they confirmed that their approach overcame the sparsity problem, which decreases accuracy of prediction results when the transaction dataset is sparse. Some researchers also use the graph analysis technique to study recommendations[72, 83, 67, 73]. For example, to solve the sparsity problem using the graph-based method, Huang proposed the method that trails the transitively associated edges on the graph whose nodes are items and users[72]. However, to the best of our knowledge, no study has identified items with higher novelty using random walks on the user graph or using other graph-based methods Novel Item Recommendation The notion novel is defined in different ways in several related papers. To the best of our knowledge, novel items are often defined in several studies as the items that are not known to the user but interesting for the user[71]. However, the definition above is very abstract, and thus difficult to evaluate item novelty in detail. Onuma et al. proposed the method that identifies novel items as items (they call surprising ) that are accessed by users similar with the active user but also accessed by users not so similar with the active user[87]. Their idea is to envision the problem of identifying novel items as node selection on a graph, giving high scores to nodes that are well connected to the older choices, and at the same time well connected to unrelated choices. Their evaluation example shows that their method generates more diverse recommendation results such that recommending surprising items in comedy, 12

31 horror, and SF movie items to the user who only likes comedy movies. We consider that the score of novelty in our paper is more natural definition because it lets the active user understand why the recommended items are novel by using the taxonomy of items. The authors in [90, 91] also define novel items as those which are identified by a certain method but are not identified by other methods. Their evaluation shows that their proposed method can identify items that can not be identified by the method that only ranks items by their popularity. We consider such a metric is not for evaluating novelty, but for just the metric of popularity. As we explain above, we consider definitions of previous works are not very useful for the active user to understand why the recommended items are determined as novel and to understand how novel they are. Different from above explained previous works, Nakatsuji et al. proposed a taxonomy-based algorithm to find novel items that are defined as items that are included in classes that the active user has not accessed before, (in their paper, they call those as innovative items.). Their online evaluation shows that clicks of users tend to concentrate on novel items[85]. Unfortunately, they focused on the application of novel item recommendation and did not investigate how accurate and how novel the items predicted by their method are. They did not compare the accuracy and novelty of identified items with those predicted by other CF methods. We improve and confirm the prediction accuracy by measuring similarity of users considering the width of user interests according to the taxonomy. Furthermore, our method identifies items with higher novelty for the active user by applying the graph-based approach. Herlocker and his co-workers also described that novel items and serendipitous items are different though both are not known to the user but interesting for the user[71]. The difference is that the former is more easily found by the user than the latter. Our method does not classify novel items and serendipitous items. It makes users aware of how far the recommended items are from their present interests through our proposed measure, the score of novelty. However, as the reader can naturally imagine, items with 13

32 high novelty for the user can not be easily discovered by the user. For example, the user who, up to now, has demonstrated an interest only in music items in Classic, is unable to easily discover interesting items in Jazz by himself. Our evaluation, described later, also shows that existing CF methods have difficulty in accurately identifying items with high novelty for the user. Indeed, our evaluation did not explicitly treat serendipitous items because the evaluation data set were taken from user access histories. However, the previous online evaluation by Nakatsuji et al.[85] presented novel (or serendipitous) items to users that were not included in the users access histories, and confirmed that the actual users were excited in those items Cross-domain Recommendation Related to the studies of novel item identification, recently, there are few works against cross-domain recommendations[79, 80, 84], which predict items that are located in the domains that the active user explicitly did not showed interests before. Bin and his co-workers analyze users who take similar rating behaviors against items across several item domains[79, 80]. Their method shares the knowledge that is learned by using the rating datasets from multiple item domains even when the users and items of these datasets do not overlap. Nakatsuji et al. also proposed cross-domain recommendation based on the observation that users who share similar items or who share social connections, can provide recommendation chains (sequences of transitively associated edges) to items in other domains[84]. We consider, however, novel item identification within a domain is still important because users who access items in a domain has already expressed the interests in that domain. Thus, it is natural to present novel items within the domain to expand his interests, by analyzing user interests in detail based on the domain-specific taxonomy. 14

33 2.2 Random Walk with Restarts A graph is a natural representation of data that have some inherent relational structure. In a graph, objects and their relationships can be represented as nodes and weighted edges respectively, where weights denote the strength of a relationship. Measuring the relatedness of two nodes in the graph can be achieved by using RWR theory[60]. Starting from node a, arwris performed by following a randomly selected link to another node at each step. Additionally, at every step there is a probability, α, that the walk denotes the probability that the random walk at step t proceeds from node u. q is a column vector whose elements are set to zero; only the element corresponding to a is set to one, i.e. q(a)=1. Also let A be the column-normalized adjacency matrix of the graph. In other words, A is the transition probability table where its element A(u,v) gives the probability of v being the next node given that the current node is u. The stationary probabilities for each node can be obtained by recursively applying equation (5.3) until convergence, and they give us the long-term visit rate of each node with a bias towards a particular starting node. restarts at a. Let p (t) be a column vector where p (t) u p(t + 1)=(1 α)ap (t) + α q (2.4) Therefore, p (l) a, where l is the status after convergence, can be considered as a measure of relatedness between nodes a and u. 15

35 Chapter 3 Identifying novel topics based on user interests 3.1 Introduction In this section, we first describe the background and purpose of our study. We explain our approach to identify novel topics, and then describe the impact of applications based on our method Background and our purpose Blogs are becoming more popular for publishing and discussing shared interests among users. Information sharing systems for blogs could enable users to expand their interests by browsing the collections of blog entries published by other users. However, to retrieve information from blog entries, current blog services simply employ keyword searches of blogs using Google or simple metadata attached to blog-entries, i.e. RSS metadata such as titles, creators, dates and so on. Unfortunately, neither approach offers detailed semantics about the description content in blog entries. Moreover, there is no function to generate personalized searches easily, users are restricted by their own knowledge or imagination when entering search keywords. Such keyword searches are time consuming and troublesome. 17

36 For example, users cannot perform a keyword search if they do not understand what they want to search for to some degree beforehand. Thus, when keywords cannot be specified, information retrieval from blog entries often cannot be performed even if the database contains topics that the user might become interested in. To counteract the above problems, the study on Adaptive Information Filtering (AIF)[35] cooperates with the user in constructing a user profile; recommendations are offered based on the profile. Making a user profile interactively beforehand is good for offering recommendations to users, as indicated by the high-accuracy of AIF. Unfortunately, a common complaint about AIF is the user s need to make his/her own profiles, and often known information is encountered many times. This is because recommendation systems with conventional AIF only check the possibility of the user being interested the document and fail to identify if the information has already been presented to the user or not. For filtering these redundant documents, novelty-detection researchers[49] define a novel document as a document that includes new information that is relevant according to the user profile. They extract relevant documents from a document stream and then classify the documents as novel or not; novel documents are provided to the user. Novelty detection can, however, provide documents that offer new information about concepts that have already present in the user profile. In our study[42, 85], we define an novel topic as a topic that includes new concepts that are likely be interesting to the user even though those concepts are not present in the user profile. The goal is to expand the user s interests significantly by identifying novel topics and recommending those to the user. In particular, we first focus on the novel topics identification in blogs because blogs have become a popular method of publishing and searching for information that can appeal to the users Approaches For achieving the above-mentioned goal, we use the following approaches. 18

37 We start with a proposal to build user interests according to a taxonomy of items. We consider that users who like items, may like the classes that include those items. Our method thus reflects the rating of the user on an item to that of a class that includes that item. In the taxonomy, items are object that the user is interested in, such as music artists, music songs, movie titles and so on. On the other hand, the classes are defined using a taxonomy of items in a service domain. For example, we can set classes as genres, which are defined by item sets. By classifying blog entries into each class and item in the taxonomy, we could automatically generate user interests according to the taxonomy. In classifying user entries according to the taxonomy of items, we remove classification mistakes automatically by using the taxonomy of items and continuity of descriptions about user interests as explained in our previous paper[85, 42]. Of course, we can also build user interests according to the taxonomy by using buying histories and listening histories of users. Next, we measure the similarity of users by considering the degree of interest agreement between each class and item. Most previous techniques of measuring similarity of users use Pearson correlation coefficient or cosine-based similarity against items rated by both users as we will explain in Section 5.3. In this paper, we build user interests according to the taxonomy of items and measures similarity of users by using not only co-rating behaviors against items but also those against classes in the taxonomy. By considering the degree of interest agreement between each class and item, we can measure the similarity of users considering the width and depth of a user s interests through the taxonomy of items. As a result, we can identify many items accurately for the user by analyzing the items of users who share the same items and/or same classes with the user. We also establish a new evaluation method that determines a suitable size of user group G U, whose users are similar with the active user a, who receives recommendations, by 19

38 observing the difference between the interests of user a and interests of users among G U while changing the size of G U. Finally, novel topics for the active user a are identified by analyzing the classes, C, that are interested by users in user group G U even though a did not explicitly show interests to C. We introduce a measure, the score of novelty, to understand how novel the recommended items are for the user, and try to identify items of high novelty for the user, while also guaranteeing highly accurate recommendation results[42, 85]. Accuracy is also important because users trust accurate recommendation results and tend to use such services[82]. We define the score of novelty as the smallest number of hops from the class user accessed before to the class that includes a possible item over the taxonomy. By accurately identifying items that are highly novel to the user, and recommending those to him, he may accept those items and widen his interests. We show two evaluation steps based on users implicate ratings against music items extracted from a large number of blog entries as collected by the blog portal Doblog. The taxonomy of music artists is provided by ListenJapan. The first step is an offline experiment that evaluates the accuracy in predicting users hidden interests using our implicate rating dataset and investigates the distribution of user interests extracted from blogs according to the score of novelty. The results show that our method can identify items with higher accuracy than the previous methods including a previous taxonomy-based method[50]. They also show that our method can identify items with higher novelty than the recommendations manually created by the designers in the service provider. was one of the biggest blog portals in Japan. Unfortunately, Doblog terminated services on May

39 The second step is an online experiment for analyzing user reactions to topics recommended based on our assessments of an online experimental service. Most prior works used only offline synthetic data to evaluate their recommendation techniques. However, analyzing the reactions of actual users to recommendations is very important for confirming whether the recommended novel topics are actually effective. By analyzing the frequency of user access to novel items output by our recommendation scheme over time, we confirmed the effectiveness of our novel topic recommendation. We found that the novel topics recommended by our technique were used for creating new communication links between users; this was confirmed by evaluating the frequency of comments between users who came to know each other through our online recommendations Impact of applications based on our method Most recommendation schemes fail to consider the semantic relationships between a user and items that are recommended to the user. Thus, the user can t easily understand why particular items were recommended. The semantic relationships described above are necessary if the user is to accept recommended items, especially when the user has not thought of those items before. Our method can attach the score of novelty, which indicates how novel the recommended items are to the user. It indicates the relationships between the present interests of a user and the items that are recommended to the user, by using a taxonomy of items defined by service designers. That is, our method can recommend to user a content items that belong to the concept that user a does not know of, together with their score of novelty. Here, we consider concept as a class defined in the taxonomy created in each service domain. By presenting items with supporting information such as novelty of those items, the user can more readily become interested in We provided an experimental service DoblogMusic at for Doblog users from August to December

40 items not stored in his/her profile and so acquire new interests. Some examples might help understanding. Consider user A who has items I 1 and I 2 under the class Rock in her interests, and we extract users X whose interests are similar to those of A according to the results of similarity measurements between user A and other users. If there are many users in X who are interested in item I 3 under the class Classic, we can recommend item I 3 of class Classic to user A together with information indicating its score of novelty, because Classic and Rock are not similar semantically given the definition in taxonomy of items in music domain. Thus, we can recommend items to user A with the phrase you may not have heard about item I 3 in Classic genre, but users whose interests are similar to yours, are interested in item I 3. By presenting some unknown items to the user together with the score of novelty, or using phrases like the one described above, user A may develop an interest in I 3 even though its class may not be known to user A, i.e. not stored in a profile of A. However, user A has a chance to expand his/her interests significantly, if he/she accesses novel item I 3. The paper is organized as follows. Section 3.2 introduces related works and Section 5.3 explains the technical background of the paper. Section 3.4 describes our model of user interests according to the taxonomy of items. Section describes our similarity measurement of users using the taxonomy of items and Section explains identifying novel topics based on similarity measurement results. Sections 6.4 and 3.7 describe our offline and online experimental studies, respectively. Section 3.8 concludes this paper. 3.2 Related works In [21], the authors classify web pages and place them in a topic directory by using pages in the directory and hyperlink relationships among pages. On the other hand, we extract interest ontologies and use them for innovative blog entry detection. Therefore, we do not need a huge volume of web 22

41 pages and hyperlink relationships. We classify blog entries by only using a service-domain ontology, and remove classification mistakes by using class characteristics and continuity of descriptions about user interests. In [22, 20], the authors try a major technique that extracts blog community web pages by adapting a current extraction technique that is similar to the technique in[19] and the PageRank algorithm[24]. The problems in applying the technique in [22, 20] to creating and activating a blog community are that the technique cannot provide innovative information to users because pages are only extracted if they already have link relationships. Many online content providers such as Amazon, offer recommendations based on collaborative filtering[45, 33, 88], which is a broad term for the process of recommending items to users based on the intuition that users within a particular group tend to behave similarly under similar circumstances. One advantage of collaborative filtering techniques is that they can recommend relevant items that are different from those in a user s profile. However, the existing collaborative filtering techniques don t consider the semantic relationships between user A and content items that are recommended to A by using the taxonomies attached to content items. As a result, the user cannot understand semantic reasons why those items are recommended and how innovative the recommended items are, and so is less likely to access the recommendation. For applying a semantic approach to retrieving information from a blog, semblog[23] tries to construct a user profile using a personal ontology, which is a manual construction of a users classification of blog entries in a category directory of the ontology according to their interests. A category directory is built by users beforehand to construct an ontology-mappingbased search framework. However, manual ontology creation is a timeconsuming and troublesome task for users, and applying a semantic ontology to a blog community is difficult. We automatically extract a userinterest ontology; thus, creating and updating ontologies is easy for users. In research studies of ontology mapping[44, 34, 41], similarity mea- 23

42 surements considering approximation of classes and class topologies are proposed in [41]. In addition to class topology, we consider each user s weighted interest in each class and instance. Furthermore, in analyzing conjunctions in class topologies of ontologies with high similarity scores, we detect innovative instances, those that other users have in their ontologies but the user does not. 3.3 Collaborative Filtering Our method extends CF to identify novel topics. Thus, we explain CF in this section. CF methods can be classified into two approaches: memory-based CF and model-based CF. Memory-based CF is based on the assumption that each user belongs to a larger group of similarly behaving users. Indeed this method is referred to as user-oriented memory-based CF[36] ; an analogous method which builds item similarity groups using co-purchase history is known as item-oriented[45]. On the other hand, model-based CF generates the predictions by using a model that is optimized by training data. Clustering[75, 94], Bayesian network models[76, 95] are examples of the model-based approach. In computing similarity of users, basic CF methods often use the Pearson correlation approach[46, 88] or the cosine-based approach[33]. If we define M as number of items rated by user a and u, r a,ii is the rating value of user a for item I i, and r a is the average value of item ratings given by a, the Pearson correlation coefficient measures the similarity S(a, u) between a and u according to equation (5.1). S(a,u)= M i (r a,ii r a )(r u,ii r u ) (3.1) M i (r a,ii r a ) 2 M i (r u,ii r u ) 2 When we use the cosine-based approach, we set r a and r u as zero in equation (5.1). The advantage of the Pearson correlation approach is that it takes into account that different users might have different rating schemes. 24

43 Metadata Title Artist Select Label Genre Album Rock/Pop Property: rock/pop Domain: music (1) Designer chooses music domain for creating blog community. (2) Selecting metadata for extracting user interests. (3) For example, classifying artists (instances) by genre (class). Artist Adult Contemporary Property: rock/pop, adult contemporary Domain: music Light Rock Artist Artist Adult Alternative Class Property: rock/pop, adult contemporary, light rock Domain: music Property: rock/pop, adult contemporary, adult alternative Domain: music Instance Figure 3.1: Procedure for designing service-domain ontology. If we assume N is the set of users that are most similar to the active user a, the predicted rating of a on item I i, p a,ii is obtained by the following equation (5.2). p a,ii = r a + N u (r u,ii r u )S(a,u) N u S(a,u) (3.2) 3.4 Interest ontology extraction We first explain how to design the service-domain ontology of a service domain, examples are provided for the content delivery services of music and movies, and then describe an method that can automatically extract interest ontologies Designing service-domain ontology We describe the procedure so as to support the generation of interest ontologies. We use OWL (Web Ontology Language)[25] for describing a service domain ontology in detail. The problem is that most users find it very dif- 25

44 ficult to design detailed ontologies. Our solution is to permit the use of simple ontologies. These ontologies require only a hierarchical relationship among the classes (subclassof description) and a property description that specifies the enumeration of the instances (oneof description); they restrict the succession condition in the class hierarchy. Our method, described in Section 3.4.2, can automatically extract an interest ontology by classifying user blog entries into service domain ontologies without user intervention. As shown in Fig 3.1, first, the ontology designer chooses the target service domain for extracting user interests. The designer then chooses metadata that reflects user interests by analyzing the activity of an existing community such as a Bulletin Board System (BBS). In the music domain, the designer chooses metadata of genres or artists, considering that the community is founded on this metadata. Finally, the designer chooses the metadata that represents the restriction properties of a class hierarchy and classifies other metadata into classes. For example, the designer chooses genres as a property and classifies artists as instances of classes. Service designers need only construct a service-domain ontology with the intended domains and gradually increase the number of ontologies as the service is expanded. Designers also should adjust the granularity of end classes for reflecting user interests in detail. Fortunately, the designers of many content directories, such as All Media Guide (AMG) and listen Japan, have developed content taxonomies with fine granularity to support users when they browse and buy content according to interests. Therefore, we construct service-domain ontologies according to these directories Interest ontology generation algorithm We explain the interest ontology generation algorithm by analyzing the interest distribution of users, as shown in Fig

45 Entries of user A All blog entries Entries of user B ªªªªª Entries of user X (1) Creating index for all entries. (2) Classifying entries into service-domain ontology. (3) Analyzing user's interest distribution based on user ID of classified blog entry. Stone Temple Pilots Alternative Nirvana Class Instance Farm Happy Mondays Madchester New Order Stone Roses Verve Coldplay Shoegaze My Bloody Valentine 69 Stone Temple Pilots 420 Alternative Nirvana 92 Number of users Farm Happy Mondays New Order 89 Madchester Stone Roses Verve Coldplay Shoegaze 42 My Bloody Valentine Interest ontology of user A (4) Extracting interest ontology by arranging entries based on user ID. Interest ontology of user X Stone Temple Pilots Alternative ª Nirvana Alternative New Order (5) User modifies interest ontology. delete Farm Madchester New Order Madchester Shoegaze My Bloody Valentine Stone Roses Figure 3.2: Procedure for generating interest ontologies. Basic ontology generation algorithm First, we describe the merit of generating user interests according to a service domain ontology. We use the service domain ontology as defined by the experts in each service domain. By using the accurate and detailed knowledge included in the service domain ontology, we can extract the user-interest ontologies accurately. We note that many service providers assign various name attributes to their content items with the idea of assisting users in locating content items via keyword search. The current version of our method uses exact keyword matching to extract user interests as described in his/her blog-entries. The polysemy problem can be eased by applying maintenance knowledge of service domain ontologies. The basic ontology generation algorithm (BOGA) is described below. (1) BOGA makes index files for all blog entries (can be collected through the ping server). For example, our experiments in Section 6.4 and 27

46 Section 3.7 used all Doblog blog-entries stored over a roughly four year period. Here, we assume that each collected blog entry has a unique user ID. (2) BOGA classifies all collected blog entries into a service-domain ontology. BOGA classifies blog entry E i into instance I i ( classc i ) if there is a name attribute of I i in E i. BOGA permits each blog entry to be classified into two or more classes. For example, consider the service-domain ontology in Fig.3.2. BOGA classifies the blog entry into instance Happy Mondays of class Madchester when there is a Happy Mondays character string in the description in the blog entry. (3) BOGA measures the number of users interested in each instance of C e, which is one of the end classes in the service-domain ontology. In calculating the number of interested users, BOGA counts the number of users as one, even if the same user describes the same instance or class in two or more blog entries. BOGA calculates the number of users interested in class C e by obtaining the number of users interested in all instances in C e. Thus, the interested user distribution in the domain can be measured by recurrently counting the number of users from C e to the root class C r. (4) BOGA extracts only the classification results about one user ID from all classification results in order to develop an interest ontology for this user ID. In Fig. 3.2, BOGA can extract an interest ontology of user A when the blog entries of this user describe instances of Stone Temple Pilots, New Order, and Farm. (5) Finally, our method allows the user to inspect and delete instances that he/she considers are not his/her actual interests, from his/her interest ontology. Ontology-filtering algorithms For example, BOGA classifies blog entries that describe Farm, actual reference is to an agricultural farm, into the instance Farm of class Madchester. To filtering the mistakes caused by words with several meanings, we make use of the following characteristics such as taxonomy of instances 28

47 in ontologies and the durability of user interests as expressed in the user s blog. Instances that belong to the same class have the same characteristics. Adjacent classes have similar characteristics. classes also have similar characteristics. Instances of those User interests that continue for a certain period and describe an interest over two or more days. We propose two filtering algorithms: FA1 and FA2. First, we explain FA1. Filtering algorithm 1 We subdivide procedure (2) of BOGA to permit FA1 to be applied. (2-1) When the name attribute n(i i ) of instance I i ( C i ) is described in blog entry E i, FA1 checks whether a name attribute of an instance of the same class I k {(I k C i ) (I k I i )} is described in all blog entries that the user has accumulated. We call instances I k classification decision elements (CDEs). (2-2) Blog entry E i is classified as mentioning instance I i when there is a description of CDEs, and not classified as mentioning instance I i when there is no description. In Fig. 3.2, when the description of Farm exists in E i, and New Order is described among all accumulation blog entries of a user, E i is assumed to be a blog entry about instance Farm of Madchester and is classified accordingly. We can filter classification mistakes accurately by using the many CDEs created from the accurate and comprehensive knowledge contained in the service domain ontology maintained by expert domain designers. Filtering algorithm 2 In addition, we propose filtering algorithm 2 (FA2) that provides more restrictive classification than FA1. In procedure (2-1) of FA1, FA2 checks whether CDEs are described in blog entry E i. Blog entry E i is classified in I i if CDEs are described, and not otherwise. 29

48 Two hops Rock One hops Nirvana Alternative US Indie Athens R.E.M. Zero hops Charlatans Farm Madchester Stone Roses Coldplay Shoegaze Verve My Bloody Valentine Elf Power Elephant 6 New Order Ride Olivia Tremor Control Figure 3.3: Hops in filtering algorithms. Adjusting range of CDEs In addition, we introduce a mechanism that adjusts the range of CDEs by using the class hierarchy. We consider that descriptions of classes and instances of interest often appear together with instances of the same class and those of neighboring classes. We add a new adjustment parameter, hop limit, which defines the range of CDEs. In Fig. 3.3, we assume there are CDEs that include instances of brother classes and those of the grandfather class when two hops from end classes are permitted Introducing interest weight to ontology In addition, we introduce the interest weight as a parameter that indicates the degree of a user s interest in each class and instance of an interest ontology. By using this parameter, we can create a virtual-community of those users who have almost the same degree of interest in the same classes or instances. Here, we explain the idea of calculating interest weight using Fig In this paper, we extract the interest weight of a user for item I i by ana- 30

49 Total entries of user A User A, Entry 1 I like stone roses and my bloody valentine recently. I like Nirvana and new order much more. User A, Entry 2 As for shoegaze, I think Ride and My bloody valentine are best. 2 Alternative nirvana 1/4 Interest weight under class 1/2 5/4 1/4 New Order Madchester My Bloody Valentine Shoegaze Ride 1/4 Stone Roses 1/2 Interest weight for instance 3/4 Figure 3.4: Applying interest weight to ontology. lyzing the number of times I i is present in all of his/her blog entries. Note that the user s interest means more than just the simple number instances since I i might be simply part of a list. Our proposal is to apply the following ideas to extract interest weight from blog entries. (1) The interest weight of every blog entry is one. (2) If N(E i ) kinds of name attributes of interest instances appear in blog entry E i, the interest weight of each instance in E i becomes 1/N(E i ). (3) When we define the set of all accumulated blog entries of a user as E, the interest weight S(I i ) of each instance I i is S(I i )= E (I i E i ) (1/N(E i)), and the interest weight S(C i ) of each class C i is S(C i )= Ii C i S(I i ). We also consider that a user who is interested in I i is also interested in class C if I i lies under class C. In the same way, we consider that a user who is interested in C, is also interested in the super class of C. Thus, we give the following definition. A user who has an interest in instances in a deeper class hierarchy, tends to have upper class hierarchies that have larger interest weight values. (4) The interest weight of the instances is reflected in that of the class 31

50 that includes the instance. The interest weight of the classes is reflected in that of the super class. For example, in Fig. 3.4, we give the interest weight of instance Stone Roses as 1/4, that of instance My Bloody Valentine as 1/4 + 1/2 = 3/4, that of class Shoegaze as 3/4 + 1/2 = 5/4, and that of class Alternative as 1/2 + 5/4 + 1/4 = Detecting novelty by similarity measurements In this section, we propose to measure the similarity between ontologies through a consideration of interest weights. We detect innovative topics for user u by measuring the similarity between the user-interest ontology of u and those other users. Next, we determine a group of users, of appropriate size, whose interest ontologies are similar to that of u Interest-weight-based similarity measurement We now explain our similarity measurement in detail by using Fig We use Table 3.1 that gives definitions of terms used in this Section, and examples based on Fig We first define the terms interest ontology O A of user A and O B of user B, topology T 1, which is composed of a class and subclass relationship, and topology T 2, which is composed of a class and instance relationship. Furthermore, we define common classes C i as classes that both ontologies have, and common instances I i as instances that both ontologies have. For example, there are five common classes, a1, b1, b2, c3, and c4, in Fig In particular, we define a common class set that formalizes topology T 1 as C(T 1 ), and a common class set that formalizes topology T 2 as C(T 2 ). For example, in Fig. 3.5, C(T 1 ) has common classes a1 and b2, and C(T 2 ) has common classes b2, c3, and c4. We also give the degree of interest agreement of common instance I i as I(I i ), that of common class C i as I(C i ), 32

51 Interest ontology of user A: OA Interest ontology of user B: OB 3 5 b1 m k a1 0 1 c3 a b c b n 2 3 c4 2 g h c1 l b1 a b2 n c3 c4 a e p 3 c j 0 b3 d Class Instance Interest weight of instance Interest weight under the class. Topology T1 Topology T2 Figure 3.5: Measuring similarity based on degree of interest agreement. and that of common topology created by common class C i as I t (C i ). In [41], the authors calculate the similarity between ontologies considering the degree of similarity between class topologies T 1. In addition, we apply the following ideas to create user-interest-based virtual communities. Evaluating the degree of interest agreement between C i s and I i s from the interest weight with smaller value. This filters users who simply enumerate a lot of instances in an blog entry and creates a virtual community among users who have similar or larger interest weight values with respect to that of each user. Treating topologies T 1 and T 2 separately because we consider that T 1 reflects the width and depth of a user s interests while T 2 reflects the objects in which users are interested. Decreasing the computational complexity by generating the class schema of user-interest ontologies according to that of service-domain ontologies. Accessing a large number of blog entries, as is done in our experiments in Section 6.4, is important for useful ontology mapping. 33

52 Table 3.1: Definitions of terms and examples. Type of a graph Hierarchy of COIs Number of edges Average number of users among COIs Variance number of users among COIs N1= N1= N1= NF N1=60, N2= N1=90, N2= (1) We analyze classes common to O A and O B and extract common classes that belong to C(T 1 ) and C(T 2 ). (2) When common class C i has common instance I i between ontologies, we assign the smallest value of the interest weight of common instances I i to I(I i ). For example, I(a) is 2. (3) Similarly, we assign the smallest value of the interest weight of common class C i to I(C i ). For example, I(b1) is 3. (4) We define the product sets of subclasses of C i, which are common to a class set, as N(C i ), and the set union of subclasses of C i among C i C(T 1 ) as U(C i ). For example, if we insert common class a1 C(T 1 ) to N(C i ) and U(C i ), N(a1)={b1, b2} and U(a1)={b1, b2, b3}. We then we give I t (C i ) as C j N(C i ) I(C j ). For example, I U(C i ) t (a1) is given by ( )/3 = 7. Thus, we obtain the degree of interest agreement S(T 1 ) of C(T 1 ) as Ci C(T 1 ) I t (C i ). In Fig. 3.5, S(T 1 )=( )/3 +(9 + 3)/2. (5) We also define an instance set of C i in ontology O A as I A (C i ), and an instance set of C i in ontology O B as I B (C i ) among C i C(T 2 ). We then give 34

53 Ii C I t (C i ) as i I(I i ) I A (C i ) I B (C i ). For example, I t(c3) is given by (( )/4)= 5/4. Thus, we assign the degree of interest agreement S(T 2 ) of C(T 2 ) as Ci C(T 2 ) I t (C i ). In Fig. 3.5, S(T 2 )=2/1 + 5/ (6) By using evaluation function f (X), which corresponds to the relative degree of importance of a topology, we finally determine the similarity score between ontologies S O (AB) as S(T 1 )+ f (S(T 2 )). For example, if f (X) equals X, in Fig. 3.5, S O (AB)= /4. As explained in procedures (4) and (5), our algorithm determines that interest ontologies that are more similar follow topology T 1, which expresses the depth and width of user interest. Our similarity measurement between ontologies returns higher similarity values if there are more common classes C(T 1 ) that form topology T 1 in both ontologies (in other words, if there are more classes that appear in both ontologies.). Our method yields almost the same effect as calculating the similarities between T 1 when calculating similarities between topology T 2 in different ontologies. Thus, our method identifies two different ontologies as being similar if they have common instances in common classes (deeper level) or wider level of their hierarchies Innovative blog-entry detection We use our similarity measurement for innovative blog-entry detection and user-oriented community creation. (1) We calculate the similarity between the ontology of user A and the ontologies of other users in set U. By using the heuristic threshold X, we derive X users who have high similarity to user A as the interest-sharing virtual community G U. (2) We then analyze difference instances between the ontology of user A and the ontologies of G U. We also define a parameter called the score of novelty, which indicates how many hops we need to get from difference instances of an ontology of G U to the class of the ontology of user A. In Fig. 3.6, we need three hops to go from difference instance Elf Power of the ontology of user B to class Rock of the ontology of user A. By 35

54 Blog entries of user a (4) Creating community by browsing recommended entry Blog entries of user b (3) Recommending items via other users entries. User a can become interested in artist Elf Power through such entries. (1) Extracting user interests. (1) Extracting user interests. Item 1 Rock New Order Item 2 Madchester Alternatiive Shoegaze My Bloody Valentine Happy Mondays Coldplay Galaxie 500 (2) Measuring similarity. Item 1 Rock Item 2 Alternative US Indie Madchester Athens R.E.M. Coldplay Shoegaze Happy Mondays Stone Roses My Bloody Valentine Elf Power Elephant 6 Olivia Tremor Control Fig 3.6: Community creation service of recommending innovative blog entries. recommending blog entries with a high score of novelty, the interests of users may be significantly expanded. Lowering the level of novelty may produce more comfortable new concepts but these will prove to be less satisfying. (3) Finally, we extract innovative instances G I, which are unknown to user A, but that are well-known to users in G U ; the innovative blog entries about G I are recommended to user A together with the score of novelty. As we defined in Section 5.3, innovative topic are concepts that are new and interesting to the user, thus the value of score of novelty of an innovative instance is more than one. Here, determining the most suitable size of G U is very important for detecting attractive and innovative instances. If the size of G U is reduced, the difference between user-interest ontologies is smaller, and instances in G I may be close to the user-interest ontology of each user. However, there may be few novel instances in G I. On the other hand, if the size of G U is increased, the difference between user-interest ontologies is larger, and instances in G I may be too novel for the user. Thus, we observe the difference between the user-interest ontology of user u and those of G U while changing the size of G U. The most suitable size of G U is the point at which there 36

55 is a rapid increase in the number of G I. Details of this process are given in Section An example of community creation is depicted in Fig.3.6. User B is included in user group G U whose interest ontologies are measured as similar to the interest ontology of user A. If users in G U often have an interest in Elf Power, user A has the potential to be interested in Elf Power even though the class Elephant 6 that includes Elf Power is many hops from the class Rock that user A has a known interest in. Furthermore, by browsing blog entries concerning these novel instances, users may expand their interests and share interests with each other. 3.6 Offline Experimental results We now present the results of offline experiments and simulation studies that demonstrate the performance of interest ontology extraction and novel blog-entry detection Datasets and methodology The proposed methods were tested using the large-scale blog portal Doblog, which holds 1,600,000 blog entries from 55,000 users. We also used the service-domain ontology of the music domain, as shown in Fig. 3.2, which was created by referring to public information on listen Japan, a web portal storing music artist genre information. Our experimental servicedomain ontology contains 114 classes as genres, covering a wide range of genres in the music domain, Rock, Classic, Jazz, and Soul and the instances are 4,300 artists; it has, on average, four level class hierarchies; the deepest class hierarchy has five levels. Furthermore, each class and instance of the service-domain ontology has two or more name attributes. For example, the instance R.E.M. has the name attributes R.E.M. and REM. Overall, the 4,300 instances were given 7,600 name attributes. A genre hierarchy almost similar to our service-domain ontology is referred to in URL htm of listen Japan. 37

56 For evaluating accuracy, we defined correct answers as blog entries that have descriptions of classified classes or instances and evaluated the generated interest ontology by using the precision and recall of the classification results. In this paper, precision means the proportion of correct answers in the classification results and recall means that of correct answers in all blog entries. When recall is high, the extracted interest ontologies better cover user interests. However, when precision is low, created interest ontologies include classification mistakes, and the novel topics detected for the user are unreliable. Thus, achieving high precision is indispensable. In the evaluation, we used filtering algorithms to eliminate instances that consisted of just one word such as police, because we consider that such instances have a high probability of having several meanings. We used Namazu to generate index files of blog entries Measuring interest distributions of blog users Graphs of user distributions in the music domain examined are depicted in Fig. 3.7-(a). This figure shows the number of users in each class in the different level of class hierarchy in the ontology. Each class has about 200 users, even the end classes. By checking the blog entries classified in end classes, we confirmed that these blog entries frequently have unique words that describe the features of these classes. For example, blog entries classified into the end class Death Metal have the phrase death voice with high probability. This is because the end classes in our service-domain ontology have a granularity that is appropriate for extracting the uniqueness of the blog entries classified into these classes. End class granularity is important because it controls whether we can determine if a user is interested in end class instances or not. 38

57 «Number of users« Number of users in each class hierarchy 2nd hierarchy 3rd hierarchy 4th hierarchy Precision 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Rock / Pop Alternative/Punk Soul R&B Jazz Blues Classical (a) Interest distribution of blog users in experimental service domain ontology. FA2 FA1 BOGA (Genre) 0% Funk Glam Metal Folk Hard Adult Art & Progressive Genre/Artists Rock Rock Rock Rock Rock Contemporary (b) Comparing the precision of instances with one word among BOGA, FA1 and FA2. Fig 3.7: Experimental results of user distributions and ontology extraction Measuring performance of extracted interest ontology We evaluated the accuracy of FA2 by checking 1/4 (randomly selected) of the classified blog entries. As shown in Table 3.2-(a), the achieved precision is higher than 90% with high recall of 80%. Thus, our filtering algorithm is effective for generating suitable user-interest ontologies. The reason why we achieved high performance is that we used comprehensive knowledge contained in the service domain ontology as explained in Section We can filter classification mistakes accurately by using a lot of CDEs defined in Section in the service domain ontology. We also found that negative comments constituted only about 5 percent or so of all comments. It seems that if the user has a negative opinion related to a hobby, most often he/she does not express it. A positive opinion leads to a lot of comments. 39

58 Table 3.2: Experimental results of our ontology extraction and detection of novel topics. Score of novelty Percentages of instances % 15.2% 23.2% 4.0% Score of novelty Percentages of instances 23.4% 23.1% 44.3% 9.2% From those results, we assume that the user s comment is positive if the blog entry repeatedly refers to something about their hobbies. Interest weight reflects information about how many times the user described a certain instance in a certain class as explained in Section We also note that there are a few events/people, such as Michel Jackson, that are described in blog entries as a form of gossip. Extracting and manually checking these instances may improve system accuracy if they are frequent in blog entries over a certain period of time Comparing filtering algorithms We compared BOGA and filtering algorithms by randomly checking 1/4 of the blog entries that were classified into instances with one word. Graphs of the precisions achieved by BOGA, FA1, and FA2 for the 83 instances that were randomly selected from among the 827 instances with one word, are shown in Fig. 3.7-(b). The accuracies of BOGA and filtering algorithms are shown in Table 3.2-(b). These results indicate that precision improves in the order of BOGA, FA1, and FA2, while recall decreases significantly with FA2, even though FA1 decreases only slightly compared to BOGA. To improve recall while holding the high precision in FA2, we will add a method that checks for CDEs in the blog entries with these elements having a high probability of appearing such as trackbacks of entries or entries near each other in a time series. Analyzing Fig. 3.7-(b) in more detail, there are eight instances in which the precision cannot be improved even with FA2, and those instances lower 40

59 the overall precision. We thus extracted the instances in which the classification number increased by ten times or more when FA2 was replaced by FA1. This yielded 28 instances, and 5 of those instances had precision of 0. The reason for this is that they do not co-occur in the same blog entry with CDEs, even though the user was interested in them and described the name attribute of these instances often. Thus, precision can be effectively improved by deleting these instances from the service-domain ontology. We also evaluated the accuracy of FA2 while changing the hop limit number. Two hops were better than zero hops with respect to the number of correct answers and precision, as shown in Table 3.2-(c). However, four hops yielded worse precision than two hops, although the number of correct answers was slightly better. This is because our service domain ontology has a large number of instances in end classes, and the relationship between end classes and super classes is closer than the relationship between super classes and grandfather classes. For example, end class Acid Metal has the super class Metal and grandfather class Rock. In this case, the relationship between Acid Metal and Metal is closer than the relationship between Metal and Rock. Thus, two hops offer better precision than zero hops because two hops include many CDEs. Four hops have lower precision than two hops because the resulting instances are far from the end classes. Furthermore, we analyzed the cases in which the number of classified blog entries changed by at least a factor of four hops were used instead of two hops. Such cases represent classification mistakes. For example, CDEs of the instance Europe in class Northern Metal with four hops, included instances in class Adult Contemporary under the class Rock. In this case, blog entries with the description Europe tour were also classified into Europe in Northern Metal. Therefore, the number of correct answers with high precision can be effectively increased by deleting these mistakenly classified blog entries from the classified instances by changing the hop limit number. 41

60 Number of users who are interested in artists of each group Number of users who are interested in artists of each group. 100 famous group 200 famous group 50 moderately famous group 100 moderately famous group group with small number of fans group with small number of fans X: Number of users in the group Gu. X: Number of users in the group Gu. (a) Number of users by changing X. (b) Number of users who rate many items by changing X Fig 3.8: (a) number of users obtained by changing X. (b) number of users that have high interest weight after changing X Analyzing size of user-oriented community We determined the suitable size of G U, as described in Section 3.5.2, by observing the difference between the user-interest ontology of each user u and those of G U while changing the size of G U. First, we selected user A from among all users extracted by our servicedomain ontology and analyzed a suitable size of G U by changing parameter X, which represents the number of users who have high similarity to user A in interest-sharing community G U, see Section In this evaluation, we divided novel instances G I into 3 instance groups in order of the appearance rate of instances when we set X to 70: a very popular instance group, a moderately popular instance group, and instance group with a small number of fans. We then calculated the number of users who were interested in the artists of each instance group while changing X from 10 to 70 in steps of 1. Graphs of the number of users who were interested in each instance group obtained while changing X are shown in Fig. 3.8-(a). Next, we focused on users who had high interest weights in their interest ontologies. Graphs of the number of such users obtained while changing X are shown in Fig. 3.8-(b). The very popular instance group was recommended to users regardless of the value of X, see Fig. 3.8-(a). The instance group with a small number of fans, on the other hand, was recommended most often when X was ten (Fig. 3.8-(b), ); the moderately popular instance group 42

61 was recommended more often as X was increased. This is because users with high interest weights tend to discuss instances in the instance group with a small number of fans, rather than discussing instances in the famous instance group. Furthermore, the number of users in each instance group increased suddenly when X is greater than 60. This is because the difference between a user s ontology and those of G U is larger when X is greater than 60, and instances with low probability of being interesting come to be recommended more often. From this result, novel topics are effectively detected with respect to detailed user interests when X is smaller than 60 given the datasets used in our experiment. This result also suggests that the suitable size of G U is given by X = 60 because the number of instances of each group radically increased when X exceeded that point Measuring performance of detecting novel topics We next evaluated novel blog-entry detection. In the evaluation, we compared the proportion of novel instances in the manually defined recommendation lists created by you might like these artists in a music portal listen Japan to the proportion of novel instances in the recommendation lists created by our methods. Designers of music portal listen Japan have manually defined artists (A n ) that are considered to relevant to artist (A i ). We checked the 75 users, out of the total of 1503 users, who were judged to be interested in the music domain of our service-domain ontology. First, we identified X users who had high similarity to user A as described in Section We took from the recommendation lists created by our method the top 150 instances that appeared frequently in the interests of those X users. The manually defined recommendation lists were generated by passing the user interests, extracted by our algorithm, to the portal s recommendation system. The manual recommendation lists included, on average, 23 instances. Table 3.2-(d) and (e) show the percentages of recommended instances and their score of novelty for the manually generated recommendation lists According to Section 3.6.5, we set X to

User s Blog Site R ecommendation page of DoblogMusic DoblogMusic (2) Recommendations (1) G enre: Alternative rock, E mo, Lo-fi Artists: Jimmy eat World, Get up kids Small (0) Score of novelty Large

Color types of bar chart mean values of score of novelty. For example, novel artists with red color types of bar charts have large value of score of novelty.

(a) Number of users accessing DoblogMusic. (b) Number of accesses of DoblogMusic. Fig 3.10: User accesses to DoblogMusic. and our lists, respectively.

62 User s Blog Site R ecommendation page of DoblogMusic DoblogMusic (2) Recommendations (1) G enre: Alternative rock, E mo, Lo-fi Artists: Jimmy eat World, Get up kids Small (0) Score of novelty Large (3) Automatically tagging Artists or Genres to each entries by clas s ifying blog entries to items/classes in the taxonomy. Length of bar charts means strength of prediction values. Color types of bar chart mean values of score of novelty. For example, novel artists with red color types of bar charts have large value of score of novelty. User 1 User 2 User 3 User 4 Genre: E mo, Lo-fi, British P op Artists: Jimmy eat World, Charlatans Fig 3.9: Snapshot of online experimental service DoblogMusic. (a) Number of users accessing DoblogMusic. (b) Number of accesses of DoblogMusic. Fig 3.10: User accesses to DoblogMusic. and our lists, respectively. These results indicate that our technique recommends more instances with a higher score of novelty than the manually created recommendation lists. Another conclusion that can be drawn is that users actually have a much wider range of interests than predicted by the music portal experts. 3.7 Online Experimental results To evaluate the effectiveness of novel topic detection, we offered an experimental service DoblogMusic to Doblog users. We used a larger service- 44

Predicting user rating for Yelp businesses leveraging user similarity

Predicting user rating for Yelp businesses leveraging user similarity Kritika Singh kritika@eng.ucsd.edu Abstract Users visit a Yelp business, such as a restaurant, based on its overall rating and often