A logistic regression model for Semantic Web service matchmaking

Size: px

Start display at page:

Download "A logistic regression model for Semantic Web service matchmaking"

Cameron Glenn
6 years ago
Views:

1 . BRIEF REPORT. SCIENCE CHINA Information Sciences July 2012 Vol. 55 No. 7: doi: /s x A logistic regression model for Semantic Web service matchmaking WEI DengPing 1*, WANG Ting 1 & WANG Ji 2 1 School of Computer, National University of Defense Technology, Changsha , China; 2 National Laboratory for Parallel and Distributed Processing, Changsha , China Received March 24, 2011; accepted February 21, 2012; published online May 17, 2012 Abstract Semantic Web service matchmaking, as one of the most challenging problems in Semantic Web services (SWS), aims to filter and rank a set of services with respect to a service query by using a certain matching strategy. In this paper, we propose a logistic regression based method to aggregate several matching strategies instead of a fixed integration (e.g., the weighted sum) for SWS matchmaking. The logistic regression model is trained on training data derived from binary relevance assessments of existing test collections, and then used to predict the probability of relevance between a new pair of query and service according to their matching values obtained from various matching strategies. Services are then ranked according to the probabilities of relevance with respect to each query. Our method is evaluated on two main test collections, SAWSDL-TC2 and Jena Geography Dataset(JGD). Experimental results show that the logistic regression model can effectively predict the relevance between a query and a service, and hence can improve the effectiveness of service matchmaking. Keywords Semantic Web service, matchmaking, logistic regression Citation Wei D P, Wang T, Wang J. A logistic regression model for Semantic Web service matchmaking. Sci China Inf Sci, 2012, 55: , doi: /s x 1 Introduction Semantic Web services (SWS), as an application of the ideas of the Semantic Web to the service oriented computing, has attracted much attention recently [1]. SWS matchmaking is one of the most challenging problems in SWS [2], which aims to filter and rank a set of services with respect to a query by using a certain matching strategy that measures the similarity between a query and a service. A variety of competing matching strategies have been proposed recently [3,4], among which integrated matching strategies that combine the matching results obtained from different matching strategies have been shown to be promising according to the intensive comparisons from various service matchmaking contests 1). Integration provides a comprehensive and complementary way to measure the similarity between a query and a service by considering different descriptions of Web services. Thus, how to effectively integrate individual similarity values obtained from useful matching strategies into an overall score becomes an important issue. *Corresponding author ( dpwei@nudt.edu.cn) 1) klusch/s3/index.html c Science China Press and Springer-Verlag Berlin Heidelberg 2012 info.scichina.com

2 1716 Wei D P, et al. Sci China Inf Sci July 2012 Vol. 55 No. 7 An intuitive integration way is to use empirical values as the weights of different matching strategies. For example, URBE [5] uses weighted sum to integrate several similarity values into an overall score. However, these empirical weights are difficult to be predicted correctly in practice, due to the various characteristics of applications. To alleviate this problem, several machine learning based methods have been used to learn these weights for service discovery. Christoph et al. [6] proposed the SWS matchmaker imatcher which integrates various text similarity measures using different machine learning algorithms. Klusch et al. [7] also proposed the SAWSDL service matchmaker SAWSDL-MX2 that integrates three matching variants using support vector machine (SVM), including logic-based, text similarity based matching of semantic annotations, and structural matching. The logistic regression model is a popular model for binary data prediction, regression and classification [8], and it has been successfully applied in several applications such as text retrieval [9]. Essentially, the service matchmaking problem can be viewed as a binary data prediction problem of judging whether a service is relevant to a query or not. In addition, the logistic regression provides a normal way to analyze the contribution of each matching strategy to service matchmaking in a specific domain according to the estimates of the coefficients, which is of practical help for domain experts to select appropriate matching strategies for their specific applications. Based on this insight, in this paper, we propose a method that exploits the logistic regression model to integrate various matching strategies and to predict the probability of relevance between a query and a service based on their individual matching scores. Following our previous work [10,11], we adopt several matching strategies to compute the individual similarity values, and then integrate them into an overall similarity using the trained logistic regression model. Experimental results show the logistic regression model outperforms all basic matching strategies in terms of recall and precision, and also outperforms the well-known integrated matchmakers. 2 The approach Let x = {x 1,x 2,...,x k } denote k matching strategies used to calculate the similarity values between a pair of query and service (q j,s i ). A set of similarity values {x 1 (q j,s i ),x 2 (q j,s i ),...,x k (q j,s i )} between each pair of query and service (q j,s i ) are then obtained by using these k matching strategies respectively, where x l (q j,s i ) is the similarity value between q j and s i, which is calculated by using the matching strategy x l. Our aim is to establish a function to integrate these individual similarity scores into an overall score which is used to rank services. Logistic regression is a variation of ordinary regression which is used when the response variable is a binary variable (occurrence or non-occurrence of the outcome event) and the input variables are continuous, categorical, or both. Service matchmaking problem can be essentially viewed as a binary prediction problem that judges whether a service is relevant to a query according to the similarity values obtained from various matching strategies. Therefore, we are interested in predicting the probability of relevance between a pair of query q j and service s i. Let R denote the relevance between a pair of query and service, where R = 1 indicates that the service is relevant to the query, and R = 0 indicates that it is irrelevant to the query. We model the conditional mean of R given a pair of service and query (q j,s i ) and the set of matching strategies x = {x 1,x 2,...,x k }, i.e., E(R =1 x, q j,s i ), via the following logistic regression function: eβ0+β1x1+ +β kx k E(R =1 x, q j,s i )= 1+e β0+β1x1+ +β. (1) kx k This function produces E(R =1 x, q j,s i ) between 0 and 1, and the terms β = {β 0,...,β k } are unknown parameters (called regression coefficients), to be estimated based on the available observations. A logit transformation of E(R =1 x, q j,s i ) is defined as [ ] E(R =1 x, qj,s i ) g(x 1,x 2,...,x k )= ln 1 E(R =1 x, q j,s i ) = β 0 + β 1 x 1 + β 2 x β k x k. (2)

3 Wei D P, et al. Sci China Inf Sci July 2012 Vol. 55 No Table 1 The basic matching strategies Matching strategy Similarity measure Function Name Description text Dice s coefficient Cosine similarity 2 B(nr ) B(ns) Sim(n r,n s)= B(n, r) + B(n s) B(x): set of bigrams in string x; n x:nameofx Sim(v r,v s)= vr i=1 (vr i vs i )2 vr vs, i=1 v2 r i i=1 v2 s i v x: classic vector space model of description text in x 1 Sim(v r,v s)= vr 1+ i=1 (vr i vs i )2 Syntactic IO Euclidean distance v x: boolean vector for the unfolded concept expressions of I/O concepts of x Semantic IO Logic based matching The algorithm described in [10] To obtain the estimates of the unknown parameters {β i }, the common used maximum likelihood is employed, which maximizes the probability of obtaining the observed service test collections. Let the matrix S denote matching values of the observed pairs of query and service according to the k matching strategies, whose relevance is known in advance. Let s i. =[s i,1,s i,2,...,s i,k ]denotetheith row of the matrix S, ands i,j denote the similarity value between the ith pair of query and service according to the jth matching strategy. In this paper, we select the basic matching strategies listed in Table 1 as x = {x 1,x 2,x 3,x 4 } in Eq. (1), i.e., name based matching strategy (Name), description text based matching strategy (Description text), semantic annotation based syntactic matching strategy (Syntactic IO) and semantic matching strategy (Semantic IO). The selected matching strategies are based on the most commonly used description components among several Semantic Web service ontologies/specifications that may be complementary to describing the functional properties of Web services. S = i=1 x 1 x 2 x k s 1,1 s 1,2 s 1,k s 2,1 s 2,2 s 2,k s m,1 s m,2 s m,k In order to apply maximum likelihood, the likelihood function is constructed as follows. m [ ] Ri [ ] 1 Ri 1 1 L(β) = 1+e si.β 1+e si.β. (4) The maximum likelihood estimators (MLE) of the parameter {β i } are calculated by using the well-known Quasi Newton method based on the samples in Matrix S. Finally, the probability of relevance of a new pair of query and service can be calculated by Eq. (1). Generally, the relevant services of a query are much less than the advertised services; thus the number of irrelevant pairs of query and service is much larger than the number of relevant pairs of query and service. This unbalanced training data set will lead to the effect that conventional machine learning methods are biased toward a larger class. To overcome this problem, cost sensitive model is developed by defining the penalty of each kind of samples. Our goal is to use the learned model to predict the probabilities that a service is relevant to the query, in order that the matchmaker ranks services according to these probabilities. Normally, users want to find their desired services at the top of the ranking list, without caring whether all the relevant services are returned. From this point of view, a false negative prediction is, therefore, considered to have more serious consequences than a false positive prediction in this work. Thus, the misclassifying an irrelevant pair of query and service is set to 40 times as expensive as misclassifying a relevant pair of query and service in this cost sensitive model, since relevant services are about 1/40 fewer than non-relevant services for each query in general.. (3)

4 1718 Wei D P, et al. Sci China Inf Sci July 2012 Vol. 55 No. 7 Figure 1 Performance comparison of statistical model based strategies. (a) SAWSDL-TC2; (b) JGD. 3 Experimental results In this evaluation, we use two test collections from Semantic Service Selection (S3) contest ) : SAWSDL-TC2 and Jena Geography Dataset (JGD). Each test collection is represented by a set of vectors with cardinality Q P in matrix S, inwhichq and P represent the sets of queries and services respectively in the test collection. Each row in matrix S corresponds to a sample, which represents the similarity values of a pair of query and service vs. the matching strategies respectively. The set of samples is divided into Q folds, and each fold consists of all the samples related to a query. Each time, we take one fold as test set (related to one query) and learn the logistic regression model on the remaining Q 1 folds, and then measure the effectiveness on the test query. Finally, the macro-average of the results of the Q runs is considered as the performance of the statistical model based matching strategies on the whole test collection. This approach follows the standard N-fold cross validation in machine learning. To show the performance of our method, in this paper, we also implement other machine learning based matchmaking methods based on the same matching strategies by using WEKA [12], such as ɛ-svr, linear regression, J48 decision tree, Adaboosting based J48, etc. Figure 1 shows a performance comparison between logistic regression model and other statistical model based matchmakers. On SAWSDL-TC2 (Figure 1(a)), the logistic regression based matchmaker (logisitic) outperforms other statistical model based matchmakers with a mean average precision (MAP) of 0.749, although it is slightly outperformed at the beginning by ɛ-svr based matchmaker (MAP=0.723). On JGD, logistic regression based matchmaker (MAP=0.67) outperforms other matchmakers before half of the relevant services are returned. On the whole, ɛ-svr based matchmaker performs best with MAP of The logistic regression based matchmaker outperforms J48, Adaboosting J48 and linear regression based matchmakers. Figure 2 shows a performance comparison between logistic regression model and the basic matchmaking strategies which are used to learn the model in this paper. Figure 2(a) indicates that, on SAWSDL- TC2, logistic based matchmaker outperforms each basic matchmaking strategy. On JGD (Figure 2(b)), the same conclusion can also be drawn, although it is slightly outperformed at the very beginning by single matching strategies such as semantic annotations based matchmaking strategy (semantic IO) and description text based matchmaking strategy (text). In addition, we also compare our method with the well-known SVM based matchmaker SAWSDL-MX2 that integrates different matching strategies from those used in this paper. The mean average precision of our method is on SAWSDL-TC2 and 0.67 on JGD, while the MAP of SAWSDL-TC2 is on SAWSDL-TC2 and 0.45 on JGD. In summary, our logistic regression model can effectively integrate the commonly used matching strategies shown in Table 1, and also improve the effectiveness of service matchmaking by learning from other s 2) klusch/s3/html/2009.html

5 Wei D P, et al. Sci China Inf Sci July 2012 Vol. 55 No Figure 2 The performance comparison between logistic regression model and basic matching strategies. (a) SAWSDL- TC2; (b) JGD. strong points to offset one s weaknesses. It also indicates that selecting proper basic matching strategies is very important to integrated service matchmaking, since each matching strategy may contribute differently in service matchmaking. This is another advantage of our method, since logistic regression can help us to select proper matching strategies according to the estimates of the coefficients. 4 Conclusions This paper proposes a novel method for Semantic Web service matchmaking, which employs logistic regression to aggregate multi-results obtained from several basic matching strategies into an overall similarity value. Experiments show that the logistic regression model is able to provide an overall and almost overwhelming performance. We can, therefore, conclude that the logistic regression model used in this paper is effective and appropriate for integrating individual similarity values obtained from various matching strategies on different description components. Acknowledgements The research was supported by National Grand Fundamental Research Program of China (Grant No. 2011CB ) and National Natural Science Foundation of China (Grant No ). References 1 Wang H B, Huang J Z X, Qu Y Z, et al. Web services: problems and future directions. Web Semant Sci Serv Agent World Wide Web, 2004, 1: Klusch M. Semantic service coordination. In: Schumacher M, Helin H, Schuldt H, eds. CASCOM: Intelligent Service Coordination in the Semantic Web. Berlin: Springer, Deng S G, Wu Z H, Wu J, et al. An efficient service discovery method and its application. Int J Web Serv Res, 2009, 6: Cai S B, Zou Y Z, Xie B, et al. Mining the Web of trust for Web services selection. In: Proceedings of 2008 IEEE International Conference on Web Services (ICWS 2008). Washington: IEEE Computer Society, Plebani P, Pernici B. URBE: web service retrieval based on similarity evaluation. IEEE Trans Knowl Data Eng, 2009, 21: Christoph K, Abraham B. The creation and evaluation of isparql strategies for matchmaking. In: Proceedings of the 5th European Semantic Web Conference (ESWC). Berlin: Springer, Klusch M, Kapahnke P, Zinnikus I. Adaptive hybrid semantic selection of SAWSDL services with SAWSDL-MX2. Int J Semant Web Inf Syst, 2010, 6: 1 26

6 1720 Wei D P, et al. Sci China Inf Sci July 2012 Vol. 55 No. 7 8 Hosmer D W, Lemesbow S. Applied logistic regression. 2nd ed. New York: Wiley Inc, Gey F C. Inferring probability of relevance using the method of logistic regression. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Springer- Verlag, Wei D P, Wang T, Tang J T, et al. SAWSDL-iMatcher: A customizable and effective Semantic Web service matchmaker. Web Semant Sci Serv Agent World Wide Web, 2011, 9: Wei D P, Wang T, Wang J, et al. Extracting semantic constraint from description text for Semantic Web service discovery. In: Proceedings of the 7th International Semantic Web Conference. Berlin: Springer, Hall M, Frank E, Holmes G, et al. The WEKA data mining software: An update. SIGKDD Explor, 2009, 11: 10 18

Improving Ranking-based Recommendation by Social Information and Negative Similarity Ying Liu a,b, * Jiajun Yang a

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 732 740 Information Technology and Quantitative Management (ITQM 2015) Improving Ranking-based Recommendation