Improving Web Service Clustering through Ontology Learning and Context Awareness

Size: px

Start display at page:

Download "Improving Web Service Clustering through Ontology Learning and Context Awareness"

Brianna Tate
5 years ago
Views:

Improving Web Service Clustering through Ontology Learning and Context Awareness Banage Thenne Gedara Samantha Kumara A DISSERTATION SUBMITTTED IN FULFILLMENT OF THE

1 Improving Web Service Clustering through Ontology Learning and Context Awareness Banage Thenne Gedara Samantha Kumara A DISSERTATION SUBMITTTED IN FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE AND ENGINEERING Graduate Department of Computer and Information Systems The University of Aizu 2015

The thesis titled Improving Web Service Clustering through Ontology Learning and Context Awareness by Banage Thenne Gedara Samantha Kumara is reviewed

3 The thesis titled Improving Web Service Clustering through Ontology Learning and Context Awareness by Banage Thenne Gedara Samantha Kumara is reviewed and approved by: Chief Referee Senior Associate Professor Incheon Paik Professor Qiangfu Zhao Professor Vitaly Klyuev Associate Professor Neil Yen III

4 IV

5 Table of Contents Chapter 1 Introduction Improving Web Service Clustering through Ontology Learning Improving Web Service Clustering through Context Awareness Original Contributions Thesis Organization... 8 Chapter 2 Background and Related Works Overview of Web Services Web Service Architecture Web Services Stack WSDL Structure Web Service Clustering Functionally Based Web Service Clustering Non Functionally Based Web Service Clustering Social criteria Based Web Service Clustering Context-Aware Web Services Top-Down Flow from Service Composition to Proposed Approaches Chapter 3 Ontology Learning Based Clustering Motivation for Ontology Learning Method and Proposed Clustering Approach Motivating Example for Ontology Learning Proposed Ontology Learning Based Clustering Approach Feature Extraction Ontology Generation and Feature Similarity Calculation Ontology Learning Method IR Based Term Similarity Matching Filters and Similarity Calculation Feature Integration Clustering Algorithm Proposed Cluster Center Identification Approach Chapter 4 CAS Based Clustering Approach Motivation for Context Awareness and Proposed Clustering Approach Motivating Scenarios for Context Awareness V

6 4.1.2 Proposed CAS Based Clustering Approach CAS Method Outline of CAS Method Overview of SVMs Generating Context Vectors for Domains Training the SVMs Calculating Term Similarity from Model Spatial Clustering Calculating Affinity of Services ASKS Algorithm Spherical ASKS Chapter 5 Experiments and Evaluation Evaluation of Ontology Learning Based Clustering Approach IR Based Term-Similarity Methods Evaluation Feature Strength Evaluation Ontology Evaluation Cluster Evaluation Evaluation of CAS Based Clustering Approach Experimental Setup SVM Kernel Performance Term Similarity Methods Evaluation Visualization of Web Service Clusters Comparison of Clustering Approach Chapter 6 Conclusion and Future work Acknowledgments References Publications VI

7 List of Figures Figure 1.1 Thesis organization Figure 2.1 Service Oriented Architecture Figure 2.2 Web services stack Figure 2.3 Structure of WSDL file Figure 2.4 Part of WSDL file Figure 2.5 Top-down flow from service composition to proposed approaches Figure 3.1 Generated ontology for motivating example Figure 3.2 Overview of the ontology learning based clustering approach Figure 3.3 Phases of the ontology learning based clustering approach Figure 3.4 Part of WSDL file that shows the structure of service element Figure 3.5 Part of WSDL file that shows the structure of message element Figure 3.6 Example ontology concepts and relationships with rules Figure 3.7 Sub-sumption hierarchy of the complex term Figure 3.8 Sample ontology Figure 3.9 Sample ontology with three ontology classes Figure 3.10 Agglomerative clustering algorithm Figure 4.1 Steps of the CAS based clustering approach Figure 4.2 Effect of domain context in choosing clusters Figure 4.3 Service clustering example with current approaches Figure 4.4 Architecture of the CAS-based clustering approach Figure 4.5 Snippets for the queries from the Google search engine Figure 4.6 Term similarity calculation by model Figure 4.7 Process of extracting frequently used terms Figure 4.8 Interface of Wikipedia Figure 4.9 Sample query in Google Figure 4.10 Distance function f(θ) used in ASKS Figure 4.11 Uniformalization process of SASKS Figure 5.1 Feature strength evaluations Figure 5.2 Ontology evaluation Figure 5.3 Contribution of term-similarity methods VII

8 Figure 5.4 Cluster center identification approach evaluation Figure 5.5 Cluster performances with HTS approach which uses ontology learning Figure 5.6 Service sphere Figure 5.7 Visualization results for the CAS method with a Vehicle filter Figure 5.8 Visualization results for the CAS method (non-vehicle cluster) with a Vehicle filter Figure 5.9 Visualization results for the ontology learning method Figure 5.10 Visualization results for the ontology learning method (Medicine cluster) Figure 5.11 Visualization results for the edge-count-based method (Vehicle cluster) Figure 5.12 Part of cluster result with calculating affinity value using WordNet Figure 5.13 Visualization results for the CAS Method (Medicine filter) Figure 5.14 Visualization Results for the CAS method (Book filter) Figure 5.15 Minimum distance between one service and the others Figure 5.16 Average precision of the clusters VIII

9 List of Tables Table 1.1 Summary of issues in current clustering approaches Table 2.1 Roles in SOA Table 3.1 Matching filters for concepts in Figure 3.8 ontology Table 4.1 Frequently used terms in the computer domain Table 5.1 Correlation between human rating and term-similarity methods Table 5.2 Accuracy measures of clusters Table 5.3 Term similarity with the CAS method for different domain filters Table 5.4 Comparison of similarity calculation approaches Table 5.5 Comparison for the Vehicle cluster Table 5.6 Comparison of clustering approaches IX

10 Preface This thesis presents my work for the fulfillment of the requirement for the Doctor of Philosophy in Computer Science and Engineering, Graduate School of Computer Science and Engineering, the University of Aizu, Japan. The study was carried out in the period from April 2012 to March X

11 Abstract With the large number of Web services now available via the Internet, service discovery has become a challenging and time-consuming task. Organizing Web services into similar clusters is a very efficient approach to reducing the search space. A principal issue for clustering is computing the semantic similarity between services, especially one for improving clustering accuracy, the other for context awareness. In this thesis, we present two solutions for similarity calculation. Current approaches use similarity-distance measurement methods such as keyword, information-retrieval (IR) or ontology based methods. These approaches have problems that include discovering semantic characteristics, loss of semantic information and a shortage of high-quality ontologies. In this thesis, first we present a method that adopts ontology learning to generate ontologies via the hidden semantic patterns existing within complex terms. If calculating similarity using the generated ontology fails, it then applies an IR-based method. Further, in the ontology based approach, we propose an approach to identifying the cluster center by combining service similarity with term frequency inverse document frequency values of service names. As the other issue, current approaches do not consider the domain-specific context in measuring similarity and this has affected their clustering performance. Therefore, we propose a context-aware similarity (CAS) method that learns domain context by machine learning to produce models of context for terms retrieved from the Web as the our second approach. The CAS method analyzes the hidden semantics of services within a particular domain, and the awareness of service context helps to find cluster tensors that characterize the cluster elements. To analyze visually the effect of domain context on the clustering results, CAS based clustering approach applies a spherical associated-keyword-space algorithm. Experimental results show that ontology based approach outperforms comparable existing approaches and CAS based approach works efficiently for the domain-context-aware clustering of services. XI

12 XII

13 Chapter 1 Introduction Service Oriented Architecture [1] has been a widely accepted paradigm to facilitate distributed application integration and interoperability. Web services are loosely coupled software components that are a popular implementation of the serviceoriented architecture. Existing technologies for Web services have been extended to give value-added customized services to users through service composition [2]. Developers and users can then solve complex problems by combining available basic services such as travel planners. Web service discovery, which aims to match the user request against multiple service advertisements and provides a set of substitutable and compatible services by maintaining the relationship among services, is a crucial part of service composition. Now most of the business organizations are moving towards the Web services. Hence, numbers of Web services publish on the Internet are being increased in recent years [3]. With this proliferation of Web services, service discovery is becoming a challenging and time-consuming task because of unnecessary similarity calculations in the matchmaking process within repositories such as Universal Description, Discovery and Integrations (UDDIs) and Web portals. Clustering Web services into similar groups, which can greatly reduce the search space for service discovery, is an efficient approach to improving discovery performance. Clustering the Web services enables the user to identify appropriate and interesting services according to his or her requirements while excluding potential candidate services outside the relevant cluster and thereby limiting the search space to that cluster alone. Further, it enables efficient browsing for similar services within the 1

14 same cluster. Current clustering approaches can be classified by considering the properties used in the clustering process: (i) functionally based clustering (ii) nonfunctionally based clustering and (iii) social-criteria-based clustering. Most previous works focus on the functionally based clustering approaches, considering the semantics of functional properties such as operations, and their input, output, precondition, and effect [4]-[6]. Non-functionally based clustering approaches reduce the computational time and complexity for Web service processes by considering quality-of-service properties such as cost and reliability. The user can use nonfunctionally based clusters to identify good, moderate, or poor instances from a collection of functionally equal services [7, 8]. There is little work related to socialcriteria-based clustering, where social properties of services such as sociability [9] are considered. Usually, service clusters are created using functionality as the first factor, with other properties being considered as secondary factors. A principal issue for clustering is computing the semantic similarity between services. Current similarity computing approaches have problems that include discovering semantic characteristics, loss of semantic information and a shortage of high-quality ontologies. Further, current similarity computing approaches do not consider the domain-specific context in measuring similarity. Thus, current clustering approaches have problem caused from the accuracy and context. This thesis proposes two clustering approaches for improving clustering in terms of accuracy and context. 1.1 Improving Web Service Clustering through Ontology Learning As we mentioned, a clustering approach requires similarity calculation methods to compute the similarity of services. First, the method computes the similarity of feature of the services. Then, the similarity of services is computed as an aggregate of the individual feature similarity values. Recent studies have proposed several approaches to calculating functional similarity. Simple approaches include checking the one-to- 2

15 one matching of features such as the service name and checking the matching of service signatures such as the messages [6]. In some studies, information retrieval (IR) techniques are used. These include similarity-measuring methods such as searchengine-based (SEB) methods [10] and cosine similarity [11]. Some researchers have used logical relationships such as exact and plug-in [12] or edge-counting-based techniques [13] to increase the semantics in the similarity calculations via ontologies. However, one-to-one matching, structure matching and vector-space model may not accurately identify the semantic similarity among terms because of the heterogeneity and independence of service sources. These methods consider terms only at the syntactic level, whereas different service providers may use the same term to represent different concepts or may use different terms for the same concept. Furthermore, IR techniques such as cosine similarity usually focus on plain text, whereas Web services contain much more complex structures, often with very little textual description. This means that depending on IR techniques is very problematic. Moreover, there can be a loss of the machine-interpretable semantics found in service descriptions when converting data provided in service descriptions into vectors in IR techniques. In SEB similarity-measuring methods such as normalized Google distance (NGD), there is no guarantee that all the information needed to measure the semantic similarity between a given pair of words is contained in the top-ranking snippets. On the other hand, although ontologies help to improve semantic similarity, defining high-quality ontologies is a major challenge. Several methods have been used to develop ontologies in current approaches, including obtaining assistance from domain expertise, using resources such as WordNet [14] and using ontologies already available via the Internet [13]. Developing ontology by obtaining assistance from domain expertise is a time-consuming task that requires considerable human effort. In addition, the lack of up-to-date information in a resource might fail to capture the latest concepts and relationships in a domain. Further, the lack of standards for integrating and reusing existing ontologies also hampers ontology-based (OB) semantics matching. Table 1.1 provides the summary of issues that affect the clustering approaches in existing clustering approaches. 3

16 Table 1.1 Summary of issues in current clustering approaches Current approaches Problem for clustering performance One-to-one and Structure Consider terms only at the syntactic level Matching Loss of the machine-interpretable semantics found in service descriptions Lack of up-to-date knowledge Failed to identify synonyms or variations of terms IR-based methods (ex., Usually focus on plain text, whereas Web Cosine similarity) services contain much more complex structures, often with very little textual description. Loss of the machine-interpretable semantics found in service descriptions Lack of up-to-date knowledge Failed to identify synonyms or variations of terms IR-based methods (ex., Fixed, lack of up-to-date knowledge WordNet) OB method Shortage of high quality ontology (defining high-quality ontologies is a major challenge) Lack of up-to-date knowledge SEB (ex., NGD) Do not encode fine grained information Another issue in clustering is how to create dense clusters that are well separated. To generate clusters, current approaches use traditional algorithms that include partition clustering methods such as k-medoids [15], bottom-up hierarchical clustering methods [16] and graph-based methods [17]. However, current approaches simply apply these algorithms without any fine-grained improvements. Identification of a suitable cluster representative for a service cluster is an important issue in creating dense, well-separated clusters. For example, in the agglomerative algorithm, merging 4

17 clusters involves finding the distance between clusters. Here, we need to select cluster representatives in measuring distances. However, there is no fine-grained method for identifying a suitable cluster representative. It is therefore possible for a false-positive cluster member to become the cluster representative, which will affect the clustering performance. In this thesis, as the first approach, we propose an ontology-learning method to calculating the semantic similarity of Web services to improve the Web service clustering. Our ontology learning uses Web service description language (WSDL) documents to generate ontologies by examining the hidden semantic patterns that exist within the complex terms used in service features (e.g., AuthorizePhysician as a subclass of Physician). If this fails to calculate the similarities, we then use an IRbased method. In IR method, we use both thesaurus-based and SEB term similarities. To address the second issue in service clustering, we propose an approach that identifies the cluster center as cluster representative by combining the servicesimilarity value with the term frequency inverse document frequency (TF IDF) value of the service name, which reflects the importance of a service to its cluster. 1.2 Improving Web Service Clustering through Context Awareness As we discussed above, current clustering approaches use similarity computing categories such as string based approaches like cosine similarity, corpus based approaches like NGD, knowledge based approaches like OB and hybrid approaches. The similarity computing methods compute the similarity of features as global values without considering specific domains. Therefore, one common issue in current functionally based clustering approaches is that they fail to identify changes in features for different domains. Semantic similarity of features can change according to the domain. For example, the AmbulanceLocationInformation service will have a strong semantic similarity with Medicine domain services within the Medicine domain 5

18 and with Vehicle domain services within the Vehicle domain. However, it is remote from other domains such as Food or Film. But, current clustering approaches fail to identify the semantic relationships between services that exist within a particular domain. As a result, using these approaches, some services may be placed in clusters that the user had not expected. In our example, suppose the user seeks information about an ambulance service by searching for Ambulance Information service in the Medicine cluster. This may fail because the service was placed in the Vehicle cluster instead of the Medicine cluster. Therefore, we should cluster services according to the user application domain to discover the advertisements that satisfy the user s requirements. To capture this semantic relationship, we need to analyze domain knowledge. Although ontology-based clustering approaches do use domain knowledge through ontologies [12, 13], the ontologies involve a shared model for domains. Even these approaches may fail to identify semantic relationships for a specific domain. It is very natural to match or retrieve information not only within a general context, but also within the context of a specific domain. That is, the clustering of services for specific domains will give well-clustered information for service discovery and composition. Awareness of the domain-specific context in services will help to find tensors among the clusters, which characterize their elements. Context is any information that can be used to characterize the situation of an entity [18]. In the literature, context awareness has been used to solve major problems in service computing such as recommendation [19], [20], discovery [21], [22] and composition [23], [24]. The various approaches have used different categories for the context. These categories include user properties such as location and time [20], [25], the environmental context such as the operating system and user devices [26], [27], user preferences [25], and the environment affecting Web-service delivery and execution such as service profiles and network bandwidth [26], [27]. Contexts such as these are used as extra information in the matchmaking process to obtain the desired and most relevant services according to user requests. However, the contexts are not used to increase the semantics of functional similarity of the services or to cluster services, 6

19 with these approaches considering nonfunctional properties as the context. In contrast, we propose to employ context to capture the hidden semantics of services for particular domains in calculating the functional similarity. In this thesis, we propose the context-aware similarity (CAS) method to compute the similarity of services for different domains as the our second similarity computing approach. We extend the definition of context [18] to refer to the semantically related set of terms that are used frequently in a given domain. Context is created using snippets that are extracted from real Web using search engines. Support vector machines (SVMs) are trained to produce a model for computing the service similarity for different domains. We are able to compute reasonable services similarity values by capturing the semantic relationships between services within a particular domain through the extracted context and trained SVMs. In addition, the approach overcomes other limitations to current similarity calculation methods, including the lack of up-todate information, the lack of a high-quality ontology, and the loss of machineinterpretable semantics. The CAS method obtains up-to-date knowledge from the Web and addresses the semantic issues for current methods by capturing the semantics between terms within a particular domain. Efficient visualization contributes greatly to identifying clustering situations. Therefore, we apply a spherical associated keyword space (SASKS) algorithm [28] to visualize the service clusters in this clustering approach. Our objective is to analyze the effect of domain context on the clustering results in terms of visual output. The clustering algorithms used in current clustering approaches, such as K-mean [11], [13], hierarchical agglomeration [5], [29] and neural-network-based algorithms [30], [31] output abstract cluster information with lists of cluster members, but do not output visualized results. Conceptual clusters are useful for machine-readable purposes, but visualization will help human manipulation of the service clusters and visual feedback may inspire identification of specific domains. SASKS can project the clustering results from a three-dimensional (3D) sphere to a two-dimensional (2D) spherical surface for 2D visualization. The algorithm provides visual output of the service placement on a sphere according to the similarity of services. We are then able 7

20 to analyze the clusters by changing the domain context and the changes of features for the various domains through this visualization. 1.3 Original Contributions We proposed two clustering approaches by introducing two novel similarity calculation methods. Main original contributions for service clustering to increase the performance of clusters have been made in the work as follows. 1. Ontology learning method has been proposed to calculate the service similarity to improve the accuracy of service clusters. 2. CAS method has been proposed to capture the semantic of services under particular domain in calculating service similarity to improve the semantic of clusters. Our additional contributions are as follows. 1. An approach is proposed to identifying the cluster center to act as the cluster representative by combining the service-similarity value with the TF IDF value of the service name in ontology learning based clustering approach. 2. SASKS algorithm has been applied to produce visual output of service clusters in CAS based clustering approach. 1.4 Thesis Organization The thesis mainly consists of four parts as shown in Figure 1.1. Part Ⅰ: Background of the study In the chapter 2, the background of the study will be presented. First, we give an explanation of Web services and WSDL structure. Then, we discuss the work related to Web service clustering. Categories of Web service clustering are pointed out in 8

21 discussion. Next, we focus on existing context aware Web service approaches. Research areas of context aware web services are analyzed under this chapter. Part II: Ontology Learning Based Clustering Approach Chapter 3.1 presents motivating scenarios for ontology learning based method and explains architecture of ontology learning based clustering approach. Then, we explain the steps of clustering approach which include feature extraction, feature similarity calculation, feature integration and clustering. Here, we describe the proposed ontology learning method by explaining the procedure of ontology construction, rules and similarity calculation filters. In addition, we explain the IR based similarity calculation method that we use in similarity calculation with ontology learning method. Further, cluster center identification approach is proposed to improve the performance of service clustering in clustering step. Part III: CAS Based Clustering Approach Chapter 4.1 gives our research motivation of context awareness and describes architecture of CAS based clustering approach. Further, in this chapter, we present proposed CAS based clustering approach. First, outline of the CAS method is provided to get clear idea of importance of the context awareness in service similarity computation. Then, we describe the steps of the method which include context vector generation, training SVM and term similarity calculation. In CAS method, we use SASKS algorithm as the clustering algorithm to analyze the effect of context awareness visually. Thus, chapter 4.3 provides the steps of SASKS algorithm. First, we focus on service affinity calculation needed to generate the affinity matrix. Then, ASKS algorithm is presented. Finally, we show the changes required to convert the ASKS algorithm into the SASKS algorithm. 9

Part IV: Implementation and Evaluation In chapter 5, implementation and evaluation of our proposed two clustering approaches are presented. Experimental results of chapter 5.

22 Part IV: Implementation and Evaluation In chapter 5, implementation and evaluation of our proposed two clustering approaches are presented. Experimental results of chapter 5.1 show that our proposed ontology learning approach can improve the Web service clustering by addressing the issue of shortage of ontology in existing approaches. Further, experimental results in chapter 5.2 show that proposed CAS based clustering approach can improve the semantic of service clusters. Part V: Conclusion and Future Works In chapter 6, the thesis is concluded and the future works are presented. Figure 1.1 Thesis organization 10

23 Chapter 2 Background and Related Works In this chapter, first we present overview of the Web services. Then, we describe the Web service clustering and give the brief description of existing clustering approaches. Chapter 2.3 discusses about the context aware Web services. Finally, chapter 2.4 discusses about top-down flow from big problem in service computing to our proposed approaches. 2.1 Overview of Web Services A Web service is an interface that describes a collection of operations that are network accessible through standardized XML messaging. XML is used to encode all communications to a Web service. For an example, a client invokes a Web service by sending an XML message, and then he waits for a corresponding XML response. Because all communication is in XML, Web services are not tied to any one operating system or programming language and hides the implementation details of the service. Platform independence nature of the services allows and encourages Web servicesbased applications to be loosely coupled, component-oriented and cross-technology implementations. Further, we can describe Web service as a collection of open protocols and standards used for exchanging data between applications or systems. Thus, software applications written in various programming languages and running on various platforms can use Web services to exchange data over computer networks like the Internet in a manner similar to inter-process communication on a single computer. 11

24 A Web service is described using a standard, formal XML notion, called its service description. Service description document describes all the details necessary to interact with the service, including message formats, transport protocols and location. Web Services fulfill a specific task or a set of tasks. Web services share business logic, data and processes through a programmatic interface, represent an important way for businesses to communicate with each other and with clients. They can be used alone or with other Web Services to carry out a complex aggregation or a business transaction as we discussed in introduction part. The concept of Web services has therefore become a widely applied paradigm in research and industry. Definition of Web services: Web services are loosely coupled software component that are built on top of open standards such as TCP/IP, HTTP, Java, HTML, and XML. The applications are self-contained, modular, distributed, dynamic applications that can be described, published, located, or invoked over the network to create products, processes, and supply chains. Web services have an interface described in a machine-processable format. These applications can be local, distributed, or Webbased Web Service Architecture Web services are popular implementation of Service Oriented Architecture (SOA). Figure 2.1 shows the architecture. In this architecture, Web service is an interface described by a service description, its implementation is the service and service description contains the details of the interface and implementation of the service. This includes its data types, operations, binding information and network location. Further, the architecture includes three roles namely service providers, service brokers and service requesters. Roles can be summarized as in Table 2.1. A service provider can publish services which can be found and invoked by service requester. A service broker provides information of registered services by acting as bridge between provider and requester. Operations between three roles such as publishing, locating 12

25 and binding are based on standard protocol (e.g., WSDL, UDDI and SOAP [32]). In detail, the above operations can be described as follows: Publishing: To be accessible, a service description needs to be published in service registry. After available the service on registry service requestor can find it. Locating: In this operation, the service requestor retrieves a service description from the service registry for the type of service required. Binding: In this operation the service invokes or initiates an interaction with the service at runtime using the binding details in the service description. Figure 2.1 Service Oriented Architecture Table 2.1 Roles in SOA Role Business perspective Architectural perspective Service Provider Owner of the service Platform that hosts access to the service Service Requester Service Broker Business that requires certain functions to be satisfied Application that is looking for and invoking or initiating an interaction with a service Searchable registry of service descriptions where service providers publish their service descriptions 13

2.1.2 Web Services Stack To perform the three operations of publish, find and bind, there must be a Web Services stack that embraces standards at each level. Figure 2.

26 2.1.2 Web Services Stack To perform the three operations of publish, find and bind, there must be a Web Services stack that embraces standards at each level. Figure 2.2 shows the conceptual stack of the Web services. The lower level of the Web Services stack is the network. Web Services must be network accessible to be invoked by a service requestor. The next layer, XML-based messaging, represents the use of XML as the basis for the messaging protocol. The service description layer is actually a stack of description documents. WSDL is the one of standard for XML-based service description. Then, we can see the upper layer processes. There are many ways that a requester entity might engage and use a Web service. In general, as the first step the requester and provider entities become known to each other. Then the requester and provider entities agree on the service description and semantics. Next, the service description and semantics are realized by the requester and provider agents; and finally the requester and provider agents exchange messages. According to the steps, we can see that first requester deals with the service description file and not with the real Web services. First, requester needs to discover the relevant service description document from the collection of service description documents. Thus in our research, we cluster service description documents. We used Figure 2.2 Web services stack 14

27 WSDL documents to cluster the services. We provide brief description of WSDL document in next sub section WSDL Structure The service description is key to making the Web Services architecture loosely coupled and reducing the amount of required shared understanding and custom programming and integration between the service provider and the service requestor. There are many industrial and academic standards for Web service descriptions, including WSDL [33], OWL-S [34] and Web Service Modeling Ontology [35]. WSDL represents the most fundamental form for standard-of-service APIs. WSDL is an XML document for describing Web Services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented messages. We use the structure of WSDL to cluster services and we translate material retrieved in other formats into WSDL. Figure 2.3 represents the structure of WSDL file and Fig. 2.4 shows the part of WSDL file of VehiclePriceService. WSDL file provides definitions grouped into the following sections: <definitions> The root element of a WSDL document. The attributes of this element specify the name of the WSDL document, the document s target namespace and the shorthand definitions for the namespaces referenced in the WSDL document. <types> The XML Schema definitions for the data units that form the building blocks of the messages used by a service. <messages> The section that contains the description of the messages exchanged during invocation of a service operation. <porttype> The most important WSDL element. It defines a Web service, the operations that can be performed and the messages that are involved. <binding> The section that defines the message format and communication protocol details for each port. It links the port type to a transport method. <service> The section that defines port elements that specify where requests should be sent. 15

28 Figure 2.3 Structure of WSDL file Figure 2.4 Part of WSDL file 16

29 2.2 Web Service Clustering Service clustering, which can greatly reduce the search space of service discovery, is an efficient approach to increasing the discovery performance. The idea is to organize semantically similar services into one group. As we mentioned in introduction section, service clustering can be categorized as functionally based, nonfunctionally based and social criteria based clustering Functionally Based Web Service Clustering Functional based clustering approaches use functional attributes of Web services such as service name, operation name, input and output in clustering process. Calculating the semantic similarity between services has been a critical issue for functional based service clustering. Over recent decades, several approaches have been developed for the improved measurement of service similarity. The similarity methods such as cosine similarity, SEB methods and ontology methods can be categorized into several categories: (i) string based approaches (ii) corpus based approaches (iii) knowledge based approaches and (iv) hybrid approaches. String based approaches operate on string sequences and character composition. The approaches measure similarity or dissimilarity between two text strings for approximate string matching or comparison. Similarity methods such as one to one matching and cosine similarity are belonging to this category. Corpus based similarity is a semantic similarity measure that determines the similarity between terms according to information gained from large corpora. SEB methods such as NGD can be included into this category. Knowledge-based similarity is a semantic similarity measure that determines the degree of similarity between terms using information derived from semantic network such as ontologies and knowledge based such as WordNet. Hybrid approaches used combination of above approaches. Clustering approach [11] uses string-based methods such as cosine similarity to measure the similarity of services. Cosine similarity usually focuses on plain text, whereas Web services can contain much more complex structures, often with very 17

30 little textual description. This makes the method very problematic. Further, the method cannot perform fine-grained measurements when calculating the semantic similarity of services because of the absence of machine-interpretable semantics. Liu and Wong [10], and Elgazzar et al. [6] combined a string-based similarity method such as structure matching with a corpus-based method based on NGD to measure the similarity of Web service features and to cluster them appropriately. However, structure matching may not accurately identify the semantic similarity among terms because of the heterogeneity and independence of service sources. These methods consider terms only at the syntactic level. Further, NGD does not take into account the context in which the terms co-occur, and, although the method uses up-to-date knowledge and information from the Internet, it does not encode fine-grained information, leading to low precision in the clustering results. Research work [30] represented service name, operation and message using a WordNet-VSM model and calculated the feature similarity by generating vectors. Instead of using traditional clustering algorithms, they used unsupervised self-organizing maps of neural networks based on a kernel cosine-similarity measure. However, this method has some problems with clustering, such as linking the output node to a weight and requiring much computation. Lee and Kim [36] presented an approach to clustering parameter names from a collection of Web application description-language documents into meaningful concepts. This research utilized a heuristic as the basis for clustering, in that parameters tend to express the same concept if they occur together frequently. Therefore, without using any traditional semantic-similarity measuring methods, they used an association rule to identify relationships between parameters. Further, to increase the precision of the clusters, they captured the relationships between the terms, using the patterns existing within complex terms, and saved them in ontology. Nayak and Lee [5] proposed a Web service discovery approach with additional semantics and clustering. They took advantage of the OWL-S ontology and WordNet lexicon to enhance the description with semantics. Each of the extracted terms from the service documents was expanded to enhance its semantics by using WordNet to 18

31 identify synonyms. They used the Jaccard coefficient in computing the service similarity. Wen et al. [15] presented a Web service discovery method based on semantics and clustering. Similarities among services were computed by using knowledge-based methods based on WordNet and ontologies. Wagner et al. [12] arranged Web services in a functionality graph based on the Exact and Plug-in relationships. Logic-based filters were used to calculate the service similarity. Connected components in the graph were considered as clusters. Xie et al. [13] measured the similarity based on two aspects (functional and process), with the aim of clustering services. A weighted-domain ontology method was used to calculate the functional similarity using input and output parameters and the domain ontology was developed using a semantic dictionary and existing ontologies from the Internet. Chifu et al. [37] proposed an approach to service clustering inspired by the behavior of ants. They defined a set of metrics to evaluate the semantic similarity between services. The proposed metrics considered ontology hierarchies of concepts and properties. In addition [38] presented an algorithm for Web service clustering that utilizes graph theory and a corresponding algorithm of Web services discovery. The proposed algorithms are designed for semantic Web services, thus, ontology is used to describe the input and output parameters of the services. However, by using fixed ontologies and fixed knowledge bases, these approaches failed to capture reasonable similarity values for different domains. Further, developing ontology by obtaining assistance from domain expertise is a time-consuming task that requires considerable human effort. In our ontology learning approach, we automatically generate ontologies by examine the service description files. Furthermore, knowledge-based methods lack up-to-date information and the have the problem of a shortage of highquality ontologies. Research work [39] investigated the ranking and clustering of Web services search results that are related to the notion of dominance. The proposed methods perform the matching processes by employing multiple criteria and do not aggregate the matching scores of each service parameter. In relation to clustering, they evaluate two types of algorithm namely Approximate Skyline Clustering and Heuristic Skyline Clustering. 19

32 The experiments are conducted in order to find the most representable services for clustering. Consequently, the clusters are able to reveal the trade-offs that exist between the matched parameters. Similar work [40] presented an improved Web service clustering method which uses Peano Space filling curve. The technique is compared with the previous work that employs Hilbert space filling curve. The results show that the proposed technique is better than the previous work in terms of fairness, scalability and irregularity. As we discussed in the Introduction, one of the common drawbacks in all of the above similarity methods is that they do not consider the domain context, thereby losing the semantic relationships that exist between terms within particular domains Non Functionally Based Web Service Clustering Non-functional based clustering approaches use QoS attribute of Web services in clustering process. Research work [41] proposed Web service clustering based on QoS properties using genetic algorithm to improve the efficiency of service discovery. In addition, work [7] presented an algorithm that used clustering technique to cluster a huge number of Web services into a number of groups according to QoS properties. The algorithm was able to reduce computational time and produce near-optimal Web services selection process. Further, Zhu et al. [42] proposed clustering-based QoS prediction solution for Web services recommendation system. They argued that the real-time QoS values are needed to ensure the accuracy of the prediction. However, in our approaches, we do not consider the QoS properties and we focus on improve the functionally based clustering Social criteria Based Web Service Clustering There are only limited works regarding the social criteria based clustering. Research works [9] and [43] proposed social criteria based discovery method and service composition approach respectively. The works connected isolated service islands into a global social service network to enhance the services sociability on a 20

33 global scale to improve the discovery and composition. They considered social properties such as sociability preference in generating the global social service network. 2.3 Context-Aware Web Services In our CAS based clustering approach, we use domain specific context to calculate the semantic similarity between services. As mentioned in the Introduction chapter, context awareness can be used to increase the performance of service discovery, recommendation, and composition. Instead of a formal definition of context [18], we can find enumeration definitions for context in the literature. In an enumeration definition, researchers define the context from three aspects, namely where a service is, what other services are present, and what resources are nearby. They argue that the context should include items such as location, lighting, noise level, network connectivity, and communication costs [44]. Zhang et al. [26] proposed a novel approach to modeling the dynamic behavior of interacting context-aware Web services, aiming to process and take advantage of context effectively in realizing the behavior adaptation of Web services. Moreover, for service recommendation, some researchers have used both personal-usage history-oriented contexts and group-usage history-oriented contexts, such as collaborative filters that assess users ratings of Web services [45], [46]. However, although these context-aware services help to identify those services that match a user s preference, user properties, usage history, or environment properties in the match-making process, they do not bootstrap to increase the functional semantic similarity of Web services by considering the domain-specific context. Further, these approaches consider only nonfunctional aspects. In our CAS based clustering approach, we define a context that can capture the hidden semantics of terms within particular domains in the similarity calculation process by considering the functional aspect. To match and rank Web services for service composition, Segev and Toch [24] extracted sets of terms from Web documents that were considered as context by being 21

34 semantically related to the terms used in service description files. Zhang et al. [47] proposed context based clustering approach by introducing service usage context. However, above two works also did not consider domain specific contextual information. 2.4 Top-Down Flow from Service Composition to Proposed Approaches Figure 2.5 shows the flow of our research from Web service composition to two proposed approach. As we mentioned, two service clustering approaches are proposed in this thesis. To address the issues in current similarity calculation methods, we proposed two methods namely ontology learning based method and CAS method. Our objective is to improve the service clustering through proposed similarity methods. Service clustering is an efficient approach for service discovery. In the Figure 3.1, we show the relationship between service discovery and service composition. 22

35 Figure 2.5 Top-down flow from service composition to proposed approaches 23

36 Chapter 3 Ontology Learning Based Clustering In this chapter, first we present motivating scenarios for our ontology learning based clustering approaches and architecture of proposed approaches. Chapter 3.1 presents motivating scenarios for ontology learning based method and explains architecture of ontology learning based clustering approach. Chapter 3.2 discusses the feature extraction process. Then, we describe the proposed ontology learning method by explaining the procedure of ontology construction, rules and similarity calculation filters in chapter 3.3. In addition, we explain the IR based similarity calculation method that we use in similarity calculation with ontology learning method in the same sub chapter. Chapter 3.4 presents the feature integration process. Finally, chapter 3.5 discusses the clustering algorithm. We explain the proposed cluster center identification approach which we adopt in clustering algorithm to improve the performance of service clustering in the same sub chapter. 3.1 Motivation for Ontology Learning Method and Proposed Clustering Approach Similarity calculation approach taken has affected to the service clustering performance. Current similarity computing approaches have problems that include a lack of semantic characteristics. This results in a loss of semantic information caused by the shortage of proper ontologies. For example, a description of service features in 24

37 WSDL or OWL-S usually consists of complex terms. The current approaches simply split them into token terms and enter the analysis phase directly. This results in a simple mechanical analysis of the terms and hinders the accurate calculation of service similarity. In a real situation, complex terms contain ontological relationships that should be utilized. Our research analyzes these ontological relationships by ontology learning from a service data set. Capturing ontological concepts in complex terms will improve the performance of the similarity calculation significantly. Then, if the similarity calculation procedure fails, in terms of the generated ontologies, it will hand over to calculation by an IR-based method. Therefore, although ontologies may fail in calculating similarities, our ontology learning based clustering approach can calculate a reasonable semantic similarity for services via an IR-based method. This hybrid of ontology learning and an IR-based method will optimize the similarity calculation in a natural fashion. As another contribution to improving clustering performance, we suggest a new method for calculating cluster centers, which has not been studied previously in the service clustering literature. If we select an invalid cluster member as its center in an intermediate step of a clustering algorithm such as agglomerative, then it will affect the final clustering solution. As mentioned above, there are no fine-grained methods for finding the center in existing approaches. In this research, we find the center by using the TF IDF value of service name and service-similarity values Motivating Example for Ontology Learning Consider the calculation of similarity between ScienceFictionNovel and RomanticNovel services. To calculate the similarity, we first tokenize the complex terms and calculate pair similarities (e.g., (Science, Romantic), (Science, Novel), (Fiction, Romantic)). Existing approaches consider only the distance between pairs of tokenized terms and cannot catch completely the semantics of the complex term. When we analyze the complex terms, we can identify hidden semantic patterns that may exist between tokenized terms in complex terms (e.g., RomanticNovel is a 25

38 subclass of Novel). We can use this semantic pattern by generating ontologies for the service domain. Figure 3.1 (a) shows the generated ontology for the above two services. Figure 3.1 (b) shows the extended ontology with more services. We can then measure the similarity of features by considering the whole complex terms. This helps to preserve the existence of semantics in the features. (a) Ontology for two web services (b) Extended ontology Figure 3.1 Generated ontology for motivating example Proposed Ontology Learning Based Clustering Approach The architecture of the proposed ontology learning based clustering approach is illustrated in Figure 3.2, with Figure 3.3 showing the five phases used by our clustering approach. We use WSDL files to cluster the services. First, we mine the WSDL documents to extract features that describe the functionality of services in the feature-extraction phase (FE). In the ontology-learning Phase (OL), we use ontology learning method to generate ontologies for all of the extracted features. We then compute the similarity of individual features in the similarity-calculation phase (HTSC) by using the hybrid term similarity (HTS) method which uses ontology learning and IR based term similarity. Next, features are integrated in the featureintegration phase (FI). Finally, in the clustering phase (CL), an agglomerative clustering algorithm is used to cluster the services. To identify the cluster centers in its 26

39 intermediate steps, the algorithm uses a cluster-center identification approach that uses similarity values and the TF IDF of service name. Figure 3.2 Overview of the ontology learning based clustering approach 27

40 Figure 3.3 Phases of the ontology learning based clustering approach 3.2 Feature Extraction As the first step of the functional based clustering procedure, we need to extract service features from the Web service description documents. As we mentioned, we used WSDL documents to cluster the Web services. As is usual in the literature [6, 10], we use service name, operation, domain name, input and output messages as the service features. Selected features of WSDL file describe and reveal the functionality of its Web services. We extract the service name and domain name used in the service and definition elements of WSDL documents. Figure 3.4 shows location of service name in the WSDL file. The figure is part of ScienceFictionNovelPublisher service s WSDL document. Operations give an abstract description of the actions supported by the service, which are listed in the main element <porttype>. Operation has several parameters such as name and another optional attributes specifying the order of the 28

41 Service Name Figure 3.4 Part of WSDL file that shows the structure of service element parameters used in that operation. As an example, we can define an operation called GetSymbol that has the GetSymbolInput massage as an input that produce the GetSymbolOutput massage as an output. We extract the operation name as a feature. Message elements describe the data being exchanged between the Web service providers and consumers. Messages are composed of part elements, one for each parameter of the Web service s function. In this thesis, we consider the part elements for measuring the similarity of input and output messages. Multiple part names are used in the messages when the message has multiple logic units. For example, a get_book_priceresponse output message in an AuthorBook-price service would have Book and Price part elements (Figure 3.5). Then average similarity value for the input or output messages is calculated as follows: Sim ( S, S ) m i j max_ sim( s r ) k n p, q (3.1) p1 q1 ( k n) Here, s p and r q denote the individual part elements of input or output messages in services S i and, S j respectively. Parameters k and n are the number of part elements in input or output messages. Further, Similarity value ( Sim ( S, S ) ) is between 0 and 1. m i j 29

42 Figure 3.5 Part of WSDL file that shows the structure of message element 3.3 Ontology Generation and Feature Similarity Calculation As the second step, we need to generate the ontologies from service documents. Here, ontologies are generated for each service feature type (service name) separately. We use ontology learning algorithm that explain in sub chapter to generate the ontologies After generating the ontologies, service feature similarity values are computed using HTS approach which uses proposed ontology learning method and IR based term similarity. In calculating the similarity of the relevant feature (service name) of two Web services, we do not split the complex terms as in usual literature, but consider the ontologies that generated by ontology learning method for those complex terms to check for the existence of concepts. If there are any concepts that relate to service features of the two Web service in the same ontology, then we compute the degree of semantic matching for the given pair of service features by applying different filters that we will explain in sub-chapter Otherwise, we use the IR method to calculate the similarity Ontology Learning Method As mentioned, developing high-quality ontology is a difficult and time-consuming task and also IR loss semantic found in service description. Therefore, the clustering accuracy of existing approaches has reached a saturation point and cannot be 30

43 improved further. We therefore propose an ontology learning method that analyzes service features to recognize their semantics more precisely. We use the complex terms used in service features and their underling semantics to generate the ontologies automatically. First, we extract the relevant feature (e.g., service name) from the service data set. If the feature is a complex term, we then split it into individual terms based on several assumptions. For example, the ComedyFilm name would be divided into two parts (Comedy, Film) based on the assumption that the capitalized characters indicate the start of a new word. The Author-of-Novel name would be divided into three parts (Author, of, Novel) based on the assumption that a hyphen (-) is used to join two words. Stop-word filtering is then performed to remove any stop words (e.g., of). In the next step after this preprocessing, we find the TF IDF value of all the tokenized words. The terms are ranked according to their TF IDF values, with the highest-ranking word having the highest TF IDF value and a threshold TF IDF value being defined. This is because we need to identify only service-specific terms relevant to the service domain and the more meaningful terms in generating the upper-level concepts. Ontology is an explicit specification of a conceptualization. Relations describe the interactions between concepts or a concept s properties. We consider two types of relations, namely concept hierarchy (Subclass Superclass) and triples (Subject Predicate Object). Let C be a set of concepts {C 1, C 2,, C n } in the ontology. Here, C i represents S i F, which is a feature F (e.g., service name) of service S i. LSC(C i ) is the set of least specific concepts (direct children) C x of C i. That is, C x is an immediate sub-concept of C i in the concept hierarchy. LGC(C x ) is the set of least generic concepts (direct parents) C i of C x. PROP(C i ) is the set of properties of concept C i. Definition 1 (Subclass Superclass relationship): If C i LSC(C j ) C j LGC(C i ), then there exists a Subclass Superclass relationship between concepts C i and C j. Concept C i can be an individual term (Employee) or a complex term (OrganizationEmployee). If a concept is a complex term, then its rightmost term is the 31

44 head of the concept (Employee) and the element to the left is the modifier term of the concept (Organization). Rule 1 (Head Modifier relation rule): Heads and modifiers express Subclass Superclass relations between lexical items. This identifies a set of terms related through hyponymy with the head of the compound constituting the hypernym [48]. Example 1 (Subclass Superclass relationship): Let us consider the complex term RomanticNovel. Here, Novel is modified by the term Romantic. Therefore, RomanticNovel is a subclass of Novel as shown in Figure 3.6 (a). We consider two types of properties, namely data and object in this research. The data property refers to the data in a concept (Organization name). The object property is used to relate a concept to another concept (Organization has Organization Employee). Definition 2 (Property relationship): If there exists C j PROP(C i ), then C j has a property relation (triple relation) with C i. Here, the target entity of the property could be either an object or data. Definition 2.1 (Data property relationship): If p i PROP(C j ) and p i is data in concept C j, then there exists a data property relationship between p i and concept C j. Rule 2 (Compound noun rule): If the individual terms in complex term t are nouns and if there is no concept in the ontology that is equal to head term H t and if there is a concept that is equal to modifier term M t, then there exists a data property relationship between concept M t and data t. Example 2 (Data property relationship): Let us consider the complex term PhysicianName. If Name is not a concept and Physician is a concept, then 32

45 PhysicianName is a data property of Physician as shown in Figure 3.6 (b). Definition 2.2 (Object property relationship): If (C i PROP(C j )) (C j PROP(C i )), then there exists an object property relationship between concepts C i and C j. Rule 3 (Concept and modifier rule): If concept C i is equal to a modifier term of concept C j, then there exists an object property relationship between C i and C j. Example 3 (Object property relationship): If concept C i is Hospital and concept C j is HospitalEmployee, then the relationship can be expressed as HospitalEmployee has Hospital and Hospital has HospitalEmployee as shown in Figure 3.6 (c). Rule 4 (Modifier only rule): If a modifier term of concept C i is equal to a modifier term of concept C j and if there is no concept in the ontology that is equal to that modifier term, then there exists an object property relationship between C i and C j. (a) Head Modifier relation rule (b) Compound noun rule (c) Concept and modifier rule (d) Modifier only rule Figure 3.6 Example ontology concepts and relationships with rules 33

46 Example 4 (Object property relationship): Let us consider two concepts MedicalEmployee and MedicalOrganization. If term Medical is not a concept, then the relationship can be expressed as MedicalEmployee has MedicalOrganization and MedicalOrganization has MedicalEmployee as shown in Figure 3.6 (d). Ontology Construction Algorithm: Algorithm 1 describes the ontology-construction process for complex service terms. We generate the concepts and relationships between concepts using the TF IDF value ranking and rules. We choose a word of the highest rank and generate a concept for that term. We then select all complex terms that make use of that word to build the complex term by taking it as its head. Algorithm 1: Ontology Construction Input T c : Array of complex terms Input T t : Array of tokenized terms Input θ : Threshold TF IDF value Output O : Ontology 1: for each tokenized term t t, where TF IDF value > θ in T t do 2: generateconcept(t t ); 3: for each complex term t in T c do 4: H t = getheadterm(t); 5: if(t t.equals(h t )) 6: generateconceptsforalllevel-complexterms(t); 7: end 8: generatesubsuperrelationship(); // By Rule 1. 9: end-for 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: end-for for each complex term t in T c do H t = getheadterm(t); M t = getmodifierterm(t); If (H t is not a concept and M t is a concept) generatedataproperty(); //By Rule 2. end end-for for each concept C i do for each concept C j do generateobjectpropertyforconceptmodifier(); // By Rule 3. generateobjectpropertyformodifieronly(); //By Rule 4. end-for end-for 34

We consider all levels in the subsumption hierarchy of the complex term (For example, as show in Figure 3.7 the first level of complex term HigherEducationalOrganization is organization.

47 We consider all levels in the subsumption hierarchy of the complex term (For example, as show in Figure 3.7 the first level of complex term HigherEducationalOrganization is organization. The second level is EducationalOrganization and the final level is HigherEducationalOrganization). We generate concepts for all levels of the particular complex term (Lines 3 7). We then apply Rule 1 to generate a Subclass Superclass relation (Line 8). This process has to be repeated for all tokenized words that have TF IDF values greater than the defined threshold value. Next, we generate a data property relation by applying Rule 2 (Lines 11 17). Finally, we apply Rules 3 and 4 to generate object property relations (Lines 18 23). Figure 3.7 Sub-sumption hierarchy of the complex term IR Based Term Similarity As IR-based term-similarity methods, we use two approaches, namely thesaurusbased term similarity and SEB term similarity. Thesaurus-based term similarity: This method can be considered as a knowledgerich similarity-measuring technique, which requires a semantic network or a semantically tagged corpus to define the concept of a term in relation to other concepts or within the surrounding context. We use WordNet as the knowledge base. To calculate the semantic similarity of two terms we use an edge-count-based approach [49], which is a natural and direct way of evaluating semantic similarity in the taxonomy field. SEB term similarity: One main issue for the above method is that some terms used in 35

48 Web services may not be included in the thesaurus. We may therefore fail to obtain a reasonable similarity value for features (e.g., IphonePrice and NokiaPrice ). However, the SEB method can overcome this problem because it analyzes Web-based documents. Further, it can identify the latent semantics in the terms (e.g., the semantic similarity between Apple and Computer ). We now consider three algorithms called Web-Jaccard, Web-Dice and Web-PMI, as described in [50]. 0, if ( H( P Q)) c Web _ Jaccard( P, Q) H( P Q), otherwice H( P) H( Q) H( P Q) (3.2) 0, if ( H( P Q)) c Web _ Dice( P, Q) 2 H( P Q), otherwice H( P) H( Q) (3.3) 0, if ( H( P Q)) c H( P Q) Web _ PMI( P, Q) log N 2, otherwice H( P) H( Q) N N (3.4) Here, H(P) and H(Q) are page counts for the queries P and Q, respectively. Value H(P Q) is the conjunction query P AND Q. All the coefficients are set to zero if H(P Q) is less than a threshold, c, because two terms may appear by accident on the same page. N is the number of documents indexed by the search engine. Further, all similarity values are between 0 and 1. First, we compute the pair similarity of the individual terms used in complex terms to calculate the feature similarity value as follows: 36

49 Sim( T, T ) Sim ( T, T ) Sim ( T, T ) (3.5) 1 2 T 1 2 SE 1 2 Here, Sim T (T 1, T 2 ) is the thesaurus-based term-similarity score and Sim SE (T 1, T 2 ) is the SEB similarity score. Parameters α and β are real values between 0 and 1, with α+β=1 and represent weights for thesaurus-based and SEB similarities. Final similarity value is between 0 and 1. We then calculate the feature similarity value: Sim ( S, S ) F i j max_ sim( x y ) l m p, q (3.6) p1 q1 ( l m) Where x p and y q denote the individual terms, with l and m being the number of individual terms in a particular feature (service name) of services S i and S j, respectively. Further, feature similarity value is between 0 and Matching Filters and Similarity Calculation To compute the similarity of services using generated ontology by proposed ontology learning method, we use the Exact, Plug-in and Subsumes filters defined in [51]. If one concept is a property of another concept, then those concepts are semantically closer to each other. We therefore introduce three new filters, namely Property-&-Concept, Property-&-Property and Sibling, in this research. Exact: If C i C j, then S i F perfectly matches S j F. Property-&-Concept: If C i PROP(C j ), then the S i F property-&-concept matches S j F. Property-&-Property: If C i PROP(C k ) C j PROP(C k ), then S i F property-&-property matches S j F. Plug-in match: If C i LSC(C j ), then S i F plugs into S j F. Sibling match: If C i LSC(C k ) C j LSC(C k ), then S i F sibling-matches S j F. Subsumes match: If C j > C i (C i is more specific than C j ), then S j F subsumes 37

50 S i F. Logic Fail and Fail: If C i and C j are in the same ontology Op, but fail in the above six matches, then S i F logic-fails to match S j F. If the two concepts are in heterogeneous ontologies, then S i F fails to match S j F. We apply the filters in the following order, based on the degree of strength for logic-based matching: Exact > Property-&-Concept > Property-&-Property > Plugin > Sibling > Subsumes > Logic Fail > Fail. For an example let us consider the following ontology in Figure 3.8. Table 3.1 shows the logical relationship between services in ontology. Table 3.1 Matching filters for concepts in Figure 3.8 ontology Service S 1 Service S 2 Matching Filter EducationalEmployee EducationalEmployee Exact HospitalEmployee HosiptalName Property-&-Concept HosiptalName HosiptalAddress Property-&-Property HospitalEmployee Employee Plug-in HospitalEmployee EducationalEmployee Sibling HigherEducationalEmployee Employee Subsumes HospitalEmployee HigherEducationalEmployee Logic Fail HigherEducationalEmployee RomanticNovel Fail If there is an Exact match between two concepts, then the similarity is equal to the highest value 1. If the matching filter is Property-&-Concept, Property-&-Property, Sibling, Plug-in, Subsumes, or Logic Fail, then we calculate the similarity as follows: Figure 3.8 Sample ontology 38

51 Sim( C, C ) W W Sim ( C, C ) (3.7) i j m e E i j Here, W m and W e are weights for matching filter and edge-base similarity respectively, with W m + W e = 1. Sim E (C i, C j ) is the edge-based similarity and calculated by using following equation [52 ]. d( C, C ) i j Sim ( C, C ) log (3.8) E i j 2D Here, d(c i, C j ) is the shortest distance between concepts C i and C j and parameter D is the maximum depth of the ontology. Further, all similarity values of the above equations are between 0 and 1. If two concepts are in heterogeneous ontologies (the services fail to match using any matching filter except Fail), then the IR-based term-similarity method is used to calculate the feature similarity. In calculating the similarity of the relevant feature (service name) of two Web services, we do not split the complex terms as in usual literature, but consider the ontologies that generated by ontology learning method for those complex terms to check for the existence of concepts. If there are any concepts that relate to service features of the two Web service in the same ontology, then we compute the degree of semantic matching for the given pair of service features by applying different filters. For an example let us consider we need to calculate the similarity of service name of the HospitalEmployee and Employee. As we discuss, current approaches split the complex terms and compute the pair-similarities ((Hospital, Emplyee), (Emplyee, Employee)). Thus, we can t obtain the real semantic similarity. But, our method checks for concepts without splitting and compute the similarity by using logic filters. According to the Table 3.1 here we can see Plug-In relationship between this two service names. Thus, we do not lose the semantic exists between the service name. In this research, we do not generate and train from labeled data, but just generates ontology for helping more exact term similarity calculation from the existing Web 39

service set as shown in ontology generation chapter by defining the rules. Our method creates some standard for answer from the existing problem sets before examination.

9 Sample ontology with three ontology classes Assume we need to compute similarity between service names FastCar and ExpensiveCar. To compute the similarity using ontology we can apply equation 3.7.

But, as mentioned current approaches calculate the similarity of individual term pairs and get the average value as the final similarity value.

52 service set as shown in ontology generation chapter by defining the rules. Our method creates some standard for answer from the existing problem sets before examination. Let us consider following ontologies generated by ontology learning with three ontology classes (Figure 3.9). Figure 3.9 Sample ontology with three ontology classes Assume we need to compute similarity between service names FastCar and ExpensiveCar. To compute the similarity using ontology we can apply equation 3.7. The services are in the same ontology class and they are children of same concept Car. Thus, here matching filter is Sibling filter. Here, we can obtain high semantic similarity (0.90). But, as mentioned current approaches calculate the similarity of individual term pairs and get the average value as the final similarity value. In this case, WordNet based method obtain low similarity value compare to ontology learning based method (0.50). In general point of view the services has high semantic similarity. Further, let us consider services ThreeWheeledCar and FastCar services. Here, also in ontology learning method we can obtain higher similarity value (0.89) through Logic Fail filter compare to WordNet Method (0.56). Thus, we can show that using ontology learning we can improve semantic similarity. This is affected to the 40

53 clustering accuracy. Then, assume we need to calculate similarity between Novel service and NovelAuthor Service. In this case we can calculate similarity using Property-&-Concept filter. However, if we want to calculate similarity between ComedyFilm service and RomanticNovel service, then we need to use IR based method. Here, there are in two different ontology classes. 3.4 Feature Integration Various approaches for combining service feature similarity values exist. One method is to assign weights to individual feature, determined through user feedback. Appropriate weights are chosen either by assuming a priori knowledge of the user s preferences or by applying machine learning technique. Here also, the final service similarity value Sim S (S i, S j ) to be used for service clustering is calculated by integrating the feature-similarity values for Web services S i and S j by assigning weights to individual feature as follows: Sim ( S, S ) W Sim( Name, Name ) W Sim( Op, Op ) S i j N i j OP i j W Sim( Domain, Domain ) W Sim( Out, Out ) W Sim( In, In ) D i j O i j I i j (3.9) Here, weights for each feature elements, W N, W OP, W D, W O and W I are real values between 0 and 1, with W N + W OP + W D + W O + W I =1. We define the weights values of features by measuring the strength of features in service similarity calculation. We will explain the procedure of assigning values for the weights values in chapter Further, final similarity value is between 0 and Clustering Algorithm After calculating service similarity, we use an agglomerative clustering algorithm (Algorithm 2) that can handle any form of similarity or distance easily, can obtain the main structure of the data and has a low computation cost. This bottom-up 41

54 hierarchical clustering method starts by assigning each service to its own cluster (Lines 1). Algorithm 2 Clustering Algorithm Input S : Array of service similarity values Input n : Number of required clusters Output C : Service clusters 1: Let each service be a cluster; 2: ComputeProximityMatrix(S); 3: k=noofservices; 4: while k!=n do 5: Merge two closest clusters; 6: k=getnoofcurrentclusters(); 7: Calculate center value of all services in all clusters; // By using center value CV(S i,c ) calculation formula. 8: Select service with highest value of each cluster as cluster centers; 9: UpdateProximityMatrix(); 10: end while It then starts merging the most similar clusters, based on proximity of the clusters at each iteration, until the stopping criterion is met (e.g., number of clusters) (Lines 4 10). Several methods have been used to merge clusters, such as single-link and complete-link [53]. We use a centroid-based method where, for the proximity value, we use Sim S (S i, S j ) between cluster centers. Figures 3.10 (a) and 3.10 (b) show an example of the clustering steps, with Figure 3.10 (c) showing a tree representation Proposed Cluster Center Identification Approach As mentioned, it is very important to identify most suitable cluster center to optimize cluster performance. In this sub-section, we propose new method to calculate cluster center. To calculate the center, we presume the following condition. A center service of a service cluster has highest value of the summation of TF IDF value of service name and average relative similarity among services in the cluster. According to the condition, we defined following equation. First, we calculate the center value CV for all services in the cluster (see Lines 7 in Algorithm 2) and then choose the service with the highest value as the cluster center. 42

55 (a)web services (b) Clustering (c) Tree representation Figure 3.10 Agglomerative clustering algorithm m Sim ( S, S ) S i j CV( S ) ( ) i, c tfidf Si, c j1 mm ( 1) (3.10) Here, CV(S i,c ) is center value of service S i in cluster C. Parameter m is the number of services in the cluster. Sim S (S i,, S j, ) is similarity values between service S i and S j in cluster C. Average similarity value is between 0 and 1. Value tfidf(s i,c ) is TF-IDF value of service name S i within the cluster C. It reflects how important a service is to a collection of services. To calculate the TF IDF value, first we tokenize all the service names in the clusters into individual terms and then calculate the TF IDF value of the individual terms as: 43

56 n tfidf tf log x, c x, c Cnx (3.11) Here, tfidf x,c is the TF IDF value for term x in cluster C. tf x,c is the term frequency of term x in cluster C. Cn x is the number of clusters that contain term x. Parameter n is the number of clusters. Finally, we calculate the average TF IDF values for terms used in service name as the TF IDF value of service name as: tfidf ( S ) ic, k x1 tfidf k xc, (3.12) Here, k is the number of individual terms used in service name and average TF-IDF value is between 0 and 1. 44

Chapter 4 CAS Based Clustering Approach Our second clustering approach uses CAS method as the similarity calculation method in the clustering process.

57 Chapter 4 CAS Based Clustering Approach Our second clustering approach uses CAS method as the similarity calculation method in the clustering process. CAS method learns domain context by machine learning to produce models of context for terms retrieved from the Web. The method captures the hidden semantics of services within a particular domain through models. Following fig. 4.1 shows the steps of the clustering approach. In this chapter, first we present our motivating examples and architecture of proposed CAS clustering approach. Next, we explain the CAS method. In there we provide the outline of the CAS method. Then, we describe the steps of the method which include context vector generation, training SVM and term similarity calculation. Then, Chapter 4.3 discusses the spatial clustering method which uses CAS method to calculate the service similarity based on the domain context. Figure 4.1 Steps of the CAS based clustering approach 45

4.1 Motivation for Context Awareness and Proposed Clustering Approach Similarity computing method that are used in current clustering approaches such as string- and knowledge-based methods do not use

58 4.1 Motivation for Context Awareness and Proposed Clustering Approach Similarity computing method that are used in current clustering approaches such as string- and knowledge-based methods do not use domain context in measuring the similarity, which hinders the accurate calculation of service similarity. In a real situation, there will be semantic relationships between services within a particular domain that should be utilized. In this research, we focus on term similarity according to context by learning in a domain. The CAS method can optimize the similarity calculations in a natural fashion and consideration of the domain context in computing similarities can have a significant impact on the formation of service clusters Motivating Scenarios for Context Awareness To better motivate and illustrate the importance of context awareness, we will refer to a simple example involving three Web services getambulancelocationinformation (S 1 ), getcarlocation (S 2 ), and HospitalPhysician (S 3 ) services. Assume that we have three clusters, namely Medicine, Vehicle, and Location. If we use traditional similarity computing matrices such as string- and knowledge-based methods, then placement of the three services among the clusters will be as shown in Figure 4.2. The S 1 service shares characteristics with the Medicine, Location, and Vehicle domains because of the terms used in service name and does not identify easily with one cluster among these three domains. The S 2 service represents terms in the Vehicle and Location Figure 4.2 Effect of domain context in choosing clusters 46

domains. It is therefore placed between the Vehicle and Location clusters. However, the S 3 service shows characteristics with the Medicine domain alone.

For example, if we calculate the similarity for the Vehicle domain, then both S 1 and S 2 services move towards the Vehicle cluster, as shown by the arrows in Figure 4.

59 domains. It is therefore placed between the Vehicle and Location clusters. However, the S 3 service shows characteristics with the Medicine domain alone. Therefore, it is placed inside the Medicine cluster. In contrast, if we calculate similarities using the domain context, then the services will move to new coordinates according to their context. For example, if we calculate the similarity for the Vehicle domain, then both S 1 and S 2 services move towards the Vehicle cluster, as shown by the arrows in Figure 4.2, and become remote from the other two clusters. If we use the Location domain context, then the services will move towards the Location cluster. In both cases, S 3 will remain unchanged because it does not share its characteristics with multiple domains. If we use the Medicine domain context, then service S 1 will move towards the Medicine cluster, but the context does not influence the position of S2. To illustrate further the importance of context awareness in service clustering, we will refer to another example with 15 services from five different domains (Medicine, Computer, Vehicle, Location, and Food). Current clustering approaches will cluster the services as shown in Figure 4.3. These approaches have problems, including the failure to capture semantic relationships between services within a particular domain. For example, assume the AppleProduct service in Figure 4.3 is used to obtain product Figure 4.3 Service clustering example with current approaches. 47

60 details about the Apple Company. However, it has been incorrectly placed in the Food cluster instead of the Computer cluster. The method used did not consider the term Apple in AppleProduct service as a computer manufacturer because of the lack of surrounding domain information and this has reduced the similarity between the AppleProduct service and other Computer domain services. In contrast, if we compute the similarity value between the service and another service in the Computer domain such as ComputerPrice, then the similarity value is greater than the value obtained without considering the domain. Here, the term Apple is considered not only as a food, but also as a computer manufacturer. Although corpus-based clustering approaches methods such as NGD do reason in this way, they do not encode finegrained information. Furthermore, we cannot guarantee that there would exist concepts for services such as AppleProduct, which relates to the products of the Apple Company in fixed ontologies. Moreover, the AmbulanceLocation service can be a member of any of the Medicine, Vehicle, or Location clusters according to the domain context, as in the previous scenario. However, current clustering approaches can insert this service into only one cluster according to the highest similarity Proposed CAS Based Clustering Approach We propose an approach to the functional clustering of Web services using the CAS method. The architecture of the proposed approach is illustrated in Figure 4.4. As in previous approach here also, we use the structure of WSDL files to cluster services and material retrieved in other formats are translated into WSDL. First, WSDL documents are mined to extract features that describe the functionality of services. Then, we use a CAS module that uses models learned from real Web and domain datasets as the context retrieved from Web search engines to compute service similarity. We adopt an SVM technique to compute the similarity of features within a particular domain in the CAS module, because machine learning can learn term similarities from terms in several domains. Next, similarity of service values are computed as an aggregation of the individual service feature similarity values. We can then create an affinity matrix using the computed service similarity values to provide 48

the input for the SASKS algorithm, because in this approach we use SASKS algorithm as the clustering algorithm. Finally, services are clustered and plotted onto a sphere using the SASKS algorithm.

61 the input for the SASKS algorithm, because in this approach we use SASKS algorithm as the clustering algorithm. Finally, services are clustered and plotted onto a sphere using the SASKS algorithm. Figure 4.4 Architecture of the CAS-based clustering approach 4.2 CAS Method Outline of CAS Method In our research, we first analyzed snippets obtained from search engines for term pairs to identify significant information to help capture the semantics of terms within a domain. For example, consider the following snippets (see Figure 4.5) from the Google search engine for the queries apple computer, apple banana, and software hardware. There are domain-related terms that are frequently used in that domain (e.g., terms such as hardware, software, and desktop are frequently used in the 49

62 Figure 4.5 Snippets for the queries from the Google search engine Computer domain, and terms such as fruit, vegetable, and recipe are frequently used in the Food domain). According to snippets sn1 and sn2, we see that the terms apple and computer are associated with frequently used terms in the Computer domain, such as laptop and desktop. According to snippets sn3 and sn4, the terms apple and banana are associated with frequently used terms in the Food domain, such as fruit and drink. The terms software and hardware are also associated with frequently used terms in the Computer domain, according to snippets sn5 and sn6. Using these snippets, we can determine that the terms apple and computer are semantically related to each other in the Computer domain, and apple and banana are semantically related to each other in the Food domain. Therefore, after careful examination, we can make the following important hypothesis: If two particular terms are associated with frequently used terms in a particular domain, then there exists a semantic relationship between those two terms for that domain. If two terms are associated with more frequently used terms, then there will be a strong semantic relationship between those two terms. Consider Table 4.1, which is constructed from the snippets in Figure 4.5. The table is constructed by considering frequently used terms in the Computer domain. According to the table, we can see that the term combination hardware software is associated with more frequently used terms in the Computer domain than are the other two combinations. Therefore, there exists a strong semantic relationship between the terms combinations for the Computer domain, in comparison to the other two combinations. 50

63 Computer Hardware Software Laptop Desktop Peripherals Data Instruction Devices Storage Table 4.1 Frequently used terms in the computer domain Term pair Frequently used terms in the Computer domain apple computer apple banana hardware software Further, the variation of semantic relationships between term pairs can be shown as the following ordering for the Computer domain: Hardware Software > Apple Computer > Apple Banana If we apply the same procedure for the Food domain, the sequence will be: Apple Banana > Apple Computer > Hardware Software To use our hypothesis in this research, we extended the definition of context beyond that used in ubiquitous computing. Here, we define context as C di = {T 1, T 2,, T n }, where T i is a term that is frequently used in a domain d i. Context is based on the domain and it varies from domain to domain. We extract the context using Web search engines. To measure the semantic similarity of two terms within a particular domain, we implement a domain filter by training an SVM for that domain. For example, to measure the similarity between car and vehicle for the Vehicle domain, we implement a Vehicle domain filter. Figure 4.6 shows the process of term similarity calculation by model The output of the SVM is converted into a posterior probability and we define the semantic similarity between two terms as this posterior probability. In previous research, Bollegala et al. [50] used an SVM to measure the similarity of two terms. However, they used fixed patterns extracted from WordNet to identify the semantic relationships between terms. In their research, they used those patterns and searchengine-based similarity values to generate feature vectors. 51

Figure 4.6 Term similarity calculation by model 4.2.2 Overview of SVMs In this sub section, we describe the SVM used in machine learning to generate models in CAS based clustering approach.

64 Figure 4.6 Term similarity calculation by model Overview of SVMs In this sub section, we describe the SVM used in machine learning to generate models in CAS based clustering approach. Service descriptions are free-text documents that include a variety of terms. The existing approaches to calculate similarity of terms in services are based general concept. In CAS based clustering approach, we want to device a new approach to calculate term similarity according to domain context using model from SVM. The SVM is a state-of-the-art classification method that uses a learning algorithm based on structural risk minimization [54]. The classifier can be used in many disciplines because of its high accuracy, ability to deal with high dimensions, and flexibility in modeling diverse sources of data [55]. Zhang et al. [56] proposed an approach to service categorization involving novel feature-vector elements for service characteristics and an extension of the SVM-based text classification technique to enhance the accuracy of the service categorization. This approach incrementally established domain knowledge and leveraged that knowledge to automatically verify and enhance the service categorization. In addition, Web search engine based term similarity measure and results integration by SVM 52

65 were used to discover matched service on the trip domain by Paik and Fujikawa [57]. In classification, the SVM first transforms the input space to a higher-dimensional feature space through a nonlinear mapping function, and then constructs a separating hyperplane that has the maximum distance from the closest points of the training set. The optimal hyperplane is constructed by maximizing the distance between samples in different classes and the hyperplane. For a binary classification problem, given a set of linearly separable samples, N {( x i, y i )} i and { 1, 1} P 1 y i where x i is an n-dimensional vector. Here, we use the domain context to construct the vector (we discuss the formation of the vector in detail in Chapter 6). The label of the class that the vector belongs to is y i. The objective of the SVM is to find an optimal hyperplane with following condition: for y 1, W T x b 1 i i for y 1, W T x b 1 i i Here, W T is a weight vector and b is a bias. By introducing Lagrange multipliers, we can arrive at the following optimization problem: minimize: N N N 1 W( ) y y K( x, x ) 2 i i j i j i j i1 i1 j1 subject to: N y 0, i: 0 (4.1) i i i i1 Here, K(x i, x j ) = x i. x j is a kernel function. The hyperplane decision function can then be expressed as: N q( x) sign y K( x, x) b i i i (4.2) i1 The training process determines and b, and the criteria for optimization are to i maximize the margin as well as and minimize the error. 53

Output q(x) indicates the distance between testing data and the optimal hyperplane. However, the output cannot be used directly to measure the semantic similarity of services.

66 Output q(x) indicates the distance between testing data and the optimal hyperplane. However, the output cannot be used directly to measure the semantic similarity of services. Constructing an SVM to produce the posterior probability P(class input) that the feature vector belongs to a positive class is a fine-grained solution to this problem. In the literature, several approaches have been proposed to convert the uncalibrated output q(x) of the SVM into a probability [58], [59] Generating Context Vectors for Domains We generated context vectors for each required domain to implement the domain filters (see Algorithm 3). The context vector for domain d i contains all the terms in the domain-specific context C di of d i. To extract the domain-specific context, we used the Google and Wikipedia search engines. We extracted frequently used terms in domain d i as the context. Figure 4.7 shows this process. Further, Fig. 4.8 shows the interface of Wikipedia. First, we gave the domain name as the search query to the two search engines and retrieved the top 100 snippets from each search engine (SnippetG(d i ) and SnippetW(d i )). We extracted 200 snippets for each domain (Lines 1 3 of Algorithm 1). We then performed stop-word filtering and computed the TF IDF value of all the terms in the 200 snippets for each domain (Lines 4 13) using (6.1). The advantage of using TF IDF is that it reflects the importance of the term T i for domain d i from among the collection of domains d 1, d 2, d 3,..., d n. Even if term T i is frequently used in domain d i, it may not be a domain-specific term, but may also be a common term for other domains. If so, we obtain no advantage by selecting that term as a context term. Figure 4.7 Process of extracting frequently used terms. 54

Algorithm 3: Context vector generation for term pairs Input D : Array of domains Output C : Context vectors for each domain 1: For each domain d i in D do 2: Sd i =getsnippets( d i ); //get total 200

67 Algorithm 3: Context vector generation for term pairs Input D : Array of domains Output C : Context vectors for each domain 1: For each domain d i in D do 2: Sd i =getsnippets( d i ); //get total 200 snippets from Google and Wikipedia search engines 3: end-for 4: For each domain d i in D do 5: For each snippet s j in Sd i do 6: StopWord_filtering( Sd i ); 7: end-for 8: Td i = Calculate_term_frequency(); // 2D array with term and term frequency 9: end-for 10: For each domain d i in D do 11: For each term T x in Td i do 12: y x = calculate_tf IDF( T x ); 13: end-for 14: Cd i = generatecontextvector();//select 200 terms with highest TF IDF value 15: end-for Figure 4.8 Interface of Wikipedia. 55

68 We therefore use the TF IDF values to identify the preferred important domainspecific terms. n T TFIDF tf log i, j i, j dfi (4.3) Here, T i,j TFIDF is the TF IDF value for term i in domain j, tf i,j is the term frequency for term i in domain j, df i is the number of domains that contain term i and n is the total number of domains. After computing the TF IDF values, we selected the 200 terms that had the highest TF IDF values from domain d i as the context for that domain. We then generated the context vector for each domain d i, where each element was a frequently used term in that domain (Lines 14) Training the SVMs After generating the context vectors for each domain, we need to train a separate SVM for each domain to implement the domain filters. For example, an SVM has to be trained within a medicine context to implement the Medicine domain filter. Each SVM was trained to classify same-domain term pairs that belonged to the domain under consideration and term pairs that did not belong to that domain. We extracted terms for five domains: Book, Medicine, Vehicle, Food, and Film from dictionaries, thesauruses, and service documents. We prepared term pairs from those selected as belonging to the domain under consideration to be the positive-training dataset. Negative-training term pairs were prepared by taking terms from other domains. For example, in implementing the filter for the Vehicle domain, we prepared positive term pairs by taking terms from the Vehicle domain such as car automobile or train transport and further process the generating of the term pairs as following example. Example: Positive training data set for Vehicle domain: car- vehicle /automobile bus/etc.. (Primary Term pairs) We search Google for (T 1 T 2 ) to find snippets for each term pairs. 56

69 Top 10 snippets of each query are extracted. TF-IDF of each terms are calculated. Term pairs are prepared using terms with higher TF-IDF values. For the negative term pairs, we used terms from other domains such as Food or Book (meal juice, author publisher, or bus meat). Consider training an SVM to implement the domain filter for domain d i. We need to prepare a training dataset P = {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )}, where x i is the feature vector and y i is the expected class label for the ith instance. For each training pair of terms (a b), we create a feature vector x i, as shown in Algorithm 4. We first search Google for (a b) to find snippets (Line 1). Figure 4.9 shows the sample query in Google search engine. We count the frequency of each term T x in the snippets, where T x is a member of the context vector for domain d i (Lines 3 9). We then normalize the count of each term T x by dividing it by the total number of frequently used terms (Line 11). Then, we compute the similarity of term pair using SEB method. Here, we used Web- Jaccard, Web-PMI, Web-Overlap and Web-Dice as the similarity computing methods. Finally, we have a 204 dimension vector of frequently used terms in which each element is the normalized frequency of the corresponding frequently used term up to Algorithm 4: Feature vector generation for term pairs Input T : Array of 200 terms // frequently used terms in domain d i (Context vector) Output x i : Feature vector of term pair (a b) 1: S=getSnippets(a b); 2: total_count=0; 3: For each term T x in T do 4: count_t x =0; 5: For each snippet s S do 6: count_t x = count_t x +count(t x, s); 7: end-for 8: total_count=total_count+count_t x ; 9: end-for 10: For each term T x in T do 11: NT x = normalized(count_t x, total_count); 12: 13: end-for Sim a = ComputeSimValue((a b), SEB y ) // Calculate SEB Values 14: generatefeaturevector(nt x, Sim a ); 57

70 200 and four SEB similarity values of the term pair. Feature Vector: f ((nftf i ), Web-jaccard, Web-Overlap, Web-Dice, Web-PMI) nftf normalized frequently used term-frequency We produce feature vectors for all positive and negative term pairs in this manner. The SVM is then trained with the labeled feature vectors to implement the filter for domain d i. We train the SVM using two-fold cross validation. In this way, we trained SVMs for each domain that required a filter. Figure 4.9 Sample query in Google Calculating Term Similarity from Model Assume we need to compute the similarity of two terms within the Vehicle domain. We use the SVM that was trained for the Vehicle domain. First, we generate the feature vector FV for the selected term pair using the same method that we used to generate feature vectors for training term pairs. As mentioned above, in current research disciplines, an SVM can be used to classify a given data set. We adopt it here to calculate semantic similarities. We define the semantic similarity between terms as the posterior probability that the feature vector belongs to the same-domain term-pair (positive) class as that for the domain under consideration. However, the output of an 58

71 SVM (q(x)) is both uncalibrated and not in the form of a probability. To convert the SVM output to a calibrated posterior probability, we follow a previous approach [59] that uses a sigmoid function. Because the parameters of the model are adapted to give the best probability output, the previous method used a parametric model to fit the posterior probability P(y = 1 q) directly as follows: 1 P( y 1 q) 1 exp( Aq B) (4.4) The parameters A and B are fitted by using maximum likelihood estimation from the training set (x i,y i ). We define a training set (x i,v i ), where v i is the target probability, defined as: v i y 1 i (4.5) 2 The parameters A and B are then found by minimizing the negative log likelihood of the training data: min v log( p ) (1 v )log(1 p ), (4.6) i i i i i where p i 1, 1 exp( Aq B) (4.7) i Further similarity between terms a and b can be expressed as: context _ sim( a, b) Pr ob FV / domain (4.8) where FV is a feature vector and Prob(FV/domain) is the probability that FV belongs to the same domain class. context _ sim( a, b) is between 0 and 1. 59

72 4.3 Spatial Clustering To analyze the effect of domain context on the clustering results in terms of visual output, SASKS algorithm is applied as the clustering algorithm instead of traditional clustering algorithm such as k-mean and agglomerative. SASKS algorithm is a modify version of associated keyword space (ASKS) algorithm. For the spherical visualization of the clustering space, we need to create a 3D space using affinity values for pairs of services. This is same as the nonmetric multidimensional scaling (MDS) problem [60]. The ASKS algorithm is an extended MDS algorithm that is able to represent services in 3D space by using service similarity. In addition, ASKS is able to achieve high precision by noise isolation [61]. In this chapter, we first describe the service affinity calculation needed to generate the affinity matrix. We then discuss the ASKS algorithm. Finally, we show the changes required to convert the ASKS algorithm into the SASKS algorithm Calculating Affinity of Services We use the semantic similarity between services calculated via the CAS method as the affinity values. In this approach, we use service name, operation name, input, and output. Selected features of a WSDL file describe and reveal the functionality of its Web services. As in ontology based approach here also we extract operation name as a feature and consider the part elements for measuring the similarity of input and output messages Calculating Service Feature Similarity If the service feature is a complex term, then we tokenize it into individual terms, using several assumptions as in our previous approach. For example, we tokenize the service name AuthorizePhysician into two parts (Authorize, Physician) based on the assumption that a capitalized character indicates the start of a new word. After tokenizing the name, stop-word filtering is performed to remove any stop words. We then calculate the pair similarities of terms using the trained SVM, as discussed in 60

73 above section. Next, the similarity of relevant feature is calculated using (4.9): context _ sim ( s, r ) m n max x y (, ) W F i j DF (4.9) x1 y1 ( m n) Sim S S Here, s x and r y denote the individual terms in feature F of services S i and S j respectively. The parameters m and n are the numbers of individual terms in the features. Further, feature similarity value is between 0 and 1. There may be some term pairs that are accidentally present in the domain under consideration. Therefore, it is important to determine the strength of the two services as belonging to the domain. We introduce a weight W DF as a domain factor defined by (4.10): W DF 1 T 1 log P 10 D P (4.10) Here, T P is the number of term pairs for two service features and D P is the number of term pairs that belong to the domain under consideration. If the two service features contain several term pairs belonging to the domain, then there is a high probability that the two services belong to it. Value of weight is between 0 and 1. In the case of input and output, multiple part names are used in messages when the message has multiple logic units as we explain in Chapter 3.2. The average similarity value for the input or output messages is calculated as in equation 3.1. If there are multiple operations in the Web service, then we calculate the average similarity value using the same equation (3.1) Calculating Service Similarity We calculate the feature similarity value using (4.9), and then calculate the final service-similarity value by integrating the similarity values of features using (4.11): 61

74 Sim ( S, S ) Sim( N, N ) Sim( O, O ) Sim( Out, Out ) Sim( In, In ) S i j i J i J i J i J (4.11) Herer Sim(N i, N j ), Sim(O i, O j ), Sim(Out i, Out j ) and Sim(In i, In j ) are the similarity values for service name, operation name, output, and input, respectively. Parameters μ, β, χ and δ are weights for each feature, where μ + β + χ + δ = 1. Further, final similarity value is between 0 and 1. Using these service similarity values, we can then generate the affinity matrix. The matrix is given as the input for SASKS algorithm ASKS Algorithm Distance Measure of ASKS Let k denote the dimension of the space in which the services are located. The distance between two services D ij is given by (4.12): ( k) ( k) Dij f ( x x ) j i (4.12) Here, x i and x j are locations of the services L and M, respectively. Distance between two services is between 0 and 1. f has a parameter a and is defined using (4.13): f ( ) 2, a 2 2, a a a (4.13) Here, and parameter a is a density control parameter. Clustering ( k) ( k) ( x x ) j i efficiency and the calculation load are both strongly influenced by the parameter. Figure 4.10 shows nonlinear distance function f( ) used in ASKS. 62

f( 2a θ a 2 Figure 4.10 Distance function f(θ used in ASKS 4.3.2.2 Uniformalization of the Distribution in ASKS There are three types of constraints on the distribution of services that are used to

75 f( 2a θ a 2 Figure 4.10 Distance function f(θ used in ASKS Uniformalization of the Distribution in ASKS There are three types of constraints on the distribution of services that are used to decide the amount of space to be allocated to similar services in distinguishable clusters: making the original point as the center of gravity for the services, obtaining covariance matrices such that dispersion in any direction creates the same value, and uniformalizing the services radially from the origin. Uniformalization is useful for clustering noisy data, but otherwise tends to distribute the connections too evenly across the data Iterative Solution of Nonlinear Optimization The criterion function of ASKS is given by (4.14): p n ( k) ( k) ( k) ( k) ( k) ( x, x,..., x ) M f x x max 1 2 n ij j i k1 ij (4.14) Here, M ij is the affinity value between services i and j. The partial derivative of with respect to provides the formula for determining the values of that maximize : 63

76 p n ( k) ( k) M f x x ij j i k1 ij 0 ( k) ( k) x x i i (4.15) n ' ( k ) ( k M f x x ) ij j i 0 (4.16) j1 The derivative of f is given by (4.17): 2, ' f ( ) 2 a, a a (4.17) Here, the parameter a is a conjunction of the linear and nonlinear distance measures for controlling density. We define D as in (4.18): 2, ' a f ( ) D( ) 2 a, a (4.18) from which we derive the expression: k k k k k k j i j i j i f x x D x x x x ' ( ) ( ) ( ) ( ) ( ) ( ) (4.19) We can then obtain (4.20) using (4.16) and (4.19): n j1 ( k) ( k) ( k) ( k) M D x x x x 0 (4.20) ij j i j i x ( k) i n j1 n j1 ( k) ( k) ( k) M D x x x ij j i j M D x x ( k) ( k) ij j i (4.21) 64

77 The following (4.22) iterative computation converges to the solution x i : ij = 1,2,n, k=1,2,3,,p and t=1,2..,: x ( k) i ( t1) n j1 j1 ( k) ( k) ( k) ( ) ( ) ( ) M D x t x t x t ij j i j n M D x ( t) x ( t) ( k) ( k) ij j i (4.22) The three constraints must be enforced at each step of the iterative computation for all service locations, x i (i = 1, 2,... n). x i is the new location of the service i in the sphere Spherical ASKS ASKS algorithm plots services to a 3D sphere. However, this 3D form is difficult to visualize on a 2D screen. We therefore apply the SASKS technique proposed in [28] by modifying the uniformalization part of ASKS [61] for our clustering approach. The SASKS technique plots services onto a 2D spherical surface for easy visualization on a 2D screen. The affinity calculation part of SASKS is the same as for ASKS. Figure 4.11 shows the uniformalization. In SASKS, after calculating the service position by using affinity calculations, a KL transform is used to fit the origin and distribution of service positions. The service position is then fitted to a spherical surface using a diagonal from the center. At this stage, the service distribution is a temporary fit to the spherical surface, and it may involve deviations. After several iterations of recalculating via the KL transform and fitting to the spherical surface, SASKS can achieve a stable distribution. 65

78 a)asks calculation b) Rotation and Centering by KL- Transform spherical surface e) Rotating until stabilizing distribution c)fitting spherical surface d) Spherical distribution Figure 4.11 Uniformalization process of SASKS 66

79 Chapter 5 Experiments and Evaluation In this chapter, we give evaluation of the suggested approaches to analyze the effect of two similarity calculation method on clustering results. Chapter is divided into two main sections by considering the two approaches. In chapter 5.1, we give the evaluation of ontology learning based clustering approach. Chapter 5.2 describes the evaluation of CAS based clustering approach. 5.1 Evaluation of Ontology Learning Based Clustering Approach The experimental platform used Microsoft Windows 7, an Intel Core i at 3.40 GHz and 4 GB RAM. Java was used as the programming language, and the Jena Framework was used to build the ontologies. The Jena Framework provides a collection of tools and Java libraries for developing ontologies. WSDL documents related to the Book, Medicine, Food, Film, and Vehicle domains were gathered from real-world Web service repositories, and the OWL-S [62] test collection to act as the services dataset. Our dataset available at [63] IR Based Term-Similarity Methods Evaluation In computing the service similarity, if the services fail to match using any matching 67

80 filter except Fail, then the IR-based term-similarity method is used to calculate the feature similarity. We considered SEB based methods and edge-count-based method in the experiments. Web-PMI, Web-Dice and Web-Jaccard are considered as SEB methods. In this step, we evaluated the methods to select best methods for our implementation. SEB s three equations used page-count results from Google search engine API. We selected term pairs from the WordSimilarity-353 test collection published by Finkelstein et al. [64] as the test dataset. The Pearson correlation was used to compute the correlation between human ratings and the term-similarity methods and calculated as: r( HR, IRT) ( ts ts ) 2 i1 1 2 ts 2 i ts 1 2 N ts N i ts 2 (5.1) Here, HR is the human rating and IRT is the IR-based term-similarity method. Parameters ts1 and ts2 are the human-rating similarity and the term similarity for terms, respectively. Parameter N is the number of term pairs. Further, correlation value is between 0 and 1. The experimental results (Table 5.1) show that the edge-count-based termsimilarity method that used WordNet has the highest correlation value. Here, we t show the similarity values for term pairs in the data set and the final correlation values. The average term-similarity value for Web-Jaccard, Web-Dice and Web-PMI obtained a higher correlation than that for the individual SEB methods. Note that average SEB refers to the average of similarity values of three SEB methods for particular pair of terms used in the data set (e.g., similarity values between two terms book and library with Web-Jaccard, Web-Dice and Web-PMI are 0.54, 0.78 and 0.60 respectively. then the average SEB is 0.64 ) and not the average of the final correlation values. Therefore, we used the average of the SEB term similarity as the SEB method for our implementation with the WordNet based edge count method. 68

81 Table 5.1 Correlation between human rating and term-similarity methods Term Pairs SEB Similarity Methods Word1 Word2 Human WorldNet Web- Jaccard Web- Dice Web- PMI Average SEB Car flight Cell phone television film phone equipment bread butter student professor bank money Book Library money cash Gem jewel computer news arrival hotel street avenue stock company physics chemistry deployment withdrawal Cup drink Hospital Infrastructure video archive Media Trading book paper credit card food rooster news report Journey Car glass metal Correlation Feature Strength Evaluation In this clustering approach, we extracted service name, domain name, operation name, input and output as the service features. All features are not equally help in clustering process. Some features may give little contribution in computing the service similarity. Thus, we need to identify the strength of each feature in computing the similarity. Our objective is to assign values for feature weights in equation

82 Precision based on the strength of the feature without assigning equal values. We implemented a program that returned a list of similar Web services for a given input Web service. Benchmarks of three different data sets with 100 services were selected from test data sets, with each set containing services from five different domains. We executed experiment for the three times for each data set by changing the input Web service and calculated the average top-k precisions. Figure 5.1 shows the results for Top-k precisions. Precision value is between 0 and 1. According to these results, service name obtained the highest precision values overall, whereas operation name, output message and domain name obtained lower values. When analyzing the operations and output messages in WSDL documents, we observed that some Web services from different domains had similar operation names and output messages. For example, the BookTaxPrice service in the Book domain and the VehicleTaxPrice service in the Vehicle domain have the GetTaxPrice operation name. Furthermore, the FilmPrice service in the Film domain and the ButterPrice service in the Food domain have the same output Price. As a result, operation name and output message obtained lower-precision values. In addition, we note that some providers advertise services through their own websites, which means they may publish different Web services on the same host. As a result, domain name obtained a lower precision in comparison to name and input. By analyzing the results, we Service Name Operation Name Input Output Domain 0.00 Top 2 Top 5 Top 10 Top 15 Figure 5.1 Feature strength evaluations 70

83 Precision conclude that service name and input contribute more to measuring Web-service similarity than do other features. We therefore assigned the values (0.35, 0.25, 0.15, 0.15 and 0.10) for the constants (W N, W I, W D, W O and W OP ) used in the feature integration formula (3.9) according to their contribution Ontology Evaluation The idea here is to measure how far our ontology learning helps to improve the results of similarity calculations and to measure the functional dimension of ontology. We used a task-based approach to evaluate the generated ontologies. We used the same evaluation procedure as for the feature-strength evaluation. In this evaluation, we integrated the features and measured the similarity of Web services using ontology. For comparison, we used the edge-count-based term-similarity method that employs WordNet. According to the results shown in Figure 5.2, we can see that using an ontology learning method improves the performance of the Web service similarity calculations. In fact, the ontology method obtained higher precision values than the edge-count-based method throughout Ontology method Edge-count-based method Top2 Top5 Top10 Top15 Figure 5.2 Ontology evaluation 71

84 5.1.4 Cluster Evaluation We selected 500 Web services from service data set [63] for comparison purposes related to Book, Medicine, Food, Film and Vehicle domains. For the evaluation of cluster quality, we first used purity and entropy, which are external evaluation criteria. Purity determines how pure each of the clusters is and is defined as: k 1 i Purity max { n } (5.2) n j 1 c j Here, n is the total number of services and n j i is the number of services in cluster j belonging to domain class i. Further, purity value is between 0 and 1. q i i 1 n n r r EC ( ) log r (5.3) log q n n i1 r r Here, q is the number of domain classes in the data set, i nr is the number of services of the i th domain class that were assigned to the r th cluster and n r is the number of services in cluster r. The entropy of the entire cluster is: k nr Entropy E( C ) (5.4) r n Here, entropy value is between 0 and 1. First, we evaluated the cluster performance by changing the weight values in IR based term similarity equation (Equation 3.5) to measure the effect of the termsimilarity methods. We used two methods. For Method 1, if two terms were in the WordNet database, we assigned 1 for α and 0 for β (otherwise 0 for α and 1 for β). For Method 2, we assigned the same value of 0.5 for both α and β. Figure 5.3 shows the variation in purity when the number of Web services is increased. According to these results, Method 1 obtained higher purity values than Method 2 at every stage. We can therefore determine that, by using Method 1, we can improve the r1 72

85 Purity Purity performance of the clustering process. In the next step of the evaluation procedure, the proposed cluster-center identification approach was evaluated. We used three methods. For Method 1, we used our proposed approach. For Method 2, we used only the similarity values (S i,c Sim). For Method 3, we used only the TF IDF (tfidfs i,c ) values of service names. According to the results in Figure 5.4, we see that the cluster-center identification approach that uses both similarity values and TF IDF values of service names obtained higher purity values at all stages. The implication is that, by using our proposed method, we can improve the clustering performance. Then, we evaluated our HTS approach, which uses both proposed ontology Method 1 Method Number of Web services Figure 5.3 Contribution of term-similarity methods Method 1 (TF-IDF & Similarity) 0.4 Method 2 (Similarity only) Number of Web services Method 3 (TF-IDF Only) Figure 5.4 Cluster center identification approach evaluation 73

86 Entropy Purity learning and IR-based term similarity. We implemented a clustering approach using only the edge-count-based method that uses WordNet to calculate similarities for comparison. Figure 5.5 (a) and Figure 5.5 (b) show the purity and entropy values for the two approaches with respect to the number of services respectively. According to the results, purity decreases and entropy increases when increasing the number of services in both approaches. However, our approach obtained lower entropy and higher purity values throughout. Moreover, the rate of entropy increase is greater in the edge-count-based method and the rate of purity-value decrease is smaller in the HTS approach. According to these results, we can see that our ontology learning based approach improves the clustering performance. As additional evaluation criteria for our ontology learning clustering approach, we HTS approach Edge-count-based approach Number of Web services (a) Purity variation of two approaches HTS approach Edge-count-based approach Number of web services (b) Entropy variation of two approaches Figure 5.5 Cluster performances with HTS approach which uses ontology learning 74

87 used precision, recall and F-measure. Precision is the fraction of a cluster that comprises services of a specified class. Recall is the fraction of a cluster that comprises all services of a specified class. The F-measure measures the extent to which a cluster contains only services of a particular class and all services of that class. Following Equations (5.5), (5.6) and (5.7) are used to calculate these three criteria. NMij Pr ecision( i, j) (5.5) NM j NMij Re call( i, j) (5.6) NM i Here, NM ij is the number of members of class i in cluster j, NM j is the number of members of cluster j and NM i is the number of members of class i. 2Pr ecision( i, j) Re call( i, j) F( i, j) Pr ecision( i, j) Re call( i, j) (5.7) Values of the precision, recall and F-measure are between 0 and 1. We convert the value into percentage and results are entered in Table 5,2. According to the experimental results (Table 5.2), there are no false positives for the Medicine cluster in either approach, the precision values for both approaches being 100%. For all other clusters, however, our approach obtained higher precision values. For example, our approach improved the precision value for the Food cluster by 41%. However, the Medicine cluster obtained the lowest recall. When we analyzed the WSDL documents, we observed that some extracted features failed to identify their ontology. For example, the CheckRoomAvailability service belonging to the Medicine domain was not successfully placed in the Medicine cluster. In this case, the service failed to join with other services, such as MedicalOrganization, HospitalClinic and many others in the Medicine domain, in generating the ontology. 75

Moreover, the IR-based method in the HTS approach also failed to identify the correct domain from the terms used in the CheckRoomAvailability service.

88 Moreover, the IR-based method in the HTS approach also failed to identify the correct domain from the terms used in the CheckRoomAvailability service. As for the precision values, our approach obtained higher values for both recall and F-measure. The recall value for the Book cluster improved by 38.7% and obtained a 100% value by placing all services correctly. Table 5.2 Accuracy measures of clusters Cluster Ontology Learning Based Clustering Approach Edge-Count-Based (WordNet) Clustering Approach Precision % Recall % F- F- Precision % Recall % Measure % Measure% Book Medicine Food Film Vehicle Evaluation of CAS Based Clustering Approach Experimental Setup We used same experimental platform as in ontology learning based clustering evaluation. Java was used for CAS, the service affinity calculations, and the SVM implementation. The SASKS algorithm was implemented using MATLAB. We implemented five domain filters, namely Book, Medicine, Food, Film, and Vehicle, by training an SVM for each domain. First, we evaluated the performance of different SVM kernels. We consider linear kernels, polynomial kernels, and a radial bias function and measured the accuracy of each kernel type to identify the best kernel for our implementation. Next, to investigate the effect of domain context in measuring the term similarity, we calculated the similarity of term pairs by implementing and using five domain filters and analyzed the effect of domain context in terms of changes in similarity values. We selected the term pairs from the same test collection published by Finkelstein et al. [64] as the test dataset. We then evaluated the CAS method in comparison to existing methods, computing 76

89 term similarities via the NGD and edge-count-based methods, as used in WordNet. NGD is a corpus-based method and edge count is a knowledge-based method. Here, we used Pearson correlation (5.1) to check the performance of each of the term similarity methods. In the next step, we evaluated the effectiveness of the CAS method for service clustering. We computed the service similarity with our domain filters and the services were clustered using the SASKS algorithm. We analyzed the visual output of clusters for different domains. WSDL documents related to the Book, Medicine, Food, Film, and Vehicle domains were used form service data set as test collection. To provide comparative results, we used ontology learning based clustering approach (HTS) and an edge-count-based clustering approach. Finally, to evaluate the CASbased clustering approach further, we computed the purity and cohesion of the clusters for each approach SVM Kernel Performance As described above, we experimented with different kernel types to select the best kernel for our implementations. We used term pairs from the selected data set [64] and computed the Pearson correlation with a human rating. Correlation values were 0.82 for linear, 0.78 for radial bias function, 0.70 for polynomial degree = 3, and 0.43 for polynomial degree = 2. According to the results, the best performance was obtained using a linear kernel and the lowest performance was obtained by using higher-degree kernels (polynomial degree = 2 or 3). We therefore used a linear kernel in our implementation Term Similarity Methods Evaluation We calculated the similarities of term pairs for the five domains using an SVM, taking the highest value as the best value. The results in Table 5.3 show that the similarity value of a term pair differs from domain to domain. For example, the similarity value between jaguar and car in the Vehicle domain context was the highest 77

Table 5.3 Term similarity with the CAS method for different domain filters Word 1 Word 2 Medicine Book Film Vehicle Food book paper 0.00 0.89 0.00 0.00 0.00 automobile car 0.00 0.00 0.01 0.94 0.

90 Table 5.3 Term similarity with the CAS method for different domain filters Word 1 Word 2 Medicine Book Film Vehicle Food book paper automobile car psychology doctor jaguar car emergency victim liquid water video archive doctor nurse value (0.98), and was 0.00 for the other domains. The similarity value between emergency and victim was 0.99 for the Medicine domain, 0.20 for the Vehicle domain, and 0.00 for all other domains. When we analyze these similarity values, we can see that the domain context affects the similarity values greatly. We can therefore determine that domain context can play a significant role in the values obtained for term similarity in a particular domain. Next, we compared the CAS method with existing similarity calculation methods. Table 5.4 shows the correlation of each method with a human rating. We used term pairs from the data set [64]. The results show that the CAS method for a representative domain context gave higher correlation values with the human rating than did the other two methods. We note that the CAS method computes the similarity by considering the semantic relationships between terms in a domain. For example, the method gave the highest similarity value for the term pair jaguar and car compared with the other methods (CAS = 0.98, NGD = 0.79, and Edge-count = 0.47). Here, Edge-count does not consider jaguar as a car model, being more associated with an animal than a vehicle. Therefore, the method obtained a lower similarity value. We can therefore determine that domain-specific contexts help to improve the semantic similarity values for term pairs and that context plays a significant role in measuring the semantic similarity of terms. 78

Table 5.4 Comparison of similarity calculation approaches Term A Term B Human NGD Edge- Count CAS doctor nurse 7.00 0.88 0.52 0.82 psychology clinic 6.58 0.78 0.62 0.85 psychology doctor 6.42 0.83 0.

91 Table 5.4 Comparison of similarity calculation approaches Term A Term B Human NGD Edge- Count CAS doctor nurse psychology clinic psychology doctor psychology health stroke hospital treatment recovery doctor personnel doctor liability book paper journal association car automobile journey voyage jaguar car cucumber potato vodka gin vodka brandy food fruit cup food seafood food movie star movie popcorn television film Correlation Visualization of Web Service Clusters In this step, we analyze the clustering results using a visual output. We clustered the Web services by applying our domain filters. We used WSDL files from our service dataset [63]. First, we applied the Vehicle domain filter, which was trained to filter 79

Vehicle domain services. The SASKS algorithm plotted the services on the sphere according to their similarity values. Figure 5.6 shows sample output of service sphere and Figure 5.

92 Vehicle domain services. The SASKS algorithm plotted the services on the sphere according to their similarity values. Figure 5.6 shows sample output of service sphere and Figure 5.7 shows the visual output for the Vehicle cluster. When we analyze the spherical surface, we can observe that the most of the services that belong to the Vehicle domain, such as CarPriceColor, ExpensivecarPrice, and JaguarCarPrice, were placed in the same region and we can consider that region as a Vehicle cluster. Because we used the Vehicle filter, other services belonging to a domain such as Food were not clustered separately (Figure 5.8). Instead, all other services were placed without reference to a domain. We therefore observe the dataset as two clusters, namely Vehicle domain services and non-vehicle domain services. We see a clear separation between Vehicle domain services and the other services. Analyzing the Vehicle cluster further, we observe that some services such as Patienttransport, selectmedicalflight and AmbulanceService (highlighted services in Figure 5.7), which show characteristics of both the Medicine and Vehicle domain, were also placed inside the Vehicle cluster. Furthermore, we see that very similar services are placed close to each other within the cluster. For Figure 5.6 Service sphere 80

Amount-ofmoney3wheeledcarRecommendedPrice service than the JaguarCarPrice or VehiclePrice services.

When we analyzed the spherical surface, we observe that similar services in the same domain were placed into one region and there were

93 Figure 5.7 Visualization results for the CAS method with a Vehicle filter example, the Amount-of-moneycarPrice service is placed closer to the Amount-ofmoney3wheeledcarRecommendedPrice service than the JaguarCarPrice or VehiclePrice services. Next, we clustered Web services using the ontology learning based approach to compare with the results of the CAS method. When we analyzed the spherical surface, we observe that similar services in the same domain were placed into one region and there were five main separated regions for each domain. Figure 5.9 (a) shows the visualization results for part of the Vehicle cluster. In that region, we can see services that belong to the Vehicle domain such as JaguarCarPrice and ThreeWheeledCarPrice. Figure 5.9 (c) shows part of the Medicine cluster. In that 81

9 (b)) is moving towards the Medicine cluster and we see the service on the sphere between the Vehicle and Medicine

94 region, we can see services that belong to the Medicine domain, such as InformHospital and HospitalPhysian. We observed that AmbulanceService (highlighted service in Fig. 5.9 (b)) is moving towards the Medicine cluster and we see the service on the sphere between the Vehicle and Medicine clusters. Medicine domain services Book domain services Food domain services Figure 5.8 Visualization results for the CAS method (non-vehicle cluster) with a Vehicle filter 82

95 (a) Part of the Vehicle cluster (b) Service sphere (c) Part of the Medicine cluster Figure 5.9 Visualization results for the ontology learning method 83

For this method, the Patienttransport service obtained higher similarity values with Medicine domain services such as InformHospital than with Vehicle domain services such as CarPrice.

96 The system is finding it difficult to choose between these two clusters. Further, according to Figure 5.9 (c), the Patienttransport service was placed within the Medicine cluster by the ontology learning based clustering method. For this method, the Patienttransport service obtained higher similarity values with Medicine domain services such as InformHospital than with Vehicle domain services such as CarPrice. The service therefore became a member of the Medicine cluster. But, according to Figure 5.7, however, for the CAS method, both of the above two services should be member of Vehicle cluster. Moreover, we observed the same result for other services such as selectmedicalflight and provide_medical_flight_information (Figure 5.10). We then applied the edge-count-based clustering approach. Figure 5.11 shows the Medicine Cluster Figure 5.10 Visualization results for the ontology learning method (Medicine cluster) 84

visualization of the Vehicle cluster for this method.

However, we observed more false-positive members in clusters with this method than with the other two methods.

11, which belong to the Medicine domain, were incorrectly placed in the Vehicle cluster. In Table 5.

97 visualization of the Vehicle cluster for this method. As for the ontology learning based method, we observe that similar services in the same domain were placed into one region. However, we observed more false-positive members in clusters with this method than with the other two methods. For example, the highlighted services (HospitalExperimenting and EmergencyPhysian) in Figure 5.11, which belong to the Medicine domain, were incorrectly placed in the Vehicle cluster. In Table 5.5, we list some cluster members of the Vehicle cluster for each method. According to these results, some services, such as Patienttransport and Provide_nonmedical_flightinformation, were not placed in the Vehicle cluster by the Figure 5.11 Visualization results for the edge-count-based method (Vehicle cluster) 85

98 edge count-based or ontology learning based methods. However, these two services were identified as Vehicle domain services when we applied the CAS method with a model that was trained to the Vehicle domain (the other two methods placed these two services in the Medicine cluster). In addition, for the edge count-based method, some services that should have been in the Vehicle cluster were not placed correctly, such as the CarTaxedpricereport service. Moreover, for this method, some services were placed incorrectly in the Vehicle cluster. For example, at the bottom of Table 5.5, there are some incorrectly placed members. Note also that the VehicleTechnologyBookPrice service was filtered with Vehicle domain services by the CAS method. If we analyze the WSDL documents, we observe that the extracted features include Vehicle domainrelated terms such as Vehicle and Technology. Figure 5.12 (a) shows the part of Book cluster with edge-count based approach. Members Table 5.5 Comparison for the Vehicle cluster Edge Count- Based 86 Ontology Learning Based CAS (Vehicle Filter) FourwheeledcarYearprice VehiclePrice OnepersonbicyclecarPrice BicycleCarPriceService ExpensivecarPrice Amount-of-moneycarPrice CheapcarsTechnology JaguarCarPrice CarPrice AmbulanceService? RedFerrariprice_service ThreewheeledcarPrice CarPricereport CarTaxedpricereport X Patienttransport X X VehicleTechnologyBookPrice X X Provide_nonmedical_flightInfo X X rmation Invalid Members HospitalExperimenting X X MedicalclinicPredicting X X Sendemaphonenumber X X -Member/X- Not a Member/? - Cannot define clearly

99 Highlighted areas in the figure show the some incorrectly placed services. For an example, GrocerystorFood service is placed in Book cluster. But, it is belonging to Food cluster. Further, according to the Figure 5.12 (b) CorparationApple service and BookRecommendedpriceindollar service are incorrectly placed in Medicine cluster with the edge-count based clustering method. In the next step of our evaluation, we clustered services using the CAS method and applying different filters to evaluate the effect of the domain context. The highlighted area of Figure 5.13 shows the visualization of the Medicine cluster after applying the Medicine filter. Most of the services that belong to the Medicine domain are placed inside this cluster. An important feature to note is that some services, such as Patienttransport, selectmedicalflight, and AmbulanceService, were inside or close to (b) Book Cluster (a) Medicine Cluster Figure 5.12 Part of cluster result with calculating affinity value using WordNet 87

the Medicine cluster. However, according to the results in Table 5.5 and Figure 5.

filter. Therefore, these services show characteristics of both the Vehicle and Medicine domains.

in the Medicine or Vehicle domains via the domain context and clustered the services according to

However, using the other two methods, each service was placed in only one cluster according to the

100 the Medicine cluster. However, according to the results in Table 5.5 and Figure 5.9, we saw that these services were also members of the Vehicle cluster when we applied the Vehicle filter. Therefore, these services show characteristics of both the Vehicle and Medicine domains. Here, CAS was able to identify the semantic relationship between these services and other services in the Medicine or Vehicle domains via the domain context and clustered the services according to the domain. However, using the other two methods, each service was placed in only one cluster according to the similarity values with other services. These approaches could not identify the multi-domain nature of services such as these. Figure 5.13 Visualization results for the CAS Method (Medicine filter) 88

Figure 5.14 shows the visualization of the Book cluster after applying the Book filter. Services that belong to the Book domain were placed into the same area.

101 Figure 5.14 shows the visualization of the Book cluster after applying the Book filter. Services that belong to the Book domain were placed into the same area. We observe that the VehicleTechnologyBookPrice service, which was a member of the Vehicle cluster after applying the Vehicle filter, was placed very close to the Book cluster. According to these results, we can determine that awareness of the context of services helps to find tensors among the clusters, which can characterize their elements. Figure 5.15 shows the average minimum distance from one service to other services within a clustering area. In the graph, distance value is between 0 and 1. This distance increases with an increasing number of services in all three approaches. However, the CAS method, for which we applied the Vehicle filter, obtained the minimum distance values in all cases. This indicates that services inside the cluster for the CAS method were more tightly clustered than when the ontology learning- and Figure 5.14 Visualization Results for the CAS method (Book filter) 89

Service Oriented Architecture

Service Oriented Architecture Part I INTRODUCING SOA Service Oriented Architecture- Presented by Hassan.Tanabi@Gmail.com 2 Fundamental SOA 1. The term "service-oriented" has existed for some time, it has