Improving SIM-based Annotation Method of Protein Sequence Using Support Vector Machine

Size: px
Start display at page:

Download "Improving SIM-based Annotation Method of Protein Sequence Using Support Vector Machine"

Transcription

1 SU-E2-3 Tokyo, Japan (September 20-24, 2006) Improving SIM-based Annotation Method of Protein Sequence Using Support Vector Machine Jung-Ying Wang 1,2 1 1,3, Cheng-Kang Liu and Hahn-Ming Lee 1 Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan 2 Department of Multimedia and Game Science, Lunghwa University of Science and Technology, Taoyuan 333, Taiwan 3 Institute of Information Science, Academia Sinica, Taipei 115, Taiwan Abstract - In this paper, we present a protein sequence annotation system, named as (Multiple Annotation for Protein Sequences), which provides a mechanism to extract multiple annotations from various types of biological data including the SwissProt keywords, InterPro signatures and GO terms. Meanwhile, can automatically eliminate the error annotations by a pre-trained SVM classifier. It assigns an annotation to the input protein sequence by considering all hit proteins with this annotation entirely, not only single hit protein. The experimental results show that the error annotations can be eliminated effectively and keep high accuracy on different types of annotations. I. INTRODUCTION With the completion of many genome sequencing projects, the gap between the amount of newly published protein sequences and reliable function annotations in public database is growing. The actual function of a protein can only be determined by "wet" experiment. However, it can not catch up with the rapid growth of new protein sequences. Thus, the demand of computational approaches to predict the functions of protein sequences is generated [1, 2]. Many approaches have been proposed which are based on different biological concepts and various computational techniques [3, 4, 5]. These methods can make experimental determination simpler. It is clearly more efficient to test a high probability hypothesis than to randomly test for possible functions. Currently, the most common used method is sequence similarity (SIM) method which searches for homologies relationships between the sequences, such as FASTA [6], BLAST [7] and PSI-BLAST [8]. However, only a few sequences can be annotated with high similarity. In addition, annotations inferred from weak similar sequences might be erroneous [9]. Thus, manual reconfirmation of the inferred annotation by similarity searches is still in great demand. Besides, a single sentence describing some properties of the unknown protein sequences is not regarded as optimal annotation. Various types of biological data are required to achieve a comprehensive description of protein sequences, such as functions of protein, domains, families, functional sites, subcelluar location, and etc. However, these data are usually generated by individual groups around the world and having their own data types. Thus, the integration of various types of biological data derived from different sources is difficult but vital [10]. In this paper, we propose an automatic annotation system of protein sequences, named as (Multiple Annotation for Protein Sequences), which provides a method to extract multiple annotations from various types of biological data. These annotations include the SwissProt keywords [11], InterPro signatures [12] and GO terms [13]. The idea of is inspired by our previous experience of EST annotation project [14]. performs sequence alignment by BLAST to obtain annotations from hit protein sequences and automatically eliminates the error annotations by a two-class classifier SVM [15,16] to obtain precise annotation results. Furthermore, it assigns an annotation to input protein sequence by considering all similar proteins with the annotation, not only single similar protein. Thus, it can reduce the error annotations inferred from weak sequence similarities and from sequence identities in non-functional segments. The goals of are to annotate proteins directly from their sequences, integrate various data sources to provide multiple annotations and decrease the need for human intervention to accelerate the annotation speed. The experimental results show that the error annotations can be eliminated effectively and keep high accuracy on different types of annotations. II. METHODS We present the concept and the system architecture of in detail in this section. First, we describe the system architecture of. Then each component in the architecture is depicted. Finally, we illustrate the features of used in decision and evaluation processes. A. System architecture The system architecture of the is shown in Figure 1. It consists of four modules and a database. The first module is similar protein sequence searching unit, which looks for similar protein sequences. Next, the annotation-based protein clustering unit groups the similar protein sequences with common annotation into several protein clusters. Then the protein cluster selecting and evaluating unit will select usable protein clusters and evaluate the reliability of selected protein clusters. At last, the final module is GO annotation searching unit which looks for GO annotations of possible protein clusters by searching GO mapping tables. The outputs of include the protein signatures, SwissProt keywords and GO annotations. The main database of is data collector database which is responsible to collect similar protein sequences and protein clusters data. Furthermore, it is also a data provider of protein cluster selecting and evaluating unit

2 input protein will be obtained from the GO annotations of found protein clusters. The protein cluster data is also stored in data collector database. Figure 1. System architecture of Figure 3. Annotation-based protein clustering unit Figure 2. Architecture of similar protein sequence searching unit. B. Similar protein sequence searching unit The main purpose of this module is to find annotated protein sequences which are similar to input protein sequence. The architecture of this module is shown in Figure 2. First, it performs sequence alignment by BLAST [7] against protein database SwissProt [11] to get a list of hit protein sequences. SwissProt is a curate protein sequence database which provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. The alignment result of BLAST is then parsed by the BLAST parser to get the list of hit proteins and alignment data. The alignment data includes the SwissProt id of hit protein sequence, bit score and the starting and ending positions of similar protein segments. The alignment data extracted by BLAST parser will be stored in data collector database. Finally, all hit proteins will be passed to annotation-based protein clustering unit. C. Annotation-based protein clustering unit This module is responsible to group the hit proteins into protein clusters. The architecture of this module is shown in Figure 3. Hit proteins with common annotation are grouped into a protein cluster. The annotation referring to the keyword or signature can be obtained by searching against the SwissProt and InterPro databases, respectively. The keyword is an important field of protein function annotation in SwissProt database. The protein signature is the functional unit of protein, such as protein family, domain and functional site. Therefore, each protein cluster can be treated as one keyword or one InterPro signature. The GO annotation of Figure 4. Architecture of protein cluater selecting and evalting unit. D. Protein cluster selecting and evaluting unit This module aims to automatic select the protein clusters which are usable for annotating the input protein sequence. Here, usable means that the protein clusters are supported by enough amounts of hit protein sequences and the similarities of the hit protein sequences are thought to be significant. The two-class classifiers SVM [15, 16] are used to determine whether the protein cluster are selected or not for function annotation automatically. In addition, we also measure the reliability of selected signature protein clusters by calculating their domain matching scores, which will be explained later in Section G. All processes in this module are shown in Figure 4. First, the features of each protein cluster are extracted in protein cluster feature extractor. These features consist of supporting score and similarity score, which will be explained later in Section F. Protein clusters will be selected if they are supported by more similar protein sequences and these protein sequences are significant. The extracted features are encoded as vectors and sent to the SVM classifiers as input to select usable protein clusters. If the usable protein cluster is a signature protein cluster, it will be sent to domain matching score calculator to evaluate its reliability. Finally, all usable protein clusters are passed to GO annotation searching unit

3 Figure 5. Architecture of GO annotation searching and ranking unit. Figure 6. An example of protein voters of one protein cluster. E. GO annotation searching unit The GO annotations of usable protein clusters will be obtained and ranked in this module. The architecture of this module is shown in Figure 5. Function searcher looks for GO annotations of usable protein clusters by searching GO mapping tables. Then the found GO annotations will be ranked in GO ranker. Since the GO annotations are inferred from two annotation systems, SwissProt keyword and InterPro signature, thus the GO annotations are ranked and displayed separately according to their source annotation systems. The GO annotations which are inferred from InterPro signatures, we rank these annotations according to the domain matching score and the GO structural relationship where the domain matching score is obtained from its corresponding usable protein cluster. The GO structural relationship means that if there is a relationship between two GO annotations, the child annotation is a subclass of its parent annotation. This relationship can make the parent annotation more reliable. Thus, the annotations which are inferred from protein clusters with higher domain matching score are thought to be more reliable, and the reliability of annotations with low domain matching score will be raised when they are supported by child annotations with high domain matching scores. Refer to the GO annotations which are inferred from SwissProt keywords, we give score 1 for each annotation initially, and the GO structural relationship is also applied as discussed above. Figure 7. An example of Bit Scores of proteins in the protein cluster. F. Cluster features The extracted features of protein clusters are divided into two parts: supporting score and similarity score. The supporting score shows the support ratio of a protein cluster in all hit proteins and the support ratio of a protein cluster in an annotation. As regards the similarity score, it shows the significance of each protein cluster. These features aims to select the protein clusters which are supported by more similar protein sequences and these protein sequences are significant. There are two kinds of supporting scores: HitPercent _voter and HitPercent_annotation. HitPercent_voter is a voting by the hit proteins of the input protein sequence. HitPercent_annotation is a voting by the hit proteins within the InterPro signature or SwissProt keyword. As Figure 6 shows, 5 proteins of the hit protein list are included in the protein cluster. The HitPercent_voter of the protein cluster is 5/8*100% = 62.5%, with totally 8 proteins in the hit protein list. The HitPercent_annotation is 5/7*100% = 71.4%, with totally 7 proteins in the InterPro node. HitPercent_voter and HitPercent_annotation reveal the supporting ratio in hit proteins and the supporting ratio in an InterPro signature or SwissProt keyword, respectively. There are two ways to measure the similarity score: sum bit score and max bit score. The SumBitScoreRatio is normalized sum bit score of hit proteins. The MaxBitScoreRatio is normalized max bit scores of hit proteins. As Figure 7 shows, the max bit score of protein cluster A and protein cluster B are 270 and 80, respectively; the sum bit scores of protein cluster A and protein cluster B are 520 and 155 respectively. To measure the significance of one protein cluster among other clusters, the max bit score and sum bit score need be normalized. Thus, the normalized max bit score (MaxBitScoreRatio) of protein cluster A is 270/80 = 3.375, and protein cluster B is 1. Just like the MaxBitScoreRatio, the normalized sum bit score (SumBitScoreRatio) of node A is 520/155=3.35, and the normalized sum bit score of node B is 1. Normalized MaxBitScoreRatio and SumBitScoreRatio reveal how significant of protein cluster A over protein cluster B is. G. Domain matching score Domain matching score is used to measure the reliability of signature protein clusters. Each protein cluster includes a group of hit proteins which are found by similar protein se

4 quence searching unit. In these hit proteins, whether the similar sub-segment and the functional segment are in agreement or not will influence the reliability of the signature protein cluster. Thus, we take into account this issue to evaluate the reliability of the signature protein cluster. First, the DomainMatchingRatioScore of every hit protein in the signature protein cluster is calculated to measure the degree of similar segment consistent with the functional segment. As Figure 8 shows, protein 1 is one of the hit proteins in the protein cluster A. The lengths of similar segment and functional segment are 14 and 17 respectively, and the length of matching segment is 11. Thus, the DomainMatchingRatioScore of protein 1 is 11/17 = After all DomainMatchingRatioScores of hit proteins are calculated, the domain matching score of a protein cluster is: n i = 1 ( protein i) DomainMatc hingratios core (1) where n is the number of hit protein sequences in the protein cluster. Figure 8. Illustration of DomainMatchingRatioScore. III. EXPERIMENT RESULTS We discuss the performance and the feasibility of for annotating protein sequences in this section. A. Experimental setup The protein sequences used in our experiment are the protein coding ORFs in S. cerevisiae genome [17]. S. cerevisiae is one of the most important model organisms, and has been well studied over a century. We randomly select 500 ORF sequences from Saccharomyces Genome database (SGD) [18] as our data set. Each sequence has known SwissProt keywords, InterPro signatures and GO terms. These sequences are performed similar sequence searching and protein clustering to find the protein clusters of these ORF sequences. Then these protein clusters will be used as training and testing data for the SVM classifier. These protein clusters are treated as positive instances if they are corresponding to the known InterPro signatures or SwissProt keywords of the protein sequences; otherwise, they are treated as negative instances. In order to evaluate the results of annotation, the experiment is divided into two parts. The first part is to evaluate the selection ability of protein clusters. Since there are two kinds of protein clusters (InterPro signature and SwissProt keyword) in our system, they will be examined and discussed separately. Besides, we would like to examine the influence of similarity searches under different E values. Thus, we use the BLAST to generate multiple sequence alignments of the experimental sequences by using the E value of 1.0E-21, 1.0E-6 and 1.0 respectively. Moreover, we also show that the domain matching score can be used to evaluate the reliability of signature protein clusters by analyzing the distribution of domain matching scores in positive and negative instances. In the second part, the GO annotations inferred from the selected protein clusters are compared with the GO annotations in SGD. The SVM software used in protein cluster selection is LIBSVM [19]. It is an integrated software for support vector classification, regression and distribution estimation. We choose the Gaussian radial basis function (RBF) as the kernel function, and select the best parameters C and gamma by grid method [19]. B. Signature protein cluster selection After the experimental sequences perform the BLAST searching under different E values and protein cluster according to the InterPro signatures, a set of signature protein clusters can be obtained. The distribution of positive and negative instances in the protein clusters are listed in Table I. The numbers of both positive and negative instances are increased as the E value increases, especially the negative instances. Table I. Numbers of positive and negative instances in signature protein clusters under different E values E value 1E-21 1E-6 1 Positive Negative The performance is evaluated by performing the three-fold cross validation. The precision rate decreases slightly from 93.03% to 90.21% while the E value increases from 1E-21 to 1. The recall rate decreases from 93.90% to 88.10% while the E value increases from 1E-21 to 1. The average precision rate is 91.76% and the average recall rate is 90.74%. It shows that different E values will not have strong influence on the precision and recall rates. Since the increasing rate of the negative instance is far faster than the positive instance, the positive and negative instances become unbalance. Thus, both precision and recall rates decrease slightly while the E value becomes larger. Table II. Numbers of positive and negative instances in keyword protein clusters under different E values E value 1E-21 1E-6 1 Positive Negative C. Keyword protein cluster selection When the hit proteins are clustered according to the keywords in SwissProt, the distributions of positive and negative instances in the protein clusters are shown in Table II. The number of negative instances increase as E value rises. However, the positive instances remain unchanged under E values 1E-21 and 1E-6, and increase slightly at E value 1. The precision rate of keyword protein cluster selection

5 keeps over 89.83% and the recall rate keeps over 82.88% under different E values. The average precision rate is 91.74% and the average recall rate is 85.52%. The lowest precision rate is obtained when the E value is 1E-6. It is because that the number of positive instance remains unchanged, however, the number of negative instance increases as E value increases. D. Feasibility of domain matching score Domain matching score is used to evaluate the reliability of selected signature protein clusters. The ratios of positive instances with different domain matching scores are plotted in Figure 9. It shows that the ratio of positive instance increases as the domain matching score increases. It implies that a protein cluster with higher domain matching score is more confident to be regarded as a positive instance. Besides, the ratio of positive instance increases more obviously when the E value becomes smaller. It is rational since the protein sequences found by small E value are more similar than protein sequences found under large E value. Therefore, the matching scores comes from strong similar protein sequences are more reliable. results are shown in Table IV. Clearly, is much faster than InterProScan. In addition, we perform InterProScan to the same data set. The precision and recall rates are 95.48% and 97.12%, respectively. When E value is setted to 1E-21, the precision and recall rates of are 93.03% and 93.90%, respectively. InterProScan is better than. The precision and recall rate of both systems can keep over 93%. It shows a tradeoff between the speed and the accuracy of prediction results. Table III. Numbers of annotations inferred from BLAST and Signature Keyword E value 1E-21 1E-6 1 BLAST BLAST TP FP TP FP TP FP TP FP Table IV. Comparison of time complexity Length 100~ ~ ~ ~ 900 >1000 Method InterProScan (min/sequence) (min/sequence) Figure 9. Ratio of positive instances with different domain matching score. E. Comparison between and BLAST There are 1043 known protein signatures and 2379 known keywords in our data set. Table III shows the numbers of annotations (signature and keyword) inferred from BLAST and, respectively. TP denotes the number of true positive annotations and FP denotes the number of false positive annotations. Under different E values, BLAST is able to find more true positive annotations than in both annotation systems. However, the number of false positive annotations found by BLAST rapidly increases as the E value raises. shows good ability to eliminate the error annotations. It can provide more precision and stable annotation results in spites of losing some annotations. F Comparison between and InterProScan In this section, we compare the time complexity and the recognizing ability of signature protein cluster of with InterProScan [20]. First, we test the time complexity of both systems under different sequence lengths. The experimental Figure 10. Numbers of GO annotations agreed with SGD and the same with SGD. G. Agreement of GO annotations between and SGD The GO annotations of input protein sequences are obtained from usable protein clusters by searching against GO mapping tables. The GO annotations of are not completely agreed with SGD. Moreover, the GO annotations of inferred from different annotation systems also display certain degree of discrepancy. Here, agree means the GO annotations from and SGD are the same, or the GO function annotations from are ancestor functions of GO annotations in SGD. Figure 10 shows the number of GO annotations from under E value 1E-21 agree with SGD and the same with SGD. Most of GO annotations from are agreed with SGD. It implies that will annotate at higher level (more general) in GO. This is reason

6 able because InterPro or SwissProt try to annotate to a group of proteins in the same cluster. The common function in the same InterPro signature or SwissProt keyword is usually on the upper level of GO, i.e. more general functions will be obtained. In addition, more GO annotations can be predicted when both of signature and keyword are applied. Since different annotation systems have their own concepts to classify the proteins,.thus we believe some lost functions can be found with the more annotation systems are included. IV. CONCLUSION provides a fast and automatic annotation method of protein sequence. It integrates many types of biological data to provide multiple annotations and maintains high precision and recall under different E values. Most of the error annotations inferred from original sequence similarity searches can be eliminated by our method. effectively reduces the need for human reconfirmation and accelerates the annotation speed. Some annotations might be lost in our system due to the side effects of protein cluster selection process. However, since the annotation systems have complementary relationship, these lost annotations might still have chances to be discovered in other annotation systems. is well suited in assigning functions to new protein sequences when experimental data is not available. It can provide a comprehensive function description of given protein sequences to users. Furthermore, since is flexible and extensible, it can be fine-tuned according to characteristics of each annotation system. ACKNOWLEDGMENT This work was supported in part by the National Digital Archive Program-Research & Development of Technology Division (NDAP-R&DTD), the National Science Council of Taiwan under grant NSC E , NSC H , NSC H , and also by the Taiwan Information Security Center (TWISC), the National Science Council under grant NSC P , NSC P Y. REFERENCES [1] T.F. Smith, Functional genomics bioinformatics is ready for the challenge, Trends Genet, Vol. 14, 1998, pp [2] S. Lewis, M. Ashburner and M.G. Reese, Annotating eukaryote genomes, Curr. Opin. Struct. Biol., Vol. 10, 2000, pp [3] M. Kanehisa and P. Bork, Bioinformatics in post-sequence era, Nature Genetics, Vol. 33, 2003, pp [4] D. Eisenberg, E.M. Marcotte, I. Xenarios and T.O. Yeates, Protein function in the post-genomic era, Nature, Vol. 405, 2000, pp [5] D.B. Kell and R.D. King, On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning, Trends Biotechnol, Vol. 18, 2000, pp [6] W.R. Pearson and D.J. Lipman, Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. USA. Vol. 85, 1988, pp [7] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J Lipman, Basic local alighment search tool, Journal Molecular Biol, Vol. 215, 1990, pp [8] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acid Res, Vol. 25, 1997, pp [9] D. Devos and A. Valencia, Intrinsic errors in genome annotation, Trends in Genetics, Vol. 17, 2001, pp [10] M. Chicurel, Bioinformatics: Bringing it all together, Nature, Vol. 419, 2002, pp [11] B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout and M. Schneider, The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003, Nucleic Acids Res., Vol. 31, 2003, pp [12] The InterPro Consortium, InterPro- an integrated documentation resource for protein families, domains and functional sites, Bioinformatics, Vol. 16, 2000, pp [13] The Gene Ontology Consortium, The gene ontology (GO) database and informatics resource, Nucleic Acids Res., Vol. 32, 2004, pp [14] N.Y. Chuang and H.M Lee, ESTFastAnnotator: EST function th annotation by protein cluster selection, Proc. of 9 TAAI Conference, Taipei, Taiwan, December [15] C.J.C Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2, 1998, pp [16] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, [17] H.W. Mewes, K. Albermann, M. Bähr, D. Frishman, A. Gleissner, J. Hani, K. Heumann, K. Kleine, A. Maierl, S.G. Oliver1, F. Pfeiffer and A. Zollner, Overview of the yeast genome, Nature, Vol. 387, 1997, pp 7-8. [18] S.S. Dwight, R. Balakrishnan, K.R. Christie, M.C. Costanzo, K. Dolinski, S.R. Engel, B. Feierbach, D.G. Fisk, J. Hirschman, E.L. Hong, L. Issel-Tarver, R.S. Nash, A. Sethuraman, B. Starr, C.L. Theesfeld, R. Andrada, G. Binkley, Q. Dong, C. Lane, M. Schroeder, S. Weng, D. Botstein and J.M. Cherry, Saccharomyces genome database: underlying principles and organization,. Brief Bioinform, Vol. 5, 2004, pp [19] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. Software available at /~cjlin/ libsvm., [20] E.M. Zdobnov and R. Apweiler, InterProScan- an integration platform for the signature-recognition method in InterPro, Bioinformatics, Vol. 17, 2001, pp