Viewing the Proteome from Oligopeptides and Prediction of Protein Function

Size: px

Start display at page:

Download "Viewing the Proteome from Oligopeptides and Prediction of Protein Function"

Godwin Stevenson
6 years ago
Views:

1 74 Genome Informatics 6(2): (25) Viewing the Proteome from Oligopeptides and Prediction of Protein Function Hisayuki Horai,2 Kouichi Doi Hirofumi Doi,2 Nara Institute of Science and Technology, Takayama-cho, Ikoma, Nara 63-92, Japan 2 Celestar Lexico-Sciences, Inc., MTG D7, -3 Nakase, Mihama-ku, Chiba 26-85, Japan Abstract Our research activity of making the lexicon of relatively short oligopeptides has been one of the first steps to view the world of proteome from the perspective of oligopeptides. We propose a new method for the prediction of protein function, especially GeneOntology terms (GO terms), based on statistical characteristics of oligopeptides as an application of the lexicon. In the lexicon, a known function of a protein is inherited to its oligopeptides, and the correspondence between oligopeptides and the function is calculated in the whole proteins. In our method, unknown functions of proteins are predicted by means of the correspondence automatically. We measured the prediction performance using the 28,52 whole human proteins registered in RefSeq for several GO terms by recall-precision graphs. The GO terms include membrane, nucleus, ATP binding, hydorolase activity, GTP binding, intracellular signaling cascade and ubiquitin cycle. In most cases, it scores 7% recall with 8% precision. The prediction for ATP binding and GTP binding results in quite high performance: it scores 8% recall with 8% precision. Even in the worst case (ubiquitin cycle), it scores 62.6% recall with 8% precision. These results suggest that the proposed method is quite efficient for predicting GO terms. Keywords: oligopeptide, prediction, protein function, proteome Introduction Proteomics is the large-scale study of proteins, particularly their structures and functions, and their existence. Identifying the function of each newly determined sequence by means of bioinformatics techniques is one of the most important problems in proteomics [5, 6]. Although the function of each protein can be introduced in wide spectrum and predicted based on different properties, we focus on the prediction of Gene Ontology (GO) terms [] based on sequence. There are some methods for solving the problem proposed based on homology search [6] and pattern matching [5]. Each method based on homology search focuses on similarity of a relatively long subsequence or the full-length sequence. In many cases, each protein is related to several numbers of GO terms. When a new protein has homology to such a multi functional protein, it is difficult to determine that each GO term is annotated or not. Consequently, further investigation of a protein at the level of every shorter subsequence is needed after homology search. Each method based on pattern matching focuses on similarity of a relatively short subsequence. These are conservative methods, taking similarity to clearly defined protein families whose members are annotated with GO terms. It is difficult to predict all GO terms because many GO terms have not been able to relate with any families yet. Oligopeptide is a subsequence of fixed length. For example, in the 28,52 whole human proteins registered in RefSeq (Reference Sequence of the National Center for Biotechnology Information dated

2 Viewing the Proteome from Oligopeptides 75 Protein MAIFARLNRA MADIK ADIKT KTGIF TGIFA GIFAD IFADI FADIK (duplicated) ADIKT IKTKR KTKRL TKRLN KRLNR RLNRA Characteristic Oligopeptides = { ADIKT,,, FADIK, GIFAD, IFADI,, IKTKR, KRLNR, KTGIF, KTKRL, MADIK, RLNRA, TGIFA, TKRLN } Figure : Characteristic of oligopeptide. 3-May-25), there exist 2,36,75 kinds of oligopeptides of length 5. The existence of oligopeptides shows quite interesting characteristics [2]: () some oligopeptides exist commonly in many proteins and others exist unevenly; (2) some oligopeptides exist too many time in comparison with the existing probability of each component amino acid; and (3) many oligopeptides do not exist in the world of proteins (specificity of oligopeptide). Therefore, to view the world of proteins from the perspective of oligopeptides will provide a new computational science of proteomics. As one of the first steps of such computational proteomics from the perspective of oligopeptides, we propose a new method based on the concept of the lexicon of oligopeptides that have been paid much effort to construct by us [2]. In our method, each protein is characterised based on the existence of oligopeptides. In this paper, we use oligopeptide of length 5 in the following consideration but it is not mandatory. Our method is based on the co-occurrence of each oligopeptide and the specificity of oligopeptide. Longer the length of oligopeptide is, less the co-occurrence of each oligopeptide, while, in [5], the specificity of oligopeptide does not be observed strongly in oligopeptides whose length is less than 5. Our method predicts the GO terms annotated to a protein based on a set of proteins already annotated, called Annotated Proteins here. Every Annotated Protein is divided into a set of its oligopeptides, and each GO term annotated to the protein is regarded to be related to all of its oligopeptides. Finally, the correspondence between oligopeptides and GO terms in Annotated Proteins is calculated. The correspondence between an oligopeptide and a GO term is the number of proteins which contain the oligopeptide and be annotated with the GO term. This correspondence is uniquely defined for each set of Annotated Proteins and stored in a matrix, PepGO Matrix. The correspondence between a new protein and each GO term is calculated based on all oligopeptides in the protein and PepGO Matrix. To evaluate the prediction performance, we made several experiments. In the evaluation, we use some measurements used in information retrieval research, such as recall, precision and f-measure [4]. These measurements are effective to evaluate a score-based prediction method. Our method is regarded as a score-based method using the correspondence as score. In a score-based method, the global property of performance for varied score threshold is more important than the best performance by a specific threshold. The global property is usually shown in a recall-precision graph.

3 76 Horai et al. 2 Method 2. Characteristic Oligopeptide In our method, each protein is characterised by a set of oligopeptides, called Characteristic Oligopeptides. The length of oligopeptide is arbitrary fixed number n. Characteristic Oligopeptides of a protein is a set (without duplication) of all oligopeptides which exist in the protein. When the length of a protein is m then the number of Characteristic Oligopeptides is less than or equal to mn +. If there is no duplication of oligopeptides in the protein, the number of its Characteristic Oligopeptides is equal to mn +. Figure shows a simplified example for explanation. In this example, n is 5, m is 2, oligopeptides duplicate once, and 6 Characteristic Oligopeptides are obtained. 2.2 PepGO Matrix PepGO Matrix is a non-negative integer matrix. Each row is related to an oligopeptide, while each line is related to a GO term. Each cell denotes the number of Annotated Proteins which have the corresponding oligopeptide and is annotated with the corresponding GO term. The order of rows and lines of PepGO is arbitrary because our method does not focus on relation among oligopeptides nor relation among GO term, but only on the correspondence of oligopeptides and GO terms. In this subsection, we explain the method to generate PepGO Matrix MAT (P s) from a set of Annotated Proteins P s. Protein P: MAIFARLNRA annotated with GO terms T, T2 and T3 in RefSeq GO Term TT2T3 ADIKT FADIK GIFAD IFADI IKTKR KRLNR KTGIF KTKRL MADIK RLNRA TGIFA TKRLN Characteristic Oligopeptides Protein P: MAIFARLNRA annotated with GO terms T, T2 and T3 in RefSeq TT2T3 ADIKT FADIK GIFAD IFADI IKTKR KRLNR KTGIF KTKRL MADIK RLNRA TGIFA TKRLN Protein P2: MCAA annotated with GO terms T2 and T4 in RefSeq T2T4 CDIKT IKTGA KTGAA MCDIK PepGo Matrix TT2T3T4 2 ADIKT CDIKT FADIK GIFAD IFADI IKTGA IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN Figure 2: Matrix generated from a protein. Figure 3: PepGO matrix. We start with generation of PepGO Matrix from each Annotated Protein. Consider Annotated Protein P, an element of P s, annotated with several GO terms GO(P ). Let Characteristic Oligopetides of P and PepGO Matrix generated from P be OP (P ) and Mat(P ), respectively. The rows and the lines of Mat(P ) are related to OP (P ) and GO(P ), respectively. The cells of Mat(P ) are all one. Mat(P ) denotes that all oligopeptides in P correspond with all GO terms annotated to P. Figure 2 shows a simplified example for explanation using the protein in Figure. In this example, P is annotated with 3 GO terms T, T2 and T3. After generating PepGO Matrices from all Annotated Proteins one-by-one, we accumulate all PepGO Matrices into a single PepGO Matrix MAT (P s). The rows and the lines of MAT (P s) are a set union of all rows and lines of all PepGO Matrices, respectively. The orders of the rows and the lines are arbitrary. For every cell of all PepGO Matrices, the value, i.e., is accumulated to the corresponding cell of MAT (P s). Finally, a cell of MAT (P s) is the number of P s which has the corresponding oligopeptide and annotated the corresponding GO term. When an oligopeptide is not related to a GO term in any PepGO Matrices, then the corresponding cell of MAT (P s) is zero. Figure 3 shows a simplified example for explanation. In this example, P s consists of only two proteins P

4 Viewing the Proteome from Oligopeptides 77 and P2. P is the protein in Figure and 2. The numbers of OP (P ), GO(P ), OP (P 2) and GO(P 2) are 5, 3, 5 and 2, respectively. Some oligopeptides and GO terms exist commonly in some Annotated Proteins, e.g. oligopeptide DIKIG and GO term T2 in Figure 3, and others exist uniquely in a specific Annotated Protein. We define some notations concerning PepGO Matrix for further explanation in this paper. M AT (P s).cell(op, t) denotes the cell of MAT (P s) corresponding oligopeptide op and GO term t. MAT (P s). row(op) denotes the sum of cells in the row of MAT (P s) corresponding oligopeptide op. 2.3 Prediction of GO Term In our method, the prediction of GO term is the calculation of correspondence between each protein and each GO term by means of MAT (P s). For given protein X, and GO term T, let their correspondence calculated by means of MAT (P s) be Cor(X, T, P s). At first, for every element of OP (X), i.e. every Characteristic Oligopeptide of P, the correspondence between the Characteristic Oligopeptide and T is calculated. Let the correspondence between oligopeptide op and GO term T be cor(op, T, P s). We defined that cor(op, T, P s) is MAT (P s).cell(op, T ) /MAT (P s).row(op) because op can be related to some GO terms and the specificity of op for T should be taken into account. In the case that op is related to many GO terms, even if MAT (P s).cell(op, T ) is large, cor(op, T, P s) become relatively smaller than in the case that op is related to a few GO terms. Figure 4 shows a simplified example for explanation of calculating cor(op, T, P s). In this example, MAT (P s) is the PepGO Matrix in Figure 3. op is and T is T2. MAT (P s).cell(op, T ) and MAT (P s).row(op) are 2 and 5, respectively. Cor(op, T, P s) becomes. After calculating correspondence between all Characteristic Oligopeptides of X one-by-one, Cor(X, T, P s) is calculated by means of the results. Fundamentally, we consider that correspondence between a protein and a GO term is positively correlated with the sum of correspondence between all Characteristic Oligopeptides of the protein and the GO term. Generally speaking, the length of a protein is different from another and has a great positive impact to the sum of correspondence between Characteristic Oligopeptides and a GO term. Because we also intend to compare the correspondence of a GO term among proteins, we defined Cor(X, T, P s) as the sum of cor(op, T, P s) where op is every element of OP (X) divided by the number of OP (X). Figure 5 shows a simplified example for explanation. In this example, MAT (P s) is the PepGO Matrix in Figure 3 and 4. The number of OP (X) is 5. Only four oligopeptides of OP (X) except KTGIA appear in MAT (P s). Cor(X, T, P s) becomes 7. GO term T : T2 Protein X : MCIA GO term T : T3 Oligopeptide op : TT2T3T4 ADIKT CDIKT 2 FADIK GIFAD IFADI IKTGA IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN PepGo Matrix MAT(Ps) cor(op, T, Ps) = 2 / ( ) = CDIKT KTGIA MCDIK Characteristic Oligopeptide cor(op, T, Ps) T T2 T3 T4 ADIKT CDIKT 2 FADIK GIFAD IFADI IKTGA.333 IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN PepGo Matrix MAT(Ps) Cor(X, T, Ps) 33 / 5 = 7 Figure 4: Correspondence between oligopeptide and GO term. Figure 5: Correspondence between protein and GO term.

5 78 Horai et al. Whole Proteins RefSeq without 'BJOUXZ' (28,52) OligoPeptide Extract from Database Entry Protein GO term GO term (4,488) AAAAA, AAAAB, AAAAC,......, YYYYS, YYYYW, YYYYY (2,36,75) (,924 pairs) Figure 6: Protein, oligopeptide and GO term in experiments. 3 Results 3. Outline of Experiment We made several experiments in order to evaluate our method. At first, we made a set of proteins from Reference Sequence (RefSeq) [3] of the National Center for Biotechnology Information dated 3-May-25. This set include all human proteins in RefSeq whose sequences are completely known, i.e. not include B, J, O, U, X nor Z. The number of the human proteins, called Whole Proteins, is 28,52. Figure 6 shows the Whole Proteins, the oligopeptide of length 5 and the GO terms of our experiments. 2,36,75 kinds of oligopeptide of length 5 are extracted from the Whole Proteins. GO terms of each RefSeq protein are annotated in the database record of the protein. 4,488 kinds of GO term and,924 pairs of protein and GO term were extracted from the Whole Proteins. Each experiment evaluates the prediction performance for a given GO term T. We choose each GO term which annotated to relatively many proteins. For every protein X of 28,52 Whole Proteins, we divide Whole Proteins into X and the remains P s (28,59 proteins included) and calculate Cor(X, T, P s), i.e. the correspondence between X and T calculated by means of the other proteins. The obtained correspondence depends upon other proteins than X only, and X itself does not affect to the correspondence. After the calculation for all Whole Proteins, we obtain characteristics of prediction performance by means of some measurements in information retrieval research, such as precision (or accuracy ), recall (or sensitivity ) and f-measure (i.e. harmonic mean of precision and recall) as follows. P = TP/(TP + FP), R = TP/(TP + FN), F-measure = 2 P R/(P + R), TP = number of true positive, FP = number of false positive, FN = number of false negative. To draw the recall-precision graph for a GO term, at first, we sort Whole Proteins in descending order. For each i from to the number of Whole Proteins, i.e. 28,52 in the experiments, we select the top i proteins. The top i proteins is regarded as a tentative prediction using Cor(the i th protein, T, P s) as threshold. Using each tentative prediction, we calculate precision, recall and f-measure, and plot

6 Viewing the Proteome from Oligopeptides 79 a point in the recall-precision graph. Finally, 28,52 points are plotted in the recall-precision graph, and we connect these points. We also obtain the maximum f-measure. 3.2 Prediction of GO Component We made experiments for two GO terms: membrane [goid 62] and nucleus [goid 5634]. The numbers of proteins annotated with membrane and nucleus are 2,549 and 4,28, respectively. The recall-precision graph of prediction for membrane is shown in Figure (6.7%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.4%, 86.% and 6.% precision, respectively. With 8% and 5% precision, the 7.7% and 83.4% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 57. The recall-precision graph of prediction for nucleus is shown in Figure 8. 4 (6.5%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 9%, 85.2% and 6% precision, respectively. With 8% and 5% precision, the 7 and 84.8% of proteins annotated with the GO term is predictable. The maximum f-measure is Figure 7: -precision graph for membrane. Figure 8: -precision graph for nucleus. 3.3 Prediction of GO Function We made experiments for three GO terms: ATP binding [goid 5524], hydrolase activity [goid 6787] and GTP binding [goid 5525]. The numbers of proteins annotated with ATP binding, hydrolase activity and GTP binding are,655,,2 and 47, respectively. The recall-precision graphs of prediction for ATP binding is shown in Figure (2.%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 97.%, 95.5% and 79.4% precision, respectively. With 8% and 5% precision, the 8.% and 62.2% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 4. The recall-precision graph of prediction for hydrolase activity is shown in Figure. 44 (3.%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.5%, 86.3% and 35.% precision, respectively. With 8% and 5% precision, the 69.6% and 7.% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 5. The recall-precision graph of prediction for GTP binding is shown in Figure. 5 (2.5%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and

7 8 Horai et al. Figure 9: - Graph for ATP Binding Figure 9: -precision graph for ATP binding. Figure : -precision graph for hydrolase activity Figure : -precision graph for GTP binding. 8% of proteins annotated with the GO term is predictable with 87.9%, 85.6% and 83.8% precision, respectively. With 8% and 5% precision, the 82.% and 89.9% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is Prediction of GO Process We made an experiment for two GO terms: intracellular signaling cascade [goid 7242] and ubiquitin cycle [goid 652]. The numbers of proteins annotated with intracellular signaling cascade and ubiquitin cycle 492 and 334. The recall-precision graph of prediction for intracellular signaling cascade is shown in Figure (8.7%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.%, 87.5% and 63.5% precision, respectively. With 8% and 5% precision, the 73.6% and 8.3% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 69. The recall-precision graph of prediction for ubiquitin cycle is shown in Figure (9.8%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.5%, 85.5% and 35.% precision, respectively. With 8% and 5% precision, the 62.6% and 76.% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 5.

8 Viewing the Proteome from Oligopeptides Figure 2: -precision graph for intracellular signaling cascade Figure 3: -precision graph for ubiquitin cycle. 4 Conclusion We proposed a new method to predict functions, especially GO terms, for a protein based on statistical characteristics of oligopeptides. In order to evaluate the method, we made several experiments by means of the known annotation in RefSeq. In the experiments for GO component terms, it scores 7% recall with 8% precision, and the maximum f-measure is 5. The results suggest that our method is efficient for predicting GO component terms. In the experiments for GO function terms, ATP binding and GTP binding score 8% recall with 8% precision, and the maximum f-measure is greater than. The results suggest that our method is quite efficient for predicting these GO function terms. In contrast, the prediction performance for GO process terms is delicate. The results for intracellular signaling cascade are almost equivalently favorable, while ubiquitin cycle scores lower than others (62.6% recall with 8% precision and f-measure = 5). We consider that one of the reasons is that the number of annotated proteins is quite less in comparison with cases of other GO terms. Generally speaking for all experiments, every recall-precision graph has the following shape: holding horizontally at relatively high precision from to a certain value of recall, and slanted to the lower right corner. Prediction performance mainly depends upon the length of the horizontal part. GTP binding (see Figure ) is an excellent instance. In a usual information retrieval system, this horizontal holding part is so short that the prediction performance is quite low. The experiments suggest the excellent performance of our method in terms of common sense in information retrieval. Furthermore, we can explain the difference between ATP binding and GTP binding by the characteristics of correspondence calculation in our method. Because the correspondence calculation is normalised by the number of included oligopeptides which is proportional to the length, each oligopeptide has larger impact and makes the characteristics of prediction performance more clearly in case of a short protein than a long protein. Because the length of protein annotated with GTP binding (approx. 46 amino acids in average) is quite shorter than ATP binding (approx. 9 amino acids in average), GTP binding results in better than ATP binding. We mentioned in Introduction that there are some methods for solving the problem proposed based on homology search and pattern matching. The comparison of our method with these methods is remained as future works for us. The results in this paper suggest that our method is quite effective for predicting a variety of many GO terms fundamentally, while it is revealed that the applicability may be substantial for some GO terms. We need further investigation on oligopeptides and improvement of the method. Future works include investigation on every oligopeptide which make predominant impact to predication positively or negatively.

9 82 Horai et al. Acknowledgments The authors would like to thank Dr. Tomohiro Mitsumori, a postdoctoral fellow of NAIST, for expert computational assistant and useful discussion. References [] Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarkis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., and Sherlock, G., Gene ontology: Tool for the unification of biology, Nat. Genet., 25:25 29, 2. [2] Doi, H., Kitajima, M., Watanabe, I., Kikuchi, Y., Matsuzawa, F., Aikawa, S., Takiguchi, K., and Ohno, S., Diverse incidences of individual oligopeptides (dipeptidic to hexapeptidic) in proteins of human, bakers yeast, and Escherichia coli origin registered in the Swiss-Prot data base, Proc. Natl. Acad. Sci. USA, 92(7): , 995. [3] Pruitt, K., Tatusova, Y., and Maglott, D., NCBI Reference Sequence (RefSeq): A curated nonredundant sequence database of genomes, transcripts and proteins, Nucleic Acids Res., 33:D5 54, 25. [4] Salton, G., Automatic Text Processing - The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, 989. [5] Shug, J., Diskin, S., Mazzarelli, J., Brunk, B., and Stoeckert, C., Predicting gene ontology functions from ProDom and CDD protein domains, Genome Res., 2(4): , 22. [6] Zehetner, G., OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms, Nucleic Acids Res., 3: , 23.

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;