SVM-based Method for Predicting Enzyme Function in a Hierarchical Context

Size: px
Start display at page:

Download "SVM-based Method for Predicting Enzyme Function in a Hierarchical Context"

Transcription

1 The Fourth Internationa Conference on Computationa Systems Bioogy (ISB200) Suzhou, China, September 9, 200 Copyright 200 ORSC & APORC, pp SVM-based Method for Predicting Enzyme Function in a Hierarchica Context Yong-Cui Wang Zhi-Xia Yang 2 Nai-Yang Deng, Coege of Science, China Agricutura University, Beijing 00083, China 2 Coege of Mathematics and System Science, Xinjiang University, Urumuchi , China Abstract Automaticay categorizing enzyme into the Enzyme Commission (EC) hierarchy is crucia to understand its specific moecuar mechanism. Standard machine earning methods ike support vector machine (SVM) and naïve bayesian cassifier have been successfuy appied for this task. However, they treat each functiona cass independenty, and ignore the inter-cass reationships. In this paper, we deveop a SVM-based method for prediction of enzyme function into the EC hierarchica context. Our method with ow computationa compexity is a modified version of a structured predictive mode Hierarchica Max-Margin Markov agorithm (HM 3 ). HM 3, which is speciay designed for the hierarchica muti-abe cassification, has been successfuy used in many structured pattern recognition probems, such as document categorization, web contend cassification, and enzyme function prediction. As input features for our predictive mode, we use the conjoint triad feature (CTF). Our method has been vaidated on an enzyme benchmark dataset, the proteins in this benchmark dataset have ess than 40% sequence identity to any other in a same functiona cass. Finay, for the first three EC digits, the predictive accuracy and the Matthew s correation coefficient (MCC) of our method range from 78% to 00% and 0.76 to respectivey. Therefore we think our new method wi be usefu suppementary toos for the future studies in enzyme function prediction. Keywords: Enzyme function prediction; Conjoint triad feature; Structured hierarchica output; Support vector machine. Introduction About haf of a the proteins have been characterized as function of enzymatic activity by various biochemica experiments. In addition, accurate assignment of enzyme function is a prerequisite of high-quaity metaboic reconstruction and the anaysis of metaboic fuxes []. It shoud be noted that, the Enzyme Commission (EC) number is comprised of four digits separated by periods to cassify the functions of enzymes [2]. The first three digits describe the overa type of an enzymatic reaction, whie the ast digit represents the substrate specificity of the catayzed reaction. Since the first three digits have the direct reationship with enzyme function, it is in pressing need to deveop computationa methods to accuratey output the first three EC digits for a given protein. Corresponding author. E-mai: dengnaiyang@cau.edu.cn

2 20 The 4th Internationa Conference on Computationa Systems Bioogy The straightforward idea is to transfer an EC number between two gobay aigned protein sequences. The accuracy of the direct inference has been reported to significanty drop under 60% sequence identity [3]. One of the remedies is to deveop computationa methods to automaticay categorize enzyme into EC hierarchy. Some works concentrate in predicting the top ever of the EC taxonomy, and then outputting the second and third digits of EC number. For exampe, Chou et.a [4] deveoped a top-down approach for predicting enzyme functiona casses and sub-casses, and the overa accuracy for the first two ayers is higher than %. However, they treated functiona cass independenty, and ignore the inter-cass reationships. To address this imitation, some works aim at the structured prediction of enzyme functions in the whoe EC hierarchica taxonomy. For exampe, Astikainen et.a. [5] introduced the Hierarchica Max-Margin Markov (HM 3 ) to output the hierarchica enzyme function, and obtained over 85% accuracy on the four digits EC number. HM 3, which is speciay designed for the hierarchica muti-abe cassification, generaizes support vector machine (SVM) earning and that is based on discriminant functions that are structured in a way that mirrors the cass hierarchy [7]. It has been successfuy used in many structured pattern recognition probems, such as document categorization [7], web contend cassification [8] and so on. However, the HM 3 is time consuming, and unsuitabe for the arge predictive task. For deaing with the computationa chaenge of the HM 3, in this artice, we reformuate the mode of HM 3, and make the number of the variabe decrease to the number of the training sampes. Another key probem in predicting enzyme function is to encode a protein as a reavaue vector. In previous studies, the amino acid composition (AAC) representation has been widey utiized in predicting enzyme famiy and subfamiy cass [9]. Owing to ack of the sequence order information, some modified versions of AAC, such as pseudo amino acid composition (Pse-AAC) [0] and amphiphiic pseudo-amino acid composition (Am- Pse-AAC) [] have been deveoped. However, both Pse-AAC and Am-Pse-AAC have some parameters to be determined, and need the properties of physio-chemistry of amino acids. Recenty, a much simpe feature encoding method for protein-protein interactions (PPIs) prediction was proposed [2]. The authors have shown that SVM with the conjoint triad feature (CTF) outperformed other sequence-based PPI prediction methods. The CTF considers not ony properties of one amino acid but aso its vicina amino acids and treats any three continuous amino acids as an unit. That is, it contains not ony the composition of amino acids but aso sequence-order effect. It has aso successfuy been used in prediction of DNA- and RNA-binding proteins [3]. Inspired by these, in this paper, we introduce the CTF into our structured predictive mode, and the better resuts are expected. The paper is structured as foows. In Materias and Methods section, we give the depiction for the CTF and benchmark dataset, and discuss the procedure for formuating the structure-based predictive mode In Resuts section, we compare our method with the existing methods. Lasty, we concude this paper. 2 Materias and Methods 2. Materias The benchmark dataset used to vaidate the performance of our method is coected from the iterature [4]. The sequences in this dataset have ess than 40% sequence identity to any other in a same functiona cass. The detai information of this dataset can be

3 SVM-based Method for Predicting Enzyme Function in a Hierarchica Context 2 Figure : The toy exampe for functiona hierarchica tree found in [4]. In addition, for avoiding the extreme subfamiy bias, those subfamiies which contain ess than 40 proteins are excuded in our vaidation. Finay there are six main functiona casses and thirty-four subfamiy casses (i.e., tweve for oxidoreductases, seven for transferases, five for hydroases, four for yases, four for isomerases and two for igases) in the benchmark dataset. 2.2 Methods 2.2. Input feature: CTF Now et us construct the CTF. Based on the dipoes and voumes of the side chains, the 20 amino acids can be cassified into seven casses: {A,G,V }, {I,L,F,P}, {Y,M,T,S}, {H,N,Q,W}, {R,K}, {D,E}, {C}. Thus, a (7 7 7 =)343-dimension vector is used to represent a given protein, where each eement of this vector is the frequency of the corresponding conjoint triad appearing in the protein sequence. The detai description for the CTF can be found in [2] Structured SVM Now we give the detai representation for our structured predictive mode. And we ca it as the structured SVM. Given exampes {x i,y i } for i =,,, where x i is a vector in the input space R n and y i denotes the corresponding cass category taking a vaue in the output space {,,q}, where q is the tota number of categories. In addition, the functiona hierarchica tree, representing the reationships among the categories, is aso known, where the number of eaf nodes in the tree is q. A tree with q = 3 is shown in Figure. Suppose that, there are a tota of s nodes in the hierarchica functiona tree (In Figure, s = 5). To predict enzyme functions in the whoe EC hierarchica taxonomy, we introduce a structured predictive mode HM 3 [6] firsty, and then modified it to reduce its computationa compexity. Introduce Λ(y) for output y by Λ(y) = (λ (y),,λ s (y)) T, ()

4 22 The 4th Internationa Conference on Computationa Systems Bioogy where {, i f j path(y); λ j (y) = 0, otherwise, where path(y) is the category tags for the path from the root to the eaf node y in the functiona hierarchica tree. For exampe, in Figure, path(2) = {2,5}. The prima and Lagrange dua optimization probem for HM 3 are and min α 2 min w,b,ξ 2 w 2 +C (2) ξ i, (3) s.t. (w δφ i (y)) ξ i, i =,,,y y i, (4) j= ξ i 0,δΦ i (y) = Φ(x i,y i ) Φ(x i,y), i =,,, (5) Y Y i,y Y j α iy α jy ((Λ(y i ) Λ(y)) (Λ(y j ) Λ(y )))K(x i,x j ) y y i α iy, s.t. α iy 0, i =,,,y y i, (7) α iy C, i =,,,y y i, y y i (8) respectivey, where w = (w T,,wT s ) T, φ : R n H is a mapping from the input space R n to a Hibert space H, and Φ(x i,y i ) = λ (y i ) φ(x i ). (9) λ s (y i ) φ(x i ). Noticing that the number of the variabes equas (q ), so the probem (6) (8) wi be quite arge when and q is arge. However, if we can find a abe y i y i, such that y i = arg y yi min{(w Φ(x i,y i )) (w Φ(x i,y))}, (0) the inequaity (4) wi become (w δφ i (y i )) ξ i, i =,,. So the Lagrange dua probem can be reformuated as foows: min α 2 j= α i α j ((Λ(y i ) Λ(y i )) (Λ(y j ) Λ(y j)))k(x i,x j ) α i, () s.t. α i 0, i =,,, (2) α i C, i =,,. (3) The number of the variabes for the probem () (3) decreases to. Note that, this probem is reduced to the standard SVM mode by repacing the kerne function with ((Λ(y i ) Λ(y i )) (Λ(y j) Λ(y j )))K(x i,x j ). How to choose a suitabe y i for each input x i? The foowing theorem may provide some information for this task. (6)

5 SVM-based Method for Predicting Enzyme Function in a Hierarchica Context 23 Theorem. Let k, is the arbitrary two cass abes in eaf nodes, x is a test datapoint, α iy,i =,,,y y i are the optima soutions of the probem (6) (8). Then for every pair of abe tags k and, we have that, where (w Φ(x,k)) (w Φ(x,)) 2 Λ(k) Λ() 2 (w j φ(x)) = s (w j φ(x)) 2, (4) j= y y i (λ j (y i ) λ j (y))α iy K(x i,x). (5) Proof: The concusion can be proved by using Cauchy inequaity. From the Theorem, we can see that the cass abes which beong to the same subfamiy of the test abe have the owest upper bound of (w Φ(x,y)) (w Φ(x,y )), compared to the other abe y y. So in this paper, y i is set to be one of the cass beonging to the same subfamiy of y i. 2.3 Impementing predictive mode and evauation criterions By repacing the kerne function with ((Λ(y i ) Λ(y i )) (Λ(y j) Λ(y j )))K(x i,x j ), the freey avaiabe software for SVM impementation: LIBSVM (v.2.88) [4], can be used for soving our structured SVM mode. The RBF kerne function is used here, and the penaty parameter C and the RBF kerne parameter γ are optimized by grid search approach with 3-fod cross-vaidation. To evauate the performance of proposed methods, we run the 0 fod cross-vaidation procedure. Besides for the accuracy, the Matthew s correation coefficient (MCC)[5], which aows us to overcome the shortcoming of accuracy on imbaanced data [6], is used to further evauate the performance of our method. 3 Resuts 3. The resuts on the benchmark dataset Here we woud ike to dispay the promising performance of the structured SVM by vaidating it on a benchmark dataset (see Materias and Methods section). For comparison, we aso train OET-KNN and AMSVM on the same benchmark dataset. OET-KNN (Optimized evidence-theoretic k nearest neighbor) cassifier was successfuy used in topdown predicting enzyme famiy and subfamiy cass [4], whie AMSVM (SVM with arithmetic mean offset) is a modified version of the standard SVM, which is speciay designed for imbaance cassification probem [7]. It shoud be noted that, predicting enzyme function is an imbaance muti-cass cassification probem due to the fact that the number of proteins in each enzyme famiy and subfamiy makes a great difference. Both OET-KNN and AMSVM predict the top eve of EC hierarchy firsty, and then dea with the second and third eve. OET-KNN ony reports the predictive accuracy on the first two EC digits, so the MCC and the corresponding resuts on the third EC digit for OET-KNN are not shown in the foowing subsection. The accuracies and MCCs for OET-KNN, AMSVM and the structured SVM in prediction of enzyme famiy casses are isted in Tabe. From Tabe, we can see that

6 24 The 4th Internationa Conference on Computationa Systems Bioogy Tabe : The accuracy and MCC of the current methods on the top eve cass in the EC hierarchy Famiy name OET-KNN AMSVM Structured SVM Accuracy MCC Accuracy MCC Accuracy MCC EC : Oxidoreductases EC2 : Trans f erases EC3 : Hydroases EC4 : Lyases EC5 : Isomerase EC6 : Ligases AMSVM speciay designed for imbaance probem performs we than OET-KNN, and the structured SVM outperforms AMSVM not ony on the accuracy but aso on the MCC for a six enzyme main famiies. These resuts suggest that the more properties of dataset itsef are incorporated into the predictive mode, the more better resuts can be expected. We use box pots to exhibit the variabiity of predictive accuracy for comparabe methods among enzyme subfamiy casses and sub-subfamiy casses (Figure 2,3). The box pots for the variabiity of MCC are shown in Figure 4. For the second EC digit (enzyme subfamiy cass), the accuracy of OET-KNN ranges from 50% to 00%, and AMSVM makes the accuracy range from 75% to 95%. Athough the median of accuracy for OET- KNN is arger than that for AMSVM, the difference between them is about 3% and the variation range of AMSVM is ess than that of OET-KNN. It impies that AMSVM is more stabe than OET-KNN, and have competitive performance of OET-KNN. Whie the minimum vaue of accuracy for the structured SVM is about 80%, and the maximum vaue reaches 00%. Furthermore, the structured SVM outperforms AMSVM and OET- KNN not ony on the range of accuracy waved but aso on the median of accuracy. For the third EC digit (enzyme sub-subfamiy cass), except for EC6 sub-subfamiy cass, the structured SVM outperforms AMSVM and OET-KNN not ony on the range of accuracy waved but aso on the median of accuracy. For both the second and the third EC digits, athough the ength of the MCC box for the structured SVM seems onger than AMSVM, but the structured SVM has a much higher MCC than AMSVM. A these resuts impy that the performance of the structured SVM is superior to those of the existing methods, which do not take the hierarchica structure of the dataset into account. 3.2 Comparison with other structure-based methods In [5], besides HM 3, the authors have been introduced another structure-based predictive mode Maximum Margin Regression agorithm (MMR) [8] to output enzyme function. The MMR generats one-cass SVM to structured output domain. It has the same the number of variabes as our structured SVM and can be impemented easiy. On a god standard enzyme dataset containing 30 proteins, for the a individua EC digits, the MMR achieved the best accuracy, whie HM 3 achieved the neary same resuts, and the HM 3 obtains the best F scores and MMR comes cose second. When predicting sub-subfamiies, the HM 3 obtained the over 79% F score, and obtained over 89%

7 SVM-based Method for Predicting Enzyme Function in a Hierarchica Context 25 The variance on accuracy for three method in EC The variance on accuracy for three method in EC2 The variance on accuracy for three methods in EC The variance on accuracy for three methods in EC4 The variance on accuracy for three methods in EC5 The variance on accuracy for three methods in EC Figure 2: The box pots for accuracy on enzyme subfamiy in EC,EC2,EC3,EC4,EC5,EC6 respectivey. The variance on accuracy for AMSVM in enzyme sub subfamiy Accuracy(\%) The variance on accuracy for Structured SVM in enzyme sub subfamiy EC EC2 EC3 EC4 EC5 EC6 70 EC EC2 EC3 EC4 EC5 EC6 Figure 3: The box pots for accuracy on enzyme sub-subfamiy. F score when predicting the subfamiies and the main famiies [5]. Like the MCC, F score can avoid the bias due to the imbaance probem. In this work, our structured SVM has obtained the over 80% F score in predicting enzyme sub-subfamiies, and the over % F score in predicting subfamiies and the main famiies. These resuts suggest that our structured SVM has the competitive performance with the HM 3, and may outperform other existing ow costy structure-based method, the MMR. 4 Concusion In this paper, we propose a structure-based method the structured SVM for predicting enzyme function in EC hierarchy. The structured SVM is a modified version of HM 3 and can be easiy impemented for the rea-word arge scae probem. The structured SVM has been vaidated on a benchmark enzyme dataset, where the best MCC and accuracy of 9% and % are obtained in predicting the main famiies. Furthermore, we obtain over 95% and % MCC in predicting subfamiies and sub-subfamiies respectivey.

8 26 The 4th Internationa Conference on Computationa Systems Bioogy The variance on MCC for AMSVM in enzyme subfamiy The varince on MCC for Structured SVM in enzyme subfamiy MCC MCC EC EC2 EC3 EC4 EC5 EC EC EC2 EC3 EC4 EC5 EC6 MCC The variance on MCC for AMSVM in enzyme sub subfamiy EC! EC2 EC3 EC4 EC5 EC6 MCC The vaiance on MCC for Structured SVM in enzyme sub subfamiy EC EC2 EC3 EC4 EC5 EC6 Figure 4: The box pots for MCC on enzyme subfamiy (the top) and sub-subfamiy (the down) respectivey. The predictive resuts suggest that for predicting the enzyme function in EC hierarchica taxonomy, the structured SVM performs much better than existing methods which do not take account of hierarchica reationship among enzyme categories. In addition, the structured SVM has exhibited the competitive performance with the HM 3, and the better predictive resuts than other existing ow costy structured-based method (MMR) in structured output prediction of enzyme function. Therefore we think our new method wi be usefu suppementary toos for the future studies in enzyme function prediction. Acknowedgments This work is supported by the Key Project of the Nationa Natura Science Foundation of China (No ), the Nationa Natura Science Foundation of China (No. 0802, No ) and the Ph.D Graduate Start Research Foundation of Xinjiang University Funded Project (No.BS0800). References [] Passon, B. Systems Bioogy: Properties of Reconstructed Networks Cambridge University Press New York, NY, USA, [2] Bairoch, A.The ENZYME database in Nuceic Acids Research, 2000, 28, [3] Tian, W., Skonick, J. How we is enzyme function conserved as a function of pairwise sequence identity? Journa of Moecuar Bioogy, 2003, 333,

9 SVM-based Method for Predicting Enzyme Function in a Hierarchica Context 27 [4] Shen, H.B., Chou, K.C. EzyPred: A top-down approach for predicting enzyme functiona casses and subcasses. Biochemica and Biophysica Research Communications, 2007, 364, [5] Astikainen, K., Hom, L., Pitkänen, E., Szedmak, S., Rousu, J. Towards structured output prediction of enzyme function. BMC Proceedings, 2008, 2:S2. [6] Rousu, J., Saunders, C., Szedmak, S., Shawe-Tayor, J. Kerne-Based Learning of Hierarchica Mutiabe Cassification Modes. The Journa of Machine Learning Research, 2006, 7, [7] Cai, L, J., Hofmann, T. (2004) Hierarchica document categorization with support vector machines. Proceedings of the thirteenth ACM internationa conference on Information and knowedge management, Washington, D.C., USA. [8] Dumais, S., Chen, H. Hierarchica Cassification of Web Content, 2000, In SIGIR. [9] Chou, K.C., Erod, D.W. Prediction of enzyme famiy casses. Journa of Proteome Research, 2003, 2, 83. [0] Chou, K.C. Prediction of protein ceuar attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Genetics, 200, 43, [] Chou, K.C. Using amphiphiic pseudo amino acid composition to predict enzyme subfamiy casses. Bioinformatics, 2005, 2, 0 9. [2] Shen, J.W., Zhang, J., Luo, X. M., Zhu, W.L., Yu, K.Q., Chen, K.X., Li, Y. X., Jiang, H.L. Predicting protein-protein interactions based ony on sequences information. Proceedings of the Nationa Academy of Sciences, 2007, 04, [3] Shao, X. J., Tian, Y. J., Wu, L. Y., Wang, Y., Deng, N. Y. Predicting DNA- and RNA-binding proteins from sequences with kerne methods. Journa of Theoretica Bioogy, 2009, 258, [4] Cheng, A.C., Coeman, R.G., Smyth, K.T., Cao, Q., Souard, P., Caffrey, D.R., Sazberg, A.C., Huang, E.S. Structure-based maxima affinity mode predicts smamoecue druggabiity. Nature Biotechnoogy, 2007, 25, [5] Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage ysozyme. Biochimica et Biophysica Acta, 975, 405, [6] Pu, X., Guo, J., Leunga, H., Lin, Y.L.Prediction of membrane protein types from sequences and position-specific scoring matrices. Journa of Theoretica Bioogy, 2007, 247, [7] Li, B. Hu, J. Hirasawa, K. Sun, P., Marko, K. Support vector machine with fuzzy decisionmaking for rea-word data cassification. In IEEE Word Congress on Computationa Inteigence, Int. Joint Conf. on Neura Networks, 2006, Canada. [8] Szedmak, S., Shawe-Tayor, J, Parado-Hernandez, E. Learning via Linear Operators: Maximum Margin Regression. Tech. rep., 2005, Pasca Research Reports. [9] Wang, X.B., Wu, L.Y., Wang, Y.C., Deng, N.Y. Prediction of pamitoyation sites using the composition of k-spaced amino acid pairs. Protein Engineering Design and Seection, 2009, 22(),