A Combination of a Functional Motif Model and a Structural Motif Model for a Database Validation

Size: px

Start display at page:

Download "A Combination of a Functional Motif Model and a Structural Motif Model for a Database Validation"

Augusta Manning
6 years ago
Views:

1 A Combination of a Functional Motif Model and a Structural Motif Model for a Database Validation Minoru Asogawa, Yukiko Fujiwara, Akihiko Konagaya Massively Parallel Systems NEC Laboratory, RWCP * 4-1-1, Miyazaki, Miyamaeku, Kawasaki, Kanagawa 216, Japan asogawa@csl.cl.nec.co.jp, yukiko@csl.cl.nec.co.jp, konagaya@csl.cl.nec.co.jp Abstract This paper reports results obtained from a study on database validation concerning a leucine zipper motif, utilizing both a functional motif model and a structural motif model. As an example for this method, a leucine tipper motif is chosen, which is a subsequence consists of two twisted alpha helix sequences and preceded by a DNA binding site. For a functional motif model, an HMM (Hidden Markov Model) which is trained with leucine zipper subsequences is employed. For a structural motif model, a Neural Network, which is trained to classify an alpha helix region, is employed. Because only 122 such leucine zipper sequences are in the Swiss Protein database (R.22), there is a possibility that the HMM could not learn the general mechanism for helical structures completely. Therefore, a structural motif model for an alpha helix is utilized to eliminate non-helical sequences. Fortunately, there are numerous secondary structures examples available in the PDB database. All polypeptides in the Swiss Protein database v.22 are examined with a combination of an HMM and a neural network. For predicting a leucine zipper region, an HMM achieved percent and improved up to percent by combining a neural network. 1 Introduction Predicting a motif from protein sequences is an important problem. Motifs, which are the preserved sites in the evolution process, are considered to represent the function or structure of proteins. This motif prediction problem increases in importance as many pre tein sequences are revealed, because the rate of sequencing far exceeds that of understanding the structures. l RWCP: Real World Computing Partnership Until recently, a symbolic pattern representation was used to represent a functional motif. For example, the pattern of the leucine zipper motif, a wellknown motif for the DNA binding proteins, is L-X(6)- L-X(6)-L-X(6)-L-X(6)-L representing a repetition of Leucine with any following six residues. One of the issues in motif representation is the exception handling caused by the variety of amino acid sequences. Konagaya [l] employed a stochastic decision predicate, which consists of the conjunctive and disjunctive patterns and their probability parameter to represent the exceptions in the motif. However, using a pattern representation cannot achieve satisfactory classification accuracy. For example, the accuracy of leucine zipper motifs is percent. This is because proteins usually have various sequences corresponding to different species, even around motifs. In leucine zipper motifs, the: repeated L s (Leu) tend to change to other amino acids, such as V (Val), A (Ala), M (Met). Such variations are considered to be related to the evolution process of organisms. Thus, these variations might be some systematic relationships; i.e., the variations of amino acids at a residue related to the neighboring residues. These systematic relationships represent biological characteristics. An HMM can represent these systematic relationships or biological characteristics. Another aspect of motifs is that they contain specific structural motifs; those are secondary structures, such as an alpha helix or a beta strand. In the leucine zipper motif, there are two twisted alpha helix sequences bonded by leucines (or perhaps other similar amino acids). Although an HMM is implicitly trained to represent structural motifs, there is not enough motif data available for structural motif learning. Consequently, an HMM might accept a sequence similar to the leucine zipper motif, but which doesn t form a helical structure. A neural network is utilized to predict structural /95 $ IEEE Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 174

2 motifs; i.e., secondary structure. A large amount of data is available for the secondary structure, and can be utilized for training a neural network. To achieve high classification accuracy, a functional motif, modeled with an HMM, and a structural motif, modeled with a neural network, are combined ss one system. I Protein Data Base I Protein Sequences Belonging to Certain Category 1 Iterative Duplication Learning Method. Motif Represented by HMM us to obtain an optimal HMM topology for the given training sequences, as well as optimal HMM parameters for the network. It starts from a small fullyconnected network and iterates the network generation and parameter optimization. The network generation prunes transitions and adds a state according to the previous topology. This method obtains simpler HMM topology in less time than the one obtained from a fully-connected model. This paper is organized as follows. First, the authors explain HMMs followed by an explanation of the iterative duplication method for an HMM learning. After that, the authors explain about a neural network. Then, the experimental results are given a leucine zipper motif prediction, only employing an HMM. Finally, the performance improvement achieved by combining an HMM and a neural network is shown. Unknown Sequence0 Category Prediction 2 HMMs Neural Network 2.l Overview Learning Secondary Structure Data Base Protein DSSP Data Base Figure 1: Motif prediction outline It is desirable to extract motifs biological characteristics from the training data only. When the training sequences come from two different subgroups or families, it is expected that the resulting HMM topology would branch into two parts. For this purpose, general HMMs containing global loops are needed, instead of the left-to-right models commonly used in speech recognition. Accordingly, one of the problems to solve is determining the HMM topology, because there are lots of candidate topology in general HMMs. One of the methods to determine the HMM topology is to train from a large fully-connected HMM and delete negligible transitions. However, the HMM resulting from a fully-connected one may be very complex and difficult to interpret. Moreover, it takes a huge amount of training time in order to optimize numerous free parameters. An HMM is a nondeterministic finite state automaton that represents a Markov process. HMMs are commonly used in speech recognition[ll], and recently have been applied to protein structure grammar estimation[5] and protein modeling[7], [6]. An HMM is characterized by a network with a finite number of parameters and states (see Fig. 2). Parameters represent initial probabilities, transition probabilities, and observation probabilities. At discrete instants of time, the process is assumed to be in one state and an observation (or output symbol) is generated by the observation probability corresponding to the current state. This state then changes, depending upon its transition probability. Init. prob.a.0 Figure 2: An example of an HMM (left-t&right) Thus, the iterative duplication method, is utilized A special type of an HMM, called a left-to-right for generating an HMM [2] [3]. The method enables model in Fig. 2, is commonly used in the case of speech 175

3 recognition. In this model, states are linearly connected with self-loop transitions; a state visited once is never revisited at a later time. This is because there is little requirement to deal with periodic structures in speech recognition. However, such periodic structures are rather common in amino acid sequences and have great significance for constructing a geometric structure. Therefore, the authors adopt a general HMM containing global loops. The correspondence between motifs and HMMs is as follows. The training set inbolves the portions of amino acid sequences that have the same structure or function. An HMM is expected to model the training proteins in terms of discrimination. The alphabet used for the output symbols corresponds to 20 amino acids. The test sequence is the portion of an amino acid sequence which might have the target structure or function. The result is the likelihood of the test sequence, calculated by tracing all possible transition paths, that observe the test sequence in the HMM. To use a trained HMM its a classifier, the authors define a threshold value determined by Z score [7], generat,ed by both positive examples and negative examples. The probability generated by a given sequence is compared with the threshold value, which is given by approximating both positive and negative likelihood distributions as normal distributions. A threshold value for trained neural network is also determined by both positive examples and negative examples. One of the great advantages of using HMMs and neural networks is that it is possible to quantify similarity between the test sequence and the training set by comparing their likelihood on the HMM. 3 Motif Prediction using an HMM 3.1 Learning Algorithm In order to obtain the optimal HMM topology for the given training sequences, an iterative duplication method[2] is used. This method also produces the opt,imal HMM parameters for the network. The method includes transition network generation and parameter optimization. The method is summarized in Fig. 4. It starts from a small fully-connected network. In order to avoid converging in the local maximum, many initial HMMs with random parameters are prepared. The Baum-Welch algorithm is used for parameter opt,imization. Network generation is implemented by copying one node selected from the current network. The method iterates the network generation and pa- rameter optimization phases until sufficient discrimination accuracy is obtained. The details of network generation follow. First, delete the transitions with negligible transitional probability, that is less than 6 = max(cl, P), where cl is a smoothing value and r is a convergence radius. Next, for each state Si except the final state, count the number of incoming and outgoing transitions of the state, that is the number of transitions from the state Si plus that of transitions to the state Si. Then, select the state with the largest number (denoted as SC) and make a copy of it (denoted as S,,,) so that S,,, has the same transition with SC. If the state SC has a selfloop, sm has a self-loop and the transitions from SC to S,,, and from S,,,, to SC (see Fig. 3). The purpose of deleting the negligible transitions is to restrict the network topology space and eventually to reduce the training cost for parameter optimization. The reason to split the most connected node is that it might represent overlapping of independent states. In this case, the network topology may become simpler by splitting the states. Fig. 3 shows an example of such a case. In Fig. 3 (a), the most connected state is a hatched state which outputs E (Glu) with probability 0.26, Q (Gln) with probability 0.15 and so on. By splitting the state into two states, a new network will be obtained which has additional transitions represented by bold lines (see Fig. 3 (b)). However, the network can be simplified, if the most transitions become negligible after parameter optimization (see Fig. 3 (c)). In each epoch, this algorithm produces an optimal HMM for the training data with each number ofstates. Selecting the HMM with highest prediction accuracy, the optimal number of states for the given data is obtained. 3.2 Prediction using an HMM Prediction is carried out by comparing data with a Z score, which is a normalized likelihood. If a sequence achieves higher likelihood than the threshold value, then it is predicted to have the target motif, that is, the same structure and/or function. In the current implementation, a threshold value, determined by Z score [7], is generated by both positive examples and negative examples. Accuracy E,, is defined as follows; Transitions unrelated ted from Fig. 3. to the explanation are omit- 176

4 x C I C.v new (Left) A general rule for a duplication. (Right) A n example of the step from six to seven. (a) The resulting HMM with 6 states after parameter optimization and negligible transition deletion. (b) A new network by a hatched state copy. (c) An obtained HMM with 7 states after parameter optimization and negligible transition deletion. Figure 3: A part of learning R 0.13 Q 0.10 input: (protein) sequences and a small fully-connected HMM. initialization: optimize parameter for the HMM. choose the best HMM on likelihood as a seed. repeat network generation: delete negligible transitions. copy the most connected state. parameter optimization: optimize parameters for the new topology. choose the best HMM on likelihood. until sufficient accuracy is obtained. output: the resulting HMM. values are normalized with respect to the number of each example. Since there are plenty of negative examples, compared to positive examples, the threshold value would be determined only by the negative examples without normalization. The threshold value for classification is determined by the following method. In this method, it is assumed that the likelihood distribution, both positive and negative, is a normal distribution. Actually, Fig. 5 shows resulting likelihood for the close data, and the likelihood distribution is almost a normal distribution. By using this assumption, Eq. (1) is approximated as, Figure 4: Iterative Duplication Method where N+ is the number of positive examples, N- is the number of negative examples and Cf is the number of correctly classified examples in the positive data. C- is the number of correctly classified negative examples. Both the positive and negative accuracy where P(.;,u, U) is the normal distribution probability function with the mean value as p and standard deviation as u, P,,,~, a,,, is the mean value and standard deviation of negative examples, and ppos, upoj is for positive examples. To obtain the a value which 171

5 produces maximum E,,, the partial derivative for Eq. (2) with respect to (Y is taken and the formula which sets this derivative as 0 is solved. Consequently, the threshold value a is determined as follows; (3) accuracy is obtained. After 300 epoch the classification performance achieves as much as percent for the learning data and percent for testing data. In this performance evaluation, the cell which yields higher activation is considered as an answer category. Note that, in this experiment, the neural network is designed to predict an alpha helix region longer than 15 residues. This is different from usual applications of a neural network, predicting 3 classes of the secondary structures. This is the reason why this neural network shows such a high performance. For applying the trained neural network as an alpha helix region, output activations are compared with helix region teacher signal and a normalized squared error is utilized. Therefore, the small normalized squared error implies that the current subsequence might have an alpha helix structure Figure 5: Z Score Histogram for Close Data Figure 6: Neural Network Architect,ure 4 Secondary Structure Prediction using Neural Network For a neural network, a multi-layered perceptron is utilized. The neural network consists of 3 layers, an input layer, a hidden layer and an output layer. The neural network is designed to observe 15 residues simultaneously [12]. Each residue is coded as a binary string with a length of 20. When the residue is unknown, 0.05 is applied to all 20 input cells. Thus, there are 300 input cells at the input layer. There are 10 cells at the hidden layer. Two cells are at the output layer, individual ones correspond to a prediction category, one is for a helix and the other is a non-helix region (Fig. 6). Th is classification corresponds to a secondary structure of residue at the center of the input window. Learning is implemented with the backpropagation algorithm, until sufficient discrimination 5 Experiments 5.1 Training Data and Test Data For predicting a leucine zipper motif with an HMM, 112 positive examples, which are the collect8ion of subsequences annotated as leucine zipper (like), were chosen from the Swiss Protein database Release 22 [lo]. Positive examples contain short sequences in length 15, 22, 29 and 36, with proportion sas much as 7.14, 56.25, and 4.46 percent, respectively negative examples were randomly selected from the Swiss Protein database Release 22. These negative examples were chosen only from a protein which doesn t have a leucine zipper annotation. The ratio of sequence length is controlled to coincide to 178

6 Proceedings of the 28th Annual Hawaii lnrernational Conference on System Sciences that for positive examples. Randomly selected 80 percent of the positive subsequences are used for training, and the remaining positive examples are used for prediction performance evaluation. To employ a trained HMM as a classifier, a Z score threshold value is determined by both positive and negative examples. For determining threshold value, randomly selected 20 percent of the negative examples were utilized. Remaining 80 percent of the negative examples are used for testing purposes. To train a neural network, alpha helix subsequences are chosen from PDB data base of August of 92. Practically, to determine secondary structures from PDB data, the DSSP program is used. In the training data, subsequences of alpha helix with lengths more than 15 are chosen as the positive data. Neither 2,3 nor 5 helix is included in the positive data. All subsequences which don t contain any helix region in the input window range (15 residues) are chosen as the negative data. Randomly selected 70 percent of the training subsequences are used for learning, and the remaining data are used for prediction performance evaluation. To evaluate a symbolic representation performance, the following selected sequences are utilized; sequences which satisfy L-X(S)-L-X(6)-L-X(6)-L-X(6)-L pattern, a repetition of leucine and any six residues [ll]. Although, those selected sequences are similar to positive examples in terms of symbolic representation, only some of these are true leucine zipper sequences. Since there are numerous negative data ( nonleucine zipper sequences rather than leucine zipper sequences), the symbolic representation yields as high an accuracy as percent. 5.2 Evaluation with a Symbolic Pattern To measure symbolic motif representation performance, subsequences with lengths of 15, 22, 29 and 36 are collected from the Swiss Protein database and tested to determine whether they satisfy the leucine zipper representation, such is L-X(S)-L-X(S)-L-X(S)- L-X(6)-L. This classification is validated with the database annotated comment. The result is shown in Table 1. To make the evaluation performance acceptable, partial matching is separately counted as a partial match ; i.e., as long aa a selected subsequence is included in the correct region, which is determined by the database annotation, it is counted as a partial match. 5.3 Experimental Results with HMM training test pos.data pos.data neg.data average test0 98.9% 81.8% 83.0% 82.4% \ test % 91.3% 65.2% 78.2% test2 98.9% 68.2% 83.9% 76.1% test3 98.9% 87.0% 78.6% 82.8% test4 98.9% 77.3% 75.9% 76.6% Ave. 99.1% 81.3% 77.3% 79.3% Table 2: Prediction accuracy (leucine zipper) Table 2 shows the result of cross-validation for leucine zipper motifs. To contrast the ability of an HMM, 112 carefully selected negative examples are used. These are the sequences which contain the leucine zipper motif, L-X(G)-L-X(S)-L-X(S)-L-X(G)- L. Positive data is divided into 5 groups and tested with both negative and positive data. The average prediction accuracy is 99.1 percent for training data and 79.3 percent for test data; 81.3 percent for positive data To emphasize HMM performance, the negative test set is chosen from the sequences satisfying the L-X(6)-L-X(6)-L-X(6)-L-X(6)-L pattern, the average prediction accuracy for test data is just 14.8 percent; 29.5 percent for the positive data and 0.0 percent for the negative data. Hydrophilic I Hydrophobic Helix 3 Basic M- DNA-binding site p--; --- Observed from top L 5 2 Hydrophilic Figure 8: (Left) Biological structure of a leucine zipper motif. (Right)The helical wheel. Fig. 7 shows an HMM for leucine zipper motifs obtained using the iterative duplication method. This HMM contains global loops corresponding to the helix structure in the leucine zipper motif. Such a helical structure, as that shown in Fig. 8, is caused by 179

7 Proceedings of rhe 28th Annual Hawaii Inlernational Conference on System Sciences Perfect Match positive 1 total 1 percent 11 negative total I percent II average % % % Partial Match positive I total 1 percent 11 negative total percent 1 average % % % Table 1: Prediction Accuracy with a Symbolic Pattern s 0.25 E 0.23 DO.14 TO.17 \ 0.02 L 0.13 v 0.10 \ vo.15 I 0.12 K 0.12 NO.12 LO.11 E0.13 E-A L E0.41 R 0.27 NO.18 D 0.14 K0.12 Figure 7: Extracted HMM for a leucine zipper motif (Left) An HMM (1 eucine zipper). (Right) The helical wheel, i.e., helices observed from the top at HMM paths. Hydrophobic, hydrophilic and the other amino acids are described by bold, hatched and broken letters, respectively. The characters on the circles are t.he most frequently observed amino acids at each state. the existence of seven amino acids per every two periods. This is because a pair of aligned leucines forms a zipper-like structure. On the left in Fig. 8, this characteristic is shown with a helical wheel view from the top. These circles depict that there are many hydrophobic amino acids ou one side around combined leucines and many hydrophilic amino acids on the other side. This tendency for hydrophilic and hydrophobic amino acids is a key to forming two twisted helices. In Fig. 7, each circle at the right corresponding to each HMM path has a similar characteristic to the previous helical wheel. The characters on the circles are the most frequently observed amino acid in each state. In order to see the helical wheel, hydrophobic amino acids, such as I (Ile), V (Val), L (Leu), F (Phe), C (Cys) are described by bold letters in the following. On the other hand, hydrophilic amino acids, such as R (Arg), K (Lys), N (Asn), D (Asp), Q (Gln), E (Glu), H (His) are described by hatched letters. Others M (Met,), A (Ala), G (GUY), T (Thr), s Per>, W ( %I, Y (Tyr), P (Pro) are described by broken letters. These circles show three kinds of helical wheels. Therefore, lit is shown that the iterative duplication method automatically extracted the helical structures and characteristics from the positive data. Fig. 5 indicates the Z score for both positive and negative data on close data. By using the method described in 3.2, the threshold value (I! is determined as , which is indicated with a dotted line in Fig. 5. With this threshold value, the HMM achieved accuracy E,,, as much as percent for the close data. The close data consists of positive data and negative data; open data is 80 of the leucine zipper subse- 180

8 Proceedings of the 28th Annual Hawaii International Conference on System Sciences - I995 quences and is utilized for an HMM learning, negative data is of the non-leucine zipper subsequences and utilized for determining the threshold value. Details are shown in Table 3. This HMM and the threshold value (Y is examined with the open data, which are used for neither HMM learning nor the threshold value determination. The accuracy E,, is percent for the open data. Details are shown in Table 3. Fig. 9 indicates the Z score for the close data NN Nommliied Squared Error Figure 10: Normalized Squared Error Histogram for The Close Data For clarity, all negative Z scores less than the threshold value are omitted from this figure. Figure 9: Z Score Histogram for Open Data when in low likelihood. A solid line, in this figure, is drawn based on this fact. Therefore, the data below this line is interpreted as negative data. By utilizing this method, the number of incorrect classifications for negative data decreases from 4394 to 3190 for the close data. Consequently, accuracy E,, is improved from percent to percent. For the open data, the number of incorrect classifications for negative data decreases from to 6274, and accuracy E,, is improved from percent to percent. 7 Conclusion 6 Experimental Result with an HMM and a Neural Network Fig. 10 indicates a normalized squared error distribution for the close data. The normalized squared error is closely correlated to the class of the positive and negative data. Thus, the normalized squared error could be utilized for segregating the positive data from the negative data. Since some positive data indicates large squared error, it is much better to combine this measurement with another orthogonal measurement, such a9 the HMM likelihood. Fig. 11 indicates a scatter plot for the Z score and the normalized squared error. A dotted line corresponds to the threshold value a for the Z score. Careful examination of Fig. 11 shows that the normalized squared error increases when the Z score decreases for the positive dat,a. This indicates that leucine zipper subsequences tend to show high squared error, An HMM is capable of representing a stochastic motif well. Since a HMM is trained only by a small number of subsequences, there is a possibility that the HMM could not learn the general mechanism completely. Usually, the secondary structure of a motif is well known, especially for the leucine zipper motif. In the leucine zipper motif, there are two twisted alpha helix sequences bonded by leucines (or perhaps other similar amino acids). In the Swiss Protein database, the lengths of alpha helix sequences are 15, 22, 29 and 36. Therefore, a neural network trained with alpha helix subsequences more than 15 residues long and it is used to predict an alpha helix and a non-alpha helix region. According to the experience gained brom leucine zipper motif prediction, the HMM shows higher discrimination performance than a symbolic motif representation. By comparing two tables, Table 3 and 4, it is shown that the prediction performance is improved 181

9 pos.data neg.data correct total percentage correct total percentage average Close Open Table 3: Prediction accuracy with an HMM t Positive Z Score Figure 11: Normalized Squared Error Distribution for Close Data by combining a neural network. Since a large amount of negative data is used in this experiment and since most of the negative data is very easy to classify, the performance improvement looks small, such as percent. The ease achieved in classifying the negative data is shown by the prediction accuracy for a symbolic pattern method. This is because the amount of negative data, which is , further exceeds that for positive data, which is 112. Therefore, it is difficult to decrease the number of misclassifications of the negative data from By combining the neural network, the number of misclassifications of negative data is decreased from to Moreover, since a motif usually contains specific secondary structures, this combined method is widely applicable. Acknowledgment The authors would like to thank the genetic information processing group and Mr. Mamitsuka in NEC for meaningful discussion and valuable help. References [l] A. Konagaya and M. Kondo: Stochastic Motif Extraction using a Genetic Algorithm with the MDL principle, ~~ , HICSS26 (1993). [2] Y. Fujiwara and A. Konagaya: Protein Motif Extraction using Hidden Markov ModeP, pp56-64, Proceedings Genome Informatics Workshop IV (1993). [3] Y. Fujiwara, M. Asogawa and A. Konagaya: Protein Motif Extraction using Hidden Markov Model, To be appeared at ISMB 94 (1994). [4] S. Nakagawa: Speech Recognition Using Stochastic Models, pp29-108, Electronic Society of Information Communication (1988) [5] K. Asai, S. Hayamizu and K. Onizuka: HMM with Protein Structure Grammar, pp , HICSS26 (1993). 182

10 Z Score For clarity, all negative Z scores, less than the threshold value, are omitted from this figure. Figure 12: Normalized Squared Error Distribution for Open Data posdata neg.data correct total percentage correct total percentage average Close Open Table 4: Prediction accuracy with an HMM and a Neural Network [6] P. Baldi and Y. Chauvin and T.HunkapiIIer and M.A.McClure: Hidden Markov models of biological primary sequence information Neural Computation, (1994). [7] D. Haussler, A. Krogh, I. Mian and K. Sjolander: Protein Modeling using Hidden Markov Models: Analysis of Globins, pp , HICSS26 (1993). [ll] A. Aitken: Identification of Protein Consensus Sequences, ~~ , Ellis Horwood Limited (1990). [12] N. Qian, T. Sejnowski: Predicting the Secondary Structure of Globular Proteins Using Neural Network Models ~~ , Journal of Molecular Biology, 202 (1988). [8] J.Takami and S.Sagayama: Automatic Generation of the Hidden Markov Network by Successive State Splitting, , Proceedings of ICASSP (1991). [9] A. Bairoch: PROSITE database, SwissProt Release 22 (1992). [lo] A. Bairoch: Sequence database, SwissProt Release 22 (1992). 183

CFSSP: Chou and Fasman Secondary Structure Prediction server

Wide Spectrum, Vol. 1, No. 9, (2013) pp 15-19 CFSSP: Chou and Fasman Secondary Structure Prediction server T. Ashok Kumar Department of Bioinformatics, Noorul Islam College of Arts and Science, Kumaracoil