A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

Size: px
Start display at page:

Download "A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs"

Transcription

1 A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs CHANGHUI YAN and JING HU Department of Computer Science Utah State University Logan, UT USA Abstract: - This paper presents a hidden Markov model (referred as HMM_AA_SA) that combines amino acid sequence with solvent accessibility to identify Helix-Turn-Helix (HTH) motifs, a structure that plays a pivotal role in protein-dna interactions. Solvent accessibility of amino acid residues is discretized into several categories, with each category being denoted by one letter. At each state, HMM_AA_SA emits one letter of amino acid and one letter of solvent accessibility. We apply the method to identifying Helix-Turn-Helix (HTH) motifs. The results show that adding solvent accessibility into the model can increase its sensitivity in identifying HTH motifs. We explore different thresholds for discretizing solvent accessibility. The results show that dividing solvent accessibility into three discrete categories (B: buried, M: medium and E: exposed) achieves better performance than dividing it into two categories (B: buried and E: exposed). Key-Words: - Hidden Markov model, helix-turn-helix, and solvent accessibility 1 Introduction Protein-DNA interactions play a pivotal role in gene regulation. Helix-Turn-Helix (HTH) is an important protein structure (motif) through which proteins bind with DNA. The ability to identify HTH motifs is critical for deciphering the mechanisms of gene-regulation Both sequence-based methods [1, 2] and structure-based methods [3-5] have been used to detect HTH motifs. Sequence-based methods identify HTH motifs by detecting the occurrence of a sequence pattern. Structure-based methods focus on finding the structure that can fit into the structural templates developed from known HTH motifs. Because there exist big variations in the sequence and structure of HTH motifs, neither sequence-based methods nor structure-based methods alone can identify HTH motifs with perfect performance. Methods that explore both sequence and structural data have shown promising results [6]. Because solvent accessibility of residues has been shown to be an effective factor in identifying HTH motifs [4, 7], in this study, we combine solvent accessibility and amino acid sequence to identify HTH motifs. HTH motifs can be divided into families based on their sequence and structure [8]. There is only low similarity in sequence among the families. Most previous HTH-identifying methods aim to identify HTH for each family separately. However, given the big variations among the families, it is likely that there are some HTH motifs that do not fall into any currently defined family. To identify these new motifs, we need methods that can capture the common features shared by different families of HTH motifs and are able to recognize HTH motifs across families. In this study, we aim to build a generic model that can be used to identify HTH motifs from all families. We use hidden Markov models (HMM) to combine amino acid sequence and solvent accessibility. Hidden Markov models are a powerful tool for modeling sequence data and have achieved many successes in computational biology [9]. Krogh [10] developed a hidden Markov architecture to represent multiple sequence alignments. Eddy [11] extended the architecture to develop profile hidden Markov models for protein families. The resulting database (known as Pfam) has been widely used in protein function annotations [12]. Although originally used to deal with sequence data, hidden Markov models have been applied to protein structure predictions in many studies [13-15]. Some studies encode the structure using one-dimensional symbols [16], while others explicitly model 3D coordinates [17]. Hargbo and Elofsson [18] developed a hidden Markov model for fold recognition using amino acid sequence and predicted secondary structure. In their model, each state emits one letter of secondary structure in addition to one letter of amino acid residue. We adopt 1

2 Hargbo and Elofsson [18] approach to develop a hidden Markov model (referred to as HMM_AA_SA) method to model both amino acid sequence and solvent accessibility. In each state HMM_AA_SA emits one letter of solvent accessibility in addiction to one letter of amino acid. Since solvent accessibility takes real number, a critical step for the HMM_AA_SA approach is to discretize solvent accessibility into discrete categories. In this study, we explore various discretization criteria and find a discretization criterion that can greatly increase the sensitivity in identifying HTH motifs, with only little increase in false positive rate. 2 Methods and Materials 2.1 Data Sets Nineteen families containing the Helix-Turn-Helix (HTH) motif were extracted from the Pfam database [12]. The structures of these families were visually inspected to ensure that these proteins contain a HTH motif of standard shape. The full alignment of the whole Pfam database was obtained. The sequences without structure data were removed. A sequence is put into the negative set if it does not belong to any of the nineteen families, and into the positive set otherwise. The negative set has 4,687 sequences. The positive set consists of the sequences from the nineteen families. Among them, one family has much more sequences than the others. To avoid bias, we only chose 9 sequences from this family for the positive data set. In total, the positive data set contains 70 sequences. 2.2 Solvent Accessibility and Relative Solvent Accessibility Solvent accessibility measures the surface area of a residue that is accessible by solvent molecules. Relative solvent accessibility (which is given by the solvent accessibility divided by the total surface area of the residue) measures the fraction of residue s surface that is accessible by solvent. Relative solvent accessibility falls in the range of [0, 1]. We discretize solvent accessibility based on relative solvent accessibility. The relative solvent accessibility for each residue was obtained with the sequence from Pfam [11]. 2.3 HMM_AA: hidden Markov model that emits only amino acids A hidden Markov model (HMM) consists of a set of states. It can be viewed as a machine that generates sequences of letters by going through paths of states. At each state, it emits observable letters based on the emission probabilities. The transitions among the states are controlled by the transition probabilities. Figure 1 shows the structure of the HMM used to represent multiple sequence alignments by Krogh [9]. The heart of it is a set of match (M), insertion (I), and deletion (D) states. One M state corresponds to one consensus column in the multiple alignments. I and D states correspond to the insertions and deletions in the alignments. D states only produce gaps. Each M or I state emits one amino acid residue. Therefore, each M or I state is associated with 20 emission probabilities corresponding to the 20 amino acids. The emission probabilities were determined by the frequency that residues have been observed in the corresponding column of the multiple alignments. Transition probabilities between states are determined by the observed frequency of the corresponding transitions in the alignments. Figure 1. Hidden Markov model that emits only amino acid residues (referred as HMM_AA). M: Match states, I: Insertion states, D: Deletion states. Arrows show the state transitions. At each state, the model emits one amino acid residue. 2.4 HMM_AA_SA: hidden Markov model that emits solvent accessibility in addition to the identity of amino acids In this study, we modify Krogh s HMM to combine amino acid sequence and solvent accessibility. Figure 2 shows the core structure of the hidden Markov model (referred to as HMM_AA_SA) used in this study. The difference between the models in Figure2 and Figure 1 is that the emission in Figure 2 includes both the identity of amino acid residues and their solvent accessibility. When solvent accessibility is divided into two categories (i.e., buried (B) and exposed (E)), each M or I state in figure 2 is associated with 40 emission probabilities, corresponding to the 40 combinations of 20 amino acid letters and 2 solvent accessibility letters. When solvent accessibility is divided into three categories (i.e., buried (M), medium (M), and exposed (E)). Each M or I state in figure 2 is associated with 60 emission probabilities corresponding to the 60 2

3 combinations of 20 amino acid letters and 3 solvent accessibility letters. When hidden Markov models were used to scan a protein sequence, a NULL model, which states the background occurrence of the query sequence, was used to calculate the significance of the hit. E-value was used as the measure. The E-value shows the expected number of false positives that can fit the model at least as good as the hit. Thus, the lower the E-value, the more significant is the hit. In this study, we chose E = 0.01 as the cutoff to identify significant hits, which means the expected number of false positives is Figure 2. Hidden Markov model that emits both amino acids and solvent accessibility (referred as HMM_AA_SA). M: Match states, I: Insertion states, D: Deletion states. Arrows show the state transitions. At each state, the model emits one amino acid residue and its solvent accessibility. 2.5 Software Implementation The software for HHMs building and searching used in this study was implemented by modifying the HMMER package to allow multiple emissions in a state. 2.6 Measurements Let TP be the number of true positives (the examples that are positive and are predicted as such.), P be the total number of positive examples, FP be the number of false positives (the examples that are negative but are predicted as positive) and N be the total number of negative examples. The measurements are defined as follows. Sensitivity=TP/P: The fraction of positive examples that are correctly identified. False positive rate=fp/n: The fraction of negative examples that are classified as positive examples. 3 RESULTS 3.1 HMM_AA only achieves very low performance We first evaluated the performance of the hidden Markov model (referred as HMM_AA) that only emits amino acids. At each state, HMM_AA emits one letter of amino acid. HMM_AA was tested using the positive data set based on three-fold cross-validation. The sensitivity is only 2.8%. False positive rate was examined by building HMM_AA using the positive data set and then testing it on the negative data set. The results show that the false positive rate of HMM_AA is 0%. We can see that HMM_AA can only identify HTH motifs with very low performance. In the following sections, we improve the performance of the method by adding solvent accessibility into the hidden Markov model. The resulting new model will be referred as HMM_AA_SA. At each state, HMM_AA_SA emits not only an amino acid but also its solvent accessibility. Since the solvent accessibility takes real values, we need to discretize it into discrete categories. Then, here raises the question as to how to discretize the solvent accessibility? Our goal is to find a discretization such that the sensitivity of the method is increased greatly, with no or little increase in false positive rate Sensitivity False positive rate Threshold Figure 3. The performance of HMM_AA_SA with solvent accessibility being divided into two categories. Solvent accessibility is disretized into two categories (buried (B) and exposed (E)) using one threshold α. A residue is in the buried (B) category if its relative solvent accessibility is less than α and exposed (E) otherwise. 3.2 HMM_AA_SA achieves better performance than HMM_AA: Discretizing solvent accessibility into two categories We first discretized solvent accessibility into two categories (buried (B) and exposed (E)) using one threshold α. A residue is in the buried (B) category if 3

4 its relative solvent accessibility is less than α, and exposed (E) otherwise. Using this discretization, every state of the hidden Makov model (referred to as HMM_AA_SA) emits one solvent accessibility letter (B or E) in addition to one letter of amino acid. Thus each state is associated with 40 emission probabilities corresponding to the 40 combinations of 20 amino acid letters and 2 solvent accessibility letters. We tried various values of threshold α, ranging from 0.1 to 0.9 in increments of 0.1. For each threshold, we evaluated HMM_AA_SA on the positive data set using three-fold cross-validation. The sensitivity from the three-fold cross-validation is reported in Figure 3. We also examined HMM_AA_SA s false positive rate by building a HMM_AA_SA using the positive set and then testing it on the negative data set. The results are also shown in Figure 3. Figure 3 shows that when the threshold increases from 0.1 to 0.6, the sensitivity first decreases and then slightly increases, while the false positive rate remains at a very low level. When the threshold continues to increase, the sensitivity increases quickly and the false positive rate also increases dramatically. An ideal threshold should give a high sensitivity and low false positive rate. From the results, we can see that the thresholds larger than 0.6 are not a good choice because they have very high false positive rate. We focus on the range from 0.1 to 0.6 where the false positive rate remains at a low level. We can see that in that range, HMM_AA_SA achieve the highest sensitivity (11.4%) when the threshold takes 0.1. Thus, the threshold of 0.1 is the best choice for discretizing solvent accessibility into two categories for the HMM_AA_SA method. Table 1 compares the performance of the HMM_AA with that of the HMM_AA_SA that uses 0.1 as threshold to discretize solvent accessibility. The results show that HMM_AA_SA achieves much higher sensitivity than HMM_AA, while the false positive rate is remained at a low level. Table 1. HMM_AA_SA achieves better performance than HMM_AA by dividing solvent accessibility into two discrete categories Method Sensitivity False positive rate HMM_AA HMM_AA_SA (α 1 =0.1) The threshold used to discretize solvent accessibility. A residue is in the buried (B) category if its relative solvent accessibility is less than α and exposed (E) otherwise. 3.3 HMM_AA_SA s performance can be further improved: Discretizing solvent accessibility into three categories In the previous section, we have evaluated HMM_AA_SA s performance by dividing residue solvent accessibility into two categories: buried (B) and exposed (E). The results show that 0.1 is the best threshold for discretizing solvent accessibility. In this section, we explore the discretization that divides residue solvent accessibility into three categories: buried (B), medium (M) and exposed (E). To divide solvent accessibility into three discrete categories, we need two thresholds α 1 and α 2, with α 1 <α 2. A residue is in the buried (B) category if its relative solvent accessibility (RSA) is less than α, medium (M) if α 1 RSA<α 2 and exposed (E) if RSA α 2. Using this discretization, every state of the hidden Makov model (referred to as HMM_AA_SA) emits one solvent accessibility letter (B, M, or E) in addition to one letter of amino acid. Thus, each state is associated with 60 emission probabilities corresponding to the 60 combinations of 20 amino acid letters and 3 solvent accessibility letters. Since 0.1 is the best threshold for dividing the solvent accessibility into two categories, we fix α 1 =0.1 and try various values of α 2, ranging from 0.2 to 0.9 with increments of 0.1. For each combination of (α 1, α 2 ), we examined HMM_AA_SA s sensitivity and false positive rate. The results (see Figure 4) show that the false positive rate remains at a low level when α 2 is within the range of [0.2, 0.7]. False positive rate increases dramatically when α 2 exceeds 0.7. Since we would like to have a method that achieves high sensitivity with low false positive rate, we focus on the range where the false positive rate remains low, i.e., α 2 is in the range of [0.2, 0.7]. Figure 4 shows that in this range, HMM_AA_SA achieves the highest sensitivity (29.9%) when α 2 =0.2. Table 2 shows the comparison among the HMM_AA (row 2), the HMM_AA_SA that uses 0.1 as the threshold to divide solvent accessibility into two categories (row 3), and the HMM_AA_SA that uses (0.1, 0.2) as the thresholds to divide solvent accessibility into three categories (row 4). Comparing rows 3, 4 with row 2, we can see that adding solvent accessibility into the HMM can improve its performance by greatly increasing sensitivity. At the same time, there is only a little increase in false positive rate. Comparing row 3 with 4 we can see that dividing solvent accessibility into three categories can improve the performance more than dividing it into two categories. We note that the HMM_AA_SA that divides solvent accessibility into 4

5 three categories using (0.1, 0.2) as thresholds achieves a sensitivity of 29.9%, a dramatic increase compared with the sensitivity (2.8%) achieved by HMM_AA, while its false positive rate is still low (1.3%) (0.1, 0.2) (0.1, 0.3) Sensitivity False positive rate (0.1, 0.4) (0.1, 0.5) (0.1, 0.6) Thresholds (0.1, 0.7) (0.1, 0.8) (0.1, 0.9) Figure 4. The performance of HMM_AA_SA with solvent accessibility being divided into three categories. Solvent accessibility is disretized into three categories (buried (B), medium (M) and exposed (E)) using thresholds (α 1, α 2 ), with α 1 <α 2. A residue is in the buried (B) category if its relative solvent accessibility (RSA) is less than α 1, medium (M) category if α 1 RSA<α 2 and exposed (E) if RSA α 2. Table 2. HMM_AA_SA s performance can be improved by dividing solvent accessibility into three discrete categories Method Sensitivity False positive rate HMM_AA HMM_AA_SA (α=0.1) a HMM_AA_SA (α 1 =0.1, α 2 =0.2) b a Solvent accessibility is disretized into two categories (buried (B) and exposed (E)) using one threshold α. A residue is in the buried (B) category if its relative solvent accessibility is less than α, and exposed (E) otherwise. b Solvent accessibility is disretized into three categories (buried (B), medium (M) and exposed (E)) using thresholds α 1 and α 2, with α 1 <α 2. A residue is in the buried (B) category if its relative solvent accessibility (RSA) is less than α 1, medium (M) category if α 1 RSA< α 2 and exposed (E) if RSA α 2. 4 CONCLUSION In summary, we present a hidden Markov model method (referred as HMM_AA_SA) for identification of Hilex-Turn-Helix motifs. The method models both amino acid sequence and solvent accessibility. In each state, the model emits one letter of amino acid and one letter of solvent accessibility. The results show that adding solvent accessibility into the model can dramatically increase its sensitivity in identifying HTH motifs, with just a little increase in false positive rate. By adding solvent accessibility into the model and using (0.1, 0.2) as the thresholds to divide solvent accessibility into three categories, the sensitivity is improved from 2.8% to 29.9%, with only an increase of 1.3% in false positive rate. From the results, we can see that the hidden Markov model emitting only amino acids (HMM_AA) can identify HTH motifs with only a low sensitivity, 2.8%. This low performance is not surprising, since the positive data set contains nineteen families of HTH motifs and the sequence similarity among these families is quite low. One approach to circumvent this obstacle of low similarity is to build a model for each HTH motif family. However, using this approach, we need to assemble, for each family, a data set big enough to train the model. Furthermore, given the big variations in the sequence and structure of HTH motifs, it is likely that there are some HTH motifs that do not fall into any currently defined family. To identify these new motifs, we need methods that can capture the common features shared by different families of HTH motifs and are able to recognize HTH motifs across families. In this study, we aim to build a generic model that can be used to identify HTH motifs from all families. By adding solvent accessibility into the model and dividing solvent accessibility into three discrete categories, we are able to improve the sensitivity from 2.8% to 29.9%, with only 1.3% false positive rate. Although 29.9% sensitivity still seems to be low, it has been a significant improvement considering the big variations in the data sets. It is worth noting that the value of the proposed method (HMM_AA_SA) lies in its ability to identify HTH motifs from different families. Using it, we have a better chance to recognize the HTH motifs that do not fall into any of the currently defined families. Combining such a generic model with the specific models built for each family will provide a useful tool for identifying HTH motifs. We explored different criteria for the discretization of solvent accessibility. The results show that 0.1 is the best threshold to discretize the solvent accessibility into two categories, and (0.1, 0.2) are the best thresholds for three categories. The results also show that dividing solvent accessibility into three categories brings a higher improvement in performance than dividing it into two categories. It will be interesting to explore the discretization that 5

6 divides solvent accessibility into more than three categories. However, as the number of categories increase by 1, the number of parameters (emission probabilities) in each state increases by 20. Current data sets, especially the positive data set, are not big enough to estimate a big amount of parameters. One direction for future study is to assembly larger data sets to explore the discretizaion of solvent accessibility into more than three categories. Another possible direction for future study is to modify the hidden Markov model to incorporate more data (e.g. secondary structure, hydrophobicity, and electrostatics) in addition to protein sequence and solvent accessibility. REFERENCES: [1] K. Mathee and G. Narasimhan, "Detection of DNA-binding Helix-Turn-Helix motifs in proteins using the pattern dictionary method," in Methods Enzymol, S. A. a. S. Garges, Ed., Volume 370 ed: Academic Press, 2003, pp [2] G. Narasimhan, C. Bu, Y. Gao, X. Wang, N. Xu, and K. Mathee, "Mining protein sequences for motifs," J Comput Biol, vol. 9, pp , [3] S. Jones, J. A. Barker, I. Nobeli, and J. M. Thornton, "Using structural motif templates to identify proteins with DNA binding function," Nucl. Acids Res., vol. 31, pp , [4] W. A. McLaughlin and H. M. Berman, "Statistical models for discerning protein structures containing the DNA-binding Helix-Turn-Helix motif," Journal of Molecular Biology, vol. 330, pp , [5] H. P. Shanahan, M. A. Garcia, S. Jones, and J. M. Thornton, "Identifying DNA-binding proteins using structural motifs and the electrostatic potential," Nucleic Acids Res, vol. 32, pp , [6] M. Pellegrini-Calace and J. M. Thornton, "Detecting DNA-binding Helix-Turn-Helix structural motifs using sequence and structure information," Nucl. Acids Res., vol. 33, pp , [7] C. Ferrer-Costa, H. P. Shanahan, S. Jones, and J. M. Thornton, "HTHquery: a method for detecting DNA-binding proteins with a Helix-Turn-Helix structural motif," Bioinformatics, vol. 21, pp , [8] R. Wintjens and M. Rooman, "Structural classification of HTH DNA-binding domains and protein-dna interaction modes," Journal of Molecular Biology, vol. 262, pp , [9] K. H. Choo, J. C. Tong, and L. Zhang, "Recent applications of hidden Markov models in computational biology," Genomics Proteomics Bioinformatics, vol. 2, pp , [10]A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov models in computational biology : Applications to protein modeling," Journal of Molecular Biology, vol. 235, pp , [11 S. Eddy, "Profile hidden Markov models," Bioinformatics, vol. 14, pp , [12 R. D. Finn, J. Mistry, B. Schuster-Bockler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. R. Eddy, E. L. L. Sonnhammer, and A. Bateman, "Pfam: clans, web tools and services," Nucl. Acids Res., vol. 34, pp. D , [13 P. Bagos, T. Liakopoulos, I. Spyropoulos, and S. Hamodrakas, "A hidden Markov model method, capable of predicting and discriminating beta-barrel outer membrane proteins," BMC Bioinformatics, vol. 5, pp. 29, [14 K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, and C. Sander, "Predicting protein structure using hidden Markov models," Proteins, vol. Suppl 1, pp , [15 M. Delorenzi and T. Speed, "An HMM model for coiled-coil domains and a comparison with PSSM-based predictions," Bioinformatics, vol. 18, pp , [16 A. C. Camproux and P. Tuffery, "Hidden Markov model-derived structural alphabet for proteins: The learning of protein local shapes captures sequence specificity," Biochim Biophys Acta, vol. 1724, pp , [17 V. Alexandrov and M. Gerstein, "Using 3D hidden Markov models that explicitly represent spatial coordinates to model and compare protein structures," BMC Bioinformatics, vol. 9, pp. 2, [18]J. Hargbo and A. Elofsson, "Hidden Markov models that use predicted secondary structures for fold recognition," Proteins, vol. 36, pp ,