Local structure prediction with local structure-based sequence profiles

Size: px
Start display at page:

Download "Local structure prediction with local structure-based sequence profiles"

Transcription

1 BIOINFORMATICS Vol. 19 no , pages DOI: /bioinformatics/btg151 Local structure prediction with local structure-based sequence profiles An-Suei Yang and Lu-yong Wang Department of Pharmacology, Columbia Genome Center, and Center for Computational Biology and Bioinformatics, Columbia University, 630 West 168th street, PH 7WRoom 318, New York, NY 10032, USA Received on November 5, 2002; revised on January 27, 2003; accepted on January 29, 2003 ABSTRACT Motivation: A large body of experimental and theoretical evidence suggests that local structural determinants are frequently encoded in short segments of protein sequence. Although the local structural information, once recognized, is particularly useful in protein structural and functional analyses, it remains a difficult problem to identify embedded local structural codes based solely on sequence information. Results: In this paper, we describe a local structure prediction method aiming at predicting the backbone structures of nine-residue sequence segments. Two elements are the keys for this local structure prediction procedure. The first key element is the LSBSP1 database, which contains a large number of non-redundant local structure-based sequence profiles for nine-residue structure segments. The second key element is the consensus approach, which identifies a consensus structure from a set of hit structures. The local structure prediction procedure starts by matching a query sequence segment of nine consecutive amino acid residues to all the sequence profiles in the local structure-based sequence profile database (LSBSP1). The consensus structure, which is at the center of the largest structural cluster of the hit structures, is predicted to be the native state structure adopted by the query sequence segment. This local structure prediction method is assessed with a large set of random test protein structures that have not been used in constructing the LSBSP1 database. The benchmark results indicate that the prediction capacities of the novel local structure prediction procedure exceed the prediction capacities of the local backbone structure prediction methods based on the I-sites library by a significant margin. Availability: All the computational and assessment procedures have been implemented in the integrated computational system PrISM.1 (Protein Informatics System for Modeling). The system and associated databases To whom correspondence should be addressed. for LINUX systems can be downloaded from the website: ay1/. Contact: ay1@columbia.edu INTRODUCTION A large body of experimental evidence showed that well-formed substructures distributed in protein folding pathways are mostly native-like local structures (see for example Alm et al., 2002 and the references therein). Moreover, studies on the denatured state under non-denaturing conditions indicated that local structures observed in this state are frequently native-like and remain in the folding nuclei leading to the native structures (Choy and Forman-Kay, 2001; Gillespie and Shortle, 1997; Wong et al., 2000). These results suggest that local structure formation in some regions of a protein sequence must put more weight in guiding the folding process to the native structure; that is, structural information encoded in some sequence segments should be recognizable independent of interactions involving distant residues in sequence. Computational work on protein structure prediction has shown that local structure predictions are useful in predicting protein structures from protein sequences (Baldwin and Rose, 1999; Bonneau et al., 2001; Fidelis et al., 1994; Rooman et al., 1991; Simon et al., 1991; Unger et al., 1989; Yang and Wang, 2002). A recent prediction methodology with the I-sites library was used in local structure prediction with significant success (Bystroff and Baker, 1998; Bystroff et al., 2000). The methodology was the only local backbone structure prediction procedure that had been extensively assessed with a large set of test proteins and with three-dimensional structural comparison measures (Bystroff et al., 2000). In this paper, we describe a new local structure prediction method. The goal of this method is to predict the backbone conformation of a nine-residue sequence segment based solely on the sequence information in the query sequence. We focused on nine-residue sequence segments Bioinformatics 19(10) c Oxford University Press 2003; all rights reserved. 1267

2 A.-S.Yang and L.-Y. Wang in the local structure prediction methodology because this specific sequence length has been reported to contain optimal local structural information (Bystroff et al., 1996). The details of the local structure prediction method are described in the Methods section. The prediction accuracy and coverage are assessed with a large set of random test proteins. The prediction capacities of the local structure prediction procedure are compared with the prediction capacities of the I-sites data library in the Results and Discussion section. METHODS Local structure-based sequence profile database LSBSP1 The construction of the LSBSP1 database is summarized in a flow chart available from our ftp server (ftp: //ps7ayang.cpmc.columbia.edu/pub/lsbsp1flow1.pdf). A seed nine-residue segment from a protein structure is used as a probe to search through a pool of nine-residue segments to select all the segments that are similar in structure to the seed segment. A total of nineresidue segments are excised from a set of non-redundant protein structures (PDB SELECT 25 (Hobohm et al., 1992) version Feb/2001, with no pair-wise sequence identify greater than 25%) to form the segment pool. No two segments in the pool are identical but the segments could overlap by up to eight residues. The structural comparison procedure is carried out by assigning a backbone conformational state to each residue in the segments with a character (such as a, b, p, l...). Each of the character represents a range of the φ- and ψ-angles defined by Oliva et al. (1997). The ranges of the φ- and ψ-angles of the conformational states are also shown in the Ramachandram plot in Figure 1. The conformational states defined by Oliva et al. essentially encompass all major clusters of the φ ψ distribution observed in the known protein structural space. All the nine-residue segments that are identical in backbone conformational state to the seed segment form a subset of segments with a similar backbone structure. These segments are then compared with the seed segment in sequence similarity. The sequence similarity is calculated with the structure-specific amino acid substitution matrices that we developed to align distantly related protein pairs (Yang, 2002). Sequence segments that are identical in backbone conformational states and have the amino acid replacement scores above a threshold (>0) in comparison with the seed sequence are aligned to construct a preliminary local structure-based sequence profile for the seed segment. The preliminary local structure-based sequence profile is then converted to a position specific score matrix (pre-pssm) in half-bit units with the Bayesian prediction pseudo-count method (Tatusov et al., 1994): ( ) qji W Ji = 2 log 2 (1) p i where p i is the background probability (Tatusov et al., 1994) for amino acid type i, and ( C Ji + B + M ) 20 k=1 C Jk p i q Ji = (2) M + B where C Ji is the number of amino acid type i that appears in the column J of the sequence profile. M is the number of rows in the sequence profile. The term (B + M k=1,20 C Jk) in the numerator is the pseudo-count, where B = M 0.5 is considered adequate (Tatusov et al., 1994) We take the approximation to assign equal weights to the sequences in the profile because the sequences are all distant in sequence similarity (<25% in sequence ID). Even in very rare cases where conserved local sequence motifs might bias a few sequence profiles by slightly over-representing the conserved sequence features with the equal-weight approximation, such rare overrepresenting profiles in the database are not expected to have negative effect in the following local structure prediction procedure. This pre-pssm ([W Ji ],a20 9 matrix) is used to search for nine-residue segments in the pool for segments with identical backbone conformational states and with sequence-profile match scores higher than a threshold (>15) in comparison with the seed segment. This set of segments form a refined local structure-based sequence profile for the seed segment. The reason underlying this procedure is that a local structure can usually be adopted by many different sequences if all the sequences with the same backbone structure were included in one sequence profile, the information content of the sequence profile would become so diluted that the prediction capacities of the sequence profile diminish. This is particularly true for the all-α all-β nine-residue segments. The local structure-based sequence profile for the seed segment is designed to reflect only the sequence variations based on the family of sequence segments that are closely related not only in structure but also in sequence; that is, the sequence profile needs to have high information content related to the preference in the sequence space near the seed structure. The procedure described above was applied to all the sequence segments with 9 consecutive residues in the non-redundant protein structures to construct the LSBSP1 data library in the PrISM.1 system. Not all the seed segments have corresponding local structurebased sequence profiles in LSBSP1; some of the seed segments are unique in sequence and/or structure such that few sequence segments are selected for these seeds 1268

3 Local structure prediction Fig. 1. Local structure prediction profile and the accuracy/coverage measures for the local structure predictions. The PDB code names and residue ranges shown on the left-hand side of the local structure prediction profile indicate the structural segments that are predicted to relate to the query sequence, which is shown in the first row of the local structure prediction profile. The local structure prediction profile shows the backbone conformational states of the structural segments that are predicted to relate to the query sequence. The backbone torsion angle ranges of the backbone conformational states (A, B, G and E) in the local structure prediction profile are defined in the right-hand side of the Ramachandran plot below the local structure prediction profile. The definitions of the conformational states shown in the Ramachandran plot were obtained from Oliva et al. (1997). This classification of the conformational states essentially encompasses all major clusters of backbone conformation in known protein structural space (Creighton, 1993). to construct the refined local structure-based sequence profiles. Overall, local structure-based sequence profiles are stored in the LSBSP1 data library; each of the sequence profiles has more than 10 sequences. The information content in some of the sequence profiles in the database is redundant. The concept behind the construction of the LSBSP1 database is not to throw away any information in the database construction, but to merge the related information in the prediction procedure according to the query sequence and the prediction results. This coupled prediction procedure is described in what follows. Prediction procedure coupled with LSBSP1 database the consensus approach The local structure prediction procedure coupled with the LSBSP1 database is summarized in a flow chart available from our ftp server (ftp://ps7ayang.cpmc.columbia. edu/pub/lsbsp1flow2.pdf). The secondary structure of a query sequence is first predicted with the PSI-PRED program (Jones, 1999). The PSI-PRED secondary structure prediction method has been recently available from the public domain and has been assessed independently with a large set of test proteins that were newly released and are 1269

4 A.-S.Yang and L.-Y. Wang not expected to be part of the training set of the prediction method. Our test showed that the three state accuracy rate is 78% for PSI-PRED, which is close to the released prediction accuracy rate. Based on the test and availability of the program, we use PSI-PRED as an integral part of our local structure prediction method. After the secondary structure prediction, the query sequence is parsed into overlapping sequence segments with nine consecutive residues. Each of the sequence segments is used as a query sequence segment to match with all the local structure-based sequence profiles in LSBSP1. These sequence-profile matches are scored based on the PSSM derived for each of the local structure-based sequence profiles with the formalism shown in Equations (1) and (2). For each of the query segments, the sequence profiles in LSBSP1 with sequence-profile matching score above a threshold (>20) and with at least 60% of the secondary structure consistent with the PSI-PRED prediction are selected. Anecdotal evidence has shown that the secondary structure matching level (60%) is optimal in coupling the two local structure prediction procedures: a lower level would reduce the bias power of the correct PSI-PRED prediction, while a higher level would not allow the incorrect PSI-PRED predictions to be corrected in the local structure prediction procedure. The consensus score of the seed structure of the k-th selected sequence profile is calculated with Equation (3) (Yang and Wang, 2002). consensus score(k) = 1 M(q k,i, q j,i ) n i=1,n j=1 m, j =k (3) where m is the number of selected sequence profiles and n is the length of the multi-alignment (n = 9). q i, j is the backbone conformational state (a, or b, or g, or e...etc), as defined by the range of φ- and ψ-angles (see Fig. 1), for residue j in sequence i. M(q k,i, q j,i ) equals to 1 only when q k,i and q j,i are the same conformational state, otherwise, M(q k,i, q j,i ) = 1. The structure that has the highest consensus score is used as the predicted structure for the query segment. The predicted structures for the overlapping query sequence segments are compiled in a local structure prediction profile as shown in Figure 1. One of the major principles underlying the local structure prediction methodology is to use the consensus approach to reach a final prediction for the structure of a query sequence segment. The rationale is based on the finding that the largest cluster of similar structures sampled by a protein conformational search procedure for a polypeptide chain are frequently the closest to the native state structure of the polypeptide chain (Shortle et al., 1998; Xiang et al., 2002). If we assume that the structures selected by certain sequence-profile matching score cutoff represent the accessible conformations for the query sequence segment under native conditions as a first approximation, the consensus structure, which is the center of the largest cluster of similar structures by definition, is expected to be the native state structure of the query sequence. Benchmark the accuracy and coverage of the local structure prediction method The local structure predictions are assessed with accuracy and coverage rates: PCPcoverage Position-wise consensus prediction (PCP) is the prediction based on the majority of the backbone conformational state predictions in a column (position) of the local structure prediction profile. Positions with PCP are in the local structure prediction profile the columns that show consensus in backbone conformational state predictions with the majority prediction counts higher than or equal to a threshold. In the example of Figure 1, the consensus level threshold is set at 5 and the columns with the majority prediction counts 5 are shown in bold characters. The PCPcoverage is the percentage of residue positions (columns) with PCP over the total number of positions in the test proteins. PCPaccuracy This is the percentage of the positions with PCP that are predicted correctly (shown in italic in Fig. 1) over the total positions with PCP. SCPcoverage The segment-wise consensus prediction (SCP) is based on the number of residues with PCP in a segment (row) of a local structure prediction profile. A segment with SCP is in the local structure prediction profile the segment (row) for which the number of residue positions with PCP is higher than or equal to the set consensus level threshold. The underlined rows in the center box of the local structure prediction profile in Figure 1 are the segments for each of which the number of PCP (bold characters) is higher than or equal to the consensus level threshold (5). The SCPcoverage is the percentage of the query residues that are predicted at least once in a segment with SCP. SCPaccuracy This is the percentage of the correct backbone conformational state predictions (shown in italic in Fig. 1) over all the predictions in the segments with SCP. RMSDaccuracy This prediction accuracy measure follows the definition of the rmsd measure by Bystroff and Baker (Bystroff and Baker, 1998). The SCPaccuracy measure (see above) counts predictions for the same position multiple times (see Fig. 1). An alternative to avoid the multiple counting is to count the positions for which at least one of the overlapping nine-residue (eight-residue in the Bystroff Baker rmsd measure Bystroff and Baker, 1998) segments is predicted correctly based on the RMSD < 1.4 Å threshold. The correctly predicted segments based on this RMSD criterion are shaded in gray in the local structure prediction profile shown in Figure 1. The 1270

5 Local structure prediction RMSDaccuracy is the percentage of correctly predicted positions (shaded residues in the second row of the local structure prediction profile) over the positions with SCP (underlined residues in the second row of the local structure prediction profile). Figure 1 shows one example for each of the accuracy/coverage measures described in this section. RESULTS AND DISCUSSION A test set of 241 protein structures was obtained from a newer version of the PDB SELECT25 list (version Sep/2001) for which no protein pairs are similar with pair-wise sequence identity greater than 25%. All the PDB SELECT25 (Sep/2001) proteins that are not related in sequence (pair-wise Smith Waterman alignment p-value > 10 6, or sequence ID < 18% on average Yang, 2002) to the sequences in LSBSP1 (from PDB SELECT25 version Feb/2001) were used as test proteins. This p-value criterion essentially eliminates detectable sequence relationships between the training set and the test set (Yang, 2002). Each of the test proteins was used as a query sequence for the prediction procedure (see the Methods section). There was no pre-treatment to eliminate low complexity regions (for example with the SEG program Wootton and Federhen, 1996) from the test sequences. The prediction results were summarized in a local structure prediction profile similar to the example shown in Figure 1. The prediction accuracy and coverage were then benchmarked with the accuracy and coverage measures described in the previous section and in Figure 1. The results are summarized in Table 1. There are residues with backbone conformational state assigned in the test proteins. The distribution of these test residues in backbone conformational state is shown in the right margin of Table 1a. At the lowest consensus level (consensus level = 1), 94.6% of the test residues are predicted with PCP; 5.4% of the predictions have the same consensus predictions for two or more backbone conformational states and thus are not conclusive in PCP. Out of the 94.6% test residues (or, PCPs), 79.0% (see Table 1b) of the PCPs are predicted correctly by comparing the predicted backbone conformational states with the native state structures of the test proteins. The distributions of the correct predictions in A, B, G and E backbone conformational states are shown in Table 1. To isolate the effect of the secondary structure predictions from PSI-PRED, we switched off the secondary structure prediction in the prediction procedure (see the Methods section) and reassessed the local structure prediction method. The results showed that the PCPaccuracy for the lowest consensus level was reduced to 71.9% (from 79.0%, see above) and the PCPcoverage remained similar (95.1% in comparison with 94.6%), indicating that the bias introduced by the PSI-PRED secondary structure predictions improved the accuracy of the local structure prediction method. The HMMSTR predictions based on the I-sites data library were tested on residues (data from Table 5 of Bystroff et al., 2000). To facilitate the comparison of the benchmark results of the local structure prediction methods, we grouped the backbone conformational states in the work of Bystroff et al. into four major conformational states as well: A = H + G, B = B + E + d + b + e, G = L + l, and E = x; the backbone conformational states on the right-hand sides of the equations were defined by (Bystroff et al., 2000). The A, B, G, and E states are approximately equivalent to the A, B, G, and E states defined in Figure 1: The residue population ratio for A : B : G : E is 1 : 0.81 : : for the test residues in the HMMSTR predictions (data from Table 5 of Bystroff et al., 2000). This is similar to the residue population ratio for A : B : G : E = 1 : 0.73 : : for the test residues shown in Table 1. The benchmark results for the HMMSTR backbone conformation predictions showed that 82.0 % of the A residues 71.6% of the B residues, 15.5% of the G residues and 22.6% of E residues were predicted correctly (data from Table 5 of Bystroff et al., 2000). Overall, 74.0% of the test residues were predicted correctly in these four major backbone conformational states (A, B, G, and E ). The prediction results for consensus level = 1 in Table 1 is comparable to the HMMSTR prediction results shown above; 86.0% of the A residues, 76.2% of the B residues, 36.3% of the G residues and 7.6% of the E residues were predicted correctly. Overall, 79.0% of the test residues were predicted correctly in the four backbone conformational states (A, B, G, and E). Noted that our prediction rate is based on 94.6% of the test residues because 5.4% of the predictions were not conclusive in the consensus prediction (see above). In the worst scenario case that all the 5.4% non-conclusive predictions were considered incorrect, the correct rate would drop from 79.0 to 74.7%. The comparison indicates that the local structure prediction method in this work is at least comparable to, if not better than, the HMMSTR prediction method in backbone torsion angle predictions. We assessed the local structure prediction results with the RMSD calculated based on the match of the predicted structures with the correct structures. Figure 2a shows the distribution of the RMSD for the predicted structures superimposed with native state structures of the query segments with SCP at the consensus level = 8. The truenegative distribution of the RMSD shown in Figure 2a was calculated with random matches of nine-residue segments from all-α proteins to nine-residue segments from all-β proteins. The distribution of the RMSD of 1271

6 A.-S.Yang and L.-Y. Wang Table 1. Assessment of the prediction capacities of the LSBSP1 database and the consensus approach for local structure predictions Consensus level Test Residue number (a) PCPcoverage a A 96.2% 93.7% 89.0% 82.4% 73.7% 63.1% 51.7% 39.0% B 93.4% 89.4% 83.0% 74.1% 62.0% 48.5% 35.3% 22.5% G 90.3% 87.9% 82.6% 72.5% 56.5% 40.8% 28.8% 19.7% 1491 E 85.2% 81.3% 73.5% 58.1% 40.3% 25.2% 15.0% 7.8% 461 All 94.6% 91.5% 86.1% 78.3% 67.8% 55.7% 43.6% 31.2% (b) PCPaccuracy b Predictions A 86.0% 86.4% 87.2% 88.5% 90.4% 92.6% 94.6% 96.4% B 76.2% 76.5% 77.3% 78.5% 80.6% 83.5% 86.2% 88.0% G 36.3% 37.3% 38.9% 42.5% 47.9% 54.3% 56.9% 61.4% E 7.6% 8.0% 8.6% 8.6% 6.5% 5.2% 5.8% 11.1% all 79.0% 79.4% 80.3% 81.9% 84.5% 87.6% 90.3% 92.7% (c) SCPcoverage c A 99.0% 98.0% 95.8% 90.4% 77.2% 55.7% 30.7% B 97.6% 96.1% 91.2% 81.1% 61.3% 34.2% 10.7% G 98.6% 96.7% 93.3% 84.2% 64.7% 36.8% 15.2% E 97.8% 94.8% 87.6% 77.4% 52.3% 23.4% 5.9% All 98.4% 97.1% 93.7% 86.2% 69.9% 45.9% 21.7% (d) SCPaccuracy d Predictions A 84.2% 84.5% 85.5% 87.7% 91.2% 94.8% 98.2% B 73.0% 73.2% 73.8% 74.7% 76.6% 78.5% 79.5% G 34.9% 35.2% 35.9% 38.2% 42.8% 50.0% 60.8% E 10.2% 10.1% 10.5% 10.6% 10.2% 9.9% 2.7% All 76.9% 77.2% 78.2% 80.3% 84.2% 88.9% 94.3% (e) RMSDaccuracy e Predictions %correct 62.1% 62.4% 63.5% 65.7% 68.6% 73.7% 82.6% a The PCPcoverage rates are shown in percentage (%) as the consensus level threshold for predictions (first row) increases from left to right. Rows 1 4 show the PCPcoverage for each of the conformational states (A, B, G and E, see Fig. 1 for definition) at various consensus level thresholds. The right margin shows the total number of residues in the test proteins. b The Table shows the prediction results in PCPaccuracy measure as the consensus level for the predictions increases from left to right. c Same as in PCPcoverage except that the Table shows the results in SCPcoverage measure. d Same as in PCPaccuracy except that the Table shows the results in SCPaccuracy measure. e The prediction results are shown in RMSDaccuracy measure. See the Method section and Figure 1 for the definition of the accuracy/coverage measures. the true-negative nine-residue segment pairs shows that the probability (or p-value) for a random match with RMSD < 1.4 Å is 10 4 and the probability (or p- value) for a random match with RMSD < 2.4 Åis10 2. Figure 2a shows that 89% of the query segments with SCP (consensus level = 8) are predicted correctly in overall structure prediction with RMSD < 2.4 Å or with p-value < This accuracy percentage decreases as the SCPcoverage increases. The accuracy percentage is plotted as a function of SCPcoverage in Figure 2b (see solid triangles in the Figure). The RMSDaccuracy shown in Table 1e is plotted as a function of SCPcoverage (see Table 1c) in Figure 2b (see solid squares in the Figure). This accuracy/coverage plot is compared with the benchmark results of the I-sites local structure predictions on a total of test residues (see solid circles in Fig. 2b, data obtained from Table 1 of the work by Bystroff and Baker 1998). The I-sites local structure predictions were assessed with the rmsd measure (Bystroff and Baker, 1998). This rmsd measure is comparable to the RMSDaccuracy measure in this work (see the Methods section). The comparison of the two 1272

7 Local structure prediction of the local structure prediction procedure resides in the LSBSP1 data library, which contains a large set of structure-based sequence profiles derived from multiple structural alignments of nine-residue sequence segments. The concept is in contrast to that behind the construction of the I-sites library, where sequence profile segments were clustered based on conserved sequence features and then merged to form less than 100 sequence profiles based on local structural similarity (Bystroff and Baker, 1998). Moreover, one major advantage for the PrISM s local structure prediction procedure is that the procedure predicts local structures for single sequences. In contrast, prediction methods based on the I-sites library predict local structures for sequence profile segments derived from multiple sequence alignments of the protein families. Many aspects of the local structure prediction procedure in this work remained to be optimized to further enhance the prediction capacities of the methodologies. In particular, we are just beginning to understand the structural roles of the length and the distribution of the sequence segments that are highly specific in local structural determinants. In this work, we have developed a general approach in identifying local structural codes based on sequence information in the query protein. Fig. 2. (a) Distribution of pair-wise RMSD. The gray histogram shows the distribution of the RMSD calculated based on the superimposition of the predicted structure with the correct structure. The predicted structures are predicted with SCP at consensus level = 8. The white histogram shows the distribution of RMSD calculated based on random matches of nine-residue segments from all-α proteins with nine-residue segments from allβ proteins. (b) Prediction accuracy plotted against coverage. The solid triangles are the percentage of correctly predicted nine-residue segments for segments with SCP based on the RMSD< 2.4 Åorpvalue < 10 2 criterion. The data for the solid squares are obtained from Table 1 for coverage and Table 1 for accuracy. The solid circles show the I-sites prediction accuracy and coverage rates based on the confidence level of >0.2, >0.4, and >0.6 for the data points from right to left respectively (data obtained from Table 1 of the work by Bystroff and Baker Bystroff and Baker, 1998). accuracy/coverage plots (solid squares and solid circles in Fig. 2b) indicates that the prediction accuracy of the local structure prediction method described in this work is higher than the prediction accuracy of the I-sites library by more than 10% at all coverage levels. At the same level of prediction accuracy, the local structure prediction method in this work covers 2 3 fold test residues in comparison with the predictions based on the I-sites library. CONCLUSIONS We developed a new local structure prediction method in PrISM.1 for nine-residue sequence segments. The novelty ACKNOWLEDGEMENT This work was supported by the William J. Matheson foundation. We would like to thank David Jones for making the PSI-PRED program available in the public domain. REFERENCES Alm,E., Morozov,A.V., Kortemme,T. and Baker,D. (2002) Simple physical models connect theory and experiment in protein folding kinetics. J. Mol. Biol., 322, Baldwin,R.L. and Rose,G.D. (1999) Is protein folding hierarchic? II. Folding intermediates and transition states. Trends Biochem. Sci., 24, Bonneau,R., Tsai,J., Ruczinski,I., Chivian,D., Rohl,C., Strauss,C.E. and Baker,D. (2001) Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins (Suppl. 5), Bystroff,C. and Baker,D. (1998) Prediction of local structure in proteins using a library of sequence-structure motifs. J. Mol. Biol., 281, Bystroff,C., Simons,K.T., Han,K.F. and Baker,D. (1996) Local sequence-structure correlations in proteins. Curr. Opin. Biotechnol., 7, Bystroff,C., Thorsson,V. and Baker,D. (2000) HMMSTR: a hidden markov model for local sequence-structure correlation in proteins. J. Mol. Biol., 301, Choy,W.Y. and Forman-Kay,J.D. (2001) Calculation of ensembles of structures representing the unfolded state of an SH3 domain. J. Mol. Biol., 308, Creighton,T.E. (1993) Proteins: Structures and Molecular Properties, second edn, W.H. Freeman and Company, New York. 1273

8 A.-S.Yang and L.-Y. Wang Fidelis,K., Stern,P.S., Bacon,D. and Moult,J. (1994) Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng., 7, Gillespie,J.R. and Shortle,D. (1997) Characterization of long-range structure in the denatured state of staphylococcal nuclease. II. Distance restraints from paramagnetic relaxation and calculation of an ensemble of structures. J. Mol. Biol., 268, Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selection of representative protein data sets. Protein Sci., 1, Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrix. J. Mol. Biol., 292, Oliva,B., Bates,P.A., Querol,E., Aviles,F.X. and Sternberg,M.J. (1997) An automated classification of the structure of protein loops. J. Mol. Biol., 266, Rooman,M.J., Kocher,J.P. and Wodak,S.J. (1991) Prediction of protein backbone conformation based on seven structure assignments. Influence of local interactions. J. Mol. Biol., 221, Shortle,D., Simons,K.T. and Baker,D. (1998) Clustering of lowenergy conformations near the native structures of small proteins. Proc. Natl Acad. Sci. USA, 95, Simon,I., Glasser,L. and Scheraga,H.A. (1991) Calculation of protein conformation as an assembly of stable overlapping segments: application to bovine pancreatic trypsin inhibitor. Proc. Natl Acad. Sci. USA, 88, Tatusov,R.L., Altschul,S.F. and Koonin,E.V. (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl Acad. Sci. USA, 91, Unger,R., Harel,D., Wherland,S. and Sussman,J.L. (1989) A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins, 5, Wong,K.B., Clarke,J., Bond,C.J., Neira,J.L., Freund,S.M., Fersht,A.R. and Daggett,V. (2000) Towards a complete description of the structural and dynamic properties of the denatured state of barnase and the role of residual structure in folding. J. Mol. Biol., 296, Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, Xiang,Z., Soto,C.S. and Honig,B. (2002) Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc. Natl Acad. Sci.USA, 99, Yang,A.S. (2002) Structure-dependent sequence alignment for remotely related proteins. Bioinformatics, 18, Yang,A.S. and Wang,L. (2002) Local structure-based sequence profile database for local and global protein structure predictions. Bioinformatics, 18,