proteins PREDICTION REPORT Template-based and free modeling by RAPTOR11 in CASP8 Jinbo Xu,* Jian Peng, and Feng Zhao INTRODUCTION

Size: px

Start display at page:

Download "proteins PREDICTION REPORT Template-based and free modeling by RAPTOR11 in CASP8 Jinbo Xu,* Jian Peng, and Feng Zhao INTRODUCTION"

Madeleine Holland
6 years ago
Views:

1 proteins STRUCTURE O FUNCTION O BIOINFORMATICS PREDICTION REPORT Template-based and free modeling by RAPTOR11 in CASP8 Jinbo Xu,* Jian Peng, and Feng Zhao Toyota Technological Institute at Chicago, Illinois ABSTRACT We developed and tested RAPTOR11 in CASP8 for protein structure prediction. RAPTOR11 contains four modules: threading, model quality assessment, multiple protein alignment, and template-free modeling. RAPTOR11 first threads a target protein to all the templates using three methods and then predicts the quality of the 3D model implied by each alignment using a model quality assessment method. Based upon the predicted quality, RAPTOR11 employs different strategies as follows. If multiple alignments have good quality, RAPTOR11 builds a multiple protein alignment between the target and top templates and then generates a 3D model using MODELLER. If all the alignments have very low quality, RAPTOR11 uses template-free modeling. Otherwise, RAPTOR11 submits a threading-generated 3D model with the best quality. RAPTOR11 was not ready for the first 1/3 targets and was under development during the whole CASP8 season. The template-based and template-free modeling modules in RAPTOR11 are not closely integrated. We are using our template-free modeling technique to refine template-based models. Proteins 2009; 77(Suppl 9): VC 2009 Wiley-Liss, Inc. Key words: CASP; template-based modeling; template-free modeling; protein threading; model quality assessment. INTRODUCTION Computational methods for protein structure prediction can be broadly classified into two categories: template-based modeling and template-free modeling. Although progress has been made for template-based modeling, we are still facing several challenges including identification of correct templates and generation of accurate alignments. Template-based modeling becomes unreliable when a target protein has a low sequence identity (<30%) with its best templates. 1 Pieper et al. 2 have shown that 76% of the models in MODBASE are from alignments in which the sequence and template share less than 30% sequence identity. One of the major bottlenecks with template-free modeling is that the conformation space for even a small protein is too big to be explored efficiently. To overcome this, a number of methods have been proposed including fragment assembly 3,4 and lattice model. 5,6 These methods reduce search space using discrete representation of a protein conformation, which may lead to the loss of prediction accuracy regardless of sampling algorithm and energy function. This discrete nature may exclude native-like conformations from the search space because even a small change in a single backbone angle could result in a totally different fold. Efficient sampling of protein conformations in a continuous space of protein-like conformations is still an important unsolved problem. We have developed RAPTOR11, a new protein structure prediction method, to address the aforementioned issues. RAPTOR11 is much more powerful than our threading program RAPTOR. 7,8 In RAPTOR11, we generate sequence-template alignments using three different threading methods and rank them using a model quality assessment method. Then, we employ multiple templates to model an easy target. To deal with targets without identifiable templates, we developed a novel template-free modeling method that can efficiently sample protein conformations in a continuous Additional Supporting Information may be found in the online version of this article. The authors state no conflict of interest. *Correspondence to: Jinbo Xu, Toyota Technological Institute at Chicago, IL j3xu@tti-c.org Received 10 March 2009; Revised 4 July 2009; Accepted 22 July 2009 Published online 5 August 2009 in Wiley InterScience ( DOI: /prot VC 2009 WILEY-LISS, INC. PROTEINS 133

2 J. Xu et al. space. In this article, we will briefly describe RAPTOR11, summarize its predictions in CASP8, present some specific examples and discuss strength and weakness. METHODS Threading RAPTOR11 has three threading methods with different scoring functions and alignment algorithms. Two of the three methods are core-based, whereas the third one is not. As in the old RAPTOR, a core-based method does not allow gaps in core regions. 5 The difference between the two core-based methods lies in if pairwise statistical potentials are used in their scoring functions. The noncore-based method does not use pairwise statistical potentials in its scoring function. In particular, the corebased pairwise threading method uses a scoring function consisting of gap penalty, mutation score, secondary structure score, singleton score, and pairwise score. The pairwise statistical potential is derived by McConkey et al. 6 and other scoring items are taken from the RAP- TOR. 5 The two nonpairwise methods use a similar scoring function without pairwise potential. We trained three different sets of weight factors for these scoring functions using the method in Ref. 5. The major reason for using the McConkey potential is to introduce diversity. The major difference between the McConkey potential and RAPTOR pairwise interaction potential lies in that their definitions of an inter-residue interaction are different. The McConkey potential also have its own parameters for singleton score. Our original plan is to use these two different potentials separately to generate alternative alignments. However, because of limited computing power, we used only the McConkey potential for CASP8. Some very preliminary studies indicate that the McConkey potential has similar alignment accuracy as RAPTOR, but they can generate alternative alignments for a given protein pair. Tested on the Prosup benchmark, 7 the reference-dependent alignment accuracy of a single threading method is 61.0%. This accuracy can be improved to 68.0% if the three threading methods are combined using our model quality assessment method described below. Model quality assessment Different from many methods that directly evaluate the quality of a 3D model, 9 17 our model assessment method evaluates the absolute and global quality, measured by GDT-TS or TM-score, 8 of a 3D model implied by an alignment without actually building such a 3D model using MODELLER. Our method differs from existing methods in that to the best of our knowledge, our method is the first one exploiting only the evolutionary information in an alignment for model assessment. We do not need to build a 3D model for its quality assessment and thus, can save a lot of model-building time. Trained on the RAPTOR-generated CASP6 data and tested on the CASP7 data, the MAE (mean of absolute errors) of predicted GDT-TS is and the Pearson correlation coefficient of predicted GDT-TS with the real one is This model assessment method is built upon our previous work, 18 which uses Support Vector Machines to predict the number of correctly aligned positions in an alignment. To assess model quality, our method uses a set of alignment-based features such as distribution of per-position sequence similarity score, contact capacity score, and environmental fitness score; distribution of gap lengths in an alignment, secondary structure score, solvent accessibility score and sequence identity. Multiple-template method If there are at least two very good templates for a target protein, we generate a multiple protein alignment and then build a 3D model from this alignment using MODELLER. 9 The multiple-template method has been exploited by several groups such as Joo et al. 10 and Cheng 11 in recent CASP events. The major challenge is to choose good templates and to generate multiple protein alignments. We always use the top two templates and then enumerate all the possible combinations of the remaining top templates. To save computing time, at most five templates are used in any combinations. For a given set of multiple templates, TM-align 8 is used to generate structure alignment between any two templates. Then, T-Coffee 12 is used to combine all the sequencetemplate alignments and structure alignments into a single multiple protein alignment. We used a very conservative strategy to rank models built from multiple templates because sometimes it generates worse models by using multiple templates. A multiple-template-based model is assumed to be better than another one or a single-template-based model if and only if the former has both better ProQ 13 and DFIRE 14 values. However, this ranking method sometimes failed to identify the best models in CASP8. We chose TM-align, T-Coffee, ProQ, and DFIRE because they are easily accessible. In the future, we will systematically compare our method with other similar methods. Template-free modeling We have developed a template-free modeling method, as detailed in Refs. 15, 16. Our method employs conditional (Markov) random fields (CRFs) and directional statistics to model protein sequence-structure relationship. Our method models the backbone angle distribution at each residue using a FB5 distribution 17 and sam- 134 PROTEINS

3 RAPTOR11 in CASP8 for Protein Structure Prediction ples backbone angles from sequence information using CRF. Different from the widely used fragment assembly and lattice model methods that explore protein conformations in a discrete space, our method can explore protein conformations in a continuous space by their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and PSIPRED-predicted secondary structure. Our template-free modeling module drives conformation optimization by a simple energy function consisting of Sali s DOPE, 19,20 Baker s KMBhbond 21 and later a simplified solvent accessibility potential. 22 Our experimental results in 16 indicate that although sampling in a continuous space and using a very simple energy function, our new method compares favorably with the fragment assembly method (e.g., Robetta) and the lattice model (i.e., TOUCHSTONE II). Multidomain proteins In the case that a target protein is large and may contain multiple domains, we first parse this protein into several possible domains by searching through the Pfam database 23 using HMMER. 24,25 If the whole target can be aligned to a single template, then domain parsing is skipped. In the case that there is a big chunk of the target not aligned to any top templates, we will treat this unaligned chunk as a single target and do protein modeling separately. Except the last several CASP8 targets, the models for multiple domains are not assembled into a single coordinate system. This explains why Zhang s assessment y indicates that our models for multidomain targets may contain atomic clashes when our domain boundary is different from Zhang s. RESULTS AND DISCUSSION Summary Table I summarizes the results of RAPTOR11 in CASP8. CASP8 defined 164 effective domains and classified them into three categories, whereas Shi et al. 26 defined 146 domains and classified them into five categories {. As shown in Columns 2 4, for TBM-HA targets, the difference between the first and the best models by RAPTOR11 are small. In contrast, the best models generated by RAPTOR11 for TBM and FM targets are much better than the first models. This indicates that we still need to improve our model selection method for TBM and FM targets. As shown in Columns 4 6, for TBM-HA targets, the best models generated by RAPTOR11 are not very far away from the best models submitted by all the CASP8 servers. However, for TBM and FM targets, the best models submitted by all CASP8 y { Table I Summarized Results of RAPTOR11 Predictions in CASP8 Category(#) R1 RB RBAll S1 SB CASP8 official domain definition TBM-HA (50) TBM (104) FM (13) Grishin's domain definition and classification CM easy (36) CM medium (45) CM hard (30) FR (30) FM (5) The upper half table contains the results of 164 CASP8 official domains and the lower half contains the results of 146 domains by Grishin s definition ( prodata.swmed.edu/casp8/evaluation/casp8home.htm).r1 represents GDT-TS score sum of the first-ranked models by RAPTOR; RB represents GDT-TS score sum of the best models submitted by RAPTOR; RBAll represents GDT-TS score sum of the best models generated by RAPTOR; S1 represents GDT-TS score sum of the best first models submitted by all servers; SB represents GDT-TS score sum of the best models submitted by all servers. servers are much better than the best generated by RAPTOR11. This means that in addition to improve model selection, we also need to further improve our model generation method for TBM and FM targets. We can have similar observations when Grishin s domain definition and classification is used. What went right? The model quality assessment method helps a lot in improving RAPTOR s performance on the TBM targets, as opposed to RAPTOR in CASP7 that did not perform well in this category. In fact, Randall and Baldi demonstrated that the performance of RAPTOR in CASP7 could be greatly improved by simply re-ranking the top five models using SELECTPro. 27 A typical example is T0429. The third model of RAPTOR11 for this target is much better than other server models, but RAPTOR s old template selection method failed to rank the third model to top one although RAPTOR s first model is still pretty good. Using our new model quality assessment method, we can rank the third model to top one. See Figure S1 in Supporting Information for these two models of T0429. The multiple-template method sometimes helps improve modeling easy targets. This method is likely to improve model quality when the following two conditions are satisfied. One is that some gapped regions in the alignment to one template can be covered by the alignment to another template. The other is that these multiple templates are structurally very similar. In case that either of these two conditions is not satisfied, the multiple-template method may introduce models of worse quality. For example, RAPTOR11 generated the best model for T0486 using four similar templates 2ppyA, 1q52A, 2hw5A, and 2pbpA. The GDT-TS of this model is around higher than the single-template (2ppyA) based model. By using these four templates we PROTEINS 135

4 J. Xu et al. can cover T0486 more than using any single template. See Figures S2-1, S2-2, and S2-3 in Supporting Information for alignments and 3D models for T0486. Our template-free modeling method samples protein conformations in a continuous space without using fragments in the PDB. Our method aims to overcome two major issues with current popular fragment assembly and lattice model methods. One issue is that by sampling in a discrete space, a lattice method may exclude a native structure in the search space since a small change in a backbone angle may result in a totally different fold. The other issue is that there is no 100% guarantee that the local structure of a protein with a new fold can be covered by even medium-sized fragments, as a new fold may be composed of rarely occurring supersecondary structure motifs (Andras Fisher, CASP8 talk). Compared with the Robetta server (see Table III in Ref. 16), our method performs very well on mainly-alpha proteins, e.g., T0460, T0496_D1, and T0496_D2, as shown in Figures S3, S4-1, S4-2, and S4-3 in Supporting Information, respectively. This is not surprising as our CRF model can capture well the local sequence-structure relationship. Our method also works well on small mainly beta proteins. For example, our method is better than Robetta on T0480 and T0510_D3, as shown in Figures S5 and S6 in Supporting Information, respectively. However, our method does not fare well on a relatively large protein (>100 residues) with a few beta strands, e.g., T0482 and T0513_D2. This is probably because our CRF method can only model local sequence-structure relationship, whereas a beta sheet is stabilized by nonlocal hydrogen bonding. Although sampling in a continuous space our method can still efficiently search the conformation space of a small beta protein. However, for a large protein with a few beta sheets, the search space is too big to be explored by our continuous conformation sampling algorithm. It is also worth to note that compared with Robetta, our method works well on T0397_D1 (see Fig. S7 in Supporting Information) and T0496_D1, which, according to Nick Grishin, are the only two CASP8 targets with really new folds. What went wrong? RAPTOR11 contains both template-based and template-free modeling modules, so it needs a rule to tell when to use template-free modeling and when to use templatebased modeling. RAPTOR11 used the predicted GDT-TS to do so, but sometimes this will mislead RAPTOR11 because the predicted GDT-TS is not accurate enough. RAPTOR11 used template-free modeling, if the best, predicted GDT-TS is less than For some targets such as T0496_D1 and T0510_D3, RAPTOR11 correctly submitted their template-free models, which are much better than their template-based models. However, RAPTOR11 incorrectly submitted template-free models for some targets (e.g., T0480 and T0496_D2) although they have better template-based models. When the multiple-template method is used, sometimes RAPTOR11 failed to identify the best 3D models by using ProQ and DFIRE. A better model quality assessment method is urgently needed for this purpose. Another issue is that RAPTOR11 did not update the template database during the whole CASP8 season so that RAPTOR11 missed the best template (2zf8A) for T0514, which was deposited to the PDB in July ACKNOWLEDGMENTS The authors thank Xin Gao for his work in setting up RAPTOR11 web server and running RAPTOR11 for the first 20 CASP8 targets and Tobin Sosnick, Karl Freed, Joe DeBartolo, and Brendan McConkey for their help with development of RAPTOR11. This work is supported by the TTI-C internal research funding and NIH grant R01GM This work was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET: www. sharcnet.ca), the Open Science Grid Engagement VO and the University of Chicago Computation Institute. The authors are also grateful to Dr. Ming Li, Dr. Ian Foster, Dr. John McGee and Mats Rynge for their help with computational resources. REFERENCES 1. Baker D, Sali A. Protein structure prediction and structural genomics. Science 2001;294: Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D, Shen MY, Kelly L, Melo F, Sali A. MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 2006;34:D291 D Kim DE, Chivian D, Baker D. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 2004;32:W526 W Zhou HY, Skolnick J. Ab initio protein structure prediction using Chunk-TASSER. Biophys J 2007;93: Xu J, Li M, Kim D, Xu Y. RAPTOR: optimal protein threading by linear programming. J Bioinform Comput Biol 2003;1: McConkey BJ, Sobolev V, Edelman M. Discrimination of native protein structures using atom-atom contact scoring. Proc Natl Acad Sci USA 2003;100: Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein structure alignment. Protein Eng 2000;13: Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005;33: Sali A. Comparative protein modeling by satisfaction of spatial restraints. Mol Med Today 1995;1: Joo K, Lee J, Lee S, Seo JH, Lee SJ, Lee J. High accuracy template based modeling by global optimization. Proteins 2007;69: Cheng J. A multi-template combination algorithm for protein comparative modeling. BMC Struct Biol 2008;8: Poirot O, O Toole E, Notredame C. Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res 2003;31: Wallner B, Elofsson A. Can correct protein models be identified? Protein Science 2003;12: PROTEINS

5 RAPTOR11 in CASP8 for Protein Structure Prediction 14. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002;11: ; 2003;12: Zhao F, Li SC, Sterner BW, Xu J. Discriminative learning for protein conformation sampling. Proteins 2008;73: Feng Zhao, Jian Peng, Joe DeBartolo, Karl F. Freed, Tobin R. Sosnick and Jinbo Xu. A Probabilistic Graphical Model for Ab Initio Folding. Proc. 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB), Lecture Notes in Computer Science, Vol. 5541, pp , Springer. 17. Kent JT. The fisher-bingham distribution on the sphere. J Roy Stat Soc B 1982;44: Xu J. Protein fold recognition by predicted alignment accuracy. IEEE/ACM Trans Comput Biol Bioinform 2005;2: Fitzgerald JE, Jha AK, Colubri A, Sosnick TR, Freed KF. Reduced C-beta statistical potentials can outperform all-atom potentials in decoy identification. Protein Sci 2007;16: Shen M, Sali A. Statistical potential for assessment and prediction of protein structures. Protein Sci 2006;15: Morozov AV, Kortemme T, Tsemekhman K, Baker D. Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci USA 2004;101: Fernandez A, Sosnick TR, Colubri A. Dynamics of hydrogen bond desolvation in protein folding. J Mol Biol 2002;321: Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. The Pfam protein families database. Nucleic Acids Res 2008;36:D281 D Eddy SR. Profile hidden Markov models. Bioinformatics 1998;14: Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov-models in computational biology Applications to Protein Modeling. J Mol Biol 1994;235: Shi S, Pei J, Sadreyev RI, Kinch LN, Majumdar I, Tong J, Cheng H, Kim B-H, Grishin NV, Analysis of casp8 targets, predictions and assessment methods, Database, vol. 2009, no. 0, pp. bap003+, April [Online]. Available: Randall A, Baldi P. SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs. BMC Struct Biol 2008;8:52. PROTEINS 137

proteins PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred

proteins PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred proteins STRUCTURE O FUNCTION O BIOINFORMATICS PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred Andrea Hildebrand, Michael Remmert, Andreas Biegert, and Johannes Söding* Gene