Exploring Suboptimal Sequence Alignments and Scoring Functions in Comparative Protein Structural Modeling

Size: px
Start display at page:

Download "Exploring Suboptimal Sequence Alignments and Scoring Functions in Comparative Protein Structural Modeling"

Transcription

1 Exploring Suboptimal Sequence Alignments and Scoring Functions in Comparative Protein Structural Modeling Presented by Kate Stafford 1,2 Research Mentor: Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Center for Computational Biology and Bioinformatics, University of Pittsburgh, Pittsburgh, PA 2 Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 3 Pittsburgh Supercomputing Center, Biomedical Initiative Group, Pittsburgh, PA

2 The Protein Folding Problem Knowledge of the native three-dimensional conformation of a protein provides clues to its chemical reactivity and cellular function Currently, structural information comes from experimental sources, typically X-ray crystallography or NMR spectroscopy both of which are time-consuming and labor-intensive to perform Experimental structure determination of all possible proteins is both impractical and unrealistic Ideally, we could use computational methods to predict three-dimensional protein structure given only the primary amino acid sequence

3 Predicting Protein Structure There are two main approaches to computational prediction of protein structure (Fiser et al. 2002): Ab initio protein folding Limited to relatively small peptides Computationally expensive and of relatively low accuracy Permit the exploration of structural space not covered by known, solved structures Can be performed on small regions of models largely constructed by comparative modeling Comparative modeling based on a known structural template Requires sequence alignment with a suitable template whose structure has already been solved With sufficiently high sequence identity and good alignment, can be performed with high accuracy Significant decline in accuracy occurs when the template and target have less than 40% sequence identity (Jaroszewski et al. 2002) Will grow both more common and more useful as structural genomics projects gain momentum in the structure determination community

4 Comparative Modeling and Sequence Alignments One major bottleneck in comparative modeling is the process of properly aligning target and template sequences Alignment errors can have a significant impact on the quality of the models Our approach to this problem: create an ensemble of models from suboptimal alignments and use a scoring function to identify the best model in the ensemble

5 Project Overview We selected four CASP5 target sequences on which to perform our procedure and compared the resulting models to those generated by the popular alignment program T-coffee (Notredame et al. 2000) and the combinatorial extension (CE) structural alignment (Shindyalov and Bourne 1998) We aimed to determine whether our comparative modeling procedure could produce models comparable to or better than those submitted in CASP5, and whether our selected scoring function would accurately identify the best model in the ensemble

6 Suboptimal Alignment Procedure 1. Multiple Sequence Alignments: A BLAST search (Altschul et al. 1997) is conducted against the target sequence to be modeled. From the resulting sequences, MEME profiles are extracted and a MAST profile-sequence search (Bailey and Gribskov 1998) of the PDB database (Berman et al. 2000) is conducted to obtain known structural templates from among the selected sequences. 2. Construction of Alternative Alignments: A set of suboptimal alignments is created using ProbA (Mückstein et al. 2002). 3. Construction of Alternative Alignment Models: The program MODELLER (Šali and Blundell 1993) constructs a set of structural models based on the alignments. 4. Model Scoring: The statistical potential from Prosa II (Sippl 1993) calculates the models energy scores. 5. Refinement: The Multiscale Modeling Tools for Structural Biology (MMTSB) toolset (Feig et al. 2004) is used to refine and simulate poorly aligned or unaligned regions for which homology-based models are impossible. 6. Assessment: The accuracy of these procedures is assessed by correlating Prosa scores with RMSD for the ensemble. The lowest-scoring and lowest-rmsd models are compared to T-coffee and CE models for reference.

7 Meet the Targets Target Number PDB Code Sequence Identity with Template Number of Residues Function Interesting Features T0141 1J3G 26% with 1LBA 187 Acetylmuramyl-Lalanine amidase Unusual active site structure T0153 1MQ7 35% with 1EO5 154 dutpase Beta-clip fold T0165 1L7A 18% with 1A8S 318 Cephalosporin C deacetylase Alpha/beta hydrolase T0183 1O0Y 30% with 1JCL 248 Deoxyribose phosphate aldolase TIM barrel More information, and the full list of CASP5 targets, can be found at:

8 T0141: Summary Neither Prosa II scores nor all-atom energy scores correlate well with RMSD. Model CE Best RMSD Best Prosa T-coffee RMSD Residues Although Prosa did not pick out the absolute best model from the 300-structure ensemble, the difference between its lowest-scoring model and the lowest-rmsd model is relatively small, and the RMSD values are close to the ideal CE value.

9 T0141: Pictures Superposition of the best Prosa (green), best RMSD (blue), T-coffee (red), and crystal structure (colored by secondary structure). The largest errors here lie in the orientations of loops and helices. Catalytically essential residues. Models are colored as above; the crystal structure is shown with purple residues and a gray backbone for reference.

10 T0141: Pictures The active-site residues shown superimposed on the surface of the protein. The gold sphere is a catalytically required zinc atom. The shape of the binding pocket (purple surface) is determined by the active-site loop (cyan). The models follow the template.

11 T0153: Summary There is essentially no correlation between Prosa scores and RMSD for this target, and the lowestscoring models have appreciably higher RMSD s than do the best models in the 100-structure ensemble. Model CE Best RMSD Best Prosa T-coffee RMSD Residues CASP5 performance on this target was generally quite good, making the difference between the RMSD of the lowest-scoring model and the best-rmsd model more significant than it first appears. T-coffee badly misaligned the N-terminus, leading to its high RMSD here.

12 T0153: Pictures The crystal structure of T0153, a primarily β- sheet protein. The long extension at the C- terminus is unusual in a crystal structure and is probably floppy in solution. Overlay representation of the crystal structure (purple), best Prosa (green), best RMSD (blue), and T-coffee (red). T-coffee badly misaligned the N-terminal region, resulting in a long trailing tail typical of MODELLER s response to unaligned regions. The best suboptimals missed the CE alignment by only one or two residues in this region.

13 T0153: Pictures The C-terminal 35 residues are highlighted, colored as before. This unusual region is difficult to model accurately. The rest of the crystal structure is shown in gray for reference. The N-terminal 33 residues, colored as before. The crystal structure is shown in cartoon format. The T-coffee model (red) does not wrap around at the terminus due to its misalignment.

14 T0165: Summary Prosa assigns the T-coffee model, which is dramatically misaligned, a very high score. Notably, the quality of the models for this target is relatively low, and the Prosa scores are correspondingly high a majority are positive Model CE Best RMSD Best Prosa T-coffee RMSD Residues Prosa missed the mark badly for this lowsequence-identity target. Although Prosa s lowest-scoring model is an improvement over the T-coffee model, the best RMSD in the 300-structure ensemble is significantly lower than that of Prosa s best-scoring model.

15 T0165: Pictures Left: Best Prosa (green), best RMSD (blue), and the crystal structure (colored by secondary structure) superimposed. Errors in helix placement and orientation are evident. Middle: Helical regions of the T-coffee (red) and the crystal structure superimposed. The T-coffee alignment is skewed by about 60 residues, so there is very little correspondence between the two models. Right: Helical regions of the best Prosa, best RMSD, and crystal structures.

16 T0165: Pictures The β-sheet core of the protein, colored as before. The best suboptimals reproduce the location, if not the orientation, of the sheets; the T-coffee model does not. Left: The crystal structure alone. Right: The T-coffee model alone. The T-coffee model is a mess of spaghetti!

17 T0183: Summary The quality of the models produced for this target is relatively high, and the Prosa scores are correspondingly low only one of the 200 structures received a positive energy score. Model CE Best RMSD Best Prosa RMSD Residues Although Prosa s best-scoring model did not have the lowest RMSD in the 200-structure ensemble, the difference is relatively small. The best suboptimals offered a significant improvement over the T-coffee model. T-coffee

18 T0183: Pictures The crystal structure is shown in cartoon format, colored by secondary structure. Best Prosa (green), best RMSD (blue), and T-coffee (red) models are shown as ribbons. A 0.5-nanosecond molecular dynamics simulation with GB solvation was intended to improve the packing of the models, but resulted in a partially unfolded structure. The helix on the far right has no counterpart in the template, and thus could not be modeled by homology. Latticebased simulations were performed on this region.

19 T0183: Pictures The best model after a lattice-based simulation in which all but the first 25 residues (corresponding to the unaligned helix) were highly restrained. The model (red) does not achieve the helical conformation - in fact, the residues that comprise the helix appear to be forming a β-sheet-like structure that might become more pronounced with longer simulation times.

20 Conclusions Our suboptimal alignment procedure is capable of producing models whose quality routinely exceeds that of models generated from T-coffee alignments Identifying the best model in the ensemble remains the weak point of our procedure neither all-atom energy scoring nor the statistical potential function in Prosa II reliably identified the best models However, especially for higher sequence identity, Prosa typically identified a good model, if not the absolute best. Ensembles of generally higher quality receive generally lower scores and vice versa. Crystal structures themselves also routinely score extremely low. Thus the statistical potential function in Prosa II lacks sufficient resolution to differentiate between closely related models Ways to improve our comparative modeling procedure include exploring a larger set of suboptimal alignments and identifying and implementing a more robust scoring function

21 References Altschul SF, Madden TL, Schäffer AA, Chang J, Chang Z, Miller W, Lipman DJ. (1997). Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: Bailey TL, Gribskov M. (1998). Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14: Berman HM, Westbrook J, Feng Z, Gilliland G, Baht TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 28: Feig M, Karanicolas J, Brooks CL. (2004). MMTSB tool set: enhanced sampling and multiscale modeling methods for applications in structural biology. J Mol Graph Model 22: Fiser A, Feig M, Brooks CL, Sali A. (2002). Evolution and physics in comparative protein structure modeling. Acc Chem Res 35: Jaroszewksi L, Li W, Godzik A. (2002). In search for more accurate alignments in the twilight zone. Prot Sci 11: Mückstein U, Hofacker IL, Stadler PF. Stochastic pairwise alignments. Bioinformatics 18(s2): S Norvell JC, Machalek AZ. (2000). Structural genomics programs at the US National Institute of General Medical Sciences. Nat Struct Biol 7(11s): 931. Notredame C, Higgins D, Heringa J. (2000). T-coffee: a novel method for multiple sequence alignments. J Mol Biol 302: Šali A, Blundell TL. (1993). Comparative protein modeling by satisfaction of spatial restraints. J Mol Biol 234: Shindyalov IN, Bourne PE. (1998). Protein structural alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11(9): Sippl MJ. (1993). Recognition of errors in three-dimensional structures of proteins. Proteins 17:

22 Acknowledgements Research Mentor Dr. Troy Wymore Adam Marko Pittsburgh Supercomputing Center Rajan Munshi BBSI program and students NIH/NSF