3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

3D Structure Prediction with Fold Recognition/Threading Michael Tress CNB-CSIC, Madrid

?? MREYKLVVLGSGGVGKSALTVQFVQGIF VDEYDPTIEDSYRKQVEVDCQQCMLEIL DTAGTEQFTAMRDLYMKNGQGFALVYS ITAQSTFNDLQDLREQILRVKDTEDVPM ILVGNKCDLEDERVVGKEQGQNLARQW CNCAFLESSAKSKINVNEIFYDLVRQINR? Given a new sequence that we want to know as much as possible about, one of the fundamental questions is "how does it fold?"

Homology Modelling vs Fold Recognition 0 30 100 % seq. ID Application Fold Recognition Homology Modelling Target Sequence Model Quality Any Sequence Fold Level >= 30-50% ID with template Atomic Level If the sequence is similar to a known structure (>30-50% identity) you can usually generate an all atom model by homology modelling.

The Motivation for Fold Recognition. What if the sequence shares very little evolutionary similarity with other known structures? Is there anything that can be done then? Pairwise sequence search methods can detect folds when sequence similarity is high, but are very poor at detecting relationships that have less than 20% identity. This is where fold recognition comes in.

Sequence Space is Much More Diverse than Structure Space Sequence space Structural space Homology Modelling Targets Fold Recognition Targets The development of prediction methods that used structural information came from the observation that many proteins with apparently unrelated sequences had very similar 3-dimensional structures.

MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDS YRKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQ GFALVYSITAQSTFNDLQDLREQILRVKDTEDVPMILVG NKCDLEDERVVGKEQGQNLARQWCNCAFLESSAKSKIN VNEIFYDLVRQINR? In this case the problem to be solved is the inverse of the protein folding problem: given a protein structure, what amino acid sequences are likely to fold into that structure? ASKGEELFTGVVPILVELDGDVNGHKFSVSGEGE GDATYGKLTLKFICTTPVPWPTLVTTFSYGVQCFS RYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGN YKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGH KLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIE DGSVQLADHYQQNTPIGDGPVLLPDNHYSTQSA LSKDPNEKRDHMVLLEFVTAAGIT HGMDELYK?? MREYPVKKGFPTDYDSIKRKISELG FDVKSEGDLIIASIPGISRIEIKPDK RKILVNTGDYDSDADKLAVVRTYN DFIEKLTGYSAKERKKMMTKD

The Reach of Fold Recognition In fact, studies show that no method can reliably detect proteins that share the same basic structure (or fold), but have no common ancestor. However, fold recognition methods can often detect proteins that are too distantly related to be detected by sequence based methods. Sequence search methods have evolved greatly (PSI-BLAST), but fold recognition methods can find folds that PSI-BLAST cannot find, because they evaluate the structural fit of the detected fold as well as the evolutionary fit.

The Quality Difference Between Fold Recognition Predictions and Homology Modelling Models

The General Principle of Fold Recognition Algorithms Fold recognition methods are based on the idea of superimposing a target sequence on database of known folds and evaluating the sequence-fold alignments. Fold recognition alignments are quite different from ordinary sequence alignments since they are evaluated from a structural perspective. Each fold recognition algorithm has its own method of scoring the alignments.

Threading Algorithms in Detail I Fold recogntion methods employ libraries of known protein folds that usually contain a representative subset all known structures. The target sequences, or the multiple sequence profiles built up around the sequences, are aligned against the fold library, or against representations of the folds in the library. These abstractions can be strings of descriptors that represent 3D structual features or secondary structural features.

Threading Algorithms in Detail II Alignments are usually generated by dynamic programming using scoring matrices that relate each amino acid to structural features or to 3D structural environments. Each alignment between target sequence and fold is scored for a range of factors, some evolutionary, some secondary structure-based and some based on physical potentials. The final score is the goodness of fit of the target sequence to each fold and is usually reported as a probability.

Scoring Functions for Fold Recognition Scoring functions measure some or more of the following: How similar is the structural environment of the residue to those in which the aligned amino acid is usually found? Pair potentials Solvation energy Coincidence of real and predicted secondary structure and accessibility Evolutionary information (from aligned structures and sequences)

Structural Environments Bowie et al. (1991) created a fold recognition approach that described each position of a fold template as being in one of eighteen environments. These environments were defined by measuring the area of side chain that was buried, the fraction of the side chain area that was exposed to polar atoms, and the local secondary structure. Other researches have developed similar methods, where the structural environments described include exposed atomic areas and type of residue-residue contacts.

How Structural Environments Scores are Used First a scoring matrix is generated from the probabilities of finding each of the twenty amino acids in each of the environment classes. Probabilitles are drawn from databases of known structures. A 3D profile matrix is created for each fold in the fold library using these probabilities. The matrix defines the probability of finding a certain amino acid in a certain position in each fold. The target sequence is aligned with the 3D profile and the fit evaluated from the resultant 3D alignment score.

Solvation Energy Solvation potential is a term used to describe the preference of an amino acid for a specific level of residue burial. It is derived by comparing the frequency of occurrence of each amino acid at a specific degree of residue burial to the frequency of occurrence of all other amino acid types with this degree of burial. The degree of burial of a residue is defined as the ratio between its solvent accessible surface area and its overall surface area.

Pair or Contact Potentials - the Tendency of residues to be in Contact counts d Counts become propensities (frequency at each distance separation) or energies (Boltzmann principle, -KT ln) Make count of interacting pairs of each residue type at different distance separations E d

Pair Potentials in Fold Recognition The energy that results from aligning a certain target sequence residue at a certain position depends on its interactions with other residues. This creates problems when pair potentials are used to create sequence structure alignments, since you do not know the position of all the residues in the model before threading them. Threading methods that use pair potentials in this way, such as THREADER (Jones et al, 1992) have to use clever programming methods to get round this problem.

Coincidence Between Secondary Structure and Accessibility Secondary structure prediction TOPITS uses predicted secondary structure and accessibilities for the target sequence and compares them with the known values of the template. http://cubic.bioc.colu mbia.edu/predictprotein

A Fold Recognition Example - 3D-PSSM 3D-PSSM combines: Target sequence profiles, Template sequence profiles, Residue equivalence, Secondary structure matching Solvation potentials Sequences are aligned to folds using dynamic programming with the alignments scored by a range of 1D and 3D profiles.

Preparing the Query Sequence Query sequences have their secondary structure predicted by PSI-Pred. PSI-BLAST profiles are also generated for the query sequence to allow bi-directional scoring. The 3D-FSSP dynamic programming algorithm is used to scan the fold library with the query sequence.

3D-PSSM Fold Profile Library PSI-BLAST and a non-redundant database are used to create profiles for each of the folds in the library. Each fold is aligned with members of the same superfamily using the structural alignment program SSAP. Those folds from SCOP with sufficient structural similarity are then also used to create profiles using PSI-BLAST in the same way. All the related profiles are merged

Secondary Structure and Solvation Potentials Secondary structure is assigned to each fold based on the annotation in the STRIDE database. Each residue in the fold is also assigned a solvation potential. The degree of burial of each residue is defined as the ratio between its solvent accessible surface area and its overall surface area. Solvation potential is divided into 21 bins, ranging from 0% (buried) to 100%(exposed).

Sequence and Secondary Structure Profiles 3D-PSSM also uses the coincidence of predicted secondary structure (target sequence) and known secondary structure (fold). Here a simple scoring scheme is used for matching secondary structure types, +1 for a match, otherwise -1.

3D-PSSM - Dynamic Programming Three passes of dynamic programming are performed for each query-template alignment. Each pass uses a different matrix to score the alignment, but secondary structure and solvation potential are used in each pass. The score for a match between a query residue and a fold residue is calculated the sum of the secondary structure, solvation potential and profile scores. The final score is simply the maximum of the scores from the three passes.

Fold Recognition Servers I 3D-PSSM - www.sbg.bio.ic.ac.uk/~3dpssm/ Based on sequence profiles, solvatation potentials and secondary structure. TOPITS - www.embl-heidelberg.de/predictprotein/ It employs the coincidence of secondary structure and accesibility. GenTHREADER - www.psipred.net/ Combines profiles and sequence-structure alignments. A neural network-based jury system calculates the final score based on solvation and pair potentials. ORFeus - grdb.bioinfo.pl/ ORFeus is a sensitive sequence similarity search tool. It aligns two profiles and also uses predicted secondary

Fold Recognition Servers II SAM T02 - www.cse.ucsc.edu/research/compbio/hmmapps/t02-query.html The query is checked against a library of hidden Markov models. This is NOT a threading technique, it is sequence based. FUGUE - www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html FUGUE uses environment-based fold profiles that are created from structural alignments. RAPTOR - www.cs.uwaterloo.ca/~j3xu/raptor_form.htm Best-scoring server in CAFASP competition; strangely is out of action at the moment.

Consensus Fold Recognition In the CASP series of world-wide fold recognition conferences it had long been recognised that human experts were better at fold prediction than the methods these same experts had developed. Human experts usually use several different fold recognition methods and predict folds after evaluating the results (not just the top hits) from a range of methods. So why not produce an algorithm that does at least some of what a human expert could do?

Consensus Fold Recognition - PCONS First consensus server Pcons did the following: The target sequence was sent to six publicly available fold recognition web servers. The scores from each method were re-scaled so they were equivalent. Models were built from all the predictions. The models were then structurally superimposed and evaluated for their similarity. The quality of the model was predicted from the rescaled score and from its similarity to other predicted models.

Consensus Servers - Libellula De Juan et al., 2001 http://www.pdg.cnb.uam.es:8081/libellula.html

Consensus Fold Recognition Servers INGBU - www.cs.bgu.ac.il/~bioinbgu/form.html This produces a consensus prediction based on five methods that exploit sequence and structure information in different ways. 3D Shotgun - http://bioinfo.pl/meta/ ShotGun is a consensus predictor that utilizes the results of fold recognition servers, such as FFAS, 3D- PSSM, FUGUE and mgenthreader, and uses a jury system to select structures Pcons - www.sbc.su.se/~arne/pcons/ Pcons was the first consensus server for fold recognition. It selects the best prediction from several servers. PMOD can also generate models using the alignment, template and MODELLER

Structure Prediction (Re: Pazos) Target sequence BLAST search Fold Recognition Servers Eg 3DPSSM GenTHREADER Model 1 model 2 model 3 model 4... Fold visualisation and model comparison: Threadlize Structural classifcation: FSSP, SCOP model 1 model 2 Are there domains? PFAM/ProDom/ InterPro - BLAST results dom1 dom2... Multiple Alignment (ClustalW, PSI-BLAST) Conserved residues, correlated mutations Secondary structure, accessibility, Trans-membrane segments PHD, PSIPRED Similarity to Known structure? BLAST against PDB NO Text mining for biological information: (MedLine, Swissprot) Active sites, domains, cofactors etc. 3D backbone SI Homology modelling servers SWISSMODEL Align with template core Loops... 3D model Model Evaluation - ProSa - Biotech suite Side chain canonical loops MaxSprout Complete 3D model

Acknowledgements Florencio Pazos Lawrence Kelley Arne Eloffson and anyone else whose figures I used...