Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Read intro to HMM on blackboard

Size: px

Start display at page:

Download "Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Read intro to HMM on blackboard"

Solomon Leonard
5 years ago
Views:

1 Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Read intro to HMM on blackboard OCTOBER BREAK Comparative Homology and alignments Ot23 Oct Protein Function & Motifs Oct 27 Structural / Protein Structure Prediction Oct 30 Protein Modeling, & Dynamics Nov 4 Protein-Protein P t i Interactions ti & Networks Gribskov@purdue.edu Lilly G-233 Gribskov 8.1

2 Protein Function Expectation Maximization - Using MEME/MAST Use to characterize known family and create a classifier Learn motifs Use MAST to decide which are general - this is the classifier Can convert to a Profile Find homologs for groups of unknowns Learn motifs on unknown group Use MAST on training set to decide which are general Search database for distant relatives Motif order is very important Server - Input - Multiple sequences in FASTA format Output MEME - Motif descriptions MAST - P-values, motif diagrams, alignments Gribskov 8.2

3 Protein Function Expectation Maximization - MEME Motif 1 Alignment Gribskov 8.3

4 Protein Function Expectation Maximization - MEME Motif 1 - Log-odds Weight Matrix A C D E F G H I K L M N P Q R S T V W Y Gribskov 8.4 bits Information 4.0 content 3.3 * * * (28.2 bits) 2.7 * * * 2.0 *** ****** 1.3 ********** 0.7 ********** Multilevel EDLVQETFIR consensus D VA DA sequence T

5 Protein Function Expectation Maximization - MAST - Motif diagrams Gribskov 8.5

6 Protein Function MEME motifs vs structure Gribskov 8.6

7 Protein Function MEME PSSM for protein kinase Gribskov 8.7

8 Protein Function MEME motifs protein kinase Gribskov 8.8

9 Protein Function Expectation Maximization - Gibb's Sampler Evaluating all probabilities is time consuming Sample based on probability instead Select a random position in each sequence, add into weight matrix Calculate l pattern (P) and background (Q) probabilities biliti For one sequence, select a new site randomly weighted by P/Q Iterate until convergence, typically thousands of iterations Must specify number of motifs and widths No servers currently available Gribskov 8.9

10 Protein Function Expectation Maximization - Gibb s Sampler Calculate PSSM (model & background) Remove site from 1 sequence Choose a new site weighted by P(model)/P(background) Gribskov 8.10

11 Protein Function Expectation Maximization - Gibb s Sampler Gribskov 8.11

12 Protein Families Protein families - groups of homologous molecules superfamily, family, subfamily classification introduced d by Dayhoff families are seen both across and within species Structural classes / Folds - similar structures based on 3-dimensional coordinates may not be homologous - not clear to what extent certain structures are preferred by chance only recently becoming populated Domain Sequence or structure based independently folding unit Families are important for information mapping because they give a guide to how much variation is expected between homologous proteins that maintain similar (or have different) function. Gribskov 8.12

13 Protein Families Dayhoff Protein Classification Hierarchical classification Folds: Structural similarity Superfamilies: P < 10-3 Highly probable homology Superfamilies generally are entire sequences (homeomorphic family) Newer concept is homology domain - only part of sequence Families: > 50% identical (~E<10-30 ) Clear homology Similar function Substrates and function similar but not identical Subfamilies: >80% identical (~E<10-80 ) Identical function Probably bind nearly identical substrates Gribskov 8.13

14 Protein Families PIR Superfamilies superfamilies As little as 15% sequence identity Generally, entire chain is used in definition (homeomorphic superfamily) Globins - hemoglobin, myoglobin, leghemoglobin, monomeric globins, erythrocruorin Homeodomain - Homeodomain, Recombinase DNA-binding domain, c-myb, DNA-binding domain, Paired domain, DNA-binding domain of rap1 Current PIR release has 283,177 sequences, 3508 superfamily alignments (including 994 homeomorphic superfamily and 386 homology domain) Gribskov 8.14

15 Protein Families Structural classifications SCOP Heuristic classification according to traditional crystallographic ideas Recently used as a standard for sequence comparisons v1.73, Sep PDB Entries Domains. CATH Systematic semi-automatic procedure with more clearly defined process Version 3.2.0, July ,215 domains Gribskov 8.15

16 Protein Families SCOP Primarily manually curated according to traditional crystallographic ideas Family: Clear evolutionarily relationship Generally, pairwise residue identities greater than 30%. In some cases, similar functions and structures provide definitive evidence of common descent in the absence of high h sequence identity; for example, many globins form a family though some members have sequence identities of only 15%. Superfamily: Probable common evolutionary origin Low sequence identity, but structural and functional features suggest a common evolutionary origin. For example, actin, the ATPase domain of the heat shock protein, and hexokinase together form a superfamily. Fold: Major structural similarity Major secondary structures in same arrangement and topology. Proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. Proteins with a common fold may not have a common evolutionary origin: the structural similarities could arise from physical-chemical properties of proteins that favoring certain packing arrangements and chain topologies. Gribskov 8.16

17 Protein Families SCOP - Gribskov 8.17

18 Protein Families SCOP Class - All Alpha Proteins Globin-like (2) (Globins and Phycocyanins) core: 6h helices; folded dleaf, partly opened; Long alpha-hairpin (11) 2 helices; antiparallel hairpin, left-handed twist Cytochrome c (1) core: 3 helices; folded leaf, opened; DNA-binding 3-helical bundle (10) core: 3-helices; bundle, closed or partly opened, right-handed twist; up-and down Many more... Gribskov 8.18

Protein Families CATH Classification http://www.cathdb.info/ v 3.2.

19 Protein Families CATH Classification v 3.2.0, July 2008 CATH is more formally specified and less reliant on human intervention ti than SCOP Gribskov 8.19

20 Protein Families CATH Classification Class Determined according to the secondary structure composition and packing within the structure. t Assigned automatically ti using the method of Michie et al. (1996). Architecture The overall shape of the domain structure as determined by the orientations of the secondary structures; ignores the connectivity between the secondary structures. Assigned manually Topology Fold families at this level depend on both the overall shape and connectivity of the secondary structures. This is done using the structure comparison algorithm SSAP (Taylor & Orengo, 1989). Homologous Superfamily Similarities are identified first by sequence comparisons and subsequently by structure comparison using SSAP. Criteria: Sequence identity >= 35%, 60% of larger structure equivalent to smaller SSAP score >= and sequence identity >= 20%, 60% of larger structure equivalent to smaller SSAP score >= 80.0, 60% of larger structure equivalent to smaller, and domains have related functions Sequence Families Domains clustered in the same sequence families have sequence identities >35% (with at tleast t60% of fthe larger domain equivalent tto the smaller) Gribskov 8.20

21 Protein Families CATH Classification Gribskov

22 Protein Families CATH Classification Gribskov 8.22

23 Protein Families CATH classification Gribskov 8.23

24 Protein Families Protein Family Efforts Active efforts are underway to classify all proteins by family, superfamily, and fold Uniprot/PIRSF Prodom / Pfam PROSITE ProtoMap DOMO SBASE HOVERGEN SCOP CATH etc. Gribskov 8.24

25 Protein Families Clusters of Orthologous Groups COGs & KOGs genomes, 38 orders, 28 classes 14 phyla (192,987 proteins) prokaryotic (COGs) 5666 eukaryotic (KOGs) 4852 Originally (1997), 3307 COGs were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain % of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces ces cerevisiae genome. Gribskov 8.25

26 Protein Families COGs How well do COGs cover complete Genomes? Gribskov 8.26

Protein Families COGs Distribution of COG function J, translation, including ribosome structure and biogenesis L, replication, recombination and repair K, transcription O,

ion transport and metabolism C, energy production and conversion G, carbohydrate metabolism and transport E, amino acid metabolism and transport F, nucleotide metabolism and

27 Protein Families COGs Distribution of COG function J, translation, including ribosome structure and biogenesis L, replication, recombination and repair K, transcription O, molecular chaperones and related functions M, cell wall structure and biogenesis and outermembrane N, secretion, motility and chemotaxis T, signal transduction P, inorganic ion transport and metabolism C, energy production and conversion G, carbohydrate metabolism and transport E, amino acid metabolism and transport F, nucleotide metabolism and transport H, coenzyme metabolism I, lipid metabolism D, cell division and chromosome partitioning R, general functional prediction only S, no functional prediction. Gribskov 8.27

28 Protein Families COGs 1. Perform the all-against-allagainst protein sequence comparison. 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species. 3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step Merge triangles with a common side to form COGs. 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1 4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. iti 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. Gribskov 8.28

29 Protein Families COGs Growth with number of genomes Gribskov 8.29

30 Structural (2006) Gribskov 8.30

31 Secondary structure prediction History and Context Chou and Fasman Garnier-Osguthorpe-Robson Comparison of Methods Newer Approaches Gribskov 8.31

32 Secondary Structure Prediction Structures Predicted Alpha helix (Pauling, 1951) Beta strand and beta sheet (Pauling, 1951) Turn (reverse turn) Coil or random coil irregular structure, basically everything else, note that it is generally as well organized as the rest of the protein and not random in the normal sense Conclusions drawn from early crystallographic structures and other experiments Structure is encoded by the sequence Amino acid residues prefer certain structures t Proteins are largely composed of regular secondary structures Gribskov 8.32

33 Secondary Structure Prediction Anfinsen experiments Proteins can be unfolded (using denaturants) and spontaneously refolded to their native structure Structural information is therefore completely encoded in the sequence Early studies of homopolymers showed some residues tend to form helices, some do not Zimm-Bragg model Helix-coil transition is cooperative Can be described by two parameters - initiation and extension Fairly accurate model for homopolymers Gribskov 8.33

34 Secondary Structure Prediction Based on early structures, it appeared that most if not all proteins might have a regular 3-dimensional structure composed of simple secondary structure elements Early crystal structures were small, globular, helix rich proteins such as hemoglobin, myoglobin, and cytochrome Secondary structure prediction methods try to use the statistical preference of residues for secondary structures with the sequence to predict the secondary structure of each residue Modern methods generally predict 4 states: helix, beta strand, turn, and (random) coil Gribskov 8.34

35 Secondary Structure Prediction Each amino acid contains an "amine" group (NH3) and a "carboxy" group (COOH) (black in diagram). The amino acids vary in their side chains (blue in the diagram). The eight amino acids in the orange area are nonpolar and hydrophobic. The other amino acids are polar and hydrophilic ("water loving"). The two amino acids in the magenta box are acidic ("carboxy" group in the side chain). The three amino acids in the light blue box are basic ("amine" group in the side chain). Gribskov 8.35

36 Alpha Helix An ideal alpha helix consists of 3.6 residues per complete turn. hydrogen bonds between the carboxy group of amino acid n and the amino group of another amino acid n+4. The mean phi angle is -62 degrees and the mean psi angle is -41 degrees Gribskov 8.36

37 Beta Sheet Beta sheets are created when extended chains hydrogen bond to each other may be parallel or antiparallel strands or a mixture of arallel and antiparallel strands Gribskov 8.37

Secondary Structure Prediction Basics of Protein structure: The four levels of protein

Primary structure is the sequence of amino acids that compose the protein. 2.

38 Secondary Structure Prediction Basics of Protein structure: The four levels of protein structure are 1. Primary structure is the sequence of amino acids that compose the protein. 2. Different regions of the sequence form local secondary structures, such as alpha helices and beta strands. 3. Tertiary structure is formed by packing secondary structural elements into one or several compact globular units called domains. 4. Final protein may contain several polypeptide chains arranged in quaternary structure. Gribskov 8.38

39 Favored peptide conformations 3(10)helix Gribskov 8.39 fig

40 Secondary Structure Prediction Gribskov 8.40

41 Secondary Structure Prediction Chou/Fasman Combines Zimm-Bragg physical idea initiation and extension Statistical idea First widely used method, first three state prediction (alpha, beta, turn) Parameters: structural propensities P α = f α / <f α > P β = f β /<f β > P struct = Σp struct /4 4 residue window P t,i = f t,i / < f t,i >i= 1, 4 where f structure is the frequency of a particular residue in a secondary structure and <f α > = N α / N total is the average frequency of a structure Structural propensities are odds ratios Gribskov 8.41

42 Secondary Structure Prediction Chou/Fasman Conformational Parameters (structural propensities) The Chou and Fasman method was developed before the wide availability of parameters and was thus designed to be calculated by hand Each residue was assigned to a class forming residues favor a structure breaking residues stop the extension of a structure indifferent residues Turn propensity was originally calculated for the four residues of a beta turn, but was replaced by position specific propensities P α P β P t Glu 1.51 Val 1.70 Asn 1.56 Met 1.45 Ile 1.60 Gly 1.56 Ala 1.42 Tyr 1.47 Pro 1.52 Leu 1.21 Phe 1.38 Asp 1.46 Lys 1.16 Trp 1.37 Ser 1.43 Phe 1.13 Leu 1.30 Cys 1.19 Gln 1.11 Cys 1.19 Tyr 1.14 Trp 1.08 Thr 1.19 Lys 1.01 Ile 1.08 Gln 1.10 Gln 0.98 Val 1.06 Met 1.05 Thr 0.96 Asp 1.01 Arg 0.93 Trp 0.96 His 1.00 Asn 0.89 Arg 0.95 Arg 0.98 His 0.87 His 0.95 Thr 0.83 Ala 0.83 Glu 0.74 Ser 0.77 Ser 0.75 Ala 0.66 Cys 0.70 Gly 0.75 Met 0.60 Tyr 0.69 Lys 0.74 Phe 0.60 Asn 0.67 Pro 0.55 Leu 0.59 Pro 0,57 Asp 0.54 Val 0.50 Gly 0.57 Glu 0.37 Ile 0.47 Gribskov 8.42

43 Secondary Structure Prediction - Chou Fasman Position specific turn parameters Each position of a turn has distinct preferences, for instance turn position 2 (f i+1 ), Proline = more than double any other residue, Trp strongly avoided (0.013) turn position 3 (f i+2 ), Asn (0.191), Gly (0.190), and Asp (0.179) strongly preferred turn position 4 (f i+3 ), Trp (0.162) and Gly (0.152) strongly preferred i i+1 i+2 i+3 Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Gribskov 8.43

44 Secondary Structure Prediction-Chou/Fasman Chou/Fasman procedure Find helical initiation regions Extend helices until they reach tetrapeptide breakers Find beta initiation regions Extend until they reach tetrapeptide breakers Find turns Resolve conflicts between alpha and beta Somewhat subjective Chou and Fasman suggest using additional information alpha-beta pattern, i.e. does this look like an β α β structure end probabilities Chou and Fasman in later papers also tabulated the preferences for the residues to occur at the amino and carboxyl terminal ends of α and β structures. These can be used to resolve overlaps Chou and Fasman did not provide an explicit algorithm for this conflict resolution, relying on their expert judgment. This meant that each persons prediction could be different. Most people are not experts. Gribskov 8.44

45 Secondary Structure Prediction -Chou/Fasman Chou-Fasman Rules Helix - four out of 6 helical residues initiate a helix helix is extended both directions to "tetrapeptide breaker" segments >6 residues with P α > 1.03 and P α >P β are helical Note that a helix must be 4 residues long to form the first hydrogen bonds that make it a helix Strand - 3 out of 5 beta forming residues initiate a beta strand strand extends in both directions to a tetrapeptide breaker segments with P β >1.05 and P β > P α are beta Probability of a turn, P t, is a product over four turn positions P t = Πf t,i i=1,4 tetrapeptides with P t > 0.75 x 10-4, P t >1.0 and P α <P t > P β Gribskov 8.45

Secondary Structure Prediction-Chou/Fasman FT HELIX 2 6 FT TURN 7 7 Example structure -

46 FT STRAND 10 15 FT TURN 17 18 FT HELIX 21 31 FT TURN 32 32 FT STRAND 35 38 FT HELIX

90 93 FT HELIX 99 108 FT TURN 109 109 FT STRAND 114 119 FT HELIX 122 136 FT TURN 137

46 Secondary Structure Prediction-Chou/Fasman FT HELIX 2 6 FT TURN 7 7 Example structure - Pig Adenylate Kinase KAD1_PIG Gribskov 8.46 FT STRAND FT TURN FT HELIX FT TURN FT STRAND FT HELIX FT TURN FT HELIX FT TURN FT HELIX FT TURN FT STRAND FT HELIX FT TURN FT STRAND FT HELIX FT TURN FT TURN FT HELIX FT TURN FT HELIX FT TURN FT STRAND FT HELIX FT TURN

47 Secondary Structure Prediction - GOR Garnier-Osguthorpe-Robson Based on information theoretic approach Formulation: I(S=X:~X;y) = ln( P(S=X y)/p(s=~x y) ) - ln( P(S=X) / P(S=~X) ) This time a log-odds ratio First term is a foreground model, second term is a background model GOR uses a completely specified algorithm with little subjectivity Gribskov 8.47

48 Secondary Structure Prediction - GOR Calculation of information parameters For example, for alanine 240 found in helix, 150 not in helix, total 390 residues for all residues 780 in helix (H), 1050 not in helix (~H), total 1830 P(S=H A) = 240/390 = P(S=~H A) = 150/390 = P(S=H) = 780/1830 = P(S=~H) = 1050/1830 = I(S=H:~H;A) H;A) = ln (0.615/0.385) - ln(0.426/0.573) = = = centinats Gribskov 8.48

49 Secondary Structure Prediction GOR Positional preference is strong and amazingly unique, compare Ser-Thr or Leu-Ile, pairs of residues often considered very similar Information is concentrated in region around the residue being predicted (position 0) Large effects can be seen greater than 5 residues away, e.g. Trp Note the directional preference of Glu, Lys, and Arg due to helix dipole helix beta or turn Gribskov 8.49

50 Secondary Structure Prediction GOR Gribskov 8.50 helix beta or turn

51 Secondary Structure Prediction GOR helix beta or turn Gribskov 8.51

52 Secondary Structure Prediction GOR Four states predicted - predicted structure is highest value summed over a window from - 8 to + 8 around the position being predicted (17 positions) Minor ambiguities arise when helices shorter than 4 residues are predicted Decision i constants t used to bias the result Calculation assumes that the subject protein has proportions of alpha/beta/turn/coil same as "training" data, so it makes sens to adjust the prediction if you have some idea of the α/β content. GOR suggested using an initial prediction to estimate the α/β and choosing an appropriate decision for the final prediction Gribskov 8.52

53 PEPPLOT (Garnier prediction) of: kad1_pig.embl check: 362 from: 1 to: 194 June 2, :37 ID KAD1_PIG STANDARD; PRT; 194 AA. AC P00571; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 15-FEB-2000 (Rel. 39, Last annotation update) DE ADENYLATE KINASE ISOENZYME 1 (EC ) (ATP-AMP TRANSPHOSPHORYLASE) Structural composition for no decision constant: alpha = 39.7% beta = 16.5% %Alpha No DC <20 <20 < >50 >50 % Beta No DC < >50 < >50 < Pos A B B B A A A A A 10 B B B B B B B B B 11 B B B B B B B B B 12 B B B B B B B B B 13 B B B B B B B B B 14 B B B B B B B B B 15 C C B B C B B C B 16 C C C C C C C C C 17 C C C C C C C C C 18 T T T T T T T T T 19 T T T T T T T T T 20 T T T T T T T T T 21 T T T T T T T T T 22 C C C C C C C C C 23 C C B B C B B C B 24 A C B B A B B A A 25 B B B B A B B A B 26 B B B B A B B A B 27 B B B B A B B A B 28 B B B B A B B A B 29 B B B B A B B A B 30 T T B B A B B A B 31 T T B B T B B T B 32 T T B B T B B T B 35 B B B B B B B B B 36 B B B B B B B B B 37 B C B B C B B C B 38 C C B B C B B C B 39 T T B B T B B T B 40 C C C C C C C C C 41 C C C C A A A A A 42 C C B B A A A A A 43 B B B B A B B A A 44 B B B B A B B A B 45 A B B B A B B A A 46 A B B B A B B A A 47 A C B B A A A A A 48 C C C C A A A A A 49 C C C C A A A A A 50 C C C C A A A A A 51 C C C C A A A A A 52 A C C C A A A A A 53 A C C C A A A A A 54 C C C C A A A A A 55 A C B B A A A A A 56 A C B B A A A A A 57 A C B B A A A A A 58 A A B B A A A A A 59 A A B B A A A A A 60 A A B B A A A A A 61 A A A A A A A A A 62 A A A A A A A A A 63 A A A A A A A A A 64 A T T T A A A A A 65 A C C C A A A A A 66 C C C C A A A A A 67 C C B B A B B A A 68 A B B B A B B A A 69 A B B B A A A A A 70 A B B B A A A A A 71 A B B B A A A A A 72 A A B B A A A A A 73 A A B B A A A A A 33 T T B B T B B T B 34 T T B B T B B T B 74 A A B B A A A A A 75 A A A A A A A A A Gribskov 8.53

54 Secondary Structure Prediction Neural Network Neural network - Typical Topology α β Output Layer Weighted connections Hidden Layer Input Layer Gribskov 8.54

55 Secondary Structure Prediction - Neural Networks Input is often binary - therefore must use multiple inputs for each amino acid residue. Trainable parameters Connections between layers are weighted. Some transmit full signal, others less. this is where the intelligence of the neural net is encoded. Each hidden node contains a response function, how much signal does it pass on to the next layer for how much input tput ou Response function input Gribskov 8.55

56 Secondary Structure Prediction - Neural Network Binary coding or amino acid residues 20 residues require 5 bits for instance, ala = cys=00010 asp= trp=10100 Could alternatively encode 5 properties, e.g., hydrophobicity, side chain size,... So long as you can uniquely specify the residues (or as uniquely as you want to) Gribskov 8.56

57 Analysis Secondary Structure Prediction - Neural Network Neural networks are trained by back-propagation or back-prop Connections that agree with truth are increased, those that disagree decreased Iterate t over many training i examples α β Models may have very many parameters and hence memorize training data. Cross validation very important activated connections A D C Gribskov 8.57

58 Secondary Structure Prediction Neural network Structure 13 to 17 input blocks 0 to 40 hidden units Two or three output units Neural nets are trained Requires training set and testing set training and testing sets must be independent Gribskov 8.58

59 Neural Networks layered networks First network is a sequence-to-structure structure network predicts structure of single residues second network is a structure-to-structure to predicts structure based on structural context of single residue states without reference to sequence information. The overall accuracy increased simple majority vote of a set of networks with different architectures Gribskov 8.59

60 Secondary Structure Prediction Extensions: Structure type: all alpha, all beta, alpha/beta, alpha+beta Doublet or triplet information Homology / Consensus Multiple sequence - averages or consensus over several proteins Supersecondary/ Tertiary structure Gribskov 8.60

61 Secondary Structure Prediction Definitions: N = p + q + u + v p = number predicted in structure and observed in structure q = number predicted not in structure not observed in structure u = number predicted in structure not observed in structure v = number predicted not in structure and observed in structure Fraction correct: F = p / (p + u) Underweights overprediction errors - makes method look good Correlation: 1 = perfect, 0=random, -1 = perfectly wrong C = (p/n - R S) / { R S(1- S)(1- R) }0.5 R = (p + v) / N = fraction predicted to be in structure S = (p + u) / N = fraction observed in structure Gribskov 8.61

62 Secondary Structure Prediction Example: an all helical protein (80% of residues in helix) prediction 1: 100% of residues predicted in helix prediction 2: 80% in helix, 20% non-helix (in correct positions) prediction 3: 80%/20% but with helix off by 1 residue, (8 helices) F C Gribskov 8.62

63 Secondary Structure Prediction Accuracy - Is this good? 40-90% depending on protein 50-60% for simple statistical models (i.e. GORI, Chou/Fasman) 60-85% for extended models Must be evaluated on test set not included in training Test set should have homologs removed Test set should cover various structural families all alpha, all beta, alpha/beta, alpha+beta Gribskov 8.63

64 Secondary Structure Prediction Problems When you are done you still don t know the structure It is difficult to go from secondary structure to three dimensional structure Solution Homology based modeling Use a known structure as a template Use molecular dynamics/minimization approaches to adjust for differences Gribskov 8.64