VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Size: px

Start display at page:

Download "VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch"

Valerie Andrews
6 years ago
Views:

1 VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin

2 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (2) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2

3 3 H 3. Semester (WS 12/13) DP Paarweises Seq. Align. Needleman/Wunsch Smith-Waterman FastA Blast Multiples Seq. Align. HMMs Heute Letzter Teil im Block Alignment (Wdh.) Buch: 6.1, 6.2, 6.6

4 Alginment scoring matrix Protein matrix: 4

5 Use of a scoring matrix P L S - - C F G G L T - A C H L Score = 3 5

6 Multiple sequence alignment 6

7 Sequence logo 7

8 Profile und Sequenzlogos 8

9 Biological Motives A large number of biological units with common functions tend to exhibit similarities at the sequence level. These include very short motives, such as gene splice sites, DNA regulatory binding sites, recognized by transcription factors (proteins that bind to the promoter and control gene expression), micrornas, and all the way to protein families. Often it is desirable to model such motives, to enable searching for new ones. Probabilistic models are very useful for this task. 9

10 Promoter 10

11 Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Regulatory Element Gene 11

12 Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene 12

13 Regulation of Genes Transcription Factor New protein RNA polymerase DNA Regulatory Element Gene 13

14 Motif Logo Motifs can mutate on less important bases. The five motifs at top right have mutations in position 3 and 5. Representations called motif logos illustrate the conserved regions of a motif. Position: TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA

15 Example: Calmodulin-Binding Motif (calcium-binding proteins) 15

16 PSSM Starting Point A gap-less MSA of known instances of a given motif. Representing the motif by either: 1. Consensus. 2. Position Specific Scoring Matrix (PSSM). 16

17 Sequence logos: Visualizing PSSMs 17

18 Frequency matrix 18

19 Frequency matrices Three uses of frequency matrices Describe a sequence feature Calculate probability of occurrence of feature in a random sequence Calculate degree of match between a new sequence and a feature 19

20 Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores PSSMs also called Position Weight Matrixes (PWMs) or Profiles 20

21 Methods for converting frequency matrices to PSSMs Using log ratio of observed to expected where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences) Using amino acid substitution matrix (Dayhoff similarity matrix) 21

22 Pseudo-counts How do we get a score for a position with zero counts for a particular character? Can t take log(0). Solution: add a small number to all positions with zero frequency 22

23 Consensus sequences Different ways to describe a consensus, from crude to refined: Consensus site Sequence logos Position Specific Score Matrix (PSSM) Hidden Markov Model (HMM) 23

24 Constructing a consensus 1. Collect sequences 2. Align sequences (consensus sites are descriptions of the alignment) 3. Condense the set of sequences into a consensus (to a consensus, PSSM, HMM). 4. Apply the scoring matrix in alignments/searches. 24

25 Position Specific Score Matrix (PSSM) A position specific scoring matrix (PSSM) is a matrix based on the amino acid frequencies (or nucleic acid frequencies) at every position of a multiple alignment. From these frequencies, the PSSM that will be calculated will result in a matrix that will assign superior scores to residues that appear more often than by chance at a certain position. 25

26 Creating a PSSM: Example NTEGEWI NITRGEW NIAGECC Amino acid frequencies at every position of the alignment: 26

27 Creating a PSSM: Example Amino acids that do not appear at a specific position of a multiple alignment must also be considered in order to model every possible sequence and have calculable log-odds scores. A simple procedure called pseudo-counts assigns minimal scores to residues that do not appear at a certain position of the alignment according to the following equation: Where Frequency is the frequency of residue i in column j (the count of occurances). pseudocount is a number higher or equal to 1. N is the number of sequences in the multiple alignment. 27

28 Creating a PSSM: Example In this example, N = 3 and let s use pseudocount = 1: Score(N) at position 1 = 3/3 = 1. Score(I) at position 1 = 0/3 = 0. Readjust: Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = The PSSM is obtained by taking the logarithm (of the values obtained above divided by the background frequency of the residues). To simplify for this example we ll assume that every amino acid appears equally in protein sequences, i.e. f i = 0.05 for every i): PSSM Score(N) at position 1 = log(0.044 / 0.05) = PSSM Score(I) at position 1 = log(0.174 / 0.05) =

29 Creating a PSSM: Example The matrix assigns positive scores to residues that appear more often than expected by chance and negative scores to residues that appear less often than expected by chance. 29

30 Using a PSSM To search for matches to a PSSM, scan along the sequence using a window the length (L) of the PSSM. The matrix is slid on a sequence one residue at a time and the scores of the residues of every region of length L are added. Scores that are higher than an empirically predetermined threshold are reported. 30

31 Searching with a PSSM Most approaches use the Dynamic Programming Algorithm usually the Smith-Waterman variant Excellent method for finding distantly related sequences Gap model is AFFINE with the Open and Extend Gap Penalties, a function of which position they are in the alignment. Can be used to locate a motif in an alignment and then edit the alignment 31

32 PSI-Blast 32

33 Position-Specific-Iterated-BLAST Intuition substitution matrices should be specific to a particular site. e.g. enalize alanine glycine more in a helix Idea Use BLAST with high stringency to get a set of closely related sequences. Align those sequences to create a new substitution matrix for each position. Then use that matrix to find additional sequences Cycling/iterative method Gives increased sensitivity for detecting distantly related proteins Can give insight into functional relationships Very refined statistical methods Fast still based on BLAST methods Simple to use 33

34 PSI-BLAST Principle 1. First, a standard blastp is performed 2. The highest scoring hits are used to generate a multiple alignment 3. A PSSM is generated from the multiple alignment. Highly conserved residues get high scores Less conserved residues get lower scores Sequences >98% similar not included (avoid biasing the PSSM). 4. Another similarity search is performed, this time using the new PSSM 5. Steps 2-4 can be repeated until convergence No new sequences appear after iteration 34

Example Aminoacyl trna Synthetases 20 enzymes

small, monomers, tetramers All bind to their

specificity TrpRS and TyrRS share only 13%

TrpTRS and TyrTRS are similar Structure Function

35 Example Aminoacyl trna Synthetases 20 enzymes for 20 amino acids Each is very different Big, small, monomers, tetramers All bind to their appropriate trnas and amino acids, with high specificity TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpTRS and TyrTRS are similar Structure Function relationship Tryptophanyl-tRNA synthetase Tyrosyl-tRNA synthetase 35

36 Same SCOP family based on catalytic domain Overall structure similarity noted 36

37 So is there sequence similarity between TyrRS and TrpRS? Given structural similarities, we would expect to find sequence similarity BUT! blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10 37

38 No TrpRS!? 38

39 Try Using PSI-BLAST PSI-BLAST available from BLAST main page Query form just like for blastp BUT: one extra formatting option must be used Format for PSI-BLAST activate the tick box! Second e-value cutoff used to determine which alignments will be used for PSSM build Threshold for inclusion First search using TyrRS as query Db = SwissProt; limit = Bacteria [ORGN] Threshold for inclusion =

40 40

41 41

42 After A Few Iterations 42

43 TyrRS Similarity to TrpRS! 43

44 Power of PSI-BLAST We knew TyrRS and TrpRS were similarly Functionally and structurally BLASTP gave no indication PSI-BLAST was able to detect their weak sequence similarity Words of caution: be sure to inspect and think about the results included in the PSSM build include/exclude sequences on basis of biological knowledge: you are in the driving seat! PSI-BLAST performance varies according to choice of matrix, filter, statistics etc just like BLASTP 44

45 Why (not) PSI-BLAST If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are all homologous, the sensitivity at a given specificity improves significantly However, if non-homologous sequences are included in the PSSMs, they are corrupted. Then they pull in more non-homologous sequences, and become worse than generic 45

46 Query Does the query really have a relationship with the results? One way to check is to run the search in the opposite direction but often not reversible even when true homology Results 46

47 PSI-BLAST caveats Increased ability to find distant homologues Cost of additional required care to prevent nonhomologous sequences from being included in the PSSM calculation When in doubt, leave it out! Examine sequences with moderate similarity carefully. Be particularly cautious about matches to sequences with highly biased amino acid content Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology Screen them out of your query sequences! 47

48 Profil HMMs (Hidden Markov Modelle) 48

49 Markov Chains Rain Sunny Cloudy States : Three states - sunny, cloudy, rainy. State transition matrix : The probability of the weather given the previous day's weather. Initial Distribution : Defining the probability of the system being in each of the states at time 0. 49

50 Hidden Markov Models Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather). Observable states : the states of the process that are `visible' (e.g., seaweed dampness). 50

51 Components Of HMM Output matrix : containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state. Initial Distribution : contains the probability of the (hidden) model being in a particular hidden state at time t = 1. State transition matrix : holding the probability of a hidden state given the previous hidden state. 51

52 Building from an existing alignment ACA ATG TCA ACT ATC ACA C - - AGC AGA ATC ACC G - - ATC Output Probabilities insertion Transition probabilities A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases. 52

53 Query a new sequence Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities. Consensus sequence: ACAC - - ATC P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x 0.8x1 x 0.8 = 4.7 x

54 Profile Hidden Markov Models Statistical models of multiple sequence alignments Capture position-specific information about how conserved each column of the alignment is which residues are likely use position-specific scores for amino acids (or nucleotides) position specific penalties for opening and extending an insertion or deletion. 54

55 Advantages of using HMMs HMMs have a formal probabilistic basis use probability theory to guide how all the scoring parameters should be set can do things that more heuristic methods cannot do easily For example, a profile HMM can be trained from unaligned sequences, if a trusted alignment isn t yet known HMMs have a consistent theory behind gap and insertion scores 55

56 Advantages of using HMMs In most details, profile HMMs are a slight improvement over a carefully constructed profile but less skill and manual intervention are necessary to use profile HMMs HMMs can produce true global alignments, unlike BLAST 56

57 Limitations of HMMs Do not capture any higher-order correlations assumes that the identity of a particular position is independent of the identity of all other positions make poor models of RNAs because an HMM cannot describe base pairs. compared to protein threading methods which usually include scoring terms for nearby amino acids in a three-dimensional protein structure. Slower than and less user-friendly than PSI-BLAST 57

58 Applications of profile HMMs Database searching for weak homologies Alternative to PSI-BLAST Automated annotation of the domain structure of proteins 58

59 Applications of profile HMMs Useful for organizing sequences into evolutionarily related families Databases like Pfam constructed by distinguishing between a stable curated seed alignment of a small number of representative sequences full alignments of all detectable homologs HMMER used to make a model of the seed search the database for homologs automatically produce the full alignment by aligning every sequence to the seed consensus 59

60 Constructing a profile HMM Multiple sequence alignment is made of known members of a given protein family quality of alignment, number and diversity of the sequences crucial for success Profile HMM of family built from the alignment model-building program uses the alignment together with its prior knowledge of the general nature of proteins Model-scoring program used to assign a score with respect to the model to any sequence of interest better the score, the higher the chance that query sequence is homologous to protein family in the model. each sequence in a database scored to find the members of the family present in the database. 60

61 HMMER structure/topology M = match state; I = insertion (w.r.t profile - insert gap characters in profile) D = deletion (w.r.t sequence - insert gap characters in sequence) N = N-terminal un-aligned C = C-terminal un-aligned J = Tim Joining Conrad, VL Algorithmische segment, Bioinformatik, un-aligned WS2013/

62 Profile HMM programs HMMER Developed by Sean Eddy Freely available under GNU General Public License Includes model-building and model-scoring programs relevant to homology detection Contains a program that calibrates a model by scoring it against a set of random sequences fitting an extreme value distribution to the resultant raw scores parameters of this distribution then used to calculate accurate E-values for sequences of interest. 62

63 Programs in the HMMER 2 package hmmalign Align sequences to existing model hmmbuild Build a model from multiple sequence alignment. hmmcalibrate Takes an HMM and empirically determines parameters used to make searches more sensitive by calculating more accurate E-values hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and best effort emulation of GCG profiles. hmmemit Emit sequences probabilistically from a profile HMM. hmmfetch Get a single model from an HMM database. hmmindex Index an HMM database. hmmpfam Search an HMM database for matches to a query sequence. hmmsearch Search a sequence database for matches to an HMM. 63

64 PSI-Blast vs. phmms PSI-BLAST Input: SEQUENCE Database: SEQUENCES Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass Output: Distantly related sequences + sensitive, -specific HMMs More sensitive But less user-friendly than PSI-BLAST and slower 64

65 Zusammenfassung 65

66 66 Mehr Informationen im Internet unter medicalbioinformaticsgroup.de/teaching Vielen Dank! Tim Conrad AG Medical Bioinformatics Weitere Fragen

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical