Dynamic Programming Algorithms

Similar documents
Creation of a PAM matrix

Important points from last time

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

11 questions for a total of 120 points

Basic Local Alignment Search Tool

Computational Methods for Protein Structure Prediction

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Sequence Alignment and Phylogenetic Tree Construction of Malarial Parasites

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Match the Hash Scores

Evolutionary Genetics. LV Lecture with exercises 6KP

Amino Acid Sequences and Evolutionary Relationships

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Homework. A bit about the nature of the atoms of interest. Project. The role of electronega<vity

Textbook Reading Guidelines

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Amino Acid Sequences and Evolutionary Relationships

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

Database Searching and BLAST Dannie Durand

Single alignment: FASTA. 17 march 2017

Amino Acid Sequences and Evolutionary Relationships. How do similarities in amino acid sequences of various species provide evidence for evolution?

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Supplementary Data for Monti, et al.

Basic concepts of molecular biology

Protein Structure Analysis

MATH 5610, Computational Biology

DNA.notebook March 08, DNA Overview

CFSSP: Chou and Fasman Secondary Structure Prediction server

03-511/711 Computational Genomics and Molecular Biology, Fall

Station 1 DNA Evidence

NAME:... MODEL ANSWER... STUDENT NUMBER:... Maximum marks: 50. Internal Examiner: Hugh Murrell, Computer Science, UKZN

In silico measurements of twist and bend. moduli for beta solenoid protein self-

Protein 3D Structure Prediction

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

EE550 Computational Biology

ECS 129: Structural Bioinformatics March 15, 2016

Programme Good morning and summary of last week Levels of Protein Structure - I Levels of Protein Structure - II

Mutation Rates and Sequence Changes

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Lecture 11: Gene Prediction


Bi Lecture 3 Loss-of-function (Ch. 4A) Monday, April 8, 13

First&year&tutorial&in&Chemical&Biology&(amino&acids,&peptide&and&proteins)&! 1.&!

NRPS Code Project Summary

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein


Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below.

Using DNA sequence, distinguish species in the same genus from one another.

03-511/711 Computational Genomics and Molecular Biology, Fall

Basic concepts of molecular biology

Disease and selection in the human genome 3

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

03-511/711 Computational Genomics and Molecular Biology, Fall

G4120: Introduction to Computational Biology

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Algorithms in Bioinformatics ONE Transcription Translation

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Pacific Symposium on Biocomputing 4: (1999)

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

2018 Protein Modeling Exam Key

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Why learn sequence database searching? Searching Molecular Databases with BLAST

G4120: Introduction to Computational Biology

Name: TOC#. Data and Observations: Figure 1: Amino Acid Positions in the Hemoglobin of Some Vertebrates

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

7.014 Solution Set 4

From code to translation

ENZYMES AND METABOLIC PATHWAYS

Alpha-helices, beta-sheets and U-turns within a protein are stabilized by (hint: two words).

Sequence Analysis in Immunology

Bioinformatics Practical Course. 80 Practical Hours

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

466 Asn (N) to Ala (A) Generate beta dimer Interface

Data Retrieval from GenBank

Name Date of Data Collection. Class Period Lab Days/Period Teacher

Reading for today. Adaptive Molecular Evolution. Predictions of neutral theory. The neutral theory of molecular evolution

Folding simulation: self-organization of 4-helix bundle protein. yellow = helical turns

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

G4120: Introduction to Computational Biology

Adaptive Molecular Evolution. Reading for today. Neutral theory. Predictions of neutral theory. The neutral theory of molecular evolution

Laboratory Evolution of Robust and Enantioselective Baeyer-Villiger Monooxygenases for Asymmetric Catalysis

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Exploring Similarities of Conserved Domains/Motifs

Fundamentals of Protein Structure

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Granby Transcription and Translation Services plc

CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT

6-Foot Mini Toober Activity

An introduction to multiple alignments

Residue Contact Prediction for Protein Structure using 2-Norm Distances

DNA/Protein Binding, Molecular Docking and in Vitro Anti-cancer Activity of some Thioether-Dipyrrinato Complexes

Protein Synthesis. Application Based Questions

7.014 Problem Set 4 Answers to this problem set are to be turned in. Problem sets will not be accepted late. Solutions will be posted on the web.

DNA Sequence Alignment based on Bioinformatics

Transcription:

Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212

Sequence alignment Compare two (or more) sequences to: Find regions of conservation (or variability) Common motifs Find similar sequences in a database Assess likelihood of common evolutionary history Once we have constructed the best alignment between two sequences, we can assess how similar they are to each other. February 7 212 Luce Skrabanek 2

Methods of sequence alignment Dot-matrix plots Dynamic programming algorithms (DPA) Used to determine evolutionary homology Needleman-Wunsch (global) Smith-Waterman (local) Heuristic methods Used to determine similarity to sequences in a database BLAST Combination of DPA and heuristic (FASTA) February 7 212 Luce Skrabanek 3

Dot-matrix plot

What is a DPA? A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. The basic ideas behind dynamic programming are: The organization of work in order to avoid repetition of work already done. Breaking a large problem into smaller incremental steps, which are each solved optimally. DPAs are typically used when a problem has many possible solutions and an optimal one has to be found. February 7 212 Luce Skrabanek 5

Example alignment ELVIS LIVES? February 7 212 Luce Skrabanek 6

Scoring alignments Associate a score with every possible alignment A higher score indicates a better alignment Need a scoring scheme that will give us a. Values for residue substitution b. Penalties for introducing gaps The score for each alignment is then the sum of all the residue substitution scores minus any penalties for gaps introduced February 7 212 Luce Skrabanek 7

BLOSUM62 substitution matrix

N-W mathematical formulation Global alignment: used when the sequences are of a comparable length, and show similarity along their whole lengths F(i,j) = MAX { F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d, F(i, j-1) - d } where F(i,j) is the value in cell (i,j); s is the score for that match in the substitution matrix; d is the gap penalty

Simple global alignment E L V I S L I -3-2 -2-2 4-2 2 3 7 2 7 5 V -2 6 1 8 E 5 3 4 8 1 S 3 3 2 6 12 February 7 212 Luce Skrabanek 1

Traceback step E - L L V I I V - E S S E L V I S L -2 4 3 2 I -2 2 7 7 5 V -2 6 1 8 E 5 3 4 8 1 S 3 3 2 6 12 February 7 212 Luce Skrabanek 11

Resulting alignment Not correct to just align the highest number of identical residues! EL-VIS -LIVES ELVI-S -LIVES Isoleucine (I) Score: +4+3+3-2+4 =12 Hydrophobic Acidic Glutamic acid (E) [glutamate] Valine (V)

Local alignment Differences from global alignments Minimum value allowed = Corresponds to starting a new alignment The alignment can end anywhere in the matrix, not just the bottom right corner To find the best alignment, find the best score and trace back from there Expected score for a random match MUST be negative February 7 212 Luce Skrabanek 13

S-W local alignment formula Local alignment: used when the sequences are of very different lengths, or when they show only short patches of similarity F(i,j) = MAX {, F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d, F(i, j-1) - d } February 7 212 Luce Skrabanek 14

6 4 1 1 1 1 T 8 6 3 1 3 2 1 N 7 7 5 2 3 5 4 E 9 7 7 4 2 4 6 M 11 9 6 6 2 1 N 8 1 8 8 4 1 G 6 8 1 1 6 1 I 2 4 6 8 2 3 L 1 2 4 1 A H S I L L A M S

6 4 1 1 1 1 T 8 6 3 1 3 2 1 N 7 7 5 2 3 5 4 E 9 7 7 4 2 4 6 M 11 9 6 6 2 1 N 8 1 8 8 4 1 G 6 8 1 1 6 1 I 2 4 6 8 2 3 L 1 2 4 1 A H S I L L A M S

Resulting alignments ALLISH AL-IGN ALLISH A-LIGN Score: 4+4-2+4++1 =11 Score: 4-2+4+4++1 =11 February 7 212 Luce Skrabanek 17

Alignment with affine gap scores Affine gap scores Affine means that the penalty for a gap is computed as a linear function of its length Gap opening penalty Less prohibitive gap extension penalty Have to keep track of multiple values for each pair of residue coefficients i and j in place of the single value F(i,j) We keep two matrices, M and I February 7 212 Luce Skrabanek 18

Affine gap score formula M(i,j) = MAX { M(i-1, j-1) + s(x i,y j ), I(i-1, j-1) + s(x i,y j ) } I(i,j) = MAX { M(i, j-1) - d, I(i, j-1) - e, M(i-1, j) - d, I(i-1, j) - e } February 7 212 Luce Skrabanek 19

Typical gap penalties (proteins) BLOSUM62 Gap opening penalty: 11 Gap extension penalty: 1 BLOSUM8 (closely related) Gap opening penalty: 1 Gap extension penalty: 1 BLOSUM45 (distantly related) Gap opening penalty: 15 Gap extension penalty: 2 February 7 212 Luce Skrabanek 2

Improvements Gotoh - 1982 Improved affine gapped alignment performance (from M 2 N to MN steps) Myers and Miller - 1988 Divide-and-conquer (Hirschberg algorithm) February 7 212 Luce Skrabanek 21

Conclusions DPA mathematically proven to yield the optimal alignment between any two given sequences. Final alignment depends on scoring matrix for similar residues and gap penalties. February 7 212 Luce Skrabanek 22

Scoring matrices Strongly influence the outcome of the sequence comparison Implicitly represent a particular theory of evolution Frommleta et al, Comp Stat and Data Analysis, 26 February 7 212 Luce Skrabanek 23

Examples of scoring matrices Identity BLOSUM PAM Gonnet (also called GCB - Gonnet, Cohen and Benner) JTT (Jones Thornton Taylor) Distance matrix (minimum number of mutations needed to convert one residue to another) e.g., I -> V = 1 (ATT -> GTT) Physicochemical characteristics e.g., hydrophobicity February 7 212 Luce Skrabanek 24

DNA matrices A T C G A 1 T 1 C 1 G 1 A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 Identity Transition/ transversion A T C G A -5-5 -1 T -5-1 -5 C -5-1 -5 G -1-5 -5 BLAST Purines: A, G Pyrimidines: C, T Transition: A -> G Transversion: C -> G

PAM matrices PAM (Percent/Point Accepted Mutation) One PAM unit corresponds to one amino acid change per 1 residues, or.1 changes/residue, or roughly 1% divergence PAM3:.3 changes/residue PAM25: 2.5 substitutions/residue Each matrix gives the changes expected for a given period of evolutionary time Assumes that amino acid substitutions observed over short periods of evolutionary history can be extrapolated to longer distances Uses a Markov model of evolution Each amino acid site in a protein can change to any of the other amino acids with the probabilities given in the matrix Changes that occur at each site are independent of one another February 7 212 Luce Skrabanek 26

PAM matrix construction Align sequences Infer the ancestral sequence Count the number of substitution events Estimate the propensity of an amino acid to be replaced Construct a mutation probability matrix Construct a relatedness odds matrix Construct a log odds matrix February 7 212 Luce Skrabanek 27

PAM matrix construction Aligned sequences are 85% similar To reduce the possibility that observed changes are due to two substitution events rather than one To facilitate the reconstruction of phylogenetic trees and the inference of ancestral sequences February 7 212 Luce Skrabanek 28

Ancestral sequences Organism A A W T V A A A V R T S I Organism B A Y T V A A A V R T S I Organism C A W T V A A A V L T S I (ancestral) A B C W to Y L to R One of the possible relationships between organisms A, B and C Assume that each position has undergone a maximum of one substitution (high similarity between sequences) Implies that the number of changes between sequences is proportional to the evolutionary distance between them February 7 212 Luce Skrabanek 29

Frequency of substitution From the phylogenetic trees, the number of changes of each amino acid into every other amino acid is counted In 71 groups of proteins, they counted 1,572 changes (Dayhoff, 1979). February 7 212 Luce Skrabanek 3

Relative mutability Calculate the likelihood that an amino acid will be replaced Estimate relative mutability (relative number of changes) Amino acids A B C A B C A B A Observed changes 1 1 Frequency of occurrence 3 2 1 Relative mutability.33 1 February 7 212 Luce Skrabanek 31

Scale mutabilities Mutabilities were scaled to the number of replacements per occurrence of the given amino acid per 1 residues in each alignment Normalizes data for variations in amino acid composition, mutation rate, sequence length Count the total number of changes of each amino acid and divide by the total exposure to mutation Exposure to mutation is the frequency of occurrence of each amino acid multiplied by the total number of amino acid changes per 1 sites, summed for all alignments. *(Ala has arbitrarily been set to 1) Ser 149 Met 122 Asn 111 Ile 11 Glu 12 Ala 1* Gln 98 Asp 9 Thr 9 Gap 84 Val 8 Lys 57 Pro 56 His 5 Gly 48 Phe 45 Arg 44 Leu 38 Tyr 34 Cys 27 Trp 22

Mutation probability matrix Mutation probability matrix for one PAM Each element of this matrix gives the probability that the amino acid in column i will be replaced by the amino acid in row j after a given evolutionary interval 1 PAM 1 mutation/1 residues/1 MY 1 PAM.1 changes/aligned residue Non-diagonal Diagonal where M ij = mutation probability m j = relative mutability A ij = times i replaced by j The proportionality constant (λ) is chosen such that there is an average of one mutation every hundred residues.

PAM1 matrix mutation probability matrix Atlas of Protein Sequence and Structure. Dayhoff et al, 1978

PAM1, PAM1, PAM25 To get different PAM matrices, the mutation probability matrix is multiplied by itself some integer number of times PAM1 ^ 1 PAM1 PAM1 ^ 25 PAM25 PAM1 applies to proteins that differ by.1 changes per aligned residue 1 amino acid changes per 1 residues PAM25 reflects evolution of sequences with an average of 2.5 substitutions per position 25% change in 25 MY Approximately 2% similarity

PAM25 matrix mutation probability matrix Atlas of Protein Sequence and Structure. Dayhoff et al, 1978

Relatedness odds matrix Calculate a relatedness odds matrix: Divide each element of the mutation probability matrix by the frequency of occurrence of each residue Gives the ratio that the substitution represents a real evolutionary event as opposed to random sequence variation Ala.96 Gly.9 Lys.85 Leu.85 Val.78 Thr.62 Ser.57 Asp.53 Glu.53 Phe.45 Asn.42 Pro.41 Ile.35 His.34 Arg.34 Gln.32 Tyr.3 Cys.25 Met.12 Trp.12 Frequency of occurrence

Log odds matrix Calculate the log odds matrix Take the log of each element More convenient when using the matrix to score an alignment, because instead of multiplying the odds scores, we can just add the log scores. February 7 212 Luce Skrabanek 38

Problems with PAM Assumptions: Replacement at a site depends only on the amino acid at that site and the probability given by the matrix Sequences that are being compared have average amino acid composition All amino acids mutate at the same rate February 7 212 Luce Skrabanek 39

Problems with PAM contd. Sources of error: Many sequences do not have average amino acid composition Rare replacements were observed too infrequently to resolve relative probabilities accurately Errors in PAM1 are magnified in the extrapolation to PAM25 Replacement is not equally probable over the whole sequence February 7 212 Luce Skrabanek 4

Henikoff and Henikoff, 1992 BLOSUM matrices (BLOCKS substitution matrix) Matrix values are based on the observed amino acid substitutions in a large set of conserved amino acid blocks ( signatures of protein families) Frequently occurring highly conserved sequences are collapsed into one consensus sequence to avoid bias Patterns that show similarity above a threshold are grouped into a set Find the frequencies of each residue pair in the dataset 62% threshold = BLOSUM62 matrix February 7 212 Luce Skrabanek 41

BLOSUM vs. PAM Not based on an explicit evolutionary model Consider all changes in a family Matrices to compare more distant sequences are calculated separately Based on substitutions of conserved blocks Designed to find conserved domains in proteins Markov-based model of evolution Based on prediction of first changes in each family Matrices to compare more distant sequences are derived Based on substitutions over the whole sequence Tracks evolutionary origins of proteins February 7 212 Luce Skrabanek 42

Other alignment methods Heuristic BLAST (Basic Local Alignment Search Tool) Combination of heuristic and DPA FASTA February 7 212 Luce Skrabanek 43

BLAST - heuristic strategy Altschul et al, NAR, 1987 Break the query and all database sequences into words Look for matches that have at least a given score, given a particular scoring matrix Seed alignment Extend seed alignment in both directions to find an alignment with a score above a threshold or an E- value above a threshold Refinements: use DPAs for extension; 2-hit strategy The word size and threshold score determines the speed of the search February 7 212 Luce Skrabanek 44

PSI-BLAST (Position-Specific Iterated) BLAST Uses a PSSM (position-specific scoring matrix) derived from BLAST results Generated by calculating position-specific scores for each position in the alignment Highly conserved positions receive high scores Weakly conserved positions receive low scores PSSM used as the scoring matrix for the next BLAST search Iterative searching leads to increased sensitivity February 7 212 Luce Skrabanek 45

Z-score Calculation of statistical significance Shuffle one of the sequences with the same amino acid or nucleotide composition and local sequence features Align the shuffled sequences to the first sequence Generate scores for each of these alignments Generate an extreme value distribution Calculate mean score and standard deviation of distribution Get the Z-score Z = (RealScore - MeanScore) / StandardDeviation

Statistical significance: many other methods Many random sequences (with no resemblance to either of the original sequences) are generated and aligned with each other Random sequences of different lengths are generated; scores increase as the log of the length A sequence may be aligned with all the sequences in a database library, the related sequences are discarded and the remainder are used to calculate the statistics Variations: shuffle the sequence or shuffle the library February 7 212 Luce Skrabanek 47