Creation of a PAM matrix

Similar documents
Dynamic Programming Algorithms

Basic Local Alignment Search Tool

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Single alignment: FASTA. 17 march 2017

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Match the Hash Scores

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Textbook Reading Guidelines

Database Searching and BLAST Dannie Durand

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

MATH 5610, Computational Biology

03-511/711 Computational Genomics and Molecular Biology, Fall

Important points from last time

Evolutionary Genetics. LV Lecture with exercises 6KP

Bioinformatics for Biologists. Comparative Protein Analysis

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

03-511/711 Computational Genomics and Molecular Biology, Fall

BSC 4934: Q BIC Capstone Workshop. Giri Narasimhan. ECS 254A; Phone: x3748

Bioinformatics Practical Course. 80 Practical Hours

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

G4120: Introduction to Computational Biology

Deleterious mutations

Introduction to Bioinformatics

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Data Mining for Biological Data Analysis

Analysis of Biological Sequences SPH

MOL204 Exam Fall 2015

CS100 B Fall Professor David I. Schwartz Programming Assignment 4. Due: Thursday, November 4

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Exploring Similarities of Conserved Domains/Motifs

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

Improving Remote Homology Detection using a Sequence Property Approach

Germ-line vs somatic-variation theories

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

Genome 373: Genomic Informatics. Elhanan Borenstein

In silico identification of transcriptional regulatory regions

Data Retrieval from GenBank

Computational Methods for Protein Structure Prediction

Simulation-based statistical inference in computational biology

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

Sequence Analysis in Immunology

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Why learn sequence database searching? Searching Molecular Databases with BLAST

An introduction to multiple alignments

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center

Introduction to bioinformatics

Introduction to Transcription Factor Binding Sites (TFBS) Cells control the expression of genes using Transcription Factors.

Minor Introns vs Major Introns

Bioinformation by Biomedical Informatics Publishing Group

Computational Investigation of Gene Regulatory Elements. Ryan Weddle Computational Biosciences Internship Presentation 12/15/2004

Genetic Algorithms for Optimizations

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Why do we need statistics to study genetics and evolution?

Molecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling

Challenging algorithms in bioinformatics

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Bioinformatics : Gene Expression Data Analysis

Insight II. Homology. March Scranton Road San Diego, CA / Fax: 858/

ECS 129: Structural Bioinformatics March 15, 2016

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Homework 4. Due in class, Wednesday, November 10, 2004

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Baum-Welch and HMM applications. November 16, 2017

Theory and Application of Multiple Sequence Alignments

G4120: Introduction to Computational Biology

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

COMPUTER SIMULATIONS AND PROBLEMS

Integrated UG/PG Biotechnology (Fifth Semester) End Semester Examination, 2013 LBTC 504: Bioinformatics. Model Answers

Determining Error Biases in Second Generation DNA Sequencing Data

Evolutionary Genetics: Part 1 Polymorphism in DNA

Genetic Algorithm: An Optimization Technique Concept

Classification and Learning Using Genetic Algorithms

Content Planner Construction via Evolutionary Algorithms and a Corpus-based Fitness Function

Template Based Modeling and Structural Refinement of Protein-Protein Interactions:

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Sequence Based Function Annotation

BIOINFORMATICS Introduction

CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004

Basic Local Alignment Search Tool

2. Genetic Algorithms - An Overview

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Gene Prediction in Eukaryotes

CABIOS. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Transcription:

Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental mutations are penalized less than those mutations which completely destroy the function of the protein. Instead of the simple scoring scheme of +1 for a match, -1 for a mismatch, and -2 for a gap, a variety of scores are assigned, depending on which pair of amino acids is being switched. February 15, 2001 1 Creation of a PAM matrix 1. Align sequences that are highly homologous (I.e., 85% identical). 1. Minimize ambiguity in alignments. 2. Minimize the number of coincident mutations. 2. Reconstruct phylogenetic trees and infer ancestral sequences. 3. Tally replacements "accepted" by natural selection, in all pair-wise comparisons (each is the number of times amino acid j was replaced by amino acid i in all comparisons). 4. Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced). 5. Combine data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance. 6. Calculate Log Odds Matrix for similarity scoring 1. Divide each element of the Mutation Data Matrix, M, by the frequency of occurrence of each residue. 7. The Log Odds Matrix, is calculated from the relatedness odds matrix by taking the log of each R ij. 8. Different protein families manifest different PAM rates. February 15, 2001 7

Properties of PAM matrices 1. The probability that an amino acid will change is on the order of 1% for each amino acid. The probability that it will stay the same is on the order of 99% for each amino acid. 2. The Mutation Probability Matrix, M1, defines a unit of evolutionary change: specifically, 1 PAM (Accepted Point Mutation per 100 residues). 1. The matrix can be used to simulate evolution by using a random number generator to select fate of each residue in the sequence according to the probabilit y given in the table. 2. Exposing a 100 residue protein sequence of average composition to the evolutionary change represented by M1 results in one amino acid change, on average. 3. Successive application of M1 on a sequence yields 2, 3, 4... PAMs of evolutionary change. 4. The following operations are equivalent: 1. Successive application of M1 on a sequence. 2. Matrix multiplication of M1 by itself, M1*M1, followed by operation on a sequence. 3. Scaling the elements of M1 by a constant of proportionality, = 1,2,3... (this enables the direct calculation of a matrix for any desired PAM distance). February 15, 2001 8 Assumptions in the PAM model 1. Replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). 2. Sequences that are being compared have average amino acid composition. February 15, 2001 9

Selecting an optimal PAM matrix 1. As the evolutionary distance increases, the information content of the corresponding PAM matrix decreases. 2. As the information content of the PAM matrix decreases a longer region of similarity is required to generate a sufficiently high score to be distinguishable from chance. 3. Regions of similarity in real proteins are found in narrower "blocks" as the evolutionary distance increases. February 15, 2001 10 Sources of error in the PAM model Many sequences depart from average composition. Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements were observed). Errors in 1PAM are magnified in the extrapolation to 250PAM. The Markov process is an imperfect representation of evolution: Distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence. February 15, 2001 11

The BLOSUM family of matrices There are three principal differences between the Blosum and PAM matrices. 1. The PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution. 2. The sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. 3. The Blosum procedure uses groups of sequences within which not all mutations are counted the same. A small block of lipid binding proteins, taken from the Blocks database, is shown below and used to illustrate the process of counting replacements that underlies the Blosum similarity matrices. February 15, 2001 12 BLOSUM: Blocks Substitution Matrix 1. Starting data is conserved blocks from the Blocks database 1. Aligned, ungapped sequences 2. Widely varying similarity, but measures are taken to avoid biasing the sample with frequently occurring highly related sequences. 2. Tallies of replacements are made by straight forward tallying of all pairs of aligned residues. 1. The observed frequency of each pair is: q ij = f ij /(total number of residue pairs) 2. This includes cases of i=j (i.e. no replacement observed). 3. The expected frequency of each pair is essentially the product of the frequencies of each residue in the data set. 3. Similar sequences in a block, above a threshold percent similarity are clustered and members of the cluster count fractionally toward the final tally. 1. Reduces the number of identical pairs (AA, SS, TT, etc., matches ) in the final tallies. 2. Somewhat analogous to increasing the PAM distance. 3. If clustering threshold is 80%, final matrix is BLOSUM 80. 4. Clustering at 62% reduces the number of blocks contributing to the table by 25% (still 1.25e6 pairs contributed) 5. Least frequent amino acid pair replacement was observed 2369 times. February 15, 2001 13

BLOSUM matrix efficiency February 15, 2001 16 The Struct matrix 32 three-dimensional structures from 11 families Only substitutions whose alpha carbon atoms are no more than 1.2 Å apart after superposition were considered February 15, 2001 18

Scoring insertions and deletions The selection of appropriate scores for insertions and deletions, the gap penalties, is as important as selecting the similarity scores for the success of a database search. Gap penalties are critical in determining the expected maximum score for random sequences being compared with your query sequence. While we want to force alignments to have relatively few gaps we need to allow them to be fairly long. February 15, 2001 20 Measuring statistical significance Using the Database Search to Create a Reference Distribution How often will an event at least as extreme as the one just observed happen if these events are the result of a well defined, specific, random process? Using Information Theory and the Length of the Alignment February 15, 2001 21

Characteristic of lengths of observed insertions and deletions. February 15, 2001 22 Characteristic of lengths of observed insertions and deletions vs. percent identity February 15, 2001 23

Number of residues observed between insertions and deletions vs. percent identity February 15, 2001 24 Characteristic of lengths of observed insertions and deletions in gene-pseudogene comparisons February 15, 2001 25

Substitution matrix conclusions Expect to find many small gaps in your alignment Small open gap penalty (on the same scale as the worst residue mismatch score) Gap extend penalty of approximately half this size for the majority of the cases The sequences that you are aligning will dictate the proper penalties to use February 15, 2001 26 Similarity searching and large databases As we have seen, the time for searching large databases is an important consideration, and it becomes necessary to balance speed and the sensitivity of the search. Exhaustive methods, such as Needleman-Wunsch and Smith-Waterman, generally guarantee an optimal answer but take far too long to compare a sequence with a database. In an attempt to gain speed with an acceptable loss of sensitivity, approximate methods have been derived from these algorithms. February 15, 2001 27