Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Similar documents
The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

ELE4120 Bioinformatics. Tutorial 5

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Why learn sequence database searching? Searching Molecular Databases with BLAST

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Dynamic Programming Algorithms

Database Searching and BLAST Dannie Durand

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Evolutionary Genetics. LV Lecture with exercises 6KP

Creation of a PAM matrix

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Protein Bioinformatics Part I: Access to information

MATH 5610, Computational Biology

Gene Identification in silico

Modern BLAST Programs

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Databases and database scanning

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics

Basic Bioinformatics: Homology, Sequence Alignment,

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

COMPUTER RESOURCES II:

Alignment to a database. November 3, 2016

Types of Databases - By Scope

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Chimp Sequence Annotation: Region 2_3

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Bioinformatics for Proteomics. Ann Loraine


From DNA to Protein: Genotype to Phenotype

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

NCBI web resources I: databases and Entrez

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern?

Why Use BLAST? David Form - August 15,

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

An introduction to multiple alignments

Computational Biology and Bioinformatics

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Lecture for Wednesday. Dr. Prince BIOL 1408

Theory and Application of Multiple Sequence Alignments

Introduction to Bioinformatics

Ch 10 Molecular Biology of the Gene

Bundle 6 Test Review

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

BIOINFORMATICS Introduction

Lecture 2: Biology Basics Continued

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Introduction to Bioinformatics

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

The Method Description of Target Gene Prediction

Introduction to BIOINFORMATICS

Chapter 8 From DNA to Proteins. Chapter 8 From DNA to Proteins

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

STUDYING THE SECONDARY STRUCTURE OF ACCESSION NUMBER USING CETD MATRIX

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Gene-centered resources at NCBI

CHAPTER 21 LECTURE SLIDES

Exploring the Genetic Basis for Behavior. Instructor s Notes

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

The Chemistry of Genes

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Big picture and history

Biotechnology Explorer

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

Adv Biology: DNA and RNA Study Guide

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

DNA is normally found in pairs, held together by hydrogen bonds between the bases

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Molecular Databases and Tools

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Sequence Analysis Lab Protocol

Bundle 5 Test Review

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Molecular Genetics Quiz #1 SBI4U K T/I A C TOTAL

CH 17 :From Gene to Protein

Exploring Similarities of Conserved Domains/Motifs

Introduction to Molecular Biology

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

An introduction to genetics and molecular biology

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7)

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

Transcription:

Making Sense of DNA and Protein Sequences Lily Wang, PhD Department of Biostatistics Vanderbilt University 1

Outline Biological background Major biological sequence databanks Basic concepts in sequence comparison methods Substitution matrices and gap penalties Methods for aligning two sequences Statistical distributions of alignment scores Software - BLAST 2

DNA The genetic material that is physically transmitted from parent to offspring Double stranded helix Within a single chain, nucleotide base+ sugar+ phosphate group A,G,C,T refers to the base A paired with T, G paired with C 3

Proteins The ultimate cellular activities are influenced through DNA encoded proteins 4

5 The Central Dogma of Molecular Biology

Splicing, Introns and Exons 6

The Growth of Biological Data 7

8 Goals of Functional Genomics Understand the functions of genes and their interplay with proteins and the environment to create complex, dynamic living systems

Sequence Databases - Primary Nucleotide Databases Contains sequences derived from sequencing a biological molecule that exist in a test tube, somewhere in a lab. They do not represent sequences that are a consensus of a population. GenBank, DDBJ, EMBL 9

Sequence Databases An Example Record http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html 10

11 Secondary Databases RefSeq These are curated databases Many sequences are represented more than once in GenBank, this leads to huge degrees of redundancy. Goal: provide a reference for each molecule in the central dogma (DNA, mrna, and protein). RefSeq accession numbers format: 2+6 format Experimentally determined sequence data: NT_123456 Genomic contigs (DNA) NM_123456 mrnas NP_123456 Proteins computational predictions from raw DNA sequences XM_123456 Model mrna XP_123456 Model proteins

Protein Sequences UniProt Knowledgebase (http://www.pir.uniprot.org/) SWISSPROT - manually annotated records and curator evaluated computational analysis TrEMBL - computationally analyzed records awaiting for manual annotation Source: translation of all coding sequences (CDS) found in DDBJ/EMBL/GenBank, PDB entries, sequences submitted directly to UniProt, and PIR-PSD Non-redundancy - describe in a single record, all protein products derived from a certain gene Extensive cross reference - to GenBank, 2D-PAGE data, protein structure databases, protein domain and family characterization databases, species-specific data collections, and disease databases. 12

Amino Acid Codes A sample record: http://ca.expasy.org/cgi-bin/get-sprot-entry?p09651 13

Sequence Comparisons It s much easier to determine sequences of genes and proteins than to determine their structure or function New sequences are adapted from pre-existing sequences rather than invented anew. The more conserved amino acids in similar proteins from different species are ones that play an essential role in structure and function. Significant similarities between sequences often give important clues on phylogeny, structure and functions. 14

Origins of Genes Having a Similar Sequence 15

16

17 Sequence Comparisons Human DNA has about 30,000 genes, > 50% of their functions are still unknown Many of human proteins share similarities with other organisms By experimentation and by comparing genes and proteins with those already known in the databases, our goal is to determine functions of newly discovered genes and proteins

Decide on a scoring system Pairwise Comparisons Align a given set of sequences to find the best matching region(s) Assign a score for the comparison between sequences We wish to evaluate the statistical significance of the score. 18

19 Substitution Matrices A substitution matrix or score matrix is a matrix where is in position i, j of the matrix. s( a, a ) i j BLOSUM 62 Matrix

Not all amino acids are equal Substitution Matrices Some are more easily substituted than others Some mutations occur more often Some substitutions are kept more often Mutations tend to favor some substitutions Some amino acids have similar codons (for example TTT & TTC for Phe, TTA & TTG for Leu) They are more likely to be changed from DNA mutation Selection tends to favor some substitutions Some amino acids have similar properties/structure They are more likely to be kept 20

Log Odds Score Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relative likelihood that the sequences are related as opposed to being unrelated. Consider the LRS for two sequences x and y, pxy i i Pxy (, match model) p i xiy = = Pxy (, random model) q q qq x y i x y i i i i i i p ab = s( a, b) = log The log-odds ratios s( xi, yi) where i qq a b is the log likelihood ratio of the residue pair (a,b) occurring as an aligned pair, as opposed to an unaligned pair. i 21

Scoring An alignment Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5-1 2 6 1 0 => alignment score = 46 Maximal Segment Pair (MSP) - Given two protein sequences, the pair of equal length segments that, when aligned, have the greatest aggregate score is called the Maximal Segment Pair (MSP). An MSP may be of any length; its score is the MSP score. 22

Theories on Substitution Matrices Among the MSPs from the comparison of random sequences, the amino acids a i and a j are aligned with target frequency S ij (Altschul, 1991) q = p p e λ ij i j Now among alignments representing distant homologies, the amino acids are paired with certain characteristic frequencies. Only if these correspond to a matrix s target frequencies, it has been argued, can be matrix be optimal for distinguishing distant local homologies from similarities due to chance. (Karlin & Altschul, 1990) Any substitution matrix is implicitly a log-odds matrix, with a specific target distribution for aligned pairs of amino acid residues. 23

Substitution Matrices The PAM family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. The BLOSUM family BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. Though BLOSUM 62 is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. 24

25 Selecting Optimal Matrices (Altschul, 1991; Henikoff & Henikoff 1993; Wheeler, 2003) Matrix Best Use Similarity (%) PAM40 Short Alignments that are highly similar 70-90 PAM160 Detecting members of a protein family 50-60 PAM250 Longer alignments of more divergent sequences ~30 BLOSUM90 Short alignments that are highly similar 70-90 BLOSUM80 Detecting members of a protein family 50-60 BLOSUM62 Most effective in finding all potential similarities 30-40 BLOSUM30 Longer alignments of more divergent sequences <30

26 Global vs. Local Alignment Global Alignment aligns the entire sequences, use all characters up to both ends of each sequence Local Alignment aligns only the best matching parts of the sequences that gives the highest matching scores rationale: distantly related proteins may share only isolated regions of similarity

27 Dot Matrix Methods for Pairwise Alignment reveals the presence of insertions / deletions, and direct/inverted repeats; should be considered first choice Dynamic Programming guarantees the optimal alignment, but can be slow k-tuple heuristic method does not guarantee optimal alignment, but is fast; implemented in BLAST and FASTA

Dot Matrix Method 28

Heuristic Method Basic Local Alignment Search Tool (BLAST) Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. Very fast algorithm, but does not guarantee optimal alignment Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes 29

30 BLAST For a given word length w (usually 3 for proteins) and a given score matrix: Create a list of all words (w-mers) that can can score >T when compared to w-mers from the query.

31 BLAST Each neighborhood word gives all positions in the database where it is found (hit list).

32 BLAST The program tries to extend matching segments (seeds) out in both directions by adding pairs of residues. Residues will be added until the incremental score drops below a threshold.

The Five BLAST Programs Program Database Query Typical uses BLASTN Nucleotide Nucleotide Mapping oligonucleotides, cdnas, and PCR products to a genome; screening repetitive elements; crossspecies sequence exploration; annotating genomic DNA; clustering sequencing reads; vector clipping. BLASTP Protein Protein Identifying common regions between proteins; collecting related proteins for phylogenetic analyses. BLASTX Protein Nucleotide translated into protein Finding protein-coding genes in genomic DNA; determining if a cdna corresponds to a known protein. TBLASTN Nucleotide translated into protein Protein Identifying transcripts, potentially from multiple organisms, similar to a given protein; mapping a protein to genomic DNA. TBLASTX Nucleotide translated into protein Nucleotide translated into protein Cross-species gene prediction at the genome or transcript level; searching for genes missed by traditional methods or not yet in protein databases. 33

Distribution of Pairwise Local Alignment Scores (Karlin-Altschul Statistics) - Assumptions There is at least one positive score The expected score must be negative The letters of the sequences are iid. The sequences are infinitely long. Alignment doesn t contain gaps. 34

Statistical Distribution Pairwise Local Alignment Scores Given 2 random protein sequences, the number of distinct, or "locally optimal" MSPs with scores at least S, expected to occur simply by chance is where N K = = product of seq lengths explicit calculable parameter i, j KNe λs S λ = unique positive solution to ppe λ ij i j = 1 This is the E value reported by BLAST. 35

36 Statistical Significance of Pairwise Alignment Analysis of Headruns Two sequences A, A,..., A and B,..., B with same length The letters are i.i.d. 1 2 n 1 Consider fixed alignment, that is, no shifts Consider exact matching, that is no mismatch or indels n

37 Alignment Scores The highest local alignment score H( A, B) = max{ s( I, J): I A, J B} indicates the best matching region along sequences A and B ( A, B) R = max{ for k = 1 to m, Flip a coin n times with H n m : A i + k = B i + k 0 i n m} p = Pr( A i = Bi ) for heads each time, the longest run of heads corresponds to R n

38 Analysis of Headruns - Waterman (1995) m Heuristics: a headrun of length m has a probability p there are about n possible headruns so m E(# headrtuns of length m) np If the largest run is unique, its length R n should satisfy = np R n 1, which has a solution R = log n n 1 Theorem 1 Let A1, A2,..., B1, B2,... be independent and identically distributed with < p Pr( A = B ) 1 0 1 1 < p Then Pr lim n R log 1 n p n = 1 = 1

Chen-Stein Method of Poisson Approximation 39

Analysis of Headruns 40

41

42 Distribution of Pairwise Local Alignment Scores Karlin and Altschul (1990) Of course, BLAST score theory is much more complicated Unequal sequence lengths Shifting Score matrix ungapped case Karlin-Altschul Local Alignment Scores distribution Pr( S > x) 1 exp( kmne λx )

43 E value and p-value E value is The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. p-value - the probability of an alignment occurring with the score in question or better.

E-value or p-value It is more appropriate to rank the importance of an alignment score by the p-values since matches with long sequences can yield larger scores simply due to sequence length. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. When E < 0.01, P-values and E-value are nearly identical. Use p-value for evaluation of cases on the boundary of statistical significance. 44

45 References BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215:403 410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Research, 25(17) 3389-3402 Substitution Matrices Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) Altas of Protein Sequence and Structure. National Biomedical Research Foundation. Washington, DC, 5, 345-352 Henikoff, S., Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of National Academy of Sciences, 89, 10915-10919 Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretical perspective. Journal of Molecular Biology 219, 555-565

46 Reference Distributions of Pairwise Sequence Algnment Scores Karlin, S., Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of National Academy of Sciences 87, 2264-2268 Arratia, R., Goldstein, L., Gordon, L. (1989) Two moments suffice for Poisson approximation: The Chen-Stein method. Annals of Probability, 17, 9-25 Waterman, M.S., Vingron, M. (1994) Sequence comparison significance and poisson approximation. Statistical Science 9, 367-381 Siegmund Dl, Yakir, B. (2000) Approximate p-values for local sequence alignments. Annals of Statistics, 28, 3, 657-680

Thank You! 47