The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Similar documents
Match the Hash Scores

Data Retrieval from GenBank

Evolutionary Genetics. LV Lecture with exercises 6KP

Why learn sequence database searching? Searching Molecular Databases with BLAST

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Single alignment: FASTA. 17 march 2017

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Creation of a PAM matrix

Basic Local Alignment Search Tool

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Sequence Based Function Annotation

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Dynamic Programming Algorithms

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Database Searching and BLAST Dannie Durand

Textbook Reading Guidelines

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

MATH 5610, Computational Biology

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

FUNCTIONAL BIOINFORMATICS

Challenging algorithms in bioinformatics

Sequence Databases and database scanning

03-511/711 Computational Genomics and Molecular Biology, Fall

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

03-511/711 Computational Genomics and Molecular Biology, Fall

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

ELE4120 Bioinformatics. Tutorial 5

Basic Bioinformatics: Homology, Sequence Alignment,

G4120: Introduction to Computational Biology

BME 110 Midterm Examination

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

Why study sequence similarity?

Exercise I, Sequence Analysis

Genome Sequence Assembly

Alignment to a database. November 3, 2016

COMPUTER RESOURCES II:

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

A Prac'cal Guide to NCBI BLAST

Tutorial for Stop codon reassignment in the wild

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

NCBI Molecular Biology Resources

Bacterial Genome Annotation

Genomics I. Organization of the Genome

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

MOL204 Exam Fall 2015

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Gene Identification in silico

Introduction to sequence similarity searches and sequence alignment

Bioinformatics Databases

Basic Local Alignment Search Tool

Integrated UG/PG Biotechnology (Fifth Semester) End Semester Examination, 2013 LBTC 504: Bioinformatics. Model Answers

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Introduction to Bioinformatics Problem Set 7: Genome Analysis

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Small Exon Finder User Guide

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

DNA Sequencing. ribose with two OH groups at positions 2, 3 & a tri-phosphate at position 5

Why Use BLAST? David Form - August 15,

Modern BLAST Programs

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

Aaditya Khatri. Abstract

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Lecture 10. Ab initio gene finding

Computational gene finding

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

NEXT GENERATION SEQUENCING. Farhat Habib

Tools and Opportunities to Enhance Risk Analysis. Nathan J. Hillson

Introduction to 'Omics and Bioinformatics

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

Bioinformation by Biomedical Informatics Publishing Group

Data Mining for Biological Data Analysis

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Exploring the Genetic Basis for Behavior. Instructor s Notes

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

researcharticle ASTHA DUHAN, MANOJ KUMAR SHARMA AND TAVINDERJEET KAUR

Biology 4100 Minor Assignment 1 January 19, 2007

Methods and tools for exploring functional genomics data

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern?

Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Gene Prediction in Eukaryotes

Two Mark question and Answers

G4120: Introduction to Computational Biology

Identifying Regulatory Regions using Multiple Sequence Alignments

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Bioinformatics for Biologists. Comparative Protein Analysis

Transcription:

Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif Vaisman 140,000,000,000 120,000,000,000 100,000,000,000 80,000,000,000 60,000,000,000 40,000,000,000 Base pairs Entries 160,000,000 140,000,000 120,000,000 100,000,000 80,000,000 60,000,000 40,000,000 20,000,000,000 20,000,000 Email: ivaisman@gmu.edu 0 0 December 1982 680,338 bp 606 seq December 2012 148,390,863,904 bp 161,140,325 seq Comparative Sequence Sizes Smallest bacterial gene 54 Smallest human gene (IGF-2) 252 Yeast chromosome 3 350,000 Longest human gene (dystrophin) 2,220,223 Escherichia coli (bacterium) genome 4,600,000 Largest yeast chromosome now mapped 5,800,000 Entire yeast genome 15,000,000 Smallest human chromosome (Y) 50,000,000 Largest human chromosome (1) 250,000,000 Entire human genome 3,000,000,000 string - a sequence of characters from some alphabet scoring function: exact match +2 mismatch -1 insertion -1 two strings acbcdb and cbabd can be aligned in different ways a c b c d b c b a b d - a c b c d b - c b a b d a c b c d b - - c b a - b d given: two strings CTCATG and TACTTG C T C A T G T A C T T G 3. (2) + 3. (-1) = 3 scoring function: exact match +2 mismatch -1 insertion -1 3. (2) + 4. (-1) = 2 C T C A - T - G. T - A C T T G 4. (2) + 4. (-1) = 4

Entropy and Redundancy of Language ** CUR**** F*****W******* D***** DIS*****AND P*** **BLES****FR*****B*******BR*****AND ***** AG*** The sequences are 65% identical A CURSED FIEND WROUGHT DEATH DISEASE AND PAIN A BLESSED FRIEND BROUGHT BREATH AND EASE AGAIN THATSEQUENCE THIS-S--EQUENCE THISISASEQUENCE THIS---SEQUENCE THISISASEQUENCE Scoring Alignments Scoring Alignments Substitution Matrices Dayhoff (or MDM, or PAM) - Derived from global alignments of closely related sequences PAM100 - number referes to evolutionary distance (Percentage of Acceptable point Mutations per 10 8 years) 300 million years 200 million years 100 million years WYFGKITRRESERLLLNPENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL WYFGKITRRESERLLLNPENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL PAM100 PAM100 PAM100 PAM100 PAM200 PAM150 Substitution Matrices BLOSUM (BLOcks SUbstitution Matrix) - Derived from local, ungapped alignments of distantly related sequences BLOSUM62 - number refers to the minimum percent identity Specialized Substitution Matrices SLIM (ScoreMatrix Leading to Intra-Membrane) PHAT (Predicted Hydrophobic and Transmembrane Matrix) STROMA (Score Matrix for Known Distant Homologs) HSDM (Homologous Structure Derived Matrix) WAC (Amino Acid Comparative Profiles) VTML (Maximum Likelihood Estimation) Reference: Henikoff & Henikoff Proteins 17:49, 1993 Comprehensive list: Amino Acid Matrices in AAindex (http://www.genome.jp/aaindex/aaindex/list_of_matrices)

Sequence A Sequence A Selecting a Matrix Compared sequences are related: 200 PAM or 250 PAM Database scanning: 120 PAM Local alignment search: 40 PAM, 120 PAM, 250 PAM Detection of related sequences using BLAST: BLOSUM 62 Low PAM: short segments, high similarity High PAM: long segments, low similarity Matrix Examples PAM120 THERE IS NO ONE SIZE FITS ALL MATRIX! BLOSUM-62 Zvelebil & Baum, 2007 Search and alignment entropy Information content per position: pam10-3.43 bits pam120-0.98 bits pam160-0.70 bits pam250-0.38 bits blosum62-0.70 bits Search and alignment entropy Recommended matrices for different query length Query length Substitution matrix Gap costs <35 PAM-30 ( 9,1) 35-50 PAM-70 (10,1) 50-85 BLOSUM-80 (10,1) >85 BLOSUM-62 (11,1) Information requirements: for search - 30 bits for alignment - 16 bit 1 2 First run (identities) Rescoring using PAM matrix high score low score The score of the highest scoring initial region is saved as the init1 score.

Sequence A Sequence A 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score. 4 Alignment optimization using dynamic programming The score for this alignment is the opt score. FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score. FASTA Results When init1=init0=opt: 100 % homology over the matched stretch. When initn > init1: more than 1 matching region in the database with poorly matching separating regions. When opt > initn: the matching regions are greatly improved by adding gaps in one or both of the sequences. BLAST - Basic Local Alignment Search Tool Blast programs use a heuristic search algorithm. The programs use the statistical methods of Karlin and Altschul (1990,1993). Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences. BLAST Algorithm 1 Query sequence of length L Maximium of L-w+1 words (typically w = 3 for proteins) For each word from the query sequence find the list of words with high score using a substitution matrix (PAM or BLOSUM) Word list

BLAST Algorithm BLAST Algorithm 2 Database sequences 3 Word list Maximal Segment Pairs (MSPs) Exact matches of words from the word list to the database sequences For each exact word match, alignment is extended in both directions to find high score segments Gapped BLAST The Gapped Blast algorithm allows gaps to be introduces into the alignments. That means that similar regions are not broken into several segments. This method reflects biological relationships much better. BLAST family of programs blastp - amino acid query sequence against a protein sequence database blastn - nucleotide query sequence against a nucleotide sequence database blastx - nucleotide query sequence translated in all reading frames against a protein database tblastn - protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx - six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Database Searches Run Blast first, then depending on your results run a finer tool (Fasta, Smith-Waterman, etc.) Where possible use translated sequence. E() < 0.05 is statistically significant, usually biologically interesting. Check also 0.05 < E() <10 because you might find interesting hits. Pay attention to abnormal composition of the query sequence, it usually causes biased scoring. Split large query sequence ( if >1000 for DNA, >200 for protein). If the query has repeated segments, remove them and repeat the search. Documenting the Search Algorithm(s) Substitution matrix Gap penalty (FASTA) Name of database Version of database Computer used