BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Similar documents
BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Exercise I, Sequence Analysis

Why learn sequence database searching? Searching Molecular Databases with BLAST

Sequence Based Function Annotation

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Evolutionary Genetics. LV Lecture with exercises 6KP

Match the Hash Scores

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Data Retrieval from GenBank

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

A Prac'cal Guide to NCBI BLAST

Basic Local Alignment Search Tool

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Database Searching and BLAST Dannie Durand

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Alignment to a database. November 3, 2016

G4120: Introduction to Computational Biology

BME 110 Midterm Examination

NCBI Molecular Biology Resources

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

Dynamic Programming Algorithms

Small Exon Finder User Guide

Single alignment: FASTA. 17 march 2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Data Mining for Biological Data Analysis

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

BIO 4342 Lecture on Repeats

HC70AL Spring 2011! An Introduction to Bioinformatics! By!! Brandon Le! April 7, 2011!

Chimp Sequence Annotation: Region 2_3

Genomics I. Organization of the Genome

MOL204 Exam Fall 2015

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Sequence Databases and database scanning

HC70AL Spring An Introduction to Bioinformatics -- Part I. Brandon Le. April 6, What is a Gene? An ordered sequence of nucleotides

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Annotating Fosmid 14p24 of D. Virilis chromosome 4

G4120: Introduction to Computational Biology

Modern BLAST Programs

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

Challenging algorithms in bioinformatics

Creation of a PAM matrix

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Gene Identification in silico

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Hot Topics. What s New with BLAST?

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Introduction to sequence similarity searches and sequence alignment

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

FUNCTIONAL BIOINFORMATICS

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Textbook Reading Guidelines

Why study sequence similarity?

Genomic region (ENCODE) Gene definitions

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Genome Sequence Assembly

Biology 4100 Minor Assignment 1 January 19, 2007

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page

Entrez and BLAST: Precision and Recall in Searches of NCBI Databases

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Identifying Regulatory Regions using Multiple Sequence Alignments

Gene Prediction in Eukaryotes

WSSP-10 Chapter 9 Determine ORF and BLASTP

Genome annotation & EST

COMPUTER RESOURCES II:

ELE4120 Bioinformatics. Tutorial 5

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

PREDICT Host DNA Barcoding Guide

Bacterial Genome Annotation

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotation of a Drosophila Gene

Exercises (Multiple sequence alignment, profile search)

BLASTing through the kingdom of life

Aaditya Khatri. Abstract

G4120: Introduction to Computational Biology

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern?

Exploring the Genetic Basis for Behavior. Instructor s Notes

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Introduction to BLAST

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Transcription:

BLAST Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. An example could be aligning an mrna sequence to genomic DNA. Proteins are frequently composed of functional domains repeated in many different proteins. These parts are most likely to be conserved. BLAST Since DNA databases can be very large, searching for the optimal alignment to all sequences can take too long. BLAST is a heuristic algorithm. Heuristic algorithms find a match reasonably close to the optimal one in a much shorter time than the full dynamic programming. The alignment can then separately be verified / refined using dynamic programming. 1

How BLAST works The query sequence is divided into subsequences of a given length. word size 3 for proteins, 11 for nucleotides. These are used to look for exact or nearly exact matches in the sequence database. Fast to do = computationally inexpensive. When a match is found, it is extended further. Word size (W=3) KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN Q u e r y Seeding Search space Database Word hits Alignment Gapped alignment 2

Threshold in seeding Word hit Hit is two matching, identical words, one in database, another in the query sequence (used in blastn) Hit is a neighborhood (used in protein-related searches) The neighborhood of a word contains the word itself and all other words whose score is at least as big as T (threshold) when compared via the scoring matrix. For example, if T=13, word=pqg, matrix=blosum62, only words getting a score over 13 will be scored as hits: PQG-PEG (15) is accepted, but PQG-PQA (12) is not. Setting T higher will remove more word hits, making BLAST run faster, but increases the chance of missing an interesting alignment. Setting W (wordsize) higher will decrease sensitivity (chance of finding the alignment), but increase speed of the search. 3

Extension Word hits found during seeding are extented from their ends. Extension is stopped when the alignment score drops, or in newer implementations, when the alignment score has dropped enough (drop-off score) compared to its previous maximum. Alignment Word hit Extension Extension, example drop off score KRISTIAN gap=0, X=2 -RISTISANA BLOSUM62 0544541200 <- BLOSUM62 values 059 18 23 21 13 22 21 21 <- Score 00000002 <- Drop off score Extension terminates when drop off score falls below X. 4

Evaluation When the extension stage has produced the alignments, they will be evaluated to determine whether they are statistically significant. Statistical significance is determined using Karlin-Altschul statistics (the E-score) Some simplifying assumptions are made (such as sequences inifinitely long, no gaps), but in practice, K- A statistics is nicely generalizable. E-score The lower the E-score, the more significant the alignment The E-score is dependent on both the database size and the scoring system (substitution matrix, gap penalties). If these are changed, the E-score for a specific alignment will also change. 5

Karlin-Altschul statistics E value. E = Kmne S E = number of alignments reaching score S just by chance K = minor constant m = the length of query sequence n = the length of the database e (neperin luku) 2,71 S = normalized alignment score (S is the score, lambda is the normalization factor) NOTE: When E is very small, it can be interpreted equivalently to p-value! Karlin-Altschul, example What is the chance that when two equally long (250) amino acid sequences are aligned using PAM250 matrix, the alignment score is 75? E = Kmne S = 0,1*250*250*2,71 -(0,229*75) = 0,000217 http://www.ncbi.nlm.nih.gov/blast/tutorial/altschul-1.html 6

Disadvantages of BLAST When expected sequence similarity drops below 80%, nucleotide-nucleotide blast no longer performs that well. Many significant homologies are missed due to the initial word size requirement. If initial words are allowed to be discontinuous, matching is improved. Discontinuous initial words For instance, require 11 positions out of 21 consecutive nucleotides to be homologous 7

Filtering out repeats The human genome (like most others) contains large amounts of repetitive DNA. (LINE, SINE, Alu, et.c.) If the query sequence contains repeats, many homologies identified will be to other sequences containing repeats. Repeats should in most instances be masked out. Usually represented as AATAGNNNNCGC Different varieties of BLAST DNA query against a database of DNA sequences (blastn). Protein query against protein sequences (blastp). DNA query translated in six reading frames against a protein database (blastx). Megablast, for large and closely related sequences. 8

Blastn and Megablast Typically used for identifying your sequence. Megablast is a fast alternative for finding nearly exact matches. Blastn is better at finding somewhat diverged sequences (e.g. from a related species). Blastx and tblastx Blastx translates the query sequence in all reading frames and compares it to a protein database. Aggregate statistics are provided for all reading frames. Tblastx queries a translated DNA sequence against a database of translated DNA sequences. Also produces aggregate statistics for all reading frames. 9

BLAST programs Query Database Program Typical uses DNA DNA blastn Annotation, mapping oligonucleotides to genome protein protein blastp Identifying common regions in proteins translated DNA protein blastx Finding protein-coding genes in genomic DNA protein translated DNA tblastn Identifying transcripts, possibly from multiple organisms translated DNA translated DNA tblastx Cross-species gene prediction, searching for genes not yet in megablast protein databases Large and closely related sequences Specialized BLAST Choose a type of specialized search (or database name in parentheses): Make specific primers with Primer-BLAST (Finding primers specific to your PCR template http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?link_loc=blasthome) Find conserved domains in your sequence (cds) Find sequences with similar conserved domain architecture (cdart) Search sequences that have gene expression profiles (GEO) 10

... Search immunoglobulins (IgBLAST) Screen sequence for vector contamination (vecscreen) Align two (or more) sequences using BLAST (bl2seq) Search protein or nucleotide targets in PubChem BioAssay Search SRA transcript and genomic libraries Constraint Based Protein Multiple Alignment Tool Needleman-Wunsch Global Sequence Alignment Tool Search RefSeqGene Search WGS sequences grouped by organism Yet more varieties PSI-Blast (Position Specific Iterated Blast) for very sensitive protein sequence against protein database searches (käyttötarkoitus: samaan proteiiniperheeseen kuuluvien proteiinien haku) PHI-Blast (Pattern-Hit Initiated Blast): hakusekvenssistä etsitään ensin käyttäjän antama pattern, jota sitten haetaan tietokannasta... 11

...miten valita omaan tarkoitukseen sopivin blast-versio?! Apua ohjelman valintaan: http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab 12