BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Similar documents
Data Retrieval from GenBank

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Why learn sequence database searching? Searching Molecular Databases with BLAST

NCBI Molecular Biology Resources

Match the Hash Scores

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Database Searching and BLAST Dannie Durand

Sequence Based Function Annotation

Evolutionary Genetics. LV Lecture with exercises 6KP

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

A Prac'cal Guide to NCBI BLAST

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Creation of a PAM matrix

Dynamic Programming Algorithms

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Introduction to sequence similarity searches and sequence alignment

Modern BLAST Programs

Basic Local Alignment Search Tool

Textbook Reading Guidelines

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

Alignment to a database. November 3, 2016

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

Genomics I. Organization of the Genome

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

G4120: Introduction to Computational Biology

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

BME 110 Midterm Examination

Bioinformatic Methods I Lab 2 LAB 2 ADVANCED BLAST AND COMPARATIVE GENOMICS. [Software needed: web access]

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Why Use BLAST? David Form - August 15,

The University of California, Santa Cruz (UCSC) Genome Browser

Exercise I, Sequence Analysis

Sequence Databases and database scanning

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

G4120: Introduction to Computational Biology

Methods and tools for exploring functional genomics data

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

FUNCTIONAL BIOINFORMATICS

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

ab initio and Evidence-Based Gene Finding

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

ELE4120 Bioinformatics. Tutorial 5

Protein Bioinformatics Part I: Access to information

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

03-511/711 Computational Genomics and Molecular Biology, Fall

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Single alignment: FASTA. 17 march 2017

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

Data Mining for Biological Data Analysis

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Basic Bioinformatics: Homology, Sequence Alignment,

03-511/711 Computational Genomics and Molecular Biology, Fall

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics

G4120: Introduction to Computational Biology

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

An introduction to multiple alignments

COMPUTER RESOURCES II:

Sequence Analysis. Introduction to Bioinformatics BIMMS December 2015

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Tutorial for Stop codon reassignment in the wild

Biotechnology Explorer

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Chimp Sequence Annotation: Region 2_3

Challenging algorithms in bioinformatics

Why study sequence similarity?

BLAST. Subject: The result from another organism that your query was matched to.

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Two Mark question and Answers

Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

Databases in genomics

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

MATH 5610, Computational Biology

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Theory and Application of Multiple Sequence Alignments

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

BLASTing through the kingdom of life

Protein Structure Prediction. christian studer , EPFL

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Getting To Know Your Protein

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins

From assembled genome to annotated genome

Transcription:

BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database sequences Sequences with many matches to high- scoring words are used for final alignments Protein based searches are always more powerful than nucleotide-base of coding g DNA in determining similarity and inferring homology

BLAST (Basic Local Alignment Search Tool) P=7+ Q=5 + G=6 In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. Once a word is aligned, gapped and un-gapped extensions are initiated, tallying the cumulative score When the score drops more than X, the extension is terminated The extension is trimmed back to the maximum HSP= High scoring segment pair Produces local alignments X= significance decay S= min. score to return a BLAST hit T= neighborhood score threshold

BLAST home page http://blast.ncbi.nlm.nih.gov/blast.cgi

BLASTP

BLAST databases Peptide Sequence Databases nr: non-redundant GenBank CDS translations+pdb+swissprot+pir+prf RefSeq_protein: reference proteins Swissprot: SWISS-PROT protein sequence database pdb: Sequences derived from the 3-dimensional structure from Nucleotide Sequence Databases nr: GenBank+EMBL+DDBJ+PDB (no EST, STS, GSS, or WGS, or PAT). est: Expressed Seq. tags. 34 billion seq.! htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 gss: Genome Survey Sequence,. wgs: Whole Genome Shotgun Sequences. 148 billion sequences

BLAST Advanced options -G Cost to open a gap [Integer]; default = 11 (10 10 8 9) -E Cost to extend a gap [Integer]; default = 1 ( 1 2 2 2) -e Expectation value (E) [Real]; default = 10.00 -W Word size; default is 11 for blastn, 3 for other programs. -b Number of alignments to show (B) [Integer]; default = 100 Default Short Query Special Cases Large Sequence Family Ungapped BLAST Filter on off on on Scoring Matrix BLOSUM62 PAM30-35 BLOSUM62 BLOSUM62 Word Size 3 3-2, 7 for DNA 3, 11 for DNA 3, 11 for DNA E value 10 1000 or more 10 10 Gap costs 11, 1 9, 1 11, 1 4 Alignments 50 50 2000 50

Report by species Database: All nr GenBank CDS translations+pdb+swissprot+pir+prf 2,794,673 sequences; 957,836,323 total letters Taxonomy reports Query= Apetala1 P35631 (255 letters + indicates conservative amino acid substitution indicates gap/insertion XXXX shows areas of low complexity CONSIDER TAXONOMIC RELATIONSHIP WHEN INTERPRETING SIMILARITY VALUES!

Format BLAST output All sequences above the E value threshold are aligned beneath the query. In "with identity identical residues are shown as dots. Flat Query-Anchored Query-Anchored with identities

Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases with longer sequences The extreme-value distribution Used to calculate the probability of chance alignment Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

BLAST statistics S (Bit score): calculated from raw score S (sum of BLOSUM62 scores) by normalizing with statistical variables that define a scoring system (K and λ). Bit scores from different alignments, even employing different scoring matrices can be compared. S =(λs-lnk)/ln2 k= minor constant λ= constant to adjust for scoring matrix S= score of High-scoring segment pair (HSP) E (expect) value: number of chance alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. E = mn2 -s m= query size N= database size S = bit score m*n= search space The E-value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases. The E-value depends on the size of database and the scoring system in use. When the E-value threshold is increased from the default value of 10, more hits can be reported. When reduced, more significant hits are reported. The lower the E-value (or higher the bit score), the more significant the hit The product mn defines the search space. the same HSP may come out statistically significant in a small database and not significant in a large database

P values P: Probability bilit of finding at least one HSP with bit score S or higher by chance. Since it can be shown that t the number of random HSPs with score S' is described by Poisson distribution, the probability of finding at least one HSP with bit score S' is P = 1- e -E E= expect value E= 10 -> P =0.99995 E= 0.01 -> P =0.01 E= 1 -> P =0.63 E= 0.001 -> P =0.001 E= 0.1 -> P =0.095 E= 0.0001 -> P =0.0001 P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs report E-values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical.

BLAST Tips Suggested BLAST cutoffs: DNA: book suggests E values < E -6 (I use E<e -10 ) Protein: book suggests E values < E -3 Consider evolutionary divergence in your results!: DNA mutation rate without selection =5.5 10-9 per site per year. So in 10 million years (10 7 ) of divergences= 5.5 10-2 =0.05 ~ 95% identity BLAST search artifacts: Repeated amino acid stretches (e.g. poly glutamine) or nucleotide repeats (e.g. ATATATATATATAT) result in meaningless positives with significant E values. Use BLAST filters to mask low complexity regions: programs SEG for proteins and DUST for DNA Or customize masking using lower case letter option RepeatMasker can be used to mask repeats in lower case letters http://www.repeatmasker.org/cgi-bin/webrepeatmasker

MEGABLAST Variation of BLASTN, 10 times faster Optimized for long or highly similar (>95%) sequences Ideal to find whether a large sequence is part of a large contig or chromosome, find sequencing errors and comparing large similar sequences Uses longer default word length (word length= 28 instead of 11) Faster non-affine gap penalty: gap opening penalty=0, gap extension penalty E= r/2 - q (r= match reward Non-affine gapping tends to yield more gaps of shorter length. Accepts multiple consecutive FASTA files as input Discontinuous MEGABLAST q= mismatch penalty) Ideal to compare divergent sequences from different organisms (<80% =) Uses a discontiguous word approach, different from other BLAST programs Nonconsecutive positions are examined over longer segments

PSI-BLAST (Position Specific Iterative BLAST) Designed to detect t weak relationships The added sensitivity comes from the use of a profile that is constructed (automatically) from a multiple alignment. The profile is generated by calculating a Position-Specific Scoring Matrix (PSSM) for every position in the alignment. Also called profiles of Hidden Markov Models PSSM are numerical representations of a multiple alignment A highly conserved ed position receives es a high score. The profile is used to perform additional searches ( iteration) and the results of each iteration used to refine the profile. Each iteration uses a PSSM built from the previous iteration. Continue search iteratively until no new matches are identified: "convergence". Construction of a PSSM PSI-BLAST steps BLASTP Multiple Alignment Construct PSSM Use PSSM to search Each columns in the alignment is a row in the PSSM Frequency of occurrence of a residue at each position Calculate Pb of each aa at each position T at position 8 conserved= highest score 150 P at position 9 less conserve= score 89 Note low scores of aromatic FYW relative to A at P row

PHI-BLAST (Pattern Hit Initiated BLAST) PHI-BLAST searches for particular patterns in protein queries. Combines matching of regular expressions with local alignments surrounding the match. PHI-BLAST is preferable to just searching for pattern occurrences because it filters out cases where the pattern occurrence is pb. random and not indicative of homology. PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST limits alignments to those that match the provided pattern. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of aphiblast PHI-BLAST query can be used for PSI-BLAST. Pattern: [C]-x(2)-[C]-x(10,16)-[H]-x(2,3)-[H] Syntax for pattern at http://www.ncbi.nlm.nih.gov/blast/html/phisyntax.html

Specialized BLAST http://www.ncbi.nlm.nih.gov/blast/ Great tool! Multiple Sequence Alignment COBALT