Database Searching and BLAST Dannie Durand

Similar documents
Basic Local Alignment Search Tool

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Match the Hash Scores

Data Retrieval from GenBank

Alignment to a database. November 3, 2016

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Creation of a PAM matrix

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Why learn sequence database searching? Searching Molecular Databases with BLAST

Dynamic Programming Algorithms

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Sequence Based Function Annotation

Textbook Reading Guidelines

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

03-511/711 Computational Genomics and Molecular Biology, Fall

03-511/711 Computational Genomics and Molecular Biology, Fall

MOL204 Exam Fall 2015

Evolutionary Genetics. LV Lecture with exercises 6KP

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

Methods and tools for exploring functional genomics data

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

Basic Local Alignment Search Tool

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Exploring Similarities of Conserved Domains/Motifs

Protein Structure Prediction. christian studer , EPFL

Statistical Methods for Network Analysis of Biological Data

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Genomics I. Organization of the Genome

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Optimizing Genetic Algorithm Parameters for Multiple Sequence Alignment Based on Structural Information

Bioinformation by Biomedical Informatics Publishing Group

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

Identifying Regulatory Regions using Multiple Sequence Alignments

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

NCBI Molecular Biology Resources

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Data Mining for Biological Data Analysis

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

G4120: Introduction to Computational Biology

Single alignment: FASTA. 17 march 2017

Introduction to sequence similarity searches and sequence alignment

Introduction to Bioinformatics

Optimizing multiple spaced seeds for homology search

Modern BLAST Programs

Challenging algorithms in bioinformatics

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Bioinformatics Practical Course. 80 Practical Hours

BIOINFORMATICS IN BIOCHEMISTRY

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

An introduction to multiple alignments

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Gene Prediction: Similarity-Based Approaches Spliced Alignment

Representation in Supervised Machine Learning Application to Biological Problems

Improving Remote Homology Detection using a Sequence Property Approach

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

G4120: Introduction to Computational Biology

Finding Compensatory Pathways in Yeast Genome

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Molecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling

Two Mark question and Answers

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

Genetic Algorithm for Predicting Protein Folding in the 2D HP Model

Feature Selection in Pharmacogenetics

Bioinformatics : Gene Expression Data Analysis

Accelerating Genomic Computations 1000X with Hardware

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

Christian Sigrist. January 27 SIB Protein Bioinformatics course 2016 Basel 1

Bio5488 Practice Midterm (2018) 1. Next-gen sequencing

1.1 What is bioinformatics? What is computational biology?

CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

Small Exon Finder User Guide

Cascaded walks in protein sequence space: Use of artificial sequences in remote homology detection between natural proteins

ELE4120 Bioinformatics. Tutorial 5

BIOINFORMATICS Introduction

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

MATH 5610, Computational Biology

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror

RNA Structure Prediction. Algorithms in Bioinformatics. SIGCSE 2009 RNA Secondary Structure Prediction. Transfer RNA. RNA Structure Prediction

Bioinformatics Translation Exercise

File S1. Program overview and features

Introduction to Bioinformatics

Bioinformatics Course AA 2017/2018 Tutorial 2

Determining Error Biases in Second Generation DNA Sequencing Data

Sequence Analysis Lab Protocol

Advanced topics in bioinformatics

Sequence Motif Analysis

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Identification of Ancient Greek Papyrus Fragments Using Genetic Sequence Alignment Algorithms

Transcription:

Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is an ungapped local alignment that is locally optimal; that is, the score of this alignment cannot be improved by extending it or making it shorter. A High scoring Segment Pair (HSP) is a maximal segment pair with score S S T, where S T is a similarity score threshold. A hit is an short, ungapped alignment of length w with score at least T. Given a query sequence, Q, of length m and a database, D, of length n, BLAST attempts to find all HSP s with statistically significant scores. The assessment of significance is based on BLAST statistics (Karlin and Altschul, 1990), which are used to estimate the expected number of HSP s with score at least S T under a random sequence model. The significance of a matching sequence with score S retrieved in a BLAST search is expressed as an E value. E is defined as the expected number of HSPs with score at least S, given a random query sequence of length m and a random data base sequence of length n. Here, random means a sequence randomly sampled according to the amino acid frequencies found in typical proteins sequences; for example, the amino acid frequencies in GenBank. Informally, we can think of the E value as the number of false positives with score at least S that we would expect to see if we searched a database of size n with a query of length m. Under these assumptions, Karlin and Altschul showed that the expected number of HSP s under the null hypothesis is E = Kmne λs, (1) where λ is specified by the equation 1 = i,j p i p j e λs[i,j] and K is a constant that can be computed analytically for various substitution matrices from the theory. Having set up this framework, Karlin and Altschul applied known theory about excursions of length l S to drive the statistics of HSPs of score at least S. This theoretical development required the following assumptions: 1. The scoring system allows for at least one positive step and one negative step; i.e., there exists some i and j, such that S[i, j] 0 and there exists some k and l such that S[k, l] < 0. 2. The expected step size is negative; i.e., i,j p ip j S[i, j] < 0.

Computational Genomics and Molecular Biology, Fall 2013 2 Note that these are very similar to the requirements for local alignment scoring that we proposed based on intuitive arguments at the beginning of the semester. Length adjustment Given a query sequence of length m and database of length n, to a first approximation, the size of the search space is mn. However, the database is not a single sequence but a set of concatenated sequences. An alignment cannot bridge the boundary between two database sequences. Furthermore, an alignment cannot start to close to the end of the query sequence. Both of these reduce the actual size of the search space. The effective query length, effective database size, and effective search space all include corrections for these edge effects. Blast reports these quantities under Results Statistics in the Search Summary report. Blast 90 The original BLAST90 heuristic, as described in Altschul s 1990 paper 1, has three main steps: 1. Construct L, a list of high scoring words derived from the query sequence. For proteins, high scoring words are w-mers with a similarity score T, when aligned with a subsequence of length w in the query sequence using the PAM120 substitution matrix. (I used BLOSUM62 in class.) The values of T and w are prespecified parameters. For DNA, high scoring words are the m w + 1 subsequences of Q of length w. 2. Compare each w-mer in the database sequence with the words in L to find hits. This can be done by hashing L or using a Mealy machine. Note that one could also make a list of the high scoring words in the database and compare each w-mer in the query sequence with all words in that list. Intuitively this might seem like a better option since a lot of words based on the databases could be reused for multiple queries. However, this would result in a much bigger hash table. In addition, this approach incurs a disk access performance penalty because it requires that the database be accessed randomly rather than scanned sequentially. 3. Extend hits to obtain HSP s with scores at least S T. The time spent on this step is reduced by using a score cutoff. If the score of the extended alignment is lower than the best score seen so far by the amount of the cutoff, then BLAST stops extending the alignment in that direction. The rationale of this heuristic is to restrict the search for high scoring ungapped alignments to regions of the data base that are promising ; i.e., that are likely to contain an HSP. Regions that 1 Basic Local Alignment Search Tool. Altschul, et al. J. Mol. Biol., 1990, vol.215, 403-410.

Computational Genomics and Molecular Biology, Fall 2013 3 contain a hit are considered promising. The underlying assumption is that most HSPs will contain a hit and that hits that are not contained in an HSP are rare. If a region contains an ungapped alignment with score at least S T, but there is no word in that alignment with score at least T, then BLAST will not report this HSP, resulting in a false negative. On the other hand, if extending a hit does not lead to an ungapped alignment with score at least S T, then the time spend extending an alignment in the region of the hit is wasted. The trick is to select w and T to obtain a good balance between false negatives and unnecessary extensions. These scenarios are shown graphically here: The speed and accuracy of this procedure depend on the parameters S T, w, and T. How should the values of these parameters be chosen? The value of S T is chosen by user indirectly, through the selection of the E value threshold. Increasing the value of S T (making the E value threshold more stringent) will result in fewer false positives and more false negatives. Given S T, the values of w and T are selected to minimize the number of false negatives and the search time. Steps 1 and 2 in the heuristic are relatively fast. Step 3 is slow. Therefore, the goal is to select values for the parameters w and T that limit the number of hits that must be extended in Step 3 without incurring too many false negatives. If the hit threshold, T, is increased, the number of hits - and therefore the number of extensions - will decrease. However, the number of regions that contain a local, ungapped alignment with score greater than S T, but do not contain a hit with score at least T, will also increase, resulting in more false negatives. Altschul and his colleagues used simulation studies to estimate the probability, for a given set of parameter values, that hits found in the data base would in fact be contained in local, ungapped alignments with score at least S T. This is discussed in detail in Altschul et al., 1990, on electronic reserve. Briefly, they used a statistical approach to minimize the probability of unnecessarily attempting to extend a hit. They determined empirically that a choice of w = 4 and T = 17 offered a good compromise between maximizing this probability and an excessive running time. Note that

Computational Genomics and Molecular Biology, Fall 2013 4 the values of w and T have been changed since 1990. Different values are used today. Gapped, two-hit BLAST In 1997, three innovations to the BLAST algorithm were introduced to address issues in sensitivity and running time 2 1. Gapped extensions 2. Two hit BLAST 3. Position-Specific Iterative (PSI) BLAST We discussed the first two innovations in class. PSI-BLAST, the third innovation, yields improved sensitivity by constructing a Position Specific Scoring Matrix model (PSSM) of sequences with significant similarity to the query sequence. In the first iteration, PSI-BLAST compares candidate matches to the query sequence. In subsequent iterations, PSI-BLAST compares candidate matching sequences to the PSSM from the previous iteration. Improved sensitivity is obtained because the resulting PSSM captures the distribution of amino acid sequences observed at each conserved position in the protein family represented by the query sequence. PSI-BLAST was not covered in class and will not be discussed further here. Despite the early success of the BLAST heuristic as a sequence database search tool, better runtime performance was needed to allow efficient searching as sequence data bases grew exponentially. By 1997, parameters had been reduced to w = 3 and T = 13, resulting in many more attempted extension in Step 3 of the heuristic. Since ninety percent of the running time was expended in the third step in the procedure, an approach was needed for reducing the number of extensions without loss of sensitivity. A second difficulty with the BLAST 90 heuristic was that it only finds ungapped alignments. A simple solution might be to find several ungapped alignments (HSP s) and merge them. For this to work, we need to find both ungapped alignments to identify a significant match. This in turn requires that there be a hit (a word with score at least T ) in each of them. One way to increase the probability that both are found is to decrease the word threshold from T = 13 to T = 11. But, this will increase the number of hits found in Step 2 and the number of unnecessary extensions in Step 3, resulting in slower running times. Instead, to obtain gapped alignments without further decreasing the running time, Gapped BLAST uses a two phase protocol for selecting regions for a full, gapped alignment: first, ungapped extensions are attempted only in regions containing a word with score at least T = 13. To limit the computational cost, the ungapped extension is terminated if the alignment score drops below 2 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Altschul, et al. Nucleic Acids Res., 1997, vol.25, n17, 3389-3402

Computational Genomics and Molecular Biology, Fall 2013 5 Figure 1: In Two-Hit BLAST, an extension is triggered if a pair of hits is found on the same diagonal within a distance of A. The current parameter values are w=3, t=11, and A=40. X u, the ungapped extension cutoff. Second, Gapped extensions are only attempted only in regions containing an ungapped extension with a preset minimum score, S g. If the score of the resulting HSP exceeds S g, then a gapped extension (using dynamic programming). Again, to limit the computational cost of this step, the extension is terminated if the alignment score drops below X g, the gapped extension cutoff. If the score of this gapped alignment exceeds the bit score threshold, S, the resulting match is reported. This innovation delivered gapped alignments and higher sensitivity, yet still achieved an improvment in running time. By increasing T from 11 to 13, the number of ungapped extensions was reduced by two thirds. Using the ungapped extensions as a filter for identifying candidates for gapped extension, resulted in one gapped extension per 4000 ungapped extensions. Although the computational gapped extensions is 500 times the cost of ungapped extensions, the total running time was reduced by more than a factor of 2. The second innovation, Two-Hit BLAST, delivered further performance improvement without unduly compromising sensitivity. The underlying rationale is that an MSP will typically contain more than one hit. Better specificity, resulting in fewer extensions, can be obtained by reducing the threshold, T, to obtain more hits, but requiring two hits on the same diagonal in close proximity in order to trigger an ungapped extension (Fig. 1). In Two-Hit Blast, the hit score threshold was again reduced to T = 11. Ungapped alignments are attemped only when two hits are found that are separated by a distance no greater than A = 40. This modification resulted in 3.2 times as many hits, but decreasedthe number of extensions by 0.14, yielding an additional two-fold speedup. An example showing the reduction in the number of extenstions with Two-Hit Blast is shown in Fig. 2.

Computational Genomics and Molecular Biology, Fall 2013 6 Figure 2: Hits with T=11 (.) and T=13 (+) in an alignment of Broad bean leghemoglobin and Horse beta globin, reproduced from Altschul et al. (1997). This alignment contains 37 hits when T=11, but only two pairs satisfy the requirements for an extension. In contrast, 15 hits are obtained when T=13, which would result in 15 extensions with the original 1990 BLAST algorithm. Putting it all together 1. Choose tolerated false positive rate, E. 2. Find hits of length w with similarity threshold T. 3. If there are two word pairs on same diagonal separated by a distance of at most A, perform an ungapped extension to obtain an HSP using cutoff, X u. 4. If HSP score exceeds S g, perform a gapped extension using dynamic programming with cutoff X g. 5. If gapped local alignment score exceeds S, report the match, the score, and the significance expressed as an E-value.