Alignment to a database. November 3, 2016

Similar documents
Database Searching and BLAST Dannie Durand

Data Retrieval from GenBank

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Match the Hash Scores

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Chimp Sequence Annotation: Region 2_3

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

ab initio and Evidence-Based Gene Finding

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Why learn sequence database searching? Searching Molecular Databases with BLAST

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Sequence Based Function Annotation

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Basic Local Alignment Search Tool

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Textbook Reading Guidelines

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

A Prac'cal Guide to NCBI BLAST

Exercise I, Sequence Analysis

NCBI Molecular Biology Resources

Modern BLAST Programs

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Challenging algorithms in bioinformatics

BIO 4342 Lecture on Repeats

ELE4120 Bioinformatics. Tutorial 5

Dynamic Programming Algorithms

Aaditya Khatri. Abstract

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

BME 110 Midterm Examination

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Bioinformatics Databases

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Gene Prediction Group

BLASTing through the kingdom of life

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

CS273B: Deep learning for Genomics and Biomedicine

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Quantifying gene expression

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

Tutorial for Stop codon reassignment in the wild

1. The AGI (Arabidospis Genome Initiative) convention gene names or AtRTPrimer ID should

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

Worksheet for Bioinformatics

Protein Architecture: Conserved Functional Domains

Evolutionary Genetics. LV Lecture with exercises 6KP

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

BIOINFORMATICS IN BIOCHEMISTRY

Two Mark question and Answers

Sequence Databases and database scanning

Introduction to sequence similarity searches and sequence alignment

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Bacterial Genome Annotation

Data Mining for Biological Data Analysis

Stay Tuned Computational Science NeSI. Jordi Blasco

03-511/711 Computational Genomics and Molecular Biology, Fall

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained

03-511/711 Computational Genomics and Molecular Biology, Fall

Sequence Alignments. Week 3

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

G4120: Introduction to Computational Biology

The University of California, Santa Cruz (UCSC) Genome Browser

Bioinformatics Course AA 2017/2018 Tutorial 2

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Genome Sequence Assembly

Methods and tools for exploring functional genomics data

Applications of short-read

Genomics I. Organization of the Genome

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Chapter 2: Access to Information

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

Introduction to CGE tools

Why Use BLAST? David Form - August 15,

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

G4120: Introduction to Computational Biology

Identifying Regulatory Regions using Multiple Sequence Alignments

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

Transcription:

Alignment to a database November 3, 2016

How do you create a database? 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA)

Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA)

FASTA FASTA Find regions of identity (SW) Score & save best Choose regions for banded alignment Optimal realignment with gaps

Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) 1988 Try to give GenBank to the librarians (NLM)

Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) 1988 Try to give GenBank to the librarians (NLM) 1990 NCBI established

Genome Project 1990 Basic Local Alignment Search Tool published 1992 NCBI gets GenBank and LANL wants it back 1992-2007 GenBank size doubles every 18 months 2007-present GenBank growing frighteningly quickly October 2016, release 216: 220,731,315,250 bases in 197,390,691 sequences plus 1,676,238,489,250 bases in 363,213,315 WGS records

Why align to a database? Align unknown sequence to annotated genome to discover function Search RNA and EST databases to see if sequence is expressed mrna-to-genomic alignment for gene and isoform structure Search for unexpected conservation between sequences

BLAST Basic Local Alignment and Search Tool Rapid comparison of a query sequence against a database of nucleotide or protein sequences Why not use dynamic programming? it s guaranteed to find the optimal answer! Takes waaaaaay too long and requires too much memory on even a moderately-sized database BLAST is an efficient and effective alternative to dynamic programming.

BLAST How does it work? looks for small, high-scoring sequence matches to an indexed database extends the matches when it finds them, to create longer high-scoring matches alignment scores based on PAM/BLOSUM or gap/match/mismatch

BLAST how does it really work? Begin with a matrix of similarity scores for all possible residues, compile list of high-scoring words in the query Scan the indexed database for exact word hits (word length is a parameter) query ACTTGTGAACAT words ACTTGTG CTTGTGA TTGTGAA TGTGAAC GTGAACA TGAACAT database match TGTGAAC TAGGCTTGTGAACAGT

BLAST how does it really work? extend the match to create a maximal scoring pair (MSP) stop extending when the score drops below a threshold; trim backward to get maximal score ACTTGTGAACAT TAGGCTTGTGAACAGT 7 ACTTGTGAACAT TAGGCTTGTGAACAGT 8 ACTTGTGAACAT TAGGCTTGTGAACAGT 10 ACTTGTGAACAT TAGGCTTGTGAACAGT 9 scoring: match +1, mismatch -1

BLAST how does it really work? BLAST avoids low-complexity regions tabulates all k-tuples in the database DNA (k is usually around 8) and filters those that occur more frequently than some parameter BLAST has a mask at hash option that allows you to extend through the filtered regions Later versions of BLAST require two neighboring word hits to extend -> reduces # extensions sevenfold CAGCCTCTTACCAGCTTAGCTACAGTTGATTTCTCGGTCAGGCTCTTACCAGCT CAGGCTATTATTAGCTTAGCTACAGTAGATTTCTCGGTCAGGCTGGTACCATCT

Choice of parameters Time required = time to compile list of words + time to scan database + time to extend all hits You can modify both the wordsize and the threshold Increased wordsize = fewer hits, but greater number of words Initial word score threshold T will pare down the number of hits to be extended

BLAST statistics Karlin-Altschul statistics We don t know what the a priori score distribution looks like. In fact, we re looking for the maximum of a bunch of independently and identically distributed variables, which is more like an extreme value distribution.

BLAST statistics Karlin-Altschul statistics The expected number of HSPs with score at least S is: This is the E-value for the score S. K and λ are the Karlin-Altschul parameters. m and n are the lengths of the sequences

BLAST statistics 0.40 0.35 probability 0.30 0.25 0.20 0.15 0.10 normal distribution extreme value distribution 0.05 0-5 -4-3 -2-1 0 1 2 3 4 5 x

Gapped BLAST We have talked about ungapped BLAST so far. The statistics for gapped BLAST are trickier and they are not mathematically complete. affine gapped BLAST score = #matches*match score + #mismatches*mismatch penalty + #gaps*gap opening penalty + total gap length*extension penalty ACTTGTGCATT ACAT-TG--TT Things to consider when choosing a gap penalty: Both the opening (g) and extension (r) penalties should be nonzero g + r should be greater than the max score for a match if you want gaps to be rarer than substitutions

PSI-BLAST: Position-specific iterated BLAST Database search with query Look to see if newest hits are significantly related to query If yes, repeat #1 and 2 If no, finish Creates a PSSM (position-specific scoring matrix)

PSI-BLAST and PSSMs PSSM Gapless alignment matrix Add pseudocounts to avoid tuning to most closely related sequences Align to database with very high gap penalties Generally use dynamic programming to align

PSI-BLAST and PSSMs PSI-BLAST performs well compared to other motif-finding programs More sensitive to weak but biologically relevant similarities Can use resulting PSSMs to score other alignments or in PHI-BLAST, rpsblast (finding conserved domains) etc.

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PSI-BLAST

PHI-BLAST: Pattern hit initiated BLAST Investigator supplies a complex pattern to be searched against the database of interest Can use PSSMs created by PSI-BLAST Very sensitive Very fast

BLAT Designed to find DNA sequences 30+ bp long and > 95% identity, or protein sequences greater than 80% similarity over 20 amino acids or more DNA searches best between primates, protein among land vertebrates Keeps index of all non-overlapping 11mers of entire genome in memory (not repeats though) Takes up < 1GB RAM DNA wordsize 11, protein 4 Written by Jim Kent, free.

Repeats

The repeat problem Genomes, especially those of vertebrates (not pufferfish though) and plants, are highly repetitive Transposons (DNA and retrotransposons) Simple sequence, centromeres, telomeres Other semicomplex repeats of uncertain purpose If a large sequence is searched against a repeat-laden database, you ll just get the repeats Solution: pre-mask known repeats -- is this a good idea?

>sequence1 gcgttgctggcgtttttccataggctccgcccccctgacgagcatcacaaaaatcgacgc ggtggcgaaacccgacaggactataaagataccaggcgtttccccctggaagctccctcg tgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcgggaagcgtggc tgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctgggctgtgtg ccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtccaacccggtaa agtaggacaggtgccggcagcgctctgggtcattttcggcgaggaccgctttcgctggag atcggcctgtcgcttgcggtattcggaatcttgcacgccctcgctcaagccttcgtcact ccaaacgtttcggcgagaagcaggccattatcgccggcatggcggccgacgcgctgggct ggcgttcgcgacgcgaggctggatggccttccccattatgattcttctcgcttccggcgg cccgcgttgcaggccatgctgtccaggcaggtagatgacgaccatcagggacagcttcaa cggctcttaccagcctaacttcgatcactggaccgctgatcgtcacggcgatttatgccg caagtcagaggtggcgaaacccgacaaggactataaagataccaggcgtttcccctggaa gcgctctcctgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcggg ctttctcattgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctg acgaaccccccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtcca acacgacttaacgggttggcatggattgtaggcgccgccctataccttgtctgcctcccc gcggtgcatggagccgggccacctcgacctgaatggaagccggcggcacctcgctaacgg ccaagaattggagccaatcaattcttgcggagaactgtgaatgcgcaaaccaacccttgg ccatcgcgtccgccatctccagcagccgcacgcggcgcatctcgggcagcgttgggtcct gcgcatgatcgtgctagcctgtcgttgaggacccggctaggctggcggggttgccttact atgaatcaccgatacgcgagcgaacgtgaagcgactgctgctgcaaaacgtctgcgacct atgaatggtcttcggtttccgtgtttcgtaaagtctggaaacgcggaagtcagcgccctg

>sequence2 gaattccggaagcgagcaagagataagtcctggcatcagatacagttggagataaggacg gacgtgtggcagctcccgcagaggattcactggaagtgcattacctatcccatgggagcc atggagttcgtggcgctgggggggccggatgcgggctcccccactccgttccctgatgaa gccggagccttcctggggctgggggggggcgagaggacggaggcgggggggctgctggcc tcctaccccccctcaggccgcgtgtccctggtgccgtgggcagacacgggtactttgggg accccccagtgggtgccgcccgccacccaaatggagcccccccactacctggagctgctg caacccccccggggcagccccccccatccctcctccgggcccctactgccactcagcagc gggcccccaccctgcgaggcccgtgagtgcgtcatggccaggaagaactgcggagcgacg gcaacgccgctgtggcgccgggacggcaccgggcattacctgtgcaactgggcctcagcc tgcgggctctaccaccgcctcaacggccagaaccgcccgctcatccgccccaaaaagcgc ctgcgggtgagtaagcgcgcaggcacagtgtgcagccacgagcgtgaaaactgccagaca tccaccaccactctgtggcgtcgcagccccatgggggaccccgtctgcaacaacattcac gcctgcggcctctactacaaactgcaccaagtgaaccgccccctcacgatgcgcaaagac ggaatccaaacccgaaaccgcaaagtttcctccaagggtaaaaagcggcgccccccgggg gggggaaacccctccgccaccgcgggagggggcgctcctatggggggagggggggacccc tctatgccccccccgccgccccccccggccgccgccccccctcaaagcgacgctctgtac gctctcggccccgtggtcctttcgggccattttctgccctttggaaactccggagggttt tttggggggggggcggggggttacacggcccccccggggctgagcccgcagatttaaata ataactctgacgtgggcaagtgggccttgctgagaagacagtgtaacataataatttgca cctcggcaattgcagagggtcgatctccactttggacacaacagggctactcggtaggac cagataagcactttgctccctggactgaaaaagaaaggatttatctgtttgcttcttgct gacaaatccctgtgaaaggtaaaagtcggacacagcaatcgattatttctcgcctgtgtg aaattactgtgaatattgtaaatatatatatatatatatatatatctgtatagaacagcc tcggaggcggcatggacccagcgtagatcatgctggatttgtactgccggaattc