Why learn sequence database searching? Searching Molecular Databases with BLAST

Similar documents
Data Retrieval from GenBank

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Match the Hash Scores

Evolutionary Genetics. LV Lecture with exercises 6KP

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation

Exercise I, Sequence Analysis

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Basic Bioinformatics: Homology, Sequence Alignment,

G4120: Introduction to Computational Biology

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

BME 110 Midterm Examination

NCBI Molecular Biology Resources

A Prac'cal Guide to NCBI BLAST

FUNCTIONAL BIOINFORMATICS

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Why study sequence similarity?

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

COMPUTER RESOURCES II:

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Basic Local Alignment Search Tool

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Sequence Databases and database scanning

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

The University of California, Santa Cruz (UCSC) Genome Browser

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Tutorial for Stop codon reassignment in the wild

Databases in genomics

Introduction to sequence similarity searches and sequence alignment

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Why Use BLAST? David Form - August 15,

Bioinformatic Methods I Lab 2 LAB 2 ADVANCED BLAST AND COMPARATIVE GENOMICS. [Software needed: web access]

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

ELE4120 Bioinformatics. Tutorial 5

Database Searching and BLAST Dannie Durand

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Genomics and Database Mining (HCS 604.3) April 2005

Biology 4100 Minor Assignment 1 January 19, 2007

Protein Bioinformatics Part I: Access to information

WSSP-10 Chapter 9 Determine ORF and BLASTP

Textbook Reading Guidelines

Gene Identification in silico

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Hot Topics. What s New with BLAST?

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Download the Lectin sequence output from

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Two Mark question and Answers

Introduction to BIOINFORMATICS

BLASTing through the kingdom of life

BIOINFORMATICS IN BIOCHEMISTRY

Chapter 2: Access to Information

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

ab initio and Evidence-Based Gene Finding

Identifying Regulatory Regions using Multiple Sequence Alignments

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

BLAST. Subject: The result from another organism that your query was matched to.

Homework 4. Due in class, Wednesday, November 10, 2004

Chimp Sequence Annotation: Region 2_3

Finding Genes, Building Search Strategies and Visiting a Gene Page

Modern BLAST Programs

Data Mining for Biological Data Analysis

HC70AL Spring An Introduction to Bioinformatics -- Part I. Brandon Le. April 6, What is a Gene? An ordered sequence of nucleotides

Methods and tools for exploring functional genomics data

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Challenging algorithms in bioinformatics

Module 6 BIOINFORMATICS. Jérome Gouzy and Daniel Kahn. Local organiser: Peter Mergaert

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Worksheet for Bioinformatics

HC70AL Spring 2011! An Introduction to Bioinformatics! By!! Brandon Le! April 7, 2011!

Finding Genes, Building Search Strategies and Visiting a Gene Page

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

SAMPLE LITERATURE Please refer to included weblink for correct version.

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Dynamic Programming Algorithms

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Single alignment: FASTA. 17 march 2017

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Transcription:

Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration and exercises Has someone else already found it? What is this protein&s function? What is it related to? Can I get more sequence easily? Search programs are sequence alignment programs They try to $nd the best alignment between your probe sequence and every target sequence in the database Finding optimal alignments is computationally a very resource intensive process It is usually not necessary to $nd optimal alignments, particularly for large databases Alignments are ranked and only top scores are reported Practical database search methods incorporate shortcuts The fastest sequence database searching programs use heuristic algorithms The basic concept is to break the search and alignment process down into several steps At each step, only a best scoring subset is retained for further analysis What does %HEURISTIC& mean? Heuristic programs $nd approximate alignments!using a problem solving technique in which the most appropriate solution of several found by alternative methods is selected at successive stages of a program for use in the next step of the program" Why consider every possible alignment once a reasonably good alignment is found? They are less sensitive than!dynamic programming" algorithms such as Smith# Waterman for detecting weak similarity In practice, they run much faster and are usually adequate The BLAST program developed by Stephen Altschul and coworkers at the NCBI is the most widely used heuristic program

BLAST is a collection of $ve programs for di(erent combinations of query and database sequences Program Probe Database blastn DNA DNA blastp protein protein blastx tblastn tblastx translated DNA protein translated DNA protein translated DNA translated DNA BLAST features Very fast and can be used to search extremely large databases Su'ciently sensitive and selective for most purposes Robust # the default parameters can usually be used Scores are reported in various ways Typical BLAST Output Raw values based on the speci$c scoring matrix employed As bits, which are matrix independent normalized values Signi$cance as represented by E values The EXPECT )E* threshold is used to control score reporting A match will only be reported if its E value falls below the threshold set The default value for E is 10, which means that 10 matches with scores this high are expected to be found by chance Lower EXPECT thresholds are more stringent, and report fewer matches Probabilities reported are summations of the probabilities of multiple HSPs )High scoring Segment Pairs* For HSPs to be included in a sum statistic or gapped alignment they must exhibit consistency Same orientation Consistent order Don&t overlap Repeated motifs will result in multiple, independent alignments between query and subject sequences

Interpreting scores Interpreting scores Score interpretation is based on context What is the question? What else do you know about the sequences? Scoring is highly dependent on probe length Exact matches will usually have the highest scores )and lowest E values* Short exact matches may score lower than longer partial matches Short exact matches are expected to occur at random. Partial matches over the entire length of a query are stronger evidence for homology than are short exact matches. Read the sequence descriptions! Homology vs Identity Homologous sequences are derived from a common ancestral sequence. Homology is either true or false. It can never be partial! Saying two sequences are 45+ homologous is a misuse of the term. Sequence identity and similarity can be described as a percentage and are used as evidence of homology. BLAST Example Is this sequence known? What does it encode? >clone 14b cgcatgcgcaggcgacagctcatggcgttcagggcctgacggttgctagggtgacagggacacaacatggcg gcgggatctctaacgctctccttcgagggaccaccacggagatcctagtgcgggaccccgcctcagggaagt ggaaagcagggggacaaccttcctgcttccttcttttccgtccagtgtcggcaaggggttgtcaccggcttc cgcatccaagatgaagaactataaagcaattggcaaaataggagagggaacgttttctgaagttatgaagat gcaaagcctgagagatggaaactactatgcatgtaaacaaatgaagcagcgctttgaaagtattgagcaagt caacaacctacgagagatccaagcactgaggcgcctgaatccgcacccaaacattcttatgttgcatgaagt ggtttttgacagaaaatctggttctcttgcactaatatgtgaacttatggacatgaatatttatgagctaat acgagggagaagatacccattatcagaaaaaaaaattatgcactatatgtaccagttatgtaagtccctgga tcatattcacagaaatggaatatttcacagagatgtaaaaccagaaaatatactaataaagcaggatgtcct gaaattaggggactttggctcctgccggagtgtctattccaagcagccgtacacggaatacatctccacccg ctggtaccgggccccggagtgtctcctcactgatgggttctacacgtacaagatggacctgtggagcgccgg ctgtgtgttctacgagatcgccagtctgcagcccctctttcctggagtaaatgaactggaccaaatctcaaa aatccacgatgtcatcggcacacccgctcagaagatcctcaccaagttcaaacagtcgagagctatgaattt tgattttccttttaaaaagggatcaggaatacctctactaacaaccaatttgtccccacaatgcctctccct cctgcacgcaatggtggcctatgatcccgatgagagaatcgccgcccaccaggccctgcagcacccctactt ccaagaacagaggaaaacagagaagcgggctctgggcagccacagaaaagctggctttccggagcaccctgt ggcaccggaaccactcagtaacagctgccagatttccaaggagggcagaaagcagaaacagtccctaaagca agaggaggaccgtcccaagagacgaggaccggcctatgtcatggaactgcccaaactaaagctttcgggagt ggtcagactgtcgtcttactccagccccacgctgcagtccgtgcttggatctggaacaaatggaagagtgcc ggtgctgagacccttgaagtgcatccctgcgagcaagaagacagatccgcagaaggaccttaagcctgcccc gcagcagtgtcgcctgcccaccatagtgcggaaaggcggaagataactgagcagcaccgtcgtctcgacttc ggaggcaacaccaagcccgaccgggccaggcctgggtgatctgctgctgagacgccacggagggctggggat gcgcctgcgtccgtttcgcgctggccggggctctgggtgctgccctgcgccctgccgcacccgcggcccgcg cagctgcctaggatgttctgggctaatatacttgtaaaaccaccgcattctagggttttctttcattttcgt taagaatttggggcaggaaatactttgtaactttgtatatgaatcaaaacaaacgagcaggcatttctgtga tgtgttgggcgtggttggaaggtgggttctgcgtgtcccttcccagcgctgctggtcagtcgtggagcgcca tcatgtcttaccagtgacgctgctgacacccctgacttttattaaagaataagctgtcgttaaaaaaaaaaa aaaaaaaaaa Search Strategy BLAST program = blastn nucleotide query vs. nucleotide db Database = nr )non#redundant*

Search Summary Graphical View of BLAST Results Link to GenBank File Link to Alignment Link to GenBank File Link to UniGene Link to Gene Expression Omnibus

Homologs = Shared Evolutionary Ancestry = Conserved Function Orthologs are homologs that perform same function in di(erent species. Example: mouse, globin and human,globin Paralogs are homologs that are diverged members of a family Example: human, globin and human myoglobin Statistical signi$cance of scores Orthologs will have extremely signi$cant scores DNA 10 #100, Protein 10 #30 Closely related paralogs will have signi$cant scores. Protein 10 #15 Distantly related homologs may be hard to identify. Protein 10 #4 Basic BLAST form Choice of program Choice of database Filters on or o( Sequence input Paste in as text or fasta format Read in using gi or accession number Output format options BLASTP Example >Unknown protein MWVTKLLPALLLQHVLLHLLLLPIAIPYAEGQRKRRNTIHEFKKSAKTTL IKIDPALKIKTKKVNTADQCANRCTRNKGLPFTCKAFVFDKARKQCLWFP FNSMSSGVKKEFGHEFDLYENKDYIRNCIIGKGRSYKGTVSITKSGIKCQ PWSSMIPHEHSFLPSSYRGKDLQENYCRNPRGEEGGPWCFTSNPEVRYEV CDIPQCSEVECMTCNGESYRGLMDHTESGKICQRWDHQTPHRHKFLPERY PDKGFDDNYCRNPDGQPRPWCYTLDPHTRWEYCAIKTCADNTMNDTDVPL ETTECIQGQGEGYRGTVNTIWNGIPCQRWDSQYPHEHDMTPENFKCKDLR ENYCRNPDGSESPWCFTTDPNIRVGYCSQIPNCDMSHGQDCYRGNGKNYM GNLSQTRSGLTCSMWDKNMEDLHRHIFWEPDASKLNENYCRNPDDDAHGP WCYTGNPLIPWDYCPISRCEGDTTPTIVNLDHPVISCAKTKQLRVVNGIP TRTNIGWMVSLRYRNKHICGGSLIKESWVLTARQCFPSRDLKDYEAWLGI HDVHGRGDEKCKQVLNVSQLVYGPEGSDLVLMKLARPAVLDDFVSTIDLP NYGCTIPEKTSCSVYGWGYTGLINYDGLLRVAHLYIMGNEKCSQHHRGKV TLNESEICAGAEKIGSGPCEGDYGGPLVCEQHKMRMVLGVIVPGRGCAIP NRPGIFVRVAYYAKWIHKIILTYKVPQS

BLASTP databases BLASTP databases nr # All non#redundant GenBank CDS translations+pdb+swissprot+pir swissprot # the last major release of the SWISS#PROT protein sequence database pat # patented sequences pdb # Sequences derived from the 3#dimensional structure Protein Data Bank month # All new or revised GenBank CDS translation+pdb+swissprot+pir released in the last 30 days BLAST can be slow during peak hours )9#5 EST* Conserved Domains Request ID

Protein Scoring Matrices Blosom 62 is the default BLASTP scoring matrix Di(erent Matrices Produce slightly di(erent alignments BLOSOM 62 Query: 80 EDFKFGKILGEGSFSTVVLARELATSREYAIKILEKRHIIKENKVPYVTRERDVMSRLDH 139 +DFKFG ++G+G++STV+LA + T + YA K+L K ++I++ KV YV+ E+ + +L++ Sbjct: 177 KDFKFGSVIGDGAYSTVMLATSIDTKKRYAAKVLNKEYLIRQKKVKYVSIEKTALQKLNN 236 PAM30 Query: 81 DFKFGKILGEGSFSTVVLARELATS-----REYAIKILEKRHIIKENKVPYVTRERDVMS 135 DFKFG ++G+G++STV+ LATS R YA K+L K ++I++ KV YV+ E+ + Sbjct: 178 DFKFGSVIGDGAYSTVM----LATSIDTKKR-YAAKVLNKEYLIRQKKVKYVSIEKTALQ 232 DNA Databases nr # Non#redundant GenBank + EMBL + DDBJ + PDB sequences month # All new or revised nr dbest # GenBank+EMBL+DDBJ EST Divisions dbsts # GenBank+EMBL+DDBJ STS Divisions htgs # High Throughput Genomic Sequences EST = expressed sequence tag GSS = Genome Survey Sequence HTGS PAT = patented = High Throughput sequences PDB= sequences with known Genome Sequence structures Others # Bacterial and yeast genomes Sequence $lters Low Complexity Sequences can be Filtered Out Since only a limited number of matches are reported, hits to simple repeats and other low complexity sequences can obscure other more biologically meaningful similarities Filters are used to remove low complexity sequences from the probe Low Complexity, human repeats )blastn* Query: 1681 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1740 Sbjct: 1852 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1911 Query: 1741 ctggatacgtccacgattctgcaaagaagnnnnnnngttcaagaaaatgacgtagggcct 1800 Sbjct: 1912 ctggatacgtccacgattctgcaaagaagaaaaaaagttcaagaaaatgacgtagggcct 1971 Query: 1801 acaattccaataagcgccactatcagggaatag 1833 Sbjct: 1972 acaattccaataagcgccactatcagggaatag 2004

Output Options Pairwise Output is the Default Query: 1681 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1740 Sbjct: 1852 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1911 Query: 1741 ctggatacgtccacgattctgcaaagaagnnnnnnngttcaagaaaatgacgtagggcct 1800 Sbjct: 1912 ctggatacgtccacgattctgcaaagaagaaaaaaagttcaagaaaatgacgtagggcct 1971 Query: 1801 acaattccaataagcgccactatcagggaatag 1833 Sbjct: 1972 acaattccaataagcgccactatcagggaatag 2004 Query Anchored without Identities BLASTN vs BLASTP Protein sequences have much higher information content than nucleotide sequence To $nd evidence for sequence homology, use BLASTP and search protein sequences Is my sequence already in the database? To $nd identical sequences, search nucleotide databases Translated BLAST Searches Alternate Genetic Codes translations use all 6 frames computationally intensive tblastx searches are not allowed for some large databases must specify genetic code

Translated BLAST Searches Taxonomy Reports >clone 14b cctccccacccatttcaccaccaccatgacaccgggcacccagtctcctttcttcctgctgctgctcctcacagtgctta cagttgttacaggttctggtcatgcaagctctaccccaggtggagaaaaggagacttcggctacccagagaagttcagtg cccagctctactgagaagaatgctttgtctactggggtctctttctttttcctgtcttttcacatttcaaacctccagtt >Frame 1 PPHPFHHHHDTGHPVSFLPAAAPHSAYSCYRFWSCKLYPRWRKGDFGYPEKFSAQLY*EECFVYWGLFLFPVFSHFKPPV >Frame 2 LPTHFTTTMTPGTQSPFFLLLLLTVLTVVTGSGHASSTPGGEKETSATQRSSVPSSTEKNALSTGVSFFFLSFHISNLQ >Frame 3 SPPISPPP*HRAPSLLSSCCCSSQCLQLLQVLVMQALPQVEKRRLRLPREVQCPALLRRMLCLLGSLSFSCLFTFQTSS >Frame -1 NWRFEM*KDRKKKETPVDKAFFSVELGTELLWVAEVSFSPPGVELA*PEPVTTVSTVRSSSRKKGDWVPGVMVVVKWVGR >Frame -2 TGGLKCEKTGKRKRPQ*TKHSSQ*SWALNFSG*PKSPFLHLG*SLHDQNL*QL*AL*GAAAGRKETGCPVSWWW*NGWG >Frame -3 LEV*NVKRQEKERDPSRQSILLSRAGH*TSLGSRSLLFSTWGRACMTRTCNNCKHCEEQQQEERRLGARCHGGGEMGGE More BLAST Options More BLAST Options BLAST from ORF Finder

BLAST Tutorial BLAST tutorial on Biocomp Web page Goal: demonstrate utility and di(erence between BLASTN and BLASTP searches BLASTN: is my DNA sequence in the database? BLASTP: are there related )homologus* proteins in the database?