Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Similar documents
Why learn sequence database searching? Searching Molecular Databases with BLAST

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Evolutionary Genetics. LV Lecture with exercises 6KP

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Sequence Databases and database scanning

Creation of a PAM matrix

Dynamic Programming Algorithms

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

ELE4120 Bioinformatics. Tutorial 5

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Chimp Sequence Annotation: Region 2_3

Basic Bioinformatics: Homology, Sequence Alignment,

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

MATH 5610, Computational Biology

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Database Searching and BLAST Dannie Durand

Modern BLAST Programs

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Protein Bioinformatics Part I: Access to information

From AP investigative Laboratory Manual 1

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Alignment to a database. November 3, 2016

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

Guided tour to Ensembl

Molecular Databases and Tools

Gene-centered resources at NCBI


VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Hands-On Four Investigating Inherited Diseases

Bioinformatics for Proteomics. Ann Loraine

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Exploring the Genetic Basis for Behavior. Instructor s Notes

Sequence Analysis Lab Protocol

Types of Databases - By Scope

Biotechnology Explorer

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Bioinformatic tools for metagenomic data analysis

Product Applications for the Sequence Analysis Collection

ONLINE BIOINFORMATICS RESOURCES

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

CHAPTER 21 LECTURE SLIDES

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Small Genome Annotation and Data Management at TIGR

SAMPLE LITERATURE Please refer to included weblink for correct version.

Theory and Application of Multiple Sequence Alignments

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Protein Structure Prediction. christian studer , EPFL

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

NCBI web resources I: databases and Entrez

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Sequence searching and sequence alignments MBV-INFX410

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Teaching Bioinformatics in the High School Classroom. Models for Disease. Why teach bioinformatics in high school?

BIOINFORMATICS Introduction

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

I nternet Resources for Bioinformatics Data and Tools

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Bioinformatics, in general, deals with the following important biological data:

Glossary of Commonly used Annotation Terms

Getting To Know Your Protein

Protein Bioinformatics PH Final Exam

Genome Sequence Assembly

Integration of data management and analysis for genome research

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Assigning Sequences to Taxa CMSC828G

JPred and Jnet: Protein Secondary Structure Prediction.

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Gap Filling for a Human MHC Haplotype Sequence

Tools and Opportunities to Enhance Risk Analysis. Nathan J. Hillson

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Genetics Lecture 21 Recombinant DNA

LAB. WALRUSES AND WHALES AND SEALS, OH MY!

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Lecture Four. Molecular Approaches I: Nucleic Acids

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

user s guide Question 3

Read Mapping and Variant Calling. Johannes Starlinger

Protein Synthesis. Lab Exercise 12. Introduction. Contents. Objectives

Homology Modeling of Mouse orphan G-protein coupled receptors Q99MX9 and G2A

Protein 3D Structure Prediction

Exploring Similarities of Conserved Domains/Motifs

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment

BIOINFORMATICS AND FUNCTIONAL GENOMICS

SENIOR BIOLOGY. Blueprint of life and Genetics: the Code Broken? INTRODUCTORY NOTES NAME SCHOOL / ORGANISATION DATE. Bay 12, 1417.

Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm

Transcription:

Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST Evolution Evolution of forelimbs of vertebrates Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules. Adaptive convergence Shared morphology does NOT necessarily imply common ancestry When similarity is due to common ancestry, we call it homology Common similarity problems Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. Start with a small set of sequences and identify similarities and differences among them. In many sequences or very long sequences, detect commonly occurring patterns. 1

Common similarity problems (rephrased) One against many Common among several Common part of many How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) INFO(X) ( ~ means similar) Are the sequences similar? Why is similarity important Similar sequences (homologues) often derive from the same ancestor, share the same structure, and have similar biological function. Extrapolation of findings. Similarity judgements should be based on: The types of changes or mutations that occur within sequences. Characteristics of those different types of mutations. The frequency of those mutations. Crude similarity thresholds Proteins 25% similarity Nucleic acids 75% similarity Below 25/75% is twilight zone everything is possible. 2

Refined similarity thresholds E-value expectation value: how likely the result is by chance. Length of the segments similar between two sequences. Patterns of aa conservation. Number of indels. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST BLAST Basic Local Alignment and Search Tool BLAST at NCBI http://www.ncbi.nlm.nih.gov/blast and BLAST at EMBnet http://www.ch.embnet.org/software/ablast.h tml use different databases yield slightly different results. Standard BLAST uses substitution matrix (i.e. PAM or BLOSUM) to reward identity match, gives positive points for similar aa, and penalties for different aa. Different BLASTs blastp : compares your protein with a protein database. tblastn : compares your protein with a nucleotide database (t is for translated). Protein vs. nucleotide database BLASTing protein at NCBI Six ways to translate DNA to protein direct and reverse strand 3 reading frames each. tblastn runs all 6 possibilities. Input your sequence 5 to 3 (N to C). You run query sequence against target databases to get hits or matches. 3

blastp input by accession no. blastp input by sequence FASTA CD conserved domain search deselected Intermediate result Waiting for results Waiting for results European server http://www.ebi.ac.uk/blast2/ If page indicates that search would take more than 10 minutes than use other BLAST server. Morning use USA server http://genome.wustl.edu/blast/client.pl. Afternoon use Japan server http://www.ddbj.nig.ac.jp/search/blaste.html. Click just once. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST 4

BLAST output Graphics shows where your query is similar to others. Hit list ranked names of similar sequences. Alignments one to one. Parameters used for search. Graphics part Hit list pass the mouse over the bar to see more Hit list Accession number and the description. Score (bits) must be >50 to be reliable. E-value - expectation of match by chance (given the database), must be <0.001 to be reliable. Alignment Alignments do NOT lie if you know how to look at them. x means masking (low-complexity segment) + means similarity consensus line 5

Saving BLAST results BLASTing nucleic acid Reproducibility in time is low because database, BLAST program, and default program parameters change in time. Convert to pdf. Save as Complete Webpage. Save Picture as. Common mistake Friends of my friends are my friends. NOT necessarily. BLAST runs local alignments, hits are NOT transitive unless the alignments are overlapping. Sequence 1: AAAAATTTTTT Sequence 2: AAAAA Sequence 3: TTTTTT BLASTs for DNA blastn - DNA against DNA; for noncoding DNA. tblastx - tdna against tdna; for protein discovery. blastx tdna against protein; for proteins encoded in your query DNA and for DNA sequence of unknown quality. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST Using filters Correct database (nt/protein) Organism database Repetitions 6

Use of BLAST Finding genes in a genome Predicting a protein function Predicting a protein 3-D structure Finding protein family members Finding genes in a genome Quick and dirty BLAST way: Cut your genome to 5kb overlapping sequences, use blastx against nonredundant (NR) protein database for every piece. Proper way: Run gene prediction software. Predicting a protein function Quick and dirty BLAST way: Use blastp against Swiss-Prot. If >25% identity over the whole protein length then you know the function of your protein. Proper way: Conduct domain analysis and wet-lab (bluefingers) experiments. Predicting a protein 3-D structure Quick and dirty BLAST way: Use blastp against PDB. If >25% identity over the whole protein length then you know the probable structure of your protein. Proper way: Conduct homology modelling, X-ray, and NMR experiments. Finding protein family members Quick and dirty BLAST way: Use blastp (or PSI-BLAST) against nonredundant protein family. Make a multiple sequence alignment of all members of the family and draw a phylogenetic tree. Proper way: Clone new family members using PCR. BLAST parameters Power is nothing without control. Reasons for changing default parameters: sequence has a biased composition (use masking), NO results (change substitution matrix and gap penalties), too many results (change NR database to Swiss-Prot, use Entrez keyword with Boolean operators, and increase E-value threshold), testing robustness of findings. 7

BLAST protein masking Low-complexity regions (many prolines, many glutamic acids) false matches. Masking by replacement with X. Use InterPro, CD search, or Pfscan to find and mask common domains (i.e. Zn finger domain and fibronectin domain). BLAST DNA masking BLAST output 60% of human DNA are repeats Large-scale genome sequencing brings errors - remains of vectors in human database. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST PSI-BLAST Position Specific Iterated BLAST For distantly related sequences. 1st iteration finds relatives by blastp with BLOSUM62 matrix. 2nd iteration uses results of the 1st run to generate a new substitution matrix (one aa has different penalizations on different positions) and looks for more relatives. 3rd 8

PSI-BLASTing protein PSI-BLASTing protein http://www.ncbi.nlm.nih.gov/blast/bla st.cgi?cmd=web&layout=twowindo ws&auto_format=semiauto&align MENTS=250&ALIGNMENT_VIEW=Pairw ise&client=web&composition_bas ED_STATISTICS=on&DATABASE=nr&C DD_SEARCH=on&DESCRIPTIONS=500 &ENTREZ_QUERY=(none)&EXPECT=10 &FORMA PSI-BLASTing format PSI-BLAST output check box will be used for a next iteration, can be edited green dot used in previous iterations new - reported for the first time as hit Avoiding mistakes with PSI-BLAST When we look for hemoglobin and after 2nd iteration alcohol dehydrogenase appears among hits, it is time to stop. Read annotation to distinguish between interesting finding and false finding. Check domains by InterPro/CD server/ Pfscan and cut proteins to 200 aa pieces with one domain each. BLAST alternatives Smith and Waterman ssearch : the slowest, more accurate http://ori.nibb.ac.jp/sit/ssearch.html FASTA slower, good for DNA (originally fast all) http://www.ebi.ac.uk/fasta33/ BLAT for locating cdna in a genome, keeps an index of the entire genome in memory. The index consists of all non-overlapping 11-mers except for those heavily involved in repeats http://genome.ucsc.edu/cgi-bin/hgblat FLASH Fast alignment Algorithm for finding Structural Homology http://140.109.42.177/flash/. 9

Úkol 1 We compared 4 homologs of papain sequence by structural comparison: kiwi aktinidin, human prokatepsins L and B, Staphylococcus aureus stafopain. Run Papaia papain through BLAST and PSIBLAST. Which homologs (out of 4 mentioned above) is hit by BLAST and PSIBLAST? Úkol 2 How many cytokinin dehydrogenase sequences are in databases? 10