Evolutionary Genetics. LV Lecture with exercises 6KP

Similar documents
The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Data Retrieval from GenBank

Why learn sequence database searching? Searching Molecular Databases with BLAST

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Match the Hash Scores

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Dynamic Programming Algorithms

COMPUTER RESOURCES II:

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Sequence Based Function Annotation

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Exercise I, Sequence Analysis

FUNCTIONAL BIOINFORMATICS

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Why study sequence similarity?

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

NCBI Molecular Biology Resources

Basic Local Alignment Search Tool

MATH 5610, Computational Biology

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Basic Bioinformatics: Homology, Sequence Alignment,

Tutorial for Stop codon reassignment in the wild

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

BIOINFORMATICS IN BIOCHEMISTRY

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Creation of a PAM matrix

BME 110 Midterm Examination

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

B L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?

Challenging algorithms in bioinformatics

Important points from last time

A Prac'cal Guide to NCBI BLAST

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

BLAST. Subject: The result from another organism that your query was matched to.

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Modern BLAST Programs

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

Biotechnology Explorer

Textbook Reading Guidelines

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

Database Searching and BLAST Dannie Durand

Why Use BLAST? David Form - August 15,

BIOINF525: INTRODUCTION TO BIOINFORMATICS LAB SESSION 1

Bioinformation by Biomedical Informatics Publishing Group

G4120: Introduction to Computational Biology

BLASTing through the kingdom of life

Exploring the Genetic Basis for Behavior. Instructor s Notes

Sequence Databases and database scanning

SAMPLE LITERATURE Please refer to included weblink for correct version.

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

Single alignment: FASTA. 17 march 2017

Entrez and BLAST: Precision and Recall in Searches of NCBI Databases

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

ELE4120 Bioinformatics. Tutorial 5

Introduction to sequence similarity searches and sequence alignment

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Chimp Sequence Annotation: Region 2_3

Evolutionary Genetics. LV Lecture with exercises 6KP. Databases

Integrated UG/PG Biotechnology (Fifth Semester) End Semester Examination, 2013 LBTC 504: Bioinformatics. Model Answers

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern?

WSSP-10 Chapter 9 Determine ORF and BLASTP

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

Lecture 17: Heuris.c methods for sequence alignment: BLAST and FASTA. Spring 2017 April 11, 2017

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Data Mining for Biological Data Analysis

Sequence Alignments. Week 3

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Genomics I. Organization of the Genome

Hands-On Four Investigating Inherited Diseases

2. The dropdown box has a number of databases that are searchable. Select the gene option and search for dihydrofolate reductase.

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

BLASTing through the kingdom of life

Worksheet for Bioinformatics

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Gene Prediction in Eukaryotes

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

Bioinformatics Midterm All questions worth 1pt

Bacterial Genome Annotation

From AP investigative Laboratory Manual 1

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Pre-Lab Questions. 1. Use the following data to construct a cladogram of the major plant groups.

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Hot Topics. What s New with BLAST?

Bioinformatics for Proteomics. Ann Loraine

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Transcription:

Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017

>What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2

NCBI MegaBlast search (09/14) 3

NCBI MegaBlast search (09/14) 4

Submitted (10-SEP-2014) CHINESE ACADEMY OF FISHERY SCIENCE, No. 150, Yong Ding Road, Beijing, China NCBI MegaBlast search (09/14) 5

AGTACT 6

Query Search Term Daphnia magna Search Term ATGCGGTCACAACATG... PubMed Nucleotide BLAST Subject NCBI 7

BLAST is the acronym for "Basic Local Alignment Search Tool", which is a local alignment search tool first described by Altschul et al. (1990). NCBI started providing sequence alignment service to the public using BLAST in 1992, first through its blast email server (decommissioned in 2002) and later through the web (1997). BLAST Authors: I Korf, M Yandell, J Bedell Publisher: O'Reilly Media Release Date: July 2003 Pages: 362 8

9

10

BLAST finds the optimal alignment by using the "word matching" algorithm, in which BLAST does the search in several distinctive phases: 1) generating overlapping words from the input query, 2) scanning the database for word matches (hits), and 3) extending word hits to produce (local) alignments through multiple steps of extension. During the first phase, BLAST breaks the input query into short overlapping segments (words/seeds). In the second phase BLAST takes those query words and scans the target database for initial matches. The nucleotide BLAST algorithm looks for any single exact word match. The protein BLAST algorithm uses a scoring threshold cutoff to identify matches. In addition, protein BLAST algorithm also requires two word hits within a certain distance in order to proceed to the next step. 11

ATGCGGTCACGTCACG > query sequence ATGCG > word 1 TGCGG > word 2 GCGGT > word 3 CGGTC > word 4 GGTCA > word 5 GTCAC > word 6 12

TCC? AAAAAAAAAAA AGAGAGAGAGA TTTTCTTTTTT TTTTTTACCCC CCCCCCCCCCC ATCGATCCATC 13

TCC AAAAAAAAAAA AGAGAGAGAGA TTTTCTTTTTT TTTTTTACCCC CCCCCCCCCCC ATCGATCCATC 14

A AAAAAAAAAAA A,G AGAGAGAGAGA T,C TTTTCTTTTTT T,C,A TTTTTTACCCC C CCCCCCCCCCC A,T,C,G ATCGATCCATC 15

TCC > T A AAAAAAAAAAA A,G AGAGAGAGAGA T,C TTTTCTTTTTT T,C,A TTTTTTACCCC C CCCCCCCCCCC A,T,C,G ATCGATCCATC 16

TCC > T A AAAAAAAAAAA A,G AGAGAGAGAGA T,C TTTTCTTTTTT T,A TTTTTTACCCC C CCCCCCCCCCC A,T,C,G ATCGATCCATC 17

TCC > TC A AAAAAAAAAAA A,G AGAGAGAGAGA T,C TTTTCTTTTTT T,A TTTTTTACCCC C CCCCCCCCCCC A,T,C,G ATCGATCCATC 18

TCC AAAAAAAAAAA AGAGAGAGAGA TTTTCTTTTTT TTTTTTACCCC CCCCCCCCCCC ATCGATCCATC 19

Global Alignment 1 -----TCC--- 11 1 ATCGATCCATC 11 20

Local Alignment Identities 3/3 (100%) Query 1 TCC 3 Subject 6 TCC 9 21

22

DNA DNA Protein NCBI BLAST Nucleotide nucleotide searches are beneficial because no information is lost in the alignment. When a codon is translated from nucleotides to amino acids, approximately 69% of the complexity is lost (4 3 =64 possible nucleotide combinations mapped to 20 amino acids). In contrast, however, the true physical relationship between two coding sequences is best captured in the translated view. Matrices that take into account physical properties, such as PAM and BLOSUM, can be used to add power to the search. Additionally, in a nucleotide search, there are only four possible character states compared to 20 in an amino acid search. Thus the probability of a match due to chance versus a match due to common ancestry (identify in state versus identical by descent) is higher. 23

search input query database blastn nt nt nt blastp pr pr pr blastx nt pr (6) pr tblastn pr pr pr (6) tblastx nt pr (6) pr (6) blastn compares nucleotide queries to a nucleotide database blastp compares protein queries to a protein database blastx compares a nucleotide query translated in all six reading frames against a protein database tblastn compares a protein query against a nucleotide sequence database dynamically translated in all six reading frames tblastx compares a nucleotide query in all six reading frames against a nucleotide sequence database in all six reading frames 24

>Dmag_B24_ORF0007_contigh23_2356_3466 ATGTGAACAAGTCTGAGAGATTCATCAACCGAATGTATATGAAAGCGGTGGTCCAGTCTG ATCGGGCGGGCTTTCGATCAACAACAACAACAACAACAACAACAAGTACGATCGATCTAA CTAGCTGACTAGCTGGACTGACTAGCTACTACGTACACGATCATATAATCGCGCGCGGCC CCCTATATAGCTACGATGCATCGTATATAAATATTCTTATCTCCCTTA Fasta header Sequence NCBI fasta headers: >gi 224922792 ref NM_000860.4 Homo sapiens hydroxyprostaglandin dehydrogenase 15-(NAD) (HPGD), transcript variant 1, mrna Your header: >Code_Species_Location/Gene/Coordinates 25

26

27

28

29

30

31

32

Graphic Summary Descriptions click here to see the nr entry click here to see the corresponding alignment 33

E value Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Bit score The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. 34

Nucleotide Alignment query sequence database sequence 35

Protein Alignment query sequence database sequence matching sequence 36

37 HS17 HS15 UniBas JCW

Glu (E) - Asp (D) 38 HS17 HS15 UniBas JCW

39

40

Bioinformatics - Databases Self-Study Guide 41

42

43

44

When choosing a matrix, it is important to consider the alternatives. Do not simply choose the default setting without some initial consideration. Suggested uses for common substitution matrices. The matrices highlighted in bold are available through NCBI s BLAST web interface. BLOSUM62 has been shown to provide the best results in BLAST searches overall due to its ability to detect large ranges of similarity. Nevertheless, the other matrices have their strengths. For example, if your goal is to only detect sequences of high similarity to infer homology within a species, the PAM30, BLOSUM90, and PAM70 matrices would provide the best results. 45

Percent Accepted Mutation (PAM) - A unit introduced by Margaret Dayhoff et al. (1978) to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. The PAM matrices imply a Markov chain model of protein mutation. The PAM matrices are normalized so that, for instance, the PAM1 matrix gives substitution probabilities for sequences that have experienced one point mutation for every hundred amino acids. The mutations may overlap so that the sequences reflected in the PAM250 matrix have experienced 250 mutation events for every 100 amino acids, yet only 80 out of every 100 amino acids have been affected. 46

A G A Markov chain, named for Andrey Markov, is a mathematical system that undergoes transitions from one state to another in a chainlike manner. It is a random process characterized as memoryless: the next state depends only on the current state and not on the entire past. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of realworld processes. 47

Blocks Substitution Matrix (BLOSUM). A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff 1992) 48 HS17 HS15 UniBas JCW

The BLOSUM62 matrix S ij = ( 1 λ )log ( p ij ) p i *q j pij is the probability of two amino acids i and j replacing each other in a homologous sequence, and qi and qj are the background probabilities of finding the amino acids i and j in any protein sequence at random. The factor λ is a scaling factor, set such that the matrix contains easily computable integer values. 49 HS17 HS15 UniBas JCW

50