Basic Bioinformatics: Homology, Sequence Alignment,

Similar documents
Two Mark question and Answers

NGS sequence preprocessing. José Carbonell Caballero

Francisco García Quality Control for NGS Raw Data

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

Why learn sequence database searching? Searching Molecular Databases with BLAST

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

What Are the Chemical Structures and Functions of Nucleic Acids?

Why study sequence similarity?

FUNCTIONAL BIOINFORMATICS

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

ELE4120 Bioinformatics. Tutorial 5

Sequence Based Function Annotation

Chapter 2: Access to Information

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

COMPUTER RESOURCES II:

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

Bioinformatics Course AA 2017/2018 Tutorial 2

Gene Identification in silico

Basic concepts of molecular biology

Data Retrieval from GenBank

The University of California, Santa Cruz (UCSC) Genome Browser

Textbook Reading Guidelines

Protein Bioinformatics Part I: Access to information

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

Chapter 12 DNA & RNA

RNA & PROTEIN SYNTHESIS

HC70AL Spring An Introduction to Bioinformatics -- Part I. Brandon Le. April 6, What is a Gene? An ordered sequence of nucleotides

Algorithms in Bioinformatics

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

(Very) Basic Molecular Biology

DNA Structure DNA Nucleotide 3 Parts: 1. Phosphate Group 2. Sugar 3. Nitrogen Base

Genomics and Database Mining (HCS 604.3) April 2005

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

HC70AL Spring 2011! An Introduction to Bioinformatics! By!! Brandon Le! April 7, 2011!

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

MATH 5610, Computational Biology

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Exercise I, Sequence Analysis

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

DNA and RNA 2/14/2017. What is a Nucleic Acid? Parts of Nucleic Acid. DNA Structure. RNA Structure. DNA vs RNA. Nitrogen bases.

O C. 5 th C. 3 rd C. the national health museum

CHAPTER 11 DNA NOTES PT. 4: PROTEIN SYNTHESIS TRANSCRIPTION & TRANSLATION

Data Mining for Biological Data Analysis

Match the Hash Scores

Jay McTighe and Grant Wiggins,

translation The building blocks of proteins are? amino acids nitrogen containing bases like A, G, T, C, and U Complementary base pairing links

DNA is normally found in pairs, held together by hydrogen bonds between the bases

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Sequence Databases and database scanning

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

From Gene to Protein

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

Evolutionary Genetics. LV Lecture with exercises 6KP

Basic concepts of molecular biology

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

What is necessary for life?

Adv Biology: DNA and RNA Study Guide

What is necessary for life?

Replication Transcription Translation

BME 110 Midterm Examination

Introduction to 'Omics and Bioinformatics

Nucleic acids. How DNA works. DNA RNA Protein. DNA (deoxyribonucleic acid) RNA (ribonucleic acid) Central Dogma of Molecular Biology

DNA & THE GENETIC CODE DON T PANIC! THIS SECTION OF SLIDES IS AVAILABLE AT CLASS WEBSITE

UNIT 3 GENETICS LESSON #41: Transcription

Download the Lectin sequence output from

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Chapter 13 - Concept Mapping

II. DNA Deoxyribonucleic Acid Located in the nucleus of the cell Codes for your genes Frank Griffith- discovered DNA in 1928

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

Tutorial for Stop codon reassignment in the wild

The study of the structure, function, and interaction of cellular proteins is called. A) bioinformatics B) haplotypics C) genomics D) proteomics

Roadmap. The Cell. Introduction to Molecular Biology. DNA RNA Protein Central dogma Genetic code Gene structure Human Genome

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Biotechnology Explorer

BIOINFORMATICS IN BIOCHEMISTRY

Introduction to DNA and RNA

Dr. Ramesh U2 L2. Basic structure of DNA and RNA Introduction to Genetics

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

G4120: Introduction to Computational Biology

Videos. Lesson Overview. Fermentation

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Resources. How to Use This Presentation. Chapter 10. Objectives. Table of Contents. Griffith s Discovery of Transformation. Griffith s Experiments

DNA and RNA Structure. Unit 7 Lesson 1

BIOLOGY 111. CHAPTER 6: DNA: The Molecule of Life

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sections 12.3, 13.1, 13.2

DNA vs. RNA B-4.1. Compare DNA and RNA in terms of structure, nucleotides and base pairs.

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Lecture 7. Next-generation sequencing technologies

Textbook Reading Guidelines

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

user s guide Question 1

Transcription:

Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi State University wss2@igbb.misstate.edu June 3, 2015 Biology Review: The genome is the genetic material of an organism it is primarily responsible for heredity and variation within and among species Genomes can be: DNA (deoxyribonucleic acid) a double helical molecule made up of 4 nucleotides Adenine, Guanine, Thymine, and Cytosine RNA (ribonucleic acid) a single stranded molecule also made up of 4 nucleotides Adenine, Guanine, Uracil, and Cytosine Nucleic acids are translated into: Proteins made up of one or more long chains of amino acids (20 standard amino acids) The Central Dogma of Molecular Biology 1

Biology Review: For our purposes, DNA & RNA can be thought of as a character string with an alphabet of ~4 characters: 5 -GATTACATGTTTCGGGTACGATGC-3 3 -CTAATGTACAAAGCCCATGCTACG-5 Since the 3 5 strand of DNA is the reverse compliment of the 5 3 strand, standard convention is to only store the 5 3 strand: 5 -GATTACATGTTTCGGGTACGATGC-3 or GATTACATGTTTCGGGTACGATGC Biology Review: Proteins can be thought of as character strings with an alphabet of ~21 characters: MLLITMATAFMGYVLPWGQMSFWGATV Protein sequences are represented from the N terminus (aminoterminus) to their C terminus (carboxyl terminus) 2

Standard Files Types: GenBank ( *.gb *.genbank) National Center for Biotechnology s (NCBI) Flat File Format (text) Provides a large amount of information about a given sequence record http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html FASTA (*.fasta *.fa ) Pronounced FAST A Simple text file format for storing nucleotide or peptide sequences Each record begins with a single line description starting with > and is followed by one or more lines of sequence FASTQ (*.fastq *.fq ) Pronounced FAST Q Text based file format for storing nucleotide sequences and their corresponding quality scores Quality scores are generated as the nucleotide is sequenced and correspond to a probability that a given nucleotide has been correctly sequenced by the sequencer Biological Sequence Representation: FASTA format Can represent nucleotide sequences or peptide sequences using single letter codes FASTQ format Represents nucleotide sequences and their corresponding quality scores >gi 5524211 gb AAD44166.1 cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY 3

Biological Sequence Representation: FASTQ Format @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Line 1 = @ and Sequence identifier and description (optional) Line 2 = Raw sequence Line 3 = + optionally followed by same sequence id and description Line 4 = Encoded quality values for Line 2 sequence (same number of symbols) Biological Sequence Representation: FASTQ Format Quality Scores The symbols in Line 4 of a FASTQ file represent the quality of the base at a given position in the sequence There are a few different possible encodings of the quality score, dependent on sequencing platform Sanger format quality scores range from 0 to 93 A higher score is better, and the scale is logarithmic: Q sanger = 10 log 10 p p= 10 * ( Q / 10) 4

Biological Sequence Representation: FASTQ Quality Scores Quality Score Values from Lowest to Highest:!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ Phred Quality Score Probability of incorrect Base call accuracy base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999% 60 1 in 1,000,000 99.9999% Homology: Homologous Sequences (Homologs) Agene related to a second gene by descent from a common ancestral DNA sequence. 5

Homology: Orthologous Sequences (Orthologs) genes in different species that evolved from a common ancestral gene. Gene in Common Ancestor Gene in Species A Gene in Species B Paralogous Sequences (Paralogs) genes related by duplication within a genome. Gene 1 Gene 1a Gene 1b Duplication Event Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original function. Homology: Other logs Homoeologs homologs resulting from the duplication of genes due to a whole genome duplication (WGD) event more common in plant species Ohnologs very ancestral homoeologs Xenologs homologs resulting from horizontal gene transfer between two organisms 6

Sequence Alignment: Sequence alignment is the procedure of comparing two (pairwise) or more (multiple) sequences and searching for a series of individual characters or character patterns that are the same in the set of sequences. Global alignment find matches along the entire sequence (use for sequences that are quite similar) Local alignment finds regions or islands of strong similarity (use for comparing less similar regions [finding conserved regions]) Sequence Alignment: Sequence 1 = GARVEY Sequence 2 = AVERY Global Alignment: GARVEY- -A-VERY 7

Sequence Alignment Example: A A G C G G G G T G T A A G A C G T T Sequence 1 = TGTAAGACGTT Sequence 2 = AAGCGGGG Why do researchers focus on sequence alignment? 1. To determine possible functional similarity. 2. For 2 sequences: a. If they re the same length, are they almost the same sequence? (global alignment) 3. For 2 sequences: a. Is the prefix of one string the suffix of another? (contig assembly) 4. Given a sequence, has anyone else found a similar sequence? 5. To identify the evolutionary history of a gene or protein. 6. To identify genes or proteins. 8

BLAST (Basic Local Alignment Search Tool): A tool for determining sequence similarity Originated at the National Center for Biotechnology Information (NCBI) Sequence similarity is a powerful tool for identifying unknown sequences BLAST is fast and reliable BLAST is flexible http://blast.ncbi.nlm.nih.gov/ Standard BLAST Versions: blastn searches a nucleotide database using a nucleotide query DNA/RNA sequence searched against DNA/RNA database blastp searches a protein database using a protein query Protein sequence searched against a Protein database blastx search a protein database using a translated nucleotide query DNA/RNA sequence > Protein sequence searched against a Protein database tblastn search a translated nucleotide database using a protein query Protein sequence searched against a DNA/RNA sequence database > Protein sequence database tblastx search a translated nucleotide database using a translated nucleotide query DNA/RNA sequence > Protein sequence searched against a DNA/RNA sequence database > Protein sequence database 9

NCBI BLAST Main Page: Sequence Input Databases to Search Against Program Selection Click to Run! 10

Same Page Organization Today s Example >unknown_sequence_1 TGATGTCAAGACCCTCTATGAGACTGAAGTCTTTTCTACCGACTTCTCCAACATTTCTGCAGCCAAGCAG GAGATTAACAGTCATGTGGAGATGCAAACCAAAGGGAAAGTTGTGGGTCTAATTCAAGACCTCAAGCCAA ACACCATCATGGTCTTAGTGAACTATATTCACTTTAAAGCCCAGTGGGCAAATCCTTTTGATCCATCCAA GACAGAAGACAGTTCCAGCTTCTTAATAGACAAGACCACCACTGTTCAAGTGCCCATGATGCACCAGATG GAACAATACTATCACCTAGTGGATATGGAATTGAACTGCACAGTTCTGCAAATGGACTACAGCAAGAATG CTCTGGCACTCTTTGTTCTTCCCAAGGAGGGACAGATGGAGTCAGTGGAAGCTGCCATGTCATCTAAAAC ACTGAAGAAGTGGAACCGCTTACTACAGAAGGGATGGGTTGACTTGTTTGTTCCAAAGTTTTCCATTTCT GCCACATATGACCTTGGAGCCACACTTTTGAAGATGGGCATTCAGCATGCCTATTCTGAAAATGCTGATT TTTCTGGACTCACAGAGGACAATGGTCTGAAACTTTCCAATGCTGCCCATAAGGCTGTGCTGCACATTGG TGAAAAGGGAACTGAAGCTGCAGCTGTCCCTGAAGTTGAACTTTCGGATCAGCCTGAAAACACTTTCCTA CACCCTATTATCCAAATTGATAGATCTTTCATGTTGTTGATTTTGGAGAGAAGCACAAGGAGTATTCTCT TTCTAGGGAAAGTTGTGAACCCAACGGAAGCGTAGTTGGGAAAAAGGCCATTGGCTAATTGCACGTGTGT ATTGCAATGGGAAATAAATAAATAATATAGCCTGGTGTGATTGATGTGAGCTTGGACTTGCATTCCCTTA TGATGGGATGAAGATTGAACCCTGGCTGAACTTTGTTGGCTGTGGAAGAGGCCAATCCTATGGCAGAGCA TTCAGAATGTCAATGAGTAATTCATTATTATCCAAAGCATAGGAAGGCTCTATGTTTGTATATTTCTCTT TGTCAGAATACCCCTCAACTCATTTGCTCTAATAAATTTGACTGGGTTGAAAAATTAAAA Sequence available for download at: ftp://ftp.hpc.msstate.edu/outgoing/wss2/ 11

Results of Our BLAST: Analysis of BLAST Results: Max Score how well the sequences match Total Score includes scores from non contiguous portions of the subject sequence that match the query Bit Score Alog scaled version of a score Ex. If the bit score is 30, you would have to score on average, about 230 = 1 billion independent segment pairs to find a score matching this score by chance. Each additional bit doubles the size of the search space. Query Coverage fraction of the query sequence that matches a subject sequence Evalue how likely an alignment can arise by chance Max ident the match to a subject sequence with the highest percentage of identical bases 12

Local BLAST Installation: Executables and documentation available at: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/latest/ Documentation: http://www.ncbi.nlm.nih.gov/books/nbk1762/ Windows Versions: 32 bit ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/latest/ncbi blast 2.2.30+ win32.exe 64 bit ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/latest/ncbi blast 2.2.30+ win64.exe Linux Version: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/latest/ncbi blast 2.2.30+ x64 linux.tar.gz Other Online Resources: UCSC Genome Browser (genome.ucsc.edu) 13

Other Online Resources: Protein Data Bank www.rcsb.org/pdb/home/home.do Other Resources: EMBL EBI (European Bioinformatics Institute) http://www.ebi.ac.uk/ DDBJ (DNA Data Bank of Japan) http://www.ddbj.nig.ac.jp/ NCBI s Sequence Read Archive (SRA) http://www.ncbi.nlm.nih.gov/sra IGBB s Useful Links Page http://www.igbb.msstate.edu/links.php Many, many more available online, just search. 14

Questions?? Contact Info E mail: wss2@igbb.misstate.edu Phone: (662) 325 2839 Office: A209 Portera HPC2 Building (across Hwy. 182 in the Research Park) 15