Bacterial Genome Annotation

Similar documents
Computational gene finding. Devika Subramanian Comp 470

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Data Retrieval from GenBank

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Genomic region (ENCODE) Gene definitions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence


Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

COMPUTER RESOURCES II:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Gene Prediction: Preliminary Results

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Gene Identification in silico

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

Genes and gene finding

Chimp Sequence Annotation: Region 2_3

Hands-On Four Investigating Inherited Diseases

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

BLASTing through the kingdom of life

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Perfect~ Arthropod Genes Constructed from Gigabases of RNA

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Chapter 13. From DNA to Protein

CHAPTER 21 LECTURE SLIDES

MATH 5610, Computational Biology

Big Idea 3C Basic Review

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Mapping strategies for sequence reads

Aaditya Khatri. Abstract

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

RNA-Seq with the Tuxedo Suite

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Genomics and Gene Recognition Genes and Blue Genes

Methods and Algorithms for Gene Prediction

Gene-centered resources at NCBI

Interpretation of sequence results

Concepts and methods in genome assembly and annotation

Ch. 10 Notes DNA: Transcription and Translation

Fig Ch 17: From Gene to Protein

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

measuring gene expression December 5, 2017

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Lecture for Wednesday. Dr. Prince BIOL 1408

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Genomics and Gene Recognition Genes and Blue Genes

Gene Finding Genome Annotation

Answer: Sequence overlap is required to align the sequenced segments relative to each other.

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Lecture 11: Gene Prediction

Prokaryotic Transcription

From DNA to Protein: Genotype to Phenotype

Year III Pharm.D Dr. V. Chitra

Analysis of Biological Sequences SPH

DNA Function: Information Transmission

Transcription Eukaryotic Cells

8/21/2014. From Gene to Protein

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

DNA Replication and Repair

Protein Synthesis & Gene Expression

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

user s guide Question 3

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Eukaryotic Gene Prediction. Wei Zhu May 2007

Transcription in Eukaryotes

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Eukaryotic Gene Structure

Gene function at the level of traits Gene function at the molecular level

BLASTing through the kingdom of life

Solutions to Quiz II

BLASTing through the kingdom of life

Evolutionary Genetics. LV Lecture with exercises 6KP

RNA-Seq Software, Tools, and Workflows

Glossary of Commonly used Annotation Terms

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

Gene Structure & Gene Finding Part II

Next Generation Genome Annotation with mgene.ngs

Algorithms in Nature. (brief) introduction to biology

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

Biotechnology Explorer

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Microbial Genetics. Chapter 8

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

CH 17 :From Gene to Protein

SSA Signal Search Analysis II

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Transcription:

Bacterial Genome Annotation

Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control elements variants structural RNAs trnas small RNAs pseudogenes, control regions direct/inverted repeats insertion sequences transposons, other mobile elements and it must be done quickly without a huge use of resources.

Bacterial Genome Annotation Fortunately for the protein coding sequences of bacteria there are no introns (well mostly none). But this does not mean that you can simply identify all open reading frames and call the job done. For example, how small can a protein be? What do you do if open reading frames overlap? What about read through? What about start codons?

Bacterial Genome Annotation You have mostly been taught that a bacterial coding sequence begins with ATG. True most do! But you can also have ATG, GTG, TTG, ATT, CTG or ATC. Similarly stop codons TGA or TAG may encode selenocysteine and pyrrolysine.

Bacterial Genome Annotation Do it fast! Another genome is coming off the sequencer.

Bacterial Genome Annotation Do it fast! Another genome is coming off the sequencer. Do it without my attention! I have other things to do.

Bacterial Genome Annotation 5S, 16S, and 23S rrna are found using BLASTn to Refseq ncrnas are found using BLASTn to Rfam families Both 5S rrna and ncrna use cmsearch to refine hits. trnas are found using trnascan-se. covariance model search - a dynamic programming approach to include known sequence and 2D structural information to find a match in sequences alone. trnascan-se - less than one false positive per 15 gigabases. A multi-pass algorithm that in the final pass uses a probabilistic secondary structure profile.

Bacterial Genome Annotation For proteins they use the pan-genome, and target sets of proteins expected to be found (e.g. ribosomal proteins) and compare these using a frameshift/error aware aligner ProSplign. Non-perfect alignments are passed to GeneMarkS+ which is a HMM gene caller (NCBI also hosts GLIMMER for secondary gene identification). If the hit is poor, it might be searched against a wider set of proteins and then passed through ProSplign/GeneMarkS+ again. Rapidly evolving regions such as phage related proteins and CRISPRs are treated separately.

Bacterial Genome Annotation NCBIs Prokaryotic Genome Annotation Pipeline combines a computational gene prediction algorithm with a similarity-based gene detection approach. The pipeline currently predicts protein-coding genes, structural RNAs (5S, 16S, 23S), trnas, and small non-coding RNAs. The flowchart describes the major components of the pipeline. From http://www.ncbi.nlm.nih.gov/books/nbk174280/

Bacterial Genome Annotation In a Genbank file you will often find a meta-data entry about the annotation. For example for E. coli strain JJ1897, you will find, ##Genome-Assembly-Data-START## Assembly Date :: SEP-2015 Assembly Method :: HGAP 3 (PacBio) v. SEPT-2015 Genome Representation :: Full Expected Final Version :: Yes Genome Coverage :: 100.0x Sequencing Technology :: PacBio ##Genome-Assembly-Data-END## ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 01/05/2016 15:37:32 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set; GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene; CDS; rrna; trna; ncrna; repeat_region Genes :: 5,336 CDS :: 4,695 Pseudo Genes :: 532 rrnas :: 8, 7, 7 (5S, 16S, 23S) complete rrnas :: 8, 7, 7 (5S, 16S, 23S) trnas :: 87 ncrnas :: 0 ##Genome-Annotation-Data-END##

Core components of the pipeline are the alignment programs Splign (a splicing aware alignment algorithm) and ProSplign (a protein to genomic sequence alignment method) and Gnomon, a gene prediction program combining information from alignments of experimental evidence and from models produced ab initio with an HMM-based algorithm. from http://www.ncbi.nlm.nih.gov/books/nbk169439/

The annotation pipeline produces comprehensive sets of genes, transcripts, and proteins derived from multiple sources, depending on the data available. In order of preference, the following sources are used: 1 RefSeq curated annotated genomic sequences 2 Known RefSeq transcripts 3 Gnomon-predicted models - generates gene models using mrna, EST and protein alignments as evidence, supplemented by ab initio models.

The problem with eukaryotes is splicing. Which is the correct annotation?

The problem with eukaryotes is splicing. Which is the correct annotation? But the correct answer could be YES!

Splices sites make coding bits potentially very small. (below is a segment of human chr 4). cgcctcccct ggccccgtgc acacacacgc ccaccgcggc tcgggctggc tgagcgcggg cgagtgtgag cgcgagtgtg cgcacgccgc gggagcctct ctgccctctc ctcgcaccct gctcagggca tctgaagagc ctggaaacgt gaacaggctt gaagtatggc atgttgcaaa gatggtttct gccaagaagg tacccgcgat cgctctgtcc gccggggtca gtttcgccct cctgcgcttc ctgtgcctgg cggtttggtg agttctctgg ggtgcgagga ggtgggcaa

Splices sites make coding bits potentially very small. (below is a segment of human chr 4). Open reading frames for this region are...

Splices sites make coding bits potentially very small. (below is a segment of human chr 4). Open reading frames for this region are... They are all wrong!

Splices sites make coding bits potentially very small. (below is a segment of human chr 4). cgcctcccct ggccccgtgc acacacacgc ccaccgcggc tcgggctggc tgagcgcggg cgagtgtgag cgcgagtgtg cgcacgccgc gggagcctct CTGCCCTCTC CTCGCACCCT GCTCAGGGCA TCTGAAGAGC CTGGAAACGT GAACAGGCTT GAAGTATGGC ATGTTGCAAA GATGGTTTCT GCCAAGAAGg tacccgcgat cgctctgtcc gccggggtca gtttcgccct cctgcgcttc ctgtgcctgg cggtttggtg agttctctgg ggtgcgagga ggtgggcaa It codes for just 9 amino acids MLQRWFLPR and is the first exon of a GABA receptor isoform.

Nor do exons have to be coding. These are splice variants of the human aldolase A gene. For just the first four variants there are 14, 14, 16 and 12 exons of which the first 6, 6, 7 and 4 are non-coding. How do you find these?

Currently the pipeline incorporates RNA-Seq data where available (as of 2013).

Both Splign and ProSplign are global alignment tools that enable alignment of transcripts and proteins with high resolution of splice sites. The computational cost of these algorithms requires that approximate placements of the query sequences (transcripts or proteins) on the target (genome) be first identified with a local alignment tool, such as BLAST. Since a query often aligns at multiple locations, the BLAST hits are analyzed by the Compart algorithm to identify compartments prior to running Splign or ProSplign. http://www.ncbi.nlm.nih.gov/books/nbk169439/

The Compart algorithm sorts through the BLAST hits to find a compatible set that make sense relative to query sequences and to the genome. For example, they cannot be in the order ABC in the query and then in order CAB in the genome. The resulting compartments of hits are then given to the global alignment tools Splign and ProSplign.

The global alignment tools Splign and ProSplign use a modified Needleman-Wunsch algorithm with a score modified to include a difference between alignment (or evolutionary) gaps and intronic gaps. Thus the score is constructed from the number of matches, number of mismatches, and normal gap penalties, plus gap penalities for intronic gaps. The intron gap opening cost is different between the most frequent consensus splices (GT/AG), less frequent consensus splices (GC/AG, AT/AC), and non-consensus splice sites. These alignments are then given to a chainer algorithm.

The chainer combines RNA-Seq alignments together in order to reduce the complexity of the problem. The chainer also takes into account intron structure and coding potential.

The Gnomon algorithm takes the results from chainer and tries to predict coding sequence and possible isoforms. Possible coding regions are predicted and scored using a 3-periodic fifth-order Hidden Markov Model (HMM) for coding propensity and Weight Matrix Method (WMM) models for splice signals and translation initiation and termination signals.

The availabe cdnas and proteins are used to build the first predictions. These gene models (or candidates ) are compared with a broad set of proteins. Good matches continue in the pipeline. Compart finds the positions of the target sequences, Splign & ProSplign build spliced alignments, Chainer combines these into longer models, Gnomon extends and finalizes the gene model. This then goes into all of the databases.

- Example entries from GRCh38.p7 chr 21 LOCUS NC_000021 46709983 bp DNA linear CON 06-JUN-2016 DEFINITION Homo sapiens chromosome 21, GRCh38.p7 Primary Assembly. ACCESSION NC_000021 GPC_000001313 VERSION NC_000021.9 GI:568815577 DBLINK BioProject: PRJNA168 Assembly: GCF_000001405.33 KEYWORDS RefSeq. SOURCE Homo sapiens (human)... ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Status :: Full annotation Annotation Version :: Homo sapiens Annotation Release 108 Annotation Pipeline :: NCBI eukaryotic genome annotation pipeline Annotation Software Version :: 7.0 Annotation Method :: Best-placed RefSeq; Gnomon Features Annotated :: Gene; mrna; CDS; ncrna ##Genome-Annotation-Data-END##...

- Example entries from GRCh38.p7 chr 21... ncrna complement(join(5011976..5012370,5012556..5012684)) /ncrna_class="lncrna" /gene="loc105379484" /product="uncharacterized LOC105379484" /note="derived by automated computational analysis using gene prediction method: Gnomon. Supporting evidence includes similarity to: 1 EST, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 37 samples with support for all annotated introns" /transcript_id="xr_951076.1" /db_xref="gi:768019556" /db_xref="geneid:105379484"...

- Example entries from GRCh38.p7 chr 21... mrna mrna join(5022044..5022693,5025009..5025049,5026280..5026630, 5027935..5028225,5032053..5032217,5033408..5033443, 5045419..5046678) /gene="loc102723996" /product="icos ligand, transcript variant X2" /note="derived by automated computational analysis using gene prediction method: Gnomon. Supporting evidence includes similarity to: 65 ESTs, 5 long SRA reads, 4 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 13 samples with support for all annotated introns" /transcript_id="xm_011546078.2" /db_xref="gi:1034627737" /db_xref="geneid:102723996" join(5022044..5022693,5025009..5025049,5026280..5026630, 5027935..5028225,5032053..5032217,5033408..5033443, 5034582..5036775) /gene="loc102723996" /product="icos ligand, transcript variant X4" /note="derived by automated computational analysis using gene prediction method: Gnomon. Supporting evidence includes similarity to: 8 mrnas, 78 ESTs, 108 long SRA reads, 4 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 101 samples with support for all annotated introns" /transcript_id="xm_006723900.2" /db_xref="gi:768019569" /db_xref="geneid:102723996"...

- Example entries from GRCh38.p7 chr 21... gene 5553637..5592004 /gene="loc102724219" /note="uncharacterized LOC102724219; Derived by automated computational analysis using gene prediction method: BestRefSeq,Gnomon." /db_xref="geneid:102724219" mrna join(5553637..5553838,5585823..5585987,5590019..5592004) /gene="loc102724219" /product="uncharacterized LOC102724219" /note="derived by automated computational analysis using gene prediction method: BestRefSeq." /transcript_id="nm_001322044.1" /db_xref="gi:1015130023" /db_xref="geneid:102724219"...

- Example entries from GRCh38.p7 chr 21... gene trna complement(17454789..17454859) /gene="trg-gcc1-5" /note="derived by automated computational analysis using gene prediction method: trnascan-se." /db_xref="geneid:100189284" /db_xref="hgnc:hgnc:34850" complement(17454789..17454859) /gene="trg-gcc1-5" /product="trna-gly" /inference="coordinates: profile:trnascan-se:1.23" /note="transfer RNA-Gly (GCC) 1-5; trna features were annotated by trnascan-se; Derived by automated computational analysis using gene prediction method: trnascan-se." /anticodon=(pos:complement(17454825..17454827),aa:gly, seq:gcc) /db_xref="geneid:100189284" /db_xref="hgnc:hgnc:34850"...

- Example entries from GRCh38.p7 chr 21... gene 19620752..19631774 /gene="nipa2p3" /note="non imprinted in Prader-Willi/Angelman syndrome 2 pseudogene 3; Derived by automated computational analysis using gene prediction method: Curated Genomic." /pseudo /db_xref="geneid:100128057" /db_xref="hgnc:hgnc:42043"...

Not the end Finally, note that this is not the end. Genomes are constantly being re-annotated as new information is gathered and as algorithms are improved.... hence record the annotation version if your work uses it.