Gene Finding Genome Annotation

Similar documents
Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Methods and Algorithms for Gene Prediction

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Computational gene finding. Devika Subramanian Comp 470

Bioinformatics for Proteomics. Ann Loraine

Gene Identification in silico

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes

Interpreting RNA-seq data (Browser Exercise II)

RNA-Seq with the Tuxedo Suite

From assembled genome to annotated genome

Make the protein through the genetic dogma process.

COMPUTER RESOURCES II:

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Steve Thompson 9/25/03. Introduction to BioInformatics BSC4933/5936. Florida State University Department of Biology

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genomic region (ENCODE) Gene definitions

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

About Strand NGS. Strand Genomics, Inc All rights reserved.

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

GREG GIBSON SPENCER V. MUSE

Year III Pharm.D Dr. V. Chitra

Eukaryotic Gene Prediction. Wei Zhu May 2007

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Genomics and Transcriptomics of Spirodela polyrhiza

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

RNA-Seq Software, Tools, and Workflows

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Using Expressing Sequence Tags to Improve Gene Structure Annotation

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Finding Genes, Building Search Strategies and Visiting a Gene Page

Gene-centered resources at NCBI

Finding Genes, Building Search Strategies and Visiting a Gene Page

NCBI web resources I: databases and Entrez

Eukaryotic Gene Structure

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Do engineered ips and ES cells have similar molecular signatures?

RNA-Sequencing analysis

Gene Prediction: Preliminary Results

Guided tour to Ensembl

user s guide Question 3

Lecture 11: Gene Prediction


Introduction to the UCSC genome browser

Training materials.

Introduction to Bioinformatics

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Integration of data management and analysis for genome research

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Using geneid to Identify Genes

Multiple choice questions (numbers in brackets indicate the number of correct answers)

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

DNA makes RNA makes Proteins. The Central Dogma

user s guide Question 3

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Hands-On Four Investigating Inherited Diseases

SSA Signal Search Analysis II

MATH 5610, Computational Biology

Big Idea 3C Basic Review

Bioinformatics in next generation sequencing projects

Browsing Genes and Genomes with Ensembl

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Signal Estimates from Exon Arrays

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

Wednesday, November 22, 17. Exons and Introns

Gene Structure & Gene Finding Part II

BIOINFORMATICS Introduction

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Product Applications for the Sequence Analysis Collection

Introduction to RNA sequencing

Computational Biology and Bioinformatics

Glossary of Commonly used Annotation Terms

Introduction to genome biology

Types of Databases - By Scope

Assessing De-Novo Transcriptome Assemblies

Improved annotation with de novo transcriptome assembly in four social amoeba species

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Sanger vs Next-Gen Sequencing

Gene Prediction: Statistical Approaches

Protein Synthesis & Gene Expression

Genes and gene finding

measuring gene expression December 5, 2017

Reference genomes and common file formats

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Transcription:

Gene Finding Genome Annotation

Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics Population biology & evolution Medical genomics

Basic Approaches Computational Absolute rules: start and stop codons Statistical probabilities: which codon is a true start? Introns splice junctions codon usage Experimental Comparison with known genes/proteins (BLAST) Expressed sequence tags RNAseq data

Computational Gene Prediction Statistical properties of protein-coding genes differ from those of non-coding sequence Long ORFs On average stop codons should occur 3 times in every 64 codons (~1/21) Codon bias (human) codon Amino acid % ACA Thr 28 ACC Thr 36 ACG Thr 12 ACU Thr 24

Gene features tend to occur in specific sequence contexts a. Splice acceptor sites b. Splice donor sites c. Translation starts d. Splice acceptor sites for A. thaliana genes predicted using C. elegans parameters from Korf(2004)

Many of the ab initio gene finders use Hidden Markov Models (HMMs) HMMs Contain parameters defining probabilities that specific gene features occur in different sequence contexts They can be used to predict transcription start sites Intron splice junctions Poly-A addition sites promoters

Standard practice is to perform gene predictions with multiple programs We will run two programs in today s exercise: SNAP Korf (2004) Gene finding in novel genomes BMC Bioinformatics 5:59 AUGUSTUS Stanke et al (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Research 32:W309

Gene validation Independent evidence that our candidate gene is, in fact, a gene Conserved protein motifs Blast matches Expressed sequence tags RNAseq reads

For today s exercise We will use the following evidences: Genes/proteins already identified in M.oryzae (many being well supported by blast, EST and other transcriptomic data) Splice junction information from the RNAseq mapping that we performed yesterday

Information overload!!! Results from: SNAP AUGUSTUS Magnaporthe genes Magnaporthe proteins RNAseq mapping data How are we going to make sense out of these highly redundant datasets?

Enter MAKER Synthesizes multiple forms of gene prediction data Predictions and evidences Outputs a single, consistent set of genes and gene models, including quality values Uses a standard gene annotation format GFF3 (related to the GTF format used yesterday) Results can be imported into a genome browser

GFF3 format 1 2 3 4 5 6 7 8 9 seqid source type Start End Score Strand phase attributes ##gff-version 3 ##date Wed Jul 18 22:38:03 2012 ##source gbrowse gbgff gff3 dumper ##sequence-region contig00001:11699..16698 contig00001 maker gene 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164;ID=215076 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;Parent=215076;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker gene 14925 15925. -. Name=maker-contig00001-snap-gene- 0.100;ID=215008 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;parent=215008;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015

Gene finding is an iterative process HMM SNAP AUGUSTUS GENE MODELS MAKER BLAST matches ESTs

Genome Browsers

Genome Browser Combines a genome database with interactive web pages Allows the user to retrieve and manipulate database record through a graphical user interface (GUI) Different types of information are displayed in an intuitive fashion in user-configurable tracks

GFF3 files are hard to interpret ##gff-version 3 ##date Wed Jul 18 22:38:03 2012 ##source gbrowse gbgff gff3 dumper ##sequence-region contig00001:11699..16698 contig00001 maker gene 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164;ID=215076 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;Parent=215076;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker gene 14925 15925. -. Name=maker-contig00001-snap-gene- 0.100;ID=215008 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;parent=215008;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015

MAKER genes & RNAseq reads in GBrowse

Genome Browsers for repeat definition Show is a track displaying the results of a genome blasted against itself

A plethora of genome browsers Annmap Apollo Genome Annotation Curation Tool Argo Genome Browser Avadis NGS BugView Celera Genome Browser Dalliance DiProGB DNAnexus Ensembl Gaggle Genome Browser GBrowse The Genomic HyperBrowser Genostar GenoBrowser GenPlay Integrated Genome Browser (IGB) Integrated Genome Viewer (IGV) Integrated Microbial Genomes (IMG) JBrowse (a JavaScript browser ) MGV - Microbial Genome Viewer MochiView Genome Browser NextBio Genome Browser Pathway Tools Genome Browser Savant Genome Browser SEED viewer UCSC Genome Bioinformatics Genome Browser Viral Genome Organizer (VGO) VISTA genome browser

Today s activity Learn how to use the Integrated Genome Browser Populate the browser with data: A Magnaporthe sequence contig MAKER annotations Mapped RNAseq reads RNAseqread heatmaps Explore the browser to get an idea of how it works and how the tracks can be manipulated/activated/deactivated