Gene Finding Genome Annotation
Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics Population biology & evolution Medical genomics
Basic Approaches Computational Absolute rules: start and stop codons Statistical probabilities: which codon is a true start? Introns splice junctions codon usage Experimental Comparison with known genes/proteins (BLAST) Expressed sequence tags RNAseq data
Computational Gene Prediction Statistical properties of protein-coding genes differ from those of non-coding sequence Long ORFs On average stop codons should occur 3 times in every 64 codons (~1/21) Codon bias (human) codon Amino acid % ACA Thr 28 ACC Thr 36 ACG Thr 12 ACU Thr 24
Gene features tend to occur in specific sequence contexts a. Splice acceptor sites b. Splice donor sites c. Translation starts d. Splice acceptor sites for A. thaliana genes predicted using C. elegans parameters from Korf(2004)
Many of the ab initio gene finders use Hidden Markov Models (HMMs) HMMs Contain parameters defining probabilities that specific gene features occur in different sequence contexts They can be used to predict transcription start sites Intron splice junctions Poly-A addition sites promoters
Standard practice is to perform gene predictions with multiple programs We will run two programs in today s exercise: SNAP Korf (2004) Gene finding in novel genomes BMC Bioinformatics 5:59 AUGUSTUS Stanke et al (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Research 32:W309
Gene validation Independent evidence that our candidate gene is, in fact, a gene Conserved protein motifs Blast matches Expressed sequence tags RNAseq reads
For today s exercise We will use the following evidences: Genes/proteins already identified in M.oryzae (many being well supported by blast, EST and other transcriptomic data) Splice junction information from the RNAseq mapping that we performed yesterday
Information overload!!! Results from: SNAP AUGUSTUS Magnaporthe genes Magnaporthe proteins RNAseq mapping data How are we going to make sense out of these highly redundant datasets?
Enter MAKER Synthesizes multiple forms of gene prediction data Predictions and evidences Outputs a single, consistent set of genes and gene models, including quality values Uses a standard gene annotation format GFF3 (related to the GTF format used yesterday) Results can be imported into a genome browser
GFF3 format 1 2 3 4 5 6 7 8 9 seqid source type Start End Score Strand phase attributes ##gff-version 3 ##date Wed Jul 18 22:38:03 2012 ##source gbrowse gbgff gff3 dumper ##sequence-region contig00001:11699..16698 contig00001 maker gene 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164;ID=215076 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;Parent=215076;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker gene 14925 15925. -. Name=maker-contig00001-snap-gene- 0.100;ID=215008 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;parent=215008;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015
Gene finding is an iterative process HMM SNAP AUGUSTUS GENE MODELS MAKER BLAST matches ESTs
Genome Browsers
Genome Browser Combines a genome database with interactive web pages Allows the user to retrieve and manipulate database record through a graphical user interface (GUI) Different types of information are displayed in an intuitive fashion in user-configurable tracks
GFF3 files are hard to interpret ##gff-version 3 ##date Wed Jul 18 22:38:03 2012 ##source gbrowse gbgff gff3 dumper ##sequence-region contig00001:11699..16698 contig00001 maker gene 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164;ID=215076 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;Parent=215076;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker mrna 10234 13698. +. Name=snap_masked-contig00001-abinitgene-0.164-mRNA-1;ID=215077;_QI=0%7C0%7C0%7C0%7C1%7C1%7C2%7C0%7C1128;_AED=1.00 contig00001 maker exon 10234 13073 114.575 +. Parent=215077;ID=215078 contig00001 maker exon 13152 13698 67.862 +. Parent=215077;ID=215079 contig00001 maker CDS 10234 13073. + 0 Parent=215077;ID=215080 contig00001 maker CDS 13152 13698. + 1 Parent=215077;ID=215081 contig00001 maker gene 14925 15925. -. Name=maker-contig00001-snap-gene- 0.100;ID=215008 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;parent=215008;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015 contig00001 maker mrna 14925 15925. -. Name=maker-contig00001-snap-gene-0.100- mrna-1;id=215009;_qi=0%7c0.5%7c0.33%7c1%7c0%7c0.33%7c3%7c0%7c285;_aed=0.06 contig00001 maker exon 14925 15172 62.114 -. Parent=215009;ID=215010 contig00001 maker exon 15201 15445 49.667 -. Parent=215009;ID=215011 contig00001 maker exon 15561 15925 85.814 -. Parent=215009;ID=215012 contig00001 maker CDS 14925 15172. - 2 Parent=215009;ID=215013 contig00001 maker CDS 15201 15445. - 1 Parent=215009;ID=215014 contig00001 maker CDS 15561 15925. - 0 Parent=215009;ID=215015
MAKER genes & RNAseq reads in GBrowse
Genome Browsers for repeat definition Show is a track displaying the results of a genome blasted against itself
A plethora of genome browsers Annmap Apollo Genome Annotation Curation Tool Argo Genome Browser Avadis NGS BugView Celera Genome Browser Dalliance DiProGB DNAnexus Ensembl Gaggle Genome Browser GBrowse The Genomic HyperBrowser Genostar GenoBrowser GenPlay Integrated Genome Browser (IGB) Integrated Genome Viewer (IGV) Integrated Microbial Genomes (IMG) JBrowse (a JavaScript browser ) MGV - Microbial Genome Viewer MochiView Genome Browser NextBio Genome Browser Pathway Tools Genome Browser Savant Genome Browser SEED viewer UCSC Genome Bioinformatics Genome Browser Viral Genome Organizer (VGO) VISTA genome browser
Today s activity Learn how to use the Integrated Genome Browser Populate the browser with data: A Magnaporthe sequence contig MAKER annotations Mapped RNAseq reads RNAseqread heatmaps Explore the browser to get an idea of how it works and how the tracks can be manipulated/activated/deactivated