Gene Prediction Background & Strategy Faction 2 February 22, 2017

Size: px
Start display at page:

Download "Gene Prediction Background & Strategy Faction 2 February 22, 2017"

Transcription

1 Gene Prediction Background & Strategy Faction 2 February 22, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee

2 Introduction Gene prediction is the second step in a computational genomics pipeline Follows genome assemble Ultimate Goal: Describe all of the genes computationally of a given genome with near 100% accuracy Open Reading Frame (ORF): Segment of DNA that has a start and stop codon Every gene is a ORF, but not every ORF is a gene Our goal!

3 Prokaryotic vs Eukaryotic Gene Prediction Simpler in Prokaryotes than in Eukaryotes: Prokaryotic genomes are gene-rich with their genomes comprised with at least 90% coding regions No Introns Prokaryotic genes are usually arranged into operons with a known operator, promoter, start site, coding region, and then a stop codon One of the biggest problems with prokaryotic gene prediction is determining which of two or more overlapping ORFs are true genes Determining True Positives vs. False Positives

4 Ab initio vs. Homology Based Gene Prediction Ab initio Gene Prediction: Method for predicting genes based on intrinsic factors Structure of gene such as start and stop sites used as prediction signals Looks for promoter sequences (Shine- Dalgarno sequence) and other functional parameters Homology Gene Prediction: Method for predicting genes based on sequence similarity between the sequence in question and a database

5 Prokaryotic Gene Prediction Pipeline A typical gene prediction pipeline has two major goals: Predict the coding regions of a genome: Protein coding genes Genes encoding functional proteins: Genes that are fully transcribed into mrna and translated into Proteins will represent the set of genes predicted in the coding regions

6 Prokaryotic Gene Prediction Pipeline Predict the non-coding regions of a genome: Non-coding RNAs are transcribed, however, not translated into proteins. rrnas: Ribosomal RNAs used during translation trnas: trnas used during Translation sirnas and other regulatory RNAs

7 GENE PREDICTION BLAST Validation EUGENE-PP Ab-initio Prediction RNA Prediction GeneMark Glimmer Prodigal FGenesB Rfam rrnascan RNAmmer RNAcon

8 Markov Models A stochastic model Future state is dependent on current state, but independent of any previous state Memorylessness

9 Markov Chain Simplest form of Markov model. A system that moves sequentially from one state to another. The state is directly visible to the observer. The numbers represent the probability of transition from one state to the other Can be described by a transition matrix.

10 Hidden Markov Models (HMMs) Commonly used by Gene Predicting programs System being modeled is a Markov process with hidden states Hidden state determines emission probabilities Most likely be able to predict the hidden state based on model parameters Model parameters: Number of states Transitions Learning material and time

11 Hidden Markov Models (HMMs)

12 Interpolated Markov Models (IMMs) Ideal to use the highest-order Markov model for a better prediction Moving up to higher order models, the estimation of the number of probabilities increases exponentially. Requires larger training set For DNA sequence, 4k+1 probabilities in a kth-order Markov model needs to be learned. How many previous bases should we consider in order to predict the next one?

13 Interpolated Markov Models (IMMs) Guess the word that comes after you you (are, too, see,??) can you say can you oh say can you Uses a combination of all the probabilities based on 0, 1, 2,, k previous bases P IMM x i x i n,, x i 1 = λ 0 P x i + λ 1 P x i x i 1 + +λ n 1 P x i x i n 1,, x i 1 + λ n P x i x i n 1,, x i 1

14 Ab-initio Prediction

15 Glimmer Developed by Johns Hopkins School for Computational Biology Gene prediction tool for finding genes in Microbial DNA First system to use interpolated Markov models to predict coding regions Conservative estimates based on experiments with H. influenza and H. prlori is that the system finds 97%-98% of all genes (White et. al. 1998)

16 Glimmer1 and Glimmer2 First iteration of Glimmer was developed and published in 1998 Tested on finding genes in the microbial genomes of Haemophilus influenza and the Heliobacter pylori and succeeded with near 98% accuracy Uses the interpolated Markov Model (IMM) to distinguish coding regions from non-coding regions based on a variable context given to each nucleotide The context changes based on the local composition of the sequence

17 Glimmer3 Significantly improved compared to its previous versions (99% or higher sensitivity) Improved method for finding transcription start sites and overall architecture Dramatically reduces the rate of falsepositive predictions while remaining at Glimmer s 99% specificity

18 Glimmer3 Glimmer3 uses an interpolated context model (ICM) approach Identifies ORFs, then reading backwards from the stop codon to the start codon, the algorithm scores the probability trained on a context window on its 3 side and the score of the log-likelihood sum of the bases contained in the ORF. The score is then computed incrementally as a cumulative sum at every codon position in a given ORF

19 Glimmer3 Pros: High specificity and accuracy when comparing predicted genes using Glimmer3 and a previously annotated genome Accurate at predicting translation initiation sites (RBS) Cons: Can overpredict with some rates of false positives

20 Glimmer3 Comparisons

21 GeneMark A family of gene prediction programs Developed at GT by Borodovsky s lab Web software for gene finding in prokaryotes, eukaryotes, and viruses Two methods for gene prediction in bacteria: GeneMark.hmm ~ heuristic models GeneMarkS ~ self-training program

22 GeneMark.hmm Based on the Viterbi algorithm for HMM with variable duration of hidden states Incorporates RBS prediction model to improve translation start position prediction

23 GeneMark

24 GeneMarkS Procedure

25 Comparison

26 Prodigal Prokaryotic Dynamic programming Genefinding Algorithm Prokaryotic gene recognition and translation initiation site identification software. Using dynamic programming with loglikelihood function to get coding scores. Goals: Improve gene structure prediction Increase the number of correct identifications genes and translation initiation sites for each gene Reduce the overall number of false positives

27 Prodigal Algorithm-follow the basic principle of KISS (Keep It Simple, Stupid) Basic Steps: Construct a training set for protein coding Build log-likelihood coding statistics from the training data Sharpen coding scores Add length factor to coding Iterative start training Final dynamic programming

28 Dynamic Programming Connections in Prodigal The red arrows represent gene connections, and the black arrows represent intergenic connections.

29 Prodigal Features: Fast speed Easy to use High accuracy Specificity: false positive rate < 5% GC-content indifferent

30 Shortcomings: Recognition of short genes, atypical genes, translation initiation mechanisms, and genomes. Prodigal Identify laterally transferred genes, genes in phage regions, proteins with signal peptides, and any other genes that do not match the typical GC frame bias for the organism

31 FGenesB is an ab initio, Markov chain-based algorithm. FGenesB FgenesB integrates model-based gene prediction with homology-based annotation, accompanied by operon, promoter and terminator prediction in bacterial sequences. FgenesB also finds trna and rrna genes.

32 FgenesB pipeline main steps Compute parameters for gene prediction Sequence-based gene prediction Predict protein coding genes Predict operons based on distance between predicted genes

33 FgenesB pipeline main steps Improve operons using GOG database Homology-based annotation Blast predicted proteins against NR/KEGG/custom protein database Include mapped ribosomal proteins in annotation

34 Generalized HMM Adding more states or transitions between states to HMM-based models so that multiple genes, partial genes and genes on both strands can be predicted together. Advantages: Avoids the prediction of genes that overlap on the two strands as being two separate genes. Avoids the prediction of shadow exons.

35 Comparison

36 RNA based tools

37 RNA genes do not code for proteins. Hence called non-coding genes. RNA The Communicati on layer Both types of RNAs (trnas and rrnas): play pivotal roles in protein synthesis are crucial for defining protein functions are highly conserved across bacteria, archaea and eukaryotes. are structurally and functionally conserved. Issue in identifying ncrna genes : diversity and lack of consensus patterns of such genes.

38 RNA based tools ab initio based tools RNAmmer trnascan-se Homology based tools Rfam machine-learning based tools RNAcon

39 Works on hidden Markov models (HMMs) for annotation of rrna genes. RNAmmer (ab initio) predicts 5s/8s,16s/18s and 23s/28s rrnas in full genome sequences. Limitation: most of rrna genes in metagenomic sequencing reads are fragmentary, and will be overlooked by RNAmmer that focus on full length rrnas.

40 trnascan 1.3 identifies 97.5% of true trna genes error rate unsuitable for larger genomes (0.37 false positives per Mbp). Pavesi Algorithm Searches for linear sequence signals the combined sensitivities exceed 99% the combined false positive rate is about five times that of trnascan alone Covariance models identifies >99.98% of true trnas, with a false positive rate of <0.2/Mbp High Sensitivity and Specificity but prohibitively CPU intensive trnascan-se identifies % of transfer RNA genes in DNA sequence gives less than one false positive per 15 gigabases The primary limitation is speed.

41 RFam The Rfam database aims to catalogue noncoding RNAs through the use of sequence alignments and statistical profile models known as covariance models. Contains about 2450 RNA families, including non coding RNAs, mrna cis-regulatory elements and self-splicing RNA s.

42

43 Distinguishes coding and non-coding transcripts and also classifies ncrna transcripts. RNAcon Support vector machine based prediction method which uses a single feature - trinucleotide composition. Graph properties based features with Random forest algorithm to classify ncrnas into different classes.

44 Integrate Tool

45 EuGene PP EuGene is an open integrative gene finder for eukaryotic and prokaryotic genomes integrates arbitrary sources RNA-Seq Protein similarities Homologies Statistical information EuGene-PP (Prokaryote Pipeline) facilitates the application of EuGene on prokaryotic genomes

46 EuGene PP

47

48 GENE PREDICTION BLAST Validation EUGENE-PP Ab-initio Prediction RNA Prediction GeneMark Glimmer Prodigal FGenesB Rfam rrnascan RNAmmer RNAcon

49 Benchmark

50 Salzberg, Steven L., et al. "Microbial gene identification using interpolated Markov models." Nucleic acids research 26.2 (1998): Sallet, Erika, Jérôme Gouzy, and Thomas Schiex. "EuGene-PP: a next generation automated annotation pipeline for prokaryotic genomes." Bioinformatics (2014): btu366. Sallet, Erika, et al. "Next-generation annotation of prokaryotic genomes with EuGene-P: application to Sinorhizobium meliloti 2011." DNA research 20.4 (2013): Lowe, T. M., & Eddy, S. R. (1997). trnascan-se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 25(5), References Lagesen, K., Hallin, P., Rødland, E. A., Stæ rfeldt, H.-H., Rognes, T., & Ussery, D. W. (2007). RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research, 35(9), Panwar B, Arora A, Raghava GP. Prediction and classification of ncrnas using structural information. BMC Genomics Feb 13;15:127. doi: / RFam: Burge, Sarah W. et al. Rfam 11.0: 10 Years of RNA Families. Nucleic Acids Research 41.Database issue (2013): D226 D232. PMC. Web. 21 Feb Lukashin, Alexander V., and Mark Borodovsky. "GeneMark.hmm: New Solutions for Gene Finding." GeneMark.hmm: New Solutions for Gene Finding Nucleic Acids Research Oxford Academic. Oxford University Press, 01 Feb Web. 17 Feb Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research (2001): Web.

51 Thanks for listening.