Gene Prediction Background & Strategy. February 24, 2016

Size: px

Start display at page:

Download "Gene Prediction Background & Strategy. February 24, 2016"

Dylan Burns
5 years ago
Views:

1 Gene Prediction Background & Strategy February 24, 2016

2 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

3 Gene Prediction - the next step after genome assembly - process of identifying regions of genomic DNA that encodes genes - one of the most important steps in understanding the genome of a species once it has been fully sequenced

4 Eukaryotic vs. Prokaryotic Gene Prediction Prokaryotes: - Overall small genome size High density of CDSs Low numbers of repeated noncoding sequences Regulatory regions are close to the protein coding sequences CDS regions are short Eukaryotes - Opposite of Prokaryotes

5 Different types of algorithms for gene prediction 1. Ab initio a. b. Identify genes based on intrinsic factors - sequences look different based on whether they are in coding or non-coding regions Use statistical models (e.g. Markov models) 2. RNA prediction a. b. Non-coding RNAs (ncrna) are transcribed, not translated into protein ncrna shown to be major players in prokaryotic cellular processes c. Prediction methods include sets of utilities to assess predicted ncrna genes relevant to their context, annotation, conservation, and secondary structure 3. Homology a. Novel sequences are compared to known sequences in a database

Prokaryotic Gene structure and characteristics - Central Dogma A prokaryotic gene can be divided into: - regulatory elements (promoter and operator)

6 Prokaryotic Gene structure and characteristics - Central Dogma A prokaryotic gene can be divided into: - regulatory elements (promoter and operator) - structural elements Polycistronic (operons) and monocistronic genes By Thomas Shafee [CC BY-SA 4.0 (

7 Important features to be considered during gene prediction Stop codons - 3 out of 64 codons => expected random occurrence is 1 in 20 GC content differs between coding and noncoding regions Reading frames and frameshifts

8 Markov Models A Markov chain is a discrete random process that undergoes transitions from one state to another on a state space. "Memorylessness" - next state depends only on the current state

9 Two state Markov Chain Three state Markov Chain The numbers represent the probability of transition from one state to another state.

10 Given : Today is Sunny Find the probability that it would be sunny tomorrow and rainy the day after P(Day2=Sunny,Day3=Rainy Day1=Sunny) P(Day2=Sunny Day1 = Sunny ) * P (Day3 = Rainy Day2 = Sunny ) 0.8 * 0.05 = 0.04 Weather prediction Markov Chain example

11 Hidden Markov Models (HMMs) Utilized by many gene prediction tools There is a hidden state, which must be derived from emissions When using HMMs you must specify a model Hidden States: Coding Sequence, Non-coding sequence Observed Emissions: A, C, T, G Number of states Possible transitions Learning material/time Most probable hidden state can be predicted based on model parameters and by using dynamic programming

12 HMM Model for 5 Splice Site Recognition States: Begin, Exon, Donor, Intron Observations: A, C, G, T

13 IMM Example Guess the word that comes after his : his (favorite, little,???) with his (hands,???) off with his (probably head?) Problem: how many bases do you look at when you re trying to predict the next one? Looking at more bases requires you to have a larger training set if you observe k-mers in a training set of genes, you will expect to observe instances of that k-mer Once k-mers begin to grow long, your training set must also grow substantially to gain accurate probability estimates for your transition states If k-mers are too short, then their predictive power is not as strong

14 Interpolated Markov Models Sometimes you might not be sure how much memory to give your regular Markov Model That s where IMMs come in handy Computes expected values for next base by looking at the product of the probabilities of the most recent 1 through k-mers

15 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

16 GeneMark Developed by: Dr. Borodovsky s Group, Georgia Tech Utilizes Markov models of coding and non-coding region together with Bayes decision making function Deals simultaneously with direct and reverse DNA strands Statistical patterns in functional regions of genome was used to calculate probability transition matrices. Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry, 1993, Vol. 17, No. 19, pp

17 GeneMark.hmm Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Improvement to find exact gene starts GeneMark models embedded into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states Viterbi algorithm to find the most likely sequence of hidden states Ribosome binding site pattern to refine predictions of translation initiation codons.

18 GeneMarkS GeneMark.hmm with heuristic models Non-supervised training procedure Any sequence > 400nt Gibbs Sampling to align upstream sequences Predict the correct translation initiation site Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001):

19 FGenesB An accurate ab initio prokaryotic gene prediction program package based off of Hidden Markov Models Predicts operons by promoter and terminator sequence identification Additionally annotates genes based on homology Finds and masks rrna/trna genes. rrna found by blast against rrna database trna found by trnascan-se program Initial predictions of long ORFs are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes Uses 5th order in-frame markov chains for coding regions and 2nd order markov models for translation and termination sites

20 FGenesB Predicts operons based only on distances between predicted genes Runs BLASTP for predicted proteins against COG database, cog.pro Improves operon prediction based off of conservation of neighboring gene pairs in known genomes Runs BLASTP against NR for proteins having no COGs hits Predicts potential promoters or terminators in upstream and downstream regions, correspondingly, of predicted genes Refines operon predictions using predicted promoters and terminators as additional evidences

21 FGenesB Accurate Light web based version along with downloadable version with more functionality All you need for input is genomic DNA

22 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) Developed by: Oak Ridge National Laboratory Prodigal's algorithm for gene prediction follows the basic principle of KISS (Keep It Simple, Stupid). No Hidden Markov Model, No Interpolated Markov Model Dynamic Programming with log-likelihood functions Algorithm Steps: Constructing a training set for protein coding: GC frame plot based training Building log-likelihood coding statistics from the training data: Every potential gene is scored Sharpening coding scores: Penalizes all potential start candidates that lie downstream from a higher-scoring start Length factor to coding scores: Add static score to long ORF s with negative coding scores Iterative start training: For all ORF w/ gene w/ coding score above a certain threshold, the translation initiation site with the highest coding score is recorded. These coding starts are rescored by ATG/GTG/TTG frequency and are then used for training Final dynamic programming: Gene calling

23 Dynamic Programming Connections in Prodigal 5 Forward 3 3 Reverse 5 Red Arrows: Gene Connections Black Arrows: Intergenic Connections Blue Pieces: Potential Genes The score of a "gene" connection is the precalculated coding score for that gene, whereas the score for an intergenic connection is a small bonus or penalty based on the distance between the two genes

24 Prodigal Key features: Speed: Can analyze entire microbial genome in 30 seconds Accuracy: When tested on curated data sets prodigal s accuracy is similar to the other top gene prediction tools (GeneMark-S, Glimmer) Specificity: Under 5% false-positive discovery rate GC-Content: Unlike other tools, prodigal works well with high GC-content genomes because it implements GC frame plot based training and changes parameters based on GC content Easy to use

25 GLIMMER3 (Gene Locator and Interpolated Markov ModelER 3) Developed by: Center for Computational Biology, Johns Hopkins University Interpolated context model (ICM) based approach GLIMMER1 was interpolated Markov model, GLIMMER2 and GLIMMER3 are ICM Algorithm Steps: Identify open reading frames (ORFs) Starting from stop codon and working backwards (3 to 5 ), calculate probability of each nucleotide being part of a coding region a Probability calculated based on context (bases preceding current position) using ICM Calculate cumulative log-likelihood sum - peak is likely location of start codon Select the set of ORFs that maximizes total score with no overlaps greater than a specified max

26 GLIMMER IMM: Look at base pairs immediately preceding the target Bases immediately preceding target aren t always most informative (e.g. third nucleotide in codon) ICM: Find bases in context region that most strongly correlate with target Source Source

27 GLIMMER3 Pros 1. Higher number of unique gene calls than Prodigal and GeneMarkS 2. Better performance than fixed-order Markov models Cons 1. Higher error rate than Prodigal and GeneMarkS a. False positives

GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Takes input of gene calls in EMBL or GenBank format and outputs report of gene prediction anomalies 1. CRISPR finder 2.

28 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Takes input of gene calls in EMBL or GenBank format and outputs report of gene prediction anomalies 1. CRISPR finder 2. Overlaps between features 3. BLASTing and filtering proteins 4. Classify into long/short, broken and interrupted genes 5. Intergenic regions Pati A, Ivanova NN Nat Methods Jun;7(6):455-7

29 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Short and long genes classified through alignment quality score: α=(cq-ch)/(cq+ch) Broken genes Interrupted genes Frameshifts Pseudogenes Pati A, Ivanova NN Nat Methods Jun;7(6):455-7

GenePRIMP Pros/Cons Can significantly decrease incorrectly predicted genes and be used as quality control step Attempts to detect and fix interrupted genes (eg when a spurious stop codon is

30 GenePRIMP Pros/Cons Can significantly decrease incorrectly predicted genes and be used as quality control step Attempts to detect and fix interrupted genes (eg when a spurious stop codon is present) Classifies genes with interrupted translation frames as pseudogenes which result in higher rate of missed gene calls H. James Tripp Standards in Genomic Sciences /s

31 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

32 trnascan-se trnascan 1.4 (Optimized version of 1.3): uses a hierarchical, rule-based system in which potential trnas must exceed empirically determined similarity thresholds and have the ability to form base pairing present in trna stem-loop structures. EufindtRNA(Pavesi s Algorithm): Searches exclusively for linear sequence signals in the form of eukaryotic RNA polymerase III promoters and terminators. Effectively identifies prokaryotic trnas with an adjustment to cutoff score (implemented through command line option -P) Covels: Takes candidate trna sequences plus 7 flanking nucleotides and applies a trna covariance model made by structurally aligning 1415 trnas from 1993 Sprinzl database. Coves: Takes predicted trnas confirmed with Covels log odds scores >20.0 bits, trims trna bounds, and predicts secondary structure through global structure alignment to trna covariance model.

33 trnascan-se Can be implemented via online web server or downloaded and run locally Input: FASTA Output: tabular, ACeDB, or secondary structure format

34 trnascan-se Detects % of true trnas with less than 1 false positive per 15 billion nucleotides ~ times the speed of trna covariance models (~30,000 bp/s) Additional extensions allow for detection of unusual trna species including selenocysteine trna genes, trna-derived repetitive elements and pseudogenes.

35 Rfam 12.0 Developed by Wellcome Trust Sanger Institute Currently hosted by the European Molecular Biology Laboratory s European Bioinformatics Institute (EMBL-EBI) Database of 2450 RNA families represented by manually curated multiple sequence alignments (MSAs), consensus secondary structures and covariance models (CMs) CMs, or profile stochastic context-free grammars, are probabilistic models of the conserved sequence and secondary structure of an RNA family Analogous to HMM but rather than each position of the model being independent, CM basepaired positions are dependent on one another Added complexity allows for the modeling of secondary structures which are often more conserved than primary sequences in functional RNA Families broken down into three functional groups: non-coding RNA genes structured cis-regulatory elements self-splicing RNAs

36 Rfam 12.0 and Infernal 1.1 Process for RNA Prediction Rfam pipeline with Infernal 1.1 Infernal 1.1 is a software package that searching DNA sequence databases for RNA structure and sequence similarities. Install Infernal 1.1 and download Rfam 12.0 library of CMs Run Infernal s cmscan Takes a query sequence and CM database as input parameters Returns known/detectable structural RNAs in given sequence as well as information about whether the sequence contains homologies to any known RNA families in the library

37 RNAmmer Predicts ribosomal RNA Accepts Prokaryotic and Eukaryotic inputs Uses Hidden Markov Models (HMM) 2 levels Spotter Model Detects approximate gene position Flanking regions extracted and sent to Full model Full Model Matches the entire gene

38 RNAmmer RNAmmer 2 components: rnammer wrapper Initializes/configures search of input sequence core-rnammer core Perl program Searches both strands (in parallel)

39 RNAmmer Cited by 1631 (Lagesen et al. 2007) Released in 2007 Webserver or download Length limit 10,000,000 nucleotides Pre-screens sample, resulting in quick analysis, but possible loss of sensitivity Pre-screening step also makes it a useful tool for large datasets Runs in parallel

40 RNAcon - Classification of non-coding RNAs Developed by the Bioinformatics Center at the Institute of Microbial Technology Two step process Utilizes Support Vector Machine (SVM) based machine learning model to predict if it is ncrna using a tri-nucleotide composition (TNC) model 1) Predict whether sequence is coding vs non-coding RNA 2) Classification of ncrnas into respective classes SVM - pattern based recognition based on TNC model Learn the different types of nucleotide composition in coding vs non-coding i.e. heavy GC in crna while heavy uracil in ncrna Predicts secondary structures of the ncrna using IPknot software The structures are then used to calculate 20 different graph properties using igraph R package The numerical values are then funneled into RandomForest on WEKA. WEKA - collection of visualization tools and algorithms for data analysis and predictive modeling (JAVA) Used RandomForest Based Model to classify the ncrna into 18 different classes.

41 RNAcon Algorithm

42 RNAcon Pros: Cons: Better performance than AUGUSTUS, GeneMark.hmm, and Glimmer.hmm Computationally simpler than other SVM based methods out such as CONC and CPC Highest MCC score (.76 MCC) Much quicker and computationally less expensive Web GUI and application as well as a stand alone version Provides predicted structure of ncrna and classification Machine learning algorithm are intrinsically dependent on many factors Prediction accuracy is only as good as the learning testing data Over Optimization issues

43 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

44 ho mol o gy (\hō-ˈmä-lə-jē, hə-\) n. the existence of shared ancestry between a pair of structures, or genes, in different species

45 ho mol o gy_terminology

46 ho mol o gy Since coding sequences are conserved over evolutionary time, homology based gene prediction can use the database to find significant homology between novel and known gene sequences.

47 ho mol o gy_tools BLAST BLAT Basic Local Alignment Search Tool BLAST-Like Alignment Tool

48 ho mol o gy_blast - method for rapid searching of nucleotide and protein databases - detects similarities that may provide important clues to the function of uncharacterized proteins. - faster than FASTA and the original Smith-Waterman implementation

49 ho mol o gy_blast how it works:

50 ho mol o gy_blat - similar to BLAST, but not as flexible - finds similarities quickly but it needs an exact or nearly-exact match to find a hit - faster than BLAST and much more memory efficient b/c indexing Kent WJ. BLAT - The BLAST-like alignment tool. Genome Res. 2002;12(4): doi: /gr

51 ho mol o gy_pros/cons - fast implementation high accuracy web version available, no download/installation necessary defacto standard - does not guarantee optimal alignment - returns only one best alignment - produces only ungapped local alignments

52 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

53 EuGene-PP Open integrative gene finder Compared to most existing gene finders, EuGene is characterized by its ability to simply integrate arbitrary sources of information in its prediction process, including RNA-Seq, protein similarities, homologies and various statistical sources of information. Based on all the available information, EuGene will output a prediction of maximal score i.e., maximally consistent with the information provided.

54 Features Data integrated: Markov models of coding regions trained on regions with strong similarities with a reference protein databank. Regions of similarity with different protein databanks. A set of CDS predictions produced by a reliable self training ab initio gene finder. Prodigal is used. A set of predicted non-coding RNA genes (ncrna). trnascan-se, rfam_scan and RNAmmer is used. A set of profiles of measured expression on each strand along the genome(rnaseq data) that shows transcription. A set of potential transcription start sites, defined as points of sudden increase in expression.

55 Advantages Predicts many smaller genes It can run using just FASTA genomic sequences and expression data, and has no parameter to tune Prediction is performed independently on each strand, allowing for the prediction of antisense genes.

56 Maker2 Genome annotation and data management tool Can be executed with different ab initio programs (e.g. GeneMark, Augustus, SNAP) Supposedly ab initio programs give better results when included in the pipeline Gives good results even if the training data is of poor quality Tests were done only on eukaryotes but it works with prokaryotes Runs fast

57 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

58 Proposed Workflow

59 Questions?

60 References Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001): Lagesen, K. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research (2007) RNACon net/raghava/rnacon/index.html Pati, A. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods Jun;7 (6):455-7 Yandell, M MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects BMC Bioinformatics. 2011; 12: 491. Prodigal - Lowe TM, Eddy SR. trnascan-se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25(5): Rfam - Nawrocki, Eric P., et al. "Rfam 12.0: updates to the RNA families database."nucleic acids research (2014): gku1063. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches.bioinformatics. 2013;29(22):

61 References Infernal - Eddy, Sean R., and Richard Durbin. "RNA sequence analysis using covariance models." Nucleic acids research (1994): FGenesB - HMMs -

Gene Prediction Background & Strategy Faction 2 February 22, 2017

Gene Prediction Background & Strategy Faction 2 February 22, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee Introduction