Gene Prediction Background & Strategy. February 24, 2016
|
|
- Dylan Burns
- 5 years ago
- Views:
Transcription
1 Gene Prediction Background & Strategy February 24, 2016
2 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
3 Gene Prediction - the next step after genome assembly - process of identifying regions of genomic DNA that encodes genes - one of the most important steps in understanding the genome of a species once it has been fully sequenced
4 Eukaryotic vs. Prokaryotic Gene Prediction Prokaryotes: - Overall small genome size High density of CDSs Low numbers of repeated noncoding sequences Regulatory regions are close to the protein coding sequences CDS regions are short Eukaryotes - Opposite of Prokaryotes
5 Different types of algorithms for gene prediction 1. Ab initio a. b. Identify genes based on intrinsic factors - sequences look different based on whether they are in coding or non-coding regions Use statistical models (e.g. Markov models) 2. RNA prediction a. b. Non-coding RNAs (ncrna) are transcribed, not translated into protein ncrna shown to be major players in prokaryotic cellular processes c. Prediction methods include sets of utilities to assess predicted ncrna genes relevant to their context, annotation, conservation, and secondary structure 3. Homology a. Novel sequences are compared to known sequences in a database
6 Prokaryotic Gene structure and characteristics - Central Dogma A prokaryotic gene can be divided into: - regulatory elements (promoter and operator) - structural elements Polycistronic (operons) and monocistronic genes By Thomas Shafee [CC BY-SA 4.0 (
7 Important features to be considered during gene prediction Stop codons - 3 out of 64 codons => expected random occurrence is 1 in 20 GC content differs between coding and noncoding regions Reading frames and frameshifts
8 Markov Models A Markov chain is a discrete random process that undergoes transitions from one state to another on a state space. "Memorylessness" - next state depends only on the current state
9 Two state Markov Chain Three state Markov Chain The numbers represent the probability of transition from one state to another state.
10 Given : Today is Sunny Find the probability that it would be sunny tomorrow and rainy the day after P(Day2=Sunny,Day3=Rainy Day1=Sunny) P(Day2=Sunny Day1 = Sunny ) * P (Day3 = Rainy Day2 = Sunny ) 0.8 * 0.05 = 0.04 Weather prediction Markov Chain example
11 Hidden Markov Models (HMMs) Utilized by many gene prediction tools There is a hidden state, which must be derived from emissions When using HMMs you must specify a model Hidden States: Coding Sequence, Non-coding sequence Observed Emissions: A, C, T, G Number of states Possible transitions Learning material/time Most probable hidden state can be predicted based on model parameters and by using dynamic programming
12 HMM Model for 5 Splice Site Recognition States: Begin, Exon, Donor, Intron Observations: A, C, G, T
13 IMM Example Guess the word that comes after his : his (favorite, little,???) with his (hands,???) off with his (probably head?) Problem: how many bases do you look at when you re trying to predict the next one? Looking at more bases requires you to have a larger training set if you observe k-mers in a training set of genes, you will expect to observe instances of that k-mer Once k-mers begin to grow long, your training set must also grow substantially to gain accurate probability estimates for your transition states If k-mers are too short, then their predictive power is not as strong
14 Interpolated Markov Models Sometimes you might not be sure how much memory to give your regular Markov Model That s where IMMs come in handy Computes expected values for next base by looking at the product of the probabilities of the most recent 1 through k-mers
15 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
16 GeneMark Developed by: Dr. Borodovsky s Group, Georgia Tech Utilizes Markov models of coding and non-coding region together with Bayes decision making function Deals simultaneously with direct and reverse DNA strands Statistical patterns in functional regions of genome was used to calculate probability transition matrices. Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry, 1993, Vol. 17, No. 19, pp
17 GeneMark.hmm Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Improvement to find exact gene starts GeneMark models embedded into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states Viterbi algorithm to find the most likely sequence of hidden states Ribosome binding site pattern to refine predictions of translation initiation codons.
18 GeneMarkS GeneMark.hmm with heuristic models Non-supervised training procedure Any sequence > 400nt Gibbs Sampling to align upstream sequences Predict the correct translation initiation site Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001):
19 FGenesB An accurate ab initio prokaryotic gene prediction program package based off of Hidden Markov Models Predicts operons by promoter and terminator sequence identification Additionally annotates genes based on homology Finds and masks rrna/trna genes. rrna found by blast against rrna database trna found by trnascan-se program Initial predictions of long ORFs are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes Uses 5th order in-frame markov chains for coding regions and 2nd order markov models for translation and termination sites
20 FGenesB Predicts operons based only on distances between predicted genes Runs BLASTP for predicted proteins against COG database, cog.pro Improves operon prediction based off of conservation of neighboring gene pairs in known genomes Runs BLASTP against NR for proteins having no COGs hits Predicts potential promoters or terminators in upstream and downstream regions, correspondingly, of predicted genes Refines operon predictions using predicted promoters and terminators as additional evidences
21 FGenesB Accurate Light web based version along with downloadable version with more functionality All you need for input is genomic DNA
22 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) Developed by: Oak Ridge National Laboratory Prodigal's algorithm for gene prediction follows the basic principle of KISS (Keep It Simple, Stupid). No Hidden Markov Model, No Interpolated Markov Model Dynamic Programming with log-likelihood functions Algorithm Steps: Constructing a training set for protein coding: GC frame plot based training Building log-likelihood coding statistics from the training data: Every potential gene is scored Sharpening coding scores: Penalizes all potential start candidates that lie downstream from a higher-scoring start Length factor to coding scores: Add static score to long ORF s with negative coding scores Iterative start training: For all ORF w/ gene w/ coding score above a certain threshold, the translation initiation site with the highest coding score is recorded. These coding starts are rescored by ATG/GTG/TTG frequency and are then used for training Final dynamic programming: Gene calling
23 Dynamic Programming Connections in Prodigal 5 Forward 3 3 Reverse 5 Red Arrows: Gene Connections Black Arrows: Intergenic Connections Blue Pieces: Potential Genes The score of a "gene" connection is the precalculated coding score for that gene, whereas the score for an intergenic connection is a small bonus or penalty based on the distance between the two genes
24 Prodigal Key features: Speed: Can analyze entire microbial genome in 30 seconds Accuracy: When tested on curated data sets prodigal s accuracy is similar to the other top gene prediction tools (GeneMark-S, Glimmer) Specificity: Under 5% false-positive discovery rate GC-Content: Unlike other tools, prodigal works well with high GC-content genomes because it implements GC frame plot based training and changes parameters based on GC content Easy to use
25 GLIMMER3 (Gene Locator and Interpolated Markov ModelER 3) Developed by: Center for Computational Biology, Johns Hopkins University Interpolated context model (ICM) based approach GLIMMER1 was interpolated Markov model, GLIMMER2 and GLIMMER3 are ICM Algorithm Steps: Identify open reading frames (ORFs) Starting from stop codon and working backwards (3 to 5 ), calculate probability of each nucleotide being part of a coding region a Probability calculated based on context (bases preceding current position) using ICM Calculate cumulative log-likelihood sum - peak is likely location of start codon Select the set of ORFs that maximizes total score with no overlaps greater than a specified max
26 GLIMMER IMM: Look at base pairs immediately preceding the target Bases immediately preceding target aren t always most informative (e.g. third nucleotide in codon) ICM: Find bases in context region that most strongly correlate with target Source Source
27 GLIMMER3 Pros 1. Higher number of unique gene calls than Prodigal and GeneMarkS 2. Better performance than fixed-order Markov models Cons 1. Higher error rate than Prodigal and GeneMarkS a. False positives
28 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Takes input of gene calls in EMBL or GenBank format and outputs report of gene prediction anomalies 1. CRISPR finder 2. Overlaps between features 3. BLASTing and filtering proteins 4. Classify into long/short, broken and interrupted genes 5. Intergenic regions Pati A, Ivanova NN Nat Methods Jun;7(6):455-7
29 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Short and long genes classified through alignment quality score: α=(cq-ch)/(cq+ch) Broken genes Interrupted genes Frameshifts Pseudogenes Pati A, Ivanova NN Nat Methods Jun;7(6):455-7
30 GenePRIMP Pros/Cons Can significantly decrease incorrectly predicted genes and be used as quality control step Attempts to detect and fix interrupted genes (eg when a spurious stop codon is present) Classifies genes with interrupted translation frames as pseudogenes which result in higher rate of missed gene calls H. James Tripp Standards in Genomic Sciences /s
31 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
32 trnascan-se trnascan 1.4 (Optimized version of 1.3): uses a hierarchical, rule-based system in which potential trnas must exceed empirically determined similarity thresholds and have the ability to form base pairing present in trna stem-loop structures. EufindtRNA(Pavesi s Algorithm): Searches exclusively for linear sequence signals in the form of eukaryotic RNA polymerase III promoters and terminators. Effectively identifies prokaryotic trnas with an adjustment to cutoff score (implemented through command line option -P) Covels: Takes candidate trna sequences plus 7 flanking nucleotides and applies a trna covariance model made by structurally aligning 1415 trnas from 1993 Sprinzl database. Coves: Takes predicted trnas confirmed with Covels log odds scores >20.0 bits, trims trna bounds, and predicts secondary structure through global structure alignment to trna covariance model.
33 trnascan-se Can be implemented via online web server or downloaded and run locally Input: FASTA Output: tabular, ACeDB, or secondary structure format
34 trnascan-se Detects % of true trnas with less than 1 false positive per 15 billion nucleotides ~ times the speed of trna covariance models (~30,000 bp/s) Additional extensions allow for detection of unusual trna species including selenocysteine trna genes, trna-derived repetitive elements and pseudogenes.
35 Rfam 12.0 Developed by Wellcome Trust Sanger Institute Currently hosted by the European Molecular Biology Laboratory s European Bioinformatics Institute (EMBL-EBI) Database of 2450 RNA families represented by manually curated multiple sequence alignments (MSAs), consensus secondary structures and covariance models (CMs) CMs, or profile stochastic context-free grammars, are probabilistic models of the conserved sequence and secondary structure of an RNA family Analogous to HMM but rather than each position of the model being independent, CM basepaired positions are dependent on one another Added complexity allows for the modeling of secondary structures which are often more conserved than primary sequences in functional RNA Families broken down into three functional groups: non-coding RNA genes structured cis-regulatory elements self-splicing RNAs
36 Rfam 12.0 and Infernal 1.1 Process for RNA Prediction Rfam pipeline with Infernal 1.1 Infernal 1.1 is a software package that searching DNA sequence databases for RNA structure and sequence similarities. Install Infernal 1.1 and download Rfam 12.0 library of CMs Run Infernal s cmscan Takes a query sequence and CM database as input parameters Returns known/detectable structural RNAs in given sequence as well as information about whether the sequence contains homologies to any known RNA families in the library
37 RNAmmer Predicts ribosomal RNA Accepts Prokaryotic and Eukaryotic inputs Uses Hidden Markov Models (HMM) 2 levels Spotter Model Detects approximate gene position Flanking regions extracted and sent to Full model Full Model Matches the entire gene
38 RNAmmer RNAmmer 2 components: rnammer wrapper Initializes/configures search of input sequence core-rnammer core Perl program Searches both strands (in parallel)
39 RNAmmer Cited by 1631 (Lagesen et al. 2007) Released in 2007 Webserver or download Length limit 10,000,000 nucleotides Pre-screens sample, resulting in quick analysis, but possible loss of sensitivity Pre-screening step also makes it a useful tool for large datasets Runs in parallel
40 RNAcon - Classification of non-coding RNAs Developed by the Bioinformatics Center at the Institute of Microbial Technology Two step process Utilizes Support Vector Machine (SVM) based machine learning model to predict if it is ncrna using a tri-nucleotide composition (TNC) model 1) Predict whether sequence is coding vs non-coding RNA 2) Classification of ncrnas into respective classes SVM - pattern based recognition based on TNC model Learn the different types of nucleotide composition in coding vs non-coding i.e. heavy GC in crna while heavy uracil in ncrna Predicts secondary structures of the ncrna using IPknot software The structures are then used to calculate 20 different graph properties using igraph R package The numerical values are then funneled into RandomForest on WEKA. WEKA - collection of visualization tools and algorithms for data analysis and predictive modeling (JAVA) Used RandomForest Based Model to classify the ncrna into 18 different classes.
41 RNAcon Algorithm
42 RNAcon Pros: Cons: Better performance than AUGUSTUS, GeneMark.hmm, and Glimmer.hmm Computationally simpler than other SVM based methods out such as CONC and CPC Highest MCC score (.76 MCC) Much quicker and computationally less expensive Web GUI and application as well as a stand alone version Provides predicted structure of ncrna and classification Machine learning algorithm are intrinsically dependent on many factors Prediction accuracy is only as good as the learning testing data Over Optimization issues
43 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
44 ho mol o gy (\hō-ˈmä-lə-jē, hə-\) n. the existence of shared ancestry between a pair of structures, or genes, in different species
45 ho mol o gy_terminology
46 ho mol o gy Since coding sequences are conserved over evolutionary time, homology based gene prediction can use the database to find significant homology between novel and known gene sequences.
47 ho mol o gy_tools BLAST BLAT Basic Local Alignment Search Tool BLAST-Like Alignment Tool
48 ho mol o gy_blast - method for rapid searching of nucleotide and protein databases - detects similarities that may provide important clues to the function of uncharacterized proteins. - faster than FASTA and the original Smith-Waterman implementation
49 ho mol o gy_blast how it works:
50 ho mol o gy_blat - similar to BLAST, but not as flexible - finds similarities quickly but it needs an exact or nearly-exact match to find a hit - faster than BLAST and much more memory efficient b/c indexing Kent WJ. BLAT - The BLAST-like alignment tool. Genome Res. 2002;12(4): doi: /gr
51 ho mol o gy_pros/cons - fast implementation high accuracy web version available, no download/installation necessary defacto standard - does not guarantee optimal alignment - returns only one best alignment - produces only ungapped local alignments
52 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
53 EuGene-PP Open integrative gene finder Compared to most existing gene finders, EuGene is characterized by its ability to simply integrate arbitrary sources of information in its prediction process, including RNA-Seq, protein similarities, homologies and various statistical sources of information. Based on all the available information, EuGene will output a prediction of maximal score i.e., maximally consistent with the information provided.
54 Features Data integrated: Markov models of coding regions trained on regions with strong similarities with a reference protein databank. Regions of similarity with different protein databanks. A set of CDS predictions produced by a reliable self training ab initio gene finder. Prodigal is used. A set of predicted non-coding RNA genes (ncrna). trnascan-se, rfam_scan and RNAmmer is used. A set of profiles of measured expression on each strand along the genome(rnaseq data) that shows transcription. A set of potential transcription start sites, defined as points of sudden increase in expression.
55 Advantages Predicts many smaller genes It can run using just FASTA genomic sequences and expression data, and has no parameter to tune Prediction is performed independently on each strand, allowing for the prediction of antisense genes.
56 Maker2 Genome annotation and data management tool Can be executed with different ab initio programs (e.g. GeneMark, Augustus, SNAP) Supposedly ab initio programs give better results when included in the pipeline Gives good results even if the training data is of poor quality Tests were done only on eukaryotes but it works with prokaryotes Runs fast
57 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements
58 Proposed Workflow
59 Questions?
60 References Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001): Lagesen, K. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research (2007) RNACon net/raghava/rnacon/index.html Pati, A. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods Jun;7 (6):455-7 Yandell, M MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects BMC Bioinformatics. 2011; 12: 491. Prodigal - Lowe TM, Eddy SR. trnascan-se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25(5): Rfam - Nawrocki, Eric P., et al. "Rfam 12.0: updates to the RNA families database."nucleic acids research (2014): gku1063. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches.bioinformatics. 2013;29(22):
61 References Infernal - Eddy, Sean R., and Richard Durbin. "RNA sequence analysis using covariance models." Nucleic acids research (1994): FGenesB - HMMs -
Gene Prediction Background & Strategy Faction 2 February 22, 2017
Gene Prediction Background & Strategy Faction 2 February 22, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee Introduction
More informationGene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification
More informationGene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification
More informationGene Prediction: Preliminary Results
Gene Prediction: Preliminary Results Outline Preliminary Pipeline Programs Program Comparison Tests Metrics Gene Prediction Tools: Usage + Results GeneMarkS Glimmer 3.0 Prodigal BLAST ncrna Prediction
More informationBacterial Genome Annotation
Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control
More informationGene Prediction Final Presentation
Gene Prediction Final Presentation Final Proposed Pipeline Assembled Genome Protein - coding Gene Prediction Ab Initio Prodigal Glimmer GeneMarkS RNA Gene Prediction ncrna Specific trnascanse (trna) RNAmmer
More informationProkaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine
1 Abstract A prokaryotic annotation pipeline was developed to automatically annotate draft and complete bacterial genomes. The protein coding genes in the genomes are predicted by the combination of Glimmer
More informationComputational gene finding
Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative
More informationGenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs
Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994
More informationFunctional Annotation - Faction 2 Background and Strategy
Functional Annotation - Faction 2 Background and Strategy March 8, 2017 Khushbu Patel Karan Kapuria Angela Mo Harrison Kim David Lu Christian Colon Nolan English Bowen Yang Cong Gao RECAP. WE ARE HERE!!
More informationGene Identification in silico
Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction
More informationGene Prediction Group
Group Ben, Jasreet, Jeff, Jia, Kunal TACCTGAAAAAGCACATAATACTTATGCGTATCCGCCCTAAACACTGCCTTCTTTCTCAA AGAAGATGTCGCCGCTTTTCAACCGAACGATGTGTTCTTCGCCGTTTTCTCGGTAGTGCA TATCGATGATTCACGTTTCGGCAGTGCAGGCACCGGCGCATATTCAGGATACCGGACGCT
More informationLecture 10. Ab initio gene finding
Lecture 10 Ab initio gene finding Uses of probabilistic sequence Segmentation models/hmms Multiple alignment using profile HMMs Prediction of sequence function (gene family models) ** Gene finding ** Review
More informationGene Prediction. Lab & Preliminary Results. Faction 2 Saturday, March 11, 2017
Gene Prediction Lab & Preliminary Results Faction 2 Saturday, March 11, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee
More informationAn Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004
An Overview of Probabilistic Methods for RNA Secondary Structure Analysis David W Richardson CSE527 Project Presentation 12/15/2004 RNA - a quick review RNA s primary structure is sequence of nucleotides
More informationab initio and Evidence-Based Gene Finding
ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene
More informationG4120: Introduction to Computational Biology
ICB Fall 2009 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright 2009 Oliver Jovanovic, All Rights Reserved. Analysis of Protein
More informationGeneMarkS-2: Raising Standards of Accuracy in Gene Recognition
GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition Alexandre Lomsadze 1^, Shiyuyun Tang 2^, Karl Gemayel 3^ and Mark Borodovsky 1,2,3 ^ joint first authors 1 Wallace H. Coulter Department of
More informationOutline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions
Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson
More informationComputational gene finding
Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative
More informationProGen: GPHMM for prokaryotic genomes
ProGen: GPHMM for prokaryotic genomes Sharad Akshar Punuganti May 10, 2011 Abstract ProGen is an implementation of a Generalized Pair Hidden Markov Model (GPHMM), a model which can be used to perform both
More informationOutline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases
Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing
More informationComputational gene finding. Devika Subramanian Comp 470
Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative
More informationGenome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)
Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA
More informationLecture 7 Motif Databases and Gene Finding
Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC
More informationG4120: Introduction to Computational Biology
ICB Fall 2004 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2004 Oliver Jovanovic, All Rights Reserved. Analysis of Protein Sequences Coding
More informationTutorial for Stop codon reassignment in the wild
Tutorial for Stop codon reassignment in the wild Learning Objectives This tutorial has two learning objectives: 1. Finding evidence of stop codon reassignment on DNA fragments. 2. Detecting and confirming
More informationCollect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018
Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l
More informationGenomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010
Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomics is a new and expanding field with an increasing impact
More informationUCSC Genome Browser. Introduction to ab initio and evidence-based gene finding
UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene
More informationBackground and Strategy
Background and Strategy Background Algorithm Ab-initio tools Homology based tools RNA prediction tool Pseudogenes Validation References Gene: DNA sequence that codes for amino acids in a protein Key step
More informationBME 110 Midterm Examination
BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource
More informationGenome 373: Hidden Markov Models III. Doug Fowler
Genome 373: Hidden Markov Models III Doug Fowler Review from Hidden Markov Models I and II We talked about two decoding algorithms last time. What is meant by decoding? Review from Hidden Markov Models
More informationHomework 4. Due in class, Wednesday, November 10, 2004
1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors
More informationCollect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017
Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l
More informationProtein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)
Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical
More informationAnnotating the Genome (H)
Annotating the Genome (H) Annotation principles (H1) What is annotation? In general: annotation = explanatory note* What could be useful as an annotation of a DNA sequence? an amino acid sequence? What
More informationOutline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation
Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project
More informationAnnotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.
David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six
More informationData Mining for Biological Data Analysis
Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han
More informationPrediction of noncoding RNAs with RNAz
Prediction of noncoding RNAs with RNAz John Dzmil, III Steve Griesmer Philip Murillo April 4, 2007 What is non-coding RNA (ncrna)? RNA molecules that are not translated into proteins Size range from 20
More informationGenome annotation & EST
Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary
More informationVideos. Lesson Overview. Fermentation
Lesson Overview Fermentation Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast
More informationApplications of HMMs in Computational Biology. BMI/CS Colin Dewey
Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2008 The Gene Finding Task Given: an uncharacterized DNA sequence Do:
More informationRegulation of bacterial gene expression
Regulation of bacterial gene expression Gene Expression Gene Expression: RNA and protein synthesis DNA ----------> RNA ----------> Protein transcription translation! DNA replication only occurs in cells
More informationWhy learn sequence database searching? Searching Molecular Databases with BLAST
Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results
More informationComputational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010
Computational analysis of non-coding RNA Andrew Uzilov auzilov@ucsc.edu BME110 Tue, Nov 16, 2010 1 Corrected/updated talk slides are here: http://tinyurl.com/uzilovrna redirects to: http://users.soe.ucsc.edu/~auzilov/bme110/fall2010/
More informationFile S1. Program overview and features
File S1 Program overview and features Query list filtering. Further filtering may be applied through user selected query lists (Figure. 2B, Table S3) that restrict the results and/or report specifically
More informationVideos. Bozeman Transcription and Translation: Drawing transcription and translation:
Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast RNA and DNA. 29b) I can explain
More informationGenomic region (ENCODE) Gene definitions
DNA From genes to proteins Bioinformatics Methods RNA PROMOTER ELEMENTS TRANSCRIPTION Iosif Vaisman mrna SPLICE SITES SPLICING Email: ivaisman@gmu.edu START CODON STOP CODON TRANSLATION PROTEIN From genes
More informationAC Algorithms for Mining Biological Sequences (COMP 680)
AC-04-18 Algorithms for Mining Biological Sequences (COMP 680) Instructor: Mathieu Blanchette School of Computer Science and McGill Centre for Bioinformatics, 332 Duff Building McGill University, Montreal,
More informationLeonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 08: Gene finding aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggc tatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
More informationMachine Learning. HMM applications in computational biology
10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly
More informationPositional Preference of Rho-Independent Transcriptional Terminators in E. Coli
Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli Annie Vo Introduction Gene expression can be regulated at the transcriptional level through the activities of terminators.
More informationEnsembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets
Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger
More informationYear III Pharm.D Dr. V. Chitra
Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves
More informationRNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010
RNA Genomics BME 110: CompBio Tools Todd Lowe May 14, 2010 Admin WebCT quiz on Tuesday cover reading, using Jalview & Pfam Homework #3 assigned today due next Friday (8 days) In Genomes, Two Types of Genes
More informationAnalysis of Biological Sequences SPH
Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,
More informationConcepts and methods in genome assembly and annotation
BCM-2002 Concepts and methods in genome assembly and annotation B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique: Franz.Lang@Umontreal.ca Outline 1. What is genome assembly?
More informationWorkflows and Pipelines for NGS analysis: Lessons from proteomics
Workflows and Pipelines for NGS analysis: Lessons from proteomics Conference on Applying NGS in Basic research Health care and Agriculture 11 th Sep 2014 Debasis Dash Where are the protein coding genes
More informationI. Gene Expression Figure 1: Central Dogma of Molecular Biology
I. Gene Expression Figure 1: Central Dogma of Molecular Biology Central Dogma: Gene Expression: RNA Structure RNA nucleotides contain the pentose sugar Ribose instead of deoxyribose. Contain the bases
More informationThe Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica
The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database
More informationGene Finding Genome Annotation
Gene Finding Genome Annotation Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics Population biology & evolution Medical genomics
More informationQuestion 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.
Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or
More informationRNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011
RNA Genomics II BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011 1 TIME Why RNA? An evolutionary perspective The RNA World hypotheses: life arose as self-replicating non-coding RNA (ncrna)
More informationApplied bioinformatics in genomics
Applied bioinformatics in genomics Productive bioinformatics in a genome sequencing center Heiko Liesegang Warschau 2005 The omics pyramid: 1. 2. 3. 4. 5. Genome sequencing Genome annotation Transcriptomics
More informationBioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University
Bioinformatics: Sequence Analysis COMP 571 Luay Nakhleh, Rice University Course Information Instructor: Luay Nakhleh (nakhleh@rice.edu); office hours by appointment (office: DH 3119) TA: Leo Elworth (DH
More informationSequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing
Sequence Analysis II: Sequence Patterns and Matrices George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Patterns and Matrices Multiple sequence alignments Sequence patterns Sequence
More informationRNA folding & ncrna discovery
I519 Introduction to Bioinformatics RNA folding & ncrna discovery Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Non-coding RNAs and their functions RNA structures RNA folding
More informationSmall Genome Annotation and Data Management at TIGR
Small Genome Annotation and Data Management at TIGR Michelle Gwinn, William Nelson, Robert Dodson, Steven Salzberg, Owen White Abstract TIGR has developed, and continues to refine, a comprehensive, efficient
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationBiotechnology Unit 3: DNA to Proteins. From DNA to RNA
From DNA to RNA Biotechnology Unit 3: DNA to Proteins I. After the discovery of the structure of DNA, the major question remaining was how does the stored in the 4 letter code of DNA direct the and of
More informationI AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador
I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador blaise@ebi.ac.uk Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter http://www.ebi.ac.uk/metagenomics Blaise
More informationChimp Sequence Annotation: Region 2_3
Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker
More informationDesigning Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler
Designing Filters for Fast Protein and RNA Annotation Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler 1 Outline Background on sequence annotation Protein annotation acceleration
More informationIdentifying Regulatory Regions using Multiple Sequence Alignments
Identifying Regulatory Regions using Multiple Sequence Alignments Prerequisites: BLAST Exercise: Detecting and Interpreting Genetic Homology. Resources: ClustalW is available at http://www.ebi.ac.uk/tools/clustalw2/index.html
More informationDNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences
DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences Huiqing Liu Hao Han Jinyan Li Limsoon Wong Institute for Infocomm Research, 21 Heng Mui Keng Terrace,
More information132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:
132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel
More informationa small viral insert or the like). Both types of breaks are seen in the pyrobaculum snornas.
multigenome-snoscan: A comparative approach to snorna annotation Christoph Rau Introduction and Background The mechanisms which govern the proper functioning of an organism or cell are diverse. However,
More informationGrundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:
Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel
More informationMODULE 5: TRANSLATION
MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base
More informationGenscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,
Genscan The Genscan HMM model Training Genscan Validating Genscan (c) Devika Subramanian, 2009 96 Gene structure assumed by Genscan donor site acceptor site (c) Devika Subramanian, 2009 97 A simple model
More informationBIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.
!! www.clutchprep.com CONCEPT: GENES Beadle and Tatum develop the one gene one enzyme hypothesis through their work with Neurospora (bread mold). This idea was later revised as the one gene one polypeptide
More informationTranscription is the first stage of gene expression
Transcription is the first stage of gene expression RNA synthesis is catalyzed by RNA polymerase, which pries the DNA strands apart and hooks together the RNA nucleotides The RNA is complementary to the
More informationKlinisk kemisk diagnostik BIOINFORMATICS
Klinisk kemisk diagnostik - 2017 BIOINFORMATICS What is bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological,
More informationMatch the Hash Scores
Sort the hash scores of the database sequence February 22, 2001 1 Match the Hash Scores February 22, 2001 2 Lookup method for finding an alignment position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a.....
More informationGenes and How They Work. Chapter 15
Genes and How They Work Chapter 15 The Nature of Genes They proposed the one gene one enzyme hypothesis. Today we know this as the one gene one polypeptide hypothesis. 2 The Nature of Genes The central
More informationCOMPUTER RESOURCES II:
COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer
More informationEukaryotic Gene Prediction. Wei Zhu May 2007
Eukaryotic Gene Prediction Wei Zhu May 2007 In nature, nothing is perfect... - Alice Walker Gene Structure What is Gene Prediction? Gene prediction is the problem of parsing a sequence into nonoverlapping
More informationHow to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling
How to design an HMM for a new problem Architecture/topology design: What are the states, observation symbols, and the topology of the state transition graph? Learning/Training: Fully annotated or partially
More informationThe Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot
The Genetic Code and Transcription Chapter 12 Honors Genetics Ms. Susan Chabot TRANSCRIPTION Copy SAME language DNA to RNA Nucleic Acid to Nucleic Acid TRANSLATION Copy DIFFERENT language RNA to Amino
More informationGenes and gene finding
Genes and gene finding Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)
More informationHUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009
HUMAN GENOME BIOINFORMATICS Tore Samuelsson, Dec 2009 The sequenced (gray filled) and unsequenced (white) portions of the human genome. Peter F.R. Little Genome Res. 2005; 15: 1759-1766 Human genome organisation
More informationBIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology
BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get
More informationBLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments
BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database
More informationComparative Bioinformatics. BSCI348S Fall 2003 Midterm 1
BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to
More informationHow to Use This Presentation
How to Use This Presentation To View the presentation as a slideshow with effects select View on the menu bar and click on Slide Show. To advance through the presentation, click the right-arrow key or
More informationAnnotating Fosmid 14p24 of D. Virilis chromosome 4
Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome
More informationOutline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools
Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools Section 1: Introduction Similarity-Based Approach to Gene Prediction Some genomes may be well-studied, with
More information