Gene Prediction Background & Strategy. February 24, 2016

Size: px
Start display at page:

Download "Gene Prediction Background & Strategy. February 24, 2016"

Transcription

1 Gene Prediction Background & Strategy February 24, 2016

2 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

3 Gene Prediction - the next step after genome assembly - process of identifying regions of genomic DNA that encodes genes - one of the most important steps in understanding the genome of a species once it has been fully sequenced

4 Eukaryotic vs. Prokaryotic Gene Prediction Prokaryotes: - Overall small genome size High density of CDSs Low numbers of repeated noncoding sequences Regulatory regions are close to the protein coding sequences CDS regions are short Eukaryotes - Opposite of Prokaryotes

5 Different types of algorithms for gene prediction 1. Ab initio a. b. Identify genes based on intrinsic factors - sequences look different based on whether they are in coding or non-coding regions Use statistical models (e.g. Markov models) 2. RNA prediction a. b. Non-coding RNAs (ncrna) are transcribed, not translated into protein ncrna shown to be major players in prokaryotic cellular processes c. Prediction methods include sets of utilities to assess predicted ncrna genes relevant to their context, annotation, conservation, and secondary structure 3. Homology a. Novel sequences are compared to known sequences in a database

6 Prokaryotic Gene structure and characteristics - Central Dogma A prokaryotic gene can be divided into: - regulatory elements (promoter and operator) - structural elements Polycistronic (operons) and monocistronic genes By Thomas Shafee [CC BY-SA 4.0 (

7 Important features to be considered during gene prediction Stop codons - 3 out of 64 codons => expected random occurrence is 1 in 20 GC content differs between coding and noncoding regions Reading frames and frameshifts

8 Markov Models A Markov chain is a discrete random process that undergoes transitions from one state to another on a state space. "Memorylessness" - next state depends only on the current state

9 Two state Markov Chain Three state Markov Chain The numbers represent the probability of transition from one state to another state.

10 Given : Today is Sunny Find the probability that it would be sunny tomorrow and rainy the day after P(Day2=Sunny,Day3=Rainy Day1=Sunny) P(Day2=Sunny Day1 = Sunny ) * P (Day3 = Rainy Day2 = Sunny ) 0.8 * 0.05 = 0.04 Weather prediction Markov Chain example

11 Hidden Markov Models (HMMs) Utilized by many gene prediction tools There is a hidden state, which must be derived from emissions When using HMMs you must specify a model Hidden States: Coding Sequence, Non-coding sequence Observed Emissions: A, C, T, G Number of states Possible transitions Learning material/time Most probable hidden state can be predicted based on model parameters and by using dynamic programming

12 HMM Model for 5 Splice Site Recognition States: Begin, Exon, Donor, Intron Observations: A, C, G, T

13 IMM Example Guess the word that comes after his : his (favorite, little,???) with his (hands,???) off with his (probably head?) Problem: how many bases do you look at when you re trying to predict the next one? Looking at more bases requires you to have a larger training set if you observe k-mers in a training set of genes, you will expect to observe instances of that k-mer Once k-mers begin to grow long, your training set must also grow substantially to gain accurate probability estimates for your transition states If k-mers are too short, then their predictive power is not as strong

14 Interpolated Markov Models Sometimes you might not be sure how much memory to give your regular Markov Model That s where IMMs come in handy Computes expected values for next base by looking at the product of the probabilities of the most recent 1 through k-mers

15 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

16 GeneMark Developed by: Dr. Borodovsky s Group, Georgia Tech Utilizes Markov models of coding and non-coding region together with Bayes decision making function Deals simultaneously with direct and reverse DNA strands Statistical patterns in functional regions of genome was used to calculate probability transition matrices. Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry, 1993, Vol. 17, No. 19, pp

17 GeneMark.hmm Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Improvement to find exact gene starts GeneMark models embedded into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states Viterbi algorithm to find the most likely sequence of hidden states Ribosome binding site pattern to refine predictions of translation initiation codons.

18 GeneMarkS GeneMark.hmm with heuristic models Non-supervised training procedure Any sequence > 400nt Gibbs Sampling to align upstream sequences Predict the correct translation initiation site Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001):

19 FGenesB An accurate ab initio prokaryotic gene prediction program package based off of Hidden Markov Models Predicts operons by promoter and terminator sequence identification Additionally annotates genes based on homology Finds and masks rrna/trna genes. rrna found by blast against rrna database trna found by trnascan-se program Initial predictions of long ORFs are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes Uses 5th order in-frame markov chains for coding regions and 2nd order markov models for translation and termination sites

20 FGenesB Predicts operons based only on distances between predicted genes Runs BLASTP for predicted proteins against COG database, cog.pro Improves operon prediction based off of conservation of neighboring gene pairs in known genomes Runs BLASTP against NR for proteins having no COGs hits Predicts potential promoters or terminators in upstream and downstream regions, correspondingly, of predicted genes Refines operon predictions using predicted promoters and terminators as additional evidences

21 FGenesB Accurate Light web based version along with downloadable version with more functionality All you need for input is genomic DNA

22 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) Developed by: Oak Ridge National Laboratory Prodigal's algorithm for gene prediction follows the basic principle of KISS (Keep It Simple, Stupid). No Hidden Markov Model, No Interpolated Markov Model Dynamic Programming with log-likelihood functions Algorithm Steps: Constructing a training set for protein coding: GC frame plot based training Building log-likelihood coding statistics from the training data: Every potential gene is scored Sharpening coding scores: Penalizes all potential start candidates that lie downstream from a higher-scoring start Length factor to coding scores: Add static score to long ORF s with negative coding scores Iterative start training: For all ORF w/ gene w/ coding score above a certain threshold, the translation initiation site with the highest coding score is recorded. These coding starts are rescored by ATG/GTG/TTG frequency and are then used for training Final dynamic programming: Gene calling

23 Dynamic Programming Connections in Prodigal 5 Forward 3 3 Reverse 5 Red Arrows: Gene Connections Black Arrows: Intergenic Connections Blue Pieces: Potential Genes The score of a "gene" connection is the precalculated coding score for that gene, whereas the score for an intergenic connection is a small bonus or penalty based on the distance between the two genes

24 Prodigal Key features: Speed: Can analyze entire microbial genome in 30 seconds Accuracy: When tested on curated data sets prodigal s accuracy is similar to the other top gene prediction tools (GeneMark-S, Glimmer) Specificity: Under 5% false-positive discovery rate GC-Content: Unlike other tools, prodigal works well with high GC-content genomes because it implements GC frame plot based training and changes parameters based on GC content Easy to use

25 GLIMMER3 (Gene Locator and Interpolated Markov ModelER 3) Developed by: Center for Computational Biology, Johns Hopkins University Interpolated context model (ICM) based approach GLIMMER1 was interpolated Markov model, GLIMMER2 and GLIMMER3 are ICM Algorithm Steps: Identify open reading frames (ORFs) Starting from stop codon and working backwards (3 to 5 ), calculate probability of each nucleotide being part of a coding region a Probability calculated based on context (bases preceding current position) using ICM Calculate cumulative log-likelihood sum - peak is likely location of start codon Select the set of ORFs that maximizes total score with no overlaps greater than a specified max

26 GLIMMER IMM: Look at base pairs immediately preceding the target Bases immediately preceding target aren t always most informative (e.g. third nucleotide in codon) ICM: Find bases in context region that most strongly correlate with target Source Source

27 GLIMMER3 Pros 1. Higher number of unique gene calls than Prodigal and GeneMarkS 2. Better performance than fixed-order Markov models Cons 1. Higher error rate than Prodigal and GeneMarkS a. False positives

28 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Takes input of gene calls in EMBL or GenBank format and outputs report of gene prediction anomalies 1. CRISPR finder 2. Overlaps between features 3. BLASTing and filtering proteins 4. Classify into long/short, broken and interrupted genes 5. Intergenic regions Pati A, Ivanova NN Nat Methods Jun;7(6):455-7

29 GenePRIMP GENE PRediction IMprovement Pipeline for Prokaryotic genomes Short and long genes classified through alignment quality score: α=(cq-ch)/(cq+ch) Broken genes Interrupted genes Frameshifts Pseudogenes Pati A, Ivanova NN Nat Methods Jun;7(6):455-7

30 GenePRIMP Pros/Cons Can significantly decrease incorrectly predicted genes and be used as quality control step Attempts to detect and fix interrupted genes (eg when a spurious stop codon is present) Classifies genes with interrupted translation frames as pseudogenes which result in higher rate of missed gene calls H. James Tripp Standards in Genomic Sciences /s

31 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

32 trnascan-se trnascan 1.4 (Optimized version of 1.3): uses a hierarchical, rule-based system in which potential trnas must exceed empirically determined similarity thresholds and have the ability to form base pairing present in trna stem-loop structures. EufindtRNA(Pavesi s Algorithm): Searches exclusively for linear sequence signals in the form of eukaryotic RNA polymerase III promoters and terminators. Effectively identifies prokaryotic trnas with an adjustment to cutoff score (implemented through command line option -P) Covels: Takes candidate trna sequences plus 7 flanking nucleotides and applies a trna covariance model made by structurally aligning 1415 trnas from 1993 Sprinzl database. Coves: Takes predicted trnas confirmed with Covels log odds scores >20.0 bits, trims trna bounds, and predicts secondary structure through global structure alignment to trna covariance model.

33 trnascan-se Can be implemented via online web server or downloaded and run locally Input: FASTA Output: tabular, ACeDB, or secondary structure format

34 trnascan-se Detects % of true trnas with less than 1 false positive per 15 billion nucleotides ~ times the speed of trna covariance models (~30,000 bp/s) Additional extensions allow for detection of unusual trna species including selenocysteine trna genes, trna-derived repetitive elements and pseudogenes.

35 Rfam 12.0 Developed by Wellcome Trust Sanger Institute Currently hosted by the European Molecular Biology Laboratory s European Bioinformatics Institute (EMBL-EBI) Database of 2450 RNA families represented by manually curated multiple sequence alignments (MSAs), consensus secondary structures and covariance models (CMs) CMs, or profile stochastic context-free grammars, are probabilistic models of the conserved sequence and secondary structure of an RNA family Analogous to HMM but rather than each position of the model being independent, CM basepaired positions are dependent on one another Added complexity allows for the modeling of secondary structures which are often more conserved than primary sequences in functional RNA Families broken down into three functional groups: non-coding RNA genes structured cis-regulatory elements self-splicing RNAs

36 Rfam 12.0 and Infernal 1.1 Process for RNA Prediction Rfam pipeline with Infernal 1.1 Infernal 1.1 is a software package that searching DNA sequence databases for RNA structure and sequence similarities. Install Infernal 1.1 and download Rfam 12.0 library of CMs Run Infernal s cmscan Takes a query sequence and CM database as input parameters Returns known/detectable structural RNAs in given sequence as well as information about whether the sequence contains homologies to any known RNA families in the library

37 RNAmmer Predicts ribosomal RNA Accepts Prokaryotic and Eukaryotic inputs Uses Hidden Markov Models (HMM) 2 levels Spotter Model Detects approximate gene position Flanking regions extracted and sent to Full model Full Model Matches the entire gene

38 RNAmmer RNAmmer 2 components: rnammer wrapper Initializes/configures search of input sequence core-rnammer core Perl program Searches both strands (in parallel)

39 RNAmmer Cited by 1631 (Lagesen et al. 2007) Released in 2007 Webserver or download Length limit 10,000,000 nucleotides Pre-screens sample, resulting in quick analysis, but possible loss of sensitivity Pre-screening step also makes it a useful tool for large datasets Runs in parallel

40 RNAcon - Classification of non-coding RNAs Developed by the Bioinformatics Center at the Institute of Microbial Technology Two step process Utilizes Support Vector Machine (SVM) based machine learning model to predict if it is ncrna using a tri-nucleotide composition (TNC) model 1) Predict whether sequence is coding vs non-coding RNA 2) Classification of ncrnas into respective classes SVM - pattern based recognition based on TNC model Learn the different types of nucleotide composition in coding vs non-coding i.e. heavy GC in crna while heavy uracil in ncrna Predicts secondary structures of the ncrna using IPknot software The structures are then used to calculate 20 different graph properties using igraph R package The numerical values are then funneled into RandomForest on WEKA. WEKA - collection of visualization tools and algorithms for data analysis and predictive modeling (JAVA) Used RandomForest Based Model to classify the ncrna into 18 different classes.

41 RNAcon Algorithm

42 RNAcon Pros: Cons: Better performance than AUGUSTUS, GeneMark.hmm, and Glimmer.hmm Computationally simpler than other SVM based methods out such as CONC and CPC Highest MCC score (.76 MCC) Much quicker and computationally less expensive Web GUI and application as well as a stand alone version Provides predicted structure of ncrna and classification Machine learning algorithm are intrinsically dependent on many factors Prediction accuracy is only as good as the learning testing data Over Optimization issues

43 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

44 ho mol o gy (\hō-ˈmä-lə-jē, hə-\) n. the existence of shared ancestry between a pair of structures, or genes, in different species

45 ho mol o gy_terminology

46 ho mol o gy Since coding sequences are conserved over evolutionary time, homology based gene prediction can use the database to find significant homology between novel and known gene sequences.

47 ho mol o gy_tools BLAST BLAT Basic Local Alignment Search Tool BLAST-Like Alignment Tool

48 ho mol o gy_blast - method for rapid searching of nucleotide and protein databases - detects similarities that may provide important clues to the function of uncharacterized proteins. - faster than FASTA and the original Smith-Waterman implementation

49 ho mol o gy_blast how it works:

50 ho mol o gy_blat - similar to BLAST, but not as flexible - finds similarities quickly but it needs an exact or nearly-exact match to find a hit - faster than BLAST and much more memory efficient b/c indexing Kent WJ. BLAT - The BLAST-like alignment tool. Genome Res. 2002;12(4): doi: /gr

51 ho mol o gy_pros/cons - fast implementation high accuracy web version available, no download/installation necessary defacto standard - does not guarantee optimal alignment - returns only one best alignment - produces only ungapped local alignments

52 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

53 EuGene-PP Open integrative gene finder Compared to most existing gene finders, EuGene is characterized by its ability to simply integrate arbitrary sources of information in its prediction process, including RNA-Seq, protein similarities, homologies and various statistical sources of information. Based on all the available information, EuGene will output a prediction of maximal score i.e., maximally consistent with the information provided.

54 Features Data integrated: Markov models of coding regions trained on regions with strong similarities with a reference protein databank. Regions of similarity with different protein databanks. A set of CDS predictions produced by a reliable self training ab initio gene finder. Prodigal is used. A set of predicted non-coding RNA genes (ncrna). trnascan-se, rfam_scan and RNAmmer is used. A set of profiles of measured expression on each strand along the genome(rnaseq data) that shows transcription. A set of potential transcription start sites, defined as points of sudden increase in expression.

55 Advantages Predicts many smaller genes It can run using just FASTA genomic sequences and expression data, and has no parameter to tune Prediction is performed independently on each strand, allowing for the prediction of antisense genes.

56 Maker2 Genome annotation and data management tool Can be executed with different ab initio programs (e.g. GeneMark, Augustus, SNAP) Supposedly ab initio programs give better results when included in the pipeline Gives good results even if the training data is of poor quality Tests were done only on eukaryotes but it works with prokaryotes Runs fast

57 overview background ab initio prediction tools rna prediction tools homology-based prediction tools combo tools final statements

58 Proposed Workflow

59 Questions?

60 References Lukashin, A. "GeneMark.hmm: New Solutions for Gene Finding." Nucleic Acids Research 26, no. 4 (1998): Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29, no. 12 (2001): Lagesen, K. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research (2007) RNACon net/raghava/rnacon/index.html Pati, A. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods Jun;7 (6):455-7 Yandell, M MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects BMC Bioinformatics. 2011; 12: 491. Prodigal - Lowe TM, Eddy SR. trnascan-se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25(5): Rfam - Nawrocki, Eric P., et al. "Rfam 12.0: updates to the RNA families database."nucleic acids research (2014): gku1063. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches.bioinformatics. 2013;29(22):

61 References Infernal - Eddy, Sean R., and Richard Durbin. "RNA sequence analysis using covariance models." Nucleic acids research (1994): FGenesB - HMMs -

Gene Prediction Background & Strategy Faction 2 February 22, 2017

Gene Prediction Background & Strategy Faction 2 February 22, 2017 Gene Prediction Background & Strategy Faction 2 February 22, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee Introduction

More information

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification

More information

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification

More information

Gene Prediction: Preliminary Results

Gene Prediction: Preliminary Results Gene Prediction: Preliminary Results Outline Preliminary Pipeline Programs Program Comparison Tests Metrics Gene Prediction Tools: Usage + Results GeneMarkS Glimmer 3.0 Prodigal BLAST ncrna Prediction

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Gene Prediction Final Presentation

Gene Prediction Final Presentation Gene Prediction Final Presentation Final Proposed Pipeline Assembled Genome Protein - coding Gene Prediction Ab Initio Prodigal Glimmer GeneMarkS RNA Gene Prediction ncrna Specific trnascanse (trna) RNAmmer

More information

Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine

Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine 1 Abstract A prokaryotic annotation pipeline was developed to automatically annotate draft and complete bacterial genomes. The protein coding genes in the genomes are predicted by the combination of Glimmer

More information

Computational gene finding

Computational gene finding Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994

More information

Functional Annotation - Faction 2 Background and Strategy

Functional Annotation - Faction 2 Background and Strategy Functional Annotation - Faction 2 Background and Strategy March 8, 2017 Khushbu Patel Karan Kapuria Angela Mo Harrison Kim David Lu Christian Colon Nolan English Bowen Yang Cong Gao RECAP. WE ARE HERE!!

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Gene Prediction Group

Gene Prediction Group Group Ben, Jasreet, Jeff, Jia, Kunal TACCTGAAAAAGCACATAATACTTATGCGTATCCGCCCTAAACACTGCCTTCTTTCTCAA AGAAGATGTCGCCGCTTTTCAACCGAACGATGTGTTCTTCGCCGTTTTCTCGGTAGTGCA TATCGATGATTCACGTTTCGGCAGTGCAGGCACCGGCGCATATTCAGGATACCGGACGCT

More information

Lecture 10. Ab initio gene finding

Lecture 10. Ab initio gene finding Lecture 10 Ab initio gene finding Uses of probabilistic sequence Segmentation models/hmms Multiple alignment using profile HMMs Prediction of sequence function (gene family models) ** Gene finding ** Review

More information

Gene Prediction. Lab & Preliminary Results. Faction 2 Saturday, March 11, 2017

Gene Prediction. Lab & Preliminary Results. Faction 2 Saturday, March 11, 2017 Gene Prediction Lab & Preliminary Results Faction 2 Saturday, March 11, 2017 Group Members: Michelle Kim Khushbu Patel Krithika Xinrui Zhou Chen Lin Sujun Zhao Hannah Hatchell rohini mopuri Jack Cartee

More information

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004 An Overview of Probabilistic Methods for RNA Secondary Structure Analysis David W Richardson CSE527 Project Presentation 12/15/2004 RNA - a quick review RNA s primary structure is sequence of nucleotides

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2009 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright 2009 Oliver Jovanovic, All Rights Reserved. Analysis of Protein

More information

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition Alexandre Lomsadze 1^, Shiyuyun Tang 2^, Karl Gemayel 3^ and Mark Borodovsky 1,2,3 ^ joint first authors 1 Wallace H. Coulter Department of

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Computational gene finding

Computational gene finding Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

ProGen: GPHMM for prokaryotic genomes

ProGen: GPHMM for prokaryotic genomes ProGen: GPHMM for prokaryotic genomes Sharad Akshar Punuganti May 10, 2011 Abstract ProGen is an implementation of a Generalized Pair Hidden Markov Model (GPHMM), a model which can be used to perform both

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Computational gene finding. Devika Subramanian Comp 470

Computational gene finding. Devika Subramanian Comp 470 Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2004 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2004 Oliver Jovanovic, All Rights Reserved. Analysis of Protein Sequences Coding

More information

Tutorial for Stop codon reassignment in the wild

Tutorial for Stop codon reassignment in the wild Tutorial for Stop codon reassignment in the wild Learning Objectives This tutorial has two learning objectives: 1. Finding evidence of stop codon reassignment on DNA fragments. 2. Detecting and confirming

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010 Genomics is a new and expanding field with an increasing impact

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

Background and Strategy

Background and Strategy Background and Strategy Background Algorithm Ab-initio tools Homology based tools RNA prediction tool Pseudogenes Validation References Gene: DNA sequence that codes for amino acids in a protein Key step

More information

BME 110 Midterm Examination

BME 110 Midterm Examination BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource

More information

Genome 373: Hidden Markov Models III. Doug Fowler

Genome 373: Hidden Markov Models III. Doug Fowler Genome 373: Hidden Markov Models III Doug Fowler Review from Hidden Markov Models I and II We talked about two decoding algorithms last time. What is meant by decoding? Review from Hidden Markov Models

More information

Homework 4. Due in class, Wednesday, November 10, 2004

Homework 4. Due in class, Wednesday, November 10, 2004 1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Annotating the Genome (H)

Annotating the Genome (H) Annotating the Genome (H) Annotation principles (H1) What is annotation? In general: annotation = explanatory note* What could be useful as an annotation of a DNA sequence? an amino acid sequence? What

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Prediction of noncoding RNAs with RNAz

Prediction of noncoding RNAs with RNAz Prediction of noncoding RNAs with RNAz John Dzmil, III Steve Griesmer Philip Murillo April 4, 2007 What is non-coding RNA (ncrna)? RNA molecules that are not translated into proteins Size range from 20

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Videos. Lesson Overview. Fermentation

Videos. Lesson Overview. Fermentation Lesson Overview Fermentation Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast

More information

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2008 The Gene Finding Task Given: an uncharacterized DNA sequence Do:

More information

Regulation of bacterial gene expression

Regulation of bacterial gene expression Regulation of bacterial gene expression Gene Expression Gene Expression: RNA and protein synthesis DNA ----------> RNA ----------> Protein transcription translation! DNA replication only occurs in cells

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010 Computational analysis of non-coding RNA Andrew Uzilov auzilov@ucsc.edu BME110 Tue, Nov 16, 2010 1 Corrected/updated talk slides are here: http://tinyurl.com/uzilovrna redirects to: http://users.soe.ucsc.edu/~auzilov/bme110/fall2010/

More information

File S1. Program overview and features

File S1. Program overview and features File S1 Program overview and features Query list filtering. Further filtering may be applied through user selected query lists (Figure. 2B, Table S3) that restrict the results and/or report specifically

More information

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

Videos. Bozeman Transcription and Translation:   Drawing transcription and translation: Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast RNA and DNA. 29b) I can explain

More information

Genomic region (ENCODE) Gene definitions

Genomic region (ENCODE) Gene definitions DNA From genes to proteins Bioinformatics Methods RNA PROMOTER ELEMENTS TRANSCRIPTION Iosif Vaisman mrna SPLICE SITES SPLICING Email: ivaisman@gmu.edu START CODON STOP CODON TRANSLATION PROTEIN From genes

More information

AC Algorithms for Mining Biological Sequences (COMP 680)

AC Algorithms for Mining Biological Sequences (COMP 680) AC-04-18 Algorithms for Mining Biological Sequences (COMP 680) Instructor: Mathieu Blanchette School of Computer Science and McGill Centre for Bioinformatics, 332 Duff Building McGill University, Montreal,

More information

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015 Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 08: Gene finding aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggc tatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli

Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli Annie Vo Introduction Gene expression can be regulated at the transcriptional level through the activities of terminators.

More information

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu.   handouts, papers, datasets Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger

More information

Year III Pharm.D Dr. V. Chitra

Year III Pharm.D Dr. V. Chitra Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves

More information

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010 RNA Genomics BME 110: CompBio Tools Todd Lowe May 14, 2010 Admin WebCT quiz on Tuesday cover reading, using Jalview & Pfam Homework #3 assigned today due next Friday (8 days) In Genomes, Two Types of Genes

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Concepts and methods in genome assembly and annotation

Concepts and methods in genome assembly and annotation BCM-2002 Concepts and methods in genome assembly and annotation B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique: Franz.Lang@Umontreal.ca Outline 1. What is genome assembly?

More information

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Workflows and Pipelines for NGS analysis: Lessons from proteomics Workflows and Pipelines for NGS analysis: Lessons from proteomics Conference on Applying NGS in Basic research Health care and Agriculture 11 th Sep 2014 Debasis Dash Where are the protein coding genes

More information

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

I. Gene Expression Figure 1: Central Dogma of Molecular Biology I. Gene Expression Figure 1: Central Dogma of Molecular Biology Central Dogma: Gene Expression: RNA Structure RNA nucleotides contain the pentose sugar Ribose instead of deoxyribose. Contain the bases

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Gene Finding Genome Annotation

Gene Finding Genome Annotation Gene Finding Genome Annotation Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics Population biology & evolution Medical genomics

More information

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or

More information

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011 RNA Genomics II BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011 1 TIME Why RNA? An evolutionary perspective The RNA World hypotheses: life arose as self-replicating non-coding RNA (ncrna)

More information

Applied bioinformatics in genomics

Applied bioinformatics in genomics Applied bioinformatics in genomics Productive bioinformatics in a genome sequencing center Heiko Liesegang Warschau 2005 The omics pyramid: 1. 2. 3. 4. 5. Genome sequencing Genome annotation Transcriptomics

More information

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University Bioinformatics: Sequence Analysis COMP 571 Luay Nakhleh, Rice University Course Information Instructor: Luay Nakhleh (nakhleh@rice.edu); office hours by appointment (office: DH 3119) TA: Leo Elworth (DH

More information

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Analysis II: Sequence Patterns and Matrices George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Patterns and Matrices Multiple sequence alignments Sequence patterns Sequence

More information

RNA folding & ncrna discovery

RNA folding & ncrna discovery I519 Introduction to Bioinformatics RNA folding & ncrna discovery Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Non-coding RNAs and their functions RNA structures RNA folding

More information

Small Genome Annotation and Data Management at TIGR

Small Genome Annotation and Data Management at TIGR Small Genome Annotation and Data Management at TIGR Michelle Gwinn, William Nelson, Robert Dodson, Steven Salzberg, Owen White Abstract TIGR has developed, and continues to refine, a comprehensive, efficient

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA From DNA to RNA Biotechnology Unit 3: DNA to Proteins I. After the discovery of the structure of DNA, the major question remaining was how does the stored in the 4 letter code of DNA direct the and of

More information

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador blaise@ebi.ac.uk Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter http://www.ebi.ac.uk/metagenomics Blaise

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler Designing Filters for Fast Protein and RNA Annotation Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler 1 Outline Background on sequence annotation Protein annotation acceleration

More information

Identifying Regulatory Regions using Multiple Sequence Alignments

Identifying Regulatory Regions using Multiple Sequence Alignments Identifying Regulatory Regions using Multiple Sequence Alignments Prerequisites: BLAST Exercise: Detecting and Interpreting Genetic Homology. Resources: ClustalW is available at http://www.ebi.ac.uk/tools/clustalw2/index.html

More information

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences Huiqing Liu Hao Han Jinyan Li Limsoon Wong Institute for Infocomm Research, 21 Heng Mui Keng Terrace,

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

a small viral insert or the like). Both types of breaks are seen in the pyrobaculum snornas.

a small viral insert or the like). Both types of breaks are seen in the pyrobaculum snornas. multigenome-snoscan: A comparative approach to snorna annotation Christoph Rau Introduction and Background The mechanisms which govern the proper functioning of an organism or cell are diverse. However,

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

MODULE 5: TRANSLATION

MODULE 5: TRANSLATION MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base

More information

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian, Genscan The Genscan HMM model Training Genscan Validating Genscan (c) Devika Subramanian, 2009 96 Gene structure assumed by Genscan donor site acceptor site (c) Devika Subramanian, 2009 97 A simple model

More information

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION. !! www.clutchprep.com CONCEPT: GENES Beadle and Tatum develop the one gene one enzyme hypothesis through their work with Neurospora (bread mold). This idea was later revised as the one gene one polypeptide

More information

Transcription is the first stage of gene expression

Transcription is the first stage of gene expression Transcription is the first stage of gene expression RNA synthesis is catalyzed by RNA polymerase, which pries the DNA strands apart and hooks together the RNA nucleotides The RNA is complementary to the

More information

Klinisk kemisk diagnostik BIOINFORMATICS

Klinisk kemisk diagnostik BIOINFORMATICS Klinisk kemisk diagnostik - 2017 BIOINFORMATICS What is bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological,

More information

Match the Hash Scores

Match the Hash Scores Sort the hash scores of the database sequence February 22, 2001 1 Match the Hash Scores February 22, 2001 2 Lookup method for finding an alignment position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a.....

More information

Genes and How They Work. Chapter 15

Genes and How They Work. Chapter 15 Genes and How They Work Chapter 15 The Nature of Genes They proposed the one gene one enzyme hypothesis. Today we know this as the one gene one polypeptide hypothesis. 2 The Nature of Genes The central

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Eukaryotic Gene Prediction. Wei Zhu May 2007

Eukaryotic Gene Prediction. Wei Zhu May 2007 Eukaryotic Gene Prediction Wei Zhu May 2007 In nature, nothing is perfect... - Alice Walker Gene Structure What is Gene Prediction? Gene prediction is the problem of parsing a sequence into nonoverlapping

More information

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling How to design an HMM for a new problem Architecture/topology design: What are the states, observation symbols, and the topology of the state transition graph? Learning/Training: Fully annotated or partially

More information

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot The Genetic Code and Transcription Chapter 12 Honors Genetics Ms. Susan Chabot TRANSCRIPTION Copy SAME language DNA to RNA Nucleic Acid to Nucleic Acid TRANSLATION Copy DIFFERENT language RNA to Amino

More information

Genes and gene finding

Genes and gene finding Genes and gene finding Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)

More information

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009 HUMAN GENOME BIOINFORMATICS Tore Samuelsson, Dec 2009 The sequenced (gray filled) and unsequenced (white) portions of the human genome. Peter F.R. Little Genome Res. 2005; 15: 1759-1766 Human genome organisation

More information

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

How to Use This Presentation

How to Use This Presentation How to Use This Presentation To View the presentation as a slideshow with effects select View on the menu bar and click on Slide Show. To advance through the presentation, click the right-arrow key or

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools Section 1: Introduction Similarity-Based Approach to Gene Prediction Some genomes may be well-studied, with

More information