Methods and Algorithms for Gene Prediction

Similar documents
Using Expressing Sequence Tags to Improve Gene Structure Annotation

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Gene Signal Estimates from Exon Arrays

Eukaryotic Gene Prediction. Wei Zhu May 2007

Gene Structure & Gene Finding Part II

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

user s guide Question 3

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Baum-Welch and HMM applications. November 16, 2017

Bioinformatics for Proteomics. Ann Loraine

Genes and gene finding

Hands-On Four Investigating Inherited Diseases

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science.

user s guide Question 3

Lecture 11: Gene Prediction

Chimp Sequence Annotation: Region 2_3

Gene Prediction: Statistical Approaches

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

RNA-Sequencing analysis

Genie Gene Finding in Drosophila melanogaster

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

measuring gene expression December 5, 2017

Analysis of Biological Sequences SPH

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Open Access. Abstract

Guided tour to Ensembl

Gene Prediction: Statistical Approaches

RNA-Seq with the Tuxedo Suite

Bio 101 Sample questions: Chapter 10

Research AceView: a comprehensive cdna-supported gene and transcripts annotation Danielle Thierry-Mieg and Jean Thierry-Mieg

Gene Expression Technology

Gene-centered resources at NCBI

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

ELE4120 Bioinformatics. Tutorial 5

MATH 5610, Computational Biology

Biotechnology Explorer

7.2 Protein Synthesis. From DNA to Protein Animation

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

RNA-Seq Software, Tools, and Workflows

Training materials.

Review Article A Review of Soft Computing Techniques for Gene Prediction

Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training

Eukaryotic Gene Structure


RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Lecture #1. Introduction to microarray technology

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Interpretation of sequence results

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Alignment to a database. November 3, 2016

Introduction to genome biology

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Make the protein through the genetic dogma process.

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Regulation of eukaryotic transcription:

Introduction to the UCSC genome browser

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Types of Databases - By Scope

Chapter 15 Gene Technologies and Human Applications

Mapping strategies for sequence reads

#28 - Promoter Prediction 10/29/07

Measuring transcriptomes with RNA-Seq

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

TRANSCRIPTION AND TRANSLATION

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Contents 1 Introduction Information ow in biology Probabilistic Finite State Machines PFS

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Genomics AGRY Michael Gribskov Hock 331

Tree Depth in a Forest

Microarray Gene Expression Analysis at CNIO

Fig Ch 17: From Gene to Protein

Recommendations from the BCB Graduate Curriculum Committee 1

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

PrimePCR Assay Validation Report

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Gene Expression: Transcription

DNA makes RNA makes Proteins. The Central Dogma

Gene Prediction. Srivani Narra Indian Institute of Technology Kanpur

Assessing De-Novo Transcriptome Assemblies

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Serial Analysis of Gene Expression

BABELOMICS: Microarray Data Analysis

Ensembl gene annotation project

PrimePCR Assay Validation Report

Gene mutation and DNA polymorphism

La traduzione. Il codice genetico

Bio-Reagent Services. Custom Gene Services. Gateway to Smooth Molecular Biology! Your Innovation Partner in Drug Discovery!

Transcription:

Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. ccwei@sjtu.edu.cn http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C Bioinformatics Training Course 1

Outline 1. Introduction 2. Gene prediction methods Gene prediction methods HMM TWINSCAN and N-SCAN Using ESTs for gene prediction Resources Latest progress 3. Gene Prediction FAQs 2

1. Introduction: DNA DNA contains genetic information. DNA can be expressed as a sequence of letters A,C,G and T. Eg: ACGTTTCGAGGT 3

DNARNAProtein & processing 4

RNA Processing Primary mrna 5 5

Introduction: Gene Structure Initial Exon Intron Internal Exon Intron Intron Terminal Exon 5 UTR ATG GT AG TAA TAG TGA 3 UTR A gene is a highly structured region of DNA, it is a functional unit of inheritance. 6

Patterns in Splice Sites Donor Sites Acceptor Sites Josep F. Abril et al. Genome Res. 2005; 15: 111-119 Sequence data from RefSeq of human, mouse, rat and chicken. 7

A Typical Human Gene Structure 8

Genes in a Genome 9

In a Mammalian Genome Finding all the genes is hard Mammalian genomes are large 8,000 km of 10pt type Only about 1% protein coding 10

DNA, mrna, cdna and EST DNA Primary mrna mrna UTR Coding Region UTR 5 3 Transcription 5 3 RNA Processing 5 3 Reverse Transcription 5 Truncated cdna cdna Full-length cdna 5 3 5 3 5 EST 3 EST 5 EST 3 EST EST: Short (~650bs) High error rate (~1-5%) Contains only UTRs or coding regions 11

The Challenge and Opportunity ~3000 genomes 222 animals 93 plants 3000 2000 1000 0 Genome numbers Better Gene Structure Annotation Numbers from http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html 12

2. Gene Prediction Method History Gener ation 1 st Early 1980s 2 nd Early 1990s 3 rd Mid- Date Feature Systems Information or Methods used 1990s Approximate ends of protein coding regions and non-coding regions A complete single gene in a short sequence Multiple complete or partial genes in a long sequence 4 th 2000s Complete gene structures in whole genomes TestCode, Fickett 1982 GRAIL, Uberbacher and Mural 1991 GeneID,Guigo et al. 1992 GeneParser, Snyder and Stormo 1993 FGENSH, Solovyev et al. 1994 Genscan, Burge and Karlin, 1997 Twinscan, Korf, et al.2001 N-SCAN, Brown,et al. 2006 Twinscan_EST, N-SCAN_EST, Wei and Brent, 2006 splice sites promoters Codon usage bias Neuro-network methods + Translation start sites Stop sites Method:HMM UTR Method: Generalized HMM Multiple-genomes Transcript products Method:Generalized HMM Bayesian approach 13

Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input GENSCAN (Burge 1997; Burge and Karlin 1997) GeneFinder (Green, unpublished) Fgenesh (Solovyev and Salamov 1997) Can predict novel genes 2. Transcript-alignment-based methods Use cdna, mrna or Protein similarity as major clues ENSEMBL (Birney, Clamp, et. al. 2004) Highly accurate Can only find genes with transcript evidences cdna coverage 50-60% + EST coverage up to 85% 14

Gene Prediction Methods (2) Categorization: by input information 3. Hybrid Methods Integrate cdna, mrna, protein and EST alignments into ab initio methods Genie (Kulp, Haussler et al. 1996) Fgenesh+ (Solovyev and Salamov 1997) Genomescan (Yeh, Lim et al. 2001) GAZE (Howe, Chothiea et al. 2002) AUGUSTUS+ (Stanke, Schoffmann et al. 2006) 15

Gene Prediction Methods(3) Comparative-Genomics-Based Methods TWINSCAN and N-SCAN De novo Assumption: Coding regions are more conserved. No transcript similarity information (like EST, cdna, mrna, or protein) is used TWINSCAN-EST and N-SCAN_EST Hybrid Use EST to improve prediction accuracy 16

TWINSCAN: A Novel Gene Prediction System Using Dual Genomes Conservation sequences represent the conservation patterns between two genomes No transcript similarity information (like EST, cdna, mrna, and protein) is used 17

Hidden Markov Model: Model behind gene predictors HMM for two biased coins flipping 0.9 0.2 0.1 1 2 0.7 0.1 End e1 ( H) 0.8, e1 ( T) 0.2, e2( H) 0.3, e2( T) 0.7 TTHHTTHTTTTTHTHHHHHTHTH arg max Observed sequence x 11221111122221111112222 Hidden state sequence P( x, ) 18

states Most Probable Path and Viterbi Algorithm 1 2 0 1 2 i-1 i L-1 L max Let f Recursion (i=1 L) Time complexity l ( i) max (Pr( x0,..., xi 1, 0,..., i1, i fl ( i) e ptr ( l) i { 0,..., i1} l ( xi ) max( f k arg max( f k k k ( i 1) a ( i 1) a kl kl ); ). O ( N 2 L) space complexity O(NL) l)) 19

states Probability of All the Possible Paths and Forward Algorithm 1 2 0 1 2 i-1 i L-1 L + Let f l ( i) Pr( x,..., x, l) 0 i1 i Recursion (i=1 L) Probability of all the probable paths P ( x) f k ( L) P( x, ) f ( i) e ( x ) ( f ( i 1) a ) l l i k k k kl 20

states Posterior Probability and Forward and Backward Algorithm 1 2 0 1 2 i-1 i L-1 L + Posterior Probability P( i k x) P( i k, P( x) x) 21

Posterior Probability and Forward and Backward Algorithm l l l l b x e a x P x P ) (1 ) ( ), ( ) ( 1 0 l l i l kl k i b x e a i b ) 1 ( ) ( ) ( 1 ) ( ) ( ) ( ) ( ), ( ) ( x P i b i f x P x k P x k P k k i i 22 states Recursion (i=l-1,,1) Posterior Probability ),..., Pr( ) ( 1 k x x L b i L i k Let 1 2 + 0 1 2 i-1 i L-1 L

TWINSCAN Model Generalized HMM Each feature in a gene structure corresponds to one state. State-specific length models. State-specific sequence models Use Conservation information 23

Conservation Sequence Generated by projecting local alignments to the target sequence human mouse CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC......... : :: :... : CTAGAG---------AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with if it is aligned and identical : if it is aligned to mismatch. if it is unaligned 24

N-SCAN: A Novel Gene Prediction System Using Multiple Genomes Uses Bayesian model to include phylogenetic tree information Predicts introns in 5 UTR Has Conserved non-coding regions (Brown, Gross and Brent, Genome Res. 2005) 25

Using ESTs for Gene Prediction: TWINSCAN_EST Integrating EST alignment information into TWINSCAN to improve its accuracy where EST evidence exits and not to compromise its ability to predict novel genes. 26

Sequence Representation of EST Alignments 1. Use EST-to-genome alignment programs BLAT (Kent 2002) 2. Project the top alignment for each EST to the target genomic sequence EST Alignments ESTseq NNN EEE III EEE NNN III NNN EEE NNN 27

Accuracy Measurement Annotated data sets for training/testing RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) CCDS (http://www.ncbi.nlm.nih.gov/ccds/) Accuracy in different levels Nucleotide level Exon level Gene level Transcript level Sensitivity and specificity 28

Accuracy Measurement (continue) Annotation Prediction Correct Prediction Sensitivity Correct _ Prediction Total _ Annotation Specificit y Correct _ Prediction Total _ Prediction 29

percen TWINSCAN_EST and N-SCAN_EST on the Whole Human Genome 100 90 80 70 60 TWINSCAN2.03 TWINSCAN_EST N-SCAN N-SCAN_EST 67 81 85 88 56 58 59 60 50 40 34 38 44 30 20 24 14 17 22 23 10 0 exact gene sensitivity exact gene specificity exact exon sensitivity exact exon specificity 30

An Example of N-SCAN_EST Prediction (Hg17, chr21:33,459,500-33,465,411) 31

An Example of N-SCAN_EST Prediction 32

Experimental Validation of Predictions Siepel, Genome Research, 2007 33

Experimental Validation of Predictions See The MGC Project Team, The Completion of the Mammalian Gene Collection (MGC), Genome Research, 2009, 19:2324-2333 Wei, C., et al., Closing in on the C.elegans ORFeome by Cloning TWINSCAN predictions, Genome Research, 2005, 15:577-582. Tenney, A. E. et al., Gene prediction and verification in a compact genome with numerous small introns, Genome Research, 2004 34

N-SCAN/TWINSCAN Webserver: http://mblab.wustl.edu/nscan/submit 35

Resources for Gene Prediction Sequence data sets Nucleotide Sequences (NCBI) dbest mrna cdna Annotations RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) CCDS (http://www.ncbi.nlm.nih.gov/ccds/) Genome Browser UCSC Genome Browser(http://genome.ucsc.edu/) 36

Latest Progress in Gene Prediction New Methods Conrad: gene prediction using conditional random fields. Decaprio et al., Genome Res. 2007 Sep;17(9):1389-98. Not working for vertebrate genomes SVM for splice site Sonneburg et al., BMC Bioinformatics. 2007;8 Suppl 10:S7. CONTRAST: Gross et al., Genome Biology 2007, 8:R269 Best de novo gene predictor for human (gene level accuracy ~50%) Used SVM and conditional random field 37

3. Gene Prediction FAQs (from Ian Korf) Algorithms vs. experts Q: are expert biologists better than computer programs? A: Yes and no. Next-generation sequencing Q: Will next-gen transcript sequencing replace gene prediction? A: No. Rare transcripts may require directed experiments to validate. Prediction accuracy Q: Why are gene prediction programs inaccurate? A: We don t always know why. 38

Gene Prediction FAQs (continue) Genes in my favorate genome Q: There is no gene predictor for it, what should I do? 39

Gene prediction for algal genomes Microalgae Macroalgae 40

A Typical Alga Gene Structure A Typical Human Gene Structure 41

Gene Prediction FAQs (continue) Genes in my favorate genome Q: There is no gene predictor for it, what should I do? A: Training a gene predictor or use one that is for another organism that is close to this genome. But it may be inaccurate. Difficult genes Q: why some genes are not predicted by any program? A: They are statistical outliers. 42

Gene Prediction FAQs (continue) Just coding exons Q: why other parts are not predicted, such as noncoding exons, alternative isoforms, non-canonical splice sites, gene within genes? A: There are trade offs. Pseudogenes Q: why do some gene predictions have tiny introns? A: Retro-pseudo genes often have very strong coding signals, because they are derived from highly expressed genes. 43

Gene Prediction FAQs (end) How can I tell a good gene prediction from a bad one? Scores have been assigned to every exon and intron of a gene. People can tell if a gene prediction is good or not by the scores of exons and introns of this gene. You may have to run the program on your own computer to figure them out! 44

Acknowledgement Training program organizers Former lab members (WashU) Dr. Michael Brent Dr. Paul Flicek Dr. Ian Korf Samuel Gross 45