Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Similar documents
Gene Structure & Gene Finding Part II

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Evaluation of Gene-Finding Programs on Mammalian Sequences

Eukaryotic Gene Prediction. Wei Zhu May 2007

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Methods and Algorithms for Gene Prediction

MATH 5610, Computational Biology

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Baum-Welch and HMM applications. November 16, 2017

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Using Expressing Sequence Tags to Improve Gene Structure Annotation

Eukaryotic Gene Structure

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science.

Integrating Genomic Homology into Gene Structure Prediction

Year III Pharm.D Dr. V. Chitra

Transcription in Eukaryotes

Analysis of Biological Sequences SPH

Genomics and Gene Recognition Genes and Blue Genes

#28 - Promoter Prediction 10/29/07

Genie Gene Finding in Drosophila melanogaster

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Fig Ch 17: From Gene to Protein

SSA Signal Search Analysis II

Gene Signal Estimates from Exon Arrays

Make the protein through the genetic dogma process.

Chapter 13. From DNA to Protein

Ch. 10 Notes DNA: Transcription and Translation

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

DNA makes RNA makes Proteins. The Central Dogma

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Genes and gene finding

Chimp Sequence Annotation: Region 2_3

RNA-Seq with the Tuxedo Suite

A Markovian Approach for the Analysis of the Gene Structure

Transcription Eukaryotic Cells

Chapter 14 Active Reading Guide From Gene to Protein

Problem Set 8. Answer Key

8/21/2014. From Gene to Protein

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Protein Synthesis. DNA to RNA to Protein

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch

Chapter 17: From Gene to Protein

Transcription. DNA to RNA

measuring gene expression December 5, 2017

CHAPTERS , 17: Eukaryotic Genetics

Chapter 14: Gene Expression: From Gene to Protein

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Lecture 11: Gene Prediction

Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Genomics and Gene Recognition Genes and Blue Genes

Computational Biology I LSM5191 (2003/4)

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

RNA-Sequencing analysis

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

O C. 5 th C. 3 rd C. the national health museum

RNA-Seq Software, Tools, and Workflows

Motifs in Protein Sequences

BEADLE & TATUM EXPERIMENT

Protein Synthesis & Gene Expression

DNA Transcription. Dr Aliwaini

Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Comparisons of related genes in different species show 4.3

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

CHAPTER 21 LECTURE SLIDES

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Genomes summary. Bacterial genome sizes

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7)

Solutions to Quiz II

Protein Synthesis Notes

Gene Expression - Transcription

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.

Biology 201 (Genetics) Exam #3 120 points 20 November Read the question carefully before answering. Think before you write.

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Introduction to Bioinformatics

Mapping strategies for sequence reads

DNA Transcription. Visualizing Transcription. The Transcription Process

Bundle 5 Test Review

Name: Class: Date: ID: A

DNA Structure and Replication, and Virus Structure and Replication Test Review

REGULATION OF PROTEIN SYNTHESIS. II. Eukaryotes

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Bio 101 Sample questions: Chapter 10

Gene Prediction: Statistical Approaches

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Nucleic acids deoxyribonucleic acid (DNA) ribonucleic acid (RNA) nucleotide

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION

Introduction to genome biology

AP Biology Gene Expression/Biotechnology REVIEW

Zool 3200: Cell Biology Exam 2 2/20/15

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Bundle 6 Test Review

ALSO: look at figure 5-11 showing exonintron structure of the beta globin gene

Molecular Genetics Student Objectives

Transcription:

Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project presentations 2 Monday Dec 19 1pm - 4pm Final Exam, Room: HH B131 Multiple sequence alignment global local Evolutionary tree reconstruction Pairwise sequence alignment (global and local) Substitution matrices Prokaryotic Gene Finding Eukaryotic Gene Finding Database searching BLAST Sequence statistics Outline Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Gene Finding Questions Identify protein coding region Identify Open Reading Frame Predict mrna (including UTR s) Predict intron/exon structure Eukaryotes only Regulatory signals Protein sequence 1

Gene criteria Open Reading Frames(ORFs) Sequence features Sequence conservation Snyder and Gerstein,Science2003 Computational Computational Computational Evidence for transcription Experimental Gene inactivation induces a phenotype Experimental Sequence features Coding statistics (e.g. codon bias) Gene structure 5 Open Reading Frame Promoter region Ribosome binding site Termination sequence Start codon/stop codon Repressor site 3 An HMM that finds genes in E. coli Krogh et al,1995 observed frequencies for E. coli genes stop codons A A A A A C T T T intergene model 61 triplet models start codons (Electronicreserves) Outstanding Problems Model cannot account for drift in CG content Does not take position dependencies into account Solution: kth order Markov chain looks back k positions Glimmer (Salzberg et al, 1998) Finds 98% of all genes in a bacterial genome. 2

Some Problems Snyder and Gerstein,Science2003 Some Problems Snyder and Gerstein,Science2003 Overlapping genes aggcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgct Overlapping genes Alternate splicing exon1 exon2 exon3 exon4 exon5 exon6 exon1 exon2 exon3 exon4 exon1 exon2 exon3 exon5 exon6 Some Problems Overlapping genes Alternate splicing Pseudogenes Snyder and Gerstein,Science2003 gcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctcc gcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctcc Gene Finding Challenges Small protein-coding genes (<100 aa s) Non-protein-coding RNA genes Regulatory regions Salzberg, Nature, 2003 Genes with sparse conserved positions and little sequence similarity; e.g., betadefensins Schutte et al., PNAS, 2001 gcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctcc 3

Outline Prokaryotic vs. Eukaryotic Genes Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Prokaryotes small genomes (0.5Mb to 10Mb) high gene density (90%) no introns (or splicing) no RNA processing simple regulatory regions most long ORF s are genes Eukaryotes large genomes low gene density (3% - 50%) intron/exon structure splicing complex regulatory regions Genomic data: Must handle multiple genes and/or gene fragments in input sequence. Source: http://www.nslij-genetics.org/gene 4

Typical human gene sizes Average gene length 30kb Coding region 1-2kb Exon length 150-200 bp Exon count 5-6 Single exon genes 8% Genome statistics Size Gene number Density (1 gene per) Human 3300Mb 30K 100,000 Fly 180Mb 13.6K 9000 C. elegans 97Mb 19.1K 5000 Yeast 12Mb 6.3K 2000 E. coli 4.8Mb 3.2K 1400 H. influenzae 1.8Mb 1.7K 1000 Architecture: Genscan Individual modules: intergenic region, promoter, 5 UTR, exon/intron, post-translation region Semi Hidden Markov Model various length distributions Different statistical models for each module: weight matrices + extensions, 3-periodic 5 th order Markov chains Incorporates: Burge and Karlin, 1997 Descriptions of transcriptional, translational and splicing signals Compositional features of exons, introns, intergenic, C+G regions http://www.ornl.gov/techresources/human_genome/faq/compgen.html Genscan Larger predictive scope than previous models Partial genes Multiple genes separated by intergenic DNA Genes on either/both DNA strands Proposed pipeline Burge and Karlin, 1997 Screen for repetitive elements Predict protein sequences with GENSCAN BLAST predictions to find homologs Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96) Verify experimentally GenScan States N: intergenic region P: promoter F: 5 untranslated region E sngl: single exon (intronless) (translation start -> stop codon) E init: initial exon (translation start -> donor splice site) E k: phase k internal exon (acceptor splice site -> donor splice site) E term: terminal exon (acceptor splice site -> stop codon) I k: phase k intron: T: 3 untranslated region A: poly-a signal Fig. 3, Burge and Karlin 1997 5

GenScan States How to model sequences with lengths that are not geometrically distributed? N: intergenic region P: promoter F: 5 untranslated region E sngl: single exon (intronless) (translation start -> stop codon) E init: initial exon (translation start -> donor splice site) E k: phase k internal exon (acceptor splice site -> donor splice site) E term: terminal exon (acceptor splice site -> stop codon) I k: phase k intron: T: 3 untranslated region A: poly-a signal Fig. 3, Burge and Karlin 1997 1- p stop codons p CODON model Resulting exon length distribution: l 1 P( l) = (1 p) p Semi-hidden Markov model Semi-hidden Markov model cont d Set of states: Q 1, Q 2, Transition matrix P(Q(t) Q(t-1)) Initial distribution P(Q(0)) Each state has a length distribution a sequence generating model Emission: Each state emits a sequence, according to a particular distribution, of length, d, according to a particular length frequency distribution A parse φ of length L is A state sequence: Q 1, Q 2, A sequence of lengths: d 1,d 2,d 3,.. An observed sequence, s, is scored using a modified Viterbi algorithm ϕ = opt arg max P( ϕ s) 6

GenScan Training Set Initial and transition probabilities 2.5M base pairs 142 Single Exon Genes (SEGs) 238 multi-exon gene 1492 Exons 1254 Introns An additional 1619 coding sequences (no introns) Promoter model based on published sources. Trained separately for four categories of G+C content < 43% (G+C) 43% - 51% (G+C) 51% - 57% (G+C) > 57% (G+C) Gene density varies with G+C content Genes in A+T rich regions had fewer introns 3 1 Single Gene Exons 2 3-periodic 5 th -order Markov chain Length distribution: taken empirically from single exon lengths Sequence model Trained with coding sequences GenScan States N: intergenic region P: promoter F: 5 untranslated region E sngl: single exon (intronless) (translation start -> stop codon) E init: initial exon (translation start -> donor splice site) E k: phase k internal exon (acceptor splice site -> donor splice site) E term: terminal exon (acceptor splice site -> stop codon) I k: phase k intron: T: 3 untranslated region A: poly-a signal Fig. 3, Burge and Karlin 1997 7

Intron/Exon model phase 0 intron: GAGCACTxxxxxxxxxxxxxxGAGGTGCACCTGGTGACTTAGAC.. phase 1 intron: GAGCACTGxxxxxxxxxxxxxxAGGTGCACCTGGTGACTTAGAC.. phase 2 intron: GAGCACTGAxxxxxxxxxxxxxxGGTGCACCTGGTGACTTAGAC.. Exon phase = phase of previous intron Donor and acceptor models incorporated in intron models Length and sequence distribution determined empirically GenScan States N: intergenic region P: promoter F: 5 untranslated region Esngl: single exon (intronless) (translation start -> stop codon) Einit: initial exon (translation start -> donor splice site) Ek: phase k internal exon (acceptor splice site -> donor splice site) Eterm: terminal exon (acceptor splice site -> stop codon) I k : phase k intron: T: 3 untranslated region A: poly-a signal Fig. 3, Burge and Karlin 1997 Intergenic models Lengths are geometrically distributed with mean: Sequence model 9 3 10 60,000 5 th order Markov model (highest order trainable with data available). Similar models used for untranslated regions * Estimated human gene number in 1997 * GenScan States N: intergenic region P: promoter F: 5 untranslated region Esngl: single exon (intronless) (translation start -> stop codon) Einit: initial exon (translation start -> donor splice site) Ek: phase k internal exon (acceptor splice site -> donor splice site) Eterm: terminal exon (acceptor splice site -> stop codon) Ik: phase k intron: T: 3 untranslated region A: poly-a signal Fig. 3, Burge and Karlin 1997 8

Signal Models TATA box, polya signal, 5 UTR Acceptor Model Fixed length models PSSMs No positional dependence Fixed length model Weight array model Models dependence on the previous position Donor Model Performance measures Burset & Guigo, 1996 Fixed length model Maximal Dependence Decomposition Models dependencies between nonadjacent elements reality prediction true negative true positive false negative false positive Nucleotide Level S TP = n TP + FN 9

Performance measures Burset & Guigo, 1996 Genscan Performance Nucl S n Exon S n Exon S p Missing Exons Wrong Exons reality prediction both edges must be correctly aligned missed exon correct exon wrong exon Genscan FGENEH GENEID+ GENEPARSER3 0.93 0.77 0.91 0.86 0.78 0.61 0.73 0.56 0.81 0.64 0.70 0.58 0.09 0.15 0.07 0.14 0.05 0.12 0.13 0.09 Exon Level S TP = n TP + FN S TN = p TN + FP Proportion of genes with all exons correctly predicted: 243 = 0.43 570 Outline Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Human Gene Count (2001) International Human Genome Consortium, Nature 2001 Initial Gene Index (IGI) Ensembl GENSCAN predictions confirmed by homology to EST s, mrnas, proteins and protein motifs Genie HMM-based Extends homology with EST s/mrnas using ab initio approach Previously known genes in data bases 14882 4057 12,839 31,778 Known genes Ensembl + Genie Ensembl alone Total Predictions 10

Validation Problems: False positives Fragmentation: 1 true gene corresponds to >1 prediction Partial prediction: only part of gene is correctly predicted Comparison with 31 newly discovered genes 28/31 in draft human genome sequence 19/28 were detected by gene prediction IGI contains ~60% of novel genes in human genome (19/31) Comparison with mouse cdna s 81% of mouse cdna s similar to draft human sequence 69% of mouse cdna s similar to predicted genes Predicting Human Genome Count IGI contains: 15,000 known genes 17,000 predicted genes yields 32,000 genes Assuming 20% false positives fragmentation rate of 1.4 yields 24,500 genes Assuming 40% of novel genes are not predicted yields 31,000 genes Human Gene Count (2004) International Human Genome Consortium, Nature 2004 Human Gene Count (2004) International Human Genome Consortium, Nature 2004 2004: 20,000 25,000 genes 1. 19,599 known genes 2. 2188 additional predictions Average exons in predicted genes: 4.7 Average exons in known genes: 9.7 Predictions are likely to overestimate due to fragmentation 3. Possible false negatives Human Genome Sequence 2001 2004 Euchromatic Coverage 90% 99% Gaps 150,00 341 Error rate 1/10,000 1/100,000 Very short genes (< 100amino acids), single exon genes, rapidly evolving genes 2004: 20,000 25,000 genes 2001: ~31,000 genes Human Genome Sequence 2001 2004 Euchromatic Coverage 90% 99% Gaps 150,00 341 Error rate 1/10,000 1/100,000 Fragmentation rates were underestimated in 2001 Number of known genes has increased by 33% 11

Outline Regulatory regions Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation A B C Polymerase Polymerase D A B C Examples of binding site motifs Finding regulatory motifs Look for conserved sequences in regions upstream from genes that are co-regulated Look for conserved sequences in regions upstream from genes that are homologous 12

2 1 0-1 -2-2 -1 0 1 2 Finding binding sites with expression data Transcriptional co-regulation Identify co-expressed genes Compare the upstream regions Identify common sequence motifs Experimental validation Tavazoie et al, Nat Gen, 99 Expression cluster Common DNA sequences Expression patterns DNA sequence patterns -2-1 0 1 2 Promoter... Gibbs Sampling (AlignACE) Transcription factor binding site position weight matrix 13

Finding regulatory motifs Identifying binding sites Comparative genomic approach Look for conserved sequences in regions upstream from genes that are co-regulated Look for conserved sequences in regions upstream from genes that are homologous Organism 1 Organism 2 Conserved genes Identifying binding sites Comparative genomic approach Global alignment of upstream sequences Organism 1 Conserved genes Organism 2 Conserved regulatory elements RRPE PAC Kellis et al., Nature, 2003 14

Comparative genomics for gene finding Comparative genomics for gene finding Salzberg, Nature, 2003 Using genomic sequence from S. paradoxus, S. bayanus, S. mikatae Kellis et al. (Nature, 2003) found 503 false predictions 43 new small genes 42 new regulatory motifs Requires sequences in species that are close but not too close. Will this approach work in higher eukaryotes? One of the first examples of this approach: Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity. Mulligan ME, Hawley DK, Entriken R, McClure WR. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):789-800. We describe a simple algorithm for computing a homology score for Escherichia coli promoters based on DNA sequence alone. The homology score was related to 31 values, measured in vitro, of RNA polymerase selectivity, which we define as the product KBk2, the apparent second order rate constant for open complex formation. We found that promoter strength could be predicted to within a factor of +/-4.1 in KBk2 over a range of 10(4) in the same parameter. The quantitative evaluation was linked to an automated (Apple II) procedure for searching and evaluating possible promoters in DNA sequence files. 1992 1995 1997 1998 2000 2001 2002 Whole genome sequencing H. influenzae, 1 st whole genome sequence Yeast, 1 st eukaryote C. elegans 1 st multicellular organism Fly, Arabidopsis thaliana (mustard weed) 1 st plant Human, 1 st mammal Mouse Gene finding Assessment of protein coding measures (Fickett & Tung) Sequence features in coding, non-coding and intergenic DNA (Guigo & Fickett) HMM gene finder for E. Coli (Krogh et al) Glimmer, Higher order Markov models for prokaryotes (Salzburg et al) Genscan, Prediction of complete gene structures in human genomic DNA (Burge & Karlin) Kellis et al, 2003, comparative genomics for finding regulatory regions 15

Pairwise sequence alignment (global and local) Multiple sequence alignment Substitution matrices Database searching global local BLAST Sequence statistics Evolutionary tree reconstruction Prokaryotic Gene Finding Eukaryotic Gene Finding 16