Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Similar documents
Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Lecture 11: Gene Prediction

Gene Identification in silico

Genomics and Gene Recognition Genes and Blue Genes

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Transcription in Eukaryotes

DNA is normally found in pairs, held together by hydrogen bonds between the bases

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

Computational gene finding. Devika Subramanian Comp 470

CH 17 :From Gene to Protein

Eukaryotic Gene Structure

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Make the protein through the genetic dogma process.

Fig Ch 17: From Gene to Protein

COMPUTER RESOURCES II:

Chapter 13. From DNA to Protein

DNA Replication and Repair

Genomics and Gene Recognition Genes and Blue Genes

Genes and gene finding

Wednesday, November 22, 17. Exons and Introns

DNA Function: Information Transmission

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Gene Expression: Transcription

8/21/2014. From Gene to Protein

Chapter 14 Active Reading Guide From Gene to Protein

Chapter 14: Gene Expression: From Gene to Protein

Lecture for Wednesday. Dr. Prince BIOL 1408

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Chapter 17: From Gene to Protein

Year III Pharm.D Dr. V. Chitra

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

From Gene to Protein. How Genes Work (Ch. 17)

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

MATH 5610, Computational Biology

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Ch. 10 Notes DNA: Transcription and Translation

Transcription Eukaryotic Cells

Assessment Schedule 2013 Biology: Demonstrate understanding of gene expression (91159)

7.2 Protein Synthesis. From DNA to Protein Animation

PROTEIN SYNTHESIS. copyright cmassengale

PROTEIN SYNTHESIS. copyright cmassengale

Analysis of Biological Sequences SPH

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

RNA-Sequencing analysis

Ch 10 Molecular Biology of the Gene

Prokaryotic Transcription

Gene Prediction: Statistical Approaches

BEADLE & TATUM EXPERIMENT

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Hello! Outline. Cell Biology: RNA and Protein synthesis. In all living cells, DNA molecules are the storehouses of information. 6.

From DNA to Protein: Genotype to Phenotype

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons,

Protein Synthesis. OpenStax College

Protein Synthesis: Transcription and Translation

SSA Signal Search Analysis II

Protein Synthesis Notes

Transcription & post transcriptional modification

Transcription. DNA to RNA

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

TRANSCRIPTION AND PROCESSING OF RNA

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

Codon Bias with PRISM. 2IM24/25, Fall 2007

Eukaryotic Gene Prediction. Wei Zhu May 2007

Solutions to Quiz II

Protein Synthesis & Gene Expression

DNA makes RNA makes Proteins. The Central Dogma

Gene Prediction: Statistical Approaches

DNA Transcription. Dr Aliwaini

Gene Structure & Gene Finding Part II

Methods and Algorithms for Gene Prediction

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Do you think DNA is important? T.V shows Movies Biotech Films News Cloning Genetic Engineering

Interpretation of sequence results

Bio 101 Sample questions: Chapter 10

Thr Gly Tyr. Gly Lys Asn

Disease and selection in the human genome 3

SAY IT WITH DNA: Protein Synthesis Activity by Larry Flammer

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Chapter 10: Gene Expression and Regulation

Gene Prediction. Srivani Narra Indian Institute of Technology Kanpur

measuring gene expression December 5, 2017

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006

Adv Biology: DNA and RNA Study Guide

RNA and PROTEIN SYNTHESIS. Chapter 13

Pauling/Itano Experiment

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Biomolecules: lecture 6

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Gene Expression - Transcription

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Chimp Sequence Annotation: Region 2_3

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Transcription:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268:78-94 (1997). 12.1 Introduction In the 196s, it was discovered that a gene and its protein product are colinear structures with a direct correlation between the triplets of nucleotides in the gene and the amino acids in the protein. It soon became clear that genes can be difficult to determine, due to the existence of overlapping genes, and genes within genes etc. Moreover, the paradox arose that the genome size of many eukaryotes does not correspond to genetic complexity, for example, the salamander genome is 1 times the size of that of human. In 1977, the surprising discovery of split genes was made: genes that consist of multiple pieces of coding DNA called exons, separated by stretches of non-coding DNA called introns. DNA DNA Transcription RNA mrna Translation nucleus mrna splicing Protein Protein Prokaryote Eukaryote The existence of split genes and a large proportion of non-coding DNA makes the following problem challenging in eukaryotes: The gene finding problem: Given a DNA sequence, correctly predict the structure of every gene contained in the sequence. 12.1.1 Types of genes Until quite recently, the word gene was usually understood to mean a coding gene that is translated into a protein sequence. We now distinguish between: Protein-coding genes and non-coding genes, also known as RNA genes, that code for RNAs such as: rrna trna snrna (small nuclear RNA) snorna (small nucleolar RNA)

156 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 mirnas (micro RNAs).... In this chapter, we focus on the prediction of protein-coding genes. 12.2 ORF prediction in prokaryotes The simplest way to detect potential coding regions in prokaryotes is to look for so-called open reading frames: Definition 12.2.1 (Open reading frame (ORF)) An ORF is a sequence of codons in DNA that starts with a start codon (ATG), ends with a stop codon (TAA, TAG or TGA) and has no other (in-frame) stop codons inside. Assume that we are at a start codon. In random DNA, the probability P (k) that we see the next stop codon exactly k codons away is geometrically distributed with p =. It is calculated # stop codons # all codons = 3 64 as P (k) = (1 p) k 1 p. The mean value is 1 p = 64 3 21 codons. This is much smaller than the number of codons in an average protein ( 3). Essentially, long ORFs indicate genes, whereas short ORF may or may not indicate genes or short exons. 12.2.1 Codon usage Additionally, codon usage can be taken into account: Definition 12.2.2 (Codon usage) The codon usage of a stretch of DNA sequence is given by a 64-component vector that counts how many times each codon is present in the sequence. These counts differ substantially between coding and non-coding regions. Example: part of the codon usage vector for coding regions of E. coli: amino acid Codon Freq. per 1 codons Gly (glycine) GGG 1.89 Gly GGA.44 Gly GGT 52.99 Gly GGC 34.55 Glu (glutamic acid) GAG 15.68 Glu GAA 57.2 Aps (aspartic acid) GAT 21.63 Asp GAC 43.26 12.3 Eukaryotic gene structure Here is a simple model of the structure of a eukaryotic gene: Promotor TATA 5 UTR Start site ATG Initial exon Donor site Acceptor site Intron GT AG internal exon(s) Intron GT AG Terminal exon Stop site TAA TAG TGA 3 UTR Poly A AAATAAAA

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 157 Given a long segment of DNA sequence, how to find such genes in it? Statistical properties of the different components of such a gene model can help to predict genes in unannotated DNA. For example, for the bases around the start site the observed frequencies can be described using a position weight matrix: Pos. -8-7 -6-5 -4-3 -2-1 +1 +2 +3 +4 +5 +6 +7 A.16.29.2.25.22.66.27.15 1.28.24.11.26 C.48.31.21.33.56.5.5.58.16.29.24.4 G.18.16.46.21.17.27.12.22 1.48.2.45.21 T.19.24.14.21.6.2.11.5 1.9.26.21.21 This type of matrix is obtained by counting the frequencies of different nucleotides near the start sites in a training set of annotated sequences. 12.4 GENSCAN s model In the remainder of this chapter, we will discuss the details of GENSCAN, a classic program for predicting genes in eukaryotes, which is based on an HMM: E+ E1+ E2+ I+ I1+ I2+ P (promoter) A (poly A signal) F (5 UTR) Esngl (single exon gene) T (3 UTR) Einit+ (initial exon) Eterm+ (terminal exon) Einit (initial exon) Eterm (terminal exon) F+ (5 UTR) Esngl+ (single exon gene) T+ (3 UTR) P+ (promoter) A+ (poly A signal) I I1 I2 Forward (+) strand Reverse ( ) strand N (intergenic region) E E1 E2 12.5 GENSCAN s model Genscan has 27 states: The N state models intergenic regions. the P and A states model a promotor region and a poly-a attachment site. The F and T states model the 5 and 3 untranslated regions. The Einit, Eterm and Esngl model initial-, terminal- and single exons, respectively. There are three intron states I, I1 and I2, representing Genscan predicts genes in both strands of DNA simulatenously, that is why there are two copies of each state (except the intergenic one).

158 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 GENSCAN uses an explicit state duration HMM. This is an HMM in which a duration period is explicitly modeled for each state, using a probability distribution. The model is thought of generating a parse π, consisting of: a sequence of states q = (q 1, q 2,..., q n ), and an associated sequence of durations d = (d 1, d 2,..., d n ), which, using probabilistic models for each of the state types, generates a DNA sequence S of length L = n i=1 d i. The generation of a parse of a given sequence length L proceeds as follows: 1. An initial state q 1 is chosen according to an initial distribution p on the states, i.e. p i = P (q 1 = Q (i) ), where Q (j) (j = 1,..., 27) is an indexing of the states of the model. 2. A state duration or length d 1 is generated conditional on the value of q 1 = Q (i) from the duration distribution f Q (i). 3. A sequence segment s 1 of length d 1 is generated, conditional on d 1 and q 1, according to an appropriate sequence-generating model for state type q 1. 4. The subsequent state q 2 is generated, conditional on the value of q 1, from the (first-order Markov) state transition matrix T, i.e. T i,j = P (q k+1 = Q (j) q k = Q (i) ). This process is repeated until the sum n i=1 d i of the state durations first equals or exceeds L, at which point the last state duration is appropriately truncated, the final stretch of sequence is generated and the process stops. The resulting sequence is simply the concatenation of the sequence segments, S = s 1 s 2... s n. In addition to its topology involving the 27 states and 46 transitions depicted above, the model has four main components: a vector of initial probabilities p, a matrix of state transition probabilities T, a set of length distributions f, and a set of sequence generating models P. (Recall that an HMM has initial-, transition- and emission probabilities). 12.6 Computation of a gene prediction As the GENSCAN model is an HMM, augmented by explicit durations, the algorithms discussed in the HMMs chapter can all be applied (after appropriate modifications to take the durations into account). In particular, a gene prediction can be performed by using the Viterbi algorithm to compute a most probable path through the HMM. Note that the model is set up in such a way that a gene prediction is performed in both strands of the DNA simultaneously.

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 159 12.7 Details of the model So far, we have discussed the topology and the other main components of the GENSCAN model in general terms. The following details still need to be discussed: different intron phases, the initial and transition probabilities, the state length distributions, transcriptional and translational signals, splice signals, and reverse-strand states. Due to time constraints, we will only consider a few of these issues. 12.8 Intron phase Splicing in genes does not respect codon boundaries. Hence, any exon (except the terminal one) may end with an overhang of, 1 or 2 nucleotides and this is tracked by the HMM by transitioning to an intron state of phase, 1 or 2, respectively. When we enter an exon state from on intron state of phase i, then we must first generate the appropriate number of nucleotides to complete the overhanging codon of the previous exon before generating further codons. phase 1 intron TA TGT GTT ACT CGC GCT CGC TT exon phase 2 intron 12.9 State length distributions for exons and introns The different states of the model correspond to sequence segments with different length distributions. For some states, especially internal exon states E k, length is important for proper biological function, i.e. proper splicing and inclusion in the final processed mrna. It has been shown in vivo that internal deletions of exons to sizes below about 5 bp may often lead to exon skipping, and steric interference between factors recognizing splice sites, may make splicing of small exons more difficult. Spliceosomal assembly may be inhibited if internal exons are expanded beyond 3 bp. In summary, these arguments support the observation that internal exons are usually 12 15 bp long, with only a few of length less that 5 bp or more than 3 bp. Constraints for initial and terminal exons are slightly different. The duration in initial, internal and terminal exon states is modeled by a different empirical distribution for each of the types of states.

16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 75 25 4 6 2 Number of exons 3 Number of exons 1 Number of exons 2 2 4 2 4 2 4 Length (bp) Length (bp) Length (bp) Initial exons Internal exons Terminal exons In contrast to exons, the length of introns does not seem biologically critical, although a minimum length of 7 8 may be preferred. The length distribution for introns appears to be approximately geometric (exponential), with the average length depending on the C+G content of the sequence (the higher the content, the shorter the introns, on average). Hence, the duration in intron states is modeled by a geometric distribution with parameter q that depends on the C+G content. 3 Number of introns 2 1 1k 2k 3k 4k 5k 6k 7k 8k Length (bp) Introns 12.1 Simple signal models There are a number of different models of biological signal sequences, such as donor and acceptor sites, promoters, etc. One simple approach is the weight matrix method (WMM). Here, the frequency p a (i) of each nucleotide a at position i of a signal of length n is derived from a collection of aligned signal sequences. The product P (A) = n A = a 1 a 2... a n. i=1 P a (i) i Here is a WMM for recognition of a start site: is used to estimate the probability of generating a particular sequence Pos. -8-7 -6-5 -4-3 -2-1 +1 +2 +3 +4 +5 +6 +7 A.16.29.2.25.22.66.27.15 1.28.24.11.26 C.48.31.21.33.56.5.5.58.16.29.24.4 G.18.16.46.21.17.27.12.22 1.48.2.45.21 T.19.24.14.21.6.2.11.5 1.9.26.21.21 Under this model, the sequence...ccgccacc ATG GCGC... has the highest probability of containing a start site, namely: P =.48.31 46.33.56.66.5.58 1 1 1.48.29.45.4 =.6. The sequence...agtttttt ATG TAAT... has the lowest non-zero probability of containing a start site at the indicated position, namely: P =.16.16.14.21.6.2.11.5 1 1 1.9.24.11.21 = 2.4 1 11.

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 161 12.11 Splice signals The donor and acceptor splice signals are probably the most important signals, as the majority of exons are internal ones. The consensus region of the donor splice sites covers the last 3 bp of the exon (positions -3 to -1) and the first 6 bp of the succeeding intron (positions 1 to 6):... exon intron... Position -3-2 -1 +1 +2 +3 +4 +5 +6 Consensus c/a A G G T a/g A G t WMM: A.33.6.8.49.71.6.15 C.37.13.4.3.7.5.19 G.18.14.81 1.45.12.84.2 T.12.13.7 1.3.9.5.46 12.12 Example of GENSCAN summary output GENSCAN 1. Date run: 28-Apr-14 Time: 2:56:56 Sequence HUMAN DNA : 36741 bp : 52.9% C+G : Isochore 3 (51-57 C+G%) Parameter matrix: HumanIso.smat Predicted genes/exons: Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.1 Intr + 342 3538 119 1 2 14 66 32.16 3.29 1.2 Intr + 395 463 114 1 22 61 133.276 5.15 1.3 Term + 431 4426 117 1 83 44 64.725.34 1.4 PlyA + 529 534 6 1.5 2.5 PlyA - 5519 5514 6 1.5 2.4 Term - 7619 7455 165 2 36 47 124.792 1.33 2.3 Intr - 9364 939 56 2 2 9 84 48.63 3.79 2.2 Intr - 9557 9478 8 1 2 7 53 49.94 -.71 2.1 Init - 9986 991 77 2 2 9 9 58.71 6.79 2. Prom - 13449 1341 4-3.91 3.2 PlyA - 14233 14228 6 -.45 3.1 Sngl - 14748 14287 462 13 51 222.425 7.51 3. Prom - 1719 177 4-7. 4. Prom + 19924 19963 4.29... 12.13 Performance of GENSCAN GENSCAN was run on a test set of 57 vertebrate sequences and the forward strand exons in the optimal GENSCAN parse of the sequence were compared to the annotated exons. The following table shows the results and compares them with results obtained using other programs:

162 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 (Source: Burge and Karlin 1997) GENSCAN performs very well here and is a widely-used gene finding method. 12.14 Summary A straight-forward approach to finding genes in prokaryotes is to search for long open reading frames that have a plausible codon usage. Gene finding in eukaryotes is more complicated, due of splicing. GENSCAN is a popular tool for solving this problem. It uses an explicit-duration HMM to model genes and the Viterbi algorithm is used to produce a parse. GENSCAN can be run at http://genes.mit.edu/genscan.html.