Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268:78-94 (1997). 12.1 Introduction In the 196s, it was discovered that a gene and its protein product are colinear structures with a direct correlation between the triplets of nucleotides in the gene and the amino acids in the protein. It soon became clear that genes can be difficult to determine, due to the existence of overlapping genes, and genes within genes etc. Moreover, the paradox arose that the genome size of many eukaryotes does not correspond to genetic complexity, for example, the salamander genome is 1 times the size of that of human. In 1977, the surprising discovery of split genes was made: genes that consist of multiple pieces of coding DNA called exons, separated by stretches of non-coding DNA called introns. DNA DNA Transcription RNA mrna Translation nucleus mrna splicing Protein Protein Prokaryote Eukaryote The existence of split genes and a large proportion of non-coding DNA makes the following problem challenging in eukaryotes: The gene finding problem: Given a DNA sequence, correctly predict the structure of every gene contained in the sequence. 12.1.1 Types of genes Until quite recently, the word gene was usually understood to mean a coding gene that is translated into a protein sequence. We now distinguish between: Protein-coding genes and non-coding genes, also known as RNA genes, that code for RNAs such as: rrna trna snrna (small nuclear RNA) snorna (small nucleolar RNA)

156 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 mirnas (micro RNAs).... In this chapter, we focus on the prediction of protein-coding genes. 12.2 ORF prediction in prokaryotes The simplest way to detect potential coding regions in prokaryotes is to look for so-called open reading frames: Definition 12.2.1 (Open reading frame (ORF)) An ORF is a sequence of codons in DNA that starts with a start codon (ATG), ends with a stop codon (TAA, TAG or TGA) and has no other (in-frame) stop codons inside. Assume that we are at a start codon. In random DNA, the probability P (k) that we see the next stop codon exactly k codons away is geometrically distributed with p =. It is calculated # stop codons # all codons = 3 64 as P (k) = (1 p) k 1 p. The mean value is 1 p = 64 3 21 codons. This is much smaller than the number of codons in an average protein ( 3). Essentially, long ORFs indicate genes, whereas short ORF may or may not indicate genes or short exons. 12.2.1 Codon usage Additionally, codon usage can be taken into account: Definition 12.2.2 (Codon usage) The codon usage of a stretch of DNA sequence is given by a 64-component vector that counts how many times each codon is present in the sequence. These counts differ substantially between coding and non-coding regions. Example: part of the codon usage vector for coding regions of E. coli: amino acid Codon Freq. per 1 codons Gly (glycine) GGG 1.89 Gly GGA.44 Gly GGT 52.99 Gly GGC 34.55 Glu (glutamic acid) GAG 15.68 Glu GAA 57.2 Aps (aspartic acid) GAT 21.63 Asp GAC 43.26 12.3 Eukaryotic gene structure Here is a simple model of the structure of a eukaryotic gene: Promotor TATA 5 UTR Start site ATG Initial exon Donor site Acceptor site Intron GT AG internal exon(s) Intron GT AG Terminal exon Stop site TAA TAG TGA 3 UTR Poly A AAATAAAA

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 157 Given a long segment of DNA sequence, how to find such genes in it? Statistical properties of the different components of such a gene model can help to predict genes in unannotated DNA. For example, for the bases around the start site the observed frequencies can be described using a position weight matrix: Pos. -8-7 -6-5 -4-3 -2-1 +1 +2 +3 +4 +5 +6 +7 A.16.29.2.25.22.66.27.15 1.28.24.11.26 C.48.31.21.33.56.5.5.58.16.29.24.4 G.18.16.46.21.17.27.12.22 1.48.2.45.21 T.19.24.14.21.6.2.11.5 1.9.26.21.21 This type of matrix is obtained by counting the frequencies of different nucleotides near the start sites in a training set of annotated sequences. 12.4 GENSCAN s model In the remainder of this chapter, we will discuss the details of GENSCAN, a classic program for predicting genes in eukaryotes, which is based on an HMM: E+ E1+ E2+ I+ I1+ I2+ P (promoter) A (poly A signal) F (5 UTR) Esngl (single exon gene) T (3 UTR) Einit+ (initial exon) Eterm+ (terminal exon) Einit (initial exon) Eterm (terminal exon) F+ (5 UTR) Esngl+ (single exon gene) T+ (3 UTR) P+ (promoter) A+ (poly A signal) I I1 I2 Forward (+) strand Reverse ( ) strand N (intergenic region) E E1 E2 12.5 GENSCAN s model Genscan has 27 states: The N state models intergenic regions. the P and A states model a promotor region and a poly-a attachment site. The F and T states model the 5 and 3 untranslated regions. The Einit, Eterm and Esngl model initial-, terminal- and single exons, respectively. There are three intron states I, I1 and I2, representing Genscan predicts genes in both strands of DNA simulatenously, that is why there are two copies of each state (except the intergenic one).

158 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 GENSCAN uses an explicit state duration HMM. This is an HMM in which a duration period is explicitly modeled for each state, using a probability distribution. The model is thought of generating a parse π, consisting of: a sequence of states q = (q 1, q 2,..., q n ), and an associated sequence of durations d = (d 1, d 2,..., d n ), which, using probabilistic models for each of the state types, generates a DNA sequence S of length L = n i=1 d i. The generation of a parse of a given sequence length L proceeds as follows: 1. An initial state q 1 is chosen according to an initial distribution p on the states, i.e. p i = P (q 1 = Q (i) ), where Q (j) (j = 1,..., 27) is an indexing of the states of the model. 2. A state duration or length d 1 is generated conditional on the value of q 1 = Q (i) from the duration distribution f Q (i). 3. A sequence segment s 1 of length d 1 is generated, conditional on d 1 and q 1, according to an appropriate sequence-generating model for state type q 1. 4. The subsequent state q 2 is generated, conditional on the value of q 1, from the (first-order Markov) state transition matrix T, i.e. T i,j = P (q k+1 = Q (j) q k = Q (i) ). This process is repeated until the sum n i=1 d i of the state durations first equals or exceeds L, at which point the last state duration is appropriately truncated, the final stretch of sequence is generated and the process stops. The resulting sequence is simply the concatenation of the sequence segments, S = s 1 s 2... s n. In addition to its topology involving the 27 states and 46 transitions depicted above, the model has four main components: a vector of initial probabilities p, a matrix of state transition probabilities T, a set of length distributions f, and a set of sequence generating models P. (Recall that an HMM has initial-, transition- and emission probabilities). 12.6 Computation of a gene prediction As the GENSCAN model is an HMM, augmented by explicit durations, the algorithms discussed in the HMMs chapter can all be applied (after appropriate modifications to take the durations into account). In particular, a gene prediction can be performed by using the Viterbi algorithm to compute a most probable path through the HMM. Note that the model is set up in such a way that a gene prediction is performed in both strands of the DNA simultaneously.

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 159 12.7 Details of the model So far, we have discussed the topology and the other main components of the GENSCAN model in general terms. The following details still need to be discussed: different intron phases, the initial and transition probabilities, the state length distributions, transcriptional and translational signals, splice signals, and reverse-strand states. Due to time constraints, we will only consider a few of these issues. 12.8 Intron phase Splicing in genes does not respect codon boundaries. Hence, any exon (except the terminal one) may end with an overhang of, 1 or 2 nucleotides and this is tracked by the HMM by transitioning to an intron state of phase, 1 or 2, respectively. When we enter an exon state from on intron state of phase i, then we must first generate the appropriate number of nucleotides to complete the overhanging codon of the previous exon before generating further codons. phase 1 intron TA TGT GTT ACT CGC GCT CGC TT exon phase 2 intron 12.9 State length distributions for exons and introns The different states of the model correspond to sequence segments with different length distributions. For some states, especially internal exon states E k, length is important for proper biological function, i.e. proper splicing and inclusion in the final processed mrna. It has been shown in vivo that internal deletions of exons to sizes below about 5 bp may often lead to exon skipping, and steric interference between factors recognizing splice sites, may make splicing of small exons more difficult. Spliceosomal assembly may be inhibited if internal exons are expanded beyond 3 bp. In summary, these arguments support the observation that internal exons are usually 12 15 bp long, with only a few of length less that 5 bp or more than 3 bp. Constraints for initial and terminal exons are slightly different. The duration in initial, internal and terminal exon states is modeled by a different empirical distribution for each of the types of states.

16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 75 25 4 6 2 Number of exons 3 Number of exons 1 Number of exons 2 2 4 2 4 2 4 Length (bp) Length (bp) Length (bp) Initial exons Internal exons Terminal exons In contrast to exons, the length of introns does not seem biologically critical, although a minimum length of 7 8 may be preferred. The length distribution for introns appears to be approximately geometric (exponential), with the average length depending on the C+G content of the sequence (the higher the content, the shorter the introns, on average). Hence, the duration in intron states is modeled by a geometric distribution with parameter q that depends on the C+G content. 3 Number of introns 2 1 1k 2k 3k 4k 5k 6k 7k 8k Length (bp) Introns 12.1 Simple signal models There are a number of different models of biological signal sequences, such as donor and acceptor sites, promoters, etc. One simple approach is the weight matrix method (WMM). Here, the frequency p a (i) of each nucleotide a at position i of a signal of length n is derived from a collection of aligned signal sequences. The product P (A) = n A = a 1 a 2... a n. i=1 P a (i) i Here is a WMM for recognition of a start site: is used to estimate the probability of generating a particular sequence Pos. -8-7 -6-5 -4-3 -2-1 +1 +2 +3 +4 +5 +6 +7 A.16.29.2.25.22.66.27.15 1.28.24.11.26 C.48.31.21.33.56.5.5.58.16.29.24.4 G.18.16.46.21.17.27.12.22 1.48.2.45.21 T.19.24.14.21.6.2.11.5 1.9.26.21.21 Under this model, the sequence...ccgccacc ATG GCGC... has the highest probability of containing a start site, namely: P =.48.31 46.33.56.66.5.58 1 1 1.48.29.45.4 =.6. The sequence...agtttttt ATG TAAT... has the lowest non-zero probability of containing a start site at the indicated position, namely: P =.16.16.14.21.6.2.11.5 1 1 1.9.24.11.21 = 2.4 1 11.

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 161 12.11 Splice signals The donor and acceptor splice signals are probably the most important signals, as the majority of exons are internal ones. The consensus region of the donor splice sites covers the last 3 bp of the exon (positions -3 to -1) and the first 6 bp of the succeeding intron (positions 1 to 6):... exon intron... Position -3-2 -1 +1 +2 +3 +4 +5 +6 Consensus c/a A G G T a/g A G t WMM: A.33.6.8.49.71.6.15 C.37.13.4.3.7.5.19 G.18.14.81 1.45.12.84.2 T.12.13.7 1.3.9.5.46 12.12 Example of GENSCAN summary output GENSCAN 1. Date run: 28-Apr-14 Time: 2:56:56 Sequence HUMAN DNA : 36741 bp : 52.9% C+G : Isochore 3 (51-57 C+G%) Parameter matrix: HumanIso.smat Predicted genes/exons: Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.1 Intr + 342 3538 119 1 2 14 66 32.16 3.29 1.2 Intr + 395 463 114 1 22 61 133.276 5.15 1.3 Term + 431 4426 117 1 83 44 64.725.34 1.4 PlyA + 529 534 6 1.5 2.5 PlyA - 5519 5514 6 1.5 2.4 Term - 7619 7455 165 2 36 47 124.792 1.33 2.3 Intr - 9364 939 56 2 2 9 84 48.63 3.79 2.2 Intr - 9557 9478 8 1 2 7 53 49.94 -.71 2.1 Init - 9986 991 77 2 2 9 9 58.71 6.79 2. Prom - 13449 1341 4-3.91 3.2 PlyA - 14233 14228 6 -.45 3.1 Sngl - 14748 14287 462 13 51 222.425 7.51 3. Prom - 1719 177 4-7. 4. Prom + 19924 19963 4.29... 12.13 Performance of GENSCAN GENSCAN was run on a test set of 57 vertebrate sequences and the forward strand exons in the optimal GENSCAN parse of the sequence were compared to the annotated exons. The following table shows the results and compares them with results obtained using other programs:

162 Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 (Source: Burge and Karlin 1997) GENSCAN performs very well here and is a widely-used gene finding method. 12.14 Summary A straight-forward approach to finding genes in prokaryotes is to search for long open reading frames that have a plausible codon usage. Gene finding in eukaryotes is more complicated, due of splicing. GENSCAN is a popular tool for solving this problem. It uses an explicit-duration HMM to model genes and the Viterbi algorithm is used to produce a parse. GENSCAN can be run at http://genes.mit.edu/genscan.html.