G. Reese, Frank II. Eeckman

Size: px
Start display at page:

Download "G. Reese, Frank II. Eeckman"

Transcription

1 Ilrl~>roved :jp11ct: Site Detection in Genie Martin G. Reese, Frank II. Eeckman Human Genome Informatics Group Lawrence Berkeley National Laboratory 1 Cyclotron Road, Berkeley, CA mgreese@xbl.gov, fheeckmanqlbl.gov David Kulp and David Haussler Baskin Center for Computer Engineering and Computer Science University of California, Santa Cruz CA, 95064, USA dkulp@cse.ucsc.edu, haussler&se.ucsc.edu Abstract We present an improved splice site predictor for the genehnding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 82% of coding nucleotides correctly with a specificity of Sl%, versus 74% and 81% in the older system. In further splice site experiments, we also looked at correlations between splice site scores and intron and exon lengths, as well as at the effect of distance to the nearest splice site on false positive rates. 1 Introduction Current estimates for sequence output from the Human Genome Project are two Mbases per day, every day for the next seven years. Such a high throughput makes it important to clevelop new tools for annotation and analysis of genomic sequence. It is particularly difficult to exactly identify coding regions, from which one can deduce the structure of genes and gene products, in genomic DNA. Genefind- RECOMB 97, Santa Fe New Mexico I ISA O-8979 l-882-8/97/0 1 ing research over the past 15 years has concentrated on the recognition of short signal sequences and codon usage statistics. Signal sequences include promoters, start codons, splice sites, stop codons, etc.. Of these, splice sites are especially important since they define the boundaries between exons and introns, and hence define the exact extents of coding regions. Fickett [l] provides an overview and evaluation of the statistical measures for signal and content sensors. More recently, genefinding systems have been developed that employ many of the known recognition techniques in concert (examples include FGENEH, [2] GenLang, [3], and GENEMARK [4]). Current state-of-theart genefinding methods combine multiple statistical measures with database homology searching to identify gene features. (See, for example, GRAILII, [5, 61, GeneID [7, 81, and GeneParser [9].) We have designed a new genefinder of this type that we call Genie [lo]. Our system is similar in design to GeneParser, [9, 111 but is based on a rigorous probabilistic framework throughout, even where the homology matching methods are employed. While Genie has performed well in comparison to other systems, all the best systems to date are still deficient in accurate prediction of the exon-intron boundaries, so any improvement in this area is likely to have a significant impact on the overall prediction process. Here we describe some recent experiments with Genie s splice site prediction method, that resulted in significant improvements in overall performance. Genie is a an implementation of a generalized hidden Markov model (GHMM) - a hidden Markov model[l2] whose states are arbitrary sub-models emitting variable length sequences, rather than single letters, as in a standard HMM. In t,he next section we give a brief introduction to GHMMs using a GHMM that. defines a simple gene structure syntax as an example. We then discuss how signal sensors are used to identify transitions, such as intron-exon boundaries, and content sensors are used to score candidate regions, such as proposed exons, in the parse of a DNA sequence. Next, we describe our splice site recognition methodology, and how it has been enhanced by using neural networks that rely on dinucleotide frequencies. Other experiments using our splice site predictors are described here as well. 232

2 Figure 1: A simple GHMM for a sequence containing a multiple exon gene. The arcs represent states that emit strings of bases and nodes represent transitions between states. The state labels are J5 : 5 UTR, EI : Initial Exon, E : Exon, I : Intron, E : Internal Exon, EF: Final Exon, ES : Single Exon, and 53 : 3 UTR. The node labels are B : Begin, S : Start Translation, D : Donor, A : Acceptor, T : Stop Translation, F : Final. The arrows imply a generation of bases from 5 to 3. 0 B Figure 2: A GHMM including frame constraints. The additional acceptor and donor transition nodes ensure that only syntactically correct parses are considered. AOTCCCC3CAAAOOCTTTTCCCAAOCAOOT B Database IESXI Codon Acceptor m Bwe ezzza Donor Figure 3: A sample content sensor combines evidence from multiple components to derive a maximum likelihood of the sequence. The arrow shows the combination of component features corresponding to the maximum likelihood. 233

3 2 Methodology 2.1 Basic System Framework A generalized hidden Markov model is an enhancement of the standard hidden Markov model often used in time series pattern recognition in speech and computational biology. (See, among others, the tutorial from Rabiner and Juang[l3] and the introduction to HMMs in biosequence analysis by Krogh, et a1.[12]) In a standard hidden Markov model, viewed as a generator, each state emits a single symbol. A GHMM describes a more general model in which each state can emit one or more symbols according to an arbitrary distribution. Each state represents an independent sub-model which may, itself, be a hidden Markov model or any statistical model. Figure 1 shows a simple GHMM that models eukaryotic gene structure. The GHMM is represented as a graph. The states in the model are shown as the arcs of the graph. Nodes in the graph represent transitions between states. (This is different from typical graphical representations of regular HMMs.) Each state corresponds to a sub-model of an abstract gene feature such as an Internal Exon (E) or an lntron (I). For any sequence of bases, 2, and state, q, the sub-model associated with the state q defines a likelihood for the sequence Z. This likelihood is denoted P(xIq). When the GHMM is viewed as a generative statistical model, this is the probability that the sequence x is emitted when the hidden Markov process is in state q. These likelihood functions, one for each state, are part of the definition of the GHMM. The graph of a GHMM has a unique source node B (for Begin) and a unique sink node F (for Final). The process of generating a string from a GHMM can be viewed as taking a random walk in the graph for the GHMM from the source to the sink. For any state q, the node that the arc for state q leads to is denoted node(q). Once in this node, a next state is chosen at random from among the outgoing arcs from this node, independent of any previous choices made. The probability of choosing the next state r is denoted P(rlnode(q)). For example, in figure 1, the state I (Intron) leads to the node A (Acceptor). After the acceptor can come either the internal exon state (E) or the final exon (EF). The former is chosen with probability P(EIA) and the latter with probability P(EFIA) where P(EIA) + P(EFIA) = 1. These parameters are part of the definition of the GHMM, and are in practice determined from training data, as are the parameters defining the likelihood functions P(xlq) defined above. The full process of generating a string from a GHMM consists of a sequence of random choices: First a state q1 is chosen from among the outgoing arcs of the source node B. Then a substring 21 is generated according to the probability distribution P(.(ql). Then a next state q2 is selected from among the outgoing arcs from node(qt). Then a substring 22 is generated according to the probability distribution P(.lqz), etc., continuing like this until a state qk that leads to the sink node is selected. This state emits the last substring xk. The full string emitted by the HMM is the concatenation X = ~1.. Xk of all the substrings that are emitted. All random choices made in the process of generating the string x are independent, except for the dependencies in the sequence ql,..., qk of states, which form a Markov chain. In applications of GHMMs, this sequence of states is not observed, only the sequence X is observed. Therefore they are called hidden Markov models. We deiine a parse #J of the sequence X to be a pair con- sisting of a sequence of states 41,..., qk and a corresponding sequence of substrings x1,..., Xk, where X = xl... Zk, q1 is a state arc coming out of the unique source node (B), and qk is a state arc leading to the unique sink node (F). The GHMM defines a ioint likelihood of the seouence k = x I,...,xk and the parse 4 = (41,...,qk;xl,...I xk), according to the generative model described above. It is the joint indipendenr probability of the subsequences given the corresponding states and the probability of the transitions between states. That is, P(X,d) = PC!71 IB) (fj PtXilPiI) (fj PCS+1 b I(~~~~). (1) Given only the observed sequence X, using a variant oi the Viterbi algorithm[l3], we can calculate the parse 4 that maximizes equation 1, i.e. the most likely parse of X. In a GHMM that represents gene structure, such as the one in figure 1, this most likely parse represents the model s prediction of the most likely gene structure within the sequence X. This variant of the Viterbi algorithm used to find the most likely parse is a dynamic programming algorithm that is essentially the same as the one defined by Auger and Lawrence [14] to identify segment neighborhoods, by Sankoff [15] to optimally decompose a sequence into disjoint regions with particular properties, and by Gelfand and Roytberg [16], Snyder and Storm0 [ll], Storm0 and Haussler [17], and many others to do genefinding. So we do not elaborate on it here. GHMMs place these previous approaches within a convenient and general probabilistic framework. The GHMM in figure 1 represents only the basic ordering of gene features, and fails to fully capture the syntactic restrictions of a legal gene parse. In an ideal DNA sequence, the parse is frame consistent, i.e., the total number of coding nucleotides is a multiple of three and the reading frame is consistent from exon to exon. We can add additional states to the model graph such that only frame consistent parses are allowed. Figure 2 shows the model graph representing the resulting frame consistent GHMM. The three levels represent the three frames. Exon lengths can be restricted in the likelihood functions P(xlq) to equal 0, 1 or 2 modulo 3 for the various exon states in this GHMM in such a way to enforce frame consistency (see Kulp, et al [IS]). This more complex state structure is used by Genie. Further extensions to the GHMM graph can also be added to make the model more realistic. For example, an arc leading back from node T to node S labeled with a state that generates non-coding bases between genes would allow the GHMM to model sequences that have multiple genes within them. 2.2 Sensors A sensor is an mechanism for recognizing or scoring a subsequence according to a model of an abstract gene feature. There are two types of sensors used in the Genie system: signal sensors and content sensors. 2.3 Signal Sensor Models Signal sensors are used to recognize transitions between states in a GHMM. This type of sensor is used in a pre-processing step to identify candidate sites where state transitions can occur. When only a limited number of sites are considered as possible locations of state transitions, the dynamic programming method used to find the most likely parse runs 234

4 much quicker. However, care must be taken here, because if an important site, such as a splice site needed in the correct or optimal parse, is not included, then the dynamic programming method will no longer find the correct solution. In the GHMMs shown in Figures 1 and 2, the nodes correspond to gene features such as acceptor sites, donor sites, and positions of start and stop translation. A typical signal sensor might be a neural network to recognize an acceptor site, as described further below. 2.4 Content Sensor Models Content sensors are used to estimate the likelihood of a subsequence given a particular state in the GHMM. Some basic content sensors used by Genie were described in [18]. Since that paper, a more sophisticated type of content sensor has been developed for Genie. This new type of content sensor integrates evidence contributed from multiple sources and estimates a likelihood of a subsequence from the combined information [lo]. In the new Genie content sensor, each source of evidence is called a component; a component is trained to recognize a specific feature. Figure 3 shows an example of a fictitious subsequence whose likelihood is being evaluated by an internal exon content sensor. The internal exon content sensor is composed of several components: a nucleotide component, a codon component, end-region components representing the regions adjacent to the acceptor and donor sites, and a database homology match component. A component returns a likelihood for each potential feature occurrence, called an extent. In the figure, the maximum likelihood is determined by the joint probability of the extents shown in the bottom of the figure, i.e. an acceptor extent, followed by two nucleotide extents, a database match extent, and three codon extents. Again, we use dynamic programming to decompose the subsequence into a series of extents in such a way that the joint probability of all extents is maximized. This decomposition is then used to calculate the likelihood. This simple, efficient method encourages a modular approach t,o developing an effective genefinding system because components can be easily added to or subtracted from a content sensor. 2.5 Identifying Splice Sites The biological process of splicing is quite complex and involves various proteins. The mechanism responsible for the exact deletion of introns is probably related to gene conversion using a cdna copy of the mrna of a partially spliced intermediate RNA [19]. Clearly the intron - exon structure of genes has been very important in the generation of new genes during evolution (for an overview see Sharp [20].) The problem of recognizing signals in genomic sequences by computer analysis was pioneered by Staden [al] and the recognition of splice sites using neural networks was first addressed by Brunak et al [22]. They trained a backpropagation feedforward neural network with one layer of hidden units to recognize donor and acceptor sites, respectively. The best results were obtained by combining a neural network to recognize the consensus signal at the splice site with another one that predicted coding regions based on the statistical properties of the codon usage and preference. Solovyev and Lawrence [23] described a prediction program for splice sites based on oligonucleotide composition around the actual splice site location combined with discriminant analysis. The splice site recognizer used in the gene finding program GRAIL [24] also uses a standard neural network approach. All the above mentioned programs have a very high rate of false positives. In the first Genie version [18] we implemented a feedforward neural network similar to the one described in Bnmak et al [22]. The sequence was encoded using 4 input units for each nucleotide, one hidden unit layer was used, and there was one yes/no output unit. For training we used standard backpropagation. We trained one network to recognize the donor sites, using a hidden layer of 10 units, and another one to recognize acceptor sites, using 40 hidden units. In contrast with Brunak et al., we trained our networks only on positive and negative examples that have consensus splice sites, i.e., GT for the donor and AG for the acceptor site. A positive example for a donor site is a window of 15 residues of DNA from -7 to +8 around the GT in an actual human donor splice site, while a negative example is a window of the same size around a GT that is in a neighborhood of plus or minus 40 nucleotides around a real splice site, but is not itself a real splice site. The training examples for the net that learns to recognize acceptor sites are similar, except that window size was larger. Experiments showed that 41 residues from -21 to +20 around the consensus AG were optimal for this application. We made no attempt to recognize the rare non-consensus splice sites that have been documented [25, 261. Recently Henderson et al [27] showed that neighboring nucleotides are very strongly correlated in the splice site consensus pattern. Based on these results, we have changed our input representation from a 4-bit code per base to a 16-bit code per nucleotide pair. Hence a window of 15 nucleotides is encoded as 14 pairs of adjacent nucleotides, and each pairs is represented by 16 inputs that are all set to zero except the one in the position representing the letter pair in question, which is set to 1. This allows the network to easily model pairwise correlations between adjacent nucleotides. We also reduced the number of hidden units to 2 in the donor net and 10 in the acceptor net. This new network shows significantly improved prediction performance. We describe our experimental results below. 3 Experiments on splice site detection Figure 4 and Figure 5 show the performance improvement using the 16-bit input coding for nucleotide pairs. The results shown are obtained by testing the neural network performance on an independent test set of 50 genes that were not included in the training set, and not closely homologous to genes in the training set. In testing, all false splice sites with the consensus dinucleotide that could be found in a neighborhood of plus or minus 40 nucleotides around a real splice site were included as negative examples. One sees that in the interesting region where the threshold of the neural network is set so that the algorithm produces l-10% false positive predictions, the new networks are much more sensitive. In particular, suppose one is able to tolerate a rate of 7% false positives in donor site prediction, i.e. in 7% of the cases where the network is given a window of 15 nucleotides centered around a GT that is not a real donor site it nevertheless classifies it as a real donor site. Then the net can achieve a rate of 98.67% true positives, with only 1.33% false negatives. The previous network had a corresponding false negative rate of 3.85%. Reducing the number of false negative splice site predictions is especially critical in a system that tries to find whole genes, because, as mentioned above, one false negative splice site prevents the system from ever constructing an entirely correct set of ex- 235

5 Donor: correct positives vs false positives 100 I. 1.r 1 1 A. +y*-g-.z :- i ---t I position Figure 6: Prediction scores for false GT / AG sites in the neighboring regions of true splice sites. The score for each position in the sequence according to the true splice site position are average over the entire data set % false positives Figure 4: Performance of the new donor-sensing neural network versus the old donor sensor. Percentage false positive predictions are plotted on the x-axis and the corresponding percentage true positive predictions on the y-axis. AcceDtor: correct positives vs false positives % false positives Figure 5: Performance of the new Acceptor neural network versus the old Acceptor sensor. ons for the whole gene, whereas even in the presence of many false positives, so long as all the true positives are there, it is still possible for the dynamic programming/viterbi optimization to select the correct set of exons, if there is enough gene content signal. A number of additional experiments were done comparing the scores of a false splice site close to a true splice site versus scores of a false splice site far away from a true splice site. There appears to be a tendency for a false splice site close to the true splice site to have a lower score than for a similar false splice site far away from any true site, reducing the chance that the genefmder will mistakenly pick a wrong splice site very close to the true splice site. This is very helpful. In particular, the scores for false splice sites are atypically low when they occur in a window of about 50 bases on the both sides of a true donor site (Figure 6). While scores for false acceptor sites in a window of about 100 bp on the intron side are atypically low, false acceptor sites on the exon sides show surprisingly high scores. We believe that some of these high scores are due to annotation errors in the genomic DNA database. In our entire data set we find splice sites that cut the codon frame in all three positions. 41,3% of the splice sites are cut in frame 0, 38,7% in frame 1 and only 20,0% in frame 2. We believe that this significant lower number of splice sites in frame 2 is due to the low coding value of the nucleotide in this codon position, which has little effect on the translated amino acid sequence. We also did a series of experiments where we trained three separate neural nets for donor site prediction, and similarly for acceptors, one for each frame. These experiments showed that splice sites in frame 0 and 2 are somewhat easier to predict than those in frame 1 (results not shown). We have not yet tried incorporating these networks into Genie. Finally, we also looked at the correlations between exon length and the score of the flanking splice sites. Figure 7 and Figure 8 show the distribution of exon length versus the prediction score for donor and acceptor sites. Careful analysis shows that the exon length correlates with the score and therefore the strength of the consensus pattern. This result on our new gene dataset from Genbank version 95 from June 1996 confirms an earlier hypothesis [25]. It is 236

6 >c '* ', 7 >,c 350 I+ I/ 300 > j\ Z-Score Figure 7: Exon length versus donor site score from the neural network Z-Score 5 10 Figure 8: Exon length versus acceptor site score from the neural network particularly helpful in genefinding that short exons tend to have higher scoring splice sites. 3.1 Cross-validation Experiments on Whole Gene Prediction We did a set of experiments on whole gene prediction using the new splice site sensors. The data set used during training and testing was a collection of 288 sanitized, annotated, multiple-exon human DNA sequences from the GenBank sequence database. The data set, was randomly partitioned into seven test sets of uniform size to be used in cross-validation experiments. For each test set, the content sensors were trained on the remaining training data and predictions were recorded for the sequences in the test set. Additional tests were performed on a data set of 570 vertebrate genes. This data set was used by Bursett and Guigo as a benchmark for the comparison of many different genefinders. [S] The data set IS available at ftp://g enome.lbl.gov/pub/genesets/. 4 Results Table 1 shows the results of running Genie using the new splice site detector on all seven test sets and the average results over the entire data set. We also tested Genie against the Burset/Guigo data set; results on this set comparing our genefinder with other genefinding systems is shown in Table 2. In addition to the Genie version only based on statistical properties trained from existing genes Table 3 shows the results using our new scheme to incooperat,e homology information from a database as discusses in 2.4. In accordance with the testing scheme established by Bursett and Guigo [S], we report sensitivity and specificity with respect to per-base prediction of coding/non-coding and with respect to exact prediction of exons. The per-base sensitivity is the fraction of true coding bases predicted as coding, and the specificity is the fraction of all predicted coding bases that were correct. Similarly, the exon sensitivity is the fraction of true exons predicted exactly, and the specificity is the fraction of predicted exons that were correct. In these tests, correct exon prediction requires identification of the exact position of splice sites. Fully or partially overlapping predictions are not accepted. The approximate coefficient (AC) is described by Bursett and Guigo as a preferred alternative over the correlation coefficient and defined by where TP, FP, TN, and FN are true positives, false positives, true negatives, and false negatives. In addition, we also report the fraction of true exons that were not identified either exactly or overlapping (Missing Exons) and the fraction of predicted exons that did not overlap any true exon (Wrong Exons). 5 Discussion The work presented here extends the work and results reported in Kulp, et a/[181 by adding two novel neural networks for donor and acceptor splice site predictions. Other experiments exploring the properties of the new splice site detectors are also reported. Our approach was motivated by the work of Henderson et al [27], showing strong correlations in neighboring nucleotides at the splice site. The addition of the new networks increased the overall prediction accuracy by approximately 5%, and caused the number of missed exons to drop significantly. Per base sensitivity increased about 8%. These results show the importance of correct splice site predictions for the overall genefinding process used by Genie. The new input encoding using dinucleotides, which allow the net, to easily exploit correlations between neighboring bases, resulted in more sensitive splice site detectors. Another interesting observation was made: non functional GT and AG sites, close to real splice sites, have significantly lower scores than GT and AG sites far removed from real splice junctions. This phenomenon, which is not well understood, improves the performance of our genefinding methods. We also studied the length distributions of exons versus the scores of the flanking splice sites and found that shorter exons have stronger splice site consensus signals than average length exons. The longest exons also have very strong splice sites. The overall performance of the new Genie compares quite favorably with the other genefinders on the Bursett and 237

7 Table 1: Prediction results on seven test sets using the new splice site predictions and with the old ones. Per base statistics refer to the ability to predict whether a nucleotide is codine or non-coding. Per exon statistics refer to the ability to predict a complete exon exactly. G indicates results for the Bursett /G - luieo data set. Data Set Per Base Exact Exon 11 Sn 1 Sp 1 AC E Sn ) Sp 1 Avg 11 ME 01 fd 1 WE Part Part Part L L Part Part Part Part Average Old New Part Part Part Part Part Part T Part ) Average New I I 0.64 II 0.14 I 0.21 B/G data set Old II 0.76 I, t II i 0.51 I II I L I New Table 2: A comparison of Genie with other genefinding systems. Tests were run on a set of 570 annotated sequence from different organisms. (Bursett/Guigo data set) II Genefinder II Per Base II Exact Exon =il 11 Sn 1 Sp II Genie FGENEH GeneID GeneParser GenLang GRAIL SORFIND Xpound i WE Table 3: Prediction results on the entire data set and the Bursett/Guigo data set using the old and the new splice site recognizers and homology matches. In this table, we also include, for comparison, predictive results of GeneID+ and GeneParser as reported in Bursett and Guigo. Only sequences of length less than 8000 were tested in the latter data set to provide comparable results with the other genefmders. n Data Set II Per Base II Exact Exon A Sn Sp AC Sn Sp Avg ME WE n 1 Genie Data Set Average Old Average New [ B/G Data Set I

8 Guigo dataset. It should also be noted that this dataset includes sequences from all vertebrates, whereas Genie was trained only on human DNA. We have developed a WWW interface for Genie. Researchers can submit sequences to our server and receive predictions by . The 1JRL for Genie is The splice site predictors are separately accessible at In future work we plan to extend Genie so that it can reliably find multiple genes in a single DNA sequence. We also plan to improve the statistical model used in the intron state of Genie, as well at the model for intergenic DNA. This can be accomplished by incorporating sensors for promoters, the transcription start site, DNA repeat sequences, and the overall structure of 5 and 3 untranslated regions. We are also planning t,o extend Genie so that it can incorporate homology hits to cdna databases when these are available. 6 Acknowledgments We would like to especially thank Kevin Karplus for their contributions to this work, particularly for the splice site profiles he built, and his valuable discussion. We would also extend our gratitude to Gary Stormo, Nomi Harris, Melissa Cline, and Richard Hughey for their assistance in the development of Genie. This work was supported in part by DOE grant no. DE-FG03-95ER62112 and DE-AC03-76SF M.G. Reese and D. Haussler acknowledge support of the Aspen Center for Physics, Biosequence Analysis Workshop. References References PI PI [31 [41 [ PI [81 PI J. W. Fickett and C.-S. Tung. Assessment of protein coding measures. Nucl. Acids Res., 20: , V. Solovyev, Salamov A., and C. Lawrence. Predicting internal exons by oligonucleotide composition and discriminant analysis of splicable open reading frames. Nucl. Acids Res., 22: , S. Dong and D. B. Searls. Gene structure prediction by linguistic methods. Genomics, 162: , M. Borodovsky and J. McIninch. Genmark: Parallel gene recognition for both DNA strands. Computers and Chemistry, 17(2): , Y. Xu, J. R. Einstein, M. Shah, and E. C. Uberbacher. An improved syst,em for exon recognition and gene modeling in human dna sequences. In ISMB-94, Menlo Park, CA, AAAI/MIT Press. Y. Xu and E. Uberbacher. Gene prediction by pattern recognition and homology search. In ZSMB-96, St. Louis, June AAAI Press. R. Guigo, S. Knudsen, N. Drake, and T. Smith. Prediction of gene structure..i. Mol. Biol., 226: , M. Burset and R. Guigo. Evaluation of gene structure prediction programs. Genomics, 34(3): , E. Snyder and G. Stormo. Indentification of protein coding regions in genomic dna. JMB, 248, [loi Pll [I21 [131 ii41 ii51 [161 M. S. Gelfand and M. A. Roytberg. Prediction of the exon-intron structure by a dynamic programming approach. BioSystems, 30: , [I71 ilf31 WI WI Pll 1221 [231 [241 [251 D. Kulp, D. Haussler, M. Reese, and F. Eeckman. Integrating database homology in a probabilistic gene structure model. In Biocomputing: Proceedings of the 1997 Pacific Symposium. World Scientific Publishing, January to appear. E. E. Snyder and G. D. Stormo. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res., 21: , A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235: , February L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3( 1):4-16, January I. E. Auger and C. E. Lawrence. Algorithms for the optimal identification of segment neighborhoods. Bull. Math. Biol., 51:39-54, D. Sankoff. Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory. Math. Biosci., 111: , G. D. Storm0 and D. Haussler. Optimally parsing a sequence into different classes based on multiple types of information. In ISMB-94, Menlo Park, CA, August AAAI/MIT Press. D. Kulp, D. Haussler, M. Reese, and F. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. In ISMB-96, St. Louis, June AAAI Press. G.R. Fink. Cell, 49: , P.R. Sharp. Split genes and ma splicing. Cell, 77: , R. Staden. Computer methods to locate signals in nucleic acid sequences. NAR, 12: , S. Brunak, J. Engelbrecht, and S. Knudsen. Prediction of human mrna donor and acceptor sites from the dna sequence. JMB, 220:49-65, V.V. Solovyev and C.B. Lawrence. Identification of human gene functional regions based on oligonucleotide composition. In Proceedings, 1st International Conference on Intelligent Systems for Molecular Biology, Menlo Park, AAAI Press. E. Uberbacher and R. Mural. Locating protein coding regions in human DNA sequences by a multiple sensor - neural network approach. Proceedings of the National Academy of Sciences of the United States of America, 88: , S. M. Mount, X. Peng, and E. Meier. Some nasty facts to bear in mind when predicting splice sites. In Gene-Finding and Gene Structure Prediction Workshop, Philadelphia, PA, October

9 [26] P. Senapathy, M. B. Shapiro, and N. L. Harris. Splice junctions, branch point sites and exons: sequence statistics, identification, and applications to genome project. Meth. Entymol., 183: , [27] J. Henderson, S. %&berg, and K. Fasman. Finding genes in human dna with a hidden markov model. In Proceedings, 4rd International Conference on Intelligent Systems for Molecular Biology, St. Louis, June AAAI Press. 240