Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617)

Size: px

Start display at page:

Download "Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617)"

Loreen Shaw
6 years ago
Views:

1 Gene Structure Prediction From Many Attributes Adam A. Deaton Rocco A. Servedio Engineering Sciences Lab 102 Division of Engineering and Applied Sciences Harvard University Cambridge, MA Phone: (617) Fax: (617) January 30, 1998 Abstract Considerable research eort has been directed in recent years toward the problem of computationally identifying genes in DNA sequences. A fundamental component of a gene- nding system is a predictor which, when given a window of DNA sequence data, predicts whether or not it codes for protein product. In this paper we propose that mistake-driven, multiplicative-weight-update learning algorithms operating over a large feature set are well suited to to this prediction problem, and describe a system we have built which takes this approach. Our system is fast, simple, and produces more accurate classiers than have previously been obtained for a range of dierent sequence lengths. We conclude that a system of this type will be a useful component in larger gene-nding programs. Keywords: exon prediction; gene identication; coding sequence; machine learning; Winnow; multiplicative-weight-update learning algorithm 1 Introduction A major thrust of biological research over the past decade has been the acquisition of massive amounts of DNA sequence data. Along with an ever-larger supply of raw data, the need for accurate, automatic systems for analyzing DNA sequences has grown correspondingly more acute. Perhaps the most widely desired tool for sequence analysis would be a reliable method for identifying genes in DNA. Numerous gene-nding systems have been developed and implemented in recent years, including FGENEH [26], Genie [15], GenLang [8], GENSCAN [4], GRAILII [31], MORGAN [22] and VEIL [23]. Many of these systems are still under development, and some encouraging results have been obtained, but the general problem is far from solved. In a comprehensive survey of gene-nding systems, Burset and Guigo [5] found that most systems had average accuracy (for predicting protein coding/noncoding status at the individual nucleotide level) between 62% and 71%; they concluded that overall \the accuracy of the programs was severely aected by relatively high rates of sequence errors". Better techniques for predicting coding/noncoding status of DNA subsequences would thus be of considerable use for the next generation of gene-nding programs. In this paper we focus on the following prediction problem: given a short window of DNA, predict whether it is all-coding or all-noncoding. Devising an accurate solution to this prediction Corresponding author. 1

2 problem is an important step toward building a successful gene-nder. Previous researchers have identied numerous features that can be eciently computed from short DNA windows and have used techniques such as linear discriminant analysis, neural nets, and decision trees in attempts to solve this problem; see Section 2 for more details on related work. Our study takes a dierent approach, one that is motivated by the philosophy outlined by Valiant in his recent work on cognitive computation [27], [28]. As Valiant points out, two broad lessons that can be drawn from the past decade of research in computational learning theory are the following. First, despite intensive eorts, no general techniques are known for directly constructing complex features such as the monomials in a DNF representation of an unknown target function. On the other hand, some partial techniques are known for tolerating large numbers of irrelevant attributes in certain contexts. One goal of the current research is to demonstrate that these insights can be directly applied to the construction of systems for real-world problems of signicant practical interest. In keeping with this approach, our system uses Littlestone's Winnow algorithm [18], [19] for learning linear threshold functions as its main predictive component. Mistake-driven multiplicativeweight-update learning algorithms such as Winnow have been the subject of much study in the machine learning community over the past decade (e.g. [13], [14], [18], [19], [20]). These algorithms are known to perform exceptionally well in the presence of many irrelevant attributes and have provable resistance to noisy data. Recent applied research has exploited these properties by using these algorithms for problems such as context-sensitive spelling correction [11], library book use prediction [24], text categorization [17], [7] and portfolio selection [12]. Our results indicate that the Winnow algorithm can be used successfully to nd protein-coding regions in DNA. In the experiments we carried out, our Winnow-based system achieved higher levels of performance on Fickett and Tung's benchmark human DNA data sets [9] than any previous program. The dierence in accuracy between our classiers and Salzberg's decision trees [21] on these benchmark data sets is approximately the same as the dierence between Salzberg's decision trees and the best single features. Our algorithm also performs well on a newer benchmark set of human DNA data which has been assembled by David Kulp and Martin Reese. Based on the successful performance of our algorithm on this single-window prediction problem, we are currently working on incorporating it into a larger gene-nding system in which the goal is to identify the coding regions (if any are present) within a long stretch of DNA. The remainder of this paper is structured as follows. Section 2 gives background on previous research on this problem. Section 3 contains an introduction to the Winnow algorithm, a brief overview of some of its properties, and a discussion of why it is likely to be useful in this context. Section 4 describes data and methods. We present and discuss our results in Section 5, and give conclusions and suggestions for future work in Section 6. 2 Related Work Many researchers have considered the problem of computationally distinguishing between coding and non-coding regions of DNA. In a comparison and evaluation of published techniques, Fickett and Tung [9] noted that more than twenty dierent coding measures were present in the literature. As a benchmark for comparison of dierent techniques, they proposed three publicly available sets of human DNA data, consisting of sequences of 54, 108, and 162 base pairs respectively. Drawing from the available literature, Fickett and Tung created a unied list of 21 coding measures and, using Linear Discriminant Analysis, separately assessed the accuracy of each of these 21 measures on each of their three benchmark data sets. It stands to reason that higher accuracy could be achieved by combining individual coding measures. Indeed, Fickett and Tung report that in a simple test of the GRAIL system's Coding Recognition Module [30] on their 108-bp benchmark data set, average accuracy of 79% was achieved, which is superior to any single coding measure. In a more systematic approach to using multiple coding measures, Salzberg [21] experimented with learning decision tree classiers for the Fickett and Tung benchmark data. The features 2

3 employed by his trees were a subset of the 21 coding measures which Fickett and Tung had reviewed. Salzberg's OC1 system for building decision trees can be set to allow either univariate or multivariate tests at each tree node. His study employed several dierent splitting and pruning criteria in the construction of the decision trees. For each of the three data sets, the classiers he obtained were signicantly more accurate than the best individual coding measure (see Figure 1 in Section 5). 3 Winnow 3.1 The Winnow algorithm Many variants of the Winnow algorithm have been described and studied over the past ten years ([1], [2], [18], [19]); however, these variants all share the same underlying structure. Like the well-known Perceptron algorithm, Winnow works by attempting to construct a linear separator for the sequence of examples which it is given. The algorithm processes examples one-by-one as it receives them, and updates its current hypothesis (i.e. weight vector) only when it makes a mistake. The main dierence between Winnow and the Perceptron algorithm is that Winnow adjusts its hypothesis by making multiplicative rather than additive changes to the coecients of the weight vector; it turns out that this seemingly minor change in fact makes a substantial dierence in terms of the mistake bound which can be proved for the algorithm. More specically, the version of Winnow which we use, called Balanced Winnow, maintains two weight vectors, a positive weight vector w + = (w + 1 ; : : : ; w+ n ) 2 (0; 1)n and a negative weight vector w? = (w? 1 ; : : :; w? n ) 2 (0; 1) n. Initially, w + = w? = (1; 1; ; 1). At each step, the algorithm is given a labelled data point (x; y), where x 2 [0; 1] n and y 2 f+1;?1g; the data point x can be thought of as a vector of feature values for some xed feature set. The label y for each point is either positive or negative, depending presumably on whether the point is a positive or negative example of the concept which is being learned. Upon being given (x; y), Winnow performs two phases, a prediction phase and an update phase. In the prediction phase, the algorithm computes (w +? w? ) x. It predicts ^y = 1 if this quantity is nonnegative, and predicts ^y =?1 otherwise. In the update phase, for all i = 1; 2; : : :; n the algorithm sets and sets w + i = w + i (y?^y)xi w? i = w? i?(y?^y)xi where > 1 is some xed update parameter. Note that if the algorithm's prediction was correct, no update is performed. When the algorithm makes a false positive prediction, some coordinates of w + will decrease and the corresponding coordinates of w? will increase; the opposite occurs when it makes a false negative prediction. 3.2 Properties of the Winnow algorithm In [18], Littlestone proved a convergence theorem for Winnow which is similar to the classical Perceptron Convergence Theorem; he showed that if the data set on which Winnow is trained is linearly separable, then the Winnow algorithm will eventually nd a linear threshold function which classies all examples correctly. However, he also proved that Winnow is attribute ecient in the following sense: if the target linear threshold function depends on only k out of the n attributes (i.e. has k nonzero coecients), then the maximum number of mistakes which Winnow can make depends only logarithmically on n (and polynomially on k, given some further assumptions). The attribute eciency of Winnow is a useful property when the the feature set (i.e. n) is large but the target concept is simple in that it involves few features. Another useful property of Winnow is that both the prediction and update steps can be performed eciently. If m of the n components of an example x are nonzero, then prediction and updates each take O(m) time instead of O(n); this can represent a considerable savings. 3

4 Table 1: Coding and Noncoding Frequencies in Fickett and Tung Benchmark Data Sets 54-bp 108-bp 162-bp Training set { coding 20,456 7,086 3,512 Training set { noncoding 125,132 58,118 36,502 Training set total 145,588 65,204 40,014 Test set { coding 22,902 8,192 4,226 Test set { noncoding 122,138 57,032 35,602 Test set total 145,040 65,224 39,868 As we will see, in our application to protein coding status prediction, the examples are all very sparse vectors, allowing us to execute the algorithm quickly. One natural objection to using the Winnow algorithm for a problem such as protein coding status prediction is that the hypothesis class which it employs, the class of linear threshold functions, is too simple to be of any use. It is true that the representational power of this hypothesis class is limited; for instance, there are Boolean functions which can be succinctly represented as decision trees but are not expressible as linear threshold functions. However, there are compensatory advantages to using linear threshold functions. For one thing, as we will see, the Winnow algorithm can be feasibly applied to data points which lie in a high dimensional space; the class of threshold functions over this high dimensional space may well be suciently expressive to represent the target concept. For instance, consider a situation in which data points are given as vectors (x 1 ; : : :; x 100 ). Instead of using this representation, one can instead represent points as lying in the 5,150-dimensional space obtained by additionally including all 5,050 products terms x i x j. (Transforming to this representation will be computationally feasible as long as the original vector x is sparse.) The Winnow algorithm applied to points in this representation will be capable of learning any quadratic threshold function over fx 1 ; : : : ; x 100 g. A second advantage of using linear threshold functions as a hypothesis class is simply that powerful and elegant learning algorithms with provable performance guarantees (such as Winnow) are known for this class; this is not necessarily the case for more complex hypothesis classes. While decision trees are more expressive than linear threshold functions, their learnability is an (important) open question; currently, only heuristic algorithms are known which have no performance guarantees. A nal advantage of using the Winnow algorithm is that when the data set is not perfectly linearly separable, its performance degrades in a controlled way. Tight bounds can be given for the performance of Winnow relative to the performance of the best linear separator for any sequence of labelled examples (see [19]). 4 Data and Methods 4.1 Fickett and Tung benchmark data sets Fickett and Tung [9] proposed three data sets of human DNA sequences as benchmarks for comparison of protein coding prediction algorithms. The sequences comprising these data sets were taken from Genbank in May The three data sets contain sequences of length 54, 108, and 162 base pairs, respectively. Each of these three sets is in turn divided into a training set and a testing set of approximately equal size. Every data sequence is labelled as either entirely coding (completely within some exon), entirely noncoding (completely outside of all exons), or part coding and part noncoding. As in Salzberg's study [21], we extracted all data points that are either entirely coding or entirely noncoding and used only these in our study. A summary of the data which we used is given in Table 1; this is the same data that was used in [21]. The entire data sets, including the mixed coding/noncoding windows, can be obtained via anonymous FTP from atlas.lanl.gov. 4

5 Table 2: Coding and Noncoding Frequencies in Kulp-Reese 108-Base-Pair Data Sets Set Set Set Set Set Set Set Set Set Coding Noncoding 2,169 3,351 2,599 2,232 1,713 2,511 5,029 2,121 3,788 Total 2,621 3,849 3,090 2,731 2,246 2,885 5,412 2,540 4,277 Note that within each of the three benchmark data sets, no two sequences overlap. However, the reverse complement of every sequence is also present in the data set; each reverse complement sequence has the same label as the original sequence which it matches. Also, it should be noted that no information about the correct reading frame for data sequences was used; each data point consists simply of a string of 54 (or 108 or 162) nucleotides. 4.2 Kulp and Reese data set We also tested our system on sequences derived from a newer benchmark data set of human genes. This data set has been assembled by David Kulp and Martin Reese and is available from it consists of human genes which were taken from Genbank 95.0 (June 1996). Various lters were used to ensure that only correct, complete, known genes were included. The data set contains 269 single-exon genes and 353 genes that include introns, for a total of 622 genes. These genes are divided into nine dierent sets for cross-validation. For our study, we divided each gene into nonoverlapping windows of length 108, discarding any shorter pieces which might occur at the end. We then extracted from this collection of windows all sequences which were either entirely coding or entirely noncoding. This resulted in a total of 4,138 coding sequences and 25,513 noncoding sequences, for a grand total of 29,651 sequences. Table 2 gives the composition of each of the nine sets. Note that for this data set the reverse complement sequences were not included. No information about the correct reading frame was used. 4.3 Feature Set All of the features which we used are derived from the 21 protein coding measures which were listed by Fickett and Tung in [9] and used in [21]. However, due to the nature of the Winnow algorithm we are able to use these coding measures in a substantially dierent way than [9], [21]. Most of the features described below are vectors of measurements which are computed for each data point. While [9], [21] use Linear Discriminant Analysis to map each of these vectors to a single value (the coding measure for that feature), our algorithm uses these vectors directly. A dicodon is a subsequence of 6 consecutive nucleotides such as TAGGAC. The dicodon frequency feature is a vector of the 4096 dicodon frequencies across the input sequence, where the dicodon counts are accumulated only at locations whose starting point is a multiple of 3 (i.e. starting at the 0th, 3rd, 6th, : : : nucleotides in the sequence). The hexamer-1 frequency feature is likewise a vector of 4096 dicodon frequencies, except that the counts are accumulated at positions 1,4,7,: : :; the hexamer-2 frequency feature is dened analogously. The diamino acid frequency is a vector of the 441 amino acid frequencies which are obtained by translating from the nucleotide sequence to an amino acid string (stop codons are treated as a 21st amino acid); like the dicodon frequency feature, counts are accumulated only at locations 0,3,: : :. The codon usage feature is a vector of the 64 codon frequencies; again, counts are accumulated only at locations 0,3,: : :. The open reading frame feature is the length, in codons, of the longest sequence of codons (aligned with locations 0,3,: : :) in the data string which does not contain a 5

6 stop codon. The run feature is a vector of length 14; for each nontrivial subset S fa; C; G; T g, the run feature contains an entry which gives the length of the longest contiguous subsequence which has all entries from S. For example, if the entry of the run feature which corresponds to fc; Gg is 4, it means that the longest consecutive substring containing only C and G is of length 4. The asymmetry feature is a vector of length four which measures, for each nucleotide A,C,G,T, the extent to which the nucleotide is asymmetrically distributed over the three codon positions. The fourier feature is a vector of length eight; its coecients measure the periodicity of the example string for periods 2,3,: : :,9. For precise denitions of the asymmetry and fourier features, see [9]. In addition to the above features, which were used in [9] and [21], we also included the following additional features, which are simple modications of the features listed above. The generalized dicodon frequency feature consists of 4096 dicodon frequencies; the dierence between this and the regular dicodon frequency is that the counts are accumulated across all possible locations, not just multiples of 3. Similarly, the generalized diamino acid frequency feature consists of 441 diamino acid frequencies accumulated across all possible locations; the generalized amino acid frequency is a vector of 21 amino acid frequencies accumulated across all possible locations; and the generalized codon usage feature is a vector of 64 codon frequencies accumulated across all possible locations. Finally, the generalized open reading frame feature is the average of the following three values: the open reading frame feature's value, the value of the open reading frame computed on codons aligned with 1,4,7,: : :, and the value of the open reading frame computed on codons aligned with 2,5,8,: : :. The motivation for using these generalized features is that the reading frame of each example sequence is unknown. Consequently, any of the three possible frames may be correct for a given example point. Since many of these features consist of long vectors, the total number of variables n is very large (more than 17,000). Even if many of these variables are irrelevant to the target concept, though, the performance of Winnow will not be seriously aected. We note also that even though the total number of variables is large, for any given example sequence the vast majority of all variables have value 0, so predictions and updates can be executed very eciently. For instance, at most 17 components of the 4096-element dicodon frequency vector can be nonzero on an example sequence of length Normalizing and Voting In each of the data sets which we used, only a small fraction of the examples are coding sequences. The Winnow algorithm tends towards hypotheses which maximize overall accuracy on the training set. Consequently, when training on this data Winnow arrives at hypotheses which almost always predict \noncoding", since this prediction is usually correct. It is desirable, though, to have a predictor which performs well on both coding and noncoding sequences. To accomplish this, throughout this study we \normalized" the training sets in the following way: whenever we trained the algorithm, instead of iterating through all of the examples in the actual training set, we iterated through all of the examples in a multiset derived from the training set. Every noncoding sequence in the training set occurred once in the multiset, and every coding sequence occurred times on average, where was chosen to equalize the number of coding and noncoding instances in the multiset. Intuitively, this makes the Winnow algorithm attach equal importance to correct prediction of both coding and noncoding sequences. Even when using the normalized training sets, Winnow displayed a high degree of instability in the hypotheses which it would generate. Running the same algorithm on two slightly dierent training sets would frequently result in two hypotheses which diered widely in their predictions; often one hypothesis would err in favor of exons (many false positives) and the other would err in favor of introns (many false negatives). A common technique in situations such as this, where large changes in the predictor result from small changes in the training set, is to generate multiple predictors and combine them using some form of voting ([3], [6], [10], [16]). The method which we adopted has two parameters: r is the number of hypotheses and s is the interval (in examples) between hypotheses. Given a multiset of training examples for Winnow, we record the 6

7 current Winnow hypothesis periodically (r times) at the end of the training phase, at intervals of s examples apart. For instance, if the multiset contained 10,000 examples and r = 3 and s = 500, then we would record the Winnow hypotheses after 9,000, 9,500, and 10,000 examples. In the testing phase, given a DNA sequence we evaluate each of the r Winnow hypotheses on the sequence, and take as the nal prediction the majority vote of the r hypotheses. In all of the experiments which we performed, r was set to Tuning the algorithm Our system has two parameters which can be adjusted: the Winnow parameter and the number of examples s which are processed between hypotheses. (As mentioned earlier, we used the same feature set and the same number r = 31 of hypotheses throughout all experiments.) In order to nd a suitable setting for these parameters for each of the Fickett and Tung data sets, we employed the following procedure. We randomly split the training set into two disjoint subsets, a tuning set (80%) and an evaluation set (20%). We chose initial settings for and r, trained the system on the tuning set using those parameter settings, and observed its predictive ability on the evaluation set. Based on the results obtained on the evaluation set, we modied and r and repeated the procedure with the new parameter settings on the tuning and evaluation sets. We repeated this procedure until satisfactory parameter settings were obtained. In general, relatively little tuning was required to obtain good results, and the results obtained seem to be fairly robust over a range of values for and r. Once we had chosen values for and r, we trained our system on the entire training set using these values and then tested the resulting hypotheses on the entire test set. We emphasize that the real test set was never used in the tuning process. The test set was only used for gathering nal results after the three best parameter settings had already been selected (based on their performance on the evaluation set). 5 Experimental Results In this section we describe our results on each data set. Values are given for the percentage accuracy on test set coding sequences, the percentage accuracy on test set noncoding sequences, the average accuracy (the average of the previous two values), and the overall accuracy on the test set with the original ratio of noncoding and coding examples. Note that since there are more noncoding than coding examples, the overall accuracy is dierent from the average accuracy. We also give the correlation coecient for our predictors. This is the correlation between the actual and predicted f?1; 1g labels on the (unnormalized) test set. As noted in [21], average accuracy is arguably the most interesting statistic; it is simple to achieve high accuracy on only one type of sequence (either coding or noncoding) at the cost of low accuracy on the other kind, and the overall accuracy is always close to the accuracy on noncoding sequences due to the preponderance of noncoding examples in all of our data sets. The challenge is to simultaneously achieve high accuracy on both coding and noncoding sequences. Our experimental results are summarized in Tables 3, 4, 5 and 6. Our system is called AEPP (for Attribute Ecient Protein Predictor, pronounced \ape"); along with AEPP's performance under various parameter settings, we give the performance of the single best (highest average accuracy) decision tree from Salzberg's study [21] on each of the Fickett and Tung data sets. For these data sets we also give the performance of the single best feature from Fickett and Tung's list of 21 coding measures [9]. The average accuracy for each algorithm is given in boldface. 5.1 Fickett and Tung 54-bp data This data set consists of 290,628 nonoverlapping human DNA sequences, each of which is 54 base pairs long and is either entirely coding, entirely noncoding, or the reverse complement of a sequence which is entirely coding or entirely noncoding. The training portion of this data set contains 20,456 coding sequences and 125,100 noncoding sequences. As described in Section 4.4, we normalized this training set to obtain a multiset consisting of 125,145 instances of coding 7

8 Table 3: Predictive Accuracy on Fickett-Tung 54-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-C Hexamer The OC1-C results are taken from [21], and the Hexamer results are from [9]. Table 4: Predictive Accuracy on Fickett-Tung 108-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-D Position Asymmetry The OC1-D results are taken from [21], and the Position Asymmetry results are from [9]. sequences and 125,100 noncoding sequences. Only one pass through the training multiset was used, so AEPP processed 250,245 sequences in each training session. Each version of AEPP took less than 5 minutes to train on this multiset using an SGI with an R10000 processor. AEPP-1 results were obtained by setting = 1:6 and r = 4000, while AEPP-2 had = 1:7 and r = 4000 and AEPP-3 had = 1:5 and r = As indicated in Table 3, these three systems all achieved signicantly higher average accuracy than previous systems on this data set. For all three versions of AEPP the accuracy on coding and noncoding windows is roughly equal, with AEPP-1 coming closest to true equality. We note that the higher overall accuracy of AEPP-2, like that of OC1-C, is due to the somewhat higher accuracy of these algorithms on noncoding windows, which make up more than 84% of the test set. Each version of AEPP took approximately 30 minutes to process the 145,040 examples in the test set. We believe that this rate could be signicantly increased by pruning the coecients of the hypotheses which are used for prediction; however, we did not explore this. 5.2 Fickett and Tung 108-bp data This data set consists of 130,428 nonoverlapping human DNA sequences of 108 nucleotides each. After normalizing, the training multiset which we used contained 58,065 instances of coding sequences and 58,118 noncoding sequences. We made one pass through the training multiset for each version of AEPP. Training time for each version of AEPP was less than 4 minutes to process the multiset of 116,183 examples. The settings for and r which we used are as follows: AEPP-4 had = 1:5 and r = 2000, AEPP-5 had = 1:6 and r = 4000, and AEPP-6 had = 1:5 and r = Here too the AEPP systems signicantly outperformed previous systems (see Table 4). As on the 54-bp data, each AEPP predictor achieves roughly equal levels of accuracy on coding and noncoding windows. Each AEPP predictor took approximately 25 minutes of computation time to process the 65,224 sequences in the test set. 8

9 Table 5: Predictive Accuracy on Fickett-Tung 162-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-G Fourier The OC1-D results are taken from [21], and the Fourier results are from [9]. Table 6: Predictive Accuracy on Kulp-Reese 108-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP Each gure in this table is an average of nine values obtained by nine-fold cross-validation. 5.3 Fickett and Tung 162-bp data This data set contains 79,882 sequences, each of which is 162 nucleotides long. Since the training set contains more than 10 times as many noncoding windows as coding windows, our normalizing procedure was particularly useful here. After normalizing, the multiset which we used as input for AEPP consisted of 36,456 coding sequences and 36,502 noncoding sequences. Here too we cycled through the multiset only once for each version of AEPP. Training time was less than 3 minutes for the 72,958 examples in the training multiset. AEPP-7 was trained using = 1:6 and r = 1000, AEPP-8 using = 1:2 and r = 2000, and AEPP-9 using = 1:2 and r = While the best OC1 decision tree achieved an average accuracy which was 2.0% higher than the best single feature (the fourier measure) for this data set, our best AEPP version had average accuracy 2.9% higher than this best OC1 classier. The correlation coecient also shows a signicant increase, from to Each AEPP version took about 25 minutes for testing the 39,868 sequences in the test set. 5.4 Kulp and Reese 108-bp data This data set contains 29,651 human DNA sequences, of which 4,138 are coding sequences and 25,513 are noncoding sequences. The data set was divided into nine portions as indicated in Table 2 and our experiments were run using nine-fold cross-validation. In each of the nine runs, we held out one portion of the data as the test set, constructed a multiset of examples from the other eight portions of the data, and trained using this multiset. Since the Kulp-Reese data set is relatively small (each multiset contained approximately 45,000 examples), we cycled through the training multiset four times for each experiment. Each gure in Table 6 is an average of the nine gures which were obtained on the nine test sets. The parameter settings for and r which we used were identical to those which we used for the Fickett and Tung 108 base pair data; so no tuning was done using any of the Kulp-Reese data. Had we done such tuning, it is likely that even higher accuracy could have been obtained. Despite this, the AEPP predictors had extremely high accuracy on this data, as can be seen in Table 6. 9

10 Figure 1. Average accuracy of various predictors on Fickett and Tung's data sets Percentage Accuracy AEPP OC1 Position Asymmetry Fourier Hexamer Dicodon 54 bp 108 bp 162 bp 5.5 Comparison to previous results Figure 1 shows the performance of each of several classiers on Fickett and Tung's benchmark data sets. For each data set, the values plotted are the average accuracies of the best AEPP classier, the best OC1 decision tree, and the best single features. Like the OC1 system, our AEPP system consistently had higher accuracy than the best single feature. However, AEPP also consistently outperformed the best OC1 decision tree classier. It is encouraging to note that AEPP's performance advantage seems to increase with the length of the DNA sequences which are being classied. This indicates that a system like AEPP may be especially useful for the task of identifying coding regions in long DNA sequences, since these coding regions are frequently longer than 162 base pairs (particularly in single-exon genes). 6 Conclusions & Future Work Our study uses Balanced Winnow, which is a mistake-driven multiplicative-update learning algorithm, in combination with voting methods to obtain classiers for the prediction problem of determining whether or not a given sequence of human DNA codes for protein. Although our classiers employ a great many variables (more than 17,000), the attribute eciency of the underlying learning algorithm and the inherent sparseness of each individual data point makes our method very fast. The predictors which our system learns are signicantly more accurate than any previously obtained classiers. Besides the improvement in accuracy, another advantage of this approach is that it is extremely easy to augment our current system with new features; they simply become new Winnow variables, and Winnow itself will determine which of them (if any) are useful. The nature of the Winnow algorithm is such that even if the new variables are irrelevant to the problem at hand, they cannot have a signicant adverse eect on the system's performance. Aside from the inherent interest and usefulness of developing good classiers for this particular problem, our results are of interest because of the new approaches they suggest for developing predictors for a wide range of bioinformatics problems. Our results indicate that, as suggested in [27], using a simple attribute-ecient learning algorithm over a large feature space can be an eective practical learning technique. It would be interesting to determine whether this approach is similarly successful for other prediction problems in computational biology. Many 10

11 interesting avenues for future research also exist within the gene-nding arena; we are currently working on developing a full-scale gene-nder which uses an AEPP-like system to identify coding regions within long sequences of DNA. Another direction for future work is to improve the AEPP system itself; this could be done in a variety of ways. Judicious selection of compound variables is one promising approach for increasing the expressiveness of the hypotheses which AEPP can generate. Another intriguing approach is to use an attribute-ecient \projection learning" algorithm in place of Winnow, as described in [29]. 7 Acknowledgments This research was supported by NASA GSRP grant NGT-51706, by NSF grant CCR , and by an NSF Graduate Fellowship. We would like to thank Bonnie Berger, Steven Salzberg, and Les Valiant for their advice and encouragement. References [1] Auer, P., and Warmuth, M.K Tracking the Best Disjunction. In Proc. 36th Symposium on the Foundations of Computer Science. Los Alamitos, CA: IEEE Computer Society Press. [2] Blum, A Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain. Machine Learning 26, [3] Breiman, L Bagging Predictors. Machine Learning 24, [4] Burge, C., and Karlin, S Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, [5] Burset, M., and Guigo, R Evaluation of Gene Structure Prediction Programs. Genomics 34, [6] Chan, P.K., and Stolfo, S.J A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings 12th International Conference on Machine Learning, San Francisco: Morgan Kaufmann. [7] Dagan, I., Karov, Y., and Roth, D Mistake-Driven Learning in Text Categorization. In Second Conference on Empirical Methods in Natural Language Processing. [8] Dong, S., and Searls, D.B Gene structure prediction by linguistic methods. Genomics 162, [9] Fickett, J., and Tung, C.-S Assessment of protein coding measures. Nucleic Acids Res. 20, [10] Freund, Y Boosting a weak learning algorithm by majority. Information and Computation 121, [11] Golding, A.R., Roth, D Applying Winnow to Context-Sensitive Spelling Correction. In 13th International Conference on Machine Learning. [12] Helmbold, D.P., Schapire, R.E., Singer, Y., and Warmuth, M.K On-Line Portfolio Selection Using Multiplicative Updates. In Machine Learning: Proceedings of the Thirteenth International Conference. [13] Kivinen, J., and Warmuth, M.K Exponentiated Gradient Versus Gradient Descent for Linear Predictors. In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing. New York: ACM Press. [14] Kivinen, J., Warmuth, M.K., Auer, P The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. In Proceedings, 8th Annual Conference on Computational Learning Theory. New York: ACM Press. [15] Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA sequences. In ISMB-96. Menlo Park, CA: AAAI/MIT Press. [16] Kwok, S., and Carter, C Multiple Decision Trees. In Uncertainty in Articial Intelligence 4, ed. Schacter, R., Levigt, T., Kanal, L., and Lemmer, J. North-Holland,

12 [17] Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R Training Algorithms for Linear Text Classiers. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. [18] Littlestone, N Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning 2, [19] Littlestone, N Redundant noisy attributes, attribute err ors, and linear-threshold learning using winnow. In COLT 91, [20] Littlestone, N., and Warmuth, M.K The Weighted Majority Algorithm. Information and Computation 108, [21] Salzberg, S Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm. J. Comp. Biol. 2:3, [22] Salzberg, S., Delcher, A., Fasman, K., and Henderson, J A Decision Tree System for Finding Genes in DNA. Technical Report , Department of Computer Science, Johns Hopkins University. [23] Salzberg, S., Fasman, K., and Henderson, J Finding Genes in DNA with a Hidden Markov Model. J. Comp. Biol. 4(2), [24] Servedio, R Improved prediction of individual book use for o-site storage using a simple attribute-ecient learning system. Unpublished manuscript. [25] Snyder, E.E., and Stormo, G.D Identication of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucleic Acids Res. 21(3), [26] Solovyev, V. V., Salamov, A.A., and Lawrence, C.B Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, [27] Valiant, L.G A Neuroidal Architecture for Cognitive Computation. Harvard University Computer Science Technical Reports, TR [28] Valiant, L.G Circuits of the Mind. Oxford University Press. [29] Valiant, L.G Projection Learning. Unpublished manuscript. [30] Uberbacher, E.C., and Mural, R.J Locating protein coding regions in human DNA sequences using a multiple sensor-neural network approach. Proc. Nat. Acad. Sci. USA. 88, [31] Xu, Y., Einstein, J.R., Shah, M., and Uberbacher, E.C An improved system for exon recognition and gene modeling in human dna sequences. In ISMB-94. Menlo Park, CA: AAAI/MIT Press. 12

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science.

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science. Integrating Database Homology in a Probabilistic Gene Structure Model David Kulp, David Haussler Baskin Center for Computer Engineering and Computer Science University of California, Santa Cruz CA, 95064,