Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617)

Size: px
Start display at page:

Download "Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617)"

Transcription

1 Gene Structure Prediction From Many Attributes Adam A. Deaton Rocco A. Servedio Engineering Sciences Lab 102 Division of Engineering and Applied Sciences Harvard University Cambridge, MA Phone: (617) Fax: (617) January 30, 1998 Abstract Considerable research eort has been directed in recent years toward the problem of computationally identifying genes in DNA sequences. A fundamental component of a gene- nding system is a predictor which, when given a window of DNA sequence data, predicts whether or not it codes for protein product. In this paper we propose that mistake-driven, multiplicative-weight-update learning algorithms operating over a large feature set are well suited to to this prediction problem, and describe a system we have built which takes this approach. Our system is fast, simple, and produces more accurate classiers than have previously been obtained for a range of dierent sequence lengths. We conclude that a system of this type will be a useful component in larger gene-nding programs. Keywords: exon prediction; gene identication; coding sequence; machine learning; Winnow; multiplicative-weight-update learning algorithm 1 Introduction A major thrust of biological research over the past decade has been the acquisition of massive amounts of DNA sequence data. Along with an ever-larger supply of raw data, the need for accurate, automatic systems for analyzing DNA sequences has grown correspondingly more acute. Perhaps the most widely desired tool for sequence analysis would be a reliable method for identifying genes in DNA. Numerous gene-nding systems have been developed and implemented in recent years, including FGENEH [26], Genie [15], GenLang [8], GENSCAN [4], GRAILII [31], MORGAN [22] and VEIL [23]. Many of these systems are still under development, and some encouraging results have been obtained, but the general problem is far from solved. In a comprehensive survey of gene-nding systems, Burset and Guigo [5] found that most systems had average accuracy (for predicting protein coding/noncoding status at the individual nucleotide level) between 62% and 71%; they concluded that overall \the accuracy of the programs was severely aected by relatively high rates of sequence errors". Better techniques for predicting coding/noncoding status of DNA subsequences would thus be of considerable use for the next generation of gene-nding programs. In this paper we focus on the following prediction problem: given a short window of DNA, predict whether it is all-coding or all-noncoding. Devising an accurate solution to this prediction Corresponding author. 1

2 problem is an important step toward building a successful gene-nder. Previous researchers have identied numerous features that can be eciently computed from short DNA windows and have used techniques such as linear discriminant analysis, neural nets, and decision trees in attempts to solve this problem; see Section 2 for more details on related work. Our study takes a dierent approach, one that is motivated by the philosophy outlined by Valiant in his recent work on cognitive computation [27], [28]. As Valiant points out, two broad lessons that can be drawn from the past decade of research in computational learning theory are the following. First, despite intensive eorts, no general techniques are known for directly constructing complex features such as the monomials in a DNF representation of an unknown target function. On the other hand, some partial techniques are known for tolerating large numbers of irrelevant attributes in certain contexts. One goal of the current research is to demonstrate that these insights can be directly applied to the construction of systems for real-world problems of signicant practical interest. In keeping with this approach, our system uses Littlestone's Winnow algorithm [18], [19] for learning linear threshold functions as its main predictive component. Mistake-driven multiplicativeweight-update learning algorithms such as Winnow have been the subject of much study in the machine learning community over the past decade (e.g. [13], [14], [18], [19], [20]). These algorithms are known to perform exceptionally well in the presence of many irrelevant attributes and have provable resistance to noisy data. Recent applied research has exploited these properties by using these algorithms for problems such as context-sensitive spelling correction [11], library book use prediction [24], text categorization [17], [7] and portfolio selection [12]. Our results indicate that the Winnow algorithm can be used successfully to nd protein-coding regions in DNA. In the experiments we carried out, our Winnow-based system achieved higher levels of performance on Fickett and Tung's benchmark human DNA data sets [9] than any previous program. The dierence in accuracy between our classiers and Salzberg's decision trees [21] on these benchmark data sets is approximately the same as the dierence between Salzberg's decision trees and the best single features. Our algorithm also performs well on a newer benchmark set of human DNA data which has been assembled by David Kulp and Martin Reese. Based on the successful performance of our algorithm on this single-window prediction problem, we are currently working on incorporating it into a larger gene-nding system in which the goal is to identify the coding regions (if any are present) within a long stretch of DNA. The remainder of this paper is structured as follows. Section 2 gives background on previous research on this problem. Section 3 contains an introduction to the Winnow algorithm, a brief overview of some of its properties, and a discussion of why it is likely to be useful in this context. Section 4 describes data and methods. We present and discuss our results in Section 5, and give conclusions and suggestions for future work in Section 6. 2 Related Work Many researchers have considered the problem of computationally distinguishing between coding and non-coding regions of DNA. In a comparison and evaluation of published techniques, Fickett and Tung [9] noted that more than twenty dierent coding measures were present in the literature. As a benchmark for comparison of dierent techniques, they proposed three publicly available sets of human DNA data, consisting of sequences of 54, 108, and 162 base pairs respectively. Drawing from the available literature, Fickett and Tung created a unied list of 21 coding measures and, using Linear Discriminant Analysis, separately assessed the accuracy of each of these 21 measures on each of their three benchmark data sets. It stands to reason that higher accuracy could be achieved by combining individual coding measures. Indeed, Fickett and Tung report that in a simple test of the GRAIL system's Coding Recognition Module [30] on their 108-bp benchmark data set, average accuracy of 79% was achieved, which is superior to any single coding measure. In a more systematic approach to using multiple coding measures, Salzberg [21] experimented with learning decision tree classiers for the Fickett and Tung benchmark data. The features 2

3 employed by his trees were a subset of the 21 coding measures which Fickett and Tung had reviewed. Salzberg's OC1 system for building decision trees can be set to allow either univariate or multivariate tests at each tree node. His study employed several dierent splitting and pruning criteria in the construction of the decision trees. For each of the three data sets, the classiers he obtained were signicantly more accurate than the best individual coding measure (see Figure 1 in Section 5). 3 Winnow 3.1 The Winnow algorithm Many variants of the Winnow algorithm have been described and studied over the past ten years ([1], [2], [18], [19]); however, these variants all share the same underlying structure. Like the well-known Perceptron algorithm, Winnow works by attempting to construct a linear separator for the sequence of examples which it is given. The algorithm processes examples one-by-one as it receives them, and updates its current hypothesis (i.e. weight vector) only when it makes a mistake. The main dierence between Winnow and the Perceptron algorithm is that Winnow adjusts its hypothesis by making multiplicative rather than additive changes to the coecients of the weight vector; it turns out that this seemingly minor change in fact makes a substantial dierence in terms of the mistake bound which can be proved for the algorithm. More specically, the version of Winnow which we use, called Balanced Winnow, maintains two weight vectors, a positive weight vector w + = (w + 1 ; : : : ; w+ n ) 2 (0; 1)n and a negative weight vector w? = (w? 1 ; : : :; w? n ) 2 (0; 1) n. Initially, w + = w? = (1; 1; ; 1). At each step, the algorithm is given a labelled data point (x; y), where x 2 [0; 1] n and y 2 f+1;?1g; the data point x can be thought of as a vector of feature values for some xed feature set. The label y for each point is either positive or negative, depending presumably on whether the point is a positive or negative example of the concept which is being learned. Upon being given (x; y), Winnow performs two phases, a prediction phase and an update phase. In the prediction phase, the algorithm computes (w +? w? ) x. It predicts ^y = 1 if this quantity is nonnegative, and predicts ^y =?1 otherwise. In the update phase, for all i = 1; 2; : : :; n the algorithm sets and sets w + i = w + i (y?^y)xi w? i = w? i?(y?^y)xi where > 1 is some xed update parameter. Note that if the algorithm's prediction was correct, no update is performed. When the algorithm makes a false positive prediction, some coordinates of w + will decrease and the corresponding coordinates of w? will increase; the opposite occurs when it makes a false negative prediction. 3.2 Properties of the Winnow algorithm In [18], Littlestone proved a convergence theorem for Winnow which is similar to the classical Perceptron Convergence Theorem; he showed that if the data set on which Winnow is trained is linearly separable, then the Winnow algorithm will eventually nd a linear threshold function which classies all examples correctly. However, he also proved that Winnow is attribute ecient in the following sense: if the target linear threshold function depends on only k out of the n attributes (i.e. has k nonzero coecients), then the maximum number of mistakes which Winnow can make depends only logarithmically on n (and polynomially on k, given some further assumptions). The attribute eciency of Winnow is a useful property when the the feature set (i.e. n) is large but the target concept is simple in that it involves few features. Another useful property of Winnow is that both the prediction and update steps can be performed eciently. If m of the n components of an example x are nonzero, then prediction and updates each take O(m) time instead of O(n); this can represent a considerable savings. 3

4 Table 1: Coding and Noncoding Frequencies in Fickett and Tung Benchmark Data Sets 54-bp 108-bp 162-bp Training set { coding 20,456 7,086 3,512 Training set { noncoding 125,132 58,118 36,502 Training set total 145,588 65,204 40,014 Test set { coding 22,902 8,192 4,226 Test set { noncoding 122,138 57,032 35,602 Test set total 145,040 65,224 39,868 As we will see, in our application to protein coding status prediction, the examples are all very sparse vectors, allowing us to execute the algorithm quickly. One natural objection to using the Winnow algorithm for a problem such as protein coding status prediction is that the hypothesis class which it employs, the class of linear threshold functions, is too simple to be of any use. It is true that the representational power of this hypothesis class is limited; for instance, there are Boolean functions which can be succinctly represented as decision trees but are not expressible as linear threshold functions. However, there are compensatory advantages to using linear threshold functions. For one thing, as we will see, the Winnow algorithm can be feasibly applied to data points which lie in a high dimensional space; the class of threshold functions over this high dimensional space may well be suciently expressive to represent the target concept. For instance, consider a situation in which data points are given as vectors (x 1 ; : : :; x 100 ). Instead of using this representation, one can instead represent points as lying in the 5,150-dimensional space obtained by additionally including all 5,050 products terms x i x j. (Transforming to this representation will be computationally feasible as long as the original vector x is sparse.) The Winnow algorithm applied to points in this representation will be capable of learning any quadratic threshold function over fx 1 ; : : : ; x 100 g. A second advantage of using linear threshold functions as a hypothesis class is simply that powerful and elegant learning algorithms with provable performance guarantees (such as Winnow) are known for this class; this is not necessarily the case for more complex hypothesis classes. While decision trees are more expressive than linear threshold functions, their learnability is an (important) open question; currently, only heuristic algorithms are known which have no performance guarantees. A nal advantage of using the Winnow algorithm is that when the data set is not perfectly linearly separable, its performance degrades in a controlled way. Tight bounds can be given for the performance of Winnow relative to the performance of the best linear separator for any sequence of labelled examples (see [19]). 4 Data and Methods 4.1 Fickett and Tung benchmark data sets Fickett and Tung [9] proposed three data sets of human DNA sequences as benchmarks for comparison of protein coding prediction algorithms. The sequences comprising these data sets were taken from Genbank in May The three data sets contain sequences of length 54, 108, and 162 base pairs, respectively. Each of these three sets is in turn divided into a training set and a testing set of approximately equal size. Every data sequence is labelled as either entirely coding (completely within some exon), entirely noncoding (completely outside of all exons), or part coding and part noncoding. As in Salzberg's study [21], we extracted all data points that are either entirely coding or entirely noncoding and used only these in our study. A summary of the data which we used is given in Table 1; this is the same data that was used in [21]. The entire data sets, including the mixed coding/noncoding windows, can be obtained via anonymous FTP from atlas.lanl.gov. 4

5 Table 2: Coding and Noncoding Frequencies in Kulp-Reese 108-Base-Pair Data Sets Set Set Set Set Set Set Set Set Set Coding Noncoding 2,169 3,351 2,599 2,232 1,713 2,511 5,029 2,121 3,788 Total 2,621 3,849 3,090 2,731 2,246 2,885 5,412 2,540 4,277 Note that within each of the three benchmark data sets, no two sequences overlap. However, the reverse complement of every sequence is also present in the data set; each reverse complement sequence has the same label as the original sequence which it matches. Also, it should be noted that no information about the correct reading frame for data sequences was used; each data point consists simply of a string of 54 (or 108 or 162) nucleotides. 4.2 Kulp and Reese data set We also tested our system on sequences derived from a newer benchmark data set of human genes. This data set has been assembled by David Kulp and Martin Reese and is available from it consists of human genes which were taken from Genbank 95.0 (June 1996). Various lters were used to ensure that only correct, complete, known genes were included. The data set contains 269 single-exon genes and 353 genes that include introns, for a total of 622 genes. These genes are divided into nine dierent sets for cross-validation. For our study, we divided each gene into nonoverlapping windows of length 108, discarding any shorter pieces which might occur at the end. We then extracted from this collection of windows all sequences which were either entirely coding or entirely noncoding. This resulted in a total of 4,138 coding sequences and 25,513 noncoding sequences, for a grand total of 29,651 sequences. Table 2 gives the composition of each of the nine sets. Note that for this data set the reverse complement sequences were not included. No information about the correct reading frame was used. 4.3 Feature Set All of the features which we used are derived from the 21 protein coding measures which were listed by Fickett and Tung in [9] and used in [21]. However, due to the nature of the Winnow algorithm we are able to use these coding measures in a substantially dierent way than [9], [21]. Most of the features described below are vectors of measurements which are computed for each data point. While [9], [21] use Linear Discriminant Analysis to map each of these vectors to a single value (the coding measure for that feature), our algorithm uses these vectors directly. A dicodon is a subsequence of 6 consecutive nucleotides such as TAGGAC. The dicodon frequency feature is a vector of the 4096 dicodon frequencies across the input sequence, where the dicodon counts are accumulated only at locations whose starting point is a multiple of 3 (i.e. starting at the 0th, 3rd, 6th, : : : nucleotides in the sequence). The hexamer-1 frequency feature is likewise a vector of 4096 dicodon frequencies, except that the counts are accumulated at positions 1,4,7,: : :; the hexamer-2 frequency feature is dened analogously. The diamino acid frequency is a vector of the 441 amino acid frequencies which are obtained by translating from the nucleotide sequence to an amino acid string (stop codons are treated as a 21st amino acid); like the dicodon frequency feature, counts are accumulated only at locations 0,3,: : :. The codon usage feature is a vector of the 64 codon frequencies; again, counts are accumulated only at locations 0,3,: : :. The open reading frame feature is the length, in codons, of the longest sequence of codons (aligned with locations 0,3,: : :) in the data string which does not contain a 5

6 stop codon. The run feature is a vector of length 14; for each nontrivial subset S fa; C; G; T g, the run feature contains an entry which gives the length of the longest contiguous subsequence which has all entries from S. For example, if the entry of the run feature which corresponds to fc; Gg is 4, it means that the longest consecutive substring containing only C and G is of length 4. The asymmetry feature is a vector of length four which measures, for each nucleotide A,C,G,T, the extent to which the nucleotide is asymmetrically distributed over the three codon positions. The fourier feature is a vector of length eight; its coecients measure the periodicity of the example string for periods 2,3,: : :,9. For precise denitions of the asymmetry and fourier features, see [9]. In addition to the above features, which were used in [9] and [21], we also included the following additional features, which are simple modications of the features listed above. The generalized dicodon frequency feature consists of 4096 dicodon frequencies; the dierence between this and the regular dicodon frequency is that the counts are accumulated across all possible locations, not just multiples of 3. Similarly, the generalized diamino acid frequency feature consists of 441 diamino acid frequencies accumulated across all possible locations; the generalized amino acid frequency is a vector of 21 amino acid frequencies accumulated across all possible locations; and the generalized codon usage feature is a vector of 64 codon frequencies accumulated across all possible locations. Finally, the generalized open reading frame feature is the average of the following three values: the open reading frame feature's value, the value of the open reading frame computed on codons aligned with 1,4,7,: : :, and the value of the open reading frame computed on codons aligned with 2,5,8,: : :. The motivation for using these generalized features is that the reading frame of each example sequence is unknown. Consequently, any of the three possible frames may be correct for a given example point. Since many of these features consist of long vectors, the total number of variables n is very large (more than 17,000). Even if many of these variables are irrelevant to the target concept, though, the performance of Winnow will not be seriously aected. We note also that even though the total number of variables is large, for any given example sequence the vast majority of all variables have value 0, so predictions and updates can be executed very eciently. For instance, at most 17 components of the 4096-element dicodon frequency vector can be nonzero on an example sequence of length Normalizing and Voting In each of the data sets which we used, only a small fraction of the examples are coding sequences. The Winnow algorithm tends towards hypotheses which maximize overall accuracy on the training set. Consequently, when training on this data Winnow arrives at hypotheses which almost always predict \noncoding", since this prediction is usually correct. It is desirable, though, to have a predictor which performs well on both coding and noncoding sequences. To accomplish this, throughout this study we \normalized" the training sets in the following way: whenever we trained the algorithm, instead of iterating through all of the examples in the actual training set, we iterated through all of the examples in a multiset derived from the training set. Every noncoding sequence in the training set occurred once in the multiset, and every coding sequence occurred times on average, where was chosen to equalize the number of coding and noncoding instances in the multiset. Intuitively, this makes the Winnow algorithm attach equal importance to correct prediction of both coding and noncoding sequences. Even when using the normalized training sets, Winnow displayed a high degree of instability in the hypotheses which it would generate. Running the same algorithm on two slightly dierent training sets would frequently result in two hypotheses which diered widely in their predictions; often one hypothesis would err in favor of exons (many false positives) and the other would err in favor of introns (many false negatives). A common technique in situations such as this, where large changes in the predictor result from small changes in the training set, is to generate multiple predictors and combine them using some form of voting ([3], [6], [10], [16]). The method which we adopted has two parameters: r is the number of hypotheses and s is the interval (in examples) between hypotheses. Given a multiset of training examples for Winnow, we record the 6

7 current Winnow hypothesis periodically (r times) at the end of the training phase, at intervals of s examples apart. For instance, if the multiset contained 10,000 examples and r = 3 and s = 500, then we would record the Winnow hypotheses after 9,000, 9,500, and 10,000 examples. In the testing phase, given a DNA sequence we evaluate each of the r Winnow hypotheses on the sequence, and take as the nal prediction the majority vote of the r hypotheses. In all of the experiments which we performed, r was set to Tuning the algorithm Our system has two parameters which can be adjusted: the Winnow parameter and the number of examples s which are processed between hypotheses. (As mentioned earlier, we used the same feature set and the same number r = 31 of hypotheses throughout all experiments.) In order to nd a suitable setting for these parameters for each of the Fickett and Tung data sets, we employed the following procedure. We randomly split the training set into two disjoint subsets, a tuning set (80%) and an evaluation set (20%). We chose initial settings for and r, trained the system on the tuning set using those parameter settings, and observed its predictive ability on the evaluation set. Based on the results obtained on the evaluation set, we modied and r and repeated the procedure with the new parameter settings on the tuning and evaluation sets. We repeated this procedure until satisfactory parameter settings were obtained. In general, relatively little tuning was required to obtain good results, and the results obtained seem to be fairly robust over a range of values for and r. Once we had chosen values for and r, we trained our system on the entire training set using these values and then tested the resulting hypotheses on the entire test set. We emphasize that the real test set was never used in the tuning process. The test set was only used for gathering nal results after the three best parameter settings had already been selected (based on their performance on the evaluation set). 5 Experimental Results In this section we describe our results on each data set. Values are given for the percentage accuracy on test set coding sequences, the percentage accuracy on test set noncoding sequences, the average accuracy (the average of the previous two values), and the overall accuracy on the test set with the original ratio of noncoding and coding examples. Note that since there are more noncoding than coding examples, the overall accuracy is dierent from the average accuracy. We also give the correlation coecient for our predictors. This is the correlation between the actual and predicted f?1; 1g labels on the (unnormalized) test set. As noted in [21], average accuracy is arguably the most interesting statistic; it is simple to achieve high accuracy on only one type of sequence (either coding or noncoding) at the cost of low accuracy on the other kind, and the overall accuracy is always close to the accuracy on noncoding sequences due to the preponderance of noncoding examples in all of our data sets. The challenge is to simultaneously achieve high accuracy on both coding and noncoding sequences. Our experimental results are summarized in Tables 3, 4, 5 and 6. Our system is called AEPP (for Attribute Ecient Protein Predictor, pronounced \ape"); along with AEPP's performance under various parameter settings, we give the performance of the single best (highest average accuracy) decision tree from Salzberg's study [21] on each of the Fickett and Tung data sets. For these data sets we also give the performance of the single best feature from Fickett and Tung's list of 21 coding measures [9]. The average accuracy for each algorithm is given in boldface. 5.1 Fickett and Tung 54-bp data This data set consists of 290,628 nonoverlapping human DNA sequences, each of which is 54 base pairs long and is either entirely coding, entirely noncoding, or the reverse complement of a sequence which is entirely coding or entirely noncoding. The training portion of this data set contains 20,456 coding sequences and 125,100 noncoding sequences. As described in Section 4.4, we normalized this training set to obtain a multiset consisting of 125,145 instances of coding 7

8 Table 3: Predictive Accuracy on Fickett-Tung 54-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-C Hexamer The OC1-C results are taken from [21], and the Hexamer results are from [9]. Table 4: Predictive Accuracy on Fickett-Tung 108-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-D Position Asymmetry The OC1-D results are taken from [21], and the Position Asymmetry results are from [9]. sequences and 125,100 noncoding sequences. Only one pass through the training multiset was used, so AEPP processed 250,245 sequences in each training session. Each version of AEPP took less than 5 minutes to train on this multiset using an SGI with an R10000 processor. AEPP-1 results were obtained by setting = 1:6 and r = 4000, while AEPP-2 had = 1:7 and r = 4000 and AEPP-3 had = 1:5 and r = As indicated in Table 3, these three systems all achieved signicantly higher average accuracy than previous systems on this data set. For all three versions of AEPP the accuracy on coding and noncoding windows is roughly equal, with AEPP-1 coming closest to true equality. We note that the higher overall accuracy of AEPP-2, like that of OC1-C, is due to the somewhat higher accuracy of these algorithms on noncoding windows, which make up more than 84% of the test set. Each version of AEPP took approximately 30 minutes to process the 145,040 examples in the test set. We believe that this rate could be signicantly increased by pruning the coecients of the hypotheses which are used for prediction; however, we did not explore this. 5.2 Fickett and Tung 108-bp data This data set consists of 130,428 nonoverlapping human DNA sequences of 108 nucleotides each. After normalizing, the training multiset which we used contained 58,065 instances of coding sequences and 58,118 noncoding sequences. We made one pass through the training multiset for each version of AEPP. Training time for each version of AEPP was less than 4 minutes to process the multiset of 116,183 examples. The settings for and r which we used are as follows: AEPP-4 had = 1:5 and r = 2000, AEPP-5 had = 1:6 and r = 4000, and AEPP-6 had = 1:5 and r = Here too the AEPP systems signicantly outperformed previous systems (see Table 4). As on the 54-bp data, each AEPP predictor achieves roughly equal levels of accuracy on coding and noncoding windows. Each AEPP predictor took approximately 25 minutes of computation time to process the 65,224 sequences in the test set. 8

9 Table 5: Predictive Accuracy on Fickett-Tung 162-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP OC1-G Fourier The OC1-D results are taken from [21], and the Fourier results are from [9]. Table 6: Predictive Accuracy on Kulp-Reese 108-BP Human DNA Test Set 1 Coding Noncoding Average Overall Correlation Algorithm/coding measure (%) (%) (%) (%) coecient AEPP AEPP AEPP Each gure in this table is an average of nine values obtained by nine-fold cross-validation. 5.3 Fickett and Tung 162-bp data This data set contains 79,882 sequences, each of which is 162 nucleotides long. Since the training set contains more than 10 times as many noncoding windows as coding windows, our normalizing procedure was particularly useful here. After normalizing, the multiset which we used as input for AEPP consisted of 36,456 coding sequences and 36,502 noncoding sequences. Here too we cycled through the multiset only once for each version of AEPP. Training time was less than 3 minutes for the 72,958 examples in the training multiset. AEPP-7 was trained using = 1:6 and r = 1000, AEPP-8 using = 1:2 and r = 2000, and AEPP-9 using = 1:2 and r = While the best OC1 decision tree achieved an average accuracy which was 2.0% higher than the best single feature (the fourier measure) for this data set, our best AEPP version had average accuracy 2.9% higher than this best OC1 classier. The correlation coecient also shows a signicant increase, from to Each AEPP version took about 25 minutes for testing the 39,868 sequences in the test set. 5.4 Kulp and Reese 108-bp data This data set contains 29,651 human DNA sequences, of which 4,138 are coding sequences and 25,513 are noncoding sequences. The data set was divided into nine portions as indicated in Table 2 and our experiments were run using nine-fold cross-validation. In each of the nine runs, we held out one portion of the data as the test set, constructed a multiset of examples from the other eight portions of the data, and trained using this multiset. Since the Kulp-Reese data set is relatively small (each multiset contained approximately 45,000 examples), we cycled through the training multiset four times for each experiment. Each gure in Table 6 is an average of the nine gures which were obtained on the nine test sets. The parameter settings for and r which we used were identical to those which we used for the Fickett and Tung 108 base pair data; so no tuning was done using any of the Kulp-Reese data. Had we done such tuning, it is likely that even higher accuracy could have been obtained. Despite this, the AEPP predictors had extremely high accuracy on this data, as can be seen in Table 6. 9

10 Figure 1. Average accuracy of various predictors on Fickett and Tung's data sets Percentage Accuracy AEPP OC1 Position Asymmetry Fourier Hexamer Dicodon 54 bp 108 bp 162 bp 5.5 Comparison to previous results Figure 1 shows the performance of each of several classiers on Fickett and Tung's benchmark data sets. For each data set, the values plotted are the average accuracies of the best AEPP classier, the best OC1 decision tree, and the best single features. Like the OC1 system, our AEPP system consistently had higher accuracy than the best single feature. However, AEPP also consistently outperformed the best OC1 decision tree classier. It is encouraging to note that AEPP's performance advantage seems to increase with the length of the DNA sequences which are being classied. This indicates that a system like AEPP may be especially useful for the task of identifying coding regions in long DNA sequences, since these coding regions are frequently longer than 162 base pairs (particularly in single-exon genes). 6 Conclusions & Future Work Our study uses Balanced Winnow, which is a mistake-driven multiplicative-update learning algorithm, in combination with voting methods to obtain classiers for the prediction problem of determining whether or not a given sequence of human DNA codes for protein. Although our classiers employ a great many variables (more than 17,000), the attribute eciency of the underlying learning algorithm and the inherent sparseness of each individual data point makes our method very fast. The predictors which our system learns are signicantly more accurate than any previously obtained classiers. Besides the improvement in accuracy, another advantage of this approach is that it is extremely easy to augment our current system with new features; they simply become new Winnow variables, and Winnow itself will determine which of them (if any) are useful. The nature of the Winnow algorithm is such that even if the new variables are irrelevant to the problem at hand, they cannot have a signicant adverse eect on the system's performance. Aside from the inherent interest and usefulness of developing good classiers for this particular problem, our results are of interest because of the new approaches they suggest for developing predictors for a wide range of bioinformatics problems. Our results indicate that, as suggested in [27], using a simple attribute-ecient learning algorithm over a large feature space can be an eective practical learning technique. It would be interesting to determine whether this approach is similarly successful for other prediction problems in computational biology. Many 10

11 interesting avenues for future research also exist within the gene-nding arena; we are currently working on developing a full-scale gene-nder which uses an AEPP-like system to identify coding regions within long sequences of DNA. Another direction for future work is to improve the AEPP system itself; this could be done in a variety of ways. Judicious selection of compound variables is one promising approach for increasing the expressiveness of the hypotheses which AEPP can generate. Another intriguing approach is to use an attribute-ecient \projection learning" algorithm in place of Winnow, as described in [29]. 7 Acknowledgments This research was supported by NASA GSRP grant NGT-51706, by NSF grant CCR , and by an NSF Graduate Fellowship. We would like to thank Bonnie Berger, Steven Salzberg, and Les Valiant for their advice and encouragement. References [1] Auer, P., and Warmuth, M.K Tracking the Best Disjunction. In Proc. 36th Symposium on the Foundations of Computer Science. Los Alamitos, CA: IEEE Computer Society Press. [2] Blum, A Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain. Machine Learning 26, [3] Breiman, L Bagging Predictors. Machine Learning 24, [4] Burge, C., and Karlin, S Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, [5] Burset, M., and Guigo, R Evaluation of Gene Structure Prediction Programs. Genomics 34, [6] Chan, P.K., and Stolfo, S.J A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings 12th International Conference on Machine Learning, San Francisco: Morgan Kaufmann. [7] Dagan, I., Karov, Y., and Roth, D Mistake-Driven Learning in Text Categorization. In Second Conference on Empirical Methods in Natural Language Processing. [8] Dong, S., and Searls, D.B Gene structure prediction by linguistic methods. Genomics 162, [9] Fickett, J., and Tung, C.-S Assessment of protein coding measures. Nucleic Acids Res. 20, [10] Freund, Y Boosting a weak learning algorithm by majority. Information and Computation 121, [11] Golding, A.R., Roth, D Applying Winnow to Context-Sensitive Spelling Correction. In 13th International Conference on Machine Learning. [12] Helmbold, D.P., Schapire, R.E., Singer, Y., and Warmuth, M.K On-Line Portfolio Selection Using Multiplicative Updates. In Machine Learning: Proceedings of the Thirteenth International Conference. [13] Kivinen, J., and Warmuth, M.K Exponentiated Gradient Versus Gradient Descent for Linear Predictors. In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing. New York: ACM Press. [14] Kivinen, J., Warmuth, M.K., Auer, P The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. In Proceedings, 8th Annual Conference on Computational Learning Theory. New York: ACM Press. [15] Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA sequences. In ISMB-96. Menlo Park, CA: AAAI/MIT Press. [16] Kwok, S., and Carter, C Multiple Decision Trees. In Uncertainty in Articial Intelligence 4, ed. Schacter, R., Levigt, T., Kanal, L., and Lemmer, J. North-Holland,

12 [17] Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R Training Algorithms for Linear Text Classiers. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. [18] Littlestone, N Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning 2, [19] Littlestone, N Redundant noisy attributes, attribute err ors, and linear-threshold learning using winnow. In COLT 91, [20] Littlestone, N., and Warmuth, M.K The Weighted Majority Algorithm. Information and Computation 108, [21] Salzberg, S Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm. J. Comp. Biol. 2:3, [22] Salzberg, S., Delcher, A., Fasman, K., and Henderson, J A Decision Tree System for Finding Genes in DNA. Technical Report , Department of Computer Science, Johns Hopkins University. [23] Salzberg, S., Fasman, K., and Henderson, J Finding Genes in DNA with a Hidden Markov Model. J. Comp. Biol. 4(2), [24] Servedio, R Improved prediction of individual book use for o-site storage using a simple attribute-ecient learning system. Unpublished manuscript. [25] Snyder, E.E., and Stormo, G.D Identication of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucleic Acids Res. 21(3), [26] Solovyev, V. V., Salamov, A.A., and Lawrence, C.B Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, [27] Valiant, L.G A Neuroidal Architecture for Cognitive Computation. Harvard University Computer Science Technical Reports, TR [28] Valiant, L.G Circuits of the Mind. Oxford University Press. [29] Valiant, L.G Projection Learning. Unpublished manuscript. [30] Uberbacher, E.C., and Mural, R.J Locating protein coding regions in human DNA sequences using a multiple sensor-neural network approach. Proc. Nat. Acad. Sci. USA. 88, [31] Xu, Y., Einstein, J.R., Shah, M., and Uberbacher, E.C An improved system for exon recognition and gene modeling in human dna sequences. In ISMB-94. Menlo Park, CA: AAAI/MIT Press. 12

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science.

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science. Integrating Database Homology in a Probabilistic Gene Structure Model David Kulp, David Haussler Baskin Center for Computer Engineering and Computer Science University of California, Santa Cruz CA, 95064,

More information

Improved Splice Site Detection in Genie

Improved Splice Site Detection in Genie Improved Splice Site Detection in Genie Martin Reese Informatics Group Human Genome Center Lawrence Berkeley National Laboratory MGReese@lbl.gov http://www-hgc.lbl.gov/inf Santa Fe, 1/23/97 Database Homologies

More information

Gene finding: putting the parts together

Gene finding: putting the parts together Gene finding: putting the parts together Anders Krogh Center for Biological Sequence Analysis Technical University of Denmark Building 206, 2800 Lyngby, Denmark 1 Introduction Any isolated signal of a

More information

reviewed in detail by (Kornberg, 996, and other articles in the same issue) or (Nikolov and Burley, 997). The survey of Fickett and Hatzigeorgiou (997

reviewed in detail by (Kornberg, 996, and other articles in the same issue) or (Nikolov and Burley, 997). The survey of Fickett and Hatzigeorgiou (997 Interpolated Markov Chains for Eukaryotic Promoter Recognition Uwe Ohler ;2, Stefan Harbeck, Heinrich Niemann, Elmar Noth, and Martin G. Reese 2 Chair for Pattern Recognition (Computer Science V) University

More information

ProGen: GPHMM for prokaryotic genomes

ProGen: GPHMM for prokaryotic genomes ProGen: GPHMM for prokaryotic genomes Sharad Akshar Punuganti May 10, 2011 Abstract ProGen is an implementation of a Generalized Pair Hidden Markov Model (GPHMM), a model which can be used to perform both

More information

DFT based DNA Splicing Algorithms for Prediction of Protein Coding Regions

DFT based DNA Splicing Algorithms for Prediction of Protein Coding Regions DFT based DNA Splicing Algorithms for Prediction of Protein Coding Regions Suprakash Datta, Amir Asif Department of Computer Science and Engineering York University, Toronto, Canada datta, asif @cs.yorku.ca

More information

Implementing a Predictable Real-Time. Multiprocessor Kernel { The Spring Kernel.

Implementing a Predictable Real-Time. Multiprocessor Kernel { The Spring Kernel. Implementing a Predictable Real-Time Multiprocessor Kernel { The Spring Kernel. L. D. Molesky, K. Ramamritham, C. Shen, J. A. Stankovic, and G. Zlokapa Department of Computer and Information Science University

More information

the better the performance. Therefore it is of principle is demonstrated on an easy articial problem unless mentioned otherwise. 2.

the better the performance. Therefore it is of principle is demonstrated on an easy articial problem unless mentioned otherwise. 2. The Evolution of Genetic Code in Genetic Programming Robert E. Keller Systems Analysis Computer Science Department University of Dortmund D-44221 Dortmund, Germany keller@icd.de Abstract In most Genetic

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

March 9, Hidden Markov Models and. BioInformatics, Part I. Steven R. Dunbar. Intro. BioInformatics Problem. Hidden Markov.

March 9, Hidden Markov Models and. BioInformatics, Part I. Steven R. Dunbar. Intro. BioInformatics Problem. Hidden Markov. and, and, March 9, 2017 1 / 30 Outline and, 1 2 3 4 2 / 30 Background and, Prof E. Moriyama (SBS) has a Seminar SBS, Math, Computer Science, Statistics Extensive use of program "HMMer" Britney (Hinds)

More information

Identifying DNA splice sites using hypernetworks with artificial molecular evolution

Identifying DNA splice sites using hypernetworks with artificial molecular evolution BioSystems 87 (2007) 117 124 Identifying DNA splice sites using hypernetworks with artificial molecular evolution Jose L. Segovia-Juarez a, Silvano Colombano b,, Denise Kirschner a a Department of Microbiology

More information

G. Reese, Frank II. Eeckman

G. Reese, Frank II. Eeckman Ilrl~>roved :jp11ct: Site Detection in Genie Martin G. Reese, Frank II. Eeckman Human Genome Informatics Group Lawrence Berkeley National Laboratory 1 Cyclotron Road, Berkeley, CA 94720 mgreese@xbl.gov,

More information

University of Groningen. Effective monitoring and control with intelligent products Meyer, Gerben Gerald

University of Groningen. Effective monitoring and control with intelligent products Meyer, Gerben Gerald University of Groningen Effective monitoring and control with intelligent products Meyer, Gerben Gerald IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish

More information

Using Multi-chromosomes to Solve. Hans J. Pierrot and Robert Hinterding. Victoria University of Technology

Using Multi-chromosomes to Solve. Hans J. Pierrot and Robert Hinterding. Victoria University of Technology Using Multi-chromosomes to Solve a Simple Mixed Integer Problem Hans J. Pierrot and Robert Hinterding Department of Computer and Mathematical Sciences Victoria University of Technology PO Box 14428 MCMC

More information

Gene Expression Classification: Decision Trees vs. SVMs

Gene Expression Classification: Decision Trees vs. SVMs Gene Expression Classification: Decision Trees vs. SVMs Xiaojing Yuan and Xiaohui Yuan and Fan Yang and Jing Peng and Bill P. Buckles Electrical Engineering and Computer Science Department Tulane University

More information

Genetic Algorithm with Upgrading Operator

Genetic Algorithm with Upgrading Operator Genetic Algorithm with Upgrading Operator NIDAPAN SUREERATTANAN Computer Science and Information Management, School of Advanced Technologies, Asian Institute of Technology, P.O. Box 4, Klong Luang, Pathumthani

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Contents Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Neural Networks and Applications in Bioinformatics

Neural Networks and Applications in Bioinformatics Contents Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized 1 2 3 Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized medicine, risk assessment etc Public Health Bio

More information

Protein Synthesis. OpenStax College

Protein Synthesis. OpenStax College OpenStax-CNX module: m46032 1 Protein Synthesis OpenStax College This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 By the end of this section, you will

More information

status of processors. A Job Scheduler dispatches a job to the requested number of processors using a certain scheduling algorithm

status of processors. A Job Scheduler dispatches a job to the requested number of processors using a certain scheduling algorithm Eect of Job Size Characteristics on Job Scheduling Performance Kento Aida Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology 4259, Nagatsuta, Midori-ku, Yokohama-shi

More information

Review of Bioinformatics and Biometrics (RBB) Volume 3 Issue 1, March 2014

Review of Bioinformatics and Biometrics (RBB) Volume 3 Issue 1, March 2014 An Extensive Repot on the Efficiency of AIS- INMACA (A Novel Integrated MACA based Clonal Classifier for Protein Coding and Promoter Region Prediction) Pokkuluri Kiran Sree 1, Inampudi Ramesh Babu 2 1

More information

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools Section 1: Introduction Similarity-Based Approach to Gene Prediction Some genomes may be well-studied, with

More information

Representation in Supervised Machine Learning Application to Biological Problems

Representation in Supervised Machine Learning Application to Biological Problems Representation in Supervised Machine Learning Application to Biological Problems Frank Lab Howard Hughes Medical Institute & Columbia University 2010 Robert Howard Langlois Hughes Medical Institute What

More information

ing and analysis of evolving systems. 1 Regression testing, which attempts to validate modied software and ensure that no new errors are introduced in

ing and analysis of evolving systems. 1 Regression testing, which attempts to validate modied software and ensure that no new errors are introduced in Architecture-Based Regression Testing of Evolving Systems Mary Jean Harrold Computer and Information Science The Ohio State University 395 Dreese Lab, 2015 Neil Avenue Columbus, OH 43210-1227 USA +1 614

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences Huiqing Liu Hao Han Jinyan Li Limsoon Wong Institute for Infocomm Research, 21 Heng Mui Keng Terrace,

More information

1,3. not. mailed. mailed 1. mailed. not mailed. not mailed. mailed X 2 M X. p(s=subscribed) = 0.6 p(s=not subscribed) = 0.4

1,3. not. mailed. mailed 1. mailed. not mailed. not mailed. mailed X 2 M X. p(s=subscribed) = 0.6 p(s=not subscribed) = 0.4 A Decision Theoretic Approach to Targeted Advertising David axwell Chickering and David Heckerman icrosoft Research Redmond WA, 98052-6399 dmax@microsoft.com heckerma@microsoft.com Abstract A simple advertising

More information

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005 Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

A Niched Pareto Genetic Algorithm for Multiobjective. Optimization. Jerey Horn, Nicholas Nafpliotis, and David E. Goldberg

A Niched Pareto Genetic Algorithm for Multiobjective. Optimization. Jerey Horn, Nicholas Nafpliotis, and David E. Goldberg A Niched Pareto Genetic Algorithm for Multiobjective Optimization Jerey Horn, Nicholas Nafpliotis, and David E. Goldberg PREPRINT (camera-ready) As accepted for publication in the Proceedings of the First

More information

Compromise Strategies for Constraint Agents. Eugene C. Freuder and Peggy S. Eaton. University of New Hampshire.

Compromise Strategies for Constraint Agents. Eugene C. Freuder and Peggy S. Eaton. University of New Hampshire. Compromise Strategies for Constraint Agents Eugene C. Freuder and Peggy S. Eaton Computer Science Department University of New Hampshire Durham, New Hampshire 03824 ecf,pse@cs.unh.edu Abstract We describe

More information

Mechanisms. A thesis presented. Swara S. Kopparty. Applied Mathematics. for the degree of. Bachelor of Arts. Harvard College. Cambridge, Massachusetts

Mechanisms. A thesis presented. Swara S. Kopparty. Applied Mathematics. for the degree of. Bachelor of Arts. Harvard College. Cambridge, Massachusetts Modeling Task Allocation with Time using Auction Mechanisms A thesis presented by Swara S. Kopparty To Applied Mathematics in partial fulllment of the honors requirements for the degree of Bachelor of

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Detecting and Pruning Introns for Faster Decision Tree Evolution

Detecting and Pruning Introns for Faster Decision Tree Evolution Detecting and Pruning Introns for Faster Decision Tree Evolution Jeroen Eggermont and Joost N. Kok and Walter A. Kosters Leiden Institute of Advanced Computer Science Universiteit Leiden P.O. Box 9512,

More information

GRAPH OF BEST PERFORMANCE std genetic algorthm (GA) FUNCTION 5. std GA using ranking. GENITOR using fitness. GENITOR using rank

GRAPH OF BEST PERFORMANCE std genetic algorthm (GA) FUNCTION 5. std GA using ranking. GENITOR using fitness. GENITOR using rank The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best Darrell Whitley Computer Science Department Colorado State University Fort Collins, CO 80523. whitley@cs.colostate.edu

More information

GRAPH OF BEST PERFORMANCE std genetic algorthm (GA) FUNCTION 5. std GA using ranking. GENITOR using fitness. GENITOR using rank

GRAPH OF BEST PERFORMANCE std genetic algorthm (GA) FUNCTION 5. std GA using ranking. GENITOR using fitness. GENITOR using rank The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best Darrell Whitley Computer Science Department Colorado State University Fort Collins, CO 80523. whitley@cs.colostate.edu

More information

BIOINFORMATICS. Interpolated Markov chains for eukaryotic promoter recognition. Abstract. Introduction. 362 Oxford University Press

BIOINFORMATICS. Interpolated Markov chains for eukaryotic promoter recognition. Abstract. Introduction. 362 Oxford University Press BIOINFORMATICS Interpolated Markov chains for eukaryotic promoter recognition Abstract Motivation: We describe a new content-based approach for the detection of promoter regions of eukaryotic protein encoding

More information

Homework 4. Due in class, Wednesday, November 10, 2004

Homework 4. Due in class, Wednesday, November 10, 2004 1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors

More information

Data Mining and Applications in Genomics

Data Mining and Applications in Genomics Data Mining and Applications in Genomics Lecture Notes in Electrical Engineering Volume 25 For other titles published in this series, go to www.springer.com/series/7818 Sio-Iong Ao Data Mining and Applications

More information

Tree Depth in a Forest

Tree Depth in a Forest Tree Depth in a Forest Mark Segal Center for Bioinformatics & Molecular Biostatistics Division of Bioinformatics Department of Epidemiology and Biostatistics UCSF NUS / IMS Workshop on Classification and

More information

Profile HMMs. 2/10/05 CAP5510/CGS5166 (Lec 10) 1 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END

Profile HMMs. 2/10/05 CAP5510/CGS5166 (Lec 10) 1 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END Profile HMMs START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END 2/10/05 CAP5510/CGS5166 (Lec 10) 1 Profile HMMs with InDels Insertions Deletions Insertions & Deletions DELETE 1 DELETE 2 DELETE 3

More information

Optimizing multiple spaced seeds for homology search

Optimizing multiple spaced seeds for homology search Optimizing multiple spaced seeds for homology search Jinbo Xu, Daniel G. Brown, Ming Li, and Bin Ma School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada j3xu,browndg,mli

More information

Functional microrna targets in protein coding sequences. Merve Çakır

Functional microrna targets in protein coding sequences. Merve Çakır Functional microrna targets in protein coding sequences Martin Reczko, Manolis Maragkakis, Panagiotis Alexiou, Ivo Grosse, Artemis G. Hatzigeorgiou Merve Çakır 27.04.2012 microrna * micrornas are small

More information

Lecture 10. Ab initio gene finding

Lecture 10. Ab initio gene finding Lecture 10 Ab initio gene finding Uses of probabilistic sequence Segmentation models/hmms Multiple alignment using profile HMMs Prediction of sequence function (gene family models) ** Gene finding ** Review

More information

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity.

More information

Predicting ratings of peer-generated content with personalized metrics

Predicting ratings of peer-generated content with personalized metrics Predicting ratings of peer-generated content with personalized metrics Project report Tyler Casey tyler.casey09@gmail.com Marius Lazer mlazer@stanford.edu [Group #40] Ashish Mathew amathew9@stanford.edu

More information

A New Database of Genetic and. Molecular Pathways. Minoru Kanehisa. sequencing projects have been. Mbp) and for several bacteria including

A New Database of Genetic and. Molecular Pathways. Minoru Kanehisa. sequencing projects have been. Mbp) and for several bacteria including Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways Minoru Kanehisa Institute for Chemical Research, Kyoto University From Genome Sequences to Functions The Human Genome Project

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: January 16, 2013 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information

CSE 527 Computational Biology Autumn Lectures ~14-15 Gene Prediction

CSE 527 Computational Biology Autumn Lectures ~14-15 Gene Prediction CSE 527 Computational Biology Autumn 2004 Lectures ~14-15 Gene Prediction Some References A great online bib http://www.nslij-genetics.org/gene/ A good intro survey JM Claverie (1997) "Computational methods

More information

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2008 The Gene Finding Task Given: an uncharacterized DNA sequence Do:

More information

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* COMPUTATIONAL METHODS IN SCIENCE AND TECHNOLOGY 9(1-2) 93-100 (2003/2004) Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* DARIUSZ PLEWCZYNSKI AND LESZEK RYCHLEWSKI BiolnfoBank

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

When to Book: Predicting Flight Pricing

When to Book: Predicting Flight Pricing When to Book: Predicting Flight Pricing Qiqi Ren Stanford University qiqiren@stanford.edu Abstract When is the best time to purchase a flight? Flight prices fluctuate constantly, so purchasing at different

More information

Machine learning techniques for customer churn prediction in banking environments

Machine learning techniques for customer churn prediction in banking environments Università degli Studi di Padova DIPARTIMENTO DI INGEGNERIA DELL INFORMAZIONE Corso di Laurea Magistrale in Ingegneria Informatica Machine learning techniques for customer churn prediction in banking environments

More information

ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS

ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS 1 ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS DONALD E. GOODMAN, JR. Mississippi State University Department of Psychology Mississippi State, Mississippi LOIS C. BOGGESS Mississippi

More information

Evaluation of Gene-Finding Programs on Mammalian Sequences

Evaluation of Gene-Finding Programs on Mammalian Sequences Letter Evaluation of Gene-Finding Programs on Mammalian Sequences Sanja Rogic, 1 Alan K. Mackworth, 2 and Francis B.F. Ouellette 3 1 Computer Science Department, The University of California at Santa Cruz,

More information

Computer Science Technical Report

Computer Science Technical Report Computer Science Technical Report Characterizing Domain Specific Effects in Flaw Selection for Partial Order Planners AdeleE.Howe EricDahlman Computer Science Department Colorado State University Fort

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

Data mining and machine learning algorithms for soft sensor development. Tibor Kulcsár. Theses of the doctoral (PhD) dissertation

Data mining and machine learning algorithms for soft sensor development. Tibor Kulcsár. Theses of the doctoral (PhD) dissertation Theses of the doctoral (PhD) dissertation Data mining and machine learning algorithms for soft sensor development Tibor Kulcsár University of Pannonia Doctoral School in Chemical Engineering and Material

More information

Protein Domain Boundary Prediction from Residue Sequence Alone using Bayesian Neural Networks

Protein Domain Boundary Prediction from Residue Sequence Alone using Bayesian Neural Networks Protein Domain Boundary Prediction from Residue Sequence Alone using Bayesian s DAVID SACHEZ SPIROS H. COURELLIS Department of Computer Science Department of Computer Science California State University

More information

Genie Gene Finding in Drosophila melanogaster

Genie Gene Finding in Drosophila melanogaster Methods Gene Finding in Drosophila melanogaster Martin G. Reese, 1,2,4 David Kulp, 2 Hari Tammana, 2 and David Haussler 2,3 1 Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology,

More information

Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences

Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences 686 Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences Tetsushi Yada The Japan Information Center of Science and Technology (JICST) 5-3 YonbancllO, Clliyoda-ku, Tokyo

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

Boosting Trees for Anti-Spam Filtering. [Extended Version] TALP Research Center. LSI Department. Universitat Politecnica de Catalunya (UPC)

Boosting Trees for Anti-Spam  Filtering. [Extended Version] TALP Research Center. LSI Department. Universitat Politecnica de Catalunya (UPC) Boosting Trees for Anti-Spam Email Filtering [Extended Version] Xavier Carreras Llus Marquez TALP Research Center LSI Department Universitat Politecnica de Catalunya (UPC) Jordi Girona Salgado 1{3 Barcelona

More information

Department of Computer Science and Engineering, University of

Department of Computer Science and Engineering, University of Optimizing the BAC-End Strategy for Sequencing the Human Genome Richard M. Karp Ron Shamir y April 4, 999 Abstract The rapid increase in human genome sequencing eort and the emergence of several alternative

More information

COMP 555 Bioalgorithms. Fall Lecture 1: Course Introduction

COMP 555 Bioalgorithms. Fall Lecture 1: Course Introduction COMP 555 Bioalgorithms Fall 2014 Jan Prins Lecture 1: Course Introduction Read Chapter 1 and Chapter 3, Secns 3.1-3.7 1 Intended Audience COMP 555: Bioalgorithms Suitable for both undergraduate and graduate

More information

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets Kang Peng Slobodan Vucetic Zoran Obradovic Abstract In this study we proposed an iterative procedure

More information

Bank Card Usage Prediction Exploiting Geolocation Information

Bank Card Usage Prediction Exploiting Geolocation Information Bank Card Usage Prediction Exploiting Geolocation Information Martin Wistuba, Nghia Duong-Trung, Nicolas Schilling, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab University of Hildesheim

More information

September 11, Abstract. of an optimization problem and rewards instances that have uniform landscapes,

September 11, Abstract. of an optimization problem and rewards instances that have uniform landscapes, On the Evolution of Easy Instances Christos H. Papadimitriou christos@cs.berkeley.edu Martha Sideri sideri@aueb.gr September 11, 1998 Abstract We present experimental evidence, based on the traveling salesman

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang Abstract Our team intends to develop a recommendation system for job seekers based on the information of current

More information

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY GUY HASKIN FERNALD 1, DORNA KASHEF 2, NICHOLAS P. TATONETTI 1 Center for Biomedical Informatics Research 1, Department of Computer

More information

New Methods for Splice Site Recognition

New Methods for Splice Site Recognition New Methods for Splice Site Recognition S. Sonnenburg, G. Rätsch, A. Jagota, and K.-R. Müller Fraunhofer FIRST, Kekuléstr. 7, 12489 Berlin, Germany Australian National University, Canberra, ACT 0200, Australia

More information

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs. Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied

More information

Predictive Analytics Using Support Vector Machine

Predictive Analytics Using Support Vector Machine International Journal for Modern Trends in Science and Technology Volume: 03, Special Issue No: 02, March 2017 ISSN: 2455-3778 http://www.ijmtst.com Predictive Analytics Using Support Vector Machine Ch.Sai

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

C. Wohlin, P. Runeson and A. Wesslén, "Software Reliability Estimations through Usage Analysis of Software Specifications and Designs", International

C. Wohlin, P. Runeson and A. Wesslén, Software Reliability Estimations through Usage Analysis of Software Specifications and Designs, International C. Wohlin, P. Runeson and A. Wesslén, "Software Reliability Estimations through Usage Analysis of Software Specifications and Designs", International Journal of Reliability, Quality and Safety Engineering,

More information

May 16. Gene Finding

May 16. Gene Finding Gene Finding j T[j,k] k i Q is a set of states T is a matrix of transition probabilities T[j,k]: probability of moving from state j to state k Σ is a set of symbols e j (S) is the probability of emitting

More information

Molecular Biology and Pooling Design

Molecular Biology and Pooling Design Molecular Biology and Pooling Design Weili Wu 1, Yingshu Li 2, Chih-hao Huang 2, and Ding-Zhu Du 2 1 Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA weiliwu@cs.utdallas.edu

More information

Methodology for the Design and Evaluation of Ontologies. Michael Gruninger and Mark S. Fox. University oftoronto. f gruninger, msf

Methodology for the Design and Evaluation of Ontologies. Michael Gruninger and Mark S. Fox. University oftoronto. f gruninger, msf Methodology for the Design and Evaluation of Ontologies Michael Gruninger and Mark S. Fox Department of Industrial Engineering University oftoronto Toronto, Canada M5S 1A4 f gruninger, msf g@ie.utoronto.ca

More information

100 Training Error Testing Error second 1 Intersection 1 Turn. Distance Exchange Value in Feet 500. Percentage Correct

100 Training Error Testing Error second 1 Intersection 1 Turn. Distance Exchange Value in Feet 500. Percentage Correct Interactive Renement of Route Preferences for Driving Seth Rogers and Pat Langley (rogers, langley)@rtna.daimlerbenz.com Daimler-Benz Research and Technology Center 1510 Page Mill Road, Palo Alto, CA 94304-1135

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

Methods and Algorithms for Gene Prediction

Methods and Algorithms for Gene Prediction Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. ccwei@sjtu.edu.cn http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C

More information

Private Surveys for the Public Good

Private Surveys for the Public Good Private Surveys for the Public Good Bianca Pham, Emmanuel Genene, Ethan Abramson, and Gastón P Montemayor Olaizola Abstract Protecting the privacy of individuals data collected while taking a survey has

More information

Exploring Long DNA Sequences by Information Content

Exploring Long DNA Sequences by Information Content Exploring Long DNA Sequences by Information Content Trevor I. Dix 1,2, David R. Powell 1,2, Lloyd Allison 1, Samira Jaeger 1, Julie Bernal 1, and Linda Stern 3 1 Faculty of I.T., Monash University, 2 Victorian

More information

Ranking Potential Customers based on GroupEnsemble method

Ranking Potential Customers based on GroupEnsemble method Ranking Potential Customers based on GroupEnsemble method The ExceedTech Team South China University Of Technology 1. Background understanding Both of the products have been on the market for many years,

More information

Bis2A 12.2 Eukaryotic Transcription

Bis2A 12.2 Eukaryotic Transcription OpenStax-CNX module: m56061 1 Bis2A 12.2 Eukaryotic Transcription Mitch Singer Based on Eukaryotic Transcription by OpenStax College This work is produced by OpenStax-CNX and licensed under the Creative

More information

Linking maintenance strategies to performance

Linking maintenance strategies to performance Int. J. Production Economics 70 (2001) 237}244 Linking maintenance strategies to performance Laura Swanson* Department of Management, Southern Illinois University Edwardsville, Edwardsville, IL 62026-1100,

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: May 1, 2009 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information