Evaluation of Gene-Finding Programs on Mammalian Sequences

Size: px

Start display at page:

Download "Evaluation of Gene-Finding Programs on Mammalian Sequences"

Morgan Fowler
6 years ago
Views:

1 Letter Evaluation of Gene-Finding Programs on Mammalian Sequences Sanja Rogic, 1 Alan K. Mackworth, 2 and Francis B.F. Ouellette 3 1 Computer Science Department, The University of California at Santa Cruz, Santa Cruz 95064, California; 2 Computer Science Department, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada; 3 Centre for Molecular Medicine and Therapeutics, Vancouver, BC V5Z 4H4, Canada We present an independent comparative analysis of seven recently developed gene-finding programs: FGENES, GeneMark.hmm, Genie, Genscan, HMMgene, Morgan, and MZEF. For evaluation purposes we developed a new, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does not overlap with the training sets of the programs analyzed. Our analysis shows that the new generation of programs has substantially better results than the programs analyzed in previous studies. The accuracy of the programs was also examined as a function of various sequence and prediction features, suchas G + C content of the sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpoints the strengths and weaknesses of each individual program as well as those of computational gene-finding in general. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available at rogic/evaluation/. Currently, in genome centers around the world, millions of bases of genomic DNA from different organisms are sequenced every day. With the recently assembled draft sequence of the human genome in hand and the completed sequence to follow in a couple of years, we need to re-evaluate our methods for deciphering such an enormous amount of data. We present here the results of a comprehensive evaluation of recent computer programs used for the identification of protein coding genes in eukaryotic genomic sequences. Because we expect that such an analysis will be of interest to both biologists and computer scientists, we will first provide an overview of gene structure and computational methods for genefinding. Gene Structure The genes of most eukaryotic organisms are neither continuous nor contiguous. They are separated by long stretches of intergenic DNA and their coding sequences are interrupted by noncoding introns. Coding sequences occupy just a small fraction of a typical higher eukaryotic genome; the extreme example is the human genome, where an estimate of that fraction at 3% (Duret et al. 1995) was recently confirmed for chromosome 22 (Dunham et al. 1999). To obtain a continuous coding sequence which will be translated into a protein sequence, genes are transcribed into long premrna molecules that subsequently undergo complex 5 Corresponding author. rogic@cse.ucsc.edu; FAX: (831) Article and publication are at /cgi/doi/ / gr processing to remove intronic sequences and assemble exons to form mrna. However, assembly of the gene exons in the mature mrna is not always the same; Mironov et al. (1999) found that at least 35% of human genes are alternatively spliced having more than one possible exon assembly. The arrangement of genes in genomes is also prone to exceptions. Although usually separated with an intergenic region, there are examples of genes nested within each other (Dunham et al. 1999); that is, one gene located in an intron of another gene or overlapping genes on the same (Schulz and Butler 1989; Ashburner et al. 1999) or opposite (Cooper et al. 1998) DNA strands. The presence of pseudogenes (nonfunctional sequences resembling real genes) which are distributed in numerous copies throughout the genome further complicates the identification of true protein coding genes. Regulatory regions play a crucial role in gene expression, and their identification is needed to fully comprehend a gene s function, activity, and role in cellular processes. The location of regulatory regions relative to their target gene is not uniquely determined; the basic regulatory elements, such as the TATA and CAT boxes, are usually found in the upstream proximity of the transcription start site, while the other elements such as enhancers and silencers, can be located in distant upstream and downstream regions of a gene and sometimes even within the introns of the gene. This brief overview of genome organization and gene architecture highlights the complexity of gene identification in the sequences of uncharacterized DNA. 11: by Cold Spring Harbor Laboratory Press ISSN /01 $5.00; Genome Research 817

2 Rogic et al. Computational Methods for Identification of Genes There are several methods used for the experimental discovery of genes, but they are time-consuming and costly. Accordingly, for the last 15 years researchers have been developing computational methods for gene-finding that can automate, or facilitate, the identification of genes. Two basic approaches have been established for computational gene-finding: the sequence similarity search, or lookup (Fickett 1996), method and the integrated compositional and signal search, or template (Fickett 1996), method. The latter method is also commonly referred to as ab initio gene finding. Sequence similarity search is a well-established computational method for gene discovery which has been used extensively with considerable success. It is based on sequence conservation due to the functional constraints and is used to search for regions of similarity between an uncharacterized sequence of interest and already characterized sequences in a public sequence database. Significant similarity between two sequences suggests that they are homologous, that is, they have common evolutionary origin. A query sequence can be compared with DNA, protein, or expressed sequence tag (EST) sequences or it can be searched for known sequence motifs. If a query sequence is found to be significantly similar to an already annotated sequence (DNA or protein), we can use the information from the annotated sequence to possibly infer gene structure or function of the query sequence. Comparison with an EST database can provide information if the sequence of interest is transcribed, that is, contains an expressed gene, but will only give incomplete clues about the structure of the whole gene or its function. Although sequence similarity search has been proven useful in many cases, it has been shown that only a fraction of newly discovered sequences have identifiable homologs in the current databases (Oliver et al. 1992; Wilson et al. 1994; Dunham et al. 1999). Furthermore, Green et al. (1993) suggested that currently known proteins may already include representatives of most ancient conserved regions (ACRs, regions of protein sequences showing highly significant similarity across phyla) and that new sequences not similar to any database sequence are unlikely to contain ACRs. The proportion of vertebrate genes with no detectable similarity in other phyla is estimated to be 50% (Claverie 1997). This is supported by a recent analysis of human chromosome 22 (Dunham et al. 1999) where only 50% of the proteins are found to be similar to previously known proteins. These results suggest that, even today, only one half of all new vertebrate genes may be discovered by sequence similarity search across phyla. Considering that a complete vertebrate genome is still not available and that the most prominent vertebrate organisms in GenBank (Benson et al. 2000), Homo sapiens, and Mus musculus, have only 25% and 0.6% of their genomes present in finished sequences, respectively (data from September 2000), it is obvious that sequence similarity search within vertebrates is currently limited. When more vertebrate sequences become available in Gen- Bank (such as mouse, zebrafish, or pufferfish), matches within phyla will be more likely and this will facilitate the detection of genes coding for non-acr-containing proteins. The second computational approach for the prediction of genes structures in the genomic DNA sequences, termed the template approach, integrates coding statistics with signal detection into one framework. Coding statistics behave differently on coding and noncoding regions and they are measures indicative of protein coding functions. A number of these measures have been evaluated by Fickett and Tung (1992), and it has been concluded that the in-phase hexamer measure, which measures the frequency of occurrence of oligonucleotides of length six in a specific reading frame, is the most effective. Indeed, this measure was used successfully in many recently developed programs such as GeneMark.hmm (Lukashin and Borodovski 1998), Genscan (Burge 1997), and HMMgene (Krogh 1997). This coding statistic is usually implemented as a 5th-order hidden Markov model (HMM) (for a review of the theory of HMMs, see Rabiner 1989). Signal sensors attempt to mimic closely processes occurring within the cell. They are intended to identify sequence signals, usually just several nucleotides-long subsequences, which are recognized by cell machinery and are initiators of certain processes. The signals that are usually modeled by gene-finding programs are promoter elements, start and stop codons, splice sites, and poly-a sites. Many different pattern recognition methods have been used as signal detectors, including simple consensus sequences, weight matrices, weight arrays, neural networks, and decision trees. DNA sequence signals have low information content; they are usually degenerate and highly unspecific because it is almost impossible to distinguish the signals truly processed by the cell from those that are apparently nonfunctional. Therefore, signal sensors are not sufficient to elucidate gene structure, and it is necessary to combine them with coding statistics methods in order to obtain satisfactory predictive power. Both codon statistics and signal models are learned from a training set: Frequencies of oligonucleotide occurrence in different regions of the genes are calculated from sequences in the training set, and the signal models are constructed using the alignment of the signal sequences from the training set. 818 Genome Research

3 Gene-Finding Programs on Mammalian Sequences There is also a group of programs that integrate a third component in their systems: similarity with an annotated sequence. Examples of such programs are GeneID+ (Guigo et al. 1992), GeneParser3 (Snyder and Stormo 1995), Procrustes (Gelfand et al. 1996), and AAT (Huang et al. 1997). Existing gene-finding programs have been designed to identify simpler gene structure and then the intricate structure described above. Most of the programs, especially the older ones, are trained to identify just one gene in a sequence, rarely predicting any promoter elements. Some progress has been made with recently developed programs which are capable of identifying more complex genomic structure: any number of genes with either complete or partial structure. This is the case with Genie (Kulp et al. 1996), GeneMark.hmm, Genscan, and HMMgene. Still, regulatory regions and poly-a sites usually remain unidentified, 5 and 3 untranslated regions are not specified, alternative splice variants are not considered, and overlapping or nested genes are not detected. Nevertheless, the prediction of the coding sequence of typical genes is an important first step in deciphering the content of any genome, and the programs for gene-finding are used extensively for this task with considerable success. An excellent Web resource for computational gene recognition, including the URLs of gene-finding programs and a comprehensive bibliography on gene recognition and related subjects, is maintained by Wentian Li at Evaluation of Gene-Finders Because in many cases there is no additional evidence to support the gene predictions provided by ab initio gene-finding programs, it is very important to know the accuracy level of these programs. The reliability of the programs concerns both users and developers. Lab bench experiments are often based on the gene/exon predictions, and they usually require a substantial investment in time and resources. This is why it is important for a user to know how well a certain algorithm performs, what its strengths and weaknesses are, and how to interpret a particular score given by the program. For developers it is valuable to know the current state of the art, to relate the program s efficiency and reliability to the methods used, and to recognize the weaknesses that need to be addressed. To evaluate gene-finding programs meaningfully it is necessary to do it uniformly on one test set of sequences. It is also important to avoid using sequences used for the training of programs analyzed, otherwise the accuracy of the programs may be overestimated. Previous comparative analyses of gene-finding programs have been performed by Snyder and Stormo in 1995 and Burset and Guigo in Snyder and Stormo (1995) analyzed three gene-finding programs GeneID, GRAIL (Xu et al. 1994) (two versions), and GeneParser (three versions of the program) on rather limited test sets containing 28 and 34 sequences. A more comprehensive evaluation of gene structure prediction programs was done by Burset and Guigo (1996). These authors tested nine programs on a test set of 570 sequences and introduced a number of performance metrics to measure the accuracy of prediction on the nucleotide, exon, and gene levels. Some of these measures were known and had been used previously (sensitivity, specificity, and correlation coefficient at the nucleotide level) and some were newly introduced (approximate correlation, sensitivity, and specificity at the exon level). Those authors also investigated the behavior of the programs on sequences with errors (frameshift mutations), sequences with differing G + C content and sequences from different phylogenic groups within the vertebrates. This comprehensive analysis has been a valuable resource for both users and developers of gene prediction programs, and the Burset/Guigo dataset has been used extensively as a benchmark dataset for testing new generations of programs. In the four years since the Burset/Guigo analysis was published, many new programs have been developed. The accuracy measures for most of them have been reported for the Burset/Guigo dataset. A reason for concern is not just that authors tested their programs themselves, but also in many cases it is not clear how the sequences from a program s training set overlap with the Burset/Guigo test set. It is realistic to assume that, in many cases, the training sets of these programs do overlap with the Burset/Guigo dataset because at one time it contained the vast majority of available vertebrate genomic sequences. Here we present a new independent comparative analysis of seven recently developed programs that use only coding and signal information for the prediction of gene structure: FGENES (V. Solovyev, unpubl.), GeneMark.hmm, Genie, Genscan, HMMgene, Morgan (Salzberg et al. 1998), and MZEF (Zhang 1997). The programs were tested on genomic sequences containing one single- or multi-exon gene of human, mouse or rat origin. In order to avoid overlap with the training sets of the programs, we selected only sequences entered in GenBank after the programs were developed and trained. The initial dataset was further filtered to exclude any anomalous sequences (e.g., containing atypical start codon or splice site dinucleotide, inframe stop codon), and then we subjected it to nonredundancy testing to eliminate groups or pairs of very similar sequences. Finally, we kept only sequences for which we could confirm the annotated exon/intron boundaries by aligning them with a corresponding Genome Research 819

4 Rogic et al. mrna sequence. The final set consisted of 195 sequences as described below. For all the programs tested, the basic accuracy measures introduced by Burset and Guigo (1996) were calculated. We chose to compute only nucleotide and exon level accuracy measures, because the prediction of the entire gene structure is still unreliable and seldom used. As an illustration, Dunham et al. (1999) identified 94% at least partially predicted genes on human chromosome 22 using Genscan, but only 20% of genes had all exons predicted exactly. We also examined the programs accuracy as a function of various sequence or prediction features, such as G + C content of the sequence, length and type of the annotated and predicted exons, signal type of the annotated and predicted exons, and the score/probability of the predicted exon, motivated by a similar detailed analysis done by Burge (1997). RESULTS The accuracy measures for five of the seven programs analyzed in this study have already been reported for the Burset/Guigo dataset. However, the results should be considered with some care because it is not clear how this test dataset and the training sets used overlapped. This problem motivated us to build a dataset that does not contain any of the sequences that were part of any training set of these programs. This was accomplished simply by choosing sequences that were entered into GenBank after these programs were trained, but there are no guarantees that there are no similarities between these newly selected sequences and sequences in the programs training datasets. Our opinion is that an independent dataset should not necessarily exclude the sequences that are similar to the sequences in the training set, because, realistically, the unannotated sequence submitted to the gene-finding program might also be similar to an already known and characterized sequence that has been use in the program s training set. The same approach was advocated by Snyder and Stormo (1995). Sequences from the test dataset HMR195 were run through the seven gene prediction programs. For each sequence, the exons predicted on the forward strand (predictions for the reverse strand were ignored) were compared to the actual coding exons, as annotated in the GenBank CDS feature. Although all of the programs tested, except Morgan, can predict genes and exons on both DNA strands simultaneously, the Gen- Bank records for most of the sequences in HMR195 contain only annotation for the Watson/plus strand, and consequently only prediction for that strand could be confirmed. From this comparison, accuracy measures at the nucleotide and exon levels were computed and then averaged. We chose to use averaging by sequence, where measures are first calculated for each gene and then averaged over all genes, as opposed to averaging by base, where measures are summed for all sequences and then averaged by nucleotide or by exon, depending on the measure type. The former is thought to give a better indication of the success rate for the individual sequence entry. For a discussion of this topic, see Dong and Searls (1995) and Burset and Guigo (1996). The measures are averaged only over sequences for which they are defined. This might overestimate the values for sensitivity (Sn), specificity (Sp), approximate correlation (AC), correlation coefficient (CC), exon level sensitivity (ESn), exon level specificity (ESp), and the average of the latter two on the sets that have many sequences without prediction. So, in order to obtain a realistic estimate of the gene-finders performance, one must also look at the number of sequences where no genes were predicted. The accuracy measures for all of the programs analyzed, averaged for the entire HMR195 dataset, are presented in Table 1. The next step was to examine accuracy as a function of various sequence or prediction features, such as G + C content of the sequence, length and type of the annotated exons and predicted exons, signal type for both annotated and predicted genes, and the score/ probability of the exon prediction. For each of these characteristics, the dataset was divided into the subsets exhibiting different value ranges or types of the characteristic examined. For each subset, the accuracy measures were calculated and averaged over all of the sequences belonging to it. These results are presented in the tables below; the seventh table shows the accuracy results for human and murine sequences separately. Because of space limitations we show only some of the accuracy measures calculated. The full results for each program can be found at rogic/evaluation/results.html. DISCUSSION Comparing the results presented by Burset and Guigo (1996) with the results obtained in the present study (Table 1), it is apparent that the new generation of programs has, overall, substantially higher prediction accuracy than the programs analyzed by Burset and Guigo in At that time, the program with the best approximate correlation (among the programs not using any database similarity search) was FGENEH, with AC = 0.78, while the highest AC in 1999 is 0.91, exhibited by both Genscan and HMMgene. On the exon level, (ESn + ESp)/2 has increased from 0.64 for FGENEH to 0.76 for HMMgene. Also, earlier gene-finders were programmed to have low false-positive rates at the expense of losing valid predictions, which resulted in, on average, 20% higher specificity than sensitivity. Programs of the new generation are tuned to have equally 820 Genome Research

5 Gene-Finding Programs on Mammalian Sequences Table 1. Nucleotide and Exon Level Accuracy Programs No. of sequences Nucleotide accuracy Exon accuracy Sn Sp AC CC ESn ESp (ESn+ESp)/2 ME WE PCa PCp OL FGENES 195 (5) GeneMark.hmm 195 (0) Genie 195 (15) Genscan 195 (3) HMMgene 195 (5) Morgan 127 (0) MZEF 119 (8) For each sequence in the HMR195 dataset, the exons predicted on the forward (+) strand were compared to the annotated exons. The standard measures of predictive accuracy on nucleotide and exon level were calculated for each sequence and averaged over all sequences for which they were defined. This was done separately for each of the programs tested. (No. of sequences) number of sequences effectively analyzed by each program; in parentheses is the number of sequences where the absence of gene was predicted; (Sn) nucleotide level sensitivity; (Sp) nucleotide level specificity; (AC) approximate correlation; (CC) correlation coefficient; (ESn) exon level sensitivity; (ESp) exon level specificity; (ME) missed exons; (WE) wrong exons; (PCa) proportion of real exons that were partially predicted (only one exon boundary correct); (PCp) proportion of predicted exons that were only partially correct; (OL) proportion of predicted exons that overlap an actual exon. AC and (ESn+ESp)/2 are given with standard deviation. high sensitivity and specificity, which is more desirable. These improvements have come about as a result of the development of more accurate models for gene structure that are capable of recognizing many different gene features in the sequence. Most of the genefinders analyzed use explicit duration HMM with associated length distribution for each state. These models of genomic structure are hierarchical, with generalized HMM modeling the overall gene structure, where states of the model are independent probabilistic models themselves, such as HMMs and neural networks. Also, new methods have been developed for signal recognition, such as maximal dependence decomposition used for the recognition of donor site in Genscan, neural networks in Genie, and ribosomal binding sites in GeneMark.hmm. The training sets are carefully selected and the average number of training sequences in the dataset has increased, allowing for more diversity in genomic content of the training sequences. With the accuracy measures at the nucleotide level as high as 0.91 for the AC value for both Genscan and HMMgene, we might conclude that the problem of computational gene-finding is almost solved. However, looking at the results for exon sensitivity and specificity and their average, we see that the goal is still far away. Why is there such a gap between AC and (ESn + ESp)/2? Since ESn and ESp are defined as true exons (TE) divided by annotated exons (AE) and predicted exons (PE), respectively, this means that in calculating these two measures only exons with both boundaries predicted correctly will be considered. An almost perfectly predicted exon, covering the whole sequence of an actual exon but exceeding the splicing site by just a few nucleotides will not be counted in TE. In order to predict the exact boundaries of an exon, a program has to have a strong search by signal component signal sensors for identifying start and stop codons and splicing sites. However, signal detection, especially of start and stop codons, is probably the weakest component of current gene-finding programs, as can be observed in Table 2. Although discrimination among coding and noncoding regions, most often done by measuring the hexamer frequencies in these Table 2. Programs Accuracy versus Signal Type start codon (195) Signal type acceptor site (753) donor site (753) stop codon (195) FGENES (0.63) (0.77) (0.82) (0.72) GeneMark.hmm (0.60) (0.75) (0.78) (0.64) Genie (0.57) (0.82) (0.83) (0.73) Genscan (0.78) (0.80) (0.84) (0.86) HMMgene (0.78) (0.85) (0.87) (0.81) Morgan (0.43) (0.57) (0.56) (0.39) MZEF (0.65) (0.73) For each program, the proportion of actual signals identified correctly (the upper number) and the proportion of predicted signals that are correct (the lower number) are averaged over all signals belonging to a particular type. The number in parenthesis in the header of each column represents the number of signals of each type in the HMR195 dataset. Genome Research 821

6 Rogic et al. regions, has shown to be quite successful, signal recognition could still be improved. There has been significant effort to improve the prediction of acceptor and donor splice, and many different methods have been used for this task, such as neural networks and maximal dependence decomposition (the methods for splice site detection are not known to us for all of the programs analyzed). The success of these methods is apparent in Table 2. On the other hand, we are not aware of any systematic effort to tackle the problem of start and stop codon detection. These signals are considered to have low information content, and they are usually detected by using weight (positional) matrices, weight arrays that capture dependencies between adjacent nucleotides or, in the case of Genie, neural networks for translation initiation site. The tendency to miss actual signals can also be observed from the proportion of partially predicted exons (PCa) that ranges from 0.08 for Genie to 0.29 for GeneMark.hmm (Table 1). GeneMark.hmm, and Morgan, besides having high proportions of PCa, also have a relatively high proportion of OL, the proportion of predicted exons that overlap an actual exon (0.09 and 0.07, respectively, Table 1), and according to these results they are the programs with the poorest signal detection. If we add the number of the partially predicted exons to the number of correctly predicted exons and use this number for calculating ESn and ESp, then AC and (ESn + ESp)/2 would have similar values. G + C Content The human genome (and the genomes of other warmblooded vertebrates) is not a structurally homogenous sequence of nucleotides. Instead, it is a mosaic of isochores, long (>300 kb, on average) DNA regions whose base composition is locally homogenous, but varies significantly between disjoint regions. The genome is usually divided into five different compositional categories: L1, L2 (A + T-rich regions), H1, H2, and H3 (G + C-rich regions) in increasing order of G + C percentage. It has been observed (Bernardi 1993) that L1 + L2 constitute 60% of the human genome, H1 + H2 30%, and H3 only 5%. These compositional regions vary widely in gene density: Zoubak et al. (1996) calculated that L1 + L2 regions have a relative gene concentration of 4%, H1 + H2, 20%, and H3 76%. This means that the gene density in very G + C-rich DNA segments is almost 20 times higher than in A + T-rich regions. Important structural properties of genes are found to be strongly correlated with G + C content (Duret et al. 1995): genes from G + C-poor isochores code for proteins that are on average longer then those from G + C-rich isochores, intronic DNA is on average three times longer in L1 + L2 than in H3, and the number of introns per gene is higher in L1 + L2 than in H3. How does compositional variability in genomic sequences affect the performance of gene-finding programs? Burset and Guigo (1996), Snyder and Stormo (1995), and Lopez et al. (1994) have shown in their analyses that gene-finding programs usually perform worse when the G + C content is low. The proposed reasons for this anomaly are that G + C-rich genes have stronger codon bias that makes them easier to identify and that they are more frequent than the genes in A + T-rich isochores. Guigo and Fickett (1995) showed that coding statistics used by gene-finding programs (codon, dicodon, and hexamer frequency) are strongly dependent on G + C content. It is obvious that if a program has only one set of parameters intended to model gene structure (oligonucleotide frequency, length of coding and intergenic region, exon and intron length and number), it will not be able to perform equally well in both A + T- and G + C-rich sequences due to the significant structural differences between genes in these sequences. The reason why programs perform better for G + C-rich sequences could also be because they are trained on the sequence subset of GenBank, which is biased towards G + C-rich sequences. According to Duret et al. (1995), genes from G + C-rich isochores are much more frequently sequenced than those from G + C-poor isochores. Some programs, such as Genscan, HMMgene, and MZEF tested in this survey, recently adopted the approach of using distinct, empirically derived model parameters for distinct G + C compositional regions. Table 3 presents the programs accuracy measures on the sequences with different G + C content. We partitioned our dataset into four groups according to the G + C content of the sequences. These groups are closely related to previously defined isochores except that the very G + C-rich isochore was split into two groups because it was heavily populated. Seven percent (14/195) of the sequences came from L1 + L2 isochores (more precisely with G + C% 40%), 35% (69/195) of sequences from H1 + H2 (40% <G + C% 50%) and 57% (112/195) of sequences from H3 (G + C% >50%), which were subsequently split into two groups (50% <G+C% 60% and G + C% >60%), containing 93 and 19 sequences, respectively. These percentages are significantly inconsistent with the results from Bernardi (1993), which points out the huge bias for the G + C- rich sequences in GenBank. Consistent with the observations made in Burset and Guigo (1996), it seems that some programs are sensitive to the G + C content of a sequence, performing better when the sequence is G + C-rich. The programs which exhibited this in our analysis are FGENES on the nucleotide level, GeneMark.hmm and Genie on both levels, and HMMgene marginally on the exon level. Among programs that are known to use different parameter sets for different G + C content, Genscan s 822 Genome Research

7 Gene-Finding Programs on Mammalian Sequences Table 3. Accuracy versus G + C Content <40%(14) 40 50%(69) 50 60%(93) >60%(19) C + G content AC (Esn+Esp)/2 AC (Esn+Esp)/2 AC (Esn+Esp)/2 AC (Esn+Esp)/2 FGENES GeneMark.hmm Genie Genscan HMMgene Morgan MZEF The HMR195 dataset was partitioned according to the G + C% content of the sequences. The number in parenthesis in the header of each column represents the number of sequences belonging to each partition. For each program, AC and (ESn+ESp)/2 are averaged over all sequences belonging to the particular partition for which they are defined. and HMMgene s prediction accuracy is relatively independent of the base composition, but MZEF still has very variable results, especially on the exon level, that are not proportional to the G + C content of a sequence. The situation is similar for Morgan. There is one peculiarity noted in Table 3: All of the programs except Morgan have the lowest accuracy measures averaged on the sequences with G + C content between 40% and 50%. Because this is not the region with the lowest G + C composition, it is not clear whether the programs really do perform most poorly for this type of sequences or whether there is some characteristic of our test set that causes this slight drop in prediction accuracy. Exon Length The length distributions of different gene elements differ considerably from each other. Introns seem to have an approximate geometric length distribution (Hawkins 1988; Burge 1997), which is a characteristic of a discrete stochastic process with the memoryless property (Karlin and Taylor 1975). This supports the idea that introns do not have any significant constraints on their length, except that the minimal number of nucleotides (70 80) is required (Wieringa et al. 1984). Exons, in contrast, have significant functional constraints. The exon length plays an important role in proper splicing and inclusion in the mature mrna (Dominski and Kole 1991). These constraints have shaped the exon length distribution quite differently from geometric distribution. The length distribution depends on the exon type. Internal exons have length distribution close to a Gaussian distribution with a broad peak between 100 and 170 bp (Hawkins 1988). Hawkins calculated the mean internal exon length to be 137 bp (in our dataset 136 bp), and he observed very few exons shorter than 50 or longer than 300 nucleotides. Length distributions for initial and terminal coding exons are not recognizable statistical distributions. They are still substantially peaked around 60 and 160 bp, respectively (Hawkins 1988; Burge 1997), but do not have a steep drop-off in density after 300 bp. Both types of exons are more variable in length than internal exons, and their calculated means are 134 bp for initial exons and 198 bp for terminal exons (Hawkins 1988) (in our dataset 207 bp for initial and 265 bp for terminal). For the fourth class of exons, single-exons, the length distribution is not known, but in general they are much longer than any other type of exons and their mean length is calculated to be 1300 bp (Hawkins 1988) (in our dataset 1010 bp). In our analysis, we grouped exons by both their annotated length and their predicted length and averaged the accuracy measures in each group. Because many of the programs tested in this analysis (Genie, GeneMark.hmm, Genscan, and HMMgene) use explicit duration HMM, which has associated length distribution to each state of the model, it is interesting to see how these distributions influence the accuracy of their exon prediction. From Table 4 it can be observed that the general trend of all seven of the programs is to have a very low proportion of correctly predicted short exons, which then slowly but monotonically rises with the length of annotated exons. For almost all of the programs, exons are most accurately predicted when their length ranges between 75 and 200 nucleotides (these exons were the most common: 560 of 839). The exons longer than 200 nucleotides (our dataset contained 131 of these exons) seem more difficult to predict correctly, and the accuracy measures drop further as the length increases. The exception is HMMgene, which predicts longer exons with the same accuracy as the more common mediumlength exons. The exons shorter then 25 bases (there were only 17) are missed in 41% of cases for FGENES, up to 88% for MZEF. The most plausible explanation for this Genome Research 823

8 Rogic et al. Table 4. Accuracy versus Exon Length Length range of exons in bp Programs 0 24 (22) (49) (91) (130) (440) (91) 300+ (125) FGENES (0.33) (0.42) (0.64) (0.75) (0.81) (0.61) (0.66) GeneMark.hmm (0.12) (0.51) (0.58) (0.72) (0.73) (0.62) (0.45) Genie (0.18) (0.47) (0.66) (0.81) (0.83) (0.68) (0.69) Genscan (0.29) (0.81) (0.79) (0.85) (0.76) (0.71) (0.65) HMMgene (0.42) (0.76) (0.75) (0.77) (0.85) (0.72) (0.74) Morgan (0.14) (0.14) (0.31) (0.57) (0.57) (0.41) (0.35) MZEF (0.00) (0.44) (0.45) (0.58) (0.73) (0.53) (0.26) The HMR195 dataset was partitioned according to the length of the annotated exons. The number in parenthesis in the header of each column represents the number of actual exons belonging to each partition. For each program, CRa (the proportion of real exons that are correctly predicted [the upper number]) and CRp (the proportion of predicted exons that are correct [the number in parentheses]) are averaged over all sequences belonging to that particular partition. phenomenon is that the length of the coding region is too short to be clearly distinguished from surrounding noncoding regions. In addition, there is biochemical evidence that this type of exon is inefficiently spliced in vivo without the presence of special splicing activating sequences (Dominski and Kole 1991). Finally, the associated length distributions used by some programs do not favor very short exons, and depending how these distributions are used by the systems, this may cause poor prediction for this type of exon. Although very long exons are less likely to be predicted correctly than medium-length exons, they are most unlikely to be completely missed. The number of partially predicted exons longer than 300 nucleotides is relatively large (data not shown), and only <7% of them are completely missed (the exception is MZEF with 33% of exons missed). Lastly, it can be noted from Table 4 that there is usually a significant difference between the proportion of annotated exons that are correctly predicted (CRa) and the proportion of predicted exons that are exactly correct (CRp) for very short exons. The reason for this is that whereas programs FGENES, Genie, and Morgan overpredict short exons, the rest of the programs underpredict them the total number of short exons predicted by each of these programs is much lower than the actual number of exons of the same size. Again, this may be the consequence of exon length distribution built into the gene-finding programs. This discrepancy in the numbers of real and predicted exons is much smaller for the longer exons. Exon Type and Signal Prediction Table 5 summarizes accuracy measures for different exon types. What can be observed is a striking difference between the proportion of correctly predicted internal exons on one side and the proportion of correctly predicted initial and terminal exons on the other side. This difference is partially eroded with a high number of partially predicted initial and terminal ex- Table 5. Programs Accuracy versus Exon Type initial (152) internal (601) Exon type terminal (152) single (43) FGENES (0.55) (0.78) (0.58) (0.83) GeneMark.hmm (0.48) (0.72) (0.51) (0.65) Genie (0.45) (0.82) (0.57) (0.68) Genscan (0.71) (0.76) (0.73) (0.83) HMMgene (0.72) (0.83) (0.73) (0.79) Morgan (0.35) (0.46) (0.36) MZEF The HMR195 dataset was partitioned according to the type of the annotated exons. The number in parenthesis in the header of each column represents the number of actual exons belonging to each partition. For each program, CRa (the upper number) and CRp (the number in parentheses), are averaged over all sequences belonging to that particular partition. 824 Genome Research

9 Gene-Finding Programs on Mammalian Sequences ons (data not shown), especially if we allow the predicted exon to be of any type; but still initial and terminal exons are more likely to be completely missed than internal exons. For single-exon genes, the situation is similar in the sense that the proportion of correctly predicted single exons (CRa) is usually significantly lower than CRa for internal exons (the exceptions are HMMgene and Genie), but they have very high PCa, the proportion of partially predicted exons (the extreme case is GeneMark.hmm with CRa = 0.30 and PCa = 0.56 for single exons). The difference is that single-exon genes are very rarely missed, and the proportion of missed exons of this type is the lowest among all exon types and all programs. The only program that is almost equally successful in predicting exons of any type is HMMgene, which also has the highest proportion of correctly predicted exons (CRp) for initial, terminal and single exons among all of the programs. This HMMgene characteristic surely contributes to its excellent results on our dataset. Why are initial, terminal, and single exons more difficult to identify? The only obvious structural differences among different types of exons are the signals bordering them; there are no studies showing that codon usage (hexamer frequency) fluctuates among different exon types. The difference in exon length could be a reason, since internal exons (136 bp in our dataset) belong to a group of exons more likely to be identified correctly than exons longer than 200 bp, which is the case with initial and terminal exons (207 bp and 265 bp, respectively) (Table 4). However, the differences in accuracy level shown in Table 4 do not compensate for the large differences shown in Table 5. Our hypothesis that signal prediction is mainly responsible for the difference in accuracy levels is supported by the results given in Table 2: The detection of start and stop codons is much less accurate than that of acceptor and donor sites (again, the exception is HMMgene), and the difference in accuracy level is proportional to the accuracy level difference for initial and terminal exons versus internal exons shown in Table 5. As noted above, during the assembly of HMR195 we were not able to validate the locations of annotated start and stop codons. Consequently, prediction accuracy measures calculated for these signals as well as subsequent analysis and discussion strongly rely on the correctness of their annotation in GenBank. Burge and Karlin (1998) presented similar results for Genscan signal detection on a different dataset. The situation is a bit more complex for single-exon genes: on the one hand they contain both start and terminal codons which should complicate their identification even further, but on the other their average length in our dataset is 1022 bp, which according to the analysis described above in the section entitled Exon Length makes them hard to predict exactly, but difficult to miss. This directly corresponds to the results shown in Table 5. What can also be observed in Table 5 is that for the programs FGENES, GeneMark.hmm, and Genscan there is a significant difference between CRa and CRp for the single exons. Analogous to the case of very short exons, the cause of this phenomenon is that these programs are conservative in predicting single-exon genes: the number of single exons predicted by any of these programs is much lower then the number of real ones in our dataset. The consequence of this is that single exons predicted by these programs have a very good chance of being correct, while many real such exons remain unidentified. Exon Probabilities and Scores Each of the programs evaluated in this study except GeneMark.hmm has a scoring scheme for its exon prediction. Genscan, HMMgene and MZEF have a probability score for each exon predicted that is supposed to be a quantitative measure of the likelihood that the given exon is correct. Morgan s scores were originally intended to be probabilities, but that intention was not followed through subsequent upgrades, and what is left is a scale with no formal meaning except that very high scores result from motifs that Morgan has seen Table 6. Accuracy versus Probability Probability range of predicted exons Programs Genscan (112) (159) (132) (93) (481) HMMgene (91) (173) (136) (96) (406) MZEF (111) (104) (72) (258) The HMR195 dataset was partitioned according to probability of the predicted exons. For each program, CRp (proportion of predicted exons that are correct) is averaged over all sequences belonging to that particular partition. The number in parenthesis is the number of exons belonging to each partition. Genome Research 825

10 Rogic et al. before. Genie uses the bit score against a background distribution that is dependent on the length of the predicted exon and thus the scores cannot be meaningfully compared to other exon scores. The way FGENES calculates exon scores is not known to us. Since the nature of Morgan s and Genie s scores makes them uninformative for a user, we tested the reliability of exon score for the other four programs: FGENES, Genscan, HMMgene, and MZEF. The results for FGENES (data not shown) appear to show that its scores are not directly useful. Most of the exons predicted have a score <10, and CR values average to the similar levels for any subregion on the scale from 0 to10. The only informative exon score is above 10, since, at least in our dataset, these exons are correctly predicted in 90% of cases, which is significantly higher than the CR for exons with lower scores. However, scores this high are rarely assigned to an exon prediction. The accuracy measures for different regions of probability scores for Genscan, HMMgene, and MZEF are displayed in Table 6. What can be observed is that CR values are monotonically rising with the increase in exon probability. For Genscan and HMMgene, these values are usually close to the lower boundary of a probability range (the exception is the probability region , where CR values are lower than probabilities). MZEF, on the other hand, significantly overestimates probabilities for its exon predictions. If we relax our criterion and include partially predicted exons (the results for PC values are not shown), CR values will reach and sometimes overreach the upper boundary of a probability region (CR values for MZEF will correspond to the probability region average). This analysis shows that in the case of Genscan and HMMgene, the exon probability score can be a very useful guide to the reliability of the exon prediction. Phylogenetic Specificity All of the programs analyzed in this survey were trained on human sequences except for Morgan, which was trained on the dataset of vertebrate sequences collected by Burset and Guigo (1996). Since the dataset we used to test the programs was composed of 103 human and 92 murine (82 Mus musculus and 10 Rattus norvegicus) sequences, we wanted to investigate whether such a phylogenetic mix can corrupt the performance of the gene-finding programs, especially those calibrated for human sequences. The AC values on the nucleotide level and for (ESn + ESp)/2 on the exon level for each of the programs, but separately for human and murine sequences, are given in Table 7. It can be observed that the difference in accuracy measures between human and mouse/rat are marginal. Even more interesting is that in most cases the values for murine sequences are higher than the values for the human sequence, even though the model parameters of the programs were learned from the set of human sequences. It is likely that such differences are not statistically significant and that they would also be observed if the results on two different human sequence sets were compared. This hypothesis is also supported by the comparison of the human and mouse grammars constructed by Dong and Searls (1994), where no significant differences were found. Conclusions The results obtained in this analysis indicate that the new generation of programs has significantly higher accuracy then the programs analyzed by Burset and Guigo in Comparing the programs with the highest approximation correlation in their study and our study, we find that this value has improved from 0.78 (FGENEH) to 0.91 (Genscan and HMMgene), a 17% increase, while the highest averaged exon sensitivity and specificity has improved from 0.64 (FGENEH) to 0.76 (HMMgene), a 19% increase. The behavior of the programs on the sequences with different G + C content is not systematic. Some programs accuracy appears to be slightly dependent Table 7. Phylogenetic Specificity Nucleotide accuracy-ac Exon accuracy (ESn+ESp)/2 Programs Trained on human murine human murine FGENES human GeneMark.hmm human Genie human Genscan human HMMgene human Morgan vertebrate MZEF human The HMR195 dataset was split into two species subsets containing 103 human and 92 murine sequences. For each subset and each program, AC and (ESn+ESp)/2 were averaged over all sequences belonging to the particular subset. 826 Genome Research

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project